linux/mm
Christoph Lameter dc85da15d4 [PATCH] NUMA policies in the slab allocator V2
This patch fixes a regression in 2.6.14 against 2.6.13 that causes an
imbalance in memory allocation during bootup.

The slab allocator in 2.6.13 is not numa aware and simply calls
alloc_pages().  This means that memory policies may control the behavior of
alloc_pages().  During bootup the memory policy is set to MPOL_INTERLEAVE
resulting in the spreading out of allocations during bootup over all
available nodes.  The slab allocator in 2.6.13 has only a single list of
slab pages.  As a result the per cpu slab cache and the spinlock controlled
page lists may contain slab entries from off node memory.  The slab
allocator in 2.6.13 makes no effort to discern the locality of an entry on
its lists.

The NUMA aware slab allocator in 2.6.14 controls locality of the slab pages
explicitly by calling alloc_pages_node().  The NUMA slab allocator manages
slab entries by having lists of available slab pages for each node.  The
per cpu slab cache can only contain slab entries associated with the node
local to the processor.  This guarantees that the default allocation mode
of the slab allocator always assigns local memory if available.

Setting MPOL_INTERLEAVE as a default policy during bootup has no effect
anymore.  In 2.6.14 all node unspecific slab allocations are performed on
the boot processor.  This means that most of key data structures are
allocated on one node.  Most processors will have to refer to these
structures making the boot node a potential bottleneck.  This may reduce
performance and cause unnecessary memory pressure on the boot node.

This patch implements NUMA policies in the slab layer.  There is the need
of explicit application of NUMA memory policies by the slab allcator itself
since the NUMA slab allocator does no longer let the page_allocator control
locality.

The check for policies is made directly at the beginning of __cache_alloc
using current->mempolicy.  The memory policy is already frequently checked
by the page allocator (alloc_page_vma() and alloc_page_current()).  So it
is highly likely that the cacheline is present.  For MPOL_INTERLEAVE
kmalloc() will spread out each request to one node after another so that an
equal distribution of allocations can be obtained during bootup.

It is not possible to push the policy check to lower layers of the NUMA
slab allocator since the per cpu caches are now only containing slab
entries from the current node.  If the policy says that the local node is
not to be preferred or forbidden then there is no point in checking the
slab cache or local list of slab pages.  The allocation better be directed
immediately to the lists containing slab entries for the allowed set of
nodes.

This way of applying policy also fixes another strange behavior in 2.6.13.
alloc_pages() is controlled by the memory allocation policy of the current
process.  It could therefore be that one process is running with
MPOL_INTERLEAVE and would f.e.  obtain a new page following that policy
since no slab entries are in the lists anymore.  A page can typically be
used for multiple slab entries but lets say that the current process is
only using one.  The other entries are then added to the slab lists.  These
are now non local entries in the slab lists despite of the possible
availability of local pages that would provide faster access and increase
the performance of the application.

Another process without MPOL_INTERLEAVE may now run and expect a local slab
entry from kmalloc().  However, there are still these free slab entries
from the off node page obtained from the other process via MPOL_INTERLEAVE
in the cache.  The process will then get an off node slab entry although
other slab entries may be available that are local to that process.  This
means that the policy if one process may contaminate the locality of the
slab caches for other processes.

This patch in effect insures that a per process policy is followed for the
allocation of slab entries and that there cannot be a memory policy
influence from one process to another.  A process with default policy will
always get a local slab entry if one is available.  And the process using
memory policies will get its memory arranged as requested.  Off-node slab
allocation will require the use of spinlocks and will make the use of per
cpu caches not possible.  A process using memory policies to redirect
allocations offnode will have to cope with additional lock overhead in
addition to the latency added by the need to access a remote slab entry.

Changes V1->V2
- Remove #ifdef CONFIG_NUMA by moving forward declaration into
  prior #ifdef CONFIG_NUMA section.

- Give the function determining the node number to use a saner
  name.

Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-18 19:20:18 -08:00
..
bootmem.c [PATCH] FRV: Clean up bootmem allocator's page freeing algorithm 2006-01-06 08:33:26 -08:00
fadvise.c [PATCH] fadvise: return ESPIPE on FIFO/pipe 2006-01-08 20:14:00 -08:00
filemap_xip.c [PATCH] replace inode_update_time with file_update_time 2006-01-10 08:01:30 -08:00
filemap.c [PATCH] mm: migration page refcounting fix 2006-01-18 19:20:17 -08:00
filemap.h [PATCH] xip: reduce code duplication 2005-06-24 00:06:41 -07:00
fremap.c VM: add common helper function to create the page tables 2005-11-29 14:03:14 -08:00
highmem.c [PATCH] gfp_t: the rest 2005-10-28 08:16:51 -07:00
hugetlb.c [PATCH] mm: make hugepages obey cpusets. 2006-01-08 20:12:43 -08:00
internal.h [PATCH] FRV: Clean up bootmem allocator's page freeing algorithm 2006-01-06 08:33:26 -08:00
Kconfig [PATCH] Swap Migration V5: Add CONFIG_MIGRATION for page migration support 2006-01-08 20:12:41 -08:00
madvise.c [PATCH] madvise(MADV_REMOVE): remove pages from tmpfs shm backing store 2006-01-06 08:33:22 -08:00
Makefile [PATCH] slob: introduce the SLOB allocator 2006-01-08 20:13:41 -08:00
memory_hotplug.c [PATCH] memhotplug: __add_section remove unused pgdat definition 2006-01-06 08:33:21 -08:00
memory.c [PATCH] mutex subsystem, semaphore to mutex: VFS, ->i_sem 2006-01-09 15:59:24 -08:00
mempolicy.c [PATCH] NUMA policies in the slab allocator V2 2006-01-18 19:20:18 -08:00
mempool.c [PATCH] gfp_t: mm/* (easy parts) 2005-10-28 08:16:47 -07:00
mincore.c [PATCH] freepgt: sys_mincore ignore FIRST_USER_PGD_NR 2005-04-19 13:29:20 -07:00
mlock.c [PATCH] move capable() to capability.h 2006-01-11 18:42:13 -08:00
mmap.c [PATCH] move capable() to capability.h 2006-01-11 18:42:13 -08:00
mprotect.c [PATCH] unpaged: private write VM_RESERVED 2005-11-22 09:13:42 -08:00
mremap.c [PATCH] move capable() to capability.h 2006-01-11 18:42:13 -08:00
msync.c [PATCH] mutex subsystem, semaphore to mutex: VFS, ->i_sem 2006-01-09 15:59:24 -08:00
nommu.c [PATCH] NOMMU: Make SYSV IPC SHM use ramfs facilities on NOMMU 2006-01-06 08:33:32 -08:00
oom_kill.c [PATCH] cpuset oom lock fix 2006-01-14 18:27:10 -08:00
page_alloc.c [PATCH] Zone reclaim: Reclaim logic 2006-01-18 19:20:17 -08:00
page_io.c [PATCH] mm: split page table lock 2005-10-29 21:40:42 -07:00
page-writeback.c [PATCH] mm: dirty_exceeded speedup 2006-01-18 19:20:17 -08:00
pdflush.c [PATCH] Swap Migration V5: PF_SWAPWRITE to allow writing to swap 2006-01-08 20:12:41 -08:00
prio_tree.c Linux-2.6.12-rc2 2005-04-16 15:20:36 -07:00
readahead.c [PATCH] add AOP_TRUNCATED_PAGE, prepend AOP_ to WRITEPAGE_ACTIVATE 2006-01-03 11:45:42 -08:00
rmap.c [PATCH] mm: migration page refcounting fix 2006-01-18 19:20:17 -08:00
shmem.c [PATCH] Add tmpfs options for memory placement policies 2006-01-14 18:27:07 -08:00
slab.c [PATCH] NUMA policies in the slab allocator V2 2006-01-18 19:20:18 -08:00
slob.c [PATCH] slob: introduce the SLOB allocator 2006-01-08 20:13:41 -08:00
sparse.c [PATCH] Change maxaligned_in_smp alignemnt macros to internodealigned_in_smp macros 2006-01-08 20:13:38 -08:00
swap_state.c [PATCH] SwapMig: add_to_swap() avoid atomic allocations 2006-01-08 20:12:42 -08:00
swap.c [PATCH] mm: migration page refcounting fix 2006-01-18 19:20:17 -08:00
swapfile.c [PATCH] sem2mutex: mm/slab.c 2006-01-18 19:20:18 -08:00
thrash.c [PATCH] temporarily disable swap token on memory pressure 2005-11-28 14:42:25 -08:00
tiny-shmem.c [PATCH] do_truncate() call fix in tiny-shmem.c 2006-01-12 09:08:49 -08:00
truncate.c [PATCH] mutex subsystem, semaphore to mutex: VFS, ->i_sem 2006-01-09 15:59:24 -08:00
util.c [PATCH] slob: introduce mm/util.c for shared functions 2006-01-08 20:13:41 -08:00
vmalloc.c [PATCH] kernel-doc: fix warnings in vmalloc.c 2005-11-07 07:53:56 -08:00
vmscan.c [PATCH] Zone reclaim: Reclaim logic 2006-01-18 19:20:17 -08:00