linux/mm
David Gibson 3935baa9bc [PATCH] hugepage: serialize hugepage allocation and instantiation
Currently, no lock or mutex is held between allocating a hugepage and
inserting it into the pagetables / page cache.  When we do go to insert the
page into pagetables or page cache, we recheck and may free the newly
allocated hugepage.  However, since the number of hugepages in the system
is strictly limited, and it's usualy to want to use all of them, this can
still lead to spurious allocation failures.

For example, suppose two processes are both mapping (MAP_SHARED) the same
hugepage file, large enough to consume the entire available hugepage pool.
If they race instantiating the last page in the mapping, they will both
attempt to allocate the last available hugepage.  One will fail, of course,
returning OOM from the fault and thus causing the process to be killed,
despite the fact that the entire mapping can, in fact, be instantiated.

The patch fixes this race by the simple method of adding a (sleeping) mutex
to serialize the hugepage fault path between allocation and insertion into
pagetables and/or page cache.  It would be possible to avoid the
serialization by catching the allocation failures, waiting on some
condition, then rechecking to see if someone else has instantiated the page
for us.  Given the likely frequency of hugepage instantiations, it seems
very doubtful it's worth the extra complexity.

This patch causes no regression on the libhugetlbfs testsuite, and one
test, which can trigger this race now passes where it previously failed.

Actually, the test still sometimes fails, though less often and only as a
shmat() failure, rather processes getting OOM killed by the VM.  The dodgy
heuristic tests in fs/hugetlbfs/inode.c for whether there's enough hugepage
space aren't protected by the new mutex, and would be ugly to do so, so
there's still a race there.  Another patch to replace those tests with
something saner for this reason as well as others coming...

Signed-off-by: David Gibson <dwg@au1.ibm.com>
Cc: William Lee Irwin III <wli@holomorphy.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-22 07:54:03 -08:00
..
bootmem.c [PATCH] FRV: Clean up bootmem allocator's page freeing algorithm 2006-01-06 08:33:26 -08:00
fadvise.c [PATCH] fadvise: return ESPIPE on FIFO/pipe 2006-01-08 20:14:00 -08:00
filemap_xip.c [PATCH] replace inode_update_time with file_update_time 2006-01-10 08:01:30 -08:00
filemap.c [PATCH] mm: make __put_page internal 2006-03-22 07:54:01 -08:00
filemap.h [PATCH] xip: reduce code duplication 2005-06-24 00:06:41 -07:00
fremap.c VM: add common helper function to create the page tables 2005-11-29 14:03:14 -08:00
highmem.c [PATCH] gfp_t: the rest 2005-10-28 08:16:51 -07:00
hugetlb.c [PATCH] hugepage: serialize hugepage allocation and instantiation 2006-03-22 07:54:03 -08:00
internal.h [PATCH] remove set_page_count() outside mm/ 2006-03-22 07:54:02 -08:00
Kconfig [PATCH] Swap Migration V5: Add CONFIG_MIGRATION for page migration support 2006-01-08 20:12:41 -08:00
madvise.c [PATCH] madvise MADV_DONTFORK/MADV_DOFORK 2006-02-14 16:09:34 -08:00
Makefile [PATCH] slob: introduce the SLOB allocator 2006-01-08 20:13:41 -08:00
memory_hotplug.c [PATCH] memory hotadd: pgdat->node_present_pages fix 2006-03-09 19:47:38 -08:00
memory.c [PATCH] mm: more CONFIG_DEBUG_VM 2006-03-22 07:54:02 -08:00
mempolicy.c [PATCH] mm: kill kmem_cache_t usage 2006-03-22 07:53:58 -08:00
mempool.c [PATCH] mm: kill kmem_cache_t usage 2006-03-22 07:53:58 -08:00
mincore.c [PATCH] freepgt: sys_mincore ignore FIRST_USER_PGD_NR 2005-04-19 13:29:20 -07:00
mlock.c [PATCH] move capable() to capability.h 2006-01-11 18:42:13 -08:00
mmap.c [PATCH] remove VM_DONTCOPY bogosities 2006-03-22 07:54:01 -08:00
mprotect.c [PATCH] Enable mprotect on huge pages 2006-03-22 07:54:03 -08:00
mremap.c [PATCH] move capable() to capability.h 2006-01-11 18:42:13 -08:00
msync.c [PATCH] mutex subsystem, semaphore to mutex: VFS, ->i_sem 2006-01-09 15:59:24 -08:00
nommu.c [PATCH] mm: nommu use compound pages 2006-03-22 07:54:01 -08:00
oom_kill.c [PATCH] out_of_memory() locking fix 2006-03-02 08:33:07 -08:00
page_alloc.c [PATCH] mm: prep_zero_page() in irq is a bug 2006-03-22 07:54:02 -08:00
page_io.c [PATCH] mm: split page table lock 2005-10-29 21:40:42 -07:00
page-writeback.c [PATCH] mm: dirty_exceeded speedup 2006-01-18 19:20:17 -08:00
pdflush.c [PATCH] Swap Migration V5: PF_SWAPWRITE to allow writing to swap 2006-01-08 20:12:41 -08:00
prio_tree.c Linux-2.6.12-rc2 2005-04-16 15:20:36 -07:00
readahead.c [PATCH] readahead: fix initial window size calculation 2006-03-22 07:54:03 -08:00
rmap.c [PATCH] mm: more CONFIG_DEBUG_VM 2006-03-22 07:54:02 -08:00
shmem.c [PATCH] shmem: inline to avoid warning 2006-03-22 07:54:02 -08:00
slab.c [PATCH] mm: nommu use compound pages 2006-03-22 07:54:01 -08:00
slob.c [PATCH] SLOB=y && SMP=y fix 2006-02-08 07:52:58 -08:00
sparse.c [PATCH] Change maxaligned_in_smp alignemnt macros to internodealigned_in_smp macros 2006-01-08 20:13:38 -08:00
swap_state.c [PATCH] Direct Migration V9: Avoid writeback / page_migrate() method 2006-02-01 08:53:17 -08:00
swap.c [PATCH] mm: less atomic ops 2006-03-22 07:53:57 -08:00
swapfile.c [PATCH] Direct Migration V9: remove_from_swap() to remove swap ptes 2006-02-01 08:53:16 -08:00
thrash.c [PATCH] temporarily disable swap token on memory pressure 2005-11-28 14:42:25 -08:00
tiny-shmem.c [PATCH] do_truncate() call fix in tiny-shmem.c 2006-01-12 09:08:49 -08:00
truncate.c [PATCH] mutex subsystem, semaphore to mutex: VFS, ->i_sem 2006-01-09 15:59:24 -08:00
util.c [PATCH] slob: introduce mm/util.c for shared functions 2006-01-08 20:13:41 -08:00
vmalloc.c [PATCH] kernel-doc: fix warnings in vmalloc.c 2005-11-07 07:53:56 -08:00
vmscan.c [PATCH] vmscan: emove obsolete checks from shrink_list() and fix unlikely in refill_inactive_zone() 2006-03-22 07:54:02 -08:00