linux/arch/ia64/kernel
Rik van Riel 5beb493052 mm: change anon_vma linking to fix multi-process server scalability issue
The old anon_vma code can lead to scalability issues with heavily forking
workloads.  Specifically, each anon_vma will be shared between the parent
process and all its child processes.

In a workload with 1000 child processes and a VMA with 1000 anonymous
pages per process that get COWed, this leads to a system with a million
anonymous pages in the same anon_vma, each of which is mapped in just one
of the 1000 processes.  However, the current rmap code needs to walk them
all, leading to O(N) scanning complexity for each page.

This can result in systems where one CPU is walking the page tables of
1000 processes in page_referenced_one, while all other CPUs are stuck on
the anon_vma lock.  This leads to catastrophic failure for a benchmark
like AIM7, where the total number of processes can reach in the tens of
thousands.  Real workloads are still a factor 10 less process intensive
than AIM7, but they are catching up.

This patch changes the way anon_vmas and VMAs are linked, which allows us
to associate multiple anon_vmas with a VMA.  At fork time, each child
process gets its own anon_vmas, in which its COWed pages will be
instantiated.  The parents' anon_vma is also linked to the VMA, because
non-COWed pages could be present in any of the children.

This reduces rmap scanning complexity to O(1) for the pages of the 1000
child processes, with O(N) complexity for at most 1/N pages in the system.
 This reduces the average scanning cost in heavily forking workloads from
O(N) to 2.

The only real complexity in this patch stems from the fact that linking a
VMA to anon_vmas now involves memory allocations.  This means vma_adjust
can fail, if it needs to attach a VMA to anon_vma structures.  This in
turn means error handling needs to be added to the calling functions.

A second source of complexity is that, because there can be multiple
anon_vmas, the anon_vma linking in vma_adjust can no longer be done under
"the" anon_vma lock.  To prevent the rmap code from walking up an
incomplete VMA, this patch introduces the VM_LOCK_RMAP VMA flag.  This bit
flag uses the same slot as the NOMMU VM_MAPPED_COPY, with an ifdef in mm.h
to make sure it is impossible to compile a kernel that needs both symbolic
values for the same bitflag.

Some test results:

Without the anon_vma changes, when AIM7 hits around 9.7k users (on a test
box with 16GB RAM and not quite enough IO), the system ends up running
>99% in system time, with every CPU on the same anon_vma lock in the
pageout code.

With these changes, AIM7 hits the cross-over point around 29.7k users.
This happens with ~99% IO wait time, there never seems to be any spike in
system time.  The anon_vma lock contention appears to be resolved.

[akpm@linux-foundation.org: cleanups]
Signed-off-by: Rik van Riel <riel@redhat.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Larry Woodman <lwoodman@redhat.com>
Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
Cc: Minchan Kim <minchan.kim@gmail.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Hugh Dickins <hugh.dickins@tiscali.co.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-03-06 11:26:26 -08:00
..
cpufreq [IA64] improper printk format in acpi-cpufreq 2008-07-17 11:11:17 -07:00
.gitignore [IA64] Cleanup generated file not ignored by .gitignore 2008-08-04 11:06:16 -07:00
acpi-ext.c
acpi.c Merge branch 'x86-apic-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip 2010-03-03 08:15:37 -08:00
asm-offsets.c ia64/pv_ops/xen: paravirtualize read/write ar.itc and ar.itm 2009-03-26 10:50:32 -07:00
audit.c [IA64] Remove COMPAT_IA32 support 2010-02-08 10:42:17 -08:00
brl_emu.c
crash_dump.c kdump: make elfcorehdr_addr independent of CONFIG_PROC_VMCORE 2008-10-20 08:52:39 -07:00
crash.c sysctl: Drop & in front of every proc_handler. 2009-11-18 08:37:40 -08:00
cyclone.c clocksource: pass clocksource to read() callback 2009-04-21 13:41:47 -07:00
dma-mapping.c [IA64] Fix warning in dma-mapping.c 2009-09-02 09:12:21 -07:00
efi_stub.S
efi.c [IA64] Convert ia64 to use int-ll64.h 2009-06-17 09:33:49 -07:00
entry.h
entry.S [IA64] Remove COMPAT_IA32 support 2010-02-08 10:42:17 -08:00
err_inject.c sysdev: Pass the attribute to the low level sysdev show/store function 2008-07-21 21:55:02 -07:00
esi_stub.S
esi.c tree-wide: fix assorted typos all over the place 2009-12-04 15:39:55 +01:00
fsys.S ia64/pv_ops: paravirtualize mov = ar.itc. 2009-03-26 10:50:22 -07:00
fsyscall_gtod_data.h [IA64] generalize attribute of fsyscall_gtod_data 2008-02-04 15:36:36 -08:00
ftrace.c ftrace, ia64: IA64 dynamic ftrace support 2009-01-14 12:11:31 +01:00
gate-data.S
gate.lds.S ia64/pv_ops: gate page paravirtualization. 2009-03-26 10:51:02 -07:00
gate.S ia64/pv_ops: paravirtualize gate.S. 2009-03-26 11:01:46 -07:00
head.S percpu: make percpu symbols in ia64 unique 2009-10-29 22:34:14 +09:00
ia64_ksyms.c percpu: remove per_cpu__ prefix. 2009-10-29 22:34:15 +09:00
init_task.c Use new __init_task_data macro in arch init_task.c files. 2009-09-21 06:27:08 +02:00
iosapic.c genirq: Convert irq_desc.lock to raw_spinlock 2009-12-14 23:55:33 +01:00
irq_ia64.c Merge branch 'release' of git://git.kernel.org/pub/scm/linux/kernel/git/aegl/linux-2.6 2009-12-16 10:44:43 -08:00
irq_lsapic.c [IA64] remove obsolete hw_interrupt_type 2009-06-15 14:35:10 -07:00
irq.c genirq: Convert irq_desc.lock to raw_spinlock 2009-12-14 23:55:33 +01:00
ivt.S [IA64] Remove COMPAT_IA32 support 2010-02-08 10:42:17 -08:00
jprobes.S [IA64] Move include/asm-ia64 to arch/ia64/include/asm 2008-08-01 10:21:21 -07:00
kprobes.c kprobes: Disable booster when CONFIG_PREEMPT=y 2010-02-04 09:36:18 +01:00
machine_kexec.c [IA64] kexec: Unregister MCA handler before kexec 2009-09-14 16:18:17 -07:00
machvec.c x86, ia64: convert to use generic dma_map_ops struct 2009-01-06 14:06:57 +01:00
Makefile [IA64] build arch/ia64/kernel/acpi-ext.o when CONFIG_ACPI 2010-02-25 15:16:38 -08:00
Makefile.gate kbuild: rename ld-option to cc-ldoption 2009-09-20 12:27:42 +02:00
mca_asm.S percpu: make percpu symbols in ia64 unique 2009-10-29 22:34:14 +09:00
mca_drv_asm.S [IA64] mca style cleanup 2008-02-04 15:42:06 -08:00
mca_drv.c CRED: Wrap task credential accesses in the IA64 arch 2008-11-14 10:38:37 +11:00
mca_drv.h [IA64] mca style cleanup 2008-02-04 15:42:06 -08:00
mca.c [IA64] __per_cpu_idtrs[] is a memory hog 2010-01-07 16:10:57 -08:00
minstate.h [IA64] pvops: paravirtualize minstate.h. 2008-05-27 15:02:17 -07:00
module.c [IA64] Convert ia64 to use int-ll64.h 2009-06-17 09:33:49 -07:00
msi_ia64.c [IA64] msi_ia64.c dmar_msi_type should be static 2009-06-15 14:35:54 -07:00
nr-irqs.c ia64/pv_ops/xen: define the nubmer of irqs which xen needs. 2008-10-17 10:06:59 -07:00
numa.c [IA64] Minimize per_cpu reservations. 2008-04-08 13:51:35 -07:00
pal.S
palinfo.c [IA64] Convert ia64 to use int-ll64.h 2009-06-17 09:33:49 -07:00
paravirt_inst.h ia64/pv_ops: paravirtualized instruction checker. 2008-10-17 10:12:54 -07:00
paravirt_patch.c ia64/pv_op/binarypatch: add helper functions to support binary patching for paravirt_ops. 2009-03-26 11:02:31 -07:00
paravirt_patchlist.c [IA64] Fix build error in paravirt_patchlist.c 2009-06-17 09:04:40 -07:00
paravirt_patchlist.h ia64/pv_ops: gate page paravirtualization. 2009-03-26 10:51:02 -07:00
paravirt.c ia64: remove some warnings. 2009-03-27 11:11:04 -07:00
paravirtentry.S ia64/pv_ops: implement binary patching optimization for native. 2009-03-26 11:02:42 -07:00
patch.c ia64: remove some warnings. 2009-03-27 11:11:04 -07:00
pci-dma.c Bug Fix arch/ia64/kernel/pci-dma.c: fix recursive dma_supported() call in iommu_dma_supported() 2009-08-11 14:52:10 -07:00
pci-swiotlb.c swiotlb: Defer swiotlb init printing, export swiotlb_print_info() 2009-11-10 12:32:00 +01:00
perfmon_default_smpl.c [IA64] remove remaining __FUNCTION__ occurrences 2008-03-06 09:19:27 -08:00
perfmon_generic.h
perfmon_itanium.h
perfmon_mckinley.h [IA64] spelling fixes: arch/ia64/ 2007-05-11 14:55:43 -07:00
perfmon_montecito.h [IA64] sparse cleanups 2006-12-07 10:48:19 -08:00
perfmon.c mm: change anon_vma linking to fix multi-process server scalability issue 2010-03-06 11:26:26 -08:00
process.c Pull alex into release branch 2010-02-26 12:04:22 -08:00
ptrace.c [IA64] Remove COMPAT_IA32 support 2010-02-08 10:42:17 -08:00
relocate_kernel.S percpu: make percpu symbols in ia64 unique 2009-10-29 22:34:14 +09:00
sal.c [IA64] Update check_sal_cache_flush to use platform_send_ipi() 2008-06-11 16:40:33 -07:00
salinfo.c [IA64] address compiler warnings perfmon.c/salinfo.c 2009-06-30 14:26:34 -07:00
setup.c [IA64] Remove COMPAT_IA32 support 2010-02-08 10:42:17 -08:00
sigframe.h [IA64] Add TIF_RESTORE_SIGMASK 2007-05-08 14:51:59 -07:00
signal.c [IA64] Remove COMPAT_IA32 support 2010-02-08 10:42:17 -08:00
smp.c ia64: convert last user of smp_call_function_mask 2009-09-24 09:34:40 +09:30
smpboot.c [IA64] Remove COMPAT_IA32 support 2010-02-08 10:42:17 -08:00
sys_ia64.c Get rid of open-coding in ia64_brk() 2009-12-11 06:44:58 -05:00
time.c clocksource: add argument to resume callback 2010-02-05 14:54:10 +01:00
topology.c ia64/topology.c: exit cache_add_dev when kobject_init_and_add fails 2009-08-11 14:52:11 -07:00
traps.c [IA64] Remove COMPAT_IA32 support 2010-02-08 10:42:17 -08:00
unaligned.c [IA64] use printk_once() unaligned.c/io_common.c 2009-10-13 12:48:25 -07:00
uncached.c Pull for-2.6.31 into release 2009-06-17 09:35:24 -07:00
unwind_decoder.c
unwind_i.h
unwind.c [IA64] Do not go beyond ARRAY_SIZE of unw.hash 2009-02-25 11:48:04 -08:00
vmlinux.lds.S ia64: allocate percpu area for cpu0 like percpu areas for other cpus 2009-10-02 13:28:56 +09:00