linux/kernel
KAMEZAWA Hiroyuki f0c0b2b808 change zonelist order: zonelist order selection logic
Make zonelist creation policy selectable from sysctl/boot option v6.

This patch makes NUMA's zonelist (of pgdat) order selectable.
Available order are Default(automatic)/ Node-based / Zone-based.

[Default Order]
The kernel selects Node-based or Zone-based order automatically.

[Node-based Order]
This policy treats the locality of memory as the most important parameter.
Zonelist order is created by each zone's locality. This means lower zones
(ex. ZONE_DMA) can be used before higher zone (ex. ZONE_NORMAL) exhausion.
IOW. ZONE_DMA will be in the middle of zonelist.
current 2.6.21 kernel uses this.

Pros.
 * A user can expect local memory as much as possible.
Cons.
 * lower zone will be exhansted before higher zone. This may cause OOM_KILL.

Maybe suitable if ZONE_DMA is relatively big and you never see OOM_KILL
because of ZONE_DMA exhaution and you need the best locality.

(example)
assume 2 node NUMA. node(0) has ZONE_DMA/ZONE_NORMAL, node(1) has ZONE_NORMAL.

*node(0)'s memory allocation order:

 node(0)'s NORMAL -> node(0)'s DMA -> node(1)'s NORMAL.

*node(1)'s memory allocation order:

 node(1)'s NORMAL -> node(0)'s NORMAL -> node(0)'s DMA.

[Zone-based order]
This policy treats the zone type as the most important parameter.
Zonelist order is created by zone-type order. This means lower zone
never be used bofere higher zone exhaustion.
IOW. ZONE_DMA will be always at the tail of zonelist.

Pros.
 * OOM_KILL(bacause of lower zone) occurs only if the whole zones are exhausted.
Cons.
 * memory locality may not be best.

(example)
assume 2 node NUMA. node(0) has ZONE_DMA/ZONE_NORMAL, node(1) has ZONE_NORMAL.

*node(0)'s memory allocation order:

 node(0)'s NORMAL -> node(1)'s NORMAL -> node(0)'s DMA.

*node(1)'s memory allocation order:

 node(1)'s NORMAL -> node(0)'s NORMAL -> node(0)'s DMA.

bootoption "numa_zonelist_order=" and proc/sysctl is supporetd.

command:
%echo N > /proc/sys/vm/numa_zonelist_order

Will rebuild zonelist in Node-based order.

command:
%echo Z > /proc/sys/vm/numa_zonelist_order

Will rebuild zonelist in Zone-based order.

Thanks to Lee Schermerhorn, he gives me much help and codes.

[Lee.Schermerhorn@hp.com: add check_highest_zone to build_zonelists_in_zone_order]
[akpm@linux-foundation.org: build fix]
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Lee Schermerhorn <lee.schermerhorn@hp.com>
Cc: Christoph Lameter <clameter@sgi.com>
Cc: Andi Kleen <ak@suse.de>
Cc: "jesse.barnes@intel.com" <jesse.barnes@intel.com>
Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-07-16 09:05:35 -07:00
..
irq Fix crash with irqpoll due to the IRQF_IRQPOLL flag testing 2007-05-24 08:37:14 -07:00
power PM: introduce set_target method in pm_ops 2007-07-01 12:29:44 -07:00
time NTP: remove clock_was_set() call to prevent deadlock 2007-07-03 13:54:27 -07:00
.gitignore
acct.c [PATCH] kernel: change uses of f_{dentry, vfsmnt} to use f_path 2006-12-08 08:28:42 -08:00
audit.c audit: add spaces on either side of case "..." operator. 2007-05-08 11:15:09 -07:00
audit.h [PATCH] audit signal recipients 2007-05-11 05:38:25 -04:00
auditfilter.c audit: fix oops removing watch if audit disabled 2007-06-24 08:59:12 -07:00
auditsc.c [PATCH] Abnormal End of Processes 2007-05-11 05:38:26 -04:00
capability.c [PATCH] pid: replace do/while_each_task_pid with do/while_each_pid_task 2007-02-12 09:48:32 -08:00
compat.c signal/timer/event: timerfd compat code 2007-05-11 08:29:36 -07:00
configs.c use simple_read_from_buffer in kernel/ 2007-05-09 12:30:49 -07:00
cpu.c microcode: use suspend-related CPU hotplug notifications 2007-05-09 12:30:56 -07:00
cpuset.c cpuset: zero malloc - fix for old cpusets 2007-06-16 13:16:15 -07:00
delayacct.c sched: update delay-accounting to use CFS's precise stats 2007-07-09 18:52:00 +02:00
die_notifier.c move die notifier handling to common code 2007-05-08 11:15:04 -07:00
dma.c [PATCH] struct seq_operations and struct file_operations constification 2006-12-07 08:39:46 -08:00
exec_domain.c Remove obsolete #include <linux/config.h> 2006-06-30 19:25:36 +02:00
exit.c sched: update delay-accounting to use CFS's precise stats 2007-07-09 18:52:00 +02:00
extable.c
fork.c sched: update delay-accounting to use CFS's precise stats 2007-07-09 18:52:00 +02:00
futex_compat.c Revert "futex_requeue_pi optimization" 2007-06-18 09:48:41 -07:00
futex.c FUTEX: Restore the dropped ERSCH fix 2007-06-24 12:08:53 -07:00
hrtimer.c Add suspend-related notifications for CPU hotplug 2007-05-09 12:30:56 -07:00
itimer.c The scheduled -EINVAL for invalid timevals in setitimer 2007-05-08 11:15:13 -07:00
kallsyms.c fix possible null ptr deref in kallsyms_lookup 2007-05-30 10:51:38 -07:00
Kconfig.hz [PATCH] HZ: 300Hz support 2006-12-07 08:39:36 -08:00
Kconfig.preempt Fix trivial typos in Kconfig* files 2007-05-09 07:12:20 +02:00
kexec.c kdump/kexec: calculate note size at compile time 2007-05-08 11:15:07 -07:00
kfifo.c [PATCH] Numerous fixes to kernel-doc info in source files. 2007-02-11 10:51:32 -08:00
kmod.c wait_for_helper: remove unneeded do_sigaction() 2007-05-09 12:30:53 -07:00
kprobes.c Kprobes: The ON/OFF knob thru debugfs 2007-05-08 11:15:19 -07:00
ksysfs.c remove "struct subsystem" as it is no longer needed 2007-05-02 18:57:59 -07:00
kthread.c freezer: fix kthread_create vs freezer theoretical race 2007-05-23 20:14:11 -07:00
latency.c [PATCH] severing module.h->sched.h 2006-12-04 02:00:22 -05:00
lockdep_internals.h [PATCH] lockdep: more chains 2006-12-07 08:39:43 -08:00
lockdep_proc.c [PATCH] remove many unneeded #includes of sched.h 2007-02-14 08:09:54 -08:00
lockdep.c lockdep: removed unused ip argument in mark_lock & mark_held_locks 2007-05-08 11:15:13 -07:00
Makefile move die notifier handling to common code 2007-05-08 11:15:04 -07:00
module.c sysfs: kill unnecessary attribute->owner 2007-07-11 16:09:06 -07:00
mutex-debug.c [PATCH] remove many unneeded #includes of sched.h 2007-02-14 08:09:54 -08:00
mutex-debug.h [PATCH] lockdep: better lock debugging 2006-07-03 15:27:01 -07:00
mutex.c wrap access to thread_info 2007-05-09 12:30:56 -07:00
mutex.h [PATCH] lockdep: prove mutex locking correctness 2006-07-03 15:27:04 -07:00
nsproxy.c fix refcounting of nsproxy object when unshared 2007-06-24 08:59:10 -07:00
panic.c [PATCH] Add TAINT_USER and ability to set taint flags from userspace 2007-02-11 10:51:29 -08:00
params.c sysfs: kill unnecessary attribute->owner 2007-07-11 16:09:06 -07:00
pid.c statically initialize struct pid for swapper 2007-05-11 08:29:35 -07:00
posix-cpu-timers.c sched: make posix-cpu-timers use CFS's accounting information 2007-07-09 18:51:58 +02:00
posix-timers.c posix-timers: Prevent softirq starvation by small intervals and SIG_IGN 2007-06-21 15:57:04 -07:00
printk.c serial: convert early_uart to earlycon for 8250 2007-07-16 09:05:35 -07:00
profile.c Detach sched.h from mm.h 2007-05-21 09:18:19 -07:00
ptrace.c [PATCH] auditing ptrace 2007-05-11 05:38:25 -04:00
rcupdate.c Add suspend-related notifications for CPU hotplug 2007-05-09 12:30:56 -07:00
rcutorture.c rcutorture: Remove redundant assignment to cur_ops in for loop 2007-05-08 11:15:17 -07:00
relay.c relay: fixup kerneldoc comment 2007-07-13 14:14:28 +02:00
resource.c libata/IDE: remove combined mode quirk 2007-04-28 14:15:59 -04:00
rtmutex_common.h Revert "futex_requeue_pi optimization" 2007-06-18 09:48:41 -07:00
rtmutex-debug.c Remove all inclusions of <linux/config.h> 2006-10-04 03:38:54 -04:00
rtmutex-debug.h [PATCH] lockdep: better lock debugging 2006-07-03 15:27:01 -07:00
rtmutex-tester.c [PATCH] Add include/linux/freezer.h and move definitions from sched.h 2006-12-07 08:39:27 -08:00
rtmutex.c Revert "futex_requeue_pi optimization" 2007-06-18 09:48:41 -07:00
rtmutex.h [PATCH] lockdep: better lock debugging 2006-07-03 15:27:01 -07:00
rwsem.c Lockdep treats down_write_trylock like regular down_write 2007-05-08 11:15:09 -07:00
sched_debug.c [PATCH] sched: remove stale version info from kernel/sched_debug.c 2007-07-13 10:10:41 -07:00
sched_fair.c sched: cfs core, kernel/sched_fair.c 2007-07-09 18:51:58 +02:00
sched_idletask.c sched: cfs core, kernel/sched_idletask.c 2007-07-09 18:51:58 +02:00
sched_rt.c sched: cfs core, kernel/sched_rt.c 2007-07-09 18:51:58 +02:00
sched_stats.h sched: update delay-accounting to use CFS's precise stats 2007-07-09 18:52:00 +02:00
sched.c CFS: Fix missing digit off in wmult table 2007-07-13 16:45:43 -07:00
seccomp.c
signal.c Fix signalfd interaction with thread-private signals 2007-06-18 10:18:32 -07:00
softirq.c sched: do not set softirqs to nice +19 2007-07-09 18:52:00 +02:00
softlockup.c Add suspend-related notifications for CPU hotplug 2007-05-09 12:30:56 -07:00
spinlock.c [PATCH] lockdep: spin_lock_irqsave_nested() 2006-11-25 13:28:34 -08:00
srcu.c [PATCH] SRCU: report out-of-memory errors 2006-10-04 07:55:30 -07:00
stacktrace.c [PATCH] lockdep: stacktrace subsystem, core 2006-07-03 15:27:02 -07:00
stop_machine.c stop_machine() now uses hard_irq_disable 2007-05-11 08:29:34 -07:00
sys_ni.c compat signalfd and timerfd are cond syscalls 2007-05-12 10:55:40 -07:00
sys.c attach_pid() with struct pid parameter 2007-05-11 08:29:35 -07:00
sysctl.c change zonelist order: zonelist order selection logic 2007-07-16 09:05:35 -07:00
taskstats.c KMEM_CACHE(): simplify slab cache creation 2007-05-07 12:12:55 -07:00
time.c header cleaning: don't include smp_lock.h when not used 2007-05-08 11:15:07 -07:00
timer.c NOHZ: prevent multiplication overflow - stop timer for huge timeouts 2007-05-29 18:11:10 -07:00
tsacct.c [PATCH] time: x86_64: split x86_64/kernel/time.c up 2007-02-16 08:14:00 -08:00
uid16.c header cleaning: don't include smp_lock.h when not used 2007-05-08 11:15:07 -07:00
user.c [PATCH] slab: remove kmem_cache_t 2006-12-07 08:39:25 -08:00
utsname_sysctl.c [PATCH] sysctl: remove insert_at_head from register_sysctl 2007-02-14 08:09:59 -08:00
utsname.c Merge sys_clone()/sys_unshare() nsproxy and namespace handling 2007-05-08 11:15:00 -07:00
wait.c Fix occurrences of "the the " 2007-05-09 08:57:56 +02:00
workqueue.c simplify cleanup_workqueue_thread() 2007-05-23 20:14:13 -07:00