linux/kernel
David Rientjes a63d83f427 oom: badness heuristic rewrite
This a complete rewrite of the oom killer's badness() heuristic which is
used to determine which task to kill in oom conditions.  The goal is to
make it as simple and predictable as possible so the results are better
understood and we end up killing the task which will lead to the most
memory freeing while still respecting the fine-tuning from userspace.

Instead of basing the heuristic on mm->total_vm for each task, the task's
rss and swap space is used instead.  This is a better indication of the
amount of memory that will be freeable if the oom killed task is chosen
and subsequently exits.  This helps specifically in cases where KDE or
GNOME is chosen for oom kill on desktop systems instead of a memory
hogging task.

The baseline for the heuristic is a proportion of memory that each task is
currently using in memory plus swap compared to the amount of "allowable"
memory.  "Allowable," in this sense, means the system-wide resources for
unconstrained oom conditions, the set of mempolicy nodes, the mems
attached to current's cpuset, or a memory controller's limit.  The
proportion is given on a scale of 0 (never kill) to 1000 (always kill),
roughly meaning that if a task has a badness() score of 500 that the task
consumes approximately 50% of allowable memory resident in RAM or in swap
space.

The proportion is always relative to the amount of "allowable" memory and
not the total amount of RAM systemwide so that mempolicies and cpusets may
operate in isolation; they shall not need to know the true size of the
machine on which they are running if they are bound to a specific set of
nodes or mems, respectively.

Root tasks are given 3% extra memory just like __vm_enough_memory()
provides in LSMs.  In the event of two tasks consuming similar amounts of
memory, it is generally better to save root's task.

Because of the change in the badness() heuristic's baseline, it is also
necessary to introduce a new user interface to tune it.  It's not possible
to redefine the meaning of /proc/pid/oom_adj with a new scale since the
ABI cannot be changed for backward compatability.  Instead, a new tunable,
/proc/pid/oom_score_adj, is added that ranges from -1000 to +1000.  It may
be used to polarize the heuristic such that certain tasks are never
considered for oom kill while others may always be considered.  The value
is added directly into the badness() score so a value of -500, for
example, means to discount 50% of its memory consumption in comparison to
other tasks either on the system, bound to the mempolicy, in the cpuset,
or sharing the same memory controller.

/proc/pid/oom_adj is changed so that its meaning is rescaled into the
units used by /proc/pid/oom_score_adj, and vice versa.  Changing one of
these per-task tunables will rescale the value of the other to an
equivalent meaning.  Although /proc/pid/oom_adj was originally defined as
a bitshift on the badness score, it now shares the same linear growth as
/proc/pid/oom_score_adj but with different granularity.  This is required
so the ABI is not broken with userspace applications and allows oom_adj to
be deprecated for future removal.

Signed-off-by: David Rientjes <rientjes@google.com>
Cc: Nick Piggin <npiggin@suse.de>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-08-09 20:45:02 -07:00
..
debug Merge branch 'timers-timekeeping-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip 2010-08-06 13:18:29 -07:00
gcov microblaze: Enable GCOV_PROFILE_ALL 2009-09-21 14:29:21 +02:00
irq irq: Add new IRQ flag IRQF_NO_SUSPEND 2010-07-29 13:24:57 +02:00
power Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq 2010-08-07 12:42:58 -07:00
time Merge branch 'timers-timekeeping-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip 2010-08-06 13:18:29 -07:00
trace Merge branch 'bkl/core' of git://git.kernel.org/pub/scm/linux/kernel/git/frederic/random-tracing 2010-08-07 17:06:54 -07:00
.gitignore
acct.c Merge branch 'next' into for-linus 2010-05-18 08:57:00 +10:00
async.c async: use workqueue for worker pool 2010-07-14 11:29:46 +02:00
audit_tree.c include cleanup: Update gfp.h and slab.h includes to prepare for breaking implicit slab.h inclusion from percpu.h 2010-03-30 22:02:32 +09:00
audit_watch.c include cleanup: Update gfp.h and slab.h includes to prepare for breaking implicit slab.h inclusion from percpu.h 2010-03-30 22:02:32 +09:00
audit.c drop_monitor: convert some kfree_skb call sites to consume_skb 2010-07-20 13:28:05 -07:00
audit.h
auditfilter.c include cleanup: Update gfp.h and slab.h includes to prepare for breaking implicit slab.h inclusion from percpu.h 2010-03-30 22:02:32 +09:00
auditsc.c audit: preface audit printk with audit 2010-04-05 13:19:45 -07:00
backtracetest.c
bounds.c kbuild: move bounds.h to include/generated 2009-12-12 13:08:14 +01:00
capability.c sched: Remove remaining USER_SCHED code 2010-04-02 20:12:00 +02:00
cgroup_freezer.c Freezer / cgroup freezer: Update stale locking comments 2010-05-10 23:18:47 +02:00
cgroup.c cgroupfs: create /sys/fs/cgroup to mount cgroupfs on 2010-08-05 13:53:35 -07:00
compat.c cpumask: fix compat getaffinity 2010-05-19 11:48:18 -07:00
configs.c
cpu.c sched: adjust when cpu_active and cpuset configurations are updated during cpu on/offlining 2010-06-08 21:40:36 +02:00
cpuset.c Merge branch 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip 2010-08-06 09:39:22 -07:00
cred.c CRED: Fix get_task_cred() and task_state() to not resurrect dead credentials 2010-07-29 15:16:17 -07:00
delayacct.c headers: taskstats_kern.h trim 2009-09-18 09:48:52 -07:00
dma.c
early_res.c kmemleak: Add support for NO_BOOTMEM configurations 2010-07-19 11:54:15 +01:00
elfcore.c elf coredump: add extended numbering support 2010-03-06 11:26:46 -08:00
exec_domain.c sys_personality: change sys_personality() to accept "unsigned int" instead of u_long 2010-06-04 15:21:45 -07:00
exit.c proc: turn signal_struct->count into "int nr_threads" 2010-05-27 09:12:47 -07:00
extable.c
fork.c oom: badness heuristic rewrite 2010-08-09 20:45:02 -07:00
freezer.c
futex_compat.c futex: Protect pid lookup in compat code with RCU 2009-12-09 14:22:14 +01:00
futex.c futex: futex_find_get_task remove credentails check 2010-06-30 15:43:44 -07:00
groups.c security: remove dead hook task_setgroups 2010-04-12 12:19:18 +10:00
hrtimer.c Merge branch 'timers-timekeeping-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip 2010-08-06 13:18:29 -07:00
hung_task.c softlockup: Fix hung_task_check_count sysctl 2009-11-27 06:21:57 +01:00
hw_breakpoint.c Merge branch 'perf-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip 2010-08-06 09:30:52 -07:00
itimer.c itimers: Fix racy writes to cpu_itimer fields 2009-11-18 16:32:12 +01:00
kallsyms.c kdb: core for kgdb back end (2 of 2) 2010-05-20 21:04:21 -05:00
Kconfig.freezer
Kconfig.hz
Kconfig.locks mutex: Better control mutex adaptive spinning config 2009-12-03 11:50:11 +01:00
Kconfig.preempt
kexec.c kexec: fix Oops in crash_shrink_memory() 2010-06-29 15:29:31 -07:00
kfifo.c kfifo: Don't use integer as NULL pointer 2010-02-16 15:11:08 -08:00
kmod.c call_usermodehelper: UMH_WAIT_EXEC ignores kernel_thread() failure 2010-05-27 09:12:45 -07:00
kprobes.c kprobes: Move enable/disable_kprobe() out from debugfs code 2010-05-08 18:08:30 +02:00
ksysfs.c sysfs: add struct file* to bin_attr callbacks 2010-05-21 09:37:31 -07:00
kthread.c kthread: implement kthread_data() 2010-06-29 10:07:09 +02:00
latencytop.c include cleanup: Update gfp.h and slab.h includes to prepare for breaking implicit slab.h inclusion from percpu.h 2010-03-30 22:02:32 +09:00
lockdep_internals.h lockdep: No need to disable preemption in debug atomic ops 2010-05-04 05:38:16 +02:00
lockdep_proc.c lockstat: Make lockstat counting per cpu 2010-04-06 00:15:37 +02:00
lockdep_states.h
lockdep.c sched_clock: Add local_clock() API and improve documentation 2010-06-09 10:34:49 +02:00
Makefile Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq 2010-08-07 12:42:58 -07:00
module.c module: cleanup comments, remove noinline 2010-08-05 12:59:13 +09:30
mutex-debug.c headers: remove sched.h from interrupt.h 2009-10-11 11:20:58 -07:00
mutex-debug.h locking: Implement new raw_spinlock 2009-12-14 23:55:32 +01:00
mutex.c mutex: Fix optimistic spinning vs. BKL 2010-05-19 08:18:44 +02:00
mutex.h
notifier.c sched: Use lockdep-based checking on rcu_dereference() 2010-02-25 10:34:26 +01:00
ns_cgroup.c cgroups: let ss->can_attach and ss->attach do whole threadgroups at a time 2009-09-24 07:20:58 -07:00
nsproxy.c include cleanup: Update gfp.h and slab.h includes to prepare for breaking implicit slab.h inclusion from percpu.h 2010-03-30 22:02:32 +09:00
padata.c Merge git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6 2010-08-04 15:23:14 -07:00
panic.c panic: call console_verbose() in panic 2010-05-27 09:12:53 -07:00
params.c Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial 2010-03-12 16:04:50 -08:00
perf_event.c Merge branch 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip 2010-08-06 09:39:22 -07:00
pid_namespace.c include cleanup: Update gfp.h and slab.h includes to prepare for breaking implicit slab.h inclusion from percpu.h 2010-03-30 22:02:32 +09:00
pid.c pids: increase pid_max based on num_possible_cpus 2010-05-27 09:12:51 -07:00
pm_qos_params.c pm_qos: Get rid of the allocation in pm_qos_add_request() 2010-07-19 02:00:34 +02:00
posix-cpu-timers.c sched: Fix the racy usage of thread_group_cputimer() in fastpath_timer_check() 2010-06-18 10:46:57 +02:00
posix-timers.c posix_timer: Move copy_to_user(created_timer_id) down in timer_create() 2010-07-23 15:08:12 +02:00
printk.c printk: fix delayed messages from CPU hotplug events 2010-08-05 13:25:59 +01:00
profile.c numa: in-kernel profiling: use cpu_to_mem() for per cpu allocations 2010-05-27 09:12:57 -07:00
ptrace.c ptrace: PTRACE_GETFDPIC: fix the unsafe usage of child->mm 2010-05-27 09:12:44 -07:00
range.c x86: Change range end to start+size 2010-02-10 17:47:17 -08:00
rcupdate.c tree/tiny rcu: Add debug RCU head objects 2010-06-14 16:37:26 -07:00
rcutiny_plugin.h rcu: slim down rcutiny by removing rcu_scheduler_active and friends 2010-05-10 11:08:34 -07:00
rcutiny.c tree/tiny rcu: Add debug RCU head objects 2010-06-14 16:37:26 -07:00
rcutorture.c sched_clock: Add local_clock() API and improve documentation 2010-06-09 10:34:49 +02:00
rcutree_plugin.h rcu: remove all rcu head initializations, except on_stack initializations 2010-05-11 16:10:47 -07:00
rcutree_trace.c rcu: reduce the number of spurious RCU_SOFTIRQ invocations 2010-05-10 11:08:35 -07:00
rcutree.c tree/tiny rcu: Add debug RCU head objects 2010-06-14 16:37:26 -07:00
rcutree.h rcu: reduce the number of spurious RCU_SOFTIRQ invocations 2010-05-10 11:08:35 -07:00
relay.c kernel/: convert cpu notifier to return encapsulate errno value 2010-05-27 09:12:48 -07:00
res_counter.c include cleanup: Update gfp.h and slab.h includes to prepare for breaking implicit slab.h inclusion from percpu.h 2010-03-30 22:02:32 +09:00
resource.c resource: shared I/O region support 2010-05-11 12:01:10 -07:00
rtmutex_common.h
rtmutex-debug.c sched: Convert pi_lock to raw_spinlock 2009-12-14 23:55:33 +01:00
rtmutex-debug.h
rtmutex-tester.c
rtmutex.c rtmutes: Convert rtmutex.lock to raw_spinlock 2009-12-14 23:55:33 +01:00
rtmutex.h
rwsem.c
sched_clock.c sched_clock: Add local_clock() API and improve documentation 2010-06-09 10:34:49 +02:00
sched_cpupri.c sched: No need for bootmem special cases 2010-07-17 12:06:22 +02:00
sched_cpupri.h sched: No need for bootmem special cases 2010-07-17 12:06:22 +02:00
sched_debug.c sched: Use correct macro to display sched_child_runs_first in /proc/sched_debug 2010-07-21 21:46:12 +02:00
sched_fair.c Merge branch 'linus' into sched/core 2010-07-21 21:45:08 +02:00
sched_features.h sched: Remove ASYM_GRAN feature 2010-03-11 18:32:53 +01:00
sched_idletask.c sched: Cure load average vs NO_HZ woes 2010-04-23 11:02:02 +02:00
sched_rt.c sched: task_tick_rt: Remove the obsolete ->signal != NULL check 2010-06-18 10:46:56 +02:00
sched_stats.h sched: Remove the obsolete exit_state/signal hacks 2010-06-18 10:46:56 +02:00
sched.c Merge branch 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip 2010-08-06 09:39:22 -07:00
seccomp.c
semaphore.c
signal.c CRED: Fix RCU warning due to previous patch fixing __task_cred()'s checks 2010-08-04 11:17:10 -07:00
smp.c kernel/: convert cpu notifier to return encapsulate errno value 2010-05-27 09:12:48 -07:00
softirq.c kernel/: fix BUG_ON checks for cpu notifier callbacks direct call 2010-06-04 15:21:45 -07:00
spinlock.c locking: Cleanup the name space completely 2009-12-14 23:55:33 +01:00
srcu.c include cleanup: Update gfp.h and slab.h includes to prepare for breaking implicit slab.h inclusion from percpu.h 2010-03-30 22:02:32 +09:00
stacktrace.c
stop_machine.c sched: Make sure timers have migrated before killing the migration_thread 2010-05-31 08:37:44 +02:00
sys_ni.c Add generic sys_ipc wrapper 2010-03-12 15:52:32 -08:00
sys.c kmod: add init function to usermodehelper 2010-05-27 09:12:44 -07:00
sysctl_binary.c sysctl: don't use own implementation of hex_to_bin() 2010-05-25 08:07:05 -07:00
sysctl_check.c ipv4 05/05: add sysctl to accept packets with local source addresses 2009-12-03 12:14:38 -08:00
sysctl.c oom: move sysctl declarations to oom.h 2010-08-09 20:44:57 -07:00
taskstats.c include cleanup: Update gfp.h and slab.h includes to prepare for breaking implicit slab.h inclusion from percpu.h 2010-03-30 22:02:32 +09:00
test_kprobes.c
time.c time: Kill off CONFIG_GENERIC_TIME 2010-07-27 12:40:54 +02:00
timeconst.pl
timer.c Merge branch 'timers-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip 2010-08-06 13:12:36 -07:00
tracepoint.c tracing: Let tracepoints have data passed to tracepoint callbacks 2010-05-14 09:50:34 -04:00
tsacct.c mm: clean up mm_counter 2010-03-06 11:26:23 -08:00
uid16.c headers: utsname.h redux 2009-09-23 18:13:10 -07:00
up.c
user_namespace.c user_ns: Introduce user_nsmap_uid and user_ns_map_gid. 2010-06-16 14:55:34 -07:00
user-return-notifier.c core: Clean up user return notifers use of per_cpu 2009-12-02 10:22:59 +01:00
user.c sched: Remove a stale comment 2010-05-10 08:48:39 +02:00
utsname_sysctl.c sysctl kernel: Remove binary sysctl logic 2009-11-12 02:04:55 -08:00
utsname.c
wait.c
watchdog.c kernel/watchdog: Initialize 'result' 2010-07-07 08:46:42 +02:00
workqueue_sched.h workqueue: implement concurrency managed dynamic worker pool 2010-06-29 10:07:14 +02:00
workqueue.c workqueue: workqueue_cpu_callback() should be cpu_notifier instead of hotcpu_notifier 2010-08-09 11:50:34 +02:00