linux/kernel
Michal Hocko 49550b6055 oom: add helpers for setting and clearing TIF_MEMDIE
This patchset addresses a race which was described in the changelog for
5695be142e ("OOM, PM: OOM killed task shouldn't escape PM suspend"):

: PM freezer relies on having all tasks frozen by the time devices are
: getting frozen so that no task will touch them while they are getting
: frozen.  But OOM killer is allowed to kill an already frozen task in order
: to handle OOM situtation.  In order to protect from late wake ups OOM
: killer is disabled after all tasks are frozen.  This, however, still keeps
: a window open when a killed task didn't manage to die by the time
: freeze_processes finishes.

The original patch hasn't closed the race window completely because that
would require a more complex solution as it can be seen by this patchset.

The primary motivation was to close the race condition between OOM killer
and PM freezer _completely_.  As Tejun pointed out, even though the race
condition is unlikely the harder it would be to debug weird bugs deep in
the PM freezer when the debugging options are reduced considerably.  I can
only speculate what might happen when a task is still runnable
unexpectedly.

On a plus side and as a side effect the oom enable/disable has a better
(full barrier) semantic without polluting hot paths.

I have tested the series in KVM with 100M RAM:
- many small tasks (20M anon mmap) which are triggering OOM continually
- s2ram which resumes automatically is triggered in a loop
	echo processors > /sys/power/pm_test
	while true
	do
		echo mem > /sys/power/state
		sleep 1s
	done
- simple module which allocates and frees 20M in 8K chunks. If it sees
  freezing(current) then it tries another round of allocation before calling
  try_to_freeze
- debugging messages of PM stages and OOM killer enable/disable/fail added
  and unmark_oom_victim is delayed by 1s after it clears TIF_MEMDIE and before
  it wakes up waiters.
- rebased on top of the current mmotm which means some necessary updates
  in mm/oom_kill.c. mark_tsk_oom_victim is now called under task_lock but
  I think this should be OK because __thaw_task shouldn't interfere with any
  locking down wake_up_process. Oleg?

As expected there are no OOM killed tasks after oom is disabled and
allocations requested by the kernel thread are failing after all the tasks
are frozen and OOM disabled.  I wasn't able to catch a race where
oom_killer_disable would really have to wait but I kinda expected the race
is really unlikely.

[  242.609330] Killed process 2992 (mem_eater) total-vm:24412kB, anon-rss:2164kB, file-rss:4kB
[  243.628071] Unmarking 2992 OOM victim. oom_victims: 1
[  243.636072] (elapsed 2.837 seconds) done.
[  243.641985] Trying to disable OOM killer
[  243.643032] Waiting for concurent OOM victims
[  243.644342] OOM killer disabled
[  243.645447] Freezing remaining freezable tasks ... (elapsed 0.005 seconds) done.
[  243.652983] Suspending console(s) (use no_console_suspend to debug)
[  243.903299] kmem_eater: page allocation failure: order:1, mode:0x204010
[...]
[  243.992600] PM: suspend of devices complete after 336.667 msecs
[  243.993264] PM: late suspend of devices complete after 0.660 msecs
[  243.994713] PM: noirq suspend of devices complete after 1.446 msecs
[  243.994717] ACPI: Preparing to enter system sleep state S3
[  243.994795] PM: Saving platform NVS memory
[  243.994796] Disabling non-boot CPUs ...

The first 2 patches are simple cleanups for OOM.  They should go in
regardless the rest IMO.

Patches 3 and 4 are trivial printk -> pr_info conversion and they should
go in ditto.

The main patch is the last one and I would appreciate acks from Tejun and
Rafael.  I think the OOM part should be OK (except for __thaw_task vs.
task_lock where a look from Oleg would appreciated) but I am not so sure I
haven't screwed anything in the freezer code.  I have found several
surprises there.

This patch (of 5):

This patch is just a preparatory and it doesn't introduce any functional
change.

Note:
I am utterly unhappy about lowmemory killer abusing TIF_MEMDIE just to
wait for the oom victim and to prevent from new killing. This is
just a side effect of the flag. The primary meaning is to give the oom
victim access to the memory reserves and that shouldn't be necessary
here.

Signed-off-by: Michal Hocko <mhocko@suse.cz>
Cc: Tejun Heo <tj@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Cong Wang <xiyou.wangcong@gmail.com>
Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-02-11 17:06:03 -08:00
..
bpf Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net 2015-01-27 13:55:36 -08:00
configs x86: Add "make tinyconfig" to configure the tiniest possible kernel 2014-08-08 16:30:24 -07:00
debug Surprising number of fixes this merge window :( 2015-01-23 06:40:36 +12:00
events perf: Decouple unthrottling and rotating 2015-02-04 08:07:16 +01:00
gcov gcov: enable GCOV_PROFILE_ALL from ARCH Kconfigs 2014-12-13 12:42:51 -08:00
irq genirq: Prevent proc race against freeing of irq descriptors 2014-12-13 13:33:07 +01:00
livepatch livepatch: add missing newline to error message 2015-02-06 21:28:35 +01:00
locking Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next 2015-02-10 20:01:30 -08:00
power ACPI and power management updates for v3.20-rc1 2015-02-10 15:09:41 -08:00
printk This code is a fork from the trace-3.19 pull as it needed the trace_seq 2014-12-13 14:04:41 -08:00
rcu Merge branches 'doc.2015.01.07a', 'fixes.2015.01.15a', 'preempt.2015.01.06a', 'srcu.2015.01.06a', 'stall.2015.01.16a' and 'torture.2015.01.11a' into HEAD 2015-01-15 23:34:34 -08:00
sched Merge branch 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip 2015-02-09 16:06:06 -08:00
time Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next 2015-02-10 20:01:30 -08:00
trace ACPI and power management updates for v3.20-rc1 2015-02-10 15:09:41 -08:00
.gitignore
acct.c acct: eliminate compile warning 2014-10-09 22:26:04 -04:00
async.c kernel/async.c: switch to pr_foo() 2014-10-09 22:26:04 -04:00
audit_tree.c fsnotify: unify inode and mount marks handling 2014-12-13 12:42:53 -08:00
audit_watch.c audit: invalid op= values for rules 2014-09-23 16:37:53 -04:00
audit.c Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net 2014-12-30 10:45:47 -08:00
audit.h audit: reduce scope of audit_log_fcaps 2014-09-23 16:37:51 -04:00
auditfilter.c Merge branch 'upstream' of git://git.infradead.org/users/pcmoore/audit 2014-12-23 18:13:16 -08:00
auditsc.c Merge branch 'upstream' of git://git.infradead.org/users/pcmoore/audit 2014-12-31 14:52:18 -08:00
backtracetest.c kernel/backtracetest.c: replace no level printk by pr_info() 2014-06-04 16:54:14 -07:00
bounds.c page-cgroup: get rid of NR_PCG_FLAGS 2014-08-08 15:57:18 -07:00
capability.c CAPABILITIES: remove undefined caps from all processes 2014-07-24 21:53:47 +10:00
cgroup_freezer.c cgroup: rename cgroup_subsys->base_cftypes to ->legacy_cftypes 2014-07-15 11:05:09 -04:00
cgroup.c cgroup: prevent mount hang due to memory controller lifetime 2015-01-22 10:26:43 -05:00
compat.c compat: nanosleep: Clarify error handling 2014-09-06 12:58:18 +02:00
configs.c
context_tracking.c sched: stop the unbound recursion in preempt_schedule_context() 2014-10-28 10:46:05 +01:00
cpu_pm.c
cpu.c hotplugcpu: Avoid deadlocks by waking active_writer 2015-01-06 11:01:14 -08:00
cpuset.c Merge branch 'for-3.19' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup 2014-12-11 18:57:19 -08:00
crash_dump.c crash_dump: Make is_kdump_kernel() accessible from modules 2014-08-25 15:42:19 -07:00
cred.c
delayacct.c delayacct: Remove braindamaged type conversions 2014-07-23 10:18:06 -07:00
dma.c
elfcore.c
exec_domain.c kernel/exec_domain.c: code clean-up 2014-06-04 16:54:15 -07:00
exit.c oom: add helpers for setting and clearing TIF_MEMDIE 2015-02-11 17:06:03 -08:00
extable.c ftrace/x86/extable: Add is_ftrace_trampoline() function 2014-11-19 15:25:26 -05:00
fork.c rmap: drop support of non-linear mappings 2015-02-10 14:30:31 -08:00
freezer.c freezer: remove obsolete comments in __thaw_task() 2014-10-21 23:44:20 +02:00
futex_compat.c
futex.c futex: Fix argument handling in futex_lock_pi() calls 2015-01-19 12:05:32 +01:00
groups.c userns: Don't allow setgroups until a gid mapping has been setablished 2014-12-09 16:58:40 -06:00
hung_task.c kernel/hung_task.c: convert simple_strtoul to kstrtouint 2014-06-04 16:54:15 -07:00
irq_work.c percpu: Convert remaining __get_cpu_var uses in 3.18-rcX 2014-10-29 11:18:18 -04:00
jump_label.c
kallsyms.c kernel/kallsyms.c: use __seq_open_private() 2014-10-14 02:18:16 +02:00
kcmp.c kcmp: fix standard comparison bug 2014-09-10 15:42:12 -07:00
Kconfig.freezer
Kconfig.hz
Kconfig.locks locking/mcs: Better differentiate between MCS variants 2015-01-14 15:07:32 +01:00
Kconfig.preempt
kexec.c Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial 2015-02-10 18:57:15 -08:00
kmod.c usermodehelper: kill the kmod_thread_locker logic 2014-12-10 17:41:17 -08:00
kprobes.c module: remove mod arg from module_free, rename module_memfree(). 2015-01-20 11:38:33 +10:30
ksysfs.c
kthread.c kernel/kthread.c: partial revert of 81c98869fa ("kthread: ensure locality of task_struct allocations") 2014-10-09 22:25:51 -04:00
latencytop.c kernel/latencytop.c: convert seq_printf to seq_puts 2014-06-04 16:54:15 -07:00
Makefile livepatch: kernel: add support for live patching 2014-12-22 15:40:49 +01:00
module_signing.c
module-internal.h
module.c module: make module_refcount() a signed integer. 2015-01-22 11:15:54 +10:30
notifier.c rcu: Make SRCU optional by using CONFIG_SRCU 2015-01-06 11:04:29 -08:00
nsproxy.c bury struct proc_ns in fs/proc 2014-12-04 14:34:54 -05:00
padata.c
panic.c livepatch: kernel: add TAINT_LIVEPATCH 2014-12-22 15:40:48 +01:00
params.c param: fix uninitialized read with CONFIG_DEBUG_LOCK_ALLOC 2015-01-20 11:38:31 +10:30
pid_namespace.c Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs 2014-12-16 15:53:03 -08:00
pid.c Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs 2014-12-16 15:53:03 -08:00
profile.c kernel/profile.c: use static const char instead of static char 2014-06-06 16:08:13 -07:00
ptrace.c exit: ptrace: shift "reap dead" code from exit_ptrace() to forget_original_parent() 2014-12-10 17:41:10 -08:00
range.c kernel: avoid overflow in cmp_range 2015-01-17 10:02:23 +13:00
reboot.c kernel: add support for kernel restart handler call chain 2014-09-26 00:00:06 -07:00
relay.c
resource.c resources: Move struct resource_list_entry from ACPI into resource core 2015-02-05 15:09:25 +01:00
seccomp.c Merge branch 'x86-seccomp-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip 2014-10-14 02:27:06 +02:00
signal.c Merge branch 'x86-mpx-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip 2014-12-10 09:34:43 -08:00
smp.c Merge branch 'for-3.18-consistent-ops' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/percpu 2014-10-15 07:48:18 +02:00
smpboot.c smpboot: Add missing get_online_cpus() in smpboot_register_percpu_thread() 2015-01-23 11:33:51 +01:00
smpboot.h
softirq.c Merge branch 'locking-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip 2015-02-09 15:24:03 -08:00
stacktrace.c stacktrace: introduce snprint_stack_trace for buffer output 2014-12-13 12:42:48 -08:00
stop_machine.c kernel/stop_machine.c: kernel-doc warning fix 2014-06-04 16:54:15 -07:00
sys_ni.c syscalls: implement execveat() system call 2014-12-13 12:42:51 -08:00
sys.c x86, mpx: Strictly enforce empty prctl() args 2015-01-22 21:11:06 +01:00
sysctl_binary.c kernel: add panic_on_warn 2014-12-10 17:41:10 -08:00
sysctl.c mm, hugetlb: remove unnecessary lower bound on sysctl handlers"? 2015-02-10 14:30:34 -08:00
system_certificates.S
system_keyring.c KEYS: validate certificate trust only with builtin keys 2014-07-17 09:35:17 -04:00
task_work.c
taskstats.c netlink: make nlmsg_end() and genlmsg_end() void 2015-01-18 01:03:45 -05:00
test_kprobes.c kernel/test_kprobes.c: use current logging functions 2014-08-08 15:57:18 -07:00
torture.c torture: Address race in module cleanup 2014-09-16 13:41:06 -07:00
tracepoint.c tracing: syscall_regfunc() should not skip kernel threads 2014-06-21 00:15:26 -04:00
tsacct.c sched: Make task->start_time nanoseconds based 2014-07-23 10:18:05 -07:00
uid16.c groups: Consolidate the setgroups permission checks 2014-12-05 17:19:27 -06:00
up.c
user_namespace.c Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace 2014-12-17 12:31:40 -08:00
user-return-notifier.c scheduler: Replace __get_cpu_var with this_cpu_ptr 2014-08-26 13:45:45 -04:00
user.c Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace 2014-12-17 12:31:40 -08:00
utsname_sysctl.c sysctl: convert use of typedef ctl_table to struct ctl_table 2014-06-06 16:08:16 -07:00
utsname.c copy address of proc_ns_ops into ns_common 2014-12-04 14:34:47 -05:00
watchdog.c Merge branch 'for-3.18-consistent-ops' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/percpu 2014-10-15 07:48:18 +02:00
workqueue_internal.h
workqueue.c workqueue: fix subtle pool management issue which can stall whole worker_pool 2015-01-16 14:21:16 -05:00