linux

History

Michal Hocko 79dfdaccd1 memcg: make oom_lock 0 and 1 based rather than counter Commit `867578cb` ("memcg: fix oom kill behavior") introduced a oom_lock counter which is incremented by mem_cgroup_oom_lock when we are about to handle memcg OOM situation. mem_cgroup_handle_oom falls back to a sleep if oom_lock > 1 to prevent from multiple oom kills at the same time. The counter is then decremented by mem_cgroup_oom_unlock called from the same function. This works correctly but it can lead to serious starvations when we have many processes triggering OOM and many CPUs available for them (I have tested with 16 CPUs). Consider a process (call it A) which gets the oom_lock (the first one that got to mem_cgroup_handle_oom and grabbed memcg_oom_mutex) and other processes that are blocked on the mutex. While A releases the mutex and calls mem_cgroup_out_of_memory others will wake up (one after another) and increase the counter and fall into sleep (memcg_oom_waitq). Once A finishes mem_cgroup_out_of_memory it takes the mutex again and decreases oom_lock and wakes other tasks (if releasing memory by somebody else - e.g. killed process - hasn't done it yet). A testcase would look like: Assume malloc XXX is a program allocating XXX Megabytes of memory which touches all allocated pages in a tight loop # swapoff SWAP_DEVICE # cgcreate -g memory:A # cgset -r memory.oom_control=0 A # cgset -r memory.limit_in_bytes= 200M # for i in `seq 100` # do # cgexec -g memory:A malloc 10 & # done The main problem here is that all processes still race for the mutex and there is no guarantee that we will get counter back to 0 for those that got back to mem_cgroup_handle_oom. In the end the whole convoy in/decreases the counter but we do not get to 1 that would enable killing so nothing useful can be done. The time is basically unbounded because it highly depends on scheduling and ordering on mutex (I have seen this taking hours...). This patch replaces the counter by a simple {un}lock semantic. As mem_cgroup_oom_{un}lock works on the a subtree of a hierarchy we have to make sure that nobody else races with us which is guaranteed by the memcg_oom_mutex. We have to be careful while locking subtrees because we can encounter a subtree which is already locked: hierarchy: A / \ B \ /\ \ C D E B - C - D tree might be already locked. While we want to enable locking E subtree because OOM situations cannot influence each other we definitely do not want to allow locking A. Therefore we have to refuse lock if any subtree is already locked and clear up the lock for all nodes that have been set up to the failure point. On the other hand we have to make sure that the rest of the world will recognize that a group is under OOM even though it doesn't have a lock. Therefore we have to introduce under_oom variable which is incremented and decremented for the whole subtree when we enter resp. leave mem_cgroup_handle_oom. under_oom, unlike oom_lock, doesn't need be updated under memcg_oom_mutex because its users only check a single group and they use atomic operations for that. This can be checked easily by the following test case: # cgcreate -g memory:A # cgset -r memory.use_hierarchy=1 A # cgset -r memory.oom_control=1 A # cgset -r memory.limit_in_bytes= 100M # cgset -r memory.memsw.limit_in_bytes= 100M # cgcreate -g memory:A/B # cgset -r memory.oom_control=1 A/B # cgset -r memory.limit_in_bytes=20M # cgset -r memory.memsw.limit_in_bytes=20M # cgexec -g memory:A/B malloc 30 & #->this will be blocked by OOM of group B # cgexec -g memory:A malloc 80 & #->this will be blocked by OOM of group A While B gets oom_lock A will not get it. Both of them go into sleep and wait for an external action. We can make the limit higher for A to enforce waking it up # cgset -r memory.memsw.limit_in_bytes=300M A # cgset -r memory.limit_in_bytes=300M A malloc in A has to wake up even though it doesn't have oom_lock. Finally, the unlock path is very easy because we always unlock only the subtree we have locked previously while we always decrement under_oom. Signed-off-by: Michal Hocko <mhocko@suse.cz> Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Balbir Singh <bsingharora@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>		2011-07-26 16:49:42 -07:00
..
backing-dev.c	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/wfg/writeback	2011-07-26 10:39:54 -07:00
bootmem.c	crash_dump: export is_kdump_kernel to modules, consolidate elfcorehdr_addr, setup_elfcorehdr and saved_max_pfn	2011-03-23 19:47:19 -07:00
bounce.c	bounce: call flush_dcache_page() after bounce_copy_vec()	2010-09-09 18:57:25 -07:00
cleancache.c	mm: cleancache core ops functions and config	2011-05-26 10:01:36 -06:00
compaction.c	mm: compaction: abort compaction if too many pages are isolated and caller is asynchronous V2	2011-06-15 20:04:02 -07:00
debug-pagealloc.c
dmapool.c	devres: fix possible use after free	2011-07-25 20:57:14 -07:00
fadvise.c
failslab.c
filemap_xip.c	mm: Convert i_mmap_lock to a mutex	2011-05-25 08:39:18 -07:00
filemap.c	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/wfg/writeback	2011-07-26 10:39:54 -07:00
fremap.c	mm: don't access vm_flags as 'int'	2011-05-26 09:20:31 -07:00
highmem.c	mm,x86: fix kmap_atomic_push vs ioremap_32.c	2010-10-27 18:03:05 -07:00
huge_memory.c	mm/huge_memory.c: minor lock simplification in __khugepaged_exit	2011-07-25 20:57:09 -07:00
hugetlb.c	mm: hugetlb: fix coding style issues	2011-07-25 20:57:09 -07:00
hwpoison-inject.c	Fix common misspellings	2011-03-31 11:26:23 -03:00
init-mm.c	mm: convert mm->cpu_vm_cpumask into cpumask_var_t	2011-05-25 08:39:21 -07:00
internal.h	mm: nommu: sort mm->mmap list properly	2011-05-25 08:39:05 -07:00
Kconfig	mm Kconfig typo: cleancacne -> cleancache	2011-06-10 14:47:52 +02:00
Kconfig.debug	mm: debug-pagealloc: fix kconfig dependency warning	2011-03-22 17:44:02 -07:00
kmemcheck.c
kmemleak-test.c	kmemleak: remove memset by using kzalloc	2011-01-27 18:31:51 +00:00
kmemleak.c	kmemleak: Do not return a pointer to an object that kmemleak did not get	2011-05-19 17:35:28 +01:00
ksm.c	ksm: fix NULL pointer dereference in scan_get_next_rmap_item()	2011-06-15 20:04:02 -07:00
maccess.c	maccess,probe_kernel: Make write/read src const void *	2011-05-25 19:56:23 -04:00
madvise.c	fs: kill i_alloc_sem	2011-07-20 20:47:46 -04:00
Makefile	mm: cleancache core ops functions and config	2011-05-26 10:01:36 -06:00
memblock.c	mm/memblock.c: avoid abuse of RED_INACTIVE	2011-07-25 20:57:09 -07:00
memcontrol.c	memcg: make oom_lock 0 and 1 based rather than counter	2011-07-26 16:49:42 -07:00
memory_hotplug.c	mm: extend memory hotplug API to allow memory hotplug in virtual machines	2011-07-25 20:57:08 -07:00
memory-failure.c	mm/memory-failure.c: fix spinlock vs mutex order	2011-06-27 18:00:13 -07:00
memory.c	mm/futex: fix futex writes on archs with SW tracking of dirty & young	2011-07-25 20:57:11 -07:00
mempolicy.c	mm: proc: move show_numa_map() to fs/proc/task_mmu.c	2011-05-25 08:39:34 -07:00
mempool.c
migrate.c	migrate: don't account swapcache as shmem	2011-06-16 15:01:24 -07:00
mincore.c	thp: mincore transparent hugepage support	2011-01-13 17:32:44 -08:00
mlock.c	mm: don't access vm_flags as 'int'	2011-05-26 09:20:31 -07:00
mm_init.c
mmap.c	mmap: fix and tidy up overcommit page arithmetic	2011-07-25 20:57:09 -07:00
mmu_context.c
mmu_notifier.c	thp: mmu_notifier_test_young	2011-01-13 17:32:46 -08:00
mmzone.c	mm: page allocator: adjust the per-cpu counter threshold when memory is low	2011-01-13 17:32:31 -08:00
mprotect.c	thp: mprotect: transparent huge page support	2011-01-13 17:32:44 -08:00
mremap.c	mm: Convert i_mmap_lock to a mutex	2011-05-25 08:39:18 -07:00
msync.c
nobootmem.c	memblock/nobootmem: remove unneeded code from alloc_bootmem_node_high()	2011-05-25 08:39:31 -07:00
nommu.c	mmap: fix and tidy up overcommit page arithmetic	2011-07-25 20:57:09 -07:00
oom_kill.c	oom: remove references to old badness() function	2011-07-25 20:57:09 -07:00
page_alloc.c	mm: page allocator: reconsider zones for allocation after direct reclaim	2011-07-25 20:57:10 -07:00
page_cgroup.c	mm/page_cgroup.c: simplify code by using SECTION_ALIGN_UP() and SECTION_ALIGN_DOWN() macros	2011-07-25 20:57:09 -07:00
page_io.c	block: kill off REQ_UNPLUG	2011-03-10 08:52:27 +01:00
page_isolation.c	mm: page_isolation: codeclean fix comment and rm unneeded val init	2010-10-26 16:52:11 -07:00
page-writeback.c	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/wfg/writeback	2011-07-26 10:39:54 -07:00
pagewalk.c	pagewalk: fix code comment for THP	2011-07-25 20:57:09 -07:00
percpu-km.c	percpu: clear memory allocated with the km allocator	2010-10-02 10:28:42 +03:00
percpu-vm.c	mm: remove gfp mask from pcpu_get_vm_areas	2011-01-13 17:32:34 -08:00
percpu.c	Merge branch 'for-2.6.40' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/percpu	2011-05-24 11:53:42 -07:00
pgtable-generic.c	mm/pgtable-generic.c: fix CONFIG_SWAP=n build	2011-01-26 10:49:58 +10:00
prio_tree.c	sanitize <linux/prefetch.h> usage	2011-05-20 12:50:29 -07:00
quicklist.c
readahead.c	readahead: readahead page allocations are OK to fail	2011-05-25 08:39:25 -07:00
rmap.c	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/wfg/writeback	2011-07-26 10:39:54 -07:00
shmem.c	Merge 'akpm' patch series	2011-07-25 21:00:19 -07:00
slab.c	Merge branch 'slab-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/penberg/slab-2.6	2011-07-22 12:44:30 -07:00
slob.c	slob/lockdep: Fix gfp flags passed to lockdep	2011-06-07 21:38:07 +03:00
slub.c	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial	2011-07-25 13:56:39 -07:00
sparse-vmemmap.c	tree-wide: fix comment/printk typos	2010-11-01 15:38:34 -04:00
sparse.c	mm: make some struct page's const	2011-07-25 20:57:07 -07:00
swap_state.c	block: remove per-queue plugging	2011-03-10 08:52:07 +01:00
swap.c	mm: batch activate_page() to reduce lock contention	2011-05-25 08:39:37 -07:00
swapfile.c	fs: seq_file - add event counter to simplify poll() support	2011-07-20 20:47:50 -04:00
thrash.c	mm: swap-token: add a comment for priority aging	2011-07-25 20:57:08 -07:00
truncate.c	mm: pincer in truncate_inode_pages_range	2011-07-25 20:57:10 -07:00
util.c	mm: nommu: sort mm->mmap list properly	2011-05-25 08:39:05 -07:00
vmalloc.c	vmalloc,rcu: Convert call_rcu(rcu_free_vb) to kfree_rcu()	2011-07-20 14:10:18 -07:00
vmscan.c	memcg: consolidate memory cgroup lru stat functions	2011-07-26 16:49:42 -07:00
vmstat.c	mm, mem-hotplug: update pcp->stat_threshold when memory hotplug occur	2011-05-25 08:39:09 -07:00