linux/mm
Michal Hocko cd04ae1e2d mm, oom: do not rely on TIF_MEMDIE for memory reserves access
For ages we have been relying on TIF_MEMDIE thread flag to mark OOM
victims and then, among other things, to give these threads full access
to memory reserves.  There are few shortcomings of this implementation,
though.

First of all and the most serious one is that the full access to memory
reserves is quite dangerous because we leave no safety room for the
system to operate and potentially do last emergency steps to move on.

Secondly this flag is per task_struct while the OOM killer operates on
mm_struct granularity so all processes sharing the given mm are killed.
Giving the full access to all these task_structs could lead to a quick
memory reserves depletion.  We have tried to reduce this risk by giving
TIF_MEMDIE only to the main thread and the currently allocating task but
that doesn't really solve this problem while it surely opens up a room
for corner cases - e.g.  GFP_NO{FS,IO} requests might loop inside the
allocator without access to memory reserves because a particular thread
was not the group leader.

Now that we have the oom reaper and that all oom victims are reapable
after 1b51e65eab ("oom, oom_reaper: allow to reap mm shared by the
kthreads") we can be more conservative and grant only partial access to
memory reserves because there are reasonable chances of the parallel
memory freeing.  We still want some access to reserves because we do not
want other consumers to eat up the victim's freed memory.  oom victims
will still contend with __GFP_HIGH users but those shouldn't be so
aggressive to starve oom victims completely.

Introduce ALLOC_OOM flag and give all tsk_is_oom_victim tasks access to
the half of the reserves.  This makes the access to reserves independent
on which task has passed through mark_oom_victim.  Also drop any usage
of TIF_MEMDIE from the page allocator proper and replace it by
tsk_is_oom_victim as well which will make page_alloc.c completely
TIF_MEMDIE free finally.

CONFIG_MMU=n doesn't have oom reaper so let's stick to the original
ALLOC_NO_WATERMARKS approach.

There is a demand to make the oom killer memcg aware which will imply
many tasks killed at once.  This change will allow such a usecase
without worrying about complete memory reserves depletion.

Link: http://lkml.kernel.org/r/20170810075019.28998-2-mhocko@kernel.org
Signed-off-by: Michal Hocko <mhocko@suse.com>
Acked-by: Mel Gorman <mgorman@techsingularity.net>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: David Rientjes <rientjes@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Roman Gushchin <guro@fb.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-09-06 17:27:30 -07:00
..
kasan Merge branch 'linus' into locking/core, to pick up fixes 2017-08-10 12:20:53 +02:00
backing-dev.c bdi: Drop 'parent' argument from bdi_register[_va]() 2017-04-20 12:09:55 -06:00
balloon_compaction.c mm/balloon_compaction.c: don't zero ballooned pages 2017-08-10 15:54:07 -07:00
bootmem.c mm/bootmem.c: cosmetic improvement of code readability 2017-02-22 16:41:29 -08:00
cleancache.c fs: switch ->s_uuid to uuid_t 2017-06-05 16:59:12 +02:00
cma_debug.c mm/cma_debug.c: fix stack corruption due to sprintf usage 2017-08-18 15:32:02 -07:00
cma.c cma: fix calculation of aligned offset 2017-07-10 16:32:32 -07:00
cma.h cma: Store a name in the cma structure 2017-04-18 20:41:12 +02:00
compaction.c mm, compaction: skip over holes in __reset_isolation_suitable 2017-07-06 16:24:32 -07:00
debug_page_ref.c mm/page_ref: add tracepoint to track down page reference manipulation 2016-03-17 15:09:34 -07:00
debug.c mm: make tlb_flush_pending global 2017-08-10 15:54:07 -07:00
dmapool.c lib/vsprintf.c: remove %Z support 2017-02-27 18:43:47 -08:00
early_ioremap.c x86/mm: Add support to access boot related data in the clear 2017-07-18 11:38:02 +02:00
fadvise.c mm: fadvise: avoid expensive remote LRU cache draining after FADV_DONTNEED 2016-12-20 09:48:46 -08:00
failslab.c
filemap.c mm: use find_get_pages_range() in filemap_range_has_page() 2017-09-06 17:27:27 -07:00
frame_vector.c treewide: use kv[mz]alloc* rather than opencoded variants 2017-05-08 17:15:13 -07:00
frontswap.c mm, frontswap: convert frontswap_enabled to static key 2016-07-26 16:19:19 -07:00
gup.c mm/gup: make __gup_device_* require THP 2017-09-06 17:27:26 -07:00
highmem.c mm/highmem: make nr_free_highpages() handles all highmem zones by itself 2016-05-19 19:12:14 -07:00
huge_memory.c mm, THP, swap: support splitting THP for THP swap out 2017-09-06 17:27:28 -07:00
hugetlb_cgroup.c mm, hugetlb_cgroup: round limit_in_bytes down to hugepage size 2016-05-20 17:58:30 -07:00
hugetlb.c mm, hugetlb: do not allocate non-migrateable gigantic pages from movable zones 2017-09-06 17:27:29 -07:00
hwpoison-inject.c mm: hwpoison: call shake_page() unconditionally 2017-05-03 15:52:12 -07:00
init-mm.c mm: Add a user_ns owner to mm_struct and fix ptrace permission checks 2016-11-22 11:49:48 -06:00
internal.h mm, oom: do not rely on TIF_MEMDIE for memory reserves access 2017-09-06 17:27:30 -07:00
interval_tree.c
Kconfig mm, devm_memremap_pages: use multi-order radix for ZONE_DEVICE lookups 2017-09-06 17:27:29 -07:00
Kconfig.debug mm: enable page poisoning early at boot 2017-05-03 15:52:10 -07:00
khugepaged.c mm: make PR_SET_THP_DISABLE immediately active 2017-07-10 16:32:31 -07:00
kmemcheck.c mm: Rename SLAB_DESTROY_BY_RCU to SLAB_TYPESAFE_BY_RCU 2017-04-18 11:42:36 -07:00
kmemleak-test.c mm: convert printk(KERN_<LEVEL> to pr_<level> 2016-03-17 15:09:34 -07:00
kmemleak.c mm: kmemleak: treat vm_struct as alternative reference to vmalloc'ed objects 2017-07-06 16:24:34 -07:00
ksm.c mm/ksm.c: constify attribute_group structures 2017-09-06 17:27:27 -07:00
list_lru.c mm/list_lru.c: fix list_lru_count_node() to be race free 2017-07-10 16:32:33 -07:00
maccess.c x86: remove more uaccess_32.h complexity 2016-05-22 17:21:27 -07:00
madvise.c mm, madvise: ensure poisoned pages are removed from per-cpu lists 2017-08-31 16:33:15 -07:00
Makefile percpu: expose statistics about percpu memory via debugfs 2017-06-20 15:31:38 -04:00
memblock.c mm/memblock.c: reversed logic in memblock_discard() 2017-08-25 16:12:46 -07:00
memcontrol.c memcg, THP, swap: make mem_cgroup_swapout() support THP 2017-09-06 17:27:28 -07:00
memory_hotplug.c mm, memory_hotplug: get rid of zonelists_mutex 2017-09-06 17:27:26 -07:00
memory-failure.c x86/mm, mm/hwpoison: Clear PRESENT bit for kernel 1:1 mappings of poison pages 2017-08-17 10:30:49 +02:00
memory.c mm, swap: VMA based swap readahead 2017-09-06 17:27:29 -07:00
mempolicy.c mm/mempolicy: fix use after free when calling get_mempolicy 2017-08-18 15:32:02 -07:00
mempool.c sched/wait: Rename wait_queue_t => wait_queue_entry_t 2017-06-20 12:18:27 +02:00
memtest.c
migrate.c Sanitize 'move_pages()' permission checks 2017-08-20 13:26:27 -07:00
mincore.c mm: remove shmem_mapping() shmem_zero_setup() duplicates 2017-02-24 17:46:56 -08:00
mlock.c mlock: fix mlock count can not decrease in race condition 2017-06-02 15:07:38 -07:00
mm_init.c mm: convert printk(KERN_<LEVEL> to pr_<level> 2016-03-17 15:09:34 -07:00
mmap.c userfaultfd: call userfaultfd_unmap_prep only if __split_vma succeeds 2017-09-06 17:27:29 -07:00
mmu_context.c sched/headers: Prepare to move the task_lock()/unlock() APIs to <linux/sched/task.h> 2017-03-02 08:42:38 +01:00
mmu_notifier.c mm/mmu_notifier: kill invalidate_page 2017-08-31 16:13:00 -07:00
mmzone.c mm/mmzone.c: swap likely to unlikely as code logic is different for next_zones_zonelist() 2017-02-22 16:41:29 -08:00
mprotect.c mm: migrate: prevent racy access to tlb_flush_pending 2017-08-10 15:54:07 -07:00
mremap.c mm/mremap: fail map duplication attempts for private mappings 2017-09-06 17:27:26 -07:00
msync.c
nobootmem.c mm: discard memblock data later 2017-08-18 15:32:01 -07:00
nommu.c mm: rename global_page_state to global_zone_page_state 2017-09-06 17:27:29 -07:00
oom_kill.c mm, oom: do not rely on TIF_MEMDIE for memory reserves access 2017-09-06 17:27:30 -07:00
page_alloc.c mm, oom: do not rely on TIF_MEMDIE for memory reserves access 2017-09-06 17:27:30 -07:00
page_counter.c
page_ext.c mm, page_ext: periodically reschedule during page_ext_init() 2017-09-06 17:27:26 -07:00
page_idle.c mm/page_idle.c: constify attribute_group structures 2017-09-06 17:27:27 -07:00
page_io.c mm: test code to write THP to swap device as a whole 2017-09-06 17:27:28 -07:00
page_isolation.c mm: unify new_node_page and alloc_migrate_target 2017-07-10 16:32:31 -07:00
page_owner.c mm, page_owner: don't grab zone->lock for init_pages_in_zone() 2017-09-06 17:27:26 -07:00
page_poison.c mm: enable page poisoning early at boot 2017-05-03 15:52:10 -07:00
page_vma_mapped.c mm/hugetlb: add size parameter to huge_pte_offset() 2017-07-06 16:24:34 -07:00
page-writeback.c mm: rename global_page_state to global_zone_page_state 2017-09-06 17:27:29 -07:00
pagewalk.c mm/hugetlb: add size parameter to huge_pte_offset() 2017-07-06 16:24:34 -07:00
percpu-internal.h percpu: fix early calls for spinlock in pcpu_stats 2017-06-21 13:53:52 -04:00
percpu-km.c percpu: fix static checker warnings in pcpu_destroy_chunk 2017-06-29 11:23:38 -04:00
percpu-stats.c percpu: expose statistics about percpu memory via debugfs 2017-06-20 15:31:38 -04:00
percpu-vm.c percpu: fix static checker warnings in pcpu_destroy_chunk 2017-06-29 11:23:38 -04:00
percpu.c percpu: resolve err may not be initialized in pcpu_alloc 2017-06-21 12:00:45 -04:00
pgtable-generic.c mm: convert generic code to 5-level paging 2017-03-09 11:48:47 -08:00
process_vm_access.c sched/headers: Prepare for new header dependencies before moving code to <linux/sched/mm.h> 2017-03-02 08:42:28 +01:00
quicklist.c fix Christoph's email addresses 2016-03-17 15:09:34 -07:00
readahead.c mm: don't cap request size based on read-ahead setting 2016-12-12 18:55:08 -08:00
rmap.c mm/rmap: update to new mmu_notifier semantic v2 2017-08-31 16:12:59 -07:00
rodata_test.c mm: remove rodata_test_data export, add pr_fmt 2017-05-03 15:52:09 -07:00
shmem.c mm, swap: VMA based swap readahead 2017-09-06 17:27:29 -07:00
slab_common.c mm: allow slab_nomerge to be set at build time 2017-07-06 16:24:31 -07:00
slab.c mm: memcontrol: account slab stats per lruvec 2017-07-06 16:24:35 -07:00
slab.h locking/lockdep: Rework FS_RECLAIM annotation 2017-08-10 12:29:03 +02:00
slob.c locking/lockdep: Rework FS_RECLAIM annotation 2017-08-10 12:29:03 +02:00
slub.c mm/slub.c: constify attribute_group structures 2017-09-06 17:27:27 -07:00
sparse-vmemmap.c mm, sparse, page_ext: drop ugly N_HIGH_MEMORY branches for allocations 2017-09-06 17:27:26 -07:00
sparse.c mm, sparse, page_ext: drop ugly N_HIGH_MEMORY branches for allocations 2017-09-06 17:27:26 -07:00
swap_cgroup.c mm, THP, swap: delay splitting THP during swap out 2017-07-06 16:24:31 -07:00
swap_slots.c mm/swap_slots.c: don't disable preemption while taking the per-CPU cache 2017-07-10 16:32:32 -07:00
swap_state.c mm, swap: add sysfs interface for VMA based swap readahead 2017-09-06 17:27:29 -07:00
swap.c mm: remove nr_pages argument from pagevec_lookup{,_range}() 2017-09-06 17:27:27 -07:00
swapfile.c mm, swap: don't use VMA based swap readahead if HDD is used as swap 2017-09-06 17:27:30 -07:00
truncate.c mm/truncate.c: fix THP handling in invalidate_mapping_pages() 2017-07-10 16:32:32 -07:00
usercopy.c mm/usercopy: Drop extra is_vmalloc_or_module() check 2017-04-05 12:30:18 -07:00
userfaultfd.c userfaultfd: shmem: wire up shmem_mfill_zeropage_pte 2017-09-06 17:27:28 -07:00
util.c mm: rename global_page_state to global_zone_page_state 2017-09-06 17:27:29 -07:00
vmacache.c sched/headers: Prepare to move 'init_task' and 'init_thread_union' from <linux/sched.h> to <linux/sched/task.h> 2017-03-02 08:42:38 +01:00
vmalloc.c mm/vmalloc.c: don't reinvent the wheel but use existing llist API 2017-09-06 17:27:29 -07:00
vmpressure.c mm, vmpressure: pass-through notification support 2017-07-10 16:32:31 -07:00
vmscan.c mm, THP, swap: add THP swapping out fallback counting 2017-09-06 17:27:28 -07:00
vmstat.c mm, swap: add swap readahead hit statistics 2017-09-06 17:27:29 -07:00
workingset.c mm: memcontrol: per-lruvec stats infrastructure 2017-07-06 16:24:35 -07:00
z3fold.c z3fold: use per-cpu unbuddied lists 2017-09-06 17:27:30 -07:00
zbud.c
zpool.c
zsmalloc.c zsmalloc: zs_page_migrate: skip unnecessary loops but not return -EBUSY if zspage is not inuse 2017-09-06 17:27:26 -07:00
zswap.c mm/zswap.c: delete an error message for a failed memory allocation in zswap_dstmem_prepare() 2017-07-06 16:24:35 -07:00