linux/mm
Tejun Heo b23afb93d3 memcg: punt high overage reclaim to return-to-userland path
Currently, try_charge() tries to reclaim memory synchronously when the
high limit is breached; however, if the allocation doesn't have
__GFP_WAIT, synchronous reclaim is skipped.  If a process performs only
speculative allocations, it can blow way past the high limit.  This is
actually easily reproducible by simply doing "find /".  slab/slub
allocator tries speculative allocations first, so as long as there's
memory which can be consumed without blocking, it can keep allocating
memory regardless of the high limit.

This patch makes try_charge() always punt the over-high reclaim to the
return-to-userland path.  If try_charge() detects that high limit is
breached, it adds the overage to current->memcg_nr_pages_over_high and
schedules execution of mem_cgroup_handle_over_high() which performs
synchronous reclaim from the return-to-userland path.

As long as kernel doesn't have a run-away allocation spree, this should
provide enough protection while making kmemcg behave more consistently.
It also has the following benefits.

- All over-high reclaims can use GFP_KERNEL regardless of the specific
  gfp mask in use, e.g. GFP_NOFS, when the limit was breached.

- It copes with prio inversion.  Previously, a low-prio task with
  small memory.high might perform over-high reclaim with a bunch of
  locks held.  If a higher prio task needed any of these locks, it
  would have to wait until the low prio task finished reclaim and
  released the locks.  By handing over-high reclaim to the task exit
  path this issue can be avoided.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Michal Hocko <mhocko@kernel.org>
Reviewed-by: Vladimir Davydov <vdavydov@parallels.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-11-05 19:34:48 -08:00
..
kasan kasan: fix last shadow judgement in memory_is_poisoned_16() 2015-09-17 21:16:07 -07:00
backing-dev.c writeback: remove broken rbtree_postorder_for_each_entry_safe() usage in cgwb_bdi_destroy() 2015-10-21 08:17:29 -06:00
balloon_compaction.c mm/balloon_compaction: fix deflation when compaction is disabled 2014-10-29 16:33:15 -07:00
bootmem.c bootmem: avoid freeing to bootmem after bootmem is done 2015-09-08 15:35:28 -07:00
cleancache.c cleancache: remove limit on the number of cleancache enabled filesystems 2015-04-14 16:49:03 -07:00
cma_debug.c mm/cma_debug: correct size input to bitmap function 2015-07-17 16:39:54 -07:00
cma.c mm: cma: fix incorrect type conversion for size during dma allocation 2015-10-23 17:55:10 +09:00
cma.h mm: cma: mark cma_bitmap_maxno() inline in header 2015-08-14 15:56:32 -07:00
compaction.c mm/compaction: correct to flush migrated pages if pageblock skip happens 2015-09-08 15:35:28 -07:00
debug-pagealloc.c mm/debug-pagealloc: make debug-pagealloc boottime configurable 2014-12-13 12:42:48 -08:00
debug.c mm: introduce idle page tracking 2015-09-10 13:29:01 -07:00
dmapool.c dmapool: fix overflow condition in pool_find_page() 2015-10-01 21:42:35 -04:00
early_ioremap.c mm/early_ioremap: add explicit #include of asm/early_ioremap.h 2015-09-11 15:21:34 -07:00
fadvise.c writeback: implement and use inode_congested() 2015-06-02 08:33:35 -06:00
failslab.c debugfs: Pass bool pointer to debugfs_create_bool() 2015-10-04 11:36:07 +01:00
filemap.c mm: make sendfile(2) killable 2015-10-23 17:55:10 +09:00
frame_vector.c [media] mm: Provide new get_vaddr_frames() helper 2015-08-16 13:02:47 -03:00
frontswap.c frontswap: allow multiple backends 2015-06-24 17:49:45 -07:00
gup.c mm: make GUP handle pfn mapping unless FOLL_GET is requested 2015-09-04 16:54:41 -07:00
highmem.c mm/highmem: make kmap cache coloring aware 2014-08-06 18:01:22 -07:00
huge_memory.c - Support for new MM features in ARCv2 cores (THP, PAE40) 2015-11-03 13:21:09 -08:00
hugetlb_cgroup.c mm: page_counter: pull "-1" handling out of page_counter_memparse() 2015-02-11 17:06:02 -08:00
hugetlb.c mm: hugetlbfs: skip shared VMAs when unmapping private pages to satisfy a fault 2015-10-01 21:42:35 -04:00
hwpoison-inject.c hwpoison: use page_cgroup_ino for filtering by memcg 2015-09-10 13:29:01 -07:00
init-mm.c
internal.h mm/compaction: correct to flush migrated pages if pageblock skip happens 2015-09-08 15:35:28 -07:00
interval_tree.c mm: replace vma->sharead.linear with vma->shared 2015-02-10 14:30:31 -08:00
Kconfig media updates for v4.3-rc1 2015-09-11 16:42:39 -07:00
Kconfig.debug mm/debug_pagealloc: remove obsolete Kconfig options 2015-01-08 15:10:52 -08:00
kmemcheck.c mm/slab_common: move kmem_cache definition to internal header 2014-10-09 22:25:50 -04:00
kmemleak-test.c mm/kmemleak-test.c: use pr_fmt for logging 2014-06-06 16:08:18 -07:00
kmemleak.c mm/kmemleak.c: remove unneeded initialization of object to NULL 2015-11-05 19:34:48 -08:00
ksm.c mm: remove rest of ACCESS_ONCE() usages 2015-04-15 16:35:18 -07:00
list_lru.c list_lru: don't call list_lru_from_kmem if the list_head is empty 2015-09-08 15:35:28 -07:00
maccess.c uaccess: reimplement probe_kernel_address() using probe_kernel_read() 2015-11-05 19:34:48 -08:00
madvise.c mm: madvise allow remove operation for hugetlbfs 2015-09-08 15:35:28 -07:00
Makefile media updates for v4.3-rc1 2015-09-11 16:42:39 -07:00
memblock.c mm/memblock.c: fix comment in __next_mem_range() 2015-09-08 15:35:28 -07:00
memcontrol.c memcg: punt high overage reclaim to return-to-userland path 2015-11-05 19:34:48 -08:00
memory_hotplug.c mm: memory hotplug with an existing resource 2015-10-23 14:19:58 +01:00
memory-failure.c hwpoison: use page_cgroup_ino for filtering by memcg 2015-09-10 13:29:01 -07:00
memory.c mm, dax: fix DAX deadlocks 2015-10-16 11:42:28 -07:00
mempolicy.c mm: rename alloc_pages_exact_node() to __alloc_pages_node() 2015-09-08 15:35:28 -07:00
mempool.c mm/mempool: allow NULL `pool' pointer in mempool_destroy() 2015-09-08 15:35:28 -07:00
memtest.c memtest: remove unused header files 2015-09-08 15:35:28 -07:00
migrate.c memcg: fix dirty page migration 2015-10-01 21:42:35 -04:00
mincore.c mincore: apply page table walker on do_mincore() 2015-02-11 17:06:06 -08:00
mlock.c mm/mlock.c: reorganize mlockall() return values and remove goto-out label 2015-11-05 19:34:48 -08:00
mm_init.c mm: meminit: remove mminit_verify_page_links 2015-06-30 19:44:56 -07:00
mmap.c mm/mmap.c: remove useless statement "vma = NULL" in find_vma() 2015-11-05 19:34:48 -08:00
mmu_context.c
mmu_notifier.c mmu-notifier: add clear_young callback 2015-09-10 13:29:01 -07:00
mmzone.c mm: microoptimize zonelist operations 2015-02-11 17:06:02 -08:00
mprotect.c userfaultfd: teach vma_merge to merge across vma->vm_userfaultfd_ctx 2015-09-04 16:54:41 -07:00
mremap.c mremap: simplify the "overlap" check in mremap_to() 2015-09-04 16:54:41 -07:00
msync.c mm: remove rest usage of VM_NONLINEAR and pte_file() 2015-02-10 14:30:31 -08:00
nobootmem.c mm: page_alloc: pass PFN to __free_pages_bootmem 2015-06-30 19:44:55 -07:00
nommu.c mm, mpx: add "vm_flags_t vm_flags" arg to do_mmap_pgoff() 2015-09-10 13:29:01 -07:00
oom_kill.c mm, oom: remove unnecessary variable 2015-09-08 15:35:28 -07:00
page_alloc.c debugfs: Pass bool pointer to debugfs_create_bool() 2015-10-04 11:36:07 +01:00
page_counter.c mm: page_counter: pull "-1" handling out of page_counter_memparse() 2015-02-11 17:06:02 -08:00
page_ext.c mm: introduce idle page tracking 2015-09-10 13:29:01 -07:00
page_idle.c mm: introduce idle page tracking 2015-09-10 13:29:01 -07:00
page_io.c fs: use helper bio_add_page() instead of open coding on bi_io_vec 2015-08-13 12:32:00 -06:00
page_isolation.c mm, page_isolation: make set/unset_migratetype_isolate() file-local 2015-09-08 15:35:28 -07:00
page_owner.c mm/page_owner: set correct gfp_mask on page_owner 2015-07-17 16:39:54 -07:00
page-writeback.c writeback: fix incorrect calculation of available memory for memcg domains 2015-10-12 10:31:13 -06:00
pagewalk.c mm/pagewalk.c: prevent positive return value of walk_page_test() from being passed to callers 2015-03-25 16:20:30 -07:00
percpu-km.c percpu: implmeent pcpu_nr_empty_pop_pages and chunk->nr_populated 2014-09-02 14:46:05 -04:00
percpu-vm.c percpu: move region iterations out of pcpu_[de]populate_chunk() 2014-09-02 14:46:02 -04:00
percpu.c percpu: clean up of schunk->map[] assignment in pcpu_setup_first_chunk 2015-07-21 11:31:00 -04:00
pgtable-generic.c mm,thp: introduce flush_pmd_tlb_range 2015-10-17 17:48:20 +05:30
process_vm_access.c process_vm_access: switch to {compat_,}import_iovec() 2015-04-11 22:27:12 -04:00
quicklist.c
readahead.c mm, fs: obey gfp_mapping for add_to_page_cache() 2015-10-16 11:42:28 -07:00
rmap.c mm: introduce idle page tracking 2015-09-10 13:29:01 -07:00
shmem.c shmem: recalculate file inode when fstat 2015-09-08 15:35:28 -07:00
slab_common.c mm/slab_common.c: initialize kmem_cache pointer to NULL 2015-11-05 19:34:48 -08:00
slab.c mm: slab: only move management objects off-slab for sizes larger than KMALLOC_MIN_SIZE 2015-11-05 19:34:48 -08:00
slab.h mm/slab_common.c: clear pointers to per memcg caches on destroy 2015-11-05 19:34:48 -08:00
slob.c mm: rename alloc_pages_exact_node() to __alloc_pages_node() 2015-09-08 15:35:28 -07:00
slub.c mm/slub: calculate start order with reserved in consideration 2015-11-05 19:34:48 -08:00
sparse-vmemmap.c
sparse.c
swap_cgroup.c mm: page_cgroup: rename file to mm/swap_cgroup.c 2014-12-10 17:41:09 -08:00
swap_state.c mm: swap: zswap: maybe_preload & refactoring 2015-09-08 15:35:28 -07:00
swap.c mm: introduce idle page tracking 2015-09-10 13:29:01 -07:00
swapfile.c mm: /proc/pid/smaps:: show proportional swap share of the mapping 2015-09-08 15:35:28 -07:00
truncate.c memcg: add per cgroup dirty page accounting 2015-06-02 08:33:33 -06:00
userfaultfd.c userfaultfd: avoid mmap_sem read recursion in mcopy_atomic 2015-09-04 16:54:41 -07:00
util.c mm: uninline and cleanup page-mapping related helpers 2015-04-15 16:35:19 -07:00
vmacache.c mm,vmacache: count number of system-wide flushes 2014-12-13 12:42:48 -08:00
vmalloc.c mm: get rid of 'vmalloc_info' from /proc/meminfo 2015-11-01 17:09:15 -08:00
vmpressure.c mm/vmpressure.c: fix race in vmpressure_work_fn() 2014-12-02 17:32:07 -08:00
vmscan.c vmscan: fix sane_reclaim helper for legacy memcg 2015-09-22 15:09:53 -07:00
vmstat.c vmstat: explicitly schedule per-cpu work on the CPU we need it to run on 2015-10-15 13:01:50 -07:00
workingset.c list_lru: add helpers to isolate items 2015-02-12 18:54:10 -08:00
zbud.c mm: zbud: constify the zbud_ops 2015-09-08 15:35:28 -07:00
zpool.c zpool: add zpool_has_pool() 2015-09-10 13:29:01 -07:00
zsmalloc.c mm: zpool: constify the zpool_ops 2015-09-08 15:35:28 -07:00
zswap.c zswap: change zpool/compressor at runtime 2015-09-10 13:29:01 -07:00