linux/mm
Shakeel Butt 90a6f2a8f4 memcg: use ratelimited stats flush in the reclaim
The Meta prod is seeing large amount of stalls in memcg stats flush from
the memcg reclaim code path.  At the moment, this specific callsite is
doing a synchronous memcg stats flush.  The rstat flush is an expensive
and time consuming operation, so concurrent relaimers will busywait on the
lock potentially for a long time.  Actually this issue is not unique to
Meta and has been observed by Cloudflare [1] as well.  For the Cloudflare
case, the stalls were due to contention between kswapd threads running on
their 8 numa node machines which does not make sense as rstat flush is
global and flush from one kswapd thread should be sufficient for all. 
Simply replace the synchronous flush with the ratelimited one.

One may raise a concern on potentially using 2 sec stale (at worst) stats
for heuristics like desirable inactive:active ratio and preferring
inactive file pages over anon pages but these specific heuristics do not
require very precise stats and also are ignored under severe memory
pressure.

More specifically for this code path, the stats are needed for two
specific heuristics:

1. Deactivate LRUs
2. Cache trim mode

The deactivate LRUs heuristic is to maintain a desirable inactive:active
ratio of the LRUs.  The specific stats needed are WORKINGSET_ACTIVATE* and
the hierarchical LRU size.  The WORKINGSET_ACTIVATE* is needed to check if
there is a refault since last snapshot and the LRU size are needed for the
desirable ratio between inactive and active LRUs.  See the table below on
how the desirable ratio is calculated.

/* total     target    max
 * memory    ratio     inactive
 * -------------------------------------
 *   10MB       1         5MB
 *  100MB       1        50MB
 *    1GB       3       250MB
 *   10GB      10       0.9GB
 *  100GB      31         3GB
 *    1TB     101        10GB
 *   10TB     320        32GB
 */

The desirable ratio only changes at the boundary of 1 GiB, 10 GiB, 100
GiB, 1 TiB and 10 TiB.  There is no need for the precise and accurate LRU
size information to calculate this ratio.  In addition, if deactivation is
skipped for some LRU, the kernel will force deactive on the severe memory
pressure situation.

For the cache trim mode, inactive file LRU size is read and the kernel
scales it down based on the reclaim iteration (file >> sc->priority) and
only checks if it is zero or not.  Again precise information is not
needed.

This patch has been running on Meta fleet for several months and we have
not observed any issues.  Please note that MGLRU is not impacted by this
issue at all as it avoids rstat flushing completely.

Link: https://lore.kernel.org/all/6ee2518b-81dd-4082-bdf5-322883895ffc@kernel.org [1]
Link: https://lkml.kernel.org/r/20240813215358.2259750-1-shakeel.butt@linux.dev
Signed-off-by: Shakeel Butt <shakeel.butt@linux.dev>
Cc: Jesper Dangaard Brouer <hawk@kernel.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Muchun Song <muchun.song@linux.dev>
Cc: Roman Gushchin <roman.gushchin@linux.dev>
Cc: Yosry Ahmed <yosryahmed@google.com>
Cc: Yu Zhao <yuzhao@google.com>
Cc: Nhat Pham <nphamcs@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-09-01 20:26:13 -07:00
..
damon mm/damon/lru_sort: adjust local variable to dynamic allocation 2024-09-01 20:25:45 -07:00
kasan kasan: fix bad call to unpoison_slab_object 2024-06-24 20:52:09 -07:00
kfence kfence: save freeing stack trace at calling time instead of freeing time 2024-09-01 20:26:12 -07:00
kmsan kmsan: do not pass NULL pointers as 0 2024-07-03 19:30:26 -07:00
backing-dev.c writeback: support retrieving per group debug writeback stats of bdi 2024-05-05 17:53:51 -07:00
balloon_compaction.c mm: remove MIGRATE_SYNC_NO_COPY mode 2024-07-03 19:30:00 -07:00
bootmem_info.c
cma_debug.c
cma_sysfs.c mm/cma: add sysfs file 'release_pages_success' 2024-02-22 10:24:57 -08:00
cma.c mm/cma: change the addition of totalcma_pages in the cma_init_reserved_mem 2024-09-01 20:25:56 -07:00
cma.h mm/cma: add sysfs file 'release_pages_success' 2024-02-22 10:24:57 -08:00
compaction.c sysctl: treewide: constify the ctl_table argument of proc_handlers 2024-07-24 20:59:29 +02:00
debug_page_alloc.c mm: page_alloc: consolidate free page accounting 2024-04-25 20:56:04 -07:00
debug_page_ref.c
debug_vm_pgtable.c mm/debug_vm_pgtable: drop RANDOM_ORVALUE trick 2024-06-15 10:43:08 -07:00
debug.c mm/debug: print only page mapcount (excluding folio entire mapcount) in __dump_folio() 2024-05-05 17:53:31 -07:00
dmapool_test.c mm/dmapool: add MODULE_DESCRIPTION() 2024-07-03 19:29:58 -07:00
dmapool.c mm/mempool/dmapool: remove CONFIG_DEBUG_SLAB ifdefs 2023-12-05 11:17:58 +01:00
early_ioremap.c
execmem.c mm/execmem, arch: convert remaining overrides of module_alloc to execmem 2024-05-14 00:31:43 -07:00
fadvise.c
fail_page_alloc.c mm, page_alloc: put should_fail_alloc_page() back behing CONFIG_FAIL_PAGE_ALLOC 2024-07-17 21:05:18 -07:00
failslab.c mm, slab: put should_failslab() back behind CONFIG_SHOULD_FAILSLAB 2024-07-17 21:05:18 -07:00
filemap.c filemap: add trace events for get_pages, map_pages, and fault 2024-09-01 20:26:10 -07:00
folio-compat.c mm: remove page_mapping() 2024-07-03 19:29:59 -07:00
gup_test.c
gup_test.h
gup.c mm: remove follow_page() 2024-09-01 20:26:01 -07:00
highmem.c mm/highmem: make nr_free_highpages() return "unsigned long" 2024-07-03 19:30:06 -07:00
hmm.c mm: provide mm_struct and address to huge_ptep_get() 2024-07-12 15:52:15 -07:00
huge_memory.c mm/mprotect: fix dax pud handlings 2024-09-01 20:26:10 -07:00
hugetlb_cgroup.c mm: memcg: don't call propagate_protected_usage() needlessly 2024-09-01 20:25:50 -07:00
hugetlb_vmemmap.c mm/hugetlb_vmemmap: don't synchronize_rcu() without HVO 2024-09-01 20:25:45 -07:00
hugetlb_vmemmap.h
hugetlb.c mm/hugetlb_vmemmap: batch HVO work when demoting 2024-09-01 20:26:10 -07:00
hwpoison-inject.c mm/hwpoison: add MODULE_DESCRIPTION() 2024-07-03 19:29:58 -07:00
init-mm.c mm: Deprecate pasid field 2023-12-12 10:11:32 +01:00
internal.h mm: add a helper to accept page 2024-09-01 20:26:07 -07:00
interval_tree.c
io-mapping.c
ioremap.c
Kconfig mm: turn USE_SPLIT_PTE_PTLOCKS / USE_SPLIT_PTE_PTLOCKS into Kconfig options 2024-09-01 20:25:51 -07:00
Kconfig.debug mm/slub: unify all sl[au]b parameters with "slab_$param" 2024-01-22 10:31:08 +01:00
khugepaged.c - 875fa64577da ("mm/hugetlb_vmemmap: fix race with speculative PFN 2024-07-21 17:15:46 -07:00
kmemleak.c kmemleak: enable tracking for percpu pointers 2024-09-01 20:25:49 -07:00
ksm.c mm/ksm: convert break_ksm() from walk_page_range_vma() to folio_walk 2024-09-01 20:26:02 -07:00
list_lru.c mm: list_lru: fix UAF for memory cgroup 2024-08-07 18:33:56 -07:00
maccess.c
madvise.c Random number generator updates for Linux 6.11-rc1. 2024-07-24 10:29:50 -07:00
Makefile mm: move internal core VMA manipulation functions to own file 2024-09-01 20:25:54 -07:00
mapping_dirty_helpers.c
memblock.c mm: rework accept memory helpers 2024-09-01 20:26:07 -07:00
memcontrol-v1.c memcg_write_event_control(): fix a user-triggerable oops 2024-08-12 21:58:44 -04:00
memcontrol-v1.h mm: memcg: gather memcg1-specific fields initialization in memcg1_memcg_init() 2024-07-04 18:05:56 -07:00
memcontrol.c memcg: replace memcg ID idr with xarray 2024-09-01 20:26:05 -07:00
memfd.c mm/gup: introduce memfd_pin_folios() for pinning memfd folios 2024-07-12 15:52:09 -07:00
memory_hotplug.c mm/memory_hotplug: get rid of __ref 2024-09-01 20:25:56 -07:00
memory-failure.c mm/memory-failure: use raw_spinlock_t in struct memory_failure_cpu 2024-08-15 22:16:14 -07:00
memory-tiers.c memory tiering: introduce folio_use_access_time() check 2024-09-01 20:25:47 -07:00
memory.c mm/migrate: move common code to numa_migrate_check (was numa_migrate_prep) 2024-09-01 20:26:06 -07:00
mempolicy.c mm: improve code consistency with zonelist_* helper functions 2024-09-01 20:25:55 -07:00
mempool.c mm: fix xyz_noprof functions calling profiled functions 2024-06-05 19:19:26 -07:00
memremap.c mm: convert put_devmap_managed_page_refs() to put_devmap_managed_folio_refs() 2024-05-05 17:53:49 -07:00
memtest.c memtest: use {READ,WRITE}_ONCE in memory scanning 2024-03-13 12:12:21 -07:00
migrate_device.c mm: extend rmap flags arguments for folio_add_new_anon_rmap 2024-07-03 19:30:18 -07:00
migrate.c fs: remove calls to set and clear the folio error flag 2024-09-01 20:26:04 -07:00
mincore.c mm: provide mm_struct and address to huge_ptep_get() 2024-07-12 15:52:15 -07:00
mlock.c Random number generator updates for Linux 6.11-rc1. 2024-07-24 10:29:50 -07:00
mm_init.c mm: rework accept memory helpers 2024-09-01 20:26:07 -07:00
mm_slot.h
mmap_lock.c mm: mmap_lock: replace get_memcg_path_buf() with on-stack buffer 2024-07-03 19:30:26 -07:00
mmap.c mm: remove legacy install_special_mapping() code 2024-09-01 20:26:13 -07:00
mmu_gather.c mm/mmu_gather: improve cond_resched() handling with large folios and expensive page freeing 2024-02-22 15:27:17 -08:00
mmu_notifier.c mm: move internal core VMA manipulation functions to own file 2024-09-01 20:25:54 -07:00
mmzone.c mm: improve code consistency with zonelist_* helper functions 2024-09-01 20:25:55 -07:00
mprotect.c mm/mprotect: fix dax pud handlings 2024-09-01 20:26:10 -07:00
mremap.c mm: remove page_mkclean() 2024-07-03 19:30:17 -07:00
mseal.c mseal: fix is_madv_discard() 2024-08-15 22:16:13 -07:00
msync.c
nommu.c mm: remove follow_page() 2024-09-01 20:26:01 -07:00
oom_kill.c memory: remove the now superfluous sentinel element from ctl_table array 2024-04-25 20:56:32 -07:00
page_alloc.c mm: accept to promo watermark 2024-09-01 20:26:07 -07:00
page_counter.c mm, memcg: cg2 memory{.swap,}.peak write handlers 2024-09-01 20:25:53 -07:00
page_ext.c mm: don't account memmap per-node 2024-08-15 22:16:14 -07:00
page_idle.c
page_io.c fs: remove calls to set and clear the folio error flag 2024-09-01 20:26:04 -07:00
page_isolation.c mm: page_isolation: handle unaccepted memory isolation 2024-09-01 20:26:07 -07:00
page_owner.c mm/page-owner: use gfp_nested_mask() instead of open coded masking 2024-05-19 14:40:44 -07:00
page_poison.c mm/page_poison: replace kmap_atomic() with kmap_local_page() 2023-12-10 16:51:50 -08:00
page_reporting.c mm, treewide: rename MAX_ORDER to MAX_PAGE_ORDER 2024-01-08 15:27:15 -08:00
page_reporting.h
page_table_check.c mm/page_table_check: fix crash on ZONE_DEVICE 2024-06-15 10:43:04 -07:00
page_vma_mapped.c mm: make page_mapped_in_vma conditional on CONFIG_MEMORY_FAILURE 2024-05-05 17:53:45 -07:00
page-writeback.c sysctl: treewide: constify the ctl_table argument of proc_handlers 2024-07-24 20:59:29 +02:00
pagewalk.c mm/pagewalk: introduce folio_walk_start() + folio_walk_end() 2024-09-01 20:25:59 -07:00
percpu-internal.h mm: remove CONFIG_MEMCG_KMEM 2024-07-10 12:14:54 -07:00
percpu-km.c
percpu-stats.c
percpu-vm.c percpu: clean up all mappings when pcpu_map_pages() fails 2024-04-25 20:55:49 -07:00
percpu.c percpu: remove pcpu_alloc_size() 2024-09-01 20:26:04 -07:00
pgalloc-track.h
pgtable-generic.c mm: fix race between __split_huge_pmd_locked() and GUP-fast 2024-05-07 10:37:00 -07:00
process_vm_access.c mm: fix process_vm_rw page counts 2023-12-10 16:51:39 -08:00
ptdump.c mm: ptdump: add check_wx_pages debugfs attribute 2024-02-22 10:24:47 -08:00
readahead.c Merge branch 'mm-hotfixes-stable' into mm-stable to pick up "mm: fix 2024-07-06 11:44:41 -07:00
rmap.c mm/rmap: minimize folio->_nr_pages_mapped updates when batching PTE (un)mapping 2024-09-01 20:26:04 -07:00
rodata_test.c
secretmem.c
shmem_quota.c shmem_quota: build the object file conditionally to the config option 2024-09-01 20:25:45 -07:00
shmem.c mm: shmem: move shmem_huge_global_enabled() into shmem_allowable_huge_orders() 2024-09-01 20:25:44 -07:00
show_mem.c lib: add memory allocations report in show_mem() 2024-04-25 20:55:57 -07:00
shrinker_debug.c
shrinker.c mm: shrinker: use kvzalloc_node() from expand_one_shrinker_info() 2024-01-05 09:58:32 -08:00
shuffle.c
shuffle.h mm, treewide: rename MAX_ORDER to MAX_PAGE_ORDER 2024-01-08 15:27:15 -08:00
slab_common.c - 875fa64577da ("mm/hugetlb_vmemmap: fix race with speculative PFN 2024-07-21 17:15:46 -07:00
slab.h - 875fa64577da ("mm/hugetlb_vmemmap: fix race with speculative PFN 2024-07-21 17:15:46 -07:00
slub.c mm, slub: do not call do_slab_free for kfence object 2024-07-30 11:50:00 +02:00
sparse-vmemmap.c mm: don't account memmap per-node 2024-08-15 22:16:14 -07:00
sparse.c mm: don't account memmap per-node 2024-08-15 22:16:14 -07:00
swap_cgroup.c
swap_slots.c mm: swap: update get_swap_pages() to take folio order 2024-04-25 20:56:37 -07:00
swap_state.c mm: return the folio from swapin_readahead 2024-09-01 20:26:05 -07:00
swap.c mm/swap: take folio refcount after testing the LRU flag 2024-09-01 20:26:10 -07:00
swap.h mm: return the folio from swapin_readahead 2024-09-01 20:26:05 -07:00
swapfile.c mm: return the folio from swapin_readahead 2024-09-01 20:26:05 -07:00
truncate.c mm: Fix missing folio invalidation calls during truncation 2024-08-24 16:09:16 +02:00
usercopy.c
userfaultfd.c userfaultfd: move core VMA manipulation logic to mm/userfaultfd.c 2024-09-01 20:25:53 -07:00
util.c mm: only enforce minimum stack gap size if it's sensible 2024-09-01 20:26:02 -07:00
vma_internal.h mm: remove duplicated include in vma_internal.h 2024-09-01 20:26:02 -07:00
vma.c mm: remove arch_unmap() 2024-09-01 20:26:13 -07:00
vma.h mm: move internal core VMA manipulation functions to own file 2024-09-01 20:25:54 -07:00
vmalloc.c mm: vmalloc: add optimization hint on page existence check 2024-09-01 20:26:08 -07:00
vmpressure.c
vmscan.c memcg: use ratelimited stats flush in the reclaim 2024-09-01 20:26:13 -07:00
vmstat.c mm: print the promo watermark in zoneinfo 2024-09-01 20:25:59 -07:00
workingset.c cachestat: do not flush stats in recency check 2024-07-03 22:40:37 -07:00
z3fold.c mm/z3fold: add __percpu annotation to *unbuddied pointer in struct z3fold_pool 2024-09-01 20:25:56 -07:00
zbud.c mm: zpool: return pool size in pages 2024-04-25 20:55:48 -07:00
zpool.c mm: zpool: return pool size in pages 2024-04-25 20:55:48 -07:00
zsmalloc.c minmax: make generic MIN() and MAX() macros available everywhere 2024-07-28 15:49:18 -07:00
zswap.c zswap: implement a second chance algorithm for dynamic zswap shrinker 2024-09-01 20:26:02 -07:00