linux

mirror of https://github.com/torvalds/linux.git synced 2024-11-25 21:51:40 +00:00

History

Shakeel Butt 90a6f2a8f4 memcg: use ratelimited stats flush in the reclaim The Meta prod is seeing large amount of stalls in memcg stats flush from the memcg reclaim code path. At the moment, this specific callsite is doing a synchronous memcg stats flush. The rstat flush is an expensive and time consuming operation, so concurrent relaimers will busywait on the lock potentially for a long time. Actually this issue is not unique to Meta and has been observed by Cloudflare [1] as well. For the Cloudflare case, the stalls were due to contention between kswapd threads running on their 8 numa node machines which does not make sense as rstat flush is global and flush from one kswapd thread should be sufficient for all. Simply replace the synchronous flush with the ratelimited one. One may raise a concern on potentially using 2 sec stale (at worst) stats for heuristics like desirable inactive:active ratio and preferring inactive file pages over anon pages but these specific heuristics do not require very precise stats and also are ignored under severe memory pressure. More specifically for this code path, the stats are needed for two specific heuristics: 1. Deactivate LRUs 2. Cache trim mode The deactivate LRUs heuristic is to maintain a desirable inactive:active ratio of the LRUs. The specific stats needed are WORKINGSET_ACTIVATE* and the hierarchical LRU size. The WORKINGSET_ACTIVATE* is needed to check if there is a refault since last snapshot and the LRU size are needed for the desirable ratio between inactive and active LRUs. See the table below on how the desirable ratio is calculated. /* total target max * memory ratio inactive * ------------------------------------- * 10MB 1 5MB * 100MB 1 50MB * 1GB 3 250MB * 10GB 10 0.9GB * 100GB 31 3GB * 1TB 101 10GB * 10TB 320 32GB */ The desirable ratio only changes at the boundary of 1 GiB, 10 GiB, 100 GiB, 1 TiB and 10 TiB. There is no need for the precise and accurate LRU size information to calculate this ratio. In addition, if deactivation is skipped for some LRU, the kernel will force deactive on the severe memory pressure situation. For the cache trim mode, inactive file LRU size is read and the kernel scales it down based on the reclaim iteration (file >> sc->priority) and only checks if it is zero or not. Again precise information is not needed. This patch has been running on Meta fleet for several months and we have not observed any issues. Please note that MGLRU is not impacted by this issue at all as it avoids rstat flushing completely. Link: https://lore.kernel.org/all/6ee2518b-81dd-4082-bdf5-322883895ffc@kernel.org [1] Link: https://lkml.kernel.org/r/20240813215358.2259750-1-shakeel.butt@linux.dev Signed-off-by: Shakeel Butt <shakeel.butt@linux.dev> Cc: Jesper Dangaard Brouer <hawk@kernel.org> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Yosry Ahmed <yosryahmed@google.com> Cc: Yu Zhao <yuzhao@google.com> Cc: Nhat Pham <nphamcs@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>		2024-09-01 20:26:13 -07:00
..
damon	mm/damon/lru_sort: adjust local variable to dynamic allocation	2024-09-01 20:25:45 -07:00
kasan	kasan: fix bad call to unpoison_slab_object	2024-06-24 20:52:09 -07:00
kfence	kfence: save freeing stack trace at calling time instead of freeing time	2024-09-01 20:26:12 -07:00
kmsan	kmsan: do not pass NULL pointers as 0	2024-07-03 19:30:26 -07:00
backing-dev.c	writeback: support retrieving per group debug writeback stats of bdi	2024-05-05 17:53:51 -07:00
balloon_compaction.c	mm: remove MIGRATE_SYNC_NO_COPY mode	2024-07-03 19:30:00 -07:00
bootmem_info.c
cma_debug.c
cma_sysfs.c	mm/cma: add sysfs file 'release_pages_success'	2024-02-22 10:24:57 -08:00
cma.c	mm/cma: change the addition of totalcma_pages in the cma_init_reserved_mem	2024-09-01 20:25:56 -07:00
cma.h	mm/cma: add sysfs file 'release_pages_success'	2024-02-22 10:24:57 -08:00
compaction.c	sysctl: treewide: constify the ctl_table argument of proc_handlers	2024-07-24 20:59:29 +02:00
debug_page_alloc.c	mm: page_alloc: consolidate free page accounting	2024-04-25 20:56:04 -07:00
debug_page_ref.c
debug_vm_pgtable.c	mm/debug_vm_pgtable: drop RANDOM_ORVALUE trick	2024-06-15 10:43:08 -07:00
debug.c	mm/debug: print only page mapcount (excluding folio entire mapcount) in __dump_folio()	2024-05-05 17:53:31 -07:00
dmapool_test.c	mm/dmapool: add MODULE_DESCRIPTION()	2024-07-03 19:29:58 -07:00
dmapool.c	mm/mempool/dmapool: remove CONFIG_DEBUG_SLAB ifdefs	2023-12-05 11:17:58 +01:00
early_ioremap.c
execmem.c	mm/execmem, arch: convert remaining overrides of module_alloc to execmem	2024-05-14 00:31:43 -07:00
fadvise.c
fail_page_alloc.c	mm, page_alloc: put should_fail_alloc_page() back behing CONFIG_FAIL_PAGE_ALLOC	2024-07-17 21:05:18 -07:00
failslab.c	mm, slab: put should_failslab() back behind CONFIG_SHOULD_FAILSLAB	2024-07-17 21:05:18 -07:00
filemap.c	filemap: add trace events for get_pages, map_pages, and fault	2024-09-01 20:26:10 -07:00
folio-compat.c	mm: remove page_mapping()	2024-07-03 19:29:59 -07:00
gup_test.c
gup_test.h
gup.c	mm: remove follow_page()	2024-09-01 20:26:01 -07:00
highmem.c	mm/highmem: make nr_free_highpages() return "unsigned long"	2024-07-03 19:30:06 -07:00
hmm.c	mm: provide mm_struct and address to huge_ptep_get()	2024-07-12 15:52:15 -07:00
huge_memory.c	mm/mprotect: fix dax pud handlings	2024-09-01 20:26:10 -07:00
hugetlb_cgroup.c	mm: memcg: don't call propagate_protected_usage() needlessly	2024-09-01 20:25:50 -07:00
hugetlb_vmemmap.c	mm/hugetlb_vmemmap: don't synchronize_rcu() without HVO	2024-09-01 20:25:45 -07:00
hugetlb_vmemmap.h
hugetlb.c	mm/hugetlb_vmemmap: batch HVO work when demoting	2024-09-01 20:26:10 -07:00
hwpoison-inject.c	mm/hwpoison: add MODULE_DESCRIPTION()	2024-07-03 19:29:58 -07:00
init-mm.c	mm: Deprecate pasid field	2023-12-12 10:11:32 +01:00
internal.h	mm: add a helper to accept page	2024-09-01 20:26:07 -07:00
interval_tree.c
io-mapping.c
ioremap.c
Kconfig	mm: turn USE_SPLIT_PTE_PTLOCKS / USE_SPLIT_PTE_PTLOCKS into Kconfig options	2024-09-01 20:25:51 -07:00
Kconfig.debug	mm/slub: unify all sl[au]b parameters with "slab_$param"	2024-01-22 10:31:08 +01:00
khugepaged.c	- 875fa64577da ("mm/hugetlb_vmemmap: fix race with speculative PFN	2024-07-21 17:15:46 -07:00
kmemleak.c	kmemleak: enable tracking for percpu pointers	2024-09-01 20:25:49 -07:00
ksm.c	mm/ksm: convert break_ksm() from walk_page_range_vma() to folio_walk	2024-09-01 20:26:02 -07:00
list_lru.c	mm: list_lru: fix UAF for memory cgroup	2024-08-07 18:33:56 -07:00
maccess.c
madvise.c	Random number generator updates for Linux 6.11-rc1.	2024-07-24 10:29:50 -07:00
Makefile	mm: move internal core VMA manipulation functions to own file	2024-09-01 20:25:54 -07:00
mapping_dirty_helpers.c
memblock.c	mm: rework accept memory helpers	2024-09-01 20:26:07 -07:00
memcontrol-v1.c	memcg_write_event_control(): fix a user-triggerable oops	2024-08-12 21:58:44 -04:00
memcontrol-v1.h	mm: memcg: gather memcg1-specific fields initialization in memcg1_memcg_init()	2024-07-04 18:05:56 -07:00
memcontrol.c	memcg: replace memcg ID idr with xarray	2024-09-01 20:26:05 -07:00
memfd.c	mm/gup: introduce memfd_pin_folios() for pinning memfd folios	2024-07-12 15:52:09 -07:00
memory_hotplug.c	mm/memory_hotplug: get rid of __ref	2024-09-01 20:25:56 -07:00
memory-failure.c	mm/memory-failure: use raw_spinlock_t in struct memory_failure_cpu	2024-08-15 22:16:14 -07:00
memory-tiers.c	memory tiering: introduce folio_use_access_time() check	2024-09-01 20:25:47 -07:00
memory.c	mm/migrate: move common code to numa_migrate_check (was numa_migrate_prep)	2024-09-01 20:26:06 -07:00
mempolicy.c	mm: improve code consistency with zonelist_* helper functions	2024-09-01 20:25:55 -07:00
mempool.c	mm: fix xyz_noprof functions calling profiled functions	2024-06-05 19:19:26 -07:00
memremap.c	mm: convert put_devmap_managed_page_refs() to put_devmap_managed_folio_refs()	2024-05-05 17:53:49 -07:00
memtest.c	memtest: use {READ,WRITE}_ONCE in memory scanning	2024-03-13 12:12:21 -07:00
migrate_device.c	mm: extend rmap flags arguments for folio_add_new_anon_rmap	2024-07-03 19:30:18 -07:00
migrate.c	fs: remove calls to set and clear the folio error flag	2024-09-01 20:26:04 -07:00
mincore.c	mm: provide mm_struct and address to huge_ptep_get()	2024-07-12 15:52:15 -07:00
mlock.c	Random number generator updates for Linux 6.11-rc1.	2024-07-24 10:29:50 -07:00
mm_init.c	mm: rework accept memory helpers	2024-09-01 20:26:07 -07:00
mm_slot.h
mmap_lock.c	mm: mmap_lock: replace get_memcg_path_buf() with on-stack buffer	2024-07-03 19:30:26 -07:00
mmap.c	mm: remove legacy install_special_mapping() code	2024-09-01 20:26:13 -07:00
mmu_gather.c	mm/mmu_gather: improve cond_resched() handling with large folios and expensive page freeing	2024-02-22 15:27:17 -08:00
mmu_notifier.c	mm: move internal core VMA manipulation functions to own file	2024-09-01 20:25:54 -07:00
mmzone.c	mm: improve code consistency with zonelist_* helper functions	2024-09-01 20:25:55 -07:00
mprotect.c	mm/mprotect: fix dax pud handlings	2024-09-01 20:26:10 -07:00
mremap.c	mm: remove page_mkclean()	2024-07-03 19:30:17 -07:00
mseal.c	mseal: fix is_madv_discard()	2024-08-15 22:16:13 -07:00
msync.c
nommu.c	mm: remove follow_page()	2024-09-01 20:26:01 -07:00
oom_kill.c	memory: remove the now superfluous sentinel element from ctl_table array	2024-04-25 20:56:32 -07:00
page_alloc.c	mm: accept to promo watermark	2024-09-01 20:26:07 -07:00
page_counter.c	mm, memcg: cg2 memory{.swap,}.peak write handlers	2024-09-01 20:25:53 -07:00
page_ext.c	mm: don't account memmap per-node	2024-08-15 22:16:14 -07:00
page_idle.c
page_io.c	fs: remove calls to set and clear the folio error flag	2024-09-01 20:26:04 -07:00
page_isolation.c	mm: page_isolation: handle unaccepted memory isolation	2024-09-01 20:26:07 -07:00
page_owner.c	mm/page-owner: use gfp_nested_mask() instead of open coded masking	2024-05-19 14:40:44 -07:00
page_poison.c	mm/page_poison: replace kmap_atomic() with kmap_local_page()	2023-12-10 16:51:50 -08:00
page_reporting.c	mm, treewide: rename MAX_ORDER to MAX_PAGE_ORDER	2024-01-08 15:27:15 -08:00
page_reporting.h
page_table_check.c	mm/page_table_check: fix crash on ZONE_DEVICE	2024-06-15 10:43:04 -07:00
page_vma_mapped.c	mm: make page_mapped_in_vma conditional on CONFIG_MEMORY_FAILURE	2024-05-05 17:53:45 -07:00
page-writeback.c	sysctl: treewide: constify the ctl_table argument of proc_handlers	2024-07-24 20:59:29 +02:00
pagewalk.c	mm/pagewalk: introduce folio_walk_start() + folio_walk_end()	2024-09-01 20:25:59 -07:00
percpu-internal.h	mm: remove CONFIG_MEMCG_KMEM	2024-07-10 12:14:54 -07:00
percpu-km.c
percpu-stats.c
percpu-vm.c	percpu: clean up all mappings when pcpu_map_pages() fails	2024-04-25 20:55:49 -07:00
percpu.c	percpu: remove pcpu_alloc_size()	2024-09-01 20:26:04 -07:00
pgalloc-track.h
pgtable-generic.c	mm: fix race between __split_huge_pmd_locked() and GUP-fast	2024-05-07 10:37:00 -07:00
process_vm_access.c	mm: fix process_vm_rw page counts	2023-12-10 16:51:39 -08:00
ptdump.c	mm: ptdump: add check_wx_pages debugfs attribute	2024-02-22 10:24:47 -08:00
readahead.c	Merge branch 'mm-hotfixes-stable' into mm-stable to pick up "mm: fix	2024-07-06 11:44:41 -07:00
rmap.c	mm/rmap: minimize folio->_nr_pages_mapped updates when batching PTE (un)mapping	2024-09-01 20:26:04 -07:00
rodata_test.c
secretmem.c
shmem_quota.c	shmem_quota: build the object file conditionally to the config option	2024-09-01 20:25:45 -07:00
shmem.c	mm: shmem: move shmem_huge_global_enabled() into shmem_allowable_huge_orders()	2024-09-01 20:25:44 -07:00
show_mem.c	lib: add memory allocations report in show_mem()	2024-04-25 20:55:57 -07:00
shrinker_debug.c
shrinker.c	mm: shrinker: use kvzalloc_node() from expand_one_shrinker_info()	2024-01-05 09:58:32 -08:00
shuffle.c
shuffle.h	mm, treewide: rename MAX_ORDER to MAX_PAGE_ORDER	2024-01-08 15:27:15 -08:00
slab_common.c	- 875fa64577da ("mm/hugetlb_vmemmap: fix race with speculative PFN	2024-07-21 17:15:46 -07:00
slab.h	- 875fa64577da ("mm/hugetlb_vmemmap: fix race with speculative PFN	2024-07-21 17:15:46 -07:00
slub.c	mm, slub: do not call do_slab_free for kfence object	2024-07-30 11:50:00 +02:00
sparse-vmemmap.c	mm: don't account memmap per-node	2024-08-15 22:16:14 -07:00
sparse.c	mm: don't account memmap per-node	2024-08-15 22:16:14 -07:00
swap_cgroup.c
swap_slots.c	mm: swap: update get_swap_pages() to take folio order	2024-04-25 20:56:37 -07:00
swap_state.c	mm: return the folio from swapin_readahead	2024-09-01 20:26:05 -07:00
swap.c	mm/swap: take folio refcount after testing the LRU flag	2024-09-01 20:26:10 -07:00
swap.h	mm: return the folio from swapin_readahead	2024-09-01 20:26:05 -07:00
swapfile.c	mm: return the folio from swapin_readahead	2024-09-01 20:26:05 -07:00
truncate.c	mm: Fix missing folio invalidation calls during truncation	2024-08-24 16:09:16 +02:00
usercopy.c
userfaultfd.c	userfaultfd: move core VMA manipulation logic to mm/userfaultfd.c	2024-09-01 20:25:53 -07:00
util.c	mm: only enforce minimum stack gap size if it's sensible	2024-09-01 20:26:02 -07:00
vma_internal.h	mm: remove duplicated include in vma_internal.h	2024-09-01 20:26:02 -07:00
vma.c	mm: remove arch_unmap()	2024-09-01 20:26:13 -07:00
vma.h	mm: move internal core VMA manipulation functions to own file	2024-09-01 20:25:54 -07:00
vmalloc.c	mm: vmalloc: add optimization hint on page existence check	2024-09-01 20:26:08 -07:00
vmpressure.c
vmscan.c	memcg: use ratelimited stats flush in the reclaim	2024-09-01 20:26:13 -07:00
vmstat.c	mm: print the promo watermark in zoneinfo	2024-09-01 20:25:59 -07:00
workingset.c	cachestat: do not flush stats in recency check	2024-07-03 22:40:37 -07:00
z3fold.c	mm/z3fold: add __percpu annotation to *unbuddied pointer in struct z3fold_pool	2024-09-01 20:25:56 -07:00
zbud.c	mm: zpool: return pool size in pages	2024-04-25 20:55:48 -07:00
zpool.c	mm: zpool: return pool size in pages	2024-04-25 20:55:48 -07:00
zsmalloc.c	minmax: make generic MIN() and MAX() macros available everywhere	2024-07-28 15:49:18 -07:00
zswap.c	zswap: implement a second chance algorithm for dynamic zswap shrinker	2024-09-01 20:26:02 -07:00