mirror of
https://github.com/torvalds/linux.git
synced 2024-12-03 09:31:26 +00:00
3822a7c409
F_SEAL_EXEC") which permits the setting of the memfd execute bit at memfd creation time, with the option of sealing the state of the X bit. - Peter Xu adds a patch series ("mm/hugetlb: Make huge_pte_offset() thread-safe for pmd unshare") which addresses a rare race condition related to PMD unsharing. - Several folioification patch serieses from Matthew Wilcox, Vishal Moola, Sidhartha Kumar and Lorenzo Stoakes - Johannes Weiner has a series ("mm: push down lock_page_memcg()") which does perform some memcg maintenance and cleanup work. - SeongJae Park has added DAMOS filtering to DAMON, with the series "mm/damon/core: implement damos filter". These filters provide users with finer-grained control over DAMOS's actions. SeongJae has also done some DAMON cleanup work. - Kairui Song adds a series ("Clean up and fixes for swap"). - Vernon Yang contributed the series "Clean up and refinement for maple tree". - Yu Zhao has contributed the "mm: multi-gen LRU: memcg LRU" series. It adds to MGLRU an LRU of memcgs, to improve the scalability of global reclaim. - David Hildenbrand has added some userfaultfd cleanup work in the series "mm: uffd-wp + change_protection() cleanups". - Christoph Hellwig has removed the generic_writepages() library function in the series "remove generic_writepages". - Baolin Wang has performed some maintenance on the compaction code in his series "Some small improvements for compaction". - Sidhartha Kumar is doing some maintenance work on struct page in his series "Get rid of tail page fields". - David Hildenbrand contributed some cleanup, bugfixing and generalization of pte management and of pte debugging in his series "mm: support __HAVE_ARCH_PTE_SWP_EXCLUSIVE on all architectures with swap PTEs". - Mel Gorman and Neil Brown have removed the __GFP_ATOMIC allocation flag in the series "Discard __GFP_ATOMIC". - Sergey Senozhatsky has improved zsmalloc's memory utilization with his series "zsmalloc: make zspage chain size configurable". - Joey Gouly has added prctl() support for prohibiting the creation of writeable+executable mappings. The previous BPF-based approach had shortcomings. See "mm: In-kernel support for memory-deny-write-execute (MDWE)". - Waiman Long did some kmemleak cleanup and bugfixing in the series "mm/kmemleak: Simplify kmemleak_cond_resched() & fix UAF". - T.J. Alumbaugh has contributed some MGLRU cleanup work in his series "mm: multi-gen LRU: improve". - Jiaqi Yan has provided some enhancements to our memory error statistics reporting, mainly by presenting the statistics on a per-node basis. See the series "Introduce per NUMA node memory error statistics". - Mel Gorman has a second and hopefully final shot at fixing a CPU-hog regression in compaction via his series "Fix excessive CPU usage during compaction". - Christoph Hellwig does some vmalloc maintenance work in the series "cleanup vfree and vunmap". - Christoph Hellwig has removed block_device_operations.rw_page() in ths series "remove ->rw_page". - We get some maple_tree improvements and cleanups in Liam Howlett's series "VMA tree type safety and remove __vma_adjust()". - Suren Baghdasaryan has done some work on the maintainability of our vm_flags handling in the series "introduce vm_flags modifier functions". - Some pagemap cleanup and generalization work in Mike Rapoport's series "mm, arch: add generic implementation of pfn_valid() for FLATMEM" and "fixups for generic implementation of pfn_valid()" - Baoquan He has done some work to make /proc/vmallocinfo and /proc/kcore better represent the real state of things in his series "mm/vmalloc.c: allow vread() to read out vm_map_ram areas". - Jason Gunthorpe rationalized the GUP system's interface to the rest of the kernel in the series "Simplify the external interface for GUP". - SeongJae Park wishes to migrate people from DAMON's debugfs interface over to its sysfs interface. To support this, we'll temporarily be printing warnings when people use the debugfs interface. See the series "mm/damon: deprecate DAMON debugfs interface". - Andrey Konovalov provided the accurately named "lib/stackdepot: fixes and clean-ups" series. - Huang Ying has provided a dramatic reduction in migration's TLB flush IPI rates with the series "migrate_pages(): batch TLB flushing". - Arnd Bergmann has some objtool fixups in "objtool warning fixes". -----BEGIN PGP SIGNATURE----- iHUEABYIAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCY/PoPQAKCRDdBJ7gKXxA jlvpAPsFECUBBl20qSue2zCYWnHC7Yk4q9ytTkPB/MMDrFEN9wD/SNKEm2UoK6/K DmxHkn0LAitGgJRS/W9w81yrgig9tAQ= =MlGs -----END PGP SIGNATURE----- Merge tag 'mm-stable-2023-02-20-13-37' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull MM updates from Andrew Morton: - Daniel Verkamp has contributed a memfd series ("mm/memfd: add F_SEAL_EXEC") which permits the setting of the memfd execute bit at memfd creation time, with the option of sealing the state of the X bit. - Peter Xu adds a patch series ("mm/hugetlb: Make huge_pte_offset() thread-safe for pmd unshare") which addresses a rare race condition related to PMD unsharing. - Several folioification patch serieses from Matthew Wilcox, Vishal Moola, Sidhartha Kumar and Lorenzo Stoakes - Johannes Weiner has a series ("mm: push down lock_page_memcg()") which does perform some memcg maintenance and cleanup work. - SeongJae Park has added DAMOS filtering to DAMON, with the series "mm/damon/core: implement damos filter". These filters provide users with finer-grained control over DAMOS's actions. SeongJae has also done some DAMON cleanup work. - Kairui Song adds a series ("Clean up and fixes for swap"). - Vernon Yang contributed the series "Clean up and refinement for maple tree". - Yu Zhao has contributed the "mm: multi-gen LRU: memcg LRU" series. It adds to MGLRU an LRU of memcgs, to improve the scalability of global reclaim. - David Hildenbrand has added some userfaultfd cleanup work in the series "mm: uffd-wp + change_protection() cleanups". - Christoph Hellwig has removed the generic_writepages() library function in the series "remove generic_writepages". - Baolin Wang has performed some maintenance on the compaction code in his series "Some small improvements for compaction". - Sidhartha Kumar is doing some maintenance work on struct page in his series "Get rid of tail page fields". - David Hildenbrand contributed some cleanup, bugfixing and generalization of pte management and of pte debugging in his series "mm: support __HAVE_ARCH_PTE_SWP_EXCLUSIVE on all architectures with swap PTEs". - Mel Gorman and Neil Brown have removed the __GFP_ATOMIC allocation flag in the series "Discard __GFP_ATOMIC". - Sergey Senozhatsky has improved zsmalloc's memory utilization with his series "zsmalloc: make zspage chain size configurable". - Joey Gouly has added prctl() support for prohibiting the creation of writeable+executable mappings. The previous BPF-based approach had shortcomings. See "mm: In-kernel support for memory-deny-write-execute (MDWE)". - Waiman Long did some kmemleak cleanup and bugfixing in the series "mm/kmemleak: Simplify kmemleak_cond_resched() & fix UAF". - T.J. Alumbaugh has contributed some MGLRU cleanup work in his series "mm: multi-gen LRU: improve". - Jiaqi Yan has provided some enhancements to our memory error statistics reporting, mainly by presenting the statistics on a per-node basis. See the series "Introduce per NUMA node memory error statistics". - Mel Gorman has a second and hopefully final shot at fixing a CPU-hog regression in compaction via his series "Fix excessive CPU usage during compaction". - Christoph Hellwig does some vmalloc maintenance work in the series "cleanup vfree and vunmap". - Christoph Hellwig has removed block_device_operations.rw_page() in ths series "remove ->rw_page". - We get some maple_tree improvements and cleanups in Liam Howlett's series "VMA tree type safety and remove __vma_adjust()". - Suren Baghdasaryan has done some work on the maintainability of our vm_flags handling in the series "introduce vm_flags modifier functions". - Some pagemap cleanup and generalization work in Mike Rapoport's series "mm, arch: add generic implementation of pfn_valid() for FLATMEM" and "fixups for generic implementation of pfn_valid()" - Baoquan He has done some work to make /proc/vmallocinfo and /proc/kcore better represent the real state of things in his series "mm/vmalloc.c: allow vread() to read out vm_map_ram areas". - Jason Gunthorpe rationalized the GUP system's interface to the rest of the kernel in the series "Simplify the external interface for GUP". - SeongJae Park wishes to migrate people from DAMON's debugfs interface over to its sysfs interface. To support this, we'll temporarily be printing warnings when people use the debugfs interface. See the series "mm/damon: deprecate DAMON debugfs interface". - Andrey Konovalov provided the accurately named "lib/stackdepot: fixes and clean-ups" series. - Huang Ying has provided a dramatic reduction in migration's TLB flush IPI rates with the series "migrate_pages(): batch TLB flushing". - Arnd Bergmann has some objtool fixups in "objtool warning fixes". * tag 'mm-stable-2023-02-20-13-37' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (505 commits) include/linux/migrate.h: remove unneeded externs mm/memory_hotplug: cleanup return value handing in do_migrate_range() mm/uffd: fix comment in handling pte markers mm: change to return bool for isolate_movable_page() mm: hugetlb: change to return bool for isolate_hugetlb() mm: change to return bool for isolate_lru_page() mm: change to return bool for folio_isolate_lru() objtool: add UACCESS exceptions for __tsan_volatile_read/write kmsan: disable ftrace in kmsan core code kasan: mark addr_has_metadata __always_inline mm: memcontrol: rename memcg_kmem_enabled() sh: initialize max_mapnr m68k/nommu: add missing definition of ARCH_PFN_OFFSET mm: percpu: fix incorrect size in pcpu_obj_full_size() maple_tree: reduce stack usage with gcc-9 and earlier mm: page_alloc: call panic() when memoryless node allocation fails mm: multi-gen LRU: avoid futile retries migrate_pages: move THP/hugetlb migration support check to simplify code migrate_pages: batch flushing TLB migrate_pages: share more code between _unmap and _move ...
101 lines
5.3 KiB
ReStructuredText
101 lines
5.3 KiB
ReStructuredText
================
|
|
Memory Balancing
|
|
================
|
|
|
|
Started Jan 2000 by Kanoj Sarcar <kanoj@sgi.com>
|
|
|
|
Memory balancing is needed for !__GFP_HIGH and !__GFP_KSWAPD_RECLAIM as
|
|
well as for non __GFP_IO allocations.
|
|
|
|
The first reason why a caller may avoid reclaim is that the caller can not
|
|
sleep due to holding a spinlock or is in interrupt context. The second may
|
|
be that the caller is willing to fail the allocation without incurring the
|
|
overhead of page reclaim. This may happen for opportunistic high-order
|
|
allocation requests that have order-0 fallback options. In such cases,
|
|
the caller may also wish to avoid waking kswapd.
|
|
|
|
__GFP_IO allocation requests are made to prevent file system deadlocks.
|
|
|
|
In the absence of non sleepable allocation requests, it seems detrimental
|
|
to be doing balancing. Page reclamation can be kicked off lazily, that
|
|
is, only when needed (aka zone free memory is 0), instead of making it
|
|
a proactive process.
|
|
|
|
That being said, the kernel should try to fulfill requests for direct
|
|
mapped pages from the direct mapped pool, instead of falling back on
|
|
the dma pool, so as to keep the dma pool filled for dma requests (atomic
|
|
or not). A similar argument applies to highmem and direct mapped pages.
|
|
OTOH, if there is a lot of free dma pages, it is preferable to satisfy
|
|
regular memory requests by allocating one from the dma pool, instead
|
|
of incurring the overhead of regular zone balancing.
|
|
|
|
In 2.2, memory balancing/page reclamation would kick off only when the
|
|
_total_ number of free pages fell below 1/64 th of total memory. With the
|
|
right ratio of dma and regular memory, it is quite possible that balancing
|
|
would not be done even when the dma zone was completely empty. 2.2 has
|
|
been running production machines of varying memory sizes, and seems to be
|
|
doing fine even with the presence of this problem. In 2.3, due to
|
|
HIGHMEM, this problem is aggravated.
|
|
|
|
In 2.3, zone balancing can be done in one of two ways: depending on the
|
|
zone size (and possibly of the size of lower class zones), we can decide
|
|
at init time how many free pages we should aim for while balancing any
|
|
zone. The good part is, while balancing, we do not need to look at sizes
|
|
of lower class zones, the bad part is, we might do too frequent balancing
|
|
due to ignoring possibly lower usage in the lower class zones. Also,
|
|
with a slight change in the allocation routine, it is possible to reduce
|
|
the memclass() macro to be a simple equality.
|
|
|
|
Another possible solution is that we balance only when the free memory
|
|
of a zone _and_ all its lower class zones falls below 1/64th of the
|
|
total memory in the zone and its lower class zones. This fixes the 2.2
|
|
balancing problem, and stays as close to 2.2 behavior as possible. Also,
|
|
the balancing algorithm works the same way on the various architectures,
|
|
which have different numbers and types of zones. If we wanted to get
|
|
fancy, we could assign different weights to free pages in different
|
|
zones in the future.
|
|
|
|
Note that if the size of the regular zone is huge compared to dma zone,
|
|
it becomes less significant to consider the free dma pages while
|
|
deciding whether to balance the regular zone. The first solution
|
|
becomes more attractive then.
|
|
|
|
The appended patch implements the second solution. It also "fixes" two
|
|
problems: first, kswapd is woken up as in 2.2 on low memory conditions
|
|
for non-sleepable allocations. Second, the HIGHMEM zone is also balanced,
|
|
so as to give a fighting chance for replace_with_highmem() to get a
|
|
HIGHMEM page, as well as to ensure that HIGHMEM allocations do not
|
|
fall back into regular zone. This also makes sure that HIGHMEM pages
|
|
are not leaked (for example, in situations where a HIGHMEM page is in
|
|
the swapcache but is not being used by anyone)
|
|
|
|
kswapd also needs to know about the zones it should balance. kswapd is
|
|
primarily needed in a situation where balancing can not be done,
|
|
probably because all allocation requests are coming from intr context
|
|
and all process contexts are sleeping. For 2.3, kswapd does not really
|
|
need to balance the highmem zone, since intr context does not request
|
|
highmem pages. kswapd looks at the zone_wake_kswapd field in the zone
|
|
structure to decide whether a zone needs balancing.
|
|
|
|
Page stealing from process memory and shm is done if stealing the page would
|
|
alleviate memory pressure on any zone in the page's node that has fallen below
|
|
its watermark.
|
|
|
|
watemark[WMARK_MIN/WMARK_LOW/WMARK_HIGH]/low_on_memory/zone_wake_kswapd: These
|
|
are per-zone fields, used to determine when a zone needs to be balanced. When
|
|
the number of pages falls below watermark[WMARK_MIN], the hysteric field
|
|
low_on_memory gets set. This stays set till the number of free pages becomes
|
|
watermark[WMARK_HIGH]. When low_on_memory is set, page allocation requests will
|
|
try to free some pages in the zone (providing GFP_WAIT is set in the request).
|
|
Orthogonal to this, is the decision to poke kswapd to free some zone pages.
|
|
That decision is not hysteresis based, and is done when the number of free
|
|
pages is below watermark[WMARK_LOW]; in which case zone_wake_kswapd is also set.
|
|
|
|
|
|
(Good) Ideas that I have heard:
|
|
|
|
1. Dynamic experience should influence balancing: number of failed requests
|
|
for a zone can be tracked and fed into the balancing scheme (jalvo@mbay.net)
|
|
2. Implement a replace_with_highmem()-like replace_with_regular() to preserve
|
|
dma pages. (lkd@tantalophile.demon.co.uk)
|