mirror of
https://github.com/torvalds/linux.git
synced 2024-11-23 12:42:02 +00:00
de120e1d69
2100 Commits
Author | SHA1 | Message | Date | |
---|---|---|---|---|
Linus Torvalds
|
902861e34c |
- Sumanth Korikkar has taught s390 to allocate hotplug-time page frames
from hotplugged memory rather than only from main memory. Series "implement "memmap on memory" feature on s390". - More folio conversions from Matthew Wilcox in the series "Convert memcontrol charge moving to use folios" "mm: convert mm counter to take a folio" - Chengming Zhou has optimized zswap's rbtree locking, providing significant reductions in system time and modest but measurable reductions in overall runtimes. The series is "mm/zswap: optimize the scalability of zswap rb-tree". - Chengming Zhou has also provided the series "mm/zswap: optimize zswap lru list" which provides measurable runtime benefits in some swap-intensive situations. - And Chengming Zhou further optimizes zswap in the series "mm/zswap: optimize for dynamic zswap_pools". Measured improvements are modest. - zswap cleanups and simplifications from Yosry Ahmed in the series "mm: zswap: simplify zswap_swapoff()". - In the series "Add DAX ABI for memmap_on_memory", Vishal Verma has contributed several DAX cleanups as well as adding a sysfs tunable to control the memmap_on_memory setting when the dax device is hotplugged as system memory. - Johannes Weiner has added the large series "mm: zswap: cleanups", which does that. - More DAMON work from SeongJae Park in the series "mm/damon: make DAMON debugfs interface deprecation unignorable" "selftests/damon: add more tests for core functionalities and corner cases" "Docs/mm/damon: misc readability improvements" "mm/damon: let DAMOS feeds and tame/auto-tune itself" - In the series "mm/mempolicy: weighted interleave mempolicy and sysfs extension" Rakie Kim has developed a new mempolicy interleaving policy wherein we allocate memory across nodes in a weighted fashion rather than uniformly. This is beneficial in heterogeneous memory environments appearing with CXL. - Christophe Leroy has contributed some cleanup and consolidation work against the ARM pagetable dumping code in the series "mm: ptdump: Refactor CONFIG_DEBUG_WX and check_wx_pages debugfs attribute". - Luis Chamberlain has added some additional xarray selftesting in the series "test_xarray: advanced API multi-index tests". - Muhammad Usama Anjum has reworked the selftest code to make its human-readable output conform to the TAP ("Test Anything Protocol") format. Amongst other things, this opens up the use of third-party tools to parse and process out selftesting results. - Ryan Roberts has added fork()-time PTE batching of THP ptes in the series "mm/memory: optimize fork() with PTE-mapped THP". Mainly targeted at arm64, this significantly speeds up fork() when the process has a large number of pte-mapped folios. - David Hildenbrand also gets in on the THP pte batching game in his series "mm/memory: optimize unmap/zap with PTE-mapped THP". It implements batching during munmap() and other pte teardown situations. The microbenchmark improvements are nice. - And in the series "Transparent Contiguous PTEs for User Mappings" Ryan Roberts further utilizes arm's pte's contiguous bit ("contpte mappings"). Kernel build times on arm64 improved nicely. Ryan's series "Address some contpte nits" provides some followup work. - In the series "mm/hugetlb: Restore the reservation" Breno Leitao has fixed an obscure hugetlb race which was causing unnecessary page faults. He has also added a reproducer under the selftest code. - In the series "selftests/mm: Output cleanups for the compaction test", Mark Brown did what the title claims. - Kinsey Ho has added the series "mm/mglru: code cleanup and refactoring". - Even more zswap material from Nhat Pham. The series "fix and extend zswap kselftests" does as claimed. - In the series "Introduce cpu_dcache_is_aliasing() to fix DAX regression" Mathieu Desnoyers has cleaned up and fixed rather a mess in our handling of DAX on archiecctures which have virtually aliasing data caches. The arm architecture is the main beneficiary. - Lokesh Gidra's series "per-vma locks in userfaultfd" provides dramatic improvements in worst-case mmap_lock hold times during certain userfaultfd operations. - Some page_owner enhancements and maintenance work from Oscar Salvador in his series "page_owner: print stacks and their outstanding allocations" "page_owner: Fixup and cleanup" - Uladzislau Rezki has contributed some vmalloc scalability improvements in his series "Mitigate a vmap lock contention". It realizes a 12x improvement for a certain microbenchmark. - Some kexec/crash cleanup work from Baoquan He in the series "Split crash out from kexec and clean up related config items". - Some zsmalloc maintenance work from Chengming Zhou in the series "mm/zsmalloc: fix and optimize objects/page migration" "mm/zsmalloc: some cleanup for get/set_zspage_mapping()" - Zi Yan has taught the MM to perform compaction on folios larger than order=0. This a step along the path to implementaton of the merging of large anonymous folios. The series is named "Enable >0 order folio memory compaction". - Christoph Hellwig has done quite a lot of cleanup work in the pagecache writeback code in his series "convert write_cache_pages() to an iterator". - Some modest hugetlb cleanups and speedups in Vishal Moola's series "Handle hugetlb faults under the VMA lock". - Zi Yan has changed the page splitting code so we can split huge pages into sizes other than order-0 to better utilize large folios. The series is named "Split a folio to any lower order folios". - David Hildenbrand has contributed the series "mm: remove total_mapcount()", a cleanup. - Matthew Wilcox has sought to improve the performance of bulk memory freeing in his series "Rearrange batched folio freeing". - Gang Li's series "hugetlb: parallelize hugetlb page init on boot" provides large improvements in bootup times on large machines which are configured to use large numbers of hugetlb pages. - Matthew Wilcox's series "PageFlags cleanups" does that. - Qi Zheng's series "minor fixes and supplement for ptdesc" does that also. S390 is affected. - Cleanups to our pagemap utility functions from Peter Xu in his series "mm/treewide: Replace pXd_large() with pXd_leaf()". - Nico Pache has fixed a few things with our hugepage selftests in his series "selftests/mm: Improve Hugepage Test Handling in MM Selftests". - Also, of course, many singleton patches to many things. Please see the individual changelogs for details. -----BEGIN PGP SIGNATURE----- iHUEABYIAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCZfJpPQAKCRDdBJ7gKXxA joxeAP9TrcMEuHnLmBlhIXkWbIR4+ki+pA3v+gNTlJiBhnfVSgD9G55t1aBaRplx TMNhHfyiHYDTx/GAV9NXW84tasJSDgA= =TG55 -----END PGP SIGNATURE----- Merge tag 'mm-stable-2024-03-13-20-04' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull MM updates from Andrew Morton: - Sumanth Korikkar has taught s390 to allocate hotplug-time page frames from hotplugged memory rather than only from main memory. Series "implement "memmap on memory" feature on s390". - More folio conversions from Matthew Wilcox in the series "Convert memcontrol charge moving to use folios" "mm: convert mm counter to take a folio" - Chengming Zhou has optimized zswap's rbtree locking, providing significant reductions in system time and modest but measurable reductions in overall runtimes. The series is "mm/zswap: optimize the scalability of zswap rb-tree". - Chengming Zhou has also provided the series "mm/zswap: optimize zswap lru list" which provides measurable runtime benefits in some swap-intensive situations. - And Chengming Zhou further optimizes zswap in the series "mm/zswap: optimize for dynamic zswap_pools". Measured improvements are modest. - zswap cleanups and simplifications from Yosry Ahmed in the series "mm: zswap: simplify zswap_swapoff()". - In the series "Add DAX ABI for memmap_on_memory", Vishal Verma has contributed several DAX cleanups as well as adding a sysfs tunable to control the memmap_on_memory setting when the dax device is hotplugged as system memory. - Johannes Weiner has added the large series "mm: zswap: cleanups", which does that. - More DAMON work from SeongJae Park in the series "mm/damon: make DAMON debugfs interface deprecation unignorable" "selftests/damon: add more tests for core functionalities and corner cases" "Docs/mm/damon: misc readability improvements" "mm/damon: let DAMOS feeds and tame/auto-tune itself" - In the series "mm/mempolicy: weighted interleave mempolicy and sysfs extension" Rakie Kim has developed a new mempolicy interleaving policy wherein we allocate memory across nodes in a weighted fashion rather than uniformly. This is beneficial in heterogeneous memory environments appearing with CXL. - Christophe Leroy has contributed some cleanup and consolidation work against the ARM pagetable dumping code in the series "mm: ptdump: Refactor CONFIG_DEBUG_WX and check_wx_pages debugfs attribute". - Luis Chamberlain has added some additional xarray selftesting in the series "test_xarray: advanced API multi-index tests". - Muhammad Usama Anjum has reworked the selftest code to make its human-readable output conform to the TAP ("Test Anything Protocol") format. Amongst other things, this opens up the use of third-party tools to parse and process out selftesting results. - Ryan Roberts has added fork()-time PTE batching of THP ptes in the series "mm/memory: optimize fork() with PTE-mapped THP". Mainly targeted at arm64, this significantly speeds up fork() when the process has a large number of pte-mapped folios. - David Hildenbrand also gets in on the THP pte batching game in his series "mm/memory: optimize unmap/zap with PTE-mapped THP". It implements batching during munmap() and other pte teardown situations. The microbenchmark improvements are nice. - And in the series "Transparent Contiguous PTEs for User Mappings" Ryan Roberts further utilizes arm's pte's contiguous bit ("contpte mappings"). Kernel build times on arm64 improved nicely. Ryan's series "Address some contpte nits" provides some followup work. - In the series "mm/hugetlb: Restore the reservation" Breno Leitao has fixed an obscure hugetlb race which was causing unnecessary page faults. He has also added a reproducer under the selftest code. - In the series "selftests/mm: Output cleanups for the compaction test", Mark Brown did what the title claims. - Kinsey Ho has added the series "mm/mglru: code cleanup and refactoring". - Even more zswap material from Nhat Pham. The series "fix and extend zswap kselftests" does as claimed. - In the series "Introduce cpu_dcache_is_aliasing() to fix DAX regression" Mathieu Desnoyers has cleaned up and fixed rather a mess in our handling of DAX on archiecctures which have virtually aliasing data caches. The arm architecture is the main beneficiary. - Lokesh Gidra's series "per-vma locks in userfaultfd" provides dramatic improvements in worst-case mmap_lock hold times during certain userfaultfd operations. - Some page_owner enhancements and maintenance work from Oscar Salvador in his series "page_owner: print stacks and their outstanding allocations" "page_owner: Fixup and cleanup" - Uladzislau Rezki has contributed some vmalloc scalability improvements in his series "Mitigate a vmap lock contention". It realizes a 12x improvement for a certain microbenchmark. - Some kexec/crash cleanup work from Baoquan He in the series "Split crash out from kexec and clean up related config items". - Some zsmalloc maintenance work from Chengming Zhou in the series "mm/zsmalloc: fix and optimize objects/page migration" "mm/zsmalloc: some cleanup for get/set_zspage_mapping()" - Zi Yan has taught the MM to perform compaction on folios larger than order=0. This a step along the path to implementaton of the merging of large anonymous folios. The series is named "Enable >0 order folio memory compaction". - Christoph Hellwig has done quite a lot of cleanup work in the pagecache writeback code in his series "convert write_cache_pages() to an iterator". - Some modest hugetlb cleanups and speedups in Vishal Moola's series "Handle hugetlb faults under the VMA lock". - Zi Yan has changed the page splitting code so we can split huge pages into sizes other than order-0 to better utilize large folios. The series is named "Split a folio to any lower order folios". - David Hildenbrand has contributed the series "mm: remove total_mapcount()", a cleanup. - Matthew Wilcox has sought to improve the performance of bulk memory freeing in his series "Rearrange batched folio freeing". - Gang Li's series "hugetlb: parallelize hugetlb page init on boot" provides large improvements in bootup times on large machines which are configured to use large numbers of hugetlb pages. - Matthew Wilcox's series "PageFlags cleanups" does that. - Qi Zheng's series "minor fixes and supplement for ptdesc" does that also. S390 is affected. - Cleanups to our pagemap utility functions from Peter Xu in his series "mm/treewide: Replace pXd_large() with pXd_leaf()". - Nico Pache has fixed a few things with our hugepage selftests in his series "selftests/mm: Improve Hugepage Test Handling in MM Selftests". - Also, of course, many singleton patches to many things. Please see the individual changelogs for details. * tag 'mm-stable-2024-03-13-20-04' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (435 commits) mm/zswap: remove the memcpy if acomp is not sleepable crypto: introduce: acomp_is_async to expose if comp drivers might sleep memtest: use {READ,WRITE}_ONCE in memory scanning mm: prohibit the last subpage from reusing the entire large folio mm: recover pud_leaf() definitions in nopmd case selftests/mm: skip the hugetlb-madvise tests on unmet hugepage requirements selftests/mm: skip uffd hugetlb tests with insufficient hugepages selftests/mm: dont fail testsuite due to a lack of hugepages mm/huge_memory: skip invalid debugfs new_order input for folio split mm/huge_memory: check new folio order when split a folio mm, vmscan: retry kswapd's priority loop with cache_trim_mode off on failure mm: add an explicit smp_wmb() to UFFDIO_CONTINUE mm: fix list corruption in put_pages_list mm: remove folio from deferred split list before uncharging it filemap: avoid unnecessary major faults in filemap_fault() mm,page_owner: drop unnecessary check mm,page_owner: check for null stack_record before bumping its refcount mm: swap: fix race between free_swap_and_cache() and swapoff() mm/treewide: align up pXd_leaf() retval across archs mm/treewide: drop pXd_large() ... |
||
Linus Torvalds
|
9187210eee |
Networking changes for 6.9.
Core & protocols ---------------- - Large effort by Eric to lower rtnl_lock pressure and remove locks: - Make commonly used parts of rtnetlink (address, route dumps etc.) lockless, protected by RCU instead of rtnl_lock. - Add a netns exit callback which already holds rtnl_lock, allowing netns exit to take rtnl_lock once in the core instead of once for each driver / callback. - Remove locks / serialization in the socket diag interface. - Remove 6 calls to synchronize_rcu() while holding rtnl_lock. - Remove the dev_base_lock, depend on RCU where necessary. - Support busy polling on a per-epoll context basis. Poll length and budget parameters can be set independently of system defaults. - Introduce struct net_hotdata, to make sure read-mostly global config variables fit in as few cache lines as possible. - Add optional per-nexthop statistics to ease monitoring / debug of ECMP imbalance problems. - Support TCP_NOTSENT_LOWAT in MPTCP. - Ensure that IPv6 temporary addresses' preferred lifetimes are long enough, compared to other configured lifetimes, and at least 2 sec. - Support forwarding of ICMP Error messages in IPSec, per RFC 4301. - Add support for the independent control state machine for bonding per IEEE 802.1AX-2008 5.4.15 in addition to the existing coupled control state machine. - Add "network ID" to MCTP socket APIs to support hosts with multiple disjoint MCTP networks. - Re-use the mono_delivery_time skbuff bit for packets which user space wants to be sent at a specified time. Maintain the timing information while traversing veth links, bridge etc. - Take advantage of MSG_SPLICE_PAGES for RxRPC DATA and ACK packets. - Simplify many places iterating over netdevs by using an xarray instead of a hash table walk (hash table remains in place, for use on fastpaths). - Speed up scanning for expired routes by keeping a dedicated list. - Speed up "generic" XDP by trying harder to avoid large allocations. - Support attaching arbitrary metadata to netconsole messages. Things we sprinkled into general kernel code -------------------------------------------- - Enforce VM_IOREMAP flag and range in ioremap_page_range and introduce VM_SPARSE kind and vm_area_[un]map_pages (used by bpf_arena). - Rework selftest harness to enable the use of the full range of ksft exit code (pass, fail, skip, xfail, xpass). Netfilter --------- - Allow userspace to define a table that is exclusively owned by a daemon (via netlink socket aliveness) without auto-removing this table when the userspace program exits. Such table gets marked as orphaned and a restarting management daemon can re-attach/regain ownership. - Speed up element insertions to nftables' concatenated-ranges set type. Compact a few related data structures. BPF --- - Add BPF token support for delegating a subset of BPF subsystem functionality from privileged system-wide daemons such as systemd through special mount options for userns-bound BPF fs to a trusted & unprivileged application. - Introduce bpf_arena which is sparse shared memory region between BPF program and user space where structures inside the arena can have pointers to other areas of the arena, and pointers work seamlessly for both user-space programs and BPF programs. - Introduce may_goto instruction that is a contract between the verifier and the program. The verifier allows the program to loop assuming it's behaving well, but reserves the right to terminate it. - Extend the BPF verifier to enable static subprog calls in spin lock critical sections. - Support registration of struct_ops types from modules which helps projects like fuse-bpf that seeks to implement a new struct_ops type. - Add support for retrieval of cookies for perf/kprobe multi links. - Support arbitrary TCP SYN cookie generation / validation in the TC layer with BPF to allow creating SYN flood handling in BPF firewalls. - Add code generation to inline the bpf_kptr_xchg() helper which improves performance when stashing/popping the allocated BPF objects. Wireless -------- - Add SPP (signaling and payload protected) AMSDU support. - Support wider bandwidth OFDMA, as required for EHT operation. Driver API ---------- - Major overhaul of the Energy Efficient Ethernet internals to support new link modes (2.5GE, 5GE), share more code between drivers (especially those using phylib), and encourage more uniform behavior. Convert and clean up drivers. - Define an API for querying per netdev queue statistics from drivers. - IPSec: account in global stats for fully offloaded sessions. - Create a concept of Ethernet PHY Packages at the Device Tree level, to allow parameterizing the existing PHY package code. - Enable Rx hashing (RSS) on GTP protocol fields. Misc ---- - Improvements and refactoring all over networking selftests. - Create uniform module aliases for TC classifiers, actions, and packet schedulers to simplify creating modprobe policies. - Address all missing MODULE_DESCRIPTION() warnings in networking. - Extend the Netlink descriptions in YAML to cover message encapsulation or "Netlink polymorphism", where interpretation of nested attributes depends on link type, classifier type or some other "class type". Drivers ------- - Ethernet high-speed NICs: - Add a new driver for Marvell's Octeon PCI Endpoint NIC VF. - Intel (100G, ice, idpf): - support E825-C devices - nVidia/Mellanox: - support devices with one port and multiple PCIe links - Broadcom (bnxt): - support n-tuple filters - support configuring the RSS key - Wangxun (ngbe/txgbe): - implement irq_domain for TXGBE's sub-interrupts - Pensando/AMD: - support XDP - optimize queue submission and wakeup handling (+17% bps) - optimize struct layout, saving 28% of memory on queues - Ethernet NICs embedded and virtual: - Google cloud vNIC: - refactor driver to perform memory allocations for new queue config before stopping and freeing the old queue memory - Synopsys (stmmac): - obey queueMaxSDU and implement counters required by 802.1Qbv - Renesas (ravb): - support packet checksum offload - suspend to RAM and runtime PM support - Ethernet switches: - nVidia/Mellanox: - support for nexthop group statistics - Microchip: - ksz8: implement PHY loopback - add support for KSZ8567, a 7-port 10/100Mbps switch - PTP: - New driver for RENESAS FemtoClock3 Wireless clock generator. - Support OCP PTP cards designed and built by Adva. - CAN: - Support recvmsg() flags for own, local and remote traffic on CAN BCM sockets. - Support for esd GmbH PCIe/402 CAN device family. - m_can: - Rx/Tx submission coalescing - wake on frame Rx - WiFi: - Intel (iwlwifi): - enable signaling and payload protected A-MSDUs - support wider-bandwidth OFDMA - support for new devices - bump FW API to 89 for AX devices; 90 for BZ/SC devices - MediaTek (mt76): - mt7915: newer ADIE version support - mt7925: radio temperature sensor support - Qualcomm (ath11k): - support 6 GHz station power modes: Low Power Indoor (LPI), Standard Power) SP and Very Low Power (VLP) - QCA6390 & WCN6855: support 2 concurrent station interfaces - QCA2066 support - Qualcomm (ath12k): - refactoring in preparation for Multi-Link Operation (MLO) support - 1024 Block Ack window size support - firmware-2.bin support - support having multiple identical PCI devices (firmware needs to have ATH12K_FW_FEATURE_MULTI_QRTR_ID) - QCN9274: support split-PHY devices - WCN7850: enable Power Save Mode in station mode - WCN7850: P2P support - RealTek: - rtw88: support for more rtw8811cu and rtw8821cu devices - rtw89: support SCAN_RANDOM_SN and SET_SCAN_DWELL - rtlwifi: speed up USB firmware initialization - rtwl8xxxu: - RTL8188F: concurrent interface support - Channel Switch Announcement (CSA) support in AP mode - Broadcom (brcmfmac): - per-vendor feature support - per-vendor SAE password setup - DMI nvram filename quirk for ACEPC W5 Pro Signed-off-by: Jakub Kicinski <kuba@kernel.org> -----BEGIN PGP SIGNATURE----- iQIzBAABCAAdFiEE6jPA+I1ugmIBA4hXMUZtbf5SIrsFAmXv0mgACgkQMUZtbf5S IrtgMxAAuRd+WJW++SENr4KxIWhYO1q6Xcxnai43wrNkan9swD24icG8TYALt4f3 yoT6idQvWReAb5JNlh9rUQz8R7E0nJXlvEFn5MtJwcthx2C6wFo/XkJlddlRrT+j c2xGILwLjRhW65LaC0MZ2ECbEERkFz8xcGfK2SWzUgh6KYvPjcRfKFxugpM7xOQK P/Wnqhs4fVRS/Mj/bCcXcO+yhwC121Q3qVeQVjGS0AzEC65hAW87a/kc2BfgcegD EyI9R7mf6criQwX+0awubjfoIdr4oW/8oDVNvUDczkJkbaEVaLMQk9P5x/0XnnVS UHUchWXyI80Q8Rj12uN1/I0h3WtwNQnCRBuLSmtm6GLfCAwbLvp2nGWDnaXiqryW DVKUIHGvqPKjkOOMOVfSvfB3LvkS3xsFVVYiQBQCn0YSs/gtu4CoF2Nty9CiLPbK tTuxUnLdPDZDxU//l0VArZmP8p2JM7XQGJ+JH8GFH4SBTyBR23e0iyPSoyaxjnYn RReDnHMVsrS1i7GPhbqDJWn+uqMSs7N149i0XmmyeqwQHUVSJN3J2BApP2nCaDfy H2lTuYly5FfEezt61NvCE4qr/VsWeEjm1fYlFQ9dFn4pGn+HghyCpw+xD1ZN56DN lujemau5B3kk1UTtAT4ypPqvuqjkRFqpNV2LzsJSk/Js+hApw8Y= =oY52 -----END PGP SIGNATURE----- Merge tag 'net-next-6.9' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next Pull networking updates from Jakub Kicinski: "Core & protocols: - Large effort by Eric to lower rtnl_lock pressure and remove locks: - Make commonly used parts of rtnetlink (address, route dumps etc) lockless, protected by RCU instead of rtnl_lock. - Add a netns exit callback which already holds rtnl_lock, allowing netns exit to take rtnl_lock once in the core instead of once for each driver / callback. - Remove locks / serialization in the socket diag interface. - Remove 6 calls to synchronize_rcu() while holding rtnl_lock. - Remove the dev_base_lock, depend on RCU where necessary. - Support busy polling on a per-epoll context basis. Poll length and budget parameters can be set independently of system defaults. - Introduce struct net_hotdata, to make sure read-mostly global config variables fit in as few cache lines as possible. - Add optional per-nexthop statistics to ease monitoring / debug of ECMP imbalance problems. - Support TCP_NOTSENT_LOWAT in MPTCP. - Ensure that IPv6 temporary addresses' preferred lifetimes are long enough, compared to other configured lifetimes, and at least 2 sec. - Support forwarding of ICMP Error messages in IPSec, per RFC 4301. - Add support for the independent control state machine for bonding per IEEE 802.1AX-2008 5.4.15 in addition to the existing coupled control state machine. - Add "network ID" to MCTP socket APIs to support hosts with multiple disjoint MCTP networks. - Re-use the mono_delivery_time skbuff bit for packets which user space wants to be sent at a specified time. Maintain the timing information while traversing veth links, bridge etc. - Take advantage of MSG_SPLICE_PAGES for RxRPC DATA and ACK packets. - Simplify many places iterating over netdevs by using an xarray instead of a hash table walk (hash table remains in place, for use on fastpaths). - Speed up scanning for expired routes by keeping a dedicated list. - Speed up "generic" XDP by trying harder to avoid large allocations. - Support attaching arbitrary metadata to netconsole messages. Things we sprinkled into general kernel code: - Enforce VM_IOREMAP flag and range in ioremap_page_range and introduce VM_SPARSE kind and vm_area_[un]map_pages (used by bpf_arena). - Rework selftest harness to enable the use of the full range of ksft exit code (pass, fail, skip, xfail, xpass). Netfilter: - Allow userspace to define a table that is exclusively owned by a daemon (via netlink socket aliveness) without auto-removing this table when the userspace program exits. Such table gets marked as orphaned and a restarting management daemon can re-attach/regain ownership. - Speed up element insertions to nftables' concatenated-ranges set type. Compact a few related data structures. BPF: - Add BPF token support for delegating a subset of BPF subsystem functionality from privileged system-wide daemons such as systemd through special mount options for userns-bound BPF fs to a trusted & unprivileged application. - Introduce bpf_arena which is sparse shared memory region between BPF program and user space where structures inside the arena can have pointers to other areas of the arena, and pointers work seamlessly for both user-space programs and BPF programs. - Introduce may_goto instruction that is a contract between the verifier and the program. The verifier allows the program to loop assuming it's behaving well, but reserves the right to terminate it. - Extend the BPF verifier to enable static subprog calls in spin lock critical sections. - Support registration of struct_ops types from modules which helps projects like fuse-bpf that seeks to implement a new struct_ops type. - Add support for retrieval of cookies for perf/kprobe multi links. - Support arbitrary TCP SYN cookie generation / validation in the TC layer with BPF to allow creating SYN flood handling in BPF firewalls. - Add code generation to inline the bpf_kptr_xchg() helper which improves performance when stashing/popping the allocated BPF objects. Wireless: - Add SPP (signaling and payload protected) AMSDU support. - Support wider bandwidth OFDMA, as required for EHT operation. Driver API: - Major overhaul of the Energy Efficient Ethernet internals to support new link modes (2.5GE, 5GE), share more code between drivers (especially those using phylib), and encourage more uniform behavior. Convert and clean up drivers. - Define an API for querying per netdev queue statistics from drivers. - IPSec: account in global stats for fully offloaded sessions. - Create a concept of Ethernet PHY Packages at the Device Tree level, to allow parameterizing the existing PHY package code. - Enable Rx hashing (RSS) on GTP protocol fields. Misc: - Improvements and refactoring all over networking selftests. - Create uniform module aliases for TC classifiers, actions, and packet schedulers to simplify creating modprobe policies. - Address all missing MODULE_DESCRIPTION() warnings in networking. - Extend the Netlink descriptions in YAML to cover message encapsulation or "Netlink polymorphism", where interpretation of nested attributes depends on link type, classifier type or some other "class type". Drivers: - Ethernet high-speed NICs: - Add a new driver for Marvell's Octeon PCI Endpoint NIC VF. - Intel (100G, ice, idpf): - support E825-C devices - nVidia/Mellanox: - support devices with one port and multiple PCIe links - Broadcom (bnxt): - support n-tuple filters - support configuring the RSS key - Wangxun (ngbe/txgbe): - implement irq_domain for TXGBE's sub-interrupts - Pensando/AMD: - support XDP - optimize queue submission and wakeup handling (+17% bps) - optimize struct layout, saving 28% of memory on queues - Ethernet NICs embedded and virtual: - Google cloud vNIC: - refactor driver to perform memory allocations for new queue config before stopping and freeing the old queue memory - Synopsys (stmmac): - obey queueMaxSDU and implement counters required by 802.1Qbv - Renesas (ravb): - support packet checksum offload - suspend to RAM and runtime PM support - Ethernet switches: - nVidia/Mellanox: - support for nexthop group statistics - Microchip: - ksz8: implement PHY loopback - add support for KSZ8567, a 7-port 10/100Mbps switch - PTP: - New driver for RENESAS FemtoClock3 Wireless clock generator. - Support OCP PTP cards designed and built by Adva. - CAN: - Support recvmsg() flags for own, local and remote traffic on CAN BCM sockets. - Support for esd GmbH PCIe/402 CAN device family. - m_can: - Rx/Tx submission coalescing - wake on frame Rx - WiFi: - Intel (iwlwifi): - enable signaling and payload protected A-MSDUs - support wider-bandwidth OFDMA - support for new devices - bump FW API to 89 for AX devices; 90 for BZ/SC devices - MediaTek (mt76): - mt7915: newer ADIE version support - mt7925: radio temperature sensor support - Qualcomm (ath11k): - support 6 GHz station power modes: Low Power Indoor (LPI), Standard Power) SP and Very Low Power (VLP) - QCA6390 & WCN6855: support 2 concurrent station interfaces - QCA2066 support - Qualcomm (ath12k): - refactoring in preparation for Multi-Link Operation (MLO) support - 1024 Block Ack window size support - firmware-2.bin support - support having multiple identical PCI devices (firmware needs to have ATH12K_FW_FEATURE_MULTI_QRTR_ID) - QCN9274: support split-PHY devices - WCN7850: enable Power Save Mode in station mode - WCN7850: P2P support - RealTek: - rtw88: support for more rtw8811cu and rtw8821cu devices - rtw89: support SCAN_RANDOM_SN and SET_SCAN_DWELL - rtlwifi: speed up USB firmware initialization - rtwl8xxxu: - RTL8188F: concurrent interface support - Channel Switch Announcement (CSA) support in AP mode - Broadcom (brcmfmac): - per-vendor feature support - per-vendor SAE password setup - DMI nvram filename quirk for ACEPC W5 Pro" * tag 'net-next-6.9' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next: (2255 commits) nexthop: Fix splat with CONFIG_DEBUG_PREEMPT=y nexthop: Fix out-of-bounds access during attribute validation nexthop: Only parse NHA_OP_FLAGS for dump messages that require it nexthop: Only parse NHA_OP_FLAGS for get messages that require it bpf: move sleepable flag from bpf_prog_aux to bpf_prog bpf: hardcode BPF_PROG_PACK_SIZE to 2MB * num_possible_nodes() selftests/bpf: Add kprobe multi triggering benchmarks ptp: Move from simple ida to xarray vxlan: Remove generic .ndo_get_stats64 vxlan: Do not alloc tstats manually devlink: Add comments to use netlink gen tool nfp: flower: handle acti_netdevs allocation failure net/packet: Add getsockopt support for PACKET_COPY_THRESH net/netlink: Add getsockopt support for NETLINK_LISTEN_ALL_NSID selftests/bpf: Add bpf_arena_htab test. selftests/bpf: Add bpf_arena_list test. selftests/bpf: Add unit tests for bpf_arena_alloc/free_pages bpf: Add helper macro bpf_addr_space_cast() libbpf: Recognize __arena global variables. bpftool: Recognize arena map type ... |
||
Thorsten Blum
|
72741db683 |
mm: page_alloc: use div64_ul() instead of do_div()
Fixes Coccinelle/coccicheck warning reported by do_div.cocci. Compared to do_div(), div64_ul() does not implicitly cast the divisor and does not unnecessarily calculate the remainder. Link: https://lkml.kernel.org/r/20240228224911.1164-2-thorsten.blum@toblux.com Signed-off-by: Thorsten Blum <thorsten.blum@toblux.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
Yunsheng Lin
|
a0727489ac |
net: introduce page_frag_cache_drain()
When draining a page_frag_cache, most user are doing the similar steps, so introduce an API to avoid code duplication. Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com> Acked-by: Jason Wang <jasowang@redhat.com> Reviewed-by: Alexander Duyck <alexanderduyck@fb.com> Acked-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com> |
||
Yunsheng Lin
|
4bc0d63a23 |
page_frag: unify gfp bits for order 3 page allocation
Currently there seems to be three page frag implementations which all try to allocate order 3 page, if that fails, it then fail back to allocate order 0 page, and each of them all allow order 3 page allocation to fail under certain condition by using specific gfp bits. The gfp bits for order 3 page allocation are different between different implementation, __GFP_NOMEMALLOC is or'd to forbid access to emergency reserves memory for __page_frag_cache_refill(), but it is not or'd in other implementions, __GFP_DIRECT_RECLAIM is masked off to avoid direct reclaim in vhost_net_page_frag_refill(), but it is not masked off in __page_frag_cache_refill(). This patch unifies the gfp bits used between different implementions by or'ing __GFP_NOMEMALLOC and masking off __GFP_DIRECT_RECLAIM for order 3 page allocation to avoid possible pressure for mm. Leave the gfp unifying for page frag implementation in sock.c for now as suggested by Paolo Abeni. Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com> Reviewed-by: Alexander Duyck <alexanderduyck@fb.com> CC: Alexander Duyck <alexander.duyck@gmail.com> Acked-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com> |
||
Yunsheng Lin
|
411c5f3680 |
mm/page_alloc: modify page_frag_alloc_align() to accept align as an argument
napi_alloc_frag_align() and netdev_alloc_frag_align() accept align as an argument, and they are thin wrappers around the __napi_alloc_frag_align() and __netdev_alloc_frag_align() APIs doing the alignment checking and align mask conversion, in order to call page_frag_alloc_align() directly. The intention here is to keep the alignment checking and the alignmask conversion in in-line wrapper to avoid those kind of operations during execution time since it can usually be handled during compile time. We are going to use page_frag_alloc_align() in vhost_net.c, it need the same kind of alignment checking and alignmask conversion, so split up page_frag_alloc_align into an inline wrapper doing the above operation, and add __page_frag_alloc_align() which is passed with the align mask the original function expected as suggested by Alexander. Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com> CC: Alexander Duyck <alexander.duyck@gmail.com> Acked-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com> |
||
Richard Chang
|
c8b3600312 |
mm: add alloc_contig_migrate_range allocation statistics
alloc_contig_migrate_range has every information to be able to understand big contiguous allocation latency. For example, how many pages are migrated, how many times they were needed to unmap from page tables. This patch adds the trace event to collect the allocation statistics. In the field, it was quite useful to understand CMA allocation latency. [akpm@linux-foundation.org: a/trace_mm_alloc_config_migrate_range_info_enabled/trace_mm_alloc_contig_migrate_range_info_enabled] Link: https://lkml.kernel.org/r/20240228051127.2859472-1-richardycc@google.com Signed-off-by: Richard Chang <richardycc@google.com> Reviewed-by: Steven Rostedt (Google) <rostedt@goodmis.org. Cc: Martin Liu <liumartin@google.com> Cc: "Masami Hiramatsu (Google)" <mhiramat@kernel.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Suren Baghdasaryan <surenb@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
Matthew Wilcox (Oracle)
|
8b7b0a5eee |
mm: remove free_unref_page_list()
All callers now use free_unref_folios() so we can delete this function. Link: https://lkml.kernel.org/r/20240227174254.710559-15-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Ryan Roberts <ryan.roberts@arm.com> Cc: David Hildenbrand <david@redhat.com> Cc: Mel Gorman <mgorman@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
Matthew Wilcox (Oracle)
|
31b2ff82ae |
mm: handle large folios in free_unref_folios()
Call folio_undo_large_rmappable() if needed. free_unref_page_prepare() destroys the ability to call folio_order(), so stash the order in folio->private for the benefit of the second loop. Link: https://lkml.kernel.org/r/20240227174254.710559-10-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: David Hildenbrand <david@redhat.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Ryan Roberts <ryan.roberts@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
Matthew Wilcox (Oracle)
|
90491d87dd |
mm: add free_unref_folios()
Iterate over a folio_batch rather than a linked list. This is easier for the CPU to prefetch and has a batch count naturally built in so we don't need to track it. Again, this lowers the maximum lock hold time from 32 folios to 15, but I do not expect this to have a significant effect. Link: https://lkml.kernel.org/r/20240227174254.710559-4-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: David Hildenbrand <david@redhat.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Ryan Roberts <ryan.roberts@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
Matthew Wilcox (Oracle)
|
7c76d92253 |
mm: convert free_unref_page_list() to use folios
Most of its callees are not yet ready to accept a folio, but we know all of the pages passed in are actually folios because they're linked through ->lru. Link: https://lkml.kernel.org/r/20240227174254.710559-3-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Ryan Roberts <ryan.roberts@arm.com> Cc: David Hildenbrand <david@redhat.com> Cc: Mel Gorman <mgorman@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
Zi Yan
|
46d44d09d2 |
mm: page_owner: add support for splitting to any order in split page_owner
It adds a new_order parameter to set new page order in page owner. It prepares for upcoming changes to support split huge page to any lower order. Link: https://lkml.kernel.org/r/20240226205534.1603748-7-zi.yan@sent.com Signed-off-by: Zi Yan <ziy@nvidia.com> Cc: David Hildenbrand <david@redhat.com> Acked-by: David Hildenbrand <david@redhat.com> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Luis Chamberlain <mcgrof@kernel.org> Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org> Cc: Michal Koutny <mkoutny@suse.com> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Yang Shi <shy828301@gmail.com> Cc: Yu Zhao <yuzhao@google.com> Cc: Zach O'Keefe <zokeefe@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
Zi Yan
|
b8791381d7 |
mm: memcg: make memcg huge page split support any order split
It sets memcg information for the pages after the split. A new parameter new_order is added to tell the order of subpages in the new page, always 0 for now. It prepares for upcoming changes to support split huge page to any lower order. Link: https://lkml.kernel.org/r/20240226205534.1603748-6-zi.yan@sent.com Signed-off-by: Zi Yan <ziy@nvidia.com> Acked-by: David Hildenbrand <david@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Luis Chamberlain <mcgrof@kernel.org> Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org> Cc: Michal Koutny <mkoutny@suse.com> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Yang Shi <shy828301@gmail.com> Cc: Yu Zhao <yuzhao@google.com> Cc: Zach O'Keefe <zokeefe@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
Zi Yan
|
9a581c12cd |
mm/page_owner: use order instead of nr in split_page_owner()
We do not have non power of two pages, using nr is error prone if nr is not power-of-two. Use page order instead. Link: https://lkml.kernel.org/r/20240226205534.1603748-5-zi.yan@sent.com Signed-off-by: Zi Yan <ziy@nvidia.com> Acked-by: David Hildenbrand <david@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Luis Chamberlain <mcgrof@kernel.org> Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org> Cc: Michal Koutny <mkoutny@suse.com> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Yang Shi <shy828301@gmail.com> Cc: Yu Zhao <yuzhao@google.com> Cc: Zach O'Keefe <zokeefe@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
Zi Yan
|
502003bb76 |
mm/memcg: use order instead of nr in split_page_memcg()
We do not have non power of two pages, using nr is error prone if nr is not power-of-two. Use page order instead. Link: https://lkml.kernel.org/r/20240226205534.1603748-4-zi.yan@sent.com Signed-off-by: Zi Yan <ziy@nvidia.com> Acked-by: David Hildenbrand <david@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Luis Chamberlain <mcgrof@kernel.org> Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org> Cc: Michal Koutny <mkoutny@suse.com> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Yang Shi <shy828301@gmail.com> Cc: Yu Zhao <yuzhao@google.com> Cc: Zach O'Keefe <zokeefe@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
Hao Ge
|
77c7a09564 |
mm/page_alloc: make check_new_page() return bool
Make check_new_page() return bool like check_new_pages() Link: https://lkml.kernel.org/r/20240222091932.54799-1-gehao@kylinos.cn Signed-off-by: Hao Ge <gehao@kylinos.cn> Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
Vlastimil Babka
|
803de9000f |
mm, vmscan: prevent infinite loop for costly GFP_NOIO | __GFP_RETRY_MAYFAIL allocations
Sven reports an infinite loop in __alloc_pages_slowpath() for costly order
__GFP_RETRY_MAYFAIL allocations that are also GFP_NOIO. Such combination
can happen in a suspend/resume context where a GFP_KERNEL allocation can
have __GFP_IO masked out via gfp_allowed_mask.
Quoting Sven:
1. try to do a "costly" allocation (order > PAGE_ALLOC_COSTLY_ORDER)
with __GFP_RETRY_MAYFAIL set.
2. page alloc's __alloc_pages_slowpath tries to get a page from the
freelist. This fails because there is nothing free of that costly
order.
3. page alloc tries to reclaim by calling __alloc_pages_direct_reclaim,
which bails out because a zone is ready to be compacted; it pretends
to have made a single page of progress.
4. page alloc tries to compact, but this always bails out early because
__GFP_IO is not set (it's not passed by the snd allocator, and even
if it were, we are suspending so the __GFP_IO flag would be cleared
anyway).
5. page alloc believes reclaim progress was made (because of the
pretense in item 3) and so it checks whether it should retry
compaction. The compaction retry logic thinks it should try again,
because:
a) reclaim is needed because of the early bail-out in item 4
b) a zonelist is suitable for compaction
6. goto 2. indefinite stall.
(end quote)
The immediate root cause is confusing the COMPACT_SKIPPED returned from
__alloc_pages_direct_compact() (step 4) due to lack of __GFP_IO to be
indicating a lack of order-0 pages, and in step 5 evaluating that in
should_compact_retry() as a reason to retry, before incrementing and
limiting the number of retries. There are however other places that
wrongly assume that compaction can happen while we lack __GFP_IO.
To fix this, introduce gfp_compaction_allowed() to abstract the __GFP_IO
evaluation and switch the open-coded test in try_to_compact_pages() to use
it.
Also use the new helper in:
- compaction_ready(), which will make reclaim not bail out in step 3, so
there's at least one attempt to actually reclaim, even if chances are
small for a costly order
- in_reclaim_compaction() which will make should_continue_reclaim()
return false and we don't over-reclaim unnecessarily
- in __alloc_pages_slowpath() to set a local variable can_compact,
which is then used to avoid retrying reclaim/compaction for costly
allocations (step 5) if we can't compact and also to skip the early
compaction attempt that we do in some cases
Link: https://lkml.kernel.org/r/20240221114357.13655-2-vbabka@suse.cz
Fixes:
|
||
Hao Ge
|
5bb1421422 |
mm/page_alloc: make bad_range() return bool
bad_range() can return bool, so let us change it. Link: https://lkml.kernel.org/r/20240221073227.276234-1-gehao@kylinos.cn Signed-off-by: Hao Ge <gehao@kylinos.cn> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
Zi Yan
|
733aea0b3a |
mm/compaction: add support for >0 order folio memory compaction.
Before last commit, memory compaction only migrates order-0 folios and skips >0 order folios. Last commit splits all >0 order folios during compaction. This commit migrates >0 order folios during compaction by keeping isolated free pages at their original size without splitting them into order-0 pages and using them directly during migration process. What is different from the prior implementation: 1. All isolated free pages are kept in a NR_PAGE_ORDERS array of page lists, where each page list stores free pages in the same order. 2. All free pages are not post_alloc_hook() processed nor buddy pages, although their orders are stored in first page's private like buddy pages. 3. During migration, in new page allocation time (i.e., in compaction_alloc()), free pages are then processed by post_alloc_hook(). When migration fails and a new page is returned (i.e., in compaction_free()), free pages are restored by reversing the post_alloc_hook() operations using newly added free_pages_prepare_fpi_none(). Step 3 is done for a latter optimization that splitting and/or merging free pages during compaction becomes easier. Note: without splitting free pages, compaction can end prematurely due to migration will return -ENOMEM even if there is free pages. This happens when no order-0 free page exist and compaction_alloc() return NULL. Link: https://lkml.kernel.org/r/20240220183220.1451315-4-zi.yan@sent.com Signed-off-by: Zi Yan <ziy@nvidia.com> Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Tested-by: Baolin Wang <baolin.wang@linux.alibaba.com> Tested-by: Yu Zhao <yuzhao@google.com> Cc: Adam Manzanares <a.manzanares@samsung.com> Cc: David Hildenbrand <david@redhat.com> Cc: Huang Ying <ying.huang@intel.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Kemeng Shi <shikemeng@huaweicloud.com> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Luis Chamberlain <mcgrof@kernel.org> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Vishal Moola (Oracle) <vishal.moola@gmail.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Yin Fengwei <fengwei.yin@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
Zi Yan
|
5267fe5d09 |
mm/page_alloc: remove unused fpi_flags in free_pages_prepare()
Patch series "Enable >0 order folio memory compaction", v7.
This patchset enables >0 order folio memory compaction, which is one of
the prerequisitions for large folio support[1].
I am aware of that split free pages is necessary for folio migration in
compaction, since if >0 order free pages are never split and no order-0
free page is scanned, compaction will end prematurely due to migration
returns -ENOMEM. Free page split becomes a must instead of an
optimization.
lkp ncompare results (on a 8-CPU (Intel Xeon E5-2650 v4 @2.20GHz) 16G VM)
for default LRU (-no-mglru) and CONFIG_LRU_GEN are shown at the bottom,
copied from V3[4]. In sum, most of vm-scalability applications do not see
performance change, and the others see ~4% to ~26% performance boost under
default LRU and ~2% to ~6% performance boost under CONFIG_LRU_GEN.
Overview
===
To support >0 order folio compaction, the patchset changes how free pages
used for migration are kept during compaction. Free pages used to be
split into order-0 pages that are post allocation processed (i.e.,
PageBuddy flag cleared, page order stored in page->private is zeroed, and
page reference is set to 1). Now all free pages are kept in a
NR_PAGE_ORDER array of page lists based on their order without post
allocation process. When migrate_pages() asks for a new page, one of the
free pages, based on the requested page order, is then processed and given
out. And THP <2MB would need this feature.
[1] https://lore.kernel.org/linux-mm/f8d47176-03a8-99bf-a813-b5942830fd73@arm.com/
[2] https://lore.kernel.org/linux-mm/20231113170157.280181-1-zi.yan@sent.com/
[3] https://lore.kernel.org/linux-mm/20240123034636.1095672-1-zi.yan@sent.com/
[4] https://lore.kernel.org/linux-mm/20240202161554.565023-1-zi.yan@sent.com/
[5] https://lore.kernel.org/linux-mm/20240212163510.859822-1-zi.yan@sent.com/
[6] https://lore.kernel.org/linux-mm/20240214220420.1229173-1-zi.yan@sent.com/
[7] https://lore.kernel.org/linux-mm/20240216170432.1268753-1-zi.yan@sent.com/
This patch (of 4):
Commit
|
||
Huang Ying
|
5cec4eb7fa |
mm and cache_info: remove unnecessary CPU cache info update
For each CPU hotplug event, we will update per-CPU data slice size and
corresponding PCP configuration for every online CPU to make the
implementation simple. But, Kyle reported that this takes tens seconds
during boot on a machine with 34 zones and 3840 CPUs.
So, in this patch, for each CPU hotplug event, we only update per-CPU data
slice size and corresponding PCP configuration for the CPUs that share
caches with the hotplugged CPU. With the patch, the system boot time
reduces 67 seconds on the machine.
Link: https://lkml.kernel.org/r/20240126081944.414520-1-ying.huang@intel.com
Fixes:
|
||
Linus Torvalds
|
3e7aeb78ab |
Networking changes for 6.8.
Core & protocols ---------------- - Analyze and reorganize core networking structs (socks, netdev, netns, mibs) to optimize cacheline consumption and set up build time warnings to safeguard against future header changes. This improves TCP performances with many concurrent connections up to 40%. - Add page-pool netlink-based introspection, exposing the memory usage and recycling stats. This helps indentify bad PP users and possible leaks. - Refine TCP/DCCP source port selection to no longer favor even source port at connect() time when IP_LOCAL_PORT_RANGE is set. This lowers the time taken by connect() for hosts having many active connections to the same destination. - Refactor the TCP bind conflict code, shrinking related socket structs. - Refactor TCP SYN-Cookie handling, as a preparation step to allow arbitrary SYN-Cookie processing via eBPF. - Tune optmem_max for 0-copy usage, increasing the default value to 128KB and namespecifying it. - Allow coalescing for cloned skbs coming from page pools, improving RX performances with some common configurations. - Reduce extension header parsing overhead at GRO time. - Add bridge MDB bulk deletion support, allowing user-space to request the deletion of matching entries. - Reorder nftables struct members, to keep data accessed by the datapath first. - Introduce TC block ports tracking and use. This allows supporting multicast-like behavior at the TC layer. - Remove UAPI support for retired TC qdiscs (dsmark, CBQ and ATM) and classifiers (RSVP and tcindex). - More data-race annotations. - Extend the diag interface to dump TCP bound-only sockets. - Conditional notification of events for TC qdisc class and actions. - Support for WPAN dynamic associations with nearby devices, to form a sub-network using a specific PAN ID. - Implement SMCv2.1 virtual ISM device support. - Add support for Batman-avd mulicast packet type. BPF --- - Tons of verifier improvements: - BPF register bounds logic and range support along with a large test suite - log improvements - complete precision tracking support for register spills - track aligned STACK_ZERO cases as imprecise spilled registers. It improves the verifier "instructions processed" metric from single digit to 50-60% for some programs - support for user's global BPF subprogram arguments with few commonly requested annotations for a better developer experience - support tracking of BPF_JNE which helps cases when the compiler transforms (unsigned) "a > 0" into "if a == 0 goto xxx" and the like - several fixes - Add initial TX metadata implementation for AF_XDP with support in mlx5 and stmmac drivers. Two types of offloads are supported right now, that is, TX timestamp and TX checksum offload. - Fix kCFI bugs in BPF all forms of indirect calls from BPF into kernel and from kernel into BPF work with CFI enabled. This allows BPF to work with CONFIG_FINEIBT=y. - Change BPF verifier logic to validate global subprograms lazily instead of unconditionally before the main program, so they can be guarded using BPF CO-RE techniques. - Support uid/gid options when mounting bpffs. - Add a new kfunc which acquires the associated cgroup of a task within a specific cgroup v1 hierarchy where the latter is identified by its id. - Extend verifier to allow bpf_refcount_acquire() of a map value field obtained via direct load which is a use-case needed in sched_ext. - Add BPF link_info support for uprobe multi link along with bpftool integration for the latter. - Support for VLAN tag in XDP hints. - Remove deprecated bpfilter kernel leftovers given the project is developed in user-space (https://github.com/facebook/bpfilter). Misc ---- - Support for parellel TC self-tests execution. - Increase MPTCP self-tests coverage. - Updated the bridge documentation, including several so-far undocumented features. - Convert all the net self-tests to run in unique netns, to avoid random failures due to conflict and allow concurrent runs. - Add TCP-AO self-tests. - Add kunit tests for both cfg80211 and mac80211. - Autogenerate Netlink families documentation from YAML spec. - Add yml-gen support for fixed headers and recursive nests, the tool can now generate user-space code for all genetlink families for which we have specs. - A bunch of additional module descriptions fixes. - Catch incorrect freeing of pages belonging to a page pool. Driver API ---------- - Rust abstractions for network PHY drivers; do not cover yet the full C API, but already allow implementing functional PHY drivers in rust. - Introduce queue and NAPI support in the netdev Netlink interface, allowing complete access to the device <> NAPIs <> queues relationship. - Introduce notifications filtering for devlink to allow control application scale to thousands of instances. - Improve PHY validation, requesting rate matching information for each ethtool link mode supported by both the PHY and host. - Add support for ethtool symmetric-xor RSS hash. - ACPI based Wifi band RFI (WBRF) mitigation feature for the AMD platform. - Expose pin fractional frequency offset value over new DPLL generic netlink attribute. - Convert older drivers to platform remove callback returning void. - Add support for PHY package MMD read/write. New hardware / drivers ---------------------- - Ethernet: - Octeon CN10K devices - Broadcom 5760X P7 - Qualcomm SM8550 SoC - Texas Instrument DP83TG720S PHY - Bluetooth: - IMC Networks Bluetooth radio Removed ------- - WiFi: - libertas 16-bit PCMCIA support - Atmel at76c50x drivers - HostAP ISA/PCMCIA style 802.11b driver - zd1201 802.11b USB dongles - Orinoco ISA/PCMCIA 802.11b driver - Aviator/Raytheon driver - Planet WL3501 driver - RNDIS USB 802.11b driver Drivers ------- - Ethernet high-speed NICs: - Intel (100G, ice, idpf): - allow one by one port representors creation and removal - add temperature and clock information reporting - add get/set for ethtool's header split ringparam - add again FW logging - adds support switchdev hardware packet mirroring - iavf: implement symmetric-xor RSS hash - igc: add support for concurrent physical and free-running timers - i40e: increase the allowable descriptors - nVidia/Mellanox: - Preparation for Socket-Direct multi-dev netdev. That will allow in future releases combining multiple PFs devices attached to different NUMA nodes under the same netdev - Broadcom (bnxt): - TX completion handling improvements - add basic ntuple filter support - reduce MSIX vectors usage for MQPRIO offload - add VXLAN support, USO offload and TX coalesce completion for P7 - Marvell Octeon EP: - xmit-more support - add PF-VF mailbox support and use it for FW notifications for VFs - Wangxun (ngbe/txgbe): - implement ethtool functions to operate pause param, ring param, coalesce channel number and msglevel - Netronome/Corigine (nfp): - add flow-steering support - support UDP segmentation offload - Ethernet NICs embedded, slower, virtual: - Xilinx AXI: remove duplicate DMA code adopting the dma engine driver - stmmac: add support for HW-accelerated VLAN stripping - TI AM654x sw: add mqprio, frame preemption & coalescing - gve: add support for non-4k page sizes. - virtio-net: support dynamic coalescing moderation - nVidia/Mellanox Ethernet datacenter switches: - allow firmware upgrade without a reboot - more flexible support for bridge flooding via the compressed FID flooding mode - Ethernet embedded switches: - Microchip: - fine-tune flow control and speed configurations in KSZ8xxx - KSZ88X3: enable setting rmii reference - Renesas: - add jumbo frames support - Marvell: - 88E6xxx: add "eth-mac" and "rmon" stats support - Ethernet PHYs: - aquantia: add firmware load support - at803x: refactor the driver to simplify adding support for more chip variants - NXP C45 TJA11xx: Add MACsec offload support - Wifi: - MediaTek (mt76): - NVMEM EEPROM improvements - mt7996 Extremely High Throughput (EHT) improvements - mt7996 Wireless Ethernet Dispatcher (WED) support - mt7996 36-bit DMA support - Qualcomm (ath12k): - support for a single MSI vector - WCN7850: support AP mode - Intel (iwlwifi): - new debugfs file fw_dbg_clear - allow concurrent P2P operation on DFS channels - Bluetooth: - QCA2066: support HFP offload - ISO: more broadcast-related improvements - NXP: better recovery in case receiver/transmitter get out of sync Signed-off-by: Paolo Abeni <pabeni@redhat.com> -----BEGIN PGP SIGNATURE----- iQJGBAABCAAwFiEEg1AjqC77wbdLX2LbKSR5jcyPE6QFAmWdamsSHHBhYmVuaUBy ZWRoYXQuY29tAAoJECkkeY3MjxOkGC4P/2xjLzdw22ckSssuE9ORbGko9SNjnqHk PQh1E+26BHiCg5KB8VvzMsL78E79MRNXEattSW+1g7dhCvln3oi+Vd0WkdRkgt35 98Iv18zLbbwFAJeyKvmLAPAkQkMLtVj19QILBBRrugF+egEZgVSE3JBcTAiKv2ZQ HzkabA171Ri6LpCcEEtY5XuaKvimGnGzF8YMFf8rX0wtqd2p5kbY9aMe47WAGxvU Vf9548XvH+A5yVH2/4/gujtUOpA/RHuhuCMb+oo0cZ+VCC1x9MGzoXzj6r87OTkf k2W1whNzcGoin92f+9Lk1JYMuiGKBH4QVaDdNXJnYFSJWPTE7RvRsPzYTSD4/GzK yEZbzSJXpy/2vDQm16NoAxl7evRs8Sorzkw4LQRviZHI/5SAkK2ZQiCK5CO8QSYy C1LELcV5kn6Foe24xWnrWLjAGug9oJnYoGPMU5gvPmFJMvUMXqm5rmbBgUWL5Rxw q1M6gVzabCyWUy6z2G2vaqW2ZntNVvCkdsLtIX0XZkcTzNoP0MA+TuhyGz4wbiuo PeyQp/mbGnDgCYggqKIA0YWrTVxkhFrKN520cbO8qXBQytV9oFbM/0/+C0/r/5WX pL1JVzLrh6l5ME7EIQfha8UOF9j8q4ueSwb40P3AR2NaZiDABM0zfUZ6+sx+91WF ucqPEcZB5cRE =1bW6 -----END PGP SIGNATURE----- Merge tag 'net-next-6.8' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next Pull networking updates from Paolo Abeni: "The most interesting thing is probably the networking structs reorganization and a significant amount of changes is around self-tests. Core & protocols: - Analyze and reorganize core networking structs (socks, netdev, netns, mibs) to optimize cacheline consumption and set up build time warnings to safeguard against future header changes This improves TCP performances with many concurrent connections up to 40% - Add page-pool netlink-based introspection, exposing the memory usage and recycling stats. This helps indentify bad PP users and possible leaks - Refine TCP/DCCP source port selection to no longer favor even source port at connect() time when IP_LOCAL_PORT_RANGE is set. This lowers the time taken by connect() for hosts having many active connections to the same destination - Refactor the TCP bind conflict code, shrinking related socket structs - Refactor TCP SYN-Cookie handling, as a preparation step to allow arbitrary SYN-Cookie processing via eBPF - Tune optmem_max for 0-copy usage, increasing the default value to 128KB and namespecifying it - Allow coalescing for cloned skbs coming from page pools, improving RX performances with some common configurations - Reduce extension header parsing overhead at GRO time - Add bridge MDB bulk deletion support, allowing user-space to request the deletion of matching entries - Reorder nftables struct members, to keep data accessed by the datapath first - Introduce TC block ports tracking and use. This allows supporting multicast-like behavior at the TC layer - Remove UAPI support for retired TC qdiscs (dsmark, CBQ and ATM) and classifiers (RSVP and tcindex) - More data-race annotations - Extend the diag interface to dump TCP bound-only sockets - Conditional notification of events for TC qdisc class and actions - Support for WPAN dynamic associations with nearby devices, to form a sub-network using a specific PAN ID - Implement SMCv2.1 virtual ISM device support - Add support for Batman-avd mulicast packet type BPF: - Tons of verifier improvements: - BPF register bounds logic and range support along with a large test suite - log improvements - complete precision tracking support for register spills - track aligned STACK_ZERO cases as imprecise spilled registers. This improves the verifier "instructions processed" metric from single digit to 50-60% for some programs - support for user's global BPF subprogram arguments with few commonly requested annotations for a better developer experience - support tracking of BPF_JNE which helps cases when the compiler transforms (unsigned) "a > 0" into "if a == 0 goto xxx" and the like - several fixes - Add initial TX metadata implementation for AF_XDP with support in mlx5 and stmmac drivers. Two types of offloads are supported right now, that is, TX timestamp and TX checksum offload - Fix kCFI bugs in BPF all forms of indirect calls from BPF into kernel and from kernel into BPF work with CFI enabled. This allows BPF to work with CONFIG_FINEIBT=y - Change BPF verifier logic to validate global subprograms lazily instead of unconditionally before the main program, so they can be guarded using BPF CO-RE techniques - Support uid/gid options when mounting bpffs - Add a new kfunc which acquires the associated cgroup of a task within a specific cgroup v1 hierarchy where the latter is identified by its id - Extend verifier to allow bpf_refcount_acquire() of a map value field obtained via direct load which is a use-case needed in sched_ext - Add BPF link_info support for uprobe multi link along with bpftool integration for the latter - Support for VLAN tag in XDP hints - Remove deprecated bpfilter kernel leftovers given the project is developed in user-space (https://github.com/facebook/bpfilter) Misc: - Support for parellel TC self-tests execution - Increase MPTCP self-tests coverage - Updated the bridge documentation, including several so-far undocumented features - Convert all the net self-tests to run in unique netns, to avoid random failures due to conflict and allow concurrent runs - Add TCP-AO self-tests - Add kunit tests for both cfg80211 and mac80211 - Autogenerate Netlink families documentation from YAML spec - Add yml-gen support for fixed headers and recursive nests, the tool can now generate user-space code for all genetlink families for which we have specs - A bunch of additional module descriptions fixes - Catch incorrect freeing of pages belonging to a page pool Driver API: - Rust abstractions for network PHY drivers; do not cover yet the full C API, but already allow implementing functional PHY drivers in rust - Introduce queue and NAPI support in the netdev Netlink interface, allowing complete access to the device <> NAPIs <> queues relationship - Introduce notifications filtering for devlink to allow control application scale to thousands of instances - Improve PHY validation, requesting rate matching information for each ethtool link mode supported by both the PHY and host - Add support for ethtool symmetric-xor RSS hash - ACPI based Wifi band RFI (WBRF) mitigation feature for the AMD platform - Expose pin fractional frequency offset value over new DPLL generic netlink attribute - Convert older drivers to platform remove callback returning void - Add support for PHY package MMD read/write New hardware / drivers: - Ethernet: - Octeon CN10K devices - Broadcom 5760X P7 - Qualcomm SM8550 SoC - Texas Instrument DP83TG720S PHY - Bluetooth: - IMC Networks Bluetooth radio Removed: - WiFi: - libertas 16-bit PCMCIA support - Atmel at76c50x drivers - HostAP ISA/PCMCIA style 802.11b driver - zd1201 802.11b USB dongles - Orinoco ISA/PCMCIA 802.11b driver - Aviator/Raytheon driver - Planet WL3501 driver - RNDIS USB 802.11b driver Driver updates: - Ethernet high-speed NICs: - Intel (100G, ice, idpf): - allow one by one port representors creation and removal - add temperature and clock information reporting - add get/set for ethtool's header split ringparam - add again FW logging - adds support switchdev hardware packet mirroring - iavf: implement symmetric-xor RSS hash - igc: add support for concurrent physical and free-running timers - i40e: increase the allowable descriptors - nVidia/Mellanox: - Preparation for Socket-Direct multi-dev netdev. That will allow in future releases combining multiple PFs devices attached to different NUMA nodes under the same netdev - Broadcom (bnxt): - TX completion handling improvements - add basic ntuple filter support - reduce MSIX vectors usage for MQPRIO offload - add VXLAN support, USO offload and TX coalesce completion for P7 - Marvell Octeon EP: - xmit-more support - add PF-VF mailbox support and use it for FW notifications for VFs - Wangxun (ngbe/txgbe): - implement ethtool functions to operate pause param, ring param, coalesce channel number and msglevel - Netronome/Corigine (nfp): - add flow-steering support - support UDP segmentation offload - Ethernet NICs embedded, slower, virtual: - Xilinx AXI: remove duplicate DMA code adopting the dma engine driver - stmmac: add support for HW-accelerated VLAN stripping - TI AM654x sw: add mqprio, frame preemption & coalescing - gve: add support for non-4k page sizes. - virtio-net: support dynamic coalescing moderation - nVidia/Mellanox Ethernet datacenter switches: - allow firmware upgrade without a reboot - more flexible support for bridge flooding via the compressed FID flooding mode - Ethernet embedded switches: - Microchip: - fine-tune flow control and speed configurations in KSZ8xxx - KSZ88X3: enable setting rmii reference - Renesas: - add jumbo frames support - Marvell: - 88E6xxx: add "eth-mac" and "rmon" stats support - Ethernet PHYs: - aquantia: add firmware load support - at803x: refactor the driver to simplify adding support for more chip variants - NXP C45 TJA11xx: Add MACsec offload support - Wifi: - MediaTek (mt76): - NVMEM EEPROM improvements - mt7996 Extremely High Throughput (EHT) improvements - mt7996 Wireless Ethernet Dispatcher (WED) support - mt7996 36-bit DMA support - Qualcomm (ath12k): - support for a single MSI vector - WCN7850: support AP mode - Intel (iwlwifi): - new debugfs file fw_dbg_clear - allow concurrent P2P operation on DFS channels - Bluetooth: - QCA2066: support HFP offload - ISO: more broadcast-related improvements - NXP: better recovery in case receiver/transmitter get out of sync" * tag 'net-next-6.8' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next: (1714 commits) lan78xx: remove redundant statement in lan78xx_get_eee lan743x: remove redundant statement in lan743x_ethtool_get_eee bnxt_en: Fix RCU locking for ntuple filters in bnxt_rx_flow_steer() bnxt_en: Fix RCU locking for ntuple filters in bnxt_srxclsrldel() bnxt_en: Remove unneeded variable in bnxt_hwrm_clear_vnic_filter() tcp: Revert no longer abort SYN_SENT when receiving some ICMP Revert "mlx5 updates 2023-12-20" Revert "net: stmmac: Enable Per DMA Channel interrupt" ipvlan: Remove usage of the deprecated ida_simple_xx() API ipvlan: Fix a typo in a comment net/sched: Remove ipt action tests net: stmmac: Use interrupt mode INTM=1 for per channel irq net: stmmac: Add support for TX/RX channel interrupt net: stmmac: Make MSI interrupt routine generic dt-bindings: net: snps,dwmac: per channel irq net: phy: at803x: make read_status more generic net: phy: at803x: add support for cdt cross short test for qca808x net: phy: at803x: refactor qca808x cable test get status function net: phy: at803x: generalize cdt fault length function net: ethernet: cortina: Drop TSO support ... |
||
Kirill A. Shutemov
|
5e0a760b44 |
mm, treewide: rename MAX_ORDER to MAX_PAGE_ORDER
commit
|
||
Kirill A. Shutemov
|
fd37721803 |
mm, treewide: introduce NR_PAGE_ORDERS
NR_PAGE_ORDERS defines the number of page orders supported by the page allocator, ranging from 0 to MAX_ORDER, MAX_ORDER + 1 in total. NR_PAGE_ORDERS assists in defining arrays of page orders and allows for more natural iteration over them. [kirill.shutemov@linux.intel.com: fixup for kerneldoc warning] Link: https://lkml.kernel.org/r/20240101111512.7empzyifq7kxtzk3@box Link: https://lkml.kernel.org/r/20231228144704.14033-1-kirill.shutemov@linux.intel.com Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Reviewed-by: Zi Yan <ziy@nvidia.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
Andrey Konovalov
|
5cb6674b69 |
mm, kasan: use KASAN_TAG_KERNEL instead of 0xff
Use the KASAN_TAG_KERNEL marco instead of open-coding 0xff in the mm code. This macro is provided by include/linux/kasan-tags.h, which does not include any other headers, so it's safe to include it into mm.h without causing circular include dependencies. Link: https://lkml.kernel.org/r/71db9087b0aebb6c4dccbc609cc0cd50621533c7.1703188911.git.andreyknvl@google.com Signed-off-by: Andrey Konovalov <andreyknvl@google.com> Cc: Alexander Potapenko <glider@google.com> Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Marco Elver <elver@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
Yajun Deng
|
250ae189d9 |
mm: page_alloc: simplify __free_pages_ok()
There is redundant code in __free_pages_ok(). Use free_one_page() simplify it. Link: https://lkml.kernel.org/r/20231216030503.2126130-1-yajun.deng@linux.dev Signed-off-by: Yajun Deng <yajun.deng@linux.dev> Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
Charan Teja Kalla
|
ac3f3b0a55 |
mm: page_alloc: unreserve highatomic page blocks before oom
__alloc_pages_direct_reclaim() is called from slowpath allocation where
high atomic reserves can be unreserved after there is a progress in
reclaim and yet no suitable page is found. Later should_reclaim_retry()
gets called from slow path allocation to decide if the reclaim needs to be
retried before OOM kill path is taken.
should_reclaim_retry() checks the available(reclaimable + free pages)
memory against the min wmark levels of a zone and returns:
a) true, if it is above the min wmark so that slow path allocation will
do the reclaim retries.
b) false, thus slowpath allocation takes oom kill path.
should_reclaim_retry() can also unreserves the high atomic reserves **but
only after all the reclaim retries are exhausted.**
In a case where there are almost none reclaimable memory and free pages
contains mostly the high atomic reserves but allocation context can't use
these high atomic reserves, makes the available memory below min wmark
levels hence false is returned from should_reclaim_retry() leading the
allocation request to take OOM kill path. This can turn into a early oom
kill if high atomic reserves are holding lot of free memory and
unreserving of them is not attempted.
(early)OOM is encountered on a VM with the below state:
[ 295.998653] Normal free:7728kB boost:0kB min:804kB low:1004kB
high:1204kB reserved_highatomic:8192KB active_anon:4kB inactive_anon:0kB
active_file:24kB inactive_file:24kB unevictable:1220kB writepending:0kB
present:70732kB managed:49224kB mlocked:0kB bounce:0kB free_pcp:688kB
local_pcp:492kB free_cma:0kB
[ 295.998656] lowmem_reserve[]: 0 32
[ 295.998659] Normal: 508*4kB (UMEH) 241*8kB (UMEH) 143*16kB (UMEH)
33*32kB (UH) 7*64kB (UH) 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB
0*4096kB = 7752kB
Per above log, the free memory of ~7MB exist in the high atomic reserves
is not freed up before falling back to oom kill path.
Fix it by trying to unreserve the high atomic reserves in
should_reclaim_retry() before __alloc_pages_direct_reclaim() can fallback
to oom kill path.
Link: https://lkml.kernel.org/r/1700823445-27531-1-git-send-email-quic_charante@quicinc.com
Fixes:
|
||
Charan Teja Kalla
|
9cd20f3fe0 |
mm: page_alloc: enforce minimum zone size to do high atomic reserves
Highatomic reserves are set to roughly 1% of zone for maximum and a pageblock size for minimum. Encountered a system with the below configuration: Normal free:7728kB boost:0kB min:804kB low:1004kB high:1204kB reserved_highatomic:8192KB managed:49224kB On such systems, even a single pageblock makes highatomic reserves are set to ~8% of the zone memory. This high value can easily exert pressure on the zone. Per discussion with Michal and Mel, it is not much useful to reserve the memory for highatomic allocations on such small systems[1]. Since the minimum size for high atomic reserves is always going to be a pageblock size and if 1% of zone managed pages is going to be below pageblock size, don't reserve memory for high atomic allocations. Thanks Michal for this suggestion[2]. Since no memory is being reserved for high atomic allocations and if respective allocation failures are seen, this patch can be reverted. [1] https://lore.kernel.org/linux-mm/20231117161956.d3yjdxhhm4rhl7h2@techsingularity.net/ [2] https://lore.kernel.org/linux-mm/ZVYRJMUitykepLRy@tiehlicka/ Link: https://lkml.kernel.org/r/c3a2a48e2cfe08176a80eaf01c110deb9e918055.1700821416.git.quic_charante@quicinc.com Signed-off-by: Charan Teja Kalla <quic_charante@quicinc.com> Acked-by: David Rientjes <rientjes@google.com> Cc: David Hildenbrand <david@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Michal Hocko <mhocko@suse.com> Cc: Pavankumar Kondeti <quic_pkondeti@quicinc.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
Charan Teja Kalla
|
d68e39fc45 |
mm: page_alloc: correct high atomic reserve calculations
Patch series "mm: page_alloc: fixes for high atomic reserve caluculations", v3. The state of the system where the issue exposed shown in oom kill logs: [ 295.998653] Normal free:7728kB boost:0kB min:804kB low:1004kB high:1204kB reserved_highatomic:8192KB active_anon:4kB inactive_anon:0kB active_file:24kB inactive_file:24kB unevictable:1220kB writepending:0kB present:70732kB managed:49224kB mlocked:0kB bounce:0kB free_pcp:688kBlocal_pcp:492kB free_cma:0kB [ 295.998656] lowmem_reserve[]: 0 32 [ 295.998659] Normal: 508*4kB (UMEH) 241*8kB (UMEH) 143*16kB (UMEH) 33*32kB (UH) 7*64kB (UH) 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 7752kB From the above, it is seen that ~16MB of memory reserved for high atomic reserves against the expectation of 1% reserves which is fixed in the 1st patch. Don't reserve the high atomic page blocks if 1% of zone memory size is below a pageblock size. This patch (of 2): reserve_highatomic_pageblock() aims to reserve the 1% of the managed pages of a zone, which is used for the high order atomic allocations. It uses the below calculation to reserve: static void reserve_highatomic_pageblock(struct page *page, ....) { ....... max_managed = (zone_managed_pages(zone) / 100) + pageblock_nr_pages; if (zone->nr_reserved_highatomic >= max_managed) goto out; zone->nr_reserved_highatomic += pageblock_nr_pages; set_pageblock_migratetype(page, MIGRATE_HIGHATOMIC); move_freepages_block(zone, page, MIGRATE_HIGHATOMIC, NULL); out: .... } Since we are always appending the 1% of zone managed pages count to pageblock_nr_pages, the minimum it is turning into 2 pageblocks as the nr_reserved_highatomic is incremented/decremented in pageblock sizes. Encountered a system(actually a VM running on the Linux kernel) with the below zone configuration: Normal free:7728kB boost:0kB min:804kB low:1004kB high:1204kB reserved_highatomic:8192KB managed:49224kB The existing calculations making it to reserve the 8MB(with pageblock size of 4MB) i.e. 16% of the zone managed memory. Reserving such high amount of memory can easily exert memory pressure in the system thus may lead into unnecessary reclaims till unreserving of high atomic reserves. Since high atomic reserves are managed in pageblock size granules, as MIGRATE_HIGHATOMIC is set for such pageblock, fix the calculations for high atomic reserves as, minimum is pageblock size , maximum is approximately 1% of the zone managed pages. Link: https://lkml.kernel.org/r/cover.1700821416.git.quic_charante@quicinc.com Link: https://lkml.kernel.org/r/1660034138397b82a0a8b6ae51cbe96bd583d89e.1700821416.git.quic_charante@quicinc.com Signed-off-by: Charan Teja Kalla <quic_charante@quicinc.com> Acked-by: Mel Gorman <mgorman@techsingularity.net> Acked-by: David Rientjes <rientjes@google.com> Cc: David Hildenbrand <david@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Pavankumar Kondeti <quic_pkondeti@quicinc.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
Brendan Jackman
|
17b46e7beb |
mm/page_alloc: dedupe some memcg uncharging logic
The duplication makes it seem like some work is required before uncharging in the !PageHWPoison case. But it isn't, so we can simplify the code a little. Note the PageMemcgKmem check is redundant, but I've left it in as it avoids an unnecessary function call. Link: https://lkml.kernel.org/r/20231108164920.3401565-1-jackmanb@google.com Signed-off-by: Brendan Jackman <jackmanb@google.com> Reviewed-by: Yosry Ahmed <yosryahmed@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Shakeel Butt <shakeelb@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
Jesper Dangaard Brouer
|
dba1b8a7ab |
mm/page_pool: catch page_pool memory leaks
Pages belonging to a page_pool (PP) instance must be freed through the PP APIs in-order to correctly release any DMA mappings and release refcnt on the DMA device when freeing PP instance. When PP release a page (page_pool_release_page) the page->pp_magic value is cleared. This patch detect a leaked PP page in free_page_is_bad() via unexpected state of page->pp_magic value being PP_SIGNATURE. We choose to report and treat it as a bad page. It would be possible to release the page via returning it to the PP instance as the page->pp pointer is likely still valid. Notice this code is only activated when either compiled with CONFIG_DEBUG_VM or boot cmdline debug_pagealloc=on, and CONFIG_PAGE_POOL. Reduced example output of leak with PP_SIGNATURE = dead000000000040: BUG: Bad page state in process swapper/4 pfn:141fa6 page:000000006dbf8062 refcount:0 mapcount:0 mapping:0000000000000000 index:0x141fa6000 pfn:0x141fa6 flags: 0x2fffff80000000(node=0|zone=2|lastcpupid=0x1fffff) page_type: 0xffffffff() raw: 002fffff80000000 dead000000000040 ffff88814888a000 0000000000000000 raw: 0000000141fa6000 0000000000000001 00000000ffffffff 0000000000000000 page dumped because: page_pool leak [...] Call Trace: <IRQ> dump_stack_lvl+0x32/0x50 bad_page+0x70/0xf0 free_unref_page_prepare+0x263/0x430 free_unref_page+0x34/0x130 mlx5e_free_rx_mpwqe+0x190/0x1c0 [mlx5_core] mlx5e_post_rx_mpwqes+0x1ac/0x280 [mlx5_core] mlx5e_napi_poll+0x12b/0x710 [mlx5_core] ? skb_free_head+0x4f/0x90 __napi_poll+0x2b/0x1c0 net_rx_action+0x27b/0x360 The advantage is the Call Trace directly points to the function leaking the PP page, which in this case is an on purpose bug introduced into the mlx5 driver to test this code change. Currently PP will periodically in page_pool_release_retry() printk warning "stalled pool shutdown" which cannot be directly corrolated to leaking and might as well be a false positive due to SKBs being stuck on a socket for an extended period. After this patch we should be able to remove this printk. Signed-off-by: Jesper Dangaard Brouer <hawk@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net> |
||
Hugh Dickins
|
23e4883248 |
mm: add page_rmappable_folio() wrapper
folio_prep_large_rmappable() is being used repeatedly along with a conversion from page to folio, a check non-NULL, a check order > 1: wrap it all up into struct folio *page_rmappable_folio(struct page *). Link: https://lkml.kernel.org/r/8d92c6cf-eebe-748-e29c-c8ab224c741@google.com Signed-off-by: Hugh Dickins <hughd@google.com> Cc: Andi Kleen <ak@linux.intel.com> Cc: Christoph Lameter <cl@linux.com> Cc: David Hildenbrand <david@redhat.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: "Huang, Ying" <ying.huang@intel.com> Cc: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Nhat Pham <nphamcs@gmail.com> Cc: Sidhartha Kumar <sidhartha.kumar@oracle.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Tejun heo <tj@kernel.org> Cc: Vishal Moola (Oracle) <vishal.moola@gmail.com> Cc: Yang Shi <shy828301@gmail.com> Cc: Yosry Ahmed <yosryahmed@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
Hyesoo Yu
|
76f26535d1 |
mm: page_alloc: check the order of compound page even when the order is zero
For compound pages, the head sets the PG_head flag and the tail sets the compound_head to indicate the head page. If a user allocates a compound page and frees it with a different order, the compound page information will not be properly initialized. To detect this problem, compound_order(page) and the order argument are compared, but this is not checked when the order argument is zero. That error should be checked regardless of the order. Link: https://lkml.kernel.org/r/20231023083217.1866451-1-hyesoo.yu@samsung.com Signed-off-by: Hyesoo Yu <hyesoo.yu@samsung.com> Reviewed-by: Vishal Moola (Oracle) <vishal.moola@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
Qi Zheng
|
c2baef394a |
mm: page_alloc: skip memoryless nodes entirely
Patch series "handle memoryless nodes more appropriately", v3. Currently, in the process of initialization or offline memory, memoryless nodes will still be built into the fallback list of itself or other nodes. This is not what we expected, so this patch series removes memoryless nodes from the fallback list entirely. This patch (of 2): In find_next_best_node(), we skipped the memoryless nodes when building the zonelists of other normal nodes (N_NORMAL), but did not skip the memoryless node itself when building the zonelist. This will cause it to be traversed at runtime. For example, say we have node0 and node1, node0 is memoryless node, then the fallback order of node0 and node1 as follows: [ 0.153005] Fallback order for Node 0: 0 1 [ 0.153564] Fallback order for Node 1: 1 After this patch, we skip memoryless node0 entirely, then the fallback order of node0 and node1 as follows: [ 0.155236] Fallback order for Node 0: 1 [ 0.155806] Fallback order for Node 1: 1 So it becomes completely invisible, which will reduce runtime overhead. And in this way, we will not try to allocate pages from memoryless node0, then the panic mentioned in [1] will also be fixed. Even though this problem has been solved by dropping the NODE_MIN_SIZE constrain in x86 [2], it would be better to fix it in core MM as well. [1]. https://lore.kernel.org/all/20230212110305.93670-1-zhengqi.arch@bytedance.com/ [2]. https://lore.kernel.org/all/20231017062215.171670-1-rppt@kernel.org/ [zhengqi.arch@bytedance.com: update comment, per Ingo] Link: https://lkml.kernel.org/r/7300fc00a057eefeb9a68c8ad28171c3f0ce66ce.1697799303.git.zhengqi.arch@bytedance.com Link: https://lkml.kernel.org/r/cover.1697799303.git.zhengqi.arch@bytedance.com Link: https://lkml.kernel.org/r/cover.1697711415.git.zhengqi.arch@bytedance.com Link: https://lkml.kernel.org/r/157013e978468241de4a4c05d5337a44638ecb0e.1697711415.git.zhengqi.arch@bytedance.com Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com> Acked-by: David Hildenbrand <david@redhat.com> Acked-by: Ingo Molnar <mingo@kernel.org> Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Cc: "Huang, Ying" <ying.huang@intel.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Oscar Salvador <osalvador@suse.de> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
Huang Ying
|
6ccdcb6d3a |
mm, pcp: reduce detecting time of consecutive high order page freeing
In current PCP auto-tuning design, if the number of pages allocated is much more than that of pages freed on a CPU, the PCP high may become the maximal value even if the allocating/freeing depth is small, for example, in the sender of network workloads. If a CPU was used as sender originally, then it is used as receiver after context switching, we need to fill the whole PCP with maximal high before triggering PCP draining for consecutive high order freeing. This will hurt the performance of some network workloads. To solve the issue, in this patch, we will track the consecutive page freeing with a counter in stead of relying on PCP draining. So, we can detect consecutive page freeing much earlier. On a 2-socket Intel server with 128 logical CPU, we tested SCTP_STREAM_MANY test case of netperf test suite with 64-pair processes. With the patch, the network bandwidth improves 5.0%. This restores the performance drop caused by PCP auto-tuning. Link: https://lkml.kernel.org/r/20231016053002.756205-10-ying.huang@intel.com Signed-off-by: "Huang, Ying" <ying.huang@intel.com> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: David Hildenbrand <david@redhat.com> Cc: Johannes Weiner <jweiner@redhat.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Pavel Tatashin <pasha.tatashin@soleen.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Christoph Lameter <cl@linux.com> Cc: Arjan van de Ven <arjan@linux.intel.com> Cc: Sudeep Holla <sudeep.holla@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
Huang Ying
|
57c0419c5f |
mm, pcp: decrease PCP high if free pages < high watermark
One target of PCP is to minimize pages in PCP if the system free pages is too few. To reach that target, when page reclaiming is active for the zone (ZONE_RECLAIM_ACTIVE), we will stop increasing PCP high in allocating path, decrease PCP high and free some pages in freeing path. But this may be too late because the background page reclaiming may introduce latency for some workloads. So, in this patch, during page allocation we will detect whether the number of free pages of the zone is below high watermark. If so, we will stop increasing PCP high in allocating path, decrease PCP high and free some pages in freeing path. With this, we can reduce the possibility of the premature background page reclaiming caused by too large PCP. The high watermark checking is done in allocating path to reduce the overhead in hotter freeing path. Link: https://lkml.kernel.org/r/20231016053002.756205-9-ying.huang@intel.com Signed-off-by: "Huang, Ying" <ying.huang@intel.com> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: David Hildenbrand <david@redhat.com> Cc: Johannes Weiner <jweiner@redhat.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Pavel Tatashin <pasha.tatashin@soleen.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Christoph Lameter <cl@linux.com> Cc: Arjan van de Ven <arjan@linux.intel.com> Cc: Sudeep Holla <sudeep.holla@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
Huang Ying
|
51a755c56d |
mm: tune PCP high automatically
The target to tune PCP high automatically is as follows, - Minimize allocation/freeing from/to shared zone - Minimize idle pages in PCP - Minimize pages in PCP if the system free pages is too few To reach these target, a tuning algorithm as follows is designed, - When we refill PCP via allocating from the zone, increase PCP high. Because if we had larger PCP, we could avoid to allocate from the zone. - In periodic vmstat updating kworker (via refresh_cpu_vm_stats()), decrease PCP high to try to free possible idle PCP pages. - When page reclaiming is active for the zone, stop increasing PCP high in allocating path, decrease PCP high and free some pages in freeing path. So, the PCP high can be tuned to the page allocating/freeing depth of workloads eventually. One issue of the algorithm is that if the number of pages allocated is much more than that of pages freed on a CPU, the PCP high may become the maximal value even if the allocating/freeing depth is small. But this isn't a severe issue, because there are no idle pages in this case. One alternative choice is to increase PCP high when we drain PCP via trying to free pages to the zone, but don't increase PCP high during PCP refilling. This can avoid the issue above. But if the number of pages allocated is much less than that of pages freed on a CPU, there will be many idle pages in PCP and it is hard to free these idle pages. 1/8 (>> 3) of PCP high will be decreased periodically. The value 1/8 is kind of arbitrary. Just to make sure that the idle PCP pages will be freed eventually. On a 2-socket Intel server with 224 logical CPU, we run 8 kbuild instances in parallel (each with `make -j 28`) in 8 cgroup. This simulates the kbuild server that is used by 0-Day kbuild service. With the patch, the build time decreases 3.5%. The cycles% of the spinlock contention (mostly for zone lock) decreases from 11.0% to 0.5%. The number of PCP draining for high order pages freeing (free_high) decreases 65.6%. The number of pages allocated from zone (instead of from PCP) decreases 83.9%. Link: https://lkml.kernel.org/r/20231016053002.756205-8-ying.huang@intel.com Signed-off-by: "Huang, Ying" <ying.huang@intel.com> Suggested-by: Mel Gorman <mgorman@techsingularity.net> Suggested-by: Michal Hocko <mhocko@suse.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: David Hildenbrand <david@redhat.com> Cc: Johannes Weiner <jweiner@redhat.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Pavel Tatashin <pasha.tatashin@soleen.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Christoph Lameter <cl@linux.com> Cc: Arjan van de Ven <arjan@linux.intel.com> Cc: Sudeep Holla <sudeep.holla@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
Huang Ying
|
90b41691b9 |
mm: add framework for PCP high auto-tuning
The page allocation performance requirements of different workloads are usually different. So, we need to tune PCP (per-CPU pageset) high to optimize the workload page allocation performance. Now, we have a system wide sysctl knob (percpu_pagelist_high_fraction) to tune PCP high by hand. But, it's hard to find out the best value by hand. And one global configuration may not work best for the different workloads that run on the same system. One solution to these issues is to tune PCP high of each CPU automatically. This patch adds the framework for PCP high auto-tuning. With it, pcp->high of each CPU will be changed automatically by tuning algorithm at runtime. The minimal high (pcp->high_min) is the original PCP high value calculated based on the low watermark pages. While the maximal high (pcp->high_max) is the PCP high value when percpu_pagelist_high_fraction sysctl knob is set to MIN_PERCPU_PAGELIST_HIGH_FRACTION. That is, the maximal pcp->high that can be set via sysctl knob by hand. It's possible that PCP high auto-tuning doesn't work well for some workloads. So, when PCP high is tuned by hand via the sysctl knob, the auto-tuning will be disabled. The PCP high set by hand will be used instead. This patch only adds the framework, so pcp->high will be set to pcp->high_min (original default) always. We will add actual auto-tuning algorithm in the following patches in the series. Link: https://lkml.kernel.org/r/20231016053002.756205-7-ying.huang@intel.com Signed-off-by: "Huang, Ying" <ying.huang@intel.com> Acked-by: Mel Gorman <mgorman@techsingularity.net> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: David Hildenbrand <david@redhat.com> Cc: Johannes Weiner <jweiner@redhat.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Pavel Tatashin <pasha.tatashin@soleen.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Christoph Lameter <cl@linux.com> Cc: Arjan van de Ven <arjan@linux.intel.com> Cc: Sudeep Holla <sudeep.holla@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
Huang Ying
|
c0a242394c |
mm, page_alloc: scale the number of pages that are batch allocated
When a task is allocating a large number of order-0 pages, it may acquire the zone->lock multiple times allocating pages in batches. This may unnecessarily contend on the zone lock when allocating very large number of pages. This patch adapts the size of the batch based on the recent pattern to scale the batch size for subsequent allocations. On a 2-socket Intel server with 224 logical CPU, we run 8 kbuild instances in parallel (each with `make -j 28`) in 8 cgroup. This simulates the kbuild server that is used by 0-Day kbuild service. With the patch, the cycles% of the spinlock contention (mostly for zone lock) decreases from 12.6% to 11.0% (with PCP size == 367). Link: https://lkml.kernel.org/r/20231016053002.756205-6-ying.huang@intel.com Signed-off-by: "Huang, Ying" <ying.huang@intel.com> Suggested-by: Mel Gorman <mgorman@techsingularity.net> Acked-by: Mel Gorman <mgorman@techsingularity.net> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: David Hildenbrand <david@redhat.com> Cc: Johannes Weiner <jweiner@redhat.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Pavel Tatashin <pasha.tatashin@soleen.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Christoph Lameter <cl@linux.com> Cc: Arjan van de Ven <arjan@linux.intel.com> Cc: Sudeep Holla <sudeep.holla@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
Huang Ying
|
52166607ec |
mm: restrict the pcp batch scale factor to avoid too long latency
In page allocator, PCP (Per-CPU Pageset) is refilled and drained in
batches to increase page allocation throughput, reduce page
allocation/freeing latency per page, and reduce zone lock contention. But
too large batch size will cause too long maximal allocation/freeing
latency, which may punish arbitrary users. So the default batch size is
chosen carefully (in zone_batchsize(), the value is 63 for zone > 1GB) to
avoid that.
In commit
|
||
Huang Ying
|
362d37a106 |
mm, pcp: reduce lock contention for draining high-order pages
In commit
|
||
Huang Ying
|
ca71fe1ad9 |
mm, pcp: avoid to drain PCP when process exit
Patch series "mm: PCP high auto-tuning", v3.
The page allocation performance requirements of different workloads are
often different. So, we need to tune the PCP (Per-CPU Pageset) high on
each CPU automatically to optimize the page allocation performance.
The list of patches in series is as follows,
[1/9] mm, pcp: avoid to drain PCP when process exit
[2/9] cacheinfo: calculate per-CPU data cache size
[3/9] mm, pcp: reduce lock contention for draining high-order pages
[4/9] mm: restrict the pcp batch scale factor to avoid too long latency
[5/9] mm, page_alloc: scale the number of pages that are batch allocated
[6/9] mm: add framework for PCP high auto-tuning
[7/9] mm: tune PCP high automatically
[8/9] mm, pcp: decrease PCP high if free pages < high watermark
[9/9] mm, pcp: reduce detecting time of consecutive high order page freeing
Patch [1/9], [2/9], [3/9] optimize the PCP draining for consecutive
high-order pages freeing.
Patch [4/9], [5/9] optimize batch freeing and allocating.
Patch [6/9], [7/9], [8/9] implement and optimize a PCP high
auto-tuning method.
Patch [9/9] optimize the PCP draining for consecutive high order page
freeing based on PCP high auto-tuning.
The test results for patches with performance impact are as follows,
kbuild
======
On a 2-socket Intel server with 224 logical CPU, we run 8 kbuild instances
in parallel (each with `make -j 28`) in 8 cgroup. This simulates the
kbuild server that is used by 0-Day kbuild service.
build time lock contend% free_high alloc_zone
---------- ---------- --------- ----------
base 100.0 14.0 100.0 100.0
patch1 99.5 12.8 19.5 95.6
patch3 99.4 12.6 7.1 95.6
patch5 98.6 11.0 8.1 97.1
patch7 95.1 0.5 2.8 15.6
patch9 95.0 1.0 8.8 20.0
The PCP draining optimization (patch [1/9], [3/9]) and PCP batch
allocation optimization (patch [5/9]) reduces zone lock contention a
little. The PCP high auto-tuning (patch [7/9], [9/9]) reduces build time
visibly. Where the tuning target: the number of pages allocated from zone
reduces greatly. So, the zone contention cycles% reduces greatly.
With PCP tuning patches (patch [7/9], [9/9]), the average used memory
during test increases up to 18.4% because more pages are cached in PCP.
But at the end of the test, the number of the used memory decreases to the
same level as that of the base patch. That is, the pages cached in PCP
will be released to zone after not being used actively.
netperf SCTP_STREAM_MANY
========================
On a 2-socket Intel server with 128 logical CPU, we tested
SCTP_STREAM_MANY test case of netperf test suite with 64-pair processes.
score lock contend% free_high alloc_zone cache miss rate%
----- ---------- --------- ---------- ----------------
base 100.0 2.1 100.0 100.0 1.3
patch1 99.4 2.1 99.4 99.4 1.3
patch3 106.4 1.3 13.3 106.3 1.3
patch5 106.0 1.2 13.2 105.9 1.3
patch7 103.4 1.9 6.7 90.3 7.6
patch9 108.6 1.3 13.7 108.6 1.3
The PCP draining optimization (patch [1/9]+[3/9]) improves performance.
The PCP high auto-tuning (patch [7/9]) reduces performance a little
because PCP draining cannot be triggered in time sometimes. So, the cache
miss rate% increases. The further PCP draining optimization (patch [9/9])
based on PCP tuning restore the performance.
lmbench3 UNIX (AF_UNIX)
=======================
On a 2-socket Intel server with 128 logical CPU, we tested UNIX
(AF_UNIX socket) test case of lmbench3 test suite with 16-pair
processes.
score lock contend% free_high alloc_zone cache miss rate%
----- ---------- --------- ---------- ----------------
base 100.0 51.4 100.0 100.0 0.2
patch1 116.8 46.1 69.5 104.3 0.2
patch3 199.1 21.3 7.0 104.9 0.2
patch5 200.0 20.8 7.1 106.9 0.3
patch7 191.6 19.9 6.8 103.8 2.8
patch9 193.4 21.7 7.0 104.7 2.1
The PCP draining optimization (patch [1/9], [3/9]) improves performance
much. The PCP tuning (patch [7/9]) reduces performance a little because
PCP draining cannot be triggered in time sometimes. The further PCP
draining optimization (patch [9/9]) based on PCP tuning restores the
performance partly.
The patchset adds several fields in struct per_cpu_pages. The struct
layout before/after the patchset is as follows,
base
====
struct per_cpu_pages {
spinlock_t lock; /* 0 4 */
int count; /* 4 4 */
int high; /* 8 4 */
int batch; /* 12 4 */
short int free_factor; /* 16 2 */
short int expire; /* 18 2 */
/* XXX 4 bytes hole, try to pack */
struct list_head lists[13]; /* 24 208 */
/* size: 256, cachelines: 4, members: 7 */
/* sum members: 228, holes: 1, sum holes: 4 */
/* padding: 24 */
} __attribute__((__aligned__(64)));
patched
=======
struct per_cpu_pages {
spinlock_t lock; /* 0 4 */
int count; /* 4 4 */
int high; /* 8 4 */
int high_min; /* 12 4 */
int high_max; /* 16 4 */
int batch; /* 20 4 */
u8 flags; /* 24 1 */
u8 alloc_factor; /* 25 1 */
u8 expire; /* 26 1 */
/* XXX 1 byte hole, try to pack */
short int free_count; /* 28 2 */
/* XXX 2 bytes hole, try to pack */
struct list_head lists[13]; /* 32 208 */
/* size: 256, cachelines: 4, members: 11 */
/* sum members: 237, holes: 2, sum holes: 3 */
/* padding: 16 */
} __attribute__((__aligned__(64)));
The size of the struct doesn't changed with the patchset.
This patch (of 9):
In commit
|
||
Kemeng Shi
|
0dfca313a0 |
mm/page_alloc: remove unnecessary next_page in break_down_buddy_pages
The next_page is only used to forward page in case target is in second half range. Move forward page directly to remove unnecessary next_page. Link: https://lkml.kernel.org/r/20230927103514.98281-3-shikemeng@huaweicloud.com Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com> Acked-by: Naoya Horiguchi <naoya.horiguchi@nec.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Oscar Salvador <osalvador@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
Kemeng Shi
|
27e0db3c21 |
mm/page_alloc: remove unnecessary check in break_down_buddy_pages
Patch series "Two minor cleanups to break_down_buddy_pages", v2. Two minor cleanups to break_down_buddy_pages. This patch (of 2): 1. We always have target in range started with next_page and full free range started with current_buddy. 2. The last split range size is 1 << low and low should be >= 0, then size >= 1. So page + size != page is always true (because size > 0). As summary, current_page will not equal to target page. Link: https://lkml.kernel.org/r/20230927103514.98281-1-shikemeng@huaweicloud.com Link: https://lkml.kernel.org/r/20230927103514.98281-2-shikemeng@huaweicloud.com Signed-off-by: Kemeng Shi <shikemeng@huaweicloud.com> Acked-by: Naoya Horiguchi <naoya.horiguchi@nec.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Oscar Salvador <osalvador@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
Kemeng Shi
|
61e21cf2d2 |
mm/page_alloc: correct start page when guard page debug is enabled
When guard page debug is enabled and set_page_guard returns success, we
miss to forward page to point to start of next split range and we will do
split unexpectedly in page range without target page. Move start page
update before set_page_guard to fix this.
As we split to wrong target page, then splited pages are not able to merge
back to original order when target page is put back and splited pages
except target page is not usable. To be specific:
Consider target page is the third page in buddy page with order 2.
| buddy-2 | Page | Target | Page |
After break down to target page, we will only set first page to Guard
because of bug.
| Guard | Page | Target | Page |
When we try put_page_back_buddy with target page, the buddy page of target
if neither guard nor buddy, Then it's not able to construct original page
with order 2
| Guard | Page | buddy-0 | Page |
All pages except target page is not in free list and is not usable.
Link: https://lkml.kernel.org/r/20230927094401.68205-1-shikemeng@huaweicloud.com
Fixes:
|
||
Johannes Weiner
|
7b086755fb |
mm: page_alloc: fix CMA and HIGHATOMIC landing on the wrong buddy list
Commit |
||
Johannes Weiner
|
f945116e4e |
mm: page_alloc: remove stale CMA guard code
In the past, movable allocations could be disallowed from CMA through PF_MEMALLOC_PIN. As CMA pages are funneled through the MOVABLE pcplist, this required filtering that cornercase during allocations, such that pinnable allocations wouldn't accidentally get a CMA page. However, since |
||
Matthew Wilcox (Oracle)
|
de53c05f2a |
mm: add large_rmappable page flag
Stored in the first tail page's flags, this flag replaces the destructor. That removes the last of the destructors, so remove all references to folio_dtor and compound_dtor. Link: https://lkml.kernel.org/r/20230816151201.3655946-9-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: David Hildenbrand <david@redhat.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Sidhartha Kumar <sidhartha.kumar@oracle.com> Cc: Yanteng Si <siyanteng@loongson.cn> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
Matthew Wilcox (Oracle)
|
9c5ccf2db0 |
mm: remove HUGETLB_PAGE_DTOR
We can use a bit in page[1].flags to indicate that this folio belongs to hugetlb instead of using a value in page[1].dtors. That lets folio_test_hugetlb() become an inline function like it should be. We can also get rid of NULL_COMPOUND_DTOR. Link: https://lkml.kernel.org/r/20230816151201.3655946-8-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: David Hildenbrand <david@redhat.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Sidhartha Kumar <sidhartha.kumar@oracle.com> Cc: Yanteng Si <siyanteng@loongson.cn> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
Matthew Wilcox (Oracle)
|
0f2f43fabb |
mm: remove free_compound_page() and the compound_page_dtors array
The only remaining destructor is free_compound_page(). Inline it into destroy_large_folio() and remove the array it used to live in. Link: https://lkml.kernel.org/r/20230816151201.3655946-7-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: David Hildenbrand <david@redhat.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Sidhartha Kumar <sidhartha.kumar@oracle.com> Cc: Yanteng Si <siyanteng@loongson.cn> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |