mirror of
https://github.com/torvalds/linux.git
synced 2024-11-23 20:51:44 +00:00
e99fb98d47
21610 Commits
Author | SHA1 | Message | Date | |
---|---|---|---|---|
SeongJae Park
|
d91beaa505 |
mm/damon/sysfs-schemes: implement a command for scheme quota goals only commit
To update DAMOS quota goals, users need to enter 'commit' command to the 'state' file of the kdamond, which applies not only the goals but entire inputs. It is inefficient. Implement yet another 'state' file input command for reading and committing only the scheme quota goals, namely 'commit_schemes_quota_goals'. Link: https://lkml.kernel.org/r/20231130023652.50284-5-sj@kernel.org Signed-off-by: SeongJae Park <sj@kernel.org> Cc: Brendan Higgins <brendanhiggins@google.com> Cc: David Gow <davidgow@google.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Shuah Khan <shuah@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
SeongJae Park
|
8b549a4fd3 |
mm/damon/sysfs-schemes: commit damos quota goals user input to DAMOS
Make DAMON sysfs interface to read the user inputs for DAMOS quota goals and pass those to DAMOS, so that the users can use the quota auto-tuning feature. It uses the DAMON sysfs interface's user input commit mechanism, which applies all user inputs for initial starting of DAMON and online input updates, which can be done by writing 'on' and 'commit' to the kdamond's 'state' file, respectively. In other words, the user should periodically write appropriate value to 'current_value' files and 'commit' command to the 'state' file. 'target_value' files could also be similarly updated at any time. Note that the interface is supporting multiple goals while the core logic supports only one goal. DAMON sysfs interface passes only best feedback among the given inputs, to avoid making DAMOS too aggressive. Link: https://lkml.kernel.org/r/20231130023652.50284-4-sj@kernel.org Signed-off-by: SeongJae Park <sj@kernel.org> Cc: Brendan Higgins <brendanhiggins@google.com> Cc: David Gow <davidgow@google.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Shuah Khan <shuah@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
SeongJae Park
|
7f262da0a3 |
mm/damon/sysfs-schemes: implement files for scheme quota goals setup
Implement DAMON sysfs directories and files for the goals of DAMOS quota. Those allow users set multiple goals for their aim, with target values. Users can further enter the current score value for each goal as feedback for DAMOS. Note that this commit is implementing only the basic file operations, and not connecting the files with the DAMOS core logic. Hence writing something to the files makes no real effect. The following commit will connect the file operations and the core logic. Link: https://lkml.kernel.org/r/20231130023652.50284-3-sj@kernel.org Signed-off-by: SeongJae Park <sj@kernel.org> Cc: Brendan Higgins <brendanhiggins@google.com> Cc: David Gow <davidgow@google.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Shuah Khan <shuah@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
SeongJae Park
|
9294a037c0 |
mm/damon/core: implement goal-oriented feedback-driven quota auto-tuning
Patch series "mm/damon: let users feed and tame/auto-tune DAMOS".
Introduce Aim-oriented Feedback-driven DAMOS Aggressiveness Auto-tuning.
It makes DAMOS self-tuned with periodic simple user feedback.
Background: DAMOS Control Difficulty
====================================
DAMOS helps users easily implement access pattern aware system operations.
However, controlling DAMOS in the wild is not that easy.
The basic way for DAMOS control is specifying the target access pattern.
In this approach, the user is assumed to well understand the access
pattern and the characteristics of the system and the workloads. Though
there are useful tools for that, it takes time and effort depending on the
complexity and the dynamicity of the system and the workloads. After all,
the access pattern consists of three ranges, namely the size, the access
rate, and the age of the regions. It means users need to tune six
parameters, which is anyway not a simple task.
One of the worst cases would be DAMOS being too aggressive like a
berserker, and therefore consuming too much system resource and making
unwanted radical system operations. To let users avoid such cases, DAMOS
allows users to set the upper-limit of the schemes' aggressiveness, namely
DAMOS quota. DAMOS further provides its best-effort under the limit by
prioritizing regions based on the access pattern of the regions. For
example, users can ask DAMOS to page out up to 100 MiB of memory regions
per second. Then DAMOS pages out regions that are not accessed for a
longer time (colder) first under the limit. This allows users to set the
target access pattern a bit naive with wider ranges, and focus on tuning
only one parameter, the quota. In other words, the number of parameters
to tune can be reduced from six to one.
Still, however, the optimum value for the quota depends on the system and
the workloads' characteristics, so not that simple. The number of
parameters to tune can also increase again if the user needs to run
multiple schemes.
Aim-oriented Feedback-driven DAMOS Aggressiveness Auto Tuning
=============================================================
Users would use DAMOS since they want to achieve something with it. They
will likely have measurable metrics representing the achievement and the
target number of the metric like SLO, and continuously measure that
anyway. While the additional cost of getting the information is nearly
zero, it could be useful for DAMOS to understand how appropriate its
current aggressiveness is set, and adjust it on its own to make the metric
value more close to the target.
Based on this idea, we introduce a new way of tuning DAMOS with nearly
zero additional effort, namely Aim-oriented Feedback-driven DAMOS
Aggressiveness Auto Tuning. It asks users to provide feedback
representing how well DAMOS is doing relative to the users' aim. Then
DAMOS adjusts its aggressiveness, specifically the quota that provides
the best effort result under the limit, based on the current level of
the aggressiveness and the users' feedback.
Implementation
==============
The implementation asks users to represent the feedback with score
numbers. The scores could be anything including user-space specific
metrics including latency and throughput of special user-space workloads,
and system metrics including free memory ratio, memory pressure stall time
(PSI), and active to inactive LRU lists size ratio. The feedback scores
and the aggressiveness of the given DAMOS scheme are assumed to be
positively proportional, though. Selecting metrics of the assumption is
the users' responsibility.
The core logic uses the below simple feedback loop algorithm to calculate
the next aggressiveness level of the scheme from the current
aggressiveness level and the current feedback (target_score and
current_score). It calculates the compensation for next aggressiveness as
a proportion of current aggressiveness and distance to the target score.
As a result, it arrives at the near-goal state in a short time using big
steps when it's far from the goal, but avoids making unnecessarily radical
changes that could turn out to be a bad decision using small steps when
its near to the goal.
f(n) = max(1, f(n - 1) * ((target_score - current_score) / target_score + 1))
Note that the compensation value becomes negative when it's over
achieving the goal. That's why the feedback metric and the
aggressiveness of the scheme should be positively proportional. The
distance-adaptive speed manipulation is simply applied.
Example Use Cases
=================
If users want to reduce the memory footprint of the system as much as
possible as long as the time spent for handling the resulting memory
pressure is within a threshold, they could use DAMOS scheme that reclaims
cold memory regions aiming for a little level of memory pressure stall
time.
If users want the active/inactive LRU lists well balanced to reduce the
performance impact due to possible future memory pressure, they could use
two schemes. The first one would be set to locate hot pages in the active
LRU list, aiming for a specific active-to-inactive LRU list size ratio,
say, 70%. The second one would be to locate cold pages in the inactive
LRU list, aiming for a specific inactive-to-active LRU list size ratio,
say, 30%. Then, DAMOS will balance the two schemes based on the goal and
feedback.
This aim-oriented auto tuning could also be useful for general
balancing-required access aware system operations such as system memory
auto scaling[3] and tiered memory management[4]. These two example usages
are not what current DAMOS implementation is already supporting, but
require additional DAMOS action developments, though.
Evaluation: subtle memory pressure aiming proactive reclamation
===============================================================
To show if the implementation works as expected, we prepare four different
system configurations on AWS i3.metal instances. The first setup
(original) runs the workload without any DAMOS scheme. The second setup
(not-tuned) runs the workload with a virtual address space-based proactive
reclamation scheme that pages out memory regions that are not accessed for
five seconds or more. The third setup (offline-tuned) runs the same
proactive reclamation DAMOS scheme, but after making it tuned for each
workload offline, using our previous user-space driven automatic tuning
approach, namely DAMOOS[1]. The fourth and final setup (AFDAA) runs the
scheme that is the same as that of 'not-tuned' setup, but aims to keep
0.5% of 'some' memory pressure stall time (PSI) for the last 10 seconds
using the aiming-oriented auto tuning.
For each setup, we run realistic workloads from PARSEC3 and SPLASH-2X
benchmark suites. For each run, we measure RSS and runtime of the
workload, and 'some' memory pressure stall time (PSI) of the system. We
repeat the runs five times and use averaged measurements.
For simple comparison of the results, we normalize the measurements to
those of 'original'. In the case of the PSI, though, the measurement for
'original' was zero, so we normalize the value to that of 'not-tuned'
scheme's result. The normalized results are shown below.
Not-tuned Offline-tuned AFDAA
RSS 0.622688178226118 0.787950678944904 0.740093483278979
runtime 1.11767826657912 1.0564674983585 1.0910833880499
PSI 1 0.727521443794069 0.308498846350299
The 'not-tuned' scheme achieves about 38.7% memory saving but incur about
11.7% runtime slowdown. The 'offline-tuned' scheme achieves about 22.2%
memory saving with about 5.5% runtime slowdown. It also achieves about
28.2% memory pressure stall time saving. AFDAA achieves about 26% memory
saving with about 9.1% runtime slowdown. It also achieves about 69.1%
memory pressure stall time saving. We repeat this test multiple times,
and get consistent results. AFDAA is now integrated in our daily DAMON
performance test setup.
Apparently the aggressiveness of 'AFDAA' setup is somewhere between those
of 'not-tuned' and 'offline-tuned' setup, since its memory saving and
runtime overhead are between those of the other two setups. Actually we
set the memory pressure stall time goal aiming for this middle
aggressiveness. The difference in the two metrics are not significant,
though. However, it shows significant saving of the memory pressure stall
time, which was the goal of the auto-tuning, over the two variants.
Hence, we conclude the automatic tuning is working as expected.
Please note that the AFDAA setup is only for the evaluation, and
therefore intentionally set a bit aggressive. It might not be
appropriate for production environments.
The test code is also available[2], so you could reproduce it on your
system and workloads.
Patches Sequence
================
The first four patches implement the core logic and user interfaces for
the auto tuning. The first patch implements the core logic for the auto
tuning, and the API for DAMOS users in the kernel space. The second
patch implements basic file operations of DAMON sysfs directories and
files that will be used for setting the goals and providing the
feedback. The third patch connects the quota goals files inputs to the
DAMOS core logic. Finally the fourth patch implements a dedicated DAMOS
sysfs command for efficiently committing the quota goals feedback.
Two patches for simple tests of the logic and interfaces follow. The
fifth patch implements the core logic unit test. The sixth patch
implements a selftest for the DAMON Sysfs interface for the goals.
Finally, three patches for documentation follows. The seventh patch
documents the design of the feature. The eighth patch updates the API
doc for the new sysfs files. The final eighth patch updates the usage
document for the features.
References
==========
[1] DAOS paper:
https://www.amazon.science/publications/daos-data-access-aware-operating-system
[2] Evaluation code:
|
||
Nhat Pham
|
b5ba474f3f |
zswap: shrink zswap pool based on memory pressure
Currently, we only shrink the zswap pool when the user-defined limit is hit. This means that if we set the limit too high, cold data that are unlikely to be used again will reside in the pool, wasting precious memory. It is hard to predict how much zswap space will be needed ahead of time, as this depends on the workload (specifically, on factors such as memory access patterns and compressibility of the memory pages). This patch implements a memcg- and NUMA-aware shrinker for zswap, that is initiated when there is memory pressure. The shrinker does not have any parameter that must be tuned by the user, and can be opted in or out on a per-memcg basis. Furthermore, to make it more robust for many workloads and prevent overshrinking (i.e evicting warm pages that might be refaulted into memory), we build in the following heuristics: * Estimate the number of warm pages residing in zswap, and attempt to protect this region of the zswap LRU. * Scale the number of freeable objects by an estimate of the memory saving factor. The better zswap compresses the data, the fewer pages we will evict to swap (as we will otherwise incur IO for relatively small memory saving). * During reclaim, if the shrinker encounters a page that is also being brought into memory, the shrinker will cautiously terminate its shrinking action, as this is a sign that it is touching the warmer region of the zswap LRU. As a proof of concept, we ran the following synthetic benchmark: build the linux kernel in a memory-limited cgroup, and allocate some cold data in tmpfs to see if the shrinker could write them out and improved the overall performance. Depending on the amount of cold data generated, we observe from 14% to 35% reduction in kernel CPU time used in the kernel builds. [nphamcs@gmail.com: check shrinker enablement early, use less costly stat flushing] Link: https://lkml.kernel.org/r/20231206194456.3234203-1-nphamcs@gmail.com Link: https://lkml.kernel.org/r/20231130194023.4102148-7-nphamcs@gmail.com Signed-off-by: Nhat Pham <nphamcs@gmail.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Tested-by: Bagas Sanjaya <bagasdotme@gmail.com> Cc: Chris Li <chrisl@kernel.org> Cc: Dan Streetman <ddstreet@ieee.org> Cc: Domenico Cerasuolo <cerasuolodomenico@gmail.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Seth Jennings <sjenning@redhat.com> Cc: Shakeel Butt <shakeelb@google.com> Cc: Shuah Khan <shuah@kernel.org> Cc: Vitaly Wool <vitaly.wool@konsulko.com> Cc: Yosry Ahmed <yosryahmed@google.com> Cc: Chengming Zhou <chengming.zhou@linux.dev> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
Domenico Cerasuolo
|
7108cc3f76 |
mm: memcg: add per-memcg zswap writeback stat
Since zswap now writes back pages from memcg-specific LRUs, we now need a new stat to show writebacks count for each memcg. [nphamcs@gmail.com: rename ZSWP_WB to ZSWPWB] Link: https://lkml.kernel.org/r/20231205193307.2432803-1-nphamcs@gmail.com Link: https://lkml.kernel.org/r/20231130194023.4102148-5-nphamcs@gmail.com Suggested-by: Nhat Pham <nphamcs@gmail.com> Signed-off-by: Domenico Cerasuolo <cerasuolodomenico@gmail.com> Signed-off-by: Nhat Pham <nphamcs@gmail.com> Tested-by: Bagas Sanjaya <bagasdotme@gmail.com> Reviewed-by: Yosry Ahmed <yosryahmed@google.com> Cc: Chris Li <chrisl@kernel.org> Cc: Dan Streetman <ddstreet@ieee.org> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Seth Jennings <sjenning@redhat.com> Cc: Shakeel Butt <shakeelb@google.com> Cc: Shuah Khan <shuah@kernel.org> Cc: Vitaly Wool <vitaly.wool@konsulko.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
Domenico Cerasuolo
|
a65b0e7607 |
zswap: make shrinking memcg-aware
Currently, we only have a single global LRU for zswap. This makes it impossible to perform worload-specific shrinking - an memcg cannot determine which pages in the pool it owns, and often ends up writing pages from other memcgs. This issue has been previously observed in practice and mitigated by simply disabling memcg-initiated shrinking: https://lore.kernel.org/all/20230530232435.3097106-1-nphamcs@gmail.com/T/#u This patch fully resolves the issue by replacing the global zswap LRU with memcg- and NUMA-specific LRUs, and modify the reclaim logic: a) When a store attempt hits an memcg limit, it now triggers a synchronous reclaim attempt that, if successful, allows the new hotter page to be accepted by zswap. b) If the store attempt instead hits the global zswap limit, it will trigger an asynchronous reclaim attempt, in which an memcg is selected for reclaim in a round-robin-like fashion. [nphamcs@gmail.com: use correct function for the onlineness check, use mem_cgroup_iter_break()] Link: https://lkml.kernel.org/r/20231205195419.2563217-1-nphamcs@gmail.com [nphamcs@gmail.com: drop the pool's reference at the end of the writeback step] Link: https://lkml.kernel.org/r/20231206030627.4155634-1-nphamcs@gmail.com Link: https://lkml.kernel.org/r/20231130194023.4102148-4-nphamcs@gmail.com Signed-off-by: Domenico Cerasuolo <cerasuolodomenico@gmail.com> Co-developed-by: Nhat Pham <nphamcs@gmail.com> Signed-off-by: Nhat Pham <nphamcs@gmail.com> Tested-by: Bagas Sanjaya <bagasdotme@gmail.com> Cc: Chris Li <chrisl@kernel.org> Cc: Dan Streetman <ddstreet@ieee.org> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Seth Jennings <sjenning@redhat.com> Cc: Shakeel Butt <shakeelb@google.com> Cc: Shuah Khan <shuah@kernel.org> Cc: Vitaly Wool <vitaly.wool@konsulko.com> Cc: Yosry Ahmed <yosryahmed@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
Nhat Pham
|
0a97c01cd2 |
list_lru: allow explicit memcg and NUMA node selection
Patch series "workload-specific and memory pressure-driven zswap writeback", v8. There are currently several issues with zswap writeback: 1. There is only a single global LRU for zswap, making it impossible to perform worload-specific shrinking - an memcg under memory pressure cannot determine which pages in the pool it owns, and often ends up writing pages from other memcgs. This issue has been previously observed in practice and mitigated by simply disabling memcg-initiated shrinking: https://lore.kernel.org/all/20230530232435.3097106-1-nphamcs@gmail.com/T/#u But this solution leaves a lot to be desired, as we still do not have an avenue for an memcg to free up its own memory locked up in the zswap pool. 2. We only shrink the zswap pool when the user-defined limit is hit. This means that if we set the limit too high, cold data that are unlikely to be used again will reside in the pool, wasting precious memory. It is hard to predict how much zswap space will be needed ahead of time, as this depends on the workload (specifically, on factors such as memory access patterns and compressibility of the memory pages). This patch series solves these issues by separating the global zswap LRU into per-memcg and per-NUMA LRUs, and performs workload-specific (i.e memcg- and NUMA-aware) zswap writeback under memory pressure. The new shrinker does not have any parameter that must be tuned by the user, and can be opted in or out on a per-memcg basis. As a proof of concept, we ran the following synthetic benchmark: build the linux kernel in a memory-limited cgroup, and allocate some cold data in tmpfs to see if the shrinker could write them out and improved the overall performance. Depending on the amount of cold data generated, we observe from 14% to 35% reduction in kernel CPU time used in the kernel builds. This patch (of 6): The interface of list_lru is based on the assumption that the list node and the data it represents belong to the same allocated on the correct node/memcg. While this assumption is valid for existing slab objects LRU such as dentries and inodes, it is undocumented, and rather inflexible for certain potential list_lru users (such as the upcoming zswap shrinker and the THP shrinker). It has caused us a lot of issues during our development. This patch changes list_lru interface so that the caller must explicitly specify numa node and memcg when adding and removing objects. The old list_lru_add() and list_lru_del() are renamed to list_lru_add_obj() and list_lru_del_obj(), respectively. It also extends the list_lru API with a new function, list_lru_putback, which undoes a previous list_lru_isolate call. Unlike list_lru_add, it does not increment the LRU node count (as list_lru_isolate does not decrement the node count). list_lru_putback also allows for explicit memcg and NUMA node selection. Link: https://lkml.kernel.org/r/20231130194023.4102148-1-nphamcs@gmail.com Link: https://lkml.kernel.org/r/20231130194023.4102148-2-nphamcs@gmail.com Signed-off-by: Nhat Pham <nphamcs@gmail.com> Suggested-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Tested-by: Bagas Sanjaya <bagasdotme@gmail.com> Cc: Chris Li <chrisl@kernel.org> Cc: Dan Streetman <ddstreet@ieee.org> Cc: Domenico Cerasuolo <cerasuolodomenico@gmail.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Seth Jennings <sjenning@redhat.com> Cc: Shakeel Butt <shakeelb@google.com> Cc: Shuah Khan <shuah@kernel.org> Cc: Vitaly Wool <vitaly.wool@konsulko.com> Cc: Yosry Ahmed <yosryahmed@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
Liam R. Howlett
|
067311d33e |
maple_tree: separate ma_state node from status
The maple tree node is overloaded to keep status as well as the active node. This, unfortunately, results in a re-walk on underflow or overflow. Since the maple state has room, the status can be placed in its own enum in the structure. Once an underflow/overflow is detected, certain modes can restore the status to active and others may need to re-walk just that one node to see the entry. The status being an enum has the benefit of detecting unhandled status in switch statements. [Liam.Howlett@oracle.com: fix comments about MAS_*] Link: https://lkml.kernel.org/r/20231106154124.614247-1-Liam.Howlett@oracle.com [Liam.Howlett@oracle.com: update forking to separate maple state and node] Link: https://lkml.kernel.org/r/20231106154551.615042-1-Liam.Howlett@oracle.com [Liam.Howlett@oracle.com: fix mas_prev() state separation code] Link: https://lkml.kernel.org/r/20231207193319.4025462-1-Liam.Howlett@oracle.com Link: https://lkml.kernel.org/r/20231101171629.3612299-9-Liam.Howlett@oracle.com Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com> Cc: Peng Zhang <zhangpeng.00@bytedance.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
Liam R. Howlett
|
bf857ddd21 |
maple_tree: move debug check to __mas_set_range()
__mas_set_range() was created to shortcut resetting the maple state and a debug check was added to the caller (the vma iterator) to ensure the internal maple state remains safe to use. Move the debug check from the vma iterator into the maple tree itself so other users do not incorrectly use the advanced maple state modification. Fallout from this change include a large amount of debug setup needed to be moved to earlier in the header, and the maple_tree.h radix-tree test code needed to move the inclusion of the header to after the atomic define. None of those changes have functional changes. Link: https://lkml.kernel.org/r/20231101171629.3612299-4-Liam.Howlett@oracle.com Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com> Cc: Peng Zhang <zhangpeng.00@bytedance.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
Juntong Deng
|
5d4c6ac946 |
kasan: record and report more information
Record and report more information to help us find the cause of the bug and to help us correlate the error with other system events. This patch adds recording and showing CPU number and timestamp at allocation and free (controlled by CONFIG_KASAN_EXTRA_INFO). The timestamps in the report use the same format and source as printk. Error occurrence timestamp is already implicit in the printk log, and CPU number is already shown by dump_stack_lvl, so there is no need to add it. In order to record CPU number and timestamp at allocation and free, corresponding members need to be added to the relevant data structures, which will lead to increased memory consumption. In Generic KASAN, members are added to struct kasan_track. Since in most cases, alloc meta is stored in the redzone and free meta is stored in the object or the redzone, memory consumption will not increase much. In SW_TAGS KASAN and HW_TAGS KASAN, members are added to struct kasan_stack_ring_entry. Memory consumption increases as the size of struct kasan_stack_ring_entry increases (this part of the memory is allocated by memblock), but since this is configurable, it is up to the user to choose. Link: https://lkml.kernel.org/r/VI1P193MB0752BD991325D10E4AB1913599BDA@VI1P193MB0752.EURP193.PROD.OUTLOOK.COM Signed-off-by: Juntong Deng <juntong.deng@outlook.com> Cc: Alexander Potapenko <glider@google.com> Cc: Andrey Konovalov <andreyknvl@gmail.com> Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Vincenzo Frascino <vincenzo.frascino@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
Dmitry Rokosov
|
664dc2189d |
mm: memcg: add reminder comment for the memcg v2 events
To maintain the correct state, it is important to ensure that events for the memory cgroup v2 are aligned with the sample cgroup codes. Link: https://lkml.kernel.org/r/20231123071945.25811-4-ddrokosov@salutedevices.com Signed-off-by: Dmitry Rokosov <ddrokosov@salutedevices.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Shakeel Butt <shakeelb@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
Muchun Song
|
ebc20dcac4 |
mm: hugetlb_vmemmap: convert page to folio
There are still some places where it does not be converted to folio, this patch convert all of them to folio. And this patch also does some trival cleanup to fix the code style problems. Link: https://lkml.kernel.org/r/20231127084645.27017-5-songmuchun@bytedance.com Signed-off-by: Muchun Song <songmuchun@bytedance.com> Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com> Cc: Kefeng Wang <wangkefeng.wang@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
Muchun Song
|
be035a2acf |
mm: hugetlb_vmemmap: move PageVmemmapSelfHosted() check to split_vmemmap_huge_pmd()
To check a page whether it is self-hosted needs to traverse the page table (e.g. pmd_off_k()), however, we already have done this in the next calling of vmemmap_remap_range(). Moving PageVmemmapSelfHosted() check to vmemmap_pmd_entry() could simplify the code a bit. Link: https://lkml.kernel.org/r/20231127084645.27017-4-songmuchun@bytedance.com Signed-off-by: Muchun Song <songmuchun@bytedance.com> Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com> Cc: Kefeng Wang <wangkefeng.wang@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
Muchun Song
|
fb93ed6334 |
mm: hugetlb_vmemmap: use walk_page_range_novma() to simplify the code
It is unnecessary to implement a series of dedicated page table walking helpers since there is already a general one walk_page_range_novma(). So use it to simplify the code. Link: https://lkml.kernel.org/r/20231127084645.27017-3-songmuchun@bytedance.com Signed-off-by: Muchun Song <songmuchun@bytedance.com> Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com> Cc: Kefeng Wang <wangkefeng.wang@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
Muchun Song
|
b123d09304 |
mm: pagewalk: assert write mmap lock only for walking the user page tables
The
|
||
Fabio M. De Francesco
|
829c3151f0 |
mm/swapfile: replace kmap_atomic() with kmap_local_page()
kmap_atomic() has been deprecated in favor of kmap_local_page(). Therefore, replace kmap_atomic() with kmap_local_page() in swapfile.c. kmap_atomic() is implemented like a kmap_local_page() which also disables page-faults and preemption (the latter only in !PREEMPT_RT kernels). The kernel virtual addresses returned by these two API are only valid in the context of the callers (i.e., they cannot be handed to other threads). With kmap_local_page() the mappings are per thread and CPU local like in kmap_atomic(); however, they can handle page-faults and can be called from any context (including interrupts). The tasks that call kmap_local_page() can be preempted and, when they are scheduled to run again, the kernel virtual addresses are restored and are still valid. In mm/swapfile.c, the blocks of code between the mappings and un-mappings do not depend on the above-mentioned side effects of kmap_atomic(), so that the mere replacements of the old API with the new one is all that is required (i.e., there is no need to explicitly call pagefault_disable() and/or preempt_disable()). Link: https://lkml.kernel.org/r/20231127155452.586387-1-fabio.maria.de.francesco@linux.intel.com Signed-off-by: Fabio M. De Francesco <fabio.maria.de.francesco@linux.intel.com> Cc: Ira Weiny <ira.weiny@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
Fabio M. De Francesco
|
003ae2fb0b |
mm/zswap: replace kmap_atomic() with kmap_local_page()
kmap_atomic() has been deprecated in favor of kmap_local_page(). Therefore, replace kmap_atomic() with kmap_local_page() in zswap.c. kmap_atomic() is implemented like a kmap_local_page() which also disables page-faults and preemption (the latter only in !PREEMPT_RT kernels). The kernel virtual addresses returned by these two API are only valid in the context of the callers (i.e., they cannot be handed to other threads). With kmap_local_page() the mappings are per thread and CPU local like in kmap_atomic(); however, they can handle page-faults and can be called from any context (including interrupts). The tasks that call kmap_local_page() can be preempted and, when they are scheduled to run again, the kernel virtual addresses are restored and are still valid. In mm/zswap.c, the blocks of code between the mappings and un-mappings do not depend on the above-mentioned side effects of kmap_atomic(), so that the mere replacements of the old API with the new one is all that is required (i.e., there is no need to explicitly call pagefault_disable() and/or preempt_disable()). Link: https://lkml.kernel.org/r/20231127160058.586446-1-fabio.maria.de.francesco@linux.intel.com Signed-off-by: Fabio M. De Francesco <fabio.maria.de.francesco@linux.intel.com> Reviewed-by: Nhat Pham <nphamcs@gmail.com> Acked-by: Chris Li <chrisl@kernel.org> (Google) Cc: Ira Weiny <ira.weiny@intel.com> Cc: Seth Jennings <sjenning@redhat.com> Cc: Dan Streetman <ddstreet@ieee.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
Yong Wang
|
27873192ac |
mm, oom:dump_tasks add rss detailed information printing
When the system is under oom, it prints out the RSS information of each
process. However, we don't know the size of rss_anon, rss_file, and
rss_shmem.
To distinguish the memory occupied by anonymous or file mappings
or shmem, could help us identify the root cause of the oom.
So this patch adds RSS details, which refers to the /proc/<pid>/status[1].
It can help us know more about process memory usage.
Example of oom including the new rss_* fields:
[ 1630.902466] Tasks state (memory values in pages):
[ 1630.902870] [ pid ] uid tgid total_vm rss rss_anon rss_file rss_shmem pgtables_bytes swapents oom_score_adj name
[ 1630.903619] [ 149] 0 149 486 288 0 288 0 36864 0 0 ash
[ 1630.904210] [ 156] 0 156 153531 153345 153345 0 0 1269760 0 0 mm_test
[1] commit
|
||
Peter Xu
|
e9119fb657 |
mm/gup: fix follow_devmap_p[mu]d() on page==NULL handling
This is a bug found not by any report but only by code observations. When GUP sees a devpmd/devpud and if page==NULL is returned, it means a fault is probably required. Here falling through when page==NULL can cause unexpected behavior. Fix both cases by catching the page==NULL cases with no_page_table(). Link: https://lkml.kernel.org/r/20231123180222.1048297-1-peterx@redhat.com Fixes: |
||
Charan Teja Kalla
|
ac3f3b0a55 |
mm: page_alloc: unreserve highatomic page blocks before oom
__alloc_pages_direct_reclaim() is called from slowpath allocation where
high atomic reserves can be unreserved after there is a progress in
reclaim and yet no suitable page is found. Later should_reclaim_retry()
gets called from slow path allocation to decide if the reclaim needs to be
retried before OOM kill path is taken.
should_reclaim_retry() checks the available(reclaimable + free pages)
memory against the min wmark levels of a zone and returns:
a) true, if it is above the min wmark so that slow path allocation will
do the reclaim retries.
b) false, thus slowpath allocation takes oom kill path.
should_reclaim_retry() can also unreserves the high atomic reserves **but
only after all the reclaim retries are exhausted.**
In a case where there are almost none reclaimable memory and free pages
contains mostly the high atomic reserves but allocation context can't use
these high atomic reserves, makes the available memory below min wmark
levels hence false is returned from should_reclaim_retry() leading the
allocation request to take OOM kill path. This can turn into a early oom
kill if high atomic reserves are holding lot of free memory and
unreserving of them is not attempted.
(early)OOM is encountered on a VM with the below state:
[ 295.998653] Normal free:7728kB boost:0kB min:804kB low:1004kB
high:1204kB reserved_highatomic:8192KB active_anon:4kB inactive_anon:0kB
active_file:24kB inactive_file:24kB unevictable:1220kB writepending:0kB
present:70732kB managed:49224kB mlocked:0kB bounce:0kB free_pcp:688kB
local_pcp:492kB free_cma:0kB
[ 295.998656] lowmem_reserve[]: 0 32
[ 295.998659] Normal: 508*4kB (UMEH) 241*8kB (UMEH) 143*16kB (UMEH)
33*32kB (UH) 7*64kB (UH) 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB
0*4096kB = 7752kB
Per above log, the free memory of ~7MB exist in the high atomic reserves
is not freed up before falling back to oom kill path.
Fix it by trying to unreserve the high atomic reserves in
should_reclaim_retry() before __alloc_pages_direct_reclaim() can fallback
to oom kill path.
Link: https://lkml.kernel.org/r/1700823445-27531-1-git-send-email-quic_charante@quicinc.com
Fixes:
|
||
Charan Teja Kalla
|
9cd20f3fe0 |
mm: page_alloc: enforce minimum zone size to do high atomic reserves
Highatomic reserves are set to roughly 1% of zone for maximum and a pageblock size for minimum. Encountered a system with the below configuration: Normal free:7728kB boost:0kB min:804kB low:1004kB high:1204kB reserved_highatomic:8192KB managed:49224kB On such systems, even a single pageblock makes highatomic reserves are set to ~8% of the zone memory. This high value can easily exert pressure on the zone. Per discussion with Michal and Mel, it is not much useful to reserve the memory for highatomic allocations on such small systems[1]. Since the minimum size for high atomic reserves is always going to be a pageblock size and if 1% of zone managed pages is going to be below pageblock size, don't reserve memory for high atomic allocations. Thanks Michal for this suggestion[2]. Since no memory is being reserved for high atomic allocations and if respective allocation failures are seen, this patch can be reverted. [1] https://lore.kernel.org/linux-mm/20231117161956.d3yjdxhhm4rhl7h2@techsingularity.net/ [2] https://lore.kernel.org/linux-mm/ZVYRJMUitykepLRy@tiehlicka/ Link: https://lkml.kernel.org/r/c3a2a48e2cfe08176a80eaf01c110deb9e918055.1700821416.git.quic_charante@quicinc.com Signed-off-by: Charan Teja Kalla <quic_charante@quicinc.com> Acked-by: David Rientjes <rientjes@google.com> Cc: David Hildenbrand <david@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Michal Hocko <mhocko@suse.com> Cc: Pavankumar Kondeti <quic_pkondeti@quicinc.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
Charan Teja Kalla
|
d68e39fc45 |
mm: page_alloc: correct high atomic reserve calculations
Patch series "mm: page_alloc: fixes for high atomic reserve caluculations", v3. The state of the system where the issue exposed shown in oom kill logs: [ 295.998653] Normal free:7728kB boost:0kB min:804kB low:1004kB high:1204kB reserved_highatomic:8192KB active_anon:4kB inactive_anon:0kB active_file:24kB inactive_file:24kB unevictable:1220kB writepending:0kB present:70732kB managed:49224kB mlocked:0kB bounce:0kB free_pcp:688kBlocal_pcp:492kB free_cma:0kB [ 295.998656] lowmem_reserve[]: 0 32 [ 295.998659] Normal: 508*4kB (UMEH) 241*8kB (UMEH) 143*16kB (UMEH) 33*32kB (UH) 7*64kB (UH) 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 7752kB From the above, it is seen that ~16MB of memory reserved for high atomic reserves against the expectation of 1% reserves which is fixed in the 1st patch. Don't reserve the high atomic page blocks if 1% of zone memory size is below a pageblock size. This patch (of 2): reserve_highatomic_pageblock() aims to reserve the 1% of the managed pages of a zone, which is used for the high order atomic allocations. It uses the below calculation to reserve: static void reserve_highatomic_pageblock(struct page *page, ....) { ....... max_managed = (zone_managed_pages(zone) / 100) + pageblock_nr_pages; if (zone->nr_reserved_highatomic >= max_managed) goto out; zone->nr_reserved_highatomic += pageblock_nr_pages; set_pageblock_migratetype(page, MIGRATE_HIGHATOMIC); move_freepages_block(zone, page, MIGRATE_HIGHATOMIC, NULL); out: .... } Since we are always appending the 1% of zone managed pages count to pageblock_nr_pages, the minimum it is turning into 2 pageblocks as the nr_reserved_highatomic is incremented/decremented in pageblock sizes. Encountered a system(actually a VM running on the Linux kernel) with the below zone configuration: Normal free:7728kB boost:0kB min:804kB low:1004kB high:1204kB reserved_highatomic:8192KB managed:49224kB The existing calculations making it to reserve the 8MB(with pageblock size of 4MB) i.e. 16% of the zone managed memory. Reserving such high amount of memory can easily exert memory pressure in the system thus may lead into unnecessary reclaims till unreserving of high atomic reserves. Since high atomic reserves are managed in pageblock size granules, as MIGRATE_HIGHATOMIC is set for such pageblock, fix the calculations for high atomic reserves as, minimum is pageblock size , maximum is approximately 1% of the zone managed pages. Link: https://lkml.kernel.org/r/cover.1700821416.git.quic_charante@quicinc.com Link: https://lkml.kernel.org/r/1660034138397b82a0a8b6ae51cbe96bd583d89e.1700821416.git.quic_charante@quicinc.com Signed-off-by: Charan Teja Kalla <quic_charante@quicinc.com> Acked-by: Mel Gorman <mgorman@techsingularity.net> Acked-by: David Rientjes <rientjes@google.com> Cc: David Hildenbrand <david@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Pavankumar Kondeti <quic_pkondeti@quicinc.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
Serge Semin
|
01846c6c70 |
mm/mm_init.c: append newline to the unavailable ranges log-message
Based on the init_unavailable_range() method and it's callee semantics no multi-line info messages are intended to be printed to the console. Thus append the '\n' symbol to the respective info string. Link: https://lkml.kernel.org/r/20231122182419.30633-7-fancer.lancer@gmail.com Signed-off-by: Serge Semin <fancer.lancer@gmail.com> Reviewed-by: Mike Rapoport (IBM) <rppt@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
Serge Semin
|
ecf5dd1ffe |
mm/mm_init.c: extend init unavailable range doc info
Besides of the already described reasons the pages backended memory holes might be persistent due to having memory mapped IO spaces behind those ranges in the framework of flatmem kernel config. Add such note to the init_unavailable_range() method kdoc in order to point out to one more reason of having the function executed for such regions. [fancer.lancer@gmail.com: update per Mike] Link: https://lkml.kernel.org/r/20231202111855.18392-1-fancer.lancer@gmail.com Link: https://lkml.kernel.org/r/20231122182419.30633-6-fancer.lancer@gmail.com Signed-off-by: Serge Semin <fancer.lancer@gmail.com> Reviewed-by: Mike Rapoport (IBM) <rppt@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
SeongJae Park
|
50668b53f8 |
mm/damon/core-test: test damon_split_region_at()'s access rate copying
damon_split_region_at() should set access rate related fields of the resulting regions same. It may forgotten, and actually there was the mistake before. Test it with the unit test case for the function. Link: https://lkml.kernel.org/r/20231119171529.66863-2-sj@kernel.org Signed-off-by: SeongJae Park <sj@kernel.org> Cc: Brendan Higgins <brendanhiggins@google.com> Cc: David Gow <davidgow@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
Juntong Deng
|
a5989d4ed4 |
kasan: improve free meta storage in Generic KASAN
Currently free meta can only be stored in object if the object is not smaller than free meta. After the improvement, when the object is smaller than free meta and SLUB DEBUG is not enabled, it is possible to store part of the free meta in the object, reducing the increased size of the red zone. Example: free meta size: 16 bytes alloc meta size: 16 bytes object size: 8 bytes optimal redzone size (object_size <= 64): 16 bytes Before improvement: actual redzone size = alloc meta size + free meta size = 32 bytes After improvement: actual redzone size = alloc meta size + (free meta size - object size) = 24 bytes [juntong.deng@outlook.com: make kasan_metadata_size() adapt to the improved free meta storage] Link: https://lkml.kernel.org/r/VI1P193MB0752675D6E0A2D16CE656F8299BAA@VI1P193MB0752.EURP193.PROD.OUTLOOK.COM Link: https://lkml.kernel.org/r/VI1P193MB0752DE2CCD9046B5FED0AA8E99B5A@VI1P193MB0752.EURP193.PROD.OUTLOOK.COM Signed-off-by: Juntong Deng <juntong.deng@outlook.com> Suggested-by: Dmitry Vyukov <dvyukov@google.com> Cc: Alexander Potapenko <glider@google.com> Cc: Andrey Konovalov <andreyknvl@gmail.com> Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Vincenzo Frascino <vincenzo.frascino@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
Fabio M. De Francesco
|
f542b8e582 |
mm/page_poison: replace kmap_atomic() with kmap_local_page()
kmap_atomic() has been deprecated in favor of kmap_local_page(). Therefore, replace kmap_atomic() with kmap_local_page(). kmap_atomic() is implemented like a kmap_local_page() which also disables page-faults and preemption (the latter only in !PREEMPT_RT kernels). The kernel virtual addresses returned by these two API are only valid in the context of the callers (i.e., they cannot be handed to other threads). With kmap_local_page() the mappings are per thread and CPU local like in kmap_atomic(); however, they can handle page-faults and can be called from any context (including interrupts). The tasks that call kmap_local_page() can be preempted and, when they are scheduled to run again, the kernel virtual addresses are restored and are still valid. The code blocks between the mappings and un-mappings do not rely on the above-mentioned side effects of kmap_atomic(), so that mere replacements of the old API with the new one is all that they require (i.e., there is no need to explicitly call pagefault_disable() and/or preempt_disable()). Link: https://lkml.kernel.org/r/20231120142836.7219-1-fabio.maria.de.francesco@linux.intel.com Signed-off-by: Fabio M. De Francesco <fabio.maria.de.francesco@linux.intel.com> Cc: Ira Weiny <ira.weiny@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
Fabio M. De Francesco
|
f2bcc99a5e |
mm/mempool: replace kmap_atomic() with kmap_local_page()
kmap_atomic() has been deprecated in favor of kmap_local_page(). Therefore, replace kmap_atomic() with kmap_local_page(). kmap_atomic() is implemented like a kmap_local_page() which also disables page-faults and preemption (the latter only in !PREEMPT_RT kernels). The kernel virtual addresses returned by these two API are only valid in the context of the callers (i.e., they cannot be handed to other threads). With kmap_local_page() the mappings are per thread and CPU local like in kmap_atomic(); however, they can handle page-faults and can be called from any context (including interrupts). The tasks that call kmap_local_page() can be preempted and, when they are scheduled to run again, the kernel virtual addresses are restored and are still valid. The code blocks between the mappings and un-mappings don't rely on the above-mentioned side effects of kmap_atomic(), so that mere replacements of the old API with the new one is all that they require (i.e., there is no need to explicitly call pagefault_disable() and/or preempt_disable()). Link: https://lkml.kernel.org/r/20231120142640.7077-1-fabio.maria.de.francesco@linux.intel.com Signed-off-by: Fabio M. De Francesco <fabio.maria.de.francesco@linux.intel.com> Cc: Ira Weiny <ira.weiny@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
Fabio M. De Francesco
|
24d2613a63 |
mm/memory: use kmap_local_page() in __wp_page_copy_user()
kmap_atomic() has been deprecated in favor of kmap_local_{folio,page}. Therefore, replace kmap_atomic() with kmap_local_page in __wp_page_copy_user(). kmap_atomic() disables preemption in !PREEMPT_RT kernels and unconditionally disables also page-faults. My limited knowledge of the implementation of __wp_page_copy_user() makes me think that the latter side effect is still needed here, but kmap_local_page() is implemented not to disable page-faults. So, in addition to the conversion to local mapping, add explicit pagefault_disable() / pagefault_enable() between mapping and un-mapping. Link: https://lkml.kernel.org/r/20231120142418.6977-1-fmdefrancesco@gmail.com Signed-off-by: Fabio M. De Francesco <fabio.maria.de.francesco@linux.intel.com> Cc: Ira Weiny <ira.weiny@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
Fabio M. De Francesco
|
b335198966 |
mm/ksm: use kmap_local_page() in calc_checksum()
kmap_atomic() has been deprecated in favor of kmap_local_page(). Therefore, replace kmap_atomic() with kmap_local_page() in calc_checksum(). kmap_atomic() is implemented like a kmap_local_page() which also disables page-faults and preemption (the latter only in !PREEMPT_RT kernels). The kernel virtual addresses returned by these two API are only valid in the context of the callers (i.e., they cannot be handed to other threads). With kmap_local_page() the mappings are per thread and CPU local like in kmap_atomic(); however, they can handle page-faults and can be called from any context (including interrupts). The tasks that call kmap_local_page() can be preempted and, when they are scheduled to run again, the kernel virtual addresses are restored and are still valid. In calc_checksum(), the block of code between the mapping and un-mapping does not depend on the above-mentioned side effects of kmap_aatomic(), so that a mere replacements of the old API with the new one is all that is required (i.e., there is no need to explicitly call pagefault_disable() and/or preempt_disable()). Link: https://lkml.kernel.org/r/20231120141855.6761-1-fmdefrancesco@gmail.com Signed-off-by: Fabio M. De Francesco <fabio.maria.de.francesco@linux.intel.com> Cc: Ira Weiny <ira.weiny@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
Fabio De Francesco
|
2f7537620f |
mm/util: use kmap_local_page() in memcmp_pages()
kmap_atomic() has been deprecated in favor of kmap_local_page(). Therefore, replace kmap_atomic() with kmap_local_page() in memcmp_pages(). kmap_atomic() is implemented like a kmap_local_page() which also disables page-faults and preemption (the latter only in !PREEMPT_RT kernels). The kernel virtual addresses returned by these two API are only valid in the context of the callers (i.e., they cannot be handed to other threads). With kmap_local_page() the mappings are per thread and CPU local like in kmap_atomic(); however, they can handle page-faults and can be called from any context (including interrupts). The tasks that call kmap_local_page() can be preempted and, when they are scheduled to run again, the kernel virtual addresses are restored and are still valid. In memcmp_pages(), the block of code between the mapping and un-mapping does not depend on the above-mentioned side effects of kmap_aatomic(), so that mere replacements of the old API with the new one is all that is required (i.e., there is no need to explicitly call pagefault_disable() and/or preempt_disable()). Link: https://lkml.kernel.org/r/20231120141554.6612-1-fmdefrancesco@gmail.com Signed-off-by: Fabio M. De Francesco <fabio.maria.de.francesco@linux.intel.com> Cc: Ira Weiny <ira.weiny@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
Sumanth Korikkar
|
95a2ac9370 |
mm: use vmem_altmap code without CONFIG_ZONE_DEVICE
vmem_altmap_free() and vmem_altmap_offset() could be utlized without CONFIG_ZONE_DEVICE enabled. For example, mm/memory_hotplug.c:__add_pages() relies on that. The altmap is no longer restricted to ZONE_DEVICE handling, but instead depends on CONFIG_SPARSEMEM_VMEMMAP. When CONFIG_SPARSEMEM_VMEMMAP is disabled, these functions are defined as inline stubs, ensuring compatibility with configurations that do not use sparsemem vmemmap. Without it, lkp reported the following: ld: arch/x86/mm/init_64.o: in function `remove_pagetable': init_64.c:(.meminit.text+0xfc7): undefined reference to `vmem_altmap_free' Link: https://lkml.kernel.org/r/20231120145354.308999-4-sumanthk@linux.ibm.com Signed-off-by: Sumanth Korikkar <sumanthk@linux.ibm.com> Reported-by: kernel test robot <lkp@intel.com> Closes: https://lore.kernel.org/oe-kbuild-all/202311180545.VeyRXEDq-lkp@intel.com/ Reviewed-by: Gerald Schaefer <gerald.schaefer@linux.ibm.com> Acked-by: David Hildenbrand <david@redhat.com> Cc: Alexander Gordeev <agordeev@linux.ibm.com> Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Cc: Anshuman Khandual <anshuman.khandual@arm.com> Cc: Heiko Carstens <hca@linux.ibm.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Vasily Gorbik <gor@linux.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
Andrey Konovalov
|
773688a6cb |
kasan: use stack_depot_put for Generic mode
Evict alloc/free stack traces from the stack depot for Generic KASAN once they are evicted from the quaratine. For auxiliary stack traces, evict the oldest stack trace once a new one is saved (KASAN only keeps references to the last two). Also evict all saved stack traces on krealloc. To avoid double-evicting and mis-evicting stack traces (in case KASAN's metadata was corrupted), reset KASAN's per-object metadata that stores stack depot handles when the object is initialized and when it's evicted from the quarantine. Note that stack_depot_put is no-op if the handle is 0. Link: https://lkml.kernel.org/r/5cef104d9b842899489b4054fe8d1339a71acee0.1700502145.git.andreyknvl@google.com Signed-off-by: Andrey Konovalov <andreyknvl@google.com> Reviewed-by: Marco Elver <elver@google.com> Cc: Alexander Potapenko <glider@google.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Evgenii Stepanov <eugenis@google.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
Andrey Konovalov
|
2d5524635b |
slub, kasan: improve interaction of KASAN and slub_debug poisoning
When both KASAN and slub_debug are enabled, when a free object is being prepared in setup_object, slub_debug poisons the object data before KASAN initializes its per-object metadata. Right now, in setup_object, KASAN only initializes the alloc metadata, which is always stored outside of the object. slub_debug is aware of this and it skips poisoning and checking that memory area. However, with the following patch in this series, KASAN also starts initializing its free medata in setup_object. As this metadata might be stored within the object, this initialization might overwrite the slub_debug poisoning. This leads to slub_debug reports. Thus, skip checking slub_debug poisoning of the object data area that overlaps with the in-object KASAN free metadata. Also make slub_debug poisoning of tail kmalloc redzones more precise when KASAN is enabled: slub_debug can still poison and check the tail kmalloc allocation area that comes after the KASAN free metadata. Link: https://lkml.kernel.org/r/20231122231202.121277-1-andrey.konovalov@linux.dev Signed-off-by: Andrey Konovalov <andreyknvl@google.com> Tested-by: Hyeonggon Yoo <42.hyeyoo@gmail.com> Cc: Alexander Potapenko <glider@google.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Evgenii Stepanov <eugenis@google.com> Cc: Feng Tang <feng.tang@intel.com> Cc: Marco Elver <elver@google.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
Andrey Konovalov
|
f816938bff |
kasan: use stack_depot_put for tag-based modes
Make tag-based KASAN modes evict stack traces from the stack depot once they are evicted from the stack ring. Internally, pass STACK_DEPOT_FLAG_GET to stack_depot_save_flags (via kasan_save_stack) to increment the refcount when saving a new entry to stack ring and call stack_depot_put when removing an entry from stack ring. Link: https://lkml.kernel.org/r/b4773e5c1b0b9df6826ec0b65c1923feadfa78e5.1700502145.git.andreyknvl@google.com Signed-off-by: Andrey Konovalov <andreyknvl@google.com> Reviewed-by: Alexander Potapenko <glider@google.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Evgenii Stepanov <eugenis@google.com> Cc: Marco Elver <elver@google.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
Andrey Konovalov
|
7d88e4f768 |
kasan: check object_size in kasan_complete_mode_report_info
Check the object size when looking up entries in the stack ring. If the size of the object for which a report is being printed does not match the size of the object for which a stack trace has been saved in the stack ring, the saved stack trace is irrelevant. Link: https://lkml.kernel.org/r/68c6948175aadd7e7e7deea61725103d64a4528f.1700502145.git.andreyknvl@google.com Signed-off-by: Andrey Konovalov <andreyknvl@google.com> Reviewed-by: Alexander Potapenko <glider@google.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Evgenii Stepanov <eugenis@google.com> Cc: Marco Elver <elver@google.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
Andrey Konovalov
|
f3b5979862 |
kasan: remove atomic accesses to stack ring entries
Remove the atomic accesses to entry fields in save_stack_info and kasan_complete_mode_report_info for tag-based KASAN modes. These atomics are not required, as the read/write lock prevents the entries from being read (in kasan_complete_mode_report_info) while being written (in save_stack_info) and the try_cmpxchg prevents the same entry from being rewritten (in save_stack_info) in the unlikely case of wrapping during writing. Link: https://lkml.kernel.org/r/29f59126d9845c5257b6c29cd7ad113b16f19f47.1700502145.git.andreyknvl@google.com Signed-off-by: Andrey Konovalov <andreyknvl@google.com> Reviewed-by: Alexander Potapenko <glider@google.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Evgenii Stepanov <eugenis@google.com> Cc: Marco Elver <elver@google.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
Andrey Konovalov
|
022012dcf4 |
lib/stackdepot, kasan: add flags to __stack_depot_save and rename
Change the bool can_alloc argument of __stack_depot_save to a u32 argument that accepts a set of flags. The following patch will add another flag to stack_depot_save_flags besides the existing STACK_DEPOT_FLAG_CAN_ALLOC. Also rename the function to stack_depot_save_flags, as __stack_depot_save is a cryptic name, Link: https://lkml.kernel.org/r/645fa15239621eebbd3a10331e5864b718839512.1700502145.git.andreyknvl@google.com Signed-off-by: Andrey Konovalov <andreyknvl@google.com> Reviewed-by: Alexander Potapenko <glider@google.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Evgenii Stepanov <eugenis@google.com> Cc: Marco Elver <elver@google.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
Andrey Konovalov
|
3bddc3100c |
kmsan: use stack_depot_save instead of __stack_depot_save
Make KMSAN use stack_depot_save instead of __stack_depot_save, as it always passes true to __stack_depot_save as the last argument. Link: https://lkml.kernel.org/r/18092240699efdc6acd78b51e41ea782953e6c8d.1700502145.git.andreyknvl@google.com Signed-off-by: Andrey Konovalov <andreyknvl@google.com> Reviewed-by: Alexander Potapenko <glider@google.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Evgenii Stepanov <eugenis@google.com> Cc: Marco Elver <elver@google.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
Jim Cromie
|
52c5d2bc32 |
kmemleak: add checksum to backtrace report
Change /sys/kernel/debug/kmemleak report format slightly, adding "(extra info)" to the backtrace header: from: " backtrace:" to: " backtrace (crc <cksum>):" The <cksum> allows a user to see recurring backtraces without detailed/careful reading of multiline stacks. So after cycling kmemleak-test a few times, I know some leaks are repeating. bash-5.2# grep backtrace /sys/kernel/debug/kmemleak | wc 62 186 1792 bash-5.2# grep backtrace /sys/kernel/debug/kmemleak | sort -u | wc 37 111 1067 syzkaller parses kmemleak for "unreferenced object" only, so is unaffected by this change. Other github repos are moribund. Link: https://lkml.kernel.org/r/20231116224318.124209-3-jim.cromie@gmail.com Signed-off-by: Jim Cromie <jim.cromie@gmail.com> Reviewed-by: Catalin Marinas <catalin.marinas@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
Jim Cromie
|
88f9ee2b30 |
kmemleak: drop (age <increasing>) from leak record
Patch series "tweak kmemleak report format". These 2 patches make minor changes to the report: 1st strips "age <increasing>" from output. This makes the output idempotent; unchanging until a new leak is reported. 2nd adds the backtrace.checksum to the "backtrace:" line. This lets a user see repeats without actually reading the whole backtrace. So now the backtrace line looks like this: backtrace (crc 603070071): I surveyed for un-wanted effects upon users: Syzkaller parses kmemleak in executor/common_linux.h: static void check_leaks(char** frames, int nframes) It just counts occurrences of "unreferenced object", specifically it does not look for "age", nor would it choke on "crc" being added. github has 3 repos with "kmemleak" mentioned, all are moribund. gitlab has 0 hits on "kmemleak". This patch (of 2): Displaying age is pretty, but counter-productive; it changes with current-time, so it surrenders idempotency of the output, which breaks simple hash-based cataloging of the records by the user. The trouble: sequential reads, wo new leaks, get new results: :#> sum /sys/kernel/debug/kmemleak 53439 74 /sys/kernel/debug/kmemleak :#> sum /sys/kernel/debug/kmemleak 59066 74 /sys/kernel/debug/kmemleak and age is why (nothing else changes): :#> grep -v age /sys/kernel/debug/kmemleak | sum 58894 67 :#> grep -v age /sys/kernel/debug/kmemleak | sum 58894 67 Since jiffies is already printed in the "comm" line, age adds nothing. Notably, syzkaller reads kmemleak only for "unreferenced object", and won't care about this reform of age-ism. A few moribund github repos mention it, but don't compile. Link: https://lkml.kernel.org/r/20231116224318.124209-1-jim.cromie@gmail.com Link: https://lkml.kernel.org/r/20231116224318.124209-2-jim.cromie@gmail.com Signed-off-by: Jim Cromie <jim.cromie@gmail.com> Reviewed-by: Catalin Marinas <catalin.marinas@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
Matthew Wilcox (Oracle)
|
af7628d6ec |
fs: convert error_remove_page to error_remove_folio
There were already assertions that we were not passing a tail page to error_remove_page(), so make the compiler enforce that by converting everything to pass and use a folio. Link: https://lkml.kernel.org/r/20231117161447.2461643-7-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Naoya Horiguchi <naoya.horiguchi@nec.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
Matthew Wilcox (Oracle)
|
e130b6514e |
memory-failure: convert truncate_error_page to truncate_error_folio
Both callers now have a folio, so pass it in. Nothing downstream was expecting a tail page; that's asserted in generic_error_remove_page(), for example. Link: https://lkml.kernel.org/r/20231117161447.2461643-6-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Naoya Horiguchi <naoya.horiguchi@nec.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
Matthew Wilcox (Oracle)
|
b6fd410c32 |
memory-failure: use a folio in me_huge_page()
This function was already explicitly calling compound_head(); unfortunately the compiler can't know that and elide the redundant calls to compound_head() buried in page_mapping(), unlock_page(), etc. Switch to using a folio, which does let us elide these calls. Link: https://lkml.kernel.org/r/20231117161447.2461643-5-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Naoya Horiguchi <naoya.horiguchi@nec.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
Matthew Wilcox (Oracle)
|
f709239357 |
memory-failure: convert delete_from_lru_cache() to take a folio
All three callers now have a folio; pass it in instead of the page. Saves five calls to compound_head(). Link: https://lkml.kernel.org/r/20231117161447.2461643-4-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Naoya Horiguchi <naoya.horiguchi@nec.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
Matthew Wilcox (Oracle)
|
6304b531cd |
memory-failure: use a folio in me_pagecache_dirty()
Replaces three hidden calls to compound_head() with one visible one. Link: https://lkml.kernel.org/r/20231117161447.2461643-3-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Naoya Horiguchi <naoya.horiguchi@nec.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
Matthew Wilcox (Oracle)
|
3d47e31790 |
memory-failure: use a folio in me_pagecache_clean()
Patch series "Convert aops->error_remove_page to ->error_remove_folio". This is a memory-failure patch series which converts a lot of uses of page APIs into folio APIs with the usual benefits. This patch (of 6): Replaces three hidden calls to compound_head() with one visible one. Fix up a few comments while I'm modifying this function. Link: https://lkml.kernel.org/r/20231117161447.2461643-1-willy@infradead.org Link: https://lkml.kernel.org/r/20231117161447.2461643-2-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Naoya Horiguchi <naoya.horiguchi@nec.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
Barry Song
|
1b5c65b64c |
mm/page_owner: record and dump free_pid and free_tgid
While investigating some complex memory allocation and free bugs especially in multi-processes and multi-threads cases, from time to time, I feel the free stack isn't sufficient as a page can be freed by processes or threads other than the one allocating it. And other processes and threads which free the page often have the exactly same free stack with the one allocating the page. We can't know who free the page only through the free stack though the current page_owner does tell us the pid and tgid of the one allocating the page. This makes the bug investigation often hard. So this patch adds free pid and tgid in page_owner, so that we can easily figure out if the freeing is crossing processes or threads. Link: https://lkml.kernel.org/r/20231114034202.73098-1-v-songbaohua@oppo.com Signed-off-by: Barry Song <v-songbaohua@oppo.com> Cc: Audra Mitchell <audra@redhat.com> Cc: Hyeonggon Yoo <42.hyeyoo@gmail.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Kassey Li <quic_yingangl@quicinc.com> Cc: Kemeng Shi <shikemeng@huaweicloud.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
York Jasper Niebuhr
|
932b59e3be |
mm: fix process_vm_rw page counts
1. There is a "-1" missing in the page number calculation in process_vm_rw_core. While this can't break anything, it can cause unnecessary allocations in certain cases: Consider handling an iovec ranging over PVM_MAX_PP_ARRAY_COUNT pages that is also aligned to a page boundary. While pp_stack could hold references to such an amount of pinned pages, nr_pages yields (PVM_MAX_PP_ARRAY + 1) in process_vm_rw_core. Consequently, a larger buffer is allocated with kmalloc for no reason. For any page boundary aligned iovec that is a multiple of PAGE_SIZE and larger than PVM_MAX_PP_ARRAY_COUNT pages, nr_pages will be too big by 1 and thus kmalloc allocates excess space for one more pointer. 2. max_pages_per_loop is constant and there is no reason to have it as a variable. A macro does the job just fine and saves memory. 3. Replaced "sizeof(struct pages *)" with "sizeof(struct page *)" to have matching types for allocation and prevent confusion. Link: https://lkml.kernel.org/r/20231111184859.44264-1-yjnworkstation@gmail.com Signed-off-by: York Jasper Niebuhr <yjnworkstation@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
Lukas Bulwahn
|
69e583eaca |
mmap: remove the IA64-specific vma expansion implementation
With commit
|
||
Brendan Jackman
|
17b46e7beb |
mm/page_alloc: dedupe some memcg uncharging logic
The duplication makes it seem like some work is required before uncharging in the !PageHWPoison case. But it isn't, so we can simplify the code a little. Note the PageMemcgKmem check is redundant, but I've left it in as it avoids an unnecessary function call. Link: https://lkml.kernel.org/r/20231108164920.3401565-1-jackmanb@google.com Signed-off-by: Brendan Jackman <jackmanb@google.com> Reviewed-by: Yosry Ahmed <yosryahmed@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Shakeel Butt <shakeelb@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
Matthew Wilcox (Oracle)
|
2033c98cce |
mm: remove invalidate_inode_page()
All callers are now converted to call mapping_evict_folio(). Link: https://lkml.kernel.org/r/20231108182809.602073-7-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Naoya Horiguchi <naoya.horiguchi@nec.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
Matthew Wilcox (Oracle)
|
761d79fbad |
mm: convert isolate_page() to mf_isolate_folio()
The only caller now has a folio, so pass it in and operate on it. Saves many page->folio conversions and introduces only one folio->page conversion when calling isolate_movable_page(). Link: https://lkml.kernel.org/r/20231108182809.602073-6-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Naoya Horiguchi <naoya.horiguchi@nec.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
Matthew Wilcox (Oracle)
|
049b26048d |
mm: convert soft_offline_in_use_page() to use a folio
Replace the existing head-page logic with folio logic. Link: https://lkml.kernel.org/r/20231108182809.602073-5-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Naoya Horiguchi <naoya.horiguchi@nec.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
Matthew Wilcox (Oracle)
|
19369d866a |
mm: use mapping_evict_folio() in truncate_error_page()
We already have the folio and the mapping, so replace the call to invalidate_inode_page() with mapping_evict_folio(). Link: https://lkml.kernel.org/r/20231108182809.602073-4-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Naoya Horiguchi <naoya.horiguchi@nec.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
Matthew Wilcox (Oracle)
|
01d1e0e6b7 |
mm: convert __do_fault() to use a folio
Convert vmf->page to a folio as soon as we're going to use it. This fixes
a bug if the fault handler returns a tail page with hardware poison; tail
pages have an invalid page->index, so we would fail to unmap the page from
the page tables. We actually have to unmap the entire folio (or
mapping_evict_folio() will fail), so use unmap_mapping_folio() instead.
This also saves various calls to compound_head() hidden in lock_page(),
put_page(), etc.
Link: https://lkml.kernel.org/r/20231108182809.602073-3-willy@infradead.org
Fixes:
|
||
Matthew Wilcox (Oracle)
|
1e12cbb9f6 |
mm: make mapping_evict_folio() the preferred way to evict clean folios
Patch series "Fix fault handler's handling of poisoned tail pages". Since introducing the ability to have large folios in the page cache, it's been possible to have a hwpoisoned tail page returned from the fault handler. We handle this situation poorly; failing to remove the affected page from use. This isn't a minimal patch to fix it, it's a full conversion of all the code surrounding it. This patch (of 6): invalidate_inode_page() does very little beyond calling mapping_evict_folio(). Move the check for mapping being NULL into mapping_evict_folio() and make it available to the rest of the MM for use in the next few patches. Link: https://lkml.kernel.org/r/20231108182809.602073-1-willy@infradead.org Link: https://lkml.kernel.org/r/20231108182809.602073-2-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Naoya Horiguchi <naoya.horiguchi@nec.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
Matthew Wilcox (Oracle)
|
b5612c3686 |
mm: return void from folio_start_writeback() and related functions
Nobody now checks the return value from any of these functions, so add an assertion at the beginning of the function and return void. Link: https://lkml.kernel.org/r/20231108204605.745109-5-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Cc: David Howells <dhowells@redhat.com> Cc: Steve French <sfrench@samba.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
Minjie Du
|
8ff252663d |
mm/filemap: increase usage of folio_next_index() helper
Simplify code pattern of 'folio->index + folio_nr_pages(folio)' by using the existing helper folio_next_index() in filemap_get_folios_contig(). Link: https://lkml.kernel.org/r/20231107024635.4512-1-duminjie@vivo.com Signed-off-by: Minjie Du <duminjie@vivo.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
Vishal Verma
|
6b8f0798b8 |
mm/memory_hotplug: split memmap_on_memory requests across memblocks
The MHP_MEMMAP_ON_MEMORY flag for hotplugged memory is restricted to 'memblock_size' chunks of memory being added. Adding a larger span of memory precludes memmap_on_memory semantics. For users of hotplug such as kmem, large amounts of memory might get added from the CXL subsystem. In some cases, this amount may exceed the available 'main memory' to store the memmap for the memory being added. In this case, it is useful to have a way to place the memmap on the memory being added, even if it means splitting the addition into memblock-sized chunks. Change add_memory_resource() to loop over memblock-sized chunks of memory if caller requested memmap_on_memory, and if other conditions for it are met. Teach try_remove_memory() to also expect that a memory range being removed might have been split up into memblock sized chunks, and to loop through those as needed. This does preclude being able to use PUD mappings in the direct map; a proposal to how this could be optimized in the future is laid out here[1]. [1]: https://lore.kernel.org/linux-mm/b6753402-2de9-25b2-36e9-eacd49752b19@redhat.com/ Link: https://lkml.kernel.org/r/20231107-vv-kmem_memmap-v10-2-1253ec050ed0@intel.com Signed-off-by: Vishal Verma <vishal.l.verma@intel.com> Suggested-by: David Hildenbrand <david@redhat.com> Reviewed-by: Dan Williams <dan.j.williams@intel.com> Reviewed-by: "Huang, Ying" <ying.huang@intel.com> Acked-by: David Hildenbrand <david@redhat.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Dave Jiang <dave.jiang@intel.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Cc: Fan Ni <fan.ni@samsung.com> Cc: Jeff Moyer <jmoyer@redhat.com> Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
Vishal Verma
|
82b8a3b49e |
mm/memory_hotplug: replace an open-coded kmemdup() in add_memory_resource()
Patch series "mm: use memmap_on_memory semantics for dax/kmem", v10. The dax/kmem driver can potentially hot-add large amounts of memory originating from CXL memory expanders, or NVDIMMs, or other 'device memories'. There is a chance there isn't enough regular system memory available to fit the memmap for this new memory. It's therefore desirable, if all other conditions are met, for the kmem managed memory to place its memmap on the newly added memory itself. The main hurdle for accomplishing this for kmem is that memmap_on_memory can only be done if the memory being added is equal to the size of one memblock. To overcome this, allow the hotplug code to split an add_memory() request into memblock-sized chunks, and try_remove_memory() to also expect and handle such a scenario. Patch 1 replaces an open-coded kmemdup() Patch 2 teaches the memory_hotplug code to allow for splitting add_memory() and remove_memory() requests over memblock sized chunks. Patch 3 allows the dax region drivers to request memmap_on_memory semantics. CXL dax regions default this to 'on', all others default to off to keep existing behavior unchanged. This patch (of 3): A review of the memmap_on_memory modifications to add_memory_resource() revealed an instance of an open-coded kmemdup(). Replace it with kmemdup(). Link: https://lkml.kernel.org/r/20231107-vv-kmem_memmap-v10-0-1253ec050ed0@intel.com Link: https://lkml.kernel.org/r/20231107-vv-kmem_memmap-v10-1-1253ec050ed0@intel.com Signed-off-by: Vishal Verma <vishal.l.verma@intel.com> Reviewed-by: David Hildenbrand <david@redhat.com> Reviewed-by: Fan Ni <fan.ni@samsung.com> Reported-by: Dan Williams <dan.j.williams@intel.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Dave Jiang <dave.jiang@intel.com> Cc: "Huang, Ying" <ying.huang@intel.com> Cc: Jeff Moyer <jmoyer@redhat.com> Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
Liam Ni
|
ff6c3d81f2 |
NUMA: optimize detection of memory with no node id assigned by firmware
Sanity check that makes sure the nodes cover all memory loops over numa_meminfo to count the pages that have node id assigned by the firmware, then loops again over memblock.memory to find the total amount of memory and in the end checks that the difference between the total memory and memory that covered by nodes is less than some threshold. Worse, the loop over numa_meminfo calls __absent_pages_in_range() that also partially traverses memblock.memory. It's much simpler and more efficient to have a single traversal of memblock.memory that verifies that amount of memory not covered by nodes is less than a threshold. Introduce memblock_validate_numa_coverage() that does exactly that and use it instead of numa_meminfo_cover_memory(). Link: https://lkml.kernel.org/r/20231026020329.327329-1-zhiguangni01@gmail.com Signed-off-by: Liam Ni <zhiguangni01@gmail.com> Reviewed-by: Mike Rapoport (IBM) <rppt@kernel.org> Cc: Andy Lutomirski <luto@kernel.org> Cc: Bibo Mao <maobibo@loongson.cn> Cc: Binbin Zhou <zhoubinbin@loongson.cn> Cc: Borislav Petkov <bp@alien8.de> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Feiyang Chen <chenfeiyang@loongson.cn> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Huacai Chen <chenhuacai@kernel.org> Cc: Ingo Molnar <mingo@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: WANG Xuerui <kernel@xen0n.name> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
Baolin Wang
|
3027c6f8eb |
mm: huge_memory: batch tlb flush when splitting a pte-mapped THP
I can observe an obvious tlb flush hotspot when splitting a pte-mapped THP on my ARM64 server, and the distribution of this hotspot is as follows: - 16.85% split_huge_page_to_list + 7.80% down_write - 7.49% try_to_migrate - 7.48% rmap_walk_anon 7.23% ptep_clear_flush + 1.52% __split_huge_page The reason is that the split_huge_page_to_list() will build migration entries for each subpage of a pte-mapped Anon THP by try_to_migrate(), or unmap for file THP, and it will clear and tlb flush for each subpage's pte. Moreover, the split_huge_page_to_list() will set TTU_SPLIT_HUGE_PMD flag to ensure the THP is already a pte-mapped THP before splitting it to some normal pages. Actually, there is no need to flush tlb for each subpage immediately, instead we can batch tlb flush for the pte-mapped THP to improve the performance. After this patch, we can see the batch tlb flush can improve the latency obviously when running thpscale. k6.5-base patched Amean fault-both-1 1071.17 ( 0.00%) 901.83 * 15.81%* Amean fault-both-3 2386.08 ( 0.00%) 1865.32 * 21.82%* Amean fault-both-5 2851.10 ( 0.00%) 2273.84 * 20.25%* Amean fault-both-7 3679.91 ( 0.00%) 2881.66 * 21.69%* Amean fault-both-12 5916.66 ( 0.00%) 4369.55 * 26.15%* Amean fault-both-18 7981.36 ( 0.00%) 6303.57 * 21.02%* Amean fault-both-24 10950.79 ( 0.00%) 8752.56 * 20.07%* Amean fault-both-30 14077.35 ( 0.00%) 10170.01 * 27.76%* Amean fault-both-32 13061.57 ( 0.00%) 11630.08 * 10.96%* Link: https://lkml.kernel.org/r/431d9fb6823036369dcb1d3b2f63732f01df21a7.1698488264.git.baolin.wang@linux.alibaba.com Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com> Reviewed-by: "Huang, Ying" <ying.huang@intel.com> Reviewed-by: Yang Shi <shy828301@gmail.com> Reviewed-by: Alistair Popple <apopple@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
Peng Zhang
|
d240629148 |
fork: use __mt_dup() to duplicate maple tree in dup_mmap()
In dup_mmap(), using __mt_dup() to duplicate the old maple tree and then directly replacing the entries of VMAs in the new maple tree can result in better performance. __mt_dup() uses DFS pre-order to duplicate the maple tree, so it is efficient. The average time complexity of __mt_dup() is O(n), where n is the number of VMAs. The proof of the time complexity is provided in the commit log that introduces __mt_dup(). After duplicating the maple tree, each element is traversed and replaced (ignoring the cases of deletion, which are rare). Since it is only a replacement operation for each element, this process is also O(n). Analyzing the exact time complexity of the previous algorithm is challenging because each insertion can involve appending to a node, pushing data to adjacent nodes, or even splitting nodes. The frequency of each action is difficult to calculate. The worst-case scenario for a single insertion is when the tree undergoes splitting at every level. If we consider each insertion as the worst-case scenario, we can determine that the upper bound of the time complexity is O(n*log(n)), although this is a loose upper bound. However, based on the test data, it appears that the actual time complexity is likely to be O(n). As the entire maple tree is duplicated using __mt_dup(), if dup_mmap() fails, there will be a portion of VMAs that have not been duplicated in the maple tree. To handle this, we mark the failure point with XA_ZERO_ENTRY. In exit_mmap(), if this marker is encountered, stop releasing VMAs that have not been duplicated after this point. There is a "spawn" in byte-unixbench[1], which can be used to test the performance of fork(). I modified it slightly to make it work with different number of VMAs. Below are the test results. The first row shows the number of VMAs. The second and third rows show the number of fork() calls per ten seconds, corresponding to next-20231006 and the this patchset, respectively. The test results were obtained with CPU binding to avoid scheduler load balancing that could cause unstable results. There are still some fluctuations in the test results, but at least they are better than the original performance. 21 121 221 421 821 1621 3221 6421 12821 25621 51221 112100 76261 54227 34035 20195 11112 6017 3161 1606 802 393 114558 83067 65008 45824 28751 16072 8922 4747 2436 1233 599 2.19% 8.92% 19.88% 34.64% 42.37% 44.64% 48.28% 50.17% 51.68% 53.74% 52.42% [1] https://github.com/kdlucas/byte-unixbench/tree/master Link: https://lkml.kernel.org/r/20231027033845.90608-11-zhangpeng.00@bytedance.com Signed-off-by: Peng Zhang <zhangpeng.00@bytedance.com> Suggested-by: Liam R. Howlett <Liam.Howlett@oracle.com> Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com> Cc: Christian Brauner <brauner@kernel.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Mateusz Guzik <mjguzik@gmail.com> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Michael S. Tsirkin <mst@redhat.com> Cc: Mike Christie <michael.christie@oracle.com> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Suren Baghdasaryan <surenb@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
Li Zhijian
|
23e9f01389 |
mm/vmstat: move pgdemote_* to per-node stats
Demotion will migrate pages across nodes. Previously, only the global demotion statistics were accounted for. Changed them to per-node statistics, making it easier to observe where demotion occurs on each node. This will help to identify which nodes are under pressure. This patch also make pgdemote_* behind CONFIG_NUMA_BALANCING, since demotion is not available for !CONFIG_NUMA_BALANCING With this patch, here is a sample where node0 node1 are DRAM, node3 is PMEM: Global stats: $ grep demote /proc/vmstat pgdemote_kswapd 254288 pgdemote_direct 113497 pgdemote_khugepaged 0 Per-node stats: $ grep demote /sys/devices/system/node/node0/vmstat # demotion source pgdemote_kswapd 68454 pgdemote_direct 83431 pgdemote_khugepaged 0 $ grep demote /sys/devices/system/node/node1/vmstat # demotion source pgdemote_kswapd 185834 pgdemote_direct 30066 pgdemote_khugepaged 0 $ grep demote /sys/devices/system/node/node3/vmstat # demotion target pgdemote_kswapd 0 pgdemote_direct 0 pgdemote_khugepaged 0 Link: https://lkml.kernel.org/r/20231103031450.1456523-1-lizhijian@fujitsu.com Signed-off-by: Li Zhijian <lizhijian@fujitsu.com> Acked-by: "Huang, Ying" <ying.huang@intel.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: "Rafael J. Wysocki" <rafael@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
Jiexun Wang
|
b2f557a21b |
mm/madvise: add cond_resched() in madvise_cold_or_pageout_pte_range()
I conducted real-time testing and observed that madvise_cold_or_pageout_pte_range() causes significant latency under memory pressure, which can be effectively reduced by adding cond_resched() within the loop. I tested on the LicheePi 4A board using Cylictest for latency testing and Ftrace for latency tracing. The board uses TH1520 processor and has a memory size of 8GB. The kernel version is 6.5.0 with the PREEMPT_RT patch applied. The script I tested is as follows: echo wakeup_rt > /sys/kernel/tracing/current_tracer echo 1 > /sys/kernel/tracing/tracing_on echo 0 > /sys/kernel/tracing/tracing_max_latency stress-ng --vm 8 --vm-bytes 2G & cyclictest --mlockall --smp --priority=99 --distance=0 --duration=30m echo 0 > /sys/kernel/tracing/tracing_on cat /sys/kernel/tracing/trace The tracing results before modification are as follows: # tracer: wakeup_rt # # wakeup_rt latency trace v1.1.5 on 6.5.0-rt6-r1208-00003-g999d221864bf # -------------------------------------------------------------------- # latency: 2552 us, #6/6, CPU#3 | (M:preempt_rt VP:0, KP:0, SP:0 HP:0 #P:4) # ----------------- # | task: cyclictest-196 (uid:0 nice:0 policy:1 rt_prio:99) # ----------------- # # _--------=> CPU# # / _-------=> irqs-off/BH-disabled # | / _------=> need-resched # || / _-----=> need-resched-lazy # ||| / _----=> hardirq/softirq # |||| / _---=> preempt-depth # ||||| / _--=> preempt-lazy-depth # |||||| / _-=> migrate-disable # ||||||| / delay # cmd pid |||||||| time | caller # \ / |||||||| \ | / stress-n-206 3dn.h512 2us : 206:120:R + [003] 196: 0:R cyclictest stress-n-206 3dn.h512 7us : <stack trace> => __ftrace_trace_stack => __trace_stack => probe_wakeup => ttwu_do_activate => try_to_wake_up => wake_up_process => hrtimer_wakeup => __hrtimer_run_queues => hrtimer_interrupt => riscv_timer_interrupt => handle_percpu_devid_irq => generic_handle_domain_irq => riscv_intc_irq => handle_riscv_irq => do_irq stress-n-206 3dn.h512 9us#: 0 stress-n-206 3d...3.. 2544us : __schedule stress-n-206 3d...3.. 2545us : 206:120:R ==> [003] 196: 0:R cyclictest stress-n-206 3d...3.. 2551us : <stack trace> => __ftrace_trace_stack => __trace_stack => probe_wakeup_sched_switch => __schedule => preempt_schedule => migrate_enable => rt_spin_unlock => madvise_cold_or_pageout_pte_range => walk_pgd_range => __walk_page_range => walk_page_range => madvise_pageout => madvise_vma_behavior => do_madvise => sys_madvise => do_trap_ecall_u => ret_from_exception The tracing results after modification are as follows: # tracer: wakeup_rt # # wakeup_rt latency trace v1.1.5 on 6.5.0-rt6-r1208-00004-gca3876fc69a6-dirty # -------------------------------------------------------------------- # latency: 1689 us, #6/6, CPU#0 | (M:preempt_rt VP:0, KP:0, SP:0 HP:0 #P:4) # ----------------- # | task: cyclictest-217 (uid:0 nice:0 policy:1 rt_prio:99) # ----------------- # # _--------=> CPU# # / _-------=> irqs-off/BH-disabled # | / _------=> need-resched # || / _-----=> need-resched-lazy # ||| / _----=> hardirq/softirq # |||| / _---=> preempt-depth # ||||| / _--=> preempt-lazy-depth # |||||| / _-=> migrate-disable # ||||||| / delay # cmd pid |||||||| time | caller # \ / |||||||| \ | / stress-n-232 0dn.h413 1us+: 232:120:R + [000] 217: 0:R cyclictest stress-n-232 0dn.h413 12us : <stack trace> => __ftrace_trace_stack => __trace_stack => probe_wakeup => ttwu_do_activate => try_to_wake_up => wake_up_process => hrtimer_wakeup => __hrtimer_run_queues => hrtimer_interrupt => riscv_timer_interrupt => handle_percpu_devid_irq => generic_handle_domain_irq => riscv_intc_irq => handle_riscv_irq => do_irq stress-n-232 0dn.h413 19us#: 0 stress-n-232 0d...3.. 1671us : __schedule stress-n-232 0d...3.. 1676us+: 232:120:R ==> [000] 217: 0:R cyclictest stress-n-232 0d...3.. 1687us : <stack trace> => __ftrace_trace_stack => __trace_stack => probe_wakeup_sched_switch => __schedule => preempt_schedule => migrate_enable => free_unref_page_list => release_pages => free_pages_and_swap_cache => tlb_batch_pages_flush => tlb_flush_mmu => unmap_page_range => unmap_vmas => unmap_region => do_vmi_align_munmap.constprop.0 => do_vmi_munmap => __vm_munmap => sys_munmap => do_trap_ecall_u => ret_from_exception After the modification, the cause of maximum latency is no longer madvise_cold_or_pageout_pte_range(), so this modification can reduce the latency caused by madvise_cold_or_pageout_pte_range(). Currently the madvise_cold_or_pageout_pte_range() function exhibits significant latency under memory pressure, which can be effectively reduced by adding cond_resched() within the loop. When the batch_count reaches SWAP_CLUSTER_MAX, we reschedule the task to ensure fairness and avoid long lock holding times. Link: https://lkml.kernel.org/r/85363861af65fac66c7a98c251906afc0d9c8098.1695291046.git.wangjiexun@tinylab.org Signed-off-by: Jiexun Wang <wangjiexun@tinylab.org> Cc: Zhangjin Wu <falcon@tinylab.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
SeongJae Park
|
7d6fa31a2f |
mm/damon/sysfs-schemes: add timeout for update_schemes_tried_regions
If a scheme is set to not applied to any monitoring target region for any
reasons including the target access pattern, quota, filters, or
watermarks, writing 'update_schemes_tried_regions' to 'state' DAMON sysfs
file can indefinitely hang. Fix the case by implementing a timeout for
the operation. The time limit is two apply intervals of each scheme.
Link: https://lkml.kernel.org/r/20231124213840.39157-1-sj@kernel.org
Fixes:
|
||
Peter Xu
|
97219cc358 |
mm/Kconfig: make userfaultfd a menuconfig
PTE_MARKER_UFFD_WP is a subconfig for userfaultfd. To make it clear, switch to use menuconfig for userfaultfd. Link: https://lkml.kernel.org/r/20231123224204.1060152-1-peterx@redhat.com Signed-off-by: Peter Xu <peterx@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Axel Rasmussen <axelrasmussen@google.com> Cc: Mike Rapoport (IBM) <rppt@kernel.org> Cc: Peter Xu <peterx@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
SeongJae Park
|
1f3730fd9e |
mm/damon/core: copy nr_accesses when splitting region
Regions split function ('damon_split_region_at()') is called at the beginning of an aggregation interval, and when DAMOS applying the actions and charging quota. Because 'nr_accesses' fields of all regions are reset at the beginning of each aggregation interval, and DAMOS was applying the action at the end of each aggregation interval, there was no need to copy the 'nr_accesses' field to the split-out region. However, commit |
||
Sumanth Korikkar
|
f42ce5f087 |
mm/memory_hotplug: fix error handling in add_memory_resource()
In add_memory_resource(), creation of memory block devices occurs after
successful call to arch_add_memory(). However, creation of memory block
devices could fail. In that case, arch_remove_memory() is called to
perform necessary cleanup.
Currently with or without altmap support, arch_remove_memory() is always
passed with altmap set to NULL during error handling. This leads to
freeing of struct pages using free_pages(), eventhough the allocation
might have been performed with altmap support via
altmap_alloc_block_buf().
Fix the error handling by passing altmap in arch_remove_memory(). This
ensures the following:
* When altmap is disabled, deallocation of the struct pages array occurs
via free_pages().
* When altmap is enabled, deallocation occurs via vmem_altmap_free().
Link: https://lkml.kernel.org/r/20231120145354.308999-3-sumanthk@linux.ibm.com
Fixes:
|
||
Sumanth Korikkar
|
001002e737 |
mm/memory_hotplug: add missing mem_hotplug_lock
From Documentation/core-api/memory-hotplug.rst:
When adding/removing/onlining/offlining memory or adding/removing
heterogeneous/device memory, we should always hold the mem_hotplug_lock
in write mode to serialise memory hotplug (e.g. access to global/zone
variables).
mhp_(de)init_memmap_on_memory() functions can change zone stats and
struct page content, but they are currently called w/o the
mem_hotplug_lock.
When memory block is being offlined and when kmemleak goes through each
populated zone, the following theoretical race conditions could occur:
CPU 0: | CPU 1:
memory_offline() |
-> offline_pages() |
-> mem_hotplug_begin() |
... |
-> mem_hotplug_done() |
| kmemleak_scan()
| -> get_online_mems()
| ...
-> mhp_deinit_memmap_on_memory() |
[not protected by mem_hotplug_begin/done()]|
Marks memory section as offline, | Retrieves zone_start_pfn
poisons vmemmap struct pages and updates | and struct page members.
the zone related data |
| ...
| -> put_online_mems()
Fix this by ensuring mem_hotplug_lock is taken before performing
mhp_init_memmap_on_memory(). Also ensure that
mhp_deinit_memmap_on_memory() holds the lock.
online/offline_pages() are currently only called from
memory_block_online/offline(), so it is safe to move the locking there.
Link: https://lkml.kernel.org/r/20231120145354.308999-2-sumanthk@linux.ibm.com
Fixes:
|
||
Hugh Dickins
|
9aa1345d66 |
mm: fix oops when filemap_map_pmd() without prealloc_pte
syzbot reports oops in lockdep's __lock_acquire(), called from
__pte_offset_map_lock() called from filemap_map_pages(); or when I run the
repro, the oops comes in pmd_install(), called from filemap_map_pmd()
called from filemap_map_pages(), just before the __pte_offset_map_lock().
The problem is that filemap_map_pmd() has been assuming that when it finds
pmd_none(), a page table has already been prepared in prealloc_pte; and
indeed do_fault_around() has been careful to preallocate one there, when
it finds pmd_none(): but what if *pmd became none in between?
My 6.6 mods in mm/khugepaged.c, avoiding mmap_lock for write, have made it
easy for *pmd to be cleared while servicing a page fault; but even before
those, a huge *pmd might be zapped while a fault is serviced.
The difference in symptomatic stack traces comes from the "memory model"
in use: pmd_install() uses pmd_populate() uses page_to_pfn(): in some
models that is strict, and will oops on the NULL prealloc_pte; in other
models, it will construct a bogus value to be populated into *pmd, then
__pte_offset_map_lock() oops when trying to access split ptlock pointer
(or some other symptom in normal case of ptlock embedded not pointer).
Link: https://lore.kernel.org/linux-mm/20231115065506.19780-1-jose.pekkarinen@foxhound.fi/
Link: https://lkml.kernel.org/r/6ed0c50c-78ef-0719-b3c5-60c0c010431c@google.com
Fixes:
|
||
Roman Gushchin
|
5f79489a73 |
mm: kmem: properly initialize local objcg variable in current_obj_cgroup()
Erhard reported that the 6.7-rc1 kernel panics on boot if being
built with clang-16. The problem was not reproducible with gcc.
[ 5.975049] general protection fault, probably for non-canonical address 0xf555515555555557: 0000 [#1] SMP KASAN PTI
[ 5.976422] KASAN: maybe wild-memory-access in range [0xaaaaaaaaaaaaaab8-0xaaaaaaaaaaaaaabf]
[ 5.977475] CPU: 3 PID: 1 Comm: systemd Not tainted 6.7.0-rc1-Zen3 #77
[ 5.977860] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.13.0-1ubuntu1.1 04/01/2014
[ 5.977860] RIP: 0010:obj_cgroup_charge_pages+0x27/0x2d5
[ 5.977860] Code: 90 90 90 55 41 57 41 56 41 55 41 54 53 89 d5 41 89 f6 49 89 ff 48 b8 00 00 00 00 00 fc ff df 49 83 c7 10 4d3
[ 5.977860] RSP: 0018:ffffc9000001fb18 EFLAGS: 00010a02
[ 5.977860] RAX: dffffc0000000000 RBX: aaaaaaaaaaaaaaaa RCX: ffff8883eb9a8b08
[ 5.977860] RDX: 0000000000000005 RSI: 0000000000400cc0 RDI: aaaaaaaaaaaaaaaa
[ 5.977860] RBP: 0000000000000005 R08: 3333333333333333 R09: 0000000000000000
[ 5.977860] R10: 0000000000000000 R11: 0000000000000000 R12: ffff8883eb9a8b18
[ 5.977860] R13: 1555555555555557 R14: 0000000000400cc0 R15: aaaaaaaaaaaaaaba
[ 5.977860] FS: 00007f2976438b40(0000) GS:ffff8883eb980000(0000) knlGS:0000000000000000
[ 5.977860] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 5.977860] CR2: 00007f29769e0060 CR3: 0000000107222003 CR4: 0000000000370eb0
[ 5.977860] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 5.977860] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 5.977860] Call Trace:
[ 5.977860] <TASK>
[ 5.977860] ? __die_body+0x16/0x75
[ 5.977860] ? die_addr+0x4a/0x70
[ 5.977860] ? exc_general_protection+0x1c9/0x2d0
[ 5.977860] ? cgroup_mkdir+0x455/0x9fb
[ 5.977860] ? __x64_sys_mkdir+0x69/0x80
[ 5.977860] ? asm_exc_general_protection+0x26/0x30
[ 5.977860] ? obj_cgroup_charge_pages+0x27/0x2d5
[ 5.977860] obj_cgroup_charge+0x114/0x1ab
[ 5.977860] pcpu_alloc+0x1a6/0xa65
[ 5.977860] ? mem_cgroup_css_alloc+0x1eb/0x1140
[ 5.977860] ? cgroup_apply_control_enable+0x26b/0x7c0
[ 5.977860] mem_cgroup_css_alloc+0x23f/0x1140
[ 5.977860] cgroup_apply_control_enable+0x26b/0x7c0
[ 5.977860] ? cgroup_kn_set_ugid+0x2d/0x1a0
[ 5.977860] cgroup_mkdir+0x455/0x9fb
[ 5.977860] ? __cfi_cgroup_mkdir+0x10/0x10
[ 5.977860] kernfs_iop_mkdir+0x130/0x170
[ 5.977860] vfs_mkdir+0x405/0x530
[ 5.977860] do_mkdirat+0x188/0x1f0
[ 5.977860] __x64_sys_mkdir+0x69/0x80
[ 5.977860] do_syscall_64+0x7d/0x100
[ 5.977860] ? do_syscall_64+0x89/0x100
[ 5.977860] ? do_syscall_64+0x89/0x100
[ 5.977860] ? do_syscall_64+0x89/0x100
[ 5.977860] ? do_syscall_64+0x89/0x100
[ 5.977860] entry_SYSCALL_64_after_hwframe+0x4b/0x53
[ 5.977860] RIP: 0033:0x7f297671defb
[ 5.977860] Code: 8b 05 39 7f 0d 00 bb ff ff ff ff 64 c7 00 16 00 00 00 e9 61 ff ff ff e8 23 0c 02 00 0f 1f 00 f3 0f 1e fa b88
[ 5.977860] RSP: 002b:00007ffee6242bb8 EFLAGS: 00000246 ORIG_RAX: 0000000000000053
[ 5.977860] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f297671defb
[ 5.977860] RDX: 0000000000000000 RSI: 00000000000001ed RDI: 000055c6b449f0e0
[ 5.977860] RBP: 00007ffee6242bf0 R08: 000000000000000e R09: 0000000000000000
[ 5.977860] R10: 0000000000000000 R11: 0000000000000246 R12: 000055c6b445db80
[ 5.977860] R13: 00000000000003a0 R14: 00007f2976a68651 R15: 00000000000003a0
[ 5.977860] </TASK>
[ 5.977860] Modules linked in:
[ 6.014095] ---[ end trace 0000000000000000 ]---
[ 6.014701] RIP: 0010:obj_cgroup_charge_pages+0x27/0x2d5
[ 6.015348] Code: 90 90 90 55 41 57 41 56 41 55 41 54 53 89 d5 41 89 f6 49 89 ff 48 b8 00 00 00 00 00 fc ff df 49 83 c7 10 4d3
[ 6.017575] RSP: 0018:ffffc9000001fb18 EFLAGS: 00010a02
[ 6.018255] RAX: dffffc0000000000 RBX: aaaaaaaaaaaaaaaa RCX: ffff8883eb9a8b08
[ 6.019120] RDX: 0000000000000005 RSI: 0000000000400cc0 RDI: aaaaaaaaaaaaaaaa
[ 6.019983] RBP: 0000000000000005 R08: 3333333333333333 R09: 0000000000000000
[ 6.020849] R10: 0000000000000000 R11: 0000000000000000 R12: ffff8883eb9a8b18
[ 6.021747] R13: 1555555555555557 R14: 0000000000400cc0 R15: aaaaaaaaaaaaaaba
[ 6.022609] FS: 00007f2976438b40(0000) GS:ffff8883eb980000(0000) knlGS:0000000000000000
[ 6.023593] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 6.024296] CR2: 00007f29769e0060 CR3: 0000000107222003 CR4: 0000000000370eb0
[ 6.025279] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 6.026139] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 6.027000] Kernel panic - not syncing: Attempted to kill init! exitcode=0x0000000b
Actually the problem is caused by uninitialized local variable in
current_obj_cgroup(). If the root memory cgroup is set as an active
memory cgroup for a charging scope (as in the trace, where systemd tries
to create the first non-root cgroup, so the parent cgroup is the root
cgroup), the "for" loop is skipped and uninitialized objcg is returned,
causing a panic down the accounting stack.
The fix is trivial: initialize the objcg variable to NULL unconditionally
before the "for" loop.
[vbabka@suse.cz: remove redundant assignment]
Link: https://lkml.kernel.org/r/4bd106d5-c3e3-6731-9a74-cff81e2392de@suse.cz
Link: https://lkml.kernel.org/r/20231116025109.3775055-1-roman.gushchin@linux.dev
Fixes:
|
||
Liu Shixin
|
d63385a7d3 |
mm/kmemleak: move set_track_prepare() outside raw_spinlocks
set_track_prepare() will call __alloc_pages() which attempts to acquire zone->lock(spinlocks), so move it outside object->lock(raw_spinlocks) because it's not right to acquire spinlocks while holding raw_spinlocks in RT mode. Link: https://lkml.kernel.org/r/20231115082138.2649870-3-liushixin2@huawei.com Signed-off-by: Liu Shixin <liushixin2@huawei.com> Acked-by: Catalin Marinas <catalin.marinas@arm.com> Cc: Geert Uytterhoeven <geert+renesas@glider.be> Cc: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: Patrick Wang <patrick.wang.shcn@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
Liu Shixin
|
4eff7d62ab |
Revert "mm/kmemleak: move the initialisation of object to __link_object"
Patch series "Fix invalid wait context of set_track_prepare()". Geert reported an invalid wait context[1] which is resulted by moving set_track_prepare() inside kmemleak_lock. This is not allowed because in RT mode, the spinlocks can be preempted but raw_spinlocks can not, so it is not allowd to acquire spinlocks while holding raw_spinlocks. The second patch fix same problem in kmemleak_update_trace(). This patch (of 2): Move the initialisation of object back to__alloc_object() because set_track_prepare() attempt to acquire zone->lock(spinlocks) while __link_object is holding kmemleak_lock(raw_spinlocks). This is not right for RT mode. This reverts commit |
||
Andrew Morton
|
727d16f199 |
mm/memory.c:zap_pte_range() print bad swap entry
We have a report of this WARN() triggering. Let's print the offending swp_entry_t to help diagnosis. Link: https://lkml.kernel.org/r/000000000000b0e576060a30ee3b@google.com Cc: Muhammad Usama Anjum <usama.anjum@collabora.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
Mike Kravetz
|
187da0f825 |
hugetlb: fix null-ptr-deref in hugetlb_vma_lock_write
The routine __vma_private_lock tests for the existence of a reserve map
associated with a private hugetlb mapping. A pointer to the reserve map
is in vma->vm_private_data. __vma_private_lock was checking the pointer
for NULL. However, it is possible that the low bits of the pointer could
be used as flags. In such instances, vm_private_data is not NULL and not
a valid pointer. This results in the null-ptr-deref reported by syzbot:
general protection fault, probably for non-canonical address 0xdffffc000000001d:
0000 [#1] PREEMPT SMP KASAN
KASAN: null-ptr-deref in range [0x00000000000000e8-0x00000000000000ef]
CPU: 0 PID: 5048 Comm: syz-executor139 Not tainted 6.6.0-rc7-syzkaller-00142-g88
8cf78c29e2 #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 1
0/09/2023
RIP: 0010:__lock_acquire+0x109/0x5de0 kernel/locking/lockdep.c:5004
...
Call Trace:
<TASK>
lock_acquire kernel/locking/lockdep.c:5753 [inline]
lock_acquire+0x1ae/0x510 kernel/locking/lockdep.c:5718
down_write+0x93/0x200 kernel/locking/rwsem.c:1573
hugetlb_vma_lock_write mm/hugetlb.c:300 [inline]
hugetlb_vma_lock_write+0xae/0x100 mm/hugetlb.c:291
__hugetlb_zap_begin+0x1e9/0x2b0 mm/hugetlb.c:5447
hugetlb_zap_begin include/linux/hugetlb.h:258 [inline]
unmap_vmas+0x2f4/0x470 mm/memory.c:1733
exit_mmap+0x1ad/0xa60 mm/mmap.c:3230
__mmput+0x12a/0x4d0 kernel/fork.c:1349
mmput+0x62/0x70 kernel/fork.c:1371
exit_mm kernel/exit.c:567 [inline]
do_exit+0x9ad/0x2a20 kernel/exit.c:861
__do_sys_exit kernel/exit.c:991 [inline]
__se_sys_exit kernel/exit.c:989 [inline]
__x64_sys_exit+0x42/0x50 kernel/exit.c:989
do_syscall_x64 arch/x86/entry/common.c:50 [inline]
do_syscall_64+0x38/0xb0 arch/x86/entry/common.c:80
entry_SYSCALL_64_after_hwframe+0x63/0xcd
Mask off low bit flags before checking for NULL pointer. In addition, the
reserve map only 'belongs' to the OWNER (parent in parent/child
relationships) so also check for the OWNER flag.
Link: https://lkml.kernel.org/r/20231114012033.259600-1-mike.kravetz@oracle.com
Reported-by: syzbot+6ada951e7c0f7bc8a71e@syzkaller.appspotmail.com
Closes: https://lore.kernel.org/linux-mm/00000000000078d1e00608d7878b@google.com/
Fixes:
|
||
Linus Torvalds
|
fa2b906f51 |
vfs-6.7-rc3.fixes
-----BEGIN PGP SIGNATURE----- iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCZWBq0gAKCRCRxhvAZXjc ot4EAP48O5ExMtQ3/AIkNDo+/9/Iz4g7bE1HYmdyiMPO3Ou/uwEAySwBXRJrFAsS 9omvkEdqrfyguW0xgoYwcxBdATVHnAE= =ScR3 -----END PGP SIGNATURE----- Merge tag 'vfs-6.7-rc3.fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull vfs fixes from Christian Brauner: - Avoid calling back into LSMs from vfs_getattr_nosec() calls. IMA used to query inode properties accessing raw inode fields without dedicated helpers. That was finally fixed a few releases ago by forcing IMA to use vfs_getattr_nosec() helpers. The goal of the vfs_getattr_nosec() helper is to query for attributes without calling into the LSM layer which would be quite problematic because incredibly IMA is called from __fput()... __fput() -> ima_file_free() What it does is to call back into the filesystem to update the file's IMA xattr. Querying the inode without using vfs_getattr_nosec() meant that IMA didn't handle stacking filesystems such as overlayfs correctly. So the switch to vfs_getattr_nosec() is quite correct. But the switch to vfs_getattr_nosec() revealed another bug when used on stacking filesystems: __fput() -> ima_file_free() -> vfs_getattr_nosec() -> i_op->getattr::ovl_getattr() -> vfs_getattr() -> i_op->getattr::$WHATEVER_UNDERLYING_FS_getattr() -> security_inode_getattr() # calls back into LSMs Now, if that __fput() happens from task_work_run() of an exiting task current->fs and various other pointer could already be NULL. So anything in the LSM layer relying on that not being NULL would be quite surprised. Fix that by passing the information that this is a security request through to the stacking filesystem by adding a new internal ATT_GETATTR_NOSEC flag. Now the callchain becomes: __fput() -> ima_file_free() -> vfs_getattr_nosec() -> i_op->getattr::ovl_getattr() -> if (AT_GETATTR_NOSEC) vfs_getattr_nosec() else vfs_getattr() -> i_op->getattr::$WHATEVER_UNDERLYING_FS_getattr() - Fix a bug introduced with the iov_iter rework from last cycle. This broke /proc/kcore by copying too much and without the correct offset. - Add a missing NULL check when allocating the root inode in autofs_fill_super(). - Fix stable writes for multi-device filesystems (xfs, btrfs etc) and the block device pseudo filesystem. Stable writes used to be a superblock flag only, making it a per filesystem property. Add an additional AS_STABLE_WRITES mapping flag to allow for fine-grained control. - Ensure that offset_iterate_dir() returns 0 after reaching the end of a directory so it adheres to getdents() convention. * tag 'vfs-6.7-rc3.fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: libfs: getdents() should return 0 after reaching EOD xfs: respect the stable writes flag on the RT device xfs: clean up FS_XFLAG_REALTIME handling in xfs_ioctl_setattr_xflags block: update the stable_writes flag in bdev_add filemap: add a per-mapping stable writes flag autofs: add: new_inode check in autofs_fill_super() iov_iter: fix copy_page_to_iter_nofault() fs: Pass AT_GETATTR_NOSEC flag to getattr interface function |
||
Christoph Hellwig
|
762321dab9 |
filemap: add a per-mapping stable writes flag
folio_wait_stable waits for writeback to finish before modifying the contents of a folio again, e.g. to support check summing of the data in the block integrity code. Currently this behavior is controlled by the SB_I_STABLE_WRITES flag on the super_block, which means it is uniform for the entire file system. This is wrong for the block device pseudofs which is shared by all block devices, or file systems that can use multiple devices like XFS witht the RT subvolume or btrfs (although btrfs currently reimplements folio_wait_stable anyway). Add a per-address_space AS_STABLE_WRITES flag to control the behavior in a more fine grained way. The existing SB_I_STABLE_WRITES is kept to initialize AS_STABLE_WRITES to the existing default which covers most cases. Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20231025141020.192413-2-hch@lst.de Tested-by: Ilya Dryomov <idryomov@gmail.com> Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Christian Brauner <brauner@kernel.org> |
||
Ryan Roberts
|
afccb0804f |
mm: more ptep_get() conversion
Commit
|
||
Helge Deller
|
5f74f820f6 |
parisc: fix mmap_base calculation when stack grows upwards
Matoro reported various userspace crashes on the parisc platform with kernel 6.6 and bisected it to commit |
||
Hyeongtak Ji
|
13b2a4b22e |
mm/damon/core.c: avoid unintentional filtering out of schemes
The function '__damos_filter_out()' causes DAMON to always filter out
schemes whose filter type is anon or memcg if its matching value is set
to false.
This commit addresses the issue by ensuring that '__damos_filter_out()'
no longer applies to filters whose type is 'anon' or 'memcg'.
Link: https://lkml.kernel.org/r/1699594629-3816-1-git-send-email-hyeongtak.ji@gmail.com
Fixes:
|
||
Roman Gushchin
|
24948e3b7b |
mm: kmem: drop __GFP_NOFAIL when allocating objcg vectors
Objcg vectors attached to slab pages to store slab object ownership information are allocated using gfp flags for the original slab allocation. Depending on slab page order and the size of slab objects, objcg vector can take several pages. If the original allocation was done with the __GFP_NOFAIL flag, it triggered a warning in the page allocation code. Indeed, order > 1 pages should not been allocated with the __GFP_NOFAIL flag. Fix this by simply dropping the __GFP_NOFAIL flag when allocating the objcg vector. It effectively allows to skip the accounting of a single slab object under a heavy memory pressure. An alternative would be to implement the mechanism to fallback to order-0 allocations for accounting metadata, which is also not perfect because it will increase performance penalty and memory footprint of the kernel memory accounting under memory pressure. Link: https://lkml.kernel.org/r/ZUp8ZFGxwmCx4ZFr@P9FQF9L96D.corp.robot.car Signed-off-by: Roman Gushchin <roman.gushchin@linux.dev> Reported-by: Christoph Lameter <cl@linux.com> Closes: https://lkml.kernel.org/r/6b42243e-f197-600a-5d22-56bd728a5ad8@gentwo.org Acked-by: Shakeel Butt <shakeelb@google.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
SeongJae Park
|
ae636ae2bb |
mm/damon/sysfs-schemes: handle tried region directory allocation failure
DAMON sysfs interface's before_damos_apply callback
(damon_sysfs_before_damos_apply()), which creates the DAMOS tried regions
for each DAMOS action applied region, is not handling the allocation
failure for the sysfs directory data. As a result, NULL pointer
derefeence is possible. Fix it by handling the case.
Link: https://lkml.kernel.org/r/20231106233408.51159-4-sj@kernel.org
Fixes:
|
||
SeongJae Park
|
84055688b6 |
mm/damon/sysfs-schemes: handle tried regions sysfs directory allocation failure
DAMOS tried regions sysfs directory allocation function
(damon_sysfs_scheme_regions_alloc()) is not handling the memory allocation
failure. In the case, the code will dereference NULL pointer. Handle the
failure to avoid such invalid access.
Link: https://lkml.kernel.org/r/20231106233408.51159-3-sj@kernel.org
Fixes:
|
||
SeongJae Park
|
b4936b544b |
mm/damon/sysfs: check error from damon_sysfs_update_target()
Patch series "mm/damon/sysfs: fix unhandled return values".
Some of DAMON sysfs interface code is not handling return values from some
functions. As a result, confusing user input handling or NULL-dereference
is possible. Check those properly.
This patch (of 3):
damon_sysfs_update_target() returns error code for failures, but its
caller, damon_sysfs_set_targets() is ignoring that. The update function
seems making no critical change in case of such failures, but the behavior
will look like DAMON sysfs is silently ignoring or only partially
accepting the user input. Fix it.
Link: https://lkml.kernel.org/r/20231106233408.51159-1-sj@kernel.org
Link: https://lkml.kernel.org/r/20231106233408.51159-2-sj@kernel.org
Fixes:
|
||
Stefan Roesch
|
a48d5bdc87 |
mm: fix for negative counter: nr_file_hugepages
While qualifiying the 6.4 release, the following warning was detected in messages: vmstat_refresh: nr_file_hugepages -15664 The warning is caused by the incorrect updating of the NR_FILE_THPS counter in the function split_huge_page_to_list. The if case is checking for folio_test_swapbacked, but the else case is missing the check for folio_test_pmd_mappable. The other functions that manipulate the counter like __filemap_add_folio and filemap_unaccount_folio have the corresponding check. I have a test case, which reproduces the problem. It can be found here: https://github.com/sroeschus/testcase/blob/main/vmstat_refresh/madv.c The test case reproduces on an XFS filesystem. Running the same test case on a BTRFS filesystem does not reproduce the problem. AFAIK version 6.1 until 6.6 are affected by this problem. [akpm@linux-foundation.org: whitespace fix] [shr@devkernel.io: test for folio_test_pmd_mappable()] Link: https://lkml.kernel.org/r/20231108171517.2436103-1-shr@devkernel.io Link: https://lkml.kernel.org/r/20231106181918.1091043-1-shr@devkernel.io Signed-off-by: Stefan Roesch <shr@devkernel.io> Co-debugged-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: David Hildenbrand <david@redhat.com> Reviewed-by: Yang Shi <shy828301@gmail.com> Cc: Rik van Riel <riel@surriel.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
Dan Carpenter
|
85c2ceaafb |
mm/damon/sysfs: eliminate potential uninitialized variable warning
The "err" variable is not initialized if damon_target_has_pid(ctx) is false and sys_target->regions->nr is zero. Link: https://lkml.kernel.org/r/739e6aaf-a634-4e33-98a8-16546379ec9f@moroto.mountain Fixes: 0bcd216c4741 ("mm/damon/sysfs: update monitoring target regions for online input commit") Signed-off-by: Dan Carpenter <dan.carpenter@linaro.org> Reviewed-by: SeongJae Park <sj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
Linus Torvalds
|
447cec034b |
memblock: report failures when memblock_can_resize is not set
Numerous memblock reservations at early boot may exhaust static memblock.reserved array and it is unnoticed because most of the callers don't check memblock_reserve() return value. In this case the system will crash later, but the reason is hard to identify. Replace return of an error with panic() when memblock.reserved is exhausted before it can be resized. -----BEGIN PGP SIGNATURE----- iQFEBAABCgAuFiEEeOVYVaWZL5900a/pOQOGJssO/ZEFAmVJMC0QHHJwcHRAa2Vy bmVsLm9yZwAKCRA5A4Ymyw79kZ3CCACgaHmlV5YimQ3le0yGkyoVfA1ZYMTZJiq3 umN9z/Vs/WQLz5yiCriEjG/a9dkaceMtrqON+/Lxu5U28/XdYe13lKAmpmPXQvc4 5iqfXHL3KZSBvzYNKIKZioLdWQWNvIQyGnYn1LBqqtcOCVLscbqatE9JmN2w39ny TwbTBG7HcalxJ9suauwF+jqQWg2f+yJEOzqiNalNJO0GRxuO7qJdqoLB1zsvXJ9F 3StCBq2/QyeklX5UtDT5oBIldx0ZGccg20hFhNDzN0pipF8lDL7Io2wI5zjhuHWO JmTCY3GAZ+yl16BKe2StryVxs28PtagLykEq/CdGk56jiyWGUuLI =xf1W -----END PGP SIGNATURE----- Merge tag 'memblock-v6.7-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rppt/memblock Pull memblock update from Mike Rapoport: "Report failures when memblock_can_resize is not set. Numerous memblock reservations at early boot may exhaust static memblock.reserved array and it is unnoticed because most of the callers don't check memblock_reserve() return value. In this case the system will crash later, but the reason is hard to identify. Replace return of an error with panic() when memblock.reserved is exhausted before it can be resized" * tag 'memblock-v6.7-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rppt/memblock: memblock: report failures when memblock_can_resize is not set |
||
Linus Torvalds
|
8f6f76a6a2 |
As usual, lots of singleton and doubleton patches all over the tree and
there's little I can say which isn't in the individual changelogs. The lengthier patch series are - "kdump: use generic functions to simplify crashkernel reservation in arch", from Baoquan He. This is mainly cleanups and consolidation of the "crashkernel=" kernel parameter handling. - After much discussion, David Laight's "minmax: Relax type checks in min() and max()" is here. Hopefully reduces some typecasting and the use of min_t() and max_t(). - A group of patches from Oleg Nesterov which clean up and slightly fix our handling of reads from /proc/PID/task/... and which remove task_struct.therad_group. -----BEGIN PGP SIGNATURE----- iHUEABYIAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCZUQP9wAKCRDdBJ7gKXxA jmOAAQDh8sxagQYocoVsSm28ICqXFeaY9Co1jzBIDdNesAvYVwD/c2DHRqJHEiS4 63BNcG3+hM9nwGJHb5lyh5m79nBMRg0= =On4u -----END PGP SIGNATURE----- Merge tag 'mm-nonmm-stable-2023-11-02-14-08' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull non-MM updates from Andrew Morton: "As usual, lots of singleton and doubleton patches all over the tree and there's little I can say which isn't in the individual changelogs. The lengthier patch series are - 'kdump: use generic functions to simplify crashkernel reservation in arch', from Baoquan He. This is mainly cleanups and consolidation of the 'crashkernel=' kernel parameter handling - After much discussion, David Laight's 'minmax: Relax type checks in min() and max()' is here. Hopefully reduces some typecasting and the use of min_t() and max_t() - A group of patches from Oleg Nesterov which clean up and slightly fix our handling of reads from /proc/PID/task/... and which remove task_struct.thread_group" * tag 'mm-nonmm-stable-2023-11-02-14-08' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (64 commits) scripts/gdb/vmalloc: disable on no-MMU scripts/gdb: fix usage of MOD_TEXT not defined when CONFIG_MODULES=n .mailmap: add address mapping for Tomeu Vizoso mailmap: update email address for Claudiu Beznea tools/testing/selftests/mm/run_vmtests.sh: lower the ptrace permissions .mailmap: map Benjamin Poirier's address scripts/gdb: add lx_current support for riscv ocfs2: fix a spelling typo in comment proc: test ProtectionKey in proc-empty-vm test proc: fix proc-empty-vm test with vsyscall fs/proc/base.c: remove unneeded semicolon do_io_accounting: use sig->stats_lock do_io_accounting: use __for_each_thread() ocfs2: replace BUG_ON() at ocfs2_num_free_extents() with ocfs2_error() ocfs2: fix a typo in a comment scripts/show_delta: add __main__ judgement before main code treewide: mark stuff as __ro_after_init fs: ocfs2: check status values proc: test /proc/${pid}/statm compiler.h: move __is_constexpr() to compiler.h ... |
||
Linus Torvalds
|
ecae0bd517 |
Many singleton patches against the MM code. The patch series which are
included in this merge do the following: - Kemeng Shi has contributed some compation maintenance work in the series "Fixes and cleanups to compaction". - Joel Fernandes has a patchset ("Optimize mremap during mutual alignment within PMD") which fixes an obscure issue with mremap()'s pagetable handling during a subsequent exec(), based upon an implementation which Linus suggested. - More DAMON/DAMOS maintenance and feature work from SeongJae Park i the following patch series: mm/damon: misc fixups for documents, comments and its tracepoint mm/damon: add a tracepoint for damos apply target regions mm/damon: provide pseudo-moving sum based access rate mm/damon: implement DAMOS apply intervals mm/damon/core-test: Fix memory leaks in core-test mm/damon/sysfs-schemes: Do DAMOS tried regions update for only one apply interval - In the series "Do not try to access unaccepted memory" Adrian Hunter provides some fixups for the recently-added "unaccepted memory' feature. To increase the feature's checking coverage. "Plug a few gaps where RAM is exposed without checking if it is unaccepted memory". - In the series "cleanups for lockless slab shrink" Qi Zheng has done some maintenance work which is preparation for the lockless slab shrinking code. - Qi Zheng has redone the earlier (and reverted) attempt to make slab shrinking lockless in the series "use refcount+RCU method to implement lockless slab shrink". - David Hildenbrand contributes some maintenance work for the rmap code in the series "Anon rmap cleanups". - Kefeng Wang does more folio conversions and some maintenance work in the migration code. Series "mm: migrate: more folio conversion and unification". - Matthew Wilcox has fixed an issue in the buffer_head code which was causing long stalls under some heavy memory/IO loads. Some cleanups were added on the way. Series "Add and use bdev_getblk()". - In the series "Use nth_page() in place of direct struct page manipulation" Zi Yan has fixed a potential issue with the direct manipulation of hugetlb page frames. - In the series "mm: hugetlb: Skip initialization of gigantic tail struct pages if freed by HVO" has improved our handling of gigantic pages in the hugetlb vmmemmep optimizaton code. This provides significant boot time improvements when significant amounts of gigantic pages are in use. - Matthew Wilcox has sent the series "Small hugetlb cleanups" - code rationalization and folio conversions in the hugetlb code. - Yin Fengwei has improved mlock()'s handling of large folios in the series "support large folio for mlock" - In the series "Expose swapcache stat for memcg v1" Liu Shixin has added statistics for memcg v1 users which are available (and useful) under memcg v2. - Florent Revest has enhanced the MDWE (Memory-Deny-Write-Executable) prctl so that userspace may direct the kernel to not automatically propagate the denial to child processes. The series is named "MDWE without inheritance". - Kefeng Wang has provided the series "mm: convert numa balancing functions to use a folio" which does what it says. - In the series "mm/ksm: add fork-exec support for prctl" Stefan Roesch makes is possible for a process to propagate KSM treatment across exec(). - Huang Ying has enhanced memory tiering's calculation of memory distances. This is used to permit the dax/kmem driver to use "high bandwidth memory" in addition to Optane Data Center Persistent Memory Modules (DCPMM). The series is named "memory tiering: calculate abstract distance based on ACPI HMAT" - In the series "Smart scanning mode for KSM" Stefan Roesch has optimized KSM by teaching it to retain and use some historical information from previous scans. - Yosry Ahmed has fixed some inconsistencies in memcg statistics in the series "mm: memcg: fix tracking of pending stats updates values". - In the series "Implement IOCTL to get and optionally clear info about PTEs" Peter Xu has added an ioctl to /proc/<pid>/pagemap which permits us to atomically read-then-clear page softdirty state. This is mainly used by CRIU. - Hugh Dickins contributed the series "shmem,tmpfs: general maintenance" - a bunch of relatively minor maintenance tweaks to this code. - Matthew Wilcox has increased the use of the VMA lock over file-backed page faults in the series "Handle more faults under the VMA lock". Some rationalizations of the fault path became possible as a result. - In the series "mm/rmap: convert page_move_anon_rmap() to folio_move_anon_rmap()" David Hildenbrand has implemented some cleanups and folio conversions. - In the series "various improvements to the GUP interface" Lorenzo Stoakes has simplified and improved the GUP interface with an eye to providing groundwork for future improvements. - Andrey Konovalov has sent along the series "kasan: assorted fixes and improvements" which does those things. - Some page allocator maintenance work from Kemeng Shi in the series "Two minor cleanups to break_down_buddy_pages". - In thes series "New selftest for mm" Breno Leitao has developed another MM self test which tickles a race we had between madvise() and page faults. - In the series "Add folio_end_read" Matthew Wilcox provides cleanups and an optimization to the core pagecache code. - Nhat Pham has added memcg accounting for hugetlb memory in the series "hugetlb memcg accounting". - Cleanups and rationalizations to the pagemap code from Lorenzo Stoakes, in the series "Abstract vma_merge() and split_vma()". - Audra Mitchell has fixed issues in the procfs page_owner code's new timestamping feature which was causing some misbehaviours. In the series "Fix page_owner's use of free timestamps". - Lorenzo Stoakes has fixed the handling of new mappings of sealed files in the series "permit write-sealed memfd read-only shared mappings". - Mike Kravetz has optimized the hugetlb vmemmap optimization in the series "Batch hugetlb vmemmap modification operations". - Some buffer_head folio conversions and cleanups from Matthew Wilcox in the series "Finish the create_empty_buffers() transition". - As a page allocator performance optimization Huang Ying has added automatic tuning to the allocator's per-cpu-pages feature, in the series "mm: PCP high auto-tuning". - Roman Gushchin has contributed the patchset "mm: improve performance of accounted kernel memory allocations" which improves their performance by ~30% as measured by a micro-benchmark. - folio conversions from Kefeng Wang in the series "mm: convert page cpupid functions to folios". - Some kmemleak fixups in Liu Shixin's series "Some bugfix about kmemleak". - Qi Zheng has improved our handling of memoryless nodes by keeping them off the allocation fallback list. This is done in the series "handle memoryless nodes more appropriately". - khugepaged conversions from Vishal Moola in the series "Some khugepaged folio conversions". -----BEGIN PGP SIGNATURE----- iHUEABYIAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCZULEMwAKCRDdBJ7gKXxA jhQHAQCYpD3g849x69DmHnHWHm/EHQLvQmRMDeYZI+nx/sCJOwEAw4AKg0Oemv9y FgeUPAD1oasg6CP+INZvCj34waNxwAc= =E+Y4 -----END PGP SIGNATURE----- Merge tag 'mm-stable-2023-11-01-14-33' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull MM updates from Andrew Morton: "Many singleton patches against the MM code. The patch series which are included in this merge do the following: - Kemeng Shi has contributed some compation maintenance work in the series 'Fixes and cleanups to compaction' - Joel Fernandes has a patchset ('Optimize mremap during mutual alignment within PMD') which fixes an obscure issue with mremap()'s pagetable handling during a subsequent exec(), based upon an implementation which Linus suggested - More DAMON/DAMOS maintenance and feature work from SeongJae Park i the following patch series: mm/damon: misc fixups for documents, comments and its tracepoint mm/damon: add a tracepoint for damos apply target regions mm/damon: provide pseudo-moving sum based access rate mm/damon: implement DAMOS apply intervals mm/damon/core-test: Fix memory leaks in core-test mm/damon/sysfs-schemes: Do DAMOS tried regions update for only one apply interval - In the series 'Do not try to access unaccepted memory' Adrian Hunter provides some fixups for the recently-added 'unaccepted memory' feature. To increase the feature's checking coverage. 'Plug a few gaps where RAM is exposed without checking if it is unaccepted memory' - In the series 'cleanups for lockless slab shrink' Qi Zheng has done some maintenance work which is preparation for the lockless slab shrinking code - Qi Zheng has redone the earlier (and reverted) attempt to make slab shrinking lockless in the series 'use refcount+RCU method to implement lockless slab shrink' - David Hildenbrand contributes some maintenance work for the rmap code in the series 'Anon rmap cleanups' - Kefeng Wang does more folio conversions and some maintenance work in the migration code. Series 'mm: migrate: more folio conversion and unification' - Matthew Wilcox has fixed an issue in the buffer_head code which was causing long stalls under some heavy memory/IO loads. Some cleanups were added on the way. Series 'Add and use bdev_getblk()' - In the series 'Use nth_page() in place of direct struct page manipulation' Zi Yan has fixed a potential issue with the direct manipulation of hugetlb page frames - In the series 'mm: hugetlb: Skip initialization of gigantic tail struct pages if freed by HVO' has improved our handling of gigantic pages in the hugetlb vmmemmep optimizaton code. This provides significant boot time improvements when significant amounts of gigantic pages are in use - Matthew Wilcox has sent the series 'Small hugetlb cleanups' - code rationalization and folio conversions in the hugetlb code - Yin Fengwei has improved mlock()'s handling of large folios in the series 'support large folio for mlock' - In the series 'Expose swapcache stat for memcg v1' Liu Shixin has added statistics for memcg v1 users which are available (and useful) under memcg v2 - Florent Revest has enhanced the MDWE (Memory-Deny-Write-Executable) prctl so that userspace may direct the kernel to not automatically propagate the denial to child processes. The series is named 'MDWE without inheritance' - Kefeng Wang has provided the series 'mm: convert numa balancing functions to use a folio' which does what it says - In the series 'mm/ksm: add fork-exec support for prctl' Stefan Roesch makes is possible for a process to propagate KSM treatment across exec() - Huang Ying has enhanced memory tiering's calculation of memory distances. This is used to permit the dax/kmem driver to use 'high bandwidth memory' in addition to Optane Data Center Persistent Memory Modules (DCPMM). The series is named 'memory tiering: calculate abstract distance based on ACPI HMAT' - In the series 'Smart scanning mode for KSM' Stefan Roesch has optimized KSM by teaching it to retain and use some historical information from previous scans - Yosry Ahmed has fixed some inconsistencies in memcg statistics in the series 'mm: memcg: fix tracking of pending stats updates values' - In the series 'Implement IOCTL to get and optionally clear info about PTEs' Peter Xu has added an ioctl to /proc/<pid>/pagemap which permits us to atomically read-then-clear page softdirty state. This is mainly used by CRIU - Hugh Dickins contributed the series 'shmem,tmpfs: general maintenance', a bunch of relatively minor maintenance tweaks to this code - Matthew Wilcox has increased the use of the VMA lock over file-backed page faults in the series 'Handle more faults under the VMA lock'. Some rationalizations of the fault path became possible as a result - In the series 'mm/rmap: convert page_move_anon_rmap() to folio_move_anon_rmap()' David Hildenbrand has implemented some cleanups and folio conversions - In the series 'various improvements to the GUP interface' Lorenzo Stoakes has simplified and improved the GUP interface with an eye to providing groundwork for future improvements - Andrey Konovalov has sent along the series 'kasan: assorted fixes and improvements' which does those things - Some page allocator maintenance work from Kemeng Shi in the series 'Two minor cleanups to break_down_buddy_pages' - In thes series 'New selftest for mm' Breno Leitao has developed another MM self test which tickles a race we had between madvise() and page faults - In the series 'Add folio_end_read' Matthew Wilcox provides cleanups and an optimization to the core pagecache code - Nhat Pham has added memcg accounting for hugetlb memory in the series 'hugetlb memcg accounting' - Cleanups and rationalizations to the pagemap code from Lorenzo Stoakes, in the series 'Abstract vma_merge() and split_vma()' - Audra Mitchell has fixed issues in the procfs page_owner code's new timestamping feature which was causing some misbehaviours. In the series 'Fix page_owner's use of free timestamps' - Lorenzo Stoakes has fixed the handling of new mappings of sealed files in the series 'permit write-sealed memfd read-only shared mappings' - Mike Kravetz has optimized the hugetlb vmemmap optimization in the series 'Batch hugetlb vmemmap modification operations' - Some buffer_head folio conversions and cleanups from Matthew Wilcox in the series 'Finish the create_empty_buffers() transition' - As a page allocator performance optimization Huang Ying has added automatic tuning to the allocator's per-cpu-pages feature, in the series 'mm: PCP high auto-tuning' - Roman Gushchin has contributed the patchset 'mm: improve performance of accounted kernel memory allocations' which improves their performance by ~30% as measured by a micro-benchmark - folio conversions from Kefeng Wang in the series 'mm: convert page cpupid functions to folios' - Some kmemleak fixups in Liu Shixin's series 'Some bugfix about kmemleak' - Qi Zheng has improved our handling of memoryless nodes by keeping them off the allocation fallback list. This is done in the series 'handle memoryless nodes more appropriately' - khugepaged conversions from Vishal Moola in the series 'Some khugepaged folio conversions'" [ bcachefs conflicts with the dynamically allocated shrinkers have been resolved as per Stephen Rothwell in https://lore.kernel.org/all/20230913093553.4290421e@canb.auug.org.au/ with help from Qi Zheng. The clone3 test filtering conflict was half-arsed by yours truly ] * tag 'mm-stable-2023-11-01-14-33' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (406 commits) mm/damon/sysfs: update monitoring target regions for online input commit mm/damon/sysfs: remove requested targets when online-commit inputs selftests: add a sanity check for zswap Documentation: maple_tree: fix word spelling error mm/vmalloc: fix the unchecked dereference warning in vread_iter() zswap: export compression failure stats Documentation: ubsan: drop "the" from article title mempolicy: migration attempt to match interleave nodes mempolicy: mmap_lock is not needed while migrating folios mempolicy: alloc_pages_mpol() for NUMA policy without vma mm: add page_rmappable_folio() wrapper mempolicy: remove confusing MPOL_MF_LAZY dead code mempolicy: mpol_shared_policy_init() without pseudo-vma mempolicy trivia: use pgoff_t in shared mempolicy tree mempolicy trivia: slightly more consistent naming mempolicy trivia: delete those ancient pr_debug()s mempolicy: fix migrate_pages(2) syscall return nr_failed kernfs: drop shared NUMA mempolicy hooks hugetlbfs: drop shared NUMA mempolicy pretence mm/damon/sysfs-test: add a unit test for damon_sysfs_set_targets() ... |
||
Linus Torvalds
|
1e0c505e13 |
asm-generic updates for v6.7
The ia64 architecture gets its well-earned retirement as planned, now that there is one last (mostly) working release that will be maintained as an LTS kernel. The architecture specific system call tables are updated for the added map_shadow_stack() syscall and to remove references to the long-gone sys_lookup_dcookie() syscall. -----BEGIN PGP SIGNATURE----- iQIzBAABCgAdFiEEiK/NIGsWEZVxh/FrYKtH/8kJUicFAmVC40IACgkQYKtH/8kJ Uidhmw/9EX+aWSXGoObJ3fngaNSMw+PmrEuP8qEKBHxfKHcCdX3hc451Oh4GlhaQ tru91pPwgNvN2/rfoKusxT+V4PemGIzfNni/04rp+P0kvmdw5otQ2yNhsQNsfVmq XGWvkxF4P2GO6bkjjfR/1dDq7GtlyXtwwPDKeLbYb6TnJOZjtx+EAN27kkfSn1Ms R4Sa3zJ+DfHUmHL5S9g+7UD/CZ5GfKNmIskI4Mz5GsfoUz/0iiU+Bge/9sdcdSJQ kmbLy5YnVzfooLZ3TQmBFsO3iAMWb0s/mDdtyhqhTVmTUshLolkPYyKnPFvdupyv shXcpEST2XJNeaDRnL2K4zSCdxdbnCZHDpjfl9wfioBg7I8NfhXKpf1jYZHH1de4 LXq8ndEFEOVQw/zSpYWfQq1sux8Jiqr+UK/ukbVeFWiGGIUs91gEWtPAf8T0AZo9 ujkJvaWGl98O1g5wmBu0/dAR6QcFJMDfVwbmlIFpU8O+MEaz6X8mM+O5/T0IyTcD eMbAUjj4uYcU7ihKzHEv/0SS9Of38kzff67CLN5k8wOP/9NlaGZ78o1bVle9b52A BdhrsAefFiWHp1jT6Y9Rg4HOO/TguQ9e6EWSKOYFulsiLH9LEFaB9RwZLeLytV0W vlAgY9rUW77g1OJcb7DoNv33nRFuxsKqsnz3DEIXtgozo9CzbYI= =H1vH -----END PGP SIGNATURE----- Merge tag 'asm-generic-6.7' of git://git.kernel.org/pub/scm/linux/kernel/git/arnd/asm-generic Pull ia64 removal and asm-generic updates from Arnd Bergmann: - The ia64 architecture gets its well-earned retirement as planned, now that there is one last (mostly) working release that will be maintained as an LTS kernel. - The architecture specific system call tables are updated for the added map_shadow_stack() syscall and to remove references to the long-gone sys_lookup_dcookie() syscall. * tag 'asm-generic-6.7' of git://git.kernel.org/pub/scm/linux/kernel/git/arnd/asm-generic: hexagon: Remove unusable symbols from the ptrace.h uapi asm-generic: Fix spelling of architecture arch: Reserve map_shadow_stack() syscall number for all architectures syscalls: Cleanup references to sys_lookup_dcookie() Documentation: Drop or replace remaining mentions of IA64 lib/raid6: Drop IA64 support Documentation: Drop IA64 from feature descriptions kernel: Drop IA64 support from sig_fault handlers arch: Remove Itanium (IA-64) architecture |
||
SeongJae Park
|
9732336006 |
mm/damon/sysfs: update monitoring target regions for online input commit
When user input is committed online, DAMON sysfs interface is ignoring the
user input for the monitoring target regions. Such request is valid and
useful for fixed monitoring target regions-based monitoring ops like
'paddr' or 'fvaddr'.
Update the region boundaries as user specified, too. Note that the
monitoring results of the regions that overlap between the latest
monitoring target regions and the new target regions are preserved.
Treat empty monitoring target regions user request as a request to just
make no change to the monitoring target regions. Otherwise, users should
set the monitoring target regions same to current one for every online
input commit, and it could be challenging for dynamic monitoring target
regions update DAMON ops like 'vaddr'. If the user really need to remove
all monitoring target regions, they can simply remove the target and then
create the target again with empty target regions.
Link: https://lkml.kernel.org/r/20231031170131.46972-1-sj@kernel.org
Fixes:
|
||
SeongJae Park
|
19467a950b |
mm/damon/sysfs: remove requested targets when online-commit inputs
damon_sysfs_set_targets(), which updates the targets of the context for
online commitment, do not remove targets that removed from the
corresponding sysfs files. As a result, more than intended targets of the
context can exist and hence consume memory and monitoring CPU resource
more than expected.
Fix it by removing all targets of the context and fill up again using the
user input. This could cause unnecessary memory dealloc and realloc
operations, but this is not a hot code path. Also, note that damon_target
is stateless, and hence no data is lost.
[sj@kernel.org: fix unnecessary monitoring results removal]
Link: https://lkml.kernel.org/r/20231028213353.45397-1-sj@kernel.org
Link: https://lkml.kernel.org/r/20231022210735.46409-2-sj@kernel.org
Fixes:
|
||
Baoquan He
|
ca6c2ce1b4 |
mm/vmalloc: fix the unchecked dereference warning in vread_iter()
LKP reported smatch warning as below: =================== smatch warnings: mm/vmalloc.c:3689 vread_iter() error: we previously assumed 'vm' could be null (see line 3667) ...... |
||
Nhat Pham
|
cb61dad80f |
zswap: export compression failure stats
During a zswap store attempt, the compression algorithm could fail (for e.g due to the page containing incompressible random data). This is not tracked in any of existing zswap counters, making it hard to monitor for and investigate. We have run into this problem several times in our internal investigations on zswap store failures. This patch adds a dedicated debugfs counter for compression algorithm failures. Link: https://lkml.kernel.org/r/20231024234509.2680539-1-nphamcs@gmail.com Signed-off-by: Nhat Pham <nphamcs@gmail.com> Reviewed-by: Sergey Senozhatsky <senozhatsky@chromium.org> Cc: Dan Streetman <ddstreet@ieee.org> Cc: Domenico Cerasuolo <cerasuolodomenico@gmail.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Seth Jennings <sjenning@redhat.com> Cc: Shakeel Butt <shakeelb@google.com> Cc: Vitaly Wool <vitaly.wool@konsulko.com> Cc: Yosry Ahmed <yosryahmed@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
Linus Torvalds
|
89ed67ef12 |
Networking changes for 6.7.
Core & protocols ---------------- - Support usec resolution of TCP timestamps, enabled selectively by a route attribute. - Defer regular TCP ACK while processing socket backlog, try to send a cumulative ACK at the end. Increase single TCP flow performance on a 200Gbit NIC by 20% (100Gbit -> 120Gbit). - The Fair Queuing (FQ) packet scheduler: - add built-in 3 band prio / WRR scheduling - support bypass if the qdisc is mostly idle (5% speed up for TCP RR) - improve inactive flow reporting - optimize the layout of structures for better cache locality - Support TCP Authentication Option (RFC 5925, TCP-AO), a more modern replacement for the old MD5 option. - Add more retransmission timeout (RTO) related statistics to TCP_INFO. - Support sending fragmented skbs over vsock sockets. - Make sure we send SIGPIPE for vsock sockets if socket was shutdown(). - Add sysctl for ignoring lower limit on lifetime in Router Advertisement PIO, based on an in-progress IETF draft. - Add sysctl to control activation of TCP ping-pong mode. - Add sysctl to make connection timeout in MPTCP configurable. - Support rcvlowat and notsent_lowat on MPTCP sockets, to help apps limit the number of wakeups. - Support netlink GET for MDB (multicast forwarding), allowing user space to request a single MDB entry instead of dumping the entire table. - Support selective FDB flushing in the VXLAN tunnel driver. - Allow limiting learned FDB entries in bridges, prevent OOM attacks. - Allow controlling via configfs netconsole targets which were created via the kernel cmdline at boot, rather than via configfs at runtime. - Support multiple PTP timestamp event queue readers with different filters. - MCTP over I3C. BPF --- - Add new veth-like netdevice where BPF program defines the logic of the xmit routine. It can operate in L3 and L2 mode. - Support exceptions - allow asserting conditions which should never be true but are hard for the verifier to infer. With some extra flexibility around handling of the exit / failure. https://lwn.net/Articles/938435/ - Add support for local per-cpu kptr, allow allocating and storing per-cpu objects in maps. Access to those objects operates on the value for the current CPU. This allows to deprecate local one-off implementations of per-CPU storage like BPF_MAP_TYPE_PERCPU_CGROUP_STORAGE maps. - Extend cgroup BPF sockaddr hooks for UNIX sockets. The use case is for systemd to re-implement the LogNamespace feature which allows running multiple instances of systemd-journald to process the logs of different services. - Enable open-coded task_vma iteration, after maple tree conversion made it hard to directly walk VMAs in tracing programs. - Add open-coded task, css_task and css iterator support. One of the use cases is customizable OOM victim selection via BPF. - Allow source address selection with bpf_*_fib_lookup(). - Add ability to pin BPF timer to the current CPU. - Prevent creation of infinite loops by combining tail calls and fentry/fexit programs. - Add missed stats for kprobes to retrieve the number of missed kprobe executions and subsequent executions of BPF programs. - Inherit system settings for CPU security mitigations. - Add BPF v4 CPU instruction support for arm32 and s390x. Changes to common code ---------------------- - overflow: add DEFINE_FLEX() for on-stack definition of structs with flexible array members. - Process doc update with more guidance for reviewers. Driver API ---------- - Simplify locking in WiFi (cfg80211 and mac80211 layers), use wiphy mutex in most places and remove a lot of smaller locks. - Create a common DPLL configuration API. Allow configuring and querying state of PLL circuits used for clock syntonization, in network time distribution. - Unify fragmented and full page allocation APIs in page pool code. Let drivers be ignorant of PAGE_SIZE. - Rework PHY state machine to avoid races with calls to phy_stop(). - Notify DSA drivers of MAC address changes on user ports, improve correctness of offloads which depend on matching port MAC addresses. - Allow antenna control on injected WiFi frames. - Reduce the number of variants of napi_schedule(). - Simplify error handling when composing devlink health messages. Misc ---- - A lot of KCSAN data race "fixes", from Eric. - A lot of __counted_by() annotations, from Kees. - A lot of strncpy -> strscpy and printf format fixes. - Replace master/slave terminology with conduit/user in DSA drivers. - Handful of KUnit tests for netdev and WiFi core. Removed ------- - AppleTalk COPS. - AppleTalk ipddp. - TI AR7 CPMAC Ethernet driver. Drivers ------- - Ethernet high-speed NICs: - Intel (100G, ice, idpf): - add a driver for the Intel E2000 IPUs - make CRC/FCS stripping configurable - cross-timestamping for E823 devices - basic support for E830 devices - use aux-bus for managing client drivers - i40e: report firmware versions via devlink - nVidia/Mellanox: - support 4-port NICs - increase max number of channels to 256 - optimize / parallelize SF creation flow - Broadcom (bnxt): - enhance NIC temperature reporting - support PAM4 speeds and lane configuration - Marvell OcteonTX2: - PTP pulse-per-second output support - enable hardware timestamping for VFs - Solarflare/AMD: - conntrack NAT offload and offload for tunnels - Wangxun (ngbe/txgbe): - expose HW statistics - Pensando/AMD: - support PCI level reset - narrow down the condition under which skbs are linearized - Netronome/Corigine (nfp): - support CHACHA20-POLY1305 crypto in IPsec offload - Ethernet NICs embedded, slower, virtual: - Synopsys (stmmac): - add Loongson-1 SoC support - enable use of HW queues with no offload capabilities - enable PPS input support on all 5 channels - increase TX coalesce timer to 5ms - RealTek USB (r8152): improve efficiency of Rx by using GRO frags - xen: support SW packet timestamping - add drivers for implementations based on TI's PRUSS (AM64x EVM) - nVidia/Mellanox Ethernet datacenter switches: - avoid poor HW resource use on Spectrum-4 by better block selection for IPv6 multicast forwarding and ordering of blocks in ACL region - Ethernet embedded switches: - Microchip: - support configuring the drive strength for EMI compliance - ksz9477: partial ACL support - ksz9477: HSR offload - ksz9477: Wake on LAN - Realtek: - rtl8366rb: respect device tree config of the CPU port - Ethernet PHYs: - support Broadcom BCM5221 PHYs - TI dp83867: support hardware LED blinking - CAN: - add support for Linux-PHY based CAN transceivers - at91_can: clean up and use rx-offload helpers - WiFi: - MediaTek (mt76): - new sub-driver for mt7925 USB/PCIe devices - HW wireless <> Ethernet bridging in MT7988 chips - mt7603/mt7628 stability improvements - Qualcomm (ath12k): - WCN7850: - enable 320 MHz channels in 6 GHz band - hardware rfkill support - enable IEEE80211_HW_SINGLE_SCAN_ON_ALL_BANDS to make scan faster - read board data variant name from SMBIOS - QCN9274: mesh support - RealTek (rtw89): - TDMA-based multi-channel concurrency (MCC) - Silicon Labs (wfx): - Remain-On-Channel (ROC) support - Bluetooth: - ISO: many improvements for broadcast support - mark BCM4378/BCM4387 as BROKEN_LE_CODED - add support for QCA2066 - btmtksdio: enable Bluetooth wakeup from suspend Signed-off-by: Jakub Kicinski <kuba@kernel.org> -----BEGIN PGP SIGNATURE----- iQIzBAABCAAdFiEE6jPA+I1ugmIBA4hXMUZtbf5SIrsFAmU8XsYACgkQMUZtbf5S Irv19RAAnud/24OOF5XMEJkIcYlnfqximh4XO6PujRSYkSkOUJdZTF6iJPgf3pSP YpwoHYbYKHYfeOf8+3bTNESiQNSnoVmvmvwiS6/7lZ3behHUrGLQzW9Htc3EZyWH 2h6QkDZ5OOjfg0bwYSfp3vXkmMH2k8WE9Y0NvCkhcohqZi13Rmp14RnyPmNb2d1V yZRYDMSM133KqE6gnBr1Ct65IEvnKeGlCUN2mTGqOJgdn6DZMsyxvtt0y4rmN7Ab 41+CgPU5SfxfbYpW+Dl2HJpgfte3WrC57KC6AM0PAPJzPmQWgeB/m9mjz/apj6Bg bhsEIo7FdvbCnQm3yWPhK2OgCAcSwLr8jfGMU+Q+W4VnL5SRRR3Rm0zjsze+kHNP OfqJgxzl3DpvoJqVBy1h5FGcZt0XHwhksm4cTxWqIahsF+veY0ECBXbuBBQx9XTF Y7INfI8ulg7wISJs+CJfIClYkgOibTw2u8taBS5ikbtgxNqp5D4QqODn7UefQap1 PR/IDYODF+zRgmMJLeBqSa6fij6BkfOEDiOWak5kggBoZdtbtmeKI6tzze06CNdW lWv1WEhRufxnwK+IuWsEkjhiMbs2WGLvkJ5JbgQV9BfqHfIfiqBCrcWtT/WbQnGt lmU46CXh1t/FZEqbmK9h+8vsIIfrcDl6jb5npEiKPRG00vDKRTM= =46nS -----END PGP SIGNATURE----- Merge tag 'net-next-6.7' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next Pull networking updates from Jakub Kicinski: "Core & protocols: - Support usec resolution of TCP timestamps, enabled selectively by a route attribute. - Defer regular TCP ACK while processing socket backlog, try to send a cumulative ACK at the end. Increase single TCP flow performance on a 200Gbit NIC by 20% (100Gbit -> 120Gbit). - The Fair Queuing (FQ) packet scheduler: - add built-in 3 band prio / WRR scheduling - support bypass if the qdisc is mostly idle (5% speed up for TCP RR) - improve inactive flow reporting - optimize the layout of structures for better cache locality - Support TCP Authentication Option (RFC 5925, TCP-AO), a more modern replacement for the old MD5 option. - Add more retransmission timeout (RTO) related statistics to TCP_INFO. - Support sending fragmented skbs over vsock sockets. - Make sure we send SIGPIPE for vsock sockets if socket was shutdown(). - Add sysctl for ignoring lower limit on lifetime in Router Advertisement PIO, based on an in-progress IETF draft. - Add sysctl to control activation of TCP ping-pong mode. - Add sysctl to make connection timeout in MPTCP configurable. - Support rcvlowat and notsent_lowat on MPTCP sockets, to help apps limit the number of wakeups. - Support netlink GET for MDB (multicast forwarding), allowing user space to request a single MDB entry instead of dumping the entire table. - Support selective FDB flushing in the VXLAN tunnel driver. - Allow limiting learned FDB entries in bridges, prevent OOM attacks. - Allow controlling via configfs netconsole targets which were created via the kernel cmdline at boot, rather than via configfs at runtime. - Support multiple PTP timestamp event queue readers with different filters. - MCTP over I3C. BPF: - Add new veth-like netdevice where BPF program defines the logic of the xmit routine. It can operate in L3 and L2 mode. - Support exceptions - allow asserting conditions which should never be true but are hard for the verifier to infer. With some extra flexibility around handling of the exit / failure: https://lwn.net/Articles/938435/ - Add support for local per-cpu kptr, allow allocating and storing per-cpu objects in maps. Access to those objects operates on the value for the current CPU. This allows to deprecate local one-off implementations of per-CPU storage like BPF_MAP_TYPE_PERCPU_CGROUP_STORAGE maps. - Extend cgroup BPF sockaddr hooks for UNIX sockets. The use case is for systemd to re-implement the LogNamespace feature which allows running multiple instances of systemd-journald to process the logs of different services. - Enable open-coded task_vma iteration, after maple tree conversion made it hard to directly walk VMAs in tracing programs. - Add open-coded task, css_task and css iterator support. One of the use cases is customizable OOM victim selection via BPF. - Allow source address selection with bpf_*_fib_lookup(). - Add ability to pin BPF timer to the current CPU. - Prevent creation of infinite loops by combining tail calls and fentry/fexit programs. - Add missed stats for kprobes to retrieve the number of missed kprobe executions and subsequent executions of BPF programs. - Inherit system settings for CPU security mitigations. - Add BPF v4 CPU instruction support for arm32 and s390x. Changes to common code: - overflow: add DEFINE_FLEX() for on-stack definition of structs with flexible array members. - Process doc update with more guidance for reviewers. Driver API: - Simplify locking in WiFi (cfg80211 and mac80211 layers), use wiphy mutex in most places and remove a lot of smaller locks. - Create a common DPLL configuration API. Allow configuring and querying state of PLL circuits used for clock syntonization, in network time distribution. - Unify fragmented and full page allocation APIs in page pool code. Let drivers be ignorant of PAGE_SIZE. - Rework PHY state machine to avoid races with calls to phy_stop(). - Notify DSA drivers of MAC address changes on user ports, improve correctness of offloads which depend on matching port MAC addresses. - Allow antenna control on injected WiFi frames. - Reduce the number of variants of napi_schedule(). - Simplify error handling when composing devlink health messages. Misc: - A lot of KCSAN data race "fixes", from Eric. - A lot of __counted_by() annotations, from Kees. - A lot of strncpy -> strscpy and printf format fixes. - Replace master/slave terminology with conduit/user in DSA drivers. - Handful of KUnit tests for netdev and WiFi core. Removed: - AppleTalk COPS. - AppleTalk ipddp. - TI AR7 CPMAC Ethernet driver. Drivers: - Ethernet high-speed NICs: - Intel (100G, ice, idpf): - add a driver for the Intel E2000 IPUs - make CRC/FCS stripping configurable - cross-timestamping for E823 devices - basic support for E830 devices - use aux-bus for managing client drivers - i40e: report firmware versions via devlink - nVidia/Mellanox: - support 4-port NICs - increase max number of channels to 256 - optimize / parallelize SF creation flow - Broadcom (bnxt): - enhance NIC temperature reporting - support PAM4 speeds and lane configuration - Marvell OcteonTX2: - PTP pulse-per-second output support - enable hardware timestamping for VFs - Solarflare/AMD: - conntrack NAT offload and offload for tunnels - Wangxun (ngbe/txgbe): - expose HW statistics - Pensando/AMD: - support PCI level reset - narrow down the condition under which skbs are linearized - Netronome/Corigine (nfp): - support CHACHA20-POLY1305 crypto in IPsec offload - Ethernet NICs embedded, slower, virtual: - Synopsys (stmmac): - add Loongson-1 SoC support - enable use of HW queues with no offload capabilities - enable PPS input support on all 5 channels - increase TX coalesce timer to 5ms - RealTek USB (r8152): improve efficiency of Rx by using GRO frags - xen: support SW packet timestamping - add drivers for implementations based on TI's PRUSS (AM64x EVM) - nVidia/Mellanox Ethernet datacenter switches: - avoid poor HW resource use on Spectrum-4 by better block selection for IPv6 multicast forwarding and ordering of blocks in ACL region - Ethernet embedded switches: - Microchip: - support configuring the drive strength for EMI compliance - ksz9477: partial ACL support - ksz9477: HSR offload - ksz9477: Wake on LAN - Realtek: - rtl8366rb: respect device tree config of the CPU port - Ethernet PHYs: - support Broadcom BCM5221 PHYs - TI dp83867: support hardware LED blinking - CAN: - add support for Linux-PHY based CAN transceivers - at91_can: clean up and use rx-offload helpers - WiFi: - MediaTek (mt76): - new sub-driver for mt7925 USB/PCIe devices - HW wireless <> Ethernet bridging in MT7988 chips - mt7603/mt7628 stability improvements - Qualcomm (ath12k): - WCN7850: - enable 320 MHz channels in 6 GHz band - hardware rfkill support - enable IEEE80211_HW_SINGLE_SCAN_ON_ALL_BANDS to make scan faster - read board data variant name from SMBIOS - QCN9274: mesh support - RealTek (rtw89): - TDMA-based multi-channel concurrency (MCC) - Silicon Labs (wfx): - Remain-On-Channel (ROC) support - Bluetooth: - ISO: many improvements for broadcast support - mark BCM4378/BCM4387 as BROKEN_LE_CODED - add support for QCA2066 - btmtksdio: enable Bluetooth wakeup from suspend" * tag 'net-next-6.7' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next: (1816 commits) net: pcs: xpcs: Add 2500BASE-X case in get state for XPCS drivers net: bpf: Use sockopt_lock_sock() in ip_sock_set_tos() net: mana: Use xdp_set_features_flag instead of direct assignment vxlan: Cleanup IFLA_VXLAN_PORT_RANGE entry in vxlan_get_size() iavf: delete the iavf client interface iavf: add a common function for undoing the interrupt scheme iavf: use unregister_netdev iavf: rely on netdev's own registered state iavf: fix the waiting time for initial reset iavf: in iavf_down, don't queue watchdog_task if comms failed iavf: simplify mutex_trylock+sleep loops iavf: fix comments about old bit locks doc/netlink: Update schema to support cmd-cnt-name and cmd-max-name tools: ynl: introduce option to process unknown attributes or types ipvlan: properly track tx_errors netdevsim: Block until all devices are released nfp: using napi_build_skb() to replace build_skb() net: dsa: microchip: ksz9477: Fix spelling mistake "Enery" -> "Energy" net: dsa: microchip: Ensure Stable PME Pin State for Wake-on-LAN net: dsa: microchip: Refactor switch shutdown routine for WoL preparation ... |
||
Linus Torvalds
|
d82c0a37d4 |
execve updates for v6.7-rc1
- Support non-BSS ELF segments with 0 filesz (Eric W. Biederman, Kees Cook) - Enable namespaced binfmt_misc (Christian Brauner) - Remove struct tag 'dynamic' from ELF UAPI (Alejandro Colomar) - Clean up binfmt_elf_fdpic debug output (Greg Ungerer) -----BEGIN PGP SIGNATURE----- iQJKBAABCgA0FiEEpcP2jyKd1g9yPm4TiXL039xtwCYFAmU/40AWHGtlZXNjb29r QGNocm9taXVtLm9yZwAKCRCJcvTf3G3AJh1rD/9g8iQ77KvC/l/GUjt9WNCsbwMR 4Ro15U6PP9TbafxEUTgYGrwpVPcWrOTz3zrLZ/NsR5GtgKolxLry94oeUPCpFRUP v+4cQWpcQQtkAiRw+vc4/XfUivWmZuNGiLOt2egZUP6tQhJocmRNW7XbGF1ZDrSA ASZknaA7qVLx/hnghm+bCXjNOx6hN/Md35QBPuyAclpD3sbUJDPSODZZb9Bcz+4w 0qD3acA3Nulug9k/5j7Ed0MzV8I/WfgZQQhGMl4K7yBQv06vcrRV6Eon4D9KvJVm bjK3zFE/zILkY1BHIUZZT3h2DjdUwHrGr82u5y6u3buj88IcNyFfSaGyYYBqn3Ux P7Y+dD9zZXQuMbqmhWbdK8UoSYiJ9isOB02lt0oHipONR5PqRocTsA6gseMsO9cv TwvGL279WlfZIj+2pvn0VJv/7DOCKGjZfc2AXhgPSkjICSO9mEOlVcFv1v3ZuXAn Cb/6/BMZyNqh/UIGWdPRyDVHEdswpJVcecewnJwmrG1vmvYyfyP8U+VoRE4ItELz fMpZskAb7SKV+McHLDauV+9eCgqaF5DIM3/zgws5iayRcGZQfengXqIajL7Ujlwf RKlnfhtRxkfgpF8vEmQDs0y5AVsU/l48dOSrb/0Vg9oXKBdBa9ozyhr1Ok5kwkiN LfDZDjSMyERgO/UZHQ== =hxk8 -----END PGP SIGNATURE----- Merge tag 'execve-v6.7-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux Pull execve updates from Kees Cook: - Support non-BSS ELF segments with zero filesz Eric Biederman and I refactored ELF segment loading to handle the case where a segment has a smaller filesz than memsz. Traditionally linkers only did this for .bss and it was always the last segment. As a result, the kernel only handled this case when it was the last segment. We've had two recent cases where linkers were trying to use these kinds of segments for other reasons, and the were in the middle of the segment list. There was no good reason for the kernel not to support this, and the refactor actually ends up making things more readable too. - Enable namespaced binfmt_misc Christian Brauner has made it possible to use binfmt_misc with mount namespaces. This means some traditionally root-only interfaces (for adding/removing formats) are now more exposed (but believed to be safe). - Remove struct tag 'dynamic' from ELF UAPI Alejandro Colomar noticed that the ELF UAPI has been polluting the struct namespace with an unused and overly generic tag named "dynamic" for no discernible reason for many many years. After double-checking various distro source repositories, it has been removed. - Clean up binfmt_elf_fdpic debug output (Greg Ungerer) * tag 'execve-v6.7-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux: binfmt_misc: enable sandboxed mounts binfmt_misc: cleanup on filesystem umount binfmt_elf_fdpic: clean up debug warnings mm: Remove unused vm_brk() binfmt_elf: Only report padzero() errors when PROT_WRITE binfmt_elf: Use elf_load() for library binfmt_elf: Use elf_load() for interpreter binfmt_elf: elf_bss no longer used by load_elf_binary() binfmt_elf: Support segments with 0 filesz and misaligned starts elf, uapi: Remove struct tag 'dynamic' |
||
Linus Torvalds
|
fdce8bd380 |
slab updates for 6.7
-----BEGIN PGP SIGNATURE----- iQEzBAABCAAdFiEEe7vIQRWZI0iWSE3xu+CwddJFiJoFAmU7xhwACgkQu+CwddJF iJr60gf8ChEzZnP6JM6OtvcbL1AtuoIn/B0iOOCbZYM2e/EPVL19KI3IXxOVb9j3 bOIogGT/PSa6cfzPwEzxu1AEv3A2xu9TgXMIG3vXzBU7QqPlp2rFD8+xvLe8Trcl VWaPZNUCKL9Hrq5gPM8B12N/UuAUA5Pjf9A9Hn1ZNjS4p2seEIc+CfdqtEAB7Th3 ++Tc9MRNP5bEmswf8iivIsuDDSkRS3GxshfBbcVzSE2l+JvxekKjmPjykQvsGnJm sV3Z1blej9Dq/d04ZRaKtLSrA7kW27cyHWzTsZb2IXs+JOmmGZGgK0CFl9OOVtVD IxpsYLw6oZKf8WXetkyzplsQWERXUw== =TGgo -----END PGP SIGNATURE----- Merge tag 'slab-for-6.7' of git://git.kernel.org/pub/scm/linux/kernel/git/vbabka/slab Pull slab updates from Vlastimil Babka: - SLUB: slab order calculation refactoring (Vlastimil Babka, Feng Tang) Recent proposals to tune the slab order calculations have prompted us to look at the current code and refactor it to make it easier to follow and eliminate some odd corner cases. The refactoring is mostly non-functional changes, but should make the actual tuning easier to implement and review. * tag 'slab-for-6.7' of git://git.kernel.org/pub/scm/linux/kernel/git/vbabka/slab: mm/slub: refactor calculate_order() and calc_slab_order() mm/slub: attempt to find layouts up to 1/2 waste in calculate_order() mm/slub: remove min_objects loop from calculate_order() mm/slub: simplify the last resort slab order calculation mm/slub: add sanity check for slub_min/max_order cmdline setup |
||
Linus Torvalds
|
2656821f1f |
RCU pull request for v6.7
This pull request contains the following branches: rcu/torture: RCU torture, locktorture and generic torture infrastructure updates that include various fixes, cleanups and consolidations. Among the user visible things, ftrace dumps can now be found into their own file, and module parameters get better documented and reported on dumps. rcu/fixes: Generic and misc fixes all over the place. Some highlights: * Hotplug handling has seen some light cleanups and comments. * An RCU barrier can now be triggered through sysfs to serialize memory stress testing and avoid OOM. * Object information is now dumped in case of invalid callback invocation. * Also various SRCU issues, too hard to trigger to deserve urgent pull requests, have been fixed. rcu/docs: RCU documentation updates rcu/refscale: RCU reference scalability test minor fixes and doc improvements. rcu/tasks: RCU tasks minor fixes rcu/stall: Stall detection updates. Introduce RCU CPU Stall notifiers that allows a subsystem to provide informations to help debugging. Also cure some false positive stalls. -----BEGIN PGP SIGNATURE----- iQIzBAABCgAdFiEEd76+gtGM8MbftQlOhSRUR1COjHcFAmU21h0ACgkQhSRUR1CO jHdUgA/+Myy5K5OxNrqlF/gIK+flOSg635RyZ0DBx8OMXZ/fAg9qRI+PKt5I4Lha eXAg6EtmwSgHmIbjcg8WzsvwniEsqqjOF+n1qil447fHUI2Qqw6c7fIm/MXQkeHJ qA7CODDRtsAnwnjmTteasmMeGV0bmXDENxhNrAZBFnVkRgTqfyDbFcn+nxOaPK6b fmbKvnB07WUg1KOV8/MbEtAZPb8QgHo58bXSZRKjKkiqRQWB/D3On+tShFK7SYJi wIqQ96MLyUXLaIWQ47v6xEO4PZO+3o1wAryvP1DRdb5UrPjO6yKFfQaoo5Mza92G zhBJhnXkVvCoNoCU7GKJIDV54SgDHaB6Sf1GN5cjwfujOkLuGCyg0CpKktCGm7uH n3X66PVep608Uj2Y/pAo/hv3Hbv7lCu4nfrERvVLG9YoxUvTJDsKmBv+SF/g2mxF rHqFa39HUPr1yHA5WjqOQS3lLdqCXEGKvNi6zXCvOceiDbHbiJFkBo6p8TVrbSMX FCOWZ3LoE+6uiLu/lLOEroTjeBd8GhDh1LgWgyVK7o0LhP1018DSBolrpcSwnmOo Q/E4G2x+aPWs+5NTOmMGOIPY70khKQIM3c8YZelSRffJBo6O3yV68h6X45NQxYvx keLvrDaza8h4hKwaof/QaX4ZJgTOZ0xjpawr1vR0hbK8LNtPrUw= =cVD7 -----END PGP SIGNATURE----- Merge tag 'rcu-next-v6.7' of git://git.kernel.org/pub/scm/linux/kernel/git/frederic/linux-dynticks Pull RCU updates from Frederic Weisbecker: - RCU torture, locktorture and generic torture infrastructure updates that include various fixes, cleanups and consolidations. Among the user visible things, ftrace dumps can now be found into their own file, and module parameters get better documented and reported on dumps. - Generic and misc fixes all over the place. Some highlights: * Hotplug handling has seen some light cleanups and comments * An RCU barrier can now be triggered through sysfs to serialize memory stress testing and avoid OOM * Object information is now dumped in case of invalid callback invocation * Also various SRCU issues, too hard to trigger to deserve urgent pull requests, have been fixed - RCU documentation updates - RCU reference scalability test minor fixes and doc improvements. - RCU tasks minor fixes - Stall detection updates. Introduce RCU CPU Stall notifiers that allows a subsystem to provide informations to help debugging. Also cure some false positive stalls. * tag 'rcu-next-v6.7' of git://git.kernel.org/pub/scm/linux/kernel/git/frederic/linux-dynticks: (56 commits) srcu: Only accelerate on enqueue time locktorture: Check the correct variable for allocation failure srcu: Fix callbacks acceleration mishandling rcu: Comment why callbacks migration can't wait for CPUHP_RCUTREE_PREP rcu: Standardize explicit CPU-hotplug calls rcu: Conditionally build CPU-hotplug teardown callbacks rcu: Remove references to rcu_migrate_callbacks() from diagrams rcu: Assume rcu_report_dead() is always called locally rcu: Assume IRQS disabled from rcu_report_dead() rcu: Use rcu_segcblist_segempty() instead of open coding it rcu: kmemleak: Ignore kmemleak false positives when RCU-freeing objects srcu: Fix srcu_struct node grpmask overflow on 64-bit systems torture: Convert parse-console.sh to mktemp rcutorture: Traverse possible cpu to set maxcpu in rcu_nocb_toggle() rcutorture: Replace schedule_timeout*() 1-jiffy waits with HZ/20 torture: Add kvm.sh --debug-info argument locktorture: Rename readers_bind/writers_bind to bind_readers/bind_writers doc: Catch-up update for locktorture module parameters locktorture: Add call_rcu_chains module parameter locktorture: Add new module parameters to lock_torture_print_module_parms() ... |
||
Linus Torvalds
|
63ce50fff9 |
Scheduler changes for v6.7 are:
- Fair scheduler (SCHED_OTHER) improvements: - Remove the old and now unused SIS_PROP code & option - Scan cluster before LLC in the wake-up path - Use candidate prev/recent_used CPU if scanning failed for cluster wakeup - NUMA scheduling improvements: - Improve the VMA access-PID code to better skip/scan VMAs - Extend tracing to cover VMA-skipping decisions - Improve/fix the recently introduced sched_numa_find_nth_cpu() code - Generalize numa_map_to_online_node() - Energy scheduling improvements: - Remove the EM_MAX_COMPLEXITY limit - Add tracepoints to track energy computation - Make the behavior of the 'sched_energy_aware' sysctl more consistent - Consolidate and clean up access to a CPU's max compute capacity - Fix uclamp code corner cases - RT scheduling improvements: - Drive dl_rq->overloaded with dl_rq->pushable_dl_tasks updates - Drive the ->rto_mask with rt_rq->pushable_tasks updates - Scheduler scalability improvements: - Rate-limit updates to tg->load_avg - On x86 disable IBRS when CPU is offline to improve single-threaded performance - Micro-optimize in_task() and in_interrupt() - Micro-optimize the PSI code - Avoid updating PSI triggers and ->rtpoll_total when there are no state changes - Core scheduler infrastructure improvements: - Use saved_state to reduce some spurious freezer wakeups - Bring in a handful of fast-headers improvements to scheduler headers - Make the scheduler UAPI headers more widely usable by user-space - Simplify the control flow of scheduler syscalls by using lock guards - Fix sched_setaffinity() vs. CPU hotplug race - Scheduler debuggability improvements: - Disallow writing invalid values to sched_rt_period_us - Fix a race in the rq-clock debugging code triggering warnings - Fix a warning in the bandwidth distribution code - Micro-optimize in_atomic_preempt_off() checks - Enforce that the tasklist_lock is held in for_each_thread() - Print the TGID in sched_show_task() - Remove the /proc/sys/kernel/sched_child_runs_first sysctl - Misc cleanups & fixes Signed-off-by: Ingo Molnar <mingo@kernel.org> -----BEGIN PGP SIGNATURE----- iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmU8/NoRHG1pbmdvQGtl cm5lbC5vcmcACgkQEnMQ0APhK1gN+xAAvKGYNZBCBG4jowxccgqAbCx81KOhhsy/ KUaOmdLPg9WaXuqjZ5sggXQCMT0wUqBYAmqV7ts53VhWcma2I1ap4dCM6Jj+RLrc vNwkeNetsikiZtarMoCJs5NahL8ULh3liBaoAkkToPjQ5r43aZ/eKwDovEdIKc+g +Vgn7jUY8ssIrAOKT1midSwY1y8kAU2AzWOSFDTgedkJP4PgOu9/lBl9jSJ2sYaX N4XqONYPXTwOHUtvmzkYILxLz0k0GgJ7hmt78E8Xy2rC4taGCRwCfCMBYxREuwiP huo3O1P/iIe5svm4/EBUvcpvf44eAWTV+CD0dnJPwOc9IvFhpSzqSZZAsyy/JQKt Lnzmc/xmyc1PnXCYJfHuXrw2/m+MyUHaegPzh5iLJFrlqa79GavOElj0jNTAMzbZ 39fybzPtuFP+64faRfu0BBlQZfORPBNc/oWMpPKqgP58YGuveKTWaUF5rl5lM7Ne nm07uOmq02JVR8YzPl/FcfhU2dPMawWuMwUjEr2eU+lAunY3PF88vu0FALj7iOBd 66F8qrtpDHJanOxrdEUwSJ7hgw79qY1iw66Db7cQYjMazFKZONxArQPqFUZ0ngLI n9hVa7brg1bAQKrQflqjcIAIbpVu3SjPEl15cKpAJTB/gn5H66TQgw8uQ6HfG+h2 GtOsn1nlvuk= =GDqb -----END PGP SIGNATURE----- Merge tag 'sched-core-2023-10-28' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull scheduler updates from Ingo Molnar: "Fair scheduler (SCHED_OTHER) improvements: - Remove the old and now unused SIS_PROP code & option - Scan cluster before LLC in the wake-up path - Use candidate prev/recent_used CPU if scanning failed for cluster wakeup NUMA scheduling improvements: - Improve the VMA access-PID code to better skip/scan VMAs - Extend tracing to cover VMA-skipping decisions - Improve/fix the recently introduced sched_numa_find_nth_cpu() code - Generalize numa_map_to_online_node() Energy scheduling improvements: - Remove the EM_MAX_COMPLEXITY limit - Add tracepoints to track energy computation - Make the behavior of the 'sched_energy_aware' sysctl more consistent - Consolidate and clean up access to a CPU's max compute capacity - Fix uclamp code corner cases RT scheduling improvements: - Drive dl_rq->overloaded with dl_rq->pushable_dl_tasks updates - Drive the ->rto_mask with rt_rq->pushable_tasks updates Scheduler scalability improvements: - Rate-limit updates to tg->load_avg - On x86 disable IBRS when CPU is offline to improve single-threaded performance - Micro-optimize in_task() and in_interrupt() - Micro-optimize the PSI code - Avoid updating PSI triggers and ->rtpoll_total when there are no state changes Core scheduler infrastructure improvements: - Use saved_state to reduce some spurious freezer wakeups - Bring in a handful of fast-headers improvements to scheduler headers - Make the scheduler UAPI headers more widely usable by user-space - Simplify the control flow of scheduler syscalls by using lock guards - Fix sched_setaffinity() vs. CPU hotplug race Scheduler debuggability improvements: - Disallow writing invalid values to sched_rt_period_us - Fix a race in the rq-clock debugging code triggering warnings - Fix a warning in the bandwidth distribution code - Micro-optimize in_atomic_preempt_off() checks - Enforce that the tasklist_lock is held in for_each_thread() - Print the TGID in sched_show_task() - Remove the /proc/sys/kernel/sched_child_runs_first sysctl ... and misc cleanups & fixes" * tag 'sched-core-2023-10-28' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (82 commits) sched/fair: Remove SIS_PROP sched/fair: Use candidate prev/recent_used CPU if scanning failed for cluster wakeup sched/fair: Scan cluster before scanning LLC in wake-up path sched: Add cpus_share_resources API sched/core: Fix RQCF_ACT_SKIP leak sched/fair: Remove unused 'curr' argument from pick_next_entity() sched/nohz: Update comments about NEWILB_KICK sched/fair: Remove duplicate #include sched/psi: Update poll => rtpoll in relevant comments sched: Make PELT acronym definition searchable sched: Fix stop_one_cpu_nowait() vs hotplug sched/psi: Bail out early from irq time accounting sched/topology: Rename 'DIE' domain to 'PKG' sched/psi: Delete the 'update_total' function parameter from update_triggers() sched/psi: Avoid updating PSI triggers and ->rtpoll_total when there are no state changes sched/headers: Remove comment referring to rq::cpu_load, since this has been removed sched/numa: Complete scanning of inactive VMAs when there is no alternative sched/numa: Complete scanning of partial VMAs regardless of PID activity sched/numa: Move up the access pid reset logic sched/numa: Trace decisions related to skipping VMAs ... |
||
Linus Torvalds
|
14ab6d425e |
vfs-6.7.ctime
-----BEGIN PGP SIGNATURE----- iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCZTppYgAKCRCRxhvAZXjc okIHAP9anLz1QDyMLH12ASuHjgBc0Of3jcB6NB97IWGpL4O21gEA46ohaD+vcJuC YkBLU3lXqQ87nfu28ExFAzh10hG2jwM= =m4pB -----END PGP SIGNATURE----- Merge tag 'vfs-6.7.ctime' of gitolite.kernel.org:pub/scm/linux/kernel/git/vfs/vfs Pull vfs inode time accessor updates from Christian Brauner: "This finishes the conversion of all inode time fields to accessor functions as discussed on list. Changing timestamps manually as we used to do before is error prone. Using accessors function makes this robust. It does not contain the switch of the time fields to discrete 64 bit integers to replace struct timespec and free up space in struct inode. But after this, the switch can be trivially made and the patch should only affect the vfs if we decide to do it" * tag 'vfs-6.7.ctime' of gitolite.kernel.org:pub/scm/linux/kernel/git/vfs/vfs: (86 commits) fs: rename inode i_atime and i_mtime fields security: convert to new timestamp accessors selinux: convert to new timestamp accessors apparmor: convert to new timestamp accessors sunrpc: convert to new timestamp accessors mm: convert to new timestamp accessors bpf: convert to new timestamp accessors ipc: convert to new timestamp accessors linux: convert to new timestamp accessors zonefs: convert to new timestamp accessors xfs: convert to new timestamp accessors vboxsf: convert to new timestamp accessors ufs: convert to new timestamp accessors udf: convert to new timestamp accessors ubifs: convert to new timestamp accessors tracefs: convert to new timestamp accessors sysv: convert to new timestamp accessors squashfs: convert to new timestamp accessors server: convert to new timestamp accessors client: convert to new timestamp accessors ... |
||
Linus Torvalds
|
7352a6765c |
vfs-6.7.xattr
-----BEGIN PGP SIGNATURE----- iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCZTppWAAKCRCRxhvAZXjc okB2AP4jjoRErJBwj245OIDJqzoj4m4UVOVd0MH2AkiSpANczwD/TToChdpusY2y qAYg1fQoGMbDVlb7Txaj9qI9ieCf9w0= =2PXg -----END PGP SIGNATURE----- Merge tag 'vfs-6.7.xattr' of gitolite.kernel.org:pub/scm/linux/kernel/git/vfs/vfs Pull vfs xattr updates from Christian Brauner: "The 's_xattr' field of 'struct super_block' currently requires a mutable table of 'struct xattr_handler' entries (although each handler itself is const). However, no code in vfs actually modifies the tables. This changes the type of 's_xattr' to allow const tables, and modifies existing file systems to move their tables to .rodata. This is desirable because these tables contain entries with function pointers in them; moving them to .rodata makes it considerably less likely to be modified accidentally or maliciously at runtime" * tag 'vfs-6.7.xattr' of gitolite.kernel.org:pub/scm/linux/kernel/git/vfs/vfs: (30 commits) const_structs.checkpatch: add xattr_handler net: move sockfs_xattr_handlers to .rodata shmem: move shmem_xattr_handlers to .rodata overlayfs: move xattr tables to .rodata xfs: move xfs_xattr_handlers to .rodata ubifs: move ubifs_xattr_handlers to .rodata squashfs: move squashfs_xattr_handlers to .rodata smb: move cifs_xattr_handlers to .rodata reiserfs: move reiserfs_xattr_handlers to .rodata orangefs: move orangefs_xattr_handlers to .rodata ocfs2: move ocfs2_xattr_handlers and ocfs2_xattr_handler_map to .rodata ntfs3: move ntfs_xattr_handlers to .rodata nfs: move nfs4_xattr_handlers to .rodata kernfs: move kernfs_xattr_handlers to .rodata jfs: move jfs_xattr_handlers to .rodata jffs2: move jffs2_xattr_handlers to .rodata hfsplus: move hfsplus_xattr_handlers to .rodata hfs: move hfs_xattr_handlers to .rodata gfs2: move gfs2_xattr_handlers_max to .rodata fuse: move fuse_xattr_handlers to .rodata ... |
||
Linus Torvalds
|
3b3f874cc1 |
vfs-6.7.misc
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCZTpoQAAKCRCRxhvAZXjc
ovFNAQDgIRjXfZ1Ku+USxsRRdqp8geJVaNc3PuMmYhOYhUenqgEAmC1m+p0y31dS
P6+HlL16Mqgu0tpLCcJK9BibpDZ0Ew4=
=7yD1
-----END PGP SIGNATURE-----
Merge tag 'vfs-6.7.misc' of gitolite.kernel.org:pub/scm/linux/kernel/git/vfs/vfs
Pull misc vfs updates from Christian Brauner:
"This contains the usual miscellaneous features, cleanups, and fixes
for vfs and individual fses.
Features:
- Rename and export helpers that get write access to a mount. They
are used in overlayfs to get write access to the upper mount.
- Print the pretty name of the root device on boot failure. This
helps in scenarios where we would usually only print
"unknown-block(1,2)".
- Add an internal SB_I_NOUMASK flag. This is another part in the
endless POSIX ACL saga in a way.
When POSIX ACLs are enabled via SB_POSIXACL the vfs cannot strip
the umask because if the relevant inode has POSIX ACLs set it might
take the umask from there. But if the inode doesn't have any POSIX
ACLs set then we apply the umask in the filesytem itself. So we end
up with:
(1) no SB_POSIXACL -> strip umask in vfs
(2) SB_POSIXACL -> strip umask in filesystem
The umask semantics associated with SB_POSIXACL allowed filesystems
that don't even support POSIX ACLs at all to raise SB_POSIXACL
purely to avoid umask stripping. That specifically means NFS v4 and
Overlayfs. NFS v4 does it because it delegates this to the server
and Overlayfs because it needs to delegate umask stripping to the
upper filesystem, i.e., the filesystem used as the writable layer.
This went so far that SB_POSIXACL is raised eve on kernels that
don't even have POSIX ACL support at all.
Stop this blatant abuse and add SB_I_NOUMASK which is an internal
superblock flag that filesystems can raise to opt out of umask
handling. That should really only be the two mentioned above. It's
not that we want any filesystems to do this. Ideally we have all
umask handling always in the vfs.
- Make overlayfs use SB_I_NOUMASK too.
- Now that we have SB_I_NOUMASK, stop checking for SB_POSIXACL in
IS_POSIXACL() if the kernel doesn't have support for it. This is a
very old patch but it's only possible to do this now with the wider
cleanup that was done.
- Follow-up work on fake path handling from last cycle. Citing mostly
from Amir:
When overlayfs was first merged, overlayfs files of regular files
and directories, the ones that are installed in file table, had a
"fake" path, namely, f_path is the overlayfs path and f_inode is
the "real" inode on the underlying filesystem.
In v6.5, we took another small step by introducing of the
backing_file container and the file_real_path() helper. This change
allowed vfs and filesystem code to get the "real" path of an
overlayfs backing file. With this change, we were able to make
fsnotify work correctly and report events on the "real" filesystem
objects that were accessed via overlayfs.
This method works fine, but it still leaves the vfs vulnerable to
new code that is not aware of files with fake path. A recent
example is commit
|
||
Linus Torvalds
|
d4e175f2c4 |
vfs-6.7.super
-----BEGIN PGP SIGNATURE----- iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCZT0C2gAKCRCRxhvAZXjc otV8AQCK5F9ONoQ7ISpdrKyUJiswySGXx0CYPfXbSg5gHH87zgEAua3vwVKeGXXF 5iVsdiNzIIQDwGDx7FyxufL4ggcN6gQ= =E1kV -----END PGP SIGNATURE----- Merge tag 'vfs-6.7.super' of gitolite.kernel.org:pub/scm/linux/kernel/git/vfs/vfs Pull vfs superblock updates from Christian Brauner: "This contains the work to make block device opening functions return a struct bdev_handle instead of just a struct block_device. The same struct bdev_handle is then also passed to block device closing functions. This allows us to propagate context from opening to closing a block device without having to modify all users everytime. Sidenote, in the future we might even want to try and have block device opening functions return a struct file directly but that's a series on top of this. These are further preparatory changes to be able to count writable opens and blocking writes to mounted block devices. That's a separate piece of work for next cycle and for that we absolutely need the changes to btrfs that have been quietly dropped somehow. Originally the series contained a patch that removed the old blkdev_*() helpers. But since this would've caused needles churn in -next for bcachefs we ended up delaying it. The second piece of work addresses one of the major annoyances about the work last cycle, namely that we required dropping s_umount whenever we used the superblock and fs_holder_ops for a block device. The reason for that requirement had been that in some codepaths s_umount could've been taken under disk->open_mutex (that's always been the case, at least theoretically). For example, on surprise block device removal or media change. And opening and closing block devices required grabbing disk->open_mutex as well. So we did the work and went through the block layer and fixed all those places so that s_umount is never taken under disk->open_mutex. This means no more brittle games where we yield and reacquire s_umount during block device opening and closing and no more requirements where block devices need to be closed. Filesystems don't need to care about this. There's a bunch of other follow-up work such as moving block device freezing and thawing to holder operations which makes it work for all block devices and not just the main block device just as we did for surprise removal. But that is for next cycle. Tested with fstests for all major fses, blktests, LTP" * tag 'vfs-6.7.super' of gitolite.kernel.org:pub/scm/linux/kernel/git/vfs/vfs: (37 commits) porting: update locking requirements fs: assert that open_mutex isn't held over holder ops block: assert that we're not holding open_mutex over blk_report_disk_dead block: move bdev_mark_dead out of disk_check_media_change block: WARN_ON_ONCE() when we remove active partitions block: simplify bdev_del_partition() fs: Avoid grabbing sb->s_umount under bdev->bd_holder_lock jfs: fix log->bdev_handle null ptr deref in lbmStartIO bcache: Fixup error handling in register_cache() xfs: Convert to bdev_open_by_path() reiserfs: Convert to bdev_open_by_dev/path() ocfs2: Convert to use bdev_open_by_dev() nfs/blocklayout: Convert to use bdev_open_by_dev/path() jfs: Convert to bdev_open_by_dev() f2fs: Convert to bdev_open_by_dev/path() ext4: Convert to bdev_open_by_dev() erofs: Convert to use bdev_open_by_path() btrfs: Convert to bdev_open_by_path() fs: Convert to bdev_open_by_dev() mm/swap: Convert to use bdev_open_by_dev() ... |
||
Jan Kara
|
4c6bca43c5
|
mm/swap: Convert to use bdev_open_by_dev()
Convert swapping code to use bdev_open_by_dev() and pass the handle around. CC: linux-mm@kvack.org CC: Andrew Morton <akpm@linux-foundation.org> Acked-by: Christoph Hellwig <hch@lst.de> Acked-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20230927093442.25915-18-jack@suse.cz Signed-off-by: Christian Brauner <brauner@kernel.org> |
||
Jakub Kicinski
|
c6f9b7138b |
bpf-next-for-netdev
-----BEGIN PGP SIGNATURE----- iHUEABYIAB0WIQTFp0I1jqZrAX+hPRXbK58LschIgwUCZTp12QAKCRDbK58LschI g8BrAQDifqp5liEEdXV8jdReBwJtqInjrL5tzy5LcyHUMQbTaAEA6Ph3Ct3B+3oA mFnIW/y6UJiJrby0Xz4+vV5BXI/5WQg= =pLCV -----END PGP SIGNATURE----- Merge tag 'for-netdev' of ssh://gitolite.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next Daniel Borkmann says: ==================== pull-request: bpf-next 2023-10-26 We've added 51 non-merge commits during the last 10 day(s) which contain a total of 75 files changed, 5037 insertions(+), 200 deletions(-). The main changes are: 1) Add open-coded task, css_task and css iterator support. One of the use cases is customizable OOM victim selection via BPF, from Chuyi Zhou. 2) Fix BPF verifier's iterator convergence logic to use exact states comparison for convergence checks, from Eduard Zingerman, Andrii Nakryiko and Alexei Starovoitov. 3) Add BPF programmable net device where bpf_mprog defines the logic of its xmit routine. It can operate in L3 and L2 mode, from Daniel Borkmann and Nikolay Aleksandrov. 4) Batch of fixes for BPF per-CPU kptr and re-enable unit_size checking for global per-CPU allocator, from Hou Tao. 5) Fix libbpf which eagerly assumed that SHT_GNU_verdef ELF section was going to be present whenever a binary has SHT_GNU_versym section, from Andrii Nakryiko. 6) Fix BPF ringbuf correctness to fold smp_mb__before_atomic() into atomic_set_release(), from Paul E. McKenney. 7) Add a warning if NAPI callback missed xdp_do_flush() under CONFIG_DEBUG_NET which helps checking if drivers were missing the former, from Sebastian Andrzej Siewior. 8) Fix missed RCU read-lock in bpf_task_under_cgroup() which was throwing a warning under sleepable programs, from Yafang Shao. 9) Avoid unnecessary -EBUSY from htab_lock_bucket by disabling IRQ before checking map_locked, from Song Liu. 10) Make BPF CI linked_list failure test more robust, from Kumar Kartikeya Dwivedi. 11) Enable samples/bpf to be built as PIE in Fedora, from Viktor Malik. 12) Fix xsk starving when multiple xsk sockets were associated with a single xsk_buff_pool, from Albert Huang. 13) Clarify the signed modulo implementation for the BPF ISA standardization document that it uses truncated division, from Dave Thaler. 14) Improve BPF verifier's JEQ/JNE branch taken logic to also consider signed bounds knowledge, from Andrii Nakryiko. 15) Add an option to XDP selftests to use multi-buffer AF_XDP xdp_hw_metadata and mark used XDP programs as capable to use frags, from Larysa Zaremba. 16) Fix bpftool's BTF dumper wrt printing a pointer value and another one to fix struct_ops dump in an array, from Manu Bretelle. * tag 'for-netdev' of ssh://gitolite.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next: (51 commits) netkit: Remove explicit active/peer ptr initialization selftests/bpf: Fix selftests broken by mitigations=off samples/bpf: Allow building with custom bpftool samples/bpf: Fix passing LDFLAGS to libbpf samples/bpf: Allow building with custom CFLAGS/LDFLAGS bpf: Add more WARN_ON_ONCE checks for mismatched alloc and free selftests/bpf: Add selftests for netkit selftests/bpf: Add netlink helper library bpftool: Extend net dump with netkit progs bpftool: Implement link show support for netkit libbpf: Add link-based API for netkit tools: Sync if_link uapi header netkit, bpf: Add bpf programmable net device bpf: Improve JEQ/JNE branch taken logic bpf: Fold smp_mb__before_atomic() into atomic_set_release() bpf: Fix unnecessary -EBUSY from htab_lock_bucket xsk: Avoid starving the xsk further down the list bpf: print full verifier states on infinite loop detection selftests/bpf: test if state loops are detected in a tricky case bpf: correct loop detection for iterators convergence ... ==================== Link: https://lore.kernel.org/r/20231026150509.2824-1-daniel@iogearbox.net Signed-off-by: Jakub Kicinski <kuba@kernel.org> |
||
Jakub Kicinski
|
ec4c20ca09 |
Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Cross-merge networking fixes after downstream PR. Conflicts: net/mac80211/rx.c |
||
Hugh Dickins
|
88c91dc585 |
mempolicy: migration attempt to match interleave nodes
Improve alloc_migration_target_by_mpol()'s treatment of MPOL_INTERLEAVE. Make an effort in do_mbind(), to identify the correct interleave index for the first page to be migrated, so that it and all subsequent pages from the same vma will be targeted to precisely their intended nodes. Pages from following vmas will still be interleaved from the requested nodemask, but perhaps starting from a different base. Whether this is worth doing at all, or worth improving further, is arguable: queue_folio_required() is right not to care about the precise placement on interleaved nodes; but this little effort seems appropriate. [hughd@google.com: do vma_iter search under mmap_write_unlock()] Link: https://lkml.kernel.org/r/3311d544-fb05-a7f1-1b74-16aa0f6cd4fe@google.com Link: https://lkml.kernel.org/r/77954a5-9c9b-1c11-7d5c-3262c01b895f@google.com Signed-off-by: Hugh Dickins <hughd@google.com> Cc: Andi Kleen <ak@linux.intel.com> Cc: Christoph Lameter <cl@linux.com> Cc: David Hildenbrand <david@redhat.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: "Huang, Ying" <ying.huang@intel.com> Cc: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Nhat Pham <nphamcs@gmail.com> Cc: Sidhartha Kumar <sidhartha.kumar@oracle.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Tejun heo <tj@kernel.org> Cc: Vishal Moola (Oracle) <vishal.moola@gmail.com> Cc: Yang Shi <shy828301@gmail.com> Cc: Yosry Ahmed <yosryahmed@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
Hugh Dickins
|
72e315f7a7 |
mempolicy: mmap_lock is not needed while migrating folios
mbind(2) holds down_write of current task's mmap_lock throughout (exclusive because it needs to set the new mempolicy on the vmas); migrate_pages(2) holds down_read of pid's mmap_lock throughout. They both hold mmap_lock across the internal migrate_pages(), under which all new page allocations (huge or small) are made. I'm nervous about it; and migrate_pages() certainly does not need mmap_lock itself. It's done this way for mbind(2), because its page allocator is vma_alloc_folio() or alloc_hugetlb_folio_vma(), both of which depend on vma and address. Now that we have alloc_pages_mpol(), depending on (refcounted) memory policy and interleave index, mbind(2) can be modified to use that or alloc_hugetlb_folio_nodemask(), and then not need mmap_lock across the internal migrate_pages() at all: add alloc_migration_target_by_mpol() to replace mbind's new_page(). (After that change, alloc_hugetlb_folio_vma() is used by nothing but a userfaultfd function: move it out of hugetlb.h and into the #ifdef.) migrate_pages(2) has chosen its target node before migrating, so can continue to use the standard alloc_migration_target(); but let it take and drop mmap_lock just around migrate_to_node()'s queue_pages_range(): neither the node-to-node calculations nor the page migrations need it. It seems unlikely, but it is conceivable that some userspace depends on the kernel's mmap_lock exclusion here, instead of doing its own locking: more likely in a testsuite than in real life. It is also possible, of course, that some pages on the list will be munmapped by another thread before they are migrated, or a newer memory policy applied to the range by that time: but such races could happen before, as soon as mmap_lock was dropped, so it does not appear to be a concern. Link: https://lkml.kernel.org/r/21e564e8-269f-6a89-7ee2-fd612831c289@google.com Signed-off-by: Hugh Dickins <hughd@google.com> Cc: Andi Kleen <ak@linux.intel.com> Cc: Christoph Lameter <cl@linux.com> Cc: David Hildenbrand <david@redhat.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: "Huang, Ying" <ying.huang@intel.com> Cc: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Nhat Pham <nphamcs@gmail.com> Cc: Sidhartha Kumar <sidhartha.kumar@oracle.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Tejun heo <tj@kernel.org> Cc: Vishal Moola (Oracle) <vishal.moola@gmail.com> Cc: Yang Shi <shy828301@gmail.com> Cc: Yosry Ahmed <yosryahmed@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
Hugh Dickins
|
ddc1a5cbc0 |
mempolicy: alloc_pages_mpol() for NUMA policy without vma
Shrink shmem's stack usage by eliminating the pseudo-vma from its folio
allocation. alloc_pages_mpol(gfp, order, pol, ilx, nid) becomes the
principal actor for passing mempolicy choice down to __alloc_pages(),
rather than vma_alloc_folio(gfp, order, vma, addr, hugepage).
vma_alloc_folio() and alloc_pages() remain, but as wrappers around
alloc_pages_mpol(). alloc_pages_bulk_*() untouched, except to provide the
additional args to policy_nodemask(), which subsumes policy_node().
Cleanup throughout, cutting out some unhelpful "helpers".
It would all be much simpler without MPOL_INTERLEAVE, but that adds a
dynamic to the constant mpol: complicated by v3.6 commit
|
||
Hugh Dickins
|
23e4883248 |
mm: add page_rmappable_folio() wrapper
folio_prep_large_rmappable() is being used repeatedly along with a conversion from page to folio, a check non-NULL, a check order > 1: wrap it all up into struct folio *page_rmappable_folio(struct page *). Link: https://lkml.kernel.org/r/8d92c6cf-eebe-748-e29c-c8ab224c741@google.com Signed-off-by: Hugh Dickins <hughd@google.com> Cc: Andi Kleen <ak@linux.intel.com> Cc: Christoph Lameter <cl@linux.com> Cc: David Hildenbrand <david@redhat.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: "Huang, Ying" <ying.huang@intel.com> Cc: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Nhat Pham <nphamcs@gmail.com> Cc: Sidhartha Kumar <sidhartha.kumar@oracle.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Tejun heo <tj@kernel.org> Cc: Vishal Moola (Oracle) <vishal.moola@gmail.com> Cc: Yang Shi <shy828301@gmail.com> Cc: Yosry Ahmed <yosryahmed@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
Hugh Dickins
|
2cafb58217 |
mempolicy: remove confusing MPOL_MF_LAZY dead code
v3.8 commit |
||
Hugh Dickins
|
35ec8fa020 |
mempolicy: mpol_shared_policy_init() without pseudo-vma
mpol_shared_policy_init() does not need to use a pseudo-vma: it can use sp_alloc() and sp_insert() directly, since the object's shared policy tree is empty and inaccessible (needing no lock) at get_inode() time. Link: https://lkml.kernel.org/r/3bef62d8-ae78-4c2-533-56a44ae425c@google.com Signed-off-by: Hugh Dickins <hughd@google.com> Cc: Andi Kleen <ak@linux.intel.com> Cc: Christoph Lameter <cl@linux.com> Cc: David Hildenbrand <david@redhat.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: "Huang, Ying" <ying.huang@intel.com> Cc: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Nhat Pham <nphamcs@gmail.com> Cc: Sidhartha Kumar <sidhartha.kumar@oracle.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Tejun heo <tj@kernel.org> Cc: Vishal Moola (Oracle) <vishal.moola@gmail.com> Cc: Yang Shi <shy828301@gmail.com> Cc: Yosry Ahmed <yosryahmed@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
Hugh Dickins
|
93397c3b76 |
mempolicy trivia: use pgoff_t in shared mempolicy tree
Prefer the more explicit "pgoff_t" to "unsigned long" when dealing with a shared mempolicy tree. Delete confusing comment about pseudo mm vmas. Link: https://lkml.kernel.org/r/5451157-3818-4af5-fd2c-5d26a5d1dc53@google.com Signed-off-by: Hugh Dickins <hughd@google.com> Cc: Andi Kleen <ak@linux.intel.com> Cc: Christoph Lameter <cl@linux.com> Cc: David Hildenbrand <david@redhat.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: "Huang, Ying" <ying.huang@intel.com> Cc: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Nhat Pham <nphamcs@gmail.com> Cc: Sidhartha Kumar <sidhartha.kumar@oracle.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Tejun heo <tj@kernel.org> Cc: Vishal Moola (Oracle) <vishal.moola@gmail.com> Cc: Yang Shi <shy828301@gmail.com> Cc: Yosry Ahmed <yosryahmed@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
Hugh Dickins
|
c36f6e6dff |
mempolicy trivia: slightly more consistent naming
Before getting down to work, do a little cleanup, mainly of inconsistent variable naming. I gave up trying to rationalize mpol versus pol versus policy, and node versus nid, but let's avoid p and nd. Remove a few superfluous blank lines, but add one; and here prefer vma->vm_policy to vma_policy(vma) - the latter being appropriate in other sources, which have to allow for !CONFIG_NUMA. That intriguing line about KERNEL_DS? should have gone in v2.6.15, when numa_policy_init() stopped using set_mempolicy(2)'s system call handler. Link: https://lkml.kernel.org/r/68287974-b6ae-7df-4ba-d19ddd69cbf@google.com Signed-off-by: Hugh Dickins <hughd@google.com> Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Andi Kleen <ak@linux.intel.com> Cc: Christoph Lameter <cl@linux.com> Cc: David Hildenbrand <david@redhat.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: "Huang, Ying" <ying.huang@intel.com> Cc: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Nhat Pham <nphamcs@gmail.com> Cc: Sidhartha Kumar <sidhartha.kumar@oracle.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Tejun heo <tj@kernel.org> Cc: Vishal Moola (Oracle) <vishal.moola@gmail.com> Cc: Yang Shi <shy828301@gmail.com> Cc: Yosry Ahmed <yosryahmed@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
Hugh Dickins
|
7f1ee4e207 |
mempolicy trivia: delete those ancient pr_debug()s
Delete those ancient pr_debug()s - PDprintk()s in Andi Kleen's original submission of core NUMA API, and useful when debugging shared mempolicy lifetime back then, but not used recently. Link: https://lkml.kernel.org/r/f25135-ffb2-40d8-9577-720772b333@google.com Signed-off-by: Hugh Dickins <hughd@google.com> Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Andi Kleen <ak@linux.intel.com> Cc: Christoph Lameter <cl@linux.com> Cc: David Hildenbrand <david@redhat.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: "Huang, Ying" <ying.huang@intel.com> Cc: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Nhat Pham <nphamcs@gmail.com> Cc: Sidhartha Kumar <sidhartha.kumar@oracle.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Tejun heo <tj@kernel.org> Cc: Vishal Moola (Oracle) <vishal.moola@gmail.com> Cc: Yang Shi <shy828301@gmail.com> Cc: Yosry Ahmed <yosryahmed@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
Hugh Dickins
|
1cb5d11a37 |
mempolicy: fix migrate_pages(2) syscall return nr_failed
"man 2 migrate_pages" says "On success migrate_pages() returns the number of pages that could not be moved". Although 5.3 and 5.4 commits fixed mbind(MPOL_MF_STRICT|MPOL_MF_MOVE*) to fail with EIO when not all pages could be moved (because some could not be isolated for migration), migrate_pages(2) was left still reporting only those pages failing at the migration stage, forgetting those failing at the earlier isolation stage. Fix that by accumulating a long nr_failed count in struct queue_pages, returned by queue_pages_range() when it's not returning an error, for adding on to the nr_failed count from migrate_pages() in mm/migrate.c. A count of pages? It's more a count of folios, but changing it to pages would entail more work (also in mm/migrate.c): does not seem justified. queue_pages_range() itself should only return -EIO in the "strictly unmovable" case (STRICT without any MOVEs): in that case it's best to break out as soon as nr_failed gets set; but otherwise it should continue to isolate pages for MOVing even when nr_failed - as the mbind(2) manpage promises. There's a case when nr_failed should be incremented when it was missed: queue_folios_pte_range() and queue_folios_hugetlb() count the transient migration entries, like queue_folios_pmd() already did. And there's a case when nr_failed should not be incremented when it would have been: in meeting later PTEs of the same large folio, which can only be isolated once: fixed by recording the current large folio in struct queue_pages. Clean up the affected functions, fixing or updating many comments. Bool migrate_folio_add(), without -EIO: true if adding, or if skipping shared (but its arguable folio_estimated_sharers() heuristic left unchanged). Use MPOL_MF_WRLOCK flag to queue_pages_range(), instead of bool lock_vma. Use explicit STRICT|MOVE* flags where queue_pages_test_walk() checks for skipping, instead of hiding them behind MPOL_MF_VALID. Link: https://lkml.kernel.org/r/9a6b0b9-3bb-dbef-8adf-efab4397b8d@google.com Signed-off-by: Hugh Dickins <hughd@google.com> Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: "Huang, Ying" <ying.huang@intel.com> Cc: Andi Kleen <ak@linux.intel.com> Cc: Christoph Lameter <cl@linux.com> Cc: David Hildenbrand <david@redhat.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Nhat Pham <nphamcs@gmail.com> Cc: Sidhartha Kumar <sidhartha.kumar@oracle.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Tejun heo <tj@kernel.org> Cc: Vishal Moola (Oracle) <vishal.moola@gmail.com> Cc: Yang Shi <shy828301@gmail.com> Cc: Yosry Ahmed <yosryahmed@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
SeongJae Park
|
b8ee5575f7 |
mm/damon/sysfs-test: add a unit test for damon_sysfs_set_targets()
damon_sysfs_set_targets() had a bug that can result in unexpected memory usage and monitoring overhead increase. The bug has fixed by a previous commit. Add a unit test for avoiding a similar bug of future. Link: https://lkml.kernel.org/r/20231022210735.46409-3-sj@kernel.org Signed-off-by: SeongJae Park <sj@kernel.org> Cc: Brendan Higgins <brendanhiggins@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
SeongJae Park
|
62f76a7b53 |
mm/damon/core: avoid divide-by-zero from pseudo-moving window length calculation
When calculating the pseudo-moving access rate, DAMON divides some values
by the maximum nr_accesses. However, due to the type of the related
variables, simple division-based calculation of the divisor can return
zero. As a result, divide-by-zero is possible. Fix it by using
damon_max_nr_accesses(), which handles the case.
Note that this is a fix for a commit that not in the mainline but mm
tree.
Link: https://lkml.kernel.org/r/20231019194924.100347-6-sj@kernel.org
Fixes:
|
||
SeongJae Park
|
44063f125a |
mm/damon/lru_sort: avoid divide-by-zero in hot threshold calculation
When calculating the hotness threshold for lru_prio scheme of
DAMON_LRU_SORT, the module divides some values by the maximum nr_accesses.
However, due to the type of the related variables, simple division-based
calculation of the divisor can return zero. As a result, divide-by-zero
is possible. Fix it by using damon_max_nr_accesses(), which handles the
case.
Link: https://lkml.kernel.org/r/20231019194924.100347-5-sj@kernel.org
Fixes:
|
||
SeongJae Park
|
3bafc47d3c |
mm/damon/ops-common: avoid divide-by-zero during region hotness calculation
When calculating the hotness of each region for the under-quota regions
prioritization, DAMON divides some values by the maximum nr_accesses.
However, due to the type of the related variables, simple division-based
calculation of the divisor can return zero. As a result, divide-by-zero
is possible. Fix it by using damon_max_nr_accesses(), which handles the
case.
Link: https://lkml.kernel.org/r/20231019194924.100347-4-sj@kernel.org
Fixes:
|
||
SeongJae Park
|
d35963bfb0 |
mm/damon/core: avoid divide-by-zero during monitoring results update
When monitoring attributes are changed, DAMON updates access rate of the
monitoring results accordingly. For that, it divides some values by the
maximum nr_accesses. However, due to the type of the related variables,
simple division-based calculation of the divisor can return zero. As a
result, divide-by-zero is possible. Fix it by using
damon_max_nr_accesses(), which handles the case.
Link: https://lkml.kernel.org/r/20231019194924.100347-3-sj@kernel.org
Fixes:
|
||
Hugh Dickins
|
b1454b463c |
mm: mlock: avoid folio_within_range() on KSM pages
Since commit |
||
Baolin Wang
|
eebb3dabbb |
mm: migrate: record the mlocked page status to remove unnecessary lru drain
When doing compaction, I found the lru_add_drain() is an obvious hotspot
when migrating pages. The distribution of this hotspot is as follows:
- 18.75% compact_zone
- 17.39% migrate_pages
- 13.79% migrate_pages_batch
- 11.66% migrate_folio_move
- 7.02% lru_add_drain
+ 7.02% lru_add_drain_cpu
+ 3.00% move_to_new_folio
1.23% rmap_walk
+ 1.92% migrate_folio_unmap
+ 3.20% migrate_pages_sync
+ 0.90% isolate_migratepages
The lru_add_drain() was added by commit
|
||
Vegard Nossum
|
e5b16c8628 |
mm: hugetlb_vmemmap: fix reference to nonexistent file
The directory this file is in was renamed but the reference didn't get
updated. Fix it.
Link: https://lkml.kernel.org/r/20231022185619.919397-1-vegard.nossum@oracle.com
Fixes:
|
||
Hyesoo Yu
|
76f26535d1 |
mm: page_alloc: check the order of compound page even when the order is zero
For compound pages, the head sets the PG_head flag and the tail sets the compound_head to indicate the head page. If a user allocates a compound page and frees it with a different order, the compound page information will not be properly initialized. To detect this problem, compound_order(page) and the order argument are compared, but this is not checked when the order argument is zero. That error should be checked regardless of the order. Link: https://lkml.kernel.org/r/20231023083217.1866451-1-hyesoo.yu@samsung.com Signed-off-by: Hyesoo Yu <hyesoo.yu@samsung.com> Reviewed-by: Vishal Moola (Oracle) <vishal.moola@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
Muhammad Muzammil
|
be16dd764a |
mm: fix multiple typos in multiple files
Link: https://lkml.kernel.org/r/20231023124405.36981-1-m.muzzammilashraf@gmail.com Signed-off-by: Muhammad Muzammil <m.muzzammilashraf@gmail.com> Reviewed-by: Randy Dunlap <rdunlap@infradead.org> Cc: "James E.J. Bottomley" <James.Bottomley@HansenPartnership.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Muhammad Muzammil <m.muzzammilashraf@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
Vishal Moola (Oracle)
|
98b32d296d |
mm/khugepaged: convert collapse_pte_mapped_thp() to use folios
This removes 2 calls to compound_head() and helps convert khugepaged to use folios throughout. Previously, if the address passed to collapse_pte_mapped_thp() corresponded to a tail page, the scan would fail immediately. Using filemap_lock_folio() we get the corresponding folio back and try to operate on the folio instead. Link: https://lkml.kernel.org/r/20231020183331.10770-6-vishal.moola@gmail.com Signed-off-by: Vishal Moola (Oracle) <vishal.moola@gmail.com> Reviewed-by: Rik van Riel <riel@surriel.com> Reviewed-by: Yang Shi <shy828301@gmail.com> Cc: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
Vishal Moola (Oracle)
|
b455f39d22 |
mm/khugepaged: convert alloc_charge_hpage() to use folios
Also remove count_memcg_page_event now that its last caller no longer uses it and reword hpage_collapse_alloc_page() to hpage_collapse_alloc_folio(). This removes 1 call to compound_head() and helps convert khugepaged to use folios throughout. Link: https://lkml.kernel.org/r/20231020183331.10770-5-vishal.moola@gmail.com Signed-off-by: Vishal Moola (Oracle) <vishal.moola@gmail.com> Reviewed-by: Rik van Riel <riel@surriel.com> Reviewed-by: Yang Shi <shy828301@gmail.com> Cc: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
Vishal Moola (Oracle)
|
dbf85c21e4 |
mm/khugepaged: convert is_refcount_suitable() to use folios
Both callers of is_refcount_suitable() have been converted to use folios, so convert it to take in a folio. Both callers only operate on head pages of folios so mapcount/refcount conversions here are trivial. Removes 3 calls to compound head, and removes 315 bytes of kernel text. Link: https://lkml.kernel.org/r/20231020183331.10770-4-vishal.moola@gmail.com Signed-off-by: Vishal Moola (Oracle) <vishal.moola@gmail.com> Reviewed-by: David Hildenbrand <david@redhat.com> Reviewed-by: Yang Shi <shy828301@gmail.com> Cc: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
Vishal Moola (Oracle)
|
5c07ebb372 |
mm/khugepaged: convert hpage_collapse_scan_pmd() to use folios
Replaces 5 calls to compound_head(), and removes 1385 bytes of kernel text. Link: https://lkml.kernel.org/r/20231020183331.10770-3-vishal.moola@gmail.com Signed-off-by: Vishal Moola (Oracle) <vishal.moola@gmail.com> Reviewed-by: Rik van Riel <riel@surriel.com> Reviewed-by: Yang Shi <shy828301@gmail.com> Cc: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
Vishal Moola (Oracle)
|
8dd1e89673 |
mm/khugepaged: convert __collapse_huge_page_isolate() to use folios
Patch series "Some khugepaged folio conversions", v3. This patchset converts a number of functions to use folios. This cleans up some khugepaged code and removes a large number of hidden compound_head() calls. This patch (of 5): Replaces 11 calls to compound_head() with 1, and removes 1348 bytes of kernel text. Link: https://lkml.kernel.org/r/20231020183331.10770-1-vishal.moola@gmail.com Link: https://lkml.kernel.org/r/20231020183331.10770-2-vishal.moola@gmail.com Signed-off-by: Vishal Moola (Oracle) <vishal.moola@gmail.com> Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: David Hildenbrand <david@redhat.com> Reviewed-by: Yang Shi <shy828301@gmail.com> Cc: Kefeng Wang <wangkefeng.wang@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
Qi Zheng
|
b7812c86c7 |
mm: memory_hotplug: drop memoryless node from fallback lists
In offline_pages(), if a node becomes memoryless, we will clear its N_MEMORY state by calling node_states_clear_node(). But we do this after rebuilding the zonelists by calling build_all_zonelists(), which will cause this memoryless node to still be in the fallback nodes (node_order[]) of other nodes. To drop memoryless nodes from fallback nodes in this case, just call node_states_clear_node() before calling build_all_zonelists(). In this way, we will not try to allocate pages from memoryless node0, then the panic mentioned in [1] will also be fixed. Even though this problem has been solved by dropping the NODE_MIN_SIZE constrain in x86 [2], it would be better to fix it in the core MM as well. https://lore.kernel.org/all/20230212110305.93670-1-zhengqi.arch@bytedance.com/ [1] https://lore.kernel.org/all/20231017062215.171670-1-rppt@kernel.org/ [2] Link: https://lkml.kernel.org/r/9f1dbe7ee1301c7163b2770e32954ff5e3ecf2c4.1697711415.git.zhengqi.arch@bytedance.com Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com> Acked-by: David Hildenbrand <david@redhat.com> Acked-by: Ingo Molnar <mingo@kernel.org> Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Cc: "Huang, Ying" <ying.huang@intel.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Oscar Salvador <osalvador@suse.de> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
Qi Zheng
|
c2baef394a |
mm: page_alloc: skip memoryless nodes entirely
Patch series "handle memoryless nodes more appropriately", v3. Currently, in the process of initialization or offline memory, memoryless nodes will still be built into the fallback list of itself or other nodes. This is not what we expected, so this patch series removes memoryless nodes from the fallback list entirely. This patch (of 2): In find_next_best_node(), we skipped the memoryless nodes when building the zonelists of other normal nodes (N_NORMAL), but did not skip the memoryless node itself when building the zonelist. This will cause it to be traversed at runtime. For example, say we have node0 and node1, node0 is memoryless node, then the fallback order of node0 and node1 as follows: [ 0.153005] Fallback order for Node 0: 0 1 [ 0.153564] Fallback order for Node 1: 1 After this patch, we skip memoryless node0 entirely, then the fallback order of node0 and node1 as follows: [ 0.155236] Fallback order for Node 0: 1 [ 0.155806] Fallback order for Node 1: 1 So it becomes completely invisible, which will reduce runtime overhead. And in this way, we will not try to allocate pages from memoryless node0, then the panic mentioned in [1] will also be fixed. Even though this problem has been solved by dropping the NODE_MIN_SIZE constrain in x86 [2], it would be better to fix it in core MM as well. [1]. https://lore.kernel.org/all/20230212110305.93670-1-zhengqi.arch@bytedance.com/ [2]. https://lore.kernel.org/all/20231017062215.171670-1-rppt@kernel.org/ [zhengqi.arch@bytedance.com: update comment, per Ingo] Link: https://lkml.kernel.org/r/7300fc00a057eefeb9a68c8ad28171c3f0ce66ce.1697799303.git.zhengqi.arch@bytedance.com Link: https://lkml.kernel.org/r/cover.1697799303.git.zhengqi.arch@bytedance.com Link: https://lkml.kernel.org/r/cover.1697711415.git.zhengqi.arch@bytedance.com Link: https://lkml.kernel.org/r/157013e978468241de4a4c05d5337a44638ecb0e.1697711415.git.zhengqi.arch@bytedance.com Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com> Acked-by: David Hildenbrand <david@redhat.com> Acked-by: Ingo Molnar <mingo@kernel.org> Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Cc: "Huang, Ying" <ying.huang@intel.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Oscar Salvador <osalvador@suse.de> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
Zi Yan
|
49cac03a8f |
mm/migrate: add nr_split to trace_mm_migrate_pages stats.
Add nr_split to trace_mm_migrate_pages for large folio (including THP) split events. [akpm@linux-foundation.org: cleanup per Huang, Ying] Link: https://lkml.kernel.org/r/20231017163129.2025214-2-zi.yan@sent.com Signed-off-by: Zi Yan <ziy@nvidia.com> Reviewed-by: "Huang, Ying" <ying.huang@intel.com> Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: David Hildenbrand <david@redhat.com> Cc: Huang Ying <ying.huang@intel.com> Cc: Matthew Wilcox <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
Zi Yan
|
a259945efe |
mm/migrate: correct nr_failed in migrate_pages_sync()
nr_failed was missing the large folio splits from migrate_pages_batch()
and can cause a mismatch between migrate_pages() return value and the
number of not migrated pages, i.e., when the return value of
migrate_pages() is 0, there are still pages left in the from page list.
It will happen when a non-PMD THP large folio fails to migrate due to
-ENOMEM and is split successfully but not all the split pages are not
migrated, migrate_pages_batch() would return non-zero, but
astats.nr_thp_split = 0. nr_failed would be 0 and returned to the caller
of migrate_pages(), but the not migrated pages are left in the from page
list without being added back to LRU lists.
Fix it by adding a new nr_split counter for large folio splits and adding
it to nr_failed in migrate_page_sync() after migrate_pages_batch() is
done.
Link: https://lkml.kernel.org/r/20231017163129.2025214-1-zi.yan@sent.com
Fixes:
|
||
Liu Shixin
|
245245c2ff |
mm/kmemleak: move the initialisation of object to __link_object
In patch (mm: kmemleak: split __create_object into two functions), the initialisation of object has been splited in two places. Catalin said it feels a bit weird and error prone. So leave __alloc_object() to just do the actual allocation and let __link_object() do the full initialisation. Link: https://lkml.kernel.org/r/20231023025125.90972-1-liushixin2@huawei.com Signed-off-by: Liu Shixin <liushixin2@huawei.com> Suggested-by: Catalin Marinas <catalin.marinas@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
Liu Shixin
|
5e4fc577db |
mm/kmemleak: fix partially freeing unknown object warning
delete_object_part() can be called by multiple callers in the same time.
If an object is found and removed by a caller, and then another caller try
to find it too, it failed and return directly. It still be recorded by
kmemleak even if it has already been freed to buddy. With DEBUG on,
kmemleak will report the following warning,
kmemleak: Partially freeing unknown object at 0xa1af86000 (size 4096)
CPU: 0 PID: 742 Comm: test_huge Not tainted 6.6.0-rc3kmemleak+ #54
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.13.0-1ubuntu1.1 04/01/2014
Call Trace:
<TASK>
dump_stack_lvl+0x37/0x50
kmemleak_free_part_phys+0x50/0x60
hugetlb_vmemmap_optimize+0x172/0x290
? __pfx_vmemmap_remap_pte+0x10/0x10
__prep_new_hugetlb_folio+0xe/0x30
prep_new_hugetlb_folio.isra.0+0xe/0x40
alloc_fresh_hugetlb_folio+0xc3/0xd0
alloc_surplus_hugetlb_folio.constprop.0+0x6e/0xd0
hugetlb_acct_memory.part.0+0xe6/0x2a0
hugetlb_reserve_pages+0x110/0x2c0
hugetlbfs_file_mmap+0x11d/0x1b0
mmap_region+0x248/0x9a0
? hugetlb_get_unmapped_area+0x15c/0x2d0
do_mmap+0x38b/0x580
vm_mmap_pgoff+0xe6/0x190
ksys_mmap_pgoff+0x18a/0x1f0
do_syscall_64+0x3f/0x90
entry_SYSCALL_64_after_hwframe+0x6e/0xd8
Expand __create_object() and move __alloc_object() to the beginning. Then
use kmemleak_lock to protect __find_and_remove_object() and
__link_object() as a whole, which can guarantee all objects are processed
sequentialally.
Link: https://lkml.kernel.org/r/20231018102952.3339837-8-liushixin2@huawei.com
Fixes:
|
||
Liu Shixin
|
858a195b93 |
mm: kmemleak: add __find_and_remove_object()
Add new __find_and_remove_object() without kmemleak_lock protect, it is in preparation for the next patch. Link: https://lkml.kernel.org/r/20231018102952.3339837-7-liushixin2@huawei.com Signed-off-by: Liu Shixin <liushixin2@huawei.com> Acked-by: Catalin Marinas <catalin.marinas@arm.com> Cc: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: Patrick Wang <patrick.wang.shcn@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
Liu Shixin
|
2e1d47385f |
mm: kmemleak: use mem_pool_free() to free object
The kmemleak object is allocated by mem_pool_alloc(), which could be from
slab or mem_pool[], so it's not suitable using __kmem_cache_free() to free
the object, use __mem_pool_free() instead.
Link: https://lkml.kernel.org/r/20231018102952.3339837-6-liushixin2@huawei.com
Fixes:
|
||
Liu Shixin
|
0edd7b5829 |
mm: kmemleak: split __create_object into two functions
__create_object() consists of two part, the first part allocate a kmemleak object and initialize it, the second part insert it into object tree. This function need kmemleak_lock but actually only the second part need lock. Split it into two functions, the first function __alloc_object only allocate a kmemleak object, and the second function __link_object() will initialize the object and insert it into object tree, use the kmemleak_lock to protect __link_object() only. [akpm@linux-foundation.org: coding-style cleanups] Link: https://lkml.kernel.org/r/20231018102952.3339837-5-liushixin2@huawei.com Signed-off-by: Liu Shixin <liushixin2@huawei.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: Patrick Wang <patrick.wang.shcn@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
Liu Shixin
|
62047e0f3e |
mm/kmemleak: fix print format of pointer in pr_debug()
With 0x%p, the pointer will be hashed and print (____ptrval____) instead. And with 0x%pa, the pointer can be successfully printed but with duplicate prefixes, which looks like: kmemleak: kmemleak_free(0x(____ptrval____)) kmemleak: kmemleak_free_percpu(0x(____ptrval____)) kmemleak: kmemleak_free_part_phys(0x0x0000000a1af86000) Use 0x%px instead of 0x%p or 0x%pa to print the pointer. Then the print will be like: kmemleak: kmemleak_free(0xffff9111c145b020) kmemleak: kmemleak_free_percpu(0x00000000000333b0) kmemleak: kmemleak_free_part_phys(0x0000000a1af80000) Link: https://lkml.kernel.org/r/20231018102952.3339837-4-liushixin2@huawei.com Signed-off-by: Liu Shixin <liushixin2@huawei.com> Acked-by: Catalin Marinas <catalin.marinas@arm.com> Cc: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: Patrick Wang <patrick.wang.shcn@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
Liu Shixin
|
6d4e2cda62 |
bootmem: use kmemleak_free_part_phys in put_page_bootmem
Patch series "Some bugfix about kmemleak", v3.
Some bugfixes for kmemleak and the printed info from debug mode.
This patch (of 7):
Since kmemleak_alloc_phys() rather than kmemleak_alloc() was called from
memblock_alloc_range_nid(), kmemleak_free_part_phys() should be used to
delete kmemleak object in put_page_bootmem(). In debug mode, there are
following warning:
kmemleak: Partially freeing unknown object at 0xffff97345aff7000 (size 4096)
Link: https://lkml.kernel.org/r/20231018102952.3339837-1-liushixin2@huawei.com
Link: https://lkml.kernel.org/r/20231018102952.3339837-2-liushixin2@huawei.com
Fixes:
|
||
Kefeng Wang
|
8f0f4788b1 |
mm: remove page_cpupid_xchg_last()
Since all calls use folio_xchg_last_cpupid(), remove page_cpupid_xchg_last(). Link: https://lkml.kernel.org/r/20231018140806.2783514-20-wangkefeng.wang@huawei.com Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: David Hildenbrand <david@redhat.com> Cc: Huang Ying <ying.huang@intel.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Juri Lelli <juri.lelli@redhat.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Vincent Guittot <vincent.guittot@linaro.org> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
Kefeng Wang
|
c2c3b51480 |
mm: use folio_xchg_last_cpupid() in wp_page_reuse()
Convert to use folio_xchg_last_cpupid() in wp_page_reuse(), and remove page variable. Since now only normal and PMD-mapped page is handled by numa balancing, it's enough to only update the entire folio's last cpupid. Link: https://lkml.kernel.org/r/20231018140806.2783514-19-wangkefeng.wang@huawei.com Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: David Hildenbrand <david@redhat.com> Cc: Huang Ying <ying.huang@intel.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Juri Lelli <juri.lelli@redhat.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Vincent Guittot <vincent.guittot@linaro.org> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
Kefeng Wang
|
a86bc96b77 |
mm: convert wp_page_reuse() and finish_mkwrite_fault() to take a folio
Saves one compound_head() call, also in preparation for page_cpupid_xchg_last() conversion. Link: https://lkml.kernel.org/r/20231018140806.2783514-18-wangkefeng.wang@huawei.com Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: David Hildenbrand <david@redhat.com> Cc: Huang Ying <ying.huang@intel.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Juri Lelli <juri.lelli@redhat.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Vincent Guittot <vincent.guittot@linaro.org> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
Kefeng Wang
|
c08b7e3830 |
mm: make finish_mkwrite_fault() static
Make finish_mkwrite_fault static since it is not used outside of memory.c. Link: https://lkml.kernel.org/r/20231018140806.2783514-17-wangkefeng.wang@huawei.com Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: David Hildenbrand <david@redhat.com> Cc: Huang Ying <ying.huang@intel.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Juri Lelli <juri.lelli@redhat.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Vincent Guittot <vincent.guittot@linaro.org> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
Kefeng Wang
|
c825301134 |
mm: huge_memory: use folio_xchg_last_cpupid() in __split_huge_page_tail()
Convert to use folio_xchg_last_cpupid() in __split_huge_page_tail(). Link: https://lkml.kernel.org/r/20231018140806.2783514-16-wangkefeng.wang@huawei.com Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: David Hildenbrand <david@redhat.com> Cc: Huang Ying <ying.huang@intel.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Juri Lelli <juri.lelli@redhat.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Vincent Guittot <vincent.guittot@linaro.org> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
Kefeng Wang
|
4e694fe4d2 |
mm: migrate: use folio_xchg_last_cpupid() in folio_migrate_flags()
Convert to use folio_xchg_last_cpupid() in folio_migrate_flags(), also directly use folio_nid() instead of page_to_nid(&folio->page). Link: https://lkml.kernel.org/r/20231018140806.2783514-15-wangkefeng.wang@huawei.com Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: David Hildenbrand <david@redhat.com> Cc: Huang Ying <ying.huang@intel.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Juri Lelli <juri.lelli@redhat.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Vincent Guittot <vincent.guittot@linaro.org> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
Kefeng Wang
|
d986ba2b19 |
mm: huge_memory: use a folio in change_huge_pmd()
Use a folio in change_huge_pmd(), which helps to remove last xchg_page_access_time() caller. Link: https://lkml.kernel.org/r/20231018140806.2783514-11-wangkefeng.wang@huawei.com Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: David Hildenbrand <david@redhat.com> Cc: Huang Ying <ying.huang@intel.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Juri Lelli <juri.lelli@redhat.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Vincent Guittot <vincent.guittot@linaro.org> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
Kefeng Wang
|
ec1778807a |
mm: mprotect: use a folio in change_pte_range()
Use a folio in change_pte_range() to save three compound_head() calls. Since now only normal and PMD-mapped page is handled by numa balancing, it is enough to only update the entire folio's access time. Link: https://lkml.kernel.org/r/20231018140806.2783514-10-wangkefeng.wang@huawei.com Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: David Hildenbrand <david@redhat.com> Cc: Huang Ying <ying.huang@intel.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Juri Lelli <juri.lelli@redhat.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Vincent Guittot <vincent.guittot@linaro.org> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
Kefeng Wang
|
19c1ac02ce |
mm: huge_memory: use folio_last_cpupid() in __split_huge_page_tail()
Convert to use folio_last_cpupid() in __split_huge_page_tail(). Link: https://lkml.kernel.org/r/20231018140806.2783514-6-wangkefeng.wang@huawei.com Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: David Hildenbrand <david@redhat.com> Cc: Huang Ying <ying.huang@intel.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Juri Lelli <juri.lelli@redhat.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Vincent Guittot <vincent.guittot@linaro.org> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
Kefeng Wang
|
c4a8d2faab |
mm: huge_memory: use folio_last_cpupid() in do_huge_pmd_numa_page()
Convert to use folio_last_cpupid() in do_huge_pmd_numa_page(). Link: https://lkml.kernel.org/r/20231018140806.2783514-5-wangkefeng.wang@huawei.com Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: David Hildenbrand <david@redhat.com> Cc: Huang Ying <ying.huang@intel.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Juri Lelli <juri.lelli@redhat.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Vincent Guittot <vincent.guittot@linaro.org> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
Kefeng Wang
|
67b33e3ff5 |
mm: memory: use folio_last_cpupid() in do_numa_page()
Convert to use folio_last_cpupid() in do_numa_page(). Link: https://lkml.kernel.org/r/20231018140806.2783514-4-wangkefeng.wang@huawei.com Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: David Hildenbrand <david@redhat.com> Cc: Huang Ying <ying.huang@intel.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Juri Lelli <juri.lelli@redhat.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Vincent Guittot <vincent.guittot@linaro.org> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
Kairui Song
|
e5b306a082 |
mm/swap: avoid a xa load for swapout path
A variable is never used for swapout path (shadowp is NULL) and compiler
is unable to optimize out the unneeded load since it's a function call.
The was introduced by
|
||
Roman Gushchin
|
e56808fef8 |
mm: kmem: reimplement get_obj_cgroup_from_current()
Reimplement get_obj_cgroup_from_current() using current_obj_cgroup(). get_obj_cgroup_from_current() and current_obj_cgroup() share 80% of the code, so the new implementation is almost trivial. get_obj_cgroup_from_current() is a convenient function used by the bpf subsystem, so there is no reason to get rid of it completely. Link: https://lkml.kernel.org/r/20231019225346.1822282-7-roman.gushchin@linux.dev Signed-off-by: Roman Gushchin (Cruise) <roman.gushchin@linux.dev> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: Shakeel Butt <shakeelb@google.com> Cc: David Rientjes <rientjes@google.com> Cc: Dennis Zhou <dennis@kernel.org> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: Naresh Kamboju <naresh.kamboju@linaro.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
Roman Gushchin
|
c63b835d0e |
percpu: scoped objcg protection
Similar to slab and kmem, switch to a scope-based protection of the objcg pointer to avoid. Link: https://lkml.kernel.org/r/20231019225346.1822282-6-roman.gushchin@linux.dev Signed-off-by: Roman Gushchin (Cruise) <roman.gushchin@linux.dev> Tested-by: Naresh Kamboju <naresh.kamboju@linaro.org> Acked-by: Shakeel Butt <shakeelb@google.com> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Cc: David Rientjes <rientjes@google.com> Cc: Dennis Zhou <dennis@kernel.org> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Muchun Song <muchun.song@linux.dev> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
Roman Gushchin
|
e86828e544 |
mm: kmem: scoped objcg protection
Switch to a scope-based protection of the objcg pointer on slab/kmem allocation paths. Instead of using the get_() semantics in the pre-allocation hook and put the reference afterwards, let's rely on the fact that objcg is pinned by the scope. It's possible because: 1) if the objcg is received from the current task struct, the task is keeping a reference to the objcg. 2) if the objcg is received from an active memcg (remote charging), the memcg is pinned by the scope and has a reference to the corresponding objcg. Link: https://lkml.kernel.org/r/20231019225346.1822282-5-roman.gushchin@linux.dev Signed-off-by: Roman Gushchin (Cruise) <roman.gushchin@linux.dev> Tested-by: Naresh Kamboju <naresh.kamboju@linaro.org> Acked-by: Shakeel Butt <shakeelb@google.com> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Cc: David Rientjes <rientjes@google.com> Cc: Dennis Zhou <dennis@kernel.org> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Muchun Song <muchun.song@linux.dev> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
Roman Gushchin
|
675d6c9b59 |
mm: kmem: make memcg keep a reference to the original objcg
Keep a reference to the original objcg object for the entire life of a memcg structure. This allows to simplify the synchronization on the kernel memory allocation paths: pinning a (live) memcg will also pin the corresponding objcg. The memory overhead of this change is minimal because object cgroups usually outlive their corresponding memory cgroups even without this change, so it's only an additional pointer per memcg. Link: https://lkml.kernel.org/r/20231019225346.1822282-4-roman.gushchin@linux.dev Signed-off-by: Roman Gushchin (Cruise) <roman.gushchin@linux.dev> Tested-by: Naresh Kamboju <naresh.kamboju@linaro.org> Acked-by: Shakeel Butt <shakeelb@google.com> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Cc: David Rientjes <rientjes@google.com> Cc: Dennis Zhou <dennis@kernel.org> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Muchun Song <muchun.song@linux.dev> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
Roman Gushchin
|
1aacbd3543 |
mm: kmem: add direct objcg pointer to task_struct
To charge a freshly allocated kernel object to a memory cgroup, the kernel needs to obtain an objcg pointer. Currently it does it indirectly by obtaining the memcg pointer first and then calling to __get_obj_cgroup_from_memcg(). Usually tasks spend their entire life belonging to the same object cgroup. So it makes sense to save the objcg pointer on task_struct directly, so it can be obtained faster. It requires some work on fork, exit and cgroup migrate paths, but these paths are way colder. To avoid any costly synchronization the following rules are applied: 1) A task sets it's objcg pointer itself. 2) If a task is being migrated to another cgroup, the least significant bit of the objcg pointer is set atomically. 3) On the allocation path the objcg pointer is obtained locklessly using the READ_ONCE() macro and the least significant bit is checked. If it's set, the following procedure is used to update it locklessly: - task->objcg is zeroed using cmpxcg - new objcg pointer is obtained - task->objcg is updated using try_cmpxchg - operation is repeated if try_cmpxcg fails It guarantees that no updates will be lost if task migration is racing against objcg pointer update. It also allows to keep both read and write paths fully lockless. Because the task is keeping a reference to the objcg, it can't go away while the task is alive. This commit doesn't change the way the remote memcg charging works. Link: https://lkml.kernel.org/r/20231019225346.1822282-3-roman.gushchin@linux.dev Signed-off-by: Roman Gushchin (Cruise) <roman.gushchin@linux.dev> Tested-by: Naresh Kamboju <naresh.kamboju@linaro.org> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Shakeel Butt <shakeelb@google.com> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Cc: David Rientjes <rientjes@google.com> Cc: Dennis Zhou <dennis@kernel.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Muchun Song <muchun.song@linux.dev> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
Roman Gushchin
|
7d0715d0d6 |
mm: kmem: optimize get_obj_cgroup_from_current()
Patch series "mm: improve performance of accounted kernel memory allocations", v5. This patchset improves the performance of accounted kernel memory allocations by ~30% as measured by a micro-benchmark [1]. The benchmark is very straightforward: 1M of 64 bytes-large kmalloc() allocations. Below are results with the disabled kernel memory accounting, the original state and with this patchset applied. | | Kmem disabled | Original | Patched | Delta | |-------------+---------------+----------+---------+--------| | User cgroup | 29764 | 84548 | 59078 | -30.0% | | Root cgroup | 29742 | 48342 | 31501 | -34.8% | As we can see, the patchset removes the majority of the overhead when there is no actual accounting (a task belongs to the root memory cgroup) and almost halves the accounting overhead otherwise. The main idea is to get rid of unnecessary memcg to objcg conversions and switch to a scope-based protection of objcgs, which eliminates extra operations with objcg reference counters under a rcu read lock. More details are provided in individual commit descriptions. This patch (of 5): Manually inline memcg_kmem_bypass() and active_memcg() to speed up get_obj_cgroup_from_current() by avoiding duplicate in_task() checks and active_memcg() readings. Also add a likely() macro to __get_obj_cgroup_from_memcg(): obj_cgroup_tryget() should succeed at almost all times except a very unlikely race with the memcg deletion path. Link: https://lkml.kernel.org/r/20231019225346.1822282-1-roman.gushchin@linux.dev Link: https://lkml.kernel.org/r/20231019225346.1822282-2-roman.gushchin@linux.dev Signed-off-by: Roman Gushchin (Cruise) <roman.gushchin@linux.dev> Tested-by: Naresh Kamboju <naresh.kamboju@linaro.org> Acked-by: Shakeel Butt <shakeelb@google.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Cc: David Rientjes <rientjes@google.com> Cc: Dennis Zhou <dennis@kernel.org> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Muchun Song <muchun.song@linux.dev> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
Huang Ying
|
6ccdcb6d3a |
mm, pcp: reduce detecting time of consecutive high order page freeing
In current PCP auto-tuning design, if the number of pages allocated is much more than that of pages freed on a CPU, the PCP high may become the maximal value even if the allocating/freeing depth is small, for example, in the sender of network workloads. If a CPU was used as sender originally, then it is used as receiver after context switching, we need to fill the whole PCP with maximal high before triggering PCP draining for consecutive high order freeing. This will hurt the performance of some network workloads. To solve the issue, in this patch, we will track the consecutive page freeing with a counter in stead of relying on PCP draining. So, we can detect consecutive page freeing much earlier. On a 2-socket Intel server with 128 logical CPU, we tested SCTP_STREAM_MANY test case of netperf test suite with 64-pair processes. With the patch, the network bandwidth improves 5.0%. This restores the performance drop caused by PCP auto-tuning. Link: https://lkml.kernel.org/r/20231016053002.756205-10-ying.huang@intel.com Signed-off-by: "Huang, Ying" <ying.huang@intel.com> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: David Hildenbrand <david@redhat.com> Cc: Johannes Weiner <jweiner@redhat.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Pavel Tatashin <pasha.tatashin@soleen.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Christoph Lameter <cl@linux.com> Cc: Arjan van de Ven <arjan@linux.intel.com> Cc: Sudeep Holla <sudeep.holla@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
Huang Ying
|
57c0419c5f |
mm, pcp: decrease PCP high if free pages < high watermark
One target of PCP is to minimize pages in PCP if the system free pages is too few. To reach that target, when page reclaiming is active for the zone (ZONE_RECLAIM_ACTIVE), we will stop increasing PCP high in allocating path, decrease PCP high and free some pages in freeing path. But this may be too late because the background page reclaiming may introduce latency for some workloads. So, in this patch, during page allocation we will detect whether the number of free pages of the zone is below high watermark. If so, we will stop increasing PCP high in allocating path, decrease PCP high and free some pages in freeing path. With this, we can reduce the possibility of the premature background page reclaiming caused by too large PCP. The high watermark checking is done in allocating path to reduce the overhead in hotter freeing path. Link: https://lkml.kernel.org/r/20231016053002.756205-9-ying.huang@intel.com Signed-off-by: "Huang, Ying" <ying.huang@intel.com> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: David Hildenbrand <david@redhat.com> Cc: Johannes Weiner <jweiner@redhat.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Pavel Tatashin <pasha.tatashin@soleen.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Christoph Lameter <cl@linux.com> Cc: Arjan van de Ven <arjan@linux.intel.com> Cc: Sudeep Holla <sudeep.holla@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
Huang Ying
|
51a755c56d |
mm: tune PCP high automatically
The target to tune PCP high automatically is as follows, - Minimize allocation/freeing from/to shared zone - Minimize idle pages in PCP - Minimize pages in PCP if the system free pages is too few To reach these target, a tuning algorithm as follows is designed, - When we refill PCP via allocating from the zone, increase PCP high. Because if we had larger PCP, we could avoid to allocate from the zone. - In periodic vmstat updating kworker (via refresh_cpu_vm_stats()), decrease PCP high to try to free possible idle PCP pages. - When page reclaiming is active for the zone, stop increasing PCP high in allocating path, decrease PCP high and free some pages in freeing path. So, the PCP high can be tuned to the page allocating/freeing depth of workloads eventually. One issue of the algorithm is that if the number of pages allocated is much more than that of pages freed on a CPU, the PCP high may become the maximal value even if the allocating/freeing depth is small. But this isn't a severe issue, because there are no idle pages in this case. One alternative choice is to increase PCP high when we drain PCP via trying to free pages to the zone, but don't increase PCP high during PCP refilling. This can avoid the issue above. But if the number of pages allocated is much less than that of pages freed on a CPU, there will be many idle pages in PCP and it is hard to free these idle pages. 1/8 (>> 3) of PCP high will be decreased periodically. The value 1/8 is kind of arbitrary. Just to make sure that the idle PCP pages will be freed eventually. On a 2-socket Intel server with 224 logical CPU, we run 8 kbuild instances in parallel (each with `make -j 28`) in 8 cgroup. This simulates the kbuild server that is used by 0-Day kbuild service. With the patch, the build time decreases 3.5%. The cycles% of the spinlock contention (mostly for zone lock) decreases from 11.0% to 0.5%. The number of PCP draining for high order pages freeing (free_high) decreases 65.6%. The number of pages allocated from zone (instead of from PCP) decreases 83.9%. Link: https://lkml.kernel.org/r/20231016053002.756205-8-ying.huang@intel.com Signed-off-by: "Huang, Ying" <ying.huang@intel.com> Suggested-by: Mel Gorman <mgorman@techsingularity.net> Suggested-by: Michal Hocko <mhocko@suse.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: David Hildenbrand <david@redhat.com> Cc: Johannes Weiner <jweiner@redhat.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Pavel Tatashin <pasha.tatashin@soleen.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Christoph Lameter <cl@linux.com> Cc: Arjan van de Ven <arjan@linux.intel.com> Cc: Sudeep Holla <sudeep.holla@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
Huang Ying
|
90b41691b9 |
mm: add framework for PCP high auto-tuning
The page allocation performance requirements of different workloads are usually different. So, we need to tune PCP (per-CPU pageset) high to optimize the workload page allocation performance. Now, we have a system wide sysctl knob (percpu_pagelist_high_fraction) to tune PCP high by hand. But, it's hard to find out the best value by hand. And one global configuration may not work best for the different workloads that run on the same system. One solution to these issues is to tune PCP high of each CPU automatically. This patch adds the framework for PCP high auto-tuning. With it, pcp->high of each CPU will be changed automatically by tuning algorithm at runtime. The minimal high (pcp->high_min) is the original PCP high value calculated based on the low watermark pages. While the maximal high (pcp->high_max) is the PCP high value when percpu_pagelist_high_fraction sysctl knob is set to MIN_PERCPU_PAGELIST_HIGH_FRACTION. That is, the maximal pcp->high that can be set via sysctl knob by hand. It's possible that PCP high auto-tuning doesn't work well for some workloads. So, when PCP high is tuned by hand via the sysctl knob, the auto-tuning will be disabled. The PCP high set by hand will be used instead. This patch only adds the framework, so pcp->high will be set to pcp->high_min (original default) always. We will add actual auto-tuning algorithm in the following patches in the series. Link: https://lkml.kernel.org/r/20231016053002.756205-7-ying.huang@intel.com Signed-off-by: "Huang, Ying" <ying.huang@intel.com> Acked-by: Mel Gorman <mgorman@techsingularity.net> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: David Hildenbrand <david@redhat.com> Cc: Johannes Weiner <jweiner@redhat.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Pavel Tatashin <pasha.tatashin@soleen.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Christoph Lameter <cl@linux.com> Cc: Arjan van de Ven <arjan@linux.intel.com> Cc: Sudeep Holla <sudeep.holla@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
Huang Ying
|
c0a242394c |
mm, page_alloc: scale the number of pages that are batch allocated
When a task is allocating a large number of order-0 pages, it may acquire the zone->lock multiple times allocating pages in batches. This may unnecessarily contend on the zone lock when allocating very large number of pages. This patch adapts the size of the batch based on the recent pattern to scale the batch size for subsequent allocations. On a 2-socket Intel server with 224 logical CPU, we run 8 kbuild instances in parallel (each with `make -j 28`) in 8 cgroup. This simulates the kbuild server that is used by 0-Day kbuild service. With the patch, the cycles% of the spinlock contention (mostly for zone lock) decreases from 12.6% to 11.0% (with PCP size == 367). Link: https://lkml.kernel.org/r/20231016053002.756205-6-ying.huang@intel.com Signed-off-by: "Huang, Ying" <ying.huang@intel.com> Suggested-by: Mel Gorman <mgorman@techsingularity.net> Acked-by: Mel Gorman <mgorman@techsingularity.net> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: David Hildenbrand <david@redhat.com> Cc: Johannes Weiner <jweiner@redhat.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Pavel Tatashin <pasha.tatashin@soleen.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Christoph Lameter <cl@linux.com> Cc: Arjan van de Ven <arjan@linux.intel.com> Cc: Sudeep Holla <sudeep.holla@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
Huang Ying
|
52166607ec |
mm: restrict the pcp batch scale factor to avoid too long latency
In page allocator, PCP (Per-CPU Pageset) is refilled and drained in
batches to increase page allocation throughput, reduce page
allocation/freeing latency per page, and reduce zone lock contention. But
too large batch size will cause too long maximal allocation/freeing
latency, which may punish arbitrary users. So the default batch size is
chosen carefully (in zone_batchsize(), the value is 63 for zone > 1GB) to
avoid that.
In commit
|
||
Huang Ying
|
362d37a106 |
mm, pcp: reduce lock contention for draining high-order pages
In commit
|
||
Huang Ying
|
ca71fe1ad9 |
mm, pcp: avoid to drain PCP when process exit
Patch series "mm: PCP high auto-tuning", v3.
The page allocation performance requirements of different workloads are
often different. So, we need to tune the PCP (Per-CPU Pageset) high on
each CPU automatically to optimize the page allocation performance.
The list of patches in series is as follows,
[1/9] mm, pcp: avoid to drain PCP when process exit
[2/9] cacheinfo: calculate per-CPU data cache size
[3/9] mm, pcp: reduce lock contention for draining high-order pages
[4/9] mm: restrict the pcp batch scale factor to avoid too long latency
[5/9] mm, page_alloc: scale the number of pages that are batch allocated
[6/9] mm: add framework for PCP high auto-tuning
[7/9] mm: tune PCP high automatically
[8/9] mm, pcp: decrease PCP high if free pages < high watermark
[9/9] mm, pcp: reduce detecting time of consecutive high order page freeing
Patch [1/9], [2/9], [3/9] optimize the PCP draining for consecutive
high-order pages freeing.
Patch [4/9], [5/9] optimize batch freeing and allocating.
Patch [6/9], [7/9], [8/9] implement and optimize a PCP high
auto-tuning method.
Patch [9/9] optimize the PCP draining for consecutive high order page
freeing based on PCP high auto-tuning.
The test results for patches with performance impact are as follows,
kbuild
======
On a 2-socket Intel server with 224 logical CPU, we run 8 kbuild instances
in parallel (each with `make -j 28`) in 8 cgroup. This simulates the
kbuild server that is used by 0-Day kbuild service.
build time lock contend% free_high alloc_zone
---------- ---------- --------- ----------
base 100.0 14.0 100.0 100.0
patch1 99.5 12.8 19.5 95.6
patch3 99.4 12.6 7.1 95.6
patch5 98.6 11.0 8.1 97.1
patch7 95.1 0.5 2.8 15.6
patch9 95.0 1.0 8.8 20.0
The PCP draining optimization (patch [1/9], [3/9]) and PCP batch
allocation optimization (patch [5/9]) reduces zone lock contention a
little. The PCP high auto-tuning (patch [7/9], [9/9]) reduces build time
visibly. Where the tuning target: the number of pages allocated from zone
reduces greatly. So, the zone contention cycles% reduces greatly.
With PCP tuning patches (patch [7/9], [9/9]), the average used memory
during test increases up to 18.4% because more pages are cached in PCP.
But at the end of the test, the number of the used memory decreases to the
same level as that of the base patch. That is, the pages cached in PCP
will be released to zone after not being used actively.
netperf SCTP_STREAM_MANY
========================
On a 2-socket Intel server with 128 logical CPU, we tested
SCTP_STREAM_MANY test case of netperf test suite with 64-pair processes.
score lock contend% free_high alloc_zone cache miss rate%
----- ---------- --------- ---------- ----------------
base 100.0 2.1 100.0 100.0 1.3
patch1 99.4 2.1 99.4 99.4 1.3
patch3 106.4 1.3 13.3 106.3 1.3
patch5 106.0 1.2 13.2 105.9 1.3
patch7 103.4 1.9 6.7 90.3 7.6
patch9 108.6 1.3 13.7 108.6 1.3
The PCP draining optimization (patch [1/9]+[3/9]) improves performance.
The PCP high auto-tuning (patch [7/9]) reduces performance a little
because PCP draining cannot be triggered in time sometimes. So, the cache
miss rate% increases. The further PCP draining optimization (patch [9/9])
based on PCP tuning restore the performance.
lmbench3 UNIX (AF_UNIX)
=======================
On a 2-socket Intel server with 128 logical CPU, we tested UNIX
(AF_UNIX socket) test case of lmbench3 test suite with 16-pair
processes.
score lock contend% free_high alloc_zone cache miss rate%
----- ---------- --------- ---------- ----------------
base 100.0 51.4 100.0 100.0 0.2
patch1 116.8 46.1 69.5 104.3 0.2
patch3 199.1 21.3 7.0 104.9 0.2
patch5 200.0 20.8 7.1 106.9 0.3
patch7 191.6 19.9 6.8 103.8 2.8
patch9 193.4 21.7 7.0 104.7 2.1
The PCP draining optimization (patch [1/9], [3/9]) improves performance
much. The PCP tuning (patch [7/9]) reduces performance a little because
PCP draining cannot be triggered in time sometimes. The further PCP
draining optimization (patch [9/9]) based on PCP tuning restores the
performance partly.
The patchset adds several fields in struct per_cpu_pages. The struct
layout before/after the patchset is as follows,
base
====
struct per_cpu_pages {
spinlock_t lock; /* 0 4 */
int count; /* 4 4 */
int high; /* 8 4 */
int batch; /* 12 4 */
short int free_factor; /* 16 2 */
short int expire; /* 18 2 */
/* XXX 4 bytes hole, try to pack */
struct list_head lists[13]; /* 24 208 */
/* size: 256, cachelines: 4, members: 7 */
/* sum members: 228, holes: 1, sum holes: 4 */
/* padding: 24 */
} __attribute__((__aligned__(64)));
patched
=======
struct per_cpu_pages {
spinlock_t lock; /* 0 4 */
int count; /* 4 4 */
int high; /* 8 4 */
int high_min; /* 12 4 */
int high_max; /* 16 4 */
int batch; /* 20 4 */
u8 flags; /* 24 1 */
u8 alloc_factor; /* 25 1 */
u8 expire; /* 26 1 */
/* XXX 1 byte hole, try to pack */
short int free_count; /* 28 2 */
/* XXX 2 bytes hole, try to pack */
struct list_head lists[13]; /* 32 208 */
/* size: 256, cachelines: 4, members: 11 */
/* sum members: 237, holes: 2, sum holes: 3 */
/* padding: 16 */
} __attribute__((__aligned__(64)));
The size of the struct doesn't changed with the patchset.
This patch (of 9):
In commit
|
||
Kairui Song
|
1f4f7f0f88 |
mm/oom_killer: simplify OOM killer info dump helper
There is only one caller wants to dump the kill victim info, so just let it call the standalone helper, no need to make the generic info dump helper take an extra argument for that. Result of bloat-o-meter: ./scripts/bloat-o-meter ./mm/oom_kill.old.o ./mm/oom_kill.o add/remove: 0/0 grow/shrink: 1/2 up/down: 131/-142 (-11) Function old new delta oom_kill_process 412 543 +131 out_of_memory 1422 1418 -4 dump_header 562 424 -138 Total: Before=21514, After=21503, chg -0.05% Link: https://lkml.kernel.org/r/20231016113103.86477-1-ryncsn@gmail.com Signed-off-by: Kairui Song <kasong@tencent.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Christian Brauner <brauner@kernel.org> Cc: Suren Baghdasaryan <surenb@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
Pedro Falcato
|
09aec5f9b2 |
mm: kmsan: panic on failure to allocate early boot metadata
Given large enough allocations and a machine with low enough memory (i.e a default QEMU VM), it's entirely possible that kmsan_init_alloc_meta_for_range's shadow+origin allocation fails. Instead of eating a NULL deref kernel oops, check explicitly for memblock_alloc() failure and panic with a nice error message. Alexander Potapenko said: For posterity, it is generally quite important for the allocated shadow and origin to be contiguous, otherwise an unaligned memory write may result in memory corruption (the corresponding unaligned shadow write will be assuming that shadow pages are adjacent). So instead of panicking we could have split the range into smaller ones until the allocation succeeds, but that would've led to hard-to-debug problems in the future. Link: https://lkml.kernel.org/r/20231016153446.132763-1-pedro.falcato@gmail.com Signed-off-by: Pedro Falcato <pedro.falcato@gmail.com> Reviewed-by: Alexander Potapenko <glider@google.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Marco Elver <elver@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
Matthew Wilcox (Oracle)
|
4093602d6b |
nilfs2: convert nilfs_copy_page() to nilfs_copy_folio()
Both callers already have a folio, so pass it in and use it directly. Removes a lot of hidden calls to compound_head(). Link: https://lkml.kernel.org/r/20231016201114.1928083-13-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Acked-by: Ryusuke Konishi <konishi.ryusuke@gmail.com> Cc: Andreas Gruenbacher <agruenba@redhat.com> Cc: Pankaj Raghav <p.raghav@samsung.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
Usama Arif
|
c5ad3233ea |
hugetlb_vmemmap: use folio argument for hugetlb_vmemmap_* functions
Most function calls in hugetlb.c are made with folio arguments. This brings hugetlb_vmemmap calls inline with them by using folio instead of head struct page. Head struct page is still needed within these functions. The set/clear/test functions for hugepages are also changed to folio versions. Link: https://lkml.kernel.org/r/20231011144557.1720481-2-usama.arif@bytedance.com Signed-off-by: Usama Arif <usama.arif@bytedance.com> Reviewed-by: Muchun Song <songmuchun@bytedance.com> Cc: Fam Zheng <fam.zheng@bytedance.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Punit Agrawal <punit.agrawal@bytedance.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
Mike Kravetz
|
c24f188b22 |
hugetlb: batch TLB flushes when restoring vmemmap
Update the internal hugetlb restore vmemmap code path such that TLB flushing can be batched. Use the existing mechanism of passing the VMEMMAP_REMAP_NO_TLB_FLUSH flag to indicate flushing should not be performed for individual pages. The routine hugetlb_vmemmap_restore_folios is the only user of this new mechanism, and it will perform a global flush after all vmemmap is restored. Link: https://lkml.kernel.org/r/20231019023113.345257-9-mike.kravetz@oracle.com Signed-off-by: Joao Martins <joao.m.martins@oracle.com> Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com> Reviewed-by: Muchun Song <songmuchun@bytedance.com> Cc: Anshuman Khandual <anshuman.khandual@arm.com> Cc: Barry Song <21cnbao@gmail.com> Cc: David Hildenbrand <david@redhat.com> Cc: David Rientjes <rientjes@google.com> Cc: James Houghton <jthoughton@google.com> Cc: Konrad Dybcio <konradybcio@kernel.org> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Naoya Horiguchi <naoya.horiguchi@linux.dev> Cc: Oscar Salvador <osalvador@suse.de> Cc: Sergey Senozhatsky <senozhatsky@chromium.org> Cc: Usama Arif <usama.arif@bytedance.com> Cc: Xiongchun Duan <duanxiongchun@bytedance.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
Joao Martins
|
f13b83fdd9 |
hugetlb: batch TLB flushes when freeing vmemmap
Now that a list of pages is deduplicated at once, the TLB flush can be batched for all vmemmap pages that got remapped. Expand the flags field value to pass whether to skip the TLB flush on remap of the PTE. The TLB flush is global as we don't have guarantees from caller that the set of folios is contiguous, or to add complexity in composing a list of kVAs to flush. Modified by Mike Kravetz to perform TLB flush on single folio if an error is encountered. Link: https://lkml.kernel.org/r/20231019023113.345257-8-mike.kravetz@oracle.com Signed-off-by: Joao Martins <joao.m.martins@oracle.com> Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com> Reviewed-by: Muchun Song <songmuchun@bytedance.com> Cc: Anshuman Khandual <anshuman.khandual@arm.com> Cc: Barry Song <21cnbao@gmail.com> Cc: David Hildenbrand <david@redhat.com> Cc: David Rientjes <rientjes@google.com> Cc: James Houghton <jthoughton@google.com> Cc: Konrad Dybcio <konradybcio@kernel.org> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Naoya Horiguchi <naoya.horiguchi@linux.dev> Cc: Oscar Salvador <osalvador@suse.de> Cc: Sergey Senozhatsky <senozhatsky@chromium.org> Cc: Usama Arif <usama.arif@bytedance.com> Cc: Xiongchun Duan <duanxiongchun@bytedance.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
Joao Martins
|
f4b7e3efad |
hugetlb: batch PMD split for bulk vmemmap dedup
In an effort to minimize amount of TLB flushes, batch all PMD splits belonging to a range of pages in order to perform only 1 (global) TLB flush. Add a flags field to the walker and pass whether it's a bulk allocation or just a single page to decide to remap. First value (VMEMMAP_SPLIT_NO_TLB_FLUSH) designates the request to not do the TLB flush when we split the PMD. Rebased and updated by Mike Kravetz Link: https://lkml.kernel.org/r/20231019023113.345257-7-mike.kravetz@oracle.com Signed-off-by: Joao Martins <joao.m.martins@oracle.com> Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com> Reviewed-by: Muchun Song <songmuchun@bytedance.com> Cc: Anshuman Khandual <anshuman.khandual@arm.com> Cc: Barry Song <21cnbao@gmail.com> Cc: David Hildenbrand <david@redhat.com> Cc: David Rientjes <rientjes@google.com> Cc: James Houghton <jthoughton@google.com> Cc: Konrad Dybcio <konradybcio@kernel.org> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Naoya Horiguchi <naoya.horiguchi@linux.dev> Cc: Oscar Salvador <osalvador@suse.de> Cc: Sergey Senozhatsky <senozhatsky@chromium.org> Cc: Usama Arif <usama.arif@bytedance.com> Cc: Xiongchun Duan <duanxiongchun@bytedance.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
Mike Kravetz
|
91f386bf07 |
hugetlb: batch freeing of vmemmap pages
Now that batching of hugetlb vmemmap optimization processing is possible, batch the freeing of vmemmap pages. When freeing vmemmap pages for a hugetlb page, we add them to a list that is freed after the entire batch has been processed. This enhances the ability to return contiguous ranges of memory to the low level allocators. Link: https://lkml.kernel.org/r/20231019023113.345257-6-mike.kravetz@oracle.com Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com> Reviewed-by: Muchun Song <songmuchun@bytedance.com> Cc: Anshuman Khandual <anshuman.khandual@arm.com> Cc: Barry Song <21cnbao@gmail.com> Cc: David Hildenbrand <david@redhat.com> Cc: David Rientjes <rientjes@google.com> Cc: James Houghton <jthoughton@google.com> Cc: Joao Martins <joao.m.martins@oracle.com> Cc: Konrad Dybcio <konradybcio@kernel.org> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Naoya Horiguchi <naoya.horiguchi@linux.dev> Cc: Oscar Salvador <osalvador@suse.de> Cc: Sergey Senozhatsky <senozhatsky@chromium.org> Cc: Usama Arif <usama.arif@bytedance.com> Cc: Xiongchun Duan <duanxiongchun@bytedance.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
Mike Kravetz
|
cfb8c75099 |
hugetlb: perform vmemmap restoration on a list of pages
The routine update_and_free_pages_bulk already performs vmemmap restoration on the list of hugetlb pages in a separate step. In preparation for more functionality to be added in this step, create a new routine hugetlb_vmemmap_restore_folios() that will restore vmemmap for a list of folios. This new routine must provide sufficient feedback about errors and actual restoration performed so that update_and_free_pages_bulk can perform optimally. Special care must be taken when encountering an error from hugetlb_vmemmap_restore_folios. We want to continue making as much forward progress as possible. A new routine bulk_vmemmap_restore_error handles this specific situation. Link: https://lkml.kernel.org/r/20231019023113.345257-5-mike.kravetz@oracle.com Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com> Reviewed-by: Muchun Song <songmuchun@bytedance.com> Cc: Anshuman Khandual <anshuman.khandual@arm.com> Cc: Barry Song <21cnbao@gmail.com> Cc: David Hildenbrand <david@redhat.com> Cc: David Rientjes <rientjes@google.com> Cc: James Houghton <jthoughton@google.com> Cc: Joao Martins <joao.m.martins@oracle.com> Cc: Konrad Dybcio <konradybcio@kernel.org> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Naoya Horiguchi <naoya.horiguchi@linux.dev> Cc: Oscar Salvador <osalvador@suse.de> Cc: Sergey Senozhatsky <senozhatsky@chromium.org> Cc: Usama Arif <usama.arif@bytedance.com> Cc: Xiongchun Duan <duanxiongchun@bytedance.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
Mike Kravetz
|
79359d6d24 |
hugetlb: perform vmemmap optimization on a list of pages
When adding hugetlb pages to the pool, we first create a list of the allocated pages before adding to the pool. Pass this list of pages to a new routine hugetlb_vmemmap_optimize_folios() for vmemmap optimization. Due to significant differences in vmemmmap initialization for bootmem allocated hugetlb pages, a new routine prep_and_add_bootmem_folios is created. We also modify the routine vmemmap_should_optimize() to check for pages that are already optimized. There are code paths that might request vmemmap optimization twice and we want to make sure this is not attempted. Link: https://lkml.kernel.org/r/20231019023113.345257-4-mike.kravetz@oracle.com Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com> Reviewed-by: Muchun Song <songmuchun@bytedance.com> Cc: Anshuman Khandual <anshuman.khandual@arm.com> Cc: Barry Song <21cnbao@gmail.com> Cc: David Hildenbrand <david@redhat.com> Cc: David Rientjes <rientjes@google.com> Cc: James Houghton <jthoughton@google.com> Cc: Joao Martins <joao.m.martins@oracle.com> Cc: Konrad Dybcio <konradybcio@kernel.org> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Naoya Horiguchi <naoya.horiguchi@linux.dev> Cc: Oscar Salvador <osalvador@suse.de> Cc: Sergey Senozhatsky <senozhatsky@chromium.org> Cc: Usama Arif <usama.arif@bytedance.com> Cc: Xiongchun Duan <duanxiongchun@bytedance.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
Mike Kravetz
|
d67e32f267 |
hugetlb: restructure pool allocations
Allocation of a hugetlb page for the hugetlb pool is done by the routine alloc_pool_huge_page. This routine will allocate contiguous pages from a low level allocator, prep the pages for usage as a hugetlb page and then add the resulting hugetlb page to the pool. In the 'prep' stage, optional vmemmap optimization is done. For performance reasons we want to perform vmemmap optimization on multiple hugetlb pages at once. To do this, restructure the hugetlb pool allocation code such that vmemmap optimization can be isolated and later batched. The code to allocate hugetlb pages from bootmem was also modified to allow batching. No functional changes, only code restructure. Link: https://lkml.kernel.org/r/20231019023113.345257-3-mike.kravetz@oracle.com Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com> Reviewed-by: Muchun Song <songmuchun@bytedance.com> Tested-by: Sergey Senozhatsky <senozhatsky@chromium.org> Cc: Anshuman Khandual <anshuman.khandual@arm.com> Cc: Barry Song <21cnbao@gmail.com> Cc: David Hildenbrand <david@redhat.com> Cc: David Rientjes <rientjes@google.com> Cc: James Houghton <jthoughton@google.com> Cc: Joao Martins <joao.m.martins@oracle.com> Cc: Konrad Dybcio <konradybcio@kernel.org> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Naoya Horiguchi <naoya.horiguchi@linux.dev> Cc: Oscar Salvador <osalvador@suse.de> Cc: Usama Arif <usama.arif@bytedance.com> Cc: Xiongchun Duan <duanxiongchun@bytedance.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
Mike Kravetz
|
d2cf88c27f |
hugetlb: optimize update_and_free_pages_bulk to avoid lock cycles
Patch series "Batch hugetlb vmemmap modification operations", v8.
When hugetlb vmemmap optimization was introduced, the overhead of enabling
the option was measured as described in commit
|
||
Huang Ying
|
fa8c4f9a66 |
mm: fix draining remote pageset
If there is no memory allocation/freeing in the PCP (Per-CPU Pageset) of a remote zone (zone in remote NUMA node) after some time (3 seconds for now), the pages of the PCP of the remote zone will be drained to avoid memory wastage. This behavior was introduced in the commit |
||
Linus Torvalds
|
4f82870119 |
20 hotfixes. 12 are cc:stable and the remainder address post-6.5 issues
or aren't considered necessary for earlier kernel versions. -----BEGIN PGP SIGNATURE----- iHUEABYIAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCZTfz/QAKCRDdBJ7gKXxA joMyAP99hLaLYeJbjlf+4tLJzhlpbVoFra1ieun2D+ZgFE78xQD/T4T3PYrZhYqD WdrxGT9fiKOykXM5pmQRH9Zr4EvJBA0= =Obbk -----END PGP SIGNATURE----- Merge tag 'mm-hotfixes-stable-2023-10-24-09-40' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull misc fixes from Andrew Morton: "20 hotfixes. 12 are cc:stable and the remainder address post-6.5 issues or aren't considered necessary for earlier kernel versions" * tag 'mm-hotfixes-stable-2023-10-24-09-40' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: maple_tree: add GFP_KERNEL to allocations in mas_expected_entries() selftests/mm: include mman header to access MREMAP_DONTUNMAP identifier mailmap: correct email aliasing for Oleksij Rempel mailmap: map Bartosz's old address to the current one mm/damon/sysfs: check DAMOS regions update progress from before_terminate() MAINTAINERS: Ondrej has moved kasan: disable kasan_non_canonical_hook() for HW tags kasan: print the original fault addr when access invalid shadow hugetlbfs: close race between MADV_DONTNEED and page fault hugetlbfs: extend hugetlb_vma_lock to private VMAs hugetlbfs: clear resv_map pointer if mmap fails mm: zswap: fix pool refcount bug around shrink_worker() mm/migrate: fix do_pages_move for compat pointers riscv: fix set_huge_pte_at() for NAPOT mappings when a swap entry is set riscv: handle VM_FAULT_[HWPOISON|HWPOISON_LARGE] faults instead of panicking mmap: fix error paths with dup_anon_vma() mmap: fix vma_iterator in error path of vma_merge() mm: fix vm_brk_flags() to not bail out while holding lock mm/mempolicy: fix set_mempolicy_home_node() previous VMA pointer mm/page_alloc: correct start page when guard page debug is enabled |
||
Ingo Molnar
|
4e5b65a22b |
Linux 6.6-rc7
-----BEGIN PGP SIGNATURE----- iQFSBAABCAA8FiEEq68RxlopcLEwq+PEeb4+QwBBGIYFAmU1ngkeHHRvcnZhbGRz QGxpbnV4LWZvdW5kYXRpb24ub3JnAAoJEHm+PkMAQRiGrsIH/0k/+gdBBYFFdEym foRhKir9WV3ZX4oIozJjA1f7T+qVYclKs6kaYm3gNepRBb6AoG8pdgv4MMAqhYsf QMe2XHi0MrO/qKBgfNfivxEa9jq+0QK5uvTbqCRqCAB8LfwVyDqapCmg3EuiZcPW UbMITmnwLIfXgPxvp9rabmCsTqO6FLbf0GDOVIkNSAIDBXMpcO1iffjrWUbhRa7n oIoiJmWJLcXLxPWDsRKbpJwzw2cIG08YhfQYAiQnC3YaeRm1FKLDIICRBsmfYzja rWv9r4dn4TDfV4/AnjggQnsZvz2yPCxNaFSQIT88nIeiLvyuUTJ9j8aidsSfMZQf xZAbzbA= =NoQv -----END PGP SIGNATURE----- Merge tag 'v6.6-rc7' into sched/core, to pick up fixes Pick up recent sched/urgent fixes merged upstream. Signed-off-by: Ingo Molnar <mingo@kernel.org> |
||
Hou Tao
|
b460bc8302 |
mm/percpu.c: introduce pcpu_alloc_size()
Introduce pcpu_alloc_size() to get the size of the dynamic per-cpu area. It will be used by bpf memory allocator in the following patches. BPF memory allocator maintains per-cpu area caches for multiple area sizes and its free API only has the to-be-freed per-cpu pointer, so it needs the size of dynamic per-cpu area to select the corresponding cache when bpf program frees the dynamic per-cpu pointer. Acked-by: Dennis Zhou <dennis@kernel.org> Signed-off-by: Hou Tao <houtao1@huawei.com> Link: https://lore.kernel.org/r/20231020133202.4043247-3-houtao@huaweicloud.com Signed-off-by: Alexei Starovoitov <ast@kernel.org> |
||
Hou Tao
|
394e6869f0 |
mm/percpu.c: don't acquire pcpu_lock for pcpu_chunk_addr_search()
There is no need to acquire pcpu_lock for pcpu_chunk_addr_search(): 1) both pcpu_first_chunk & pcpu_reserved_chunk must have been initialized before the invocation of free_percpu(). 2) The dynamically-created chunk must be valid before the per-cpu pointers allocated from it are freed. So acquire pcpu_lock() after the invocation of pcpu_chunk_addr_search(). Acked-by: Dennis Zhou <dennis@kernel.org> Signed-off-by: Hou Tao <houtao1@huawei.com> Link: https://lore.kernel.org/r/20231020133202.4043247-2-houtao@huaweicloud.com Signed-off-by: Alexei Starovoitov <ast@kernel.org> |
||
Jakub Kicinski
|
041c3466f3 |
Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Cross-merge networking fixes after downstream PR. net/mac80211/key.c |
||
Reuben Hawkins
|
7116c0af4b
|
vfs: fix readahead(2) on block devices
Readahead was factored to call generic_fadvise. That refactor added an
S_ISREG restriction which broke readahead on block devices.
In addition to S_ISREG, this change checks S_ISBLK to fix block device
readahead. There is no change in behavior with any file type besides block
devices in this change.
Fixes:
|
||
Alexey Dobriyan
|
68279f9c9f |
treewide: mark stuff as __ro_after_init
__read_mostly predates __ro_after_init. Many variables which are marked __read_mostly should have been __ro_after_init from day 1. Also, mark some stuff as "const" and "__init" while I'm at it. [akpm@linux-foundation.org: revert sysctl_nr_open_min, sysctl_nr_open_max changes due to arm warning] [akpm@linux-foundation.org: coding-style cleanups] Link: https://lkml.kernel.org/r/4f6bb9c0-abba-4ee4-a7aa-89265e886817@p183 Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
Lorenzo Stoakes
|
158978945f |
mm: perform the mapping_map_writable() check after call_mmap()
In order for a F_SEAL_WRITE sealed memfd mapping to have an opportunity to clear VM_MAYWRITE, we must be able to invoke the appropriate vm_ops->mmap() handler to do so. We would otherwise fail the mapping_map_writable() check before we had the opportunity to avoid it. This patch moves this check after the call_mmap() invocation. Only memfd actively denies write access causing a potential failure here (in memfd_add_seals()), so there should be no impact on non-memfd cases. This patch makes the userland-visible change that MAP_SHARED, PROT_READ mappings of an F_SEAL_WRITE sealed memfd mapping will now succeed. There is a delicate situation with cleanup paths assuming that a writable mapping must have occurred in circumstances where it may now not have. In order to ensure we do not accidentally mark a writable file unwritable by mistake, we explicitly track whether we have a writable mapping and unmap only if we do. [lstoakes@gmail.com: do not set writable_file_mapping in inappropriate case] Link: https://lkml.kernel.org/r/c9eb4cc6-7db4-4c2b-838d-43a0b319a4f0@lucifer.local Link: https://bugzilla.kernel.org/show_bug.cgi?id=217238 Link: https://lkml.kernel.org/r/55e413d20678a1bb4c7cce889062bbb07b0df892.1697116581.git.lstoakes@gmail.com Signed-off-by: Lorenzo Stoakes <lstoakes@gmail.com> Reviewed-by: Jan Kara <jack@suse.cz> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Andy Lutomirski <luto@kernel.org> Cc: Christian Brauner <brauner@kernel.org> Cc: Hugh Dickins <hughd@google.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Muchun Song <muchun.song@linux.dev> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
Lorenzo Stoakes
|
28464bbb2d |
mm: update memfd seal write check to include F_SEAL_WRITE
The seal_check_future_write() function is called by shmem_mmap() or hugetlbfs_file_mmap() to disallow any future writable mappings of an memfd sealed this way. The F_SEAL_WRITE flag is not checked here, as that is handled via the mapping->i_mmap_writable mechanism and so any attempt at a mapping would fail before this could be run. However we intend to change this, meaning this check can be performed for F_SEAL_WRITE mappings also. The logic here is equally applicable to both flags, so update this function to accommodate both and rename it accordingly. Link: https://lkml.kernel.org/r/913628168ce6cce77df7d13a63970bae06a526e0.1697116581.git.lstoakes@gmail.com Signed-off-by: Lorenzo Stoakes <lstoakes@gmail.com> Reviewed-by: Jan Kara <jack@suse.cz> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Andy Lutomirski <luto@kernel.org> Cc: Christian Brauner <brauner@kernel.org> Cc: Hugh Dickins <hughd@google.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Muchun Song <muchun.song@linux.dev> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
Lorenzo Stoakes
|
e8e17ee90e |
mm: drop the assumption that VM_SHARED always implies writable
Patch series "permit write-sealed memfd read-only shared mappings", v4. The man page for fcntl() describing memfd file seals states the following about F_SEAL_WRITE:- Furthermore, trying to create new shared, writable memory-mappings via mmap(2) will also fail with EPERM. With emphasis on 'writable'. In turns out in fact that currently the kernel simply disallows all new shared memory mappings for a memfd with F_SEAL_WRITE applied, rendering this documentation inaccurate. This matters because users are therefore unable to obtain a shared mapping to a memfd after write sealing altogether, which limits their usefulness. This was reported in the discussion thread [1] originating from a bug report [2]. This is a product of both using the struct address_space->i_mmap_writable atomic counter to determine whether writing may be permitted, and the kernel adjusting this counter when any VM_SHARED mapping is performed and more generally implicitly assuming VM_SHARED implies writable. It seems sensible that we should only update this mapping if VM_MAYWRITE is specified, i.e. whether it is possible that this mapping could at any point be written to. If we do so then all we need to do to permit write seals to function as documented is to clear VM_MAYWRITE when mapping read-only. It turns out this functionality already exists for F_SEAL_FUTURE_WRITE - we can therefore simply adapt this logic to do the same for F_SEAL_WRITE. We then hit a chicken and egg situation in mmap_region() where the check for VM_MAYWRITE occurs before we are able to clear this flag. To work around this, perform this check after we invoke call_mmap(), with careful consideration of error paths. Thanks to Andy Lutomirski for the suggestion! [1]:https://lore.kernel.org/all/20230324133646.16101dfa666f253c4715d965@linux-foundation.org/ [2]:https://bugzilla.kernel.org/show_bug.cgi?id=217238 This patch (of 3): There is a general assumption that VMAs with the VM_SHARED flag set are writable. If the VM_MAYWRITE flag is not set, then this is simply not the case. Update those checks which affect the struct address_space->i_mmap_writable field to explicitly test for this by introducing [vma_]is_shared_maywrite() helper functions. This remains entirely conservative, as the lack of VM_MAYWRITE guarantees that the VMA cannot be written to. Link: https://lkml.kernel.org/r/cover.1697116581.git.lstoakes@gmail.com Link: https://lkml.kernel.org/r/d978aefefa83ec42d18dfa964ad180dbcde34795.1697116581.git.lstoakes@gmail.com Signed-off-by: Lorenzo Stoakes <lstoakes@gmail.com> Suggested-by: Andy Lutomirski <luto@kernel.org> Reviewed-by: Jan Kara <jack@suse.cz> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Christian Brauner <brauner@kernel.org> Cc: Hugh Dickins <hughd@google.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Muchun Song <muchun.song@linux.dev> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
SeongJae Park
|
76126332c7 |
mm/damon/sysfs: avoid empty scheme tried regions for large apply interval
DAMON_SYSFS assumes all schemes will be applied for at least one DAMON monitoring results snapshot within one aggregation interval, or makes no sense to wait for it while DAMON is deactivated by the watermarks. That for deactivated status still makes sense, but the aggregation interval based assumption is invalid now because each scheme can has its own apply interval. For schemes having larger than the aggregation or watermarks check interval, DAMOS tried regions update request can be finished without the update. Avoid the case by explicitly checking the status of the schemes tried regions update and watermarks based DAMON deactivation. Link: https://lkml.kernel.org/r/20231012192256.33556-3-sj@kernel.org Signed-off-by: SeongJae Park <sj@kernel.org> Cc: Jonathan Corbet <corbet@lwn.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
SeongJae Park
|
4d4e41b682 |
mm/damon/sysfs-schemes: do not update tried regions more than one DAMON snapshot
Patch series "mm/damon/sysfs-schemes: Do DAMOS tried regions update for only one apply interval". DAMOS tried regions update feature of DAMON sysfs interface is doing the update for one aggregation interval after the request is made. Since the per-scheme apply interval is supported, that behavior makes no much sense. That is, the tried regions directory will have regions from multiple DAMON monitoring results snapshots, or no region for apply intervals that much shorter than, or longer than the aggregation interval, respectively. Update the behavior to update the regions for each scheme for only its apply interval, and update the document. Since DAMOS apply interval is the aggregation by default, this change makes no visible behavioral difference to old users who don't explicitly set the apply intervals. Patches Sequence ---------------- The first two patches makes schemes of apply intervals that much shorter or longer than the aggregation interval to keep the maximum and minimum times for continuing the update. After the two patches, the update aligns with the each scheme's apply interval. Finally, the third patch updates the document to reflect the behavior. This patch (of 3): DAMON_SYSFS exposes every DAMON-found region that eligible for applying the scheme action for one aggregation interval. However, each DAMON-based operation scheme has its own apply interval. Hence, for a scheme that having its apply interval much smaller than the aggregation interval, DAMON_SYSFS will expose the scheme regions that applied to more than one DAMON monitoring results snapshots. Since the purpose of DAMON tried regions is exposing single snapshot, this makes no much sense. Track progress of each scheme's tried regions update and avoid the case. Link: https://lkml.kernel.org/r/20231012192256.33556-1-sj@kernel.org Link: https://lkml.kernel.org/r/20231012192256.33556-2-sj@kernel.org Signed-off-by: SeongJae Park <sj@kernel.org> Cc: Jonathan Corbet <corbet@lwn.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
Audra Mitchell
|
b459f0905e |
mm/page_owner: remove free_ts from page_owner output
Patch series "Fix page_owner's use of free timestamps".
While page ower output is used to investigate memory utilization,
typically the allocation pathway, the introduction of timestamps to the
page owner records caused each record to become unique due to the
granularity of the nanosecond timestamp (for example):
Page allocated via order 0 ... ts 5206196026 ns, free_ts 5187156703 ns
Page allocated via order 0 ... ts 5206198540 ns, free_ts 5187162702 ns
Furthermore, the page_owner output only dumps the currently allocated
records, so having the free timestamps is nonsensical for the typical use
case.
In addition, the introduction of timestamps was not properly handled in
the page_owner_sort tool causing most use cases to be broken. This series
is meant to remove the free timestamps from the page_owner output and fix
the page_owner_sort tool so proper collation can occur.
This patch (of 5):
When printing page_owner data via the sysfs interface, no free pages will
ever be dumped due to the series of checks in read_page_owner():
/*
* Although we do have the info about past allocation of free
* pages, it's not relevant for current memory usage.
*/
if (!test_bit(PAGE_EXT_OWNER_ALLOCATED, &page_ext->flags))
The free_ts values are still used when dump_page_owner() is called, so
keeping the field for other use cases but removing them for the typical
page_owner case.
Link: https://lkml.kernel.org/r/20231013190350.579407-1-audra@redhat.com
Link: https://lkml.kernel.org/r/20231013190350.579407-2-audra@redhat.com
Fixes:
|
||
Lorenzo Stoakes
|
93bf5d4aa2 |
mm: abstract VMA merge and extend into vma_merge_extend() helper
mremap uses vma_merge() in the case where a VMA needs to be extended. This can be significantly simplified and abstracted. This makes it far easier to understand what the actual function is doing, avoids future mistakes in use of the confusing vma_merge() function and importantly allows us to make future changes to how vma_merge() is implemented by knowing explicitly which merge cases each invocation uses. Note that in the mremap() extend case, we perform this merge only when old_len == vma->vm_end - addr. The extension_start, i.e. the start of the extended portion of the VMA is equal to addr + old_len, i.e. vma->vm_end. With this refactoring, vma_merge() is no longer required anywhere except mm/mmap.c, so mark it static. Link: https://lkml.kernel.org/r/f16cbdc2e72d37a1a097c39dc7d1fee8919a1c93.1697043508.git.lstoakes@gmail.com Signed-off-by: Lorenzo Stoakes <lstoakes@gmail.com> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Christian Brauner <brauner@kernel.org> Cc: Liam R. Howlett <Liam.Howlett@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
Lorenzo Stoakes
|
4b5f2d2016 |
mm: abstract merge for new VMAs into vma_merge_new_vma()
Only in mmap_region() and copy_vma() do we attempt to merge VMAs which occupy entirely new regions of virtual memory. We can abstract this logic and make the intent of this invocations of it completely explicit, rather than invoking vma_merge() with an inscrutable wall of parameters. This also paves the way for a simplification of the core vma_merge() implementation, as we seek to make it entirely an implementation detail. The VMA merge call in mmap_region() occurs only for file-backed mappings, where each of the parameters previously specified as NULL are defaulted to NULL in vma_init() (called by vm_area_alloc()). This matches the previous behaviour of specifying NULL for a number of fields, however note that prior to this call we pass the VMA to the file system driver via call_mmap(), which may in theory adjust fields that we pass in to vma_merge_new_vma(). Therefore we actually resolve an oversight here by allowing for the fact that the driver may have done this. Link: https://lkml.kernel.org/r/3dc71d17e307756a54781d4a4ce7315cf8b18bea.1697043508.git.lstoakes@gmail.com Signed-off-by: Lorenzo Stoakes <lstoakes@gmail.com> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Christian Brauner <brauner@kernel.org> Cc: Liam R. Howlett <Liam.Howlett@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
||
Lorenzo Stoakes
|
adb20b0c78 |
mm: make vma_merge() and split_vma() internal
Now the common pattern of - attempting a merge via vma_merge() and should this fail splitting VMAs via split_vma() - has been abstracted, the former can be placed into mm/internal.h and the latter made static. In addition, the split_vma() nommu variant also need not be exported. Link: https://lkml.kernel.org/r/405f2be10e20c4e9fbcc9fe6b2dfea105f6642e0.1697043508.git.lstoakes@gmail.com Signed-off-by: Lorenzo Stoakes <lstoakes@gmail.com> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Christian Brauner <brauner@kernel.org> Cc: Liam R. Howlett <Liam.Howlett@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |