linux

mirror of https://github.com/torvalds/linux.git synced 2024-11-27 14:41:39 +00:00

Author	SHA1	Message	Date
Jingyu Wang	5da1a8687a	mm/gup.c: fix typo in comments Link: https://lkml.kernel.org/r/20230309104813.170309-1-jingyuwang_vip@163.com Signed-off-by: Jingyu Wang <jingyuwang_vip@163.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2023-03-28 16:20:14 -07:00
Danilo Krummrich	5c63a7c32a	maple_tree: export symbol mas_preallocate() Fix missing EXPORT_SYMBOL_GPL() statement for mas_preallocate(). It isn't actually used by anything yet, but mas_preallocate() is part of the maple tree's 'Advanced API'. All other functions of this API are exported already. Link: https://lkml.kernel.org/r/20230302011035.4928-1-dakr@redhat.com Signed-off-by: Danilo Krummrich <dakr@redhat.com> Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2023-03-28 16:20:14 -07:00
Christoph Hellwig	452a8f4072	mm,jfs: move write_one_page/folio_write_one to jfs The last remaining user of folio_write_one through the write_one_page wrapper is jfs, so move the functionality there and hard code the call to metapage_writepage. Note that the use of the pagecache by the JFS 'metapage' buffer cache is a bit odd, and we could probably do without VM-level dirty tracking at all, but that's a change for another time. Link: https://lkml.kernel.org/r/20230307143125.27778-4-hch@lst.de Signed-off-by: Christoph Hellwig <hch@lst.de> Acked-by: Dave Kleikamp <dave.kleikamp@oracle.com> Cc: Changwei Ge <gechangwei@live.cn> Cc: Evgeniy Dushistov <dushistov@mail.ru> Cc: Gang He <ghe@suse.com> Cc: Jan Kara <jack@suse.cz> Cc: Jan Kara via Ocfs2-devel <ocfs2-devel@oss.oracle.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Joseph Qi <jiangqi903@gmail.com> Cc: Joseph Qi <joseph.qi@linux.alibaba.com> Cc: Jun Piao <piaojun@huawei.com> Cc: Junxiao Bi <junxiao.bi@oracle.com> Cc: Mark Fasheh <mark@fasheh.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2023-03-28 16:20:14 -07:00
Christoph Hellwig	a0d50b11bf	ocfs2: don't use write_one_page in ocfs2_duplicate_clusters_by_page Use filemap_write_and_wait_range to write back the range of the dirty page instead of write_one_page in preparation of removing write_one_page and eventually ->writepage. Link: https://lkml.kernel.org/r/20230307143125.27778-3-hch@lst.de Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com> Cc: Changwei Ge <gechangwei@live.cn> Cc: Dave Kleikamp <dave.kleikamp@oracle.com> Cc: Evgeniy Dushistov <dushistov@mail.ru> Cc: Gang He <ghe@suse.com> Cc: Jan Kara via Ocfs2-devel <ocfs2-devel@oss.oracle.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Joseph Qi <jiangqi903@gmail.com> Cc: Jun Piao <piaojun@huawei.com> Cc: Junxiao Bi <junxiao.bi@oracle.com> Cc: Mark Fasheh <mark@fasheh.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2023-03-28 16:20:14 -07:00
Christoph Hellwig	8b8d9a2d32	ufs: don't flush page immediately for DIRSYNC directories Patch series "remove most callers of write_one_page", v4. This series removes most users of the write_one_page API. These helpers internally call ->writepage which we are gradually removing from the kernel. This patch (of 3): We do not need to writeout modified directory blocks immediately when modifying them while the page is locked. It is enough to do the flush somewhat later which has the added benefit that inode times can be flushed as well. It also allows us to stop depending on write_one_page() function. Ported from an ext2 patch by Jan Kara. Link: https://lkml.kernel.org/r/20230307143125.27778-1-hch@lst.de Link: https://lkml.kernel.org/r/20230307143125.27778-2-hch@lst.de Signed-off-by: Christoph Hellwig <hch@lst.de> Cc: Dave Kleikamp <dave.kleikamp@oracle.com> Cc: Evgeniy Dushistov <dushistov@mail.ru> Cc: Jan Kara via Ocfs2-devel <ocfs2-devel@oss.oracle.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Mark Fasheh <mark@fasheh.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Junxiao Bi <junxiao.bi@oracle.com> Cc: Joseph Qi <jiangqi903@gmail.com> Cc: Changwei Ge <gechangwei@live.cn> Cc: Gang He <ghe@suse.com> Cc: Jun Piao <piaojun@huawei.com> Cc: Jan Kara <jack@suse.cz> Cc: Joseph Qi <joseph.qi@linux.alibaba.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2023-03-28 16:20:14 -07:00
Alexander Potapenko	6204c9ab4a	kmsan: add test_stackdepot_roundtrip Ensure that KMSAN does not report false positives in instrumented callers of stack_depot_save(), stack_depot_print(), and stack_depot_fetch(). Link: https://lkml.kernel.org/r/20230306111322.205724-2-glider@google.com Signed-off-by: Alexander Potapenko <glider@google.com> Cc: Andrey Konovalov <andreyknvl@gmail.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Marco Elver <elver@google.com> Cc: syzbot <syzkaller@googlegroups.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2023-03-28 16:20:14 -07:00
Alexander Potapenko	8e00b2dffd	lib/stackdepot: kmsan: mark API outputs as initialized KMSAN does not instrument stackdepot and may treat memory allocated by it as uninitialized. This is not a problem for KMSAN itself, because its functions calling stackdepot API are also not instrumented. But other kernel features (e.g. netdev tracker) may access stack depot from instrumented code, which will lead to false positives, unless we explicitly mark stackdepot outputs as initialized. Link: https://lkml.kernel.org/r/20230306111322.205724-1-glider@google.com Signed-off-by: Alexander Potapenko <glider@google.com> Reported-by: syzbot <syzkaller@googlegroups.com> Reviewed-by: Dmitry Vyukov <dvyukov@google.com> Suggested-by: Dmitry Vyukov <dvyukov@google.com> Reviewed-by: Andrey Konovalov <andreyknvl@gmail.com> Cc: Marco Elver <elver@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2023-03-28 16:20:13 -07:00
Yue Zhao	2178e20c24	mm, memcg: Prevent memory.soft_limit_in_bytes load/store tearing The knob for cgroup v1 memory controller: memory.soft_limit_in_bytes is not protected by any locking so it can be modified while it is used. This is not an actual problem because races are unlikely. But it is better to use [READ\|WRITE]_ONCE to prevent compiler from doing anything funky. The access of memcg->soft_limit is lockless, so it can be concurrently set at the same time as we are trying to read it. All occurrences of memcg->soft_limit are updated with [READ\|WRITE]_ONCE. [findns94@gmail.com: v3] Link: https://lkml.kernel.org/r/20230308162555.14195-5-findns94@gmail.com Link: https://lkml.kernel.org/r/20230306154138.3775-5-findns94@gmail.com Signed-off-by: Yue Zhao <findns94@gmail.com> Acked-by: Shakeel Butt <shakeelb@google.com> Acked-by: Roman Gushchin <roman.gushchin@linux.dev> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: Tang Yizhou <tangyeechou@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2023-03-28 16:20:13 -07:00
Yue Zhao	17c56de6a8	mm, memcg: Prevent memory.oom_control load/store tearing The knob for cgroup v1 memory controller: memory.oom_control is not protected by any locking so it can be modified while it is used. This is not an actual problem because races are unlikely. But it is better to use [READ\|WRITE]_ONCE to prevent compiler from doing anything funky. The access of memcg->oom_kill_disable is lockless, so it can be concurrently set at the same time as we are trying to read it. All occurrences of memcg->oom_kill_disable are updated with [READ\|WRITE]_ONCE. [findns94@gmail.com: v3] Link: https://lkml.kernel.org/r/20230308162555.14195-4-findns94@gmail.com Link: https://lkml.kernel.org/r/20230306154138.377-4-findns94@gmail.com Signed-off-by: Yue Zhao <findns94@gmail.com> Acked-by: Shakeel Butt <shakeelb@google.com> Acked-by: Roman Gushchin <roman.gushchin@linux.dev> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: Tang Yizhou <tangyeechou@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2023-03-28 16:20:13 -07:00
Yue Zhao	82b3aa2681	mm, memcg: Prevent memory.swappiness load/store tearing The knob for cgroup v1 memory controller: memory.swappiness is not protected by any locking so it can be modified while it is used. This is not an actual problem because races are unlikely. But it is better to use [READ\|WRITE]_ONCE to prevent compiler from doing anything funky. The access of memcg->swappiness and vm_swappiness is lockless, so both of them can be concurrently set at the same time as we are trying to read them. All occurrences of memcg->swappiness and vm_swappiness are updated with [READ\|WRITE]_ONCE. [findns94@gmail.com: v3] Link: https://lkml.kernel.org/r/20230308162555.14195-3-findns94@gmail.com Link: https://lkml.kernel.org/r/20230306154138.3775-3-findns94@gmail.com Signed-off-by: Yue Zhao <findns94@gmail.com> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Shakeel Butt <shakeelb@google.com> Acked-by: Roman Gushchin <roman.gushchin@linux.dev> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: Tang Yizhou <tangyeechou@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2023-03-28 16:20:13 -07:00
Yue Zhao	eaf7b66b76	mm, memcg: Prevent memory.oom.group load/store tearing Patch series "mm, memcg: cgroup v1 and v2 tunable load/store tearing fixes", v2. This patch series helps to prevent load/store tearing in several cgroup knobs. As kindly pointed out by Michal Hocko and Roman Gushchin , the changelog has been rephrased. Besides, more knobs were checked, according to kind suggestions from Shakeel Butt and Muchun Song. This patch (of 4): The knob for cgroup v2 memory controller: memory.oom.group is not protected by any locking so it can be modified while it is used. This is not an actual problem because races are unlikely (the knob is usually configured long before any workloads hits actual memcg oom) but it is better to use READ_ONCE/WRITE_ONCE to prevent compiler from doing anything funky. The access of memcg->oom_group is lockless, so it can be concurrently set at the same time as we are trying to read it. Link: https://lkml.kernel.org/r/20230306154138.3775-1-findns94@gmail.com Link: https://lkml.kernel.org/r/20230306154138.3775-2-findns94@gmail.com Signed-off-by: Yue Zhao <findns94@gmail.com> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Shakeel Butt <shakeelb@google.com> Acked-by: Roman Gushchin <roman.gushchin@linux.dev> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: Tang Yizhou <tangyeechou@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2023-03-28 16:20:13 -07:00
Zi Yan	dd63bd7df4	selftests/mm: fix split huge page tests Fix two inputs to check_anon_huge() and one if condition, so the tests work as expected. Link: https://lkml.kernel.org/r/20230306160907.16804-1-zi.yan@sent.com Fixes: `c07c343cda` ("selftests/vm: dedup THP helpers") Signed-off-by: Zi Yan <ziy@nvidia.com> Reviewed-by: Zach O'Keefe <zokeefe@google.com> Tested-by: Zach O'Keefe <zokeefe@google.com> Acked-by: David Hildenbrand <david@redhat.com> Cc: Shuah Khan <shuah@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2023-03-28 16:20:13 -07:00
Gerald Schaefer	99c2913363	mm: add PTE pointer parameter to flush_tlb_fix_spurious_fault() s390 can do more fine-grained handling of spurious TLB protection faults, when there also is the PTE pointer available. Therefore, pass on the PTE pointer to flush_tlb_fix_spurious_fault() as an additional parameter. This will add no functional change to other architectures, but those with private flush_tlb_fix_spurious_fault() implementations need to be made aware of the new parameter. Link: https://lkml.kernel.org/r/20230306161548.661740-1-gerald.schaefer@linux.ibm.com Signed-off-by: Gerald Schaefer <gerald.schaefer@linux.ibm.com> Reviewed-by: Alexander Gordeev <agordeev@linux.ibm.com> Acked-by: Catalin Marinas <catalin.marinas@arm.com> [arm64] Acked-by: Michael Ellerman <mpe@ellerman.id.au> [powerpc] Acked-by: David Hildenbrand <david@redhat.com> Cc: Anshuman Khandual <anshuman.khandual@arm.com> Cc: Borislav Petkov (AMD) <bp@alien8.de> Cc: Christophe Leroy <christophe.leroy@csgroup.eu> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2023-03-28 16:20:12 -07:00
Sergey Senozhatsky	e1807d5d27	zsmalloc: show per fullness group class stats We keep the old fullness (3/4 threshold) reporting in zs_stats_size_show(). Switch from allmost full/empty stats to fine-grained per inuse ratio (fullness group) reporting, which gives signicantly more data on classes fragmentation. Link: https://lkml.kernel.org/r/20230304034835.2082479-5-senozhatsky@chromium.org Signed-off-by: Sergey Senozhatsky <senozhatsky@chromium.org> Acked-by: Minchan Kim <minchan@kernel.org> Cc: Yosry Ahmed <yosryahmed@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2023-03-28 16:20:12 -07:00
Sergey Senozhatsky	5a845e9f2d	zsmalloc: rework compaction algorithm The zsmalloc compaction algorithm has the potential to waste some CPU cycles, particularly when compacting pages within the same fullness group. This is due to the way it selects the head page of the fullness list for source and destination pages, and how it reinserts those pages during each iteration. The algorithm may first use a page as a migration destination and then as a migration source, leading to an unnecessary back-and-forth movement of objects. Consider the following fullness list: PageA PageB PageC PageD PageE During the first iteration, the compaction algorithm will select PageA as the source and PageB as the destination. All of PageA's objects will be moved to PageB, and then PageA will be released while PageB is reinserted into the fullness list. PageB PageC PageD PageE During the next iteration, the compaction algorithm will again select the head of the list as the source and destination, meaning that PageB will now serve as the source and PageC as the destination. This will result in the objects being moved away from PageB, the same objects that were just moved to PageB in the previous iteration. To prevent this avalanche effect, the compaction algorithm should not reinsert the destination page between iterations. By doing so, the most optimal page will continue to be used and its usage ratio will increase, reducing internal fragmentation. The destination page should only be reinserted into the fullness list if: - It becomes full - No source page is available. TEST ==== It's very challenging to reliably test this series. I ended up developing my own synthetic test that has 100% reproducibility. The test generates significan fragmentation (for each size class) and then performs compaction for each class individually and tracks the number of memcpy() in zs_object_copy(), so that we can compare the amount work compaction does on per-class basis. Total amount of work (zram mm_stat objs_moved) ---------------------------------------------- Old fullness grouping, old compaction algorithm: 323977 memcpy() in zs_object_copy(). Old fullness grouping, new compaction algorithm: 262944 memcpy() in zs_object_copy(). New fullness grouping, new compaction algorithm: 213978 memcpy() in zs_object_copy(). Per-class compaction memcpy() comparison (T-test) ------------------------------------------------- x Old fullness grouping, old compaction algorithm + Old fullness grouping, new compaction algorithm N Min Max Median Avg Stddev x 140 349 3513 2461 2314.1214 806.03271 + 140 289 2778 2006 1878.1714 641.02073 Difference at 95.0% confidence -435.95 +/- 170.595 -18.8387% +/- 7.37193% (Student's t, pooled s = 728.216) x Old fullness grouping, old compaction algorithm + New fullness grouping, new compaction algorithm N Min Max Median Avg Stddev x 140 349 3513 2461 2314.1214 806.03271 + 140 226 2279 1644 1528.4143 524.85268 Difference at 95.0% confidence -785.707 +/- 159.331 -33.9527% +/- 6.88516% (Student's t, pooled s = 680.132) Link: https://lkml.kernel.org/r/20230304034835.2082479-4-senozhatsky@chromium.org Signed-off-by: Sergey Senozhatsky <senozhatsky@chromium.org> Acked-by: Minchan Kim <minchan@kernel.org> Cc: Yosry Ahmed <yosryahmed@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2023-03-28 16:20:12 -07:00
Sergey Senozhatsky	4c7ac97285	zsmalloc: fine-grained inuse ratio based fullness grouping Each zspage maintains ->inuse counter which keeps track of the number of objects stored in the zspage. The ->inuse counter also determines the zspage's "fullness group" which is calculated as the ratio of the "inuse" objects to the total number of objects the zspage can hold (objs_per_zspage). The closer the ->inuse counter is to objs_per_zspage, the better. Each size class maintains several fullness lists, that keep track of zspages of particular "fullness". Pages within each fullness list are stored in random order with regard to the ->inuse counter. This is because sorting the zspages by ->inuse counter each time obj_malloc() or obj_free() is called would be too expensive. However, the ->inuse counter is still a crucial factor in many situations. For the two major zsmalloc operations, zs_malloc() and zs_compact(), we typically select the head zspage from the corresponding fullness list as the best candidate zspage. However, this assumption is not always accurate. For the zs_malloc() operation, the optimal candidate zspage should have the highest ->inuse counter. This is because the goal is to maximize the number of ZS_FULL zspages and make full use of all allocated memory. For the zs_compact() operation, the optimal source zspage should have the lowest ->inuse counter. This is because compaction needs to move objects in use to another page before it can release the zspage and return its physical pages to the buddy allocator. The fewer objects in use, the quicker compaction can release the zspage. Additionally, compaction is measured by the number of pages it releases. This patch reworks the fullness grouping mechanism. Instead of having two groups - ZS_ALMOST_EMPTY (usage ratio below 3/4) and ZS_ALMOST_FULL (usage ration above 3/4) - that result in too many zspages being included in the ALMOST_EMPTY group for specific classes, size classes maintain a larger number of fullness lists that give strict guarantees on the minimum and maximum ->inuse values within each group. Each group represents a 10% change in the ->inuse ratio compared to neighboring groups. In essence, there are groups for zspages with 0%, 10%, 20% usage ratios, and so on, up to 100%. This enhances the selection of candidate zspages for both zs_malloc() and zs_compact(). A printout of the ->inuse counters of the first 7 zspages per (random) class fullness group: class-768 objs_per_zspage 16: fullness 100%: empty fullness 99%: empty fullness 90%: empty fullness 80%: empty fullness 70%: empty fullness 60%: 8 8 9 9 8 8 8 fullness 50%: empty fullness 40%: 5 5 6 5 5 5 5 fullness 30%: 4 4 4 4 4 4 4 fullness 20%: 2 3 2 3 3 2 2 fullness 10%: 1 1 1 1 1 1 1 fullness 0%: empty The zs_malloc() function searches through the groups of pages starting with the one having the highest usage ratio. This means that it always selects a zspage from the group with the least internal fragmentation (highest usage ratio) and makes it even less fragmented by increasing its usage ratio. The zs_compact() function, on the other hand, begins by scanning the group with the highest fragmentation (lowest usage ratio) to locate the source page. The first available zspage is selected, and then the function moves downward to find a destination zspage in the group with the lowest internal fragmentation (highest usage ratio). Link: https://lkml.kernel.org/r/20230304034835.2082479-3-senozhatsky@chromium.org Signed-off-by: Sergey Senozhatsky <senozhatsky@chromium.org> Acked-by: Minchan Kim <minchan@kernel.org> Cc: Yosry Ahmed <yosryahmed@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2023-03-28 16:20:12 -07:00
Sergey Senozhatsky	a40a71e834	zsmalloc: remove insert_zspage() ->inuse optimization Patch series "zsmalloc: fine-grained fullness and new compaction algorithm", v4. Existing zsmalloc page fullness grouping leads to suboptimal page selection for both zs_malloc() and zs_compact(). This patchset reworks zsmalloc fullness grouping/classification. Additinally it also implements new compaction algorithm that is expected to use less CPU-cycles (as it potentially does fewer memcpy-s in zs_object_copy()). Test (synthetic) results can be seen in patch 0003. This patch (of 4): This optimization has no effect. It only ensures that when a zspage was added to its corresponding fullness list, its "inuse" counter was higher or lower than the "inuse" counter of the zspage at the head of the list. The intention was to keep busy zspages at the head, so they could be filled up and moved to the ZS_FULL fullness group more quickly. However, this doesn't work as the "inuse" counter of a zspage can be modified by obj_free() but the zspage may still belong to the same fullness list. So, fix_fullness_group() won't change the zspage's position in relation to the head's "inuse" counter, leading to a largely random order of zspages within the fullness list. For instance, consider a printout of the "inuse" counters of the first 10 zspages in a class that holds 93 objects per zspage: ZS_ALMOST_EMPTY: 36 67 68 64 35 54 63 52 As we can see the zspage with the lowest "inuse" counter is actually the head of the fullness list. Remove this pointless "optimisation". Link: https://lkml.kernel.org/r/20230304034835.2082479-1-senozhatsky@chromium.org Link: https://lkml.kernel.org/r/20230304034835.2082479-2-senozhatsky@chromium.org Signed-off-by: Sergey Senozhatsky <senozhatsky@chromium.org> Acked-by: Minchan Kim <minchan@kernel.org> Cc: Yosry Ahmed <yosryahmed@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2023-03-28 16:20:12 -07:00
Jaewon Kim	3ccefdea22	dma-buf: system_heap: avoid reclaim for order 4 Using order 4 pages would be helpful for IOMMUs mapping, but trying to get order 4 pages could spend quite much time in the page allocation. From the perspective of responsiveness, the deterministic memory allocation speed, I think, is quite important. The order 4 allocation with __GFP_RECLAIM may spend much time in reclaim and compation logic. __GFP_NORETRY also may affect. These cause unpredictable delay. To get reasonable allocation speed from dma-buf system heap, use HIGH_ORDER_GFP for order 4 to avoid reclaim. And let me remove meaningless __GFP_COMP for order 0. According to my tests, order 4 with MID_ORDER_GFP could get more number of order 4 pages but the elapsed times could be very slow. time order 8 order 4 order 0 584 usec 0 160 0 28,428 usec 0 160 0 100,701 usec 0 160 0 76,645 usec 0 160 0 25,522 usec 0 160 0 38,798 usec 0 160 0 89,012 usec 0 160 0 23,015 usec 0 160 0 73,360 usec 0 160 0 76,953 usec 0 160 0 31,492 usec 0 160 0 75,889 usec 0 160 0 84,551 usec 0 160 0 84,352 usec 0 160 0 57,103 usec 0 160 0 93,452 usec 0 160 0 If HIGH_ORDER_GFP is used for order 4, the number of order 4 could be decreased but the elapsed time results were quite stable and fast enough. time order 8 order 4 order 0 1,356 usec 0 155 80 1,901 usec 0 11 2384 1,912 usec 0 0 2560 1,911 usec 0 0 2560 1,884 usec 0 0 2560 1,577 usec 0 0 2560 1,366 usec 0 0 2560 1,711 usec 0 0 2560 1,635 usec 0 28 2112 544 usec 10 0 0 633 usec 2 128 0 848 usec 0 160 0 729 usec 0 160 0 1,000 usec 0 160 0 1,358 usec 0 160 0 2,638 usec 0 31 2064 Link: https://lkml.kernel.org/r/20230303050332.10138-1-jaewon31.kim@samsung.com Signed-off-by: Jaewon Kim <jaewon31.kim@samsung.com> Reviewed-by: John Stultz <jstultz@google.com> Cc: Daniel Vetter <daniel.vetter@ffwll.ch> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Sumit Semwal <sumit.semwal@linaro.org> Cc: T.J. Mercier <tjmercier@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2023-03-28 16:20:12 -07:00
Alexander Potapenko	78c74aeee5	kmsan: add memsetXX tests Add tests ensuring that memset16()/memset32()/memset64() are instrumented by KMSAN and correctly initialize the memory. Link: https://lkml.kernel.org/r/20230303141433.3422671-4-glider@google.com Signed-off-by: Alexander Potapenko <glider@google.com> Reviewed-by: Marco Elver <elver@google.com> Cc: Daniel Vetter <daniel@ffwll.ch> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Cc: Helge Deller <deller@gmx.de> Cc: Kees Cook <keescook@chromium.org> Cc: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2023-03-28 16:20:11 -07:00
Alexander Potapenko	27f644dc5a	x86: kmsan: use C versions of memset16/memset32/memset64 KMSAN must see as many memory accesses as possible to prevent false positive reports. Fall back to versions of memset16()/memset32()/memset64() implemented in lib/string.c instead of those written in assembly. Link: https://lkml.kernel.org/r/20230303141433.3422671-3-glider@google.com Signed-off-by: Alexander Potapenko <glider@google.com> Suggested-by: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp> Reviewed-by: Marco Elver <elver@google.com> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Cc: Daniel Vetter <daniel@ffwll.ch> Cc: Helge Deller <deller@gmx.de> Cc: Kees Cook <keescook@chromium.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2023-03-28 16:20:11 -07:00
Alexander Potapenko	d340292553	kmsan: another take at fixing memcpy tests commit `5478afc55a` ("kmsan: fix memcpy tests") uses OPTIMIZER_HIDE_VAR() to hide the uninitialized var from the compiler optimizations. However OPTIMIZER_HIDE_VAR(uninit) enforces an immediate check of @uninit, so memcpy tests did not actually check the behavior of memcpy(), because they always contained a KMSAN report. Replace OPTIMIZER_HIDE_VAR() with a file-local macro that just clobbers the memory with a barrier(), and add a test case for memcpy() that does not expect an error report. Also reflow kmsan_test.c with clang-format. Link: https://lkml.kernel.org/r/20230303141433.3422671-2-glider@google.com Signed-off-by: Alexander Potapenko <glider@google.com> Reviewed-by: Marco Elver <elver@google.com> Cc: Daniel Vetter <daniel@ffwll.ch> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Cc: Helge Deller <deller@gmx.de> Cc: Kees Cook <keescook@chromium.org> Cc: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2023-03-28 16:20:11 -07:00
Alexander Potapenko	6dc4bd4e2f	x86: kmsan: don't rename memintrinsics in uninstrumented files clang -fsanitize=kernel-memory already replaces calls to memset/memcpy/memmove and their __builtin_ versions with __msan_memset/__msan_memcpy/__msan_memmove in instrumented files, so there is no need to override them. In non-instrumented versions we are now required to leave memset() and friends intact, so we cannot replace them with __msan_XXX() functions. Link: https://lkml.kernel.org/r/20230303141433.3422671-1-glider@google.com Signed-off-by: Alexander Potapenko <glider@google.com> Suggested-by: Marco Elver <elver@google.com> Reviewed-by: Marco Elver <elver@google.com> Cc: Kees Cook <keescook@chromium.org> Cc: Daniel Vetter <daniel@ffwll.ch> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Cc: Helge Deller <deller@gmx.de> Cc: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2023-03-28 16:20:11 -07:00
Peter Xu	7cb1d7ef66	mm/khugepaged: cleanup memcg uncharge for failure path Explicit memcg uncharging is not needed when the memcg accounting has the same lifespan of the page/folio. That becomes the case for khugepaged after Yang & Zach's recent rework so the hpage will be allocated for each collapse rather than being cached. Cleanup the explicit memcg uncharge in khugepaged failure path and leave that for put_page(). Link: https://lkml.kernel.org/r/20230303151218.311015-1-peterx@redhat.com Signed-off-by: Peter Xu <peterx@redhat.com> Suggested-by: Zach O'Keefe <zokeefe@google.com> Reviewed-by: Zach O'Keefe <zokeefe@google.com> Reviewed-by: Yang Shi <shy828301@gmail.com> Cc: David Stevens <stevensd@chromium.org> Cc: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2023-03-28 16:20:11 -07:00
Anshuman Khandual	9dabf6e137	mm/debug_vm_pgtable: replace pte_mkhuge() with arch_make_huge_pte() Since the following commit arch_make_huge_pte() should be used directly in generic memory subsystem as a platform provided page table helper, instead of pte_mkhuge(). Change hugetlb_basic_tests() to call arch_make_huge_pte() directly, and update its relevant documentation entry as required. 'commit `16785bd774` ("mm: merge pte_mkhuge() call into arch_make_huge_pte()")' Link: https://lkml.kernel.org/r/20230302114845.421674-1-anshuman.khandual@arm.com Signed-off-by: Anshuman Khandual <anshuman.khandual@arm.com> Reported-by: Christophe Leroy <christophe.leroy@csgroup.eu> Link: https://lore.kernel.org/all/1ea45095-0926-a56a-a273-816709e9075e@csgroup.eu/ Cc: Jonathan Corbet <corbet@lwn.net> Cc: David Hildenbrand <david@redhat.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Mike Rapoport <rppt@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2023-03-28 16:20:11 -07:00
Anshuman Khandual	1da28f1b5a	mm/migrate: drop pte_mkhuge() in remove_migration_pte() Since the following commit, arch_make_huge_pte() should be used directly in generic memory subsystem as a platform provided page table helper, instead of pte_mkhuge(). This just drops pte_mkhuge() from remove_migration_pte(), which has now become redundant. 'commit `16785bd774` ("mm: merge pte_mkhuge() call into arch_make_huge_pte()")' Link: https://lkml.kernel.org/r/20230302025349.358341-1-anshuman.khandual@arm.com Signed-off-by: Anshuman Khandual <anshuman.khandual@arm.com> Reported-by: Christophe Leroy <christophe.leroy@csgroup.eu> Link: https://lore.kernel.org/all/1ea45095-0926-a56a-a273-816709e9075e@csgroup.eu/ Reviewed-by: David Hildenbrand <david@redhat.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2023-03-28 16:20:11 -07:00
Kefeng Wang	3e4fb13ac3	mm: swap: remove unneeded cgroup_throttle_swaprate() All the callers of cgroup_throttle_swaprate() are converted to folio_throttle_swaprate(), so make __cgroup_throttle_swaprate() to take a folio, and rename it to __folio_throttle_swaprate(), also rename gfp_mask to gfp and drop redundant extern keyword. finally, drop unused cgroup_throttle_swaprate(). Link: https://lkml.kernel.org/r/20230302115835.105364-8-wangkefeng.wang@huawei.com Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com> Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2023-03-28 16:20:10 -07:00
Kefeng Wang	68fa572b50	mm: memory: use folio_throttle_swaprate() in do_cow_fault() Directly use folio_throttle_swaprate() instead of cgroup_throttle_swaprate(). Link: https://lkml.kernel.org/r/20230302115835.105364-7-wangkefeng.wang@huawei.com Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com> Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2023-03-28 16:20:10 -07:00
Kefeng Wang	e2bf3e2caa	mm: memory: use folio_throttle_swaprate() in do_anonymous_page() Directly use folio_throttle_swaprate() instead of cgroup_throttle_swaprate(). Link: https://lkml.kernel.org/r/20230302115835.105364-6-wangkefeng.wang@huawei.com Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com> Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2023-03-28 16:20:10 -07:00
Kefeng Wang	4d4f75bf32	mm: memory: use folio_throttle_swaprate() in wp_page_copy() Directly use folio_throttle_swaprate() instead of cgroup_throttle_swaprate(). Link: https://lkml.kernel.org/r/20230302115835.105364-5-wangkefeng.wang@huawei.com Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com> Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2023-03-28 16:20:10 -07:00
Kefeng Wang	e601ded424	mm: memory: use folio_throttle_swaprate() in page_copy_prealloc() Directly use folio_throttle_swaprate() instead of cgroup_throttle_swaprate(). Link: https://lkml.kernel.org/r/20230302115835.105364-4-wangkefeng.wang@huawei.com Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com> Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2023-03-28 16:20:10 -07:00
Kefeng Wang	4231f84258	mm: memory: use folio_throttle_swaprate() in do_swap_page() Directly use folio_throttle_swaprate() instead of cgroup_throttle_swaprate(). Link: https://lkml.kernel.org/r/20230302115835.105364-3-wangkefeng.wang@huawei.com Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com> Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2023-03-28 16:20:10 -07:00
Kefeng Wang	cfe3236d32	mm: huge_memory: convert __do_huge_pmd_anonymous_page() to use a folio Patch series "mm: remove cgroup_throttle_swaprate() completely", v2. Convert all the caller functions of cgroup_throttle_swaprate() to use folios, and use folio_throttle_swaprate(), which allows us to remove cgroup_throttle_swaprate() completely. This patch (of 7): Convert from page to folio within __do_huge_pmd_anonymous_page(), as we need the precise page which is to be stored at this PTE in the folio, the function still keep a page as the parameter. Link: https://lkml.kernel.org/r/20230302115835.105364-1-wangkefeng.wang@huawei.com Link: https://lkml.kernel.org/r/20230302115835.105364-2-wangkefeng.wang@huawei.com Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com> Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2023-03-28 16:20:09 -07:00
Peter Collingbourne	16d91faf09	kasan: call clear_page with a match-all tag instead of changing page tag Instead of changing the page's tag solely in order to obtain a pointer with a match-all tag and then changing it back again, just convert the pointer that we get from kmap_atomic() into one with a match-all tag before passing it to clear_page(). On a certain microarchitecture, this has been observed to cause a measurable improvement in microbenchmark performance, presumably as a result of being able to avoid the atomic operations on the page tag. Link: https://lkml.kernel.org/r/20230216195924.3287772-1-pcc@google.com Signed-off-by: Peter Collingbourne <pcc@google.com> Link: https://linux-review.googlesource.com/id/I0249822cc29097ca7a04ad48e8eb14871f80e711 Reviewed-by: Andrey Konovalov <andreyknvl@gmail.com> Reviewed-by: Catalin Marinas <catalin.marinas@arm.com> Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com> Cc: Evgenii Stepanov <eugenis@google.com> Cc: Peter Collingbourne <pcc@google.com> Cc: Vincenzo Frascino <vincenzo.frascino@arm.com> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2023-03-28 16:20:09 -07:00
Ivan Orlov	af7df1c986	selftests: cgroup: add 'malloc' failures checks in test_memcontrol There are several 'malloc' calls in test_memcontrol, which can be unsuccessful. This patch will add 'malloc' failures checking to give more details about test's fail reasons and avoid possible undefined behavior during the future null dereference (like the one in alloc_anon_50M_check_swap function). Link: https://lkml.kernel.org/r/20230226131634.34366-1-ivan.orlov0322@gmail.com Signed-off-by: Ivan Orlov <ivan.orlov0322@gmail.com> Reviewed-by: Muchun Song <songmuchun@bytedance.com> Acked-by: Shakeel Butt <shakeelb@google.com> Acked-by: Roman Gushchin <roman.gushchin@linux.dev> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Shuah Khan <shuah@kernel.org> Cc: Tejun Heo <tj@kernel.org> Cc: Zefan Li <lizefan.x@bytedance.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2023-03-28 16:20:09 -07:00
Uros Bizjak	bdeb918810	mm/rmap: use atomic_try_cmpxchg in set_tlb_ubc_flush_pending Use atomic_try_cmpxchg instead of atomic_cmpxchg (ptr, old, new) == old in set_tlb_ubc_flush_pending. 86 CMPXCHG instruction returns success in ZF flag, so this change saves a compare after cmpxchg (and related move instruction in front of cmpxchg). Also, try_cmpxchg implicitly assigns old ptr value to "old" when cmpxchg fails. No functional change intended. Link: https://lkml.kernel.org/r/20230227214228.3533299-1-ubizjak@gmail.com Signed-off-by: Uros Bizjak <ubizjak@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2023-03-28 16:20:09 -07:00
Hyeonggon Yoo	f2421a16f4	mm/debug: use %pGt to display page_type in dump_page() Some page flags are stored in page_type rather than ->flags field. Use newly introduced page type %pGt in dump_page(). Below are some examples: page:00000000da7184dd refcount:1 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x101cb3 flags: 0x2ffff0000000000(node=0\|zone=2\|lastcpupid=0xffff) page_type: 0xffffffff() raw: 02ffff0000000000 0000000000000000 dead000000000122 0000000000000000 raw: 0000000000000000 0000000000000000 00000001ffffffff 0000000000000000 page dumped because: newly allocated page page:00000000da7184dd refcount:0 mapcount:-128 mapping:0000000000000000 index:0x0 pfn:0x101cb3 flags: 0x2ffff0000000000(node=0\|zone=2\|lastcpupid=0xffff) page_type: 0xffffff7f(buddy) raw: 02ffff0000000000 ffff88813fff8e80 ffff88813fff8e80 0000000000000000 raw: 0000000000000000 0000000000000000 00000000ffffff7f 0000000000000000 page dumped because: freed page page:0000000042202316 refcount:3 mapcount:2 mapping:0000000000000000 index:0x7f634722a pfn:0x11994e memcg:ffff888100135000 anon flags: 0x2ffff0000080024(uptodate\|active\|swapbacked\|node=0\|zone=2\|lastcpupid=0xffff) page_type: 0x1() raw: 02ffff0000080024 0000000000000000 dead000000000122 ffff8881193398f1 raw: 00000007f634722a 0000000000000000 0000000300000001 ffff888100135000 page dumped because: user-mapped page Link: https://lkml.kernel.org/r/20230130042514.2418-4-42.hyeyoo@gmail.com Signed-off-by: Hyeonggon Yoo <42.hyeyoo@gmail.com> Cc: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Cc: David Hildenbrand <david@redhat.com> Cc: Joe Perches <joe@perches.com> Cc: John Ogness <john.ogness@linutronix.de> Cc: Matthew Wilcox <willy@infradead.org> Cc: Petr Mladek <pmladek@suse.com> Cc: Sergey Senozhatsky <senozhatsky@chromium.org> Cc: Steven Rostedt (Google) <rostedt@goodmis.org> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2023-03-28 16:20:09 -07:00
Hyeonggon Yoo	4c85c0be3d	mm, printk: introduce new format %pGt for page_type %pGp format is used to display 'flags' field of a struct page. However, some page flags (i.e. PG_buddy, see page-flags.h for more details) are stored in page_type field. To display human-readable output of page_type, introduce %pGt format. It is important to note the meaning of bits are different in page_type. if page_type is 0xffffffff, no flags are set. Setting PG_buddy (0x00000080) flag results in a page_type of 0xffffff7f. Clearing a bit actually means setting a flag. Bits in page_type are inverted when displaying type names. Only values for which page_type_has_type() returns true are considered as page_type, to avoid confusion with mapcount values. if it returns false, only raw values are displayed and not page type names. Link: https://lkml.kernel.org/r/20230130042514.2418-3-42.hyeyoo@gmail.com Signed-off-by: Hyeonggon Yoo <42.hyeyoo@gmail.com> Reviewed-by: Petr Mladek <pmladek@suse.com> [vsprintf part] Cc: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Cc: David Hildenbrand <david@redhat.com> Cc: Joe Perches <joe@perches.com> Cc: John Ogness <john.ogness@linutronix.de> Cc: Matthew Wilcox <willy@infradead.org> Cc: Sergey Senozhatsky <senozhatsky@chromium.org> Cc: Steven Rostedt (Google) <rostedt@goodmis.org> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2023-03-28 16:20:09 -07:00
Hyeonggon Yoo	e26fcc02c7	mmflags.h: use less error prone method to define pageflag_names Patch series "mm, printk: introduce new format for page_type", v4. This series moves PG_slab page flag to page_type, freeing one bit in page->flags and introduces %pGt format that prints human-readable page_type like %pGp for printing page flags. See changelog of patch 2 for more implementation details. Thanks everyone that gave valuable comments. This patch (of 3): Use helper macro to decrease chances of typo when defining pageflag_names. Link: https://lkml.kernel.org/r/20230130042514.2418-1-42.hyeyoo@gmail.com Link: https://lore.kernel.org/lkml/Y6AycLbpjVzXM5I9@smile.fi.intel.com Link: https://lkml.kernel.org/r/20230130042514.2418-2-42.hyeyoo@gmail.com Signed-off-by: Hyeonggon Yoo <42.hyeyoo@gmail.com> Suggested-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Reviewed-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Cc: David Hildenbrand <david@redhat.com> Cc: Joe Perches <joe@perches.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Petr Mladek <pmladek@suse.com> Cc: Steven Rostedt (Google) <rostedt@goodmis.org> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Sergey Senozhatsky <senozhatsky@chromium.org> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: John Ogness <john.ogness@linutronix.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2023-03-28 16:20:08 -07:00
Stefan Roesch	739100c88f	mm: add tracepoints to ksm This adds the following tracepoints to ksm: - start / stop scan - ksm enter / exit - merge a page - merge a page with ksm - remove a page - remove a rmap item This patch has been split off from the RFC patch series "mm: process/cgroup ksm support". Link: https://lkml.kernel.org/r/20230210214645.2720847-1-shr@devkernel.io Signed-off-by: Stefan Roesch <shr@devkernel.io> Reviewed-by: Steven Rostedt (Google) <rostedt@goodmis.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2023-03-28 16:20:08 -07:00
Nicholas Piggin	77f68ebeee	powerpc/64s: enable MMU_LAZY_TLB_SHOOTDOWN On a 16-socket 192-core POWER8 system, the context_switch1_threads benchmark from will-it-scale (see earlier changelog), upstream can achieve a rate of about 1 million context switches per second, due to contention on the mm refcount. 64s meets the prerequisites for CONFIG_MMU_LAZY_TLB_SHOOTDOWN, so enable the option. This increases the above benchmark to 118 million context switches per second. This generates 314 additional IPI interrupts on a 144 CPU system doing a kernel compile, which is in the noise in terms of kernel cycles. Link: https://lkml.kernel.org/r/20230203071837.1136453-6-npiggin@gmail.com Signed-off-by: Nicholas Piggin <npiggin@gmail.com> Acked-by: Linus Torvalds <torvalds@linux-foundation.org> Cc: Andy Lutomirski <luto@kernel.org> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Christophe Leroy <christophe.leroy@csgroup.eu> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Nadav Amit <nadav.amit@gmail.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rik van Riel <riel@redhat.com> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2023-03-28 16:20:08 -07:00
Nicholas Piggin	2655421ae6	lazy tlb: shoot lazies, non-refcounting lazy tlb mm reference handling scheme On big systems, the mm refcount can become highly contented when doing a lot of context switching with threaded applications. user<->idle switch is one of the important cases. Abandoning lazy tlb entirely slows this switching down quite a bit in the common uncontended case, so that is not viable. Implement a scheme where lazy tlb mm references do not contribute to the refcount, instead they get explicitly removed when the refcount reaches zero. The final mmdrop() sends IPIs to all CPUs in the mm_cpumask and they switch away from this mm to init_mm if it was being used as the lazy tlb mm. Enabling the shoot lazies option therefore requires that the arch ensures that mm_cpumask contains all CPUs that could possibly be using mm. A DEBUG_VM option IPIs every CPU in the system after this to ensure there are no references remaining before the mm is freed. Shootdown IPIs cost could be an issue, but they have not been observed to be a serious problem with this scheme, because short-lived processes tend not to migrate CPUs much, therefore they don't get much chance to leave lazy tlb mm references on remote CPUs. There are a lot of options to reduce them if necessary, described in comments. The near-worst-case can be benchmarked with will-it-scale: context_switch1_threads -t $(($(nproc) / 2)) This will create nproc threads (nproc / 2 switching pairs) all sharing the same mm that spread over all CPUs so each CPU does thread->idle->thread switching. [ Rik came up with basically the same idea a few years ago, so credit to him for that. ] Link: https://lore.kernel.org/linux-mm/20230118080011.2258375-1-npiggin@gmail.com/ Link: https://lore.kernel.org/all/20180728215357.3249-11-riel@surriel.com/ Link: https://lkml.kernel.org/r/20230203071837.1136453-5-npiggin@gmail.com Signed-off-by: Nicholas Piggin <npiggin@gmail.com> Acked-by: Linus Torvalds <torvalds@linux-foundation.org> Cc: Andy Lutomirski <luto@kernel.org> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Christophe Leroy <christophe.leroy@csgroup.eu> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Nadav Amit <nadav.amit@gmail.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rik van Riel <riel@redhat.com> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2023-03-28 16:20:08 -07:00
Nicholas Piggin	88e3009b52	lazy tlb: allow lazy tlb mm refcounting to be configurable Add CONFIG_MMU_TLB_REFCOUNT which enables refcounting of the lazy tlb mm when it is context switched. This can be disabled by architectures that don't require this refcounting if they clean up lazy tlb mms when the last refcount is dropped. Currently this is always enabled, so the patch introduces no functional change. Link: https://lkml.kernel.org/r/20230203071837.1136453-4-npiggin@gmail.com Signed-off-by: Nicholas Piggin <npiggin@gmail.com> Acked-by: Linus Torvalds <torvalds@linux-foundation.org> Cc: Andy Lutomirski <luto@kernel.org> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Christophe Leroy <christophe.leroy@csgroup.eu> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Nadav Amit <nadav.amit@gmail.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rik van Riel <riel@redhat.com> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2023-03-28 16:20:08 -07:00
Nicholas Piggin	aa464ba9a1	lazy tlb: introduce lazy tlb mm refcount helper functions Add explicit _lazy_tlb annotated functions for lazy tlb mm refcounting. This makes the lazy tlb mm references more obvious, and allows the refcounting scheme to be modified in later changes. There is no functional change with this patch. Link: https://lkml.kernel.org/r/20230203071837.1136453-3-npiggin@gmail.com Signed-off-by: Nicholas Piggin <npiggin@gmail.com> Acked-by: Linus Torvalds <torvalds@linux-foundation.org> Cc: Andy Lutomirski <luto@kernel.org> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Christophe Leroy <christophe.leroy@csgroup.eu> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Nadav Amit <nadav.amit@gmail.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rik van Riel <riel@redhat.com> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2023-03-28 16:20:08 -07:00
Nicholas Piggin	6cad87b0d2	kthread: simplify kthread_use_mm refcounting Patch series "shoot lazy tlbs (lazy tlb refcount scalability improvement)", v7. This series improves scalability of context switching between user and kernel threads on large systems with a threaded process spread across a lot of CPUs. Discussion of v6 here: https://lore.kernel.org/linux-mm/20230118080011.2258375-1-npiggin@gmail.com/ This patch (of 5): Remove the special case avoiding refcounting when the mm to be used is the same as the kernel thread's active (lazy tlb) mm. kthread_use_mm() should not be such a performance critical path that this matters much. This simplifies a later change to lazy tlb mm refcounting. Link: https://lkml.kernel.org/r/20230203071837.1136453-1-npiggin@gmail.com Link: https://lkml.kernel.org/r/20230203071837.1136453-2-npiggin@gmail.com Signed-off-by: Nicholas Piggin <npiggin@gmail.com> Acked-by: Linus Torvalds <torvalds@linux-foundation.org> Cc: Andy Lutomirski <luto@kernel.org> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Nadav Amit <nadav.amit@gmail.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rik van Riel <riel@redhat.com> Cc: Will Deacon <will@kernel.org> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Christophe Leroy <christophe.leroy@csgroup.eu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2023-03-28 16:20:07 -07:00
Taejoon Song	62bf1258ec	mm/zswap: try to avoid worst-case scenario on same element pages The worst-case scenario on finding same element pages is that almost all elements are same at the first glance but only last few elements are different. Since the same element tends to be grouped from the beginning of the pages, if we check the first element with the last element before looping through all elements, we might have some chances to quickly detect non-same element pages. 1. Test is done under LG webOS TV (64-bit arch) 2. Dump the swap-out pages (~819200 pages) 3. Analyze the pages with simple test script which counts the iteration number and measures the speed at off-line Under 64-bit arch, the worst iteration count is PAGE_SIZE / 8 bytes = 512. The speed is based on the time to consume page_same_filled() function only. The result, on average, is listed as below: Num of Iter Speed(MB/s) Looping-Forward (Orig) 38 99265 Looping-Backward 36 102725 Last-element-check (This Patch) 33 125072 The result shows that the average iteration count decreases by 13% and the speed increases by 25% with this patch. This patch does not increase the overall time complexity, though. I also ran simpler version which uses backward loop. Just looping backward also makes some improvement, but less than this patch. A similar change has already been made to zram in `90f82cbfe5` ("zram: try to avoid worst-case scenario on same element pages"). Link: https://lkml.kernel.org/r/20230205190036.1730134-1-taejoon.song@lge.com Signed-off-by: Taejoon Song <taejoon.song@lge.com> Reviewed-by: Sergey Senozhatsky <senozhatsky@chromium.org> Cc: Dan Streetman <ddstreet@ieee.org> Cc: Seth Jennings <sjenning@redhat.com> Cc: Taejoon Song <taejoon.song@lge.com> Cc: Vitaly Wool <vitaly.wool@konsulko.com> Cc: Minchan Kim <minchan@kernel.org> Cc: <yjay.kim@lge.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2023-03-28 16:20:07 -07:00
T.J. Alumbaugh	32d32ef140	mm: multi-gen LRU: improve design doc This patch improves the design doc. Specifically, 1. add a section for the per-memcg mm_struct list, and 2. add a section for the PID controller. Link: https://lkml.kernel.org/r/20230214035445.1250139-2-talumbau@google.com Signed-off-by: T.J. Alumbaugh <talumbau@google.com> Cc: Yu Zhao <yuzhao@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2023-03-28 16:20:07 -07:00
T.J. Alumbaugh	9a52b2f32a	mm: multi-gen LRU: clean up sysfs code This patch cleans up the sysfs code. Specifically, 1. use sysfs_emit(), 2. use __ATTR_RW(), and 3. constify multi-gen LRU struct attribute_group. Link: https://lkml.kernel.org/r/20230214035445.1250139-1-talumbau@google.com Signed-off-by: T.J. Alumbaugh <talumbau@google.com> Cc: Yu Zhao <yuzhao@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2023-03-28 16:20:07 -07:00
Ma Wupeng	d155df53f3	x86/mm/pat: clear VM_PAT if copy_p4d_range failed Syzbot reports a warning in untrack_pfn(). Digging into the root we found that this is due to memory allocation failure in pmd_alloc_one. And this failure is produced due to failslab. In copy_page_range(), memory alloaction for pmd failed. During the error handling process in copy_page_range(), mmput() is called to remove all vmas. While untrack_pfn this empty pfn, warning happens. Here's a simplified flow: dup_mm dup_mmap copy_page_range copy_p4d_range copy_pud_range copy_pmd_range pmd_alloc __pmd_alloc pmd_alloc_one page = alloc_pages(gfp, 0); if (!page) return NULL; mmput exit_mmap unmap_vmas unmap_single_vma untrack_pfn follow_phys WARN_ON_ONCE(1); Since this vma is not generate successfully, we can clear flag VM_PAT. In this case, untrack_pfn() will not be called while cleaning this vma. Function untrack_pfn_moved() has also been renamed to fit the new logic. Link: https://lkml.kernel.org/r/20230217025615.1595558-1-mawupeng1@huawei.com Signed-off-by: Ma Wupeng <mawupeng1@huawei.com> Reported-by: <syzbot+5f488e922d047d8f00cc@syzkaller.appspotmail.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Borislav Petkov <bp@suse.de> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Suresh Siddha <suresh.b.siddha@intel.com> Cc: Toshi Kani <toshi.kani@hp.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2023-03-28 16:20:07 -07:00
Muhammad Usama Anjum	a1b92a3f14	mm/userfaultfd: support WP on multiple VMAs mwriteprotect_range() errors out if [start, end) doesn't fall in one VMA. We are facing a use case where multiple VMAs are present in one range of interest. For example, the following pseudocode reproduces the error which we are trying to fix: - Allocate memory of size 16 pages with PROT_NONE with mmap - Register userfaultfd - Change protection of the first half (1 to 8 pages) of memory to PROT_READ \| PROT_WRITE. This breaks the memory area in two VMAs. - Now UFFDIO_WRITEPROTECT_MODE_WP on the whole memory of 16 pages errors out. This is a simple use case where user may or may not know if the memory area has been divided into multiple VMAs. We need an implementation which doesn't disrupt the already present users. So keeping things simple, stop going over all the VMAs if any one of the VMA hasn't been registered in WP mode. While at it, remove the un-needed error check as well. [akpm@linux-foundation.org: s/VM_WARN_ON_ONCE/VM_WARN_ONCE/ to fix build] Link: https://lkml.kernel.org/r/20230217105558.832710-1-usama.anjum@collabora.com Signed-off-by: Muhammad Usama Anjum <usama.anjum@collabora.com> Acked-by: Peter Xu <peterx@redhat.com> Acked-by: David Hildenbrand <david@redhat.com> Reported-by: Paul Gofman <pgofman@codeweavers.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2023-03-28 16:20:07 -07:00
Vlastimil Babka	700d2e9a36	mm, page_alloc: reduce page alloc/free sanity checks Historically, we have performed sanity checks on all struct pages being allocated or freed, making sure they have no unexpected page flags or certain field values. This can detect insufficient cleanup and some cases of use-after-free, although on its own it can't always identify the culprit. The result is a warning and the "bad page" being leaked. The checks do need some cpu cycles, so in 4.7 with commits `479f854a20` ("mm, page_alloc: defer debugging checks of pages allocated from the PCP") and `4db7548ccb` ("mm, page_alloc: defer debugging checks of freed pages until a PCP drain") they were no longer performed in the hot paths when allocating and freeing from pcplists, but only when pcplists are bypassed, refilled or drained. For debugging purposes, with CONFIG_DEBUG_VM enabled the checks were instead still done in the hot paths and not when refilling or draining pcplists. With `4462b32c92` ("mm, page_alloc: more extensive free page checking with debug_pagealloc"), enabling debug_pagealloc also moved the sanity checks back to hot pahs. When both debug_pagealloc and CONFIG_DEBUG_VM are enabled, the checks are done both in hotpaths and pcplist refill/drain. Even though the non-debug default today might seem to be a sensible tradeoff between overhead and ability to detect bad pages, on closer look it's arguably not. As most allocations go through the pcplists, catching any bad pages when refilling or draining pcplists has only a small chance, insufficient for debugging or serious hardening purposes. On the other hand the cost of the checks is concentrated in the already expensive drain/refill batching operations, and those are done under the often contended zone lock. That was recently identified as an issue for page allocation and the zone lock contention reduced by moving the checks outside of the locked section with a patch "mm: reduce lock contention of pcp buffer refill", but the cost of the checks is still visible compared to their removal [1]. In the pcplist draining path free_pcppages_bulk() the checks are still done under zone->lock. Thus, remove the checks from pcplist refill and drain paths completely. Introduce a static key check_pages_enabled to control checks during page allocation a freeing (whether pcplist is used or bypassed). The static key is enabled if either is true: - kernel is built with CONFIG_DEBUG_VM=y (debugging) - debug_pagealloc or page poisoning is boot-time enabled (debugging) - init_on_alloc or init_on_free is boot-time enabled (hardening) The resulting user visible changes: - no checks when draining/refilling pcplists - less overhead, with likely no practical reduction of ability to catch bad pages - no checks when bypassing pcplists in default config (no debugging/hardening) - less overhead etc. as above - on typical hardened kernels [2], checks are now performed on each page allocation/free (previously only when bypassing/draining/refilling pcplists) - the init_on_alloc/init_on_free enabled should be sufficient indication for preferring more costly alloc/free operations for hardening purposes and we shouldn't need to introduce another toggle - code (various wrappers) removal and simplification [1] https://lore.kernel.org/all/68ba44d8-6899-c018-dcb3-36f3a96e6bea@sra.uni-hannover.de/ [2] https://lore.kernel.org/all/63ebc499.a70a0220.9ac51.29ea@mx.google.com/ [akpm@linux-foundation.org: coding-style cleanups] [akpm@linux-foundation.org: make check_pages_enabled static] Link: https://lkml.kernel.org/r/20230216095131.17336-1-vbabka@suse.cz Reported-by: Alexander Halbuer <halbuer@sra.uni-hannover.de> Reported-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Cc: Kees Cook <keescook@chromium.org> Cc: Mel Gorman <mgorman@techsingularity.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>	2023-03-28 16:20:06 -07:00

1 2 3 4 5 ...

1170410 Commits