Commit Graph

1265941 Commits

Author SHA1 Message Date
Peter Xu
24334e78e8 mm/hugetlb: declare hugetlbfs_pagecache_present() non-static
It will be used outside hugetlb.c soon.

Link: https://lkml.kernel.org/r/20240327152332.950956-3-peterx@redhat.com
Signed-off-by: Peter Xu <peterx@redhat.com>
Tested-by: Ryan Roberts <ryan.roberts@arm.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Andrew Jones <andrew.jones@linux.dev>
Cc: Aneesh Kumar K.V (IBM) <aneesh.kumar@kernel.org>
Cc: Axel Rasmussen <axelrasmussen@google.com>
Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: David Hildenbrand <david@redhat.com>
Cc: James Houghton <jthoughton@google.com>
Cc: Jason Gunthorpe <jgg@nvidia.com>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: Kirill A. Shutemov <kirill@shutemov.name>
Cc: Lorenzo Stoakes <lstoakes@gmail.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: "Mike Rapoport (IBM)" <rppt@kernel.org>
Cc: Muchun Song <muchun.song@linux.dev>
Cc: Rik van Riel <riel@surriel.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Yang Shi <shy828301@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-04-25 20:56:20 -07:00
Peter Xu
ac3830c3b2 mm/Kconfig: CONFIG_PGTABLE_HAS_HUGE_LEAVES
Patch series "mm/gup: Unify hugetlb, part 2", v4.

The series removes the hugetlb slow gup path after a previous refactor
work [1], so that slow gup now uses the exact same path to process all
kinds of memory including hugetlb.

For the long term, we may want to remove most, if not all, call sites of
huge_pte_offset().  It'll be ideal if that API can be completely dropped
from arch hugetlb API.  This series is one small step towards merging
hugetlb specific codes into generic mm paths.  From that POV, this series
removes one reference to huge_pte_offset() out of many others.

One goal of such a route is that we can reconsider merging hugetlb
features like High Granularity Mapping (HGM).  It was not accepted in the
past because it may add lots of hugetlb specific codes and make the mm
code even harder to maintain.  With a merged codeset, features like HGM
can hopefully share some code with THP, legacy (PMD+) or modern
(continuous PTEs).

To make it work, the generic slow gup code will need to at least
understand hugepd, which is already done like so in fast-gup.  Due to the
specialty of hugepd to be software-only solution (no hardware recognizes
the hugepd format, so it's purely artificial structures), there's chance
we can merge some or all hugepd formats with cont_pte in the future.  That
question is yet unsettled from Power side to have an acknowledgement.  As
of now for this series, I kept the hugepd handling because we may still
need to do so before getting a clearer picture of the future of hugepd. 
The other reason is simply that we did it already for fast-gup and most
codes are still around to be reused.  It'll make more sense to keep
slow/fast gup behave the same before a decision is made to remove hugepd.

There's one major difference for slow-gup on cont_pte / cont_pmd handling,
currently supported on three architectures (aarch64, riscv, ppc).  Before
the series, slow gup will be able to recognize e.g.  cont_pte entries with
the help of huge_pte_offset() when hstate is around.  Now it's gone but
still working, by looking up pgtable entries one by one.

It's not ideal, but hopefully this change should not affect yet on major
workloads.  There's some more information in the commit message of the
last patch.  If this would be a concern, we can consider teaching slow gup
to recognize cont pte/pmd entries, and that should recover the lost
performance.  But I doubt its necessity for now, so I kept it as simple as
it can be.

Patch layout
=============

Patch 1-8:    Preparation works, or cleanups in relevant code paths
Patch 9-11:   Teach slow gup with all kinds of huge entries (pXd, hugepd)
Patch 12:     Drop hugetlb_follow_page_mask()

More information can be found in the commit messages of each patch.

[1] https://lore.kernel.org/all/20230628215310.73782-1-peterx@redhat.com
[2] https://lore.kernel.org/r/20240321215047.678172-1-peterx@redhat.com




Introduce a config option that will be selected as long as huge leaves are
involved in pgtable (thp or hugetlbfs).  It would be useful to mark any
code with this new config that can process either hugetlb or thp pages in
any level that is higher than pte level.

Link: https://lkml.kernel.org/r/20240327152332.950956-1-peterx@redhat.com
Link: https://lkml.kernel.org/r/20240327152332.950956-2-peterx@redhat.com
Signed-off-by: Peter Xu <peterx@redhat.com>
Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
Tested-by: Ryan Roberts <ryan.roberts@arm.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Andrew Jones <andrew.jones@linux.dev>
Cc: Aneesh Kumar K.V (IBM) <aneesh.kumar@kernel.org>
Cc: Axel Rasmussen <axelrasmussen@google.com>
Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: David Hildenbrand <david@redhat.com>
Cc: James Houghton <jthoughton@google.com>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: Kirill A. Shutemov <kirill@shutemov.name>
Cc: Lorenzo Stoakes <lstoakes@gmail.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: "Mike Rapoport (IBM)" <rppt@kernel.org>
Cc: Muchun Song <muchun.song@linux.dev>
Cc: Rik van Riel <riel@surriel.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Yang Shi <shy828301@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-04-25 20:56:20 -07:00
Matthew Wilcox (Oracle)
632230ff19 mm: rename mm_put_huge_zero_page to mm_put_huge_zero_folio
Also remove mm_get_huge_zero_page() now it has no users.

Link: https://lkml.kernel.org/r/20240326202833.523759-9-willy@infradead.org
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-04-25 20:56:20 -07:00
Matthew Wilcox (Oracle)
c93012d849 dax: use huge_zero_folio
Convert from huge_zero_page to huge_zero_folio.

Link: https://lkml.kernel.org/r/20240326202833.523759-8-willy@infradead.org
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-04-25 20:56:20 -07:00
Matthew Wilcox (Oracle)
e28833bc4a mm: convert do_huge_pmd_anonymous_page to huge_zero_folio
Use folios more widely.

Link: https://lkml.kernel.org/r/20240326202833.523759-7-willy@infradead.org
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-04-25 20:56:19 -07:00
Matthew Wilcox (Oracle)
5691753d73 mm: convert huge_zero_page to huge_zero_folio
With all callers of is_huge_zero_page() converted, we can now switch the
huge_zero_page itself from being a compound page to a folio.

Link: https://lkml.kernel.org/r/20240326202833.523759-6-willy@infradead.org
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-04-25 20:56:19 -07:00
Matthew Wilcox (Oracle)
b002a7b0a5 mm: convert migrate_vma_collect_pmd to use a folio
Convert the pmd directly to a folio and use it.  Turns four calls to
compound_head() into one.

Link: https://lkml.kernel.org/r/20240326202833.523759-5-willy@infradead.org
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-04-25 20:56:19 -07:00
Matthew Wilcox (Oracle)
e06d03d559 mm: add pmd_folio()
Convert directly from a pmd to a folio without going through another
representation first.  For now this is just a slightly shorter way to
write it, but it might end up being more efficient later.

Link: https://lkml.kernel.org/r/20240326202833.523759-4-willy@infradead.org
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-04-25 20:56:19 -07:00
Matthew Wilcox (Oracle)
5beaee54a3 mm: add is_huge_zero_folio()
This is the folio equivalent of is_huge_zero_page().  It doesn't add any
efficiency, but it does prevent the caller from passing a tail page and
getting confused when the predicate returns false.

Link: https://lkml.kernel.org/r/20240326202833.523759-3-willy@infradead.org
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-04-25 20:56:18 -07:00
Matthew Wilcox (Oracle)
4d30eac374 sparc: use is_huge_zero_pmd()
Patch series "Convert huge_zero_page to huge_zero_folio".

Almost all the callers of is_huge_zero_page() already have a folio.  And
they should -- is_huge_zero_page() will return false for tail pages, even
if they're tail pages of the huge zero page.  That's confusing, and one of
the benefits of the folio conversion is to get rid of this confusion.


This patch (of 8):

There's no need to convert to a page, much less a folio.  We can tell from
the pmd whether it is a huge zero page or not.  Saves 60 bytes of text.

Link: https://lkml.kernel.org/r/20240326202833.523759-1-willy@infradead.org
Link: https://lkml.kernel.org/r/20240326202833.523759-2-willy@infradead.org
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Acked-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-04-25 20:56:18 -07:00
Chris Li
796c2c23e1 zswap: replace RB tree with xarray
Very deep RB tree requires rebalance at times.  That contributes to the
zswap fault latencies.  Xarray does not need to perform tree rebalance. 
Replacing RB tree to xarray can have some small performance gain.

One small difference is that xarray insert might fail with ENOMEM, while
RB tree insert does not allocate additional memory.

The zswap_entry size will reduce a bit due to removing the RB node, which
has two pointers and a color field.  Xarray store the pointer in the
xarray tree rather than the zswap_entry.  Every entry has one pointer from
the xarray tree.  Overall, switching to xarray should save some memory, if
the swap entries are densely packed.

Notice the zswap_rb_search and zswap_rb_insert often followed by
zswap_rb_erase.  Use xa_erase and xa_store directly.  That saves one tree
lookup as well.

Remove zswap_invalidate_entry due to no need to call zswap_rb_erase any
more.  Use zswap_free_entry instead.

The "struct zswap_tree" has been replaced by "struct xarray".  The tree
spin lock has transferred to the xarray lock.

Run the kernel build testing 5 times for each version, averages:
(memory.max=2GB, zswap shrinker and writeback enabled, one 50GB swapfile,
24 HT core, 32 jobs)

           mm-unstable-4aaccadb5c04     xarray v9
user       3548.902 			3534.375
sys        522.232                      520.976
real       202.796                      200.864

[chrisl@kernel.org: restore original comment "erase" to "invalidate"]
  Link: https://lkml.kernel.org/r/20240326-zswap-xarray-v10-1-bf698417c968@kernel.org
Link: https://lkml.kernel.org/r/20240326-zswap-xarray-v9-1-d2891a65dfc7@kernel.org
Signed-off-by: Chris Li <chrisl@kernel.org>
Acked-by: Yosry Ahmed <yosryahmed@google.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Nhat Pham <nphamcs@gmail.com>
Cc: Barry Song <v-songbaohua@oppo.com>
Cc: Chengming Zhou <zhouchengming@bytedance.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-04-25 20:56:18 -07:00
Baoquan He
0aac45663a mm/page_alloc.c: change the array-length to MIGRATE_PCPTYPES
Earlier, in commit 1dd214b8f2 ("mm: page_alloc: avoid merging
non-fallbackable pageblocks with others"), migrate type MIGRATE_CMA and
MIGRATE_ISOLATE are removed from fallbacks list since they are never used.

Later on, in commit ("aa02d3c174ab mm/page_alloc: reduce fallbacks to
(MIGRATE_PCPTYPES - 1)"), the array column size is reduced to
'MIGRATE_PCPTYPES - 1'. In fact, the array row size need be reduced to
MIGRATE_PCPTYPES too since it's only covering rows of the number
MIGRATE_PCPTYPES. Even though the current code has handled cases
when the migratetype is CMA, HIGHATOMIC and MEMORY_ISOLATION, making
the row size right is still good to avoid future error and confusion.

Link: https://lkml.kernel.org/r/20240326061134.1055295-8-bhe@redhat.com
Signed-off-by: Baoquan He <bhe@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: "Mike Rapoport (IBM)" <rppt@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-04-25 20:56:18 -07:00
Baoquan He
96a5c186ef mm/page_alloc.c: don't show protection in zone's ->lowmem_reserve[] for empty zone
On one node, for lower zone's ->lowmem_reserve[], it will show how much
memory is reserved in this lower zone to avoid excessive page allocation
from the relevant higher zone's fallback allocation.

However, currently lower zone's lowmem_reserve[] element will be filled
even though the relevant higher zone is empty.  That doesnt' make sense
and can cause confusion.

E.g on node 0 of one system as below, it has zone
DMA/DMA32/NORMAL/MOVABLE/DEVICE, among them zone MOVABLE/DEVICE are the
highest and both are empty.  In zone DMA/DMA32's protection array, we can
see that it has value for zone MOVABLE and DEVICE.

Node 0, zone      DMA
  ......
  pages free     2816
        boost    0
        min      7
        low      10
        high     13
        spanned  4095
        present  3998
        managed  3840
        cma      0
        protection: (0, 1582, 23716, 23716, 23716)
   ......
Node 0, zone    DMA32
  pages free     403269
        boost    0
        min      753
        low      1158
        high     1563
        spanned  1044480
        present  487039
        managed  405070
        cma      0
        protection: (0, 0, 22134, 22134, 22134)
   ......
Node 0, zone   Normal
  pages free     5423879
        boost    0
        min      10539
        low      16205
        high     21871
        spanned  5767168
        present  5767168
        managed  5666438
        cma      0
        protection: (0, 0, 0, 0, 0)
   ......
Node 0, zone  Movable
  pages free     0
        boost    0
        min      32
        low      32
        high     32
        spanned  0
        present  0
        managed  0
        cma      0
        protection: (0, 0, 0, 0, 0)
Node 0, zone   Device
  pages free     0
        boost    0
        min      0
        low      0
        high     0
        spanned  0
        present  0
        managed  0
        cma      0
        protection: (0, 0, 0, 0, 0)

Here, clear out the element value in lower zone's ->lowmem_reserve[] if the
relevant higher zone is empty.

And also replace space with tab in _deferred_grow_zone()

Link: https://lkml.kernel.org/r/20240326061134.1055295-7-bhe@redhat.com
Signed-off-by: Baoquan He <bhe@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: "Mike Rapoport (IBM)" <rppt@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-04-25 20:56:17 -07:00
Baoquan He
f55d3471b7 mm/mm_init.c: remove the outdated code comment above deferred_grow_zone()
The noinline attribute has been taken off in commit 9420f89db2 ("mm:
move most of core MM initialization to mm/mm_init.c").  So remove the
unneeded code comment above deferred_grow_zone().

And also remove the unneeded bracket in deferred_init_pages().

Link: https://lkml.kernel.org/r/20240326061134.1055295-6-bhe@redhat.com
Signed-off-by: Baoquan He <bhe@redhat.com>
Reviewed-by: Mike Rapoport (IBM) <rppt@kernel.org>
Cc: Mel Gorman <mgorman@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-04-25 20:56:17 -07:00
Baoquan He
bb8ea62daa mm/page_alloc.c: remove unneeded codes in !NUMA version of build_zonelists()
When CONFIG_NUMA=n, MAX_NUMNODES is always 1 because Kconfig item
NODES_SHIFT depends on NUMA.  So in !NUMA version of build_zonelists(), no
need to bother with the two for loop because code execution won't enter
them ever.

Here, remove those unneeded codes in !NUMA version of build_zonelists().

[bhe@redhat.com: remove unused locals]
  Link: https://lkml.kernel.org/r/ZgQL1WOf9K88nLpQ@MiWiFi-R3L-srv
Link: https://lkml.kernel.org/r/20240326061134.1055295-5-bhe@redhat.com
Signed-off-by: Baoquan He <bhe@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: "Mike Rapoport (IBM)" <rppt@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-04-25 20:56:17 -07:00
Baoquan He
b6dd94596f mm: make __absent_pages_in_range() as static
It's only called in mm/mm_init.c now.

Link: https://lkml.kernel.org/r/20240326061134.1055295-4-bhe@redhat.com
Signed-off-by: Baoquan He <bhe@redhat.com>
Reviewed-by: Mike Rapoport (IBM) <rppt@kernel.org>
Cc: Mel Gorman <mgorman@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-04-25 20:56:17 -07:00
Baoquan He
c091dd963f mm/init: remove the unnecessary special treatment for memory-less node
Because memory-less node's ->node_present_pages and its zone's
->present_pages are all 0, the judgement before calling node_set_state()
to set N_MEMORY, N_HIGH_MEMORY, N_NORMAL_MEMORY for node is enough to skip
memory-less node.  The 'continue;' statement inside for_each_node() loop
of free_area_init() is gilding the lily.

Here, remove the special handling to make memory-less node share the same
code flow as normal node.

And also rephrase the code comments above the 'continue' statement
and move them above above line 'if (pgdat->node_present_pages)'.

[bhe@redhat.com: redo code comments, per Mike]
  Link: https://lkml.kernel.org/r/ZhYJAVQRYJSTKZng@MiWiFi-R3L-srv
Link: https://lkml.kernel.org/r/20240326061134.1055295-3-bhe@redhat.com
Signed-off-by: Baoquan He <bhe@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: "Mike Rapoport (IBM)" <rppt@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-04-25 20:56:16 -07:00
Baoquan He
850ed20539 mm: move array mem_section init code out of memory_present()
Patch series "mm/init: minor clean up and improvement".

These are all observed when going through code flow during mm init.


This patch (of 7):

When CONFIG_SPARSEMEM_EXTREME is enabled, mem_section need be initialized
to point at a two-dimensional array, and its 1st dimension of length
NR_SECTION_ROOTS will be dynamically allocated.  Once the allocation is
done, it's available for all nodes.

So take the 1st dimension of mem_section initialization out of
memory_present()(), and put it into memblocks_present() which is a more
appripriate place.

Link: https://lkml.kernel.org/r/20240326061134.1055295-1-bhe@redhat.com
Link: https://lkml.kernel.org/r/20240326061134.1055295-2-bhe@redhat.com
Signed-off-by: Baoquan He <bhe@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: "Mike Rapoport (IBM)" <rppt@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-04-25 20:56:16 -07:00
Vlastimil Babka
e6100a4590 mm, slab: move slab_memcg hooks to mm/memcontrol.c
The hooks make multiple calls to functions in mm/memcontrol.c, including
to th current_obj_cgroup() marked __always_inline.  It might be faster to
make a single call to the hook in mm/memcontrol.c instead.  The hooks also
don't use almost anything from mm/slub.c.  obj_full_size() can move with
the hooks and cache_vmstat_idx() to the internal mm/slab.h

Link: https://lkml.kernel.org/r/20240326-slab-memcg-v3-2-d85d2563287a@suse.cz
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Reviewed-by: Roman Gushchin <roman.gushchin@linux.dev>
Cc: Al Viro <viro@ZenIV.linux.org.uk>
Cc: Chengming Zhou <chengming.zhou@linux.dev>
Cc: Christian Brauner <brauner@kernel.org>
Cc: Christoph Lameter <cl@linux.com>
Cc: Chuck Lever <chuck.lever@oracle.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Hyeonggon Yoo <42.hyeyoo@gmail.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Jeff Layton <jlayton@kernel.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Josh Poimboeuf <jpoimboe@kernel.org>
Cc: Kees Cook <kees@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Muchun Song <muchun.song@linux.dev>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Shakeel Butt <shakeel.butt@linux.dev>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-04-25 20:56:16 -07:00
Vlastimil Babka
9f9796b413 mm, slab: move memcg charging to post-alloc hook
Patch series "memcg_kmem hooks refactoring", v3.


This patch (of 2):

The MEMCG_KMEM integration with slab currently relies on two hooks during
allocation.  memcg_slab_pre_alloc_hook() determines the objcg and charges
it, and memcg_slab_post_alloc_hook() assigns the objcg pointer to the
allocated object(s).

As Linus pointed out, this is unnecessarily complex.  Failing to charge
due to memcg limits should be rare, so we can optimistically allocate the
object(s) and do the charging together with assigning the objcg pointer in
a single post_alloc hook.  In the rare case the charging fails, we can
free the object(s) back.

This simplifies the code (no need to pass around the objcg pointer) and
potentially allows to separate charging from allocation in cases where
it's common that the allocation would be immediately freed, and the memcg
handling overhead could be saved.

[vbabka@suse.cz: fix call to memcg_alloc_abort_single()]
  Link: https://lkml.kernel.org/r/4af50be2-4109-45e5-8a36-2136252a635e@suse.cz
[roman.gushchin@linux.dev: comment fixup]
  Link: https://lkml.kernel.org/r/Zg2LsNm6twOmG69l@P9FQF9L96D.corp.robot.car
Link: https://lkml.kernel.org/r/20240326-slab-memcg-v3-0-d85d2563287a@suse.cz
Link: https://lkml.kernel.org/r/20240326-slab-memcg-v3-1-d85d2563287a@suse.cz
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Roman Gushchin <roman.gushchin@linux.dev>
Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Link: https://lore.kernel.org/all/CAHk-=whYOOdM7jWy5jdrAm8LxcgCMFyk2bt8fYYvZzM4U-zAQA@mail.gmail.com/
Reviewed-by: Roman Gushchin <roman.gushchin@linux.dev>
Reviewed-by: Chengming Zhou <chengming.zhou@linux.dev>
Cc: Al Viro <viro@ZenIV.linux.org.uk>
Cc: Christian Brauner <brauner@kernel.org>
Cc: Christoph Lameter <cl@linux.com>
Cc: Chuck Lever <chuck.lever@oracle.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Hyeonggon Yoo <42.hyeyoo@gmail.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Jeff Layton <jlayton@kernel.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Josh Poimboeuf <jpoimboe@kernel.org>
Cc: Kees Cook <kees@kernel.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Muchun Song <muchun.song@linux.dev>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Shakeel Butt <shakeel.butt@linux.dev>
Cc: Aishwarya TCV <aishwarya.tcv@arm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-04-25 20:56:16 -07:00
Matthew Wilcox (Oracle)
dee3d0bef2 proc: rewrite stable_page_flags()
Reduce the usage of PageFlag tests and reduce the number of
compound_head() calls.

For multi-page folios, we'll now show all pages as having the flags that
apply to them, e.g.  if it's dirty, all pages will have the dirty flag set
instead of just the head page.  The mapped flag is still per page, as is
the hwpoison flag.

[willy@infradead.org: fix up some bits vs masks]
  Link: https://lkml.kernel.org/r/20240403173112.1450721-1-willy@infradead.org
[willy@infradead.org: fix warnings]
  Link: https://lkml.kernel.org/r/ZhBPtCYfSuFuUMEz@casper.infradead.org
Link: https://lkml.kernel.org/r/20240326171045.410737-11-willy@infradead.org
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Svetly Todorov <svetly.todorov@memverge.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-04-25 20:56:15 -07:00
Matthew Wilcox (Oracle)
4dc7d37370 remove references to page->flags in documentation
Mostly rewording, but remove entirely the copy of page_fixed_fake_head()
in the documentation; we can refer people to the actual source if
necessary.

Link: https://lkml.kernel.org/r/20240326171045.410737-10-willy@infradead.org
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-04-25 20:56:15 -07:00
Matthew Wilcox (Oracle)
5e0debe012 slub: remove use of page->flags
Use slub->__page_flags instead.  We can also remove the assertion that
it's not a tail page as struct slab never points to a tail page.

Link: https://lkml.kernel.org/r/20240326171045.410737-9-willy@infradead.org
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-04-25 20:56:15 -07:00
Matthew Wilcox (Oracle)
51718e25c5 mm: convert arch_clear_hugepage_flags to take a folio
All implementations that aren't no-ops just set a bit in the flags, and we
want to use the folio flags rather than the page flags for that.  Rename
it to arch_clear_hugetlb_flags() while we're touching it so nobody thinks
it's used for THP.

[willy@infradead.org: fix arm64 build]
  Link: https://lkml.kernel.org/r/ZgQvNKGdlDkwhQEX@casper.infradead.org
Link: https://lkml.kernel.org/r/20240326171045.410737-8-willy@infradead.org
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-04-25 20:56:15 -07:00
Matthew Wilcox (Oracle)
b84fd2835c mm: make page_mapped() take a const argument
None of the functions called by page_mapped() modify the page or folio, so
mark them all as const.

Link: https://lkml.kernel.org/r/20240326171045.410737-7-willy@infradead.org
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-04-25 20:56:14 -07:00
Matthew Wilcox (Oracle)
2ace5a670e mm: make is_free_buddy_page() take a const argument
This function does not modify its argument; let the callers know that so
they can make better optimisation decisions.

Link: https://lkml.kernel.org/r/20240326171045.410737-6-willy@infradead.org
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-04-25 20:56:14 -07:00
Matthew Wilcox (Oracle)
e3089fd0b0 mm: make folio_test_idle and folio_test_young take a const argument
If these functions are defined in page-flags.h, they already take a const
argument; make it true for these alternate definitions too.

Link: https://lkml.kernel.org/r/20240326171045.410737-5-willy@infradead.org
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-04-25 20:56:14 -07:00
Matthew Wilcox (Oracle)
6e65aa55cd mm: make page_ext_get() take a const argument
In order to constify other functions, we need page_ext_get() to be const. 
This is no problem as lookup_page_ext() already takes a const argument.

Link: https://lkml.kernel.org/r/20240326171045.410737-4-willy@infradead.org
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-04-25 20:56:14 -07:00
Matthew Wilcox (Oracle)
fa92722e38 xtensa: remove uses of PG_arch_1 on individual pages
Since switching to the new page table range API, we disregard the
PG_arch_1 (aka dcache dirty) flag on tail pages, and only pay attention to
it on the folio.  Fix these two missed spots where we were setting it on
arbitrary pages.

Link: https://lkml.kernel.org/r/20240326171045.410737-3-willy@infradead.org
Reported-by: Svetly Todorov <svetly.todorov@memverge.com>
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Svetly Todorov <svetly.todorov@memverge.com>	[xtensa]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-04-25 20:56:13 -07:00
Matthew Wilcox (Oracle)
f4b6680973 sh: remove use of PG_arch_1 on individual pages
Patch series "Various page->flags cleanups".

The first two patches are bug fixes, although I'm not sure that either
architecture will have noticed.  There aren't a lot of uses of page->flags
left!  The big build-up here is to reworking stable_page_flags(), which
will definitely be a user-visible change.  I think a welcome one, given
the special case we had to spread the Slab flag into all tail pages.


This patch (of 10):

Since switching to the new page table range API, we do not set the
PG_arch_1 (aka dcache clean) flag on tail pages, only on the folio.  Test
it on the folio.  Also use page_mapped() instead of page_mapcount() as it
is more efficient.

[akpm@linux-foundation.org: fix folio_flags call]
Link: https://lkml.kernel.org/r/20240326171045.410737-1-willy@infradead.org
Link: https://lkml.kernel.org/r/20240326171045.410737-2-willy@infradead.org
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-04-25 20:56:13 -07:00
David Hildenbrand
f002882ca3 mm: merge folio_is_secretmem() and folio_fast_pin_allowed() into gup_fast_folio_allowed()
folio_is_secretmem() is currently only used during GUP-fast.  Nowadays,
folio_fast_pin_allowed() performs similar checks during GUP-fast and
contains a lot of careful handling -- READ_ONCE() -- , sanity checks --
lockdep_assert_irqs_disabled() -- and helpful comments on how this
handling is safe and correct.

So let's merge folio_is_secretmem() into folio_fast_pin_allowed().  Rename
folio_fast_pin_allowed() to gup_fast_folio_allowed(), to better match the
new semantics.

Link: https://lkml.kernel.org/r/20240326143210.291116-4-david@redhat.com
Signed-off-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Mike Rapoport (IBM) <rppt@kernel.org>
Cc: David Hildenbrand <david@redhat.com>
Cc: Lorenzo Stoakes <lstoakes@gmail.com>
Cc: Miklos Szeredi <mszeredi@redhat.com>
Cc: xingwei lee <xrivendell7@gmail.com>
Cc: yue sun <samsun1006219@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-04-25 20:56:13 -07:00
David Hildenbrand
c139ca42f5 selftests/memfd_secret: add vmsplice() test
Let's add a simple reproducer for a scenario where GUP-fast could succeed
on secretmem folios, making vmsplice() succeed instead of failing.  The
reproducer is based on a reproducer [1] by Miklos Szeredi.

We want to perform two tests: vmsplice() when a fresh page was just
faulted in, and vmsplice() on an existing page after munmap() that would
drain certain LRU caches/batches in the kernel.

In an ideal world, we could use fallocate(FALLOC_FL_PUNCH_HOLE) /
MADV_REMOVE to remove any existing page.  As that is currently not
possible, run the test before any other tests that would allocate memory
in the secretmem fd.

Perform the ftruncate() only once, and check the return value.

[1] https://lkml.kernel.org/r/CAJfpegt3UCsMmxd0taOY11Uaw5U=eS1fE5dn0wZX3HF0oy8-oQ@mail.gmail.com

Link: https://lkml.kernel.org/r/20240326143210.291116-3-david@redhat.com
Signed-off-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Mike Rapoport (IBM) <rppt@kernel.org>
Cc: Lorenzo Stoakes <lstoakes@gmail.com>
Cc: Miklos Szeredi <mszeredi@redhat.com>
Cc: xingwei lee <xrivendell7@gmail.com>
Cc: yue sun <samsun1006219@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-04-25 20:56:13 -07:00
Christoph Hellwig
5b34b76cb0 mm: move follow_phys to arch/x86/mm/pat/memtype.c
follow_phys is only used by two callers in arch/x86/mm/pat/memtype.c. 
Move it there and hardcode the two arguments that get the same values
passed by both callers.

[david@redhat.com: conflict resolutions]
Link: https://lkml.kernel.org/r/20240403212131.929421-4-david@redhat.com
Link: https://lkml.kernel.org/r/20240324234542.2038726-4-hch@lst.de
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: David Hildenbrand <david@redhat.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Fei Li <fei1.li@intel.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Nathan Chancellor <nathan@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-04-25 20:56:12 -07:00
Christoph Hellwig
cb10c28ac8 mm: remove follow_pfn
Remove follow_pfn now that the last user is gone.

Link: https://lkml.kernel.org/r/20240324234542.2038726-3-hch@lst.de
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Hildenbrand <david@redhat.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Fei Li <fei1.li@intel.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Nathan Chancellor <nathan@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-04-25 20:56:12 -07:00
Christoph Hellwig
1b265da7ea virt: acrn: stop using follow_pfn
Patch series "remove follow_pfn".

This series open codes follow_pfn in the only remaining caller, although
the code there remains questionable.  It then also moves follow_phys into
the only user and simplifies it a bit.


This patch (of 3):

Switch from follow_pfn to follow_pte so that we can get rid of follow_pfn.
Note that this doesn't fix any of the pre-existing raciness and lack of
permission checking in the code.

Link: https://lkml.kernel.org/r/20240324234542.2038726-1-hch@lst.de
Link: https://lkml.kernel.org/r/20240324234542.2038726-2-hch@lst.de
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Hildenbrand <david@redhat.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Fei Li <fei1.li@intel.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-04-25 20:56:12 -07:00
Kefeng Wang
85109a8a9a mm: backing-dev: use group allocation/free of per-cpu counters API
Use group allocation/free of per-cpu counters api to accelerate
wb_init/exit() and simplify code.

Link: https://lkml.kernel.org/r/20240325035635.49342-1-wangkefeng.wang@huawei.com
Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com>
Cc: Dennis Zhou <dennis@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-04-25 20:56:12 -07:00
John Hubbard
a8353dc98f huge_memory.c: document huge page splitting rules more thoroughly
1. Add information about the behavior of huge page splitting, with
   respect to page/folio refcounts, and gup/pup pins.

2. Update and clarify the existing documentation, to compensate for the
   ravages of time and code change.

Link: https://lkml.kernel.org/r/20240325044452.217463-1-jhubbard@nvidia.com
Signed-off-by: John Hubbard <jhubbard@nvidia.com>
Reviewed-by: Zi Yan <ziy@nvidia.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Cc: Matthew Wilcox <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-04-25 20:56:11 -07:00
Yajun Deng
d4e6b397be mm/mmap: convert all mas except mas_detach to vma iterator
There are two types of iterators mas and vmi in the current code.  If the
maple tree comes from the mm structure, we can use the vma iterator. 
Avoid using mas directly as possible.

Keep using mas for the mt_detach tree, since it doesn't come from the mm
structure.

Remove as many uses of mas as possible, but we will still have a few that
must be passed through in unmap_vmas() and free_pgtables().

Also introduce vma_iter_reset, vma_iter_{prev, next}_range_limit and
vma_iter_area_{lowest, highest} helper functions for using the vma
interator.

Link: https://lkml.kernel.org/r/20240325063258.1437618-1-yajun.deng@linux.dev
Signed-off-by: Yajun Deng <yajun.deng@linux.dev>
Tested-by: Helge Deller <deller@gmx.de>	[parisc]
Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com>
Cc: Lorenzo Stoakes <lstoakes@gmail.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-04-25 20:56:11 -07:00
Baoquan He
0b52663f75 mm/mm_init.c: remove arch_reserved_kernel_pages()
Since the current calculation of calc_nr_kernel_pages() has taken into
consideration of kernel reserved memory, no need to have
arch_reserved_kernel_pages() any more.

Link: https://lkml.kernel.org/r/20240325145646.1044760-7-bhe@redhat.com
Signed-off-by: Baoquan He <bhe@redhat.com>
Reviewed-by: Mike Rapoport (IBM) <rppt@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-04-25 20:56:11 -07:00
Baoquan He
90e796e22e mm/mm_init.c: remove unneeded calc_memmap_size()
Nobody calls calc_memmap_size() now.

Link: https://lkml.kernel.org/r/20240325145646.1044760-6-bhe@redhat.com
Signed-off-by: Baoquan He <bhe@redhat.com>
Reviewed-by: Mike Rapoport (IBM) <rppt@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-04-25 20:56:11 -07:00
Baoquan He
0ac5e785dc mm/mm_init.c: remove meaningless calculation of zone->managed_pages in free_area_init_core()
Currently, in free_area_init_core(), when initialize zone's field, a rough
value is set to zone->managed_pages.  That value is calculated by
(zone->present_pages - memmap_pages).

In the meantime, add the value to nr_all_pages and nr_kernel_pages which
represent all free pages of system (only low memory or including HIGHMEM
memory separately).  Both of them are gonna be used in
alloc_large_system_hash().

However, the rough calculation and setting of zone->managed_pages is
meaningless because
  a) memmap pages are allocated on units of node in sparse_init() or
     alloc_node_mem_map(pgdat); The simple (zone->present_pages -
     memmap_pages) is too rough to make sense for zone;
  b) the set zone->managed_pages will be zeroed out and reset with
     acutal value in mem_init() via memblock_free_all(). Before the
     resetting, no buddy allocation request is issued.

Here, remove the meaningless and complicated calculation of
(zone->present_pages - memmap_pages), directly set zone->managed_pages as
zone->present_pages for now.  It will be adjusted in mem_init().

And also remove the assignment of nr_all_pages and nr_kernel_pages in
free_area_init_core().  Instead, call the newly added
calc_nr_kernel_pages() to count up all free but not reserved memory in
memblock and assign to nr_all_pages and nr_kernel_pages.  The counting
excludes memmap_pages, and other kernel used data, which is more accurate
than old way and simpler, and can also cover the ppc required
arch_reserved_kernel_pages() case.

And also clean up the outdated code comment above free_area_init_core(). 
And free_area_init_core() is easy to understand now, no need to add words
to explain.

[bhe@redhat.com: initialize zone->managed_pages as zone->present_pages for now]
  Link: https://lkml.kernel.org/r/ZgU0bsJ2FEjykvju@MiWiFi-R3L-srv
Link: https://lkml.kernel.org/r/20240325145646.1044760-5-bhe@redhat.com
Signed-off-by: Baoquan He <bhe@redhat.com>
Reviewed-by: Mike Rapoport (IBM) <rppt@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-04-25 20:56:10 -07:00
Baoquan He
8ad4184985 mm/mm_init.c: add new function calc_nr_all_pages()
This is a preparation to calculate nr_kernel_pages and nr_all_pages, both
of which will be used later in alloc_large_system_hash().

nr_all_pages counts up all free but not reserved memory in memblock
allocator, including HIGHMEM memory.  While nr_kernel_pages counts up all
free but not reserved low memory in memblock allocator, excluding HIGHMEM
memory.

Link: https://lkml.kernel.org/r/20240325145646.1044760-4-bhe@redhat.com
Signed-off-by: Baoquan He <bhe@redhat.com>
Reviewed-by: Mike Rapoport (IBM) <rppt@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-04-25 20:56:10 -07:00
Baoquan He
6600a6b10c mm/mm_init.c: remove the useless dma_reserve
Now nobody calls set_dma_reserve() to set value for dma_reserve, remove
set_dma_reserve(), global variable dma_reserve and the codes using it.

Link: https://lkml.kernel.org/r/20240325145646.1044760-3-bhe@redhat.com
Signed-off-by: Baoquan He <bhe@redhat.com>
Reviewed-by: Mike Rapoport (IBM) <rppt@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-04-25 20:56:10 -07:00
Baoquan He
fdb022f6e9 x86: remove unneeded memblock_find_dma_reserve()
Patch series "mm/mm_init.c: refactor free_area_init_core()".

In function free_area_init_core(), the code calculating
zone->managed_pages and the subtracting dma_reserve from DMA zone looks
very confusing.

From git history, the code calculating zone->managed_pages was for
zone->present_pages originally.  The early rough assignment is for
optimize zone's pcp and water mark setting.  Later, managed_pages was
introduced into zone to represent the number of managed pages by buddy. 
Now, zone->managed_pages is zeroed out and reset in mem_init() when
calling memblock_free_all().  zone's pcp and wmark setting relying on
actual zone->managed_pages are done later than mem_init() invocation.  So
we don't need rush to early calculate and set zone->managed_pages, just
set it as zone->present_pages, will adjust it in mem_init().

And also add a new function calc_nr_kernel_pages() to count up free but
not reserved pages in memblock, then assign it to nr_all_pages and
nr_kernel_pages after memmap pages are allocated.


This patch (of 6):

Variable dma_reserve and its usage was introduced in commit 0e0b864e06
("[PATCH] Account for memmap and optionally the kernel image as holes"). 
Its original purpose was to accounting for the reserved pages in DMA zone
to make DMA zone's watermarks calculation more accurate on x86.

However, currently there's zone->managed_pages to account for all
available pages for buddy, zone->present_pages to account for all present
physical pages in zone.  What is more important, on x86, calculating and
setting the zone->managed_pages is a temporary move, all zone's
managed_pages will be zeroed out and reset to the actual value according
to how many pages are added to buddy allocator in mem_init().  Before
mem_init(), no buddy alloction is requested.  And zone's pcp and watermark
setting are all done after mem_init().  So, no need to worry about the DMA
zone's setting accuracy during free_area_init().

Hence, remove memblock_find_dma_reserve() to stop calculating and
setting dma_reserve.

Link: https://lkml.kernel.org/r/20240325145646.1044760-1-bhe@redhat.com
Link: https://lkml.kernel.org/r/20240325145646.1044760-2-bhe@redhat.com
Signed-off-by: Baoquan He <bhe@redhat.com>
Reviewed-by: Mike Rapoport (IBM) <rppt@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-04-25 20:56:10 -07:00
Kairui Song
6758c1128c mm/filemap: optimize filemap folio adding
Instead of doing multiple tree walks, do one optimism range check with
lock hold, and exit if raced with another insertion.  If a shadow exists,
check it with a new xas_get_order helper before releasing the lock to
avoid redundant tree walks for getting its order.

Drop the lock and do the allocation only if a split is needed.

In the best case, it only need to walk the tree once.  If it needs to
alloc and split, 3 walks are issued (One for first ranged conflict check
and order retrieving, one for the second check after allocation, one for
the insert after split).

Testing with 4K pages, in an 8G cgroup, with 16G brd as block device:

  echo 3 > /proc/sys/vm/drop_caches

  fio -name=cached --numjobs=16 --filename=/mnt/test.img \
    --buffered=1 --ioengine=mmap --rw=randread --time_based \
    --ramp_time=30s --runtime=5m --group_reporting

Before:
bw (  MiB/s): min= 1027, max= 3520, per=100.00%, avg=2445.02, stdev=18.90, samples=8691
iops        : min=263001, max=901288, avg=625924.36, stdev=4837.28, samples=8691

After (+7.3%):
bw (  MiB/s): min=  493, max= 3947, per=100.00%, avg=2625.56, stdev=25.74, samples=8651
iops        : min=126454, max=1010681, avg=672142.61, stdev=6590.48, samples=8651

Test result with THP (do a THP randread then switch to 4K page in hope it
issues a lot of splitting):

  echo 3 > /proc/sys/vm/drop_caches

  fio -name=cached --numjobs=16 --filename=/mnt/test.img \
      --buffered=1 --ioengine=mmap -thp=1 --readonly \
      --rw=randread --time_based --ramp_time=30s --runtime=10m \
      --group_reporting

  fio -name=cached --numjobs=16 --filename=/mnt/test.img \
      --buffered=1 --ioengine=mmap \
      --rw=randread --time_based --runtime=5s --group_reporting

Before:
bw (  KiB/s): min= 4141, max=14202, per=100.00%, avg=7935.51, stdev=96.85, samples=18976
iops        : min= 1029, max= 3548, avg=1979.52, stdev=24.23, samples=18976·

READ: bw=4545B/s (4545B/s), 4545B/s-4545B/s (4545B/s-4545B/s), io=64.0KiB (65.5kB), run=14419-14419msec

After (+12.5%):
bw (  KiB/s): min= 4611, max=15370, per=100.00%, avg=8928.74, stdev=105.17, samples=19146
iops        : min= 1151, max= 3842, avg=2231.27, stdev=26.29, samples=19146

READ: bw=4635B/s (4635B/s), 4635B/s-4635B/s (4635B/s-4635B/s), io=64.0KiB (65.5kB), run=14137-14137msec

The performance is better for both 4K (+7.5%) and THP (+12.5%) cached read.

Link: https://lkml.kernel.org/r/20240415171857.19244-5-ryncsn@gmail.com
Signed-off-by: Kairui Song <kasong@tencent.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-04-25 20:56:09 -07:00
Kairui Song
a4864671ca lib/xarray: introduce a new helper xas_get_order
It can be used after xas_load to check the order of loaded entries. 
Compared to xa_get_order, it saves an XA_STATE and avoid a rewalk.

Added new test for xas_get_order, to make the test work, we have to export
xas_get_order with EXPORT_SYMBOL_GPL.

Also fix a sparse warning by checking the slot value with xa_entry instead
of accessing it directly, as suggested by Matthew Wilcox.

[kasong@tencent.com: simplify comment, sparse warning fix, per Matthew Wilcox]
  Link: https://lkml.kernel.org/r/20240416071722.45997-4-ryncsn@gmail.com
Link: https://lkml.kernel.org/r/20240415171857.19244-4-ryncsn@gmail.com
Signed-off-by: Kairui Song <kasong@tencent.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-04-25 20:56:09 -07:00
Kairui Song
b2ebcf9d3d mm/filemap: clean up hugetlb exclusion code
__filemap_add_folio only has two callers, one never passes hugetlb folio
and one always passes in hugetlb folio.  So move the hugetlb related
cgroup charging out of it to make the code cleaner.

Link: https://lkml.kernel.org/r/20240415171857.19244-3-ryncsn@gmail.com
Signed-off-by: Kairui Song <kasong@tencent.com>
Acked-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-04-25 20:56:09 -07:00
Kairui Song
de60fd8dde mm/filemap: return early if failed to allocate memory for split
Patch series "mm/filemap: optimize folio adding and splitting", v4.

Currently, at least 3 tree walks are needed for filemap folio adding if
the folio is previously evicted.  One for getting the order of current
slot, one for ranged conflict check, and one for another order retrieving.
If a split is needed, more walks are needed.

This series is trying to merge these walks, and speed up
filemap_add_folio, I see a 7.5% - 12.5% performance gain for fio stress
test.

So instead of doing multiple tree walks, do one optimism range check with
lock hold, and exit if raced with another insertion.  If a shadow exists,
check it with a new xas_get_order helper before releasing the lock to
avoid redundant tree walks for getting its order.

Drop the lock and do the allocation only if a split is needed.

In the best case, it only need to walk the tree once.  If it needs to
alloc and split, 3 walks are issued (One for first ranged conflict check
and order retrieving, one for the second check after allocation, one for
the insert after split).

Testing with 4K pages, in an 8G cgroup, with 16G brd as block device:

  echo 3 > /proc/sys/vm/drop_caches

  fio -name=cached --numjobs=16 --filename=/mnt/test.img \
    --buffered=1 --ioengine=mmap --rw=randread --time_based \
    --ramp_time=30s --runtime=5m --group_reporting

Before:
bw (  MiB/s): min= 1027, max= 3520, per=100.00%, avg=2445.02, stdev=18.90, samples=8691
iops        : min=263001, max=901288, avg=625924.36, stdev=4837.28, samples=8691

After (+7.3%):
bw (  MiB/s): min=  493, max= 3947, per=100.00%, avg=2625.56, stdev=25.74, samples=8651
iops        : min=126454, max=1010681, avg=672142.61, stdev=6590.48, samples=8651

Test result with THP (do a THP randread then switch to 4K page in hope it
issues a lot of splitting):

  echo 3 > /proc/sys/vm/drop_caches

  fio -name=cached --numjobs=16 --filename=/mnt/test.img \
      --buffered=1 --ioengine=mmap -thp=1 --readonly \
      --rw=randread --time_based --ramp_time=30s --runtime=10m \
      --group_reporting

  fio -name=cached --numjobs=16 --filename=/mnt/test.img \
      --buffered=1 --ioengine=mmap \
      --rw=randread --time_based --runtime=5s --group_reporting

Before:
bw (  KiB/s): min= 4141, max=14202, per=100.00%, avg=7935.51, stdev=96.85, samples=18976
iops        : min= 1029, max= 3548, avg=1979.52, stdev=24.23, samples=18976·

READ: bw=4545B/s (4545B/s), 4545B/s-4545B/s (4545B/s-4545B/s), io=64.0KiB (65.5kB), run=14419-14419msec

After (+10.4%):
bw (  KiB/s): min= 4611, max=15370, per=100.00%, avg=8928.74, stdev=105.17, samples=19146
iops        : min= 1151, max= 3842, avg=2231.27, stdev=26.29, samples=19146

READ: bw=4635B/s (4635B/s), 4635B/s-4635B/s (4635B/s-4635B/s), io=64.0KiB (65.5kB), run=14137-14137msec

The performance is better for both 4K (+7.5%) and THP (+12.5%) cached read.


This patch (of 4):

xas_split_alloc could fail with NOMEM, and in such case, it should abort
early instead of keep going and fail the xas_split below.

Link: https://lkml.kernel.org/r/20240416071722.45997-1-ryncsn@gmail.com
Link: https://lkml.kernel.org/r/20240415171857.19244-1-ryncsn@gmail.com
Link: https://lkml.kernel.org/r/20240415171857.19244-2-ryncsn@gmail.com
Signed-off-by: Kairui Song <kasong@tencent.com>
Acked-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-04-25 20:56:09 -07:00
David Hildenbrand
ebb34f78d7 mm: convert folio_estimated_sharers() to folio_likely_mapped_shared()
Callers of folio_estimated_sharers() only care about "mapped shared vs. 
mapped exclusively", not the exact estimate of sharers.  Let's consolidate
and unify the condition users are checking.  While at it clarify the
semantics and extend the discussion on the fuzziness.

Use the "likely mapped shared" terminology to better express what the
(adjusted) function actually checks.

Whether a partially-mappable folio is more likely to not be partially
mapped than partially mapped is debatable.  In the future, we might be
able to improve our estimate for partially-mappable folios, though.

Note that we will now consistently detect "mapped shared" only if the
first subpage is actually mapped multiple times.  When the first subpage
is not mapped, we will consistently detect it as "mapped exclusively". 
This change should currently only affect the usage in
madvise_free_pte_range() and queue_folios_pte_range() for large folios: if
the first page was already unmapped, we would have skipped the folio.

[david@redhat.com: folio_likely_mapped_shared() kerneldoc fixup]
  Link: https://lkml.kernel.org/r/dd0ad9f2-2d7a-45f3-9ba3-979488c7dd27@redhat.com
Link: https://lkml.kernel.org/r/20240227201548.857831-1-david@redhat.com
Signed-off-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Khalid Aziz <khalid.aziz@oracle.com>
Acked-by: Barry Song <v-songbaohua@oppo.com>
Reviewed-by: Vishal Moola (Oracle) <vishal.moola@gmail.com>
Reviewed-by: Ryan Roberts <ryan.roberts@arm.com>
Reviewed-by: Zi Yan <ziy@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-04-25 20:56:08 -07:00
Zi Yan
7262f208ca mm/migrate: split source folio if it is on deferred split list
If the source folio is on deferred split list, it is likely some subpages
are not used.  Split it before migration to avoid migrating unused
subpages.

Commit 616b837153 ("mm: thp: enable thp migration in generic path") did
not check if a THP is on deferred split list before migration, thus, the
destination THP is never put on deferred split list even if the source THP
might be.  The opportunity of reclaiming free pages in a partially mapped
THP during deferred list scanning is lost, but no other harmful
consequence is present[1].

[1]: https://lore.kernel.org/linux-mm/03CE3A00-917C-48CC-8E1C-6A98713C817C@nvidia.com/

[zi.yan@sent.com: fix an error in migrate_misplaced_folio()]
  Link: https://lkml.kernel.org/r/20240326150031.569387-1-zi.yan@sent.com
Link: https://lkml.kernel.org/r/20240322193304.522496-1-zi.yan@sent.com
Fixes: 616b837153 ("mm: thp: enable thp migration in generic path")
Signed-off-by: Zi Yan <ziy@nvidia.com>
Tested-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Acked-by: David Hildenbrand <david@redhat.com>
Cc: Huang, Ying <ying.huang@intel.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Cc: SeongJae Park <sj@kernel.org>
Cc: Yang Shi <shy828301@gmail.com>
Cc: Yin Fengwei <fengwei.yin@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-04-25 20:56:08 -07:00