Commit Graph

18411 Commits

Author SHA1 Message Date
Jackie Liu
83d7d04f9d mm/kfence: print disabling or re-enabling message
By printing information, we can friendly prompt the status change
information of kfence by dmesg and record by syslog.

Also, set kfence_enabled to false only when needed.

Link: https://lkml.kernel.org/r/20220518073105.3160335-1-liu.yun@linux.dev
Signed-off-by: Jackie Liu <liuyun01@kylinos.cn>
Co-developed-by: Marco Elver <elver@google.com>
Signed-off-by: Marco Elver <elver@google.com>
Reviewed-by: Marco Elver <elver@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-05-25 10:47:48 -07:00
Zi Yan
88ee134320 mm: fix a potential infinite loop in start_isolate_page_range()
In isolate_single_pageblock() called by start_isolate_page_range(), there
are some pageblock isolation issues causing a potential infinite loop when
isolating a page range.  This is reported by Qian Cai.

1. the pageblock was isolated by just changing pageblock migratetype
   without checking unmovable pages. Calling set_migratetype_isolate() to
   isolate pageblock properly.
2. an off-by-one error caused migrating pages unnecessarily, since the page
   is not crossing pageblock boundary.
3. migrating a compound page across pageblock boundary then splitting the
   free page later has a small race window that the free page might be
   allocated again, so that the code will try again, causing an potential
   infinite loop. Temporarily set the to-be-migrated page's pageblock to
   MIGRATE_ISOLATE to prevent that and bail out early if no free page is
   found after page migration.

An additional fix to split_free_page() aims to avoid crashing in
__free_one_page().  When the free page is split at the specified
split_pfn_offset, free_page_order should check both the first bit of
free_page_pfn and the last bit of split_pfn_offset and use the smaller
one.  For example, if free_page_pfn=0x10000, split_pfn_offset=0xc000,
free_page_order should first be 0x8000 then 0x4000, instead of 0x4000 then
0x8000, which the original algorithm did.

[akpm@linux-foundation.org: suppress min() warning]
Link: https://lkml.kernel.org/r/20220524194756.1698351-1-zi.yan@sent.com
Fixes: b2c9e2fbba ("mm: make alloc_contig_range work at pageblock granularity")
Signed-off-by: Zi Yan <ziy@nvidia.com>
Reported-by: Qian Cai <quic_qiancai@quicinc.com>
Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
Cc: David Hildenbrand <david@redhat.com>
Cc: Eric Ren <renzhengeek@gmail.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Mike Rapoport <rppt@linux.ibm.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-05-25 10:47:47 -07:00
Hugh Dickins
e384200e70 mm/shmem: fix shmem folio swapoff hang
Shmem swapoff makes no progress: the index to indices is not incremented. 
But "ret" is no longer a return value, so use folio_batch_count() instead.

Link: https://lkml.kernel.org/r/c32bee8a-f0aa-245-f94e-24dd271924fa@google.com
Fixes: da08e9b793 ("mm/shmem: convert shmem_swapin_page() to shmem_swapin_folio()")
Signed-off-by: Hugh Dickins <hughd@google.com>
Reviewed-by: Miaohe Lin <linmiaohe@huawei.com>
Tested-by: Miaohe Lin <linmiaohe@huawei.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-05-25 10:47:47 -07:00
Linus Torvalds
2e17ce1106 slab changes for 5.19
-----BEGIN PGP SIGNATURE-----
 
 iQEzBAABCAAdFiEEjUuTAak14xi+SF7M4CHKc/GJqRAFAmKLUYoACgkQ4CHKc/GJ
 qRCMFwf/Tm1cf2JLUANrT58rjkrrj15EtKhnJdm5/yvmsWKps7WKPP4jeUHe+NTO
 NovAGt67lG1l6LMLczZkWckOkWlyYjC42CPDLdxRUkk+zQRb3nRA8Nbt6VTNBOfQ
 0wTLOqXgsNXdSPSVUsKGL8kIAHNQTMX+7TjO6s7CXy/5Qag6r1iZX2HZxASOHxLa
 yYzaJ9pJRZBAMGnzV6L6v0J8KPnjYO0fB68S1qYQTbhoRxchtFF+0AIr1JydGgBI
 9RFUowTrSpJkZtcSjabopvZz4JfCRDP+eAxkyw13feji7MG1FMX74HgDdw+HhzTv
 R2/6iA5WcsmzcXopsfMx8lUP/KIfPw==
 =gnSc
 -----END PGP SIGNATURE-----

Merge tag 'slab-for-5.19' of git://git.kernel.org/pub/scm/linux/kernel/git/vbabka/slab

Pull slab updates from Vlastimil Babka:

 - Conversion of slub_debug stack traces to stackdepot, allowing more
   useful debugfs-based inspection for e.g. memory leak debugging.
   Allocation and free debugfs info now includes full traces and is
   sorted by the unique trace frequency.

   The stackdepot conversion was already attempted last year but
   reverted by ae14c63a9f. The memory overhead (while not actually
   enabled on boot) has been meanwhile solved by making the large
   stackdepot allocation dynamic. The xfstest issues haven't been
   reproduced on current kernel locally nor in -next, so the slab cache
   layout changes that originally made that bug manifest were probably
   not the root cause.

 - Refactoring of dma-kmalloc caches creation.

 - Trivial cleanups such as removal of unused parameters, fixes and
   clarifications of comments.

 - Hyeonggon Yoo joins as a reviewer.

* tag 'slab-for-5.19' of git://git.kernel.org/pub/scm/linux/kernel/git/vbabka/slab:
  MAINTAINERS: add myself as reviewer for slab
  mm/slub: remove unused kmem_cache_order_objects max
  mm: slab: fix comment for __assume_kmalloc_alignment
  mm: slab: fix comment for ARCH_KMALLOC_MINALIGN
  mm/slub: remove unneeded return value of slab_pad_check
  mm/slab_common: move dma-kmalloc caches creation into new_kmalloc_cache()
  mm/slub: remove meaningless node check in ___slab_alloc()
  mm/slub: remove duplicate flag in allocate_slab()
  mm/slub: remove unused parameter in setup_object*()
  mm/slab.c: fix comments
  slab, documentation: add description of debugfs files for SLUB caches
  mm/slub: sort debugfs output by frequency of stack traces
  mm/slub: distinguish and print stack traces in debugfs files
  mm/slub: use stackdepot to save stack trace in objects
  mm/slub: move struct track init out of set_track()
  lib/stackdepot: allow requesting early initialization dynamically
  mm/slub, kunit: Make slub_kunit unaffected by user specified flags
  mm/slab: remove some unused functions
2022-05-25 10:24:04 -07:00
Linus Torvalds
fdaf9a5840 Page cache changes for 5.19
- Appoint myself page cache maintainer
 
  - Fix how scsicam uses the page cache
 
  - Use the memalloc_nofs_save() API to replace AOP_FLAG_NOFS
 
  - Remove the AOP flags entirely
 
  - Remove pagecache_write_begin() and pagecache_write_end()
 
  - Documentation updates
 
  - Convert several address_space operations to use folios:
    - is_dirty_writeback
    - readpage becomes read_folio
    - releasepage becomes release_folio
    - freepage becomes free_folio
 
  - Change filler_t to require a struct file pointer be the first argument
    like ->read_folio
 -----BEGIN PGP SIGNATURE-----
 
 iQEzBAABCgAdFiEEejHryeLBw/spnjHrDpNsjXcpgj4FAmKNMDUACgkQDpNsjXcp
 gj4/mwf/bpHhXH4ZoNIvtUpTF6rZbqeffmc0VrbxCZDZ6igRnRPglxZ9H9v6L53O
 7B0FBQIfxgNKHZpdqGdOkv8cjg/GMe/HJUbEy5wOakYPo4L9fZpHbDZ9HM2Eankj
 xBqLIBgBJ7doKr+Y62DAN19TVD8jfRfVtli5mqXJoNKf65J7BkxljoTH1L3EXD9d
 nhLAgyQjR67JQrT/39KMW+17GqLhGefLQ4YnAMONtB6TVwX/lZmigKpzVaCi4r26
 bnk5vaR/3PdjtNxIoYvxdc71y2Eg05n2jEq9Wcy1AaDv/5vbyZUlZ2aBSaIVbtKX
 WfrhN9O3L0bU5qS7p9PoyfLc9wpq8A==
 =djLv
 -----END PGP SIGNATURE-----

Merge tag 'folio-5.19' of git://git.infradead.org/users/willy/pagecache

Pull page cache updates from Matthew Wilcox:

 - Appoint myself page cache maintainer

 - Fix how scsicam uses the page cache

 - Use the memalloc_nofs_save() API to replace AOP_FLAG_NOFS

 - Remove the AOP flags entirely

 - Remove pagecache_write_begin() and pagecache_write_end()

 - Documentation updates

 - Convert several address_space operations to use folios:
     - is_dirty_writeback
     - readpage becomes read_folio
     - releasepage becomes release_folio
     - freepage becomes free_folio

 - Change filler_t to require a struct file pointer be the first
   argument like ->read_folio

* tag 'folio-5.19' of git://git.infradead.org/users/willy/pagecache: (107 commits)
  nilfs2: Fix some kernel-doc comments
  Appoint myself page cache maintainer
  fs: Remove aops->freepage
  secretmem: Convert to free_folio
  nfs: Convert to free_folio
  orangefs: Convert to free_folio
  fs: Add free_folio address space operation
  fs: Convert drop_buffers() to use a folio
  fs: Change try_to_free_buffers() to take a folio
  jbd2: Convert release_buffer_page() to use a folio
  jbd2: Convert jbd2_journal_try_to_free_buffers to take a folio
  reiserfs: Convert release_buffer_page() to use a folio
  fs: Remove last vestiges of releasepage
  ubifs: Convert to release_folio
  reiserfs: Convert to release_folio
  orangefs: Convert to release_folio
  ocfs2: Convert to release_folio
  nilfs2: Remove comment about releasepage
  nfs: Convert to release_folio
  jfs: Convert to release_folio
  ...
2022-05-24 19:55:07 -07:00
Linus Torvalds
0bf13a8436 kernel-hardening updates for v5.19-rc1
- usercopy hardening expanded to check other allocation types
   (Matthew Wilcox, Yuanzheng Song)
 
 - arm64 stackleak behavioral improvements (Mark Rutland)
 
 - arm64 CFI code gen improvement (Sami Tolvanen)
 
 - LoadPin LSM block dev API adjustment (Christoph Hellwig)
 
 - Clang randstruct support (Bill Wendling, Kees Cook)
 -----BEGIN PGP SIGNATURE-----
 
 iQJKBAABCgA0FiEEpcP2jyKd1g9yPm4TiXL039xtwCYFAmKL1kMWHGtlZXNjb29r
 QGNocm9taXVtLm9yZwAKCRCJcvTf3G3AJlz6D/9lYEwDQYwKVK6fsXdgcs/eUkqc
 P06KGm7jDiYiua34LMpgu35wkRcxVDzB92kzQmt7yaVqhlIGjO9wnP+uZrq8q/LS
 X9FSb457fREg0XLPX5XC60abHYyikvgJMf06dSLaBcRq1Wzqwp5JZPpLZJUAM2ab
 rM1Vq0brfF1+lPAPECx1sYYNksP9XTw0dtzUu8D9tlTQDFAhKYhV6Io5yRFkA4JH
 ELSHjJHlNgLYeZE5IfWHRQBb+yofjnt61IwoVkqa5lSfoyvKpBPF5G+3gOgtdkyv
 A8So2aG/bMNUUY80Th5ojiZ6V7z5SYjUmHRil6I/swAdkc825n2wM+AQqsxv6U4I
 VvGz3cxaKklERw5N+EJw4amivcgm1jEppZ7qCx9ysLwVg/LI050qhv/T10TYPmOX
 0sQEpZvbKuqGb6nzWo6DME8OpZ27yIa/oRzBHdkIkfkEefYlKWS+dfvWb/73cltj
 jx066Znk1hHZWGT48EsRmxdGAHn4kfIMcMgIs1ki1OO2II6LoXyaFJ0wSAYItxpz
 5gCmDMjkGFRrtXXPEhi6kfKKpOuQux+BmpbVfEzox7Gnrf45sp92cYLncmpAsFB3
 91nPa4/utqb/9ijFCIinazLdcUBPO8I1C8FOHDWSFCnNt4d3j2ozpLbrKWyQsm7+
 RCGdcy+NU/FH1FwZlg==
 =nxsC
 -----END PGP SIGNATURE-----

Merge tag 'kernel-hardening-v5.19-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux

Pull kernel hardening updates from Kees Cook:

 - usercopy hardening expanded to check other allocation types (Matthew
   Wilcox, Yuanzheng Song)

 - arm64 stackleak behavioral improvements (Mark Rutland)

 - arm64 CFI code gen improvement (Sami Tolvanen)

 - LoadPin LSM block dev API adjustment (Christoph Hellwig)

 - Clang randstruct support (Bill Wendling, Kees Cook)

* tag 'kernel-hardening-v5.19-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux: (34 commits)
  loadpin: stop using bdevname
  mm: usercopy: move the virt_addr_valid() below the is_vmalloc_addr()
  gcc-plugins: randstruct: Remove cast exception handling
  af_unix: Silence randstruct GCC plugin warning
  niu: Silence randstruct warnings
  big_keys: Use struct for internal payload
  gcc-plugins: Change all version strings match kernel
  randomize_kstack: Improve docs on requirements/rationale
  lkdtm/stackleak: fix CONFIG_GCC_PLUGIN_STACKLEAK=n
  arm64: entry: use stackleak_erase_on_task_stack()
  stackleak: add on/off stack variants
  lkdtm/stackleak: check stack boundaries
  lkdtm/stackleak: prevent unexpected stack usage
  lkdtm/stackleak: rework boundary management
  lkdtm/stackleak: avoid spurious failure
  stackleak: rework poison scanning
  stackleak: rework stack high bound handling
  stackleak: clarify variable names
  stackleak: rework stack low bound handling
  stackleak: remove redundant check
  ...
2022-05-24 12:27:09 -07:00
Linus Torvalds
ac2ab99072 Random number generator updates for Linux 5.19-rc1.
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEEq5lC5tSkz8NBJiCnSfxwEqXeA64FAmKKpM8ACgkQSfxwEqXe
 A6726w/+OJimGd4arvpSmdn+vxepSyDLgKfwM0x5zprRVd16xg8CjJr4eMonTesq
 YvtJRqpetb53MB+sMhutlvQqQzrjtf2MBkgPwF4I2gUrk7vLD45Q+AGdGhi/rUwz
 wHGA7xg1FHLHia2M/9idSqi8QlZmUP4u4l5ZnMyTUHiwvRD6XOrWKfqvUSawNzyh
 hCWlTUxDrjizsW5YpsJX/MkRadSC8loJEk5ByZebow6nRPfurJvqfrcOMgHyNrbY
 pOZ/CGPxcetMqotL2TuuJt5wKmenqYhIWGAp3YM2SWWgU2ueBZekW8AYeMfgUcvh
 LWV93RpSuAnE5wsdjIULvjFnEDJBf8ihfMnMrd9G5QjQu44tuKWfY2MghLSpYzaR
 V6UFbRmhrqhqiStHQXOvk1oqxtpbHlc9zzJLmvPmDJcbvzXQ9Opk5GVXAmdtnHnj
 M/ty3wGWxucY6mHqT8MkCShSSslbgEtc1pEIWHdrUgnaiSVoCVBEO+9LqLbjvOTm
 XA/6YtoiCE5FasK51pir1zVb2GORQn0v8HnuAOsusD/iPAlRQ/G5jZkaXbwRQI6j
 atYL1svqvSKn5POnzqAlMUXfMUr19K5xqJdp7i6qmlO1Vq6Z+tWbCQgD1JV+Wjkb
 CMyvXomFCFu4aYKGRE2SBRnWLRghG3kYHqEQ15yTPMQerxbUDNg=
 =SUr3
 -----END PGP SIGNATURE-----

Merge tag 'random-5.19-rc1-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/crng/random

Pull random number generator updates from Jason Donenfeld:
 "These updates continue to refine the work began in 5.17 and 5.18 of
  modernizing the RNG's crypto and streamlining and documenting its
  code.

  New for 5.19, the updates aim to improve entropy collection methods
  and make some initial decisions regarding the "premature next" problem
  and our threat model. The cloc utility now reports that random.c is
  931 lines of code and 466 lines of comments, not that basic metrics
  like that mean all that much, but at the very least it tells you that
  this is very much a manageable driver now.

  Here's a summary of the various updates:

   - The random_get_entropy() function now always returns something at
     least minimally useful. This is the primary entropy source in most
     collectors, which in the best case expands to something like RDTSC,
     but prior to this change, in the worst case it would just return 0,
     contributing nothing. For 5.19, additional architectures are wired
     up, and architectures that are entirely missing a cycle counter now
     have a generic fallback path, which uses the highest resolution
     clock available from the timekeeping subsystem.

     Some of those clocks can actually be quite good, despite the CPU
     not having a cycle counter of its own, and going off-core for a
     stamp is generally thought to increase jitter, something positive
     from the perspective of entropy gathering. Done very early on in
     the development cycle, this has been sitting in next getting some
     testing for a while now and has relevant acks from the archs, so it
     should be pretty well tested and fine, but is nonetheless the thing
     I'll be keeping my eye on most closely.

   - Of particular note with the random_get_entropy() improvements is
     MIPS, which, on CPUs that lack the c0 count register, will now
     combine the high-speed but short-cycle c0 random register with the
     lower-speed but long-cycle generic fallback path.

   - With random_get_entropy() now always returning something useful,
     the interrupt handler now collects entropy in a consistent
     construction.

   - Rather than comparing two samples of random_get_entropy() for the
     jitter dance, the algorithm now tests many samples, and uses the
     amount of differing ones to determine whether or not jitter entropy
     is usable and how laborious it must be. The problem with comparing
     only two samples was that if the cycle counter was extremely slow,
     but just so happened to be on the cusp of a change, the slowness
     wouldn't be detected. Taking many samples fixes that to some
     degree.

     This, combined with the other improvements to random_get_entropy(),
     should make future unification of /dev/random and /dev/urandom
     maybe more possible. At the very least, were we to attempt it again
     today (we're not), it wouldn't break any of Guenter's test rigs
     that broke when we tried it with 5.18. So, not today, but perhaps
     down the road, that's something we can revisit.

   - We attempt to reseed the RNG immediately upon waking up from system
     suspend or hibernation, making use of the various timestamps about
     suspend time and such available, as well as the usual inputs such
     as RDRAND when available.

   - Batched randomness now falls back to ordinary randomness before the
     RNG is initialized. This provides more consistent guarantees to the
     types of random numbers being returned by the various accessors.

   - The "pre-init injection" code is now gone for good. I suspect you
     in particular will be happy to read that, as I recall you
     expressing your distaste for it a few months ago. Instead, to avoid
     a "premature first" issue, while still allowing for maximal amount
     of entropy availability during system boot, the first 128 bits of
     estimated entropy are used immediately as it arrives, with the next
     128 bits being buffered. And, as before, after the RNG has been
     fully initialized, it winds up reseeding anyway a few seconds later
     in most cases. This resulted in a pretty big simplification of the
     initialization code and let us remove various ad-hoc mechanisms
     like the ugly crng_pre_init_inject().

   - The RNG no longer pretends to handle the "premature next" security
     model, something that various academics and other RNG designs have
     tried to care about in the past. After an interesting mailing list
     thread, these issues are thought to be a) mainly academic and not
     practical at all, and b) actively harming the real security of the
     RNG by delaying new entropy additions after a potential compromise,
     making a potentially bad situation even worse. As well, in the
     first place, our RNG never even properly handled the premature next
     issue, so removing an incomplete solution to a fake problem was
     particularly nice.

     This allowed for numerous other simplifications in the code, which
     is a lot cleaner as a consequence. If you didn't see it before,
     https://lore.kernel.org/lkml/YmlMGx6+uigkGiZ0@zx2c4.com/ may be a
     thread worth skimming through.

   - While the interrupt handler received a separate code path years ago
     that avoids locks by using per-cpu data structures and a faster
     mixing algorithm, in order to reduce interrupt latency, input and
     disk events that are triggered in hardirq handlers were still
     hitting locks and more expensive algorithms. Those are now
     redirected to use the faster per-cpu data structures.

   - Rather than having the fake-crypto almost-siphash-based random32
     implementation be used right and left, and in many places where
     cryptographically secure randomness is desirable, the batched
     entropy code is now fast enough to replace that.

   - As usual, numerous code quality and documentation cleanups. For
     example, the initialization state machine now uses enum symbolic
     constants instead of just hard coding numbers everywhere.

   - Since the RNG initializes once, and then is always initialized
     thereafter, a pretty heavy amount of code used during that
     initialization is never used again. It is now completely cordoned
     off using static branches and it winds up in the .text.unlikely
     section so that it doesn't reduce cache compactness after the RNG
     is ready.

   - A variety of functions meant for waiting on the RNG to be
     initialized were only used by vsprintf, and in not a particularly
     optimal way. Replacing that usage with a more ordinary setup made
     it possible to remove those functions.

   - A cleanup of how we warn userspace about the use of uninitialized
     /dev/urandom and uninitialized get_random_bytes() usage.
     Interestingly, with the change you merged for 5.18 that attempts to
     use jitter (but does not block if it can't), the majority of users
     should never see those warnings for /dev/urandom at all now, and
     the one for in-kernel usage is mainly a debug thing.

   - The file_operations struct for /dev/[u]random now implements
     .read_iter and .write_iter instead of .read and .write, allowing it
     to also implement .splice_read and .splice_write, which makes
     splice(2) work again after it was broken here (and in many other
     places in the tree) during the set_fs() removal. This was a bit of
     a last minute arrival from Jens that hasn't had as much time to
     bake, so I'll be keeping my eye on this as well, but it seems
     fairly ordinary. Unfortunately, read_iter() is around 3% slower
     than read() in my tests, which I'm not thrilled about. But Jens and
     Al, spurred by this observation, seem to be making progress in
     removing the bottlenecks on the iter paths in the VFS layer in
     general, which should remove the performance gap for all drivers.

   - Assorted other bug fixes, cleanups, and optimizations.

   - A small SipHash cleanup"

* tag 'random-5.19-rc1-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/crng/random: (49 commits)
  random: check for signals after page of pool writes
  random: wire up fops->splice_{read,write}_iter()
  random: convert to using fops->write_iter()
  random: convert to using fops->read_iter()
  random: unify batched entropy implementations
  random: move randomize_page() into mm where it belongs
  random: remove mostly unused async readiness notifier
  random: remove get_random_bytes_arch() and add rng_has_arch_random()
  random: move initialization functions out of hot pages
  random: make consistent use of buf and len
  random: use proper return types on get_random_{int,long}_wait()
  random: remove extern from functions in header
  random: use static branch for crng_ready()
  random: credit architectural init the exact amount
  random: handle latent entropy and command line from random_init()
  random: use proper jiffies comparison macro
  random: remove ratelimiting for in-kernel unseeded randomness
  random: move initialization out of reseeding hot path
  random: avoid initializing twice in credit race
  random: use symbolic constants for crng_init states
  ...
2022-05-24 11:58:10 -07:00
Linus Torvalds
143a6252e1 arm64 updates for 5.19:
- Initial support for the ARMv9 Scalable Matrix Extension (SME). SME
   takes the approach used for vectors in SVE and extends this to provide
   architectural support for matrix operations. No KVM support yet, SME
   is disabled in guests.
 
 - Support for crashkernel reservations above ZONE_DMA via the
   'crashkernel=X,high' command line option.
 
 - btrfs search_ioctl() fix for live-lock with sub-page faults.
 
 - arm64 perf updates: support for the Hisilicon "CPA" PMU for monitoring
   coherent I/O traffic, support for Arm's CMN-650 and CMN-700
   interconnect PMUs, minor driver fixes, kerneldoc cleanup.
 
 - Kselftest updates for SME, BTI, MTE.
 
 - Automatic generation of the system register macros from a 'sysreg'
   file describing the register bitfields.
 
 - Update the type of the function argument holding the ESR_ELx register
   value to unsigned long to match the architecture register size
   (originally 32-bit but extended since ARMv8.0).
 
 - stacktrace cleanups.
 
 - ftrace cleanups.
 
 - Miscellaneous updates, most notably: arm64-specific huge_ptep_get(),
   avoid executable mappings in kexec/hibernate code, drop TLB flushing
   from get_clear_flush() (and rename it to get_clear_contig()),
   ARCH_NR_GPIO bumped to 2048 for ARCH_APPLE.
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEE5RElWfyWxS+3PLO2a9axLQDIXvEFAmKH19IACgkQa9axLQDI
 XvEFWg//bf0p6zjeNaOJmBbyVFsXsVyYiEaLUpFPUs3oB+81s2YZ+9i1rgMrNCft
 EIDQ9+/HgScKxJxnzWf68heMdcBDbk76VJtLALExbge6owFsjByQDyfb/b3v/bLd
 ezAcGzc6G5/FlI1IP7ct4Z9MnQry4v5AG8lMNAHjnf6GlBS/tYNAqpmj8HpQfgRQ
 ZbhfZ8Ayu3TRSLWL39NHVevpmxQm/bGcpP3Q9TtjUqg0r1FQ5sK/LCqOksueIAzT
 UOgUVYWSFwTpLEqbYitVqgERQp9LiLoK5RmNYCIEydfGM7+qmgoxofSq5e2hQtH2
 SZM1XilzsZctRbBbhMit1qDBqMlr/XAy/R5FO0GauETVKTaBhgtj6mZGyeC9nU/+
 RGDljaArbrOzRwMtSuXF+Fp6uVo5spyRn1m8UT/k19lUTdrV9z6EX5Fzuc4Mnhed
 oz4iokbl/n8pDObXKauQspPA46QpxUYhrAs10B/ELc3yyp/Qj3jOfzYHKDNFCUOq
 HC9mU+YiO9g2TbYgCrrFM6Dah2E8fU6/cR0ZPMeMgWK4tKa+6JMEINYEwak9e7M+
 8lZnvu3ntxiJLN+PrPkiPyG+XBh2sux1UfvNQ+nw4Oi9xaydeX7PCbQVWmzTFmHD
 q7UPQ8220e2JNCha9pULS8cxDLxiSksce06DQrGXwnHc1Ir7T04=
 =0DjE
 -----END PGP SIGNATURE-----

Merge tag 'arm64-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux

Pull arm64 updates from Catalin Marinas:

 - Initial support for the ARMv9 Scalable Matrix Extension (SME).

   SME takes the approach used for vectors in SVE and extends this to
   provide architectural support for matrix operations. No KVM support
   yet, SME is disabled in guests.

 - Support for crashkernel reservations above ZONE_DMA via the
   'crashkernel=X,high' command line option.

 - btrfs search_ioctl() fix for live-lock with sub-page faults.

 - arm64 perf updates: support for the Hisilicon "CPA" PMU for
   monitoring coherent I/O traffic, support for Arm's CMN-650 and
   CMN-700 interconnect PMUs, minor driver fixes, kerneldoc cleanup.

 - Kselftest updates for SME, BTI, MTE.

 - Automatic generation of the system register macros from a 'sysreg'
   file describing the register bitfields.

 - Update the type of the function argument holding the ESR_ELx register
   value to unsigned long to match the architecture register size
   (originally 32-bit but extended since ARMv8.0).

 - stacktrace cleanups.

 - ftrace cleanups.

 - Miscellaneous updates, most notably: arm64-specific huge_ptep_get(),
   avoid executable mappings in kexec/hibernate code, drop TLB flushing
   from get_clear_flush() (and rename it to get_clear_contig()),
   ARCH_NR_GPIO bumped to 2048 for ARCH_APPLE.

* tag 'arm64-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux: (145 commits)
  arm64/sysreg: Generate definitions for FAR_ELx
  arm64/sysreg: Generate definitions for DACR32_EL2
  arm64/sysreg: Generate definitions for CSSELR_EL1
  arm64/sysreg: Generate definitions for CPACR_ELx
  arm64/sysreg: Generate definitions for CONTEXTIDR_ELx
  arm64/sysreg: Generate definitions for CLIDR_EL1
  arm64/sve: Move sve_free() into SVE code section
  arm64: Kconfig.platforms: Add comments
  arm64: Kconfig: Fix indentation and add comments
  arm64: mm: avoid writable executable mappings in kexec/hibernate code
  arm64: lds: move special code sections out of kernel exec segment
  arm64/hugetlb: Implement arm64 specific huge_ptep_get()
  arm64/hugetlb: Use ptep_get() to get the pte value of a huge page
  arm64: kdump: Do not allocate crash low memory if not needed
  arm64/sve: Generate ZCR definitions
  arm64/sme: Generate defintions for SVCR
  arm64/sme: Generate SMPRI_EL1 definitions
  arm64/sme: Automatically generate SMPRIMAP_EL2 definitions
  arm64/sme: Automatically generate SMIDR_EL1 defines
  arm64/sme: Automatically generate defines for SMCR
  ...
2022-05-23 21:06:11 -07:00
Linus Torvalds
115cd47132 for-5.19/block-2022-05-22
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmKKrUsQHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgpgDjD/44hY9h0JsOLoRH1IvFtuaH6n718JXuqG17
 hHCfmnAUVqj2jT00IUbVlUTd905bCGpfrodBL3PAmPev1zZHOUd/MnJKrSynJ+/s
 NJEMZQaHxLmocNDpJ1sZo7UbAFErsZXB0gVYUO8cH2bFYNu84H1mhRCOReYyqmvQ
 aIAASX5qRB/ciBQCivzAJl2jTdn4WOn5hWi9RLidQB7kSbaXGPmgKAuN88WI4H7A
 zQgAkEl2EEquyMI5tV1uquS7engJaC/4PsenF0S9iTyrhJLjneczJBJZKMLeMR8d
 sOm6sKJdpkrfYDyaA4PIkgmLoEGTtwGpqGHl4iXTyinUAxJoca5tmPvBb3wp66GE
 2Mr7pumxc1yJID2VHbsERXlOAX3aZNCowx2gum2MTRIO8g11Eu3aaVn2kv37MBJ2
 4R2a/cJFl5zj9M8536cG+Yqpy0DDVCCQKUIqEupgEu1dyfpznyWH5BTAHXi1E8td
 nxUin7uXdD0AJkaR0m04McjS/Bcmc1dc6I8xvkdUFYBqYCZWpKOTiEpIBlHg0XJA
 sxdngyz5lSYTGVA4o4QCrdR0Tx1n36A1IYFuQj0wzxBJYZ02jEZuII/A3dd+8hiv
 EY+VeUQeVIXFFuOcY+e0ScPpn7Nr17hAd1en/j2Hcoe4ZE8plqG2QTcnwgflcbis
 iomvJ4yk0Q==
 =0Rw1
 -----END PGP SIGNATURE-----

Merge tag 'for-5.19/block-2022-05-22' of git://git.kernel.dk/linux-block

Pull block updates from Jens Axboe:
 "Here are the core block changes for 5.19. This contains:

   - blk-throttle accounting fix (Laibin)

   - Series removing redundant assignments (Michal)

   - Expose bio cache via the bio_set, so that DM can use it (Mike)

   - Finish off the bio allocation interface cleanups by dealing with
     the weirdest member of the family. bio_kmalloc combines a kmalloc
     for the bio and bio_vecs with a hidden bio_init call and magic
     cleanup semantics (Christoph)

   - Clean up the block layer API so that APIs consumed by file systems
     are (almost) only struct block_device based, so that file systems
     don't have to poke into block layer internals like the
     request_queue (Christoph)

   - Clean up the blk_execute_rq* API (Christoph)

   - Clean up various lose end in the blk-cgroup code to make it easier
     to follow in preparation of reworking the blkcg assignment for bios
     (Christoph)

   - Fix use-after-free issues in BFQ when processes with merged queues
     get moved to different cgroups (Jan)

   - BFQ fixes (Jan)

   - Various fixes and cleanups (Bart, Chengming, Fanjun, Julia, Ming,
     Wolfgang, me)"

* tag 'for-5.19/block-2022-05-22' of git://git.kernel.dk/linux-block: (83 commits)
  blk-mq: fix typo in comment
  bfq: Remove bfq_requeue_request_body()
  bfq: Remove superfluous conversion from RQ_BIC()
  bfq: Allow current waker to defend against a tentative one
  bfq: Relax waker detection for shared queues
  blk-cgroup: delete rcu_read_lock_held() WARN_ON_ONCE()
  blk-throttle: Set BIO_THROTTLED when bio has been throttled
  blk-cgroup: Remove unnecessary rcu_read_lock/unlock()
  blk-cgroup: always terminate io.stat lines
  block, bfq: make bfq_has_work() more accurate
  block, bfq: protect 'bfqd->queued' by 'bfqd->lock'
  block: cleanup the VM accounting in submit_bio
  block: Fix the bio.bi_opf comment
  block: reorder the REQ_ flags
  blk-iocost: combine local_stat and desc_stat to stat
  block: improve the error message from bio_check_eod
  block: allow passing a NULL bdev to bio_alloc_clone/bio_init_clone
  block: remove superfluous calls to blkcg_bio_issue_init
  kthread: unexport kthread_blkcg
  blk-cgroup: cleanup blkcg_maybe_throttle_current
  ...
2022-05-23 13:56:39 -07:00
Vlastimil Babka
e001897da6 Merge branches 'slab/for-5.19/stackdepot' and 'slab/for-5.19/refactor' into slab/for-linus 2022-05-23 11:14:32 +02:00
Kefeng Wang
02e34fff19 mm: damon: use HPAGE_PMD_SIZE
Use HPAGE_PMD_SIZE instead of open coding.

Link: https://lkml.kernel.org/r/20220517145120.118523-1-wangkefeng.wang@huawei.com
Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com>
Reviewed-by: SeongJae Park <sj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-05-19 14:08:55 -07:00
Qi Zheng
3f913fc5f9 mm: fix missing handler for __GFP_NOWARN
We expect no warnings to be issued when we specify __GFP_NOWARN, but
currently in paths like alloc_pages() and kmalloc(), there are still some
warnings printed, fix it.

But for some warnings that report usage problems, we don't deal with them.
If such warnings are printed, then we should fix the usage problems. 
Such as the following case:

	WARN_ON_ONCE((gfp_flags & __GFP_NOFAIL) && (order > 1));

[zhengqi.arch@bytedance.com: v2]
 Link: https://lkml.kernel.org/r/20220511061951.1114-1-zhengqi.arch@bytedance.com
Link: https://lkml.kernel.org/r/20220510113809.80626-1-zhengqi.arch@bytedance.com
Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
Cc: Akinobu Mita <akinobu.mita@gmail.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Jiri Slaby <jirislaby@kernel.org>
Cc: Steven Rostedt (Google) <rostedt@goodmis.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-05-19 14:08:55 -07:00
Wonhyuk Yang
10e0f75302 mm/page_alloc: fix tracepoint mm_page_alloc_zone_locked()
Currently, trace point mm_page_alloc_zone_locked() doesn't show correct
information.

First, when alloc_flag has ALLOC_HARDER/ALLOC_CMA, page can be allocated
from MIGRATE_HIGHATOMIC/MIGRATE_CMA.  Nevertheless, tracepoint use
requested migration type not MIGRATE_HIGHATOMIC and MIGRATE_CMA.

Second, after commit 44042b4498 ("mm/page_alloc: allow high-order pages
to be stored on the per-cpu lists") percpu-list can store high order
pages.  But trace point determine whether it is a refiil of percpu-list by
comparing requested order and 0.

To handle these problems, make mm_page_alloc_zone_locked() only be called
by __rmqueue_smallest with correct migration type.  With a new argument
called percpu_refill, it can show roughly whether it is a refill of
percpu-list.

Link: https://lkml.kernel.org/r/20220512025307.57924-1-vvghjk1234@gmail.com
Signed-off-by: Wonhyuk Yang <vvghjk1234@gmail.com>
Acked-by: Mel Gorman <mgorman@suse.de>
Cc: Baik Song An <bsahn@etri.re.kr>
Cc: Hong Yeon Kim <kimhy@etri.re.kr>
Cc: Taeung Song <taeung@reallinux.co.kr>
Cc: <linuxgeek@linuxgeek.io>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Ingo Molnar <mingo@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-05-19 14:08:54 -07:00
Fanjun Kong
3645b5ec0a mm/page_owner.c: add missing __initdata attribute
This patch fixes two issues:
1. Add __initdata attribute according to include/linux/init.h:
	For initialized data:
	You should insert __initdata between the variable name and equal
	sign followed by value

2. Fix below error reported by checkpatch.pl:
	ERROR: do not initialise statics to false

Special thanks to Muchun Song :)

Link: https://lkml.kernel.org/r/20220516030039.1487005-1-bh1scw@gmail.com
Signed-off-by: Fanjun Kong <bh1scw@gmail.com>
Suggested-by: Muchun Song <songmuchun@bytedance.com>
Reviewed-by: Muchun Song <songmuchun@bytedance.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-05-19 14:08:54 -07:00
Luo Meng
d14f5efadd tmpfs: fix undefined-behaviour in shmem_reconfigure()
When shmem_reconfigure() calls __percpu_counter_compare(), the second
parameter is unsigned long long.  But in the definition of
__percpu_counter_compare(), the second parameter is s64.  So when
__percpu_counter_compare() executes abs(count - rhs), UBSAN shows the
following warning:

================================================================================
UBSAN: Undefined behaviour in lib/percpu_counter.c:209:6
signed integer overflow:
0 - -9223372036854775808 cannot be represented in type 'long long int'
CPU: 1 PID: 9636 Comm: syz-executor.2 Tainted: G                 ---------r-  - 4.18.0 #2
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.14.0-0-g155821a1990b-prebuilt.qemu.org 04/01/2014
Call Trace:
 __dump_stack home/install/linux-rh-3-10/lib/dump_stack.c:77 [inline]
 dump_stack+0x125/0x1ae home/install/linux-rh-3-10/lib/dump_stack.c:117
 ubsan_epilogue+0xe/0x81 home/install/linux-rh-3-10/lib/ubsan.c:159
 handle_overflow+0x19d/0x1ec home/install/linux-rh-3-10/lib/ubsan.c:190
 __percpu_counter_compare+0x124/0x140 home/install/linux-rh-3-10/lib/percpu_counter.c:209
 percpu_counter_compare home/install/linux-rh-3-10/./include/linux/percpu_counter.h:50 [inline]
 shmem_remount_fs+0x1ce/0x6b0 home/install/linux-rh-3-10/mm/shmem.c:3530
 do_remount_sb+0x11b/0x530 home/install/linux-rh-3-10/fs/super.c:888
 do_remount home/install/linux-rh-3-10/fs/namespace.c:2344 [inline]
 do_mount+0xf8d/0x26b0 home/install/linux-rh-3-10/fs/namespace.c:2844
 ksys_mount+0xad/0x120 home/install/linux-rh-3-10/fs/namespace.c:3075
 __do_sys_mount home/install/linux-rh-3-10/fs/namespace.c:3089 [inline]
 __se_sys_mount home/install/linux-rh-3-10/fs/namespace.c:3086 [inline]
 __x64_sys_mount+0xbf/0x160 home/install/linux-rh-3-10/fs/namespace.c:3086
 do_syscall_64+0xca/0x5c0 home/install/linux-rh-3-10/arch/x86/entry/common.c:298
 entry_SYSCALL_64_after_hwframe+0x6a/0xdf
RIP: 0033:0x46b5e9
Code: 5d db fa ff c3 66 2e 0f 1f 84 00 00 00 00 00 66 90 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 0f 83 2b db fa ff c3 66 2e 0f 1f 84 00 00 00 00
RSP: 002b:00007f54d5f22c68 EFLAGS: 00000246 ORIG_RAX: 00000000000000a5
RAX: ffffffffffffffda RBX: 000000000077bf60 RCX: 000000000046b5e9
RDX: 0000000000000000 RSI: 0000000020000000 RDI: 0000000000000000
RBP: 000000000077bf60 R08: 0000000020000140 R09: 0000000000000000
R10: 00000000026740a4 R11: 0000000000000246 R12: 0000000000000000
R13: 00007ffd1fb1592f R14: 00007f54d5f239c0 R15: 000000000077bf6c
================================================================================

[akpm@linux-foundation.org: tweak error message text]
Link: https://lkml.kernel.org/r/20220513025225.2678727-1-luomeng12@huawei.com
Signed-off-by: Luo Meng <luomeng12@huawei.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Yu Kuai <yukuai3@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-05-19 14:08:54 -07:00
Wang Cheng
018160ad31 mm/mempolicy: fix uninit-value in mpol_rebind_policy()
mpol_set_nodemask()(mm/mempolicy.c) does not set up nodemask when
pol->mode is MPOL_LOCAL.  Check pol->mode before access
pol->w.cpuset_mems_allowed in mpol_rebind_policy()(mm/mempolicy.c).

BUG: KMSAN: uninit-value in mpol_rebind_policy mm/mempolicy.c:352 [inline]
BUG: KMSAN: uninit-value in mpol_rebind_task+0x2ac/0x2c0 mm/mempolicy.c:368
 mpol_rebind_policy mm/mempolicy.c:352 [inline]
 mpol_rebind_task+0x2ac/0x2c0 mm/mempolicy.c:368
 cpuset_change_task_nodemask kernel/cgroup/cpuset.c:1711 [inline]
 cpuset_attach+0x787/0x15e0 kernel/cgroup/cpuset.c:2278
 cgroup_migrate_execute+0x1023/0x1d20 kernel/cgroup/cgroup.c:2515
 cgroup_migrate kernel/cgroup/cgroup.c:2771 [inline]
 cgroup_attach_task+0x540/0x8b0 kernel/cgroup/cgroup.c:2804
 __cgroup1_procs_write+0x5cc/0x7a0 kernel/cgroup/cgroup-v1.c:520
 cgroup1_tasks_write+0x94/0xb0 kernel/cgroup/cgroup-v1.c:539
 cgroup_file_write+0x4c2/0x9e0 kernel/cgroup/cgroup.c:3852
 kernfs_fop_write_iter+0x66a/0x9f0 fs/kernfs/file.c:296
 call_write_iter include/linux/fs.h:2162 [inline]
 new_sync_write fs/read_write.c:503 [inline]
 vfs_write+0x1318/0x2030 fs/read_write.c:590
 ksys_write+0x28b/0x510 fs/read_write.c:643
 __do_sys_write fs/read_write.c:655 [inline]
 __se_sys_write fs/read_write.c:652 [inline]
 __x64_sys_write+0xdb/0x120 fs/read_write.c:652
 do_syscall_x64 arch/x86/entry/common.c:51 [inline]
 do_syscall_64+0x54/0xd0 arch/x86/entry/common.c:82
 entry_SYSCALL_64_after_hwframe+0x44/0xae

Uninit was created at:
 slab_post_alloc_hook mm/slab.h:524 [inline]
 slab_alloc_node mm/slub.c:3251 [inline]
 slab_alloc mm/slub.c:3259 [inline]
 kmem_cache_alloc+0x902/0x11c0 mm/slub.c:3264
 mpol_new mm/mempolicy.c:293 [inline]
 do_set_mempolicy+0x421/0xb70 mm/mempolicy.c:853
 kernel_set_mempolicy mm/mempolicy.c:1504 [inline]
 __do_sys_set_mempolicy mm/mempolicy.c:1510 [inline]
 __se_sys_set_mempolicy+0x44c/0xb60 mm/mempolicy.c:1507
 __x64_sys_set_mempolicy+0xd8/0x110 mm/mempolicy.c:1507
 do_syscall_x64 arch/x86/entry/common.c:51 [inline]
 do_syscall_64+0x54/0xd0 arch/x86/entry/common.c:82
 entry_SYSCALL_64_after_hwframe+0x44/0xae

KMSAN: uninit-value in mpol_rebind_task (2)
https://syzkaller.appspot.com/bug?id=d6eb90f952c2a5de9ea718a1b873c55cb13b59dc

This patch seems to fix below bug too.
KMSAN: uninit-value in mpol_rebind_mm (2)
https://syzkaller.appspot.com/bug?id=f2fecd0d7013f54ec4162f60743a2b28df40926b

The uninit-value is pol->w.cpuset_mems_allowed in mpol_rebind_policy().
When syzkaller reproducer runs to the beginning of mpol_new(),

	    mpol_new() mm/mempolicy.c
	  do_mbind() mm/mempolicy.c
	kernel_mbind() mm/mempolicy.c

`mode` is 1(MPOL_PREFERRED), nodes_empty(*nodes) is `true` and `flags`
is 0. Then

	mode = MPOL_LOCAL;
	...
	policy->mode = mode;
	policy->flags = flags;

will be executed. So in mpol_set_nodemask(),

	    mpol_set_nodemask() mm/mempolicy.c
	  do_mbind()
	kernel_mbind()

pol->mode is 4 (MPOL_LOCAL), that `nodemask` in `pol` is not initialized,
which will be accessed in mpol_rebind_policy().

Link: https://lkml.kernel.org/r/20220512123428.fq3wofedp6oiotd4@ppc.localdomain
Signed-off-by: Wang Cheng <wanngchenng@gmail.com>
Reported-by: <syzbot+217f792c92599518a2ab@syzkaller.appspotmail.com>
Tested-by: <syzbot+217f792c92599518a2ab@syzkaller.appspotmail.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-05-19 14:08:54 -07:00
Minchan Kim
6d4675e601 mm: don't be stuck to rmap lock on reclaim path
The rmap locks(i_mmap_rwsem and anon_vma->root->rwsem) could be contended
under memory pressure if processes keep working on their vmas(e.g., fork,
mmap, munmap).  It makes reclaim path stuck.  In our real workload traces,
we see kswapd is waiting the lock for 300ms+(worst case, a sec) and it
makes other processes entering direct reclaim, which were also stuck on
the lock.

This patch makes lru aging path try_lock mode like shink_page_list so the
reclaim context will keep working with next lru pages without being stuck.
if it found the rmap lock contended, it rotates the page back to head of
lru in both active/inactive lrus to make them consistent behavior, which
is basic starting point rather than adding more heristic.

Since this patch introduces a new "contended" field as out-param along
with try_lock in-param in rmap_walk_control, it's not immutable any longer
if the try_lock is set so remove const keywords on rmap related functions.
Since rmap walking is already expensive operation, I doubt the const
would help sizable benefit( And we didn't have it until 5.17).

In a heavy app workload in Android, trace shows following statistics.  It
almost removes rmap lock contention from reclaim path.

Martin Liu reported:

Before:

   max_dur(ms)  min_dur(ms)  max-min(dur)ms  avg_dur(ms)  sum_dur(ms)  count blocked_function
         1632            0            1631   151.542173        31672    209  page_lock_anon_vma_read
          601            0             601   145.544681        28817    198  rmap_walk_file

After:

   max_dur(ms)  min_dur(ms)  max-min(dur)ms  avg_dur(ms)  sum_dur(ms)  count blocked_function
          NaN          NaN              NaN          NaN          NaN    0.0             NaN
            0            0                0     0.127645            1     12  rmap_walk_file

[minchan@kernel.org: add comment, per Matthew]
  Link: https://lkml.kernel.org/r/YnNqeB5tUf6LZ57b@google.com
Link: https://lkml.kernel.org/r/20220510215423.164547-1-minchan@kernel.org
Signed-off-by: Minchan Kim <minchan@kernel.org>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: John Dias <joaodias@google.com>
Cc: Tim Murray <timmurray@google.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: Martin Liu <liumartin@google.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Matthew Wilcox <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-05-19 14:08:54 -07:00
Johannes Weiner
f4840ccfca zswap: memcg accounting
Applications can currently escape their cgroup memory containment when
zswap is enabled.  This patch adds per-cgroup tracking and limiting of
zswap backend memory to rectify this.

The existing cgroup2 memory.stat file is extended to show zswap statistics
analogous to what's in meminfo and vmstat.  Furthermore, two new control
files, memory.zswap.current and memory.zswap.max, are added to allow
tuning zswap usage on a per-workload basis.  This is important since not
all workloads benefit from zswap equally; some even suffer compared to
disk swap when memory contents don't compress well.  The optimal size of
the zswap pool, and the threshold for writeback, also depends on the size
of the workload's warm set.

The implementation doesn't use a traditional page_counter transaction. 
zswap is unconventional as a memory consumer in that we only know the
amount of memory to charge once expensive compression has occurred.  If
zwap is disabled or the limit is already exceeded we obviously don't want
to compress page upon page only to reject them all.  Instead, the limit is
checked against current usage, then we compress and charge.  This allows
some limit overrun, but not enough to matter in practice.

[hannes@cmpxchg.org: fix for CONFIG_SLOB builds]
  Link: https://lkml.kernel.org/r/YnwD14zxYjUJPc2w@cmpxchg.org
[hannes@cmpxchg.org: opt out of cgroups v1]
  Link: https://lkml.kernel.org/r/Yn6it9mBYFA+/lTb@cmpxchg.org
Link: https://lkml.kernel.org/r/20220510152847.230957-7-hannes@cmpxchg.org
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Roman Gushchin <guro@fb.com>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Seth Jennings <sjenning@redhat.com>
Cc: Dan Streetman <ddstreet@ieee.org>
Cc: Minchan Kim <minchan@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-05-19 14:08:53 -07:00
Johannes Weiner
f6498b776d mm: zswap: add basic meminfo and vmstat coverage
Currently it requires poking at debugfs to figure out the size and
population of the zswap cache on a host.  There are no counters for reads
and writes against the cache.  As a result, it's difficult to understand
zswap behavior on production systems.

Print zswap memory consumption and how many pages are zswapped out in
/proc/meminfo.  Count zswapouts and zswapins in /proc/vmstat.

Link: https://lkml.kernel.org/r/20220510152847.230957-6-hannes@cmpxchg.org
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: David Hildenbrand <david@redhat.com>
Cc: Dan Streetman <ddstreet@ieee.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Roman Gushchin <guro@fb.com>
Cc: Seth Jennings <sjenning@redhat.com>
Cc: Shakeel Butt <shakeelb@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-05-19 14:08:53 -07:00
Johannes Weiner
b3fbd58fcb mm: Kconfig: simplify zswap configuration
- CONFIG_ZRAM: Zram is a user-facing feature, whereas zsmalloc is
  not. Don't make the user chase down a technical dependency like
  that, just select it in automatically when zram is requested. The
  CONFIG_CRYPTO dependency is redundant due to more specific deps.

- CONFIG_ZPOOL: This is not a user-facing feature. Hide the symbol and
  have it selected in as needed.

- CONFIG_ZSWAP: Select CRYPTO instead of depend. Common pattern.

- Make the ZSWAP suboptions and their descriptions (compression,
  allocation backend) a bit more straight-forward for the user.

Link: https://lkml.kernel.org/r/20220510152847.230957-5-hannes@cmpxchg.org
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Dan Streetman <ddstreet@ieee.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Roman Gushchin <guro@fb.com>
Cc: Seth Jennings <sjenning@redhat.com>
Cc: Shakeel Butt <shakeelb@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-05-19 14:08:53 -07:00
Johannes Weiner
519bcb7979 mm: Kconfig: group swap, slab, hotplug and thp options into submenus
There are several clusters of related config options spread throughout the
mostly flat MM submenu.  Group them together and put specialization
options into further subdirectories to make the MM submenu a bit more
organized and easier to navigate.

[hannes@cmpxchg.org: fix kbuild warnings]
  Link: https://lkml.kernel.org/r/YnvkSVivfnT57Vwh@cmpxchg.org
[hannes@cmpxchg.org: fix more kbuild warnings]
  Link: https://lkml.kernel.org/r/Ynz8NusTdEGcCnJN@cmpxchg.org
Link: https://lkml.kernel.org/r/20220510152847.230957-4-hannes@cmpxchg.org
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Dan Streetman <ddstreet@ieee.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Roman Gushchin <guro@fb.com>
Cc: Seth Jennings <sjenning@redhat.com>
Cc: Shakeel Butt <shakeelb@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-05-19 14:08:53 -07:00
Johannes Weiner
7b42f1041c mm: Kconfig: move swap and slab config options to the MM section
These are currently under General Setup. MM seems like a better fit.

Link: https://lkml.kernel.org/r/20220510152847.230957-3-hannes@cmpxchg.org
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Dan Streetman <ddstreet@ieee.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Roman Gushchin <guro@fb.com>
Cc: Seth Jennings <sjenning@redhat.com>
Cc: Shakeel Butt <shakeelb@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-05-19 14:08:53 -07:00
Miaohe Lin
ff351f4bb9 mm/swap: fix comment about swap extent
Since commit 4efaceb1c5 ("mm, swap: use rbtree for swap_extent"), rbtree
is used for swap extent.  Also curr_swap_extent is removed at that time. 
Update the corresponding comment.

Link: https://lkml.kernel.org/r/20220509131416.17553-16-linmiaohe@huawei.com
Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
Cc: Alistair Popple <apopple@nvidia.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: David Howells <dhowells@redhat.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Naoya Horiguchi <naoya.horiguchi@nec.com>
Cc: NeilBrown <neilb@suse.de>
Cc: Peter Xu <peterx@redhat.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Oscar Salvador <osalvador@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-05-19 14:08:52 -07:00
Miaohe Lin
133d2743ef mm/swap: fix the comment of get_kernel_pages
If no pages were pinned, 0 is returned in fact.  Fix the corresponding
comment.

[akpm@linux-foundation.org: s/nr_pages/nr_segs/ also, per David, reflow comment]
Link: https://lkml.kernel.org/r/20220509131416.17553-15-linmiaohe@huawei.com
Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Cc: Alistair Popple <apopple@nvidia.com>
Cc: David Howells <dhowells@redhat.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Naoya Horiguchi <naoya.horiguchi@nec.com>
Cc: NeilBrown <neilb@suse.de>
Cc: Peter Xu <peterx@redhat.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Oscar Salvador <osalvador@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-05-19 14:08:52 -07:00
Miaohe Lin
3c3115ad6b mm/swap: clean up the comment of find_next_to_unuse
Since commit 10a9c49678 ("mm: simplify try_to_unuse"), frontswap
parameter is removed.  Update the corresponding comment.

Link: https://lkml.kernel.org/r/20220509131416.17553-14-linmiaohe@huawei.com
Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Cc: Alistair Popple <apopple@nvidia.com>
Cc: David Howells <dhowells@redhat.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Naoya Horiguchi <naoya.horiguchi@nec.com>
Cc: NeilBrown <neilb@suse.de>
Cc: Peter Xu <peterx@redhat.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Oscar Salvador <osalvador@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-05-19 14:08:52 -07:00
Miaohe Lin
4b9ae8426c mm/swap: add helper swap_offset_available()
Add helper swap_offset_available() to remove some duplicated codes.  Minor
readability improvement.

[akpm@linux-foundation.org: s/swap_offset_available/swap_offset_available_and_locked/, per Neil]
Link: https://lkml.kernel.org/r/20220509131416.17553-12-linmiaohe@huawei.com
Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
Cc: Alistair Popple <apopple@nvidia.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: David Howells <dhowells@redhat.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Naoya Horiguchi <naoya.horiguchi@nec.com>
Cc: NeilBrown <neilb@suse.de>
Cc: Peter Xu <peterx@redhat.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Oscar Salvador <osalvador@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-05-19 14:08:52 -07:00
Miaohe Lin
eacde32757 mm/swap: avoid calling swp_swap_info when try to check SWP_STABLE_WRITES
Use flags of si directly to check SWP_STABLE_WRITES to avoid possible
READ_ONCE and thus save some cpu cycles.

[akpm@linux-foundation.org: use data_race() on si->flags, per Neil]
Link: https://lkml.kernel.org/r/20220509131416.17553-10-linmiaohe@huawei.com
Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
Cc: Alistair Popple <apopple@nvidia.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: David Howells <dhowells@redhat.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Naoya Horiguchi <naoya.horiguchi@nec.com>
Cc: NeilBrown <neilb@suse.de>
Cc: Peter Xu <peterx@redhat.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Oscar Salvador <osalvador@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-05-19 14:08:51 -07:00
Miaohe Lin
3db3264d8a mm/swap: make page_swapcount and __lru_add_drain_all static
Make page_swapcount and __lru_add_drain_all static.  They are only used
within the file now.

Link: https://lkml.kernel.org/r/20220509131416.17553-9-linmiaohe@huawei.com
Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Oscar Salvador <osalvador@suse.de>
Cc: Alistair Popple <apopple@nvidia.com>
Cc: David Howells <dhowells@redhat.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Naoya Horiguchi <naoya.horiguchi@nec.com>
Cc: NeilBrown <neilb@suse.de>
Cc: Peter Xu <peterx@redhat.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Oscar Salvador <osalvador@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-05-19 14:08:51 -07:00
Miaohe Lin
dab8dfff49 mm/swap: remove unneeded p != NULL check in __swap_duplicate
If p is NULL, __swap_duplicate will already return -EINVAL.  So if we
reach here, p must be non-NULL.

Link: https://lkml.kernel.org/r/20220509131416.17553-8-linmiaohe@huawei.com
Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Oscar Salvador <osalvador@suse.de>
Cc: Alistair Popple <apopple@nvidia.com>
Cc: David Howells <dhowells@redhat.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Naoya Horiguchi <naoya.horiguchi@nec.com>
Cc: NeilBrown <neilb@suse.de>
Cc: Peter Xu <peterx@redhat.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-05-19 14:08:51 -07:00
Miaohe Lin
f19c25684c mm/swap: remove buggy cache->nr check in refill_swap_slots_cache
refill_swap_slots_cache is always called when cache->nr is 0.  So remove
such buggy and confusing check.

Link: https://lkml.kernel.org/r/20220509131416.17553-7-linmiaohe@huawei.com
Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
Acked-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Oscar Salvador <osalvador@suse.de>
Cc: Alistair Popple <apopple@nvidia.com>
Cc: David Howells <dhowells@redhat.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Naoya Horiguchi <naoya.horiguchi@nec.com>
Cc: NeilBrown <neilb@suse.de>
Cc: Peter Xu <peterx@redhat.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-05-19 14:08:51 -07:00
Miaohe Lin
23b230ba8a mm/swap: print bad swap offset entry in get_swap_device
If offset exceeds the si->max, print bad swap offset entry to help debug
the unexpected case.

Link: https://lkml.kernel.org/r/20220509131416.17553-6-linmiaohe@huawei.com
Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Oscar Salvador <osalvador@suse.de>
Cc: Alistair Popple <apopple@nvidia.com>
Cc: David Howells <dhowells@redhat.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Naoya Horiguchi <naoya.horiguchi@nec.com>
Cc: NeilBrown <neilb@suse.de>
Cc: Peter Xu <peterx@redhat.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-05-19 14:08:51 -07:00
Miaohe Lin
bc4a68adb1 mm/swap: remove unneeded return value of free_swap_slot
The return value of free_swap_slot is always 0 and also ignored now. 
Remove it to clean up the code.

Link: https://lkml.kernel.org/r/20220509131416.17553-5-linmiaohe@huawei.com
Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Oscar Salvador <osalvador@suse.de>
Cc: Alistair Popple <apopple@nvidia.com>
Cc: David Howells <dhowells@redhat.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Naoya Horiguchi <naoya.horiguchi@nec.com>
Cc: NeilBrown <neilb@suse.de>
Cc: Peter Xu <peterx@redhat.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-05-19 14:08:50 -07:00
Miaohe Lin
afba72b171 mm/swap: fold __swap_info_get() into its sole caller
Fold __swap_info_get() into its sole caller to make code more clear. 
Minor readability improvement.

Link: https://lkml.kernel.org/r/20220509131416.17553-4-linmiaohe@huawei.com
Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Oscar Salvador <osalvador@suse.de>
Cc: Alistair Popple <apopple@nvidia.com>
Cc: David Howells <dhowells@redhat.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Naoya Horiguchi <naoya.horiguchi@nec.com>
Cc: NeilBrown <neilb@suse.de>
Cc: Peter Xu <peterx@redhat.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-05-19 14:08:50 -07:00
Miaohe Lin
6106b93efa mm/swap: use helper macro __ATTR_RW
Use helper macro __ATTR_RW to define vma_ra_enabled_attr to make code more
clear.  Minor readability improvement.

Link: https://lkml.kernel.org/r/20220509131416.17553-3-linmiaohe@huawei.com
Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Oscar Salvador <osalvador@suse.de>
Cc: Alistair Popple <apopple@nvidia.com>
Cc: David Howells <dhowells@redhat.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Naoya Horiguchi <naoya.horiguchi@nec.com>
Cc: NeilBrown <neilb@suse.de>
Cc: Peter Xu <peterx@redhat.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-05-19 14:08:50 -07:00
Miaohe Lin
92bafb20b2 mm/swap: use helper is_swap_pte() in swap_vma_readahead
Patch series "A few cleanup patches for swap".

This series contains a few patches to fix the comment, remove unneeded
return value, use some helpers and so on.  More details can be found in
the respective changelogs.


This patch (of 14):

Use helper is_swap_pte() to check whether pte is swap entry to make code
more clear.  Minor readability improvement.

Link: https://lkml.kernel.org/r/20220509131416.17553-1-linmiaohe@huawei.com
Link: https://lkml.kernel.org/r/20220509131416.17553-2-linmiaohe@huawei.com
Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Oscar Salvador <osalvador@suse.de>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: David Howells <dhowells@redhat.com>
Cc: NeilBrown <neilb@suse.de>
Cc: Alistair Popple <apopple@nvidia.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Naoya Horiguchi <naoya.horiguchi@nec.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-05-19 14:08:50 -07:00
Yang Shi
613bec092f mm: mmap: register suitable readonly file vmas for khugepaged
The readonly FS THP relies on khugepaged to collapse THP for suitable
vmas.  But the behavior is inconsistent for "always" mode
(https://lore.kernel.org/linux-mm/00f195d4-d039-3cf2-d3a1-a2c88de397a0@suse.cz/).

The "always" mode means THP allocation should be tried all the time and
khugepaged should try to collapse THP all the time.  Of course the
allocation and collapse may fail due to other factors and conditions.

Currently file THP may not be collapsed by khugepaged even though all the
conditions are met.  That does break the semantics of "always" mode.

So make sure readonly FS vmas are registered to khugepaged to fix the
break.

Register suitable vmas in common mmap path, that could cover both readonly
FS vmas and shmem vmas, so remove the khugepaged calls in shmem.c.

Still need to keep the khugepaged call in vma_merge() since vma_merge() is
called in a lot of places, for example, madvise, mprotect, etc.

Link: https://lkml.kernel.org/r/20220510203222.24246-9-shy828301@gmail.com
Signed-off-by: Yang Shi <shy828301@gmail.com>
Reported-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Vlastmil Babka <vbabka@suse.cz>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Miaohe Lin <linmiaohe@huawei.com>
Cc: Song Liu <songliubraving@fb.com>
Cc: Rik van Riel <riel@surriel.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Zi Yan <ziy@nvidia.com>
Cc: Theodore Ts'o <tytso@mit.edu>
Cc: Song Liu <song@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-05-19 14:08:50 -07:00
Yang Shi
c791576c60 mm: khugepaged: introduce khugepaged_enter_vma() helper
The khugepaged_enter_vma_merge() actually does as the same thing as the
khugepaged_enter() section called by shmem_mmap(), so consolidate them
into one helper and rename it to khugepaged_enter_vma().

Link: https://lkml.kernel.org/r/20220510203222.24246-8-shy828301@gmail.com
Signed-off-by: Yang Shi <shy828301@gmail.com>
Acked-by: Vlastmil Babka <vbabka@suse.cz>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Miaohe Lin <linmiaohe@huawei.com>
Cc: Rik van Riel <riel@surriel.com>
Cc: Song Liu <song@kernel.org>
Cc: Song Liu <songliubraving@fb.com>
Cc: Theodore Ts'o <tytso@mit.edu>
Cc: Zi Yan <ziy@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-05-19 14:08:50 -07:00
Yang Shi
2647d11b9e mm: khugepaged: make hugepage_vma_check() non-static
The hugepage_vma_check() could be reused by khugepaged_enter() and
khugepaged_enter_vma_merge(), but it is static in khugepaged.c.  Make it
non-static and declare it in khugepaged.h.

Link: https://lkml.kernel.org/r/20220510203222.24246-7-shy828301@gmail.com
Signed-off-by: Yang Shi <shy828301@gmail.com>
Suggested-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Miaohe Lin <linmiaohe@huawei.com>
Cc: Rik van Riel <riel@surriel.com>
Cc: Song Liu <song@kernel.org>
Cc: Song Liu <songliubraving@fb.com>
Cc: Theodore Ts'o <tytso@mit.edu>
Cc: Zi Yan <ziy@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-05-19 14:08:49 -07:00
Yang Shi
d2081b2bf8 mm: khugepaged: make khugepaged_enter() void function
The most callers of khugepaged_enter() don't care about the return value. 
Only dup_mmap(), anonymous THP page fault and MADV_HUGEPAGE handle the
error by returning -ENOMEM.  Actually it is not harmful for them to ignore
the error case either.  It also sounds overkilling to fail fork() and page
fault early due to khugepaged_enter() error, and MADV_HUGEPAGE does set
VM_HUGEPAGE flag regardless of the error.

Link: https://lkml.kernel.org/r/20220510203222.24246-6-shy828301@gmail.com
Signed-off-by: Yang Shi <shy828301@gmail.com>
Acked-by: Song Liu <song@kernel.org>
Acked-by: Vlastmil Babka <vbabka@suse.cz>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Miaohe Lin <linmiaohe@huawei.com>
Cc: Rik van Riel <riel@surriel.com>
Cc: Song Liu <songliubraving@fb.com>
Cc: Theodore Ts'o <tytso@mit.edu>
Cc: Zi Yan <ziy@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-05-19 14:08:49 -07:00
Yang Shi
78d12c19e0 mm: thp: only regular file could be THP eligible
Since commit a4aeaa06d4 ("mm: khugepaged: skip huge page collapse for
special files"), khugepaged just collapses THP for regular file which is
the intended usecase for readonly fs THP.  Only show regular file as THP
eligible accordingly.

And make file_thp_enabled() available for khugepaged too in order to
remove duplicate code.

Link: https://lkml.kernel.org/r/20220510203222.24246-5-shy828301@gmail.com
Signed-off-by: Yang Shi <shy828301@gmail.com>
Acked-by: Song Liu <song@kernel.org>
Acked-by: Vlastmil Babka <vbabka@suse.cz>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Miaohe Lin <linmiaohe@huawei.com>
Cc: Rik van Riel <riel@surriel.com>
Cc: Song Liu <songliubraving@fb.com>
Cc: Theodore Ts'o <tytso@mit.edu>
Cc: Zi Yan <ziy@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-05-19 14:08:49 -07:00
Yang Shi
52b52bf15b mm: khugepaged: skip DAX vma
The DAX vma may be seen by khugepaged when the mm has other khugepaged
suitable vmas.  So khugepaged may try to collapse THP for DAX vma, but it
will fail due to page sanity check, for example, page is not on LRU.

So it is not harmful, but it is definitely pointless to run khugepaged
against DAX vma, so skip it in early check.

Link: https://lkml.kernel.org/r/20220510203222.24246-4-shy828301@gmail.com
Signed-off-by: Yang Shi <shy828301@gmail.com>
Reviewed-by: Miaohe Lin <linmiaohe@huawei.com>
Acked-by: Song Liu <song@kernel.org>
Acked-by: Vlastmil Babka <vbabka@suse.cz>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Rik van Riel <riel@surriel.com>
Cc: Song Liu <songliubraving@fb.com>
Cc: Theodore Ts'o <tytso@mit.edu>
Cc: Zi Yan <ziy@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-05-19 14:08:49 -07:00
Yang Shi
cb648754a1 mm: khugepaged: remove redundant check for VM_NO_KHUGEPAGED
The hugepage_vma_check() called by khugepaged_enter_vma_merge() does check
VM_NO_KHUGEPAGED.  Remove the check from caller and move the check in
hugepage_vma_check() up.

More checks may be run for VM_NO_KHUGEPAGED vmas, but MADV_HUGEPAGE is
definitely not a hot path, so cleaner code does outweigh.

Link: https://lkml.kernel.org/r/20220510203222.24246-3-shy828301@gmail.com
Signed-off-by: Yang Shi <shy828301@gmail.com>
Reviewed-by: Miaohe Lin <linmiaohe@huawei.com>
Acked-by: Song Liu <song@kernel.org>
Acked-by: Vlastmil Babka <vbabka@suse.cz>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Rik van Riel <riel@surriel.com>
Cc: Song Liu <songliubraving@fb.com>
Cc: Theodore Ts'o <tytso@mit.edu>
Cc: Zi Yan <ziy@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-05-19 14:08:49 -07:00
Jason A. Donenfeld
5ad7dd882e random: move randomize_page() into mm where it belongs
randomize_page is an mm function. It is documented like one. It contains
the history of one. It has the naming convention of one. It looks
just like another very similar function in mm, randomize_stack_top().
And it has always been maintained and updated by mm people. There is no
need for it to be in random.c. In the "which shape does not look like
the other ones" test, pointing to randomize_page() is correct.

So move randomize_page() into mm/util.c, right next to the similar
randomize_stack_top() function.

This commit contains no actual code changes.

Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
2022-05-19 16:54:15 +02:00
Yuanzheng Song
a5f4d9df1f mm: usercopy: move the virt_addr_valid() below the is_vmalloc_addr()
The is_kmap_addr() and the is_vmalloc_addr() in the check_heap_object()
will not work, because the virt_addr_valid() will exclude the kmap and
vmalloc regions. So let's move the virt_addr_valid() below
the is_vmalloc_addr().

Signed-off-by: Yuanzheng Song <songyuanzheng@huawei.com>
Fixes: 4e140f59d2 ("mm/usercopy: Check kmap addresses properly")
Fixes: 0aef499f31 ("mm/usercopy: Detect vmalloc overruns")
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Kees Cook <keescook@chromium.org>
Link: https://lore.kernel.org/r/20220505071037.4121100-1-songyuanzheng@huawei.com
2022-05-16 16:02:21 -07:00
Rei Yamamoto
bbe832b9db mm, compaction: fast_find_migrateblock() should return pfn in the target zone
At present, pages not in the target zone are added to cc->migratepages
list in isolate_migratepages_block().  As a result, pages may migrate
between nodes unintentionally.

This would be a serious problem for older kernels without commit
a984226f45 ("mm: memcontrol: remove the pgdata parameter of
mem_cgroup_page_lruvec"), because it can corrupt the lru list by
handling pages in list without holding proper lru_lock.

Avoid returning a pfn outside the target zone in the case that it is
not aligned with a pageblock boundary.  Otherwise
isolate_migratepages_block() will handle pages not in the target zone.

Link: https://lkml.kernel.org/r/20220511044300.4069-1-yamamoto.rei@jp.fujitsu.com
Fixes: 70b44595ea ("mm, compaction: use free lists to quickly locate a migration source")
Signed-off-by: Rei Yamamoto <yamamoto.rei@jp.fujitsu.com>
Reviewed-by: Miaohe Lin <linmiaohe@huawei.com>
Acked-by: Mel Gorman <mgorman@techsingularity.net>
Reviewed-by: Oscar Salvador <osalvador@suse.de>
Cc: Don Dutile <ddutile@redhat.com>
Cc: Wonhyuk Yang <vvghjk1234@gmail.com>
Cc: Rei Yamamoto <yamamoto.rei@jp.fujitsu.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-05-13 16:48:57 -07:00
Ganesan Rajagopal
8e20d4b332 mm/memcontrol: export memcg->watermark via sysfs for v2 memcg
We run a lot of automated tests when building our software and run into
OOM scenarios when the tests run unbounded.  v1 memcg exports
memcg->watermark as "memory.max_usage_in_bytes" in sysfs.  We use this
metric to heuristically limit the number of tests that can run in parallel
based on per test historical data.

This metric is currently not exported for v2 memcg and there is no other
easy way of getting this information.  getrusage() syscall returns
"ru_maxrss" which can be used as an approximation but that's the max RSS
of a single child process across all children instead of the aggregated
max for all child processes.  The only work around is to periodically poll
"memory.current" but that's not practical for short-lived one-off cgroups.

Hence, expose memcg->watermark as "memory.peak" for v2 memcg.

Link: https://lkml.kernel.org/r/20220507050916.GA13577@us192.sjc.aristanetworks.com
Signed-off-by: Ganesan Rajagopal <rganesan@arista.com>
Acked-by: Shakeel Butt <shakeelb@google.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Roman Gushchin <roman.gushchin@linux.dev>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Roman Gushchin <roman.gushchin@linux.dev>
Reviewed-by: Michal Koutný <mkoutny@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-05-13 16:48:57 -07:00
Muchun Song
78f39084b4 mm: hugetlb_vmemmap: add hugetlb_optimize_vmemmap sysctl
We must add hugetlb_free_vmemmap=on (or "off") to the boot cmdline and
reboot the server to enable or disable the feature of optimizing vmemmap
pages associated with HugeTLB pages.  However, rebooting usually takes a
long time.  So add a sysctl to enable or disable the feature at runtime
without rebooting.  Why we need this?  There are 3 use cases.

1) The feature of minimizing overhead of struct page associated with
   each HugeTLB is disabled by default without passing
   "hugetlb_free_vmemmap=on" to the boot cmdline.  When we (ByteDance)
   deliver the servers to the users who want to enable this feature, they
   have to configure the grub (change boot cmdline) and reboot the
   servers, whereas rebooting usually takes a long time (we have thousands
   of servers).  It's a very bad experience for the users.  So we need a
   approach to enable this feature after rebooting.  This is a use case in
   our practical environment.

2) Some use cases are that HugeTLB pages are allocated 'on the fly'
   instead of being pulled from the HugeTLB pool, those workloads would be
   affected with this feature enabled.  Those workloads could be
   identified by the characteristics of they never explicitly allocating
   huge pages with 'nr_hugepages' but only set 'nr_overcommit_hugepages'
   and then let the pages be allocated from the buddy allocator at fault
   time.  We can confirm it is a real use case from the commit
   099730d674.  For those workloads, the page fault time could be ~2x
   slower than before.  We suspect those users want to disable this
   feature if the system has enabled this before and they don't think the
   memory savings benefit is enough to make up for the performance drop.

3) If the workload which wants vmemmap pages to be optimized and the
   workload which wants to set 'nr_overcommit_hugepages' and does not want
   the extera overhead at fault time when the overcommitted pages be
   allocated from the buddy allocator are deployed in the same server. 
   The user could enable this feature and set 'nr_hugepages' and
   'nr_overcommit_hugepages', then disable the feature.  In this case, the
   overcommited HugeTLB pages will not encounter the extra overhead at
   fault time.

Link: https://lkml.kernel.org/r/20220512041142.39501-5-songmuchun@bytedance.com
Signed-off-by: Muchun Song <songmuchun@bytedance.com>
Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Luis Chamberlain <mcgrof@kernel.org>
Cc: Kees Cook <keescook@chromium.org>
Cc: Iurii Zaikin <yzaikin@google.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: David Hildenbrand <david@redhat.com>
Cc: Masahiro Yamada <masahiroy@kernel.org>
Cc: Xiongchun Duan <duanxiongchun@bytedance.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-05-13 16:48:56 -07:00
Muchun Song
9c54c522bb mm: hugetlb_vmemmap: use kstrtobool for hugetlb_vmemmap param parsing
Use kstrtobool rather than open coding "on" and "off" parsing in
mm/hugetlb_vmemmap.c, which is more powerful to handle all kinds of
parameters like 'Yy1Nn0' or [oO][NnFf] for "on" and "off".

Link: https://lkml.kernel.org/r/20220512041142.39501-4-songmuchun@bytedance.com
Signed-off-by: Muchun Song <songmuchun@bytedance.com>
Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Iurii Zaikin <yzaikin@google.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Kees Cook <keescook@chromium.org>
Cc: Luis Chamberlain <mcgrof@kernel.org>
Cc: Masahiro Yamada <masahiroy@kernel.org>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Xiongchun Duan <duanxiongchun@bytedance.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-05-13 16:48:56 -07:00
Muchun Song
6e02c46b4d mm: memory_hotplug: override memmap_on_memory when hugetlb_free_vmemmap=on
Optimizing HugeTLB vmemmap pages is not compatible with allocating memmap
on hot added memory.  If "hugetlb_free_vmemmap=on" and
memory_hotplug.memmap_on_memory" are both passed on the kernel command
line, optimizing hugetlb pages takes precedence.  However, the global
variable memmap_on_memory will still be set to 1, even though we will not
try to allocate memmap on hot added memory.

Also introduce mhp_memmap_on_memory() helper to move the definition of
"memmap_on_memory" to the scope of CONFIG_MHP_MEMMAP_ON_MEMORY.  In the
next patch, mhp_memmap_on_memory() will also be exported to be used in
hugetlb_vmemmap.c.

Link: https://lkml.kernel.org/r/20220512041142.39501-3-songmuchun@bytedance.com
Signed-off-by: Muchun Song <songmuchun@bytedance.com>
Acked-by: Mike Kravetz <mike.kravetz@oracle.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Iurii Zaikin <yzaikin@google.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Kees Cook <keescook@chromium.org>
Cc: Luis Chamberlain <mcgrof@kernel.org>
Cc: Masahiro Yamada <masahiroy@kernel.org>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Xiongchun Duan <duanxiongchun@bytedance.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-05-13 16:48:56 -07:00
Muchun Song
0effdf461c mm: hugetlb_vmemmap: disable hugetlb_optimize_vmemmap when struct page crosses page boundaries
Patch series "add hugetlb_optimize_vmemmap sysctl", v11.

This series aims to add hugetlb_optimize_vmemmap sysctl to enable or
disable the feature of optimizing vmemmap pages associated with HugeTLB
pages.


This patch (of 4):

If the size of "struct page" is not the power of two but with the feature
of minimizing overhead of struct page associated with each HugeTLB is
enabled, then the vmemmap pages of HugeTLB will be corrupted after
remapping (panic is about to happen in theory).  But this only exists when
!CONFIG_MEMCG && !CONFIG_SLUB on x86_64.  However, it is not a
conventional configuration nowadays.  So it is not a real word issue, just
the result of a code review.

But we cannot prevent anyone from configuring that combined configure. 
This hugetlb_optimize_vmemmap should be disable in this case to fix this
issue.

Link: https://lkml.kernel.org/r/20220512041142.39501-1-songmuchun@bytedance.com
Link: https://lkml.kernel.org/r/20220512041142.39501-2-songmuchun@bytedance.com
Signed-off-by: Muchun Song <songmuchun@bytedance.com>
Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Iurii Zaikin <yzaikin@google.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Kees Cook <keescook@chromium.org>
Cc: Luis Chamberlain <mcgrof@kernel.org>
Cc: Masahiro Yamada <masahiroy@kernel.org>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Xiongchun Duan <duanxiongchun@bytedance.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-05-13 16:48:56 -07:00
Baolin Wang
a00a875925 mm: rmap: fix CONT-PTE/PMD size hugetlb issue when unmapping
On some architectures (like ARM64), it can support CONT-PTE/PMD size
hugetlb, which means it can support not only PMD/PUD size hugetlb: 2M and
1G, but also CONT-PTE/PMD size: 64K and 32M if a 4K page size specified.

When unmapping a hugetlb page, we will get the relevant page table entry
by huge_pte_offset() only once to nuke it.  This is correct for PMD or PUD
size hugetlb, since they always contain only one pmd entry or pud entry in
the page table.

However this is incorrect for CONT-PTE and CONT-PMD size hugetlb, since
they can contain several continuous pte or pmd entry with same page table
attributes, so we will nuke only one pte or pmd entry for this
CONT-PTE/PMD size hugetlb page.

And now try_to_unmap() is only passed a hugetlb page in the case where the
hugetlb page is poisoned.  Which means now we will unmap only one pte
entry for a CONT-PTE or CONT-PMD size poisoned hugetlb page, and we can
still access other subpages of a CONT-PTE or CONT-PMD size poisoned
hugetlb page, which will cause serious issues possibly.

So we should change to use huge_ptep_clear_flush() to nuke the hugetlb
page table to fix this issue, which already considered CONT-PTE and
CONT-PMD size hugetlb.

We've already used set_huge_swap_pte_at() to set a poisoned swap entry for
a poisoned hugetlb page.  Meanwhile adding a VM_BUG_ON() to make sure the
passed hugetlb page is poisoned in try_to_unmap().

Link: https://lkml.kernel.org/r/0a2e547238cad5bc153a85c3e9658cb9d55f9cac.1652270205.git.baolin.wang@linux.alibaba.com
Link: https://lkml.kernel.org/r/730ea4b6d292f32fb10b7a4e87dad49b0eb30474.1652147571.git.baolin.wang@linux.alibaba.com
Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Reviewed-by: Muchun Song <songmuchun@bytedance.com>
Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
Acked-by: David Hildenbrand <david@redhat.com>
Cc: Alexander Gordeev <agordeev@linux.ibm.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Christian Borntraeger <borntraeger@linux.ibm.com>
Cc: David S. Miller <davem@davemloft.net>
Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com>
Cc: Heiko Carstens <hca@linux.ibm.com>
Cc: Helge Deller <deller@gmx.de>
Cc: James Bottomley <James.Bottomley@HansenPartnership.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Rich Felker <dalias@libc.org>
Cc: Sven Schnelle <svens@linux.ibm.com>
Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Cc: Will Deacon <will@kernel.org>
Cc: Yoshinori Sato <ysato@users.osdn.me>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-05-13 16:48:56 -07:00
Baolin Wang
5d4af6195c mm: rmap: fix CONT-PTE/PMD size hugetlb issue when migration
On some architectures (like ARM64), it can support CONT-PTE/PMD size
hugetlb, which means it can support not only PMD/PUD size hugetlb: 2M and
1G, but also CONT-PTE/PMD size: 64K and 32M if a 4K page size specified.

When migrating a hugetlb page, we will get the relevant page table entry
by huge_pte_offset() only once to nuke it and remap it with a migration
pte entry.  This is correct for PMD or PUD size hugetlb, since they always
contain only one pmd entry or pud entry in the page table.

However this is incorrect for CONT-PTE and CONT-PMD size hugetlb, since
they can contain several continuous pte or pmd entry with same page table
attributes.  So we will nuke or remap only one pte or pmd entry for this
CONT-PTE/PMD size hugetlb page, which is not expected for hugetlb
migration.  The problem is we can still continue to modify the subpages'
data of a hugetlb page during migrating a hugetlb page, which can cause a
serious data consistent issue, since we did not nuke the page table entry
and set a migration pte for the subpages of a hugetlb page.

To fix this issue, we should change to use huge_ptep_clear_flush() to nuke
a hugetlb page table, and remap it with set_huge_pte_at() and
set_huge_swap_pte_at() when migrating a hugetlb page, which already
considered the CONT-PTE or CONT-PMD size hugetlb.

[akpm@linux-foundation.org: fix nommu build]
[baolin.wang@linux.alibaba.com: fix build errors for !CONFIG_MMU]
  Link: https://lkml.kernel.org/r/a4baca670aca637e7198d9ae4543b8873cb224dc.1652270205.git.baolin.wang@linux.alibaba.com
Link: https://lkml.kernel.org/r/ea5abf529f0997b5430961012bfda6166c1efc8c.1652147571.git.baolin.wang@linux.alibaba.com
Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Reviewed-by: Muchun Song <songmuchun@bytedance.com>
Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
Acked-by: David Hildenbrand <david@redhat.com>
Cc: Alexander Gordeev <agordeev@linux.ibm.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Christian Borntraeger <borntraeger@linux.ibm.com>
Cc: David S. Miller <davem@davemloft.net>
Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com>
Cc: Heiko Carstens <hca@linux.ibm.com>
Cc: Helge Deller <deller@gmx.de>
Cc: James Bottomley <James.Bottomley@HansenPartnership.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Rich Felker <dalias@libc.org>
Cc: Sven Schnelle <svens@linux.ibm.com>
Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Cc: Will Deacon <will@kernel.org>
Cc: Yoshinori Sato <ysato@users.osdn.me>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-05-13 16:48:55 -07:00
Sultan Alsawaf
2505a98111 zsmalloc: fix races between asynchronous zspage free and page migration
The asynchronous zspage free worker tries to lock a zspage's entire page
list without defending against page migration.  Since pages which haven't
yet been locked can concurrently migrate off the zspage page list while
lock_zspage() churns away, lock_zspage() can suffer from a few different
lethal races.

It can lock a page which no longer belongs to the zspage and unsafely
dereference page_private(), it can unsafely dereference a torn pointer to
the next page (since there's a data race), and it can observe a spurious
NULL pointer to the next page and thus not lock all of the zspage's pages
(since a single page migration will reconstruct the entire page list, and
create_page_chain() unconditionally zeroes out each list pointer in the
process).

Fix the races by using migrate_read_lock() in lock_zspage() to synchronize
with page migration.

Link: https://lkml.kernel.org/r/20220509024703.243847-1-sultan@kerneltoast.com
Fixes: 77ff465799 ("zsmalloc: zs_page_migrate: skip unnecessary loops but not return -EBUSY if zspage is not inuse")
Signed-off-by: Sultan Alsawaf <sultan@kerneltoast.com>
Acked-by: Minchan Kim <minchan@kernel.org>
Cc: Nitin Gupta <ngupta@vflare.org>
Cc: Sergey Senozhatsky <senozhatsky@chromium.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-05-13 15:11:26 -07:00
Dong Aisheng
60a60e32cf Revert "mm/cma.c: remove redundant cma_mutex lock"
This reverts commit a4efc174b3 which introduced a regression issue
that when there're multiple processes allocating dma memory in parallel by
calling dma_alloc_coherent(), it may fail sometimes as follows:

Error log:
cma: cma_alloc: linux,cma: alloc failed, req-size: 148 pages, ret: -16
cma: number of available pages:
3@125+20@172+12@236+4@380+32@736+17@2287+23@2473+20@36076+99@40477+108@40852+44@41108+20@41196+108@41364+108@41620+
108@42900+108@43156+483@44061+1763@45341+1440@47712+20@49324+20@49388+5076@49452+2304@55040+35@58141+20@58220+20@58284+
7188@58348+84@66220+7276@66452+227@74525+6371@75549=> 33161 free of 81920 total pages

When issue happened, we saw there were still 33161 pages (129M) free CMA
memory and a lot available free slots for 148 pages in CMA bitmap that we
want to allocate.

When dumping memory info, we found that there was also ~342M normal
memory, but only 1352K CMA memory left in buddy system while a lot of
pageblocks were isolated.

Memory info log:
Normal free:351096kB min:30000kB low:37500kB high:45000kB reserved_highatomic:0KB
	    active_anon:98060kB inactive_anon:98948kB active_file:60864kB inactive_file:31776kB
	    unevictable:0kB writepending:0kB present:1048576kB managed:1018328kB mlocked:0kB
	    bounce:0kB free_pcp:220kB local_pcp:192kB free_cma:1352kB lowmem_reserve[]: 0 0 0
Normal: 78*4kB (UECI) 1772*8kB (UMECI) 1335*16kB (UMECI) 360*32kB (UMECI) 65*64kB (UMCI)
	36*128kB (UMECI) 16*256kB (UMCI) 6*512kB (EI) 8*1024kB (UEI) 4*2048kB (MI) 8*4096kB (EI)
	8*8192kB (UI) 3*16384kB (EI) 8*32768kB (M) = 489288kB

The root cause of this issue is that since commit a4efc174b3 ("mm/cma.c:
remove redundant cma_mutex lock"), CMA supports concurrent memory
allocation.  It's possible that the memory range process A trying to alloc
has already been isolated by the allocation of process B during memory
migration.

The problem here is that the memory range isolated during one allocation
by start_isolate_page_range() could be much bigger than the real size we
want to alloc due to the range is aligned to MAX_ORDER_NR_PAGES.

Taking an ARMv7 platform with 1G memory as an example, when
MAX_ORDER_NR_PAGES is big (e.g.  32M with max_order 14) and CMA memory is
relatively small (e.g.  128M), there're only 4 MAX_ORDER slot, then it's
very easy that all CMA memory may have already been isolated by other
processes when one trying to allocate memory using dma_alloc_coherent(). 
Since current CMA code will only scan one time of whole available CMA
memory, then dma_alloc_coherent() may easy fail due to contention with
other processes.

This patch simply falls back to the original method that using cma_mutex
to make alloc_contig_range() run sequentially to avoid the issue.

Link: https://lkml.kernel.org/r/20220509094551.3596244-1-aisheng.dong@nxp.com
Link: https://lore.kernel.org/all/20220315144521.3810298-2-aisheng.dong@nxp.com/
Fixes: a4efc174b3 ("mm/cma.c: remove redundant cma_mutex lock")
Signed-off-by: Dong Aisheng <aisheng.dong@nxp.com>
Acked-by: Minchan Kim <minchan@kernel.org>
Acked-by: David Hildenbrand <david@redhat.com>
Cc: Marek Szyprowski <m.szyprowski@samsung.com>
Cc: Lecopzer Chen <lecopzer.chen@mediatek.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: <stable@vger.kernel.org>	[5.11+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-05-13 15:11:26 -07:00
Linus Torvalds
364a453ab9 hotfixes for 5.18-rc7
-----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCYnvwxgAKCRDdBJ7gKXxA
 jhymAQDvHnFT3F5ydvBqApbzrQRUk/+fkkQSrF/xYawknZNgkAEA6Tnh9XqYplJN
 bbmml6HTVvDjprEOCGakY/Kyz7qmdQ0=
 =SMJQ
 -----END PGP SIGNATURE-----

Merge tag 'mm-hotfixes-stable-2022-05-11' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm

Pull misc fixes from Andrew Morton:
 "Seven MM fixes, three of which address issues added in the most recent
  merge window, four of which are cc:stable.

  Three non-MM fixes, none very serious"

[ And yes, that's a real pull request from Andrew, not me creating a
  branch from emailed patches. Woo-hoo! ]

* tag 'mm-hotfixes-stable-2022-05-11' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm:
  MAINTAINERS: add a mailing list for DAMON development
  selftests: vm: Makefile: rename TARGETS to VMTARGETS
  mm/kfence: reset PG_slab and memcg_data before freeing __kfence_pool
  mailmap: add entry for martyna.szapar-mudlaw@intel.com
  arm[64]/memremap: don't abuse pfn_valid() to ensure presence of linear map
  procfs: prevent unprivileged processes accessing fdinfo dir
  mm: mremap: fix sign for EFAULT error return value
  mm/hwpoison: use pr_err() instead of dump_page() in get_any_page()
  mm/huge_memory: do not overkill when splitting huge_zero_page
  Revert "mm/memory-failure.c: skip huge_zero_page in memory_failure()"
2022-05-13 10:22:37 -07:00
zhenwei pi
e240ac52f7 mm/memory-failure.c: simplify num_poisoned_pages_inc/dec
Originally, do num_poisoned_pages_inc() in memory failure routine, use
num_poisoned_pages_dec() to rollback the number if filtered/ cancelled.

Suggested by Naoya, do num_poisoned_pages_inc() only in action_result(),
this make this clear and simple.

Link: https://lkml.kernel.org/r/20220509105641.491313-6-pizhenwei@bytedance.com
Signed-off-by: zhenwei pi <pizhenwei@bytedance.com>
Acked-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-05-13 07:20:19 -07:00
zhenwei pi
f0696cb406 mm/hwpoison: disable hwpoison filter during removing
hwpoison filter is enabled by hwpoison-inject module, after removing this
module, hwpoison filter still works.  What is worse, user can not find the
debugfs entries to know this.

Disable the hwpoison filter during removing hwpoison-inject module.

Link: https://lkml.kernel.org/r/20220509105641.491313-5-pizhenwei@bytedance.com
Signed-off-by: zhenwei pi <pizhenwei@bytedance.com>
Acked-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-05-13 07:20:19 -07:00
zhenwei pi
9113eaf331 mm/memory-failure.c: add hwpoison_filter for soft offline
hwpoison_filter is missing in the soft offline path, this leads an issue:
after enabling the corrupt filter, the user process still has a chance to
inject hwpoison fault by madvise(addr, len, MADV_SOFT_OFFLINE) at PFN
which is expected to reject.

Also do a minor change in comment of memory_failure().

Link: https://lkml.kernel.org/r/20220509105641.491313-4-pizhenwei@bytedance.com
Signed-off-by: zhenwei pi <pizhenwei@bytedance.com>
Acked-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-05-13 07:20:19 -07:00
zhenwei pi
c8bd84f73f mm/memory-failure.c: simplify num_poisoned_pages_dec
Don't decrease the number of poisoned pages in page_alloc.c, let the
memory-failure.c do inc/dec poisoned pages only.

Also simplify unpoison_memory(), only decrease the number of
poisoned pages when:
 - TestClearPageHWPoison() succeed
 - put_page_back_buddy succeed

After decreasing, print necessary log.

Finally, remove clear_page_hwpoison() and unpoison_taken_off_page().

Link: https://lkml.kernel.org/r/20220509105641.491313-3-pizhenwei@bytedance.com
Signed-off-by: zhenwei pi <pizhenwei@bytedance.com>
Acked-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-05-13 07:20:19 -07:00
zhenwei pi
60f272f6b0 mm/memory-failure.c: move clear_hwpoisoned_pages
Patch series "memory-failure: fix hwpoison_filter", v2.

As well known, the memory failure mechanism handles memory corrupted
event, and try to send SIGBUS to the user process which uses this
corrupted page.

For the virtualization case, QEMU catches SIGBUS and tries to inject MCE
into the guest, and the guest handles memory failure again.  Thus the
guest gets the minimal effect from hardware memory corruption.

The further step I'm working on:

1, try to modify code to decrease poisoned pages in a single place
   (mm/memofy-failure.c: simplify num_poisoned_pages_dec in this series).

2, try to use page_handle_poison() to handle SetPageHWPoison() and
   num_poisoned_pages_inc() together.  It would be best to call
   num_poisoned_pages_inc() in a single place too.

3, introduce memory failure notifier list in memory-failure.c: notify
   the corrupted PFN to someone who registers this list.  If I can
   complete [1] and [2] part, [3] will be quite easy(just call notifier
   list after increasing poisoned page).

4, introduce memory recover VQ for memory balloon device, and registers
   memory failure notifier list.  During the guest kernel handles memory
   failure, balloon device gets notified by memory failure notifier list,
   and tells the host to recover the corrupted PFN(GPA) by the new VQ.

5, host side remaps the corrupted page(HVA), and tells the guest side
   to unpoison the PFN(GPA).  Then the guest fixes the corrupted page(GPA)
   dynamically.


This patch (of 5):

clear_hwpoisoned_pages() clears HWPoison flag and decreases the number of
poisoned pages, this actually works as part of memory failure.

Move this function from sparse.c to memory-failure.c, finally there is no
CONFIG_MEMORY_FAILURE in sparse.c.

Link: https://lkml.kernel.org/r/20220509105641.491313-1-pizhenwei@bytedance.com
Link: https://lkml.kernel.org/r/20220509105641.491313-2-pizhenwei@bytedance.com
Signed-off-by: zhenwei pi <pizhenwei@bytedance.com>
Acked-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-05-13 07:20:19 -07:00
Eric Dumazet
cd8c1fd8cd mm/page_owner: use strscpy() instead of strlcpy()
current->comm[] is not a string (no guarantee for a zero byte in it).

strlcpy(s1, s2, l) is calling strlen(s2), potentially
causing out-of-bound access, as reported by syzbot:

detected buffer overflow in __fortify_strlen
------------[ cut here ]------------
kernel BUG at lib/string_helpers.c:980!
invalid opcode: 0000 [#1] PREEMPT SMP KASAN
CPU: 0 PID: 4087 Comm: dhcpcd-run-hooks Not tainted 5.18.0-rc3-syzkaller-01537-g20b87e7c29df #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
RIP: 0010:fortify_panic+0x18/0x1a lib/string_helpers.c:980
Code: 8c e8 c5 ba e1 fa e9 23 0f bf fa e8 0b 5d 8c f8 eb db 55 48 89 fd e8 e0 49 40 f8 48 89 ee 48 c7 c7 80 f5 26 8a e8 99 09 f1 ff <0f> 0b e8 ca 49 40 f8 48 8b 54 24 18 4c 89 f1 48 c7 c7 00 00 27 8a
RSP: 0018:ffffc900000074a8 EFLAGS: 00010286

RAX: 000000000000002c RBX: ffff88801226b728 RCX: 0000000000000000
RDX: ffff8880198e0000 RSI: ffffffff81600458 RDI: fffff52000000e87
RBP: ffffffff89da2aa0 R08: 000000000000002c R09: 0000000000000000
R10: ffffffff815fae2e R11: 0000000000000000 R12: ffff88801226b700
R13: ffff8880198e0830 R14: 0000000000000000 R15: 0000000000000000
FS:  0000000000000000(0000) GS:ffff8880b9c00000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007f5876ad6ff8 CR3: 000000001a48c000 CR4: 00000000003506f0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000600
Call Trace:
 <IRQ>
 __fortify_strlen include/linux/fortify-string.h:128 [inline]
 strlcpy include/linux/fortify-string.h:143 [inline]
 __set_page_owner_handle+0x2b1/0x3e0 mm/page_owner.c:171
 __set_page_owner+0x3e/0x50 mm/page_owner.c:190
 prep_new_page mm/page_alloc.c:2441 [inline]
 get_page_from_freelist+0xba2/0x3e00 mm/page_alloc.c:4182
 __alloc_pages+0x1b2/0x500 mm/page_alloc.c:5408
 alloc_pages+0x1aa/0x310 mm/mempolicy.c:2272
 alloc_slab_page mm/slub.c:1799 [inline]
 allocate_slab+0x26c/0x3c0 mm/slub.c:1944
 new_slab mm/slub.c:2004 [inline]
 ___slab_alloc+0x8df/0xf20 mm/slub.c:3005
 __slab_alloc.constprop.0+0x4d/0xa0 mm/slub.c:3092
 slab_alloc_node mm/slub.c:3183 [inline]
 slab_alloc mm/slub.c:3225 [inline]
 __kmem_cache_alloc_lru mm/slub.c:3232 [inline]
 kmem_cache_alloc+0x360/0x3b0 mm/slub.c:3242
 dst_alloc+0x146/0x1f0 net/core/dst.c:92

Link: https://lkml.kernel.org/r/20220509145949.265184-1-eric.dumazet@gmail.com
Fixes: 865ed6a327 ("mm/page_owner: record task command name")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: syzbot <syzkaller@googlegroups.com>
Acked-by: Waiman Long <longman@redhat.com>
Acked-by: Shakeel Butt <shakeelb@google.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-05-13 07:20:19 -07:00
Andrey Konovalov
06bc4cf6cd kasan: give better names to shadow values
Rename KASAN_KMALLOC_* shadow values to KASAN_SLAB_*, as they are used for
all slab allocations, not only for kmalloc.

Also rename KASAN_FREE_PAGE to KASAN_PAGE_FREE to be consistent with
KASAN_PAGE_REDZONE and KASAN_SLAB_FREE.

Link: https://lkml.kernel.org/r/bebcaf4eafdb0cabae0401a69c0af956aa87fcaa.1652111464.git.andreyknvl@google.com
Signed-off-by: Andrey Konovalov <andreyknvl@google.com>
Reviewed-by: Alexander Potapenko <glider@google.com>
Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Marco Elver <elver@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-05-13 07:20:19 -07:00
Andrey Konovalov
83f8e4a8b4 kasan: use tabs to align shadow values
Consistently use tabs instead of spaces to shadow value definitions.

Link: https://lkml.kernel.org/r/00e7e66b5fc375d58200dc1489949b3edcd096b7.1652111464.git.andreyknvl@google.com
Signed-off-by: Andrey Konovalov <andreyknvl@google.com>
Cc: Alexander Potapenko <glider@google.com>
Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Marco Elver <elver@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-05-13 07:20:18 -07:00
Andrey Konovalov
fc0e5b91df kasan: clean up comments in internal kasan.h
Clean up comments in mm/kasan/kasan.h: clarify, unify styles, fix
punctuation, etc.

Link: https://lkml.kernel.org/r/a0680ff30035b56cb7bdd5f59fd400e71712ceb5.1652111464.git.andreyknvl@google.com
Signed-off-by: Andrey Konovalov <andreyknvl@google.com>
Reviewed-by: Alexander Potapenko <glider@google.com>
Cc: Marco Elver <elver@google.com>
Cc: Alexander Potapenko <glider@google.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-05-13 07:20:18 -07:00
Sebastian Andrzej Siewior
3f80492001 mm/vmalloc: use raw_cpu_ptr() for vmap_block_queue access
The per-CPU resource vmap_block_queue is accessed via get_cpu_var().  That
macro disables preemption and then loads the pointer from the current CPU.

This doesn't work on PREEMPT_RT because a spinlock_t is later accessed
within the preempt-disable section.

There is no need to disable preemption while accessing the per-CPU struct
vmap_block_queue because the list is protected with a spinlock_t.  The
per-CPU struct is also accessed cross-CPU in purge_fragmented_blocks().

It is possible that by using raw_cpu_ptr() the code migrates to another
CPU and uses struct from another CPU.  This is fine because the list is
locked and the locked section is very short.

Use raw_cpu_ptr() to access vmap_block_queue.

Link: https://lkml.kernel.org/r/YnKx3duAB53P7ojN@linutronix.de
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: Uladzislau Rezki (Sony) <urezki@gmail.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-05-13 07:20:18 -07:00
Vasily Averin
fe573327ff tracing: incorrect gfp_t conversion
Fixes the following sparse warnings:

include/trace/events/*: sparse: cast to restricted gfp_t
include/trace/events/*: sparse: restricted gfp_t degrades to integer

gfp_t type is bitwise and requires __force attributes for any casts.

Link: https://lkml.kernel.org/r/331d88fe-f4f7-657c-02a2-d977f15fbff6@openvz.org
Signed-off-by: Vasily Averin <vvs@openvz.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Ingo Molnar <mingo@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-05-13 07:20:18 -07:00
Vasily Averin
f67bed134a percpu: improve percpu_alloc_percpu event trace
Add call_site, bytes_alloc and gfp_flags fields to the output of the
percpu_alloc_percpu ftrace event:

mkdir-4393  [001]   169.334788: percpu_alloc_percpu:
 call_site=mem_cgroup_css_alloc+0xa6 reserved=0 is_atomic=0 size=2408 align=8
  base_addr=0xffffc7117fc00000 off=402176 ptr=0x3dc867a62300 bytes_alloc=14448
   gfp_flags=GFP_KERNEL_ACCOUNT

This is required to track memcg-accounted percpu allocations.

Link: https://lkml.kernel.org/r/a07be858-c8a3-7851-9086-e3262cbcf707@openvz.org
Signed-off-by: Vasily Averin <vvs@openvz.org>
Acked-by: Roman Gushchin <roman.gushchin@linux.dev>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Dennis Zhou <dennis@kernel.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: Christoph Lameter <cl@linux.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-05-13 07:20:18 -07:00
Jiapeng Chong
2e14a8d3bb mm/damon/reclaim: use resource_size function on resource object
Fix the following coccicheck warnings:

./mm/damon/reclaim.c:241:30-33: WARNING: Suspicious code. resource_size is maybe missing with res.

Link: https://lkml.kernel.org/r/20220507032512.129598-1-jiapeng.chong@linux.alibaba.com
Signed-off-by: Jiapeng Chong <jiapeng.chong@linux.alibaba.com>
Reported-by: Abaci Robot <abaci@linux.alibaba.com>
Reviewed-by: SeongJae Park <sj@kernel.org>
Cc: "Boehme, Markus" <markubo@amazon.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-05-13 07:20:18 -07:00
Li kunyu
c8db8c2628 mm: functions may simplify the use of return values
p4d_clear_huge may be optimized for void return type and function usage. 
vunmap_p4d_range function saves a few steps here.

Link: https://lkml.kernel.org/r/20220507150630.90399-1-kunyu@nfschina.com
Signed-off-by: Li kunyu <kunyu@nfschina.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-05-13 07:20:18 -07:00
Kefeng Wang
e5a5540146 mm: page_table_check: move pxx_user_accessible_page into x86
The pxx_user_accessible_page() checks the PTE bit, it's
architecture-specific code, move them into x86's pgtable.h.

These helpers are being moved out to make the page table check framework
platform independent.

Link: https://lkml.kernel.org/r/20220507110114.4128854-3-tongtiangen@huawei.com
Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com>
Signed-off-by: Tong Tiangen <tongtiangen@huawei.com>
Acked-by: Pasha Tatashin <pasha.tatashin@soleen.com>
Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Palmer Dabbelt <palmer@dabbelt.com>
Cc: Paul Walmsley <paul.walmsley@sifive.com>
Cc: Will Deacon <will@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-05-13 07:20:17 -07:00
Tong Tiangen
92fb05242a mm: page_table_check: using PxD_SIZE instead of PxD_PAGE_SIZE
Patch series "mm: page_table_check: add support on arm64 and riscv", v7.

Page table check performs extra verifications at the time when new pages
become accessible from the userspace by getting their page table entries
(PTEs PMDs etc.) added into the table.  It is supported on X86[1].

This patchset made some simple changes and make it easier to support new
architecture, then we support this feature on ARM64 and RISCV.

[1]https://lore.kernel.org/lkml/20211123214814.3756047-1-pasha.tatashin@soleen.com/


This patch (of 6):

Compared with PxD_PAGE_SIZE, which is defined and used only on X86,
PxD_SIZE is more common in each architecture.  Therefore, it is more
reasonable to use PxD_SIZE instead of PxD_PAGE_SIZE in page_table_check.c.
At the same time, it is easier to support page table check in other
architectures.  The substitution has no functional impact on the x86.

Link: https://lkml.kernel.org/r/20220507110114.4128854-1-tongtiangen@huawei.com
Link: https://lkml.kernel.org/r/20220507110114.4128854-2-tongtiangen@huawei.com
Signed-off-by: Tong Tiangen <tongtiangen@huawei.com>
Suggested-by: Anshuman Khandual <anshuman.khandual@arm.com>
Acked-by: Pasha Tatashin <pasha.tatashin@soleen.com>
Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Will Deacon <will@kernel.org>
Cc: Paul Walmsley <paul.walmsley@sifive.com>
Cc: Palmer Dabbelt <palmer@dabbelt.com>
Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-05-13 07:20:17 -07:00
Matthew Wilcox (Oracle)
e7e3ffeb27 mm/migrate: convert move_to_new_page() into move_to_new_folio()
Pass in the folios that we already have in each caller.  Saves a
lot of calls to compound_head().

Link: https://lkml.kernel.org/r/20220504182857.4013401-27-willy@infradead.org
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-05-13 07:20:17 -07:00
Matthew Wilcox (Oracle)
da08e9b793 mm/shmem: convert shmem_swapin_page() to shmem_swapin_folio()
shmem_swapin_page() only brings in order-0 pages, which are folios
by definition.

Link: https://lkml.kernel.org/r/20220504182857.4013401-24-willy@infradead.org
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-05-13 07:20:17 -07:00
Matthew Wilcox (Oracle)
b1d0ec3a9a mm/shmem: convert shmem_getpage_gfp to use a folio
Rename shmem_alloc_and_acct_page() to shmem_alloc_and_acct_folio() and
have it return a folio, then use a folio throuughout shmem_getpage_gfp(). 
It continues to return a struct page.

Link: https://lkml.kernel.org/r/20220504182857.4013401-23-willy@infradead.org
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-05-13 07:20:16 -07:00
Matthew Wilcox (Oracle)
72827e5c2b mm/shmem: convert shmem_alloc_and_acct_page to use a folio
Convert shmem_alloc_hugepage() to return the folio that it uses and use a
folio throughout shmem_alloc_and_acct_page().  Continue to return a page
from shmem_alloc_and_acct_page() for now.

Link: https://lkml.kernel.org/r/20220504182857.4013401-22-willy@infradead.org
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-05-13 07:20:16 -07:00
Matthew Wilcox (Oracle)
0c023ef52d mm/shmem: add shmem_alloc_folio()
Call vma_alloc_folio() directly instead of alloc_page_vma().  Add a
shmem_alloc_page() wrapper to avoid changing the callers.

Link: https://lkml.kernel.org/r/20220504182857.4013401-21-willy@infradead.org
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-05-13 07:20:16 -07:00
Matthew Wilcox (Oracle)
069d849cde mm/shmem: turn shmem_should_replace_page into shmem_should_replace_folio
This is a straightforward conversion.

Link: https://lkml.kernel.org/r/20220504182857.4013401-20-willy@infradead.org
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-05-13 07:20:16 -07:00
Matthew Wilcox (Oracle)
b7dd44a12c mm/shmem: convert shmem_add_to_page_cache to take a folio
Shrinks shmem_add_to_page_cache() by 16 bytes.  All the callers grow,
but this is temporary as they will all be converted to folios soon.

Link: https://lkml.kernel.org/r/20220504182857.4013401-19-willy@infradead.org
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-05-13 07:20:16 -07:00
Matthew Wilcox (Oracle)
0562457186 mm/shmem: use a folio in shmem_unused_huge_shrink
When calling split_huge_page() we usually have to find the precise page,
but that's not necessary here because we only need to unlock and put the
folio afterwards.  Saves 231 bytes of text (20% of this function).

Link: https://lkml.kernel.org/r/20220504182857.4013401-17-willy@infradead.org
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-05-13 07:20:16 -07:00
Matthew Wilcox (Oracle)
c28a0e9695 vmscan: remove remaining uses of page in shrink_page_list
These are all straightforward conversions to the folio API.

Link: https://lkml.kernel.org/r/20220504182857.4013401-16-willy@infradead.org
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-05-13 07:20:16 -07:00
Matthew Wilcox (Oracle)
246b648038 vmscan: convert the activate_locked portion of shrink_page_list to folios
This accounts the number of pages activated correctly for large folios.

Link: https://lkml.kernel.org/r/20220504182857.4013401-14-willy@infradead.org
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-05-13 07:20:15 -07:00
Matthew Wilcox (Oracle)
5441d4902f vmscan: move initialisation of mapping down
Now that we don't interrogate the BDI for congestion, we can delay looking
up the folio's mapping until we've got further through the function,
reducing register pressure and saving a call to folio_mapping for folios
we're adding to the swap cache.

Link: https://lkml.kernel.org/r/20220504182857.4013401-13-willy@infradead.org
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-05-13 07:20:15 -07:00
Matthew Wilcox (Oracle)
64daa5d818 vmscan: convert lazy freeing to folios
Remove a hidden call to compound_head(), and account nr_pages instead of a
single page.  This matches the code in lru_lazyfree_fn() that accounts
nr_pages to PGLAZYFREE.

Link: https://lkml.kernel.org/r/20220504182857.4013401-12-willy@infradead.org
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-05-13 07:20:15 -07:00
Matthew Wilcox (Oracle)
0a36111c8c vmscan: convert page buffer handling to use folios
This mostly just removes calls to compound_head() although nr_reclaimed
should be incremented by the number of pages, not just 1.

Link: https://lkml.kernel.org/r/20220504182857.4013401-11-willy@infradead.org
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-05-13 07:20:15 -07:00
Matthew Wilcox (Oracle)
49bd2bf967 vmscan: convert dirty page handling to folios
Mostly this just eliminates calls to compound_head(), but
NR_VMSCAN_IMMEDIATE was being incremented by 1 instead of by nr_pages.

Link: https://lkml.kernel.org/r/20220504182857.4013401-10-willy@infradead.org
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-05-13 07:20:15 -07:00
Matthew Wilcox (Oracle)
09c02e5632 swap: convert add_to_swap() to take a folio
The only caller already has a folio available, so this saves a conversion.
Also convert the return type to boolean.

Link: https://lkml.kernel.org/r/20220504182857.4013401-9-willy@infradead.org
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-05-13 07:20:15 -07:00
Matthew Wilcox (Oracle)
e2e3fdc7d4 swap: turn get_swap_page() into folio_alloc_swap()
This removes an assumption that a large folio is HPAGE_PMD_NR pages
in size.

Link: https://lkml.kernel.org/r/20220504182857.4013401-8-willy@infradead.org
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-05-13 07:20:15 -07:00
Matthew Wilcox (Oracle)
d33e4e1412 vmscan: convert the writeback handling in shrink_page_list() to folios
Slightly more efficient due to fewer calls to compound_head().

Link: https://lkml.kernel.org/r/20220504182857.4013401-7-willy@infradead.org
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-05-13 07:20:15 -07:00
Matthew Wilcox (Oracle)
1bee2c1677 vmscan: use folio_mapped() in shrink_page_list()
Remove some legacy function calls.

Link: https://lkml.kernel.org/r/20220504182857.4013401-6-willy@infradead.org
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-05-13 07:20:15 -07:00
Matthew Wilcox (Oracle)
adf88aa8ea mm: remove alloc_pages_vma()
All callers have now been converted to use vma_alloc_folio(), so convert
the body of alloc_pages_vma() to allocate folios instead.

Link: https://lkml.kernel.org/r/20220504182857.4013401-5-willy@infradead.org
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-05-13 07:20:15 -07:00
Matthew Wilcox (Oracle)
cb196ee1ef mm/huge_memory: convert do_huge_pmd_anonymous_page() to use vma_alloc_folio()
Remove the use of this old API, eliminating a call to
prep_transhuge_page().

Link: https://lkml.kernel.org/r/20220504182857.4013401-3-willy@infradead.org
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-05-13 07:20:14 -07:00
Matthew Wilcox (Oracle)
dfe98499ef shmem: convert shmem_alloc_hugepage() to use vma_alloc_folio()
Patch series "Folio patches for 5.19", v2.


This patch (of 26):

For now, return the head page of the folio, but remove use of the old
alloc_pages_vma() API.

Link: https://lkml.kernel.org/r/20220504182857.4013401-1-willy@infradead.org
Link: https://lkml.kernel.org/r/20220504182857.4013401-2-willy@infradead.org
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Zi Yan <ziy@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-05-13 07:20:14 -07:00
Wan Jiabing
54943a1a4d mm/shmem: remove duplicate include in memory.c
Fix following checkincludes.pl warning:
mm/memory.c: linux/mm_inline.h is included more than once.

The include is in line 44. Remove the duplicated here.

Link: https://lkml.kernel.org/r/20220427064717.803019-1-wanjiabing@vivo.com
Signed-off-by: Wan Jiabing <wanjiabing@vivo.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-05-13 07:20:14 -07:00
Wei Yang
ed657e5568 mm/vmscan: don't use NUMA_NO_NODE as indicator of page on different node
Now we are sure there is at least one page on page_list, so it is safe to
get the nid of it.  This means it is not necessary to use NUMA_NO_NODE as
an indicator for the beginning of iteration or a page on different node.

Link: https://lkml.kernel.org/r/20220429014426.29223-2-richard.weiyang@gmail.com
Signed-off-by: Wei Yang <richard.weiyang@gmail.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Minchan Kim <minchan@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-05-13 07:20:14 -07:00
Wei Yang
1ae65e2749 mm/vmscan: filter empty page_list at the beginning
node_page_list would always be !empty on finishing the loop, except
page_list is empty.

Let's handle empty page_list before doing any real work including touching
PF_MEMALLOC flag.

Link: https://lkml.kernel.org/r/20220429014426.29223-1-richard.weiyang@gmail.com
Signed-off-by: Wei Yang <richard.weiyang@gmail.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Minchan Kim <minchan@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-05-13 07:20:14 -07:00
Miaohe Lin
f19a27e399 mm/vmscan: use helper folio_is_file_lru()
Use helper folio_is_file_lru() to check whether folio is file lru.  Minor
readability improvement.

[linmiaohe@huawei.com: use folio_is_file_lru()]
  Link: https://lkml.kernel.org/r/20220428105802.21389-1-linmiaohe@huawei.com
Link: https://lkml.kernel.org/r/20220425111232.23182-7-linmiaohe@huawei.com
Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Huang, Ying <ying.huang@intel.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Oscar Salvador <osalvador@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-05-13 07:20:14 -07:00
Miaohe Lin
4355e4b265 mm/vmscan: remove obsolete comment in kswapd_run
Since commit 6b700b5b3c ("mm/vmscan.c: remove cpu online notification
for now"), cpu online notification is removed.  So kswapd won't move to
proper cpus if cpus are hot-added.  Remove this obsolete comment.

Link: https://lkml.kernel.org/r/20220425111232.23182-6-linmiaohe@huawei.com
Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Huang, Ying <ying.huang@intel.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Oscar Salvador <osalvador@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-05-13 07:20:14 -07:00
Miaohe Lin
9aafcffc18 mm/vmscan: take all base pages of THP into account when race with speculative reference
If the page has buffers, shrink_page_list will try to free the buffer
mappings associated with the page and try to free the page as well.  In
the rare race with speculative reference, the page will be freed shortly
by speculative reference.  But nr_reclaimed is not incremented correctly
when we come across the THP.  We need to account all the base pages in
this case.

Link: https://lkml.kernel.org/r/20220425111232.23182-5-linmiaohe@huawei.com
Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Huang, Ying <ying.huang@intel.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Oscar Salvador <osalvador@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-05-13 07:20:14 -07:00
Miaohe Lin
1fe47c0beb mm/vmscan: introduce helper function reclaim_page_list()
Introduce helper function reclaim_page_list() to eliminate the duplicated
code of doing shrink_page_list() and putback_lru_page.  Also we can
separate node reclaim from node page list operation this way.  No
functional change intended.

Link: https://lkml.kernel.org/r/20220425111232.23182-3-linmiaohe@huawei.com
Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Huang, Ying <ying.huang@intel.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Oscar Salvador <osalvador@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-05-13 07:20:13 -07:00
Miaohe Lin
32a331a72f mm/vmscan: add a comment about MADV_FREE pages check in folio_check_dirty_writeback
Patch series "A few cleanup and fixup patches for vmscan

This series contains a few patches to remove obsolete comment, introduce
helper to remove duplicated code and so no.  Also we take all base pages
of THP into account in rare race condition.  More details can be found in
the respective changelogs.


This patch (of 6):

The MADV_FREE pages check in folio_check_dirty_writeback is a bit hard to
follow.  Add a comment to make the code clear.

Link: https://lkml.kernel.org/r/20220425111232.23182-2-linmiaohe@huawei.com
Suggested-by: Huang, Ying <ying.huang@intel.com>
Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Oscar Salvador <osalvador@suse.de>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2022-05-13 07:20:13 -07:00