Commit Graph

1264597 Commits

Author SHA1 Message Date
Eric Van Hensbergen
824f06ff81
fs/9p: Revert "fs/9p: fix dups even in uncached mode"
This reverts commit be57855f50.

It caused a regression involving duplicate inode numbers in
some tester trees.  The bad behavior seems to be dependent on inode
reuse policy in underlying file system, so it did not trigger in my
test setup.

Signed-off-by: Eric Van Hensbergen <ericvh@kernel.org>
2024-04-11 23:36:33 +00:00
Eric Van Hensbergen
6e45a30fe5
fs/9p: remove erroneous nlink init from legacy stat2inode
In 9p2000 legacy mode, stat2inode initializes nlink to 1,
which is redundant with what alloc_inode should have already set.
9p2000.u overrides this with extensions if present in the stat
structure, and 9p2000.L incorporates nlink into its stat structure.

At the very least this probably messes with directory nlink
accounting in legacy mode.

Signed-off-by: Eric Van Hensbergen <ericvh@kernel.org>
2024-04-09 23:53:00 +00:00
Jeff Layton
7a84602297
9p: explicitly deny setlease attempts
9p is a remote network protocol, and it doesn't support asynchronous
notifications from the server. Ensure that we don't hand out any leases
since we can't guarantee they'll be broken when a file's contents
change.

Signed-off-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Eric Van Hensbergen <ericvh@kernel.org>
2024-03-28 19:52:55 +00:00
Joakim Sindholt
4e5d208cc9
fs/9p: fix the cache always being enabled on files with qid flags
I'm not sure why this check was ever here. After updating to 6.6 I
suddenly found caching had been turned on by default and neither
cache=none nor the new directio would turn it off. After walking through
the new code very manually I realized that it's because the caching has
to be, in effect, turned off explicitly by setting P9L_DIRECT and
whenever a file has a flag, in my case QTAPPEND, it doesn't get set.

Setting aside QTDIR which seems to ignore the new fid->mode entirely,
the rest of these either should be subject to the same cache rules as
every other QTFILE or perhaps very explicitly not cached in the case of
QTAUTH.

Signed-off-by: Joakim Sindholt <opensource@zhasha.com>
Signed-off-by: Eric Van Hensbergen <ericvh@kernel.org>
2024-03-28 15:10:29 +00:00
Joakim Sindholt
87de39e705
fs/9p: translate O_TRUNC into OTRUNC
This one hits both 9P2000 and .u as it appears v9fs has never translated
the O_TRUNC flag.

Signed-off-by: Joakim Sindholt <opensource@zhasha.com>
Signed-off-by: Eric Van Hensbergen <ericvh@kernel.org>
2024-03-28 15:10:28 +00:00
Joakim Sindholt
cd25e15e57
fs/9p: only translate RWX permissions for plain 9P2000
Garbage in plain 9P2000's perm bits is allowed through, which causes it
to be able to set (among others) the suid bit. This was presumably not
the intent since the unix extended bits are handled explicitly and
conditionally on .u.

Signed-off-by: Joakim Sindholt <opensource@zhasha.com>
Signed-off-by: Eric Van Hensbergen <ericvh@kernel.org>
2024-03-28 13:59:23 +00:00
Linus Torvalds
8d025e2092 Changes since last update:
- Add a new reviewer Sandeep Dhavale to build a healthier community;
 
  - Drop experimental warning for FSDAX.
 -----BEGIN PGP SIGNATURE-----
 
 iQJFBAABCgAvFiEEQ0A6bDUS9Y+83NPFUXZn5Zlu5qoFAmYE1ucRHHhpYW5nQGtl
 cm5lbC5vcmcACgkQUXZn5Zlu5qoSnw//Q35Sc7fIQb6YfuOOQ2VQX7TVE4jTrLej
 +3hIQUz3BA8wRwrBYdeJLcvjda4y+0gNxlpw2ycNS6S4vxAiVEAuwPJYO6oAC0Il
 ETa6opc1vpTMeXgwwtbXC7ACnWas9EQC61Z4E8W5zeVcNqQZnZyInMJ9Rkjqs/iJ
 VjJH2wWR5MIgWJdEPqchPx/28nqbBOcztc7ARJqpujyZEvu+OBVIPv8P/7N0a5yG
 bpHkDrzoelBMkpktpuvrkv84ymyCDC7LH9mq+Wk6dY5wRFxa2BtoZVh6YcpcjrpR
 75GZVg3BN3Ph41JCwYHqxyRGpoLO11dSYi6DNDVngxOGtkTNRVGrJ1FYEapWV+o8
 1MnEbl0vZSHUkjrIFbfZTFSqpvW2XSfEOa3heNDFknmzT/ISobSGENULp9cggcYI
 jhS6wtVG4bl5bCiCKCZluByr8/J8TCQc/5t5f5bQLy2MWqlyjaSx82uWuDpzO1Hh
 +q+p+MB+ketMUwxaIUAuTNgzhFPFT4Na/ni9WqP7Ri3GJY6pdjDUUtrtoIBK4oPQ
 ajUWhPlOk5zMwLq9Jl4MiG1ostBI9P37ZerjsdaLDZYElGhTjwjPu/xlh7p26Inq
 Ufq3QaQH2wai+oAVS6Sli3dJfb399XVJhmT2WFMH+0DmQW6JzvsGTdUX2fMgv5sb
 I7dVfuceTs0=
 =+ZW9
 -----END PGP SIGNATURE-----

Merge tag 'erofs-for-6.9-rc2-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/xiang/erofs

Pull erofs fixes from Gao Xiang:

 - Add a new reviewer Sandeep Dhavale to build a healthier community

 - Drop experimental warning for FSDAX

* tag 'erofs-for-6.9-rc2-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/xiang/erofs:
  MAINTAINERS: erofs: add myself as reviewer
  erofs: drop experimental warning for FSDAX
2024-03-27 20:24:09 -07:00
Linus Torvalds
4076fa1612 fs/9p: fix errors found in 6.9-rc1
Two of these fix syzbot reported issues, and
 the other fixes a unused variable in some configurations.
 
 Signed-off-by: Eric Van Hensbergen <ericvh@kernel.org>
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEElpbw0ZalkJikytFRiP/V+0pf/5gFAmYEjyAACgkQiP/V+0pf
 /5jrDhAAkrEkFry9Zp7txMNG7fyRVRW8BPkvCUbORBdmLm01lswTnvHDT5pbSYcf
 Z3KxSpo0DvO2G3XEHQHQN24Sbx78Xg8XssXGFc0j1/hQbGpwIyXM/NTa3VVnfmwH
 lV2ysLa5zCR81k0hu8QzGe8DSOFsyE9oz3ABE7nCjmQsOI0zgyOvn7Uy1rwT5B11
 lBYN4rotl2Ie1wPoVTf8PNbeAWtuVR3FN9GTSvZuNDFbYjYEv0zFangxXIBFhn2g
 pUJ22C6CYenhLzTKBRVW/CcSxPVVS4jFok1nws07MmlXozrbklPYHuN6XQYFetLt
 wtnZxmUmy1u4pmpG2xPziUOGb7wdm6cDC1aIrcYPlbk/9U2iYHuVgQz0WOcEIbA3
 g5LrgCpUZO8UPxEvtFYDq9oUAhxhjf6BI7/rR0fNWp8JgJ0scz3d/e+Q/srWWr23
 Pej+10N5JijTEuD39BUnWy/yTz0WVBg1nP/C44buB0zSWJbxiPOJd0imSW6ipoIk
 Onr3FmhnRSiIfuUKEH/jI+QsqKzZTdBe9e3+SYBEg4KXvM3e1ltJgDs+BIdWbo00
 ih/jkZL7iX+uzkOdVR3o/1KagkwHTcU8P+4i8wELD7+CquQXvsedfqeTmz5JvNIR
 9Df1Gu1xCeINn6jnlSaVAonnCi1awJoaxFW+azCb+q5MP8lP6IU=
 =1ev6
 -----END PGP SIGNATURE-----

Merge tag '9p-fixes-for-6.9-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/ericvh/v9fs

Pull 9p fixes from Eric Van Hensbergen:
 "Two of these fix syzbot reported issues, and the other fixes a unused
  variable in some configurations"

* tag '9p-fixes-for-6.9-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/ericvh/v9fs:
  fs/9p: fix uninitialized values during inode evict
  fs/9p: remove redundant pointer v9ses
  fs/9p: fix uaf in in v9fs_stat2inode_dotl
2024-03-27 14:53:56 -07:00
Linus Torvalds
400dd456bd for-6.9-rc1-tag
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAmYEfJYACgkQxWXV+ddt
 WDucIg/+IupuqdLKnj6bepxX/VufnFAjecD3sRgZQQLIMfm3MQX3TzbNoPYEiAGU
 tNG6jxYgkGRoyhN3aIQnsJmRFje5epYjNA5+ueUNT2/KfyKonnS2TIKQt6u7XBls
 fl4SCTSNRX7w/QUNUWwyY5/86yzV4F8w19X5nVOKcp7Nz3hUBdeDZWAmMlYyHuFW
 N2YRyNdCxB4Y0U9g1vgI63wFjOac0F+7RTHGsDH7ueOZ2dbtDM38lHBoCbdX3jmy
 5nG7wVJZp2H/zCmzrVQJ897CMfr3h9r9Kxx8EE3JDJaJ5sMMaRh361rgsZTaGsjz
 SwUzT6Z0u0hsBANSTOUZixhfX5sqArmemG+XpFu6Rq+732DqS+c4vWRSu7c8Rc8i
 +4HIQNsjJqm/d1u2IyxXfuqSbaULLnyYQ8rdEx3o2AM37JnuTvOWoB+v/JqPb9TI
 aG+bOPvg7GM9Sl3IoM5sR+j3bEebranZbUF+UiDEujZJiY+uiw3vMbFvyOBRWaUU
 ODTpNoyCmz94mWg79hyosOjM9A/NCEkRH4oSc+YeqOvzTIBG3V+D3HxN/DX4FVTy
 VDxMdptu0aPIEkUQ3nvsj4t3OKgj1w9rxZFpRYH33zJVvRqZ8VrgtC9V6zgPv3h7
 suQL4s4i4EiIgAk2Z0OR23wDwg1TwXVWLGErfQVHslvhl/a2Qb4=
 =Lhhp
 -----END PGP SIGNATURE-----

Merge tag 'for-6.9-rc1-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux

Pull btrfs fixes from David Sterba:

 - fix race when reading extent buffer and 'uptodate' status is missed
   by one thread (introduced in 6.5)

 - do additional validation of devices using major:minor numbers

 - zoned mode fixes:
     - use zone-aware super block access during scrub
     - fix use-after-free during device replace (found by KASAN)
     - also delete zones that are 100% unusable to reclaim space

 - extent unpinning fixes:
     - fix extent map leak after error handling
     - print correct range in error message

 - error code and message updates

* tag 'for-6.9-rc1-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
  btrfs: fix race in read_extent_buffer_pages()
  btrfs: return accurate error code on open failure in open_fs_devices()
  btrfs: zoned: don't skip block groups with 100% zone unusable
  btrfs: use btrfs_warn() to log message at btrfs_add_extent_mapping()
  btrfs: fix message not properly printing interval when adding extent map
  btrfs: fix warning messages not printing interval at unpin_extent_range()
  btrfs: fix extent map leak in unexpected scenario at unpin_extent_cache()
  btrfs: validate device maj:min during open
  btrfs: zoned: fix use-after-free in do_zone_finish()
  btrfs: zoned: use zone aware sb location for scrub
2024-03-27 13:56:41 -07:00
Linus Torvalds
dc189b8e6a 21 hotfixes. 11 are cc:stable and the remainder address post-6.8 issues
or aren't considered suitable for backporting.
 
 zswap figures prominently in the post-6.8 issues - folloup against the
 large amount of changes we have just made to that code.
 
 Apart from that, all over the map.
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYIAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCZgRltQAKCRDdBJ7gKXxA
 jvxhAP0SCuKnuzs/K8BTnO4CHXJPPXMhdWFbFjUOKoNToAwRJQEA0+5NvpAEtbov
 ljCkRZ3hXHJ6D5MH2vXSJIbkjekz/gg=
 =0j71
 -----END PGP SIGNATURE-----

Merge tag 'mm-hotfixes-stable-2024-03-27-11-25' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm

Pull misc fixes from Andrew Morton:
 "Various hotfixes. About half are cc:stable and the remainder address
  post-6.8 issues or aren't considered suitable for backporting.

  zswap figures prominently in the post-6.8 issues - folloup against the
  large amount of changes we have just made to that code.

  Apart from that, all over the map"

* tag 'mm-hotfixes-stable-2024-03-27-11-25' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (21 commits)
  crash: use macro to add crashk_res into iomem early for specific arch
  mm: zswap: fix data loss on SWP_SYNCHRONOUS_IO devices
  selftests/mm: fix ARM related issue with fork after pthread_create
  hexagon: vmlinux.lds.S: handle attributes section
  userfaultfd: fix deadlock warning when locking src and dst VMAs
  tmpfs: fix race on handling dquot rbtree
  selftests/mm: sigbus-wp test requires UFFD_FEATURE_WP_HUGETLBFS_SHMEM
  mm: zswap: fix writeback shinker GFP_NOIO/GFP_NOFS recursion
  ARM: prctl: reject PR_SET_MDWE on pre-ARMv6
  prctl: generalize PR_SET_MDWE support check to be per-arch
  MAINTAINERS: remove incorrect M: tag for dm-devel@lists.linux.dev
  mm: zswap: fix kernel BUG in sg_init_one
  selftests: mm: restore settings from only parent process
  tools/Makefile: remove cgroup target
  mm: cachestat: fix two shmem bugs
  mm: increase folio batch size
  mm,page_owner: fix recursion
  mailmap: update entry for Leonard Crestez
  init: open /initrd.image with O_LARGEFILE
  selftests/mm: Fix build with _FORTIFY_SOURCE
  ...
2024-03-27 13:30:48 -07:00
Linus Torvalds
962490525c Probes fixes for v6.9-rc1:
- tracing/probes: initialize a 'val' local variable with zero. This variable
    is read by FETCH_OP_ST_EDATA in a loop, and is expected to be initialized
    by FETCH_OP_ARG in the same loop. Since this expectation is not obvious,
    thus smatch warns it. Initializing 'val' with zero fixes this warning.
 -----BEGIN PGP SIGNATURE-----
 
 iQFPBAABCgA5FiEEh7BulGwFlgAOi5DV2/sHvwUrPxsFAmYEKw4bHG1hc2FtaS5o
 aXJhbWF0c3VAZ21haWwuY29tAAoJENv7B78FKz8bZmkH/2TNnrQivrNEyihXaZNC
 i4kyOGiNpIKi1BvE2nBFItqMlKbry+bR74tgk9+9BZiwjnTSz+pVDsWuOrSvrLNy
 j9yOjqZMdEnx7YdSZX+HUxpXzklSyAd1PQ0kEOYHRgqcXgcUEYSaHPpZZJqBtIah
 X/4rmLyMTFE2CdbdJ+dfPn9GLn0C9TP+jGz0DC/efEXdXJ3bVY2d3DJXHNiVpW58
 4Yy0UWwUu/QAk2JGUf1r3GEwoklKNNjvGLaCMm4mAoY3haDzfbGecstAB4cc/gD/
 FzN2q+7Z7RgJloICJo1Bh91AWi06+EeFFk6MZ5oHzjsp7EH/aGCy5umEsIPGjHr4
 Y0Q=
 =b4J5
 -----END PGP SIGNATURE-----

Merge tag 'probes-fixes-v6.9-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace

Pull probes fixlet from Masami Hiramatsu:

 - tracing/probes: initialize a 'val' local variable with zero.

   This variable is read by FETCH_OP_ST_EDATA in a loop, and is
   initialized by FETCH_OP_ARG in the same loop. Since this
   initialization is not obvious, smatch warns about it.

   Explicitly initializing 'val' with zero fixes this warning.

* tag 'probes-fixes-v6.9-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace:
  tracing: probes: Fix to zero initialize a local variable
2024-03-27 10:01:24 -07:00
Linus Torvalds
f4a432914a execve fixes for v6.9-rc2
- Fix selftests to conform to the TAP output format (Muhammad Usama Anjum)
 
 - Fix NOMMU linux_binprm::exec pointer in auxv (Max Filippov)
 
 - Replace deprecated strncpy usage (Justin Stitt)
 
 - Replace another /bin/sh instance in selftests
 -----BEGIN PGP SIGNATURE-----
 
 iQJKBAABCgA0FiEEpcP2jyKd1g9yPm4TiXL039xtwCYFAmYDT3sWHGtlZXNjb29r
 QGNocm9taXVtLm9yZwAKCRCJcvTf3G3AJjxlD/49PYpA4hMReEJ/01UkMn7IT2DP
 QWV9IfaPTodj9tjQngalhcF7r6O5guRR7MRfZxyaXriq4aJNzOLm2STmwSG1cOgP
 hP9D0HnMSc5CrqMJ2kSTr3ETK0a2mTivWl375TUgGdW+QJo7YYInHYaH2THhme1Z
 MkLHqSkruHw6YVvSvzoWiwZ4taiia7op8HbAEvJQiwnJdiVeCLIYbf2AxXNop2xv
 xcmoGkSh6KSiQ0XQ7VXs4LC3v/ElHBINSbChoXPBDY5kBWZybyxRwYCVt8mJftgF
 mVGXBFFpnaLU/gDayPg/Pyq9sW1bLpi8w0BBu419BVfAQ475K+YZ/V8nj4fm95e3
 gIWm3x1O48r0OxdzmPb5re/s7lG5uNLzzFEWIus18NmqgA8S1CyFveRB3Zh8LlXB
 9UEt4mlcgp/CLAo1Zv6IBe6UDcAf4AR4Tq+d+etmORTqHmM7n399XivNuft9myyB
 9ObLCfKvOa71uF0n714XLHc5STk2KTK70Me2L/H5gitSqjIEKFNQ5SOaSbsGImDv
 i4YPnptCJFTQumE0Tu5hna8uyjOXFIxq/zkfDmzc1wP8FcijwRx3UPoO6WlQsdfx
 5cmJSaIX1bhFC+4gxAoEHUDWPh/f4kLeDpIXX6NPH28Do1wxLnri3ryvkfgkw5Vj
 /1E03LXfcnnSbjQAPQ==
 =Siss
 -----END PGP SIGNATURE-----

Merge tag 'execve-v6.9-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux

Pull execve fixes from Kees Cook:

 - Fix selftests to conform to the TAP output format (Muhammad Usama
   Anjum)

 - Fix NOMMU linux_binprm::exec pointer in auxv (Max Filippov)

 - Replace deprecated strncpy usage (Justin Stitt)

 - Replace another /bin/sh instance in selftests

* tag 'execve-v6.9-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux:
  binfmt: replace deprecated strncpy
  exec: Fix NOMMU linux_binprm::exec in transfer_args_to_stack()
  selftests/exec: Convert remaining /bin/sh to /bin/bash
  selftests/exec: execveat: Improve debug reporting
  selftests/exec: recursion-depth: conform test to TAP format output
  selftests/exec: load_address: conform test to TAP format output
  selftests/exec: binfmt_script: Add the overall result line according to TAP
2024-03-27 09:57:30 -07:00
Linus Torvalds
498e47cd1d Fix build errors due to new UIO_MEM_DMA_COHERENT mess
Commit 576882ef5e ("uio: introduce UIO_MEM_DMA_COHERENT type")
introduced a new use-case for 'struct uio_mem' where the 'mem' field now
contains a kernel virtual address when 'memtype' is set to
UIO_MEM_DMA_COHERENT.

That in turn causes build errors, because 'mem' is of type
'phys_addr_t', and a virtual address is a pointer type.  When the code
just blindly uses cast to mix the two, it caused problems when
phys_addr_t isn't the same size as a pointer - notably on 32-bit
architectures with PHYS_ADDR_T_64BIT.

The proper thing to do would probably be to use a union member, and not
have any casts, and make the 'mem' member be a union of 'mem.physaddr'
and 'mem.vaddr', based on 'memtype'.

This is not that proper thing.  This is just fixing the ugly casts to be
even uglier, but at least not cause build errors on 32-bit platforms
with 64-bit physical addresses.

Reported-by: Guenter Roeck <linux@roeck-us.net>
Fixes: 576882ef5e ("uio: introduce UIO_MEM_DMA_COHERENT type")
Fixes: 7722151e46 ("uio_pruss: UIO_MEM_DMA_COHERENT conversion")
Fixes: 019947805a ("uio_dmem_genirq: UIO_MEM_DMA_COHERENT conversion")
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Chris Leech <cleech@redhat.com>
Cc: Nilesh Javali <njavali@marvell.com>
Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: Linus Torvalds <torvalds@linuxfoundation.org>
2024-03-27 09:48:47 -07:00
Linus Torvalds
5b4cdd9c56 Fix memory leak in posix_clock_open()
If the clk ops.open() function returns an error, we don't release the
pccontext we allocated for this clock.

Re-organize the code slightly to make it all more obvious.

Reported-by: Rohit Keshri <rkeshri@redhat.com>
Acked-by: Oleg Nesterov <oleg@redhat.com>
Fixes: 60c6946675 ("posix-clock: introduce posix_clock_context concept")
Cc: Jakub Kicinski <kuba@kernel.org>
Cc: David S. Miller <davem@davemloft.net>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Linus Torvalds <torvalds@linuxfoundation.org>
2024-03-27 09:03:22 -07:00
Baoquan He
32fbe52465 crash: use macro to add crashk_res into iomem early for specific arch
There are regression reports[1][2] that crashkernel region on x86_64 can't
be added into iomem tree sometime.  This causes the later failure of kdump
loading.

This happened after commit 4a693ce65b ("kdump: defer the insertion of
crashkernel resources") was merged.

Even though, these reported issues are proved to be related to other
component, they are just exposed after above commmit applied, I still
would like to keep crashk_res and crashk_low_res being added into iomem
early as before because the early adding has been always there on x86_64
and working very well.  For safety of kdump, Let's change it back.

Here, add a macro HAVE_ARCH_ADD_CRASH_RES_TO_IOMEM_EARLY to limit that
only ARCH defining the macro can have the early adding
crashk_res/_low_res into iomem. Then define
HAVE_ARCH_ADD_CRASH_RES_TO_IOMEM_EARLY on x86 to enable it.

Note: In reserve_crashkernel_low(), there's a remnant of crashk_low_res
handling which was mistakenly added back in commit 85fcde402d ("kexec:
split crashkernel reservation code out from crash_core.c").

[1]
[PATCH V2] x86/kexec: do not update E820 kexec table for setup_data
https://lore.kernel.org/all/Zfv8iCL6CT2JqLIC@darkstar.users.ipa.redhat.com/T/#u

[2]
Question about Address Range Validation in Crash Kernel Allocation
https://lore.kernel.org/all/4eeac1f733584855965a2ea62fa4da58@huawei.com/T/#u

Link: https://lkml.kernel.org/r/ZgDYemRQ2jxjLkq+@MiWiFi-R3L-srv
Fixes: 4a693ce65b ("kdump: defer the insertion of crashkernel resources")
Signed-off-by: Baoquan He <bhe@redhat.com>
Cc: Dave Young <dyoung@redhat.com>
Cc: Huacai Chen <chenhuacai@loongson.cn>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Jiri Bohac <jbohac@suse.cz>
Cc: Li Huafei <lihuafei1@huawei.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-03-26 11:14:12 -07:00
Johannes Weiner
25cd241408 mm: zswap: fix data loss on SWP_SYNCHRONOUS_IO devices
Zhongkun He reports data corruption when combining zswap with zram.

The issue is the exclusive loads we're doing in zswap. They assume
that all reads are going into the swapcache, which can assume
authoritative ownership of the data and so the zswap copy can go.

However, zram files are marked SWP_SYNCHRONOUS_IO, and faults will try to
bypass the swapcache.  This results in an optimistic read of the swap data
into a page that will be dismissed if the fault fails due to races.  In
this case, zswap mustn't drop its authoritative copy.

Link: https://lore.kernel.org/all/CACSyD1N+dUvsu8=zV9P691B9bVq33erwOXNTmEaUbi9DrDeJzw@mail.gmail.com/
Fixes: b9c91c4341 ("mm: zswap: support exclusive loads")
Link: https://lkml.kernel.org/r/20240324210447.956973-1-hannes@cmpxchg.org
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Reported-by: Zhongkun He <hezhongkun.hzk@bytedance.com>
Tested-by: Zhongkun He <hezhongkun.hzk@bytedance.com>
Acked-by: Yosry Ahmed <yosryahmed@google.com>
Acked-by: Barry Song <baohua@kernel.org>
Reviewed-by: Chengming Zhou <chengming.zhou@linux.dev>
Reviewed-by: Nhat Pham <nphamcs@gmail.com>
Acked-by: Chris Li <chrisl@kernel.org>
Cc: <stable@vger.kernel.org>	[6.5+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-03-26 11:14:12 -07:00
Edward Liaw
8c864371b2 selftests/mm: fix ARM related issue with fork after pthread_create
Following issue was observed while running the uffd-unit-tests selftest
on ARM devices. On x86_64 no issues were detected:

pthread_create followed by fork caused deadlock in certain cases wherein
fork required some work to be completed by the created thread.  Used
synchronization to ensure that created thread's start function has started
before invoking fork.

[edliaw@google.com: refactored to use atomic_bool]
Link: https://lkml.kernel.org/r/20240325194100.775052-1-edliaw@google.com
Fixes: 760aee0b71 ("selftests/mm: add tests for RO pinning vs fork()")
Signed-off-by: Lokesh Gidra <lokeshgidra@google.com>
Signed-off-by: Edward Liaw <edliaw@google.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-03-26 11:14:12 -07:00
Nathan Chancellor
549aa9678a hexagon: vmlinux.lds.S: handle attributes section
After the linked LLVM change, the build fails with
CONFIG_LD_ORPHAN_WARN_LEVEL="error", which happens with allmodconfig:

  ld.lld: error: vmlinux.a(init/main.o):(.hexagon.attributes) is being placed in '.hexagon.attributes'

Handle the attributes section in a similar manner as arm and riscv by
adding it after the primary ELF_DETAILS grouping in vmlinux.lds.S, which
fixes the error.

Link: https://lkml.kernel.org/r/20240319-hexagon-handle-attributes-section-vmlinux-lds-s-v1-1-59855dab8872@kernel.org
Fixes: 113616ec5b ("hexagon: select ARCH_WANT_LD_ORPHAN_WARN")
Link: 31f4b329c8
Signed-off-by: Nathan Chancellor <nathan@kernel.org>
Reviewed-by: Brian Cain <bcain@quicinc.com>
Cc: Bill Wendling <morbo@google.com>
Cc: Justin Stitt <justinstitt@google.com>
Cc: Nick Desaulniers <ndesaulniers@google.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-03-26 11:07:23 -07:00
Lokesh Gidra
30af24facf userfaultfd: fix deadlock warning when locking src and dst VMAs
Use down_read_nested() to avoid the warning.

Link: https://lkml.kernel.org/r/20240321235818.125118-1-lokeshgidra@google.com
Fixes: 867a43a34f ("userfaultfd: use per-vma locks in userfaultfd operations")
Reported-by: syzbot+49056626fe41e01f2ba7@syzkaller.appspotmail.com
Signed-off-by: Lokesh Gidra <lokeshgidra@google.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Axel Rasmussen <axelrasmussen@google.com>
Cc: Brian Geffon <bgeffon@google.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Hillf Danton <hdanton@sina.com>
Cc: Jann Horn <jannh@google.com> [Bug #2]
Cc: Kalesh Singh <kaleshsingh@google.com>
Cc: Lokesh Gidra <lokeshgidra@google.com>
Cc: Mike Rapoport (IBM) <rppt@kernel.org>
Cc: Nicolas Geoffray <ngeoffray@google.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-03-26 11:07:23 -07:00
Carlos Maiolino
0a69b6b3a0 tmpfs: fix race on handling dquot rbtree
A syzkaller reproducer found a race while attempting to remove dquot
information from the rb tree.

Fetching the rb_tree root node must also be protected by the
dqopt->dqio_sem, otherwise, giving the right timing, shmem_release_dquot()
will trigger a warning because it couldn't find a node in the tree, when
the real reason was the root node changing before the search starts:

Thread 1				Thread 2
- shmem_release_dquot()			- shmem_{acquire,release}_dquot()

- fetch ROOT				- Fetch ROOT

					- acquire dqio_sem
- wait dqio_sem

					- do something, triger a tree rebalance
					- release dqio_sem

- acquire dqio_sem
- start searching for the node, but
  from the wrong location, missing
  the node, and triggering a warning.

Link: https://lkml.kernel.org/r/20240320124011.398847-1-cem@kernel.org
Fixes: eafc474e20 ("shmem: prepare shmem quota infrastructure")
Signed-off-by: Carlos Maiolino <cmaiolino@redhat.com>
Reported-by: Ubisectech Sirius <bugreport@ubisectech.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: Hugh Dickins <hughd@google.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-03-26 11:07:23 -07:00
Edward Liaw
105840ebd7 selftests/mm: sigbus-wp test requires UFFD_FEATURE_WP_HUGETLBFS_SHMEM
The sigbus-wp test requires the UFFD_FEATURE_WP_HUGETLBFS_SHMEM flag for
shmem and hugetlb targets.  Otherwise it is not backwards compatible with
kernels <5.19 and fails with EINVAL.

Link: https://lkml.kernel.org/r/20240321232023.2064975-1-edliaw@google.com
Fixes: 73c1ea939b ("selftests/mm: move uffd sig/events tests into uffd unit tests")
Signed-off-by: Edward Liaw <edliaw@google.com>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Peter Xu <peterx@redhat.com
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-03-26 11:07:22 -07:00
Johannes Weiner
30fb6a8d9e mm: zswap: fix writeback shinker GFP_NOIO/GFP_NOFS recursion
Kent forwards this bug report of zswap re-entering the block layer
from an IO request allocation and locking up:

[10264.128242] sysrq: Show Blocked State
[10264.128268] task:kworker/20:0H   state:D stack:0     pid:143   tgid:143   ppid:2      flags:0x00004000
[10264.128271] Workqueue: bcachefs_io btree_write_submit [bcachefs]
[10264.128295] Call Trace:
[10264.128295]  <TASK>
[10264.128297]  __schedule+0x3e6/0x1520
[10264.128303]  schedule+0x32/0xd0
[10264.128304]  schedule_timeout+0x98/0x160
[10264.128308]  io_schedule_timeout+0x50/0x80
[10264.128309]  wait_for_completion_io_timeout+0x7f/0x180
[10264.128310]  submit_bio_wait+0x78/0xb0
[10264.128313]  swap_writepage_bdev_sync+0xf6/0x150
[10264.128317]  zswap_writeback_entry+0xf2/0x180
[10264.128319]  shrink_memcg_cb+0xe7/0x2f0
[10264.128322]  __list_lru_walk_one+0xb9/0x1d0
[10264.128325]  list_lru_walk_one+0x5d/0x90
[10264.128326]  zswap_shrinker_scan+0xc4/0x130
[10264.128327]  do_shrink_slab+0x13f/0x360
[10264.128328]  shrink_slab+0x28e/0x3c0
[10264.128329]  shrink_one+0x123/0x1b0
[10264.128331]  shrink_node+0x97e/0xbc0
[10264.128332]  do_try_to_free_pages+0xe7/0x5b0
[10264.128333]  try_to_free_pages+0xe1/0x200
[10264.128334]  __alloc_pages_slowpath.constprop.0+0x343/0xde0
[10264.128337]  __alloc_pages+0x32d/0x350
[10264.128338]  allocate_slab+0x400/0x460
[10264.128339]  ___slab_alloc+0x40d/0xa40
[10264.128345]  kmem_cache_alloc+0x2e7/0x330
[10264.128348]  mempool_alloc+0x86/0x1b0
[10264.128349]  bio_alloc_bioset+0x200/0x4f0
[10264.128352]  bio_alloc_clone+0x23/0x60
[10264.128354]  alloc_io+0x26/0xf0 [dm_mod 7e9e6b44df4927f93fb3e4b5c782767396f58382]
[10264.128361]  dm_submit_bio+0xb8/0x580 [dm_mod 7e9e6b44df4927f93fb3e4b5c782767396f58382]
[10264.128366]  __submit_bio+0xb0/0x170
[10264.128367]  submit_bio_noacct_nocheck+0x159/0x370
[10264.128368]  bch2_submit_wbio_replicas+0x21c/0x3a0 [bcachefs 85f1b9a7a824f272eff794653a06dde1a94439f2]
[10264.128391]  btree_write_submit+0x1cf/0x220 [bcachefs 85f1b9a7a824f272eff794653a06dde1a94439f2]
[10264.128406]  process_one_work+0x178/0x350
[10264.128408]  worker_thread+0x30f/0x450
[10264.128409]  kthread+0xe5/0x120

The zswap shrinker resumes the swap_writepage()s that were intercepted
by the zswap store. This will enter the block layer, and may even
enter the filesystem depending on the swap backing file.

Make it respect GFP_NOIO and GFP_NOFS.

Link: https://lore.kernel.org/linux-mm/rc4pk2r42oyvjo4dc62z6sovquyllq56i5cdgcaqbd7wy3hfzr@n4nbxido3fme/
Link: https://lkml.kernel.org/r/20240321182532.60000-1-hannes@cmpxchg.org
Fixes: b5ba474f3f ("zswap: shrink zswap pool based on memory pressure")
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Reported-by: Kent Overstreet <kent.overstreet@linux.dev>
Acked-by: Yosry Ahmed <yosryahmed@google.com>
Reported-by: Jérôme Poulin <jeromepoulin@gmail.com>
Reviewed-by: Nhat Pham <nphamcs@gmail.com>
Reviewed-by: Chengming Zhou <chengming.zhou@linux.dev>
Cc: stable@vger.kernel.org	[v6.8]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-03-26 11:07:22 -07:00
Zev Weiss
166ce846dc ARM: prctl: reject PR_SET_MDWE on pre-ARMv6
On v5 and lower CPUs we can't provide MDWE protection, so ensure we fail
any attempt to enable it via prctl(PR_SET_MDWE).

Previously such an attempt would misleadingly succeed, leading to any
subsequent mmap(PROT_READ|PROT_WRITE) or execve() failing unconditionally
(the latter somewhat violently via force_fatal_sig(SIGSEGV) due to
READ_IMPLIES_EXEC).

Link: https://lkml.kernel.org/r/20240227013546.15769-6-zev@bewilderbeest.net
Signed-off-by: Zev Weiss <zev@bewilderbeest.net>
Cc: <stable@vger.kernel.org>	[6.3+]
Cc: Borislav Petkov <bp@alien8.de>
Cc: David Hildenbrand <david@redhat.com>
Cc: Florent Revest <revest@chromium.org>
Cc: Helge Deller <deller@gmx.de>
Cc: "James E.J. Bottomley" <James.Bottomley@HansenPartnership.com>
Cc: Josh Triplett <josh@joshtriplett.org>
Cc: Kees Cook <keescook@chromium.org>
Cc: Miguel Ojeda <ojeda@kernel.org>
Cc: Mike Rapoport (IBM) <rppt@kernel.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Ondrej Mosnacek <omosnace@redhat.com>
Cc: Rick Edgecombe <rick.p.edgecombe@intel.com>
Cc: Russell King (Oracle) <linux@armlinux.org.uk>
Cc: Sam James <sam@gentoo.org>
Cc: Stefan Roesch <shr@devkernel.io>
Cc: Yang Shi <yang@os.amperecomputing.com>
Cc: Yin Fengwei <fengwei.yin@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-03-26 11:07:22 -07:00
Zev Weiss
d5aad4c2ca prctl: generalize PR_SET_MDWE support check to be per-arch
Patch series "ARM: prctl: Reject PR_SET_MDWE where not supported".

I noticed after a recent kernel update that my ARM926 system started
segfaulting on any execve() after calling prctl(PR_SET_MDWE).  After some
investigation it appears that ARMv5 is incapable of providing the
appropriate protections for MDWE, since any readable memory is also
implicitly executable.

The prctl_set_mdwe() function already had some special-case logic added
disabling it on PARISC (commit 793838138c, "prctl: Disable
prctl(PR_SET_MDWE) on parisc"); this patch series (1) generalizes that
check to use an arch_*() function, and (2) adds a corresponding override
for ARM to disable MDWE on pre-ARMv6 CPUs.

With the series applied, prctl(PR_SET_MDWE) is rejected on ARMv5 and
subsequent execve() calls (as well as mmap(PROT_READ|PROT_WRITE)) can
succeed instead of unconditionally failing; on ARMv6 the prctl works as it
did previously.

[0] https://lore.kernel.org/all/2023112456-linked-nape-bf19@gregkh/


This patch (of 2):

There exist systems other than PARISC where MDWE may not be feasible to
support; rather than cluttering up the generic code with additional
arch-specific logic let's add a generic function for checking MDWE support
and allow each arch to override it as needed.

Link: https://lkml.kernel.org/r/20240227013546.15769-4-zev@bewilderbeest.net
Link: https://lkml.kernel.org/r/20240227013546.15769-5-zev@bewilderbeest.net
Signed-off-by: Zev Weiss <zev@bewilderbeest.net>
Acked-by: Helge Deller <deller@gmx.de>	[parisc]
Cc: Borislav Petkov <bp@alien8.de>
Cc: David Hildenbrand <david@redhat.com>
Cc: Florent Revest <revest@chromium.org>
Cc: "James E.J. Bottomley" <James.Bottomley@HansenPartnership.com>
Cc: Josh Triplett <josh@joshtriplett.org>
Cc: Kees Cook <keescook@chromium.org>
Cc: Miguel Ojeda <ojeda@kernel.org>
Cc: Mike Rapoport (IBM) <rppt@kernel.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Ondrej Mosnacek <omosnace@redhat.com>
Cc: Rick Edgecombe <rick.p.edgecombe@intel.com>
Cc: Russell King (Oracle) <linux@armlinux.org.uk>
Cc: Sam James <sam@gentoo.org>
Cc: Stefan Roesch <shr@devkernel.io>
Cc: Yang Shi <yang@os.amperecomputing.com>
Cc: Yin Fengwei <fengwei.yin@intel.com>
Cc: <stable@vger.kernel.org>	[6.3+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-03-26 11:07:22 -07:00
Kuan-Wei Chiu
db09f2df91 MAINTAINERS: remove incorrect M: tag for dm-devel@lists.linux.dev
The dm-devel@lists.linux.dev mailing list should only be listed under the
L: (List) tag in the MAINTAINERS file.  However, it was incorrectly listed
under both L: and M: (Maintainers) tags, which is not accurate.  Remove
the M: tag for dm-devel@lists.linux.dev in the MAINTAINERS file to reflect
the correct categorization.

Link: https://lkml.kernel.org/r/20240319181842.249547-1-visitorckw@gmail.com
Signed-off-by: Kuan-Wei Chiu <visitorckw@gmail.com>
Cc: Ching-Chun (Jim) Huang <jserv@ccns.ncku.edu.tw>
Cc: Matthew Sakai <msakai@redhat.com>
Cc: Michael Sclafani <dm-devel@lists.linux.dev>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-03-26 11:07:21 -07:00
Barry Song
9c500835f2 mm: zswap: fix kernel BUG in sg_init_one
sg_init_one() relies on linearly mapped low memory for the safe
utilization of virt_to_page().  Otherwise, we trigger a kernel BUG,

kernel BUG at include/linux/scatterlist.h:187!
Internal error: Oops - BUG: 0 [#1] PREEMPT SMP ARM
Modules linked in:
CPU: 0 PID: 2997 Comm: syz-executor198 Not tainted 6.8.0-syzkaller #0
Hardware name: ARM-Versatile Express
PC is at sg_set_buf include/linux/scatterlist.h:187 [inline]
PC is at sg_init_one+0x9c/0xa8 lib/scatterlist.c:143
LR is at sg_init_table+0x2c/0x40 lib/scatterlist.c:128
Backtrace:
[<807e16ac>] (sg_init_one) from [<804c1824>] (zswap_decompress+0xbc/0x208 mm/zswap.c:1089)
 r7:83471c80 r6:def6d08c r5:844847d0 r4:ff7e7ef4
[<804c1768>] (zswap_decompress) from [<804c4468>] (zswap_load+0x15c/0x198 mm/zswap.c:1637)
 r9:8446eb80 r8:8446eb80 r7:8446eb84 r6:def6d08c r5:00000001 r4:844847d0
[<804c430c>] (zswap_load) from [<804b9644>] (swap_read_folio+0xa8/0x498 mm/page_io.c:518)
 r9:844ac800 r8:835e6c00 r7:00000000 r6:df955d4c r5:00000001 r4:def6d08c
[<804b959c>] (swap_read_folio) from [<804bb064>] (swap_cluster_readahead+0x1c4/0x34c mm/swap_state.c:684)
 r10:00000000 r9:00000007 r8:df955d4b r7:00000000 r6:00000000 r5:00100cca
 r4:00000001
[<804baea0>] (swap_cluster_readahead) from [<804bb3b8>] (swapin_readahead+0x68/0x4a8 mm/swap_state.c:904)
 r10:df955eb8 r9:00000000 r8:00100cca r7:84476480 r6:00000001 r5:00000000
 r4:00000001
[<804bb350>] (swapin_readahead) from [<8047cde0>] (do_swap_page+0x200/0xcc4 mm/memory.c:4046)
 r10:00000040 r9:00000000 r8:844ac800 r7:84476480 r6:00000001 r5:00000000
 r4:df955eb8
[<8047cbe0>] (do_swap_page) from [<8047e6c4>] (handle_pte_fault mm/memory.c:5301 [inline])
[<8047cbe0>] (do_swap_page) from [<8047e6c4>] (__handle_mm_fault mm/memory.c:5439 [inline])
[<8047cbe0>] (do_swap_page) from [<8047e6c4>] (handle_mm_fault+0x3d8/0x12b8 mm/memory.c:5604)
 r10:00000040 r9:842b3900 r8:7eb0d000 r7:84476480 r6:7eb0d000 r5:835e6c00
 r4:00000254
[<8047e2ec>] (handle_mm_fault) from [<80215d28>] (do_page_fault+0x148/0x3a8 arch/arm/mm/fault.c:326)
 r10:00000007 r9:842b3900 r8:7eb0d000 r7:00000207 r6:00000254 r5:7eb0d9b4
 r4:df955fb0
[<80215be0>] (do_page_fault) from [<80216170>] (do_DataAbort+0x38/0xa8 arch/arm/mm/fault.c:558)
 r10:7eb0da7c r9:00000000 r8:80215be0 r7:df955fb0 r6:7eb0d9b4 r5:00000207
 r4:8261d0e0
[<80216138>] (do_DataAbort) from [<80200e3c>] (__dabt_usr+0x5c/0x60 arch/arm/kernel/entry-armv.S:427)
Exception stack(0xdf955fb0 to 0xdf955ff8)
5fa0:                                     00000000 00000000 22d5f800 0008d158
5fc0: 00000000 7eb0d9a4 00000000 00000109 00000000 00000000 7eb0da7c 7eb0da3c
5fe0: 00000000 7eb0d9a0 00000001 00066bd4 00000010 ffffffff
 r8:824a9044 r7:835e6c00 r6:ffffffff r5:00000010 r4:00066bd4
Code: 1a000004 e1822003 e8860094 e89da8f0 (e7f001f2)
---[ end trace 0000000000000000 ]---
----------------
Code disassembly (best guess):
   0:	1a000004 	bne	0x18
   4:	e1822003 	orr	r2, r2, r3
   8:	e8860094 	stm	r6, {r2, r4, r7}
   c:	e89da8f0 	ldm	sp, {r4, r5, r6, r7, fp, sp, pc}
* 10:	e7f001f2 	udf	#18 <-- trapping instruction

Consequently, we have two choices: either employ kmap_to_page() alongside
sg_set_page(), or resort to copying high memory contents to a temporary
buffer residing in low memory.  However, considering the introduction of
the WARN_ON_ONCE in commit ef6e06b2ef ("highmem: fix kmap_to_page() for
kmap_local_page() addresses"), which specifically addresses high memory
concerns, it appears that memcpy remains the sole viable option.

Link: https://lkml.kernel.org/r/20240318234706.95347-1-21cnbao@gmail.com
Fixes: 270700dd06 ("mm/zswap: remove the memcpy if acomp is not sleepable")
Signed-off-by: Barry Song <v-songbaohua@oppo.com>
Reported-by: syzbot+adbc983a1588b7805de3@syzkaller.appspotmail.com
Closes: https://lore.kernel.org/all/000000000000bbb3d80613f243a6@google.com/
Tested-by: syzbot+adbc983a1588b7805de3@syzkaller.appspotmail.com
Acked-by: Yosry Ahmed <yosryahmed@google.com>
Reviewed-by: Nhat Pham <nphamcs@gmail.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Chris Li <chrisl@kernel.org>
Cc: Ira Weiny <ira.weiny@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-03-26 11:07:21 -07:00
Muhammad Usama Anjum
c52eb6db7b selftests: mm: restore settings from only parent process
The atexit() is called from parent process as well as forked processes. 
Hence the child restores the settings at exit while the parent is still
executing.  Fix this by checking pid of atexit() calling process and only
restore THP number from parent process.

Link: https://lkml.kernel.org/r/20240314094045.157149-1-usama.anjum@collabora.com
Fixes: c23ea61726 ("selftests/mm: protection_keys: save/restore nr_hugepages settings")
Signed-off-by: Muhammad Usama Anjum <usama.anjum@collabora.com>
Tested-by: Joey Gouly <joey.gouly@arm.com>
Cc: Shuah Khan <shuah@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-03-26 11:07:21 -07:00
Cong Liu
950bf45d3b tools/Makefile: remove cgroup target
The tools/cgroup directory no longer contains a Makefile.  This patch
updates the top-level tools/Makefile to remove references to building and
installing cgroup components.  This change reflects the current structure
of the tools directory and fixes the build failure when building tools in
the top-level directory.

linux/tools$ make cgroup
  DESCEND cgroup
make[1]: *** No targets specified and no makefile found.  Stop.
make: *** [Makefile:73: cgroup] Error 2

Link: https://lkml.kernel.org/r/20240315012249.439639-1-liucong2@kylinos.cn
Signed-off-by: Cong Liu <liucong2@kylinos.cn>
Acked-by: Stanislav Fomichev <sdf@google.com>
Reviewed-by: Dmitry Rokosov <ddrokosov@salutedevices.com>
Cc: Cong Liu <liucong2@kylinos.cn>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-03-26 11:07:21 -07:00
Johannes Weiner
d5d39c707a mm: cachestat: fix two shmem bugs
When cachestat on shmem races with swapping and invalidation, there
are two possible bugs:

1) A swapin error can have resulted in a poisoned swap entry in the
   shmem inode's xarray. Calling get_shadow_from_swap_cache() on it
   will result in an out-of-bounds access to swapper_spaces[].

   Validate the entry with non_swap_entry() before going further.

2) When we find a valid swap entry in the shmem's inode, the shadow
   entry in the swapcache might not exist yet: swap IO is still in
   progress and we're before __remove_mapping; swapin, invalidation,
   or swapoff have removed the shadow from swapcache after we saw the
   shmem swap entry.

   This will send a NULL to workingset_test_recent(). The latter
   purely operates on pointer bits, so it won't crash - node 0, memcg
   ID 0, eviction timestamp 0, etc. are all valid inputs - but it's a
   bogus test. In theory that could result in a false "recently
   evicted" count.

   Such a false positive wouldn't be the end of the world. But for
   code clarity and (future) robustness, be explicit about this case.

   Bail on get_shadow_from_swap_cache() returning NULL.

Link: https://lkml.kernel.org/r/20240315095556.GC581298@cmpxchg.org
Fixes: cf264e1329 ("cachestat: implement cachestat syscall")
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Reported-by: Chengming Zhou <chengming.zhou@linux.dev>	[Bug #1]
Reported-by: Jann Horn <jannh@google.com>		[Bug #2]
Reviewed-by: Chengming Zhou <chengming.zhou@linux.dev>
Reviewed-by: Nhat Pham <nphamcs@gmail.com>
Cc: <stable@vger.kernel.org>				[v6.5+]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-03-26 11:07:20 -07:00
Matthew Wilcox (Oracle)
9cecde80aa mm: increase folio batch size
On a 104 thread, 2 socket Skylake system, Intel report a 4.7% performance
reduction with will-it-scale page_fault2.  This was due to reducing the
size of the batch from 32 to 15.  Increasing the folio batch size from 15
to 31 gives a performance increase of 12.5% relative to the original, or
17.2% relative to the reduced performance commit.

The penalty of this commit is an additional 128 bytes of stack usage.  Six
folio_batches are also allocated from percpu memory in cpu_fbatches so
that will be an additional 768 bytes of percpu memory (per CPU).  Tim Chen
originally submitted a patch like this in 2020:
https://lore.kernel.org/linux-mm/d1cc9f12a8ad6c2a52cb600d93b06b064f2bbc57.1593205965.git.tim.c.chen@linux.intel.com/

Link: https://lkml.kernel.org/r/20240315140823.2478146-1-willy@infradead.org
Fixes: 99fbb6bfc1 ("mm: make folios_put() the basis of release_pages()")
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Tested-by: Yujie Liu <yujie.liu@intel.com>
Reported-by: kernel test robot <oliver.sang@intel.com>
Closes: https://lore.kernel.org/oe-lkp/202403151058.7048f6a8-oliver.sang@intel.com
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-03-26 11:07:20 -07:00
Oscar Salvador
7844c01472 mm,page_owner: fix recursion
Prior to 217b2119b9 ("mm,page_owner: implement the tracking of the
stacks count") the only place where page_owner could potentially go into
recursion due to its need of allocating more memory was in save_stack(),
which ends up calling into stackdepot code with the possibility of
allocating memory.

We made sure to guard against that by signaling that the current task was
already in page_owner code, so in case a recursion attempt was made, we
could catch that and return dummy_handle.

After above commit, a new place in page_owner code was introduced where we
could allocate memory, meaning we could go into recursion would we take
that path.

Make sure to signal that we are in page_owner in that codepath as well. 
Move the guard code into two helpers {un}set_current_in_page_owner() and
use them prior to calling in the two functions that might allocate memory.

Link: https://lkml.kernel.org/r/20240315222610.6870-1-osalvador@suse.de
Signed-off-by: Oscar Salvador <osalvador@suse.de>
Fixes: 217b2119b9 ("mm,page_owner: implement the tracking of the stacks count")
Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Alexander Potapenko <glider@google.com>
Cc: Andrey Konovalov <andreyknvl@gmail.com>
Cc: Marco Elver <elver@google.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Oscar Salvador <osalvador@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-03-26 11:07:20 -07:00
Leonard Crestez
3290032466 mailmap: update entry for Leonard Crestez
Put my personal email first because NXP employment ended some time ago.
Also add my old intel email address.

Link: https://lkml.kernel.org/r/f568faa0-2380-4e93-a312-b80c1e367645@gmail.com
Signed-off-by: Leonard Crestez <cdleonard@gmail.com>
Cc: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-03-26 11:07:20 -07:00
John Sperbeck
4624b346cf init: open /initrd.image with O_LARGEFILE
If initrd data is larger than 2Gb, we'll eventually fail to write to the
/initrd.image file when we hit that limit, unless O_LARGEFILE is set.

Link: https://lkml.kernel.org/r/20240317221522.896040-1-jsperbeck@google.com
Signed-off-by: John Sperbeck <jsperbeck@google.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Nick Desaulniers <ndesaulniers@google.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-03-26 11:07:19 -07:00
Vitaly Chikunov
8b65ef5ad4 selftests/mm: Fix build with _FORTIFY_SOURCE
Add missing flags argument to open(2) call with O_CREAT.

Some tests fail to compile if _FORTIFY_SOURCE is defined (to any valid
value) (together with -O), resulting in similar error messages such as:

  In file included from /usr/include/fcntl.h:342,
                   from gup_test.c:1:
  In function 'open',
      inlined from 'main' at gup_test.c:206:10:
  /usr/include/bits/fcntl2.h:50:11: error: call to '__open_missing_mode' declared with attribute error: open with O_CREAT or O_TMPFILE in second argument needs 3 arguments
     50 |           __open_missing_mode ();
        |           ^~~~~~~~~~~~~~~~~~~~~~

_FORTIFY_SOURCE is enabled by default in some distributions, so the
tests are not built by default and are skipped.

open(2) man-page warns about missing flags argument: "if it is not
supplied, some arbitrary bytes from the stack will be applied as the
file mode."

Link: https://lkml.kernel.org/r/20240318023445.3192922-1-vt@altlinux.org
Fixes: aeb85ed4f4 ("tools/testing/selftests/vm/gup_benchmark.c: allow user specified file")
Fixes: fbe37501b2 ("mm: huge_memory: debugfs for file-backed THP split")
Fixes: c942f5bd17 ("selftests: soft-dirty: add test for mprotect")
Signed-off-by: Vitaly Chikunov <vt@altlinux.org>
Reviewed-by: Zi Yan <ziy@nvidia.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Cc: Keith Busch <kbusch@kernel.org>
Cc: Peter Xu <peterx@redhat.com>
Cc: Yang Shi <shy828301@gmail.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Nadav Amit <nadav.amit@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-03-26 11:07:19 -07:00
Peter Xu
f8572367ea mm/memory: fix missing pte marker for !page on pte zaps
Commit 0cf18e839f of large folio zap work broke uffd-wp.  Now mm's uffd
unit test "wp-unpopulated" will trigger this WARN_ON_ONCE().

The WARN_ON_ONCE() asserts that an VMA cannot be registered with
userfaultfd-wp if it contains a !normal page, but it's actually possible. 
One example is an anonymous vma, register with uffd-wp, read anything will
install a zero page.  Then when zap on it, this should trigger.

What's more, removing that WARN_ON_ONCE may not be enough either, because
we should also not rely on "whether it's a normal page" to decide whether
pte marker is needed.  For example, one can register wr-protect over some
DAX regions to track writes when UFFD_FEATURE_WP_ASYNC enabled, in which
case it can have page==NULL for a devmap but we may want to keep the
marker around.

Link: https://lkml.kernel.org/r/20240313213107.235067-1-peterx@redhat.com
Fixes: 0cf18e839f ("mm/memory: handle !page case in zap_present_pte() separately")
Signed-off-by: Peter Xu <peterx@redhat.com>
Acked-by: David Hildenbrand <david@redhat.com>
Cc: Muhammad Usama Anjum <usama.anjum@collabora.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-03-26 11:07:19 -07:00
Linus Torvalds
7033999ecd printk changes for 6.9-rc2
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEESH4wyp42V4tXvYsjUqAMR0iAlPIFAmYBhi4ACgkQUqAMR0iA
 lPKy0A/+Pl9ysymbBzQYZEMhiIEPz1Aakh56hcDz89g9Axqn9hPgDgZaA/AKmN3Y
 aRo3X/aZVk9uQfUQ1VCYX/738/9fATIt6P8WzukSID7PpYMd5wnjg2rdqtb6zErz
 wGDPDJabMMUWg0IT0LEXnERUI31TS5VCc8MlHkZFjnT1j7oKMGebC+kX1YNK2Jfr
 hk/4eaidDsFlXR9m+2P8DcmSlloXIvG326Ke7aLDs46FIRL+pzoQhjIIN/hn1Mi3
 FPNkYMvOWdDKA1s55EqOno/i7MpQRf2tjGjfQd0mzJgUxk9hXo0YYpHUpnY3MKDE
 +RO3P0MheNJHSatoTEY5/r5aclk1Kg4QnYdDhCWm57flq68iM0aM7i0ACfuVNr5z
 fXz5Uv4lRAd1993yK2nRrczljGN2OpNium4kTifHGhfMZd7UuY9Yh3CPjMIfDm3e
 iZ7wp6L7pBTv9px4cK3U3pM5+w1Rr8pCZPJRK36WxhyZ9ivNU9tzTxO7dwLvqalN
 3q2y3jjgBupTZVawCtodFl6XHY+2LMihR+4tVWXWKblAdQ2H9UkFLIeJqjlAYZeW
 O9OV/gDLB8VfKWnVkkouKcFm6GXngFpTSc7APPDHLJrfbUcZ7FyMjtjyDVSeeaz+
 +lcvDwf2n2nl3UM5kmm6tNma/TDprKQzxk3m2JgPXjDcDgmA24w=
 =7UMV
 -----END PGP SIGNATURE-----

Merge tag 'printk-for-6.9-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/printk/linux

Pull printk fix from Petr Mladek:

 - Prevent scheduling in an atomic context when printk() takes over the
   console flushing duty

* tag 'printk-for-6.9-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/printk/linux:
  printk: Update @console_may_schedule in console_trylock_spinning()
2024-03-26 09:25:57 -07:00
Linus Torvalds
576bb2d8e3 pwm: A fix targeting v6.9-rc2
This contains a single fix for a regression introduced in v5.18-rc1
 which made the img pwm driver fail to bind.
 -----BEGIN PGP SIGNATURE-----
 
 iQEzBAABCgAdFiEEP4GsaTp6HlmJrf7Tj4D7WH0S/k4FAmYCf3sACgkQj4D7WH0S
 /k59xggAn8fUj0xLQ3ho+rhH0uJkoTJlIYYQCX5CDkE5VM/a0JbyVd8q2oH708Z9
 KOKcWixUG6gGm8RXTlA1Hn6xpKnjUCSpC/37BcqtnTBhp5rqq2HHukZ331yFFOGw
 mf63QElYTFnWh3TzfVMJOa/tzVeJQ2nzPpm28VoJEl9lWZs845VwUaKCMtZJ6cpd
 gP6STcDJkUsY1jinN4nMQfS9iBalzvaHNVUMGPwxbnvVvexM/qjOULiSUmc7dKKY
 K7WPwFp3yNT4GtRaJFwV6sAJQ/R86XQOwYHBGnutUY5u0eOp2PGjYnFHTfl0hxth
 KTR5PpveQpx7v2EdYLGn2/WrNcvS8A==
 =Hq7u
 -----END PGP SIGNATURE-----

Merge tag 'pwm/for-6.9-rc2-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/ukleinek/linux

Pull pwm fix from Uwe Kleine-König:
 "This contains a single fix for a regression introduced in v5.18-rc1
  which made the img pwm driver fail to bind"

* tag 'pwm/for-6.9-rc2-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/ukleinek/linux:
  pwm: img: fix pwm clock lookup
2024-03-26 09:20:56 -07:00
Tavian Barnes
ef1e68236b btrfs: fix race in read_extent_buffer_pages()
There are reports from tree-checker that detects corrupted nodes,
without any obvious pattern so possibly an overwrite in memory.
After some debugging it turns out there's a race when reading an extent
buffer the uptodate status can be missed.

To prevent concurrent reads for the same extent buffer,
read_extent_buffer_pages() performs these checks:

    /* (1) */
    if (test_bit(EXTENT_BUFFER_UPTODATE, &eb->bflags))
        return 0;

    /* (2) */
    if (test_and_set_bit(EXTENT_BUFFER_READING, &eb->bflags))
        goto done;

At this point, it seems safe to start the actual read operation. Once
that completes, end_bbio_meta_read() does

    /* (3) */
    set_extent_buffer_uptodate(eb);

    /* (4) */
    clear_bit(EXTENT_BUFFER_READING, &eb->bflags);

Normally, this is enough to ensure only one read happens, and all other
callers wait for it to finish before returning.  Unfortunately, there is
a racey interleaving:

    Thread A | Thread B | Thread C
    ---------+----------+---------
       (1)   |          |
             |    (1)   |
       (2)   |          |
       (3)   |          |
       (4)   |          |
             |    (2)   |
             |          |    (1)

When this happens, thread B kicks of an unnecessary read. Worse, thread
C will see UPTODATE set and return immediately, while the read from
thread B is still in progress.  This race could result in tree-checker
errors like this as the extent buffer is concurrently modified:

    BTRFS critical (device dm-0): corrupted node, root=256
    block=8550954455682405139 owner mismatch, have 11858205567642294356
    expect [256, 18446744073709551360]

Fix it by testing UPTODATE again after setting the READING bit, and if
it's been set, skip the unnecessary read.

Fixes: d7172f52e9 ("btrfs: use per-buffer locking for extent_buffer reading")
Link: https://lore.kernel.org/linux-btrfs/CAHk-=whNdMaN9ntZ47XRKP6DBes2E5w7fi-0U3H2+PS18p+Pzw@mail.gmail.com/
Link: https://lore.kernel.org/linux-btrfs/f51a6d5d7432455a6a858d51b49ecac183e0bbc9.1706312914.git.wqu@suse.com/
Link: https://lore.kernel.org/linux-btrfs/c7241ea4-fcc6-48d2-98c8-b5ea790d6c89@gmx.com/
CC: stable@vger.kernel.org # 6.5+
Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Tavian Barnes <tavianator@tavianator.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ minor update of changelog ]
Signed-off-by: David Sterba <dsterba@suse.com>
2024-03-26 16:42:39 +01:00
Anand Jain
2f1aeab9fc btrfs: return accurate error code on open failure in open_fs_devices()
When attempting to exclusive open a device which has no exclusive open
permission, such as a physical device associated with the flakey dm
device, the open operation will fail, resulting in a mount failure.

In this particular scenario, we erroneously return -EINVAL instead of the
correct error code provided by the bdev_open_by_path() function, which is
-EBUSY.

Fix this, by returning error code from the bdev_open_by_path() function.
With this correction, the mount error message will align with that of
ext4 and xfs.

Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-03-26 16:42:39 +01:00
Johannes Thumshirn
a8b70c7f86 btrfs: zoned: don't skip block groups with 100% zone unusable
Commit f4a9f21941 ("btrfs: do not delete unused block group if it may be
used soon") changed the behaviour of deleting unused block-groups on zoned
filesystems. Starting with this commit, we're using
btrfs_space_info_used() to calculate the number of used bytes in a
space_info. But btrfs_space_info_used() also accounts
btrfs_space_info::bytes_zone_unusable as used bytes.

So if a block group is 100% zone_unusable it is skipped from the deletion
step.

In order not to skip fully zone_unusable block-groups, also check if the
block-group has bytes left that can be used on a zoned filesystem.

Fixes: f4a9f21941 ("btrfs: do not delete unused block group if it may be used soon")
CC: stable@vger.kernel.org # 6.1+
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-03-26 16:42:39 +01:00
Filipe Manana
2133460061 btrfs: use btrfs_warn() to log message at btrfs_add_extent_mapping()
At btrfs_add_extent_mapping(), if we failed to merge the extent map, which
is unexpected and theoretically should never happen, we use WARN_ONCE() to
log a message which is not great because we don't get information about
which filesystem it relates to in case we have multiple btrfs filesystems
mounted. So change this to use btrfs_warn() and surround the error check
with WARN_ON() so we always get a useful stack trace and the condition is
flagged as "unlikely" since it's not expected to ever happen.

Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-03-26 16:42:39 +01:00
Filipe Manana
379c872393 btrfs: fix message not properly printing interval when adding extent map
At btrfs_add_extent_mapping(), if we are unable to merge the existing
extent map, we print a warning message that suggests interval ranges in
the form "[X, Y)", where the first element is the inclusive start offset
of a range and the second element is the exclusive end offset. However
we end up printing the length of the ranges instead of the exclusive end
offsets. So fix this by printing the range end offsets.

Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-03-26 16:42:39 +01:00
Filipe Manana
4dc1d69c2b btrfs: fix warning messages not printing interval at unpin_extent_range()
At unpin_extent_range() we print warning messages that are supposed to
print an interval in the form "[X, Y)", with the first element being an
inclusive start offset and the second element being the exclusive end
offset of a range. However we end up printing the range's length instead
of the range's exclusive end offset, so fix that to avoid having confusing
and non-sense messages in case we hit one of these unexpected scenarios.

Fixes: 00deaf04df ("btrfs: log messages at unpin_extent_range() during unexpected cases")
Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-03-26 16:42:38 +01:00
Filipe Manana
8a565ec04d btrfs: fix extent map leak in unexpected scenario at unpin_extent_cache()
At unpin_extent_cache() if we happen to find an extent map with an
unexpected start offset, we jump to the 'out' label and never release the
reference we added to the extent map through the call to
lookup_extent_mapping(), therefore resulting in a leak. So fix this by
moving the free_extent_map() under the 'out' label.

Fixes: c03c89f821 ("btrfs: handle errors returned from unpin_extent_cache()")
Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-03-26 16:42:38 +01:00
Anand Jain
9f7eb8405d btrfs: validate device maj:min during open
Boris managed to create a device capable of changing its maj:min without
altering its device path.

Only multi-devices can be scanned. A device that gets scanned and remains
in the btrfs kernel cache might end up with an incorrect maj:min.

Despite the temp-fsid feature patch did not introduce this bug, it could
lead to issues if the above multi-device is converted to a single device
with a stale maj:min. Subsequently, attempting to mount the same device
with the correct maj:min might mistake it for another device with the same
fsid, potentially resulting in wrongly auto-enabling the temp-fsid feature.

To address this, this patch validates the device's maj:min at the time of
device open and updates it if it has changed since the last scan.

CC: stable@vger.kernel.org # 6.7+
Fixes: a5b8a5f9f8 ("btrfs: support cloned-device mount capability")
Reported-by: Boris Burkov <boris@bur.io>
Co-developed-by: Boris Burkov <boris@bur.io>
Reviewed-by: Boris Burkov <boris@bur.io>#
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-03-26 16:42:38 +01:00
Johannes Thumshirn
1ec17ef591 btrfs: zoned: fix use-after-free in do_zone_finish()
Shinichiro reported the following use-after-free triggered by the device
replace operation in fstests btrfs/070.

 BTRFS info (device nullb1): scrub: finished on devid 1 with status: 0
 ==================================================================
 BUG: KASAN: slab-use-after-free in do_zone_finish+0x91a/0xb90 [btrfs]
 Read of size 8 at addr ffff8881543c8060 by task btrfs-cleaner/3494007

 CPU: 0 PID: 3494007 Comm: btrfs-cleaner Tainted: G        W          6.8.0-rc5-kts #1
 Hardware name: Supermicro Super Server/X11SPi-TF, BIOS 3.3 02/21/2020
 Call Trace:
  <TASK>
  dump_stack_lvl+0x5b/0x90
  print_report+0xcf/0x670
  ? __virt_addr_valid+0x200/0x3e0
  kasan_report+0xd8/0x110
  ? do_zone_finish+0x91a/0xb90 [btrfs]
  ? do_zone_finish+0x91a/0xb90 [btrfs]
  do_zone_finish+0x91a/0xb90 [btrfs]
  btrfs_delete_unused_bgs+0x5e1/0x1750 [btrfs]
  ? __pfx_btrfs_delete_unused_bgs+0x10/0x10 [btrfs]
  ? btrfs_put_root+0x2d/0x220 [btrfs]
  ? btrfs_clean_one_deleted_snapshot+0x299/0x430 [btrfs]
  cleaner_kthread+0x21e/0x380 [btrfs]
  ? __pfx_cleaner_kthread+0x10/0x10 [btrfs]
  kthread+0x2e3/0x3c0
  ? __pfx_kthread+0x10/0x10
  ret_from_fork+0x31/0x70
  ? __pfx_kthread+0x10/0x10
  ret_from_fork_asm+0x1b/0x30
  </TASK>

 Allocated by task 3493983:
  kasan_save_stack+0x33/0x60
  kasan_save_track+0x14/0x30
  __kasan_kmalloc+0xaa/0xb0
  btrfs_alloc_device+0xb3/0x4e0 [btrfs]
  device_list_add.constprop.0+0x993/0x1630 [btrfs]
  btrfs_scan_one_device+0x219/0x3d0 [btrfs]
  btrfs_control_ioctl+0x26e/0x310 [btrfs]
  __x64_sys_ioctl+0x134/0x1b0
  do_syscall_64+0x99/0x190
  entry_SYSCALL_64_after_hwframe+0x6e/0x76

 Freed by task 3494056:
  kasan_save_stack+0x33/0x60
  kasan_save_track+0x14/0x30
  kasan_save_free_info+0x3f/0x60
  poison_slab_object+0x102/0x170
  __kasan_slab_free+0x32/0x70
  kfree+0x11b/0x320
  btrfs_rm_dev_replace_free_srcdev+0xca/0x280 [btrfs]
  btrfs_dev_replace_finishing+0xd7e/0x14f0 [btrfs]
  btrfs_dev_replace_by_ioctl+0x1286/0x25a0 [btrfs]
  btrfs_ioctl+0xb27/0x57d0 [btrfs]
  __x64_sys_ioctl+0x134/0x1b0
  do_syscall_64+0x99/0x190
  entry_SYSCALL_64_after_hwframe+0x6e/0x76

 The buggy address belongs to the object at ffff8881543c8000
  which belongs to the cache kmalloc-1k of size 1024
 The buggy address is located 96 bytes inside of
  freed 1024-byte region [ffff8881543c8000, ffff8881543c8400)

 The buggy address belongs to the physical page:
 page:00000000fe2c1285 refcount:1 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x1543c8
 head:00000000fe2c1285 order:3 entire_mapcount:0 nr_pages_mapped:0 pincount:0
 flags: 0x17ffffc0000840(slab|head|node=0|zone=2|lastcpupid=0x1fffff)
 page_type: 0xffffffff()
 raw: 0017ffffc0000840 ffff888100042dc0 ffffea0019e8f200 dead000000000002
 raw: 0000000000000000 0000000000100010 00000001ffffffff 0000000000000000
 page dumped because: kasan: bad access detected

 Memory state around the buggy address:
  ffff8881543c7f00: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
  ffff8881543c7f80: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
 >ffff8881543c8000: fa fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
                                                        ^
  ffff8881543c8080: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
  ffff8881543c8100: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb

This UAF happens because we're accessing stale zone information of a
already removed btrfs_device in do_zone_finish().

The sequence of events is as follows:

btrfs_dev_replace_start
  btrfs_scrub_dev
   btrfs_dev_replace_finishing
    btrfs_dev_replace_update_device_in_mapping_tree <-- devices replaced
    btrfs_rm_dev_replace_free_srcdev
     btrfs_free_device                              <-- device freed

cleaner_kthread
 btrfs_delete_unused_bgs
  btrfs_zone_finish
   do_zone_finish              <-- refers the freed device

The reason for this is that we're using a cached pointer to the chunk_map
from the block group, but on device replace this cached pointer can
contain stale device entries.

The staleness comes from the fact, that btrfs_block_group::physical_map is
not a pointer to a btrfs_chunk_map but a memory copy of it.

Also take the fs_info::dev_replace::rwsem to prevent
btrfs_dev_replace_update_device_in_mapping_tree() from changing the device
underneath us again.

Note: btrfs_dev_replace_update_device_in_mapping_tree() is holding
fs_info::mapping_tree_lock, but as this is a spinning read/write lock we
cannot take it as the call to blkdev_zone_mgmt() requires a memory
allocation which may not sleep.
But btrfs_dev_replace_update_device_in_mapping_tree() is always called with
the fs_info::dev_replace::rwsem held in write mode.

Many thanks to Shinichiro for analyzing the bug.

Reported-by: Shinichiro Kawasaki <shinichiro.kawasaki@wdc.com>
CC: stable@vger.kernel.org # 6.8
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-03-26 16:41:01 +01:00
Linus Torvalds
928a87efa4 gfs2 fix
- Fix boundary check in punch_hole
 -----BEGIN PGP SIGNATURE-----
 
 iQJIBAABCAAyFiEEJZs3krPW0xkhLMTc1b+f6wMTZToFAmYBaM0UHGFncnVlbmJh
 QHJlZGhhdC5jb20ACgkQ1b+f6wMTZTrzqw/9GpK71h1dIA8vYqInumdrUabksLKy
 jRMR2ZxfzBKLdAfgn9AS3nrWNos72vjAxbjYCi/fbY9uvIK1/zzq7Ef7601kCetM
 NzxShY8AwLJa9mO8O5yReLL7O/61gjlcdD6rSjkYwphWuobd5vpudKkibgpdJyH8
 bn6U1/2K5ASFtWyTRbudOIsz4AqPUE6ZB4KxSuCDx7uFiQjnuh6sk8wfg48pdig7
 GAsNPmBFfWAQXClPnI/WFG0hpkuRIK1hk9ITWx1ybu2JqaNeVXRBqGoRZbEkPYju
 qEkp4oT3j/1siBz1sMOjC5tfmAzhLvAeL61pD2EOcm5Bpd3iKJibYt/uCIpYFHM0
 WfRcUmqEduN1zhDuSR4KSe49JQ5dFXVf83YqUgbtrHFiHHXNBYYqFNUVfcDAB1p7
 IH9AlNd82zyxJ3fsBX7VpEbGC2qNa3K8hYO7px8DNVrPGzW7AhPF1Lsh0OE9GlZU
 H5f70Nryi98iwadbePBUchTrx0S3iYjk2TQgLGf5L/lAl6J/MRNG31kittDtehri
 cct/JBr8sUAK014TS5NxPbpxqDnVot3UsYk7h6s7WdmM1svfs7j5f1mo3ovMEGqX
 io5Z6pFEE7n1ce5hbieDKr3JFh6LxP1ArUSY8oz5rR0shE2XHMcIdq3J26Vfi0Q0
 4VjdBic/7rUUBXI=
 =QXEK
 -----END PGP SIGNATURE-----

Merge tag 'gfs2-v6.8-fix' of git://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2

Pull gfs2 fix from Andreas Gruenbacher:

 - Fix boundary check in punch_hole

* tag 'gfs2-v6.8-fix' of git://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2:
  gfs2: Fix invalid metadata access in punch_hole
2024-03-25 10:53:39 -07:00
Linus Torvalds
174fdc93a2 This push fixes a regression that broke iwd as well as a divide by
zero in iaa.
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEn51F/lCuNhUwmDeSxycdCkmxi6cFAmX9buQACgkQxycdCkmx
 i6dw1w/+IiwZX9A62NLDEBJbaXavQEHqoI1mhXq4Zwqeor5DhUlUGbmAhdZ5bg88
 oA60zjroLgt+Wke3phXFMjavSdwyLMd8zkDU3BjhqHAD6ASZz4ml51MCi48ROKB8
 lKSSAvjJX3VAHlYSxctUB/KWfzZZaGGUWQdtkC/90Tqp6OGeVyBbQCiQPyF4I55P
 e1B8ADY5Ey2d+Rgos1whMuLkKa077yoWdiDgX5PFjfGalRdNh4jNsojrCSx6AEiI
 KijWUtaGBNj9Exc04NeCa0JQNff4vynhn21ygMbgPMEMTde0SlHHdf6FWKrZcm6h
 JlNYYVGXjcobMC62dQdTosTLxuAqSd4kr5UaiCjO52QdM8txfBJCm/PDARS2z9Gl
 xROvBpOfRSn0Z8GpuHaBMNot4DlKv6y/puwmef6o3qYlUJH+CnuHUFaLOHRYVt6d
 B0tWi7PWyx2jorj0VHUCpqiGGYlbUq6lWhGYt8XzPHeQ2ZewzH6EbbF5qKpEWGiB
 3X8Dxl1PpO8sSwOQRabBpoWoXxJLL6G9uvyJCjJeftL9ZDEeTBDvdA9qis7t7kR5
 Ckv4q/5rOq4MK3NsXsL6lJZY7ckQavLhCoZlRU9kcQt3MG06i6KUDoo83L/49rUL
 zq5adU8sI3lP/2VplfA0XEeYFuLlBWH4TwNAmwJBRoWFiG8j8XM=
 =Mau0
 -----END PGP SIGNATURE-----

Merge tag 'v6.9-p2' of git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6

Pull crypto fixes from Herbert Xu:
 "This fixes a regression that broke iwd as well as a divide by zero in
  iaa"

* tag 'v6.9-p2' of git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6:
  crypto: iaa - Fix nr_cpus < nr_iaa case
  Revert "crypto: pkcs7 - remove sha1 support"
2024-03-25 10:48:23 -07:00
Eric Van Hensbergen
6630036b7c
fs/9p: fix uninitialized values during inode evict
If an iget fails due to not being able to retrieve information
from the server then the inode structure is only partially
initialized.  When the inode gets evicted, references to
uninitialized structures (like fscache cookies) were being
made.

This patch checks for a bad_inode before doing anything other
than clearing the inode from the cache.  Since the inode is
bad, it shouldn't have any state associated with it that needs
to be written back (and there really isn't a way to complete
those anyways).

Reported-by: syzbot+eb83fe1cce5833cd66a0@syzkaller.appspotmail.com
Signed-off-by: Eric Van Hensbergen <ericvh@kernel.org>
2024-03-25 14:16:06 +00:00
Masami Hiramatsu (Google)
0add699ad0 tracing: probes: Fix to zero initialize a local variable
Fix to initialize 'val' local variable with zero.
Dan reported that Smatch static code checker reports an error that a local
'val' variable needs to be initialized. Actually, the 'val' is expected to
be initialized by FETCH_OP_ARG in the same loop, but it is not obvious. So
initialize it with zero.

Link: https://lore.kernel.org/all/171092223833.237219.17304490075697026697.stgit@devnote2/

Reported-by: Dan Carpenter <dan.carpenter@linaro.org>
Closes: https://lore.kernel.org/all/b010488e-68aa-407c-add0-3e059254aaa0@moroto.mountain/
Fixes: 25f00e40ce ("tracing/probes: Support $argN in return probe (kprobe and fprobe)")
Reviewed-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
2024-03-25 16:24:31 +09:00