Commit Graph

94943 Commits

Author SHA1 Message Date
Qu Wenruo
a8706d0271 btrfs: wait for writeback if sector size is smaller than page size
[PROBLEM]
If sector perfect compression is enabled for sector size < page size
case, the following case can lead dirty ranges not being written back:

     0     32K     64K     96K     128K
     |     |///////||//////|     |/|
                                 124K

In above example, the page size is 64K, and we need to write back above
two pages.

- Submit for page 0 (main thread)
  We found delalloc range [32K, 96K), which can be compressed.
  So we queue an async range for [32K, 96K).
  This means, the page unlock/clearing dirty/setting writeback will
  all happen in a workqueue context.

- The compression is done, and compressed range is submitted (workqueue)
  Since the compression is done in asynchronously, the compression can
  be done before the main thread to submit for page 64K.

  Now the whole range [32K, 96K), involving two pages, will be marked
  writeback.

- Submit for page 64K (main thread)
  extent_write_cache_pages() got its wbc->sync_mode is WB_SYNC_NONE,
  so it skips the writeback wait.

  And unlock the page and exit. This means the dirty range [124K, 128K)
  will never be submitted, until next writeback happens for page 64K.

This will never happen for previous kernels because:

- For sector size == page size case
  Since one page is one sector, if a page is marked writeback it will
  not have dirty flags.
  So this corner case will never hit.

- For sector size < page size case
  We never do subpage compression, a range can only be submitted for
  compression if the range is fully page aligned.
  This change makes the subpage behavior mostly the same as non-subpage
  cases.

[ENHANCEMENT]
Instead of relying WB_SYNC_NONE check only, if it's a subpage case, then
always wait for writeback flags.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-11-11 14:34:12 +01:00
Qu Wenruo
dd5e276254 btrfs: compression: add an ASSERT() to ensure the read-in length is sane
There are already two bugs (one in zlib, one in zstd) that involved
compression path is not handling sector size < page size cases well.

So it makes more sense to make sure that btrfs_compress_folios() returns

Since we already have two bugs (one in zlib, one in zstd) in the
compression path resulting the @total_in be to larger than the
to-be-compressed range length, there is enough reason to add an ASSERT()
to make sure the total read-in length doesn't exceed the input length.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-11-11 14:34:12 +01:00
Qu Wenruo
90275a7762 btrfs: zstd: make the compression path to handle sector size < page size
Inside zstd_compress_folios(), after exhausted one input page, we need
to switch to the next page as input.

However when counting the total input bytes (@tot_in), we always increase
it by PAGE_SIZE.

For the following case, it can cause incorrect value:

        0          32K         64K          96K
        |          |///////////||///////////|

After compressing range [32K, 64K), we switch to the next page, and
increasing @tot_in by 64K, while we only read 32K.

This will cause the @total_in to return a value larger than the input
length.

Fix it by only increase @tot_in by the input size.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-11-11 14:34:12 +01:00
Qu Wenruo
f6ebedb09b btrfs: zlib: make the compression path to handle sector size < page size
Inside zlib_compress_folios(), each time we switch the input page cache,
the @start is increased by PAGE_SIZE.

But for the incoming compression support for sector size < page size
(previously we support compression only when the range is fully page
aligned), this is not going to handle the following case:

    0          32K         64K          96K
    |          |///////////||///////////|

@start has the initial value 32K, indicating the start filepos of the
to-be-compressed range.

And when grabbing the first page as input, we always call "start +=
PAGE_SIZE;".

But since @start is starting at 32K, it will be increased by 64K,
resulting it to be 96K for the next range, causing incorrect input range
and corruption for the future subpage compression.

Fix it by only increase @start by the input size.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-11-11 14:34:12 +01:00
Qu Wenruo
67cd3f2217 btrfs: split out CONFIG_BTRFS_EXPERIMENTAL from CONFIG_BTRFS_DEBUG
Currently CONFIG_BTRFS_EXPERIMENTAL is not only for the extra debugging
output, but also for experimental features.

This is not ideal to distinguish planned but not yet stable features
from those purely designed for debugging.

This patch splits the following features into CONFIG_BTRFS_EXPERIMENTAL:

- Extent map shrinker
  This seems to be the first one to exit experimental.

- Extent tree v2
  This seems to be the last one to graduate from experimental.

- Raid stripe tree
- Csum offload mode
- Send protocol v3

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-11-11 14:34:12 +01:00
Qu Wenruo
c186345a6b btrfs: make assert_rbio() to only check CONFIG_BTRFS_ASSERT
According to the description, CONFIG_BTRFS_DEBUG is only for extra
debug info, meanwhile sanity checks should be managed by
CONFIG_BTRFS_ASSERT.

There is no need to check both to enable assert_rbio().

Just remove the check for CONFIG_BTRFS_DEBUG.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-11-11 14:34:12 +01:00
Johannes Thumshirn
8cca35cb29 btrfs: don't take dev_replace rwsem on task already holding it
Running fstests btrfs/011 with MKFS_OPTIONS="-O rst" to force the usage of
the RAID stripe-tree, we get the following splat from lockdep:

 BTRFS info (device sdd): dev_replace from /dev/sdd (devid 1) to /dev/sdb started

 ============================================
 WARNING: possible recursive locking detected
 6.11.0-rc3-btrfs-for-next #599 Not tainted
 --------------------------------------------
 btrfs/2326 is trying to acquire lock:
 ffff88810f215c98 (&fs_info->dev_replace.rwsem){++++}-{3:3}, at: btrfs_map_block+0x39f/0x2250

 but task is already holding lock:
 ffff88810f215c98 (&fs_info->dev_replace.rwsem){++++}-{3:3}, at: btrfs_map_block+0x39f/0x2250

 other info that might help us debug this:
  Possible unsafe locking scenario:

        CPU0
        ----
   lock(&fs_info->dev_replace.rwsem);
   lock(&fs_info->dev_replace.rwsem);

  *** DEADLOCK ***

  May be due to missing lock nesting notation

 1 lock held by btrfs/2326:
  #0: ffff88810f215c98 (&fs_info->dev_replace.rwsem){++++}-{3:3}, at: btrfs_map_block+0x39f/0x2250

 stack backtrace:
 CPU: 1 UID: 0 PID: 2326 Comm: btrfs Not tainted 6.11.0-rc3-btrfs-for-next #599
 Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
 Call Trace:
  <TASK>
  dump_stack_lvl+0x5b/0x80
  __lock_acquire+0x2798/0x69d0
  ? __pfx___lock_acquire+0x10/0x10
  ? __pfx___lock_acquire+0x10/0x10
  lock_acquire+0x19d/0x4a0
  ? btrfs_map_block+0x39f/0x2250
  ? __pfx_lock_acquire+0x10/0x10
  ? find_held_lock+0x2d/0x110
  ? lock_is_held_type+0x8f/0x100
  down_read+0x8e/0x440
  ? btrfs_map_block+0x39f/0x2250
  ? __pfx_down_read+0x10/0x10
  ? do_raw_read_unlock+0x44/0x70
  ? _raw_read_unlock+0x23/0x40
  btrfs_map_block+0x39f/0x2250
  ? btrfs_dev_replace_by_ioctl+0xd69/0x1d00
  ? btrfs_bio_counter_inc_blocked+0xd9/0x2e0
  ? __kasan_slab_alloc+0x6e/0x70
  ? __pfx_btrfs_map_block+0x10/0x10
  ? __pfx_btrfs_bio_counter_inc_blocked+0x10/0x10
  ? kmem_cache_alloc_noprof+0x1f2/0x300
  ? mempool_alloc_noprof+0xed/0x2b0
  btrfs_submit_chunk+0x28d/0x17e0
  ? __pfx_btrfs_submit_chunk+0x10/0x10
  ? bvec_alloc+0xd7/0x1b0
  ? bio_add_folio+0x171/0x270
  ? __pfx_bio_add_folio+0x10/0x10
  ? __kasan_check_read+0x20/0x20
  btrfs_submit_bio+0x37/0x80
  read_extent_buffer_pages+0x3df/0x6c0
  btrfs_read_extent_buffer+0x13e/0x5f0
  read_tree_block+0x81/0xe0
  read_block_for_search+0x4bd/0x7a0
  ? __pfx_read_block_for_search+0x10/0x10
  btrfs_search_slot+0x78d/0x2720
  ? __pfx_btrfs_search_slot+0x10/0x10
  ? lock_is_held_type+0x8f/0x100
  ? kasan_save_track+0x14/0x30
  ? __kasan_slab_alloc+0x6e/0x70
  ? kmem_cache_alloc_noprof+0x1f2/0x300
  btrfs_get_raid_extent_offset+0x181/0x820
  ? __pfx_lock_acquire+0x10/0x10
  ? __pfx_btrfs_get_raid_extent_offset+0x10/0x10
  ? down_read+0x194/0x440
  ? __pfx_down_read+0x10/0x10
  ? do_raw_read_unlock+0x44/0x70
  ? _raw_read_unlock+0x23/0x40
  btrfs_map_block+0x5b5/0x2250
  ? __pfx_btrfs_map_block+0x10/0x10
  scrub_submit_initial_read+0x8fe/0x11b0
  ? __pfx_scrub_submit_initial_read+0x10/0x10
  submit_initial_group_read+0x161/0x3a0
  ? lock_release+0x20e/0x710
  ? __pfx_submit_initial_group_read+0x10/0x10
  ? __pfx_lock_release+0x10/0x10
  scrub_simple_mirror.isra.0+0x3eb/0x580
  scrub_stripe+0xe4d/0x1440
  ? lock_release+0x20e/0x710
  ? __pfx_scrub_stripe+0x10/0x10
  ? __pfx_lock_release+0x10/0x10
  ? do_raw_read_unlock+0x44/0x70
  ? _raw_read_unlock+0x23/0x40
  scrub_chunk+0x257/0x4a0
  scrub_enumerate_chunks+0x64c/0xf70
  ? __mutex_unlock_slowpath+0x147/0x5f0
  ? __pfx_scrub_enumerate_chunks+0x10/0x10
  ? bit_wait_timeout+0xb0/0x170
  ? __up_read+0x189/0x700
  ? scrub_workers_get+0x231/0x300
  ? up_write+0x490/0x4f0
  btrfs_scrub_dev+0x52e/0xcd0
  ? create_pending_snapshots+0x230/0x250
  ? __pfx_btrfs_scrub_dev+0x10/0x10
  btrfs_dev_replace_by_ioctl+0xd69/0x1d00
  ? lock_acquire+0x19d/0x4a0
  ? __pfx_btrfs_dev_replace_by_ioctl+0x10/0x10
  ? lock_release+0x20e/0x710
  ? btrfs_ioctl+0xa09/0x74f0
  ? __pfx_lock_release+0x10/0x10
  ? do_raw_spin_lock+0x11e/0x240
  ? __pfx_do_raw_spin_lock+0x10/0x10
  btrfs_ioctl+0xa14/0x74f0
  ? lock_acquire+0x19d/0x4a0
  ? find_held_lock+0x2d/0x110
  ? __pfx_btrfs_ioctl+0x10/0x10
  ? lock_release+0x20e/0x710
  ? do_sigaction+0x3f0/0x860
  ? __pfx_do_vfs_ioctl+0x10/0x10
  ? do_raw_spin_lock+0x11e/0x240
  ? lockdep_hardirqs_on_prepare+0x270/0x3e0
  ? _raw_spin_unlock_irq+0x28/0x50
  ? do_sigaction+0x3f0/0x860
  ? __pfx_do_sigaction+0x10/0x10
  ? __x64_sys_rt_sigaction+0x18e/0x1e0
  ? __pfx___x64_sys_rt_sigaction+0x10/0x10
  ? __x64_sys_close+0x7c/0xd0
  __x64_sys_ioctl+0x137/0x190
  do_syscall_64+0x71/0x140
  entry_SYSCALL_64_after_hwframe+0x76/0x7e
 RIP: 0033:0x7f0bd1114f9b
 Code: Unable to access opcode bytes at 0x7f0bd1114f71.
 RSP: 002b:00007ffc8a8c3130 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
 RAX: ffffffffffffffda RBX: 0000000000000003 RCX: 00007f0bd1114f9b
 RDX: 00007ffc8a8c35e0 RSI: 00000000ca289435 RDI: 0000000000000003
 RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000007
 R10: 0000000000000008 R11: 0000000000000246 R12: 00007ffc8a8c6c85
 R13: 00000000398e72a0 R14: 0000000000004361 R15: 0000000000000004
  </TASK>

This happens because on RAID stripe-tree filesystems we recurse back into
btrfs_map_block() on scrub to perform the logical to device physical
mapping.

But as the device replace task is already holding the dev_replace::rwsem
we deadlock.

So don't take the dev_replace::rwsem in case our task is the task performing
the device replace.

Suggested-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-11-11 14:34:12 +01:00
Kent Overstreet
2642084f26 bcachefs: Allow for unknown key types in backpointers fsck
We can't assume that btrees only contain keys of a given type - even if
they only have a single key type listed in the allowed key types for
that btree; this is a forwards compatibility issue.

Reported-by: syzbot+a27c3aaa3640dd3e1dfb@syzkaller.appspotmail.com
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-11-11 00:37:19 -05:00
Kent Overstreet
0b6ec0c5ac bcachefs: Fix assertion pop in topology repair
Fixes: baefd3f849 ("bcachefs: btree_cache.freeable list fixes")
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-11-11 00:37:18 -05:00
Linus Torvalds
28e43197c4 20 hotfixes, 14 of which are cc:stable.
Three affect DAMON.  Lorenzo's five-patch series to address the
 mmap_region error handling is here also.
 
 Apart from that, various singletons.
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCZzBVmAAKCRDdBJ7gKXxA
 ju42AQD0EEnzW+zFyI+E7x5FwCmLL6ofmzM8Sw9YrKjaeShdZgEAhcyS2Rc/AaJq
 Uty2ZvVMDF2a9p9gqHfKKARBXEbN2w0=
 =n+lO
 -----END PGP SIGNATURE-----

Merge tag 'mm-hotfixes-stable-2024-11-09-22-40' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm

Pull misc fixes from Andrew Morton:
 "20 hotfixes, 14 of which are cc:stable.

  Three affect DAMON. Lorenzo's five-patch series to address the
  mmap_region error handling is here also.

  Apart from that, various singletons"

* tag 'mm-hotfixes-stable-2024-11-09-22-40' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm:
  mailmap: add entry for Thorsten Blum
  ocfs2: remove entry once instead of null-ptr-dereference in ocfs2_xa_remove()
  signal: restore the override_rlimit logic
  fs/proc: fix compile warning about variable 'vmcore_mmap_ops'
  ucounts: fix counter leak in inc_rlimit_get_ucounts()
  selftests: hugetlb_dio: check for initial conditions to skip in the start
  mm: fix docs for the kernel parameter ``thp_anon=``
  mm/damon/core: avoid overflow in damon_feed_loop_next_input()
  mm/damon/core: handle zero schemes apply interval
  mm/damon/core: handle zero {aggregation,ops_update} intervals
  mm/mlock: set the correct prev on failure
  objpool: fix to make percpu slot allocation more robust
  mm/page_alloc: keep track of free highatomic
  mm: resolve faulty mmap_region() error path behaviour
  mm: refactor arch_calc_vm_flag_bits() and arm64 MTE handling
  mm: refactor map_deny_write_exec()
  mm: unconditionally close VMAs on error
  mm: avoid unsafe VMA hook invocation when error arises on mmap hook
  mm/thp: fix deferred split unqueue naming and locking
  mm/thp: fix deferred split queue not partially_mapped
2024-11-10 09:04:27 -08:00
Linus Torvalds
de2f378f2b nfsd-6.12 fixes:
- Fix a v6.12-rc regression when exporting ext4 filesystems with NFSD
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEEKLLlsBKG3yQ88j7+M2qzM29mf5cFAmcvm+cACgkQM2qzM29m
 f5f27xAAwdRav9cbwNfLEIC2cMqS4stZH6DsjJ0VSUa87eJeO0BggqnEGxgK38a6
 jONYrjD5eWdTIuv3P07UF/dAcANT4drr6VgZURukNwnX6fIV0N2pOGH8z3y5Mban
 JByqkGNs2lXegdTn8DLJb8xhU0ZJlLQ01908IKc4E1YecAy8vq2yDeksEIBi3HOB
 y3WViieDdD1rYqh/OGCl/hQ72g+Sq+C6ionl+YsRiE4lfmcWCbnOUSrF5jLbU4Xa
 K8pTMTE1JoLvBwTe4+Dzg48PNVlqq8x386O7Lu+rK8sG+C4LX7I+BX9PlJzWDMBU
 KWN04lNxtxVqNfKN7A3ADw5bIPmcsrc5Lh3FXIomp1deuQ3dFOqsADoVSca4Z3Am
 Sci5MrNqDMG74n6IYkhDMaySpSFs4Gx7Si2G0zgBMRyaw7EdO6b8rx8s5OdPwvFj
 Xv5x+Fya5cXuJ28a8FPrYpKe/ImjNV5A2x/5cJNiOdqY7i7QBcQzu0QfFHzBcMkL
 lL9svfTFzAo9B/jj3VsBr8UudDU14QUibQNtoTpKv9gwWCIJPLsU+KqdCkTBcgrZ
 X5Cv0JSOrw8fOd9GAOksjrFHkc8p6vBIGZC34tmVLLfOSRPyeG93roT5cCMkQAEQ
 XfWvC/epKjGoLRZepA5GG/Y9EjnpiQvUWJvyznWpY5hFYdRfBu0=
 =ltss
 -----END PGP SIGNATURE-----

Merge tag 'nfsd-6.12-4' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux

Pull nfsd fix from Chuck Lever:

 - Fix a v6.12-rc regression when exporting ext4 filesystems with NFSD

* tag 'nfsd-6.12-4' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux:
  NFSD: Fix READDIR on NFSv3 mounts of ext4 exports
2024-11-09 13:18:07 -08:00
Linus Torvalds
bceea66799 fix net namespace refcount issue
-----BEGIN PGP SIGNATURE-----
 
 iQGzBAABCgAdFiEE6fsu8pdIjtWE/DpLiiy9cAdyT1EFAmcuWXMACgkQiiy9cAdy
 T1Eu8gv+LUAmrvvv8PDoLUT50QZb6aAY2SeulgTdeG8OzImXH5VUSjptRYwP46Dk
 KNLh85A4C39w/guxm3FX2qjeesZZD5DDubSJNATLy75jorq7z+1uTNg8oUZGpvJS
 airmcv/0mcDZqVayCmiT7wPyhUSYa+VTvHrkFpsI20BrlyDybe5HGps77iCOJ5K0
 uTRgM6VNxkKx+Z5NietpDyaUl2A5b6Yx/9J8vMq4ytBfEcSGi+ndpZNvG7kKg8gQ
 3i/ND4O2+eScwvYclVP5mJbF71LW0Z/ljS4mEVH5UuRgLH2Ji35B9xaDFDSixI3x
 EHFwnAX0QeGHIlIuFhRDdtR2gFqREAJOYxkDxfo7PXO5gOXLWZXru9F7v6lWsydN
 varqSseBBucHOLn8NylvgJWwqYs+sIKQycYKsX3ZUnQfejaUwfV2H/ADJzccjFF8
 PUzVQFyOZtUK3fdkoqvULr/zvwninhtLJYLIsPcUgSPCcxGxMApvtkCaJVV3JGfB
 2acZPdMu
 =ZzcZ
 -----END PGP SIGNATURE-----

Merge tag 'v6.12-rc6-smb3-client-fix' of git://git.samba.org/sfrench/cifs-2.6

Pull smb client fix from Steve French:
 "Fix net namespace refcount use after free issue"

* tag 'v6.12-rc6-smb3-client-fix' of git://git.samba.org/sfrench/cifs-2.6:
  smb: client: Fix use-after-free of network namespace.
2024-11-09 12:58:23 -08:00
Linus Torvalds
1eb714c660 four fixes, also for stable
-----BEGIN PGP SIGNATURE-----
 
 iQGzBAABCgAdFiEE6fsu8pdIjtWE/DpLiiy9cAdyT1EFAmcsXSUACgkQiiy9cAdy
 T1FyOgv+Ks1lfl+6D/G89zFl5XOtCm8njsedJu9y3jR7hzophX2osfmodACMVX6B
 0VLu0jzquvUo18VNlL+wF7YFH+Mc6zrevEnjBay9Xa05YyRqK5c7qjpiWEgXPN7/
 ROQfC2slCAFjymhw+9qY+PGZYg3x0fyGdJC/gBNSFnu2ufag367Li+0fTKQTXFwz
 F24S5eI+M9OWNgMnMYoNt+77f0n0JkKbQznq9nTEvUsbTWZFSEfmVczfSY0ltdOH
 RER9zoyTU3zbPuMZqK+Jb7c2247ahsLzDEBAUG0Wn77wSaiWXU5dmVD5bWsDTp25
 5p9uLpkr3irDWwJGkCrkpm2Tva/50IHPEFQ4kllVlm6ffoao/dxBCwFf/MEvJXzI
 OgU+HpXyZdq6NF1hcB4xUlcbHvGCa6pEcYkcM7PwLml+6SKIwEsEGpnJ23kxGR3+
 MGYMCITatRuvZstfEDolNyrO2+gPMd3ODnLhfjfjT47Kh38e7yxrLr4cmxbPAA+s
 EVdm2N08
 =zTn6
 -----END PGP SIGNATURE-----

Merge tag 'v6.12-rc6-ksmbd-fixes' of git://git.samba.org/ksmbd

Pull smb server fixes from Steve French:
 "Four fixes, all also marked for stable:

   - fix two potential use after free issues

   - fix OOM issue with many simultaneous requests

   - fix missing error check in RPC pipe handling"

* tag 'v6.12-rc6-ksmbd-fixes' of git://git.samba.org/ksmbd:
  ksmbd: check outstanding simultaneous SMB operations
  ksmbd: fix slab-use-after-free in smb3_preauth_hash_rsp
  ksmbd: fix slab-use-after-free in ksmbd_smb2_session_create
  ksmbd: Fix the missing xa_store error check
2024-11-08 13:03:29 -10:00
Kent Overstreet
bcf77a05fb bcachefs: Fix hidden btree errors when reading roots
We silence btree errors in btree_node_scan, since it's probing and
errors are expected: add a fake pass so that btree_node_scan is no
longer recovery pass 0, and we don't think we're in btree node scan when
reading btree roots.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-11-08 14:07:12 -05:00
Kent Overstreet
dc537189b5 bcachefs: Fix validate_bset() repair path
When we truncate a bset (due to it extending past the end of the btree
node), we can't skip the rest of the validation for e.g. the packed
format (if it's the first bset in the node).

Reported-by: syzbot+4d722d3c539d77c7bc82@syzkaller.appspotmail.com
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-11-08 14:07:11 -05:00
Linus Torvalds
9183e033ec for-6.12-rc6-tag
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAmct+40ACgkQxWXV+ddt
 WDvCtRAAp0rheEu14hpVvWE2//+6u9Gx7Wfjzbj0+o4zBRWdg7BigFxfeb6JsH/E
 2TjuWdcoP/OMV9ghCBQAQxySAPtsxH7skkyNy2UcMk5byBIrNvhw9auP5GXXlrhK
 jSKDD4yfOMb++8LhrLevgTrijNyjLqaKXruw9a1Pmc3gxpdNmnMEySsQaF62o2Sm
 YC3jwi0KpNAhu2qyJ6TnPgd5zf3BTM0JAeuB019IZW4WoeRTOdcPe7S7gqqJwZ+e
 lL0D2/lfIE1lKvLE266Fab4FAQiJV07rozYj25XHiDpqThCxnJVOZCEHasOQ1PRy
 d6j3RmGPqJYAYfQL1L+FH2hsS1BVZfVyCV1V7A/cN+lAffBfnROnf13C3gJ15Nbx
 3lTyjBPQQw2WpfdmeyF3ikbrjZ8AfahChQO+mMnLN7oAWdIwWX5MRB+cwfWTxzA/
 P8upz6HSTpSwy8nXdq264q1KkyCjx0Wv+8iyU7LirN2fCcEchA12HAIaOBeHedgh
 PrGZDqrkZccQQxAvU5H7hQv0hZkGK8qba381oYHO09g72VM6ysuBU7tGrPZrlZYB
 CvYTCwNZ/lqI8ikrcHOyUO1SPR9SaaWej1mWgBJ69ZIfg+ZuMtOMl171DU4S/i2V
 iYgYoN8eCqTQWdaX5kk+3LWmK8fSU7F/KSDtJtT1KxkaSwCacfY=
 =TQzP
 -----END PGP SIGNATURE-----

Merge tag 'for-6.12-rc6-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux

Pull btrfs fixes from David Sterba:
 "A few more one-liners that fix some user visible problems:

   - use correct range when clearing qgroup reservations after COW

   - properly reset freed delayed ref list head

   - fix ro/rw subvolume mounts to be backward compatible with old and
     new mount API"

* tag 'for-6.12-rc6-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
  btrfs: fix the length of reserved qgroup to free
  btrfs: reinitialize delayed ref list after deleting it from the list
  btrfs: fix per-subvolume RO/RW flags with new mount API
2024-11-08 07:31:03 -10:00
Linus Torvalds
b5f1b48800 bcachefs fixes for 6.12-rc7
Some trivial syzbot fixes, two more serious btree fixes found by looping
 single_devices.ktest small_nodes:
 
 - Topology error on split after merge, where we accidentaly picked the
   node being deleted for the pivot, resulting in an assertion pop
 
 - New nodes being preallocated were left on the freedlist, unlocked,
   resulting in them sometimes being accidentally freed: this dated from
   pre-cycle detector, when we could leave them locked. This should have
   resulted in more explosions and fireworks, but turned out to be
   surprisingly hard to hit because the preallocated nodes were being
   used right away.
 
   the fix for this is bigger than we'd like - reworking btree list
   handling was a bit invasive - but we've now got more assertions and
   it's well tested.
 
 - Also another mishandled transaction restart fix (in
   btree_node_prefetch) - we're almost done with those.
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEKnAFLkS8Qha+jvQrE6szbY3KbnYFAmctRYoACgkQE6szbY3K
 bnbkxRAArtqV9/qsKbSYAaa/+GaL7YdapuYbi/pmC9X96F9qbTdEJzW5rs66iiGE
 zbkfFqo2I85nacSTk3b12E3QUXj+CEmSIWOPQtamYw/0AkmVsKepgGsXLazZ0rYi
 X8UDVc6fuFkoO1aC/9V2NJEFG9QXIj8ru0m2kyUE9ZM6rgskugVN/ec9ipNQNZhY
 4L8U7Z6Y9AX4vs/BeV3i6cLrTaMroUFYSM0hJalBJ24KZsZ1bWflC39C0dXSvy/O
 gCmBCobZTT5aDEQai1kdyFr4GZZUgCJg4YEUDfyOdpPmhbcP4iwX/cJqJHXqxXVt
 nMyLz5nLs0nYO791UlLHZuUUUe99nl+tC09b034n20peLnQwWW/obTrhn86SDDka
 2eQv1Rk5C5i8r5b0k8UYjy5ogfiVlC/X1OwmLKkarKnC/wd0eFQI71Qq9s8KpXbo
 VVASENYFV3hrIV8ZcxiqiJ18g6o7++jtTAmIfRljQrO6B8tU5g5uWCTZli+wciii
 qWnt1k7P92er8lBzUnQGh9CEwLVbe9ZyBJv+fYVwTOxPES/TbJS7n5fb+1f1rF9w
 j5llXVUiaLucXoCpBjEDflvhBTRQHEkKk3gJgy86NKgRjEjPhQT8D2dksT4kgyHb
 RqgOSUN+oVqi/i+7RKf9x/jG4id0uvMH5xT7qiXTUiQXtUD+J9g=
 =cn3u
 -----END PGP SIGNATURE-----

Merge tag 'bcachefs-2024-11-07' of git://evilpiepirate.org/bcachefs

Pull bcachefs fixes from Kent Overstreet:
 "Some trivial syzbot fixes, two more serious btree fixes found by
  looping single_devices.ktest small_nodes:

   - Topology error on split after merge, where we accidentaly picked
     the node being deleted for the pivot, resulting in an assertion pop

   - New nodes being preallocated were left on the freedlist, unlocked,
     resulting in them sometimes being accidentally freed: this dated
     from pre-cycle detector, when we could leave them locked. This
     should have resulted in more explosions and fireworks, but turned
     out to be surprisingly hard to hit because the preallocated nodes
     were being used right away.

     The fix for this is bigger than we'd like - reworking btree list
     handling was a bit invasive - but we've now got more assertions and
     it's well tested.

   - Also another mishandled transaction restart fix (in
     btree_node_prefetch) - we're almost done with those"

* tag 'bcachefs-2024-11-07' of git://evilpiepirate.org/bcachefs:
  bcachefs: Fix UAF in __promote_alloc() error path
  bcachefs: Change OPT_STR max to be 1 less than the size of choices array
  bcachefs: btree_cache.freeable list fixes
  bcachefs: check the invalid parameter for perf test
  bcachefs: add check NULL return of bio_kmalloc in journal_read_bucket
  bcachefs: Ensure BCH_FS_may_go_rw is set before exiting recovery
  bcachefs: Fix topology errors on split after merge
  bcachefs: Ancient versions with bad bkey_formats are no longer supported
  bcachefs: Fix error handling in bch2_btree_node_prefetch()
  bcachefs: Fix null ptr deref in bucket_gen_get()
2024-11-08 07:27:14 -10:00
Kent Overstreet
f8f1dde686 bcachefs: Fix missing validation for bch_backpointer.level
This fixes an assertion pop where we try to navigate to the target of
the backpointer, and the path level isn't what we expect.

Reported-by: syzbot+b17df21b4d370f2dc330@syzkaller.appspotmail.com
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-11-08 00:05:53 -05:00
Kent Overstreet
27a036a0c3 bcachefs: Fix bch_member.btree_bitmap_shift validation
Needs to match the assert later when we resize...

Reported-by: syzbot+e8eff054face85d7ea41@syzkaller.appspotmail.com
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-11-07 23:31:11 -05:00
Kent Overstreet
ca43f73cd1 bcachefs: bch2_btree_write_buffer_flush_going_ro()
The write buffer needs to be specifically flushed when going RO: keys in
the journal that haven't yet been moved to the write buffer don't have a
journal pin yet.

This fixes numerous syzbot bugs, all with symptoms of still doing writes
after we've got RO.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-11-07 23:31:11 -05:00
Andrew Kanner
0b63c0e01f ocfs2: remove entry once instead of null-ptr-dereference in ocfs2_xa_remove()
Syzkaller is able to provoke null-ptr-dereference in ocfs2_xa_remove():

[   57.319872] (a.out,1161,7):ocfs2_xa_remove:2028 ERROR: status = -12
[   57.320420] (a.out,1161,7):ocfs2_xa_cleanup_value_truncate:1999 ERROR: Partial truncate while removing xattr overlay.upper.  Leaking 1 clusters and removing the entry
[   57.321727] BUG: kernel NULL pointer dereference, address: 0000000000000004
[...]
[   57.325727] RIP: 0010:ocfs2_xa_block_wipe_namevalue+0x2a/0xc0
[...]
[   57.331328] Call Trace:
[   57.331477]  <TASK>
[...]
[   57.333511]  ? do_user_addr_fault+0x3e5/0x740
[   57.333778]  ? exc_page_fault+0x70/0x170
[   57.334016]  ? asm_exc_page_fault+0x2b/0x30
[   57.334263]  ? __pfx_ocfs2_xa_block_wipe_namevalue+0x10/0x10
[   57.334596]  ? ocfs2_xa_block_wipe_namevalue+0x2a/0xc0
[   57.334913]  ocfs2_xa_remove_entry+0x23/0xc0
[   57.335164]  ocfs2_xa_set+0x704/0xcf0
[   57.335381]  ? _raw_spin_unlock+0x1a/0x40
[   57.335620]  ? ocfs2_inode_cache_unlock+0x16/0x20
[   57.335915]  ? trace_preempt_on+0x1e/0x70
[   57.336153]  ? start_this_handle+0x16c/0x500
[   57.336410]  ? preempt_count_sub+0x50/0x80
[   57.336656]  ? _raw_read_unlock+0x20/0x40
[   57.336906]  ? start_this_handle+0x16c/0x500
[   57.337162]  ocfs2_xattr_block_set+0xa6/0x1e0
[   57.337424]  __ocfs2_xattr_set_handle+0x1fd/0x5d0
[   57.337706]  ? ocfs2_start_trans+0x13d/0x290
[   57.337971]  ocfs2_xattr_set+0xb13/0xfb0
[   57.338207]  ? dput+0x46/0x1c0
[   57.338393]  ocfs2_xattr_trusted_set+0x28/0x30
[   57.338665]  ? ocfs2_xattr_trusted_set+0x28/0x30
[   57.338948]  __vfs_removexattr+0x92/0xc0
[   57.339182]  __vfs_removexattr_locked+0xd5/0x190
[   57.339456]  ? preempt_count_sub+0x50/0x80
[   57.339705]  vfs_removexattr+0x5f/0x100
[...]

Reproducer uses faultinject facility to fail ocfs2_xa_remove() ->
ocfs2_xa_value_truncate() with -ENOMEM.

In this case the comment mentions that we can return 0 if
ocfs2_xa_cleanup_value_truncate() is going to wipe the entry
anyway. But the following 'rc' check is wrong and execution flow do
'ocfs2_xa_remove_entry(loc);' twice:
* 1st: in ocfs2_xa_cleanup_value_truncate();
* 2nd: returning back to ocfs2_xa_remove() instead of going to 'out'.

Fix this by skipping the 2nd removal of the same entry and making
syzkaller repro happy.

Link: https://lkml.kernel.org/r/20241103193845.2940988-1-andrew.kanner@gmail.com
Fixes: 399ff3a748 ("ocfs2: Handle errors while setting external xattr values.")
Signed-off-by: Andrew Kanner <andrew.kanner@gmail.com>
Reported-by: syzbot+386ce9e60fa1b18aac5b@syzkaller.appspotmail.com
Closes: https://lore.kernel.org/all/671e13ab.050a0220.2b8c0f.01d0.GAE@google.com/T/
Tested-by: syzbot+386ce9e60fa1b18aac5b@syzkaller.appspotmail.com
Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com>
Cc: Mark Fasheh <mark@fasheh.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Changwei Ge <gechangwei@live.cn>
Cc: Jun Piao <piaojun@huawei.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-11-07 14:14:59 -08:00
Qi Xi
b8ee299855 fs/proc: fix compile warning about variable 'vmcore_mmap_ops'
When build with !CONFIG_MMU, the variable 'vmcore_mmap_ops'
is defined but not used:

>> fs/proc/vmcore.c:458:42: warning: unused variable 'vmcore_mmap_ops'
     458 | static const struct vm_operations_struct vmcore_mmap_ops = {

Fix this by only defining it when CONFIG_MMU is enabled.

Link: https://lkml.kernel.org/r/20241101034803.9298-1-xiqi2@huawei.com
Fixes: 9cb218131d ("vmcore: introduce remap_oldmem_pfn_range()")
Signed-off-by: Qi Xi <xiqi2@huawei.com>
Reported-by: kernel test robot <lkp@intel.com>
Closes: https://lore.kernel.org/lkml/202410301936.GcE8yUos-lkp@intel.com/
Cc: Baoquan He <bhe@redhat.com>
Cc: Dave Young <dyoung@redhat.com>
Cc: Michael Holzheu <holzheu@linux.vnet.ibm.com>
Cc: Vivek Goyal <vgoyal@redhat.com>
Cc: Wang ShaoBo <bobo.shaobowang@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-11-07 14:14:59 -08:00
Kent Overstreet
8440da9331 bcachefs: Fix UAF in __promote_alloc() error path
If we error in data_update_init() after adding to the rhashtable of
outstanding promotes, kfree_rcu() is required.

Reported-by: Reed Riley <reed@riley.engineer>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-11-07 16:48:21 -05:00
Piotr Zalewski
f9f0a5390d bcachefs: Change OPT_STR max to be 1 less than the size of choices array
Change OPT_STR max value to be 1 less than the "ARRAY_SIZE" of "_choices"
array. As a result, remove -1 from (opt->max-1) in bch2_opt_to_text.

The "_choices" array is a null-terminated array, so computing the maximum
using "ARRAY_SIZE" without subtracting 1 yields an incorrect result. Since
bch2_opt_validate don't subtract 1, as bch2_opt_to_text does, values
bigger than the actual maximum would pass through option validation.

Reported-by: syzbot+bee87a0c3291c06aa8c6@syzkaller.appspotmail.com
Closes: https://syzkaller.appspot.com/bug?extid=bee87a0c3291c06aa8c6
Fixes: 63c4b25453 ("bcachefs: Better superblock opt validation")
Suggested-by: Kent Overstreet <kent.overstreet@linux.dev>
Signed-off-by: Piotr Zalewski <pZ010001011111@proton.me>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-11-07 16:48:21 -05:00
Kent Overstreet
baefd3f849 bcachefs: btree_cache.freeable list fixes
When allocating new btree nodes, we were leaving them on the freeable
list - unlocked - allowing them to be reclaimed: ouch.

Additionally, bch2_btree_node_free_never_used() ->
bch2_btree_node_hash_remove was putting it on the freelist, while
bch2_btree_node_free_never_used() was putting it back on the btree
update reserve list - ouch.

Originally, the code was written to always keep btree nodes on a list -
live or freeable - and this worked when new nodes were kept locked.

But now with the cycle detector, we can't keep nodes locked that aren't
tracked by the cycle detector; and this is fine as long as they're not
reachable.

We also have better and more robust leak detection now, with memory
allocation profiling, so the original justification no longer applies.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-11-07 16:48:21 -05:00
Hongbo Li
9bb33852f5 bcachefs: check the invalid parameter for perf test
The perf_test does not check the number of iterations and threads
when it is zero. If nr_thread is 0, the perf test will keep
waiting for wakekup. If iteration is 0, it will cause exception
of division by zero. This can be reproduced by:
  echo "rand_insert 0 1" > /sys/fs/bcachefs/${uuid}/perf_test
or
  echo "rand_insert 1 0" > /sys/fs/bcachefs/${uuid}/perf_test

Fixes: 1c6fdbd8f2 ("bcachefs: Initial commit")
Signed-off-by: Hongbo Li <lihongbo22@huawei.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-11-07 16:48:21 -05:00
Pei Xiao
93d53f1caf bcachefs: add check NULL return of bio_kmalloc in journal_read_bucket
bio_kmalloc may return NULL, will cause NULL pointer dereference.
Add check NULL return for bio_kmalloc in journal_read_bucket.

Signed-off-by: Pei Xiao <xiaopei01@kylinos.cn>
Fixes: ac10a9611d ("bcachefs: Some fixes for building in userspace")
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-11-07 16:48:21 -05:00
Kent Overstreet
ef4f6c322b bcachefs: Ensure BCH_FS_may_go_rw is set before exiting recovery
If BCH_FS_may_go_rw is not yet set, it indicates to the transaction
commit path that updates should be done via the list of journal replay
keys.

This must be set before multithreaded use commences.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-11-07 16:48:21 -05:00
Kent Overstreet
cec136d348 bcachefs: Fix topology errors on split after merge
If a btree split picks a pivot that's being deleted by a btree node
merge, we're going to have problems.

Fix this by checking if the pivot is being deleted, the same as we check
for deletions in journal replay keys.

Found by single_devic.ktest small_nodes.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-11-07 16:48:21 -05:00
Kent Overstreet
d335bb3fd3 bcachefs: Ancient versions with bad bkey_formats are no longer supported
Syzbot found an assertion pop, by generating an ancient filesystem
version with an invalid bkey_format (with fields that can overflow) as
well as packed keys that aren't representable unpacked.

This breaks key comparisons in all sorts of painful ways.

Filesystems have been automatically rewriting nodes with such invalid
formats for years; we can safely drop support for them.

Reported-by: syzbot+8a0109511de9d4b61217@syzkaller.appspotmail.com
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-11-07 16:48:20 -05:00
Kent Overstreet
72acab3a7c bcachefs: Fix error handling in bch2_btree_node_prefetch()
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-11-07 16:48:20 -05:00
Kent Overstreet
fd00045f38 bcachefs: Fix null ptr deref in bucket_gen_get()
bucket_gen() checks if we're lookup up a valid bucket and returns NULL
otherwise, but bucket_gen_get() was failing to check; other callers were
correct.

Also do a bit of cleanup on callers.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-11-07 16:48:17 -05:00
David Wang
84b9749a3a proc/softirqs: replace seq_printf with seq_put_decimal_ull_width
seq_printf is costy, on a system with n CPUs, reading /proc/softirqs
would yield 10*n decimal values, and the extra cost parsing format string
grows linearly with number of cpus. Replace seq_printf with
seq_put_decimal_ull_width have significant performance improvement.
On an 8CPUs system, reading /proc/softirqs show ~40% performance
gain with this patch.

Signed-off-by: David Wang <00107082@163.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2024-11-07 07:40:14 -10:00
Chuck Lever
bb1fb40f8b NFSD: Fix READDIR on NFSv3 mounts of ext4 exports
I noticed that recently, simple operations like "make" started
failing on NFSv3 mounts of ext4 exports. Network capture shows that
READDIRPLUS operated correctly but READDIR failed with
NFS3ERR_INVAL. The vfs_llseek() call returned EINVAL when it is
passed a non-zero starting directory cookie.

I bisected to commit c689bdd3bf ("nfsd: further centralize
protocol version checks.").

Turns out that nfsd3_proc_readdir() does not call fh_verify() before
it calls nfsd_readdir(), so the new fhp->fh_64bit_cookies boolean is
not set properly. This leaves the NFSD_MAY_64BIT_COOKIE unset when
the directory is opened.

For ext4, this causes the wrong "max file size" value to be used
when sanity checking the incoming directory cookie (which is a seek
offset value).

The fhp->fh_64bit_cookies boolean is /always/ properly initialized
after nfsd_open() returns. There doesn't seem to be a reason for the
generic NFSD open helper to handle the f_mode fix-up for
directories, so just move that to the one caller that tries to open
an S_IFDIR with NFSD_MAY_64BIT_COOKIE.

Suggested-by: NeilBrown <neilb@suse.de>
Fixes: c689bdd3bf ("nfsd: further centralize protocol version checks.")
Reviewed-by: NeilBrown <neilb@suse.de>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2024-11-07 09:11:37 -05:00
Nam Cao
28e70352b8 fs/aio: Switch to use hrtimer_setup_sleeper_on_stack()
hrtimer_setup_sleeper_on_stack() replaces hrtimer_init_sleeper_on_stack()
to keep the naming convention consistent.

Convert the usage site over to it. The conversion was done with Coccinelle.

Signed-off-by: Nam Cao <namcao@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/all/5f10c259fa43ba2fe774de5b2cedc22f5e9cfd2d.1730386209.git.namcao@linutronix.de
2024-11-07 02:47:06 +01:00
Thomas Gleixner
2634303f87 alarmtimers: Remove return value from alarm functions
Now that the SIG_IGN problem is solved in the core code, the alarmtimer
callbacks do not require a return value anymore.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Frederic Weisbecker <frederic@kernel.org>
Link: https://lore.kernel.org/all/20241105064214.318837272@linutronix.de
2024-11-07 02:14:46 +01:00
Thomas Gleixner
6017a158be posix-timers: Embed sigqueue in struct k_itimer
To cure the SIG_IGN handling for posix interval timers, the preallocated
sigqueue needs to be embedded into struct k_itimer to prevent life time
races of all sorts.

Now that the prerequisites are in place, embed the sigqueue into struct
k_itimer and fixup the relevant usage sites.

Aside of preparing for proper SIG_IGN handling, this spares an extra
allocation.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/all/20241105064213.719695194@linutronix.de
2024-11-07 02:14:44 +01:00
Haisu Wang
2b084d8205 btrfs: fix the length of reserved qgroup to free
The dealloc flag may be cleared and the extent won't reach the disk in
cow_file_range when errors path. The reserved qgroup space is freed in
commit 30479f31d4 ("btrfs: fix qgroup reserve leaks in
cow_file_range"). However, the length of untouched region to free needs
to be adjusted with the correct remaining region size.

Fixes: 30479f31d4 ("btrfs: fix qgroup reserve leaks in cow_file_range")
CC: stable@vger.kernel.org # 6.11+
Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: Haisu Wang <haisuwang@tencent.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-11-07 02:08:29 +01:00
Filipe Manana
c9a75ec45f btrfs: reinitialize delayed ref list after deleting it from the list
At insert_delayed_ref() if we need to update the action of an existing
ref to BTRFS_DROP_DELAYED_REF, we delete the ref from its ref head's
ref_add_list using list_del(), which leaves the ref's add_list member
not reinitialized, as list_del() sets the next and prev members of the
list to LIST_POISON1 and LIST_POISON2, respectively.

If later we end up calling drop_delayed_ref() against the ref, which can
happen during merging or when destroying delayed refs due to a transaction
abort, we can trigger a crash since at drop_delayed_ref() we call
list_empty() against the ref's add_list, which returns false since
the list was not reinitialized after the list_del() and as a consequence
we call list_del() again at drop_delayed_ref(). This results in an
invalid list access since the next and prev members are set to poison
pointers, resulting in a splat if CONFIG_LIST_HARDENED and
CONFIG_DEBUG_LIST are set or invalid poison pointer dereferences
otherwise.

So fix this by deleting from the list with list_del_init() instead.

Fixes: 1d57ee9416 ("btrfs: improve delayed refs iterations")
CC: stable@vger.kernel.org # 4.19+
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-11-07 02:07:53 +01:00
Qu Wenruo
cda7163d4e btrfs: fix per-subvolume RO/RW flags with new mount API
[BUG]
With util-linux 2.40.2, the 'mount' utility is already utilizing the new
mount API. e.g:

  # strace  mount -o subvol=subv1,ro /dev/test/scratch1 /mnt/test/
  ...
  fsconfig(3, FSCONFIG_SET_STRING, "source", "/dev/mapper/test-scratch1", 0) = 0
  fsconfig(3, FSCONFIG_SET_STRING, "subvol", "subv1", 0) = 0
  fsconfig(3, FSCONFIG_SET_FLAG, "ro", NULL, 0) = 0
  fsconfig(3, FSCONFIG_CMD_CREATE, NULL, NULL, 0) = 0
  fsmount(3, FSMOUNT_CLOEXEC, 0)          = 4
  mount_setattr(4, "", AT_EMPTY_PATH, {attr_set=MOUNT_ATTR_RDONLY, attr_clr=0, propagation=0 /* MS_??? */, userns_fd=0}, 32) = 0
  move_mount(4, "", AT_FDCWD, "/mnt/test", MOVE_MOUNT_F_EMPTY_PATH) = 0

But this leads to a new problem, that per-subvolume RO/RW mount no
longer works, if the initial mount is RO:

  # mount -o subvol=subv1,ro /dev/test/scratch1 /mnt/test
  # mount -o rw,subvol=subv2 /dev/test/scratch1  /mnt/scratch
  # mount | grep mnt
  /dev/mapper/test-scratch1 on /mnt/test type btrfs (ro,relatime,discard=async,space_cache=v2,subvolid=256,subvol=/subv1)
  /dev/mapper/test-scratch1 on /mnt/scratch type btrfs (ro,relatime,discard=async,space_cache=v2,subvolid=257,subvol=/subv2)
  # touch /mnt/scratch/foobar
  touch: cannot touch '/mnt/scratch/foobar': Read-only file system

This is a common use cases on distros.

[CAUSE]
We have a workaround for remount to handle the RO->RW change, but if the
mount is using the new mount API, we do not do that, and rely on the
mount tool NOT to set the ro flag.

But that's not how the mount tool is doing for the new API:

  fsconfig(3, FSCONFIG_SET_STRING, "source", "/dev/mapper/test-scratch1", 0) = 0
  fsconfig(3, FSCONFIG_SET_STRING, "subvol", "subv1", 0) = 0
  fsconfig(3, FSCONFIG_SET_FLAG, "ro", NULL, 0) = 0       <<<< Setting RO flag for super block
  fsconfig(3, FSCONFIG_CMD_CREATE, NULL, NULL, 0) = 0
  fsmount(3, FSMOUNT_CLOEXEC, 0)          = 4
  mount_setattr(4, "", AT_EMPTY_PATH, {attr_set=MOUNT_ATTR_RDONLY, attr_clr=0, propagation=0 /* MS_??? */, userns_fd=0}, 32) = 0
  move_mount(4, "", AT_FDCWD, "/mnt/test", MOVE_MOUNT_F_EMPTY_PATH) = 0

This means we will set the super block RO at the first mount.

Later RW mount will not try to reconfigure the fs to RW because the
mount tool is already using the new API.

This totally breaks the per-subvolume RO/RW mount behavior.

[FIX]
Do not skip the reconfiguration even if using the new API.  The old
comments are just expecting any mount tool to properly skip the RO flag
set even if we specify "ro", which is not the reality.

Update the comments regarding the backward compatibility on the kernel
level so it works with old and new mount utilities.

CC: stable@vger.kernel.org # 6.8+
Fixes: f044b31867 ("btrfs: handle the ro->rw transition for mounting different subvolumes")
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-11-07 02:07:45 +01:00
Linus Torvalds
ff7afaeca1 More NFS Client Bugfixes for Linux 6.12-rc
Stable Fixes:
 * Fix KMSAN warning in decode_getfattr_attrs()
 
 Other Bugfixes:
 * Handle -ENOTCONN in xs_tcp_setup_socked()
 * NFSv3: only use NFS timeout for MOUNT when protocols are compatible
 * Fix attribute delegation behavior on exclusive create and a/mtime changes
 * Fix localio to cope with racing nfs_local_probe()
 * Avoid i_lock contention in fs_clear_invalid_mapping()
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEEnZ5MQTpR7cLU7KEp18tUv7ClQOsFAmcr1HUACgkQ18tUv7Cl
 QOtdYBAA0YohWDHflcHPbltJu0UyCyDDtowvpVacSDJwZwVEXnLQRTTqrdUnWVxx
 Bc2Ae8tGsfcwo10yZ6LUIPjcyEqLQeYvKoKv2Awf0j7eubjRYZrQVypIKtmy8aC2
 H5ETCyrbIubE06jX8EPO8LFxQ+T6nGD7kC8qJZL8z/aNVXGA2nRRCi7AzdE4o6Ht
 0t6fC+W5vxJ4hQHYKb59nGvREMwpKSLg2U4wo1lyFvkDxEJ06DobGOKEtD333cI8
 Mou/1UlSZ6RzgfwJNIPMMpCepIp2spaDeet0XVN+zqzxg55Jmk7LqpxP5pswTjLb
 WsxErV9ZRXtwutCCf+IDoMCv/YS4g4ZG7CLKXQ4felKJVYIuiS4z0n659xRqLyyi
 nW71vrRUdOBE3rCXUW6crZYwX/fHDvl6bsq9/h7cy2ZPnbGkVvXx+LIm0dJRenfb
 MaxVM3CyrMnzL3UUk/caK/rVCOHrDD5q/dAtSNfizMWnqoX+gXby3ho6Zwn0Wj89
 NiUZJIRI/s4V1WzMw4g+Daz7LUUwGblODTtphH2nnKRDfTiYXeT/r/waU6zUOVcS
 7Jd285DF/tkQp2SJ3nvsM/ni7TD2UuG2BsKA3Urlht9i32lwyENeS3nNcx6aHo3i
 blNpD+9mp3vZfWWZNVvLM/JldcIqEvd30+P6GWwS/Td8Zz4PYIM=
 =9mwu
 -----END PGP SIGNATURE-----

Merge tag 'nfs-for-6.12-3' of git://git.linux-nfs.org/projects/anna/linux-nfs

Pull NFS client fixes from Anna Schumaker:
 "These are mostly fixes that came up during the nfs bakeathon the other
  week.

  Stable Fixes:
   - Fix KMSAN warning in decode_getfattr_attrs()

  Other Bugfixes:
   - Handle -ENOTCONN in xs_tcp_setup_socked()
   - NFSv3: only use NFS timeout for MOUNT when protocols are compatible
   - Fix attribute delegation behavior on exclusive create and a/mtime
     changes
   - Fix localio to cope with racing nfs_local_probe()
   - Avoid i_lock contention in fs_clear_invalid_mapping()"

* tag 'nfs-for-6.12-3' of git://git.linux-nfs.org/projects/anna/linux-nfs:
  nfs: avoid i_lock contention in nfs_clear_invalid_mapping
  nfs_common: fix localio to cope with racing nfs_local_probe()
  NFS: Further fixes to attribute delegation a/mtime changes
  NFS: Fix attribute delegation behaviour on exclusive create
  nfs: Fix KMSAN warning in decode_getfattr_attrs()
  NFSv3: only use NFS timeout for MOUNT when protocols are compatible
  sunrpc: handle -ENOTCONN in xs_tcp_setup_socket()
2024-11-06 13:09:22 -10:00
Linus Torvalds
7758b20611 Fix tracefs mount options:
The commit 78ff640819 ("vfs: Convert tracefs to use the new mount API")
 broke the gid setting when set by fstab or other mount utility.
 It is ignored when it is set. Fix the code so that it recognises the
 option again and will honor the settings on mount at boot up.
 
 Update the internal documentation and create a selftest to make sure
 it doesn't break again in the future.
 -----BEGIN PGP SIGNATURE-----
 
 iIoEABYIADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCZyuidRQccm9zdGVkdEBn
 b29kbWlzLm9yZwAKCRAp5XQQmuv6qsgQAQDuV0x4RLpCrrowDS/ITQw/eb/WjhR7
 lhkXVROLN6RK6wD+JWmbaCP82q2S4A2Vx0Rjc72gUMmTzDb1HQflhQiLhwU=
 =0dZF
 -----END PGP SIGNATURE-----

Merge tag 'tracefs-v6.12-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace

Pull tracefs fixes from Steven Rostedt:
 "Fix tracefs mount options.

  Commit 78ff640819 ("vfs: Convert tracefs to use the new mount API")
  broke the gid setting when set by fstab or other mount utility. It is
  ignored when it is set. Fix the code so that it recognises the option
  again and will honor the settings on mount at boot up.

  Update the internal documentation and create a selftest to make sure
  it doesn't break again in the future"

* tag 'tracefs-v6.12-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace:
  tracing/selftests: Add tracefs mount options test
  tracing: Document tracefs gid mount option
  tracing: Fix tracefs mount options
2024-11-06 08:08:39 -10:00
Colin Ian King
46a7fcec09 xattr: remove redundant check on variable err
Curretly in function generic_listxattr the for_each_xattr_handler loop
checks err and will return out of the function if err is non-zero.
It's impossible for err to be non-zero at the end of the function where
err is checked again for a non-zero value. The final non-zero check is
therefore redundant and can be removed. Also move the declaration of
err into the loop.

Signed-off-by: Colin Ian King <colin.i.king@gmail.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2024-11-06 13:00:01 -05:00
Christian Göttsche
6140be90ec fs/xattr: add *at family syscalls
Add the four syscalls setxattrat(), getxattrat(), listxattrat() and
removexattrat().  Those can be used to operate on extended attributes,
especially security related ones, either relative to a pinned directory
or on a file descriptor without read access, avoiding a
/proc/<pid>/fd/<fd> detour, requiring a mounted procfs.

One use case will be setfiles(8) setting SELinux file contexts
("security.selinux") without race conditions and without a file
descriptor opened with read access requiring SELinux read permission.

Use the do_{name}at() pattern from fs/open.c.

Pass the value of the extended attribute, its length, and for
setxattrat(2) the command (XATTR_CREATE or XATTR_REPLACE) via an added
struct xattr_args to not exceed six syscall arguments and not
merging the AT_* and XATTR_* flags.

[AV: fixes by Christian Brauner folded in, the entire thing rebased on
top of {filename,file}_...xattr() primitives, treatment of empty
pathnames regularized.  As the result, AT_EMPTY_PATH+NULL handling
is cheap, so f...(2) can use it]

Signed-off-by: Christian Göttsche <cgzones@googlemail.com>
Link: https://lore.kernel.org/r/20240426162042.191916-1-cgoettsche@seltendoof.de
Reviewed-by: Arnd Bergmann <arnd@arndb.de>
Reviewed-by: Christian Brauner <brauner@kernel.org>
CC: x86@kernel.org
CC: linux-alpha@vger.kernel.org
CC: linux-kernel@vger.kernel.org
CC: linux-arm-kernel@lists.infradead.org
CC: linux-ia64@vger.kernel.org
CC: linux-m68k@lists.linux-m68k.org
CC: linux-mips@vger.kernel.org
CC: linux-parisc@vger.kernel.org
CC: linuxppc-dev@lists.ozlabs.org
CC: linux-s390@vger.kernel.org
CC: linux-sh@vger.kernel.org
CC: sparclinux@vger.kernel.org
CC: linux-fsdevel@vger.kernel.org
CC: audit@vger.kernel.org
CC: linux-arch@vger.kernel.org
CC: linux-api@vger.kernel.org
CC: linux-security-module@vger.kernel.org
CC: selinux@vger.kernel.org
[brauner: slight tweaks]
Signed-off-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2024-11-06 12:59:44 -05:00
Al Viro
22a4d1954c new helpers: file_removexattr(), filename_removexattr()
switch path_removexattrat() and fremovexattr(2) to those

Reviewed-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2024-11-06 12:59:39 -05:00
Al Viro
60ad149cf3 new helpers: file_listxattr(), filename_listxattr()
switch path_listxattr() and flistxattr(2) to those

Reviewed-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2024-11-06 12:59:39 -05:00
Al Viro
0158005aaa replace do_getxattr() with saner helpers.
similar to do_setxattr() in the previous commit...

Reviewed-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2024-11-06 12:59:39 -05:00
Al Viro
66d7ac6bdb replace do_setxattr() with saner helpers.
io_uring setxattr logics duplicates stuff from fs/xattr.c; provide
saner helpers (filename_setxattr() and file_setxattr() resp.) and
use them.

NB: putname(ERR_PTR()) is a no-op

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2024-11-06 12:59:39 -05:00
Al Viro
a10c4c5e01 new helper: import_xattr_name()
common logics for marshalling xattr names.

Reviewed-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2024-11-06 12:59:39 -05:00
Christian Göttsche
537c76629d fs: rename struct xattr_ctx to kernel_xattr_ctx
Rename the struct xattr_ctx to increase distinction with the about to be
added user API struct xattr_args.

No functional change.

Suggested-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Christian Göttsche <cgzones@googlemail.com>
Link: https://lore.kernel.org/r/20240426162042.191916-2-cgoettsche@seltendoof.de
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2024-11-06 12:59:21 -05:00
Thorsten Blum
fdfa4c02e6
freevxfs: Replace one-element array with flexible array member
Replace the deprecated one-element array with a modern flexible array
member in the struct vxfs_dirblk.

Link: https://github.com/KSPP/linux/issues/79
Signed-off-by: Thorsten Blum <thorsten.blum@linux.dev>
Link: https://lore.kernel.org/r/20241103121707.102838-3-thorsten.blum@linux.dev
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-11-06 10:42:06 +01:00
Ritesh Harjani (IBM)
299537e9df ext4: Do not fallback to buffered-io for DIO atomic write
atomic writes is currently only supported for single fsblock and only
for direct-io. We should not return -ENOTBLK for atomic writes since we
want the atomic write request to either complete fully or fail
otherwise. Hence, we should never fallback to buffered-io in case of
DIO atomic write requests.
Let's also catch if this ever happens by adding some WARN_ON_ONCE before
buffered-io handling for direct-io atomic writes. More details of the
discussion [1].

While at it let's add an inline helper ext4_want_directio_fallback() which
simplifies the logic checks and inherently fixes condition on when to return
-ENOTBLK which otherwise was always returning true for any write or directio in
ext4_iomap_end(). It was ok since ext4 only supports direct-io via iomap.

[1]: https://lore.kernel.org/linux-xfs/cover.1729825985.git.ritesh.list@gmail.com/T/#m9dbecc11bed713ed0d7a486432c56b105b555f04
Suggested-by: Darrick J. Wong <djwong@kernel.org> # inline helper
Signed-off-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Jan Kara <jack@suse.cz>
2024-11-05 16:20:40 -08:00
Ritesh Harjani (IBM)
b7987a7d69 ext4: Support setting FMODE_CAN_ATOMIC_WRITE
FS needs to add the fmode capability in order to support atomic writes
during file open (refer kiocb_set_rw_flags()). Set this capability on
a regular file if ext4 can do atomic write.

Reviewed-by: John Garry <john.g.garry@oracle.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Jan Kara <jack@suse.cz>
2024-11-05 16:20:40 -08:00
Ritesh Harjani (IBM)
43c696f9d0 ext4: Check for atomic writes support in write iter
Let's validate the given constraints for atomic write request.
Otherwise it will fail with -EINVAL. Currently atomic write is only
supported on DIO, so for buffered-io it will return -EOPNOTSUPP.

Reviewed-by: John Garry <john.g.garry@oracle.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Jan Kara <jack@suse.cz>
2024-11-05 16:20:40 -08:00
Ritesh Harjani (IBM)
6dfc1c1d59 ext4: Add statx support for atomic writes
This patch adds base support for atomic writes via statx getattr.
On bs < ps systems, we can create FS with say bs of 16k. That means
both atomic write min and max unit can be set to 16k for supporting
atomic writes.

Co-developed-by: Ojaswin Mujoo <ojaswin@linux.ibm.com>
Signed-off-by: Ojaswin Mujoo <ojaswin@linux.ibm.com>
Signed-off-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Jan Kara <jack@suse.cz>
2024-11-05 16:20:40 -08:00
Matthew Wilcox (Oracle)
9b4bb82244
ecryptfs: Pass the folio index to crypt_extent()
We need to pass pages, not folios, to crypt_extent() as we may be
working with a plain page rather than a folio.  But we need to know the
index in the file, so pass it in from the caller.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Link: https://lore.kernel.org/r/20241025190822.1319162-11-willy@infradead.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-11-05 17:20:00 +01:00
Matthew Wilcox (Oracle)
bf64913dfe
ecryptfs: Convert lower_offset_for_page() to take a folio
Both callers have a folio, so pass it in and use folio->index instead of
page->index.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Link: https://lore.kernel.org/r/20241025190822.1319162-10-willy@infradead.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-11-05 17:20:00 +01:00
Matthew Wilcox (Oracle)
c15b81461d
ecryptfs: Convert ecryptfs_decrypt_page() to take a folio
Both callers have a folio, so pass it in and use it throughout.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Link: https://lore.kernel.org/r/20241025190822.1319162-9-willy@infradead.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-11-05 17:20:00 +01:00
Matthew Wilcox (Oracle)
6b9c0e8137
ecryptfs: Convert ecryptfs_encrypt_page() to take a folio
All three callers have a folio, so pass it in and use it throughout.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Link: https://lore.kernel.org/r/20241025190822.1319162-8-willy@infradead.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-11-05 17:19:59 +01:00
Matthew Wilcox (Oracle)
de5ced2721
ecryptfs: Convert ecryptfs_write_lower_page_segment() to take a folio
Both callers now have a folio, so pass it in and use it throughout.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Link: https://lore.kernel.org/r/20241025190822.1319162-7-willy@infradead.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-11-05 17:19:59 +01:00
Matthew Wilcox (Oracle)
4d3727fd06
ecryptfs: Convert ecryptfs_write() to use a folio
Remove ecryptfs_get_locked_page() and call read_mapping_folio()
directly.  Use the folio throught this function.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Link: https://lore.kernel.org/r/20241025190822.1319162-6-willy@infradead.org
Reviewed-by: Pankaj Raghav <p.raghav@samsung.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-11-05 17:19:59 +01:00
Matthew Wilcox (Oracle)
890d477a0f
ecryptfs: Convert ecryptfs_read_lower_page_segment() to take a folio
All callers have a folio, so pass it in and use it directly.  This will
not work for large folios, but I doubt anybody wants to use large folios
with ecryptfs.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Link: https://lore.kernel.org/r/20241025190822.1319162-5-willy@infradead.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-11-05 17:19:59 +01:00
Matthew Wilcox (Oracle)
497eb79c31
ecryptfs: Convert ecryptfs_copy_up_encrypted_with_header() to take a folio
Both callers have a folio, so pass it in and use it throughout.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Link: https://lore.kernel.org/r/20241025190822.1319162-4-willy@infradead.org
Reviewed-by: Pankaj Raghav <p.raghav@samsung.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-11-05 17:19:58 +01:00
Matthew Wilcox (Oracle)
064fe6b475
ecryptfs: Use a folio throughout ecryptfs_read_folio()
Remove the conversion to a struct page.  Removes a few hidden calls to
compound_head().  Use 'err' instead of 'rc' for clarity.

Also remove the unnecessary call to ClearPageUptodate(); the uptodate
flag is already clear if this function is being called.  That lets us
switch to folio_end_read() which does one atomic flag operation instead
of two.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Link: https://lore.kernel.org/r/20241025190822.1319162-3-willy@infradead.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-11-05 17:19:58 +01:00
Matthew Wilcox (Oracle)
807a11dab9
ecryptfs: Convert ecryptfs_writepage() to ecryptfs_writepages()
By adding a ->migrate_folio implementation, theree is no need to keep
the ->writepage implementation.  The new writepages removes the
unnecessary call to SetPageUptodate(); the folio should already be
uptodate at this point.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Link: https://lore.kernel.org/r/20241025190822.1319162-2-willy@infradead.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-11-05 17:19:58 +01:00
Namjae Jeon
0a77d947f5 ksmbd: check outstanding simultaneous SMB operations
If Client send simultaneous SMB operations to ksmbd, It exhausts too much
memory through the "ksmbd_work_cache”. It will cause OOM issue.
ksmbd has a credit mechanism but it can't handle this problem. This patch
add the check if it exceeds max credits to prevent this problem by assuming
that one smb request consumes at least one credit.

Cc: stable@vger.kernel.org # v5.15+
Reported-by: Norbert Szetei <norbert@doyensec.com>
Tested-by: Norbert Szetei <norbert@doyensec.com>
Signed-off-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
2024-11-05 09:26:38 +09:00
Namjae Jeon
b8fc56fbca ksmbd: fix slab-use-after-free in smb3_preauth_hash_rsp
ksmbd_user_session_put should be called under smb3_preauth_hash_rsp().
It will avoid freeing session before calling smb3_preauth_hash_rsp().

Cc: stable@vger.kernel.org # v5.15+
Reported-by: Norbert Szetei <norbert@doyensec.com>
Tested-by: Norbert Szetei <norbert@doyensec.com>
Signed-off-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
2024-11-05 09:26:37 +09:00
Namjae Jeon
0a77715db2 ksmbd: fix slab-use-after-free in ksmbd_smb2_session_create
There is a race condition between ksmbd_smb2_session_create and
ksmbd_expire_session. This patch add missing sessions_table_lock
while adding/deleting session from global session table.

Cc: stable@vger.kernel.org # v5.15+
Reported-by: Norbert Szetei <norbert@doyensec.com>
Tested-by: Norbert Szetei <norbert@doyensec.com>
Signed-off-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
2024-11-05 09:26:35 +09:00
John Garry
3af5298ce9 xfs: Support setting FMODE_CAN_ATOMIC_WRITE
Set FMODE_CAN_ATOMIC_WRITE flag if we can atomic write for that inode.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
Signed-off-by: John Garry <john.g.garry@oracle.com>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: John Garry <john.g.garry@oracle.com>
Tested-by: Ojaswin Mujoo <ojaswin@linux.ibm.com>	 #On ppc64
2024-11-04 16:22:11 -08:00
John Garry
f096207d32 xfs: Validate atomic writes
Validate that an atomic write adheres to length/offset rules. Currently
we can only write a single FS block.

For an IOCB with IOCB_ATOMIC set to get as far as xfs_file_write_iter(),
FMODE_CAN_ATOMIC_WRITE will need to be set for the file; for this,
ATOMICWRITES flags would also need to be set for the inode.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
Signed-off-by: John Garry <john.g.garry@oracle.com>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2024-11-04 16:22:10 -08:00
John Garry
6432c6e723 xfs: Support atomic write for statx
Support providing info on atomic write unit min and max for an inode.

For simplicity, currently we limit the min at the FS block size. As for
max, we limit also at FS block size, as there is no current method to
guarantee extent alignment or granularity for regular files.

Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: John Garry <john.g.garry@oracle.com>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2024-11-04 16:22:10 -08:00
John Garry
9e0933c21c fs: iomap: Atomic write support
Support direct I/O atomic writes by producing a single bio with REQ_ATOMIC
flag set.

Initially FSes (XFS) should only support writing a single FS block
atomically.

As with any atomic write, we should produce a single bio which covers the
complete write length.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
Signed-off-by: John Garry <john.g.garry@oracle.com>
Reviewed-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
[djwong: clarify a couple of things in the docs]
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2024-11-04 16:14:02 -08:00
John Garry
a570bad16b fs: Export generic_atomic_write_valid()
The XFS code will need this.

Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: John Garry <john.g.garry@oracle.com>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2024-11-04 16:14:02 -08:00
Mike Snitzer
867da60d46 nfs: avoid i_lock contention in nfs_clear_invalid_mapping
Multi-threaded buffered reads to the same file exposed significant
inode spinlock contention in nfs_clear_invalid_mapping().

Eliminate this spinlock contention by checking flags without locking,
instead using smp_rmb and smp_load_acquire accordingly, but then take
spinlock and double-check these inode flags.

Also refactor nfs_set_cache_invalid() slightly to use
smp_store_release() to pair with nfs_clear_invalid_mapping()'s
smp_load_acquire().

While this fix is beneficial for all multi-threaded buffered reads
issued by an NFS client, this issue was identified in the context of
surprisingly low LOCALIO performance with 4K multi-threaded buffered
read IO.  This fix dramatically speeds up LOCALIO performance:

before: read: IOPS=1583k, BW=6182MiB/s (6482MB/s)(121GiB/20002msec)
after:  read: IOPS=3046k, BW=11.6GiB/s (12.5GB/s)(232GiB/20001msec)

Fixes: 17dfeb9113 ("NFS: Fix races in nfs_revalidate_mapping")
Signed-off-by: Mike Snitzer <snitzer@kernel.org>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Anna Schumaker <anna.schumaker@oracle.com>
2024-11-04 10:24:19 -05:00
Mike Snitzer
bc29408695 nfs_common: fix localio to cope with racing nfs_local_probe()
Fix the possibility of racing nfs_local_probe() resulting in:
  list_add double add: new=ffff8b99707f9f58, prev=ffff8b99707f9f58, next=ffffffffc0f30000.
  ------------[ cut here ]------------
  kernel BUG at lib/list_debug.c:35!

Add nfs_uuid_init() to properly initialize all nfs_uuid_t members
(particularly its list_head).

Switch to returning bool from nfs_uuid_begin(), returns false if
nfs_uuid_t is already in-use (its list_head is on a list). Update
nfs_local_probe() to return early if the nfs_client's cl_uuid
(nfs_uuid_t) is in-use.

Also, switch nfs_uuid_begin() from using list_add_tail_rcu() to
list_add_tail() -- rculist was used in an earlier version of the
localio code that had a lockless nfs_uuid_lookup interface.

Signed-off-by: Mike Snitzer <snitzer@kernel.org>
Signed-off-by: Anna Schumaker <anna.schumaker@oracle.com>
2024-11-04 10:24:19 -05:00
Trond Myklebust
40f45ab381 NFS: Further fixes to attribute delegation a/mtime changes
When asked to set both an atime and an mtime to the current system time,
ensure that the setting is atomic by calling inode_update_timestamps()
only once with the appropriate flags.

Fixes: e12912d941 ("NFSv4: Add support for delegated atime and mtime attributes")
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Signed-off-by: Anna Schumaker <anna.schumaker@oracle.com>
2024-11-04 10:24:19 -05:00
Trond Myklebust
d054c5eb28 NFS: Fix attribute delegation behaviour on exclusive create
When the client does an exclusive create and the server decides to store
the verifier in the timestamps, a SETATTR is subsequently sent to fix up
those timestamps. When that is the case, suppress the exceptions for
attribute delegations in nfs4_bitmap_copy_adjust().

Fixes: 32215c1f89 ("NFSv4: Don't request atime/mtime/size if they are delegated to us")
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Anna Schumaker <anna.schumaker@oracle.com>
2024-11-04 10:24:19 -05:00
Roberto Sassu
dc270d7159 nfs: Fix KMSAN warning in decode_getfattr_attrs()
Fix the following KMSAN warning:

CPU: 1 UID: 0 PID: 7651 Comm: cp Tainted: G    B
Tainted: [B]=BAD_PAGE
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009)
=====================================================
=====================================================
BUG: KMSAN: uninit-value in decode_getfattr_attrs+0x2d6d/0x2f90
 decode_getfattr_attrs+0x2d6d/0x2f90
 decode_getfattr_generic+0x806/0xb00
 nfs4_xdr_dec_getattr+0x1de/0x240
 rpcauth_unwrap_resp_decode+0xab/0x100
 rpcauth_unwrap_resp+0x95/0xc0
 call_decode+0x4ff/0xb50
 __rpc_execute+0x57b/0x19d0
 rpc_execute+0x368/0x5e0
 rpc_run_task+0xcfe/0xee0
 nfs4_proc_getattr+0x5b5/0x990
 __nfs_revalidate_inode+0x477/0xd00
 nfs_access_get_cached+0x1021/0x1cc0
 nfs_do_access+0x9f/0xae0
 nfs_permission+0x1e4/0x8c0
 inode_permission+0x356/0x6c0
 link_path_walk+0x958/0x1330
 path_lookupat+0xce/0x6b0
 filename_lookup+0x23e/0x770
 vfs_statx+0xe7/0x970
 vfs_fstatat+0x1f2/0x2c0
 __se_sys_newfstatat+0x67/0x880
 __x64_sys_newfstatat+0xbd/0x120
 x64_sys_call+0x1826/0x3cf0
 do_syscall_64+0xd0/0x1b0
 entry_SYSCALL_64_after_hwframe+0x77/0x7f

The KMSAN warning is triggered in decode_getfattr_attrs(), when calling
decode_attr_mdsthreshold(). It appears that fattr->mdsthreshold is not
initialized.

Fix the issue by initializing fattr->mdsthreshold to NULL in
nfs_fattr_init().

Cc: stable@vger.kernel.org # v3.5.x
Fixes: 88034c3d88 ("NFSv4.1 mdsthreshold attribute xdr")
Signed-off-by: Roberto Sassu <roberto.sassu@huawei.com>
Signed-off-by: Anna Schumaker <anna.schumaker@oracle.com>
2024-11-04 10:24:18 -05:00
NeilBrown
6e2a10343e NFSv3: only use NFS timeout for MOUNT when protocols are compatible
If a timeout is specified in the mount options, it currently applies to
both the NFS protocol and (with v3) the MOUNT protocol.  This is
sensible when they both use the same underlying protocol, or those
protocols are compatible w.r.t timeouts as RDMA and TCP are.

However if, for example, NFS is using TCP and MOUNT is using UDP then
using the same timeout doesn't make much sense.

If you
   mount -o vers=3,proto=tcp,mountproto=udp,timeo=600,retrans=5 \
      server:/path /mountpoint

then the timeo=600 which was intended for the NFS/TCP request will
apply to the MOUNT/UDP requests with the result that there will only be
one request sent (because UDP has a maximum timeout of 60 seconds).
This is not what a reasonable person might expect.

This patch disables the sharing of timeout information in cases where
the underlying protocols are not compatible.

Fixes: c9301cb35b ("nfs: hornor timeo and retrans option when mounting NFSv3")
Signed-off-by: NeilBrown <neilb@suse.de>
Signed-off-by: Anna Schumaker <anna.schumaker@oracle.com>
2024-11-04 10:24:18 -05:00
Kuniyuki Iwashima
ef7134c7fc smb: client: Fix use-after-free of network namespace.
Recently, we got a customer report that CIFS triggers oops while
reconnecting to a server.  [0]

The workload runs on Kubernetes, and some pods mount CIFS servers
in non-root network namespaces.  The problem rarely happened, but
it was always while the pod was dying.

The root cause is wrong reference counting for network namespace.

CIFS uses kernel sockets, which do not hold refcnt of the netns that
the socket belongs to.  That means CIFS must ensure the socket is
always freed before its netns; otherwise, use-after-free happens.

The repro steps are roughly:

  1. mount CIFS in a non-root netns
  2. drop packets from the netns
  3. destroy the netns
  4. unmount CIFS

We can reproduce the issue quickly with the script [1] below and see
the splat [2] if CONFIG_NET_NS_REFCNT_TRACKER is enabled.

When the socket is TCP, it is hard to guarantee the netns lifetime
without holding refcnt due to async timers.

Let's hold netns refcnt for each socket as done for SMC in commit
9744d2bf19 ("smc: Fix use-after-free in tcp_write_timer_handler().").

Note that we need to move put_net() from cifs_put_tcp_session() to
clean_demultiplex_info(); otherwise, __sock_create() still could touch a
freed netns while cifsd tries to reconnect from cifs_demultiplex_thread().

Also, maybe_get_net() cannot be put just before __sock_create() because
the code is not under RCU and there is a small chance that the same
address happened to be reallocated to another netns.

[0]:
CIFS: VFS: \\XXXXXXXXXXX has not responded in 15 seconds. Reconnecting...
CIFS: Serverclose failed 4 times, giving up
Unable to handle kernel paging request at virtual address 14de99e461f84a07
Mem abort info:
  ESR = 0x0000000096000004
  EC = 0x25: DABT (current EL), IL = 32 bits
  SET = 0, FnV = 0
  EA = 0, S1PTW = 0
  FSC = 0x04: level 0 translation fault
Data abort info:
  ISV = 0, ISS = 0x00000004
  CM = 0, WnR = 0
[14de99e461f84a07] address between user and kernel address ranges
Internal error: Oops: 0000000096000004 [#1] SMP
Modules linked in: cls_bpf sch_ingress nls_utf8 cifs cifs_arc4 cifs_md4 dns_resolver tcp_diag inet_diag veth xt_state xt_connmark nf_conntrack_netlink xt_nat xt_statistic xt_MASQUERADE xt_mark xt_addrtype ipt_REJECT nf_reject_ipv4 nft_chain_nat nf_nat xt_conntrack nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 xt_comment nft_compat nf_tables nfnetlink overlay nls_ascii nls_cp437 sunrpc vfat fat aes_ce_blk aes_ce_cipher ghash_ce sm4_ce_cipher sm4 sm3_ce sm3 sha3_ce sha512_ce sha512_arm64 sha1_ce ena button sch_fq_codel loop fuse configfs dmi_sysfs sha2_ce sha256_arm64 dm_mirror dm_region_hash dm_log dm_mod dax efivarfs
CPU: 5 PID: 2690970 Comm: cifsd Not tainted 6.1.103-109.184.amzn2023.aarch64 #1
Hardware name: Amazon EC2 r7g.4xlarge/, BIOS 1.0 11/1/2018
pstate: 00400005 (nzcv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
pc : fib_rules_lookup+0x44/0x238
lr : __fib_lookup+0x64/0xbc
sp : ffff8000265db790
x29: ffff8000265db790 x28: 0000000000000000 x27: 000000000000bd01
x26: 0000000000000000 x25: ffff000b4baf8000 x24: ffff00047b5e4580
x23: ffff8000265db7e0 x22: 0000000000000000 x21: ffff00047b5e4500
x20: ffff0010e3f694f8 x19: 14de99e461f849f7 x18: 0000000000000000
x17: 0000000000000000 x16: 0000000000000000 x15: 0000000000000000
x14: 0000000000000000 x13: 0000000000000000 x12: 3f92800abd010002
x11: 0000000000000001 x10: ffff0010e3f69420 x9 : ffff800008a6f294
x8 : 0000000000000000 x7 : 0000000000000006 x6 : 0000000000000000
x5 : 0000000000000001 x4 : ffff001924354280 x3 : ffff8000265db7e0
x2 : 0000000000000000 x1 : ffff0010e3f694f8 x0 : ffff00047b5e4500
Call trace:
 fib_rules_lookup+0x44/0x238
 __fib_lookup+0x64/0xbc
 ip_route_output_key_hash_rcu+0x2c4/0x398
 ip_route_output_key_hash+0x60/0x8c
 tcp_v4_connect+0x290/0x488
 __inet_stream_connect+0x108/0x3d0
 inet_stream_connect+0x50/0x78
 kernel_connect+0x6c/0xac
 generic_ip_connect+0x10c/0x6c8 [cifs]
 __reconnect_target_unlocked+0xa0/0x214 [cifs]
 reconnect_dfs_server+0x144/0x460 [cifs]
 cifs_reconnect+0x88/0x148 [cifs]
 cifs_readv_from_socket+0x230/0x430 [cifs]
 cifs_read_from_socket+0x74/0xa8 [cifs]
 cifs_demultiplex_thread+0xf8/0x704 [cifs]
 kthread+0xd0/0xd4
Code: aa0003f8 f8480f13 eb18027f 540006c0 (b9401264)

[1]:
CIFS_CRED="/root/cred.cifs"
CIFS_USER="Administrator"
CIFS_PASS="Password"
CIFS_IP="X.X.X.X"
CIFS_PATH="//${CIFS_IP}/Users/Administrator/Desktop/CIFS_TEST"
CIFS_MNT="/mnt/smb"
DEV="enp0s3"

cat <<EOF > ${CIFS_CRED}
username=${CIFS_USER}
password=${CIFS_PASS}
domain=EXAMPLE.COM
EOF

unshare -n bash -c "
mkdir -p ${CIFS_MNT}
ip netns attach root 1
ip link add eth0 type veth peer veth0 netns root
ip link set eth0 up
ip -n root link set veth0 up
ip addr add 192.168.0.2/24 dev eth0
ip -n root addr add 192.168.0.1/24 dev veth0
ip route add default via 192.168.0.1 dev eth0
ip netns exec root sysctl net.ipv4.ip_forward=1
ip netns exec root iptables -t nat -A POSTROUTING -s 192.168.0.2 -o ${DEV} -j MASQUERADE
mount -t cifs ${CIFS_PATH} ${CIFS_MNT} -o vers=3.0,sec=ntlmssp,credentials=${CIFS_CRED},rsize=65536,wsize=65536,cache=none,echo_interval=1
touch ${CIFS_MNT}/a.txt
ip netns exec root iptables -t nat -D POSTROUTING -s 192.168.0.2 -o ${DEV} -j MASQUERADE
"

umount ${CIFS_MNT}

[2]:
ref_tracker: net notrefcnt@000000004bbc008d has 1/1 users at
     sk_alloc (./include/net/net_namespace.h:339 net/core/sock.c:2227)
     inet_create (net/ipv4/af_inet.c:326 net/ipv4/af_inet.c:252)
     __sock_create (net/socket.c:1576)
     generic_ip_connect (fs/smb/client/connect.c:3075)
     cifs_get_tcp_session.part.0 (fs/smb/client/connect.c:3160 fs/smb/client/connect.c:1798)
     cifs_mount_get_session (fs/smb/client/trace.h:959 fs/smb/client/connect.c:3366)
     dfs_mount_share (fs/smb/client/dfs.c:63 fs/smb/client/dfs.c:285)
     cifs_mount (fs/smb/client/connect.c:3622)
     cifs_smb3_do_mount (fs/smb/client/cifsfs.c:949)
     smb3_get_tree (fs/smb/client/fs_context.c:784 fs/smb/client/fs_context.c:802 fs/smb/client/fs_context.c:794)
     vfs_get_tree (fs/super.c:1800)
     path_mount (fs/namespace.c:3508 fs/namespace.c:3834)
     __x64_sys_mount (fs/namespace.c:3848 fs/namespace.c:4057 fs/namespace.c:4034 fs/namespace.c:4034)
     do_syscall_64 (arch/x86/entry/common.c:52 arch/x86/entry/common.c:83)
     entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:130)

Fixes: 26abe14379 ("net: Modify sk_alloc to not reference count the netns of kernel sockets.")
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Acked-by: Tom Talpey <tom@talpey.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
2024-11-03 19:28:31 -06:00
Linus Torvalds
a8cc743272 17 hotfixes. 9 are cc:stable. 13 are MM and 4 are non-MM.
The usual collection of singletons - please see the changelogs.
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCZyfGDAAKCRDdBJ7gKXxA
 jr19AQD6bfDF/6L2Alq1QG26pgrgccEbKzDSzR6pBajwCbdrNQD/XPhiv3zRJfGf
 lgt0Qkqwe/ApBhVYUnL8y1CePv3EDgA=
 =W5W0
 -----END PGP SIGNATURE-----

Merge tag 'mm-hotfixes-stable-2024-11-03-10-50' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm

Pull misc fixes from Andrew Morton:
 "17 hotfixes.  9 are cc:stable.  13 are MM and 4 are non-MM.

  The usual collection of singletons - please see the changelogs"

* tag 'mm-hotfixes-stable-2024-11-03-10-50' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm:
  mm: multi-gen LRU: use {ptep,pmdp}_clear_young_notify()
  mm: multi-gen LRU: remove MM_LEAF_OLD and MM_NONLEAF_TOTAL stats
  mm, mmap: limit THP alignment of anonymous mappings to PMD-aligned sizes
  mm: shrinker: avoid memleak in alloc_shrinker_info
  .mailmap: update e-mail address for Eugen Hristev
  vmscan,migrate: fix page count imbalance on node stats when demoting pages
  mailmap: update Jarkko's email addresses
  mm: allow set/clear page_type again
  nilfs2: fix potential deadlock with newly created symlinks
  Squashfs: fix variable overflow in squashfs_readpage_block
  kasan: remove vmalloc_percpu test
  tools/mm: -Werror fixes in page-types/slabinfo
  mm, swap: avoid over reclaim of full clusters
  mm: fix PSWPIN counter for large folios swap-in
  mm: avoid VM_BUG_ON when try to map an anon large folio to zero page.
  mm/codetag: fix null pointer check logic for ref and tag
  mm/gup: stop leaking pinned pages in low memory conditions
2024-11-03 10:25:05 -10:00
Al Viro
a71874379e xattr: switch to CLASS(fd)
Reviewed-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2024-11-03 13:29:31 -05:00
Al Viro
8935989798 do_pollfd(): convert to CLASS(fd)
lift setting ->revents into the caller, so that failure exits (including
the early one) would be plain returns.

We need the scope of our struct fd to end before the store to ->revents,
since that's shared with the failure exits prior to the point where we
can do fdget().

Reviewed-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2024-11-03 01:28:07 -05:00
Al Viro
d000e073ca convert do_select()
take the logics from fdget() to fdput() into an inlined helper - with existing
wait_key_set() subsumed into that.

Reviewed-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2024-11-03 01:28:07 -05:00
Al Viro
6b1a5ae9b5 convert vfs_dedupe_file_range().
fdput() is followed by checking fatal_signal_pending() (and aborting
the loop in such case).  fdput() is transposable with that check.
Yes, it'll probably end up with slightly fatter code (call after the
check has returned false + call on the almost never taken out-of-line path
instead of one call before the check), but it's not worth bothering with
explicit extra scope there (or dragging the check into the loop condition,
for that matter).

Reviewed-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2024-11-03 01:28:07 -05:00
Al Viro
9bd812744d convert cifs_ioctl_copychunk()
fdput() moved past mnt_drop_file_write(); harmless, if somewhat cringeworthy.
Reordering could be avoided either by adding an explicit scope or by making
mnt_drop_file_write() called via __cleanup.

Reviewed-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2024-11-03 01:28:07 -05:00
Al Viro
20d9eb3b87 convert do_preadv()/do_pwritev()
fdput() can be transposed with add_{r,w}char() and inc_sysc{r,w}();
it's the same story as with do_readv()/do_writev(), only with
fdput() instead of fdput_pos().

Reviewed-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2024-11-03 01:28:06 -05:00
Al Viro
8152f82010 fdget(), more trivial conversions
all failure exits prior to fdget() leave the scope, all matching fdput()
are immediately followed by leaving the scope.

[xfs_ioc_commit_range() chunk moved here as well]

Reviewed-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2024-11-03 01:28:06 -05:00
Al Viro
6348be02ee fdget(), trivial conversions
fdget() is the first thing done in scope, all matching fdput() are
immediately followed by leaving the scope.

Reviewed-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2024-11-03 01:28:06 -05:00
Al Viro
554ceb7a5e o2hb_region_dev_store(): avoid goto around fdget()/fdput()
Preparation for CLASS(fd) conversion.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2024-11-03 01:28:06 -05:00
Al Viro
d7a9616ce0 introduce "fd_pos" class, convert fdget_pos() users to it.
fdget_pos() for constructor, fdput_pos() for cleanup, all users of
fd..._pos() converted trivially.

Reviewed-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2024-11-03 01:28:06 -05:00
Al Viro
048181992c fdget_raw() users: switch to CLASS(fd_raw)
Reviewed-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2024-11-03 01:28:06 -05:00
Al Viro
a6f46579d7 convert vmsplice() to CLASS(fd)
Irregularity here is fdput() not in the same scope as fdget(); we could
just lift it out vmsplice_type() in vmsplice(2), but there's no much
point keeping vmsplice_type() separate after that...

Reviewed-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2024-11-03 01:28:06 -05:00
Al Viro
0d113fcbc2 simplify xfs_find_handle() a bit
XFS_IOC_FD_TO_HANDLE can grab a reference to copied ->f_path and
let the file go; results in simpler control flow - cleanup is
the same for both "by descriptor" and "by pathname" cases.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2024-11-03 01:28:06 -05:00
Al Viro
919a7a1aac timerfd: switch to CLASS(fd)
Fold timerfd_fget() into both callers to have fdget() and fdput() in
the same scope.  Could be done in different ways, but this is probably
the smallest solution.

Reviewed-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2024-11-03 01:28:06 -05:00
Al Viro
05e555642c regularize emptiness checks in fini_module(2) and vfs_dedupe_file_range()
With few exceptions emptiness checks are done as fd_file(...) in boolean
context (usually something like if (!fd_file(f))...); those will be
taken care of later.

However, there's a couple of places where we do those checks as
'store fd_file(...) into a variable, then check if this variable is
NULL' and those are harder to spot.

Get rid of those now.

use fd_empty() instead of extracting file and then checking it for NULL.

Reviewed-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2024-11-03 01:28:06 -05:00
Linus Torvalds
3e5e6c9900 nfsd-6.12 fixes:
- Fix two async COPY bugs found during NFS bake-a-thon
 - Fix an svcrdma memory leak
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEEKLLlsBKG3yQ88j7+M2qzM29mf5cFAmcmT5wACgkQM2qzM29m
 f5coUg/9FQMf1IeXhyGlDtV0ELUxkHoXaZ2T6zhfXFoLmyl/DU4AWMH4YbSAqk2M
 NnR47sVsM08tIwE3KYQgYzLbbFF41nQUa1sckeel+nYGtpcN6IwOlU5LYYpNNFeQ
 vsECxV78BA6FGdjaXwQ07r4G6lpVhCCqM/RZpDrwNSyoIWVLo77KBUVCSoQb5wzG
 z7OBvO9M7HfVJVOHcPd+tVcZaGAF0fhW812fibZQKV2mrWdhOOe+gWVs8ro3tmm1
 GocbTTQW2hlYcLCZPe1przTI9flfwon6Lk8TmIZuU5IrzcaAB+U3P140aKgl9427
 v4WdLuKYlKi+xISBdRG3omyaLroNUs8IHW4KoBXAW3FinyLzNsAyoPxb02m7SEge
 sOJ/gbeLtb2u+ur4wAp4gDmVKfg3TGyh05Hdt96LXsbQUuWIlwEcPurl+nY93Eoq
 vrPLIdPOXrOD5jBIaVQkBYlaCn04mDg+VTNbG9hW1wjorVFpWKS7MwCjVloXSVIn
 uE++cVpQtIKp8aTYBbFVXqtVREatczl++f+Npnlm8xlcquDbaORkk4ZBOs4vQuHo
 pNuZcWO0rIBR6hakr44OjTLnJBIwChPBYvBVgtq1E6oAbwHvC2SVXbiJB8IcCPOx
 nB2jnF0/tpTs2LnrHAxdAGdU9Om6RGmahfz/uwh/8djdbCVH+Ik=
 =NiQh
 -----END PGP SIGNATURE-----

Merge tag 'nfsd-6.12-3' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux

Pull nfsd fixes from Chuck Lever:

 - Fix two async COPY bugs found during NFS bake-a-thon

 - Fix an svcrdma memory leak

* tag 'nfsd-6.12-3' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux:
  rpcrdma: Always release the rpcrdma_device's xa_array
  NFSD: Never decrement pending_async_copies on error
  NFSD: Initialize struct nfsd4_copy earlier
2024-11-02 09:27:11 -10:00
Linus Torvalds
f6a7b4ec74 XFS bug fies for 6.12-rc6
* fix a sysbot reported crash on filestreams
 * Reduce cpu time spent searching for extents in
   a very fragmented FS
 * Check for delayed allocations before setting extsize
 
 Signed-off-by: Carlos Maiolino <cem@kernel.org>
 -----BEGIN PGP SIGNATURE-----
 
 iJUEABMJAB0WIQQMHYkcUKcy4GgPe2RGdaER5QtfpgUCZyIMDwAKCRBGdaER5Qtf
 pllxAYCkk+mtDTD5xBfOVGZWO5MMFz8HqYcro5wrSCzgL8HDmW29kXTBYFviGn3R
 3l/H6BEBgOk0EkI5qGOzijpzbsWyJeLzPzZtxQFPD8zFBdxSERCtbpqFDLLvLQWG
 M+TLhUNkPQ==
 =kKX4
 -----END PGP SIGNATURE-----

Merge tag 'xfs-6.12-fixes-6' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux

Pull xfs fixes from Carlos Maiolino:

 - fix a sysbot reported crash on filestreams

 - Reduce cpu time spent searching for extents in a very fragmented FS

 - Check for delayed allocations before setting extsize

* tag 'xfs-6.12-fixes-6' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux:
  xfs: streamline xfs_filestream_pick_ag
  xfs: fix finding a last resort AG in xfs_filestream_pick_ag
  xfs: Reduce unnecessary searches when searching for the best extents
  xfs: Check for delayed allocations before setting extsize
2024-11-02 09:22:16 -10:00
Linus Torvalds
17fa6a5f93 vfs-6.12-rc6.iomap
-----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCZyTGVAAKCRCRxhvAZXjc
 oltEAP9r8cWa3Tdv8DzMNWu/jezTUXoW/mX5Qe+c1L6faqj0WQD/dIVtBtG37Tfq
 3Ci9F/GEWjKijtCQ5lwMGUq27jQJ1gk=
 =/0iA
 -----END PGP SIGNATURE-----

Merge tag 'vfs-6.12-rc6.iomap' of gitolite.kernel.org:pub/scm/linux/kernel/git/vfs/vfs

Pull iomap fixes from Christian Brauner:
 "Fixes for iomap to prevent data corruption bugs in the fallocate
  unshare range implementation of fsdax and a small cleanup to turn
  iomap_want_unshare_iter() into an inline function"

* tag 'vfs-6.12-rc6.iomap' of gitolite.kernel.org:pub/scm/linux/kernel/git/vfs/vfs:
  iomap: turn iomap_want_unshare_iter into an inline function
  fsdax: dax_unshare_iter needs to copy entire blocks
  fsdax: remove zeroing code from dax_unshare_iter
  iomap: share iomap_unshare_iter predicate code with fsdax
  xfs: don't allocate COW extents when unsharing a hole
2024-11-01 07:45:00 -10:00
Linus Torvalds
d56239a82e vfs-6.12-rc6.fixes
-----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCZyTGAQAKCRCRxhvAZXjc
 opd6AQCal4omyfS8FYe4VRRZ/0XHouagq99I0U0TAmKkvoKAsgD/XrdE+pSTEkPX
 Pv4T9phh1cZRxcyKVu77UoYkuHJEDAg=
 =Lu9R
 -----END PGP SIGNATURE-----

Merge tag 'vfs-6.12-rc6.fixes' of gitolite.kernel.org:pub/scm/linux/kernel/git/vfs/vfs

Pull filesystem fixes from Christian Brauner:
 "VFS:

   - Fix copy_page_from_iter_atomic() if KMAP_LOCAL_FORCE_MAP=y is set

   - Add a get_tree_bdev_flags() helper that allows to modify e.g.,
     whether errors are logged into the filesystem context during
     superblock creation. This is used by erofs to fix a userspace
     regression where an error is currently logged when its used on a
     regular file which is an new allowed mode in erofs.

  netfs:

   - Fix the sysfs debug path in the documentation.

   - Fix iov_iter_get_pages*() for folio queues by skipping the page
     extracation if we're at the end of a folio.

  afs:

   - Fix moving subdirectories to different parent directory.

  autofs:

   - Fix handling of AUTOFS_DEV_IOCTL_TIMEOUT_CMD ioctl in
     validate_dev_ioctl(). The actual ioctl number, not the ioctl
     command needs to be checked for autofs"

* tag 'vfs-6.12-rc6.fixes' of gitolite.kernel.org:pub/scm/linux/kernel/git/vfs/vfs:
  iov_iter: fix copy_page_from_iter_atomic() if KMAP_LOCAL_FORCE_MAP
  autofs: fix thinko in validate_dev_ioctl()
  iov_iter: Fix iov_iter_get_pages*() for folio_queue
  afs: Fix missing subdir edit when renamed between parent dirs
  doc: correcting the debug path for cachefiles
  erofs: use get_tree_bdev_flags() to avoid misleading messages
  fs/super.c: introduce get_tree_bdev_flags()
2024-11-01 07:37:10 -10:00
Linus Torvalds
6b4926494e for-6.12-rc5-tag
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAmck8eQACgkQxWXV+ddt
 WDu05g/6AwrnvPkivC4iVOv4Wkzrpk4gm76smx91Y9B8tSDLI1pHaS27CvJz9iWl
 vBKXPN3PQVQHwo6SPn+NjsFOSMkXlbBOVKpPU+MlZwH9Tuw66qcC+EnUCK2wEuAy
 3TN7cUGIA4r/j+SkhgIz+Irlr5pjdb1KkPIMBEVGcVFqDIuvDaTEGBqTn2i/V5aa
 dMn+gK+9rfngTOJ68t/pEFaX7SEWCvgMIcBpBB4/vs1gHm3ve2bcc1sBAdMxb1Se
 SrxgZfq+Rc5tkMn540JaWGwkb0rLzwXlurK6ygTKDKCpH0IMX+pBvDkexh9Zj0ux
 jejlRxiuDzTx3z2a7FjHDyp2sdZWMpq3sPsowpJ1Dsgi5EtSxTy4irmQuSAZY1Uj
 /uo6YwV9aTGeiNDwZeKqKc/wOuAttaMZLr14s37pro9KxndFJ/XZBxeyB+euUCOw
 B8AvAQVVIJAYQLyWINWruNKppqlgiO2RaN15RvvT2pX01d0TOx1KX1XFQku7YFxb
 M/8ZNXzJ96XtkeyHL3euo3zj7N5jWtnCvPINugUG1ADQa+bc8aX336gld1neD6fs
 QqIFIgzZG0l4N95viJilACrI6tW9zFnBqMyNFRhucKiX9aP9glOvhSfxfjcpDuQ/
 i/LIyxVLwp8M3hPNvv8tC345+1C2ug9AD0OyhWjjIYPuiOxtTWs=
 =alpB
 -----END PGP SIGNATURE-----

Merge tag 'for-6.12-rc5-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux

Pull btrfs fixes from David Sterba:
 "A few more stability fixes. There's one patch adding export of MIPS
  cmpxchg helper, used in the error propagation fix.

   - fix error propagation from split bios to the original btrfs bio

   - fix merging of adjacent extents (normal operation, defragmentation)

   - fix potential use after free after freeing btrfs device structures"

* tag 'for-6.12-rc5-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
  btrfs: fix defrag not merging contiguous extents due to merged extent maps
  btrfs: fix extent map merging not happening for adjacent extents
  btrfs: fix use-after-free of block device file in __btrfs_free_extra_devids()
  btrfs: fix error propagation of split bios
  MIPS: export __cmpxchg_small()
2024-11-01 07:31:47 -10:00
Linus Torvalds
7b83601da4 bcachefs fixes for 6.12-rc6
Various syzbot fixes, and the more notable ones:
 
 - Fix for pointers in an extent overflowing the max (16) on a filesystem
   with many devices: we were creating too many cached copies when moving
   data around. Now, we only create at most one cached copy if there's a
   promote target set.
 
   Caching will be a bit broken for reflinked data until 6.13: I have
   larger series queued up which significantly improves the plumbing for
   data options down into the extent (bch_extent_rebalance) to fix this.
 
 - Fix for deadlock on -ENOSPC on tiny filesystems
 
   Allocation from the partial open_bucket list wasn't correctly
   accounting partial open_buckets as free: this fixes the main cause of
   tests timing out in the automated tests.
 -----BEGIN PGP SIGNATURE-----
 
 iQJOBAABCAA4FiEEKnAFLkS8Qha+jvQrE6szbY3KbnYFAmckUrIaHGtlbnQub3Zl
 cnN0cmVldEBsaW51eC5kZXYACgkQE6szbY3KbnaFYA//Qd9SD8v+ypavnaogWhqk
 3bufCO8YJDV5DRQVuX/z36fia8zOKzWQGRAYvq0vF0mmgagwBE+AcBh6vvfDCxqZ
 m1937IXcv/hHh2FFQau9gWEItTH9dGwyQjeDjB3xaTL5ZTGsAdA9558ygf8GAVOe
 wD+W8Z8Qj09hAErnNS7y50t/PGbZDuG7AV2Dy2unp+fp6U0FVrZ3Z0bhFuhxcR7/
 e3j49DoW4EZL7Gu1svn7nzehjWK4wx1wX7QhynPgSOVIhdj2Fc3XG76b3mBsuZF6
 A/cBRKmSZsYL9MBK0vferqizqeuwlIJsvwpo/6zzukpyf8QOl+0IqPuAXFoz8vg3
 vrdp9cdvzWvQNexTD2+7PYosCKoUswOvo0oIy8Iopkg4VGSreZib1sZeCPzw2FBK
 AZcQaQSBLKojWpYsn9Dl2AlqEHHTvnopjr5wRXiimqKe/OcA3ugIvebUw2UE2ACp
 /Z2ZQu615BtRYQM+dRIJJQ2CAy0F3EZxIXEXwc/yrH7kL2VBay8QCKp/k/9YYy4e
 Nlxxw7alb/XGgT8GQgu24tho3yMKT621dLFOaAZ7x2HtLP8T56zL/L/wKWsocW/V
 R8Kwqot6F1EVb3Q0LECUJottYQ+5I1Et7ZpVyOPxfqF1y7KOsuxKOmZFLO7i3Spc
 fg0gOt/fyKrAF3zuSmWXne8=
 =hzm/
 -----END PGP SIGNATURE-----

Merge tag 'bcachefs-2024-10-31' of git://evilpiepirate.org/bcachefs

Pull bcachefs fixes from Kent Overstreet:
 "Various syzbot fixes, and the more notable ones:

   - Fix for pointers in an extent overflowing the max (16) on a
     filesystem with many devices: we were creating too many cached
     copies when moving data around. Now, we only create at most one
     cached copy if there's a promote target set.

     Caching will be a bit broken for reflinked data until 6.13: I have
     larger series queued up which significantly improves the plumbing
     for data options down into the extent (bch_extent_rebalance) to fix
     this.

   - Fix for deadlock on -ENOSPC on tiny filesystems

     Allocation from the partial open_bucket list wasn't correctly
     accounting partial open_buckets as free: this fixes the main cause
     of tests timing out in the automated tests"

* tag 'bcachefs-2024-10-31' of git://evilpiepirate.org/bcachefs:
  bcachefs: Fix NULL ptr dereference in btree_node_iter_and_journal_peek
  bcachefs: fix possible null-ptr-deref in __bch2_ec_stripe_head_get()
  bcachefs: Fix deadlock on -ENOSPC w.r.t. partial open buckets
  bcachefs: Don't filter partial list buckets in open_buckets_to_text()
  bcachefs: Don't keep tons of cached pointers around
  bcachefs: init freespace inited bits to 0 in bch2_fs_initialize
  bcachefs: Fix unhandled transaction restart in fallocate
  bcachefs: Fix UAF in bch2_reconstruct_alloc()
  bcachefs: fix null-ptr-deref in have_stripes()
  bcachefs: fix shift oob in alloc_lru_idx_fragmentation
  bcachefs: Fix invalid shift in validate_sb_layout()
2024-11-01 07:21:03 -10:00
Linus Torvalds
cb80d9074f
fs: optimize acl_permission_check()
generic_permission() turned out to be costlier than expected. The reason
is that acl_permission_check() performs owner checks that involve costly
pointer dereferences.

There's already code that skips expensive group checks if the group and
other permission bits are the same wrt to the mask that is checked. This
logic can be extended to also shortcut permission checking in
acl_permission_check().

If the inode doesn't have POSIX ACLs than ownership doesn't matter. If
there are no missing UGO permissions the permission check can be
shortcut.

Acked-by: Al Viro <viro@zeniv.linux.org.uk>
Link: https://lore.kernel.org/all/CAHk-=wgXEoAOFRkDg+grxs+p1U+QjWXLixRGmYEfd=vG+OBuFw@mail.gmail.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-11-01 14:12:34 +01:00
Kalesh Singh
e4d32142d1 tracing: Fix tracefs mount options
Commit 78ff640819 ("vfs: Convert tracefs to use the new mount API")
converted tracefs to use the new mount APIs caused mount options
(e.g. gid=<gid>) to not take effect.

The tracefs superblock can be updated from multiple paths:
    - on fs_initcall() to init_trace_printk_function_export()
    - from a work queue to initialize eventfs
      tracer_init_tracefs_work_func()
    - fsconfig() syscall to mount or remount of tracefs

The tracefs superblock root inode gets created early on in
init_trace_printk_function_export().

With the new mount API, tracefs effectively uses get_tree_single() instead
of the old API mount_single().

Previously, mount_single() ensured that the options are always applied to
the superblock root inode:
    (1) If the root inode didn't exist, call fill_super() to create it
        and apply the options.
    (2) If the root inode exists, call reconfigure_single() which
        effectively calls tracefs_apply_options() to parse and apply
        options to the subperblock's fs_info and inode and remount
        eventfs (if necessary)

On the other hand, get_tree_single() effectively calls vfs_get_super()
which:
    (3) If the root inode doesn't exists, calls fill_super() to create it
        and apply the options.
    (4) If the root inode already exists, updates the fs_context root
        with the superblock's root inode.

(4) above is always the case for tracefs mounts, since the super block's
root inode will already be created by init_trace_printk_function_export().

This means that the mount options get ignored:
    - Since it isn't applied to the superblock's root inode, it doesn't
      get inherited by the children.
    - Since eventfs is initialized from a separate work queue and
      before call to mount with the options, and it doesn't get remounted
      for mount.

Ensure that the mount options are applied to the super block and eventfs
is remounted to respect the mount options.

To understand this better, if fstab has the following:

 tracefs  /sys/kernel/tracing  tracefs   nosuid,nodev,noexec,gid=tracing 0  0

On boot up, permissions look like:

 # ls -l /sys/kernel/tracing/trace
 -rw-r----- 1 root root 0 Nov  1 08:37 /sys/kernel/tracing/trace

When it should look like:

 # ls -l /sys/kernel/tracing/trace
 -rw-r----- 1 root tracing 0 Nov  1 08:37 /sys/kernel/tracing/trace

Link: https://lore.kernel.org/r/536e99d3-345c-448b-adee-a21389d7ab4b@redhat.com/

Cc: Eric Sandeen <sandeen@redhat.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Ali Zahraee <ahzahraee@gmail.com>
Cc: Christian Brauner <brauner@kernel.org>
Cc: David Howells <dhowells@redhat.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: stable@vger.kernel.org
Fixes: 78ff640819 ("vfs: Convert tracefs to use the new mount API")
Link: https://lore.kernel.org/20241030171928.4168869-2-kaleshsingh@google.com
Signed-off-by: Kalesh Singh <kaleshsingh@google.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2024-11-01 08:38:14 -04:00
Filipe Manana
77b0d113ee btrfs: fix defrag not merging contiguous extents due to merged extent maps
When running defrag (manual defrag) against a file that has extents that
are contiguous and we already have the respective extent maps loaded and
merged, we end up not defragging the range covered by those contiguous
extents. This happens when we have an extent map that was the result of
merging multiple extent maps for contiguous extents and the length of the
merged extent map is greater than or equals to the defrag threshold
length.

The script below reproduces this scenario:

   $ cat test.sh
   #!/bin/bash

   DEV=/dev/sdi
   MNT=/mnt/sdi

   mkfs.btrfs -f $DEV
   mount $DEV $MNT

   # Create a 256K file with 4 extents of 64K each.
   xfs_io -f -c "falloc 0 64K" \
             -c "pwrite 0 64K" \
             -c "falloc 64K 64K" \
             -c "pwrite 64K 64K" \
             -c "falloc 128K 64K" \
             -c "pwrite 128K 64K" \
             -c "falloc 192K 64K" \
             -c "pwrite 192K 64K" \
             $MNT/foo

   umount $MNT
   echo -n "Initial number of file extent items: "
   btrfs inspect-internal dump-tree -t 5 $DEV | grep EXTENT_DATA | wc -l

   mount $DEV $MNT
   # Read the whole file in order to load and merge extent maps.
   cat $MNT/foo > /dev/null

   btrfs filesystem defragment -t 128K $MNT/foo
   umount $MNT
   echo -n "Number of file extent items after defrag with 128K threshold: "
   btrfs inspect-internal dump-tree -t 5 $DEV | grep EXTENT_DATA | wc -l

   mount $DEV $MNT
   # Read the whole file in order to load and merge extent maps.
   cat $MNT/foo > /dev/null

   btrfs filesystem defragment -t 256K $MNT/foo
   umount $MNT
   echo -n "Number of file extent items after defrag with 256K threshold: "
   btrfs inspect-internal dump-tree -t 5 $DEV | grep EXTENT_DATA | wc -l

Running it:

   $ ./test.sh
   Initial number of file extent items: 4
   Number of file extent items after defrag with 128K threshold: 4
   Number of file extent items after defrag with 256K threshold: 4

The 4 extents don't get merged because we have an extent map with a size
of 256K that is the result of merging the individual extent maps for each
of the four 64K extents and at defrag_lookup_extent() we have a value of
zero for the generation threshold ('newer_than' argument) since this is a
manual defrag. As a consequence we don't call defrag_get_extent() to get
an extent map representing a single file extent item in the inode's
subvolume tree, so we end up using the merged extent map at
defrag_collect_targets() and decide not to defrag.

Fix this by updating defrag_lookup_extent() to always discard extent maps
that were merged and call defrag_get_extent() regardless of the minimum
generation threshold ('newer_than' argument).

A test case for fstests will be sent along soon.

CC: stable@vger.kernel.org # 6.1+
Fixes: 199257a78b ("btrfs: defrag: don't use merged extent map for their generation check")
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-10-31 16:46:41 +01:00
Filipe Manana
a0f0625390 btrfs: fix extent map merging not happening for adjacent extents
If we have 3 or more adjacent extents in a file, that is, consecutive file
extent items pointing to adjacent extents, within a contiguous file range
and compatible flags, we end up not merging all the extents into a single
extent map.

For example:

  $ mkfs.btrfs -f /dev/sdc
  $ mount /dev/sdc /mnt/sdc

  $ xfs_io -f -d -c "pwrite -b 64K 0 64K" \
                 -c "pwrite -b 64K 64K 64K" \
                 -c "pwrite -b 64K 128K 64K" \
                 -c "pwrite -b 64K 192K 64K" \
                 /mnt/sdc/foo

After all the ordered extents complete we unpin the extent maps and try
to merge them, but instead of getting a single extent map we get two
because:

1) When the first ordered extent completes (file range [0, 64K)) we
   unpin its extent map and attempt to merge it with the extent map for
   the range [64K, 128K), but we can't because that extent map is still
   pinned;

2) When the second ordered extent completes (file range [64K, 128K)), we
   unpin its extent map and merge it with the previous extent map, for
   file range [0, 64K), but we can't merge with the next extent map, for
   the file range [128K, 192K), because this one is still pinned.

   The merged extent map for the file range [0, 128K) gets the flag
   EXTENT_MAP_MERGED set;

3) When the third ordered extent completes (file range [128K, 192K)), we
   unpin its extent map and attempt to merge it with the previous extent
   map, for file range [0, 128K), but we can't because that extent map
   has the flag EXTENT_MAP_MERGED set (mergeable_maps() returns false
   due to different flags) while the extent map for the range [128K, 192K)
   doesn't have that flag set.

   We also can't merge it with the next extent map, for file range
   [192K, 256K), because that one is still pinned.

   At this moment we have 3 extent maps:

   One for file range [0, 128K), with the flag EXTENT_MAP_MERGED set.
   One for file range [128K, 192K).
   One for file range [192K, 256K) which is still pinned;

4) When the fourth and final extent completes (file range [192K, 256K)),
   we unpin its extent map and attempt to merge it with the previous
   extent map, for file range [128K, 192K), which succeeds since none
   of these extent maps have the EXTENT_MAP_MERGED flag set.

   So we end up with 2 extent maps:

   One for file range [0, 128K), with the flag EXTENT_MAP_MERGED set.
   One for file range [128K, 256K), with the flag EXTENT_MAP_MERGED set.

   Since after merging extent maps we don't attempt to merge again, that
   is, merge the resulting extent map with the one that is now preceding
   it (and the one following it), we end up with those two extent maps,
   when we could have had a single extent map to represent the whole file.

Fix this by making mergeable_maps() ignore the EXTENT_MAP_MERGED flag.
While this doesn't present any functional issue, it prevents the merging
of extent maps which allows to save memory, and can make defrag not
merging extents too (that will be addressed in the next patch).

Fixes: 199257a78b ("btrfs: defrag: don't use merged extent map for their generation check")
CC: stable@vger.kernel.org # 6.1+
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-10-31 16:45:16 +01:00
Ryusuke Konishi
b3a033e3ec nilfs2: fix potential deadlock with newly created symlinks
Syzbot reported that page_symlink(), called by nilfs_symlink(), triggers
memory reclamation involving the filesystem layer, which can result in
circular lock dependencies among the reader/writer semaphore
nilfs->ns_segctor_sem, s_writers percpu_rwsem (intwrite) and the
fs_reclaim pseudo lock.

This is because after commit 21fc61c73c ("don't put symlink bodies in
pagecache into highmem"), the gfp flags of the page cache for symbolic
links are overwritten to GFP_KERNEL via inode_nohighmem().

This is not a problem for symlinks read from the backing device, because
the __GFP_FS flag is dropped after inode_nohighmem() is called.  However,
when a new symlink is created with nilfs_symlink(), the gfp flags remain
overwritten to GFP_KERNEL.  Then, memory allocation called from
page_symlink() etc.  triggers memory reclamation including the FS layer,
which may call nilfs_evict_inode() or nilfs_dirty_inode().  And these can
cause a deadlock if they are called while nilfs->ns_segctor_sem is held:

Fix this issue by dropping the __GFP_FS flag from the page cache GFP flags
of newly created symlinks in the same way that nilfs_new_inode() and
__nilfs_read_inode() do, as a workaround until we adopt nofs allocation
scope consistently or improve the locking constraints.

Link: https://lkml.kernel.org/r/20241020050003.4308-1-konishi.ryusuke@gmail.com
Fixes: 21fc61c73c ("don't put symlink bodies in pagecache into highmem")
Signed-off-by: Ryusuke Konishi <konishi.ryusuke@gmail.com>
Reported-by: syzbot+9ef37ac20608f4836256@syzkaller.appspotmail.com
Closes: https://syzkaller.appspot.com/bug?extid=9ef37ac20608f4836256
Tested-by: syzbot+9ef37ac20608f4836256@syzkaller.appspotmail.com
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-10-30 20:14:12 -07:00
Phillip Lougher
d31638ff6c Squashfs: fix variable overflow in squashfs_readpage_block
Syzbot reports a slab out of bounds access in squashfs_readpage_block().

This is caused by an attempt to read page index 0x2000000000.  This value
(start_index) is stored in an integer loop variable which overflows
producing a value of 0.  This causes a loop which iterates over pages
start_index -> end_index to iterate over 0 -> end_index, which ultimately
causes an out of bounds page array access.

Fix by changing variable to a loff_t, and rename to index to make it
clearer it is a page index, and not a loop count.

Link: https://lkml.kernel.org/r/20241020232200.837231-1-phillip@squashfs.org.uk
Signed-off-by: Phillip Lougher <phillip@squashfs.org.uk>
Reported-by: "Lai, Yi" <yi1.lai@linux.intel.com>
Closes: https://lore.kernel.org/all/ZwzcnCAosIPqQ9Ie@ly-workstation/
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-10-30 20:14:12 -07:00
Jan Kara
76486b1041 ext4: avoid remount errors with 'abort' mount option
When we remount filesystem with 'abort' mount option while changing
other mount options as well (as is LTP test doing), we can return error
from the system call after commit d3476f3dad ("ext4: don't set
SB_RDONLY after filesystem errors") because the application of mount
option changes detects shutdown filesystem and refuses to do anything.
The behavior of application of other mount options in presence of
'abort' mount option is currently rather arbitary as some mount option
changes are handled before 'abort' and some after it.

Move aborting of the filesystem to the end of remount handling so all
requested changes are properly applied before the filesystem is shutdown
to have a reasonably consistent behavior.

Fixes: d3476f3dad ("ext4: don't set SB_RDONLY after filesystem errors")
Reported-by: Jan Stancek <jstancek@redhat.com>
Link: https://lore.kernel.org/all/Zvp6L+oFnfASaoHl@t14s
Signed-off-by: Jan Kara <jack@suse.cz>
Tested-by: Jan Stancek <jstancek@redhat.com>
Link: https://patch.msgid.link/20241004221556.19222-1-jack@suse.cz
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2024-10-30 17:42:44 -04:00
Jeongjun Park
902cc179c9 ext4: supress data-race warnings in ext4_free_inodes_{count,set}()
find_group_other() and find_group_orlov() read *_lo, *_hi with
ext4_free_inodes_count without additional locking. This can cause
data-race warning, but since the lock is held for most writes and free
inodes value is generally not a problem even if it is incorrect, it is
more appropriate to use READ_ONCE()/WRITE_ONCE() than to add locking.

==================================================================
BUG: KCSAN: data-race in ext4_free_inodes_count / ext4_free_inodes_set

write to 0xffff88810404300e of 2 bytes by task 6254 on cpu 1:
 ext4_free_inodes_set+0x1f/0x80 fs/ext4/super.c:405
 __ext4_new_inode+0x15ca/0x2200 fs/ext4/ialloc.c:1216
 ext4_symlink+0x242/0x5a0 fs/ext4/namei.c:3391
 vfs_symlink+0xca/0x1d0 fs/namei.c:4615
 do_symlinkat+0xe3/0x340 fs/namei.c:4641
 __do_sys_symlinkat fs/namei.c:4657 [inline]
 __se_sys_symlinkat fs/namei.c:4654 [inline]
 __x64_sys_symlinkat+0x5e/0x70 fs/namei.c:4654
 x64_sys_call+0x1dda/0x2d60 arch/x86/include/generated/asm/syscalls_64.h:267
 do_syscall_x64 arch/x86/entry/common.c:52 [inline]
 do_syscall_64+0x54/0x120 arch/x86/entry/common.c:83
 entry_SYSCALL_64_after_hwframe+0x76/0x7e

read to 0xffff88810404300e of 2 bytes by task 6257 on cpu 0:
 ext4_free_inodes_count+0x1c/0x80 fs/ext4/super.c:349
 find_group_other fs/ext4/ialloc.c:594 [inline]
 __ext4_new_inode+0x6ec/0x2200 fs/ext4/ialloc.c:1017
 ext4_symlink+0x242/0x5a0 fs/ext4/namei.c:3391
 vfs_symlink+0xca/0x1d0 fs/namei.c:4615
 do_symlinkat+0xe3/0x340 fs/namei.c:4641
 __do_sys_symlinkat fs/namei.c:4657 [inline]
 __se_sys_symlinkat fs/namei.c:4654 [inline]
 __x64_sys_symlinkat+0x5e/0x70 fs/namei.c:4654
 x64_sys_call+0x1dda/0x2d60 arch/x86/include/generated/asm/syscalls_64.h:267
 do_syscall_x64 arch/x86/entry/common.c:52 [inline]
 do_syscall_64+0x54/0x120 arch/x86/entry/common.c:83
 entry_SYSCALL_64_after_hwframe+0x76/0x7e

Cc: stable@vger.kernel.org
Signed-off-by: Jeongjun Park <aha310510@gmail.com>
Reviewed-by: Andreas Dilger <adilger@dilger.ca>
Link: https://patch.msgid.link/20241003125337.47283-1-aha310510@gmail.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2024-10-30 17:42:44 -04:00
Markus Elfring
d431a2cd28 ext4: Call ext4_journal_stop(handle) only once in ext4_dio_write_iter()
An ext4_journal_stop(handle) call was immediately used after a return value
check for a ext4_orphan_add() call in this function implementation.
Thus call such a function only once instead directly before the check.

This issue was transformed by using the Coccinelle software.

Signed-off-by: Markus Elfring <elfring@users.sourceforge.net>
Reviewed-by: Zhang Yi <yi.zhang@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://patch.msgid.link/cf895072-43cf-412c-bced-8268498ad13e@web.de
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2024-10-30 17:42:44 -04:00
Chuck Lever
8286f8b622 NFSD: Never decrement pending_async_copies on error
The error flow in nfsd4_copy() calls cleanup_async_copy(), which
already decrements nn->pending_async_copies.

Reported-by: Olga Kornievskaia <okorniev@redhat.com>
Fixes: aadc3bbea1 ("NFSD: Limit the number of concurrent async COPY operations")
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2024-10-30 14:12:16 -04:00
Christoph Hellwig
81a1e1c32e xfs: streamline xfs_filestream_pick_ag
Directly return the error from xfs_bmap_longest_free_extent instead
of breaking from the loop and handling it there, and use a done
label to directly jump to the exist when we found a suitable perag
structure to reduce the indentation level and pag/max_pag check
complexity in the tail of the function.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Carlos Maiolino <cem@kernel.org>
2024-10-30 11:27:18 +01:00
Christoph Hellwig
dc60992ce7 xfs: fix finding a last resort AG in xfs_filestream_pick_ag
When the main loop in xfs_filestream_pick_ag fails to find a suitable
AG it tries to just pick the online AG.  But the loop for that uses
args->pag as loop iterator while the later code expects pag to be
set.  Fix this by reusing the max_pag case for this last resort, and
also add a check for impossible case of no AG just to make sure that
the uninitialized pag doesn't even escape in theory.

Reported-by: syzbot+4125a3c514e3436a02e6@syzkaller.appspotmail.com
Signed-off-by: Christoph Hellwig <hch@lst.de>
Tested-by: syzbot+4125a3c514e3436a02e6@syzkaller.appspotmail.com
Fixes: f8f1ed1ab3 ("xfs: return a referenced perag from filestreams allocator")
Cc: <stable@vger.kernel.org> # v6.3
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Carlos Maiolino <cem@kernel.org>
2024-10-30 11:27:18 +01:00
Chi Zhiling
3ef2268403 xfs: Reduce unnecessary searches when searching for the best extents
Recently, we found that the CPU spent a lot of time in
xfs_alloc_ag_vextent_size when the filesystem has millions of fragmented
spaces.

The reason is that we conducted much extra searching for extents that
could not yield a better result, and these searches would cost a lot of
time when there were millions of extents to search through. Even if we
get the same result length, we don't switch our choice to the new one,
so we can definitely terminate the search early.

Since the result length cannot exceed the found length, when the found
length equals the best result length we already have, we can conclude
the search.

We did a test in that filesystem:
[root@localhost ~]# xfs_db -c freesp /dev/vdb
   from      to extents  blocks    pct
      1       1     215     215   0.01
      2       3  994476 1988952  99.99

Before this patch:
 0)               |  xfs_alloc_ag_vextent_size [xfs]() {
 0) * 15597.94 us |  }

After this patch:
 0)               |  xfs_alloc_ag_vextent_size [xfs]() {
 0)   19.176 us    |  }

Signed-off-by: Chi Zhiling <chizhiling@kylinos.cn>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Carlos Maiolino <cem@kernel.org>
2024-10-30 11:27:18 +01:00
Ojaswin Mujoo
2a492ff666 xfs: Check for delayed allocations before setting extsize
Extsize should only be allowed to be set on files with no data in it.
For this, we check if the files have extents but miss to check if
delayed extents are present. This patch adds that check.

While we are at it, also refactor this check into a helper since
it's used in some other places as well like xfs_inactive() or
xfs_ioctl_setattr_xflags()

**Without the patch (SUCCEEDS)**

$ xfs_io -c 'open -f testfile' -c 'pwrite 0 1024' -c 'extsize 65536'

wrote 1024/1024 bytes at offset 0
1 KiB, 1 ops; 0.0002 sec (4.628 MiB/sec and 4739.3365 ops/sec)

**With the patch (FAILS as expected)**

$ xfs_io -c 'open -f testfile' -c 'pwrite 0 1024' -c 'extsize 65536'

wrote 1024/1024 bytes at offset 0
1 KiB, 1 ops; 0.0002 sec (4.628 MiB/sec and 4739.3365 ops/sec)
xfs_io: FS_IOC_FSSETXATTR testfile: Invalid argument

Fixes: e94af02a9c ("[XFS] fix old xfs_setattr mis-merge from irix; mostly harmless esp if not using xfs rt")
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: John Garry <john.g.garry@oracle.com>
Signed-off-by: Ojaswin Mujoo <ojaswin@linux.ibm.com>
Signed-off-by: Carlos Maiolino <cem@kernel.org>
2024-10-30 11:27:18 +01:00
Christian Brauner
2ec67bb4f9
Merge branch 'work.fdtable' into vfs.file
Bring in the fdtable changes for this cycle.

Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-10-30 09:58:02 +01:00
Christian Brauner
90ee6ed776
fs: port files to file_ref
Port files to rely on file_ref reference to improve scaling and gain
overflow protection.

- We continue to WARN during get_file() in case a file that is already
  marked dead is revived as get_file() is only valid if the caller
  already holds a reference to the file. This hasn't changed just the
  check changes.

- The semantics for epoll and ttm's dmabuf usage have changed. Both
  epoll and ttm synchronize with __fput() to prevent the underlying file
  from beeing freed.

  (1) epoll

      Explaining epoll is straightforward using a simple diagram.
      Essentially, the mutex of the epoll instance needs to be taken in both
      __fput() and around epi_fget() preventing the file from being freed
      while it is polled or preventing the file from being resurrected.

          CPU1                                   CPU2
          fput(file)
          -> __fput(file)
             -> eventpoll_release(file)
                -> eventpoll_release_file(file)
                                                 mutex_lock(&ep->mtx)
                                                 epi_item_poll()
                                                 -> epi_fget()
                                                    -> file_ref_get(file)
                                                 mutex_unlock(&ep->mtx)
                   mutex_lock(&ep->mtx);
                   __ep_remove()
                   mutex_unlock(&ep->mtx);
             -> kmem_cache_free(file)

  (2) ttm dmabuf

      This explanation is a bit more involved. A regular dmabuf file stashed
      the dmabuf in file->private_data and the file in dmabuf->file:

          file->private_data = dmabuf;
          dmabuf->file = file;

      The generic release method of a dmabuf file handles file specific
      things:

          f_op->release::dma_buf_file_release()

      while the generic dentry release method of a dmabuf handles dmabuf
      freeing including driver specific things:

          dentry->d_release::dma_buf_release()

      During ttm dmabuf initialization in ttm_object_device_init() the ttm
      driver copies the provided struct dma_buf_ops into a private location:

          struct ttm_object_device {
                  spinlock_t object_lock;
                  struct dma_buf_ops ops;
                  void (*dmabuf_release)(struct dma_buf *dma_buf);
                  struct idr idr;
          };

          ttm_object_device_init(const struct dma_buf_ops *ops)
          {
                  // copy original dma_buf_ops in private location
                  tdev->ops = *ops;

                  // stash the release method of the original struct dma_buf_ops
                  tdev->dmabuf_release = tdev->ops.release;

                  // override the release method in the copy of the struct dma_buf_ops
                  // with ttm's own dmabuf release method
                  tdev->ops.release = ttm_prime_dmabuf_release;
          }

      When a new dmabuf is created the struct dma_buf_ops with the overriden
      release method set to ttm_prime_dmabuf_release is passed in exp_info.ops:

          DEFINE_DMA_BUF_EXPORT_INFO(exp_info);
          exp_info.ops = &tdev->ops;
          exp_info.size = prime->size;
          exp_info.flags = flags;
          exp_info.priv = prime;

      The call to dma_buf_export() then sets

          mutex_lock_interruptible(&prime->mutex);
          dma_buf = dma_buf_export(&exp_info)
          {
                  dmabuf->ops = exp_info->ops;
          }
          mutex_unlock(&prime->mutex);

      which creates a new dmabuf file and then install a file descriptor to
      it in the callers file descriptor table:

          ret = dma_buf_fd(dma_buf, flags);

      When that dmabuf file is closed we now get:

          fput(file)
          -> __fput(file)
             -> f_op->release::dma_buf_file_release()
             -> dput()
                -> d_op->d_release::dma_buf_release()
                   -> dmabuf->ops->release::ttm_prime_dmabuf_release()
                      mutex_lock(&prime->mutex);
                      if (prime->dma_buf == dma_buf)
                            prime->dma_buf = NULL;
                      mutex_unlock(&prime->mutex);

      Where we can see that prime->dma_buf is set to NULL. So when we have
      the following diagram:

          CPU1                                                          CPU2
          fput(file)
          -> __fput(file)
             -> f_op->release::dma_buf_file_release()
             -> dput()
                -> d_op->d_release::dma_buf_release()
                   -> dmabuf->ops->release::ttm_prime_dmabuf_release()
                                                                        ttm_prime_handle_to_fd()
                                                                        mutex_lock_interruptible(&prime->mutex)
                                                                        dma_buf = prime->dma_buf
                                                                        dma_buf && get_dma_buf_unless_doomed(dma_buf)
                                                                        -> file_ref_get(dma_buf->file)
                                                                        mutex_unlock(&prime->mutex);

                      mutex_lock(&prime->mutex);
                      if (prime->dma_buf == dma_buf)
                            prime->dma_buf = NULL;
                      mutex_unlock(&prime->mutex);
             -> kmem_cache_free(file)

      The logic of the mechanism is the same as for epoll: sync with
      __fput() preventing the file from being freed. Here the
      synchronization happens through the ttm instance's prime->mutex.
      Basically, the lifetime of the dma_buf and the file are tighly
      coupled.

  Both (1) and (2) used to call atomic_inc_not_zero() to check whether
  the file has already been marked dead and then refuse to revive it.

  This is only safe because both (1) and (2) sync with __fput() and thus
  prevent kmem_cache_free() on the file being called and thus prevent
  the file from being immediately recycled due to SLAB_TYPESAFE_BY_RCU.

  Both (1) and (2) have been ported from atomic_inc_not_zero() to
  file_ref_get(). That means a file that is already in the process of
  being marked as FILE_REF_DEAD:

  file_ref_put()
  cnt = atomic_long_dec_return()
  -> __file_ref_put(cnt)
     if (cnt == FIlE_REF_NOREF)
             atomic_long_try_cmpxchg_release(cnt, FILE_REF_DEAD)

  can be revived again:

  CPU1                                                             CPU2
  file_ref_put()
  cnt = atomic_long_dec_return()
  -> __file_ref_put(cnt)
     if (cnt == FIlE_REF_NOREF)
                                                                   file_ref_get()
                                                                   // Brings reference back to FILE_REF_ONEREF
                                                                   atomic_long_add_negative()
             atomic_long_try_cmpxchg_release(cnt, FILE_REF_DEAD)

  This is fine and inherent to the file_ref_get()/file_ref_put()
  semantics. For both (1) and (2) this is safe because __fput() is
  prevented from making progress if file_ref_get() fails due to the
  aforementioned synchronization mechanisms.

  Two cases need to be considered that affect both (1) epoll and (2) ttm
  dmabuf:

   (i) fput()'s file_ref_put() and marks the file as FILE_REF_NOREF but
       before that fput() can mark the file as FILE_REF_DEAD someone
       manages to sneak in a file_ref_get() and brings the refcount back
       from FILE_REF_NOREF to FILE_REF_ONEREF. In that case the original
       fput() doesn't call __fput(). For epoll the poll will finish and
       for ttm dmabuf the file can be used again. For ttm dambuf this is
       actually an advantage because it avoids immediately allocating
       a new dmabuf object.

       CPU1                                                             CPU2
       file_ref_put()
       cnt = atomic_long_dec_return()
       -> __file_ref_put(cnt)
          if (cnt == FIlE_REF_NOREF)
                                                                        file_ref_get()
                                                                        // Brings reference back to FILE_REF_ONEREF
                                                                        atomic_long_add_negative()
                  atomic_long_try_cmpxchg_release(cnt, FILE_REF_DEAD)

  (ii) fput()'s file_ref_put() marks the file FILE_REF_NOREF and
       also suceeds in actually marking it FILE_REF_DEAD and then calls
       into __fput() to free the file.

       When either (1) or (2) call file_ref_get() they fail as
       atomic_long_add_negative() will return true.

       At the same time, both (1) and (2) all file_ref_get() under
       mutexes that __fput() must also acquire preventing
       kmem_cache_free() from freeing the file.

  So while this might be treated as a change in semantics for (1) and
  (2) it really isn't. It if should end up causing issues this can be
  fixed by adding a helper that does something like:

  long cnt = atomic_long_read(&ref->refcnt);
  do {
          if (cnt < 0)
                  return false;
  } while (!atomic_long_try_cmpxchg(&ref->refcnt, &cnt, cnt + 1));
  return true;

  which would block FILE_REF_NOREF to FILE_REF_ONEREF transitions.

- Jann correctly pointed out that kmem_cache_zalloc() cannot be used
  anymore once files have been ported to file_ref_t.

  The kmem_cache_zalloc() call will memset() the whole struct file to
  zero when it is reallocated. This will also set file->f_ref to zero
  which mens that a concurrent file_ref_get() can return true:

  CPU1                            CPU2
                                  __get_file_rcu()
                                    rcu_dereference_raw()
  close()
    [frees file]
  alloc_empty_file()
    kmem_cache_zalloc()
      [reallocates same file]
      memset(..., 0, ...)
                                    file_ref_get()
                                      [increments 0->1, returns true]
    init_file()
      file_ref_init(..., 1)
        [sets to 0]
                                    rcu_dereference_raw()
                                    fput()
                                      file_ref_put()
                                        [decrements 0->FILE_REF_NOREF, frees file]
    [UAF]

   causing a concurrent __get_file_rcu() call to acquire a reference to
   the file that is about to be reallocated and immediately freeing it
   on realizing that it has been recycled. This causes a UAF for the
   task that reallocated/recycled the file.

   This is prevented by switching from kmem_cache_zalloc() to
   kmem_cache_alloc() and initializing the fields manually. With
   file->f_ref initialized last.

   Note that a memset() also isn't guaranteed to atomically update an
   unsigned long so it's theoretically possible to see torn and
   therefore bogus counter values.

Link: https://lore.kernel.org/r/20241007-brauner-file-rcuref-v2-3-387e24dc9163@kernel.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-10-30 09:57:43 +01:00
Zhihao Cheng
aec8e6bf83 btrfs: fix use-after-free of block device file in __btrfs_free_extra_devids()
Mounting btrfs from two images (which have the same one fsid and two
different dev_uuids) in certain executing order may trigger an UAF for
variable 'device->bdev_file' in __btrfs_free_extra_devids(). And
following are the details:

1. Attach image_1 to loop0, attach image_2 to loop1, and scan btrfs
   devices by ioctl(BTRFS_IOC_SCAN_DEV):

             /  btrfs_device_1 → loop0
   fs_device
             \  btrfs_device_2 → loop1
2. mount /dev/loop0 /mnt
   btrfs_open_devices
    btrfs_device_1->bdev_file = btrfs_get_bdev_and_sb(loop0)
    btrfs_device_2->bdev_file = btrfs_get_bdev_and_sb(loop1)
   btrfs_fill_super
    open_ctree
     fail: btrfs_close_devices // -ENOMEM
	    btrfs_close_bdev(btrfs_device_1)
             fput(btrfs_device_1->bdev_file)
	      // btrfs_device_1->bdev_file is freed
	    btrfs_close_bdev(btrfs_device_2)
             fput(btrfs_device_2->bdev_file)

3. mount /dev/loop1 /mnt
   btrfs_open_devices
    btrfs_get_bdev_and_sb(&bdev_file)
     // EIO, btrfs_device_1->bdev_file is not assigned,
     // which points to a freed memory area
    btrfs_device_2->bdev_file = btrfs_get_bdev_and_sb(loop1)
   btrfs_fill_super
    open_ctree
     btrfs_free_extra_devids
      if (btrfs_device_1->bdev_file)
       fput(btrfs_device_1->bdev_file) // UAF !

Fix it by setting 'device->bdev_file' as 'NULL' after closing the
btrfs_device in btrfs_close_one_device().

Fixes: 1423881941 ("btrfs: do not background blkdev_put()")
CC: stable@vger.kernel.org # 4.19+
Link: https://bugzilla.kernel.org/show_bug.cgi?id=219408
Signed-off-by: Zhihao Cheng <chengzhihao1@huawei.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-10-29 21:59:25 +01:00
Chuck Lever
63fab04cbd NFSD: Initialize struct nfsd4_copy earlier
Ensure the refcount and async_copies fields are initialized early.
cleanup_async_copy() will reference these fields if an error occurs
in nfsd4_copy(). If they are not correctly initialized, at the very
least, a refcount underflow occurs.

Reported-by: Olga Kornievskaia <okorniev@redhat.com>
Fixes: aadc3bbea1 ("NFSD: Limit the number of concurrent async COPY operations")
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Tested-by: Olga Kornievskaia <okorniev@redhat.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2024-10-29 15:31:18 -04:00
Christoph Hellwig
2f5a65ef30 block: add a bdev_limits helper
Add a helper to get the queue_limits from the bdev without having to
poke into the request_queue.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: John Garry <john.g.garry@oracle.com>
Link: https://lore.kernel.org/r/20241029141937.249920-1-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-10-29 09:15:00 -06:00
Piotr Zalewski
3726a1970b bcachefs: Fix NULL ptr dereference in btree_node_iter_and_journal_peek
Add NULL check for key returned from bch2_btree_and_journal_iter_peek in
btree_node_iter_and_journal_peek to avoid NULL ptr dereference in
bch2_bkey_buf_reassemble.

When key returned from bch2_btree_and_journal_iter_peek is NULL it means
that btree topology needs repair. Print topology error message with
position at which node wasn't found, its parent node information and
btree_id with level.

Return error code returned by bch2_topology_error to ensure that topology
error is handled properly by recovery.

Reported-by: syzbot+005ef9aa519f30d97657@syzkaller.appspotmail.com
Closes: https://syzkaller.appspot.com/bug?extid=005ef9aa519f30d97657
Fixes: 5222a4607c ("bcachefs: BTREE_ITER_WITH_JOURNAL")
Suggested-by: Alan Huang <mmpgouride@gmail.com>
Suggested-by: Kent Overstreet <kent.overstreet@linux.dev>
Signed-off-by: Piotr Zalewski <pZ010001011111@proton.me>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-10-29 06:34:11 -04:00
Gaosheng Cui
ca959e328b bcachefs: fix possible null-ptr-deref in __bch2_ec_stripe_head_get()
The function ec_new_stripe_head_alloc() returns nullptr if kzalloc()
fails. It is crucial to verify its return value before dereferencing
it to avoid a potential nullptr dereference.

Fixes: 035d72f72c ("bcachefs: bch2_ec_stripe_head_get() now checks for change in rw devices")
Signed-off-by: Gaosheng Cui <cuigaosheng1@huawei.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-10-29 06:34:10 -04:00
Kent Overstreet
778ac324cc bcachefs: Fix deadlock on -ENOSPC w.r.t. partial open buckets
Open buckets on the partial list should not count as allocated when
we're trying to allocate from the partial list.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-10-29 06:34:10 -04:00
Kent Overstreet
e0fafac5c4 bcachefs: Don't filter partial list buckets in open_buckets_to_text()
these are an important source of stranded buckets we need to be able to
watch

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-10-29 06:34:10 -04:00
Kent Overstreet
a34eef6dd1 bcachefs: Don't keep tons of cached pointers around
We had a bug report where the data update path was creating an extent
that failed to validate because it had too many pointers; almost all of
them were cached.

To fix this, we have:
- want_cached_ptr(), a new helper that checks if we even want a cached
  pointer (is on appropriate target, device is readable).

- bch2_extent_set_ptr_cached() now only sets a pointer cached if we want
  it.

- bch2_extent_normalize_by_opts() now ensures that we only have a single
  cached pointer that we want.

While working on this, it was noticed that this doesn't work well with
reflinked data and per-file options. Another patch series is coming that
plumbs through additional io path options through bch_extent_rebalance,
with improved option handling.

Reported-by: Reed Riley <reed@riley.engineer>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-10-29 06:34:10 -04:00
Piotr Zalewski
3fd27e9c57 bcachefs: init freespace inited bits to 0 in bch2_fs_initialize
Initialize freespace_initialized bits to 0 in member's flags and update
member's cached version for each device in bch2_fs_initialize.

It's possible for the bits to be set to 1 before fs is initialized and if
call to bch2_trans_mark_dev_sbs (just before bch2_fs_freespace_init) fails
bits remain to be 1 which can later indirectly trigger BUG condition in
bch2_bucket_alloc_freelist during shutdown.

Reported-by: syzbot+2b6a17991a6af64f9489@syzkaller.appspotmail.com
Closes: https://syzkaller.appspot.com/bug?extid=2b6a17991a6af64f9489
Fixes: bbe682c767 ("bcachefs: Ensure devices are always correctly initialized")
Suggested-by: Kent Overstreet <kent.overstreet@linux.dev>
Signed-off-by: Piotr Zalewski <pZ010001011111@proton.me>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-10-29 06:34:10 -04:00
Kent Overstreet
c1fa854acc bcachefs: Fix unhandled transaction restart in fallocate
This used to not matter, but now we're being more strict.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-10-29 06:34:10 -04:00
Ryusuke Konishi
41e192ad27 nilfs2: fix kernel bug due to missing clearing of checked flag
Syzbot reported that in directory operations after nilfs2 detects
filesystem corruption and degrades to read-only,
__block_write_begin_int(), which is called to prepare block writes, may
fail the BUG_ON check for accesses exceeding the folio/page size,
triggering a kernel bug.

This was found to be because the "checked" flag of a page/folio was not
cleared when it was discarded by nilfs2's own routine, which causes the
sanity check of directory entries to be skipped when the directory
page/folio is reloaded.  So, fix that.

This was necessary when the use of nilfs2's own page discard routine was
applied to more than just metadata files.

Link: https://lkml.kernel.org/r/20241017193359.5051-1-konishi.ryusuke@gmail.com
Fixes: 8c26c4e269 ("nilfs2: fix issue with flush kernel thread after remount in RO mode because of driver's internal error or metadata corruption")
Signed-off-by: Ryusuke Konishi <konishi.ryusuke@gmail.com>
Reported-by: syzbot+d6ca2daf692c7a82f959@syzkaller.appspotmail.com
Closes: https://syzkaller.appspot.com/bug?extid=d6ca2daf692c7a82f959
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-10-28 21:40:40 -07:00
Edward Adam Davis
bc0a2f3a73 ocfs2: pass u64 to ocfs2_truncate_inline maybe overflow
Syzbot reported a kernel BUG in ocfs2_truncate_inline.  There are two
reasons for this: first, the parameter value passed is greater than
ocfs2_max_inline_data_with_xattr, second, the start and end parameters of
ocfs2_truncate_inline are "unsigned int".

So, we need to add a sanity check for byte_start and byte_len right before
ocfs2_truncate_inline() in ocfs2_remove_inode_range(), if they are greater
than ocfs2_max_inline_data_with_xattr return -EINVAL.

Link: https://lkml.kernel.org/r/tencent_D48DB5122ADDAEDDD11918CFB68D93258C07@qq.com
Fixes: 1afc32b952 ("ocfs2: Write support for inline data")
Signed-off-by: Edward Adam Davis <eadavis@qq.com>
Reported-by: syzbot+81092778aac03460d6b7@syzkaller.appspotmail.com
Closes: https://syzkaller.appspot.com/bug?extid=81092778aac03460d6b7
Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Joseph Qi <joseph.qi@linux.alibaba.com>
Cc: Mark Fasheh <mark@fasheh.com>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Changwei Ge <gechangwei@live.cn>
Cc: Gang He <ghe@suse.com>
Cc: Jun Piao <piaojun@huawei.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-10-28 21:40:40 -07:00
Lorenzo Stoakes
f64e67e5d3 fork: do not invoke uffd on fork if error occurs
Patch series "fork: do not expose incomplete mm on fork".

During fork we may place the virtual memory address space into an
inconsistent state before the fork operation is complete.

In addition, we may encounter an error during the fork operation that
indicates that the virtual memory address space is invalidated.

As a result, we should not be exposing it in any way to external machinery
that might interact with the mm or VMAs, machinery that is not designed to
deal with incomplete state.

We specifically update the fork logic to defer khugepaged and ksm to the
end of the operation and only to be invoked if no error arose, and
disallow uffd from observing fork events should an error have occurred.


This patch (of 2):

Currently on fork we expose the virtual address space of a process to
userland unconditionally if uffd is registered in VMAs, regardless of
whether an error arose in the fork.

This is performed in dup_userfaultfd_complete() which is invoked
unconditionally, and performs two duties - invoking registered handlers
for the UFFD_EVENT_FORK event via dup_fctx(), and clearing down
userfaultfd_fork_ctx objects established in dup_userfaultfd().

This is problematic, because the virtual address space may not yet be
correctly initialised if an error arose.

The change in commit d240629148 ("fork: use __mt_dup() to duplicate
maple tree in dup_mmap()") makes this more pertinent as we may be in a
state where entries in the maple tree are not yet consistent.

We address this by, on fork error, ensuring that we roll back state that
we would otherwise expect to clean up through the event being handled by
userland and perform the memory freeing duty otherwise performed by
dup_userfaultfd_complete().

We do this by implementing a new function, dup_userfaultfd_fail(), which
performs the same loop, only decrementing reference counts.

Note that we perform mmgrab() on the parent and child mm's, however
userfaultfd_ctx_put() will mmdrop() this once the reference count drops to
zero, so we will avoid memory leaks correctly here.

Link: https://lkml.kernel.org/r/cover.1729014377.git.lorenzo.stoakes@oracle.com
Link: https://lkml.kernel.org/r/d3691d58bb58712b6fb3df2be441d175bd3cdf07.1729014377.git.lorenzo.stoakes@oracle.com
Fixes: d240629148 ("fork: use __mt_dup() to duplicate maple tree in dup_mmap()")
Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Reported-by: Jann Horn <jannh@google.com>
Reviewed-by: Jann Horn <jannh@google.com>
Reviewed-by: Liam R. Howlett <Liam.Howlett@Oracle.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Christian Brauner <brauner@kernel.org>
Cc: Jan Kara <jack@suse.cz>
Cc: Linus Torvalds <torvalds@linuxfoundation.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-10-28 21:40:38 -07:00
André Almeida
58e55efd6c
tmpfs: Add casefold lookup support
Enable casefold lookup in tmpfs, based on the encoding defined by
userspace. That means that instead of comparing byte per byte a file
name, it compares to a case-insensitive equivalent of the Unicode
string.

* Dcache handling

There's a special need when dealing with case-insensitive dentries.
First of all, we currently invalidated every negative casefold dentries.
That happens because currently VFS code has no proper support to deal
with that, giving that it could incorrectly reuse a previous filename
for a new file that has a casefold match. For instance, this could
happen:

$ mkdir DIR
$ rm -r DIR
$ mkdir dir
$ ls
DIR/

And would be perceived as inconsistency from userspace point of view,
because even that we match files in a case-insensitive manner, we still
honor whatever is the initial filename.

Along with that, tmpfs stores only the first equivalent name dentry used
in the dcache, preventing duplications of dentries in the dcache. The
d_compare() version for casefold files uses a normalized string, so the
filename under lookup will be compared to another normalized string for
the existing file, achieving a casefolded lookup.

* Enabling casefold via mount options

Most filesystems have their data stored in disk, so casefold option need
to be enabled when building a filesystem on a device (via mkfs).
However, as tmpfs is a RAM backed filesystem, there's no disk
information and thus no mkfs to store information about casefold.

For tmpfs, create casefold options for mounting. Userspace can then
enable casefold support for a mount point using:

$ mount -t tmpfs -o casefold=utf8-12.1.0 fs_name mount_dir/

Userspace must set what Unicode standard is aiming to. The available
options depends on what the kernel Unicode subsystem supports.

And for strict encoding:

$ mount -t tmpfs -o casefold=utf8-12.1.0,strict_encoding fs_name mount_dir/

Strict encoding means that tmpfs will refuse to create invalid UTF-8
sequences. When this option is not enabled, any invalid sequence will be
treated as an opaque byte sequence, ignoring the encoding thus not being
able to be looked up in a case-insensitive way.

* Check for casefold dirs on simple_lookup()

On simple_lookup(), do not create dentries for casefold directories.
Currently, VFS does not support case-insensitive negative dentries and
can create inconsistencies in the filesystem. Prevent such dentries to
being created in the first place.

Reviewed-by: Gabriel Krisman Bertazi <gabriel@krisman.be>
Reviewed-by: Gabriel Krisman Bertazi <krisman@suse.de>
Signed-off-by: André Almeida <andrealmeid@igalia.com>
Link: https://lore.kernel.org/r/20241021-tonyk-tmpfs-v8-6-f443d5814194@igalia.com
Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-10-28 13:36:55 +01:00
André Almeida
458532c8df
libfs: Export generic_ci_ dentry functions
Export generic_ci_ dentry functions so they can be used by
case-insensitive filesystems that need something more custom than the
default one set by `struct generic_ci_dentry_ops`.

Reviewed-by: Gabriel Krisman Bertazi <gabriel@krisman.be>
Signed-off-by: André Almeida <andrealmeid@igalia.com>
Link: https://lore.kernel.org/r/20241021-tonyk-tmpfs-v8-5-f443d5814194@igalia.com
Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-10-28 13:36:54 +01:00
André Almeida
142fa60f61
unicode: Recreate utf8_parse_version()
All filesystems that currently support UTF-8 casefold can fetch the
UTF-8 version from the filesystem metadata stored on disk. They can get
the data stored and directly match it to a integer, so they can skip the
string parsing step, which motivated the removal of this function in the
first place.

However, for tmpfs, the only way to tell the kernel which UTF-8 version
we are about to use is via mount options, using a string. Re-introduce
utf8_parse_version() to be used by tmpfs.

This version differs from the original by skipping the intermediate step
of copying the version string to an auxiliary string before calling
match_token(). This versions calls match_token() in the argument string.
The paramenters are simpler now as well.

utf8_parse_version() was created by 9d53690f0d ("unicode: implement
higher level API for string handling") and later removed by 49bd03cc7e
("unicode: pass a UNICODE_AGE() tripple to utf8_load").

Signed-off-by: André Almeida <andrealmeid@igalia.com>
Link: https://lore.kernel.org/r/20241021-tonyk-tmpfs-v8-4-f443d5814194@igalia.com
Reviewed-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Gabriel Krisman Bertazi <krisman@suse.de>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-10-28 13:36:54 +01:00
André Almeida
04dad6c6d3
unicode: Export latest available UTF-8 version number
Export latest available UTF-8 version number so filesystems can easily
load the newest one.

Signed-off-by: André Almeida <andrealmeid@igalia.com>
Link: https://lore.kernel.org/r/20241021-tonyk-tmpfs-v8-3-f443d5814194@igalia.com
Acked-by: Gabriel Krisman Bertazi <krisman@suse.de>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-10-28 13:36:54 +01:00
André Almeida
3f5ad0d21d
ext4: Use generic_ci_validate_strict_name helper
Use the helper function to check the requirements for casefold
directories using strict encoding.

Suggested-by: Gabriel Krisman Bertazi <krisman@suse.de>
Signed-off-by: André Almeida <andrealmeid@igalia.com>
Link: https://lore.kernel.org/r/20241021-tonyk-tmpfs-v8-2-f443d5814194@igalia.com
Acked-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Gabriel Krisman Bertazi <krisman@suse.de>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-10-28 13:36:53 +01:00
Pankaj Raghav
30dac24e14
fs/writeback: convert wbc_account_cgroup_owner to take a folio
Most of the callers of wbc_account_cgroup_owner() are converting a folio
to page before calling the function. wbc_account_cgroup_owner() is
converting the page back to a folio to call mem_cgroup_css_from_folio().

Convert wbc_account_cgroup_owner() to take a folio instead of a page,
and convert all callers to pass a folio directly except f2fs.

Convert the page to folio for all the callers from f2fs as they were the
only callers calling wbc_account_cgroup_owner() with a page. As f2fs is
already in the process of converting to folios, these call sites might
also soon be calling wbc_account_cgroup_owner() with a folio directly in
the future.

No functional changes. Only compile tested.

Signed-off-by: Pankaj Raghav <p.raghav@samsung.com>
Link: https://lore.kernel.org/r/20240926140121.203821-1-kernel@pankajraghav.com
Acked-by: David Sterba <dsterba@suse.com>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-10-28 13:26:54 +01:00
Ian Kent
f19910006e
autofs: fix thinko in validate_dev_ioctl()
I was so sure the per-dentry expire timeout patch worked ok but my
testing was flawed.

In validate_dev_ioctl() the check for ioctl AUTOFS_DEV_IOCTL_TIMEOUT_CMD
should use the ioctl number not the passed in ioctl command.

Fixes: 433f9d76a0 ("autofs: add per dentry expire timeout")
Cc: <stable@vger.kernel.org> # mainline only
Signed-off-by: Ian Kent <raven@themaw.net>
Link: https://lore.kernel.org/r/20241027224732.5507-1-raven@themaw.net
Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-10-28 13:16:56 +01:00
Jinjie Ruan
3abab905b1 ksmbd: Fix the missing xa_store error check
xa_store() can fail, it return xa_err(-EINVAL) if the entry cannot
be stored in an XArray, or xa_err(-ENOMEM) if memory allocation failed,
so check error for xa_store() to fix it.

Cc: stable@vger.kernel.org
Fixes: b685757c7b ("ksmbd: Implements sess->rpc_handle_list as xarray")
Signed-off-by: Jinjie Ruan <ruanjinjie@huawei.com>
Acked-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
2024-10-28 08:30:05 +09:00
Linus Torvalds
a8b3be2617 XFS bug fixes for 6.12-rc5
* fix recovery of allocator ops after a growfs
 * Do not fail repairs on metadata files with no attr fork
 
 Signed-off-by: Carlos Maiolino <cem@kernel.org>
 -----BEGIN PGP SIGNATURE-----
 
 iJUEABMJAB0WIQQMHYkcUKcy4GgPe2RGdaER5QtfpgUCZxjo7gAKCRBGdaER5Qtf
 pr26AYCUc9+Vlg5iReesrghYHJgeCaMYZm2i4WdNdI+BO8d+5+AA1oUO55ib3xWd
 fX8A0MEBf32eeMR0E+K0NeKsmHnbGHXyWRg/27IlNRniL4/yldssEFB8X3b7Gkw5
 /geUVdz99A==
 =+NGs
 -----END PGP SIGNATURE-----

Merge tag 'xfs-6.12-fixes-5' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux

Pull xfs fixes from Carlos Maiolino:

 - Fix recovery of allocator ops after a growfs

 - Do not fail repairs on metadata files with no attr fork

* tag 'xfs-6.12-fixes-5' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux:
  xfs: update the pag for the last AG at recovery time
  xfs: don't use __GFP_RETRY_MAYFAIL in xfs_initialize_perag
  xfs: error out when a superblock buffer update reduces the agcount
  xfs: update the file system geometry after recoverying superblock buffers
  xfs: merge the perag freeing helpers
  xfs: pass the exact range to initialize to xfs_initialize_perag
  xfs: don't fail repairs on metadata files with no attr fork
2024-10-27 08:23:49 -10:00
Linus Torvalds
850925a813 Revert patches causing inode collision problems
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEE/IPbcYBuWt0zoYhOq06b7GqY5nAFAmccC7UACgkQq06b7GqY
 5nBYbA/7BP/Me80+ofClszxd0F3nlXCGJPe31vC36EeFsap5p1TrIM7jvj0iR0Zr
 HXLxyTwocrubFgSE28rUrSS2nhmk016g3EUEVeoaO+hYSbvtdrRhoTSCubEwb4N5
 68vUz1ebGqZ35oT9oVZeAd+xdwxpKPktqrN6zCJsyWbSiZevE8YMkV+dgYrGde+Y
 j7BMypcxtQ2+u8donJ1PnGz04k0RNhqcqGGED0VuNw1Sp6quY8n0EWvHEeYjrb+N
 hFYB1lSeA245fVm8bmnUFHLJ2w47bxfJZ8hANbO9C0zLI98vlmH6xItXQnGofc4P
 1VtCHxoZnIFMMjK/iTPjvi4JsanNr0mS1Aa5w8HAHqyHYeIoBCpn78XFf79IhFAl
 pO3RV6nxE8fELXBM39Yyq7I3rQfyFmNBeob0UWIbazQ7cC54xBHpu0RDMj2gD6xy
 TBbZL2euRfGEswKWfrdR32U1jzSCv+jTAWew9sdx+z0RX9inRyaNBHFVsjOnVFZS
 iLJn20cotgw0WU2OhghWvgsV4o/Ckx4eJAC9oNxZ9pT1s7BqSvW1qg/Zlpn9wYgf
 UGQGXrCdXGY6XZ3W53c2joS+QWtmU6N8mAC6jCeqQm8pCQQeiTHE1QycyNc0xPGd
 ELhQYzCA1hEiNDki3e/vqzbPScpAer9dc/6hMKUT55SVfjcCBoc=
 =O6o1
 -----END PGP SIGNATURE-----

Merge tag '9p-for-6.12-rc5' of https://github.com/martinetd/linux

Pull more 9p reverts from Dominique Martinet:
 "Revert patches causing inode collision problems.

  The code simplification introduced significant regressions on servers
  that do not remap inode numbers when exporting multiple underlying
  filesystems with colliding inodes. See the top-most revert (commit
  be2ca38253) for details.

  This problem had been ignored for too long and the reverts will also
  head to stable (6.9+).

  I'm confident this set of patches gets us back to previous behaviour
  (another related patch had already been reverted back in April and
  we're almost back to square 1, and the rest didn't touch inode
  lifecycle)"

* tag '9p-for-6.12-rc5' of https://github.com/martinetd/linux:
  Revert "fs/9p: simplify iget to remove unnecessary paths"
  Revert "fs/9p: fix uaf in in v9fs_stat2inode_dotl"
  Revert "fs/9p: remove redundant pointer v9ses"
  Revert " fs/9p: mitigate inode collisions"
2024-10-25 15:25:02 -07:00
Linus Torvalds
c71f8fb4dc two fixes for stable
-----BEGIN PGP SIGNATURE-----
 
 iQGzBAABCgAdFiEE6fsu8pdIjtWE/DpLiiy9cAdyT1EFAmcblTgACgkQiiy9cAdy
 T1HlwgwAm1aDntHOUht5HqhmZpNT89hVPWqbH8KuAUTUewNE2g5GEgqtX2Fqbl+i
 Nh09ipjFVqQzTL17t0lxlE2dTann5HoyZFCplGYo0KQUNgDPKpi6oeEDXufLOTkj
 SXE4CTw0fZWSNRwZKffbRfPRkMLr8/Gn1BBiPKU+fd9G2+sszuc+h6ovy8pXNM+W
 w49agufwENOVmMoJBpDBOvDlorzpWCoIV8NKHsbxBR4dijR6oRQWbb1Dk/YtivAn
 vC9rHbgyOVVR5IOJKUAF4PjNwXxDpRlitUyY2GF4gGxITy/gHK8054i4mfT79IAa
 RVL0wMoxE1IZfv6fDiP0cXvN3w7X935Z6ggjKgcvP3zOajcs11PCOPSIX4LEeTI5
 B2Tn8i+4+uB7INcFUBxyoFb3lM2v8L/ejWii0cTsy3HOnaU8D36NZJLLQT0g2PIk
 WcGvzz/+l63tSBWZY9uEcD21H50WJbZpK4DKnkG8gClmC2QJBw5nR4F/XvbO3Z/7
 k3yE1Gfp
 =agOS
 -----END PGP SIGNATURE-----

Merge tag 'v6.12-rc4-smb3-client-fixes' of git://git.samba.org/sfrench/cifs-2.6

Pull smb client fixes from Steve French:

 - Fix init module error caseb

 - Fix memory allocation error path (for passwords) in mount

* tag 'v6.12-rc4-smb3-client-fixes' of git://git.samba.org/sfrench/cifs-2.6:
  cifs: fix warning when destroy 'cifs_io_request_pool'
  smb: client: Handle kstrdup failures for passwords
2024-10-25 11:45:22 -07:00
Linus Torvalds
81dcc79758 fuse fixes for 6.12-rc5
-----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQSQHSd0lITzzeNWNm3h3BK/laaZPAUCZxu02AAKCRDh3BK/laaZ
 PLDHAPwMzz4c+wbqz8Qo2IEo3lxvgPjgzMNXQetCgZFKvxKRlwD+PaIeRixGwwmB
 ON3IsScZjROphzb+ofroUpj7lLEM7Ag=
 =9rYN
 -----END PGP SIGNATURE-----

Merge tag 'fuse-fixes-6.12-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse

Pull fuse fixes from Miklos Szeredi:

 - Fix cached size after passthrough writes

   This fix needed a trivial change in the backing-file API, which
   resulted in some non-fuse files being touched.

 - Revert a commit meant as a cleanup but which triggered a WARNING

 - Remove a stray debug line left-over

* tag 'fuse-fixes-6.12-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse:
  fuse: remove stray debug line
  Revert "fuse: move initialization of fuse_file to fuse_writepages() instead of in callback"
  fuse: update inode size after extending passthrough write
  fs: pass offset and result to backing_file end_write() callback
2024-10-25 11:41:18 -07:00
Linus Torvalds
f647053312 nfsd-6.12 fixes:
- Fix a couple of use-after-free bugs
 -----BEGIN PGP SIGNATURE-----
 
 iQIyBAABCAAdFiEEKLLlsBKG3yQ88j7+M2qzM29mf5cFAmcbp9YACgkQM2qzM29m
 f5dzAg/4lCFRbsebia7qktW88ZDBbAoZyKolk2DHfjZXCuq5DHb/5I/Hk1rGyTYs
 VaJmCU59ZdpyBFSdQhOYKf2xvgNPvJG02U8il5KWtMAY5cStXFjeU0FSDoC5O4Dl
 9IoaVbtAWGMCjxWJ1WEGpU82JoM7moSVB4G718LlxF+4cUS7idq5se0uK31WQvft
 DmJsOfmnch1Y/7+RRcDwbBu0HwP2ZQHS8zMYMQ2JGXPDZJFenTibezVb36YyzyZn
 WQfvaW6MmdiVL9omZxvURL9WuBKA2L+Ly/92PyHaflcAXSngcpfu28IIQbzp/m7K
 JT/3ad32lB7F3LrDP5I4gwVh8oGLYiI5r7RBWo0e98LPvAR/89gBVdZHhjasstCh
 nAL7Kk6P/jQdbM/KR9T+yS7xTVScI5Wp4Xcitz2mlHgU4br67GO9gpo1e/tqXenm
 Gasapkg5qCduz+ksj2vwpeFXKQi+qwJgfVGKMxELoV8qTazyr09Dfgouqe045ztl
 /0khkOrLkw0bYLDNJKhj/XG0ZEV5V10c/0PEnivC2BHVQioBIDRQAaH2S2G/8vQH
 MdWayGhNlTV0g4DdtMVPxf5uN+uQmfMsj5BIe17NIUYxiJnw0CSM/bzHUcJ15+DH
 nTblaek6sa5BFpZ2fed8by4il4/tme3NOJEeHnSqe/3QKyZgOQ==
 =1YOr
 -----END PGP SIGNATURE-----

Merge tag 'nfsd-6.12-2' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux

Pull nfsd fixes from Chuck Lever:

 - Fix a couple of use-after-free bugs

* tag 'nfsd-6.12-2' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux:
  nfsd: cancel nfsd_shrinker_work using sync mode in nfs4_state_shutdown_net
  nfsd: fix race between laundromat and free_stateid
2024-10-25 11:38:15 -07:00
Kent Overstreet
8e910ca20e bcachefs: Fix UAF in bch2_reconstruct_alloc()
write_super() -> sb_counters_from_cpu() may reallocate the superblock

Reported-by: syzbot+9fc4dac4775d07bcfe34@syzkaller.appspotmail.com
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-10-25 13:17:23 -04:00
Jeongjun Park
a25a83de45 bcachefs: fix null-ptr-deref in have_stripes()
c->btree_roots_known[i].b can be NULL. In this case, a NULL pointer dereference
occurs, so you need to add code to check the variable.

Reported-by: syzbot+b468b9fef56949c3b528@syzkaller.appspotmail.com
Fixes: 7773df19c3 ("bcachefs: metadata version bucket_stripe_sectors")
Signed-off-by: Jeongjun Park <aha310510@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-10-25 13:17:06 -04:00
Miklos Szeredi
d34a5575e6 fuse: remove stray debug line
It wasn't there when the patch was posted for review, but somehow made it
into the pull.

Link: https://lore.kernel.org/all/20240913104703.1673180-1-mszeredi@redhat.com/
Fixes: efad7153bf ("fuse: allow O_PATH fd for FUSE_DEV_IOC_BACKING_OPEN")
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2024-10-25 17:05:49 +02:00
Jeongjun Park
5c41f75d1b bcachefs: fix shift oob in alloc_lru_idx_fragmentation
The size of a.data_type is set abnormally large, causing shift-out-of-bounds.
To fix this, we need to add validation on a.data_type in
alloc_lru_idx_fragmentation().

Reported-by: syzbot+7f45fa9805c40db3f108@syzkaller.appspotmail.com
Fixes: 260af1562e ("bcachefs: Kill alloc_v4.fragmentation_lru")
Signed-off-by: Jeongjun Park <aha310510@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-10-24 17:41:43 -04:00
Gianfranco Trad
2045fc4295 bcachefs: Fix invalid shift in validate_sb_layout()
Add check on layout->sb_max_size_bits against BCH_SB_LAYOUT_SIZE_BITS_MAX
to prevent UBSAN shift-out-of-bounds in validate_sb_layout().

Reported-by: syzbot+089fad5a3a5e77825426@syzkaller.appspotmail.com
Closes: https://syzkaller.appspot.com/bug?extid=089fad5a3a5e77825426
Fixes: 03ef80b469 ("bcachefs: Ignore unknown mount options")
Tested-by: syzbot+089fad5a3a5e77825426@syzkaller.appspotmail.com
Signed-off-by: Gianfranco Trad <gianf.trad@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-10-24 17:41:43 -04:00
Dominique Martinet
be2ca38253 Revert "fs/9p: simplify iget to remove unnecessary paths"
This reverts commit 724a08450f.

This code simplification introduced significant regressions on servers
that do not remap inode numbers when exporting multiple underlying
filesystems with colliding inodes, as can be illustrated with simple
tmpfs exports in qemu with remapping disabled:
```
# host side
cd /tmp/linux-test
mkdir m1 m2
mount -t tmpfs tmpfs m1
mount -t tmpfs tmpfs m2
mkdir m1/dir m2/dir
echo foo > m1/dir/foo
echo bar > m2/dir/bar

# guest side
# started with -virtfs local,path=/tmp/linux-test,mount_tag=tmp,security_model=mapped-file
mount -t 9p -o trans=virtio,debug=1 tmp /mnt/t

ls /mnt/t/m1/dir
# foo
ls /mnt/t/m2/dir
# bar (works ok if directry isn't open)

# cd to keep first dir's inode alive
cd /mnt/t/m1/dir
ls /mnt/t/m2/dir
# foo (should be bar)
```
Other examples can be crafted with regular files with fscache enabled,
in which case I/Os just happen to the wrong file leading to
corruptions, or guest failing to boot with:
  | VFS: Lookup of 'com.android.runtime' in 9p 9p would have caused loop

In theory, we'd want the servers to be smart enough and ensure they
never send us two different files with the same 'qid.path', but while
qemu has an option to remap that is recommended (and qemu prints a
warning if this case happens), there are many other servers which do
not (kvmtool, nfs-ganesha, probably diod...), we should at least ensure
we don't cause regressions on this:
- assume servers can't be trusted and operations that should get a 'new'
inode properly do so. commit d05dcfdf5e (" fs/9p: mitigate inode
collisions") attempted to do this, but v9fs_fid_iget_dotl() was not
called so some higher level of caching got in the way; this needs to be
fixed properly before we can re-apply the patches.
- if we ever want to really simplify this code, we will need to add some
negotiation with the server at mount time where the server could claim
they handle this properly, at which point we could optimize this out.
(but that might not be needed at all if we properly handle the 'new'
check?)

Fixes: 724a08450f ("fs/9p: simplify iget to remove unnecessary paths")
Reported-by: Will Deacon <will@kernel.org>
Link: https://lore.kernel.org/all/20240408141436.GA17022@redhat.com/
Link: https://lkml.kernel.org/r/20240923100508.GA32066@willie-the-truck
Cc: stable@vger.kernel.org # v6.9+
Message-ID: <20241024-revert_iget-v1-4-4cac63d25f72@codewreck.org>
Signed-off-by: Dominique Martinet <asmadeus@codewreck.org>
2024-10-25 06:26:09 +09:00
Dominique Martinet
26f8dd2dde Revert "fs/9p: fix uaf in in v9fs_stat2inode_dotl"
This reverts commit 11763a8598.

This is a requirement to revert commit 724a08450f ("fs/9p: simplify
iget to remove unnecessary paths"), see that revert for details.

Fixes: 724a08450f ("fs/9p: simplify iget to remove unnecessary paths")
Reported-by: Will Deacon <will@kernel.org>
Link: https://lkml.kernel.org/r/20240923100508.GA32066@willie-the-truck
Cc: stable@vger.kernel.org # v6.9+
Message-ID: <20241024-revert_iget-v1-3-4cac63d25f72@codewreck.org>
Signed-off-by: Dominique Martinet <asmadeus@codewreck.org>
2024-10-25 06:26:09 +09:00
Dominique Martinet
fedd06210b Revert "fs/9p: remove redundant pointer v9ses"
This reverts commit 10211b4a23.

This is a requirement to revert commit 724a08450f ("fs/9p: simplify
iget to remove unnecessary paths"), see that revert for details.

Fixes: 724a08450f ("fs/9p: simplify iget to remove unnecessary paths")
Reported-by: Will Deacon <will@kernel.org>
Link: https://lkml.kernel.org/r/20240923100508.GA32066@willie-the-truck
Cc: stable@vger.kernel.org # v6.9+
Message-ID: <20241024-revert_iget-v1-2-4cac63d25f72@codewreck.org>
Signed-off-by: Dominique Martinet <asmadeus@codewreck.org>
2024-10-25 06:26:09 +09:00
Dominique Martinet
f69999b5f9 Revert " fs/9p: mitigate inode collisions"
This reverts commit d05dcfdf5e.

This is a requirement to revert commit 724a08450f ("fs/9p: simplify
iget to remove unnecessary paths"), see that revert for details.

Fixes: 724a08450f ("fs/9p: simplify iget to remove unnecessary paths")
Reported-by: Will Deacon <will@kernel.org>
Link: https://lkml.kernel.org/r/20240923100508.GA32066@willie-the-truck
Cc: stable@vger.kernel.org # v6.9+
Message-ID: <20241024-revert_iget-v1-1-4cac63d25f72@codewreck.org>
Signed-off-by: Dominique Martinet <asmadeus@codewreck.org>
2024-10-25 06:26:08 +09:00
Linus Torvalds
4e46774408 for-6.12-rc4-tag
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAmcZGqsACgkQxWXV+ddt
 WDsZTg//dSLAAswT3dTAvWDt7rGxPhJC1V+gnzWdj/0Q3CUimNaw7zHp1QHtJDFT
 S+3TVvJJ2JDvOnbi3u24s/bL5YdkWcvIyy87oVE4trJMbQPc/E45pkYFqF3TRQAH
 wEYAVEnO9f9WUY+ekxX2XBzhmKp3xol93j9BBHDXOF9kHsDC5lI0D5YVYEhRY8Qu
 D4GcqAhqUj5Hj6I6ppiqO47NCJBNzw1Se9QsgruPpmItRbB0/LYJFhUfevwosTKg
 xVpkRVntFqQjkIdIdBfBv/ZGWfxJyM7K4M49QwLUqUQfxugu7BiGYuEjkBkiy07a
 pZDEOF9s8wUZsxvRVohqKlhL0zHBF+/pAANowYuhKNW1sqKt4GVCdN3V34AbH8ST
 JIbPvC2g1tUzIc2wckE8GO/NnsNR4r3k6iPB53MdCHrIo3jnENOeb9wF0GiVDb6s
 OrhCa3ph2ps80YC1aCnc4Jr/yV2ONebSivnqvHCIUQEZfpjtAc07atm0i946/7nx
 eiBE+9zSVJZB0LoooIVz2I3HX3lRnm3Wwi4nj8U/sNL/IbGaHNCurIETR23NivYP
 yWVql2njwE+yMc8q9YZXs4MBdKsSGP6eGJW3ZwKi6ru2PJdf5eIib+ffvR4bLqXD
 UUDfMyC1esGsQB24sc8wppk97wmmMrdqQUj9WnmNlTP8FUFggvQ=
 =sYxX
 -----END PGP SIGNATURE-----

Merge tag 'for-6.12-rc4-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux

Pull btrfs fixes from David Sterba:

 - mount option fixes:
     - fix handling of compression mount options on remount
     - reject rw remount in case there are options that don't work
       in read-write mode (like rescue options)

 - fix zone accounting of unusable space

 - fix in-memory corruption when merging extent maps

 - fix delalloc range locking for sector < page

 - use more convenient default value of drop subtree threshold, clean
   more subvolumes without the fallback to marking quotas inconsistent

 - fix smatch warning about incorrect value passed to ERR_PTR

* tag 'for-6.12-rc4-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
  btrfs: fix passing 0 to ERR_PTR in btrfs_search_dir_index_item()
  btrfs: reject ro->rw reconfiguration if there are hard ro requirements
  btrfs: fix read corruption due to race with extent map merging
  btrfs: fix the delalloc range locking if sector size < page size
  btrfs: qgroup: set a more sane default value for subtree drop threshold
  btrfs: clear force-compress on remount when compress mount option is given
  btrfs: zoned: fix zone unusable accounting for freed reserved extent
2024-10-24 13:04:15 -07:00
Linus Torvalds
6cc65abee8 Fix a regression introduced in 6.12-rc1
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEEIodevzQLVs53l6BhNqiEXrVAjGQFAmcY+6QACgkQNqiEXrVA
 jGRitw//TWl2UsPGzHCJXrdQTDmceGpCXoH+2C04eRe1Kk0ILqkJdSobbM1e10My
 0eb/leiOqP90d+6WnrwWIpqcBOriaXs1NN7H6qdZpyE/xmmEPLfLKMU0SB6jMDaz
 A9BNoRGvKmxy960kid0OsDd79uOKyu95PT1VsRET5bm10LYUSM1E/qwMxaLydX3s
 g8kVLakgKPKq0iKXNpOuFgTKxhNJACaIm1a1+Lq55lXXtg8HN3QuLKwM/rXD+yOD
 0+imvhTKcQBQGqHeRJtssk93BxKqA42FXsn36Zi+zBmLY75WshKZd0USihpCYlwE
 zL7JUSPmtFHtxkbjfr3pK6HzDlSg7rzp78BPIMeWOymZh4tQ9W8G6UtPHTb6KYoP
 WLY/LCe4IVQDnpyeKDcCPcLEe4/Jgk1QbCGtqjoOzrSuIEGReAP6LSxs9A9Kc7VD
 AVa2OLMXoyD+oLyHk7uuIuZch51bD34mhaGhOB+fyX5xz6ZocSOei+1/yiTWHwdD
 3AAmbIvVc2rNxlg/O7t+UvrAACWLZhZ/uqQLqcAmdT5CNDZZHdNRZGy3pe5ELa+F
 PIzERoDlrALPAASOGqvmFsbVEZSG6z89uTREhFxAErryWBESHK4nPJ8WqHlybSwm
 ISQeTJ/h9+vgw7poeHtYDmghh0Eh8H6/iy+DTLCqZ16lxytfvRE=
 =sZ6X
 -----END PGP SIGNATURE-----

Merge tag 'jfs-6.12-rc5' of github.com:kleikamp/linux-shaggy

Pull jfs fix from David Kleikamp:
 "Fix a regression introduced in 6.12-rc1"

* tag 'jfs-6.12-rc5' of github.com:kleikamp/linux-shaggy:
  jfs: Fix sanity check in dbMount
2024-10-24 12:47:01 -07:00
Linus Torvalds
c1e822754c bcachefs fixes for 6.12-rc5
Lots of hotfixes:
 - transaction restart injection has been shaking out a few things
 
 - fix a data corruption in the buffered write path on -ENOSPC, found by
   xfstests generic/299
 
 - Some small show_options fixes
 
 - Repair mismatches in inode hash type, seed: different snapshot
   versions of an inode must have the same hash/type seed, used for
   directory entries and xattrs. We were checking the hash seed, but not
   the type, and a user contributed a filesystem where the hash type on
   one inode had somehow been flipped; these fixes allow his filesystem
   to repair.
 
   Additionally, the hash type flip made some directory entries
   invisible, which were then recreated by userspace; so the hash check
   code now checks for duplicate non dangling dirents, and renames one of
   them if necessary.
 
 - Don't use wait_event_interruptible() in recovery: this fixes some
   filesystems failing to mount with -ERESTARTSYS
 
 - Workaround for kvmalloc not supporting > INT_MAX allocations, causing
   an -ENOMEM when allocating the sorted array of journal keys: this
   allows a 75 TB filesystem to mount
 
 - Make sure bch_inode_unpacked.bi_snapshot is set in the old inode
   compat path: this alllows Marcin's filesystem (in use since before
   6.7) to repair and mount.
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEKnAFLkS8Qha+jvQrE6szbY3KbnYFAmcX4vYACgkQE6szbY3K
 bnbywxAArBfIJfshWq5Wk9WztenzUmyUmV2HIgntT/iN4ty4eIpZ26VSvHcGvgkU
 j3wx+OuxMTPBGc3fjUS+gALf/BGcQEgh6oPZCV+6M3kasTzNzG2jYOCkLqKbpcO1
 V5n/Le/SM1X2grkgTm/H+TulGHNgG9gJ2U4kjihroJrTbTesZhzcW/qlz6RWo7U1
 02NvLop4WE9M6WaW9RzsHK2llRUAl2Z3oRMuwNz3IIijCpm98STGD4gyvGoMV2b8
 qNsXjy7b2lkYObKI29yWF0caRzWK1LRz79afRlnNVSJb6DK1QB83ms5Qa8rprCU4
 uOq0wsGWyg6lzwQ19X+2TvUYABopVk2HXLlzTO/lJrWeMTuYJVPZ7KZi3l6ubw5T
 GIsAD5qMdCm8E5nXX8hG//0rOIl6QK288+zMQyRCvAkCL+iN2k0TU8qKAEEC44de
 vj6ZyNqbuLR39LLz9K09ZhzIZGk09ELpxOJ2Wwwj4ZFriwphWDtFgBtBUpNo/KWA
 inBfq2lZJsmNjfns9vCqOmNOStOJxXnyMOR25sTv7wM69QPGkl41dPY3oeuG8lRk
 cU/qJQKlpTKJbFeXiEKWKDnMzWxOnovqLFC0tKu2qAYM6vAz+AtwTXgthVFGh21U
 QoUDbsnQCCixMkS2AksCo7nivLrxmV/EeYm5pgeiU38VdA5ofBM=
 =OpYN
 -----END PGP SIGNATURE-----

Merge tag 'bcachefs-2024-10-22' of https://github.com/koverstreet/bcachefs

Pull bcachefs fixes from Kent Overstreet:
 "Lots of hotfixes:

   - transaction restart injection has been shaking out a few things

   - fix a data corruption in the buffered write path on -ENOSPC, found
     by xfstests generic/299

   - Some small show_options fixes

   - Repair mismatches in inode hash type, seed: different snapshot
     versions of an inode must have the same hash/type seed, used for
     directory entries and xattrs. We were checking the hash seed, but
     not the type, and a user contributed a filesystem where the hash
     type on one inode had somehow been flipped; these fixes allow his
     filesystem to repair.

     Additionally, the hash type flip made some directory entries
     invisible, which were then recreated by userspace; so the hash
     check code now checks for duplicate non dangling dirents, and
     renames one of them if necessary.

   - Don't use wait_event_interruptible() in recovery: this fixes some
     filesystems failing to mount with -ERESTARTSYS

   - Workaround for kvmalloc not supporting > INT_MAX allocations,
     causing an -ENOMEM when allocating the sorted array of journal
     keys: this allows a 75 TB filesystem to mount

   - Make sure bch_inode_unpacked.bi_snapshot is set in the old inode
     compat path: this alllows Marcin's filesystem (in use since before
     6.7) to repair and mount"

* tag 'bcachefs-2024-10-22' of https://github.com/koverstreet/bcachefs: (26 commits)
  bcachefs: Set bch_inode_unpacked.bi_snapshot in old inode path
  bcachefs: Mark more errors as AUTOFIX
  bcachefs: Workaround for kvmalloc() not supporting > INT_MAX allocations
  bcachefs: Don't use wait_event_interruptible() in recovery
  bcachefs: Fix __bch2_fsck_err() warning
  bcachefs: fsck: Improve hash_check_key()
  bcachefs: bch2_hash_set_or_get_in_snapshot()
  bcachefs: Repair mismatches in inode hash seed, type
  bcachefs: Add hash seed, type to inode_to_text()
  bcachefs: INODE_STR_HASH() for bch_inode_unpacked
  bcachefs: Run in-kernel offline fsck without ratelimit errors
  bcachefs: skip mount option handle for empty string.
  bcachefs: fix incorrect show_options results
  bcachefs: Fix data corruption on -ENOSPC in buffered write path
  bcachefs: bch2_folio_reservation_get_partial() is now better behaved
  bcachefs: fix disk reservation accounting in bch2_folio_reservation_get()
  bcachefS: ec: fix data type on stripe deletion
  bcachefs: Don't use commit_do() unnecessarily
  bcachefs: handle restarts in bch2_bucket_io_time_reset()
  bcachefs: fix restart handling in __bch2_resume_logged_op_finsert()
  ...
2024-10-24 12:38:59 -07:00
Dominique Martinet
f009e946c1 Revert "9p: Enable multipage folios"
This reverts commit 1325e4a91a.

using multipage folios apparently break some madvise operations like
MADV_PAGEOUT which do not reliably unload the specified page anymore,

Revert the patch until that is figured out.

Reported-by: Andrii Nakryiko <andrii@kernel.org>
Fixes: 1325e4a91a ("9p: Enable multipage folios")
Signed-off-by: Dominique Martinet <asmadeus@codewreck.org>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2024-10-24 11:24:05 -07:00
Luca Boccassi
cdda1f26e7
pidfd: add ioctl to retrieve pid info
A common pattern when using pid fds is having to get information
about the process, which currently requires /proc being mounted,
resolving the fd to a pid, and then do manual string parsing of
/proc/N/status and friends. This needs to be reimplemented over
and over in all userspace projects (e.g.: I have reimplemented
resolving in systemd, dbus, dbus-daemon, polkit so far), and
requires additional care in checking that the fd is still valid
after having parsed the data, to avoid races.

Having a programmatic API that can be used directly removes all
these requirements, including having /proc mounted.

As discussed at LPC24, add an ioctl with an extensible struct
so that more parameters can be added later if needed. Start with
returning pid/tgid/ppid and creds unconditionally, and cgroupid
optionally.

Signed-off-by: Luca Boccassi <luca.boccassi@gmail.com>
Link: https://lore.kernel.org/r/20241010155401.2268522-1-luca.boccassi@gmail.com
Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-10-24 13:54:51 +02:00
David Howells
247d65fb12
afs: Fix missing subdir edit when renamed between parent dirs
When rename moves an AFS subdirectory between parent directories, the
subdir also needs a bit of editing: the ".." entry needs updating to point
to the new parent (though I don't make use of the info) and the DV needs
incrementing by 1 to reflect the change of content.  The server also sends
a callback break notification on the subdirectory if we have one, but we
can take care of recovering the promise next time we access the subdir.

This can be triggered by something like:

    mount -t afs %example.com:xfstest.test20 /xfstest.test/
    mkdir /xfstest.test/{aaa,bbb,aaa/ccc}
    touch /xfstest.test/bbb/ccc/d
    mv /xfstest.test/{aaa/ccc,bbb/ccc}
    touch /xfstest.test/bbb/ccc/e

When the pathwalk for the second touch hits "ccc", kafs spots that the DV
is incorrect and downloads it again (so the fix is not critical).

Fix this, if the rename target is a directory and the old and new
parents are different, by:

 (1) Incrementing the DV number of the target locally.

 (2) Editing the ".." entry in the target to refer to its new parent's
     vnode ID and uniquifier.

Link: https://lore.kernel.org/r/3340431.1729680010@warthog.procyon.org.uk
Fixes: 63a4681ff3 ("afs: Locally edit directory data for mkdir/create/unlink/...")
cc: David Howells <dhowells@redhat.com>
cc: Marc Dionne <marc.dionne@auristor.com>
cc: linux-afs@lists.infradead.org
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-10-24 13:50:27 +02:00
Naohiro Aota
d48e1dea39 btrfs: fix error propagation of split bios
The purpose of btrfs_bbio_propagate_error() shall be propagating an error
of split bio to its original btrfs_bio, and tell the error to the upper
layer. However, it's not working well on some cases.

* Case 1. Immediate (or quick) end_bio with an error

When btrfs sends btrfs_bio to mirrored devices, btrfs calls
btrfs_bio_end_io() when all the mirroring bios are completed. If that
btrfs_bio was split, it is from btrfs_clone_bioset and its end_io function
is btrfs_orig_write_end_io. For this case, btrfs_bbio_propagate_error()
accesses the orig_bbio's bio context to increase the error count.

That works well in most cases. However, if the end_io is called enough
fast, orig_bbio's (remaining part after split) bio context may not be
properly set at that time. Since the bio context is set when the orig_bbio
(the last btrfs_bio) is sent to devices, that might be too late for earlier
split btrfs_bio's completion.  That will result in NULL pointer
dereference.

That bug is easily reproducible by running btrfs/146 on zoned devices [1]
and it shows the following trace.

[1] You need raid-stripe-tree feature as it create "-d raid0 -m raid1" FS.

  BUG: kernel NULL pointer dereference, address: 0000000000000020
  #PF: supervisor read access in kernel mode
  #PF: error_code(0x0000) - not-present page
  PGD 0 P4D 0
  Oops: Oops: 0000 [#1] PREEMPT SMP PTI
  CPU: 1 UID: 0 PID: 13 Comm: kworker/u32:1 Not tainted 6.11.0-rc7-BTRFS-ZNS+ #474
  Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
  Workqueue: writeback wb_workfn (flush-btrfs-5)
  RIP: 0010:btrfs_bio_end_io+0xae/0xc0 [btrfs]
  BTRFS error (device dm-0): bdev /dev/mapper/error-test errs: wr 2, rd 0, flush 0, corrupt 0, gen 0
  RSP: 0018:ffffc9000006f248 EFLAGS: 00010246
  RAX: 0000000000000000 RBX: ffff888005a7f080 RCX: ffffc9000006f1dc
  RDX: 0000000000000000 RSI: 000000000000000a RDI: ffff888005a7f080
  RBP: ffff888011dfc540 R08: 0000000000000000 R09: 0000000000000001
  R10: ffffffff82e508e0 R11: 0000000000000005 R12: ffff88800ddfbe58
  R13: ffff888005a7f080 R14: ffff888005a7f158 R15: ffff888005a7f158
  FS:  0000000000000000(0000) GS:ffff88803ea80000(0000) knlGS:0000000000000000
  CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
  CR2: 0000000000000020 CR3: 0000000002e22006 CR4: 0000000000370ef0
  DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
  DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
  Call Trace:
   <TASK>
   ? __die_body.cold+0x19/0x26
   ? page_fault_oops+0x13e/0x2b0
   ? _printk+0x58/0x73
   ? do_user_addr_fault+0x5f/0x750
   ? exc_page_fault+0x76/0x240
   ? asm_exc_page_fault+0x22/0x30
   ? btrfs_bio_end_io+0xae/0xc0 [btrfs]
   ? btrfs_log_dev_io_error+0x7f/0x90 [btrfs]
   btrfs_orig_write_end_io+0x51/0x90 [btrfs]
   dm_submit_bio+0x5c2/0xa50 [dm_mod]
   ? find_held_lock+0x2b/0x80
   ? blk_try_enter_queue+0x90/0x1e0
   __submit_bio+0xe0/0x130
   ? ktime_get+0x10a/0x160
   ? lockdep_hardirqs_on+0x74/0x100
   submit_bio_noacct_nocheck+0x199/0x410
   btrfs_submit_bio+0x7d/0x150 [btrfs]
   btrfs_submit_chunk+0x1a1/0x6d0 [btrfs]
   ? lockdep_hardirqs_on+0x74/0x100
   ? __folio_start_writeback+0x10/0x2c0
   btrfs_submit_bbio+0x1c/0x40 [btrfs]
   submit_one_bio+0x44/0x60 [btrfs]
   submit_extent_folio+0x13f/0x330 [btrfs]
   ? btrfs_set_range_writeback+0xa3/0xd0 [btrfs]
   extent_writepage_io+0x18b/0x360 [btrfs]
   extent_write_locked_range+0x17c/0x340 [btrfs]
   ? __pfx_end_bbio_data_write+0x10/0x10 [btrfs]
   run_delalloc_cow+0x71/0xd0 [btrfs]
   btrfs_run_delalloc_range+0x176/0x500 [btrfs]
   ? find_lock_delalloc_range+0x119/0x260 [btrfs]
   writepage_delalloc+0x2ab/0x480 [btrfs]
   extent_write_cache_pages+0x236/0x7d0 [btrfs]
   btrfs_writepages+0x72/0x130 [btrfs]
   do_writepages+0xd4/0x240
   ? find_held_lock+0x2b/0x80
   ? wbc_attach_and_unlock_inode+0x12c/0x290
   ? wbc_attach_and_unlock_inode+0x12c/0x290
   __writeback_single_inode+0x5c/0x4c0
   ? do_raw_spin_unlock+0x49/0xb0
   writeback_sb_inodes+0x22c/0x560
   __writeback_inodes_wb+0x4c/0xe0
   wb_writeback+0x1d6/0x3f0
   wb_workfn+0x334/0x520
   process_one_work+0x1ee/0x570
   ? lock_is_held_type+0xc6/0x130
   worker_thread+0x1d1/0x3b0
   ? __pfx_worker_thread+0x10/0x10
   kthread+0xee/0x120
   ? __pfx_kthread+0x10/0x10
   ret_from_fork+0x30/0x50
   ? __pfx_kthread+0x10/0x10
   ret_from_fork_asm+0x1a/0x30
   </TASK>
  Modules linked in: dm_mod btrfs blake2b_generic xor raid6_pq rapl
  CR2: 0000000000000020

* Case 2. Earlier completion of orig_bbio for mirrored btrfs_bios

btrfs_bbio_propagate_error() assumes the end_io function for orig_bbio is
called last among split bios. In that case, btrfs_orig_write_end_io() sets
the bio->bi_status to BLK_STS_IOERR by seeing the bioc->error [2].
Otherwise, the increased orig_bio's bioc->error is not checked by anyone
and return BLK_STS_OK to the upper layer.

[2] Actually, this is not true. Because we only increases orig_bioc->errors
by max_errors, the condition "atomic_read(&bioc->error) > bioc->max_errors"
is still not met if only one split btrfs_bio fails.

* Case 3. Later completion of orig_bbio for un-mirrored btrfs_bios

In contrast to the above case, btrfs_bbio_propagate_error() is not working
well if un-mirrored orig_bbio is completed last. It sets
orig_bbio->bio.bi_status to the btrfs_bio's error. But, that is easily
over-written by orig_bbio's completion status. If the status is BLK_STS_OK,
the upper layer would not know the failure.

* Solution

Considering the above cases, we can only save the error status in the
orig_bbio (remaining part after split) itself as it is always
available. Also, the saved error status should be propagated when all the
split btrfs_bios are finished (i.e, bbio->pending_ios == 0).

This commit introduces "status" to btrfs_bbio and saves the first error of
split bios to original btrfs_bio's "status" variable. When all the split
bios are finished, the saved status is loaded into original btrfs_bio's
status.

With this commit, btrfs/146 on zoned devices does not hit the NULL pointer
dereference anymore.

Fixes: 852eee62d3 ("btrfs: allow btrfs_submit_bio to split bios")
CC: stable@vger.kernel.org # 6.6+
Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-10-23 18:17:43 +02:00
Ye Bin
2ce1007f42 cifs: fix warning when destroy 'cifs_io_request_pool'
There's a issue as follows:
WARNING: CPU: 1 PID: 27826 at mm/slub.c:4698 free_large_kmalloc+0xac/0xe0
RIP: 0010:free_large_kmalloc+0xac/0xe0
Call Trace:
 <TASK>
 ? __warn+0xea/0x330
 mempool_destroy+0x13f/0x1d0
 init_cifs+0xa50/0xff0 [cifs]
 do_one_initcall+0xdc/0x550
 do_init_module+0x22d/0x6b0
 load_module+0x4e96/0x5ff0
 init_module_from_file+0xcd/0x130
 idempotent_init_module+0x330/0x620
 __x64_sys_finit_module+0xb3/0x110
 do_syscall_64+0xc1/0x1d0
 entry_SYSCALL_64_after_hwframe+0x77/0x7f

Obviously, 'cifs_io_request_pool' is not created by mempool_create().
So just use mempool_exit() to revert 'cifs_io_request_pool'.

Fixes: edea94a697 ("cifs: Add mempools for cifs_io_request and cifs_io_subrequest structs")
Signed-off-by: Ye Bin <yebin10@huawei.com>
Acked-by: David Howells <dhowells@redhat.com
Signed-off-by: Steve French <stfrench@microsoft.com>
2024-10-23 07:42:44 -05:00
Henrique Carvalho
9a5dd61151 smb: client: Handle kstrdup failures for passwords
In smb3_reconfigure(), after duplicating ctx->password and
ctx->password2 with kstrdup(), we need to check for allocation
failures.

If ses->password allocation fails, return -ENOMEM.
If ses->password2 allocation fails, free ses->password, set it
to NULL, and return -ENOMEM.

Fixes: c1eb537bf4 ("cifs: allow changing password during remount")
Reviewed-by: David Howells <dhowells@redhat.com
Signed-off-by: Haoxiang Li <make24@iscas.ac.cn>
Signed-off-by: Henrique Carvalho <henrique.carvalho@suse.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
2024-10-23 07:42:22 -05:00
Dave Kleikamp
67373ca840 jfs: Fix sanity check in dbMount
MAXAG is a legitimate value for bmp->db_numag

Fixes: e63866a475 ("jfs: fix out-of-bounds in dbNextAG() and diAlloc()")

Signed-off-by: Dave Kleikamp <dave.kleikamp@oracle.com>
2024-10-22 09:40:37 -05:00
Jens Axboe
fdad1a20cd Merge branch 'for-6.13/block-atomic' into for-6.13/block
Merge in block/fs prep patches for the atomic write support.

* for-6.13/block-atomic:
  block: Add bdev atomic write limits helpers
  fs/block: Check for IOCB_DIRECT in generic_atomic_write_valid()
  block/fs: Pass an iocb to generic_atomic_write_valid()
2024-10-22 08:21:51 -06:00
Yue Haibing
75f49c3dc7 btrfs: fix passing 0 to ERR_PTR in btrfs_search_dir_index_item()
The ret may be zero in btrfs_search_dir_index_item() and should not
passed to ERR_PTR(). Now btrfs_unlink_subvol() is the only caller to
this, reconstructed it to check ERR_PTR(-ENOENT) while ret >= 0.

This fixes smatch warnings:

fs/btrfs/dir-item.c:353
  btrfs_search_dir_index_item() warn: passing zero to 'ERR_PTR'

Fixes: 9dcbe16fcc ("btrfs: use btrfs_for_each_slot in btrfs_search_dir_index_item")
CC: stable@vger.kernel.org # 6.1+
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Yue Haibing <yuehaibing@huawei.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-10-22 16:10:55 +02:00
Qu Wenruo
3c36a72c1d btrfs: reject ro->rw reconfiguration if there are hard ro requirements
[BUG]
Syzbot reports the following crash:

  BTRFS info (device loop0 state MCS): disabling free space tree
  BTRFS info (device loop0 state MCS): clearing compat-ro feature flag for FREE_SPACE_TREE (0x1)
  BTRFS info (device loop0 state MCS): clearing compat-ro feature flag for FREE_SPACE_TREE_VALID (0x2)
  Oops: general protection fault, probably for non-canonical address 0xdffffc0000000003: 0000 [#1] PREEMPT SMP KASAN NOPTI
  KASAN: null-ptr-deref in range [0x0000000000000018-0x000000000000001f]
  Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.3-debian-1.16.3-2~bpo12+1 04/01/2014
  RIP: 0010:backup_super_roots fs/btrfs/disk-io.c:1691 [inline]
  RIP: 0010:write_all_supers+0x97a/0x40f0 fs/btrfs/disk-io.c:4041
  Call Trace:
   <TASK>
   btrfs_commit_transaction+0x1eae/0x3740 fs/btrfs/transaction.c:2530
   btrfs_delete_free_space_tree+0x383/0x730 fs/btrfs/free-space-tree.c:1312
   btrfs_start_pre_rw_mount+0xf28/0x1300 fs/btrfs/disk-io.c:3012
   btrfs_remount_rw fs/btrfs/super.c:1309 [inline]
   btrfs_reconfigure+0xae6/0x2d40 fs/btrfs/super.c:1534
   btrfs_reconfigure_for_mount fs/btrfs/super.c:2020 [inline]
   btrfs_get_tree_subvol fs/btrfs/super.c:2079 [inline]
   btrfs_get_tree+0x918/0x1920 fs/btrfs/super.c:2115
   vfs_get_tree+0x90/0x2b0 fs/super.c:1800
   do_new_mount+0x2be/0xb40 fs/namespace.c:3472
   do_mount fs/namespace.c:3812 [inline]
   __do_sys_mount fs/namespace.c:4020 [inline]
   __se_sys_mount+0x2d6/0x3c0 fs/namespace.c:3997
   do_syscall_x64 arch/x86/entry/common.c:52 [inline]
   do_syscall_64+0xf3/0x230 arch/x86/entry/common.c:83
   entry_SYSCALL_64_after_hwframe+0x77/0x7f

[CAUSE]
To support mounting different subvolume with different RO/RW flags for
the new mount APIs, btrfs introduced two workaround to support this feature:

- Skip mount option/feature checks if we are mounting a different
  subvolume

- Reconfigure the fs to RW if the initial mount is RO

Combining these two, we can have the following sequence:

- Mount the fs ro,rescue=all,clear_cache,space_cache=v1
  rescue=all will mark the fs as hard read-only, so no v2 cache clearing
  will happen.

- Mount a subvolume rw of the same fs.
  We go into btrfs_get_tree_subvol(), but fc_mount() returns EBUSY
  because our new fc is RW, different from the original fs.

  Now we enter btrfs_reconfigure_for_mount(), which switches the RO flag
  first so that we can grab the existing fs_info.
  Then we reconfigure the fs to RW.

- During reconfiguration, option/features check is skipped
  This means we will restart the v2 cache clearing, and convert back to
  v1 cache.
  This will trigger fs writes, and since the original fs has "rescue=all"
  option, it skips the csum tree read.

  And eventually causing NULL pointer dereference in super block
  writeback.

[FIX]
For reconfiguration caused by different subvolume RO/RW flags, ensure we
always run btrfs_check_options() to ensure we have proper hard RO
requirements met.

In fact the function btrfs_check_options() doesn't really do many
complex checks, but hard RO requirement and some feature dependency
checks, thus there is no special reason not to do the check for mount
reconfiguration.

Reported-by: syzbot+56360f93efa90ff15870@syzkaller.appspotmail.com
Link: https://lore.kernel.org/linux-btrfs/0000000000008c5d090621cb2770@google.com/
Fixes: f044b31867 ("btrfs: handle the ro->rw transition for mounting different subvolumes")
CC: stable@vger.kernel.org # 6.8+
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-10-22 16:10:51 +02:00
Boris Burkov
7a2339058e btrfs: fix read corruption due to race with extent map merging
In debugging some corrupt squashfs files, we observed symptoms of
corrupt page cache pages but correct on-disk contents. Further
investigation revealed that the exact symptom was a correct page
followed by an incorrect, duplicate, page. This got us thinking about
extent maps.

commit ac05ca913e ("Btrfs: fix race between using extent maps and merging them")
enforces a reference count on the primary `em` extent_map being merged,
as that one gets modified.

However, since,
commit 3d2ac99224 ("btrfs: introduce new members for extent_map")
both 'em' and 'merge' get modified, which started modifying 'merge'
and thus introduced the same race.

We were able to reproduce this by looping the affected squashfs workload
in parallel on a bunch of separate btrfs-es while also dropping caches.
We are still working on a simple enough reproducer to make into an fstest.

The simplest fix is to stop modifying 'merge', which is not essential,
as it is dropped immediately after the merge. This behavior is simply
a consequence of the order of the two extent maps being important in
computing the new values. Modify merge_ondisk_extents to take prev and
next by const* and also take a third merged parameter that it puts the
results in. Note that this introduces the rather odd behavior of passing
'em' to merge_ondisk_extents as a const * and as a regular ptr.

Fixes: 3d2ac99224 ("btrfs: introduce new members for extent_map")
CC: stable@vger.kernel.org # 6.11+
Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Boris Burkov <boris@bur.io>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-10-22 16:10:13 +02:00
Qu Wenruo
f10f59f91a btrfs: fix the delalloc range locking if sector size < page size
Inside lock_delalloc_folios(), there are several problems related to
sector size < page size handling:

- Set the writer locks without checking if the folio is still valid
  We call btrfs_folio_start_writer_lock() just like it's folio_lock().
  But since the folio may not even be the folio of the current mapping,
  we can easily screw up the folio->private.

- The range is not clamped inside the page
  This means we can over write other bitmaps if the start/len is not
  properly handled, and trigger the btrfs_subpage_assert().

- @processed_end is always rounded up to page end
  If the delalloc range is not page aligned, and we need to retry
  (returning -EAGAIN), then we will unlock to the page end.

  Thankfully this is not a huge problem, as now
  btrfs_folio_end_writer_lock() can handle range larger than the locked
  range, and only unlock what is already locked.

Fix all these problems by:

- Lock and check the folio first, then call
  btrfs_folio_set_writer_lock()
  So that if we got a folio not belonging to the inode, we won't
  touch folio->private.

- Properly truncate the range inside the page

- Update @processed_end to the locked range end

Fixes: 1e1de38792 ("btrfs: make process_one_page() to handle subpage locking")
CC: stable@vger.kernel.org # 6.1+
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-10-22 16:09:44 +02:00
Qu Wenruo
5f9062a48d btrfs: qgroup: set a more sane default value for subtree drop threshold
Since commit 011b46c304 ("btrfs: skip subtree scan if it's too high to
avoid low stall in btrfs_commit_transaction()"), btrfs qgroup can
automatically skip large subtree scan at the cost of marking qgroup
inconsistent.

It's designed to address the final performance problem of snapshot drop
with qgroup enabled, but to be safe the default value is
BTRFS_MAX_LEVEL, requiring a user space daemon to set a different value
to make it work.

I'd say it's not a good idea to rely on user space tool to set this
default value, especially when some operations (snapshot dropping) can
be triggered immediately after mount, leaving a very small window to
that that sysfs interface.

So instead of disabling this new feature by default, enable it with a
low threshold (3), so that large subvolume tree drop at mount time won't
cause huge qgroup workload.

CC: stable@vger.kernel.org # 6.1
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-10-22 16:09:11 +02:00
Filipe Manana
3510e684b8 btrfs: clear force-compress on remount when compress mount option is given
After the migration to use fs context for processing mount options we had
a slight change in the semantics for remounting a filesystem that was
mounted with compress-force. Before we could clear compress-force by
passing only "-o compress[=algo]" during a remount, but after that change
that does not work anymore, force-compress is still present and one needs
to pass "-o compress-force=no,compress[=algo]" to the mount command.

Example, when running on a kernel 6.8+:

  $ mount -o compress-force=zlib:9 /dev/sdi /mnt/sdi
  $ mount | grep sdi
  /dev/sdi on /mnt/sdi type btrfs (rw,relatime,compress-force=zlib:9,discard=async,space_cache=v2,subvolid=5,subvol=/)

  $ mount -o remount,compress=zlib:5 /mnt/sdi
  $ mount | grep sdi
  /dev/sdi on /mnt/sdi type btrfs (rw,relatime,compress-force=zlib:5,discard=async,space_cache=v2,subvolid=5,subvol=/)

On a 6.7 kernel (or older):

  $ mount -o compress-force=zlib:9 /dev/sdi /mnt/sdi
  $ mount | grep sdi
  /dev/sdi on /mnt/sdi type btrfs (rw,relatime,compress-force=zlib:9,discard=async,space_cache=v2,subvolid=5,subvol=/)

  $ mount -o remount,compress=zlib:5 /mnt/sdi
  $ mount | grep sdi
  /dev/sdi on /mnt/sdi type btrfs (rw,relatime,compress=zlib:5,discard=async,space_cache=v2,subvolid=5,subvol=/)

So update btrfs_parse_param() to clear "compress-force" when "compress" is
given, providing the same semantics as kernel 6.7 and older.

Reported-by: Roman Mamedov <rm@romanrm.net>
Link: https://lore.kernel.org/linux-btrfs/20241014182416.13d0f8b0@nvm/
CC: stable@vger.kernel.org # 6.8+
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-10-22 16:07:53 +02:00
Christoph Hellwig
4a201dcfa1 xfs: update the pag for the last AG at recovery time
Currently log recovery never updates the in-core perag values for the
last allocation group when they were grown by growfs.  This leads to
btree record validation failures for the alloc, ialloc or finotbt
trees if a transaction references this new space.

Found by Brian's new growfs recovery stress test.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Carlos Maiolino <cem@kernel.org>
2024-10-22 13:37:19 +02:00
Christoph Hellwig
069cf5e32b xfs: don't use __GFP_RETRY_MAYFAIL in xfs_initialize_perag
__GFP_RETRY_MAYFAIL increases the likelyhood of allocations to fail,
which isn't really helpful during log recovery.  Remove the flag and
stick to the default GFP_KERNEL policies.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Carlos Maiolino <cem@kernel.org>
2024-10-22 13:37:18 +02:00
Christoph Hellwig
b882b0f813 xfs: error out when a superblock buffer update reduces the agcount
XFS currently does not support reducing the agcount, so error out if
a logged sb buffer tries to shrink the agcount.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Carlos Maiolino <cem@kernel.org>
2024-10-22 13:37:18 +02:00
Christoph Hellwig
6a18765b54 xfs: update the file system geometry after recoverying superblock buffers
Primary superblock buffers that change the file system geometry after a
growfs operation can affect the operation of later CIL checkpoints that
make use of the newly added space and allocation groups.

Apply the changes to the in-memory structures as part of recovery pass 2,
to ensure recovery works fine for such cases.

In the future we should apply the logic to other updates such as features
bits as well.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Carlos Maiolino <cem@kernel.org>
2024-10-22 13:37:18 +02:00
Christoph Hellwig
aa67ec6a25 xfs: merge the perag freeing helpers
There is no good reason to have two different routines for freeing perag
structures for the unmount and error cases.  Add two arguments to specify
the range of AGs to free to xfs_free_perag, and use that to replace
xfs_free_unused_perag_range.

The addition RCU grace period for the error case is harmless, and the
extra check for the AG to actually exist is not required now that the
callers pass the exact known allocated range.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Carlos Maiolino <cem@kernel.org>
2024-10-22 13:37:18 +02:00
Christoph Hellwig
82742f8c3f xfs: pass the exact range to initialize to xfs_initialize_perag
Currently only the new agcount is passed to xfs_initialize_perag, which
requires lookups of existing AGs to skip them and complicates error
handling.  Also pass the previous agcount so that the range that
xfs_initialize_perag operates on is exactly defined.  That way the
extra lookups can be avoided, and error handling can clean up the
exact range from the old count to the last added perag structure.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Carlos Maiolino <cem@kernel.org>
2024-10-22 13:37:18 +02:00
Darrick J. Wong
af8512c527 xfs: don't fail repairs on metadata files with no attr fork
Fix a minor bug where we fail repairs on metadata files that do not have
attr forks because xrep_metadata_inode_subtype doesn't filter ENOENT.

Cc: stable@vger.kernel.org # v6.8
Fixes: 5a8e07e799 ("xfs: repair the inode core and forks of a metadata inode")
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Carlos Maiolino <cem@kernel.org>
2024-10-22 13:37:18 +02:00
Thorsten Blum
8c6e03ffed
acl: Annotate struct posix_acl with __counted_by()
Add the __counted_by compiler attribute to the flexible array member
a_entries to improve access bounds-checking via CONFIG_UBSAN_BOUNDS and
CONFIG_FORTIFY_SOURCE.

Use struct_size() to calculate the number of bytes to allocate for new
and cloned acls and remove the local size variables.

Change the posix_acl_alloc() function parameter count from int to
unsigned int to match posix_acl's a_count data type. Add identifier
names to the function definition to silence two checkpatch warnings.

Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Thorsten Blum <thorsten.blum@linux.dev>
Link: https://lore.kernel.org/r/20241018121426.155247-2-thorsten.blum@linux.dev
Cc: Nathan Chancellor <nathan@kernel.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-10-22 11:16:59 +02:00
Xuewen Yan
900bbaae67
epoll: Add synchronous wakeup support for ep_poll_callback
Now, the epoll only use wake_up() interface to wake up task.
However, sometimes, there are epoll users which want to use
the synchronous wakeup flag to hint the scheduler, such as
Android binder driver.
So add a wake_up_sync() define, and use the wake_up_sync()
when the sync is true in ep_poll_callback().

Co-developed-by: Jing Xia <jing.xia@unisoc.com>
Signed-off-by: Jing Xia <jing.xia@unisoc.com>
Signed-off-by: Xuewen Yan <xuewen.yan@unisoc.com>
Link: https://lore.kernel.org/r/20240426080548.8203-1-xuewen.yan@unisoc.com
Tested-by: Brian Geffon <bgeffon@google.com>
Reviewed-by: Brian Geffon <bgeffon@google.com>
Reported-by: Benoit Lize <lizeb@google.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-10-22 11:16:59 +02:00
Rik van Riel
0dfcb72d33
coredump: add cond_resched() to dump_user_range
The loop between elf_core_dump() and dump_user_range() can run for
so long that the system shows softlockup messages, with side effects
like workqueues and RCU getting stuck on the core dumping CPU.

Add a cond_resched() in dump_user_range() to avoid that softlockup.

Signed-off-by: Rik van Riel <riel@surriel.com>
Link: https://lore.kernel.org/r/20241010113651.50cb0366@imladris.surriel.com
Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-10-22 11:16:58 +02:00
Andrew Kreimer
80d3ab2227
fs/inode: Fix a typo
Fix a typo in comments: wether v-> whether.

Signed-off-by: Andrew Kreimer <algonell@gmail.com>
Link: https://lore.kernel.org/r/20241008121602.16778-1-algonell@gmail.com
Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-10-22 11:16:58 +02:00
Christian Brauner
2714b0d1f3
fcntl: make F_DUPFD_QUERY associative
Currently when passing a closed file descriptor to
fcntl(fd, F_DUPFD_QUERY, fd_dup) the order matters:

    fd = open("/dev/null");
    fd_dup = dup(fd);

When we now close one of the file descriptors we get:

    (1) fcntl(fd, fd_dup) // -EBADF
    (2) fcntl(fd_dup, fd) // 0 aka not equal

depending on which file descriptor is passed first. That's not a huge
deal but it gives the api I slightly weird feel. Make it so that the
order doesn't matter by requiring that both file descriptors are valid:

(1') fcntl(fd, fd_dup) // -EBADF
(2') fcntl(fd_dup, fd) // -EBADF

Link: https://lore.kernel.org/r/20241008-duften-formel-251f967602d5@brauner
Fixes: c62b758bae ("fcntl: add F_DUPFD_QUERY fcntl()")
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Reviewed-By: Lennart Poettering <lennart@poettering.net>
Cc: stable@vger.kernel.org
Reported-by: Lennart Poettering <lennart@poettering.net>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-10-22 11:16:58 +02:00
Andreas Gruenbacher
c298638743
vfs: inode insertion kdoc corrections
Some minor corrections to the inode_insert5 and iget5_locked kernel
documentation.

Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
Link: https://lore.kernel.org/r/20241004115151.44834-1-agruenba@redhat.com
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-10-22 11:16:57 +02:00
Uros Bizjak
0cb9c994e7
namespace: Use atomic64_inc_return() in alloc_mnt_ns()
Use atomic64_inc_return(&ref) instead of atomic64_add_return(1, &ref)
to use optimized implementation and ease register pressure around
the primitive for targets that implement optimized variant.

Signed-off-by: Uros Bizjak <ubizjak@gmail.com>
Link: https://lore.kernel.org/r/20241007085303.48312-1-ubizjak@gmail.com
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Christian Brauner <brauner@kernel.org>
Cc: Jan Kara <jack@suse.cz>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-10-22 11:16:57 +02:00
Julia Lawall
1e756248be
fs: Reorganize kerneldoc parameter names
Reorganize kerneldoc parameter names to match the parameter
order in the function header.

Problems identified using Coccinelle.

Signed-off-by: Julia Lawall <Julia.Lawall@inria.fr>
Link: https://lore.kernel.org/r/20240930112121.95324-9-Julia.Lawall@inria.fr
Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-10-22 11:16:57 +02:00
Yafang Shao
e6957c99dc
vfs: Add a sysctl for automated deletion of dentry
Commit 681ce86235 ("vfs: Delete the associated dentry when deleting a
file") introduced an unconditional deletion of the associated dentry when a
file is removed. However, this led to performance regressions in specific
benchmarks, such as ilebench.sum_operations/s [0], prompting a revert in
commit 4a4be1ad3a ("Revert "vfs: Delete the associated dentry when
deleting a file"").

This patch seeks to reintroduce the concept conditionally, where the
associated dentry is deleted only when the user explicitly opts for it
during file removal. A new sysctl fs.automated_deletion_of_dentry is
added for this purpose. Its default value is set to 0.

There are practical use cases for this proactive dentry reclamation.
Besides the Elasticsearch use case mentioned in commit 681ce86235,
additional examples have surfaced in our production environment. For
instance, in video rendering services that continuously generate temporary
files, upload them to persistent storage servers, and then delete them, a
large number of negative dentries—serving no useful purpose—accumulate.
Users in such cases would benefit from proactively reclaiming these
negative dentries.

Link: https://lore.kernel.org/linux-fsdevel/202405291318.4dfbb352-oliver.sang@intel.com [0]
Link: https://lore.kernel.org/all/20240912-programm-umgibt-a1145fa73bb6@brauner/
Suggested-by: Christian Brauner <brauner@kernel.org>
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
Link: https://lore.kernel.org/r/20240929122831.92515-1-laoar.shao@gmail.com
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Christian Brauner <brauner@kernel.org>
Cc: Jan Kara <jack@suse.cz>
Cc: Mateusz Guzik <mjguzik@gmail.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-10-22 11:16:57 +02:00
Christian Brauner
6474353a5e
epoll: annotate racy check
Epoll relies on a racy fastpath check during __fput() in
eventpoll_release() to avoid the hit of pointlessly acquiring a
semaphore. Annotate that race by using WRITE_ONCE() and READ_ONCE().

Link: https://lore.kernel.org/r/66edfb3c.050a0220.3195df.001a.GAE@google.com
Link: https://lore.kernel.org/r/20240925-fungieren-anbauen-79b334b00542@brauner
Reviewed-by: Jan Kara <jack@suse.cz>
Reported-by: syzbot+3b6b32dc50537a49bb4a@syzkaller.appspotmail.com
Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-10-22 11:16:56 +02:00
Linus Torvalds
7166c32651 vfs-6.12-rc5.fixes
-----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCZxY6XAAKCRCRxhvAZXjc
 opmUAQCu4KhzBBdZmFw3AfZFNJvYb1onT4FiU0pnyGgfvzEdEwD6AlnlgQ7DL3ZN
 WBqBzUl+DpGYJfzhkqoEGH89Fagx7QM=
 =mm68
 -----END PGP SIGNATURE-----

Merge tag 'vfs-6.12-rc5.fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs

Pull vfs fixes from Christian Brauner:
 "afs:
   - Fix a lock recursion in afs_wake_up_async_call() on ->notify_lock

 netfs:
   - Drop the references to a folio immediately after the folio has been
     extracted to prevent races with future I/O collection

   - Fix a documenation build error

   - Downgrade the i_rwsem for buffered writes to fix a cifs reported
     performance regression when switching to netfslib

  vfs:
   - Explicitly return -E2BIG from openat2() if the specified size is
     unexpectedly large. This aligns openat2() with other extensible
     struct based system calls

   - When copying a mount namespace ensure that we only try to remove
     the new copy from the mount namespace rbtree if it has already been
     added to it

  nilfs:
   - Clear the buffer delay flag when clearing the buffer state clags
     when a buffer head is discarded to prevent a kernel OOPs

  ocfs2:
   - Fix an unitialized value warning in ocfs2_setattr()

  proc:
   - Fix a kernel doc warning"

* tag 'vfs-6.12-rc5.fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
  proc: Fix W=1 build kernel-doc warning
  afs: Fix lock recursion
  fs: Fix uninitialized value issue in from_kuid and from_kgid
  fs: don't try and remove empty rbtree node
  netfs: Downgrade i_rwsem for a buffered write
  nilfs2: fix kernel bug due to missing clearing of buffer delay flag
  openat2: explicitly return -E2BIG for (usize > PAGE_SIZE)
  netfs: fix documentation build error
  netfs: In readahead, put the folio refs as soon extracted
2024-10-21 10:48:24 -07:00
Christoph Hellwig
6db388585e
iomap: turn iomap_want_unshare_iter into an inline function
iomap_want_unshare_iter currently sits in fs/iomap/buffered-io.c, which
depends on CONFIG_BLOCK.  It is also in used in fs/dax.c whіch has no
such dependency.  Given that it is a trivial check turn it into an inline
in include/linux/iomap.h to fix the DAX && !BLOCK build.

Fixes: 6ef6a0e821 ("iomap: share iomap_unshare_iter predicate code with fsdax")
Reported-by: kernel test robot <lkp@intel.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20241015041350.118403-1-hch@lst.de
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-10-21 17:01:01 +02:00
Yang Erkun
d5ff2fb2e7 nfsd: cancel nfsd_shrinker_work using sync mode in nfs4_state_shutdown_net
In the normal case, when we excute `echo 0 > /proc/fs/nfsd/threads`, the
function `nfs4_state_destroy_net` in `nfs4_state_shutdown_net` will
release all resources related to the hashed `nfs4_client`. If the
`nfsd_client_shrinker` is running concurrently, the `expire_client`
function will first unhash this client and then destroy it. This can
lead to the following warning. Additionally, numerous use-after-free
errors may occur as well.

nfsd_client_shrinker         echo 0 > /proc/fs/nfsd/threads

expire_client                nfsd_shutdown_net
  unhash_client                ...
                               nfs4_state_shutdown_net
                                 /* won't wait shrinker exit */
  /*                             cancel_work(&nn->nfsd_shrinker_work)
   * nfsd_file for this          /* won't destroy unhashed client1 */
   * client1 still alive         nfs4_state_destroy_net
   */

                               nfsd_file_cache_shutdown
                                 /* trigger warning */
                                 kmem_cache_destroy(nfsd_file_slab)
                                 kmem_cache_destroy(nfsd_file_mark_slab)
  /* release nfsd_file and mark */
  __destroy_client

====================================================================
BUG nfsd_file (Not tainted): Objects remaining in nfsd_file on
__kmem_cache_shutdown()
--------------------------------------------------------------------
CPU: 4 UID: 0 PID: 764 Comm: sh Not tainted 6.12.0-rc3+ #1

 dump_stack_lvl+0x53/0x70
 slab_err+0xb0/0xf0
 __kmem_cache_shutdown+0x15c/0x310
 kmem_cache_destroy+0x66/0x160
 nfsd_file_cache_shutdown+0xac/0x210 [nfsd]
 nfsd_destroy_serv+0x251/0x2a0 [nfsd]
 nfsd_svc+0x125/0x1e0 [nfsd]
 write_threads+0x16a/0x2a0 [nfsd]
 nfsctl_transaction_write+0x74/0xa0 [nfsd]
 vfs_write+0x1a5/0x6d0
 ksys_write+0xc1/0x160
 do_syscall_64+0x5f/0x170
 entry_SYSCALL_64_after_hwframe+0x76/0x7e

====================================================================
BUG nfsd_file_mark (Tainted: G    B   W         ): Objects remaining
nfsd_file_mark on __kmem_cache_shutdown()
--------------------------------------------------------------------

 dump_stack_lvl+0x53/0x70
 slab_err+0xb0/0xf0
 __kmem_cache_shutdown+0x15c/0x310
 kmem_cache_destroy+0x66/0x160
 nfsd_file_cache_shutdown+0xc8/0x210 [nfsd]
 nfsd_destroy_serv+0x251/0x2a0 [nfsd]
 nfsd_svc+0x125/0x1e0 [nfsd]
 write_threads+0x16a/0x2a0 [nfsd]
 nfsctl_transaction_write+0x74/0xa0 [nfsd]
 vfs_write+0x1a5/0x6d0
 ksys_write+0xc1/0x160
 do_syscall_64+0x5f/0x170
 entry_SYSCALL_64_after_hwframe+0x76/0x7e

To resolve this issue, cancel `nfsd_shrinker_work` using synchronous
mode in nfs4_state_shutdown_net.

Fixes: 7c24fa2250 ("NFSD: replace delayed_work with work_struct for nfsd_client_shrinker")
Signed-off-by: Yang Erkun <yangerkun@huaweicloud.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2024-10-21 10:27:36 -04:00
Gao Xiang
14c2d97265
erofs: use get_tree_bdev_flags() to avoid misleading messages
Users can pass in an arbitrary source path for the proper type of
a mount then without "Can't lookup blockdev" error message.

Reported-by: Allison Karlitskaya <allison.karlitskaya@redhat.com>
Closes: https://lore.kernel.org/r/CAOYeF9VQ8jKVmpy5Zy9DNhO6xmWSKMB-DO8yvBB0XvBE7=3Ugg@mail.gmail.com
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Link: https://lore.kernel.org/r/20241009033151.2334888-2-hsiangkao@linux.alibaba.com
Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-10-21 14:30:27 +02:00
Gao Xiang
4021e68513
fs/super.c: introduce get_tree_bdev_flags()
As Allison reported [1], currently get_tree_bdev() will store
"Can't lookup blockdev" error message.  Although it makes sense for
pure bdev-based fses, this message may mislead users who try to use
EROFS file-backed mounts since get_tree_nodev() is used as a fallback
then.

Add get_tree_bdev_flags() to specify extensible flags [2] and
GET_TREE_BDEV_QUIET_LOOKUP to silence "Can't lookup blockdev" message
since it's misleading to EROFS file-backed mounts now.

[1] https://lore.kernel.org/r/CAOYeF9VQ8jKVmpy5Zy9DNhO6xmWSKMB-DO8yvBB0XvBE7=3Ugg@mail.gmail.com
[2] https://lore.kernel.org/r/ZwUkJEtwIpUA4qMz@infradead.org

Suggested-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Link: https://lore.kernel.org/r/20241009033151.2334888-1-hsiangkao@linux.alibaba.com
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-10-21 14:30:26 +02:00
Miklos Szeredi
184429a17f Revert "fuse: move initialization of fuse_file to fuse_writepages() instead of in callback"
This reverts commit 672c3b7457.

fuse_writepages() might be called with no dirty pages after all writable
opens were closed.  In this case __fuse_write_file_get() will return NULL
which will trigger the WARNING.

The exact conditions under which this is triggered is unclear and syzbot
didn't find a reproducer yet.

Reported-by: syzbot+217a976dc26ef2fa8711@syzkaller.appspotmail.com
Link: https://lore.kernel.org/all/CAJnrk1aQwfvb51wQ5rUSf9N8j1hArTFeSkHqC_3T-mU6_BCD=A@mail.gmail.com/
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2024-10-21 10:02:51 +02:00
Ingo Molnar
d1fb8a78b2 Linux 6.12-rc4
-----BEGIN PGP SIGNATURE-----
 
 iQFSBAABCAA8FiEEq68RxlopcLEwq+PEeb4+QwBBGIYFAmcVgfoeHHRvcnZhbGRz
 QGxpbnV4LWZvdW5kYXRpb24ub3JnAAoJEHm+PkMAQRiGhCYH/0Sdfp3cIq3JWLRv
 HCkWhPkPbEvR5XQlYQsAvTPVrEc0ZG9PKlXCaYaa8Tvt8xQ7WT/VDTjKgaWEhr8s
 qa6bNTx1zggiNBTP/3jYsNliOyAYfw5qjxA7fpEmueAeuT5y1XKZFKPHEXE/1qbR
 8zeISKTkE0qwUmLqCdXe2qBWFnCC5i+78RcI6IN7uErnuNWk7ssapldgU4DB+dEl
 DDRxi1FTvARGPQGl8T+jPkfJiugv87ksG7l4WsqcYgoW+045K76C7I6vQjkDOrsd
 wqtPIow/yPmGQbbdRhWLxNU+wDmselYQ6xp7aMxppNF45HoHtzNm+X+T2ZU3bPoP
 iT2Mkbg=
 =+GXK
 -----END PGP SIGNATURE-----

Merge tag 'v6.12-rc4' into sched/core, to resolve conflict

Overlapping fixes solving the same bug slightly differently:

  7266f0a6d3 fs/bcachefs: Fix __wait_on_freeing_inode() definition of waitqueue entry
  3b80552e70 bcachefs: __wait_for_freeing_inode: Switch to wait_bit_queue_entry

Use the upstream version.

Signed-off-by: Ingo Molnar <mingo@kernel.org>
2024-10-21 08:14:15 +02:00
Kent Overstreet
a069f01479 bcachefs: Set bch_inode_unpacked.bi_snapshot in old inode path
This fixes a fsck bug on a very old filesystem (pre mainline merge).

Fixes: 72350ee0ea ("bcachefs: Kill snapshot arg to fsck_write_inode()")
Reported-by: Marcin Mirosław <marcin@mejor.pl>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-10-20 18:09:09 -04:00
Kent Overstreet
e04ee86089 bcachefs: Mark more errors as AUTOFIX
Reported-by: Marcin Mirosław <marcin@mejor.pl>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-10-20 18:08:53 -04:00
Kent Overstreet
f0d3302073 bcachefs: Workaround for kvmalloc() not supporting > INT_MAX allocations
kvmalloc() doesn't support allocations > INT_MAX, but vmalloc() does -
the limit should be lifted, but we can work around this for now.

A user with a 75 TB filesystem reported the following journal replay
error:
https://github.com/koverstreet/bcachefs/issues/769

In journal replay we have to sort and dedup all the keys from the
journal, which means we need a large contiguous allocation. Given that
the user has 128GB of ram, the 2GB limit on allocation size has become
far too small.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-10-20 16:50:14 -04:00
Kent Overstreet
3956ff8bc2 bcachefs: Don't use wait_event_interruptible() in recovery
Fix a bug where mount was failing with -ERESTARTSYS:
https://github.com/koverstreet/bcachefs/issues/741

We only want the interruptible wait when called from fsync.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-10-20 16:50:14 -04:00
Kent Overstreet
eb5db64c45 bcachefs: Fix __bch2_fsck_err() warning
We only warn about having a btree_trans that wasn't passed in if we'll
be prompting.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-10-20 16:50:14 -04:00
Al Viro
e896474fe4 getname_maybe_null() - the third variant of pathname copy-in
Semantics used by statx(2) (and later *xattrat(2)): without AT_EMPTY_PATH
it's standard getname() (i.e. ERR_PTR(-ENOENT) on empty string,
ERR_PTR(-EFAULT) on NULL), with AT_EMPTY_PATH both empty string and
NULL are accepted.

Calling conventions: getname_maybe_null(user_pointer, flags) returns
	* pointer to struct filename when non-empty string had been
successfully read
	* ERR_PTR(...) on error
	* NULL if an empty string or NULL pointer had been given
with AT_EMPTY_PATH in the flags argument.

It tries to avoid allocation in the last case; it's not always
able to do so, in which case the temporary struct filename instance
is freed and NULL returned anyway.

Fast path is inlined.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2024-10-19 20:33:34 -04:00
Al Viro
5b313bcb6e teach filename_lookup() to treat NULL filename as ""
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2024-10-19 20:32:39 -04:00
John Garry
c3be7ebbbc fs/block: Check for IOCB_DIRECT in generic_atomic_write_valid()
Currently FMODE_CAN_ATOMIC_WRITE is set if the bdev can atomic write and
the file is open for direct IO. This does not work if the file is not
opened for direct IO, yet fcntl(O_DIRECT) is used on the fd later.

Change to check for direct IO on a per-IO basis in
generic_atomic_write_valid(). Since we want to report -EOPNOTSUPP for
non-direct IO for an atomic write, change to return an error code.

Relocate the block fops atomic write checks to the common write path, as to
catch non-direct IO.

Fixes: c34fc6f26a ("fs: Initial atomic write support")
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: John Garry <john.g.garry@oracle.com>
Link: https://lore.kernel.org/r/20241019125113.369994-3-john.g.garry@oracle.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-10-19 16:48:22 -06:00
John Garry
9a8dbdadae block/fs: Pass an iocb to generic_atomic_write_valid()
Darrick and Hannes both thought it better that generic_atomic_write_valid()
should be passed a struct iocb, and not just the member of that struct
which is referenced; see [0] and [1].

I think that makes a more generic and clean API, so make that change.

[0] https://lore.kernel.org/linux-block/680ce641-729b-4150-b875-531a98657682@suse.de/
[1] https://lore.kernel.org/linux-xfs/20240620212401.GA3058325@frogsfrogsfrogs/

Fixes: c34fc6f26a ("fs: Initial atomic write support")
Suggested-by: Darrick J. Wong <djwong@kernel.org>
Suggested-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: John Garry <john.g.garry@oracle.com>
Link: https://lore.kernel.org/r/20241019125113.369994-2-john.g.garry@oracle.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-10-19 16:48:21 -06:00
Linus Torvalds
9197b73fd7 Mashed-up update that I sat on too long:
- fix for multiple slabs created with the same name
 - enable multipage folios
 - theorical fix to also look for opened fids by inode if none
 was found by dentry
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEE/IPbcYBuWt0zoYhOq06b7GqY5nAFAmcS81AACgkQq06b7GqY
 5nACpBAAtXOGRjg+dushCwUVKBlnI3oTwE2G+ywnphNZg2A0emlMOxos7x1OTiM3
 Fu0b10MCUWHIXo4jD6ALVPWITJTfjiXR8s90Q/ozypcIXXhkDDShhV31b2h6Iplr
 YyKyjEehDFRiS7rqWC2a9mce99sOpwdQRmnssnWbjYvpJ4imFbl+50Z1I5Nc/Omu
 j2y02eMuikiWF/shKj0Dx1mmpZ4InSv3kvlM+V2D2YdWKNonGZe/xFZhid95LXmr
 Upt55R8k9qR2pn4VU22eKP6c34DIZGDlrcQdPUCNP5QuaAdGZov3TjNQdjE1bJmF
 E2QdxvUNfvvHqlvaRrlWa27uMgXMcy7QV3LEKwmo3tmaYVw2PDMRbFXc9zQdxy91
 zqXjjGasnwzE8ca36y79vZjFTHAyY5VK/3cHCL3ai+ysu4UL3k2QgmVegREG/xKk
 G8Nz4UO/R6s8Wc2VqxKJdZS5NMLlADS+Aes0PG+9AxQz7iR9Ktgwrw39KDxMi+Lm
 PeH3Gz2rP9+EPoa3usoBQtvvvmJKM/Wb9qdPW9vTtRbRJ7bVclJoizFoLMA/TiW1
 Jru+HYGBO75s8RynwEDLMiJhkjZWHfVgDjPsY6YsGVH8W2gOcJ7egQ2J2EsuurN3
 tzKz4uQilV+VeDuWs8pWKrX/c3Y3KpSYV+oayg7Je7LoTlQBmU8=
 =VG4t
 -----END PGP SIGNATURE-----

Merge tag '9p-for-6.12-rc4' of https://github.com/martinetd/linux

Pull 9p fixes from Dominique Martinet:
 "Mashed-up update that I sat on too long:

   - fix for multiple slabs created with the same name

   - enable multipage folios

   - theorical fix to also look for opened fids by inode if none was
     found by dentry"

[ Enabling multi-page folios should have been done during the merge
  window, but it's a one-liner, and the actual meat of the enablement
  is in netfs and already in use for other filesystems...  - Linus ]

* tag '9p-for-6.12-rc4' of https://github.com/martinetd/linux:
  9p: Avoid creating multiple slab caches with the same name
  9p: Enable multipage folios
  9p: v9fs_fid_find: also lookup by inode if not found dentry
2024-10-19 08:44:10 -07:00
Christian Brauner
08ef26ea9a
fs: add file_ref
As atomic_inc_not_zero() is implemented with a try_cmpxchg() loop it has
O(N^2) behaviour under contention with N concurrent operations and it is
in a hot path in __fget_files_rcu().

The rcuref infrastructures remedies this problem by using an
unconditional increment relying on safe- and dead zones to make this
work and requiring rcu protection for the data structure in question.
This not just scales better it also introduces overflow protection.

However, in contrast to generic rcuref, files require a memory barrier
and thus cannot rely on *_relaxed() atomic operations and also require
to be built on atomic_long_t as having massive amounts of reference
isn't unheard of even if it is just an attack.

As suggested by Linus, add a file specific variant instead of making
this a generic library.

Files are SLAB_TYPESAFE_BY_RCU and thus don't have "regular" rcu
protection. In short, freeing of files isn't delayed until a grace
period has elapsed. Instead, they are freed immediately and thus can be
reused (multiple times) within the same grace period.

So when picking a file from the file descriptor table via its file
descriptor number it is thus possible to see an elevated reference count
on file->f_count even though the file has already been recycled possibly
multiple times by another task.

To guard against this the vfs will pick the file from the file
descriptor table twice. Once before the refcount increment and once
after to compare the pointers (grossly simplified). If they match then
the file is still valid. If not the caller needs to fput() it.

The unconditional increment makes the following race possible as
illustrated by rcuref:

> Deconstruction race
> ===================
>
> The release operation must be protected by prohibiting a grace period in
> order to prevent a possible use after free:
>
>      T1                              T2
>      put()                           get()
>      // ref->refcnt = ONEREF
>      if (!atomic_add_negative(-1, &ref->refcnt))
>              return false;                           <- Not taken
>
>      // ref->refcnt == NOREF
>      --> preemption
>                                      // Elevates ref->refcnt to ONEREF
>                                      if (!atomic_add_negative(1, &ref->refcnt))
>                                              return true;                    <- taken
>
>                                      if (put(&p->ref)) { <-- Succeeds
>                                              remove_pointer(p);
>                                              kfree_rcu(p, rcu);
>                                      }
>
>              RCU grace period ends, object is freed
>
>      atomic_cmpxchg(&ref->refcnt, NOREF, DEAD);      <- UAF
>
> [...] it prevents the grace period which keeps the object alive until
> all put() operations complete.

Having files by SLAB_TYPESAFE_BY_RCU shouldn't cause any problems for
this deconstruction race. Afaict, the only interesting case would be
someone freeing the file and someone immediately recycling it within the
same grace period and reinitializing file->f_count to ONEREF while a
concurrent fput() is doing atomic_cmpxchg(&ref->refcnt, NOREF, DEAD) as
in the race above.

But this is safe from SLAB_TYPESAFE_BY_RCU's perspective and it should
be safe from rcuref's perspective.

      T1                              T2                                                    T3
      fput()                          fget()
      // f_count->refcnt = ONEREF
      if (!atomic_add_negative(-1, &f_count->refcnt))
              return false;                           <- Not taken

      // f_count->refcnt == NOREF
      --> preemption
                                      // Elevates f_count->refcnt to ONEREF
                                      if (!atomic_add_negative(1, &f_count->refcnt))
                                              return true;                    <- taken

                                      if (put(&f_count)) { <-- Succeeds
                                              remove_pointer(p);
                                              /*
                                               * Cache is SLAB_TYPESAFE_BY_RCU
                                               * so this is freed without a grace period.
                                               */
                                              kmem_cache_free(p);
                                      }

                                                                                             kmem_cache_alloc()
                                                                                             init_file() {
                                                                                                     // Sets f_count->refcnt to ONEREF
                                                                                                     rcuref_long_init(&f->f_count, 1);
                                                                                             }

                        Object has been reused within the same grace period
                        via kmem_cache_alloc()'s SLAB_TYPESAFE_BY_RCU.

      /*
       * With SLAB_TYPESAFE_BY_RCU this would be a safe UAF access and
       * it would work correctly because the atomic_cmpxchg()
       * will fail because the refcount has been reset to ONEREF by T3.
       */
      atomic_cmpxchg(&ref->refcnt, NOREF, DEAD);      <- UAF

However, there are other cases to consider:

(1) Benign race due to multiple atomic_long_read()

    CPU1                                                    CPU2

    file_ref_put()
    // last reference
    // => count goes negative/FILE_REF_NOREF
    atomic_long_add_negative_release(-1, &ref->refcnt)
    -> __file_ref_put()
                                                        file_ref_get()
                                                        // goes back from negative/FILE_REF_NOREF to 0
                                                        // and file_ref_get() succeeds
                                                        atomic_long_add_negative(1, &ref->refcnt)

                                                        // This is immediately followed by file_ref_put()
                                                        // managing to set FILE_REF_DEAD
                                                        file_ref_put()

       // __file_ref_put() continues and sees
       // cnt > FILE_REF_RELEASED // and splats with
       // "imbalanced put on file reference count"
       cnt = atomic_long_read(&ref->refcnt);

    The race however is benign and the problem is the
    atomic_long_read(). Instead of performing a separate read this uses
    atomic_long_dec_return() and pass the value to __file_ref_put().
    Thanks to Linus for pointing out that braino.

(2) SLAB_TYPESAFE_BY_RCU may cause recycled files to be marked dead

    When a file is recycled the following race exists:

    CPU1                                                       CPU2
    // @file is already dead and thus
    // cnt >= FILE_REF_RELEASED.
    file_ref_get(file)
    atomic_long_add_negative(1, &ref->refcnt)
       // We thus call into __file_ref_get()
    -> __file_ref_get()

       // which sees cnt >= FILE_REF_RELEASED
       cnt = atomic_long_read(&ref->refcnt);
                                                               // In the meantime @file gets freed
                                                               kmem_cache_free()

                                                               // and is immediately recycled
                                                               file = kmem_cache_zalloc()
                                                               // and the reference count is reinitialized
                                                               // and the file alive again in someone
                                                               // else's file descriptor table
                                                               file_ref_init(&ref->refcnt, 1);

       // the __file_ref_get() slowpath now continues
       // and as it saw earlier that cnt >= FILE_REF_RELEASED
       // it wants to ensure that we're staying in the middle
       // of the deadzone and unconditionally sets
       // FILE_REF_DEAD.
       // This marks @file dead for CPU2...
       atomic_long_set(&ref->refcnt, FILE_REF_DEAD);

                                                               // Caller issues a close() system call to close @file
                                                               close(fd)
                                                               file = file_close_fd_locked()
                                                               filp_flush()
                                                               // The caller sees that cnt >= FILE_REF_RELEASED
                                                               // and warns the first time...
                                                               CHECK_DATA_CORRUPTION(file_count(file) == 0)

                                                               // and then splats a second time because
                                                               // __file_ref_put() sees cnt >= FILE_REF_RELEASED
                                                               file_ref_put(&ref->refcnt);
                                                               -> __file_ref_put()

    My initial inclination was to replace the unconditional
    atomic_long_set() with an atomic_long_try_cmpxchg() but Linus
    pointed out that:

    > I think we should just make file_ref_get() do a simple
    >
    >        return !atomic_long_add_negative(1, &ref->refcnt));
    >
    > and nothing else. Yes, multiple CPU's can race, and you can increment
    > more than once, but the gap - even on 32-bit - between DEAD and
    > becoming close to REF_RELEASED is so big that we simply don't care.
    > That's the point of having a gap.

I've been testing this with will-it-scale using fstat() on a machine
that Jens gave me access (thank you very much!):

processor       : 511
vendor_id       : AuthenticAMD
cpu family      : 25
model           : 160
model name      : AMD EPYC 9754 128-Core Processor

and I consistently get a 3-5% improvement on 256+ threads.

Reported-by: kernel test robot <oliver.sang@intel.com>
Closes: https://lore.kernel.org/oe-lkp/202410151043.5d224a27-oliver.sang@intel.com
Closes: https://lore.kernel.org/all/202410151611.f4cd71f2-oliver.sang@intel.com
Link: https://lore.kernel.org/r/20241007-brauner-file-rcuref-v2-2-387e24dc9163@kernel.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-10-19 14:16:45 +02:00
Matthew Wilcox (Oracle)
b57c010e70 ufs: Convert ufs_change_blocknr() to take a folio
Now that ufs_new_fragments() has a folio, pass it to ufs_change_blocknr()
as a folio instead of converting it from folio to page to folio.
This removes the last use of struct page in UFS.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2024-10-18 17:35:31 -04:00
Matthew Wilcox (Oracle)
14bcb7bb68 ufs: Pass a folio to ufs_new_fragments()
All callers now have a folio, pass it to ufs_new_fragments() instead
of converting back to a page.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2024-10-18 17:35:31 -04:00
Matthew Wilcox (Oracle)
24a87a0adb ufs: Convert ufs_inode_getfrag() to take a folio
Pass bh->b_folio instead of bh->b_page.  They're in a union, so no
code change expected.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2024-10-18 17:35:31 -04:00
Matthew Wilcox (Oracle)
b6250a013d ufs: Convert ufs_extend_tail() to take a folio
Pass bh->b_folio instead of bh->b_page.  They're in a union, so no
code change expected.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2024-10-18 17:35:31 -04:00
Matthew Wilcox (Oracle)
d9036c488c ufs: Convert ufs_inode_getblock() to take a folio
Pass bh->b_folio instead of bh->b_page.  They're in a union, so no
code change expected.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2024-10-18 17:35:31 -04:00
Al Viro
6b103cc0ba ufs: take the handling of free block counters into a helper
There are 3 places where those counters (many and varied...) are
adjusted - when we are freeing fragments and get an entire block
freed, when we are freeing blocks and (in opposite direction) when
we are grabbing a block.  The logics is identical (modulo the
sign of adjustment) in all three; better take it into a helper -
less duplication and less clutter in the callers that way.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2024-10-18 17:35:31 -04:00
Al Viro
64f30e80d6 clean ufs_trunc_direct() up a bit...
For short files (== no indirect blocks needed) UFS allows the last
block to be a partial one.  That creates some complications for
truncation down to "short file" lengths.  ufs_trunc_direct() is
called when we'd already made sure that new EOF is not in a hole;
nothing needs to be done if we are extending the file and in
case we are shrinking the file it needs to
	* shrink or free the old final block.
	* free all full direct blocks between the new and old EOF.
	* possibly shrink the new final block.

The logics is needlessly complicated by trying to keep all cases
handled by the same sequence of operations.
	if not shrinking
		nothing to do
	else if number of full blocks unchanged
		free the tail of possibly partial last block
	else
		free the tail of (currently full) new last block
		free all present (full) blocks in between
		free the (possibly partial) old last block

is easier to follow than the result of trying to unify these
cases.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2024-10-18 17:35:31 -04:00
Al Viro
db57044217 ufs: get rid of ubh_{ubhcpymem,memcpyubh}()
used only in ufs_read_cylinder_structures()/ufs_put_super_internal()
and there we can just as well avoid bothering with ufs_buffer_head
and just deal with it fragment-by-fragment.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2024-10-18 17:35:31 -04:00
Al Viro
ae79ce9d06 ufs_inode_getfrag(): remove junk comment
It used to be a stubbed out beginning of ufs2 support, which had
been implemented differently quite a while ago.  Remove the
commented-out (pseudo-)code.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2024-10-18 17:35:31 -04:00
Al Viro
426f07ad3e ufs_free_fragments(): fix the braino in sanity check
The function expects that all fragments it's been asked to free will
be within the same block.  And it even has a sanity check verifying
that - it takes the fragment number modulo the number of fragments
per block, adds the count and checks if that's too high.

Unfortunately, it misspells the upper limit - instead of ->s_fpb
(fragments per block) it says ->s_fpg (fragments per cylinder group).
So "too high" ends up being insanely lenient.

Had been that way since 2.1.112, when UFS write support had been
added.  27 years to spot a typo...

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2024-10-18 17:35:31 -04:00
Al Viro
c5df105f7d ufs_clusteracct(): switch to passing fragment number
Currently all callers pass it a block number.  All of them have it derived
from a fragment number (both fragment and block numbers are within a cylinder
group, and thus 32bit).  Pass it the fragment number instead; none of the
callers has other uses for the block number, so that ends up with cleaner
code.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2024-10-18 17:35:31 -04:00
Al Viro
dce3e8d33a ufs: untangle ubh_...block...(), part 3
Pass fragment number instead of a block one.  It's available in all
callers and it makes the logics inside those helpers much simpler.
The bitmap they operate upon is with bit per fragment, block being
an aligned group of 1, 2, 4 or 8 adjacent fragments.  We still
need a switch by the number of fragments in block (== number of
bits to check/set/clear), but finding the byte we need to work
with becomes uniform and that makes the things easier to follow.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2024-10-18 17:35:31 -04:00
Al Viro
8bec0618a4 ufs: untangle ubh_...block...(), part 2
pass cylinder group descriptor instead of its buffer head (ubh,
always UCPI_UBH(ucpi)) and its ->c_freeoff.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2024-10-18 17:35:31 -04:00
Al Viro
65136e46a0 ufs: untangle ubh_...block...() macros, part 1
passing implicit argument to a macro by having it in a variable
with special name is Not Nice(tm); just pass it explicitly.

kill an unused macro, while we are at it...

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2024-10-18 17:35:31 -04:00
Al Viro
0bfd3e1078 ufs: fix ufs_read_cylinder() failure handling
1) ufs_load_cylinder() should return NULL on ufs_read_cylinder() failures.
ufs_error() is not enough.  As it is, IO failure on attempt to read a part
of cylinder group metadata is likely to end up with an oops.

2) we drop the wrong buffer heads when undoing sb_bread() on IO failure
halfway through the read - we need to brelse() what we've got from
sb_bread(), TYVM...

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2024-10-18 17:35:31 -04:00
Al Viro
7f71d6e346 ufs: missing ->splice_write()
normal ->write_iter()-based ->splice_write() works here just fine...

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2024-10-18 17:35:31 -04:00
Al Viro
6a1c4c4688 ufs: fix handling of delete_entry and set_link failures
similar to minixfs series - make ufs_set_link() report failures,
lift folio_release_kmap() into the callers of ufs_set_link()
and ufs_delete_entry(), make ufs_rename() handle failures in both.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2024-10-18 17:35:30 -04:00
Olga Kornievskaia
8dd91e8d31 nfsd: fix race between laundromat and free_stateid
There is a race between laundromat handling of revoked delegations
and a client sending free_stateid operation. Laundromat thread
finds that delegation has expired and needs to be revoked so it
marks the delegation stid revoked and it puts it on a reaper list
but then it unlock the state lock and the actual delegation revocation
happens without the lock. Once the stid is marked revoked a racing
free_stateid processing thread does the following (1) it calls
list_del_init() which removes it from the reaper list and (2) frees
the delegation stid structure. The laundromat thread ends up not
calling the revoke_delegation() function for this particular delegation
but that means it will no release the lock lease that exists on
the file.

Now, a new open for this file comes in and ends up finding that
lease list isn't empty and calls nfsd_breaker_owns_lease() which ends
up trying to derefence a freed delegation stateid. Leading to the
followint use-after-free KASAN warning:

kernel: ==================================================================
kernel: BUG: KASAN: slab-use-after-free in nfsd_breaker_owns_lease+0x140/0x160 [nfsd]
kernel: Read of size 8 at addr ffff0000e73cd0c8 by task nfsd/6205
kernel:
kernel: CPU: 2 UID: 0 PID: 6205 Comm: nfsd Kdump: loaded Not tainted 6.11.0-rc7+ #9
kernel: Hardware name: Apple Inc. Apple Virtualization Generic Platform, BIOS 2069.0.0.0.0 08/03/2024
kernel: Call trace:
kernel: dump_backtrace+0x98/0x120
kernel: show_stack+0x1c/0x30
kernel: dump_stack_lvl+0x80/0xe8
kernel: print_address_description.constprop.0+0x84/0x390
kernel: print_report+0xa4/0x268
kernel: kasan_report+0xb4/0xf8
kernel: __asan_report_load8_noabort+0x1c/0x28
kernel: nfsd_breaker_owns_lease+0x140/0x160 [nfsd]
kernel: nfsd_file_do_acquire+0xb3c/0x11d0 [nfsd]
kernel: nfsd_file_acquire_opened+0x84/0x110 [nfsd]
kernel: nfs4_get_vfs_file+0x634/0x958 [nfsd]
kernel: nfsd4_process_open2+0xa40/0x1a40 [nfsd]
kernel: nfsd4_open+0xa08/0xe80 [nfsd]
kernel: nfsd4_proc_compound+0xb8c/0x2130 [nfsd]
kernel: nfsd_dispatch+0x22c/0x718 [nfsd]
kernel: svc_process_common+0x8e8/0x1960 [sunrpc]
kernel: svc_process+0x3d4/0x7e0 [sunrpc]
kernel: svc_handle_xprt+0x828/0xe10 [sunrpc]
kernel: svc_recv+0x2cc/0x6a8 [sunrpc]
kernel: nfsd+0x270/0x400 [nfsd]
kernel: kthread+0x288/0x310
kernel: ret_from_fork+0x10/0x20

This patch proposes a fixed that's based on adding 2 new additional
stid's sc_status values that help coordinate between the laundromat
and other operations (nfsd4_free_stateid() and nfsd4_delegreturn()).

First to make sure, that once the stid is marked revoked, it is not
removed by the nfsd4_free_stateid(), the laundromat take a reference
on the stateid. Then, coordinating whether the stid has been put
on the cl_revoked list or we are processing FREE_STATEID and need to
make sure to remove it from the list, each check that state and act
accordingly. If laundromat has added to the cl_revoke list before
the arrival of FREE_STATEID, then nfsd4_free_stateid() knows to remove
it from the list. If nfsd4_free_stateid() finds that operations arrived
before laundromat has placed it on cl_revoke list, it marks the state
freed and then laundromat will no longer add it to the list.

Also, for nfsd4_delegreturn() when looking for the specified stid,
we need to access stid that are marked removed or freeable, it means
the laundromat has started processing it but hasn't finished and this
delegreturn needs to return nfserr_deleg_revoked and not
nfserr_bad_stateid. The latter will not trigger a FREE_STATEID and the
lack of it will leave this stid on the cl_revoked list indefinitely.

Fixes: 2d4a532d38 ("nfsd: ensure that clp->cl_revoked list is protected by clp->cl_lock")
CC: stable@vger.kernel.org
Signed-off-by: Olga Kornievskaia <okorniev@redhat.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2024-10-18 16:40:37 -04:00
Linus Torvalds
b04ae0f451 two fixes for stable, and two small cleanup fixes
-----BEGIN PGP SIGNATURE-----
 
 iQGzBAABCgAdFiEE6fsu8pdIjtWE/DpLiiy9cAdyT1EFAmcSdmYACgkQiiy9cAdy
 T1EnnAwAoNbY+odLB9atHIuaBftpyINrhzRrzpwTfYNtPKUPGxxGk2fiP29YqMLb
 OF4jnC87E3P/xhydoZHXXe3kKBQFVMAkJZKHiZBvJd+brk/EadfQnNmIio1pwOGh
 zFNxSujFtsM/1HU/ZoI2kaHzrqj5KxWKWFytZ6umd8C3NyKK9Lo/lcqUBKv8MpJy
 XXkMBh+7HGKRfDQlU+n6NQ5+dqFL5xDjTXlm9dM8LXuInKy5oKTGnRhLA7OA8lt7
 EenFo8joy0IpXUByHt+ksQ8P88NCnU2h9kGp1UrGrBPh90+MokRr9GAcH8twK8jt
 /bpL4yzAwuk1TAg+L9mSLT2OtWYsDpsQZmsBMbxBZGr2qmtjwgbxSgjf6DNiJZgn
 jz15nFsuEsU5AbX4EAE67fwRWAo9AmQFyOOcYgkiIWOFHaRU6D/2NzCxCDZ+mfpy
 Z5f7dF/sA158iY4wmB5BrQpFamxzpLADz6Qy4NA9hXjEKsbyFAuf22EjE64ruxZ4
 8nMB3buh
 =peum
 -----END PGP SIGNATURE-----

Merge tag 'v6.12-rc3-smb3-client-fixes' of git://git.samba.org/sfrench/cifs-2.6

Pull smb client fixes from Steve French:

 - Fix possible double free setting xattrs

 - Fix slab out of bounds with large ioctl payload

 - Remove three unused functions, and an unused variable that could be
   confusing

* tag 'v6.12-rc3-smb3-client-fixes' of git://git.samba.org/sfrench/cifs-2.6:
  cifs: Remove unused functions
  smb/client: Fix logically dead code
  smb: client: fix OOBs when building SMB2_IOCTL request
  smb: client: fix possible double free in smb2_set_ea()
2024-10-18 11:37:12 -07:00
Linus Torvalds
568570fdf2 XFS Bug fixes for 6.12-rc4
* Fix integer overflow in xrep_bmap
 * Fix stale dealloc punching for COW IO
 
 Signed-off-by: Carlos Maiolino <cem@kernel.org>
 -----BEGIN PGP SIGNATURE-----
 
 iJUEABMJAB0WIQQMHYkcUKcy4GgPe2RGdaER5QtfpgUCZw5LIwAKCRBGdaER5Qtf
 puRlAYDezbvs1dDSkKIGOt3inGdLptNAu4qniXBUkbYI9BzmtIVDueWP4Wo0dV3d
 gu3xrWQBfjFXdmEuBlwLuAFrp07AN18BVMj+DWCiEShsPHSoSPcF/IrDiz4BHvGv
 MKYq9CywFw==
 =Gj9b
 -----END PGP SIGNATURE-----

Merge tag 'xfs-6.12-fixes-4' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux

Pull xfs fixes from Carlos Maiolino:

 - Fix integer overflow in xrep_bmap

 - Fix stale dealloc punching for COW IO

* tag 'xfs-6.12-fixes-4' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux:
  xfs: punch delalloc extents from the COW fork for COW writes
  xfs: set IOMAP_F_SHARED for all COW fork allocations
  xfs: share more code in xfs_buffered_write_iomap_begin
  xfs: support the COW fork in xfs_bmap_punch_delalloc_range
  xfs: IOMAP_ZERO and IOMAP_UNSHARE already hold invalidate_lock
  xfs: take XFS_MMAPLOCK_EXCL xfs_file_write_zero_eof
  xfs: factor out a xfs_file_write_zero_eof helper
  iomap: move locking out of iomap_write_delalloc_release
  iomap: remove iomap_file_buffered_write_punch_delalloc
  iomap: factor out a iomap_last_written_block helper
  xfs: fix integer overflow in xrep_bmap
2024-10-18 11:28:39 -07:00
Thorsten Blum
197231da7f
proc: Fix W=1 build kernel-doc warning
Building the kernel with W=1 generates the following warning:

  fs/proc/fd.c:81: warning: This comment starts with '/**',
                   but isn't a kernel-doc comment.

Use a normal comment for the helper function proc_fdinfo_permission().

Signed-off-by: Thorsten Blum <thorsten.blum@linux.dev>
Link: https://lore.kernel.org/r/20241018102705.92237-2-thorsten.blum@linux.dev
Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-10-18 13:02:47 +02:00
Kent Overstreet
bc6d2d1041 bcachefs: fsck: Improve hash_check_key()
hash_check_key() checks and repairs the hash table btrees: dirents and
xattrs are open addressing hash tables.

We recently had a corruption reported where the hash type on an inode
somehow got flipped, which made the existing dirents invisible and
allowed new ones to be created with the same name.

Now, hash_check_key() can repair duplicates: it will delete one of them,
if it has an xattr or dangling dirent, but if it has two valid dirents
one of them gets renamed.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-10-18 00:49:48 -04:00
Kent Overstreet
dc96656b20 bcachefs: bch2_hash_set_or_get_in_snapshot()
Add a variant of bch2_hash_set_in_snapshot() that returns the existing
key on -EEXIST.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-10-18 00:49:48 -04:00
Kent Overstreet
15a3836c8e bcachefs: Repair mismatches in inode hash seed, type
Different versions of the same inode (same inode number, different
snapshot ID) must have the same hash seed and type - lookups require
this, since they see keys from different snapshots simultaneously.

To repair we only need to make the inodes consistent, hash_check_key()
will do the rest.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-10-18 00:49:48 -04:00
Kent Overstreet
d8e879377f bcachefs: Add hash seed, type to inode_to_text()
This helped with discovering some filesystem corruption fsck has having
trouble with: the str_hash type had gotten flipped on one snapshot's
version of an inode.

All versions of a given inode number have the same hash seed and hash
type, since lookups will be done with a single hash/seed and type and
see dirents/xattrs from multiple snapshots.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-10-18 00:49:48 -04:00
Kent Overstreet
78cf0ae636 bcachefs: INODE_STR_HASH() for bch_inode_unpacked
Trivial cleanup - add a normal BITMASK() helper for bch_inode_unpacked.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-10-18 00:49:48 -04:00
Kent Overstreet
b96f8cd387 bcachefs: Run in-kernel offline fsck without ratelimit errors
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-10-18 00:49:48 -04:00
Hongbo Li
489ecc4cfd bcachefs: skip mount option handle for empty string.
The options parse in get_tree will split the options buffer, it will
get the empty string for last one by strsep(). After commit
ea0eeb89b1d5 ("bcachefs: reject unknown mount options") is merged,
unknown mount options is not allowed (here is empty string), and this
causes this errors. This can be reproduced just by the following steps:

    bcachefs format /dev/loop
    mount -t bcachefs -o metadata_target=loop1 /dev/loop1 /mnt/bcachefs/

Fixes: ea0eeb89b1d5 ("bcachefs: reject unknown mount options")
Signed-off-by: Hongbo Li <lihongbo22@huawei.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-10-18 00:49:48 -04:00
Hongbo Li
07cf8bac2d bcachefs: fix incorrect show_options results
When call show_options in bcachefs, the options buffer is appeneded
to the seq variable. In fact, it requires an additional comma to be
appended first. This will affect the remount process when reading
existing mount options.

Fixes: 9305cf91d05e ("bcachefs: bch2_opts_to_text()")
Signed-off-by: Hongbo Li <lihongbo22@huawei.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-10-18 00:49:48 -04:00
Kent Overstreet
97535cd84f bcachefs: Fix data corruption on -ENOSPC in buffered write path
Found by generic/299: When we have to truncate a write due to -ENOSPC,
we may have to read in the folio we're writing to if we're now no longer
doing a complete write to a !uptodate folio.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-10-18 00:49:48 -04:00
Kent Overstreet
335d318ef5 bcachefs: bch2_folio_reservation_get_partial() is now better behaved
bch2_folio_reservation_get_partial(), on partial success, will now
return a reservation that's aligned to the filesystem blocksize.

This is a partial fix for fstests generic/299 - fio verify is badly
behaved in the presence of short writes that aren't aligned to its
blocksize.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-10-18 00:49:48 -04:00
Kent Overstreet
81e0b6c7c1 bcachefs: fix disk reservation accounting in bch2_folio_reservation_get()
bch2_disk_reservation_put() zeroes out the reservation - oops.

This fixes a disk reservation leak when getting a quota reservation
returned an error.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-10-18 00:49:48 -04:00
Kent Overstreet
4007bbb203 bcachefS: ec: fix data type on stripe deletion
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-10-18 00:49:48 -04:00
Kent Overstreet
a0d11feefb bcachefs: Don't use commit_do() unnecessarily
Using commit_do() to call alloc_sectors_start_trans() breaks when we're
randomly injecting transaction restarts - the restart in the commit
causes us to leak the lock that alloc_sectorS_start_trans() takes.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-10-18 00:49:48 -04:00
Kent Overstreet
6bee2a04c5 bcachefs: handle restarts in bch2_bucket_io_time_reset()
bch2_bucket_io_time_reset() doesn't need to succeed, which is why it
didn't previously retry on transaction restart - but we're now treating
these as errors.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-10-18 00:49:48 -04:00
Kent Overstreet
29fd10a36a bcachefs: fix restart handling in __bch2_resume_logged_op_finsert()
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-10-18 00:49:48 -04:00
Kent Overstreet
d8b5059774 bcachefs: fix restart handling in bch2_alloc_write_key()
This is ugly:

We may discover in alloc_write_key that the data type we calculated is
wrong, because BCH_DATA_need_discard is checked/set elsewhere, and the
disk accounting counters we calculated need to be updated.

But bch2_alloc_key_to_dev_counters(..., BTREE_TRIGGER_gc) is not safe
w.r.t. transaction restarts, so we need to propagate the fixup back to
our gc state in case we take a transaction restart.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-10-18 00:49:47 -04:00
Kent Overstreet
7ee4be9c62 bcachefs: fix restart handling in bch2_do_invalidates_work()
this one is fairly harmless since the invalidate worker will just run
again later if it needs to, but still worth fixing

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-10-18 00:49:47 -04:00
Kent Overstreet
028f3c1d9b bcachefs: fix missing restart handling in bch2_read_retry_nodecode()
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-10-18 00:49:47 -04:00
Kent Overstreet
e1c4d2f082 bcachefs: fix restart handling in bch2_fiemap()
We were leaking transaction restart errors to userspace.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-10-18 00:49:47 -04:00
Kent Overstreet
94bdeec8f5 bcachefs: fix bch2_hash_delete() error path
we were exiting an iterator that hadn't been initialized

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-10-18 00:49:47 -04:00
Kent Overstreet
74ec2f3024 bcachefs: fix restart handling in bch2_rename2()
This should be impossible to hit in practice; the first lookup within a
transaction won't return a restart due to lock ordering, but we're
adding fault injection for transaction restarts and shaking out bugs.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-10-18 00:49:47 -04:00
Linus Torvalds
4d939780b7 28 hotfixes. 13 are cc:stable. 23 are MM.
It is the usual shower of unrelated singletons - please see the individual
 changelogs for details.
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYIAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCZxGY5wAKCRDdBJ7gKXxA
 js6RAQC16zQ7WRV091i79cEi1C5648NbZjMCU626hZjuyfbzKgEA2v8PYtjj9w2e
 UGLxMY+PYZki2XNEh75Sikdkiyl9Vgg=
 =xcWT
 -----END PGP SIGNATURE-----

Merge tag 'mm-hotfixes-stable-2024-10-17-16-08' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm

Pull misc fixes from Andrew Morton:
 "28 hotfixes. 13 are cc:stable. 23 are MM.

  It is the usual shower of unrelated singletons - please see the
  individual changelogs for details"

* tag 'mm-hotfixes-stable-2024-10-17-16-08' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (28 commits)
  maple_tree: add regression test for spanning store bug
  maple_tree: correct tree corruption on spanning store
  mm/mglru: only clear kswapd_failures if reclaimable
  mm/swapfile: skip HugeTLB pages for unuse_vma
  selftests: mm: fix the incorrect usage() info of khugepaged
  MAINTAINERS: add Jann as memory mapping/VMA reviewer
  mm: swap: prevent possible data-race in __try_to_reclaim_swap
  mm: khugepaged: fix the incorrect statistics when collapsing large file folios
  MAINTAINERS: kasan, kcov: add bugzilla links
  mm: don't install PMD mappings when THPs are disabled by the hw/process/vma
  mm: huge_memory: add vma_thp_disabled() and thp_disabled_by_hw()
  Docs/damon/maintainer-profile: update deprecated awslabs GitHub URLs
  Docs/damon/maintainer-profile: add missing '_' suffixes for external web links
  maple_tree: check for MA_STATE_BULK on setting wr_rebalance
  mm: khugepaged: fix the arguments order in khugepaged_collapse_file trace point
  mm/damon/tests/sysfs-kunit.h: fix memory leak in damon_sysfs_test_add_targets()
  mm: remove unused stub for can_swapin_thp()
  mailmap: add an entry for Andy Chiu
  MAINTAINERS: add memory mapping/VMA co-maintainers
  fs/proc: fix build with GCC 15 due to -Werror=unterminated-string-initialization
  ...
2024-10-17 16:33:06 -07:00
Mark Brown
4e6e8c2b75 binfmt_elf: Wire up AT_HWCAP3 at AT_HWCAP4
AT_HWCAP3 and AT_HWCAP4 were recently defined for use on PowerPC in commit
3281366a8e ("uapi/auxvec: Define AT_HWCAP3 and AT_HWCAP4 aux vector,
entries"). Since we want to start using AT_HWCAP3 on arm64 add support for
exposing both these new hwcaps via binfmt_elf.

Signed-off-by: Mark Brown <broonie@kernel.org>
Acked-by: Kees Cook <kees@kernel.org>
Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com>
Link: https://lore.kernel.org/r/20241004-arm64-elf-hwcap3-v2-1-799d1daad8b0@kernel.org
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
2024-10-17 18:38:49 +01:00
Naohiro Aota
bf9821ba47 btrfs: zoned: fix zone unusable accounting for freed reserved extent
When btrfs reserves an extent and does not use it (e.g, by an error), it
calls btrfs_free_reserved_extent() to free the reserved extent. In the
process, it calls btrfs_add_free_space() and then it accounts the region
bytes as block_group->zone_unusable.

However, it leaves the space_info->bytes_zone_unusable side not updated. As
a result, ENOSPC can happen while a space_info reservation succeeded. The
reservation is fine because the freed region is not added in
space_info->bytes_zone_unusable, leaving that space as "free". OTOH,
corresponding block group counts it as zone_unusable and its allocation
pointer is not rewound, we cannot allocate an extent from that block group.
That will also negate space_info's async/sync reclaim process, and cause an
ENOSPC error from the extent allocation process.

Fix that by returning the space to space_info->bytes_zone_unusable.
Ideally, since a bio is not submitted for this reserved region, we should
return the space to free space and rewind the allocation pointer. But, it
needs rework on extent allocation handling, so let it work in this way for
now.

Fixes: 169e0da91a ("btrfs: zoned: track unusable bytes for zones")
CC: stable@vger.kernel.org # 5.15+
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-10-17 16:16:46 +02:00