Commit Graph

94546 Commits

Author SHA1 Message Date
Linus Torvalds
c1e822754c bcachefs fixes for 6.12-rc5
Lots of hotfixes:
 - transaction restart injection has been shaking out a few things
 
 - fix a data corruption in the buffered write path on -ENOSPC, found by
   xfstests generic/299
 
 - Some small show_options fixes
 
 - Repair mismatches in inode hash type, seed: different snapshot
   versions of an inode must have the same hash/type seed, used for
   directory entries and xattrs. We were checking the hash seed, but not
   the type, and a user contributed a filesystem where the hash type on
   one inode had somehow been flipped; these fixes allow his filesystem
   to repair.
 
   Additionally, the hash type flip made some directory entries
   invisible, which were then recreated by userspace; so the hash check
   code now checks for duplicate non dangling dirents, and renames one of
   them if necessary.
 
 - Don't use wait_event_interruptible() in recovery: this fixes some
   filesystems failing to mount with -ERESTARTSYS
 
 - Workaround for kvmalloc not supporting > INT_MAX allocations, causing
   an -ENOMEM when allocating the sorted array of journal keys: this
   allows a 75 TB filesystem to mount
 
 - Make sure bch_inode_unpacked.bi_snapshot is set in the old inode
   compat path: this alllows Marcin's filesystem (in use since before
   6.7) to repair and mount.
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEKnAFLkS8Qha+jvQrE6szbY3KbnYFAmcX4vYACgkQE6szbY3K
 bnbywxAArBfIJfshWq5Wk9WztenzUmyUmV2HIgntT/iN4ty4eIpZ26VSvHcGvgkU
 j3wx+OuxMTPBGc3fjUS+gALf/BGcQEgh6oPZCV+6M3kasTzNzG2jYOCkLqKbpcO1
 V5n/Le/SM1X2grkgTm/H+TulGHNgG9gJ2U4kjihroJrTbTesZhzcW/qlz6RWo7U1
 02NvLop4WE9M6WaW9RzsHK2llRUAl2Z3oRMuwNz3IIijCpm98STGD4gyvGoMV2b8
 qNsXjy7b2lkYObKI29yWF0caRzWK1LRz79afRlnNVSJb6DK1QB83ms5Qa8rprCU4
 uOq0wsGWyg6lzwQ19X+2TvUYABopVk2HXLlzTO/lJrWeMTuYJVPZ7KZi3l6ubw5T
 GIsAD5qMdCm8E5nXX8hG//0rOIl6QK288+zMQyRCvAkCL+iN2k0TU8qKAEEC44de
 vj6ZyNqbuLR39LLz9K09ZhzIZGk09ELpxOJ2Wwwj4ZFriwphWDtFgBtBUpNo/KWA
 inBfq2lZJsmNjfns9vCqOmNOStOJxXnyMOR25sTv7wM69QPGkl41dPY3oeuG8lRk
 cU/qJQKlpTKJbFeXiEKWKDnMzWxOnovqLFC0tKu2qAYM6vAz+AtwTXgthVFGh21U
 QoUDbsnQCCixMkS2AksCo7nivLrxmV/EeYm5pgeiU38VdA5ofBM=
 =OpYN
 -----END PGP SIGNATURE-----

Merge tag 'bcachefs-2024-10-22' of https://github.com/koverstreet/bcachefs

Pull bcachefs fixes from Kent Overstreet:
 "Lots of hotfixes:

   - transaction restart injection has been shaking out a few things

   - fix a data corruption in the buffered write path on -ENOSPC, found
     by xfstests generic/299

   - Some small show_options fixes

   - Repair mismatches in inode hash type, seed: different snapshot
     versions of an inode must have the same hash/type seed, used for
     directory entries and xattrs. We were checking the hash seed, but
     not the type, and a user contributed a filesystem where the hash
     type on one inode had somehow been flipped; these fixes allow his
     filesystem to repair.

     Additionally, the hash type flip made some directory entries
     invisible, which were then recreated by userspace; so the hash
     check code now checks for duplicate non dangling dirents, and
     renames one of them if necessary.

   - Don't use wait_event_interruptible() in recovery: this fixes some
     filesystems failing to mount with -ERESTARTSYS

   - Workaround for kvmalloc not supporting > INT_MAX allocations,
     causing an -ENOMEM when allocating the sorted array of journal
     keys: this allows a 75 TB filesystem to mount

   - Make sure bch_inode_unpacked.bi_snapshot is set in the old inode
     compat path: this alllows Marcin's filesystem (in use since before
     6.7) to repair and mount"

* tag 'bcachefs-2024-10-22' of https://github.com/koverstreet/bcachefs: (26 commits)
  bcachefs: Set bch_inode_unpacked.bi_snapshot in old inode path
  bcachefs: Mark more errors as AUTOFIX
  bcachefs: Workaround for kvmalloc() not supporting > INT_MAX allocations
  bcachefs: Don't use wait_event_interruptible() in recovery
  bcachefs: Fix __bch2_fsck_err() warning
  bcachefs: fsck: Improve hash_check_key()
  bcachefs: bch2_hash_set_or_get_in_snapshot()
  bcachefs: Repair mismatches in inode hash seed, type
  bcachefs: Add hash seed, type to inode_to_text()
  bcachefs: INODE_STR_HASH() for bch_inode_unpacked
  bcachefs: Run in-kernel offline fsck without ratelimit errors
  bcachefs: skip mount option handle for empty string.
  bcachefs: fix incorrect show_options results
  bcachefs: Fix data corruption on -ENOSPC in buffered write path
  bcachefs: bch2_folio_reservation_get_partial() is now better behaved
  bcachefs: fix disk reservation accounting in bch2_folio_reservation_get()
  bcachefS: ec: fix data type on stripe deletion
  bcachefs: Don't use commit_do() unnecessarily
  bcachefs: handle restarts in bch2_bucket_io_time_reset()
  bcachefs: fix restart handling in __bch2_resume_logged_op_finsert()
  ...
2024-10-24 12:38:59 -07:00
Dominique Martinet
f009e946c1 Revert "9p: Enable multipage folios"
This reverts commit 1325e4a91a.

using multipage folios apparently break some madvise operations like
MADV_PAGEOUT which do not reliably unload the specified page anymore,

Revert the patch until that is figured out.

Reported-by: Andrii Nakryiko <andrii@kernel.org>
Fixes: 1325e4a91a ("9p: Enable multipage folios")
Signed-off-by: Dominique Martinet <asmadeus@codewreck.org>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2024-10-24 11:24:05 -07:00
David Howells
247d65fb12
afs: Fix missing subdir edit when renamed between parent dirs
When rename moves an AFS subdirectory between parent directories, the
subdir also needs a bit of editing: the ".." entry needs updating to point
to the new parent (though I don't make use of the info) and the DV needs
incrementing by 1 to reflect the change of content.  The server also sends
a callback break notification on the subdirectory if we have one, but we
can take care of recovering the promise next time we access the subdir.

This can be triggered by something like:

    mount -t afs %example.com:xfstest.test20 /xfstest.test/
    mkdir /xfstest.test/{aaa,bbb,aaa/ccc}
    touch /xfstest.test/bbb/ccc/d
    mv /xfstest.test/{aaa/ccc,bbb/ccc}
    touch /xfstest.test/bbb/ccc/e

When the pathwalk for the second touch hits "ccc", kafs spots that the DV
is incorrect and downloads it again (so the fix is not critical).

Fix this, if the rename target is a directory and the old and new
parents are different, by:

 (1) Incrementing the DV number of the target locally.

 (2) Editing the ".." entry in the target to refer to its new parent's
     vnode ID and uniquifier.

Link: https://lore.kernel.org/r/3340431.1729680010@warthog.procyon.org.uk
Fixes: 63a4681ff3 ("afs: Locally edit directory data for mkdir/create/unlink/...")
cc: David Howells <dhowells@redhat.com>
cc: Marc Dionne <marc.dionne@auristor.com>
cc: linux-afs@lists.infradead.org
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-10-24 13:50:27 +02:00
Naohiro Aota
d48e1dea39 btrfs: fix error propagation of split bios
The purpose of btrfs_bbio_propagate_error() shall be propagating an error
of split bio to its original btrfs_bio, and tell the error to the upper
layer. However, it's not working well on some cases.

* Case 1. Immediate (or quick) end_bio with an error

When btrfs sends btrfs_bio to mirrored devices, btrfs calls
btrfs_bio_end_io() when all the mirroring bios are completed. If that
btrfs_bio was split, it is from btrfs_clone_bioset and its end_io function
is btrfs_orig_write_end_io. For this case, btrfs_bbio_propagate_error()
accesses the orig_bbio's bio context to increase the error count.

That works well in most cases. However, if the end_io is called enough
fast, orig_bbio's (remaining part after split) bio context may not be
properly set at that time. Since the bio context is set when the orig_bbio
(the last btrfs_bio) is sent to devices, that might be too late for earlier
split btrfs_bio's completion.  That will result in NULL pointer
dereference.

That bug is easily reproducible by running btrfs/146 on zoned devices [1]
and it shows the following trace.

[1] You need raid-stripe-tree feature as it create "-d raid0 -m raid1" FS.

  BUG: kernel NULL pointer dereference, address: 0000000000000020
  #PF: supervisor read access in kernel mode
  #PF: error_code(0x0000) - not-present page
  PGD 0 P4D 0
  Oops: Oops: 0000 [#1] PREEMPT SMP PTI
  CPU: 1 UID: 0 PID: 13 Comm: kworker/u32:1 Not tainted 6.11.0-rc7-BTRFS-ZNS+ #474
  Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
  Workqueue: writeback wb_workfn (flush-btrfs-5)
  RIP: 0010:btrfs_bio_end_io+0xae/0xc0 [btrfs]
  BTRFS error (device dm-0): bdev /dev/mapper/error-test errs: wr 2, rd 0, flush 0, corrupt 0, gen 0
  RSP: 0018:ffffc9000006f248 EFLAGS: 00010246
  RAX: 0000000000000000 RBX: ffff888005a7f080 RCX: ffffc9000006f1dc
  RDX: 0000000000000000 RSI: 000000000000000a RDI: ffff888005a7f080
  RBP: ffff888011dfc540 R08: 0000000000000000 R09: 0000000000000001
  R10: ffffffff82e508e0 R11: 0000000000000005 R12: ffff88800ddfbe58
  R13: ffff888005a7f080 R14: ffff888005a7f158 R15: ffff888005a7f158
  FS:  0000000000000000(0000) GS:ffff88803ea80000(0000) knlGS:0000000000000000
  CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
  CR2: 0000000000000020 CR3: 0000000002e22006 CR4: 0000000000370ef0
  DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
  DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
  Call Trace:
   <TASK>
   ? __die_body.cold+0x19/0x26
   ? page_fault_oops+0x13e/0x2b0
   ? _printk+0x58/0x73
   ? do_user_addr_fault+0x5f/0x750
   ? exc_page_fault+0x76/0x240
   ? asm_exc_page_fault+0x22/0x30
   ? btrfs_bio_end_io+0xae/0xc0 [btrfs]
   ? btrfs_log_dev_io_error+0x7f/0x90 [btrfs]
   btrfs_orig_write_end_io+0x51/0x90 [btrfs]
   dm_submit_bio+0x5c2/0xa50 [dm_mod]
   ? find_held_lock+0x2b/0x80
   ? blk_try_enter_queue+0x90/0x1e0
   __submit_bio+0xe0/0x130
   ? ktime_get+0x10a/0x160
   ? lockdep_hardirqs_on+0x74/0x100
   submit_bio_noacct_nocheck+0x199/0x410
   btrfs_submit_bio+0x7d/0x150 [btrfs]
   btrfs_submit_chunk+0x1a1/0x6d0 [btrfs]
   ? lockdep_hardirqs_on+0x74/0x100
   ? __folio_start_writeback+0x10/0x2c0
   btrfs_submit_bbio+0x1c/0x40 [btrfs]
   submit_one_bio+0x44/0x60 [btrfs]
   submit_extent_folio+0x13f/0x330 [btrfs]
   ? btrfs_set_range_writeback+0xa3/0xd0 [btrfs]
   extent_writepage_io+0x18b/0x360 [btrfs]
   extent_write_locked_range+0x17c/0x340 [btrfs]
   ? __pfx_end_bbio_data_write+0x10/0x10 [btrfs]
   run_delalloc_cow+0x71/0xd0 [btrfs]
   btrfs_run_delalloc_range+0x176/0x500 [btrfs]
   ? find_lock_delalloc_range+0x119/0x260 [btrfs]
   writepage_delalloc+0x2ab/0x480 [btrfs]
   extent_write_cache_pages+0x236/0x7d0 [btrfs]
   btrfs_writepages+0x72/0x130 [btrfs]
   do_writepages+0xd4/0x240
   ? find_held_lock+0x2b/0x80
   ? wbc_attach_and_unlock_inode+0x12c/0x290
   ? wbc_attach_and_unlock_inode+0x12c/0x290
   __writeback_single_inode+0x5c/0x4c0
   ? do_raw_spin_unlock+0x49/0xb0
   writeback_sb_inodes+0x22c/0x560
   __writeback_inodes_wb+0x4c/0xe0
   wb_writeback+0x1d6/0x3f0
   wb_workfn+0x334/0x520
   process_one_work+0x1ee/0x570
   ? lock_is_held_type+0xc6/0x130
   worker_thread+0x1d1/0x3b0
   ? __pfx_worker_thread+0x10/0x10
   kthread+0xee/0x120
   ? __pfx_kthread+0x10/0x10
   ret_from_fork+0x30/0x50
   ? __pfx_kthread+0x10/0x10
   ret_from_fork_asm+0x1a/0x30
   </TASK>
  Modules linked in: dm_mod btrfs blake2b_generic xor raid6_pq rapl
  CR2: 0000000000000020

* Case 2. Earlier completion of orig_bbio for mirrored btrfs_bios

btrfs_bbio_propagate_error() assumes the end_io function for orig_bbio is
called last among split bios. In that case, btrfs_orig_write_end_io() sets
the bio->bi_status to BLK_STS_IOERR by seeing the bioc->error [2].
Otherwise, the increased orig_bio's bioc->error is not checked by anyone
and return BLK_STS_OK to the upper layer.

[2] Actually, this is not true. Because we only increases orig_bioc->errors
by max_errors, the condition "atomic_read(&bioc->error) > bioc->max_errors"
is still not met if only one split btrfs_bio fails.

* Case 3. Later completion of orig_bbio for un-mirrored btrfs_bios

In contrast to the above case, btrfs_bbio_propagate_error() is not working
well if un-mirrored orig_bbio is completed last. It sets
orig_bbio->bio.bi_status to the btrfs_bio's error. But, that is easily
over-written by orig_bbio's completion status. If the status is BLK_STS_OK,
the upper layer would not know the failure.

* Solution

Considering the above cases, we can only save the error status in the
orig_bbio (remaining part after split) itself as it is always
available. Also, the saved error status should be propagated when all the
split btrfs_bios are finished (i.e, bbio->pending_ios == 0).

This commit introduces "status" to btrfs_bbio and saves the first error of
split bios to original btrfs_bio's "status" variable. When all the split
bios are finished, the saved status is loaded into original btrfs_bio's
status.

With this commit, btrfs/146 on zoned devices does not hit the NULL pointer
dereference anymore.

Fixes: 852eee62d3 ("btrfs: allow btrfs_submit_bio to split bios")
CC: stable@vger.kernel.org # 6.6+
Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-10-23 18:17:43 +02:00
Ye Bin
2ce1007f42 cifs: fix warning when destroy 'cifs_io_request_pool'
There's a issue as follows:
WARNING: CPU: 1 PID: 27826 at mm/slub.c:4698 free_large_kmalloc+0xac/0xe0
RIP: 0010:free_large_kmalloc+0xac/0xe0
Call Trace:
 <TASK>
 ? __warn+0xea/0x330
 mempool_destroy+0x13f/0x1d0
 init_cifs+0xa50/0xff0 [cifs]
 do_one_initcall+0xdc/0x550
 do_init_module+0x22d/0x6b0
 load_module+0x4e96/0x5ff0
 init_module_from_file+0xcd/0x130
 idempotent_init_module+0x330/0x620
 __x64_sys_finit_module+0xb3/0x110
 do_syscall_64+0xc1/0x1d0
 entry_SYSCALL_64_after_hwframe+0x77/0x7f

Obviously, 'cifs_io_request_pool' is not created by mempool_create().
So just use mempool_exit() to revert 'cifs_io_request_pool'.

Fixes: edea94a697 ("cifs: Add mempools for cifs_io_request and cifs_io_subrequest structs")
Signed-off-by: Ye Bin <yebin10@huawei.com>
Acked-by: David Howells <dhowells@redhat.com
Signed-off-by: Steve French <stfrench@microsoft.com>
2024-10-23 07:42:44 -05:00
Henrique Carvalho
9a5dd61151 smb: client: Handle kstrdup failures for passwords
In smb3_reconfigure(), after duplicating ctx->password and
ctx->password2 with kstrdup(), we need to check for allocation
failures.

If ses->password allocation fails, return -ENOMEM.
If ses->password2 allocation fails, free ses->password, set it
to NULL, and return -ENOMEM.

Fixes: c1eb537bf4 ("cifs: allow changing password during remount")
Reviewed-by: David Howells <dhowells@redhat.com
Signed-off-by: Haoxiang Li <make24@iscas.ac.cn>
Signed-off-by: Henrique Carvalho <henrique.carvalho@suse.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
2024-10-23 07:42:22 -05:00
Dave Kleikamp
67373ca840 jfs: Fix sanity check in dbMount
MAXAG is a legitimate value for bmp->db_numag

Fixes: e63866a475 ("jfs: fix out-of-bounds in dbNextAG() and diAlloc()")

Signed-off-by: Dave Kleikamp <dave.kleikamp@oracle.com>
2024-10-22 09:40:37 -05:00
Yue Haibing
75f49c3dc7 btrfs: fix passing 0 to ERR_PTR in btrfs_search_dir_index_item()
The ret may be zero in btrfs_search_dir_index_item() and should not
passed to ERR_PTR(). Now btrfs_unlink_subvol() is the only caller to
this, reconstructed it to check ERR_PTR(-ENOENT) while ret >= 0.

This fixes smatch warnings:

fs/btrfs/dir-item.c:353
  btrfs_search_dir_index_item() warn: passing zero to 'ERR_PTR'

Fixes: 9dcbe16fcc ("btrfs: use btrfs_for_each_slot in btrfs_search_dir_index_item")
CC: stable@vger.kernel.org # 6.1+
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Yue Haibing <yuehaibing@huawei.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-10-22 16:10:55 +02:00
Qu Wenruo
3c36a72c1d btrfs: reject ro->rw reconfiguration if there are hard ro requirements
[BUG]
Syzbot reports the following crash:

  BTRFS info (device loop0 state MCS): disabling free space tree
  BTRFS info (device loop0 state MCS): clearing compat-ro feature flag for FREE_SPACE_TREE (0x1)
  BTRFS info (device loop0 state MCS): clearing compat-ro feature flag for FREE_SPACE_TREE_VALID (0x2)
  Oops: general protection fault, probably for non-canonical address 0xdffffc0000000003: 0000 [#1] PREEMPT SMP KASAN NOPTI
  KASAN: null-ptr-deref in range [0x0000000000000018-0x000000000000001f]
  Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.3-debian-1.16.3-2~bpo12+1 04/01/2014
  RIP: 0010:backup_super_roots fs/btrfs/disk-io.c:1691 [inline]
  RIP: 0010:write_all_supers+0x97a/0x40f0 fs/btrfs/disk-io.c:4041
  Call Trace:
   <TASK>
   btrfs_commit_transaction+0x1eae/0x3740 fs/btrfs/transaction.c:2530
   btrfs_delete_free_space_tree+0x383/0x730 fs/btrfs/free-space-tree.c:1312
   btrfs_start_pre_rw_mount+0xf28/0x1300 fs/btrfs/disk-io.c:3012
   btrfs_remount_rw fs/btrfs/super.c:1309 [inline]
   btrfs_reconfigure+0xae6/0x2d40 fs/btrfs/super.c:1534
   btrfs_reconfigure_for_mount fs/btrfs/super.c:2020 [inline]
   btrfs_get_tree_subvol fs/btrfs/super.c:2079 [inline]
   btrfs_get_tree+0x918/0x1920 fs/btrfs/super.c:2115
   vfs_get_tree+0x90/0x2b0 fs/super.c:1800
   do_new_mount+0x2be/0xb40 fs/namespace.c:3472
   do_mount fs/namespace.c:3812 [inline]
   __do_sys_mount fs/namespace.c:4020 [inline]
   __se_sys_mount+0x2d6/0x3c0 fs/namespace.c:3997
   do_syscall_x64 arch/x86/entry/common.c:52 [inline]
   do_syscall_64+0xf3/0x230 arch/x86/entry/common.c:83
   entry_SYSCALL_64_after_hwframe+0x77/0x7f

[CAUSE]
To support mounting different subvolume with different RO/RW flags for
the new mount APIs, btrfs introduced two workaround to support this feature:

- Skip mount option/feature checks if we are mounting a different
  subvolume

- Reconfigure the fs to RW if the initial mount is RO

Combining these two, we can have the following sequence:

- Mount the fs ro,rescue=all,clear_cache,space_cache=v1
  rescue=all will mark the fs as hard read-only, so no v2 cache clearing
  will happen.

- Mount a subvolume rw of the same fs.
  We go into btrfs_get_tree_subvol(), but fc_mount() returns EBUSY
  because our new fc is RW, different from the original fs.

  Now we enter btrfs_reconfigure_for_mount(), which switches the RO flag
  first so that we can grab the existing fs_info.
  Then we reconfigure the fs to RW.

- During reconfiguration, option/features check is skipped
  This means we will restart the v2 cache clearing, and convert back to
  v1 cache.
  This will trigger fs writes, and since the original fs has "rescue=all"
  option, it skips the csum tree read.

  And eventually causing NULL pointer dereference in super block
  writeback.

[FIX]
For reconfiguration caused by different subvolume RO/RW flags, ensure we
always run btrfs_check_options() to ensure we have proper hard RO
requirements met.

In fact the function btrfs_check_options() doesn't really do many
complex checks, but hard RO requirement and some feature dependency
checks, thus there is no special reason not to do the check for mount
reconfiguration.

Reported-by: syzbot+56360f93efa90ff15870@syzkaller.appspotmail.com
Link: https://lore.kernel.org/linux-btrfs/0000000000008c5d090621cb2770@google.com/
Fixes: f044b31867 ("btrfs: handle the ro->rw transition for mounting different subvolumes")
CC: stable@vger.kernel.org # 6.8+
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-10-22 16:10:51 +02:00
Boris Burkov
7a2339058e btrfs: fix read corruption due to race with extent map merging
In debugging some corrupt squashfs files, we observed symptoms of
corrupt page cache pages but correct on-disk contents. Further
investigation revealed that the exact symptom was a correct page
followed by an incorrect, duplicate, page. This got us thinking about
extent maps.

commit ac05ca913e ("Btrfs: fix race between using extent maps and merging them")
enforces a reference count on the primary `em` extent_map being merged,
as that one gets modified.

However, since,
commit 3d2ac99224 ("btrfs: introduce new members for extent_map")
both 'em' and 'merge' get modified, which started modifying 'merge'
and thus introduced the same race.

We were able to reproduce this by looping the affected squashfs workload
in parallel on a bunch of separate btrfs-es while also dropping caches.
We are still working on a simple enough reproducer to make into an fstest.

The simplest fix is to stop modifying 'merge', which is not essential,
as it is dropped immediately after the merge. This behavior is simply
a consequence of the order of the two extent maps being important in
computing the new values. Modify merge_ondisk_extents to take prev and
next by const* and also take a third merged parameter that it puts the
results in. Note that this introduces the rather odd behavior of passing
'em' to merge_ondisk_extents as a const * and as a regular ptr.

Fixes: 3d2ac99224 ("btrfs: introduce new members for extent_map")
CC: stable@vger.kernel.org # 6.11+
Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Boris Burkov <boris@bur.io>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-10-22 16:10:13 +02:00
Qu Wenruo
f10f59f91a btrfs: fix the delalloc range locking if sector size < page size
Inside lock_delalloc_folios(), there are several problems related to
sector size < page size handling:

- Set the writer locks without checking if the folio is still valid
  We call btrfs_folio_start_writer_lock() just like it's folio_lock().
  But since the folio may not even be the folio of the current mapping,
  we can easily screw up the folio->private.

- The range is not clamped inside the page
  This means we can over write other bitmaps if the start/len is not
  properly handled, and trigger the btrfs_subpage_assert().

- @processed_end is always rounded up to page end
  If the delalloc range is not page aligned, and we need to retry
  (returning -EAGAIN), then we will unlock to the page end.

  Thankfully this is not a huge problem, as now
  btrfs_folio_end_writer_lock() can handle range larger than the locked
  range, and only unlock what is already locked.

Fix all these problems by:

- Lock and check the folio first, then call
  btrfs_folio_set_writer_lock()
  So that if we got a folio not belonging to the inode, we won't
  touch folio->private.

- Properly truncate the range inside the page

- Update @processed_end to the locked range end

Fixes: 1e1de38792 ("btrfs: make process_one_page() to handle subpage locking")
CC: stable@vger.kernel.org # 6.1+
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-10-22 16:09:44 +02:00
Qu Wenruo
5f9062a48d btrfs: qgroup: set a more sane default value for subtree drop threshold
Since commit 011b46c304 ("btrfs: skip subtree scan if it's too high to
avoid low stall in btrfs_commit_transaction()"), btrfs qgroup can
automatically skip large subtree scan at the cost of marking qgroup
inconsistent.

It's designed to address the final performance problem of snapshot drop
with qgroup enabled, but to be safe the default value is
BTRFS_MAX_LEVEL, requiring a user space daemon to set a different value
to make it work.

I'd say it's not a good idea to rely on user space tool to set this
default value, especially when some operations (snapshot dropping) can
be triggered immediately after mount, leaving a very small window to
that that sysfs interface.

So instead of disabling this new feature by default, enable it with a
low threshold (3), so that large subvolume tree drop at mount time won't
cause huge qgroup workload.

CC: stable@vger.kernel.org # 6.1
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-10-22 16:09:11 +02:00
Filipe Manana
3510e684b8 btrfs: clear force-compress on remount when compress mount option is given
After the migration to use fs context for processing mount options we had
a slight change in the semantics for remounting a filesystem that was
mounted with compress-force. Before we could clear compress-force by
passing only "-o compress[=algo]" during a remount, but after that change
that does not work anymore, force-compress is still present and one needs
to pass "-o compress-force=no,compress[=algo]" to the mount command.

Example, when running on a kernel 6.8+:

  $ mount -o compress-force=zlib:9 /dev/sdi /mnt/sdi
  $ mount | grep sdi
  /dev/sdi on /mnt/sdi type btrfs (rw,relatime,compress-force=zlib:9,discard=async,space_cache=v2,subvolid=5,subvol=/)

  $ mount -o remount,compress=zlib:5 /mnt/sdi
  $ mount | grep sdi
  /dev/sdi on /mnt/sdi type btrfs (rw,relatime,compress-force=zlib:5,discard=async,space_cache=v2,subvolid=5,subvol=/)

On a 6.7 kernel (or older):

  $ mount -o compress-force=zlib:9 /dev/sdi /mnt/sdi
  $ mount | grep sdi
  /dev/sdi on /mnt/sdi type btrfs (rw,relatime,compress-force=zlib:9,discard=async,space_cache=v2,subvolid=5,subvol=/)

  $ mount -o remount,compress=zlib:5 /mnt/sdi
  $ mount | grep sdi
  /dev/sdi on /mnt/sdi type btrfs (rw,relatime,compress=zlib:5,discard=async,space_cache=v2,subvolid=5,subvol=/)

So update btrfs_parse_param() to clear "compress-force" when "compress" is
given, providing the same semantics as kernel 6.7 and older.

Reported-by: Roman Mamedov <rm@romanrm.net>
Link: https://lore.kernel.org/linux-btrfs/20241014182416.13d0f8b0@nvm/
CC: stable@vger.kernel.org # 6.8+
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-10-22 16:07:53 +02:00
Christoph Hellwig
4a201dcfa1 xfs: update the pag for the last AG at recovery time
Currently log recovery never updates the in-core perag values for the
last allocation group when they were grown by growfs.  This leads to
btree record validation failures for the alloc, ialloc or finotbt
trees if a transaction references this new space.

Found by Brian's new growfs recovery stress test.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Carlos Maiolino <cem@kernel.org>
2024-10-22 13:37:19 +02:00
Christoph Hellwig
069cf5e32b xfs: don't use __GFP_RETRY_MAYFAIL in xfs_initialize_perag
__GFP_RETRY_MAYFAIL increases the likelyhood of allocations to fail,
which isn't really helpful during log recovery.  Remove the flag and
stick to the default GFP_KERNEL policies.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Carlos Maiolino <cem@kernel.org>
2024-10-22 13:37:18 +02:00
Christoph Hellwig
b882b0f813 xfs: error out when a superblock buffer update reduces the agcount
XFS currently does not support reducing the agcount, so error out if
a logged sb buffer tries to shrink the agcount.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Carlos Maiolino <cem@kernel.org>
2024-10-22 13:37:18 +02:00
Christoph Hellwig
6a18765b54 xfs: update the file system geometry after recoverying superblock buffers
Primary superblock buffers that change the file system geometry after a
growfs operation can affect the operation of later CIL checkpoints that
make use of the newly added space and allocation groups.

Apply the changes to the in-memory structures as part of recovery pass 2,
to ensure recovery works fine for such cases.

In the future we should apply the logic to other updates such as features
bits as well.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Carlos Maiolino <cem@kernel.org>
2024-10-22 13:37:18 +02:00
Christoph Hellwig
aa67ec6a25 xfs: merge the perag freeing helpers
There is no good reason to have two different routines for freeing perag
structures for the unmount and error cases.  Add two arguments to specify
the range of AGs to free to xfs_free_perag, and use that to replace
xfs_free_unused_perag_range.

The addition RCU grace period for the error case is harmless, and the
extra check for the AG to actually exist is not required now that the
callers pass the exact known allocated range.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Carlos Maiolino <cem@kernel.org>
2024-10-22 13:37:18 +02:00
Christoph Hellwig
82742f8c3f xfs: pass the exact range to initialize to xfs_initialize_perag
Currently only the new agcount is passed to xfs_initialize_perag, which
requires lookups of existing AGs to skip them and complicates error
handling.  Also pass the previous agcount so that the range that
xfs_initialize_perag operates on is exactly defined.  That way the
extra lookups can be avoided, and error handling can clean up the
exact range from the old count to the last added perag structure.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Carlos Maiolino <cem@kernel.org>
2024-10-22 13:37:18 +02:00
Darrick J. Wong
af8512c527 xfs: don't fail repairs on metadata files with no attr fork
Fix a minor bug where we fail repairs on metadata files that do not have
attr forks because xrep_metadata_inode_subtype doesn't filter ENOENT.

Cc: stable@vger.kernel.org # v6.8
Fixes: 5a8e07e799 ("xfs: repair the inode core and forks of a metadata inode")
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Carlos Maiolino <cem@kernel.org>
2024-10-22 13:37:18 +02:00
Linus Torvalds
7166c32651 vfs-6.12-rc5.fixes
-----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCZxY6XAAKCRCRxhvAZXjc
 opmUAQCu4KhzBBdZmFw3AfZFNJvYb1onT4FiU0pnyGgfvzEdEwD6AlnlgQ7DL3ZN
 WBqBzUl+DpGYJfzhkqoEGH89Fagx7QM=
 =mm68
 -----END PGP SIGNATURE-----

Merge tag 'vfs-6.12-rc5.fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs

Pull vfs fixes from Christian Brauner:
 "afs:
   - Fix a lock recursion in afs_wake_up_async_call() on ->notify_lock

 netfs:
   - Drop the references to a folio immediately after the folio has been
     extracted to prevent races with future I/O collection

   - Fix a documenation build error

   - Downgrade the i_rwsem for buffered writes to fix a cifs reported
     performance regression when switching to netfslib

  vfs:
   - Explicitly return -E2BIG from openat2() if the specified size is
     unexpectedly large. This aligns openat2() with other extensible
     struct based system calls

   - When copying a mount namespace ensure that we only try to remove
     the new copy from the mount namespace rbtree if it has already been
     added to it

  nilfs:
   - Clear the buffer delay flag when clearing the buffer state clags
     when a buffer head is discarded to prevent a kernel OOPs

  ocfs2:
   - Fix an unitialized value warning in ocfs2_setattr()

  proc:
   - Fix a kernel doc warning"

* tag 'vfs-6.12-rc5.fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
  proc: Fix W=1 build kernel-doc warning
  afs: Fix lock recursion
  fs: Fix uninitialized value issue in from_kuid and from_kgid
  fs: don't try and remove empty rbtree node
  netfs: Downgrade i_rwsem for a buffered write
  nilfs2: fix kernel bug due to missing clearing of buffer delay flag
  openat2: explicitly return -E2BIG for (usize > PAGE_SIZE)
  netfs: fix documentation build error
  netfs: In readahead, put the folio refs as soon extracted
2024-10-21 10:48:24 -07:00
Christoph Hellwig
6db388585e
iomap: turn iomap_want_unshare_iter into an inline function
iomap_want_unshare_iter currently sits in fs/iomap/buffered-io.c, which
depends on CONFIG_BLOCK.  It is also in used in fs/dax.c whіch has no
such dependency.  Given that it is a trivial check turn it into an inline
in include/linux/iomap.h to fix the DAX && !BLOCK build.

Fixes: 6ef6a0e821 ("iomap: share iomap_unshare_iter predicate code with fsdax")
Reported-by: kernel test robot <lkp@intel.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20241015041350.118403-1-hch@lst.de
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-10-21 17:01:01 +02:00
Yang Erkun
d5ff2fb2e7 nfsd: cancel nfsd_shrinker_work using sync mode in nfs4_state_shutdown_net
In the normal case, when we excute `echo 0 > /proc/fs/nfsd/threads`, the
function `nfs4_state_destroy_net` in `nfs4_state_shutdown_net` will
release all resources related to the hashed `nfs4_client`. If the
`nfsd_client_shrinker` is running concurrently, the `expire_client`
function will first unhash this client and then destroy it. This can
lead to the following warning. Additionally, numerous use-after-free
errors may occur as well.

nfsd_client_shrinker         echo 0 > /proc/fs/nfsd/threads

expire_client                nfsd_shutdown_net
  unhash_client                ...
                               nfs4_state_shutdown_net
                                 /* won't wait shrinker exit */
  /*                             cancel_work(&nn->nfsd_shrinker_work)
   * nfsd_file for this          /* won't destroy unhashed client1 */
   * client1 still alive         nfs4_state_destroy_net
   */

                               nfsd_file_cache_shutdown
                                 /* trigger warning */
                                 kmem_cache_destroy(nfsd_file_slab)
                                 kmem_cache_destroy(nfsd_file_mark_slab)
  /* release nfsd_file and mark */
  __destroy_client

====================================================================
BUG nfsd_file (Not tainted): Objects remaining in nfsd_file on
__kmem_cache_shutdown()
--------------------------------------------------------------------
CPU: 4 UID: 0 PID: 764 Comm: sh Not tainted 6.12.0-rc3+ #1

 dump_stack_lvl+0x53/0x70
 slab_err+0xb0/0xf0
 __kmem_cache_shutdown+0x15c/0x310
 kmem_cache_destroy+0x66/0x160
 nfsd_file_cache_shutdown+0xac/0x210 [nfsd]
 nfsd_destroy_serv+0x251/0x2a0 [nfsd]
 nfsd_svc+0x125/0x1e0 [nfsd]
 write_threads+0x16a/0x2a0 [nfsd]
 nfsctl_transaction_write+0x74/0xa0 [nfsd]
 vfs_write+0x1a5/0x6d0
 ksys_write+0xc1/0x160
 do_syscall_64+0x5f/0x170
 entry_SYSCALL_64_after_hwframe+0x76/0x7e

====================================================================
BUG nfsd_file_mark (Tainted: G    B   W         ): Objects remaining
nfsd_file_mark on __kmem_cache_shutdown()
--------------------------------------------------------------------

 dump_stack_lvl+0x53/0x70
 slab_err+0xb0/0xf0
 __kmem_cache_shutdown+0x15c/0x310
 kmem_cache_destroy+0x66/0x160
 nfsd_file_cache_shutdown+0xc8/0x210 [nfsd]
 nfsd_destroy_serv+0x251/0x2a0 [nfsd]
 nfsd_svc+0x125/0x1e0 [nfsd]
 write_threads+0x16a/0x2a0 [nfsd]
 nfsctl_transaction_write+0x74/0xa0 [nfsd]
 vfs_write+0x1a5/0x6d0
 ksys_write+0xc1/0x160
 do_syscall_64+0x5f/0x170
 entry_SYSCALL_64_after_hwframe+0x76/0x7e

To resolve this issue, cancel `nfsd_shrinker_work` using synchronous
mode in nfs4_state_shutdown_net.

Fixes: 7c24fa2250 ("NFSD: replace delayed_work with work_struct for nfsd_client_shrinker")
Signed-off-by: Yang Erkun <yangerkun@huaweicloud.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2024-10-21 10:27:36 -04:00
Gao Xiang
14c2d97265
erofs: use get_tree_bdev_flags() to avoid misleading messages
Users can pass in an arbitrary source path for the proper type of
a mount then without "Can't lookup blockdev" error message.

Reported-by: Allison Karlitskaya <allison.karlitskaya@redhat.com>
Closes: https://lore.kernel.org/r/CAOYeF9VQ8jKVmpy5Zy9DNhO6xmWSKMB-DO8yvBB0XvBE7=3Ugg@mail.gmail.com
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Link: https://lore.kernel.org/r/20241009033151.2334888-2-hsiangkao@linux.alibaba.com
Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-10-21 14:30:27 +02:00
Gao Xiang
4021e68513
fs/super.c: introduce get_tree_bdev_flags()
As Allison reported [1], currently get_tree_bdev() will store
"Can't lookup blockdev" error message.  Although it makes sense for
pure bdev-based fses, this message may mislead users who try to use
EROFS file-backed mounts since get_tree_nodev() is used as a fallback
then.

Add get_tree_bdev_flags() to specify extensible flags [2] and
GET_TREE_BDEV_QUIET_LOOKUP to silence "Can't lookup blockdev" message
since it's misleading to EROFS file-backed mounts now.

[1] https://lore.kernel.org/r/CAOYeF9VQ8jKVmpy5Zy9DNhO6xmWSKMB-DO8yvBB0XvBE7=3Ugg@mail.gmail.com
[2] https://lore.kernel.org/r/ZwUkJEtwIpUA4qMz@infradead.org

Suggested-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Link: https://lore.kernel.org/r/20241009033151.2334888-1-hsiangkao@linux.alibaba.com
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-10-21 14:30:26 +02:00
Miklos Szeredi
184429a17f Revert "fuse: move initialization of fuse_file to fuse_writepages() instead of in callback"
This reverts commit 672c3b7457.

fuse_writepages() might be called with no dirty pages after all writable
opens were closed.  In this case __fuse_write_file_get() will return NULL
which will trigger the WARNING.

The exact conditions under which this is triggered is unclear and syzbot
didn't find a reproducer yet.

Reported-by: syzbot+217a976dc26ef2fa8711@syzkaller.appspotmail.com
Link: https://lore.kernel.org/all/CAJnrk1aQwfvb51wQ5rUSf9N8j1hArTFeSkHqC_3T-mU6_BCD=A@mail.gmail.com/
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2024-10-21 10:02:51 +02:00
Kent Overstreet
a069f01479 bcachefs: Set bch_inode_unpacked.bi_snapshot in old inode path
This fixes a fsck bug on a very old filesystem (pre mainline merge).

Fixes: 72350ee0ea ("bcachefs: Kill snapshot arg to fsck_write_inode()")
Reported-by: Marcin Mirosław <marcin@mejor.pl>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-10-20 18:09:09 -04:00
Kent Overstreet
e04ee86089 bcachefs: Mark more errors as AUTOFIX
Reported-by: Marcin Mirosław <marcin@mejor.pl>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-10-20 18:08:53 -04:00
Kent Overstreet
f0d3302073 bcachefs: Workaround for kvmalloc() not supporting > INT_MAX allocations
kvmalloc() doesn't support allocations > INT_MAX, but vmalloc() does -
the limit should be lifted, but we can work around this for now.

A user with a 75 TB filesystem reported the following journal replay
error:
https://github.com/koverstreet/bcachefs/issues/769

In journal replay we have to sort and dedup all the keys from the
journal, which means we need a large contiguous allocation. Given that
the user has 128GB of ram, the 2GB limit on allocation size has become
far too small.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-10-20 16:50:14 -04:00
Kent Overstreet
3956ff8bc2 bcachefs: Don't use wait_event_interruptible() in recovery
Fix a bug where mount was failing with -ERESTARTSYS:
https://github.com/koverstreet/bcachefs/issues/741

We only want the interruptible wait when called from fsync.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-10-20 16:50:14 -04:00
Kent Overstreet
eb5db64c45 bcachefs: Fix __bch2_fsck_err() warning
We only warn about having a btree_trans that wasn't passed in if we'll
be prompting.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-10-20 16:50:14 -04:00
Linus Torvalds
9197b73fd7 Mashed-up update that I sat on too long:
- fix for multiple slabs created with the same name
 - enable multipage folios
 - theorical fix to also look for opened fids by inode if none
 was found by dentry
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEE/IPbcYBuWt0zoYhOq06b7GqY5nAFAmcS81AACgkQq06b7GqY
 5nACpBAAtXOGRjg+dushCwUVKBlnI3oTwE2G+ywnphNZg2A0emlMOxos7x1OTiM3
 Fu0b10MCUWHIXo4jD6ALVPWITJTfjiXR8s90Q/ozypcIXXhkDDShhV31b2h6Iplr
 YyKyjEehDFRiS7rqWC2a9mce99sOpwdQRmnssnWbjYvpJ4imFbl+50Z1I5Nc/Omu
 j2y02eMuikiWF/shKj0Dx1mmpZ4InSv3kvlM+V2D2YdWKNonGZe/xFZhid95LXmr
 Upt55R8k9qR2pn4VU22eKP6c34DIZGDlrcQdPUCNP5QuaAdGZov3TjNQdjE1bJmF
 E2QdxvUNfvvHqlvaRrlWa27uMgXMcy7QV3LEKwmo3tmaYVw2PDMRbFXc9zQdxy91
 zqXjjGasnwzE8ca36y79vZjFTHAyY5VK/3cHCL3ai+ysu4UL3k2QgmVegREG/xKk
 G8Nz4UO/R6s8Wc2VqxKJdZS5NMLlADS+Aes0PG+9AxQz7iR9Ktgwrw39KDxMi+Lm
 PeH3Gz2rP9+EPoa3usoBQtvvvmJKM/Wb9qdPW9vTtRbRJ7bVclJoizFoLMA/TiW1
 Jru+HYGBO75s8RynwEDLMiJhkjZWHfVgDjPsY6YsGVH8W2gOcJ7egQ2J2EsuurN3
 tzKz4uQilV+VeDuWs8pWKrX/c3Y3KpSYV+oayg7Je7LoTlQBmU8=
 =VG4t
 -----END PGP SIGNATURE-----

Merge tag '9p-for-6.12-rc4' of https://github.com/martinetd/linux

Pull 9p fixes from Dominique Martinet:
 "Mashed-up update that I sat on too long:

   - fix for multiple slabs created with the same name

   - enable multipage folios

   - theorical fix to also look for opened fids by inode if none was
     found by dentry"

[ Enabling multi-page folios should have been done during the merge
  window, but it's a one-liner, and the actual meat of the enablement
  is in netfs and already in use for other filesystems...  - Linus ]

* tag '9p-for-6.12-rc4' of https://github.com/martinetd/linux:
  9p: Avoid creating multiple slab caches with the same name
  9p: Enable multipage folios
  9p: v9fs_fid_find: also lookup by inode if not found dentry
2024-10-19 08:44:10 -07:00
Olga Kornievskaia
8dd91e8d31 nfsd: fix race between laundromat and free_stateid
There is a race between laundromat handling of revoked delegations
and a client sending free_stateid operation. Laundromat thread
finds that delegation has expired and needs to be revoked so it
marks the delegation stid revoked and it puts it on a reaper list
but then it unlock the state lock and the actual delegation revocation
happens without the lock. Once the stid is marked revoked a racing
free_stateid processing thread does the following (1) it calls
list_del_init() which removes it from the reaper list and (2) frees
the delegation stid structure. The laundromat thread ends up not
calling the revoke_delegation() function for this particular delegation
but that means it will no release the lock lease that exists on
the file.

Now, a new open for this file comes in and ends up finding that
lease list isn't empty and calls nfsd_breaker_owns_lease() which ends
up trying to derefence a freed delegation stateid. Leading to the
followint use-after-free KASAN warning:

kernel: ==================================================================
kernel: BUG: KASAN: slab-use-after-free in nfsd_breaker_owns_lease+0x140/0x160 [nfsd]
kernel: Read of size 8 at addr ffff0000e73cd0c8 by task nfsd/6205
kernel:
kernel: CPU: 2 UID: 0 PID: 6205 Comm: nfsd Kdump: loaded Not tainted 6.11.0-rc7+ #9
kernel: Hardware name: Apple Inc. Apple Virtualization Generic Platform, BIOS 2069.0.0.0.0 08/03/2024
kernel: Call trace:
kernel: dump_backtrace+0x98/0x120
kernel: show_stack+0x1c/0x30
kernel: dump_stack_lvl+0x80/0xe8
kernel: print_address_description.constprop.0+0x84/0x390
kernel: print_report+0xa4/0x268
kernel: kasan_report+0xb4/0xf8
kernel: __asan_report_load8_noabort+0x1c/0x28
kernel: nfsd_breaker_owns_lease+0x140/0x160 [nfsd]
kernel: nfsd_file_do_acquire+0xb3c/0x11d0 [nfsd]
kernel: nfsd_file_acquire_opened+0x84/0x110 [nfsd]
kernel: nfs4_get_vfs_file+0x634/0x958 [nfsd]
kernel: nfsd4_process_open2+0xa40/0x1a40 [nfsd]
kernel: nfsd4_open+0xa08/0xe80 [nfsd]
kernel: nfsd4_proc_compound+0xb8c/0x2130 [nfsd]
kernel: nfsd_dispatch+0x22c/0x718 [nfsd]
kernel: svc_process_common+0x8e8/0x1960 [sunrpc]
kernel: svc_process+0x3d4/0x7e0 [sunrpc]
kernel: svc_handle_xprt+0x828/0xe10 [sunrpc]
kernel: svc_recv+0x2cc/0x6a8 [sunrpc]
kernel: nfsd+0x270/0x400 [nfsd]
kernel: kthread+0x288/0x310
kernel: ret_from_fork+0x10/0x20

This patch proposes a fixed that's based on adding 2 new additional
stid's sc_status values that help coordinate between the laundromat
and other operations (nfsd4_free_stateid() and nfsd4_delegreturn()).

First to make sure, that once the stid is marked revoked, it is not
removed by the nfsd4_free_stateid(), the laundromat take a reference
on the stateid. Then, coordinating whether the stid has been put
on the cl_revoked list or we are processing FREE_STATEID and need to
make sure to remove it from the list, each check that state and act
accordingly. If laundromat has added to the cl_revoke list before
the arrival of FREE_STATEID, then nfsd4_free_stateid() knows to remove
it from the list. If nfsd4_free_stateid() finds that operations arrived
before laundromat has placed it on cl_revoke list, it marks the state
freed and then laundromat will no longer add it to the list.

Also, for nfsd4_delegreturn() when looking for the specified stid,
we need to access stid that are marked removed or freeable, it means
the laundromat has started processing it but hasn't finished and this
delegreturn needs to return nfserr_deleg_revoked and not
nfserr_bad_stateid. The latter will not trigger a FREE_STATEID and the
lack of it will leave this stid on the cl_revoked list indefinitely.

Fixes: 2d4a532d38 ("nfsd: ensure that clp->cl_revoked list is protected by clp->cl_lock")
CC: stable@vger.kernel.org
Signed-off-by: Olga Kornievskaia <okorniev@redhat.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2024-10-18 16:40:37 -04:00
Linus Torvalds
b04ae0f451 two fixes for stable, and two small cleanup fixes
-----BEGIN PGP SIGNATURE-----
 
 iQGzBAABCgAdFiEE6fsu8pdIjtWE/DpLiiy9cAdyT1EFAmcSdmYACgkQiiy9cAdy
 T1EnnAwAoNbY+odLB9atHIuaBftpyINrhzRrzpwTfYNtPKUPGxxGk2fiP29YqMLb
 OF4jnC87E3P/xhydoZHXXe3kKBQFVMAkJZKHiZBvJd+brk/EadfQnNmIio1pwOGh
 zFNxSujFtsM/1HU/ZoI2kaHzrqj5KxWKWFytZ6umd8C3NyKK9Lo/lcqUBKv8MpJy
 XXkMBh+7HGKRfDQlU+n6NQ5+dqFL5xDjTXlm9dM8LXuInKy5oKTGnRhLA7OA8lt7
 EenFo8joy0IpXUByHt+ksQ8P88NCnU2h9kGp1UrGrBPh90+MokRr9GAcH8twK8jt
 /bpL4yzAwuk1TAg+L9mSLT2OtWYsDpsQZmsBMbxBZGr2qmtjwgbxSgjf6DNiJZgn
 jz15nFsuEsU5AbX4EAE67fwRWAo9AmQFyOOcYgkiIWOFHaRU6D/2NzCxCDZ+mfpy
 Z5f7dF/sA158iY4wmB5BrQpFamxzpLADz6Qy4NA9hXjEKsbyFAuf22EjE64ruxZ4
 8nMB3buh
 =peum
 -----END PGP SIGNATURE-----

Merge tag 'v6.12-rc3-smb3-client-fixes' of git://git.samba.org/sfrench/cifs-2.6

Pull smb client fixes from Steve French:

 - Fix possible double free setting xattrs

 - Fix slab out of bounds with large ioctl payload

 - Remove three unused functions, and an unused variable that could be
   confusing

* tag 'v6.12-rc3-smb3-client-fixes' of git://git.samba.org/sfrench/cifs-2.6:
  cifs: Remove unused functions
  smb/client: Fix logically dead code
  smb: client: fix OOBs when building SMB2_IOCTL request
  smb: client: fix possible double free in smb2_set_ea()
2024-10-18 11:37:12 -07:00
Linus Torvalds
568570fdf2 XFS Bug fixes for 6.12-rc4
* Fix integer overflow in xrep_bmap
 * Fix stale dealloc punching for COW IO
 
 Signed-off-by: Carlos Maiolino <cem@kernel.org>
 -----BEGIN PGP SIGNATURE-----
 
 iJUEABMJAB0WIQQMHYkcUKcy4GgPe2RGdaER5QtfpgUCZw5LIwAKCRBGdaER5Qtf
 puRlAYDezbvs1dDSkKIGOt3inGdLptNAu4qniXBUkbYI9BzmtIVDueWP4Wo0dV3d
 gu3xrWQBfjFXdmEuBlwLuAFrp07AN18BVMj+DWCiEShsPHSoSPcF/IrDiz4BHvGv
 MKYq9CywFw==
 =Gj9b
 -----END PGP SIGNATURE-----

Merge tag 'xfs-6.12-fixes-4' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux

Pull xfs fixes from Carlos Maiolino:

 - Fix integer overflow in xrep_bmap

 - Fix stale dealloc punching for COW IO

* tag 'xfs-6.12-fixes-4' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux:
  xfs: punch delalloc extents from the COW fork for COW writes
  xfs: set IOMAP_F_SHARED for all COW fork allocations
  xfs: share more code in xfs_buffered_write_iomap_begin
  xfs: support the COW fork in xfs_bmap_punch_delalloc_range
  xfs: IOMAP_ZERO and IOMAP_UNSHARE already hold invalidate_lock
  xfs: take XFS_MMAPLOCK_EXCL xfs_file_write_zero_eof
  xfs: factor out a xfs_file_write_zero_eof helper
  iomap: move locking out of iomap_write_delalloc_release
  iomap: remove iomap_file_buffered_write_punch_delalloc
  iomap: factor out a iomap_last_written_block helper
  xfs: fix integer overflow in xrep_bmap
2024-10-18 11:28:39 -07:00
Thorsten Blum
197231da7f
proc: Fix W=1 build kernel-doc warning
Building the kernel with W=1 generates the following warning:

  fs/proc/fd.c:81: warning: This comment starts with '/**',
                   but isn't a kernel-doc comment.

Use a normal comment for the helper function proc_fdinfo_permission().

Signed-off-by: Thorsten Blum <thorsten.blum@linux.dev>
Link: https://lore.kernel.org/r/20241018102705.92237-2-thorsten.blum@linux.dev
Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-10-18 13:02:47 +02:00
Kent Overstreet
bc6d2d1041 bcachefs: fsck: Improve hash_check_key()
hash_check_key() checks and repairs the hash table btrees: dirents and
xattrs are open addressing hash tables.

We recently had a corruption reported where the hash type on an inode
somehow got flipped, which made the existing dirents invisible and
allowed new ones to be created with the same name.

Now, hash_check_key() can repair duplicates: it will delete one of them,
if it has an xattr or dangling dirent, but if it has two valid dirents
one of them gets renamed.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-10-18 00:49:48 -04:00
Kent Overstreet
dc96656b20 bcachefs: bch2_hash_set_or_get_in_snapshot()
Add a variant of bch2_hash_set_in_snapshot() that returns the existing
key on -EEXIST.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-10-18 00:49:48 -04:00
Kent Overstreet
15a3836c8e bcachefs: Repair mismatches in inode hash seed, type
Different versions of the same inode (same inode number, different
snapshot ID) must have the same hash seed and type - lookups require
this, since they see keys from different snapshots simultaneously.

To repair we only need to make the inodes consistent, hash_check_key()
will do the rest.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-10-18 00:49:48 -04:00
Kent Overstreet
d8e879377f bcachefs: Add hash seed, type to inode_to_text()
This helped with discovering some filesystem corruption fsck has having
trouble with: the str_hash type had gotten flipped on one snapshot's
version of an inode.

All versions of a given inode number have the same hash seed and hash
type, since lookups will be done with a single hash/seed and type and
see dirents/xattrs from multiple snapshots.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-10-18 00:49:48 -04:00
Kent Overstreet
78cf0ae636 bcachefs: INODE_STR_HASH() for bch_inode_unpacked
Trivial cleanup - add a normal BITMASK() helper for bch_inode_unpacked.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-10-18 00:49:48 -04:00
Kent Overstreet
b96f8cd387 bcachefs: Run in-kernel offline fsck without ratelimit errors
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-10-18 00:49:48 -04:00
Hongbo Li
489ecc4cfd bcachefs: skip mount option handle for empty string.
The options parse in get_tree will split the options buffer, it will
get the empty string for last one by strsep(). After commit
ea0eeb89b1d5 ("bcachefs: reject unknown mount options") is merged,
unknown mount options is not allowed (here is empty string), and this
causes this errors. This can be reproduced just by the following steps:

    bcachefs format /dev/loop
    mount -t bcachefs -o metadata_target=loop1 /dev/loop1 /mnt/bcachefs/

Fixes: ea0eeb89b1d5 ("bcachefs: reject unknown mount options")
Signed-off-by: Hongbo Li <lihongbo22@huawei.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-10-18 00:49:48 -04:00
Hongbo Li
07cf8bac2d bcachefs: fix incorrect show_options results
When call show_options in bcachefs, the options buffer is appeneded
to the seq variable. In fact, it requires an additional comma to be
appended first. This will affect the remount process when reading
existing mount options.

Fixes: 9305cf91d05e ("bcachefs: bch2_opts_to_text()")
Signed-off-by: Hongbo Li <lihongbo22@huawei.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-10-18 00:49:48 -04:00
Kent Overstreet
97535cd84f bcachefs: Fix data corruption on -ENOSPC in buffered write path
Found by generic/299: When we have to truncate a write due to -ENOSPC,
we may have to read in the folio we're writing to if we're now no longer
doing a complete write to a !uptodate folio.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-10-18 00:49:48 -04:00
Kent Overstreet
335d318ef5 bcachefs: bch2_folio_reservation_get_partial() is now better behaved
bch2_folio_reservation_get_partial(), on partial success, will now
return a reservation that's aligned to the filesystem blocksize.

This is a partial fix for fstests generic/299 - fio verify is badly
behaved in the presence of short writes that aren't aligned to its
blocksize.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-10-18 00:49:48 -04:00
Kent Overstreet
81e0b6c7c1 bcachefs: fix disk reservation accounting in bch2_folio_reservation_get()
bch2_disk_reservation_put() zeroes out the reservation - oops.

This fixes a disk reservation leak when getting a quota reservation
returned an error.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-10-18 00:49:48 -04:00
Kent Overstreet
4007bbb203 bcachefS: ec: fix data type on stripe deletion
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-10-18 00:49:48 -04:00
Kent Overstreet
a0d11feefb bcachefs: Don't use commit_do() unnecessarily
Using commit_do() to call alloc_sectors_start_trans() breaks when we're
randomly injecting transaction restarts - the restart in the commit
causes us to leak the lock that alloc_sectorS_start_trans() takes.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-10-18 00:49:48 -04:00
Kent Overstreet
6bee2a04c5 bcachefs: handle restarts in bch2_bucket_io_time_reset()
bch2_bucket_io_time_reset() doesn't need to succeed, which is why it
didn't previously retry on transaction restart - but we're now treating
these as errors.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-10-18 00:49:48 -04:00
Kent Overstreet
29fd10a36a bcachefs: fix restart handling in __bch2_resume_logged_op_finsert()
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-10-18 00:49:48 -04:00
Kent Overstreet
d8b5059774 bcachefs: fix restart handling in bch2_alloc_write_key()
This is ugly:

We may discover in alloc_write_key that the data type we calculated is
wrong, because BCH_DATA_need_discard is checked/set elsewhere, and the
disk accounting counters we calculated need to be updated.

But bch2_alloc_key_to_dev_counters(..., BTREE_TRIGGER_gc) is not safe
w.r.t. transaction restarts, so we need to propagate the fixup back to
our gc state in case we take a transaction restart.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-10-18 00:49:47 -04:00
Kent Overstreet
7ee4be9c62 bcachefs: fix restart handling in bch2_do_invalidates_work()
this one is fairly harmless since the invalidate worker will just run
again later if it needs to, but still worth fixing

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-10-18 00:49:47 -04:00
Kent Overstreet
028f3c1d9b bcachefs: fix missing restart handling in bch2_read_retry_nodecode()
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-10-18 00:49:47 -04:00
Kent Overstreet
e1c4d2f082 bcachefs: fix restart handling in bch2_fiemap()
We were leaking transaction restart errors to userspace.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-10-18 00:49:47 -04:00
Kent Overstreet
94bdeec8f5 bcachefs: fix bch2_hash_delete() error path
we were exiting an iterator that hadn't been initialized

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-10-18 00:49:47 -04:00
Kent Overstreet
74ec2f3024 bcachefs: fix restart handling in bch2_rename2()
This should be impossible to hit in practice; the first lookup within a
transaction won't return a restart due to lock ordering, but we're
adding fault injection for transaction restarts and shaking out bugs.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-10-18 00:49:47 -04:00
Linus Torvalds
4d939780b7 28 hotfixes. 13 are cc:stable. 23 are MM.
It is the usual shower of unrelated singletons - please see the individual
 changelogs for details.
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYIAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCZxGY5wAKCRDdBJ7gKXxA
 js6RAQC16zQ7WRV091i79cEi1C5648NbZjMCU626hZjuyfbzKgEA2v8PYtjj9w2e
 UGLxMY+PYZki2XNEh75Sikdkiyl9Vgg=
 =xcWT
 -----END PGP SIGNATURE-----

Merge tag 'mm-hotfixes-stable-2024-10-17-16-08' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm

Pull misc fixes from Andrew Morton:
 "28 hotfixes. 13 are cc:stable. 23 are MM.

  It is the usual shower of unrelated singletons - please see the
  individual changelogs for details"

* tag 'mm-hotfixes-stable-2024-10-17-16-08' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (28 commits)
  maple_tree: add regression test for spanning store bug
  maple_tree: correct tree corruption on spanning store
  mm/mglru: only clear kswapd_failures if reclaimable
  mm/swapfile: skip HugeTLB pages for unuse_vma
  selftests: mm: fix the incorrect usage() info of khugepaged
  MAINTAINERS: add Jann as memory mapping/VMA reviewer
  mm: swap: prevent possible data-race in __try_to_reclaim_swap
  mm: khugepaged: fix the incorrect statistics when collapsing large file folios
  MAINTAINERS: kasan, kcov: add bugzilla links
  mm: don't install PMD mappings when THPs are disabled by the hw/process/vma
  mm: huge_memory: add vma_thp_disabled() and thp_disabled_by_hw()
  Docs/damon/maintainer-profile: update deprecated awslabs GitHub URLs
  Docs/damon/maintainer-profile: add missing '_' suffixes for external web links
  maple_tree: check for MA_STATE_BULK on setting wr_rebalance
  mm: khugepaged: fix the arguments order in khugepaged_collapse_file trace point
  mm/damon/tests/sysfs-kunit.h: fix memory leak in damon_sysfs_test_add_targets()
  mm: remove unused stub for can_swapin_thp()
  mailmap: add an entry for Andy Chiu
  MAINTAINERS: add memory mapping/VMA co-maintainers
  fs/proc: fix build with GCC 15 due to -Werror=unterminated-string-initialization
  ...
2024-10-17 16:33:06 -07:00
Naohiro Aota
bf9821ba47 btrfs: zoned: fix zone unusable accounting for freed reserved extent
When btrfs reserves an extent and does not use it (e.g, by an error), it
calls btrfs_free_reserved_extent() to free the reserved extent. In the
process, it calls btrfs_add_free_space() and then it accounts the region
bytes as block_group->zone_unusable.

However, it leaves the space_info->bytes_zone_unusable side not updated. As
a result, ENOSPC can happen while a space_info reservation succeeded. The
reservation is fine because the freed region is not added in
space_info->bytes_zone_unusable, leaving that space as "free". OTOH,
corresponding block group counts it as zone_unusable and its allocation
pointer is not rewound, we cannot allocate an extent from that block group.
That will also negate space_info's async/sync reclaim process, and cause an
ENOSPC error from the extent allocation process.

Fix that by returning the space to space_info->bytes_zone_unusable.
Ideally, since a bio is not submitted for this reserved region, we should
return the space to free space and rewind the allocation pointer. But, it
needs rework on extent allocation handling, so let it work in this way for
now.

Fixes: 169e0da91a ("btrfs: zoned: track unusable bytes for zones")
CC: stable@vger.kernel.org # 5.15+
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-10-17 16:16:46 +02:00
David Howells
610a79ffea
afs: Fix lock recursion
afs_wake_up_async_call() can incur lock recursion.  The problem is that it
is called from AF_RXRPC whilst holding the ->notify_lock, but it tries to
take a ref on the afs_call struct in order to pass it to a work queue - but
if the afs_call is already queued, we then have an extraneous ref that must
be put... calling afs_put_call() may call back down into AF_RXRPC through
rxrpc_kernel_shutdown_call(), however, which might try taking the
->notify_lock again.

This case isn't very common, however, so defer it to a workqueue.  The oops
looks something like:

  BUG: spinlock recursion on CPU#0, krxrpcio/7001/1646
   lock: 0xffff888141399b30, .magic: dead4ead, .owner: krxrpcio/7001/1646, .owner_cpu: 0
  CPU: 0 UID: 0 PID: 1646 Comm: krxrpcio/7001 Not tainted 6.12.0-rc2-build3+ #4351
  Hardware name: ASUS All Series/H97-PLUS, BIOS 2306 10/09/2014
  Call Trace:
   <TASK>
   dump_stack_lvl+0x47/0x70
   do_raw_spin_lock+0x3c/0x90
   rxrpc_kernel_shutdown_call+0x83/0xb0
   afs_put_call+0xd7/0x180
   rxrpc_notify_socket+0xa0/0x190
   rxrpc_input_split_jumbo+0x198/0x1d0
   rxrpc_input_data+0x14b/0x1e0
   ? rxrpc_input_call_packet+0xc2/0x1f0
   rxrpc_input_call_event+0xad/0x6b0
   rxrpc_input_packet_on_conn+0x1e1/0x210
   rxrpc_input_packet+0x3f2/0x4d0
   rxrpc_io_thread+0x243/0x410
   ? __pfx_rxrpc_io_thread+0x10/0x10
   kthread+0xcf/0xe0
   ? __pfx_kthread+0x10/0x10
   ret_from_fork+0x24/0x40
   ? __pfx_kthread+0x10/0x10
   ret_from_fork_asm+0x1a/0x30
   </TASK>

Signed-off-by: David Howells <dhowells@redhat.com>
Link: https://lore.kernel.org/r/1394602.1729162732@warthog.procyon.org.uk
cc: Marc Dionne <marc.dionne@auristor.com>
cc: linux-afs@lists.infradead.org
cc: linux-fsdevel@vger.kernel.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-10-17 15:33:46 +02:00
Alessandro Zanni
15f3434748
fs: Fix uninitialized value issue in from_kuid and from_kgid
ocfs2_setattr() uses attr->ia_mode, attr->ia_uid and attr->ia_gid in
a trace point even though ATTR_MODE, ATTR_UID and ATTR_GID aren't set.

Initialize all fields of newattrs to avoid uninitialized variables, by
checking if ATTR_MODE, ATTR_UID, ATTR_GID are initialized, otherwise 0.

Reported-by: syzbot+6c55f725d1bdc8c52058@syzkaller.appspotmail.com
Closes: https://syzkaller.appspot.com/bug?extid=6c55f725d1bdc8c52058
Signed-off-by: Alessandro Zanni <alessandro.zanni87@gmail.com>
Link: https://lore.kernel.org/r/20241017120553.55331-1-alessandro.zanni87@gmail.com
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-10-17 15:33:43 +02:00
Christian Brauner
229fd15908
fs: don't try and remove empty rbtree node
When copying a namespace we won't have added the new copy into the
namespace rbtree until after the copy succeeded. Calling free_mnt_ns()
will try to remove the copy from the rbtree which is invalid. Simply
free the namespace skeleton directly.

Link: https://lore.kernel.org/r/20241016-adapter-seilwinde-83c508a7bde1@brauner
Fixes: 1901c92497 ("fs: keep an index of current mount namespaces")
Tested-by: Brad Spengler <spender@grsecurity.net>
Cc: stable@vger.kernel.org # v6.11+
Reported-by: Brad Spengler <spender@grsecurity.net>
Suggested-by: Brad Spengler <spender@grsecurity.net>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-10-17 15:33:43 +02:00
David Howells
d6a77668a7
netfs: Downgrade i_rwsem for a buffered write
In the I/O locking code borrowed from NFS into netfslib, i_rwsem is held
locked across a buffered write - but this causes a performance regression
in cifs as it excludes buffered reads for the duration (cifs didn't use any
locking for buffered reads).

Mitigate this somewhat by downgrading the i_rwsem to a read lock across the
buffered write.  This at least allows parallel reads to occur whilst
excluding other writes, DIO, truncate and setattr.

Note that this shouldn't be a problem for a buffered write as a read
through an mmap can circumvent i_rwsem anyway.

Also note that we might want to make this change in NFS also.

Signed-off-by: David Howells <dhowells@redhat.com>
Link: https://lore.kernel.org/r/1317958.1729096113@warthog.procyon.org.uk
cc: Steve French <sfrench@samba.org>
cc: Paulo Alcantara <pc@manguebit.com>
cc: Trond Myklebust <trondmy@kernel.org>
cc: Jeff Layton <jlayton@kernel.org>
cc: netfs@lists.linux.dev
cc: linux-cifs@vger.kernel.org
cc: linux-nfs@vger.kernel.org
cc: linux-fsdevel@vger.kernel.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-10-17 15:33:42 +02:00
Brahmajit Das
5778ace04e fs/proc: fix build with GCC 15 due to -Werror=unterminated-string-initialization
show show_smap_vma_flags() has been a using misspelled initializer in
mnemonics[] - it needed to initialize 2 element array of char and it used
NUL-padded 2 character string literals (i.e.  3-element initializer).

This has been spotted by gcc-15[*]; prior to that gcc quietly dropped the
3rd eleemnt of initializers.  To fix this we are increasing the size of
mnemonics[] (from mnemonics[BITS_PER_LONG][2] to
mnemonics[BITS_PER_LONG][3]) to accomodate the NUL-padded string literals.

This also helps us in simplyfying the logic for printing of the flags as
instead of printing each character from the mnemonics[], we can just print
the mnemonics[] using seq_printf.

[*]: fs/proc/task_mmu.c:917:49: error: initializer-string for array of `char' is too long [-Werror=unterminate d-string-initialization]
  917 |                 [0 ... (BITS_PER_LONG-1)] = "??",
      |                                                 ^~~~
fs/proc/task_mmu.c:917:49: error: initializer-string for array of `char' is too long [-Werror=unterminate d-string-initialization]
fs/proc/task_mmu.c:917:49: error: initializer-string for array of `char' is too long [-Werror=unterminate d-string-initialization]
fs/proc/task_mmu.c:917:49: error: initializer-string for array of `char' is too long [-Werror=unterminate d-string-initialization]
fs/proc/task_mmu.c:917:49: error: initializer-string for array of `char' is too long [-Werror=unterminate d-string-initialization]
fs/proc/task_mmu.c:917:49: error: initializer-string for array of `char' is too long [-Werror=unterminate d-string-initialization]
...


Stephen pointed out:

: The C standard explicitly allows for a string initializer to be too long
: due to the NUL byte at the end ...  so this warning may be overzealous.

but let's make the warning go away anwyay.

Link: https://lkml.kernel.org/r/20241005063700.2241027-1-brahmajit.xyz@gmail.com
Link: https://lkml.kernel.org/r/20241003093040.47c08382@canb.auug.org.au
Signed-off-by: Brahmajit Das <brahmajit.xyz@gmail.com>
Cc: Cyrill Gorcunov <gorcunov@openvz.org>
Cc: David Hildenbrand <david@redhat.com>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-10-17 00:28:07 -07:00
OGAWA Hirofumi
963a7f4d3b fat: fix uninitialized variable
syszbot produced this with a corrupted fs image.  In theory, however an IO
error would trigger this also.

This affects just an error report, so should not be a serious error.

Link: https://lkml.kernel.org/r/87r08wjsnh.fsf@mail.parknet.co.jp
Link: https://lkml.kernel.org/r/66ff2c95.050a0220.49194.03e9.GAE@google.com
Signed-off-by: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp>
Reported-by: syzbot+ef0d7bc412553291aa86@syzkaller.appspotmail.com
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-10-17 00:28:06 -07:00
Ryusuke Konishi
08cfa12adf nilfs2: propagate directory read errors from nilfs_find_entry()
Syzbot reported that a task hang occurs in vcs_open() during a fuzzing
test for nilfs2.

The root cause of this problem is that in nilfs_find_entry(), which
searches for directory entries, ignores errors when loading a directory
page/folio via nilfs_get_folio() fails.

If the filesystem images is corrupted, and the i_size of the directory
inode is large, and the directory page/folio is successfully read but
fails the sanity check, for example when it is zero-filled,
nilfs_check_folio() may continue to spit out error messages in bursts.

Fix this issue by propagating the error to the callers when loading a
page/folio fails in nilfs_find_entry().

The current interface of nilfs_find_entry() and its callers is outdated
and cannot propagate error codes such as -EIO and -ENOMEM returned via
nilfs_find_entry(), so fix it together.

Link: https://lkml.kernel.org/r/20241004033640.6841-1-konishi.ryusuke@gmail.com
Fixes: 2ba466d74e ("nilfs2: directory entry operations")
Signed-off-by: Ryusuke Konishi <konishi.ryusuke@gmail.com>
Reported-by: Lizhi Xu <lizhi.xu@windriver.com>
Closes: https://lkml.kernel.org/r/20240927013806.3577931-1-lizhi.xu@windriver.com
Reported-by: syzbot+8a192e8d090fa9a31135@syzkaller.appspotmail.com
Closes: https://syzkaller.appspot.com/bug?extid=8a192e8d090fa9a31135
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-10-17 00:28:06 -07:00
Linus Torvalds
667b1d41b2 for-6.12-rc3-tag
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAmcPxtAACgkQxWXV+ddt
 WDu9lA//WfB88fwEKnqBYDRo6aiSMIAzLDuXkJ9i8d7rcjZO1OIZkEnMOsxhvTcZ
 KxgjNjkgzoTyUwoAUlG+ZpvMeSNMhBdr2NFkXmYzN9oanFE4zplpZiWx6tGSApRU
 0ilngjXBsr8p03HmB88Yb05DVYQ2elMP6Jx3VETDBa0CNyp4//tGKzusNhZdA7KM
 XLZmkKRk3ZKabNo+p2J5t8UGJCl2L18U0o/EphfSkODKadUnsBbAPZUt2EGQCZwv
 uZhDFAUkgTFBkeRO7JwTfDrNi51M4zwmh+kEduzg4Ny4TdFb1UapU7K1N330WMru
 4Qa953Met9I4NB/kKI+fZP1lN4NGuD2qEU6yoZVSy4UiqRp1gEg8kOUfVGFbNJa1
 VFYcwdrBad0I4PjnQc5bpZVjzqJT5wWiZxjlWrB7VyIfdmnvQxe5h4DBwBhN5FJr
 +MEtuY2QNFygjDAZ5z0Ss8hegqI+FYi562Cjy9QRLhb3qGD8STF2BIChaILIn3oA
 UVJUlUP6CUmCu1RZRMFB4/WkeHO46FmZxJErGfFeXqJInThf0/rdSZOQgIP0JsUq
 N8FINEgXFAMCkK1PT7MNvAYkSP0tR7B0JjGKcSlGS3v3F0URCNGvHSiqbLedAtXT
 lc1MdXTZxub8h6xhIvgY1j7HRAFGrunn7LD6MIKRWX1SZPWwAGI=
 =DEUA
 -----END PGP SIGNATURE-----

Merge tag 'for-6.12-rc3-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux

Pull btrfs fixes from David Sterba:

 - regression fix: dirty extents tracked in xarray for qgroups must be
   adjusted for 32bit platforms

 - fix potentially freeing uninitialized name in fscrypt structure

 - fix warning about unneeded variable in a send callback

* tag 'for-6.12-rc3-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
  btrfs: fix uninitialized pointer free on read_alloc_one_name() error
  btrfs: send: cleanup unneeded return variable in changed_verity()
  btrfs: fix uninitialized pointer free in add_inode_ref()
  btrfs: use sector numbers as keys for the dirty extents xarray
2024-10-16 09:30:20 -07:00
Linus Torvalds
9f635d44d7 two ksmbd server fixes
-----BEGIN PGP SIGNATURE-----
 
 iQGzBAABCgAdFiEE6fsu8pdIjtWE/DpLiiy9cAdyT1EFAmcPUXEACgkQiiy9cAdy
 T1Gm7QwAlPW5//Cb4B0gpjzRcUws51IZ4yFhp4IQWmsd0RqdjZ4TxSCOPF3u3HR3
 0OPxyLdbUn6h5g0S2ayzqomHx2VBOQTjgyuMtaTWzokToMNu8kqvxK1MTslkBior
 9YEHUz9+5f0OJ+JBGNUzjfy4Plygr5y09udaLfqIknuY8+SeuooxNNUNfkIvrP7C
 JsSAWJznN9VMpKJmszYc4ntyTiz1XVXyyjJmjhRQ27ah8LUghqZ0mamgigTS5UFa
 U7eYBDfs6+9i5Lvkd4bJPdGyov9g/EPViLURZMfNaz3+p0TfosN8s2UZuhHC+zuv
 BDQ+wHGRqzmteZspLanrGBt9y9svHXp1CD7MwqWeGR3GhKsfsxCMJpE931fBhsxM
 vlJdd/xCs128fv48AvNyHA9abN0U1FpskOJhOzjDgvhKqDoIQ4TCC7QFDEttsPRv
 ZiQmyOCPyZZY28EmfoltU4CFcMIwKQ81nPUSOJFgKmHBbSpc+Qtnv5QgRHZCzj7n
 StJfaIMv
 =WhJj
 -----END PGP SIGNATURE-----

Merge tag 'v6.12-rc3-ksmbd-fixes' of git://git.samba.org/ksmbd

Pull smb server fixes from Steve French:

 - fix race between session setup and session logoff

 - add supplementary group support

* tag 'v6.12-rc3-ksmbd-fixes' of git://git.samba.org/ksmbd:
  ksmbd: add support for supplementary groups
  ksmbd: fix user-after-free from session log off
2024-10-16 09:15:43 -07:00
Ryusuke Konishi
6ed469df0b
nilfs2: fix kernel bug due to missing clearing of buffer delay flag
Syzbot reported that after nilfs2 reads a corrupted file system image
and degrades to read-only, the BUG_ON check for the buffer delay flag
in submit_bh_wbc() may fail, causing a kernel bug.

This is because the buffer delay flag is not cleared when clearing the
buffer state flags to discard a page/folio or a buffer head. So, fix
this.

This became necessary when the use of nilfs2's own page clear routine
was expanded.  This state inconsistency does not occur if the buffer
is written normally by log writing.

Signed-off-by: Ryusuke Konishi <konishi.ryusuke@gmail.com>
Link: https://lore.kernel.org/r/20241015213300.7114-1-konishi.ryusuke@gmail.com
Fixes: 8c26c4e269 ("nilfs2: fix issue with flush kernel thread after remount in RO mode because of driver's internal error or metadata corruption")
Reported-by: syzbot+985ada84bf055a575c07@syzkaller.appspotmail.com
Closes: https://syzkaller.appspot.com/bug?extid=985ada84bf055a575c07
Cc: stable@vger.kernel.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-10-16 15:05:32 +02:00
Amir Goldstein
20121d3f58 fuse: update inode size after extending passthrough write
yangyun reported that libfuse test test_copy_file_range() copies zero
bytes from a newly written file when fuse passthrough is enabled.

The reason is that extending passthrough write is not updating the fuse
inode size and when vfs_copy_file_range() observes a zero size inode,
it returns without calling the filesystem copy_file_range() method.

Fix this by adjusting the fuse inode size after an extending passthrough
write.

This does not provide cache coherency of fuse inode attributes and
backing inode attributes, but it should prevent situations where fuse
inode size is too small, causing read/copy to be wrongly shortened.

Reported-by: yangyun <yangyun50@huawei.com>
Closes: https://github.com/libfuse/libfuse/issues/1048
Fixes: 57e1176e60 ("fuse: implement read/write passthrough")
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2024-10-16 13:18:21 +02:00
Amir Goldstein
f03b296e8b fs: pass offset and result to backing_file end_write() callback
This is needed for extending fuse inode size after fuse passthrough write.

Suggested-by: Miklos Szeredi <miklos@szeredi.hu>
Link: https://lore.kernel.org/linux-fsdevel/CAJfpegs=cvZ_NYy6Q_D42XhYS=Sjj5poM1b5TzXzOVvX=R36aA@mail.gmail.com/
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2024-10-16 13:17:45 +02:00
Dr. David Alan Gilbert
6aca91c416 cifs: Remove unused functions
cifs_ses_find_chan() has been unused since commit
f486ef8e20 ("cifs: use the chans_need_reconnect bitmap for reconnect status")

cifs_read_page_from_socket() has been unused since commit
d08089f649 ("cifs: Change the I/O paths to use an iterator rather than a page list")

cifs_chan_in_reconnect() has been unused since commit
bc962159e8 ("cifs: avoid race conditions with parallel reconnects")

Remove them.

Signed-off-by: Dr. David Alan Gilbert <linux@treblig.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
2024-10-16 00:30:52 -05:00
Advait Dhamorikar
3dfea293f4 smb/client: Fix logically dead code
The if condition in collect_sample: can never be satisfied
because of a logical contradiction. The indicated dead code
may have performed some action; that action will never occur.

Fixes: 94ae8c3fee ("smb: client: compress: LZ77 code improvements cleanup")
Signed-off-by: Advait Dhamorikar <advaitdhamorikar@gmail.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
2024-10-16 00:30:52 -05:00
Paulo Alcantara
1ab60323c5 smb: client: fix OOBs when building SMB2_IOCTL request
When using encryption, either enforced by the server or when using
'seal' mount option, the client will squash all compound request buffers
down for encryption into a single iov in smb2_set_next_command().

SMB2_ioctl_init() allocates a small buffer (448 bytes) to hold the
SMB2_IOCTL request in the first iov, and if the user passes an input
buffer that is greater than 328 bytes, smb2_set_next_command() will
end up writing off the end of @rqst->iov[0].iov_base as shown below:

  mount.cifs //srv/share /mnt -o ...,seal
  ln -s $(perl -e "print('a')for 1..1024") /mnt/link

  BUG: KASAN: slab-out-of-bounds in
  smb2_set_next_command.cold+0x1d6/0x24c [cifs]
  Write of size 4116 at addr ffff8881148fcab8 by task ln/859

  CPU: 1 UID: 0 PID: 859 Comm: ln Not tainted 6.12.0-rc3 #1
  Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS
  1.16.3-2.fc40 04/01/2014
  Call Trace:
   <TASK>
   dump_stack_lvl+0x5d/0x80
   ? smb2_set_next_command.cold+0x1d6/0x24c [cifs]
   print_report+0x156/0x4d9
   ? smb2_set_next_command.cold+0x1d6/0x24c [cifs]
   ? __virt_addr_valid+0x145/0x310
   ? __phys_addr+0x46/0x90
   ? smb2_set_next_command.cold+0x1d6/0x24c [cifs]
   kasan_report+0xda/0x110
   ? smb2_set_next_command.cold+0x1d6/0x24c [cifs]
   kasan_check_range+0x10f/0x1f0
   __asan_memcpy+0x3c/0x60
   smb2_set_next_command.cold+0x1d6/0x24c [cifs]
   smb2_compound_op+0x238c/0x3840 [cifs]
   ? kasan_save_track+0x14/0x30
   ? kasan_save_free_info+0x3b/0x70
   ? vfs_symlink+0x1a1/0x2c0
   ? do_symlinkat+0x108/0x1c0
   ? __pfx_smb2_compound_op+0x10/0x10 [cifs]
   ? kmem_cache_free+0x118/0x3e0
   ? cifs_get_writable_path+0xeb/0x1a0 [cifs]
   smb2_get_reparse_inode+0x423/0x540 [cifs]
   ? __pfx_smb2_get_reparse_inode+0x10/0x10 [cifs]
   ? rcu_is_watching+0x20/0x50
   ? __kmalloc_noprof+0x37c/0x480
   ? smb2_create_reparse_symlink+0x257/0x490 [cifs]
   ? smb2_create_reparse_symlink+0x38f/0x490 [cifs]
   smb2_create_reparse_symlink+0x38f/0x490 [cifs]
   ? __pfx_smb2_create_reparse_symlink+0x10/0x10 [cifs]
   ? find_held_lock+0x8a/0xa0
   ? hlock_class+0x32/0xb0
   ? __build_path_from_dentry_optional_prefix+0x19d/0x2e0 [cifs]
   cifs_symlink+0x24f/0x960 [cifs]
   ? __pfx_make_vfsuid+0x10/0x10
   ? __pfx_cifs_symlink+0x10/0x10 [cifs]
   ? make_vfsgid+0x6b/0xc0
   ? generic_permission+0x96/0x2d0
   vfs_symlink+0x1a1/0x2c0
   do_symlinkat+0x108/0x1c0
   ? __pfx_do_symlinkat+0x10/0x10
   ? strncpy_from_user+0xaa/0x160
   __x64_sys_symlinkat+0xb9/0xf0
   do_syscall_64+0xbb/0x1d0
   entry_SYSCALL_64_after_hwframe+0x77/0x7f
  RIP: 0033:0x7f08d75c13bb

Reported-by: David Howells <dhowells@redhat.com>
Fixes: e77fe73c7e ("cifs: we can not use small padding iovs together with encryption")
Signed-off-by: Paulo Alcantara (Red Hat) <pc@manguebit.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
2024-10-16 00:30:52 -05:00
Su Hui
19ebc1e6ca smb: client: fix possible double free in smb2_set_ea()
Clang static checker(scan-build) warning:
fs/smb/client/smb2ops.c:1304:2: Attempt to free released memory.
 1304 |         kfree(ea);
      |         ^~~~~~~~~

There is a double free in such case:
'ea is initialized to NULL' -> 'first successful memory allocation for
ea' -> 'something failed, goto sea_exit' -> 'first memory release for ea'
-> 'goto replay_again' -> 'second goto sea_exit before allocate memory
for ea' -> 'second memory release for ea resulted in double free'.

Re-initialie 'ea' to NULL near to the replay_again label, it can fix this
double free problem.

Fixes: 4f1fffa237 ("cifs: commands that are retried should have replay flag set")
Reviewed-by: Dan Carpenter <dan.carpenter@linaro.org>
Signed-off-by: Su Hui <suhui@nfschina.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
2024-10-16 00:25:54 -05:00
Linus Torvalds
bdc7276512 bcachefs fixes for 6.12-rc4
- New metadata version inode_has_child_snapshots
   This fixes bugs with handling of unlinked inodes + snapshots, in
   particular when an inode is reattached after taking a snapshot;
   deleted inodes now get correctly cleaned up across snapshots.
 
 - Disk accounting rewrite fixes
   - validation fixes for when a device has been removed
   - fix journal replay failing with "journal_reclaim_would_deadlock"
 
 - Some more small fixes for erasure coding + device removal
 
 - Assorted small syzbot fixes
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEKnAFLkS8Qha+jvQrE6szbY3KbnYFAmcNw4UACgkQE6szbY3K
 bnbSzBAAmSCCQCqRwnFSp4OdNSlBK9q1e5WsbKOqHgtoXZU/mOUBe/5bnPPqm6Mg
 GkTc7FqVOs/95/rEDKXw2LneFgxRrt8MriJCUdXZvV5fC2R4Kdl0TkwABtMtm2Ae
 wp37n6iQO81j4uZHfOj67RzC2NRo7dMdun5HnQPRBTKzyuDaZXqwjMmF2LmaeODh
 oiBFUvD5nFBo5XvXPABBin6xpdquHO+6ZWf6SFD4+iRe11NrJAOAIS/crJvxsFfr
 I/X152Z+gzKPE+NhANKMxlHyNnVGo7iHUqhUjVuI4SSaXb9Ap6k4sXgfoIzncR17
 GA5qWtaNS1W72+awT3R2EaF9Tqi+Vng2RVfxxQ04giImnBq0eziOjlZ26enOE0LU
 0ZZrBFzqpItqYbNnzPissHuKb1mAQGPWy6kxoGIrqDKbichA7lzyWDz2lgEE85Sx
 E1mvHwYbKhUuLC4c4460hueGVUgMWmjqM3E8oex+oNDpauPB+/bnYkcgZEG2RBla
 +ZlDL28fg4fxtqlUrOQeonQ1RecGNdRMJz7xiGnkYU9rQpUuv8QwFiBZGAbLP6zn
 6fbFZGxS/pO95sY7GmAtKz7ZgKxJQCzII4s+Oht5AgOvoBlPjAiol1UbwYadYQxz
 HKF+WBaPC9z/L6JjP+gx+uUzTWRIfBmhHylhWbKr4vLGfx3Jc1g=
 =Rkq2
 -----END PGP SIGNATURE-----

Merge tag 'bcachefs-2024-10-14' of git://evilpiepirate.org/bcachefs

Pull bcachefs fixes from Kent Overstreet:

 - New metadata version inode_has_child_snapshots

   This fixes bugs with handling of unlinked inodes + snapshots, in
   particular when an inode is reattached after taking a snapshot;
   deleted inodes now get correctly cleaned up across snapshots.

 - Disk accounting rewrite fixes
     - validation fixes for when a device has been removed
     - fix journal replay failing with "journal_reclaim_would_deadlock"

 - Some more small fixes for erasure coding + device removal

 - Assorted small syzbot fixes

* tag 'bcachefs-2024-10-14' of git://evilpiepirate.org/bcachefs: (27 commits)
  bcachefs: Fix sysfs warning in fstests generic/730,731
  bcachefs: Handle race between stripe reuse, invalidate_stripe_to_dev
  bcachefs: Fix kasan splat in new_stripe_alloc_buckets()
  bcachefs: Add missing validation for bch_stripe.csum_granularity_bits
  bcachefs: Fix missing bounds checks in bch2_alloc_read()
  bcachefs: fix uaf in bch2_dio_write_done()
  bcachefs: Improve check_snapshot_exists()
  bcachefs: Fix bkey_nocow_lock()
  bcachefs: Fix accounting replay flags
  bcachefs: Fix invalid shift in member_to_text()
  bcachefs: Fix bch2_have_enough_devs() for BCH_SB_MEMBER_INVALID
  bcachefs: __wait_for_freeing_inode: Switch to wait_bit_queue_entry
  bcachefs: Check if stuck in journal_res_get()
  closures: Add closure_wait_event_timeout()
  bcachefs: Fix state lock involved deadlock
  bcachefs: Fix NULL pointer dereference in bch2_opt_to_text
  bcachefs: Release transaction before wake up
  bcachefs: add check for btree id against max in try read node
  bcachefs: Disk accounting device validation fixes
  bcachefs: bch2_inode_or_descendents_is_open()
  ...
2024-10-15 11:06:45 -07:00
Christoph Hellwig
f6f91d290c xfs: punch delalloc extents from the COW fork for COW writes
When ->iomap_end is called on a short write to the COW fork it needs to
punch stale delalloc data from the COW fork and not the data fork.

Ensure that IOMAP_F_NEW is set for new COW fork allocations in
xfs_buffered_write_iomap_begin, and then use the IOMAP_F_SHARED flag
in xfs_buffered_write_delalloc_punch to decide which fork to punch.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Carlos Maiolino <cem@kernel.org>
2024-10-15 11:37:42 +02:00
Christoph Hellwig
7d6fe5c586 xfs: set IOMAP_F_SHARED for all COW fork allocations
Change to always set xfs_buffered_write_iomap_begin for COW fork
allocations even if they don't overlap existing data fork extents,
which will allow the iomap_end callback to detect if it has to punch
stale delalloc blocks from the COW fork instead of the data fork.  It
also means we sample the sequence counter for both the data and the COW
fork when writing to the COW fork, which ensures we properly revalidate
when only COW fork changes happens.

This is essentially a revert of commit 72a048c105 ("xfs: only set
IOMAP_F_SHARED when providing a srcmap to a write"). This is fine because
the problem that the commit fixed has now been dealt with in iomap by
only looking at the actual srcmap and not the fallback to the write
iomap.

Note that the direct I/O path was never changed and has always set
IOMAP_F_SHARED for all COW fork allocations.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Carlos Maiolino <cem@kernel.org>
2024-10-15 11:37:42 +02:00
Christoph Hellwig
c29440ff66 xfs: share more code in xfs_buffered_write_iomap_begin
Introduce a local iomap_flags variable so that the code allocating new
delalloc blocks in the data fork can fall through to the found_imap
label and reuse the code to unlock and fill the iomap.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Carlos Maiolino <cem@kernel.org>
2024-10-15 11:37:42 +02:00
Christoph Hellwig
8fe3b21efa xfs: support the COW fork in xfs_bmap_punch_delalloc_range
xfs_buffered_write_iomap_begin can also create delallocate reservations
that need cleaning up, prepare for that by adding support for the COW
fork in xfs_bmap_punch_delalloc_range.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Carlos Maiolino <cem@kernel.org>
2024-10-15 11:37:42 +02:00
Christoph Hellwig
abd7d651ad xfs: IOMAP_ZERO and IOMAP_UNSHARE already hold invalidate_lock
All XFS callers of iomap_zero_range and iomap_file_unshare already hold
invalidate_lock, so we can't take it again in
iomap_file_buffered_write_punch_delalloc.

Use the passed in flags argument to detect if we're called from a zero
or unshare operation and don't take the lock again in this case.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Carlos Maiolino <cem@kernel.org>
2024-10-15 11:37:42 +02:00
Christoph Hellwig
acfbac7764 xfs: take XFS_MMAPLOCK_EXCL xfs_file_write_zero_eof
xfs_file_write_zero_eof is the only caller of xfs_zero_range that does
not take XFS_MMAPLOCK_EXCL (aka the invalidate lock).  Currently that
is actually the right thing, as an error in the iomap zeroing code will
also take the invalidate_lock to clean up, but to fix that deadlock we
need a consistent locking pattern first.

The only extra thing that XFS_MMAPLOCK_EXCL will lock out are read
pagefaults, which isn't really needed here, but also not actively
harmful.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Carlos Maiolino <cem@kernel.org>
2024-10-15 11:37:42 +02:00
Christoph Hellwig
3c399374af xfs: factor out a xfs_file_write_zero_eof helper
Split a helper from xfs_file_write_checks that just deal with the
post-EOF zeroing to keep the code readable.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Carlos Maiolino <cem@kernel.org>
2024-10-15 11:37:42 +02:00
Christoph Hellwig
b784951662 iomap: move locking out of iomap_write_delalloc_release
XFS (which currently is the only user of iomap_write_delalloc_release)
already holds invalidate_lock for most zeroing operations.  To be able
to avoid a deadlock it needs to stop taking the lock, but doing so
in iomap would leak XFS locking details into iomap.

To avoid this require the caller to hold invalidate_lock when calling
iomap_write_delalloc_release instead of taking it there.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Carlos Maiolino <cem@kernel.org>
2024-10-15 11:37:42 +02:00
Christoph Hellwig
caf0ea451d iomap: remove iomap_file_buffered_write_punch_delalloc
Currently iomap_file_buffered_write_punch_delalloc can be called from
XFS either with the invalidate lock held or not.  To fix this while
keeping the locking in the file system and not the iomap library
code we'll need to life the locking up into the file system.

To prepare for that, open code iomap_file_buffered_write_punch_delalloc
in the only caller, and instead export iomap_write_delalloc_release.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Carlos Maiolino <cem@kernel.org>
2024-10-15 11:37:42 +02:00
Christoph Hellwig
c0adf8c3a9 iomap: factor out a iomap_last_written_block helper
Split out a pice of logic from iomap_file_buffered_write_punch_delalloc
that is useful for all iomap_end implementations.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Carlos Maiolino <cem@kernel.org>
2024-10-15 11:37:41 +02:00
Linus Torvalds
eca631b8fe f2fs fix for 6.12-rc4
This includes an urgent fix to resolve DIO read performance regression caused by
 0cac51185e ("f2fs: fix to avoid racing in between read and OPU dio write").
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEE00UqedjCtOrGVvQiQBSofoJIUNIFAmcNSyoACgkQQBSofoJI
 UNJBKg/6AgfkxWeY4Vb5h43nnmwnv/eJZZRIhsFMSz+vWEI1wfKeCkUys0Q1GGZ3
 RPZl+ZT4vJ2FDjLRwZFfCi31nT/qIAlOnm0GtlDczsccBh91RrzFFKDHcwvfOd0/
 NLpdqsDt2tuurf6zjbhtx5Paoyr7KvMg4+sSfRqV+7nvXmLImMH7ahRGiB5Eh4HP
 gDNpQ7tk+D2+ZHBU40PUSYXooikFYznGuHk5JjpKnVCAsK8F0u9nA35ZeSlkkUCM
 8MGS+zHEpEqD/wZWlrwUWhmXmHLuNUbJh6X3pPNYxe0s/+ymHo99zKRw5HHKKybK
 FFZYbSWXrTNS+2SS2NUhUxp3CpPV0N6IGM+i7UkYo4DMu/MG7skVrOjkLx7NQvSf
 8/8B8g6LQt32JzlNvCrZcjStEgdxbXzFaJZH71952C7dp8mnhc4LkgkYkADnjsa3
 /+L+nazVgX6YaXZ9Ny2TY3gMF/gyJHp7LzZGeOdeqKNWYTclnrkEzGJ3Eg9aE2vz
 yymcz2P7nWYFklIfnRUPAYnvUwleBysYkHsw5z3wrqX6TjW5IW8fcozN9dIIdOTC
 2AVBGhi931xnJSj4AU9in+p+s9qfPP6bHN/C/5PuL04UQ7+y+pjUhmwJxUzugVoq
 T4xpp1BFj/3SQ8SUQX4nVTdHqR5uKfHvFKyAe6/0345w6bQCQtI=
 =7KMu
 -----END PGP SIGNATURE-----

Merge tag 'f2fs-6.12-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs

Pull f2fs fix from Jaegeuk Kim:
 "An urgent fix to resolve DIO read performance regression caused by
  'f2fs: fix to avoid racing in between read and OPU dio write'"

* tag 'f2fs-6.12-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs:
  f2fs: allow parallel DIO reads
2024-10-14 11:19:19 -07:00
Linus Torvalds
63fa605041 Changes since last update:
- Make sure only regular inodes can be used for file-backed mounts;
 
  - Two minor codebase cleanups.
 -----BEGIN PGP SIGNATURE-----
 
 iQJFBAABCgAvFiEEQ0A6bDUS9Y+83NPFUXZn5Zlu5qoFAmcNHqARHHhpYW5nQGtl
 cm5lbC5vcmcACgkQUXZn5Zlu5qo1XA/+MFbobJ4bWxJQKnouLlCiFQ5C1xEFbVn2
 HasHLfrMIcdz/n/S3Ib4Ayi+9W0zM2Ekq9EG+fuOBqjZP17+EOj3e7OPtVVPNwx0
 u2GbD9zNCliZg9PigCfPO+6oImt6l/Mytmx+7bELqbMywAy7JNCNesJuyycsTcja
 o1I3dNNUZdppilohXPIENTRLjBlOuGBaZdUXDih0LqB+Pb0jgXTP6JfD88h1MLFw
 xBbhqQ1A/GgyESfsMpZFn2xvFIocLBCIAdAehi9M1AiEwCTjGkTZ66WW3H6Es/Zp
 vcC9KjHJoGGCXxZf8mnoQHQo/WqQuNUPc2BVf9iExzCo0nwRArcTbAu5Bskqg0LF
 c+a7FrrxhODz8ioxOOiMUqG4b3/qGkzlk6w5a/t7IRrmFtmcXmPWZ14aI8qpy7o/
 CW3iPUoF/zEsmmFvOgJtHwy3g+bC8KhDvz3fqFIDSSMjSKjqb4cPYSe/L5MyhwED
 wmLgp1uYjEyR0uuqqUp93FEYIbHuO5HpPRT5crLczRIoYn7bXRhjNWLCTmzlLqrj
 yDAQsrngK99BQ7g0FTQ/OV9si/HRRGsusZmCkeCb6KnRNIvml4X9/WXKc1ioOFk/
 3MSaxlQlTXzCCctjVCDNn9GfD/yR1cXu2sUpGSEnP1ssLG4ARyXGVfoeSw7gJ4xn
 C5lm9SOmkzU=
 =0gFG
 -----END PGP SIGNATURE-----

Merge tag 'erofs-for-6.12-rc4-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/xiang/erofs

Pull erofs fixes from Gao Xiang:
 "The main one fixes a syzbot issue due to the invalid inode type out of
  file-backed mounts. The others are minor cleanups without actual logic
  changes.

  Summary:

   - Make sure only regular inodes can be used for file-backed mounts

   - Two minor codebase cleanups"

* tag 'erofs-for-6.12-rc4-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/xiang/erofs:
  erofs: get rid of kaddr in `struct z_erofs_maprecorder`
  erofs: get rid of z_erofs_try_to_claim_pcluster()
  erofs: ensure regular inodes for file-backed mounts
2024-10-14 11:12:09 -07:00
Kent Overstreet
5e3b72324d bcachefs: Fix sysfs warning in fstests generic/730,731
sysfs warns if we're removing a symlink from a directory that's no
longer in sysfs; this is triggered by fstests generic/730, which
simulates hot removal of a block device.

This patch is however not a correct fix, since checking
kobj->state_in_sysfs on a kobj owned by another subsystem is racy.

A better fix would be to add the appropriate check to
sysfs_remove_link() - and sysfs_create_link() as well.

But kobject_add_internal()/kobject_del() do not as of today have locking
that would support that.

Note that the block/holder.c code appears to be subject to this race as
well.

Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: "Rafael J. Wysocki" <rafael@kernel.org>
Cc:  Christoph Hellwig <hch@lst.de>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-10-14 05:43:01 -04:00
Kent Overstreet
cb6055e66f bcachefs: Handle race between stripe reuse, invalidate_stripe_to_dev
When creating a new stripe, we may reuse an existing stripe that has
some empty and some nonempty blocks.

Generally, the existing stripe won't change underneath us - except for
block sector counts, which we copy to the new key in
ec_stripe_key_update.

But the device removal path can now invalidate stripe pointers to a
device, and that can race with stripe reuse.

Change ec_stripe_key_update() to check for and resolve this
inconsistency.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-10-13 22:03:03 -04:00
Kent Overstreet
b1e562265e bcachefs: Fix kasan splat in new_stripe_alloc_buckets()
Update for BCH_SB_MEMBER_INVALID.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-10-13 22:03:01 -04:00
Kent Overstreet
9f25dbe0bf bcachefs: Add missing validation for bch_stripe.csum_granularity_bits
Reported-by: syzbot+f8c98a50c323635be65d@syzkaller.appspotmail.com
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-10-13 17:55:33 -04:00
Kent Overstreet
a319aeaebb bcachefs: Fix missing bounds checks in bch2_alloc_read()
We were checking that the alloc key was for a valid device, but not a
valid bucket.

This is the upgrade path from versions prior to bcachefs being mainlined.

Reported-by: syzbot+a1b59c8e1a3f022fd301@syzkaller.appspotmail.com
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-10-13 17:55:33 -04:00
Kent Overstreet
573ddcdc56 bcachefs: fix uaf in bch2_dio_write_done()
Reported-by: syzbot+19ad84d5133871207377@syzkaller.appspotmail.com
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-10-13 17:55:33 -04:00
Linus Torvalds
cfea70e835 two fixes for Windows symlink handling
-----BEGIN PGP SIGNATURE-----
 
 iQGzBAABCgAdFiEE6fsu8pdIjtWE/DpLiiy9cAdyT1EFAmcMBf4ACgkQiiy9cAdy
 T1Hf7Qv/f/TEXZisWIGshUpIerxOAWmN70bTw4sNID9ge8mVWwtVJBs57rlSjPTc
 97Jj95urqnKEAGk/KC8qntp5QCMBQAeBFILigZph2c7vqEXPQy0dpbDUEUFuRN2G
 mq0wn7IcJZcPJmhZGx9JJeteHk/24drJRSM+jyklwI2Rmev6Y6dlsv4JyMuvP7iI
 YuCdbN7rYXsRBkpnK5AbiWCRdxwQMiMuGsppNQyBVSZKkt/g+8R16Z6WKxSbkaZf
 XajVsywhlP5Bg9HRAk/YTPK4enKVi8ISp9qfS9EuinwM/VFzEnXnYrec/fiD0Ukg
 rEemM7iF/YQdQq/2q8gm5KpoOjnLbaew+Zb+OoWyXMK7RJygD79+uMHn3v1cdi7B
 BWCgbQQ7KiRi6rOo0Xzz8Rmw3L4+DHjTvIbh46jz90qQyuumR2hUSa7cPl2ATO4l
 lxA50Q8xPE1i0Cfob1w/XHlrfmWMyovtHSKDvaeOMclp/VAHDfS6nB0x/ngyY8UH
 ii2czaDd
 =uI8y
 -----END PGP SIGNATURE-----

Merge tag '6.12-rc2-cifs-fixes' of git://git.samba.org/sfrench/cifs-2.6

Pull smb client fixes from Steve French:
 "Two fixes for Windows symlink handling"

* tag '6.12-rc2-cifs-fixes' of git://git.samba.org/sfrench/cifs-2.6:
  cifs: Fix creating native symlinks pointing to current or parent directory
  cifs: Improve creating native symlinks pointing to directory
2024-10-13 10:52:39 -07:00
Kent Overstreet
c986dd7ecb bcachefs: Improve check_snapshot_exists()
Check if we have snapshot_trees or subvolumes that refer to the snapshot
node being reconstructed, and use them.

With this, the kill_btree_root test that blows away the snapshots btree
now passes, and we're able to successfully reconstruct.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-10-12 05:02:48 -04:00
Kent Overstreet
9183c2b11e bcachefs: Fix bkey_nocow_lock()
This fixes an assertion pop in nocow_locking.c

00243 kernel BUG at fs/bcachefs/nocow_locking.c:41!
00243 Internal error: Oops - BUG: 00000000f2000800 [#1] SMP
00243 Modules linked in:
00243 Hardware name: linux,dummy-virt (DT)
00243 pstate: 60001005 (nZCv daif -PAN -UAO -TCO -DIT +SSBS BTYPE=--)
00244 pc : bch2_bucket_nocow_unlock (/home/testdashboard/linux-7/fs/bcachefs/nocow_locking.c:41)
00244 lr : bkey_nocow_lock (/home/testdashboard/linux-7/fs/bcachefs/data_update.c:79)
00244 sp : ffffff80c82373b0
00244 x29: ffffff80c82373b0 x28: ffffff80e08958c0 x27: ffffff80e0880000
00244 x26: ffffff80c8237a98 x25: 00000000000000a0 x24: ffffff80c8237ab0
00244 x23: 00000000000000c0 x22: 0000000000000008 x21: 0000000000000000
00244 x20: ffffff80c8237a98 x19: 0000000000000018 x18: 0000000000000000
00244 x17: 0000000000000000 x16: 000000000000003f x15: 0000000000000000
00244 x14: 0000000000000008 x13: 0000000000000018 x12: 0000000000000000
00244 x11: 0000000000000000 x10: ffffff80e0880000 x9 : ffffffc0803ac1a4
00244 x8 : 0000000000000018 x7 : ffffff80c8237a88 x6 : ffffff80c8237ab0
00244 x5 : ffffff80e08988d0 x4 : 00000000ffffffff x3 : 0000000000000000
00244 x2 : 0000000000000004 x1 : 0003000000000d1e x0 : ffffff80e08988c0
00244 Call trace:
00244 bch2_bucket_nocow_unlock (/home/testdashboard/linux-7/fs/bcachefs/nocow_locking.c:41)
00245 bch2_data_update_init (/home/testdashboard/linux-7/fs/bcachefs/data_update.c:627 (discriminator 1))
00245 promote_alloc.isra.0 (/home/testdashboard/linux-7/fs/bcachefs/io_read.c:242 /home/testdashboard/linux-7/fs/bcachefs/io_read.c:304)
00245 __bch2_read_extent (/home/testdashboard/linux-7/fs/bcachefs/io_read.c:949)
00246 __bch2_read (/home/testdashboard/linux-7/fs/bcachefs/io_read.c:1215)
00246 bch2_direct_IO_read (/home/testdashboard/linux-7/fs/bcachefs/fs-io-direct.c:132)
00246 bch2_read_iter (/home/testdashboard/linux-7/fs/bcachefs/fs-io-direct.c:201)
00247 aio_read.constprop.0 (/home/testdashboard/linux-7/fs/aio.c:1602)
00247 io_submit_one.constprop.0 (/home/testdashboard/linux-7/fs/aio.c:2003 /home/testdashboard/linux-7/fs/aio.c:2052)
00248 __arm64_sys_io_submit (/home/testdashboard/linux-7/fs/aio.c:2111 /home/testdashboard/linux-7/fs/aio.c:2081 /home/testdashboard/linux-7/fs/aio.c:2081)
00248 invoke_syscall.constprop.0 (/home/testdashboard/linux-7/arch/arm64/include/asm/syscall.h:61 /home/testdashboard/linux-7/arch/arm64/kernel/syscall.c:54)
00248 ========= FAILED TIMEOUT tiering_variable_buckets_replicas in 1200s

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-10-12 05:01:52 -04:00
Kent Overstreet
672f75238e bcachefs: Fix accounting replay flags
BCH_TRANS_COMMIT_journal_reclaim without BCH_WATERMARK_reclaim means
"return an error if low on journal space" - but accounting replay must
succeed.

Fixes https://github.com/koverstreet/bcachefs/issues/656

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-10-12 03:02:16 -04:00
Kent Overstreet
c1bd21bb65 bcachefs: Fix invalid shift in member_to_text()
Reported-by: syzbot+064ce437a1ad63d3f6ef@syzkaller.appspotmail.com
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-10-12 03:02:16 -04:00
Kent Overstreet
7d84d9f449 bcachefs: Fix bch2_have_enough_devs() for BCH_SB_MEMBER_INVALID
This fixes a kasan splat in the ec device removal tests.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2024-10-11 22:20:51 -04:00