linux

mirror of https://github.com/torvalds/linux.git synced 2024-12-04 01:51:34 +00:00

Author	SHA1	Message	Date
Kent Overstreet	d50d7a5fa4	bcachefs: Check for logged ops when clean If we shut down successfully, there shouldn't be any logged ops to resume. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-09-27 22:32:22 -04:00
Kent Overstreet	1c0ee43b2c	bcachefs: BCH_FS_clean_recovery Add a filesystem flag to indicate whether we did a clean recovery - using c->sb.clean after we've got rw is incorrect, since c->sb is updated whenever we write the superblock. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-09-27 22:32:22 -04:00
Kent Overstreet	9773547b16	bcachefs: Convert disk accounting BUG_ON() to WARN_ON() We had a bug where disk accounting keys didn't always have their version field set in journal replay; change the BUG_ON() to a WARN(), and exclude this case since it's now checked for elsewhere (in the bkey validate function). Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-09-27 22:32:22 -04:00
Kent Overstreet	a3581ca35d	bcachefs: Fix BCH_TRANS_COMMIT_skip_accounting_apply This was added to avoid double-counting accounting keys in journal replay. But applied incorrectly (easily done since it applies to the transaction commit, not a particular update), it leads to skipping in-mem accounting for real accounting updates, and failure to give them a version number - which leads to journal replay becoming very confused the next time around. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-09-27 22:32:20 -04:00
Kent Overstreet	f8911ad88d	bcachefs: Check for accounting keys with bversion=0 Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-09-27 21:46:35 -04:00
Kent Overstreet	cf49f8a8c2	bcachefs: rename version -> bversion give bversions a more distinct name, to aid in grepping Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-09-27 21:46:35 -04:00
Kent Overstreet	fd65378db9	bcachefs: Don't delete unlinked inodes before logged op resume Previously, check_inode() would delete unlinked inodes if they weren't on the deleted list - this code dating from before there was a deleted list. But, if we crash during a logged op (truncate or finsert/fcollapse) of an unlinked file, logged op resume will get confused if the inode has already been deleted - instead, just add it to the deleted list if it needs to be there; delete_dead_inodes runs after logged op resume. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-09-27 21:46:35 -04:00
Kent Overstreet	8d65b15f8d	bcachefs: Fix BCH_SB_ERRS() so we can reorder BCH_SB_ERRS() has a field for the actual enum val so that we can reorder to reorganize, but the way BCH_SB_ERR_MAX was defined didn't allow for this. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-09-27 21:46:35 -04:00
Kent Overstreet	5612daafb7	bcachefs: Fix fsck warnings from bkey validation __bch2_fsck_err() warns if the current task has a btree_trans object and it wasn't passed in, because if it has to prompt for user input it has to be able to unlock it. But plumbing the btree_trans through bkey_validate(), as well as transaction restarts, is problematic - so instead make bkey fsck errors FSCK_AUTOFIX, which doesn't need to warn. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-09-27 21:46:35 -04:00
Kent Overstreet	7c980a43e9	bcachefs: Move transaction commit path validation to as late as possible In order to check for accounting keys with version=0, we need to run validation after they've been assigned version numbers. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-09-27 21:46:35 -04:00
Kent Overstreet	431312b59c	bcachefs: Fix disk accounting attempting to mark invalid replicas entry This fixes the following bug, where a disk accounting key has an invalid replicas entry, and we attempt to add it to the superblock: bcachefs (3c0860e8-07ca-4276-8954-11c1774be868): starting version 1.12: rebalance_work_acct_fix opts=metadata_replicas=2,data_replicas=2,foreground_target=ssd,background_target=hdd,nopromote_whole_extents,verbose,fsck,fix_errors=yes bcachefs (3c0860e8-07ca-4276-8954-11c1774be868): recovering from clean shutdown, journal seq 15211644 bcachefs (3c0860e8-07ca-4276-8954-11c1774be868): accounting_read... accounting not marked in superblock replicas replicas cached: 1/1 [0], fixing bcachefs (3c0860e8-07ca-4276-8954-11c1774be868): sb invalid before write: Invalid superblock section replicas_v0: invalid device 0 in entry cached: 1/1 [0] replicas_v0 (size 88): user: 2 [3 5] user: 2 [1 4] cached: 1 [2] btree: 2 [1 2] user: 2 [2 5] cached: 1 [0] cached: 1 [4] journal: 2 [1 5] user: 2 [1 2] user: 2 [2 3] user: 2 [3 4] user: 2 [4 5] cached: 1 [1] cached: 1 [3] cached: 1 [5] journal: 2 [1 2] journal: 2 [2 5] btree: 2 [2 5] user: 2 [1 3] user: 2 [1 5] user: 2 [2 4] bcachefs (3c0860e8-07ca-4276-8954-11c1774be868): inconsistency detected - emergency read only at journal seq 15211644 accounting not marked in superblock replicas replicas user: 1/1 [3], fixing bcachefs (3c0860e8-07ca-4276-8954-11c1774be868): sb invalid before write: Invalid superblock section replicas_v0: invalid device 0 in entry cached: 1/1 [0] replicas_v0 (size 96): user: 2 [3 5] user: 2 [1 3] cached: 1 [2] btree: 2 [1 2] user: 2 [2 4] cached: 1 [0] cached: 1 [4] journal: 2 [1 5] user: 1 [3] user: 2 [1 5] user: 2 [3 4] user: 2 [4 5] cached: 1 [1] cached: 1 [3] cached: 1 [5] journal: 2 [1 2] journal: 2 [2 5] btree: 2 [2 5] user: 2 [1 2] user: 2 [1 4] user: 2 [2 3] user: 2 [2 5] accounting not marked in superblock replicas replicas user: 1/2 [3 7], fixing bcachefs (3c0860e8-07ca-4276-8954-11c1774be868): sb invalid before write: Invalid superblock section replicas_v0: invalid device 7 in entry user: 1/2 [3 7] replicas_v0 (size 96): user: 2 [3 7] user: 2 [1 3] cached: 1 [2] btree: 2 [1 2] user: 2 [2 4] cached: 1 [0] cached: 1 [4] journal: 2 [1 5] user: 1 [3] user: 2 [1 5] user: 2 [3 4] user: 2 [4 5] cached: 1 [1] cached: 1 [3] cached: 1 [5] journal: 2 [1 2] journal: 2 [2 5] btree: 2 [2 5] user: 2 [1 2] user: 2 [1 4] user: 2 [2 3] user: 2 [2 5] user: 2 [3 5] done bcachefs (3c0860e8-07ca-4276-8954-11c1774be868): alloc_read... done bcachefs (3c0860e8-07ca-4276-8954-11c1774be868): stripes_read... done bcachefs (3c0860e8-07ca-4276-8954-11c1774be868): snapshots_read... done Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-09-27 21:46:35 -04:00
Kent Overstreet	49fd90b2cc	bcachefs: Fix unlocked access to c->disk_sb.sb in bch2_replicas_entry_validate() Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-09-27 21:46:35 -04:00
Kent Overstreet	9104fc1928	bcachefs: Fix accounting read + device removal accounting read was checking if accounting replicas entries were marked in the superblock prior to applying accounting from the journal, which meant that a recently removed device could spuriously trigger a "not marked in superblocked" error (when journal entries zero out the offending counter). Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-09-27 21:46:35 -04:00
Kent Overstreet	1e0272ef47	bcachefs: bch_accounting_mode Minor refactoring - replace multiple bool arguments with an enum; prep work for fixing a bug in accounting read. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-09-27 21:46:35 -04:00
Kent Overstreet	3672bda8f5	bcachefs: fix transaction restart handling in check_extents(), check_dirents() Dealing with outside state within a btree transaction is always tricky. check_extents() and check_dirents() have to accumulate counters for i_sectors and i_nlink (for subdirectories). There were two bugs: - transaction commit may return a restart; therefore we have to commit before accumulating to those counters - get_inode_all_snapshots() may return a transaction restart, before updating w->last_pos; then, on the restart, check_i_sectors()/check_subdir_count() would see inodes that were not for w->last_pos Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-09-27 21:46:35 -04:00
Kent Overstreet	22a507d68e	bcachefs: kill inode_walker_entry.seen_this_pos dead code Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-09-27 21:46:34 -04:00
Kent Overstreet	b29c30ab48	bcachefs: Fix incorrect IS_ERR_OR_NULL usage Returning a positive integer instead of an error code causes error paths to become very confused. Closes: syzbot+c0360e8367d6d8d04a66@syzkaller.appspotmail.com Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-09-27 21:46:34 -04:00
Hongbo Li	dc5bfdf8ea	bcachefs: fix the memory leak in exception case The pointer clean points the memory allocated by kmemdup, when the return value of bch2_sb_clean_validate_late is not zero. The memory pointed by clean is leaked. So we should free it in this case. Fixes: `a37ad1a3ab` ("bcachefs: sb-clean.c") Signed-off-by: Hongbo Li <lihongbo22@huawei.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-09-27 21:46:34 -04:00
Hongbo Li	3125c95ea6	bcachefs: fast exit when darray_make_room failed In downgrade_table_extra, the return value is needed. When it return failed, we should exit immediately. Fixes: `7773df19c3` ("bcachefs: metadata version bucket_stripe_sectors") Signed-off-by: Hongbo Li <lihongbo22@huawei.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-09-27 21:46:34 -04:00
Kent Overstreet	951dd86e7c	bcachefs: Fix iterator leak in check_subvol() A couple small error handling fixes Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-09-27 21:46:34 -04:00
Kent Overstreet	2a1df87346	bcachefs: Add snapshot to bch_inode_unpacked this allows for various cleanups in fsck Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-09-27 21:46:34 -04:00
Diogo Jahchan Koike	40d40c6bea	bcachefs: assign return error when iterating through layout syzbot reported a null ptr deref in __copy_user [0] In __bch2_read_super, when a corrupt backup superblock matches the default opts offset, no error is assigned to ret and the freed superblock gets through, possibly being assigned as the best sb in bch2_fs_open and being later dereferenced, causing a fault. Assign EINVALID to ret when iterating through layout. [0]: https://syzkaller.appspot.com/bug?extid=18a5c5e8a9c856944876 Reported-by: syzbot+18a5c5e8a9c856944876@syzkaller.appspotmail.com Closes: https://syzkaller.appspot.com/bug?extid=18a5c5e8a9c856944876 Signed-off-by: Diogo Jahchan Koike <djahchankoike@gmail.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-09-27 21:46:34 -04:00
Kent Overstreet	c6040447c5	bcachefs: Fix srcu warning in check_topology check_topology doesn't need the srcu lock and doesn't use normal btree transactions - we can just drop the srcu lock. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-09-27 21:46:34 -04:00
Kent Overstreet	18c520f408	bcachefs: Fix error path in check_dirent_inode_dirent() fsck_err() jumps to the fsck_err label when bailing out; need to make sure bp_iter was initialized... Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-09-27 21:46:34 -04:00
Piotr Zalewski	0696a18a8c	bcachefs: memset bounce buffer portion to 0 after key_sort_fix_overlapping Zero-initialize part of allocated bounce buffer which wasn't touched by subsequent bch2_key_sort_fix_overlapping to mitigate later uinit-value use KMSAN bug[1]. After applying the patch reproducer still triggers stack overflow[2] but it seems unrelated to the uninit-value use warning. After further investigation it was found that stack overflow occurs because KMSAN adds too many function calls[3]. Backtrace of where the stack magic number gets smashed was added as a reply to syzkaller thread[3]. It was confirmed that task's stack magic number gets smashed after the code path where KSMAN detects uninit-value use is executed, so it can be assumed that it doesn't contribute in any way to uninit-value use detection. [1] https://syzkaller.appspot.com/bug?extid=6f655a60d3244d0c6718 [2] https://lore.kernel.org/lkml/66e57e46.050a0220.115905.0002.GAE@google.com [3] https://lore.kernel.org/all/rVaWgPULej8K7HqMPNIu8kVNyXNjjCiTB-QBtItLFBmk0alH6fV2tk4joVPk97Evnuv4ZRDd8HB5uDCkiFG6u81xKdzDj-KrtIMJSlF6Kt8=@proton.me Reported-by: syzbot+6f655a60d3244d0c6718@syzkaller.appspotmail.com Closes: https://syzkaller.appspot.com/bug?extid=6f655a60d3244d0c6718 Fixes: `ec4edd7b9d` ("bcachefs: Prep work for variable size btree node buffers") Suggested-by: Kent Overstreet <kent.overstreet@linux.dev> Signed-off-by: Piotr Zalewski <pZ010001011111@proton.me> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-09-27 21:46:34 -04:00
Kent Overstreet	51b7cc7c0f	bcachefs: Improve bch2_is_inode_open() warning message Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-09-27 21:46:34 -04:00
Kent Overstreet	4a8f8fafbd	bcachefs: Add extra padding in bkey_make_mut_noupdate() This fixes a kasan splat in propagate_key_to_snapshot_leaves() - varint_decode_fast() does reads (that it never uses) up to 7 bytes past the end of the integer. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-09-27 21:46:34 -04:00
Kent Overstreet	f890c8513f	bcachefs: Mark inode errors as autofix Most or all errors will be autofix in the future, we're currently just doing the ones that we know are well tested. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-09-27 21:46:34 -04:00
Al Viro	cb787f4ac0	[tree-wide] finally take no_llseek out no_llseek had been defined to NULL two years ago, in commit `868941b144` ("fs: remove no_llseek") To quote that commit, At -rc1 we'll need do a mechanical removal of no_llseek - git grep -l -w no_llseek \| grep -v porting.rst \| while read i; do sed -i '/\<no_llseek\>/d' $i done would do it. Unfortunately, that hadn't been done. Linus, could you do that now, so that we could finally put that thing to rest? All instances are of the form .llseek = no_llseek, so it's obviously safe. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2024-09-27 08:18:43 -07:00
Kent Overstreet	7eb4a319db	bcachefs: Fix infinite loop in propagate_key_to_snapshot_leaves() As we iterate we need to mark that we no longer need iterators - otherwise we'll infinite loop via the "too many iters" check when there's many snapshots. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-09-23 18:46:58 -04:00
Kent Overstreet	6d12d7ace9	bcachefs: Ensure BCH_FS_accounting_replay_done is always set if it doesn't get set we'll never be able to flush the btree write buffer; this only happens in fake rw mode, but prevents us from shutting down. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-09-23 18:46:58 -04:00
Linus Torvalds	b3f391fddf	bcachefs changes for 6.12-rc1 rcu_pending, btree key cache rework: this solves lock contenting in the key cache, eliminating the biggest source of the srcu lock hold time warnings, and drastically improving performance on some metadata heavy workloads - on multithreaded creates we're now 3-4x faster than xfs. We're now using an rhashtable instead of the system inode hash table; this is another significant performance improvement on multithreaded metadata workloads, eliminating more lock contention. for_each_btree_key_in_subvolume_upto(): new helper for iterating over keys within a specific subvolume, eliminating a lot of open coded "subvolume_get_snapshot()" and also fixing another source of srcu lock time warnings, by running each loop iteration in its own transaction (as the existing for_each_btree_key() does). More work on btree_trans locking asserts; we now assert that we don't hold btree node locks when trans->locked is false, which is important because we don't use lockdep for tracking individual btree node locks. Some cleanups and improvements in the bset.c btree node lookup code, from Alan. Rework of btree node pinning, which we use in backpointers fsck. The old hacky implementation, where the shrinker just skipped over nodes in the pinned range, was causing OOMs; instead we now use another shrinker with a much higher seeks number for pinned nodes. Rebalance now uses BCH_WRITE_ONLY_SPECIFIED_DEVS; this fixes an issue where rebalance would sometimes fall back to allocating from the full filesystem, which is not what we want when it's trying to move data to a specific target. Use __GFP_ACCOUNT, GFP_RECLAIMABLE for btree node, key cache allocations. Idmap mounts are now supported - Hongbo. Rename whiteouts are now supported - Hongbo. Erasure coding can now handle devices being marked as failed, or forcibly removed. We still need the evacuate path for erasure coding, but it's getting very close to ready for people to start using. Status, and when will we be taking off experimental: ---------------------------------------------------- Going by critical, user facing bugs getting found and fixed, we're nearly there. There are a couple key items that need to be finished before we can take off the experimental label: - The end-user experience is still pretty painful when the root filesystem needs a fsck; we need some form of limited self healing so that necessary repair gets run automatically. Errors (by type) are recorded in the superblock, so what we need to do next is convert remaining inconsistent() errors to fsck() errors (so that all runtime inconsistencies are logged in the superblock), and we need to go through the list of fsck errors and classify them by which fsck passes are needed to repair them. - We need comprehensive torture testing for all our repair paths, to shake out remaining bugs there. Thomas has been working on the tooling for this, so this is coming soonish. Slightly less critical items: - We need to improve the end-user experience for degraded mounts: right now, a degraded root filesystem means dropping to an initramfs shell or somehow inputting mount options manually (we don't want to allow degraded mounts without some form of user input, except on unattended servers) - we need the mount helper to prompt the user to allow mounting degraded, and make sure this works with systemd. - Scalabiity: we have users running 100TB+ filesystems, and that's effectively the limit right now due to fsck times. We have some reworks in the pipeline to address this, we're aiming to make petabyte sized filesystems practical. -----BEGIN PGP SIGNATURE----- iQIzBAABCgAdFiEEKnAFLkS8Qha+jvQrE6szbY3KbnYFAmbvHQoACgkQE6szbY3K bnYfAw/+IXQ43/O+Jzs0MLD7pKZnrlbHiX9FqYLazD40vWvkyRTQOwgTn8pVNhq3 4YWmtuZyqh036YC+bGqYFOhz20YetS5UdgbClpwmc99JJ6xsY+Z1mdpYfz5oq1Dw /pBX5iYb3rAt8UbQoZ8lcWM+GpT3GKJVgJuiLB2gRp9gATFesuh+0qU42oIVVVU5 4y3VhDBUmRk4XqEnk8hr7EIDMW0wWP3aptxYMZzeUPW0x1cEQ+FWrJo5D6lXv2KK dKv3MogvA0FFNi/eNexclPiu2pXtI7vrxT7umsxAICHLt41rWpV5ttE6io3bC4ZN qvwF9w2CpmKPKchFru9PO+QrWHVR7e6bphwf3TzyoKZ7tTn42f1RQlub7gBzI3bz ai5ZwGRIvpUoPVBj+CO+Ipog81uUb23Ma+gXg1akEFBOAb+o7I3KOOSBh5l+0cHj 3Ov1n0TLcsoO2cqoqfsV2QubW9YcWEZ76g5mKwQnUn8Cs6Fp0wWaIyK9aNkIAxcr tNDPGtH1gKitxUvju5i/LyI7y1UoeFvqJFee0VsU6QnixHn1ySzhePsJt6UEnIJT Ia3C96Igqu2mV9FxhfGHj/qi7TGjqqkZHa8+B610cDpgf15cx7Ps2DYjkuQMFCqZ Q3Q1o5De9roRq5xF2hLiYJCbzJKqd5ichFsBtLQuX572ICxbICg= =oVCy -----END PGP SIGNATURE----- Merge tag 'bcachefs-2024-09-21' of git://evilpiepirate.org/bcachefs Pull bcachefs updates from Kent Overstreet: - rcu_pending, btree key cache rework: this solves lock contenting in the key cache, eliminating the biggest source of the srcu lock hold time warnings, and drastically improving performance on some metadata heavy workloads - on multithreaded creates we're now 3-4x faster than xfs. - We're now using an rhashtable instead of the system inode hash table; this is another significant performance improvement on multithreaded metadata workloads, eliminating more lock contention. - for_each_btree_key_in_subvolume_upto(): new helper for iterating over keys within a specific subvolume, eliminating a lot of open coded "subvolume_get_snapshot()" and also fixing another source of srcu lock time warnings, by running each loop iteration in its own transaction (as the existing for_each_btree_key() does). - More work on btree_trans locking asserts; we now assert that we don't hold btree node locks when trans->locked is false, which is important because we don't use lockdep for tracking individual btree node locks. - Some cleanups and improvements in the bset.c btree node lookup code, from Alan. - Rework of btree node pinning, which we use in backpointers fsck. The old hacky implementation, where the shrinker just skipped over nodes in the pinned range, was causing OOMs; instead we now use another shrinker with a much higher seeks number for pinned nodes. - Rebalance now uses BCH_WRITE_ONLY_SPECIFIED_DEVS; this fixes an issue where rebalance would sometimes fall back to allocating from the full filesystem, which is not what we want when it's trying to move data to a specific target. - Use __GFP_ACCOUNT, GFP_RECLAIMABLE for btree node, key cache allocations. - Idmap mounts are now supported (Hongbo Li) - Rename whiteouts are now supported (Hongbo Li) - Erasure coding can now handle devices being marked as failed, or forcibly removed. We still need the evacuate path for erasure coding, but it's getting very close to ready for people to start using. * tag 'bcachefs-2024-09-21' of git://evilpiepirate.org/bcachefs: (99 commits) bcachefs: return err ptr instead of null in read sb clean bcachefs: Remove duplicated include in backpointers.c bcachefs: Don't drop devices with stripe pointers bcachefs: bch2_ec_stripe_head_get() now checks for change in rw devices bcachefs: bch_fs.rw_devs_change_count bcachefs: bch2_dev_remove_stripes() bcachefs: bch2_trigger_ptr() calculates sectors even when no device bcachefs: improve error messages in bch2_ec_read_extent() bcachefs: improve error message on too few devices for ec bcachefs: improve bch2_new_stripe_to_text() bcachefs: ec_stripe_head.nr_created bcachefs: bch_stripe.disk_label bcachefs: stripe_to_mem() bcachefs: EIO errcode cleanup bcachefs: Rework btree node pinning bcachefs: split up btree cache counters for live, freeable bcachefs: btree cache counters should be size_t bcachefs: Don't count "skipped access bit" as touched in btree cache scan bcachefs: Failed devices no longer require mounting in degraded mode bcachefs: bch2_dev_rcu_noerror() ...	2024-09-23 10:05:41 -07:00
Ahmed Ehab	39c3aad43f	bcachefs: Hold read lock in bch2_snapshot_tree_oldest_subvol() Syzbot reports a problem that a warning is triggered due to suspicious use of rcu_dereference_check(). That is triggered by a call of bch2_snapshot_tree_oldest_subvol(). The cause of the warning is that inside bch2_snapshot_tree_oldest_subvol(), snapshot_t() is called which calls rcu_dereference() that requires a read lock to be held. Also, the call of bch2_snapshot_tree_next() eventually calls snapshot_t(). To fix this, call rcu_read_lock() before calling snapshot_t(). Then, release the lock after the termination of the while loop. Reported-by: <syzbot+f7c41a878676b72c16a6@syzkaller.appspotmail.com> Signed-off-by: Ahmed Ehab <bottaawesome633@gmail.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-09-21 14:54:18 -04:00
Diogo Jahchan Koike	025c55a4c7	bcachefs: return err ptr instead of null in read sb clean syzbot reported a null-ptr-deref in bch2_fs_start. [0] When a sb is marked clear but doesn't have a clean section bch2_read_superblock_clean returns NULL which PTR_ERR_OR_ZERO lets through, eventually leading to a null ptr dereference down the line. Adjust read sb clean to return an ERR_PTR indicating the invalid clean section. [0] https://syzkaller.appspot.com/bug?extid=1cecc37d87c4286e5543 Reported-by: syzbot+1cecc37d87c4286e5543@syzkaller.appspotmail.com Closes: https://syzkaller.appspot.com/bug?extid=1cecc37d87c4286e5543 Signed-off-by: Diogo Jahchan Koike <djahchankoike@gmail.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-09-21 11:39:49 -04:00
Yang Li	abb43dd677	bcachefs: Remove duplicated include in backpointers.c The header files bbpos.h is included twice in backpointers.c, so one inclusion of each can be removed. Reported-by: Abaci Robot <abaci@linux.alibaba.com> Closes: https://bugzilla.openanolis.cn/show_bug.cgi?id=10783 Signed-off-by: Yang Li <yang.lee@linux.alibaba.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-09-21 11:39:49 -04:00
Kent Overstreet	d5c5b337f8	bcachefs: Don't drop devices with stripe pointers Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-09-21 11:39:49 -04:00
Kent Overstreet	035d72f72c	bcachefs: bch2_ec_stripe_head_get() now checks for change in rw devices This factors out ec_strie_head_devs_update(), which initializes the bitmap of devices we're allocating from, and runs it every time c->rw_devs_change_count changes. We also cancel pending, not allocated stripes, since they may refer to devices that are no longer available. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-09-21 11:39:49 -04:00
Kent Overstreet	83ccd9b31d	bcachefs: bch_fs.rw_devs_change_count Add a counter that's incremented whenever rw devices change; this will be used for erasure coding so that it can keep ec_stripe_head in sync and not deadlock on a new stripe when a device it wants goes away. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-09-21 11:39:49 -04:00
Kent Overstreet	ad8d1f77fc	bcachefs: bch2_dev_remove_stripes() We can now correctly force-remove a device that has stripes on it; this uses the new BCH_SB_MEMBER_INVALID sentinal value. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-09-21 11:39:49 -04:00
Kent Overstreet	934137b0c0	bcachefs: bch2_trigger_ptr() calculates sectors even when no device This is necessary for erasure coded pointers to devices that have been removed. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-09-21 11:39:49 -04:00
Kent Overstreet	2aee59eb21	bcachefs: improve error messages in bch2_ec_read_extent() Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-09-21 11:39:49 -04:00
Kent Overstreet	cb771fe891	bcachefs: improve error message on too few devices for ec Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-09-21 11:39:49 -04:00
Kent Overstreet	c9cabfb215	bcachefs: improve bch2_new_stripe_to_text() also print out the new stripe key Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-09-21 11:39:48 -04:00
Kent Overstreet	a4b7a0c037	bcachefs: ec_stripe_head.nr_created additional debug stat Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-09-21 11:39:48 -04:00
Kent Overstreet	fa85c47397	bcachefs: bch_stripe.disk_label When reshaping existing stripes, we should keep them on the same target that they were allocated on; to do this, we need to add a field to the btree stripe type. This is a tad awkward, because we only have 8 bits left, and targets are 16 bits - but we only need to store a label, not a full target. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-09-21 11:39:48 -04:00
Kent Overstreet	1b11c4d365	bcachefs: stripe_to_mem() factor out a common helper Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-09-21 11:39:48 -04:00
Kent Overstreet	54a12984a9	bcachefs: EIO errcode cleanup We want to be using private errcodes whenever possible, for better error messages. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-09-21 11:39:48 -04:00
Kent Overstreet	7a51608d01	bcachefs: Rework btree node pinning In backpointers fsck, we do a seqential scan of one btree, and check references to another: extents <-> backpointers Checking references generates random lookups, so we want to pin that btree in memory (or only a range, if it doesn't fit in ram). Previously, this was done with a simple check in the shrinker - "if btree node is in range being pinned, don't free it" - but this generated OOMs, as our shrinker wasn't well behaved if there was less memory available than expected. Instead, we now have two different shrinkers and lru lists; the second shrinker being for pinned nodes, with seeks set much higher than normal - so they can still be freed if necessary, but we'll prefer not to. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-09-21 11:39:48 -04:00
Kent Overstreet	91ddd71510	bcachefs: split up btree cache counters for live, freeable this is prep for introducing a second live list and shrinker for pinned nodes Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-09-21 11:39:48 -04:00
Kent Overstreet	691f2cba22	bcachefs: btree cache counters should be size_t 32 bits won't overflow any time soon, but size_t is the correct type for counting objects in memory. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-09-21 11:39:48 -04:00
Kent Overstreet	ad5dbe3ce5	bcachefs: Don't count "skipped access bit" as touched in btree cache scan Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-09-21 11:39:48 -04:00
Kent Overstreet	e92e5056e4	bcachefs: Failed devices no longer require mounting in degraded mode Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-09-21 11:39:48 -04:00
Kent Overstreet	805ddc2042	bcachefs: bch2_dev_rcu_noerror() bch2_dev_rcu() now properly errors if the device is invalid Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-09-21 11:39:48 -04:00
Kent Overstreet	b99a94fd7a	bcachefs: Progress indicator for extents_to_backpointers Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-09-21 11:39:48 -04:00
Kent Overstreet	3621ecc10f	bcachefs: bch2_opts_to_text() Factor out bch2_show_options() into a generic helper, for debugging option passing issues. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-09-21 11:39:48 -04:00
Kent Overstreet	bf611567b7	bcachefs: improve "no device to read from" message Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-09-21 11:39:48 -04:00
Hongbo Li	b161ca8096	bcachefs: Fix compilation error for bch2_sb_member_alloc Fix the following compilation error: ``` fs/bcachefs/sb-members.c: In function ‘bch2_sb_member_alloc’: fs/bcachefs/sb-members.c:508:2: error: a label can only be part of a statement and a declaration is not a statement 508 \| unsigned nr_devices = max_t(unsigned, dev_idx + 1, c->sb.nr_devices); ``` Fixes: a7d364a133c7 ("bcachefs: bch2_sb_member_alloc()") Signed-off-by: Hongbo Li <lihongbo22@huawei.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-09-21 11:39:48 -04:00
Kent Overstreet	17405279e8	bcachefs: bch2_sb_member_alloc() refactoring Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-09-21 11:39:48 -04:00
Kent Overstreet	6b812f1dce	bcachefs: bch2_dev_remove_alloc() -> alloc_background.c Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-09-21 11:39:48 -04:00
Kent Overstreet	8ed4ba3663	bcachefs: Move tabstop setup to bch2_dev_usage_to_text() No reason for it not to be where it's needed. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-09-21 11:39:48 -04:00
Kent Overstreet	4f19a60c32	bcachefs: Options for recovery_passes, recovery_passes_exclude This adds mount options for specifying recovery passes to run, or exclude; the immediate need for this is that backpointers fsck is having trouble completing, so we need a way to skip it. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-09-21 11:39:48 -04:00
Kent Overstreet	ff7f756f2b	bcachefs: Use mm_account_reclaimed_pages() when freeing btree nodes When freeing in a shrinker callback, we need to notify memory reclaim, so it knows forward progress has been made. Normally this is done in e.g. slab code, but we're not freeing through slab - or rather we are, but these allocations are big, and use the kmalloc_large() path. This is really a bug in the slub code, but we're working around it here for now. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-09-21 11:39:48 -04:00
Kent Overstreet	895fbf1cf0	bcachefs: Use __GFP_ACCOUNT for reclaimable memory Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-09-21 11:39:46 -04:00
Sasha Finkelstein	4645855df0	bcachefs: Hook up RENAME_WHITEOUT in rename. This is needed for overlayfs, which is used by container managers. Signed-off-by: Sasha Finkelstein <fnkl.kernel@gmail.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-09-21 11:35:20 -04:00
Kent Overstreet	d90c8acd35	bcachefs: rebalance writes use BCH_WRITE_ONLY_SPECIFIED_DEVS this was an oversight: rebalance is moving data to a specific device, so we don't want it falling back to the full filesystem Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-09-21 11:35:20 -04:00
Kent Overstreet	a977f3e162	bcachefs: BCH_WRITE_ALLOC_NOWAIT no longer applies to open bucket allocation rebalance writes must be BCH_WRITE_ALLOC_NOWAIT because they don't allocate from the full filesystem - but we don't want spurious allocation failures due to open buckets. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-09-21 11:35:20 -04:00
Kent Overstreet	2e95497e81	bcachefs: fix prototype to bch2_alloc_sectors_start_trans() Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-09-21 11:35:20 -04:00
Kent Overstreet	da2d20c98d	bcachefs: kill redundant is_vmalloc_addr() Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-09-21 11:35:20 -04:00
Kent Overstreet	af05633d40	bcachefs: convert __bch2_encrypt_bio() to darray like the previous patch, kill use of bare arrays; the encryption code likes to work in big batches, so this is a small performance improvement. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-09-21 11:35:20 -04:00
Kent Overstreet	b7d8092a1b	bcachefs: do_encrypt() now handles allocation failures convert to darray, and add a fallback when allocation fails Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-09-21 11:35:20 -04:00
Kent Overstreet	3340dee235	bcachefs: Add pinned to btree cache not freed counters Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-09-21 11:35:19 -04:00
Linus Torvalds	2004cef11e	In the v6.12 scheduler development cycle we had 63 commits from 18 contributors: - Implement the SCHED_DEADLINE server infrastructure - Daniel Bristot de Oliveira's last major contribution to the kernel: "SCHED_DEADLINE servers can help fixing starvation issues of low priority tasks (e.g., SCHED_OTHER) when higher priority tasks monopolize CPU cycles. Today we have RT Throttling; DEADLINE servers should be able to replace and improve that." (Daniel Bristot de Oliveira, Peter Zijlstra, Joel Fernandes, Youssef Esmat, Huang Shijie) - Preparatory changes for sched_ext integration: - Use set_next_task(.first) where required - Fix up set_next_task() implementations - Clean up DL server vs. core sched - Split up put_prev_task_balance() - Rework pick_next_task() - Combine the last put_prev_task() and the first set_next_task() - Rework dl_server - Add put_prev_task(.next) (Peter Zijlstra, with a fix by Tejun Heo) - Complete the EEVDF transition and refine EEVDF scheduling: - Implement delayed dequeue - Allow shorter slices to wakeup-preempt - Use sched_attr::sched_runtime to set request/slice suggestion - Document the new feature flags - Remove unused and duplicate-functionality fields - Simplify & unify pick_next_task_fair() - Misc debuggability enhancements (Peter Zijlstra, with fixes/cleanups by Dietmar Eggemann, Valentin Schneider and Chuyi Zhou) - Initialize the vruntime of a new task when it is first enqueued, resulting in significant decrease in latency of newly woken tasks. (Zhang Qiao) - Introduce SM_IDLE and an idle re-entry fast-path in __schedule() (K Prateek Nayak, Peter Zijlstra) - Clean up and clarify the usage of Clean up usage of rt_task() (Qais Yousef) - Preempt SCHED_IDLE entities in strict cgroup hierarchies (Tianchen Ding) - Clarify the documentation of time units for deadline scheduler parameters. (Christian Loehle) - Remove the HZ_BW chicken-bit feature flag introduced a year ago, the original change seems to be working fine. (Phil Auld) - Misc fixes and cleanups (Chen Yu, Dan Carpenter, Huang Shijie, Peilin He, Qais Yousefm and Vincent Guittot) Signed-off-by: Ingo Molnar <mingo@kernel.org> -----BEGIN PGP SIGNATURE----- iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmbr8qcRHG1pbmdvQGtl cm5lbC5vcmcACgkQEnMQ0APhK1gdbw/+Mj3zWfYP+dtUkfgrR2FClPAJoo1/9Dz0 LYD8XgYHu8rEJ0Aq+VbdkgYGUt9utvzUFPIxvWFDcldQl57KwhF4hp9Ir+PqJyYC NolQ1q8ddo1hnslxnEg6SgHVzQq/4FqMM0nDNUkQETCx6zTyFFeRf+q7o/2c2m5B uI9dSU1Wrx7XrXm2D3kB8+xP+ZRy+qhbFN5Pfuz96mhelfklylgKMfPzgAiCT/7T JTbQhQ2HdcCNgiLoSrWsHBDy2UYpouP4zb4jyd+lDQzhSUJrj3u4Xy4vVmuTKq+y sTgWlgKB+MTuh9UuJ4UYzSnMqg161UlMvtXeH84ABmAqDNGHRPtOKrrlcLtJ3D4x m1SPhNnsvpjOu2pH0XLIS8al3VUesWND5S+rucHRYSq6Nvhivf4MTvRJlicXXurL Mt2APnIlhGJuKBNWnmyZovVdtO0ZUUPlaZWfr3rCS4txAVo+HwWhsm3uhtTycQqN gazsCiuGh6Jds90ZqA/BvdLWG+DY8J0xLlV3ex4pCXuQ/HFrabVWTyThJsULhrZ2 5mTdWIsocPctNMO9/RHMy7vJI7G7ljgHEquWVn5kiGGzXhK6VwVwKAMpfgXGw+YA yVP6/M7a7g2yEzj69gXkcDa8k/kedMVquJ/G/8YhZM7u7sPqsMjpmaGsqsJRfnpT ChngAzap+kA= =TEC6 -----END PGP SIGNATURE----- Merge tag 'sched-core-2024-09-19' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull scheduler updates from Ingo Molnar: - Implement the SCHED_DEADLINE server infrastructure - Daniel Bristot de Oliveira's last major contribution to the kernel: "SCHED_DEADLINE servers can help fixing starvation issues of low priority tasks (e.g., SCHED_OTHER) when higher priority tasks monopolize CPU cycles. Today we have RT Throttling; DEADLINE servers should be able to replace and improve that." (Daniel Bristot de Oliveira, Peter Zijlstra, Joel Fernandes, Youssef Esmat, Huang Shijie) - Preparatory changes for sched_ext integration: - Use set_next_task(.first) where required - Fix up set_next_task() implementations - Clean up DL server vs. core sched - Split up put_prev_task_balance() - Rework pick_next_task() - Combine the last put_prev_task() and the first set_next_task() - Rework dl_server - Add put_prev_task(.next) (Peter Zijlstra, with a fix by Tejun Heo) - Complete the EEVDF transition and refine EEVDF scheduling: - Implement delayed dequeue - Allow shorter slices to wakeup-preempt - Use sched_attr::sched_runtime to set request/slice suggestion - Document the new feature flags - Remove unused and duplicate-functionality fields - Simplify & unify pick_next_task_fair() - Misc debuggability enhancements (Peter Zijlstra, with fixes/cleanups by Dietmar Eggemann, Valentin Schneider and Chuyi Zhou) - Initialize the vruntime of a new task when it is first enqueued, resulting in significant decrease in latency of newly woken tasks (Zhang Qiao) - Introduce SM_IDLE and an idle re-entry fast-path in __schedule() (K Prateek Nayak, Peter Zijlstra) - Clean up and clarify the usage of Clean up usage of rt_task() (Qais Yousef) - Preempt SCHED_IDLE entities in strict cgroup hierarchies (Tianchen Ding) - Clarify the documentation of time units for deadline scheduler parameters (Christian Loehle) - Remove the HZ_BW chicken-bit feature flag introduced a year ago, the original change seems to be working fine (Phil Auld) - Misc fixes and cleanups (Chen Yu, Dan Carpenter, Huang Shijie, Peilin He, Qais Yousefm and Vincent Guittot) * tag 'sched-core-2024-09-19' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (64 commits) sched/cpufreq: Use NSEC_PER_MSEC for deadline task cpufreq/cppc: Use NSEC_PER_MSEC for deadline task sched/deadline: Clarify nanoseconds in uapi sched/deadline: Convert schedtool example to chrt sched/debug: Fix the runnable tasks output sched: Fix sched_delayed vs sched_core kernel/sched: Fix util_est accounting for DELAY_DEQUEUE kthread: Fix task state in kthread worker if being frozen sched/pelt: Use rq_clock_task() for hw_pressure sched/fair: Move effective_cpu_util() and effective_cpu_util() in fair.c sched/core: Introduce SM_IDLE and an idle re-entry fast-path in __schedule() sched: Add put_prev_task(.next) sched: Rework dl_server sched: Combine the last put_prev_task() and the first set_next_task() sched: Rework pick_next_task() sched: Split up put_prev_task_balance() sched: Clean up DL server vs core sched sched: Fixup set_next_task() implementations sched: Use set_next_task(.first) where required sched/fair: Properly deactivate sched_delayed task upon class change ...	2024-09-19 15:55:58 +02:00
Linus Torvalds	2775df6e5e	vfs-6.12.folio -----BEGIN PGP SIGNATURE----- iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCZuQEvgAKCRCRxhvAZXjc ou77AQD3U1KjbdgzbUi6kaUmiiWOPhfYTlm8mho8dBjqvTCB+AD/XTWSFCWWhHB4 KyQZTbjRD81xmVNbKjASazp0EA6Ahwc= =gIsD -----END PGP SIGNATURE----- Merge tag 'vfs-6.12.folio' of gitolite.kernel.org:pub/scm/linux/kernel/git/vfs/vfs Pull vfs folio updates from Christian Brauner: "This contains work to port write_begin and write_end to rely on folios for various filesystems. This converts ocfs2, vboxfs, orangefs, jffs2, hostfs, fuse, f2fs, ecryptfs, ntfs3, nilfs2, reiserfs, minixfs, qnx6, sysv, ufs, and squashfs. After this series lands a bunch of the filesystems in this list do not mention struct page anymore" * tag 'vfs-6.12.folio' of gitolite.kernel.org:pub/scm/linux/kernel/git/vfs/vfs: (61 commits) Squashfs: Ensure all readahead pages have been used Squashfs: Rewrite and update squashfs_readahead_fragment() to not use page->index Squashfs: Update squashfs_readpage_block() to not use page->index Squashfs: Update squashfs_readahead() to not use page->index Squashfs: Update page_actor to not use page->index jffs2: Use a folio in jffs2_garbage_collect_dnode() jffs2: Convert jffs2_do_readpage_nolock to take a folio buffer: Convert __block_write_begin() to take a folio ocfs2: Convert ocfs2_write_zero_page to use a folio fs: Convert aops->write_begin to take a folio fs: Convert aops->write_end to take a folio vboxsf: Use a folio in vboxsf_write_end() orangefs: Convert orangefs_write_begin() to use a folio orangefs: Convert orangefs_write_end() to use a folio jffs2: Convert jffs2_write_begin() to use a folio jffs2: Convert jffs2_write_end() to use a folio hostfs: Convert hostfs_write_end() to use a folio fuse: Convert fuse_write_begin() to use a folio fuse: Convert fuse_write_end() to use a folio f2fs: Convert f2fs_write_begin() to use a folio ...	2024-09-16 08:54:30 +02:00
Linus Torvalds	8f72c31f45	vfs-6.12.misc -----BEGIN PGP SIGNATURE----- iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCZuQEGwAKCRCRxhvAZXjc ojIuAQC433+hBkvjvmQ7H0r5rgZSjUuCTG3bSmdU7RJmPHUHhwEA85v/NGq53f+W IhandK6t+Cf0JYpFZ3N0bT88hDYVhQQ= =9zGL -----END PGP SIGNATURE----- Merge tag 'vfs-6.12.misc' of gitolite.kernel.org:pub/scm/linux/kernel/git/vfs/vfs Pull misc vfs updates from Christian Brauner: "This contains the usual pile of misc updates: Features: - Add F_CREATED_QUERY fcntl() that allows userspace to query whether a file was actually created. Often userspace wants to know whether an O_CREATE request did actually create a file without using O_EXCL. The current logic is that to first attempts to open the file without O_CREAT \| O_EXCL and if ENOENT is returned userspace tries again with both flags. If that succeeds all is well. If it now reports EEXIST it retries. That works fairly well but some corner cases make this more involved. If this operates on a dangling symlink the first openat() without O_CREAT \| O_EXCL will return ENOENT but the second openat() with O_CREAT \| O_EXCL will fail with EEXIST. The reason is that openat() without O_CREAT \| O_EXCL follows the symlink while O_CREAT \| O_EXCL doesn't for security reasons. So it's not something we can really change unless we add an explicit opt-in via O_FOLLOW which seems really ugly. All available workarounds are really nasty (fanotify, bpf lsm etc) so add a simple fcntl(). - Try an opportunistic lookup for O_CREAT. Today, when opening a file we'll typically do a fast lookup, but if O_CREAT is set, the kernel always takes the exclusive inode lock. This was likely done with the expectation that O_CREAT means that we always expect to do the create, but that's often not the case. Many programs set O_CREAT even in scenarios where the file already exists (see related F_CREATED_QUERY patch motivation above). The series contained in the pr rearranges the pathwalk-for-open code to also attempt a fast_lookup in certain O_CREAT cases. If a positive dentry is found, the inode_lock can be avoided altogether and it can stay in rcuwalk mode for the last step_into. - Expose the 64 bit mount id via name_to_handle_at() Now that we provide a unique 64-bit mount ID interface in statx(2), we can now provide a race-free way for name_to_handle_at(2) to provide a file handle and corresponding mount without needing to worry about racing with /proc/mountinfo parsing or having to open a file just to do statx(2). While this is not necessary if you are using AT_EMPTY_PATH and don't care about an extra statx(2) call, users that pass full paths into name_to_handle_at(2) need to know which mount the file handle comes from (to make sure they don't try to open_by_handle_at a file handle from a different filesystem) and switching to AT_EMPTY_PATH would require allocating a file for every name_to_handle_at(2) call - Add a per dentry expire timeout to autofs There are two fairly well known automounter map formats, the autofs format and the amd format (more or less System V and Berkley). Some time ago Linux autofs added an amd map format parser that implemented a fair amount of the amd functionality. This was done within the autofs infrastructure and some functionality wasn't implemented because it either didn't make sense or required extra kernel changes. The idea was to restrict changes to be within the existing autofs functionality as much as possible and leave changes with a wider scope to be considered later. One of these changes is implementing the amd options: 1) "unmount", expire this mount according to a timeout (same as the current autofs default). 2) "nounmount", don't expire this mount (same as setting the autofs timeout to 0 except only for this specific mount) . 3) "utimeout=<seconds>", expire this mount using the specified timeout (again same as setting the autofs timeout but only for this mount) To implement these options per-dentry expire timeouts need to be implemented for autofs indirect mounts. This is because all map keys (mounts) for autofs indirect mounts use an expire timeout stored in the autofs mount super block info. structure and all indirect mounts use the same expire timeout. Fixes: - Fix missing fput for FSCONFIG_SET_FD in autofs - Use param->file for FSCONFIG_SET_FD in coda - Delete the 'fs/netfs' proc subtreee when netfs module exits - Make sure that struct uid_gid_map fits into a single cacheline - Don't flush in-flight wb switches for superblocks without cgroup writeback - Correcting the idmapping mount example in the idmapping documentation - Fix a race between evice_inodes() and find_inode() and iput() - Refine the show_inode_state() macro definition in writeback code - Prevent dump_mapping() from accessing invalid dentry.d_name.name - Show actual source for debugfs in /proc/mounts - Annotate data-race of busy_poll_usecs in eventpoll - Don't WARN for racy path_noexec check in exec code - Handle OOM on mnt_warn_timestamp_expiry() - Fix some spelling in the iomap design documentation - Fix typo in procfs comment - Fix typo in fs/namespace.c comment Cleanups: - Add the VFS git tree to the MAINTAINERS file - Move FMODE_UNSIGNED_OFFSET to fop_flags freeing up another f_mode bit in struct file bringing us to 5 free f_mode bits - Remove the __I_DIO_WAKEUP bit from i_state flags as we can simplify the wait mechanism - Remove the unused path_put_init() helper - Replace a __u32 with u32 for s_fsnotify_mask as __u32 is uapi specific - Replace the unsigned long i_state member with a u32 i_state member in struct inode freeing up 4 bytes in struct inode. Instead of using the bit based wait apis we're now using the var event apis and using the individual bytes of the i_state member to wait on state changes - Explain how per-syscall AT_* flags should be allocated - Use in_group_or_capable() helper to simplify the posix acl mode update code - Switch to LIST_HEAD() in fsync_buffers_list() to simplify the code - Removed comment about d_rcu_to_refcount() as that function doesn't exist anymore - Add kernel documentation for lookup_fast() - Don't re-zero evenpoll fields - Remove outdated comment after close_fd() - Fix imprecise wording in comment about the pipe filesystem - Drop GFP_NOFAIL mode from alloc_page_buffers - Missing blank line warnings and struct declaration improved in file_table - Annotate struct poll_list with __counted_by() - Remove the unused read parameter in percpu-rwsem - Remove linux/prefetch.h include from direct-io code - Use kmemdup_array instead of kmemdup for multiple allocation in mnt_idmapping code - Remove unused mnt_cursor_del() declaration Performance tweaks: - Dodge smp_mb in break_lease and break_deleg in the common case - Only read fops once in fops_{get,put}() - Use RCU in ilookup() - Elide smp_mb in iversion handling in the common case - Drop one lock trip in evict()" * tag 'vfs-6.12.misc' of gitolite.kernel.org:pub/scm/linux/kernel/git/vfs/vfs: (58 commits) uidgid: make sure we fit into one cacheline proc: Fix typo in the comment fs/pipe: Correct imprecise wording in comment fhandle: expose u64 mount id to name_to_handle_at(2) uapi: explain how per-syscall AT_* flags should be allocated fs: drop GFP_NOFAIL mode from alloc_page_buffers writeback: Refine the show_inode_state() macro definition fs/inode: Prevent dump_mapping() accessing invalid dentry.d_name.name mnt_idmapping: Use kmemdup_array instead of kmemdup for multiple allocation netfs: Delete subtree of 'fs/netfs' when netfs module exits fs: use LIST_HEAD() to simplify code inode: make i_state a u32 inode: port __I_LRU_ISOLATING to var event vfs: fix race between evice_inodes() and find_inode()&iput() inode: port __I_NEW to var event inode: port __I_SYNC to var event fs: reorder i_state bits fs: add i_state helpers MAINTAINERS: add the VFS git tree fs: s/__u32/u32/ for s_fsnotify_mask ...	2024-09-16 08:35:09 +02:00
Thorsten Blum	fa1ab1b466	bcachefs: Annotate bch_replicas_entry_{v0,v1} with __counted_by() Add the __counted_by compiler attribute to the flexible array members devs to improve access bounds-checking via CONFIG_UBSAN_BOUNDS and CONFIG_FORTIFY_SOURCE. Increment nr_devs before adding a new device to the devs array and adjust the array indexes accordingly. Add a helper macro for adding a new device. In bch2_journal_read(), explicitly set nr_devs to 0. Signed-off-by: Thorsten Blum <thorsten.blum@toblux.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-09-09 09:41:49 -04:00
Hongbo Li	c24adfa0df	bcachefs: support idmap mounts We enable idmapped mounts for bcachefs. Here, we just pass down the user_namespace argument from the VFS methods to the relevant helpers. The idmap test in bcachefs is as following: ``` 1. losetup /dev/loop1 bcachefs.img 2. ./bcachefs format /dev/loop1 3. mount -t bcachefs /dev/loop1 /mnt/bcachefs/ 4. ./mount-idmapped --map-mount b:0:1000:1 /mnt/bcachefs /mnt/idmapped1/ ll /mnt/bcachefs total 2 drwx------. 2 root root 0 Jun 14 14:10 lost+found -rw-r--r--. 1 root root 1945 Jun 14 14:12 profile ll /mnt/idmapped1/ total 2 drwx------. 2 1000 1000 0 Jun 14 14:10 lost+found -rw-r--r--. 1 1000 1000 1945 Jun 14 14:12 profile Signed-off-by: Hongbo Li <lihongbo22@huawei.com> Reviewed-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-09-09 09:41:49 -04:00
Thorsten Blum	86e92eeeb2	bcachefs: Annotate struct bch_xattr with __counted_by() Add the __counted_by compiler attribute to the flexible array member x_name to improve access bounds-checking via CONFIG_UBSAN_BOUNDS and CONFIG_FORTIFY_SOURCE. Signed-off-by: Thorsten Blum <thorsten.blum@toblux.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-09-09 09:41:49 -04:00
Kent Overstreet	2c6a7bff2a	bcachefs: Switch gc bucket array to a genradix A user with a 30 tb device is overflowing the INT_MAX limit on vmalloc allocations... Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-09-09 09:41:49 -04:00
Kent Overstreet	a803fa551d	bcachefs: darray: convert to alloc_hooks() better memory allocation profiling support Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-09-09 09:41:49 -04:00
Chen Yufan	848c3ff882	bcachefs: Convert to use jiffies macros Use jiffies macros instead of using jiffies directly to handle wraparound. Signed-off-by: Chen Yufan <chenyufan@vivo.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-09-09 09:41:49 -04:00
Alan Huang	94932a0842	bcachefs: Refactor bch2_bset_fix_lookup_table bch2_bset_fix_lookup_table is too complicated to be easily understood, the comment "l now > where" there is also incorrect when where == t->end_offset. This patch therefore refactor the function, the idea is that when where >= rw_aux_tree(b, t)[t->size - 1].offset, we don't need to adjust the rw aux tree. Signed-off-by: Alan Huang <mmpgouride@gmail.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-09-09 09:41:49 -04:00
Kent Overstreet	f1625637b8	bcachefs: Assert that we don't lock nodes when !trans->locked We rely on the trans->locked to know if a trans has nodes locked for assertions about deadlocks; there can't be more than one trans in the same process that is locked. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-09-09 09:41:49 -04:00
Matthew Wilcox (Oracle)	a8cdf0ff46	bcachefs: Do not check folio_has_private() folio_has_private() is an attractive nuisance; filesystem authors generally don't realise that it actually checks two flags (one of which is never set by bcachefs). There's no need to check the private flag at all; for folios owned by bcachefs, we know that folio->private is NULL when the private flag is clear and non-NULL when the private flag is set. Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-09-09 09:41:49 -04:00
Kent Overstreet	fdbc9c390a	bcachefs: bch2_time_stats_reset() Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-09-09 09:41:49 -04:00
Kent Overstreet	b36f679c99	bcachefs: Drop memalloc_nofs_save() in bch2_btree_node_mem_alloc() It's really not needed: the only locks used here are the btree cache lock, which we drop for GFP_WAIT allocations, and btree node locks - but we also drop those for GFP_WAIT allocations. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-09-09 09:41:49 -04:00
Youling Tang	42386fbaee	bcachefs: Simplify bch2_xattr_emit() implementation Use helper functions to make code more readable. Similar to commit `a5488f2983` ("fs: simplify ->listxattr() implementation") Signed-off-by: Youling Tang <tangyouling@kylinos.cn> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-09-09 09:41:49 -04:00
Youling Tang	d3f30f1629	bcachefs: drop unused posix acl handlers Remove struct nop_posix_acl_{access,default} for bcachefs filesystem that don't depend on the xattr handler in their inode->i_op->listxattr() method in any way. There's nothing more to do than to simply remove the handler. It's been effectively unused ever since we introduced the new posix acl api. See [1] for details. Link [1]: https://patchwork.kernel.org/project/linux-fsdevel/cover/20230125-fs-acl-remove-generic-xattr-handlers-v3-0-f760cc58967d@kernel.org/ Signed-off-by: Youling Tang <tangyouling@kylinos.cn> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-09-09 09:41:49 -04:00
Alan Huang	5935bf3341	bcachefs: Remove unused parameter iter here is unused, remove it. Signed-off-by: Alan Huang <mmpgouride@gmail.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-09-09 09:41:49 -04:00
Alan Huang	3130303bd9	bcachefs: Remove the prev array stuff After reducing the search range when building the aux tree, the prev array stuff is no longer useful, so remove it. Signed-off-by: Alan Huang <mmpgouride@gmail.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-09-09 09:41:49 -04:00
Alan Huang	5d01101284	bcachefs: Minimize the search range used to calculate the mantissa When the search key's mantissa is larger than the node i's, we know that the search key is larger than the first key of the cacheline corresponding to node i, so that when we are calculating the mantissa of right side nodes of node i, the left side of the search range can be the first key of node i. Once the search range is minimized, the mantissa we are calculating can have more useful bits, thus reduce the slow path comparison. Besides, we can now remove all the prev array stuff. Signed-off-by: Alan Huang <mmpgouride@gmail.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-09-09 09:41:49 -04:00
Alan Huang	288a6690eb	bcachefs: Convert open-coded extra computation to helper This patch replaces open-coded extra computation to eytzinger1_extra. Signed-off-by: Alan Huang <mmpgouride@gmail.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-09-09 09:41:49 -04:00
Alan Huang	6cca8319e0	bcachefs: Remove dead code in __build_ro_aux_tree This logic is no longer useful since commit `3ce8b463e3` ("bcachefs: kill bset_tree->max_key"), so remove it. Signed-off-by: Alan Huang <mmpgouride@gmail.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-09-09 09:41:49 -04:00
Alan Huang	89ae9a04b2	bcachefs: Remove unused parameter of bkey_mantissa_bits_dropped The idx parameter of bkey_mantissa_bits_dropped is unused, remove it. Signed-off-by: Alan Huang <mmpgouride@gmail.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-09-09 09:41:49 -04:00
Alan Huang	2a463e948a	bcachefs: Remove unused parameter of bkey_mantissa The idx parameter of bkey_mantissa became unused since commit `b904a79918` ("bcachefs: Go back to 16 bit mantissa bkey floats"), so remove it. Signed-off-by: Alan Huang <mmpgouride@gmail.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-09-09 09:41:49 -04:00
Kent Overstreet	59a1a62a42	bcachefs: bch2_sb_nr_devices() factoring out a helper Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-09-09 09:41:49 -04:00
Kent Overstreet	11827dba08	bcachefs: trivial open_bucket_add_buckets() cleanup Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-09-09 09:41:48 -04:00
Kent Overstreet	c7652f253a	bcachefs: promote_whole_extents is now a normal option Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-09-09 09:41:48 -04:00
Kent Overstreet	cfd273f1ae	bcachefs: Move rebalance_status out of sysfs/internal Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-09-09 09:41:48 -04:00
Julian Sun	26c0900d85	bcachefs: remove the unused parameter in macro bkey_crc_next In the macro definition of bkey_crc_next, five parameters were accepted, but only four of them were used. Let's remove the unused one. The patch has only passed compilation tests, but it should be fine. Signed-off-by: Julian Sun <sunjunchao2870@gmail.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-09-09 09:41:48 -04:00
Julian Sun	4d05a083b3	bcachefs: fix macro definition allocate_dropping_locks The macro allocate_dropping_locks accepts a parameter _trans, but it was not used, rather the variable trans was directly used, which may be a local variable inside a function that calls the macros. Signed-off-by: Julian Sun <sunjunchao2870@gmail.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>	2024-09-09 09:41:48 -04:00

1 2 3 4 5 ...

4167 Commits