Different versions of the same inode (same inode number, different
snapshot ID) must have the same hash seed and type - lookups require
this, since they see keys from different snapshots simultaneously.
To repair we only need to make the inodes consistent, hash_check_key()
will do the rest.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
This helped with discovering some filesystem corruption fsck has having
trouble with: the str_hash type had gotten flipped on one snapshot's
version of an inode.
All versions of a given inode number have the same hash seed and hash
type, since lookups will be done with a single hash/seed and type and
see dirents/xattrs from multiple snapshots.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
The options parse in get_tree will split the options buffer, it will
get the empty string for last one by strsep(). After commit
ea0eeb89b1d5 ("bcachefs: reject unknown mount options") is merged,
unknown mount options is not allowed (here is empty string), and this
causes this errors. This can be reproduced just by the following steps:
bcachefs format /dev/loop
mount -t bcachefs -o metadata_target=loop1 /dev/loop1 /mnt/bcachefs/
Fixes: ea0eeb89b1d5 ("bcachefs: reject unknown mount options")
Signed-off-by: Hongbo Li <lihongbo22@huawei.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
When call show_options in bcachefs, the options buffer is appeneded
to the seq variable. In fact, it requires an additional comma to be
appended first. This will affect the remount process when reading
existing mount options.
Fixes: 9305cf91d05e ("bcachefs: bch2_opts_to_text()")
Signed-off-by: Hongbo Li <lihongbo22@huawei.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Found by generic/299: When we have to truncate a write due to -ENOSPC,
we may have to read in the folio we're writing to if we're now no longer
doing a complete write to a !uptodate folio.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
bch2_folio_reservation_get_partial(), on partial success, will now
return a reservation that's aligned to the filesystem blocksize.
This is a partial fix for fstests generic/299 - fio verify is badly
behaved in the presence of short writes that aren't aligned to its
blocksize.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
bch2_disk_reservation_put() zeroes out the reservation - oops.
This fixes a disk reservation leak when getting a quota reservation
returned an error.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Using commit_do() to call alloc_sectors_start_trans() breaks when we're
randomly injecting transaction restarts - the restart in the commit
causes us to leak the lock that alloc_sectorS_start_trans() takes.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
bch2_bucket_io_time_reset() doesn't need to succeed, which is why it
didn't previously retry on transaction restart - but we're now treating
these as errors.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
This is ugly:
We may discover in alloc_write_key that the data type we calculated is
wrong, because BCH_DATA_need_discard is checked/set elsewhere, and the
disk accounting counters we calculated need to be updated.
But bch2_alloc_key_to_dev_counters(..., BTREE_TRIGGER_gc) is not safe
w.r.t. transaction restarts, so we need to propagate the fixup back to
our gc state in case we take a transaction restart.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
this one is fairly harmless since the invalidate worker will just run
again later if it needs to, but still worth fixing
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
This should be impossible to hit in practice; the first lookup within a
transaction won't return a restart due to lock ordering, but we're
adding fault injection for transaction restarts and shaking out bugs.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
sysfs warns if we're removing a symlink from a directory that's no
longer in sysfs; this is triggered by fstests generic/730, which
simulates hot removal of a block device.
This patch is however not a correct fix, since checking
kobj->state_in_sysfs on a kobj owned by another subsystem is racy.
A better fix would be to add the appropriate check to
sysfs_remove_link() - and sysfs_create_link() as well.
But kobject_add_internal()/kobject_del() do not as of today have locking
that would support that.
Note that the block/holder.c code appears to be subject to this race as
well.
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: "Rafael J. Wysocki" <rafael@kernel.org>
Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
When creating a new stripe, we may reuse an existing stripe that has
some empty and some nonempty blocks.
Generally, the existing stripe won't change underneath us - except for
block sector counts, which we copy to the new key in
ec_stripe_key_update.
But the device removal path can now invalidate stripe pointers to a
device, and that can race with stripe reuse.
Change ec_stripe_key_update() to check for and resolve this
inconsistency.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
We were checking that the alloc key was for a valid device, but not a
valid bucket.
This is the upgrade path from versions prior to bcachefs being mainlined.
Reported-by: syzbot+a1b59c8e1a3f022fd301@syzkaller.appspotmail.com
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Check if we have snapshot_trees or subvolumes that refer to the snapshot
node being reconstructed, and use them.
With this, the kill_btree_root test that blows away the snapshots btree
now passes, and we're able to successfully reconstruct.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
BCH_TRANS_COMMIT_journal_reclaim without BCH_WATERMARK_reclaim means
"return an error if low on journal space" - but accounting replay must
succeed.
Fixes https://github.com/koverstreet/bcachefs/issues/656
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Like how we already do when the allocator seems to be stuck, check if
we're waiting too long for a journal reservation and print some debug
info.
This is specifically to track down
https://github.com/koverstreet/bcachefs/issues/656
which is showing up in userspace where we don't have sysfs/debugfs to
get the journal debug info.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Add a closure version of wait_event_timeout(), with the same semantics.
The closure version is useful because unlike wait_event(), it allows
blocking code to run in the conditional expression.
Cc: Coly Li <colyli@suse.de>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
This patch adds a bounds check to the bch2_opt_to_text function to prevent
NULL pointer dereferences when accessing the opt->choices array. This
ensures that the index used is within valid bounds before dereferencing.
The new version enhances the readability.
Reported-and-tested-by: syzbot+37186860aa7812b331d5@syzkaller.appspotmail.com
Closes: https://syzkaller.appspot.com/bug?extid=37186860aa7812b331d5
Signed-off-by: Mohammed Anees <pvmohammedanees2003@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
We will get this if we wake up first:
Kernel panic - not syncing: btree_node_write_done leaked btree_trans
since there are still transactions waiting for cycle detectors after
BTREE_NODE_write_in_flight is cleared.
Signed-off-by: Alan Huang <mmpgouride@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
- Fix failure to validate that accounting replicas entries point to
valid devices: this wasn't a real bug since they'd be cleaned up by
GC, but is still something we should know about
- Fix failure to validate that dev_data_type entries point to valid
devices: this does fix a real bug, since bch2_accounting_read() would
then try to copy the counters to that device and pop an inconsistent
error when the device didn't exist
- Remove accounting entries that are zeroed or invalid: if we're not
validating them we need to get rid of them: they might not exist in
the superblock, so we need the to trigger the superblock mark path
when they're readded.
This fixes the replication.ktest rereplicate test, which was failing
with "superblock not marked for replicas..."
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
fsck can now correctly check if inodes in interior snapshot nodes are
open/in use.
- Tweak the vfs inode rhashtable so that the subvolume ID isn't hashed,
meaning inums in different subvolumes will hash to the same slot. Note
that this is a hack, and will cause problems if anyone ever has the
same file in many different snapshots open all at the same time.
- Then check if any of those subvolumes is a descendent of the snapshot
ID being checked
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
There's an inherent race in taking a snapshot while an unlinked file is
open, and then reattaching it in the child snapshot.
In the interior snapshot node the file will appear unlinked, as though
it should be deleted - it's not referenced by anything in that snapshot
- but we can't delete it, because the file data is referenced by the
child snapshot.
This was being handled incorrectly with
propagate_key_to_snapshot_leaves() - but that doesn't resolve the
fundamental inconsistency of "this file looks like it should be deleted
according to normal rules, but - ".
To fix this, we need to fix the rule for when an inode is deleted. The
previous rule, ignoring snapshots (there was no well-defined rule
for with snapshots) was:
Unlinked, non open files are deleted, either at recovery time or
during online fsck
The new rule is:
Unlinked, non open files, that do not exist in child snapshots, are
deleted.
To make this work transactionally, we add a new inode flag,
BCH_INODE_has_child_snapshot; it overrides BCH_INODE_unlinked when
considering whether to delete an inode, or put it on the deleted list.
For transactional consistency, clearing it handled by the inode trigger:
when deleting an inode we check if there are parent inodes which can now
have the BCH_INODE_has_child_snapshot flag cleared.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
BCH_INODE_i_size_dirty dates from before we had logged operations for
truncate (as well as finsert) - it hasn't been needed since before
bcachefs was mainlined.
BCH_INODE_i_sectors_dirty hasn't been needed since we started always
updating i_sectors transactionally - it's been unused for even longer.
BCH_INODE_backptr_untrusted also hasn't been used since prior to
mainlining; when unlinking a hardling, we zero out the backpointer
fields if they're for the dirent being removed.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
When we find an unreachable inode, we now reattach it in the oldest
version that needs to be reattached (thus avoiding redundant work
reattaching every single version), and we now fix up inode -> dirent
backpointers in newer versions as needed - or white out the reattaching
dirent in newer versions, if the newer version isn't supposed to be
reattached.
This results in the second verify fsck now passing cleanly after
repairing on a user-provided filesystem image with thousands of
different snapshots.
Reported-by: Christopher Snowhill <chris@kode54.net>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
With inode backpointers, we can write a very simple
check_unreachable_inodes() pass that only looks for non-unlinked inodes
that are missing backpointers, and reattaches them.
This simplifies check_directory_structure() so that it's now only
checking for directory structure loops,
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Initially it was thought that we just wanted to ignore errors from
logged op replay, but it turns out we do need to catch -EROFS, or we'll
go into an infinite loop.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
These shouldn't always be fatal errors - logged op resume, in
particular, and we want it as a parameter there.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
It was initially believed that it would be better to be explicit about
the snapshot we're updating when writing inodes in fsck; however, it
turns out that passing around the snapshot separately is more error
prone and we're usually updating the inode in the same snapshow we read
it from.
This is different from normal filesystem paths, where we do the update
in the snapshot of the subvolume we're in.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
We want to check for this early so it can be reattached if necessary in
check_unreachable_inodes(); better than letting it be deleted and having
the children reattached, losing their filenames.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>