Commit Graph

317 Commits

Author SHA1 Message Date
Kent Overstreet
cd7c2d3d8d bcachefs: Make sure BCH_FS_FSCK_DONE gets set
If we're not running fsck we still want to set BCH_FS_FSCK_DONE, so that
bch2_fsck_err() calls are interpreted as bch2_inconsistent_error()
calls().

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
2023-10-22 17:09:21 -04:00
Kent Overstreet
dfd41fb9f2 bcachefs: Fix race between btree updates & journal replay
Add a flag to indicate whether a journal replay key has been
overwritten, and set/test it with appropriate btree locks held.

This fixes a race between the allocator - invalidating buckets, and
doing btree updates - and journal replay, which before this patch could
clobber the allocator thread's update with an older version of the key
from the journal.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
2023-10-22 17:09:21 -04:00
Kent Overstreet
528b18e6d1 bcachefs: bch2_journal_entry_to_text()
This adds a _to_text() pretty printer for journal entries - including
every subtype - which will shortly be used by the 'bcachefs
list_journal' subcommand.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
2023-10-22 17:09:21 -04:00
Kent Overstreet
5ba2fd1145 bcachefs: Journal replay does't resort main list of keys
The upcoming BTREE_ITER_WITH_JOURNAL patch will require journal keys to
stay in sorted order, so the btree iterator code can overlay them over
btree keys.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
2023-10-22 17:09:20 -04:00
Kent Overstreet
d93cf6858f bcachefs: Run scan_old_btree_nodes after version upgrade
In the recovery path, we scan for old btree nodes if we don't have
certain compat bits set. If we do this, we should be doing it after we
upgraded to the newest on disk format.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
2023-10-22 17:09:20 -04:00
Kent Overstreet
04f0f77df2 bcachefs: Delete some obsolete journal_seq_blacklist code
Since metadata version bcachefs_metadata_version_btree_ptr_sectors_written,
we haven't needed the journal seq blacklist mechanism for ignoring
blacklisted btree node writes - we now only need it for ignoring journal
entries that were written after the newest flush journal entry, and then
we only need to keep those blacklist entries around until journal replay
is finished.

That means we can delete the code for scanning btree nodes to GC
journal_seq_blacklist entries.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
2023-10-22 17:09:20 -04:00
Kent Overstreet
e75b2d4c1c bcachefs: bch2_journal_key_insert() no longer transfers ownership
bch2_journal_key_insert() used to assume that the key passed to it was
allocated with kmalloc(), and on success took ownership. This patch
deletes that behaviour, making it more similar to
bch2_trans_update()/bch2_trans_commit().

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
2023-10-22 17:09:19 -04:00
Kent Overstreet
c64740ef27 bcachefs: Don't start allocator threads too early
If the allocator threads start before journal replay has finished
replaying alloc keys, journal replay might overwrite the allocator's
btree updates.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
2023-10-22 17:09:19 -04:00
Kent Overstreet
77170d0dd7 bcachefs: bch2_bucket_alloc_new_fs() no longer depends on bucket marks
Now that bch2_bucket_alloc_new_fs() isn't looking at bucket marks to
decide what buckets are eligible to allocate, we can clean up the
filesystem initialization and device add paths. Previously, we had to
use ancient code to mark superblock/journal buckets in the in memory
bucket marks as we allocated them, and then zero that out and re-do that
marking using the newer transational bucket mark paths. Now, we can
simply delete the in-memory bucket marking.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
2023-10-22 17:09:19 -04:00
Kent Overstreet
09943313d7 bcachefs: Rewrite bch2_bucket_alloc_new_fs()
This changes bch2_bucket_alloc_new_fs() to a simple bump allocator that
doesn't need to use the in memory bucket array, part of a larger patch
series to entirely get rid of the in memory bucket array, except for
gc/fsck.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
2023-10-22 17:09:19 -04:00
Kent Overstreet
fb0e480872 bcachefs: bch2_alloc_write()
This adds a new helper that much like the one we have for inode updates,
that allocates the packed alloc key, packs it and calls
bch2_trans_update.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22 17:09:18 -04:00
Kent Overstreet
990d42d187 bcachefs: Split out struct gc_stripe from struct stripe
We have two radix trees of stripes - one that mirrors some information
from the stripes btree in normal operation, and another that GC uses to
recalculate block usage counts.

The normal one is now only used for finding partially empty stripes in
order to reuse them - the normal stripes radix tree and the GC stripes
radix tree are used significantly differently, so this patch splits them
into separate types.

In an upcoming patch we'll be replacing c->stripes with a btree that
indexes stripes by the order we want to reuse them.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
2023-10-22 17:09:18 -04:00
Kent Overstreet
9be1efe9c5 bcachefs: Fix error reporting from bch2_journal_flush_seq
- bch2_journal_halt() was unconditionally overwriting j->err_seq, the
  sequence number that we failed to write
- journal_write_done was updating seq_ondisk and flushed_seq_ondisk even
  for writes that errored, which broke the way bch2_journal_flush_seq_async()
  locklessly checked for completions.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
2023-10-22 17:09:17 -04:00
Kent Overstreet
0a84a066f9 bcachefs: Also log device name in userspace
Change log messages in userspace to be closer to what they are in kernel
space, and include the device name - it's also useful in userspace.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
2023-10-22 17:09:17 -04:00
Kent Overstreet
697e546fb3 bcachefs: Refactor journal replay code
This consolidates duplicated code in journal replay - it's only a few
flags that are different for replaying alloc keys.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
2023-10-22 17:09:17 -04:00
Kent Overstreet
3e52c22255 bcachefs: Add journal_seq to inode & alloc keys
Add fields to inode & alloc keys that record the journal sequence number
when they were most recently modified.

For alloc keys, this is needed to know what journal sequence number we
have to flush before the bucket can be reused. Currently this is tracked
in memory, but we'll be getting rid of the in memory bucket array.

For inodes, this is needed for fsync when the inode has been evicted
from the vfs cache. Currently we use a bloom filter per outstanding
journal buf - but that mechanism has been broken since we added the
ability to not issue a flush/fua for every journal write.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
2023-10-22 17:09:16 -04:00
Kent Overstreet
904823de49 bcachefs: Convert bch2_mark_key() to take a btree_trans *
This helps to unify the interface between bch2_mark_key() and
bch2_trans_mark_key() - and it also gives access to the journal
reservation and journal seq in the mark_key path.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
2023-10-22 17:09:15 -04:00
Kent Overstreet
8325cd1ed4 bcachefs: Don't do upgrades in nochanges mode
nochanges mode is often used for getting data off of otherwise
nonrecoverable filesystems, which is often because of errors hit during
fsck.

Don't force version upgrade & fsck in nochanges mode, so that it's more
likely to mount.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
2023-10-22 17:09:15 -04:00
Kent Overstreet
4db650277d bcachefs: Subvol dirents are now only visible in parent subvol
This changes the on disk format for dirents that point to subvols so
that they also record the subvolid of the parent subvol, so that we can
filter them out in other subvolumes.

This also updates the dirent code to do that filtering, and in
particular tweaks the rename code - we need to ensure that there's only
ever one dirent (counting multiplicities in different snapshots) that
point to a subvolume.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
2023-10-22 17:09:14 -04:00
Kent Overstreet
bfe88863cf bcachefs: New on disk format to fix reflink_p pointers
We had a bug where reflink_p pointers weren't being initialized to 0,
and when we started using the second word, things broke badly.

This patch revs the on disk format version and adds cleanup code to zero
out the second word of reflink_p pointers before we start using it.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
2023-10-22 17:09:14 -04:00
Kent Overstreet
0476fa948e bcachefs: Rev the on disk format version for snapshots
This will cause the compat code to be run that creates entries in the
subvolumes and snapshots btrees.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
2023-10-22 17:09:13 -04:00
Kent Overstreet
42d237320e bcachefs: Snapshot creation, deletion
This is the final patch in the patch series implementing snapshots.
This patch implements two new ioctls that work like creation and
deletion of directories, but fancier.

 - BCH_IOCTL_SUBVOLUME_CREATE, for creating new subvolumes and snaphots
 - BCH_IOCTL_SUBVOLUME_DESTROY, for deleting subvolumes and snapshots

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
2023-10-22 17:09:13 -04:00
Kent Overstreet
6fed42bb77 bcachefs: Plumb through subvolume id
To implement snapshots, we need every filesystem btree operation (every
btree operation without a subvolume) to start by looking up the
subvolume and getting the current snapshot ID, with
bch2_subvolume_get_snapshot() - then, that snapshot ID is used for doing
btree lookups in BTREE_ITER_FILTER_SNAPSHOTS mode.

This patch adds those bch2_subvolume_get_snapshot() calls, and also
switches to passing around a subvol_inum instead of just an inode
number.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
2023-10-22 17:09:12 -04:00
Kent Overstreet
14b393ee76 bcachefs: Subvolumes, snapshots
This patch adds subvolume.c - support for the subvolumes and snapshots
btrees and related data types and on disk data structures. The next
patches will start hooking up this new code to existing code.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
2023-10-22 17:09:12 -04:00
Kent Overstreet
67e0dd8f0d bcachefs: btree_path
This splits btree_iter into two components: btree_iter is now the
externally visible componont, and it points to a btree_path which is now
reference counted.

This means we no longer have to clone iterators up front if they might
be mutated - btree_path can be shared by multiple iterators, and cloned
if an iterator would mutate a shared btree_path. This will help us use
iterators more efficiently, as well as slimming down the main long lived
state in btree_trans, and significantly cleans up the logic for iterator
lifetimes.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22 17:09:11 -04:00
Kent Overstreet
78cf784eaa bcachefs: Further reduce iter->trans usage
This is prep work for splitting btree_path out from btree_iter -
btree_path will not have a pointer to btree_trans.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
2023-10-22 17:09:11 -04:00
Brett Holman
8dd6ed9451 bcachefs: add progress stats to sysfs
This adds progress stats to sysfs for copygc, rebalance, recovery, and the
cmd_job ioctls.

Signed-off-by: Brett Holman <bholman.devel@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22 17:09:10 -04:00
Kent Overstreet
877da05ffb bcachefs: Zero out mem_ptr field in btree ptr keys from journal replay
This fixes a bad ptr deref on recovery from unclean shutdown in
bch2_btree_node_get_noiter().

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
2023-10-22 17:09:10 -04:00
Kent Overstreet
9f1833cadd bcachefs: Update btree ptrs after every write
This closes a significant hole (and last known hole) in our ability to
verify metadata. Previously, since btree nodes are log structured, we
couldn't detect lost btree writes that weren't the first write to a
given node. Additionally, this seems to have lead to some significant
metadata corruption on multi device filesystems with metadata
replication: since a write may have made it to one device and not
another, if we read that btree node back from the replica that did have
that write and started appending after that point, the other replica
would have a gap in the bset entries and reading from that replica
wouldn't find the rest of the bsets.

But, since updates to interior btree nodes are now journalled, we can
close this hole by updating pointers to btree nodes after every write
with the currently written number of sectors, without negatively
affecting performance. This means we will always detect lost or corrupt
metadata - it also means that our btree is now a curious hybrid of COW
and non COW btrees, with all the benefits of both (excluding
complexity).

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
2023-10-22 17:09:08 -04:00
Kent Overstreet
8c3f6da9fc bcachefs: Improve iter->should_be_locked
Adding iter->should_be_locked introduced a regression where it ended up
not being set on the iterator passed to bch2_btree_update_start(), which
is definitely not what we want.

This patch requires it to be set when calling bch2_trans_update(), and
adds various fixups to make that happen.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
2023-10-22 17:09:06 -04:00
Kent Overstreet
90d22a660a bcachefs: Fix overflow in journal_replay_entry_early
If filesystem on disk was used by a version with a larger BCH_DATA_NR
thas the currently running version, we don't want this to cause a buffer
overrun.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
2023-10-22 17:09:06 -04:00
Kent Overstreet
c0ebe3e48c bcachefs: Assorted endianness fixes
Found by sparse

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
2023-10-22 17:09:05 -04:00
Kent Overstreet
3a402c8dab bcachefs: Fix some refcounting bugs
We really need debug mode assertions that ca->ref and ca->io_ref are
used correctly.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22 17:09:03 -04:00
Kent Overstreet
ac1019d32b bcachefs: Clean up bch2_btree_and_journal_walk()
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22 17:09:02 -04:00
Kent Overstreet
aae15aafcd bcachefs: New and improved topology repair code
This splits out btree topology repair into a separate pass, and makes
some improvements:
 - When we have to pick which of two overlapping nodes to drop keys
   from, we use the btree node header sequence number to preserve the
   newer node

 - the gc code has been changed so that it doesn't bail out if we're
   continuing/ignoring on fsck error - this way the dump tool can skip
   running the repair pass but still walk all reachable metadata

 - add a new superblock flag indicating when a filesystem is known to
   have btree topology issues, and the topology repair pass should be
   run

 - changing the start/end of a node might mean keys in that node have to
   be deleted: this patch handles that better by splitting it out into a
   separate function and running it explicitly in the topology repair
   code, previously those keys were only being dropped when the btree
   node was read in.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22 17:09:02 -04:00
Kent Overstreet
4932e07ea0 bcachefs: Fix key cache assertion
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22 17:09:02 -04:00
Kent Overstreet
d62ab355d7 bcachefs: Fix bch2_trans_mark_dev_sb()
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22 17:09:00 -04:00
Kent Overstreet
423300e8fe bcachefs: BCH_BEATURE_atomic_nlink is obsolete
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22 17:09:00 -04:00
Kent Overstreet
8a85b20cd7 bcachefs: Inode backpointers are now required
This lets us simplify fsck quite a bit, which we need for making fsck
snapshot aware.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22 17:08:59 -04:00
Kent Overstreet
3a14d58e7b bcachefs: Drop bch2_fsck_inode_nlink()
We've had BCH_FEATURE_atomic_nlink for quite some time, we can drop this
now.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22 17:08:59 -04:00
Kent Overstreet
e751c01a8e bcachefs: Start using bpos.snapshot field
This patch starts treating the bpos.snapshot field like part of the key
in the btree code:

* bpos_successor() and bpos_predecessor() now include the snapshot field
* Keys in btrees that will be using snapshots (extents, inodes, dirents
  and xattrs) now always have their snapshot field set to U32_MAX

The btree iterator code gets a new flag, BTREE_ITER_ALL_SNAPSHOTS, that
determines whether we're iterating over keys in all snapshots or not -
internally, this controlls whether bkey_(successor|predecessor)
increment/decrement the snapshot field, or only the higher bits of the
key.

We add a new member to struct btree_iter, iter->snapshot: when
BTREE_ITER_ALL_SNAPSHOTS is not set, iter->pos.snapshot should always
equal iter->snapshot, which will be 0 for btrees that don't use
snapshots, and alsways U32_MAX for btrees that will use snapshots
(until we enable snapshot creation).

This patch also introduces a new metadata version number, and compat
code for reading from/writing to older versions - this isn't a forced
upgrade (yet).

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22 17:08:57 -04:00
Kent Overstreet
4cf91b0270 bcachefs: Split out bpos_cmp() and bkey_cmp()
With snapshots, we're going to need to differentiate between comparisons
that should and shouldn't include the snapshot field. bpos_cmp is now
the comparison function that does include the snapshot field, used by
core btree code.

Upper level filesystem code generally does _not_ want to compare against
the snapshot field - that code wants keys to compare as equal even when
one of them is in an ancestor snapshot.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22 17:08:57 -04:00
Kent Overstreet
73590619ec bcachefs: Don't unconditially version_upgrade in initialize
This is mkfs's job. Also, clean up the handling of feature bits some.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22 17:08:56 -04:00
Kent Overstreet
c7bb769c81 bcachefs: __bch2_trans_get_iter() refactoring, BTREE_ITER_NOT_EXTENTS
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22 17:08:55 -04:00
Kent Overstreet
7d6f07edc2 bcachefs: Fix compat code for superblock
The bkey compat code wasn't being run for btree roots in the superblock
clean section - this patch fixes it to use the journal entry validate
code.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22 17:08:55 -04:00
Kent Overstreet
41f8b09edc bcachefs: Rename BTREE_ID enums for consistency with other enums
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22 17:08:55 -04:00
Kent Overstreet
f2785955bb bcachefs: Kill support for !BTREE_NODE_NEW_EXTENT_OVERWRITE()
bcachefs has been aggressively migrating filesystems and btree nodes to
the new format for quite some time - this shouldn't affect anyone
anymore, and lets us delete a _lot_ of code. Also, it frees up
KEY_TYPE_discard for a new whiteout key type for snapshots.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22 17:08:55 -04:00
Kent Overstreet
41e3778636 bcachefs: Bring back metadata only gc
This is useful for the filesystem dump debugging tool - when we're
hitting bugs we want to skip as much of the recovery process as
possible, and the dump tool only needs to know where metadata lives.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22 17:08:54 -04:00
Kent Overstreet
19dd3172b0 bcachefs: Use x-macros for compat feature bits
This is to generate strings for them, so that we can print them out.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22 17:08:54 -04:00
Kent Overstreet
e01dacf76c bcachefs: Fix bkey format generation for 32 bit fields
Having a packed format that can represent a field larger than the
unpacked type breaks bkey_packed_successor() assertions - we need to fix this to start using the snapshot filed.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22 17:08:54 -04:00
Kent Overstreet
a4805d6672 bcachefs: Scan for old btree nodes if necessary on mount
We dropped support for !BTREE_NODE_NEW_EXTENT_OVERWRITE but it turned
out there were people who still had filesystems with btree nodes in that
format in the wild. This adds a new compat feature that indicates we've
scanned for and rewritten nodes in the old format, and does that scan at
mount time if the option isn't set.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22 17:08:54 -04:00
Kent Overstreet
dab9ef0d27 bcachefs: Add error message for some allocation failures
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22 17:08:53 -04:00
Kent Overstreet
8042b5b715 bcachefs: Extents may now cross btree node boundaries
When snapshots arrive, we won't necessarily be able to arbitrarily split
existis - when we need to split an existing extent, we'll have to check
if the extent was overwritten in child snapshots and if so emit a
whiteout for the split in the child snapshot.

Because extents couldn't span btree nodes previously, journal replay
would sometimes have to split existing extents. That's no good anymore,
but fortunately since extent handling has already been lifted above most
of the btree code there's no real need for that rule anymore.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22 17:08:53 -04:00
Kent Overstreet
5d428c7c64 bcachefs: Run fsck if BCH_FEATURE_alloc_v2 isn't set
We're using BCH_FEATURE_alloc_v2 to also gate journalling updates to dev
usage - we don't have the code for reconstructing this from buckets
anymore, so we need to run fsck if it's not set.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22 17:08:53 -04:00
Kent Overstreet
180fb49dea bcachefs: Journal updates to dev usage
This eliminates the need to scan every bucket to regenerate dev_usage at
mount time.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22 17:08:52 -04:00
Kent Overstreet
2abe542087 bcachefs: Persist 64 bit io clocks
Originally, bcachefs - going back to bcache - stored, for each bucket, a
16 bit counter corresponding to how long it had been since the bucket
was read from. But, this required periodically rescaling counters on
every bucket to avoid wraparound. That wasn't an issue in bcache, where
we'd perodically rewrite the per bucket metadata all at once, but in
bcachefs we're trying to avoid having to walk every single bucket.

This patch switches to persisting 64 bit io clocks, corresponding to the
64 bit bucket timestaps introduced in the previous patch with
KEY_TYPE_alloc_v2.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22 17:08:52 -04:00
Kent Overstreet
a0b73c1c53 bcachefs: Add (partial) support for fixing btree topology
When we walk the btrees during recovery, part of that is checking that
btree topology is correct: for every interior btree node, its child
nodes should exactly span the range the parent node covers.

Previously, we had checks for this, but not repair code. Now that we
have the ability to do btree updates during initial GC, this patch adds
that repair code.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22 17:08:52 -04:00
Kent Overstreet
5b593ee172 bcachefs: Add support for doing btree updates prior to journal replay
Some errors may need to be fixed in order for GC to successfully run -
walk and mark all metadata. But we can't start the allocators and do
normal btree updates until after GC has completed, and allocation
information is known to be consistent, so we need a different method of
doing btree updates.

Fortunately, we already have code for walking the btree while overlaying
keys from the journal to be replayed. This patch adds an update path
that adds keys to the list of keys to be replayed by journal replay, and
also fixes up iterators.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22 17:08:52 -04:00
Kent Overstreet
079663d8ed bcachefs: Kill metadata only gc
This was useful before we had transactional updates to interior btree
nodes - but now, it's just extra unneeded complexity.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22 17:08:52 -04:00
Kent Overstreet
ac95800629 bcachefs: Factor out bch2_ec_stripes_heap_start()
This fixes a bug where mark and sweep gc incorrectly was clearing out
the stripes heap and causing assertions to fire later - simpler to just
create the stripes heap after gc has finished.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22 17:08:51 -04:00
Kent Overstreet
edfbba58e3 bcachefs: Add btree node prefetching to bch2_btree_and_journal_walk()
bch2_btree_and_journal_walk() walks the btree overlaying keys from the
journal; it was introduced so that we could read in the alloc btree
prior to journal replay being done, when journalling of updates to
interior btree nodes was introduced.

But it didn't have btree node prefetching, which introduced a severe
regression with mount times, particularly on spinning rust. This patch
implements btree node prefetching for the btree + journal walk,
hopefully fixing that.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22 17:08:51 -04:00
Kent Overstreet
4291a3317f bcachefs: bch2_alloc_write() should be writing for all devices
Alloc info isn't stored on a particular device, it makes no sense to
only be writing it out for rw members - this was causing fsck to not fix
alloc info errors, oops.

Also, make sure we write out alloc info in other repair paths.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22 17:08:50 -04:00
Kent Overstreet
07a1006ae8 bcachefs: Reduce/kill BKEY_PADDED use
With various newer key types - stripe keys, inline data extents - the
old approach of calculating the maximum size of the value is becoming
more and more error prone. Better to switch to bkey_on_stack, which can
dynamically allocate if necessary to handle any size bkey.

In particular we also want to get rid of BKEY_EXTENT_VAL_U64s_MAX.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22 17:08:50 -04:00
Kent Overstreet
719fe7fb55 bcachefs: Update transactional triggers interface to pass old & new keys
This is needed to fix a bug where we're overflowing iterators within a
btree transaction, because we're updating the stripes btree (to update
block counts) and the stripes btree trigger is unnecessarily updating
the alloc btree - it doesn't need to update the alloc btree when the
pointers within a stripe aren't changing.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22 17:08:49 -04:00
Kent Overstreet
adbcada43f bcachefs: Don't require flush/fua on every journal write
This patch adds a flag to journal entries which, if set, indicates that
they weren't done as flush/fua writes.

 - non flush/fua journal writes don't update last_seq (i.e. they don't
   free up space in the journal), thus the journal free space
   calculations now check whether nonflush journal writes are currently
   allowed (i.e. are we low on free space, or would doing a flush write
   free up a lot of space in the journal)

 - write_delay_ms, the user configurable option for when open journal
   entries are automatically written, is now interpreted as the max
   delay between flush journal writes (default 1 second).

 - bch2_journal_flush_seq_async is changed to ensure a flush write >=
   the requested sequence number has happened

 - journal read/replay must now ignore, and blacklist, any journal
   entries newer than the most recent flush entry in the journal. Also,
   the way the read_entire_journal option is handled has been improved;
   struct journal_replay now has an entry, 'ignore', for entries that
   were read but should not be used.

 - assorted refactoring and improvements related to journal read in
   journal_io.c and recovery.c

Previously, we'd have to issue a flush/fua write every time we
accumulated a full journal entry - typically the bucket size. Now we
need to issue them much less frequently: when an fsync is requested, or
it's been more than write_delay_ms since the last flush, or when we need
to free up space in the journal. This is a significant performance
improvement on many write heavy workloads.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22 17:08:49 -04:00
Kent Overstreet
ebb84d0941 bcachefs: Increase journal pipelining
This patch increases the maximum journal buffers in flight from 2 to 4 -
this will be particularly helpful when in the future we stop requiring
flush+fua for every journal write.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22 17:08:49 -04:00
Kent Overstreet
3eb26d0157 bcachefs: bch2_trans_get_iter() no longer returns errors
Since we now always preallocate the maximum number of iterators when we
initialize a btree transaction, getting an iterator never fails - we can
delete a fair amount of error path code.

This patch also simplifies the iterator allocation code a bit.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22 17:08:48 -04:00
Kent Overstreet
2e9f3b8885 bcachefs: Use BTREE_ITER_PREFETCH in journal+btree iter
Introducing the journal+btree iter introduced a regression where we
stopped using BTREE_ITER_PREFETCH - this is a performance regression on
rotating disks.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22 17:08:48 -04:00
Kent Overstreet
5731cf0156 bcachefs: Fix journal reclaim spinning in recovery
We can't run journal reclaim until we've finished replaying updates to
interior btree nodes - the check for this was in the wrong place though,
leading to journal reclaim spinning before it was allowed to proceed.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22 17:08:48 -04:00
Kent Overstreet
6d758368f1 bcachefs: Fix a btree transaction iter overflow
extent_replay_key dates from before putting iterators was required -
fixed.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22 17:08:47 -04:00
Kent Overstreet
a3e7226268 bcachefs: New varints
Previous varint implementation used by the inode code was not nearly as
fast as it could have been; partly because it was attempting to encode
integers up to 96 bits (for timestamps) but this meant that encoding and
decoding the length required a table lookup.

Instead, we'll just encode timestamps greater than 64 bits as two
separate varints; this will make decoding/encoding of inodes
significantly faster overall.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22 17:08:46 -04:00
Kent Overstreet
33114c2d89 bcachefs: Drop alloc keys from journal when -o reconstruct_alloc
This fixes a bug where we'd pop an assertion due to replaying a key for
an interior btree node when that node no longer exists.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22 17:08:45 -04:00
Kent Overstreet
8d6b6222bf bcachefs: Improvements to writing alloc info
Now that we've got transactional alloc info updates (and have for
awhile), we don't need to write it out on shutdown, and we don't need to
write it out on startup except when GC found errors - this is a big
improvement to mount/unmount performance.

This patch also fixes a few bugs where we weren't writing out alloc
info (on new filesystems, and new devices) and should have been.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22 17:08:44 -04:00
Kent Overstreet
9f115ce9e9 bcachefs: Fix a bug with the journal_seq_blacklist mechanism
Previously, we would start doing btree updates before writing the first
journal entry; if this was after an unclean shutdown, this could cause
those btree updates to not be blacklisted.

Also, move some code to headers for userspace debug tools.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22 17:08:43 -04:00
Kent Overstreet
f621e1521c bcachefs: Add an option for rebuilding the replicas section
There is a bug where we cnan end up clearing the data_has field in the
superblock members section, which causes us to skip reading the journal
and thus journal replay fails. This option tells the recovery path to
not trust those fields.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22 17:08:42 -04:00
Kent Overstreet
697e45b230 bcachefs: Kill BTREE_TRIGGER_NOOVERWRITES
This is prep work for reworking the triggers machinery - we have
triggers that need to know both the old and the new key.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22 17:08:42 -04:00
Kent Overstreet
5d20ba48f0 bcachefs: Use cached iterators for alloc btree
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22 17:08:41 -04:00
Kent Overstreet
7fffc85baf bcachefs: Add an internal option for reading entire journal
To be used the debug tool that dumps the contents of the journal.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22 17:08:41 -04:00
Kent Overstreet
61fc3c9610 bcachefs: Set filesystem features earlier in fs init path
Before we were setting features after allocating btree nodes, which
meant we were using the old btree pointer format.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22 17:08:40 -04:00
Kent Overstreet
00b8ccf707 bcachefs: Interior btree updates are now fully transactional
We now update the alloc info (bucket sector counts) atomically with
journalling the update to the interior btree nodes, and we also set new
btree roots atomically with the journalled part of the btree update.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22 17:08:40 -04:00
Kent Overstreet
b29303966b bcachefs: Fix reading of alloc info after unclean shutdown
When updates to interior nodes started being journalled, that meant that
after an unclean shutdown, until journal replay is done we can't walk
the btree without overlaying the updates from the journal.

The initial btree gc was changed to walk the btree overlaying keys from
the journal - but bch2_alloc_read() and bch2_stripes_read() were missed.
Major whoops...

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22 17:08:40 -04:00
Kent Overstreet
b58a181d5c bcachefs: Fix iterating of journal keys within a btree node
Extent btrees no longer have weird special behaviour for min_key.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22 17:08:38 -04:00
Kent Overstreet
5a655f06c9 bcachefs: Read journal when keep_journal on
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22 17:08:37 -04:00
Kent Overstreet
f1d786a0db bcachefs: Add an option for keeping journal entries after startup
This will be used by the userspace debug tools.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22 17:08:37 -04:00
Kent Overstreet
2f194e1697 bcachefs: Fix an assertion when nothing to replay
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22 17:08:37 -04:00
Kent Overstreet
f44a6a7134 bcachefs: Replay interior node keys
This slightly modifies the journal replay code so that it can replay
updates to interior nodes.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22 17:08:37 -04:00
Kent Overstreet
e62d65f2fb bcachefs: trans_commit() path can now insert to interior nodes
This will be needed for the upcoming patches to journal updates to
interior btree nodes.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22 17:08:37 -04:00
Kent Overstreet
5d548743bd bcachefs: Clear BCH_FEATURE_extents_above_btree_updates on clean shutdown
This is needed so that users can roll back to before "d9bb516b2d
bcachefs: Move extent overwrite handling out of core btree code", which
it appears may still be buggy.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22 17:08:36 -04:00
Kent Overstreet
e3e464ac6d bcachefs: Move extent overwrite handling out of core btree code
Ever since the btree code was first written, handling of overwriting
existing extents - including partially overwriting and splittin existing
extents - was handled as part of the core btree insert path. The modern
transaction and iterator infrastructure didn't exist then, so that was
the only way for it to be done.

This patch moves that outside of the core btree code to a pass that runs
at transaction commit time.

This is a significant simplification to the btree code and overall
reduction in code size, but more importantly it gets us much closer to
the core btree code being completely independent of extents and is
important prep work for snapshots.

This introduces a new feature bit; the old and new extent update models
are incompatible when the filesystem needs journal replay.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22 17:08:36 -04:00
Kent Overstreet
3186c80fe9 bcachefs: Skip 0 size deleted extents in journal replay
These are created by the new extent update path, but not used yet by the
recovery code and they break the existing recovery code, so we can just
skip them.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22 17:08:36 -04:00
Kent Overstreet
f6d0368e06 bcachefs: Traverse iterator in journal replay
This fixes a bug where we end up spinning in journal replay - in theory
this shouldn't be necessary though, transaction reset should be
re-traversing all iterators.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22 17:08:36 -04:00
Kent Overstreet
27beb81023 bcachefs: Fix another iterator leak
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22 17:08:36 -04:00
Kent Overstreet
b807a0c8ba bcachefs: BCH_SB_FEATURES_ALL
BCH_FEATURE_btree_ptr_v2 wasn't getting set on new filesystems, oops

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22 17:08:35 -04:00
Kent Overstreet
548b3d209f bcachefs: btree_ptr_v2
Add a new btree ptr type which contains the sequence number (random 64
bit cookie, actually) for that btree node - this lets us verify that
when we read in a btree node it really is the btree node we wanted.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22 17:08:35 -04:00
Kent Overstreet
5c4a5cd5b3 bcachefs: btree_and_journal_iter
Introduce a new iterator that iterates over keys in the btree with keys
from the journal overlaid on top. This factors out what the erasure
coding init code was doing manually.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22 17:08:35 -04:00
Kent Overstreet
2d594dfb53 bcachefs: Split out btree_trigger_flags
The trigger flags really belong with individual btree_insert_entries,
not the transaction commit flags - this splits out those  flags and
unifies them with the BCH_BUCKET_MARK flags. Todo - split out
btree_trigger.c from buckets.c

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22 17:08:33 -04:00
Kent Overstreet
bcd6f3e06f bcachefs: Use KEY_TYPE_deleted whitouts for extents
Previously, partial overwrites of existing extents were handled
implicitly by the btree code; when reading in a btree node, we'd do a
mergesort of the different bsets and detect and fix partially
overlapping extents during that mergesort.

That approach won't work with snapshots: this changes extents to work
like regular keys as far as the btree code is concerned, where a 0 size
KEY_TYPE_deleted whiteout will completely overwrite an existing extent.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22 17:08:33 -04:00
Kent Overstreet
1c3ff72c0f bcachefs: Convert some enums to x-macros
Helps for preventing things from getting out of sync.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22 17:08:33 -04:00
Kent Overstreet
e731d466d2 bcachefs: Don't export __bch2_fs_read_write
BTREE_INSERT_LAZY_RW was added for this since this code was written; use
it instead.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22 17:08:33 -04:00
Kent Overstreet
8b3bbe2c34 bcachefs: Don't reexecute triggers when retrying transaction commit
This was causing a bug with transaction iterators overflowing; now, if
triggers have to be reexecuted we always return -EINTR and retry from
the start of the transaction.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22 17:08:33 -04:00
Kent Overstreet
58e2388f9e bcachefs: Kill BTREE_INSERT_ATOMIC
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22 17:08:33 -04:00
Kent Overstreet
b1fd23df1d bcachefs: Convert all bch2_trans_commit() users to BTREE_INSERT_ATOMIC
BTREE_INSERT_ATOMIC should really be the default mode, and there's not
that much code that doesn't need it - so this is prep work for getting
rid of the flag.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22 17:08:33 -04:00
Kent Overstreet
ba239c954e bcachefs: bch2_check_set_feature()
New helper function for setting incompatible feature bits

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22 17:08:32 -04:00
Kent Overstreet
4de774952b bcachefs: Reorganize extents.c
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22 17:08:32 -04:00
Kent Overstreet
4be1a412ea bcachefs: Inline data extents
This implements extents that have their data inline, in the value,
instead of the bkey value being pointers to the data - and the read and
write paths are updated to read from these new extent types and write
them out, when the write size is small enough.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22 17:08:32 -04:00
Kent Overstreet
085ab69357 bcachefs: Rework of cut_front & cut_back
This changes bch2_cut_front and bch2_cut_back so that they're able to
shorten the size of the value, and it also changes the extent update
path to update the accounting in the btree node when this happens.

When the size of the value is shortened, they zero out the space that's
no longer used, so it's interpreted as noops (as implemented in the last
patch).

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22 17:08:32 -04:00
Justin Husted
b627c7d8f4 bcachefs: Set lost+found mode to 0700
For security and conformance with other filesystems, the lost+found
directory should not be world or group accessible.

Signed-off-by: Justin Husted <sigstop@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22 17:08:31 -04:00
Kent Overstreet
ff929515cc bcachefs: Trust btree alloc info at runtime
This lets us avoid a cache miss in the write path.

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22 17:08:30 -04:00
Kent Overstreet
05240ba6b8 bcachefs: Fix creation of lost+found
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22 17:08:29 -04:00
Kent Overstreet
a40d97a771 bcachefs: Fix incorrect use of bch2_extent_atomic_end()
bch2_extent_atomic_end counts the number of iterators requried for
marking overwrites - but journal replay never marks overwrites, so that
part was incorrect. And counting iterators for the key being inserted
should be unnecessary because we did that prior to the key being
inserted before it was first journalled.

This should fix an iterator overflow bug - the iterators for walking
overwrites were totally unneeded.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22 17:08:28 -04:00
Kent Overstreet
137b0ed907 bcachefs: bch2_extent_atomic_end() now traverses iter
This fixes a bug in io.c bch2_write_index_default() - it was missing the
traverse call, but bch2_extent_atomic_end returns an error now and can
just call it itself.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22 17:08:28 -04:00
Kent Overstreet
9638574229 bcachefs: Factor out fs-common.c
This refactoring makes the code easier to understand by separating the
bcachefs btree transactional code from the linux VFS code - but more
importantly, it's also to share code with the fuse port.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22 17:08:28 -04:00
Kent Overstreet
a7199432c3 bcachefs: Kill deferred btree updates
Will be replaced by cached btree iterators

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22 17:08:28 -04:00
Kent Overstreet
f9c5519336 bcachefs: Drop trans arg to bch2_extent_atomic_end()
Just for consistency

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22 17:08:26 -04:00
Kent Overstreet
89b0511826 bcachefs: Flush fsck errors when looping in btree gc
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22 17:08:26 -04:00
Kent Overstreet
ad7e137ebc bcachefs: Switch reconstruct_alloc to a mount option
Right now this is the only way of repairing bucket gens in the future

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22 17:08:26 -04:00
Kent Overstreet
06f6c3ec0b bcachefs: Reflink pointers also have to be remarked if split in journal replay
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22 17:08:26 -04:00
Kent Overstreet
76426098e4 bcachefs: Reflink
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22 17:08:25 -04:00
Kent Overstreet
3c7f3b7aeb bcachefs: Refactor bch2_extent_trim_atomic() for reflink
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22 17:08:25 -04:00
Kent Overstreet
2cbe5cfe27 bcachefs: Rework calling convention for marking overwrites
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22 17:08:25 -04:00
Kent Overstreet
e222d206f2 bcachefs: Fix ec_stripes_read()
Change it to not mark keys that will be overwritten by keys in the
journal - this fixes a bug where we pop an assertion in
bucket_set_stripe() because of a stale pointer - because the stripe that
has the stale pointer has been deleted.

This code could be factored out and used elsewhere, at some point.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22 17:08:24 -04:00
Kent Overstreet
2ded276b7e bcachefs: Fix array overrun with unknown btree roots
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22 17:08:23 -04:00
Kent Overstreet
f707e3d8f4 bcachefs: fix kasan splat
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22 17:08:23 -04:00
Kent Overstreet
6e738539cd bcachefs: Improve key marking interface
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22 17:08:22 -04:00
Kent Overstreet
3838be7841 bcachefs: Don't use a fixed size buffer for fs_usage_deltas
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22 17:08:22 -04:00
Kent Overstreet
20bceecb31 bcachefs: More work to avoid transaction restarts
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22 17:08:22 -04:00
Kent Overstreet
619f5bee86 bcachefs: some improvements to startup messages and options
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22 17:08:21 -04:00
Kent Overstreet
932aa83745 bcachefs: bch2_trans_mark_update()
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22 17:08:21 -04:00
Kent Overstreet
5e82a9a1f4 bcachefs: Write out fs usage consistently
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22 17:08:21 -04:00
Kent Overstreet
c6dd04f8f5 bcachefs: Mark overwrites from journal replay in initial gc
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22 17:08:20 -04:00
Kent Overstreet
d07343561e bcachefs: Deduplicate keys in the journal before replay
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22 17:08:20 -04:00
Kent Overstreet
644d180b05 bcachefs: Journal replay refactoring
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22 17:08:20 -04:00
Kent Overstreet
478259b749 bcachefs: delete duplicated code
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22 17:08:20 -04:00
Kent Overstreet
1dd7f9d98d bcachefs: Rewrite journal_seq_blacklist machinery
Now, we store blacklisted journal sequence numbers in the superblock,
not the journal: this helps to greatly simplify the code, and more
importantly it's now implemented in a way that doesn't require all btree
nodes to be visited before starting the journal - instead, we
unconditionally blacklist the next 4 journal sequence numbers after an
unclean shutdown.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22 17:08:20 -04:00
Kent Overstreet
a1d58243f9 bcachefs: add ability to run gc on metadata only
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22 17:08:20 -04:00
Kent Overstreet
7b512638e0 bcachefs: Refactor bch2_fs_recovery()
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22 17:08:19 -04:00
Kent Overstreet
0bc166ff56 bcachefs: Track whether filesystem has errors in superblock
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22 17:08:19 -04:00
Kent Overstreet
03e183cb5d bcachefs: Verify fs hasn't been modified before going rw
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22 17:08:18 -04:00
Kent Overstreet
134915f3d3 bcachefs: Go rw lazily
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22 17:08:18 -04:00
Kent Overstreet
6122ab639c bcachefs: More debug params for testing of recovery paths
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22 17:08:18 -04:00
Kent Overstreet
05235e9928 bcachefs: Run gc if failed to read alloc btree
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22 17:08:18 -04:00
Kent Overstreet
082f0801e0 bcachefs: Fix error handling in bch2_fs_recovery()
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22 17:08:18 -04:00
Kent Overstreet
1633e492ce bcachefs: improved flush_held_btree_writes()
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22 17:08:16 -04:00
Kent Overstreet
768ac63924 bcachefs: Add a mechanism for blocking the journal
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22 17:08:16 -04:00
Kent Overstreet
f7e76361c4 bcachefs: no need to run gc when initializing new fs
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22 17:08:15 -04:00
Kent Overstreet
1df42b5715 bcachefs: don't do initial gc if have alloc info feature
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22 17:08:15 -04:00
Kent Overstreet
3577df5f7f bcachefs: serialize persistent_reserved
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22 17:08:15 -04:00
Kent Overstreet
3e0745e283 bcachefs: initialize fs usage summary in recovery
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22 17:08:15 -04:00
Kent Overstreet
42b72e0ba2 bcachefs: journal_replay_early()
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22 17:08:15 -04:00
Kent Overstreet
2c5af169f7 bcachefs: reserve space in journal for fs usage entries
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22 17:08:15 -04:00
Kent Overstreet
94cd106f8d bcachefs: delete a debug printk
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22 17:08:15 -04:00
Kent Overstreet
61c8d7c8eb bcachefs: Persist stripe blocks_used
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22 17:08:15 -04:00
Kent Overstreet
73e6ab9564 bcachefs: Switch replicas to mark_lock
Prep work for upcoming disk accounting changes

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22 17:08:13 -04:00
Kent Overstreet
f0cfb963ec bcachefs: Track nr_inodes with the key marking machinery
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22 17:08:12 -04:00
Kent Overstreet
26609b619f bcachefs: Make bkey types globally unique
this lets us get rid of a lot of extra switch statements - in a lot of
places we dispatch on the btree node type, and then the key type, so
this is a nice cleanup across a lot of code.

Also improve the on disk format versioning stuff.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22 17:08:12 -04:00
Kent Overstreet
dfe9bfb32e bcachefs: Stripes now properly subject to gc
gc now verifies the contents of the stripes radix tree, important for
persistent alloc info

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22 17:08:12 -04:00
Kent Overstreet
4e65431c85 Revert "bcachefs: start erasure coding after journal replay"
This reverts commit 36f389604294dfc953e6f5624ceb683818d32f28.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22 17:08:12 -04:00
Kent Overstreet
129550c4d0 bcachefs: start erasure coding after journal replay
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22 17:08:11 -04:00
Kent Overstreet
cd575ddf57 bcachefs: Erasure coding
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22 17:08:11 -04:00
Kent Overstreet
af9d3bc203 bcachefs: stripe support for replicas tracking
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22 17:08:11 -04:00
Kent Overstreet
8b335baef2 bcachefs: Assorted fixes for running on very small devices
It's now possible to create and use a filesystem on a 512k device with
4k buckets (though at that size we still waste almost half to internal
reserves)

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22 17:08:11 -04:00
Kent Overstreet
72644db153 bcachefs: Fix an assertion when rebuilding replicas
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22 17:08:11 -04:00
Kent Overstreet
02f1a96c13 bcachefs: Rename nofsck opt to fsck
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22 17:08:11 -04:00
Kent Overstreet
7b3f84ea7d bcachefs: Split out alloc_background.c
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22 17:08:10 -04:00
Kent Overstreet
fc3268c13c bcachefs: kill extent_insert_hook
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22 17:08:09 -04:00
Kent Overstreet
88c07f7397 bcachefs: Only check inode i_nlink during full fsck
Now that all filesystem operatinos that manipulate the filesystem
heirachy and i_nlink are fully atomic, we can add a feature bit to
indicate i_nlink doesn't need to be checked.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22 17:08:07 -04:00
Kent Overstreet
1c6fdbd8f2 bcachefs: Initial commit
Initially forked from drivers/md/bcache, bcachefs is a new copy-on-write
filesystem with every feature you could possibly want.

Website: https://bcachefs.org

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
2023-10-22 17:08:07 -04:00