These utility functions are for managing btree node state within a
btree_trans - rename them for consistency, and drop some unneeded
arguments.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
This is prep work for splitting btree_path out from btree_iter -
btree_path will not have a pointer to btree_trans.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
BTREE_ITER_SET_POS_AFTER_COMMIT is used internally to automagically
advance extent btree iterators on sucessful commit.
But with the upcomnig btree_path patch it's getting more awkward to
support, and it adds overhead to core data structures that's only used
in a few places, and can be easily done by the caller instead.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
This consolidates the code for doing extent updates, and makes the btree
iterator usage a bit cleaner and more efficient.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
This factors out bch2_dump_trans_iters_updates() from the iter alloc
overflow path, and makes some small improvements to what it prints.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Inode creation is done with non-cached btree iterators, but then in the
same transaction the inode may be updated again with a cached iterator -
it makes cache coherency easier if new inodes always land in the
underlying btree.
This patch adds a check to bch2_trans_update() - if the same key is
updated multiple times in the same transaction with both cached and non
cache iterators, use the non cached iterator.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
With the recent transaction restart changes, it's no longer needed - all
transaction commits have BTREE_INSERT_NOUNLOCK semantics.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Start tracking when btree transactions have been restarted - and assert
that we're always calling bch2_trans_begin() immediately after
transaction restart.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
They should already be traversed, and we're asserting that since the
introduction of iter->should_be_locked
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
This closes a significant hole (and last known hole) in our ability to
verify metadata. Previously, since btree nodes are log structured, we
couldn't detect lost btree writes that weren't the first write to a
given node. Additionally, this seems to have lead to some significant
metadata corruption on multi device filesystems with metadata
replication: since a write may have made it to one device and not
another, if we read that btree node back from the replica that did have
that write and started appending after that point, the other replica
would have a gap in the bset entries and reading from that replica
wouldn't find the rest of the bsets.
But, since updates to interior btree nodes are now journalled, we can
close this hole by updating pointers to btree nodes after every write
with the currently written number of sectors, without negatively
affecting performance. This means we will always detect lost or corrupt
metadata - it also means that our btree is now a curious hybrid of COW
and non COW btrees, with all the benefits of both (excluding
complexity).
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
btree_trans should always be passed when we have one - iter->trans is
disfavoured. This mainly updates old code in btree_update_interior.c,
some of which predates btree_trans.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Add a new flag to control assertions about updating to internal snapshot
nodes, that normally should not be written to - to be used in an
upcoming patch.
Also do some renaming - trigger_flags is now update_flags.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
- We no longer mark subsets of extents, they're marked like regular
keys now - which means we can drop the offset & sectors arguments
to trigger functions
- Drop other arguments that are no longer needed anymore in various
places - fs_usage
- Drop the logic for handling extents in bch2_mark_update() that isn't
needed anymore, to match bch2_trans_mark_update()
- Better logic for hanlding the BTREE_ITER_CACHED_NOFILL case, where we
don't have an old key to mark
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Adding iter->should_be_locked introduced a regression where it ended up
not being set on the iterator passed to bch2_btree_update_start(), which
is definitely not what we want.
This patch requires it to be set when calling bch2_trans_update(), and
adds various fixups to make that happen.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
With trans->updates2 gone, we can now drop this helper and use
bch2_btree_delete_at() instead.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
We haven't had extent merging in quite some time. It used to be done by
the btree code when sorting btree nodes, but that was eliminated as part
of the work to separate extent handling from core btree code.
This patch re-implements extent merging in the transaction commit path.
We don't currently have the ability to merge reflink pointers, we need
to do some work on the triggers code to be able to do that without
ending up with incorrect refcounts.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Now that extent handling has been lifted to bch2_trans_update(), we
don't need to keep two different lists of updates.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
This lifts handling of overlapping extents out of __bch2_trans_commit()
and moves it to where we first do the update - which means that
BTREE_ITER_WITH_UPDATES can now work correctly in extents mode.
Also, this patch reworks how extent triggers work: previously, on
partial extent overwrite we would pass this information to the trigger,
telling it what part of the extent was being overwritten. But, this
approach has had too many subtle corner cases - now, we only mark whole
extents, meaning on partial extent overwrite we unmark the old extent
and mark the new extent.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
This drops bch2_btree_iter_peek_with_updates() and replaces it with a
new flag, BTREE_ITER_WITH_UPDATES, and also reworks
bch2_btree_iter_peek_slot() to respect it too.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
This patch adds some new tracepoints to the btree iterator code, and
adds new fields to the existing tracepoints - primarily for the iterator
position.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Upcoming refactoring is going to change bch2_trans_update() to start
returning transaction restarts.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Buffered writes may have to increase their disk reservation at btree
update time, due to compression and erasure coding being unpredictable:
O_DIRECT writes should be checking for -ENOSPC, but buffered writes have
already been accepted and should not.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Currently, we handle multiple overlapping extents in the same
transaction commit by doing fixups in bch2_trans_update() - this patch
extents that to split updates when necessary. The next patch that
changes the reflink code to not fragment extents when making them
indirect will require this.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
It was being skipped when hole punching, leading to problems when
splitting compressed extents.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
bch2_varint_decode() can read up to 7 bytes past the end of the buffer,
which means we need to allocate slightly larger key cache buffers.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Using the normal transaction commit path to insert and journal updates
to interior nodes hadn't been done before this repair code was written,
not surprising that there was a bug.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
We check for this prior to metadata being written, but we're seeing some
strange bugs lately, and this will help catch those closer to where they
occur.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Can't run arbitrary code inside a wait_event() conditional, due to
task state being weird...
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
If we let bch2_trans_commit() do it, it'll traverse iterators in sorted
order which means we'll get fewer lock restarts.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
This is an important cleanup, eliminating an unnecessary copy in the
transaction commit path.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
The bug was that we were trying to find a replicas entry that wasn't
sorted - but, we can also simplify the code by not using
bch2_mark_bkey_replicas and instead ensuring the list of replicas
entries exists directly.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
JOURNAL_RES_GET_RESERVED should only be used for updatse that need to be
done to free up space in the journal. In particular, when we're flushing
keys from the key cache, if we're flushing them out of order we
shouldn't be using it, since we're using up our remaining space in the
journal without dropping a pin that will let us make forward progress.
With this patch, BTREE_INSERT_JOURNAL_RECLAIM without
BTREE_INSERT_JOURNAL_RESERVED may return -EAGAIN - we can't wait on
journal reclaim if we're already in journal reclaim.
This means we need to propagate these errors up to journal reclaim,
indicating that flushing a journal pin should be retried in the future.
This is prep work for a patch to change the way journal reclaim works,
to split out flushing key cache keys because the btree key cache is too
dirty from journal reclaim because we need space in the journal.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
After we get a journal reservation, we need to use it - if we erorr out
of a transaction commit, we'll be eating into space in the journal and
if our transaction needs to make forward progress in order to reclaim
space in the journal, we'll deadlock.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Since we're no longer doing btree node merging post commit, we can now
delete a bunch of code.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Currently, BTREE_INSERT_NOUNLOCK makes it hard to ensure btree node
merging happens reliably - since btree node merging happens after
transaction commit, we can't drop btree locks and block when starting
the btree update.
This patch moves it to before transaction commit - and failure to do a
merge that we wanted to do just restarts the transaction.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
This is specifically to speed up bch2_inode_rm(), so that we're not
traversing iterators we're done with.
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
This patch starts treating the bpos.snapshot field like part of the key
in the btree code:
* bpos_successor() and bpos_predecessor() now include the snapshot field
* Keys in btrees that will be using snapshots (extents, inodes, dirents
and xattrs) now always have their snapshot field set to U32_MAX
The btree iterator code gets a new flag, BTREE_ITER_ALL_SNAPSHOTS, that
determines whether we're iterating over keys in all snapshots or not -
internally, this controlls whether bkey_(successor|predecessor)
increment/decrement the snapshot field, or only the higher bits of the
key.
We add a new member to struct btree_iter, iter->snapshot: when
BTREE_ITER_ALL_SNAPSHOTS is not set, iter->pos.snapshot should always
equal iter->snapshot, which will be 0 for btrees that don't use
snapshots, and alsways U32_MAX for btrees that will use snapshots
(until we enable snapshot creation).
This patch also introduces a new metadata version number, and compat
code for reading from/writing to older versions - this isn't a forced
upgrade (yet).
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>