Remove conditional path allocation from btrfs_read_locked_inode(). Add
an ASSERT(path) to indicate it should never be called with a NULL path.
Call btrfs_read_locked_inode() directly from btrfs_iget(). This causes
code duplication between btrfs_iget() and btrfs_iget_path(), but I
think this is justifiable as it removes the need for conditionally
allocating the path inside of btrfs_read_locked_inode(). This makes the
code easier to reason about and makes it clear who has the
responsibility of allocating and freeing the path.
Signed-off-by: Leo Martins <loemra.dev@gmail.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Move btrfs_add_inode_to_root() so it can be called from
btrfs_read_locked_inode(), no changes were made to the function.
Move cleanup code from btrfs_iget_path() to btrfs_read_locked_inode.
This improves readability and improves a leaky abstraction. Previously
btrfs_iget_path() had to handle a positive error case as a result of a
call to btrfs_search_slot(), but it makes more sense to handle this
closer to the source of the call.
Signed-off-by: Leo Martins <loemra.dev@gmail.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Add struct io_btrfs_cmd as a wrapper type for io_uring_cmd_to_pdu(),
rather than using a raw pointer.
Suggested-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Mark Harmstone <maharmstone@fb.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Add an io_uring command for encoded reads, using the same interface as
the existing BTRFS_IOC_ENCODED_READ ioctl.
btrfs_uring_encoded_read() is an io_uring version of
btrfs_ioctl_encoded_read(), which validates the user input and calls
btrfs_encoded_read() to read the appropriate metadata. If we determine
that we need to read an extent from disk, we call
btrfs_encoded_read_regular_fill_pages() through
btrfs_uring_read_extent() to prepare the bio.
The existing btrfs_encoded_read_regular_fill_pages() is changed so that
if it is passed a valid uring_ctx, rather than waking up any waiting
threads it calls btrfs_uring_read_extent_endio(). This in turn copies
the read data back to userspace, and calls io_uring_cmd_done() to
complete the io_uring command.
Because we're potentially doing a non-blocking read,
btrfs_uring_read_extent() doesn't clean up after itself if it returns
-EIOCBQUEUED. Instead, it allocates a priv struct, populates the fields
there that we will need to unlock the inode and free our allocations,
and defers this to the btrfs_uring_read_finished() that gets called when
the bio completes.
Signed-off-by: Mark Harmstone <maharmstone@fb.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Change btrfs_encoded_read_regular_fill_pages() so that the priv struct
is allocated rather than stored on the stack, in preparation for adding
an asynchronous mode to the function.
Signed-off-by: Mark Harmstone <maharmstone@fb.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Change btrfs_encoded_read() so that it returns -EAGAIN rather than sleeps
if IOCB_NOWAIT is set in iocb->ki_flags. The conditions that require
sleeping are: inode lock, writeback, extent lock, ordered range.
Signed-off-by: Mark Harmstone <maharmstone@fb.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Change the behaviour of btrfs_encoded_read() so that if it needs to read
an extent from disk, it leaves the extent and inode locked and returns
-EIOCBQUEUED. The caller is then responsible for doing the I/O via
btrfs_encoded_read_regular() and unlocking the extent and inode.
Signed-off-by: Mark Harmstone <maharmstone@fb.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
iocb->ki_pos isn't used after this function, so there's no point in
changing its value.
Signed-off-by: Mark Harmstone <maharmstone@fb.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
After the previous patch, which converted the rb-tree used to track
delayed ref heads into an xarray, the find_ref_head() function is now
used only by one caller which always passes false to the 'return_bigger'
argument. So remove the 'return_bigger' logic, simplifying the function,
and move all the function code to the single caller.
Reviewed-by: Boris Burkov <boris@bur.io>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Currently we use a red black tree (rb-tree) to track the delayed ref
heads (in struct btrfs_delayed_ref_root::href_root). This however is not
very efficient when the number of delayed ref heads is large (and it's
very common to be at least in the order of thousands) since rb-trees are
binary trees. For example for 10K delayed ref heads, the tree has a depth
of 13. Besides that, inserting into the tree requires navigating through
it and pulling useless cache lines in the process since the red black tree
nodes are embedded within the delayed ref head structure - on the other
hand, by being embedded, it requires no extra memory allocations.
We can improve this by using an xarray instead which has a much higher
branching factor than a red black tree (binary balanced tree) and is more
cache friendly and behaves like a resizable array, with a much better
search and insertion complexity than a red black tree. This only has one
small disadvantage which is that insertion will sometimes require
allocating memory for the xarray - which may fail (not that often since
it uses a kmem_cache) - but on the other hand we can reduce the delayed
ref head structure size by 24 bytes (from 152 down to 128 bytes) after
removing the embedded red black tree node, meaning than we can now fit
32 delayed ref heads per 4K page instead of 26, and that gain compensates
for the occasional memory allocations needed for the xarray nodes. We
also end up using only 2 cache lines instead of 3 per delayed ref head.
Running the following fs_mark test showed some improvements:
$ cat test.sh
#!/bin/bash
DEV=/dev/nullb0
MNT=/mnt/nullb0
MOUNT_OPTIONS="-o ssd"
FILES=100000
THREADS=$(nproc --all)
echo "performance" | \
tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
mkfs.btrfs -f $DEV
mount $MOUNT_OPTIONS $DEV $MNT
OPTS="-S 0 -L 5 -n $FILES -s 0 -t $THREADS -k"
for ((i = 1; i <= $THREADS; i++)); do
OPTS="$OPTS -d $MNT/d$i"
done
fs_mark $OPTS
umount $MNT
Before this patch:
FSUse% Count Size Files/sec App Overhead
10 1200000 0 171845.7 12253839
16 2400000 0 230898.7 12308254
23 3600000 0 212292.9 12467768
30 4800000 0 195737.8 12627554
46 6000000 0 171055.2 12783329
After this patch:
FSUse% Count Size Files/sec App Overhead
10 1200000 0 173835.0 12246131
16 2400000 0 233537.8 12271746
23 3600000 0 220398.7 12307737
30 4800000 0 204483.6 12392318
40 6000000 0 182923.3 12771843
Reviewed-by: Boris Burkov <boris@bur.io>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Add some comments to struct btrfs_delayed_ref_root's fields to mention
what its spinlock protects.
Reviewed-by: Boris Burkov <boris@bur.io>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The delayed refs lock must be held when calling add_delayed_ref_head(),
so assert that it's being held.
Reviewed-by: Boris Burkov <boris@bur.io>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The delayed refs lock must be held when calling find_first_ref_head(), so
assert that it's being held.
Reviewed-by: Boris Burkov <boris@bur.io>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We have 3 callers for find_ref_head() so assert at find_ref_head() that we
have the delayed refs lock held, removing the assertion from one of its
callers (btrfs_find_delayed_ref_head()).
Reviewed-by: Boris Burkov <boris@bur.io>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
One of the following patches in the series will need to access fs_info at
btrfs_delete_ref_head(), so pass a fs_info argument to it.
Reviewed-by: Boris Burkov <boris@bur.io>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
One of the following patches in the series will need to access fs_info in
the function find_ref_head(), so pass a fs_info argument to it as well as
to the functions btrfs_select_ref_head() and btrfs_find_delayed_ref_head()
which call find_ref_head().
Reviewed-by: Boris Burkov <boris@bur.io>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The unselect_delayed_ref_head() at extent-tree.c doesn't really belong in
that file as it's a delayed refs specific detail and therefore should be
at delayed-ref.c. Further its inverse, btrfs_select_ref_head(), is at
delayed-ref.c, so it only makes sense to have it there too.
So move unselect_delayed_ref_head() into delayed-ref.c and rename it to
btrfs_unselect_ref_head() so that its name closely matches its inverse
(btrfs_select_ref_head()).
Reviewed-by: Boris Burkov <boris@bur.io>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Instead of doing it in two steps outside of delayed-ref.c, leaking low
level details such as locking, move the logic entirely to delayed-ref.c
under btrfs_select_ref_head(), reducing code and making things simpler
for the caller.
Reviewed-by: Boris Burkov <boris@bur.io>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The function only returns 0, meaning it was able to lock the delayed ref
head, or -EAGAIN in case it wasn't able to lock it. So simplify this and
use a boolean return type instead, returning true if it was able to lock
and false otherwise.
Reviewed-by: Boris Burkov <boris@bur.io>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The atomic counter 'num_entries' is not used anymore, we increment it
and decrement it but then we don't ever read it to use for any logic.
Its last use was removed with commit 61a56a992f ("btrfs: delayed refs
pre-flushing should only run the heads we have"). So remove it.
Reviewed-by: Boris Burkov <boris@bur.io>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Instead of open coding it, use the find_first_ref_head() helper at
btrfs_destroy_delayed_refs(). This avoids duplicating the logic,
specially with the upcoming changes in subsequent patches.
Reviewed-by: Boris Burkov <boris@bur.io>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
When destroying delayed refs during a transaction abort, we have open
coded the removal of a delayed ref, which is also done by the static
helper function drop_delayed_ref(). So remove that duplicated code and
use drop_delayed_ref() instead.
Reviewed-by: Boris Burkov <boris@bur.io>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The fs_info parameter is redundant because it can be extracted from the
transaction given as another parameter. So remove it and use the fs_info
accessible from the transaction.
Reviewed-by: Boris Burkov <boris@bur.io>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The fs_info parameter is redundant because it can be extracted from the
transaction given as another parameter. So remove it and use the fs_info
accessible from the transaction.
Reviewed-by: Boris Burkov <boris@bur.io>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
It's better suited at delayed-ref.c since it's about delayed refs and
contains logic to iterate over them (using the red black tree, doing all
the locking, freeing, etc), so move it from disk-io.c, which is pretty
big, into delayed-ref.c, hiding implementation details of how delayed
refs are tracked and managed. This also facilitates the next patches in
the series.
This change moves the code between files but also does the following
simple cleanups:
1) Rename the 'cache' variable to 'bg', since it's a block group
(the 'cache' logic comes from old days where the block group
structure was named 'btrfs_block_group_cache');
2) Move the 'ref' variable declaration to the scope of the inner
while loop, since it's not used outside that loop.
Reviewed-by: Boris Burkov <boris@bur.io>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
At btrfs_destroy_delayed_refs() it's unexpected to not find the block
group to which a delayed reference's extent belongs to, so we have this
BUG_ON(), not just because it's highly unexpected but also because we
don't know what to do there.
Since we are in the transaction abort path, there's nothing we can do
other than proceed and cleanup all used resources we can. So remove
the BUG_ON() and deal with a missing block group by logging an error
message and continuing to cleanup all we can related to the current
delayed ref head and moving to other delayed refs.
Reviewed-by: Boris Burkov <boris@bur.io>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
When inserting extent backref, in order to check whether refs other than
inline refs are used, we always use path keep locks for tree search, which
will increase the lock contention of extent tree.
We do not need the parent node every time to determine whether normal
refs are used. It is only needed when the extent item is the last item
in a leaf.
Therefore, we change it to first use keep_locks=0 for search. If the
extent item happens to be the last item in the leaf, we then change to
keep_locks=1 for the second search to reduce lock contention.
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Robbie Ko <robbieko@synology.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Implement self-tests for partial deletion of RAID stripe-tree entries.
These two new tests cover both the deletion of the front of a RAID
stripe-tree stripe extent as well as truncation of an item to make it
smaller.
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
In our CI system, the RAID stripe tree configuration sometimes fails with
the following ASSERT():
assertion failed: found_start >= start && found_end <= end, in fs/btrfs/raid-stripe-tree.c:64
This ASSERT()ion triggers, because for the initial design of RAID
stripe-tree, I had the "one ordered-extent equals one bio" rule of zoned
btrfs in mind.
But for a RAID stripe-tree based system, that is not hosted on a zoned
storage device, but on a regular device this rule doesn't apply.
So in case the range we want to delete starts in the middle of the
previous item, grab the item and "truncate" it's length. That is, clone
the item, subtract the deleted portion from the key's offset, delete the
old item and insert the new one.
In case the range to delete ends in the middle of an item, we have to
adjust both the item's key as well as the stripe extents and then
re-insert the modified clone into the tree after deleting the old stripe
extent.
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
When fgp_flags and gfp_flags are zero, use filemap_get_folio(A, B)
instead of __filemap_get_folio(A, B, 0, 0)—no need for the extra
arguments 0, 0.
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The buffered write path is still heavily utilizing the page interface.
Since we have converted it to do a page-by-page copying, it's much easier
to convert all involved functions to folio interface, this involves:
- btrfs_copy_from_user()
- btrfs_drop_folio()
- prepare_uptodate_page()
- prepare_one_page()
- lock_and_cleanup_extent_if_need()
- btrfs_dirty_page()
All function are changed to accept a folio parameter, and if the word
"page" is in the function name, change that to "folio" too.
The function btrfs_dirty_page() is exported for v1 space cache, convert
v1 cache call site to convert its page to folio for the new interface.
And there is a small enhancement for prepare_one_folio(), instead of
manually waiting for the page writeback, let __filemap_get_folio() to
handle that by using FGP_WRITEBEGIN, which implies
(FGP_LOCK | FGP_WRITE | FGP_CREAT | FGP_STABLE).
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Currently the btrfs_buffered_write() is preparing multiple page a time,
allowing a better performance.
But the current trend is to support larger folio as an optimization,
instead of implementing own multi-page optimization.
This is inspired by generic_perform_write(), which is copying one folio
a time.
Such change will prepare us to migrate to implement the write_begin()
and write_end() callbacks, and make every involved function a little
easier.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
btrfs_do_encoded_write() was converted to use folios in 400b172b8c,
but we're still allocating based on sizeof(struct page *) rather than
sizeof(struct folio *). There's no functional change.
Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: Mark Harmstone <maharmstone@fb.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Remove hard-coded strings by using the str_yes_no() and str_no_yes()
helper functions.
Signed-off-by: Thorsten Blum <thorsten.blum@linux.dev>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Since there is no user of reader locks, rename the writer locks into a
more generic name, by removing the "_writer" part from the name.
And also rename btrfs_subpage::writer into btrfs_subpage::locked.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Since commit d7172f52e9 ("btrfs: use per-buffer locking for
extent_buffer reading"), metadata read no longer relies on the subpage
reader locking.
This means we do not need to maintain a different metadata/data split
for locking, so we can convert the existing reader lock users by:
- add_ra_bio_pages()
Convert to btrfs_folio_set_writer_lock()
- end_folio_read()
Convert to btrfs_folio_end_writer_lock()
- begin_folio_read()
Convert to btrfs_folio_set_writer_lock()
- folio_range_has_eb()
Remove the subpage->readers checks, since it is always 0.
- Remove btrfs_subpage_start_reader() and btrfs_subpage_end_reader()
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This function is not really suitable to lock a folio, as it lacks the
proper mapping checks, thus the locked folio may not even belong to
btrfs.
And due to the above reason, the last user inside lock_delalloc_folios()
is already removed, and we can remove this function.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
If you follow the seed/sprout wiki, it suggests the following workflow:
btrfstune -S 1 seed_dev
mount seed_dev mnt
btrfs device add sprout_dev
mount -o remount,rw mnt
The first mount mounts the FS readonly, which results in not setting
BTRFS_FS_OPEN, and setting the readonly bit on the sb. The device add
somewhat surprisingly clears the readonly bit on the sb (though the
mount is still practically readonly, from the users perspective...).
Finally, the remount checks the readonly bit on the sb against the flag
and sees no change, so it does not run the code intended to run on
ro->rw transitions, leaving BTRFS_FS_OPEN unset.
As a result, when the cleaner_kthread runs, it sees no BTRFS_FS_OPEN and
does no work. This results in leaking deleted snapshots until we run out
of space.
I propose fixing it at the first departure from what feels reasonable:
when we clear the readonly bit on the sb during device add.
A new fstest I have written reproduces the bug and confirms the fix.
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Boris Burkov <boris@bur.io>
Signed-off-by: David Sterba <dsterba@suse.com>
It's redundant to have the 'gen' variable since we already have the same
value in the local btrfs_tree_parent_check structure. So remove it and
instead use the structure's field.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
It's pointless to initialize the has_first_key field of the stack local
btrfs_tree_parent_check structure at btrfs_tree_parent_check() and at
btrfs_qgroup_trace_subtree() since all fields not explicitly initialized
are zeroed out. In the case of the first function it's a bit odd because
we are assigning 0 and the field is of type bool, however not incorrect
since a 0 is converted to false.
Just remove the explicit initializations due to their redundancy.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The only caller of btrfs_verify_level_key() is read_block_for_search() and
it's passing 3 arguments to it that can be extracted from its on stack
variable of type struct btrfs_tree_parent_check.
So change btrfs_verify_level_key() to accept an argument of type
struct btrfs_tree_parent_check instead of level, first key and parent
transid arguments.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The level parameter passed to read_block_for_search() always matches the
level of the extent buffer passed in the "eb_ret" parameter, which we are
also extracting into the "parent_level" local variable.
So remove the level parameter and instead use the "parent_level" variable
which in fact has a better name (it's the level of the parent node from
which we are reading a child node/leaf).
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Now that the extent map shrinker can only be run by a single task and runs
asynchronously as a work queue job, enable it as it can no longer cause
stalls on tasks allocating memory and entering the extent map shrinker
through the fs shrinker (implemented by btrfs_free_cached_objects()).
This is crucial to prevent exhaustion of memory due to unbounded extent
map creation, primarily with direct IO but also for buffered IO on files
with holes. This problem, for the direct IO case, was first reported in
the Link tag below. That report was added to a Link tag of the first patch
that introduced the extent map shrinker, commit 956a17d9d0 ("btrfs: add
a shrinker for extent maps"), however the Link tag disappeared somehow
from the committed patch (but was included in the submitted patch to the
mailing list), so adding it below for future reference.
Link: https://lore.kernel.org/linux-btrfs/13f94633dcf04d29aaf1f0a43d42c55e@amazon.com/
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The names for the members of struct btrfs_fs_info related to the extent
map shrinker are a bit too long, so rename them to be shorter by replacing
the "extent_map_" prefix with the "em_" prefix.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Now that the extent map shrinker can only be run by a single task (as a
work queue item) there is no need to keep the progress of the shrinker
protected by a spinlock and passing the progress to trace events as
parameters. So remove the lock and simplify the arguments for the trace
events.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Currently the extent map shrinker is run synchronously for kswapd tasks
that end up calling the fs shrinker (fs/super.c:super_cache_scan()).
This has some disadvantages and for some heavy workloads with memory
pressure it can cause some delays and stalls that make a machine
unresponsive for some periods. This happens because:
1) We can have several kswapd tasks on machines with multiple NUMA zones,
and running the extent map shrinker concurrently can cause high
contention on some spin locks, namely the spin locks that protect
the radix tree that tracks roots, the per root xarray that tracks
open inodes and the list of delayed iputs. This not only delays the
shrinker but also causes high CPU consumption and makes the task
running the shrinker monopolize a core, resulting in the symptoms
of an unresponsive system. This was noted in previous commits such as
commit ae1e766f62 ("btrfs: only run the extent map shrinker from
kswapd tasks");
2) The extent map shrinker's iteration over inodes can often be slow, even
after changing the data structure that tracks open inodes for a root
from a red black tree (up to kernel 6.10) to an xarray (kernel 6.10+).
The transition to the xarray while it made things a bit faster, it's
still somewhat slow - for example in a test scenario with 10000 inodes
that have no extent maps loaded, the extent map shrinker took between
5ms to 8ms, using a release, non-debug kernel. Iterating over the
extent maps of an inode can also be slow if have an inode with many
thousands of extent maps, since we use a red black tree to track and
search extent maps. So having the extent map shrinker run synchronously
adds extra delay for other things a kswapd task does.
So make the extent map shrinker run asynchronously as a job for the
system unbounded workqueue, just like what we do for data and metadata
space reclaim jobs.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Move the common code to remove an extent map from its inode's tree into a
helper function and use it, reducing duplicated code.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
When crawling btree, if an eb cache miss occurs, we change to use the eb
read lock and release all previous locks (including the parent lock) to
reduce lock contention.
If an eb cache miss occurs in a leaf and needs to execute IO, before this
change we released locks only from level 2 and up and we read a leaf's
content from disk while holding a lock on its parent (level 1), causing
the unnecessary lock contention on the parent, after this change we
release locks from level 1 and up, but we lock level 0, and read leaf's
content from disk.
Because we have prepared the check parameters and the read lock of eb we
hold, we can ensure that no race will occur during the check and cause
unexpected errors.
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Robbie Ko <robbieko@synology.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The compression heuristic pass does not need a level, so we can drop the
parameter.
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Cascaded removal of fs_info that is not needed in several functions.
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The function got split in commit 6ab6ebb760 ("btrfs: split
alloc_log_tree()") and since then transaction parameter has been unused.
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The only caller passes NULL, we can drop the parameter. This is since
the new mount option parser done in 3bb17a25bc ("btrfs: add get_tree
callback for new mount API").
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Since the new mount option parser in commit ad21f15b0f ("btrfs:
switch to the new mount API") we don't pass the options like that
anymore.
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The parameter was added in 8ff8466d29 ("btrfs: support subpage for
extent buffer page release") for page but hasn't been used since, so we
can drop it.
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The mask parameter used for allocations got unified to GFP_NOFS and
removed from relevant functions in 1d12680044 ("btrfs: drop gfp from
parameter extent state helpers").
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The parameter duplicates what can be effectively obtained from
wc->refs[level - 1] and this is what's actually used inside. Added in
commit 2b73c7e761 ("btrfs: unify logic to decide if we need to walk
down into a node during snapshot delete").
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The parameter 'from' has never been used since commit b8d8e1fd57
("btrfs: introduce btrfs_write_check()"), this is for buffered write.
Direct io write needs it so it was probably an interface thing, but we
can drop it.
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The file_offset parameter used to be passed to encoded read struct but
was removed in commit b665affe93 ("btrfs: remove unused members from
struct btrfs_encoded_read_private").
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We don't need offset for inline extents, they always start from 0.
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We don't need the inode pointer to read inline extent, it's all
accessible from the path pointer.
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We don't need the user passed parameter, rescan is a filesystem
operation so fs_info is sufficient.
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The LZO compression has only one level, we don't need to pass the
parameter.
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The caller replace_path() runs under transaction but we don't need it in
btrfs_qgroup_add_swapped_blocks().
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We don't need fs_info here, everything is reachable from qgroup.
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The parameter map used to be passed to scrub_extent() until
e02ee89baa ("btrfs: scrub: switch scrub_simple_mirror() to
scrub_stripe infrastructure"), where the scrub implementation was
completely reworked.
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The parameter is unused and we can reach sctx from scrub stripe if
needed.
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
None of the ref iteration callbacks needs the index parameter (this is
for the directory item iteration), so we can drop it.
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
None of the ref iteration callbacks needs the num parameter (this is for
the directory item iteration), so we can drop it.
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The parameter is unused and we can get it from space info if needed.
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The parameter is not used, we can also reach it from the space info if
needed in the future.
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The path parameter was used for our own locking, that got converted to
rwsem eventually. Last usage in ac5887c8e0 ("btrfs: locking: remove
all the blocking helpers").
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Make sure we got the right timer struct for the zstd workspace reclaim
work.
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The function btrfs_set_range_writeback() was originally a callback for
metadata and data, to mark a range with writeback flag.
Then it was converted into a common function call for both metadata and
data.
From the very beginning, the function had been only called on a full page,
later converted to handle range inside a page.
But it never needed to handle multiple pages, and since commit
8189197425 ("btrfs: refactor __extent_writepage_io() to do
sector-by-sector submission") the function was only called on a
sector-by-sector basis.
This makes the function unnecessary, and can be converted to a simple
btrfs_folio_set_writeback() call instead.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
When trying to flush qgroups in order to release space we run delayed
iputs in order to release space from recently deleted files (their link
counted reached zero), and then we start delalloc and wait for any
existing ordered extents to complete.
However there's a time window here where we end up not doing the final
iput on a deleted file which could release necessary space:
1) An unlink operation starts;
2) During the unlink, or right before it completes, delalloc is flushed
and an ordered extent is created;
3) When the ordered extent is created, the inode's ref count is
incremented (with igrab() at alloc_ordered_extent());
4) When the unlink finishes it doesn't drop the last reference on the
inode and so it doesn't trigger inode eviction to delete all of
the inode's items in its root and drop all references on its data
extents;
5) Another task enters try_flush_qgroup() to try to release space,
it runs all delayed iputs, but there's no delayed iput yet for that
deleted file because the ordered extent hasn't completed yet;
6) Then at try_flush_qgroup() we wait for the ordered extent to complete
and that results in adding a delayed iput at btrfs_put_ordered_extent()
when called from btrfs_finish_one_ordered();
7) Adding the delayed iput results in waking the cleaner kthread if it's
not running already. However it may take some time for it to be
scheduled, or it may be running but busy running auto defrag, dropping
deleted snapshots or doing other work, so by the time we return from
try_flush_qgroup() the space for deleted file isn't released.
Improve on this by running delayed iputs only after flushing delalloc
and waiting for ordered extent completion.
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Performing the initial extent sector read on a RAID stripe-tree backed
filesystem with pre-allocated extents will cause the RAID stripe-tree
lookup code to return ENODATA, as pre-allocated extents do not have any
on-disk bytes and thus no RAID stripe-tree entries.
But the current scrub read code marks these extents as errors, because
the lookup fails.
If btrfs_map_block() returns -ENODATA, it means that the call to
btrfs_get_raid_extent_offset() returned -ENODATA, because there is no
entry for the corresponding range in the RAID stripe-tree. But as this
range is in the extent tree it means we've hit a pre-allocated extent. In
this case, don't mark the sector in the stripe's error bitmaps as faulty
and carry on to the next.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
In case a lookup in the RAID stripe-tree fails, return ENODATA instead of
ENOENT to better distinguish stripe-tree lookups from other code paths
where we return ENOENT.
Suggested-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Currently we BUG_ON() in btrfs_finish_one_ordered() if we are finishing
an ordered extent that is flagged as NOCOW, but it's checksum list is
not empty.
This is clearly a logic error which we can recover from by aborting the
transaction.
For developer builds which enable CONFIG_BTRFS_ASSERT, also ASSERT()
that the list is empty.
Suggested-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Currently inside prepare_pages(), we handle the leading and tailing page
differently, and skip the middle pages (if any). This is to avoid
reading pages which are fully covered by the dirty range.
Refactor the code by moving all checks (alignment check, range check,
force read check) into prepare_uptodate_page().
So that prepare_pages() only needs to iterate all the pages
unconditionally.
And since we're here, also update prepare_uptodate_page() to use
folio API other than the old page API.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Inside btrfs_buffered_write(), we have a local variable @dirty_pages,
recording the number of pages we dirtied in the current iteration.
However we do not really need that variable, since it can be calculated
from @pos and @copied.
In fact there is already a problem inside the short copy path, where we
use @dirty_pages to calculate the range we need to release.
But that usage assumes sectorsize == PAGE_SIZE, which is no longer true.
Instead of keeping @dirty_pages and cause incorrect usage, just
calculate the number of dirtied pages inside btrfs_dirty_pages().
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
btrfs_try_tree_write_lock() has been unused since commit
50b21d7a06 ("btrfs: submit a writeback bio per extent_buffer").
Remove it as we don't need it anymore.
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Dr. David Alan Gilbert <linux@treblig.org>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
btrfs_is_parity_mirror() has been unused since commit 4886ff7b50
("btrfs: introduce a new helper to submit write bio for repair").
Remove it as the code was refactored and we don't need the helper
anymore.
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Dr. David Alan Gilbert <linux@treblig.org>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
btrfs_free_squota_rsv() was added in commit
e85a0adacf ("btrfs: ensure releasing squota reserve on head refs")
but has remained unused since then.
Remove it as we don't seem to need it and was probably a leftover.
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Dr. David Alan Gilbert <linux@treblig.org>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Add first stash of very basic self tests for the RAID stripe-tree.
More test cases will follow exercising the tree.
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This macro is no longer used after the "btrfs: Cleaned up folio->page
conversion" series patch [1] was applied, so remove it.
[1]: https://patchwork.kernel.org/project/linux-btrfs/cover/20240828182908.3735344-1-lizetao1@huawei.com/
Reviewed-by: Neal Gompa <neal@gompa.dev>
Signed-off-by: Youling Tang <tangyouling@kylinos.cn>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The variable stop_loop was originally introduced in commit 625f1c8dc6
("Btrfs: improve the loop of scrub_stripe"). It was initialized to 0 in
commit 3b080b2564 ("Btrfs: scrub raid56 stripes in the right way").
However, in a later commit 18d30ab961 ("btrfs: scrub: use
scrub_simple_mirror() to handle RAID56 data stripe scrub"), the code
that modified stop_loop was removed, making the variable redundant.
Currently, stop_loop is only initialized with 0 and is never used or
modified within the scrub_stripe() function. As a result, this patch
removes the stop_loop variable to clean up the code and eliminate
unnecessary redundancy.
This change has no impact on functionality, as stop_loop was never
utilized in any meaningful way in the final version of the code.
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Riyan Dhiman <riyandhiman14@gmail.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The qgroup record was allocated with kzalloc(), so it's pointless to set
its old_roots member to NULL. Remove the assignment.
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Instead of dereferencing the delayed refs from the transaction multiple
times, store it early in the local variable and then always use the
variable.
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
There's no need to hold the delayed refs spinlock when calling
btrfs_qgroup_trace_extent_nolock() from btrfs_qgroup_trace_extent(), since
it doesn't change anything in delayed refs and it only changes the xarray
used to track qgroup extent records, which is protected by the xarray's
lock.
Holding the lock is only adding unnecessary lock contention with other
tasks that actually need to take the lock to add/remove/change delayed
references. So remove the locking.
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Instead of extracting fs_info from the transaction multiples times, store
it in a local variable and use it.
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Now that we track qgroup extent records in a xarray we don't need to have
a "bytenr" field in struct btrfs_qgroup_extent_record, since we can get
it from the index of the record in the xarray.
So remove the field and grab the bytenr from either the index key or any
other place where it's available (delayed refs). This reduces the size of
struct btrfs_qgroup_extent_record from 40 bytes down to 32 bytes, meaning
that we now can store 128 instances of this structure instead of 102 per
4K page.
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Remove the duplicated transaction joining, block reserve setting and raid
extent inserting in btrfs_finish_ordered_extent().
While at it, also abort the transaction in case inserting a RAID
stripe-tree entry fails.
Suggested-by: Naohiro Aota <naohiro.aota@wdc.com>
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
[PROBLEM]
Currently btrfs accepts any file path for its device, resulting some
weird situation:
# ./mount_by_fd /dev/test/scratch1 /mnt/btrfs/
The program has the following source code:
#include <fcntl.h>
#include <stdio.h>
#include <sys/mount.h>
int main(int argc, char *argv[]) {
int fd = open(argv[1], O_RDWR);
char path[256];
snprintf(path, sizeof(path), "/proc/self/fd/%d", fd);
return mount(path, argv[2], "btrfs", 0, NULL);
}
Then we can have the following weird device path:
BTRFS: device fsid 2378be81-fe12-46d2-a9e8-68cf08dd98d5 devid 1 transid 7 /proc/self/fd/3 (253:2) scanned by mount_by_fd (18440)
Normally it's not a big deal, and later udev can trigger a device path
rename. But if udev didn't trigger, the device path "/proc/self/fd/3"
will show up in mtab.
[CAUSE]
For filename "/proc/self/fd/3", it means the opened file descriptor 3.
In above case, it's exactly the device we want to open, aka points to
"/dev/test/scratch1" which is another symlink pointing to "/dev/dm-2".
Inside kernel we solve the mount source using LOOKUP_FOLLOW, which
follows the symbolic link and grab the proper block device.
But inside btrfs we also save the filename into btrfs_device::name, and
utilize that member to report our mount source, which leads to the above
situation.
[FIX]
Instead of unconditionally trust the path, check if the original file
(not following the symbolic link) is inside "/dev/", if not, then
manually lookup the path to its final destination, and use that as our
device path.
This allows us to still use symbolic links, like
"/dev/mapper/test-scratch" from LVM2, which is required for fstests runs
with LVM2 setup.
And for really weird names, like the above case, we solve it to
"/dev/dm-2" instead.
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Link: https://bugzilla.suse.com/show_bug.cgi?id=1230641
Reported-by: Fabian Vogt <fvogt@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
[PROBLEM]
It is very common for udev to trigger device scan, and every time a
mounted btrfs device got re-scan from different soft links, we will get
some of unnecessary device path updates, this is especially common
for LVM based storage:
# lvs
scratch1 test -wi-ao---- 10.00g
scratch2 test -wi-a----- 10.00g
scratch3 test -wi-a----- 10.00g
scratch4 test -wi-a----- 10.00g
scratch5 test -wi-a----- 10.00g
test test -wi-a----- 10.00g
# mkfs.btrfs -f /dev/test/scratch1
# mount /dev/test/scratch1 /mnt/btrfs
# dmesg -c
[ 205.705234] BTRFS: device fsid 7be2602f-9e35-4ecf-a6ff-9e91d2c182c9 devid 1 transid 6 /dev/mapper/test-scratch1 (253:4) scanned by mount (1154)
[ 205.710864] BTRFS info (device dm-4): first mount of filesystem 7be2602f-9e35-4ecf-a6ff-9e91d2c182c9
[ 205.711923] BTRFS info (device dm-4): using crc32c (crc32c-intel) checksum algorithm
[ 205.713856] BTRFS info (device dm-4): using free-space-tree
[ 205.722324] BTRFS info (device dm-4): checking UUID tree
So far so good, but even if we just touched any soft link of
"dm-4", we will get quite some unnecessary device path updates.
# touch /dev/mapper/test-scratch1
# dmesg -c
[ 469.295796] BTRFS info: devid 1 device path /dev/mapper/test-scratch1 changed to /dev/dm-4 scanned by (udev-worker) (1221)
[ 469.300494] BTRFS info: devid 1 device path /dev/dm-4 changed to /dev/mapper/test-scratch1 scanned by (udev-worker) (1221)
Such device path rename is unnecessary and can lead to random path
change due to the udev race.
[CAUSE]
Inside device_list_add(), we are using a very primitive way checking if
the device has changed, strcmp().
Which can never handle links well, no matter if it's hard or soft links.
So every different link of the same device will be treated as a different
device, causing the unnecessary device path update.
[FIX]
Introduce a helper, is_same_device(), and use path_equal() to properly
detect the same block device.
So that the different soft links won't trigger the rename race.
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Link: https://bugzilla.suse.com/show_bug.cgi?id=1230641
Reported-by: Fabian Vogt <fvogt@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Previously for btrfs with sector size smaller than page size (subpage),
we only allow compression if the range is fully page aligned.
This is to work around the asynchronous submission of compressed range,
which delayed the page unlock and writeback into a workqueue,
furthermore asynchronous submission can lock multiple sector range
across page boundary.
Such asynchronous submission makes it very hard to co-operate with other
regular writes.
With the recent changes to the subpage folio unlock path, now
asynchronous submission of compressed pages can co-operate with regular
submission, so enable sector perfect compression if it's an experimental
build.
The ETA for moving this feature out of experimental is 6.15, and I hope
all remaining corner cases can be exposed before that.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Currently we only mark sectors as locked if there is a *NEW* delalloc
range for it.
But NEW delalloc range is not the same as dirty sectors we want to
submit, e.g:
0 32K 64K 96K 128K
| |////////||///////| |////|
120K
For above 64K page size case, writepage_delalloc() for page 0 will find
and lock the delalloc range [32K, 96K), which is beyond the page
boundary.
Then when writepage_delalloc() is called for the page 64K, since [64K,
96K) is already locked, only [120K, 128K) will be locked.
This means, although range [64K, 96K) is dirty and will be submitted
later by extent_writepage_io(), it will not be marked as locked.
This is fine for now, as we call btrfs_folio_end_writer_lock_bitmap() to
free every non-compressed sector, and compression is only allowed for
full page range.
But this is not safe for future sector perfect compression support, as
this can lead to double folio unlock:
Thread A | Thread B
---------------------------------------+--------------------------------
| submit_one_async_extent()
| |- extent_clear_unlock_delalloc()
extent_writepage() | |- btrfs_folio_end_writer_lock()
|- btrfs_folio_end_writer_lock_bitmap()| |- btrfs_subpage_end_and_test_writer()
| | | |- atomic_sub_and_test()
| | | /* Now the atomic value is 0 */
|- if (atomic_read() == 0) | |
|- folio_unlock() | |- folio_unlock()
The root cause is the above range [64K, 96K) is dirtied and should also
be locked but it isn't.
So to make everything more consistent and prepare for the incoming
sector perfect compression, mark all dirty sectors as locked.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Currently for subpage (sector size < page size) cases, we reuse subpage
locked bitmap to find out all delalloc ranges we have locked, and run
all those found ranges.
However such reuse is not perfect, e.g.:
0 32K 64K 96K 128K
| |////////||///////| |////|
120K
For above range, writepage_delalloc() for page 0 will handle the range
[32K, 96k), note delalloc range can be beyond the page boundary.
But writepage_delalloc() for page 64K will only handle range [120K,
128K), as the previous run on page 0 has already handled range [64K,
96K).
Meanwhile for the writeback we should expect range [64K, 96K) to also be
locked, this leads to the mismatch from locked bitmap and delalloc
range.
This is not causing problems yet, but it's still an inconsistent
behavior.
So instead of relying on the subpage locked bitmap, move the delalloc
range search using local @delalloc_bitmap, so that we can remove the
existing btrfs_folio_find_writer_locked().
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The function extent_writepage_io() will submit the dirty sectors inside
the page for the write.
But recently to co-operate with the incoming subpage compression
enhancement, a new bitmap is introduced to
btrfs_bio_ctrl::submit_bitmap, to only avoid a subset of the dirty
range.
This is because we can have the following cases with 64K page size:
0 16K 32K 48K 64K
| |/////////| |/|
52K
For range [16K, 32K), we queue the dirty range for compression, which is
ran in a delayed workqueue.
Then for range [48K, 52K), we go through the regular submission path.
In that case, our btrfs_bio_ctrl::submit_bitmap will exclude the range
[16K, 32K).
The dirty flags for the range [16K, 32K) is only cleared when the
compression is done, by the extent_clear_unlock_delalloc() call inside
submit_one_async_extent().
This patch fix the false alert by removing the
btrfs_folio_assert_not_dirty() check, since it's no longer correct for
subpage compression cases.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
For btrfs with sector size < page size (e.g. 4K sector size, 64K page
size), and enable the sector perfect compression support, then the
following dirty range can lead to problems:
0 32K 64K 96K 128K
| |///////||//////| |/|
124K
In above case, if we start writeback for that inode, the last dirty
range [124K, 128K) will not be submitted and cause reserved space
leakage:
- Start writeback for page 0
We find the range [32K, 96K) is suitable for compression, and queue it
into a workqueue to do the delayed compression and submission.
- Compression happens for range [32K, 96K)
Function extent_range_clear_dirty_for_io() is called, however it is
only doing full page handling, not considering any the extra bitmaps
for subpage cases.
That function will clear page dirty for both page 0 and page 64K.
- Writeback for the inode is done
Because page 64K has its dirty flag cleared, it will not be considered
as a writeback target.
This means the range [124K, 128K) will not be submitted, and reserved
space for it will be leaked.
Fix this problem by using the subpage helper to clear the dirty flag.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
[PROBLEM]
If sector perfect compression is enabled for sector size < page size
case, the following case can lead dirty ranges not being written back:
0 32K 64K 96K 128K
| |///////||//////| |/|
124K
In above example, the page size is 64K, and we need to write back above
two pages.
- Submit for page 0 (main thread)
We found delalloc range [32K, 96K), which can be compressed.
So we queue an async range for [32K, 96K).
This means, the page unlock/clearing dirty/setting writeback will
all happen in a workqueue context.
- The compression is done, and compressed range is submitted (workqueue)
Since the compression is done in asynchronously, the compression can
be done before the main thread to submit for page 64K.
Now the whole range [32K, 96K), involving two pages, will be marked
writeback.
- Submit for page 64K (main thread)
extent_write_cache_pages() got its wbc->sync_mode is WB_SYNC_NONE,
so it skips the writeback wait.
And unlock the page and exit. This means the dirty range [124K, 128K)
will never be submitted, until next writeback happens for page 64K.
This will never happen for previous kernels because:
- For sector size == page size case
Since one page is one sector, if a page is marked writeback it will
not have dirty flags.
So this corner case will never hit.
- For sector size < page size case
We never do subpage compression, a range can only be submitted for
compression if the range is fully page aligned.
This change makes the subpage behavior mostly the same as non-subpage
cases.
[ENHANCEMENT]
Instead of relying WB_SYNC_NONE check only, if it's a subpage case, then
always wait for writeback flags.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
There are already two bugs (one in zlib, one in zstd) that involved
compression path is not handling sector size < page size cases well.
So it makes more sense to make sure that btrfs_compress_folios() returns
Since we already have two bugs (one in zlib, one in zstd) in the
compression path resulting the @total_in be to larger than the
to-be-compressed range length, there is enough reason to add an ASSERT()
to make sure the total read-in length doesn't exceed the input length.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Inside zstd_compress_folios(), after exhausted one input page, we need
to switch to the next page as input.
However when counting the total input bytes (@tot_in), we always increase
it by PAGE_SIZE.
For the following case, it can cause incorrect value:
0 32K 64K 96K
| |///////////||///////////|
After compressing range [32K, 64K), we switch to the next page, and
increasing @tot_in by 64K, while we only read 32K.
This will cause the @total_in to return a value larger than the input
length.
Fix it by only increase @tot_in by the input size.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Inside zlib_compress_folios(), each time we switch the input page cache,
the @start is increased by PAGE_SIZE.
But for the incoming compression support for sector size < page size
(previously we support compression only when the range is fully page
aligned), this is not going to handle the following case:
0 32K 64K 96K
| |///////////||///////////|
@start has the initial value 32K, indicating the start filepos of the
to-be-compressed range.
And when grabbing the first page as input, we always call "start +=
PAGE_SIZE;".
But since @start is starting at 32K, it will be increased by 64K,
resulting it to be 96K for the next range, causing incorrect input range
and corruption for the future subpage compression.
Fix it by only increase @start by the input size.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Currently CONFIG_BTRFS_EXPERIMENTAL is not only for the extra debugging
output, but also for experimental features.
This is not ideal to distinguish planned but not yet stable features
from those purely designed for debugging.
This patch splits the following features into CONFIG_BTRFS_EXPERIMENTAL:
- Extent map shrinker
This seems to be the first one to exit experimental.
- Extent tree v2
This seems to be the last one to graduate from experimental.
- Raid stripe tree
- Csum offload mode
- Send protocol v3
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
According to the description, CONFIG_BTRFS_DEBUG is only for extra
debug info, meanwhile sanity checks should be managed by
CONFIG_BTRFS_ASSERT.
There is no need to check both to enable assert_rbio().
Just remove the check for CONFIG_BTRFS_DEBUG.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We can't assume that btrees only contain keys of a given type - even if
they only have a single key type listed in the allowed key types for
that btree; this is a forwards compatibility issue.
Reported-by: syzbot+a27c3aaa3640dd3e1dfb@syzkaller.appspotmail.com
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Three affect DAMON. Lorenzo's five-patch series to address the
mmap_region error handling is here also.
Apart from that, various singletons.
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCZzBVmAAKCRDdBJ7gKXxA
ju42AQD0EEnzW+zFyI+E7x5FwCmLL6ofmzM8Sw9YrKjaeShdZgEAhcyS2Rc/AaJq
Uty2ZvVMDF2a9p9gqHfKKARBXEbN2w0=
=n+lO
-----END PGP SIGNATURE-----
Merge tag 'mm-hotfixes-stable-2024-11-09-22-40' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Pull misc fixes from Andrew Morton:
"20 hotfixes, 14 of which are cc:stable.
Three affect DAMON. Lorenzo's five-patch series to address the
mmap_region error handling is here also.
Apart from that, various singletons"
* tag 'mm-hotfixes-stable-2024-11-09-22-40' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm:
mailmap: add entry for Thorsten Blum
ocfs2: remove entry once instead of null-ptr-dereference in ocfs2_xa_remove()
signal: restore the override_rlimit logic
fs/proc: fix compile warning about variable 'vmcore_mmap_ops'
ucounts: fix counter leak in inc_rlimit_get_ucounts()
selftests: hugetlb_dio: check for initial conditions to skip in the start
mm: fix docs for the kernel parameter ``thp_anon=``
mm/damon/core: avoid overflow in damon_feed_loop_next_input()
mm/damon/core: handle zero schemes apply interval
mm/damon/core: handle zero {aggregation,ops_update} intervals
mm/mlock: set the correct prev on failure
objpool: fix to make percpu slot allocation more robust
mm/page_alloc: keep track of free highatomic
mm: resolve faulty mmap_region() error path behaviour
mm: refactor arch_calc_vm_flag_bits() and arm64 MTE handling
mm: refactor map_deny_write_exec()
mm: unconditionally close VMAs on error
mm: avoid unsafe VMA hook invocation when error arises on mmap hook
mm/thp: fix deferred split unqueue naming and locking
mm/thp: fix deferred split queue not partially_mapped
-----BEGIN PGP SIGNATURE-----
iQGzBAABCgAdFiEE6fsu8pdIjtWE/DpLiiy9cAdyT1EFAmcuWXMACgkQiiy9cAdy
T1Eu8gv+LUAmrvvv8PDoLUT50QZb6aAY2SeulgTdeG8OzImXH5VUSjptRYwP46Dk
KNLh85A4C39w/guxm3FX2qjeesZZD5DDubSJNATLy75jorq7z+1uTNg8oUZGpvJS
airmcv/0mcDZqVayCmiT7wPyhUSYa+VTvHrkFpsI20BrlyDybe5HGps77iCOJ5K0
uTRgM6VNxkKx+Z5NietpDyaUl2A5b6Yx/9J8vMq4ytBfEcSGi+ndpZNvG7kKg8gQ
3i/ND4O2+eScwvYclVP5mJbF71LW0Z/ljS4mEVH5UuRgLH2Ji35B9xaDFDSixI3x
EHFwnAX0QeGHIlIuFhRDdtR2gFqREAJOYxkDxfo7PXO5gOXLWZXru9F7v6lWsydN
varqSseBBucHOLn8NylvgJWwqYs+sIKQycYKsX3ZUnQfejaUwfV2H/ADJzccjFF8
PUzVQFyOZtUK3fdkoqvULr/zvwninhtLJYLIsPcUgSPCcxGxMApvtkCaJVV3JGfB
2acZPdMu
=ZzcZ
-----END PGP SIGNATURE-----
Merge tag 'v6.12-rc6-smb3-client-fix' of git://git.samba.org/sfrench/cifs-2.6
Pull smb client fix from Steve French:
"Fix net namespace refcount use after free issue"
* tag 'v6.12-rc6-smb3-client-fix' of git://git.samba.org/sfrench/cifs-2.6:
smb: client: Fix use-after-free of network namespace.
-----BEGIN PGP SIGNATURE-----
iQGzBAABCgAdFiEE6fsu8pdIjtWE/DpLiiy9cAdyT1EFAmcsXSUACgkQiiy9cAdy
T1FyOgv+Ks1lfl+6D/G89zFl5XOtCm8njsedJu9y3jR7hzophX2osfmodACMVX6B
0VLu0jzquvUo18VNlL+wF7YFH+Mc6zrevEnjBay9Xa05YyRqK5c7qjpiWEgXPN7/
ROQfC2slCAFjymhw+9qY+PGZYg3x0fyGdJC/gBNSFnu2ufag367Li+0fTKQTXFwz
F24S5eI+M9OWNgMnMYoNt+77f0n0JkKbQznq9nTEvUsbTWZFSEfmVczfSY0ltdOH
RER9zoyTU3zbPuMZqK+Jb7c2247ahsLzDEBAUG0Wn77wSaiWXU5dmVD5bWsDTp25
5p9uLpkr3irDWwJGkCrkpm2Tva/50IHPEFQ4kllVlm6ffoao/dxBCwFf/MEvJXzI
OgU+HpXyZdq6NF1hcB4xUlcbHvGCa6pEcYkcM7PwLml+6SKIwEsEGpnJ23kxGR3+
MGYMCITatRuvZstfEDolNyrO2+gPMd3ODnLhfjfjT47Kh38e7yxrLr4cmxbPAA+s
EVdm2N08
=zTn6
-----END PGP SIGNATURE-----
Merge tag 'v6.12-rc6-ksmbd-fixes' of git://git.samba.org/ksmbd
Pull smb server fixes from Steve French:
"Four fixes, all also marked for stable:
- fix two potential use after free issues
- fix OOM issue with many simultaneous requests
- fix missing error check in RPC pipe handling"
* tag 'v6.12-rc6-ksmbd-fixes' of git://git.samba.org/ksmbd:
ksmbd: check outstanding simultaneous SMB operations
ksmbd: fix slab-use-after-free in smb3_preauth_hash_rsp
ksmbd: fix slab-use-after-free in ksmbd_smb2_session_create
ksmbd: Fix the missing xa_store error check
We silence btree errors in btree_node_scan, since it's probing and
errors are expected: add a fake pass so that btree_node_scan is no
longer recovery pass 0, and we don't think we're in btree node scan when
reading btree roots.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
When we truncate a bset (due to it extending past the end of the btree
node), we can't skip the rest of the validation for e.g. the packed
format (if it's the first bset in the node).
Reported-by: syzbot+4d722d3c539d77c7bc82@syzkaller.appspotmail.com
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAmct+40ACgkQxWXV+ddt
WDvCtRAAp0rheEu14hpVvWE2//+6u9Gx7Wfjzbj0+o4zBRWdg7BigFxfeb6JsH/E
2TjuWdcoP/OMV9ghCBQAQxySAPtsxH7skkyNy2UcMk5byBIrNvhw9auP5GXXlrhK
jSKDD4yfOMb++8LhrLevgTrijNyjLqaKXruw9a1Pmc3gxpdNmnMEySsQaF62o2Sm
YC3jwi0KpNAhu2qyJ6TnPgd5zf3BTM0JAeuB019IZW4WoeRTOdcPe7S7gqqJwZ+e
lL0D2/lfIE1lKvLE266Fab4FAQiJV07rozYj25XHiDpqThCxnJVOZCEHasOQ1PRy
d6j3RmGPqJYAYfQL1L+FH2hsS1BVZfVyCV1V7A/cN+lAffBfnROnf13C3gJ15Nbx
3lTyjBPQQw2WpfdmeyF3ikbrjZ8AfahChQO+mMnLN7oAWdIwWX5MRB+cwfWTxzA/
P8upz6HSTpSwy8nXdq264q1KkyCjx0Wv+8iyU7LirN2fCcEchA12HAIaOBeHedgh
PrGZDqrkZccQQxAvU5H7hQv0hZkGK8qba381oYHO09g72VM6ysuBU7tGrPZrlZYB
CvYTCwNZ/lqI8ikrcHOyUO1SPR9SaaWej1mWgBJ69ZIfg+ZuMtOMl171DU4S/i2V
iYgYoN8eCqTQWdaX5kk+3LWmK8fSU7F/KSDtJtT1KxkaSwCacfY=
=TQzP
-----END PGP SIGNATURE-----
Merge tag 'for-6.12-rc6-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs fixes from David Sterba:
"A few more one-liners that fix some user visible problems:
- use correct range when clearing qgroup reservations after COW
- properly reset freed delayed ref list head
- fix ro/rw subvolume mounts to be backward compatible with old and
new mount API"
* tag 'for-6.12-rc6-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
btrfs: fix the length of reserved qgroup to free
btrfs: reinitialize delayed ref list after deleting it from the list
btrfs: fix per-subvolume RO/RW flags with new mount API
Some trivial syzbot fixes, two more serious btree fixes found by looping
single_devices.ktest small_nodes:
- Topology error on split after merge, where we accidentaly picked the
node being deleted for the pivot, resulting in an assertion pop
- New nodes being preallocated were left on the freedlist, unlocked,
resulting in them sometimes being accidentally freed: this dated from
pre-cycle detector, when we could leave them locked. This should have
resulted in more explosions and fireworks, but turned out to be
surprisingly hard to hit because the preallocated nodes were being
used right away.
the fix for this is bigger than we'd like - reworking btree list
handling was a bit invasive - but we've now got more assertions and
it's well tested.
- Also another mishandled transaction restart fix (in
btree_node_prefetch) - we're almost done with those.
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEKnAFLkS8Qha+jvQrE6szbY3KbnYFAmctRYoACgkQE6szbY3K
bnbkxRAArtqV9/qsKbSYAaa/+GaL7YdapuYbi/pmC9X96F9qbTdEJzW5rs66iiGE
zbkfFqo2I85nacSTk3b12E3QUXj+CEmSIWOPQtamYw/0AkmVsKepgGsXLazZ0rYi
X8UDVc6fuFkoO1aC/9V2NJEFG9QXIj8ru0m2kyUE9ZM6rgskugVN/ec9ipNQNZhY
4L8U7Z6Y9AX4vs/BeV3i6cLrTaMroUFYSM0hJalBJ24KZsZ1bWflC39C0dXSvy/O
gCmBCobZTT5aDEQai1kdyFr4GZZUgCJg4YEUDfyOdpPmhbcP4iwX/cJqJHXqxXVt
nMyLz5nLs0nYO791UlLHZuUUUe99nl+tC09b034n20peLnQwWW/obTrhn86SDDka
2eQv1Rk5C5i8r5b0k8UYjy5ogfiVlC/X1OwmLKkarKnC/wd0eFQI71Qq9s8KpXbo
VVASENYFV3hrIV8ZcxiqiJ18g6o7++jtTAmIfRljQrO6B8tU5g5uWCTZli+wciii
qWnt1k7P92er8lBzUnQGh9CEwLVbe9ZyBJv+fYVwTOxPES/TbJS7n5fb+1f1rF9w
j5llXVUiaLucXoCpBjEDflvhBTRQHEkKk3gJgy86NKgRjEjPhQT8D2dksT4kgyHb
RqgOSUN+oVqi/i+7RKf9x/jG4id0uvMH5xT7qiXTUiQXtUD+J9g=
=cn3u
-----END PGP SIGNATURE-----
Merge tag 'bcachefs-2024-11-07' of git://evilpiepirate.org/bcachefs
Pull bcachefs fixes from Kent Overstreet:
"Some trivial syzbot fixes, two more serious btree fixes found by
looping single_devices.ktest small_nodes:
- Topology error on split after merge, where we accidentaly picked
the node being deleted for the pivot, resulting in an assertion pop
- New nodes being preallocated were left on the freedlist, unlocked,
resulting in them sometimes being accidentally freed: this dated
from pre-cycle detector, when we could leave them locked. This
should have resulted in more explosions and fireworks, but turned
out to be surprisingly hard to hit because the preallocated nodes
were being used right away.
The fix for this is bigger than we'd like - reworking btree list
handling was a bit invasive - but we've now got more assertions and
it's well tested.
- Also another mishandled transaction restart fix (in
btree_node_prefetch) - we're almost done with those"
* tag 'bcachefs-2024-11-07' of git://evilpiepirate.org/bcachefs:
bcachefs: Fix UAF in __promote_alloc() error path
bcachefs: Change OPT_STR max to be 1 less than the size of choices array
bcachefs: btree_cache.freeable list fixes
bcachefs: check the invalid parameter for perf test
bcachefs: add check NULL return of bio_kmalloc in journal_read_bucket
bcachefs: Ensure BCH_FS_may_go_rw is set before exiting recovery
bcachefs: Fix topology errors on split after merge
bcachefs: Ancient versions with bad bkey_formats are no longer supported
bcachefs: Fix error handling in bch2_btree_node_prefetch()
bcachefs: Fix null ptr deref in bucket_gen_get()
This fixes an assertion pop where we try to navigate to the target of
the backpointer, and the path level isn't what we expect.
Reported-by: syzbot+b17df21b4d370f2dc330@syzkaller.appspotmail.com
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
The write buffer needs to be specifically flushed when going RO: keys in
the journal that haven't yet been moved to the write buffer don't have a
journal pin yet.
This fixes numerous syzbot bugs, all with symptoms of still doing writes
after we've got RO.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
When build with !CONFIG_MMU, the variable 'vmcore_mmap_ops'
is defined but not used:
>> fs/proc/vmcore.c:458:42: warning: unused variable 'vmcore_mmap_ops'
458 | static const struct vm_operations_struct vmcore_mmap_ops = {
Fix this by only defining it when CONFIG_MMU is enabled.
Link: https://lkml.kernel.org/r/20241101034803.9298-1-xiqi2@huawei.com
Fixes: 9cb218131d ("vmcore: introduce remap_oldmem_pfn_range()")
Signed-off-by: Qi Xi <xiqi2@huawei.com>
Reported-by: kernel test robot <lkp@intel.com>
Closes: https://lore.kernel.org/lkml/202410301936.GcE8yUos-lkp@intel.com/
Cc: Baoquan He <bhe@redhat.com>
Cc: Dave Young <dyoung@redhat.com>
Cc: Michael Holzheu <holzheu@linux.vnet.ibm.com>
Cc: Vivek Goyal <vgoyal@redhat.com>
Cc: Wang ShaoBo <bobo.shaobowang@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
If we error in data_update_init() after adding to the rhashtable of
outstanding promotes, kfree_rcu() is required.
Reported-by: Reed Riley <reed@riley.engineer>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Change OPT_STR max value to be 1 less than the "ARRAY_SIZE" of "_choices"
array. As a result, remove -1 from (opt->max-1) in bch2_opt_to_text.
The "_choices" array is a null-terminated array, so computing the maximum
using "ARRAY_SIZE" without subtracting 1 yields an incorrect result. Since
bch2_opt_validate don't subtract 1, as bch2_opt_to_text does, values
bigger than the actual maximum would pass through option validation.
Reported-by: syzbot+bee87a0c3291c06aa8c6@syzkaller.appspotmail.com
Closes: https://syzkaller.appspot.com/bug?extid=bee87a0c3291c06aa8c6
Fixes: 63c4b25453 ("bcachefs: Better superblock opt validation")
Suggested-by: Kent Overstreet <kent.overstreet@linux.dev>
Signed-off-by: Piotr Zalewski <pZ010001011111@proton.me>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
When allocating new btree nodes, we were leaving them on the freeable
list - unlocked - allowing them to be reclaimed: ouch.
Additionally, bch2_btree_node_free_never_used() ->
bch2_btree_node_hash_remove was putting it on the freelist, while
bch2_btree_node_free_never_used() was putting it back on the btree
update reserve list - ouch.
Originally, the code was written to always keep btree nodes on a list -
live or freeable - and this worked when new nodes were kept locked.
But now with the cycle detector, we can't keep nodes locked that aren't
tracked by the cycle detector; and this is fine as long as they're not
reachable.
We also have better and more robust leak detection now, with memory
allocation profiling, so the original justification no longer applies.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
The perf_test does not check the number of iterations and threads
when it is zero. If nr_thread is 0, the perf test will keep
waiting for wakekup. If iteration is 0, it will cause exception
of division by zero. This can be reproduced by:
echo "rand_insert 0 1" > /sys/fs/bcachefs/${uuid}/perf_test
or
echo "rand_insert 1 0" > /sys/fs/bcachefs/${uuid}/perf_test
Fixes: 1c6fdbd8f2 ("bcachefs: Initial commit")
Signed-off-by: Hongbo Li <lihongbo22@huawei.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
bio_kmalloc may return NULL, will cause NULL pointer dereference.
Add check NULL return for bio_kmalloc in journal_read_bucket.
Signed-off-by: Pei Xiao <xiaopei01@kylinos.cn>
Fixes: ac10a9611d ("bcachefs: Some fixes for building in userspace")
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
If BCH_FS_may_go_rw is not yet set, it indicates to the transaction
commit path that updates should be done via the list of journal replay
keys.
This must be set before multithreaded use commences.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
If a btree split picks a pivot that's being deleted by a btree node
merge, we're going to have problems.
Fix this by checking if the pivot is being deleted, the same as we check
for deletions in journal replay keys.
Found by single_devic.ktest small_nodes.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Syzbot found an assertion pop, by generating an ancient filesystem
version with an invalid bkey_format (with fields that can overflow) as
well as packed keys that aren't representable unpacked.
This breaks key comparisons in all sorts of painful ways.
Filesystems have been automatically rewriting nodes with such invalid
formats for years; we can safely drop support for them.
Reported-by: syzbot+8a0109511de9d4b61217@syzkaller.appspotmail.com
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
bucket_gen() checks if we're lookup up a valid bucket and returns NULL
otherwise, but bucket_gen_get() was failing to check; other callers were
correct.
Also do a bit of cleanup on callers.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
seq_printf is costy, on a system with n CPUs, reading /proc/softirqs
would yield 10*n decimal values, and the extra cost parsing format string
grows linearly with number of cpus. Replace seq_printf with
seq_put_decimal_ull_width have significant performance improvement.
On an 8CPUs system, reading /proc/softirqs show ~40% performance
gain with this patch.
Signed-off-by: David Wang <00107082@163.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
I noticed that recently, simple operations like "make" started
failing on NFSv3 mounts of ext4 exports. Network capture shows that
READDIRPLUS operated correctly but READDIR failed with
NFS3ERR_INVAL. The vfs_llseek() call returned EINVAL when it is
passed a non-zero starting directory cookie.
I bisected to commit c689bdd3bf ("nfsd: further centralize
protocol version checks.").
Turns out that nfsd3_proc_readdir() does not call fh_verify() before
it calls nfsd_readdir(), so the new fhp->fh_64bit_cookies boolean is
not set properly. This leaves the NFSD_MAY_64BIT_COOKIE unset when
the directory is opened.
For ext4, this causes the wrong "max file size" value to be used
when sanity checking the incoming directory cookie (which is a seek
offset value).
The fhp->fh_64bit_cookies boolean is /always/ properly initialized
after nfsd_open() returns. There doesn't seem to be a reason for the
generic NFSD open helper to handle the f_mode fix-up for
directories, so just move that to the one caller that tries to open
an S_IFDIR with NFSD_MAY_64BIT_COOKIE.
Suggested-by: NeilBrown <neilb@suse.de>
Fixes: c689bdd3bf ("nfsd: further centralize protocol version checks.")
Reviewed-by: NeilBrown <neilb@suse.de>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Now that the SIG_IGN problem is solved in the core code, the alarmtimer
callbacks do not require a return value anymore.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Frederic Weisbecker <frederic@kernel.org>
Link: https://lore.kernel.org/all/20241105064214.318837272@linutronix.de
To cure the SIG_IGN handling for posix interval timers, the preallocated
sigqueue needs to be embedded into struct k_itimer to prevent life time
races of all sorts.
Now that the prerequisites are in place, embed the sigqueue into struct
k_itimer and fixup the relevant usage sites.
Aside of preparing for proper SIG_IGN handling, this spares an extra
allocation.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/all/20241105064213.719695194@linutronix.de
The dealloc flag may be cleared and the extent won't reach the disk in
cow_file_range when errors path. The reserved qgroup space is freed in
commit 30479f31d4 ("btrfs: fix qgroup reserve leaks in
cow_file_range"). However, the length of untouched region to free needs
to be adjusted with the correct remaining region size.
Fixes: 30479f31d4 ("btrfs: fix qgroup reserve leaks in cow_file_range")
CC: stable@vger.kernel.org # 6.11+
Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: Haisu Wang <haisuwang@tencent.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
At insert_delayed_ref() if we need to update the action of an existing
ref to BTRFS_DROP_DELAYED_REF, we delete the ref from its ref head's
ref_add_list using list_del(), which leaves the ref's add_list member
not reinitialized, as list_del() sets the next and prev members of the
list to LIST_POISON1 and LIST_POISON2, respectively.
If later we end up calling drop_delayed_ref() against the ref, which can
happen during merging or when destroying delayed refs due to a transaction
abort, we can trigger a crash since at drop_delayed_ref() we call
list_empty() against the ref's add_list, which returns false since
the list was not reinitialized after the list_del() and as a consequence
we call list_del() again at drop_delayed_ref(). This results in an
invalid list access since the next and prev members are set to poison
pointers, resulting in a splat if CONFIG_LIST_HARDENED and
CONFIG_DEBUG_LIST are set or invalid poison pointer dereferences
otherwise.
So fix this by deleting from the list with list_del_init() instead.
Fixes: 1d57ee9416 ("btrfs: improve delayed refs iterations")
CC: stable@vger.kernel.org # 4.19+
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
[BUG]
With util-linux 2.40.2, the 'mount' utility is already utilizing the new
mount API. e.g:
# strace mount -o subvol=subv1,ro /dev/test/scratch1 /mnt/test/
...
fsconfig(3, FSCONFIG_SET_STRING, "source", "/dev/mapper/test-scratch1", 0) = 0
fsconfig(3, FSCONFIG_SET_STRING, "subvol", "subv1", 0) = 0
fsconfig(3, FSCONFIG_SET_FLAG, "ro", NULL, 0) = 0
fsconfig(3, FSCONFIG_CMD_CREATE, NULL, NULL, 0) = 0
fsmount(3, FSMOUNT_CLOEXEC, 0) = 4
mount_setattr(4, "", AT_EMPTY_PATH, {attr_set=MOUNT_ATTR_RDONLY, attr_clr=0, propagation=0 /* MS_??? */, userns_fd=0}, 32) = 0
move_mount(4, "", AT_FDCWD, "/mnt/test", MOVE_MOUNT_F_EMPTY_PATH) = 0
But this leads to a new problem, that per-subvolume RO/RW mount no
longer works, if the initial mount is RO:
# mount -o subvol=subv1,ro /dev/test/scratch1 /mnt/test
# mount -o rw,subvol=subv2 /dev/test/scratch1 /mnt/scratch
# mount | grep mnt
/dev/mapper/test-scratch1 on /mnt/test type btrfs (ro,relatime,discard=async,space_cache=v2,subvolid=256,subvol=/subv1)
/dev/mapper/test-scratch1 on /mnt/scratch type btrfs (ro,relatime,discard=async,space_cache=v2,subvolid=257,subvol=/subv2)
# touch /mnt/scratch/foobar
touch: cannot touch '/mnt/scratch/foobar': Read-only file system
This is a common use cases on distros.
[CAUSE]
We have a workaround for remount to handle the RO->RW change, but if the
mount is using the new mount API, we do not do that, and rely on the
mount tool NOT to set the ro flag.
But that's not how the mount tool is doing for the new API:
fsconfig(3, FSCONFIG_SET_STRING, "source", "/dev/mapper/test-scratch1", 0) = 0
fsconfig(3, FSCONFIG_SET_STRING, "subvol", "subv1", 0) = 0
fsconfig(3, FSCONFIG_SET_FLAG, "ro", NULL, 0) = 0 <<<< Setting RO flag for super block
fsconfig(3, FSCONFIG_CMD_CREATE, NULL, NULL, 0) = 0
fsmount(3, FSMOUNT_CLOEXEC, 0) = 4
mount_setattr(4, "", AT_EMPTY_PATH, {attr_set=MOUNT_ATTR_RDONLY, attr_clr=0, propagation=0 /* MS_??? */, userns_fd=0}, 32) = 0
move_mount(4, "", AT_FDCWD, "/mnt/test", MOVE_MOUNT_F_EMPTY_PATH) = 0
This means we will set the super block RO at the first mount.
Later RW mount will not try to reconfigure the fs to RW because the
mount tool is already using the new API.
This totally breaks the per-subvolume RO/RW mount behavior.
[FIX]
Do not skip the reconfiguration even if using the new API. The old
comments are just expecting any mount tool to properly skip the RO flag
set even if we specify "ro", which is not the reality.
Update the comments regarding the backward compatibility on the kernel
level so it works with old and new mount utilities.
CC: stable@vger.kernel.org # 6.8+
Fixes: f044b31867 ("btrfs: handle the ro->rw transition for mounting different subvolumes")
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Stable Fixes:
* Fix KMSAN warning in decode_getfattr_attrs()
Other Bugfixes:
* Handle -ENOTCONN in xs_tcp_setup_socked()
* NFSv3: only use NFS timeout for MOUNT when protocols are compatible
* Fix attribute delegation behavior on exclusive create and a/mtime changes
* Fix localio to cope with racing nfs_local_probe()
* Avoid i_lock contention in fs_clear_invalid_mapping()
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEEnZ5MQTpR7cLU7KEp18tUv7ClQOsFAmcr1HUACgkQ18tUv7Cl
QOtdYBAA0YohWDHflcHPbltJu0UyCyDDtowvpVacSDJwZwVEXnLQRTTqrdUnWVxx
Bc2Ae8tGsfcwo10yZ6LUIPjcyEqLQeYvKoKv2Awf0j7eubjRYZrQVypIKtmy8aC2
H5ETCyrbIubE06jX8EPO8LFxQ+T6nGD7kC8qJZL8z/aNVXGA2nRRCi7AzdE4o6Ht
0t6fC+W5vxJ4hQHYKb59nGvREMwpKSLg2U4wo1lyFvkDxEJ06DobGOKEtD333cI8
Mou/1UlSZ6RzgfwJNIPMMpCepIp2spaDeet0XVN+zqzxg55Jmk7LqpxP5pswTjLb
WsxErV9ZRXtwutCCf+IDoMCv/YS4g4ZG7CLKXQ4felKJVYIuiS4z0n659xRqLyyi
nW71vrRUdOBE3rCXUW6crZYwX/fHDvl6bsq9/h7cy2ZPnbGkVvXx+LIm0dJRenfb
MaxVM3CyrMnzL3UUk/caK/rVCOHrDD5q/dAtSNfizMWnqoX+gXby3ho6Zwn0Wj89
NiUZJIRI/s4V1WzMw4g+Daz7LUUwGblODTtphH2nnKRDfTiYXeT/r/waU6zUOVcS
7Jd285DF/tkQp2SJ3nvsM/ni7TD2UuG2BsKA3Urlht9i32lwyENeS3nNcx6aHo3i
blNpD+9mp3vZfWWZNVvLM/JldcIqEvd30+P6GWwS/Td8Zz4PYIM=
=9mwu
-----END PGP SIGNATURE-----
Merge tag 'nfs-for-6.12-3' of git://git.linux-nfs.org/projects/anna/linux-nfs
Pull NFS client fixes from Anna Schumaker:
"These are mostly fixes that came up during the nfs bakeathon the other
week.
Stable Fixes:
- Fix KMSAN warning in decode_getfattr_attrs()
Other Bugfixes:
- Handle -ENOTCONN in xs_tcp_setup_socked()
- NFSv3: only use NFS timeout for MOUNT when protocols are compatible
- Fix attribute delegation behavior on exclusive create and a/mtime
changes
- Fix localio to cope with racing nfs_local_probe()
- Avoid i_lock contention in fs_clear_invalid_mapping()"
* tag 'nfs-for-6.12-3' of git://git.linux-nfs.org/projects/anna/linux-nfs:
nfs: avoid i_lock contention in nfs_clear_invalid_mapping
nfs_common: fix localio to cope with racing nfs_local_probe()
NFS: Further fixes to attribute delegation a/mtime changes
NFS: Fix attribute delegation behaviour on exclusive create
nfs: Fix KMSAN warning in decode_getfattr_attrs()
NFSv3: only use NFS timeout for MOUNT when protocols are compatible
sunrpc: handle -ENOTCONN in xs_tcp_setup_socket()
The commit 78ff640819 ("vfs: Convert tracefs to use the new mount API")
broke the gid setting when set by fstab or other mount utility.
It is ignored when it is set. Fix the code so that it recognises the
option again and will honor the settings on mount at boot up.
Update the internal documentation and create a selftest to make sure
it doesn't break again in the future.
-----BEGIN PGP SIGNATURE-----
iIoEABYIADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCZyuidRQccm9zdGVkdEBn
b29kbWlzLm9yZwAKCRAp5XQQmuv6qsgQAQDuV0x4RLpCrrowDS/ITQw/eb/WjhR7
lhkXVROLN6RK6wD+JWmbaCP82q2S4A2Vx0Rjc72gUMmTzDb1HQflhQiLhwU=
=0dZF
-----END PGP SIGNATURE-----
Merge tag 'tracefs-v6.12-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace
Pull tracefs fixes from Steven Rostedt:
"Fix tracefs mount options.
Commit 78ff640819 ("vfs: Convert tracefs to use the new mount API")
broke the gid setting when set by fstab or other mount utility. It is
ignored when it is set. Fix the code so that it recognises the option
again and will honor the settings on mount at boot up.
Update the internal documentation and create a selftest to make sure
it doesn't break again in the future"
* tag 'tracefs-v6.12-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace:
tracing/selftests: Add tracefs mount options test
tracing: Document tracefs gid mount option
tracing: Fix tracefs mount options
Curretly in function generic_listxattr the for_each_xattr_handler loop
checks err and will return out of the function if err is non-zero.
It's impossible for err to be non-zero at the end of the function where
err is checked again for a non-zero value. The final non-zero check is
therefore redundant and can be removed. Also move the declaration of
err into the loop.
Signed-off-by: Colin Ian King <colin.i.king@gmail.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
switch path_removexattrat() and fremovexattr(2) to those
Reviewed-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
switch path_listxattr() and flistxattr(2) to those
Reviewed-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
similar to do_setxattr() in the previous commit...
Reviewed-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
io_uring setxattr logics duplicates stuff from fs/xattr.c; provide
saner helpers (filename_setxattr() and file_setxattr() resp.) and
use them.
NB: putname(ERR_PTR()) is a no-op
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Rename the struct xattr_ctx to increase distinction with the about to be
added user API struct xattr_args.
No functional change.
Suggested-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Christian Göttsche <cgzones@googlemail.com>
Link: https://lore.kernel.org/r/20240426162042.191916-2-cgoettsche@seltendoof.de
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>