The path cache used during fiemap used to determine the sharedness of
extent buffers in a path from a leaf containing a file extent item
pointing to our data extent up to the root node of the tree, is meant to
be used for a single path. Having a single path is by far the most common
case, and therefore worth to optimize for, but it's possible to actually
have multiple paths because we have 2 or more leaves.
If we have multiple leaves, the 'level' variable keeps getting incremented
in each iteration of the while loop at btrfs_is_data_extent_shared(),
which means we will treat the second leaf in the 'tmp' ulist as a level 1
node, and so forth. In the worst case this can lead to getting a level
greater than or equals to BTRFS_MAX_LEVEL (8), which will trigger a
WARN_ON_ONCE() in the functions to lookup from or store in the path cache
(lookup_backref_shared_cache() and store_backref_shared_cache()). If the
current level never goes beyond 8, due to shared nodes in the paths and
a fs tree height smaller than 8, it can still result in incorrectly
marking one leaf as shared because some other leaf is shared and is stored
one level below that other leaf, as when storing a true sharedness value
in the cache results in updating the sharedness to true of all entries in
the cache below the current level.
Having multiple leaves happens in a case like the following:
- We have a file extent item point to data extent at bytenr X, for
a file range [0, 1M[ for example;
- At this moment we have an extent data ref for the extent, with
an offset of 0 and a count of 1;
- A write into the middle of the extent happens, file range [64K, 128K)
so the file extent item is split into two (at btrfs_drop_extents()):
1) One for file range [0, 64K), with a length (num_bytes field) of
64K and an extent offset of 0;
2) Another one for file range [128K, 1M), with a length of 896K
(1M - 128K) and an extent offset of 128K.
- At this moment the two file extent items are located in the same
leaf;
- A new file extent item for the range [64K, 128K), pointing to a new
data extent, is inserted in the leaf. This results in a leaf split
and now those two file extent items pointing to data extent X end
up located in different leaves;
- Once delayed refs are run, we still have a single extent data ref
item for our data extent at bytenr X, for offset 0, but now with a
count of 2 instead of 1;
- So during fiemap, at btrfs_is_data_extent_shared(), after we call
find_parent_nodes() for the data extent, we get two leaves, since
we have two file extent items point to data extent at bytenr X that
are located in two different leaves.
So skip the use of the path cache when we get more than one leaf.
Fixes: 12a824dc67 ("btrfs: speedup checking for extent sharedness during fiemap")
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
During backref walking, when processing a delayed reference with a type of
BTRFS_TREE_BLOCK_REF_KEY, we have two bugs there:
1) We are accessing the delayed references extent_op, and its key, without
the protection of the delayed ref head's lock;
2) If there's no extent op for the delayed ref head, we end up with an
uninitialized key in the stack, variable 'tmp_op_key', and then pass
it to add_indirect_ref(), which adds the reference to the indirect
refs rb tree.
This is wrong, because indirect references should have a NULL key
when we don't have access to the key, and in that case they should be
added to the indirect_missing_keys rb tree and not to the indirect rb
tree.
This means that if have BTRFS_TREE_BLOCK_REF_KEY delayed ref resulting
from freeing an extent buffer, therefore with a count of -1, it will
not cancel out the corresponding reference we have in the extent tree
(with a count of 1), since both references end up in different rb
trees.
When using fiemap, where we often need to check if extents are shared
through shared subtrees resulting from snapshots, it means we can
incorrectly report an extent as shared when it's no longer shared.
However this is temporary because after the transaction is committed
the extent is no longer reported as shared, as running the delayed
reference results in deleting the tree block reference from the extent
tree.
Outside the fiemap context, the result is unpredictable, as the key was
not initialized but it's used when navigating the rb trees to insert
and search for references (prelim_ref_compare()), and we expect all
references in the indirect rb tree to have valid keys.
The following reproducer triggers the second bug:
$ cat test.sh
#!/bin/bash
DEV=/dev/sdj
MNT=/mnt/sdj
mkfs.btrfs -f $DEV
mount -o compress $DEV $MNT
# With a compressed 128M file we get a tree height of 2 (level 1 root).
xfs_io -f -c "pwrite -b 1M 0 128M" $MNT/foo
btrfs subvolume snapshot $MNT $MNT/snap
# Fiemap should output 0x2008 in the flags column.
# 0x2000 means shared extent
# 0x8 means encoded extent (because it's compressed)
echo
echo "fiemap after snapshot, range [120M, 120M + 128K):"
xfs_io -c "fiemap -v 120M 128K" $MNT/foo
echo
# Overwrite one extent and fsync to flush delalloc and COW a new path
# in the snapshot's tree.
#
# After this we have a BTRFS_DROP_DELAYED_REF delayed ref of type
# BTRFS_TREE_BLOCK_REF_KEY with a count of -1 for every COWed extent
# buffer in the path.
#
# In the extent tree we have inline references of type
# BTRFS_TREE_BLOCK_REF_KEY, with a count of 1, for the same extent
# buffers, so they should cancel each other, and the extent buffers in
# the fs tree should no longer be considered as shared.
#
echo "Overwriting file range [120M, 120M + 128K)..."
xfs_io -c "pwrite -b 128K 120M 128K" $MNT/snap/foo
xfs_io -c "fsync" $MNT/snap/foo
# Fiemap should output 0x8 in the flags column. The extent in the range
# [120M, 120M + 128K) is no longer shared, it's now exclusive to the fs
# tree.
echo
echo "fiemap after overwrite range [120M, 120M + 128K):"
xfs_io -c "fiemap -v 120M 128K" $MNT/foo
echo
umount $MNT
Running it before this patch:
$ ./test.sh
(...)
wrote 134217728/134217728 bytes at offset 0
128 MiB, 128 ops; 0.1152 sec (1.085 GiB/sec and 1110.5809 ops/sec)
Create a snapshot of '/mnt/sdj' in '/mnt/sdj/snap'
fiemap after snapshot, range [120M, 120M + 128K):
/mnt/sdj/foo:
EXT: FILE-OFFSET BLOCK-RANGE TOTAL FLAGS
0: [245760..246015]: 34304..34559 256 0x2008
Overwriting file range [120M, 120M + 128K)...
wrote 131072/131072 bytes at offset 125829120
128 KiB, 1 ops; 0.0001 sec (683.060 MiB/sec and 5464.4809 ops/sec)
fiemap after overwrite range [120M, 120M + 128K):
/mnt/sdj/foo:
EXT: FILE-OFFSET BLOCK-RANGE TOTAL FLAGS
0: [245760..246015]: 34304..34559 256 0x2008
The extent in the range [120M, 120M + 128K) is still reported as shared
(0x2000 bit set) after overwriting that range and flushing delalloc, which
is not correct - an entire path was COWed in the snapshot's tree and the
extent is now only referenced by the original fs tree.
Running it after this patch:
$ ./test.sh
(...)
wrote 134217728/134217728 bytes at offset 0
128 MiB, 128 ops; 0.1198 sec (1.043 GiB/sec and 1068.2067 ops/sec)
Create a snapshot of '/mnt/sdj' in '/mnt/sdj/snap'
fiemap after snapshot, range [120M, 120M + 128K):
/mnt/sdj/foo:
EXT: FILE-OFFSET BLOCK-RANGE TOTAL FLAGS
0: [245760..246015]: 34304..34559 256 0x2008
Overwriting file range [120M, 120M + 128K)...
wrote 131072/131072 bytes at offset 125829120
128 KiB, 1 ops; 0.0001 sec (694.444 MiB/sec and 5555.5556 ops/sec)
fiemap after overwrite range [120M, 120M + 128K):
/mnt/sdj/foo:
EXT: FILE-OFFSET BLOCK-RANGE TOTAL FLAGS
0: [245760..246015]: 34304..34559 256 0x8
Now the extent is not reported as shared anymore.
So fix this by passing a NULL key pointer to add_indirect_ref() when
processing a delayed reference for a tree block if there's no extent op
for our delayed ref head with a defined key. Also access the extent op
only after locking the delayed ref head's lock.
The reproducer will be converted later to a test case for fstests.
Fixes: 86d5f99442 ("btrfs: convert prelimary reference tracking to use rbtrees")
Fixes: a6dbceafb9 ("btrfs: Remove unused op_key var from add_delayed_refs")
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
When processing delayed data references during backref walking and we are
using a share context (we are being called through fiemap), whenever we
find a delayed data reference for an inode different from the one we are
interested in, then we immediately exit and consider the data extent as
shared. This is wrong, because:
1) This might be a DROP reference that will cancel out a reference in the
extent tree;
2) Even if it's an ADD reference, it may be followed by a DROP reference
that cancels it out.
In either case we should not exit immediately.
Fix this by never exiting when we find a delayed data reference for
another inode - instead add the reference and if it does not cancel out
other delayed reference, we will exit early when we call
extent_is_shared() after processing all delayed references. If we find
a drop reference, then signal the code that processes references from
the extent tree (add_inline_refs() and add_keyed_refs()) to not exit
immediately if it finds there a reference for another inode, since we
have delayed drop references that may cancel it out. In this later case
we exit once we don't have references in the rb trees that cancel out
each other and have two references for different inodes.
Example reproducer for case 1):
$ cat test-1.sh
#!/bin/bash
DEV=/dev/sdj
MNT=/mnt/sdj
mkfs.btrfs -f $DEV
mount $DEV $MNT
xfs_io -f -c "pwrite 0 64K" $MNT/foo
cp --reflink=always $MNT/foo $MNT/bar
echo
echo "fiemap after cloning:"
xfs_io -c "fiemap -v" $MNT/foo
rm -f $MNT/bar
echo
echo "fiemap after removing file bar:"
xfs_io -c "fiemap -v" $MNT/foo
umount $MNT
Running it before this patch, the extent is still listed as shared, it has
the flag 0x2000 (FIEMAP_EXTENT_SHARED) set:
$ ./test-1.sh
fiemap after cloning:
/mnt/sdj/foo:
EXT: FILE-OFFSET BLOCK-RANGE TOTAL FLAGS
0: [0..127]: 26624..26751 128 0x2001
fiemap after removing file bar:
/mnt/sdj/foo:
EXT: FILE-OFFSET BLOCK-RANGE TOTAL FLAGS
0: [0..127]: 26624..26751 128 0x2001
Example reproducer for case 2):
$ cat test-2.sh
#!/bin/bash
DEV=/dev/sdj
MNT=/mnt/sdj
mkfs.btrfs -f $DEV
mount $DEV $MNT
xfs_io -f -c "pwrite 0 64K" $MNT/foo
cp --reflink=always $MNT/foo $MNT/bar
# Flush delayed references to the extent tree and commit current
# transaction.
sync
echo
echo "fiemap after cloning:"
xfs_io -c "fiemap -v" $MNT/foo
rm -f $MNT/bar
echo
echo "fiemap after removing file bar:"
xfs_io -c "fiemap -v" $MNT/foo
umount $MNT
Running it before this patch, the extent is still listed as shared, it has
the flag 0x2000 (FIEMAP_EXTENT_SHARED) set:
$ ./test-2.sh
fiemap after cloning:
/mnt/sdj/foo:
EXT: FILE-OFFSET BLOCK-RANGE TOTAL FLAGS
0: [0..127]: 26624..26751 128 0x2001
fiemap after removing file bar:
/mnt/sdj/foo:
EXT: FILE-OFFSET BLOCK-RANGE TOTAL FLAGS
0: [0..127]: 26624..26751 128 0x2001
After this patch, after deleting bar in both tests, the extent is not
reported with the 0x2000 flag anymore, it gets only the flag 0x1
(which is FIEMAP_EXTENT_LAST):
$ ./test-1.sh
fiemap after cloning:
/mnt/sdj/foo:
EXT: FILE-OFFSET BLOCK-RANGE TOTAL FLAGS
0: [0..127]: 26624..26751 128 0x2001
fiemap after removing file bar:
/mnt/sdj/foo:
EXT: FILE-OFFSET BLOCK-RANGE TOTAL FLAGS
0: [0..127]: 26624..26751 128 0x1
$ ./test-2.sh
fiemap after cloning:
/mnt/sdj/foo:
EXT: FILE-OFFSET BLOCK-RANGE TOTAL FLAGS
0: [0..127]: 26624..26751 128 0x2001
fiemap after removing file bar:
/mnt/sdj/foo:
EXT: FILE-OFFSET BLOCK-RANGE TOTAL FLAGS
0: [0..127]: 26624..26751 128 0x1
These tests will later be converted to a test case for fstests.
Fixes: dc046b10c8 ("Btrfs: make fiemap not blow when you have lots of snapshots")
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
When looking the stored result for a cached path node, if the stored
result is valid and has a value of true, we must update all the nodes for
all levels below it with a result of true as well. This is necessary when
moving from one leaf in the fs tree to the next one, as well as when
moving from a node at any level to the next node at the same level.
Currently this logic is missing as it was somehow forgotten by a recent
patch with the subject: "btrfs: speedup checking for extent sharedness
during fiemap".
This adds the missing logic, which is the counter part to what we do
when adding a shared node to the cache at store_backref_shared_cache().
Fixes: 12a824dc67 ("btrfs: speedup checking for extent sharedness during fiemap")
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
During fiemap, for each file extent we find, we must check if it's shared
or not. The sharedness check starts by verifying if the extent is directly
shared (its refcount in the extent tree is > 1), and if it is not directly
shared, then we will check if every node in the subvolume b+tree leading
from the root to the leaf that has the file extent item (in reverse order),
is shared (through snapshots).
However this second step is not needed if our extent was created in a
transaction more recent than the last transaction where a snapshot of the
inode's root happened, because it can't be shared indirectly (through
shared subtrees) without a snapshot created in a more recent transaction.
So grab the generation of the extent from the extent map and pass it to
btrfs_is_data_extent_shared(), which will skip this second phase when the
generation is more recent than the root's last snapshot value. Note that
we skip this optimization if the extent map is the result of merging 2
or more extent maps, because in this case its generation is the maximum
of the generations of all merged extent maps.
The fact the we use extent maps and they can be merged despite the
underlying extents being distinct (different file extent items in the
subvolume b+tree and different extent items in the extent b+tree), can
result in some bugs when reporting shared extents. But this is a problem
of the current implementation of fiemap relying on extent maps.
One example where we get incorrect results is:
$ cat fiemap-bug.sh
#!/bin/bash
DEV=/dev/sdj
MNT=/mnt/sdj
mkfs.btrfs -f $DEV
mount $DEV $MNT
# Create a file with two 256K extents.
# Since there is no other write activity, they will be contiguous,
# and their extent maps merged, despite having two distinct extents.
xfs_io -f -c "pwrite -S 0xab 0 256K" \
-c "fsync" \
-c "pwrite -S 0xcd 256K 256K" \
-c "fsync" \
$MNT/foo
# Now clone only the second extent into another file.
xfs_io -f -c "reflink $MNT/foo 256K 0 256K" $MNT/bar
# Filefrag will report a single 512K extent, and say it's not shared.
echo
filefrag -v $MNT/foo
umount $MNT
Running the reproducer:
$ ./fiemap-bug.sh
wrote 262144/262144 bytes at offset 0
256 KiB, 64 ops; 0.0038 sec (65.479 MiB/sec and 16762.7030 ops/sec)
wrote 262144/262144 bytes at offset 262144
256 KiB, 64 ops; 0.0040 sec (61.125 MiB/sec and 15647.9218 ops/sec)
linked 262144/262144 bytes at offset 0
256 KiB, 1 ops; 0.0002 sec (1.034 GiB/sec and 4237.2881 ops/sec)
Filesystem type is: 9123683e
File size of /mnt/sdj/foo is 524288 (128 blocks of 4096 bytes)
ext: logical_offset: physical_offset: length: expected: flags:
0: 0.. 127: 3328.. 3455: 128: last,eof
/mnt/sdj/foo: 1 extent found
We end up reporting that we have a single 512K that is not shared, however
we have two 256K extents, and the second one is shared. Changing the
reproducer to clone instead the first extent into file 'bar', makes us
report a single 512K extent that is shared, which is algo incorrect since
we have two 256K extents and only the first one is shared.
This is z problem that existed before this change, and remains after this
change, as it can't be easily fixed. The next patch in the series reworks
fiemap to primarily use file extent items instead of extent maps (except
for checking for delalloc ranges), with the goal of improving its
scalability and performance, but it also ends up fixing this particular
bug caused by extent map merging.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
One of the most expensive tasks performed during fiemap is to check if
an extent is shared. This task has two major steps:
1) Check if the data extent is shared. This implies checking the extent
item in the extent tree, checking delayed references, etc. If we
find the data extent is directly shared, we terminate immediately;
2) If the data extent is not directly shared (its extent item has a
refcount of 1), then it may be shared if we have snapshots that share
subtrees of the inode's subvolume b+tree. So we check if the leaf
containing the file extent item is shared, then its parent node, then
the parent node of the parent node, etc, until we reach the root node
or we find one of them is shared - in which case we stop immediately.
During fiemap we process the extents of a file from left to right, from
file offset 0 to EOF. This means that we iterate b+tree leaves from left
to right, and has the implication that we keep repeating that second step
above several times for the same b+tree path of the inode's subvolume
b+tree.
For example, if we have two file extent items in leaf X, and the path to
leaf X is A -> B -> C -> X, then when we try to determine if the data
extent referenced by the first extent item is shared, we check if the data
extent is shared - if it's not, then we check if leaf X is shared, if not,
then we check if node C is shared, if not, then check if node B is shared,
if not than check if node A is shared. When we move to the next file
extent item, after determining the data extent is not shared, we repeat
the checks for X, C, B and A - doing all the expensive searches in the
extent tree, delayed refs, etc. If we have thousands of tile extents, then
we keep repeating the sharedness checks for the same paths over and over.
On a file that has no shared extents or only a small portion, it's easy
to see that this scales terribly with the number of extents in the file
and the sizes of the extent and subvolume b+trees.
This change eliminates the repeated sharedness check on extent buffers
by caching the results of the last path used. The results can be used as
long as no snapshots were created since they were cached (for not shared
extent buffers) or no roots were dropped since they were cached (for
shared extent buffers). This greatly reduces the time spent by fiemap for
files with thousands of extents and/or large extent and subvolume b+trees.
Example performance test:
$ cat fiemap-perf-test.sh
#!/bin/bash
DEV=/dev/sdi
MNT=/mnt/sdi
mkfs.btrfs -f $DEV
mount -o compress=lzo $DEV $MNT
# 40G gives 327680 128K file extents (due to compression).
xfs_io -f -c "pwrite -S 0xab -b 1M 0 40G" $MNT/foobar
umount $MNT
mount -o compress=lzo $DEV $MNT
start=$(date +%s%N)
filefrag $MNT/foobar
end=$(date +%s%N)
dur=$(( (end - start) / 1000000 ))
echo "fiemap took $dur milliseconds (metadata not cached)"
start=$(date +%s%N)
filefrag $MNT/foobar
end=$(date +%s%N)
dur=$(( (end - start) / 1000000 ))
echo "fiemap took $dur milliseconds (metadata cached)"
umount $MNT
Before this patch:
$ ./fiemap-perf-test.sh
(...)
/mnt/sdi/foobar: 327680 extents found
fiemap took 3597 milliseconds (metadata not cached)
/mnt/sdi/foobar: 327680 extents found
fiemap took 2107 milliseconds (metadata cached)
After this patch:
$ ./fiemap-perf-test.sh
(...)
/mnt/sdi/foobar: 327680 extents found
fiemap took 1646 milliseconds (metadata not cached)
/mnt/sdi/foobar: 327680 extents found
fiemap took 698 milliseconds (metadata cached)
That's about 2.2x faster when no metadata is cached, and about 3x faster
when all metadata is cached. On a real filesystem with many other files,
data, directories, etc, the b+trees will be 2 or 3 levels higher,
therefore this optimization will have a higher impact.
Several reports of a slow fiemap show up often, the two Link tags below
refer to two recent reports of such slowness. This patch, together with
the next ones in the series, is meant to address that.
Link: https://lore.kernel.org/linux-btrfs/21dd32c6-f1f9-f44a-466a-e18fdc6788a7@virtuozzo.com/
Link: https://lore.kernel.org/linux-btrfs/Ysace25wh5BbLd5f@atmark-techno.com/
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The function btrfs_check_shared() is supposed to be used to check if a
data extent is shared, but its name is too generic, may easily cause
confusion in the sense that it may be used for metadata extents.
So rename it to btrfs_is_data_extent_shared(), which will also make it
less confusing after the next change that adds a backref lookup cache for
the b+tree nodes that lead to the leaf that contains the file extent item
that points to the target data extent.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
There's only one function we pass to iterate_inodes_from_logical as
iterator, so we can drop the indirection and call it directly, after
moving the function to backref.c
Signed-off-by: David Sterba <dsterba@suse.com>
The inode reference iterator interface takes parameters that are derived
from the context parameter, but as it's a void* type the values are
passed individually.
Change the ctx type to inode_fs_path as it's the only thing we pass and
drop any parameters that are derived from that.
Signed-off-by: David Sterba <dsterba@suse.com>
The functions for iterating inode reference take a function parameter
but there's only one value, inode_to_path(). Remove the indirection and
call the function. As paths_from_inode would become just an alias for
iterate_irefs(), merge the two into one function.
Signed-off-by: David Sterba <dsterba@suse.com>
We had an error handling pattern for read_tree_block() like this:
eb = read_tree_block();
if (IS_ERR(eb)) {
/*
* Handling error here
* Normally ended up with return or goto out.
*/
} else if (!extent_buffer_uptodate(eb)) {
/*
* Different error handling here
* Normally also ended up with return or goto out;
*/
}
This is fine, but if we want to add extra check for each
read_tree_block(), the existing if-else-if is not that expandable and
will take reader some seconds to figure out there is no extra branch.
Here we change it to a more common way, without the extra else:
eb = read_tree_block();
if (IS_ERR(eb)) {
/*
* Handling error here
*/
return eb or goto out;
}
if (!extent_buffer_uptodate(eb)) {
/*
* Different error handling here
*/
return eb or goto out;
}
This also removes some oddball call sites which uses some creative way
to check error.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
When we start having multiple extent roots we'll need to use a helper to
get to the correct extent_root. Rename fs_info->extent_root to
_extent_root and convert all of the users of the extent root to using
the btrfs_extent_root() helper. This will allow us to easily clean up
the remaining direct accesses in the future.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We are going to have many extent_roots soon, and we don't need a root
here necessarily as we're not modifying anything, we're just getting the
trans handle so we can have an accurate view of references, so use the
tree_root here.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
If we're looking for leafs that point to a data extent we want to record
the extent items that point at our bytenr. At this point we have the
reference and we know for a fact that this leaf should have a reference
to our bytenr. However if there's some sort of corruption we may not
find any references to our leaf, and thus could end up with eie == NULL.
Replace this BUG_ON() with an ASSERT() and then return -EUCLEAN for the
mortals.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We search for an extent entry with .offset = -1, which shouldn't be a
thing, but corruption happens. Add an ASSERT() for the developers,
return -EUCLEAN for mortals.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We define __TRANS_DUMMY always, so this extra ifdef stuff is not needed.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This comment was much closer to the related code when it was originally
added, but has slowly migrated north far from its ancestral lands. Move
it back down with its people.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We pass in the path, but use btrfs_next_item() using the root we
searched with. Pass the root down to add_keyed_refs() instead of the
fs_info so we can continue to use the same root we searched with.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Now that all call sites are using the slot number to modify item values,
rename the SETGET helpers to raw_item_*(), and then rework the _nr()
helpers to be the btrfs_item_*() btrfs_set_item_*() helpers, and then
rename all of the callers to the new helpers.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We have this pattern in a lot of places
item = btrfs_item_nr(slot);
btrfs_item_size(leaf, item);
when we could simply use
btrfs_item_size(leaf, slot);
Fix all callers of btrfs_item_size() and btrfs_item_offset() to use the
_nr variation of the helpers.
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Currently all the callers of btrfs_find_all_roots() pass a value of false
for its ignore_offset argument. This makes the argument pointless and we
can remove it and make btrfs_find_all_roots() always pass false as the
ignore_offset argument for btrfs_find_all_roots_safe(). So just do that.
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Using a transaction in btrfs_search_slot is only useful when we are
searching to add or modify the tree. When the function is used for
searching, insert length and mod arguments are 0, there is no need to
use a transaction.
No functional changes, changing for consistency.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Marcos Paulo de Souza <mpdesouza@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
At btrfs_qgroup_trace_extent_post() we call btrfs_find_all_roots() with a
NULL value as the transaction handle argument, which makes that function
take the commit_root_sem semaphore, which is necessary when we don't hold
a transaction handle or any other mechanism to prevent a transaction
commit from wiping out commit roots.
However btrfs_qgroup_trace_extent_post() can be called in a context where
we are holding a write lock on an extent buffer from a subvolume tree,
namely from btrfs_truncate_inode_items(), called either during truncate
or unlink operations. In this case we end up with a lock inversion problem
because the commit_root_sem is a higher level lock, always supposed to be
acquired before locking any extent buffer.
Lockdep detects this lock inversion problem since we switched the extent
buffer locks from custom locks to semaphores, and when running btrfs/158
from fstests, it reported the following trace:
[ 9057.626435] ======================================================
[ 9057.627541] WARNING: possible circular locking dependency detected
[ 9057.628334] 5.14.0-rc2-btrfs-next-93 #1 Not tainted
[ 9057.628961] ------------------------------------------------------
[ 9057.629867] kworker/u16:4/30781 is trying to acquire lock:
[ 9057.630824] ffff8e2590f58760 (btrfs-tree-00){++++}-{3:3}, at: __btrfs_tree_read_lock+0x24/0x110 [btrfs]
[ 9057.632542]
but task is already holding lock:
[ 9057.633551] ffff8e25582d4b70 (&fs_info->commit_root_sem){++++}-{3:3}, at: iterate_extent_inodes+0x10b/0x280 [btrfs]
[ 9057.635255]
which lock already depends on the new lock.
[ 9057.636292]
the existing dependency chain (in reverse order) is:
[ 9057.637240]
-> #1 (&fs_info->commit_root_sem){++++}-{3:3}:
[ 9057.638138] down_read+0x46/0x140
[ 9057.638648] btrfs_find_all_roots+0x41/0x80 [btrfs]
[ 9057.639398] btrfs_qgroup_trace_extent_post+0x37/0x70 [btrfs]
[ 9057.640283] btrfs_add_delayed_data_ref+0x418/0x490 [btrfs]
[ 9057.641114] btrfs_free_extent+0x35/0xb0 [btrfs]
[ 9057.641819] btrfs_truncate_inode_items+0x424/0xf70 [btrfs]
[ 9057.642643] btrfs_evict_inode+0x454/0x4f0 [btrfs]
[ 9057.643418] evict+0xcf/0x1d0
[ 9057.643895] do_unlinkat+0x1e9/0x300
[ 9057.644525] do_syscall_64+0x3b/0xc0
[ 9057.645110] entry_SYSCALL_64_after_hwframe+0x44/0xae
[ 9057.645835]
-> #0 (btrfs-tree-00){++++}-{3:3}:
[ 9057.646600] __lock_acquire+0x130e/0x2210
[ 9057.647248] lock_acquire+0xd7/0x310
[ 9057.647773] down_read_nested+0x4b/0x140
[ 9057.648350] __btrfs_tree_read_lock+0x24/0x110 [btrfs]
[ 9057.649175] btrfs_read_lock_root_node+0x31/0x40 [btrfs]
[ 9057.650010] btrfs_search_slot+0x537/0xc00 [btrfs]
[ 9057.650849] scrub_print_warning_inode+0x89/0x370 [btrfs]
[ 9057.651733] iterate_extent_inodes+0x1e3/0x280 [btrfs]
[ 9057.652501] scrub_print_warning+0x15d/0x2f0 [btrfs]
[ 9057.653264] scrub_handle_errored_block.isra.0+0x135f/0x1640 [btrfs]
[ 9057.654295] scrub_bio_end_io_worker+0x101/0x2e0 [btrfs]
[ 9057.655111] btrfs_work_helper+0xf8/0x400 [btrfs]
[ 9057.655831] process_one_work+0x247/0x5a0
[ 9057.656425] worker_thread+0x55/0x3c0
[ 9057.656993] kthread+0x155/0x180
[ 9057.657494] ret_from_fork+0x22/0x30
[ 9057.658030]
other info that might help us debug this:
[ 9057.659064] Possible unsafe locking scenario:
[ 9057.659824] CPU0 CPU1
[ 9057.660402] ---- ----
[ 9057.660988] lock(&fs_info->commit_root_sem);
[ 9057.661581] lock(btrfs-tree-00);
[ 9057.662348] lock(&fs_info->commit_root_sem);
[ 9057.663254] lock(btrfs-tree-00);
[ 9057.663690]
*** DEADLOCK ***
[ 9057.664437] 4 locks held by kworker/u16:4/30781:
[ 9057.665023] #0: ffff8e25922a1148 ((wq_completion)btrfs-scrub){+.+.}-{0:0}, at: process_one_work+0x1c7/0x5a0
[ 9057.666260] #1: ffffabb3451ffe70 ((work_completion)(&work->normal_work)){+.+.}-{0:0}, at: process_one_work+0x1c7/0x5a0
[ 9057.667639] #2: ffff8e25922da198 (&ret->mutex){+.+.}-{3:3}, at: scrub_handle_errored_block.isra.0+0x5d2/0x1640 [btrfs]
[ 9057.669017] #3: ffff8e25582d4b70 (&fs_info->commit_root_sem){++++}-{3:3}, at: iterate_extent_inodes+0x10b/0x280 [btrfs]
[ 9057.670408]
stack backtrace:
[ 9057.670976] CPU: 7 PID: 30781 Comm: kworker/u16:4 Not tainted 5.14.0-rc2-btrfs-next-93 #1
[ 9057.672030] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.14.0-0-g155821a1990b-prebuilt.qemu.org 04/01/2014
[ 9057.673492] Workqueue: btrfs-scrub btrfs_work_helper [btrfs]
[ 9057.674258] Call Trace:
[ 9057.674588] dump_stack_lvl+0x57/0x72
[ 9057.675083] check_noncircular+0xf3/0x110
[ 9057.675611] __lock_acquire+0x130e/0x2210
[ 9057.676132] lock_acquire+0xd7/0x310
[ 9057.676605] ? __btrfs_tree_read_lock+0x24/0x110 [btrfs]
[ 9057.677313] ? lock_is_held_type+0xe8/0x140
[ 9057.677849] down_read_nested+0x4b/0x140
[ 9057.678349] ? __btrfs_tree_read_lock+0x24/0x110 [btrfs]
[ 9057.679068] __btrfs_tree_read_lock+0x24/0x110 [btrfs]
[ 9057.679760] btrfs_read_lock_root_node+0x31/0x40 [btrfs]
[ 9057.680458] btrfs_search_slot+0x537/0xc00 [btrfs]
[ 9057.681083] ? _raw_spin_unlock+0x29/0x40
[ 9057.681594] ? btrfs_find_all_roots_safe+0x11f/0x140 [btrfs]
[ 9057.682336] scrub_print_warning_inode+0x89/0x370 [btrfs]
[ 9057.683058] ? btrfs_find_all_roots_safe+0x11f/0x140 [btrfs]
[ 9057.683834] ? scrub_write_block_to_dev_replace+0xb0/0xb0 [btrfs]
[ 9057.684632] iterate_extent_inodes+0x1e3/0x280 [btrfs]
[ 9057.685316] scrub_print_warning+0x15d/0x2f0 [btrfs]
[ 9057.685977] ? ___ratelimit+0xa4/0x110
[ 9057.686460] scrub_handle_errored_block.isra.0+0x135f/0x1640 [btrfs]
[ 9057.687316] scrub_bio_end_io_worker+0x101/0x2e0 [btrfs]
[ 9057.688021] btrfs_work_helper+0xf8/0x400 [btrfs]
[ 9057.688649] ? lock_is_held_type+0xe8/0x140
[ 9057.689180] process_one_work+0x247/0x5a0
[ 9057.689696] worker_thread+0x55/0x3c0
[ 9057.690175] ? process_one_work+0x5a0/0x5a0
[ 9057.690731] kthread+0x155/0x180
[ 9057.691158] ? set_kthread_struct+0x40/0x40
[ 9057.691697] ret_from_fork+0x22/0x30
Fix this by making btrfs_find_all_roots() never attempt to lock the
commit_root_sem when it is called from btrfs_qgroup_trace_extent_post().
We can't just pass a non-NULL transaction handle to btrfs_find_all_roots()
from btrfs_qgroup_trace_extent_post(), because that would make backref
lookup not use commit roots and acquire read locks on extent buffers, and
therefore could deadlock when btrfs_qgroup_trace_extent_post() is called
from the btrfs_truncate_inode_items() code path which has acquired a write
lock on an extent buffer of the subvolume btree.
CC: stable@vger.kernel.org # 4.19+
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The tree modification log, which records modifications done to btrees, is
quite large and currently spread all over ctree.c, which is a huge file
already.
To make things better organized, move all that code into its own separate
source and header files. Functions and definitions that are used outside
of the module (mostly by ctree.c) are renamed so that they start with a
"btrfs_" prefix. Everything else remains unchanged.
This makes it easier to go over the tree modification log code every
time I need to go read it to fix a bug.
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ minor comment updates ]
Signed-off-by: David Sterba <dsterba@suse.com>
The backref code is looking for a reloc_root that corresponds to the
given fs root. However any number of things could have gone wrong while
initializing that reloc_root, like ENOMEM while trying to allocate the
root itself, or EIO while trying to write the root item. This would
result in no corresponding reloc_root being in the reloc root cache, and
thus would return NULL when we do the find_reloc_root() call.
Because of this we do not want to WARN_ON(). This presumably was meant
to catch developer errors, cases where we messed up adding the reloc
root. However we can easily hit this case with error injection, and
thus should not do a WARN_ON().
CC: stable@vger.kernel.org # 5.10+
Reported-by: Zygo Blaxell <ce3g8jdj@umail.furryterror.org>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Zygo reported the following panic when testing my error handling patches
for relocation:
kernel BUG at fs/btrfs/backref.c:2545!
invalid opcode: 0000 [#1] SMP KASAN PTI CPU: 3 PID: 8472 Comm: btrfs Tainted: G W 14
Hardware name: QEMU Standard PC (i440FX + PIIX,
Call Trace:
btrfs_backref_error_cleanup+0x4df/0x530
build_backref_tree+0x1a5/0x700
? _raw_spin_unlock+0x22/0x30
? release_extent_buffer+0x225/0x280
? free_extent_buffer.part.52+0xd7/0x140
relocate_tree_blocks+0x2a6/0xb60
? kasan_unpoison_shadow+0x35/0x50
? do_relocation+0xc10/0xc10
? kasan_kmalloc+0x9/0x10
? kmem_cache_alloc_trace+0x6a3/0xcb0
? free_extent_buffer.part.52+0xd7/0x140
? rb_insert_color+0x342/0x360
? add_tree_block.isra.36+0x236/0x2b0
relocate_block_group+0x2eb/0x780
? merge_reloc_roots+0x470/0x470
btrfs_relocate_block_group+0x26e/0x4c0
btrfs_relocate_chunk+0x52/0x120
btrfs_balance+0xe2e/0x18f0
? pvclock_clocksource_read+0xeb/0x190
? btrfs_relocate_chunk+0x120/0x120
? lock_contended+0x620/0x6e0
? do_raw_spin_lock+0x1e0/0x1e0
? do_raw_spin_unlock+0xa8/0x140
btrfs_ioctl_balance+0x1f9/0x460
btrfs_ioctl+0x24c8/0x4380
? __kasan_check_read+0x11/0x20
? check_chain_key+0x1f4/0x2f0
? __asan_loadN+0xf/0x20
? btrfs_ioctl_get_supported_features+0x30/0x30
? kvm_sched_clock_read+0x18/0x30
? check_chain_key+0x1f4/0x2f0
? lock_downgrade+0x3f0/0x3f0
? handle_mm_fault+0xad6/0x2150
? do_vfs_ioctl+0xfc/0x9d0
? ioctl_file_clone+0xe0/0xe0
? check_flags.part.50+0x6c/0x1e0
? check_flags.part.50+0x6c/0x1e0
? check_flags+0x26/0x30
? lock_is_held_type+0xc3/0xf0
? syscall_enter_from_user_mode+0x1b/0x60
? do_syscall_64+0x13/0x80
? rcu_read_lock_sched_held+0xa1/0xd0
? __kasan_check_read+0x11/0x20
? __fget_light+0xae/0x110
__x64_sys_ioctl+0xc3/0x100
do_syscall_64+0x37/0x80
entry_SYSCALL_64_after_hwframe+0x44/0xa9
This occurs because of this check
if (RB_EMPTY_NODE(&upper->rb_node))
BUG_ON(!list_empty(&node->upper));
As we are dropping the backref node, if we discover that our upper node
in the edge we just cleaned up isn't linked into the cache that we are
now done with this node, thus the BUG_ON().
However this is an erroneous assumption, as we will look up all the
references for a node first, and then process the pending edges. All of
the 'upper' nodes in our pending edges won't be in the cache's rb_tree
yet, because they haven't been processed. We could very well have many
edges still left to cleanup on this node.
The fact is we simply do not need this check, we can just process all of
the edges only for this node, because below this check we do the
following
if (list_empty(&upper->lower)) {
list_add_tail(&upper->lower, &cache->leaves);
upper->lowest = 1;
}
If the upper node truly isn't used yet, then we add it to the
cache->leaves list to be cleaned up later. If it is still used then the
last child node that has it linked into its node will add it to the
leaves list and then it will be cleaned up.
Fix this problem by dropping this logic altogether. With this fix I no
longer see the panic when testing with error injection in the backref
code.
CC: stable@vger.kernel.org # 4.4+
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Zygo reported the following KASAN splat:
BUG: KASAN: use-after-free in btrfs_backref_cleanup_node+0x18a/0x420
Read of size 8 at addr ffff888112402950 by task btrfs/28836
CPU: 0 PID: 28836 Comm: btrfs Tainted: G W 5.10.0-e35f27394290-for-next+ #23
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.12.0-1 04/01/2014
Call Trace:
dump_stack+0xbc/0xf9
? btrfs_backref_cleanup_node+0x18a/0x420
print_address_description.constprop.8+0x21/0x210
? record_print_text.cold.34+0x11/0x11
? btrfs_backref_cleanup_node+0x18a/0x420
? btrfs_backref_cleanup_node+0x18a/0x420
kasan_report.cold.10+0x20/0x37
? btrfs_backref_cleanup_node+0x18a/0x420
__asan_load8+0x69/0x90
btrfs_backref_cleanup_node+0x18a/0x420
btrfs_backref_release_cache+0x83/0x1b0
relocate_block_group+0x394/0x780
? merge_reloc_roots+0x4a0/0x4a0
btrfs_relocate_block_group+0x26e/0x4c0
btrfs_relocate_chunk+0x52/0x120
btrfs_balance+0xe2e/0x1900
? check_flags.part.50+0x6c/0x1e0
? btrfs_relocate_chunk+0x120/0x120
? kmem_cache_alloc_trace+0xa06/0xcb0
? _copy_from_user+0x83/0xc0
btrfs_ioctl_balance+0x3a7/0x460
btrfs_ioctl+0x24c8/0x4360
? __kasan_check_read+0x11/0x20
? check_chain_key+0x1f4/0x2f0
? __asan_loadN+0xf/0x20
? btrfs_ioctl_get_supported_features+0x30/0x30
? kvm_sched_clock_read+0x18/0x30
? check_chain_key+0x1f4/0x2f0
? lock_downgrade+0x3f0/0x3f0
? handle_mm_fault+0xad6/0x2150
? do_vfs_ioctl+0xfc/0x9d0
? ioctl_file_clone+0xe0/0xe0
? check_flags.part.50+0x6c/0x1e0
? check_flags.part.50+0x6c/0x1e0
? check_flags+0x26/0x30
? lock_is_held_type+0xc3/0xf0
? syscall_enter_from_user_mode+0x1b/0x60
? do_syscall_64+0x13/0x80
? rcu_read_lock_sched_held+0xa1/0xd0
? __kasan_check_read+0x11/0x20
? __fget_light+0xae/0x110
__x64_sys_ioctl+0xc3/0x100
do_syscall_64+0x37/0x80
entry_SYSCALL_64_after_hwframe+0x44/0xa9
RIP: 0033:0x7f4c4bdfe427
Allocated by task 28836:
kasan_save_stack+0x21/0x50
__kasan_kmalloc.constprop.18+0xbe/0xd0
kasan_kmalloc+0x9/0x10
kmem_cache_alloc_trace+0x410/0xcb0
btrfs_backref_alloc_node+0x46/0xf0
btrfs_backref_add_tree_node+0x60d/0x11d0
build_backref_tree+0xc5/0x700
relocate_tree_blocks+0x2be/0xb90
relocate_block_group+0x2eb/0x780
btrfs_relocate_block_group+0x26e/0x4c0
btrfs_relocate_chunk+0x52/0x120
btrfs_balance+0xe2e/0x1900
btrfs_ioctl_balance+0x3a7/0x460
btrfs_ioctl+0x24c8/0x4360
__x64_sys_ioctl+0xc3/0x100
do_syscall_64+0x37/0x80
entry_SYSCALL_64_after_hwframe+0x44/0xa9
Freed by task 28836:
kasan_save_stack+0x21/0x50
kasan_set_track+0x20/0x30
kasan_set_free_info+0x1f/0x30
__kasan_slab_free+0xf3/0x140
kasan_slab_free+0xe/0x10
kfree+0xde/0x200
btrfs_backref_error_cleanup+0x452/0x530
build_backref_tree+0x1a5/0x700
relocate_tree_blocks+0x2be/0xb90
relocate_block_group+0x2eb/0x780
btrfs_relocate_block_group+0x26e/0x4c0
btrfs_relocate_chunk+0x52/0x120
btrfs_balance+0xe2e/0x1900
btrfs_ioctl_balance+0x3a7/0x460
btrfs_ioctl+0x24c8/0x4360
__x64_sys_ioctl+0xc3/0x100
do_syscall_64+0x37/0x80
entry_SYSCALL_64_after_hwframe+0x44/0xa9
This occurred because we freed our backref node in
btrfs_backref_error_cleanup(), but then tried to free it again in
btrfs_backref_release_cache(). This is because
btrfs_backref_release_cache() will cycle through all of the
cache->leaves nodes and free them up. However
btrfs_backref_error_cleanup() freed the backref node with
btrfs_backref_free_node(), which simply kfree()d the backref node
without unlinking it from the cache. Change this to a
btrfs_backref_drop_node(), which does the appropriate cleanup and
removes the node from the cache->leaves list, so when we go to free the
remaining cache we don't trip over items we've already dropped.
Fixes: 75bfb9aff4 ("Btrfs: cleanup error handling in build_backref_tree")
CC: stable@vger.kernel.org # 4.4+
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
In order to properly set the lockdep class of a newly allocated block we
need to know the owner of the block. For non-refcounted trees this is
straightforward, we always know in advance what tree we're reading from.
For refcounted trees we don't necessarily know, however all refcounted
trees share the same lockdep class name, tree-<level>.
Fix all the callers of read_tree_block() to pass in the root objectid
we're using. In places like relocation and backref we could probably
unconditionally use 0, but just in case use the root when we have it,
otherwise use 0 in the cases we don't have the root as it's going to be
a refcounted tree anyway.
This is a preparation patch for further changes.
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We no longer distinguish between blocking and spinning, so rip out all
this code.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Now that we're using a rw_semaphore we no longer need to indicate if a
lock is blocking or not, nor do we need to flip the entire path from
blocking to spinning. Remove these helpers and all the places they are
called.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
I got the following lockdep splat with tree locks converted to rwsem
patches on btrfs/104:
======================================================
WARNING: possible circular locking dependency detected
5.9.0+ #102 Not tainted
------------------------------------------------------
btrfs-cleaner/903 is trying to acquire lock:
ffff8e7fab6ffe30 (btrfs-root-00){++++}-{3:3}, at: __btrfs_tree_read_lock+0x32/0x170
but task is already holding lock:
ffff8e7fab628a88 (&fs_info->commit_root_sem){++++}-{3:3}, at: btrfs_find_all_roots+0x41/0x80
which lock already depends on the new lock.
the existing dependency chain (in reverse order) is:
-> #3 (&fs_info->commit_root_sem){++++}-{3:3}:
down_read+0x40/0x130
caching_thread+0x53/0x5a0
btrfs_work_helper+0xfa/0x520
process_one_work+0x238/0x540
worker_thread+0x55/0x3c0
kthread+0x13a/0x150
ret_from_fork+0x1f/0x30
-> #2 (&caching_ctl->mutex){+.+.}-{3:3}:
__mutex_lock+0x7e/0x7b0
btrfs_cache_block_group+0x1e0/0x510
find_free_extent+0xb6e/0x12f0
btrfs_reserve_extent+0xb3/0x1b0
btrfs_alloc_tree_block+0xb1/0x330
alloc_tree_block_no_bg_flush+0x4f/0x60
__btrfs_cow_block+0x11d/0x580
btrfs_cow_block+0x10c/0x220
commit_cowonly_roots+0x47/0x2e0
btrfs_commit_transaction+0x595/0xbd0
sync_filesystem+0x74/0x90
generic_shutdown_super+0x22/0x100
kill_anon_super+0x14/0x30
btrfs_kill_super+0x12/0x20
deactivate_locked_super+0x36/0xa0
cleanup_mnt+0x12d/0x190
task_work_run+0x5c/0xa0
exit_to_user_mode_prepare+0x1df/0x200
syscall_exit_to_user_mode+0x54/0x280
entry_SYSCALL_64_after_hwframe+0x44/0xa9
-> #1 (&space_info->groups_sem){++++}-{3:3}:
down_read+0x40/0x130
find_free_extent+0x2ed/0x12f0
btrfs_reserve_extent+0xb3/0x1b0
btrfs_alloc_tree_block+0xb1/0x330
alloc_tree_block_no_bg_flush+0x4f/0x60
__btrfs_cow_block+0x11d/0x580
btrfs_cow_block+0x10c/0x220
commit_cowonly_roots+0x47/0x2e0
btrfs_commit_transaction+0x595/0xbd0
sync_filesystem+0x74/0x90
generic_shutdown_super+0x22/0x100
kill_anon_super+0x14/0x30
btrfs_kill_super+0x12/0x20
deactivate_locked_super+0x36/0xa0
cleanup_mnt+0x12d/0x190
task_work_run+0x5c/0xa0
exit_to_user_mode_prepare+0x1df/0x200
syscall_exit_to_user_mode+0x54/0x280
entry_SYSCALL_64_after_hwframe+0x44/0xa9
-> #0 (btrfs-root-00){++++}-{3:3}:
__lock_acquire+0x1167/0x2150
lock_acquire+0xb9/0x3d0
down_read_nested+0x43/0x130
__btrfs_tree_read_lock+0x32/0x170
__btrfs_read_lock_root_node+0x3a/0x50
btrfs_search_slot+0x614/0x9d0
btrfs_find_root+0x35/0x1b0
btrfs_read_tree_root+0x61/0x120
btrfs_get_root_ref+0x14b/0x600
find_parent_nodes+0x3e6/0x1b30
btrfs_find_all_roots_safe+0xb4/0x130
btrfs_find_all_roots+0x60/0x80
btrfs_qgroup_trace_extent_post+0x27/0x40
btrfs_add_delayed_data_ref+0x3fd/0x460
btrfs_free_extent+0x42/0x100
__btrfs_mod_ref+0x1d7/0x2f0
walk_up_proc+0x11c/0x400
walk_up_tree+0xf0/0x180
btrfs_drop_snapshot+0x1c7/0x780
btrfs_clean_one_deleted_snapshot+0xfb/0x110
cleaner_kthread+0xd4/0x140
kthread+0x13a/0x150
ret_from_fork+0x1f/0x30
other info that might help us debug this:
Chain exists of:
btrfs-root-00 --> &caching_ctl->mutex --> &fs_info->commit_root_sem
Possible unsafe locking scenario:
CPU0 CPU1
---- ----
lock(&fs_info->commit_root_sem);
lock(&caching_ctl->mutex);
lock(&fs_info->commit_root_sem);
lock(btrfs-root-00);
*** DEADLOCK ***
3 locks held by btrfs-cleaner/903:
#0: ffff8e7fab628838 (&fs_info->cleaner_mutex){+.+.}-{3:3}, at: cleaner_kthread+0x6e/0x140
#1: ffff8e7faadac640 (sb_internal){.+.+}-{0:0}, at: start_transaction+0x40b/0x5c0
#2: ffff8e7fab628a88 (&fs_info->commit_root_sem){++++}-{3:3}, at: btrfs_find_all_roots+0x41/0x80
stack backtrace:
CPU: 0 PID: 903 Comm: btrfs-cleaner Not tainted 5.9.0+ #102
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.13.0-2.fc32 04/01/2014
Call Trace:
dump_stack+0x8b/0xb0
check_noncircular+0xcf/0xf0
__lock_acquire+0x1167/0x2150
? __bfs+0x42/0x210
lock_acquire+0xb9/0x3d0
? __btrfs_tree_read_lock+0x32/0x170
down_read_nested+0x43/0x130
? __btrfs_tree_read_lock+0x32/0x170
__btrfs_tree_read_lock+0x32/0x170
__btrfs_read_lock_root_node+0x3a/0x50
btrfs_search_slot+0x614/0x9d0
? find_held_lock+0x2b/0x80
btrfs_find_root+0x35/0x1b0
? do_raw_spin_unlock+0x4b/0xa0
btrfs_read_tree_root+0x61/0x120
btrfs_get_root_ref+0x14b/0x600
find_parent_nodes+0x3e6/0x1b30
btrfs_find_all_roots_safe+0xb4/0x130
btrfs_find_all_roots+0x60/0x80
btrfs_qgroup_trace_extent_post+0x27/0x40
btrfs_add_delayed_data_ref+0x3fd/0x460
btrfs_free_extent+0x42/0x100
__btrfs_mod_ref+0x1d7/0x2f0
walk_up_proc+0x11c/0x400
walk_up_tree+0xf0/0x180
btrfs_drop_snapshot+0x1c7/0x780
? btrfs_clean_one_deleted_snapshot+0x73/0x110
btrfs_clean_one_deleted_snapshot+0xfb/0x110
cleaner_kthread+0xd4/0x140
? btrfs_alloc_root+0x50/0x50
kthread+0x13a/0x150
? kthread_create_worker_on_cpu+0x40/0x40
ret_from_fork+0x1f/0x30
BTRFS info (device sdb): disk space caching is enabled
BTRFS info (device sdb): has skinny extents
This happens because qgroups does a backref lookup when we create a
delayed ref. From here it may have to look up a root from an indirect
ref, which does a normal lookup on the tree_root, which takes the read
lock on the tree_root nodes.
To fix this we need to add a variant for looking up roots that searches
the commit root of the tree_root. Then when we do the backref search
using the commit root we are sure to not take any locks on the tree_root
nodes. This gets rid of the lockdep splat when running btrfs/104.
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
In btrfs_orphan_cleanup, there's another instance of fs_info, but it's
the same as the one we already have.
In btrfs_backref_finish_upper_links, rb_node is same type and used
as temporary cursor to the tree.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The `if (!ret)` check will always be false and it may result in
ret->path being dereferenced while it is a NULL pointer.
Fixes: a37f232b7b ("btrfs: backref: introduce the skeleton of btrfs_backref_iter")
CC: stable@vger.kernel.org # 5.8+
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Boleyn Su <boleynsu@google.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The main function to lookup a root by its id btrfs_get_fs_root takes the
whole key, while only using the objectid. The value of offset is preset
to (u64)-1 but not actually used until btrfs_find_root that does the
actual search.
Switch btrfs_get_fs_root to use only objectid and remove all local
variables that existed just for the lookup. The actual key for search is
set up in btrfs_get_fs_root, reusing another key variable.
Signed-off-by: David Sterba <dsterba@suse.com>
The name BTRFS_ROOT_REF_COWS is not very clear about the meaning.
In fact, that bit can only be set to those trees:
- Subvolume roots
- Data reloc root
- Reloc roots for above roots
All other trees won't get this bit set. So just by the result, it is
obvious that, roots with this bit set can have tree blocks shared with
other trees. Either shared by snapshots, or by reloc roots (an special
snapshot created by relocation).
This patch will rename BTRFS_ROOT_REF_COWS to BTRFS_ROOT_SHAREABLE to
make it easier to understand, and update all comment mentioning
"reference counted" to follow the rename.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
For relocation tree detection, relocation backref cache uses
btrfs_should_ignore_reloc_root() which uses relocation-specific checks
like checking the DEAD_RELOC_ROOT bit.
However for general purpose backref cache, we can rely on that check, as
it's possible that relocation is also running.
For generic purposed backref cache, we detect reloc root by
SHARED_BLOCK_REF item. Only reloc root node has its parent bytenr
pointing back to itself.
And in that case, backref cache will mark the reloc root node useless,
dropping any child orphan nodes.
So only call btrfs_should_ignore_reloc_root() if the backref cache is
for relocation.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The error cleanup will be extracted as a new function,
btrfs_backref_error_cleanup(), and moved to backref.c and exported for
later usage.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This the the 2nd major part of generic backref cache. Move it to
backref.c so we can reuse it.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This function is the major part of backref cache build process, move it
to backref.c so we can reuse it later.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Since we're releasing all existing nodes/edges, other than cleanup the
mess after error, "release" is a more proper naming here.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Also add comment explaining the cleanup progress, to differ it from
btrfs_backref_drop_node().
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This function will go to the next inline/keyed backref for
btrfs_backref_iter infrastructure.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Due to the complex nature of btrfs extent tree, when we want to iterate
all backrefs of one extent, this involves quite a lot of work, like
searching the EXTENT_ITEM/METADATA_ITEM, iteration through inline and keyed
backrefs.
Normally this would result in a complex code, something like:
btrfs_search_slot()
/* Ensure we are at EXTENT_ITEM/METADATA_ITEM */
while (1) { /* Loop for extent tree items */
while (ptr < end) { /* Loop for inlined items */
/* Real work here */
}
next:
ret = btrfs_next_item()
/* Ensure we're still at keyed item for specified bytenr */
}
The idea of btrfs_backref_iter is to avoid such complex and hard to
read code structure, but something like the following:
iter = btrfs_backref_iter_alloc();
ret = btrfs_backref_iter_start(iter, bytenr);
if (ret < 0)
goto out;
for (; ; ret = btrfs_backref_iter_next(iter)) {
/* Real work here */
}
out:
btrfs_backref_iter_free(iter);
This patch is just the skeleton + btrfs_backref_iter_start() code.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Some older compilers like gcc-4.8 warn about mismatched curly braces in
a initializer:
fs/btrfs/backref.c: In function 'is_shared_data_backref':
fs/btrfs/backref.c:394:9: error: missing braces around
initializer [-Werror=missing-braces]
struct prelim_ref target = {0};
^
fs/btrfs/backref.c:394:9: error: (near initialization for
'target.rbnode') [-Werror=missing-braces]
Use the GNU empty initializer extension to avoid this.
Fixes: ed58f2e66e ("btrfs: backref, don't add refs from shared block when resolving normal backref")
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>