forked from Minki/linux
2505a98111
604 Commits
Author | SHA1 | Message | Date | |
---|---|---|---|---|
Filipe Manana
|
50ff57888d |
btrfs: fix leaked plug after failure syncing log on zoned filesystems
On a zoned filesystem, if we fail to allocate the root node for the log
root tree while syncing the log, we end up returning without finishing
the IO plug we started before, resulting in leaking resources as we
have started writeback for extent buffers of a log tree before. That
allocation failure, which typically is either -ENOMEM or -ENOSPC, is not
fatal and the fsync can safely fallback to a full transaction commit.
So release the IO plug if we fail to allocate the extent buffer for the
root of the log root tree when syncing the log on a zoned filesystem.
Fixes:
|
||
Filipe Manana
|
313ab75399 |
btrfs: add and use helper for unlinking inode during log replay
During log replay there is this pattern of running delayed items after every inode unlink. To avoid repeating this several times, move the logic into an helper function and use it instead of calling btrfs_unlink_inode() followed by btrfs_run_delayed_items(). Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
||
Filipe Manana
|
23e3337faf |
btrfs: reset last_reflink_trans after fsyncing inode
When an inode has a last_reflink_trans matching the current transaction, we have to take special care when logging its checksums in order to avoid getting checksum items with overlapping ranges in a log tree, which could result in missing checksums after log replay (more on that in the changelogs of commit |
||
Filipe Manana
|
96acb3753e |
btrfs: voluntarily relinquish cpu when doing a full fsync
Doing a full fsync may require processing many leaves of metadata, which can take some time and result in a task monopolizing a cpu for too long. So add a cond_resched() after processing a leaf when doing a full fsync, while not holding any locks on any tree (a subvolume or a log tree). Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
||
Filipe Manana
|
5b7ce5e287 |
btrfs: hold on to less memory when logging checksums during full fsync
When doing a full fsync, at copy_items(), we iterate over all extents and then collect their checksums into a list. After copying all the extents to the log tree, we then log all the previously collected checksums. Before the previous patch in the series (subject "btrfs: stop copying old file extents when doing a full fsync"), we had to do it this way, because while we were iterating over the items in the leaf of the subvolume tree, we were holding a write lock on a leaf of the log tree, so logging the checksums for an extent right after we collected them could result in a deadlock, in case the checksum items ended up in the same leaf. However after the previous patch in the series we now do a first iteration over all the items in the leaf of the subvolume tree before locking a path in the log tree, so we can now log the checksums right after we have obtained them. This avoids holding in memory all checksums for all extents in the leaf while copying items from the source leaf to the log tree. The amount of memory used to hold all checksums of the extents in a leaf can be significant. For example if a leaf has 200 file extent items referring to 1M extents, using the default crc32c checksums, would result in using over 200K of memory (not accounting for the extra overhead of struct btrfs_ordered_sum), with smaller or less extents it would be less, but it could be much more with more extents per leaf and/or much larger extents. So change copy_items() to log the checksums for an extent after looking them up, and then free their memory, as they are no longer necessary. Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
||
Filipe Manana
|
7f30c07288 |
btrfs: stop copying old file extents when doing a full fsync
When logging an inode in full sync mode, we go over every leaf that was modified in the current transaction and has items associated to our inode, and then copy all those items into the log tree. This includes copying file extent items that were created and added to the inode in past transactions, which is useless and only makes use more leaf space in the log tree. It's common to have a file with many file extent items spanning many leaves where only a few file extent items are new and need to be logged, and in such case we log all the file extent items we find in the modified leaves. So change the full sync behaviour to skip over file extent items that are not needed. Those are the ones that match the following criteria: 1) Have a generation older than the current transaction and the inode was not a target of a reflink operation, as that can copy file extent items from a past generation from some other inode into our inode, so we have to log them; 2) Start at an offset within i_size - we must log anything at or beyond i_size, otherwise we would lose prealloc extents after log replay. The following script exercises a scenario where this happens, and it's somehow close enough to what happened often on a SQL Server workload which I had to debug sometime ago to fix an issue where a pattern of writes to prealloc extents and fsync resulted in fsync failing with -EIO (that was commit |
||
Filipe Manana
|
e1f53ed874 |
btrfs: prepare extents to be logged before locking a log tree path
When we want to log an extent, in the fast fsync path, we obtain a path to the leaf that will hold the file extent item either through a deletion search, via btrfs_drop_extents(), or through an insertion search using btrfs_insert_empty_item(). After that we fill the file extent item's fields one by one directly on the leaf. Instead of doing that, we could prepare the file extent item before obtaining a btree path, and then copy the prepared extent item with a single operation once we get the path. This helps avoid some contention on the log tree, since we are holding write locks for longer than necessary, especially in the case where the path is obtained via btrfs_drop_extents() through a deletion search, which always keeps a write lock on the nodes at levels 1 and 2 (besides the leaf). This change does that, we prepare the file extent item that is going to be inserted before acquiring a path, and then copy it into a leaf using a single copy operation once we get a path. This change if part of a patchset that is comprised of the following patches: 1/6 btrfs: remove unnecessary leaf free space checks when pushing items 2/6 btrfs: avoid unnecessary COW of leaves when deleting items from a leaf 3/6 btrfs: avoid unnecessary computation when deleting items from a leaf 4/6 btrfs: remove constraint on number of visited leaves when replacing extents 5/6 btrfs: remove useless path release in the fast fsync path 6/6 btrfs: prepare extents to be logged before locking a log tree path The following test was run to measure the impact of the whole patchset: $ cat test.sh #!/bin/bash DEV=/dev/sdi MNT=/mnt/sdi MOUNT_OPTIONS="-o ssd" MKFS_OPTIONS="-R free-space-tree -O no-holes" NUM_JOBS=8 FILE_SIZE=128M RUN_TIME=200 cat <<EOF > /tmp/fio-job.ini [writers] rw=randwrite fsync=1 fallocate=none group_reporting=1 direct=0 bssplit=4k/20:8k/20:16k/20:32k/10:64k/10:128k/5:256k/5:512k/5:1m/5 ioengine=sync filesize=$FILE_SIZE runtime=$RUN_TIME time_based directory=$MNT numjobs=$NUM_JOBS thread EOF echo "performance" | \ tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor echo echo "Using config:" echo cat /tmp/fio-job.ini echo umount $MNT &> /dev/null mkfs.btrfs -f $MKFS_OPTIONS $DEV mount $MOUNT_OPTIONS $DEV $MNT fio /tmp/fio-job.ini umount $MNT The test ran inside a VM (8 cores, 32G of RAM) with the target disk mapping to a raw NVMe device, and using a non-debug kernel config (Debian's default config). Before the patchset: WRITE: bw=116MiB/s (122MB/s), 116MiB/s-116MiB/s (122MB/s-122MB/s), io=22.7GiB (24.4GB), run=200013-200013msec After the patchset: WRITE: bw=125MiB/s (131MB/s), 125MiB/s-125MiB/s (131MB/s-131MB/s), io=24.3GiB (26.1GB), run=200007-200007msec A 7.8% gain on throughput and +7.0% more IO done in the same period of time (200 seconds). Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
||
Filipe Manana
|
d845753170 |
btrfs: remove useless path release in the fast fsync path
There's no point in calling btrfs_release_path() after finishing the loop that logs the modified extents, since log_one_extent() returns with the path released. In case the list of extents is empty, the path is already released, so there's no need for that case as well. So just remove that unnecessary btrfs_release_path() call. This change if part of a patchset that is comprised of the following patches: 1/6 btrfs: remove unnecessary leaf free space checks when pushing items 2/6 btrfs: avoid unnecessary COW of leaves when deleting items from a leaf 3/6 btrfs: avoid unnecessary computation when deleting items from a leaf 4/6 btrfs: remove constraint on number of visited leaves when replacing extents 5/6 btrfs: remove useless path release in the fast fsync path 6/6 btrfs: prepare extents to be logged before locking a log tree path The last patch in the series has some performance test result in its changelog. Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
||
Filipe Manana
|
65faced5b9 |
btrfs: use single variable to track return value at btrfs_log_inode()
At btrfs_log_inode(), we have two variables to track errors and the return value of the function, named 'ret' and 'err'. In some places we use 'ret' and if gets a non-zero value we assign its value to 'err' and then jump to the 'out' label, while in other places we use 'err' directly without 'ret' as an intermediary. This is inconsistent, error prone and not necessary. So change that to use only the 'ret' variable, making this consistent with most functions in btrfs. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
||
Filipe Manana
|
0f8ce49821 |
btrfs: avoid inode logging during rename and link when possible
During a rename or link operation, we need to determine if an inode was previously logged or not, and if it was, do some update to the logged inode. We used to rely exclusively on the logged_trans field of struct btrfs_inode to determine that, but that was not reliable because the value of that field is not persisted in the inode item, so it's lost when an inode is evicted and loaded back again. That led to several issues in the past, such as not persisting deletions (such as the case fixed by commit |
||
Filipe Manana
|
259c4b96d7 |
btrfs: stop doing unnecessary log updates during a rename
During a rename, we call __btrfs_unlink_inode(), which will call btrfs_del_inode_ref_in_log() and btrfs_del_dir_entries_in_log(), in order to remove an inode reference and a directory entry from the log. These are necessary when __btrfs_unlink_inode() is called from the unlink path, but not necessary when it's called from a rename context, because: 1) For the btrfs_del_inode_ref_in_log() call, it's pointless to delete the inode reference related to the old name, because later in the rename path we call btrfs_log_new_name(), which will drop all inode references from the log and copy all inode references from the subvolume tree to the log tree. So we are doing one unnecessary btree operation which adds additional latency and lock contention in case there are other tasks accessing the log tree; 2) For the btrfs_del_dir_entries_in_log() call, we are now doing the equivalent at btrfs_log_new_name() since the previous patch in the series, that has the subject "btrfs: avoid logging all directory changes during renames". In fact, having __btrfs_unlink_inode() call this function not only adds additional latency and lock contention due to the extra btree operation, but also can make btrfs_log_new_name() unnecessarily log a range item to track the deletion of the old name, since it has no way to known that the directory entry related to the old name was previously logged and already deleted by __btrfs_unlink_inode() through its call to btrfs_del_dir_entries_in_log(). So skip those calls at __btrfs_unlink_inode() when we are doing a rename. Skipping them also allows us now to reduce the duration of time we are pinning a log transaction during renames, which is always beneficial as it's not delaying so much other tasks trying to sync the log tree, in particular we end up not holding the log transaction pinned while adding the new name (adding inode ref, directory entry, etc). This change is part of a patchset comprised of the following patches: 1/5 btrfs: add helper to delete a dir entry from a log tree 2/5 btrfs: pass the dentry to btrfs_log_new_name() instead of the inode 3/5 btrfs: avoid logging all directory changes during renames 4/5 btrfs: stop doing unnecessary log updates during a rename 5/5 btrfs: avoid inode logging during rename and link when possible Just like the previous patch in the series, "btrfs: avoid logging all directory changes during renames", the following script mimics part of what a package installation/upgrade with zypper does, which is basically renaming a lot of files, in some directory under /usr, to a name with a suffix of "-RPMDELETE": $ cat test.sh #!/bin/bash DEV=/dev/nvme0n1 MNT=/mnt/nvme0n1 NUM_FILES=10000 mkfs.btrfs -f $DEV mount $DEV $MNT mkdir $MNT/testdir for ((i = 1; i <= $NUM_FILES; i++)); do echo -n > $MNT/testdir/file_$i done sync # Do some change to testdir and fsync it. echo -n > $MNT/testdir/file_$((NUM_FILES + 1)) xfs_io -c "fsync" $MNT/testdir echo "Renaming $NUM_FILES files..." start=$(date +%s%N) for ((i = 1; i <= $NUM_FILES; i++)); do mv $MNT/testdir/file_$i $MNT/testdir/file_$i-RPMDELETE done end=$(date +%s%N) dur=$(( (end - start) / 1000000 )) echo "Renames took $dur milliseconds" umount $MNT Testing this change on box a using a non-debug kernel (Debian's default kernel config) gave the following results: NUM_FILES=10000, before patchset: 27399 ms NUM_FILES=10000, after patches 1/5 to 3/5 applied: 9093 ms (-66.8%) NUM_FILES=10000, after patches 1/5 to 4/5 applied: 9016 ms (-67.1%) NUM_FILES=5000, before patchset: 9241 ms NUM_FILES=5000, after patches 1/5 to 3/5 applied: 4642 ms (-49.8%) NUM_FILES=5000, after patches 1/5 to 4/5 applied: 4553 ms (-50.7%) NUM_FILES=2000, before patchset: 2550 ms NUM_FILES=2000, after patches 1/5 to 3/5 applied: 1788 ms (-29.9%) NUM_FILES=2000, after patches 1/5 to 4/5 applied: 1767 ms (-30.7%) NUM_FILES=1000, before patchset: 1088 ms NUM_FILES=1000, after patches 1/5 to 3/5 applied: 905 ms (-16.9%) NUM_FILES=1000, after patches 1/5 to 4/5 applied: 883 ms (-18.8%) The next patch in the series (5/5), also contains dbench results after applying to whole patchset. Link: https://bugzilla.opensuse.org/show_bug.cgi?id=1193549 Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
||
Filipe Manana
|
88d2beec7e |
btrfs: avoid logging all directory changes during renames
When doing a rename of a file, if the file or its old parent directory were logged before, we log the new name of the file and then make sure we log the old parent directory, to ensure that after a log replay the old name of the file is deleted and the new name added. The logging of the old parent directory can take some time, because it will scan all leaves modified in the current transaction, check which directory entries were already logged, copy the ones that were not logged before, etc. In this rename context all we need to do is make sure that the old name of the file is deleted on log replay, so instead of triggering a directory log operation, we can just delete the old directory entry from the log if it's there, or in case it isn't there, just log a range item to signal log replay that the old name must be deleted. So change btrfs_log_new_name() to do that. This scenario is actually not uncommon to trigger, and recently on a 5.15 kernel, an openSUSE Tumbleweed user reported package installations and upgrades, with the zypper tool, were often taking a long time to complete, much more than usual. With strace it could be observed that zypper was spending over 99% of its time on rename operations, and then with further analysis we checked that directory logging was happening too frequently and causing high latencies for the rename operations. Taking into account that installation/upgrade of some of these packages needed about a few thousand file renames, the slowdown was very noticeable for the user. The issue was caused indirectly due to an excessive number of inode evictions on a 5.15 kernel, about 100x more compared to a 5.13, 5.14 or a 5.16-rc8 kernel. After an inode eviction we can't tell for sure, in an efficient way, if an inode was previously logged in the current transaction, so we are pessimistic and assume it was, because in case it was we need to update the logged inode. More details on that in one of the patches in the same series (subject "btrfs: avoid inode logging during rename and link when possible"). Either way, in case the parent directory was logged before, we currently do more work then necessary during a rename, and this change minimizes that amount of work. The following script mimics part of what a package installation/upgrade with zypper does, which is basically renaming a lot of files, in some directory under /usr, to a name with a suffix of "-RPMDELETE": $ cat test.sh #!/bin/bash DEV=/dev/nvme0n1 MNT=/mnt/nvme0n1 NUM_FILES=10000 mkfs.btrfs -f $DEV mount $DEV $MNT mkdir $MNT/testdir for ((i = 1; i <= $NUM_FILES; i++)); do echo -n > $MNT/testdir/file_$i done sync # Do some change to testdir and fsync it. echo -n > $MNT/testdir/file_$((NUM_FILES + 1)) xfs_io -c "fsync" $MNT/testdir echo "Renaming $NUM_FILES files..." start=$(date +%s%N) for ((i = 1; i <= $NUM_FILES; i++)); do mv $MNT/testdir/file_$i $MNT/testdir/file_$i-RPMDELETE done end=$(date +%s%N) dur=$(( (end - start) / 1000000 )) echo "Renames took $dur milliseconds" umount $MNT Testing this change on box using a non-debug kernel (Debian's default kernel config) gave the following results: NUM_FILES=10000, before this patch: 27399 ms NUM_FILES=10000, after this patch: 9093 ms (-66.8%) NUM_FILES=5000, before this patch: 9241 ms NUM_FILES=5000, after this patch: 4642 ms (-49.8%) NUM_FILES=2000, before this patch: 2550 ms NUM_FILES=2000, after this patch: 1788 ms (-29.9%) NUM_FILES=1000, before this patch: 1088 ms NUM_FILES=1000, after this patch: 905 ms (-16.9%) Link: https://bugzilla.opensuse.org/show_bug.cgi?id=1193549 Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
||
Filipe Manana
|
d5f5bd5465 |
btrfs: pass the dentry to btrfs_log_new_name() instead of the inode
In the next patch in the series, there will be the need to access the old name, and its length, of an inode when logging the inode during a rename. So instead of passing the inode to btrfs_log_new_name() pass the dentry, because from the dentry we can get the inode, the name and its length. This will avoid passing 3 new parameters to btrfs_log_new_name() in the next patch - the name, its length and an index number. This way we end up passing only 1 new parameter, the index number. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
||
Filipe Manana
|
839061fe88 |
btrfs: add helper to delete a dir entry from a log tree
Move the code that finds and deletes a logged dir entry out of btrfs_del_dir_entries_in_log() into a helper function. This new helper function will be used by another patch in the same series, and serves to avoid having duplicated logic. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
||
Filipe Manana
|
de6bc7f598 |
btrfs: stop trying to log subdirectories created in past transactions
When logging a directory we are trying to log subdirectories that were
changed in the current transaction and created in a past transaction.
This type of behaviour was introduced by commit
|
||
Filipe Manana
|
732d591a5d |
btrfs: stop copying old dir items when logging a directory
When logging a directory, we go over every leaf of the subvolume tree that was changed in the current transaction and copy all its dir index keys to the log tree. That includes copying dir index keys created in past transactions. This is done mostly for simplicity, as after logging the keys we log an item that specifies the start and end ranges of the keys we logged. That item is then used during log replay to figure out which keys need to be deleted - every key in that range that we find in the subvolume tree and is not in the log tree, needs to be deleted. Now that we log only dir index keys, and not dir item keys anymore, when we remove dentries from a directory (due to unlink and rename operations), we can get entire leaves that we changed only for deleting old dir index keys, or that have few dir index keys that are new - this is due to the fact that the offset for new index keys comes from a monotonically increasing counter. We can avoid logging dir index keys from past transactions, and in order to track the deletions, only log range items (BTRFS_DIR_LOG_INDEX_KEY key type) when we find gaps between consecutive index keys. This massively reduces the amount of logged metadata when we have deleted directory entries, even if it's a small percentage of the total number of entries. The reduction comes from both less items that are logged and instead of logging many dir index items (struct btrfs_dir_item), which have a size of 30 bytes plus a file name, we typically log just a few range items (struct btrfs_dir_log_item), which take only 8 bytes each. Even if no entries were deleted from a directory and only new entries were added, we typically still get a reduction on the amount of logged metadata, because it's very likely the first leaf that got the new dir index entries also has several old dir index entries. So change the logging logic to not log dir index keys created in past transactions and log a range item for every gap it finds between each pair of consecutive index keys, to ensure deletions are tracked and replayed on log replay. This patch is part of a patchset comprised of the following patches: 1/4 btrfs: don't log unnecessary boundary keys when logging directory 2/4 btrfs: put initial index value of a directory in a constant 3/4 btrfs: stop copying old dir items when logging a directory 4/4 btrfs: stop trying to log subdirectories created in past transactions The following test was run on a branch without this patchset and on a branch with the first three patches applied: $ cat test.sh #!/bin/bash DEV=/dev/nvme0n1 MNT=/mnt/nvme0n1 NUM_FILES=1000000 NUM_FILE_DELETES=10000 MKFS_OPTIONS="-O no-holes -R free-space-tree" MOUNT_OPTIONS="-o ssd" mkfs.btrfs -f $MKFS_OPTIONS $DEV mount $MOUNT_OPTIONS $DEV $MNT mkdir $MNT/testdir for ((i = 1; i <= $NUM_FILES; i++)); do echo -n > $MNT/testdir/file_$i done sync del_inc=$(( $NUM_FILES / $NUM_FILE_DELETES )) for ((i = 1; i <= $NUM_FILES; i += $del_inc)); do rm -f $MNT/testdir/file_$i done start=$(date +%s%N) xfs_io -c "fsync" $MNT/testdir end=$(date +%s%N) dur=$(( (end - start) / 1000000 )) echo "dir fsync took $dur ms after deleting $NUM_FILE_DELETES files" echo umount $MNT The test was run on a non-debug kernel (Debian's default kernel config), and the results were the following for various values of NUM_FILES and NUM_FILE_DELETES: ** before, NUM_FILES = 1 000 000, NUM_FILE_DELETES = 10 000 ** dir fsync took 585 ms after deleting 10000 files ** after, NUM_FILES = 1 000 000, NUM_FILE_DELETES = 10 000 ** dir fsync took 34 ms after deleting 10000 files (-94.2%) ** before, NUM_FILES = 100 000, NUM_FILE_DELETES = 1 000 ** dir fsync took 50 ms after deleting 1000 files ** after, NUM_FILES = 100 000, NUM_FILE_DELETES = 1 000 ** dir fsync took 7 ms after deleting 1000 files (-86.0%) ** before, NUM_FILES = 10 000, NUM_FILE_DELETES = 100 ** dir fsync took 9 ms after deleting 100 files ** after, NUM_FILES = 10 000, NUM_FILE_DELETES = 100 ** dir fsync took 5 ms after deleting 100 files (-44.4%) Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
||
Filipe Manana
|
a450a4af74 |
btrfs: don't log unnecessary boundary keys when logging directory
Before we start to log dir index keys from a leaf, we check if there is a previous index key, which normally is at the end of a leaf that was not changed in the current transaction. Then we log that key and set the start of logged range (item of type BTRFS_DIR_LOG_INDEX_KEY) to the offset of that key. This is to ensure that if there were deleted index keys between that key and the first key we are going to log, those deletions are replayed in case we need to replay to the log after a power failure. However we really don't need to log that previous key, we can just set the start of the logged range to that key's offset plus 1. This achieves the same and avoids logging one dir index key. The same logic is performed when we finish logging the index keys of a leaf and we find that the next leaf has index keys and was not changed in the current transaction. We are logging the first key of that next leaf and use its offset as the end of range we log. This is just to ensure that if there were deleted index keys between the last index key we logged and the first key of that next leaf, those index keys are deleted if we end up replaying the log. However that is not necessary, we can avoid logging that first index key of the next leaf and instead set the end of the logged range to match the offset of that index key minus 1. So avoid logging those index keys at the boundaries and adjust the start and end offsets of the logged ranges as described above. This patch is part of a patchset comprised of the following patches: 1/4 btrfs: don't log unnecessary boundary keys when logging directory 2/4 btrfs: put initial index value of a directory in a constant 3/4 btrfs: stop copying old dir items when logging a directory 4/4 btrfs: stop trying to log subdirectories created in past transactions Performance test results are listed in the changelog of patch 3/4. Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
||
Filipe Manana
|
c816d705b9 |
btrfs: remove write and wait of struct walk_control
The ->write and ->wait fields of struct walk_control, used for log trees, are not used since 2008, more specifically since commit |
||
Filipe Manana
|
4751dc9962 |
btrfs: add missing run of delayed items after unlink during log replay
During log replay, whenever we need to check if a name (dentry) exists in a directory we do searches on the subvolume tree for inode references or or directory entries (BTRFS_DIR_INDEX_KEY keys, and BTRFS_DIR_ITEM_KEY keys as well, before kernel 5.17). However when during log replay we unlink a name, through btrfs_unlink_inode(), we may not delete inode references and dir index keys from a subvolume tree and instead just add the deletions to the delayed inode's delayed items, which will only be run when we commit the transaction used for log replay. This means that after an unlink operation during log replay, if we attempt to search for the same name during log replay, we will not see that the name was already deleted, since the deletion is recorded only on the delayed items. We run delayed items after every unlink operation during log replay, except at unlink_old_inode_refs() and at add_inode_ref(). This was due to an overlook, as delayed items should be run after evert unlink, for the reasons stated above. So fix those two cases. Fixes: |
||
Filipe Manana
|
d994788743 |
btrfs: fix lost prealloc extents beyond eof after full fsync
When doing a full fsync, if we have prealloc extents beyond (or at) eof, and the leaves that contain them were not modified in the current transaction, we end up not logging them. This results in losing those extents when we replay the log after a power failure, since the inode is truncated to the current value of the logged i_size. Just like for the fast fsync path, we need to always log all prealloc extents starting at or beyond i_size. The fast fsync case was fixed in commit |
||
Filipe Manana
|
40cdc50987 |
btrfs: skip reserved bytes warning on unmount after log cleanup failure
After the recent changes made by commit |
||
Josef Bacik
|
8697b8f88e |
btrfs: do not check -EAGAIN when truncating inodes in the log root
We only throttle the btrfs_truncate_inode_items if the root is SHAREABLE, which isn't set on the log root, which means this loop is unnecessary. Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
||
Josef Bacik
|
71d18b5354 |
btrfs: add inode to truncate control
In the future we're going to want to use btrfs_truncate_inode_items without looking up the associated inode. In order to accommodate this add the inode to btrfs_truncate_control and handle the case where control->inode is NULL appropriately. This is fairly straightforward, we simply need to add a helper for the trace points, as the file extent map update is controlled by a flag on btrfs_truncate_control. Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
||
Josef Bacik
|
487e81d2a4 |
btrfs: pass the ino via truncate control
In the future we are going to want to truncate inode items without needing to have an btrfs_inode to pass in, so add ino to the btrfs_truncate_control and use that to look up the inode items to truncate. Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
||
Josef Bacik
|
5caa490ed8 |
btrfs: control extent reference updates with a control flag for truncate
We've had weird bugs in the past where we forgot to adjust the truncate path to deal with the fact that we can be called by the tree log path. Instead of checking if our root is a LOG_ROOT use a flag on the btrfs_truncate_control to indicate that we don't want to do extent reference updates during this truncate. Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
||
Josef Bacik
|
d9ac19c380 |
btrfs: add truncate control struct
I'm going to be adding more arguments and counters to btrfs_truncate_inode_items, so add a control struct to handle all of the extra arguments to make it easier to follow. Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
||
Josef Bacik
|
26c2c4540d |
btrfs: add an inode-item.h
We have a few helpers in inode-item.c, and I'm going to make a few changes to how we do truncate in the future, so break out these definitions into their own header file to trim down ctree.h some and make it easier to do the work on truncate in the future. Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
||
Josef Bacik
|
fc28b25e1f |
btrfs: stop accessing ->csum_root directly
We are going to have multiple csum roots in the future, so convert all users of ->csum_root to btrfs_csum_root() and rename ->csum_root to ->_csum_root so we can easily find remaining users in the future. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
||
Josef Bacik
|
3212fa14e7 |
btrfs: drop the _nr from the item helpers
Now that all call sites are using the slot number to modify item values, rename the SETGET helpers to raw_item_*(), and then rework the _nr() helpers to be the btrfs_item_*() btrfs_set_item_*() helpers, and then rename all of the callers to the new helpers. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
||
Filipe Manana
|
ccae4a19c9 |
btrfs: remove no longer needed logic for replaying directory deletes
Now that we log only dir index keys when logging a directory, we no longer need to deal with dir item keys in the log replay code for replaying directory deletes. This is also true for the case when we replay a log tree created by a kernel that still logs dir items. So remove the remaining code of the replay of directory deletes algorithm that deals with dir item keys. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
||
Filipe Manana
|
339d035424 |
btrfs: only copy dir index keys when logging a directory
Currently, when logging a directory, we copy both dir items and dir index items from the fs/subvolume tree to the log tree. Both items have exactly the same data (same struct btrfs_dir_item), the difference lies in the key values, where a dir index key contains the index number of a directory entry while the dir item key does not, as it's used for doing fast lookups of an entry by name, while the former is used for sorting entries when listing a directory. We can exploit that and log only the dir index items, since they contain all the information needed to correctly add, replace and delete directory entries when replaying a log tree. Logging only the dir index items is also backward and forward compatible: an unpatched kernel (without this change) can correctly replay a log tree generated by a patched kernel (with this patch), and a patched kernel can correctly replay a log tree generated by an unpatched kernel. The backward compatibility is ensured because: 1) For inserting a new dentry: a dentry is only inserted when we find a new dir index key - we can only insert if we know the dir index offset, which is encoded in the dir index key's offset; 2) For deleting dentries: during log replay, before adding or replacing dentries, we first replay dentry deletions. Whenever we find a dir item key or a dir index key in the subvolume/fs tree that is not logged in a range for which the log tree is authoritative, we do the unlink of the dentry, which removes both the existing dir item key and the dir index key. Therefore logging just dir index keys is enough to ensure dentry deletions are correctly replayed; 3) For dentry replacements: they work when we log only dir index keys and this is mostly due to a combination of 1) and 2). If we replace a dentry with name "foobar" to point from inode A to inode B, then we know the dir index key for the new dentry is different from the old one, as it has an index number (key offset) larger than the old one. This results in replaying a deletion, through replay_dir_deletes(), that causes the old dentry to be removed, both the dir item key and the dir index key, as mentioned at 2). Then when processing the new dir index key, we add the new dentry, adding both a new dir item key and a new index key pointing to inode B, as stated in 1). The forward compatibility, the ability for a patched kernel to replay a log created by an older, unpatched kernel, comes from the changes required for making sure we are able to replay a log that only contains dir index keys - we simply ignore every dir item key we find. So modify directory logging to log only dir index items, and modify the log replay process to ignore dir item keys, from log trees created by an unpatched kernel, and process only with dir index keys. This reduces the amount of logged metadata by about half, and therefore the time spent logging or fsyncing large directories (less CPU time and less IO). The following test script was used to measure this change: #!/bin/bash DEV=/dev/nvme0n1 MNT=/mnt/nvme0n1 NUM_NEW_FILES=1000000 NUM_FILE_DELETES=10000 mkfs.btrfs -f $DEV mount -o ssd $DEV $MNT mkdir $MNT/testdir for ((i = 1; i <= $NUM_NEW_FILES; i++)); do echo -n > $MNT/testdir/file_$i done start=$(date +%s%N) xfs_io -c "fsync" $MNT/testdir end=$(date +%s%N) dur=$(( (end - start) / 1000000 )) echo "dir fsync took $dur ms after adding $NUM_NEW_FILES files" # sync to force transaction commit and wipeout the log. sync del_inc=$(( $NUM_NEW_FILES / $NUM_FILE_DELETES )) for ((i = 1; i <= $NUM_NEW_FILES; i += $del_inc)); do rm -f $MNT/testdir/file_$i done start=$(date +%s%N) xfs_io -c "fsync" $MNT/testdir end=$(date +%s%N) dur=$(( (end - start) / 1000000 )) echo "dir fsync took $dur ms after deleting $NUM_FILE_DELETES files" echo umount $MNT The tests were run on a physical machine, with a non-debug kernel (Debian's default kernel config), for different values of $NUM_NEW_FILES and $NUM_FILE_DELETES, and the results were the following: ** Before patch, NUM_NEW_FILES = 1 000 000, NUM_DELETE_FILES = 10 000 ** dir fsync took 8412 ms after adding 1000000 files dir fsync took 500 ms after deleting 10000 files ** After patch, NUM_NEW_FILES = 1 000 000, NUM_DELETE_FILES = 10 000 ** dir fsync took 4252 ms after adding 1000000 files (-49.5%) dir fsync took 269 ms after deleting 10000 files (-46.2%) ** Before patch, NUM_NEW_FILES = 100 000, NUM_DELETE_FILES = 1 000 ** dir fsync took 745 ms after adding 100000 files dir fsync took 59 ms after deleting 1000 files ** After patch, NUM_NEW_FILES = 100 000, NUM_DELETE_FILES = 1 000 ** dir fsync took 404 ms after adding 100000 files (-45.8%) dir fsync took 31 ms after deleting 1000 files (-47.5%) ** Before patch, NUM_NEW_FILES = 10 000, NUM_DELETE_FILES = 1 000 ** dir fsync took 67 ms after adding 10000 files dir fsync took 9 ms after deleting 1000 files ** After patch, NUM_NEW_FILES = 10 000, NUM_DELETE_FILES = 1 000 ** dir fsync took 36 ms after adding 10000 files (-46.3%) dir fsync took 5 ms after deleting 1000 files (-44.4%) ** Before patch, NUM_NEW_FILES = 1 000, NUM_DELETE_FILES = 100 ** dir fsync took 9 ms after adding 1000 files dir fsync took 4 ms after deleting 100 files ** After patch, NUM_NEW_FILES = 1 000, NUM_DELETE_FILES = 100 ** dir fsync took 7 ms after adding 1000 files (-22.2%) dir fsync took 3 ms after deleting 100 files (-25.0%) Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
||
Filipe Manana
|
1b2e5e5c7f |
btrfs: fix missing last dir item offset update when logging directory
When logging a directory, once we finish processing a leaf that is full
of dir items, if we find the next leaf was not modified in the current
transaction, we grab the first key of that next leaf and log it as to
mark the end of a key range boundary.
However we did not update the value of ctx->last_dir_item_offset, which
tracks the offset of the last logged key. This can result in subsequent
logging of the same directory in the current transaction to not realize
that key was already logged, and then add it to the middle of a batch
that starts with a lower key, resulting later in a leaf with one key
that is duplicated and at non-consecutive slots. When that happens we get
an error later when writing out the leaf, reporting that there is a pair
of keys in wrong order. The report is something like the following:
Dec 13 21:44:50 kernel: BTRFS critical (device dm-0): corrupt leaf:
root=18446744073709551610 block=118444032 slot=21, bad key order, prev
(704687 84 4146773349) current (704687 84 1063561078)
Dec 13 21:44:50 kernel: BTRFS info (device dm-0): leaf 118444032 gen
91449 total ptrs 39 free space 546 owner 18446744073709551610
Dec 13 21:44:50 kernel: item 0 key (704687 1 0) itemoff 3835
itemsize 160
Dec 13 21:44:50 kernel: inode generation 35532 size
1026 mode 40755
Dec 13 21:44:50 kernel: item 1 key (704687 12 704685) itemoff
3822 itemsize 13
Dec 13 21:44:50 kernel: item 2 key (704687 24 3817753667)
itemoff 3736 itemsize 86
Dec 13 21:44:50 kernel: item 3 key (704687 60 0) itemoff 3728 itemsize 8
Dec 13 21:44:50 kernel: item 4 key (704687 72 0) itemoff 3720 itemsize 8
Dec 13 21:44:50 kernel: item 5 key (704687 84 140445108)
itemoff 3666 itemsize 54
Dec 13 21:44:50 kernel: dir oid 704793 type 1
Dec 13 21:44:50 kernel: item 6 key (704687 84 298800632)
itemoff 3599 itemsize 67
Dec 13 21:44:50 kernel: dir oid 707849 type 2
Dec 13 21:44:50 kernel: item 7 key (704687 84 476147658)
itemoff 3532 itemsize 67
Dec 13 21:44:50 kernel: dir oid 707901 type 2
Dec 13 21:44:50 kernel: item 8 key (704687 84 633818382)
itemoff 3471 itemsize 61
Dec 13 21:44:50 kernel: dir oid 704694 type 2
Dec 13 21:44:50 kernel: item 9 key (704687 84 654256665)
itemoff 3403 itemsize 68
Dec 13 21:44:50 kernel: dir oid 707841 type 1
Dec 13 21:44:50 kernel: item 10 key (704687 84 995843418)
itemoff 3331 itemsize 72
Dec 13 21:44:50 kernel: dir oid 2167736 type 1
Dec 13 21:44:50 kernel: item 11 key (704687 84 1063561078)
itemoff 3278 itemsize 53
Dec 13 21:44:50 kernel: dir oid 704799 type 2
Dec 13 21:44:50 kernel: item 12 key (704687 84 1101156010)
itemoff 3225 itemsize 53
Dec 13 21:44:50 kernel: dir oid 704696 type 1
Dec 13 21:44:50 kernel: item 13 key (704687 84 2521936574)
itemoff 3173 itemsize 52
Dec 13 21:44:50 kernel: dir oid 704704 type 2
Dec 13 21:44:50 kernel: item 14 key (704687 84 2618368432)
itemoff 3112 itemsize 61
Dec 13 21:44:50 kernel: dir oid 704738 type 1
Dec 13 21:44:50 kernel: item 15 key (704687 84 2676316190)
itemoff 3046 itemsize 66
Dec 13 21:44:50 kernel: dir oid 2167729 type 1
Dec 13 21:44:50 kernel: item 16 key (704687 84 3319104192)
itemoff 2986 itemsize 60
Dec 13 21:44:50 kernel: dir oid 704745 type 2
Dec 13 21:44:50 kernel: item 17 key (704687 84 3908046265)
itemoff 2929 itemsize 57
Dec 13 21:44:50 kernel: dir oid 2167734 type 1
Dec 13 21:44:50 kernel: item 18 key (704687 84 3945713089)
itemoff 2857 itemsize 72
Dec 13 21:44:50 kernel: dir oid 2167730 type 1
Dec 13 21:44:50 kernel: item 19 key (704687 84 4077169308)
itemoff 2795 itemsize 62
Dec 13 21:44:50 kernel: dir oid 704688 type 1
Dec 13 21:44:50 kernel: item 20 key (704687 84 4146773349)
itemoff 2727 itemsize 68
Dec 13 21:44:50 kernel: dir oid 707892 type 1
Dec 13 21:44:50 kernel: item 21 key (704687 84 1063561078)
itemoff 2674 itemsize 53
Dec 13 21:44:50 kernel: dir oid 704799 type 2
Dec 13 21:44:50 kernel: item 22 key (704687 96 2) itemoff 2612
itemsize 62
Dec 13 21:44:50 kernel: item 23 key (704687 96 6) itemoff 2551
itemsize 61
Dec 13 21:44:50 kernel: item 24 key (704687 96 7) itemoff 2498
itemsize 53
Dec 13 21:44:50 kernel: item 25 key (704687 96 12) itemoff
2446 itemsize 52
Dec 13 21:44:50 kernel: item 26 key (704687 96 14) itemoff
2385 itemsize 61
Dec 13 21:44:50 kernel: item 27 key (704687 96 18) itemoff
2325 itemsize 60
Dec 13 21:44:50 kernel: item 28 key (704687 96 24) itemoff
2271 itemsize 54
Dec 13 21:44:50 kernel: item 29 key (704687 96 28) itemoff
2218 itemsize 53
Dec 13 21:44:50 kernel: item 30 key (704687 96 62) itemoff
2150 itemsize 68
Dec 13 21:44:50 kernel: item 31 key (704687 96 66) itemoff
2083 itemsize 67
Dec 13 21:44:50 kernel: item 32 key (704687 96 75) itemoff
2015 itemsize 68
Dec 13 21:44:50 kernel: item 33 key (704687 96 79) itemoff
1948 itemsize 67
Dec 13 21:44:50 kernel: item 34 key (704687 96 82) itemoff
1882 itemsize 66
Dec 13 21:44:50 kernel: item 35 key (704687 96 83) itemoff
1810 itemsize 72
Dec 13 21:44:50 kernel: item 36 key (704687 96 85) itemoff
1753 itemsize 57
Dec 13 21:44:50 kernel: item 37 key (704687 96 87) itemoff
1681 itemsize 72
Dec 13 21:44:50 kernel: item 38 key (704694 1 0) itemoff 1521
itemsize 160
Dec 13 21:44:50 kernel: inode generation 35534 size 30
mode 40755
Dec 13 21:44:50 kernel: BTRFS error (device dm-0): block=118444032
write time tree block corruption detected
So fix that by adding the missing update of ctx->last_dir_item_offset with
the offset of the boundary key.
Reported-by: Chris Murphy <lists@colorremedies.com>
Link: https://lore.kernel.org/linux-btrfs/CAJCQCtT+RSzpUjbMq+UfzNUMe1X5+1G+DnAGbHC=OZ=iRS24jg@mail.gmail.com/
Fixes:
|
||
Jianglei Nie
|
f35838a693 |
btrfs: fix memory leak in __add_inode_ref()
Line 1169 (#3) allocates a memory chunk for victim_name by kmalloc(),
but when the function returns in line 1184 (#4) victim_name allocated
by line 1169 (#3) is not freed, which will lead to a memory leak.
There is a similar snippet of code in this function as allocating a memory
chunk for victim_name in line 1104 (#1) as well as releasing the memory
in line 1116 (#2).
We should kfree() victim_name when the return value of backref_in_log()
is less than zero and before the function returns in line 1184 (#4).
1057 static inline int __add_inode_ref(struct btrfs_trans_handle *trans,
1058 struct btrfs_root *root,
1059 struct btrfs_path *path,
1060 struct btrfs_root *log_root,
1061 struct btrfs_inode *dir,
1062 struct btrfs_inode *inode,
1063 u64 inode_objectid, u64 parent_objectid,
1064 u64 ref_index, char *name, int namelen,
1065 int *search_done)
1066 {
1104 victim_name = kmalloc(victim_name_len, GFP_NOFS);
// #1: kmalloc (victim_name-1)
1105 if (!victim_name)
1106 return -ENOMEM;
1112 ret = backref_in_log(log_root, &search_key,
1113 parent_objectid, victim_name,
1114 victim_name_len);
1115 if (ret < 0) {
1116 kfree(victim_name); // #2: kfree (victim_name-1)
1117 return ret;
1118 } else if (!ret) {
1169 victim_name = kmalloc(victim_name_len, GFP_NOFS);
// #3: kmalloc (victim_name-2)
1170 if (!victim_name)
1171 return -ENOMEM;
1180 ret = backref_in_log(log_root, &search_key,
1181 parent_objectid, victim_name,
1182 victim_name_len);
1183 if (ret < 0) {
1184 return ret; // #4: missing kfree (victim_name-2)
1185 } else if (!ret) {
1241 return 0;
1242 }
Fixes:
|
||
Naohiro Aota
|
84c2544892 |
btrfs: fix re-dirty process of tree-log nodes
There is a report of a transaction abort of -EAGAIN with the following
script.
#!/bin/sh
for d in sda sdb; do
mkfs.btrfs -d single -m single -f /dev/\${d}
done
mount /dev/sda /mnt/test
mount /dev/sdb /mnt/scratch
for dir in test scratch; do
echo 3 >/proc/sys/vm/drop_caches
fio --directory=/mnt/\${dir} --name=fio.\${dir} --rw=read --size=50G --bs=64m \
--numjobs=$(nproc) --time_based --ramp_time=5 --runtime=480 \
--group_reporting |& tee /dev/shm/fio.\${dir}
echo 3 >/proc/sys/vm/drop_caches
done
for d in sda sdb; do
umount /dev/\${d}
done
The stack trace is shown in below.
[3310.967991] BTRFS: error (device sda) in btrfs_commit_transaction:2341: errno=-11 unknown (Error while writing out transaction)
[3310.968060] BTRFS info (device sda): forced readonly
[3310.968064] BTRFS warning (device sda): Skipping commit of aborted transaction.
[3310.968065] ------------[ cut here ]------------
[3310.968066] BTRFS: Transaction aborted (error -11)
[3310.968074] WARNING: CPU: 14 PID: 1684 at fs/btrfs/transaction.c:1946 btrfs_commit_transaction.cold+0x209/0x2c8
[3310.968131] CPU: 14 PID: 1684 Comm: fio Not tainted 5.14.10-300.fc35.x86_64 #1
[3310.968135] Hardware name: DIAWAY Tartu/Tartu, BIOS V2.01.B10 04/08/2021
[3310.968137] RIP: 0010:btrfs_commit_transaction.cold+0x209/0x2c8
[3310.968144] RSP: 0018:ffffb284ce393e10 EFLAGS: 00010282
[3310.968147] RAX: 0000000000000026 RBX: ffff973f147b0f60 RCX: 0000000000000027
[3310.968149] RDX: ffff974ecf098a08 RSI: 0000000000000001 RDI: ffff974ecf098a00
[3310.968150] RBP: ffff973f147b0f08 R08: 0000000000000000 R09: ffffb284ce393c48
[3310.968151] R10: ffffb284ce393c40 R11: ffffffff84f47468 R12: ffff973f101bfc00
[3310.968153] R13: ffff971f20cf2000 R14: 00000000fffffff5 R15: ffff973f147b0e58
[3310.968154] FS: 00007efe65468740(0000) GS:ffff974ecf080000(0000) knlGS:0000000000000000
[3310.968157] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[3310.968158] CR2: 000055691bcbe260 CR3: 000000105cfa4001 CR4: 0000000000770ee0
[3310.968160] PKRU: 55555554
[3310.968161] Call Trace:
[3310.968167] ? dput+0xd4/0x300
[3310.968174] btrfs_sync_file+0x3f1/0x490
[3310.968180] __x64_sys_fsync+0x33/0x60
[3310.968185] do_syscall_64+0x3b/0x90
[3310.968190] entry_SYSCALL_64_after_hwframe+0x44/0xae
[3310.968194] RIP: 0033:0x7efe6557329b
[3310.968200] RSP: 002b:00007ffe0236ebc0 EFLAGS: 00000293 ORIG_RAX: 000000000000004a
[3310.968203] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007efe6557329b
[3310.968204] RDX: 0000000000000000 RSI: 00007efe58d77010 RDI: 0000000000000006
[3310.968205] RBP: 0000000004000000 R08: 0000000000000000 R09: 00007efe58d77010
[3310.968207] R10: 0000000016cacc0c R11: 0000000000000293 R12: 00007efe5ce95980
[3310.968208] R13: 0000000000000000 R14: 00007efe6447c790 R15: 0000000c80000000
[3310.968212] ---[ end trace 1a346f4d3c0d96ba ]---
[3310.968214] BTRFS: error (device sda) in cleanup_transaction:1946: errno=-11 unknown
The abort occurs because of a write hole while writing out freeing tree
nodes of a tree-log tree. For zoned btrfs, we re-dirty a freed tree
node to ensure btrfs can write the region and does not leave a hole on
write on a zoned device. The current code fails to re-dirty a node
when the tree-log tree's depth is greater or equal to 2. That leads to
a transaction abort with -EAGAIN.
Fix the issue by properly re-dirtying a node on walking up the tree.
Fixes:
|
||
Filipe Manana
|
d1ed82f355 |
btrfs: remove root argument from check_item_in_log()
The root argument passed to check_item_in_log() always matches the root of the given directory, so it can be eliminated. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
||
Filipe Manana
|
6d9cc07215 |
btrfs: remove root argument from add_link()
The root argument for tree-log.c:add_link() always matches the root of the given directory and the given inode, so it can eliminated. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
||
Filipe Manana
|
4467af8809 |
btrfs: remove root argument from btrfs_unlink_inode()
The root argument passed to btrfs_unlink_inode() and its callee, __btrfs_unlink_inode(), always matches the root of the given directory and the given inode. So remove the argument and make __btrfs_unlink_inode() use the root of the directory. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
||
Filipe Manana
|
9798ba24cb |
btrfs: remove root argument from drop_one_dir_item()
The root argument for drop_one_dir_item() always matches the root of the given directory inode, since each log tree is associated to one and only one subvolume/root, so remove the argument. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
||
Filipe Manana
|
10adb1152d |
btrfs: fix lost error handling when replaying directory deletes
At replay_dir_deletes(), if find_dir_range() returns an error we break out of the main while loop and then assign a value of 0 (success) to the 'ret' variable, resulting in completely ignoring that an error happened. Fix that by jumping to the 'out' label when find_dir_range() returns an error (negative value). CC: stable@vger.kernel.org # 4.4+ Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
||
Nikolay Borisov
|
f42c5da6c1 |
btrfs: add additional parameters to btrfs_init_tree_ref/btrfs_init_data_ref
In order to make 'real_root' used only in ref-verify it's required to have the necessary context to perform the same checks that this member is used for. So add 'mod_root' which will contain the root on behalf of which a delayed ref was created and a 'skip_group' parameter which will contain callsite-specific override of skip_qgroup. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
||
Josef Bacik
|
8496153945 |
btrfs: add a BTRFS_FS_ERROR helper
We have a few flags that are inconsistently used to describe the fs in different states of failure. As of |
||
Josef Bacik
|
9a35fc9542 |
btrfs: change error handling for btrfs_delete_*_in_log
Currently we will abort the transaction if we get a random error (like -EIO) while trying to remove the directory entries from the root log during rename. However since these are simply log tree related errors, we can mark the trans as needing a full commit. Then if the error was truly catastrophic we'll hit it during the normal commit and abort as appropriate. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com> |
||
Josef Bacik
|
ba51e2a11e |
btrfs: change handle_fs_error in recover_log_trees to aborts
During inspection of the return path for replay I noticed that we don't actually abort the transaction if we get a failure during replay. This isn't a problem necessarily, as we properly return the error and will fail to mount. However we still leave this dangling transaction that could conceivably be committed without thinking there was an error. We were using btrfs_handle_fs_error() here, but that pre-dates the transaction abort code. Simply replace the btrfs_handle_fs_error() calls with transaction aborts, so we still know where exactly things went wrong, and add a few in some other un-handled error cases. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
||
Filipe Manana
|
da1b811fcd |
btrfs: use single bulk copy operations when logging directories
When logging a directory and inserting a batch of directory items, we are copying the data of each item from a leaf in the fs/subvolume tree to a leaf in a log tree, separately. This is not really needed, since we are copying from a contiguous memory area into another one, so we can use a single copy operation to copy all items at once. This patch is part of a small patchset that is comprised of the following patches: btrfs: loop only once over data sizes array when inserting an item batch btrfs: unexport setup_items_for_insert() btrfs: use single bulk copy operations when logging directories This is patch 3/3. The following test was used to compare performance of a branch without the patchset versus one branch that has the whole patchset applied: $ cat dir-fsync-test.sh #!/bin/bash DEV=/dev/nvme0n1 MNT=/mnt/nvme0n1 NUM_NEW_FILES=1000000 NUM_FILE_DELETES=1000 LEAF_SIZE=16K mkfs.btrfs -f -n $LEAF_SIZE $DEV mount -o ssd $DEV $MNT mkdir $MNT/testdir for ((i = 1; i <= $NUM_NEW_FILES; i++)); do echo -n > $MNT/testdir/file_$i done # Fsync the directory, this will log the new dir items and the inodes # they point to, because these are new inodes. start=$(date +%s%N) xfs_io -c "fsync" $MNT/testdir end=$(date +%s%N) dur=$(( (end - start) / 1000000 )) echo "dir fsync took $dur ms after adding $NUM_NEW_FILES files" # sync to force transaction commit and wipeout the log. sync del_inc=$(( $NUM_NEW_FILES / $NUM_FILE_DELETES )) for ((i = 1; i <= $NUM_NEW_FILES; i += $del_inc)); do rm -f $MNT/testdir/file_$i done # Fsync the directory, this will only log dir items, there are no # dentries pointing to new inodes. start=$(date +%s%N) xfs_io -c "fsync" $MNT/testdir end=$(date +%s%N) dur=$(( (end - start) / 1000000 )) echo "dir fsync took $dur ms after deleting $NUM_FILE_DELETES files" umount $MNT The tests were run on a non-debug kernel (Debian's default kernel config) and were the following: *** with a leaf size of 16K, before patchset *** dir fsync took 8482 ms after adding 1000000 files dir fsync took 166 ms after deleting 1000 files *** with a leaf size of 16K, after patchset *** dir fsync took 8196 ms after adding 1000000 files (-3.4%) dir fsync took 143 ms after deleting 1000 files (-14.9%) *** with a leaf size of 64K, before patchset *** dir fsync took 12851 ms after adding 1000000 files dir fsync took 466 ms after deleting 1000 files *** with a leaf size of 64K, after patchset *** dir fsync took 12287 ms after adding 1000000 files (-4.5%) dir fsync took 414 ms after deleting 1000 files (-11.8%) Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
||
Filipe Manana
|
b7ef5f3a6f |
btrfs: loop only once over data sizes array when inserting an item batch
When inserting a batch of items into a btree, we end up looping over the data sizes array 3 times: 1) Once in the caller of btrfs_insert_empty_items(), when it populates the array with the data sizes for each item; 2) Once at btrfs_insert_empty_items() to sum the elements of the data sizes array and compute the total data size; 3) And then once again at setup_items_for_insert(), where we do exactly the same as what we do at btrfs_insert_empty_items(), to compute the total data size. That is not bad for small arrays, but when the arrays have hundreds of elements, the time spent on looping is not negligible. For example when doing batch inserts of delayed items for dir index items or when logging a directory, it's common to have 200 to 260 dir index items in a single batch when using a leaf size of 16K and using file names between 8 and 12 characters. For a 64K leaf size, multiply that by 4. Taking into account that during directory logging or when flushing delayed dir index items we can have many of those large batches, the time spent on the looping adds up quickly. It's also more important to avoid it at setup_items_for_insert(), since we are holding a write lock on a leaf and, in some cases, on upper nodes of the btree, which causes us to block other tasks that want to access the leaf and nodes for longer than necessary. So change the code so that setup_items_for_insert() and btrfs_insert_empty_items() no longer compute the total data size, and instead rely on the caller to supply it. This makes us loop over the array only once, where we can both populate the data size array and compute the total data size, taking advantage of spatial and temporal locality. To make this more manageable, use a structure to contain all the relevant details for a batch of items (keys array, data sizes array, total data size, number of items), and use it as an argument for btrfs_insert_empty_items() and setup_items_for_insert(). This patch is part of a small patchset that is comprised of the following patches: btrfs: loop only once over data sizes array when inserting an item batch btrfs: unexport setup_items_for_insert() btrfs: use single bulk copy operations when logging directories This is patch 1/3 and performance results, and the specific tests, are included in the changelog of patch 3/3. Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
||
Filipe Manana
|
dc2872247e |
btrfs: keep track of the last logged keys when logging a directory
After the first time we log a directory in the current transaction, for each directory item in a changed leaf of the subvolume tree, we have to check if we previously logged the item, in order to overwrite it in case its data changed or skip it in case its data hasn't changed. Checking if we have logged each item before not only wastes times, but it also adds lock contention on the log tree. So in order to minimize the number of times we do such checks, keep track of the offset of the last key we logged for a directory and, on the next time we log the directory, skip the checks for any new keys that have an offset greater than the offset we have previously saved. This is specially effective for index keys, because the offset for these keys comes from a monotonically increasing counter. This patch is part of a patchset comprised of the following 5 patches: btrfs: remove root argument from btrfs_log_inode() and its callees btrfs: remove redundant log root assignment from log_dir_items() btrfs: factor out the copying loop of dir items from log_dir_items() btrfs: insert items in batches when logging a directory when possible btrfs: keep track of the last logged keys when logging a directory This is patch 5/5. The following test was used on a non-debug kernel to measure the impact it has on a directory fsync: $ cat test-dir-fsync.sh #!/bin/bash DEV=/dev/nvme0n1 MNT=/mnt/nvme0n1 NUM_NEW_FILES=100000 NUM_FILE_DELETES=1000 mkfs.btrfs -f $DEV mount -o ssd $DEV $MNT mkdir $MNT/testdir for ((i = 1; i <= $NUM_NEW_FILES; i++)); do echo -n > $MNT/testdir/file_$i done # fsync the directory, this will log the new dir items and the inodes # they point to, because these are new inodes. start=$(date +%s%N) xfs_io -c "fsync" $MNT/testdir end=$(date +%s%N) dur=$(( (end - start) / 1000000 )) echo "dir fsync took $dur ms after adding $NUM_NEW_FILES files" # sync to force transaction commit and wipeout the log. sync del_inc=$(( $NUM_NEW_FILES / $NUM_FILE_DELETES )) for ((i = 1; i <= $NUM_NEW_FILES; i += $del_inc)); do rm -f $MNT/testdir/file_$i done # fsync the directory, this will only log dir items, there are no # dentries pointing to new inodes. start=$(date +%s%N) xfs_io -c "fsync" $MNT/testdir end=$(date +%s%N) dur=$(( (end - start) / 1000000 )) echo "dir fsync took $dur ms after deleting $NUM_FILE_DELETES files" umount $MNT Test results with NUM_NEW_FILES set to 100 000 and 1 000 000: **** before patchset, 100 000 files, 1000 deletes **** dir fsync took 848 ms after adding 100000 files dir fsync took 175 ms after deleting 1000 files **** after patchset, 100 000 files, 1000 deletes **** dir fsync took 758 ms after adding 100000 files (-11.2%) dir fsync took 63 ms after deleting 1000 files (-94.1%) **** before patchset, 1 000 000 files, 1000 deletes **** dir fsync took 9945 ms after adding 1000000 files dir fsync took 473 ms after deleting 1000 files **** after patchset, 1 000 000 files, 1000 deletes **** dir fsync took 8677 ms after adding 1000000 files (-13.6%) dir fsync took 146 ms after deleting 1000 files (-105.6%) Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
||
Filipe Manana
|
086dcbfa50 |
btrfs: insert items in batches when logging a directory when possible
When logging a directory, we scan its directory items from the subvolume tree and then copy one by one into the log tree. This is not efficient since we generally are able to insert several items in a batch, using a single btree operation for adding several items at once. The reason we copy items one by one is that we must check if each item was previously logged in the current transaction, and if it was we either overwrite it or skip it in case its content did not change in the subvolume tree (this can happen only for dir item keys, but not for dir index keys), and doing such check makes it a bit cumbersome to attempt batch insertions. However the chances for doing batch insertions are very frequent and always happen when: 1) Logging the directory for the first time in the current transaction, as none of the items exist in the log tree yet; 2) Logging new dir index keys, because the offset for new dir index keys comes from a monotonically increasing counter. This means if we keep adding dentries to a directory, through creation of new files and sub-directories or by adding new links or renaming from some other directory into the one we are logging, all the new dir index keys have a new offset that is greater than the offset of any previously logged index keys, so we can insert them in batches into the log tree. For dir item keys, since their offset depends on the result of an hash function against the dentry's name, unless the directory is being logged for the first time in the current transaction, the chances being able to insert the items in the log using batches is pretty much random and not predictable, as it depends on the names of the dentries, but still happens often enough. So change directory logging to keep track of consecutive directory items that don't exist yet in the log and batch insert them. This patch is part of a patchset comprised of the following 5 patches: btrfs: remove root argument from btrfs_log_inode() and its callees btrfs: remove redundant log root assignment from log_dir_items() btrfs: factor out the copying loop of dir items from log_dir_items() btrfs: insert items in batches when logging a directory when possible btrfs: keep track of the last logged keys when logging a directory This is patch 4/5. The change log of the last patch (5/5) has performance results. Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
||
Filipe Manana
|
eb10d85ee7 |
btrfs: factor out the copying loop of dir items from log_dir_items()
In preparation for the next change, move the loop that processes a leaf and copies its directory items to the log, into a separate helper function. This makes the next change simpler and it also helps making log_dir_items() a bit shorter (specially after the next change). This patch is part of a patchset comprised of the following 5 patches: btrfs: remove root argument from btrfs_log_inode() and its callees btrfs: remove redundant log root assignment from log_dir_items() btrfs: factor out the copying loop of dir items from log_dir_items() btrfs: insert items in batches when logging a directory when possible btrfs: keep track of the last logged keys when logging a directory This is patch 3/5. The change log of the last patch (5/5) has performance results. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
||
Filipe Manana
|
d46fb845af |
btrfs: remove redundant log root assignment from log_dir_items()
At log_dir_items() we are assigning the exact same value to the local variable 'log', once when it's declared and once again shortly after. Remove the later assignment as it's pointless. This patch is part of a patchset comprised of the following 5 patches: btrfs: remove root argument from btrfs_log_inode() and its callees btrfs: remove redundant log root assignment from log_dir_items() btrfs: factor out the copying loop of dir items from log_dir_items() btrfs: insert items in batches when logging a directory when possible btrfs: keep track of the last logged keys when logging a directory This is patch 2/5. The change log of the last patch (5/5) has performance results. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |
||
Filipe Manana
|
90d04510a7 |
btrfs: remove root argument from btrfs_log_inode() and its callees
The root argument passed to btrfs_log_inode() is unncessary, as it is always the root of the inode we are going to log. This root also gets unnecessarily propagated to several functions called by btrfs_log_inode(), and all of them take the inode as an argument as well. So just remove the root argument from these functions and have them get the root from the inode where needed. This patch is part of a patchset comprised of the following 5 patches: btrfs: remove root argument from btrfs_log_inode() and its callees btrfs: remove redundant log root assignment from log_dir_items() btrfs: factor out the copying loop of dir items from log_dir_items() btrfs: insert items in batches when logging a directory when possible btrfs: keep track of the last logged keys when logging a directory This is patch 1/5. The change log of the last patch (5/5) has performance results. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com> |