Function gfs2_trim_blocks is not referenced. Eliminate it.
Signed-off-by: Bob Peterson <rpeterso@redhat.com>
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
We have maintain PagePrivate and page_private and page reference
w/ {set,clear}_page_private_*, it doesn't need to call
folio_detach_private() in the end of .invalidate_folio and
.release_folio, remove it and use f2fs_bug_on instead.
Signed-off-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Convert to use remove_proc_subtree() and kill kobject_del() directly.
kobject_put() actually covers kobject removal automatically, which is
single stage removal.
Signed-off-by: Yangtao Li <frank.li@vivo.com>
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
There are some warnings on older compilers (gcc 10, 7) or non-x86_64
architectures (aarch64). As btrfs wants to enable -Wmaybe-uninitialized
by default, fix the warnings even though it's not necessary on recent
compilers (gcc 12+).
../fs/btrfs/volumes.c: In function ‘btrfs_init_new_device’:
../fs/btrfs/volumes.c:2703:3: error: ‘seed_devices’ may be used uninitialized in this function [-Werror=maybe-uninitialized]
2703 | btrfs_setup_sprout(fs_info, seed_devices);
| ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
../fs/btrfs/send.c: In function ‘get_cur_inode_state’:
../include/linux/compiler.h:70:32: error: ‘right_gen’ may be used uninitialized in this function [-Werror=maybe-uninitialized]
70 | (__if_trace.miss_hit[1]++,1) : \
| ^
../fs/btrfs/send.c:1878:6: note: ‘right_gen’ was declared here
1878 | u64 right_gen;
| ^~~~~~~~~
Reported-by: k2ci <kernel-bot@kylinos.cn>
Signed-off-by: Genjian Zhang <zhanggenjian@kylinos.cn>
Reviewed-by: David Sterba <dsterba@suse.com>
[ update changelog ]
Signed-off-by: David Sterba <dsterba@suse.com>
When logging dir dentries of a directory, we iterate over the subvolume
tree to find dir index keys on leaves modified in the current transaction.
This however is heavy on locking, since btrfs_search_forward() may often
keep locks on extent buffers for quite a while when walking the tree to
find a suitable leaf modified in the current transaction and with a key
not smaller than then the provided minimum key. That means it will block
other tasks trying to access the subvolume tree, which may be common fs
operations like creating, renaming, linking, unlinking, reflinking files,
etc.
A better solution is to iterate the log tree, since it's much smaller than
a subvolume tree and just use plain btrfs_search_slot() (or the wrapper
btrfs_for_each_slot()) and only contains dir index keys added in the
current transaction.
The following bonnie++ test on a non-debug kernel (with Debian's default
kernel config) on a 20G null block device, was used to measure the impact:
$ cat test.sh
#!/bin/bash
DEV=/dev/nullb0
MNT=/mnt/nullb0
NR_DIRECTORIES=20
NR_FILES=20480 # must be a multiple of 1024
DATASET_SIZE=$(( (8 * 1024 * 1024 * 1024) / 1048576 )) # 8 GiB as megabytes
DIRECTORY_SIZE=$(( DATASET_SIZE / NR_FILES ))
NR_FILES=$(( NR_FILES / 1024 ))
umount $DEV &> /dev/null
mkfs.btrfs -f $DEV
mount $DEV $MNT
bonnie++ -u root -d $MNT \
-n $NR_FILES:$DIRECTORY_SIZE:$DIRECTORY_SIZE:$NR_DIRECTORIES \
-r 0 -s $DATASET_SIZE -b
umount $MNT
Before patchset:
Version 2.00a ------Sequential Output------ --Sequential Input- --Random-
-Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks--
Name:Size etc /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP
debian0 8G 376k 99 1.1g 98 939m 92 1527k 99 3.2g 99 9060 256
Latency 24920us 207us 680ms 5594us 171us 2891us
Version 2.00a ------Sequential Create------ --------Random Create--------
debian0 -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete--
files /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP
20/20 20480 96 +++++ +++ 20480 95 20480 99 +++++ +++ 20480 97
Latency 8708us 137us 5128us 6743us 60us 19712us
After patchset:
Version 2.00a ------Sequential Output------ --Sequential Input- --Random-
-Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks--
Name:Size etc /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP
debian0 8G 384k 99 1.2g 99 971m 91 1533k 99 3.3g 99 9180 309
Latency 24930us 125us 661ms 5587us 46us 2020us
Version 2.00a ------Sequential Create------ --------Random Create--------
debian0 -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete--
files /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP
20/20 20480 90 +++++ +++ 20480 99 20480 99 +++++ +++ 20480 97
Latency 7030us 61us 1246us 4942us 56us 16855us
The patchset consists of this patch plus a previous one that has the
following subject:
"btrfs: avoid iterating over all indexes when logging directory"
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
When logging a directory, after copying all directory index items from the
subvolume tree to the log tree, we iterate over the subvolume tree to find
all dir index items that are located in leaves COWed (or created) in the
current transaction. If we keep logging a directory several times during
the same transaction, we end up iterating over the same dir index items
everytime we log the directory, wasting time and adding extra lock
contention on the subvolume tree.
So just keep track of the last logged dir index offset in order to start
the search for that index (+1) the next time the directory is logged, as
dir index values (key offsets) come from a monotonically increasing
counter.
The following test measures the difference before and after this change:
$ cat test.sh
#!/bin/bash
DEV=/dev/nullb0
MNT=/mnt/nullb0
umount $DEV &> /dev/null
mkfs.btrfs -f $DEV
mount -o ssd $DEV $MNT
# Time values in milliseconds.
declare -a fsync_times
# Total number of files added to the test directory.
num_files=1000000
# Fsync directory after every N files are added.
fsync_period=100
mkdir $MNT/testdir
fsync_total_time=0
for ((i = 1; i <= $num_files; i++)); do
echo -n > $MNT/testdir/file_$i
if [ $((i % fsync_period)) -eq 0 ]; then
start=$(date +%s%N)
xfs_io -c "fsync" $MNT/testdir
end=$(date +%s%N)
fsync_total_time=$((fsync_total_time + (end - start)))
fsync_times[i]=$(( (end - start) / 1000000 ))
echo -n -e "Progress $i / $num_files\r"
fi
done
echo -e "\nHistogram of directory fsync duration in ms:\n"
printf '%s\n' "${fsync_times[@]}" | \
perl -MStatistics::Histogram -e '@d = <>; print get_histogram(\@d);'
fsync_total_time=$((fsync_total_time / 1000000))
echo -e "\nTotal time spent in fsync: $fsync_total_time ms\n"
echo
umount $MNT
The test was run on a non-debug kernel (Debian's default kernel config)
against a 15G null block device.
Result before this change:
Histogram of directory fsync duration in ms:
Count: 10000
Range: 3.000 - 362.000; Mean: 34.556; Median: 31.000; Stddev: 25.751
Percentiles: 90th: 71.000; 95th: 77.000; 99th: 81.000
3.000 - 5.278: 1423 #################################
5.278 - 8.854: 1173 ###########################
8.854 - 14.467: 591 ##############
14.467 - 23.277: 1025 #######################
23.277 - 37.105: 1422 #################################
37.105 - 58.809: 2036 ###############################################
58.809 - 92.876: 2316 #####################################################
92.876 - 146.346: 6 |
146.346 - 230.271: 6 |
230.271 - 362.000: 2 |
Total time spent in fsync: 350527 ms
Result after this change:
Histogram of directory fsync duration in ms:
Count: 10000
Range: 3.000 - 1088.000; Mean: 8.704; Median: 8.000; Stddev: 12.576
Percentiles: 90th: 12.000; 95th: 14.000; 99th: 17.000
3.000 - 6.007: 3222 #################################
6.007 - 11.276: 5197 #####################################################
11.276 - 20.506: 1551 ################
20.506 - 36.674: 24 |
36.674 - 201.552: 1 |
201.552 - 353.841: 4 |
353.841 - 1088.000: 1 |
Total time spent in fsync: 92114 ms
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
[BUG]
Even before the scrub rework, if we have some corrupted metadata failed
to be repaired during replace, we still continue replacing and let it
finish just as there is nothing wrong:
BTRFS info (device dm-4): dev_replace from /dev/mapper/test-scratch1 (devid 1) to /dev/mapper/test-scratch2 started
BTRFS warning (device dm-4): tree block 5578752 mirror 1 has bad csum, has 0x00000000 want 0xade80ca1
BTRFS warning (device dm-4): tree block 5578752 mirror 0 has bad csum, has 0x00000000 want 0xade80ca1
BTRFS warning (device dm-4): checksum error at logical 5578752 on dev /dev/mapper/test-scratch1, physical 5578752: metadata leaf (level 0) in tree 5
BTRFS warning (device dm-4): checksum error at logical 5578752 on dev /dev/mapper/test-scratch1, physical 5578752: metadata leaf (level 0) in tree 5
BTRFS error (device dm-4): bdev /dev/mapper/test-scratch1 errs: wr 0, rd 0, flush 0, corrupt 1, gen 0
BTRFS warning (device dm-4): tree block 5578752 mirror 1 has bad bytenr, has 0 want 5578752
BTRFS error (device dm-4): unable to fixup (regular) error at logical 5578752 on dev /dev/mapper/test-scratch1
BTRFS info (device dm-4): dev_replace from /dev/mapper/test-scratch1 (devid 1) to /dev/mapper/test-scratch2 finished
This can lead to unexpected problems for the resulting filesystem.
[CAUSE]
Btrfs reuses scrub code path for dev-replace to iterate all dev extents.
But unlike scrub, dev-replace doesn't really bother to check the scrub
progress, which records all the errors found during replace.
And even if we check the progress, we cannot really determine which
errors are minor, which are critical just by the plain numbers.
(remember we don't treat metadata/data checksum error differently).
This behavior is there from the very beginning.
[FIX]
Instead of continuing the replace, just error out if we hit an
unrepaired metadata sector.
Now the dev-replace would be rejected with -EIO, to let the user know.
Although it also means, the filesystem has some metadata error which
cannot be repaired, the user would be upset anyway.
The new dmesg would look like this:
BTRFS info (device dm-4): dev_replace from /dev/mapper/test-scratch1 (devid 1) to /dev/mapper/test-scratch2 started
BTRFS warning (device dm-4): tree block 5578752 mirror 1 has bad csum, has 0x00000000 want 0xade80ca1
BTRFS warning (device dm-4): tree block 5578752 mirror 1 has bad csum, has 0x00000000 want 0xade80ca1
BTRFS error (device dm-4): unable to fixup (regular) error at logical 5570560 on dev /dev/mapper/test-scratch1 physical 5570560
BTRFS warning (device dm-4): header error at logical 5570560 on dev /dev/mapper/test-scratch1, physical 5570560: metadata leaf (level 0) in tree 5
BTRFS warning (device dm-4): header error at logical 5570560 on dev /dev/mapper/test-scratch1, physical 5570560: metadata leaf (level 0) in tree 5
BTRFS error (device dm-4): stripe 5570560 has unrepaired metadata sector at 5578752
BTRFS error (device dm-4): btrfs_scrub_dev(/dev/mapper/test-scratch1, 1, /dev/mapper/test-scratch2) failed -5
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
It's pointless to have a while loop at btrfs_get_next_valid_item(), as if
the slot on the current leaf is beyond the last item, we call
btrfs_next_leaf(), which leaves us at a valid slot of the next leaf (or
a valid slot in the current leaf if after releasing the path an item gets
pushed from the next leaf to the current leaf).
So just call btrfs_next_leaf() if the current slot on the current leaf is
beyond the last item.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Since the introduction of scrub interface, the only flag that we support
is BTRFS_SCRUB_READONLY. Thus there is no sanity checks, if there are
some undefined flags passed in, we just ignore them.
This is problematic if we want to introduce new scrub flags, as we have
no way to determine if such flags are supported.
Address the problem by introducing a check for the flags, and if
unsupported flags are set, return -EOPNOTSUPP to inform the user space.
This check should be backported for all supported kernels before any new
scrub flags are introduced.
CC: stable@vger.kernel.org # 4.14+
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Currently, a limit of 0 results in a hard coded metering over 6 hours.
Since the default is a set limit, I suspect no one truly depends on this
rather arbitrary setting. Repurpose it for an arguably more useful
"unlimited" mode, where the delay is 0.
Note that if block groups are too new, or go fully empty, there is still
a delay associated with those conditions. Those delays implement
heuristics for not trimming a region we are relatively likely to fully
overwrite soon.
CC: stable@vger.kernel.org # 6.2+
Reviewed-by: Neal Gompa <neal@gompa.dev>
Signed-off-by: Boris Burkov <boris@bur.io>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Previously, the default was a relatively conservative 10. This results
in a 100ms delay, so with ~300 discards in a commit, it takes the full
30s till the next commit to finish the discards. On a workstation, this
results in the disk never going idle, wasting power/battery, etc.
Set the default to 1000, which results in using the smallest possible
delay, currently, which is 1ms. This has shown to not pathologically
keep the disk busy by the original reporter.
Link: https://lore.kernel.org/linux-btrfs/Y%2F+n1wS%2F4XAH7X1p@nz/
Link: https://bugzilla.redhat.com/show_bug.cgi?id=2182228
CC: stable@vger.kernel.org # 6.2+
Reviewed-by: Neal Gompa <neal@gompa.dev
Signed-off-by: Boris Burkov <boris@bur.io>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Since the scrub rework, the following RAID56 functions are no longer
called:
- raid56_add_scrub_pages()
- raid56_alloc_missing_rbio()
- raid56_submit_missing_rbio()
Those functions are all utilized by scrub to handle missing device cases
for RAID56.
However the new scrub code handle them in a completely different way:
- If it's data stripe, go recovery path through btrfs_submit_bio()
- If it's P/Q stripe, it would be handled through
raid56_parity_submit_scrub_rbio()
And that function would handle dev-replace and repair properly.
Thus we can safely remove those functions.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Since scrub path has been fully moved to scrub_stripe based facilities,
no more scrub_bio would be submitted.
Thus we can remove it completely, this involves:
- SCRUB_SECTORS_PER_BIO macro
- SCRUB_BIOS_PER_SCTX macro
- SCRUB_MAX_PAGES macro
- BTRFS_MAX_MIRRORS macro
- scrub_bio structure
- scrub_ctx::bios member
- scrub_ctx::curr member
- scrub_ctx::bios_in_flight member
- scrub_ctx::workers_pending member
- scrub_ctx::list_lock member
- scrub_ctx::list_wait member
- function scrub_bio_end_io_worker()
- function scrub_pending_bio_inc()
- function scrub_pending_bio_dec()
- function scrub_throttle()
- function scrub_submit()
- function scrub_find_csum()
- function drop_csum_range()
- Some unnecessary flush and scrub pauses
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Those two structures are used to represent a bunch of sectors for scrub,
but now they are fully replaced by scrub_stripe in one go, so we can
remove them. This involves:
- structure scrub_block
- structure scrub_sector
- structure scrub_page_private
- function attach_scrub_page_private()
- function detach_scrub_page_private()
Now we no longer need to use page::private to handle subpage.
- function alloc_scrub_block()
- function alloc_scrub_sector()
- function scrub_sector_get_page()
- function scrub_sector_get_page_offset()
- function scrub_sector_get_kaddr()
- function bio_add_scrub_sector()
- function scrub_checksum_data()
- function scrub_checksum_tree_block()
- function scrub_checksum_super()
- function scrub_check_fsid()
- function scrub_block_get()
- function scrub_block_put()
- function scrub_sector_get()
- function scrub_sector_put()
- function scrub_bio_end_io()
- function scrub_block_complete()
- function scrub_add_sector_to_rd_bio()
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The old scrub code has different entrance to verify the content, and
since we have removed the writeback path, now we can start removing the
re-check part, including:
- scrub_recover structure
- scrub_sector::recover member
- function scrub_setup_recheck_block()
- function scrub_recheck_block()
- function scrub_recheck_block_checksum()
- function scrub_repair_block_group_good_copy()
- function scrub_repair_sector_from_good_copy()
- function scrub_is_page_on_raid56()
- function full_stripe_lock()
- function search_full_stripe_lock()
- function get_full_stripe_logical()
- function insert_full_stripe_lock()
- function lock_full_stripe()
- function unlock_full_stripe()
- btrfs_block_group::full_stripe_locks_root member
- btrfs_full_stripe_locks_tree structure
This infrastructure is to ensure RAID56 scrub is properly handling
recovery and P/Q scrub correctly.
This is no longer needed, before P/Q scrub we will wait for all
the involved data stripes to be scrubbed first, and RAID56 code has
internal lock to ensure no race in the same full stripe.
- function scrub_print_warning()
- function scrub_get_recover()
- function scrub_put_recover()
- function scrub_handle_errored_block()
- function scrub_setup_recheck_block()
- function scrub_bio_wait_endio()
- function scrub_submit_raid56_bio_wait()
- function scrub_recheck_block_on_raid56()
- function scrub_recheck_block()
- function scrub_recheck_block_checksum()
- function scrub_repair_block_from_good_copy()
- function scrub_repair_sector_from_good_copy()
And two more functions exported temporarily for later cleanup:
- alloc_scrub_sector()
- alloc_scrub_block()
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Since the whole scrub path has been switched to scrub_stripe based
solution, the old writeback path can be removed completely, which
involves:
- scrub_ctx::wr_curr_bio member
- scrub_ctx::flush_all_writes member
- function scrub_write_block_to_dev_replace()
- function scrub_write_sector_to_dev_replace()
- function scrub_add_sector_to_wr_bio()
- function scrub_wr_submit()
- function scrub_wr_bio_end_io()
- function scrub_wr_bio_end_io_worker()
And one more function needs to be exported temporarily:
- scrub_sector_get()
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The structure scrub_parity is used to indicate that some extents are
scrubbed for the purpose of RAID56 P/Q scrubbing.
Since the whole RAID56 P/Q scrubbing path has been replaced with new
scrub_stripe infrastructure, and we no longer need to use scrub_parity
to modify the behavior of data stripes, we can remove it completely.
This removal involves:
- scrub_parity_workers
Now only one worker would be utilized, scrub_workers, to do the read
and repair.
All writeback would happen at the main scrub thread.
- scrub_block::sparity member
- scrub_parity structure
- function scrub_parity_get()
- function scrub_parity_put()
- function scrub_free_parity()
- function __scrub_mark_bitmap()
- function scrub_parity_mark_sectors_error()
- function scrub_parity_mark_sectors_data()
These helpers are no longer needed, scrub_stripe has its bitmaps and
we can use bitmap helpers to get the error/data status.
- scrub_parity_bio_endio()
- scrub_parity_check_and_repair()
- function scrub_sectors_for_parity()
- function scrub_extent_for_parity()
- function scrub_raid56_data_stripe_for_parity()
- function scrub_raid56_parity()
The new code would reuse the scrub read-repair and writeback path.
Just skip the dev-replace phase.
And scrub_stripe infrastructure allows us to submit and wait for those
data stripes before scrubbing P/Q, without extra infrastructure.
The following two functions are temporarily exported for later cleanup:
- scrub_find_csum()
- scrub_add_sector_to_rd_bio()
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Implement the only missing part for scrub: RAID56 P/Q stripe scrub.
The workflow is pretty straightforward for the new function,
scrub_raid56_parity_stripe():
- Go through the regular scrub path for each data stripe
- Wait for the verification and repair to finish
- Writeback the repaired sectors to data stripes
- Make sure all stripes are properly repaired
If we have sectors unrepaired, we cannot continue, or we could further
corrupt the P/Q stripe.
- Submit the rbio for P/Q stripe
The dev-replace would be handled inside
raid56_parity_submit_scrub_rbio() path.
- Wait for the above bio to finish
Although the old code is no longer used, we still keep the declaration,
as the cleanup can be several times larger than this patch itself.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Switch scrub_simple_mirror() to the new scrub_stripe infrastructure.
Since scrub_simple_mirror() is the core part of scrub (only RAID56
P/Q stripes don't utilize it), we can get rid of a big chunk of code,
mostly scrub_extent(), scrub_sectors() and directly called functions.
There is a functionality change:
- Scrub speed throttle now only affects read on the scrubbing device
Writes (for repair and replace), and reads from other mirrors won't
be limited by the set limits.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The new helper, queue_scrub_stripe(), would try to queue a stripe for
scrub. If all stripes are already in use, we will submit all the
existing ones and wait for them to finish.
Currently we would queue up to 8 stripes, to enlarge the blocksize to
512KiB to improve the performance. Sectors repaired on zoned need to be
relocated instead of in-place fix.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The new helper, scrub_stripe_report_errors(), will report the result of
the scrub to system log.
The main reporting is done by introducing a new helper,
scrub_print_common_warning(), which is mostly the same content from
scrub_print_wanring(), but without the need for a scrub_block.
Since we're reporting the errors, it's the perfect time to update the
scrub stats too.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Add a new helper, scrub_write_sectors(), to submit write bios for
specified sectors to the target disk.
There are several differences compared to read path:
- Utilize btrfs_submit_scrub_write()
Now we still rely on the @mirror_num based writeback, but the
requirement is also a little different than regular writeback or read,
thus we have to call btrfs_submit_scrub_write().
- We cannot write the full stripe back
We can only write the sectors we have. There will be two call sites
later, one for repaired sectors, one for all utilized sectors of
dev-replace.
Thus the callers should specify their own write_bitmap.
This function only submit the bios, will not wait for them unless for
zoned case.
Caller must explicitly wait for the IO to finish.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The new helper, scrub_stripe_read_repair_worker(), would handle the
read-repair part:
- Wait for the previous submitted read IO to finish
- Verify the contents of the stripe
- Go through the remaining mirrors, using as large blocksize as possible
At this stage, we just read out all the failed sectors from each
mirror and re-verify.
If no more failed sector, we can exit.
- Go through all mirrors again, sector-by-sector
This time, we read sector by sector, this is to address cases where
one bad sector mismatches the drive's internal checksum, and cause the
whole read range to fail.
We put this recovery method as the last resort, as sector-by-sector
reading is slow, and reading from other mirrors may have already fixed
the errors.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The new helper, scrub_verify_stripe(), shares the same main workflow of
the old scrub code.
The major differences are:
- How pages/page_offset is grabbed
Everything can be grabbed from scrub_stripe easily.
- When error report happens
Currently the helper only verifies the sectors, not really doing any
error reporting.
The error reporting would be done after we have done the repair.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The new helper, scrub_verify_one_metadata(), is almost the same as
scrub_checksum_tree_block().
The difference is in how we grab the pages from other structures.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The new helper will search the extent tree to find the first extent of a
logical range, then fill the sectors array by two loops:
- Loop 1 to fill common bits and metadata generation
- Loop 2 to fill csum data (only for data bgs)
This loop will use the new btrfs_lookup_csums_bitmap() to fill
the full csum buffer, and set scrub_sector_verification::csum.
With all the needed info filled by this function, later we only need to
submit and verify the stripe.
Here we temporarily export the helper to avoid warning on unused static
function.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This patch introduces the following structures:
- scrub_sector_verification
Contains all the needed info to verify one sector (data or metadata).
- scrub_stripe
Contains all needed members (mostly bitmap based) to scrub one stripe
(with a length of BTRFS_STRIPE_LEN).
The basic idea is, we keep the existing per-device scrub behavior, but
merge all the scrub_bio/scrub_bio into one generic structure, and read
the full BTRFS_STRIPE_LEN stripe on the first try.
This means we will read some sectors which are not scrub target, but
that's fine. At dev-replace time we only writeback the utilized and good
sectors, and for read-repair we only writeback the repaired sectors.
With every read submitted in BTRFS_STRIPE_LEN, the need for complex bio
form shaping would be gone.
Although to get the same performance of the old scrub behavior, we would
need to submit the initial read for two stripes at once.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Both scrub and read-repair are utilizing a special repair writes that:
- Only writes back to a single device
Even for read-repair on RAID56, we only update the corrupted data
stripe itself, not triggering the full RMW path.
- Requires a valid @mirror_num
For RAID56 case, only @mirror_num == 1 is valid.
For non-RAID56 cases, we need @mirror_num to locate our stripe.
- No data csum generation needed
These two call sites still have some differences though:
- Read-repair goes plain bio
It doesn't need a full btrfs_bio, and goes submit_bio_wait().
- New scrub repair would go btrfs_bio
To simplify both read and write path.
So here this patch would:
- Introduce a common helper, btrfs_map_repair_block()
Due to the single device nature, we can use an on-stack
btrfs_io_stripe to pass device and its physical bytenr.
- Introduce a new interface, btrfs_submit_repair_bio(), for later scrub
code
This is for the incoming scrub code.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Currently we're doing a lot of work for btrfs_bio:
- Checksum verification for data read bios
- Bio splits if it crosses stripe boundary
- Read repair for data read bios
However for the incoming scrub patches, we don't want this extra
functionality at all, just plain logical + mirror -> physical mapping
ability.
Thus here we do the following changes:
- Introduce btrfs_bio::fs_info
This is for the new scrub specific btrfs_bio, which would not populate
btrfs_bio::inode.
Thus we need such new member to grab a fs_info
This new member will always be populated.
- Replace @inode argument with @fs_info for btrfs_bio_init() and its
caller
Since @inode is no longer a mandatory member, replace it with
@fs_info, and let involved users populate @inode.
- Skip checksum verification and generation if @bbio->inode is NULL
- Add extra ASSERT()s
To make sure:
* bbio->inode is properly set for involved read repair path
* if @file_offset is set, bbio->inode is also populated
- Grab @fs_info from @bbio directly
We can no longer go @bbio->inode->root->fs_info, as bbio->inode can be
NULL. This involves:
* btrfs_simple_end_io()
* should_async_write()
* btrfs_wq_submit_bio()
* btrfs_use_zone_append()
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
There is really no need to go through the super complex scrub_sectors()
to just handle super blocks. Introduce a dedicated function to handle
super block scrubbing.
This new function will introduce a behavior change, instead of using the
complex but concurrent scrub_bio system, here we just go submit-and-wait.
There is really not much sense to care the performance of super block
scrubbing. It only has 3 super blocks at most, and they are all
scattered around the devices already.
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Commit 321f69f86a ("btrfs: reset device back to allocation state when
removing") included adding extent_io_tree_release(&device->alloc_state)
to btrfs_close_one_device(), which had already been called in
btrfs_free_device().
The alloc_state tree (IO_TREE_DEVICE_ALLOC_STATE), is created in
btrfs_alloc_device() and released in btrfs_close_one_device(). Therefore,
the additional call to extent_io_tree_release(&device->alloc_state) in
btrfs_free_device() is unnecessary and can be removed.
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
During my recent search for the root cause of a reported bug, I realized
that it's a good idea to issue a warning for missed cleanup instead of
using debug-only assertions. Since most installations run with debug off,
missed cleanups and premature calls to close could go unnoticed. However,
these issues are serious enough to warrant reporting and fixing.
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Btrfs can use various different checksumming algorithms, and prints
the one used for a given file system at mount time. Don't bother
printing the crc32c implementation at module load time, the information
is available in /sys/fs/btrfs/FSID/checksum.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The tree-log code has three almost identical copies for the accounting on
an extent_buffer that doesn't need to be written any more. The only
difference is that walk_down_log_tree passed the bytenr used to find the
buffer instead of extent_buffer.start and calculates the length using the
nodesize, while the other two callers look at the extent_buffer.len
field that must always be equivalent to the nodesize.
Factor the code into a common helper.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Guard all the code to punt bios to a per-cgroup submission helper by a
new CONFIG_BLK_CGROUP_PUNT_BIO symbol that is selected by btrfs.
This way non-btrfs kernel builds don't need to have this code.
Reviewed-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
REQ_CGROUP_PUNT is a bit annoying as it is hard to follow and adds
a branch to the bio submission hot path. To fix this, export
blkcg_punt_bio_submit and let btrfs call it directly. Add a new
REQ_FS_PRIVATE flag for btrfs to indicate to it's own low-level
bio submission code that a punt to the cgroup submission helper
is required.
Reviewed-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
punt_to_cgroup is only used by extent_write_locked_range, but that
function also directly controls the bio flags for the actual submission.
Remove th punt_to_cgroup field, and just set REQ_CGROUP_PUNT directly
in extent_write_locked_range.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: David Sterba <dsterba@suse.com>
submit_one_async_extent needs to use submit_one_async_extent no matter
if the range it handles ends up beeing compressed or not as the deadlock
risk due to cgroup thottling is the same. Call kthread_associate_blkcg
earlier to cover submit_uncompressed_range case as well.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Let submit_one_async_extent, which is the only caller of
submit_uncompressed_range handle freeing of the async_extent in one
central place.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
btrfs_submit_compressed_write should not have to care if it is called
from a helper thread or not. Move the kthread_associate_blkcg handling
into submit_one_async_extent, as that is the one caller that needs it.
Also move the assignment of REQ_CGROUP_PUNT into cow_file_range_async,
as that is the routine that sets up the helper thread offload.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
When starting a transaction, we are assuming the number of bytes used for
each delayed ref update matches the number of bytes used for each item
update, that is the return value of:
btrfs_calc_insert_metadata_size(fs_info, num_items)
However that is not correct when we are using the free space tree, as we
need to multiply that value by 2, since delayed ref updates need to modify
the free space tree besides the extent tree.
So fix this by using btrfs_calc_delayed_ref_bytes() to get the correct
number of bytes used for delayed ref updates.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
When starting a transaction we are comparing the result of a call to
btrfs_block_rsv_full() with 0, but the function returns a boolean. While
in practice it is not incorrect, as 0 is equivalent to false, it makes it
a bit odd and less readable. So update the check to not compare against 0
and instead use the logical not (!) operator.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
If an application is doing direct io to a btrfs file and experiences a
page fault reading from the write buffer, iomap will issue a partial
bio, and allow the fs to keep going. However, there was a subtle bug in
this code path in the btrfs dio iomap implementation that led to the
partial write ending up as a gap in the file's extents and to be read
back as zeros.
The sequence of events in a partial write, lightly summarized and
trimmed down for brevity is as follows:
==== WRITING TASK ====
btrfs_direct_write
__iomap_dio_write
iomap_iter
btrfs_dio_iomap_begin # create full ordered extent
iomap_dio_bio_iter
bio_iov_iter_get_pages # page fault; partial read
submit_bio # partial bio
iomap_iter
btrfs_dio_iomap_end
btrfs_mark_ordered_io_finished # sets BTRFS_ORDERED_IOERR;
# submit to finish_ordered_fn wq
fault_in_iov_iter_readable # btrfs_direct_write detects partial write
__iomap_dio_write
iomap_iter
btrfs_dio_iomap_begin # create second partial ordered extent
iomap_dio_bio_iter
bio_iov_iter_get_pages # read all of remainder
submit_bio # partial bio with all of remainder
iomap_iter
btrfs_dio_iomap_end # nothing exciting to do with ordered io
==== DIO ENDIO ====
== FIRST PARTIAL BIO ==
btrfs_dio_end_io
btrfs_mark_ordered_io_finished # bytes_left > 0
# don't submit to finish_ordered_fn wq
== SECOND PARTIAL BIO ==
btrfs_dio_end_io
btrfs_mark_ordered_io_finished # bytes_left == 0
# submit to finish_ordered_fn wq
==== BTRFS FINISH ORDERED WQ ====
== FIRST PARTIAL BIO ==
btrfs_finish_ordered_io # called by dio_iomap_end_io, sees
# BTRFS_ORDERED_IOERR, just drops the
# ordered_extent
==SECOND PARTIAL BIO==
btrfs_finish_ordered_io # called by btrfs_dio_end_io, writes out file
# extents, csums, etc...
The essence of the problem is that while btrfs_direct_write and iomap
properly interact to submit all the correct bios, there is insufficient
logic in the btrfs dio functions (btrfs_dio_iomap_begin,
btrfs_dio_submit_io, btrfs_dio_end_io, and btrfs_dio_iomap_end) to
ensure that every bio is at least a part of a completed ordered_extent.
And it is completing an ordered_extent that results in crucial
functionality like writing out a file extent for the range.
More specifically, btrfs_dio_end_io treats the ordered extent as
unfinished but btrfs_dio_iomap_end sets BTRFS_ORDERED_IOERR on it.
Thus, the finish io work doesn't result in file extents, csums, etc.
In the aftermath, such a file behaves as though it has a hole in it,
instead of the purportedly written data.
We considered a few options for fixing the bug:
1. treat the partial bio as if we had truncated the file, which would
result in properly finishing it.
2. split the ordered extent when submitting a partial bio.
3. cache the ordered extent across calls to __iomap_dio_rw in
iter->private, so that we could reuse it and correctly apply
several bios to it.
I had trouble with 1, and it felt the most like a hack, so I tried 2
and 3. Since 3 has the benefit of also not creating an extra file
extent, and avoids an ordered extent lookup during bio submission, it
felt like the best option. However, that turned out to re-introduce a
deadlock which this code discarding the ordered_extent between faults
was meant to fix in the first place. (Link to an explanation of the
deadlock below.)
Therefore, go with fix 2, which requires a bit more setup work but fixes
the corruption without introducing the deadlock, which is fundamentally
caused by the ordered extent existing when we attempt to fault in a
range that overlaps with it.
Put succinctly, what this patch does is: when we submit a dio bio, check
if it is partial against the ordered extent stored in dio_data, and if it
is, extract the ordered_extent that matches the bio exactly out of the
larger ordered_extent. Keep the remaining ordered_extent around in dio_data
for cancellation in iomap_end.
Thanks to Josef, Christoph, and Filipe with their help figuring out the
bug and the fix.
Fixes: 51bd9563b6 ("btrfs: fix deadlock due to page faults during direct IO reads and writes")
Link: https://bugzilla.redhat.com/show_bug.cgi?id=2169947
Link: https://lore.kernel.org/linux-btrfs/aa1fb69e-b613-47aa-a99e-a0a2c9ed273f@app.fastmail.com/
Link: https://pastebin.com/3SDaH8C6
Link: https://lore.kernel.org/linux-btrfs/20230315195231.GW10580@twin.jikos.cz/T/#t
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Tested-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Boris Burkov <boris@bur.io>
[ hch: refactored the ordered_extent extraction ]
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: David Sterba <dsterba@suse.com>
NOCOW writes just overwrite an existing extent map, which thus should
not be split in btrfs_extract_ordered_extent. The NOCOW case can't
currently happen as btrfs_extract_ordered_extent is only used on zoned
devices that do not support NOCOW writes, but this will change soon.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Tested-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Boris Burkov <boris@bur.io>
[ hch: split from a larger patch, wrote a commit log ]
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: David Sterba <dsterba@suse.com>
To prepare for a new caller that already has the ordered_extent
available, change btrfs_extract_ordered_extent to take an argument
for it. Add a wrapper for the bio case that still has to do the
lookup (for now).
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Tested-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
split_zoned_em is only ever asked to split out the beginning of an extent
map. Change it to only take a len to split out instead of a pre and post
region.
Also rename the function to split_extent_map as there is nothing zoned
device specific about it.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Tested-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: David Sterba <dsterba@suse.com>
The function btrfs_clone_ordered_extent is very specific to the usage in
btrfs_split_ordered_extent. Now that only a single call to
btrfs_clone_ordered_extent is left, just fold it into
btrfs_split_ordered_extent to make the operation more clear.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Tested-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: David Sterba <dsterba@suse.com>
btrfs_split_ordered_extent is only ever asked to split out the beginning
of an ordered_extent (i.e. post == 0). Change it to only take a len to
split out, and switch it to allocate the new extent for the beginning,
as that helps with callers that want to keep a pointer to the
ordered_extent that it is stealing from.
Reviewed-by: Naohiro Aota <naohiro.aota@wdc.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Tested-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: David Sterba <dsterba@suse.com>
btrfs_extract_ordered_extent is always used to split an ordered_extent
and extent_map into two parts, so it doesn't need to deal with a three
way split.
Simplify it by only allowing for a single split point, and always split
out the beginning of the extent, as that is what we'll later need to
be able to hold on to a reference to the original ordered_extent that
the first part is split off for submission.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Tested-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: David Sterba <dsterba@suse.com>
Move the three checks that are about ordered extent internal sanity
checking into btrfs_split_ordered_extent instead of doing them in the
higher level btrfs_extract_ordered_extent routine.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Tested-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
While it is not feasible for an ordered extent to survive across the
calls btrfs_direct_write makes into __iomap_dio_rw, it is still helpful
to stash it on the dio_data in between creating it in iomap_begin and
finishing it in either end_io or iomap_end.
The specific use I have in mind is that we can check if a particular bio
is partial in submit_io without unconditionally looking up the ordered
extent. This is a preparatory patch for a later patch which does just
that.
Reviewed-by: Naohiro Aota <naohiro.aota@wdc.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Tested-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Boris Burkov <boris@bur.io>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: David Sterba <dsterba@suse.com>
The ordered_extent flags are declared as unsigned long, so pass them as
such to btrfs_add_ordered_extent.
Reviewed-by: Naohiro Aota <naohiro.aota@wdc.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Tested-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Boris Burkov <boris@bur.io>
[ hch: split from a larger patch ]
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Currently, btrfs_add_ordered_extent allocates a new ordered extent, adds
it to the rb_tree, but doesn't return a referenced pointer to the
caller. There are cases where it is useful for the creator of a new
ordered_extent to hang on to such a pointer, so add a new function
btrfs_alloc_ordered_extent which is the same as
btrfs_add_ordered_extent, except it takes an additional reference count
and returns a pointer to the ordered_extent. Implement
btrfs_add_ordered_extent as btrfs_alloc_ordered_extent followed by
dropping the new reference and handling the IS_ERR case.
The type of flags in btrfs_alloc_ordered_extent and
btrfs_add_ordered_extent is changed from unsigned int to unsigned long
so it's unified with the other ordered extent functions.
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Naohiro Aota <naohiro.aota@wdc.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Tested-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Boris Burkov <boris@bur.io>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The btrfs raid56 sector submission code uses bio_add_page() to add a
page to a newly created bio. bio_add_page() can fail, but the return
value is never checked.
Use __bio_add_page() as adding a single page to a newly created bio is
guaranteed to succeed.
This brings us a step closer to marking bio_add_page() as __must_check.
Reviewed-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The btrfs repair bio submission code uses bio_add_page() to add a page
to a newly created bio. bio_add_page() can fail, but the return value is
never checked.
Use __bio_add_page() as adding a single page to a newly created bio is
guaranteed to succeed.
This brings us a step closer to marking bio_add_page() as __must_check.
Reviewed-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The function wait_dev_flush() tests for the BTRFS_DEV_STATE_FLUSH_SENT
bit and then clears it separately. Instead, use test_and_clear_bit().
Though we don't need to do the atomic test and clear, it's following a
common pattern.
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The flush error code is maintained in btrfs_device::last_flush_error, so
there is no point in returning it in wait_dev_flush() when it is not being
used. Instead, we can return a boolean value.
Note that even though btrfs_device::last_flush_error may not be used, we
will keep it for now.
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
check_barrier_error() is almost a single line function, and just calls
btrfs_check_rw_degradable(). Instead, open code it.
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We parallelize the flush command across devices using our own code,
write_dev_flush() sends the flush command to each device and
wait_dev_flush() waits for the flush to complete on all devices. Errors
from each device are recorded at device->last_flush_error and reset to
BLK_STS_OK in write_dev_flush() and to the error, if any, in
wait_dev_flush(). These functions are called from barrier_all_devices().
This patch consolidates the use of device->last_flush_error in
write_dev_flush() and wait_dev_flush() to remove it from
barrier_all_devices().
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Instead of using two labels at btrfs_evict_inode() for exiting depending
on whether we need to delete the inode items and orphan or some error
happened, we can use a single exit label if we initialize the block
reserve to NULL, since btrfs_free_block_rsv() ignores a NULL block reserve
pointer. So just do that. It will also make an upcoming change simpler by
avoiding one extra error label.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
When updating the global block reserve, we account for the 6 items needed
by an unlink operation and the 6 delayed references for each one of those
items. However the calculation for the delayed references is not correct
in case we have the free space tree enabled, as in that case we need to
touch the free space tree as well and therefore need twice the number of
bytes. So use the btrfs_calc_delayed_ref_bytes() helper to calculate the
number of bytes need for the delayed references at
btrfs_update_global_block_rsv().
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Instead of hard coding the number of metadata units for an unlink operation
in a couple places, define a macro and use it instead. This eliminates the
problem of one place getting out of sync with the other, such as recently
fixed by the previous patch in the series ("btrfs: fix calculation of the
global block reserve's size").
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
At btrfs_update_global_block_rsv(), we are assuming an unlink operation
uses 5 metadata units, but that's not true anymore, it uses 6 since the
commit bca4ad7c0b ("btrfs: reserve correct number of items for unlink
and rmdir"). So update the code and comments to consider 6 units.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
When evicting an inode, we are incorrectly calculating the amount of space
required for a single delayed reference in case the free space tree is
enabled. We have to multiply by 2 the result of
btrfs_calc_insert_metadata_size(). We should be calculating according to
the size update and space release of the delayed block reserve logic at
btrfs_update_delayed_refs_rsv() and btrfs_delayed_refs_rsv_release().
Fix this by using the btrfs_calc_delayed_ref_bytes() helper at
evict_refill_and_join() instead of btrfs_calc_insert_metadata_size().
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Instead of duplicating the logic for calculating how much space is
required for a given number of delayed references, add an inline helper
to encapsulate that logic and use it everywhere we are calculating the
space required.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Now that btrfs_calc_insert_metadata_size() can take a const fs_info
argument, make the fs_info argument of calc_reclaim_items_nr() and of
calc_delayed_refs_nr() const as well.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The fs_info argument of the helpers btrfs_calc_insert_metadata_size() and
btrfs_calc_metadata_size() is not modified so it can be const. This will
also allow a new helper function in one of the next patches to have its
fs_info argument as const.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
When flushing a limited number of delayed references (FLUSH_DELAYED_REFS_NR
state), we are assuming each delayed reference is holding a number of bytes
matching the needed space for inserting for a single metadata item (the
result of btrfs_calc_insert_metadata_size()). That is not correct when
using the free space tree, as in that case we have to multiply that value
by 2 since we need to touch the free space tree as well. This is the same
computation as we do at btrfs_update_delayed_refs_rsv() and at
btrfs_delayed_refs_rsv_release().
So correct the computation for the amount of delayed references we need to
flush in case we have the free space tree. This does not fix a functional
issue, instead it makes the flush code flush less delayed references, only
the minimum necessary to satisfy a ticket.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
When refilling the delayed block reserve we are incorrectly computing the
amount of bytes for a single delayed reference if the free space tree is
being used. In that case we should double the calculated amount.
Everywhere else we compute the correct amount, like when updating the
delayed block reserve, at btrfs_update_delayed_refs_rsv(), or when
releasing space from the delayed block reserve, at
btrfs_delayed_refs_rsv_release().
So fix btrfs_delayed_refs_rsv_refill() to multiply the amount of bytes for
a single delayed reference by two in case the free space tree is used.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
During inode eviction, if we are truncating a deleted inode, we don't add
delayed items for our inode, so there's no need to throttle on delayed
items on each iteration of the loop that truncates inode items from its
subvolume tree. But we dirty extent buffers from its subvolume tree, so
we only need to throttle on btree inode dirty pages.
So use btrfs_btree_balance_dirty_nodelay() in the loop that truncates
inode items.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We have this logic encapsulated in btrfs_should_throttle_delayed_refs()
where we try to estimate if running the current amount of delayed
references we have will take more than half a second, and if so, the
caller btrfs_should_throttle_delayed_refs() should do something to
prevent more and more delayed refs from being accumulated.
This logic was added in commit 0a2b2a844a ("Btrfs: throttle delayed
refs better") and then further refined in commit a79b7d4b3e ("Btrfs:
async delayed refs"). The idea back then was that the caller of
btrfs_should_throttle_delayed_refs() would release its transaction
handle (by calling btrfs_end_transaction()) when that function returned
true, then btrfs_end_transaction() would trigger an async job to run
delayed references in a workqueue, and later start/join a transaction
again and do more work.
However we don't run delayed references asynchronously anymore, that
was removed in commit db2462a6ad ("btrfs: don't run delayed refs in
the end transaction logic"). That makes the logic that tries to estimate
how long we will take to run our current delayed references, at
btrfs_should_throttle_delayed_refs(), pointless as we don't take any
action to run delayed references anymore. We do have other type of
throttling, which consists of checking the size and reserved space of
the delayed and global block reserves, as well as if fluhsing delayed
references for the current transaction was already started, etc - this
is all done by btrfs_should_end_transaction(), and the only user of
btrfs_should_throttle_delayed_refs() does periodically call
btrfs_should_end_transaction().
So remove btrfs_should_throttle_delayed_refs() and the infrastructure
that keeps track of the average time used for running delayed references,
as well as adapting btrfs_truncate_inode_items() to call
btrfs_check_space_for_delayed_refs() instead.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
At btrfs_block_rsv_refill(), there's no point in initializing the
'num_bytes' variable to 0 and then, after taking the block reserve's
spinlock, initializing it to the value of the 'min_reserved' parameter.
So just get rid of the 'num_bytes' local variable and rename the
'min_reserved' parameter to 'num_bytes'.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
At btrfs_truncate_inode_items(), in the while loop when we decide that we
are going to delete an item, it's pointless to check that 'pending_del_nr'
is non-zero in an else clause because the corresponding if statement is
checking if 'pending_del_nr' has a value of zero. So just remove that
condition from the else clause.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
When reserving metadata space for delalloc (and direct IO too), at
btrfs_delalloc_reserve_metadata(), there's no need to count the number of
extents while holding the inode's spinlock, since that does not require
access to any field of the inode.
This section of code can be called concurrently, when we have direct IO
writes against different file ranges that don't increase the inode's
i_size, so it's beneficial to shorten the critical section by counting
the number of extents before taking the inode's spinlock.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The only caller of btrfs_make_block_group() always passes 0 as the value
for the bytes_used argument, so remove it.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The function should_end_transaction() is very short and only has one
caller, which is btrfs_should_end_transaction(). So move the code from
should_end_transaction() into btrfs_should_end_transaction().
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Currently btrfs_should_throttle_delayed_refs() returns 1 or 2 in case the
delayed refs should be throttled, however the only caller (inode eviction
and truncation path) does not care about those two different conditions,
it treats the return value as a boolean. This allows us to remove one of
the conditions in btrfs_should_throttle_delayed_refs() and change its
return value from 'int' to 'bool'. So just do that.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
At space-info.c:__reserve_bytes(), instead of initializing 'ret' to 0 when
it's declared and then shortly after set it to -ENOSPC under the space
info's spinlock, initialize it to -ENOSPC when declaring it.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
When reserving space, at space-info.c:__reserve_bytes(), we assert that
either the current task is not holding a transacion handle, or, if it is,
that the flush method is not BTRFS_RESERVE_FLUSH_ALL. This is because that
flush method can trigger transaction commits, and therefore could lead to
a deadlock.
However there are other 2 flush methods that can trigger transaction
commits:
1) BTRFS_RESERVE_FLUSH_ALL_STEAL
2) BTRFS_RESERVE_FLUSH_EVICT
So update the assertion to check the flush method is also not one those
two methods if the current task is holding a transaction handle.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The BTRFS_RESERVE_FLUSH_EVICT flush method can also commit transactions,
see the definition of the evict_flush_states const array at space-info.c,
but the documentation for it at space-info.h does not mention it.
So update the documentation.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The block reserve passed to btrfs_block_rsv_check() is never NULL, so
remove the check. In case it can ever become NULL in the future, then
we'll get a pretty obvious and clear NULL pointer dereference crash and
stack trace.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
At btrfs_delayed_refs_rsv_refill(), we are passing a value of 0 to the
'update_size' argument of btrfs_block_rsv_add_bytes(), which is defined
as a boolean. Functionally this is fine because a 0 is, implicitly,
converted to a boolean false value. However it's easier to read an
explicit 'false' value, so just pass 'false' instead of 0.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The last argument of btrfs_block_rsv_migrate() is a boolean, but we are
passing an integer, with a value of 1, to it at evict_refill_and_join().
While this is not a bug, due to type conversion, it's a lot more clear to
simply pass the boolean true value instead. So just do that.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
It's not used anywhere at the moment, but it was used in earlier version
of a patch that removed its use in the second version. So just remove
btrfs_lru_cache_is_full().
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
btrfs_add_compressed_bio_pages is needlessly complicated. Instead
of iterating over the logic disk offset just to add pages to the bio
use a simple offset starting at 0, which also removes most of the
claiming. Additionally __bio_add_pages already takes care of the
assert that the bio is always properly sized, and btrfs_submit_bio
called right after asserts that the bio size is non-zero.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Adding pages to a bio has nothing to do with the sector. Move the
assignment to the two callers in preparation for cleaning up
btrfs_add_compressed_bio_pages.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Currently, /sys/fs/btrfs/<UUID>/bg_reclaim_threshold is limited to 0
(disable) or [50 .. 100]%, so we need to fill 50% of a device to start the
auto reclaim process. It is cumbersome to do so when we want to shake out
possible race issues of normal write vs reclaim.
Relax the threshold check under the BTRFS_DEBUG option.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
btrfs_split_bio expects a btrfs_bio as argument and always allocates one.
Type both the orig_bio argument and the return value as struct btrfs_bio
to improve type safety.
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: David Sterba <dsterba@suse.com>
Return the containing struct btrfs_bio instead of the less type safe
struct bio from btrfs_bio_alloc.
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: David Sterba <dsterba@suse.com>
The bio in struct btrfs_bio_ctrl must be a btrfs_bio, so store a pointer
to the btrfs_bio for better type checking.
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: David Sterba <dsterba@suse.com>
struct btrfs_bio now has an always valid inode pointer that can be used
to find the inode in submit_one_bio, so use that and initialize all
variables for which it is possible at declaration time.
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: David Sterba <dsterba@suse.com>
The original bio must be a btrfs_bio, so store a pointer to the
btrfs_bio for better type checking.
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: David Sterba <dsterba@suse.com>
btrfs_submit_compressed_read expects the bio passed to it to be embedded
into a btrfs_bio structure. Pass the btrfs_bio directly to increase type
safety and make the code self-documenting.
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: David Sterba <dsterba@suse.com>
btrfs_submit_bio expects the bio passed to it to be embedded into a
btrfs_bio structure. Pass the btrfs_bio directly to increase type
safety and make the code self-documenting.
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: David Sterba <dsterba@suse.com>
All algorithms have to fill the remainder of the orig_bio with zeroes,
so do it in common code.
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: David Sterba <dsterba@suse.com>
btrfs_encoded_read_regular_fill_pages has a pretty odd control flow.
Unwind it so that there is a single loop over the pages array.
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: David Sterba <dsterba@suse.com>
The inode and file_offset members in struct btrfs_encoded_read_private
are unused, so remove them.
Last used in commit 7959bd4411 ("btrfs: remove the start argument to
check_data_csum and export") and commit 7609afac67 ("btrfs: handle
checksum validation and repair at the storage layer").
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: David Sterba <dsterba@suse.com>
The DREW lock uses percpu variable to track lock counters and for that
it needs to allocate the structure. In btrfs_read_tree_root() or
btrfs_init_fs_root() this may add another error case or requires the
NOFS scope protection.
One way is to preallocate the structure as was suggested in
https://lore.kernel.org/linux-btrfs/20221214021125.28289-1-robbieko@synology.com/
We may avoid the allocation altogether if we don't use the percpu
variables but an atomic for the writer counter. This should not make any
difference, the DREW lock is used for truncate and NOCOW writes along
with other IO operations.
The percpu counter for writers has been there since the original commit
8257b2dc3c "Btrfs: introduce btrfs_{start, end}_nocow_write() for
each subvolume". The reason could be to avoid hammering the same
cacheline from all the readers but then the writers do that anyway.
Signed-off-by: David Sterba <dsterba@suse.com>
If no discard mount option is specified, including the NODISCARD option,
we make the async discard the default option then we don't have to call
the clear_opt again to clear the NODISCARD flag. Though this makes no
difference, that the call is redundant has been pointed out several
times so we better remove it.
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
BTRFS_FEATURE_INCOMPAT_SUPP is defined twice, once under
CONFIG_BTRFS_DEBUG and once without it, resulting in repetitive code. The
reason for this is to add experimental features under CONFIG_BTRFS_DEBUG.
To avoid repetitive code, add a common list BTRFS_FEATURE_INCOMPAT_SUPP_STABLE,
and append experimental features only under CONFIG_BTRFS_DEBUG.
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We don't need to pass the roots as arguments, reading them from the
rb-tree is cheap. Thus there is really not much need to pre-fetch it
and pass it all the way from scrub_stripe().
And we already have more than enough arguments in scrub_simple_mirror()
and scrub_simple_stripe(), it's better to remove them and only grab
those roots in scrub_simple_mirror().
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The variable @path is no longer passed into any call sites after commit
18d30ab961 ("btrfs: scrub: use scrub_simple_mirror() to handle RAID56
data stripe scrub"), thus we can remove the variable completely.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
[BUG]
Currently btrfs can use dev-replace device as an extra mirror for
read-repair. But it can lead to NODATASUM corruption in the following
case:
There is a RAID1 data chunk, and dev-replace is running from
dev2 to dev0.
|//| = Replaced data
X X+1MB X+2MB
Dev 2: | | | <- Source dev
Dev 0: |///////| | <- Target dev
Then a read on dev 2 X+2MB happens.
And something wrong happened inside devid 2, causing an -EIO.
In that case, read-repair would try the next mirror, and since we can
use target device as an extra mirror, we will use that mirror instead.
But unfortunately since the read is beyond the current replace cursor,
we should not trust it at all, what we get would be just uninitialized
garbage.
But if this read is for NODATASUM range, then we just trust them and
cause data corruption.
[CAUSE]
We used to have some checks to make sure we only return such extra
mirror when the range is before our left cursor.
The first commit introducing this behavior is ad6d620e2a ("Btrfs:
allow repair code to include target disk when searching mirrors").
But later a fix, 22ab04e814 ("Btrfs: fix race between device replace
and chunk allocation") changed the behavior, to always let
btrfs_map_block() include the extra mirror to address a race in
dev-replace which can cause missing writes to target device.
This means, we lose the tracking of cursor for the extra mirror, thus
can lead to above corruption.
[FIX]
The extra mirror is never a reliable one, at the beginning of
dev-replace, the reliability is zero, while only at the end of the
replace it's a fully reliable mirror.
We either do the complex tracking, or never trust it.
IMHO it's much easier to maintain if we don't trust it at all, and the
extra mirror can only benefit for a limited period of time (during
replace).
Thus this patch would completely remove the ability to use target device
as an extra mirror.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Currently open_ctree() still uses two variables for error handling, err
and ret. This can be confusing and missing some errors and does not
conform to current coding style.
This patch will fix the problems by:
- Use only ret for error handling
- Add proper ret assignment
Originally we rely on the default value (-EINVAL) of err to handle
errors, but that doesn't really reflects the error.
This will change it use the correct error number for the following
call sites:
* subpage_info allocation
* btrfs_free_extra_devids()
* btrfs_check_rw_degradable()
* cleaner_kthread allocation
* transaction_kthread allocation
- Add an extra ASSERT()
To make sure we error out instead of returning 0.
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Introduce a bio_offset variable for the current offset into the bio
instead of recalculating it over and over. Remove the now only used
once search_len and sector_offset variables, and reduce the scope for
count and cur_disk_bytenr.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: David Sterba <dsterba@suse.com>
There is no need to search for a file offset in a bio, it is now always
provided in bbio->file_offset (set at bio allocation time since
0d495430db ("btrfs: set bbio->file_offset in alloc_new_bio")). Just
use that with the offset into the bio.
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Nowadays calc_bio_boundaries() is a relatively simple function that only
guarantees the one bio equals to one ordered extent rule for uncompressed
Zone Append bios.
Sink it into it's only caller alloc_new_bio().
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
bio_add_page always adds either the entire range passed to it or nothing.
Based on that btrfs_bio_add_page can only return a length smaller than
the passed in one when hitting the ordered extent limit, which can only
happen for writes. Given that compressed writes never even use this code
path, this means that all the special cases for compressed extent offset
handling are dead code.
Reflow submit_extent_page to take advantage of this by inlining
btrfs_bio_add_page and handling the ordered extent limit by decrementing
it for each added range and thus significantly simplifying the loop.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Different loop iterations in btrfs_bio_add_page not only have the same
contiguity parameters, but also any non-initial operation operates on a
fresh bio anyway.
Factor out the contiguity check into a new btrfs_bio_is_contig and only
call it once in submit_extent_page before descending into the
bio_add_page loop.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Remove the has_error and saved_ret variables, and just jump to a goto
label for error handling from the only place returning an error from the
main loop.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
submit_extent_page always returns 0 since commit d5e4377d50 ("btrfs:
split zone append bios in btrfs_submit_bio"). Change it to a void return
type and remove all the unreachable error handling code in the callers.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Update the compress_type in the btrfs_bio_ctrl after forcing out the
previous bio in btrfs_do_readpage, so that alloc_new_bio can just use
the compress_type member in struct btrfs_bio_ctrl instead of passing the
same information redundantly as a function argument.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Rename this_bio_flag to compress_type to match the surrounding code
and better document the intent. Also use the proper enum type instead
of unsigned long.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The compress_type can only change on a per-extent basis. So instead of
checking it for every page in btrfs_bio_add_page, do the check once in
btrfs_do_readpage, which is the only caller of btrfs_bio_add_page and
submit_extent_page that deals with compressed extents.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Instead of passing down the wbc pointer the deep call chain, just
add it to the btrfs_bio_ctrl structure.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The sync_io flag is equivalent to wbc->sync_mode == WB_SYNC_ALL, so
just check for that and remove the separate flag.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The bio op and flags never change over the life time of a bio_ctrl,
so move it in there instead of passing it down the deep call chain
all the way down to alloc_new_bio.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
If force_bio_submit, submit_extent_page simply calls submit_one_bio as
the first thing. This can just be moved to the only caller that sets
force_bio_submit to true.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
When read_extent_buffer_subpage calls submit_extent_page, it does
so on a freshly initialized btrfs_bio_ctrl structure that can't have
a valid bio to submit. Clear the force_bio_submit parameter to false
as there is nothing to submit.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
btrfs_bin_search() is a simple wrapper that searches for the whole slots
by calling btrfs_generic_bin_search() with the starting slot/first_slot
preset to 0.
This simple wrapper can be open coded as btrfs_bin_search().
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
[BUG]
Although dev replace ioctl has a way to specify the mode on whether we
should read from the source device, it's not properly followed.
# mkfs.btrfs -f -d raid1 -m raid1 $dev1 $dev2
# mount $dev1 $mnt
# xfs_io -f -c "pwrite 0 32M" $mnt/file
# sync
# btrfs replace start -r -f 1 $dev3 $mnt
And one extra trace is added to scrub_submit(), showing the detail about
the bio:
btrfs-11569 [005] ... 37.0270: scrub_submit.part.0: devid=1 logical=22036480 phy=22036480 len=16384
btrfs-11569 [005] ... 37.0273: scrub_submit.part.0: devid=1 logical=30457856 phy=30457856 len=32768
btrfs-11569 [005] ... 37.0274: scrub_submit.part.0: devid=1 logical=30507008 phy=30507008 len=49152
btrfs-11569 [005] ... 37.0274: scrub_submit.part.0: devid=1 logical=30605312 phy=30605312 len=32768
btrfs-11569 [005] ... 37.0275: scrub_submit.part.0: devid=1 logical=30703616 phy=30703616 len=65536
btrfs-11569 [005] ... 37.0281: scrub_submit.part.0: devid=1 logical=298844160 phy=298844160 len=131072
...
btrfs-11569 [005] ... 37.0762: scrub_submit.part.0: devid=1 logical=322961408 phy=322961408 len=131072
btrfs-11569 [005] ... 37.0762: scrub_submit.part.0: devid=1 logical=323092480 phy=323092480 len=131072
One can see that all the reads are submitted to devid 1, even if we have
specified "-r" option to avoid reading from the source device.
[CAUSE]
The dev-replace read mode is only set but not followed by scrub code at
all. In fact, only common read path is properly following the read
mode, but scrub itself has its own read path, thus not following the
mode.
[FIX]
Here we enhance scrub_find_good_copy() to also follow the read mode.
The idea is pretty simple, in the first loop, we avoid the following
devices:
- Missing devices
This is the existing condition
- The source device if the replace wants to avoid it.
And if above loop found no candidate (e.g. replace a single device),
then we discard the 2nd condition, and try again.
Since we're here, also enhance the function scrub_find_good_copy() by:
- Remove the forward declaration
- Makes it return int
To indicates errors, e.g. no good mirror found.
- Add extra error messages
Now with the same trace, "btrfs replace start -r" works as expected:
btrfs-1213 [000] ... 991.9059: scrub_submit.part.0: devid=2 logical=22036480 phy=1064960 len=16384
btrfs-1213 [000] ... 991.9062: scrub_submit.part.0: devid=2 logical=30457856 phy=9486336 len=32768
btrfs-1213 [000] ... 991.9063: scrub_submit.part.0: devid=2 logical=30507008 phy=9535488 len=49152
btrfs-1213 [000] ... 991.9064: scrub_submit.part.0: devid=2 logical=30605312 phy=9633792 len=32768
btrfs-1213 [000] ... 991.9065: scrub_submit.part.0: devid=2 logical=30703616 phy=9732096 len=65536
btrfs-1213 [000] ... 991.9073: scrub_submit.part.0: devid=2 logical=298844160 phy=277872640 len=131072
btrfs-1213 [000] ... 991.9075: scrub_submit.part.0: devid=2 logical=298975232 phy=278003712 len=131072
btrfs-1213 [000] ... 991.9078: scrub_submit.part.0: devid=2 logical=299106304 phy=278134784 len=131072
...
btrfs-1213 [000] ... 991.9474: scrub_submit.part.0: devid=2 logical=318504960 phy=297533440 len=131072
btrfs-1213 [000] ... 991.9476: scrub_submit.part.0: devid=2 logical=318636032 phy=297664512 len=131072
btrfs-1213 [000] ... 991.9479: scrub_submit.part.0: devid=2 logical=318767104 phy=297795584 len=131072
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Fold finish_compressed_bio_write into its only caller as there is no
reason to keep them separate.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
No one ever set ->mapping on these pages, so don't bother clearing it.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Share the code to free the compressed pages and the array to hold them
into a common helper.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Factor out a common helper to add the compressed_bio pages to the
bio that is shared by the compressed read and write path.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
struct btrfs_bio now has a file_offset field set up by all submitters.
Use that value combined with the bio size in add_ra_bio_pages to
calculate the last offset in the bio.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
struct btrfs_bio now has a file_offset field set up by all submitters.
Use that in btrfs_submit_compressed_read instead of recalculating the
value.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
em can't be non-NULL after the free_extent_map label. Also remove
the now pointless clearing of em to NULL after freeing it.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Embed a btrfs_bio into struct compressed_bio. This avoids potential
(so far theoretical) deadlocks due to nesting of btrfs_bioset allocations
for the original read bio and the compressed bio, and avoids an extra
memory allocation in the I/O path.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
In btrfs_io_context structure, we have a pointer raid_map, which
indicates the logical bytenr for each stripe.
But considering we always call sort_parity_stripes(), the result
raid_map[] is always sorted, thus raid_map[0] is always the logical
bytenr of the full stripe.
So why we waste the space and time (for sorting) for raid_map?
This patch will replace btrfs_io_context::raid_map with a single u64
number, full_stripe_start, by:
- Replace btrfs_io_context::raid_map with full_stripe_start
- Replace call sites using raid_map[0] to use full_stripe_start
- Replace call sites using raid_map[i] to compare with nr_data_stripes.
The benefits are:
- Less memory wasted on raid_map
It's sizeof(u64) * num_stripes vs sizeof(u64).
It'll always save at least one u64, and the benefit grows larger with
num_stripes.
- No more weird alloc_btrfs_io_context() behavior
As there is only one fixed size + one variable length array.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
For btrfs dev-replace, we have to duplicate writes to the source
device into the target device.
For non-RAID56, all writes into the same mapped ranges are sharing the
same content, thus they don't really need to bother anything.
(E.g. in btrfs_submit_bio() for non-RAID56 range we just submit the
same write to all involved devices).
But for RAID56, all stripes contain different content, thus we must
have a clear mapping of which stripe is duplicated from which original
stripe.
Currently we use a complex way using tgtdev_map[] array, e.g:
num_tgtdevs = 1
tgtdev_map[0] = 0 <- Means stripes[0] is not involved in replace.
tgtdev_map[1] = 3 <- Means stripes[1] is involved in replace,
and it's duplicated to stripes[3].
tgtdev_map[2] = 0 <- Means stripes[2] is not involved in replace.
But this is wasting some space, and ignores one important thing for
dev-replace, there is at most one running replace.
Thus we can change it to a fixed array to represent the mapping:
replace_nr_stripes = 1
replace_stripe_src = 1 <- Means stripes[1] is involved in replace.
thus the extra stripe is a copy of
stripes[1]
By this we can save some space for bioc on RAID56 chunks with many
devices. And we get rid of one variable sized array from bioc.
Thus the patch involves the following changes:
- Replace @num_tgtdevs and @tgtdev_map[] with @replace_nr_stripes
and @replace_stripe_src.
@num_tgtdevs is just renamed to @replace_nr_stripes.
While the mapping is completely changed.
- Add extra ASSERT()s for RAID56 code
- Only add two more extra stripes for dev-replace cases.
As we have an upper limit on how many dev-replace stripes we can have.
- Unify the behavior of handle_ops_on_dev_replace()
Previously handle_ops_on_dev_replace() go two different paths for
WRITE and GET_READ_MIRRORS.
Now unify them by always going the WRITE path first (with at most 2
replace stripes), then if we're doing GET_READ_MIRRORS and we have 2
extra stripes, just drop one stripe.
- Remove the @real_stripes argument from alloc_btrfs_io_context()
As we don't need the old variable length array any more.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
That structure is our ultimate object for all __btrfs_map_block()
related functions. We have some hard to understand members, like
tgtdev_map, but without any comments.
This patch will improve the situation:
- Add extra comments for num_stripes, mirror_num, num_tgtdevs and
tgtdev_map[]
Especially for the last two members, add a dedicated (thus very long)
comments for them, with example to explain it.
- Shrink those int members to u16.
In fact our on-disk format is only using u16 for num_stripes, thus
no need to use int at all.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
There is no memory re-allocation for handle_ops_on_dev_replace(), thus
we don't need to pass a btrfs_io_context pointer.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
There are quite some div64 calls inside btrfs_map_block() and its
variants.
Such calls are for @stripe_nr, where @stripe_nr is the number of
stripes before our logical bytenr inside a chunk.
However we can eliminate such div64 calls by just reducing the width of
@stripe_nr from 64 to 32.
This can be done because our chunk size limit is already 10G, with fixed
stripe length 64K.
Thus a U32 is definitely enough to contain the number of stripes.
With such width reduction, we can get rid of slower div64, and extra
warning for certain 32bit arch.
This patch would do:
- Add a new tree-checker chunk validation on chunk length
Make sure no chunk can reach 256G, which can also act as a bitflip
checker.
- Reduce the width from u64 to u32 for @stripe_nr variables
- Replace unnecessary div64 calls with regular modulo and division
32bit division and modulo are much faster than 64bit operations, and
we are finally free of the div64 fear at least in those involved
functions.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Currently btrfs doesn't support stripe lengths other than 64KiB.
This is already set in the tree-checker.
There is really no meaning to record that fixed value in map_lookup for
now, and can all be replaced with BTRFS_STRIPE_LEN.
Furthermore we can use the fix stripe length to do the following
optimization:
- Use BTRFS_STRIPE_LEN_SHIFT to replace some 64bit division
Now we only need to do a right shift.
And the value of BTRFS_STRIPE_LEN itself is already too large for bit
shift, thus if we accidentally use BTRFS_STRIPE_LEN to do bit shift,
a compiler warning would be triggered.
Thus this bit shift optimization would be safe.
- Use BTRFS_STRIPE_LEN_MASK to calculate the offset inside a stripe
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Move the remaining code that deals with initializing the btree
inode into btrfs_init_btree_inode instead of splitting it between
that helpers and its only caller.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Function search_file_offset_in_bio() finds the file offset in the
file_offset_ret, and we use the return value to indicate if it is
successful, so use bool.
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The function btrfs_lookup_bio_sums() and a nested if statement declare
ret respectively as blk_status_t and int.
There is no need to store the return value of
search_file_offset_in_bio() to ret as this is a one-time call.
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Remove btrfs_csum_ptr() and fold it into it's only caller.
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
These days all the operations that take locks in the raid56.c code are
run from user context (mostly workqueues). Drop all the irqsafe locking
that is not required any more.
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We were seeing weird errors when we were testing our btrfs backports
before we had the incorrect level check fix. These errors appeared to
be improper error handling, but error injection testing uncovered that
the errors were a result of corruption that occurred from improper error
handling during snapshot delete.
With snapshot delete if we encounter any errors during walk_down or
walk_up we'll simply return an error, we won't abort the transaction.
This is problematic because we will be dropping references for nodes and
leaves along the way, and if we fail in the middle we will leave the
file system corrupt because we don't know where we left off in the drop.
Fix this by making sure we abort if we hit any errors during the walk
down or walk up operations, as we have no idea what operations could
have been left half done at this point.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We can get errors in walk_down_proc as we try and lookup extent info for
the snapshot dropping to act on. However if we get an error we simply
return 1 which indicates we're done with walking down, which will lead
us to improperly continue with the snapshot drop with the incorrect
information. Instead break if we get any error from walk_down_proc or
do_walk_down, and handle the case of ret == 1 by returning 0, otherwise
return the ret value that we have.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
When we mount the file system we do something like this:
while (1) {
lookup fs roots;
for (i = 0; i < num_roots; i++) {
ret = btrfs_orphan_cleanup(roots[i]);
if (ret)
break;
btrfs_put_root(roots[i]);
}
}
for (; i < num_roots; i++)
btrfs_put_root(roots[i]);
As you can see if we break in that inner loop we just go back to the
outer loop and lose the fact that we have to drop references on the
remaining roots we looked up. Fix this by making an out label and
jumping to that on error so we don't leak a reference to the roots we
looked up.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We missed a couple of iput()s in the orphan cleanup failure paths, add
them so we don't get refcount errors. The iput needs to be done in the
check and not under a common label due to the way the code is
structured.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
While investigating a problem with error injection I tripped over
curious behavior in the node/leaf splitting code. If we get an EIO when
trying to read either the left or right leaf/node for splitting we'll
simply treat the node as if it were full and continue on. The end
result of this isn't too bad, we simply end up allocating a block when
we may have pushed items into the adjacent blocks.
However this does essentially allow us to continue to modify a file
system that we've gotten errors on, either from a bad disk or csum
mismatch or other corruption. This isn't particularly safe, so instead
handle these btrfs_read_node_slot() usages differently. We allow you to
pass in any slot, the idea being that we save some code if the slot
number is outside of the range of the parent. This means we treat all
errors the same, when in reality we only want to ignore -ENOENT.
Fix this by changing how we call btrfs_read_node_slot(), which is to
only call it for slots we know are valid. This way if we get an error
back from reading the block we can properly pass the error up the chain.
This was validated with the error injection testing I was doing.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
In btrfs_read_node_slot() we have a BUG_ON() that can be converted to an
ASSERT(), it's from an extent buffer and the level is validated at the
time it's read from disk.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
While trying to track down a lost EIO problem I hit the following
assertion while doing my error injection testing
BTRFS warning (device nvme1n1): transaction 1609 (with 180224 dirty metadata bytes) is not committed
assertion failed: !found, in fs/btrfs/disk-io.c:4456
------------[ cut here ]------------
kernel BUG at fs/btrfs/messages.h:169!
invalid opcode: 0000 [#1] PREEMPT SMP NOPTI
CPU: 0 PID: 1445 Comm: mount Tainted: G W 6.2.0-rc5+ #3
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.1-2.fc37 04/01/2014
RIP: 0010:btrfs_assertfail.constprop.0+0x18/0x1a
RSP: 0018:ffffb95fc3b0bc68 EFLAGS: 00010286
RAX: 0000000000000034 RBX: ffff9941c2ac2000 RCX: 0000000000000000
RDX: 0000000000000001 RSI: ffffffffb6741f7d RDI: 00000000ffffffff
RBP: ffff9941c2ac2428 R08: 0000000000000000 R09: ffffb95fc3b0bb38
R10: 0000000000000003 R11: ffffffffb71438a8 R12: ffff9941c2ac2428
R13: ffff9941c2ac2450 R14: ffff9941c2ac2450 R15: 000000000002c000
FS: 00007fcea2d07800(0000) GS:ffff9941fbc00000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007f00cc7c83a8 CR3: 000000010c686000 CR4: 0000000000350ef0
Call Trace:
<TASK>
close_ctree+0x426/0x48f
btrfs_mount_root.cold+0x7e/0xee
? legacy_parse_param+0x2b/0x220
legacy_get_tree+0x2b/0x50
vfs_get_tree+0x29/0xc0
vfs_kern_mount.part.0+0x73/0xb0
btrfs_mount+0x11d/0x3d0
? legacy_parse_param+0x2b/0x220
legacy_get_tree+0x2b/0x50
vfs_get_tree+0x29/0xc0
path_mount+0x438/0xa40
__x64_sys_mount+0xe9/0x130
do_syscall_64+0x3e/0x90
entry_SYSCALL_64_after_hwframe+0x72/0xdc
This is because the error injection did an EIO for the root inode lookup
and we simply jumped to closing the ctree. However because we didn't
mark the file system as having an error we skipped all of the broken
transaction cleanup stuff, and thus triggered this ASSERT(). Fix this
by calling btrfs_handle_fs_error() in this case so we have the error set
on the file system.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
@server->origin_fullpath already contains the tree name + optional
prefix, so avoid calling __build_path_from_dentry_optional_prefix() as
it might end up duplicating prefix path from @cifs_sb->prepath into
final full path.
Instead, generate DFS full path by simply merging
@server->origin_fullpath with dentry's path.
This fixes the following case
mount.cifs //root/dfs/dir /mnt/ -o ...
ls /mnt/link
where cifs_dfs_do_automount() will call smb3_parse_devname() with
@devname set to "//root/dfs/dir/link" instead of
"//root/dfs/dir/dir/link".
Fixes: 7ad54b98fc ("cifs: use origin fullpath for automounts")
Cc: <stable@vger.kernel.org> # 6.2+
Signed-off-by: Paulo Alcantara (SUSE) <pc@manguebit.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
This is a proposal to revert commit 914eedcb9b.
I found this when writing a simple UFFDIO_API test to be the first unit
test in this set. Two things breaks with the commit:
- UFFDIO_API check was lost and missing. According to man page, the
kernel should reject ioctl(UFFDIO_API) if uffdio_api.api != 0xaa. This
check is needed if the api version will be extended in the future, or
user app won't be able to identify which is a new kernel.
- Feature flags checks were removed, which means UFFDIO_API with a
feature that does not exist will also succeed. According to the man
page, we should (and it makes sense) to reject ioctl(UFFDIO_API) if
unknown features passed in.
Link: https://lore.kernel.org/r/20220722201513.1624158-1-axelrasmussen@google.com
Link: https://lkml.kernel.org/r/20230412163922.327282-2-peterx@redhat.com
Fixes: 914eedcb9b ("userfaultfd: don't fail on unrecognized features")
Signed-off-by: Peter Xu <peterx@redhat.com>
Acked-by: David Hildenbrand <david@redhat.com>
Cc: Axel Rasmussen <axelrasmussen@google.com>
Cc: Dmitry Safonov <0x7f454c46@gmail.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Mike Rapoport (IBM) <rppt@kernel.org>
Cc: Zach O'Keefe <zokeefe@google.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
KASAN report null-ptr-deref:
==================================================================
BUG: KASAN: null-ptr-deref in bdi_split_work_to_wbs+0x5c5/0x7b0
Write of size 8 at addr 0000000000000000 by task sync/943
CPU: 5 PID: 943 Comm: sync Tainted: 6.3.0-rc5-next-20230406-dirty #461
Call Trace:
<TASK>
dump_stack_lvl+0x7f/0xc0
print_report+0x2ba/0x340
kasan_report+0xc4/0x120
kasan_check_range+0x1b7/0x2e0
__kasan_check_write+0x24/0x40
bdi_split_work_to_wbs+0x5c5/0x7b0
sync_inodes_sb+0x195/0x630
sync_inodes_one_sb+0x3a/0x50
iterate_supers+0x106/0x1b0
ksys_sync+0x98/0x160
[...]
==================================================================
The race that causes the above issue is as follows:
cpu1 cpu2
-------------------------|-------------------------
inode_switch_wbs
INIT_WORK(&isw->work, inode_switch_wbs_work_fn)
queue_rcu_work(isw_wq, &isw->work)
// queue_work async
inode_switch_wbs_work_fn
wb_put_many(old_wb, nr_switched)
percpu_ref_put_many
ref->data->release(ref)
cgwb_release
queue_work(cgwb_release_wq, &wb->release_work)
// queue_work async
&wb->release_work
cgwb_release_workfn
ksys_sync
iterate_supers
sync_inodes_one_sb
sync_inodes_sb
bdi_split_work_to_wbs
kmalloc(sizeof(*work), GFP_ATOMIC)
// alloc memory failed
percpu_ref_exit
ref->data = NULL
kfree(data)
wb_get(wb)
percpu_ref_get(&wb->refcnt)
percpu_ref_get_many(ref, 1)
atomic_long_add(nr, &ref->data->count)
atomic64_add(i, v)
// trigger null-ptr-deref
bdi_split_work_to_wbs() traverses &bdi->wb_list to split work into all
wbs. If the allocation of new work fails, the on-stack fallback will be
used and the reference count of the current wb is increased afterwards.
If cgroup writeback membership switches occur before getting the reference
count and the current wb is released as old_wd, then calling wb_get() or
wb_put() will trigger the null pointer dereference above.
This issue was introduced in v4.3-rc7 (see fix tag1). Both
sync_inodes_sb() and __writeback_inodes_sb_nr() calls to
bdi_split_work_to_wbs() can trigger this issue. For scenarios called via
sync_inodes_sb(), originally commit 7fc5854f8c ("writeback: synchronize
sync(2) against cgroup writeback membership switches") reduced the
possibility of the issue by adding wb_switch_rwsem, but in v5.14-rc1 (see
fix tag2) removed the "inode_io_list_del_locked(inode, old_wb)" from
inode_switch_wbs_work_fn() so that wb->state contains WB_has_dirty_io,
thus old_wb is not skipped when traversing wbs in bdi_split_work_to_wbs(),
and the issue becomes easily reproducible again.
To solve this problem, percpu_ref_exit() is called under RCU protection to
avoid race between cgwb_release_workfn() and bdi_split_work_to_wbs().
Moreover, replace wb_get() with wb_tryget() in bdi_split_work_to_wbs(),
and skip the current wb if wb_tryget() fails because the wb has already
been shutdown.
Link: https://lkml.kernel.org/r/20230410130826.1492525-1-libaokun1@huawei.com
Fixes: b817525a4a ("writeback: bdi_writeback iteration must not skip dying ones")
Signed-off-by: Baokun Li <libaokun1@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Acked-by: Tejun Heo <tj@kernel.org>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Andreas Dilger <adilger.kernel@dilger.ca>
Cc: Christian Brauner <brauner@kernel.org>
Cc: Dennis Zhou <dennis@kernel.org>
Cc: Hou Tao <houtao1@huawei.com>
Cc: yangerkun <yangerkun@huawei.com>
Cc: Zhang Yi <yi.zhang@huawei.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Switch EROFS_I_{VERSION,DATALAYOUT}_BITS into
EROFS_I_{VERSION,DATALAYOUT}_MASK.
Also avoid erofs_bitrange() since its functionality is simple enough.
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Reviewed-by: Chao Yu <chao@kernel.org>
Link: https://lore.kernel.org/r/20230414083027.12307-2-hsiangkao@linux.alibaba.com
Given on-disk i_xattr_icount is 16 bits and xattr_isize is calculated
from i_xattr_icount multiplying 4, xattr_isize has a theoretical maximum
of 256K (64K * 4).
Thus declare xattr_isize as unsigned int to avoid the potential overflow.
Fixes: bfb8674dc0 ("staging: erofs: add erofs in-memory stuffs")
Signed-off-by: Jingbo Xu <jefflexu@linux.alibaba.com>
Reviewed-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Reviewed-by: Chao Yu <chao@kernel.org>
Link: https://lore.kernel.org/r/20230414061810.6479-1-jefflexu@linux.alibaba.com
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Prior to big pclusters, non-compact compression indexes could have
empty headers.
Let's just avoid the legacy path since it can be handled properly
as a specific compression header with z_erofs_fill_inode_lazy() too.
Tested with erofs-utils exist versions.
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Reviewed-by: Yue Hu <huyue2@coolpad.com>
Reviewed-by: Chao Yu <chao@kernel.org>
Link: https://lore.kernel.org/r/20230413092241.73829-1-hsiangkao@linux.alibaba.com
Let's enable long xattr name prefix feature. Old kernels will just
ignore / skip such extended attributes. In addition, in case you
don't want to mount such images, add another incompatible feature as
an option for this.
Signed-off-by: Jingbo Xu <jefflexu@linux.alibaba.com>
Reviewed-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Acked-by: Chao Yu <chao@kernel.org>
Link: https://lore.kernel.org/r/20230407222808.19670-1-jefflexu@linux.alibaba.com
[ Gao Xiang: minor commit message fix. ]
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Make .{list,get}xattr routines adapted to long xattr name prefixes.
When the bit 7 of erofs_xattr_entry.e_name_index is set, it indicates
that it refers to a long xattr name prefix.
Signed-off-by: Jingbo Xu <jefflexu@linux.alibaba.com>
Reviewed-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Acked-by: Chao Yu <chao@kernel.org>
Link: https://lore.kernel.org/r/20230411093537.127286-1-jefflexu@linux.alibaba.com
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Long xattr name prefixes will be scanned upon mounting and the in-memory
long xattr name prefix array will be initialized accordingly.
Signed-off-by: Jingbo Xu <jefflexu@linux.alibaba.com>
Reviewed-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Acked-by: Chao Yu <chao@kernel.org>
Link: https://lore.kernel.org/r/20230407141710.113882-6-jefflexu@linux.alibaba.com
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Besides the predefined xattr name prefixes, introduces long xattr name
prefixes, which work similarly as the predefined name prefixes, except
that they are user specified.
It is especially useful for use cases together with overlayfs like
Composefs model, which introduces diverse xattr values with only a few
common xattr names (trusted.overlay.redirect, trusted.overlay.digest,
and maybe more in the future). That makes the existing predefined
prefixes ineffective in both image size and runtime performance.
When a user specified long xattr name prefix is used, only the trailing
part of the xattr name apart from the long xattr name prefix will be
stored in erofs_xattr_entry.e_name. e_name is empty if the xattr name
matches exactly as the long xattr name prefix. All long xattr prefixes
are stored in the packed or meta inode, which depends if fragments
feature is enabled or not.
For each long xattr name prefix, the on-disk format is kept as the same
as the unique metadata format: ALIGN({__le16 len, data}, 4), where len
represents the total size of struct erofs_xattr_long_prefix, followed
by data of struct erofs_xattr_long_prefix itself.
Each erofs_xattr_long_prefix keeps predefined prefixes (base_index)
and the remaining prefix string without the trailing '\0'.
Two fields are introduced to the on-disk superblock, where
xattr_prefix_count represents the total number of the long xattr name
prefixes recorded, and xattr_prefix_start represents the start offset of
recorded name prefixes in the packed/meta inode divided by 4.
When referring to a long xattr name prefix, the highest bit (bit 7) of
erofs_xattr_entry.e_name_index is set, while the lower bits (bit 0-6)
as a whole represents the index of the referred long name prefix among
all long xattr name prefixes.
Signed-off-by: Jingbo Xu <jefflexu@linux.alibaba.com>
Reviewed-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Acked-by: Chao Yu <chao@kernel.org>
Link: https://lore.kernel.org/r/20230407141710.113882-5-jefflexu@linux.alibaba.com
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
packed inode could be used in more scenarios which are independent of
compression in the future.
For example, packed inode could be used to keep extra long xattr
prefixes with the help of following patches.
Signed-off-by: Jingbo Xu <jefflexu@linux.alibaba.com>
Reviewed-by: Yue Hu <huyue2@coolpad.com>
Reviewed-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Acked-by: Chao Yu <chao@kernel.org>
Link: https://lore.kernel.org/r/20230407141710.113882-4-jefflexu@linux.alibaba.com
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
So that erofs_read_metadata() can read metadata from other inodes
(e.g. packed inode) as well.
Signed-off-by: Jingbo Xu <jefflexu@linux.alibaba.com>
Acked-by: Chao Yu <chao@kernel.org>
Link: https://lore.kernel.org/r/20230407141710.113882-2-jefflexu@linux.alibaba.com
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
As commit 8f7acdae2c ("staging: erofs: kill all failure handling in
fill_super()"), move the initialization of packed inode after root
inode is assigned, so that the iput() in .put_super() is adequate as
the failure handling.
Otherwise, iput() is also needed in .kill_sb(), in case of the mounting
fails halfway.
Signed-off-by: Jingbo Xu <jefflexu@linux.alibaba.com>
Reviewed-by: Yue Hu <huyue2@coolpad.com>
Fixes: b15b2e307c ("erofs: support on-disk compressed fragments data")
Reviewed-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Acked-by: Chao Yu <chao@kernel.org>
Link: https://lore.kernel.org/r/20230407141710.113882-3-jefflexu@linux.alibaba.com
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
The ztailpacking feature has been merged for a year, it has been mostly
stable now.
Signed-off-by: Yue Hu <huyue2@coolpad.com>
Reviewed-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Reviewed-by: Chao Yu <chao@kernel.org>
Link: https://lore.kernel.org/r/20230227084457.3510-1-zbestahu@gmail.com
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
erofs_xattr_generic_get() won't be called from xattr handlers other than
user/trusted/security xattr handler, and thus there's no need of extra
checking.
Signed-off-by: Jingbo Xu <jefflexu@linux.alibaba.com>
Link: https://lore.kernel.org/r/20230330082910.125374-4-jefflexu@linux.alibaba.com
Reviewed-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Reviewed-by: Yue Hu <huyue2@coolpad.com>
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Move xattrblock_addr() and xattrblock_offset() helpers into xattr.c,
as they are not used outside of xattr.c.
inlinexattr_header_size() has only one caller, and thus make it inlined
into the caller directly.
Signed-off-by: Jingbo Xu <jefflexu@linux.alibaba.com>
Reviewed-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Link: https://lore.kernel.org/r/20230330082910.125374-2-jefflexu@linux.alibaba.com
Reviewed-by: Yue Hu <huyue2@coolpad.com>
Acked-by: Chao Yu <chao@kernel.org>
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
- Get rid of all "vle" (variable-length extents) expressions
since they only expand overall name lengths unnecessarily;
- Rename COMPRESSION_LEGACY to COMPRESSED_FULL;
- Move on-disk directory definitions ahead of compression;
- Drop unused extended attribute definitions;
- Move inode ondisk union `i_u` out as `union erofs_inode_i_u`.
No actual logical change.
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Reviewed-by: Yue Hu <huyue2@coolpad.com>
Reviewed-by: Chao Yu <chao@kernel.org>
Link: https://lore.kernel.org/r/20230331063149.25611-1-hsiangkao@linux.alibaba.com
In order to support mounting multi-blobs container image as a single
block device, add flattened block device feature for EROFS.
In this mode, all meta/data contents will be mapped into one block
space. User could compose a block device(by nbd/ublk/virtio-blk/
vhost-user-blk) from multiple sources and mount the block device by
EROFS directly. It can reduce the number of block devices used, and
it's also benefits in both VM file passthrough and distributed storage
scenarios.
You can test this using the method mentioned by:
https://github.com/dragonflyoss/image-service/pull/1139
1. Compose a (nbd)block device from multi-blobs.
2. Mount EROFS on mntdir/.
3. Compare the md5sum between source dir and mntdir/.
Later, we could also use it to refer original tar blobs.
Signed-off-by: Jia Zhu <zhujia.zj@bytedance.com>
Signed-off-by: Xin Yin <yinxin.x@bytedance.com>
Reviewed-by: Jingbo Xu <jefflexu@linux.alibaba.com>
Acked-by: Chao Yu <chao@kernel.org>
Tested-by: Jiang Liu <gerry@linux.alibaba.com>
Link: https://lore.kernel.org/r/20230302071751.48425-1-zhujia.zj@bytedance.com
[ Gao Xiang: refine commit message and use erofs_pos(). ]
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Set the block size to that specified in on-disk superblock.
Also remove the hard constraint of PAGE_SIZE block size for the
uncompressed device backend. This constraint is temporarily remained
for compressed device and fscache backend, as there is more work needed
to handle the condition where the block size is not equal to PAGE_SIZE.
It is worth noting that the on-disk block size is read prior to
erofs_superblock_csum_verify(), as the read block size is needed in the
latter.
Besides, later we are going to make erofs refer to tar data blobs (which
is 512-byte aligned) for OCI containers, where the block size is 512
bytes. In this case, the 512-byte block size may not be adequate for a
directory to contain enough dirents. To fix this, we are also going to
introduce directory block size independent on the block size.
Due to we have already supported block size smaller than PAGE_SIZE now,
disable all these images with such separated directory block size until
we supported this feature later.
Signed-off-by: Jingbo Xu <jefflexu@linux.alibaba.com>
Reviewed-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Reviewed-by: Yue Hu <huyue2@coolpad.com>
Reviewed-by: Chao Yu <chao@kernel.org>
Link: https://lore.kernel.org/r/20230313135309.75269-3-jefflexu@linux.alibaba.com
[ Gao Xiang: update documentation. ]
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
As the first step of converting hardcoded blocksize to that specified in
on-disk superblock, convert all call sites of hardcoded blocksize to
sb->s_blocksize except for:
1) use sbi->blkszbits instead of sb->s_blocksize in
erofs_superblock_csum_verify() since sb->s_blocksize has not been
updated with the on-disk blocksize yet when the function is called.
2) use inode->i_blkbits instead of sb->s_blocksize in erofs_bread(),
since the inode operated on may be an anonymous inode in fscache mode.
Currently the anonymous inode is allocated from an anonymous mount
maintained in erofs, while in the near future we may allocate anonymous
inodes from a generic API directly and thus have no access to the
anonymous inode's i_sb. Thus we keep the block size in i_blkbits for
anonymous inodes in fscache mode.
Be noted that this patch only gets rid of the hardcoded blocksize, in
preparation for actually setting the on-disk block size in the following
patch. The hard limit of constraining the block size to PAGE_SIZE still
exists until the next patch.
Signed-off-by: Jingbo Xu <jefflexu@linux.alibaba.com>
Reviewed-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Reviewed-by: Yue Hu <huyue2@coolpad.com>
Reviewed-by: Chao Yu <chao@kernel.org>
Link: https://lore.kernel.org/r/20230313135309.75269-2-jefflexu@linux.alibaba.com
[ Gao Xiang: fold a patch to fix incorrect truncated offsets. ]
Link: https://lore.kernel.org/r/20230413035734.15457-1-zhujia.zj@bytedance.com
Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
-----BEGIN PGP SIGNATURE-----
iQGzBAABCgAdFiEE6fsu8pdIjtWE/DpLiiy9cAdyT1EFAmQ7bhwACgkQiiy9cAdy
T1Fk3Av+NcTPMF6ZIhxXN4IwpsvE0KdXm+BB/+dCw82zi2mVAyZowLSFkM3TKqRh
6GOpSnKu2Vp7TCSNdN0ZtnOcC9q8H/SpFmLojBeoiyUr87tjngd7ktTkUd32FEaf
jfOqS0+NSZPmhB7eKXJ75jOISMvga0x3t1KHbO7vTm12I5b6VY3r1hxiit0RP0fg
7QKWNwSR8erQMkg8+F+n5q9kAIi88ymrPTx8991JdENqzCjJ0dNMLX7ULwD8SiWa
d9PnEFGyQeLoVF/FRQ4hYNRv67Os3xjEFdJtpZKlZ9CKfzgwA1kOYQQRfGb64bBP
wQ0Syga8OudYMq6X1jMGsw0qaGxwC32jIA03M05oQ75A8SaXyb1jauHdwNFJqjmH
JhSZ6qI77TduYK0v92Oa+Y76miW/RoI5sS8i0GrayjwN8NsBsrHH7JuLS/LSFpc/
vlv0fPqBTRpFP7Yv+JJr8lgY6a8aeAF5R4fYPeyGbOpxXm71Af95ZX5Q3JYNzdz4
ZuEpSVVn
=LnMO
-----END PGP SIGNATURE-----
Merge tag '6.3-rc6-ksmbd-server-fix' of git://git.samba.org/ksmbd
Pull ksmbd server fix from Steve French:
"smb311 server preauth integrity negotiate context parsing fix (check
for out of bounds access)"
* tag '6.3-rc6-ksmbd-server-fix' of git://git.samba.org/ksmbd:
ksmbd: avoid out of bounds access in decode_preauth_ctxt()
smb311_decode_neg_context() doesn't properly check against SMB packet
boundaries prior to accessing individual negotiate context entries. This
is due to the length check omitting the eight byte smb2_neg_context
header, as well as incorrect decrementing of len_of_ctxts.
Fixes: 5100d8a3fe ("SMB311: Improve checking of negotiate security contexts")
Reported-by: Volker Lendecke <vl@samba.org>
Reviewed-by: Paulo Alcantara (SUSE) <pc@manguebit.com>
Signed-off-by: David Disseldorp <ddiss@suse.de>
Signed-off-by: Steve French <stfrench@microsoft.com>
These checkings are also related with feature compatibility checkings.
So move them into ext4_check_feature_compatibility(). No functional
change.
Signed-off-by: Jason Yan <yanaijie@huawei.com>
Link: https://lore.kernel.org/r/20230323140517.1070239-9-yanaijie@huawei.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
The naming styles are different for some functions with 'check' in their
names. Some of them are like:
ext4_check_quota_consistency
ext4_check_test_dummy_encryption
ext4_check_opt_consistency
ext4_check_descriptors
ext4_check_feature_compatibility
While the others looks like below:
ext4_geometry_check
ext4_journal_data_mode_check
This is not a big deal and boils down to personal preference. But I'd
like to make them consistent.
Signed-off-by: Jason Yan <yanaijie@huawei.com>
Link: https://lore.kernel.org/r/20230323140517.1070239-6-yanaijie@huawei.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
The only difference here is that ->s_group_desc and ->s_flex_groups share
the same rcu read lock here but it is not necessary. In other places they
do not share the lock at all.
Signed-off-by: Jason Yan <yanaijie@huawei.com>
Link: https://lore.kernel.org/r/20230323140517.1070239-4-yanaijie@huawei.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Factor out ext4_percpu_param_init() and ext4_percpu_param_destroy(). And
also use ext4_percpu_param_destroy() in ext4_put_super() to avoid
duplicated code. No functional change.
Signed-off-by: Jason Yan <yanaijie@huawei.com>
Link: https://lore.kernel.org/r/20230323140517.1070239-3-yanaijie@huawei.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
After making ext4_writepages() properly clean all pages there is no need
for special treatment of filesystem freezing. Revert commit
e6c28a26b7.
Signed-off-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20230329154950.19720-13-jack@suse.cz
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Since filemap_write_and_wait() is now enough to get journalled data to
final location update the comment in mpage_prepare_extent_to_map().
Signed-off-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20230329154950.19720-12-jack@suse.cz
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Now that ext4_writepages() gets journalled data into its final location
we just use filemap_write_and_wait() instead of special handling of
journalled data in ext4_bmap(). We can also drop EXT4_STATE_JDATA flag
as it is not used anymore.
Signed-off-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20230329154950.19720-11-jack@suse.cz
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Now that ext4_writepages() makes sure all journalled data is committed
and checkpointed, sync_filesystem() call done by dquot_quota_on() is
enough for quota IO to see uptodate data. So drop special handling of
journalled data from ext4_quota_on().
Signed-off-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20230329154950.19720-10-jack@suse.cz
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Now that ext4_writepages() makes sure journalled data is on stable
storage, write_inode_now() call in iput_final() is enough to make
pagecache pages with journalled data really clean (data committed and
checkpointed). So we can drop special handling of journalled data in
ext4_evict_inode().
Signed-off-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20230329154950.19720-9-jack@suse.cz
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
The handling of journalled data in ext4_zero_range() is incomplete. We
do not need to commit running transaction but we rather need to
checkpoint pages with journalled data. If we don't, journal tail can be
advanced beyond transaction containing the journalled data and if we
then crash before committing the transaction doing the zeroing we will
have inconsistent (too old) data in the file. Make sure file pages with
journalled data are properly checkpointed before removing them from the
page cache.
Signed-off-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20230329154950.19720-8-jack@suse.cz
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Now that filemap_write_and_wait() makes sure pages with journalled data
are safely on disk, ext4_collapse_range() and ext4_insert_range() do
not need special handling of journalled data.
Signed-off-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20230329154950.19720-7-jack@suse.cz
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Now that ext4_writepages() make sure all pages with journalled data are
stable on disk, we don't need special handling of journalled data in
ext4_sync_file().
Signed-off-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20230329154950.19720-6-jack@suse.cz
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
When journalling data we currently just walk over pages, journal those
that are marked for delayed dirtying (only pinned pages dirtied behing
our back these days) and checkpoint other dirty pages. Because some
pages may be part of running transaction the result is that after
filemap_write_and_wait() we are not guaranteed pages are stable on disk.
Thus places that want to flush current pagecache content need to jump
through hoops to make sure journalled data is not lost. This is
manageable in cases completely controlled by ext4 (such as extent
shifting operations or inode eviction) but it gets ugly for stuff like
fsverity. Furthermore it is rather error prone as people often do not
realize journalled data needs special handling.
So change ext4_writepages() to commit transaction with inode's data
before going through the writeback loop in WB_SYNC_ALL mode. As a result
filemap_write_and_wait() is now really getting pages to stable storage
and makes pagecache pages safe to reclaim. Consequently we can remove
the special handling of journalled data from several places in follow up
patches.
Note that this will make fsync(2) for journalled data more expensive as
we will end up not only committing the transaction we need but also
checkpointing the data (which we may have previously skipped if the data
was part of the running transaction). If we really cared, we would need
to introduce special VFS function for writing out & invalidating page
cache for a range, use ->launder_page callback to perform checkpointing,
and use it from all the places that need this functionality. But at this
point I'm not convinced the complexity is worth it.
Signed-off-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20230329154950.19720-5-jack@suse.cz
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
With journalled data it can happen that checkpointing code will write
out page contents without clearing the page dirty bit. The logic in
ext4_page_nomap_can_writeout() then results in us never calling
mpage_submit_page() and thus clearing the dirty bit. Drop the
optimization with ext4_page_nomap_can_writeout() and just always call to
mpage_submit_page(). ext4_bio_write_page() knows when to redirty the
page and the additional clearing & setting of page dirty bit for ordered
mode writeout is not that expensive to jump through the hoops for it.
Signed-off-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20230329154950.19720-4-jack@suse.cz
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Currently we clear page dirty bit when we checkpoint some buffers from a
page with journalled data or when we perform delayed dirtying of a page
in ext4_writepages(). In a quest to simplify handling of journalled data
we want to keep page dirty as long as it has either buffers to
checkpoint or journalled dirty data. So make sure to keep page dirty in
ext4_writepages() if it still has journalled data attached to it.
Signed-off-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20230329154950.19720-3-jack@suse.cz
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Currently pages with journalled data written by write(2) or modified by
block zeroing during truncate(2) are not marked as dirty. They are
dirtied only once the transaction commits. This however makes writeback
code think inode has no pages to write and so ext4_writepages() is not
called to make pages with journalled data persistent. Mark pages with
journalled data dirty (similarly as it happens for writes through mmap)
so that writeback code knows about them and ext4_writepages() can do
what it needs to to the inode.
Signed-off-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20230329154950.19720-2-jack@suse.cz
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
When invalidating buffers under the partial tail page,
jbd2_journal_invalidate_folio() returns -EBUSY if the buffer is part of
the committing transaction as we cannot safely modify buffer state.
However if the buffer is already invalidated (due to previous
invalidation attempts from ext4_wait_for_tail_page_commit()), there's
nothing to do and there's no point in returning -EBUSY. This fixes
occasional warnings from ext4_journalled_invalidate_folio() triggered by
generic/051 fstest when blocksize < pagesize.
Fixes: 53e872681f ("ext4: fix deadlock in journal_unmap_buffer()")
Signed-off-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20230329154950.19720-1-jack@suse.cz
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
PRINT_QUOTA_WARNING is deprecated since commit 8e8934695d ("quota:
send messages via netlink") merged in 2007. Users should rather be using
notification over netlink socket if they are interested about explicit
notification in addition to plain EDQUOT error. Since printing to
console from deep inside filesystem code is problematic, mark the
feature as BROKEN now and see who complains.
Signed-off-by: Yangtao Li <frank.li@vivo.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Message-Id: <20230413163833.43913-1-frank.li@vivo.com>
Wire up the iopoll method to the common implementation.
As f2fs use common dio infrastructure:
commit a1e09b03e6 ("f2fs: use iomap for direct I/O")
Signed-off-by: Wu Bo <bo.wu@vivo.com>
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
In __replace_atomic_write_block(), we missed to check return value
of inc_valid_block_count(), for extreme testcase that f2fs image is
run out of space, it may cause inconsistent status in between SIT
table and total valid block count.
Cc: Daeho Jeong <daehojeong@google.com>
Fixes: 3db1de0e58 ("f2fs: change the current atomic write way")
Signed-off-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Otherwise, if truncation on cow_inode failed, remained data may
pollute current transaction of atomic write.
Cc: Daeho Jeong <daehojeong@google.com>
Fixes: a46bebd502 ("f2fs: synchronize atomic write aborts")
Signed-off-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
We should not pass relative address in a zone to
__f2fs_issue_discard_zone().
Signed-off-by: Daeho Jeong <daehojeong@google.com>
Reviewed-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Last week, I was fiddling around with the metadump name obfuscation code
while writing a debugger command to generate directories full of names
that all have the same hash name. I had a few questions about how well
all that worked with ascii-ci mode, and discovered a nasty discrepancy
between the kernel and glibc's implementations of the tolower()
function.
I discovered that I could create a directory that is large enough to
require separate leaf index blocks. The hashes stored in the dabtree
use the ascii-ci specific hash function, which uses a library function
to convert the name to lowercase before hashing. If the kernel and C
library's versions of tolower do not behave exactly identically,
xfs_ascii_ci_hashname will not produce the same results for the same
inputs. xfs_repair will deem the leaf information corrupt and rebuild
the directory. After that, lookups in the kernel will fail because the
hash index doesn't work.
The kernel's tolower function will convert extended ascii uppercase
letters (e.g. A-with-umlaut) to extended ascii lowercase letters (e.g.
a-with-umlaut), whereas glibc's will only do that if you force LANG to
ascii. Tiny embedded libc implementations just plain won't do it at
all, and the result is a mess. Stabilize the behavior of the hash
function by encoding the name transformation function in libxfs, add it
to the selftest, and fix all the userspace tools, none of which handle
this transformation correctly.
The v1 series generated a /lot/ of discussion, in which several things
became very clear: (1) Linus is not enamored of case folding of any
kind; (2) Dave and Christoph don't seem to agree on whether the feature
is supposed to work for 7-bit ascii or latin1; (3) it trashes UTF8
encoded names if those happen to show up; and (4) I don't want to
maintain this mess any longer than I have to. Kill it in 2030.
v2: rename the functions to make it clear we're moving away from the
letters t, o, l, o, w, e, and r; and deprecate the whole feature once
we've fixed the bugs and added tests.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQQ2qTKExjcn+O1o2YRKO3ySh0YRpgUCZDYdqwAKCRBKO3ySh0YR
pi33AQC4MFCz0uP1aF64zRgE+wtU2YBGw5cGps7nWIljVptbkAEAubfoY88wAop8
/KHIgZ8pHIb7ooPrYKpPZL5m0udtMw8=
=3Up6
-----END PGP SIGNATURE-----
Merge tag 'fix-asciici-bugs-6.4_2023-04-11' of git://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux into guilt/xfs-for-next
xfs: fix ascii-ci problems, then kill it [v2]
Last week, I was fiddling around with the metadump name obfuscation code
while writing a debugger command to generate directories full of names
that all have the same hash name. I had a few questions about how well
all that worked with ascii-ci mode, and discovered a nasty discrepancy
between the kernel and glibc's implementations of the tolower()
function.
I discovered that I could create a directory that is large enough to
require separate leaf index blocks. The hashes stored in the dabtree
use the ascii-ci specific hash function, which uses a library function
to convert the name to lowercase before hashing. If the kernel and C
library's versions of tolower do not behave exactly identically,
xfs_ascii_ci_hashname will not produce the same results for the same
inputs. xfs_repair will deem the leaf information corrupt and rebuild
the directory. After that, lookups in the kernel will fail because the
hash index doesn't work.
The kernel's tolower function will convert extended ascii uppercase
letters (e.g. A-with-umlaut) to extended ascii lowercase letters (e.g.
a-with-umlaut), whereas glibc's will only do that if you force LANG to
ascii. Tiny embedded libc implementations just plain won't do it at
all, and the result is a mess. Stabilize the behavior of the hash
function by encoding the name transformation function in libxfs, add it
to the selftest, and fix all the userspace tools, none of which handle
this transformation correctly.
The v1 series generated a /lot/ of discussion, in which several things
became very clear: (1) Linus is not enamored of case folding of any
kind; (2) Dave and Christoph don't seem to agree on whether the feature
is supposed to work for 7-bit ascii or latin1; (3) it trashes UTF8
encoded names if those happen to show up; and (4) I don't want to
maintain this mess any longer than I have to. Kill it in 2030.
v2: rename the functions to make it clear we're moving away from the
letters t, o, l, o, w, e, and r; and deprecate the whole feature once
we've fixed the bugs and added tests.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Dave Chinner <david@fromorbit.com>
This series strengthens space allocation record cross referencing by
using AG block bitmaps to compute the difference between space used
according to the rmap records and the primary metadata, and reports
cross-referencing errors for any discrepancies.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQQ2qTKExjcn+O1o2YRKO3ySh0YRpgUCZDYdqgAKCRBKO3ySh0YR
pgQ8AP4uxtGQY7RXaJ6aKXTwqarZO9vwZyQiL/nFo1onGCHv4wD/cIwn7hvJmbnF
7avZMCsYJlVFNAJUFZuACmz0v4PwWQ8=
=KqDr
-----END PGP SIGNATURE-----
Merge tag 'scrub-strengthen-rmap-checking-6.4_2023-04-11' of git://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux into guilt/xfs-for-next
xfs: strengthen rmapbt scrubbing [v24.5]
This series strengthens space allocation record cross referencing by
using AG block bitmaps to compute the difference between space used
according to the rmap records and the primary metadata, and reports
cross-referencing errors for any discrepancies.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Dave Chinner <david@fromorbit.com>
In this series, we make some changes to the incore bitmap code: First,
we shorten the prefix to 'xbitmap'. Then, we rework some utility
functions for later use by online repair and clarify how the walk
functions are supposed to be used.
Finally, we use all these new pieces to convert the incore bitmap to use
an interval tree instead of linked lists. This lifts the limitation
that callers had to be careful not to set a range that was already set;
and gets us ready for the btree rebuilder functions needing to be able
to set bits in a bitmap and generate maximal contiguous extents for the
set ranges.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQQ2qTKExjcn+O1o2YRKO3ySh0YRpgUCZDYdqQAKCRBKO3ySh0YR
pmoIAQDszM48pGVRa7zh942ACCa+CbPABFyKDaVePwhccnX2/QD/a9gi4lmEoPVw
915wIjWbk5rzAjZPp2k3IwYkEf4v2wI=
=QCWU
-----END PGP SIGNATURE-----
Merge tag 'repair-bitmap-rework-6.4_2023-04-11' of git://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux into guilt/xfs-for-next
xfs: rework online fsck incore bitmap [v24.5]
In this series, we make some changes to the incore bitmap code: First,
we shorten the prefix to 'xbitmap'. Then, we rework some utility
functions for later use by online repair and clarify how the walk
functions are supposed to be used.
Finally, we use all these new pieces to convert the incore bitmap to use
an interval tree instead of linked lists. This lifts the limitation
that callers had to be careful not to set a range that was already set;
and gets us ready for the btree rebuilder functions needing to be able
to set bits in a bitmap and generate maximal contiguous extents for the
set ranges.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Dave Chinner <david@fromorbit.com>
Currently, the extended attribute scrubber uses a single VLA to store
all the context information needed in various parts of the scrubber
code. This includes xattr leaf block space usage bitmaps, and the value
buffer used to check the correctness of remote xattr value block
headers. We try to minimize the insanity through the use of helper
functions, but this is a memory management nightmare. Clean this up by
making the bitmap and value pointers explicit members of struct
xchk_xattr_buf.
Second, strengthen the xattr checking by teaching it to look for overlapping
data structures in the shortform attr data.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQQ2qTKExjcn+O1o2YRKO3ySh0YRpgUCZDYdqAAKCRBKO3ySh0YR
pnfxAQDGPPb0yT/FD7qnVSSdmUgQPc3Q8OPam/pOrIvXYABn9AD+I9zr64zoR/cD
xn2Z7nkcLV4wK+ofOO8beGqK6R/oBwI=
=/OO/
-----END PGP SIGNATURE-----
Merge tag 'scrub-fix-xattr-memory-mgmt-6.4_2023-04-11' of git://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux into guilt/xfs-for-next
xfs: clean up memory management in xattr scrub [v24.5]
Currently, the extended attribute scrubber uses a single VLA to store
all the context information needed in various parts of the scrubber
code. This includes xattr leaf block space usage bitmaps, and the value
buffer used to check the correctness of remote xattr value block
headers. We try to minimize the insanity through the use of helper
functions, but this is a memory management nightmare. Clean this up by
making the bitmap and value pointers explicit members of struct
xchk_xattr_buf.
Second, strengthen the xattr checking by teaching it to look for overlapping
data structures in the shortform attr data.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Dave Chinner <david@fromorbit.com>
While I was doing differential fuzz analysis between xfs_scrub and
xfs_repair, I noticed that xfs_repair was only partially effective at
detecting btree records that can be merged, and xfs_scrub totally didn't
notice at all.
For every interval btree type except for the bmbt, there should never
exist two adjacent records with adjacent keyspaces because the
blockcount field is always large enough to span the entire keyspace of
the domain. This is because the free space, rmap, and refcount btrees
have a blockcount field large enough to store the maximum AG length, and
there can never be an allocation larger than an AG.
The bmbt is a different story due to its ondisk encoding where the
blockcount is only 21 bits wide. Because AGs can span up to 2^31 blocks
and the RT volume can span up to 2^52 blocks, a preallocation of 2^22
blocks will be expressed as two records of 2^21 length. We don't
opportunistically combine records when doing bmbt operations, which is
why the fsck tools have never complained about this scenario.
Offline repair is partially effective at detecting mergeable records
because I taught it to do that for the rmap and refcount btrees. This
series enhances the free space, rmap, and refcount scrubbers to detect
mergeable records. For the bmbt, it will flag the file as being
eligible for an optimization to shrink the size of the data structure.
The last patch in this set also enhances the rmap scrubber to detect
records that overlap incorrectly. This check is done automatically for
non-overlapping btree types, but we have to do it separately for the
rmapbt because there are constraints on which allocation types are
allowed to overlap.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQQ2qTKExjcn+O1o2YRKO3ySh0YRpgUCZDYdqAAKCRBKO3ySh0YR
pj1kAP97i2gcFebCwLC+YTDFQoiGVYTf2Mo/bHcPopP/ptXFSAD9HYeG8xM0T3aB
EGrmyaXSPcO+l/txMGTx8ACCJjxOFwg=
=G9gi
-----END PGP SIGNATURE-----
Merge tag 'scrub-detect-mergeable-records-6.4_2023-04-11' of git://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux into guilt/xfs-for-next
xfs: detect mergeable and overlapping btree records [v24.5]
While I was doing differential fuzz analysis between xfs_scrub and
xfs_repair, I noticed that xfs_repair was only partially effective at
detecting btree records that can be merged, and xfs_scrub totally didn't
notice at all.
For every interval btree type except for the bmbt, there should never
exist two adjacent records with adjacent keyspaces because the
blockcount field is always large enough to span the entire keyspace of
the domain. This is because the free space, rmap, and refcount btrees
have a blockcount field large enough to store the maximum AG length, and
there can never be an allocation larger than an AG.
The bmbt is a different story due to its ondisk encoding where the
blockcount is only 21 bits wide. Because AGs can span up to 2^31 blocks
and the RT volume can span up to 2^52 blocks, a preallocation of 2^22
blocks will be expressed as two records of 2^21 length. We don't
opportunistically combine records when doing bmbt operations, which is
why the fsck tools have never complained about this scenario.
Offline repair is partially effective at detecting mergeable records
because I taught it to do that for the rmap and refcount btrees. This
series enhances the free space, rmap, and refcount scrubbers to detect
mergeable records. For the bmbt, it will flag the file as being
eligible for an optimization to shrink the size of the data structure.
The last patch in this set also enhances the rmap scrubber to detect
records that overlap incorrectly. This check is done automatically for
non-overlapping btree types, but we have to do it separately for the
rmapbt because there are constraints on which allocation types are
allowed to overlap.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Dave Chinner <david@fromorbit.com>
I started looking into performance problems with the data fork scrubber
in generic/333, and noticed a few things that needed improving. First,
due to design reasons, it's possible for file forks btrees to contain
multiple contiguous mappings to the same physical space. Instead of
checking each ondisk mapping individually, it's much faster to combine
them when possible and check the combined mapping because that's fewer
trips through the rmap btree, and we can drop this check-around
behavior that it does when an rmapbt lookup produces a record that
starts before or ends after a particular bmbt mapping.
Second, I noticed that the bmbt scrubber decides to walk every reverse
mapping in the filesystem if the file fork is in btree format. This is
very costly, and only necessary if the inode repair code had to zap a
fork to convince iget to work. Constraining the full-rmap scan to this
one case means we can skip it for normal files, which drives the runtime
of this test from 8 hours down to 45 minutes (observed with realtime
reflink and rebuild-all mode.)
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQQ2qTKExjcn+O1o2YRKO3ySh0YRpgUCZDdPcQAKCRBKO3ySh0YR
pl1UAPoDtMaFrsLvz7clh31S6Yi+X8oCB/iJZXWl7HXaNsIjUQEA253GuiOj80Rz
IHYo3t0KPYTm2Mc/7kBFQcctFbisDwE=
=zFQ+
-----END PGP SIGNATURE-----
Merge tag 'scrub-merge-bmap-records-6.4_2023-04-12' of git://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux into guilt/xfs-for-next
xfs: merge bmap records for faster scrubs [v24.5]
I started looking into performance problems with the data fork scrubber
in generic/333, and noticed a few things that needed improving. First,
due to design reasons, it's possible for file forks btrees to contain
multiple contiguous mappings to the same physical space. Instead of
checking each ondisk mapping individually, it's much faster to combine
them when possible and check the combined mapping because that's fewer
trips through the rmap btree, and we can drop this check-around
behavior that it does when an rmapbt lookup produces a record that
starts before or ends after a particular bmbt mapping.
Second, I noticed that the bmbt scrubber decides to walk every reverse
mapping in the filesystem if the file fork is in btree format. This is
very costly, and only necessary if the inode repair code had to zap a
fork to convince iget to work. Constraining the full-rmap scan to this
one case means we can skip it for normal files, which drives the runtime
of this test from 8 hours down to 45 minutes (observed with realtime
reflink and rebuild-all mode.)
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Dave Chinner <david@fromorbit.com>
This patchset fixes a handful of problems relating to how we get and
release incore inodes in the online scrub code. The first patch fixes
how we handle DONTCACHE -- our reasons for setting (or clearing it)
depend entirely on the runtime environment at irele time. Hence we can
refactor iget and irele to use our own wrappers that set that context
appropriately.
The second patch fixes a race between the iget call in the inode core
scrubber and other writer threads that are allocating or freeing inodes
in the same AG by changing the behavior of xchk_iget (and the inode core
scrub setup function) to return either an incore inode or the AGI buffer
so that we can be sure that the inode cannot disappear on us.
The final patch elides MMAPLOCK from scrub paths when possible. It did
not fit anywhere else.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQQ2qTKExjcn+O1o2YRKO3ySh0YRpgUCZDdPcAAKCRBKO3ySh0YR
piKpAP43yHHymFsXykAj23/YTBumdWYvAYu/n3IDLHe8Tt/UTgEAuiJj/MWfi3aR
35JL0XbhANm9i72vfwRJJl0WkCwwwAI=
=zQ5R
-----END PGP SIGNATURE-----
Merge tag 'scrub-iget-fixes-6.4_2023-04-12' of git://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux into guilt/xfs-for-next
xfs: fix iget/irele usage in online fsck [v24.5]
This patchset fixes a handful of problems relating to how we get and
release incore inodes in the online scrub code. The first patch fixes
how we handle DONTCACHE -- our reasons for setting (or clearing it)
depend entirely on the runtime environment at irele time. Hence we can
refactor iget and irele to use our own wrappers that set that context
appropriately.
The second patch fixes a race between the iget call in the inode core
scrubber and other writer threads that are allocating or freeing inodes
in the same AG by changing the behavior of xchk_iget (and the inode core
scrub setup function) to return either an incore inode or the AGI buffer
so that we can be sure that the inode cannot disappear on us.
The final patch elides MMAPLOCK from scrub paths when possible. It did
not fit anywhere else.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Dave Chinner <david@fromorbit.com>
Jan Kara pointed out that the VFS doesn't take i_rwsem of a child
subdirectory that is being moved from one parent to another. Upon
deeper analysis, I realized that this was the source of a very hard to
trigger false corruption report in the parent pointer checking code.
Now that we've refactored how directory walks work in scrub, we can also
get rid of all the unnecessary and broken locking to make parent pointer
scrubbing work properly.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQQ2qTKExjcn+O1o2YRKO3ySh0YRpgUCZDdPcAAKCRBKO3ySh0YR
po7/AQD2P0QjtwTDUPPMy9AttJjXybZ0Xc/kKjrPk7AXyIrmDAEA758BIpxI7Wwv
9OrAWI2NG8allL4sIM7c4Y+fguHu8gg=
=f+Wc
-----END PGP SIGNATURE-----
Merge tag 'scrub-parent-fixes-6.4_2023-04-12' of git://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux into guilt/xfs-for-next
xfs: fix bugs in parent pointer checking [v24.5]
Jan Kara pointed out that the VFS doesn't take i_rwsem of a child
subdirectory that is being moved from one parent to another. Upon
deeper analysis, I realized that this was the source of a very hard to
trigger false corruption report in the parent pointer checking code.
Now that we've refactored how directory walks work in scrub, we can also
get rid of all the unnecessary and broken locking to make parent pointer
scrubbing work properly.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Dave Chinner <david@fromorbit.com>
In this series, we fix some problems with how the directory scrubber
grabs child inodes. First, we want to reduce EDEADLOCK returns by
replacing fixed-iteration loops with interruptible trylock loops.
Second, we add UNTRUSTED to the child iget call so that we can detect a
dirent that points to an unallocated inode. Third, we fix a bug where
we weren't checking the inode pointed to by dotdot entries at all.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQQ2qTKExjcn+O1o2YRKO3ySh0YRpgUCZDdPbwAKCRBKO3ySh0YR
pjuKAQDHfHFTFVo2Svy/R/E4l/OqWWpO6ffHzESIE1Dzmb8flgEAoaPgXrMJwFg6
lNoEJRc9DZ+K+rNvHtgZhxIKfFOgSw4=
=xSO3
-----END PGP SIGNATURE-----
Merge tag 'scrub-dir-iget-fixes-6.4_2023-04-12' of git://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux into guilt/xfs-for-next
xfs: fix iget usage in directory scrub [v24.5]
In this series, we fix some problems with how the directory scrubber
grabs child inodes. First, we want to reduce EDEADLOCK returns by
replacing fixed-iteration loops with interruptible trylock loops.
Second, we add UNTRUSTED to the child iget call so that we can detect a
dirent that points to an unallocated inode. Third, we fix a bug where
we weren't checking the inode pointed to by dotdot entries at all.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Dave Chinner <david@fromorbit.com>
Following in the theme of the last two patchsets, this one strengthens
the rmap btree record checking so that scrub can count the number of
space records that map to a given owner and that do not map to a given
owner. This enables us to determine exclusive ownership of space that
can't be shared.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQQ2qTKExjcn+O1o2YRKO3ySh0YRpgUCZDYdpAAKCRBKO3ySh0YR
pk9sAQDXcVPG4a2TvTd+j95UkkPovjYjTJekTTlJL/Xo91rAxgD/fEx3I8A8vNes
dxVeyT/CwiYOPRYxFE3g3UdJGbaeHQA=
=ux+s
-----END PGP SIGNATURE-----
Merge tag 'scrub-detect-rmapbt-gaps-6.4_2023-04-11' of git://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux into guilt/xfs-for-next
xfs: detect incorrect gaps in rmap btree [v24.5]
Following in the theme of the last two patchsets, this one strengthens
the rmap btree record checking so that scrub can count the number of
space records that map to a given owner and that do not map to a given
owner. This enables us to determine exclusive ownership of space that
can't be shared.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Dave Chinner <david@fromorbit.com>
This series continues the corrections for a couple of problems I found
in the inode btree scrubber. The first problem is that we don't
directly check the inobt records have a direct correspondence with the
finobt records, and vice versa. The second problem occurs on
filesystems with sparse inode chunks -- the cross-referencing we do
detects sparseness, but it doesn't actually check the consistency
between the inobt hole records and the rmap data.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQQ2qTKExjcn+O1o2YRKO3ySh0YRpgUCZDYdowAKCRBKO3ySh0YR
pt2WAQDcHbg0JDyGDcTiSyYqTlT2xxzeaxtMRg75fWYpIRa2dQEAuatGejdp56in
AbH6jSmtS9f4M0wcy5JhHyHzZdZjcgc=
=1G5P
-----END PGP SIGNATURE-----
Merge tag 'scrub-detect-inobt-gaps-6.4_2023-04-11' of git://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux into guilt/xfs-for-next
xfs: detect incorrect gaps in inode btree [v24.5]
This series continues the corrections for a couple of problems I found
in the inode btree scrubber. The first problem is that we don't
directly check the inobt records have a direct correspondence with the
finobt records, and vice versa. The second problem occurs on
filesystems with sparse inode chunks -- the cross-referencing we do
detects sparseness, but it doesn't actually check the consistency
between the inobt hole records and the rmap data.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Dave Chinner <david@fromorbit.com>
The next few patchsets address a deficiency in scrub that I found while
QAing the refcount btree scrubber. If there's a gap between refcount
records, we need to cross-reference that gap with the reverse mappings
to ensure that there are no overlapping records in the rmap btree. If
we find any, then the refcount btree is not consistent. This is not a
property that is specific to the refcount btree; they all need to have
this sort of keyspace scanning logic to detect inconsistencies.
To do this accurately, we need to be able to scan the keyspace of a
btree (which we already do) to be able to tell the caller if the
keyspace is empty, sparse, or fully covered by records. The first few
patches add the keyspace scanner to the generic btree code, along with
the ability to mask off parts of btree keys because when we scan the
rmapbt, we only care about space usage, not the owners.
The final patch closes the scanning gap in the refcountbt scanner.
v23.1: create helpers for the key extraction and comparison functions,
improve documentation, and eliminate the ->mask_key indirect
calls
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQQ2qTKExjcn+O1o2YRKO3ySh0YRpgUCZDYdogAKCRBKO3ySh0YR
pjDDAQC88qzAvA3j2JP8ZC9mnK89LsYpkOEX2i6HV2m4LWYdWgD/fWdGnp0BFoQj
is+V82X6oRhWi8SRnjOX28Mk8gCdDA8=
=fzga
-----END PGP SIGNATURE-----
Merge tag 'scrub-detect-refcount-gaps-6.4_2023-04-11' of git://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux into guilt/xfs-for-next
xfs: detect incorrect gaps in refcount btree [v24.5]
The next few patchsets address a deficiency in scrub that I found while
QAing the refcount btree scrubber. If there's a gap between refcount
records, we need to cross-reference that gap with the reverse mappings
to ensure that there are no overlapping records in the rmap btree. If
we find any, then the refcount btree is not consistent. This is not a
property that is specific to the refcount btree; they all need to have
this sort of keyspace scanning logic to detect inconsistencies.
To do this accurately, we need to be able to scan the keyspace of a
btree (which we already do) to be able to tell the caller if the
keyspace is empty, sparse, or fully covered by records. The first few
patches add the keyspace scanner to the generic btree code, along with
the ability to mask off parts of btree keys because when we scan the
rmapbt, we only care about space usage, not the owners.
The final patch closes the scanning gap in the refcountbt scanner.
v23.1: create helpers for the key extraction and comparison functions,
improve documentation, and eliminate the ->mask_key indirect
calls
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Dave Chinner <david@fromorbit.com>
This series fixes the scrub btree block checker to ensure that the keys
in the parent block accurately represent the block, and check the
ordering of all interior key records.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQQ2qTKExjcn+O1o2YRKO3ySh0YRpgUCZDYdogAKCRBKO3ySh0YR
pszmAP94RPiNPhUac1Eh6cCOGrt7Cor84dyZmWugqDI+nGo5JAD/Qfy5gFwAK7WV
vQZZPSf/AdLTEEJFA11NVqBW0TukWgE=
=xIq/
-----END PGP SIGNATURE-----
Merge tag 'scrub-btree-key-enhancements-6.4_2023-04-11' of git://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux into guilt/xfs-for-next
xfs: enhance btree key scrubbing [v24.5]
This series fixes the scrub btree block checker to ensure that the keys
in the parent block accurately represent the block, and check the
ordering of all interior key records.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Dave Chinner <david@fromorbit.com>
This series fixes numerous flag handling bugs in the rmapbt key code.
The most serious transgression is that key comparisons completely strip
out all flag bits from rm_offset, including the ones that participate in
record lookups. The second problem is that for years we've been letting
the unwritten flag (which is an attribute of a specific record and not
part of the record key) escape from leaf records into key records.
The solution to the second problem is to filter attribute flags when
creating keys from records, and the solution to the first problem is to
preserve *only* the flags used for key lookups. The ATTR and BMBT flags
are a part of the lookup key, and the UNWRITTEN flag is a record
attribute.
This has worked for years without generating user complaints because
ATTR and BMBT extents cannot be shared, so key comparisons succeed
solely on rm_startblock. Only file data fork extents can be shared, and
those records never set any of the three flag bits, so comparisons that
dig into rm_owner and rm_offset work just fine.
A filesystem written with an unpatched kernel and mounted on a patched
kernel will work correctly because the ATTR/BMBT flags have been
conveyed into keys correctly all along, and we still ignore the
UNWRITTEN flag in any key record. This was what doomed my previous
attempt to correct this problem in 2019.
A filesystem written with a patched kernel and mounted on an unpatched
kernel will also work correctly because unpatched kernels ignore all
flags.
With this patchset applied, the scrub code gains the ability to detect
rmap btrees with incorrectly set attr and bmbt flags in the key records.
After three years of testing, I haven't encountered any problems.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQQ2qTKExjcn+O1o2YRKO3ySh0YRpgUCZDYdoQAKCRBKO3ySh0YR
prUmAP9WiaLPxeMAnQiQcaZyqyAhaiqbwNoLkDMx0+1+SKDPCwD7BU6tPQpT039i
mrDag3g2x4N7g/e89N29SQp8EDGuQQQ=
=Chkt
-----END PGP SIGNATURE-----
Merge tag 'rmap-btree-fix-key-handling-6.4_2023-04-11' of git://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux into guilt/xfs-for-next
xfs: fix rmap btree key flag handling [v24.5]
This series fixes numerous flag handling bugs in the rmapbt key code.
The most serious transgression is that key comparisons completely strip
out all flag bits from rm_offset, including the ones that participate in
record lookups. The second problem is that for years we've been letting
the unwritten flag (which is an attribute of a specific record and not
part of the record key) escape from leaf records into key records.
The solution to the second problem is to filter attribute flags when
creating keys from records, and the solution to the first problem is to
preserve *only* the flags used for key lookups. The ATTR and BMBT flags
are a part of the lookup key, and the UNWRITTEN flag is a record
attribute.
This has worked for years without generating user complaints because
ATTR and BMBT extents cannot be shared, so key comparisons succeed
solely on rm_startblock. Only file data fork extents can be shared, and
those records never set any of the three flag bits, so comparisons that
dig into rm_owner and rm_offset work just fine.
A filesystem written with an unpatched kernel and mounted on a patched
kernel will work correctly because the ATTR/BMBT flags have been
conveyed into keys correctly all along, and we still ignore the
UNWRITTEN flag in any key record. This was what doomed my previous
attempt to correct this problem in 2019.
A filesystem written with a patched kernel and mounted on an unpatched
kernel will also work correctly because unpatched kernels ignore all
flags.
With this patchset applied, the scrub code gains the ability to detect
rmap btrees with incorrectly set attr and bmbt flags in the key records.
After three years of testing, I haven't encountered any problems.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Dave Chinner <david@fromorbit.com>
There are a few things about btree records that scrub checked but the
libxfs _get_rec functions didn't. Move these bits into libxfs so that
everyone can benefit.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQQ2qTKExjcn+O1o2YRKO3ySh0YRpgUCZDYdoAAKCRBKO3ySh0YR
pvbPAP9zGtY7B15ORWk9wcHELUoPgDhNZR39ye7MfxWNCBZJxgD6A8SzZpbZc5Gh
9a1/ImUDZ0ekFnAdx0dVRA+gnrO4Vwo=
=197l
-----END PGP SIGNATURE-----
Merge tag 'btree-hoist-scrub-checks-6.4_2023-04-11' of git://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux into guilt/xfs-for-next
xfs: hoist scrub record checks into libxfs [v24.5]
There are a few things about btree records that scrub checked but the
libxfs _get_rec functions didn't. Move these bits into libxfs so that
everyone can benefit.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Dave Chinner <david@fromorbit.com>
While I was cleaning things up for 6.1, I noticed that the btree
_query_range and _query_all functions don't perform the same checking
that the _get_rec functions perform. In fact, they don't perform /any/
sanity checking, which means that callers aren't warned about impossible
records.
Therefore, hoist the record validation and complaint logging code into
separate functions, and call them from any place where we convert an
ondisk record into an incore record. For online scrub, we can replace
checking code with a call to the record checking functions in libxfs,
thereby reducing the size of the codebase.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQQ2qTKExjcn+O1o2YRKO3ySh0YRpgUCZDYdnwAKCRBKO3ySh0YR
ppn6AQCOWjqsq7klLAQdvEDm3O8v4k94geKdn4Ruvbptwa2iUQD/WAJ5LwKnEPuQ
+eB5AfzsziMQMNX7DtUwncaDJm1RBgY=
=ys9Z
-----END PGP SIGNATURE-----
Merge tag 'btree-complain-bad-records-6.4_2023-04-11' of git://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux into guilt/xfs-for-next
xfs: standardize btree record checking code [v24.5]
While I was cleaning things up for 6.1, I noticed that the btree
_query_range and _query_all functions don't perform the same checking
that the _get_rec functions perform. In fact, they don't perform /any/
sanity checking, which means that callers aren't warned about impossible
records.
Therefore, hoist the record validation and complaint logging code into
separate functions, and call them from any place where we convert an
ondisk record into an incore record. For online scrub, we can replace
checking code with a call to the record checking functions in libxfs,
thereby reducing the size of the codebase.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Dave Chinner <david@fromorbit.com>
The design doc for XFS online fsck contains a long discussion of the
eventual consistency models in use for XFS metadata. In that chapter,
we note that it is possible for scrub to collide with a chain of
deferred space metadata updates, and proposes a lightweight solution:
The use of a pending-intents counter so that scrub can wait for the
system to drain all chains.
This patchset implements that scrub drain. The first patch implements
the basic mechanism, and the subsequent patches reduce the runtime
overhead by converting the implementation to use sloppy counters and
introducing jump labels to avoid walking into scrub hooks when it isn't
running. This last paradigm repeats elsewhere in this megaseries.
v23.1: make intent items take an active ref to the perag structure and
document why we bump and drop the intent counts when we do
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQQ2qTKExjcn+O1o2YRKO3ySh0YRpgUCZDYdnwAKCRBKO3ySh0YR
poQmAQDAu0YNxoRGok7H/RGfQQHWBReSkLXT9RKGzjWn4G51EQD8DA/CpuqsC3yU
uJ55vGAb8jSCBFJITVF1/i8B9sfpngw=
=Nz0X
-----END PGP SIGNATURE-----
Merge tag 'scrub-drain-intents-6.4_2023-04-11' of git://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux into guilt/xfs-for-next
xfs: drain deferred work items when scrubbing [v24.5]
The design doc for XFS online fsck contains a long discussion of the
eventual consistency models in use for XFS metadata. In that chapter,
we note that it is possible for scrub to collide with a chain of
deferred space metadata updates, and proposes a lightweight solution:
The use of a pending-intents counter so that scrub can wait for the
system to drain all chains.
This patchset implements that scrub drain. The first patch implements
the basic mechanism, and the subsequent patches reduce the runtime
overhead by converting the implementation to use sloppy counters and
introducing jump labels to avoid walking into scrub hooks when it isn't
running. This last paradigm repeats elsewhere in this megaseries.
v23.1: make intent items take an active ref to the perag structure and
document why we bump and drop the intent counts when we do
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Dave Chinner <david@fromorbit.com>
Fix various attribution problems in the xfs_scrub source code, such as
the author's contact information, out of date SPDX tags, and a rough
estimate of when the feature was under heavy development. The most
egregious parts are the files that are missing license information
completely.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQQ2qTKExjcn+O1o2YRKO3ySh0YRpgUCZDYdngAKCRBKO3ySh0YR
pjGfAP9QqnMMXWlre680TuB1pWXKzHviCdKHlxTeJtYTAycNoAEAiA0JMyahF4kT
p8EEGUvw0cg0RcZA4PUF/vCsxvCJfws=
=NYxO
-----END PGP SIGNATURE-----
Merge tag 'scrub-fix-legalese-6.4_2023-04-11' of git://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux into guilt/xfs-for-next
xfs_scrub: fix licensing and copyright notices [v24.5]
Fix various attribution problems in the xfs_scrub source code, such as
the author's contact information, out of date SPDX tags, and a rough
estimate of when the feature was under heavy development. The most
egregious parts are the files that are missing license information
completely.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Dave Chinner <david@fromorbit.com>
Avoid the cost of perag radix tree lookups by passing around active perag
references when possible.
v24.2: rework some of the naming and whatnot so there's less opencoding
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQQ2qTKExjcn+O1o2YRKO3ySh0YRpgUCZDYdnQAKCRBKO3ySh0YR
po/YAPsEFygm4/bQZBtOf0HFmVDtTXYAEujJeiXKbmEqzlMxpQEAhuCqFaTQ+Pnr
zpg1egeIcaw6dNTW4f2slcATaQgG0gM=
=8HsC
-----END PGP SIGNATURE-----
Merge tag 'pass-perag-refs-6.4_2023-04-11' of git://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux into guilt/xfs-for-next
xfs: pass perag references around when possible [v24.5]
Avoid the cost of perag radix tree lookups by passing around active perag
references when possible.
v24.2: rework some of the naming and whatnot so there's less opencoding
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Dave Chinner <david@fromorbit.com>
Now that we've cleaned up some code warts in the deferred work item
processing code, let's make intent items take an active perag reference
from their creation until they are finally freed by the defer ops
machinery. This change facilitates the scrub drain in the next patchset
and will make it easier for the future AG removal code to detect a busy
AG in need of quiescing.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQQ2qTKExjcn+O1o2YRKO3ySh0YRpgUCZDYdnAAKCRBKO3ySh0YR
poBzAP9+tx/LNTZeLtmjj/d7tVLMm2/f8LPyhDmkF85JWnjknwEAnLQxkqRMfF9i
ah3ACAZ30o+Mp7Qe6tnYVIdOSD2xCAM=
=mRAy
-----END PGP SIGNATURE-----
Merge tag 'intents-perag-refs-6.4_2023-04-11' of git://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux into guilt/xfs-for-next
xfs: make intent items take a perag reference [v24.5]
Now that we've cleaned up some code warts in the deferred work item
processing code, let's make intent items take an active perag reference
from their creation until they are finally freed by the defer ops
machinery. This change facilitates the scrub drain in the next patchset
and will make it easier for the future AG removal code to detect a busy
AG in need of quiescing.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Dave Chinner <david@fromorbit.com>
Since commit 8b41fc4454 ("kbuild: create modules.builtin without
Makefile.modbuiltin or tristate.conf"), MODULE_LICENSE declarations
are used to identify modules. As a consequence, uses of the macro
in non-modules will cause modprobe to misidentify their containing
object file as a module when it is not (false positives), and modprobe
might succeed rather than failing with a suitable error message.
So remove it in the files in this commit, none of which can be built as
modules.
Signed-off-by: Nick Alcock <nick.alcock@oracle.com>
Suggested-by: Luis Chamberlain <mcgrof@kernel.org>
Acked-by: Gabriel Krisman Bertazi <krisman@suse.de>
Cc: Luis Chamberlain <mcgrof@kernel.org>
Cc: linux-modules@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Cc: Hitomi Hasegawa <hasegawa-hitomi@fujitsu.com>
Cc: Gabriel Krisman Bertazi <krisman@collabora.com>
Cc: linux-fsdevel@vger.kernel.org
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
Since commit 8b41fc4454 ("kbuild: create modules.builtin without
Makefile.modbuiltin or tristate.conf"), MODULE_LICENSE declarations
are used to identify modules. As a consequence, uses of the macro
in non-modules will cause modprobe to misidentify their containing
object file as a module when it is not (false positives), and modprobe
might succeed rather than failing with a suitable error message.
So remove it in the files in this commit, none of which can be built as
modules.
Signed-off-by: Nick Alcock <nick.alcock@oracle.com>
Suggested-by: Luis Chamberlain <mcgrof@kernel.org>
Cc: Luis Chamberlain <mcgrof@kernel.org>
Cc: linux-modules@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Cc: Hitomi Hasegawa <hasegawa-hitomi@fujitsu.com>
Cc: Trond Myklebust <trond.myklebust@hammerspace.com>
Cc: Anna Schumaker <anna@kernel.org>
Cc: Chuck Lever <chuck.lever@oracle.com>
Cc: Jeff Layton <jlayton@kernel.org>
Cc: linux-nfs@vger.kernel.org
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
Since commit 8b41fc4454 ("kbuild: create modules.builtin without
Makefile.modbuiltin or tristate.conf"), MODULE_LICENSE declarations
are used to identify modules. As a consequence, uses of the macro
in non-modules will cause modprobe to misidentify their containing
object file as a module when it is not (false positives), and modprobe
might succeed rather than failing with a suitable error message.
So remove it in the files in this commit, none of which can be built as
modules.
Signed-off-by: Nick Alcock <nick.alcock@oracle.com>
Suggested-by: Luis Chamberlain <mcgrof@kernel.org>
Cc: Luis Chamberlain <mcgrof@kernel.org>
Cc: linux-modules@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Cc: Hitomi Hasegawa <hasegawa-hitomi@fujitsu.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Christian Brauner <brauner@kernel.org>
Cc: linux-fsdevel@vger.kernel.org
Cc: linux-mm@kvack.org
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
'obj-$(CONFIG_SYSCTL) += sysctls.o' must be moved after "obj-y :=",
or it won't be built as it is overwrited.
Note that there is nothing that is going to break by linking
sysctl.o later, we were just being way to cautious and patches
have been updated to reflect these considerations and sent for
stable as well with the whole "base" stuff needing to be linked
prior to child sysctl tables that use that directory. All of
the kernel sysctl APIs always share the same directory, and races
against using it should end up re-using the same single created
directory.
And so something we can do eventually is do away with all the base stuff.
For now it's fine, it's not creating an issue. It is just a bit pedantic
and careful.
Fixes: ab171b952c ("fs: move namespace sysctls and declare fs base directory")
Cc: stable@vger.kernel.org # v5.17
Cc: Christian Brauner <brauner@kernel.org>
Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com>
[mcgrof: enhanced commit log for stable criteria and clarify base stuff ]
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
Confirm that the accessed pneg_ctxt->HashAlgorithms address sits within
the SMB request boundary; deassemble_neg_contexts() only checks that the
eight byte smb2_neg_context header + (client controlled) DataLength are
within the packet boundary, which is insufficient.
Checking for sizeof(struct smb2_preauth_neg_context) is overkill given
that the type currently assumes SMB311_SALT_SIZE bytes of trailing Salt.
Signed-off-by: David Disseldorp <ddiss@suse.de>
Acked-by: Namjae Jeon <linkinjeon@kernel.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
There is no need to declare an extra tables to just create directory,
this can be easily be done with a prefix path with register_sysctl().
Simplify this registration.
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
There is no need to declare an extra tables to just create directory,
this can be easily be done with a prefix path with register_sysctl().
Simplify this registration.
Acked-by: Jan Harkes <jaharkes@cs.cmu.edu>
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
There is no need to declare an extra tables to just create directory,
this can be easily be done with a prefix path with register_sysctl().
Simplify this registration.
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
There is no need to declare two tables to just create directories,
this can be easily be done with a prefix path with register_sysctl().
Simplify this registration.
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
There is no need to declare two tables to just create directories,
this can be easily be done with a prefix path with register_sysctl().
Simplify this registration.
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
There is no need to declare two tables to just create directories,
this can be easily be done with a prefix path with register_sysctl().
Simplify this registration.
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
There is no need to declare two tables to just create directories,
this can be easily be done with a prefix path with register_sysctl().
Simplify this registration.
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
Expand documentation to clarify:
o that paths don't need to exist for the new API callers
o clarify that we *require* callers to keep the memory of
the table around during the lifetime of the sysctls
o annotate routines we are trying to deprecate and later remove
Cc: stable@vger.kernel.org # v5.17
Cc: Christian Brauner <brauner@kernel.org>
Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
Relatively new docs which I added which hinted the base directories needed
to be created before is wrong, remove that incorrect comment. This has been
hinted before by Eric twice already [0] [1], I had just not verified that
until now. Now that I've verified that updates the docs to relax the context
described.
[0] https://lkml.kernel.org/r/875ys0azt8.fsf@email.froward.int.ebiederm.org
[1] https://lkml.kernel.org/r/87ftbiud6s.fsf@x220.int.ebiederm.org
Cc: stable@vger.kernel.org # v5.17
Cc: Christian Brauner <brauner@kernel.org>
Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
Suggested-by: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
Move the code which creates the subdirectories for a ctl table
into a helper routine so to make it easier to review. Document
the goal.
This creates no functional changes.
Reviewed-by: John Johansen <john.johansen@canonical.com>
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
Update the docs for __register_sysctl_table() to make it clear no child
entries can be passed. When the child is true these are non-leaf entries
on the ctl table and sysctl treats these as directories. The point to
__register_sysctl_table() is to deal only with directories not part of
the ctl table where thay may riside, to be simple and avoid recursion.
While at it, hint towards using long on extra1 and extra2 later.
Cc: stable@vger.kernel.org # v5.17
Cc: Christian Brauner <brauner@kernel.org>
Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
f2fs support quota since commit 0abd675e97 ("f2fs: support plain
user/group quota"), let's document it.
Signed-off-by: Yangtao Li <frank.li@vivo.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Message-Id: <20230413151412.30059-1-frank.li@vivo.com>
F2FS has the same issue in ext4_rename causing crash revealed by
xfstests/generic/707.
See also commit 0813299c58 ("ext4: Fix possible corruption when moving a directory")
CC: stable@vger.kernel.org
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
To prevent excessive increase in preemption count
add radix_tree_preload_end in retry
Signed-off-by: Yohan Joung <yohan.joung@sk.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
With -O quota mkfs option, xfstests generic/417 fails due to fsck detects
data corruption on quota inodes.
[ASSERT] (fsck_chk_quota_files:2051) --> Quota file is missing or invalid quota file content found.
The root cause is there is a hole f2fs doesn't hold quota inodes,
so all recovered quota data will be dropped due to SBI_POR_DOING
flag was set.
- f2fs_fill_super
- f2fs_recover_orphan_inodes
- f2fs_enable_quota_files
- f2fs_quota_off_umount
<--- quota inodes were dropped --->
- f2fs_recover_fsync_data
- f2fs_enable_quota_files
- f2fs_quota_off_umount
This patch tries to eliminate the hole by holding quota inodes
during entire recovery flow as below:
- f2fs_fill_super
- f2fs_recover_quota_begin
- f2fs_recover_orphan_inodes
- f2fs_recover_fsync_data
- f2fs_recover_quota_end
Then, recovered quota data can be persisted after SBI_POR_DOING
is cleared.
Signed-off-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
With below case, it can mount multi-device image w/ rw option, however
one of secondary device is set as ro, later update will cause panic, so
let's introduce f2fs_dev_is_readonly(), and check multi-devices rw status
in f2fs_remount() w/ it in order to avoid such inconsistent mount status.
mkfs.f2fs -c /dev/zram1 /dev/zram0 -f
blockdev --setro /dev/zram1
mount -t f2fs dev/zram0 /mnt/f2fs
mount: /mnt/f2fs: WARNING: source write-protected, mounted read-only.
mount -t f2fs -o remount,rw mnt/f2fs
dd if=/dev/zero of=/mnt/f2fs/file bs=1M count=8192
kernel BUG at fs/f2fs/inline.c:258!
RIP: 0010:f2fs_write_inline_data+0x23e/0x2d0 [f2fs]
Call Trace:
f2fs_write_single_data_page+0x26b/0x9f0 [f2fs]
f2fs_write_cache_pages+0x389/0xa60 [f2fs]
__f2fs_write_data_pages+0x26b/0x2d0 [f2fs]
f2fs_write_data_pages+0x2e/0x40 [f2fs]
do_writepages+0xd3/0x1b0
__writeback_single_inode+0x5b/0x420
writeback_sb_inodes+0x236/0x5a0
__writeback_inodes_wb+0x56/0xf0
wb_writeback+0x2a3/0x490
wb_do_writeback+0x2b2/0x330
wb_workfn+0x6a/0x260
process_one_work+0x270/0x5e0
worker_thread+0x52/0x3e0
kthread+0xf4/0x120
ret_from_fork+0x29/0x50
Signed-off-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
i_gc_rwsem[WRITE] and i_gc_rwsem[READ] lock order is reversed
in gc_data_segment() and f2fs_dio_write_iter(), fix to keep
consistent lock order as below:
1. lock i_gc_rwsem[WRITE]
2. lock i_gc_rwsem[READ]
Signed-off-by: Chao Yu <chao@kernel.org>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
Fix netfs_extract_iter_to_sg() for ITER_UBUF and ITER_IOVEC to set the
size of the page to the part of the page extracted, not the remaining
amount of data in the extracted page array at that point.
This doesn't yet affect anything as cifs, the only current user, only
passes in non-user-backed iterators.
Fixes: 0185846975 ("netfs: Add a function to extract an iterator into a scatterlist")
Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Cc: Steve French <sfrench@samba.org>
Cc: Shyam Prasad N <nspmangalore@gmail.com>
Cc: Rohith Surabattula <rohiths.msft@gmail.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This patch only converts the actual array, but doesn't touch the
individual nfs_cache_array pages and related functions (that will be
done in the next patch).
I also adjust the names of the fields in the nfs_readdir_descriptor to
say "folio" instead of "page".
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
syzbot is hitting WARN_ON() in hfsplus_cat_{read,write}_inode(), for
crafted filesystem image can contain bogus length. There conditions are
not kernel bugs that can justify kernel to panic.
Reported-by: syzbot <syzbot+e2787430e752a92b8750@syzkaller.appspotmail.com>
Link: https://syzkaller.appspot.com/bug?extid=e2787430e752a92b8750
Reported-by: syzbot <syzbot+4913dca2ea6e4d43f3f1@syzkaller.appspotmail.com>
Link: https://syzkaller.appspot.com/bug?extid=4913dca2ea6e4d43f3f1
Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Reviewed-by: Viacheslav Dubeyko <slava@dubeyko.com>
Message-Id: <15308173-5252-d6a3-ae3b-e96d46cb6f41@I-love.SAKURA.ne.jp>
Signed-off-by: Christian Brauner <brauner@kernel.org>
syzbot detected a crash during log recovery:
XFS (loop0): Mounting V5 Filesystem bfdc47fc-10d8-4eed-a562-11a831b3f791
XFS (loop0): Torn write (CRC failure) detected at log block 0x180. Truncating head block from 0x200.
XFS (loop0): Starting recovery (logdev: internal)
==================================================================
BUG: KASAN: slab-out-of-bounds in xfs_btree_lookup_get_block+0x15c/0x6d0 fs/xfs/libxfs/xfs_btree.c:1813
Read of size 8 at addr ffff88807e89f258 by task syz-executor132/5074
CPU: 0 PID: 5074 Comm: syz-executor132 Not tainted 6.2.0-rc1-syzkaller #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 10/26/2022
Call Trace:
<TASK>
__dump_stack lib/dump_stack.c:88 [inline]
dump_stack_lvl+0x1b1/0x290 lib/dump_stack.c:106
print_address_description+0x74/0x340 mm/kasan/report.c:306
print_report+0x107/0x1f0 mm/kasan/report.c:417
kasan_report+0xcd/0x100 mm/kasan/report.c:517
xfs_btree_lookup_get_block+0x15c/0x6d0 fs/xfs/libxfs/xfs_btree.c:1813
xfs_btree_lookup+0x346/0x12c0 fs/xfs/libxfs/xfs_btree.c:1913
xfs_btree_simple_query_range+0xde/0x6a0 fs/xfs/libxfs/xfs_btree.c:4713
xfs_btree_query_range+0x2db/0x380 fs/xfs/libxfs/xfs_btree.c:4953
xfs_refcount_recover_cow_leftovers+0x2d1/0xa60 fs/xfs/libxfs/xfs_refcount.c:1946
xfs_reflink_recover_cow+0xab/0x1b0 fs/xfs/xfs_reflink.c:930
xlog_recover_finish+0x824/0x920 fs/xfs/xfs_log_recover.c:3493
xfs_log_mount_finish+0x1ec/0x3d0 fs/xfs/xfs_log.c:829
xfs_mountfs+0x146a/0x1ef0 fs/xfs/xfs_mount.c:933
xfs_fs_fill_super+0xf95/0x11f0 fs/xfs/xfs_super.c:1666
get_tree_bdev+0x400/0x620 fs/super.c:1282
vfs_get_tree+0x88/0x270 fs/super.c:1489
do_new_mount+0x289/0xad0 fs/namespace.c:3145
do_mount fs/namespace.c:3488 [inline]
__do_sys_mount fs/namespace.c:3697 [inline]
__se_sys_mount+0x2d3/0x3c0 fs/namespace.c:3674
do_syscall_x64 arch/x86/entry/common.c:50 [inline]
do_syscall_64+0x3d/0xb0 arch/x86/entry/common.c:80
entry_SYSCALL_64_after_hwframe+0x63/0xcd
RIP: 0033:0x7f89fa3f4aca
Code: 83 c4 08 5b 5d c3 66 2e 0f 1f 84 00 00 00 00 00 c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 49 89 ca b8 a5 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 c0 ff ff ff f7 d8 64 89 01 48
RSP: 002b:00007fffd5fb5ef8 EFLAGS: 00000206 ORIG_RAX: 00000000000000a5
RAX: ffffffffffffffda RBX: 00646975756f6e2c RCX: 00007f89fa3f4aca
RDX: 0000000020000100 RSI: 0000000020009640 RDI: 00007fffd5fb5f10
RBP: 00007fffd5fb5f10 R08: 00007fffd5fb5f50 R09: 000000000000970d
R10: 0000000000200800 R11: 0000000000000206 R12: 0000000000000004
R13: 0000555556c6b2c0 R14: 0000000000200800 R15: 00007fffd5fb5f50
</TASK>
The fuzzed image contains an AGF with an obviously garbage
agf_refcount_level value of 32, and a dirty log with a buffer log item
for that AGF. The ondisk AGF has a higher LSN than the recovered log
item. xlog_recover_buf_commit_pass2 reads the buffer, compares the
LSNs, and decides to skip replay because the ondisk buffer appears to be
newer.
Unfortunately, the ondisk buffer is corrupt, but recovery just read the
buffer with no buffer ops specified:
error = xfs_buf_read(mp->m_ddev_targp, buf_f->blf_blkno,
buf_f->blf_len, buf_flags, &bp, NULL);
Skipping the buffer leaves its contents in memory unverified. This sets
us up for a kernel crash because xfs_refcount_recover_cow_leftovers
reads the buffer (which is still around in XBF_DONE state, so no read
verification) and creates a refcountbt cursor of height 32. This is
impossible so we run off the end of the cursor object and crash.
Fix this by invoking the verifier on all skipped buffers and aborting
log recovery if the ondisk buffer is corrupt. It might be smarter to
force replay the log item atop the buffer and then see if it'll pass the
write verifier (like ext4 does) but for now let's go with the
conservative option where we stop immediately.
Link: https://syzkaller.appspot.com/bug?extid=7e9494b8b399902e994e
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>