Commit Graph

1296593 Commits

Author SHA1 Message Date
Li Zetao
54c78d497b btrfs: convert zlib_decompress() to take a folio
The old page API is being gradually replaced and converted to use folio
to improve code readability and avoid repeated conversion between page
and folio. And memcpy_to_page() can be replaced with memcpy_to_folio().
But there is no memzero_folio(), but it can be replaced equivalently by
folio_zero_range().

Signed-off-by: Li Zetao <lizetao1@huawei.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-09-10 16:51:21 +02:00
Li Zetao
046c0d6596 btrfs: convert try_release_extent_mapping() to take a folio
The old page API is being gradually replaced and converted to use folio
to improve code readability and avoid repeated conversion between page
and folio. And page_to_inode() can be replaced with folio_to_inode() now.

Signed-off-by: Li Zetao <lizetao1@huawei.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-09-10 16:51:21 +02:00
Li Zetao
dd0a8df455 btrfs: convert try_release_extent_state() to take a folio
The old page API is being gradually replaced and converted to use folio
to improve code readability and avoid repeated conversion between page
and folio. Moreover, use folio_pos() instead of page_offset(),
which is more consistent with folio usage.

Signed-off-by: Li Zetao <lizetao1@huawei.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-09-10 16:51:21 +02:00
Li Zetao
08dd8507b1 btrfs: convert submit_eb_page() to take a folio
The old page API is being gradually replaced and converted to use folio
to improve code readability and avoid repeated conversion between page
and folio.

Signed-off-by: Li Zetao <lizetao1@huawei.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-09-10 16:51:21 +02:00
Li Zetao
135873258c btrfs: convert submit_eb_subpage() to take a folio
The old page API is being gradually replaced and converted to use folio
to improve code readability and avoid repeated conversion between page
and folio. Moreover, use folio_pos() instead of page_offset(),
which is more consistent with folio usage.

Signed-off-by: Li Zetao <lizetao1@huawei.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-09-10 16:51:21 +02:00
Li Zetao
884937793d btrfs: convert read_key_bytes() to take a folio
The old page API is being gradually replaced and converted to use folio
to improve code readability and avoid repeated conversion between page
and folio. Moreover, use kmap_local_folio() instead of kmap_local_page(),
which is more consistent with folio usage.

Signed-off-by: Li Zetao <lizetao1@huawei.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-09-10 16:51:20 +02:00
Li Zetao
b8ae2bfa68 btrfs: convert try_release_extent_buffer() to take a folio
The old page API is being gradually replaced and converted to use folio
to improve code readability and avoid repeated conversion between page
and folio.

Signed-off-by: Li Zetao <lizetao1@huawei.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-09-10 16:51:20 +02:00
Li Zetao
0145aa38cb btrfs: convert try_release_subpage_extent_buffer() to take a folio
The old page API is being gradually replaced and converted to use folio
to improve code readability and avoid repeated conversion between page
and folio. And use folio_pos instead of page_offset, which is more
consistent with folio usage. At the same time, folio_test_private() can
handle folio directly without converting from page to folio first.

Signed-off-by: Li Zetao <lizetao1@huawei.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-09-10 16:51:20 +02:00
Li Zetao
d4aeb5f7a7 btrfs: convert get_next_extent_buffer() to take a folio
The old page API is being gradually replaced and converted to use folio
to improve code readability and avoid repeated conversion between page
and folio. Use folio_pos instead of page_offset, which is more
consistent with folio usage.

Signed-off-by: Li Zetao <lizetao1@huawei.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-09-10 16:51:20 +02:00
Li Zetao
266a9361a4 btrfs: convert clear_page_extent_mapped() to take a folio
The old page API is being gradually replaced and converted to use folio
to improve code readability and avoid repeated conversion between page
and folio. Now clear_page_extent_mapped() can deal with a folio
directly, so change its name to clear_folio_extent_mapped().

Signed-off-by: Li Zetao <lizetao1@huawei.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-09-10 16:51:20 +02:00
Qu Wenruo
fd1e75d010 btrfs: make compression path to be subpage compatible
Currently btrfs compression path is not really subpage compatible, every
thing is still done in page unit.

That's fine for regular sector size and subpage routine. As even for
subpage routine compression is only enabled if the whole range is page
aligned, so reading the page cache in page unit is totally fine.

However in preparation for the future subpage perfect compression
support, we need to change the compression routine to properly handle a
subpage range.

This patch would prepare both zlib and zstd to only read the subpage
range for compression.
Lzo is already doing subpage aware read, as lzo's on-disk format is
already sectorsize dependent.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-09-10 16:51:20 +02:00
Qu Wenruo
9ca0e58cb7 btrfs: merge btrfs_orig_bbio_end_io() into btrfs_bio_end_io()
There are only two differences between the two functions:

- btrfs_orig_bbio_end_io() does extra error propagation
  This is mostly to allow tolerance for write errors.

- btrfs_orig_bbio_end_io() does extra pending_ios check
  This check can handle both the original bio, or the cloned one.
  (All accounting happens in the original one).

This makes btrfs_orig_bbio_end_io() a much safer call.
In fact we already had a double freeing error due to usage of
btrfs_bio_end_io() in the error path of btrfs_submit_chunk().

So just move the whole content of btrfs_orig_bbio_end_io() into
btrfs_bio_end_io().

For normal paths this brings no change, because they are already calling
btrfs_orig_bbio_end_io() in the first place.

For error paths (not only inside bio.c but also external callers), this
change will introduce extra checks, especially for external callers, as
they will error out without submitting the btrfs bio.

But considering it's already in the error path, such slower but much
safer checks are still an overall win.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-09-10 16:51:20 +02:00
Josef Bacik
ac325fc2aa btrfs: do not hold the extent lock for entire read
Historically we've held the extent lock throughout the entire read.
There's been a few reasons for this, but it's mostly just caused us
problems.  For example, this prevents us from allowing page faults
during direct io reads, because we could deadlock.  This has forced us
to only allow 4k reads at a time for io_uring NOWAIT requests because we
have no idea if we'll be forced to page fault and thus have to do a
whole lot of work.

On the buffered side we are protected by the page lock, as long as we're
reading things like buffered writes, punch hole, and even direct IO to a
certain degree will get hung up on the page lock while the page is in
flight.

On the direct side we have the dio extent lock, which acts much like the
way the extent lock worked previously to this patch, however just for
direct reads.  This protects direct reads from concurrent direct writes,
while we're protected from buffered writes via the inode lock.

Now that we're protected in all cases, narrow the extent lock to the
part where we're getting the extent map to submit the reads, no longer
holding the extent lock for the entire read operation.  Push the extent
lock down into do_readpage() so that we're only grabbing it when looking
up the extent map.  This portion was contributed by Goldwyn.

Co-developed-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Reviewed-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-09-10 16:51:20 +02:00
Josef Bacik
07d399cb4e btrfs: take the dio extent lock during O_DIRECT operations
Currently we hold the extent lock for the entire duration of a read.
This isn't really necessary in the buffered case, we're protected by the
page lock, however it's necessary for O_DIRECT.

For O_DIRECT reads, if we only locked the extent for the part where we
get the extent, we could potentially race with an O_DIRECT write in the
same region.  This isn't really a problem, unless the read is delayed so
much that the write does the COW, unpins the old extent, and some other
application re-allocates the extent before the read is actually able to
be submitted.  At that point at best we'd have a checksum mismatch, but
at worse we could read data that doesn't belong to us.

To address this potential race we need to make sure we don't have
overlapping, concurrent direct io reads and writes.

To accomplish this use the new EXTENT_DIO_LOCKED bit in the direct IO
case in the same spot as the current extent lock.  The writes will take
this while they're creating the ordered extent, which is also used to
make sure concurrent buffered reads or concurrent direct reads are not
allowed to occur, and drop it after the ordered extent is taken.  For
reads it will act as the current read behavior for the EXTENT_LOCKED
bit, we set it when we're starting the read, we clear it in the end_io
to allow other direct writes to continue.

This still has the drawback of disallowing concurrent overlapping direct
reads from occurring, but that exists with the current extent locking.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-09-10 16:51:20 +02:00
Josef Bacik
7e2a595084 btrfs: introduce EXTENT_DIO_LOCKED
In order to support dropping the extent lock during a read we need a way
to make sure that direct reads and direct writes for overlapping ranges
are protected from each other.  To accomplish this introduce another
lock bit specifically for direct io.  Subsequent patches will utilize
this to protect direct IO operations.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-09-10 16:51:20 +02:00
David Sterba
df2825e985 btrfs: always pass readahead state to defrag
Defrag ioctl passes readahead from the file, but autodefrag does not
have a file so the readahead state is allocated when needed.

The autodefrag loop in cleaner thread iterates over inodes so we can
simply provide an on-stack readahead state and will not need to allocate
it in btrfs_defrag_file(). The size is 32 bytes which is acceptable.

Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-09-10 16:51:20 +02:00
David Sterba
11e3107d47 btrfs: drop transaction parameter from btrfs_add_inode_defrag()
There's only one caller inode_should_defrag() that passes NULL to
btrfs_add_inode_defrag() so we can drop it an simplify the code.

Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-09-10 16:51:19 +02:00
David Sterba
91c9f2855e btrfs: return void from btrfs_add_inode_defrag()
The potential memory allocation failure is not a fatal error, skipping
autodefrag is fine and the caller inode_should_defrag() does not care
about the errors.  Further writes can attempt to add the inode back to
the defragmentation list again.

Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-09-10 16:51:19 +02:00
David Sterba
276940915f btrfs: clear defragmented inodes using postorder in btrfs_cleanup_defrag_inodes()
btrfs_cleanup_defrag_inodes() is not called frequently, only in remount
or unmount, but the way it frees the inodes in fs_info->defrag_inodes
is inefficient. Each time it needs to locate first node, remove it,
potentially rebalance tree until it's done. This allows to do a
conditional reschedule.

For cleanups the rbtree_postorder_for_each_entry_safe() iterator is
convenient but we can't reschedule and restart iteration because some of
the tree nodes would be already freed.

The cleanup operation is kmem_cache_free() which will likely take the
fast path for most objects so rescheduling should not be necessary.

Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-09-10 16:51:19 +02:00
David Sterba
ffc531652d btrfs: rename __btrfs_run_defrag_inode() and drop double underscores
The function does not follow the pattern where the underscores would be
justified, so rename it.

Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-09-10 16:51:19 +02:00
David Sterba
4225756902 btrfs: rename __btrfs_add_inode_defrag() and drop double underscores
The function does not follow the pattern where the underscores would be
justified, so rename it.

Also update the misleading comment, the passed item is not freed, that's
what the caller does.

Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-09-10 16:51:19 +02:00
David Sterba
6d2f07e13c btrfs: rename __need_auto_defrag() and drop double underscores
The function does not follow the pattern where the underscores would be
justified, so rename it.

Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-09-10 16:51:19 +02:00
David Sterba
b7164d9ab0 btrfs: constify arguments of compare_inode_defrag()
A comparator function does not change its parameters, make them const.

Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-09-10 16:51:19 +02:00
David Sterba
a92914a80b btrfs: rename __compare_inode_defrag() and drop double underscores
The function does not follow the pattern where the underscores would be
justified, so rename it.

Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-09-10 16:51:19 +02:00
David Sterba
06de42c5a9 btrfs: rename __extent_writepage() and drop double underscores
The function does not follow the pattern where the underscores would be
justified, so rename it.

Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-09-10 16:51:19 +02:00
David Sterba
22b4ef50dc btrfs: rename __btrfs_submit_bio() and drop double underscores
Previous patch freed the function name btrfs_submit_bio() so we can use
it for a helper that submits struct bio.

Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-09-10 16:51:19 +02:00
David Sterba
792e86ef31 btrfs: rename btrfs_submit_bio() to btrfs_submit_bbio()
The function name is a bit misleading as it submits the btrfs_bio
(bbio), rename it so we can use btrfs_submit_bio() when an actual bio is
submitted.

Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-09-10 16:51:19 +02:00
Qu Wenruo
ce4a71ee15 btrfs: subpage: remove btrfs_fs_info::subpage_info member
The member btrfs_fs_info::subpage_info stores the cached bitmap start
position inside the merged bitmap.

However in reality there is only one thing depending on the sectorsize,
bitmap_nr_bits, which records the number of sectors that fit inside a
page.

The sequence of sub-bitmaps have fixed order, thus it's just a quick
multiplication to calculate the start position of each sub-bitmaps.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-09-10 16:51:18 +02:00
Qu Wenruo
2c70fe16ea btrfs: remove the nr_ret parameter from __extent_writepage_io()
The parameter @nr_ret is used to tell the caller how many sectors have
been submitted for IO.

Then callers check @nr_ret value to determine if we need to manually
clear the PAGECACHE_TAG_DIRTY, as if we submitted no sector (e.g. all
sectors are beyond i_size) there is no folio_start_writeback() called thus
PAGECACHE_TAG_DIRTY tag will not be cleared.

Remove this parameter by:

- Moving the btrfs_folio_clear_writeback() call into
  __extent_writepage_io()
  So that if we didn't submit any IO, then manually call
  btrfs_folio_set_writeback() to clear PAGECACHE_TAG_DIRTY when
  the page is no longer dirty.

- Use a bool to record if we have submitted any sector
  Instead of an int.

- Use subpage compatible helpers to end folio writeback.
  This brings no change to the behavior, just for the sake of consistency.

  As for the call site inside __extent_writepage(), we're always called
  for the whole page, so the existing full page helper
  folio_(start|end)_writeback() is totally fine.

  For the call site inside extent_write_locked_range(), although we can
  have subpage range, folio_start_writeback() will only clear
  PAGECACHE_TAG_DIRTY if the page is no longer dirty, and the full folio
  will still be dirty if there is any subpage dirty range.
  Only when the last dirty subpage sector is cleared, the
  folio_start_writeback() will clear PAGECACHE_TAG_DIRTY.

  So no matter if we call the full page or subpage helper, the result
  is still the same, then just use the subpage helpers for consistency.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-09-10 16:51:18 +02:00
Thorsten Blum
e39ba5dfd0 btrfs: send: fix grammar in comments
Fix a few obvious grammar mistakes: a -> an, then -> than.

Signed-off-by: Thorsten Blum <thorsten.blum@toblux.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-09-10 16:51:18 +02:00
Junchao Sun
3cce39a8ca btrfs: qgroup: use xarray to track dirty extents in transaction
Use xarray to track dirty extents to reduce the size of the struct
btrfs_qgroup_extent_record from 64 bytes to 40 bytes.  The xarray is
more cache line friendly, it also reduces the complexity of insertion
and search code compared to rb tree.

Another change introduced is about error handling.  Before this patch,
the result of btrfs_qgroup_trace_extent_nolock() is always a success. In
this patch, because of this function calls the function xa_store() which
has the possibility to fail, so mark qgroup as inconsistent if error
happened and then free preallocated memory. Also we preallocate memory
before spin_lock(), if memory preallcation failed, error handling is the
same the existing code.

Suggested-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Junchao Sun <sunjunchao2870@gmail.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-09-10 16:51:18 +02:00
Junchao Sun
14ed830d10 btrfs: qgroup: use goto style to handle errors in add_delayed_ref()
Clean up resources using goto to get rid of repeated code.

Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Junchao Sun <sunjunchao2870@gmail.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-09-10 16:51:18 +02:00
Qu Wenruo
8189197425 btrfs: refactor __extent_writepage_io() to do sector-by-sector submission
Unlike the bitmap usage inside raid56, for __extent_writepage_io() we
handle the subpage submission not sector-by-sector, but for each dirty
range we found.

This is not a big deal normally, as the subpage complex code is already
mostly optimized out by the compiler for x86_64.

However for the sake of consistency and for the future of subpage
sector-perfect compression support, this patch does:

- Extract the sector submission code into submit_one_sector()

- Add the needed code to extract the dirty bitmap for subpage case
  There is a small pitfall for non-subpage case, as we cleared page
  dirty before starting writeback, so we have to manually set
  the default dirty_bitmap to 1 for such case.

- Use bitmap_and() to calculate the target sectors we need to submit
  This is done for both subpage and non-subpage cases, and will later
  be expanded to skip inline/compression ranges.

For x86_64, the dirty bitmap will be fixed to 1, with the length of 1,
so we're still doing the same workload per sector.

For larger page sizes, the overhead will be a little larger, as previous
we only need to do one extent_map lookup per-dirty-range, but now it
will be one extent_map lookup per-sector.

But that is the same frequency as x86_64, so we're just aligning the
behavior to x86_64.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-09-10 16:51:18 +02:00
Qu Wenruo
77b0b98bb7 btrfs: subpage: fix the bitmap dump which can cause bitmap corruption
In commit 75258f20fb ("btrfs: subpage: dump extra subpage bitmaps for
debug") an internal macro GET_SUBPAGE_BITMAP() is introduced to grab the
bitmap of each attribute.

But that commit is using bitmap_cut() which will do the left shift of
the larger bitmap, causing incorrect values.

Thankfully this bitmap_cut() is only called for debug usage, and so far
it's not yet causing problem.

Fix it to use bitmap_read() to only grab the desired sub-bitmap.

Fixes: 75258f20fb ("btrfs: subpage: dump extra subpage bitmaps for debug")
CC: stable@vger.kernel.org # 6.6+
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-09-10 16:51:18 +02:00
Johannes Thumshirn
0ae653fbec btrfs: reduce chunk_map lookups in btrfs_map_block()
Currently we're calling btrfs_num_copies() before btrfs_get_chunk_map() in
btrfs_map_block(). But btrfs_num_copies() itself does a chunk map lookup
to be able to calculate the number of copies.

So split out the code getting the number of copies from btrfs_num_copies()
into a helper called btrfs_chunk_map_num_copies() and directly call it
from btrfs_map_block() and btrfs_num_copies().

This saves us one rbtree lookup per btrfs_map_block() invocation.

Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-09-10 16:51:18 +02:00
Filipe Manana
6d752cacae btrfs: directly wake up cleaner kthread in the BTRFS_IOC_SYNC ioctl
The BTRFS_IOC_SYNC ioctl wants to wake up the cleaner kthread so that it
does any pending work (subvolume deletion, delayed iputs, etc), however
it is waking up the transaction kthread, which in turn wakes up the
cleaner. Since we don't have any transaction to commit, as any ongoing
transaction was already committed when it called btrfs_sync_fs() and
the goal is just to wake up the cleaner thread, directly wake up the
cleaner instead of the transaction kthread.

Reviewed-by: Boris Burkov <boris@bur.io>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-09-10 16:51:18 +02:00
Qu Wenruo
efffb803bf btrfs: make btrfs_is_subpage() to return false directly for 4K page size
Btrfs only supports sectorsize 4K, 8K, 16K, 32K, 64K for now, thus for
systems with 4K page size, there is no way the fs is subpage (sectorsize
< PAGE_SIZE).

So here we define btrfs_is_subpage() different according to the
PAGE_SIZE:

- PAGE_SIZE > 4K
  We may hit real subpage cases, define btrfs_is_subpage() as a regular
  function and do the usual checks.

- PAGE_SIZE == 4K (no smaller PAGE_SIZE support AFAIK)
  There is no way the fs is subpage, so just define btrfs_is_subpage()
  as an inline function which always return false.

This saves about 7K bytes for x86_64 debug builds:

	   text	   data	    bss	    dec	    hex	filename
Before:	1484452	 168693	  25776	1678921	 199e49	fs/btrfs/btrfs.ko
After:	1476605	 168445	  25776	1670826	 197eaa	fs/btrfs/btrfs.ko

Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-09-10 16:51:18 +02:00
Johannes Thumshirn
0c749585fc btrfs: change RST lookup error message level to debug
Now that RAID stripe-tree lookup failures are not treated as a fatal issue
any more, change the RAID stripe-tree lookup error message to debug level.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-09-10 16:51:17 +02:00
Johannes Thumshirn
04915240e2 btrfs: don't readahead the relocation inode on RST
On relocation we're doing readahead on the relocation inode, but if the
filesystem is backed by a RAID stripe tree we can get ENOENT (e.g. due to
preallocated extents not being mapped in the RST) from the lookup.

But readahead doesn't handle the error and submits invalid reads to the
device, causing an assertion in the scatter-gather list code:

  BTRFS info (device nvme1n1): balance: start -d -m -s
  BTRFS info (device nvme1n1): relocating block group 6480920576 flags data|raid0
  BTRFS error (device nvme1n1): cannot find raid-stripe for logical [6481928192, 6481969152] devid 2, profile raid0
  ------------[ cut here ]------------
  kernel BUG at include/linux/scatterlist.h:115!
  Oops: invalid opcode: 0000 [#1] PREEMPT SMP PTI
  CPU: 0 PID: 1012 Comm: btrfs Not tainted 6.10.0-rc7+ #567
  RIP: 0010:__blk_rq_map_sg+0x339/0x4a0
  RSP: 0018:ffffc90001a43820 EFLAGS: 00010202
  RAX: 0000000000000000 RBX: 0000000000000000 RCX: ffffea00045d4802
  RDX: 0000000117520000 RSI: 0000000000000000 RDI: ffff8881027d1000
  RBP: 0000000000003000 R08: ffffea00045d4902 R09: 0000000000000000
  R10: 0000000000000000 R11: 0000000000001000 R12: ffff8881003d10b8
  R13: ffffc90001a438f0 R14: 0000000000000000 R15: 0000000000003000
  FS:  00007fcc048a6900(0000) GS:ffff88813bc00000(0000) knlGS:0000000000000000
  CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
  CR2: 000000002cd11000 CR3: 00000001109ea001 CR4: 0000000000370eb0
  Call Trace:
   <TASK>
   ? __die_body.cold+0x14/0x25
   ? die+0x2e/0x50
   ? do_trap+0xca/0x110
   ? do_error_trap+0x65/0x80
   ? __blk_rq_map_sg+0x339/0x4a0
   ? exc_invalid_op+0x50/0x70
   ? __blk_rq_map_sg+0x339/0x4a0
   ? asm_exc_invalid_op+0x1a/0x20
   ? __blk_rq_map_sg+0x339/0x4a0
   nvme_prep_rq.part.0+0x9d/0x770
   nvme_queue_rq+0x7d/0x1e0
   __blk_mq_issue_directly+0x2a/0x90
   ? blk_mq_get_budget_and_tag+0x61/0x90
   blk_mq_try_issue_list_directly+0x56/0xf0
   blk_mq_flush_plug_list.part.0+0x52b/0x5d0
   __blk_flush_plug+0xc6/0x110
   blk_finish_plug+0x28/0x40
   read_pages+0x160/0x1c0
   page_cache_ra_unbounded+0x109/0x180
   relocate_file_extent_cluster+0x611/0x6a0
   ? btrfs_search_slot+0xba4/0xd20
   ? balance_dirty_pages_ratelimited_flags+0x26/0xb00
   relocate_data_extent.constprop.0+0x134/0x160
   relocate_block_group+0x3f2/0x500
   btrfs_relocate_block_group+0x250/0x430
   btrfs_relocate_chunk+0x3f/0x130
   btrfs_balance+0x71b/0xef0
   ? kmalloc_trace_noprof+0x13b/0x280
   btrfs_ioctl+0x2c2e/0x3030
   ? kvfree_call_rcu+0x1e6/0x340
   ? list_lru_add_obj+0x66/0x80
   ? mntput_no_expire+0x3a/0x220
   __x64_sys_ioctl+0x96/0xc0
   do_syscall_64+0x54/0x110
   entry_SYSCALL_64_after_hwframe+0x76/0x7e
  RIP: 0033:0x7fcc04514f9b
  Code: Unable to access opcode bytes at 0x7fcc04514f71.
  RSP: 002b:00007ffeba923370 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
  RAX: ffffffffffffffda RBX: 0000000000000003 RCX: 00007fcc04514f9b
  RDX: 00007ffeba923460 RSI: 00000000c4009420 RDI: 0000000000000003
  RBP: 0000000000000000 R08: 0000000000000013 R09: 0000000000000001
  R10: 00007fcc043fbba8 R11: 0000000000000246 R12: 00007ffeba924fc5
  R13: 00007ffeba923460 R14: 0000000000000002 R15: 00000000004d4bb0
   </TASK>
  Modules linked in:
  ---[ end trace 0000000000000000 ]---
  RIP: 0010:__blk_rq_map_sg+0x339/0x4a0
  RSP: 0018:ffffc90001a43820 EFLAGS: 00010202
  RAX: 0000000000000000 RBX: 0000000000000000 RCX: ffffea00045d4802
  RDX: 0000000117520000 RSI: 0000000000000000 RDI: ffff8881027d1000
  RBP: 0000000000003000 R08: ffffea00045d4902 R09: 0000000000000000
  R10: 0000000000000000 R11: 0000000000001000 R12: ffff8881003d10b8
  R13: ffffc90001a438f0 R14: 0000000000000000 R15: 0000000000003000
  FS:  00007fcc048a6900(0000) GS:ffff88813bc00000(0000) knlGS:0000000000000000
  CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
  CR2: 00007fcc04514f71 CR3: 00000001109ea001 CR4: 0000000000370eb0
  Kernel panic - not syncing: Fatal exception
  Kernel Offset: disabled
  ---[ end Kernel panic - not syncing: Fatal exception ]---

So in case of a relocation on a RAID stripe-tree based file system, skip
the readahead.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-09-10 16:51:17 +02:00
Johannes Thumshirn
f4d39cf1ce btrfs: set search_commit_root on stripe io in case of relocation
Set rst_search_commit_root in the btrfs_io_stripe we're passing to
btrfs_map_block() in case we're doing data relocation.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-09-10 16:51:17 +02:00
Johannes Thumshirn
d6106f0dc5 btrfs: rename btrfs_io_stripe::is_scrub to rst_search_commit_root
Rename 'btrfs_io_stripe::is_scrub' to 'rst_search_commit_root'. While
'is_scrub' describes the state of the io_stripe (it is a stripe submitted
by scrub) it does not describe the purpose, namely looking at the commit
root when searching RAID stripe-tree entries.

Renaming the stripe to rst_search_commit_root describes this purpose.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-09-10 16:51:17 +02:00
Johannes Thumshirn
f8428360c8 btrfs: don't dump stripe-tree on lookup error
This just creates unnecessary noise and doesn't provide any insights into
debugging RAID stripe-tree related issues.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-09-10 16:51:17 +02:00
Boris Burkov
f8e9f4a76d btrfs: add comment about locking in cow_file_range_inline()
Add a comment to document the complicated locked_page unlock logic in
cow_file_range_inline. The specifically tricky part is that a caller
just up the stack converts ret == 0 to ret == 1 and then another
caller far up the callstack handles ret == 1 as a success, AND returns
without cleanup in that case, both of which "feel" unnatural and led to
the original bug.

Try to document that somewhat specific callstack logic here to explain
the weird un-setting of locked_folio on success.

Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Boris Burkov <boris@bur.io>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-09-10 16:51:17 +02:00
Filipe Manana
68a505bb87 btrfs: more efficient chunk map iteration when device replace finishes
When iterating the chunk maps when a device replace finishes we are doing
a full rbtree search for each chunk map, which is not the most efficient
thing to do, wasting CPU time. As we are holding a write lock on the tree
during the whole iteration, we can simply start from the first node in the
tree and then move to the next chunk map by doing a rb_next() call - the
only exception is when we need to reschedule, in which case we have to do
a full rbtree search since we dropped the write lock and the tree may have
changed (chunk maps may have been removed and the tree got rebalanced).
So just do that.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-09-10 16:51:17 +02:00
Filipe Manana
b79f1c2caa btrfs: reschedule when updating chunk maps at the end of a device replace
At the end of a device replace we must go over all the chunk maps and
update their stripes to point to the target device instead of the source
device. We iterate over the chunk maps while holding a write lock and
we never reschedule, which can result in monopolizing a CPU for too long
and blocking readers for too long (it's a rw lock, non-blocking).

So improve on this by rescheduling if necessary. This is safe because at
this point we are holding the chunk mutex, which means no new chunks can
be allocated and therefore we don't risk missing a new chunk map that
covers a range behind the last one we processed before rescheduling.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-09-10 16:51:17 +02:00
Josef Bacik
5fe1912449 btrfs: convert extent_range_clear_dirty_for_io() to use a folio
Instead of getting a page and using that to clear dirty for io, use the
folio helper and use the appropriate folio functions.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-09-10 16:51:17 +02:00
Josef Bacik
c86d3aac81 btrfs: convert insert_inline_extent() to use a folio
We only use a page to copy in the data for the inline extent.  Use a
folio for this instead.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-09-10 16:51:17 +02:00
Josef Bacik
1bbf3a3aea btrfs: convert btrfs_set_range_writeback() to use a folio
We already use a lot of functions here that use folios, update the
function to use __filemap_get_folio instead of find_get_page and then
use the folio directly.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-09-10 16:51:17 +02:00
Josef Bacik
dfc9e3017a btrfs: convert wait_subpage_spinlock() to only use a folio
Currently this already uses a folio for most things, update it to take a
folio and update all the page usage with the corresponding folio usage.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-09-10 16:51:16 +02:00
Josef Bacik
1a48259d9b btrfs: convert find_next_dirty_byte() to take a folio
We already use a folio some in this function, replace all page usage
with the folio and update the function to take the folio as an argument.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2024-09-10 16:51:16 +02:00