__extent_writepage_io() function originally just iterates through all
the extent maps of a page, and submits any regular extents.
This is fine for sectorsize == PAGE_SIZE case, as if a page is dirty, we
need to submit the only sector contained in the page.
But for subpage case, one dirty page can contain several clean sectors
with at least one dirty sector.
If __extent_writepage_io() still submit all regular extent maps, it can
submit data which is already written to disk.
And since such already written data won't have corresponding ordered
extents, it will trigger a BUG_ON() in btrfs_csum_one_bio().
Change the behavior of __extent_writepage_io() by finding the first
dirty byte in the page, and only submit the dirty range other than the
full extent.
Since we're also here, also modify the following calls to be subpage
compatible:
- SetPageError()
- end_page_writeback()
Tested-by: Ritesh Harjani <riteshh@linux.ibm.com> # [ppc64]
Tested-by: Anand Jain <anand.jain@oracle.com> # [aarch64]
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Function btrfs_set_range_writeback() currently just sets the page
writeback unconditionally.
Change it to call the subpage helper so that we can handle both cases
well.
Since the subpage helpers needs btrfs_fs_info, also change the parameter
to accept btrfs_inode.
Tested-by: Ritesh Harjani <riteshh@linux.ibm.com> # [ppc64]
Tested-by: Anand Jain <anand.jain@oracle.com> # [aarch64]
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
In cow_file_range(), after we have succeeded creating an inline extent,
we unlock the page with extent_clear_unlock_delalloc() by passing
locked_page == NULL.
For sectorsize == PAGE_SIZE case, this is just making the page lock and
unlock harder to grab.
But for incoming subpage case, it can be a big problem.
For incoming subpage case, page locking have two entry points:
- __process_pages_contig()
In that case, we know exactly the range we want to lock (which only
requires sector alignment).
To handle the subpage requirement, we introduce btrfs_subpage::writers
to page::private, and will update it in __process_pages_contig().
- Other directly lock/unlock_page() call sites
Those won't touch btrfs_subpage::writers at all.
This means, page locked by __process_pages_contig() can only be unlocked
by __process_pages_contig().
Thankfully we already have the existing infrastructure in the form of
@locked_page in various call sites.
Unfortunately, extent_clear_unlock_delalloc() in cow_file_range() after
creating an inline extent is the exception.
It intentionally call extent_clear_unlock_delalloc() with locked_page ==
NULL, to also unlock current page (and clear its dirty/writeback bits).
To co-operate with incoming subpage modifications, and make the page
lock/unlock pair easier to understand, this patch will still call
extent_clear_unlock_delalloc() with locked_page, and only unlock the
page in __extent_writepage().
Tested-by: Ritesh Harjani <riteshh@linux.ibm.com> # [ppc64]
Tested-by: Anand Jain <anand.jain@oracle.com> # [aarch64]
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
When __process_pages_contig() gets called for
extent_clear_unlock_delalloc(), if we hit the locked page, only Private2
bit is updated, but dirty/writeback/error bits are all skipped.
There are several call sites that call extent_clear_unlock_delalloc()
with locked_page and PAGE_CLEAR_DIRTY/PAGE_SET_WRITEBACK/PAGE_END_WRITEBACK
- cow_file_range()
- run_delalloc_nocow()
- cow_file_range_async()
All for their error handling branches.
For those call sites, since we skip the locked page for
dirty/error/writeback bit update, the locked page will still have its
subpage dirty bit remaining.
Normally it's the call sites which locked the page to handle the locked
page, but it won't hurt if we also do the update.
Especially there are already other call sites doing the same thing by
manually passing NULL as locked_page.
Tested-by: Ritesh Harjani <riteshh@linux.ibm.com> # [ppc64]
Tested-by: Anand Jain <anand.jain@oracle.com> # [aarch64]
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This involves the following modification:
- Ordered extent creation
This is done in process_one_page(), now PAGE_SET_ORDERED will call
subpage helper to do the work.
- endio functions
This is done in btrfs_mark_ordered_io_finished().
- btrfs_invalidatepage()
- btrfs_cleanup_ordered_extents()
Use the subpage page helper, and add an extra branch to exit if the
locked page have covered the full range.
Now the usage of page Ordered flag for ordered extent accounting is fully
subpage compatible.
Tested-by: Ritesh Harjani <riteshh@linux.ibm.com> # [ppc64]
Tested-by: Anand Jain <anand.jain@oracle.com> # [aarch64]
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This patch introduces the following functions to handle btrfs subpage
ordered (Private2) status:
- btrfs_subpage_set_ordered()
- btrfs_subpage_clear_ordered()
- btrfs_subpage_test_ordered()
These helpers can only be called when the range is ensured to be
inside the page.
- btrfs_page_set_ordered()
- btrfs_page_clear_ordered()
- btrfs_page_test_ordered()
These helpers can handle both regular sector size and subpage without
problem.
These functions are here to coordinate btrfs_invalidatepage() with
btrfs_writepage_endio_finish_ordered(), to make sure only one of those
functions can finish the ordered extent.
Tested-by: Ritesh Harjani <riteshh@linux.ibm.com> # [ppc64]
Tested-by: Anand Jain <anand.jain@oracle.com> # [aarch64]
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Introduce a new data inodes specific subpage member, writers, to record
how many sectors are under page lock for delalloc writing.
This member acts pretty much the same as readers, except it's only for
delalloc writes.
This is important for delalloc code to trace which page can really be
freed, as we have cases like run_delalloc_nocow() where we may exit
processing nocow range inside a page, but need to exit to do cow half
way.
In that case, we need a way to determine if we can really unlock a full
page.
With the new btrfs_subpage::writers, there is a new requirement:
- Page locked by process_one_page() must be unlocked by
process_one_page()
There are still tons of call sites manually lock and unlock a page,
without updating btrfs_subpage::writers.
So if we lock a page through process_one_page() then it must be
unlocked by process_one_page() to keep btrfs_subpage::writers
consistent.
This will be handled in next patch.
Tested-by: Ritesh Harjani <riteshh@linux.ibm.com> # [ppc64]
Tested-by: Anand Jain <anand.jain@oracle.com> # [aarch64]
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Now in end_bio_extent_writepage(), the only subpage incompatible code is
the end_page_writeback().
Just call the subpage helpers.
Tested-by: Ritesh Harjani <riteshh@linux.ibm.com> # [ppc64]
Tested-by: Anand Jain <anand.jain@oracle.com> # [aarch64]
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
For __process_pages_contig() and process_one_page(), to handle subpage
we only need to pass bytenr in and call subpage helpers to handle
dirty/error/writeback status.
Tested-by: Ritesh Harjani <riteshh@linux.ibm.com> # [ppc64]
Tested-by: Anand Jain <anand.jain@oracle.com> # [aarch64]
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Since the extent io tree operations in btrfs_dirty_pages() are already
subpage compatible, we only need to make the page status update to use
subpage helpers.
Tested-by: Ritesh Harjani <riteshh@linux.ibm.com> # [ppc64]
Tested-by: Anand Jain <anand.jain@oracle.com> # [aarch64]
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Just like read page, for subpage support we only require sector size
alignment.
So change the error message condition to only require sector alignment.
This should not affect existing code, as for regular sectorsize ==
PAGE_SIZE case, we are still requiring page alignment.
Tested-by: Ritesh Harjani <riteshh@linux.ibm.com> # [ppc64]
Tested-by: Anand Jain <anand.jain@oracle.com> # [aarch64]
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
In the coming subpage RW supports, there are a lot of page status update
calls which need to be converted to subpage compatible version, which
needs @start and @len.
Some call sites already have such @start/@len and are already in
page range, like various endio functions.
But there are also call sites which need to clamp the range for subpage
case, like btrfs_dirty_pagse() and __process_contig_pages().
Here we introduce new helpers, btrfs_page_clamp_*(), to do and only do the
clamp for subpage version.
Although in theory all existing btrfs_page_*() calls can be converted to
use btrfs_page_clamp_*() directly, but that would make us to do
unnecessary clamp operations.
Tested-by: Ritesh Harjani <riteshh@linux.ibm.com> # [ppc64]
Tested-by: Anand Jain <anand.jain@oracle.com> # [aarch64]
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
In __process_pages_contig() we update page status according to page_ops.
That update process is a bunch of 'if' branches, which lie inside
two loops, this makes it pretty hard to expand for later subpage
operations.
So this patch will extract these operations into its own function,
process_one_pages().
Also since we're refactoring __process_pages_contig(), also move the new
helper and __process_pages_contig() before the first caller of them, to
remove the forward declaration.
Tested-by: Ritesh Harjani <riteshh@linux.ibm.com> # [ppc64]
Tested-by: Anand Jain <anand.jain@oracle.com> # [aarch64]
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
As a preparation for incoming subpage support, we need bytenr passed to
__process_pages_contig() directly, not the current page index.
So change the parameter and all callers to pass bytenr in.
With the modification, here we need to replace the old @index_ret with
@processed_end for __process_pages_contig(), but this brings a small
problem.
Normally we follow the inclusive return value, meaning @processed_end
should be the last byte we processed.
If parameter @start is 0, and we failed to lock any page, then we would
return @processed_end as -1, causing more problems for
__unlock_for_delalloc().
So here for @processed_end, we use two different return value patterns.
If we have locked any page, @processed_end will be the last byte of
locked page.
Or it will be @start otherwise.
This change will impact lock_delalloc_pages(), so it needs to check
@processed_end to only unlock the range if we have locked any.
Tested-by: Ritesh Harjani <riteshh@linux.ibm.com> # [ppc64]
Tested-by: Anand Jain <anand.jain@oracle.com> # [aarch64]
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
[BUG]
When running subpage preparation patches on x86, btrfs/125 will hang
forever with one ordered extent never finished.
[CAUSE]
The test case btrfs/125 itself will always fail as the fix is never merged.
When the test fails at balance, btrfs needs to cleanup the ordered
extent in btrfs_cleanup_ordered_extents() for data reloc inode.
The problem is in the sequence how we cleanup the page Order bit.
Currently it works like:
btrfs_cleanup_ordered_extents()
|- find_get_page();
|- btrfs_page_clear_ordered(page);
| Now the page doesn't have Ordered bit anymore.
| !!! This also includes the first (locked) page !!!
|
|- offset += PAGE_SIZE
| This is to skip the first page
|- __endio_write_update_ordered()
|- btrfs_mark_ordered_io_finished(NULL)
Except the first page, all ordered extents are finished.
Then the locked page is cleaned up in __extent_writepage():
__extent_writepage()
|- If (PageError(page))
|- end_extent_writepage()
|- btrfs_mark_ordered_io_finished(page)
|- if (btrfs_test_page_ordered(page))
|- !!! The page gets skipped !!!
The ordered extent is not decreased as the page doesn't
have ordered bit anymore.
This leaves the ordered extent with bytes_left == sectorsize, thus never
finish.
[FIX]
The fix is to ensure we never clear page Ordered bit without running the
ordered extent accounting.
Here we choose to skip the locked page in
btrfs_cleanup_ordered_extents() so that later end_extent_writepage() can
properly finish the ordered extent.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Inside btrfs we use Private2 page status to indicate we have an ordered
extent with pending IO for the sector.
But the page status name, Private2, tells us nothing about the bit
itself, so this patch will rename it to Ordered.
And with extra comment about the bit added, so reader who is still
uncertain about the page Ordered status, will find the comment pretty
easily.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This patch will refactor btrfs_invalidatepage() for the incoming subpage
support.
The involved modifications are:
- Use while() loop instead of "goto again;"
- Use single variable to determine whether to delete extent states
Each branch will also have comments why we can or cannot delete the
extent states
- Do qgroup free and extent states deletion per-loop
Current code can only work for PAGE_SIZE == sectorsize case.
This refactor also makes it clear what we do for different sectors:
- Sectors without ordered extent
We're completely safe to remove all extent states for the sector(s)
- Sectors with ordered extent, but no Private2 bit
This means the endio has already been executed, we can't remove all
extent states for the sector(s).
- Sectors with ordere extent, still has Private2 bit
This means we need to decrease the ordered extent accounting.
And then it comes to two different variants:
* We have finished and removed the ordered extent
Then it's the same as "sectors without ordered extent"
* We didn't finished the ordered extent
We can remove some extent states, but not all.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Although we already have btrfs_lookup_first_ordered_extent() and
btrfs_lookup_ordered_extent(), they all have their own limitations:
- btrfs_lookup_ordered_extent() can't do extra range check
It's only designed to lookup any ordered extent before certain bytenr.
- btrfs_lookup_first_ordered_extent() may not return the first ordered
extent in the range
It doesn't ensure the first ordered extent is returned.
The existing callers are only interested in exhausting all ordered
extents in a range, the order is not important.
For incoming btrfs_invalidatepage() refactoring, we need a way to
properly iterate all ordered extents in their bytenr order of a range.
So this patch will introduce a new function,
btrfs_lookup_first_ordered_range(), to do ordered extent with bytenr
order awareness and extra range check.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The existing comments in btrfs_invalidatepage() don't really get to the
point, especially for what Private2 is really representing and how the
race avoidance is done.
The truth is, there are only three entrances to do ordered extent
accounting:
- btrfs_writepage_endio_finish_ordered()
- __endio_write_update_ordered()
Those two entrance are just endio functions for dio and buffered
write.
- btrfs_invalidatepage()
But there is a pitfall, in endio functions there is no check on whether
the ordered extent is already accounted.
They just blindly clear the Private2 bit and do the accounting.
So it's all btrfs_invalidatepage()'s responsibility to make sure we
won't do double account for the same sector.
That's why in btrfs_invalidatepage() we have to wait for page writeback,
this will ensure all submitted bios have finished, thus their endio
functions have finished the accounting on the ordered extent.
Then we also check page Private2 to ensure that, we only run ordered
extent accounting on pages who has no bio submitted.
This patch will rework related comments to make it more clear on the
race and how we use wait_on_page_writeback() and Private2 to prevent
double accounting on ordered extent.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Btrfs has two endio functions to mark certain io range finished for
ordered extents:
- __endio_write_update_ordered()
This is for direct IO
- btrfs_writepage_endio_finish_ordered()
This for buffered IO.
However they go different routines to handle ordered extent io:
- Whether to iterate through all ordered extents
__endio_write_update_ordered() will but
btrfs_writepage_endio_finish_ordered() will not.
In fact, iterating through all ordered extents will benefit later
subpage support, while for current PAGE_SIZE == sectorsize requirement
this behavior makes no difference.
- Whether to update page Private2 flag
__endio_write_update_ordered() will not update page Private2 flag as
for iomap direct IO, the page can not be even mapped.
While btrfs_writepage_endio_finish_ordered() will clear Private2 to
prevent double accounting against btrfs_invalidatepage().
Those differences are pretty subtle, and the ordered extent iterations
code in callers makes code much harder to read.
So this patch will introduce a new function,
btrfs_mark_ordered_io_finished(), to do the heavy lifting:
- Iterate through all ordered extents in the range
- Do the ordered extent accounting
- Queue the work for finished ordered extent
This function has two new feature:
- Proper underflow detection and recovery
The old underflow detection will only detect the problem, then
continue.
No proper info like root/inode/ordered extent info, nor noisy enough
to be caught by fstests.
Furthermore when underflow happens, the ordered extent will never
finish.
New error detection will reset the bytes_left to 0, do proper
kernel warning, and output extra info including root, ino, ordered
extent range, the underflow value.
- Prevent double accounting based on Private2 flag
Now if we find a range without Private2 flag, we will skip to next
range.
As that means someone else has already finished the accounting of
ordered extent.
This makes no difference for current code, but will be a critical part
for incoming subpage support, as we can call
btrfs_mark_ordered_io_finished() for multiple sectors if they are
beyond inode size.
Thus such double accounting prevention is a key feature for subpage.
Now both endio functions only need to call that new function.
And since the only caller of btrfs_dec_test_first_ordered_pending() is
removed, also remove btrfs_dec_test_first_ordered_pending() completely.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Currently we use page Private2 bit to indicate that we have ordered
extent for the page range.
But the lifespan of it is not consistent, during regular writeback path,
there are two locations to clear the same PagePrivate2:
T ----- Page marked Dirty
|
+ ----- Page marked Private2, through btrfs_run_dealloc_range()
|
+ ----- Page cleared Private2, through btrfs_writepage_cow_fixup()
| in __extent_writepage_io()
| ^^^ Private2 cleared for the first time
|
+ ----- Page marked Writeback, through btrfs_set_range_writeback()
| in __extent_writepage_io().
|
+ ----- Page cleared Private2, through
| btrfs_writepage_endio_finish_ordered()
| ^^^ Private2 cleared for the second time.
|
+ ----- Page cleared Writeback, through
btrfs_writepage_endio_finish_ordered()
Currently PagePrivate2 is mostly to prevent ordered extent accounting
being executed for both endio and invalidatepage.
Thus only the one who cleared page Private2 is responsible for ordered
extent accounting.
But the fact is, in btrfs_writepage_endio_finish_ordered(), page
Private2 is cleared and ordered extent accounting is executed
unconditionally.
The race prevention only happens through btrfs_invalidatepage(), where
we wait for the page writeback first, before checking the Private2 bit.
This means, Private2 is also protected by Writeback bit, and there is no
need for btrfs_writepage_cow_fixup() to clear Priavte2.
This patch will change btrfs_writepage_cow_fixup() to just check
PagePrivate2, not to clear it.
The clearing will happen in either btrfs_invalidatepage() or
btrfs_writepage_endio_finish_ordered().
This makes the Private2 bit easier to understand, just meaning the page
has unfinished ordered extent attached to it.
And this patch is a hard requirement for the incoming refactoring for
how we finished ordered IO for endio context, as the coming patch will
check Private2 to determine if we need to do the ordered extent
accounting. Thus this patch is definitely needed or we will hang due to
unfinished ordered extent.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
There is a pretty bad abuse of btrfs_writepage_endio_finish_ordered() in
end_compressed_bio_write().
It passes compressed pages to btrfs_writepage_endio_finish_ordered(),
which is only supposed to accept inode pages.
Thankfully the important info here is the inode, so let's pass
btrfs_inode directly into btrfs_writepage_endio_finish_ordered(), and
make @page parameter optional.
By this, end_compressed_bio_write() can happily pass page=NULL while
still getting everything done properly.
Also, to cooperate with such modification, replace @page parameter for
trace_btrfs_writepage_end_io_hook() with btrfs_inode.
Although this removes page_index info, the existing start/len should be
enough for most usage.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
For subpage metadata, we're reusing two functions for subpage metadata
write:
- end_bio_extent_buffer_writepage()
- write_one_eb()
But the truth is, for subpage we just call
end_bio_subpage_eb_writepage() without using any bit in
end_bio_extent_buffer_writepage().
For write_one_eb(), it's pretty similar, but with a small part of code
reused.
There is really no need to pollute the existing code path if we're not
really using most of them.
So this patch will do the following change to separate the subpage
metadata write path from regular write path by:
- Use end_bio_subpage_eb_writepage() directly as endio in
write_one_subpage_eb()
- Directly call write_one_subpage_eb() in submit_eb_subpage()
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
There is a lot of code inside extent_io.c needs both "struct bio
**bio_ret" and "unsigned long prev_bio_flags", along with some
parameters like "unsigned long bio_flags".
Such strange parameters are here for bio assembly.
For example, we have such inode page layout:
0 4K 8K 12K
|<-- Extent A-->|<- EB->|
Then what we do is:
- Page [0, 4K)
*bio_ret = NULL
So we allocate a new bio to bio_ret,
Add page [0, 4K) to *bio_ret.
- Page [4K, 8K)
*bio_ret != NULL
We found this page is continuous to *bio_ret,
and if we're not at stripe boundary, we
add page [4K, 8K) to *bio_ret.
- Page [8K, 12K)
*bio_ret != NULL
But we found this page is not continuous, so
we submit *bio_ret, then allocate a new bio,
and add page [8K, 12K) to the new bio.
This means we need to record both the bio and its bio_flag, but we
record them manually using those strange parameter list, other than
encapsulating them into their own structure.
So this patch will introduce a new structure, btrfs_bio_ctrl, to record
both the bio, and its bio_flags.
Also, in above case, for all pages added to the bio, we need to check if
the new page crosses stripe boundary. This check itself can be time
consuming, and we don't really need to do that for each page.
This patch also integrates the stripe boundary check into btrfs_bio_ctrl.
When a new bio is allocated, the stripe and ordered extent boundary is
also calculated, so no matter how large the bio will be, we only
calculate the boundaries once, to save some CPU time.
The following functions/structures are affected:
- struct extent_page_data
Replace its bio pointer with structure btrfs_bio_ctrl (embedded
structure, not pointer)
- end_write_bio()
- flush_write_bio()
Just change how bio is fetched
- btrfs_bio_add_page()
Use pre-calculated boundaries instead of re-calculating them.
And use @bio_ctrl to replace @bio and @prev_bio_flags.
- calc_bio_boundaries()
New function
- submit_extent_page() callers
- btrfs_do_readpage() callers
- contiguous_readpages() callers
To Use @bio_ctrl to replace @bio and @prev_bio_flags, and how to grab
bio.
- btrfs_bio_fits_in_ordered_extent()
Removed, as now the ordered extent size limit is done at bio
allocation time, no need to check for each page range.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Function btrfs_bio_fits_in_stripe() now requires a bio with at least one
page added. Or btrfs_get_chunk_map() will fail with -ENOENT.
But in fact this requirement is not needed at all, as we can just pass
sectorsize for btrfs_get_chunk_map().
This tiny behavior change is important for later subpage refactoring on
submit_extent_page().
As for 64K page size, we can have a page range with pgoff=0 and size=64K.
If the logical bytenr is just 16K before the stripe boundary, we have to
split the page range into two bios.
This means, we must check page range against stripe boundary, even adding
the range to an empty bio.
This tiny refactoring is for the incoming changes, but on its own,
regular sectorsize == PAGE_SIZE is not affected anyway.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The parameter @len is not really used in btrfs_bio_fits_in_stripe(),
just remove it.
It got removed in 4203431319 ("btrfs: let callers of
btrfs_get_io_geometry pass the em"), before that btrfs_get_chunk_map
utilized it.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Currently free space cache inode size is determined by two factors:
- block group size
- PAGE_SIZE
This means, for the same sized block groups, with different PAGE_SIZE,
it will result in different inode sizes.
This will not be a good thing for subpage support, so change the
requirement for PAGE_SIZE to sectorsize.
Now for the same 4K sectorsize btrfs, it should result the same inode
size no matter what the PAGE_SIZE is.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
[BUG]
For the following file layout, scrub will not be able to repair all
these two repairable error, but in fact make one corruption even
unrepairable:
inode offset 0 4k 8K
Mirror 1 |XXXXXX| |
Mirror 2 | |XXXXXX|
[CAUSE]
The root cause is the hard coded PAGE_SIZE, which makes scrub repair to
go crazy for subpage.
For above case, when reading the first sector, we use PAGE_SIZE other
than sectorsize to read, which makes us to read the full range [0, 64K).
In fact, after 8K there may be no data at all, we can just get some
garbage.
Then when doing the repair, we also writeback a full page from mirror 2,
this means, we will also writeback the corrupted data in mirror 2 back
to mirror 1, leaving the range [4K, 8K) unrepairable.
[FIX]
This patch will modify the following PAGE_SIZE use with sectorsize:
- scrub_print_warning_inode()
Remove the min() and replace PAGE_SIZE with sectorsize.
The min() makes no sense, as csum is done for the full sector with
padding.
This fixes a bug that subpage report extra length like:
checksum error at logical 298844160 on dev /dev/mapper/arm_nvme-test,
physical 575668224, root 5, inode 257, offset 0, length 12288, links 1 (path: file)
Where the error is only 1 sector.
- scrub_handle_errored_block()
Comments with PAGE|page involved, all changed to sector.
- scrub_setup_recheck_block()
- scrub_repair_page_from_good_copy()
- scrub_add_page_to_wr_bio()
- scrub_wr_submit()
- scrub_add_page_to_rd_bio()
- scrub_block_complete()
Replace PAGE_SIZE with sectorsize.
This solves several problems where we read/write extra range for
subpage case.
RAID56 code is excluded intentionally, as RAID56 has extra PAGE_SIZE
usage, and is not really safe enough.
Thus we will reject RAID56 for subpage in later commit.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Instead of calling list_entry with head->prev simply call
list_last_entry which makes it obvious which member of the list is
being referred. This allows to remove the extra 'prev' pointer.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Commit e5d7490236 ("btrfs: derive maximum output size in the
compression implementation") removed @max_out argument in
btrfs_compress_pages() but its comment remained, remove it.
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Patch "btrfs: reduce compressed_bio member's types" reduced some
member's size. Function arguments @len, @compressed_len and @nr_pages
can be declared as unsigned int.
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Patch "btrfs: reduce compressed_bio member's types" reduced some
member's size. Declare the variables @compressed_len, @nr_pages and
@pg_index size as an unsigned int in the function
btrfs_submit_compressed_read.
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
When logging an inode we always log all its xattrs, so that we are able
to figure out which ones should be deleted during log replay. However this
is unnecessary when we are doing a fast fsync and no xattrs were added,
changed or deleted since the last time we logged the inode in the current
transaction.
So skip the logging of xattrs when the inode was previously logged in the
current transaction and no xattrs were added, changed or deleted. If any
changes to xattrs happened, than the inode has BTRFS_INODE_COPY_EVERYTHING
set in its runtime flags and the xattrs get logged. This saves time on
scanning for xattrs, allocating memory, COWing log tree extent buffers and
adding more lock contention on the extent buffers when there are multiple
tasks logging in parallel.
The use of xattrs is common when using ACLs, some applications, or when
using security modules like SELinux where every inode gets a security
xattr added to it.
The following test script, using fio, was used on a box with 12 cores, 64G
of RAM, a NVMe device and the default non-debug kernel config from Debian.
It uses 8 concurrent jobs each writing in blocks of 64K to its own 4G file,
each file with a single xattr of 50 bytes (about the same size for an ACL
or SELinux xattr), doing random buffered writes with an fsync after each
write.
$ cat test.sh
#!/bin/bash
DEV=/dev/nvme0n1
MNT=/mnt/test
MOUNT_OPTIONS="-o ssd"
MKFS_OPTIONS="-d single -m single"
NUM_JOBS=8
FILE_SIZE=4G
cat <<EOF > /tmp/fio-job.ini
[writers]
rw=randwrite
fsync=1
fallocate=none
group_reporting=1
direct=0
bs=64K
ioengine=sync
size=$FILE_SIZE
directory=$MNT
numjobs=$NUM_JOBS
EOF
echo "performance" | \
tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
mkfs.btrfs -f $MKFS_OPTIONS $DEV > /dev/null
mount $MOUNT_OPTIONS $DEV $MNT
echo "Creating files before fio runs, each with 1 xattr of 50 bytes"
for ((i = 0; i < $NUM_JOBS; i++)); do
path="$MNT/writers.$i.0"
truncate -s $FILE_SIZE $path
setfattr -n user.xa1 -v $(printf '%0.sX' $(seq 50)) $path
done
fio /tmp/fio-job.ini
umount $MNT
fio output before this change:
WRITE: bw=120MiB/s (126MB/s), 120MiB/s-120MiB/s (126MB/s-126MB/s), io=32.0GiB (34.4GB), run=272145-272145msec
fio output after this change:
WRITE: bw=142MiB/s (149MB/s), 142MiB/s-142MiB/s (149MB/s-149MB/s), io=32.0GiB (34.4GB), run=230408-230408msec
+16.8% throughput, -16.6% runtime
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Accept device name "cancel" as a request to cancel running device
deletion operation. The string is literal, in case there's a real device
named "cancel", pass it as full absolute path or as "./cancel"
This works for v1 and v2 ioctls when the device is specified by name.
Moving chunks from the device uses relocation, use the conditional
exclusive operation start and cancellation helpers
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Accept literal string "cancel" as resize operation and interpret that
as a request to cancel the running operation. If it's running, wait
until it finishes current work and return ECANCELED.
Shrinking resize uses relocation to move the chunks away, use the
conditional exclusive operation start and cancellation helpers.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
To support optional cancellation of some operations, add helper that will
wrap all the combinations. In normal mode it's same as
btrfs_exclop_start, in cancellation mode it checks if it's already
running and request cancellation and waits until completion.
The error codes can be returned to to user space and semantics is not
changed, adding ECANCELED. This should be evaluated as an error and that
the operation has not completed and the operation should be restarted
or the filesystem status reviewed.
Signed-off-by: David Sterba <dsterba@suse.com>
Add try-lock for exclusive operation start to allow callers to do more
checks. The same operation must already be running. The try-lock and
unlock must pair and are a substitute for btrfs_exclop_start, thus it
must also pair with btrfs_exclop_finish to release the exclop context.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Add support code that will allow canceling relocation on the chunk
granularity. This is different and independent of balance, that also
uses relocation but is a higher level operation and manages it's own
state and pause/cancellation requests.
Relocation is used for resize (shrink) and device deletion so this will
be a common point to implement cancellation for both. The context is
entirely in btrfs_relocate_block_group and btrfs_recover_relocation,
enclosing one chunk relocation. The status bit is set and unset between
the chunks. As relocation can take long, the effects may not be
immediate and the request and actual action can slightly race.
The fs_info::reloc_cancel_req is only supposed to be increased and does
not pair with decrement like fs_info::balance_cancel_req.
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The exclusive operation is now atomically checked and set using bit
operations. Switch it to protection by spinlock. The super block lock is
not frequently used and adding a new lock seems like an overkill so it
should be safe to reuse it.
The reason to use spinlock is to enhance the locking context so more
checks can be done, eg. allowing the same exclusive operation enter
the exclop section and cancel the running one. This will be used for
resize and device delete.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Move header offsetof() to the expression that calculates the address so
it's part of get_eb_offset_in_page where the 2nd parameter is the member
offset.
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The verification copies the calculated checksum bytes to a temporary
buffer but this is not necessary. We can map the eb header on the first
page and use the checksum bytes directly.
This saves at least one function call and boundary checks so it could
lead to a minor performance improvement.
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The s_id is already printed by message helpers.
Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Several members of compressed_bio are of type that's unnecessarily big
for the values that they'd hold:
- the size of the uncompressed and compressed data is 128K now, we can
keep is as int
- same for number of pages
- the compress type fits to a byte
- the errors is 0/1
The size of the unpatched structure is 80 bytes with several holes.
Reordering nr_pages next to the pages the hole after pending_bios is
filled and the resulting size is 56 bytes. This keeps the csums array
aligned to 8 bytes, which is nice. Further size optimizations may be
possible but right now it looks good to me:
struct compressed_bio {
refcount_t pending_bios; /* 0 4 */
unsigned int nr_pages; /* 4 4 */
struct page * * compressed_pages; /* 8 8 */
struct inode * inode; /* 16 8 */
u64 start; /* 24 8 */
unsigned int len; /* 32 4 */
unsigned int compressed_len; /* 36 4 */
u8 compress_type; /* 40 1 */
u8 errors; /* 41 1 */
/* XXX 2 bytes hole, try to pack */
int mirror_num; /* 44 4 */
struct bio * orig_bio; /* 48 8 */
u8 sums[]; /* 56 0 */
/* size: 56, cachelines: 1, members: 12 */
/* sum members: 54, holes: 1, sum holes: 2 */
/* last cacheline: 56 bytes */
};
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
There are common values set for the stripe constraints, some of them
are already factored out. Do that for increment and mirror_num as well.
Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
When a log recovery is in progress, lots of operations have to take that
into account, so we keep this status per tree during the operation. Long
time ago error handling revamp patch 79787eaab4 ("btrfs: replace many
BUG_ONs with proper error handling") removed clearing of the status in
an error branch. Add it back as was intended in e02119d5a7 ("Btrfs:
Add a write ahead tree log to optimize synchronous operations").
There are probably no visible effects, log replay is done only during
mount and if it fails all structures are cleared so the stale status
won't be kept.
Fixes: 79787eaab4 ("btrfs: replace many BUG_ONs with proper error handling")
Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The defrag loop processes leaves in batches and starting transaction for
each. The whole defragmentation on a given root is protected by a bit
but in case the transaction fails, the bit is not cleared
In case the transaction fails the bit would prevent starting
defragmentation again, so make sure it's cleared.
CC: stable@vger.kernel.org # 4.4+
Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The type of discard_bitmap_bytes and discard_extent_bytes is u64 so the
format should be %llu, though the actual values would hardly ever
overflow to negative values.
Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
While stress testing our error handling I noticed that sometimes we
would still commit the transaction even though we had aborted the
transaction.
Currently we track if a trans handle has dirtied any metadata, and if it
hasn't we mark the filesystem as having an error (so no new transactions
can be started), but we will allow the current transaction to complete
as we do not mark the transaction itself as having been aborted.
This sounds good in theory, but we were not properly tracking IO errors
in btrfs_finish_ordered_io, and thus committing the transaction with
bogus free space data. This isn't necessarily a problem per-se with the
free space cache, as the other guards in place would have kept us from
accepting the free space cache as valid, but highlights a real world
case where we had a bug and could have corrupted the filesystem because
of it.
This "skip abort on empty trans handle" is nice in theory, but assumes
we have perfect error handling everywhere, which we clearly do not.
Also we do not allow further transactions to be started, so all this
does is save the last transaction that was happening, which doesn't
necessarily gain us anything other than the potential for real
corruption.
Remove this particular bit of code, if we decide we need to abort the
transaction then abort the current one and keep us from doing real harm
to the file system, regardless of whether this specific trans handle
dirtied anything or not.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
At btrfs_truncate() where we truncate the inode either to the same size
or to a smaller size, we always set the full sync flag on the inode.
This is needed in case the truncation drops or trims any file extent items
that start beyond or cross the new inode size, so that the next fsync
drops all inode items from the log and scans again the fs/subvolume tree
to find all items that must be logged.
However if the truncation does not drop or trims any file extent items, we
do not need to set the full sync flag and force the next fsync to use the
slow code path. So do not set the full sync flag in such cases.
One use case where it is frequent to do truncations that do not change
the inode size and do not drop any extents (no prealloc extents beyond
i_size) is when running Microsoft's SQL Server inside a Docker container.
One example workload is the one Philipp Fent reported recently, in the
thread with a link below. In this workload a large number of fsyncs are
preceded by such truncate operations.
After this change I constantly get the runtime for that workload from
Philipp to be reduced by about -12%, for example from 184 seconds down
to 162 seconds.
Link: https://lore.kernel.org/linux-btrfs/93c4600e-5263-5cba-adf0-6f47526e7561@in.tum.de/
Tested-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The comment at the top of btrfs_truncate() mentions that csum items are
dropped or truncated to the new i_size, but this is wrong and non sense,
as they are unrelated to the i_size and are located in the csums tree and
not on a tree with inode items (fs/subvolume tree or a log tree). Instead
that claim applies to file extent items, so fix the comment to refer to
them instead.
While at it make the whole comment for the function more descriptive and
follow the kernel doc style.
Tested-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
If we fail to update the delayed inode we need to abort the transaction,
because we could leave an inode with the improper counts or some other
such corruption behind.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
If we get an error while looking up the inode item we'll simply bail
without cleaning up the delayed node. This results in this style of
warning happening on commit:
WARNING: CPU: 0 PID: 76403 at fs/btrfs/delayed-inode.c:1365 btrfs_assert_delayed_root_empty+0x5b/0x90
CPU: 0 PID: 76403 Comm: fsstress Tainted: G W 5.13.0-rc1+ #373
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.13.0-2.fc32 04/01/2014
RIP: 0010:btrfs_assert_delayed_root_empty+0x5b/0x90
RSP: 0018:ffffb8bb815a7e50 EFLAGS: 00010286
RAX: 0000000000000000 RBX: ffff95d6d07e1888 RCX: ffff95d6c0fa3000
RDX: 0000000000000002 RSI: 000000000029e91c RDI: ffff95d6c0fc8060
RBP: ffff95d6c0fc8060 R08: 00008d6d701a2c1d R09: 0000000000000000
R10: ffff95d6d1760ea0 R11: 0000000000000001 R12: ffff95d6c15a4d00
R13: ffff95d6c0fa3000 R14: 0000000000000000 R15: ffffb8bb815a7e90
FS: 00007f490e8dbb80(0000) GS:ffff95d73bc00000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007f6e75555cb0 CR3: 00000001101ce001 CR4: 0000000000370ef0
Call Trace:
btrfs_commit_transaction+0x43c/0xb00
? finish_wait+0x80/0x80
? vfs_fsync_range+0x90/0x90
iterate_supers+0x8c/0x100
ksys_sync+0x50/0x90
__do_sys_sync+0xa/0x10
do_syscall_64+0x3d/0x80
entry_SYSCALL_64_after_hwframe+0x44/0xae
Because the iref isn't dropped and this leaves an elevated node->count,
so any release just re-queues it onto the delayed inodes list. Fix this
by going to the out label to handle the proper cleanup of the delayed
node.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Right now we only cleanup the delayed iref if we have
BTRFS_DELAYED_NODE_DEL_IREF set on the node. However we have some error
conditions that need to cleanup the iref if it still exists, so to make
this code cleaner move the test_bit into btrfs_release_delayed_iref
itself and unconditionally call it in each of the cases instead.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Add sysfs interface to limit io during scrub. We relied on the ionice
interface to do that, eg. the idle class let the system usable while
scrub was running. This has changed when mq-deadline got widespread and
did not implement the scheduling classes. That was a CFQ thing that got
deleted. We've got numerous complaints from users about degraded
performance.
Currently only BFQ supports that but it's not a common scheduler and we
can't ask everybody to switch to it.
Alternatively the cgroup io limiting can be used but that also a
non-trivial setup (v2 required, the controller must be enabled on the
system). This can still be used if desired.
Other ideas that have been explored: piggy-back on ionice (that is set
per-process and is accessible) and interpret the class and classdata as
bandwidth limits, but this does not have enough flexibility as there are
only 8 allowed and we'd have to map fixed limits to each value. Also
adjusting the value would need to lookup the process that currently runs
scrub on the given device, and the value is not sticky so would have to
be adjusted each time scrub runs.
Running out of options, sysfs does not look that bad:
- it's accessible from scripts, or udev rules
- the name is similar to what MD-RAID has
(/proc/sys/dev/raid/speed_limit_max or /sys/block/mdX/md/sync_speed_max)
- the value is sticky at least for filesystem mount time
- adjusting the value has immediate effect
- sysfs is available in constrained environments (eg. system rescue)
- the limit also applies to device replace
Sysfs:
- raw value is in bytes
- values written to the file accept suffixes like K, M
- file is in the per-device directory /sys/fs/btrfs/FSID/devinfo/DEVID/scrub_speed_max
- 0 means use default priority of IO
The scheduler is a simple deadline one and the accuracy is up to nearest
128K.
Signed-off-by: David Sterba <dsterba@suse.com>
To be able to construct a zone append bio we need to look up the
btrfs_device. The code doing the chunk map lookup to get the device is
present in btrfs_submit_compressed_write and submit_extent_page.
Factor out the lookup calls into a helper and use it in the submission
paths.
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
When inode defrag is canceled, the error is set to EAGAIN but then
overwritten by number of defragmented bytes. As this would hide the
error, rather return EAGAIN. This does not harm 'btrfs fi defrag', it
will print the error and continue to next file (as it does in for any
other error).
Signed-off-by: Tian Tao <tiantao6@hisilicon.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ update changelog ]
Signed-off-by: David Sterba <dsterba@suse.com>
The io_failure_record::in_validation was introduced to handle failed bio
which cross several sectors. In such case, we still need to verify
which sectors are corrupted.
But since we've changed the way how we handle corrupted sectors, by only
submitting repair for each corrupted sector, there is no need for extra
validation any more.
This patch will cleanup all io_failure_record::in_validation related
code.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Currently btrfs_submit_read_repair() has some extra check on whether the
failed bio needs extra validation for repair. But we can avoid all
these extra mechanisms if we submit the repair for each sector.
By this, each read repair can be easily handled without the need to
verify which sector is corrupted.
This will also benefit subpage, as one subpage bvec can contain several
sectors, making the extra verification more complex.
So this patch will:
- Introduce repair_one_sector()
The main code submitting repair, which is more or less the same as old
btrfs_submit_read_repair().
But this time, it only repairs one sector.
- Make btrfs_submit_read_repair() to handle sectors differently
There are 3 different cases:
* Good sector
We need to release the page and extent, set the range uptodate.
* Bad sector and failed to submit repair bio
We need to release the page and extent, but not set the range
uptodate.
* Bad sector but repair bio submitted
The page and extent release will be handled by the submitted repair
bio. Nothing needs to be done.
Since btrfs_submit_read_repair() will handle the page and extent
release now, we need to skip to next bvec even we hit some error.
- Change the lifespan of @uptodate in end_bio_extent_readpage()
Since now btrfs_submit_read_repair() will handle the full bvec
which contains any corruption, we don't need to bother updating
@uptodate bit anymore.
Just let @uptodate to be local variable inside the main loop,
so that any error from one bvec won't affect later bvec.
- Only export btrfs_repair_one_sector(), unexport
btrfs_submit_read_repair()
The only outside caller for read repair is DIO, which already submits
its repair for just one sector.
Only export btrfs_repair_one_sector() for DIO.
This patch will focus on the change on the repair path, the extra
validation code is still kept as is, and will be cleaned up later.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This will provide the basis for later per-sector repair for subpage,
while still keeping the existing code happy.
As if all csums match, the return value will be 0, same as now.
Only when csum mismatches, the return value is different.
The new return value will be a bitmap, for 4K sectorsize and 4K page
size, it will be either 1, instead of the -EIO (which is not used
directly by the callers, no effective change).
But for 4K sectorsize and 64K page size, aka subpage case, since the
bvec can contain multiple sectors, knowing which sectors are corrupted
will allow us to submit repair only for corrupted sectors.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The 'check_async_write' function is a helper used in
'btrfs_submit_metadata_bio' and it checks if asynchronous writing can be
used for metadata.
Make the function return bool and get rid of the local variable async in
btrfs_submit_metadata_bio storing the result of check_async_write's
tests.
As this is touching all function call sites, also rename it to
should_async_write as this is more in line with the naming we use.
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
If we can't read a reliable write pointer from a sequential zone fail
creating the block group with an I/O error.
Also if the read write pointer is beyond the end of the respective zone,
fail the creation of the block group on this zone with an I/O error.
While this could also happen in real world scenarios with misbehaving
drives, this issue addresses a problem uncovered by fstests' test case
generic/475.
CC: stable@vger.kernel.org # 5.12+
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This extends patch 784daf2b96 ("btrfs: zoned: sanity check zone
type"), the message was supposed to be there but was lost during merge.
We want to make the error noticeable so add it.
Fixes: 784daf2b96 ("btrfs: zoned: sanity check zone type")
CC: stable@vger.kernel.org # 5.12+
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
If we decide to flush delalloc from the preemptive flusher, we really do
not want to wait on ordered extents, as it gains us nothing. However
there was logic to go ahead and wait on ordered extents if there was
more ordered bytes than delalloc bytes. We do not want this behavior,
so pass through whether this flushing is for preemption, and do not wait
for ordered extents if that's the case. Also break out of the shrink
loop after the first flushing, as we just want to one shot shrink
delalloc.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
While testing heavy delalloc workloads I noticed that sometimes we'd
just stop preemptively flushing when we had loads of delalloc available
to flush. This is because we skip preemptive flushing if delalloc <=
ordered. However if we start with say 4gib of delalloc, and we flush
2gib of that, we'll stop flushing there, when we still have 2gib of
delalloc to flush.
Instead adjust the ordered bytes down by half, this way if 2/3 of our
outstanding delalloc reservations are tied up by ordered extents we
don't bother preemptive flushing, as we're getting close to the state
where we need to wait on ordered extents.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
When deciding if we should preemptively flush space, we will add in the
amount of space used by all block rsvs. However this also includes the
global block rsv, which isn't flushable so shouldn't be accounted for in
this calculation. If we decide to use ->bytes_may_use in our used
calculation we need to subtract the global rsv size from this amount so
it most closely matches the flushable space.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We calculate the amount of "free" space available for normal
reservations by taking the total space and subtracting out the hard used
space, which is readonly, used, and reserved space.
However we weren't taking into account the global block rsv, which is
essentially hard used space. Handle this by subtracting it from the
available free space, so that our threshold more closely mirrors
reality.
We need to do the check because it's possible that the global_rsv_size +
used is > total_bytes, sometimes the global reserve can end up being
calculated as larger than the available size (think small filesystems
where we only have the original 8MiB chunk of metadata). It doesn't
usually happen, but that can get us into trouble so this is safer.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Global rsv can't be used for normal allocations, and for very full file
systems we can decide to try and async flush constantly even though
there's really not a lot of space to reclaim. Deal with this by
including the global block rsv size in the "total used" calculation.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We were clamping the threshold for preemptive reclaim any time we added
a ticket to wait on, which if we have a lot of threads means we'd
essentially max out the clamp the first time we start to flush.
Instead of doing this, simply do it every time we have to start
flushing, this will make us ramp up gradually instead of going to max
clamping as soon as we start needing to do flushing.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
need_preemptive_reclaim() does some calculations, which aren't heavy,
but if we're already running preemptive reclaim there's no reason to do
them at all, so re-order the checks so that we don't do the calculation
if we're already doing reclaim.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Commit b2598edf8b ("btrfs: remove unused argument seed from
btrfs_find_device") removed the argument seed from btrfs_find_device
but forgot the comment, so remove it.
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Su Yue <l@damenly.su>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
try_lock_extent() returns 1 on success or 0 for failure and not an error
code. If try_lock_extent() fails, read_extent_buffer_subpage() returns
zero indicating subpage extent read success.
Return EAGAIN/EWOULDBLOCK if try_lock_extent() fails in locking the
extent.
Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAmDNEFMACgkQxWXV+ddt
WDuZQg/7BpGG3IDhxydM7fUrNT0xmW2/0VG8blXAgNTiaUO1zOrlrlDKm38+dtW6
yEv3ruf68tggrPNRCkyh51n45+ExqNwc7WwrxaKIRKmvYhYDsxnt8JLiNkv64isi
R/CQVETX1cKsMuRhBuqmUq3Sy6VJZoi6coUHIC7ddBcLqnz8c9p7oGqzxBT8J9u3
1CkDSeLM4HKlISlVKhmT4lRG28cQTuy3mSABUt7N5ljJvrrpQAvEN1HCOE9XUQFQ
wHH2DjNnBMvfB7mrGCBL7XGf8DF6ucgcyfofuOj6CQLFJ8bOnVKsk8dk/8XUQod+
rQoNIrVwW91LjmEO/I767JmjrRMtHbXvl3DEw3BvaD/O4efw78SN2VN+DRi4j7Xx
aMtAWWfakfIyyJNZu9IEDa736iCdp+yl4bnq+hZpqmOYRqTq8n/zWuCMWZ5ohNay
JyjxCm+xgo3vH9VEgzje6GDUki3I4Bwe7VlsaMr9F6F5GKzFp/4fb9lCewBrH6le
+Y4gWxRT09plThsC2N3qmBQ9uVIJUyzmvcsYiMJ95tb24srdcPUTCG0C9zBvuMCC
nm+1n5d3ENSEBaRNKQsC3MYcjKIh8VDEaKnntJrHAzHP41hrD+makrw3LVs6wLzu
amGYz40XNq8zK2Xxv/N8O/U/PwQWKGj4bxq/2c1Wi9p9HACWfgk=
=JbJO
-----END PGP SIGNATURE-----
Merge tag 'for-5.13-rc6-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs fix from David Sterba:
"One more fix, for a space accounting bug in zoned mode. It happens
when a block group is switched back rw->ro and unusable bytes (due to
zoned constraints) are subtracted twice.
It has user visible effects so I consider it important enough for late
-rc inclusion and backport to stable"
* tag 'for-5.13-rc6-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
btrfs: zoned: fix negative space_info->bytes_readonly
Consider we have a using block group on zoned btrfs.
|<- ZU ->|<- used ->|<---free--->|
`- Alloc offset
ZU: Zone unusable
Marking the block group read-only will migrate the zone unusable bytes
to the read-only bytes. So, we will have this.
|<- RO ->|<- used ->|<--- RO --->|
RO: Read only
When marking it back to read-write, btrfs_dec_block_group_ro()
subtracts the above "RO" bytes from the
space_info->bytes_readonly. And, it moves the zone unusable bytes back
and again subtracts those bytes from the space_info->bytes_readonly,
leading to negative bytes_readonly.
This can be observed in the output as eg.:
Data, single: total=512.00MiB, used=165.21MiB, zone_unusable=16.00EiB
Data, single: total=536870912, used=173256704, zone_unusable=18446744073603186688
This commit fixes the issue by reordering the operations.
Link: https://github.com/naota/linux/issues/37
Reported-by: David Sterba <dsterba@suse.com>
Fixes: 169e0da91a ("btrfs: zoned: track unusable bytes for zones")
CC: stable@vger.kernel.org # 5.12+
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Replacement is called copy_page_from_iter_atomic(); unlike the old primitive the
callers do *not* need to do iov_iter_advance() after it. In case when they end
up consuming less than they'd been given they need to do iov_iter_revert() on
everything they had not consumed. That, however, needs to be done only on slow
paths.
All in-tree callers converted. And that kills the last user of iterate_all_kinds()
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAmDAtXUACgkQxWXV+ddt
WDtbdA//ccQ8JL5yC/x/j0ZXLJ2INqXpxIUPjadwwEjtTgOllvx+f1nU0QazeYfM
XvvzDDvpemWajC2Ii54s2HCQbG+dAzO1YBl1XCyve91T0GeNGhzytZwM0pVxZePQ
A+aOyVH7IcfFcmBy9T0yctqiGgtD3lre208kU9kolidsIyomLHxBckBhMYDXvJCK
BOdrjq3f6H5J0zqOqAnWdc/Wc5z5pw3CHxlIuoA3Tp0Gv9TIx366Z/IvmFfCyvCt
kYv2qnUaw10OlFLiqhetlZyv49ibW4waj0RbyY/rZx+69sE/PM4961NYAjLoFJc2
6OoZZO4OHWrNZpBJfbyyX9KVLspix075FID7qVhE/AVW4CYZGOFu5wJyXQiYlysH
1qqkihK3gbKEsB2429UeLZktupmx79LBIgg346+DSQYiMXMTGR8iZY1onbBM2wlf
bep65hsiHhxoC6Z/KhxrTGZM2jyYW2nICw3o0xikhWv7MZPWKfKHrH9NJQ9Lpuhy
gxut0ef9HbPXWP9PgRmY0Z8PsUi8RT1bv0bHVw7EnhLbi62neJLyxY3Q++W+7vBG
LYeaxKWLTTJu73wpBQHLI0pD0UifXLrTkiCI+4gN8zVfzxUl+90mGz2AdSRRFI+U
kNdX/haEHi00WBqYxWt33ae/FuSHjPuYXjiPQA7Kiy/C3n9GAB0=
=mGAq
-----END PGP SIGNATURE-----
Merge tag 'for-5.13-rc5-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs fixes from David Sterba:
"A few more fixes that people hit during testing.
Zoned mode fix:
- fix 32bit value wrapping when calculating superblock offsets
Error handling fixes:
- properly check filesystema and device uuids
- properly return errors when marking extents as written
- do not write supers if we have an fs error"
* tag 'for-5.13-rc5-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
btrfs: promote debugging asserts to full-fledged checks in validate_super
btrfs: return value from btrfs_mark_extent_written() in case of error
btrfs: zoned: fix zone number to sector/physical calculation
btrfs: do not write supers if we have an fs error
Syzbot managed to trigger this assert while performing its fuzzing.
Turns out it's better to have those asserts turned into full-fledged
checks so that in case buggy btrfs images are mounted the users gets
an error and mounting is stopped. Alternatively with CONFIG_BTRFS_ASSERT
disabled such image would have been erroneously allowed to be mounted.
Reported-by: syzbot+a6bf271c02e4fe66b4e4@syzkaller.appspotmail.com
CC: stable@vger.kernel.org # 5.4+
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ add uuids to the messages ]
Signed-off-by: David Sterba <dsterba@suse.com>
We always return 0 even in case of an error in btrfs_mark_extent_written().
Fix it to return proper error value in case of a failure. All callers
handle it.
CC: stable@vger.kernel.org # 4.4+
Signed-off-by: Ritesh Harjani <riteshh@linux.ibm.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
In btrfs_get_dev_zone_info(), we have "u32 sb_zone" and calculate "sector_t
sector" by shifting it. But, this "sector" is calculated in 32bit, leading
it to be 0 for the 2nd superblock copy.
Since zone number is u32, shifting it to sector (sector_t) or physical
address (u64) can easily trigger a missing cast bug like this.
This commit introduces helpers to convert zone number to sector/LBA, so we
won't fall into the same pitfall again.
Reported-by: Dmitry Fomichev <Dmitry.Fomichev@wdc.com>
Fixes: 12659251ca ("btrfs: implement log-structured superblock for ZONED mode")
CC: stable@vger.kernel.org # 5.11+
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Error injection testing uncovered a pretty severe problem where we could
end up committing a super that pointed to the wrong tree roots,
resulting in transid mismatch errors.
The way we commit the transaction is we update the super copy with the
current generations and bytenrs of the important roots, and then copy
that into our super_for_commit. Then we allow transactions to continue
again, we write out the dirty pages for the transaction, and then we
write the super. If the write out fails we'll bail and skip writing the
supers.
However since we've allowed a new transaction to start, we can have a
log attempting to sync at this point, which would be blocked on
fs_info->tree_log_mutex. Once the commit fails we're allowed to do the
log tree commit, which uses super_for_commit, which now points at fs
tree's that were not written out.
Fix this by checking BTRFS_FS_STATE_ERROR once we acquire the
tree_log_mutex. This way if the transaction commit fails we're sure to
see this bit set and we can skip writing the super out. This patch
fixes this specific transid mismatch error I was seeing with this
particular error path.
CC: stable@vger.kernel.org # 5.12+
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAmC435cACgkQxWXV+ddt
WDuh5w/+IGfsUFfKikJZpZUP7q/2gC0t0dzZemxeZMutJbT/KCZCDd4CjLf6YH6r
oV9uYIgOWGd3aem9fe0R60ErJ4htgszIgeydCw3s2EuTms6WvAVA6Wp+wK/3UNx3
vQgYsqYkhMzIYKm/D4q8G+bqA2nPbBTDRNsXDIDrZYONxwSb+dNbQCGVknBRzRPa
hiCqYhUSyXA7E6UZdlma7MvpDOquZN+iW3RRVx1AULLqVs01PCnG/CEN+0oQm2JE
r9IyRxOZUvSeW6opT80yzZFCoboNSduMjPENTfzLY6Q1xzS/EtP4kM86fB/7AoJv
UI0c3Sr84SC9vOsBsbGJaBHpxP3OpzxohKU///jVQgEDpGv4STPlkVfxk23BHcux
Fdfg7wodkXeLU1Ff4dlJhvCqNYqc5V8lT5Kl52ai9Scct6D4yZBAq4KJp2LmYFC0
cHv6xFxBUv5zFZP1j6NMOmiLlCdDEkOruku2mMweQOBWYW/lHYNU469V5RCvfbLl
HlbDrtZdnQ3m2IhpQrXiTnT47Ib4DPYWkhRVfWbyVJHA+CbcOV62RQfl+r95Bc7j
FB1gM5vwUTJV7wgzErrq7+BD8quxG6/NuLDFjHYRcIj1kSIMK4/I1fOWruzuK+CL
6n7LLvBOojYfFo+ruQMSp2imDn3JJucBuh0/ssOlUWl2zsy6lDA=
=8066
-----END PGP SIGNATURE-----
Merge tag 'for-5.13-rc4-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs fixes from David Sterba:
"Error handling improvements, caught by error injection:
- handle errors during checksum deletion
- set error on mapping when ordered extent io cannot be finished
- inode link count fixup in tree-log
- missing return value checks for inode updates in tree-log
- abort transaction in rename exchange if adding second reference
fails
Fixes:
- fix fsync failure after writes to prealloc extents
- fix deadlock when cloning inline extents and low on available space
- fix compressed writes that cross stripe boundary"
* tag 'for-5.13-rc4-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
MAINTAINERS: add btrfs IRC link
btrfs: fix deadlock when cloning inline extents and low on available space
btrfs: fix fsync failure and transaction abort after writes to prealloc extents
btrfs: abort in rename_exchange if we fail to insert the second ref
btrfs: check error value from btrfs_update_inode in tree log
btrfs: fixup error handling in fixup_inode_link_counts
btrfs: mark ordered extent and inode with error if we fail to finish
btrfs: return errors from btrfs_del_csums in cleanup_ref_head
btrfs: fix error handling in btrfs_del_csums
btrfs: fix compressed writes that cross stripe boundary
Replace the per-block device bd_mutex with a per-gendisk open_mutex,
thus simplifying locking wherever we deal with partitions.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Acked-by: Roger Pau Monné <roger.pau@citrix.com>
Link: https://lore.kernel.org/r/20210525061301.2242282-4-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
There are a few cases where cloning an inline extent requires copying data
into a page of the destination inode. For these cases we are allocating
the required data and metadata space while holding a leaf locked. This can
result in a deadlock when we are low on available space because allocating
the space may flush delalloc and two deadlock scenarios can happen:
1) When starting writeback for an inode with a very small dirty range that
fits in an inline extent, we deadlock during the writeback when trying
to insert the inline extent, at cow_file_range_inline(), if the extent
is going to be located in the leaf for which we are already holding a
read lock;
2) After successfully starting writeback, for non-inline extent cases,
the async reclaim thread will hang waiting for an ordered extent to
complete if the ordered extent completion needs to modify the leaf
for which the clone task is holding a read lock (for adding or
replacing file extent items). So the cloning task will wait forever
on the async reclaim thread to make progress, which in turn is
waiting for the ordered extent completion which in turn is waiting
to acquire a write lock on the same leaf.
So fix this by making sure we release the path (and therefore the leaf)
every time we need to copy the inline extent's data into a page of the
destination inode, as by that time we do not need to have the leaf locked.
Fixes: 05a5a7621c ("Btrfs: implement full reflink support for inline extents")
CC: stable@vger.kernel.org # 5.10+
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
When doing a series of partial writes to different ranges of preallocated
extents with transaction commits and fsyncs in between, we can end up with
a checksum items in a log tree. This causes an fsync to fail with -EIO and
abort the transaction, turning the filesystem to RO mode, when syncing the
log.
For this to happen, we need to have a full fsync of a file following one
or more fast fsyncs.
The following example reproduces the problem and explains how it happens:
$ mkfs.btrfs -f /dev/sdc
$ mount /dev/sdc /mnt
# Create our test file with 2 preallocated extents. Leave a 1M hole
# between them to ensure that we get two file extent items that will
# never be merged into a single one. The extents are contiguous on disk,
# which will later result in the checksums for their data to be merged
# into a single checksum item in the csums btree.
#
$ xfs_io -f \
-c "falloc 0 1M" \
-c "falloc 3M 3M" \
/mnt/foobar
# Now write to the second extent and leave only 1M of it as unwritten,
# which corresponds to the file range [4M, 5M[.
#
# Then fsync the file to flush delalloc and to clear full sync flag from
# the inode, so that a future fsync will use the fast code path.
#
# After the writeback triggered by the fsync we have 3 file extent items
# that point to the second extent we previously allocated:
#
# 1) One file extent item of type BTRFS_FILE_EXTENT_REG that covers the
# file range [3M, 4M[
#
# 2) One file extent item of type BTRFS_FILE_EXTENT_PREALLOC that covers
# the file range [4M, 5M[
#
# 3) One file extent item of type BTRFS_FILE_EXTENT_REG that covers the
# file range [5M, 6M[
#
# All these file extent items have a generation of 6, which is the ID of
# the transaction where they were created. The split of the original file
# extent item is done at btrfs_mark_extent_written() when ordered extents
# complete for the file ranges [3M, 4M[ and [5M, 6M[.
#
$ xfs_io -c "pwrite -S 0xab 3M 1M" \
-c "pwrite -S 0xef 5M 1M" \
-c "fsync" \
/mnt/foobar
# Commit the current transaction. This wipes out the log tree created by
# the previous fsync.
sync
# Now write to the unwritten range of the second extent we allocated,
# corresponding to the file range [4M, 5M[, and fsync the file, which
# triggers the fast fsync code path.
#
# The fast fsync code path sees that there is a new extent map covering
# the file range [4M, 5M[ and therefore it will log a checksum item
# covering the range [1M, 2M[ of the second extent we allocated.
#
# Also, after the fsync finishes we no longer have the 3 file extent
# items that pointed to 3 sections of the second extent we allocated.
# Instead we end up with a single file extent item pointing to the whole
# extent, with a type of BTRFS_FILE_EXTENT_REG and a generation of 7 (the
# current transaction ID). This is due to the file extent item merging we
# do when completing ordered extents into ranges that point to unwritten
# (preallocated) extents. This merging is done at
# btrfs_mark_extent_written().
#
$ xfs_io -c "pwrite -S 0xcd 4M 1M" \
-c "fsync" \
/mnt/foobar
# Now do some write to our file outside the range of the second extent
# that we allocated with fallocate() and truncate the file size from 6M
# down to 5M.
#
# The truncate operation sets the full sync runtime flag on the inode,
# forcing the next fsync to use the slow code path. It also changes the
# length of the second file extent item so that it represents the file
# range [3M, 5M[ and not the range [3M, 6M[ anymore.
#
# Finally fsync the file. Since this is a fsync that triggers the slow
# code path, it will remove all items associated to the inode from the
# log tree and then it will scan for file extent items in the
# fs/subvolume tree that have a generation matching the current
# transaction ID, which is 7. This means it will log 2 file extent
# items:
#
# 1) One for the first extent we allocated, covering the file range
# [0, 1M[
#
# 2) Another for the first 2M of the second extent we allocated,
# covering the file range [3M, 5M[
#
# When logging the first file extent item we log a single checksum item
# that has all the checksums for the entire extent.
#
# When logging the second file extent item, we also lookup for the
# checksums that are associated with the range [0, 2M[ of the second
# extent we allocated (file range [3M, 5M[), and then we log them with
# btrfs_csum_file_blocks(). However that results in ending up with a log
# that has two checksum items with ranges that overlap:
#
# 1) One for the range [1M, 2M[ of the second extent we allocated,
# corresponding to the file range [4M, 5M[, which we logged in the
# previous fsync that used the fast code path;
#
# 2) One for the ranges [0, 1M[ and [0, 2M[ of the first and second
# extents, respectively, corresponding to the files ranges [0, 1M[
# and [3M, 5M[. This one was added during this last fsync that uses
# the slow code path and overlaps with the previous one logged by
# the previous fast fsync.
#
# This happens because when logging the checksums for the second
# extent, we notice they start at an offset that matches the end of the
# checksums item that we logged for the first extent, and because both
# extents are contiguous on disk, btrfs_csum_file_blocks() decides to
# extend that existing checksums item and append the checksums for the
# second extent to this item. The end result is we end up with two
# checksum items in the log tree that have overlapping ranges, as
# listed before, resulting in the fsync to fail with -EIO and aborting
# the transaction, turning the filesystem into RO mode.
#
$ xfs_io -c "pwrite -S 0xff 0 1M" \
-c "truncate 5M" \
-c "fsync" \
/mnt/foobar
fsync: Input/output error
After running the example, dmesg/syslog shows the tree checker complained
about the checksum items with overlapping ranges and we aborted the
transaction:
$ dmesg
(...)
[756289.557487] BTRFS critical (device sdc): corrupt leaf: root=18446744073709551610 block=30720000 slot=5, csum end range (16777216) goes beyond the start range (15728640) of the next csum item
[756289.560583] BTRFS info (device sdc): leaf 30720000 gen 7 total ptrs 7 free space 11677 owner 18446744073709551610
[756289.562435] BTRFS info (device sdc): refs 2 lock_owner 0 current 2303929
[756289.563654] item 0 key (257 1 0) itemoff 16123 itemsize 160
[756289.564649] inode generation 6 size 5242880 mode 100600
[756289.565636] item 1 key (257 12 256) itemoff 16107 itemsize 16
[756289.566694] item 2 key (257 108 0) itemoff 16054 itemsize 53
[756289.567725] extent data disk bytenr 13631488 nr 1048576
[756289.568697] extent data offset 0 nr 1048576 ram 1048576
[756289.569689] item 3 key (257 108 1048576) itemoff 16001 itemsize 53
[756289.570682] extent data disk bytenr 0 nr 0
[756289.571363] extent data offset 0 nr 2097152 ram 2097152
[756289.572213] item 4 key (257 108 3145728) itemoff 15948 itemsize 53
[756289.573246] extent data disk bytenr 14680064 nr 3145728
[756289.574121] extent data offset 0 nr 2097152 ram 3145728
[756289.574993] item 5 key (18446744073709551606 128 13631488) itemoff 12876 itemsize 3072
[756289.576113] item 6 key (18446744073709551606 128 15728640) itemoff 11852 itemsize 1024
[756289.577286] BTRFS error (device sdc): block=30720000 write time tree block corruption detected
[756289.578644] ------------[ cut here ]------------
[756289.579376] WARNING: CPU: 0 PID: 2303929 at fs/btrfs/disk-io.c:465 csum_one_extent_buffer+0xed/0x100 [btrfs]
[756289.580857] Modules linked in: btrfs dm_zero dm_dust loop dm_snapshot (...)
[756289.591534] CPU: 0 PID: 2303929 Comm: xfs_io Tainted: G W 5.12.0-rc8-btrfs-next-87 #1
[756289.592580] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.14.0-0-g155821a1990b-prebuilt.qemu.org 04/01/2014
[756289.594161] RIP: 0010:csum_one_extent_buffer+0xed/0x100 [btrfs]
[756289.595122] Code: 5d c3 e8 76 60 (...)
[756289.597509] RSP: 0018:ffffb51b416cb898 EFLAGS: 00010282
[756289.598142] RAX: 0000000000000000 RBX: fffff02b8a365bc0 RCX: 0000000000000000
[756289.598970] RDX: 0000000000000000 RSI: ffffffffa9112421 RDI: 00000000ffffffff
[756289.599798] RBP: ffffa06500880000 R08: 0000000000000000 R09: 0000000000000000
[756289.600619] R10: 0000000000000000 R11: 0000000000000001 R12: 0000000000000000
[756289.601456] R13: ffffa0652b1d8980 R14: ffffa06500880000 R15: 0000000000000000
[756289.602278] FS: 00007f08b23c9800(0000) GS:ffffa0682be00000(0000) knlGS:0000000000000000
[756289.603217] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[756289.603892] CR2: 00005652f32d0138 CR3: 000000025d616003 CR4: 0000000000370ef0
[756289.604725] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[756289.605563] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[756289.606400] Call Trace:
[756289.606704] btree_csum_one_bio+0x244/0x2b0 [btrfs]
[756289.607313] btrfs_submit_metadata_bio+0xb7/0x100 [btrfs]
[756289.608040] submit_one_bio+0x61/0x70 [btrfs]
[756289.608587] btree_write_cache_pages+0x587/0x610 [btrfs]
[756289.609258] ? free_debug_processing+0x1d5/0x240
[756289.609812] ? __module_address+0x28/0xf0
[756289.610298] ? lock_acquire+0x1a0/0x3e0
[756289.610754] ? lock_acquired+0x19f/0x430
[756289.611220] ? lock_acquire+0x1a0/0x3e0
[756289.611675] do_writepages+0x43/0xf0
[756289.612101] ? __filemap_fdatawrite_range+0xa4/0x100
[756289.612800] __filemap_fdatawrite_range+0xc5/0x100
[756289.613393] btrfs_write_marked_extents+0x68/0x160 [btrfs]
[756289.614085] btrfs_sync_log+0x21c/0xf20 [btrfs]
[756289.614661] ? finish_wait+0x90/0x90
[756289.615096] ? __mutex_unlock_slowpath+0x45/0x2a0
[756289.615661] ? btrfs_log_inode_parent+0x3c9/0xdc0 [btrfs]
[756289.616338] ? lock_acquire+0x1a0/0x3e0
[756289.616801] ? lock_acquired+0x19f/0x430
[756289.617284] ? lock_acquire+0x1a0/0x3e0
[756289.617750] ? lock_release+0x214/0x470
[756289.618221] ? lock_acquired+0x19f/0x430
[756289.618704] ? dput+0x20/0x4a0
[756289.619079] ? dput+0x20/0x4a0
[756289.619452] ? lockref_put_or_lock+0x9/0x30
[756289.619969] ? lock_release+0x214/0x470
[756289.620445] ? lock_release+0x214/0x470
[756289.620924] ? lock_release+0x214/0x470
[756289.621415] btrfs_sync_file+0x46a/0x5b0 [btrfs]
[756289.621982] do_fsync+0x38/0x70
[756289.622395] __x64_sys_fsync+0x10/0x20
[756289.622907] do_syscall_64+0x33/0x80
[756289.623438] entry_SYSCALL_64_after_hwframe+0x44/0xae
[756289.624063] RIP: 0033:0x7f08b27fbb7b
[756289.624588] Code: 0f 05 48 3d 00 (...)
[756289.626760] RSP: 002b:00007ffe2583f940 EFLAGS: 00000293 ORIG_RAX: 000000000000004a
[756289.627639] RAX: ffffffffffffffda RBX: 00005652f32cd0f0 RCX: 00007f08b27fbb7b
[756289.628464] RDX: 00005652f32cbca0 RSI: 00005652f32cd110 RDI: 0000000000000003
[756289.629323] RBP: 00005652f32cd110 R08: 0000000000000000 R09: 00007f08b28c4be0
[756289.630172] R10: fffffffffffff39a R11: 0000000000000293 R12: 0000000000000001
[756289.631007] R13: 00005652f32cd0f0 R14: 0000000000000001 R15: 00005652f32cc480
[756289.631819] irq event stamp: 0
[756289.632188] hardirqs last enabled at (0): [<0000000000000000>] 0x0
[756289.632911] hardirqs last disabled at (0): [<ffffffffa7e97c29>] copy_process+0x879/0x1cc0
[756289.633893] softirqs last enabled at (0): [<ffffffffa7e97c29>] copy_process+0x879/0x1cc0
[756289.634871] softirqs last disabled at (0): [<0000000000000000>] 0x0
[756289.635606] ---[ end trace 0a039fdc16ff3fef ]---
[756289.636179] BTRFS: error (device sdc) in btrfs_sync_log:3136: errno=-5 IO failure
[756289.637082] BTRFS info (device sdc): forced readonly
Having checksum items covering ranges that overlap is dangerous as in some
cases it can lead to having extent ranges for which we miss checksums
after log replay or getting the wrong checksum item. There were some fixes
in the past for bugs that resulted in this problem, and were explained and
fixed by the following commits:
27b9a8122f ("Btrfs: fix csum tree corruption, duplicate and outdated checksums")
b84b8390d6 ("Btrfs: fix file read corruption after extent cloning and fsync")
40e046acbd ("Btrfs: fix missing data checksums after replaying a log tree")
e289f03ea7 ("btrfs: fix corrupt log due to concurrent fsync of inodes with shared extents")
Fix the issue by making btrfs_csum_file_blocks() taking into account the
start offset of the next checksum item when it decides to extend an
existing checksum item, so that it never extends the checksum to end at a
range that goes beyond the start range of the next checksum item.
When we can not access the next checksum item without releasing the path,
simply drop the optimization of extending the previous checksum item and
fallback to inserting a new checksum item - this happens rarely and the
optimization is not significant enough for a log tree in order to justify
the extra complexity, as it would only save a few bytes (the size of a
struct btrfs_item) of leaf space.
This behaviour is only needed when inserting into a log tree because
for the regular checksums tree we never have a case where we try to
insert a range of checksums that overlap with a range that was previously
inserted.
A test case for fstests will follow soon.
Reported-by: Philipp Fent <fent@in.tum.de>
Link: https://lore.kernel.org/linux-btrfs/93c4600e-5263-5cba-adf0-6f47526e7561@in.tum.de/
CC: stable@vger.kernel.org # 5.4+
Tested-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Error injection stress uncovered a problem where we'd leave a dangling
inode ref if we failed during a rename_exchange. This happens because
we insert the inode ref for one side of the rename, and then for the
other side. If this second inode ref insert fails we'll leave the first
one dangling and leave a corrupt file system behind. Fix this by
aborting if we did the insert for the first inode ref.
CC: stable@vger.kernel.org # 4.9+
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Error injection testing uncovered a case where we ended up with invalid
link counts on an inode. This happened because we failed to notice an
error when updating the inode while replaying the tree log, and
committed the transaction with an invalid file system.
Fix this by checking the return value of btrfs_update_inode. This
resolved the link count errors I was seeing, and we already properly
handle passing up the error values in these paths.
CC: stable@vger.kernel.org # 4.4+
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This function has the following pattern
while (1) {
ret = whatever();
if (ret)
goto out;
}
ret = 0
out:
return ret;
However several places in this while loop we simply break; when there's
a problem, thus clearing the return value, and in one case we do a
return -EIO, and leak the memory for the path.
Fix this by re-arranging the loop to deal with ret == 1 coming from
btrfs_search_slot, and then simply delete the
ret = 0;
out:
bit so everybody can break if there is an error, which will allow for
proper error handling to occur.
CC: stable@vger.kernel.org # 4.4+
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
While doing error injection testing I saw that sometimes we'd get an
abort that wouldn't stop the current transaction commit from completing.
This abort was coming from finish ordered IO, but at this point in the
transaction commit we should have gotten an error and stopped.
It turns out the abort came from finish ordered io while trying to write
out the free space cache. It occurred to me that any failure inside of
finish_ordered_io isn't actually raised to the person doing the writing,
so we could have any number of failures in this path and think the
ordered extent completed successfully and the inode was fine.
Fix this by marking the ordered extent with BTRFS_ORDERED_IOERR, and
marking the mapping of the inode with mapping_set_error, so any callers
that simply call fdatawait will also get the error.
With this we're seeing the IO error on the free space inode when we fail
to do the finish_ordered_io.
CC: stable@vger.kernel.org # 4.19+
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We are unconditionally returning 0 in cleanup_ref_head, despite the fact
that btrfs_del_csums could fail. We need to return the error so the
transaction gets aborted properly, fix this by returning ret from
btrfs_del_csums in cleanup_ref_head.
Reviewed-by: Qu Wenruo <wqu@suse.com>
CC: stable@vger.kernel.org # 4.19+
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Error injection stress would sometimes fail with checksums on disk that
did not have a corresponding extent. This occurred because the pattern
in btrfs_del_csums was
while (1) {
ret = btrfs_search_slot();
if (ret < 0)
break;
}
ret = 0;
out:
btrfs_free_path(path);
return ret;
If we got an error from btrfs_search_slot we'd clear the error because
we were breaking instead of goto out. Instead of using goto out, simply
handle the cases where we may leave a random value in ret, and get rid
of the
ret = 0;
out:
pattern and simply allow break to have the proper error reporting. With
this fix we properly abort the transaction and do not commit thinking we
successfully deleted the csum.
Reviewed-by: Qu Wenruo <wqu@suse.com>
CC: stable@vger.kernel.org # 4.4+
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
[BUG]
When running btrfs/027 with "-o compress" mount option, it always
crashes with the following call trace:
BTRFS critical (device dm-4): mapping failed logical 298901504 bio len 12288 len 8192
------------[ cut here ]------------
kernel BUG at fs/btrfs/volumes.c:6651!
invalid opcode: 0000 [#1] PREEMPT SMP NOPTI
CPU: 5 PID: 31089 Comm: kworker/u24:10 Tainted: G OE 5.13.0-rc2-custom+ #26
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
Workqueue: btrfs-delalloc btrfs_work_helper [btrfs]
RIP: 0010:btrfs_map_bio.cold+0x58/0x5a [btrfs]
Call Trace:
btrfs_submit_compressed_write+0x2d7/0x470 [btrfs]
submit_compressed_extents+0x3b0/0x470 [btrfs]
? mark_held_locks+0x49/0x70
btrfs_work_helper+0x131/0x3e0 [btrfs]
process_one_work+0x28f/0x5d0
worker_thread+0x55/0x3c0
? process_one_work+0x5d0/0x5d0
kthread+0x141/0x160
? __kthread_bind_mask+0x60/0x60
ret_from_fork+0x22/0x30
---[ end trace 63113a3a91f34e68 ]---
[CAUSE]
The critical message before the crash means we have a bio at logical
bytenr 298901504 length 12288, but only 8192 bytes can fit into one
stripe, the remaining 4096 bytes go to another stripe.
In btrfs, all bios are properly split to avoid cross stripe boundary,
but commit 764c7c9a46 ("btrfs: zoned: fix parallel compressed writes")
changed the behavior for compressed writes.
Previously if we find our new page can't be fitted into current stripe,
ie. "submit == 1" case, we submit current bio without adding current
page.
submit = btrfs_bio_fits_in_stripe(page, PAGE_SIZE, bio, 0);
page->mapping = NULL;
if (submit || bio_add_page(bio, page, PAGE_SIZE, 0) <
PAGE_SIZE) {
But after the modification, we will add the page no matter if it crosses
stripe boundary, leading to the above crash.
submit = btrfs_bio_fits_in_stripe(page, PAGE_SIZE, bio, 0);
if (pg_index == 0 && use_append)
len = bio_add_zone_append_page(bio, page, PAGE_SIZE, 0);
else
len = bio_add_page(bio, page, PAGE_SIZE, 0);
page->mapping = NULL;
if (submit || len < PAGE_SIZE) {
[FIX]
It's no longer possible to revert to the original code style as we have
two different bio_add_*_page() calls now.
The new fix is to skip the bio_add_*_page() call if @submit is true.
Also to avoid @len to be uninitialized, always initialize it to zero.
If @submit is true, @len will not be checked.
If @submit is not true, @len will be the return value of
bio_add_*_page() call.
Either way, the behavior is still the same as the old code.
Reported-by: Josef Bacik <josef@toxicpanda.com>
Fixes: 764c7c9a46 ("btrfs: zoned: fix parallel compressed writes")
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
When multiple processes write data to the same block group on a
compressed zoned filesystem, the underlying device could report I/O
errors and data corruption is possible.
This happens because on a zoned file system, compressed data writes
where sent to the device via a REQ_OP_WRITE instead of a
REQ_OP_ZONE_APPEND operation. But with REQ_OP_WRITE and parallel
submission it cannot be guaranteed that the data is always submitted
aligned to the underlying zone's write pointer.
The change to using REQ_OP_ZONE_APPEND instead of REQ_OP_WRITE on a
zoned filesystem is non intrusive on a regular file system or when
submitting to a conventional zone on a zoned filesystem, as it is
guarded by btrfs_use_zone_append.
Reported-by: David Sterba <dsterba@suse.com>
Fixes: 9d294a685f ("btrfs: zoned: enable to mount ZONED incompat flag")
CC: stable@vger.kernel.org # 5.12.x: e380adfc21: btrfs: zoned: pass start block to btrfs_use_zone_append
CC: stable@vger.kernel.org # 5.12.x
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
btrfs_use_zone_append only needs the passed in extent_map's block_start
member, so there's no need to pass in the full extent map.
This also enables the use of btrfs_use_zone_append in places where we only
have a start byte but no extent_map.
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAmCibywACgkQxWXV+ddt
WDs8QhAAlJ1INZGF01lP2mUhzesVIctIAPGBf/77Zsxmcu0rA6E66RVVsYMgGU54
+FWd+LwuFCtC1364OnDa2DnmYtvHfgR4If7EGowpk3qzZFeZQSLqayOFa5tZLYPG
tJStjY32QTerfZRoxPJ1QPcoWjxNMxYqYw/s68G3tTTSHEYtlH9zNHbLm9ny507x
uPHpxqKXdv3/LYHLt6XUypFqsZkMoDW98oOKvo0MZE/fjcqiDcrvAoYe+y8raFC3
FztlfA2TBmmp/PouDXLCspXAksLpVo9mgTQ0kW4K7152cC0X/zWXYNH01uQ+qTAS
OFNKt2DSRIq5TR56ZmReYvRgq0FNMotYpRpxoePSF/rwL+wnsTl7QI3r/d/h/uxQ
IzBeBv1Wd+1ZJcqnmEGx8Mws3nGswKyl4W65x8yin41djVoHgM4tYu3nGqielu+w
ifEBmU5tUGo05z2HA5kpLjDzc6MwWaCIduQvjH/I4Vgo9fhDo6pQO2dyPC50Nkk5
DQ5jfxiXJ/ZSh5NbWtIkB/OQuwkVL1nDy2jtj3qnK06HDKstK1zui5nccFKFNOiX
wtYjnGqd3+vIGIZniMuu9rbPLtG4CCerq44v1gyS6LSEycNW9/r2cOXRaiQk5pej
CoYMdnmAqzwidtn4FZPRNQ7JgyckKCXQQSGCazN2vvLCXisCUrw=
=ue6o
-----END PGP SIGNATURE-----
Merge tag 'for-5.13-rc2-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs fixes from David Sterba:
"A few more fixes:
- fix fiemap to print extents that could get misreported due to
internal extent splitting and logical merging for fiemap output
- fix RCU stalls during delayed iputs
- fix removed dentries still existing after log is synced"
* tag 'for-5.13-rc2-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
btrfs: fix removed dentries still existing after log is synced
btrfs: return whole extents in fiemap
btrfs: avoid RCU stalls while running delayed iputs
btrfs: return 0 for dev_extent_hole_check_zoned hole_start in case of error
When we move one inode from one directory to another and both the inode
and its previous parent directory were logged before, we are not supposed
to have the dentry for the old parent if we have a power failure after the
log is synced. Only the new dentry is supposed to exist.
Generally this works correctly, however there is a scenario where this is
not currently working, because the old parent of the file/directory that
was moved is not authoritative for a range that includes the dir index and
dir item keys of the old dentry. This case is better explained with the
following example and reproducer:
# The test requires a very specific layout of keys and items in the
# fs/subvolume btree to trigger the bug. So we want to make sure that
# on whatever platform we are, we have the same leaf/node size.
#
# Currently in btrfs the node/leaf size can not be smaller than the page
# size (but it can be greater than the page size). So use the largest
# supported node/leaf size (64K).
$ mkfs.btrfs -f -n 65536 /dev/sdc
$ mount /dev/sdc /mnt
# "testdir" is inode 257.
$ mkdir /mnt/testdir
$ chmod 755 /mnt/testdir
# Create several empty files to have the directory "testdir" with its
# items spread over several leaves (7 in this case).
$ for ((i = 1; i <= 1200; i++)); do
echo -n > /mnt/testdir/file$i
done
# Create our test directory "dira", inode number 1458, which gets all
# its items in leaf 7.
#
# The BTRFS_DIR_ITEM_KEY item for inode 257 ("testdir") that points to
# the entry named "dira" is in leaf 2, while the BTRFS_DIR_INDEX_KEY
# item that points to that entry is in leaf 3.
#
# For this particular filesystem node size (64K), file count and file
# names, we endup with the directory entry items from inode 257 in
# leaves 2 and 3, as previously mentioned - what matters for triggering
# the bug exercised by this test case is that those items are not placed
# in leaf 1, they must be placed in a leaf different from the one
# containing the inode item for inode 257.
#
# The corresponding BTRFS_DIR_ITEM_KEY and BTRFS_DIR_INDEX_KEY items for
# the parent inode (257) are the following:
#
# item 460 key (257 DIR_ITEM 3724298081) itemoff 48344 itemsize 34
# location key (1458 INODE_ITEM 0) type DIR
# transid 6 data_len 0 name_len 4
# name: dira
#
# and:
#
# item 771 key (257 DIR_INDEX 1202) itemoff 36673 itemsize 34
# location key (1458 INODE_ITEM 0) type DIR
# transid 6 data_len 0 name_len 4
# name: dira
$ mkdir /mnt/testdir/dira
# Make sure everything done so far is durably persisted.
$ sync
# Now do a change to inode 257 ("testdir") that does not result in
# COWing leaves 2 and 3 - the leaves that contain the directory items
# pointing to inode 1458 (directory "dira").
#
# Changing permissions, the owner/group, updating or adding a xattr,
# etc, will not change (COW) leaves 2 and 3. So for the sake of
# simplicity change the permissions of inode 257, which results in
# updating its inode item and therefore change (COW) only leaf 1.
$ chmod 700 /mnt/testdir
# Now fsync directory inode 257.
#
# Since only the first leaf was changed/COWed, we log the inode item of
# inode 257 and only the dentries found in the first leaf, all have a
# key type of BTRFS_DIR_ITEM_KEY, and no keys of type
# BTRFS_DIR_INDEX_KEY, because they sort after the former type and none
# exist in the first leaf.
#
# We also log 3 items that represent ranges for dir items and dir
# indexes for which the log is authoritative:
#
# 1) a key of type BTRFS_DIR_LOG_ITEM_KEY, which indicates the log is
# authoritative for all BTRFS_DIR_ITEM_KEY keys that have an offset
# in the range [0, 2285968570] (the offset here is the crc32c of the
# dentry's name). The value 2285968570 corresponds to the offset of
# the first key of leaf 2 (which is of type BTRFS_DIR_ITEM_KEY);
#
# 2) a key of type BTRFS_DIR_LOG_ITEM_KEY, which indicates the log is
# authoritative for all BTRFS_DIR_ITEM_KEY keys that have an offset
# in the range [4293818216, (u64)-1] (the offset here is the crc32c
# of the dentry's name). The value 4293818216 corresponds to the
# offset of the highest key of type BTRFS_DIR_ITEM_KEY plus 1
# (4293818215 + 1), which is located in leaf 2;
#
# 3) a key of type BTRFS_DIR_LOG_INDEX_KEY, with an offset of 1203,
# which indicates the log is authoritative for all keys of type
# BTRFS_DIR_INDEX_KEY that have an offset in the range
# [1203, (u64)-1]. The value 1203 corresponds to the offset of the
# last key of type BTRFS_DIR_INDEX_KEY plus 1 (1202 + 1), which is
# located in leaf 3;
#
# Also, because "testdir" is a directory and inode 1458 ("dira") is a
# child directory, we log inode 1458 too.
$ xfs_io -c "fsync" /mnt/testdir
# Now move "dira", inode 1458, to be a child of the root directory
# (inode 256).
#
# Because this inode was previously logged, when "testdir" was fsynced,
# the log is updated so that the old inode reference, referring to inode
# 257 as the parent, is deleted and the new inode reference, referring
# to inode 256 as the parent, is added to the log.
$ mv /mnt/testdir/dira /mnt
# Now change some file and fsync it. This guarantees the log changes
# made by the previous move/rename operation are persisted. We do not
# need to do any special modification to the file, just any change to
# any file and sync the log.
$ xfs_io -c "pwrite -S 0xab 0 64K" -c "fsync" /mnt/testdir/file1
# Simulate a power failure and then mount again the filesystem to
# replay the log tree. We want to verify that we are able to mount the
# filesystem, meaning log replay was successful, and that directory
# inode 1458 ("dira") only has inode 256 (the filesystem's root) as
# its parent (and no longer a child of inode 257).
#
# It used to happen that during log replay we would end up having
# inode 1458 (directory "dira") with 2 hard links, being a child of
# inode 257 ("testdir") and inode 256 (the filesystem's root). This
# resulted in the tree checker detecting the issue and causing the
# mount operation to fail (with -EIO).
#
# This happened because in the log we have the new name/parent for
# inode 1458, which results in adding the new dentry with inode 256
# as the parent, but the previous dentry, under inode 257 was never
# removed - this is because the ranges for dir items and dir indexes
# of inode 257 for which the log is authoritative do not include the
# old dir item and dir index for the dentry of inode 257 referring to
# inode 1458:
#
# - for dir items, the log is authoritative for the ranges
# [0, 2285968570] and [4293818216, (u64)-1]. The dir item at inode 257
# pointing to inode 1458 has a key of (257 DIR_ITEM 3724298081), as
# previously mentioned, so the dir item is not deleted when the log
# replay procedure processes the authoritative ranges, as 3724298081
# is outside both ranges;
#
# - for dir indexes, the log is authoritative for the range
# [1203, (u64)-1], and the dir index item of inode 257 pointing to
# inode 1458 has a key of (257 DIR_INDEX 1202), as previously
# mentioned, so the dir index item is not deleted when the log
# replay procedure processes the authoritative range.
<power failure>
$ mount /dev/sdc /mnt
mount: /mnt: can't read superblock on /dev/sdc.
$ dmesg
(...)
[87849.840509] BTRFS info (device sdc): start tree-log replay
[87849.875719] BTRFS critical (device sdc): corrupt leaf: root=5 block=30539776 slot=554 ino=1458, invalid nlink: has 2 expect no more than 1 for dir
[87849.878084] BTRFS info (device sdc): leaf 30539776 gen 7 total ptrs 557 free space 2092 owner 5
[87849.879516] BTRFS info (device sdc): refs 1 lock_owner 0 current 2099108
[87849.880613] item 0 key (1181 1 0) itemoff 65275 itemsize 160
[87849.881544] inode generation 6 size 0 mode 100644
[87849.882692] item 1 key (1181 12 257) itemoff 65258 itemsize 17
(...)
[87850.562549] item 556 key (1458 12 257) itemoff 16017 itemsize 14
[87850.563349] BTRFS error (device dm-0): block=30539776 write time tree block corruption detected
[87850.564386] ------------[ cut here ]------------
[87850.564920] WARNING: CPU: 3 PID: 2099108 at fs/btrfs/disk-io.c:465 csum_one_extent_buffer+0xed/0x100 [btrfs]
[87850.566129] Modules linked in: btrfs dm_zero dm_snapshot (...)
[87850.573789] CPU: 3 PID: 2099108 Comm: mount Not tainted 5.12.0-rc8-btrfs-next-86 #1
(...)
[87850.587481] Call Trace:
[87850.587768] btree_csum_one_bio+0x244/0x2b0 [btrfs]
[87850.588354] ? btrfs_bio_fits_in_stripe+0xd8/0x110 [btrfs]
[87850.589003] btrfs_submit_metadata_bio+0xb7/0x100 [btrfs]
[87850.589654] submit_one_bio+0x61/0x70 [btrfs]
[87850.590248] submit_extent_page+0x91/0x2f0 [btrfs]
[87850.590842] write_one_eb+0x175/0x440 [btrfs]
[87850.591370] ? find_extent_buffer_nolock+0x1c0/0x1c0 [btrfs]
[87850.592036] btree_write_cache_pages+0x1e6/0x610 [btrfs]
[87850.592665] ? free_debug_processing+0x1d5/0x240
[87850.593209] do_writepages+0x43/0xf0
[87850.593798] ? __filemap_fdatawrite_range+0xa4/0x100
[87850.594391] __filemap_fdatawrite_range+0xc5/0x100
[87850.595196] btrfs_write_marked_extents+0x68/0x160 [btrfs]
[87850.596202] btrfs_write_and_wait_transaction.isra.0+0x4d/0xd0 [btrfs]
[87850.597377] btrfs_commit_transaction+0x794/0xca0 [btrfs]
[87850.598455] ? _raw_spin_unlock_irqrestore+0x32/0x60
[87850.599305] ? kmem_cache_free+0x15a/0x3d0
[87850.600029] btrfs_recover_log_trees+0x346/0x380 [btrfs]
[87850.601021] ? replay_one_extent+0x7d0/0x7d0 [btrfs]
[87850.601988] open_ctree+0x13c9/0x1698 [btrfs]
[87850.602846] btrfs_mount_root.cold+0x13/0xed [btrfs]
[87850.603771] ? kmem_cache_alloc_trace+0x7c9/0x930
[87850.604576] ? vfs_parse_fs_string+0x5d/0xb0
[87850.605293] ? kfree+0x276/0x3f0
[87850.605857] legacy_get_tree+0x30/0x50
[87850.606540] vfs_get_tree+0x28/0xc0
[87850.607163] fc_mount+0xe/0x40
[87850.607695] vfs_kern_mount.part.0+0x71/0x90
[87850.608440] btrfs_mount+0x13b/0x3e0 [btrfs]
(...)
[87850.629477] ---[ end trace 68802022b99a1ea0 ]---
[87850.630849] BTRFS: error (device sdc) in btrfs_commit_transaction:2381: errno=-5 IO failure (Error while writing out transaction)
[87850.632422] BTRFS warning (device sdc): Skipping commit of aborted transaction.
[87850.633416] BTRFS: error (device sdc) in cleanup_transaction:1978: errno=-5 IO failure
[87850.634553] BTRFS: error (device sdc) in btrfs_replay_log:2431: errno=-5 IO failure (Failed to recover log tree)
[87850.637529] BTRFS error (device sdc): open_ctree failed
In this example the inode we moved was a directory, so it was easy to
detect the problem because directories can only have one hard link and
the tree checker immediately detects that. If the moved inode was a file,
then the log replay would succeed and we would end up having both the
new hard link (/mnt/foo) and the old hard link (/mnt/testdir/foo) present,
but only the new one should be present.
Fix this by forcing re-logging of the old parent directory when logging
the new name during a rename operation. This ensures we end up with a log
that is authoritative for a range covering the keys for the old dentry,
therefore causing the old dentry do be deleted when replaying the log.
A test case for fstests will follow up soon.
Fixes: 64d6b281ba ("btrfs: remove unnecessary check_parent_dirs_for_sync()")
CC: stable@vger.kernel.org # 5.12+
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
`xfs_io -c 'fiemap <off> <len>' <file>`
can give surprising results on btrfs that differ from xfs.
btrfs prints out extents trimmed to fit the user input. If the user's
fiemap request has an offset, then rather than returning each whole
extent which intersects that range, we also trim the start extent to not
have start < off.
Documentation in filesystems/fiemap.txt and the xfs_io man page suggests
that returning the whole extent is expected.
Some cases which all yield the same fiemap in xfs, but not btrfs:
dd if=/dev/zero of=$f bs=4k count=1
sudo xfs_io -c 'fiemap 0 1024' $f
0: [0..7]: 26624..26631
sudo xfs_io -c 'fiemap 2048 1024' $f
0: [4..7]: 26628..26631
sudo xfs_io -c 'fiemap 2048 4096' $f
0: [4..7]: 26628..26631
sudo xfs_io -c 'fiemap 3584 512' $f
0: [7..7]: 26631..26631
sudo xfs_io -c 'fiemap 4091 5' $f
0: [7..6]: 26631..26630
I believe this is a consequence of the logic for merging contiguous
extents represented by separate extent items. That logic needs to track
the last offset as it loops through the extent items, which happens to
pick up the start offset on the first iteration, and trim off the
beginning of the full extent. To fix it, start `off` at 0 rather than
`start` so that we keep the iteration/merging intact without cutting off
the start of the extent.
after the fix, all the above commands give:
0: [0..7]: 26624..26631
The merging logic is exercised by fstest generic/483, and I have written
a new fstest for checking we don't have backwards or zero-length fiemaps
for cases like those above.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Boris Burkov <boris@bur.io>
Signed-off-by: David Sterba <dsterba@suse.com>