For subpage metadata, we don't use page locking at all. So just skip
the page locking part for subpage. The rest of the function can be
reused.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The new function, write_one_subpage_eb(), as a subroutine for subpage
metadata write, will handle the extent buffer bio submission.
The major differences between the new write_one_subpage_eb() and
write_one_eb() is:
- No page locking
When entering write_one_subpage_eb() the page is no longer locked.
We only lock the page for its status update, and unlock immediately.
Now we completely rely on extent io tree locking.
- Extra bitmap update along with page status update
Now page dirty and writeback is controlled by
btrfs_subpage::dirty_bitmap and btrfs_subpage::writeback_bitmap.
They both follow the schema that any sector is dirty/writeback, then
the full page gets dirty/writeback.
- When to update the nr_written number
Now we take a shortcut, if we have cleared the last dirty bit of the
page, we update nr_written.
This is not completely perfect, but should emulate the old behavior
well enough.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The new function, end_bio_subpage_eb_writepage(), will handle the
metadata writeback endio.
The major differences involved are:
- How to grab extent buffer
Now page::private is a pointer to btrfs_subpage, we can no longer grab
extent buffer directly.
Thus we need to use the bv_offset to locate the extent buffer manually
and iterate through the whole range.
- Use btrfs_subpage_end_writeback() caller
This helper will handle the subpage writeback for us.
Since this function is executed under endio context, when grabbing
extent buffers it can't grab eb->refs_lock as that lock is not designed
to be grabbed under hardirq context.
So here introduce a helper, find_extent_buffer_nolock(), for such
situation, and convert find_extent_buffer() to use that helper.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
There are a few places where we don't check the return value of
btrfs_commit_transaction in relocation.c. Thankfully all these places
have straightforward error handling, so simply change all of the sites
at once.
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We have a BUG_ON() if we get an error back from btrfs_get_fs_root().
This honestly should never fail, as at this point we have a solid
coordination of fs root to reloc root, and these roots will all be in
memory. But in the name of killing BUG_ON()'s remove these and handle
the error condition properly, ASSERT()'ing for developers.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
In corruption cases we could have paths from a block up to no root at
all, and thus we'll BUG_ON(!root) in select_one_root. Handle this by
adding an ASSERT() for developers, and returning an error for normal
users.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This probably can't happen even with a corrupt file system, because we
would have failed much earlier on than here. However there's no reason
we can't just check and bail out as appropriate, so do that and convert
the correctness BUG_ON() to an ASSERT().
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ add comment ]
Signed-off-by: David Sterba <dsterba@suse.com>
If we have a duplicate entry for a reloc root then we could have fs
corruption that resulted in a double allocation. Since this shouldn't
happen unless there is corruption, add an ASSERT(ret != -EEXIST) to all
of the callers of __add_reloc_root() to catch any logic mistakes for
developers, otherwise normal error handling will happen for normal
users.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We can already handle errors appropriately from this function, deal with
an error coming from __add_reloc_root appropriately.
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ add comment ]
Signed-off-by: David Sterba <dsterba@suse.com>
We already handle some errors in this function, and the callers do the
correct error handling, so clean up the rest of the function to do the
appropriate error handling.
There's a little extra work that needs to be done here, as we create the
inode item before we create the orphan item. We could potentially add
the orphan item, but if we failed to create the inode item we would have
to abort the transaction.
Instead add a helper to delete the inode item we created in the case
that we're unable to look up the inode (this would likely be caused by
an ENOMEM), which if it succeeds means we can avoid a transaction abort
in this particular error case.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
These checks are all taken care of for us by the tree checker code:
- the flags don't change or are updated consistently
- the v0 extent item format is invalid and caught in many other places
too
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ update changelog ]
Signed-off-by: David Sterba <dsterba@suse.com>
We need to validate that a data extent item does not have the
FULL_BACKREF flag set on its flags.
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We can already deal with errors appropriately from do_relocation, simply
handle any errors that come from changing the refs at this point
cleanly. We have to abort the transaction if we fail here as we've
modified metadata at this point.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
If any of the reference count manipulation stuff fails in replace_path
we need to abort the transaction, as we've modified the blocks already.
We can simply break at this point and everything will be cleaned up.
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The search can fail for various reasons, in case of errors there's no
cleanup to be done so we can pass the error to the caller, adjusting for
the case where the key is not found and search slot returns 1.
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ update changelog ]
Signed-off-by: David Sterba <dsterba@suse.com>
If we error out COWing the root node when doing a replace_path then we
simply unlock and free the buffer and return the error.
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
A few BUG_ON()'s in replace_path are purely to keep us from making
logical mistakes, so replace them with ASSERT()'s.
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We call btrfs_update_root in btrfs_update_reloc_root, which can fail for
all sorts of reasons, including IO errors. Instead of panicing the box
lets return the error, now that all callers properly handle those
errors.
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
btrfs_update_reloc_root will will return errors in the future, so handle
an error properly in prepare_to_merge.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
btrfs_update_reloc_root will will return errors in the future, so handle
the error properly in insert_dirty_subvol.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This will be able to return errors in the future, so change it to return
an error and handle the errors appropriately.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
btrfs_update_reloc_root will will return errors in the future, so handle
the error properly in commit_fs_roots.
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
If we fail to setup a root->reloc_root in a different thread that path
will error out, however it still leaves root->reloc_root NULL but would
still appear set up in the transaction. Subsequent calls to
btrfs_record_root_in_transaction would succeed without attempting to
create the reloc root, as the transid has already been updated.
Handle this case by making sure we have a root->reloc_root set after a
btrfs_record_root_in_transaction call so we don't end up dereferencing a
NULL pointer.
Reported-by: Zygo Blaxell <ce3g8jdj@umail.furryterror.org>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We do memory allocations here, read blocks from disk, all sorts of
operations that could easily fail at any given point. Instead of
panicing the box, simply return the error back up the chain, all callers
at this point have proper error handling.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
create_reloc_root will return errors in the future, and __add_reloc_root
can return ENOMEM or EEXIST, so handle these errors properly.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ add comment ]
Signed-off-by: David Sterba <dsterba@suse.com>
We can create a reloc root when we record the root in the trans, which
can fail for all sorts of different reasons. Propagate this error up
the chain of callers. Future patches will fix the callers of
btrfs_record_root_in_trans() to handle the error.
Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
record_root_in_trans can currently fail, so handle this failure
properly.
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
record_root_in_trans can fail currently, handle this failure properly.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
record_root_in_trans can fail currently, so handle this failure
properly.
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
btrfs_record_root_in_trans will return errors in the future, so handle
the error properly in start_transaction.
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ add comment ]
Signed-off-by: David Sterba <dsterba@suse.com>
btrfs_record_root_in_trans will return errors in the future, so handle
the error properly in relocate_tree_block.
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
btrfs_record_root_in_trans will return errors in the future, so handle
the error properly in create_subvol.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
btrfs_record_root_in_trans will return errors in the future, so handle
the error properly in btrfs_recover_log_trees.
This appears tricky, however we have a reference count on the
destination root, so if this fails we need to continue on in the loop to
make sure the proper cleanup is done.
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ add comment ]
Signed-off-by: David Sterba <dsterba@suse.com>
btrfs_record_root_in_trans will return errors in the future, so handle
the error properly in btrfs_delete_subvolume.
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
btrfs_record_root_in_trans will return errors in the future, so handle
the error properly in btrfs_rename.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
btrfs_record_root_in_trans will return errors in the future, so handle
the error properly in btrfs_rename_exchange.
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Generally speaking this shouldn't ever fail, the corresponding fs root
for the reloc root will already be in memory, so we won't get ENOMEM
here.
However if there is no corresponding root for the reloc root then we
could get ENOMEM when we try to allocate it or we could get ENOENT
when we look it up and see that it doesn't exist.
Convert these BUG_ON()'s into ASSERT()'s and add proper error handling
for the case of corruption.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We will record the fs root or the reloc root in the trans in
select_reloc_root. These will actually return errors in the following
patches, so check their return value here and return it up the stack.
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We have several BUG_ON()'s in select_reloc_root() that can be tripped if
there is an extent tree corruption. Convert these to ASSERT()'s, because
if we hit it during testing it really is bad, or could indicate a
problem with the backref walking code.
However if users hit these problems it generally indicates corruption,
I've hit a few machines in the fleet that trip over these with clearly
corrupted extent trees, so be nice and print out an error message and
return an error instead of bringing the whole box down.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Currently select_reloc_root() doesn't return an error, but followup
patches will make it possible for it to return an error. We do have
proper error recovery in do_relocation however, so handle the
possibility of select_reloc_root() having an error properly instead of
BUG_ON(!root).
I've also adjusted select_reloc_root() to return ERR_PTR(-ENOENT) if we
don't find a root, instead of NULL, to make the error case easier to
deal with. I've replaced the BUG_ON(!root) with an ASSERT(0) for this
case as it indicates we messed up the backref walking code, but it could
also indicate corruption.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We have a couple of BUG_ON()'s in relocate_tree_block() that can be
tripped if we have file system corruption. Convert these to ASSERT()'s
so developers still get yelled at when they break the backref code, but
error out nicely for users so the whole box doesn't go down.
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
A few of these are checking for correctness, and won't be triggered by
corrupted file systems, so convert them to ASSERT() instead of BUG_ON()
and add a comment explaining their existence.
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Implement readahead_batch_length() to determine the number of bytes in
the current batch of readahead pages and use it in btrfs. Also use the
readahead_pos to get the offset.
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
There are two forward declarations deep in extent_io.h, move them
to the beginning and remove the duplicate one.
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Wan Jiabing <wanjiabing@vivo.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This patch adds an overview how btrfs subpage support works:
- limitations
- behavior
- basic implementation points
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Current set_btree_ioerr() only accepts @page parameter and grabs extent
buffer from page::private. This works fine for sector size == PAGE_SIZE
case, but not for subpage case.
Add an extra parameter, @eb, for callers to pass extent buffer to this
function, so that subpage code can reuse this function.
And also add subpage special handling to update
btrfs_subpage::error_bitmap.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
For set_extent_buffer_dirty() to support subpage sized metadata, just
call btrfs_page_set_dirty() to handle both cases.
For clear_extent_buffer_dirty(), it needs to clear the page dirty if and
only if all extent buffers in the page range are no longer dirty.
Also do the same for page error.
This is pretty different from the existing clear_extent_buffer_dirty()
routine, so add a new helper function,
clear_subpage_extent_buffer_dirty() to do this for subpage metadata.
Also since the main part of clearing page dirty code is still the same,
extract that into btree_clear_page_dirty() so that it can be utilized
for both cases.
But there is a special race between set_extent_buffer_dirty() and
clear_extent_buffer_dirty(), where we can clear the page dirty.
[POSSIBLE RACE WINDOW]
For the race window between clear_subpage_extent_buffer_dirty() and
set_extent_buffer_dirty(), due to the fact that we can't call
clear_page_dirty_for_io() under subpage spin lock, we can race like
below:
T1 (eb1 in the same page) | T2 (eb2 in the same page)
-------------------------------+------------------------------
set_extent_buffer_dirty() | clear_extent_buffer_dirty()
|- was_dirty = false; | |- clear_subpagE_extent_buffer_dirty()
| | |- btrfs_clear_and_test_dirty()
| | | Since eb2 is the last dirty page
| | | we got:
| | | last == true;
| | |
|- btrfs_page_set_dirty() | |
| We set the page dirty and | |
| subpage dirty bitmap | |
| | |- if (last)
| | | Since we don't have subpage lock
| | | held, now @last is no longer
| | | correct
| | |- btree_clear_page_dirty()
| | Now PageDirty == false, even if
| | we have dirty_bitmap not zero.
|- ASSERT(PageDirty()); |
^^^^ CRASH
The solution here is to also lock the eb->pages[0] for subpage case of
set_extent_buffer_dirty(), to prevent racing with
clear_extent_buffer_dirty().
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
There are quite some assert checks on page uptodate in extent buffer
write accessors. They ensure the destination page is already uptodate.
This is fine for regular sector size case, but not for subpage case, as
for subpage we only mark the page uptodate if the page contains no hole
and all its extent buffers are uptodate.
So instead of checking PageUptodate(), for subpage case we check the
uptodate bitmap of btrfs_subpage structure.
To make the check more elegant, introduce a helper,
assert_eb_page_uptodate() to do the check for both subpage and regular
sector size cases.
The following functions are involved:
- write_extent_buffer_chunk_tree_uuid()
- write_extent_buffer_fsid()
- write_extent_buffer()
- memzero_extent_buffer()
- copy_extent_buffer()
- extent_buffer_test_bit()
- extent_buffer_bitmap_set()
- extent_buffer_bitmap_clear()
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
In alloc_extent_buffer(), we make sure that the newly allocated page is
never dirty.
This is fine for sector size == PAGE_SIZE case, but for subpage it's
possible that one extent buffer in the page is dirty, thus the whole
page is marked dirty, and could cause false alert.
To support subpage, call btrfs_page_test_dirty() to handle both cases.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Add a new helper, csum_dirty_subpage_buffers(), to iterate through all
dirty extent buffers in one bvec.
Also extract the code of calculating csum for one extent buffer into
csum_one_extent_buffer(), so that both the existing csum_dirty_buffer()
and the new csum_dirty_subpage_buffers() can reuse the same routine.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
For btree_set_page_dirty(), we should also check the extent buffer
sanity for subpage support.
Unlike the regular sector size case, since one page can contain multiple
extent buffers, we need to make sure there is at least one dirty extent
buffer in the page.
So this patch will iterate through the btrfs_subpage::dirty_bitmap
to get the extent buffers, and check if any dirty extent buffer in the page
range has EXTENT_BUFFER_DIRTY and proper refs.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Introduces the following functions to handle subpage writeback status:
- btrfs_subpage_set_writeback()
- btrfs_subpage_clear_writeback()
- btrfs_subpage_test_writeback()
These helpers can only be called when the range is ensured to be
inside the page.
- btrfs_page_set_writeback()
- btrfs_page_clear_writeback()
- btrfs_page_test_writeback()
These helpers can handle both regular sector size and subpage without
problem.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Introduce the following functions to handle subpage dirty status:
- btrfs_subpage_set_dirty()
- btrfs_subpage_clear_dirty()
- btrfs_subpage_test_dirty()
These helpers can only be called when the range is ensured to be
inside the page.
- btrfs_page_set_dirty()
- btrfs_page_clear_dirty()
- btrfs_page_test_dirty()
These helpers can handle both regular sector size and subpage without
problem.
Thus they would be used to replace PageDirty() related calls in
later patches.
There is one special point to note here, just like set_page_dirty() and
clear_page_dirty_for_io(), btrfs_*page_set_dirty() and
btrfs_*page_clear_dirty() must be called with page locked.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
In btrfs_invalidatepage() we re-declare @tree variable as
btrfs_ordered_inode_tree.
Since it's only used to do the spinlock, we can grab it from inode
directly, and remove the unnecessary declaration completely.
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
In btrfs_invalidatepage() we introduce a temporary variable, new_len, to
update ordered->truncated_len. But we can use min() to replace it
completely and no need for the variable.
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Export supported sector sizes in /sys/fs/btrfs/features/supported_sectorsizes.
Currently all architectures have PAGE_SIZE, There's some disparity
between read-only and read-write support but that will be unified in the
future so there's only one file exporting the size.
The read-only support for systems with 64K pages also works for 4K
sector size.
This new sysfs interface would help eg. mkfs.btrfs to print more
accurate warnings about potentially incompatible option combinations.
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Currently a full send operation uses the standard btree readahead when
iterating over the subvolume/snapshot btree, which despite bringing good
performance benefits, it could be improved in a few aspects for use cases
such as full send operations, which are guaranteed to visit every node
and leaf of a btree, in ascending and sequential order. The limitations
of that standard btree readahead implementation are the following:
1) It only triggers readahead for leaves that are physically close
to the leaf being read, within a 64K range;
2) It only triggers readahead for the next or previous leaves if the
leaf being read is not currently in memory;
3) It never triggers readahead for nodes.
So add a new readahead mode that addresses all these points and use it
for full send operations.
The following test script was used to measure the improvement on a box
using an average, consumer grade, spinning disk and with 16GiB of RAM:
$ cat test.sh
#!/bin/bash
DEV=/dev/sdj
MNT=/mnt/sdj
MKFS_OPTIONS="--nodesize 16384" # default, just to be explicit
MOUNT_OPTIONS="-o max_inline=2048" # default, just to be explicit
mkfs.btrfs -f $MKFS_OPTIONS $DEV > /dev/null
mount $MOUNT_OPTIONS $DEV $MNT
# Create files with inline data to make it easier and faster to create
# large btrees.
add_files()
{
local total=$1
local start_offset=$2
local number_jobs=$3
local total_per_job=$(($total / $number_jobs))
echo "Creating $total new files using $number_jobs jobs"
for ((n = 0; n < $number_jobs; n++)); do
(
local start_num=$(($start_offset + $n * $total_per_job))
for ((i = 1; i <= $total_per_job; i++)); do
local file_num=$((start_num + $i))
local file_path="$MNT/file_${file_num}"
xfs_io -f -c "pwrite -S 0xab 0 2000" $file_path > /dev/null
if [ $? -ne 0 ]; then
echo "Failed creating file $file_path"
break
fi
done
) &
worker_pids[$n]=$!
done
wait ${worker_pids[@]}
sync
echo
echo "btree node/leaf count: $(btrfs inspect-internal dump-tree -t 5 $DEV | egrep '^(node|leaf) ' | wc -l)"
}
initial_file_count=500000
add_files $initial_file_count 0 4
echo
echo "Creating first snapshot..."
btrfs subvolume snapshot -r $MNT $MNT/snap1
echo
echo "Adding more files..."
add_files $((initial_file_count / 4)) $initial_file_count 4
echo
echo "Updating 1/50th of the initial files..."
for ((i = 1; i < $initial_file_count; i += 50)); do
xfs_io -c "pwrite -S 0xcd 0 20" $MNT/file_$i > /dev/null
done
echo
echo "Creating second snapshot..."
btrfs subvolume snapshot -r $MNT $MNT/snap2
umount $MNT
echo 3 > /proc/sys/vm/drop_caches
blockdev --flushbufs $DEV &> /dev/null
hdparm -F $DEV &> /dev/null
mount $MOUNT_OPTIONS $DEV $MNT
echo
echo "Testing full send..."
start=$(date +%s)
btrfs send $MNT/snap1 > /dev/null
end=$(date +%s)
echo
echo "Full send took $((end - start)) seconds"
umount $MNT
The durations of the full send operation in seconds were the following:
Before this change: 217 seconds
After this change: 205 seconds (-5.7%)
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
When we are running out of space for updating the chunk tree, that is,
when we are low on available space in the system space info, if we have
many task concurrently allocating block groups, via fallocate for example,
many of them can end up all allocating new system chunks when only one is
needed. In extreme cases this can lead to exhaustion of the system chunk
array, which has a size limit of 2048 bytes, and results in a transaction
abort with errno EFBIG, producing a trace in dmesg like the following,
which was triggered on a PowerPC machine with a node/leaf size of 64K:
[1359.518899] ------------[ cut here ]------------
[1359.518980] BTRFS: Transaction aborted (error -27)
[1359.519135] WARNING: CPU: 3 PID: 16463 at ../fs/btrfs/block-group.c:1968 btrfs_create_pending_block_groups+0x340/0x3c0 [btrfs]
[1359.519152] Modules linked in: (...)
[1359.519239] Supported: Yes, External
[1359.519252] CPU: 3 PID: 16463 Comm: stress-ng Tainted: G X 5.3.18-47-default #1 SLE15-SP3
[1359.519274] NIP: c008000000e36fe8 LR: c008000000e36fe4 CTR: 00000000006de8e8
[1359.519293] REGS: c00000056890b700 TRAP: 0700 Tainted: G X (5.3.18-47-default)
[1359.519317] MSR: 800000000282b033 <SF,VEC,VSX,EE,FP,ME,IR,DR,RI,LE> CR: 48008222 XER: 00000007
[1359.519356] CFAR: c00000000013e170 IRQMASK: 0
[1359.519356] GPR00: c008000000e36fe4 c00000056890b990 c008000000e83200 0000000000000026
[1359.519356] GPR04: 0000000000000000 0000000000000000 0000d52a3b027651 0000000000000007
[1359.519356] GPR08: 0000000000000003 0000000000000001 0000000000000007 0000000000000000
[1359.519356] GPR12: 0000000000008000 c00000063fe44600 000000001015e028 000000001015dfd0
[1359.519356] GPR16: 000000000000404f 0000000000000001 0000000000010000 0000dd1e287affff
[1359.519356] GPR20: 0000000000000001 c000000637c9a000 ffffffffffffffe5 0000000000000000
[1359.519356] GPR24: 0000000000000004 0000000000000000 0000000000000100 ffffffffffffffc0
[1359.519356] GPR28: c000000637c9a000 c000000630e09230 c000000630e091d8 c000000562188b08
[1359.519561] NIP [c008000000e36fe8] btrfs_create_pending_block_groups+0x340/0x3c0 [btrfs]
[1359.519613] LR [c008000000e36fe4] btrfs_create_pending_block_groups+0x33c/0x3c0 [btrfs]
[1359.519626] Call Trace:
[1359.519671] [c00000056890b990] [c008000000e36fe4] btrfs_create_pending_block_groups+0x33c/0x3c0 [btrfs] (unreliable)
[1359.519729] [c00000056890ba90] [c008000000d68d44] __btrfs_end_transaction+0xbc/0x2f0 [btrfs]
[1359.519782] [c00000056890bae0] [c008000000e309ac] btrfs_alloc_data_chunk_ondemand+0x154/0x610 [btrfs]
[1359.519844] [c00000056890bba0] [c008000000d8a0fc] btrfs_fallocate+0xe4/0x10e0 [btrfs]
[1359.519891] [c00000056890bd00] [c0000000004a23b4] vfs_fallocate+0x174/0x350
[1359.519929] [c00000056890bd50] [c0000000004a3cf8] ksys_fallocate+0x68/0xf0
[1359.519957] [c00000056890bda0] [c0000000004a3da8] sys_fallocate+0x28/0x40
[1359.519988] [c00000056890bdc0] [c000000000038968] system_call_exception+0xe8/0x170
[1359.520021] [c00000056890be20] [c00000000000cb70] system_call_common+0xf0/0x278
[1359.520037] Instruction dump:
[1359.520049] 7d0049ad 40c2fff4 7c0004ac 71490004 40820024 2f83fffb 419e0048 3c620000
[1359.520082] e863bcb8 7ec4b378 48010d91 e8410018 <0fe00000> 3c820000 e884bcc8 7ec6b378
[1359.520122] ---[ end trace d6c186e151022e20 ]---
The following steps explain how we can end up in this situation:
1) Task A is at check_system_chunk(), either because it is allocating a
new data or metadata block group, at btrfs_chunk_alloc(), or because
it is removing a block group or turning a block group RO. It does not
matter why;
2) Task A sees that there is not enough free space in the system
space_info object, that is 'left' is < 'thresh'. And at this point
the system space_info has a value of 0 for its 'bytes_may_use'
counter;
3) As a consequence task A calls btrfs_alloc_chunk() in order to allocate
a new system block group (chunk) and then reserves 'thresh' bytes in
the chunk block reserve with the call to btrfs_block_rsv_add(). This
changes the chunk block reserve's 'reserved' and 'size' counters by an
amount of 'thresh', and changes the 'bytes_may_use' counter of the
system space_info object from 0 to 'thresh'.
Also during its call to btrfs_alloc_chunk(), we end up increasing the
value of the 'total_bytes' counter of the system space_info object by
8MiB (the size of a system chunk stripe). This happens through the
call chain:
btrfs_alloc_chunk()
create_chunk()
btrfs_make_block_group()
btrfs_update_space_info()
4) After it finishes the first phase of the block group allocation, at
btrfs_chunk_alloc(), task A unlocks the chunk mutex;
5) At this point the new system block group was added to the transaction
handle's list of new block groups, but its block group item, device
items and chunk item were not yet inserted in the extent, device and
chunk trees, respectively. That only happens later when we call
btrfs_finish_chunk_alloc() through a call to
btrfs_create_pending_block_groups();
Note that only when we update the chunk tree, through the call to
btrfs_finish_chunk_alloc(), we decrement the 'reserved' counter
of the chunk block reserve as we COW/allocate extent buffers,
through:
btrfs_alloc_tree_block()
btrfs_use_block_rsv()
btrfs_block_rsv_use_bytes()
And the system space_info's 'bytes_may_use' is decremented everytime
we allocate an extent buffer for COW operations on the chunk tree,
through:
btrfs_alloc_tree_block()
btrfs_reserve_extent()
find_free_extent()
btrfs_add_reserved_bytes()
If we end up COWing less chunk btree nodes/leaves than expected, which
is the typical case since the amount of space we reserve is always
pessimistic to account for the worst possible case, we release the
unused space through:
btrfs_create_pending_block_groups()
btrfs_trans_release_chunk_metadata()
btrfs_block_rsv_release()
block_rsv_release_bytes()
btrfs_space_info_free_bytes_may_use()
But before task A gets into btrfs_create_pending_block_groups()...
6) Many other tasks start allocating new block groups through fallocate,
each one does the first phase of block group allocation in a
serialized way, since btrfs_chunk_alloc() takes the chunk mutex
before calling check_system_chunk() and btrfs_alloc_chunk().
However before everyone enters the final phase of the block group
allocation, that is, before calling btrfs_create_pending_block_groups(),
new tasks keep coming to allocate new block groups and while at
check_system_chunk(), the system space_info's 'bytes_may_use' keeps
increasing each time a task reserves space in the chunk block reserve.
This means that eventually some other task can end up not seeing enough
free space in the system space_info and decide to allocate yet another
system chunk.
This may repeat several times if yet more new tasks keep allocating
new block groups before task A, and all the other tasks, finish the
creation of the pending block groups, which is when reserved space
in excess is released. Eventually this can result in exhaustion of
system chunk array in the superblock, with btrfs_add_system_chunk()
returning EFBIG, resulting later in a transaction abort.
Even when we don't reach the extreme case of exhausting the system
array, most, if not all, unnecessarily created system block groups
end up being unused since when finishing creation of the first
pending system block group, the creation of the following ones end
up not needing to COW nodes/leaves of the chunk tree, so we never
allocate and deallocate from them, resulting in them never being
added to the list of unused block groups - as a consequence they
don't get deleted by the cleaner kthread - the only exceptions are
if we unmount and mount the filesystem again, which adds any unused
block groups to the list of unused block groups, if a scrub is
run, which also adds unused block groups to the unused list, and
under some circumstances when using a zoned filesystem or async
discard, which may also add unused block groups to the unused list.
So fix this by:
*) Tracking the number of reserved bytes for the chunk tree per
transaction, which is the sum of reserved chunk bytes by each
transaction handle currently being used;
*) When there is not enough free space in the system space_info,
if there are other transaction handles which reserved chunk space,
wait for some of them to complete in order to have enough excess
reserved space released, and then try again. Otherwise proceed with
the creation of a new system chunk.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
If we reflink to or from a file opened with O_SYNC/O_DSYNC or to/from a
file that has the S_SYNC attribute set, we totally ignore that and do not
durably persist the reflink changes. Since a reflink can change the data
readable from a file (and mtime/ctime, or a file size), it makes sense to
durably persist (fsync) the source and destination files/ranges.
This was previously discussed at:
https://lore.kernel.org/linux-btrfs/20200903035225.GJ6090@magnolia/
The recently introduced test case generic/628, from fstests, exercises
these scenarios and currently fails without this change.
So make sure we fsync the source and destination files/ranges when either
of them was opened with O_SYNC/O_DSYNC or has the S_SYNC attribute set,
just like XFS already does.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
gcc complains that the ctl->max_chunk_size member might be used
uninitialized when none of the three conditions for initializing it in
init_alloc_chunk_ctl_policy_zoned() are true:
In function ‘init_alloc_chunk_ctl_policy_zoned’,
inlined from ‘init_alloc_chunk_ctl’ at fs/btrfs/volumes.c:5023:3,
inlined from ‘btrfs_alloc_chunk’ at fs/btrfs/volumes.c:5340:2:
include/linux/compiler-gcc.h:48:45: error: ‘ctl.max_chunk_size’ may be used uninitialized [-Werror=maybe-uninitialized]
4998 | ctl->max_chunk_size = min(limit, ctl->max_chunk_size);
| ^~~
fs/btrfs/volumes.c: In function ‘btrfs_alloc_chunk’:
fs/btrfs/volumes.c:5316:32: note: ‘ctl’ declared here
5316 | struct alloc_chunk_ctl ctl;
| ^~~
If we ever get into this condition, something is seriously
wrong, as validity is checked in the callers
btrfs_alloc_chunk
init_alloc_chunk_ctl
init_alloc_chunk_ctl_policy_zoned
so the same logic as in init_alloc_chunk_ctl_policy_regular()
and a few other places should be applied. This avoids both further
data corruption, and the compile-time warning.
Fixes: 1cd6121f2a ("btrfs: zoned: implement zoned chunk allocator")
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
In commit d77815461f ("btrfs: Avoid trucating page or punching hole
in a already existed hole."), existing holes can be skipped by calling
find_first_non_hole() to adjust start and len. However, if the given len
is invalid and large, when an EXTENT_MAP_HOLE extent is found, len will
not be set to zero because (em->start + em->len) is less than
(start + len). Then the ret will be 1 but len will not be set to 0.
The propagated non-zero ret will result in fallocate failure.
In the while-loop of btrfs_replace_file_extents(), len is not updated
every time before it calls find_first_non_hole(). That is, after
btrfs_drop_extents() successfully drops the last non-hole file extent,
it may fail with ENOSPC when attempting to drop a file extent item
representing a hole. The problem can happen. After it calls
find_first_non_hole(), the cur_offset will be adjusted to be larger
than or equal to end. However, since the len is not set to zero, the
break-loop condition (ret && !len) will not be met. After it leaves the
while-loop, fallocate will return 1, which is an unexpected return
value.
We're not able to construct a reproducible way to let
btrfs_drop_extents() fail with ENOSPC after it drops the last non-hole
file extent but with remaining holes left. However, it's quite easy to
fix. We just need to update and check the len every time before we call
find_first_non_hole(). To make the while loop more readable, we also
pull the variable updates to the bottom of loop like this:
while (cur_offset < end) {
...
// update cur_offset & len
// advance cur_offset & len in hole-punching case if needed
}
Reported-by: Robbie Ko <robbieko@synology.com>
Fixes: d77815461f ("btrfs: Avoid trucating page or punching hole in a already existed hole.")
CC: stable@vger.kernel.org # 4.4+
Reviewed-by: Robbie Ko <robbieko@synology.com>
Reviewed-by: Chung-Chiang Cheng <cccheng@synology.com>
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: BingJing Chang <bingjingc@synology.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Commit 6e37d24599 ("btrfs: zoned: fix deadlock on log sync") pointed out
a deadlock warning and removed mutex_{lock,unlock} of fs_info::tree_root->log_mutex.
While it looks like it always cause a deadlock, we didn't see actual
deadlock in fstests runs. The reason is log_root_tree->log_mutex !=
fs_info->tree_root->log_mutex, not taking the same lock. So, the warning
was actually a false-positive.
Since btrfs_alloc_log_tree_node() is protected only by
fs_info->tree_root->log_mutex, we can (and should) move the code out of
the lock scope of log_root_tree->log_mutex and silence the warning.
Fixes: 6e37d24599 ("btrfs: zoned: fix deadlock on log sync")
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
There is a comment at btrfs_replace_file_extents() that mentions that we
set the full sync flag on an inode when cloning into a file with a size
greater than or equals to 16MiB, through try_release_extent_mapping() when
we truncate the page cache after replacing file extents during a clone
operation.
That is not true anymore since commit 5e548b3201 ("btrfs: do not set
the full sync flag on the inode during page release"), so update the
comment to remove that part and rephrase it slightly to make it more
clear why the full sync flag is set at btrfs_replace_file_extents().
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
btrfs_orphan_cleanup() has a comment referring to find_dead_roots, but
function does not exists since commit cb517eabba ("Btrfs: cleanup the
similar code of the fs root read"). What we use now to find and load dead
roots is btrfs_find_orphan_roots(). So update the comment and make it a
bit more detailed about why we can not delete an orphan item for a root.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We used to encode two different numbers in the tree mod log counter used
for sequence numbers, one in the upper 32 bits and the other one in the
lower 32 bits. However that is no longer the case, we stopped doing that
since commit fcebe4562d ("Btrfs: rework qgroup accounting").
So update the debug message at btrfs_check_delayed_seq to stop extracting
the two 32 bits counters and print instead the 64 bits sequence numbers.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
There are two places outside the tree mod log module that extract the
lowest sequence number of the tree mod log. These places end up
duplicating code and open coding the logic and internal implementation
details of the tree mod log. So add a helper to the tree mod log module
and header that returns the lowest sequence number or 0 if there aren't
any tree mod log users at the moment.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
At btrfs_tree_mod_log_free_eb() we check if we are dealing with a leaf,
and if so, return immediately and do nothing. However this check can be
removed, because after it we call tree_mod_need_log(), which returns
false when given an extent buffer that corresponds to a leaf.
So just remove the leaf check and pass the extent buffer to
tree_mod_need_log().
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Instead of exposing implementation details of the tree mod log to check
if there are active tree mod log users at btrfs_free_tree_block(), use
the new bit BTRFS_FS_TREE_MOD_LOG_USERS for fs_info->flags instead. This
way extent-tree.c does not need to known about any of the internals of
the tree mod log and avoids taking a lock unnecessarily as well.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The tree modification log functions are called very frequently, basically
they are called every time a btree is modified (a pointer added or removed
to a node, a new root for a btree is set, etc). Because of that, to avoid
heavy lock contention on the lock that protects the list of tree mod log
users, we have checks that test the emptiness of the list with a full
memory barrier before the checks, so that when there are no tree mod log
users we avoid taking the lock.
Replace the memory barrier and list emptiness check with a test for a new
bit set at fs_info->flags. This bit is used to indicate when there are
tree mod log users, set whenever a user is added to the list and cleared
when the last user is removed from the list. This makes the intention a
bit more obvious and possibly more efficient (assuming test_bit() may be
cheaper than a full memory barrier on some architectures).
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Several functions of the tree modification log use integers as booleans,
so change them to use booleans instead, making their use more clear.
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The tree modification log, which records modifications done to btrees, is
quite large and currently spread all over ctree.c, which is a huge file
already.
To make things better organized, move all that code into its own separate
source and header files. Functions and definitions that are used outside
of the module (mostly by ctree.c) are renamed so that they start with a
"btrfs_" prefix. Everything else remains unchanged.
This makes it easier to go over the tree modification log code every
time I need to go read it to fix a bug.
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ minor comment updates ]
Signed-off-by: David Sterba <dsterba@suse.com>
btrfsic_read_block() (which calls kmap()) and
btrfsic_release_block_ctx() (which calls kunmap()) are always called
within a single thread of execution.
Therefore the mappings created within these calls can be a thread local
mapping.
Convert the kmap() of bloc_ctx->pagev to kmap_local_page(). Luckily the
unmap loops backwards through the array pointer so no adjustment needs
to be made to the unmapping order.
Signed-off-by: Ira Weiny <ira.weiny@intel.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Again there is an array of pointers which must be unmapped in the correct
order.
Convert the kmap()'s to kmap_local_page() and adjust the unmapping
to work backwards through the unmapping loop.
Signed-off-by: Ira Weiny <ira.weiny@intel.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
These kmaps are thread local and don't need to be atomic. So they can use
the more efficient kmap_local_page(). However, the mapping of pages in
the stripes and the additional parity and qstripe pages are a bit
trickier because the unmapping must occur in the opposite order from the
mapping. Furthermore, the pointer array in __raid_recover_end_io() may
get reordered.
Convert these calls to kmap_local_page() taking care to reverse the
unmappings of any page arrays as well as being careful with the mappings
of any special pages such as the parity and qstripe pages.
Signed-off-by: Ira Weiny <ira.weiny@intel.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Use a simple coccinelle script to help convert the most common
kmap()/kunmap() patterns to kmap_local_page()/kunmap_local().
Note that some kmaps which were caught by this script needed to be
handled by hand because of the strict unmapping order of kunmap_local()
so they are not included in this patch. But this script got us started.
There's another temp variable added for the final length write to the
first page so it does not interfere with cpage_out that is used for
mapping other pages.
The development of this patch was aided by the follow script:
// <smpl>
// SPDX-License-Identifier: GPL-2.0-only
// Find kmap and replace with kmap_local_page then mark kunmap
//
// Confidence: Low
// Copyright: (C) 2021 Intel Corporation
// URL: http://coccinelle.lip6.fr/
@ catch_all @
expression e, e2;
@@
(
-kmap(e)
+kmap_local_page(e)
)
...
(
-kunmap(...)
+kunmap_local()
)
// </smpl>
Signed-off-by: Ira Weiny <ira.weiny@intel.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The in_range() macro is defined twice in btrfs' source, once in ctree.h
and once in misc.h.
Remove the definition in ctree.h and include misc.h in the files depending
on it.
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Currently btrfs_inode_in_log() checks the list of modified extents of the
inode, and has a comment mentioning why, as it used to be necessary to
make sure if we did something like the following:
mmap write range A
mmap write range B
msync range A (ranged fsync)
msync range B (ranged fsync)
we ended up with both ranges being logged.
If we did not check it, then the second fsync would do nothing because
btrfs_inode_in_log() would return true. This was added in 125c4cf9f3
("Btrfs: set inode's logged_trans/last_log_commit after ranged fsync") and
test case generic/325 from fstests exercises that scenario.
However, as of commit 487781796d ("btrfs: make fast fsyncs wait only
for writeback"), every ranged fsync is now turned into a full ranged fsync
(operates on the range from 0 to LLONG_MAX), so it is now pointless to
test of emptiness of the list of modified extents, and the comment is
clearly outdated.
So just remove the comment and list emptiness check, while also changing
the function's return type to be a boolean instead of an integer.
In case one day we get support for ranged fsyncs again, it will be easy
to notice the check is necessary again, because it will make generic/325
always fail.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We have a race between marking that an inode needs to be logged, either
at btrfs_set_inode_last_trans() or at btrfs_page_mkwrite(), and between
btrfs_sync_log(). The following steps describe how the race happens.
1) We are at transaction N;
2) Inode I was previously fsynced in the current transaction so it has:
inode->logged_trans set to N;
3) The inode's root currently has:
root->log_transid set to 1
root->last_log_commit set to 0
Which means only one log transaction was committed to far, log
transaction 0. When a log tree is created we set ->log_transid and
->last_log_commit of its parent root to 0 (at btrfs_add_log_tree());
4) One more range of pages is dirtied in inode I;
5) Some task A starts an fsync against some other inode J (same root), and
so it joins log transaction 1.
Before task A calls btrfs_sync_log()...
6) Task B starts an fsync against inode I, which currently has the full
sync flag set, so it starts delalloc and waits for the ordered extent
to complete before calling btrfs_inode_in_log() at btrfs_sync_file();
7) During ordered extent completion we have btrfs_update_inode() called
against inode I, which in turn calls btrfs_set_inode_last_trans(),
which does the following:
spin_lock(&inode->lock);
inode->last_trans = trans->transaction->transid;
inode->last_sub_trans = inode->root->log_transid;
inode->last_log_commit = inode->root->last_log_commit;
spin_unlock(&inode->lock);
So ->last_trans is set to N and ->last_sub_trans set to 1.
But before setting ->last_log_commit...
8) Task A is at btrfs_sync_log():
- it increments root->log_transid to 2
- starts writeback for all log tree extent buffers
- waits for the writeback to complete
- writes the super blocks
- updates root->last_log_commit to 1
It's a lot of slow steps between updating root->log_transid and
root->last_log_commit;
9) The task doing the ordered extent completion, currently at
btrfs_set_inode_last_trans(), then finally runs:
inode->last_log_commit = inode->root->last_log_commit;
spin_unlock(&inode->lock);
Which results in inode->last_log_commit being set to 1.
The ordered extent completes;
10) Task B is resumed, and it calls btrfs_inode_in_log() which returns
true because we have all the following conditions met:
inode->logged_trans == N which matches fs_info->generation &&
inode->last_subtrans (1) <= inode->last_log_commit (1) &&
inode->last_subtrans (1) <= root->last_log_commit (1) &&
list inode->extent_tree.modified_extents is empty
And as a consequence we return without logging the inode, so the
existing logged version of the inode does not point to the extent
that was written after the previous fsync.
It should be impossible in practice for one task be able to do so much
progress in btrfs_sync_log() while another task is at
btrfs_set_inode_last_trans() right after it reads root->log_transid and
before it reads root->last_log_commit. Even if kernel preemption is enabled
we know the task at btrfs_set_inode_last_trans() can not be preempted
because it is holding the inode's spinlock.
However there is another place where we do the same without holding the
spinlock, which is in the memory mapped write path at:
vm_fault_t btrfs_page_mkwrite(struct vm_fault *vmf)
{
(...)
BTRFS_I(inode)->last_trans = fs_info->generation;
BTRFS_I(inode)->last_sub_trans = BTRFS_I(inode)->root->log_transid;
BTRFS_I(inode)->last_log_commit = BTRFS_I(inode)->root->last_log_commit;
(...)
So with preemption happening after setting ->last_sub_trans and before
setting ->last_log_commit, it is less of a stretch to have another task
do enough progress at btrfs_sync_log() such that the task doing the memory
mapped write ends up with ->last_sub_trans and ->last_log_commit set to
the same value. It is still a big stretch to get there, as the task doing
btrfs_sync_log() has to start writeback, wait for its completion and write
the super blocks.
So fix this in two different ways:
1) For btrfs_set_inode_last_trans(), simply set ->last_log_commit to the
value of ->last_sub_trans minus 1;
2) For btrfs_page_mkwrite() only set the inode's ->last_sub_trans, just
like we do for buffered and direct writes at btrfs_file_write_iter(),
which is all we need to make sure multiple writes and fsyncs to an
inode in the same transaction never result in an fsync missing that
the inode changed and needs to be logged. Turn this into a helper
function and use it both at btrfs_page_mkwrite() and at
btrfs_file_write_iter() - this also fixes the problem that at
btrfs_page_mkwrite() we were setting those fields without the
protection of the inode's spinlock.
This is an extremely unlikely race to happen in practice.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
When doing an fsync we flush all delalloc, lock the inode (VFS lock), flush
any new delalloc that might have been created before taking the lock and
then wait either for the ordered extents to complete or just for the
writeback to complete (depending on whether the full sync flag is set or
not). We then start logging the inode and assume that while we are doing it
no one else is touching the inode's file extent items (or adding new ones).
That is generally true because all operations that modify an inode acquire
the inode's lock first, including buffered and direct IO writes. However
there is one exception: memory mapped writes, which do not and can not
acquire the inode's lock.
This can cause two types of issues: ending up logging file extent items
with overlapping ranges, which is detected by the tree checker and will
result in aborting the transaction when starting writeback for a log
tree's extent buffers, or a silent corruption where we log a version of
the file that never existed.
Scenario 1 - logging overlapping extents
The following steps explain how we can end up with file extents items with
overlapping ranges in a log tree due to a race between a fsync and memory
mapped writes:
1) Task A starts an fsync on inode X, which has the full sync runtime flag
set. First it starts by flushing all delalloc for the inode;
2) Task A then locks the inode and flushes any other delalloc that might
have been created after the previous flush and waits for all ordered
extents to complete;
3) In the inode's root we have the following leaf:
Leaf N, generation == current transaction id:
---------------------------------------------------------
| (...) [ file extent item, offset 640K, length 128K ] |
---------------------------------------------------------
The last file extent item in leaf N covers the file range from 640K to
768K;
4) Task B does a memory mapped write for the page corresponding to the
file range from 764K to 768K;
5) Task A starts logging the inode. At copy_inode_items_to_log() it uses
btrfs_search_forward() to search for leafs modified in the current
transaction that contain items for the inode. It finds leaf N and copies
all the inode items from that leaf into the log tree.
Now the log tree has a copy of the last file extent item from leaf N.
At the end of the while loop at copy_inode_items_to_log(), we have the
minimum key set to:
min_key.objectid = <inode X number>
min_key.type = BTRFS_EXTENT_DATA_KEY
min_key.offset = 640K
Then we increment the key's offset by 1 so that the next call to
btrfs_search_forward() leaves us at the first key greater than the key
we just processed.
But before btrfs_search_forward() is called again...
6) Dellaloc for the page at offset 764K, dirtied by task B, is started.
It can be started for several reasons:
- The async reclaim task is attempting to satisfy metadata or data
reservation requests, and it has reached a point where it decided
to flush delalloc;
- Due to memory pressure the VMM triggers writeback of dirty pages;
- The system call sync_file_range(2) is called from user space.
7) When the respective ordered extent completes, it trims the length of
the existing file extent item for file offset 640K from 128K to 124K,
and a new file extent item is added with a key offset of 764K and a
length of 4K;
8) Task A calls btrfs_search_forward(), which returns us a path pointing
to the leaf (can be leaf N or some other) containing the new file extent
item for file offset 764K.
We end up copying this item to the log tree, which overlaps with the
last copied file extent item, which covers the file range from 640K to
768K.
When writeback is triggered for log tree's extent buffers, the issue
will be detected by the tree checker which will dump a trace and an
error message on dmesg/syslog. If the writeback is triggered when
syncing the log, which typically is, then we also end up aborting the
current transaction.
This is the same type of problem fixed in 0c713cbab6 ("Btrfs: fix race
between ranged fsync and writeback of adjacent ranges").
Scenario 2 - logging a version of the file that never existed
This scenario only happens when using the NO_HOLES feature and results in
a silent corruption, in the sense that is not detectable by 'btrfs check'
or the tree checker:
1) We have an inode I with a size of 1M and two file extent items, one
covering an extent with disk_bytenr == X for the file range [0, 512K)
and another one covering another extent with disk_bytenr == Y for the
file range [512K, 1M);
2) A hole is punched for the file range [512K, 1M);
3) Task A starts an fsync of inode I, which has the full sync runtime flag
set. It starts by flushing all existing delalloc, locks the inode (VFS
lock), starts any new delalloc that might have been created before
taking the lock and waits for all ordered extents to complete;
4) Some other task does a memory mapped write for the page corresponding to
the file range [640K, 644K) for example;
5) Task A then logs all items of the inode with the call to
copy_inode_items_to_log();
6) In the meanwhile delalloc for the range [640K, 644K) is started. It can
be started for several reasons:
- The async reclaim task is attempting to satisfy metadata or data
reservation requests, and it has reached a point where it decided
to flush delalloc;
- Due to memory pressure the VMM triggers writeback of dirty pages;
- The system call sync_file_range(2) is called from user space.
7) The ordered extent for the range [640K, 644K) completes and a file
extent item for that range is added to the subvolume tree, pointing
to a 4K extent with a disk_bytenr == Z;
8) Task A then calls btrfs_log_holes(), to scan for implicit holes in
the subvolume tree. It finds two implicit holes:
- one for the file range [512K, 640K)
- one for the file range [644K, 1M)
As a result we end up neither logging a hole for the range [640K, 644K)
nor logging the file extent item with a disk_bytenr == Z.
This means that if we have a power failure and replay the log tree we
end up getting the following file extent layout:
[ disk_bytenr X ] [ hole ] [ disk_bytenr Y ] [ hole ]
0 512K 512K 640K 640K 644K 644K 1M
Which does not corresponding to any layout the file ever had before
the power failure. The only two valid layouts would be:
[ disk_bytenr X ] [ hole ]
0 512K 512K 1M
and
[ disk_bytenr X ] [ hole ] [ disk_bytenr Z ] [ hole ]
0 512K 512K 640K 640K 644K 644K 1M
This can be fixed by serializing memory mapped writes with fsync, and there
are two ways to do it:
1) Make a fsync lock the entire file range, from 0 to (u64)-1 / LLONG_MAX
in the inode's io tree. This prevents the race but also blocks any reads
during the duration of the fsync, which has a negative impact for many
common workloads;
2) Make an fsync write lock the i_mmap_lock semaphore in the inode. This
semaphore was recently added by Josef's patch set:
btrfs: add a i_mmap_lock to our inode
btrfs: cleanup inode_lock/inode_unlock uses
btrfs: exclude mmaps while doing remap
btrfs: exclude mmap from happening during all fallocate operations
and is used to solve races between memory mapped writes and
clone/dedupe/fallocate. This also makes us have the same behaviour we
have regarding other writes (buffered and direct IO) and fsync - block
them while the inode logging is in progress.
This change uses the second approach due to the performance impact of the
first one.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
There's a small window where a deadlock can happen between fallocate and
mmap. This is described in detail by Filipe:
"""
When doing a fallocate operation we lock the inode, flush delalloc within
the target range, wait for any ordered extents to complete and then lock
the file range. Before we lock the range and after we flush delalloc,
there is a time window where another task can come in and do a memory
mapped write for a page within the fallocate range.
This means that after fallocate locks the range, there can be a dirty page
in the range. More often than not, this does not cause any problem.
The exception is when we are low on available metadata space, because an
fallocate operation needs to start a transaction while holding the file
range locked, either through btrfs_prealloc_file_range() or through the
call to btrfs_fallocate_update_isize(). If that's the case, we can end up
in a deadlock. The following list of steps explains how that happens:
1) A fallocate operation starts, locks the inode, flushes delalloc in the
range and waits for ordered extents in the range to complete;
2) Before the fallocate task locks the file range, another task does a
memory mapped write for a page in the fallocate target range. This is
possible since memory mapped writes do not (and can not) lock the
inode;
3) The fallocate task locks the file range. At this point there is one
dirty page in the range (due to the memory mapped write);
4) When the fallocate task attempts to start a transaction, it blocks when
attempting to reserve metadata space, since we are low on available
metadata space. Before blocking (wait on its reservation ticket), it
starts the async reclaim task (if not running already);
5) The async reclaim task is not able to release space through any other
means, so it decides to flush delalloc for inodes with dirty pages.
It finds that the inode used in the fallocate operation has a dirty
page and therefore queues a job (fs_info->flush_workers workqueue) to
flush delalloc for that inode and waits on that job to complete;
6) The flush job blocks when attempting to lock the file range because
it is currently locked by the fallocate task;
7) The fallocate task keeps waiting for its metadata reservation, waiting
for a wakeup on its reservation ticket. The async reclaim task is
waiting on the flush job, which in turn is waiting for locking the file
range that is currently locked by the fallocate task. So unless some
other task is able to release enough metadata space, for example an
ordered extent for some other inode completes, we end up in a deadlock
between all these tasks.
When this happens stack traces like the following show up in dmesg/syslog:
INFO: task kworker/u16:11:1810830 blocked for more than 120 seconds.
Tainted: G B W 5.10.0-rc4-btrfs-next-73 #1
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
task:kworker/u16:11 state:D stack: 0 pid:1810830 ppid: 2 flags:0x00004000
Workqueue: btrfs-flush_delalloc btrfs_work_helper [btrfs]
Call Trace:
__schedule+0x5d1/0xcf0
schedule+0x45/0xe0
lock_extent_bits+0x1e6/0x2d0 [btrfs]
? finish_wait+0x90/0x90
btrfs_invalidatepage+0x32c/0x390 [btrfs]
? __mod_memcg_state+0x8e/0x160
__extent_writepage+0x2d4/0x400 [btrfs]
extent_write_cache_pages+0x2b2/0x500 [btrfs]
? lock_release+0x20e/0x4c0
? trace_hardirqs_on+0x1b/0xf0
extent_writepages+0x43/0x90 [btrfs]
? lock_acquire+0x1a3/0x490
do_writepages+0x43/0xe0
? __filemap_fdatawrite_range+0xa4/0x100
__filemap_fdatawrite_range+0xc5/0x100
btrfs_run_delalloc_work+0x17/0x40 [btrfs]
btrfs_work_helper+0xf1/0x600 [btrfs]
process_one_work+0x24e/0x5e0
worker_thread+0x50/0x3b0
? process_one_work+0x5e0/0x5e0
kthread+0x153/0x170
? kthread_mod_delayed_work+0xc0/0xc0
ret_from_fork+0x22/0x30
INFO: task kworker/u16:1:2426217 blocked for more than 120 seconds.
Tainted: G B W 5.10.0-rc4-btrfs-next-73 #1
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
task:kworker/u16:1 state:D stack: 0 pid:2426217 ppid: 2 flags:0x00004000
Workqueue: events_unbound btrfs_async_reclaim_metadata_space [btrfs]
Call Trace:
__schedule+0x5d1/0xcf0
? kvm_clock_read+0x14/0x30
? wait_for_completion+0x81/0x110
schedule+0x45/0xe0
schedule_timeout+0x30c/0x580
? _raw_spin_unlock_irqrestore+0x3c/0x60
? lock_acquire+0x1a3/0x490
? try_to_wake_up+0x7a/0xa20
? lock_release+0x20e/0x4c0
? lock_acquired+0x199/0x490
? wait_for_completion+0x81/0x110
wait_for_completion+0xab/0x110
start_delalloc_inodes+0x2af/0x390 [btrfs]
btrfs_start_delalloc_roots+0x12d/0x250 [btrfs]
flush_space+0x24f/0x660 [btrfs]
btrfs_async_reclaim_metadata_space+0x1bb/0x480 [btrfs]
process_one_work+0x24e/0x5e0
worker_thread+0x20f/0x3b0
? process_one_work+0x5e0/0x5e0
kthread+0x153/0x170
? kthread_mod_delayed_work+0xc0/0xc0
ret_from_fork+0x22/0x30
(...)
several tasks waiting for the inode lock held by the fallocate task below
(...)
RIP: 0033:0x7f61efe73fff
Code: Unable to access opcode bytes at RIP 0x7f61efe73fd5.
RSP: 002b:00007ffc3371bbe8 EFLAGS: 00000202 ORIG_RAX: 000000000000013c
RAX: ffffffffffffffda RBX: 00007ffc3371bea0 RCX: 00007f61efe73fff
RDX: 00000000ffffff9c RSI: 0000560fbd5d90a0 RDI: 00000000ffffff9c
RBP: 00007ffc3371beb0 R08: 0000000000000001 R09: 0000000000000003
R10: 0000560fbd5d7ad0 R11: 0000000000000202 R12: 0000000000000001
R13: 000000000000005e R14: 00007ffc3371bea0 R15: 00007ffc3371beb0
task:fdm-stress state:D stack: 0 pid:2508243 ppid:2508153 flags:0x00000000
Call Trace:
__schedule+0x5d1/0xcf0
? _raw_spin_unlock_irqrestore+0x3c/0x60
schedule+0x45/0xe0
__reserve_bytes+0x4a4/0xb10 [btrfs]
? finish_wait+0x90/0x90
btrfs_reserve_metadata_bytes+0x29/0x190 [btrfs]
btrfs_block_rsv_add+0x1f/0x50 [btrfs]
start_transaction+0x2d1/0x760 [btrfs]
btrfs_replace_file_extents+0x120/0x930 [btrfs]
? btrfs_fallocate+0xdcf/0x1260 [btrfs]
btrfs_fallocate+0xdfb/0x1260 [btrfs]
? filename_lookup+0xf1/0x180
vfs_fallocate+0x14f/0x440
ioctl_preallocate+0x92/0xc0
do_vfs_ioctl+0x66b/0x750
? __do_sys_newfstat+0x53/0x60
__x64_sys_ioctl+0x62/0xb0
do_syscall_64+0x33/0x80
entry_SYSCALL_64_after_hwframe+0x44/0xa9
"""
Fix this by disallowing mmaps from happening while we're doing any of
the fallocate operations on this inode.
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Darrick reported a potential issue to me where we could allow mmap
writes after validating a page range matched in the case of dedupe.
Generally we rely on lock page -> lock extent with the ordered flush to
protect us, but this is done after we check the pages because we use the
generic helpers, so we could modify the page in between doing the check
and locking the range.
There also exists a deadlock, as described by Filipe
"""
When cloning a file range, we lock the inodes, flush any delalloc within
the respective file ranges, wait for any ordered extents and then lock the
file ranges in both inodes. This means that right after we flush delalloc
and before we lock the file ranges, memory mapped writes can come in and
dirty pages in the file ranges of the clone operation.
Most of the time this is harmless and causes no problems. However, if we
are low on available metadata space, we can later end up in a deadlock
when starting a transaction to replace file extent items. This happens if
when allocating metadata space for the transaction, we need to wait for
the async reclaim thread to release space and the reclaim thread needs to
flush delalloc for the inode that got the memory mapped write and has its
range locked by the clone task.
Basically what happens is the following:
1) A clone operation locks inodes A and B, flushes delalloc for both
inodes in the respective file ranges and waits for any ordered extents
in those ranges to complete;
2) Before the clone task locks the file ranges, another task does a
memory mapped write (which does not lock the inode) for one of the
inodes of the clone operation. So now we have a dirty page in one of
the ranges used by the clone operation;
3) The clone operation locks the file ranges for inodes A and B;
4) Later, when iterating over the file extents of inode A, the clone
task attempts to start a transaction. There's not enough available
free metadata space, so the async reclaim task is started (if not
running already) and we wait for someone to wake us up on our
reservation ticket;
5) The async reclaim task is not able to release space by any other
means and decides to flush delalloc for the inode of the clone
operation;
6) The workqueue job used to flush the inode blocks when starting
delalloc for the inode, since the file range is currently locked by
the clone task;
7) But the clone task is waiting on its reservation ticket and the async
reclaim task is waiting on the flush job to complete, which can't
progress since the clone task has the file range locked. So unless
some other task is able to release space, for example an ordered
extent for some other inode completes, we have a deadlock between all
these tasks;
When this happens stack traces like the following show up in dmesg/syslog:
INFO: task kworker/u16:11:1810830 blocked for more than 120 seconds.
Tainted: G B W 5.10.0-rc4-btrfs-next-73 #1
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
task:kworker/u16:11 state:D stack: 0 pid:1810830 ppid: 2 flags:0x00004000
Workqueue: btrfs-flush_delalloc btrfs_work_helper [btrfs]
Call Trace:
__schedule+0x5d1/0xcf0
schedule+0x45/0xe0
lock_extent_bits+0x1e6/0x2d0 [btrfs]
? finish_wait+0x90/0x90
btrfs_invalidatepage+0x32c/0x390 [btrfs]
? __mod_memcg_state+0x8e/0x160
__extent_writepage+0x2d4/0x400 [btrfs]
extent_write_cache_pages+0x2b2/0x500 [btrfs]
? lock_release+0x20e/0x4c0
? trace_hardirqs_on+0x1b/0xf0
extent_writepages+0x43/0x90 [btrfs]
? lock_acquire+0x1a3/0x490
do_writepages+0x43/0xe0
? __filemap_fdatawrite_range+0xa4/0x100
__filemap_fdatawrite_range+0xc5/0x100
btrfs_run_delalloc_work+0x17/0x40 [btrfs]
btrfs_work_helper+0xf1/0x600 [btrfs]
process_one_work+0x24e/0x5e0
worker_thread+0x50/0x3b0
? process_one_work+0x5e0/0x5e0
kthread+0x153/0x170
? kthread_mod_delayed_work+0xc0/0xc0
ret_from_fork+0x22/0x30
INFO: task kworker/u16:1:2426217 blocked for more than 120 seconds.
Tainted: G B W 5.10.0-rc4-btrfs-next-73 #1
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
task:kworker/u16:1 state:D stack: 0 pid:2426217 ppid: 2 flags:0x00004000
Workqueue: events_unbound btrfs_async_reclaim_metadata_space [btrfs]
Call Trace:
__schedule+0x5d1/0xcf0
? kvm_clock_read+0x14/0x30
? wait_for_completion+0x81/0x110
schedule+0x45/0xe0
schedule_timeout+0x30c/0x580
? _raw_spin_unlock_irqrestore+0x3c/0x60
? lock_acquire+0x1a3/0x490
? try_to_wake_up+0x7a/0xa20
? lock_release+0x20e/0x4c0
? lock_acquired+0x199/0x490
? wait_for_completion+0x81/0x110
wait_for_completion+0xab/0x110
start_delalloc_inodes+0x2af/0x390 [btrfs]
btrfs_start_delalloc_roots+0x12d/0x250 [btrfs]
flush_space+0x24f/0x660 [btrfs]
btrfs_async_reclaim_metadata_space+0x1bb/0x480 [btrfs]
process_one_work+0x24e/0x5e0
worker_thread+0x20f/0x3b0
? process_one_work+0x5e0/0x5e0
kthread+0x153/0x170
? kthread_mod_delayed_work+0xc0/0xc0
ret_from_fork+0x22/0x30
(...)
several other tasks blocked on inode locks held by the clone task below
(...)
RIP: 0033:0x7f61efe73fff
Code: Unable to access opcode bytes at RIP 0x7f61efe73fd5.
RSP: 002b:00007ffc3371bbe8 EFLAGS: 00000202 ORIG_RAX: 000000000000013c
RAX: ffffffffffffffda RBX: 00007ffc3371bea0 RCX: 00007f61efe73fff
RDX: 00000000ffffff9c RSI: 0000560fbd604690 RDI: 00000000ffffff9c
RBP: 00007ffc3371beb0 R08: 0000000000000002 R09: 0000560fbd5d75f0
R10: 0000560fbd5d81f0 R11: 0000000000000202 R12: 0000000000000002
R13: 000000000000000b R14: 00007ffc3371bea0 R15: 00007ffc3371beb0
task: fdm-stress state:D stack: 0 pid:2508234 ppid:2508153 flags:0x00004000
Call Trace:
__schedule+0x5d1/0xcf0
? _raw_spin_unlock_irqrestore+0x3c/0x60
schedule+0x45/0xe0
__reserve_bytes+0x4a4/0xb10 [btrfs]
? finish_wait+0x90/0x90
btrfs_reserve_metadata_bytes+0x29/0x190 [btrfs]
btrfs_block_rsv_add+0x1f/0x50 [btrfs]
start_transaction+0x2d1/0x760 [btrfs]
btrfs_replace_file_extents+0x120/0x930 [btrfs]
? lock_release+0x20e/0x4c0
btrfs_clone+0x3e4/0x7e0 [btrfs]
? btrfs_lookup_first_ordered_extent+0x8e/0x100 [btrfs]
btrfs_clone_files+0xf6/0x150 [btrfs]
btrfs_remap_file_range+0x324/0x3d0 [btrfs]
do_clone_file_range+0xd4/0x1f0
vfs_clone_file_range+0x4d/0x230
? lock_release+0x20e/0x4c0
ioctl_file_clone+0x8f/0xc0
do_vfs_ioctl+0x342/0x750
__x64_sys_ioctl+0x62/0xb0
do_syscall_64+0x33/0x80
entry_SYSCALL_64_after_hwframe+0x44/0xa9
"""
Fix both of these issues by excluding mmaps from happening we are doing
any sort of remap, which prevents this race completely.
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
A few places we intermix btrfs_inode_lock with a inode_unlock, and some
places we just use inode_lock/inode_unlock instead of btrfs_inode_lock.
None of these places are using this incorrectly, but as we adjust some
of these callers it would be nice to keep everything consistent, so
convert everybody to use btrfs_inode_lock/btrfs_inode_unlock.
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
We need to be able to exclude page_mkwrite from happening concurrently
with certain operations. To facilitate this, add a i_mmap_lock to our
inode, down_read() it in our mkwrite, and add a new ILOCK flag to
indicate that we want to take the i_mmap_lock as well. I used pahole to
check the size of the btrfs_inode, the sizes are as follows
no lockdep:
before: 1120 (3 per 4k page)
after: 1160 (3 per 4k page)
lockdep:
before: 2072 (1 per 4k page)
after: 2224 (1 per 4k page)
We're slightly larger but it doesn't change how many objects we can fit
per page.
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The parameter mirror is not used and does not make sense for checksum
verification of the given bio.
Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
force_cow can be calculated from inode and does not need to be passed as
an argument.
This simplifies run_delalloc_nocow() call from btrfs_run_delalloc_range()
A new function, should_nocow() checks if the range should be NOCOWed or
not. The function returns true iff either BTRFS_INODE_NODATA or
BTRFS_INODE_PREALLOC, but is not a defrag extent.
Tested-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Fix the following coccicheck warnings:
./fs/btrfs/volumes.c:1462:10-11: WARNING: return of 0/1 in function
'dev_extent_hole_check_zoned' with return type bool.
Reported-by: Abaci Robot <abaci@linux.alibaba.com>
Signed-off-by: Jiapeng Chong <jiapeng.chong@linux.alibaba.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Currently we do not do btree read ahead when doing an incremental send,
however we know that we will read and process any node or leaf in the
send root that has a generation greater than the generation of the parent
root. So triggering read ahead for such nodes and leafs is beneficial
for an incremental send.
This change does that, triggers read ahead of any node or leaf in the
send root that has a generation greater then the generation of the
parent root. As for the parent root, no readahead is triggered because
knowing in advance which nodes/leaves are going to be read is not so
linear and there's often a large time window between visiting nodes or
leaves of the parent root. So I opted to leave out the parent root,
and triggering read ahead for its nodes/leaves seemed to have not made
significant difference.
The following test script was used to measure the improvement on a box
using an average, consumer grade, spinning disk and with 16GiB of ram:
$ cat test.sh
#!/bin/bash
DEV=/dev/sdj
MNT=/mnt/sdj
MKFS_OPTIONS="--nodesize 16384" # default, just to be explicit
MOUNT_OPTIONS="-o max_inline=2048" # default, just to be explicit
mkfs.btrfs -f $MKFS_OPTIONS $DEV > /dev/null
mount $MOUNT_OPTIONS $DEV $MNT
# Create files with inline data to make it easier and faster to create
# large btrees.
add_files()
{
local total=$1
local start_offset=$2
local number_jobs=$3
local total_per_job=$(($total / $number_jobs))
echo "Creating $total new files using $number_jobs jobs"
for ((n = 0; n < $number_jobs; n++)); do
(
local start_num=$(($start_offset + $n * $total_per_job))
for ((i = 1; i <= $total_per_job; i++)); do
local file_num=$((start_num + $i))
local file_path="$MNT/file_${file_num}"
xfs_io -f -c "pwrite -S 0xab 0 2000" $file_path > /dev/null
if [ $? -ne 0 ]; then
echo "Failed creating file $file_path"
break
fi
done
) &
worker_pids[$n]=$!
done
wait ${worker_pids[@]}
sync
echo
echo "btree node/leaf count: $(btrfs inspect-internal dump-tree -t 5 $DEV | egrep '^(node|leaf) ' | wc -l)"
}
initial_file_count=500000
add_files $initial_file_count 0 4
echo
echo "Creating first snapshot..."
btrfs subvolume snapshot -r $MNT $MNT/snap1
echo
echo "Adding more files..."
add_files $((initial_file_count / 4)) $initial_file_count 4
echo
echo "Updating 1/50th of the initial files..."
for ((i = 1; i < $initial_file_count; i += 50)); do
xfs_io -c "pwrite -S 0xcd 0 20" $MNT/file_$i > /dev/null
done
echo
echo "Creating second snapshot..."
btrfs subvolume snapshot -r $MNT $MNT/snap2
umount $MNT
echo 3 > /proc/sys/vm/drop_caches
blockdev --flushbufs $DEV &> /dev/null
hdparm -F $DEV &> /dev/null
mount $MOUNT_OPTIONS $DEV $MNT
echo
echo "Testing full send..."
start=$(date +%s)
btrfs send $MNT/snap1 > /dev/null
end=$(date +%s)
echo
echo "Full send took $((end - start)) seconds"
umount $MNT
echo 3 > /proc/sys/vm/drop_caches
blockdev --flushbufs $DEV &> /dev/null
hdparm -F $DEV &> /dev/null
mount $MOUNT_OPTIONS $DEV $MNT
echo
echo "Testing incremental send..."
start=$(date +%s)
btrfs send -p $MNT/snap1 $MNT/snap2 > /dev/null
end=$(date +%s)
echo
echo "Incremental send took $((end - start)) seconds"
umount $MNT
Before this change, incremental send duration:
with $initial_file_count == 200000: 51 seconds
with $initial_file_count == 500000: 168 seconds
After this change, incremental send duration:
with $initial_file_count == 200000: 39 seconds (-26.7%)
with $initial_file_count == 500000: 125 seconds (-29.4%)
For $initial_file_count == 200000 there are 62600 nodes and leaves in the
btree of the first snapshot, and 77759 nodes and leaves in the btree of
the second snapshot. The root nodes were at level 2.
While for $initial_file_count == 500000 there are 152476 nodes and leaves
in the btree of the first snapshot, and 190511 nodes and leaves in the
btree of the second snapshot. The root nodes were at level 2 as well.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
When doing a full send we know that we are going to be reading every node
and leaf of the send root, so we benefit from enabling read ahead for the
btree.
This change enables read ahead for full send operations only, incremental
sends will have read ahead enabled in a different way by a separate patch.
The following test script was used to measure the improvement on a box
using an average, consumer grade, spinning disk and with 16GiB of RAM:
$ cat test.sh
#!/bin/bash
DEV=/dev/sdj
MNT=/mnt/sdj
MKFS_OPTIONS="--nodesize 16384" # default, just to be explicit
MOUNT_OPTIONS="-o max_inline=2048" # default, just to be explicit
mkfs.btrfs -f $MKFS_OPTIONS $DEV > /dev/null
mount $MOUNT_OPTIONS $DEV $MNT
# Create files with inline data to make it easier and faster to create
# large btrees.
add_files()
{
local total=$1
local start_offset=$2
local number_jobs=$3
local total_per_job=$(($total / $number_jobs))
echo "Creating $total new files using $number_jobs jobs"
for ((n = 0; n < $number_jobs; n++)); do
(
local start_num=$(($start_offset + $n * $total_per_job))
for ((i = 1; i <= $total_per_job; i++)); do
local file_num=$((start_num + $i))
local file_path="$MNT/file_${file_num}"
xfs_io -f -c "pwrite -S 0xab 0 2000" $file_path > /dev/null
if [ $? -ne 0 ]; then
echo "Failed creating file $file_path"
break
fi
done
) &
worker_pids[$n]=$!
done
wait ${worker_pids[@]}
sync
echo
echo "btree node/leaf count: $(btrfs inspect-internal dump-tree -t 5 $DEV | egrep '^(node|leaf) ' | wc -l)"
}
initial_file_count=500000
add_files $initial_file_count 0 4
echo
echo "Creating first snapshot..."
btrfs subvolume snapshot -r $MNT $MNT/snap1
echo
echo "Adding more files..."
add_files $((initial_file_count / 4)) $initial_file_count 4
echo
echo "Updating 1/50th of the initial files..."
for ((i = 1; i < $initial_file_count; i += 50)); do
xfs_io -c "pwrite -S 0xcd 0 20" $MNT/file_$i > /dev/null
done
echo
echo "Creating second snapshot..."
btrfs subvolume snapshot -r $MNT $MNT/snap2
umount $MNT
echo 3 > /proc/sys/vm/drop_caches
blockdev --flushbufs $DEV &> /dev/null
hdparm -F $DEV &> /dev/null
mount $MOUNT_OPTIONS $DEV $MNT
echo
echo "Testing full send..."
start=$(date +%s)
btrfs send $MNT/snap1 > /dev/null
end=$(date +%s)
echo
echo "Full send took $((end - start)) seconds"
umount $MNT
echo 3 > /proc/sys/vm/drop_caches
blockdev --flushbufs $DEV &> /dev/null
hdparm -F $DEV &> /dev/null
mount $MOUNT_OPTIONS $DEV $MNT
echo
echo "Testing incremental send..."
start=$(date +%s)
btrfs send -p $MNT/snap1 $MNT/snap2 > /dev/null
end=$(date +%s)
echo
echo "Incremental send took $((end - start)) seconds"
umount $MNT
Before this change, full send duration:
with $initial_file_count == 200000: 165 seconds
with $initial_file_count == 500000: 407 seconds
After this change, full send duration:
with $initial_file_count == 200000: 149 seconds (-10.2%)
with $initial_file_count == 500000: 353 seconds (-14.2%)
For $initial_file_count == 200000 there are 62600 nodes and leaves in the
btree of the first snapshot, while for $initial_file_count == 500000 there
are 152476 nodes and leaves. The roots were at level 2.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
btrfs_block_rsv_add can return only ENOSPC since it's called with
NO_FLUSH modifier. This so simplify the logic in
btrfs_delayed_inode_reserve_metadata to exploit this invariant.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ add assert and comment ]
Signed-off-by: David Sterba <dsterba@suse.com>
It's only used for tracepoint to obtain the inode number, but we already
have the ino from btrfs_delayed_node::inode_id.
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
It's no longer expected to call this function with an open transaction
so all the workarounds concerning this can be removed. In fact it'll
constitute a bug to call this function with a transaction already held
so WARN in this case.
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Drop function declarations at the beginning of the file scrub.c. These
functions are defined before they are used in the same file and don't
need forward declaration.
No functional changes.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
btrfs_extent_readonly() checks if the block group is readonly, the bool
return type should be used.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
btrfs_extent_readonly() is used by can_nocow_extent() in inode.c. So
move it from extent-tree.c to inode.c and declare it as static.
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
btrfs_inc_block_group_ro wants to ensure that the current transaction is
not running dirty block groups, if it is it waits and loops again.
That logic is currently implemented using a goto label. Actually using
a proper do {} while() construct doesn't hurt readability nor does it
introduce excessive nesting and makes the relevant code stand out by
being encompassed in the loop construct. No functional changes.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
No point in duplicating the functionality just use the generic helper
that has the same semantics.
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
There is small error in comment about BTRFS_ORDERED_* flags, added in
commit 3c198fe064 ("btrfs: rework the order of
btrfs_ordered_extent::flags") but the fixup did not get merged in time.
The 4 types are for ordered extent itself, not for direct io.
Only 3 types support direct io, REGULAR/NOCOW/PREALLOC.
Fix the comment to reflect that.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Some filesystem's use a digest of their uuid for f_fsid.
Create a simple wrapper for this open coded folding.
Filesystems that have a non null uuid but use the block device
number for f_fsid may also consider using this helper.
[JK: Added missing asm/byteorder.h include]
Link: https://lore.kernel.org/r/20210322173944.449469-2-amir73il@gmail.com
Acked-by: Damien Le Moal <damien.lemoal@wdc.com>
Reviewed-by: Christian Brauner <christian.brauner@ubuntu.com>
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Colin reported before possible overflow and sign extension problems in
io_provide_buffers_prep(). As Linus pointed out previous attempt did nothing
useful, see d81269fecb ("io_uring: fix provide_buffers sign extension").
Do that with help of check_<op>_overflow helpers. And fix struct
io_provide_buf::len type, as it doesn't make much sense to keep it
signed.
Reported-by: Colin Ian King <colin.king@canonical.com>
Fixes: efe68c1ca8 ("io_uring: validate the full range of provided buffers for access")
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/46538827e70fce5f6cdb50897cff4cacc490f380.1618488258.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Don't fail submission attempts if there are CQEs in the overflow
backlog, but give away the decision making to the userspace. It
might be very inconvenient to the userspace, especially if
submission and completion are done by different threads.
We can remove it because of recent changes, where requests
are now not locked by the backlog, backlog entries are allocated
separately, so they take less space and cgroup accounted.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
This does the directory entry name verification for the legacy
"fillonedir" (and compat) interface that goes all the way back to the
dark ages before we had a proper dirent, and the readdir() system call
returned just a single entry at a time.
Nobody should use this interface unless you still have binaries from
1991, but let's do it right.
This came up during discussions about unsafe_copy_to_user() and proper
checking of all the inputs to it, as the networking layer is looking to
use it in a few new places. So let's make sure the _old_ users do it
all right and proper, before we add new ones.
See also commit 8a23eb804c ("Make filldir[64]() verify the directory
entry filename is valid") which did the proper modern interfaces that
people actually use. It had a note:
Note that I didn't bother adding the checks to any legacy interfaces
that nobody uses.
which this now corrects. Note that we really don't care about POSIX and
the presense of '/' in a directory entry, but verify_dirent_name() also
ends up doing the proper name length verification which is what the
input checking discussion was about.
[ Another option would be to remove the support for this particular very
old interface: any binaries that use it are likely a.out binaries, and
they will no longer run anyway since we removed a.out binftm support
in commit eac6165570 ("x86: Deprecate a.out support").
But I'm not sure which came first: getdents() or ELF support, so let's
pretend somebody might still have a working binary that uses the
legacy readdir() case.. ]
Link: https://lore.kernel.org/lkml/CAHk-=wjbvzCAhAtvG0d81W5o0-KT5PPTHhfJ5ieDFq+bGtgOYg@mail.gmail.com/
Acked-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
- keep the ZC code, drop the code related to reinit
net/bridge/netfilter/ebtables.c
- fix build after move to net_generic
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmB5/80QHGF4Ym9lQGtl
cm5lbC5kawAKCRD301j7KXHgpgJWD/0YyQ7YlMKw4x2nS6TiojeD5KN8aTvVVHlf
TetYVXIOaMNPatUDREVl7vgaQVFi2cgWLvshYPIarsQK9ESNJ/UqKi0c6pJsa29w
02K8EJsEdmQmJlyC6tUv/Drvdfzinx+jAm9doz8FrJ+2v5+pYqNFhFTh/JKRebI8
kBEKH/e3eHg305TDXcWOoNxX3jD77v1IJo8JavLt1vncQKb6P3dahGwmkFRpRcHZ
sY33baI2jdO/79ybH0F0fNgD+XGnBYShKB7iOlJfT9SeI8NfpR/OCuuPl3gxeeKZ
PwHcsPHA/b0AAUKpWY0sX/86xMUUab2DPJV5nQR5SrgniJLs0Xqm0dvz1r6QM2BK
7EfwbXK48xm6Wop5WFlo+wqzW11ES/nRHm9leRomeeKo6Goe/4kyEgK93QJOkckv
/kt8t4PIxALlWKmgJgKk5kEWiP/QT0LjaZE2lYy9VcfpSoC7E7ZnK1KlvwbYOfZK
QqjNNsRr2z7zQaUCFHDc3mPe/WkV8DygRrXZYl8qfopqndOduceFaGqNe5iZV2IX
NnC7FGY66dcmjqGmOqr+4RvhUYR2+GtazaYgV14LM1cGBO5OtqlS1m+DRDbiavp4
fpHeGGQx78O97LURNJG4Ti/AR5dfSQLQwZzb60UBk43a9ZwVsLtxFquSaV8dkm3U
0hfxANWYXg==
=KEBW
-----END PGP SIGNATURE-----
Merge tag 'io_uring-5.12-2021-04-16' of git://git.kernel.dk/linux-block
Pull io_uring fix from Jens Axboe:
"Fix for a potential hang at exit with SQPOLL from Pavel"
* tag 'io_uring-5.12-2021-04-16' of git://git.kernel.dk/linux-block:
io_uring: fix early sqd_list removal sqpoll hangs
A hand-edit while applying this patch on top of a new base resulted in
a reverted check for re-issue, resulting in spurious -EAGAIN errors.
Fixes: 8c130827f4 ("io_uring: don't alter iopoll reissue fail ret code")
Signed-off-by: Jens Axboe <axboe@kernel.dk>
We manage these separately right now, just tie it to the request lifetime
and make it be part of the usual REQ_F_NEED_CLEANUP logic.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
We have this in two spots right now, which is a bit fragile. In
preparation for moving REQ_F_POLLED cleanup into the same spot, move
the check into a separate helper so we only have it once.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
The function was renamed, so get rid of the declaration.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Implement debugfs_create_str() to easily display names and such in
debugfs.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Tested-by: Valentin Schneider <valentin.schneider@arm.com>
Link: https://lkml.kernel.org/r/20210412102001.415407080@infradead.org
Introduces the TRAP_PERF si_code, and associated siginfo_t field
si_perf. These will be used by the perf event subsystem to send signals
(if requested) to the task where an event occurred.
Signed-off-by: Marco Elver <elver@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Geert Uytterhoeven <geert@linux-m68k.org> # m68k
Acked-by: Arnd Bergmann <arnd@arndb.de> # asm-generic
Link: https://lkml.kernel.org/r/20210408103605.1676875-6-elver@google.com
The re-add handling isn't correct for the multi wait case, so let's
just disable it for now explicitly until we can get that sorted out. This
just turns it into a one-shot request. Since we pass back whether or not
a poll request terminates in multishot mode on completion, this should
not break properly behaving applications that check for IORING_CQE_F_MORE
on completion.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
The in-memory XFS_IFEXTENTS is now only used to check if an inode with
extents still needs the extents to be read into memory before doing
operations that need the extent map. Add a new xfs_need_iread_extents
helper that returns true for btree format forks that do not have any
entries in the in-memory extent btree, and use that instead of checking
the XFS_IFEXTENTS flag.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Just check for an inline format fork instead of the using the equivalent
in-memory XFS_IFINLINE flag.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Just check for a btree format fork instead of the using the equivalent
in-memory XFS_IFBROOT flag.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Stop using the XFS_IFEXTENTS flag, and instead switch on the fork format
in xfs_idestroy_fork to decide how to cleanup.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Directly return from the subfunctions and avoid the error variable. Also
remove the not really needed dp local variable.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
xfs_bmap_one_block is only called for the attribute fork. Move it to
xfs_attr.c, drop the unused whichfork argument and code only executed for
the data fork and rename the result to xfs_attr_is_leaf.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Move the XFS_IFEXTENTS check from the callers into xfs_iread_extents to
simplify the code.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
[ 245.463317] INFO: task iou-sqp-1374:1377 blocked for more than 122 seconds.
[ 245.463334] task:iou-sqp-1374 state:D flags:0x00004000
[ 245.463345] Call Trace:
[ 245.463352] __schedule+0x36b/0x950
[ 245.463376] schedule+0x68/0xe0
[ 245.463385] __io_uring_cancel+0xfb/0x1a0
[ 245.463407] do_exit+0xc0/0xb40
[ 245.463423] io_sq_thread+0x49b/0x710
[ 245.463445] ret_from_fork+0x22/0x30
It happens when sqpoll forgot to run park_task_work and goes to exit,
then exiting user may remove ctx from sqd_list, and so corresponding
io_sq_thread() -> io_uring_cancel_sqpoll() won't be executed. Hopefully
it just stucks in do_exit() in this case.
Fixes: dbe1bdbb39 ("io_uring: handle signals for IO threads like a normal thread")
Reported-by: Joakim Hassila <joj@mac.com>
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
-----BEGIN PGP SIGNATURE-----
iQFSBAABCAA8FiEEq68RxlopcLEwq+PEeb4+QwBBGIYFAmBzdS0eHHRvcnZhbGRz
QGxpbnV4LWZvdW5kYXRpb24ub3JnAAoJEHm+PkMAQRiGDdAIAIpKH/tAHhH7s7QH
m5ewgE8foP7M5Ue9fp3+JmbtaYSzhCAMcKhqGtat/zk5PvA9AoYCDXrTetfYtBHh
LUOmhL9hcKItNobfkYBok6BiFjGUEL3HMqz5w+MUsMwnXIc4RXqfJmsQ932z9Kxf
yDwe6ehIzJVrQLI/C0mTamYRHu2aiZ1VWzhKuT493rLeg0R2odCCIClPN+/QvCwb
8/sk6l1c8eOUYYMUzKFZifaZGb12qDjRt4pZmk51aMTzg0WCpElJG+7Uqr4QQhZP
p6xeNuUQq6WwxtlDkmo79Uzkrurb5tN2/hZ1RcJhs3EdHfpR0MjIyH3Znnb31gnu
39VjHhg=
=4KP/
-----END PGP SIGNATURE-----
Merge tag 'v5.12-rc7' into driver-core-next
We need the driver core fix in here as well.
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Having poll update function as a part of IORING_OP_POLL_ADD is not
great, we have to do hack around struct layouts and add some overhead in
the way of more popular POLL_ADD. Even more serious drawback is that
POLL_ADD requires file and always grabs it, and so poll update, which
doesn't need it.
Incorporate poll update into IORING_OP_POLL_REMOVE instead of
IORING_OP_POLL_ADD. It also more consistent with timeout remove/update.
Fixes: b69de288e9 ("io_uring: allow events and user_data update of running poll requests")
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Isolate poll mask SQE parsing and preparations into a new function,
which will be reused shortly.
Fixes: b69de288e9 ("io_uring: allow events and user_data update of running poll requests")
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Don't allow REQ_OP_POLL_REMOVE to kill apoll requests, users should not
know about it. Also, remove weird -EACCESS in io_poll_update(), it
shouldn't know anything about apoll, and have to work even if happened
to have a poll and an async poll'ed request with same user_data.
Fixes: b69de288e9 ("io_uring: allow events and user_data update of running poll requests")
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Don't reinit io_ring_exit_work()'s exit work/completions on each
iteration, that's wasteful. Also add list_rotate_left(), so if we failed
to complete the task job, we don't try it again and again but defer it
until others are processed.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
The only way to get out of io_iopoll_getevents() and continue iterating
is to have empty iopoll_list, otherwise the main loop would just exit.
So, instead of the unlock on 8th time heuristic, do that based on
iopoll_list.
Also, as no one can add new requests to iopoll_list while
io_iopoll_check() hold uring_lock, it's useless to spin with the list
empty, return in that case.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/5b8ebe84f5fff7ffa1f708952dfef7fc78b668e2.1618278933.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
As CQE overflows are now untied from requests and so don't hold any
ref, we don't need to handle exiting/exec'ing cases there anymore.
Moreover, it's much nicer in regards to userspace to save overflowed
CQEs whenever possible, so remove failing on in_idle.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/d873b7dab75c7f3039ead9628a745bea01f2cfd2.1618278933.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Don't save return values of hrtimer_try_to_cancel() in a variable, but
use right away. It's in general safer to not have an intermediate
variable, which may be reused and passed out wrongly, but it be
contracted out. Also clean io_timeout_extract().
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/d2566ef7ce632e6882dc13e022a26249b3fd30b5.1618278933.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
If io_sqe_files_unregister() faults on io_rsrc_ref_quiesce(), it will
fail to do unregister leaving files referenced. And that may well happen
because of a strayed signal or just because it does allocations inside.
In io_ring_ctx_free() do an unsafe version of unregister, as it's
guaranteed to not have requests by that point and so quiesce is useless.
Cc: stable@vger.kernel.org
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/e696e9eade571b51997d0dc1d01f144c6d685c05.1618278933.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Function parameter 'cmd' is rewritten with unused value at locks.c
Signed-off-by: Tian Tao <tiantao6@hisilicon.com>
Signed-off-by: Jeff Layton <jlayton@kernel.org>
Revert of revert of "io_uring: wait potential ->release() on resurrect",
which adds a helper for resurrect not racing completion reinit, as was
removed because of a strange bug with no clear root or link to the
patch.
Was improved, instead of rcu_synchronize(), just wait_for_completion()
because we're at 0 refs and it will happen very shortly. Specifically
use non-interruptible version to ignore all pending signals that may
have ended prior interruptible wait.
This reverts commit cb5e1b8130.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/7a080c20f686d026efade810b116b72f88abaff9.1618101759.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
req_set_fail_links() condition checking is bulky. Even though it's
always in a slow path, it's inlined and generates lots of extra code,
simplify it be moving HARDLINK checking into helpers killing linked
requests.
text data bss dec hex filename
before: 79318 12330 8 91656 16608 ./fs/io_uring.o
after: 79126 12330 8 91464 16548 ./fs/io_uring.o
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/96a9387db658a9d5a44ecbfd57c2a62cb888c9b6.1618101759.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
We don't need to store file tables in rsrc nodes, for now it's easier to
handle tables not generically, so move file tables into the context. A
nice side effect is having one less pointer dereference for request with
fixed file initialisation.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/de9fc4cd3545f24c26c03be4556f58ba3d18b9c3.1618101759.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
In preparation for more changes do a little cleanup of
io_sqe_buffers_register(). Move all args/invariant checking into it from
io_buffers_map_alloc(), because it's confusing. And add a bit more
cleaning for the loop.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/93292cb9708c8455e5070cc855861d94e11ca042.1618101759.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
The FS_IOC_[GS]ETFLAGS/FS_IOC_FS[GS]ETXATTR ioctls are now handled via the
fileattr api. The only unconverted filesystem remaining is CIFS and it is
not allowed to be overlayed due to case insensitive filenames.
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Since fuse just passes ioctl args through to/from server, converting to the
fileattr API is more involved, than most other filesystems.
Both .fileattr_set() and .fileattr_get() need to obtain an open file to
operate on. The simplest way is with the following sequence:
FUSE_OPEN
FUSE_IOCTL
FUSE_RELEASE
If this turns out to be a performance problem, it could be optimized for
the case when there's already a file (any file) open for the inode.
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Remove vfs_ioc_setflags_prepare(), vfs_ioc_fssetxattr_check() and
simple_fill_fsxattr(), which are no longer used.
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Use the fileattr API to let the VFS handle locking, permission checking and
conversion.
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Cc: Richard Weinberger <richard@nod.at>
Use the fileattr API to let the VFS handle locking, permission checking and
conversion.
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Cc: Jan Kara <jack@suse.cz>
Use the fileattr API to let the VFS handle locking, permission checking and
conversion.
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Cc: Joel Becker <jlbec@evilplan.org>
Use the fileattr API to let the VFS handle locking, permission checking and
conversion.
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Cc: Ryusuke Konishi <konishi.ryusuke@gmail.com>
Use the fileattr API to let the VFS handle locking, permission checking and
conversion.
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Cc: Dave Kleikamp <shaggy@kernel.org>
Use the fileattr API to let the VFS handle locking, permission checking and
conversion.
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Cc: Matthew Garrett <mjg59@srcf.ucam.org>
Use the fileattr API to let the VFS handle locking, permission checking and
conversion.
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Cc: Darrick J. Wong <djwong@kernel.org>
Use the fileattr API to let the VFS handle locking, permission checking and
conversion.
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Cc: Mike Marshall <hubcap@omnibond.com>
Use the fileattr API to let the VFS handle locking, permission checking and
conversion.
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Cc: Andreas Gruenbacher <agruenba@redhat.com>
Use the fileattr API to let the VFS handle locking, permission checking and
conversion.
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Cc: Jaegeuk Kim <jaegeuk@kernel.org>
Use the fileattr API to let the VFS handle locking, permission checking and
conversion.
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Cc: "Theodore Ts'o" <tytso@mit.edu>
Use the fileattr API to let the VFS handle locking, permission checking and
conversion.
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Cc: Jan Kara <jack@suse.cz>
Use the fileattr API to let the VFS handle locking, permission checking and
conversion.
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Cc: David Sterba <dsterba@suse.com>
Add stacking for the fileattr operations.
Add hack for calling security_file_ioctl() for now. Probably better to
have a pair of specific hooks for these operations.
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
There's a substantial amount of boilerplate in filesystems handling
FS_IOC_[GS]ETFLAGS/ FS_IOC_FS[GS]ETXATTR ioctls.
Also due to userspace buffers being involved in the ioctl API this is
difficult to stack, as shown by overlayfs issues related to these ioctls.
Introduce a new internal API named "fileattr" (fsxattr can be confused with
xattr, xflags is inappropriate, since this is more than just flags).
There's significant overlap between flags and xflags and this API handles
the conversions automatically, so filesystems may choose which one to use.
In ->fileattr_get() a hint is provided to the filesystem whether flags or
xattr are being requested by userspace, but in this series this hint is
ignored by all filesystems, since generating all the attributes is cheap.
If a filesystem doesn't implemement the fileattr API, just fall back to
f_op->ioctl(). When all filesystems are converted, the fallback can be
removed.
32bit compat ioctls are now handled by the generic code as well.
Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
Randy reports the following error on CONFIG_BLOCK not being set:
../fs/io_uring.c: In function ‘kiocb_done’:
../fs/io_uring.c:2766:7: error: implicit declaration of function ‘io_resubmit_prep’; did you mean ‘io_put_req’? [-Werror=implicit-function-declaration]
if (io_resubmit_prep(req)) {
Provide a dummy stub for io_resubmit_prep() like we do for
io_rw_should_reissue(), which also helps remove an ifdef sequence from
io_complete_rw_iopoll() as well.
Fixes: 8c130827f4 ("io_uring: don't alter iopoll reissue fail ret code")
Reported-by: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
There are three cases where we much care about performance of
io_cqring_fill_event() -- flushing inline completions, iopoll and
io_req_complete_post(). Inline a hot part of fill_event() into them.
All others are not as important and we don't want to bloat binary for
them, so add a noinline version of the function for all other use
use cases.
nops test(batch=32): 16.932 vs 17.822 KIOPS
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/a11d59424bf4417aca33f5ec21008bb3b0ebd11e.1618101759.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Take advantage of delayed/inline completion flushing and pass right
issue flags for completion of open, open2, fadvise and poll remove
opcodes. All others either already use it or always punted and never
executed inline.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/0badc7512e82f7350b73bb09abbebbecbdd5dab8.1618101759.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Instead of keeping requests in a inflight_list, just track them with a
per tctx atomic counter. Apart from it being much easier and more
consistent with task cancel, it frees ->inflight_entry from being shared
between iopoll and cancel-track, so less headache for us.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/3c2ee0863cd7eeefa605f3eaff4c1c461a6f1157.1618101759.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
hash_del() works well with non-hashed nodes, there's no need to check
if it is hashed first.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Remove error parameter from io_poll_complete(), 0 is always passed,
and do a bit of cleaning on top.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
io_poll_complete() always fills an event (even an overflowed one), so we
always should do io_cqring_ev_posted() afterwards. And that's what is
currently happening, because second EPOLLONESHOT check is always true,
it can't return !done for oneshots.
Remove those branching, it's much easier to read.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Do not include private headers and do not frob in internals.
On top of that, while the previous code restores the affinity, it
doesn't ensure the task actually moves there if it was running,
leading to the fun situation that it can be observed running outside
of its allowed mask for potentially significant time.
Use the proper API instead.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/YG7QkiUzlEbW85TU@hirez.programming.kicks-ass.net
Signed-off-by: Jens Axboe <axboe@kernel.dk>
We currently allow racy updates to multishot requests, but we can end up
double adding the poll request if both completion and update does it.
Ensure that we skip re-add on the update side if someone else is
completing it.
Fixes: b69de288e9 ("io_uring: allow events and user_data update of running poll requests")
Reported-by: Joakim Hassila <joj@mac.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Add struct io_fixed_file representing a single registered file, first to
hide ugly struct file **, which may be misleading, and secondly to
retype it to unsigned long as conversions to it and back to file * for
handling and masking FFS_* flags are getting nasty.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/78669731a605a7614c577c3de552631cfaf0869a.1617287883.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Introduce a heler io_free_file_tables() doing all the cleaning, there
are several places where it's hand coded. Also move all allocations into
io_sqe_alloc_file_tables() and rename it, so all of it is in one place.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/502a84ebf41ff119b095e59661e678eacb752bf8.1617287883.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Don't put linked timeout req in io_async_find_and_cancel() but do it in
io_link_timeout_fn(), so we have only one point for that and won't have
to do it differently as it's now (put vs put_deferred). Btw, improve a
bit io_async_find_and_cancel()'s locking.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/d75b70957f245275ab7cba83e0ac9c1b86aae78a.1617287883.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Replace a hand-coded overflow check with a specialised function. Even
though compilers are smart enough to generate identical binary (i.e.
check carry bit), but it's more foolproof and conveys the intention
better.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/e437dcdc929bacbb6f11a4824ecbbf17225cb82a.1617287883.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>