Commit Graph

10890 Commits

Author SHA1 Message Date
Omar Sandoval
caae78e032 btrfs: move common inode creation code into btrfs_create_new_inode()
All of our inode creation code paths duplicate the calls to
btrfs_init_inode_security() and btrfs_add_link(). Subvolume creation
additionally duplicates property inheritance and the call to
btrfs_set_inode_index(). Fix this by moving the common code into
btrfs_create_new_inode(). This accomplishes a few things at once:

1. It reduces code duplication.

2. It allows us to set up the inode completely before inserting the
   inode item, removing calls to btrfs_update_inode().

3. It fixes a leak of an inode on disk in some error cases. For example,
   in btrfs_create(), if btrfs_new_inode() succeeds, then we have
   inserted an inode item and its inode ref. However, if something after
   that fails (e.g., btrfs_init_inode_security()), then we end the
   transaction and then decrement the link count on the inode. If the
   transaction is committed and the system crashes before the failed
   inode is deleted, then we leak that inode on disk. Instead, this
   refactoring aborts the transaction when we can't recover more
   gracefully.

4. It exposes various ways that subvolume creation diverges from mkdir
   in terms of inheriting flags, properties, permissions, and POSIX
   ACLs, a lot of which appears to be accidental. This patch explicitly
   does _not_ change the existing non-standard behavior, but it makes
   those differences more clear in the code and documents them so that
   we can discuss whether they should be changed.

Reviewed-by: Sweet Tea Dorminy <sweettea-kernel@dorminy.me>
Signed-off-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-05-16 17:03:08 +02:00
Omar Sandoval
3538d68dbd btrfs: reserve correct number of items for inode creation
The various inode creation code paths do not account for the compression
property, POSIX ACLs, or the parent inode item when starting a
transaction. Fix it by refactoring all of these code paths to use a new
function, btrfs_new_inode_prepare(), which computes the correct number
of items. To do so, it needs to know whether POSIX ACLs will be created,
so move the ACL creation into that function. To reduce the number of
arguments that need to be passed around for inode creation, define
struct btrfs_new_inode_args containing all of the relevant information.

btrfs_new_inode_prepare() will also be a good place to set up the
fscrypt context and encrypted filename in the future.

Reviewed-by: Sweet Tea Dorminy <sweettea-kernel@dorminy.me>
Signed-off-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-05-16 17:03:08 +02:00
Omar Sandoval
5f465bf1f1 btrfs: factor out common part of btrfs_{mknod,create,mkdir}()
btrfs_{mknod,create,mkdir}() are now identical other than the inode
initialization and some inconsequential function call order differences.
Factor out the common code to reduce code duplication.

Reviewed-by: Sweet Tea Dorminy <sweettea-kernel@dorminy.me>
Signed-off-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-05-16 17:03:08 +02:00
Omar Sandoval
a1fd0c35ff btrfs: allocate inode outside of btrfs_new_inode()
Instead of calling new_inode() and inode_init_owner() inside of
btrfs_new_inode(), do it in the callers. This allows us to pass in just
the inode instead of the mnt_userns and mode and removes the need for
memalloc_nofs_{save,restores}() since we do it before starting a
transaction. In create_subvol(), it also means we no longer have to look
up the inode again to instantiate it. This also paves the way for some
more cleanups in later patches.

This also removes the comments about Smack checking i_op, which are no
longer true since commit 5d6c31910b ("xattr: Add
__vfs_{get,set,remove}xattr helpers"). Now it checks inode->i_opflags &
IOP_XATTR, which is set based on sb->s_xattr.

Signed-off-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-05-16 17:03:08 +02:00
Qu Wenruo
b95b78e628 btrfs: warn when extent buffer leak test fails
Although we have btrfs_extent_buffer_leak_debug_check() (enabled by
CONFIG_BTRFS_DEBUG option) to detect and warn QA testers that we have
some extent buffer leakage, it's just pr_err(), not noisy enough for
fstests to cache.

So here we trigger a WARN_ON() if the allocated_ebs list is not empty.

Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-05-16 17:03:08 +02:00
Anand Jain
b67d73c1ff btrfs: use a local variable for fs_devices pointer in btrfs_dev_replace_finishing
In the function btrfs_dev_replace_finishing, we dereferenced
fs_info->fs_devices 6 times. Use keep local variable for that.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-05-16 17:03:08 +02:00
Gabriel Niebler
184b3d1900 btrfs: use btrfs_for_each_slot in btrfs_listxattr
This function can be simplified by refactoring to use the new iterator
macro.  No functional changes.

Signed-off-by: Marcos Paulo de Souza <mpdesouza@suse.com>
Signed-off-by: Gabriel Niebler <gniebler@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-05-16 17:03:08 +02:00
Gabriel Niebler
43cb1478de btrfs: use btrfs_for_each_slot in btrfs_read_chunk_tree
This function can be simplified by refactoring to use the new iterator
macro.  No functional changes.

Signed-off-by: Marcos Paulo de Souza <mpdesouza@suse.com>
Signed-off-by: Gabriel Niebler <gniebler@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-05-16 17:03:08 +02:00
Gabriel Niebler
3d64f060a7 btrfs: use btrfs_for_each_slot in btrfs_unlink_all_paths
This function can be simplified by refactoring to use the new iterator
macro.  No functional changes.

Signed-off-by: Marcos Paulo de Souza <mpdesouza@suse.com>
Signed-off-by: Gabriel Niebler <gniebler@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-05-16 17:03:08 +02:00
Gabriel Niebler
9930e9d4ad btrfs: use btrfs_for_each_slot in process_all_extents
This function can be simplified by refactoring to use the new iterator
macro.  No functional changes.

Signed-off-by: Marcos Paulo de Souza <mpdesouza@suse.com>
Signed-off-by: Gabriel Niebler <gniebler@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-05-16 17:03:08 +02:00
Gabriel Niebler
69e4317759 btrfs: use btrfs_for_each_slot in process_all_new_xattrs
This function can be simplified by refactoring to use the new iterator
macro.  No functional changes.

Signed-off-by: Marcos Paulo de Souza <mpdesouza@suse.com>
Signed-off-by: Gabriel Niebler <gniebler@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-05-16 17:03:08 +02:00
Gabriel Niebler
649b96355d btrfs: use btrfs_for_each_slot in process_all_refs
This function can be simplified by refactoring to use the new iterator
macro.  No functional changes.

Signed-off-by: Marcos Paulo de Souza <mpdesouza@suse.com>
Signed-off-by: Gabriel Niebler <gniebler@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-05-16 17:03:08 +02:00
Gabriel Niebler
35a68080ff btrfs: use btrfs_for_each_slot in is_ancestor
This function can be simplified by refactoring to use the new iterator
macro.  No functional changes.

Signed-off-by: Marcos Paulo de Souza <mpdesouza@suse.com>
Signed-off-by: Gabriel Niebler <gniebler@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-05-16 17:03:08 +02:00
Gabriel Niebler
18f80f1fa4 btrfs: use btrfs_for_each_slot in can_rmdir
This function can be simplified by refactoring to use the new iterator
macro.  No functional changes.

Signed-off-by: Marcos Paulo de Souza <mpdesouza@suse.com>
Signed-off-by: Gabriel Niebler <gniebler@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-05-16 17:03:07 +02:00
Gabriel Niebler
6dcee26087 btrfs: use btrfs_for_each_slot in did_create_dir
This function can be simplified by refactoring to use the new iterator
macro.  No functional changes.

Signed-off-by: Marcos Paulo de Souza <mpdesouza@suse.com>
Signed-off-by: Gabriel Niebler <gniebler@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-05-16 17:03:07 +02:00
Gabriel Niebler
a8ce68fd04 btrfs: use btrfs_for_each_slot in btrfs_real_readdir
This function can be simplified by refactoring to use the new iterator
macro.  No functional changes.

Signed-off-by: Marcos Paulo de Souza <mpdesouza@suse.com>
Signed-off-by: Gabriel Niebler <gniebler@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-05-16 17:03:07 +02:00
Gabriel Niebler
9dcbe16fcc btrfs: use btrfs_for_each_slot in btrfs_search_dir_index_item
This function can be simplified by refactoring to use the new iterator
macro.  No functional changes.

Signed-off-by: Marcos Paulo de Souza <mpdesouza@suse.com>
Signed-off-by: Gabriel Niebler <gniebler@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-05-16 17:03:07 +02:00
Gabriel Niebler
9bc5fc0417 btrfs: use btrfs_for_each_slot in mark_block_group_to_copy
This function can be simplified by refactoring to use the new iterator
macro.  No functional changes.

Signed-off-by: Marcos Paulo de Souza <mpdesouza@suse.com>
Signed-off-by: Gabriel Niebler <gniebler@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-05-16 17:03:07 +02:00
Gabriel Niebler
36dfbbe25e btrfs: use btrfs_for_each_slot in find_first_block_group
This function can be simplified by refactoring to use the new iterator
macro.  No functional changes.

Signed-off-by: Marcos Paulo de Souza <mpdesouza@suse.com>
Signed-off-by: Gabriel Niebler <gniebler@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-05-16 17:03:07 +02:00
Gabriel Niebler
62142be363 btrfs: introduce btrfs_for_each_slot iterator macro
There is a common pattern when searching for a key in btrfs:

* Call btrfs_search_slot to find the slot for the key
* Enter an endless loop:
  * If the found slot is larger than the no. of items in the current
    leaf, check the next leaf
  * If it's still not found in the next leaf, terminate the loop
  * Otherwise do something with the found key
  * Increment the current slot and continue

To reduce code duplication, we can replace this code pattern with an
iterator macro, similar to the existing for_each_X macros found
elsewhere in the kernel.  This also makes the code easier to understand
for newcomers by putting a name to the encapsulated functionality.

Signed-off-by: Marcos Paulo de Souza <mpdesouza@suse.com>
Signed-off-by: Gabriel Niebler <gniebler@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-05-16 17:03:07 +02:00
Qu Wenruo
e360d2f581 btrfs: scrub: rename scrub_bio::pagev and related members
Since the subpage support for scrub, one page no longer always represents
one sector, thus scrub_bio::pagev and scrub_bio::sector_count are no
longer accurate.

Rename them to scrub_bio::sectors and scrub_bio::sector_count respectively.
This also involves scrub_ctx::pages_per_bio and other macros involved.

Now the renaming of pages involved in scrub is be finished.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-05-16 17:03:07 +02:00
Qu Wenruo
4634350172 btrfs: scrub: rename scrub_page to scrub_sector
Since the subpage support of scrub, scrub_sector is in fact just
representing one sector.

Thus the name scrub_page is no longer correct, rename it to
scrub_sector.

This also involves the following renames:

- spage -> sector
  Normally we would just replace "page" with "sector" and result
  something like "ssector".
  But the repeating 's' is not really eye friendly.

  So here we just simple use "sector", as there is nothing from MM layer
  called "sector" to cause any confusion.

- scrub_parity::spages -> sectors_list
  Normally we use plural to indicate an array, not a list.
  Rename it to @sectors_list to be more explicit on the list part.

- Also reformat and update comments that get changed

Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-05-16 17:03:07 +02:00
Qu Wenruo
7e737cbca6 btrfs: scrub: rename members related to scrub_block::pagev
The following will be renamed in this patch:

- scrub_block::pagev -> sectors

- scrub_block::page_count -> sector_count

- SCRUB_MAX_PAGES_PER_BLOCK -> SCRUB_MAX_SECTORS_PER_BLOCK

- page_num -> sector_num to iterate scrub_block::sectors

For now scrub_page is not yet renamed to keep the patch reasonable and
it will be updated in a followup.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-05-16 17:03:07 +02:00
Filipe Manana
6a2e9dc46f btrfs: remove trivial wrapper btrfs_read_buffer()
The function btrfs_read_buffer() is useless, it just calls
btree_read_extent_buffer_pages() with exactly the same arguments.

So remove it and rename btree_read_extent_buffer_pages() to
btrfs_read_extent_buffer(), which is a shorter name, has the "btrfs_"
prefix (since it's used outside disk-io.c) and the name is clear enough
about what it does.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-05-16 17:03:07 +02:00
Filipe Manana
376a21d752 btrfs: update outdated comment for read_block_for_search()
The comment at the top of read_block_for_search() is very outdated, as it
refers to the blocking versus spinning path locking modes. We no longer
have these two locking modes after we switched the btree locks from custom
code to rw semaphores. So update the comment to stop referring to the
blocking mode and put it more up to date.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-05-16 17:03:07 +02:00
Filipe Manana
b246666ef7 btrfs: release upper nodes when reading stale btree node from disk
When reading a btree node (or leaf), at read_block_for_search(), if we
can't find its extent buffer in the cache (the fs_info->buffer_radix
radix tree), then we unlock all upper level nodes before reading the
btree node/leaf from disk, to prevent blocking other tasks for too long.

However if we find that the extent buffer is in the cache but it is not
up to date, we don't unlock upper level nodes before reading it from disk,
potentially blocking other tasks on upper level nodes for too long.

Fix this inconsistent behaviour by unlocking upper level nodes if we need
to read a node/leaf from disk because its in-memory extent buffer is not
up to date. If we unlocked upper level nodes then we must return -EAGAIN
to the caller, just like the case where the extent buffer is not cached in
memory. And like that case, we determine if upper level nodes are locked
by checking only if the parent node is locked - if it isn't, then no other
upper level nodes are locked.

This is actually a rare case, as if we have an extent buffer in memory,
it typically has the uptodate flag set and passes all the checks done by
btrfs_buffer_uptodate().

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-05-16 17:03:06 +02:00
Filipe Manana
4bb59055bc btrfs: avoid unnecessary btree search restarts when reading node
When reading a btree node, at read_block_for_search(), if we don't find
the node's (or leaf) extent buffer in the cache, we will read it from
disk. Since that requires waiting on IO, we release all upper level nodes
from our path before reading the target node/leaf, and then return -EAGAIN
to the caller, which will make the caller restart the while btree search.

However we are causing the restart of btree search even for cases where
it is not necessary:

1) We have a path with ->skip_locking set to true, typically when doing
   a search on a commit root, so we are never holding locks on any node;

2) We are doing a read search (the "ins_len" argument passed to
   btrfs_search_slot() is 0), or we are doing a search to modify an
   existing key (the "cow" argument passed to btrfs_search_slot() has
   a value of 1 and "ins_len" is 0), in which case we never hold locks
   for upper level nodes;

3) We are doing a search to insert or delete a key, in which case we may
   or may not have upper level nodes locked. That depends on the current
   minimum write lock levels at btrfs_search_slot(), if we had to split
   or merge parent nodes, if we had to COW upper level nodes and if
   we ever visited slot 0 of an upper level node. It's still common to
   not have upper level nodes locked, but our current node must be at
   least at level 1, for insertions, or at least at level 2 for deletions.
   In these cases when we have locks on upper level nodes, they are always
   write locks.

These cases where we are not holding locks on upper level nodes far
outweigh the cases where we are holding locks, so it's completely wasteful
to retry the whole search when we have no upper nodes locked.

So change the logic to not return -EAGAIN, and make the caller retry the
search, when we don't have the parent node locked - when it's not locked
it means no other upper level nodes are locked as well.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-05-16 17:03:06 +02:00
Omar Sandoval
305eaac009 btrfs: set inode flags earlier in btrfs_new_inode()
btrfs_new_inode() inherits the inode flags from the parent directory and
the mount options _after_ we fill the inode item. This works because all
of the callers of btrfs_new_inode() make further changes to the inode
and then call btrfs_update_inode(). It'd be better to fully initialize
the inode once to avoid the extra update, so as a first step, set the
inode flags _before_ filling the inode item.

Reviewed-by: Sweet Tea Dorminy <sweettea-kernel@dorminy.me>
Signed-off-by: Omar Sandoval <osandov@fb.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-05-16 17:03:06 +02:00
Omar Sandoval
6437d45835 btrfs: move btrfs_get_free_objectid() call into btrfs_new_inode()
Every call of btrfs_new_inode() is immediately preceded by a call to
btrfs_get_free_objectid(). Since getting an inode number is part of
creating a new inode, this is better off being moved into
btrfs_new_inode(). While we're here, get rid of the comment about
reclaiming inode numbers, since we only did that when using the ino
cache, which was removed by commit 5297199a8b ("btrfs: remove inode
number cache feature").

Reviewed-by: Sweet Tea Dorminy <sweettea-kernel@dorminy.me>
Signed-off-by: Omar Sandoval <osandov@fb.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-05-16 17:03:06 +02:00
Omar Sandoval
23c24ef8e4 btrfs: don't pass parent objectid to btrfs_new_inode() explicitly
For everything other than a subvolume root inode, we get the parent
objectid from the parent directory. For the subvolume root inode, the
parent objectid is the same as the inode's objectid. We can find this
within btrfs_new_inode() instead of passing it.

Reviewed-by: Sweet Tea Dorminy <sweettea-kernel@dorminy.me>
Signed-off-by: Omar Sandoval <osandov@fb.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-05-16 17:03:06 +02:00
Omar Sandoval
70dc55f428 btrfs: remove redundant name and name_len parameters to create_subvol
The passed dentry already contains the name.

Reviewed-by: Sweet Tea Dorminy <sweettea-kernel@dorminy.me>
Signed-off-by: Omar Sandoval <osandov@fb.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-05-16 17:03:06 +02:00
Omar Sandoval
75b993cf43 btrfs: remove unused mnt_userns parameter from __btrfs_set_acl
Commit 4a8b34afa9 ("btrfs: handle ACLs on idmapped mounts") added this
parameter but didn't use it. __btrfs_set_acl() is the low-level helper
that writes an ACL to disk. The higher-level btrfs_set_acl() is the one
that translates the ACL based on the user namespace.

Reviewed-by: Sweet Tea Dorminy <sweettea-kernel@dorminy.me>
Signed-off-by: Omar Sandoval <osandov@fb.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-05-16 17:03:06 +02:00
Omar Sandoval
c51fa51190 btrfs: remove unnecessary set_nlink() in btrfs_create_subvol_root()
btrfs_new_inode() already returns an inode with nlink set to 1 (via
inode_init_always()). Get rid of the unnecessary set.

Reviewed-by: Sweet Tea Dorminy <sweettea-kernel@dorminy.me>
Signed-off-by: Omar Sandoval <osandov@fb.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-05-16 17:03:06 +02:00
Omar Sandoval
6d831f7ef9 btrfs: remove unnecessary inode_set_bytes(0) call
new_inode() always returns an inode with i_blocks and i_bytes set to 0
(via inode_init_always()). Remove the unnecessary call to
inode_set_bytes() in btrfs_new_inode().

Reviewed-by: Sweet Tea Dorminy <sweettea-kernel@dorminy.me>
Signed-off-by: Omar Sandoval <osandov@fb.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-05-16 17:03:06 +02:00
Omar Sandoval
9124e15f27 btrfs: remove unnecessary btrfs_i_size_write(0) calls
btrfs_new_inode() always returns an inode with i_size and disk_i_size
set to 0 (via inode_init_always() and btrfs_alloc_inode(),
respectively). Remove the unnecessary calls to btrfs_i_size_write() in
btrfs_mkdir() and btrfs_create_subvol_root().

Reviewed-by: Sweet Tea Dorminy <sweettea-kernel@dorminy.me>
Signed-off-by: Omar Sandoval <osandov@fb.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-05-16 17:03:06 +02:00
Omar Sandoval
81512e89f2 btrfs: get rid of btrfs_add_nondir()
This is a trivial wrapper around btrfs_add_link(). The only thing it
does other than moving arguments around is translating a > 0 return
value to -EEXIST. As far as I can tell, btrfs_add_link() won't return >
0 (and if it did, the existing callsites in, e.g., btrfs_mkdir() would
be broken). The check itself dates back to commit 2c90e5d658 ("Btrfs:
still corruption hunting"), so it's probably left over from debugging.
Let's just get rid of btrfs_add_nondir().

Reviewed-by: Sweet Tea Dorminy <sweettea-kernel@dorminy.me>
Signed-off-by: Omar Sandoval <osandov@fb.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-05-16 17:03:06 +02:00
Omar Sandoval
2256e901f5 btrfs: fix anon_dev leak in create_subvol()
When btrfs_qgroup_inherit(), btrfs_alloc_tree_block, or
btrfs_insert_root() fail in create_subvol(), we return without freeing
anon_dev. Reorganize the error handling in create_subvol() to fix this.

Reviewed-by: Sweet Tea Dorminy <sweettea-kernel@dorminy.me>
Signed-off-by: Omar Sandoval <osandov@fb.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-05-16 17:03:06 +02:00
Omar Sandoval
c162187143 btrfs: reserve correct number of items for rename
btrfs_rename() and btrfs_rename_exchange() don't account for enough
items. Replace the incorrect explanations with a specific breakdown of
the number of items and account them accurately.

Note that this glosses over RENAME_WHITEOUT because the next commit is
going to rework that, too.

Reviewed-by: Sweet Tea Dorminy <sweettea-kernel@dorminy.me>
Signed-off-by: Omar Sandoval <osandov@fb.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-05-16 17:03:06 +02:00
Omar Sandoval
bca4ad7c0b btrfs: reserve correct number of items for unlink and rmdir
__btrfs_unlink_inode() calls btrfs_update_inode() on the parent
directory in order to update its size and sequence number. Make sure we
account for it.

Reviewed-by: Sweet Tea Dorminy <sweettea-kernel@dorminy.me>
Signed-off-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-05-16 17:03:05 +02:00
Linus Torvalds
4b97bac075 for-5.18-rc5-tag
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAmJ1X+YACgkQxWXV+ddt
 WDs75g/8DsrDBe/Sbn/bDLRw30D0StC5xRhaVJLNcAhSfIyyoBo0EKifd2nXdNSn
 rdc3pvRToPh2X1YQ/5FmtxuYW12N4pVOfWtXKdoFFXabMVJetDBSnS8KNzxBd9Ys
 OkiKj2qb8H+1uNwWVLxfbFBNWWPyCQLe/2DDmePxOAszs4YUPvwxffyA8oclyHb5
 wW0qOJAC76Lz6y5Wo/GJrtAdlvWwb3t8+IDUKgGPT1BEIOE/fna+MhvEwqvCdY9F
 PPU5sWIXkAv3addgrcBHVX5HdAzRC0jAv2lHsttfu4dEOmTzw8dCTUh/IzSRa3fj
 IVy7AIGR+ZR++d4tOhAEZBe1KAqntY3UVmXY19cKTHOLLWFjv4XkKw8KJzsDiHUt
 rczedAFdgRRo9QIgwSsAU9Zi8DSG56BSenxsqFzqiL5BDDX1bUFXCegNYR42GfB/
 8E89eYkBCXxP6XeA1+44EOalCdqLKuQOyEijTUxn0UDGdqHHB1Gd/IUQDU5fpCqo
 kX6gdgMhNmjGJ/zETfOdFrYZaHYJUiiIRc6z8SCgM3JIPBgVw0FAa99Hs8Pl+eJn
 idmpfnHn6Xvmq46FISVRWgolkuj7VzTEOM65rNgOY3889Vk9Qt2qjI/DvYPGDs9Y
 9PQI1FY+2E+ZqbNY8dZpXDii+6y36aAmGR1B10x3545+C36CA/0=
 =S4KK
 -----END PGP SIGNATURE-----

Merge tag 'for-5.18-rc5-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux

Pull btrfs fixes from David Sterba:
 "Regression fixes in zone activation:

   - move a loop invariant out of the loop to avoid checking space
     status

   - properly handle unlimited activation

  Other fixes:

   - for subpage, force the free space v2 mount to avoid a warning and
     make it easy to switch a filesystem on different page size systems

   - export sysfs status of exclusive operation 'balance paused', so the
     user space tools can recognize it and allow adding a device with
     paused balance

   - fix assertion failure when logging directory key range item"

* tag 'for-5.18-rc5-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
  btrfs: sysfs: export the balance paused state of exclusive operation
  btrfs: fix assertion failure when logging directory key range item
  btrfs: zoned: activate block group properly on unlimited active zone device
  btrfs: zoned: move non-changing condition check out of the loop
  btrfs: force v2 space cache usage for subpage mount
2022-05-06 14:32:16 -07:00
David Sterba
3e1ad19638 btrfs: sysfs: export the balance paused state of exclusive operation
The new state allowing device addition with paused balance is not
exported to user space so it can't recognize it and actually start the
operation.

Fixes: efc0e69c2f ("btrfs: introduce exclusive operation BALANCE_PAUSED state")
CC: stable@vger.kernel.org # 5.17
Signed-off-by: David Sterba <dsterba@suse.com>
2022-05-05 21:05:56 +02:00
Filipe Manana
750ee45490 btrfs: fix assertion failure when logging directory key range item
When inserting a key range item (BTRFS_DIR_LOG_INDEX_KEY) while logging
a directory, we don't expect the insertion to fail with -EEXIST, because
we are holding the directory's log_mutex and we have dropped all existing
BTRFS_DIR_LOG_INDEX_KEY keys from the log tree before we started to log
the directory. However it's possible that during the logging we attempt
to insert the same BTRFS_DIR_LOG_INDEX_KEY key twice, but for this to
happen we need to race with insertions of items from other inodes in the
subvolume's tree while we are logging a directory. Here's how this can
happen:

1) We are logging a directory with inode number 1000 that has its items
   spread across 3 leaves in the subvolume's tree:

   leaf A - has index keys from the range 2 to 20 for example. The last
   item in the leaf corresponds to a dir item for index number 20. All
   these dir items were created in a past transaction.

   leaf B - has index keys from the range 22 to 100 for example. It has
   no keys from other inodes, all its keys are dir index keys for our
   directory inode number 1000. Its first key is for the dir item with
   a sequence number of 22. All these dir items were also created in a
   past transaction.

   leaf C - has index keys for our directory for the range 101 to 120 for
   example. This leaf also has items from other inodes, and its first
   item corresponds to the dir item for index number 101 for our directory
   with inode number 1000;

2) When we finish processing the items from leaf A at log_dir_items(),
   we log a BTRFS_DIR_LOG_INDEX_KEY key with an offset of 21 and a last
   offset of 21, meaning the log is authoritative for the index range
   from 21 to 21 (a single sequence number). At this point leaf B was
   not yet modified in the current transaction;

3) When we return from log_dir_items() we have released our read lock on
   leaf B, and have set *last_offset_ret to 21 (index number of the first
   item on leaf B minus 1);

4) Some other task inserts an item for other inode (inode number 1001 for
   example) into leaf C. That resulted in pushing some items from leaf C
   into leaf B, in order to make room for the new item, so now leaf B
   has dir index keys for the sequence number range from 22 to 102 and
   leaf C has the dir items for the sequence number range 103 to 120;

5) At log_directory_changes() we call log_dir_items() again, passing it
   a 'min_offset' / 'min_key' value of 22 (*last_offset_ret from step 3
   plus 1, so 21 + 1). Then btrfs_search_forward() leaves us at slot 0
   of leaf B, since leaf B was modified in the current transaction.

   We have also initialized 'last_old_dentry_offset' to 20 after calling
   btrfs_previous_item() at log_dir_items(), as it left us at the last
   item of leaf A, which refers to the dir item with sequence number 20;

6) We then call process_dir_items_leaf() to process the dir items of
   leaf B, and when we process the first item, corresponding to slot 0,
   sequence number 22, we notice the dir item was created in a past
   transaction and its sequence number is greater than the value of
   *last_old_dentry_offset + 1 (20 + 1), so we decide to log again a
   BTRFS_DIR_LOG_INDEX_KEY key with an offset of 21 and an end range
   of 21 (key.offset - 1 == 22 - 1 == 21), which results in an -EEXIST
   error from insert_dir_log_key(), as we have already inserted that
   key at step 2, triggering the assertion at process_dir_items_leaf().

The trace produced in dmesg is like the following:

assertion failed: ret != -EEXIST, in fs/btrfs/tree-log.c:3857
[198255.980839][ T7460] ------------[ cut here ]------------
[198255.981666][ T7460] kernel BUG at fs/btrfs/ctree.h:3617!
[198255.983141][ T7460] invalid opcode: 0000 [#1] PREEMPT SMP KASAN PTI
[198255.984080][ T7460] CPU: 0 PID: 7460 Comm: repro-ghost-dir Not tainted 5.18.0-5314c78ac373-misc-next+
[198255.986027][ T7460] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.14.0-2 04/01/2014
[198255.988600][ T7460] RIP: 0010:assertfail.constprop.0+0x1c/0x1e
[198255.989465][ T7460] Code: 8b 4c 89 (...)
[198255.992599][ T7460] RSP: 0018:ffffc90007387188 EFLAGS: 00010282
[198255.993414][ T7460] RAX: 000000000000003d RBX: 0000000000000065 RCX: 0000000000000000
[198255.996056][ T7460] RDX: 0000000000000001 RSI: ffffffff8b62b180 RDI: fffff52000e70e24
[198255.997668][ T7460] RBP: ffffc90007387188 R08: 000000000000003d R09: ffff8881f0e16507
[198255.999199][ T7460] R10: ffffed103e1c2ca0 R11: 0000000000000001 R12: 00000000ffffffef
[198256.000683][ T7460] R13: ffff88813befc630 R14: ffff888116c16e70 R15: ffffc90007387358
[198256.007082][ T7460] FS:  00007fc7f7c24640(0000) GS:ffff8881f0c00000(0000) knlGS:0000000000000000
[198256.009939][ T7460] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[198256.014133][ T7460] CR2: 0000560bb16d0b78 CR3: 0000000140b34005 CR4: 0000000000170ef0
[198256.015239][ T7460] Call Trace:
[198256.015674][ T7460]  <TASK>
[198256.016313][ T7460]  log_dir_items.cold+0x16/0x2c
[198256.018858][ T7460]  ? replay_one_extent+0xbf0/0xbf0
[198256.025932][ T7460]  ? release_extent_buffer+0x1d2/0x270
[198256.029658][ T7460]  ? rcu_read_lock_sched_held+0x16/0x80
[198256.031114][ T7460]  ? lock_acquired+0xbe/0x660
[198256.032633][ T7460]  ? rcu_read_lock_sched_held+0x16/0x80
[198256.034386][ T7460]  ? lock_release+0xcf/0x8a0
[198256.036152][ T7460]  log_directory_changes+0xf9/0x170
[198256.036993][ T7460]  ? log_dir_items+0xba0/0xba0
[198256.037661][ T7460]  ? do_raw_write_unlock+0x7d/0xe0
[198256.038680][ T7460]  btrfs_log_inode+0x233b/0x26d0
[198256.041294][ T7460]  ? log_directory_changes+0x170/0x170
[198256.042864][ T7460]  ? btrfs_attach_transaction_barrier+0x60/0x60
[198256.045130][ T7460]  ? rcu_read_lock_sched_held+0x16/0x80
[198256.046568][ T7460]  ? lock_release+0xcf/0x8a0
[198256.047504][ T7460]  ? lock_downgrade+0x420/0x420
[198256.048712][ T7460]  ? ilookup5_nowait+0x81/0xa0
[198256.049747][ T7460]  ? lock_downgrade+0x420/0x420
[198256.050652][ T7460]  ? do_raw_spin_unlock+0xa9/0x100
[198256.051618][ T7460]  ? __might_resched+0x128/0x1c0
[198256.052511][ T7460]  ? __might_sleep+0x66/0xc0
[198256.053442][ T7460]  ? __kasan_check_read+0x11/0x20
[198256.054251][ T7460]  ? iget5_locked+0xbd/0x150
[198256.054986][ T7460]  ? run_delayed_iput_locked+0x110/0x110
[198256.055929][ T7460]  ? btrfs_iget+0xc7/0x150
[198256.056630][ T7460]  ? btrfs_orphan_cleanup+0x4a0/0x4a0
[198256.057502][ T7460]  ? free_extent_buffer+0x13/0x20
[198256.058322][ T7460]  btrfs_log_inode+0x2654/0x26d0
[198256.059137][ T7460]  ? log_directory_changes+0x170/0x170
[198256.060020][ T7460]  ? rcu_read_lock_sched_held+0x16/0x80
[198256.060930][ T7460]  ? rcu_read_lock_sched_held+0x16/0x80
[198256.061905][ T7460]  ? lock_contended+0x770/0x770
[198256.062682][ T7460]  ? btrfs_log_inode_parent+0xd04/0x1750
[198256.063582][ T7460]  ? lock_downgrade+0x420/0x420
[198256.064432][ T7460]  ? preempt_count_sub+0x18/0xc0
[198256.065550][ T7460]  ? __mutex_lock+0x580/0xdc0
[198256.066654][ T7460]  ? stack_trace_save+0x94/0xc0
[198256.068008][ T7460]  ? __kasan_check_write+0x14/0x20
[198256.072149][ T7460]  ? __mutex_unlock_slowpath+0x12a/0x430
[198256.073145][ T7460]  ? mutex_lock_io_nested+0xcd0/0xcd0
[198256.074341][ T7460]  ? wait_for_completion_io_timeout+0x20/0x20
[198256.075345][ T7460]  ? lock_downgrade+0x420/0x420
[198256.076142][ T7460]  ? lock_contended+0x770/0x770
[198256.076939][ T7460]  ? do_raw_spin_lock+0x1c0/0x1c0
[198256.078401][ T7460]  ? btrfs_sync_file+0x5e6/0xa40
[198256.080598][ T7460]  btrfs_log_inode_parent+0x523/0x1750
[198256.081991][ T7460]  ? wait_current_trans+0xc8/0x240
[198256.083320][ T7460]  ? lock_downgrade+0x420/0x420
[198256.085450][ T7460]  ? btrfs_end_log_trans+0x70/0x70
[198256.086362][ T7460]  ? rcu_read_lock_sched_held+0x16/0x80
[198256.087544][ T7460]  ? lock_release+0xcf/0x8a0
[198256.088305][ T7460]  ? lock_downgrade+0x420/0x420
[198256.090375][ T7460]  ? dget_parent+0x8e/0x300
[198256.093538][ T7460]  ? do_raw_spin_lock+0x1c0/0x1c0
[198256.094918][ T7460]  ? lock_downgrade+0x420/0x420
[198256.097815][ T7460]  ? do_raw_spin_unlock+0xa9/0x100
[198256.101822][ T7460]  ? dget_parent+0xb7/0x300
[198256.103345][ T7460]  btrfs_log_dentry_safe+0x48/0x60
[198256.105052][ T7460]  btrfs_sync_file+0x629/0xa40
[198256.106829][ T7460]  ? start_ordered_ops.constprop.0+0x120/0x120
[198256.109655][ T7460]  ? __fget_files+0x161/0x230
[198256.110760][ T7460]  vfs_fsync_range+0x6d/0x110
[198256.111923][ T7460]  ? start_ordered_ops.constprop.0+0x120/0x120
[198256.113556][ T7460]  __x64_sys_fsync+0x45/0x70
[198256.114323][ T7460]  do_syscall_64+0x5c/0xc0
[198256.115084][ T7460]  ? syscall_exit_to_user_mode+0x3b/0x50
[198256.116030][ T7460]  ? do_syscall_64+0x69/0xc0
[198256.116768][ T7460]  ? do_syscall_64+0x69/0xc0
[198256.117555][ T7460]  ? do_syscall_64+0x69/0xc0
[198256.118324][ T7460]  ? sysvec_call_function_single+0x57/0xc0
[198256.119308][ T7460]  ? asm_sysvec_call_function_single+0xa/0x20
[198256.120363][ T7460]  entry_SYSCALL_64_after_hwframe+0x44/0xae
[198256.121334][ T7460] RIP: 0033:0x7fc7fe97b6ab
[198256.122067][ T7460] Code: 0f 05 48 (...)
[198256.125198][ T7460] RSP: 002b:00007fc7f7c23950 EFLAGS: 00000293 ORIG_RAX: 000000000000004a
[198256.126568][ T7460] RAX: ffffffffffffffda RBX: 00007fc7f7c239f0 RCX: 00007fc7fe97b6ab
[198256.127942][ T7460] RDX: 0000000000000002 RSI: 000056167536bcf0 RDI: 0000000000000004
[198256.129302][ T7460] RBP: 0000000000000004 R08: 0000000000000000 R09: 000000007ffffeb8
[198256.130670][ T7460] R10: 00000000000001ff R11: 0000000000000293 R12: 0000000000000001
[198256.132046][ T7460] R13: 0000561674ca8140 R14: 00007fc7f7c239d0 R15: 000056167536dab8
[198256.133403][ T7460]  </TASK>

Fix this by treating -EEXIST as expected at insert_dir_log_key() and have
it update the item with an end offset corresponding to the maximum between
the previously logged end offset and the new requested end offset. The end
offsets may be different due to dir index key deletions that happened as
part of unlink operations while we are logging a directory (triggered when
fsyncing some other inode parented by the directory) or during renames
which always attempt to log a single dir index deletion.

Reported-by: Zygo Blaxell <ce3g8jdj@umail.furryterror.org>
Link: https://lore.kernel.org/linux-btrfs/YmyefE9mc2xl5ZMz@hungrycats.org/
Fixes: 732d591a5d ("btrfs: stop copying old dir items when logging a directory")
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-05-05 21:05:56 +02:00
Naohiro Aota
ceb4f60830 btrfs: zoned: activate block group properly on unlimited active zone device
btrfs_zone_activate() checks if it activated all the underlying zones in
the loop. However, that check never hit on an unlimited activate zone
device (max_active_zones == 0).

Fortunately, it still works without ENOSPC because btrfs_zone_activate()
returns true in the end, even if block_group->zone_is_active == 0. But, it
is confusing to have non zone_is_active block group still usable for
allocation. Also, we are wasting CPU time to iterate the loop every time
btrfs_zone_activate() is called for the blog groups.

Since error case in the loop is handled by out_unlock, we can just set
zone_is_active and do the list stuff after the loop.

Fixes: f9a912a3c4 ("btrfs: zoned: make zone activation multi stripe capable")
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-05-05 21:05:56 +02:00
Naohiro Aota
549577127a btrfs: zoned: move non-changing condition check out of the loop
btrfs_zone_activate() checks if block_group->alloc_offset ==
block_group->zone_capacity every time it iterates the loop. But, it is
not depending on the index. Move out the check and do it only once.

Fixes: f9a912a3c4 ("btrfs: zoned: make zone activation multi stripe capable")
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-05-05 21:05:56 +02:00
Qu Wenruo
9f73f1aef9 btrfs: force v2 space cache usage for subpage mount
[BUG]
For a 4K sector sized btrfs with v1 cache enabled and only mounted on
systems with 4K page size, if it's mounted on subpage (64K page size)
systems, it can cause the following warning on v1 space cache:

 BTRFS error (device dm-1): csum mismatch on free space cache
 BTRFS warning (device dm-1): failed to load free space cache for block group 84082688, rebuilding it now

Although not a big deal, as kernel can rebuild it without problem, such
warning will bother end users, especially if they want to switch the
same btrfs seamlessly between different page sized systems.

[CAUSE]
V1 free space cache is still using fixed PAGE_SIZE for various bitmap,
like BITS_PER_BITMAP.

Such hard-coded PAGE_SIZE usage will cause various mismatch, from v1
cache size to checksum.

Thus kernel will always reject v1 cache with a different PAGE_SIZE with
csum mismatch.

[FIX]
Although we should fix v1 cache, it's already going to be marked
deprecated soon.

And we have v2 cache based on metadata (which is already fully subpage
compatible), and it has almost everything superior than v1 cache.

So just force subpage mount to use v2 cache on mount.

Reported-by: Matt Corallo <blnxfsl@bluematt.me>
CC: stable@vger.kernel.org # 5.15+
Link: https://lore.kernel.org/linux-btrfs/61aa27d1-30fc-c1a9-f0f4-9df544395ec3@bluematt.me/
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-05-05 21:05:56 +02:00
Linus Torvalds
9050ba3a61 for-5.18-rc5-tag
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAmJwBoQACgkQxWXV+ddt
 WDu/FQ/+L2LpN5Zu1NkjOAh2Lcvz5RYZjcVext4bbPUW2yhXYH3e6836R/feLOCG
 RxRICOAQhJ7I6ct/N1aToI2AbjWnMSJK+IgageA1UdIS8McbcSP/qJOYwJ/+2Xhl
 AvK5psoj+qwGbTI9e0luNe6b+UWTGIMyYRjmN0SlkBOdg9/xqQpBMQxfKJumMvEc
 3ZwLpcNjhUUwdFKHvHZNCOQhZiZwloKFeq9MLaEL5LO30wKSY6ShiCA3pafFoVN0
 mvEEVtIGgZUsgeQTzSzD8UhGDvZtZ1+aaX0dcNMRzwI2h2pkBmPkd5QtFM9Qs0xP
 hGibSN9bC/SzQyE9v4cKohwS+g4dE4r+dUWFNpdZLIOpBt5PJBDA0tjcjxquFtMr
 6JX77coAl9kt0jspBmHVPb3qmIc1Xo3Iw2PrVgTK14QUo46XwF5Rga68wKOfNt0u
 LbD9+KCLnwxoOhvXh/LJX6nvPT8tuMrT/5AOXULI2oMCnpCY6Vl+kX8zFlAfbDk1
 d4/jy42bgHCso60j2vAcdQmZB+/snpboOhKJwkE2FqOTs+hBR8PBln6BqEt5xkHZ
 q2mBfYDujnmZWaiAU6+ETjOcCooWQioi3335tp/C9TdOrIkFDij1ztYRxhgP0g0X
 w6d6wlbkcol1Zubb/zEiBiwe+6GR/KtNYc4PBEfVe5Vw0npFreU=
 =7k9f
 -----END PGP SIGNATURE-----

Merge tag 'for-5.18-rc5-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux

Pull btrfs fixes from David Sterba:
 "A few more fixes mostly around how some file attributes could be set.

   - fix handling of compression property:
      - don't allow setting it on anything else than regular file or
        directory
      - do not allow setting it on nodatacow files via properties

   - improved error handling when setting xattr

   - make sure symlinks are always properly logged"

* tag 'for-5.18-rc5-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
  btrfs: skip compression property for anything other than files and dirs
  btrfs: do not BUG_ON() on failure to update inode when setting xattr
  btrfs: always log symlinks in full mode
  btrfs: do not allow compression on nodatacow files
  btrfs: export a helper for compression hard check
2022-05-02 10:09:02 -07:00
Filipe Manana
4b73c55fde btrfs: skip compression property for anything other than files and dirs
The compression property only has effect on regular files and directories
(so that it's propagated to files and subdirectories created inside a
directory). For any other inode type (symlink, fifo, device, socket),
it's pointless to set the compression property because it does nothing
and ends up unnecessarily wasting leaf space due to the pointless xattr
(75 or 76 bytes, depending on the compression value). Symlinks in
particular are very common (for example, I have almost 10k symlinks under
/etc, /usr and /var alone) and therefore it's worth to avoid wasting
leaf space with the compression xattr.

For example, the compression property can end up on a symlink or character
device implicitly, through inheritance from a parent directory

  $ mkdir /mnt/testdir
  $ btrfs property set /mnt/testdir compression lzo

  $ ln -s yadayada /mnt/testdir/lnk
  $ mknod /mnt/testdir/dev c 0 0

Or explicitly like this:

  $ ln -s yadayda /mnt/lnk
  $ setfattr -h -n btrfs.compression -v lzo /mnt/lnk

So skip the compression property on inodes that are neither a regular
file nor a directory.

CC: stable@vger.kernel.org # 5.4+
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-04-27 22:20:21 +02:00
Filipe Manana
193b4e8398 btrfs: do not BUG_ON() on failure to update inode when setting xattr
We are doing a BUG_ON() if we fail to update an inode after setting (or
clearing) a xattr, but there's really no reason to not instead simply
abort the transaction and return the error to the caller. This should be
a rare error because we have previously reserved enough metadata space to
update the inode and the delayed inode should have already been setup, so
an -ENOSPC or -ENOMEM, which are the possible errors, are very unlikely to
happen.

So replace the BUG_ON()s with a transaction abort.

CC: stable@vger.kernel.org # 4.9+
Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-04-27 22:20:21 +02:00
Filipe Manana
d0e64a981f btrfs: always log symlinks in full mode
On Linux, empty symlinks are invalid, and attempting to create one with
the system call symlink(2) results in an -ENOENT error and this is
explicitly documented in the man page.

If we rename a symlink that was created in the current transaction and its
parent directory was logged before, we actually end up logging the symlink
without logging its content, which is stored in an inline extent. That
means that after a power failure we can end up with an empty symlink,
having no content and an i_size of 0 bytes.

It can be easily reproduced like this:

  $ mkfs.btrfs -f /dev/sdc
  $ mount /dev/sdc /mnt

  $ mkdir /mnt/testdir
  $ sync

  # Create a file inside the directory and fsync the directory.
  $ touch /mnt/testdir/foo
  $ xfs_io -c "fsync" /mnt/testdir

  # Create a symlink inside the directory and then rename the symlink.
  $ ln -s /mnt/testdir/foo /mnt/testdir/bar
  $ mv /mnt/testdir/bar /mnt/testdir/baz

  # Now fsync again the directory, this persist the log tree.
  $ xfs_io -c "fsync" /mnt/testdir

  <power failure>

  $ mount /dev/sdc /mnt
  $ stat -c %s /mnt/testdir/baz
  0
  $ readlink /mnt/testdir/baz
  $

Fix this by always logging symlinks in full mode (LOG_INODE_ALL), so that
their content is also logged.

A test case for fstests will follow.

CC: stable@vger.kernel.org # 4.9+
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-04-27 22:20:21 +02:00
Chung-Chiang Cheng
0e852ab897 btrfs: do not allow compression on nodatacow files
Compression and nodatacow are mutually exclusive. A similar issue was
fixed by commit f37c563bab ("btrfs: add missing check for nocow and
compression inode flags"). Besides ioctl, there is another way to
enable/disable/reset compression directly via xattr. The following
steps will result in a invalid combination.

  $ touch bar
  $ chattr +C bar
  $ lsattr bar
  ---------------C-- bar
  $ setfattr -n btrfs.compression -v zstd bar
  $ lsattr bar
  --------c------C-- bar

To align with the logic in check_fsflags, nocompress will also be
unacceptable after this patch, to prevent mix any compression-related
options with nodatacow.

  $ touch bar
  $ chattr +C bar
  $ lsattr bar
  ---------------C-- bar
  $ setfattr -n btrfs.compression -v zstd bar
  setfattr: bar: Invalid argument
  $ setfattr -n btrfs.compression -v no bar
  setfattr: bar: Invalid argument

When both compression and nodatacow are enabled, then
btrfs_run_delalloc_range prefers nodatacow and no compression happens.

Reported-by: Jayce Lin <jaycelin@synology.com>
CC: stable@vger.kernel.org # 5.10.x: e6f9d69648: btrfs: export a helper for compression hard check
CC: stable@vger.kernel.org # 5.10.x
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Chung-Chiang Cheng <cccheng@synology.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-04-27 22:19:33 +02:00
Chung-Chiang Cheng
e6f9d69648 btrfs: export a helper for compression hard check
inode_can_compress will be used outside of inode.c to check the
availability of setting compression flag by xattr. This patch moves
this function as an internal helper and renames it to
btrfs_inode_can_compress.

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Chung-Chiang Cheng <cccheng@synology.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-04-27 22:15:40 +02:00
Linus Torvalds
fd574a2f84 for-5.18-rc4-tag
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAmJnGGIACgkQxWXV+ddt
 WDumDw//cE1NcawdnVkEaKr20PetHfzPyFSIIr17nedtnVvWYyOFF/0uJlHNhv8Z
 CZIfJ7fmH/pO5oWPXN84wKNfumDWNwc36QrvoXC67TrKUSiBN8BzL83HvAjGwYFH
 G+LfZXGnVbqq8F1iYkIsuH0Oo1x/N/LPM3s6iZy3O4l8s96u+J4GRnc8Tr0AH4MA
 zgz3fab8Ec378HTG9fvdAQNLxFEe0VatD6WrzILnmM8UgeQK7g73dqH9Ni9gz2DW
 2GDlO6aevQ1G6dm2AJ0ItExnbHH7TfOThkG56Gdqrzb/d39GzrVpeob7QiorETus
 EWS1rXaeikUiD4Bzt/RszUNL80yMN1DjcN3QBkiDf3ShSDFteoHMPw3e6jcQCy1m
 Dxf5oditQqltuFNLeSiVbZEMw2kXqBP7RoPiirF9rdvrDNLHhAE9wu0kpSGSSvT7
 Tyu9JyLw2axU6wGTi1GHAXurlW2ItRRyFAewWWul1lLkuz/6YXI4F/EHm3Mbh6Nh
 pMIFMNr4Oafdx+3Ful8ZA4PynirXub/xVDefcFBibz/PTGEnHG4ZVzRudmVnowh7
 GP2pql1+Y/TFkXdD98V8GWD+E10JAmNCkQSoiggJooNWR28whukmDVX/HY8lGmWg
 DjxwGkte3SltUBWNOTGnO7546hMwOxOPZHENPh+gffYkeMeIxPI=
 =xDWz
 -----END PGP SIGNATURE-----

Merge tag 'for-5.18-rc4-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux

Pull btrfs fixes from David Sterba:

 - direct IO fixes:

      - restore passing file offset to correctly calculate checksums
        when repairing on read and bio split happens

      - use correct bio when sumitting IO on zoned filesystem

 - zoned mode fixes:

      - fix selection of device to correctly calculate device
        capabilities when allocating a new bio

      - use a dedicated lock for exclusion during relocation

      - fix leaked plug after failure syncing log

 - fix assertion during scrub and relocation

* tag 'for-5.18-rc4-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
  btrfs: zoned: use dedicated lock for data relocation
  btrfs: fix assertion failure during scrub due to block group reallocation
  btrfs: fix direct I/O writes for split bios on zoned devices
  btrfs: fix direct I/O read repair for split bios
  btrfs: fix and document the zoned device choice in alloc_new_bio
  btrfs: fix leaked plug after failure syncing log on zoned filesystems
2022-04-26 11:10:42 -07:00
Naohiro Aota
5f0addf7b8 btrfs: zoned: use dedicated lock for data relocation
Currently, we use btrfs_inode_{lock,unlock}() to grant an exclusive
writeback of the relocation data inode in
btrfs_zoned_data_reloc_{lock,unlock}(). However, that can cause a deadlock
in the following path.

Thread A takes btrfs_inode_lock() and waits for metadata reservation by
e.g, waiting for writeback:

prealloc_file_extent_cluster()
  - btrfs_inode_lock(&inode->vfs_inode, 0);
  - btrfs_prealloc_file_range()
  ...
    - btrfs_replace_file_extents()
      - btrfs_start_transaction
      ...
        - btrfs_reserve_metadata_bytes()

Thread B (e.g, doing a writeback work) needs to wait for the inode lock to
continue writeback process:

do_writepages
  - btrfs_writepages
    - extent_writpages
      - btrfs_zoned_data_reloc_lock(BTRFS_I(inode));
        - btrfs_inode_lock()

The deadlock is caused by relying on the vfs_inode's lock. By using it, we
introduced unnecessary exclusion of writeback and
btrfs_prealloc_file_range(). Also, the lock at this point is useless as we
don't have any dirty pages in the inode yet.

Introduce fs_info->zoned_data_reloc_io_lock and use it for the exclusive
writeback.

Fixes: 35156d8527 ("btrfs: zoned: only allow one process to add pages to a relocation inode")
CC: stable@vger.kernel.org # 5.16.x: 869f4cdc73: btrfs: zoned: encapsulate inode locking for zoned relocation
CC: stable@vger.kernel.org # 5.16.x
CC: stable@vger.kernel.org # 5.17
Cc: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-04-21 16:06:24 +02:00
Filipe Manana
a692e13d87 btrfs: fix assertion failure during scrub due to block group reallocation
During a scrub, or device replace, we can race with block group removal
and allocation and trigger the following assertion failure:

[7526.385524] assertion failed: cache->start == chunk_offset, in fs/btrfs/scrub.c:3817
[7526.387351] ------------[ cut here ]------------
[7526.387373] kernel BUG at fs/btrfs/ctree.h:3599!
[7526.388001] invalid opcode: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC PTI
[7526.388970] CPU: 2 PID: 1158150 Comm: btrfs Not tainted 5.17.0-rc8-btrfs-next-114 #4
[7526.390279] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.14.0-0-g155821a1990b-prebuilt.qemu.org 04/01/2014
[7526.392430] RIP: 0010:assertfail.constprop.0+0x18/0x1a [btrfs]
[7526.393520] Code: f3 48 c7 c7 20 (...)
[7526.396926] RSP: 0018:ffffb9154176bc40 EFLAGS: 00010246
[7526.397690] RAX: 0000000000000048 RBX: ffffa0db8a910000 RCX: 0000000000000000
[7526.398732] RDX: 0000000000000000 RSI: ffffffff9d7239a2 RDI: 00000000ffffffff
[7526.399766] RBP: ffffa0db8a911e10 R08: ffffffffa71a3ca0 R09: 0000000000000001
[7526.400793] R10: 0000000000000001 R11: 0000000000000000 R12: ffffa0db4b170800
[7526.401839] R13: 00000003494b0000 R14: ffffa0db7c55b488 R15: ffffa0db8b19a000
[7526.402874] FS:  00007f6c99c40640(0000) GS:ffffa0de6d200000(0000) knlGS:0000000000000000
[7526.404038] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[7526.405040] CR2: 00007f31b0882160 CR3: 000000014b38c004 CR4: 0000000000370ee0
[7526.406112] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[7526.407148] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[7526.408169] Call Trace:
[7526.408529]  <TASK>
[7526.408839]  scrub_enumerate_chunks.cold+0x11/0x79 [btrfs]
[7526.409690]  ? do_wait_intr_irq+0xb0/0xb0
[7526.410276]  btrfs_scrub_dev+0x226/0x620 [btrfs]
[7526.410995]  ? preempt_count_add+0x49/0xa0
[7526.411592]  btrfs_ioctl+0x1ab5/0x36d0 [btrfs]
[7526.412278]  ? __fget_files+0xc9/0x1b0
[7526.412825]  ? kvm_sched_clock_read+0x14/0x40
[7526.413459]  ? lock_release+0x155/0x4a0
[7526.414022]  ? __x64_sys_ioctl+0x83/0xb0
[7526.414601]  __x64_sys_ioctl+0x83/0xb0
[7526.415150]  do_syscall_64+0x3b/0xc0
[7526.415675]  entry_SYSCALL_64_after_hwframe+0x44/0xae
[7526.416408] RIP: 0033:0x7f6c99d34397
[7526.416931] Code: 3c 1c e8 1c ff (...)
[7526.419641] RSP: 002b:00007f6c99c3fca8 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
[7526.420735] RAX: ffffffffffffffda RBX: 00005624e1e007b0 RCX: 00007f6c99d34397
[7526.421779] RDX: 00005624e1e007b0 RSI: 00000000c400941b RDI: 0000000000000003
[7526.422820] RBP: 0000000000000000 R08: 00007f6c99c40640 R09: 0000000000000000
[7526.423906] R10: 00007f6c99c40640 R11: 0000000000000246 R12: 00007fff746755de
[7526.424924] R13: 00007fff746755df R14: 0000000000000000 R15: 00007f6c99c40640
[7526.425950]  </TASK>

That assertion is relatively new, introduced with commit d04fbe19ae
("btrfs: scrub: cleanup the argument list of scrub_chunk()").

The block group we get at scrub_enumerate_chunks() can actually have a
start address that is smaller then the chunk offset we extracted from a
device extent item we got from the commit root of the device tree.
This is very rare, but it can happen due to a race with block group
removal and allocation. For example, the following steps show how this
can happen:

1) We are at transaction T, and we have the following blocks groups,
   sorted by their logical start address:

   [ bg A, start address A, length 1G (data) ]
   [ bg B, start address B, length 1G (data) ]
   (...)
   [ bg W, start address W, length 1G (data) ]

     --> logical address space hole of 256M,
         there used to be a 256M metadata block group here

   [ bg Y, start address Y, length 256M (metadata) ]

      --> Y matches W's end offset + 256M

   Block group Y is the block group with the highest logical address in
   the whole filesystem;

2) Block group Y is deleted and its extent mapping is removed by the call
   to remove_extent_mapping() made from btrfs_remove_block_group().

   So after this point, the last element of the mapping red black tree,
   its rightmost node, is the mapping for block group W;

3) While still at transaction T, a new data block group is allocated,
   with a length of 1G. When creating the block group we do a call to
   find_next_chunk(), which returns the logical start address for the
   new block group. This calls returns X, which corresponds to the
   end offset of the last block group, the rightmost node in the mapping
   red black tree (fs_info->mapping_tree), plus one.

   So we get a new block group that starts at logical address X and with
   a length of 1G. It spans over the whole logical range of the old block
   group Y, that was previously removed in the same transaction.

   However the device extent allocated to block group X is not the same
   device extent that was used by block group Y, and it also does not
   overlap that extent, which must be always the case because we allocate
   extents by searching through the commit root of the device tree
   (otherwise it could corrupt a filesystem after a power failure or
   an unclean shutdown in general), so the extent allocator is behaving
   as expected;

4) We have a task running scrub, currently at scrub_enumerate_chunks().
   There it searches for device extent items in the device tree, using
   its commit root. It finds a device extent item that was used by
   block group Y, and it extracts the value Y from that item into the
   local variable 'chunk_offset', using btrfs_dev_extent_chunk_offset();

   It then calls btrfs_lookup_block_group() to find block group for
   the logical address Y - since there's currently no block group that
   starts at that logical address, it returns block group X, because
   its range contains Y.

   This results in triggering the assertion:

      ASSERT(cache->start == chunk_offset);

   right before calling scrub_chunk(), as cache->start is X and
   chunk_offset is Y.

This is more likely to happen of filesystems not larger than 50G, because
for these filesystems we use a 256M size for metadata block groups and
a 1G size for data block groups, while for filesystems larger than 50G,
we use a 1G size for both data and metadata block groups (except for
zoned filesystems). It could also happen on any filesystem size due to
the fact that system block groups are always smaller (32M) than both
data and metadata block groups, but these are not frequently deleted, so
much less likely to trigger the race.

So make scrub skip any block group with a start offset that is less than
the value we expect, as that means it's a new block group that was created
in the current transaction. It's pointless to continue and try to scrub
its extents, because scrub searches for extents using the commit root, so
it won't find any. For a device replace, skip it as well for the same
reasons, and we don't need to worry about the possibility of extents of
the new block group not being to the new device, because we have the write
duplication setup done through btrfs_map_block().

Fixes: d04fbe19ae ("btrfs: scrub: cleanup the argument list of scrub_chunk()")
CC: stable@vger.kernel.org # 5.17
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-04-21 16:06:19 +02:00
Christoph Hellwig
0fdf977d45 btrfs: fix direct I/O writes for split bios on zoned devices
When a bio is split in btrfs_submit_direct, dip->file_offset contains
the file offset for the first bio.  But this means the start value used
in btrfs_end_dio_bio to record the write location for zone devices is
incorrect for subsequent bios.

CC: stable@vger.kernel.org # 5.16+
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Naohiro Aota <naohiro.aota@wdc.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Sweet Tea Dorminy <sweettea-kernel@dorminy.me>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-04-19 15:45:04 +02:00
Christoph Hellwig
00d825258b btrfs: fix direct I/O read repair for split bios
When a bio is split in btrfs_submit_direct, dip->file_offset contains
the file offset for the first bio.  But this means the start value used
in btrfs_check_read_dio_bio is incorrect for subsequent bios.  Add
a file_offset field to struct btrfs_bio to pass along the correct offset.

Given that check_data_csum only uses start of an error message this
means problems with this miscalculation will only show up when I/O fails
or checksums mismatch.

The logic was removed in f4f39fc5dc ("btrfs: remove btrfs_bio::logical
member") but we need it due to the bio splitting.

CC: stable@vger.kernel.org # 5.16+
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Naohiro Aota <naohiro.aota@wdc.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Sweet Tea Dorminy <sweettea-kernel@dorminy.me>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-04-19 15:44:56 +02:00
Christoph Hellwig
50f1cff3d8 btrfs: fix and document the zoned device choice in alloc_new_bio
Zone Append bios only need a valid block device in struct bio, but
not the device in the btrfs_bio.  Use the information from
btrfs_zoned_get_device to set up bi_bdev and fix zoned writes on
multi-device file system with non-homogeneous capabilities and remove
the pointless btrfs_bio.device assignment.

Add big fat comments explaining what is going on here.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-04-19 15:44:49 +02:00
Filipe Manana
50ff57888d btrfs: fix leaked plug after failure syncing log on zoned filesystems
On a zoned filesystem, if we fail to allocate the root node for the log
root tree while syncing the log, we end up returning without finishing
the IO plug we started before, resulting in leaking resources as we
have started writeback for extent buffers of a log tree before. That
allocation failure, which typically is either -ENOMEM or -ENOSPC, is not
fatal and the fsync can safely fallback to a full transaction commit.

So release the IO plug if we fail to allocate the extent buffer for the
root of the log root tree when syncing the log on a zoned filesystem.

Fixes: 3ddebf27fc ("btrfs: zoned: reorder log node allocation on zoned filesystem")
CC: stable@vger.kernel.org # 5.15+
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-04-19 15:44:17 +02:00
Linus Torvalds
722985e2f6 for-5.18-rc2-tag
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAmJUfvwACgkQxWXV+ddt
 WDtiuA//csj0CrJq7wyRvgDkbPtCCMyDtL4zfmjP4s++NWaMDOKTxBU8msuGUJgR
 Xribel2zWqlFiOzptd9sGEfxfKfz1p5Rm/gtFj57WVSGV7YtiAGyFGuzn/vpCrgq
 NP5LFY2z5N36VxDXUPKHzvdqczO5W2n9KdaysJCr6FpCz+vVrplFiT5U+X175Sgg
 zWS/XrPIHYbtaEFdb3rUKol6riu7vXW3MxEA9di8K4Xo9TJrp0NtBoGZDsWFQxf7
 vfVwtYsQPsACJxw+MjBcVQ5fNXO5iATL1JfBb9ltN589xouja7bDCb40Fm2gEwWB
 IUatnCq/4MN2S2NdPtEcXQ52W9svT/87z6ZblefSEiQqvQBBJHN131RTM/s8LBG4
 fkHcGV6PsiutSIFycrID0bpXr1Mhvg2aMjxdriLPBtYopaMPh+ivK6LPYoE5MggQ
 /rshWfjiWJhPKXPsng+H7UGbViOiYeG0kUBuaqFx4ARnESpN1gF2dJRYvXYFL/8a
 Q0wmLr1tf3M82VAaAFNOl/BVk8blutCSHcJDLDKxcl3DhVVlY5J8onEPXjneJRkk
 rRB3zoxLlptgfW75CPNyFrpicPdAXzGCoccienIUoEdHSX/W5rNA4L6XpVE5Hv94
 dWR1aVAbCUcuhY1QfBU7H2iYw0RHzqDO3+IRfqJuYsLGciijLbs=
 =APMY
 -----END PGP SIGNATURE-----

Merge tag 'for-5.18-rc2-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux

Pull btrfs fixes from David Sterba:
 "A few more code and warning fixes.

  There's one feature ioctl removal patch slated for 5.18 that did not
  make it to the main pull request. It's just a one-liner and the ioctl
  has a v2 that's in use for a long time, no point to postpone it to
  5.19.

  Late update:

   - remove balance v1 ioctl, superseded by v2 in 2012

  Fixes:

   - add back cgroup attribution for compressed writes

   - add super block write start/end annotations to asynchronous balance

   - fix root reference count on an error handling path

   - in zoned mode, activate zone at the chunk allocation time to avoid
     ENOSPC due to timing issues

   - fix delayed allocation accounting for direct IO

  Warning fixes:

   - simplify assertion condition in zoned check

   - remove an unused variable"

* tag 'for-5.18-rc2-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
  btrfs: fix btrfs_submit_compressed_write cgroup attribution
  btrfs: fix root ref counts in error handling in btrfs_get_root_ref
  btrfs: zoned: activate block group only for extent allocation
  btrfs: return allocated block group from do_chunk_alloc()
  btrfs: mark resumed async balance as writing
  btrfs: remove support of balance v1 ioctl
  btrfs: release correct delalloc amount in direct IO write path
  btrfs: remove unused variable in btrfs_{start,write}_dirty_block_groups()
  btrfs: zoned: remove redundant condition in btrfs_run_delalloc_range
2022-04-14 10:58:27 -07:00
Dennis Zhou
acee08aaf6 btrfs: fix btrfs_submit_compressed_write cgroup attribution
This restores the logic from commit 46bcff2bfc ("btrfs: fix compressed
write bio blkcg attribution") which added cgroup attribution to btrfs
writeback. It also adds back the REQ_CGROUP_PUNT flag for these ios.

Fixes: 9150724048 ("btrfs: determine stripe boundary at bio allocation time in btrfs_submit_compressed_write")
CC: stable@vger.kernel.org # 5.16+
Signed-off-by: Dennis Zhou <dennis@kernel.org>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-04-06 00:50:51 +02:00
Jia-Ju Bai
168a2f776b btrfs: fix root ref counts in error handling in btrfs_get_root_ref
In btrfs_get_root_ref(), when btrfs_insert_fs_root() fails,
btrfs_put_root() can happen for two reasons:

- the root already exists in the tree, in that case it returns the
  reference obtained in btrfs_lookup_fs_root()

- another error so the cleanup is done in the fail label

Calling btrfs_put_root() unconditionally would lead to double decrement
of the root reference possibly freeing it in the second case.

Reported-by: TOTE Robot <oslab@tsinghua.edu.cn>
Fixes: bc44d7c4b2 ("btrfs: push btrfs_grab_fs_root into btrfs_get_fs_root")
CC: stable@vger.kernel.org # 5.10+
Signed-off-by: Jia-Ju Bai <baijiaju1990@gmail.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-04-06 00:50:47 +02:00
Naohiro Aota
760e69c4c2 btrfs: zoned: activate block group only for extent allocation
In btrfs_make_block_group(), we activate the allocated block group,
expecting that the block group is soon used for allocation. However, the
chunk allocation from flush_space() context broke the assumption. There
can be a large time gap between the chunk allocation time and the extent
allocation time from the chunk.

Activating the empty block groups pre-allocated from flush_space()
context can exhaust the active zone counter of a device. Once we use all
the active zone counts for empty pre-allocated block groups, we cannot
activate new block group for the other things: metadata, tree-log, or
data relocation block group.  That failure results in a fake -ENOSPC.

This patch introduces CHUNK_ALLOC_FORCE_FOR_EXTENT to distinguish the
chunk allocation from find_free_extent(). Now, the new block group is
activated only in that context.

Fixes: eb66a010d5 ("btrfs: zoned: activate new block group")
CC: stable@vger.kernel.org # 5.16+
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Tested-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-04-06 00:50:41 +02:00
Naohiro Aota
820c363bd5 btrfs: return allocated block group from do_chunk_alloc()
Return the allocated block group from do_chunk_alloc(). This is a
preparation patch for the next patch.

CC: stable@vger.kernel.org # 5.16+
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Tested-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-04-06 00:50:39 +02:00
Naohiro Aota
a690e5f2db btrfs: mark resumed async balance as writing
When btrfs balance is interrupted with umount, the background balance
resumes on the next mount. There is a potential deadlock with FS freezing
here like as described in commit 26559780b953 ("btrfs: zoned: mark
relocation as writing"). Mark the process as sb_writing to avoid it.

Reviewed-by: Filipe Manana <fdmanana@suse.com>
CC: stable@vger.kernel.org # 4.9+
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-04-06 00:49:50 +02:00
Nikolay Borisov
d03ae0d3b6 btrfs: remove support of balance v1 ioctl
It was scheduled for removal in kernel v5.18 commit 6c405b2409
("btrfs: deprecate BTRFS_IOC_BALANCE ioctl") thus its time has come.

Reviewed-by: Sweet Tea Dorminy <sweettea-kernel@dorminy.me>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-04-06 00:49:39 +02:00
Naohiro Aota
6d82ad13c4 btrfs: release correct delalloc amount in direct IO write path
Running generic/406 causes the following WARNING in btrfs_destroy_inode()
which tells there are outstanding extents left.

In btrfs_get_blocks_direct_write(), we reserve a temporary outstanding
extents with btrfs_delalloc_reserve_metadata() (or indirectly from
btrfs_delalloc_reserve_space(()). We then release the outstanding extents
with btrfs_delalloc_release_extents(). However, the "len" can be modified
in the COW case, which releases fewer outstanding extents than expected.

Fix it by calling btrfs_delalloc_release_extents() for the original length.

To reproduce the warning, the filesystem should be 1 GiB.  It's
triggering a short-write, due to not being able to allocate a large
extent and instead allocating a smaller one.

  WARNING: CPU: 0 PID: 757 at fs/btrfs/inode.c:8848 btrfs_destroy_inode+0x1e6/0x210 [btrfs]
  Modules linked in: btrfs blake2b_generic xor lzo_compress
  lzo_decompress raid6_pq zstd zstd_decompress zstd_compress xxhash zram
  zsmalloc
  CPU: 0 PID: 757 Comm: umount Not tainted 5.17.0-rc8+ #101
  Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS d55cb5a 04/01/2014
  RIP: 0010:btrfs_destroy_inode+0x1e6/0x210 [btrfs]
  RSP: 0018:ffffc9000327bda8 EFLAGS: 00010206
  RAX: 0000000000000000 RBX: ffff888100548b78 RCX: 0000000000000000
  RDX: 0000000000026900 RSI: 0000000000000000 RDI: ffff888100548b78
  RBP: ffff888100548940 R08: 0000000000000000 R09: ffff88810b48aba8
  R10: 0000000000000001 R11: ffff8881004eb240 R12: ffff88810b48a800
  R13: ffff88810b48ec08 R14: ffff88810b48ed00 R15: ffff888100490c68
  FS:  00007f8549ea0b80(0000) GS:ffff888237c00000(0000) knlGS:0000000000000000
  CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
  CR2: 00007f854a09e733 CR3: 000000010a2e9003 CR4: 0000000000370eb0
  DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
  DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
  Call Trace:
   <TASK>
   destroy_inode+0x33/0x70
   dispose_list+0x43/0x60
   evict_inodes+0x161/0x1b0
   generic_shutdown_super+0x2d/0x110
   kill_anon_super+0xf/0x20
   btrfs_kill_super+0xd/0x20 [btrfs]
   deactivate_locked_super+0x27/0x90
   cleanup_mnt+0x12c/0x180
   task_work_run+0x54/0x80
   exit_to_user_mode_prepare+0x152/0x160
   syscall_exit_to_user_mode+0x12/0x30
   do_syscall_64+0x42/0x80
   entry_SYSCALL_64_after_hwframe+0x44/0xae
   RIP: 0033:0x7f854a000fb7

Fixes: f0bfa76a11 ("btrfs: fix ENOSPC failure when attempting direct IO write into NOCOW range")
CC: stable@vger.kernel.org # 5.17
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Tested-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-04-06 00:49:35 +02:00
Nathan Chancellor
6d4a6b515c btrfs: remove unused variable in btrfs_{start,write}_dirty_block_groups()
Clang's version of -Wunused-but-set-variable recently gained support for
unary operations, which reveals two unused variables:

  fs/btrfs/block-group.c:2949:6: error: variable 'num_started' set but not used [-Werror,-Wunused-but-set-variable]
          int num_started = 0;
              ^
  fs/btrfs/block-group.c:3116:6: error: variable 'num_started' set but not used [-Werror,-Wunused-but-set-variable]
          int num_started = 0;
              ^
  2 errors generated.

These variables appear to be unused from their introduction, so just
remove them to silence the warnings.

Fixes: c9dc4c6578 ("Btrfs: two stage dirty block group writeout")
Fixes: 1bbc621ef2 ("Btrfs: allow block group cache writeout outside critical section in commit")
CC: stable@vger.kernel.org # 5.4+
Link: https://github.com/ClangBuiltLinux/linux/issues/1614
Signed-off-by: Nathan Chancellor <nathan@kernel.org>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-04-06 00:49:15 +02:00
Haowen Bai
9435be734a btrfs: zoned: remove redundant condition in btrfs_run_delalloc_range
The logic !A || A && B is equivalent to !A || B. so we can
make code clear.

Note: though it's preferred to be in the more human readable form, there
have been repeated reports and patches as the expression is detected by
tools so apply it to reduce the load.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Haowen Bai <baihaowen@meizu.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ add note ]
Signed-off-by: David Sterba <dsterba@suse.com>
2022-04-06 00:49:09 +02:00
Linus Torvalds
ce4c854ee8 for-5.18-rc1-tag
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAmJLaJoACgkQxWXV+ddt
 WDvd3g/+KXpWfPOg4h7rCKiZJoE0/XnAaqHacGmIy/8+TqiFm7VEdG2uA4wEjqYq
 yeCr+2NhsWcGKOWARUPmP8sBmla98tE0+1abxQFYhGdsLMcgbI6EyW+j53E7++gY
 0j4ZY4s7c1c4lsyStNrS7kauDcbZ3MHWhG1VNdTyWCf6wal/5jU93SwdDSz81Int
 iD8y2nVQoHcWdyBbPA7t9dYL5cCGR425gRrSb3Yhvc3cI6KJLbPLDpt1AAnR6XVx
 W/ELKYMVBF70nTqp3ptBTfbmm83/AcrA1/Epoe2sEV+3gaejx1vkIYwcWlvgW//+
 e980zd0k/QSse8O8s66Z7c9QnLML+L/6xxK2vJObw85NFzEGkrmoZdCa7HyyI3MS
 pkI0ox9z4I4OEgzBaH8ZpUYgRxQlRnbDv58GrTs/aEBuNaHGQJfiCmh4sx75jGvR
 eXMnqmL/EIKqZqTsLfWVuMLC7ZgTG8V76VOJ3gDE4v5Yxi306bCLcdTS+0HYz5GA
 fkCisFy69jSbkMbOClTegCkYiRHxV6GjI19yPqj29OsnFpk+htnubOvgs3Q1link
 odpBavejmbVxffrYw92Qm7NRMEldZoSaMa39zFJ9Wgtwui7Oe66K3JFlrSnth5mD
 YowfkApQCzsSiXzitwLEHmfs7F1MVh2a0jY4hTz1XrizxZPWKHE=
 =XZaw
 -----END PGP SIGNATURE-----

Merge tag 'for-5.18-rc1-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux

Pull btrfs fixes from David Sterba:

 - prevent deleting subvolume with active swapfile

 - fix qgroup reserve limit calculation overflow

 - remove device count in superblock and its item in one transaction so
   they cant't get out of sync

 - skip defragmenting an isolated sector, this could cause some extra IO

 - unify handling of mtime/permissions in hole punch with fallocate

 - zoned mode fixes:
     - remove assert checking for only single mode, we have the
       DUP mode implemented
     - fix potential lockdep warning while traversing devices
       when checking for zone activation

* tag 'for-5.18-rc1-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
  btrfs: prevent subvol with swapfile from being deleted
  btrfs: do not warn for free space inode in cow_file_range
  btrfs: avoid defragging extents whose next extents are not targets
  btrfs: fix fallocate to use file_modified to update permissions consistently
  btrfs: remove device item and update super block in the same transaction
  btrfs: fix qgroup reserve overflow the qgroup limit
  btrfs: zoned: remove left over ASSERT checking for single profile
  btrfs: zoned: traverse devices under chunk_mutex in btrfs_can_activate_zone
2022-04-05 08:59:37 -07:00
Matthew Wilcox (Oracle)
5a60542c61 btrfs: Remove a use of PAGE_SIZE in btrfs_invalidate_folio()
While btrfs doesn't use large folios yet, this should have been changed
as part of the conversion from invalidatepage to invalidate_folio.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Al Viro <viro@zeniv.linux.org.uk>
Acked-by: Al Viro <viro@zeniv.linux.org.uk>
2022-04-01 14:40:44 -04:00
Matthew Wilcox (Oracle)
704528d895 fs: Remove ->readpages address space operation
All filesystems have now been converted to use ->readahead, so
remove the ->readpages operation and fix all the comments that
used to refer to it.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Al Viro <viro@zeniv.linux.org.uk>
Acked-by: Al Viro <viro@zeniv.linux.org.uk>
2022-04-01 13:45:33 -04:00
Linus Torvalds
561593a048 for-5.18/write-streams-2022-03-18
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmI1AHwQHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgplPjEACVJzKg5NkxpdkDThvq5tejws9KxB/4mHJg
 NoDMcv1TF+Orsd/HNW6XrgYnbU0ObHom3568/xb8BNegRVFe7V4ME/4IYNRyGOmV
 qbfciu04L1UkJhI52CIidkOioFABL3r1zgLCIz5vk0Cv9X7Le9x0UabHxJf7u9S+
 Z3lNdyxezN0SGx8VT86l/7lSoHtG3VHO9IsQCuNGF02SB+6uGpXBlptbEoQ4nTxd
 T7/H9FNOe2Wf7eKvcOOds8UlvZYAfYcY0GcRrIOXdHIy25mKFWwn5cDgFTMOH5ID
 xXpm+JFkDkrfSW1o4FFPxbN9Z6RbVXbGCsrXlIragLO2MJQdXiIUxS1OPT5oAado
 H9MlX6QtkwziLW9zUWa/N/jmRjc2vzHAxD6JFg/wXxNdtY0kd8TQpaxwTB8mVDPN
 VCGutt7lJS1CQInQ+ppzbdqzzuLHC1RHAyWSmfUE9rb8cbjxtJBnSIorYRLUesMT
 GRwqVTXW0osxSgCb1iDiBCJANrX1yPZcemv4Wh1gzbT6IE9sWxWXsE5sy9KvswNc
 M+E4nu/TYYTfkynItJjLgmDLOoi+V0FBY6ba0mRPBjkriSP4AVlwsZLGVsAHQzuA
 o5paW1GjRCCwhIQ6+AzZIoOz6wqvprBlUgUkUneyYAQ2ZKC3pZi8zPnpoVdFucVa
 VaTzP71C1Q==
 =efaq
 -----END PGP SIGNATURE-----

Merge tag 'for-5.18/write-streams-2022-03-18' of git://git.kernel.dk/linux-block

Pull NVMe write streams removal from Jens Axboe:
 "This removes the write streams support in NVMe. No vendor ever really
  shipped working support for this, and they are not interested in
  supporting it.

  With the NVMe support gone, we have nothing in the tree that supports
  this. Remove passing around of the hints.

  The only discussion point in this patchset imho is the fact that the
  file specific write hint setting/getting fcntl helpers will now return
  -1/EINVAL like they did before we supported write hints. No known
  applications use these functions, I only know of one prototype that I
  help do for RocksDB, and that's not used. That said, with a change
  like this, it's always a bit controversial. Alternatively, we could
  just make them return 0 and pretend it worked. It's placement based
  hints after all"

* tag 'for-5.18/write-streams-2022-03-18' of git://git.kernel.dk/linux-block:
  fs: remove fs.f_write_hint
  fs: remove kiocb.ki_hint
  block: remove the per-bio/request write hint
  nvme: remove support or stream based temperature hint
2022-03-26 11:51:46 -07:00
Linus Torvalds
50560ce6a0 Kbuild -std=gnu11 updates for v5.18
Linus pointed out the benefits of C99 some years ago, especially variable
 declarations in loops [1]. At that time, we were not ready for the
 migration due to old compilers.
 
 Recently, Jakob Koschel reported a bug in list_for_each_entry(), which
 leaks the invalid pointer out of the loop [2]. In the discussion, we
 agreed that the time had come. Now that GCC 5.1 is the minimum compiler
 version, there is nothing to prevent us from going to -std=gnu99, or even
 straight to -std=gnu11.
 
 Discussions for a better list iterator implementation are ongoing, but
 this patch set must land first.
 
 [1] https://lore.kernel.org/all/CAHk-=wgr12JkKmRd21qh-se-_Gs69kbPgR9x4C+Es-yJV2GLkA@mail.gmail.com/
 [2] https://lore.kernel.org/lkml/86C4CE7D-6D93-456B-AA82-F8ADEACA40B7@gmail.com/
 -----BEGIN PGP SIGNATURE-----
 
 iQJJBAABCgAzFiEEbmPs18K1szRHjPqEPYsBB53g2wYFAmI9JqMVHG1hc2FoaXJv
 eUBrZXJuZWwub3JnAAoJED2LAQed4NsG3dkP/Ar7r8m4hc60kJE8JfXaxDpGOGka
 2yVm0EPfwV1lFGq440p4mqKc1iRTVLNMPsyaG/ZhriIp8PlSUjXLW290Sty6Z8Pd
 zcxwDg09ZXkMoDX+lc2Wr9F0wpswWJjqU/TzGLP5/qkVMe46KheXIQSPJAp8tVUt
 u2of/MTgTVMa4r7Iex/+NFWCPr4lTkWkSfzVN/Jd1r91UOyzy4E1VFRNlXIk/Fc9
 BFa67k0SHx/3FFElfwzFaejYUZjHjNzK3E1Zq8Q1vkWUxrzeEnzqTEiP7QaAi4Sa
 7MbqyqQvNoPw3uvKu5kwjDE+LHMEPTsmuaKVFpAc+qCpMtZCI6g9Q48pzQsWBMO2
 hZlEmYR9Zk0TpJp1flpOnNzoy7xPzNs0rcB3PaSOZyv+dTqtJ981IP+r4RNVlwje
 y3N9vq4RSAj/kAE/wi6FiPc/8vfbY71PbEXmg8556+kn3ne6aXl13ZrXIxz8w5jK
 bIgIFmrEPH7941KvFjoXhaFp/qv9hvLpWhQZu7CFRaj5V28qqUQ5TQFJREPePRtJ
 RFPEuOJqEGMxW/xbhcfrA1AO/y9Grxbe65e8Mph4YCfWpWaUww6vN01LC+k6UgDm
 Yq2u+wSFjWpRxOEPLWNsjnrZZgfdjk22O+TNOMs92X8/gXinmu3kZG5IUavahg7+
 J0SsIjIXhmLGKdDm
 =KMDk
 -----END PGP SIGNATURE-----

Merge tag 'kbuild-gnu11-v5.18' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild

Pull Kbuild update for C11 language base from Masahiro Yamada:
 "Kbuild -std=gnu11 updates for v5.18

  Linus pointed out the benefits of C99 some years ago, especially
  variable declarations in loops [1]. At that time, we were not ready
  for the migration due to old compilers.

  Recently, Jakob Koschel reported a bug in list_for_each_entry(), which
  leaks the invalid pointer out of the loop [2]. In the discussion, we
  agreed that the time had come. Now that GCC 5.1 is the minimum
  compiler version, there is nothing to prevent us from going to
  -std=gnu99, or even straight to -std=gnu11.

  Discussions for a better list iterator implementation are ongoing, but
  this patch set must land first"

[1] https://lore.kernel.org/all/CAHk-=wgr12JkKmRd21qh-se-_Gs69kbPgR9x4C+Es-yJV2GLkA@mail.gmail.com/
[2] https://lore.kernel.org/lkml/86C4CE7D-6D93-456B-AA82-F8ADEACA40B7@gmail.com/

* tag 'kbuild-gnu11-v5.18' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild:
  Kbuild: use -std=gnu11 for KBUILD_USERCFLAGS
  Kbuild: move to -std=gnu11
  Kbuild: use -Wdeclaration-after-statement
  Kbuild: add -Wno-shift-negative-value where -Wextra is used
2022-03-25 11:48:01 -07:00
Kaiwen Hu
60021bd754 btrfs: prevent subvol with swapfile from being deleted
A subvolume with an active swapfile must not be deleted otherwise it
would not be possible to deactivate it.

After the subvolume is deleted, we cannot swapoff the swapfile in this
deleted subvolume because the path is unreachable.  The swapfile is
still active and holding references, the filesystem cannot be unmounted.

The test looks like this:

  mkfs.btrfs -f $dev > /dev/null
  mount $dev $mnt

  btrfs sub create $mnt/subvol
  touch $mnt/subvol/swapfile
  chmod 600 $mnt/subvol/swapfile
  chattr +C $mnt/subvol/swapfile
  dd if=/dev/zero of=$mnt/subvol/swapfile bs=1K count=4096
  mkswap $mnt/subvol/swapfile
  swapon $mnt/subvol/swapfile

  btrfs sub delete $mnt/subvol
  swapoff $mnt/subvol/swapfile  # failed: No such file or directory
  swapoff --all

  unmount $mnt                  # target is busy.

To prevent above issue, we simply check that whether the subvolume
contains any active swapfile, and stop the deleting process.  This
behavior is like snapshot ioctl dealing with a swapfile.

CC: stable@vger.kernel.org # 5.4+
Reviewed-by: Robbie Ko <robbieko@synology.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Kaiwen Hu <kevinhu@synology.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-03-24 17:50:57 +01:00
Josef Bacik
a7d16d9a07 btrfs: do not warn for free space inode in cow_file_range
This is a long time leftover from when I originally added the free space
inode, the point was to catch cases where we weren't honoring the NOCOW
flag.  However there exists a race with relocation, if we allocate our
free space inode in a block group that is about to be relocated, we
could trigger the COW path before the relocation has the opportunity to
find the extents and delete the free space cache.  In production where
we have auto-relocation enabled we're seeing this WARN_ON_ONCE() around
5k times in a 2 week period, so not super common but enough that it's at
the top of our metrics.

We're properly handling the error here, and with us phasing out v1 space
cache anyway just drop the WARN_ON_ONCE.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-03-24 17:50:53 +01:00
Qu Wenruo
75a36a7d3e btrfs: avoid defragging extents whose next extents are not targets
[BUG]
There is a report that autodefrag is defragging single sector, which
is completely waste of IO, and no help for defragging:

   btrfs-cleaner-808 defrag_one_locked_range: root=256 ino=651122 start=0 len=4096

[CAUSE]
In defrag_collect_targets(), we check if the current range (A) can be merged
with next one (B).

If mergeable, we will add range A into target for defrag.

However there is a catch for autodefrag, when checking mergeability
against range B, we intentionally pass 0 as @newer_than, hoping to get a
higher chance to merge with the next extent.

But in the next iteration, range B will looked up by defrag_lookup_extent(),
with non-zero @newer_than.

And if range B is not really newer, it will rejected directly, causing
only range A being defragged, while we expect to defrag both range A and
B.

[FIX]
Since the root cause is the difference in check condition of
defrag_check_next_extent() and defrag_collect_targets(), we fix it by:

1. Pass @newer_than to defrag_check_next_extent()
2. Pass @extent_thresh to defrag_check_next_extent()

This makes the check between defrag_collect_targets() and
defrag_check_next_extent() more consistent.

While there is still some minor difference, the remaining checks are
focus on runtime flags like writeback/delalloc, which are mostly
transient and safe to be checked only in defrag_collect_targets().

Link: https://github.com/btrfs/linux/issues/423#issuecomment-1066981856
CC: stable@vger.kernel.org # 5.16+
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-03-24 17:50:39 +01:00
Darrick J. Wong
05fd9564e9 btrfs: fix fallocate to use file_modified to update permissions consistently
Since the initial introduction of (posix) fallocate back at the turn of
the century, it has been possible to use this syscall to change the
user-visible contents of files.  This can happen by extending the file
size during a preallocation, or through any of the newer modes (punch,
zero range).  Because the call can be used to change file contents, we
should treat it like we do any other modification to a file -- update
the mtime, and drop set[ug]id privileges/capabilities.

The VFS function file_modified() does all this for us if pass it a
locked inode, so let's make fallocate drop permissions correctly.

Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-03-24 17:48:02 +01:00
Qu Wenruo
bbac58698a btrfs: remove device item and update super block in the same transaction
[BUG]
There is a report that a btrfs has a bad super block num devices.

This makes btrfs to reject the fs completely.

  BTRFS error (device sdd3): super_num_devices 3 mismatch with num_devices 2 found here
  BTRFS error (device sdd3): failed to read chunk tree: -22
  BTRFS error (device sdd3): open_ctree failed

[CAUSE]
During btrfs device removal, chunk tree and super block num devs are
updated in two different transactions:

  btrfs_rm_device()
  |- btrfs_rm_dev_item(device)
  |  |- trans = btrfs_start_transaction()
  |  |  Now we got transaction X
  |  |
  |  |- btrfs_del_item()
  |  |  Now device item is removed from chunk tree
  |  |
  |  |- btrfs_commit_transaction()
  |     Transaction X got committed, super num devs untouched,
  |     but device item removed from chunk tree.
  |     (AKA, super num devs is already incorrect)
  |
  |- cur_devices->num_devices--;
  |- cur_devices->total_devices--;
  |- btrfs_set_super_num_devices()
     All those operations are not in transaction X, thus it will
     only be written back to disk in next transaction.

So after the transaction X in btrfs_rm_dev_item() committed, but before
transaction X+1 (which can be minutes away), a power loss happen, then
we got the super num mismatch.

[FIX]
Instead of starting and committing a transaction inside
btrfs_rm_dev_item(), start a transaction in side btrfs_rm_device() and
pass it to btrfs_rm_dev_item().

And only commit the transaction after everything is done.

Reported-by: Luca Béla Palkovics <luca.bela.palkovics@gmail.com>
Link: https://lore.kernel.org/linux-btrfs/CA+8xDSpvdm_U0QLBAnrH=zqDq_cWCOH5TiV46CKmp3igr44okQ@mail.gmail.com/
CC: stable@vger.kernel.org # 4.14+
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-03-24 17:47:58 +01:00
Ethan Lien
b642b52d0b btrfs: fix qgroup reserve overflow the qgroup limit
We use extent_changeset->bytes_changed in qgroup_reserve_data() to record
how many bytes we set for EXTENT_QGROUP_RESERVED state. Currently the
bytes_changed is set as "unsigned int", and it will overflow if we try to
fallocate a range larger than 4GiB. The result is we reserve less bytes
and eventually break the qgroup limit.

Unlike regular buffered/direct write, which we use one changeset for
each ordered extent, which can never be larger than 256M.  For
fallocate, we use one changeset for the whole range, thus it no longer
respects the 256M per extent limit, and caused the problem.

The following example test script reproduces the problem:

  $ cat qgroup-overflow.sh
  #!/bin/bash

  DEV=/dev/sdj
  MNT=/mnt/sdj

  mkfs.btrfs -f $DEV
  mount $DEV $MNT

  # Set qgroup limit to 2GiB.
  btrfs quota enable $MNT
  btrfs qgroup limit 2G $MNT

  # Try to fallocate a 3GiB file. This should fail.
  echo
  echo "Try to fallocate a 3GiB file..."
  fallocate -l 3G $MNT/3G.file

  # Try to fallocate a 5GiB file.
  echo
  echo "Try to fallocate a 5GiB file..."
  fallocate -l 5G $MNT/5G.file

  # See we break the qgroup limit.
  echo
  sync
  btrfs qgroup show -r $MNT

  umount $MNT

When running the test:

  $ ./qgroup-overflow.sh
  (...)

  Try to fallocate a 3GiB file...
  fallocate: fallocate failed: Disk quota exceeded

  Try to fallocate a 5GiB file...

  qgroupid         rfer         excl     max_rfer
  --------         ----         ----     --------
  0/5           5.00GiB      5.00GiB      2.00GiB

Since we have no control of how bytes_changed is used, it's better to
set it to u64.

CC: stable@vger.kernel.org # 4.14+
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Ethan Lien <ethanlien@synology.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-03-23 23:34:15 +01:00
Johannes Thumshirn
62ed0bf731 btrfs: zoned: remove left over ASSERT checking for single profile
With commit dcf5652291f6 ("btrfs: zoned: allow DUP on meta-data block
groups") we started allowing DUP on metadata block groups, so the
ASSERT()s in btrfs_can_activate_zone() and btrfs_zoned_get_device() are
no longer valid and in fact even harmful.

Fixes: dcf5652291f6 ("btrfs: zoned: allow DUP on meta-data block groups")
CC: stable@vger.kernel.org # 5.17
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-03-23 23:34:15 +01:00
Johannes Thumshirn
0b9e66762a btrfs: zoned: traverse devices under chunk_mutex in btrfs_can_activate_zone
btrfs_can_activate_zone() can be called with the device_list_mutex already
held, which will lead to a deadlock:

insert_dev_extents() // Takes device_list_mutex
`-> insert_dev_extent()
 `-> btrfs_insert_empty_item()
  `-> btrfs_insert_empty_items()
   `-> btrfs_search_slot()
    `-> btrfs_cow_block()
     `-> __btrfs_cow_block()
      `-> btrfs_alloc_tree_block()
       `-> btrfs_reserve_extent()
        `-> find_free_extent()
         `-> find_free_extent_update_loop()
          `-> can_allocate_chunk()
           `-> btrfs_can_activate_zone() // Takes device_list_mutex again

Instead of using the RCU on fs_devices->device_list we
can use fs_devices->alloc_list, protected by the chunk_mutex to traverse
the list of active devices.

We are in the chunk allocation thread. The newer chunk allocation
happens from the devices in the fs_device->alloc_list protected by the
chunk_mutex.

  btrfs_create_chunk()
    lockdep_assert_held(&info->chunk_mutex);
    gather_device_info
      list_for_each_entry(device, &fs_devices->alloc_list, dev_alloc_list)

Also, a device that reappears after the mount won't join the alloc_list
yet and, it will be in the dev_list, which we don't want to consider in
the context of the chunk alloc.

  [15.166572] WARNING: possible recursive locking detected
  [15.167117] 5.17.0-rc6-dennis #79 Not tainted
  [15.167487] --------------------------------------------
  [15.167733] kworker/u8:3/146 is trying to acquire lock:
  [15.167733] ffff888102962ee0 (&fs_devs->device_list_mutex){+.+.}-{3:3}, at: find_free_extent+0x15a/0x14f0 [btrfs]
  [15.167733]
  [15.167733] but task is already holding lock:
  [15.167733] ffff888102962ee0 (&fs_devs->device_list_mutex){+.+.}-{3:3}, at: btrfs_create_pending_block_groups+0x20a/0x560 [btrfs]
  [15.167733]
  [15.167733] other info that might help us debug this:
  [15.167733]  Possible unsafe locking scenario:
  [15.167733]
  [15.171834]        CPU0
  [15.171834]        ----
  [15.171834]   lock(&fs_devs->device_list_mutex);
  [15.171834]   lock(&fs_devs->device_list_mutex);
  [15.171834]
  [15.171834]  *** DEADLOCK ***
  [15.171834]
  [15.171834]  May be due to missing lock nesting notation
  [15.171834]
  [15.171834] 5 locks held by kworker/u8:3/146:
  [15.171834]  #0: ffff888100050938 ((wq_completion)events_unbound){+.+.}-{0:0}, at: process_one_work+0x1c3/0x5a0
  [15.171834]  #1: ffffc9000067be80 ((work_completion)(&fs_info->async_data_reclaim_work)){+.+.}-{0:0}, at: process_one_work+0x1c3/0x5a0
  [15.176244]  #2: ffff88810521e620 (sb_internal){.+.+}-{0:0}, at: flush_space+0x335/0x600 [btrfs]
  [15.176244]  #3: ffff888102962ee0 (&fs_devs->device_list_mutex){+.+.}-{3:3}, at: btrfs_create_pending_block_groups+0x20a/0x560 [btrfs]
  [15.176244]  #4: ffff8881152e4b78 (btrfs-dev-00){++++}-{3:3}, at: __btrfs_tree_lock+0x27/0x130 [btrfs]
  [15.179641]
  [15.179641] stack backtrace:
  [15.179641] CPU: 1 PID: 146 Comm: kworker/u8:3 Not tainted 5.17.0-rc6-dennis #79
  [15.179641] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.15.0-1.fc35 04/01/2014
  [15.179641] Workqueue: events_unbound btrfs_async_reclaim_data_space [btrfs]
  [15.179641] Call Trace:
  [15.179641]  <TASK>
  [15.179641]  dump_stack_lvl+0x45/0x59
  [15.179641]  __lock_acquire.cold+0x217/0x2b2
  [15.179641]  lock_acquire+0xbf/0x2b0
  [15.183838]  ? find_free_extent+0x15a/0x14f0 [btrfs]
  [15.183838]  __mutex_lock+0x8e/0x970
  [15.183838]  ? find_free_extent+0x15a/0x14f0 [btrfs]
  [15.183838]  ? find_free_extent+0x15a/0x14f0 [btrfs]
  [15.183838]  ? lock_is_held_type+0xd7/0x130
  [15.183838]  ? find_free_extent+0x15a/0x14f0 [btrfs]
  [15.183838]  find_free_extent+0x15a/0x14f0 [btrfs]
  [15.183838]  ? _raw_spin_unlock+0x24/0x40
  [15.183838]  ? btrfs_get_alloc_profile+0x106/0x230 [btrfs]
  [15.187601]  btrfs_reserve_extent+0x131/0x260 [btrfs]
  [15.187601]  btrfs_alloc_tree_block+0xb5/0x3b0 [btrfs]
  [15.187601]  __btrfs_cow_block+0x138/0x600 [btrfs]
  [15.187601]  btrfs_cow_block+0x10f/0x230 [btrfs]
  [15.187601]  btrfs_search_slot+0x55f/0xbc0 [btrfs]
  [15.187601]  ? lock_is_held_type+0xd7/0x130
  [15.187601]  btrfs_insert_empty_items+0x2d/0x60 [btrfs]
  [15.187601]  btrfs_create_pending_block_groups+0x2b3/0x560 [btrfs]
  [15.187601]  __btrfs_end_transaction+0x36/0x2a0 [btrfs]
  [15.192037]  flush_space+0x374/0x600 [btrfs]
  [15.192037]  ? find_held_lock+0x2b/0x80
  [15.192037]  ? btrfs_async_reclaim_data_space+0x49/0x180 [btrfs]
  [15.192037]  ? lock_release+0x131/0x2b0
  [15.192037]  btrfs_async_reclaim_data_space+0x70/0x180 [btrfs]
  [15.192037]  process_one_work+0x24c/0x5a0
  [15.192037]  worker_thread+0x4a/0x3d0

Fixes: a85f05e59b ("btrfs: zoned: avoid chunk allocation if active block group has enough space")
CC: stable@vger.kernel.org # 5.16+
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-03-23 23:34:09 +01:00
Linus Torvalds
6b1f86f8e9 Filesystem folio changes for 5.18
Primarily this series converts some of the address_space operations
 to take a folio instead of a page.
 
 ->is_partially_uptodate() takes a folio instead of a page and changes the
 type of the 'from' and 'count' arguments to make it obvious they're bytes.
 ->invalidatepage() becomes ->invalidate_folio() and has a similar type change.
 ->launder_page() becomes ->launder_folio()
 ->set_page_dirty() becomes ->dirty_folio() and adds the address_space as
 an argument.
 
 There are a couple of other misc changes up front that weren't worth
 separating into their own pull request.
 -----BEGIN PGP SIGNATURE-----
 
 iQEzBAABCgAdFiEEejHryeLBw/spnjHrDpNsjXcpgj4FAmI4hqMACgkQDpNsjXcp
 gj7r7Af/fVJ7m8kKqjP/IayX3HiJRuIDQw+vM++BlRNXdjz+IyED6whdmFGxJeOY
 BMyT+8ApOAz7ErS4G+7fAv4ScJK/aEgFUsnSeAiCp0PliiEJ5NNJzElp6sVmQ7H5
 SX7+Ek444FZUGsQuy0qL7/ELpR3ditnD7x+5U2g0p5TeaHGUQn84crRyfR4xuhNG
 EBD9D71BOb7OxUcOHe93pTkK51QsQ0aCrcIsB1tkK5KR0BAthn1HqF7ehL90Rvrr
 omx5M7aDWGY4oj7IKrhlAs+55Ah2WaOzrZBp0FXNbr4UENDBKWKyUxErwa4xPkf6
 Gm1iQG/CspOHnxN3YWsd5WjtlL3A+A==
 =cOiq
 -----END PGP SIGNATURE-----

Merge tag 'folio-5.18b' of git://git.infradead.org/users/willy/pagecache

Pull filesystem folio updates from Matthew Wilcox:
 "Primarily this series converts some of the address_space operations to
  take a folio instead of a page.

  Notably:

   - a_ops->is_partially_uptodate() takes a folio instead of a page and
     changes the type of the 'from' and 'count' arguments to make it
     obvious they're bytes.

   - a_ops->invalidatepage() becomes ->invalidate_folio() and has a
     similar type change.

   - a_ops->launder_page() becomes ->launder_folio()

   - a_ops->set_page_dirty() becomes ->dirty_folio() and adds the
     address_space as an argument.

  There are a couple of other misc changes up front that weren't worth
  separating into their own pull request"

* tag 'folio-5.18b' of git://git.infradead.org/users/willy/pagecache: (53 commits)
  fs: Remove aops ->set_page_dirty
  fb_defio: Use noop_dirty_folio()
  fs: Convert __set_page_dirty_no_writeback to noop_dirty_folio
  fs: Convert __set_page_dirty_buffers to block_dirty_folio
  nilfs: Convert nilfs_set_page_dirty() to nilfs_dirty_folio()
  mm: Convert swap_set_page_dirty() to swap_dirty_folio()
  ubifs: Convert ubifs_set_page_dirty to ubifs_dirty_folio
  f2fs: Convert f2fs_set_node_page_dirty to f2fs_dirty_node_folio
  f2fs: Convert f2fs_set_data_page_dirty to f2fs_dirty_data_folio
  f2fs: Convert f2fs_set_meta_page_dirty to f2fs_dirty_meta_folio
  afs: Convert afs_dir_set_page_dirty() to afs_dir_dirty_folio()
  btrfs: Convert extent_range_redirty_for_io() to use folios
  fs: Convert trivial uses of __set_page_dirty_nobuffers to filemap_dirty_folio
  btrfs: Convert from set_page_dirty to dirty_folio
  fscache: Convert fscache_set_page_dirty() to fscache_dirty_folio()
  fs: Add aops->dirty_folio
  fs: Remove aops->launder_page
  orangefs: Convert launder_page to launder_folio
  nfs: Convert from launder_page to launder_folio
  fuse: Convert from launder_page to launder_folio
  ...
2022-03-22 18:26:56 -07:00
Linus Torvalds
3bf03b9a08 Merge branch 'akpm' (patches from Andrew)
Merge updates from Andrew Morton:

 - A few misc subsystems: kthread, scripts, ntfs, ocfs2, block, and vfs

 - Most the MM patches which precede the patches in Willy's tree: kasan,
   pagecache, gup, swap, shmem, memcg, selftests, pagemap, mremap,
   sparsemem, vmalloc, pagealloc, memory-failure, mlock, hugetlb,
   userfaultfd, vmscan, compaction, mempolicy, oom-kill, migration, thp,
   cma, autonuma, psi, ksm, page-poison, madvise, memory-hotplug, rmap,
   zswap, uaccess, ioremap, highmem, cleanups, kfence, hmm, and damon.

* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (227 commits)
  mm/damon/sysfs: remove repeat container_of() in damon_sysfs_kdamond_release()
  Docs/ABI/testing: add DAMON sysfs interface ABI document
  Docs/admin-guide/mm/damon/usage: document DAMON sysfs interface
  selftests/damon: add a test for DAMON sysfs interface
  mm/damon/sysfs: support DAMOS stats
  mm/damon/sysfs: support DAMOS watermarks
  mm/damon/sysfs: support schemes prioritization
  mm/damon/sysfs: support DAMOS quotas
  mm/damon/sysfs: support DAMON-based Operation Schemes
  mm/damon/sysfs: support the physical address space monitoring
  mm/damon/sysfs: link DAMON for virtual address spaces monitoring
  mm/damon: implement a minimal stub for sysfs-based DAMON interface
  mm/damon/core: add number of each enum type values
  mm/damon/core: allow non-exclusive DAMON start/stop
  Docs/damon: update outdated term 'regions update interval'
  Docs/vm/damon/design: update DAMON-Idle Page Tracking interference handling
  Docs/vm/damon: call low level monitoring primitives the operations
  mm/damon: remove unnecessary CONFIG_DAMON option
  mm/damon/paddr,vaddr: remove damon_{p,v}a_{target_valid,set_operations}()
  mm/damon/dbgfs-test: fix is_target_id() change
  ...
2022-03-22 16:11:53 -07:00
Muchun Song
fd60b28842 fs: allocate inode by using alloc_inode_sb()
The inode allocation is supposed to use alloc_inode_sb(), so convert
kmem_cache_alloc() of all filesystems to alloc_inode_sb().

Link: https://lkml.kernel.org/r/20220228122126.37293-5-songmuchun@bytedance.com
Signed-off-by: Muchun Song <songmuchun@bytedance.com>
Acked-by: Theodore Ts'o <tytso@mit.edu>		[ext4]
Acked-by: Roman Gushchin <roman.gushchin@linux.dev>
Cc: Alex Shi <alexs@kernel.org>
Cc: Anna Schumaker <Anna.Schumaker@Netapp.com>
Cc: Chao Yu <chao@kernel.org>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Fam Zheng <fam.zheng@bytedance.com>
Cc: Jaegeuk Kim <jaegeuk@kernel.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Kari Argillander <kari.argillander@gmail.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Qi Zheng <zhengqi.arch@bytedance.com>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Trond Myklebust <trond.myklebust@hammerspace.com>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Wei Yang <richard.weiyang@gmail.com>
Cc: Xiongchun Duan <duanxiongchun@bytedance.com>
Cc: Yang Shi <shy828301@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2022-03-22 15:57:03 -07:00
Linus Torvalds
5191290407 for-5.18-tag
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAmI44SgACgkQxWXV+ddt
 WDtzyg//YgMKr05jRsU3I/pIQ9znuKZmmllThwF63ZRG4PvKz2QfzvKdrMuzNjru
 5kHbG59iJqtLmU/aVsdp8mL6mmg5U3Ym2bIRsrW5m4HTtTowKdirvL/lQ3/tWm8j
 CSDJhUdCL2SwFjpru+4cxOeHLXNSfsk4BoCu8nsLitL+oXv/EPo/dkmu6nPjiMY3
 RjsIDBeDEf7J20KOuP/qJuN2YOAT7TeISPD3Ow4aDsmndWQ8n6KehEmAZb7QuqZQ
 SYubZ2wTb9HuPH/qpiTIA7innBIr+JkYtUYlz2xxixM2BUWNfqD6oKHw9RgOY5Sg
 CULFssw0i7cgGKsvuPJw1zdM002uG4wwXKigGiyljTVWvxneyr4mNDWiGad+LyFJ
 XWhnABPidkLs/1zbUkJ23DVub5VlfZsypkFDJAUXI0nGu3VrhjDfTYMa8eCe2L/F
 YuGG6CrAC+5K/arKAWTVj7hOb+52UzBTEBJz60LJJ6dS9eQoBy857V6pfo7w7ukZ
 t/tqA6q75O4tk/G3Ix3V1CjuAH3kJE6qXrvBxhpu8aZNjofopneLyGqS5oahpcE8
 8edtT+ZZhNuU9sLSEJCJATVxXRDdNzpQ8CHgOR5HOUbmM/vwKNzHPfRQzDnImznw
 UaUlFaaHwK17M6Y/6CnMecz26U2nVSJ7pyh39mb784XYe2a1efE=
 =YARd
 -----END PGP SIGNATURE-----

Merge tag 'for-5.18-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux

Pull btrfs updates from David Sterba:
 "This contains feature updates, performance improvements, preparatory
  and core work and some related VFS updates:

  Features:

   - encoded read/write ioctls, allows user space to read or write raw
     data directly to extents (now compressed, encrypted in the future),
     will be used by send/receive v2 where it saves processing time

   - zoned mode now works with metadata DUP (the mkfs.btrfs default)

   - error message header updates:
      - print error state: transaction abort, other error, log tree
        errors
      - print transient filesystem state: remount, device replace,
        ignored checksum verifications

   - tree-checker: verify the transaction id of the to-be-written dirty
     extent buffer

  Performance improvements for fsync:

   - directory logging speedups (up to -90% run time)

   - avoid logging all directory changes during renames (up to -60% run
     time)

   - avoid inode logging during rename and link when possible (up to
     -60% run time)

   - prepare extents to be logged before locking a log tree path
     (throughput +7%)

   - stop copying old file extents when doing a full fsync()

   - improved logging of old extents after truncate

  Core, fixes:

   - improved stale device identification by dev_t and not just path
     (for devices that are behind other layers like device mapper)

   - continued extent tree v2 preparatory work
      - disable features that won't work yet
      - add wrappers and abstractions for new tree roots

   - improved error handling

   - add super block write annotations around background block group
     reclaim

   - fix device scanning messages potentially accessing stale pointer

   - cleanups and refactoring

  VFS:

   - allow reflinks/deduplication from two different mounts of the same
     filesystem

   - export and add helpers for read/write range verification, for the
     encoded ioctls"

* tag 'for-5.18-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: (98 commits)
  btrfs: zoned: put block group after final usage
  btrfs: don't access possibly stale fs_info data in device_list_add
  btrfs: add lockdep_assert_held to need_preemptive_reclaim
  btrfs: verify the tranisd of the to-be-written dirty extent buffer
  btrfs: unify the error handling of btrfs_read_buffer()
  btrfs: unify the error handling pattern for read_tree_block()
  btrfs: factor out do_free_extent_accounting helper
  btrfs: remove last_ref from the extent freeing code
  btrfs: add a alloc_reserved_extent helper
  btrfs: remove BUG_ON(ret) in alloc_reserved_tree_block
  btrfs: add and use helper for unlinking inode during log replay
  btrfs: extend locking to all space_info members accesses
  btrfs: zoned: mark relocation as writing
  fs: allow cross-vfsmount reflink/dedupe
  btrfs: remove the cross file system checks from remap
  btrfs: pass btrfs_fs_info to btrfs_recover_relocation
  btrfs: pass btrfs_fs_info for deleting snapshots and cleaner
  btrfs: add filesystems state details to error messages
  btrfs: deal with unexpected extent type during reflinking
  btrfs: fix unexpected error path when reflinking an inline extent
  ...
2022-03-22 10:51:40 -07:00
Linus Torvalds
616355cc81 for-5.18/block-2022-03-18
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmI0+GcQHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgprUpD/9aTJEnj7VCw7UouSsg098sdjtoy9ilslU3
 ew47K8CIXHbCB4CDqLnFyvCwAdG1XGgS+fUmFAxvTr29R9SZeS5d+bXL6sZzEo0C
 bwxsJy9MM2QRtMvB+giAt1myXbwB8cG+ketMBWXqwXXRHRzPbbQfMZia7FqWMnfY
 KQanH9IwYHp1oa5U/W6Qcjm4oCnLgBMRwqByzUCtiF3y9qgaLkK+3IgkNwjJQjLA
 DTeUJ/9CgxGQQbzA+LPktbw2xfTqiUfcKq0mWx6Zt4wwNXn1ClqUDUXX6QSM8/5u
 3OimbscSkEPPTIYZbVBPkhFnAlQb4JaJEgOrbXvYKVV2Dh+eZY81XwNeE/E8gdBY
 TnHOTOCjkN/4sR3hIrWazlJzPLdpPA0eOYrhguCraQsX9mcsYNxlJ9otRv/Ve99g
 uqL0RZg3+NoK84fm79FCGy/ZmPQJvJttlBT9CKVwylv/Lky42xWe7AdM3OipKluY
 2nh+zN5Ai7WxZdTKXQFRhCSWfWQ+1qW51tB3dcGW+BooZr/oox47qKQVcHsEWbq1
 RNR45F5a4AuPwYUHF/P36WviLnEuq9AvX7OTTyYOplyVQohKIoDXp9chVzLNzBiZ
 KBR00W6MLKKKN+8foalQWgNyb2i2PH7Ib4xRXvXj/22Vwxg5UmUoBmSDSas9SZUS
 +dMo7CtNgA==
 =DpgP
 -----END PGP SIGNATURE-----

Merge tag 'for-5.18/block-2022-03-18' of git://git.kernel.dk/linux-block

Pull block updates from Jens Axboe:

 - BFQ cleanups and fixes (Yu, Zhang, Yahu, Paolo)

 - blk-rq-qos completion fix (Tejun)

 - blk-cgroup merge fix (Tejun)

 - Add offline error return value to distinguish it from an IO error on
   the device (Song)

 - IO stats fixes (Zhang, Christoph)

 - blkcg refcount fixes (Ming, Yu)

 - Fix for indefinite dispatch loop softlockup (Shin'ichiro)

 - blk-mq hardware queue management improvements (Ming)

 - sbitmap dead code removal (Ming, John)

 - Plugging merge improvements (me)

 - Show blk-crypto capabilities in sysfs (Eric)

 - Multiple delayed queue run improvement (David)

 - Block throttling fixes (Ming)

 - Start deprecating auto module loading based on dev_t (Christoph)

 - bio allocation improvements (Christoph, Chaitanya)

 - Get rid of bio_devname (Christoph)

 - bio clone improvements (Christoph)

 - Block plugging improvements (Christoph)

 - Get rid of genhd.h header (Christoph)

 - Ensure drivers use appropriate flush helpers (Christoph)

 - Refcounting improvements (Christoph)

 - Queue initialization and teardown improvements (Ming, Christoph)

 - Misc fixes/improvements (Barry, Chaitanya, Colin, Dan, Jiapeng,
   Lukas, Nian, Yang, Eric, Chengming)

* tag 'for-5.18/block-2022-03-18' of git://git.kernel.dk/linux-block: (127 commits)
  block: cancel all throttled bios in del_gendisk()
  block: let blkcg_gq grab request queue's refcnt
  block: avoid use-after-free on throttle data
  block: limit request dispatch loop duration
  block/bfq-iosched: Fix spelling mistake "tenative" -> "tentative"
  sr: simplify the local variable initialization in sr_block_open()
  block: don't merge across cgroup boundaries if blkcg is enabled
  block: fix rq-qos breakage from skipping rq_qos_done_bio()
  block: flush plug based on hardware and software queue order
  block: ensure plug merging checks the correct queue at least once
  block: move rq_qos_exit() into disk_release()
  block: do more work in elevator_exit
  block: move blk_exit_queue into disk_release
  block: move q_usage_counter release into blk_queue_release
  block: don't remove hctx debugfs dir from blk_mq_exit_queue
  block: move blkcg initialization/destroy into disk allocation/release handler
  sr: implement ->free_disk to simplify refcounting
  sd: implement ->free_disk to simplify refcounting
  sd: delay calling free_opal_dev
  sd: call sd_zbc_release_disk before releasing the scsi_device reference
  ...
2022-03-21 16:48:55 -07:00
Matthew Wilcox (Oracle)
ebf55c886e btrfs: Convert extent_range_redirty_for_io() to use folios
This removes a call to __set_page_dirty_nobuffers().

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Tested-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Acked-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Tested-by: Mike Marshall <hubcap@omnibond.com> # orangefs
Tested-by: David Howells <dhowells@redhat.com> # afs
2022-03-15 08:34:38 -04:00
Matthew Wilcox (Oracle)
187c82cb03 fs: Convert trivial uses of __set_page_dirty_nobuffers to filemap_dirty_folio
These filesystems use __set_page_dirty_nobuffers() either directly or
with a very thin wrapper; convert them en masse.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Tested-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Acked-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Tested-by: Mike Marshall <hubcap@omnibond.com> # orangefs
Tested-by: David Howells <dhowells@redhat.com> # afs
2022-03-15 08:34:38 -04:00
Matthew Wilcox (Oracle)
0079c3b176 btrfs: Convert from set_page_dirty to dirty_folio
Optimise the non-DEBUG case to just call filemap_dirty_folio
directly.  The DEBUG case doesn't actually compile, but convert
it to dirty_folio anyway.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Tested-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Acked-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Tested-by: Mike Marshall <hubcap@omnibond.com> # orangefs
Tested-by: David Howells <dhowells@redhat.com> # afs
2022-03-15 08:34:38 -04:00
Matthew Wilcox (Oracle)
895586eb68 btrfs: Convert from invalidatepage to invalidate_folio
A lot of the underlying infrastructure in btrfs needs to be switched
over to folios, but this at least documents that invalidatepage can't
be passed a tail page.

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Tested-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Acked-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Tested-by: Mike Marshall <hubcap@omnibond.com> # orangefs
Tested-by: David Howells <dhowells@redhat.com> # afs
2022-03-15 08:23:29 -04:00
Matthew Wilcox (Oracle)
8e1dec8eb8 btrfs: Use folio_invalidate()
Instead of calling ->invalidatepage directly, use folio_invalidate().

Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Tested-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Acked-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Tested-by: Mike Marshall <hubcap@omnibond.com> # orangefs
Tested-by: David Howells <dhowells@redhat.com> # afs
2022-03-15 08:23:29 -04:00
Nikolay Borisov
d3e2996707 btrfs: zoned: put block group after final usage
It's counter-intuitive (and wrong) to put the block group _before_ the
final usage in submit_eb_page. Fix it by re-ordering the call to
btrfs_put_block_group after its final reference. Also fix a minor typo
in 'implies'

Fixes: be1a1d7a5d ("btrfs: zoned: finish fully written block group")
CC: stable@vger.kernel.org # 5.16+
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-03-14 13:13:54 +01:00
Dongliang Mu
79c9234ba5 btrfs: don't access possibly stale fs_info data in device_list_add
Syzbot reported a possible use-after-free in printing information
in device_list_add.

Very similar with the bug fixed by commit 0697d9a610 ("btrfs: don't
access possibly stale fs_info data for printing duplicate device"),
but this time the use occurs in btrfs_info_in_rcu.

  Call Trace:
   kasan_report.cold+0x83/0xdf mm/kasan/report.c:459
   btrfs_printk+0x395/0x425 fs/btrfs/super.c:244
   device_list_add.cold+0xd7/0x2ed fs/btrfs/volumes.c:957
   btrfs_scan_one_device+0x4c7/0x5c0 fs/btrfs/volumes.c:1387
   btrfs_control_ioctl+0x12a/0x2d0 fs/btrfs/super.c:2409
   vfs_ioctl fs/ioctl.c:51 [inline]
   __do_sys_ioctl fs/ioctl.c:874 [inline]
   __se_sys_ioctl fs/ioctl.c:860 [inline]
   __x64_sys_ioctl+0x193/0x200 fs/ioctl.c:860
   do_syscall_x64 arch/x86/entry/common.c:50 [inline]
   do_syscall_64+0x35/0xb0 arch/x86/entry/common.c:80
   entry_SYSCALL_64_after_hwframe+0x44/0xae

Fix this by modifying device->fs_info to NULL too.

Reported-and-tested-by: syzbot+82650a4e0ed38f218363@syzkaller.appspotmail.com
CC: stable@vger.kernel.org # 4.19+
Signed-off-by: Dongliang Mu <mudongliangabcd@gmail.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-03-14 13:13:54 +01:00
Niels Dossche
bf7bd725b0 btrfs: add lockdep_assert_held to need_preemptive_reclaim
In a previous patch ("btrfs: extend locking to all space_info members
accesses") the locking for the space_info members was extended in
btrfs_preempt_reclaim_metadata_space because not all the member
accesses that needed locks were actually locked (bytes_pinned et al).

It was then suggested to also add a call to lockdep_assert_held to
need_preemptive_reclaim. This function also works with space_info
members. As of now, it has only two call sites which both hold the lock.

Suggested-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Niels Dossche <dossche.niels@gmail.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-03-14 13:13:53 +01:00
Qu Wenruo
3777369ff1 btrfs: verify the tranisd of the to-be-written dirty extent buffer
[BUG]
There is a bug report that a bitflip in the transid part of an extent
buffer makes btrfs to reject certain tree blocks:

  BTRFS error (device dm-0): parent transid verify failed on 1382301696 wanted 262166 found 22

[CAUSE]
Note the failed transid check, hex(262166) = 0x40016, while
hex(22) = 0x16.

It's an obvious bitflip.

Furthermore, the reporter also confirmed the bitflip is from the
hardware, so it's a real hardware caused bitflip, and such problem can
not be detected by the existing tree-checker framework.

As tree-checker can only verify the content inside one tree block, while
generation of a tree block can only be verified against its parent.

So such problem remain undetected.

[FIX]
Although tree-checker can not verify it at write-time, we still have a
quick (but not the most accurate) way to catch such obvious corruption.

Function csum_one_extent_buffer() is called before we submit metadata
write.

Thus it means, all the extent buffer passed in should be dirty tree
blocks, and should be newer than last committed transaction.

Using that we can catch the above bitflip.

Although it's not a perfect solution, as if the corrupted generation is
higher than the correct value, we have no way to catch it at all.

Reported-by: Christoph Anton Mitterer <calestyo@scientia.org>
Link: https://lore.kernel.org/linux-btrfs/2dfcbc130c55cc6fd067b93752e90bd2b079baca.camel@scientia.org/
CC: stable@vger.kernel.org # 5.15+
Signed-off-by: Qu Wenruo <wqu@sus,ree.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-03-14 13:13:53 +01:00
Qu Wenruo
9a4ffa1bd6 btrfs: unify the error handling of btrfs_read_buffer()
There is one oddball error handling of btrfs_read_buffer():

	ret = btrfs_read_buffer(tmp, gen, parent_level - 1, &first_key);
	if (!ret) {
		*eb_ret = tmp;
		return 0;
	}
	free_extent_buffer(tmp);
	btrfs_release_path(p);
	return -EIO;

While all other call sites check the error first.  Unify the behavior.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-03-14 13:13:53 +01:00
Qu Wenruo
4eb150d612 btrfs: unify the error handling pattern for read_tree_block()
We had an error handling pattern for read_tree_block() like this:

	eb = read_tree_block();
	if (IS_ERR(eb)) {
		/*
		 * Handling error here
		 * Normally ended up with return or goto out.
		 */
	} else if (!extent_buffer_uptodate(eb)) {
		/*
		 * Different error handling here
		 * Normally also ended up with return or goto out;
		 */
	}

This is fine, but if we want to add extra check for each
read_tree_block(), the existing if-else-if is not that expandable and
will take reader some seconds to figure out there is no extra branch.

Here we change it to a more common way, without the extra else:

	eb = read_tree_block();
	if (IS_ERR(eb)) {
		/*
		 * Handling error here
		 */
		return eb or goto out;
	}
	if (!extent_buffer_uptodate(eb)) {
		/*
		 * Different error handling here
		 */
		return eb or goto out;
	}

This also removes some oddball call sites which uses some creative way
to check error.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-03-14 13:13:53 +01:00
Josef Bacik
8f8aa4c7a9 btrfs: factor out do_free_extent_accounting helper
__btrfs_free_extent() does all of the hard work of updating the extent
ref items, and then at the end if we dropped the extent completely it
does the cleanup accounting work.  We're going to only want to do that
work for metadata with extent tree v2, so extract this bit into its own
helper.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-03-14 13:13:53 +01:00
Josef Bacik
5b2a54bb7c btrfs: remove last_ref from the extent freeing code
This is a remnant of the work I did for qgroups a long time ago to only
run for a block when we had dropped the last ref.  We haven't done that
for years, but the code remains.  Drop this remnant.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-03-14 13:13:53 +01:00
Josef Bacik
3466670558 btrfs: add a alloc_reserved_extent helper
We duplicate this logic for both data and metadata, at this point we've
already done our type specific extent root operations, this is just
doing the accounting and removing the space from the free space tree.
Extract this common logic out into a helper.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-03-14 13:13:53 +01:00
Josef Bacik
b3c958a369 btrfs: remove BUG_ON(ret) in alloc_reserved_tree_block
Switch this to an ASSERT() and return the error in the normal case.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-03-14 13:13:53 +01:00
Filipe Manana
313ab75399 btrfs: add and use helper for unlinking inode during log replay
During log replay there is this pattern of running delayed items after
every inode unlink. To avoid repeating this several times, move the
logic into an helper function and use it instead of calling
btrfs_unlink_inode() followed by btrfs_run_delayed_items().

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-03-14 13:13:53 +01:00
Niels Dossche
06bae87663 btrfs: extend locking to all space_info members accesses
bytes_pinned is always accessed under space_info->lock, except in
btrfs_preempt_reclaim_metadata_space, however the other members are
accessed under that lock. The reserved member of the rsv's are also
partially accessed under a lock and partially not. Move all these
accesses into the same lock to ensure consistency.

This could potentially race and lead to a flush instead of a commit but
it's not a big problem as it's only for preemptive flush.

CC: stable@vger.kernel.org # 5.15+
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Niels Dossche <niels.dossche@ugent.be>
Signed-off-by: Niels Dossche <dossche.niels@gmail.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-03-14 13:13:53 +01:00
Naohiro Aota
ca5e4ea0be btrfs: zoned: mark relocation as writing
There is a hung_task issue with running generic/068 on an SMR
device. The hang occurs while a process is trying to thaw the
filesystem. The process is trying to take sb->s_umount to thaw the
FS. The lock is held by fsstress, which calls btrfs_sync_fs() and is
waiting for an ordered extent to finish. However, as the FS is frozen,
the ordered extents never finish.

Having an ordered extent while the FS is frozen is the root cause of
the hang. The ordered extent is initiated from btrfs_relocate_chunk()
which is called from btrfs_reclaim_bgs_work().

This commit adds sb_*_write() around btrfs_relocate_chunk() call
site. For the usual "btrfs balance" command, we already call it with
mnt_want_file() in btrfs_ioctl_balance().

Fixes: 18bb8bbf13 ("btrfs: zoned: automatically reclaim zones")
CC: stable@vger.kernel.org # 5.13+
Link: https://github.com/naota/linux/issues/56
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-03-14 13:13:53 +01:00
Josef Bacik
ae460f058e btrfs: remove the cross file system checks from remap
The sb check is already done in do_clone_file_range, and the mnt check
(which will hopefully go away in a subsequent patch) is done in
ioctl_file_clone().  Remove the check in our code and put an ASSERT() to
make sure it doesn't change underneath us.

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-03-14 13:13:52 +01:00
Josef Bacik
7eefae6bb1 btrfs: pass btrfs_fs_info to btrfs_recover_relocation
We don't need a root here, we just need the btrfs_fs_info, we can just
get the specific roots we need from fs_info.

Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-03-14 13:13:52 +01:00
Josef Bacik
33c4418499 btrfs: pass btrfs_fs_info for deleting snapshots and cleaner
We're passing a root around here, but we only really need the fs_info,
so fix up btrfs_clean_one_deleted_snapshot() to take an fs_info instead,
and then fix up all the callers appropriately.

Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-03-14 13:13:52 +01:00
Sweet Tea Dorminy
c067da8781 btrfs: add filesystems state details to error messages
When a filesystem goes read-only due to an error, multiple errors tend
to be reported, some of which are knock-on failures. Logging fs_states,
in btrfs_handle_fs_error() and btrfs_printk() helps distinguish the
first error from subsequent messages which may only exist due to an
error state.

Under the new format, most initial errors will look like:
`BTRFS: error (device loop0) in ...`
while subsequent errors will begin with:
`error (device loop0: state E) in ...`

An initial transaction abort error will look like
`error (device loop0: state A) in ...`
and subsequent messages will contain
`(device loop0: state EA) in ...`

In addition to the error states we can also print other states that are
temporary, like remounting, device replace, or indicate a global state
that may affect functionality.

Now implemented:

E - filesystem error detected
A - transaction aborted
L - log tree errors

M - remounting in progress
R - device replace in progress
C - data checksums not verified (mounted with ignoredatacsums)

Signed-off-by: Sweet Tea Dorminy <sweettea-kernel@dorminy.me>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-03-14 13:13:52 +01:00
Filipe Manana
b2d9f2dc01 btrfs: deal with unexpected extent type during reflinking
Smatch complains about a possible dereference of a pointer that was not
initialized:

    CC [M]  fs/btrfs/reflink.o
    CHECK   fs/btrfs/reflink.c
  fs/btrfs/reflink.c:533 btrfs_clone() error: potentially dereferencing uninitialized 'trans'.

This is because we are not dealing with the case where the type of a file
extent has an unexpected value (not regular, not prealloc and not inline),
in which case the transaction handle pointer is not initialized.

Such unexpected type should be impossible, except in case of some memory
corruption caused either by bad hardware or some software bug causing
something like a buffer overrun.

So ASSERT that if the extent type is neither regular nor prealloc, then
it must be inline. Bail out with -EUCLEAN and a warning in case it is
not. This silences smatch.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-03-14 13:13:52 +01:00
Filipe Manana
1f4613cdbe btrfs: fix unexpected error path when reflinking an inline extent
When reflinking an inline extent, we assert that its file offset is 0 and
that its uncompressed length is not greater than the sector size. We then
return an error if one of those conditions is not satisfied. However we
use a return statement, which results in returning from btrfs_clone()
without freeing the path and buffer that were allocated before, as well as
not clearing the flag BTRFS_INODE_NO_DELALLOC_FLUSH for the destination
inode.

Fix that by jumping to the 'out' label instead, and also add a WARN_ON()
for each condition so that in case assertions are disabled, we get to
known which of the unexpected conditions triggered the error.

Fixes: a61e1e0df9 ("Btrfs: simplify inline extent handling when doing reflinks")
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-03-14 13:13:52 +01:00
Filipe Manana
23e3337faf btrfs: reset last_reflink_trans after fsyncing inode
When an inode has a last_reflink_trans matching the current transaction,
we have to take special care when logging its checksums in order to
avoid getting checksum items with overlapping ranges in a log tree,
which could result in missing checksums after log replay (more on that
in the changelogs of commit 40e046acbd ("Btrfs: fix missing data
checksums after replaying a log tree") and commit e289f03ea7 ("btrfs:
fix corrupt log due to concurrent fsync of inodes with shared extents")).
We also need to make sure a full fsync will copy all old file extent
items it finds in modified leaves, because they might have been copied
from some other inode.

However once we fsync an inode, we don't need to keep paying the price of
that extra special care in future fsyncs done in the same transaction,
unless the inode is used for another reflink operation or the full sync
flag is set on it (truncate, failure to allocate extent maps for holes,
and other exceptional and infrequent cases).

So after we fsync an inode reset its last_unlink_trans to zero. In case
another reflink happens, we continue to update the last_reflink_trans of
the inode, just as before. Also set last_reflink_trans to the generation
of the last transaction that modified the inode whenever we need to set
the full sync flag on the inode, just like when we need to load an inode
from disk after eviction.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-03-14 13:13:52 +01:00
Filipe Manana
96acb3753e btrfs: voluntarily relinquish cpu when doing a full fsync
Doing a full fsync may require processing many leaves of metadata, which
can take some time and result in a task monopolizing a cpu for too long.
So add a cond_resched() after processing a leaf when doing a full fsync,
while not holding any locks on any tree (a subvolume or a log tree).

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-03-14 13:13:52 +01:00
Filipe Manana
5b7ce5e287 btrfs: hold on to less memory when logging checksums during full fsync
When doing a full fsync, at copy_items(), we iterate over all extents and
then collect their checksums into a list. After copying all the extents to
the log tree, we then log all the previously collected checksums.

Before the previous patch in the series (subject "btrfs: stop copying old
file extents when doing a full fsync"), we had to do it this way, because
while we were iterating over the items in the leaf of the subvolume tree,
we were holding a write lock on a leaf of the log tree, so logging the
checksums for an extent right after we collected them could result in a
deadlock, in case the checksum items ended up in the same leaf.

However after the previous patch in the series we now do a first iteration
over all the items in the leaf of the subvolume tree before locking a path
in the log tree, so we can now log the checksums right after we have
obtained them. This avoids holding in memory all checksums for all extents
in the leaf while copying items from the source leaf to the log tree. The
amount of memory used to hold all checksums of the extents in a leaf can
be significant. For example if a leaf has 200 file extent items referring
to 1M extents, using the default crc32c checksums, would result in using
over 200K of memory (not accounting for the extra overhead of struct
btrfs_ordered_sum), with smaller or less extents it would be less, but
it could be much more with more extents per leaf and/or much larger
extents.

So change copy_items() to log the checksums for an extent after looking
them up, and then free their memory, as they are no longer necessary.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-03-14 13:13:52 +01:00
Filipe Manana
7f30c07288 btrfs: stop copying old file extents when doing a full fsync
When logging an inode in full sync mode, we go over every leaf that was
modified in the current transaction and has items associated to our inode,
and then copy all those items into the log tree. This includes copying
file extent items that were created and added to the inode in past
transactions, which is useless and only makes use more leaf space in the
log tree.

It's common to have a file with many file extent items spanning many
leaves where only a few file extent items are new and need to be logged,
and in such case we log all the file extent items we find in the modified
leaves.

So change the full sync behaviour to skip over file extent items that are
not needed. Those are the ones that match the following criteria:

1) Have a generation older than the current transaction and the inode
   was not a target of a reflink operation, as that can copy file extent
   items from a past generation from some other inode into our inode, so
   we have to log them;

2) Start at an offset within i_size - we must log anything at or beyond
   i_size, otherwise we would lose prealloc extents after log replay.

The following script exercises a scenario where this happens, and it's
somehow close enough to what happened often on a SQL Server workload which
I had to debug sometime ago to fix an issue where a pattern of writes to
prealloc extents and fsync resulted in fsync failing with -EIO (that was
commit ea7036de0d ("btrfs: fix fsync failure and transaction abort
after writes to prealloc extents")). In that particular case, we had large
files that had random writes and were often truncated, which made the
next fsync be a full sync.

  $ cat test.sh
  #!/bin/bash

  DEV=/dev/sdi
  MNT=/mnt/sdi

  MKFS_OPTIONS="-O no-holes -R free-space-tree"
  MOUNT_OPTIONS="-o ssd"

  FILE_SIZE=$((1 * 1024 * 1024 * 1024)) # 1G
  # FILE_SIZE=$((2 * 1024 * 1024 * 1024)) # 2G
  # FILE_SIZE=$((512 * 1024 * 1024)) # 512M

  mkfs.btrfs -f $MKFS_OPTIONS $DEV
  mount $MOUNT_OPTIONS $DEV $MNT

  # Create a file with many extents. Use direct IO to make it faster
  # to create the file - using buffered IO we would have to fsync
  # after each write (terribly slow).
  echo "Creating file with $((FILE_SIZE / 4096)) extents of 4K each..."
  xfs_io -f -d -c "pwrite -b 4K 0 $FILE_SIZE" $MNT/foobar

  # Commit the transaction, so every extent after this is from an
  # old generation.
  sync

  # Now rewrite only a few extents, which are all far spread apart from
  # each other (e.g. 1G / 32M = 32 extents).
  # After this only a few extents have a new generation, while all other
  # ones have an old generation.
  echo "Rewriting $((FILE_SIZE / (32 * 1024 * 1024))) extents..."
  for ((i = 0; i < $FILE_SIZE; i += $((32 * 1024 * 1024)))); do
      xfs_io -c "pwrite $i 4K" $MNT/foobar >/dev/null
  done

  # Fsync, the inode logged in full sync mode since it was never fsynced
  # before.
  echo "Fsyncing file..."
  xfs_io -c "fsync" $MNT/foobar

  umount $MNT

And the following bpftrace program was running when executing the test
script:

  $ cat bpf-script.sh
  #!/usr/bin/bpftrace

  k:btrfs_log_inode
  {
      @start_log_inode[tid] = nsecs;
  }

  kr:btrfs_log_inode
  /@start_log_inode[tid]/
  {
      @log_inode_dur[tid] = (nsecs - @start_log_inode[tid]) / 1000;
      delete(@start_log_inode[tid]);
  }

  k:btrfs_sync_log
  {
      @start_sync_log[tid] = nsecs;
  }

  kr:btrfs_sync_log
  /@start_sync_log[tid]/
  {
      $sync_log_dur = (nsecs - @start_sync_log[tid]) / 1000;
      printf("btrfs_log_inode() took %llu us\n", @log_inode_dur[tid]);
      printf("btrfs_sync_log()  took %llu us\n", $sync_log_dur);
      delete(@start_sync_log[tid]);
      delete(@log_inode_dur[tid]);
      exit();
  }

With 512M test file, before this patch:

  btrfs_log_inode() took 15218 us
  btrfs_sync_log()  took 1328 us

  Log tree has 17 leaves and 1 node, its total size is 294912 bytes.

With 512M test file, after this patch:

  btrfs_log_inode() took 14760 us
  btrfs_sync_log()  took 588 us

  Log tree has a single leaf, its total size is 16K.

With 1G test file, before this patch:

  btrfs_log_inode() took 27301 us
  btrfs_sync_log()  took 1767 us

  Log tree has 33 leaves and 1 node, its total size is 557056 bytes.

With 1G test file, after this patch:

  btrfs_log_inode() took 26166 us
  btrfs_sync_log()  took 593 us

  Log tree has a single leaf, its total size is 16K

With 2G test file, before this patch:

  btrfs_log_inode() took 50892 us
  btrfs_sync_log()  took 3127 us

  Log tree has 65 leaves and 1 node, its total size is 1081344 bytes.

With 2G test file, after this patch:

  btrfs_log_inode() took 50126 us
  btrfs_sync_log()  took 586 us

  Log tree has a single leaf, its total size is 16K.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-03-14 13:13:52 +01:00
Josef Bacik
8cbc3001a3 btrfs: do not clean up repair bio if submit fails
The submit helper will always run bio_endio() on the bio if it fails to
submit, so cleaning up the bio just leads to a variety of use-after-free
and NULL pointer dereference bugs because we race with the endio
function that is cleaning up the bio.  Instead just return BLK_STS_OK as
the repair function has to continue to process the rest of the pages,
and the endio for the repair bio will do the appropriate cleanup for the
page that it was given.

Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-03-14 13:13:52 +01:00
Josef Bacik
510671d2d8 btrfs: do not try to repair bio that has no mirror set
If we fail to submit a bio for whatever reason, we may not have setup a
mirror_num for that bio.  This means we shouldn't try to do the repair
workflow, if we do we'll hit an BUG_ON(!failrec->this_mirror) in
clean_io_failure.  Instead simply skip the repair workflow if we have no
mirror set, and add an assert to btrfs_check_repairable() to make it
easier to catch what is happening in the future.

Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-03-14 13:13:52 +01:00
Josef Bacik
f9f15de85d btrfs: do not double complete bio on errors during compressed reads
I hit some weird panics while fixing up the error handling from
btrfs_lookup_bio_sums().  Turns out the compression path will complete
the bio we use if we set up any of the compression bios and then return
an error, and then btrfs_submit_data_bio() will also call bio_endio() on
the bio.

Fix this by making btrfs_submit_compressed_read() responsible for
calling bio_endio() on the bio if there are any errors.  Currently it
was only doing it if we created the compression bios, otherwise it was
depending on btrfs_submit_data_bio() to do the right thing.  This
creates the above problem, so fix up btrfs_submit_compressed_read() to
always call bio_endio() in case of an error, and then simply return from
btrfs_submit_data_bio() if we had to call
btrfs_submit_compressed_read().

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-03-14 13:13:51 +01:00
Josef Bacik
606f82e797 btrfs: track compressed bio errors as blk_status_t
Right now we just have a binary "errors" flag, so any error we get on
the compressed bio's gets translated to EIO.  This isn't necessarily a
bad thing, but if we get an ENOMEM it may be nice to know that's what
happened instead of an EIO.  Track our errors as a blk_status_t, and do
the appropriate setting of the errors accordingly.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-03-14 13:13:51 +01:00
Josef Bacik
e14bfdb5a1 btrfs: remove the bio argument from finish_compressed_bio_read
This bio is usually one of the compressed bio's, and we don't actually
need it in this function, so remove the argument and stop passing it
around.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-03-14 13:13:51 +01:00
Josef Bacik
b0bbc8a3d4 btrfs: check correct bio in finish_compressed_bio_read
Commit c09abff87f ("btrfs: cloned bios must not be iterated by
bio_for_each_segment_all") added ASSERT()'s to make sure we weren't
calling bio_for_each_segment_all() on a RAID5/6 bio.  However it was
checking the bio that the compression code passed in, not the
cb->orig_bio that we actually iterate over, so adjust this ASSERT() to
check the correct bio.

Reviewed-by: Boris Burkov <boris@bur.io>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-03-14 13:13:51 +01:00
Josef Bacik
1784b7d502 btrfs: handle csum lookup errors properly on reads
Currently any error we get while trying to lookup csums during reads
shows up as a missing csum, and then on the read completion side we
print an error saying there was a csum mismatch and we increase the
device corruption count.

However we could have gotten an EIO from the lookup.  We could also be
inside of a memory constrained container and gotten a ENOMEM while
trying to do the read.  In either case we don't want to make this look
like a file system corruption problem, we want to make it look like the
actual error it is.  Capture any negative value, convert it to the
appropriate blk_status_t, free the csum array if we have one and bail.

Note: a possible improvement would be to make the relocation code look
up the owning inode and see if it's marked as NODATASUM and set
EXTENT_NODATASUM there, that way if there's corruption and there isn't a
checksum when we want it we can fail here rather than later.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-03-14 13:13:51 +01:00
Josef Bacik
03ddb19d2e btrfs: make search_csum_tree return 0 if we get -EFBIG
We can either fail to find a csum entry at all and return -ENOENT, or we
can find a range that is close, but return -EFBIG.  In essence these
both mean the same thing when we are doing a lookup for a csum in an
existing range, we didn't find a csum.  We want to treat both of these
errors the same way, complain loudly that there wasn't a csum.  This
currently happens anyway because we do

	count = search_csum_tree();
	if (count <= 0) {
		// reloc and error handling
	}

However it forces us to incorrectly treat EIO or ENOMEM errors as on
disk corruption.  Fix this by returning 0 if we get either -ENOENT or
-EFBIG from btrfs_lookup_csum() so we can do proper error handling.

Reviewed-by: Boris Burkov <boris@bur.io>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-03-14 13:13:51 +01:00
Omar Sandoval
7c0c7269f7 btrfs: add BTRFS_IOC_ENCODED_WRITE
The implementation resembles direct I/O: we have to flush any ordered
extents, invalidate the page cache, and do the io tree/delalloc/extent
map/ordered extent dance. From there, we can reuse the compression code
with a minor modification to distinguish the write from writeback. This
also creates inline extents when possible.

Signed-off-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-03-14 13:13:51 +01:00
Omar Sandoval
1881fba89b btrfs: add BTRFS_IOC_ENCODED_READ ioctl
There are 4 main cases:

1. Inline extents: we copy the data straight out of the extent buffer.
2. Hole/preallocated extents: we fill in zeroes.
3. Regular, uncompressed extents: we read the sectors we need directly
   from disk.
4. Regular, compressed extents: we read the entire compressed extent
   from disk and indicate what subset of the decompressed extent is in
   the file.

This initial implementation simplifies a few things that can be improved
in the future:

- Cases 1, 3, and 4 allocate temporary memory to read into before
  copying out to userspace.
- We don't do read repair, because it turns out that read repair is
  currently broken for compressed data.
- We hold the inode lock during the operation.

Note that we don't need to hold the mmap lock. We may race with
btrfs_page_mkwrite() and read the old data from before the page was
dirtied:

btrfs_page_mkwrite         btrfs_encoded_read
---------------------------------------------------
(enter)                    (enter)
                           btrfs_wait_ordered_range
lock_extent_bits
btrfs_page_set_dirty
unlock_extent_cached
(exit)
                           lock_extent_bits
                           read extent (dirty page hasn't been flushed,
                                        so this is the old data)
                           unlock_extent_cached
                           (exit)

we read the old data from before the page was dirtied. But, that's true
even if we were to hold the mmap lock:

btrfs_page_mkwrite               btrfs_encoded_read
-------------------------------------------------------------------
(enter)                          (enter)
                                 btrfs_inode_lock(BTRFS_ILOCK_MMAP)
down_read(i_mmap_lock) (blocked)
                                 btrfs_wait_ordered_range
                                 lock_extent_bits
				 read extent (page hasn't been dirtied,
                                              so this is the old data)
                                 unlock_extent_cached
                                 btrfs_inode_unlock(BTRFS_ILOCK_MMAP)
down_read(i_mmap_lock) returns
lock_extent_bits
btrfs_page_set_dirty
unlock_extent_cached

In other words, this is inherently racy, so it's fine that we return the
old data in this tiny window.

Signed-off-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-03-14 13:13:51 +01:00
Omar Sandoval
d9496e8aba btrfs: optionally extend i_size in cow_file_range_inline()
Currently, an inline extent is always created after i_size is extended
from btrfs_dirty_pages(). However, for encoded writes, we only want to
update i_size after we successfully created the inline extent. Add an
update_i_size parameter to cow_file_range_inline() and
insert_inline_extent() and pass in the size of the extent rather than
determining it from i_size.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Omar Sandoval <osandov@fb.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ reformat comment ]
Signed-off-by: David Sterba <dsterba@suse.com>
2022-03-14 13:13:51 +01:00
Omar Sandoval
8dd9872d2e btrfs: clean up cow_file_range_inline()
The start parameter to cow_file_range_inline() (and
insert_inline_extent()) is always 0, so get rid of it and simplify the
logic in those two functions. Pass btrfs_inode to insert_inline_extent()
and remove the redundant root parameter. Also document the requirements
for creating an inline extent. No functional change.

Signed-off-by: Omar Sandoval <osandov@fb.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-03-14 13:13:51 +01:00
Omar Sandoval
28c9b1e75a btrfs: support different disk extent size for delalloc
Currently, we always reserve the same extent size in the file and extent
size on disk for delalloc because the former is the worst case for the
latter. For BTRFS_IOC_ENCODED_WRITE writes, we know the exact size of
the extent on disk, which may be less than or greater than (for
bookends) the size in the file. Add a disk_num_bytes parameter to
btrfs_delalloc_reserve_metadata() so that we can reserve the correct
amount of csum bytes. No functional change.

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Omar Sandoval <osandov@fb.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-03-14 13:13:51 +01:00
Omar Sandoval
cb36a9bb17 btrfs: add ram_bytes and offset to btrfs_ordered_extent
Currently, we only create ordered extents when ram_bytes == num_bytes
and offset == 0. However, BTRFS_IOC_ENCODED_WRITE writes may create
extents which only refer to a subset of the full unencoded extent, so we
need to plumb these fields through the ordered extent infrastructure and
pass them down to insert_reserved_file_extent().

Since we're changing the btrfs_add_ordered_extent* signature, let's get
rid of the trivial wrappers and add a kernel-doc.

Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Omar Sandoval <osandov@fb.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-03-14 13:13:51 +01:00
Omar Sandoval
e331f6b19f btrfs: don't advance offset for compressed bios in btrfs_csum_one_bio()
btrfs_csum_one_bio() loops over each filesystem block in the bio while
keeping a cursor of its current logical position in the file in order to
look up the ordered extent to add the checksums to. However, this
doesn't make much sense for compressed extents, as a sector on disk does
not correspond to a sector of decompressed file data. It happens to work
because:

1) the compressed bio always covers one ordered extent
2) the size of the bio is always less than the size of the ordered
   extent

However, the second point will not always be true for encoded writes.

Let's add a boolean parameter to btrfs_csum_one_bio() to indicate that
it can assume that the bio only covers one ordered extent. Since we're
already changing the signature, let's get rid of the contig parameter
and make it implied by the offset parameter, similar to the change we
recently made to btrfs_lookup_bio_sums(). Additionally, let's rename
nr_sectors to blockcount to make it clear that it's the number of
filesystem blocks, not the number of 512-byte sectors.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-03-14 13:13:50 +01:00
Sidong Yang
457b0a3d6e btrfs: qgroup: remove outdated TODO comments
These comments are old, outdated and not very specific. It seems that it
doesn't help to inspire anybody to work on that.  So we remove them.

Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Sidong Yang <realwakka@gmail.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-03-14 13:13:50 +01:00
Sidong Yang
a8f6f619e4 btrfs: qgroup: remove duplicated check in adding qgroup relations
Removes duplicated check when adding qgroup relations.
btrfs_add_qgroup_relations function adds relations by calling
add_relation_rb(). add_relation_rb() checks that member/parentid exists
in current qgroup_tree. But it already checked before calling the
function. It seems that we don't need to double check.

Add new function __add_relation_rb() that adds relations with
qgroup structures and makes old function use the new one. And it makes
btrfs_add_qgroup_relation() function work without double checks by
calling the new function.

Signed-off-by: Sidong Yang <realwakka@gmail.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ add comments ]
Signed-off-by: David Sterba <dsterba@suse.com>
2022-03-14 13:13:50 +01:00
Dāvis Mosāns
dc4a4bdb3f btrfs: add lzo workspace buffer length constants
It makes it more readable for length checking and is be used repeatedly.

Signed-off-by: Dāvis Mosāns <davispuh@gmail.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-03-14 13:13:50 +01:00
Qu Wenruo
40e7efe057 btrfs: populate extent_map::generation when reading from disk
When btrfs_get_extent() tries to get some file extent from disk, it
never populates extent_map::generation, leaving the value to be 0.

On the other hand, for extent map generated by IO, it will get its
generation properly set at finish_ordered_io()

 finish_ordered_io()
 |- unpin_extent_cache(gen = trans->transid)
    |- em->generation = gen;

[CAUSE]
Since extent_map::generation is mostly used by fsync code, and for fsync
they only care about modified extents, which all have their
em::generation > 0.

Thus it's fine to not populate em read from disk for fsync.

[CORNER CASE]
However autodefrag also relies on em::generation to determine if one
extent needs to be defragged.

This unpopulated extent_map::generation can prevent the following
autodefrag case from working:

	mkfs.btrfs -f $dev
	mount $dev $mnt -o autodefrag

	# initial write to queue the inode for autodefrag
	xfs_io -f -c "pwrite 0 4k" $mnt/file
	sync

	# Real fragmented write
	xfs_io -f -s -c "pwrite -b 4096 0 32k" $mnt/file
	sync
	echo "=== before autodefrag ==="
	xfs_io -c "fiemap -v" $mnt/file

	# Drop cache to force em to be read from disk
	echo 3 > /proc/sys/vm/drop_caches
	mount -o remount,commit=1 $mnt
	sleep 3
	sync

	echo "=== After autodefrag ==="
	xfs_io -c "fiemap -v" $mnt/file
	umount $mnt

The result looks like this:

  === before autodefrag ===
  /mnt/btrfs/file:
   EXT: FILE-OFFSET      BLOCK-RANGE      TOTAL FLAGS
     0: [0..15]:         26672..26687        16   0x0
     1: [16..31]:        26656..26671        16   0x0
     2: [32..47]:        26640..26655        16   0x0
     3: [48..63]:        26624..26639        16   0x1
  === After autodefrag ===
  /mnt/btrfs/file:
   EXT: FILE-OFFSET      BLOCK-RANGE      TOTAL FLAGS
     0: [0..15]:         26672..26687        16   0x0
     1: [16..31]:        26656..26671        16   0x0
     2: [32..47]:        26640..26655        16   0x0
     3: [48..63]:        26624..26639        16   0x1

This fragmented 32K will not be defragged by autodefrag.

[FIX]
To make things less weird, just populate extent_map::generation when
reading file extents from disk.

This would make above fragmented extents to be properly defragged:

  == before autodefrag ===
  /mnt/btrfs/file:
   EXT: FILE-OFFSET      BLOCK-RANGE      TOTAL FLAGS
     0: [0..15]:         26672..26687        16   0x0
     1: [16..31]:        26656..26671        16   0x0
     2: [32..47]:        26640..26655        16   0x0
     3: [48..63]:        26624..26639        16   0x1
  === After autodefrag ===
  /mnt/btrfs/file:
   EXT: FILE-OFFSET      BLOCK-RANGE      TOTAL FLAGS
     0: [0..63]:         26688..26751        64   0x1

Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-03-14 13:13:50 +01:00
Filipe Manana
6d3b050efa btrfs: assert we have a write lock when removing and replacing extent maps
Removing or replacing an extent map requires holding a write lock on the
extent map's tree. We currently do that everywhere, except in one of the
self tests, where it's harmless since there's no concurrency.

In order to catch possible races in the future, assert that we are holding
a write lock on the extent map tree before removing or replacing an extent
map in the tree, and update the self test to obtain a write lock before
removing extent maps.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-03-14 13:13:50 +01:00
Filipe Manana
ad3fc7946b btrfs: remove no longer used counter when reading data page
After commit 92082d4097 ("btrfs: integrate page status update for
data read path into begin/end_page_read"), the 'nr' counter at
btrfs_do_readpage() is no longer used, we increment it but we never
read from it. So just remove it.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-03-14 13:13:50 +01:00
Filipe Manana
bbf0ea7ea3 btrfs: fix lost error return value when reading a data page
At btrfs_do_readpage(), if we get an error when trying to lookup for an
extent map, we end up marking the page with the error bit, clearing
the uptodate bit on it, and doing everything else that should be done.
However we return success (0) to the caller, when we should return the
error encoded in the extent map pointer. So fix that by returning the
error encoded in the pointer.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-03-14 13:13:50 +01:00
Filipe Manana
c03475506e btrfs: stop checking for NULL return from btrfs_get_extent()
At extent_io.c, in the read page and write page code paths, we are testing
if the return value from btrfs_get_extent() can be NULL. However that is
not possible, as btrfs_get_extent() always returns either an error pointer
or a (non-NULL) pointer to an extent map structure.

Everywhere else outside extent_io.c we never check for NULL, we always
treat any returned value as a non-NULL pointer if it does not encode an
error.

So check only for the IS_ERR() case at extent_io.c.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-03-14 13:13:50 +01:00
Filipe Manana
e1f53ed874 btrfs: prepare extents to be logged before locking a log tree path
When we want to log an extent, in the fast fsync path, we obtain a path
to the leaf that will hold the file extent item either through a deletion
search, via btrfs_drop_extents(), or through an insertion search using
btrfs_insert_empty_item(). After that we fill the file extent item's
fields one by one directly on the leaf.

Instead of doing that, we could prepare the file extent item before
obtaining a btree path, and then copy the prepared extent item with a
single operation once we get the path. This helps avoid some contention
on the log tree, since we are holding write locks for longer than
necessary, especially in the case where the path is obtained via
btrfs_drop_extents() through a deletion search, which always keeps a
write lock on the nodes at levels 1 and 2 (besides the leaf).

This change does that, we prepare the file extent item that is going to
be inserted before acquiring a path, and then copy it into a leaf using
a single copy operation once we get a path.

This change if part of a patchset that is comprised of the following
patches:

  1/6 btrfs: remove unnecessary leaf free space checks when pushing items
  2/6 btrfs: avoid unnecessary COW of leaves when deleting items from a leaf
  3/6 btrfs: avoid unnecessary computation when deleting items from a leaf
  4/6 btrfs: remove constraint on number of visited leaves when replacing extents
  5/6 btrfs: remove useless path release in the fast fsync path
  6/6 btrfs: prepare extents to be logged before locking a log tree path

The following test was run to measure the impact of the whole patchset:

  $ cat test.sh
  #!/bin/bash

  DEV=/dev/sdi
  MNT=/mnt/sdi
  MOUNT_OPTIONS="-o ssd"
  MKFS_OPTIONS="-R free-space-tree -O no-holes"

  NUM_JOBS=8
  FILE_SIZE=128M
  RUN_TIME=200

  cat <<EOF > /tmp/fio-job.ini
  [writers]
  rw=randwrite
  fsync=1
  fallocate=none
  group_reporting=1
  direct=0
  bssplit=4k/20:8k/20:16k/20:32k/10:64k/10:128k/5:256k/5:512k/5:1m/5
  ioengine=sync
  filesize=$FILE_SIZE
  runtime=$RUN_TIME
  time_based
  directory=$MNT
  numjobs=$NUM_JOBS
  thread
  EOF

  echo "performance" | \
      tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor

  echo
  echo "Using config:"
  echo
  cat /tmp/fio-job.ini
  echo

  umount $MNT &> /dev/null
  mkfs.btrfs -f $MKFS_OPTIONS $DEV
  mount $MOUNT_OPTIONS $DEV $MNT

  fio /tmp/fio-job.ini

  umount $MNT

The test ran inside a VM (8 cores, 32G of RAM) with the target disk
mapping to a raw NVMe device, and using a non-debug kernel config
(Debian's default config).

Before the patchset:

WRITE: bw=116MiB/s (122MB/s), 116MiB/s-116MiB/s (122MB/s-122MB/s), io=22.7GiB (24.4GB), run=200013-200013msec

After the patchset:

WRITE: bw=125MiB/s (131MB/s), 125MiB/s-125MiB/s (131MB/s-131MB/s), io=24.3GiB (26.1GB), run=200007-200007msec

A 7.8% gain on throughput and +7.0% more IO done in the same period of
time (200 seconds).

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-03-14 13:13:50 +01:00
Filipe Manana
d845753170 btrfs: remove useless path release in the fast fsync path
There's no point in calling btrfs_release_path() after finishing the loop
that logs the modified extents, since log_one_extent() returns with the
path released. In case the list of extents is empty, the path is already
released, so there's no need for that case as well.
So just remove that unnecessary btrfs_release_path() call.

This change if part of a patchset that is comprised of the following
patches:

  1/6 btrfs: remove unnecessary leaf free space checks when pushing items
  2/6 btrfs: avoid unnecessary COW of leaves when deleting items from a leaf
  3/6 btrfs: avoid unnecessary computation when deleting items from a leaf
  4/6 btrfs: remove constraint on number of visited leaves when replacing extents
  5/6 btrfs: remove useless path release in the fast fsync path
  6/6 btrfs: prepare extents to be logged before locking a log tree path

The last patch in the series has some performance test result in its
changelog.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-03-14 13:13:49 +01:00
Filipe Manana
7ecb4c31e7 btrfs: remove constraint on number of visited leaves when replacing extents
At btrfs_drop_extents(), we try to replace a range of file extent items
with a new file extent in a single btree search, to avoid the need to do
a search for deletion, followed by a path release and followed by yet
another search for insertion.

When I originally added that optimization, in commit 1acae57b16
("Btrfs: faster file extent item replace operations"), I left a constraint
to do the fast replace only if we visited a single leaf. That was because
in the most common case we find all file extent items that need to be
deleted (or trimmed) in a single leaf, however it can work for other
common cases like when we need to delete a few file extent items located
at the end of a leaf and a few more located at the beginning of the next
leaf. The key for the new file extent item is greater than the key of
any deleted or trimmed file extent item from previous leaves, so we are
fine to use the last leaf that we found as long as we are holding a
write lock on it - even if the new key ends up at slot 0, as if that's
the case, the btree search has obtained a write lock on any upper nodes
that need to have a key pointer updated.

So removed the constraint that limits the optimization to the case where
we visited only a single leaf.

This change if part of a patchset that is comprised of the following
patches:

  1/6 btrfs: remove unnecessary leaf free space checks when pushing items
  2/6 btrfs: avoid unnecessary COW of leaves when deleting items from a leaf
  3/6 btrfs: avoid unnecessary computation when deleting items from a leaf
  4/6 btrfs: remove constraint on number of visited leaves when replacing extents
  5/6 btrfs: remove useless path release in the fast fsync path
  6/6 btrfs: prepare extents to be logged before locking a log tree path

The last patch in the series has some performance test result in its
changelog.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-03-14 13:13:49 +01:00
Filipe Manana
0cae23b66a btrfs: avoid unnecessary computation when deleting items from a leaf
When deleting items from a leaf, we always compute the sum of the data
sizes of the items that are going to be deleted. However we only use
that sum when the last item to delete is behind the last item in the
leaf. This unnecessarily wastes CPU time when we are deleting either
the whole leaf or from some slot > 0 up to the last item in the leaf,
and both of these cases are common (e.g. truncation operation, either
as a result of truncate(2) or when logging inodes, deleting checksums
after removing a large enough extent, etc).

So compute only the sum of the data sizes if the last item to be
deleted does not match the last item in the leaf.

This change if part of a patchset that is comprised of the following
patches:

  1/6 btrfs: remove unnecessary leaf free space checks when pushing items
  2/6 btrfs: avoid unnecessary COW of leaves when deleting items from a leaf
  3/6 btrfs: avoid unnecessary computation when deleting items from a leaf
  4/6 btrfs: remove constraint on number of visited leaves when replacing extents
  5/6 btrfs: remove useless path release in the fast fsync path
  6/6 btrfs: prepare extents to be logged before locking a log tree path

The last patch in the series has some performance test result in its
changelog.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-03-14 13:13:49 +01:00
Filipe Manana
7c4063d19e btrfs: avoid unnecessary COW of leaves when deleting items from a leaf
When we delete items from a leaf, if we end up with more than two thirds
of unused leaf space, we try to delete the leaf by moving all its items
into its left and right neighbour leaves. Sometimes that is not possible
because there is not enough free space in the left and right leaves, and
in that case we end up not deleting our leaf.

The way we are doing this is not ideal and can be improved in the
following ways:

1) When we call push_leaf_left(), we pass a value of 1 byte to the data
   size parameter of push_leaf_left(). This is not realistic value because
   no item can have a size less than 25 bytes, which is the size of struct
   btrfs_item. This means that means that if the left leaf has not enough
   free space to push any item, we end up COWing it even if we end up not
   changing its content at all.

   COWing that leaf means allocating a new metadata extent, marking it
   dirty and doing more IO when committing a transaction or when syncing a
   log tree. For a log tree case, it's particularly more important to
   avoid the useless COW operation, as more IO can imply a higher latency
   for an fsync operation.

   So instead of passing 1 as the minimum data size for push_leaf_left(),
   pass the size of the first item in our leaf, as we don't want to COW
   the left leaf if we can't at least push the first item of our leaf;

2) When we call push_leaf_right(), we also pass a value of 1 byte as the
   data size parameter of push_leaf_right(). Like the previous case, it
   will also result in COWing the right leaf even if we are not able to
   move any items into it, since there can't be any item with a size
   smaller than 25 bytes (the size of struct btrfs_item).

   So instead of passing 1 as the minimum data size to push_leaf_right(),
   pass a size that corresponds to the sum of the size of all the
   remaining items in our leaf. We are not interested in moving less than
   that, because if we do, we are not able to delete our leaf and we have
   COWed the right leaf for nothing. Plus, moving only some of the items
   of our leaf, it means an even less balanced tree.

   Just like the previous case, we want to avoid the useless COW of the
   right leaf, this way we don't have to spend time allocating one new
   metadata extent, and doing more IO when committing a transaction or
   syncing a log tree. For the log tree case it's specially more important
   because more IO can result in a higher latency for a fsync operation.

So adjust the minimum data size passed to push_leaf_left() and
push_leaf_right() as mentioned above.

This change if part of a patchset that is comprised of the following
patches:

  1/6 btrfs: remove unnecessary leaf free space checks when pushing items
  2/6 btrfs: avoid unnecessary COW of leaves when deleting items from a leaf
  3/6 btrfs: avoid unnecessary computation when deleting items from a leaf
  4/6 btrfs: remove constraint on number of visited leaves when replacing extents
  5/6 btrfs: remove useless path release in the fast fsync path
  6/6 btrfs: prepare extents to be logged before locking a log tree path

Not being able to delete a leaf that became less than 1/3 full after
deleting items from it is actually common. For example, for the fio test
mentioned in the changelog of patch 6/6, we are only able to delete a
leaf at btrfs_del_items() about 5.3% of the time, due to its left and
right neighbour leaves not having enough free space to push all the
remaining items into them.

The last patch in the series has some performance test result in its
changelog.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-03-14 13:13:49 +01:00
Filipe Manana
b4e098a97f btrfs: remove unnecessary leaf free space checks when pushing items
When trying to push items from a leaf into its left and right neighbours,
we lock the left or right leaf, check if it has the required minimum free
space, COW the leaf and then check again if it has the minimum required
free space. This second check is pointless:

1) Most and foremost because it's not needed. We have a write lock on the
   leaf and on its parent node, so no one can come in and change either
   the pre-COW or post-COW version of the leaf for the whole duration of
   the push_leaf_left() and push_leaf_right() calls;

2) The call to btrfs_leaf_free_space() is not trivial, it has a fair
   amount of arithmetic operations and access to fields in the leaf's
   header and items, so it's not very cheap.

So remove the duplicated free space checks.

This change if part of a patchset that is comprised of the following
patches:

  1/6 btrfs: remove unnecessary leaf free space checks when pushing items
  2/6 btrfs: avoid unnecessary COW of leaves when deleting items from a leaf
  3/6 btrfs: avoid unnecessary computation when deleting items from a leaf
  4/6 btrfs: remove constraint on number of visited leaves when replacing extents
  5/6 btrfs: remove useless path release in the fast fsync path
  6/6 btrfs: prepare extents to be logged before locking a log tree path

The last patch in the series has some performance test result in its
changelog.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-03-14 13:13:49 +01:00
Johannes Thumshirn
6b5b7a41d0 btrfs: stop checking for NULL return from btrfs_get_extent_fiemap()
In get_extent_skip_holes() we're checking the return of
btrfs_get_extent_fiemap() for an error pointer or NULL, but
btrfs_get_extent_fiemap() will never return NULL, only error pointers or
a valid extent_map.

The other caller of btrfs_get_extent_fiemap(), find_desired_extent(),
correctly only checks for error-pointers.

Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-03-14 13:13:49 +01:00
Pankaj Raghav
f716fa4798 btrfs: zoned: remove redundant assignment in btrfs_check_zoned_mode
Remove the redundant assignment to zone_info variable in
btrfs_check_zoned_mode function.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Pankaj Raghav <p.raghav@samsung.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-03-14 13:13:49 +01:00
David Sterba
a55e65b80e btrfs: replace BUILD_BUG_ON by static_assert
The static_assert introduced in 6bab69c650 ("build_bug.h: add wrapper
for _Static_assert") has been supported by compilers for a long time
(gcc 4.6, clang 3.0) and can be used in header files. We don't need to
put BUILD_BUG_ON to random functions but rather keep it next to the
definition.

The exception here is the UAPI header btrfs_tree.h that could be
potentially included by userspace code and the static assert is not
defined (nor used in any other header).

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-03-14 13:13:49 +01:00
Johannes Thumshirn
265f7237dd btrfs: zoned: allow DUP on meta-data block groups
Allow creating or reading block-groups on a zoned device with DUP as a
meta-data profile.

This works because we're using the zoned_meta_io_lock and REQ_OP_WRITE
operations for meta-data on zoned btrfs, so all writes to meta-data zones
are aligned to the zone's write-pointer.

Upon loading of the block-group, it is ensured both zones do have the same
zone capacity and write-pointer offsets, so no extra machinery is needed
to keep the write-pointers in sync for the meta-data zones. If this
prerequisite is not met, loading of the block-group is refused.

Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-03-14 13:13:49 +01:00
Johannes Thumshirn
dbfcc18f27 btrfs: zoned: prepare for allowing DUP on zoned
Allow for a block-group to be placed on more than one physical zone.

This is a preparation for allowing DUP profiles for meta-data on a zoned
file-system.

Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-03-14 13:13:49 +01:00
Johannes Thumshirn
4dcbb8ab31 btrfs: zoned: make zone finishing multi stripe capable
Currently finishing of a zone only works if the block group isn't
spanning more than one zone.

This limitation is purely artificial and can be easily expanded to block
groups being places across multiple zones.

This is a preparation for allowing DUP and later more complex block-group
profiles on zoned btrfs.

Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-03-14 13:13:49 +01:00
Johannes Thumshirn
f9a912a3c4 btrfs: zoned: make zone activation multi stripe capable
Currently activation of a zone only works if the block group isn't
spanning more than one zone.

This limitation is purely artificial and can be easily expanded to block
groups being places across multiple zones.

This is a preparation for allowing DUP and later more complex block-group
profiles on zoned btrfs.

Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-03-14 13:13:49 +01:00