Commit Graph

1216481 Commits

Author SHA1 Message Date
Filipe Manana
28270e25c6 btrfs: always reserve space for delayed refs when starting transaction
When starting a transaction (or joining an existing one with
btrfs_start_transaction()), we reserve space for the number of items we
want to insert in a btree, but we don't do it for the delayed refs we
will generate while using the transaction to modify (COW) extent buffers
in a btree or allocate new extent buffers. Basically how it works:

1) When we start a transaction we reserve space for the number of items
   the caller wants to be inserted/modified/deleted in a btree. This space
   goes to the transaction block reserve;

2) If the delayed refs block reserve is not full, its size is greater
   than the amount of its reserved space, and the flush method is
   BTRFS_RESERVE_FLUSH_ALL, then we attempt to reserve more space for
   it corresponding to the number of items the caller wants to
   insert/modify/delete in a btree;

3) The size of the delayed refs block reserve is increased when a task
   creates delayed refs after COWing an extent buffer, allocating a new
   one or deleting (freeing) an extent buffer. This happens after the
   the task started or joined a transaction, whenever it calls
   btrfs_update_delayed_refs_rsv();

4) The delayed refs block reserve is then refilled by anyone calling
   btrfs_delayed_refs_rsv_refill(), either during unlink/truncate
   operations or when someone else calls btrfs_start_transaction() with
   a 0 number of items and flush method BTRFS_RESERVE_FLUSH_ALL;

5) As a task COWs or allocates extent buffers, it consumes space from the
   transaction block reserve. When the task releases its transaction
   handle (btrfs_end_transaction()) or it attempts to commit the
   transaction, it releases any remaining space in the transaction block
   reserve that it did not use, as not all space may have been used (due
   to pessimistic space calculation) by calling btrfs_block_rsv_release()
   which will try to add that unused space to the delayed refs block
   reserve (if its current size is greater than its reserved space).
   That transferred space may not be enough to completely fulfill the
   delayed refs block reserve.

   Plus we have some tasks that will attempt do modify as many leaves
   as they can before getting -ENOSPC (and then reserving more space and
   retrying), such as hole punching and extent cloning which call
   btrfs_replace_file_extents(). Such tasks can generate therefore a
   high number of delayed refs, for both metadata and data (we can't
   know in advance how many file extent items we will find in a range
   and therefore how many delayed refs for dropping references on data
   extents we will generate);

6) If a transaction starts its commit before the delayed refs block
   reserve is refilled, for example by the transaction kthread or by
   someone who called btrfs_join_transaction() before starting the
   commit, then when running delayed references if we don't have enough
   reserved space in the delayed refs block reserve, we will consume
   space from the global block reserve.

Now this doesn't make a lot of sense because:

1) We should reserve space for delayed references when starting the
   transaction, since we have no guarantees the delayed refs block
   reserve will be refilled;

2) If no refill happens then we will consume from the global block reserve
   when running delayed refs during the transaction commit;

3) If we have a bunch of tasks calling btrfs_start_transaction() with a
   number of items greater than zero and at the time the delayed refs
   reserve is full, then we don't reserve any space at
   btrfs_start_transaction() for the delayed refs that will be generated
   by a task, and we can therefore end up using a lot of space from the
   global reserve when running the delayed refs during a transaction
   commit;

4) There are also other operations that result in bumping the size of the
   delayed refs reserve, such as creating and deleting block groups, as
   well as the need to update a block group item because we allocated or
   freed an extent from the respective block group;

5) If we have a significant gap between the delayed refs reserve's size
   and its reserved space, two very bad things may happen:

   1) The reserved space of the global reserve may not be enough and we
      fail the transaction commit with -ENOSPC when running delayed refs;

   2) If the available space in the global reserve is enough it may result
      in nearly exhausting it. If the fs has no more unallocated device
      space for allocating a new block group and all the available space
      in existing metadata block groups is not far from the global
      reserve's size before we started the transaction commit, we may end
      up in a situation where after the transaction commit we have too
      little available metadata space, and any future transaction commit
      will fail with -ENOSPC, because although we were able to reserve
      space to start the transaction, we were not able to commit it, as
      running delayed refs generates some more delayed refs (to update the
      extent tree for example) - this includes not even being able to
      commit a transaction that was started with the goal of unlinking a
      file, removing an empty data block group or doing reclaim/balance,
      so there's no way to release metadata space.

      In the worst case the next time we mount the filesystem we may
      also fail with -ENOSPC due to failure to commit a transaction to
      cleanup orphan inodes. This later case was reported and hit by
      someone running a SLE (SUSE Linux Enterprise) distribution for
      example - where the fs had no more unallocated space that could be
      used to allocate a new metadata block group, and the available
      metadata space was about 1.5M, not enough to commit a transaction
      to cleanup an orphan inode (or do relocation of data block groups
      that were far from being full).

So improve on this situation by always reserving space for delayed refs
when calling start_transaction(), and if the flush method is
BTRFS_RESERVE_FLUSH_ALL, also try to refill the delayed refs block
reserve if it's not full. The space reserved for the delayed refs is added
to a local block reserve that is part of the transaction handle, and when
a task updates the delayed refs block reserve size, after creating a
delayed ref, the space is transferred from that local reserve to the
global delayed refs reserve (fs_info->delayed_refs_rsv). In case the
local reserve does not have enough space, which may happen for tasks
that generate a variable and potentially large number of delayed refs
(such as the hole punching and extent cloning cases mentioned before),
we transfer any available space and then rely on the current behaviour
of hoping some other task refills the delayed refs reserve or fallback
to the global block reserve.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-10-12 16:44:06 +02:00
Filipe Manana
adb86dbe42 btrfs: stop doing excessive space reservation for csum deletion
Currently when reserving space for deleting the csum items for a data
extent, when adding or updating a delayed ref head, we determine how
many leaves of csum items we can have and then pass that number to the
helper btrfs_calc_delayed_ref_bytes(). This helper is used for calculating
space for all tree modifications we need when running delayed references,
however the amount of space it computes is excessive for deleting csum
items because:

1) It uses btrfs_calc_insert_metadata_size() which is excessive because
   we only need to delete csum items from the csum tree, we don't need
   to insert any items, so btrfs_calc_metadata_size() is all we need (as
   it computes space needed to delete an item);

2) If the free space tree is enabled, it doubles the amount of space,
   which is pointless for csum deletion since we don't need to touch the
   free space tree or any other tree other than the csum tree.

So improve on this by tracking how many csum deletions we have and using
a new helper to calculate space for csum deletions (just a wrapper around
btrfs_calc_metadata_size() with a comment). This reduces the amount of
space we need to reserve for csum deletions by a factor of 4, and it helps
reduce the number of times we have to block space reservations and have
the reclaim task enter the space flushing algorithm (flush delayed items,
flush delayed refs, etc) in order to satisfy tickets.

For example this results in a total time decrease when unlinking (or
truncating) files with many extents, as we end up having to block on space
metadata reservations less often. Example test:

  $ cat test.sh
  #!/bin/bash

  DEV=/dev/nullb0
  MNT=/mnt/test

  umount $DEV &> /dev/null
  mkfs.btrfs -f $DEV
  # Use compression to quickly create files with a lot of extents
  # (each with a size of 128K).
  mount -o compress=lzo $DEV $MNT

  # 100G gives at least 983040 extents with a size of 128K.
  xfs_io -f -c "pwrite -S 0xab -b 1M 0 120G" $MNT/foobar

  # Flush all delalloc and clear all metadata from memory.
  umount $MNT
  mount -o compress=lzo $DEV $MNT

  start=$(date +%s%N)
  rm -f $MNT/foobar
  end=$(date +%s%N)
  dur=$(( (end - start) / 1000000 ))
  echo "rm took $dur milliseconds"

  umount $MNT

Before this change rm took: 7504 milliseconds
After this change rm took:  6574 milliseconds  (-12.4%)

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-10-12 16:44:06 +02:00
Filipe Manana
b6ea3e6ab5 btrfs: remove pointless initialization at btrfs_delayed_refs_rsv_release()
There's no point in initializing to 0 the local variable 'released' as
we don't use it before the next assignment to it. So remove the
initialization. This may help avoid some warnings with clang tools such
as the one reported/fixed by commit 966de47ff0 ("btrfs: remove redundant
initialization of variables in log_new_ancestors").

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-10-12 16:44:06 +02:00
Filipe Manana
3ee56a58ad btrfs: reserve space for delayed refs on a per ref basis
Currently when reserving space for delayed refs we do it on a per ref head
basis. This is generally enough because most back refs for an extent end
up being inlined in the extent item - with the default leaf size of 16K we
can have at most 33 inline back refs (this is calculated by the macro
BTRFS_MAX_EXTENT_ITEM_SIZE()). The amount of bytes reserved for each ref
head is given by btrfs_calc_delayed_ref_bytes(), which basically
corresponds to a single path for insertion into the extent tree plus
another path for insertion into the free space tree if it's enabled.

However if we have reached the limit of inline refs or we have a mix of
inline and non-inline refs, then we will need to insert a non-inline ref
and update the existing extent item to update the total number of
references for the extent. This implies we need reserved space for two
insertion paths in the extent tree, but we only reserved for one path.
The extent item and the non-inline ref item may be located in different
leaves, or even if they are located in the same leaf, after updating the
extent item and before inserting the non-inline ref item, the extent
buffers in the btree path may have been written (due to memory pressure
for e.g.), in which case we need to COW the entire path again. In this
case since we have not reserved enough space for the delayed refs block
reserve, we will use the global block reserve.

If we are in a situation where the fs has no more unallocated space enough
to allocate a new metadata block group and available space in the existing
metadata block groups is close to the maximum size of the global block
reserve (512M), we may end up consuming too much of the free metadata
space to the point where we can't commit any future transaction because it
will fail, with -ENOSPC, during its commit when trying to allocate an
extent for some COW operation (running delayed refs generated by running
delayed refs or COWing the root tree's root node at commit_cowonly_roots()
for example). Such dramatic scenario can happen if we have many delayed
refs that require the insertion of non-inline ref items, due to too many
reflinks or snapshots. We also have situations where we use the global
block reserve because we could not in advance know that we will need
space to update some trees (block group creation for example), so this
all adds up to increase the chances of exhausting the global block reserve
and making any future transaction commit to fail with -ENOSPC and turn
the fs into RO mode, or fail the mount operation in case the mount needs
to start and commit a transaction, such as when we have orphans to cleanup
for example - such case was reported and hit by someone running a SLE
(SUSE Linux Enterprise) distribution for example - where the fs had no
more unallocated space that could be used to allocate a new metadata block
group, and the available metadata space was about 1.5M, not enough to
commit a transaction to cleanup an orphan inode (or do relocation of data
block groups that were far from being full).

So reserve space for delayed refs by individual refs and not by ref heads,
as we may need to COW multiple extent tree paths due to non-inline ref
items.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-10-12 16:44:06 +02:00
Filipe Manana
8a526c44da btrfs: allow to run delayed refs by bytes to be released instead of count
When running delayed references, through btrfs_run_delayed_refs(), we can
specify how many to run, run all existing delayed references and keep
running delayed references while we can find any. This is controlled with
the value of the 'count' argument, where a value of 0 means to run all
delayed references that exist by the time btrfs_run_delayed_refs() is
called, (unsigned long)-1 means to keep running delayed references while
we are able find any, and any other value to run that exact number of
delayed references.

Typically a specific value other than 0 or -1 is used when flushing space
to try to release a certain amount of bytes for a ticket. In this case
we just simply calculate how many delayed reference heads correspond to a
specific amount of bytes, with calc_delayed_refs_nr(). However that only
takes into account the space reserved for the reference heads themselves,
and does not account for the space reserved for deleting checksums from
the csum tree (see add_delayed_ref_head() and update_existing_head_ref())
in case we are going to delete a data extent. This means we may end up
running more delayed references than necessary in case we process delayed
references for deleting a data extent.

So change the logic of btrfs_run_delayed_refs() to take a bytes argument
to specify how many bytes of delayed references to run/release, using the
special values of 0 to mean all existing delayed references and U64_MAX
(or (u64)-1) to keep running delayed references while we can find any.

This prevents running more delayed references than necessary, when we have
delayed references for deleting data extents, but also makes the upcoming
changes/patches simpler and it's preparatory work for them.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-10-12 16:44:06 +02:00
Filipe Manana
da8848ac6a btrfs: simplify check for extent item overrun at lookup_inline_extent_backref()
At lookup_inline_extent_backref() we can simplify the check for an overrun
of the extent item by making the while loop's condition to be "ptr < end"
and then check after the loop if an overrun happened ("ptr > end"). This
reduces indentation and makes the loop condition more clear. So move the
check out of the loop and change the loop condition accordingly, while
also adding the 'unlikely' tag to the check since it's not supposed to be
triggered.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-10-12 16:44:06 +02:00
Filipe Manana
eba444f1c0 btrfs: return -EUCLEAN if extent item is missing when searching inline backref
At lookup_inline_extent_backref() when trying to insert an inline backref,
if we don't find the extent item we log an error and then return -EIO.
This error code is confusing because there was actually no IO error, and
this means we have some corruption, either caused by a bug or something
like a memory bitflip for example. So change the error code from -EIO to
-EUCLEAN.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-10-12 16:44:06 +02:00
Filipe Manana
cc925b9646 btrfs: use a single variable for return value at lookup_inline_extent_backref()
At lookup_inline_extent_backref(), instead of using a 'ret' and an 'err'
variable for tracking the return value, use a single one ('ret'). This
simplifies the code, makes it comply with most of the existing code and
it's less prone for logic errors as time has proven over and over in the
btrfs code.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-10-12 16:44:06 +02:00
Filipe Manana
20fb05a6d1 btrfs: use a single variable for return value at run_delayed_extent_op()
Instead of using a 'ret' and an 'err' variable at run_delayed_extent_op()
for tracking the return value, use a single one ('ret'). This simplifies
the code, makes it comply with most of the existing code and it's less
prone for logic errors as time has proven over and over in the btrfs code.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-10-12 16:44:06 +02:00
Filipe Manana
e721043a98 btrfs: remove pointless 'ref_root' variable from run_delayed_data_ref()
The 'ref_root' variable, at run_delayed_data_ref(), is not really needed
as we can always use ref->root directly, plus its initialization to 0 is
completely pointless as we assign it ref->root before its first use.
So just drop that variable and use ref->root directly.

This may help avoid some warnings with clang tools such as the one
reported/fixed by commit 966de47ff0 ("btrfs: remove redundant
initialization of variables in log_new_ancestors").

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-10-12 16:44:06 +02:00
Filipe Manana
7cce0d690d btrfs: initialize key where it's used when running delayed data ref
At run_delayed_data_ref() we are always initializing a key but the key
is only needed and used if we are inserting a new extent. So move the
declaration and initialization of the key to 'if' branch where it's used.
Also rename the key from 'ins' to 'key', as it's a more clear name.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-10-12 16:44:06 +02:00
Filipe Manana
1df6b3c060 btrfs: remove refs_to_drop argument from __btrfs_free_extent()
Currently the 'refs_to_drop' argument of __btrfs_free_extent() always
matches the value of node->ref_mod, so remove the argument and use
node->ref_mod at __btrfs_free_extent().

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-10-12 16:44:05 +02:00
Filipe Manana
88b2d08879 btrfs: remove refs_to_add argument from __btrfs_inc_extent_ref()
Currently the 'refs_to_add' argument of __btrfs_inc_extent_ref() always
matches the value of node->ref_mod, so remove the argument and use
node->ref_mod at __btrfs_inc_extent_ref().

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-10-12 16:44:05 +02:00
Filipe Manana
abff279eb3 btrfs: remove the refcount warning/check at btrfs_put_delayed_ref()
At btrfs_put_delayed_ref(), it's pointless to have a WARN_ON() to check if
the refcount of the delayed ref is zero. Such check is already done by the
refcount_t module and refcount_dec_and_test(), which loudly complains if
we try to decrement a reference count that is currently 0.

The WARN_ON() dates back to the time when used a regular atomic_t type
for the reference counter, before we switched to the refcount_t type.
The main goal of the refcount_t type/module is precisely to catch such
types of bugs and loudly complain if they happen.

This also reduces a bit the module's text size.
Before this change:

   $ size fs/btrfs/btrfs.ko
      text	   data	    bss	    dec	    hex	filename
   1612483	 167145	  16864	1796492	 1b698c	fs/btrfs/btrfs.ko

After this change:

   $ size fs/btrfs/btrfs.ko
      text	   data	    bss	    dec	    hex	filename
   1612371	 167073	  16864	1796308	 1b68d4	fs/btrfs/btrfs.ko

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-10-12 16:44:05 +02:00
Filipe Manana
3cbb9f5160 btrfs: remove unnecessary logic when running new delayed references
When running delayed references, at btrfs_run_delayed_refs(), we have this
logic to run any new delayed references that might have been added just
after we ran all delayed references. This logic grabs the first delayed
reference, then locks it to wait for any contention on it before running
all new delayed references. This however is pointless and not necessary
because at __btrfs_run_delayed_refs() when we start running delayed
references, we pick the first reference with btrfs_obtain_ref_head() and
then we will lock it (with btrfs_delayed_ref_lock()).

So remove the duplicate and unnecessary logic at btrfs_run_delayed_refs().

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-10-12 16:44:05 +02:00
Filipe Manana
03551d651e btrfs: pass a space_info argument to btrfs_reserve_metadata_bytes()
We are passing a block reserve argument to btrfs_reserve_metadata_bytes()
which is not really used, all we need is to pass the space_info associated
to the block reserve, we don't change the block reserve at all.

Not only it's pointless to pass the block reserve, it's also confusing as
one might think that the reserved bytes will end up being added to the
passed block reserve, when that's not the case. The pattern for reserving
space and adding it to a block reserve is to first reserve space with
btrfs_reserve_metadata_bytes() and if that succeeds, then add the space to
a block reserve by calling btrfs_block_rsv_add_bytes().

Also the reverse of btrfs_reserve_metadata_bytes(), which is
btrfs_space_info_free_bytes_may_use(), takes a space_info argument and
not a block reserve, so one more reason to pass a space_info and not a
block reserve to btrfs_reserve_metadata_bytes().

So change btrfs_reserve_metadata_bytes() and its callers to pass a
space_info argument instead of a block reserve argument.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-10-12 16:44:05 +02:00
Qu Wenruo
9fb2acc2fe btrfs: remove the need_raid_map parameter from btrfs_map_block()
The parameter @need_raid_map is mostly a legacy from the old days where
we don't yet have a solid definition on the @mirror_num, and only
check-integrity was using that parameter, while all other call sites
just pass 1 for that parameter.

Now since we have removed check-integrity functionality, we can also
remove the @need_raid_map parameter.

This change will also remove the ability to read P/Q stripe directly
when passing 0 as @need_raid_map.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-10-12 16:44:05 +02:00
Qu Wenruo
732fab95ab btrfs: check-integrity: remove CONFIG_BTRFS_FS_CHECK_INTEGRITY option
Since all check-integrity entry points have been removed, let's also
remove the config and all related code relying on that.

And since we have removed the mount option for check-integrity, we also
need to re-number all the BTRFS_MOUNT_* enums.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-10-12 16:44:05 +02:00
Qu Wenruo
fb2a836da4 btrfs: check-integrity: remove btrfsic_unmount() function
The function btrfsic_mount() is part of the deprecated check-integrity
functionality.

Now let's remove the main entry point of check-integrity, and thankfully
most of the check-integrity code is self-contained inside
check-integrity.c, we can safely remove the function without huge
changes to btrfs code base.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-10-12 16:44:05 +02:00
Qu Wenruo
af32d3632e btrfs: check-integrity: remove btrfsic_mount() function
The function btrfsic_mount() is part of the deprecated check-integrity
functionality.

Now let's remove the main entry point of check-integrity, and thankfully
most of the check-integrity code is self-contained inside
check-integrity.c, we can safely remove the function without huge
changes to btrfs code base.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-10-12 16:44:05 +02:00
Qu Wenruo
51cf580c23 btrfs: check-integrity: remove btrfsic_check_bio() function
The function btrfsic_check_bio() is part of the deprecated
check-integrity functionality.

Now let's remove the main entry point of check-integrity, and thankfully
most of the check-integrity code is self-contained inside
check-integrity.c, we can safely remove the function without huge
changes to btrfs code base.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-10-12 16:44:05 +02:00
David Sterba
150cce2d9f btrfs: move extent_buffer::lock_owner to debug section
The lock_owner is used for a rare corruption case and we haven't seen
any reports in years. Move it to the debugging section of eb.  To close
the holes also move log_index so the final layout looks like:

struct extent_buffer {
        u64                        start;                /*     0     8 */
        long unsigned int          len;                  /*     8     8 */
        long unsigned int          bflags;               /*    16     8 */
        struct btrfs_fs_info *     fs_info;              /*    24     8 */
        spinlock_t                 refs_lock;            /*    32     4 */
        atomic_t                   refs;                 /*    36     4 */
        int                        read_mirror;          /*    40     4 */
        s8                         log_index;            /*    44     1 */

        /* XXX 3 bytes hole, try to pack */

        struct callback_head       callback_head __attribute__((__aligned__(8))); /*    48    16 */
        /* --- cacheline 1 boundary (64 bytes) --- */
        struct rw_semaphore        lock;                 /*    64    40 */
        struct page *              pages[16];            /*   104   128 */

        /* size: 232, cachelines: 4, members: 11 */
        /* sum members: 229, holes: 1, sum holes: 3 */
        /* forced alignments: 1, forced holes: 1, sum forced holes: 3 */
        /* last cacheline: 40 bytes */
} __attribute__((__aligned__(8)));

This saves 8 bytes in total and still keeps the lock on a separate cacheline.

Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-10-12 16:44:05 +02:00
David Sterba
321f4992c1 btrfs: reduce size of struct btrfs_ref
We can reduce two members' size that in turn reduce size of struct
btrfs_ref from 64 to 56 bytes. As the structure is often used as a local
variable several functions reduce their stack usage.

- make enum btrfs_ref_type packed, there are only 4 values

- switch action and its values to a packed enum

Final structure layout:

struct btrfs_ref {
        enum btrfs_ref_type        type;                 /*     0     1 */
        enum btrfs_delayed_ref_action action;            /*     1     1 */
        bool                       skip_qgroup;          /*     2     1 */

        /* XXX 5 bytes hole, try to pack */

        u64                        bytenr;               /*     8     8 */
        u64                        len;                  /*    16     8 */
        u64                        parent;               /*    24     8 */
        union {
                struct btrfs_data_ref data_ref;          /*    32    24 */
                struct btrfs_tree_ref tree_ref;          /*    32    16 */
        };                                               /*    32    24 */

        /* size: 56, cachelines: 1, members: 7 */
        /* sum members: 51, holes: 1, sum holes: 5 */
        /* last cacheline: 56 bytes */
};

Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-10-12 16:44:04 +02:00
David Sterba
e41570d379 btrfs: reduce size and reorder compression members in struct btrfs_inode
Currently the compression type values are bounded and fit to an u8, we
can pack the btrfs_inode a bit by reordering them to the space created
by the location key. This reduces size from 1112 to 1104.

Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-10-12 16:44:04 +02:00
David Sterba
105c8c4214 btrfs: reduce size of prelim_ref::level
The values of level are bounded and fit into a byte so let's use it for
the structure to reduce size from 88 to 80 bytes on a release build,
which increases number of objects in the default 8K slab from 93 to 102.

struct prelim_ref {
        struct rb_node             rbnode __attribute__((__aligned__(8))); /*     0    24 */
        u64                        root_id;              /*    24     8 */
        struct btrfs_key           key_for_search;       /*    32    17 */
        u8                         level;                /*    49     1 */

        /* XXX 2 bytes hole, try to pack */

        int                        count;                /*    52     4 */
        struct extent_inode_elem * inode_list;           /*    56     8 */
        /* --- cacheline 1 boundary (64 bytes) --- */
        u64                        parent;               /*    64     8 */
        u64                        wanted_disk_byte;     /*    72     8 */

        /* size: 80, cachelines: 2, members: 8 */
        /* sum members: 78, holes: 1, sum holes: 2 */
        /* forced alignments: 1 */
        /* last cacheline: 16 bytes */
} __attribute__((__aligned__(8)));

Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-10-12 16:44:04 +02:00
David Sterba
02cd00fa78 btrfs: reduce arguments of helpers space accounting root item
There are two helpers to increase used bytes of root items that add or
subtract one node size, we don't need to pass the argument for that.
Rename the function so it matches the root item member that gets
changed.

Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-10-12 16:44:04 +02:00
David Sterba
007dec8c7e btrfs: reduce parameters of btrfs_pin_extent_for_log_replay
Both callers of btrfs_pin_extent_for_log_replay expand the parameters to
extent buffer members. We can simply pass the extent buffer instead.

Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-10-12 16:44:04 +02:00
David Sterba
f863c50277 btrfs: reduce parameters of btrfs_pin_reserved_extent
There is only one caller of btrfs_pin_reserved_extent that expands the
parameters to extent buffer members. We can simply pass the extent
buffer instead.

Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-10-12 16:44:04 +02:00
David Sterba
203f6a8772 btrfs: drop __must_check annotations
Drop all __must_check annotations because they're used in random
functions and not consistently. All errors should be handled.

Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-10-12 16:44:04 +02:00
David Sterba
9580503bcb btrfs: reformat remaining kdoc style comments
Function name in the comment does not bring much value to code not
exposed as API and we don't stick to the kdoc format anymore. Update
formatting of parameter descriptions.

Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-10-12 16:44:04 +02:00
David Sterba
33b6b25191 btrfs: move functions comments from qgroup.h to qgroup.c
We keep the comments next to the implementation, there were some left
to move.

Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-10-12 16:44:04 +02:00
Anand Jain
cb6eb4757e btrfs: comment about fsid and metadata_uuid relationship
Add a comment explaining the relationship between fsid and metadata_uuid
in the on-disk superblock and the in-memory struct btrfs_fs_devices.

Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-10-12 16:44:04 +02:00
Jiapeng Chong
1246873114 btrfs: qgroup: remove unused helpers for ulist aux data
These functions are defined in the qgroup.c file, but not called
anymore since commit "btrfs: qgroup: use qgroup_iterator_nested to in
qgroup_update_refcnt()" so we can delete them.

fs/btrfs/qgroup.c:149:19: warning: unused function 'qgroup_to_aux'.
fs/btrfs/qgroup.c:154:36: warning: unused function 'unode_aux_to_qgroup'.

Reported-by: Abaci Robot <abaci@linux.alibaba.com>
Closes: https://bugzilla.openanolis.cn/show_bug.cgi?id=6566
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Jiapeng Chong <jiapeng.chong@linux.alibaba.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-10-12 16:44:03 +02:00
Qu Wenruo
79ace7b807 btrfs: qgroup: prealloc btrfs_qgroup_list for __add_relation_rb()
Currently we go GFP_ATOMIC allocation for qgroup relation add, this
includes the following 3 call sites:

- btrfs_read_qgroup_config()
  This is not really needed, as at that time we're still in single
  thread mode, and no spin lock is held.

- btrfs_add_qgroup_relation()
  This one is holding a spinlock, but we're ensured to add at most one
  relation, thus we can easily do a preallocation and use the
  preallocated memory to avoid GFP_ATOMIC.

- btrfs_qgroup_inherit()
  This is a little more tricky, as we may have as many relationships as
  inherit::num_qgroups.
  Thus we have to properly allocate an array then preallocate all the
  memory.

This patch would remove the GFP_ATOMIC allocation for above involved
call sites, by doing preallocation before holding the spinlock, and let
__add_relation_rb() to handle the freeing of the structure.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-10-12 16:44:03 +02:00
Qu Wenruo
8d54518b5e btrfs: qgroup: pre-allocate btrfs_qgroup to reduce GFP_ATOMIC usage
Qgroup is the heaviest user of GFP_ATOMIC, but one call site does not
really need GFP_ATOMIC, that is add_qgroup_rb().

That function only searches the rbtree to find if we already have such
entry.  If not, then it would try to allocate memory for it.

This means we can afford to pre-allocate such structure unconditionally,
then free the memory if it's not needed.

Considering this function is not a hot path, only utilized by the
following functions:

- btrfs_qgroup_inherit()
  For "btrfs subvolume snapshot -i" option.

- btrfs_read_qgroup_config()
  At mount time, and we're ensured there would be no existing rb tree
  entry for each qgroup.

- btrfs_create_qgroup()

Thus we're completely safe to pre-allocate the extra memory for btrfs_qgroup
structure, and reduce unnecessary GFP_ATOMIC usage.

Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-10-12 16:44:03 +02:00
Qu Wenruo
dce28769a3 btrfs: qgroup: use qgroup_iterator_nested to in qgroup_update_refcnt()
The ulist @qgroups is utilized to record all involved qgroups from both
old and new roots inside btrfs_qgroup_account_extent().

Due to the fact that qgroup_update_refcnt() itself is already utilizing
qgroup_iterator, here we have to introduce another list_head,
btrfs_qgroup::nested_iterator, allowing nested iteration.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-10-12 16:44:03 +02:00
Qu Wenruo
a4a81383fb btrfs: qgroup: use qgroup_iterator to replace tmp ulist in qgroup_update_refcnt()
For function qgroup_update_refcnt(), we use @tmp list to iterate all the
involved qgroups of a subvolume.

It's a perfect match for qgroup_iterator facility, as that @tmp ulist
has a very limited lifespan (just inside the while() loop).

By migrating to qgroup_iterator, we can get rid of the GFP_ATOMIC memory
allocation and no error handling is needed.

Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-10-12 16:44:03 +02:00
Qu Wenruo
a0bdc04b07 btrfs: qgroup: use qgroup_iterator in __qgroup_excl_accounting()
With the new qgroup_iterator_add() and qgroup_iterator_clean(), we can
get rid of the ulist and its GFP_ATOMIC memory allocation.

Furthermore we can merge the code handling the initial and parent
qgroups into one loop, and drop the @tmp ulist parameter for involved
call sites.

Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-10-12 16:44:03 +02:00
Qu Wenruo
0913445082 btrfs: qgroup: use qgroup_iterator in qgroup_convert_meta()
With the new qgroup_iterator_add() and qgroup_iterator_clean(), we can
get rid of the ulist and its GFP_ATOMIC memory allocation.

Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-10-12 16:44:03 +02:00
Qu Wenruo
25152cb7a8 btrfs: qgroup: use qgroup_iterator in btrfs_qgroup_free_refroot()
With the new qgroup_iterator_add() and qgroup_iterator_clean(), we can
get rid of the ulist and its GFP_ATOMIC memory allocation.

Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-10-12 16:44:03 +02:00
Qu Wenruo
686c4a5a42 btrfs: qgroup: iterate qgroups without memory allocation for qgroup_reserve()
Qgroup heavily relies on ulist to go through all the involved
qgroups, but since we're using ulist inside fs_info->qgroup_lock
spinlock, this means we're doing a lot of GFP_ATOMIC allocations.

This patch reduces the GFP_ATOMIC usage for qgroup_reserve() by
eliminating the memory allocation completely.

This is done by moving the needed memory to btrfs_qgroup::iterator
list_head, so that we can put all the involved qgroup into a on-stack
list, thus eliminating the need to allocate memory while holding
spinlock.

The only cost is the slightly higher memory usage, but considering the
reduce GFP_ATOMIC during a hot path, it should still be acceptable.

Function qgroup_reserve() is the perfect start point for this
conversion.

Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-10-12 16:44:03 +02:00
Josef Bacik
2a3a1dd99e btrfs: remove extraneous includes from ctree.h
We don't need any of these includes in the ctree.h header file for the
header file itself, remove them to clean up ctree.h a little bit.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-10-12 16:44:03 +02:00
Josef Bacik
c60a28806c btrfs: include linux/security.h in super.c
We use some of the security related code in here, include it in super.c
so we can remove the include from ctree.h.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-10-12 16:44:03 +02:00
Josef Bacik
5335f4376c btrfs: include trace header in where necessary
If we no longer include the tracepoints from ctree.h we fail to compile
because we have the dependency in some of the header files and source
files.  Add the include where we have these dependencies to allow us to
remove the include from ctree.h.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-10-12 16:44:03 +02:00
Josef Bacik
82cc2ade2a btrfs: add btrfs_delayed_ref_head declaration to extent-tree.h
extent-tree.h uses btrfs_delayed_ref_head in a function argument but
doesn't pull it's declaration from anywhere, add it to the top of the
header.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-10-12 16:44:02 +02:00
Josef Bacik
04cc63d12c btrfs: add fscrypt related dependencies to respective headers
These headers have struct fscrypt_str as function arguments, so add
struct fscrypt_str to the theader, and include linux/fscrypt.h in
btrfs_inode.h as it also needs the definition of struct fscrypt_name for
the new inode args.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-10-12 16:44:02 +02:00
Josef Bacik
3ecb43cb64 btrfs: include linux/iomap.h in file.c
We use the iomap code in file.c, include it so we have our dependencies.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-10-12 16:44:02 +02:00
Josef Bacik
f005d997c4 btrfs: include asm/unaligned.h in accessors.h
We use the unaligned helpers directly in accessors.h, add the include
here.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-10-12 16:44:02 +02:00
Josef Bacik
1b9e6a15bc btrfs: move btrfs_name_hash to dir-item.h
This is related to the name hashing for dir items, move it into
dir-item.h.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-10-12 16:44:02 +02:00
Josef Bacik
98e4f060c4 btrfs: move btrfs_extref_hash into inode-item.h
Ideally this would be un-inlined, but that is a cleanup for later.  For
now move this into inode-item.h, which is where the extref code lives.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-10-12 16:44:02 +02:00