btrfs: always reserve space for delayed refs when starting transaction

When starting a transaction (or joining an existing one with
btrfs_start_transaction()), we reserve space for the number of items we
want to insert in a btree, but we don't do it for the delayed refs we
will generate while using the transaction to modify (COW) extent buffers
in a btree or allocate new extent buffers. Basically how it works:

1) When we start a transaction we reserve space for the number of items
   the caller wants to be inserted/modified/deleted in a btree. This space
   goes to the transaction block reserve;

2) If the delayed refs block reserve is not full, its size is greater
   than the amount of its reserved space, and the flush method is
   BTRFS_RESERVE_FLUSH_ALL, then we attempt to reserve more space for
   it corresponding to the number of items the caller wants to
   insert/modify/delete in a btree;

3) The size of the delayed refs block reserve is increased when a task
   creates delayed refs after COWing an extent buffer, allocating a new
   one or deleting (freeing) an extent buffer. This happens after the
   the task started or joined a transaction, whenever it calls
   btrfs_update_delayed_refs_rsv();

4) The delayed refs block reserve is then refilled by anyone calling
   btrfs_delayed_refs_rsv_refill(), either during unlink/truncate
   operations or when someone else calls btrfs_start_transaction() with
   a 0 number of items and flush method BTRFS_RESERVE_FLUSH_ALL;

5) As a task COWs or allocates extent buffers, it consumes space from the
   transaction block reserve. When the task releases its transaction
   handle (btrfs_end_transaction()) or it attempts to commit the
   transaction, it releases any remaining space in the transaction block
   reserve that it did not use, as not all space may have been used (due
   to pessimistic space calculation) by calling btrfs_block_rsv_release()
   which will try to add that unused space to the delayed refs block
   reserve (if its current size is greater than its reserved space).
   That transferred space may not be enough to completely fulfill the
   delayed refs block reserve.

   Plus we have some tasks that will attempt do modify as many leaves
   as they can before getting -ENOSPC (and then reserving more space and
   retrying), such as hole punching and extent cloning which call
   btrfs_replace_file_extents(). Such tasks can generate therefore a
   high number of delayed refs, for both metadata and data (we can't
   know in advance how many file extent items we will find in a range
   and therefore how many delayed refs for dropping references on data
   extents we will generate);

6) If a transaction starts its commit before the delayed refs block
   reserve is refilled, for example by the transaction kthread or by
   someone who called btrfs_join_transaction() before starting the
   commit, then when running delayed references if we don't have enough
   reserved space in the delayed refs block reserve, we will consume
   space from the global block reserve.

Now this doesn't make a lot of sense because:

1) We should reserve space for delayed references when starting the
   transaction, since we have no guarantees the delayed refs block
   reserve will be refilled;

2) If no refill happens then we will consume from the global block reserve
   when running delayed refs during the transaction commit;

3) If we have a bunch of tasks calling btrfs_start_transaction() with a
   number of items greater than zero and at the time the delayed refs
   reserve is full, then we don't reserve any space at
   btrfs_start_transaction() for the delayed refs that will be generated
   by a task, and we can therefore end up using a lot of space from the
   global reserve when running the delayed refs during a transaction
   commit;

4) There are also other operations that result in bumping the size of the
   delayed refs reserve, such as creating and deleting block groups, as
   well as the need to update a block group item because we allocated or
   freed an extent from the respective block group;

5) If we have a significant gap between the delayed refs reserve's size
   and its reserved space, two very bad things may happen:

   1) The reserved space of the global reserve may not be enough and we
      fail the transaction commit with -ENOSPC when running delayed refs;

   2) If the available space in the global reserve is enough it may result
      in nearly exhausting it. If the fs has no more unallocated device
      space for allocating a new block group and all the available space
      in existing metadata block groups is not far from the global
      reserve's size before we started the transaction commit, we may end
      up in a situation where after the transaction commit we have too
      little available metadata space, and any future transaction commit
      will fail with -ENOSPC, because although we were able to reserve
      space to start the transaction, we were not able to commit it, as
      running delayed refs generates some more delayed refs (to update the
      extent tree for example) - this includes not even being able to
      commit a transaction that was started with the goal of unlinking a
      file, removing an empty data block group or doing reclaim/balance,
      so there's no way to release metadata space.

      In the worst case the next time we mount the filesystem we may
      also fail with -ENOSPC due to failure to commit a transaction to
      cleanup orphan inodes. This later case was reported and hit by
      someone running a SLE (SUSE Linux Enterprise) distribution for
      example - where the fs had no more unallocated space that could be
      used to allocate a new metadata block group, and the available
      metadata space was about 1.5M, not enough to commit a transaction
      to cleanup an orphan inode (or do relocation of data block groups
      that were far from being full).

So improve on this situation by always reserving space for delayed refs
when calling start_transaction(), and if the flush method is
BTRFS_RESERVE_FLUSH_ALL, also try to refill the delayed refs block
reserve if it's not full. The space reserved for the delayed refs is added
to a local block reserve that is part of the transaction handle, and when
a task updates the delayed refs block reserve size, after creating a
delayed ref, the space is transferred from that local reserve to the
global delayed refs reserve (fs_info->delayed_refs_rsv). In case the
local reserve does not have enough space, which may happen for tasks
that generate a variable and potentially large number of delayed refs
(such as the hole punching and extent cloning cases mentioned before),
we transfer any available space and then rely on the current behaviour
of hoping some other task refills the delayed refs reserve or fallback
to the global block reserve.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
This commit is contained in:
Filipe Manana 2023-09-08 18:20:38 +01:00 committed by David Sterba
parent adb86dbe42
commit 28270e25c6
4 changed files with 133 additions and 32 deletions

View File

@ -281,10 +281,10 @@ u64 btrfs_block_rsv_release(struct btrfs_fs_info *fs_info,
struct btrfs_block_rsv *target = NULL; struct btrfs_block_rsv *target = NULL;
/* /*
* If we are the delayed_rsv then push to the global rsv, otherwise dump * If we are a delayed block reserve then push to the global rsv,
* into the delayed rsv if it is not full. * otherwise dump into the global delayed reserve if it is not full.
*/ */
if (block_rsv == delayed_rsv) if (block_rsv->type == BTRFS_BLOCK_RSV_DELOPS)
target = global_rsv; target = global_rsv;
else if (block_rsv != global_rsv && !btrfs_block_rsv_full(delayed_rsv)) else if (block_rsv != global_rsv && !btrfs_block_rsv_full(delayed_rsv))
target = delayed_rsv; target = delayed_rsv;

View File

@ -89,7 +89,9 @@ void btrfs_update_delayed_refs_rsv(struct btrfs_trans_handle *trans)
{ {
struct btrfs_fs_info *fs_info = trans->fs_info; struct btrfs_fs_info *fs_info = trans->fs_info;
struct btrfs_block_rsv *delayed_rsv = &fs_info->delayed_refs_rsv; struct btrfs_block_rsv *delayed_rsv = &fs_info->delayed_refs_rsv;
struct btrfs_block_rsv *local_rsv = &trans->delayed_rsv;
u64 num_bytes; u64 num_bytes;
u64 reserved_bytes;
num_bytes = btrfs_calc_delayed_ref_bytes(fs_info, trans->delayed_ref_updates); num_bytes = btrfs_calc_delayed_ref_bytes(fs_info, trans->delayed_ref_updates);
num_bytes += btrfs_calc_delayed_ref_csum_bytes(fs_info, num_bytes += btrfs_calc_delayed_ref_csum_bytes(fs_info,
@ -98,9 +100,26 @@ void btrfs_update_delayed_refs_rsv(struct btrfs_trans_handle *trans)
if (num_bytes == 0) if (num_bytes == 0)
return; return;
/*
* Try to take num_bytes from the transaction's local delayed reserve.
* If not possible, try to take as much as it's available. If the local
* reserve doesn't have enough reserved space, the delayed refs reserve
* will be refilled next time btrfs_delayed_refs_rsv_refill() is called
* by someone or if a transaction commit is triggered before that, the
* global block reserve will be used. We want to minimize using the
* global block reserve for cases we can account for in advance, to
* avoid exhausting it and reach -ENOSPC during a transaction commit.
*/
spin_lock(&local_rsv->lock);
reserved_bytes = min(num_bytes, local_rsv->reserved);
local_rsv->reserved -= reserved_bytes;
local_rsv->full = (local_rsv->reserved >= local_rsv->size);
spin_unlock(&local_rsv->lock);
spin_lock(&delayed_rsv->lock); spin_lock(&delayed_rsv->lock);
delayed_rsv->size += num_bytes; delayed_rsv->size += num_bytes;
delayed_rsv->full = false; delayed_rsv->reserved += reserved_bytes;
delayed_rsv->full = (delayed_rsv->reserved >= delayed_rsv->size);
spin_unlock(&delayed_rsv->lock); spin_unlock(&delayed_rsv->lock);
trans->delayed_ref_updates = 0; trans->delayed_ref_updates = 0;
trans->delayed_ref_csum_deletions = 0; trans->delayed_ref_csum_deletions = 0;

View File

@ -561,6 +561,69 @@ static inline bool need_reserve_reloc_root(struct btrfs_root *root)
return true; return true;
} }
static int btrfs_reserve_trans_metadata(struct btrfs_fs_info *fs_info,
enum btrfs_reserve_flush_enum flush,
u64 num_bytes,
u64 *delayed_refs_bytes)
{
struct btrfs_block_rsv *delayed_refs_rsv = &fs_info->delayed_refs_rsv;
struct btrfs_space_info *si = fs_info->trans_block_rsv.space_info;
u64 extra_delayed_refs_bytes = 0;
u64 bytes;
int ret;
/*
* If there's a gap between the size of the delayed refs reserve and
* its reserved space, than some tasks have added delayed refs or bumped
* its size otherwise (due to block group creation or removal, or block
* group item update). Also try to allocate that gap in order to prevent
* using (and possibly abusing) the global reserve when committing the
* transaction.
*/
if (flush == BTRFS_RESERVE_FLUSH_ALL &&
!btrfs_block_rsv_full(delayed_refs_rsv)) {
spin_lock(&delayed_refs_rsv->lock);
if (delayed_refs_rsv->size > delayed_refs_rsv->reserved)
extra_delayed_refs_bytes = delayed_refs_rsv->size -
delayed_refs_rsv->reserved;
spin_unlock(&delayed_refs_rsv->lock);
}
bytes = num_bytes + *delayed_refs_bytes + extra_delayed_refs_bytes;
/*
* We want to reserve all the bytes we may need all at once, so we only
* do 1 enospc flushing cycle per transaction start.
*/
ret = btrfs_reserve_metadata_bytes(fs_info, si, bytes, flush);
if (ret == 0) {
if (extra_delayed_refs_bytes > 0)
btrfs_migrate_to_delayed_refs_rsv(fs_info,
extra_delayed_refs_bytes);
return 0;
}
if (extra_delayed_refs_bytes > 0) {
bytes -= extra_delayed_refs_bytes;
ret = btrfs_reserve_metadata_bytes(fs_info, si, bytes, flush);
if (ret == 0)
return 0;
}
/*
* If we are an emergency flush, which can steal from the global block
* reserve, then attempt to not reserve space for the delayed refs, as
* we will consume space for them from the global block reserve.
*/
if (flush == BTRFS_RESERVE_FLUSH_ALL_STEAL) {
bytes -= *delayed_refs_bytes;
*delayed_refs_bytes = 0;
ret = btrfs_reserve_metadata_bytes(fs_info, si, bytes, flush);
}
return ret;
}
static struct btrfs_trans_handle * static struct btrfs_trans_handle *
start_transaction(struct btrfs_root *root, unsigned int num_items, start_transaction(struct btrfs_root *root, unsigned int num_items,
unsigned int type, enum btrfs_reserve_flush_enum flush, unsigned int type, enum btrfs_reserve_flush_enum flush,
@ -568,10 +631,12 @@ start_transaction(struct btrfs_root *root, unsigned int num_items,
{ {
struct btrfs_fs_info *fs_info = root->fs_info; struct btrfs_fs_info *fs_info = root->fs_info;
struct btrfs_block_rsv *delayed_refs_rsv = &fs_info->delayed_refs_rsv; struct btrfs_block_rsv *delayed_refs_rsv = &fs_info->delayed_refs_rsv;
struct btrfs_block_rsv *trans_rsv = &fs_info->trans_block_rsv;
struct btrfs_trans_handle *h; struct btrfs_trans_handle *h;
struct btrfs_transaction *cur_trans; struct btrfs_transaction *cur_trans;
u64 num_bytes = 0; u64 num_bytes = 0;
u64 qgroup_reserved = 0; u64 qgroup_reserved = 0;
u64 delayed_refs_bytes = 0;
bool reloc_reserved = false; bool reloc_reserved = false;
bool do_chunk_alloc = false; bool do_chunk_alloc = false;
int ret; int ret;
@ -594,9 +659,6 @@ start_transaction(struct btrfs_root *root, unsigned int num_items,
* the appropriate flushing if need be. * the appropriate flushing if need be.
*/ */
if (num_items && root != fs_info->chunk_root) { if (num_items && root != fs_info->chunk_root) {
struct btrfs_block_rsv *rsv = &fs_info->trans_block_rsv;
u64 delayed_refs_bytes = 0;
qgroup_reserved = num_items * fs_info->nodesize; qgroup_reserved = num_items * fs_info->nodesize;
/* /*
* Use prealloc for now, as there might be a currently running * Use prealloc for now, as there might be a currently running
@ -608,20 +670,16 @@ start_transaction(struct btrfs_root *root, unsigned int num_items,
if (ret) if (ret)
return ERR_PTR(ret); return ERR_PTR(ret);
/*
* We want to reserve all the bytes we may need all at once, so
* we only do 1 enospc flushing cycle per transaction start. We
* accomplish this by simply assuming we'll do num_items worth
* of delayed refs updates in this trans handle, and refill that
* amount for whatever is missing in the reserve.
*/
num_bytes = btrfs_calc_insert_metadata_size(fs_info, num_items); num_bytes = btrfs_calc_insert_metadata_size(fs_info, num_items);
if (flush == BTRFS_RESERVE_FLUSH_ALL && /*
!btrfs_block_rsv_full(delayed_refs_rsv)) { * If we plan to insert/update/delete "num_items" from a btree,
delayed_refs_bytes = btrfs_calc_delayed_ref_bytes(fs_info, * we will also generate delayed refs for extent buffers in the
num_items); * respective btree paths, so reserve space for the delayed refs
num_bytes += delayed_refs_bytes; * that will be generated by the caller as it modifies btrees.
} * Try to reserve them to avoid excessive use of the global
* block reserve.
*/
delayed_refs_bytes = btrfs_calc_delayed_ref_bytes(fs_info, num_items);
/* /*
* Do the reservation for the relocation root creation * Do the reservation for the relocation root creation
@ -631,17 +689,14 @@ start_transaction(struct btrfs_root *root, unsigned int num_items,
reloc_reserved = true; reloc_reserved = true;
} }
ret = btrfs_reserve_metadata_bytes(fs_info, rsv->space_info, ret = btrfs_reserve_trans_metadata(fs_info, flush, num_bytes,
num_bytes, flush); &delayed_refs_bytes);
if (ret) if (ret)
goto reserve_fail; goto reserve_fail;
if (delayed_refs_bytes) {
btrfs_migrate_to_delayed_refs_rsv(fs_info, delayed_refs_bytes);
num_bytes -= delayed_refs_bytes;
}
btrfs_block_rsv_add_bytes(rsv, num_bytes, true);
if (rsv->space_info->force_alloc) btrfs_block_rsv_add_bytes(trans_rsv, num_bytes, true);
if (trans_rsv->space_info->force_alloc)
do_chunk_alloc = true; do_chunk_alloc = true;
} else if (num_items == 0 && flush == BTRFS_RESERVE_FLUSH_ALL && } else if (num_items == 0 && flush == BTRFS_RESERVE_FLUSH_ALL &&
!btrfs_block_rsv_full(delayed_refs_rsv)) { !btrfs_block_rsv_full(delayed_refs_rsv)) {
@ -701,6 +756,7 @@ again:
h->type = type; h->type = type;
INIT_LIST_HEAD(&h->new_bgs); INIT_LIST_HEAD(&h->new_bgs);
btrfs_init_metadata_block_rsv(fs_info, &h->delayed_rsv, BTRFS_BLOCK_RSV_DELOPS);
smp_mb(); smp_mb();
if (cur_trans->state >= TRANS_STATE_COMMIT_START && if (cur_trans->state >= TRANS_STATE_COMMIT_START &&
@ -713,8 +769,17 @@ again:
if (num_bytes) { if (num_bytes) {
trace_btrfs_space_reservation(fs_info, "transaction", trace_btrfs_space_reservation(fs_info, "transaction",
h->transid, num_bytes, 1); h->transid, num_bytes, 1);
h->block_rsv = &fs_info->trans_block_rsv; h->block_rsv = trans_rsv;
h->bytes_reserved = num_bytes; h->bytes_reserved = num_bytes;
if (delayed_refs_bytes > 0) {
trace_btrfs_space_reservation(fs_info,
"local_delayed_refs_rsv",
h->transid,
delayed_refs_bytes, 1);
h->delayed_refs_bytes_reserved = delayed_refs_bytes;
btrfs_block_rsv_add_bytes(&h->delayed_rsv, delayed_refs_bytes, true);
delayed_refs_bytes = 0;
}
h->reloc_reserved = reloc_reserved; h->reloc_reserved = reloc_reserved;
} }
@ -770,8 +835,10 @@ join_fail:
kmem_cache_free(btrfs_trans_handle_cachep, h); kmem_cache_free(btrfs_trans_handle_cachep, h);
alloc_fail: alloc_fail:
if (num_bytes) if (num_bytes)
btrfs_block_rsv_release(fs_info, &fs_info->trans_block_rsv, btrfs_block_rsv_release(fs_info, trans_rsv, num_bytes, NULL);
num_bytes, NULL); if (delayed_refs_bytes)
btrfs_space_info_free_bytes_may_use(fs_info, trans_rsv->space_info,
delayed_refs_bytes);
reserve_fail: reserve_fail:
btrfs_qgroup_free_meta_prealloc(root, qgroup_reserved); btrfs_qgroup_free_meta_prealloc(root, qgroup_reserved);
return ERR_PTR(ret); return ERR_PTR(ret);
@ -992,11 +1059,14 @@ static void btrfs_trans_release_metadata(struct btrfs_trans_handle *trans)
if (!trans->block_rsv) { if (!trans->block_rsv) {
ASSERT(!trans->bytes_reserved); ASSERT(!trans->bytes_reserved);
ASSERT(!trans->delayed_refs_bytes_reserved);
return; return;
} }
if (!trans->bytes_reserved) if (!trans->bytes_reserved) {
ASSERT(!trans->delayed_refs_bytes_reserved);
return; return;
}
ASSERT(trans->block_rsv == &fs_info->trans_block_rsv); ASSERT(trans->block_rsv == &fs_info->trans_block_rsv);
trace_btrfs_space_reservation(fs_info, "transaction", trace_btrfs_space_reservation(fs_info, "transaction",
@ -1004,6 +1074,16 @@ static void btrfs_trans_release_metadata(struct btrfs_trans_handle *trans)
btrfs_block_rsv_release(fs_info, trans->block_rsv, btrfs_block_rsv_release(fs_info, trans->block_rsv,
trans->bytes_reserved, NULL); trans->bytes_reserved, NULL);
trans->bytes_reserved = 0; trans->bytes_reserved = 0;
if (!trans->delayed_refs_bytes_reserved)
return;
trace_btrfs_space_reservation(fs_info, "local_delayed_refs_rsv",
trans->transid,
trans->delayed_refs_bytes_reserved, 0);
btrfs_block_rsv_release(fs_info, &trans->delayed_rsv,
trans->delayed_refs_bytes_reserved, NULL);
trans->delayed_refs_bytes_reserved = 0;
} }
static int __btrfs_end_transaction(struct btrfs_trans_handle *trans, static int __btrfs_end_transaction(struct btrfs_trans_handle *trans,

View File

@ -118,6 +118,7 @@ enum {
struct btrfs_trans_handle { struct btrfs_trans_handle {
u64 transid; u64 transid;
u64 bytes_reserved; u64 bytes_reserved;
u64 delayed_refs_bytes_reserved;
u64 chunk_bytes_reserved; u64 chunk_bytes_reserved;
unsigned long delayed_ref_updates; unsigned long delayed_ref_updates;
unsigned long delayed_ref_csum_deletions; unsigned long delayed_ref_csum_deletions;
@ -140,6 +141,7 @@ struct btrfs_trans_handle {
bool in_fsync; bool in_fsync;
struct btrfs_fs_info *fs_info; struct btrfs_fs_info *fs_info;
struct list_head new_bgs; struct list_head new_bgs;
struct btrfs_block_rsv delayed_rsv;
}; };
/* /*