A mirror of the official Linux kernel repository just in case
Go to file
Filipe Manana 28270e25c6 btrfs: always reserve space for delayed refs when starting transaction
When starting a transaction (or joining an existing one with
btrfs_start_transaction()), we reserve space for the number of items we
want to insert in a btree, but we don't do it for the delayed refs we
will generate while using the transaction to modify (COW) extent buffers
in a btree or allocate new extent buffers. Basically how it works:

1) When we start a transaction we reserve space for the number of items
   the caller wants to be inserted/modified/deleted in a btree. This space
   goes to the transaction block reserve;

2) If the delayed refs block reserve is not full, its size is greater
   than the amount of its reserved space, and the flush method is
   BTRFS_RESERVE_FLUSH_ALL, then we attempt to reserve more space for
   it corresponding to the number of items the caller wants to
   insert/modify/delete in a btree;

3) The size of the delayed refs block reserve is increased when a task
   creates delayed refs after COWing an extent buffer, allocating a new
   one or deleting (freeing) an extent buffer. This happens after the
   the task started or joined a transaction, whenever it calls
   btrfs_update_delayed_refs_rsv();

4) The delayed refs block reserve is then refilled by anyone calling
   btrfs_delayed_refs_rsv_refill(), either during unlink/truncate
   operations or when someone else calls btrfs_start_transaction() with
   a 0 number of items and flush method BTRFS_RESERVE_FLUSH_ALL;

5) As a task COWs or allocates extent buffers, it consumes space from the
   transaction block reserve. When the task releases its transaction
   handle (btrfs_end_transaction()) or it attempts to commit the
   transaction, it releases any remaining space in the transaction block
   reserve that it did not use, as not all space may have been used (due
   to pessimistic space calculation) by calling btrfs_block_rsv_release()
   which will try to add that unused space to the delayed refs block
   reserve (if its current size is greater than its reserved space).
   That transferred space may not be enough to completely fulfill the
   delayed refs block reserve.

   Plus we have some tasks that will attempt do modify as many leaves
   as they can before getting -ENOSPC (and then reserving more space and
   retrying), such as hole punching and extent cloning which call
   btrfs_replace_file_extents(). Such tasks can generate therefore a
   high number of delayed refs, for both metadata and data (we can't
   know in advance how many file extent items we will find in a range
   and therefore how many delayed refs for dropping references on data
   extents we will generate);

6) If a transaction starts its commit before the delayed refs block
   reserve is refilled, for example by the transaction kthread or by
   someone who called btrfs_join_transaction() before starting the
   commit, then when running delayed references if we don't have enough
   reserved space in the delayed refs block reserve, we will consume
   space from the global block reserve.

Now this doesn't make a lot of sense because:

1) We should reserve space for delayed references when starting the
   transaction, since we have no guarantees the delayed refs block
   reserve will be refilled;

2) If no refill happens then we will consume from the global block reserve
   when running delayed refs during the transaction commit;

3) If we have a bunch of tasks calling btrfs_start_transaction() with a
   number of items greater than zero and at the time the delayed refs
   reserve is full, then we don't reserve any space at
   btrfs_start_transaction() for the delayed refs that will be generated
   by a task, and we can therefore end up using a lot of space from the
   global reserve when running the delayed refs during a transaction
   commit;

4) There are also other operations that result in bumping the size of the
   delayed refs reserve, such as creating and deleting block groups, as
   well as the need to update a block group item because we allocated or
   freed an extent from the respective block group;

5) If we have a significant gap between the delayed refs reserve's size
   and its reserved space, two very bad things may happen:

   1) The reserved space of the global reserve may not be enough and we
      fail the transaction commit with -ENOSPC when running delayed refs;

   2) If the available space in the global reserve is enough it may result
      in nearly exhausting it. If the fs has no more unallocated device
      space for allocating a new block group and all the available space
      in existing metadata block groups is not far from the global
      reserve's size before we started the transaction commit, we may end
      up in a situation where after the transaction commit we have too
      little available metadata space, and any future transaction commit
      will fail with -ENOSPC, because although we were able to reserve
      space to start the transaction, we were not able to commit it, as
      running delayed refs generates some more delayed refs (to update the
      extent tree for example) - this includes not even being able to
      commit a transaction that was started with the goal of unlinking a
      file, removing an empty data block group or doing reclaim/balance,
      so there's no way to release metadata space.

      In the worst case the next time we mount the filesystem we may
      also fail with -ENOSPC due to failure to commit a transaction to
      cleanup orphan inodes. This later case was reported and hit by
      someone running a SLE (SUSE Linux Enterprise) distribution for
      example - where the fs had no more unallocated space that could be
      used to allocate a new metadata block group, and the available
      metadata space was about 1.5M, not enough to commit a transaction
      to cleanup an orphan inode (or do relocation of data block groups
      that were far from being full).

So improve on this situation by always reserving space for delayed refs
when calling start_transaction(), and if the flush method is
BTRFS_RESERVE_FLUSH_ALL, also try to refill the delayed refs block
reserve if it's not full. The space reserved for the delayed refs is added
to a local block reserve that is part of the transaction handle, and when
a task updates the delayed refs block reserve size, after creating a
delayed ref, the space is transferred from that local reserve to the
global delayed refs reserve (fs_info->delayed_refs_rsv). In case the
local reserve does not have enough space, which may happen for tasks
that generate a variable and potentially large number of delayed refs
(such as the hole punching and extent cloning cases mentioned before),
we transfer any available space and then rely on the current behaviour
of hoping some other task refills the delayed refs reserve or fallback
to the global block reserve.

Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-10-12 16:44:06 +02:00
arch A set of updates for interrupt chip drivers: 2023-10-10 11:14:07 -07:00
block block: fix kernel-doc for disk_force_media_change() 2023-09-26 00:43:34 -06:00
certs certs: Reference revocation list for all keyrings 2023-08-17 20:12:41 +00:00
crypto crypto: sm2 - Fix crash caused by uninitialized context 2023-09-20 13:10:10 +08:00
Documentation A set of updates for interrupt chip drivers: 2023-10-10 11:14:07 -07:00
drivers ata fixes for 6.6.0-rc6 2023-10-11 13:46:56 -07:00
fs btrfs: always reserve space for delayed refs when starting transaction 2023-10-12 16:44:06 +02:00
include \n 2023-10-11 14:21:15 -07:00
init workqueue: Changes for v6.6 2023-09-01 16:06:32 -07:00
io_uring io-wq: fully initialize wqe before calling cpuhp_state_add_instance_nocalls() 2023-10-05 14:11:18 -06:00
ipc Add x86 shadow stack support 2023-08-31 12:20:12 -07:00
kernel printk fixup for 6.6-rc6 2023-10-11 13:15:16 -07:00
lib maple_tree: add MAS_UNDERFLOW and MAS_OVERFLOW states 2023-09-29 17:20:46 -07:00
LICENSES LICENSES: Add the copyleft-next-0.3.1 license 2022-11-08 15:44:01 +01:00
mm Fourteen hotfixes, eleven of which are cc:stable. The remainder pertain 2023-10-01 13:33:25 -07:00
net Including fixes from Bluetooth, netfilter, BPF and WiFi. 2023-10-05 11:29:21 -07:00
rust Documentation work keeps chugging along; stuff for 6.6 includes: 2023-08-30 20:05:42 -07:00
samples VFIO updates for v6.6-rc1 2023-08-30 20:36:01 -07:00
scripts kbuild: remove stale code for 'source' symlink in packaging scripts 2023-10-01 23:06:06 +09:00
security KEYS: trusted: Remove redundant static calls usage 2023-10-10 11:19:43 -07:00
sound ALSA: usb-audio: Fix microphone sound on Nexigo webcam. 2023-10-09 08:20:00 +02:00
tools hyperv-fixes for v6.6-rc6 2023-10-10 11:01:21 -07:00
usr initramfs: Encode dependency on KBUILD_BUILD_TIMESTAMP 2023-06-06 17:54:49 +09:00
virt ARM: 2023-09-07 13:52:20 -07:00
.clang-format iommu: Add for_each_group_device() 2023-05-23 08:15:51 +02:00
.cocciconfig
.get_maintainer.ignore get_maintainer: add Alan to .get_maintainer.ignore 2022-08-20 15:17:44 -07:00
.gitattributes .gitattributes: set diff driver for Rust source code files 2023-05-31 17:48:25 +02:00
.gitignore kbuild: rpm-pkg: rename binkernel.spec to kernel.spec 2023-07-25 00:59:33 +09:00
.mailmap MAINTAINERS: update Matthieu's email address 2023-10-05 09:34:32 -07:00
.rustfmt.toml rust: add .rustfmt.toml 2022-09-28 09:02:20 +02:00
COPYING COPYING: state that all contributions really are covered by this file 2020-02-10 13:32:20 -08:00
CREDITS USB: Remove Wireless USB and UWB documentation 2023-08-09 14:17:32 +02:00
Kbuild Kbuild updates for v6.1 2022-10-10 12:00:45 -07:00
Kconfig kbuild: ensure full rebuild when the compiler is updated 2020-05-12 13:28:33 +09:00
MAINTAINERS A set of updates for interrupt chip drivers: 2023-10-10 11:14:07 -07:00
Makefile Linux 6.6-rc5 2023-10-08 13:49:43 -07:00
README

Linux kernel
============

There are several guides for kernel developers and users. These guides can
be rendered in a number of formats, like HTML and PDF. Please read
Documentation/admin-guide/README.rst first.

In order to build the documentation, use ``make htmldocs`` or
``make pdfdocs``.  The formatted documentation can also be read online at:

    https://www.kernel.org/doc/html/latest/

There are various text files in the Documentation/ subdirectory,
several of them using the Restructured Text markup notation.

Please read the Documentation/process/changes.rst file, as it contains the
requirements for building and running the kernel, and information about
the problems which may result by upgrading your kernel.