linux/fs/btrfs
Filipe Manana 79bd37120b btrfs: rework chunk allocation to avoid exhaustion of the system chunk array
Commit eafa4fd0ad ("btrfs: fix exhaustion of the system chunk array
due to concurrent allocations") fixed a problem that resulted in
exhausting the system chunk array in the superblock when there are many
tasks allocating chunks in parallel. Basically too many tasks enter the
first phase of chunk allocation without previous tasks having finished
their second phase of allocation, resulting in too many system chunks
being allocated. That was originally observed when running the fallocate
tests of stress-ng on a PowerPC machine, using a node size of 64K.

However that commit also introduced a deadlock where a task in phase 1 of
the chunk allocation waited for another task that had allocated a system
chunk to finish its phase 2, but that other task was waiting on an extent
buffer lock held by the first task, therefore resulting in both tasks not
making any progress. That change was later reverted by a patch with the
subject "btrfs: fix deadlock with concurrent chunk allocations involving
system chunks", since there is no simple and short solution to address it
and the deadlock is relatively easy to trigger on zoned filesystems, while
the system chunk array exhaustion is not so common.

This change reworks the chunk allocation to avoid the system chunk array
exhaustion. It accomplishes that by making the first phase of chunk
allocation do the updates of the device items in the chunk btree and the
insertion of the new chunk item in the chunk btree. This is done while
under the protection of the chunk mutex (fs_info->chunk_mutex), in the
same critical section that checks for available system space, allocates
a new system chunk if needed and reserves system chunk space. This way
we do not have chunk space reserved until the second phase completes.

The same logic is applied to chunk removal as well, since it keeps
reserved system space long after it is done updating the chunk btree.

For direct allocation of system chunks, the previous behaviour remains,
because otherwise we would deadlock on extent buffers of the chunk btree.
Changes to the chunk btree are by large done by chunk allocation and chunk
removal, which first reserve chunk system space and then later do changes
to the chunk btree. The other remaining cases are uncommon and correspond
to adding a device, removing a device and resizing a device. All these
other cases do not pre-reserve system space, they modify the chunk btree
right away, so they don't hold reserved space for a long period like chunk
allocation and chunk removal do.

The diff of this change is huge, but more than half of it is just addition
of comments describing both how things work regarding chunk allocation and
removal, including both the new behavior and the parts of the old behavior
that did not change.

CC: stable@vger.kernel.org # 5.12+
Tested-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com>
Tested-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Tested-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-07-07 17:42:41 +02:00
..
tests btrfs: fix typos in comments 2021-06-22 14:11:57 +02:00
acl.c fs: make helpers idmap mount aware 2021-01-24 14:27:20 +01:00
async-thread.c Btrfs: fix crash during unmount due to race with delayed inode workers 2020-03-23 17:01:51 +01:00
async-thread.h Btrfs: fix crash during unmount due to race with delayed inode workers 2020-03-23 17:01:51 +01:00
backref.c btrfs: fix typos in comments 2021-06-22 14:11:57 +02:00
backref.h btrfs: add asserts for deleting backref cache nodes 2021-02-08 22:58:56 +01:00
block-group.c btrfs: rework chunk allocation to avoid exhaustion of the system chunk array 2021-07-07 17:42:41 +02:00
block-group.h btrfs: rework chunk allocation to avoid exhaustion of the system chunk array 2021-07-07 17:42:41 +02:00
block-rsv.c btrfs: introduce mount option rescue=ignorebadroots 2020-12-08 15:53:41 +01:00
block-rsv.h btrfs: Remove __ prefix from btrfs_block_rsv_release 2020-03-23 17:01:55 +01:00
btrfs_inode.h btrfs: remove stale comment and logic from btrfs_inode_in_log() 2021-04-19 17:25:16 +02:00
check-integrity.c btrfs: integrity-checker: convert block context kmap's to kmap_local_page 2021-04-19 17:25:16 +02:00
check-integrity.h btrfs: remove btrfsic_submit_bh() 2020-03-23 17:01:39 +01:00
compression.c btrfs: remove a stale comment for btrfs_decompress_bio() 2021-06-22 14:11:57 +02:00
compression.h btrfs: optimize variables size in btrfs_submit_compressed_write 2021-06-21 15:19:07 +02:00
ctree.c btrfs: rework chunk allocation to avoid exhaustion of the system chunk array 2021-07-07 17:42:41 +02:00
ctree.h btrfs: remove unused btrfs_fs_info::total_pinned 2021-06-22 19:58:26 +02:00
delalloc-space.c btrfs: fix typos in comments 2021-06-22 14:11:57 +02:00
delalloc-space.h btrfs: make btrfs_delalloc_reserve_space take btrfs_inode 2020-07-27 12:55:36 +02:00
delayed-inode.c btrfs: remove total_data_size variable in btrfs_batch_insert_items() 2021-06-21 15:19:11 +02:00
delayed-inode.h btrfs: make btrfs_delayed_update_inode take btrfs_inode 2020-12-08 15:54:10 +01:00
delayed-ref.c btrfs: rip out btrfs_space_info::total_bytes_pinned 2021-06-22 14:55:25 +02:00
delayed-ref.h btrfs: only let one thread pre-flush delayed refs in commit 2021-02-08 22:58:56 +01:00
dev-replace.c btrfs: fix typos in comments 2021-06-22 14:11:57 +02:00
dev-replace.h btrfs: zoned: mark block groups to copy for device-replace 2021-02-09 02:46:07 +01:00
dir-item.c btrfs: locking: rip out path->leave_spinning 2020-12-08 15:54:02 +01:00
discard.c btrfs: fix typos in comments 2021-06-22 14:11:57 +02:00
discard.h btrfs: cleanup btrfs_discard_update_discardable usage 2020-12-08 15:54:02 +01:00
disk-io.c btrfs: rip out btrfs_space_info::total_bytes_pinned 2021-06-22 14:55:25 +02:00
disk-io.h btrfs: split alloc_log_tree() 2021-02-09 02:46:07 +01:00
export.c btrfs: locking: rip out path->leave_spinning 2020-12-08 15:54:02 +01:00
export.h btrfs: export helpers for subvolume name/id resolution 2020-03-23 17:01:42 +01:00
extent_io.c btrfs: subpage: fix a rare race between metadata endio and eb freeing 2021-06-21 15:19:10 +02:00
extent_io.h btrfs: rename PagePrivate2 to PageOrdered inside btrfs 2021-06-21 15:19:09 +02:00
extent_map.c btrfs: fix parameter description of btrfs_add_extent_mapping 2021-02-08 22:58:53 +01:00
extent_map.h
extent-io-tree.h btrfs: use fixed width int type for extent_state::state 2020-12-08 15:54:13 +01:00
extent-tree.c btrfs: rip out btrfs_space_info::total_bytes_pinned 2021-06-22 14:55:25 +02:00
file-item.c btrfs: fix typos in comments 2021-06-22 14:11:57 +02:00
file.c btrfs: eliminate insert label in add_falloc_range 2021-06-21 15:19:10 +02:00
free-space-cache.c btrfs: don't set the full sync flag when truncation does not touch extents 2021-06-21 15:19:05 +02:00
free-space-cache.h btrfs: zoned: track unusable bytes for zones 2021-02-09 02:46:03 +01:00
free-space-tree.c btrfs: fix possible free space tree corruption with online conversion 2021-01-25 18:44:37 +01:00
free-space-tree.h
inode-item.c btrfs: locking: rip out path->leave_spinning 2020-12-08 15:54:02 +01:00
inode.c btrfs: compression: don't try to compress if we don't have enough pages 2021-06-22 14:11:57 +02:00
ioctl.c btrfs: fix typos in comments 2021-06-22 14:11:57 +02:00
Kconfig btrfs: disable build on platforms having page size 256K 2021-06-22 14:11:57 +02:00
locking.c btrfs: fix typos in comments 2021-06-22 14:11:57 +02:00
locking.h btrfs: remove the recurse parameter from __btrfs_tree_read_lock 2020-12-08 15:54:09 +01:00
lzo.c btrfs: convert kmap to kmap_local_page, simple cases 2021-04-19 17:25:16 +02:00
Makefile btrfs: move the tree mod log code into its own file 2021-04-19 17:25:16 +02:00
misc.h btrfs: rename tree_entry to rb_simple_node and export it 2020-05-25 11:25:19 +02:00
ordered-data.c btrfs: make page Ordered bit to be subpage compatible 2021-06-21 15:19:10 +02:00
ordered-data.h btrfs: introduce btrfs_lookup_first_ordered_range() 2021-06-21 15:19:08 +02:00
orphan.c
print-tree.c btrfs: print the actual offset in btrfs_root_name 2021-01-07 17:25:05 +01:00
print-tree.h btrfs: print the actual offset in btrfs_root_name 2021-01-07 17:25:05 +01:00
props.c btrfs: props: change how empty value is interpreted 2021-06-22 14:11:58 +02:00
props.h
qgroup.c btrfs: send: fix crash when memory allocations trigger reclaim 2021-06-22 14:11:58 +02:00
qgroup.h btrfs: export and rename qgroup_reserve_meta 2021-03-02 16:58:30 +01:00
raid56.c CFI on arm64 series for v5.13-rc1 2021-04-27 10:16:46 -07:00
raid56.h
rcu-string.h btrfs: rcu-string: Replace zero-length array with flexible-array member 2020-03-23 17:01:53 +01:00
reada.c btrfs: subpage: make readahead work properly 2021-03-16 11:06:21 +01:00
ref-verify.c btrfs: ref-verify: use 'inline void' keyword ordering 2021-03-02 16:55:40 +01:00
ref-verify.h
reflink.c btrfs: reflink: make copy_inline_to_page() to be subpage compatible 2021-06-21 15:19:10 +02:00
reflink.h Btrfs: move all reflink implementation code into its own file 2020-03-23 17:01:54 +01:00
relocation.c btrfs: ensure relocation never runs while we have send operations running 2021-06-22 14:11:58 +02:00
root-tree.c btrfs: qgroup: fix qgroup meta rsv leak for subvolume operations 2020-10-07 12:12:13 +02:00
scrub.c btrfs: fix typos in comments 2021-06-22 14:11:57 +02:00
send.c btrfs: send: fix crash when memory allocations trigger reclaim 2021-06-22 14:11:58 +02:00
send.h btrfs: send: avoid copying file data 2020-10-07 12:13:17 +02:00
space-info.c btrfs: rip out btrfs_space_info::total_bytes_pinned 2021-06-22 14:55:25 +02:00
space-info.h btrfs: rip out btrfs_space_info::total_bytes_pinned 2021-06-22 14:55:25 +02:00
struct-funcs.c btrfs: handle sectorsize < PAGE_SIZE case for extent buffer accessors 2020-12-09 19:16:10 +01:00
subpage.c btrfs: subpage: fix a rare race between metadata endio and eb freeing 2021-06-21 15:19:10 +02:00
subpage.h btrfs: subpage: fix a rare race between metadata endio and eb freeing 2021-06-21 15:19:10 +02:00
super.c btrfs: shorten integrity checker extent data mount option 2021-06-22 14:11:58 +02:00
sysfs.c btrfs: rip out btrfs_space_info::total_bytes_pinned 2021-06-22 14:55:25 +02:00
sysfs.h btrfs: split and refactor btrfs_sysfs_remove_devices_dir 2020-10-07 12:12:21 +02:00
transaction.c btrfs: rework chunk allocation to avoid exhaustion of the system chunk array 2021-07-07 17:42:41 +02:00
transaction.h btrfs: rework chunk allocation to avoid exhaustion of the system chunk array 2021-07-07 17:42:41 +02:00
tree-checker.c btrfs: tree-checker: check for BTRFS_BLOCK_FLAG_FULL_BACKREF being set improperly 2021-04-19 17:25:21 +02:00
tree-checker.h
tree-defrag.c btrfs: locking: remove all the blocking helpers 2020-12-08 15:54:01 +01:00
tree-log.c btrfs: avoid unnecessary logging of xattrs during fast fsyncs 2021-06-21 15:19:07 +02:00
tree-log.h btrfs: make fast fsyncs wait only for writeback 2020-10-07 12:06:56 +02:00
tree-mod-log.c btrfs: fix race when picking most recent mod log operation for an old root 2021-04-20 19:27:17 +02:00
tree-mod-log.h btrfs: add and use helper to get lowest sequence number for the tree mod log 2021-04-19 17:25:17 +02:00
ulist.c
ulist.h
uuid-tree.c btrfs: remove unnecessary casts in printk 2020-12-08 15:53:52 +01:00
volumes.c btrfs: rework chunk allocation to avoid exhaustion of the system chunk array 2021-07-07 17:42:41 +02:00
volumes.h btrfs: rework chunk allocation to avoid exhaustion of the system chunk array 2021-07-07 17:42:41 +02:00
xattr.c for-5.12-rc1-tag 2021-03-05 12:21:14 -08:00
xattr.h
zlib.c btrfs: use memzero_page() instead of open coded kmap pattern 2021-05-05 11:27:27 -07:00
zoned.c btrfs: fix typos in comments 2021-06-22 14:11:57 +02:00
zoned.h btrfs: zoned: factor out zoned device lookup 2021-06-21 15:19:05 +02:00
zstd.c btrfs: use memzero_page() instead of open coded kmap pattern 2021-05-05 11:27:27 -07:00