linux/fs
Filipe Manana 79bd37120b btrfs: rework chunk allocation to avoid exhaustion of the system chunk array
Commit eafa4fd0ad ("btrfs: fix exhaustion of the system chunk array
due to concurrent allocations") fixed a problem that resulted in
exhausting the system chunk array in the superblock when there are many
tasks allocating chunks in parallel. Basically too many tasks enter the
first phase of chunk allocation without previous tasks having finished
their second phase of allocation, resulting in too many system chunks
being allocated. That was originally observed when running the fallocate
tests of stress-ng on a PowerPC machine, using a node size of 64K.

However that commit also introduced a deadlock where a task in phase 1 of
the chunk allocation waited for another task that had allocated a system
chunk to finish its phase 2, but that other task was waiting on an extent
buffer lock held by the first task, therefore resulting in both tasks not
making any progress. That change was later reverted by a patch with the
subject "btrfs: fix deadlock with concurrent chunk allocations involving
system chunks", since there is no simple and short solution to address it
and the deadlock is relatively easy to trigger on zoned filesystems, while
the system chunk array exhaustion is not so common.

This change reworks the chunk allocation to avoid the system chunk array
exhaustion. It accomplishes that by making the first phase of chunk
allocation do the updates of the device items in the chunk btree and the
insertion of the new chunk item in the chunk btree. This is done while
under the protection of the chunk mutex (fs_info->chunk_mutex), in the
same critical section that checks for available system space, allocates
a new system chunk if needed and reserves system chunk space. This way
we do not have chunk space reserved until the second phase completes.

The same logic is applied to chunk removal as well, since it keeps
reserved system space long after it is done updating the chunk btree.

For direct allocation of system chunks, the previous behaviour remains,
because otherwise we would deadlock on extent buffers of the chunk btree.
Changes to the chunk btree are by large done by chunk allocation and chunk
removal, which first reserve chunk system space and then later do changes
to the chunk btree. The other remaining cases are uncommon and correspond
to adding a device, removing a device and resizing a device. All these
other cases do not pre-reserve system space, they modify the chunk btree
right away, so they don't hold reserved space for a long period like chunk
allocation and chunk removal do.

The diff of this change is huge, but more than half of it is just addition
of comments describing both how things work regarding chunk allocation and
removal, including both the new behavior and the parts of the old behavior
that did not change.

CC: stable@vger.kernel.org # 5.12+
Tested-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com>
Tested-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Tested-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2021-07-07 17:42:41 +02:00
..
9p 9p for 5.13-rc1 2021-05-07 11:18:52 -07:00
adfs
affs
afs afs: Re-enable freezing once a page fault is interrupted 2021-06-18 13:49:07 -07:00
autofs
befs
bfs
btrfs btrfs: rework chunk allocation to avoid exhaustion of the system chunk array 2021-07-07 17:42:41 +02:00
cachefiles
ceph Notable items here are a series to take advantage of David Howells' 2021-05-06 10:27:02 -07:00
cifs cifs: change format of CIFS_FULL_KEY_DUMP ioctl 2021-05-27 15:26:32 -05:00
coda
configfs treewide: remove editor modelines and cruft 2021-05-07 00:26:34 -07:00
cramfs
crypto
debugfs debugfs: Fix debugfs_read_file_str() 2021-06-04 15:01:08 +02:00
devpts
dlm
ecryptfs fs: ecryptfs: remove BUG_ON from crypt_scatterlist 2021-05-13 18:32:26 +02:00
efivarfs
efs
erofs erofs: fix 1 lcluster-sized pcluster for big pcluster 2021-05-13 15:58:46 +08:00
exfat
exportfs
ext2 Merge branch 'work.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs 2021-05-02 09:14:01 -07:00
ext4 Miscellaneous ext4 bug fixes for v5.13 2021-06-06 14:24:13 -07:00
f2fs f2fs: return EINVAL for hole cases in swap file 2021-05-12 07:38:00 -07:00
fat fs: fat: fix spelling typo of values 2021-05-07 00:26:34 -07:00
freevxfs
fscache
fuse Merge branch 'work.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs 2021-05-02 09:14:01 -07:00
gfs2 Revert "gfs2: Fix mmap locking for write faults" 2021-06-01 23:16:42 +02:00
hfs
hfsplus hfsplus: prevent corruption in shrinking truncate 2021-05-14 19:41:32 -07:00
hostfs Merge branch 'work.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs 2021-05-02 09:14:01 -07:00
hpfs hpfs: replace one-element array with flexible-array member 2021-05-06 19:24:13 -07:00
hugetlbfs mm/hugetlb: expand restore_reserve_on_error functionality 2021-06-16 09:24:42 -07:00
iomap mm/filemap: fix readahead return types 2021-05-14 19:41:32 -07:00
isofs isofs: fix fall-through warnings for Clang 2021-05-06 19:24:13 -07:00
jbd2
jffs2 This pull request contains changes for JFFS2, UBI and UBIFS 2021-05-04 18:08:40 -07:00
jfs
kernfs
lockd
minix
netfs netfs: Make CONFIG_NETFS_SUPPORT auto-selected rather than manual 2021-05-25 13:48:04 +01:00
nfs NFSv4: Fix second deadlock in nfs4_evict_inode() 2021-06-03 10:14:42 -04:00
nfs_common
nfsd NFS client updates for Linux 5.13 2021-05-07 11:23:41 -07:00
nilfs2 Merge branch 'akpm' (patches from Andrew) 2021-05-07 00:34:51 -07:00
nls
notify fanotify: fix copy_event_to_user() fid error clean up 2021-06-14 12:16:37 +02:00
ntfs
ocfs2 ocfs2: fix data corruption by fallocate 2021-06-05 08:58:12 -07:00
omfs
openpromfs
orangefs orangefs: leave files in the page cache for a few micro seconds at least 2021-04-29 08:06:05 -04:00
overlayfs overlayfs update for 5.13 2021-04-30 15:17:08 -07:00
proc proc: only require mm_struct for writing 2021-06-15 10:47:51 -07:00
pstore printk changes for 5.13 2021-04-27 18:09:44 -07:00
qnx4
qnx6
quota quota: Use 'hlist_for_each_entry' to simplify code 2021-05-10 16:27:49 +02:00
ramfs
reiserfs treewide: remove editor modelines and cruft 2021-05-07 00:26:34 -07:00
romfs
squashfs squashfs: fix divide error in calculate_skip() 2021-05-14 19:41:32 -07:00
sysfs
sysv
tracefs
ubifs This pull request contains changes for JFFS2, UBI and UBIFS 2021-05-04 18:08:40 -07:00
udf
ufs
unicode .gitignore: prefix local generated files with a slash 2021-05-02 00:43:35 +09:00
vboxsf
verity
xfs xfs: bunmapi has unnecessary AG lock ordering issues 2021-05-27 08:11:24 -07:00
zonefs \n 2021-04-29 11:06:13 -07:00
aio.c Revert "mremap: don't allow MREMAP_DONTUNMAP on special_mappings and aio" 2021-04-30 11:20:39 -07:00
anon_inodes.c
attr.c
bad_inode.c
binfmt_aout.c
binfmt_elf_fdpic.c
binfmt_elf.c
binfmt_em86.c
binfmt_flat.c
binfmt_misc.c
binfmt_script.c
block_dev.c block-5.13-2021-05-22 2021-05-22 07:40:34 -10:00
buffer.c Merge branch 'akpm' (patches from Andrew) 2021-05-05 13:50:15 -07:00
char_dev.c
compat_binfmt_elf.c
coredump.c coredump: Limit what can interrupt coredumps 2021-06-10 14:02:29 -07:00
d_path.c
dax.c dax fixes for 5.13-rc2 2021-05-15 08:28:08 -07:00
dcache.c
direct-io.c
drop_caches.c
eventfd.c
eventpoll.c fs/epoll: restore waking from ep_done_scan() 2021-05-06 19:24:13 -07:00
exec.c
fcntl.c
fhandle.c
file_table.c
file.c Merge branch 'work.file' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs 2021-05-03 11:05:28 -07:00
filesystems.c
fs_context.c
fs_parser.c vfs: fs_parser: clean up kernel-doc warnings 2021-04-30 11:20:35 -07:00
fs_pin.c
fs_struct.c
fs_types.c
fs-writeback.c
fsopen.c
init.c
inode.c mm: remove nrexceptional from inode: remove BUG_ON 2021-05-05 11:27:20 -07:00
internal.h
io_uring.c io_uring: add feature flag for rsrc tags 2021-06-10 16:33:51 -06:00
io-wq.c io-wq: Fix UAF when wakeup wqe in hash waitqueue 2021-05-26 09:03:56 -06:00
io-wq.h io_uring/io-wq: close io-wq full-stop gap 2021-05-25 19:39:58 -06:00
ioctl.c
Kconfig NFS client updates for Linux 5.13 2021-05-07 11:23:41 -07:00
Kconfig.binfmt
kernel_read_file.c
libfs.c
locks.c Additional fixes and clean-ups for NFSD since tags/nfsd-5.13, 2021-05-05 13:44:19 -07:00
Makefile
mbcache.c
mount.h
mpage.c
namei.c
namespace.c fs/mount_setattr: tighten permission checks 2021-05-12 14:13:16 +02:00
no-block.c
nsfs.c
open.c
pipe.c
pnode.c
pnode.h
posix_acl.c
proc_namespace.c
read_write.c
readdir.c
remap_range.c
select.c
seq_file.c seq_file: Add a seq_bprintf function 2021-04-27 15:50:15 -07:00
signalfd.c signalfd: Remove SIL_PERF_EVENT fields from signalfd_siginfo 2021-05-18 16:20:54 -05:00
splice.c
stack.c
stat.c
statfs.c
super.c
sync.c
timerfd.c
userfaultfd.c userfaultfd: add UFFDIO_CONTINUE ioctl 2021-05-05 11:27:22 -07:00
utimes.c
xattr.c