linux/fs/btrfs
Filipe Manana 47e1d1c7bb btrfs: only reserve the needed data space amount during fallocate
During a plain fallocate, we always start by reserving an amount of data
space that matches the length of the range passed to fallocate. When we
already have extents allocated in that range, we may end up trying to
reserve a lot more data space then we need, which can result in several
undesired behaviours:

1) We fail with -ENOSPC. For example the passed range has a length
   of 1G, but there's only one hole with a size of 1M in that range;

2) We temporarily reserve excessive data space that could be used by
   other operations happening concurrently;

3) By reserving much more data space then we need, we can end up
   doing expensive things like triggering dellaloc for other inodes,
   waiting for the ordered extents to complete, trigger transaction
   commits, allocate new block groups, etc.

Example:

  $ cat test.sh
  #!/bin/bash

  DEV=/dev/sdj
  MNT=/mnt/sdj

  mkfs.btrfs -f -b 1g $DEV
  mount $DEV $MNT

  # Create a file with a size of 600M and two holes, one at [200M, 201M[
  # and another at [401M, 402M[
  xfs_io -f -c "pwrite -S 0xab 0 200M" \
            -c "pwrite -S 0xcd 201M 200M" \
            -c "pwrite -S 0xef 402M 198M" \
            $MNT/foobar

  # Now call fallocate against the whole file range, see if it fails
  # with -ENOSPC or not - it shouldn't since we only need to allocate
  # 2M of data space.
  xfs_io -c "falloc 0 600M" $MNT/foobar

  umount $MNT

  $ ./test.sh
  (...)
  wrote 209715200/209715200 bytes at offset 0
  200 MiB, 51200 ops; 0.8063 sec (248.026 MiB/sec and 63494.5831 ops/sec)
  wrote 209715200/209715200 bytes at offset 210763776
  200 MiB, 51200 ops; 0.8053 sec (248.329 MiB/sec and 63572.3172 ops/sec)
  wrote 207618048/207618048 bytes at offset 421527552
  198 MiB, 50688 ops; 0.7925 sec (249.830 MiB/sec and 63956.5548 ops/sec)
  fallocate: No space left on device
  $

So fix this by not allocating an amount of data space that matches the
length of the range passed to fallocate. Instead allocate an amount of
data space that corresponds to the sum of the sizes of each hole found
in the range. This reservation now happens after we have locked the file
range, which is safe since we know at this point there's no delalloc
in the range because we've taken the inode's VFS lock in exclusive mode,
we have taken the inode's i_mmap_lock in exclusive mode, we have flushed
delalloc and waited for all ordered extents in the range to complete.

This type of failure actually seems to happen in practice with systemd,
and we had at least one report about this in a very long thread which
is referenced by the Link tag below.

Link: https://lore.kernel.org/linux-btrfs/bdJVxLiFr_PyQSXRUbZJfFW_jAjsGgoMetqPHJMbg-hdy54Xt_ZHhRetmnJ6cJ99eBlcX76wy-AvWwV715c3YndkxneSlod11P1hlaADx0s=@protonmail.com/
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-05-16 17:03:09 +02:00
..
tests btrfs: assert we have a write lock when removing and replacing extent maps 2022-03-14 13:13:50 +01:00
acl.c btrfs: reserve correct number of items for inode creation 2022-05-16 17:03:08 +02:00
async-thread.c btrfs: fix memory ordering between normal and ordered work functions 2021-11-16 16:50:23 +01:00
async-thread.h
backref.c btrfs: unify the error handling pattern for read_tree_block() 2022-03-14 13:13:53 +01:00
backref.h btrfs: remove ignore_offset argument from btrfs_find_all_roots() 2021-08-23 13:19:01 +02:00
block-group.c btrfs: use btrfs_for_each_slot in find_first_block_group 2022-05-16 17:03:07 +02:00
block-group.h btrfs: zoned: activate block group only for extent allocation 2022-04-06 00:50:41 +02:00
block-rsv.c btrfs: reserve extra space for the free space tree 2022-01-07 14:18:25 +01:00
block-rsv.h btrfs: init root block_rsv at init root time 2022-01-03 15:09:48 +01:00
btrfs_inode.h btrfs: export a helper for compression hard check 2022-04-27 22:15:40 +02:00
check-integrity.c block: remove genhd.h 2022-02-02 07:49:59 -07:00
check-integrity.h
compression.c btrfs: fix btrfs_submit_compressed_write cgroup attribution 2022-04-06 00:50:51 +02:00
compression.h btrfs: track compressed bio errors as blk_status_t 2022-03-14 13:13:51 +01:00
ctree.c btrfs: introduce btrfs_for_each_slot iterator macro 2022-05-16 17:03:07 +02:00
ctree.h btrfs: move common inode creation code into btrfs_create_new_inode() 2022-05-16 17:03:08 +02:00
delalloc-space.c btrfs: support different disk extent size for delalloc 2022-03-14 13:13:51 +01:00
delalloc-space.h
delayed-inode.c btrfs: add an inode-item.h 2022-01-07 14:18:23 +01:00
delayed-inode.h
delayed-ref.c btrfs: reserve extra space for the free space tree 2022-01-07 14:18:25 +01:00
delayed-ref.h btrfs: make btrfs_ref::real_root optional 2021-10-26 19:08:06 +02:00
dev-replace.c btrfs: use a local variable for fs_devices pointer in btrfs_dev_replace_finishing 2022-05-16 17:03:08 +02:00
dev-replace.h btrfs: zoned: mark block groups to copy for device-replace 2021-02-09 02:46:07 +01:00
dir-item.c btrfs: use btrfs_for_each_slot in btrfs_search_dir_index_item 2022-05-16 17:03:07 +02:00
discard.c btrfs: fix typos in comments 2021-06-22 14:11:57 +02:00
discard.h
disk-io.c btrfs: remove trivial wrapper btrfs_read_buffer() 2022-05-16 17:03:07 +02:00
disk-io.h btrfs: remove trivial wrapper btrfs_read_buffer() 2022-05-16 17:03:07 +02:00
export.c
export.h
extent_io.c btrfs: warn when extent buffer leak test fails 2022-05-16 17:03:08 +02:00
extent_io.h btrfs: fix qgroup reserve overflow the qgroup limit 2022-03-23 23:34:15 +01:00
extent_map.c btrfs: assert we have a write lock when removing and replacing extent maps 2022-03-14 13:13:50 +01:00
extent_map.h btrfs: defrag: don't use merged extent map for their generation check 2022-02-23 17:43:13 +01:00
extent-io-tree.h btrfs: Convert from invalidatepage to invalidate_folio 2022-03-15 08:23:29 -04:00
extent-tree.c btrfs: zoned: activate block group only for extent allocation 2022-04-06 00:50:41 +02:00
file-item.c btrfs: handle csum lookup errors properly on reads 2022-03-14 13:13:51 +01:00
file.c btrfs: only reserve the needed data space amount during fallocate 2022-05-16 17:03:09 +02:00
free-space-cache.c btrfs: add inode to truncate control 2022-01-07 14:18:24 +01:00
free-space-cache.h btrfs: change name and type of private member of btrfs_free_space_ctl 2022-01-03 15:09:50 +01:00
free-space-tree.c btrfs: add support for multiple global roots 2022-03-14 13:13:49 +01:00
free-space-tree.h
inode-item.c btrfs: make should_throttle loop local in btrfs_truncate_inode_items 2022-01-07 14:18:25 +01:00
inode-item.h btrfs: add inode to truncate control 2022-01-07 14:18:24 +01:00
inode.c btrfs: restore inode creation before xattr setting 2022-05-16 17:03:09 +02:00
ioctl.c btrfs: move common inode creation code into btrfs_create_new_inode() 2022-05-16 17:03:08 +02:00
Kconfig btrfs: use generic Kconfig option for 256kB page size limit 2022-01-20 08:52:55 +02:00
locking.c btrfs: fix typos in comments 2021-06-22 14:11:57 +02:00
locking.h btrfs: assert that extent buffers are write locked instead of only locked 2021-10-26 19:08:02 +02:00
lzo.c btrfs: add lzo workspace buffer length constants 2022-03-14 13:13:50 +01:00
Makefile Kbuild: add -Wno-shift-negative-value where -Wextra is used 2022-03-13 17:30:31 +09:00
misc.h btrfs: use correct header for div_u64 in misc.h 2021-09-07 14:29:50 +02:00
ordered-data.c btrfs: add BTRFS_IOC_ENCODED_WRITE 2022-03-14 13:13:51 +01:00
ordered-data.h btrfs: add BTRFS_IOC_ENCODED_WRITE 2022-03-14 13:13:51 +01:00
orphan.c
print-tree.c btrfs: unify the error handling pattern for read_tree_block() 2022-03-14 13:13:53 +01:00
print-tree.h
props.c btrfs: move common inode creation code into btrfs_create_new_inode() 2022-05-16 17:03:08 +02:00
props.h btrfs: move common inode creation code into btrfs_create_new_inode() 2022-05-16 17:03:08 +02:00
qgroup.c btrfs: remove trivial wrapper btrfs_read_buffer() 2022-05-16 17:03:07 +02:00
qgroup.h btrfs: fix lock inversion problem when doing qgroup extent tracing 2021-07-22 15:50:07 +02:00
raid56.c btrfs: remove btrfs_raid_bio::fs_info member 2021-10-26 19:08:03 +02:00
raid56.h btrfs: remove btrfs_raid_bio::fs_info member 2021-10-26 19:08:03 +02:00
rcu-string.h
ref-verify.c btrfs: stop accessing ->extent_root directly 2022-01-03 15:09:49 +01:00
ref-verify.h
reflink.c fs: Remove ->readpages address space operation 2022-04-01 13:45:33 -04:00
reflink.h
relocation.c btrfs: unify the error handling pattern for read_tree_block() 2022-03-14 13:13:53 +01:00
root-tree.c btrfs: do not start relocation until in progress drops are done 2022-03-02 16:52:39 +01:00
scrub.c btrfs: scrub: rename scrub_bio::pagev and related members 2022-05-16 17:03:07 +02:00
send.c btrfs: use btrfs_for_each_slot in btrfs_unlink_all_paths 2022-05-16 17:03:08 +02:00
send.h btrfs: reuse existing inode from btrfs_ioctl 2022-03-14 13:13:46 +01:00
space-info.c btrfs: add lockdep_assert_held to need_preemptive_reclaim 2022-03-14 13:13:53 +01:00
space-info.h btrfs: change root to fs_info for btrfs_reserve_metadata_bytes 2022-01-03 15:09:45 +01:00
struct-funcs.c btrfs: add special case to setget helpers for 64k pages 2021-08-23 13:18:58 +02:00
subpage.c btrfs: subpage: fix a wrong check on subpage->writers 2022-03-02 16:51:39 +01:00
subpage.h btrfs: rework page locking in __extent_writepage() 2021-10-26 19:08:05 +02:00
super.c btrfs: add filesystems state details to error messages 2022-03-14 13:13:52 +01:00
sysfs.c btrfs: sysfs: export the balance paused state of exclusive operation 2022-05-05 21:05:56 +02:00
sysfs.h
transaction.c btrfs: pass btrfs_fs_info for deleting snapshots and cleaner 2022-03-14 13:13:52 +01:00
transaction.h btrfs: pass btrfs_fs_info for deleting snapshots and cleaner 2022-03-14 13:13:52 +01:00
tree-checker.c btrfs: add support for multiple global roots 2022-03-14 13:13:49 +01:00
tree-checker.h
tree-defrag.c btrfs: remove unnecessary extent root check in btrfs_defrag_leaves 2022-01-03 15:09:48 +01:00
tree-log.c btrfs: remove trivial wrapper btrfs_read_buffer() 2022-05-16 17:03:07 +02:00
tree-log.h btrfs: avoid inode logging during rename and link when possible 2022-03-14 13:13:48 +01:00
tree-mod-log.c btrfs: fix race when picking most recent mod log operation for an old root 2021-04-20 19:27:17 +02:00
tree-mod-log.h btrfs: add and use helper to get lowest sequence number for the tree mod log 2021-04-19 17:25:17 +02:00
ulist.c
ulist.h
uuid-tree.c btrfs: drop the _nr from the item helpers 2022-01-03 15:09:43 +01:00
verity.c btrfs: drop the _nr from the item helpers 2022-01-03 15:09:43 +01:00
volumes.c btrfs: use btrfs_for_each_slot in btrfs_read_chunk_tree 2022-05-16 17:03:08 +02:00
volumes.h btrfs: fix direct I/O read repair for split bios 2022-04-19 15:44:56 +02:00
xattr.c btrfs: use btrfs_for_each_slot in btrfs_listxattr 2022-05-16 17:03:08 +02:00
xattr.h
zlib.c Revert "btrfs: compression: drop kmap/kunmap from zlib" 2021-10-29 13:03:05 +02:00
zoned.c btrfs: zoned: activate block group properly on unlimited active zone device 2022-05-05 21:05:56 +02:00
zoned.h btrfs: zoned: use dedicated lock for data relocation 2022-04-21 16:06:24 +02:00
zstd.c lib: zstd: Add kernel-specific API 2021-11-08 16:55:21 -08:00