linux/fs
Filipe Manana b6e833567e btrfs: make hole and data seeking a lot more efficient
The current implementation of hole and data seeking for llseek does not
scale well in regards to the number of extents and the distance between
the start offset and the next hole or extent. This is due to a very high
algorithmic complexity. Often we also get reports of btrfs' hole and data
seeking (llseek) being too slow, such as at 2017's LSFMM (see the Link
tag at the bottom).

In order to better understand it, lets consider the case where the start
offset is 0, we are seeking for a hole and the file size is 16G. Between
file offset 0 and the first hole in the file there are 100K extents - this
is common for large files, specially if we have compression enabled, since
the maximum extent size is limited to 128K. The steps take by the main
loop of the current algorithm are the following:

1) We start by calling btrfs_get_extent_fiemap(), for file offset 0, which
   calls btrfs_get_extent(). This will first lookup for an extent map in
   the inode's extent map tree (a red black tree). If the extent map is
   not loaded in memory, then it will do a lookup for the corresponding
   file extent item in the subvolume's b+tree, create an extent map based
   on the contents of the file extent item and then add the extent map to
   the extent map tree of the inode;

2) The second iteration calls btrfs_get_extent_fiemap() again, this time
   with a start offset matching the end offset of the previous extent.
   Again, btrfs_get_extent() will first search the extent map tree, and
   if it doesn't find an extent map there, it will again search in the
   b+tree of the subvolume for a matching file extent item, build an
   extent map based on the file extent item, and add the extent map to
   to the extent map tree of the inode;

3) This repeats over and over until we find the first hole (when seeking
   for holes) or until we find the first extent (when seeking for data).

   If there no extent maps loaded in memory for each iteration, then on
   each iteration we do 1 extent map tree search, 1 b+tree search, plus
   1 more extent map tree traversal to insert an extent map - plus we
   allocate memory for the extent map.

   On each iteration we are growing the size of the extent map tree,
   making each future search slower, and also visiting the same b+tree
   leaves over and over again - taking into account with the default leaf
   size of 16K we can fit more than 200 file extent items in a leaf - so
   we can visit the same b+tree leaf 200+ times, on each visit walking
   down a path from the root to the leaf.

So it's easy to see that what we have now doesn't scale well. Also, it
loads an extent map for every file extent item into memory, which is not
efficient - we should add extents maps only when doing IO (writing or
reading file data).

This change implements a new algorithm which scales much better, and
works like this:

1) We iterate over the subvolume's b+tree, visiting each leaf that has
   file extent items once and only once;

2) For any file extent items found, that don't represent holes or prealloc
   extents, it will not search the extent map tree - there's no need at
   all for that - an extent map is just an in-memory representation of a
   file extent item;

3) When a hole is found, or a prealloc extent, it will check if there's
   delalloc for its range. For this it will search for EXTENT_DELALLOC
   bits in the inode's io tree and check the extent map tree - this is
   for accounting for unflushed delalloc and for flushed delalloc (the
   period between running delalloc and ordered extent completion),
   respectively. This is similar to what the current implementation does
   when it finds a hole or prealloc extent, but without creating extent
   maps and adding them to the extent map tree in case they are not
   loaded in memory;

4) It never allocates extent maps, or adds extent maps to the inode's
   extent map tree. This not only saves memory and time (from the tree
   insertions and allocations), but also eliminates the possibility of
   -ENOMEM due to allocating too many extent maps.

Part of this new code will also be used later for fiemap (which also
suffers similar scalability problems).

The following test example can be used to quickly measure the efficiency
before and after this patch:

    $ cat test-seek-hole.sh
    #!/bin/bash

    DEV=/dev/sdi
    MNT=/mnt/sdi

    mkfs.btrfs -f $DEV

    mount -o compress=lzo $DEV $MNT

    # 16G file -> 131073 compressed extents.
    xfs_io -f -c "pwrite -S 0xab -b 1M 0 16G" $MNT/foobar

    # Leave a 1M hole at file offset 15G.
    xfs_io -c "fpunch 15G 1M" $MNT/foobar

    # Unmount and mount again, so that we can test when there's no
    # metadata cached in memory.
    umount $MNT
    mount -o compress=lzo $DEV $MNT

    # Test seeking for hole from offset 0 (hole is at offset 15G).

    start=$(date +%s%N)
    xfs_io -c "seek -h 0" $MNT/foobar
    end=$(date +%s%N)
    dur=$(( (end - start) / 1000000 ))
    echo "Took $dur milliseconds to seek first hole (metadata not cached)"
    echo

    start=$(date +%s%N)
    xfs_io -c "seek -h 0" $MNT/foobar
    end=$(date +%s%N)
    dur=$(( (end - start) / 1000000 ))
    echo "Took $dur milliseconds to seek first hole (metadata cached)"
    echo

    umount $MNT

Before this change:

    $ ./test-seek-hole.sh
    (...)
    Whence	Result
    HOLE	16106127360
    Took 176 milliseconds to seek first hole (metadata not cached)

    Whence	Result
    HOLE	16106127360
    Took 17 milliseconds to seek first hole (metadata cached)

After this change:

    $ ./test-seek-hole.sh
    (...)
    Whence	Result
    HOLE	16106127360
    Took 43 milliseconds to seek first hole (metadata not cached)

    Whence	Result
    HOLE	16106127360
    Took 13 milliseconds to seek first hole (metadata cached)

That's about 4x faster when no metadata is cached and about 30% faster
when all metadata is cached.

In practice the differences may often be significantly higher, either due
to a higher number of extents in a file or because the subvolume's b+tree
is much bigger than in this example, where we only have one file.

Link: https://lwn.net/Articles/718805/
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2022-09-26 12:28:00 +02:00
..
9p 9p: Fix some kernel-doc comments 2022-07-02 18:52:21 +09:00
adfs fs: Convert block_read_full_page() to block_read_full_folio() 2022-05-09 16:21:44 -04:00
affs affs: use memcpy_to_page and remove replace kmap_atomic() 2022-08-01 19:53:31 +02:00
afs Networking fixes for 6.0-rc5, including fixes from rxrpc, netfilter, 2022-09-08 08:15:01 -04:00
autofs autofs: remove unused ino field inode 2022-07-17 17:31:42 -07:00
befs befs: Convert befs_symlink_read_folio() to use a folio 2022-08-02 12:34:03 -04:00
bfs fs: Convert block_read_full_page() to block_read_full_folio() 2022-05-09 16:21:44 -04:00
btrfs btrfs: make hole and data seeking a lot more efficient 2022-09-26 12:28:00 +02:00
cachefiles cachefiles: make on-demand request distribution fairer 2022-08-31 16:41:10 +01:00
ceph We have a good pile of various fixes and cleanups from Xiubo, Jeff, 2022-08-11 12:41:07 -07:00
cifs cifs: update internal module number 2022-09-14 04:00:06 -05:00
coda coda: Convert coda_symlink_filler() to use a folio 2022-08-02 12:34:03 -04:00
configfs configfs: fix a race in configfs_{,un}register_subsystem() 2022-02-22 18:30:28 +01:00
cramfs cramfs: read_mapping_page() is synchronous 2022-08-02 12:34:02 -04:00
crypto We have a good pile of various fixes and cleanups from Xiubo, Jeff, 2022-08-11 12:41:07 -07:00
debugfs debugfs: add debugfs_lookup_and_remove() 2022-09-05 13:02:34 +02:00
devpts fsnotify: fix fsnotify hooks in pseudo filesystems 2022-01-24 14:17:02 +01:00
dlm fs: dlm: move kref_put assert for lkb structs 2022-08-01 09:31:46 -05:00
ecryptfs ecryptfs: Convert ecryptfs to read_folio 2022-05-09 16:21:45 -04:00
efivarfs efi: vars: Move efivar caching layer into efivarfs 2022-06-24 20:40:19 +02:00
efs efs: Convert efs symlinks to read_folio 2022-05-09 16:21:45 -04:00
erofs erofs: fix pcluster use-after-free on UP platforms 2022-09-05 23:23:30 +08:00
exfat exfat: fix overflow for large capacity partition 2022-09-04 09:38:40 +09:00
exportfs exportfs: support idmapped mounts 2022-04-28 16:31:10 +02:00
ext2 - The usual batches of cleanups from Baoquan He, Muchun Song, Miaohe 2022-08-05 16:32:45 -07:00
ext4 ext4: limit the number of retries after discarding preallocations blocks 2022-09-22 10:51:19 -04:00
f2fs f2fs-for-6.0 2022-08-08 11:18:31 -07:00
fat Updates to various subsystems which I help look after. lib, ocfs2, 2022-08-07 10:03:24 -07:00
freevxfs freevxfs: Convert vxfs_immed_read_folio() to use a folio 2022-08-02 12:34:03 -04:00
fscache fscache: add tracepoint when failing cookie 2022-08-09 14:13:59 +01:00
fuse iov_iter stuff, part 2, rebased 2022-08-08 20:04:35 -07:00
gfs2 New code for 6.0: 2022-08-11 13:11:49 -07:00
hfs hfs: Remove check for PageError 2022-06-29 08:51:06 -04:00
hfsplus Folio changes for 6.0 2022-08-03 10:35:43 -07:00
hostfs hostfs: Handle page write errors correctly 2022-08-02 12:34:02 -04:00
hpfs hpfs: Convert symlinks to read_folio 2022-05-09 16:21:45 -04:00
hugetlbfs iov_iter stuff, part 2, rebased 2022-08-08 20:04:35 -07:00
iomap New code for 6.0: 2022-08-11 13:11:49 -07:00
isofs fs/buffer: Combine two submit_bh() and ll_rw_block() arguments 2022-07-14 12:14:32 -06:00
jbd2 - The usual batches of cleanups from Baoquan He, Muchun Song, Miaohe 2022-08-05 16:32:45 -07:00
jffs2 This pull request contains fixes for JFFS2, UBI and UBIFS 2022-06-03 14:42:24 -07:00
jfs Folio changes for 6.0 2022-08-03 10:35:43 -07:00
kernfs kernfs: Fix typo 'the the' in comment 2022-07-28 10:57:25 +02:00
ksmbd ksmbd: don't remove dos attribute xattr on O_TRUNC open 2022-08-15 21:07:01 -05:00
lockd lockd: detect and reject lock arguments that overflow 2022-08-04 10:28:48 -04:00
minix fs: Convert block_read_full_page() to block_read_full_folio() 2022-05-09 16:21:44 -04:00
netfs netfs: do not unlock and put the folio twice 2022-07-14 10:10:12 +02:00
nfs NFS client bugfixes for Linux 6.0 2022-09-12 17:53:46 -04:00
nfs_common
nfsd fix for nfsd regression caused by iov_iter stuff this window 2022-09-13 15:11:38 +02:00
nilfs2 Folio changes for 6.0 2022-08-03 10:35:43 -07:00
nls
notify fsnotify: Fix comment typo 2022-07-26 13:38:47 +02:00
ntfs Folio changes for 6.0 2022-08-03 10:35:43 -07:00
ntfs3 fs.idmapped.fixes.v6.0-rc3 2022-08-22 11:33:02 -07:00
ocfs2 ocfs2: fix freeing uninitialized resource on ocfs2_dlm_shutdown 2022-08-28 14:02:45 -07:00
omfs fs: Convert block_read_full_page() to block_read_full_folio() 2022-05-09 16:21:44 -04:00
openpromfs fs: allocate inode by using alloc_inode_sb() 2022-03-22 15:57:03 -07:00
orangefs orangefs: Remove test for folio error 2022-06-29 08:51:07 -04:00
overlayfs acl: handle idmapped mounts for idmapped filesystems 2022-08-17 11:23:31 +02:00
proc mm/smaps: don't access young/dirty bit if pte unpresent 2022-08-20 15:17:45 -07:00
pstore EFI updates for v5.20 2022-08-03 14:38:02 -07:00
qnx4 fs: Convert block_read_full_page() to block_read_full_folio() 2022-05-09 16:21:44 -04:00
qnx6 fs: Convert mpage_readpage to mpage_read_folio 2022-05-09 16:21:44 -04:00
quota - The usual batches of cleanups from Baoquan He, Muchun Song, Miaohe 2022-08-05 16:32:45 -07:00
ramfs
reiserfs Folio changes for 6.0 2022-08-03 10:35:43 -07:00
romfs romfs: Convert romfs to read_folio 2022-05-09 16:21:46 -04:00
smbfs_common Add various fsctl structs 2022-05-23 20:24:12 -05:00
squashfs squashfs: don't call kmalloc in decompressors 2022-08-28 14:02:45 -07:00
sysfs kobject: kobj_type: remove default_attrs 2022-04-05 15:39:19 +02:00
sysv Not a lot of material this cycle. Many singleton patches against various 2022-05-27 11:22:03 -07:00
tracefs tracefs: Only clobber mode/uid/gid on remount if asked 2022-09-08 17:10:54 -04:00
ubifs - The usual batches of cleanups from Baoquan He, Muchun Song, Miaohe 2022-08-05 16:32:45 -07:00
udf fs/buffer: Combine two submit_bh() and ll_rw_block() arguments 2022-07-14 12:14:32 -06:00
ufs Folio changes for 6.0 2022-08-03 10:35:43 -07:00
unicode kbuild: unify cmd_copy and cmd_shipped 2022-02-14 10:37:32 +09:00
vboxsf vboxsf: Convert vboxsf to read_folio 2022-05-09 16:21:46 -04:00
verity btrfs: send: add support for fs-verity 2022-09-26 12:27:55 +02:00
xfs New code for 6.0: 2022-08-13 13:50:11 -07:00
zonefs New code for 6.0: 2022-08-11 13:11:49 -07:00
aio.c iov_iter work, part 1 - isolated cleanups and optimizations. 2022-08-03 13:50:22 -07:00
anon_inodes.c
attr.c vfs: Check the truncate maximum size in inode_newsize_ok() 2022-08-08 10:39:29 -07:00
bad_inode.c
binfmt_aout.c
binfmt_elf_fdpic.c coredump: Snapshot the vmas in do_coredump 2022-03-08 12:55:29 -06:00
binfmt_elf_test.c binfmt_elf: Introduce KUnit test 2022-03-03 20:38:56 -08:00
binfmt_elf.c revert "fs/binfmt_elf: use PT_LOAD p_align values for static PIE" 2022-04-15 14:49:56 -07:00
binfmt_flat.c binfmt_flat: Remove shared library support 2022-04-22 10:57:18 -07:00
binfmt_misc.c Fix regression due to "fs: move binfmt_misc sysctl to its own file" 2022-02-09 09:50:02 -08:00
binfmt_script.c
buffer.c Folio changes for 6.0 2022-08-03 10:35:43 -07:00
char_dev.c
compat_binfmt_elf.c binfmt_elf: Introduce KUnit test 2022-03-03 20:38:56 -08:00
coredump.c fs: do not compare against ->llseek 2022-07-16 09:19:15 -04:00
d_path.c
dax.c Merge branch 'for-6.0/dax' into libnvdimm-fixes 2022-09-24 18:14:12 -07:00
dcache.c dcache: move the DCACHE_OP_COMPARE case out of the __d_lookup_rcu loop 2022-08-17 14:33:03 -07:00
direct-io.c iov_iter: advancing variants of iov_iter_get_pages{,_alloc}() 2022-08-08 22:37:22 -04:00
drop_caches.c
eventfd.c
eventpoll.c epoll: autoremove wakers even more aggressively 2022-07-17 17:31:40 -07:00
exec.c Revert "fs/exec: allow to unshare a time namespace on vfork+exec" 2022-09-13 10:38:43 -07:00
fcntl.c keep iocb_flags() result cached in struct file 2022-06-10 16:10:23 -04:00
fhandle.c
file_table.c iov_iter work, part 1 - isolated cleanups and optimizations. 2022-08-03 13:50:22 -07:00
file.c fix the breakage in close_fd_get_file() calling conventions change 2022-06-05 15:03:03 -04:00
filesystems.c
fs_context.c vfs: fs_context: fix up param length parsing in legacy_parse_param 2022-01-18 09:23:19 +02:00
fs_parser.c
fs_pin.c
fs_struct.c
fs_types.c
fs-writeback.c writeback: avoid use-after-free after removing device 2022-08-28 14:02:43 -07:00
fsopen.c uninline may_mount() and don't opencode it in fspick(2)/fsopen(2) 2022-05-19 23:25:10 -04:00
init.c
inode.c fs: __file_remove_privs(): restore call to inode_has_no_xattr() 2022-08-18 09:39:33 +02:00
internal.h Cleanups (and one fix) around struct mount handling. 2022-06-04 19:00:05 -07:00
ioctl.c Fixes for 5.18-rc1: 2022-04-01 19:35:56 -07:00
Kconfig mm: hugetlb_vmemmap: introduce the name HVO 2022-08-08 18:06:42 -07:00
Kconfig.binfmt m68knommu: changes for linux 5.19 2022-05-30 10:56:18 -07:00
kernel_read_file.c fs/kernel_read_file: allow to read files up-to ssize_t 2022-06-16 19:58:21 -07:00
libfs.c fs: Convert simple_readpage to simple_read_folio 2022-05-09 16:21:44 -04:00
locks.c locks: Fix dropped call to ->fl_release_private() 2022-08-17 15:08:58 -04:00
Makefile io_uring: move to separate directory 2022-07-24 18:39:10 -06:00
mbcache.c - The usual batches of cleanups from Baoquan He, Muchun Song, Miaohe 2022-08-05 16:32:45 -07:00
mount.h switch try_to_unlazy_next() to __legitimize_mnt() 2022-07-05 16:18:21 -04:00
mpage.c Folio changes for 6.0 2022-08-03 10:35:43 -07:00
namei.c fs.setgid.v6.0 2022-08-09 09:52:28 -07:00
namespace.c fs: require CAP_SYS_ADMIN in target namespace for idmapped mounts 2022-08-17 11:27:11 +02:00
no-block.c
nsfs.c
open.c open: always initialize ownership fields 2022-09-20 11:57:57 +02:00
pipe.c Not a lot of material this cycle. Many singleton patches against various 2022-05-27 11:22:03 -07:00
pnode.c
pnode.h
posix_acl.c acl: handle idmapped mounts for idmapped filesystems 2022-08-17 11:23:31 +02:00
proc_namespace.c vfs: escape hash as well 2022-06-28 13:58:05 -04:00
read_write.c switch new_sync_{read,write}() to ITER_UBUF 2022-08-08 22:37:15 -04:00
readdir.c
remap_range.c - The usual batches of cleanups from Baoquan He, Muchun Song, Miaohe 2022-08-05 16:32:45 -07:00
select.c select: Fix indefinitely sleeping task in poll_schedule_timeout() 2022-01-11 09:03:05 -08:00
seq_file.c rxrpc: Fix locking issue 2022-05-22 21:03:01 +01:00
signalfd.c Merge branch 'signal-for-v5.17' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace 2022-01-17 05:49:30 +02:00
splice.c iter_to_pipe(): switch to advancing variant of iov_iter_get_pages() 2022-08-08 22:37:23 -04:00
stack.c
stat.c RISC-V Patches for the 5.19 Merge Window, Part 1 2022-05-31 14:10:54 -07:00
statfs.c
super.c fuse update for 6.0 2022-08-08 11:10:02 -07:00
sync.c riscv: compat: syscall: Add compat_sys_call_table implementation 2022-04-26 13:36:25 -07:00
sysctls.c fs: move namespace sysctls and declare fs base directory 2022-01-22 08:33:36 +02:00
timerfd.c
userfaultfd.c mm/uffd: reset write protection when unregister with wp-mode 2022-08-20 15:17:45 -07:00
utimes.c
xattr.c acl: move idmapped mount fixup into vfs_{g,s}etxattr() 2022-07-15 22:08:59 +02:00