linux

History

Filipe Manana b6e833567e btrfs: make hole and data seeking a lot more efficient The current implementation of hole and data seeking for llseek does not scale well in regards to the number of extents and the distance between the start offset and the next hole or extent. This is due to a very high algorithmic complexity. Often we also get reports of btrfs' hole and data seeking (llseek) being too slow, such as at 2017's LSFMM (see the Link tag at the bottom). In order to better understand it, lets consider the case where the start offset is 0, we are seeking for a hole and the file size is 16G. Between file offset 0 and the first hole in the file there are 100K extents - this is common for large files, specially if we have compression enabled, since the maximum extent size is limited to 128K. The steps take by the main loop of the current algorithm are the following: 1) We start by calling btrfs_get_extent_fiemap(), for file offset 0, which calls btrfs_get_extent(). This will first lookup for an extent map in the inode's extent map tree (a red black tree). If the extent map is not loaded in memory, then it will do a lookup for the corresponding file extent item in the subvolume's b+tree, create an extent map based on the contents of the file extent item and then add the extent map to the extent map tree of the inode; 2) The second iteration calls btrfs_get_extent_fiemap() again, this time with a start offset matching the end offset of the previous extent. Again, btrfs_get_extent() will first search the extent map tree, and if it doesn't find an extent map there, it will again search in the b+tree of the subvolume for a matching file extent item, build an extent map based on the file extent item, and add the extent map to to the extent map tree of the inode; 3) This repeats over and over until we find the first hole (when seeking for holes) or until we find the first extent (when seeking for data). If there no extent maps loaded in memory for each iteration, then on each iteration we do 1 extent map tree search, 1 b+tree search, plus 1 more extent map tree traversal to insert an extent map - plus we allocate memory for the extent map. On each iteration we are growing the size of the extent map tree, making each future search slower, and also visiting the same b+tree leaves over and over again - taking into account with the default leaf size of 16K we can fit more than 200 file extent items in a leaf - so we can visit the same b+tree leaf 200+ times, on each visit walking down a path from the root to the leaf. So it's easy to see that what we have now doesn't scale well. Also, it loads an extent map for every file extent item into memory, which is not efficient - we should add extents maps only when doing IO (writing or reading file data). This change implements a new algorithm which scales much better, and works like this: 1) We iterate over the subvolume's b+tree, visiting each leaf that has file extent items once and only once; 2) For any file extent items found, that don't represent holes or prealloc extents, it will not search the extent map tree - there's no need at all for that - an extent map is just an in-memory representation of a file extent item; 3) When a hole is found, or a prealloc extent, it will check if there's delalloc for its range. For this it will search for EXTENT_DELALLOC bits in the inode's io tree and check the extent map tree - this is for accounting for unflushed delalloc and for flushed delalloc (the period between running delalloc and ordered extent completion), respectively. This is similar to what the current implementation does when it finds a hole or prealloc extent, but without creating extent maps and adding them to the extent map tree in case they are not loaded in memory; 4) It never allocates extent maps, or adds extent maps to the inode's extent map tree. This not only saves memory and time (from the tree insertions and allocations), but also eliminates the possibility of -ENOMEM due to allocating too many extent maps. Part of this new code will also be used later for fiemap (which also suffers similar scalability problems). The following test example can be used to quickly measure the efficiency before and after this patch: $ cat test-seek-hole.sh #!/bin/bash DEV=/dev/sdi MNT=/mnt/sdi mkfs.btrfs -f $DEV mount -o compress=lzo $DEV $MNT # 16G file -> 131073 compressed extents. xfs_io -f -c "pwrite -S 0xab -b 1M 0 16G" $MNT/foobar # Leave a 1M hole at file offset 15G. xfs_io -c "fpunch 15G 1M" $MNT/foobar # Unmount and mount again, so that we can test when there's no # metadata cached in memory. umount $MNT mount -o compress=lzo $DEV $MNT # Test seeking for hole from offset 0 (hole is at offset 15G). start=$(date +%s%N) xfs_io -c "seek -h 0" $MNT/foobar end=$(date +%s%N) dur=$(( (end - start) / 1000000 )) echo "Took $dur milliseconds to seek first hole (metadata not cached)" echo start=$(date +%s%N) xfs_io -c "seek -h 0" $MNT/foobar end=$(date +%s%N) dur=$(( (end - start) / 1000000 )) echo "Took $dur milliseconds to seek first hole (metadata cached)" echo umount $MNT Before this change: $ ./test-seek-hole.sh (...) Whence Result HOLE 16106127360 Took 176 milliseconds to seek first hole (metadata not cached) Whence Result HOLE 16106127360 Took 17 milliseconds to seek first hole (metadata cached) After this change: $ ./test-seek-hole.sh (...) Whence Result HOLE 16106127360 Took 43 milliseconds to seek first hole (metadata not cached) Whence Result HOLE 16106127360 Took 13 milliseconds to seek first hole (metadata cached) That's about 4x faster when no metadata is cached and about 30% faster when all metadata is cached. In practice the differences may often be significantly higher, either due to a higher number of extents in a file or because the subvolume's b+tree is much bigger than in this example, where we only have one file. Link: https://lwn.net/Articles/718805/ Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>		2022-09-26 12:28:00 +02:00
..
9p	9p: Fix some kernel-doc comments	2022-07-02 18:52:21 +09:00
adfs	fs: Convert block_read_full_page() to block_read_full_folio()	2022-05-09 16:21:44 -04:00
affs	affs: use memcpy_to_page and remove replace kmap_atomic()	2022-08-01 19:53:31 +02:00
afs	Networking fixes for 6.0-rc5, including fixes from rxrpc, netfilter,	2022-09-08 08:15:01 -04:00
autofs	autofs: remove unused ino field inode	2022-07-17 17:31:42 -07:00
befs	befs: Convert befs_symlink_read_folio() to use a folio	2022-08-02 12:34:03 -04:00
bfs	fs: Convert block_read_full_page() to block_read_full_folio()	2022-05-09 16:21:44 -04:00
btrfs	btrfs: make hole and data seeking a lot more efficient	2022-09-26 12:28:00 +02:00
cachefiles	cachefiles: make on-demand request distribution fairer	2022-08-31 16:41:10 +01:00
ceph	We have a good pile of various fixes and cleanups from Xiubo, Jeff,	2022-08-11 12:41:07 -07:00
cifs	cifs: update internal module number	2022-09-14 04:00:06 -05:00
coda	coda: Convert coda_symlink_filler() to use a folio	2022-08-02 12:34:03 -04:00
configfs	configfs: fix a race in configfs_{,un}register_subsystem()	2022-02-22 18:30:28 +01:00
cramfs	cramfs: read_mapping_page() is synchronous	2022-08-02 12:34:02 -04:00
crypto	We have a good pile of various fixes and cleanups from Xiubo, Jeff,	2022-08-11 12:41:07 -07:00
debugfs	debugfs: add debugfs_lookup_and_remove()	2022-09-05 13:02:34 +02:00
devpts	fsnotify: fix fsnotify hooks in pseudo filesystems	2022-01-24 14:17:02 +01:00
dlm	fs: dlm: move kref_put assert for lkb structs	2022-08-01 09:31:46 -05:00
ecryptfs	ecryptfs: Convert ecryptfs to read_folio	2022-05-09 16:21:45 -04:00
efivarfs	efi: vars: Move efivar caching layer into efivarfs	2022-06-24 20:40:19 +02:00
efs	efs: Convert efs symlinks to read_folio	2022-05-09 16:21:45 -04:00
erofs	erofs: fix pcluster use-after-free on UP platforms	2022-09-05 23:23:30 +08:00
exfat	exfat: fix overflow for large capacity partition	2022-09-04 09:38:40 +09:00
exportfs	exportfs: support idmapped mounts	2022-04-28 16:31:10 +02:00
ext2	- The usual batches of cleanups from Baoquan He, Muchun Song, Miaohe	2022-08-05 16:32:45 -07:00
ext4	ext4: limit the number of retries after discarding preallocations blocks	2022-09-22 10:51:19 -04:00
f2fs	f2fs-for-6.0	2022-08-08 11:18:31 -07:00
fat	Updates to various subsystems which I help look after. lib, ocfs2,	2022-08-07 10:03:24 -07:00
freevxfs	freevxfs: Convert vxfs_immed_read_folio() to use a folio	2022-08-02 12:34:03 -04:00
fscache	fscache: add tracepoint when failing cookie	2022-08-09 14:13:59 +01:00
fuse	iov_iter stuff, part 2, rebased	2022-08-08 20:04:35 -07:00
gfs2	New code for 6.0:	2022-08-11 13:11:49 -07:00
hfs	hfs: Remove check for PageError	2022-06-29 08:51:06 -04:00
hfsplus	Folio changes for 6.0	2022-08-03 10:35:43 -07:00
hostfs	hostfs: Handle page write errors correctly	2022-08-02 12:34:02 -04:00
hpfs	hpfs: Convert symlinks to read_folio	2022-05-09 16:21:45 -04:00
hugetlbfs	iov_iter stuff, part 2, rebased	2022-08-08 20:04:35 -07:00
iomap	New code for 6.0:	2022-08-11 13:11:49 -07:00
isofs	fs/buffer: Combine two submit_bh() and ll_rw_block() arguments	2022-07-14 12:14:32 -06:00
jbd2	- The usual batches of cleanups from Baoquan He, Muchun Song, Miaohe	2022-08-05 16:32:45 -07:00
jffs2	This pull request contains fixes for JFFS2, UBI and UBIFS	2022-06-03 14:42:24 -07:00
jfs	Folio changes for 6.0	2022-08-03 10:35:43 -07:00
kernfs	kernfs: Fix typo 'the the' in comment	2022-07-28 10:57:25 +02:00
ksmbd	ksmbd: don't remove dos attribute xattr on O_TRUNC open	2022-08-15 21:07:01 -05:00
lockd	lockd: detect and reject lock arguments that overflow	2022-08-04 10:28:48 -04:00
minix	fs: Convert block_read_full_page() to block_read_full_folio()	2022-05-09 16:21:44 -04:00
netfs	netfs: do not unlock and put the folio twice	2022-07-14 10:10:12 +02:00
nfs	NFS client bugfixes for Linux 6.0	2022-09-12 17:53:46 -04:00
nfs_common
nfsd	fix for nfsd regression caused by iov_iter stuff this window	2022-09-13 15:11:38 +02:00
nilfs2	Folio changes for 6.0	2022-08-03 10:35:43 -07:00
nls
notify	fsnotify: Fix comment typo	2022-07-26 13:38:47 +02:00
ntfs	Folio changes for 6.0	2022-08-03 10:35:43 -07:00
ntfs3	fs.idmapped.fixes.v6.0-rc3	2022-08-22 11:33:02 -07:00
ocfs2	ocfs2: fix freeing uninitialized resource on ocfs2_dlm_shutdown	2022-08-28 14:02:45 -07:00
omfs	fs: Convert block_read_full_page() to block_read_full_folio()	2022-05-09 16:21:44 -04:00
openpromfs	fs: allocate inode by using alloc_inode_sb()	2022-03-22 15:57:03 -07:00
orangefs	orangefs: Remove test for folio error	2022-06-29 08:51:07 -04:00
overlayfs	acl: handle idmapped mounts for idmapped filesystems	2022-08-17 11:23:31 +02:00
proc	mm/smaps: don't access young/dirty bit if pte unpresent	2022-08-20 15:17:45 -07:00
pstore	EFI updates for v5.20	2022-08-03 14:38:02 -07:00
qnx4	fs: Convert block_read_full_page() to block_read_full_folio()	2022-05-09 16:21:44 -04:00
qnx6	fs: Convert mpage_readpage to mpage_read_folio	2022-05-09 16:21:44 -04:00
quota	- The usual batches of cleanups from Baoquan He, Muchun Song, Miaohe	2022-08-05 16:32:45 -07:00
ramfs
reiserfs	Folio changes for 6.0	2022-08-03 10:35:43 -07:00
romfs	romfs: Convert romfs to read_folio	2022-05-09 16:21:46 -04:00
smbfs_common	Add various fsctl structs	2022-05-23 20:24:12 -05:00
squashfs	squashfs: don't call kmalloc in decompressors	2022-08-28 14:02:45 -07:00
sysfs	kobject: kobj_type: remove default_attrs	2022-04-05 15:39:19 +02:00
sysv	Not a lot of material this cycle. Many singleton patches against various	2022-05-27 11:22:03 -07:00
tracefs	tracefs: Only clobber mode/uid/gid on remount if asked	2022-09-08 17:10:54 -04:00
ubifs	- The usual batches of cleanups from Baoquan He, Muchun Song, Miaohe	2022-08-05 16:32:45 -07:00
udf	fs/buffer: Combine two submit_bh() and ll_rw_block() arguments	2022-07-14 12:14:32 -06:00
ufs	Folio changes for 6.0	2022-08-03 10:35:43 -07:00
unicode	kbuild: unify cmd_copy and cmd_shipped	2022-02-14 10:37:32 +09:00
vboxsf	vboxsf: Convert vboxsf to read_folio	2022-05-09 16:21:46 -04:00
verity	btrfs: send: add support for fs-verity	2022-09-26 12:27:55 +02:00
xfs	New code for 6.0:	2022-08-13 13:50:11 -07:00
zonefs	New code for 6.0:	2022-08-11 13:11:49 -07:00
aio.c	iov_iter work, part 1 - isolated cleanups and optimizations.	2022-08-03 13:50:22 -07:00
anon_inodes.c
attr.c	vfs: Check the truncate maximum size in inode_newsize_ok()	2022-08-08 10:39:29 -07:00
bad_inode.c
binfmt_aout.c
binfmt_elf_fdpic.c	coredump: Snapshot the vmas in do_coredump	2022-03-08 12:55:29 -06:00
binfmt_elf_test.c	binfmt_elf: Introduce KUnit test	2022-03-03 20:38:56 -08:00
binfmt_elf.c	revert "fs/binfmt_elf: use PT_LOAD p_align values for static PIE"	2022-04-15 14:49:56 -07:00
binfmt_flat.c	binfmt_flat: Remove shared library support	2022-04-22 10:57:18 -07:00
binfmt_misc.c	Fix regression due to "fs: move binfmt_misc sysctl to its own file"	2022-02-09 09:50:02 -08:00
binfmt_script.c
buffer.c	Folio changes for 6.0	2022-08-03 10:35:43 -07:00
char_dev.c
compat_binfmt_elf.c	binfmt_elf: Introduce KUnit test	2022-03-03 20:38:56 -08:00
coredump.c	fs: do not compare against ->llseek	2022-07-16 09:19:15 -04:00
d_path.c
dax.c	Merge branch 'for-6.0/dax' into libnvdimm-fixes	2022-09-24 18:14:12 -07:00
dcache.c	dcache: move the DCACHE_OP_COMPARE case out of the __d_lookup_rcu loop	2022-08-17 14:33:03 -07:00
direct-io.c	iov_iter: advancing variants of iov_iter_get_pages{,_alloc}()	2022-08-08 22:37:22 -04:00
drop_caches.c
eventfd.c
eventpoll.c	epoll: autoremove wakers even more aggressively	2022-07-17 17:31:40 -07:00
exec.c	Revert "fs/exec: allow to unshare a time namespace on vfork+exec"	2022-09-13 10:38:43 -07:00
fcntl.c	keep iocb_flags() result cached in struct file	2022-06-10 16:10:23 -04:00
fhandle.c
file_table.c	iov_iter work, part 1 - isolated cleanups and optimizations.	2022-08-03 13:50:22 -07:00
file.c	fix the breakage in close_fd_get_file() calling conventions change	2022-06-05 15:03:03 -04:00
filesystems.c
fs_context.c	vfs: fs_context: fix up param length parsing in legacy_parse_param	2022-01-18 09:23:19 +02:00
fs_parser.c
fs_pin.c
fs_struct.c
fs_types.c
fs-writeback.c	writeback: avoid use-after-free after removing device	2022-08-28 14:02:43 -07:00
fsopen.c	uninline may_mount() and don't opencode it in fspick(2)/fsopen(2)	2022-05-19 23:25:10 -04:00
init.c
inode.c	fs: __file_remove_privs(): restore call to inode_has_no_xattr()	2022-08-18 09:39:33 +02:00
internal.h	Cleanups (and one fix) around struct mount handling.	2022-06-04 19:00:05 -07:00
ioctl.c	Fixes for 5.18-rc1:	2022-04-01 19:35:56 -07:00
Kconfig	mm: hugetlb_vmemmap: introduce the name HVO	2022-08-08 18:06:42 -07:00
Kconfig.binfmt	m68knommu: changes for linux 5.19	2022-05-30 10:56:18 -07:00
kernel_read_file.c	fs/kernel_read_file: allow to read files up-to ssize_t	2022-06-16 19:58:21 -07:00
libfs.c	fs: Convert simple_readpage to simple_read_folio	2022-05-09 16:21:44 -04:00
locks.c	locks: Fix dropped call to ->fl_release_private()	2022-08-17 15:08:58 -04:00
Makefile	io_uring: move to separate directory	2022-07-24 18:39:10 -06:00
mbcache.c	- The usual batches of cleanups from Baoquan He, Muchun Song, Miaohe	2022-08-05 16:32:45 -07:00
mount.h	switch try_to_unlazy_next() to __legitimize_mnt()	2022-07-05 16:18:21 -04:00
mpage.c	Folio changes for 6.0	2022-08-03 10:35:43 -07:00
namei.c	fs.setgid.v6.0	2022-08-09 09:52:28 -07:00
namespace.c	fs: require CAP_SYS_ADMIN in target namespace for idmapped mounts	2022-08-17 11:27:11 +02:00
no-block.c
nsfs.c
open.c	open: always initialize ownership fields	2022-09-20 11:57:57 +02:00
pipe.c	Not a lot of material this cycle. Many singleton patches against various	2022-05-27 11:22:03 -07:00
pnode.c
pnode.h
posix_acl.c	acl: handle idmapped mounts for idmapped filesystems	2022-08-17 11:23:31 +02:00
proc_namespace.c	vfs: escape hash as well	2022-06-28 13:58:05 -04:00
read_write.c	switch new_sync_{read,write}() to ITER_UBUF	2022-08-08 22:37:15 -04:00
readdir.c
remap_range.c	- The usual batches of cleanups from Baoquan He, Muchun Song, Miaohe	2022-08-05 16:32:45 -07:00
select.c	select: Fix indefinitely sleeping task in poll_schedule_timeout()	2022-01-11 09:03:05 -08:00
seq_file.c	rxrpc: Fix locking issue	2022-05-22 21:03:01 +01:00
signalfd.c	Merge branch 'signal-for-v5.17' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace	2022-01-17 05:49:30 +02:00
splice.c	iter_to_pipe(): switch to advancing variant of iov_iter_get_pages()	2022-08-08 22:37:23 -04:00
stack.c
stat.c	RISC-V Patches for the 5.19 Merge Window, Part 1	2022-05-31 14:10:54 -07:00
statfs.c
super.c	fuse update for 6.0	2022-08-08 11:10:02 -07:00
sync.c	riscv: compat: syscall: Add compat_sys_call_table implementation	2022-04-26 13:36:25 -07:00
sysctls.c	fs: move namespace sysctls and declare fs base directory	2022-01-22 08:33:36 +02:00
timerfd.c
userfaultfd.c	mm/uffd: reset write protection when unregister with wp-mode	2022-08-20 15:17:45 -07:00
utimes.c
xattr.c	acl: move idmapped mount fixup into vfs_{g,s}etxattr()	2022-07-15 22:08:59 +02:00