linux

mirror of https://github.com/torvalds/linux.git synced 2024-12-01 00:21:32 +00:00

History

Filipe Manana 956a17d9d0 btrfs: add a shrinker for extent maps Extent maps are used either to represent existing file extent items, or to represent new extents that are going to be written and the respective file extent items are created when the ordered extent completes. We currently don't have any limit for how many extent maps we can have, neither per inode nor globally. Most of the time this not too noticeable because extent maps are removed in the following situations: 1) When evicting an inode; 2) When releasing folios (pages) through the btrfs_release_folio() address space operation callback. However we won't release extent maps in the folio range if the folio is either dirty or under writeback or if the inode's i_size is less than or equals to 16M (see try_release_extent_mapping(). This 16M i_size constraint was added back in 2008 with commit `70dec8079d` ("Btrfs: extent_io and extent_state optimizations"), but there's no explanation about why we have it or why the 16M value. This means that for buffered IO we can reach an OOM situation due to too many extent maps if either of the following happens: 1) There's a set of tasks constantly doing IO on many files with a size not larger than 16M, specially if they keep the files open for very long periods, therefore preventing inode eviction. This requires a really high number of such files, and having many non mergeable extent maps (due to random 4K writes for example) and a machine with very little memory; 2) There's a set tasks constantly doing random write IO (therefore creating many non mergeable extent maps) on files and keeping them open for long periods of time, so inode eviction doesn't happen and there's always a lot of dirty pages or pages under writeback, preventing btrfs_release_folio() from releasing the respective extent maps. This second case was actually reported in the thread pointed by the Link tag below, and it requires a very large file under heavy IO and a machine with very little amount of RAM, which is probably hard to happen in practice in a real world use case. However when using direct IO this is not so hard to happen, because the page cache is not used, and therefore btrfs_release_folio() is never called. Which means extent maps are dropped only when evicting the inode, and that means that if we have tasks that keep a file descriptor open and keep doing IO on a very large file (or files), we can exhaust memory due to an unbounded amount of extent maps. This is especially easy to happen if we have a huge file with millions of small extents and their extent maps are not mergeable (non contiguous offsets and disk locations). This was reported in that thread with the following fio test: $ cat test.sh #!/bin/bash DEV=/dev/sdj MNT=/mnt/sdj MOUNT_OPTIONS="-o ssd" MKFS_OPTIONS="" cat <<EOF > /tmp/fio-job.ini [global] name=fio-rand-write filename=$MNT/fio-rand-write rw=randwrite bs=4K direct=1 numjobs=16 fallocate=none time_based runtime=90000 [file1] size=300G ioengine=libaio iodepth=16 EOF umount $MNT &> /dev/null mkfs.btrfs -f $MKFS_OPTIONS $DEV mount $MOUNT_OPTIONS $DEV $MNT fio /tmp/fio-job.ini umount $MNT Monitoring the btrfs_extent_map slab while running the test with: $ watch -d -n 1 'cat /sys/kernel/slab/btrfs_extent_map/objects \ /sys/kernel/slab/btrfs_extent_map/total_objects' Shows the number of active and total extent maps skyrocketing to tens of millions, and on systems with a short amount of memory it's easy and quick to get into an OOM situation, as reported in that thread. So to avoid this issue add a shrinker that will remove extents maps, as long as they are not pinned, and takes proper care with any concurrent fsync to avoid missing extents (setting the full sync flag while in the middle of a fast fsync). This shrinker is triggered through the callbacks nr_cached_objects and free_cached_objects of struct super_operations. The shrinker will iterate over all roots and over all inodes of each root, and keeps track of the last scanned root and inode, so that the next time it runs, it starts from that root and from the next inode. This is similar to what xfs does for its inode reclaim (implements those callbacks, and cycles through inodes by starting from where it ended last time). Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>		2024-05-07 21:31:06 +02:00
..
9p	fs/9p: mitigate inode collisions	2024-04-22 15:34:27 +00:00
adfs	mm, slab: remove last vestiges of SLAB_MEM_SPREAD	2024-03-12 20:32:19 -07:00
affs	affs: remove SLAB_MEM_SPREAD flag usage	2024-02-26 11:36:28 +01:00
afs	afs: Fix occasional rmdir-then-VNOVNODE with generic/011	2024-03-14 12:13:21 +01:00
autofs	dcache stuff for this cycle	2024-01-11 20:11:35 -08:00
bcachefs	bcachefs: fix integer conversion bug	2024-04-28 21:34:29 -04:00
befs	mm, slab: remove last vestiges of SLAB_MEM_SPREAD	2024-03-12 20:32:19 -07:00
bfs	mm, slab: remove last vestiges of SLAB_MEM_SPREAD	2024-03-12 20:32:19 -07:00
btrfs	btrfs: add a shrinker for extent maps	2024-05-07 21:31:06 +02:00
cachefiles	cachefiles: fix memory leak in cachefiles_add_cache()	2024-02-20 09:46:07 +01:00
ceph	ceph: switch to use cap_delay_lock for the unlink delay list	2024-04-11 22:56:28 +02:00
coda	mm, slab: remove last vestiges of SLAB_MEM_SPREAD	2024-03-12 20:32:19 -07:00
configfs
cramfs	fs,block: yield devices early	2024-03-27 13:17:15 +01:00
crypto	fscrypt updates for 6.9	2024-03-12 13:17:36 -07:00
debugfs	debugfs: fix wait/cancellation handling during remove	2024-03-07 22:08:15 +00:00
devpts	fs: Remove the now superfluous sentinel elements from ctl_table array	2023-12-28 04:57:57 -08:00
dlm	dlm for 6.9	2024-03-18 15:39:48 -07:00
ecryptfs	Merge tag 'exportfs-6.9' of ssh://gitolite.kernel.org/pub/scm/linux/kernel/git/cel/linux	2024-01-23 17:56:30 +01:00
efivarfs	efivarfs: Drop 'duplicates' bool parameter on efivar_init()	2024-02-25 09:43:39 +01:00
efs	efs: remove SLAB_MEM_SPREAD flag usage	2024-02-27 11:21:33 +01:00
erofs	erofs: reliably distinguish block based and fscache mode	2024-04-28 20:36:52 +08:00
exfat	Description for this pull request:	2024-03-21 09:47:12 -07:00
exportfs	fs: Create a generic is_dot_dotdot() utility	2024-01-23 10:58:56 -05:00
ext2	\n	2024-03-13 14:30:58 -07:00
ext4	fs,block: yield devices early	2024-03-27 13:17:15 +01:00
f2fs	fs,block: yield devices early	2024-03-27 13:17:15 +01:00
fat	- Kuan-Wei Chiu has developed the well-named series "lib min_heap: Min	2024-03-14 18:03:09 -07:00
freevxfs	mm, slab: remove last vestiges of SLAB_MEM_SPREAD	2024-03-12 20:32:19 -07:00
fuse	cuse: add kernel-doc comments to cuse_process_init_reply()	2024-04-15 11:02:10 +02:00
gfs2	gfs2 fix	2024-03-25 10:53:39 -07:00
hfs	hfs: really remove hfs_writepage	2023-12-29 11:58:34 -08:00
hfsplus	vfs-6.9.misc	2024-03-11 09:38:17 -07:00
hostfs	hostfs: use d_splice_alias() calling conventions to simplify failure exits	2023-12-21 12:51:00 -05:00
hpfs	mm, slab: remove last vestiges of SLAB_MEM_SPREAD	2024-03-12 20:32:19 -07:00
hugetlbfs	vfs-6.9.misc	2024-03-11 09:38:17 -07:00
iomap	vfs-6.9.rw_hint	2024-03-04 18:35:21 +01:00
isofs	mm, slab: remove last vestiges of SLAB_MEM_SPREAD	2024-03-12 20:32:19 -07:00
jbd2	jbd2: abort journal when detecting metadata writeback error of fs dev	2024-01-04 23:42:21 -05:00
jffs2	mm, slab: remove last vestiges of SLAB_MEM_SPREAD	2024-03-12 20:32:19 -07:00
jfs	fs,block: yield devices early	2024-03-27 13:17:15 +01:00
kernfs	kernfs: annotate different lockdep class for of->mutex of writable files	2024-04-14 06:55:46 -04:00
lockd	NFSD 6.9 Release Notes	2024-03-12 14:27:37 -07:00
minix	minix: remove SLAB_MEM_SPREAD flag usage	2024-02-27 11:21:32 +01:00
netfs	netfs: Fix the pre-flush when appending to a file in writethrough mode	2024-04-26 14:56:18 +02:00
nfs	NFS client bugfixes for Linux 6.9	2024-04-29 12:07:37 -07:00
nfs_common
nfsd	nfsd-6.9 fixes:	2024-04-29 14:22:24 -07:00
nilfs2	nilfs2: fix OOB in nilfs_set_de_type	2024-04-16 15:39:52 -07:00
nls
notify	fanotify: allow freeze when waiting response for permission events	2024-03-07 12:59:51 +01:00
ntfs3	ntfs3: add legacy ntfs file operations	2024-04-23 09:39:07 +02:00
ocfs2	- Kuan-Wei Chiu has developed the well-named series "lib min_heap: Min	2024-03-14 18:03:09 -07:00
omfs
openpromfs	openpromfs: remove SLAB_MEM_SPREAD flag usage	2024-02-27 11:21:32 +01:00
orangefs	Julia Lawall reported this null pointer dereference, this should fix it.	2024-02-14 15:57:53 -05:00
overlayfs	ovl: relax WARN_ON in ovl_verify_area()	2024-03-17 15:59:41 +02:00
proc	mm: support page_mapcount() on page_has_type() pages	2024-04-24 19:34:26 -07:00
pstore	pstore/zone: Don't clear memory twice	2024-03-09 12:33:22 -08:00
qnx4	mm, slab: remove last vestiges of SLAB_MEM_SPREAD	2024-03-12 20:32:19 -07:00
qnx6	qnx6: remove SLAB_MEM_SPREAD flag usage	2024-02-27 11:21:32 +01:00
quota	mm, slab: remove last vestiges of SLAB_MEM_SPREAD	2024-03-12 20:32:19 -07:00
ramfs	ramfs: Initialize security of in-memory inodes	2024-01-26 09:08:16 -08:00
reiserfs	fs,block: yield devices early	2024-03-27 13:17:15 +01:00
romfs	fs,block: yield devices early	2024-03-27 13:17:15 +01:00
smb	smb3: fix lock ordering potential deadlock in cifs_sync_mid_result	2024-04-25 12:49:50 -05:00
squashfs	Squashfs: check the inode number is not the invalid value of zero	2024-04-16 15:39:50 -07:00
sysfs	fs: sysfs: Fix reference leak in sysfs_break_active_protection()	2024-04-11 15:16:48 +02:00
sysv	sysv: remove SLAB_MEM_SPREAD flag usage	2024-02-27 11:21:31 +01:00
tracefs	eventfs: Have "events" directory get permissions from its parent	2024-05-04 04:25:37 -04:00
ubifs	This pull request contains updates for UBI and UBIFS:	2024-03-21 15:09:29 -07:00
udf	mm, slab: remove last vestiges of SLAB_MEM_SPREAD	2024-03-12 20:32:19 -07:00
ufs	mm, slab: remove last vestiges of SLAB_MEM_SPREAD	2024-03-12 20:32:19 -07:00
unicode
vboxsf	vboxsf: explicitly deny setlease attempts	2024-04-03 16:06:39 +02:00
verity	Networking changes for 6.9.	2024-03-12 17:44:08 -07:00
xfs	Bug fixes for 6.9-rc3:	2024-04-06 09:14:18 -07:00
zonefs	zonefs: Use str_plural() to fix Coccinelle warning	2024-04-10 07:23:47 +09:00
aio.c	aio: Fix null ptr deref in aio_complete() wakeup	2024-04-05 11:20:28 +02:00
anon_inodes.c	Merge branch 'kvm-guestmemfd' into HEAD	2023-11-14 08:31:31 -05:00
attr.c	lsm/stable-6.9 PR 20240312	2024-03-12 20:03:34 -07:00
backing-file.c	fs: Use KMEM_CACHE instead of kmem_cache_create	2024-02-02 13:11:50 +01:00
bad_inode.c
binfmt_elf_fdpic.c	binfmt: replace deprecated strncpy	2024-03-21 20:20:52 -07:00
binfmt_elf_test.c
binfmt_elf.c
binfmt_flat.c
binfmt_misc.c	execve updates for v6.7-rc1	2023-10-30 19:28:19 -10:00
binfmt_script.c
buffer.c	vfs-6.9.iomap	2024-03-11 10:07:03 -07:00
char_dev.c	As usual, lots of singleton and doubleton patches all over the tree and	2023-11-02 20:53:31 -10:00
compat_binfmt_elf.c
coredump.c	iov_iter: get rid of 'copy_mc' flag	2024-03-06 10:52:12 +01:00
d_path.c
dax.c	fs : Fix warning using plain integer as NULL	2023-11-18 15:00:01 +01:00
dcache.c	vfs-6.9.misc	2024-03-11 09:38:17 -07:00
direct-io.c	block, fs: Restore the per-bio/request data lifetime fields	2024-02-06 14:31:05 +01:00
drop_caches.c
eventfd.c	eventfd: strictly check the count parameter of eventfd_write to avoid inputting illegal strings	2024-02-08 10:12:26 +01:00
eventpoll.c	epoll: be better about file lifetimes	2024-05-05 14:00:48 -07:00
exec.c	execve fixes for v6.9-rc2	2024-03-27 09:57:30 -07:00
fcntl.c	vfs-6.9.iomap	2024-03-11 10:07:03 -07:00
fhandle.c	do_sys_name_to_handle(): use kzalloc() to fix kernel-infoleak	2024-01-22 15:33:38 +01:00
file_table.c	lsm/stable-6.9 PR 20240312	2024-03-12 20:03:34 -07:00
file.c	file: remove __receive_fd()	2023-12-12 14:24:14 +01:00
filesystems.c
fs_context.c
fs_parser.c	__fs_parse: Correct a documentation comment	2024-02-02 13:11:50 +01:00
fs_pin.c
fs_struct.c
fs_types.c
fs-writeback.c	writeback: move wb_wakeup_delayed defination to fs-writeback.c	2024-01-22 15:33:38 +01:00
fsopen.c
init.c
inode.c	bcachefs updates for 6.9	2024-03-15 09:00:09 -07:00
internal.h	pidfs: remove config option	2024-03-13 12:53:53 -07:00
ioctl.c	fs: Return ENOTTY directly if FS_IOC_GETUUID or FS_IOC_GETFSSYSFSPATH fail	2024-04-09 12:03:49 +02:00
Kconfig	- Sumanth Korikkar has taught s390 to allocate hotplug-time page frames	2024-03-14 17:43:30 -07:00
Kconfig.binfmt
kernel_read_file.c
libfs.c	pidfs: remove config option	2024-03-13 12:53:53 -07:00
locks.c	filelock: fix deadlock detection in POSIX locking	2024-02-20 09:53:33 +01:00
Makefile	vfs-6.9.pidfd	2024-03-11 10:21:06 -07:00
mbcache.c	vfs: remove SLAB_MEM_SPREAD flag usage	2024-02-27 11:21:31 +01:00
mnt_idmapping.c	fs/mnt_idmapping.c: Return -EINVAL when no map is written	2024-02-08 10:12:37 +01:00
mount.h	mounts: keep list of mounts in an rbtree	2023-11-18 14:56:16 +01:00
mpage.c	block, fs: Restore the per-bio/request data lifetime fields	2024-02-06 14:31:05 +01:00
namei.c	security: Place security_path_post_mknod() where the original IMA call was	2024-04-03 10:21:32 -07:00
namespace.c	fs: relax mount_setattr() permission checks	2024-02-07 21:16:29 +01:00
nsfs.c	pidfs: remove config option	2024-03-13 12:53:53 -07:00
open.c	lsm/stable-6.9 PR 20240312	2024-03-12 20:03:34 -07:00
pidfs.c	pidfs: remove config option	2024-03-13 12:53:53 -07:00
pipe.c	fs/pipe: Convert to lockdep_cmp_fn	2024-02-02 13:11:49 +01:00
pnode.c	mounts: keep list of mounts in an rbtree	2023-11-18 14:56:16 +01:00
pnode.h
posix_acl.c	lsm/stable-6.9 PR 20240312	2024-03-12 20:03:34 -07:00
proc_namespace.c	namespace: extract show_path() helper	2023-11-18 14:56:16 +01:00
read_write.c	fsnotify: optionally pass access range in file permission hooks	2023-12-12 16:20:02 +01:00
readdir.c	fsnotify: optionally pass access range in file permission hooks	2023-12-12 16:20:02 +01:00
remap_range.c	remap_range: merge do_clone_file_range() into vfs_clone_file_range()	2024-02-06 17:07:21 +01:00
select.c	fs/select: rework stack allocation hack for clang	2024-02-20 09:23:52 +01:00
seq_file.c
signalfd.c
splice.c	fs: use splice_copy_file_range() inline helper	2023-12-12 16:20:02 +01:00
stack.c
stat.c	vfs-6.8.mount	2024-01-08 10:57:34 -08:00
statfs.c
super.c	fs,block: yield devices early	2024-03-27 13:17:15 +01:00
sync.c
sysctls.c	fs: Remove the now superfluous sentinel elements from ctl_table array	2023-12-28 04:57:57 -08:00
timerfd.c
userfaultfd.c	userfaultfd: use per-vma locks in userfaultfd operations	2024-02-22 15:27:20 -08:00
utimes.c
xattr.c	evm: Move to LSM infrastructure	2024-02-15 23:43:47 -05:00