linux/fs
Boris Burkov 52bb7a2166 btrfs: introduce size class to block group allocator
The aim of this patch is to reduce the fragmentation of block groups
under certain unhappy workloads. It is particularly effective when the
size of extents correlates with their lifetime, which is something we
have observed causing fragmentation in the fleet at Meta.

This patch categorizes extents into size classes:

- x < 128KiB: "small"
- 128KiB < x < 8MiB: "medium"
- x > 8MiB: "large"

and as much as possible reduces allocations of extents into block groups
that don't match the size class. This takes advantage of any (possible)
correlation between size and lifetime and also leaves behind predictable
re-usable gaps when extents are freed; small writes don't gum up bigger
holes.

Size classes are implemented in the following way:

- Mark each new block group with a size class of the first allocation
  that goes into it.

- Add two new passes to ffe: "unset size class" and "wrong size class".
  First, try only matching block groups, then try unset ones, then allow
  allocation of new ones, and finally allow mismatched block groups.

- Filtering is done just by skipping inappropriate ones, there is no
  special size class indexing.

Other solutions I considered were:

- A best fit allocator with an rb-tree. This worked well, as small
  writes didn't leak big holes from large freed extents, but led to
  regressions in ffe and write performance due to lock contention on
  the rb-tree with every allocation possibly updating it in parallel.
  Perhaps something clever could be done to do the updates in the
  background while being "right enough".

- A fixed size "working set". This prevents freeing an extent
  drastically changing where writes currently land, and seems like a
  good option too. Doesn't take advantage of size in any way.

- The same size class idea, but implemented with xarray marks. This
  turned out to be slower than looping the linked list and skipping
  wrong block groups, and is also less flexible since we must have only
  3 size classes (max #marks). With the current approach we can have as
  many as we like.

Performance testing was done via: https://github.com/josefbacik/fsperf
Of particular relevance are the new fragmentation specific tests.

A brief summary of the testing results:

- Neutral results on existing tests. There are some minor regressions
  and improvements here and there, but nothing that truly stands out as
  notable.
- Improvement on new tests where size class and extent lifetime are
  correlated. Fragmentation in these cases is completely eliminated
  and write performance is generally a little better. There is also
  significant improvement where extent sizes are just a bit larger than
  the size class boundaries.
- Regression on one new tests: where the allocations are sized
  intentionally a hair under the borders of the size classes. Results
  are neutral on the test that intentionally attacks this new scheme by
  mixing extent size and lifetime.

The full dump of the performance results can be found here:
https://bur.io/fsperf/size-class-2022-11-15.txt
(there are ANSI escape codes, so best to curl and view in terminal)

Here is a snippet from the full results for a new test which mixes
buffered writes appending to a long lived set of files and large short
lived fallocates:

bufferedappendvsfallocate results
         metric             baseline       current        stdev            diff
======================================================================================
avg_commit_ms                    31.13         29.20          2.67     -6.22%
bg_count                            14         15.60             0     11.43%
commits                          11.10         12.20          0.32      9.91%
elapsed                          27.30         26.40          2.98     -3.30%
end_state_mount_ns         11122551.90   10635118.90     851143.04     -4.38%
end_state_umount_ns           1.36e+09      1.35e+09   12248056.65     -1.07%
find_free_extent_calls       116244.30     114354.30        964.56     -1.63%
find_free_extent_ns_max      599507.20    1047168.20     103337.08     74.67%
find_free_extent_ns_mean       3607.19       3672.11        101.20      1.80%
find_free_extent_ns_min            500           512          6.67      2.40%
find_free_extent_ns_p50           2848          2876         37.65      0.98%
find_free_extent_ns_p95           4916          5000         75.45      1.71%
find_free_extent_ns_p99       20734.49      20920.48       1670.93      0.90%
frag_pct_max                     61.67             0          8.05   -100.00%
frag_pct_mean                    43.59             0          6.10   -100.00%
frag_pct_min                     25.91             0         16.60   -100.00%
frag_pct_p50                     42.53             0          7.25   -100.00%
frag_pct_p95                     61.67             0          8.05   -100.00%
frag_pct_p99                     61.67             0          8.05   -100.00%
fragmented_bg_count               6.10             0          1.45   -100.00%
max_commit_ms                    49.80            46          5.37     -7.63%
sys_cpu                           2.59          2.62          0.29      1.39%
write_bw_bytes                1.62e+08      1.68e+08   17975843.50      3.23%
write_clat_ns_mean            57426.39      54475.95       2292.72     -5.14%
write_clat_ns_p50             46950.40      42905.60       2101.35     -8.62%
write_clat_ns_p99            148070.40     143769.60       2115.17     -2.90%
write_io_kbytes                4194304       4194304             0      0.00%
write_iops                     2476.15       2556.10        274.29      3.23%
write_lat_ns_max            2101667.60    2251129.50     370556.59      7.11%
write_lat_ns_mean             59374.91      55682.00       2523.09     -6.22%
write_lat_ns_min              17353.10         16250       1646.08     -6.36%

There are some mixed improvements/regressions in most metrics along with
an elimination of fragmentation in this workload.

On the balance, the drastic 1->0 improvement in the happy cases seems
worth the mix of regressions and improvements we do observe.

Some considerations for future work:

- Experimenting with more size classes
- More hinting/search ordering work to approximate a best-fit allocator

Signed-off-by: Boris Burkov <boris@bur.io>
Signed-off-by: David Sterba <dsterba@suse.com>
2023-02-13 17:50:34 +01:00
..
9p 9p-for-6.2-rc1 2022-12-23 11:39:18 -08:00
adfs fs: Convert block_read_full_page() to block_read_full_folio() 2022-05-09 16:21:44 -04:00
affs affs: initialize fsdata in affs_truncate() 2023-01-10 14:55:20 +01:00
afs rxrpc: Move call state changes from recvmsg to I/O thread 2023-01-06 09:43:33 +00:00
autofs autofs: remove unused ino field inode 2022-07-17 17:31:42 -07:00
befs befs: Convert befs_symlink_read_folio() to use a folio 2022-08-02 12:34:03 -04:00
bfs fs: Convert block_read_full_page() to block_read_full_folio() 2022-05-09 16:21:44 -04:00
btrfs btrfs: introduce size class to block group allocator 2023-02-13 17:50:34 +01:00
cachefiles fscache,cachefiles: add prepare_ondemand_read() callback 2022-12-07 10:56:29 +08:00
ceph ceph: flush cap releases when the session is flushed 2023-02-07 16:55:14 +01:00
cifs cifs: Fix use-after-free in rdata->read_into_pages() 2023-02-06 22:50:25 -06:00
coda coda: Convert coda_symlink_filler() to use a folio 2022-08-02 12:34:03 -04:00
configfs configfs: fix possible memory leak in configfs_create_dir() 2022-12-02 11:11:22 +01:00
cramfs cramfs: read_mapping_page() is synchronous 2022-08-02 12:34:02 -04:00
crypto for-6.2/block-2022-12-08 2022-12-13 10:43:59 -08:00
debugfs debugfs: fix error when writing negative value to atomic_t debugfs file 2022-11-30 16:13:16 -08:00
devpts
dlm Treewide: Stop corrupting socket's task_frag 2022-12-19 17:28:49 -08:00
ecryptfs ecryptfs: use stub posix acl handlers 2022-10-20 10:13:31 +02:00
efivarfs efi: vars: prohibit reading random seed variables 2022-12-01 09:51:21 +01:00
efs efs: Convert efs symlinks to read_folio 2022-05-09 16:21:45 -04:00
erofs erofs: clean up parsing of fscache related options 2023-01-16 22:39:47 +08:00
exfat Description for this pull request: 2022-12-15 18:14:21 -08:00
exportfs exportfs: use pr_debug for unreachable debug statements 2022-11-28 12:54:45 -05:00
ext2 \n 2022-12-12 20:32:50 -08:00
ext4 ext4: make xattr char unsignedness in hash explicit 2023-01-24 12:38:45 -08:00
f2fs f2fs: let's avoid panic if extent_tree is not created 2023-01-03 08:59:06 -08:00
fat MM patches for 6.2-rc1. 2022-12-13 19:29:45 -08:00
freevxfs freevxfs: Kconfig: fix spelling 2023-01-31 16:44:08 -08:00
fscache fscache: Use clear_and_wake_up_bit() in fscache_create_volume_work() 2023-01-30 12:51:54 +00:00
fuse fuse: fixes after adapting to new posix acl api 2023-01-24 16:33:37 +01:00
gfs2 Revert "gfs2: stop using generic_writepages in gfs2_ail1_start_one" 2023-01-22 09:46:14 +01:00
hfs hfs/hfsplus: avoid WARN_ON() for sanity check, use proper error handling 2023-01-06 14:09:13 -08:00
hfsplus MM patches for 6.2-rc1. 2022-12-13 19:29:45 -08:00
hostfs hostfs: move from strlcpy with unused retval to strscpy 2022-09-19 22:46:25 +02:00
hpfs hpfs: remove ->writepage 2022-12-11 18:12:18 -08:00
hugetlbfs hugetlbfs: inode: remove unnecessary (void*) conversions 2022-11-30 15:58:56 -08:00
iomap New XFS code for 6.2: 2022-12-14 10:11:51 -08:00
isofs - hfs and hfsplus kmap API modernization from Fabio Francesco 2022-10-12 11:00:22 -07:00
jbd2 jbd2: switch jbd2_submit_inode_data() to use fs-provided hook for data writeout 2022-12-08 21:49:25 -05:00
jffs2 fs: rename current get acl method 2022-10-20 10:13:27 +02:00
jfs MM patches for 6.2-rc1. 2022-12-13 19:29:45 -08:00
kernfs kernfs: fix all kernel-doc warnings and multiple typos 2022-11-23 19:28:26 +01:00
ksmbd ksmbd: downgrade ndr version error message to debug 2023-01-25 18:31:18 -06:00
lockd NFSD 6.2 Release Notes 2022-12-12 20:54:39 -08:00
minix vfs: open inside ->tmpfile() 2022-09-24 07:00:00 +02:00
netfs use less confusing names for iov_iter direction initializers 2022-11-25 13:01:55 -05:00
nfs NFS client fixes for Linux 6.2 2023-01-07 10:38:11 -08:00
nfs_common
nfsd nfsd-6.2 fixes: 2023-01-24 12:58:47 -08:00
nilfs2 nilfs2: fix general protection fault in nilfs_btree_insert() 2023-01-11 16:14:21 -08:00
nls
notify Merge tag 'fsnotify-for_v6.1-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs 2022-10-07 08:28:50 -07:00
ntfs - hfs and hfsplus kmap API modernization from Fabio Francesco 2022-10-12 11:00:22 -07:00
ntfs3 fs/ntfs3: don't hold ni_lock when calling truncate_setsize() 2023-01-02 10:31:09 -08:00
ocfs2 Treewide: Stop corrupting socket's task_frag 2022-12-19 17:28:49 -08:00
omfs omfs: remove ->writepage 2022-12-11 18:12:18 -08:00
openpromfs fs: allocate inode by using alloc_inode_sb() 2022-03-22 15:57:03 -07:00
orangefs orangefs: four fixes from Zhang Xiaoxu and two from Colin Ian King 2022-12-14 11:16:33 -08:00
overlayfs ovl: fail on invalid uid/gid mapping at copy up 2023-01-27 16:17:19 +01:00
proc mm: hugetlb: proc: check for hugetlb shared PMD in /proc/PID/smaps 2023-01-31 16:44:09 -08:00
pstore pstore updates for v6.2-rc1-fixes 2022-12-23 11:55:54 -08:00
qnx4 fs: Convert block_read_full_page() to block_read_full_folio() 2022-05-09 16:21:44 -04:00
qnx6 fs/qnx6: delete unnecessary checks before brelse() 2022-09-11 21:55:07 -07:00
quota ext4: fix bug_on in __es_tree_search caused by bad quota inode 2022-12-08 21:49:23 -05:00
ramfs tmpfile API change 2022-10-10 19:45:17 -07:00
reiserfs lsm/stable-6.2 PR 20221212 2022-12-13 09:47:48 -08:00
romfs romfs: Convert romfs to read_folio 2022-05-09 16:21:46 -04:00
smbfs_common smb3: define missing create contexts 2022-10-05 01:55:27 -05:00
squashfs Squashfs: fix handling and sanity checking of xattr_ids count 2023-01-31 16:44:10 -08:00
sysfs kobject: kobj_type: remove default_attrs 2022-04-05 15:39:19 +02:00
sysv fs: sysv: Fix sysv_nblocks() returns wrong value 2022-12-10 14:13:37 -05:00
tracefs tracefs: Only clobber mode/uid/gid on remount if asked 2022-09-08 17:10:54 -04:00
ubifs treewide: use get_random_u32_below() instead of deprecated function 2022-11-18 02:15:15 +01:00
udf udf: initialize newblock to 0 2023-01-06 15:44:32 +01:00
ufs ufs: replace ll_rw_block() 2022-09-11 20:26:07 -07:00
unicode
vboxsf vboxsf: Convert vboxsf to read_folio 2022-05-09 16:21:46 -04:00
verity fsverity: simplify fsverity_get_digest() 2022-11-29 21:07:41 -08:00
xfs xfs: fix extent busy updating 2023-01-05 07:34:21 -08:00
zonefs zonefs: Detect append writes at invalid locations 2023-01-16 08:42:12 +09:00
aio.c use less confusing names for iov_iter direction initializers 2022-11-25 13:01:55 -05:00
anon_inodes.c dynamic_dname(): drop unused dentry argument 2022-08-20 11:34:04 -04:00
attr.c attr: use consistent sgid stripping checks 2022-10-18 10:09:47 +02:00
bad_inode.c fs: rename current get acl method 2022-10-20 10:13:27 +02:00
binfmt_elf_fdpic.c elfcore: Add a cprm parameter to elf_core_extra_{phdrs,data_size} 2023-01-05 15:12:12 +00:00
binfmt_elf_test.c
binfmt_elf.c elfcore: Add a cprm parameter to elf_core_extra_{phdrs,data_size} 2023-01-05 15:12:12 +00:00
binfmt_flat.c binfmt_flat: Remove shared library support 2022-04-22 10:57:18 -07:00
binfmt_misc.c binfmt_misc: fix shift-out-of-bounds in check_special_flags 2022-12-02 13:57:04 -08:00
binfmt_script.c
buffer.c - hfs and hfsplus kmap API modernization from Fabio Francesco 2022-10-12 11:00:22 -07:00
char_dev.c chardev: fix error handling in cdev_device_add() 2022-12-02 17:48:59 +01:00
compat_binfmt_elf.c
coredump.c coredump: Move dump_emit_page() to kill unused warning 2023-01-10 21:03:01 -05:00
d_path.c d_path.c: typo fix... 2022-08-20 11:34:33 -04:00
dax.c fsdax,xfs: port unshare to fsdax 2022-12-11 18:12:17 -08:00
dcache.c tmpfile API change 2022-10-10 19:45:17 -07:00
direct-io.c block: remove PSI accounting from the bio layer 2022-09-20 08:24:38 -06:00
drop_caches.c
eventfd.c eventfd: provide a eventfd_signal_mask() helper 2022-11-22 06:07:55 -07:00
eventpoll.c eventpoll: add EPOLL_URING_WAKE poll wakeup flag 2022-11-21 07:45:29 -07:00
exec.c fs.vfsuid.conversion.v6.2 2022-12-12 19:20:05 -08:00
fcntl.c keep iocb_flags() result cached in struct file 2022-06-10 16:10:23 -04:00
fhandle.c do_sys_name_to_handle(): constify path 2022-09-01 17:36:39 -04:00
file_table.c locks: fix TOCTOU race when granting write lease 2022-08-16 10:59:54 -04:00
file.c fs: use acquire ordering in __fget_light() 2022-10-31 15:30:11 -04:00
filesystems.c
fs_context.c
fs_parser.c ext4: journal_path mount options should follow links 2022-12-01 10:46:54 -05:00
fs_pin.c
fs_struct.c
fs_types.c
fs-writeback.c for-6.2/writeback-2022-12-12 2022-12-15 18:09:48 -08:00
fsopen.c uninline may_mount() and don't opencode it in fspick(2)/fsopen(2) 2022-05-19 23:25:10 -04:00
init.c
inode.c fs.vfsuid.conversion.v6.2 2022-12-12 19:20:05 -08:00
internal.h fs.ovl.setgid.v6.2 2022-12-12 19:03:10 -08:00
ioctl.c Fixes for 5.18-rc1: 2022-04-01 19:35:56 -07:00
Kconfig hugetlb: make hugetlb depends on SYSFS or SYSCTL 2022-09-11 20:26:10 -07:00
Kconfig.binfmt Xtensa updates for v6.1 2022-10-10 14:21:11 -07:00
kernel_read_file.c fs/kernel_read_file: allow to read files up-to ssize_t 2022-06-16 19:58:21 -07:00
libfs.c libfs: add DEFINE_SIMPLE_ATTRIBUTE_SIGNED for signed value 2022-11-30 16:13:16 -08:00
locks.c Add process name and pid to locks warning 2022-11-30 05:08:10 -05:00
Makefile a.out: Remove the a.out implementation 2022-09-27 07:11:02 -07:00
mbcache.c ext4: fix deadlock due to mbcache entry corruption 2022-12-08 21:49:25 -05:00
mount.h switch try_to_unlazy_next() to __legitimize_mnt() 2022-07-05 16:18:21 -04:00
mpage.c Folio changes for 6.0 2022-08-03 10:35:43 -07:00
namei.c Landlock updates for v6.2-rc1 2022-12-13 09:14:50 -08:00
namespace.c fs.idmapped.mnt_idmap.v6.2 2022-12-12 19:30:18 -08:00
no-block.c
nsfs.c dynamic_dname(): drop unused dentry argument 2022-08-20 11:34:04 -04:00
open.c Landlock updates for v6.2-rc1 2022-12-13 09:14:50 -08:00
pipe.c dynamic_dname(): drop unused dentry argument 2022-08-20 11:34:04 -04:00
pnode.c pnode: terminate at peers of source 2022-12-21 14:45:25 +01:00
pnode.h
posix_acl.c fs.idmapped.mnt_idmap.v6.2 2022-12-12 19:30:18 -08:00
proc_namespace.c vfs: escape hash as well 2022-06-28 13:58:05 -04:00
read_write.c iov_iter work; most of that is about getting rid of 2022-12-12 18:29:54 -08:00
readdir.c Change calling conventions for filldir_t 2022-08-17 17:25:04 -04:00
remap_range.c New VFS code for 6.2: 2022-12-13 10:26:38 -08:00
select.c
seq_file.c use less confusing names for iov_iter direction initializers 2022-11-25 13:01:55 -05:00
signalfd.c
splice.c use less confusing names for iov_iter direction initializers 2022-11-25 13:01:55 -05:00
stack.c
stat.c fs: use type safe idmapping helpers 2022-10-26 10:02:34 +02:00
statfs.c
super.c misc pile 2022-12-12 18:38:47 -08:00
sync.c riscv: compat: syscall: Add compat_sys_call_table implementation 2022-04-26 13:36:25 -07:00
sysctls.c
timerfd.c
userfaultfd.c mm/userfaultfd: enable writenotify while userfaultfd-wp is enabled for a VMA 2023-01-11 16:14:20 -08:00
utimes.c
xattr.c fs.xattr.simple.rework.rbtree.rwlock.v6.2 2022-12-13 10:08:36 -08:00