linux/fs
Filipe Manana 5e548b3201 btrfs: do not set the full sync flag on the inode during page release
When removing an extent map at try_release_extent_mapping(), called through
the page release callback (btrfs_releasepage()), we always set the full
sync flag on the inode, which forces the next fsync to use a slower code
path.

This hurts performance for workloads that dirty an amount of data that
exceeds or is very close to the system's RAM memory and do frequent fsync
operations (like database servers can for example). In particular if there
are concurrent fsyncs against different files, by falling back to a full
fsync we do a lot more checksum lookups in the checksums btree, as we do
it for all the extents created in the current transaction, instead of only
the new ones since the last fsync. These checksums lookups not only take
some time but, more importantly, they also cause contention on the
checksums btree locks due to the concurrency with checksum insertions in
the btree by ordered extents from other inodes.

We actually don't need to set the full sync flag on the inode, because we
only remove extent maps that are in the list of modified extents if they
were created in a past transaction, in which case an fsync skips them as
it's pointless to log them. So stop setting the full fsync flag on the
inode whenever we remove an extent map.

This patch is part of a patchset that consists of 3 patches, which have
the following subjects:

1/3 btrfs: fix race between page release and a fast fsync
2/3 btrfs: release old extent maps during page release
3/3 btrfs: do not set the full sync flag on the inode during page release

Performance tests were ran against a branch (misc-next) containing the
whole patchset. The test exercises a workload where there are multiple
processes writing to files and fsyncing them (each writing and fsyncing
its own file), and in total the amount of data dirtied ranges from 2x to
4x the system's RAM memory (16GiB), so that the page release callback is
invoked frequently.

The following script, using fio, was used to perform the tests:

  $ cat test-fsync.sh
  #!/bin/bash

  DEV=/dev/sdk
  MNT=/mnt/sdk
  MOUNT_OPTIONS="-o ssd"
  MKFS_OPTIONS="-d single -m single"

  if [ $# -ne 3 ]; then
      echo "Use $0 NUM_JOBS FILE_SIZE FSYNC_FREQ"
      exit 1
  fi

  NUM_JOBS=$1
  FILE_SIZE=$2
  FSYNC_FREQ=$3

  cat <<EOF > /tmp/fio-job.ini
  [writers]
  rw=write
  fsync=$FSYNC_FREQ
  fallocate=none
  group_reporting=1
  direct=0
  bs=64k
  ioengine=sync
  size=$FILE_SIZE
  directory=$MNT
  numjobs=$NUM_JOBS
  thread
  EOF

  echo "Using config:"
  echo
  cat /tmp/fio-job.ini
  echo

  mkfs.btrfs -f $MKFS_OPTIONS $DEV &> /dev/null
  mount $MOUNT_OPTIONS $DEV $MNT
  fio /tmp/fio-job.ini
  umount $MNT

The tests were performed for different numbers of jobs, file sizes and
fsync frequency. A qemu VM using kvm was used, with 8 cores (the host has
12 cores, with cpu governance set to performance mode on all cores), 16GiB
of ram (the host has 64GiB) and using a NVMe device directly (without an
intermediary filesystem in the host). While running the tests, the host
was not used for anything else, to avoid disturbing the tests.

The obtained results were the following, and the last line printed by
fio is pasted (includes aggregated throughput and test run time).

    *****************************************************
    ****     1 job, 32GiB file, fsync frequency 1     ****
    *****************************************************

Before patchset:

WRITE: bw=29.1MiB/s (30.5MB/s), 29.1MiB/s-29.1MiB/s (30.5MB/s-30.5MB/s), io=32.0GiB (34.4GB), run=1127557-1127557msec

After patchset:

WRITE: bw=29.3MiB/s (30.7MB/s), 29.3MiB/s-29.3MiB/s (30.7MB/s-30.7MB/s), io=32.0GiB (34.4GB), run=1119042-1119042msec
(+0.7% throughput, -0.8% run time)

    *****************************************************
    ****     2 jobs, 16GiB files, fsync frequency 1   ****
    *****************************************************

Before patchset:

WRITE: bw=33.5MiB/s (35.1MB/s), 33.5MiB/s-33.5MiB/s (35.1MB/s-35.1MB/s), io=32.0GiB (34.4GB), run=979000-979000msec

After patchset:

WRITE: bw=39.9MiB/s (41.8MB/s), 39.9MiB/s-39.9MiB/s (41.8MB/s-41.8MB/s), io=32.0GiB (34.4GB), run=821283-821283msec
(+19.1% throughput, -16.1% runtime)

    *****************************************************
    ****     4 jobs, 8GiB files, fsync frequency 1    ****
    *****************************************************

Before patchset:

WRITE: bw=52.1MiB/s (54.6MB/s), 52.1MiB/s-52.1MiB/s (54.6MB/s-54.6MB/s), io=32.0GiB (34.4GB), run=629130-629130msec

After patchset:

WRITE: bw=71.8MiB/s (75.3MB/s), 71.8MiB/s-71.8MiB/s (75.3MB/s-75.3MB/s), io=32.0GiB (34.4GB), run=456357-456357msec
(+37.8% throughput, -27.5% runtime)

    *****************************************************
    ****     8 jobs, 4GiB files, fsync frequency 1    ****
    *****************************************************

Before patchset:

WRITE: bw=76.1MiB/s (79.8MB/s), 76.1MiB/s-76.1MiB/s (79.8MB/s-79.8MB/s), io=32.0GiB (34.4GB), run=430708-430708msec

After patchset:

WRITE: bw=133MiB/s (140MB/s), 133MiB/s-133MiB/s (140MB/s-140MB/s), io=32.0GiB (34.4GB), run=245458-245458msec
(+74.7% throughput, -43.0% run time)

    *****************************************************
    ****    16 jobs, 2GiB files, fsync frequency 1    ****
    *****************************************************

Before patchset:

WRITE: bw=74.7MiB/s (78.3MB/s), 74.7MiB/s-74.7MiB/s (78.3MB/s-78.3MB/s), io=32.0GiB (34.4GB), run=438625-438625msec

After patchset:

WRITE: bw=184MiB/s (193MB/s), 184MiB/s-184MiB/s (193MB/s-193MB/s), io=32.0GiB (34.4GB), run=177864-177864msec
(+146.3% throughput, -59.5% run time)

    *****************************************************
    ****    32 jobs, 2GiB files, fsync frequency 1    ****
    *****************************************************

Before patchset:

WRITE: bw=72.6MiB/s (76.1MB/s), 72.6MiB/s-72.6MiB/s (76.1MB/s-76.1MB/s), io=64.0GiB (68.7GB), run=902615-902615msec

After patchset:

WRITE: bw=227MiB/s (238MB/s), 227MiB/s-227MiB/s (238MB/s-238MB/s), io=64.0GiB (68.7GB), run=288936-288936msec
(+212.7% throughput, -68.0% run time)

    *****************************************************
    ****    64 jobs, 1GiB files, fsync frequency 1    ****
    *****************************************************

Before patchset:

WRITE: bw=98.8MiB/s (104MB/s), 98.8MiB/s-98.8MiB/s (104MB/s-104MB/s), io=64.0GiB (68.7GB), run=663126-663126msec

After patchset:

WRITE: bw=294MiB/s (308MB/s), 294MiB/s-294MiB/s (308MB/s-308MB/s), io=64.0GiB (68.7GB), run=222940-222940msec
(+197.6% throughput, -66.4% run time)

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-07-27 12:55:48 +02:00
..
9p 9p: read only once on O_NONBLOCK 2020-03-27 09:29:56 +00:00
adfs docs: filesystems: fix renamed references 2020-04-20 15:45:22 -06:00
affs docs: filesystems: fix renamed references 2020-04-20 15:45:22 -06:00
afs afs: Fix interruption of operations 2020-07-15 15:49:04 -07:00
autofs autofs: switch to kernel_write 2020-07-08 08:27:56 +02:00
befs
bfs docs: filesystems: fix renamed references 2020-04-20 15:45:22 -06:00
btrfs btrfs: do not set the full sync flag on the inode during page release 2020-07-27 12:55:48 +02:00
cachefiles cachefiles: switch to kernel_write 2020-07-08 08:27:56 +02:00
ceph ceph: skip checking caps when session reconnecting and releasing reqs 2020-06-01 13:22:53 +02:00
cifs Revert "cifs: Fix the target file was deleted when rename failed." 2020-07-23 15:44:11 -05:00
coda docs: filesystems: convert coda.txt to ReST 2020-05-05 09:22:21 -06:00
configfs A fair amount of stuff this time around, dominated by yet another massive 2020-06-01 15:45:27 -07:00
cramfs docs: filesystems: fix renamed references 2020-04-20 15:45:22 -06:00
crypto fscrypt updates for 5.8 2020-06-01 12:10:17 -07:00
debugfs Merge 5.7-rc3 into driver-core-next 2020-04-27 09:34:55 +02:00
devpts
dlm dlm for 5.8 2020-06-05 16:43:16 -07:00
ecryptfs A fair amount of stuff this time around, dominated by yet another massive 2020-06-01 15:45:27 -07:00
efivarfs efi/efivars: Expose RT service availability via efivars abstraction 2020-07-09 10:14:29 +03:00
efs
erofs erofs: fix partially uninitialized misuse in z_erofs_onlinepage_fixup 2020-06-24 09:47:44 +08:00
exfat exfat: fix name_hash computation on big endian systems 2020-07-21 10:44:19 +09:00
exportfs
ext2 mmap locking API: convert mmap_sem comments 2020-06-09 09:39:14 -07:00
ext4 This is the second round of ext4 commits for 5.8 merge window. It 2020-06-15 09:32:10 -07:00
f2fs f2fs-for-5.8-rc1 2020-06-09 11:28:59 -07:00
fat fat: improve the readahead for FAT entries 2020-06-04 19:06:25 -07:00
freevxfs
fscache Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next 2020-06-03 16:27:18 -07:00
fuse fuse: Fix parameter for FS_IOC_{GET,SET}FLAGS 2020-07-15 14:18:20 +02:00
gfs2 gfs2: Rework read and page fault locking 2020-07-07 23:40:12 +02:00
hfs for-5.8/block-2020-06-01 2020-06-02 15:29:19 -07:00
hfsplus block: remove the error_sector argument to blkdev_issue_flush 2020-05-22 08:45:46 -06:00
hostfs hostfs: Use kasprintf() instead of fixed buffer formatting 2020-03-29 23:23:00 +02:00
hpfs hpfs: fix warning due to superfluous semicolon 2020-06-06 10:08:17 -07:00
hugetlbfs mmap locking API: convert mmap_sem API comments 2020-06-09 09:39:14 -07:00
iomap New code for 5.8: 2020-06-13 12:44:30 -07:00
isofs for-5.8/block-2020-06-01 2020-06-02 15:29:19 -07:00
jbd2 This is the second round of ext4 commits for 5.8 merge window. It 2020-06-15 09:32:10 -07:00
jffs2 jffs2: Replace zero-length array with flexible-array 2020-06-15 23:08:31 -05:00
jfs Replace zero-length array in JFS 2020-06-02 20:11:35 -07:00
kernfs mmap locking API: convert mmap_sem comments 2020-06-09 09:39:14 -07:00
lockd
minix
nfs SUNRPC reverting d03727b248 ("NFSv4 fix CLOSE not waiting for direct IO compeletion") 2020-07-17 14:47:38 -04:00
nfs_common
nfsd nfsd4: fix NULL dereference in nfsd/clients display code 2020-07-22 16:47:14 -04:00
nilfs2 nilfs2: fix null pointer dereference at nilfs_segctor_do_construct() 2020-06-10 19:14:17 -07:00
nls treewide: replace '---help---' in Kconfig files with 'help' 2020-06-14 01:57:21 +09:00
notify treewide: replace '---help---' in Kconfig files with 'help' 2020-06-14 01:57:21 +09:00
ntfs Merge branch 'akpm' (patches from Andrew) 2020-06-02 12:21:36 -07:00
ocfs2 ocfs2: fix value of OCFS2_INVALID_SLOT 2020-06-26 00:27:37 -07:00
omfs fs: convert mpage_readpages to mpage_readahead 2020-06-02 10:59:07 -07:00
openpromfs
orangefs orangefs: a conversion and a cleanup... 2020-06-05 16:44:36 -07:00
overlayfs ovl: fix lookup of indexed hardlinks with metacopy 2020-07-16 07:24:47 +02:00
proc Merge branch 'fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs 2020-07-03 23:20:14 -07:00
pstore Merge branch 'uaccess.__copy_from_user' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs 2020-06-01 16:18:46 -07:00
qnx4
qnx6 fs: convert mpage_readpages to mpage_readahead 2020-06-02 10:59:07 -07:00
quota sysctl: pass kernel pointers to ->proc_handler 2020-04-27 02:07:40 -04:00
ramfs
reiserfs \n 2020-06-04 13:53:10 -07:00
romfs treewide: replace '---help---' in Kconfig files with 'help' 2020-06-14 01:57:21 +09:00
squashfs squashfs: fix length field overlap check in metadata reading 2020-07-24 12:42:41 -07:00
sysfs RDMA 5.8 merge window pull request 2020-06-05 14:05:57 -07:00
sysv docs: filesystems: fix renamed references 2020-04-20 15:45:22 -06:00
tracefs
ubifs mm: remove the pgprot argument to __vmalloc 2020-06-02 10:59:11 -07:00
udf for-5.8/block-2020-06-01 2020-06-02 15:29:19 -07:00
ufs
unicode .gitignore: add SPDX License Identifier 2020-03-25 11:50:48 +01:00
vboxsf vboxsf: don't use the source name in the bdi name 2020-05-07 08:45:47 -06:00
verity fs-verity: remove unnecessary extern keywords 2020-05-12 16:44:00 -07:00
xfs xfs: fix use-after-free on CIL context on shutdown 2020-06-22 19:22:57 -07:00
zonefs zonefs: count pages after truncating the iterator 2020-07-20 17:59:31 +09:00
aio.c aio: Replace zero-length array with flexible-array 2020-06-15 23:08:25 -05:00
anon_inodes.c
attr.c
bad_inode.c fs: move the fiemap definitions out of fs.h 2020-06-03 23:16:55 -04:00
binfmt_aout.c exec: Rename flush_old_exec begin_new_exec 2020-05-07 16:55:47 -05:00
binfmt_elf_fdpic.c Merge branch 'uaccess.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs 2020-06-10 16:02:54 -07:00
binfmt_elf.c Merge branch 'uaccess.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs 2020-06-10 16:02:54 -07:00
binfmt_em86.c Merge branch 'akpm' (patches from Andrew) 2020-06-04 19:18:29 -07:00
binfmt_flat.c Merge branch 'uaccess.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs 2020-06-10 16:02:54 -07:00
binfmt_misc.c Merge branch 'akpm' (patches from Andrew) 2020-06-04 19:18:29 -07:00
binfmt_script.c Merge branch 'akpm' (patches from Andrew) 2020-06-04 19:18:29 -07:00
block_dev.c block: make function 'kill_bdev' static 2020-06-18 09:24:35 -06:00
buffer.c fs/buffer.c: use attach/detach_page_private 2020-06-02 10:59:07 -07:00
char_dev.c vfs: allow unprivileged whiteout creation 2020-05-14 16:44:23 +02:00
compat_binfmt_elf.c Split the old READ_IMPLIES_EXEC workaround from executable PT_GNU_STACK 2020-06-05 13:45:21 -07:00
compat.c
coredump.c mmap locking API: convert mmap_sem comments 2020-06-09 09:39:14 -07:00
d_path.c
dax.c dax,iomap: Add helper dax_iomap_zero() to zero a range 2020-04-02 19:15:03 -07:00
dcache.c Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next 2020-06-03 16:27:18 -07:00
dcookies.c
direct-io.c for-5.8-part2-tag 2020-06-14 09:47:25 -07:00
drop_caches.c sysctl: pass kernel pointers to ->proc_handler 2020-04-27 02:07:40 -04:00
eventfd.c eventfd: convert to f_op->read_iter() 2020-05-06 22:33:43 -04:00
eventpoll.c epoll: call final ep_events_available() check under the lock 2020-05-14 10:00:35 -07:00
exec.c mmap locking API: convert mmap_sem comments 2020-06-09 09:39:14 -07:00
fcntl.c
fhandle.c
file_table.c Revert "fs: Do not check if there is a fsnotify watcher on pseudo inodes" 2020-06-29 09:40:55 -07:00
file.c fix multiplication overflow in copy_fdtable() 2020-05-19 18:29:36 -04:00
filesystems.c fs/filesystems.c: downgrade user-reachable WARN_ONCE() to pr_warn_once() 2020-04-10 15:36:22 -07:00
fs_context.c vfs: don't parse "silent" option 2020-05-14 16:44:25 +02:00
fs_parser.c fs_parse: remove pr_notice() about each validation 2020-04-02 09:35:26 -07:00
fs_pin.c
fs_struct.c
fs_types.c
fs-writeback.c A lot of bug fixes and cleanups for ext4, including: 2020-06-05 16:19:28 -07:00
fsopen.c
inode.c AFS Changes 2020-06-05 16:26:36 -07:00
internal.h A lot of bug fixes and cleanups for ext4, including: 2020-06-05 16:19:28 -07:00
io_uring.c io_uring: missed req_init_async() for IOSQE_ASYNC 2020-07-23 11:20:55 -06:00
io-wq.c io_uring: cancel all task's requests on exit 2020-06-15 08:51:34 -06:00
io-wq.h io_uring: cancel by ->task not pid 2020-06-15 08:51:38 -06:00
ioctl.c fs: remove the access_ok() check in ioctl_fiemap 2020-06-03 23:16:55 -04:00
Kconfig treewide: replace '---help---' in Kconfig files with 'help' 2020-06-14 01:57:21 +09:00
Kconfig.binfmt treewide: replace '---help---' in Kconfig files with 'help' 2020-06-14 01:57:21 +09:00
libfs.c block: remove the error_sector argument to blkdev_issue_flush 2020-05-22 08:45:46 -06:00
locks.c Highlights: 2020-06-11 10:33:13 -07:00
Makefile
mbcache.c
mount.h proc/mounts: add cursor 2020-05-14 16:44:24 +02:00
mpage.c fs: convert mpage_readpages to mpage_readahead 2020-06-02 10:59:07 -07:00
namei.c vfs: clean up posix_acl_permission() logic aroudn MAY_NOT_BLOCK 2020-06-08 11:04:19 -07:00
namespace.c fuse: reject options on reconfigure via fsconfig(2) 2020-07-14 14:45:41 +02:00
no-block.c
nsfs.c nsproxy: attach to namespaces via pidfds 2020-05-13 11:41:22 +02:00
open.c Merge branch 'akpm' (patches from Andrew) 2020-06-02 12:21:36 -07:00
pipe.c Notifications over pipes + Keyring notifications 2020-06-13 09:56:21 -07:00
pnode.c propagate_one(): mnt_set_mountpoint() needs mount_lock 2020-04-27 10:37:14 -04:00
pnode.h
posix_acl.c vfs: clean up posix_acl_permission() logic aroudn MAY_NOT_BLOCK 2020-06-08 11:04:19 -07:00
proc_namespace.c Merge branch 'proc-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace 2020-06-04 13:54:34 -07:00
read_write.c fs: remove __vfs_read 2020-07-08 08:27:57 +02:00
readdir.c readdir.c: get rid of the last __put_user(), drop now-useless access_ok() 2020-05-01 20:29:54 -04:00
select.c pselect6() and friends: take handling the combined 6th/7th args into helper 2020-05-29 19:10:42 -04:00
seq_file.c fs/seq_file.c: seq_read: Update pr_info_ratelimited 2020-06-04 19:06:25 -07:00
signalfd.c
splice.c Notifications over pipes + Keyring notifications 2020-06-13 09:56:21 -07:00
stack.c
stat.c New code for 5.8: 2020-06-02 19:45:12 -07:00
statfs.c
super.c Merge branch 'work.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs 2020-06-10 16:09:11 -07:00
sync.c overlayfs update for 5.8 2020-06-09 15:40:50 -07:00
timerfd.c
userfaultfd.c mmap locking API: convert mmap_sem comments 2020-06-09 09:39:14 -07:00
utimes.c utimensat: AT_EMPTY_PATH support 2020-05-14 16:44:24 +02:00
xattr.c xattr: fix uninitialized out-param 2020-04-09 15:33:09 -04:00