linux/fs
Filipe Manana a5ae50dea9 Btrfs: fix deadlock during fast fsync when logging prealloc extents beyond eof
While logging the prealloc extents of an inode during a fast fsync we call
btrfs_truncate_inode_items(), through btrfs_log_prealloc_extents(), while
holding a read lock on a leaf of the inode's root (not the log root, the
fs/subvol root), and then that function locks the file range in the inode's
iotree. This can lead to a deadlock when:

* the fsync is ranged

* the file has prealloc extents beyond eof

* writeback for a range different from the fsync range starts
  during the fsync

* the size of the file is not sector size aligned

Because when finishing an ordered extent we lock first a file range and
then try to COW the fs/subvol tree to insert an extent item.

The following diagram shows how the deadlock can happen.

           CPU 1                                        CPU 2

  btrfs_sync_file()
    --> for range [0, 1MiB)

    --> inode has a size of
        1MiB and has 1 prealloc
        extent beyond the
        i_size, starting at offset
        4MiB

    flushes all delalloc for the
    range [0MiB, 1MiB) and waits
    for the respective ordered
    extents to complete

                                              --> before task at CPU 1 locks the
                                                  inode, a write into file range
                                                  [1MiB, 2MiB + 1KiB) is made

                                              --> i_size is updated to 2MiB + 1KiB

                                              --> writeback is started for that
                                                  range, [1MiB, 2MiB + 4KiB)
                                                  --> end offset rounded up to
                                                      be sector size aligned

    btrfs_log_dentry_safe()
      btrfs_log_inode_parent()
        btrfs_log_inode()

          btrfs_log_changed_extents()
            btrfs_log_prealloc_extents()
              --> does a search on the
                  inode's root
              --> holds a read lock on
                  leaf X

                                              btrfs_finish_ordered_io()
                                                --> locks range [1MiB, 2MiB + 4KiB)
                                                    --> end offset rounded up
                                                        to be sector size aligned

                                                --> tries to cow leaf X, through
                                                    insert_reserved_file_extent()
                                                    --> already locked by the
                                                        task at CPU 1

              btrfs_truncate_inode_items()

                --> gets an i_size of
                    2MiB + 1KiB, which is
                    not sector size
                    aligned

                --> tries to lock file
                    range [2MiB, (u64)-1)
                    --> the start range
                        is rounded down
                        from 2MiB + 1K
                        to 2MiB to be sector
                        size aligned

                    --> but the subrange
                        [2MiB, 2MiB + 4KiB) is
                        already locked by
                        task at CPU 2 which
                        is waiting to get a
                        write lock on leaf X
                        for which we are
                        holding a read lock

                                *** deadlock ***

This results in a stack trace like the following, triggered by test case
generic/561 from fstests:

  [ 2779.973608] INFO: task kworker/u8:6:247 blocked for more than 120 seconds.
  [ 2779.979536]       Not tainted 5.6.0-rc2-btrfs-next-53 #1
  [ 2779.984503] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
  [ 2779.990136] kworker/u8:6    D    0   247      2 0x80004000
  [ 2779.990457] Workqueue: btrfs-endio-write btrfs_work_helper [btrfs]
  [ 2779.990466] Call Trace:
  [ 2779.990491]  ? __schedule+0x384/0xa30
  [ 2779.990521]  schedule+0x33/0xe0
  [ 2779.990616]  btrfs_tree_read_lock+0x19e/0x2e0 [btrfs]
  [ 2779.990632]  ? remove_wait_queue+0x60/0x60
  [ 2779.990730]  btrfs_read_lock_root_node+0x2f/0x40 [btrfs]
  [ 2779.990782]  btrfs_search_slot+0x510/0x1000 [btrfs]
  [ 2779.990869]  btrfs_lookup_file_extent+0x4a/0x70 [btrfs]
  [ 2779.990944]  __btrfs_drop_extents+0x161/0x1060 [btrfs]
  [ 2779.990987]  ? mark_held_locks+0x6d/0xc0
  [ 2779.990994]  ? __slab_alloc.isra.49+0x99/0x100
  [ 2779.991060]  ? insert_reserved_file_extent.constprop.19+0x64/0x300 [btrfs]
  [ 2779.991145]  insert_reserved_file_extent.constprop.19+0x97/0x300 [btrfs]
  [ 2779.991222]  ? start_transaction+0xdd/0x5c0 [btrfs]
  [ 2779.991291]  btrfs_finish_ordered_io+0x4f4/0x840 [btrfs]
  [ 2779.991405]  btrfs_work_helper+0xaa/0x720 [btrfs]
  [ 2779.991432]  process_one_work+0x26d/0x6a0
  [ 2779.991460]  worker_thread+0x4f/0x3e0
  [ 2779.991481]  ? process_one_work+0x6a0/0x6a0
  [ 2779.991489]  kthread+0x103/0x140
  [ 2779.991499]  ? kthread_create_worker_on_cpu+0x70/0x70
  [ 2779.991515]  ret_from_fork+0x3a/0x50
  (...)
  [ 2780.026211] INFO: task fsstress:17375 blocked for more than 120 seconds.
  [ 2780.027480]       Not tainted 5.6.0-rc2-btrfs-next-53 #1
  [ 2780.028482] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
  [ 2780.030035] fsstress        D    0 17375  17373 0x00004000
  [ 2780.030038] Call Trace:
  [ 2780.030044]  ? __schedule+0x384/0xa30
  [ 2780.030052]  schedule+0x33/0xe0
  [ 2780.030075]  lock_extent_bits+0x20c/0x320 [btrfs]
  [ 2780.030094]  ? btrfs_truncate_inode_items+0xf4/0x1150 [btrfs]
  [ 2780.030098]  ? rcu_read_lock_sched_held+0x59/0xa0
  [ 2780.030102]  ? remove_wait_queue+0x60/0x60
  [ 2780.030122]  btrfs_truncate_inode_items+0x133/0x1150 [btrfs]
  [ 2780.030151]  ? btrfs_set_path_blocking+0xb2/0x160 [btrfs]
  [ 2780.030165]  ? btrfs_search_slot+0x379/0x1000 [btrfs]
  [ 2780.030195]  btrfs_log_changed_extents.isra.8+0x841/0x93e [btrfs]
  [ 2780.030202]  ? do_raw_spin_unlock+0x49/0xc0
  [ 2780.030215]  ? btrfs_get_num_csums+0x10/0x10 [btrfs]
  [ 2780.030239]  btrfs_log_inode+0xf83/0x1124 [btrfs]
  [ 2780.030251]  ? __mutex_unlock_slowpath+0x45/0x2a0
  [ 2780.030275]  btrfs_log_inode_parent+0x2a0/0xe40 [btrfs]
  [ 2780.030282]  ? dget_parent+0xa1/0x370
  [ 2780.030309]  btrfs_log_dentry_safe+0x4a/0x70 [btrfs]
  [ 2780.030329]  btrfs_sync_file+0x3f3/0x490 [btrfs]
  [ 2780.030339]  do_fsync+0x38/0x60
  [ 2780.030343]  __x64_sys_fdatasync+0x13/0x20
  [ 2780.030345]  do_syscall_64+0x5c/0x280
  [ 2780.030348]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
  [ 2780.030356] RIP: 0033:0x7f2d80f6d5f0
  [ 2780.030361] Code: Bad RIP value.
  [ 2780.030362] RSP: 002b:00007ffdba3c8548 EFLAGS: 00000246 ORIG_RAX: 000000000000004b
  [ 2780.030364] RAX: ffffffffffffffda RBX: 0000000000000003 RCX: 00007f2d80f6d5f0
  [ 2780.030365] RDX: 00007ffdba3c84b0 RSI: 00007ffdba3c84b0 RDI: 0000000000000003
  [ 2780.030367] RBP: 000000000000004a R08: 0000000000000001 R09: 00007ffdba3c855c
  [ 2780.030368] R10: 0000000000000078 R11: 0000000000000246 R12: 00000000000001f4
  [ 2780.030369] R13: 0000000051eb851f R14: 00007ffdba3c85f0 R15: 0000557a49220d90

So fix this by making btrfs_truncate_inode_items() not lock the range in
the inode's iotree when the target root is a log root, since it's not
needed to lock the range for log roots as the protection from the inode's
lock and log_mutex are all that's needed.

Fixes: 28553fa992 ("Btrfs: fix race between shrinking truncate and fiemap")
CC: stable@vger.kernel.org # 4.4+
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2020-02-21 16:21:19 +01:00
..
9p 9p pull request for inclusion in 5.4 2019-09-27 15:10:34 -07:00
adfs Merge branch 'work.adfs' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs 2019-07-19 11:33:22 -07:00
affs affs: fix a memory leak in affs_remount 2019-11-18 14:26:43 +01:00
afs Merge branch 'dhowells' (patches from DavidH) 2020-01-14 09:56:31 -08:00
autofs Merge branch 'next.autofs' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs 2019-12-05 17:11:48 -08:00
befs fs: Fill in max and min timestamps in superblock 2019-08-30 07:27:17 -07:00
bfs fs: Fill in max and min timestamps in superblock 2019-08-30 07:27:17 -07:00
btrfs Btrfs: fix deadlock during fast fsync when logging prealloc extents beyond eof 2020-02-21 16:21:19 +01:00
cachefiles
ceph ceph: add more debug info when decoding mdsmap 2019-12-09 20:55:10 +01:00
cifs cifs: Optimize readdir on reparse points 2019-12-23 09:04:44 -06:00
coda y2038: add inode timestamp clamping 2019-09-19 09:42:37 -07:00
configfs configfs: calculate the depth of parent item 2019-11-06 18:36:01 +01:00
cramfs cramfs: fix usage on non-MTD device 2019-11-23 21:44:49 -05:00
crypto treewide: Use sizeof_field() macro 2019-12-09 10:36:44 -08:00
debugfs Merge branch 'fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs 2019-12-06 09:06:58 -08:00
devpts devpts_pty_kill(): don't bother with d_delete() 2019-09-03 09:30:56 -04:00
dlm dlm for 5.3 2019-07-12 17:37:53 -07:00
ecryptfs compat_ioctl: remove most of fs/compat_ioctl.c 2019-12-01 13:46:15 -08:00
efivarfs Merge branch 'work.mount0' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs 2019-07-19 10:42:02 -07:00
efs fs: Fill in max and min timestamps in superblock 2019-08-30 07:27:17 -07:00
erofs Changes since last update: 2019-12-11 12:25:32 -08:00
exportfs race in exportfs_decode_fh() 2019-11-11 09:21:59 -05:00
ext2 \n 2019-11-30 11:16:07 -08:00
ext4 Ext4 bug fixes (including a regression fix) for 5.5 2019-12-22 10:41:48 -08:00
f2fs compat_ioctl: remove most of fs/compat_ioctl.c 2019-12-01 13:46:15 -08:00
fat compat_ioctl: move drivers to compat_ptr_ioctl 2019-10-23 17:23:43 +02:00
freevxfs fs: Fill in max and min timestamps in superblock 2019-08-30 07:27:17 -07:00
fscache
fuse fuse: fix fuse_send_readpages() in the syncronous read case 2020-01-16 11:09:36 +01:00
gfs2 GFS2 changes for this merge window: 2019-12-05 13:20:11 -08:00
hfs
hfsplus fs/hfsplus/xattr.c: replace strncpy with memcpy 2019-07-16 19:23:23 -07:00
hostfs
hpfs fs: compat_ioctl: move FITRIM emulation into file systems 2019-10-23 17:23:46 +02:00
hugetlbfs mm/hugetlbfs: fix for_each_hstate() loop in init_hugetlbfs_fs() 2020-01-03 10:39:08 -08:00
iomap iomap: stop using ioend after it's been freed in iomap_finish_ioend() 2019-12-05 07:41:16 -08:00
isofs y2038: add inode timestamp clamping 2019-09-19 09:42:37 -07:00
jbd2 This merge window saw the the following new featuers added to ext4: 2019-11-30 10:53:02 -08:00
jffs2 Revert "jffs2: Fix possible null-pointer dereferences in jffs2_add_frag_to_fragtree()" 2019-11-29 11:29:58 +01:00
jfs y2038: add inode timestamp clamping 2019-09-19 09:42:37 -07:00
kernfs Merge branch 'fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs 2019-12-06 09:06:58 -08:00
lockd NFSv4.1: Don't rebind to the same source port when reconnecting to the server 2019-11-03 21:28:45 -05:00
minix fs: Fill in max and min timestamps in superblock 2019-08-30 07:27:17 -07:00
nfs reimplement path_mountpoint() with less magic 2020-01-15 01:36:06 -05:00
nfs_common
nfsd This is a relatively quiet cycle for nfsd, mainly various bugfixes. 2019-12-07 16:56:00 -08:00
nilfs2 fs: compat_ioctl: move FITRIM emulation into file systems 2019-10-23 17:23:46 +02:00
nls
notify fs: call fsnotify_sb_delete after evict_inodes 2019-12-18 00:03:01 -05:00
ntfs ntfs: remove (un)?likely() from IS_ERR() conditions 2019-09-26 10:10:44 -07:00
ocfs2 ocfs2: fix the crash due to call ocfs2_get_dlm_debug once less 2020-01-04 13:55:09 -08:00
omfs fs: omfs: Initialize filesystem timestamp ranges 2019-08-30 08:11:25 -07:00
openpromfs Merge branch 'work.mount0' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs 2019-07-19 10:42:02 -07:00
orangefs orangefs: posix open permission checking... 2019-12-04 08:52:55 -05:00
overlayfs overlayfs fixes for 5.5-rc2 2019-12-14 11:13:54 -08:00
proc sched/cputime, proc/stat: Fix incorrect guest nice cpustat value 2019-12-11 07:09:58 +01:00
pstore pstore/ram: Regularize prz label allocation lifetime 2020-01-08 17:05:45 -08:00
qnx4 fs: Fill in max and min timestamps in superblock 2019-08-30 07:27:17 -07:00
qnx6 fs: Fill in max and min timestamps in superblock 2019-08-30 07:27:17 -07:00
quota fs: avoid softlockups in s_inodes iterators 2019-12-18 00:03:01 -05:00
ramfs vfs: Convert ramfs, shmem, tmpfs, devtmpfs, rootfs to use the new mount API 2019-09-12 21:05:34 -04:00
reiserfs reiserfs: replace open-coded atomic_dec_and_mutex_lock() 2019-11-05 12:25:22 +01:00
romfs Merge branch 'work.mount2' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs 2019-09-19 10:06:57 -07:00
squashfs Merge branch 'work.mount2' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs 2019-09-19 10:06:57 -07:00
sysfs Merge branch 'work.mount0' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs 2019-07-19 10:42:02 -07:00
sysv fs: sysv: Initialize filesystem timestamp ranges 2019-08-30 07:27:18 -07:00
tracefs tracing: Do not create tracefs files if tracefs lockdown is in effect 2019-10-12 20:49:07 -04:00
ubifs ubifs: ubifs_tnc_start_commit: Fix OOB in layout_in_gaps 2019-11-17 22:22:54 +01:00
udf fs-udf: Delete an unnecessary check before brelse() 2019-09-04 18:19:43 +02:00
ufs y2038: add inode timestamp clamping 2019-09-19 09:42:37 -07:00
unicode unicode: make array 'token' static const, makes object smaller 2019-09-17 11:48:24 -04:00
verity treewide: Use sizeof_field() macro 2019-12-09 10:36:44 -08:00
xfs xfs: Make the symbol 'xfs_rtalloc_log_count' static 2019-12-20 08:07:31 -08:00
aio.c y2038: syscall implementation cleanups 2019-12-01 14:00:59 -08:00
anon_inodes.c Merge branch 'work.mount0' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs 2019-07-19 10:42:02 -07:00
attr.c timestamp_truncate: Replace users of timespec64_trunc 2019-08-30 07:27:17 -07:00
bad_inode.c
binfmt_aout.c
binfmt_elf_fdpic.c y2038: elfcore: Use __kernel_old_timeval for process times 2019-11-15 14:38:29 +01:00
binfmt_elf.c fs/binfmt_elf.c: extract elf_read() function 2019-12-04 19:44:13 -08:00
binfmt_em86.c
binfmt_flat.c fs/binfmt_flat.c: remove set but not used variable 'inode' 2019-07-16 19:23:22 -07:00
binfmt_misc.c Merge branch 'work.mount0' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs 2019-07-19 10:42:02 -07:00
binfmt_script.c
block_dev.c block: don't send uevent for empty disk when not invalidating 2019-12-02 18:49:30 -07:00
buffer.c fs: move guard_bio_eod() after bio_set_op_attrs 2020-01-09 08:16:12 -07:00
char_dev.c chardev: Avoid potential use-after-free in 'chrdev_open()' 2020-01-06 20:10:26 +01:00
compat_binfmt_elf.c y2038: elfcore: Use __kernel_old_timeval for process times 2019-11-15 14:38:29 +01:00
compat_ioctl.c New code for 5.5: 2019-12-02 14:46:22 -08:00
compat.c
coredump.c coredump: split pipe command whitespace before expanding template 2019-08-03 07:02:01 -07:00
d_path.c [PATCH] fix d_absolute_path() interplay with fsmount() 2019-08-30 19:31:09 -04:00
dax.c New code for 5.5: 2019-11-30 10:44:49 -08:00
dcache.c Merge branch 'work.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs 2019-12-08 11:08:28 -08:00
dcookies.c
direct-io.c fs/direct-io.c: include fs/internal.h for missing prototype 2020-01-04 13:55:09 -08:00
drop_caches.c fs: avoid softlockups in s_inodes iterators 2019-12-18 00:03:01 -05:00
eventfd.c
eventpoll.c fs/epoll: remove unnecessary wakeups of nested epoll 2019-12-04 19:44:13 -08:00
exec.c Merge branch 'timers-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip 2019-12-03 12:20:25 -08:00
fcntl.c Merge branch 'work.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs 2019-12-08 11:08:28 -08:00
fhandle.c fs/handle.c - fix up kerneldoc 2019-08-07 21:51:47 -04:00
file_table.c vfs: Export flush_delayed_fput for use by knfsd. 2019-08-19 11:00:39 -04:00
file.c Revert "fs: remove ksys_dup()" 2020-01-02 16:15:33 -08:00
filesystems.c
fs_context.c vfs: subtype handling moved to fuse 2019-09-06 21:28:49 +02:00
fs_parser.c vfs: Make fs_parse() handle fs_param_is_fd-type params better 2019-09-12 21:06:14 -04:00
fs_pin.c switch the remnants of releasing the mountpoint away from fs_pin 2019-07-16 22:52:37 -04:00
fs_struct.c
fs_types.c
fs-writeback.c cgroup,writeback: don't switch wbs immediately on dead wbs if the memcg is dead 2019-11-08 13:37:24 -07:00
fsopen.c Merge branch 'work.mount0' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs 2019-07-19 10:42:02 -07:00
inode.c fs: avoid softlockups in s_inodes iterators 2019-12-18 00:03:01 -05:00
internal.h fs: move guard_bio_eod() after bio_set_op_attrs 2020-01-09 08:16:12 -07:00
io_uring.c io_uring: only allow submit from owning task 2020-01-16 21:43:24 -07:00
io-wq.c io-wq: cancel work if we fail getting a mm reference 2020-01-14 22:06:11 -07:00
io-wq.h io-wq: re-add io_wq_current_is_worker() 2019-12-17 19:57:20 -07:00
ioctl.c New code for 5.5: 2019-12-02 14:46:22 -08:00
Kconfig io-wq: small threadpool implementation for io_uring 2019-10-29 12:43:00 -06:00
Kconfig.binfmt
libfs.c fs/libfs.c: fix kernel-doc warning 2019-10-14 15:04:01 -07:00
locks.c locks: print unsigned ino in /proc/locks 2019-12-29 09:00:58 -05:00
Makefile io-wq: small threadpool implementation for io_uring 2019-10-29 12:43:00 -06:00
mbcache.c
mount.h switch the remnants of releasing the mountpoint away from fs_pin 2019-07-16 22:52:37 -04:00
mpage.c fs: move guard_bio_eod() after bio_set_op_attrs 2020-01-09 08:16:12 -07:00
namei.c fix autofs regression caused by follow_managed() changes 2020-01-15 01:36:46 -05:00
namespace.c fs/namespace.c: make to_mnt_ns() static 2020-01-04 13:55:09 -08:00
no-block.c
nsfs.c fs/nsfs.c: include headers for missing declarations 2020-01-04 13:55:09 -08:00
open.c Revert "vfs: properly and reliably lock f_pos in fdget_pos()" 2019-11-26 11:34:06 -08:00
pipe.c pipe: fix empty pipe check in pipe_write() 2019-12-22 09:47:47 -08:00
pnode.c
pnode.h
posix_acl.c fs/posix_acl.c: fix kernel-doc warnings 2020-01-04 13:55:09 -08:00
proc_namespace.c vfs: subtype handling moved to fuse 2019-09-06 21:28:49 +02:00
read_write.c vfs: fix page locking deadlocks when deduping files 2019-08-16 18:43:24 -07:00
readdir.c filldir[64]: remove WARN_ON_ONCE() for bad directory entries 2019-10-18 18:41:16 -04:00
select.c y2038: syscalls: change remaining timeval to __kernel_old_timeval 2019-11-15 14:38:29 +01:00
seq_file.c seq_file: fix problem when seeking mid-record 2019-08-13 16:06:52 -07:00
signalfd.c
splice.c pipe: remove 'waiting_writers' merging logic 2019-12-07 13:21:01 -08:00
stack.c
stat.c
statfs.c vfs: Fix EOVERFLOW testing in put_compat_statfs64 2019-10-03 14:21:35 -07:00
super.c fs: call fsnotify_sb_delete after evict_inodes 2019-12-18 00:03:01 -05:00
sync.c
timerfd.c y2038: timerfd: Use timespec64 internally 2019-11-15 14:38:30 +01:00
userfaultfd.c Merge branch 'akpm' (patches from Andrew) 2019-12-01 20:36:41 -08:00
utimes.c y2038: syscalls: change remaining timeval to __kernel_old_timeval 2019-11-15 14:38:29 +01:00
xattr.c