linux

mirror of https://github.com/torvalds/linux.git synced 2024-12-29 22:31:32 +00:00

History

Josef Bacik 8d9b4a162a btrfs: exclude mmap from happening during all fallocate operations There's a small window where a deadlock can happen between fallocate and mmap. This is described in detail by Filipe: """ When doing a fallocate operation we lock the inode, flush delalloc within the target range, wait for any ordered extents to complete and then lock the file range. Before we lock the range and after we flush delalloc, there is a time window where another task can come in and do a memory mapped write for a page within the fallocate range. This means that after fallocate locks the range, there can be a dirty page in the range. More often than not, this does not cause any problem. The exception is when we are low on available metadata space, because an fallocate operation needs to start a transaction while holding the file range locked, either through btrfs_prealloc_file_range() or through the call to btrfs_fallocate_update_isize(). If that's the case, we can end up in a deadlock. The following list of steps explains how that happens: 1) A fallocate operation starts, locks the inode, flushes delalloc in the range and waits for ordered extents in the range to complete; 2) Before the fallocate task locks the file range, another task does a memory mapped write for a page in the fallocate target range. This is possible since memory mapped writes do not (and can not) lock the inode; 3) The fallocate task locks the file range. At this point there is one dirty page in the range (due to the memory mapped write); 4) When the fallocate task attempts to start a transaction, it blocks when attempting to reserve metadata space, since we are low on available metadata space. Before blocking (wait on its reservation ticket), it starts the async reclaim task (if not running already); 5) The async reclaim task is not able to release space through any other means, so it decides to flush delalloc for inodes with dirty pages. It finds that the inode used in the fallocate operation has a dirty page and therefore queues a job (fs_info->flush_workers workqueue) to flush delalloc for that inode and waits on that job to complete; 6) The flush job blocks when attempting to lock the file range because it is currently locked by the fallocate task; 7) The fallocate task keeps waiting for its metadata reservation, waiting for a wakeup on its reservation ticket. The async reclaim task is waiting on the flush job, which in turn is waiting for locking the file range that is currently locked by the fallocate task. So unless some other task is able to release enough metadata space, for example an ordered extent for some other inode completes, we end up in a deadlock between all these tasks. When this happens stack traces like the following show up in dmesg/syslog: INFO: task kworker/u16:11:1810830 blocked for more than 120 seconds. Tainted: G B W 5.10.0-rc4-btrfs-next-73 #1 "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. task:kworker/u16:11 state:D stack: 0 pid:1810830 ppid: 2 flags:0x00004000 Workqueue: btrfs-flush_delalloc btrfs_work_helper [btrfs] Call Trace: __schedule+0x5d1/0xcf0 schedule+0x45/0xe0 lock_extent_bits+0x1e6/0x2d0 [btrfs] ? finish_wait+0x90/0x90 btrfs_invalidatepage+0x32c/0x390 [btrfs] ? __mod_memcg_state+0x8e/0x160 __extent_writepage+0x2d4/0x400 [btrfs] extent_write_cache_pages+0x2b2/0x500 [btrfs] ? lock_release+0x20e/0x4c0 ? trace_hardirqs_on+0x1b/0xf0 extent_writepages+0x43/0x90 [btrfs] ? lock_acquire+0x1a3/0x490 do_writepages+0x43/0xe0 ? __filemap_fdatawrite_range+0xa4/0x100 __filemap_fdatawrite_range+0xc5/0x100 btrfs_run_delalloc_work+0x17/0x40 [btrfs] btrfs_work_helper+0xf1/0x600 [btrfs] process_one_work+0x24e/0x5e0 worker_thread+0x50/0x3b0 ? process_one_work+0x5e0/0x5e0 kthread+0x153/0x170 ? kthread_mod_delayed_work+0xc0/0xc0 ret_from_fork+0x22/0x30 INFO: task kworker/u16:1:2426217 blocked for more than 120 seconds. Tainted: G B W 5.10.0-rc4-btrfs-next-73 #1 "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. task:kworker/u16:1 state:D stack: 0 pid:2426217 ppid: 2 flags:0x00004000 Workqueue: events_unbound btrfs_async_reclaim_metadata_space [btrfs] Call Trace: __schedule+0x5d1/0xcf0 ? kvm_clock_read+0x14/0x30 ? wait_for_completion+0x81/0x110 schedule+0x45/0xe0 schedule_timeout+0x30c/0x580 ? _raw_spin_unlock_irqrestore+0x3c/0x60 ? lock_acquire+0x1a3/0x490 ? try_to_wake_up+0x7a/0xa20 ? lock_release+0x20e/0x4c0 ? lock_acquired+0x199/0x490 ? wait_for_completion+0x81/0x110 wait_for_completion+0xab/0x110 start_delalloc_inodes+0x2af/0x390 [btrfs] btrfs_start_delalloc_roots+0x12d/0x250 [btrfs] flush_space+0x24f/0x660 [btrfs] btrfs_async_reclaim_metadata_space+0x1bb/0x480 [btrfs] process_one_work+0x24e/0x5e0 worker_thread+0x20f/0x3b0 ? process_one_work+0x5e0/0x5e0 kthread+0x153/0x170 ? kthread_mod_delayed_work+0xc0/0xc0 ret_from_fork+0x22/0x30 (...) several tasks waiting for the inode lock held by the fallocate task below (...) RIP: 0033:0x7f61efe73fff Code: Unable to access opcode bytes at RIP 0x7f61efe73fd5. RSP: 002b:00007ffc3371bbe8 EFLAGS: 00000202 ORIG_RAX: 000000000000013c RAX: ffffffffffffffda RBX: 00007ffc3371bea0 RCX: 00007f61efe73fff RDX: 00000000ffffff9c RSI: 0000560fbd5d90a0 RDI: 00000000ffffff9c RBP: 00007ffc3371beb0 R08: 0000000000000001 R09: 0000000000000003 R10: 0000560fbd5d7ad0 R11: 0000000000000202 R12: 0000000000000001 R13: 000000000000005e R14: 00007ffc3371bea0 R15: 00007ffc3371beb0 task:fdm-stress state:D stack: 0 pid:2508243 ppid:2508153 flags:0x00000000 Call Trace: __schedule+0x5d1/0xcf0 ? _raw_spin_unlock_irqrestore+0x3c/0x60 schedule+0x45/0xe0 __reserve_bytes+0x4a4/0xb10 [btrfs] ? finish_wait+0x90/0x90 btrfs_reserve_metadata_bytes+0x29/0x190 [btrfs] btrfs_block_rsv_add+0x1f/0x50 [btrfs] start_transaction+0x2d1/0x760 [btrfs] btrfs_replace_file_extents+0x120/0x930 [btrfs] ? btrfs_fallocate+0xdcf/0x1260 [btrfs] btrfs_fallocate+0xdfb/0x1260 [btrfs] ? filename_lookup+0xf1/0x180 vfs_fallocate+0x14f/0x440 ioctl_preallocate+0x92/0xc0 do_vfs_ioctl+0x66b/0x750 ? __do_sys_newfstat+0x53/0x60 __x64_sys_ioctl+0x62/0xb0 do_syscall_64+0x33/0x80 entry_SYSCALL_64_after_hwframe+0x44/0xa9 """ Fix this by disallowing mmaps from happening while we're doing any of the fallocate operations on this inode. Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>		2021-04-19 17:25:15 +02:00
..
tests	idmapped-mounts-v5.12	2021-02-23 13:39:45 -08:00
acl.c	fs: make helpers idmap mount aware	2021-01-24 14:27:20 +01:00
async-thread.c	Btrfs: fix crash during unmount due to race with delayed inode workers	2020-03-23 17:01:51 +01:00
async-thread.h	Btrfs: fix crash during unmount due to race with delayed inode workers	2020-03-23 17:01:51 +01:00
backref.c	btrfs: do not warn if we can't find the reloc root when looking up backref	2021-02-08 22:58:56 +01:00
backref.h	btrfs: add asserts for deleting backref cache nodes	2021-02-08 22:58:56 +01:00
block-group.c	btrfs: replace open coded while loop with proper construct	2021-04-19 17:25:14 +02:00
block-group.h	btrfs: fix race between writes to swap files and scrub	2021-02-22 18:07:15 +01:00
block-rsv.c	btrfs: introduce mount option rescue=ignorebadroots	2020-12-08 15:53:41 +01:00
block-rsv.h	btrfs: Remove __ prefix from btrfs_block_rsv_release	2020-03-23 17:01:55 +01:00
btrfs_inode.h	btrfs: add a i_mmap_lock to our inode	2021-04-19 17:25:15 +02:00
check-integrity.c	block: store a block_device pointer in struct bio	2021-01-24 18:17:20 -07:00
check-integrity.h	btrfs: remove btrfsic_submit_bh()	2020-03-23 17:01:39 +01:00
compression.c	Merge branch 'kmap-conversion-for-5.12' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux	2021-03-01 11:24:18 -08:00
compression.h	btrfs: compression: move declarations to header	2020-10-07 12:06:55 +02:00
ctree.c	btrfs: fix race when cloning extent buffer during rewind of an old root	2021-03-16 20:32:17 +01:00
ctree.h	btrfs: add a i_mmap_lock to our inode	2021-04-19 17:25:15 +02:00
delalloc-space.c	btrfs: fix parameter description of btrfs_inode_rsv_release/btrfs_delalloc_release_space	2021-02-08 22:58:54 +01:00
delalloc-space.h	btrfs: make btrfs_delalloc_reserve_space take btrfs_inode	2020-07-27 12:55:36 +02:00
delayed-inode.c	btrfs: use btrfs_inode_lock/btrfs_inode_unlock inode lock helpers	2021-04-19 17:25:15 +02:00
delayed-inode.h	btrfs: make btrfs_delayed_update_inode take btrfs_inode	2020-12-08 15:54:10 +01:00
delayed-ref.c	btrfs: account for new extents being deleted in total_bytes_pinned	2021-02-08 22:58:55 +01:00
delayed-ref.h	btrfs: only let one thread pre-flush delayed refs in commit	2021-02-08 22:58:56 +01:00
dev-replace.c	btrfs: do not initialize dev replace for bad dev root	2021-03-17 19:42:14 +01:00
dev-replace.h	btrfs: zoned: mark block groups to copy for device-replace	2021-02-09 02:46:07 +01:00
dir-item.c	btrfs: locking: rip out path->leave_spinning	2020-12-08 15:54:02 +01:00
discard.c	btrfs: document now parameter of peek_discard_list	2021-02-08 22:58:53 +01:00
discard.h	btrfs: cleanup btrfs_discard_update_discardable usage	2020-12-08 15:54:02 +01:00
disk-io.c	for-5.12-rc4-tag	2021-03-25 15:38:22 -07:00
disk-io.h	btrfs: split alloc_log_tree()	2021-02-09 02:46:07 +01:00
export.c	btrfs: locking: rip out path->leave_spinning	2020-12-08 15:54:02 +01:00
export.h	btrfs: export helpers for subvolume name/id resolution	2020-03-23 17:01:42 +01:00
extent_io.c	btrfs: remove mirror argument from btrfs_csum_verify_data()	2021-04-19 17:25:15 +02:00
extent_io.h	btrfs: zoned: redirty released extent buffers	2021-02-09 02:46:04 +01:00
extent_map.c	btrfs: fix parameter description of btrfs_add_extent_mapping	2021-02-08 22:58:53 +01:00
extent_map.h	btrfs: remove extent_map::bdev	2019-11-18 23:43:44 +01:00
extent-io-tree.h	btrfs: use fixed width int type for extent_state::state	2020-12-08 15:54:13 +01:00
extent-tree.c	btrfs: unexport btrfs_extent_readonly() and make it static	2021-04-19 17:25:14 +02:00
file-item.c	btrfs: fix function description formats in file-item.c	2021-02-08 22:58:53 +01:00
file.c	btrfs: exclude mmap from happening during all fallocate operations	2021-04-19 17:25:15 +02:00
free-space-cache.c	btrfs: zoned: do not account freed region of read-only block group as zone_unusable	2021-03-04 16:16:58 +01:00
free-space-cache.h	btrfs: zoned: track unusable bytes for zones	2021-02-09 02:46:03 +01:00
free-space-tree.c	btrfs: fix possible free space tree corruption with online conversion	2021-01-25 18:44:37 +01:00
free-space-tree.h
inode-item.c	btrfs: locking: rip out path->leave_spinning	2020-12-08 15:54:02 +01:00
inode.c	btrfs: add a i_mmap_lock to our inode	2021-04-19 17:25:15 +02:00
ioctl.c	btrfs: use btrfs_inode_lock/btrfs_inode_unlock inode lock helpers	2021-04-19 17:25:15 +02:00
Kconfig	btrfs: switch to iomap for direct IO	2020-10-07 12:06:57 +02:00
locking.c	btrfs: remove the recurse parameter from __btrfs_tree_read_lock	2020-12-08 15:54:09 +01:00
locking.h	btrfs: remove the recurse parameter from __btrfs_tree_read_lock	2020-12-08 15:54:09 +01:00
lzo.c	btrfs: use memcpy_[to\|from]_page() and kmap_local_page()	2021-02-26 12:45:15 +01:00
Makefile	btrfs: fix build when using M=fs/btrfs	2021-03-17 19:42:18 +01:00
misc.h	btrfs: rename tree_entry to rb_simple_node and export it	2020-05-25 11:25:19 +02:00
ordered-data.c	btrfs: replace offset_in_entry with in_range	2021-04-19 17:25:14 +02:00
ordered-data.h	btrfs: fix comment for btrfs ordered extent flag bits	2021-04-19 17:25:14 +02:00
orphan.c
print-tree.c	btrfs: print the actual offset in btrfs_root_name	2021-01-07 17:25:05 +01:00
print-tree.h	btrfs: print the actual offset in btrfs_root_name	2021-01-07 17:25:05 +01:00
props.c	btrfs: simplify iget helpers	2020-05-25 11:25:37 +02:00
props.h
qgroup.c	btrfs: don't opencode extent_changeset_free	2021-04-19 17:25:15 +02:00
qgroup.h	btrfs: export and rename qgroup_reserve_meta	2021-03-02 16:58:30 +01:00
raid56.c	Merge branch 'kmap-conversion-for-5.12' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux	2021-03-01 11:24:18 -08:00
raid56.h
rcu-string.h	btrfs: rcu-string: Replace zero-length array with flexible-array member	2020-03-23 17:01:53 +01:00
reada.c	btrfs: subpage: make readahead work properly	2021-03-16 11:06:21 +01:00
ref-verify.c	btrfs: ref-verify: use 'inline void' keyword ordering	2021-03-02 16:55:40 +01:00
ref-verify.h
reflink.c	btrfs: exclude mmaps while doing remap	2021-04-19 17:25:15 +02:00
reflink.h	Btrfs: move all reflink implementation code into its own file	2020-03-23 17:01:54 +01:00
relocation.c	btrfs: use btrfs_inode_lock/btrfs_inode_unlock inode lock helpers	2021-04-19 17:25:15 +02:00
root-tree.c	btrfs: qgroup: fix qgroup meta rsv leak for subvolume operations	2020-10-07 12:12:13 +02:00
scrub.c	btrfs: scrub: drop a few function declarations	2021-04-19 17:25:14 +02:00
send.c	btrfs: add btree read ahead for incremental send operations	2021-04-19 17:25:15 +02:00
send.h	btrfs: send: avoid copying file data	2020-10-07 12:13:17 +02:00
space-info.c	btrfs: zoned: track unusable bytes for zones	2021-02-09 02:46:03 +01:00
space-info.h	btrfs: zoned: track unusable bytes for zones	2021-02-09 02:46:03 +01:00
struct-funcs.c	btrfs: handle sectorsize < PAGE_SIZE case for extent buffer accessors	2020-12-09 19:16:10 +01:00
subpage.c	btrfs: integrate page status update for data read path into begin/end_page_read	2021-02-08 22:59:03 +01:00
subpage.h	btrfs: integrate page status update for data read path into begin/end_page_read	2021-02-08 22:59:03 +01:00
super.c	btrfs: fix spurious free_space_tree remount warning	2021-03-02 16:55:55 +01:00
sysfs.c	btrfs: zoned: track unusable bytes for zones	2021-02-09 02:46:03 +01:00
sysfs.h	btrfs: split and refactor btrfs_sysfs_remove_devices_dir	2020-10-07 12:12:21 +02:00
transaction.c	btrfs: zoned: redirty released extent buffers	2021-02-09 02:46:04 +01:00
transaction.h	btrfs: zoned: redirty released extent buffers	2021-02-09 02:46:04 +01:00
tree-checker.c	btrfs: tree-checker: do not error out if extent ref hash doesn't match	2021-02-22 18:07:44 +01:00
tree-checker.h
tree-defrag.c	btrfs: locking: remove all the blocking helpers	2020-12-08 15:54:01 +01:00
tree-log.c	btrfs: zoned: fix linked list corruption after log root tree allocation failure	2021-03-15 16:57:19 +01:00
tree-log.h	btrfs: make fast fsyncs wait only for writeback	2020-10-07 12:06:56 +02:00
ulist.c
ulist.h
uuid-tree.c	btrfs: remove unnecessary casts in printk	2020-12-08 15:53:52 +01:00
volumes.c	btrfs: assign proper values to a bool variable in dev_extent_hole_check_zoned	2021-04-19 17:25:15 +02:00
volumes.h	btrfs: zoned: relocate block group to repair IO failure in zoned filesystems	2021-02-09 02:46:07 +01:00
xattr.c	for-5.12-rc1-tag	2021-03-05 12:21:14 -08:00
xattr.h
zlib.c	btrfs: use memcpy_[to\|from]_page() and kmap_local_page()	2021-02-26 12:45:15 +01:00
zoned.c	for-5.12-rc6-tag	2021-04-11 11:53:36 -07:00
zoned.h	btrfs: zoned: extend zoned allocator to use dedicated tree-log block group	2021-02-09 02:46:08 +01:00
zstd.c	btrfs: use memcpy_[to\|from]_page() and kmap_local_page()	2021-02-26 12:45:15 +01:00