linux/fs
Dave Chinner a9a4bc8c76 xfs: log worker needs to start before intent/unlink recovery
After 963 iterations of generic/530, it deadlocked during recovery
on a pinned inode cluster buffer like so:

XFS (pmem1): Starting recovery (logdev: internal)
INFO: task kworker/8:0:306037 blocked for more than 122 seconds.
      Not tainted 5.17.0-rc6-dgc+ #975
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
task:kworker/8:0     state:D stack:13024 pid:306037 ppid:     2 flags:0x00004000
Workqueue: xfs-inodegc/pmem1 xfs_inodegc_worker
Call Trace:
 <TASK>
 __schedule+0x30d/0x9e0
 schedule+0x55/0xd0
 schedule_timeout+0x114/0x160
 __down+0x99/0xf0
 down+0x5e/0x70
 xfs_buf_lock+0x36/0xf0
 xfs_buf_find+0x418/0x850
 xfs_buf_get_map+0x47/0x380
 xfs_buf_read_map+0x54/0x240
 xfs_trans_read_buf_map+0x1bd/0x490
 xfs_imap_to_bp+0x4f/0x70
 xfs_iunlink_map_ino+0x66/0xd0
 xfs_iunlink_map_prev.constprop.0+0x148/0x2f0
 xfs_iunlink_remove_inode+0xf2/0x1d0
 xfs_inactive_ifree+0x1a3/0x900
 xfs_inode_unlink+0xcc/0x210
 xfs_inodegc_worker+0x1ac/0x2f0
 process_one_work+0x1ac/0x390
 worker_thread+0x56/0x3c0
 kthread+0xf6/0x120
 ret_from_fork+0x1f/0x30
 </TASK>
task:mount           state:D stack:13248 pid:324509 ppid:324233 flags:0x00004000
Call Trace:
 <TASK>
 __schedule+0x30d/0x9e0
 schedule+0x55/0xd0
 schedule_timeout+0x114/0x160
 __down+0x99/0xf0
 down+0x5e/0x70
 xfs_buf_lock+0x36/0xf0
 xfs_buf_find+0x418/0x850
 xfs_buf_get_map+0x47/0x380
 xfs_buf_read_map+0x54/0x240
 xfs_trans_read_buf_map+0x1bd/0x490
 xfs_imap_to_bp+0x4f/0x70
 xfs_iget+0x300/0xb40
 xlog_recover_process_one_iunlink+0x4c/0x170
 xlog_recover_process_iunlinks.isra.0+0xee/0x130
 xlog_recover_finish+0x57/0x110
 xfs_log_mount_finish+0xfc/0x1e0
 xfs_mountfs+0x540/0x910
 xfs_fs_fill_super+0x495/0x850
 get_tree_bdev+0x171/0x270
 xfs_fs_get_tree+0x15/0x20
 vfs_get_tree+0x24/0xc0
 path_mount+0x304/0xba0
 __x64_sys_mount+0x108/0x140
 do_syscall_64+0x35/0x80
 entry_SYSCALL_64_after_hwframe+0x44/0xae
 </TASK>
task:xfsaild/pmem1   state:D stack:14544 pid:324525 ppid:     2 flags:0x00004000
Call Trace:
 <TASK>
 __schedule+0x30d/0x9e0
 schedule+0x55/0xd0
 io_schedule+0x4b/0x80
 xfs_buf_wait_unpin+0x9e/0xf0
 __xfs_buf_submit+0x14a/0x230
 xfs_buf_delwri_submit_buffers+0x107/0x280
 xfs_buf_delwri_submit_nowait+0x10/0x20
 xfsaild+0x27e/0x9d0
 kthread+0xf6/0x120
 ret_from_fork+0x1f/0x30

We have the mount process waiting on an inode cluster buffer read,
inodegc doing unlink waiting on the same inode cluster buffer, and
the AIL push thread blocked in writeback waiting for the inode
cluster buffer to become unpinned.

What has happened here is that the AIL push thread has raced with
the inodegc process modifying, committing and pinning the inode
cluster buffer here in xfs_buf_delwri_submit_buffers() here:

	blk_start_plug(&plug);
	list_for_each_entry_safe(bp, n, buffer_list, b_list) {
		if (!wait_list) {
			if (xfs_buf_ispinned(bp)) {
				pinned++;
				continue;
			}
Here >>>>>>
			if (!xfs_buf_trylock(bp))
				continue;

Basically, the AIL has found the buffer wasn't pinned and got the
lock without blocking, but then the buffer was pinned. This implies
the processing here was pre-empted between the pin check and the
lock, because the pin count can only be increased while holding the
buffer locked. Hence when it has gone to submit the IO, it has
blocked waiting for the buffer to be unpinned.

With all executing threads now waiting on the buffer to be unpinned,
we normally get out of situations like this via the background log
worker issuing a log force which will unpinned stuck buffers like
this. But at this point in recovery, we haven't started the log
worker. In fact, the first thing we do after processing intents and
unlinked inodes is *start the log worker*. IOWs, we start it too
late to have it break deadlocks like this.

Avoid this and any other similar deadlock vectors in intent and
unlinked inode recovery by starting the log worker before we recover
intents and unlinked inodes. This part of recovery runs as though
the filesystem is fully active, so we really should have the same
infrastructure running as we normally do at runtime.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Chandan Babu R <chandan.babu@oracle.com>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
2022-03-20 08:59:49 -07:00
..
9p Revert "fs/9p: search open fids first" 2022-01-30 22:13:37 +09:00
adfs fs/adfs: remove unneeded variable make code cleaner 2022-01-20 08:52:55 +02:00
affs affs: use bdev_nr_sectors instead of open coding it 2021-10-18 14:43:22 -06:00
afs proc: remove PDE_DATA() completely 2022-01-22 08:33:37 +02:00
autofs autofs: fix wait name hash calculation in autofs_wait() 2021-10-20 21:09:02 -04:00
befs isystem: ship and use stdarg.h 2021-08-19 09:02:55 +09:00
bfs mm: require ->set_page_dirty to be explicitly wired up 2021-06-29 10:53:48 -07:00
btrfs for-5.17-rc5-tag 2022-02-25 14:08:03 -08:00
cachefiles netfs, cachefiles: Add a method to query presence of data in the cache 2022-02-01 10:29:18 -06:00
ceph ceph: set pool_ns in new inode layout for async creates 2022-01-26 20:17:50 +01:00
cifs cifs: fix confusing unneeded warning message on smb2.1 and earlier 2022-02-16 17:16:49 -06:00
coda coda: bump module version to 7.2 2021-11-09 10:02:51 -08:00
configfs configfs: fix a race in configfs_{,un}register_subsystem() 2022-02-22 18:30:28 +01:00
cramfs cramfs: use bdev_nr_bytes instead of open coding it 2021-10-18 14:43:22 -06:00
crypto fscrypt: improve a few comments 2021-10-25 19:11:50 -07:00
debugfs debugfs: lockdown: Allow reading debugfs files that are not world readable 2022-01-06 15:47:41 +01:00
devpts fsnotify: fix fsnotify hooks in pseudo filesystems 2022-01-24 14:17:02 +01:00
dlm driver core changes for 5.17-rc1 2022-01-12 11:11:34 -08:00
ecryptfs fs: add is_idmapped_mnt() helper 2021-12-03 18:44:06 +01:00
efivarfs
efs
erofs erofs: fix small compressed files inlining 2022-02-04 12:37:12 +08:00
exfat exfat: fix missing REQ_SYNC in exfat_update_bhs() 2022-01-10 11:00:04 +09:00
exportfs
ext2 fsdax: shift partition offset handling into the file systems 2021-12-04 08:58:54 -08:00
ext4 Various bug fixes for ext4 fast commit and inline data handling. Also 2022-02-06 10:34:45 -08:00
f2fs Fix from Christoph Hellwig merging the CONFIG_UNICODE_UTF8_DATA into the 2022-02-01 11:13:24 -08:00
fat FAT: use io_schedule_timeout() instead of congestion_wait() 2022-01-20 08:52:54 +02:00
freevxfs
fscache fscache: Fix the volume collision wait condition 2022-01-21 21:36:28 +00:00
fuse virtio,vdpa,qemu_fw_cfg: features, cleanups, fixes 2022-01-18 10:05:48 +02:00
gfs2 gfs2 fixes: 2022-02-11 11:36:32 -08:00
hfs Merge branch 'akpm' (patches from Andrew) 2021-11-09 10:11:53 -08:00
hfsplus hfsplus: use struct_group_attr() for memcpy() region 2022-01-20 08:52:54 +02:00
hostfs hostfs: Fix writeback of dirty pages 2021-12-21 21:44:27 +01:00
hpfs treewide: Replace open-coded flex arrays in unions 2021-10-18 12:28:53 -07:00
hugetlbfs hugetlbfs: fix off-by-one error in hugetlb_vmdelete_list() 2022-01-15 16:30:30 +02:00
iomap xfs, iomap: limit individual ioend chain lengths in writeback 2022-01-26 09:19:20 -08:00
isofs isofs: Fix out of bound access for corrupted isofs image 2021-10-19 12:51:02 +02:00
jbd2 Various bug fixes for ext4 fast commit and inline data handling. Also 2022-02-06 10:34:45 -08:00
jffs2 Merge branch 'signal-for-v5.17' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace 2022-01-17 05:49:30 +02:00
jfs Just one JFS patch 2021-11-03 09:23:25 -07:00
kernfs kernfs: prevent early freeing of root node 2021-12-03 14:36:21 +01:00
ksmbd ksmbd: add support for key exchange 2022-02-04 00:12:22 -06:00
lockd Notable bug fixes: 2022-02-02 10:14:31 -08:00
minix mm: require ->set_page_dirty to be explicitly wired up 2021-06-29 10:53:48 -07:00
netfs netfs: Make ops->init_rreq() optional 2022-01-21 21:36:28 +00:00
nfs NFS: Do not report writeback errors in nfs_getattr() 2022-02-16 15:15:22 -05:00
nfs_common nfs: Fix kerneldoc warning shown up by W=1 2021-10-04 22:02:17 +01:00
nfsd Notable bug fixes: 2022-02-09 09:56:57 -08:00
nilfs2 Merge branch 'akpm' (patches from Andrew) 2022-01-20 10:41:01 +02:00
nls
notify fanotify: Fix stale file descriptor in copy_event_to_user() 2022-02-01 12:52:07 +01:00
ntfs fs/ntfs/attrib.c: fix one kernel-doc comment 2022-01-15 16:30:24 +02:00
ntfs3 mm: remove cleancache 2022-01-22 08:33:38 +02:00
ocfs2 ocfs2: fix a deadlock when commit trans 2022-01-30 09:56:58 +02:00
omfs mm: require ->set_page_dirty to be explicitly wired up 2021-06-29 10:53:48 -07:00
openpromfs
orangefs orangefs: Fix the size of a memory allocation in orangefs_bufmap_alloc() 2021-12-31 14:37:43 -05:00
overlayfs overlayfs fixes for 5.17-rc3 2022-02-01 11:23:02 -08:00
proc fs/proc: task_mmu.c: don't read mapcount for migration entry 2022-02-11 17:55:00 -08:00
pstore pstore update for v5.17-rc1 2022-01-10 11:48:37 -08:00
qnx4 qnx4: work around gcc false positive warning bug 2021-09-21 08:36:48 -07:00
qnx6
quota quota: make dquot_quota_sync return errors from ->sync_fs 2022-01-30 08:59:47 -08:00
ramfs Merge branch 'akpm' (patches from Andrew) 2021-11-09 10:11:53 -08:00
reiserfs reiserfs: don't use congestion_wait() 2021-11-18 11:52:22 +01:00
romfs
smbfs_common smb3: add new defines from protocol specification 2022-01-18 16:50:47 -06:00
squashfs squashfs: provide backing_dev_info in order to disable read-ahead 2022-01-15 16:30:24 +02:00
sysfs fs/sysfs/dir.c: replace S_IRWXU|S_IRUGO|S_IXUGO with 0755 sysfs_create_dir_ns() 2021-10-05 16:35:05 +02:00
sysv sysv: use BUILD_BUG_ON instead of runtime check 2021-11-09 10:02:52 -08:00
tracefs tracefs: Set the group ownership in apply_options() not parse_options() 2022-02-25 21:05:04 -05:00
ubifs ubifs: read-only if LEB may always be taken in ubifs_garbage_collect 2021-12-23 22:30:38 +01:00
udf udf: Restore i_lenAlloc when inode expansion fails 2022-01-24 14:45:02 +01:00
ufs isystem: ship and use stdarg.h 2021-08-19 09:02:55 +09:00
unicode Fix from Christoph Hellwig merging the CONFIG_UNICODE_UTF8_DATA into the 2022-02-01 11:13:24 -08:00
vboxsf vboxfs: fix broken legacy mount signature checking 2021-09-27 11:26:21 -07:00
verity fs-verity: fix signed integer overflow with i_size near S64_MAX 2021-09-22 10:56:34 -07:00
xfs xfs: log worker needs to start before intent/unlink recovery 2022-03-20 08:59:49 -07:00
zonefs zonefs: add MODULE_ALIAS_FS 2021-12-17 16:56:35 +09:00
aio.c aio: move aio sysctl to aio.c 2022-01-22 08:33:34 +02:00
anon_inodes.c fs: add anon_inode_getfile_secure() similar to anon_inode_getfd_secure() 2021-09-19 22:35:37 -04:00
attr.c fs: handle circular mappings correctly 2021-11-17 09:26:09 +01:00
bad_inode.c vfs: add rcu argument to ->get_acl() callback 2021-08-18 22:08:24 +02:00
binfmt_aout.c binfmt: a.out: Fix bogus semicolon 2021-09-05 10:15:05 -07:00
binfmt_elf_fdpic.c coredump: Limit coredumps to a single thread group 2021-10-08 12:06:02 -05:00
binfmt_elf.c fs/binfmt_elf: fix PT_LOAD p_align values for loaders 2022-02-11 17:55:00 -08:00
binfmt_flat.c binfmt: remove in-tree usage of MAP_EXECUTABLE 2021-06-29 10:53:50 -07:00
binfmt_misc.c Fix regression due to "fs: move binfmt_misc sysctl to its own file" 2022-02-09 09:50:02 -08:00
binfmt_script.c
buffer.c fs/buffer: Convert __block_write_begin_int() to take a folio 2021-12-16 15:49:51 -05:00
char_dev.c
compat_binfmt_elf.c
coredump.c fs/coredump: move coredump sysctls into its own file 2022-01-22 08:33:36 +02:00
d_path.c d_path: fix Kernel doc validator complaining 2021-11-06 13:30:32 -07:00
dax.c dax: remove the copy_from_iter and copy_to_iter methods 2021-12-18 08:04:53 -08:00
dcache.c fs: move dcache sysctls to its own file 2022-01-22 08:33:36 +02:00
direct-io.c fs: get rid of the res2 iocb->ki_complete argument 2021-10-25 10:36:24 -06:00
drop_caches.c fs: drop_caches: fix skipping over shadow cache inodes 2021-09-03 09:58:10 -07:00
eventfd.c eventfd: Export eventfd_wake_count to modules 2021-09-06 07:20:56 -04:00
eventpoll.c eventpoll: simplify sysctl declaration with register_sysctl() 2022-01-22 08:33:35 +02:00
exec.c fs/coredump: move coredump sysctls into its own file 2022-01-22 08:33:36 +02:00
fcntl.c Merge branch 'akpm' (patches from Andrew) 2021-09-03 10:08:28 -07:00
fhandle.c
file_table.c fs/file_table: fix adding missing kmemleak_not_leak() 2022-02-17 10:23:19 -08:00
file.c fget: clarify and improve __fget_files() implementation 2021-12-13 10:55:30 -08:00
filesystems.c fs: simplify get_filesystem_list / get_all_fs_names 2021-08-23 01:25:40 -04:00
fs_context.c vfs: fs_context: fix up param length parsing in legacy_parse_param 2022-01-18 09:23:19 +02:00
fs_parser.c fs_parse: allow parameter value to be empty 2021-12-09 14:09:36 -05:00
fs_pin.c
fs_struct.c
fs_types.c
fs-writeback.c fscache rewrite 2022-01-12 13:45:12 -08:00
fsopen.c
init.c
inode.c fs: move inode sysctls to its own file 2022-01-22 08:33:35 +02:00
internal.h fs/buffer: Convert __block_write_begin_int() to take a folio 2021-12-16 15:49:51 -05:00
io_uring.c io_uring: disallow modification of rsrc_data during quiesce 2022-02-22 09:57:32 -07:00
io-wq.c io_uring-5.17-2022-01-21 2022-01-21 16:07:21 +02:00
io-wq.h Merge branch 'signal-for-v5.17' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace 2022-01-17 05:49:30 +02:00
ioctl.c fs/ioctl: remove unnecessary __user annotation 2022-01-15 16:30:25 +02:00
Kconfig ksmbd: add support for key exchange 2022-02-04 00:12:22 -06:00
Kconfig.binfmt binfmt: remove support for em86 (alpha only) 2021-07-25 22:33:03 -07:00
kernel_read_file.c vfs: check fd has read access in kernel_read_file_from_fd() 2021-10-18 20:22:03 -10:00
libfs.c unicode: clean up the Kconfig symbol confusion 2022-01-20 19:57:24 -05:00
locks.c fs: move locking sysctls where they are used 2022-01-22 08:33:36 +02:00
Makefile Fix from Christoph Hellwig merging the CONFIG_UNICODE_UTF8_DATA into the 2022-02-01 11:13:24 -08:00
mbcache.c
mount.h
mpage.c mm: remove cleancache 2022-01-22 08:33:38 +02:00
namei.c \n 2022-01-28 17:51:31 +02:00
namespace.c fs: add kernel doc for mnt_{hold,unhold}_writers() 2022-02-14 08:35:32 +01:00
no-block.c
nsfs.c
open.c fs: support mapped mounts of mapped filesystems 2021-12-05 10:28:57 +01:00
pipe.c fs: move pipe sysctls to is own file 2022-01-22 08:33:36 +02:00
pnode.c
pnode.h
posix_acl.c fs: support mapped mounts of mapped filesystems 2021-12-05 10:28:57 +01:00
proc_namespace.c fs: add is_idmapped_mnt() helper 2021-12-03 18:44:06 +01:00
read_write.c fs: remove leftover comments from mandatory locking removal 2021-10-26 12:20:50 -04:00
readdir.c
remap_range.c fs: Convert vfs_dedupe_file_range_compare to folios 2022-01-08 00:28:41 -05:00
select.c select: Fix indefinitely sleeping task in poll_schedule_timeout() 2022-01-11 09:03:05 -08:00
seq_file.c seq_file: move seq_escape() to a header 2021-11-09 10:02:52 -08:00
signalfd.c Merge branch 'signal-for-v5.17' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace 2022-01-17 05:49:30 +02:00
splice.c
stack.c
stat.c fs: add generic helper for filling statx attribute flags 2021-08-17 11:47:43 +02:00
statfs.c
super.c vfs: make freeze_super abort when sync_filesystem returns error 2022-01-30 08:59:47 -08:00
sync.c vfs: make sync_filesystem return errors from ->sync_fs 2022-01-30 08:59:47 -08:00
sysctls.c fs: move namespace sysctls and declare fs base directory 2022-01-22 08:33:36 +02:00
timerfd.c timerfd: Provide timerfd_resume() 2021-08-10 17:57:22 +02:00
userfaultfd.c mm: move anon_vma declarations to linux/mm_inline.h 2022-01-15 16:30:27 +02:00
utimes.c
xattr.c