linux/fs
Zhihao Cheng 2a0629834c
vfs: Don't evict inode under the inode lru traversing context
The inode reclaiming process(See function prune_icache_sb) collects all
reclaimable inodes and mark them with I_FREEING flag at first, at that
time, other processes will be stuck if they try getting these inodes
(See function find_inode_fast), then the reclaiming process destroy the
inodes by function dispose_list(). Some filesystems(eg. ext4 with
ea_inode feature, ubifs with xattr) may do inode lookup in the inode
evicting callback function, if the inode lookup is operated under the
inode lru traversing context, deadlock problems may happen.

Case 1: In function ext4_evict_inode(), the ea inode lookup could happen
        if ea_inode feature is enabled, the lookup process will be stuck
	under the evicting context like this:

 1. File A has inode i_reg and an ea inode i_ea
 2. getfattr(A, xattr_buf) // i_ea is added into lru // lru->i_ea
 3. Then, following three processes running like this:

    PA                              PB
 echo 2 > /proc/sys/vm/drop_caches
  shrink_slab
   prune_dcache_sb
   // i_reg is added into lru, lru->i_ea->i_reg
   prune_icache_sb
    list_lru_walk_one
     inode_lru_isolate
      i_ea->i_state |= I_FREEING // set inode state
     inode_lru_isolate
      __iget(i_reg)
      spin_unlock(&i_reg->i_lock)
      spin_unlock(lru_lock)
                                     rm file A
                                      i_reg->nlink = 0
      iput(i_reg) // i_reg->nlink is 0, do evict
       ext4_evict_inode
        ext4_xattr_delete_inode
         ext4_xattr_inode_dec_ref_all
          ext4_xattr_inode_iget
           ext4_iget(i_ea->i_ino)
            iget_locked
             find_inode_fast
              __wait_on_freeing_inode(i_ea) ----→ AA deadlock
    dispose_list // cannot be executed by prune_icache_sb
     wake_up_bit(&i_ea->i_state)

Case 2: In deleted inode writing function ubifs_jnl_write_inode(), file
        deleting process holds BASEHD's wbuf->io_mutex while getting the
	xattr inode, which could race with inode reclaiming process(The
        reclaiming process could try locking BASEHD's wbuf->io_mutex in
	inode evicting function), then an ABBA deadlock problem would
	happen as following:

 1. File A has inode ia and a xattr(with inode ixa), regular file B has
    inode ib and a xattr.
 2. getfattr(A, xattr_buf) // ixa is added into lru // lru->ixa
 3. Then, following three processes running like this:

        PA                PB                        PC
                echo 2 > /proc/sys/vm/drop_caches
                 shrink_slab
                  prune_dcache_sb
                  // ib and ia are added into lru, lru->ixa->ib->ia
                  prune_icache_sb
                   list_lru_walk_one
                    inode_lru_isolate
                     ixa->i_state |= I_FREEING // set inode state
                    inode_lru_isolate
                     __iget(ib)
                     spin_unlock(&ib->i_lock)
                     spin_unlock(lru_lock)
                                                   rm file B
                                                    ib->nlink = 0
 rm file A
  iput(ia)
   ubifs_evict_inode(ia)
    ubifs_jnl_delete_inode(ia)
     ubifs_jnl_write_inode(ia)
      make_reservation(BASEHD) // Lock wbuf->io_mutex
      ubifs_iget(ixa->i_ino)
       iget_locked
        find_inode_fast
         __wait_on_freeing_inode(ixa)
          |          iput(ib) // ib->nlink is 0, do evict
          |           ubifs_evict_inode
          |            ubifs_jnl_delete_inode(ib)
          ↓             ubifs_jnl_write_inode
     ABBA deadlock ←-----make_reservation(BASEHD)
                   dispose_list // cannot be executed by prune_icache_sb
                    wake_up_bit(&ixa->i_state)

Fix the possible deadlock by using new inode state flag I_LRU_ISOLATING
to pin the inode in memory while inode_lru_isolate() reclaims its pages
instead of using ordinary inode reference. This way inode deletion
cannot be triggered from inode_lru_isolate() thus avoiding the deadlock.
evict() is made to wait for I_LRU_ISOLATING to be cleared before
proceeding with inode cleanup.

Link: https://lore.kernel.org/all/37c29c42-7685-d1f0-067d-63582ffac405@huaweicloud.com/
Link: https://bugzilla.kernel.org/show_bug.cgi?id=219022
Fixes: e50e5129f3 ("ext4: xattr-in-inode support")
Fixes: 7959cf3a75 ("ubifs: journal: Handle xattrs like files")
Cc: stable@vger.kernel.org
Signed-off-by: Zhihao Cheng <chengzhihao1@huawei.com>
Link: https://lore.kernel.org/r/20240809031628.1069873-1-chengzhihao@huaweicloud.com
Reviewed-by: Jan Kara <jack@suse.cz>
Suggested-by: Jan Kara <jack@suse.cz>
Suggested-by: Mateusz Guzik <mjguzik@gmail.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-08-13 13:52:16 +02:00
..
9p Two fixes headed to stable trees: 2024-05-29 09:25:15 -07:00
adfs fs/adfs: add MODULE_DESCRIPTION 2024-07-18 09:50:08 +02:00
affs affs: struct slink_front: Replace 1-element array with flexible array 2024-07-11 16:14:26 +02:00
afs - 875fa64577da ("mm/hugetlb_vmemmap: fix race with speculative PFN 2024-07-21 17:15:46 -07:00
autofs vfs-6.11.mount.api 2024-07-15 11:31:32 -07:00
bcachefs bcachefs fixes for 6.11-rc1 2024-07-22 10:59:08 -07:00
befs befs: Convert befs_symlink_read_folio() to use folio_end_read() 2024-05-31 12:31:39 +02:00
bfs mm, slab: remove last vestiges of SLAB_MEM_SPREAD 2024-03-12 20:32:19 -07:00
btrfs - 875fa64577da ("mm/hugetlb_vmemmap: fix race with speculative PFN 2024-07-21 17:15:46 -07:00
cachefiles cachefiles: Set the max subreq size for cache writes to MAX_RW_COUNT 2024-07-24 10:53:13 +02:00
ceph netfs: Fix handling of USE_PGPRIV2 and WRITE_TO_CACHE flags 2024-08-12 22:03:27 +02:00
coda coda: Convert coda_symlink_filler() to use folio_end_read() 2024-05-31 12:31:39 +02:00
configfs fs/configfs: Add a callback to determine attribute visibility 2024-06-17 20:42:57 +02:00
cramfs vfs-6.11.module.description 2024-07-15 11:14:59 -07:00
crypto The usual shower of singleton fixes and minor series all over MM, 2024-05-19 09:21:03 -07:00
debugfs vfs-6.11.mount.api 2024-07-15 11:31:32 -07:00
devpts
dlm dlm: add rcu_barrier before destroy kmem cache 2024-06-13 12:48:46 -05:00
ecryptfs hardening updates for 6.10-rc1 2024-05-13 14:14:05 -07:00
efivarfs efivarfs: Convert to new uid/gid option parsing helpers 2024-07-02 06:21:18 +02:00
efs vfs-6.11.module.description 2024-07-15 11:14:59 -07:00
erofs erofs: convert comma to semicolon 2024-07-26 18:48:12 +08:00
exfat Description for this pull request: 2024-07-17 12:53:47 -07:00
exportfs fhandle: relax open_by_handle_at() permission checks 2024-05-28 15:57:23 +02:00
ext2 ext2: Verify bitmap and itable block numbers before using them 2024-06-26 12:54:11 +02:00
ext4 Many cleanups and bug fixes in ext4, especially for the fast commit 2024-07-18 17:03:42 -07:00
f2fs f2fs update for 6.11-rc1 2024-07-23 15:21:19 -07:00
fat vfs-6.11.mount.api 2024-07-15 11:31:32 -07:00
freevxfs freevxfs: Convert freevxfs to the new mount API. 2024-03-26 09:04:53 +01:00
fuse virtio: features, fixes, cleanups 2024-07-19 11:57:55 -07:00
gfs2 gfs2: Clean up glock demote logic 2024-07-09 10:40:03 +02:00
hfs vfs-6.11.module.description 2024-07-15 11:14:59 -07:00
hfsplus vfs-6.11.misc 2024-07-15 10:52:51 -07:00
hostfs vfs-6.11-rc1.fixes.3 2024-07-27 15:11:59 -07:00
hpfs vfs-6.11.module.description 2024-07-15 11:14:59 -07:00
hugetlbfs - 875fa64577da ("mm/hugetlb_vmemmap: fix race with speculative PFN 2024-07-21 17:15:46 -07:00
iomap vfs-6.11.iomap 2024-07-15 13:28:14 -07:00
isofs \n 2024-07-17 13:11:42 -07:00
jbd2 jbd2: increase maximum transaction size 2024-07-08 23:59:37 -04:00
jffs2 Kbuild updates for v6.11 2024-07-23 14:32:21 -07:00
jfs Folio conversion from Matthew Wilcox and a few various fixes 2024-07-23 15:15:16 -07:00
kernfs kernfs: mount: Remove unnecessary ‘NULL’ values from knparent 2024-05-04 19:02:39 +02:00
lockd lockd: Use *-y instead of *-objs in Makefile 2024-07-08 14:10:03 -04:00
minix vfs-6.11.module.description 2024-07-15 11:14:59 -07:00
netfs netfs: Fix handling of USE_PGPRIV2 and WRITE_TO_CACHE flags 2024-08-12 22:03:27 +02:00
nfs netfs: Fix handling of USE_PGPRIV2 and WRITE_TO_CACHE flags 2024-08-12 22:03:27 +02:00
nfs_common fs: nfs: add missing MODULE_DESCRIPTION() macros 2024-07-08 13:47:24 -04:00
nfsd NFSD 6.11 Release Notes 2024-07-17 12:00:49 -07:00
nilfs2 nilfs2: handle inconsistent state in nilfs_btnode_create_block() 2024-07-26 14:33:10 -07:00
nls fs: nls: add missing MODULE_DESCRIPTION() macros 2024-06-03 16:37:07 +02:00
notify fsnotify: clear PARENT_WATCHED flags lazily 2024-06-05 09:52:38 +02:00
ntfs3 ntfs3 changes for 6.11-rc1 2024-07-22 10:50:18 -07:00
ocfs2 - In the series "treewide: Refactor heap related implementation", 2024-07-21 17:56:22 -07:00
omfs
openpromfs openpromfs: add missing MODULE_DESCRIPTION() macro 2024-06-20 09:46:01 +02:00
orangefs orangefs: Remove calls to set/clear the error flag 2024-05-31 12:31:41 +02:00
overlayfs ovl: fix encoding fid for lower only root 2024-06-14 10:30:40 +02:00
proc Random number generator updates for Linux 6.11-rc1. 2024-07-24 10:29:50 -07:00
pstore memblock: updates for 6.11-rc1 2024-07-18 14:48:11 -07:00
qnx4 qnx4: add MODULE_DESCRIPTION() 2024-05-28 11:52:53 +02:00
qnx6 qnx6: add MODULE_DESCRIPTION() 2024-05-28 11:52:49 +02:00
quota sysctl: treewide: constify the ctl_table argument of proc_handlers 2024-07-24 20:59:29 +02:00
ramfs mm: switch mm->get_unmapped_area() to a flag 2024-04-25 20:56:25 -07:00
reiserfs reiserfs: Remove call to folio_set_error() 2024-05-31 12:31:41 +02:00
romfs romfs: Convert romfs_read_folio() to use a folio 2024-05-31 12:31:42 +02:00
smb six smb3 client fixes 2024-07-27 20:08:07 -07:00
squashfs Mainly singleton patches, documented in their respective changelogs. 2024-05-19 14:02:03 -07:00
sysfs Merge 6.9-rc5 into driver-core-next 2024-04-23 13:27:43 +02:00
sysv fs: sysv: add MODULE_DESCRIPTION() 2024-05-28 11:52:45 +02:00
tests execve: Move KUnit tests to tests/ subdirectory 2024-07-22 18:25:47 -07:00
tracefs tracefs: Convert to new uid/gid option parsing helpers 2024-07-02 06:21:20 +02:00
ubifs ubifs: add check for crypto_shash_tfm_digest 2024-07-12 22:01:09 +02:00
udf udf: prevent integer overflow in udf_bitmap_free_blocks() 2024-06-26 12:54:11 +02:00
ufs - In the series "treewide: Refactor heap related implementation", 2024-07-21 17:56:22 -07:00
unicode unicode: add MODULE_DESCRIPTION() macros 2024-06-20 19:30:02 -04:00
vboxsf vfs-6.11.mount.api 2024-07-15 11:31:32 -07:00
verity bpf: treewide: Align kfunc signatures to prog point-of-view 2024-06-12 11:01:31 -07:00
xfs sysctl: treewide: constify the ctl_table argument of proc_handlers 2024-07-24 20:59:29 +02:00
zonefs zonefs: enable support for large folios 2024-06-11 11:22:57 +09:00
aio.c - 875fa64577da ("mm/hugetlb_vmemmap: fix race with speculative PFN 2024-07-21 17:15:46 -07:00
anon_inodes.c fs: Create anon_inode_getfile_fmode() 2024-04-26 10:33:05 +02:00
attr.c fs: Export in_group_or_capable() 2024-06-25 11:15:48 +02:00
backing-file.c ovl: implement tmpfile 2024-05-02 20:35:57 +02:00
bad_inode.c
binfmt_elf_fdpic.c fs: don't block i_writecount during exec 2024-06-03 15:52:10 +02:00
binfmt_elf.c execve fix for v6.11-rc1 2024-07-23 17:30:42 -07:00
binfmt_flat.c
binfmt_misc.c vfs-6.11.module.description 2024-07-15 11:14:59 -07:00
binfmt_script.c fs: binfmt: add missing MODULE_DESCRIPTION() macros 2024-05-28 12:06:51 +02:00
buffer.c Many cleanups and bug fixes in ext4, especially for the fast commit 2024-07-18 17:03:42 -07:00
char_dev.c
compat_binfmt_elf.c
coredump.c sysctl: treewide: constify the ctl_table argument of proc_handlers 2024-07-24 20:59:29 +02:00
d_path.c
dax.c dax: use huge_zero_folio 2024-04-25 20:56:20 -07:00
dcache.c sysctl: treewide: constify the ctl_table argument of proc_handlers 2024-07-24 20:59:29 +02:00
direct-io.c fs/direct-io: remove redundant assignment to variable retval 2024-04-11 10:21:24 +02:00
drop_caches.c sysctl: treewide: constify the ctl_table argument of proc_handlers 2024-07-24 20:59:29 +02:00
eventfd.c eventfd: strictly check the count parameter of eventfd_write to avoid inputting illegal strings 2024-02-08 10:12:26 +01:00
eventpoll.c epoll: be better about file lifetimes 2024-05-05 14:00:48 -07:00
exec.c sysctl: treewide: constify the ctl_table argument of proc_handlers 2024-07-24 20:59:29 +02:00
fcntl.c fcntl: add F_DUPFD_QUERY fcntl() 2024-05-10 08:26:31 +02:00
fhandle.c fhandle: relax open_by_handle_at() permission checks 2024-05-28 15:57:23 +02:00
file_table.c sysctl: treewide: constify the ctl_table argument of proc_handlers 2024-07-24 20:59:29 +02:00
file.c fs/file: fix the check in find_next_fd() 2024-05-30 09:11:47 +02:00
filesystems.c
fs_context.c
fs_parser.c fs_parse: add uid & gid option option parsing helpers 2024-07-02 06:20:49 +02:00
fs_pin.c
fs_struct.c
fs_types.c
fs-writeback.c sysctl: treewide: constify the ctl_table argument of proc_handlers 2024-07-24 20:59:29 +02:00
fsopen.c vfs: retire user_path_at_empty and drop empty arg from getname_flags 2024-06-05 17:03:57 +02:00
init.c
inode.c vfs: Don't evict inode under the inode lru traversing context 2024-08-13 13:52:16 +02:00
internal.h vfs-6.11.pidfs 2024-07-15 12:34:01 -07:00
ioctl.c fs/ioctl: Add a comment to keep the logic in sync with LSM policies 2024-05-13 06:58:35 +02:00
Kconfig - Sumanth Korikkar has taught s390 to allocate hotplug-time page frames 2024-03-14 17:43:30 -07:00
Kconfig.binfmt exec: Add KUnit test for bprm_stack_limits() 2024-06-19 13:13:55 -07:00
kernel_read_file.c
libfs.c libfs: fix infinite directory reads for offset dir 2024-08-12 22:03:26 +02:00
locks.c filelock: fix name of file_lease slab cache 2024-08-12 22:03:25 +02:00
Makefile vfs-6.9.pidfd 2024-03-11 10:21:06 -07:00
mbcache.c vfs: remove SLAB_MEM_SPREAD flag usage 2024-02-27 11:21:31 +01:00
mnt_idmapping.c fs/mnt_idmapping.c: Return -EINVAL when no map is written 2024-02-08 10:12:37 +01:00
mount.h vfs-6.11.mount 2024-07-15 11:54:04 -07:00
mpage.c buffer: Remove calls to set and clear the folio error flag 2024-05-31 12:31:43 +02:00
namei.c vfs: correct the comments of vfs_*() helpers 2024-07-24 10:53:12 +02:00
namespace.c fs: use all available ids 2024-07-24 10:53:13 +02:00
nsfs.c nsfs: use cleanup guard 2024-07-18 09:50:08 +02:00
open.c vfs-6.11.misc 2024-07-15 10:52:51 -07:00
pidfs.c pidfs: handle kernels without namespaces cleanly 2024-07-24 10:53:13 +02:00
pipe.c sysctl: treewide: constify the ctl_table argument of proc_handlers 2024-07-24 20:59:29 +02:00
pnode.c
pnode.h
posix_acl.c lsm/stable-6.9 PR 20240312 2024-03-12 20:03:34 -07:00
proc_namespace.c fs: rename show_mnt_opts -> show_vfsmnt_opts 2024-06-28 14:36:43 +02:00
read_write.c fs: Initial atomic write support 2024-06-20 15:19:17 -06:00
readdir.c readdir: Add missing quote in macro comment 2024-06-03 15:49:26 +02:00
remap_range.c vfs: export remap and write check helpers 2024-04-15 14:54:13 -07:00
select.c fs/select: rework stack allocation hack for clang 2024-02-20 09:23:52 +01:00
seq_file.c seq_file: Simplify __seq_puts() 2024-05-02 16:28:20 +02:00
signalfd.c signalfd: drop an obsolete comment 2024-05-24 13:34:07 +02:00
splice.c remove call_{read,write}_iter() functions 2024-04-15 16:03:25 -04:00
stack.c
stat.c for-6.11/block-20240710 2024-07-15 14:20:22 -07:00
statfs.c
super.c fs: don't allow non-init s_user_ns for filesystems without FS_USERNS_MOUNT 2024-07-27 09:56:33 +02:00
sync.c
sysctls.c
timerfd.c timerfd: convert to ->read_iter() 2024-04-10 16:23:02 -06:00
userfaultfd.c mm: provide mm_struct and address to huge_ptep_get() 2024-07-12 15:52:15 -07:00
utimes.c
xattr.c vfs: Fix potential circular locking through setxattr() and removexattr() 2024-07-24 10:53:14 +02:00