linux/fs
Stephen Brennan da4d6b9cf8 proc: allow pid_revalidate() during LOOKUP_RCU
Problem Description:

When running running ~128 parallel instances of

  TZ=/etc/localtime ps -fe >/dev/null

on a 128CPU machine, the %sys utilization reaches 97%, and perf shows
the following code path as being responsible for heavy contention on the
d_lockref spinlock:

      walk_component()
        lookup_fast()
          d_revalidate()
            pid_revalidate() // returns -ECHILD
          unlazy_child()
            lockref_get_not_dead(&nd->path.dentry->d_lockref) <-- contention

The reason is that pid_revalidate() is triggering a drop from RCU to ref
path walk mode.  All concurrent path lookups thus try to grab a
reference to the dentry for /proc/, before re-executing pid_revalidate()
and then stepping into the /proc/$pid directory.  Thus there is huge
spinlock contention.

This patch allows pid_revalidate() to execute in RCU mode, meaning that
the path lookup can successfully enter the /proc/$pid directory while
still in RCU mode.  Later on, the path lookup may still drop into ref
mode, but the contention will be much reduced at this point.

By applying this patch, %sys utilization falls to around 85% under the
same workload, and the number of ps processes executed per unit time
increases by 3x-4x.  Although this particular workload is a bit
contrived, we have seen some large collections of eager monitoring
scripts which produced similarly high %sys time due to contention in the
/proc directory.

As a result this patch, Al noted that several procfs methods which were
only called in ref-walk mode could now be called from RCU mode.  To
ensure that this patch is safe, I audited all the inode get_link and
permission() implementations, as well as dentry d_revalidate()
implementations, in fs/proc.  The purpose here is to ensure that they
either are safe to call in RCU (i.e.  don't sleep) or correctly bail out
of RCU mode if they don't support it.  My analysis shows that all
at-risk procfs methods are safe to call under RCU, and thus this patch
is safe.

Procfs RCU-walk Analysis:

This analysis is up-to-date with 5.15-rc3.  When called under RCU mode,
these functions have arguments as follows:

* get_link() receives a NULL dentry pointer when called in RCU mode.
* permission() receives MAY_NOT_BLOCK in the mode parameter when called
  from RCU.
* d_revalidate() receives LOOKUP_RCU in flags.

For the following functions, either they are trivially RCU safe, or they
explicitly bail at the beginning of the function when they run:

proc_ns_get_link       (bails out)
proc_get_link          (RCU safe)
proc_pid_get_link      (bails out)
map_files_d_revalidate (bails out)
map_misc_d_revalidate  (bails out)
proc_net_d_revalidate  (RCU safe)
proc_sys_revalidate    (bails out, also not under /proc/$pid)
tid_fd_revalidate      (bails out)
proc_sys_permission    (not under /proc/$pid)

The remainder of the functions require a bit more detail:

* proc_fd_permission: RCU safe. All of the body of this function is
  under rcu_read_lock(), except generic_permission() which declares
  itself RCU safe in its documentation string.
* proc_self_get_link uses GFP_ATOMIC in the RCU case, so it is RCU aware
  and otherwise looks safe. The same is true of proc_thread_self_get_link.
* proc_map_files_get_link: calls ns_capable, which calls capable(), and
  thus calls into the audit code (see note #1 below). The remainder is
  just a call to the trivially safe proc_pid_get_link().
* proc_pid_permission: calls ptrace_may_access(), which appears RCU
  safe, although it does call into the "security_ptrace_access_check()"
  hook, which looks safe under smack and selinux. Just the audit code is
  of concern. Also uses get_task_struct() and put_task_struct(), see
  note #2 below.
* proc_tid_comm_permission: Appears safe, though calls put_task_struct
  (see note #2 below).

Note #1:
  Most of the concern of RCU safety has centered around the audit code.
  However, since b17ec22fb3 ("selinux: slow_avc_audit has become
  non-blocking"), it's safe to call this code under RCU. So all of the
  above are safe by my estimation.

Note #2: get_task_struct() and put_task_struct():
  The majority of get_task_struct() is under RCU read lock, and in any
  case it is a simple increment. But put_task_struct() is complex, given
  that it could at some point free the task struct, and this process has
  many steps which I couldn't manually verify. However, several other
  places call put_task_struct() under RCU, so it appears safe to use
  here too (see kernel/hung_task.c:165 or rcu/tree-stall.h:296)

Patch description:

pid_revalidate() drops from RCU into REF lookup mode.  When many threads
are resolving paths within /proc in parallel, this can result in heavy
spinlock contention on d_lockref as each thread tries to grab a
reference to the /proc dentry (and drop it shortly thereafter).

Investigation indicates that it is not necessary to drop RCU in
pid_revalidate(), as no RCU data is modified and the function never
sleeps.  So, remove the LOOKUP_RCU check.

Link: https://lkml.kernel.org/r/20211004175629.292270-2-stephen.s.brennan@oracle.com
Signed-off-by: Stephen Brennan <stephen.s.brennan@oracle.com>
Cc: Konrad Wilk <konrad.wilk@oracle.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2021-11-09 10:02:49 -08:00
..
9p 9p: Fix a bunch of kerneldoc warnings shown up by W=1 2021-10-04 22:07:46 +01:00
adfs mm: require ->set_page_dirty to be explicitly wired up 2021-06-29 10:53:48 -07:00
affs mm: require ->set_page_dirty to be explicitly wired up 2021-06-29 10:53:48 -07:00
afs netfslib, cachefiles and afs fixes 2021-10-07 11:20:08 -07:00
autofs autofs: fix wait name hash calculation in autofs_wait() 2021-10-20 21:09:02 -04:00
befs isystem: ship and use stdarg.h 2021-08-19 09:02:55 +09:00
bfs mm: require ->set_page_dirty to be explicitly wired up 2021-06-29 10:53:48 -07:00
btrfs for-5.15-rc7-tag 2021-10-29 10:46:59 -07:00
cachefiles cachefiles: Change %p in format strings to something else 2021-08-27 13:34:02 +01:00
ceph ceph: fix handling of "meta" errors 2021-10-19 09:36:06 +02:00
cifs cifs: fix incorrect check for null pointer in header_assemble 2021-09-23 21:12:53 -05:00
coda coda: fix reference counting in coda_file_mmap error path 2021-04-23 14:42:39 -07:00
configfs configfs: fix a race in configfs_lookup() 2021-08-25 07:58:49 +02:00
cramfs
crypto fscrypt: align Base64 encoding with RFC 4648 base64url 2021-07-25 20:47:05 -07:00
debugfs debugfs: debugfs_create_file_size(): use IS_ERR to check for error 2021-09-21 09:09:06 +02:00
devpts
dlm fs: dlm: avoid comms shutdown delay in release_lockspace 2021-09-01 11:29:14 -05:00
ecryptfs mm: require ->set_page_dirty to be explicitly wired up 2021-06-29 10:53:48 -07:00
efivarfs efivars: convert to fileattr 2021-04-12 15:04:29 +02:00
efs
erofs erofs: clear compacted_2b if compacted_4b_initial > totalidx 2021-09-23 23:23:04 +08:00
exfat Description for this pull request: 2021-07-06 11:06:04 -07:00
exportfs
ext2 ext2: fix sleeping in atomic bugs on error 2021-09-22 13:05:23 +02:00
ext4 Fix a number of ext4 bugs in fast_commit, inline data, and delayed 2021-10-03 13:56:53 -07:00
f2fs f2fs-for-5.15-rc1 2021-09-04 10:48:47 -07:00
fat linux-kselftest-kunit-5.15-rc1 2021-09-02 12:32:12 -07:00
freevxfs
fscache fscache: Remove an unused static variable 2021-10-04 22:13:12 +01:00
fuse fuse: clean up error exits in fuse_fill_super() 2021-10-21 10:01:39 +02:00
gfs2 Merge branch 'work.gfs2' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs 2021-09-09 12:45:26 -07:00
hfs hfs: add lock nesting notation to hfs_find_init 2021-07-15 10:13:49 -07:00
hfsplus hfsplus: report create_date to kstat.btime 2021-07-01 11:06:06 -07:00
hostfs hostfs: support splice_write 2021-08-26 22:28:02 +02:00
hpfs hpfs: use iomap_fiemap to implement ->fiemap 2021-07-27 11:00:36 +02:00
hugetlbfs mm,hugetlb: remove mlock ulimit for SHM_HUGETLB 2021-11-09 10:02:48 -08:00
iomap iomap: standardize tracepoint formatting and storage 2021-08-26 09:18:53 -07:00
isofs isofs: joliet: Fix iocharset=utf8 mount option 2021-08-12 16:07:14 +02:00
jbd2 jbd2: add sparse annotations for add_transaction_credits() 2021-08-30 23:36:50 -04:00
jffs2 vfs: add rcu argument to ->get_acl() callback 2021-08-18 22:08:24 +02:00
jfs vfs: add rcu argument to ->get_acl() callback 2021-08-18 22:08:24 +02:00
kernfs kernfs: don't create a negative dentry if inactive node exists 2021-10-04 10:27:18 +02:00
ksmbd ksmbd: add buffer validation in session setup 2021-10-20 00:07:10 -05:00
lockd Critical bug fixes: 2021-09-22 09:21:02 -07:00
minix mm: require ->set_page_dirty to be explicitly wired up 2021-06-29 10:53:48 -07:00
netfs netfs: Fix READ/WRITE confusion when calling iov_iter_xarray() 2021-10-05 11:22:06 +01:00
nfs NFS Client Updates for Linux 5.15 2021-09-04 10:25:26 -07:00
nfs_common nfs: Fix kerneldoc warning shown up by W=1 2021-10-04 22:02:17 +01:00
nfsd Bug fixes for NFSD error handling paths 2021-10-07 14:11:40 -07:00
nilfs2 Merge branch 'akpm' (patches from Andrew) 2021-09-08 12:55:35 -07:00
nls
notify fsnotify: fix sb_connectors leak 2021-09-10 09:46:48 -07:00
ntfs Merge branch 'work.iov_iter' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs 2021-07-03 11:30:04 -07:00
ntfs3 Fixed xfstests generic/016 generic/021 generic/022 generic/041 generic/274 generic/423, 2021-10-15 09:58:11 -04:00
ocfs2 ocfs2: do not zero pages beyond i_size 2021-11-06 13:30:32 -07:00
omfs mm: require ->set_page_dirty to be explicitly wired up 2021-06-29 10:53:48 -07:00
openpromfs
orangefs vfs: add rcu argument to ->get_acl() callback 2021-08-18 22:08:24 +02:00
overlayfs ovl: fix IOCB_DIRECT if underlying fs doesn't support direct IO 2021-09-28 09:16:12 +02:00
proc proc: allow pid_revalidate() during LOOKUP_RCU 2021-11-09 10:02:49 -08:00
pstore for-5.14/drivers-2021-06-29 2021-06-30 12:21:16 -07:00
qnx4 qnx4: work around gcc false positive warning bug 2021-09-21 08:36:48 -07:00
qnx6
quota quota: remove unnecessary oom message 2021-06-22 10:40:52 +02:00
ramfs fs: move ramfs_aops to libfs 2021-06-29 10:53:48 -07:00
reiserfs Kbuild updates for v5.15 2021-09-03 15:33:47 -07:00
romfs
smbfs_common cifs: remove pathname for file from SPDX header 2021-09-13 14:51:10 -05:00
squashfs squashfs: use bvec_virt 2021-08-16 10:50:32 -06:00
sysfs sysfs: Allow deferred execution of iomem_get_mapping() 2021-08-06 13:05:28 +02:00
sysv mm: require ->set_page_dirty to be explicitly wired up 2021-06-29 10:53:48 -07:00
tracefs tracing: Fix various typos in comments 2021-03-23 14:08:18 -04:00
ubifs ubifs: report correct st_size for encrypted symlinks 2021-07-25 20:01:07 -07:00
udf udf_get_extendedattr() had no boundary checks. 2021-08-23 13:35:19 +02:00
ufs isystem: ship and use stdarg.h 2021-08-19 09:02:55 +09:00
unicode .gitignore: prefix local generated files with a slash 2021-05-02 00:43:35 +09:00
vboxsf vboxfs: fix broken legacy mount signature checking 2021-09-27 11:26:21 -07:00
verity fs-verity: fix signed integer overflow with i_size near S64_MAX 2021-09-22 10:56:34 -07:00
xfs libnvdimm for v5.15 2021-09-09 11:39:57 -07:00
zonefs \n 2021-08-30 10:24:50 -07:00
aio.c eventfd: Make signal recursion protection a task bit 2021-08-28 01:33:02 +02:00
anon_inodes.c
attr.c fs: Move notify_change permission checks into may_setattr 2021-08-13 00:41:05 -04:00
bad_inode.c vfs: add rcu argument to ->get_acl() callback 2021-08-18 22:08:24 +02:00
binfmt_aout.c binfmt: a.out: Fix bogus semicolon 2021-09-05 10:15:05 -07:00
binfmt_elf_fdpic.c binfmt: remove in-tree usage of MAP_DENYWRITE 2021-09-03 18:42:01 +02:00
binfmt_elf.c elf: don't use MAP_FIXED_NOREPLACE for elf interpreter mappings 2021-10-03 14:02:58 -07:00
binfmt_flat.c binfmt: remove in-tree usage of MAP_EXECUTABLE 2021-06-29 10:53:50 -07:00
binfmt_misc.c
binfmt_script.c
buffer.c mm: fs: invalidate bh_lrus for only cold path 2021-09-24 16:13:35 -07:00
char_dev.c
compat_binfmt_elf.c
coredump.c coredump: fix memleak in dump_vma_snapshot() 2021-09-08 11:50:27 -07:00
d_path.c d_path: fix Kernel doc validator complaining 2021-11-06 13:30:32 -07:00
dax.c New code for 5.15: 2021-08-31 11:13:35 -07:00
dcache.c useful constants: struct qstr for ".." 2021-04-15 22:36:45 -04:00
direct-io.c fs: direct-io: fix missing sdio->boundary 2021-04-09 14:54:23 -07:00
drop_caches.c fs: drop_caches: fix skipping over shadow cache inodes 2021-09-03 09:58:10 -07:00
eventfd.c eventfd: Export eventfd_wake_count to modules 2021-09-06 07:20:56 -04:00
eventpoll.c ARM development updates for 5.15: 2021-09-09 13:25:49 -07:00
exec.c Merge tag 'denywrite-for-5.15' of git://github.com/davidhildenbrand/linux 2021-09-04 11:35:47 -07:00
fcntl.c Merge branch 'akpm' (patches from Andrew) 2021-09-03 10:08:28 -07:00
fhandle.c switch file_open_root() to struct path 2021-04-07 13:56:43 -04:00
file_table.c
file.c virtio,vdpa,vhost: features, fixes 2021-09-11 14:48:42 -07:00
filesystems.c fs: simplify get_filesystem_list / get_all_fs_names 2021-08-23 01:25:40 -04:00
fs_context.c memcg: charge fs_context and legacy_fs_context 2021-09-03 09:58:12 -07:00
fs_parser.c namei: Standardize callers of filename_lookup() 2021-09-07 16:07:47 -04:00
fs_pin.c
fs_struct.c
fs_types.c
fs-writeback.c Merge branch 'akpm' (patches from Andrew) 2021-09-03 10:08:28 -07:00
fsopen.c
init.c
inode.c vfs: keep inodes with page cache off the inode shrinker LRU 2021-11-09 10:02:48 -08:00
internal.h vfs: keep inodes with page cache off the inode shrinker LRU 2021-11-09 10:02:48 -08:00
io_uring.c io_uring: apply worker limits to previous users 2021-10-21 11:19:38 -06:00
io-wq.c io-wq: max_worker fixes 2021-10-19 17:09:34 -06:00
io-wq.h io-wq: provide a way to limit max number of workers 2021-08-29 07:55:55 -06:00
ioctl.c New code for 5.15: 2021-08-31 11:06:32 -07:00
Kconfig 4 cifs/smb3 fixes, one for DFS reconnect, and one to begin creating common headers for server and client and the other two to rename the cifs_common directory to smbfs_common to be more consistent ie change use of the name cifs to smb which is more accurate 2021-09-12 10:10:21 -07:00
Kconfig.binfmt binfmt: remove support for em86 (alpha only) 2021-07-25 22:33:03 -07:00
kernel_read_file.c vfs: check fd has read access in kernel_read_file_from_fd() 2021-10-18 20:22:03 -10:00
libfs.c fs: remove noop_set_page_dirty() 2021-06-29 10:53:48 -07:00
locks.c Revert "memcg: enable accounting for file lock caches" 2021-09-07 11:21:48 -07:00
Makefile 4 cifs/smb3 fixes, one for DFS reconnect, and one to begin creating common headers for server and client and the other two to rename the cifs_common directory to smbfs_common to be more consistent ie change use of the name cifs to smb which is more accurate 2021-09-12 10:10:21 -07:00
mbcache.c
mount.h
mpage.c
namei.c putname(): IS_ERR_OR_NULL() is wrong here 2021-09-07 16:14:05 -04:00
namespace.c Merge branch 'akpm' (patches from Andrew) 2021-09-03 10:08:28 -07:00
no-block.c
nsfs.c
open.c mm, thp: fix incorrect unmap behavior for private pages 2021-11-06 13:30:41 -07:00
pipe.c Revert "mm/gup: remove try_get_page(), call try_get_compound_head() directly" 2021-09-07 11:03:45 -07:00
pnode.c
pnode.h
posix_acl.c fs/posix_acl.c: avoid -Wempty-body warning 2021-11-06 13:30:32 -07:00
proc_namespace.c
read_write.c fs: clean up after mandatory file locking support removal 2021-08-24 07:52:45 -04:00
readdir.c readdir: make sure to verify directory entry for legacy interfaces too 2021-04-17 11:39:49 -07:00
remap_range.c fs: remove mandatory file locking support 2021-08-23 06:15:36 -04:00
select.c Revert "memcg: enable accounting for pollfd and select bits arrays" 2021-09-07 11:26:23 -07:00
seq_file.c seq_file: disallow extremely large seq buffer allocations 2021-07-19 17:18:48 -07:00
signalfd.c signal: Rename SIL_PERF_EVENT SIL_FAULT_PERF_EVENT for consistency 2021-07-23 13:16:43 -05:00
splice.c
stack.c
stat.c fs: add generic helper for filling statx attribute flags 2021-08-17 11:47:43 +02:00
statfs.c
super.c fs: explicitly unregister per-superblock BDIs 2021-11-06 13:30:34 -07:00
sync.c
timerfd.c timerfd: Provide timerfd_resume() 2021-08-10 17:57:22 +02:00
userfaultfd.c userfaultfd: fix a race between writeprotect and exit_mmap() 2021-10-18 20:22:02 -10:00
utimes.c
xattr.c xattr: fix kernel-doc for mnt_userns and vfs xattr helpers 2021-03-23 11:20:26 +01:00