linux/fs
Mathieu Desnoyers af7f588d8f sched: Introduce per-memory-map concurrency ID
This feature allows the scheduler to expose a per-memory map concurrency
ID to user-space. This concurrency ID is within the possible cpus range,
and is temporarily (and uniquely) assigned while threads are actively
running within a memory map. If a memory map has fewer threads than
cores, or is limited to run on few cores concurrently through sched
affinity or cgroup cpusets, the concurrency IDs will be values close
to 0, thus allowing efficient use of user-space memory for per-cpu
data structures.

This feature is meant to be exposed by a new rseq thread area field.

The primary purpose of this feature is to do the heavy-lifting needed
by memory allocators to allow them to use per-cpu data structures
efficiently in the following situations:

- Single-threaded applications,
- Multi-threaded applications on large systems (many cores) with limited
  cpu affinity mask,
- Multi-threaded applications on large systems (many cores) with
  restricted cgroup cpuset per container.

One of the key concern from scheduler maintainers is the overhead
associated with additional spin locks or atomic operations in the
scheduler fast-path. This is why the following optimization is
implemented.

On context switch between threads belonging to the same memory map,
transfer the mm_cid from prev to next without any atomic ops. This
takes care of use-cases involving frequent context switch between
threads belonging to the same memory map.

Additional optimizations can be done if the spin locks added when
context switching between threads belonging to different memory maps end
up being a performance bottleneck. Those are left out of this patch
though. A performance impact would have to be clearly demonstrated to
justify the added complexity.

The credit goes to Paul Turner (Google) for the original virtual cpu id
idea. This feature is implemented based on the discussions with Paul
Turner and Peter Oskolkov (Google), but I took the liberty to implement
scheduler fast-path optimizations and my own NUMA-awareness scheme. The
rumor has it that Google have been running a rseq vcpu_id extension
internally in production for a year. The tcmalloc source code indeed has
comments hinting at a vcpu_id prototype extension to the rseq system
call [1].

The following benchmarks do not show any significant overhead added to
the scheduler context switch by this feature:

* perf bench sched messaging (process)

Baseline:                    86.5±0.3 ms
With mm_cid:                 86.7±2.6 ms

* perf bench sched messaging (threaded)

Baseline:                    84.3±3.0 ms
With mm_cid:                 84.7±2.6 ms

* hackbench (process)

Baseline:                    82.9±2.7 ms
With mm_cid:                 82.9±2.9 ms

* hackbench (threaded)

Baseline:                    85.2±2.6 ms
With mm_cid:                 84.4±2.9 ms

[1] https://github.com/google/tcmalloc/blob/master/tcmalloc/internal/linux_syscall_support.h#L26

Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20221122203932.231377-8-mathieu.desnoyers@efficios.com
2022-12-27 12:52:11 +01:00
..
9p 9p-for-6.2-rc1 2022-12-23 11:39:18 -08:00
adfs fs: Convert block_read_full_page() to block_read_full_folio() 2022-05-09 16:21:44 -04:00
affs affs: move from strlcpy with unused retval to strscpy 2022-08-19 13:03:10 +02:00
afs afs: Stop implementing ->writepage() 2022-12-22 11:40:35 +00:00
autofs autofs: remove unused ino field inode 2022-07-17 17:31:42 -07:00
befs befs: Convert befs_symlink_read_folio() to use a folio 2022-08-02 12:34:03 -04:00
bfs fs: Convert block_read_full_page() to block_read_full_folio() 2022-05-09 16:21:44 -04:00
btrfs hardening updates for v6.2-rc1 2022-12-14 12:20:00 -08:00
cachefiles fscache,cachefiles: add prepare_ondemand_read() callback 2022-12-07 10:56:29 +08:00
ceph A fix to facilitate prompt cap releases on async creates from Xiubo. 2022-12-14 10:35:47 -08:00
cifs 20 cifs/smb3 client fixes, mostly related to reconnect and/or DFS 2022-12-21 10:40:08 -08:00
coda coda: Convert coda_symlink_filler() to use a folio 2022-08-02 12:34:03 -04:00
configfs configfs: fix possible memory leak in configfs_create_dir() 2022-12-02 11:11:22 +01:00
cramfs cramfs: read_mapping_page() is synchronous 2022-08-02 12:34:02 -04:00
crypto for-6.2/block-2022-12-08 2022-12-13 10:43:59 -08:00
debugfs debugfs: fix error when writing negative value to atomic_t debugfs file 2022-11-30 16:13:16 -08:00
devpts
dlm Treewide: Stop corrupting socket's task_frag 2022-12-19 17:28:49 -08:00
ecryptfs ecryptfs: use stub posix acl handlers 2022-10-20 10:13:31 +02:00
efivarfs efi: vars: prohibit reading random seed variables 2022-12-01 09:51:21 +01:00
efs efs: Convert efs symlinks to read_folio 2022-05-09 16:21:45 -04:00
erofs Changes since the last update: 2022-12-12 20:14:04 -08:00
exfat Description for this pull request: 2022-12-15 18:14:21 -08:00
exportfs exportfs: use pr_debug for unreachable debug statements 2022-11-28 12:54:45 -05:00
ext2 \n 2022-12-12 20:32:50 -08:00
ext4 treewide: Convert del_timer*() to timer_shutdown*() 2022-12-25 13:38:09 -08:00
f2fs f2fs-for-6.2-rc1 2022-12-14 15:27:57 -08:00
fat MM patches for 6.2-rc1. 2022-12-13 19:29:45 -08:00
freevxfs freevxfs: Convert vxfs_immed_read_folio() to use a folio 2022-08-02 12:34:03 -04:00
fscache iov_iter work; most of that is about getting rid of 2022-12-12 18:29:54 -08:00
fuse MM patches for 6.2-rc1. 2022-12-13 19:29:45 -08:00
gfs2 gfs2 fixes 2022-12-17 08:18:04 -06:00
hfs MM patches for 6.2-rc1. 2022-12-13 19:29:45 -08:00
hfsplus MM patches for 6.2-rc1. 2022-12-13 19:29:45 -08:00
hostfs hostfs: move from strlcpy with unused retval to strscpy 2022-09-19 22:46:25 +02:00
hpfs hpfs: remove ->writepage 2022-12-11 18:12:18 -08:00
hugetlbfs hugetlbfs: inode: remove unnecessary (void*) conversions 2022-11-30 15:58:56 -08:00
iomap New XFS code for 6.2: 2022-12-14 10:11:51 -08:00
isofs - hfs and hfsplus kmap API modernization from Fabio Francesco 2022-10-12 11:00:22 -07:00
jbd2 jbd2: switch jbd2_submit_inode_data() to use fs-provided hook for data writeout 2022-12-08 21:49:25 -05:00
jffs2 fs: rename current get acl method 2022-10-20 10:13:27 +02:00
jfs MM patches for 6.2-rc1. 2022-12-13 19:29:45 -08:00
kernfs kernfs: fix all kernel-doc warnings and multiple typos 2022-11-23 19:28:26 +01:00
ksmbd six ksmbd server fixes 2022-12-15 09:29:19 -08:00
lockd NFSD 6.2 Release Notes 2022-12-12 20:54:39 -08:00
minix vfs: open inside ->tmpfile() 2022-09-24 07:00:00 +02:00
netfs use less confusing names for iov_iter direction initializers 2022-11-25 13:01:55 -05:00
nfs Driver Core changes for 6.2-rc1 2022-12-16 03:54:54 -08:00
nfs_common
nfsd nfsd-6.2 supplement: 2022-12-19 09:10:33 -06:00
nilfs2 treewide: Convert del_timer*() to timer_shutdown*() 2022-12-25 13:38:09 -08:00
nls
notify Merge tag 'fsnotify-for_v6.1-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs 2022-10-07 08:28:50 -07:00
ntfs - hfs and hfsplus kmap API modernization from Fabio Francesco 2022-10-12 11:00:22 -07:00
ntfs3 ntfs3 for 6.2 2022-12-21 10:18:17 -08:00
ocfs2 Treewide: Stop corrupting socket's task_frag 2022-12-19 17:28:49 -08:00
omfs omfs: remove ->writepage 2022-12-11 18:12:18 -08:00
openpromfs
orangefs orangefs: four fixes from Zhang Xiaoxu and two from Colin Ian King 2022-12-14 11:16:33 -08:00
overlayfs overlayfs update for 6.2 2022-12-12 20:18:26 -08:00
proc ARM64: 2022-12-15 11:12:21 -08:00
pstore pstore updates for v6.2-rc1-fixes 2022-12-23 11:55:54 -08:00
qnx4 fs: Convert block_read_full_page() to block_read_full_folio() 2022-05-09 16:21:44 -04:00
qnx6 fs/qnx6: delete unnecessary checks before brelse() 2022-09-11 21:55:07 -07:00
quota ext4: fix bug_on in __es_tree_search caused by bad quota inode 2022-12-08 21:49:23 -05:00
ramfs tmpfile API change 2022-10-10 19:45:17 -07:00
reiserfs lsm/stable-6.2 PR 20221212 2022-12-13 09:47:48 -08:00
romfs romfs: Convert romfs to read_folio 2022-05-09 16:21:46 -04:00
smbfs_common smb3: define missing create contexts 2022-10-05 01:55:27 -05:00
squashfs fs.idmapped.squashfs.v6.2 2022-12-12 20:24:51 -08:00
sysfs kobject: kobj_type: remove default_attrs 2022-04-05 15:39:19 +02:00
sysv fs: sysv: Fix sysv_nblocks() returns wrong value 2022-12-10 14:13:37 -05:00
tracefs tracefs: Only clobber mode/uid/gid on remount if asked 2022-09-08 17:10:54 -04:00
ubifs treewide: use get_random_u32_below() instead of deprecated function 2022-11-18 02:15:15 +01:00
udf \n 2022-12-12 20:32:50 -08:00
ufs ufs: replace ll_rw_block() 2022-09-11 20:26:07 -07:00
unicode
vboxsf vboxsf: Convert vboxsf to read_folio 2022-05-09 16:21:46 -04:00
verity fsverity: simplify fsverity_get_digest() 2022-11-29 21:07:41 -08:00
xfs New XFS code for 6.2: 2022-12-14 10:11:51 -08:00
zonefs zonefs: Fix active zone accounting 2022-11-25 17:01:22 +09:00
aio.c use less confusing names for iov_iter direction initializers 2022-11-25 13:01:55 -05:00
anon_inodes.c dynamic_dname(): drop unused dentry argument 2022-08-20 11:34:04 -04:00
attr.c attr: use consistent sgid stripping checks 2022-10-18 10:09:47 +02:00
bad_inode.c fs: rename current get acl method 2022-10-20 10:13:27 +02:00
binfmt_elf_fdpic.c binfmt: Fix error return code in load_elf_fdpic_binary() 2022-12-01 19:15:52 -08:00
binfmt_elf_test.c
binfmt_elf.c rseq: Introduce feature size and alignment ELF auxiliary vector entries 2022-12-27 12:52:10 +01:00
binfmt_flat.c binfmt_flat: Remove shared library support 2022-04-22 10:57:18 -07:00
binfmt_misc.c binfmt_misc: fix shift-out-of-bounds in check_special_flags 2022-12-02 13:57:04 -08:00
binfmt_script.c
buffer.c - hfs and hfsplus kmap API modernization from Fabio Francesco 2022-10-12 11:00:22 -07:00
char_dev.c chardev: fix error handling in cdev_device_add() 2022-12-02 17:48:59 +01:00
compat_binfmt_elf.c
coredump.c hardening updates for v6.2-rc1 2022-12-14 12:20:00 -08:00
d_path.c d_path.c: typo fix... 2022-08-20 11:34:33 -04:00
dax.c fsdax,xfs: port unshare to fsdax 2022-12-11 18:12:17 -08:00
dcache.c tmpfile API change 2022-10-10 19:45:17 -07:00
direct-io.c block: remove PSI accounting from the bio layer 2022-09-20 08:24:38 -06:00
drop_caches.c
eventfd.c eventfd: provide a eventfd_signal_mask() helper 2022-11-22 06:07:55 -07:00
eventpoll.c eventpoll: add EPOLL_URING_WAKE poll wakeup flag 2022-11-21 07:45:29 -07:00
exec.c sched: Introduce per-memory-map concurrency ID 2022-12-27 12:52:11 +01:00
fcntl.c keep iocb_flags() result cached in struct file 2022-06-10 16:10:23 -04:00
fhandle.c do_sys_name_to_handle(): constify path 2022-09-01 17:36:39 -04:00
file_table.c locks: fix TOCTOU race when granting write lease 2022-08-16 10:59:54 -04:00
file.c fs: use acquire ordering in __fget_light() 2022-10-31 15:30:11 -04:00
filesystems.c
fs_context.c
fs_parser.c ext4: journal_path mount options should follow links 2022-12-01 10:46:54 -05:00
fs_pin.c
fs_struct.c
fs_types.c
fs-writeback.c for-6.2/writeback-2022-12-12 2022-12-15 18:09:48 -08:00
fsopen.c uninline may_mount() and don't opencode it in fspick(2)/fsopen(2) 2022-05-19 23:25:10 -04:00
init.c
inode.c fs.vfsuid.conversion.v6.2 2022-12-12 19:20:05 -08:00
internal.h fs.ovl.setgid.v6.2 2022-12-12 19:03:10 -08:00
ioctl.c Fixes for 5.18-rc1: 2022-04-01 19:35:56 -07:00
Kconfig hugetlb: make hugetlb depends on SYSFS or SYSCTL 2022-09-11 20:26:10 -07:00
Kconfig.binfmt Xtensa updates for v6.1 2022-10-10 14:21:11 -07:00
kernel_read_file.c fs/kernel_read_file: allow to read files up-to ssize_t 2022-06-16 19:58:21 -07:00
libfs.c libfs: add DEFINE_SIMPLE_ATTRIBUTE_SIGNED for signed value 2022-11-30 16:13:16 -08:00
locks.c Add process name and pid to locks warning 2022-11-30 05:08:10 -05:00
Makefile a.out: Remove the a.out implementation 2022-09-27 07:11:02 -07:00
mbcache.c ext4: fix deadlock due to mbcache entry corruption 2022-12-08 21:49:25 -05:00
mount.h switch try_to_unlazy_next() to __legitimize_mnt() 2022-07-05 16:18:21 -04:00
mpage.c Folio changes for 6.0 2022-08-03 10:35:43 -07:00
namei.c Landlock updates for v6.2-rc1 2022-12-13 09:14:50 -08:00
namespace.c fs.idmapped.mnt_idmap.v6.2 2022-12-12 19:30:18 -08:00
no-block.c
nsfs.c dynamic_dname(): drop unused dentry argument 2022-08-20 11:34:04 -04:00
open.c Landlock updates for v6.2-rc1 2022-12-13 09:14:50 -08:00
pipe.c dynamic_dname(): drop unused dentry argument 2022-08-20 11:34:04 -04:00
pnode.c pnode: terminate at peers of source 2022-12-21 14:45:25 +01:00
pnode.h
posix_acl.c fs.idmapped.mnt_idmap.v6.2 2022-12-12 19:30:18 -08:00
proc_namespace.c vfs: escape hash as well 2022-06-28 13:58:05 -04:00
read_write.c iov_iter work; most of that is about getting rid of 2022-12-12 18:29:54 -08:00
readdir.c Change calling conventions for filldir_t 2022-08-17 17:25:04 -04:00
remap_range.c New VFS code for 6.2: 2022-12-13 10:26:38 -08:00
select.c
seq_file.c use less confusing names for iov_iter direction initializers 2022-11-25 13:01:55 -05:00
signalfd.c
splice.c use less confusing names for iov_iter direction initializers 2022-11-25 13:01:55 -05:00
stack.c
stat.c fs: use type safe idmapping helpers 2022-10-26 10:02:34 +02:00
statfs.c
super.c misc pile 2022-12-12 18:38:47 -08:00
sync.c riscv: compat: syscall: Add compat_sys_call_table implementation 2022-04-26 13:36:25 -07:00
sysctls.c
timerfd.c
userfaultfd.c fs/userfaultfd: Fix maple tree iterator in userfaultfd_unregister() 2022-11-07 12:58:26 -08:00
utimes.c
xattr.c fs.xattr.simple.rework.rbtree.rwlock.v6.2 2022-12-13 10:08:36 -08:00