linux

History

Linus Torvalds cd4699c5fd prlimit and set/getpriority tasklist_lock optimizations The tasklist_lock popped up as a scalability bottleneck on some testing workloads. The readlocks in do_prlimit and set/getpriority are not necessary in all cases. Based on a cycles profile, it looked like ~87% of the time was spent in the kernel, ~42% of which was just trying to get some spinlock (queued_spin_lock_slowpath, not necessarily the tasklist_lock). The big offenders (with rough percentages in cycles of the overall trace): - do_wait 11% - setpriority 8% (this patchset) - kill 8% - do_exit 5% - clone 3% - prlimit64 2% (this patchset) - getrlimit 1% (this patchset) I can't easily test this patchset on the original workload for various reasons. Instead, I used the microbenchmark below to at least verify there was some improvement. This patchset had a 28% speedup (12% from baseline to set/getprio, then another 14% for prlimit). One interesting thing is that my libc's getrlimit() was calling prlimit64, so hoisting the read_lock(tasklist_lock) into sys_prlimit64 had no effect - it essentially optimized the older syscalls only. I didn't do that in this patchset, but figured I'd mention it since it was an option from the previous patch's discussion. v3: https://lkml.kernel.org/r/20220106172041.522167-1-brho@google.com v2: https://lore.kernel.org/lkml/20220105212828.197013-1-brho@google.com/ - update_rlimit_cpu on the group_leader instead of for_each_thread. - update_rlimit_cpu still returns 0 or -ESRCH, even though we don't care about the error here. it felt safer that way in case someone uses that function again. v1: https://lore.kernel.org/lkml/20211213220401.1039578-1-brho@google.com/ int main(int argc, char *argv) { pid_t child; struct rlimit rlim[1]; fork(); fork(); fork(); fork(); fork(); fork(); for (int i = 0; i < 5000; i++) { child = fork(); if (child < 0) exit(1); if (child > 0) { usleep(1000); kill(child, SIGTERM); waitpid(child, NULL, 0); } else { for (;;) { setpriority(PRIO_PROCESS, 0, getpriority(PRIO_PROCESS, 0)); getrlimit(RLIMIT_CPU, rlim); } } } return 0; } Barret Rhoden (3): setpriority: only grab the tasklist_lock for PRIO_PGRP prlimit: make do_prlimit() static prlimit: do not grab the tasklist_lock include/linux/posix-timers.h \| 2 +- include/linux/resource.h \| 2 - kernel/sys.c \| 127 +++++++++++++++++---------------- kernel/time/posix-cpu-timers.c \| 12 +++- 4 files changed, 76 insertions(+), 67 deletions(-) I have dropped the first change in this series as an almost identical change was merged as commit `7f8ca0edfe` ("kernel/sys.c: only take tasklist_lock for get/setpriority(PRIO_PGRP)"). Signed-off-by: Eric W. Biederman <ebiederm@xmission.com> -----BEGIN PGP SIGNATURE----- iQIzBAABCgAdFiEEgjlraLDcwBA2B+6cC/v6Eiajj0AFAmI7eCAACgkQC/v6Eiaj j0CN8w/+MEol1+sB/mDKgDgqbNE0sIXHTjQF37KPrsqB51aas9LSX7E7CBzvxF3M Y0MSk0VzSt4oGpmrNQOAEueeMeaMucPxI5JejGHEhtdHFBMqYXKpWuhqewIHx1pc lUcYpDeUOOBjwLO/VT5hfAKzIEMUl6tEDfzexl9IvpVwd661nVjDe+z12mDplJTi tjO8ZiSHkjkLE3cAYaTCajsaqpj7NLuIYB1d4CbbpU3vO5LYoffj/vtQ1e+7UxMB jhgaP/ylo0Ab8udYJ0PFIDmmQG/6s7csc3I1wtMgf8mqv88z4xspXNZBwYvf2hxa lBpSo+zD8Q88XipC+w63iBUa7YElLaai9xpLInO/Ir42G03/H/8TS9me1OLG+1Cz vloOid6CqH7KkNQ842txXeyj3xjW1DGR7U0QOrSxFQuWc6WZ2Q/l8KIZsuXuyt9G EwTjtoQvr1R+FNMtT/4g5WZ8sTYooIaHFvFQ745T6FzBp8mCVjINg4SUbVV3Wvck JRMxuHSFFBXj8IIJi9Bv6UE/j5APwa209KthvFCQayniNZU3XPKVa/bDWVoBk+SK Hch3M//QdAjKYmRf5gmDaBbRyqzaeiFjvX1MSnkbFryBX4/yIoEfo0/QsDRzSrJV vSSSU79h/XDI080gILOzNX4HiI4cpNcpOIB63Pmajyr6MxhrMqE= =VVGP -----END PGP SIGNATURE----- Merge tag 'prlimit-tasklist_lock-for-v5.18' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace Pull tasklist_lock optimizations from Eric Biederman: "prlimit and getpriority tasklist_lock optimizations The tasklist_lock popped up as a scalability bottleneck on some testing workloads. The readlocks in do_prlimit and set/getpriority are not necessary in all cases. Based on a cycles profile, it looked like ~87% of the time was spent in the kernel, ~42% of which was just trying to get some* spinlock (queued_spin_lock_slowpath, not necessarily the tasklist_lock). The big offenders (with rough percentages in cycles of the overall trace): - do_wait 11% - setpriority 8% (done previously in commit `7f8ca0edfe`) - kill 8% - do_exit 5% - clone 3% - prlimit64 2% (this patchset) - getrlimit 1% (this patchset) I can't easily test this patchset on the original workload for various reasons. Instead, I used the microbenchmark below to at least verify there was some improvement. This patchset had a 28% speedup (12% from baseline to set/getprio, then another 14% for prlimit). This series used to do the setpriority case, but an almost identical change was merged as commit `7f8ca0edfe` ("kernel/sys.c: only take tasklist_lock for get/setpriority(PRIO_PGRP)") so that has been dropped from here. One interesting thing is that my libc's getrlimit() was calling prlimit64, so hoisting the read_lock(tasklist_lock) into sys_prlimit64 had no effect - it essentially optimized the older syscalls only. I didn't do that in this patchset, but figured I'd mention it since it was an option from the previous patch's discussion" micobenchmark.c: --------------- int main(int argc, char *argv) { pid_t child; struct rlimit rlim[1]; fork(); fork(); fork(); fork(); fork(); fork(); for (int i = 0; i < 5000; i++) { child = fork(); if (child < 0) exit(1); if (child > 0) { usleep(1000); kill(child, SIGTERM); waitpid(child, NULL, 0); } else { for (;;) { setpriority(PRIO_PROCESS, 0, getpriority(PRIO_PROCESS, 0)); getrlimit(RLIMIT_CPU, rlim); } } } return 0; } Link: https://lore.kernel.org/lkml/20211213220401.1039578-1-brho@google.com/ [v1] Link: https://lore.kernel.org/lkml/20220105212828.197013-1-brho@google.com/ [v2] Link: https://lore.kernel.org/lkml/20220106172041.522167-1-brho@google.com/ [v3] tag 'prlimit-tasklist_lock-for-v5.18' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace: prlimit: do not grab the tasklist_lock prlimit: make do_prlimit() static		2022-03-24 10:16:00 -07:00
..
bpf	bpf: Add schedule points in batch ops	2022-02-17 10:48:26 -08:00
cgroup	Merge branch 'for-5.18' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup	2022-03-23 12:43:35 -07:00
configs	configs/debug: restore DEBUG_INFO=y for overriding	2022-03-17 11:02:13 -07:00
debug	kdb: Adopt scheduler's task classification	2021-11-03 17:21:37 +00:00
dma	ARM: SoC updates for 5.18	2022-03-23 18:20:09 -07:00
entry	Changes in this cycle were:	2022-03-22 14:39:12 -07:00
events	asm-generic updates for 5.18	2022-03-23 18:03:08 -07:00
futex	mm/truncate: Inline invalidate_complete_page() into its one caller	2022-03-21 12:59:01 -04:00
gcov	gcov: Remove compiler version check	2021-12-02 17:25:21 +09:00
irq	Changes in this cycle were:	2022-03-22 14:39:12 -07:00
kcsan	KCSAN updates for v5.17	2022-01-11 09:51:26 -08:00
livepatch	Livepatching changes for 5.17	2022-01-16 10:08:13 +02:00
locking	Changes in this cycle were:	2022-03-22 13:44:21 -07:00
power	for-5.18/block-2022-03-18	2022-03-21 16:48:55 -07:00
printk	printk changes for 5.18	2022-03-23 10:54:27 -07:00
rcu	Changes in this cycle were:	2022-03-22 14:39:12 -07:00
sched	Merge branch 'akpm' (patches from Andrew)	2022-03-22 16:11:53 -07:00
time	prlimit and set/getpriority tasklist_lock optimizations	2022-03-24 10:16:00 -07:00
trace	asm-generic updates for 5.18	2022-03-23 18:03:08 -07:00
.gitignore
acct.c	kernel: remove spurious blkdev.h includes	2021-10-18 06:17:01 -06:00
async.c	Revert "module, async: async_synchronize_full() on module init iff async is used"	2022-02-03 11:20:34 -08:00
audit_fsnotify.c	fsnotify: clarify contract for create event hooks	2021-10-27 12:32:34 +02:00
audit_tree.c	audit: use struct_size() helper in kmalloc()	2021-12-14 17:39:42 -05:00
audit_watch.c	\n	2021-11-06 16:43:20 -07:00
audit.c	audit: improve audit queue handling when "audit=1" on cmdline	2022-01-25 13:22:51 -05:00
audit.h	audit: log AUDIT_TIME_* records only from rules	2022-02-22 13:51:40 -05:00
auditfilter.c	audit/stable-5.17 PR 20220110	2022-01-11 13:08:21 -08:00
auditsc.c	audit/stable-5.18 PR 20220321	2022-03-21 20:53:11 -07:00
backtracetest.c
bounds.c
capability.c
cfi.c	cfi: Use rcu_read_{un}lock_sched_notrace	2021-08-11 13:11:12 -07:00
compat.c	arch: remove compat_alloc_user_space	2021-09-08 15:32:35 -07:00
configs.c
context_tracking.c
cpu_pm.c	PM: cpu: Make notifier chain use a raw_spinlock_t	2021-08-16 18:55:32 +02:00
cpu.c	Changes in this cycle were:	2022-03-22 14:39:12 -07:00
crash_core.c	kernel/crash_core: suppress unknown crashkernel parameter warning	2021-12-25 12:20:55 -08:00
crash_dump.c
cred.c	ucounts: Base set_cred_ucounts changes on the real user	2022-02-17 09:11:02 -06:00
delayacct.c	delayacct: track delays from memory compact	2022-01-20 08:52:55 +02:00
dma.c
exec_domain.c
exit.c	asm-generic updates for 5.18	2022-03-23 18:03:08 -07:00
extable.c	extable: use is_kernel_text() helper	2021-11-09 10:02:51 -08:00
fail_function.c
fork.c	Core code updates:	2022-03-21 12:37:33 -07:00
freezer.c	sched: Add get_current_state()	2021-06-18 11:43:08 +02:00
gen_kheaders.sh	kbuild: clean up ${quiet} checks in shell scripts	2021-05-27 04:01:50 +09:00
groups.c
hung_task.c	hung_task: move hung_task sysctl interface to hung_task.c	2022-01-22 08:33:34 +02:00
iomem.c
irq_work.c	irq_work: Also rcuwait for !IRQ_WORK_HARD_IRQ on PREEMPT_RT	2021-10-15 11:25:18 +02:00
jump_label.c	jump_label: Fix jump_label_text_reserved() vs __init	2021-07-05 10:46:20 +02:00
kallsyms.c	Livepatching changes for 5.17	2022-01-16 10:08:13 +02:00
kcmp.c
Kconfig.freezer
Kconfig.hz
Kconfig.locks	locking/rwlock: Provide RT variant	2021-08-17 17:50:51 +02:00
Kconfig.preempt	Changes in this cycle were:	2022-03-22 14:39:12 -07:00
kcov.c	kcov: replace local_irq_save() with a local_lock_t	2021-11-09 10:02:52 -08:00
kexec_core.c	exit: Move oops specific logic from do_exit into make_task_dead	2021-12-13 12:04:45 -06:00
kexec_elf.c
kexec_file.c	memblock: add MEMBLOCK_DRIVER_MANAGED to mimic IORESOURCE_SYSRAM_DRIVER_MANAGED	2021-11-06 13:30:42 -07:00
kexec_internal.h
kexec.c	kexec: avoid compat_alloc_user_space	2021-09-08 15:32:34 -07:00
kheaders.c
kmod.c
kprobes.c	kprobe: move sysctl_kprobes_optimization to kprobes.c	2022-01-22 08:33:36 +02:00
ksysfs.c
kthread.c	asm-generic updates for 5.18	2022-03-23 18:03:08 -07:00
latencytop.c
Makefile	module: add in-kernel support for decompressing	2022-01-11 18:45:02 -08:00
module_decompress.c	module: fix building with sysfs disabled	2022-02-16 12:51:32 -08:00
module_signature.c
module_signing.c
module-internal.h	module: add in-kernel support for decompressing	2022-01-11 18:45:02 -08:00
module.c	NFSD: Remove svc_serv_ops::svo_module	2022-02-28 10:26:40 -05:00
notifier.c	notifier: Return an error when a callback has already been registered	2021-12-29 10:37:33 +01:00
nsproxy.c	memcg: enable accounting for new namesapces and struct nsproxy	2021-09-03 09:58:12 -07:00
padata.c	padata: replace cpumask_weight with cpumask_empty in padata.c	2022-01-31 11:21:46 +11:00
panic.c	panic: remove oops_id	2022-01-20 08:52:55 +02:00
params.c	kobject: remove kset from struct kset_uevent_ops callbacks	2021-12-28 11:26:18 +01:00
pid_namespace.c	memcg: enable accounting for new namesapces and struct nsproxy	2021-09-03 09:58:12 -07:00
pid.c	pid: add pidfd_get_task() helper	2021-10-14 13:29:18 +02:00
profile.c	exit: Remove profile_handoff_task	2022-01-08 12:43:57 -06:00
ptrace.c	ptrace: Remove second setting of PT_SEIZED in ptrace_attach	2022-01-08 12:43:57 -06:00
range.c
reboot.c	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/dtor/input	2021-11-12 11:53:16 -08:00
regset.c
relay.c
resource_kunit.c
resource.c	proc: remove PDE_DATA() completely	2022-01-22 08:33:37 +02:00
rseq.c	rseq: Remove broken uapi field layout on 32-bit little endian	2022-02-02 13:11:34 +01:00
scftorture.c	scftorture: Always log error message	2021-12-07 16:36:17 -08:00
scs.c	scs: Release kasan vmalloc poison in scs_free process	2021-09-30 09:37:27 +01:00
seccomp.c	seccomp: Invalidate seccomp mode to catch death failures	2022-02-10 19:09:12 -08:00
signal.c	signal, x86: Delay calling signals in atomic on RT enabled kernels	2022-03-04 14:58:54 +01:00
smp.c	sched: Improve wake_up_all_idle_cpus() take #2	2021-10-22 15:32:46 +02:00
smpboot.c	smpboot: Replace deprecated CPU-hotplug functions.	2021-08-10 14:57:42 +02:00
smpboot.h
softirq.c	genirq, softirq: Use in_hardirq() instead of in_irq()	2022-02-02 21:34:19 +01:00
stackleak.c	gcc-plugins/stackleak: Use noinstr in favor of notrace	2022-02-03 17:02:21 -08:00
stacktrace.c	uaccess: remove CONFIG_SET_FS	2022-02-25 09:36:06 +01:00
static_call.c	static_call: Fix static_call_text_reserved() vs __init	2021-07-05 10:46:33 +02:00
stop_machine.c
sys_ni.c	mm/mempolicy: wire up syscall set_mempolicy_home_node	2022-01-15 16:30:30 +02:00
sys.c	prlimit: do not grab the tasklist_lock	2022-03-08 14:33:36 -06:00
sysctl-test.c	kernel/sysctl-test: Remove some casts which are no-longer required	2021-06-23 16:41:24 -06:00
sysctl.c	Merge branch 'akpm' (patches from Andrew)	2022-03-22 16:11:53 -07:00
task_work.c
taskstats.c
torture.c	torture: Wake up kthreads after storing task_struct pointer	2022-02-01 17:24:39 -08:00
tracepoint.c	tracepoint: Fix kerneldoc comments	2021-08-16 11:39:51 -04:00
tsacct.c	taskstats: Cleanup the use of task->exit_code	2022-01-08 12:43:57 -06:00
ucount.c	ucounts: Handle wrapping in is_ucounts_overlimit	2022-02-17 09:11:57 -06:00
uid16.c
uid16.h
umh.c
up.c
user_namespace.c	ucounts: Fix systemd LimitNPROC with private users regression	2022-02-25 10:40:14 -06:00
user-return-notifier.c
user.c	fs/epoll: use a per-cpu counter for user's watches count	2021-09-08 11:50:27 -07:00
usermode_driver.c	Merge branch 'work.namei' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs	2021-07-03 11:41:14 -07:00
utsname_sysctl.c
utsname.c
watch_queue.c	watch_queue: Actually free the watch	2022-03-21 12:48:32 +00:00
watchdog_hld.c
watchdog.c	sched/isolation: Use single feature type while referring to housekeeping cpumask	2022-02-16 15:57:55 +01:00
workqueue_internal.h	workqueue: Assign a color to barrier work items	2021-08-17 07:49:10 -10:00
workqueue.c	Merge branch 'for-5.18' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq	2022-03-23 12:40:51 -07:00