linux

History

K Prateek Nayak f5b2eeb499 sched/fair: Consider CPU affinity when allowing NUMA imbalance in find_idlest_group() In the case of systems containing multiple LLCs per socket, like AMD Zen systems, users want to spread bandwidth hungry applications across multiple LLCs. Stream is one such representative workload where the best performance is obtained by limiting one stream thread per LLC. To ensure this, users are known to pin the tasks to a specify a subset of the CPUs consisting of one CPU per LLC while running such bandwidth hungry tasks. Suppose we kickstart a multi-threaded task like stream with 8 threads using taskset or numactl to run on a subset of CPUs on a 2 socket Zen3 server where each socket contains 128 CPUs (0-63,128-191 in one socket, 64-127,192-255 in another socket) Eg: numactl -C 0,16,32,48,64,80,96,112 ./stream8 Here each CPU in the list is from a different LLC and 4 of those LLCs are on one socket, while the other 4 are on another socket. Ideally we would prefer that each stream thread runs on a different CPU from the allowed list of CPUs. However, the current heuristics in find_idlest_group() do not allow this during the initial placement. Suppose the first socket (0-63,128-191) is our local group from which we are kickstarting the stream tasks. The first four stream threads will be placed in this socket. When it comes to placing the 5th thread, all the allowed CPUs are from the local group (0,16,32,48) would have been taken. However, the current scheduler code simply checks if the number of tasks in the local group is fewer than the allowed numa-imbalance threshold. This threshold was previously 25% of the NUMA domain span (in this case threshold = 32) but after the v6 of Mel's patchset "Adjust NUMA imbalance for multiple LLCs", got merged in sched-tip, Commit: `e496132ebe` ("sched/fair: Adjust the allowed NUMA imbalance when SD_NUMA spans multiple LLCs") it is now equal to number of LLCs in the NUMA domain, for processors with multiple LLCs. (in this case threshold = 8). For this example, the number of tasks will always be within threshold and thus all the 8 stream threads will be woken up on the first socket thereby resulting in sub-optimal performance. The following sched_wakeup_new tracepoint output shows the initial placement of tasks in the current tip/sched/core on the Zen3 machine: stream-5313 [016] d..2. 627.005036: sched_wakeup_new: comm=stream pid=5315 prio=120 target_cpu=032 stream-5313 [016] d..2. 627.005086: sched_wakeup_new: comm=stream pid=5316 prio=120 target_cpu=048 stream-5313 [016] d..2. 627.005141: sched_wakeup_new: comm=stream pid=5317 prio=120 target_cpu=000 stream-5313 [016] d..2. 627.005183: sched_wakeup_new: comm=stream pid=5318 prio=120 target_cpu=016 stream-5313 [016] d..2. 627.005218: sched_wakeup_new: comm=stream pid=5319 prio=120 target_cpu=016 stream-5313 [016] d..2. 627.005256: sched_wakeup_new: comm=stream pid=5320 prio=120 target_cpu=016 stream-5313 [016] d..2. 627.005295: sched_wakeup_new: comm=stream pid=5321 prio=120 target_cpu=016 Once the first four threads are distributed among the allowed CPUs of socket one, the rest of the treads start piling on these same CPUs when clearly there are CPUs on the second socket that can be used. Following the initial pile up on a small number of CPUs, though the load-balancer eventually kicks in, it takes a while to get to {4}{4} and even {4}{4} isn't stable as we observe a bunch of ping ponging between {4}{4} to {5}{3} and back before a stable state is reached much later (1 Stream thread per allowed CPU) and no more migration is required. We can detect this piling and avoid it by checking if the number of allowed CPUs in the local group are fewer than the number of tasks running in the local group and use this information to spread the 5th task out into the next socket (after all, the goal in this slowpath is to find the idlest group and the idlest CPU during the initial placement!). The following sched_wakeup_new tracepoint output shows the initial placement of tasks after adding this fix on the Zen3 machine: stream-4485 [016] d..2. 230.784046: sched_wakeup_new: comm=stream pid=4487 prio=120 target_cpu=032 stream-4485 [016] d..2. 230.784123: sched_wakeup_new: comm=stream pid=4488 prio=120 target_cpu=048 stream-4485 [016] d..2. 230.784167: sched_wakeup_new: comm=stream pid=4489 prio=120 target_cpu=000 stream-4485 [016] d..2. 230.784222: sched_wakeup_new: comm=stream pid=4490 prio=120 target_cpu=112 stream-4485 [016] d..2. 230.784271: sched_wakeup_new: comm=stream pid=4491 prio=120 target_cpu=096 stream-4485 [016] d..2. 230.784322: sched_wakeup_new: comm=stream pid=4492 prio=120 target_cpu=080 stream-4485 [016] d..2. 230.784368: sched_wakeup_new: comm=stream pid=4493 prio=120 target_cpu=064 We see that threads are using all of the allowed CPUs and there is no pileup. No output is generated for tracepoint sched_migrate_task with this patch due to a perfect initial placement which removes the need for balancing later on - both across NUMA boundaries and within NUMA boundaries for stream. Following are the results from running 8 Stream threads with and without pinning on a dual socket Zen3 Machine (2 x 64C/128T): During the testing of this patch, the tip sched/core was at commit: `089c02ae27` "ftrace: Use preemption model accessors for trace header printout" Pinning is done using: numactl -C 0,16,32,48,64,80,96,112 ./stream8 5.18.0-rc1 5.18.0-rc1 5.18.0-rc1 tip sched/core tip sched/core tip sched/core (no pinning) + pinning + this-patch + pinning Copy: 109364.74 (0.00 pct) 94220.50 (-13.84 pct) 158301.28 (44.74 pct) Scale: 109670.26 (0.00 pct) 90210.59 (-17.74 pct) 149525.64 (36.34 pct) Add: 129029.01 (0.00 pct) 101906.00 (-21.02 pct) 186658.17 (44.66 pct) Triad: 127260.05 (0.00 pct) 106051.36 (-16.66 pct) 184327.30 (44.84 pct) Pinning currently hurts the performance compared to unbound case on tip/sched/core. With the addition of this patch, we are able to outperform tip/sched/core by a good margin with pinning. Following are the results from running 16 Stream threads with and without pinning on a dual socket IceLake Machine (2 x 32C/64T): NUMA Topology of Intel Skylake machine: Node 1: 0,2,4,6 ... 126 (Even numbers) Node 2: 1,3,5,7 ... 127 (Odd numbers) Pinning is done using: numactl -C 0-15 ./stream16 5.18.0-rc1 5.18.0-rc1 5.18.0-rc1 tip sched/core tip sched/core tip sched/core (no pinning) +pinning + this-patch + pinning Copy: 85815.31 (0.00 pct) 149819.21 (74.58 pct) 156807.48 (82.72 pct) Scale: 64795.60 (0.00 pct) 97595.07 (50.61 pct) 99871.96 (54.13 pct) Add: 71340.68 (0.00 pct) 111549.10 (56.36 pct) 114598.33 (60.63 pct) Triad: 68890.97 (0.00 pct) 111635.16 (62.04 pct) 114589.24 (66.33 pct) In case of Icelake machine, with single LLC per socket, pinning across the two sockets reduces cache contention, thus showing great improvement in pinned case which is further benefited by this patch. Signed-off-by: K Prateek Nayak <kprateek.nayak@amd.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org> Reviewed-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com> Acked-by: Mel Gorman <mgorman@techsingularity.net> Link: https://lkml.kernel.org/r/20220407111222.22649-1-kprateek.nayak@amd.com		2022-06-13 10:30:00 +02:00
..
bpf	bpf: Fix calling global functions from BPF_PROG_TYPE_EXT programs	2022-06-07 10:41:20 -07:00
cgroup	Merge branch 'for-5.19' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup	2022-05-25 11:47:25 -07:00
configs	x86/configs: Add x86 debugging Kconfig fragment plus docs	2022-04-06 19:56:29 +02:00
debug	Modules updates for v5.19-rc1	2022-05-26 17:13:43 -07:00
dma	swiotlb: fix setting ->force_bounce	2022-06-02 07:17:59 +02:00
entry	* Fix syzkaller NULL pointer dereference	2022-06-08 09:16:31 -07:00
events	Two small perf updates:	2022-06-05 10:40:31 -07:00
futex	drm for 5.19-rc1	2022-05-25 16:18:27 -07:00
gcov	gcov: Remove compiler version check	2021-12-02 17:25:21 +09:00
irq	Updates for interrupt core and drivers:	2022-05-23 16:58:49 -07:00
kcsan	linux-kselftest-kunit-5.19-rc1	2022-05-25 11:32:53 -07:00
livepatch	Livepatching changes for 5.19	2022-06-02 08:55:01 -07:00
locking	sysctl changes for v5.19-rc1	2022-05-26 16:57:20 -07:00
module	module: Fix prefix for module.sig_enforce module param	2022-06-02 12:44:33 -07:00
power	cxl for 5.19	2022-05-27 21:24:19 -07:00
printk	Revert "printk: wake up all waiters"	2022-05-27 13:04:46 +02:00
rcu	sysctl changes for v5.19-rc1	2022-05-26 16:57:20 -07:00
sched	sched/fair: Consider CPU affinity when allowing NUMA imbalance in find_idlest_group()	2022-06-13 10:30:00 +02:00
time	While looking at the ptrace problems with PREEMPT_RT and the problems	2022-06-03 16:13:25 -07:00
trace	Networking fixes for 5.19-rc2, including fixes from bpf and netfilter.	2022-06-09 12:06:52 -07:00
.gitignore
acct.c	kernel/acct: move acct sysctls to its own file	2022-04-06 13:43:44 -07:00
async.c	Revert "module, async: async_synchronize_full() on module init iff async is used"	2022-02-03 11:20:34 -08:00
audit_fsnotify.c	fsnotify: make allow_dups a property of the group	2022-04-25 14:37:18 +02:00
audit_tree.c	audit: use fsnotify group lock helpers	2022-04-25 14:37:28 +02:00
audit_watch.c	fsnotify: pass flags argument to fsnotify_alloc_group()	2022-04-25 14:37:12 +02:00
audit.c	audit: improve audit queue handling when "audit=1" on cmdline	2022-01-25 13:22:51 -05:00
audit.h	audit: log AUDIT_TIME_* records only from rules	2022-02-22 13:51:40 -05:00
auditfilter.c	audit/stable-5.17 PR 20220110	2022-01-11 13:08:21 -08:00
auditsc.c	audit,io_uring,io-wq: call __audit_uring_exit for dummy contexts	2022-05-17 15:03:36 -04:00
backtracetest.c
bounds.c
capability.c	xfs: don't generate selinux audit messages for capability testing	2022-03-09 10:32:06 -08:00
cfi.c
compat.c
configs.c
context_tracking.c
cpu_pm.c
cpu.c	Intel Trust Domain Extensions	2022-05-23 17:51:12 -07:00
crash_core.c	Not a lot of material this cycle. Many singleton patches against various	2022-05-27 11:22:03 -07:00
crash_dump.c
cred.c	x86: Mark __invalid_creds() __noreturn	2022-03-15 10:32:44 +01:00
delayacct.c	delayacct: track delays from write-protect copy	2022-06-01 15:55:25 -07:00
dma.c
exec_domain.c
exit.c	ptrace: Cleanups for v5.18	2022-03-28 17:29:53 -07:00
extable.c	lkdtm: Really write into kernel text in WRITE_KERN	2022-02-16 23:25:12 +11:00
fail_function.c
fork.c	This set of changes updates init and user mode helper tasks to be	2022-06-03 16:03:05 -07:00
freezer.c
gen_kheaders.sh	kheaders: Have cpio unconditionally replace files	2022-05-08 03:16:59 +09:00
groups.c
hung_task.c	Not a lot of material this cycle. Many singleton patches against various	2022-05-27 11:22:03 -07:00
iomem.c
irq_work.c	irq_work: use kasan_record_aux_stack_noalloc() record callstack	2022-04-15 14:49:55 -07:00
jump_label.c
kallsyms.c	ftrace: Add ftrace_lookup_symbols function	2022-05-10 14:42:06 -07:00
kcmp.c
Kconfig.freezer
Kconfig.hz
Kconfig.locks
Kconfig.preempt	Revert "signal, x86: Delay calling signals in atomic on RT enabled kernels"	2022-03-31 10:36:55 +02:00
kcov.c	kcov: update pos before writing pc in trace function	2022-05-25 13:05:42 -07:00
kexec_core.c	Not a lot of material this cycle. Many singleton patches against various	2022-05-27 11:22:03 -07:00
kexec_elf.c
kexec_file.c	RISC-V Patches for the 5.19 Merge Window, Part 1	2022-05-31 14:10:54 -07:00
kexec_internal.h
kexec.c
kheaders.c
kmod.c
kprobes.c	tracing updates for 5.19:	2022-05-29 10:31:36 -07:00
ksysfs.c	kernel/ksysfs.c: use helper macro __ATTR_RW	2022-03-23 19:00:33 -07:00
kthread.c	kthread: unexport kthread_blkcg	2022-05-02 14:06:20 -06:00
latencytop.c	latencytop: move sysctl to its own file	2022-04-21 11:40:59 -07:00
Makefile	kernel: add platform_has() infrastructure	2022-06-06 08:06:00 +02:00
module_signature.c
notifier.c	notifier: Add blocking/atomic_notifier_chain_register_unique_prio()	2022-05-19 19:30:30 +02:00
nsproxy.c
padata.c	padata: replace cpumask_weight with cpumask_empty in padata.c	2022-01-31 11:21:46 +11:00
panic.c	sysctl changes for v5.19-rc1	2022-05-26 16:57:20 -07:00
params.c	kobject: remove kset from struct kset_uevent_ops callbacks	2021-12-28 11:26:18 +01:00
pid_namespace.c	kernel: pid_namespace: use NULL instead of using plain integer as pointer	2022-04-29 14:38:00 -07:00
pid.c
platform-feature.c	kernel: add platform_has() infrastructure	2022-06-06 08:06:00 +02:00
profile.c	exit: Remove profile_handoff_task	2022-01-08 12:43:57 -06:00
ptrace.c	While looking at the ptrace problems with PREEMPT_RT and the problems	2022-06-03 16:13:25 -07:00
range.c
reboot.c	kernel/reboot: Fix powering off using a non-syscall code paths	2022-06-07 19:42:31 +02:00
regset.c
relay.c	relay: remove redundant assignment to pointer buf	2022-05-12 20:38:37 -07:00
resource_kunit.c
resource.c	kernel/resource: fix kfree() of bootmem memory again	2022-03-23 19:00:35 -07:00
rseq.c	rseq: Remove broken uapi field layout on 32-bit little endian	2022-02-02 13:11:34 +01:00
scftorture.c	scftorture: Fix distribution of short handler delays	2022-04-11 17:07:29 -07:00
scs.c	kasan, vmalloc: only tag normal vmalloc allocations	2022-03-24 19:06:48 -07:00
seccomp.c	seccomp: Add wait_killable semantic to seccomp user notifier	2022-05-03 14:11:58 -07:00
signal.c	While looking at the ptrace problems with PREEMPT_RT and the problems	2022-06-03 16:13:25 -07:00
smp.c	Scheduler changes in this cycle were:	2022-05-24 11:11:13 -07:00
smpboot.c	cpu/hotplug: Allow the CPU in CPU_UP_PREPARE state to be brought up again.	2022-04-12 14:13:01 +02:00
smpboot.h
softirq.c	smp: Make softirq handling RT safe in flush_smp_call_function_queue()	2022-05-01 10:03:43 +02:00
stackleak.c	stackleak: add on/off stack variants	2022-05-08 01:33:09 -07:00
stacktrace.c	uaccess: remove CONFIG_SET_FS	2022-02-25 09:36:06 +01:00
static_call_inline.c	static_call: Don't make __static_call_return0 static	2022-04-05 09:59:38 +02:00
static_call.c	static_call: Don't make __static_call_return0 static	2022-04-05 09:59:38 +02:00
stop_machine.c	Scheduler changes in this cycle were:	2022-05-24 11:11:13 -07:00
sys_ni.c	mm/mempolicy: wire up syscall set_mempolicy_home_node	2022-01-15 16:30:30 +02:00
sys.c	arm64/sme: Implement vector length configuration prctl()s	2022-04-22 18:50:54 +01:00
sysctl-test.c
sysctl.c	sysctl changes for v5.19-rc1	2022-05-26 16:57:20 -07:00
task_work.c	task_work: allow TWA_SIGNAL without a rescheduling IPI	2022-04-30 08:39:32 -06:00
taskstats.c	kernel: make taskstats available from all net namespaces	2022-04-29 14:38:03 -07:00
torture.c	torture: Wake up kthreads after storing task_struct pointer	2022-02-01 17:24:39 -08:00
tracepoint.c
tsacct.c	taskstats: version 12 with thread group and exe info	2022-04-29 14:38:03 -07:00
ucount.c	ucounts: Handle wrapping in is_ucounts_overlimit	2022-02-17 09:11:57 -06:00
uid16.c
uid16.h
umh.c	kthread: Don't allocate kthread_struct for init and umh	2022-05-06 14:49:44 -05:00
up.c
user_namespace.c	ucounts: Fix systemd LimitNPROC with private users regression	2022-02-25 10:40:14 -06:00
user-return-notifier.c
user.c
usermode_driver.c	blob_to_mnt(): kern_unmount() is needed to undo kern_mount()	2022-05-19 23:25:47 -04:00
utsname_sysctl.c
utsname.c
watch_queue.c	watch_queue: Free the page array when watch_queue is dismantled	2022-04-02 10:37:39 -07:00
watchdog_hld.c	printk: add functions to prefer direct printing	2022-04-22 21:30:58 +02:00
watchdog.c	Not a lot of material this cycle. Many singleton patches against various	2022-05-27 11:22:03 -07:00
workqueue_internal.h
workqueue.c	workqueue: Wrap flush_workqueue() using a macro	2022-06-07 07:07:14 -10:00