linux

mirror of https://github.com/torvalds/linux.git synced 2024-11-23 04:31:50 +00:00

History

Vincent Guittot f60a631ab9 sched/fair: Fix tg->load when offlining a CPU When a CPU is taken offline, the contribution of its cfs_rqs to task_groups' load may remain and will negatively impact the calculation of the share of the online CPUs. To fix this bug, clear the contribution of an offlining CPU to task groups' load and skip its contribution while it is inactive. Here's the reproducer of the anomaly, by Imran Khan: "So far I have encountered only one rather lengthy way of reproducing this issue, which is as follows: 1. Take a KVM guest (booted with 4 CPUs and can be scaled up to 124 CPUs) and create 2 custom cgroups: /sys/fs/cgroup/cpu/test_group_1 and /sys/fs/cgroup/ cpu/test_group_2 2. Assign a CPU intensive workload to each of these cgroups and start the workload. For my tests I am using following app: int main(int argc, char *argv[]) { unsigned long count, i, val; if (argc != 2) { printf("usage: ./a.out <number of random nums to generate> \n"); return 0; } count = strtoul(argv[1], NULL, 10); printf("Generating %lu random numbers \n", count); for (i = 0; i < count; i++) { val = rand(); val = val % 2; //usleep(1); } printf("Generated %lu random numbers \n", count); return 0; } Also since the system is booted with 4 CPUs, in order to completely load the system I am also launching 4 instances of same test app under: /sys/fs/cgroup/cpu/ 3. We can see that both of the cgroups get similar CPU time: # systemd-cgtop --depth 1 Path Tasks %CPU Memory Input/s Output/s / 659 - 5.5G - - /system.slice - - 5.7G - - /test_group_1 4 - - - - /test_group_2 3 - - - - /user.slice 31 - 56.5M - - Path Tasks %CPU Memory Input/s Output/s / 659 394.6 5.5G - - /test_group_2 3 65.7 - - - /user.slice 29 55.1 48.0M - - /test_group_1 4 47.3 - - - /system.slice - 2.2 5.7G - - Path Tasks %CPU Memory Input/s Output/s / 659 394.8 5.5G - - /test_group_1 4 62.9 - - - /user.slice 28 44.9 54.2M - - /test_group_2 3 44.7 - - - /system.slice - 0.9 5.7G - - Path Tasks %CPU Memory Input/s Output/s / 659 394.4 5.5G - - /test_group_2 3 58.8 - - - /test_group_1 4 51.9 - - - /user.slice 30 39.3 59.6M - - /system.slice - 1.9 5.7G - - Path Tasks %CPU Memory Input/s Output/s / 659 394.7 5.5G - - /test_group_1 4 60.9 - - - /test_group_2 3 57.9 - - - /user.slice 28 43.5 36.9M - - /system.slice - 3.0 5.7G - - Path Tasks %CPU Memory Input/s Output/s / 659 395.0 5.5G - - /test_group_1 4 66.8 - - - /test_group_2 3 56.3 - - - /user.slice 29 43.1 51.8M - - /system.slice - 0.7 5.7G - - 4. Now move systemd-udevd to one of these test groups, say test_group_1, and perform scale up to 124 CPUs followed by scale down back to 4 CPUs from the host side. 5. Run the same workload i.e 4 instances of CPU hogger under /sys/fs/cgroup/cpu and one instance of CPU hogger each in /sys/fs/cgroup/cpu/test_group_1 and /sys/fs/cgroup/test_group_2. It can be seen that test_group_1 (the one where systemd-udevd was moved) is getting much less CPU time than the test_group_2, even though at this point of time both of these groups have only CPU hogger running: # systemd-cgtop --depth 1 Path Tasks %CPU Memory Input/s Output/s / 1219 - 5.4G - - /system.slice - - 5.6G - - /test_group_1 4 - - - - /test_group_2 3 - - - - /user.slice 26 - 91.3M - - Path Tasks %CPU Memory Input/s Output/s / 1221 394.3 5.4G - - /test_group_2 3 82.7 - - - /test_group_1 4 14.3 - - - /system.slice - 0.8 5.6G - - /user.slice 26 0.4 91.2M - - Path Tasks %CPU Memory Input/s Output/s / 1221 394.6 5.4G - - /test_group_2 3 67.4 - - - /system.slice - 24.6 5.6G - - /test_group_1 4 12.5 - - - /user.slice 26 0.4 91.2M - - Path Tasks %CPU Memory Input/s Output/s / 1221 395.2 5.4G - - /test_group_2 3 60.9 - - - /system.slice - 27.9 5.6G - - /test_group_1 4 12.2 - - - /user.slice 26 0.4 91.2M - - Path Tasks %CPU Memory Input/s Output/s / 1221 395.2 5.4G - - /test_group_2 3 69.4 - - - /test_group_1 4 13.9 - - - /user.slice 28 1.6 92.0M - - /system.slice - 1.0 5.6G - - Path Tasks %CPU Memory Input/s Output/s / 1221 395.6 5.4G - - /test_group_2 3 59.3 - - - /test_group_1 4 14.1 - - - /user.slice 28 1.3 92.2M - - /system.slice - 0.7 5.6G - - Path Tasks %CPU Memory Input/s Output/s / 1221 395.5 5.4G - - /test_group_2 3 67.2 - - - /test_group_1 4 11.5 - - - /user.slice 28 1.3 92.5M - - /system.slice - 0.6 5.6G - - Path Tasks %CPU Memory Input/s Output/s / 1221 395.1 5.4G - - /test_group_2 3 76.8 - - - /test_group_1 4 12.9 - - - /user.slice 28 1.3 92.8M - - /system.slice - 1.2 5.6G - - From sched_debug data it can be seen that in bad case the load.weight of per-CPU sched entities corresponding to test_group_1 has reduced significantly and also load_avg of test_group_1 remains much higher than that of test_group_2, even though systemd-udevd stopped running long time back and at this point of time both cgroups just have the CPU hogger app as running entity." [ mingo: Added details from the original discussion, plus minor edits to the patch. ] Reported-by: Imran Khan <imran.f.khan@oracle.com> Tested-by: Imran Khan <imran.f.khan@oracle.com> Tested-by: Aaron Lu <aaron.lu@intel.com> Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Reviewed-by: Imran Khan <imran.f.khan@oracle.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Borislav Petkov <bp@alien8.de> Link: https://lore.kernel.org/r/20231223111545.62135-1-vincent.guittot@linaro.org		2023-12-29 13:22:03 +01:00
..
bpf	bpf: Fix prog_array_map_poke_run map poke update	2023-12-06 22:40:16 +01:00
cgroup	cgroup: Fixes for v6.7-rc4	2023-12-07 12:42:40 -08:00
configs	hardening: Provide Kconfig fragments for basic options	2023-09-22 09:50:55 -07:00
debug	kdb: Corrects comment for kdballocenv	2023-11-06 17:13:55 +00:00
dma	swiotlb: fix out-of-bounds TLB allocations with CONFIG_SWIOTLB_DYNAMIC	2023-11-08 16:27:05 +01:00
entry	entry: Remove empty addr_limit_user_check()	2023-08-23 10:32:39 +02:00
events	perf: Fix perf_event_validate_size() lockdep splat	2023-12-15 12:33:23 +01:00
futex	futex: Fix hardcoded flags	2023-11-15 04:02:25 +01:00
gcov	gcov: annotate struct gcov_iterator with __counted_by	2023-10-18 14:43:22 -07:00
irq	As usual, lots of singleton and doubleton patches all over the tree and	2023-11-02 20:53:31 -10:00
kcsan	mm: delete checks for xor_unlock_is_negative_byte()	2023-10-18 14:34:17 -07:00
livepatch	livepatch: Fix missing newline character in klp_resolve_symbols()	2023-09-20 11:24:18 +02:00
locking	lockdep: Fix block chain corruption	2023-11-24 11:04:54 +01:00
module	This update includes the following changes:	2023-11-02 16:15:30 -10:00
power	Power management updates for 6.7-rc1	2023-10-31 15:38:12 -10:00
printk	TTY/Serial changes for 6.7-rc1	2023-11-03 15:44:25 -10:00
rcu	RCU fixes for v6.7	2023-11-08 09:47:52 -08:00
sched	sched/fair: Fix tg->load when offlining a CPU	2023-12-29 13:22:03 +01:00
time	posix-timers: Get rid of [COMPAT_]SYS_NI() uses	2023-12-20 21:30:27 -08:00
trace	Tracing fixes for 6.7:	2023-12-21 09:31:45 -08:00
.gitignore
acct.c	fs: rename __mnt_{want,drop}_write*() helpers	2023-09-11 15:05:50 +02:00
async.c
audit_fsnotify.c
audit_tree.c	As usual, lots of singleton and doubleton patches all over the tree and	2023-11-02 20:53:31 -10:00
audit_watch.c	audit: don't WARN_ON_ONCE(!current->mm) in audit_exe_compare()	2023-11-14 17:34:27 -05:00
audit.c	audit: move trailing statements to next line	2023-08-15 18:16:14 -04:00
audit.h	audit: correct audit_filter_inodes() definition	2023-07-21 12:17:25 -04:00
auditfilter.c	audit: move trailing statements to next line	2023-08-15 18:16:14 -04:00
auditsc.c	audit,io_uring: io_uring openat triggers audit reference count underflow	2023-10-13 18:34:46 +02:00
backtracetest.c
bounds.c
capability.c	lsm: constify the 'target' parameter in security_capget()	2023-08-08 16:48:47 -04:00
cfi.c
compat.c	sched_getaffinity: don't assume 'cpumask_size()' is fully initialized	2023-03-14 19:32:38 -07:00
configs.c
context_tracking.c	locking/atomic: treewide: use raw_atomic*_<op>()	2023-06-05 09:57:20 +02:00
cpu_pm.c	cpuidle, cpu_pm: Remove RCU fiddling from cpu_pm_{enter,exit}()	2023-01-13 11:48:15 +01:00
cpu.c	- Do the push of pending hrtimers away from a CPU which is being	2023-11-19 13:35:07 -08:00
crash_core.c	crash_core: fix the check for whether crashkernel is from high memory	2023-12-12 17:20:18 -08:00
crash_dump.c
cred.c	cred: get rid of CONFIG_DEBUG_CREDENTIALS	2023-12-15 14:19:48 -08:00
delayacct.c	delayacct: track delays from IRQ/SOFTIRQ	2023-04-18 16:39:34 -07:00
dma.c
exec_domain.c
exit.c	cred: get rid of CONFIG_DEBUG_CREDENTIALS	2023-12-15 14:19:48 -08:00
exit.h	exit: add internal include file with helpers	2023-09-21 12:03:50 -06:00
extable.c
fail_function.c	kernel/fail_function: fix memory leak with using debugfs_lookup()	2023-02-08 13:36:22 +01:00
fork.c	As usual, lots of singleton and doubleton patches all over the tree and	2023-11-02 20:53:31 -10:00
freezer.c	freezer,sched: Do not restore saved_state of a thawed task	2023-11-29 15:43:48 +01:00
gen_kheaders.sh	Revert "kheaders: substituting --sort in archive creation"	2023-05-28 16:20:21 +09:00
groups.c	groups: Convert group_info.usage to refcount_t	2023-09-29 11:28:39 -07:00
hung_task.c	kernel/hung_task.c: set some hung_task.c variables storage-class-specifier to static	2023-04-08 13:45:37 -07:00
iomem.c	kernel/iomem.c: remove __weak ioremap_cache helper	2023-08-21 13:37:28 -07:00
irq_work.c	trace: Add trace_ipi_send_cpu()	2023-03-24 11:01:29 +01:00
jump_label.c	jump_label: Prevent key->enabled int overflow	2022-12-01 15:53:05 -08:00
kallsyms_internal.h	kallsyms: Reduce the memory occupied by kallsyms_seqs_of_names[]	2022-11-12 18:47:36 -08:00
kallsyms_selftest.c	Modules changes for v6.6-rc1	2023-08-29 17:32:32 -07:00
kallsyms_selftest.h	kallsyms: Add self-test facility	2022-11-15 00:42:02 -08:00
kallsyms.c	kallsyms: Change func signature for cleanup_symbol_name()	2023-08-25 15:00:36 -07:00
kcmp.c	file: convert to SLAB_TYPESAFE_BY_RCU	2023-10-19 11:02:48 +02:00
Kconfig.freezer
Kconfig.hz
Kconfig.kexec	kexec: drop dependency on ARCH_SUPPORTS_KEXEC from CRASH_DUMP	2023-12-12 17:20:16 -08:00
Kconfig.locks
Kconfig.preempt
kcov.c	kcov: add prototypes for helper functions	2023-06-09 17:44:17 -07:00
kexec_core.c	crash_core: move crashk_*res definition into crash_core.c	2023-10-04 10:41:58 -07:00
kexec_elf.c
kexec_file.c	integrity-v6.6	2023-08-30 09:16:56 -07:00
kexec_internal.h
kexec.c	kernel: kexec: copy user-array safely	2023-10-09 16:59:47 +10:00
kheaders.c	kheaders: Use array declaration instead of char	2023-03-24 20:10:59 -07:00
kprobes.c	kprobes: consistent rcu api usage for kretprobe holder	2023-12-01 14:53:55 +09:00
ksyms_common.c	kallsyms: make kallsyms_show_value() as generic function	2023-06-08 12:27:20 -07:00
ksysfs.c	crash: hotplug support for kexec_load()	2023-08-24 16:25:14 -07:00
kthread.c	As usual, lots of singleton and doubleton patches all over the tree and	2023-11-02 20:53:31 -10:00
latencytop.c
Makefile	v6.5-rc1-modules-next	2023-06-28 15:51:08 -07:00
module_signature.c
notifier.c	notifiers: add tracepoints to the notifiers infrastructure	2023-04-08 13:45:38 -07:00
nsproxy.c	nsproxy: Convert nsproxy.count to refcount_t	2023-08-21 11:29:12 -07:00
padata.c	padata: Fix refcnt handling in padata_free_shell()	2023-10-27 18:04:24 +08:00
panic.c	panic: use atomic_try_cmpxchg in panic() and nmi_panic()	2023-10-04 10:41:56 -07:00
params.c	kernel: params: Remove unnecessary ‘0’ values from err	2023-07-10 12:47:01 -07:00
pid_namespace.c	pid: pid_ns_ctl_handler: remove useless comment	2023-10-04 10:41:57 -07:00
pid_sysctl.h	memfd: replace ratcheting feature from vm.memfd_noexec with hierarchy	2023-08-21 13:37:59 -07:00
pid.c	pidfd: prevent a kernel-doc warning	2023-09-19 13:21:33 -07:00
profile.c
ptrace.c	mm: make __access_remote_vm() static	2023-10-18 14:34:15 -07:00
range.c
reboot.c	kernel/reboot: Add device to sys_off_handler	2023-07-28 11:33:09 +01:00
regset.c
relay.c	kernel: relay: remove unnecessary NULL values from relay_open_buf	2023-08-18 10:18:55 -07:00
resource_kunit.c
resource.c	kernel/resource: Increment by align value in get_free_mem_region()	2023-12-04 17:19:03 -08:00
rseq.c	rseq: Extend struct rseq with per-memory-map concurrency ID	2022-12-27 12:52:12 +01:00
scftorture.c	scftorture: Pause testing after memory-allocation failure	2023-07-14 15:02:57 -07:00
scs.c	scs: add support for dynamic shadow call stacks	2022-11-09 18:06:35 +00:00
seccomp.c	seccomp: Add missing kerndoc notations	2023-08-17 12:32:15 -07:00
signal.c	As usual, lots of singleton and doubleton patches all over the tree and	2023-11-02 20:53:31 -10:00
smp.c	CSD lock commits for v6.7	2023-10-30 17:56:53 -10:00
smpboot.c	kthread: add kthread_stop_put	2023-10-04 10:41:57 -07:00
smpboot.h
softirq.c	sched/core: introduce sched_core_idle_cpu()	2023-07-13 15:21:50 +02:00
stackleak.c	stackleak: allow to specify arch specific stackleak poison function	2023-04-20 11:36:35 +02:00
stacktrace.c	stacktrace: Export stack_trace_save_tsk	2023-09-11 23:59:47 -04:00
static_call_inline.c
static_call.c
stop_machine.c
sys_ni.c	posix-timers: Get rid of [COMPAT_]SYS_NI() uses	2023-12-20 21:30:27 -08:00
sys.c	prctl: Disable prctl(PR_SET_MDWE) on parisc	2023-11-18 19:35:31 +01:00
sysctl-test.c
sysctl.c	asm-generic updates for v6.7	2023-11-01 15:28:33 -10:00
task_work.c	task_work: add kerneldoc annotation for 'data' argument	2023-09-19 13:21:32 -07:00
taskstats.c	taskstats: fill_stats_for_tgid: use for_each_thread()	2023-10-04 10:41:57 -07:00
torture.c	torture: Print out torture module parameters	2023-09-24 17:24:01 +02:00
tracepoint.c	tracepoint: Allow livepatch module add trace event	2023-02-18 14:34:36 -05:00
tsacct.c
ucount.c	sysctl: Add size to register_sysctl	2023-08-15 15:26:17 -07:00
uid16.c
uid16.h
umh.c	sysctl: fix unused proc_cap_handler() function warning	2023-06-29 15:19:43 -07:00
up.c	smp: Change function signatures to use call_single_data_t	2023-09-13 14:59:24 +02:00
user_namespace.c	As usual, lots of singleton and doubleton patches all over the tree and	2023-11-02 20:53:31 -10:00
user-return-notifier.c
user.c	binfmt_misc: enable sandboxed mounts	2023-10-11 08:46:01 -07:00
usermode_driver.c
utsname_sysctl.c	utsname: simplify one-level sysctl registration for uts_kern_table	2023-04-13 11:49:35 -07:00
utsname.c
vhost_task.c	vhost: Fix worker hangs due to missed wake up calls	2023-06-08 15:43:09 -04:00
watch_queue.c	kernel: watch_queue: copy user-array safely	2023-10-09 16:59:48 +10:00
watchdog_buddy.c	watchdog/hardlockup: move SMP barriers from common code to buddy code	2023-06-19 16:25:28 -07:00
watchdog_perf.c	watchdog/perf: add a weak function for an arch to detect if perf can use NMIs	2023-06-09 17:44:21 -07:00
watchdog.c	watchdog: move softlockup_panic back to early_param	2023-11-01 12:10:02 -07:00
workqueue_internal.h	workqueue: Drop the special locking rule for worker->flags and worker_pool->flags	2023-08-07 15:57:22 -10:00
workqueue.c	workqueue: Make sure that wq_unbound_cpumask is never empty	2023-11-22 06:17:26 -10:00