linux

History

Ken Chen 908a7c1b9b sched: fix improper load balance across sched domain We recently discovered a nasty performance bug in the kernel CPU load balancer where we were hit by 50% performance regression. When tasks are assigned to a subset of CPUs that span across sched_domains (either ccNUMA node or the new multi-core domain) via cpu affinity, kernel fails to perform proper load balance at these domains, due to several logic in find_busiest_group() miss identified busiest sched group within a given domain. This leads to inadequate load balance and causes 50% performance hit. To give you a concrete example, on a dual-core, 2 socket numa system, there are 4 logical cpu, organized as: CPU0 attaching sched-domain: domain 0: span 0003 groups: 0001 0002 domain 1: span 000f groups: 0003 000c CPU1 attaching sched-domain: domain 0: span 0003 groups: 0002 0001 domain 1: span 000f groups: 0003 000c CPU2 attaching sched-domain: domain 0: span 000c groups: 0004 0008 domain 1: span 000f groups: 000c 0003 CPU3 attaching sched-domain: domain 0: span 000c groups: 0008 0004 domain 1: span 000f groups: 000c 0003 If I run 2 tasks with CPU affinity set to 0x5. There are situation where cpu0 has run queue length of 2, and cpu2 will be idle. The kernel load balancer is unable to balance out these two tasks over cpu0 and cpu2 due to at least three logics in find_busiest_group() that heavily bias load balance towards power saving mode. e.g. while determining "busiest" variable, kernel only set it when "sum_nr_running > group_capacity". This test is flawed that "sum_nr_running" is not necessary same as sum-tasks-allowed-to-run-within-the sched-group. The end result is that kernel "think" everything is balanced, but in reality we have an imbalance and thus causing one CPU to be over-subscribed and leaving other idle. There are two other logic in the same function will also causing similar effect. The nastiness of this bug is that kernel not be able to get unstuck in this unfortunate broken state. From what we've seen in our environment, kernel will stuck in imbalanced state for extended period of time and it is also very easy for the kernel to stuck into that state (it's pretty much 100% reproducible for us). So proposing the following fix: add addition logic in find_busiest_group to detect intrinsic imbalance within the busiest group. When such condition is detected, load balance goes into spread mode instead of default grouping mode. Signed-off-by: Ken Chen <kenchen@google.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>		2007-10-17 16:55:11 +02:00
..
irq	request_irq: fix DEBUG_SHIRQ handling	2007-08-31 01:42:23 -07:00
power	hibernation doesn't even build on frv - tons of helpers are missing	2007-09-26 09:22:04 -07:00
time	time: introduce xtime_seconds	2007-10-16 10:01:50 -07:00
.gitignore
acct.c	Cleanup non-arch xtime uses, use get_seconds() or current_kernel_time().	2007-07-25 10:09:20 -07:00
audit.c	[NET]: make netlink user -> kernel interface synchronious	2007-10-10 21:15:29 -07:00
audit.h	Audit: add TTY input auditing	2007-07-16 09:05:47 -07:00
auditfilter.c	[PATCH] allow audit filtering on bit & operations	2007-07-22 09:57:02 -04:00
auditsc.c	SUNRPC: Convert rpc_pipefs to use the generic filesystem notification hooks	2007-10-09 17:15:26 -04:00
capability.c	[PATCH] pid: replace do/while_each_task_pid with do/while_each_pid_task	2007-02-12 09:48:32 -08:00
compat.c	signal/timer/event: timerfd compat code	2007-05-11 08:29:36 -07:00
configs.c	use simple_read_from_buffer in kernel/	2007-05-09 12:30:49 -07:00
cpu.c	PM: Fix dependencies of CONFIG_SUSPEND and CONFIG_HIBERNATION	2007-08-31 01:42:22 -07:00
cpuset.c	cpuset: remove sched domain hooks from cpusets	2007-10-16 09:43:09 -07:00
delayacct.c	sched: clean up schedstats, cnt -> count	2007-10-15 17:00:12 +02:00
die_notifier.c	move die notifier handling to common code	2007-05-08 11:15:04 -07:00
dma.c	[PATCH] struct seq_operations and struct file_operations constification	2006-12-07 08:39:46 -08:00
exec_domain.c	Remove obsolete #include <linux/config.h>	2006-06-30 19:25:36 +02:00
exit.c	sched: guest CPU accounting: add guest-CPU /proc/<pid>/stat fields	2007-10-15 17:00:19 +02:00
extable.c
fork.c	sched: guest CPU accounting: add guest-CPU /proc/<pid>/stat fields	2007-10-15 17:00:19 +02:00
futex_compat.c	robust futex thread exit race	2007-10-01 07:52:23 -07:00
futex.c	robust futex thread exit race	2007-10-01 07:52:23 -07:00
hrtimer.c	[KTIME]: Introduce ktime_sub_ns and ktime_sub_us	2007-10-10 16:48:12 -07:00
itimer.c	The scheduled -EINVAL for invalid timevals in setitimer	2007-05-08 11:15:13 -07:00
kallsyms.c	kallsyms: make KSYM_NAME_LEN include space for trailing '\0'	2007-07-17 10:23:03 -07:00
Kconfig.hz	[PATCH] HZ: 300Hz support	2006-12-07 08:39:36 -08:00
Kconfig.preempt	[PATCH] sched: arch preempt notifier mechanism	2007-07-26 13:40:43 +02:00
kexec.c	kdump/kexec: calculate note size at compile time	2007-05-08 11:15:07 -07:00
kfifo.c	is_power_of_2: kernel/kfifo.c	2007-07-16 09:05:50 -07:00
kmod.c	Restore call_usermodehelper_pipe() behaviour	2007-09-11 17:21:20 -07:00
kprobes.c	kprobes: support kretprobe blacklist	2007-10-16 09:43:10 -07:00
ksysfs.c	sched: group scheduling, sysfs tunables	2007-10-15 17:00:14 +02:00
kthread.c	kthread: silence bogus section mismatch warning	2007-07-31 15:39:42 -07:00
latency.c	[PATCH] severing module.h->sched.h	2006-12-04 02:00:22 -05:00
lockdep_internals.h	[PATCH] lockdep: more chains	2006-12-07 08:39:43 -08:00
lockdep_proc.c	lockdep: Avoid /proc/lockdep & lock_stat infinite output	2007-10-11 22:11:11 +02:00
lockdep.c	lockdep: syscall exit check	2007-10-11 22:11:12 +02:00
Makefile	user namespace: add the framework	2007-07-16 09:05:47 -07:00
module.c	Fix Off-by-one in /sys/module/*/refcnt	2007-08-22 14:35:35 -07:00
mutex-debug.c	[PATCH] remove many unneeded #includes of sched.h	2007-02-14 08:09:54 -08:00
mutex-debug.h	[PATCH] lockdep: better lock debugging	2006-07-03 15:27:01 -07:00
mutex.c	lockdep: fixup mutex annotations	2007-10-11 22:11:12 +02:00
mutex.h	[PATCH] lockdep: prove mutex locking correctness	2006-07-03 15:27:04 -07:00
nsproxy.c	[NET]: Add network namespace clone & unshare support.	2007-10-10 16:52:46 -07:00
panic.c	Report that kernel is tainted if there was an OOPS	2007-07-17 10:23:02 -07:00
params.c	modules: better error messages when modules fail to load due to a sysfs problem.	2007-07-30 14:25:23 -07:00
pid.c	namespace: ensure clone_flags are always stored in an unsigned long	2007-07-16 09:05:48 -07:00
posix-cpu-timers.c	sched: make posix-cpu-timers use CFS's accounting information	2007-07-09 18:51:58 +02:00
posix-timers.c	more low-hanging fruits - kernel, fs, lib signedness	2007-10-14 12:41:52 -07:00
printk.c	slow down printk during boot	2007-10-16 09:42:49 -07:00
profile.c	Memoryless nodes: Allow profiling data to fall back to other nodes	2007-10-16 09:42:58 -07:00
ptrace.c	m32r: convert to generic sys_ptrace	2007-10-16 09:43:04 -07:00
rcupdate.c	lockdep: annotate rcu_read_{,un}lock{,_bh}	2007-10-11 22:11:12 +02:00
rcutorture.c	Freezer: make kernel threads nonfreezable by default	2007-07-17 10:23:02 -07:00
relay.c	Fix a use after free bug in kernel->userspace relay file support	2007-07-31 15:39:42 -07:00
resource.c	memory unplug: memory hotplug cleanup	2007-10-16 09:43:01 -07:00
rtmutex_common.h	FUTEX: Tidy up the code	2007-07-16 09:05:49 -07:00
rtmutex-debug.c	FUTEX: Tidy up the code	2007-07-16 09:05:49 -07:00
rtmutex-debug.h	[PATCH] lockdep: better lock debugging	2006-07-03 15:27:01 -07:00
rtmutex-tester.c	Freezer: make kernel threads nonfreezable by default	2007-07-17 10:23:02 -07:00
rtmutex.c	FUTEX: Tidy up the code	2007-07-16 09:05:49 -07:00
rtmutex.h	[PATCH] lockdep: better lock debugging	2006-07-03 15:27:01 -07:00
rwsem.c	lockstat: hook into spinlock_t, rwlock_t, rwsem and mutex	2007-07-19 10:04:49 -07:00
sched_debug.c	Make scheduler debug file operations const	2007-10-15 17:00:19 +02:00
sched_fair.c	sched: reintroduce cache-hot affinity	2007-10-15 17:00:18 +02:00
sched_idletask.c	sched: mark scheduling classes as const	2007-10-15 17:00:12 +02:00
sched_rt.c	sched: tidy up SCHED_RR	2007-10-15 17:00:13 +02:00
sched_stats.h	sched: clean up schedstats, cnt -> count	2007-10-15 17:00:12 +02:00
sched.c	sched: fix improper load balance across sched domain	2007-10-17 16:55:11 +02:00
seccomp.c	make seccomp zerocost in schedule	2007-07-16 09:05:50 -07:00
signal.c	fix bogus reporting of signals by audit	2007-10-07 16:28:43 -07:00
softirq.c	[KERNEL]: Unexport raise_softirq_irqoff	2007-10-10 16:49:18 -07:00
softlockup.c	Freezer: make kernel threads nonfreezable by default	2007-07-17 10:23:02 -07:00
spinlock.c	lockstat: hook into spinlock_t, rwlock_t, rwsem and mutex	2007-07-19 10:04:49 -07:00
srcu.c	[PATCH] SRCU: report out-of-memory errors	2006-10-04 07:55:30 -07:00
stacktrace.c	[PATCH] lockdep: stacktrace subsystem, core	2006-07-03 15:27:02 -07:00
stop_machine.c	Fix stop_machine_run problem with naughty real time process	2007-07-16 09:05:41 -07:00
sys_ni.c	diskquota: 32bit quota tools on 64bit architectures	2007-07-16 09:05:48 -07:00
sys.c	Fix SMP poweroff hangs	2007-10-01 07:52:23 -07:00
sysctl.c	hugetlb: Add hugetlb_dynamic_pool sysctl	2007-10-16 09:43:02 -07:00
taskstats.c	taskstats: add context-switch counters	2007-07-16 09:05:46 -07:00
time.c	time: introduce xtime_seconds	2007-10-16 10:01:50 -07:00
timer.c	Pull ia64-clocksource into release branch	2007-07-20 11:26:47 -07:00
tsacct.c	Cleanup non-arch xtime uses, use get_seconds() or current_kernel_time().	2007-07-25 10:09:20 -07:00
uid16.c	header cleaning: don't include smp_lock.h when not used	2007-05-08 11:15:07 -07:00
user_namespace.c	Fix user namespace exiting OOPs	2007-09-19 11:24:18 -07:00
user.c	sched: generate uevents for user creation/destruction	2007-10-15 17:00:18 +02:00
utsname_sysctl.c	remove CONFIG_UTS_NS and CONFIG_IPC_NS	2007-07-16 09:05:47 -07:00
utsname.c	Fix UTS corruption during clone(CLONE_NEWUTS)	2007-09-19 11:24:17 -07:00
wait.c	Fix occurrences of "the the "	2007-05-09 08:57:56 +02:00
workqueue.c	fix bogus hotplug cpu warning	2007-08-27 10:27:48 -07:00