linux

mirror of https://github.com/torvalds/linux.git synced 2024-11-24 13:11:40 +00:00

History

Preeti U Murthy d4573c3e1c sched: Improve load balancing in the presence of idle CPUs When a CPU is kicked to do nohz idle balancing, it wakes up to do load balancing on itself, followed by load balancing on behalf of idle CPUs. But it may end up with load after the load balancing attempt on itself. This aborts nohz idle balancing. As a result several idle CPUs are left without tasks till such a time that an ILB CPU finds it unfavorable to pull tasks upon itself. This delays spreading of load across idle CPUs and worse, clutters only a few CPUs with tasks. The effect of the above problem was observed on an SMT8 POWER server with 2 levels of numa domains. Busy loops equal to number of cores were spawned. Since load balancing on fork/exec is discouraged across numa domains, all busy loops would start on one of the numa domains. However it was expected that eventually one busy loop would run per core across all domains due to nohz idle load balancing. But it was observed that it took as long as 10 seconds to spread the load across numa domains. Further investigation showed that this was a consequence of the following: 1. An ILB CPU was chosen from the first numa domain to trigger nohz idle load balancing [Given the experiment, upto 6 CPUs per core could be potentially idle in this domain.] 2. However the ILB CPU would call load_balance() on itself before initiating nohz idle load balancing. 3. Given cores are SMT8, the ILB CPU had enough opportunities to pull tasks from its sibling cores to even out load. 4. Now that the ILB CPU was no longer idle, it would abort nohz idle load balancing As a result the opportunities to spread load across numa domains were lost until such a time that the cores within the first numa domain had equal number of tasks among themselves. This is a pretty bad scenario, since the cores within the first numa domain would have as many as 4 tasks each, while cores in the neighbouring numa domains would all remain idle. Fix this, by checking if a CPU was woken up to do nohz idle load balancing, before it does load balancing upon itself. This way we allow idle CPUs across the system to do load balancing which results in quicker spread of load, instead of performing load balancing within the local sched domain hierarchy of the ILB CPU alone under circumstances such as above. Signed-off-by: Preeti U Murthy <preeti@linux.vnet.ibm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Jason Low <jason.low2@hp.com> Cc: benh@kernel.crashing.org Cc: daniel.lezcano@linaro.org Cc: efault@gmx.de Cc: iamjoonsoo.kim@lge.com Cc: morten.rasmussen@arm.com Cc: pjt@google.com Cc: riel@redhat.com Cc: srikar@linux.vnet.ibm.com Cc: svaidy@linux.vnet.ibm.com Cc: tim.c.chen@linux.intel.com Cc: vincent.guittot@linaro.org Link: http://lkml.kernel.org/r/20150326130014.21532.17158.stgit@preeti.in.ibm.com Signed-off-by: Ingo Molnar <mingo@kernel.org>		2015-03-27 09:36:09 +01:00
..
auto_group.c	sched/autogroup: Fix failure to set cpu.rt_runtime_us	2015-02-18 16:17:20 +01:00
auto_group.h	Revert "sched/autogroup: Fix crash on reboot when autogroup is disabled"	2012-12-11 10:23:45 +01:00
clock.c	kernel/sched/clock.c: add another clock for use with the soft lockup watchdog	2015-02-12 18:54:13 -08:00
completion.c	sched/completion: Serialize completion_done() with complete()	2015-02-18 14:27:40 +01:00
core.c	sched: Add SD_PREFER_SIBLING for SMT level	2015-03-27 09:36:05 +01:00
cpuacct.c	cgroup: rename cgroup_subsys->base_cftypes to ->legacy_cftypes	2014-07-15 11:05:09 -04:00
cpuacct.h	sched/cpuacct: Initialize root cpuacct earlier	2013-04-10 13:54:20 +02:00
cpudeadline.c	sched/deadline: Remove cpu_active_mask from cpudl_find()	2015-02-04 07:52:29 +01:00
cpudeadline.h	sched/deadline: Modify cpudl::free_cpus to reflect rd->online	2015-01-30 19:39:16 +01:00
cpupri.c	Merge commit '3cf2f34' into sched/core, to fix build error	2014-06-12 13:46:37 +02:00
cpupri.h	sched/cpupri: Remove unnecessary definitions in cpupri.h	2014-11-16 10:58:59 +01:00
cputime.c	sched, time: Fix build error with 64 bit cputime_t on 32 bit systems	2014-10-03 05:46:55 +02:00
deadline.c	sched/deadline: Add rq->clock update skip for dl task yield	2015-03-10 05:46:50 +01:00
debug.c	sched: Track group sched_entity usage contributions	2015-03-27 09:35:58 +01:00
fair.c	sched: Improve load balancing in the presence of idle CPUs	2015-03-27 09:36:09 +01:00
features.h	sched/rt: Use IPI to trigger RT task push migration instead of pulling	2015-03-23 10:55:22 +01:00
idle_task.c	sched: Provide update_curr callbacks for stop/idle scheduling classes	2014-11-23 14:14:40 -08:00
idle.c	cpuidle / sleep: Use broadcast timer for states that stop local timer	2015-03-05 23:13:19 +01:00
Makefile	ftrace: allow architectures to specify ftrace compile options	2015-01-29 09:19:19 +01:00
proc.c	cpuidle: menu: Lookup CPU runqueues less	2014-08-06 21:17:45 +02:00
rt.c	sched/rt: Use IPI to trigger RT task push migration instead of pulling	2015-03-23 10:55:22 +01:00
sched.h	sched: Optimize freq invariant accounting	2015-03-27 09:36:08 +01:00
stats.c	sched: use %*pb[l] to print bitmaps including cpumasks and nodemasks	2015-02-13 21:21:37 -08:00
stats.h	sched: Micro-optimize by dropping unnecessary task_rq() calls	2013-09-25 13:51:06 +02:00
stop_task.c	sched: Provide update_curr callbacks for stop/idle scheduling classes	2014-11-23 14:14:40 -08:00
wait.c	sched/wait: Fix a kthread race with wait_woken()	2014-11-04 07:17:44 +01:00