linux

History

Vladimir Davydov 685207963b sched: Move h_load calculation to task_h_load() The bad thing about update_h_load(), which computes hierarchical load factor for task groups, is that it is called for each task group in the system before every load balancer run, and since rebalance can be triggered very often, this function can eat really a lot of cpu time if there are many cpu cgroups in the system. Although the situation was improved significantly by commit `a35b646` ('sched, cgroup: Reduce rq->lock hold times for large cgroup hierarchies'), the problem still can arise under some kinds of loads, e.g. when cpus are switching from idle to busy and back very frequently. For instance, when I start 1000 of processes that wake up every millisecond on my 8 cpus host, 'top' and 'perf top' show: Cpu(s): 17.8%us, 24.3%sy, 0.0%ni, 57.9%id, 0.0%wa, 0.0%hi, 0.0%si Events: 243K cycles 7.57% [kernel] [k] __schedule 7.08% [kernel] [k] timerqueue_add 6.13% libc-2.12.so [.] usleep Then if I create 10000 idle cpu cgroups (no processes in them), cpu usage increases significantly although the 'wakers' are still executing in the root cpu cgroup: Cpu(s): 19.1%us, 48.7%sy, 0.0%ni, 31.6%id, 0.0%wa, 0.0%hi, 0.7%si Events: 230K cycles 24.56% [kernel] [k] tg_load_down 5.76% [kernel] [k] __schedule This happens because this particular kind of load triggers 'new idle' rebalance very frequently, which requires calling update_h_load(), which, in turn, calls tg_load_down() for every idle cpu cgroup even though it is absolutely useless, because idle cpu cgroups have no tasks to pull. This patch tries to improve the situation by making h_load calculation proceed only when h_load is really necessary. To achieve this, it substitutes update_h_load() with update_cfs_rq_h_load(), which computes h_load only for a given cfs_rq and all its ascendants, and makes the load balancer call this function whenever it considers if a task should be pulled, i.e. it moves h_load calculations directly to task_h_load(). For h_load of the same cfs_rq not to be updated multiple times (in case several tasks in the same cgroup are considered during the same balance run), the patch keeps the time of the last h_load update for each cfs_rq and breaks calculation when it finds h_load to be uptodate. The benefit of it is that h_load is computed only for those cfs_rq's, which really need it, in particular all idle task groups are skipped. Although this, in fact, moves h_load calculation under rq lock, it should not affect latency much, because the amount of work done under rq lock while trying to pull tasks is limited by sched_nr_migrate. After the patch applied with the setup described above (1000 wakers in the root cgroup and 10000 idle cgroups), I get: Cpu(s): 16.9%us, 24.8%sy, 0.0%ni, 58.4%id, 0.0%wa, 0.0%hi, 0.0%si Events: 242K cycles 7.57% [kernel] [k] __schedule 6.70% [kernel] [k] timerqueue_add 5.93% libc-2.12.so [.] usleep Signed-off-by: Vladimir Davydov <vdavydov@parallels.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/r/1373896159-1278-1-git-send-email-vdavydov@parallels.com Signed-off-by: Ingo Molnar <mingo@kernel.org>		2013-07-23 12:18:41 +02:00
..
auto_group.c	sched/autogroup: Fix race with task_groups list	2013-05-28 09:40:22 +02:00
auto_group.h	Revert "sched/autogroup: Fix crash on reboot when autogroup is disabled"	2012-12-11 10:23:45 +01:00
clock.c	sched_clock: Prevent 64bit inatomicity on 32bit systems	2013-04-08 11:50:44 +02:00
core.c	kernel: delete __cpuinit usage from all core kernel files	2013-07-14 19:36:59 -04:00
cpuacct.c	sched/cpuacct/UML: Fix header file dependency bug on the UML build	2013-04-10 15:12:41 +02:00
cpuacct.h	sched/cpuacct: Initialize root cpuacct earlier	2013-04-10 13:54:20 +02:00
cpupri.c	sched/rt: Move rt specific bits into new header file	2013-02-07 20:51:08 +01:00
cpupri.h
cputime.c	Linux 3.10	2013-07-01 11:18:53 +02:00
debug.c	sched/debug: Remove CONFIG_FAIR_GROUP_SCHED mask	2013-06-28 13:17:17 +02:00
fair.c	sched: Move h_load calculation to task_h_load()	2013-07-23 12:18:41 +02:00
features.h	mutex: Move mutex spinning code from sched/core.c back to mutex.c	2013-04-19 09:33:34 +02:00
idle_task.c	sched: Keep at least 1 tick per second for active dynticks tasks	2013-05-04 08:32:02 +02:00
Makefile	sched: Factor out load calculation code from sched/core.c --> sched/proc.c	2013-05-07 13:14:50 +02:00
proc.c	sched: Change get_rq_runnable_load() to static and inline	2013-06-27 10:07:44 +02:00
rt.c	sched/rt: Simplify pull_rt_task() logic and remove .leaf_rt_rq_list	2013-06-19 12:58:40 +02:00
sched.h	sched: Move h_load calculation to task_h_load()	2013-07-23 12:18:41 +02:00
stats.c	fix a leak in /proc/schedstats	2013-04-29 15:41:45 -04:00
stats.h	Merge branch 'timers-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip	2013-07-06 14:09:38 -07:00
stop_task.c	sched: Use an accessor to read the rq clock	2013-05-28 09:40:27 +02:00