linux/kernel/sched
Huaixin Chang f4183717b3 sched/fair: Introduce the burstable CFS controller
The CFS bandwidth controller limits CPU requests of a task group to
quota during each period. However, parallel workloads might be bursty
so that they get throttled even when their average utilization is under
quota. And they are latency sensitive at the same time so that
throttling them is undesired.

We borrow time now against our future underrun, at the cost of increased
interference against the other system users. All nicely bounded.

Traditional (UP-EDF) bandwidth control is something like:

  (U = \Sum u_i) <= 1

This guaranteeds both that every deadline is met and that the system is
stable. After all, if U were > 1, then for every second of walltime,
we'd have to run more than a second of program time, and obviously miss
our deadline, but the next deadline will be further out still, there is
never time to catch up, unbounded fail.

This work observes that a workload doesn't always executes the full
quota; this enables one to describe u_i as a statistical distribution.

For example, have u_i = {x,e}_i, where x is the p(95) and x+e p(100)
(the traditional WCET). This effectively allows u to be smaller,
increasing the efficiency (we can pack more tasks in the system), but at
the cost of missing deadlines when all the odds line up. However, it
does maintain stability, since every overrun must be paired with an
underrun as long as our x is above the average.

That is, suppose we have 2 tasks, both specify a p(95) value, then we
have a p(95)*p(95) = 90.25% chance both tasks are within their quota and
everything is good. At the same time we have a p(5)p(5) = 0.25% chance
both tasks will exceed their quota at the same time (guaranteed deadline
fail). Somewhere in between there's a threshold where one exceeds and
the other doesn't underrun enough to compensate; this depends on the
specific CDFs.

At the same time, we can say that the worst case deadline miss, will be
\Sum e_i; that is, there is a bounded tardiness (under the assumption
that x+e is indeed WCET).

The benefit of burst is seen when testing with schbench. Default value of
kernel.sched_cfs_bandwidth_slice_us(5ms) and CONFIG_HZ(1000) is used.

	mkdir /sys/fs/cgroup/cpu/test
	echo $$ > /sys/fs/cgroup/cpu/test/cgroup.procs
	echo 100000 > /sys/fs/cgroup/cpu/test/cpu.cfs_quota_us
	echo 100000 > /sys/fs/cgroup/cpu/test/cpu.cfs_burst_us

	./schbench -m 1 -t 3 -r 20 -c 80000 -R 10

The average CPU usage is at 80%. I run this for 10 times, and got long tail
latency for 6 times and got throttled for 8 times.

Tail latencies are shown below, and it wasn't the worst case.

	Latency percentiles (usec)
		50.0000th: 19872
		75.0000th: 21344
		90.0000th: 22176
		95.0000th: 22496
		*99.0000th: 22752
		99.5000th: 22752
		99.9000th: 22752
		min=0, max=22727
	rps: 9.90 p95 (usec) 22496 p99 (usec) 22752 p95/cputime 28.12% p99/cputime 28.44%

The interferenece when using burst is valued by the possibilities for
missing the deadline and the average WCET. Test results showed that when
there many cgroups or CPU is under utilized, the interference is
limited. More details are shown in:
https://lore.kernel.org/lkml/5371BD36-55AE-4F71-B9D7-B86DC32E3D2B@linux.alibaba.com/

Co-developed-by: Shanpei Chen <shanpeic@linux.alibaba.com>
Signed-off-by: Shanpei Chen <shanpeic@linux.alibaba.com>
Co-developed-by: Tianchen Ding <dtcccc@linux.alibaba.com>
Signed-off-by: Tianchen Ding <dtcccc@linux.alibaba.com>
Signed-off-by: Huaixin Chang <changhuaixin@linux.alibaba.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Ben Segall <bsegall@google.com>
Acked-by: Tejun Heo <tj@kernel.org>
Link: https://lore.kernel.org/r/20210621092800.23714-2-changhuaixin@linux.alibaba.com
2021-06-24 09:07:50 +02:00
..
autogroup.c sched/autogroup: Make autogroup_path() always available 2019-06-24 19:23:40 +02:00
autogroup.h
clock.c sched: Fix various typos 2021-03-22 00:11:52 +01:00
completion.c completion: Use lockdep_assert_RT_in_threaded_ctx() in complete_all() 2020-03-23 18:40:25 +01:00
core_sched.c sched: prctl() core-scheduling interface 2021-05-12 11:43:31 +02:00
core.c sched/fair: Introduce the burstable CFS controller 2021-06-24 09:07:50 +02:00
cpuacct.c sched: Wrap rq::lock access 2021-05-12 11:43:26 +02:00
cpudeadline.c sched,rt: Use the full cpumask for balancing 2020-11-10 18:39:00 +01:00
cpudeadline.h
cpufreq_schedutil.c sched/cpufreq: Consider reduced CPU capacity in energy calculation 2021-06-17 14:11:43 +02:00
cpufreq.c cpufreq: Avoid leaving stale IRQ work items during CPU offline 2019-12-12 17:59:43 +01:00
cpupri.c sched: Fix various typos 2021-03-22 00:11:52 +01:00
cpupri.h sched/cpupri: Add CPUPRI_HIGHER 2020-10-29 11:00:30 +01:00
cputime.c Scheduler updates for this cycle are: 2021-04-28 13:33:57 -07:00
deadline.c sched/rt: Fix Deadline utilization tracking during policy change 2021-06-22 16:41:59 +02:00
debug.c Merge branch 'sched/urgent' into sched/core, to resolve conflicts 2021-06-18 11:31:25 +02:00
fair.c sched/fair: Introduce the burstable CFS controller 2021-06-24 09:07:50 +02:00
features.h sched: Warn on long periods of pending need_resched 2021-04-21 13:55:41 +02:00
idle.c sched: Trivial forced-newidle balancer 2021-05-12 11:43:30 +02:00
isolation.c sched/isolation: Reconcile rcu_nocbs= and nohz_full= 2021-05-13 14:12:47 +02:00
loadavg.c sched: Make multiple runqueue task counters 32-bit 2021-05-12 21:34:17 +02:00
Makefile sched: Trivial core scheduling cookie management 2021-05-12 11:43:31 +02:00
membarrier.c sched/membarrier: fix missing local execution of ipi_sync_rq_state() 2021-03-06 12:40:21 +01:00
pelt.c sched: Fix various typos 2021-03-22 00:11:52 +01:00
pelt.h Merge branch 'sched/urgent' into sched/core, to resolve conflicts 2021-06-18 11:31:25 +02:00
psi.c psi: Fix psi state corruption when schedule() races with cgroup move 2021-05-06 15:33:26 +02:00
rt.c sched/rt: Fix RT utilization tracking during policy change 2021-06-22 16:41:59 +02:00
sched-pelt.h sched/fair: Fix "runnable_avg_yN_inv" not used warnings 2019-06-17 12:15:58 +02:00
sched.h sched/fair: Introduce the burstable CFS controller 2021-06-24 09:07:50 +02:00
smp.h sched/headers: Split out open-coded prototypes into kernel/sched/smp.h 2020-05-28 11:03:20 +02:00
stats.c sched: Fix various typos 2021-03-22 00:11:52 +01:00
stats.h sched: Introduce task_is_running() 2021-06-18 11:43:07 +02:00
stop_task.c sched: Introduce sched_class::pick_task() 2021-05-12 11:43:28 +02:00
swait.c sched/swait: Prepare usage in completions 2020-03-21 16:00:23 +01:00
topology.c sched: Wrap rq::lock access 2021-05-12 11:43:26 +02:00
wait_bit.c sched/wait: fix ___wait_var_event(exclusive) 2019-12-17 13:32:50 +01:00
wait.c sched/wait: Add add_wait_queue_priority() 2020-11-15 09:49:09 -05:00