linux/kernel
Huaixin Chang f4183717b3 sched/fair: Introduce the burstable CFS controller
The CFS bandwidth controller limits CPU requests of a task group to
quota during each period. However, parallel workloads might be bursty
so that they get throttled even when their average utilization is under
quota. And they are latency sensitive at the same time so that
throttling them is undesired.

We borrow time now against our future underrun, at the cost of increased
interference against the other system users. All nicely bounded.

Traditional (UP-EDF) bandwidth control is something like:

  (U = \Sum u_i) <= 1

This guaranteeds both that every deadline is met and that the system is
stable. After all, if U were > 1, then for every second of walltime,
we'd have to run more than a second of program time, and obviously miss
our deadline, but the next deadline will be further out still, there is
never time to catch up, unbounded fail.

This work observes that a workload doesn't always executes the full
quota; this enables one to describe u_i as a statistical distribution.

For example, have u_i = {x,e}_i, where x is the p(95) and x+e p(100)
(the traditional WCET). This effectively allows u to be smaller,
increasing the efficiency (we can pack more tasks in the system), but at
the cost of missing deadlines when all the odds line up. However, it
does maintain stability, since every overrun must be paired with an
underrun as long as our x is above the average.

That is, suppose we have 2 tasks, both specify a p(95) value, then we
have a p(95)*p(95) = 90.25% chance both tasks are within their quota and
everything is good. At the same time we have a p(5)p(5) = 0.25% chance
both tasks will exceed their quota at the same time (guaranteed deadline
fail). Somewhere in between there's a threshold where one exceeds and
the other doesn't underrun enough to compensate; this depends on the
specific CDFs.

At the same time, we can say that the worst case deadline miss, will be
\Sum e_i; that is, there is a bounded tardiness (under the assumption
that x+e is indeed WCET).

The benefit of burst is seen when testing with schbench. Default value of
kernel.sched_cfs_bandwidth_slice_us(5ms) and CONFIG_HZ(1000) is used.

	mkdir /sys/fs/cgroup/cpu/test
	echo $$ > /sys/fs/cgroup/cpu/test/cgroup.procs
	echo 100000 > /sys/fs/cgroup/cpu/test/cpu.cfs_quota_us
	echo 100000 > /sys/fs/cgroup/cpu/test/cpu.cfs_burst_us

	./schbench -m 1 -t 3 -r 20 -c 80000 -R 10

The average CPU usage is at 80%. I run this for 10 times, and got long tail
latency for 6 times and got throttled for 8 times.

Tail latencies are shown below, and it wasn't the worst case.

	Latency percentiles (usec)
		50.0000th: 19872
		75.0000th: 21344
		90.0000th: 22176
		95.0000th: 22496
		*99.0000th: 22752
		99.5000th: 22752
		99.9000th: 22752
		min=0, max=22727
	rps: 9.90 p95 (usec) 22496 p99 (usec) 22752 p95/cputime 28.12% p99/cputime 28.44%

The interferenece when using burst is valued by the possibilities for
missing the deadline and the average WCET. Test results showed that when
there many cgroups or CPU is under utilized, the interference is
limited. More details are shown in:
https://lore.kernel.org/lkml/5371BD36-55AE-4F71-B9D7-B86DC32E3D2B@linux.alibaba.com/

Co-developed-by: Shanpei Chen <shanpeic@linux.alibaba.com>
Signed-off-by: Shanpei Chen <shanpeic@linux.alibaba.com>
Co-developed-by: Tianchen Ding <dtcccc@linux.alibaba.com>
Signed-off-by: Tianchen Ding <dtcccc@linux.alibaba.com>
Signed-off-by: Huaixin Chang <changhuaixin@linux.alibaba.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Ben Segall <bsegall@google.com>
Acked-by: Tejun Heo <tj@kernel.org>
Link: https://lore.kernel.org/r/20210621092800.23714-2-changhuaixin@linux.alibaba.com
2021-06-24 09:07:50 +02:00
..
bpf bpf, lockdown, audit: Fix buggy SELinux lockdown permission checks 2021-06-02 21:59:22 +02:00
cgroup sched: Change task_struct::state 2021-06-18 11:43:09 +02:00
configs drivers/char: remove /dev/kmem for good 2021-05-07 00:26:34 -07:00
debug sched: Change task_struct::state 2021-06-18 11:43:09 +02:00
dma Merge branch 'stable/for-linus-5.13' of git://git.kernel.org/pub/scm/linux/kernel/git/konrad/swiotlb 2021-05-04 10:58:49 -07:00
entry tick/nohz: Only check for RCU deferred wakeup on user/guest entry when needed 2021-05-31 10:14:49 +02:00
events sched,perf,kvm: Fix preemption condition 2021-06-18 11:43:07 +02:00
gcov gcov: clang: drop support for clang-10 and older 2021-05-07 00:26:32 -07:00
irq gpio updates for v5.13 2021-05-05 12:39:29 -07:00
kcsan sched: Introduce task_is_running() 2021-06-18 11:43:07 +02:00
livepatch Livepatching changes for 5.13 2021-04-27 18:14:38 -07:00
locking sched: Change task_struct::state 2021-06-18 11:43:09 +02:00
power PM: sleep: fix typos in comments 2021-04-08 19:37:21 +02:00
printk kernel/printk.c: Fixed mundane typos 2021-03-30 15:34:17 +02:00
rcu sched: Change task_struct::state 2021-06-18 11:43:09 +02:00
sched sched/fair: Introduce the burstable CFS controller 2021-06-24 09:07:50 +02:00
time sched,timer: Use __set_current_state() 2021-06-18 11:43:08 +02:00
trace tracing: Correct the length check which causes memory corruption 2021-06-08 16:44:00 -04:00
.gitignore .gitignore: prefix local generated files with a slash 2021-05-02 00:43:35 +09:00
acct.c kernel/acct.c: use #elif instead of #end and #elif 2020-12-15 22:46:15 -08:00
async.c kernel/async.c: remove async_unregister_domain() 2021-05-07 00:26:33 -07:00
audit_fsnotify.c audit_alloc_mark(): don't open-code ERR_CAST() 2021-02-23 10:25:27 -05:00
audit_tree.c fsnotify: generalize handle_inode_event() 2020-12-03 14:58:35 +01:00
audit_watch.c fsnotify: generalize handle_inode_event() 2020-12-03 14:58:35 +01:00
audit.c lsm: separate security_task_getsecid() into subjective and objective variants 2021-03-22 15:23:32 -04:00
audit.h audit: avoid -Wempty-body warning 2021-03-24 12:11:48 -04:00
auditfilter.c lsm: separate security_task_getsecid() into subjective and objective variants 2021-03-22 15:23:32 -04:00
auditsc.c audit/stable-5.13 PR 20210426 2021-04-27 13:50:58 -07:00
backtracetest.c
bounds.c
capability.c capability: handle idmapped mounts 2021-01-24 14:27:16 +01:00
cfi.c add support for Clang CFI 2021-04-08 16:04:20 -07:00
compat.c
configs.c
context_tracking.c
cpu_pm.c
cpu.c cpumask/hotplug: Fix cpu_dying() state tracking 2021-04-21 13:55:43 +02:00
crash_core.c kdump: append uts_namespace.name offset to VMCOREINFO 2020-12-15 22:46:18 -08:00
crash_dump.c
cred.c kernel/cred.c: make init_groups static 2021-05-06 19:24:11 -07:00
delayacct.c delayacct: Add sysctl to enable at runtime 2021-05-12 11:43:25 +02:00
dma.c
exec_domain.c
exit.c do_wait: make PIDTYPE_PID case O(1) instead of O(n) 2021-05-06 19:24:13 -07:00
extable.c
fail_function.c fault-injection: handle EI_ETYPE_TRUE 2020-12-15 22:46:19 -08:00
fork.c sched: Change task_struct::state 2021-06-18 11:43:09 +02:00
freezer.c sched: Add get_current_state() 2021-06-18 11:43:08 +02:00
futex.c futex: Make syscall entry points less convoluted 2021-05-06 20:19:04 +02:00
gen_kheaders.sh kbuild: redo fake deps at include/config/*.h 2021-04-25 05:26:10 +09:00
groups.c groups: simplify struct group_info allocation 2021-02-26 09:41:03 -08:00
hung_task.c sched: Change task_struct::state 2021-06-18 11:43:09 +02:00
iomem.c
irq_work.c irq_work: Make irq_work_queue() NMI-safe again 2021-06-10 10:00:08 +02:00
jump_label.c static_call: Fix static_call_update() sanity check 2021-03-19 13:16:44 +01:00
kallsyms.c kallsyms: strip ThinLTO hashes from static functions 2021-04-08 16:04:21 -07:00
kcmp.c Merge branch 'exec-update-lock-for-v5.11' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace 2020-12-15 19:36:48 -08:00
Kconfig.freezer
Kconfig.hz
Kconfig.locks
Kconfig.preempt sched: Add CONFIG_SCHED_CORE help text 2021-06-01 16:00:10 +02:00
kcov.c kernel: make kcov_common_handle consider the current context 2020-11-02 18:00:20 -08:00
kexec_core.c kexec: dump kmessage before machine_kexec 2021-05-07 00:26:32 -07:00
kexec_elf.c
kexec_file.c kernel: kexec_file: fix error return code of kexec_calculate_store_digests() 2021-05-07 00:26:32 -07:00
kexec_internal.h kexec: move machine_kexec_post_load() to public interface 2021-02-22 12:33:26 +00:00
kexec.c
kheaders.c
kmod.c modules: add CONFIG_MODPROBE_PATH 2021-05-07 00:26:33 -07:00
kprobes.c kprobes: Fix to delay the kprobes jump optimization 2021-02-19 14:57:12 -05:00
ksysfs.c
kthread.c sched: Change task_struct::state 2021-06-18 11:43:09 +02:00
latencytop.c
Makefile kbuild: update config_data.gz only when the content of .config is changed 2021-05-02 00:43:35 +09:00
module_signature.c module: harden ELF info handling 2021-01-19 10:24:45 +01:00
module_signing.c module: harden ELF info handling 2021-01-19 10:24:45 +01:00
module-internal.h
module.c module: check for exit sections in layout_sections() instead of module_init_section() 2021-05-17 09:48:24 +02:00
notifier.c
nsproxy.c fixes-v5.11 2020-12-14 16:40:27 -08:00
padata.c
panic.c panic: don't dump stack twice on warn 2020-11-14 11:26:04 -08:00
params.c Modules updates for v5.11 2020-12-17 13:01:31 -08:00
pid_namespace.c fixes-v5.11 2020-12-14 16:40:27 -08:00
pid.c Merge branch 'exec-update-lock-for-v5.11' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace 2020-12-15 19:36:48 -08:00
profile.c kernel: Initialize cpumask before parsing 2021-04-10 13:35:54 +02:00
ptrace.c sched: Change task_struct::state 2021-06-18 11:43:09 +02:00
range.c kernel.h: split out min()/max() et al. helpers 2020-10-16 11:11:19 -07:00
reboot.c Revert "PM: ACPI: reboot: Use S5 for reboot" 2021-03-18 16:58:02 +01:00
regset.c
relay.c relay: allow the use of const callback structs 2020-12-15 22:46:18 -08:00
resource_kunit.c resource: provide meaningful MODULE_LICENSE() in test suite 2020-11-25 18:52:35 +01:00
resource.c kernel/resource: fix return code check in __request_free_mem_region 2021-05-14 19:41:32 -07:00
rseq.c rseq: Optimise rseq_get_rseq_cs() and clear_rseq_cs() 2021-04-14 18:04:09 +02:00
scftorture.c scftorture: Add debug output for wrong-CPU warning 2021-01-04 13:53:41 -08:00
scs.c scs: switch to vmapped shadow stacks 2020-12-01 10:30:28 +00:00
seccomp.c seccomp: Refactor notification handler to prepare for new semantics 2021-05-29 11:13:27 -07:00
signal.c sched: Introduce task_is_running() 2021-06-18 11:43:07 +02:00
smp.c smp: Fix smp_call_function_single_async prototype 2021-05-06 15:33:49 +02:00
smpboot.c sched/core: Initialize the idle task with preemption disabled 2021-05-12 13:01:45 +02:00
smpboot.h
softirq.c sched: Introduce task_is_running() 2021-06-18 11:43:07 +02:00
stackleak.c
stacktrace.c
static_call.c static_call: Fix unused variable warn w/o MODULE 2021-04-09 13:22:12 +02:00
stop_machine.c stop_machine: Add caller debug info to queue_stop_cpus_work 2021-03-23 16:01:58 +01:00
sys_ni.c Add Landlock, a new LSM from Mickaël Salaün <mic@linux.microsoft.com> 2021-05-01 18:50:44 -07:00
sys.c sched: prctl() core-scheduling interface 2021-05-12 11:43:31 +02:00
sysctl-test.c
sysctl.c Merge branch 'sched/urgent' into sched/core, to pick up fixes 2021-06-03 19:00:49 +02:00
task_work.c kasan: record task_work_add() call stack 2021-04-30 11:20:42 -07:00
taskstats.c treewide: rename nla_strlcpy to nla_strscpy. 2020-11-16 08:08:54 -08:00
test_kprobes.c
torture.c torture: Replace torture_init_begin string with %s 2021-03-08 14:22:28 -08:00
tracepoint.c tracepoints: Code clean up 2021-02-09 12:27:29 -05:00
tsacct.c
ucount.c fanotify: configurable limits via sysfs 2021-03-16 16:49:31 +01:00
uid16.c
uid16.h
umh.c kernel/umh.c: fix some spelling mistakes 2021-05-07 00:26:34 -07:00
up.c A set of locking related fixes and updates: 2021-05-09 13:07:03 -07:00
user_namespace.c kernel/user_namespace.c: fix typos 2021-05-07 00:26:34 -07:00
user-return-notifier.c
user.c
usermode_driver.c bpf: Fix umd memory leak in copy_process() 2021-03-19 22:23:19 +01:00
utsname_sysctl.c
utsname.c
watch_queue.c watch_queue: rectify kernel-doc for init_watch() 2021-01-26 11:16:34 +00:00
watchdog_hld.c
watchdog.c watchdog: reliable handling of timestamps 2021-05-22 15:09:07 -10:00
workqueue_internal.h
workqueue.c wq: handle VM suspension in stall detection 2021-05-20 12:58:30 -04:00