linux/kernel/sched
David Vernet 8a010b81b3 sched_ext: Implement runnable task stall watchdog
The most common and critical way that a BPF scheduler can misbehave is by
failing to run runnable tasks for too long. This patch implements a
watchdog.

* All tasks record when they become runnable.

* A watchdog work periodically scans all runnable tasks. If any task has
  stayed runnable for too long, the BPF scheduler is aborted.

* scheduler_tick() monitors whether the watchdog itself is stuck. If so, the
  BPF scheduler is aborted.

Because the watchdog only scans the tasks which are currently runnable and
usually very infrequently, the overhead should be negligible.
scx_qmap is updated so that it can be told to stall user and/or
kernel tasks.

A detected task stall looks like the following:

 sched_ext: BPF scheduler "qmap" errored, disabling
 sched_ext: runnable task stall (dbus-daemon[953] failed to run for 6.478s)
    scx_check_timeout_workfn+0x10e/0x1b0
    process_one_work+0x287/0x560
    worker_thread+0x234/0x420
    kthread+0xe9/0x100
    ret_from_fork+0x1f/0x30

A detected watchdog stall:

 sched_ext: BPF scheduler "qmap" errored, disabling
 sched_ext: runnable task stall (watchdog failed to check in for 5.001s)
    scheduler_tick+0x2eb/0x340
    update_process_times+0x7a/0x90
    tick_sched_timer+0xd8/0x130
    __hrtimer_run_queues+0x178/0x3b0
    hrtimer_interrupt+0xfc/0x390
    __sysvec_apic_timer_interrupt+0xb7/0x2b0
    sysvec_apic_timer_interrupt+0x90/0xb0
    asm_sysvec_apic_timer_interrupt+0x1b/0x20
    default_idle+0x14/0x20
    arch_cpu_idle+0xf/0x20
    default_idle_call+0x50/0x90
    do_idle+0xe8/0x240
    cpu_startup_entry+0x1d/0x20
    kernel_init+0x0/0x190
    start_kernel+0x0/0x392
    start_kernel+0x324/0x392
    x86_64_start_reservations+0x2a/0x2c
    x86_64_start_kernel+0x104/0x109
    secondary_startup_64_no_verify+0xce/0xdb

Note that this patch exposes scx_ops_error[_type]() in kernel/sched/ext.h to
inline scx_notify_sched_tick().

v4: - While disabling, cancel_delayed_work_sync(&scx_watchdog_work) was
      being called before forward progress was guaranteed and thus could
      lead to system lockup. Relocated.

    - While enabling, it was comparing msecs against jiffies without
      conversion leading to spurious load failures on lower HZ kernels.
      Fixed.

    - runnable list management is now used by core bypass logic and moved to
      the patch implementing sched_ext core.

v3: - bpf_scx_init_member() was incorrectly comparing ops->timeout_ms
      against SCX_WATCHDOG_MAX_TIMEOUT which is in jiffies without
      conversion leading to spurious load failures in lower HZ kernels.
      Fixed.

v2: - Julia Lawall noticed that the watchdog code was mixing msecs and
      jiffies. Fix by using jiffies for everything.

Signed-off-by: David Vernet <dvernet@meta.com>
Reviewed-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Josh Don <joshdon@google.com>
Acked-by: Hao Luo <haoluo@google.com>
Acked-by: Barret Rhoden <brho@google.com>
Cc: Julia Lawall <julia.lawall@inria.fr>
2024-06-18 10:09:18 -10:00
..
autogroup.c scheduler: Remove the now superfluous sentinel elements from ctl_table array 2024-04-24 09:43:54 +02:00
autogroup.h sched/headers: Add header guard to kernel/sched/stats.h and kernel/sched/autogroup.h 2022-02-23 08:22:00 +01:00
build_policy.c sched_ext: Add sysrq-S which disables the BPF scheduler 2024-06-18 10:09:18 -10:00
build_utility.c sched/headers: Remove duplicate header inclusions 2023-10-03 21:27:55 +02:00
clock.c sched: Fix spelling in comments 2024-05-27 17:00:21 +02:00
completion.c sched: add a few helpers to wake up tasks on the current cpu 2023-07-17 16:08:08 -07:00
core_sched.c sched: Fix spelling in comments 2024-05-27 17:00:21 +02:00
core.c sched_ext: Implement runnable task stall watchdog 2024-06-18 10:09:18 -10:00
cpuacct.c Merge branch 'sched/fast-headers' into sched/core 2022-03-15 09:05:05 +01:00
cpudeadline.c sched/topology: Consolidate and clean up access to a CPU's max compute capacity 2023-10-09 12:59:48 +02:00
cpudeadline.h
cpufreq_schedutil.c sched/fair: Fix frequency selection for non-invariant case 2024-01-16 10:41:25 +01:00
cpufreq.c sched/headers: Introduce kernel/sched/build_utility.c and build multiple .c files there 2022-02-23 10:58:33 +01:00
cpupri.c sched/rt: Fix live lock between select_fallback_rq() and RT push 2023-09-28 22:58:13 +02:00
cpupri.h sched/cpupri: Add CPUPRI_HIGHER 2020-10-29 11:00:30 +01:00
cputime.c sched: Fix spelling in comments 2024-05-27 17:00:21 +02:00
deadline.c sched: Fix spelling in comments 2024-05-27 17:00:21 +02:00
debug.c sched_ext: Implement BPF extensible scheduler class 2024-06-18 10:09:17 -10:00
ext.c sched_ext: Implement runnable task stall watchdog 2024-06-18 10:09:18 -10:00
ext.h sched_ext: Implement runnable task stall watchdog 2024-06-18 10:09:18 -10:00
fair.c sched: Add normal_policy() 2024-06-18 10:09:17 -10:00
features.h sched/fair: Remove SCHED_FEAT(UTIL_EST_FASTUP, true) 2023-12-23 15:59:56 +01:00
idle.c sched_ext: Add boilerplate for extensible scheduler class 2024-06-18 10:09:17 -10:00
isolation.c sched/isolation: Fix boot crash when maxcpus < first housekeeping CPU 2024-04-28 10:08:21 +02:00
loadavg.c sched: Fix spelling in comments 2024-05-27 17:00:21 +02:00
Makefile sched/headers: Introduce kernel/sched/build_policy.c and build multiple .c files there 2022-02-23 10:58:33 +01:00
membarrier.c RISC-V Patches for the 6.9 Merge Window 2024-03-22 10:41:13 -07:00
pelt.c sched: Fix spelling in comments 2024-05-27 17:00:21 +02:00
pelt.h sched/cpufreq: Rename arch_update_thermal_pressure() => arch_update_hw_pressure() 2024-04-24 12:08:01 +02:00
psi.c sched: Fix spelling in comments 2024-05-27 17:00:21 +02:00
rt.c sched: Fix spelling in comments 2024-05-27 17:00:21 +02:00
sched-pelt.h
sched.h sched_ext: Implement BPF extensible scheduler class 2024-06-18 10:09:17 -10:00
smp.h sched, smp: Trace smp callback causing an IPI 2023-03-24 11:01:29 +01:00
stats.c sched/debug: Increase SCHEDSTAT_VERSION to 16 2024-03-12 11:03:40 +01:00
stats.h sched: Fix spelling in comments 2024-05-27 17:00:21 +02:00
stop_task.c sched: Unify runtime accounting across classes 2023-11-15 09:57:48 +01:00
swait.c sched: add a few helpers to wake up tasks on the current cpu 2023-07-17 16:08:08 -07:00
syscalls.c sched_ext: Implement BPF extensible scheduler class 2024-06-18 10:09:17 -10:00
topology.c sched: Fix spelling in comments 2024-05-27 17:00:21 +02:00
wait_bit.c sched: Fix spelling in comments 2024-05-27 17:00:21 +02:00
wait.c sched: remove wait bookmarks 2023-10-18 14:34:18 -07:00