linux

mirror of https://github.com/torvalds/linux.git synced 2024-12-14 07:02:23 +00:00

Author	SHA1	Message	Date
Andrea Righi	431844b65f	sched_ext: Provide a sysfs enable_seq counter As discussed during the distro-centric session within the sched_ext Microconference at LPC 2024, introduce a sequence counter that is incremented every time a BPF scheduler is loaded. This feature can help distributions in diagnosing potential performance regressions by identifying systems where users are running (or have ran) custom BPF schedulers. Example: arighi@virtme-ng~> cat /sys/kernel/sched_ext/enable_seq 0 arighi@virtme-ng~> sudo scx_simple local=1 global=0 ^CEXIT: unregistered from user space arighi@virtme-ng~> cat /sys/kernel/sched_ext/enable_seq 1 In this way user-space tools (such as Ubuntu's apport and similar) are able to gather and include this information in bug reports. Cc: Giovanni Gherdovich <giovanni.gherdovich@suse.com> Cc: Kleber Sacilotto de Souza <kleber.souza@canonical.com> Cc: Marcelo Henrique Cerri <marcelo.cerri@canonical.com> Cc: Phil Auld <pauld@redhat.com> Signed-off-by: Andrea Righi <andrea.righi@linux.dev> Signed-off-by: Tejun Heo <tj@kernel.org>	2024-09-23 06:53:02 -10:00
Tejun Heo	62d3726d4c	sched_ext: Fix build when !CONFIG_STACKTRACE `a2f4b16e73` ("sched_ext: Build fix on !CONFIG_STACKTRACE[_SUPPORT]") tried fixing build when !CONFIG_STACKTRACE but didn't so fully. Also put stack_trace_print() and stack_trace_save() inside CONFIG_STACKTRACE to fix build when !CONFIG_STACKTRACE. Signed-off-by: Tejun Heo <tj@kernel.org> Reported-by: kernel test robot <lkp@intel.com> Closes: https://lore.kernel.org/oe-kbuild-all/202409220642.fDW2OmWc-lkp@intel.com/	2024-09-23 06:45:22 -10:00
Tejun Heo	513ed0c7cc	sched_ext: Don't trigger ops.quiescent/runnable() on migrations A task moving across CPUs should not trigger quiescent/runnable task state events as the task is staying runnable the whole time and just stopping and then starting on different CPUs. Suppress quiescent/runnable task state events if task_on_rq_migrating(). Signed-off-by: Tejun Heo <tj@kernel.org> Suggested-by: David Vernet <void@manifault.com> Cc: Daniel Hodges <hodges.daniel.scott@gmail.com> Cc: Changwoo Min <multics69@gmail.com> Cc: Andrea Righi <andrea.righi@linux.dev> Cc: Dan Schatzberg <schatzberg.dan@gmail.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2024-09-10 10:45:20 -10:00
Tejun Heo	750a40d816	sched_ext: Synchronize bypass state changes with rq lock While the BPF scheduler is being unloaded, the following warning messages trigger sometimes: NOHZ tick-stop error: local softirq work is pending, handler #80!!! This is caused by the CPU entering idle while there are pending softirqs. The main culprit is the bypassing state assertion not being synchronized with rq operations. As the BPF scheduler cannot be trusted in the disable path, the first step is entering the bypass mode where the BPF scheduler is ignored and scheduling becomes global FIFO. This is implemented by turning scx_ops_bypassing() true. However, the transition isn't synchronized against anything and it's possible for enqueue and dispatch paths to have different ideas on whether bypass mode is on. Make each rq track its own bypass state with SCX_RQ_BYPASSING which is modified while rq is locked. This removes most of the NOHZ tick-stop messages but not completely. I believe the stragglers are from the sched core bug where pick_task_scx() can be called without preceding balance_scx(). Once that bug is fixed, we should verify that all occurrences of this error message are gone too. v2: scx_enabled() test moved inside the for_each_possible_cpu() loop so that the per-cpu states are always synchronized with the global state. Signed-off-by: Tejun Heo <tj@kernel.org> Reported-by: David Vernet <void@manifault.com>	2024-09-10 10:43:32 -10:00
Tejun Heo	4c30f5ce4f	sched_ext: Implement scx_bpf_dispatch[_vtime]_from_dsq() Once a task is put into a DSQ, the allowed operations are fairly limited. Tasks in the built-in local and global DSQs are executed automatically and, ignoring dequeue, there is only one way a task in a user DSQ can be manipulated - scx_bpf_consume() moves the first task to the dispatching local DSQ. This inflexibility sometimes gets in the way and is an area where multiple feature requests have been made. Implement scx_bpf_dispatch[_vtime]_from_dsq(), which can be called during DSQ iteration and can move the task to any DSQ - local DSQs, global DSQ and user DSQs. The kfuncs can be called from ops.dispatch() and any BPF context which dosen't hold a rq lock including BPF timers and SYSCALL programs. This is an expansion of an earlier patch which only allowed moving into the dispatching local DSQ: http://lkml.kernel.org/r/Zn4Cw4FDTmvXnhaf@slm.duckdns.org v2: Remove @slice and @vtime from scx_bpf_dispatch_from_dsq[_vtime]() as they push scx_bpf_dispatch_from_dsq_vtime() over the kfunc argument count limit and often won't be needed anyway. Instead provide scx_bpf_dispatch_from_dsq_set_{slice\|vtime}() kfuncs which can be called only when needed and override the specified parameter for the subsequent dispatch. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Daniel Hodges <hodges.daniel.scott@gmail.com> Cc: David Vernet <void@manifault.com> Cc: Changwoo Min <multics69@gmail.com> Cc: Andrea Righi <andrea.righi@linux.dev> Cc: Dan Schatzberg <schatzberg.dan@gmail.com>	2024-09-09 13:42:47 -10:00
Tejun Heo	6462dd53a2	sched_ext: Compact struct bpf_iter_scx_dsq_kern struct scx_iter_scx_dsq is defined as 6 u64's and scx_dsq_iter_kern was using 5 of them. We want to add two more u64 fields but it's better if we do so while staying within scx_iter_scx_dsq to maintain binary compatibility. The way scx_iter_scx_dsq_kern is laid out is rather inefficient - the node field takes up three u64's but only one bit of the last u64 is used. Turn the bool into u32 flags and only use the lower 16 bits freeing up 48 bits - 16 bits for flags, 32 bits for a u32 - for use by struct bpf_iter_scx_dsq_kern. This allows moving the dsq_seq and flags fields of bpf_iter_scx_dsq_kern into the cursor field reducing the struct size by a full u64. No behavior changes intended. Signed-off-by: Tejun Heo <tj@kernel.org>	2024-09-09 13:42:47 -10:00
Tejun Heo	cf3e94430d	sched_ext: Replace consume_local_task() with move_local_task_to_local_dsq() - Rename move_task_to_local_dsq() to move_remote_task_to_local_dsq(). - Rename consume_local_task() to move_local_task_to_local_dsq() and remove task_unlink_from_dsq() and source DSQ unlocking from it. This is to make the migration code easier to reuse. No functional changes intended. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: David Vernet <void@manifault.com>	2024-09-09 13:42:47 -10:00
Tejun Heo	d434210e13	sched_ext: Move consume_local_task() upward So that the local case comes first and two CONFIG_SMP blocks can be merged. No functional changes intended. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: David Vernet <void@manifault.com>	2024-09-09 13:42:47 -10:00
Tejun Heo	6557133ecd	sched_ext: Move sanity check and dsq_mod_nr() into task_unlink_from_dsq() All task_unlink_from_dsq() users are doing dsq_mod_nr(dsq, -1). Move it into task_unlink_from_dsq(). Also move sanity check into it. No functional changes intended. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: David Vernet <void@manifault.com>	2024-09-09 13:42:47 -10:00
Tejun Heo	1389f49098	sched_ext: Reorder args for consume_local/remote_task() Reorder args for consistency in the order of: current_rq, p, src_[rq\|dsq], dst_[rq\|dsq]. No functional changes intended. Signed-off-by: Tejun Heo <tj@kernel.org>	2024-09-09 13:42:47 -10:00
Tejun Heo	18f856991d	sched_ext: Restructure dispatch_to_local_dsq() Now that there's nothing left after the big if block, flip the if condition and unindent the body. No functional changes intended. v2: Add BUG() to clarify control can't reach the end of dispatch_to_local_dsq() in UP kernels per David. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: David Vernet <void@manifault.com>	2024-09-09 13:42:47 -10:00
Tejun Heo	0aab26309e	sched_ext: Fix processs_ddsp_deferred_locals() by unifying DTL_INVALID handling With the preceding update, the only return value which makes meaningful difference is DTL_INVALID, for which one caller, finish_dispatch(), falls back to the global DSQ and the other, process_ddsp_deferred_locals(), doesn't do anything. It should always fallback to the global DSQ. Move the global DSQ fallback into dispatch_to_local_dsq() and remove the return value. v2: Patch title and description updated to reflect the behavior fix for process_ddsp_deferred_locals(). Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: David Vernet <void@manifault.com>	2024-09-09 13:42:47 -10:00
Tejun Heo	e683949a4b	sched_ext: Make find_dsq_for_dispatch() handle SCX_DSQ_LOCAL_ON find_dsq_for_dispatch() handles all DSQ IDs except SCX_DSQ_LOCAL_ON. Instead, each caller is hanlding SCX_DSQ_LOCAL_ON before calling it. Move SCX_DSQ_LOCAL_ON lookup into find_dsq_for_dispatch() to remove duplicate code in direct_dispatch() and dispatch_to_local_dsq(). No functional changes intended. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: David Vernet <void@manifault.com>	2024-09-09 13:42:47 -10:00
Tejun Heo	4d3ca89bdd	sched_ext: Refactor consume_remote_task() The tricky p->scx.holding_cpu handling was split across consume_remote_task() body and move_task_to_local_dsq(). Refactor such that: - All the tricky part is now in the new unlink_dsq_and_lock_src_rq() with consolidated documentation. - move_task_to_local_dsq() now implements straightforward task migration making it easier to use in other places. - dispatch_to_local_dsq() is another user move_task_to_local_dsq(). The usage is updated accordingly. This makes the local and remote cases more symmetric. No functional changes intended. v2: s/task_rq/src_rq/ for consistency. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: David Vernet <void@manifault.com>	2024-09-09 13:42:47 -10:00
Tejun Heo	fdaedba2f9	sched_ext: Rename scx_kfunc_set_sleepable to unlocked and relocate Sleepables don't need to be in its own kfunc set as each is tagged with KF_SLEEPABLE. Rename to scx_kfunc_set_unlocked indicating that rq lock is not held and relocate right above the any set. This will be used to add kfuncs that are allowed to be called from SYSCALL but not TRACING. No functional changes intended. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: David Vernet <void@manifault.com>	2024-09-09 13:42:47 -10:00
Tejun Heo	3ac352797c	sched_ext: Add missing static to scx_dump_data scx_dump_data is only used inside ext.c but doesn't have static. Add it. Signed-off-by: Tejun Heo <tj@kernel.org> Reported-by: kernel test robot <lkp@intel.com> Closes: https://lore.kernel.org/oe-kbuild-all/202409070218.RB5WsQ07-lkp@intel.com/	2024-09-09 13:34:33 -10:00
Tejun Heo	02e65e1c12	sched_ext: Add missing static to scx_has_op[] scx_has_op[] is only used inside ext.c but doesn't have static. Add it. Signed-off-by: Tejun Heo <tj@kernel.org> Reported-by: kernel test robot <lkp@intel.com> Closes: https://lore.kernel.org/oe-kbuild-all/202409062337.m7qqI88I-lkp@intel.com/	2024-09-06 08:18:55 -10:00
Tejun Heo	da330f5e4c	sched_ext: Temporarily work around pick_task_scx() being called without balance_scx() pick_task_scx() must be preceded by balance_scx() but there currently is a bug where fair could say yes on balance() but no on pick_task(), which then ends up calling pick_task_scx() without preceding balance_scx(). Work around by dropping WARN_ON_ONCE() and ignoring cases which don't make sense. This isn't great and can theoretically lead to stalls. However, for switch_all cases, this happens only while a BPF scheduler is being loaded or unloaded, and, for partial cases, fair will likely keep triggering this CPU. This will be reverted once the fair behavior is fixed. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org>	2024-09-06 08:17:09 -10:00
Tejun Heo	8195136669	sched_ext: Add cgroup support Add sched_ext_ops operations to init/exit cgroups, and track task migrations and config changes. A BPF scheduler may not implement or implement only subset of cgroup features. The implemented features can be indicated using %SCX_OPS_HAS_CGOUP_* flags. If cgroup configuration makes use of features that are not implemented, a warning is triggered. While a BPF scheduler is being enabled and disabled, relevant cgroup operations are locked out using scx_cgroup_rwsem. This avoids situations like task prep taking place while the task is being moved across cgroups, making things easier for BPF schedulers. v7: - cgroup interface file visibility toggling is dropped in favor just warning messages. Dynamically changing interface visiblity caused more confusion than helping. v6: - Updated to reflect the removal of SCX_KF_SLEEPABLE. - Updated to use CONFIG_GROUP_SCHED_WEIGHT and fixes for !CONFIG_FAIR_GROUP_SCHED && CONFIG_EXT_GROUP_SCHED. v5: - Flipped the locking order between scx_cgroup_rwsem and cpus_read_lock() to avoid locking order conflict w/ cpuset. Better documentation around locking. - sched_move_task() takes an early exit if the source and destination are identical. This triggered the warning in scx_cgroup_can_attach() as it left p->scx.cgrp_moving_from uncleared. Updated the cgroup migration path so that ops.cgroup_prep_move() is skipped for identity migrations so that its invocations always match ops.cgroup_move() one-to-one. v4: - Example schedulers moved into their own patches. - Fix build failure when !CONFIG_CGROUP_SCHED, reported by Andrea Righi. v3: - Make scx_example_pair switch all tasks by default. - Convert to BPF inline iterators. - scx_bpf_task_cgroup() is added to determine the current cgroup from CPU controller's POV. This allows BPF schedulers to accurately track CPU cgroup membership. - scx_example_flatcg added. This demonstrates flattened hierarchy implementation of CPU cgroup control and shows significant performance improvement when cgroups which are nested multiple levels are under competition. v2: - Build fixes for different CONFIG combinations. Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: David Vernet <dvernet@meta.com> Acked-by: Josh Don <joshdon@google.com> Acked-by: Hao Luo <haoluo@google.com> Acked-by: Barret Rhoden <brho@google.com> Reported-by: kernel test robot <lkp@intel.com> Cc: Andrea Righi <andrea.righi@canonical.com>	2024-09-04 10:24:59 -10:00
Tejun Heo	a8532fac7b	sched_ext: TASK_DEAD tasks must be switched into SCX on ops_enable During scx_ops_enable(), SCX needs to invoke the sleepable ops.init_task() on every task. To do this, it does get_task_struct() on each iterated task, drop the lock and then call ops.init_task(). However, a TASK_DEAD task may already have lost all its usage count and be waiting for RCU grace period to be freed. If get_task_struct() is called on such task, use-after-free can happen. To avoid such situations, scx_ops_enable() skips initialization of TASK_DEAD tasks, which seems safe as they are never going to be scheduled again. Unfortunately, a racing sched_setscheduler(2) can grab the task before the task is unhashed and then continue to e.g. move the task from RT to SCX after TASK_DEAD is set and ops_enable skipped the task. As the task hasn't gone through scx_ops_init_task(), scx_ops_enable_task() called from switching_to_scx() triggers the following warning: sched_ext: Invalid task state transition 0 -> 3 for stress-ng-race-[2872] WARNING: CPU: 6 PID: 2367 at kernel/sched/ext.c:3327 scx_ops_enable_task+0x18f/0x1f0 ... RIP: 0010:scx_ops_enable_task+0x18f/0x1f0 ... switching_to_scx+0x13/0xa0 __sched_setscheduler+0x84e/0xa50 do_sched_setscheduler+0x104/0x1c0 __x64_sys_sched_setscheduler+0x18/0x30 do_syscall_64+0x7b/0x140 entry_SYSCALL_64_after_hwframe+0x76/0x7e As in the ops_disable path, it just doesn't seem like a good idea to leave any task in an inconsistent state, even when the task is dead. The root cause is ops_enable not being able to tell reliably whether a task is truly dead (no one else is looking at it and it's about to be freed) and was testing TASK_DEAD instead. Fix it by testing the task's usage count directly. - ops_init no longer ignores TASK_DEAD tasks. As now all users iterate all tasks, @include_dead is removed from scx_task_iter_next_locked() along with dead task filtering. - tryget_task_struct() is added. Tasks are skipped iff tryget_task_struct() fails. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: David Vernet <void@manifault.com> Cc: Peter Zijlstra <peterz@infradead.org>	2024-09-04 10:23:32 -10:00
Tejun Heo	61eeb9a905	sched_ext: TASK_DEAD tasks must be switched out of SCX on ops_disable scx_ops_disable_workfn() only switches !TASK_DEAD tasks out of SCX while calling scx_ops_exit_task() on all tasks including dead ones. This can leave a dead task on SCX but with SCX_TASK_NONE state, which is inconsistent. If another task was in the process of changing the TASK_DEAD task's scheduling class and grabs the rq lock after scx_ops_disable_workfn() is done with the task, the task ends up calling scx_ops_disable_task() on the dead task which is in an inconsistent state triggering a warning: WARNING: CPU: 6 PID: 3316 at kernel/sched/ext.c:3411 scx_ops_disable_task+0x12c/0x160 ... RIP: 0010:scx_ops_disable_task+0x12c/0x160 ... Call Trace: <TASK> check_class_changed+0x2c/0x70 __sched_setscheduler+0x8a0/0xa50 do_sched_setscheduler+0x104/0x1c0 __x64_sys_sched_setscheduler+0x18/0x30 do_syscall_64+0x7b/0x140 entry_SYSCALL_64_after_hwframe+0x76/0x7e RIP: 0033:0x7f140d70ea5b There is no reason to leave dead tasks on SCX when unloading the BPF scheduler. Fix by making scx_ops_disable_workfn() eject all tasks including the dead ones from SCX. Signed-off-by: Tejun Heo <tj@kernel.org>	2024-09-04 10:22:55 -10:00
Tejun Heo	f422316d74	sched_ext: Remove switch_class_scx() Now that put_prev_task_scx() is called with @next on task switches, there's no reason to use sched_class.switch_class(). Rename switch_class_scx() to switch_class() and call it from put_prev_task_scx(). Signed-off-by: Tejun Heo <tj@kernel.org>	2024-09-03 21:54:29 -10:00
Tejun Heo	65aaf90569	sched_ext: Relocate functions in kernel/sched/ext.c Relocate functions to ease the removal of switch_class_scx(). No functional changes. Signed-off-by: Tejun Heo <tj@kernel.org>	2024-09-03 21:54:29 -10:00
Tejun Heo	753e2836d1	sched_ext: Unify regular and core-sched pick task paths Because the BPF scheduler's dispatch path is invoked from balance(), sched_ext needs to invoke balance_one() on all sibling rq's before picking the next task for core-sched. Before the recent pick_next_task() updates, sched_ext couldn't share pick task between regular and core-sched paths because pick_next_task() depended on put_prev_task() being called on the current task. Tasks currently running on sibling rq's can't be put when one rq is trying to pick the next task, so pick_task_scx() had to have a separate mechanism to pick between a sibling rq's current task and the first task in its local DSQ. However, with the preceding updates, pick_next_task_scx() no longer depends on the current task being put and can compare the current task and the next in line statelessly, and the pick task logic should be shareable between regular and core-sched paths. Unify regular and core-sched pick task paths: - There's no reason to distinguish local and sibling picks anymore. @local is removed from balance_one(). - pick_next_task_scx() is turned into pick_task_scx() by dropping the put_prev_set_next_task() call. - The old pick_task_scx() is dropped. Signed-off-by: Tejun Heo <tj@kernel.org>	2024-09-03 21:54:29 -10:00
Tejun Heo	8b1451f2f7	sched_ext: Replace SCX_TASK_BAL_KEEP with SCX_RQ_BAL_KEEP SCX_TASK_BAL_KEEP is used by balance_one() to tell pick_next_task_scx() to keep running the current task. It's not really a task property. Replace it with SCX_RQ_BAL_KEEP which resides in rq->scx.flags and is a better fit for the usage. Also, the existing clearing rule is unnecessarily strict and makes it difficult to use with core-sched. Just clear it on entry to balance_one(). Signed-off-by: Tejun Heo <tj@kernel.org>	2024-09-03 21:54:28 -10:00
Tejun Heo	7c65ae81ea	sched_ext: Don't call put_prev_task_scx() before picking the next task `fd03c5b858` ("sched: Rework pick_next_task()") changed the definition of pick_next_task() from: pick_next_task() := pick_task() + set_next_task(.first = true) to: pick_next_task(prev) := pick_task() + put_prev_task() + set_next_task(.first = true) making invoking put_prev_task() pick_next_task()'s responsibility. This reordering allows pick_task() to be shared between regular and core-sched paths and put_prev_task() to know the next task. sched_ext depended on put_prev_task_scx() enqueueing the current task before pick_next_task_scx() is called. While pulling sched/core changes, 70cc76aa0d80 ("Merge branch 'tip/sched/core' into for-6.12") added an explicit put_prev_task_scx() call for SCX tasks in pick_next_task_scx() before picking the first task as a workaround. Clean it up and adopt the conventions that other sched classes are following. The operation of keeping running the current task was spread and required the task to be put on the local DSQ before picking: - balance_one() used SCX_TASK_BAL_KEEP to indicate that the task is still runnable, hasn't exhausted its slice, and thus should keep running. - put_prev_task_scx() enqueued the task to local DSQ if SCX_TASK_BAL_KEEP is set. It also called do_enqueue_task() with SCX_ENQ_LAST if it is the only runnable task. do_enqueue_task() in turn decided whether to use the local DSQ depending on SCX_OPS_ENQ_LAST. Consolidate the logic in balance_one() as it always knows whether it is going to keep the current task. balance_one() now considers all conditions where the current task should be kept and uses SCX_TASK_BAL_KEEP to tell pick_next_task_scx() to keep the current task instead of picking one from the local DSQ. Accordingly, SCX_ENQ_LAST handling is removed from put_prev_task_scx() and do_enqueue_task() and pick_next_task_scx() is updated to pick the current task if SCX_TASK_BAL_KEEP is set. The workaround put_prev_task[_scx]() calls are replaced with put_prev_set_next_task(). This causes two behavior changes observable from the BPF scheduler: - When a task keep running, it no longer goes through enqueue/dequeue cycle and thus ops.stopping/running() transitions. The new behavior is better and all the existing schedulers should be able to handle the new behavior. - The BPF scheduler cannot keep executing the current task by enqueueing SCX_ENQ_LAST task to the local DSQ. If SCX_OPS_ENQ_LAST is specified, the BPF scheduler is responsible for resuming execution after each SCX_ENQ_LAST. SCX_OPS_ENQ_LAST is mostly useful for cases where scheduling decisions are not made on the local CPU - e.g. central or userspace-driven schedulin - and the new behavior is more logical and shouldn't pose any problems. SCX_OPS_ENQ_LAST demonstration from scx_qmap is dropped as it doesn't fit that well anymore and the last task handling is moved to the end of qmap_dispatch(). Signed-off-by: Tejun Heo <tj@kernel.org> Cc: David Vernet <void@manifault.com> Cc: Andrea Righi <righi.andrea@gmail.com> Cc: Changwoo Min <multics69@gmail.com> Cc: Daniel Hodges <hodges.daniel.scott@gmail.com> Cc: Dan Schatzberg <schatzberg.dan@gmail.com>	2024-09-03 21:54:28 -10:00
Tejun Heo	d7b01aef9d	Merge branch 'tip/sched/core' into for-6.12 - Resolve trivial context conflicts from dl_server clearing being moved around. - Add @next to put_prev_task_scx() and @prev to pick_next_task_scx() to match sched/core. - Merge sched_class->switch_class() addition from sched_ext with tip/sched/core changes in __pick_next_task(). - Make pick_next_task_scx() call put_prev_task_scx() to emulate the previous behavior where sched_class->put_prev_task() was called before sched_class->pick_next_task(). While this makes sched_ext build and function, the behavior is not in line with other sched classes. The follow-up patches will address the discrepancies and remove sched_class->switch_class(). Signed-off-by: Tejun Heo <tj@kernel.org>	2024-09-03 12:49:18 -10:00
Tejun Heo	62607d033b	sched_ext: Use sched_clock_cpu() instead of rq_clock_task() in touch_core_sched() Since `3cf78c5d01` ("sched_ext: Unpin and repin rq lock from balance_scx()"), sched_ext's balance path terminates rq_pin in the outermost function. This is simpler and in line with what other balance functions are doing but it loses control over rq->clock_update_flags which makes assert_clock_udpated() trigger if other CPUs pins the rq lock. The only place this matters is touch_core_sched() which uses the timestamp to order tasks from sibling rq's. Switch to sched_clock_cpu(). Later, it may be better to use per-core dispatch sequence number. v2: Use sched_clock_cpu() instead of ktime_get_ns() per David. Signed-off-by: Tejun Heo <tj@kernel.org> Fixes: `3cf78c5d01` ("sched_ext: Unpin and repin rq lock from balance_scx()") Acked-by: David Vernet <void@manifault.com> Cc: Peter Zijlstra <peterz@infradead.org>	2024-08-30 19:35:19 -10:00
Tejun Heo	0366017e09	sched_ext: Use task_can_run_on_remote_rq() test in dispatch_to_local_dsq() When deciding whether a task can be migrated to a CPU, dispatch_to_local_dsq() was open-coding p->cpus_allowed and scx_rq_online() tests instead of using task_can_run_on_remote_rq(). This had two problems. - It was missing is_migration_disabled() check and thus could try to migrate a task which shouldn't leading to assertion and scheduling failures. - It was testing p->cpus_ptr directly instead of using task_allowed_on_cpu() and thus failed to consider ISA compatibility. Update dispatch_to_local_dsq() to use task_can_run_on_remote_rq(): - Move scx_ops_error() triggering into task_can_run_on_remote_rq(). - When migration isn't allowed, fall back to the global DSQ instead of the source DSQ by returning DTL_INVALID. This is both simpler and an overall better behavior. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Acked-by: David Vernet <void@manifault.com>	2024-08-30 19:34:46 -10:00
Tejun Heo	bf934bed5e	sched_ext: Add missing cfi stub for ops.tick The cfi stub for ops.tick was missing which will fail scheduler loading after pending BPF changes. Add it. Signed-off-by: Tejun Heo <tj@kernel.org>	2024-08-27 14:19:03 -10:00
Yipeng Zou	9ad2861b77	sched_ext: Allow dequeue_task_scx to fail Since dequeue_task() allowed to fail, there is a compile error: kernel/sched/ext.c:3630:19: error: initialization of ‘bool ()(struct rq, struct task_struct , int)’ {aka ‘_Bool ()(struct rq , struct task_struct , int)’} from incompatible pointer type ‘void ()(struct rq, struct task_struct *, int)’ 3630 \| .dequeue_task = dequeue_task_scx, \| ^~~~~~~~~~~~~~~~ Allow dequeue_task_scx to fail too. Fixes: `863ccdbb91` ("sched: Allow sched_class::dequeue_task() to fail") Signed-off-by: Yipeng Zou <zouyipeng@huawei.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2024-08-20 09:09:01 -10:00
Tejun Heo	89909296a5	sched_ext: Don't use double locking to migrate tasks across CPUs consume_remote_task() and dispatch_to_local_dsq() use move_task_to_local_dsq() to migrate the task to the target CPU. Currently, move_task_to_local_dsq() expects the caller to lock both the source and destination rq's. While this may save a few lock operations while the rq's are not contended, under contention, the double locking can exacerbate the situation significantly (refer to the linked message below). Update the migration path so that double locking is not used. move_task_to_local_dsq() now expects the caller to be locking the source rq, drops it and then acquires the destination rq lock. Code is simpler this way and, on a 2-way NUMA machine w/ Xeon Gold 6138, 'hackbench 100 thread 5000` shows ~3% improvement with scx_simple. Signed-off-by: Tejun Heo <tj@kernel.org> Suggested-by: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/20240806082716.GP37996@noisy.programming.kicks-ass.net Acked-by: David Vernet <void@manifault.com>	2024-08-13 09:08:50 -10:00
Manu Bretelle	33d031ec12	sched_ext: define missing cfi stubs for sched_ext `__bpf_ops_sched_ext_ops` was missing the initialization of some struct attributes. With https://lore.kernel.org/all/20240722183049.2254692-4-martin.lau@linux.dev/ every single attributes need to be initialized programs (like scx_layered) will fail to load. 05:26:48 [INFO] libbpf: struct_ops layered: member cgroup_init not found in kernel, skipping it as it's set to zero 05:26:48 [INFO] libbpf: struct_ops layered: member cgroup_exit not found in kernel, skipping it as it's set to zero 05:26:48 [INFO] libbpf: struct_ops layered: member cgroup_prep_move not found in kernel, skipping it as it's set to zero 05:26:48 [INFO] libbpf: struct_ops layered: member cgroup_move not found in kernel, skipping it as it's set to zero 05:26:48 [INFO] libbpf: struct_ops layered: member cgroup_cancel_move not found in kernel, skipping it as it's set to zero 05:26:48 [INFO] libbpf: struct_ops layered: member cgroup_set_weight not found in kernel, skipping it as it's set to zero 05:26:48 [WARN] libbpf: prog 'layered_dump': BPF program load failed: unknown error (-524) 05:26:48 [WARN] libbpf: prog 'layered_dump': -- BEGIN PROG LOAD LOG -- attach to unsupported member dump of struct sched_ext_ops processed 0 insns (limit 1000000) max_states_per_insn 0 total_states 0 peak_states 0 mark_read 0 -- END PROG LOAD LOG -- 05:26:48 [WARN] libbpf: prog 'layered_dump': failed to load: -524 05:26:48 [WARN] libbpf: failed to load object 'bpf_bpf' 05:26:48 [WARN] libbpf: failed to load BPF skeleton 'bpf_bpf': -524 Error: Failed to load BPF program Signed-off-by: Manu Bretelle <chantr4@gmail.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2024-08-13 09:03:26 -10:00
Tejun Heo	344576fa6a	sched_ext: Improve logging around enable/disable sched_ext currently doesn't generate messages when the BPF scheduler is enabled and disabled unless there are errors. It is useful to have paper trail. Improve logging around enable/disable: - Generate info messages on enable and non-error disable. - Update error exit message formatting so that it's consistent with non-error message. Also, prefix ei->msg with the BPF scheduler's name to make it clear where the message is coming from. - Shorten scx_exit_reason() strings for SCX_EXIT_UNREG* for brevity and consistency. v2: Use pr_() instead of KERN_ consistently. (David) Signed-off-by: Tejun Heo <tj@kernel.org> Suggested-by: Phil Auld <pauld@redhat.com> Reviewed-by: Phil Auld <pauld@redhat.com> Acked-by: David Vernet <void@manifault.com>	2024-08-08 13:42:37 -10:00
Tejun Heo	991ef53a48	sched_ext: Make scx_rq_online() also test cpu_active() in addition to SCX_RQ_ONLINE scx_rq_online() currently only tests SCX_RQ_ONLINE. This isn't fully correct - e.g. consume_dispatch_q() uses task_run_on_remote_rq() which tests scx_rq_online() to see whether the current rq can run the task, and, if so, calls consume_remote_task() to migrate the task to @rq. While the test itself was done while locking @rq, @rq can be temporarily unlocked by consume_remote_task() and nothing prevents SCX_RQ_ONLINE from going offline before the migration takes place. To address the issue, add cpu_active() test to scx_rq_online(). There is a synchronize_rcu() between cpu_active() being cleared and the rq going offline, so if an on-going scheduling operation sees cpu_active(), the associated rq is guaranteed to not go offline until the scheduling operation is complete. Signed-off-by: Tejun Heo <tj@kernel.org> Fixes: `60c27fb59f` ("sched_ext: Implement sched_ext_ops.cpu_online/offline()") Acked-by: David Vernet <void@manifault.com>	2024-08-08 13:38:19 -10:00
Tejun Heo	72763ea3d4	sched_ext: Fix unsafe list iteration in process_ddsp_deferred_locals() process_ddsp_deferred_locals() executes deferred direct dispatches to the local DSQs of remote CPUs. It iterates the tasks on rq->scx.ddsp_deferred_locals list, removing and calling dispatch_to_local_dsq() on each. However, the list is protected by the rq lock that can be dropped by dispatch_to_local_dsq() temporarily, so the list can be modified during the iteration, which can lead to oopses and other failures. Fix it by popping from the head of the list instead of iterating the list. Signed-off-by: Tejun Heo <tj@kernel.org> Fixes: `5b26f7b920` ("sched_ext: Allow SCX_DSQ_LOCAL_ON for direct dispatches") Acked-by: David Vernet <void@manifault.com>	2024-08-08 13:38:09 -10:00
Tejun Heo	2c390dda9e	sched_ext: Make task_can_run_on_remote_rq() use common task_allowed_on_cpu() task_can_run_on_remote_rq() is similar to is_cpu_allowed() but there are subtle differences. It currently open codes all the tests. This is cumbersome to understand and error-prone in case the intersecting tests need to be updated. Factor out the common part - testing whether the task is allowed on the CPU at all regardless of the CPU state - into task_allowed_on_cpu() and make both is_cpu_allowed() and SCX's task_can_run_on_remote_rq() use it. As the code is now linked between the two and each contains only the extra tests that differ between them, it's less error-prone when the conditions need to be updated. Also, improve the comment to explain why they are different. v2: Replace accidental "extern inline" with "static inline" (Peter). Signed-off-by: Tejun Heo <tj@kernel.org> Suggested-by: Peter Zijlstra <peterz@infradead.org> Acked-by: David Vernet <void@manifault.com>	2024-08-06 09:40:11 -10:00
Tejun Heo	9390a923e1	sched_ext: Improve comment on idle_sched_class exception in scx_task_iter_next_locked() scx_task_iter_next_locked() skips tasks whose sched_class is idle_sched_class. While it has a short comment explaining why it's testing the sched_class directly isntead of using is_idle_task(), the comment doesn't sufficiently explain what's going on and why. Improve the comment. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Acked-by: David Vernet <void@manifault.com>	2024-08-06 09:40:11 -10:00
Tejun Heo	a735d43c7f	sched_ext: Simplify UP support by enabling sched_class->balance() in UP On SMP, SCX performs dispatch from sched_class->balance(). As balance() was not available in UP, it instead called the internal balance function from put_prev_task_scx() and pick_next_task_scx() to emulate the effect, which is rather nasty. Enabling sched_class->balance() on UP shouldn't cause any meaningful overhead. Enable balance() on UP and drop the ugly workaround. Signed-off-by: Tejun Heo <tj@kernel.org> Suggested-by: Peter Zijlstra <peterz@infradead.org> Acked-by: David Vernet <void@manifault.com>	2024-08-06 09:40:11 -10:00
Tejun Heo	7799140b6a	sched_ext: Use update_curr_common() in update_curr_scx() update_curr_scx() is open coding runtime updates. Use update_curr_common() instead and avoid unnecessary deviations. Signed-off-by: Tejun Heo <tj@kernel.org> Suggested-by: Peter Zijlstra <peterz@infradead.org> Acked-by: David Vernet <void@manifault.com>	2024-08-06 09:40:11 -10:00
Tejun Heo	e99129e5db	sched_ext: Allow p->scx.disallow only while loading From 1232da7eced620537a78f19c8cf3d4a3508e2419 Mon Sep 17 00:00:00 2001 From: Tejun Heo <tj@kernel.org> Date: Wed, 31 Jul 2024 09:14:52 -1000 p->scx.disallow provides a way for the BPF scheduler to reject certain tasks from attaching. It's currently allowed for both the load and fork paths; however, the latter doesn't actually work as p->sched_class is already set by the time scx_ops_init_task() is called during fork. This is a convenience feature which is mostly useful from the load path anyway. Allow it only from the load path. v2: Trigger scx_ops_error() iff @p->policy == SCHED_EXT to make it a bit easier for the BPF scheduler (David). Signed-off-by: Tejun Heo <tj@kernel.org> Reported-by: "Zhangqiao (2012 lab)" <zhangqiao22@huawei.com> Link: http://lkml.kernel.org/r/20240711110720.1285-1-zhangqiao22@huawei.com Fixes: `7bb6f0810e` ("sched_ext: Allow BPF schedulers to disallow specific tasks from joining SCHED_EXT") Acked-by: David Vernet <void@manifault.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2024-08-02 08:59:32 -10:00
Tejun Heo	a2f4b16e73	sched_ext: Build fix on !CONFIG_STACKTRACE[_SUPPORT] scx_dump_task() uses stack_trace_save_tsk() which is only available when CONFIG_STACKTRACE. Make CONFIG_SCHED_CLASS_EXT select CONFIG_STACKTRACE if the support is available and skip capturing stack trace if !CONFIG_STACKTRACE. Signed-off-by: Tejun Heo <tj@kernel.org> Reported-by: kernel test robot <lkp@intel.com> Closes: https://lore.kernel.org/oe-kbuild-all/202407161844.reewQQrR-lkp@intel.com/ Acked-by: David Vernet <void@manifault.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2024-08-01 07:08:01 -10:00
David Vernet	298dec19bd	scx: Allow calling sleepable kfuncs from BPF_PROG_TYPE_SYSCALL We currently only allow calling sleepable scx kfuncs (i.e. scx_bpf_create_dsq()) from BPF_PROG_TYPE_STRUCT_OPS progs. The idea here was that we'd never have to call scx_bpf_create_dsq() outside of a sched_ext struct_ops callback, but that might not actually be true. For example, a scheduler could do something like the following: 1. Open and load (not yet attach) a scheduler skel 2. Synchronously call into a BPF_PROG_TYPE_SYSCALL prog from user space. For example, to initialize an LLC domain, or some other global, read-only state. 3. Attach the skel, which actually enables the scheduler The advantage of doing this is that it can preclude having to do pretty ugly boilerplate like initializing a read-only, statically sized array of u64[]'s which the kernel consumes literally once at init time to then create struct bpf_cpumask objects which are actually queried at runtime. Doing the above is already possible given that we can invoke core BPF kfuncs, such as bpf_cpumask_create(), from BPF_PROG_TYPE_SYSCALL progs. We already allow many scx kfuncs to be called from BPF_PROG_TYPE_SYSCALL progs (e.g. scx_bpf_kick_cpu()). Let's allow the sleepable kfuncs as well. Signed-off-by: David Vernet <void@manifault.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2024-07-31 07:45:28 -10:00
Jiapeng Chong	8bb30798fd	sched_ext: Fixes incorrect type in bpf_scx_init() The type_id is defined as u32type, if(type_id<0) is invalid, hence modified its type to s32. ./kernel/sched/ext.c:4958:5-12: WARNING: Unsigned expression compared with zero: type_id < 0. Reported-by: Abaci Robot <abaci@linux.alibaba.com> Closes: https://bugzilla.openanolis.cn/show_bug.cgi?id=9523 Signed-off-by: Jiapeng Chong <jiapeng.chong@linux.alibaba.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2024-07-14 18:10:10 -10:00
Tejun Heo	5b26f7b920	sched_ext: Allow SCX_DSQ_LOCAL_ON for direct dispatches In ops.dispatch(), SCX_DSQ_LOCAL_ON can be used to dispatch the task to the local DSQ of any CPU. However, during direct dispatch from ops.select_cpu() and ops.enqueue(), this isn't allowed. This is because dispatching to the local DSQ of a remote CPU requires locking both the task's current and new rq's and such double locking can't be done directly from ops.enqueue(). While waking up a task, as ops.select_cpu() can pick any CPU and both ops.select_cpu() and ops.enqueue() can use SCX_DSQ_LOCAL as the dispatch target to dispatch to the DSQ of the picked CPU, the BPF scheduler can still do whatever it wants to do. However, while a task is being enqueued for a different reason, e.g. after its slice expiration, only ops.enqueue() is called and there's no way for the BPF scheduler to directly dispatch to the local DSQ of a remote CPU. This gap in API forces schedulers into work-arounds which are not straightforward or optimal such as skipping direct dispatches in such cases. Implement deferred enqueueing to allow directly dispatching to the local DSQ of a remote CPU from ops.select_cpu() and ops.enqueue(). Such tasks are temporarily queued on rq->scx.ddsp_deferred_locals. When the rq lock can be safely released, the tasks are taken off the list and queued on the target local DSQs using dispatch_to_local_dsq(). v2: - Add missing return after queue_balance_callback() in schedule_deferred(). (David). - dispatch_to_local_dsq() now assumes that @rq is locked but unpinned and thus no longer takes @rf. Updated accordingly. - UP build warning fix. Signed-off-by: Tejun Heo <tj@kernel.org> Tested-by: Andrea Righi <righi.andrea@gmail.com> Acked-by: David Vernet <void@manifault.com> Cc: Dan Schatzberg <schatzberg.dan@gmail.com> Cc: Changwoo Min <changwoo@igalia.com>	2024-07-12 08:20:33 -10:00
Tejun Heo	f47a818950	sched_ext: s/SCX_RQ_BALANCING/SCX_RQ_IN_BALANCE/ and add SCX_RQ_IN_WAKEUP SCX_RQ_BALANCING is used to mark that the rq is currently in balance(). Rename it to SCX_RQ_IN_BALANCE and add SCX_RQ_IN_WAKEUP which marks whether the rq is currently enqueueing for a wakeup. This will be used to implement direct dispatching to local DSQ of another CPU. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: David Vernet <void@manifault.com>	2024-07-12 08:20:33 -10:00
Tejun Heo	3cf78c5d01	sched_ext: Unpin and repin rq lock from balance_scx() sched_ext often needs to migrate tasks across CPUs right before execution and thus uses the balance path to dispatch tasks from the BPF scheduler. balance_scx() is called with rq locked and pinned but is passed @rf and thus allowed to unpin and unlock. Currently, @rf is passed down the call stack so the rq lock is unpinned just when double locking is needed. This creates unnecessary complications such as having to explicitly manipulate lock pinning for core scheduling. We also want to use dispatch_to_local_dsq_lock() from other paths which are called with rq locked but unpinned. rq lock handling in the dispatch path is straightforward outside the migration implementation and extending the pinning protection down the callstack doesn't add enough meaningful extra protection to justify the extra complexity. Unpin and repin rq lock from the outer balance_scx() and drop @rf passing and lock pinning handling from the inner functions. UP is updated to call balance_one() instead of balance_scx() to avoid adding NULL @rf handling to balance_scx(). AS this makes balance_scx() unused in UP, it's put inside a CONFIG_SMP block. No functional changes intended outside of lock annotation updates. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: David Vernet <void@manifault.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Andrea Righi <righi.andrea@gmail.com>	2024-07-12 08:20:32 -10:00
Tejun Heo	d6a05910d2	sched_ext: Open-code task_linked_on_dsq() task_linked_on_dsq() exists as a helper because it used to test both the rbtree and list nodes. It now only tests the list node and the list node will soon be used for something else too. The helper doesn't improve anything materially and the naming will become confusing. Open-code the list node testing and remove task_linked_on_dsq() Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: David Vernet <void@manifault.com>	2024-07-12 08:20:32 -10:00
Tejun Heo	e7a6395a88	sched_ext: Make scx_bpf_reenqueue_local() skip tasks that are being migrated When a running task is migrated to another CPU, the stop_task is used to preempt the running task and migrate it. This, expectedly, invokes ops.cpu_release(). If the BPF scheduler then calls scx_bpf_reenqueue_local(), it re-enqueues all tasks on the local DSQ including the task which is being migrated. This creates an unnecessary re-enqueue of a task which is about to be deactivated and re-activated for migration anyway. It can also cause confusion for the BPF scheduler as scx_bpf_task_cpu() of the task and its allowed CPUs may not agree while migration is pending. Signed-off-by: Tejun Heo <tj@kernel.org> Fixes: `245254f708` ("sched_ext: Implement sched_ext_ops.cpu_acquire/release()") Acked-by: David Vernet <void@manifault.com>	2024-07-09 12:30:26 -10:00
Tejun Heo	fd0cf51695	sched_ext: Reimplement scx_bpf_reenqueue_local() scx_bpf_reenqueue_local() is used to re-enqueue tasks on the local DSQ from ops.cpu_release(). Because the BPF scheduler may dispatch tasks to the same local DSQ, to avoid processing the same tasks repeatedly, it first takes the number of queued tasks and processes the task at the head of the queue that number of times. This is incorrect as a task can be dispatched to the same local DSQ with SCX_ENQ_HEAD. Such a task will be processed repeatedly until the count is exhausted and the succeeding tasks won't be processed at all. Fix it by first moving all candidate tasks to a private list and then processing that list. While at it, remove the WARNs. They're rather superflous as later steps will check them anyway. Signed-off-by: Tejun Heo <tj@kernel.org> Fixes: `245254f708` ("sched_ext: Implement sched_ext_ops.cpu_acquire/release()") Acked-by: David Vernet <void@manifault.com>	2024-07-09 12:30:26 -10:00

1 2

80 Commits