docs: scheduler: convert docs to ReST and rename to *.rst

In order to prepare to add them to the Kernel API book,
convert the files to ReST format.

The conversion is actually:
  - add blank lines and identation in order to identify paragraphs;
  - fix tables markups;
  - add some lists markups;
  - mark literal blocks;
  - adjust title markups.

At its new index.rst, let's add a :orphan: while this is not linked to
the main index.rst file, in order to avoid build warnings.

Signed-off-by: Mauro Carvalho Chehab <mchehab+samsung@kernel.org>
Signed-off-by: Jonathan Corbet <corbet@lwn.net>
This commit is contained in:
Mauro Carvalho Chehab 2019-06-12 14:53:03 -03:00 committed by Jonathan Corbet
parent d223884089
commit d6a3b24762
16 changed files with 324 additions and 226 deletions

View File

@ -11,4 +11,4 @@ Description:
example would be, if User A has shares = 1024 and user
B has shares = 2048, User B will get twice the CPU
bandwidth user A will. For more details refer
Documentation/scheduler/sched-design-CFS.txt
Documentation/scheduler/sched-design-CFS.rst

View File

@ -1,3 +1,4 @@
================================================
Completions - "wait for completion" barrier APIs
================================================
@ -46,7 +47,7 @@ it has to wait for it.
To use completions you need to #include <linux/completion.h> and
create a static or dynamic variable of type 'struct completion',
which has only two fields:
which has only two fields::
struct completion {
unsigned int done;
@ -57,7 +58,7 @@ This provides the ->wait waitqueue to place tasks on for waiting (if any), and
the ->done completion flag for indicating whether it's completed or not.
Completions should be named to refer to the event that is being synchronized on.
A good example is:
A good example is::
wait_for_completion(&early_console_added);
@ -81,7 +82,7 @@ have taken place, even if these wait functions return prematurely due to a timeo
or a signal triggering.
Initializing of dynamically allocated completion objects is done via a call to
init_completion():
init_completion()::
init_completion(&dynamic_object->done);
@ -100,7 +101,8 @@ but be aware of other races.
For static declaration and initialization, macros are available.
For static (or global) declarations in file scope you can use DECLARE_COMPLETION():
For static (or global) declarations in file scope you can use
DECLARE_COMPLETION()::
static DECLARE_COMPLETION(setup_done);
DECLARE_COMPLETION(setup_done);
@ -111,7 +113,7 @@ initialized to 'not done' and doesn't require an init_completion() call.
When a completion is declared as a local variable within a function,
then the initialization should always use DECLARE_COMPLETION_ONSTACK()
explicitly, not just to make lockdep happy, but also to make it clear
that limited scope had been considered and is intentional:
that limited scope had been considered and is intentional::
DECLARE_COMPLETION_ONSTACK(setup_done)
@ -140,11 +142,11 @@ Waiting for completions:
------------------------
For a thread to wait for some concurrent activity to finish, it
calls wait_for_completion() on the initialized completion structure:
calls wait_for_completion() on the initialized completion structure::
void wait_for_completion(struct completion *done)
A typical usage scenario is:
A typical usage scenario is::
CPU#1 CPU#2
@ -192,17 +194,17 @@ A common problem that occurs is to have unclean assignment of return types,
so take care to assign return-values to variables of the proper type.
Checking for the specific meaning of return values also has been found
to be quite inaccurate, e.g. constructs like:
to be quite inaccurate, e.g. constructs like::
if (!wait_for_completion_interruptible_timeout(...))
... would execute the same code path for successful completion and for the
interrupted case - which is probably not what you want.
interrupted case - which is probably not what you want::
int wait_for_completion_interruptible(struct completion *done)
This function marks the task TASK_INTERRUPTIBLE while it is waiting.
If a signal was received while waiting it will return -ERESTARTSYS; 0 otherwise.
If a signal was received while waiting it will return -ERESTARTSYS; 0 otherwise::
unsigned long wait_for_completion_timeout(struct completion *done, unsigned long timeout)
@ -214,7 +216,7 @@ Timeouts are preferably calculated with msecs_to_jiffies() or usecs_to_jiffies()
to make the code largely HZ-invariant.
If the returned timeout value is deliberately ignored a comment should probably explain
why (e.g. see drivers/mfd/wm8350-core.c wm8350_read_auxadc()).
why (e.g. see drivers/mfd/wm8350-core.c wm8350_read_auxadc())::
long wait_for_completion_interruptible_timeout(struct completion *done, unsigned long timeout)
@ -225,14 +227,14 @@ jiffies if completion occurred.
Further variants include _killable which uses TASK_KILLABLE as the
designated tasks state and will return -ERESTARTSYS if it is interrupted,
or 0 if completion was achieved. There is a _timeout variant as well:
or 0 if completion was achieved. There is a _timeout variant as well::
long wait_for_completion_killable(struct completion *done)
long wait_for_completion_killable_timeout(struct completion *done, unsigned long timeout)
The _io variants wait_for_completion_io() behave the same as the non-_io
variants, except for accounting waiting time as 'waiting on IO', which has
an impact on how the task is accounted in scheduling/IO stats:
an impact on how the task is accounted in scheduling/IO stats::
void wait_for_completion_io(struct completion *done)
unsigned long wait_for_completion_io_timeout(struct completion *done, unsigned long timeout)
@ -243,11 +245,11 @@ Signaling completions:
A thread that wants to signal that the conditions for continuation have been
achieved calls complete() to signal exactly one of the waiters that it can
continue:
continue::
void complete(struct completion *done)
... or calls complete_all() to signal all current and future waiters:
... or calls complete_all() to signal all current and future waiters::
void complete_all(struct completion *done)
@ -268,7 +270,7 @@ probably are a design bug.
Signaling completion from IRQ context is fine as it will appropriately
lock with spin_lock_irqsave()/spin_unlock_irqrestore() and it will never
sleep.
sleep.
try_wait_for_completion()/completion_done():
@ -276,14 +278,14 @@ try_wait_for_completion()/completion_done():
The try_wait_for_completion() function will not put the thread on the wait
queue but rather returns false if it would need to enqueue (block) the thread,
else it consumes one posted completion and returns true.
else it consumes one posted completion and returns true::
bool try_wait_for_completion(struct completion *done)
Finally, to check the state of a completion without changing it in any way,
call completion_done(), which returns false if there are no posted
completions that were not yet consumed by waiters (implying that there are
waiters) and true otherwise;
waiters) and true otherwise::
bool completion_done(struct completion *done)

View File

@ -0,0 +1,29 @@
:orphan:
===============
Linux Scheduler
===============
.. toctree::
:maxdepth: 1
completion
sched-arch
sched-bwc
sched-deadline
sched-design-CFS
sched-domains
sched-energy
sched-nice-design
sched-rt-group
sched-stats
text_files
.. only:: subproject and html
Indices
=======
* :ref:`genindex`

View File

@ -1,4 +1,6 @@
CPU Scheduler implementation hints for architecture specific code
=================================================================
CPU Scheduler implementation hints for architecture specific code
=================================================================
Nick Piggin, 2005
@ -35,9 +37,10 @@ Your cpu_idle routines need to obey the following rules:
4. The only time interrupts need to be disabled when checking
need_resched is if we are about to sleep the processor until
the next interrupt (this doesn't provide any protection of
need_resched, it prevents losing an interrupt).
need_resched, it prevents losing an interrupt):
4a. Common problem with this type of sleep appears to be::
4a. Common problem with this type of sleep appears to be:
local_irq_disable();
if (!need_resched()) {
local_irq_enable();
@ -51,10 +54,10 @@ Your cpu_idle routines need to obey the following rules:
although it may be reasonable to do some background work or enter
a low CPU priority.
5a. If TIF_POLLING_NRFLAG is set, and we do decide to enter
an interrupt sleep, it needs to be cleared then a memory
barrier issued (followed by a test of need_resched with
interrupts disabled, as explained in 3).
- 5a. If TIF_POLLING_NRFLAG is set, and we do decide to enter
an interrupt sleep, it needs to be cleared then a memory
barrier issued (followed by a test of need_resched with
interrupts disabled, as explained in 3).
arch/x86/kernel/process.c has examples of both polling and
sleeping idle functions.
@ -71,4 +74,3 @@ sh64 - Is sleeping racy vs interrupts? (See #4a)
sparc - IRQs on at this point(?), change local_irq_save to _disable.
- TODO: needs secondary CPUs to disable preempt (See #1)

View File

@ -1,8 +1,9 @@
=====================
CFS Bandwidth Control
=====================
[ This document only discusses CPU bandwidth control for SCHED_NORMAL.
The SCHED_RT case is covered in Documentation/scheduler/sched-rt-group.txt ]
The SCHED_RT case is covered in Documentation/scheduler/sched-rt-group.rst ]
CFS bandwidth control is a CONFIG_FAIR_GROUP_SCHED extension which allows the
specification of the maximum CPU bandwidth available to a group or hierarchy.
@ -27,7 +28,8 @@ cpu.cfs_quota_us: the total available run-time within a period (in microseconds)
cpu.cfs_period_us: the length of a period (in microseconds)
cpu.stat: exports throttling statistics [explained further below]
The default values are:
The default values are::
cpu.cfs_period_us=100ms
cpu.cfs_quota=-1
@ -55,7 +57,8 @@ For efficiency run-time is transferred between the global pool and CPU local
on large systems. The amount transferred each time such an update is required
is described as the "slice".
This is tunable via procfs:
This is tunable via procfs::
/proc/sys/kernel/sched_cfs_bandwidth_slice_us (default=5ms)
Larger slice values will reduce transfer overheads, while smaller values allow
@ -66,6 +69,7 @@ Statistics
A group's bandwidth statistics are exported via 3 fields in cpu.stat.
cpu.stat:
- nr_periods: Number of enforcement intervals that have elapsed.
- nr_throttled: Number of times the group has been throttled/limited.
- throttled_time: The total time duration (in nanoseconds) for which entities
@ -78,12 +82,15 @@ Hierarchical considerations
The interface enforces that an individual entity's bandwidth is always
attainable, that is: max(c_i) <= C. However, over-subscription in the
aggregate case is explicitly allowed to enable work-conserving semantics
within a hierarchy.
within a hierarchy:
e.g. \Sum (c_i) may exceed C
[ Where C is the parent's bandwidth, and c_i its children ]
There are two ways in which a group may become throttled:
a. it fully consumes its own quota within a period
b. a parent's quota is fully consumed within its period
@ -92,7 +99,7 @@ be allowed to until the parent's runtime is refreshed.
Examples
--------
1. Limit a group to 1 CPU worth of runtime.
1. Limit a group to 1 CPU worth of runtime::
If period is 250ms and quota is also 250ms, the group will get
1 CPU worth of runtime every 250ms.
@ -100,10 +107,10 @@ Examples
# echo 250000 > cpu.cfs_quota_us /* quota = 250ms */
# echo 250000 > cpu.cfs_period_us /* period = 250ms */
2. Limit a group to 2 CPUs worth of runtime on a multi-CPU machine.
2. Limit a group to 2 CPUs worth of runtime on a multi-CPU machine
With 500ms period and 1000ms quota, the group can get 2 CPUs worth of
runtime every 500ms.
With 500ms period and 1000ms quota, the group can get 2 CPUs worth of
runtime every 500ms::
# echo 1000000 > cpu.cfs_quota_us /* quota = 1000ms */
# echo 500000 > cpu.cfs_period_us /* period = 500ms */
@ -112,11 +119,10 @@ Examples
3. Limit a group to 20% of 1 CPU.
With 50ms period, 10ms quota will be equivalent to 20% of 1 CPU.
With 50ms period, 10ms quota will be equivalent to 20% of 1 CPU::
# echo 10000 > cpu.cfs_quota_us /* quota = 10ms */
# echo 50000 > cpu.cfs_period_us /* period = 50ms */
By using a small period here we are ensuring a consistent latency
response at the expense of burst capacity.
By using a small period here we are ensuring a consistent latency
response at the expense of burst capacity.

View File

@ -1,29 +1,29 @@
Deadline Task Scheduling
------------------------
========================
Deadline Task Scheduling
========================
CONTENTS
========
.. CONTENTS
0. WARNING
1. Overview
2. Scheduling algorithm
2.1 Main algorithm
2.2 Bandwidth reclaiming
3. Scheduling Real-Time Tasks
3.1 Definitions
3.2 Schedulability Analysis for Uniprocessor Systems
3.3 Schedulability Analysis for Multiprocessor Systems
3.4 Relationship with SCHED_DEADLINE Parameters
4. Bandwidth management
4.1 System-wide settings
4.2 Task interface
4.3 Default behavior
4.4 Behavior of sched_yield()
5. Tasks CPU affinity
5.1 SCHED_DEADLINE and cpusets HOWTO
6. Future plans
A. Test suite
B. Minimal main()
0. WARNING
1. Overview
2. Scheduling algorithm
2.1 Main algorithm
2.2 Bandwidth reclaiming
3. Scheduling Real-Time Tasks
3.1 Definitions
3.2 Schedulability Analysis for Uniprocessor Systems
3.3 Schedulability Analysis for Multiprocessor Systems
3.4 Relationship with SCHED_DEADLINE Parameters
4. Bandwidth management
4.1 System-wide settings
4.2 Task interface
4.3 Default behavior
4.4 Behavior of sched_yield()
5. Tasks CPU affinity
5.1 SCHED_DEADLINE and cpusets HOWTO
6. Future plans
A. Test suite
B. Minimal main()
0. WARNING
@ -44,7 +44,7 @@ CONTENTS
2. Scheduling algorithm
==================
=======================
2.1 Main algorithm
------------------
@ -80,7 +80,7 @@ CONTENTS
a "remaining runtime". These two parameters are initially set to 0;
- When a SCHED_DEADLINE task wakes up (becomes ready for execution),
the scheduler checks if
the scheduler checks if::
remaining runtime runtime
---------------------------------- > ---------
@ -97,7 +97,7 @@ CONTENTS
left unchanged;
- When a SCHED_DEADLINE task executes for an amount of time t, its
remaining runtime is decreased as
remaining runtime is decreased as::
remaining runtime = remaining runtime - t
@ -112,7 +112,7 @@ CONTENTS
- When the current time is equal to the replenishment time of a
throttled task, the scheduling deadline and the remaining runtime are
updated as
updated as::
scheduling deadline = scheduling deadline + period
remaining runtime = remaining runtime + runtime
@ -129,7 +129,7 @@ CONTENTS
Reclamation of Unused Bandwidth) algorithm [15, 16, 17] and it is enabled
when flag SCHED_FLAG_RECLAIM is set.
The following diagram illustrates the state names for tasks handled by GRUB:
The following diagram illustrates the state names for tasks handled by GRUB::
------------
(d) | Active |
@ -168,7 +168,7 @@ CONTENTS
breaking the real-time guarantees.
The 0-lag time for a task entering the ActiveNonContending state is
computed as
computed as::
(runtime * dl_period)
deadline - ---------------------
@ -183,7 +183,7 @@ CONTENTS
the task's utilization must be removed from the previous runqueue's active
utilization and must be added to the new runqueue's active utilization.
In order to avoid races between a task waking up on a runqueue while the
"inactive timer" is running on a different CPU, the "dl_non_contending"
"inactive timer" is running on a different CPU, the "dl_non_contending"
flag is used to indicate that a task is not on a runqueue but is active
(so, the flag is set when the task blocks and is cleared when the
"inactive timer" fires or when the task wakes up).
@ -222,36 +222,36 @@ CONTENTS
Let's now see a trivial example of two deadline tasks with runtime equal
to 4 and period equal to 8 (i.e., bandwidth equal to 0.5):
to 4 and period equal to 8 (i.e., bandwidth equal to 0.5)::
A Task T1
|
| |
| |
|-------- |----
| | V
|---|---|---|---|---|---|---|---|--------->t
0 1 2 3 4 5 6 7 8
A Task T1
|
| |
| |
|-------- |----
| | V
|---|---|---|---|---|---|---|---|--------->t
0 1 2 3 4 5 6 7 8
A Task T2
|
| |
| |
| ------------------------|
| | V
|---|---|---|---|---|---|---|---|--------->t
0 1 2 3 4 5 6 7 8
A Task T2
|
| |
| |
| ------------------------|
| | V
|---|---|---|---|---|---|---|---|--------->t
0 1 2 3 4 5 6 7 8
A running_bw
|
1 ----------------- ------
| | |
0.5- -----------------
| |
|---|---|---|---|---|---|---|---|--------->t
0 1 2 3 4 5 6 7 8
A running_bw
|
1 ----------------- ------
| | |
0.5- -----------------
| |
|---|---|---|---|---|---|---|---|--------->t
0 1 2 3 4 5 6 7 8
- Time t = 0:
@ -284,7 +284,7 @@ CONTENTS
2.3 Energy-aware scheduling
------------------------
---------------------------
When cpufreq's schedutil governor is selected, SCHED_DEADLINE implements the
GRUB-PA [19] algorithm, reducing the CPU operating frequency to the minimum
@ -299,15 +299,20 @@ CONTENTS
3. Scheduling Real-Time Tasks
=============================
* BIG FAT WARNING ******************************************************
*
* This section contains a (not-thorough) summary on classical deadline
* scheduling theory, and how it applies to SCHED_DEADLINE.
* The reader can "safely" skip to Section 4 if only interested in seeing
* how the scheduling policy can be used. Anyway, we strongly recommend
* to come back here and continue reading (once the urge for testing is
* satisfied :P) to be sure of fully understanding all technical details.
************************************************************************
.. BIG FAT WARNING ******************************************************
.. warning::
This section contains a (not-thorough) summary on classical deadline
scheduling theory, and how it applies to SCHED_DEADLINE.
The reader can "safely" skip to Section 4 if only interested in seeing
how the scheduling policy can be used. Anyway, we strongly recommend
to come back here and continue reading (once the urge for testing is
satisfied :P) to be sure of fully understanding all technical details.
.. ************************************************************************
There are no limitations on what kind of task can exploit this new
scheduling discipline, even if it must be said that it is particularly
@ -329,6 +334,7 @@ CONTENTS
sporadic with minimum inter-arrival time P is r_{j+1} >= r_j + P. Finally,
d_j = r_j + D, where D is the task's relative deadline.
Summing up, a real-time task can be described as
Task = (WCET, D, P)
The utilization of a real-time task is defined as the ratio between its
@ -352,13 +358,15 @@ CONTENTS
between the finishing time of a job and its absolute deadline).
More precisely, it can be proven that using a global EDF scheduler the
maximum tardiness of each task is smaller or equal than
((M 1) · WCET_max WCET_min)/(M (M 2) · U_max) + WCET_max
where WCET_max = max{WCET_i} is the maximum WCET, WCET_min=min{WCET_i}
is the minimum WCET, and U_max = max{WCET_i/P_i} is the maximum
utilization[12].
3.2 Schedulability Analysis for Uniprocessor Systems
------------------------
----------------------------------------------------
If M=1 (uniprocessor system), or in case of partitioned scheduling (each
real-time task is statically assigned to one and only one CPU), it is
@ -370,7 +378,9 @@ CONTENTS
a task as WCET_i/min{D_i,P_i}, and EDF is able to respect all the deadlines
of all the tasks running on a CPU if the sum of the densities of the tasks
running on such a CPU is smaller or equal than 1:
sum(WCET_i / min{D_i, P_i}) <= 1
It is important to notice that this condition is only sufficient, and not
necessary: there are task sets that are schedulable, but do not respect the
condition. For example, consider the task set {Task_1,Task_2} composed by
@ -379,7 +389,9 @@ CONTENTS
(Task_1 is scheduled as soon as it is released, and finishes just in time
to respect its deadline; Task_2 is scheduled immediately after Task_1, hence
its response time cannot be larger than 50ms + 10ms = 60ms) even if
50 / min{50,100} + 10 / min{100, 100} = 50 / 50 + 10 / 100 = 1.1
Of course it is possible to test the exact schedulability of tasks with
D_i != P_i (checking a condition that is both sufficient and necessary),
but this cannot be done by comparing the total utilization or density with
@ -399,7 +411,7 @@ CONTENTS
4 Linux uses an admission test based on the tasks' utilizations.
3.3 Schedulability Analysis for Multiprocessor Systems
------------------------
------------------------------------------------------
On multiprocessor systems with global EDF scheduling (non partitioned
systems), a sufficient test for schedulability can not be based on the
@ -428,7 +440,9 @@ CONTENTS
between total utilization (or density) and a fixed constant. If all tasks
have D_i = P_i, a sufficient schedulability condition can be expressed in
a simple way:
sum(WCET_i / P_i) <= M - (M - 1) · U_max
where U_max = max{WCET_i / P_i}[10]. Notice that for U_max = 1,
M - (M - 1) · U_max becomes M - M + 1 = 1 and this schedulability condition
just confirms the Dhall's effect. A more complete survey of the literature
@ -447,7 +461,7 @@ CONTENTS
the tasks are limited.
3.4 Relationship with SCHED_DEADLINE Parameters
------------------------
-----------------------------------------------
Finally, it is important to understand the relationship between the
SCHED_DEADLINE scheduling parameters described in Section 2 (runtime,
@ -473,6 +487,7 @@ CONTENTS
this task, as it is not possible to respect its temporal constraints.
References:
1 - C. L. Liu and J. W. Layland. Scheduling algorithms for multiprogram-
ming in a hard-real-time environment. Journal of the Association for
Computing Machinery, 20(1), 1973.
@ -550,7 +565,7 @@ CONTENTS
The interface used to control the CPU bandwidth that can be allocated
to -deadline tasks is similar to the one already used for -rt
tasks with real-time group scheduling (a.k.a. RT-throttling - see
Documentation/scheduler/sched-rt-group.txt), and is based on readable/
Documentation/scheduler/sched-rt-group.rst), and is based on readable/
writable control files located in procfs (for system wide settings).
Notice that per-group settings (controlled through cgroupfs) are still not
defined for -deadline tasks, because more discussion is needed in order to
@ -596,11 +611,13 @@ CONTENTS
Specifying a periodic/sporadic task that executes for a given amount of
runtime at each instance, and that is scheduled according to the urgency of
its own timing constraints needs, in general, a way of declaring:
- a (maximum/typical) instance execution time,
- a minimum interval between consecutive instances,
- a time constraint by which each instance must be completed.
Therefore:
* a new struct sched_attr, containing all the necessary fields is
provided;
* the new scheduling related syscalls that manipulate it, i.e.,
@ -658,21 +675,21 @@ CONTENTS
------------------------------------
An example of a simple configuration (pin a -deadline task to CPU0)
follows (rt-app is used to create a -deadline task).
follows (rt-app is used to create a -deadline task)::
mkdir /dev/cpuset
mount -t cgroup -o cpuset cpuset /dev/cpuset
cd /dev/cpuset
mkdir cpu0
echo 0 > cpu0/cpuset.cpus
echo 0 > cpu0/cpuset.mems
echo 1 > cpuset.cpu_exclusive
echo 0 > cpuset.sched_load_balance
echo 1 > cpu0/cpuset.cpu_exclusive
echo 1 > cpu0/cpuset.mem_exclusive
echo $$ > cpu0/tasks
rt-app -t 100000:10000:d:0 -D5 (it is now actually superfluous to specify
task affinity)
mkdir /dev/cpuset
mount -t cgroup -o cpuset cpuset /dev/cpuset
cd /dev/cpuset
mkdir cpu0
echo 0 > cpu0/cpuset.cpus
echo 0 > cpu0/cpuset.mems
echo 1 > cpuset.cpu_exclusive
echo 0 > cpuset.sched_load_balance
echo 1 > cpu0/cpuset.cpu_exclusive
echo 1 > cpu0/cpuset.mem_exclusive
echo $$ > cpu0/tasks
rt-app -t 100000:10000:d:0 -D5 # it is now actually superfluous to specify
# task affinity
6. Future plans
===============
@ -711,7 +728,7 @@ Appendix A. Test suite
rt-app is available at: https://github.com/scheduler-tools/rt-app.
Thread parameters can be specified from the command line, with something like
this:
this::
# rt-app -t 100000:10000:d -t 150000:20000:f:10 -D5
@ -721,27 +738,27 @@ Appendix A. Test suite
of 5 seconds.
More interestingly, configurations can be described with a json file that
can be passed as input to rt-app with something like this:
can be passed as input to rt-app with something like this::
# rt-app my_config.json
The parameters that can be specified with the second method are a superset
of the command line options. Please refer to rt-app documentation for more
details (<rt-app-sources>/doc/*.json).
details (`<rt-app-sources>/doc/*.json`).
The second testing application is a modification of schedtool, called
schedtool-dl, which can be used to setup SCHED_DEADLINE parameters for a
certain pid/application. schedtool-dl is available at:
https://github.com/scheduler-tools/schedtool-dl.git.
The usage is straightforward:
The usage is straightforward::
# schedtool -E -t 10000000:100000000 -e ./my_cpuhog_app
With this, my_cpuhog_app is put to run inside a SCHED_DEADLINE reservation
of 10ms every 100ms (note that parameters are expressed in microseconds).
You can also use schedtool to create a reservation for an already running
application, given that you know its pid:
application, given that you know its pid::
# schedtool -E -t 10000000:100000000 my_app_pid
@ -750,43 +767,43 @@ Appendix B. Minimal main()
We provide in what follows a simple (ugly) self-contained code snippet
showing how SCHED_DEADLINE reservations can be created by a real-time
application developer.
application developer::
#define _GNU_SOURCE
#include <unistd.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <time.h>
#include <linux/unistd.h>
#include <linux/kernel.h>
#include <linux/types.h>
#include <sys/syscall.h>
#include <pthread.h>
#define _GNU_SOURCE
#include <unistd.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <time.h>
#include <linux/unistd.h>
#include <linux/kernel.h>
#include <linux/types.h>
#include <sys/syscall.h>
#include <pthread.h>
#define gettid() syscall(__NR_gettid)
#define gettid() syscall(__NR_gettid)
#define SCHED_DEADLINE 6
#define SCHED_DEADLINE 6
/* XXX use the proper syscall numbers */
#ifdef __x86_64__
#define __NR_sched_setattr 314
#define __NR_sched_getattr 315
#endif
/* XXX use the proper syscall numbers */
#ifdef __x86_64__
#define __NR_sched_setattr 314
#define __NR_sched_getattr 315
#endif
#ifdef __i386__
#define __NR_sched_setattr 351
#define __NR_sched_getattr 352
#endif
#ifdef __i386__
#define __NR_sched_setattr 351
#define __NR_sched_getattr 352
#endif
#ifdef __arm__
#define __NR_sched_setattr 380
#define __NR_sched_getattr 381
#endif
#ifdef __arm__
#define __NR_sched_setattr 380
#define __NR_sched_getattr 381
#endif
static volatile int done;
static volatile int done;
struct sched_attr {
struct sched_attr {
__u32 size;
__u32 sched_policy;
@ -802,25 +819,25 @@ Appendix B. Minimal main()
__u64 sched_runtime;
__u64 sched_deadline;
__u64 sched_period;
};
};
int sched_setattr(pid_t pid,
int sched_setattr(pid_t pid,
const struct sched_attr *attr,
unsigned int flags)
{
{
return syscall(__NR_sched_setattr, pid, attr, flags);
}
}
int sched_getattr(pid_t pid,
int sched_getattr(pid_t pid,
struct sched_attr *attr,
unsigned int size,
unsigned int flags)
{
{
return syscall(__NR_sched_getattr, pid, attr, size, flags);
}
}
void *run_deadline(void *data)
{
void *run_deadline(void *data)
{
struct sched_attr attr;
int x = 0;
int ret;
@ -851,10 +868,10 @@ Appendix B. Minimal main()
printf("deadline thread dies [%ld]\n", gettid());
return NULL;
}
}
int main (int argc, char **argv)
{
int main (int argc, char **argv)
{
pthread_t thread;
printf("main thread [%ld]\n", gettid());
@ -868,4 +885,4 @@ Appendix B. Minimal main()
printf("main dies [%ld]\n", gettid());
return 0;
}
}

View File

@ -1,9 +1,10 @@
=============
CFS Scheduler
=============
=============
CFS Scheduler
=============
1. OVERVIEW
============
CFS stands for "Completely Fair Scheduler," and is the new "desktop" process
scheduler implemented by Ingo Molnar and merged in Linux 2.6.23. It is the
@ -27,6 +28,7 @@ is its actual runtime normalized to the total number of running tasks.
2. FEW IMPLEMENTATION DETAILS
==============================
In CFS the virtual runtime is expressed and tracked via the per-task
p->se.vruntime (nanosec-unit) value. This way, it's possible to accurately
@ -49,6 +51,7 @@ algorithm variants to recognize sleepers.
3. THE RBTREE
==============
CFS's design is quite radical: it does not use the old data structures for the
runqueues, but it uses a time-ordered rbtree to build a "timeline" of future
@ -84,6 +87,7 @@ picked and the current task is preempted.
4. SOME FEATURES OF CFS
========================
CFS uses nanosecond granularity accounting and does not rely on any jiffies or
other HZ detail. Thus the CFS scheduler has no notion of "timeslices" in the
@ -113,6 +117,7 @@ result.
5. Scheduling policies
======================
CFS implements three scheduling policies:
@ -137,6 +142,7 @@ SCHED_IDLE.
6. SCHEDULING CLASSES
======================
The new CFS scheduler has been designed in such a way to introduce "Scheduling
Classes," an extensible hierarchy of scheduler modules. These modules
@ -197,6 +203,7 @@ This is the (partial) list of the hooks:
7. GROUP SCHEDULER EXTENSIONS TO CFS
=====================================
Normally, the scheduler operates on individual tasks and strives to provide
fair CPU time to each task. Sometimes, it may be desirable to group tasks and
@ -219,7 +226,7 @@ SCHED_BATCH) tasks.
When CONFIG_FAIR_GROUP_SCHED is defined, a "cpu.shares" file is created for each
group created using the pseudo filesystem. See example steps below to create
task groups and modify their CPU share using the "cgroups" pseudo filesystem.
task groups and modify their CPU share using the "cgroups" pseudo filesystem::
# mount -t tmpfs cgroup_root /sys/fs/cgroup
# mkdir /sys/fs/cgroup/cpu

View File

@ -1,3 +1,7 @@
=================
Scheduler Domains
=================
Each CPU has a "base" scheduling domain (struct sched_domain). The domain
hierarchy is built from these base domains via the ->parent pointer. ->parent
MUST be NULL terminated, and domain structures should be per-CPU as they are
@ -46,7 +50,9 @@ CPU's runqueue and the newly found busiest one and starts moving tasks from it
to our runqueue. The exact number of tasks amounts to an imbalance previously
computed while iterating over this sched domain's groups.
*** Implementing sched domains ***
Implementing sched domains
==========================
The "base" domain will "span" the first level of the hierarchy. In the case
of SMT, you'll span all siblings of the physical CPU, with each group being
a single virtual CPU.

View File

@ -1,6 +1,6 @@
=======================
Energy Aware Scheduling
=======================
=======================
Energy Aware Scheduling
=======================
1. Introduction
---------------
@ -12,7 +12,7 @@ with a minimal impact on throughput. This document aims at providing an
introduction on how EAS works, what are the main design decisions behind it, and
details what is needed to get it to run.
Before going any further, please note that at the time of writing:
Before going any further, please note that at the time of writing::
/!\ EAS does not support platforms with symmetric CPU topologies /!\
@ -33,13 +33,13 @@ To make it clear from the start:
- power = energy/time = [joule/second] = [watt]
The goal of EAS is to minimize energy, while still getting the job done. That
is, we want to maximize:
is, we want to maximize::
performance [inst/s]
--------------------
power [W]
which is equivalent to minimizing:
which is equivalent to minimizing::
energy [J]
-----------
@ -97,7 +97,7 @@ domains can contain duplicate elements.
Example 1.
Let us consider a platform with 12 CPUs, split in 3 performance domains
(pd0, pd4 and pd8), organized as follows:
(pd0, pd4 and pd8), organized as follows::
CPUs: 0 1 2 3 4 5 6 7 8 9 10 11
PDs: |--pd0--|--pd4--|---pd8---|
@ -108,6 +108,7 @@ Example 1.
containing 6 CPUs. The two root domains are denoted rd1 and rd2 in the
above figure. Since pd4 intersects with both rd1 and rd2, it will be
present in the linked list '->pd' attached to each of them:
* rd1->pd: pd0 -> pd4
* rd2->pd: pd4 -> pd8
@ -159,9 +160,9 @@ Example 2.
Each performance domain has three Operating Performance Points (OPPs).
The CPU capacity and power cost associated with each OPP is listed in
the Energy Model table. The util_avg of P is shown on the figures
below as 'PP'.
below as 'PP'::
CPU util.
CPU util.
1024 - - - - - - - Energy Model
+-----------+-------------+
| Little | Big |
@ -188,8 +189,7 @@ Example 2.
(which is coherent with the behaviour of the schedutil CPUFreq
governor, see Section 6. for more details on this topic).
Case 1. P is migrated to CPU1
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
**Case 1. P is migrated to CPU1**::
1024 - - - - - - -
@ -207,8 +207,7 @@ Example 2.
CPU0 CPU1 CPU2 CPU3
Case 2. P is migrated to CPU3
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
**Case 2. P is migrated to CPU3**::
1024 - - - - - - -
@ -226,8 +225,7 @@ Example 2.
CPU0 CPU1 CPU2 CPU3
Case 3. P stays on prev_cpu / CPU 0
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
**Case 3. P stays on prev_cpu / CPU 0**::
1024 - - - - - - -
@ -324,7 +322,9 @@ hardware properties and on other features of the kernel being enabled. This
section lists these dependencies and provides hints as to how they can be met.
6.1 - Asymmetric CPU topology
6.1 - Asymmetric CPU topology
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
As mentioned in the introduction, EAS is only supported on platforms with
asymmetric CPU topologies for now. This requirement is checked at run-time by
@ -347,7 +347,8 @@ significant savings on SMP platforms have been observed yet. This restriction
could be amended in the future if proven otherwise.
6.2 - Energy Model presence
6.2 - Energy Model presence
^^^^^^^^^^^^^^^^^^^^^^^^^^^
EAS uses the EM of a platform to estimate the impact of scheduling decisions on
energy. So, your platform must provide power cost tables to the EM framework in
@ -358,7 +359,8 @@ Please also note that the scheduling domains need to be re-built after the
EM has been registered in order to start EAS.
6.3 - Energy Model complexity
6.3 - Energy Model complexity
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
The task wake-up path is very latency-sensitive. When the EM of a platform is
too complex (too many CPUs, too many performance domains, too many performance
@ -388,7 +390,8 @@ two possible options:
hence enabling it to cope with larger EMs in reasonable time.
6.4 - Schedutil governor
6.4 - Schedutil governor
^^^^^^^^^^^^^^^^^^^^^^^^
EAS tries to predict at which OPP will the CPUs be running in the close future
in order to estimate their energy consumption. To do so, it is assumed that OPPs
@ -405,7 +408,8 @@ frequency requests and energy predictions.
Using EAS with any other governor than schedutil is not supported.
6.5 Scale-invariant utilization signals
6.5 Scale-invariant utilization signals
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
In order to make accurate prediction across CPUs and for all performance
states, EAS needs frequency-invariant and CPU-invariant PELT signals. These can
@ -416,7 +420,8 @@ Using EAS on a platform that doesn't implement these two callbacks is not
supported.
6.6 Multithreading (SMT)
6.6 Multithreading (SMT)
^^^^^^^^^^^^^^^^^^^^^^^^
EAS in its current form is SMT unaware and is not able to leverage
multithreaded hardware to save energy. EAS considers threads as independent

View File

@ -1,3 +1,7 @@
=====================
Scheduler Nice Design
=====================
This document explains the thinking about the revamped and streamlined
nice-levels implementation in the new Linux scheduler.
@ -14,7 +18,7 @@ much stronger than they were before in 2.4 (and people were happy about
that change), and we also intentionally calibrated the linear timeslice
rule so that nice +19 level would be _exactly_ 1 jiffy. To better
understand it, the timeslice graph went like this (cheesy ASCII art
alert!):
alert!)::
A

View File

@ -1,18 +1,18 @@
Real-Time group scheduling
--------------------------
==========================
Real-Time group scheduling
==========================
CONTENTS
========
.. CONTENTS
0. WARNING
1. Overview
1.1 The problem
1.2 The solution
2. The interface
2.1 System-wide settings
2.2 Default behaviour
2.3 Basis for grouping tasks
3. Future plans
0. WARNING
1. Overview
1.1 The problem
1.2 The solution
2. The interface
2.1 System-wide settings
2.2 Default behaviour
2.3 Basis for grouping tasks
3. Future plans
0. WARNING
@ -159,9 +159,11 @@ Consider two sibling groups A and B; both have 50% bandwidth, but A's
period is twice the length of B's.
* group A: period=100000us, runtime=50000us
- this runs for 0.05s once every 0.1s
* group B: period= 50000us, runtime=25000us
- this runs for 0.025s twice every 0.1s (or once every 0.05 sec).
This means that currently a while (1) loop in A will run for the full period of

View File

@ -1,3 +1,7 @@
====================
Scheduler Statistics
====================
Version 15 of schedstats dropped counters for some sched_yield:
yld_exp_empty, yld_act_empty and yld_both_empty. Otherwise, it is
identical to version 14.
@ -35,19 +39,23 @@ CPU statistics
cpu<N> 1 2 3 4 5 6 7 8 9
First field is a sched_yield() statistic:
1) # of times sched_yield() was called
Next three are schedule() statistics:
2) This field is a legacy array expiration count field used in the O(1)
scheduler. We kept it for ABI compatibility, but it is always set to zero.
3) # of times schedule() was called
4) # of times schedule() left the processor idle
Next two are try_to_wake_up() statistics:
5) # of times try_to_wake_up() was called
6) # of times try_to_wake_up() was called to wake up the local cpu
Next three are statistics describing scheduling latency:
7) sum of all time spent running by tasks on this processor (in jiffies)
8) sum of all time spent waiting to run by tasks on this processor (in
jiffies)
@ -67,24 +75,23 @@ The first field is a bit mask indicating what cpus this domain operates over.
The next 24 are a variety of load_balance() statistics in grouped into types
of idleness (idle, busy, and newly idle):
1) # of times in this domain load_balance() was called when the
1) # of times in this domain load_balance() was called when the
cpu was idle
2) # of times in this domain load_balance() checked but found
2) # of times in this domain load_balance() checked but found
the load did not require balancing when the cpu was idle
3) # of times in this domain load_balance() tried to move one or
3) # of times in this domain load_balance() tried to move one or
more tasks and failed, when the cpu was idle
4) sum of imbalances discovered (if any) with each call to
4) sum of imbalances discovered (if any) with each call to
load_balance() in this domain when the cpu was idle
5) # of times in this domain pull_task() was called when the cpu
5) # of times in this domain pull_task() was called when the cpu
was idle
6) # of times in this domain pull_task() was called even though
6) # of times in this domain pull_task() was called even though
the target task was cache-hot when idle
7) # of times in this domain load_balance() was called but did
7) # of times in this domain load_balance() was called but did
not find a busier queue while the cpu was idle
8) # of times in this domain a busier queue was found while the
8) # of times in this domain a busier queue was found while the
cpu was idle but no busier group was found
9) # of times in this domain load_balance() was called when the
9) # of times in this domain load_balance() was called when the
cpu was busy
10) # of times in this domain load_balance() checked but found the
load did not require balancing when busy
@ -117,21 +124,25 @@ of idleness (idle, busy, and newly idle):
was just becoming idle but no busier group was found
Next three are active_load_balance() statistics:
25) # of times active_load_balance() was called
26) # of times active_load_balance() tried to move a task and failed
27) # of times active_load_balance() successfully moved a task
Next three are sched_balance_exec() statistics:
28) sbe_cnt is not used
29) sbe_balanced is not used
30) sbe_pushed is not used
Next three are sched_balance_fork() statistics:
31) sbf_cnt is not used
32) sbf_balanced is not used
33) sbf_pushed is not used
Next three are try_to_wake_up() statistics:
34) # of times in this domain try_to_wake_up() awoke a task that
last ran on a different cpu in this domain
35) # of times in this domain try_to_wake_up() moved a task to the
@ -139,10 +150,11 @@ of idleness (idle, busy, and newly idle):
36) # of times in this domain try_to_wake_up() started passive balancing
/proc/<pid>/schedstat
----------------
---------------------
schedstats also adds a new /proc/<pid>/schedstat file to include some of
the same information on a per-process level. There are three fields in
this file correlating for that process to:
1) time spent on the cpu
2) time spent waiting on a runqueue
3) # of timeslices run on this cpu
@ -151,4 +163,5 @@ A program could be easily written to make use of these extra fields to
report on how well a particular process or set of processes is faring
under the scheduler's policies. A simple version of such a program is
available at
http://eaglet.rain.com/rick/linux/schedstat/v12/latency.c

View File

@ -0,0 +1,5 @@
Scheduler pelt c program
------------------------
.. literalinclude:: sched-pelt.c
:language: c

View File

@ -99,7 +99,7 @@ Local allocation will tend to keep subsequent access to the allocated memory
as long as the task on whose behalf the kernel allocated some memory does not
later migrate away from that memory. The Linux scheduler is aware of the
NUMA topology of the platform--embodied in the "scheduling domains" data
structures [see Documentation/scheduler/sched-domains.txt]--and the scheduler
structures [see Documentation/scheduler/sched-domains.rst]--and the scheduler
attempts to minimize task migration to distant scheduling domains. However,
the scheduler does not take a task's NUMA footprint into account directly.
Thus, under sufficient imbalance, tasks can migrate between nodes, remote

View File

@ -734,7 +734,7 @@ menuconfig CGROUPS
use with process control subsystems such as Cpusets, CFS, memory
controls or device isolation.
See
- Documentation/scheduler/sched-design-CFS.txt (CFS)
- Documentation/scheduler/sched-design-CFS.rst (CFS)
- Documentation/cgroup-v1/ (features for grouping, isolation
and resource control)
@ -835,7 +835,7 @@ config CFS_BANDWIDTH
tasks running within the fair group scheduler. Groups with no limit
set are considered to be unconstrained and will run with no
restriction.
See Documentation/scheduler/sched-bwc.txt for more information.
See Documentation/scheduler/sched-bwc.rst for more information.
config RT_GROUP_SCHED
bool "Group scheduling for SCHED_RR/FIFO"
@ -846,7 +846,7 @@ config RT_GROUP_SCHED
to task groups. If enabled, it will also make it impossible to
schedule realtime tasks for non-root users until you allocate
realtime bandwidth for them.
See Documentation/scheduler/sched-rt-group.txt for more information.
See Documentation/scheduler/sched-rt-group.rst for more information.
endif #CGROUP_SCHED

View File

@ -726,7 +726,7 @@ static void replenish_dl_entity(struct sched_dl_entity *dl_se,
* refill the runtime and set the deadline a period in the future,
* because keeping the current (absolute) deadline of the task would
* result in breaking guarantees promised to other tasks (refer to
* Documentation/scheduler/sched-deadline.txt for more information).
* Documentation/scheduler/sched-deadline.rst for more information).
*
* This function returns true if:
*