mirror of
https://github.com/torvalds/linux.git
synced 2024-11-01 17:51:43 +00:00
cb2517653f
schedstats is very useful during debugging and performance tuning but it incurs overhead to calculate the stats. As such, even though it can be disabled at build time, it is often enabled as the information is useful. This patch adds a kernel command-line and sysctl tunable to enable or disable schedstats on demand (when it's built in). It is disabled by default as someone who knows they need it can also learn to enable it when necessary. The benefits are dependent on how scheduler-intensive the workload is. If it is then the patch reduces the number of cycles spent calculating the stats with a small benefit from reducing the cache footprint of the scheduler. These measurements were taken from a 48-core 2-socket machine with Xeon(R) E5-2670 v3 cpus although they were also tested on a single socket machine 8-core machine with Intel i7-3770 processors. netperf-tcp 4.5.0-rc1 4.5.0-rc1 vanilla nostats-v3r1 Hmean 64 560.45 ( 0.00%) 575.98 ( 2.77%) Hmean 128 766.66 ( 0.00%) 795.79 ( 3.80%) Hmean 256 950.51 ( 0.00%) 981.50 ( 3.26%) Hmean 1024 1433.25 ( 0.00%) 1466.51 ( 2.32%) Hmean 2048 2810.54 ( 0.00%) 2879.75 ( 2.46%) Hmean 3312 4618.18 ( 0.00%) 4682.09 ( 1.38%) Hmean 4096 5306.42 ( 0.00%) 5346.39 ( 0.75%) Hmean 8192 10581.44 ( 0.00%) 10698.15 ( 1.10%) Hmean 16384 18857.70 ( 0.00%) 18937.61 ( 0.42%) Small gains here, UDP_STREAM showed nothing intresting and neither did the TCP_RR tests. The gains on the 8-core machine were very similar. tbench4 4.5.0-rc1 4.5.0-rc1 vanilla nostats-v3r1 Hmean mb/sec-1 500.85 ( 0.00%) 522.43 ( 4.31%) Hmean mb/sec-2 984.66 ( 0.00%) 1018.19 ( 3.41%) Hmean mb/sec-4 1827.91 ( 0.00%) 1847.78 ( 1.09%) Hmean mb/sec-8 3561.36 ( 0.00%) 3611.28 ( 1.40%) Hmean mb/sec-16 5824.52 ( 0.00%) 5929.03 ( 1.79%) Hmean mb/sec-32 10943.10 ( 0.00%) 10802.83 ( -1.28%) Hmean mb/sec-64 15950.81 ( 0.00%) 16211.31 ( 1.63%) Hmean mb/sec-128 15302.17 ( 0.00%) 15445.11 ( 0.93%) Hmean mb/sec-256 14866.18 ( 0.00%) 15088.73 ( 1.50%) Hmean mb/sec-512 15223.31 ( 0.00%) 15373.69 ( 0.99%) Hmean mb/sec-1024 14574.25 ( 0.00%) 14598.02 ( 0.16%) Hmean mb/sec-2048 13569.02 ( 0.00%) 13733.86 ( 1.21%) Hmean mb/sec-3072 12865.98 ( 0.00%) 13209.23 ( 2.67%) Small gains of 2-4% at low thread counts and otherwise flat. The gains on the 8-core machine were slightly different tbench4 on 8-core i7-3770 single socket machine Hmean mb/sec-1 442.59 ( 0.00%) 448.73 ( 1.39%) Hmean mb/sec-2 796.68 ( 0.00%) 794.39 ( -0.29%) Hmean mb/sec-4 1322.52 ( 0.00%) 1343.66 ( 1.60%) Hmean mb/sec-8 2611.65 ( 0.00%) 2694.86 ( 3.19%) Hmean mb/sec-16 2537.07 ( 0.00%) 2609.34 ( 2.85%) Hmean mb/sec-32 2506.02 ( 0.00%) 2578.18 ( 2.88%) Hmean mb/sec-64 2511.06 ( 0.00%) 2569.16 ( 2.31%) Hmean mb/sec-128 2313.38 ( 0.00%) 2395.50 ( 3.55%) Hmean mb/sec-256 2110.04 ( 0.00%) 2177.45 ( 3.19%) Hmean mb/sec-512 2072.51 ( 0.00%) 2053.97 ( -0.89%) In constract, this shows a relatively steady 2-3% gain at higher thread counts. Due to the nature of the patch and the type of workload, it's not a surprise that the result will depend on the CPU used. hackbench-pipes 4.5.0-rc1 4.5.0-rc1 vanilla nostats-v3r1 Amean 1 0.0637 ( 0.00%) 0.0660 ( -3.59%) Amean 4 0.1229 ( 0.00%) 0.1181 ( 3.84%) Amean 7 0.1921 ( 0.00%) 0.1911 ( 0.52%) Amean 12 0.3117 ( 0.00%) 0.2923 ( 6.23%) Amean 21 0.4050 ( 0.00%) 0.3899 ( 3.74%) Amean 30 0.4586 ( 0.00%) 0.4433 ( 3.33%) Amean 48 0.5910 ( 0.00%) 0.5694 ( 3.65%) Amean 79 0.8663 ( 0.00%) 0.8626 ( 0.43%) Amean 110 1.1543 ( 0.00%) 1.1517 ( 0.22%) Amean 141 1.4457 ( 0.00%) 1.4290 ( 1.16%) Amean 172 1.7090 ( 0.00%) 1.6924 ( 0.97%) Amean 192 1.9126 ( 0.00%) 1.9089 ( 0.19%) Some small gains and losses and while the variance data is not included, it's close to the noise. The UMA machine did not show anything particularly different pipetest 4.5.0-rc1 4.5.0-rc1 vanilla nostats-v2r2 Min Time 4.13 ( 0.00%) 3.99 ( 3.39%) 1st-qrtle Time 4.38 ( 0.00%) 4.27 ( 2.51%) 2nd-qrtle Time 4.46 ( 0.00%) 4.39 ( 1.57%) 3rd-qrtle Time 4.56 ( 0.00%) 4.51 ( 1.10%) Max-90% Time 4.67 ( 0.00%) 4.60 ( 1.50%) Max-93% Time 4.71 ( 0.00%) 4.65 ( 1.27%) Max-95% Time 4.74 ( 0.00%) 4.71 ( 0.63%) Max-99% Time 4.88 ( 0.00%) 4.79 ( 1.84%) Max Time 4.93 ( 0.00%) 4.83 ( 2.03%) Mean Time 4.48 ( 0.00%) 4.39 ( 1.91%) Best99%Mean Time 4.47 ( 0.00%) 4.39 ( 1.91%) Best95%Mean Time 4.46 ( 0.00%) 4.38 ( 1.93%) Best90%Mean Time 4.45 ( 0.00%) 4.36 ( 1.98%) Best50%Mean Time 4.36 ( 0.00%) 4.25 ( 2.49%) Best10%Mean Time 4.23 ( 0.00%) 4.10 ( 3.13%) Best5%Mean Time 4.19 ( 0.00%) 4.06 ( 3.20%) Best1%Mean Time 4.13 ( 0.00%) 4.00 ( 3.39%) Small improvement and similar gains were seen on the UMA machine. The gain is small but it stands to reason that doing less work in the scheduler is a good thing. The downside is that the lack of schedstats and tracepoints may be surprising to experts doing performance analysis until they find the existence of the schedstats= parameter or schedstats sysctl. It will be automatically activated for latencytop and sleep profiling to alleviate the problem. For tracepoints, there is a simple warning as it's not safe to activate schedstats in the context when it's known the tracepoint may be wanted but is unavailable. Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Reviewed-by: Matt Fleming <matt@codeblueprint.co.uk> Reviewed-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <mgalbraith@suse.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/1454663316-22048-1-git-send-email-mgorman@techsingularity.net Signed-off-by: Ingo Molnar <mingo@kernel.org>
265 lines
7.9 KiB
C
265 lines
7.9 KiB
C
|
|
#ifdef CONFIG_SCHEDSTATS
|
|
|
|
/*
|
|
* Expects runqueue lock to be held for atomicity of update
|
|
*/
|
|
static inline void
|
|
rq_sched_info_arrive(struct rq *rq, unsigned long long delta)
|
|
{
|
|
if (rq) {
|
|
rq->rq_sched_info.run_delay += delta;
|
|
rq->rq_sched_info.pcount++;
|
|
}
|
|
}
|
|
|
|
/*
|
|
* Expects runqueue lock to be held for atomicity of update
|
|
*/
|
|
static inline void
|
|
rq_sched_info_depart(struct rq *rq, unsigned long long delta)
|
|
{
|
|
if (rq)
|
|
rq->rq_cpu_time += delta;
|
|
}
|
|
|
|
static inline void
|
|
rq_sched_info_dequeued(struct rq *rq, unsigned long long delta)
|
|
{
|
|
if (rq)
|
|
rq->rq_sched_info.run_delay += delta;
|
|
}
|
|
# define schedstat_enabled() static_branch_unlikely(&sched_schedstats)
|
|
# define schedstat_inc(rq, field) do { if (schedstat_enabled()) { (rq)->field++; } } while (0)
|
|
# define schedstat_add(rq, field, amt) do { if (schedstat_enabled()) { (rq)->field += (amt); } } while (0)
|
|
# define schedstat_set(var, val) do { if (schedstat_enabled()) { var = (val); } } while (0)
|
|
#else /* !CONFIG_SCHEDSTATS */
|
|
static inline void
|
|
rq_sched_info_arrive(struct rq *rq, unsigned long long delta)
|
|
{}
|
|
static inline void
|
|
rq_sched_info_dequeued(struct rq *rq, unsigned long long delta)
|
|
{}
|
|
static inline void
|
|
rq_sched_info_depart(struct rq *rq, unsigned long long delta)
|
|
{}
|
|
# define schedstat_enabled() 0
|
|
# define schedstat_inc(rq, field) do { } while (0)
|
|
# define schedstat_add(rq, field, amt) do { } while (0)
|
|
# define schedstat_set(var, val) do { } while (0)
|
|
#endif
|
|
|
|
#ifdef CONFIG_SCHED_INFO
|
|
static inline void sched_info_reset_dequeued(struct task_struct *t)
|
|
{
|
|
t->sched_info.last_queued = 0;
|
|
}
|
|
|
|
/*
|
|
* We are interested in knowing how long it was from the *first* time a
|
|
* task was queued to the time that it finally hit a cpu, we call this routine
|
|
* from dequeue_task() to account for possible rq->clock skew across cpus. The
|
|
* delta taken on each cpu would annul the skew.
|
|
*/
|
|
static inline void sched_info_dequeued(struct rq *rq, struct task_struct *t)
|
|
{
|
|
unsigned long long now = rq_clock(rq), delta = 0;
|
|
|
|
if (unlikely(sched_info_on()))
|
|
if (t->sched_info.last_queued)
|
|
delta = now - t->sched_info.last_queued;
|
|
sched_info_reset_dequeued(t);
|
|
t->sched_info.run_delay += delta;
|
|
|
|
rq_sched_info_dequeued(rq, delta);
|
|
}
|
|
|
|
/*
|
|
* Called when a task finally hits the cpu. We can now calculate how
|
|
* long it was waiting to run. We also note when it began so that we
|
|
* can keep stats on how long its timeslice is.
|
|
*/
|
|
static void sched_info_arrive(struct rq *rq, struct task_struct *t)
|
|
{
|
|
unsigned long long now = rq_clock(rq), delta = 0;
|
|
|
|
if (t->sched_info.last_queued)
|
|
delta = now - t->sched_info.last_queued;
|
|
sched_info_reset_dequeued(t);
|
|
t->sched_info.run_delay += delta;
|
|
t->sched_info.last_arrival = now;
|
|
t->sched_info.pcount++;
|
|
|
|
rq_sched_info_arrive(rq, delta);
|
|
}
|
|
|
|
/*
|
|
* This function is only called from enqueue_task(), but also only updates
|
|
* the timestamp if it is already not set. It's assumed that
|
|
* sched_info_dequeued() will clear that stamp when appropriate.
|
|
*/
|
|
static inline void sched_info_queued(struct rq *rq, struct task_struct *t)
|
|
{
|
|
if (unlikely(sched_info_on()))
|
|
if (!t->sched_info.last_queued)
|
|
t->sched_info.last_queued = rq_clock(rq);
|
|
}
|
|
|
|
/*
|
|
* Called when a process ceases being the active-running process involuntarily
|
|
* due, typically, to expiring its time slice (this may also be called when
|
|
* switching to the idle task). Now we can calculate how long we ran.
|
|
* Also, if the process is still in the TASK_RUNNING state, call
|
|
* sched_info_queued() to mark that it has now again started waiting on
|
|
* the runqueue.
|
|
*/
|
|
static inline void sched_info_depart(struct rq *rq, struct task_struct *t)
|
|
{
|
|
unsigned long long delta = rq_clock(rq) -
|
|
t->sched_info.last_arrival;
|
|
|
|
rq_sched_info_depart(rq, delta);
|
|
|
|
if (t->state == TASK_RUNNING)
|
|
sched_info_queued(rq, t);
|
|
}
|
|
|
|
/*
|
|
* Called when tasks are switched involuntarily due, typically, to expiring
|
|
* their time slice. (This may also be called when switching to or from
|
|
* the idle task.) We are only called when prev != next.
|
|
*/
|
|
static inline void
|
|
__sched_info_switch(struct rq *rq,
|
|
struct task_struct *prev, struct task_struct *next)
|
|
{
|
|
/*
|
|
* prev now departs the cpu. It's not interesting to record
|
|
* stats about how efficient we were at scheduling the idle
|
|
* process, however.
|
|
*/
|
|
if (prev != rq->idle)
|
|
sched_info_depart(rq, prev);
|
|
|
|
if (next != rq->idle)
|
|
sched_info_arrive(rq, next);
|
|
}
|
|
static inline void
|
|
sched_info_switch(struct rq *rq,
|
|
struct task_struct *prev, struct task_struct *next)
|
|
{
|
|
if (unlikely(sched_info_on()))
|
|
__sched_info_switch(rq, prev, next);
|
|
}
|
|
#else
|
|
#define sched_info_queued(rq, t) do { } while (0)
|
|
#define sched_info_reset_dequeued(t) do { } while (0)
|
|
#define sched_info_dequeued(rq, t) do { } while (0)
|
|
#define sched_info_depart(rq, t) do { } while (0)
|
|
#define sched_info_arrive(rq, next) do { } while (0)
|
|
#define sched_info_switch(rq, t, next) do { } while (0)
|
|
#endif /* CONFIG_SCHED_INFO */
|
|
|
|
/*
|
|
* The following are functions that support scheduler-internal time accounting.
|
|
* These functions are generally called at the timer tick. None of this depends
|
|
* on CONFIG_SCHEDSTATS.
|
|
*/
|
|
|
|
/**
|
|
* cputimer_running - return true if cputimer is running
|
|
*
|
|
* @tsk: Pointer to target task.
|
|
*/
|
|
static inline bool cputimer_running(struct task_struct *tsk)
|
|
|
|
{
|
|
struct thread_group_cputimer *cputimer = &tsk->signal->cputimer;
|
|
|
|
/* Check if cputimer isn't running. This is accessed without locking. */
|
|
if (!READ_ONCE(cputimer->running))
|
|
return false;
|
|
|
|
/*
|
|
* After we flush the task's sum_exec_runtime to sig->sum_sched_runtime
|
|
* in __exit_signal(), we won't account to the signal struct further
|
|
* cputime consumed by that task, even though the task can still be
|
|
* ticking after __exit_signal().
|
|
*
|
|
* In order to keep a consistent behaviour between thread group cputime
|
|
* and thread group cputimer accounting, lets also ignore the cputime
|
|
* elapsing after __exit_signal() in any thread group timer running.
|
|
*
|
|
* This makes sure that POSIX CPU clocks and timers are synchronized, so
|
|
* that a POSIX CPU timer won't expire while the corresponding POSIX CPU
|
|
* clock delta is behind the expiring timer value.
|
|
*/
|
|
if (unlikely(!tsk->sighand))
|
|
return false;
|
|
|
|
return true;
|
|
}
|
|
|
|
/**
|
|
* account_group_user_time - Maintain utime for a thread group.
|
|
*
|
|
* @tsk: Pointer to task structure.
|
|
* @cputime: Time value by which to increment the utime field of the
|
|
* thread_group_cputime structure.
|
|
*
|
|
* If thread group time is being maintained, get the structure for the
|
|
* running CPU and update the utime field there.
|
|
*/
|
|
static inline void account_group_user_time(struct task_struct *tsk,
|
|
cputime_t cputime)
|
|
{
|
|
struct thread_group_cputimer *cputimer = &tsk->signal->cputimer;
|
|
|
|
if (!cputimer_running(tsk))
|
|
return;
|
|
|
|
atomic64_add(cputime, &cputimer->cputime_atomic.utime);
|
|
}
|
|
|
|
/**
|
|
* account_group_system_time - Maintain stime for a thread group.
|
|
*
|
|
* @tsk: Pointer to task structure.
|
|
* @cputime: Time value by which to increment the stime field of the
|
|
* thread_group_cputime structure.
|
|
*
|
|
* If thread group time is being maintained, get the structure for the
|
|
* running CPU and update the stime field there.
|
|
*/
|
|
static inline void account_group_system_time(struct task_struct *tsk,
|
|
cputime_t cputime)
|
|
{
|
|
struct thread_group_cputimer *cputimer = &tsk->signal->cputimer;
|
|
|
|
if (!cputimer_running(tsk))
|
|
return;
|
|
|
|
atomic64_add(cputime, &cputimer->cputime_atomic.stime);
|
|
}
|
|
|
|
/**
|
|
* account_group_exec_runtime - Maintain exec runtime for a thread group.
|
|
*
|
|
* @tsk: Pointer to task structure.
|
|
* @ns: Time value by which to increment the sum_exec_runtime field
|
|
* of the thread_group_cputime structure.
|
|
*
|
|
* If thread group time is being maintained, get the structure for the
|
|
* running CPU and update the sum_exec_runtime field there.
|
|
*/
|
|
static inline void account_group_exec_runtime(struct task_struct *tsk,
|
|
unsigned long long ns)
|
|
{
|
|
struct thread_group_cputimer *cputimer = &tsk->signal->cputimer;
|
|
|
|
if (!cputimer_running(tsk))
|
|
return;
|
|
|
|
atomic64_add(ns, &cputimer->cputime_atomic.sum_exec_runtime);
|
|
}
|