Convert to use folio_xchg_last_cpupid() in should_numa_migrate_memory().
Link: https://lkml.kernel.org/r/20231018140806.2783514-14-wangkefeng.wang@huawei.com
Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Huang Ying <ying.huang@intel.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Zi Yan <ziy@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Convert to use folio_xchg_access_time() in numa_hint_fault_latency().
Link: https://lkml.kernel.org/r/20231018140806.2783514-9-wangkefeng.wang@huawei.com
Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Huang Ying <ying.huang@intel.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Zi Yan <ziy@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
There are no users of wait bookmarks left, so simplify the wait
code by removing them.
Link: https://lkml.kernel.org/r/20231010035829.544242-2-willy@infradead.org
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Acked-by: Ingo Molnar <mingo@kernel.org>
Cc: Benjamin Segall <bsegall@google.com>
Cc: Bin Lai <sclaibin@gmail.com>
Cc: Daniel Bristot de Oliveira <bristot@redhat.com>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt (Google) <rostedt@goodmis.org>
Cc: Valentin Schneider <vschneid@redhat.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
The cpupid (or access time) is stored in the head page for THP, so it is
safely to make should_numa_migrate_memory() and numa_hint_fault_latency()
to take a folio. This is in preparation for large folio numa balancing.
Link: https://lkml.kernel.org/r/20230921074417.24004-7-wangkefeng.wang@huawei.com
Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: "Huang, Ying" <ying.huang@intel.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Zi Yan <ziy@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
During RCU-boost testing with the TREE03 rcutorture config, I found that
after a few hours, the machine locks up.
On tracing, I found that there is a live lock happening between 2 CPUs.
One CPU has an RT task running, while another CPU is being offlined
which also has an RT task running. During this offlining, all threads
are migrated. The migration thread is repeatedly scheduled to migrate
actively running tasks on the CPU being offlined. This results in a live
lock because select_fallback_rq() keeps picking the CPU that an RT task
is already running on only to get pushed back to the CPU being offlined.
It is anyway pointless to pick CPUs for pushing tasks to if they are
being offlined only to get migrated away to somewhere else. This could
also add unwanted latency to this task.
Fix these issues by not selecting CPUs in RT if they are not 'active'
for scheduling, using the cpu_active_mask. Other parts in core.c already
use cpu_active_mask to prevent tasks from being put on CPUs going
offline.
With this fix I ran the tests for days and could not reproduce the
hang. Without the patch, I hit it in a few hours.
Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Tested-by: Paul E. McKenney <paulmck@kernel.org>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/r/20230923011409.3522762-1-joel@joelfernandes.org
Initial booting is setting the task flag to idle (PF_IDLE) by the call
path sched_init() -> init_idle(). Having the task idle and calling
call_rcu() in kernel/rcu/tiny.c means that TIF_NEED_RESCHED will be
set. Subsequent calls to any cond_resched() will enable IRQs,
potentially earlier than the IRQ setup has completed. Recent changes
have caused just this scenario and IRQs have been enabled early.
This causes a warning later in start_kernel() as interrupts are enabled
before they are fully set up.
Fix this issue by setting the PF_IDLE flag later in the boot sequence.
Although the boot task was marked as idle since (at least) d80e4fda576d,
I am not sure that it is wrong to do so. The forced context-switch on
idle task was introduced in the tiny_rcu update, so I'm going to claim
this fixes 5f6130fa52.
Fixes: 5f6130fa52 ("tiny_rcu: Directly force QS when call_rcu_[bh|sched]() on idle_task")
Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/linux-mm/CAMuHMdWpvpWoDa=Ox-do92czYRvkok6_x6pYUH+ZouMcJbXy+Q@mail.gmail.com/
For SMT4, any group with more than 2 tasks will be marked as
group_smt_balance. Retain the behaviour of group_has_spare by marking
the busiest group as the group which has the least number of idle_cpus.
Also, handle rounding effect of adding (ncores_local + ncores_busy) when
the local is fully idle and busy group imbalance is less than 2 tasks.
Local group should try to pull at least 1 task in this case so imbalance
should be set to 2 instead.
Fixes: fee1759e4f ("sched/fair: Determine active load balance for SMT sched groups")
Acked-by: Shrikanth Hegde <sshegde@linux.vnet.ibm.com>
Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: http://lkml.kernel.org/r/6cd1633036bb6b651af575c32c2a9608a106702c.camel@linux.intel.com
should_we_balance() is called in load_balance() to find out if the CPU that
is trying to do the load balance is the right one or not.
With commit:
b1bfeab9b002("sched/fair: Consider the idle state of the whole core for load balance")
the code tries to find an idle core to do the load balancing
and falls back on an idle sibling CPU if there is no idle core.
However, on larger SMT systems, it could be needlessly iterating to find a
idle by scanning all the CPUs in an non-idle core. If the core is not idle,
and first SMT sibling which is idle has been found, then its not needed to
check other SMT siblings for idleness
Lets say in SMT4, Core0 has 0,2,4,6 and CPU0 is BUSY and rest are IDLE.
balancing domain is MC/DIE. CPU2 will be set as the first idle_smt and
same process would be repeated for CPU4 and CPU6 but this is unnecessary.
Since calling is_core_idle loops through all CPU's in the SMT mask, effect
is multiplied by weight of smt_mask. For example,when say 1 CPU is busy,
we would skip loop for 2 CPU's and skip iterating over 8CPU's. That
effect would be more in DIE/NUMA domain where there are more cores.
Testing and performance evaluation
==================================
The test has been done on this system which has 12 cores, i.e 24 small
cores with SMT=4:
lscpu
Architecture: ppc64le
Byte Order: Little Endian
CPU(s): 96
On-line CPU(s) list: 0-95
Model name: POWER10 (architected), altivec supported
Thread(s) per core: 8
Used funclatency bcc tool to evaluate the time taken by should_we_balance(). For
base tip/sched/core the time taken is collected by making the
should_we_balance() noinline. time is in nanoseconds. The values are
collected by running the funclatency tracer for 60 seconds. values are
average of 3 such runs. This represents the expected reduced time with
patch.
tip/sched/core was at commit:
2f88c8e802 ("sched/eevdf/doc: Modify the documented knob to base_slice_ns as well")
Results:
------------------------------------------------------------------------------
workload tip/sched/core with_patch(%gain)
------------------------------------------------------------------------------
idle system 809.3 695.0(16.45)
stress ng – 12 threads -l 100 1013.5 893.1(13.49)
stress ng – 24 threads -l 100 1073.5 980.0(9.54)
stress ng – 48 threads -l 100 683.0 641.0(6.55)
stress ng – 96 threads -l 100 2421.0 2300(5.26)
stress ng – 96 threads -l 15 375.5 377.5(-0.53)
stress ng – 96 threads -l 25 635.5 637.5(-0.31)
stress ng – 96 threads -l 35 934.0 891.0(4.83)
Ran schbench(old), hackbench and stress_ng to evaluate the workload
performance between tip/sched/core and with patch.
No modification to tip/sched/core
TL;DR:
Good improvement is seen with schbench. when hackbench and stress_ng
runs for longer good improvement is seen.
------------------------------------------------------------------------------
schbench(old) tip +patch(%gain)
10 iterations sched/core
------------------------------------------------------------------------------
1 Threads
50.0th: 8.00 9.00(-12.50)
75.0th: 9.60 9.00(6.25)
90.0th: 11.80 10.20(13.56)
95.0th: 12.60 10.40(17.46)
99.0th: 13.60 11.90(12.50)
99.5th: 14.10 12.60(10.64)
99.9th: 15.90 14.60(8.18)
2 Threads
50.0th: 9.90 9.20(7.07)
75.0th: 12.60 10.10(19.84)
90.0th: 15.50 12.00(22.58)
95.0th: 17.70 14.00(20.90)
99.0th: 21.20 16.90(20.28)
99.5th: 22.60 17.50(22.57)
99.9th: 30.40 19.40(36.18)
4 Threads
50.0th: 12.50 10.60(15.20)
75.0th: 15.30 12.00(21.57)
90.0th: 18.60 14.10(24.19)
95.0th: 21.30 16.20(23.94)
99.0th: 26.00 20.70(20.38)
99.5th: 27.60 22.50(18.48)
99.9th: 33.90 31.40(7.37)
8 Threads
50.0th: 16.30 14.30(12.27)
75.0th: 20.20 17.40(13.86)
90.0th: 24.50 21.90(10.61)
95.0th: 27.30 24.70(9.52)
99.0th: 35.00 31.20(10.86)
99.5th: 46.40 33.30(28.23)
99.9th: 89.30 57.50(35.61)
16 Threads
50.0th: 22.70 20.70(8.81)
75.0th: 30.10 27.40(8.97)
90.0th: 36.00 32.80(8.89)
95.0th: 39.60 36.40(8.08)
99.0th: 49.20 44.10(10.37)
99.5th: 64.90 50.50(22.19)
99.9th: 143.50 100.60(29.90)
32 Threads
50.0th: 34.60 35.50(-2.60)
75.0th: 48.20 50.50(-4.77)
90.0th: 59.20 62.40(-5.41)
95.0th: 65.20 69.00(-5.83)
99.0th: 80.40 83.80(-4.23)
99.5th: 102.10 98.90(3.13)
99.9th: 727.10 506.80(30.30)
schbench does improve in general. There is some run to run variation with
schbench. Did a validation run to confirm that trend is similar.
------------------------------------------------------------------------------
hackbench tip +patch(%gain)
20 iterations, 50000 loops sched/core
------------------------------------------------------------------------------
Process 10 groups : 11.74 11.70(0.34)
Process 20 groups : 22.73 22.69(0.18)
Process 30 groups : 33.39 33.40(-0.03)
Process 40 groups : 43.73 43.61(0.27)
Process 50 groups : 53.82 54.35(-0.98)
Process 60 groups : 64.16 65.29(-1.76)
thread 10 Time : 12.81 12.79(0.16)
thread 20 Time : 24.63 24.47(0.65)
Process(Pipe) 10 Time : 6.40 6.34(0.94)
Process(Pipe) 20 Time : 10.62 10.63(-0.09)
Process(Pipe) 30 Time : 15.09 14.84(1.66)
Process(Pipe) 40 Time : 19.42 19.01(2.11)
Process(Pipe) 50 Time : 24.04 23.34(2.91)
Process(Pipe) 60 Time : 28.94 27.51(4.94)
thread(Pipe) 10 Time : 6.96 6.87(1.29)
thread(Pipe) 20 Time : 11.74 11.73(0.09)
hackbench shows slight improvement with pipe. Slight degradation in process.
------------------------------------------------------------------------------
stress_ng tip +patch(%gain)
10 iterations 100000 cpu_ops sched/core
------------------------------------------------------------------------------
--cpu=96 -util=100 Time taken : 5.30, 5.01(5.47)
--cpu=48 -util=100 Time taken : 7.94, 6.73(15.24)
--cpu=24 -util=100 Time taken : 11.67, 8.75(25.02)
--cpu=12 -util=100 Time taken : 15.71, 15.02(4.39)
--cpu=96 -util=10 Time taken : 22.71, 22.19(2.29)
--cpu=96 -util=20 Time taken : 12.14, 12.37(-1.89)
--cpu=96 -util=30 Time taken : 8.76, 8.86(-1.14)
--cpu=96 -util=40 Time taken : 7.13, 7.14(-0.14)
--cpu=96 -util=50 Time taken : 6.10, 6.13(-0.49)
--cpu=96 -util=60 Time taken : 5.42, 5.41(0.18)
--cpu=96 -util=70 Time taken : 4.94, 4.94(0.00)
--cpu=96 -util=80 Time taken : 4.56, 4.53(0.66)
--cpu=96 -util=90 Time taken : 4.27, 4.26(0.23)
Good improvement seen with 24 CPUs. In this case only one CPU is busy,
and no core is idle. Decent improvement with 100% utilization case. no
difference in other utilization.
Fixes: b1bfeab9b0 ("sched/fair: Consider the idle state of the whole core for load balance")
Signed-off-by: Shrikanth Hegde <sshegde@linux.vnet.ibm.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/r/20230902081204.232218-1-sshegde@linux.vnet.ibm.com
The following commit deserves special mention:
22dc02f81c Revert "sched/fair: Move unused stub functions to header"
This is in x86/cleanups, because the revert is a re-application of a
number of cleanups that got removed inadvertedly.
Signed-off-by: Ingo Molnar <mingo@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmTtDkoRHG1pbmdvQGtl
cm5lbC5vcmcACgkQEnMQ0APhK1jCMw//UvQGM8yxsTa57r0/ZpJHS2++P5pJxOsz
45kBb3aBiDV6idArce4EHpthp3MvF3Pycibp9w0qg//NOtIHTKeagXv52abxsu1W
hmS6gXJZDXZvjO1BFaUlmv97iYtzGfKnQppj32C4tMr9SaP49h3KvOHH1Z8CR3mP
1nZaJJwYIi2qBh7msnmLGG+F0drb85O/dfHdoLX6iVJw9UP4n5nu9u8u1E0iC7J7
2GC6AwP60A0EBRTK9EHQQEYwy9uvdS/TG5f2Qk1VP87KA9TTocs8MyapMG4DQu79
hZKVEGuVQAlV3rYe9cJBNpDx1mTu3rmuMH0G71KEe3T6UcG5QRUiAPm8UfA9prPD
uWjY4zm5o0W3tUio4V1MqqiLFIaBU63WmTY9RyM0QH8Ms8r8GugWKmnrTIuHfEC3
9D+Uhyb5d8ID6qFGLTOvPm0g+v64lnH71qq83PcVJgsmZvUb2XvFA3d/A0h9JzLT
2In/yfU10UsLUFTiNRyAgcLccjaGhliDB2oke9Kp0OyOTSQRcWmiq8kByVxCPImP
auOWWcNXjcuOgjlnziEkMTDuRY12MgUB2If4zhELvdEFibIaaNW5sNCbY2msWaN1
CUD7fcj0L3HZvzujUm72l5hxL2brJMuPwVNJfuOe4T8wzy569d6VJULrd1URBM1B
vfaPs1Dz46Q=
=kiAA
-----END PGP SIGNATURE-----
Merge tag 'x86-cleanups-2023-08-28' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull misc x86 cleanups from Ingo Molnar:
"The following commit deserves special mention:
22dc02f81c Revert "sched/fair: Move unused stub functions to header"
This is in x86/cleanups, because the revert is a re-application of a
number of cleanups that got removed inadvertedly"
[ This also effectively undoes the amd_check_microcode() microcode
declaration change I had done in my microcode loader merge in commit
42a7f6e3ff ("Merge tag 'x86_microcode_for_v6.6_rc1' [...]").
I picked the declaration change by Arnd from this branch instead,
which put it in <asm/processor.h> instead of <asm/microcode.h> like I
had done in my merge resolution - Linus ]
* tag 'x86-cleanups-2023-08-28' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/platform/uv: Refactor code using deprecated strncpy() interface to use strscpy()
x86/hpet: Refactor code using deprecated strncpy() interface to use strscpy()
x86/platform/uv: Refactor code using deprecated strcpy()/strncpy() interfaces to use strscpy()
x86/qspinlock-paravirt: Fix missing-prototype warning
x86/paravirt: Silence unused native_pv_lock_init() function warning
x86/alternative: Add a __alt_reloc_selftest() prototype
x86/purgatory: Include header for warn() declaration
x86/asm: Avoid unneeded __div64_32 function definition
Revert "sched/fair: Move unused stub functions to header"
x86/apic: Hide unused safe_smp_processor_id() on 32-bit UP
x86/cpu: Fix amd_check_microcode() declaration
- The biggest change is introduction of a new iteration of the
SCHED_FAIR interactivity code: the EEVDF ("Earliest Eligible Virtual
Deadline First") scheduler.
EEVDF too is a virtual-time scheduler, with two parameters (weight
and relative deadline), compared to CFS that had weight only.
It completely reworks the base scheduler: placement, preemption,
picking -- everything.
LWN.net, as usual, has a terrific writeup about EEVDF:
https://lwn.net/Articles/925371/
Preemption (both tick and wakeup) is driven by testing against
a fresh pick. Because the tree is now effectively an interval
tree, and the selection is no longer the 'leftmost' task,
over-scheduling is less of a problem. A lot of the CFS
heuristics are removed or replaced by more natural latency-space
parameters & constructs.
In terms of expected performance regressions: we'll and can fix
everything where a 'good' workload misbehaves with the new scheduler,
but EEVDF inevitably changes workload scheduling in a binary fashion,
hopefully for the better in the overwhelming majority of cases,
but in some cases it won't, especially in adversarial loads that
got lucky with the previous code, such as some variants of hackbench.
We are trying hard to err on the side of fixing all performance
regressions, but we expect some inevitable post-release iterations
of that process.
- Improve load-balancing on hybrid x86 systems: enable cluster
scheduling (again).
- Improve & fix bandwidth-scheduling on nohz systems.
- Improve bandwidth-throttling.
- Use lock guards to simplify and de-goto-ify control flow.
- Misc improvements, cleanups and fixes.
Signed-off-by: Ingo Molnar <mingo@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmTtDOgRHG1pbmdvQGtl
cm5lbC5vcmcACgkQEnMQ0APhK1iS4g//b9yewVW9OPxetKoN8zIJA0TjFYuuOVHK
BlCJi5dbzXeCTrtENI65BRA7kPbTQ3AjwLRQ2BallAZ4dJceK0RhlZJvcrMNsm4e
Adcpoch/FbqPKCrtAJQY04Ln1B244n/KyVifYett9220dMgTFQGJJYxrTc2G2+Kp
F44vdUHzRczIE+KeOgBild1CwfKv5Zn5xgaXgtuoPLZtWBE0C1fSSzbK/PTINcUx
bS4NVxK0CpOqSiNjnugV8KsYb71/0U6IgShBVjfHsrlBYigOH2NbVTH5xyjF8f83
WxiGstlhxj+N6Kv4L6FOJIAr2BIggH82j3FaPACmv4c8pzEoBBbvlAJkfinLEgbn
Povg3OF2t6uZ8NoHjeu3WxOjBsphbpkFz7H5nno1ibXSIR/JyUH5MdBPSx93QITB
QoUKQpr/L8zWauWDOEzSaJjEsZbl8rkcIVq5Bk0bR3qn2xkZsIeVte+vCEu3+tBc
b4JOZjq7AuPDqPnsBLvuyiFZ7zwsAfm+pOD5UF3/zbLjPn1N/7wTNQZ29zjc04jl
SifpCZGgF1KlG8m8wNTlSfVvq0ksppCzJt+C6VFuejZ191IGpirQHn4Vp0sluMhC
WRzXhb7v37Bq5JY10GMfeKb/jAiRs68kozhzqVPsBSAPS6I6jJssONgedq+LbQdC
tFsmE9n09do=
=XtCD
-----END PGP SIGNATURE-----
Merge tag 'sched-core-2023-08-28' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull scheduler updates from Ingo Molnar:
- The biggest change is introduction of a new iteration of the
SCHED_FAIR interactivity code: the EEVDF ("Earliest Eligible Virtual
Deadline First") scheduler
EEVDF too is a virtual-time scheduler, with two parameters (weight
and relative deadline), compared to CFS that had weight only. It
completely reworks the base scheduler: placement, preemption, picking
-- everything
LWN.net, as usual, has a terrific writeup about EEVDF:
https://lwn.net/Articles/925371/
Preemption (both tick and wakeup) is driven by testing against a
fresh pick. Because the tree is now effectively an interval tree, and
the selection is no longer the 'leftmost' task, over-scheduling is
less of a problem. A lot of the CFS heuristics are removed or
replaced by more natural latency-space parameters & constructs
In terms of expected performance regressions: we will and can fix
everything where a 'good' workload misbehaves with the new scheduler,
but EEVDF inevitably changes workload scheduling in a binary fashion,
hopefully for the better in the overwhelming majority of cases, but
in some cases it won't, especially in adversarial loads that got
lucky with the previous code, such as some variants of hackbench. We
are trying hard to err on the side of fixing all performance
regressions, but we expect some inevitable post-release iterations of
that process
- Improve load-balancing on hybrid x86 systems: enable cluster
scheduling (again)
- Improve & fix bandwidth-scheduling on nohz systems
- Improve bandwidth-throttling
- Use lock guards to simplify and de-goto-ify control flow
- Misc improvements, cleanups and fixes
* tag 'sched-core-2023-08-28' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (43 commits)
sched/eevdf/doc: Modify the documented knob to base_slice_ns as well
sched/eevdf: Curb wakeup-preemption
sched: Simplify sched_core_cpu_{starting,deactivate}()
sched: Simplify try_steal_cookie()
sched: Simplify sched_tick_remote()
sched: Simplify sched_exec()
sched: Simplify ttwu()
sched: Simplify wake_up_if_idle()
sched: Simplify: migrate_swap_stop()
sched: Simplify sysctl_sched_uclamp_handler()
sched: Simplify get_nohz_timer_target()
sched/rt: sysctl_sched_rr_timeslice show default timeslice after reset
sched/rt: Fix sysctl_sched_rr_timeslice intial value
sched/fair: Block nohz tick_stop when cfs bandwidth in use
sched, cgroup: Restore meaning to hierarchical_quota
MAINTAINERS: Add Peter explicitly to the psi section
sched/psi: Select KERNFS as needed
sched/topology: Align group flags when removing degenerate domain
sched/fair: remove util_est boosting
sched/fair: Propagate enqueue flags into place_entity()
...
Mike and others noticed that EEVDF does like to over-schedule quite a
bit -- which does hurt performance of a number of benchmarks /
workloads.
In particular, what seems to cause over-scheduling is that when lag is
of the same order (or larger) than the request / slice then placement
will not only cause the task to be placed left of current, but also
with a smaller deadline than current, which causes immediate
preemption.
[ notably, lag bounds are relative to HZ ]
Mike suggested we stick to picking 'current' for as long as it's
eligible to run, giving it uninterrupted runtime until it reaches
parity with the pack.
Augment Mike's suggestion by only allowing it to exhaust it's initial
request.
One random data point:
echo NO_RUN_TO_PARITY > /debug/sched/features
perf stat -a -e context-switches --repeat 10 -- perf bench sched messaging -g 20 -t -l 5000
3,723,554 context-switches ( +- 0.56% )
9.5136 +- 0.0394 seconds time elapsed ( +- 0.41% )
echo RUN_TO_PARITY > /debug/sched/features
perf stat -a -e context-switches --repeat 10 -- perf bench sched messaging -g 20 -t -l 5000
2,556,535 context-switches ( +- 0.51% )
9.2427 +- 0.0302 seconds time elapsed ( +- 0.33% )
Suggested-by: Mike Galbraith <umgwanakikbuti@gmail.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20230816134059.GC982867@hirez.programming.kicks-ass.net
Use guards to reduce gotos and simplify control flow.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Reviewed-by: Valentin Schneider <vschneid@redhat.com>
Link: https://lore.kernel.org/r/20230801211811.828443100@infradead.org
The sched_rr_timeslice can be reset to default by writing value that is
<= 0. However after reading from this file we always got the last value
written, which is not useful at all.
$ echo -1 > /proc/sys/kernel/sched_rr_timeslice_ms
$ cat /proc/sys/kernel/sched_rr_timeslice_ms
-1
Fix this by setting the variable that holds the sysctl file value to the
jiffies_to_msecs(RR_TIMESLICE) in case that <= 0 value was written.
Signed-off-by: Cyril Hrubis <chrubis@suse.cz>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Petr Vorel <pvorel@suse.cz>
Acked-by: Mel Gorman <mgorman@suse.de>
Tested-by: Petr Vorel <pvorel@suse.cz>
Link: https://lore.kernel.org/r/20230802151906.25258-3-chrubis@suse.cz
There is a 10% rounding error in the intial value of the
sysctl_sched_rr_timeslice with CONFIG_HZ_300=y.
This was found with LTP test sched_rr_get_interval01:
sched_rr_get_interval01.c:57: TPASS: sched_rr_get_interval() passed
sched_rr_get_interval01.c:64: TPASS: Time quantum 0s 99999990ns
sched_rr_get_interval01.c:72: TFAIL: /proc/sys/kernel/sched_rr_timeslice_ms != 100 got 90
sched_rr_get_interval01.c:57: TPASS: sched_rr_get_interval() passed
sched_rr_get_interval01.c:64: TPASS: Time quantum 0s 99999990ns
sched_rr_get_interval01.c:72: TFAIL: /proc/sys/kernel/sched_rr_timeslice_ms != 100 got 90
What this test does is to compare the return value from the
sched_rr_get_interval() and the sched_rr_timeslice_ms sysctl file and
fails if they do not match.
The problem it found is the intial sysctl file value which was computed as:
static int sysctl_sched_rr_timeslice = (MSEC_PER_SEC / HZ) * RR_TIMESLICE;
which works fine as long as MSEC_PER_SEC is multiple of HZ, however it
introduces 10% rounding error for CONFIG_HZ_300:
(MSEC_PER_SEC / HZ) * (100 * HZ / 1000)
(1000 / 300) * (100 * 300 / 1000)
3 * 30 = 90
This can be easily fixed by reversing the order of the multiplication
and division. After this fix we get:
(MSEC_PER_SEC * (100 * HZ / 1000)) / HZ
(1000 * (100 * 300 / 1000)) / 300
(1000 * 30) / 300 = 100
Fixes: 975e155ed8 ("sched/rt: Show the 'sched_rr_timeslice' SCHED_RR timeslice tuning knob in milliseconds")
Signed-off-by: Cyril Hrubis <chrubis@suse.cz>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Petr Vorel <pvorel@suse.cz>
Acked-by: Mel Gorman <mgorman@suse.de>
Tested-by: Petr Vorel <pvorel@suse.cz>
Link: https://lore.kernel.org/r/20230802151906.25258-2-chrubis@suse.cz
Pick up the EEVDF work into the main branch - it's looking good so far.
Conflicts:
kernel/sched/features.h
Signed-off-by: Ingo Molnar <mingo@kernel.org>
CFS bandwidth limits and NOHZ full don't play well together. Tasks
can easily run well past their quotas before a remote tick does
accounting. This leads to long, multi-period stalls before such
tasks can run again. Currently, when presented with these conflicting
requirements the scheduler is favoring nohz_full and letting the tick
be stopped. However, nohz tick stopping is already best-effort, there
are a number of conditions that can prevent it, whereas cfs runtime
bandwidth is expected to be enforced.
Make the scheduler favor bandwidth over stopping the tick by setting
TICK_DEP_BIT_SCHED when the only running task is a cfs task with
runtime limit enabled. We use cfs_b->hierarchical_quota to
determine if the task requires the tick.
Add check in pick_next_task_fair() as well since that is where
we have a handle on the task that is actually going to be running.
Add check in sched_can_stop_tick() to cover some edge cases such
as nr_running going from 2->1 and the 1 remains the running task.
Reviewed-By: Ben Segall <bsegall@google.com>
Signed-off-by: Phil Auld <pauld@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20230712133357.381137-3-pauld@redhat.com
In cgroupv2 cfs_b->hierarchical_quota is set to -1 for all task
groups due to the previous fix simply taking the min. It should
reflect a limit imposed at that level or by an ancestor. Even
though cgroupv2 does not require child quota to be less than or
equal to that of its ancestors the task group will still be
constrained by such a quota so this should be shown here. Cgroupv1
continues to set this correctly.
In both cases, add initialization when a new task group is created
based on the current parent's value (or RUNTIME_INF in the case of
root_task_group). Otherwise, the field is wrong until a quota is
changed after creation and __cfs_schedulable() is called.
Fixes: c53593e5cb ("sched, cgroup: Don't reject lower cpu.max on ancestors")
Signed-off-by: Phil Auld <pauld@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Ben Segall <bsegall@google.com>
Acked-by: Tejun Heo <tj@kernel.org>
Link: https://lore.kernel.org/r/20230714125746.812891-1-pauld@redhat.com
Revert commit 7aa55f2a59 ("sched/fair: Move unused stub functions to
header"), for while it has the right Changelog, the actual patch
content a revert of the previous 4 patches:
f7df852ad6 ("sched: Make task_vruntime_update() prototype visible")
c0bdfd72fb ("sched/fair: Hide unused init_cfs_bandwidth() stub")
378be384e0 ("sched: Add schedule_user() declaration")
d55ebae3f3 ("sched: Hide unused sched_update_scaling()")
So in effect this is a revert of a revert and re-applies those
patches.
Fixes: 7aa55f2a59 ("sched/fair: Move unused stub functions to header")
Reported-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
The flags of the child of a given scheduling domain are used to initialize
the flags of its scheduling groups. When the child of a scheduling domain
is degenerated, the flags of its local scheduling group need to be updated
to align with the flags of its new child domain.
The flag SD_SHARE_CPUCAPACITY was aligned in
Commit bf2dc42d6b ("sched/topology: Propagate SMT flags when removing degenerate domain").
Further generalize this alignment so other flags can be used later, such as
in cluster-based task wakeup. [1]
Reported-by: Yicong Yang <yangyicong@huawei.com>
Suggested-by: Ricardo Neri <ricardo.neri@intel.com>
Signed-off-by: Chen Yu <yu.c.chen@intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Tim Chen <tim.c.chen@linux.intel.com>
Reviewed-by: Yicong Yang <yangyicong@hisilicon.com>
Link: https://lore.kernel.org/r/20230713013133.2314153-1-yu.c.chen@intel.com
There is no need to use runnable_avg when estimating util_est and that
even generates wrong behavior because one includes blocked tasks whereas
the other one doesn't. This can lead to accounting twice the waking task p,
once with the blocked runnable_avg and another one when adding its
util_est.
cpu's runnable_avg is already used when computing util_avg which is then
compared with util_est.
In some situation, feec will not select prev_cpu but another one on the
same performance domain because of higher max_util
Fixes: 7d0583cf9e ("sched/fair, cpufreq: Introduce 'runnable boosting'")
Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Tested-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Link: https://lore.kernel.org/r/20230706135144.324311-1-vincent.guittot@linaro.org
EEVDF uses this tunable as the base request/slice -- make sure the
name reflects this.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/r/20230531124604.205287511@infradead.org
EEVDF is a better defined scheduling policy, as a result it has less
heuristics/tunables. There is no compelling reason to keep CFS around.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/r/20230531124604.137187212@infradead.org
Using lag is both more correct and simpler when moving between
runqueues.
Notable, min_vruntime() was invented as a cheap approximation of
avg_vruntime() for this very purpose (SMP migration). Since we now
have the real thing; use it.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/r/20230531124604.068911180@infradead.org
Removes the FAIR_SLEEPERS code in favour of the new LAG based
placement.
Specifically, the whole FAIR_SLEEPER thing was a very crude
approximation to make up for the lack of lag based placement,
specifically the 'service owed' part. This is important for things
like 'starve' and 'hackbench'.
One side effect of FAIR_SLEEPER is that it caused 'small' unfairness,
specifically, by always ignoring up-to 'thresh' sleeptime it would
have a 50%/50% time distribution for a 50% sleeper vs a 100% runner,
while strictly speaking this should (of course) result in a 33%/67%
split (as CFS will also do if the sleep period exceeds 'thresh').
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/r/20230531124604.000198861@infradead.org
Where CFS is currently a WFQ based scheduler with only a single knob,
the weight. The addition of a second, latency oriented parameter,
makes something like WF2Q or EEVDF based a much better fit.
Specifically, EEVDF does EDF like scheduling in the left half of the
tree -- those entities that are owed service. Except because this is a
virtual time scheduler, the deadlines are in virtual time as well,
which is what allows over-subscription.
EEVDF has two parameters:
- weight, or time-slope: which is mapped to nice just as before
- request size, or slice length: which is used to compute
the virtual deadline as: vd_i = ve_i + r_i/w_i
Basically, by setting a smaller slice, the deadline will be earlier
and the task will be more eligible and ran earlier.
Tick driven preemption is driven by request/slice completion; while
wakeup preemption is driven by the deadline.
Because the tree is now effectively an interval tree, and the
selection is no longer 'leftmost', over-scheduling is less of a
problem.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/r/20230531124603.931005524@infradead.org
With the introduction of avg_vruntime, it is possible to approximate
lag (the entire purpose of introducing it in fact). Use this to do lag
based placement over sleep+wake.
Specifically, the FAIR_SLEEPERS thing places things too far to the
left and messes up the deadline aspect of EEVDF.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/r/20230531124603.794929315@infradead.org
With the introduction of avg_vruntime() there is no need to use worse
approximations. Take the 0-lag point as starting point for inserting
new tasks.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/r/20230531124603.722361178@infradead.org
In order to move to an eligibility based scheduling policy, we need
to have a better approximation of the ideal scheduler.
Specifically, for a virtual time weighted fair queueing based
scheduler the ideal scheduler will be the weighted average of the
individual virtual runtimes (math in the comment).
As such, compute the weighted average to approximate the ideal
scheduler -- note that the approximation is in the individual task
behaviour, which isn't strictly conformant.
Specifically consider adding a task with a vruntime left of center, in
this case the average will move backwards in time -- something the
ideal scheduler would of course never do.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lore.kernel.org/r/20230531124603.654144274@infradead.org
Add complete_on_current_cpu, wake_up_poll_on_current_cpu helpers to wake
up tasks on the current CPU.
These two helpers are useful when the task needs to make a synchronous context
switch to another task. In this context, synchronous means it wakes up the
target task and falls asleep right after that.
One example of such workloads is seccomp user notifies. This mechanism allows
the supervisor process handles system calls on behalf of a target process.
While the supervisor is handling an intercepted system call, the target process
will be blocked in the kernel, waiting for a response to come back.
On-CPU context switches are much faster than regular ones.
Signed-off-by: Andrei Vagin <avagin@google.com>
Acked-by: "Peter Zijlstra (Intel)" <peterz@infradead.org>
Link: https://lore.kernel.org/r/20230308073201.3102738-4-avagin@google.com
Signed-off-by: Kees Cook <keescook@chromium.org>
Add WF_CURRENT_CPU wake flag that advices the scheduler to
move the wakee to the current CPU. This is useful for fast on-CPU
context switching use cases.
In addition, make ttwu external rather than static so that
the flag could be passed to it from outside of sched/core.c.
Signed-off-by: Peter Oskolkov <posk@google.com>
Signed-off-by: Andrei Vagin <avagin@google.com>
Acked-by: "Peter Zijlstra (Intel)" <peterz@infradead.org>
Link: https://lore.kernel.org/r/20230308073201.3102738-3-avagin@google.com
Signed-off-by: Kees Cook <keescook@chromium.org>
select_idle_capacity() not only looks for an idle cpu that fits for the
waking task but also for cpu with highest bandwidth when no cpu fits.
Start the loop with target cpu so it will be selected 1st when no cpu fits
but several cpus shared the same bandwidth. Starting with target cpu
prevents the task to migrate between cpus with same bandwidth at every
wakeup when no cpu fits.
Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20230711081359.868862-1-vincent.guittot@linaro.org
There have been a case where the SD_SHARE_CPUCAPACITY sched group flag
in a parent domain were not set and propagated properly when a degenerate
domain is removed.
Add dump of domain sched group flags of a CPU to make debug easier
in the future.
Usage:
cat /debug/sched/domains/cpu0/domain1/groups_flags
to dump cpu0 domain1's sched group flags.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Valentin Schneider <vschneid@redhat.com>
Link: https://lore.kernel.org/r/ed1749262d94d95a8296c86a415999eda90bcfe3.1688770494.git.tim.c.chen@linux.intel.com
should_we_balance() traverses the group_balance_mask (AND'ed with lb_env::
cpus) starting from lower numbered CPUs looking for the first idle CPU.
In hybrid x86 systems, the siblings of SMT cores get CPU numbers, before
non-SMT cores:
[0, 1] [2, 3] [4, 5] 6 7 8 9
b i b i b i b i i i
In the figure above, CPUs in brackets are siblings of an SMT core. The
rest are non-SMT cores. 'b' indicates a busy CPU, 'i' indicates an
idle CPU.
We should let a CPU on a fully idle core get the first chance to idle
load balance as it has more CPU capacity than a CPU on an idle SMT
CPU with busy sibling. So for the figure above, if we are running
should_we_balance() to CPU 1, we should return false to let CPU 7 on
idle core to have a chance first to idle load balance.
A partially busy (i.e., of type group_has_spare) local group with SMT
cores will often have only one SMT sibling busy. If the destination CPU
is a non-SMT core, partially busy, lower-numbered, SMT cores should not
be considered when finding the first idle CPU.
However, in should_we_balance(), when we encounter idle SMT first in partially
busy core, we prematurely break the search for the first idle CPU.
Higher-numbered, non-SMT cores is not given the chance to have
idle balance done on their behalf. Those CPUs will only be considered
for idle balancing by chance via CPU_NEWLY_IDLE.
Instead, consider the idle state of the whole SMT core.
Signed-off-by: Ricardo Neri <ricardo.neri-calderon@linux.intel.com>
Co-developed-by: Tim Chen <tim.c.chen@linux.intel.com>
Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/807bdd05331378ea3bf5956bda87ded1036ba769.1688770494.git.tim.c.chen@linux.intel.com
In the current prefer sibling load balancing code, there is an implicit
assumption that the busiest sched group and local sched group are
equivalent, hence the tasks to be moved is simply the difference in
number of tasks between the two groups (i.e. imbalance) divided by two.
However, we may have different number of cores between the cluster groups,
say when we take CPU offline or we have hybrid groups. In that case,
we should balance between the two groups such that #tasks/#cores ratio
is the same between the same between both groups. Hence the imbalance
computed will need to reflect this.
Adjust the sibling imbalance computation to take into account of the
above considerations.
Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/4eacbaa236e680687dae2958378a6173654113df.1688770494.git.tim.c.chen@linux.intel.com
When balancing sibling domains that have different number of cores,
tasks in respective sibling domain should be proportional to the
number of cores in each domain. In preparation of implementing such a
policy, record the number of cores in a scheduling group.
Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/04641eeb0e95c21224352f5743ecb93dfac44654.1688770494.git.tim.c.chen@linux.intel.com
On hybrid CPUs with scheduling cluster enabled, we will need to
consider balancing between SMT CPU cluster, and Atom core cluster.
Below shows such a hybrid x86 CPU with 4 big cores and 8 atom cores.
Each scheduling cluster span a L2 cache.
--L2-- --L2-- --L2-- --L2-- ----L2---- -----L2------
[0, 1] [2, 3] [4, 5] [5, 6] [7 8 9 10] [11 12 13 14]
Big Big Big Big Atom Atom
core core core core Module Module
If the busiest group is a big core with both SMT CPUs busy, we should
active load balance if destination group has idle CPU cores. Such
condition is considered by asym_active_balance() in load balancing but not
considered when looking for busiest group and computing load imbalance.
Add this consideration in find_busiest_group() and calculate_imbalance().
In addition, update the logic determining the busier group when one group
is SMT and the other group is non SMT but both groups are partially busy
with idle CPU. The busier group should be the group with idle cores rather
than the group with one busy SMT CPU. We do not want to make the SMT group
the busiest one to pull the only task off SMT CPU and causing the whole core to
go empty.
Otherwise suppose in the search for the busiest group, we first encounter
an SMT group with 1 task and set it as the busiest. The destination
group is an atom cluster with 1 task and we next encounter an atom
cluster group with 3 tasks, we will not pick this atom cluster over the
SMT group, even though we should. As a result, we do not load balance
the busier Atom cluster (with 3 tasks) towards the local atom cluster
(with 1 task). And it doesn't make sense to pick the 1 task SMT group
as the busier group as we also should not pull task off the SMT towards
the 1 task atom cluster and make the SMT core completely empty.
Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/e24f35d142308790f69be65930b82794ef6658a2.1688770494.git.tim.c.chen@linux.intel.com
The static key psi_cgroups_enabled is only used inside file psi.c.
Make it static.
Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Suren Baghdasaryan <surenb@google.com>
Link: https://lore.kernel.org/r/20230525103428.49712-1-linmiaohe@huawei.com
As core scheduling introduced, a new state of idle is defined as
force idle, running idle task but nr_running greater than zero.
If a cpu is in force idle state, idle_cpu() will return zero. This
result makes sense in some scenarios, e.g., load balance,
showacpu when dumping, and judge the RCU boost kthread is starving.
But this will cause error in other scenarios, e.g., tick_irq_exit():
When force idle, rq->curr == rq->idle but rq->nr_running > 0, results
that idle_cpu() returns 0. In function tick_irq_exit(), if idle_cpu()
is 0, tick_nohz_irq_exit() will not be called, and ts->idle_active will
not become 1, which became 0 in tick_nohz_irq_enter().
ts->idle_sleeptime won't update in function update_ts_time_stats(), if
ts->idle_active is 0, which should be 1. And this bug will result that
ts->idle_sleeptime is less than the actual value, and finally will
result that the idle time in /proc/stat is less than the actual value.
To solve this problem, we introduce sched_core_idle_cpu(), which
returns 1 when force idle. We audit all users of idle_cpu(), and
change idle_cpu() into sched_core_idle_cpu() in function
tick_irq_exit().
v2-->v3: Only replace idle_cpu() with sched_core_idle_cpu() in
function tick_irq_exit(). And modify the corresponding commit log.
Signed-off-by: Cruz Zhao <CruzZhao@linux.alibaba.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Peter Zijlstra <peterz@infradead.org>
Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
Reviewed-by: Joel Fernandes <joel@joelfernandes.org>
Link: https://lore.kernel.org/r/1688011324-42406-1-git-send-email-CruzZhao@linux.alibaba.com
We currently export the total throttled time for cgroups that are given
a bandwidth limit. This patch extends this accounting to also account
the total time that each children cgroup has been throttled.
This is useful to understand the degree to which children have been
affected by the throttling control. Children which are not runnable
during the entire throttled period, for example, will not show any
self-throttling time during this period.
Expose this in a new interface, 'cpu.stat.local', which is similar to
how non-hierarchical events are accounted in 'memory.events.local'.
Signed-off-by: Josh Don <joshdon@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Tejun Heo <tj@kernel.org>
Link: https://lore.kernel.org/r/20230620183247.737942-2-joshdon@google.com