082f764a2f
708 Commits
Author | SHA1 | Message | Date | |
---|---|---|---|---|
Mel Gorman
|
082f764a2f |
sched/fair: Do not migrate on wake_affine_weight() if weights are equal
wake_affine_weight() will consider migrating a task to, or near, the current CPU if there is a load imbalance. If the CPUs share LLC then either CPU is valid as a search-for-idle-sibling target and equally appropriate for stacking two tasks on one CPU if an idle sibling is unavailable. If they do not share cache then a cross-node migration potentially impacts locality so while they are equal from a CPU capacity point of view, they are not equal in terms of memory locality. In either case, it's more appropriate to migrate only if there is a difference in their effective load. This patch modifies wake_affine_weight() to only consider migrating a task if there is a load imbalance for normal wakeups but will allow potential stacking if the loads are equal and it's a sync wakeup. For the most part, the different in performance is marginal. For example, on a 4-socket server running netperf UDP_STREAM on localhost the differences are as follows: 4.15.0 4.15.0 16rc0 noequal-v1r23 Hmean send-64 355.47 ( 0.00%) 349.50 ( -1.68%) Hmean send-128 697.98 ( 0.00%) 693.35 ( -0.66%) Hmean send-256 1328.02 ( 0.00%) 1318.77 ( -0.70%) Hmean send-1024 5051.83 ( 0.00%) 5051.11 ( -0.01%) Hmean send-2048 9637.02 ( 0.00%) 9601.34 ( -0.37%) Hmean send-3312 14355.37 ( 0.00%) 14414.51 ( 0.41%) Hmean send-4096 16464.97 ( 0.00%) 16301.37 ( -0.99%) Hmean send-8192 26722.42 ( 0.00%) 26428.95 ( -1.10%) Hmean send-16384 38137.81 ( 0.00%) 38046.11 ( -0.24%) Hmean recv-64 355.47 ( 0.00%) 349.50 ( -1.68%) Hmean recv-128 697.98 ( 0.00%) 693.35 ( -0.66%) Hmean recv-256 1328.02 ( 0.00%) 1318.77 ( -0.70%) Hmean recv-1024 5051.83 ( 0.00%) 5051.11 ( -0.01%) Hmean recv-2048 9636.95 ( 0.00%) 9601.30 ( -0.37%) Hmean recv-3312 14355.32 ( 0.00%) 14414.48 ( 0.41%) Hmean recv-4096 16464.74 ( 0.00%) 16301.16 ( -0.99%) Hmean recv-8192 26721.63 ( 0.00%) 26428.17 ( -1.10%) Hmean recv-16384 38136.00 ( 0.00%) 38044.88 ( -0.24%) Stddev send-64 7.30 ( 0.00%) 4.75 ( 34.96%) Stddev send-128 15.15 ( 0.00%) 22.38 ( -47.66%) Stddev send-256 13.99 ( 0.00%) 19.14 ( -36.81%) Stddev send-1024 105.73 ( 0.00%) 67.38 ( 36.27%) Stddev send-2048 294.57 ( 0.00%) 223.88 ( 24.00%) Stddev send-3312 302.28 ( 0.00%) 271.74 ( 10.10%) Stddev send-4096 195.92 ( 0.00%) 121.10 ( 38.19%) Stddev send-8192 399.71 ( 0.00%) 563.77 ( -41.04%) Stddev send-16384 1163.47 ( 0.00%) 1103.68 ( 5.14%) Stddev recv-64 7.30 ( 0.00%) 4.75 ( 34.96%) Stddev recv-128 15.15 ( 0.00%) 22.38 ( -47.66%) Stddev recv-256 13.99 ( 0.00%) 19.14 ( -36.81%) Stddev recv-1024 105.73 ( 0.00%) 67.38 ( 36.27%) Stddev recv-2048 294.59 ( 0.00%) 223.89 ( 24.00%) Stddev recv-3312 302.24 ( 0.00%) 271.75 ( 10.09%) Stddev recv-4096 196.03 ( 0.00%) 121.14 ( 38.20%) Stddev recv-8192 399.86 ( 0.00%) 563.65 ( -40.96%) Stddev recv-16384 1163.79 ( 0.00%) 1103.86 ( 5.15%) The difference in overall performance is marginal but note that most measurements are less variable. There were similar observations for other netperf comparisons. hackbench with sockets or threads with processes or threads showed minor difference with some reduction of migration. tbench showed only marginal differences that were within the noise. dbench, regardless of filesystem, showed minor differences all of which are within noise. Multiple machines, both UMA and NUMA were tested without any regressions showing up. The biggest risk with a patch like this is affecting wakeup latencies. However, the schbench load from Facebook which is very sensitive to wakeup latency showed a mixed result with mostly improvements in wakeup latency: 4.15.0 4.15.0 16rc0 noequal-v1r23 Lat 50.00th-qrtle-1 38.00 ( 0.00%) 38.00 ( 0.00%) Lat 75.00th-qrtle-1 49.00 ( 0.00%) 41.00 ( 16.33%) Lat 90.00th-qrtle-1 52.00 ( 0.00%) 50.00 ( 3.85%) Lat 95.00th-qrtle-1 54.00 ( 0.00%) 51.00 ( 5.56%) Lat 99.00th-qrtle-1 63.00 ( 0.00%) 60.00 ( 4.76%) Lat 99.50th-qrtle-1 66.00 ( 0.00%) 61.00 ( 7.58%) Lat 99.90th-qrtle-1 78.00 ( 0.00%) 65.00 ( 16.67%) Lat 50.00th-qrtle-2 38.00 ( 0.00%) 38.00 ( 0.00%) Lat 75.00th-qrtle-2 42.00 ( 0.00%) 43.00 ( -2.38%) Lat 90.00th-qrtle-2 46.00 ( 0.00%) 48.00 ( -4.35%) Lat 95.00th-qrtle-2 49.00 ( 0.00%) 50.00 ( -2.04%) Lat 99.00th-qrtle-2 55.00 ( 0.00%) 57.00 ( -3.64%) Lat 99.50th-qrtle-2 58.00 ( 0.00%) 60.00 ( -3.45%) Lat 99.90th-qrtle-2 65.00 ( 0.00%) 68.00 ( -4.62%) Lat 50.00th-qrtle-4 41.00 ( 0.00%) 41.00 ( 0.00%) Lat 75.00th-qrtle-4 45.00 ( 0.00%) 46.00 ( -2.22%) Lat 90.00th-qrtle-4 50.00 ( 0.00%) 50.00 ( 0.00%) Lat 95.00th-qrtle-4 54.00 ( 0.00%) 53.00 ( 1.85%) Lat 99.00th-qrtle-4 61.00 ( 0.00%) 61.00 ( 0.00%) Lat 99.50th-qrtle-4 65.00 ( 0.00%) 64.00 ( 1.54%) Lat 99.90th-qrtle-4 76.00 ( 0.00%) 82.00 ( -7.89%) Lat 50.00th-qrtle-8 48.00 ( 0.00%) 46.00 ( 4.17%) Lat 75.00th-qrtle-8 55.00 ( 0.00%) 54.00 ( 1.82%) Lat 90.00th-qrtle-8 60.00 ( 0.00%) 59.00 ( 1.67%) Lat 95.00th-qrtle-8 63.00 ( 0.00%) 63.00 ( 0.00%) Lat 99.00th-qrtle-8 71.00 ( 0.00%) 69.00 ( 2.82%) Lat 99.50th-qrtle-8 74.00 ( 0.00%) 73.00 ( 1.35%) Lat 99.90th-qrtle-8 98.00 ( 0.00%) 90.00 ( 8.16%) Lat 50.00th-qrtle-16 56.00 ( 0.00%) 55.00 ( 1.79%) Lat 75.00th-qrtle-16 68.00 ( 0.00%) 67.00 ( 1.47%) Lat 90.00th-qrtle-16 77.00 ( 0.00%) 78.00 ( -1.30%) Lat 95.00th-qrtle-16 82.00 ( 0.00%) 84.00 ( -2.44%) Lat 99.00th-qrtle-16 90.00 ( 0.00%) 93.00 ( -3.33%) Lat 99.50th-qrtle-16 93.00 ( 0.00%) 97.00 ( -4.30%) Lat 99.90th-qrtle-16 110.00 ( 0.00%) 110.00 ( 0.00%) Lat 50.00th-qrtle-32 68.00 ( 0.00%) 62.00 ( 8.82%) Lat 75.00th-qrtle-32 90.00 ( 0.00%) 83.00 ( 7.78%) Lat 90.00th-qrtle-32 110.00 ( 0.00%) 100.00 ( 9.09%) Lat 95.00th-qrtle-32 122.00 ( 0.00%) 111.00 ( 9.02%) Lat 99.00th-qrtle-32 145.00 ( 0.00%) 133.00 ( 8.28%) Lat 99.50th-qrtle-32 154.00 ( 0.00%) 143.00 ( 7.14%) Lat 99.90th-qrtle-32 2316.00 ( 0.00%) 515.00 ( 77.76%) Lat 50.00th-qrtle-35 69.00 ( 0.00%) 72.00 ( -4.35%) Lat 75.00th-qrtle-35 92.00 ( 0.00%) 95.00 ( -3.26%) Lat 90.00th-qrtle-35 111.00 ( 0.00%) 114.00 ( -2.70%) Lat 95.00th-qrtle-35 122.00 ( 0.00%) 124.00 ( -1.64%) Lat 99.00th-qrtle-35 142.00 ( 0.00%) 144.00 ( -1.41%) Lat 99.50th-qrtle-35 150.00 ( 0.00%) 154.00 ( -2.67%) Lat 99.90th-qrtle-35 6104.00 ( 0.00%) 5640.00 ( 7.60%) Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Giovanni Gherdovich <ggherdovich@suse.cz> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Matt Fleming <matt@codeblueprint.co.uk> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/20180213133730.24064-4-mgorman@techsingularity.net Signed-off-by: Ingo Molnar <mingo@kernel.org> |
||
Mel Gorman
|
eeb6039863 |
sched/fair: Defer calculation of 'prev_eff_load' in wake_affine_weight() until needed
On sync wakeups, the previous CPU effective load may not be used so delay the calculation until it's needed. Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Giovanni Gherdovich <ggherdovich@suse.cz> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Matt Fleming <matt@codeblueprint.co.uk> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/20180213133730.24064-3-mgorman@techsingularity.net Signed-off-by: Ingo Molnar <mingo@kernel.org> |
||
Mel Gorman
|
7ebb66a12f |
sched/fair: Avoid an unnecessary lookup of current CPU ID during wake_affine
The only caller of wake_affine() knows the CPU ID. Pass it in instead of rechecking it. Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Giovanni Gherdovich <ggherdovich@suse.cz> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Matt Fleming <matt@codeblueprint.co.uk> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/20180213133730.24064-2-mgorman@techsingularity.net Signed-off-by: Ingo Molnar <mingo@kernel.org> |
||
Vincent Guittot
|
387f77cc82 |
sched/fair: Remove stray space in #ifdef
Remove a useless space in # ifdef and align it with others. Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org> Acked-by: Peter Zijlstra <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/1518512382-29426-1-git-send-email-vincent.guittot@linaro.org Signed-off-by: Ingo Molnar <mingo@kernel.org> |
||
Mel Gorman
|
32e839dda3 |
sched/fair: Use a recently used CPU as an idle candidate and the basis for SIS
The select_idle_sibling() (SIS) rewrite in commit:
|
||
Mel Gorman
|
806486c377 |
sched/fair: Do not migrate if the prev_cpu is idle
wake_affine_idle() prefers to move a task to the current CPU if the wakeup is due to an interrupt. The expectation is that the interrupt data is cache hot and relevant to the waking task as well as avoiding a search. However, there is no way to determine if there was cache hot data on the previous CPU that may exceed the interrupt data. Furthermore, round-robin delivery of interrupts can migrate tasks around a socket where each CPU is under-utilised. This can interact badly with cpufreq which makes decisions based on per-cpu data. It has been observed on machines with HWP that p-states are not boosted to their maximum levels even though the workload is latency and throughput sensitive. This patch uses the previous CPU for the task if it's idle and cache-affine with the current CPU even if the current CPU is idle due to the wakup being related to the interrupt. This reduces migrations at the cost of the interrupt data not being cache hot when the task wakes. A variety of workloads were tested on various machines and no adverse impact was noticed that was outside noise. dbench on ext4 on UMA showed roughly 10% reduction in the number of CPU migrations and it is a case where interrupts are frequent for IO competions. In most cases, the difference in performance is quite small but variability is often reduced. For example, this is the result for pgbench running on a UMA machine with different numbers of clients. 4.15.0-rc9 4.15.0-rc9 baseline waprev-v1 Hmean 1 22096.28 ( 0.00%) 22734.86 ( 2.89%) Hmean 4 74633.42 ( 0.00%) 75496.77 ( 1.16%) Hmean 7 115017.50 ( 0.00%) 113030.81 ( -1.73%) Hmean 12 126209.63 ( 0.00%) 126613.40 ( 0.32%) Hmean 16 131886.91 ( 0.00%) 130844.35 ( -0.79%) Stddev 1 636.38 ( 0.00%) 417.11 ( 34.46%) Stddev 4 614.64 ( 0.00%) 583.24 ( 5.11%) Stddev 7 542.46 ( 0.00%) 435.45 ( 19.73%) Stddev 12 173.93 ( 0.00%) 171.50 ( 1.40%) Stddev 16 671.42 ( 0.00%) 680.30 ( -1.32%) CoeffVar 1 2.88 ( 0.00%) 1.83 ( 36.26%) Note that the different in performance is marginal but for low utilisation, there is less variability. Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Matt Fleming <matt@codeblueprint.co.uk> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/20180130104555.4125-4-mgorman@techsingularity.net Signed-off-by: Ingo Molnar <mingo@kernel.org> |
||
Mel Gorman
|
3b76c4a339 |
sched/fair: Restructure wake_affine*() to return a CPU id
This is a preparation patch that has wake_affine*() return a CPU ID instead of a boolean. The intent is to allow the wake_affine() helpers to be avoided if a decision is already made. This patch has no functional change. Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Matt Fleming <matt@codeblueprint.co.uk> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/20180130104555.4125-3-mgorman@techsingularity.net Signed-off-by: Ingo Molnar <mingo@kernel.org> |
||
Mel Gorman
|
89a55f56fd |
sched/fair: Remove unnecessary parameters from wake_affine_idle()
wake_affine_idle() takes parameters it never uses so clean it up. Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Matt Fleming <matt@codeblueprint.co.uk> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/20180130104555.4125-2-mgorman@techsingularity.net Signed-off-by: Ingo Molnar <mingo@kernel.org> |
||
Peter Zijlstra
|
2ed41a5502 |
sched/core: Optimize update_stats_*()
These functions are already gated by schedstats_enabled(), there is no point in then issuing another static_branch for every individual update in them. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-kernel@vger.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org> |
||
Linus Torvalds
|
af8c5e2d60 |
Merge branch 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull scheduler updates from Ingo Molnar: "The main changes in this cycle were: - Implement frequency/CPU invariance and OPP selection for SCHED_DEADLINE (Juri Lelli) - Tweak the task migration logic for better multi-tasking workload scalability (Mel Gorman) - Misc cleanups, fixes and improvements" * 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: sched/deadline: Make bandwidth enforcement scale-invariant sched/cpufreq: Move arch_scale_{freq,cpu}_capacity() outside of #ifdef CONFIG_SMP sched/cpufreq: Remove arch_scale_freq_capacity()'s 'sd' parameter sched/cpufreq: Always consider all CPUs when deciding next freq sched/cpufreq: Split utilization signals sched/cpufreq: Change the worker kthread to SCHED_DEADLINE sched/deadline: Move CPU frequency selection triggering points sched/cpufreq: Use the DEADLINE utilization signal sched/deadline: Implement "runtime overrun signal" support sched/fair: Only immediately migrate tasks due to interrupts if prev and target CPUs share cache sched/fair: Correct obsolete comment about cpufreq_update_util() sched/fair: Remove impossible condition from find_idlest_group_cpu() sched/cpufreq: Don't pass flags to sugov_set_iowait_boost() sched/cpufreq: Initialize sg_cpu->flags to 0 sched/fair: Consider RT/IRQ pressure in capacity_spare_wake() sched/fair: Use 'unsigned long' for utilization, consistently sched/core: Rework and clarify prepare_lock_switch() sched/fair: Remove unused 'curr' parameter from wakeup_gran sched/headers: Constify object_is_on_stack() |
||
Peter Zijlstra
|
ce48c14649 |
sched/core: Fix cpu.max vs. cpuhotplug deadlock
Tejun reported the following cpu-hotplug lock (percpu-rwsem) read recursion: tg_set_cfs_bandwidth() get_online_cpus() cpus_read_lock() cfs_bandwidth_usage_inc() static_key_slow_inc() cpus_read_lock() Reported-by: Tejun Heo <tj@kernel.org> Tested-by: Tejun Heo <tj@kernel.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/20180122215328.GP3397@worktop Signed-off-by: Ingo Molnar <mingo@kernel.org> |
||
Juri Lelli
|
07881166a8 |
sched/deadline: Make bandwidth enforcement scale-invariant
Apply frequency and CPU scale-invariance correction factor to bandwidth enforcement (similar to what we already do to fair utilization tracking). Each delta_exec gets scaled considering current frequency and maximum CPU capacity; which means that the reservation runtime parameter (that need to be specified profiling the task execution at max frequency on biggest capacity core) gets thus scaled accordingly. Signed-off-by: Juri Lelli <juri.lelli@arm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Claudio Scordino <claudio@evidence.eu.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Luca Abeni <luca.abeni@santannapisa.it> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rafael J . Wysocki <rafael.j.wysocki@intel.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Viresh Kumar <viresh.kumar@linaro.org> Cc: alessio.balsini@arm.com Cc: bristot@redhat.com Cc: dietmar.eggemann@arm.com Cc: joelaf@google.com Cc: juri.lelli@redhat.com Cc: mathieu.poirier@linaro.org Cc: morten.rasmussen@arm.com Cc: patrick.bellasi@arm.com Cc: rjw@rjwysocki.net Cc: rostedt@goodmis.org Cc: tkjos@android.com Cc: tommaso.cucinotta@santannapisa.it Cc: vincent.guittot@linaro.org Link: http://lkml.kernel.org/r/20171204102325.5110-9-juri.lelli@redhat.com Signed-off-by: Ingo Molnar <mingo@kernel.org> |
||
Juri Lelli
|
7673c8a4c7 |
sched/cpufreq: Remove arch_scale_freq_capacity()'s 'sd' parameter
The 'sd' parameter is never used in arch_scale_freq_capacity() (and it's hard to see where information coming from scheduling domains might help doing frequency invariance scaling). Remove it; also in anticipation of moving arch_scale_freq_capacity() outside CONFIG_SMP. Signed-off-by: Juri Lelli <juri.lelli@arm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: alessio.balsini@arm.com Cc: bristot@redhat.com Cc: claudio@evidence.eu.com Cc: dietmar.eggemann@arm.com Cc: joelaf@google.com Cc: juri.lelli@redhat.com Cc: luca.abeni@santannapisa.it Cc: mathieu.poirier@linaro.org Cc: morten.rasmussen@arm.com Cc: patrick.bellasi@arm.com Cc: rjw@rjwysocki.net Cc: rostedt@goodmis.org Cc: tkjos@android.com Cc: tommaso.cucinotta@santannapisa.it Cc: vincent.guittot@linaro.org Cc: viresh.kumar@linaro.org Link: http://lkml.kernel.org/r/20171204102325.5110-7-juri.lelli@redhat.com Signed-off-by: Ingo Molnar <mingo@kernel.org> |
||
Mel Gorman
|
7332dec055 |
sched/fair: Only immediately migrate tasks due to interrupts if prev and target CPUs share cache
If waking from an idle CPU due to an interrupt then it's possible that the waker task will be pulled to wake on the current CPU. Unfortunately, depending on the type of interrupt and IRQ configuration, there may not be a strong relationship between the CPU an interrupt was delivered on and the CPU a task was running on. For example, the interrupts could all be delivered to CPUs on one particular node due to the machine topology or IRQ affinity configuration. Another example is an interrupt for an IO completion which can be delivered to any CPU where there is no guarantee the data is either cache hot or even local. This patch was motivated by the observation that an IO workload was being pulled cross-node on a frequent basis when IO completed. From a wakeup latency perspective, it's still useful to know that an idle CPU is immediately available for use but lets only consider an automatic migration if the CPUs share cache to limit damage due to NUMA migrations. Migrations may still occur if wake_affine_weight determines it's appropriate. These are the throughput results for dbench running on ext4 comparing 4.15-rc3 and this patch on a 2-socket machine where interrupts due to IO completions can happen on any CPU. 4.15.0-rc3 4.15.0-rc3 vanilla lessmigrate Hmean 1 854.64 ( 0.00%) 865.01 ( 1.21%) Hmean 2 1229.60 ( 0.00%) 1274.44 ( 3.65%) Hmean 4 1591.81 ( 0.00%) 1628.08 ( 2.28%) Hmean 8 1845.04 ( 0.00%) 1831.80 ( -0.72%) Hmean 16 2038.61 ( 0.00%) 2091.44 ( 2.59%) Hmean 32 2327.19 ( 0.00%) 2430.29 ( 4.43%) Hmean 64 2570.61 ( 0.00%) 2568.54 ( -0.08%) Hmean 128 2481.89 ( 0.00%) 2499.28 ( 0.70%) Stddev 1 14.31 ( 0.00%) 5.35 ( 62.65%) Stddev 2 21.29 ( 0.00%) 11.09 ( 47.92%) Stddev 4 7.22 ( 0.00%) 6.80 ( 5.92%) Stddev 8 26.70 ( 0.00%) 9.41 ( 64.76%) Stddev 16 22.40 ( 0.00%) 20.01 ( 10.70%) Stddev 32 45.13 ( 0.00%) 44.74 ( 0.85%) Stddev 64 93.10 ( 0.00%) 93.18 ( -0.09%) Stddev 128 184.28 ( 0.00%) 177.85 ( 3.49%) Note the small increase in throughput for low thread counts but also note that the standard deviation for each sample during the test run is lower. The throughput figures for dbench can be misleading so the benchmark is actually modified to time the latency of the processing of one load file with many samples taken. The difference in latency is 4.15.0-rc3 4.15.0-rc3 vanilla lessmigrate Amean 1 21.71 ( 0.00%) 21.47 ( 1.08%) Amean 2 30.89 ( 0.00%) 29.58 ( 4.26%) Amean 4 47.54 ( 0.00%) 46.61 ( 1.97%) Amean 8 82.71 ( 0.00%) 82.81 ( -0.12%) Amean 16 149.45 ( 0.00%) 145.01 ( 2.97%) Amean 32 265.49 ( 0.00%) 248.43 ( 6.42%) Amean 64 463.23 ( 0.00%) 463.55 ( -0.07%) Amean 128 933.97 ( 0.00%) 935.50 ( -0.16%) Stddev 1 1.58 ( 0.00%) 1.54 ( 2.26%) Stddev 2 2.84 ( 0.00%) 2.95 ( -4.15%) Stddev 4 6.78 ( 0.00%) 6.85 ( -0.99%) Stddev 8 16.85 ( 0.00%) 16.37 ( 2.85%) Stddev 16 41.59 ( 0.00%) 41.04 ( 1.32%) Stddev 32 111.05 ( 0.00%) 105.11 ( 5.35%) Stddev 64 285.94 ( 0.00%) 288.01 ( -0.72%) Stddev 128 803.39 ( 0.00%) 809.73 ( -0.79%) It's a small improvement which is not surprising given that migrations that migrate to a different node as not that common. However, it is noticeable in the CPU migration statistics which are reduced by 24%. There was a query for v1 of this patch about NAS so here are the results for C-class using MPI for parallelisation on the same machine nas-mpi 4.15.0-rc3 4.15.0-rc3 vanilla noirq Time cg.C 24.25 ( 0.00%) 23.17 ( 4.45%) Time ep.C 8.22 ( 0.00%) 8.29 ( -0.85%) Time ft.C 22.67 ( 0.00%) 20.34 ( 10.28%) Time is.C 1.42 ( 0.00%) 1.47 ( -3.52%) Time lu.C 55.62 ( 0.00%) 54.81 ( 1.46%) Time mg.C 7.93 ( 0.00%) 7.91 ( 0.25%) 4.15.0-rc3 4.15.0-rc3 vanilla noirq-v1r1 User 3799.96 3748.34 System 672.10 626.15 Elapsed 91.91 79.49 lu.C sees a small gain, ft.C a large gain and ep.C and is.C see small regressions but in terms of absolute time, the difference is small and likely within run-to-run variance. System CPU usage is slightly reduced. schbench from Facebook was also requested. This is a bit of a mixed bag but it's important to note that this workload should not be heavily impacted by wakeups from interrupt context. 4.15.0-rc3 4.15.0-rc3 vanilla noirq-v1r1 Lat 50.00th-qrtle-1 41.00 ( 0.00%) 41.00 ( 0.00%) Lat 75.00th-qrtle-1 42.00 ( 0.00%) 42.00 ( 0.00%) Lat 90.00th-qrtle-1 43.00 ( 0.00%) 44.00 ( -2.33%) Lat 95.00th-qrtle-1 44.00 ( 0.00%) 46.00 ( -4.55%) Lat 99.00th-qrtle-1 57.00 ( 0.00%) 58.00 ( -1.75%) Lat 99.50th-qrtle-1 59.00 ( 0.00%) 59.00 ( 0.00%) Lat 99.90th-qrtle-1 67.00 ( 0.00%) 78.00 ( -16.42%) Lat 50.00th-qrtle-2 40.00 ( 0.00%) 51.00 ( -27.50%) Lat 75.00th-qrtle-2 45.00 ( 0.00%) 56.00 ( -24.44%) Lat 90.00th-qrtle-2 53.00 ( 0.00%) 59.00 ( -11.32%) Lat 95.00th-qrtle-2 57.00 ( 0.00%) 61.00 ( -7.02%) Lat 99.00th-qrtle-2 67.00 ( 0.00%) 71.00 ( -5.97%) Lat 99.50th-qrtle-2 69.00 ( 0.00%) 74.00 ( -7.25%) Lat 99.90th-qrtle-2 83.00 ( 0.00%) 77.00 ( 7.23%) Lat 50.00th-qrtle-4 51.00 ( 0.00%) 51.00 ( 0.00%) Lat 75.00th-qrtle-4 57.00 ( 0.00%) 56.00 ( 1.75%) Lat 90.00th-qrtle-4 60.00 ( 0.00%) 59.00 ( 1.67%) Lat 95.00th-qrtle-4 62.00 ( 0.00%) 62.00 ( 0.00%) Lat 99.00th-qrtle-4 73.00 ( 0.00%) 72.00 ( 1.37%) Lat 99.50th-qrtle-4 76.00 ( 0.00%) 74.00 ( 2.63%) Lat 99.90th-qrtle-4 85.00 ( 0.00%) 78.00 ( 8.24%) Lat 50.00th-qrtle-8 54.00 ( 0.00%) 58.00 ( -7.41%) Lat 75.00th-qrtle-8 59.00 ( 0.00%) 62.00 ( -5.08%) Lat 90.00th-qrtle-8 65.00 ( 0.00%) 66.00 ( -1.54%) Lat 95.00th-qrtle-8 67.00 ( 0.00%) 70.00 ( -4.48%) Lat 99.00th-qrtle-8 78.00 ( 0.00%) 79.00 ( -1.28%) Lat 99.50th-qrtle-8 81.00 ( 0.00%) 80.00 ( 1.23%) Lat 99.90th-qrtle-8 116.00 ( 0.00%) 83.00 ( 28.45%) Lat 50.00th-qrtle-16 65.00 ( 0.00%) 64.00 ( 1.54%) Lat 75.00th-qrtle-16 77.00 ( 0.00%) 71.00 ( 7.79%) Lat 90.00th-qrtle-16 83.00 ( 0.00%) 82.00 ( 1.20%) Lat 95.00th-qrtle-16 87.00 ( 0.00%) 87.00 ( 0.00%) Lat 99.00th-qrtle-16 95.00 ( 0.00%) 96.00 ( -1.05%) Lat 99.50th-qrtle-16 99.00 ( 0.00%) 103.00 ( -4.04%) Lat 99.90th-qrtle-16 104.00 ( 0.00%) 122.00 ( -17.31%) Lat 50.00th-qrtle-32 71.00 ( 0.00%) 73.00 ( -2.82%) Lat 75.00th-qrtle-32 91.00 ( 0.00%) 92.00 ( -1.10%) Lat 90.00th-qrtle-32 108.00 ( 0.00%) 107.00 ( 0.93%) Lat 95.00th-qrtle-32 118.00 ( 0.00%) 115.00 ( 2.54%) Lat 99.00th-qrtle-32 134.00 ( 0.00%) 129.00 ( 3.73%) Lat 99.50th-qrtle-32 138.00 ( 0.00%) 133.00 ( 3.62%) Lat 99.90th-qrtle-32 149.00 ( 0.00%) 146.00 ( 2.01%) Lat 50.00th-qrtle-39 83.00 ( 0.00%) 81.00 ( 2.41%) Lat 75.00th-qrtle-39 105.00 ( 0.00%) 102.00 ( 2.86%) Lat 90.00th-qrtle-39 120.00 ( 0.00%) 119.00 ( 0.83%) Lat 95.00th-qrtle-39 129.00 ( 0.00%) 128.00 ( 0.78%) Lat 99.00th-qrtle-39 153.00 ( 0.00%) 149.00 ( 2.61%) Lat 99.50th-qrtle-39 166.00 ( 0.00%) 156.00 ( 6.02%) Lat 99.90th-qrtle-39 12304.00 ( 0.00%) 12848.00 ( -4.42%) When heavily loaded (e.g. 99.50th-qrtle-39 indicates 39 threads), there are small gains in many cases. Otherwise it depends on the quartile used where it can be bad -- e.g. 75.00th-qrtle-2. However, even these results are probably a co-incidence. For this workload, much depends on what node the threads get placed on and their relative locality and not wakeups from interrupt context. A larger component on how it behaves would be automatic NUMA balancing where a fault incurred to measure locality would be a much larger contributer to latency than the wakeup path. This is the results from an almost identical machine that happened to run the same test. They only differ in terms of storage which is irrelevant for this test. 4.15.0-rc3 4.15.0-rc3 vanilla noirq-v1r1 Lat 50.00th-qrtle-1 41.00 ( 0.00%) 41.00 ( 0.00%) Lat 75.00th-qrtle-1 42.00 ( 0.00%) 42.00 ( 0.00%) Lat 90.00th-qrtle-1 44.00 ( 0.00%) 43.00 ( 2.27%) Lat 95.00th-qrtle-1 53.00 ( 0.00%) 45.00 ( 15.09%) Lat 99.00th-qrtle-1 59.00 ( 0.00%) 58.00 ( 1.69%) Lat 99.50th-qrtle-1 60.00 ( 0.00%) 59.00 ( 1.67%) Lat 99.90th-qrtle-1 86.00 ( 0.00%) 61.00 ( 29.07%) Lat 50.00th-qrtle-2 52.00 ( 0.00%) 41.00 ( 21.15%) Lat 75.00th-qrtle-2 57.00 ( 0.00%) 46.00 ( 19.30%) Lat 90.00th-qrtle-2 60.00 ( 0.00%) 53.00 ( 11.67%) Lat 95.00th-qrtle-2 62.00 ( 0.00%) 57.00 ( 8.06%) Lat 99.00th-qrtle-2 73.00 ( 0.00%) 68.00 ( 6.85%) Lat 99.50th-qrtle-2 74.00 ( 0.00%) 71.00 ( 4.05%) Lat 99.90th-qrtle-2 90.00 ( 0.00%) 75.00 ( 16.67%) Lat 50.00th-qrtle-4 57.00 ( 0.00%) 52.00 ( 8.77%) Lat 75.00th-qrtle-4 60.00 ( 0.00%) 58.00 ( 3.33%) Lat 90.00th-qrtle-4 62.00 ( 0.00%) 62.00 ( 0.00%) Lat 95.00th-qrtle-4 65.00 ( 0.00%) 65.00 ( 0.00%) Lat 99.00th-qrtle-4 76.00 ( 0.00%) 75.00 ( 1.32%) Lat 99.50th-qrtle-4 77.00 ( 0.00%) 77.00 ( 0.00%) Lat 99.90th-qrtle-4 87.00 ( 0.00%) 81.00 ( 6.90%) Lat 50.00th-qrtle-8 59.00 ( 0.00%) 57.00 ( 3.39%) Lat 75.00th-qrtle-8 63.00 ( 0.00%) 62.00 ( 1.59%) Lat 90.00th-qrtle-8 66.00 ( 0.00%) 67.00 ( -1.52%) Lat 95.00th-qrtle-8 68.00 ( 0.00%) 70.00 ( -2.94%) Lat 99.00th-qrtle-8 79.00 ( 0.00%) 80.00 ( -1.27%) Lat 99.50th-qrtle-8 80.00 ( 0.00%) 84.00 ( -5.00%) Lat 99.90th-qrtle-8 84.00 ( 0.00%) 90.00 ( -7.14%) Lat 50.00th-qrtle-16 65.00 ( 0.00%) 65.00 ( 0.00%) Lat 75.00th-qrtle-16 77.00 ( 0.00%) 75.00 ( 2.60%) Lat 90.00th-qrtle-16 84.00 ( 0.00%) 83.00 ( 1.19%) Lat 95.00th-qrtle-16 88.00 ( 0.00%) 87.00 ( 1.14%) Lat 99.00th-qrtle-16 97.00 ( 0.00%) 96.00 ( 1.03%) Lat 99.50th-qrtle-16 100.00 ( 0.00%) 104.00 ( -4.00%) Lat 99.90th-qrtle-16 110.00 ( 0.00%) 126.00 ( -14.55%) Lat 50.00th-qrtle-32 70.00 ( 0.00%) 71.00 ( -1.43%) Lat 75.00th-qrtle-32 92.00 ( 0.00%) 94.00 ( -2.17%) Lat 90.00th-qrtle-32 110.00 ( 0.00%) 110.00 ( 0.00%) Lat 95.00th-qrtle-32 121.00 ( 0.00%) 118.00 ( 2.48%) Lat 99.00th-qrtle-32 135.00 ( 0.00%) 137.00 ( -1.48%) Lat 99.50th-qrtle-32 140.00 ( 0.00%) 146.00 ( -4.29%) Lat 99.90th-qrtle-32 150.00 ( 0.00%) 160.00 ( -6.67%) Lat 50.00th-qrtle-39 80.00 ( 0.00%) 71.00 ( 11.25%) Lat 75.00th-qrtle-39 102.00 ( 0.00%) 91.00 ( 10.78%) Lat 90.00th-qrtle-39 118.00 ( 0.00%) 108.00 ( 8.47%) Lat 95.00th-qrtle-39 128.00 ( 0.00%) 117.00 ( 8.59%) Lat 99.00th-qrtle-39 149.00 ( 0.00%) 133.00 ( 10.74%) Lat 99.50th-qrtle-39 160.00 ( 0.00%) 139.00 ( 13.12%) Lat 99.90th-qrtle-39 13808.00 ( 0.00%) 4920.00 ( 64.37%) Despite being nearly identical, it showed a variety of major gains so I'm not convinced that heavy emphasis should be placed on this particular workload in terms of evaluating this particular patch. Further evidence of this is the fact that testing on a UMA machine showed small gains/losses even though the patch should be a no-op on UMA. Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Matt Fleming <matt@codeblueprint.co.uk> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/20171219085947.13136-2-mgorman@techsingularity.net Signed-off-by: Ingo Molnar <mingo@kernel.org> |
||
Joel Fernandes
|
9783be2c0e |
sched/fair: Correct obsolete comment about cpufreq_update_util()
Since the remote cpufreq callback work, the cpufreq_update_util() call can happen from remote CPUs. The comment about local CPUs is thus obsolete. Update it accordingly. Signed-off-by: Joel Fernandes <joelaf@google.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Viresh Kumar <viresh.kumar@linaro.org> Cc: Android Kernel <kernel-team@android.com> Cc: Atish Patra <atish.patra@oracle.com> Cc: Chris Redpath <Chris.Redpath@arm.com> Cc: Dietmar Eggemann <dietmar.eggemann@arm.com> Cc: EAS Dev <eas-dev@lists.linaro.org> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Josef Bacik <jbacik@fb.com> Cc: Juri Lelli <juri.lelli@arm.com> Cc: Len Brown <lenb@kernel.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Morten Ramussen <morten.rasmussen@arm.com> Cc: Patrick Bellasi <patrick.bellasi@arm.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rafael J. Wysocki <rjw@rjwysocki.net> Cc: Rohit Jain <rohit.k.jain@oracle.com> Cc: Saravana Kannan <skannan@quicinc.com> Cc: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com> Cc: Steve Muckle <smuckle@google.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vikram Mulukutla <markivx@codeaurora.org> Cc: Vincent Guittot <vincent.guittot@linaro.org> Link: http://lkml.kernel.org/r/20171215153944.220146-2-joelaf@google.com Signed-off-by: Ingo Molnar <mingo@kernel.org> |
||
Joel Fernandes
|
18cec7e0dd |
sched/fair: Remove impossible condition from find_idlest_group_cpu()
find_idlest_group_cpu() goes through CPUs of a group previous selected by find_idlest_group(). find_idlest_group() returns NULL if the local group is the selected one and doesn't execute find_idlest_group_cpu if the group to which 'cpu' belongs to is chosen. So we're always guaranteed to call find_idlest_group_cpu() with a group to which 'cpu' is non-local. This makes one of the conditions in find_idlest_group_cpu() an impossible one, which we can get rid off. Signed-off-by: Joel Fernandes <joelaf@google.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Brendan Jackman <brendan.jackman@arm.com> Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org> Cc: Android Kernel <kernel-team@android.com> Cc: Atish Patra <atish.patra@oracle.com> Cc: Chris Redpath <Chris.Redpath@arm.com> Cc: Dietmar Eggemann <dietmar.eggemann@arm.com> Cc: EAS Dev <eas-dev@lists.linaro.org> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Josef Bacik <jbacik@fb.com> Cc: Juri Lelli <juri.lelli@arm.com> Cc: Len Brown <lenb@kernel.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Morten Ramussen <morten.rasmussen@arm.com> Cc: Patrick Bellasi <patrick.bellasi@arm.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rafael J. Wysocki <rjw@rjwysocki.net> Cc: Rohit Jain <rohit.k.jain@oracle.com> Cc: Saravana Kannan <skannan@quicinc.com> Cc: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com> Cc: Steve Muckle <smuckle@google.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vikram Mulukutla <markivx@codeaurora.org> Cc: Viresh Kumar <viresh.kumar@linaro.org> Link: http://lkml.kernel.org/r/20171215153944.220146-3-joelaf@google.com Signed-off-by: Ingo Molnar <mingo@kernel.org> |
||
Joel Fernandes
|
f453ae2200 |
sched/fair: Consider RT/IRQ pressure in capacity_spare_wake()
capacity_spare_wake() in the slow path influences choice of idlest groups, as we search for groups with maximum spare capacity. In scenarios where RT pressure is high, a sub optimal group can be chosen and hurt performance of the task being woken up. Fix this by using capacity_of() instead of capacity_orig_of() in capacity_spare_wake(). Tests results from improvements with this change are below. More tests were also done by myself and Matt Fleming to ensure no degradation in different benchmarks. 1) Rohit ran barrier.c test (details below) with following improvements: ------------------------------------------------------------------------ This was Rohit's original use case for a patch he posted at [1] however from his recent tests he showed my patch can replace his slow path changes [1] and there's no need to selectively scan/skip CPUs in find_idlest_group_cpu in the slow path to get the improvement he sees. barrier.c (open_mp code) as a micro-benchmark. It does a number of iterations and barrier sync at the end of each for loop. Here barrier,c is running in along with ping on CPU 0 and 1 as: 'ping -l 10000 -q -s 10 -f hostX' barrier.c can be found at: http://www.spinics.net/lists/kernel/msg2506955.html Following are the results for the iterations per second with this micro-benchmark (higher is better), on a 44 core, 2 socket 88 Threads Intel x86 machine: +--------+------------------+---------------------------+ |Threads | Without patch | With patch | | | | | +--------+--------+---------+-----------------+---------+ | | Mean | Std Dev | Mean | Std Dev | +--------+--------+---------+-----------------+---------+ |1 | 539.36 | 60.16 | 572.54 (+6.15%) | 40.95 | |2 | 481.01 | 19.32 | 530.64 (+10.32%)| 56.16 | |4 | 474.78 | 22.28 | 479.46 (+0.99%) | 18.89 | |8 | 450.06 | 24.91 | 447.82 (-0.50%) | 12.36 | |16 | 436.99 | 22.57 | 441.88 (+1.12%) | 7.39 | |32 | 388.28 | 55.59 | 429.4 (+10.59%)| 31.14 | |64 | 314.62 | 6.33 | 311.81 (-0.89%) | 11.99 | +--------+--------+---------+-----------------+---------+ 2) ping+hackbench test on bare-metal sever (by Rohit) ----------------------------------------------------- Here hackbench is running in threaded mode along with, running ping on CPU 0 and 1 as: 'ping -l 10000 -q -s 10 -f hostX' This test is running on 2 socket, 20 core and 40 threads Intel x86 machine: Number of loops is 10000 and runtime is in seconds (Lower is better). +--------------+-----------------+--------------------------+ |Task Groups | Without patch | With patch | | +-------+---------+----------------+---------+ |(Groups of 40)| Mean | Std Dev | Mean | Std Dev | +--------------+-------+---------+----------------+---------+ |1 | 0.851 | 0.007 | 0.828 (+2.77%)| 0.032 | |2 | 1.083 | 0.203 | 1.087 (-0.37%)| 0.246 | |4 | 1.601 | 0.051 | 1.611 (-0.62%)| 0.055 | |8 | 2.837 | 0.060 | 2.827 (+0.35%)| 0.031 | |16 | 5.139 | 0.133 | 5.107 (+0.63%)| 0.085 | |25 | 7.569 | 0.142 | 7.503 (+0.88%)| 0.143 | +--------------+-------+---------+----------------+---------+ [1] https://patchwork.kernel.org/patch/9991635/ Matt Fleming also ran several different hackbench tests and cyclic test to santiy-check that the patch doesn't harm other usecases. Tested-by: Matt Fleming <matt@codeblueprint.co.uk> Tested-by: Rohit Jain <rohit.k.jain@oracle.com> Signed-off-by: Joel Fernandes <joelaf@google.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org> Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com> Cc: Atish Patra <atish.patra@oracle.com> Cc: Brendan Jackman <brendan.jackman@arm.com> Cc: Chris Redpath <Chris.Redpath@arm.com> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Juri Lelli <juri.lelli@arm.com> Cc: Len Brown <lenb@kernel.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Morten Ramussen <morten.rasmussen@arm.com> Cc: Patrick Bellasi <patrick.bellasi@arm.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rafael J. Wysocki <rjw@rjwysocki.net> Cc: Saravana Kannan <skannan@quicinc.com> Cc: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com> Cc: Steve Muckle <smuckle@google.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vikram Mulukutla <markivx@codeaurora.org> Cc: Viresh Kumar <viresh.kumar@linaro.org> Link: http://lkml.kernel.org/r/20171214212158.188190-1-joelaf@google.com Signed-off-by: Ingo Molnar <mingo@kernel.org> |
||
Patrick Bellasi
|
f01415fdbf |
sched/fair: Use 'unsigned long' for utilization, consistently
Utilization and capacity are tracked as 'unsigned long', however some functions using them return an 'int' which is ultimately assigned back to 'unsigned long' variables. Since there is not scope on using a different and signed type, consolidate the signature of functions returning utilization to always use the native type. This change improves code consistency, and it also benefits code paths where utilizations should be clamped by avoiding further type conversions or ugly type casts. Signed-off-by: Patrick Bellasi <patrick.bellasi@arm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Chris Redpath <chris.redpath@arm.com> Reviewed-by: Brendan Jackman <brendan.jackman@arm.com> Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com> Cc: Joel Fernandes <joelaf@google.com> Cc: Juri Lelli <juri.lelli@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Morten Rasmussen <morten.rasmussen@arm.com> Cc: Paul Turner <pjt@google.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rafael J . Wysocki <rafael.j.wysocki@intel.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Todd Kjos <tkjos@android.com> Cc: Vincent Guittot <vincent.guittot@linaro.org> Cc: Viresh Kumar <viresh.kumar@linaro.org> Link: http://lkml.kernel.org/r/20171205171018.9203-2-patrick.bellasi@arm.com Signed-off-by: Ingo Molnar <mingo@kernel.org> |
||
Cheng Jian
|
a555e9d86e |
sched/fair: Remove unused 'curr' parameter from wakeup_gran
The first parameter of wakeup_gran(), 'curr', is unnecessary now. Signed-off-by: Cheng Jian <cj.chengjian@huawei.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: huawei.libin@huawei.com Cc: xiexiuqi@huawei.com Link: http://lkml.kernel.org/r/1512653443-179848-1-git-send-email-cj.chengjian@huawei.com Signed-off-by: Ingo Molnar <mingo@kernel.org> |
||
Vincent Guittot
|
a4c3c04974 |
sched/fair: Update and fix the runnable propagation rule
Unlike running, the runnable part can't be directly propagated through the hierarchy when we migrate a task. The main reason is that runnable time can be shared with other sched_entities that stay on the rq and this runnable time will also remain on prev cfs_rq and must not be removed. Instead, we can estimate what should be the new runnable of the prev cfs_rq and check that this estimation stay in a possible range. The prop_runnable_sum is a good estimation when adding runnable_sum but fails most often when we remove it. Instead, we could use the formula below instead: gcfs_rq's runnable_sum = gcfs_rq->avg.load_sum / gcfs_rq->load.weight which assumes that tasks are equally runnable which is not true but easy to compute. Beside these estimates, we have several simple rules that help us to filter out wrong ones: - ge->avg.runnable_sum <= than LOAD_AVG_MAX - ge->avg.runnable_sum >= ge->avg.running_sum (ge->avg.util_sum << LOAD_AVG_MAX) - ge->avg.runnable_sum can't increase when we detach a task The effect of these fixes is better cgroups balancing. Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Ben Segall <bsegall@google.com> Cc: Chris Mason <clm@fb.com> Cc: Dietmar Eggemann <dietmar.eggemann@arm.com> Cc: Josef Bacik <josef@toxicpanda.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Morten Rasmussen <morten.rasmussen@arm.com> Cc: Paul Turner <pjt@google.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Tejun Heo <tj@kernel.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Yuyang Du <yuyang.du@intel.com> Link: http://lkml.kernel.org/r/1510842112-21028-1-git-send-email-vincent.guittot@linaro.org Signed-off-by: Ingo Molnar <mingo@kernel.org> |
||
Linus Torvalds
|
22714a2ba4 |
Merge branch 'for-4.15' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup
Pull cgroup updates from Tejun Heo: "Cgroup2 cpu controller support is finally merged. - Basic cpu statistics support to allow monitoring by default without the CPU controller enabled. - cgroup2 cpu controller support. - /sys/kernel/cgroup files to help dealing with new / optional features" * 'for-4.15' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup: cgroup: export list of cgroups v2 features using sysfs cgroup: export list of delegatable control files using sysfs cgroup: mark @cgrp __maybe_unused in cpu_stat_show() MAINTAINERS: relocate cpuset.c cgroup, sched: Move basic cpu stats from cgroup.stat to cpu.stat sched: Implement interface for cgroup unified hierarchy sched: Misc preps for cgroup unified hierarchy interface sched/cputime: Add dummy cputime_adjust() implementation for CONFIG_VIRT_CPU_ACCOUNTING_NATIVE cgroup: statically initialize init_css_set->dfl_cgrp cgroup: Implement cgroup2 basic CPU usage accounting cpuacct: Introduce cgroup_account_cputime[_field]() sched/cputime: Expose cputime_adjust() |
||
Ingo Molnar
|
8a103df440 |
Merge branch 'linus' into sched/core, to pick up fixes
Signed-off-by: Ingo Molnar <mingo@kernel.org> |
||
Greg Kroah-Hartman
|
b24413180f |
License cleanup: add SPDX GPL-2.0 license identifier to files with no license
Many source files in the tree are missing licensing information, which makes it harder for compliance tools to determine the correct license. By default all files without license information are under the default license of the kernel, which is GPL version 2. Update the files which contain no license information with the 'GPL-2.0' SPDX license identifier. The SPDX identifier is a legally binding shorthand, which can be used instead of the full boiler plate text. This patch is based on work done by Thomas Gleixner and Kate Stewart and Philippe Ombredanne. How this work was done: Patches were generated and checked against linux-4.14-rc6 for a subset of the use cases: - file had no licensing information it it. - file was a */uapi/* one with no licensing information in it, - file was a */uapi/* one with existing licensing information, Further patches will be generated in subsequent months to fix up cases where non-standard license headers were used, and references to license had to be inferred by heuristics based on keywords. The analysis to determine which SPDX License Identifier to be applied to a file was done in a spreadsheet of side by side results from of the output of two independent scanners (ScanCode & Windriver) producing SPDX tag:value files created by Philippe Ombredanne. Philippe prepared the base worksheet, and did an initial spot review of a few 1000 files. The 4.13 kernel was the starting point of the analysis with 60,537 files assessed. Kate Stewart did a file by file comparison of the scanner results in the spreadsheet to determine which SPDX license identifier(s) to be applied to the file. She confirmed any determination that was not immediately clear with lawyers working with the Linux Foundation. Criteria used to select files for SPDX license identifier tagging was: - Files considered eligible had to be source code files. - Make and config files were included as candidates if they contained >5 lines of source - File already had some variant of a license header in it (even if <5 lines). All documentation files were explicitly excluded. The following heuristics were used to determine which SPDX license identifiers to apply. - when both scanners couldn't find any license traces, file was considered to have no license information in it, and the top level COPYING file license applied. For non */uapi/* files that summary was: SPDX license identifier # files ---------------------------------------------------|------- GPL-2.0 11139 and resulted in the first patch in this series. If that file was a */uapi/* path one, it was "GPL-2.0 WITH Linux-syscall-note" otherwise it was "GPL-2.0". Results of that was: SPDX license identifier # files ---------------------------------------------------|------- GPL-2.0 WITH Linux-syscall-note 930 and resulted in the second patch in this series. - if a file had some form of licensing information in it, and was one of the */uapi/* ones, it was denoted with the Linux-syscall-note if any GPL family license was found in the file or had no licensing in it (per prior point). Results summary: SPDX license identifier # files ---------------------------------------------------|------ GPL-2.0 WITH Linux-syscall-note 270 GPL-2.0+ WITH Linux-syscall-note 169 ((GPL-2.0 WITH Linux-syscall-note) OR BSD-2-Clause) 21 ((GPL-2.0 WITH Linux-syscall-note) OR BSD-3-Clause) 17 LGPL-2.1+ WITH Linux-syscall-note 15 GPL-1.0+ WITH Linux-syscall-note 14 ((GPL-2.0+ WITH Linux-syscall-note) OR BSD-3-Clause) 5 LGPL-2.0+ WITH Linux-syscall-note 4 LGPL-2.1 WITH Linux-syscall-note 3 ((GPL-2.0 WITH Linux-syscall-note) OR MIT) 3 ((GPL-2.0 WITH Linux-syscall-note) AND MIT) 1 and that resulted in the third patch in this series. - when the two scanners agreed on the detected license(s), that became the concluded license(s). - when there was disagreement between the two scanners (one detected a license but the other didn't, or they both detected different licenses) a manual inspection of the file occurred. - In most cases a manual inspection of the information in the file resulted in a clear resolution of the license that should apply (and which scanner probably needed to revisit its heuristics). - When it was not immediately clear, the license identifier was confirmed with lawyers working with the Linux Foundation. - If there was any question as to the appropriate license identifier, the file was flagged for further research and to be revisited later in time. In total, over 70 hours of logged manual review was done on the spreadsheet to determine the SPDX license identifiers to apply to the source files by Kate, Philippe, Thomas and, in some cases, confirmation by lawyers working with the Linux Foundation. Kate also obtained a third independent scan of the 4.13 code base from FOSSology, and compared selected files where the other two scanners disagreed against that SPDX file, to see if there was new insights. The Windriver scanner is based on an older version of FOSSology in part, so they are related. Thomas did random spot checks in about 500 files from the spreadsheets for the uapi headers and agreed with SPDX license identifier in the files he inspected. For the non-uapi files Thomas did random spot checks in about 15000 files. In initial set of patches against 4.14-rc6, 3 files were found to have copy/paste license identifier errors, and have been fixed to reflect the correct identifier. Additionally Philippe spent 10 hours this week doing a detailed manual inspection and review of the 12,461 patched files from the initial patch version early this week with: - a full scancode scan run, collecting the matched texts, detected license ids and scores - reviewing anything where there was a license detected (about 500+ files) to ensure that the applied SPDX license was correct - reviewing anything where there was no detection but the patch license was not GPL-2.0 WITH Linux-syscall-note to ensure that the applied SPDX license was correct This produced a worksheet with 20 files needing minor correction. This worksheet was then exported into 3 different .csv files for the different types of files to be modified. These .csv files were then reviewed by Greg. Thomas wrote a script to parse the csv files and add the proper SPDX tag to the file, in the format that the file expected. This script was further refined by Greg based on the output to detect more types of files automatically and to distinguish between header and source .c files (which need different comment types.) Finally Greg ran the script using the .csv files to generate the patches. Reviewed-by: Kate Stewart <kstewart@linuxfoundation.org> Reviewed-by: Philippe Ombredanne <pombredanne@nexb.com> Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> |
||
Frederic Weisbecker
|
de201559df |
sched/isolation: Introduce housekeeping flags
Before we implement isolcpus under housekeeping, we need the isolation features to be more finegrained. For example some people want NOHZ_FULL without the full scheduler isolation, others want full scheduler isolation without NOHZ_FULL. So let's cut all these isolation features piecewise, at the risk of overcutting it right now. We can still merge some flags later if they always make sense together. Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Acked-by: Thomas Gleixner <tglx@linutronix.de> Cc: Chris Metcalf <cmetcalf@mellanox.com> Cc: Christoph Lameter <cl@linux.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Luiz Capitulino <lcapitulino@redhat.com> Cc: Mike Galbraith <efault@gmx.de> Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rik van Riel <riel@redhat.com> Cc: Wanpeng Li <kernellwp@gmail.com> Link: http://lkml.kernel.org/r/1509072159-31808-9-git-send-email-frederic@kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org> |
||
Frederic Weisbecker
|
204c083a00 |
sched/isolation: Rename is_housekeeping_cpu() to housekeeping_cpu()
Fit it into the housekeeping_*() namespace. Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Acked-by: Thomas Gleixner <tglx@linutronix.de> Cc: Chris Metcalf <cmetcalf@mellanox.com> Cc: Christoph Lameter <cl@linux.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Luiz Capitulino <lcapitulino@redhat.com> Cc: Mike Galbraith <efault@gmx.de> Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rik van Riel <riel@redhat.com> Cc: Wanpeng Li <kernellwp@gmail.com> Link: http://lkml.kernel.org/r/1509072159-31808-7-git-send-email-frederic@kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org> |
||
Frederic Weisbecker
|
7863406143 |
sched/isolation: Move housekeeping related code to its own file
The housekeeping code is currently tied to the NOHZ code. As we are planning to make housekeeping independent from it, start with moving the relevant code to its own file. Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Acked-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: Chris Metcalf <cmetcalf@mellanox.com> Cc: Christoph Lameter <cl@linux.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Luiz Capitulino <lcapitulino@redhat.com> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rik van Riel <riel@redhat.com> Cc: Wanpeng Li <kernellwp@gmail.com> Link: http://lkml.kernel.org/r/1509072159-31808-2-git-send-email-frederic@kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org> |
||
Brendan Jackman
|
93f50f9024 |
sched/fair: Fix usage of find_idlest_group() when the local group is idlest
find_idlest_group() returns NULL when the local group is idlest. The caller then continues the find_idlest_group() search at a lower level of the current CPU's sched_domain hierarchy. find_idlest_group_cpu() is not consulted and, crucially, @new_cpu is not updated. This means the search is pointless and we return @prev_cpu from select_task_rq_fair(). This is fixed by initialising @new_cpu to @cpu instead of @prev_cpu. Signed-off-by: Brendan Jackman <brendan.jackman@arm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Josef Bacik <jbacik@fb.com> Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org> Cc: Dietmar Eggemann <dietmar.eggemann@arm.com> Cc: Josef Bacik <josef@toxicpanda.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Morten Rasmussen <morten.rasmussen@arm.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/20171005114516.18617-6-brendan.jackman@arm.com Signed-off-by: Ingo Molnar <mingo@kernel.org> |
||
Brendan Jackman
|
6fee85ccbc |
sched/fair: Fix usage of find_idlest_group() when no groups are allowed
When 'p' is not allowed on any of the CPUs in the sched_domain, we currently return NULL from find_idlest_group(), and pointlessly continue the search on lower sched_domain levels (where 'p' is also not allowed) before returning prev_cpu regardless (as we have not updated new_cpu). Add an explicit check for this case, and add a comment to find_idlest_group(). Now when find_idlest_group() returns NULL, it always means that the local group is allowed and idlest. Signed-off-by: Brendan Jackman <brendan.jackman@arm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org> Reviewed-by: Josef Bacik <jbacik@fb.com> Cc: Dietmar Eggemann <dietmar.eggemann@arm.com> Cc: Josef Bacik <josef@toxicpanda.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Morten Rasmussen <morten.rasmussen@arm.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/20171005114516.18617-5-brendan.jackman@arm.com Signed-off-by: Ingo Molnar <mingo@kernel.org> |
||
Brendan Jackman
|
0d10ab952e |
sched/fair: Fix find_idlest_group() when local group is not allowed
When the local group is not allowed we do not modify this_*_load from their initial value of 0. That means that the load checks at the end of find_idlest_group cause us to incorrectly return NULL. Fixing the initial values to ULONG_MAX means we will instead return the idlest remote group in that case. Signed-off-by: Brendan Jackman <brendan.jackman@arm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org> Reviewed-by: Josef Bacik <jbacik@fb.com> Cc: Dietmar Eggemann <dietmar.eggemann@arm.com> Cc: Josef Bacik <josef@toxicpanda.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Morten Rasmussen <morten.rasmussen@arm.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/20171005114516.18617-4-brendan.jackman@arm.com Signed-off-by: Ingo Molnar <mingo@kernel.org> |
||
Brendan Jackman
|
e90381eaec |
sched/fair: Remove unnecessary comparison with -1
Since commit:
|
||
Brendan Jackman
|
18bd1b4bd5 |
sched/fair: Move select_task_rq_fair() slow-path into its own function
In preparation for changes that would otherwise require adding a new level of indentation to the while(sd) loop, create a new function find_idlest_cpu() which contains this loop, and rename the existing find_idlest_cpu() to find_idlest_group_cpu(). Code inside the while(sd) loop is unchanged. @new_cpu is added as a variable in the new function, with the same initial value as the @new_cpu in select_task_rq_fair(). Suggested-by: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Brendan Jackman <brendan.jackman@arm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Josef Bacik <jbacik@fb.com> Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org> Cc: Dietmar Eggemann <dietmar.eggemann@arm.com> Cc: Josef Bacik <josef@toxicpanda.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Morten Rasmussen <morten.rasmussen@arm.com> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/20171005114516.18617-2-brendan.jackman@arm.com Signed-off-by: Ingo Molnar <mingo@kernel.org> |
||
Brendan Jackman
|
583ffd99d7 |
sched/fair: Force balancing on NOHZ balance if local group has capacity
The "goto force_balance" here is intended to mitigate the fact that
avg_load calculations can result in bad placement decisions when
priority is asymmetrical.
The original commit that adds it:
|
||
Brendan Jackman
|
ea16f0ea6c |
sched/fair: Sync task util before slow-path wakeup
We use task_util() in find_idlest_group() via capacity_spare_wake(). This task_util() updated in wake_cap(). However wake_cap() is not the only reason for ending up in find_idlest_group() - we could have been sent there by wake_wide(). So explicitly sync the task util with prev_cpu when we are about to head to find_idlest_group(). We could simply do this at the beginning of select_task_rq_fair() (i.e. irrespective of whether we're heading to select_idle_sibling() or find_idlest_group() & co), but I didn't want to slow down the select_idle_sibling() path more than necessary. Don't do this during fork balancing, we won't need the task_util and we'd just clobber the last_update_time, which is supposed to be 0. Signed-off-by: Brendan Jackman <brendan.jackman@arm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Andres Oportus <andresoportus@google.com> Cc: Dietmar Eggemann <dietmar.eggemann@arm.com> Cc: Joel Fernandes <joelaf@google.com> Cc: Josef Bacik <josef@toxicpanda.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Morten Rasmussen <morten.rasmussen@arm.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vincent Guittot <vincent.guittot@linaro.org> Link: http://lkml.kernel.org/r/20170808095519.10077-1-brendan.jackman@arm.com Signed-off-by: Ingo Molnar <mingo@kernel.org> |
||
Uladzislau Rezki
|
93824900a2 |
sched/fair: Search a task from the tail of the queue
As a first step this patch makes cfs_tasks list as MRU one. It means, that when a next task is picked to run on physical CPU it is moved to the front of the list. Therefore, the cfs_tasks list is more or less sorted (except woken tasks) starting from recently given CPU time tasks toward tasks with max wait time in a run-queue, i.e. MRU list. Second, as part of the load balance operation, this approach starts detach_tasks()/detach_one_task() from the tail of the queue instead of the head, giving some advantages: - tends to pick a task with highest wait time; - tasks located in the tail are less likely cache-hot, therefore the can_migrate_task() decision is higher. hackbench illustrates slightly better performance. For example doing 1000 samples and 40 groups on i5-3320M CPU, it shows below figures: default: 0.657 avg patched: 0.646 avg Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Kirill Tkhai <tkhai@yandex.ru> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Mike Galbraith <umgwanakikbuti@gmail.com> Cc: Nicolas Pitre <nicolas.pitre@linaro.org> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Oleksiy Avramchenko <oleksiy.avramchenko@sonymobile.com> Cc: Paul Turner <pjt@google.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Tim Chen <tim.c.chen@linux.intel.com> Link: http://lkml.kernel.org/r/20170913102430.8985-2-urezki@gmail.com Signed-off-by: Ingo Molnar <mingo@kernel.org> |
||
Ingo Molnar
|
151aeab777 |
Merge branch 'sched/urgent' into sched/core, to pick up fixes
Signed-off-by: Ingo Molnar <mingo@kernel.org> |
||
Peter Zijlstra
|
024c9d2fae |
sched/core: Ensure load_balance() respects the active_mask
While load_balance() masks the source CPUs against active_mask, it had
a hole against the destination CPU. Ensure the destination CPU is also
part of the 'domain-mask & active-mask' set.
Reported-by: Levin, Alexander (Sasha Levin) <alexander.levin@verizon.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Fixes:
|
||
Peter Zijlstra
|
f2cdd9cc6c |
sched/core: Address more wake_affine() regressions
The trivial wake_affine_idle() implementation is very good for a number of workloads, but it comes apart at the moment there are no idle CPUs left, IOW. the overloaded case. hackbench: NO_WA_WEIGHT WA_WEIGHT hackbench-20 : 7.362717561 seconds 6.450509391 seconds (win) netperf: NO_WA_WEIGHT WA_WEIGHT TCP_SENDFILE-1 : Avg: 54524.6 Avg: 52224.3 TCP_SENDFILE-10 : Avg: 48185.2 Avg: 46504.3 TCP_SENDFILE-20 : Avg: 29031.2 Avg: 28610.3 TCP_SENDFILE-40 : Avg: 9819.72 Avg: 9253.12 TCP_SENDFILE-80 : Avg: 5355.3 Avg: 4687.4 TCP_STREAM-1 : Avg: 41448.3 Avg: 42254 TCP_STREAM-10 : Avg: 24123.2 Avg: 25847.9 TCP_STREAM-20 : Avg: 15834.5 Avg: 18374.4 TCP_STREAM-40 : Avg: 5583.91 Avg: 5599.57 TCP_STREAM-80 : Avg: 2329.66 Avg: 2726.41 TCP_RR-1 : Avg: 80473.5 Avg: 82638.8 TCP_RR-10 : Avg: 72660.5 Avg: 73265.1 TCP_RR-20 : Avg: 52607.1 Avg: 52634.5 TCP_RR-40 : Avg: 57199.2 Avg: 56302.3 TCP_RR-80 : Avg: 25330.3 Avg: 26867.9 UDP_RR-1 : Avg: 108266 Avg: 107844 UDP_RR-10 : Avg: 95480 Avg: 95245.2 UDP_RR-20 : Avg: 68770.8 Avg: 68673.7 UDP_RR-40 : Avg: 76231 Avg: 75419.1 UDP_RR-80 : Avg: 34578.3 Avg: 35639.1 UDP_STREAM-1 : Avg: 64684.3 Avg: 66606 UDP_STREAM-10 : Avg: 52701.2 Avg: 52959.5 UDP_STREAM-20 : Avg: 30376.4 Avg: 29704 UDP_STREAM-40 : Avg: 15685.8 Avg: 15266.5 UDP_STREAM-80 : Avg: 8415.13 Avg: 7388.97 (wins and losses) sysbench: NO_WA_WEIGHT WA_WEIGHT sysbench-mysql-2 : 2135.17 per sec. 2142.51 per sec. sysbench-mysql-5 : 4809.68 per sec. 4800.19 per sec. sysbench-mysql-10 : 9158.59 per sec. 9157.05 per sec. sysbench-mysql-20 : 14570.70 per sec. 14543.55 per sec. sysbench-mysql-40 : 22130.56 per sec. 22184.82 per sec. sysbench-mysql-80 : 20995.56 per sec. 21904.18 per sec. sysbench-psql-2 : 1679.58 per sec. 1705.06 per sec. sysbench-psql-5 : 3797.69 per sec. 3879.93 per sec. sysbench-psql-10 : 7253.22 per sec. 7258.06 per sec. sysbench-psql-20 : 11166.75 per sec. 11220.00 per sec. sysbench-psql-40 : 17277.28 per sec. 17359.78 per sec. sysbench-psql-80 : 17112.44 per sec. 17221.16 per sec. (increase on the top end) tbench: NO_WA_WEIGHT Throughput 685.211 MB/sec 2 clients 2 procs max_latency=0.123 ms Throughput 1596.64 MB/sec 5 clients 5 procs max_latency=0.119 ms Throughput 2985.47 MB/sec 10 clients 10 procs max_latency=0.262 ms Throughput 4521.15 MB/sec 20 clients 20 procs max_latency=0.506 ms Throughput 9438.1 MB/sec 40 clients 40 procs max_latency=2.052 ms Throughput 8210.5 MB/sec 80 clients 80 procs max_latency=8.310 ms WA_WEIGHT Throughput 697.292 MB/sec 2 clients 2 procs max_latency=0.127 ms Throughput 1596.48 MB/sec 5 clients 5 procs max_latency=0.080 ms Throughput 2975.22 MB/sec 10 clients 10 procs max_latency=0.254 ms Throughput 4575.14 MB/sec 20 clients 20 procs max_latency=0.502 ms Throughput 9468.65 MB/sec 40 clients 40 procs max_latency=2.069 ms Throughput 8631.73 MB/sec 80 clients 80 procs max_latency=8.605 ms (increase on the top end) Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Rik van Riel <riel@redhat.com> Cc: linux-kernel@vger.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org> |
||
Peter Zijlstra
|
d153b15344 |
sched/core: Fix wake_affine() performance regression
Eric reported a sysbench regression against commit: |
||
Peter Zijlstra
|
17de4ee04c |
sched/fair: Update calc_group_*() comments
I had a wee bit of trouble recalling how the calc_group_runnable() stuff worked.. add hopefully better comments. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-kernel@vger.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org> |
||
Josef Bacik
|
2c8e4dce79 |
sched/fair: Calculate runnable_weight slightly differently
Our runnable_weight currently looks like this runnable_weight = shares * runnable_load_avg / load_avg The goal is to scale the runnable weight for the group based on its runnable to load_avg ratio. The problem with this is it biases us towards tasks that never go to sleep. Tasks that go to sleep are going to have their runnable_load_avg decayed pretty hard, which will drastically reduce the runnable weight of groups with interactive tasks. To solve this imbalance we tweak this slightly, so in the ideal case it is still the above, but in the interactive case it is runnable_weight = shares * runnable_weight / load_weight which will make the weight distribution fairer between interactive and non-interactive groups. Signed-off-by: Josef Bacik <jbacik@fb.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: kernel-team@fb.com Cc: linux-kernel@vger.kernel.org Cc: riel@redhat.com Cc: tj@kernel.org Link: http://lkml.kernel.org/r/1501773219-18774-2-git-send-email-jbacik@fb.com Signed-off-by: Ingo Molnar <mingo@kernel.org> |
||
Peter Zijlstra
|
9a2dd585b2 |
sched/fair: Implement more accurate async detach
The problem with the overestimate is that it will subtract too big a value from the load_sum, thereby pushing it down further than it ought to go. Since runnable_load_avg is not subject to a similar 'force', this results in the occasional 'runnable_load > load' situation. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-kernel@vger.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org> |
||
Peter Zijlstra
|
f207934fb7 |
sched/fair: Align PELT windows between cfs_rq and its se
The PELT _sum values are a saw-tooth function, dropping on the decay edge and then growing back up again during the window. When these window-edges are not aligned between cfs_rq and se, we can have the situation where, for example, on dequeue, the se decays first. Its _sum values will be small(er), while the cfs_rq _sum values will still be on their way up. Because of this, the subtraction: cfs_rq->avg._sum -= se->avg._sum will result in a positive value. This will then, once the cfs_rq reaches an edge, translate into its _avg value jumping up. This is especially visible with the runnable_load bits, since they get added/subtracted a lot. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-kernel@vger.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org> |
||
Peter Zijlstra
|
144d8487bc |
sched/fair: Implement synchonous PELT detach on load-balance migrate
Vincent wondered why his self migrating task had a roughly 50% dip in load_avg when landing on the new CPU. This is because we uncondionally take the asynchronous detatch_entity route, which can lead to the attach on the new CPU still seeing the old CPU's contribution to tg->load_avg, effectively halving the new CPU's shares. While in general this is something we have to live with, there is the special case of runnable migration where we can do better. Tested-by: Vincent Guittot <vincent.guittot@linaro.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-kernel@vger.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org> |
||
Peter Zijlstra
|
1ea6c46a23 |
sched/fair: Propagate an effective runnable_load_avg
The load balancer uses runnable_load_avg as load indicator. For !cgroup this is: runnable_load_avg = \Sum se->avg.load_avg ; where se->on_rq That is, a direct sum of all runnable tasks on that runqueue. As opposed to load_avg, which is a sum of all tasks on the runqueue, which includes a blocked component. However, in the cgroup case, this comes apart since the group entities are always runnable, even if most of their constituent entities are blocked. Therefore introduce a runnable_weight which for task entities is the same as the regular weight, but for group entities is a fraction of the entity weight and represents the runnable part of the group runqueue. Then propagate this load through the PELT hierarchy to arrive at an effective runnable load avgerage -- which we should not confuse with the canonical runnable load average. Suggested-by: Tejun Heo <tj@kernel.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-kernel@vger.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org> |
||
Peter Zijlstra
|
0e2d2aaaae |
sched/fair: Rewrite PELT migration propagation
When an entity migrates in (or out) of a runqueue, we need to add (or remove) its contribution from the entire PELT hierarchy, because even non-runnable entities are included in the load average sums. In order to do this we have some propagation logic that updates the PELT tree, however the way it 'propagates' the runnable (or load) change is (more or less): tg->weight * grq->avg.load_avg ge->avg.load_avg = ------------------------------ tg->load_avg But that is the expression for ge->weight, and per the definition of load_avg: ge->avg.load_avg := ge->weight * ge->avg.runnable_avg That destroys the runnable_avg (by setting it to 1) we wanted to propagate. Instead directly propagate runnable_sum. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-kernel@vger.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org> |
||
Peter Zijlstra
|
2a2f5d4e44 |
sched/fair: Rewrite cfs_rq->removed_*avg
Since on wakeup migration we don't hold the rq->lock for the old CPU we cannot update its state. Instead we add the removed 'load' to an atomic variable and have the next update on that CPU collect and process it. Currently we have 2 atomic variables; which already have the issue that they can be read out-of-sync. Also, two atomic ops on a single cacheline is already more expensive than an uncontended lock. Since we want to add more, convert the thing over to an explicit cacheline with a lock in. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-kernel@vger.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org> |
||
Vincent Guittot
|
9059393e4e |
sched/fair: Use reweight_entity() for set_user_nice()
Now that we directly change load_avg and propagate that change into the sums, sys_nice() and co should do the same, otherwise its possible to confuse load accounting when we migrate near the weight change. Fixes-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org> [ Added changelog, fixed the call condition. ] Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-kernel@vger.kernel.org Link: http://lkml.kernel.org/r/20170517095045.GA8420@linaro.org Signed-off-by: Ingo Molnar <mingo@kernel.org> |
||
Peter Zijlstra
|
840c5abca4 |
sched/fair: More accurate reweight_entity()
When a (group) entity changes it's weight we should instantly change its load_avg and propagate that change into the sums it is part of. Because we use these values to predict future behaviour and are not interested in its historical value. Without this change, the change in load would need to propagate through the average, by which time it could again have changed etc.. always chasing itself. With this change, the cfs_rq load_avg sum will more accurately reflect the current runnable and expected return of blocked load. Reported-by: Paul Turner <pjt@google.com> [josef: compile fix !SMP || !FAIR_GROUP] Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-kernel@vger.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org> |
||
Peter Zijlstra
|
8d5b9025f9 |
sched/fair: Introduce {en,de}queue_load_avg()
Analogous to the existing {en,de}queue_runnable_load_avg() add helpers for {en,de}queue_load_avg(). More users will follow. Includes some code movement to avoid fwd declarations. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-kernel@vger.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org> |
||
Peter Zijlstra
|
b5b3e35f41 |
sched/fair: Rename {en,de}queue_entity_load_avg()
Since they're now purely about runnable_load, rename them. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-kernel@vger.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org> |