During load-balance, groups classified as group_misfit_task are filtered
out if they do not pass
group_smaller_max_cpu_capacity(<candidate group>, <local group>);
which itself employs fits_capacity() to compare the sgc->max_capacity of
both groups.
Due to the underlying margin, fits_capacity(X, 1024) will return false for
any X > 819. Tough luck, the capacity_orig's on e.g. the Pixel 4 are
{261, 871, 1024}. If a CPU-bound task ends up on one of those "medium"
CPUs, misfit migration will never intentionally upmigrate it to a CPU of
higher capacity due to the aforementioned margin.
One may argue the 20% margin of fits_capacity() is excessive in the advent
of counter-enhanced load tracking (APERF/MPERF, AMUs), but one point here
is that fits_capacity() is meant to compare a utilization value to a
capacity value, whereas here it is being used to compare two capacity
values. As CPU capacity and task utilization have different dynamics, a
sensible approach here would be to add a new helper dedicated to comparing
CPU capacities.
Also note that comparing capacity extrema of local and source sched_group's
doesn't make much sense when at the day of the day the imbalance will be
pulled by a known env->dst_cpu, whose capacity can be anywhere within the
local group's capacity extrema.
While at it, replace group_smaller_{min, max}_cpu_capacity() with
comparisons of the source group's min/max capacity and the destination
CPU's capacity.
Signed-off-by: Valentin Schneider <valentin.schneider@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Reviewed-by: Qais Yousef <qais.yousef@arm.com>
Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>
Tested-by: Lingutla Chandrasekhar <clingutla@codeaurora.org>
Link: https://lkml.kernel.org/r/20210407220628.3798191-4-valentin.schneider@arm.com
When triggering an active load balance, sd->nr_balance_failed is set to
such a value that any further can_migrate_task() using said sd will ignore
the output of task_hot().
This behaviour makes sense, as active load balance intentionally preempts a
rq's running task to migrate it right away, but this asynchronous write is
a bit shoddy, as the stopper thread might run active_load_balance_cpu_stop
before the sd->nr_balance_failed write either becomes visible to the
stopper's CPU or even happens on the CPU that appended the stopper work.
Add a struct lb_env flag to denote active balancing, and use it in
can_migrate_task(). Remove the sd->nr_balance_failed write that served the
same purpose. Cleanup the LBF_DST_PINNED active balance special case.
Signed-off-by: Valentin Schneider <valentin.schneider@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>
Link: https://lkml.kernel.org/r/20210407220628.3798191-3-valentin.schneider@arm.com
During load balance, LBF_SOME_PINNED will be set if any candidate task
cannot be detached due to CPU affinity constraints. This can result in
setting env->sd->parent->sgc->group_imbalance, which can lead to a group
being classified as group_imbalanced (rather than any of the other, lower
group_type) when balancing at a higher level.
In workloads involving a single task per CPU, LBF_SOME_PINNED can often be
set due to per-CPU kthreads being the only other runnable tasks on any
given rq. This results in changing the group classification during
load-balance at higher levels when in reality there is nothing that can be
done for this affinity constraint: per-CPU kthreads, as the name implies,
don't get to move around (modulo hotplug shenanigans).
It's not as clear for userspace tasks - a task could be in an N-CPU cpuset
with N-1 offline CPUs, making it an "accidental" per-CPU task rather than
an intended one. KTHREAD_IS_PER_CPU gives us an indisputable signal which
we can leverage here to not set LBF_SOME_PINNED.
Note that the aforementioned classification to group_imbalance (when
nothing can be done) is especially problematic on big.LITTLE systems, which
have a topology the likes of:
DIE [ ]
MC [ ][ ]
0 1 2 3
L L B B
arch_scale_cpu_capacity(L) < arch_scale_cpu_capacity(B)
Here, setting LBF_SOME_PINNED due to a per-CPU kthread when balancing at MC
level on CPUs [0-1] will subsequently prevent CPUs [2-3] from classifying
the [0-1] group as group_misfit_task when balancing at DIE level. Thus, if
CPUs [0-1] are running CPU-bound (misfit) tasks, ill-timed per-CPU kthreads
can significantly delay the upgmigration of said misfit tasks. Systems
relying on ASYM_PACKING are likely to face similar issues.
Signed-off-by: Lingutla Chandrasekhar <clingutla@codeaurora.org>
[Use kthread_is_per_cpu() rather than p->nr_cpus_allowed]
[Reword changelog]
Signed-off-by: Valentin Schneider <valentin.schneider@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>
Link: https://lkml.kernel.org/r/20210407220628.3798191-2-valentin.schneider@arm.com
Mel Gorman did some nice work in 9fe1f127b9 ("sched/fair: Merge
select_idle_core/cpu()"), resulting in the kernel being more efficient
at finding an idle CPU, and in tasks spending less time waiting to be
run, both according to the schedstats run_delay numbers, and according
to measured application latencies. Yay.
The flip side of this is that we see more task migrations (about 30%
more), higher cache misses, higher memory bandwidth utilization, and
higher CPU use, for the same number of requests/second.
This is most pronounced on a memcache type workload, which saw a
consistent 1-3% increase in total CPU use on the system, due to those
increased task migrations leading to higher L2 cache miss numbers, and
higher memory utilization. The exclusive L3 cache on Skylake does us
no favors there.
On our web serving workload, that effect is usually negligible.
It appears that the increased number of CPU migrations is generally a
good thing, since it leads to lower cpu_delay numbers, reflecting the
fact that tasks get to run faster. However, the reduced locality and
the corresponding increase in L2 cache misses hurts a little.
The patch below appears to fix the regression, while keeping the
benefit of the lower cpu_delay numbers, by reintroducing
select_idle_smt with a twist: when a socket has no idle cores, check
to see if the sibling of "prev" is idle, before searching all the
other CPUs.
This fixes both the occasional 9% regression on the web serving
workload, and the continuous 2% CPU use regression on the memcache
type workload.
With Mel's patches and this patch together, task migrations are still
high, but L2 cache misses, memory bandwidth, and CPU time used are
back down to what they were before. The p95 and p99 response times for
the memcache type application improve by about 10% over what they were
before Mel's patches got merged.
Signed-off-by: Rik van Riel <riel@surriel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Vincent Guittot <vincent.guittot@linaro.org>
Link: https://lkml.kernel.org/r/20210326151932.2c187840@imladris.surriel.com
Currently only root can write files under /proc/pressure. Relax this to
allow tasks running as unprivileged users with CAP_SYS_RESOURCE to be
able to write to these files.
Signed-off-by: Josh Hunt <johunt@akamai.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Link: https://lkml.kernel.org/r/20210402025833.27599-1-johunt@akamai.com
mask is built in build_balance_mask() by for_each_cpu(i, sg_span), so
it must be a subset of sched_group_span(sg).
So the cpumask_and() call is redundant - remove it.
[ mingo: Adjusted the changelog a bit. ]
Signed-off-by: Barry Song <song.bao.hua@hisilicon.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Valentin Schneider <Valentin.Schneider@arm.com>
Link: https://lore.kernel.org/r/20210325023140.23456-1-song.bao.hua@hisilicon.com
-1 is -EPERM which is a somewhat odd error to return from
sched_dynamic_write(). No other callers care about which negative
value is used.
Signed-off-by: Rasmus Villemoes <linux@rasmusvillemoes.dk>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: https://lore.kernel.org/r/20210325004515.531631-2-linux@rasmusvillemoes.dk
Use the enum names which are also what is used in the switch() in
sched_dynamic_update().
Signed-off-by: Rasmus Villemoes <linux@rasmusvillemoes.dk>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: https://lore.kernel.org/r/20210325004515.531631-1-linux@rasmusvillemoes.dk
A long-tail load balance cost is observed on the newly idle path,
this is caused by a race window between the first nr_running check
of the busiest runqueue and its nr_running recheck in detach_tasks.
Before the busiest runqueue is locked, the tasks on the busiest
runqueue could be pulled by other CPUs and nr_running of the busiest
runqueu becomes 1 or even 0 if the running task becomes idle, this
causes detach_tasks breaks with LBF_ALL_PINNED flag set, and triggers
load_balance redo at the same sched_domain level.
In order to find the new busiest sched_group and CPU, load balance will
recompute and update the various load statistics, which eventually leads
to the long-tail load balance cost.
This patch clears LBF_ALL_PINNED flag for this race condition, and hence
reduces the long-tail cost of newly idle balance.
Signed-off-by: Aubrey Li <aubrey.li@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>
Link: https://lkml.kernel.org/r/1614154549-116078-1-git-send-email-aubrey.li@intel.com
update_idle_core() is only done for the case of sched_smt_present.
but test_idle_cores() is done for all machines even those without
SMT.
This can contribute to up 8%+ hackbench performance loss on a
machine like kunpeng 920 which has no SMT. This patch removes the
redundant test_idle_cores() for !SMT machines.
Hackbench is ran with -g {2..14}, for each g it is ran 10 times to get
an average.
$ numactl -N 0 hackbench -p -T -l 20000 -g $1
The below is the result of hackbench w/ and w/o this patch:
g= 2 4 6 8 10 12 14
w/o: 1.8151 3.8499 5.5142 7.2491 9.0340 10.7345 12.0929
w/ : 1.8428 3.7436 5.4501 6.9522 8.2882 9.9535 11.3367
+4.1% +8.3% +7.3% +6.3%
Signed-off-by: Barry Song <song.bao.hua@hisilicon.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>
Acked-by: Mel Gorman <mgorman@suse.de>
Link: https://lkml.kernel.org/r/20210320221432.924-1-song.bao.hua@hisilicon.com
We noticed that the cost of psi increases with the increase in the
levels of the cgroups. Particularly the cost of cpu_clock() sticks out
as the kernel calls it multiple times as it traverses up the cgroup
tree. This patch reduces the calls to cpu_clock().
Performed perf bench on Intel Broadwell with 3 levels of cgroup.
Before the patch:
$ perf bench sched all
# Running sched/messaging benchmark...
# 20 sender and receiver processes per group
# 10 groups == 400 processes run
Total time: 0.747 [sec]
# Running sched/pipe benchmark...
# Executed 1000000 pipe operations between two processes
Total time: 3.516 [sec]
3.516689 usecs/op
284358 ops/sec
After the patch:
$ perf bench sched all
# Running sched/messaging benchmark...
# 20 sender and receiver processes per group
# 10 groups == 400 processes run
Total time: 0.640 [sec]
# Running sched/pipe benchmark...
# Executed 1000000 pipe operations between two processes
Total time: 3.329 [sec]
3.329820 usecs/op
300316 ops/sec
Signed-off-by: Shakeel Butt <shakeelb@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Link: https://lkml.kernel.org/r/20210321205156.4186483-1-shakeelb@google.com
Most callsites were covered by commit
a8b62fd085 ("stop_machine: Add function and caller debug info")
but this skipped queue_stop_cpus_work(). Add caller debug info to it.
Signed-off-by: Valentin Schneider <valentin.schneider@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20201210163830.21514-2-valentin.schneider@arm.com
Fix ~42 single-word typos in scheduler code comments.
We have accumulated a few fun ones over the years. :-)
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Ben Segall <bsegall@google.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: linux-kernel@vger.kernel.org
For userspace checkpoint and restore (C/R) a way of getting process state
containing RSEQ configuration is needed.
There are two ways this information is going to be used:
- to re-enable RSEQ for threads which had it enabled before C/R
- to detect if a thread was in a critical section during C/R
Since C/R preserves TLS memory and addresses RSEQ ABI will be restored
using the address registered before C/R.
Detection whether the thread is in a critical section during C/R is needed
to enforce behavior of RSEQ abort during C/R. Attaching with ptrace()
before registers are dumped itself doesn't cause RSEQ abort.
Restoring the instruction pointer within the critical section is
problematic because rseq_cs may get cleared before the control is passed
to the migrated application code leading to RSEQ invariants not being
preserved. C/R code will use RSEQ ABI address to find the abort handler
to which the instruction pointer needs to be set.
To achieve above goals expose the RSEQ ABI address and the signature value
with the new ptrace request PTRACE_GET_RSEQ_CONFIGURATION.
This new ptrace request can also be used by debuggers so they are aware
of stops within restartable sequences in progress.
Signed-off-by: Piotr Figiel <figiel@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Michal Miroslaw <emmir@google.com>
Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Acked-by: Oleg Nesterov <oleg@redhat.com>
Link: https://lkml.kernel.org/r/20210226135156.1081606-1-figiel@google.com
Since 565790d28b (sched: Fix balance_callback(), 2020-05-11), there
is no longer a need to reuse the result value of the call to finish_task_switch()
inside schedule_tail(), therefore the variable used to hold that value
(rq) is no longer needed.
Signed-off-by: Edmundo Carmona Antoranz <eantoranz@gmail.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20210306210739.1370486-1-eantoranz@gmail.com
A significant portion of __calc_delta() time is spent in the loop
shifting a u64 by 32 bits. Use `fls` instead of iterating.
This is ~7x faster on benchmarks.
The generic `fls` implementation (`generic_fls`) is still ~4x faster
than the loop.
Architectures that have a better implementation will make use of it. For
example, on x86 we get an additional factor 2 in speed without dedicated
implementation.
On GCC, the asm versions of `fls` are about the same speed as the
builtin. On Clang, the versions that use fls are more than twice as
slow as the builtin. This is because the way the `fls` function is
written, clang puts the value in memory:
https://godbolt.org/z/EfMbYe. This bug is filed at
https://bugs.llvm.org/show_bug.cgi?idI406.
```
name cpu/op
BM_Calc<__calc_delta_loop> 9.57ms Â=B112%
BM_Calc<__calc_delta_generic_fls> 2.36ms Â=B113%
BM_Calc<__calc_delta_asm_fls> 2.45ms Â=B113%
BM_Calc<__calc_delta_asm_fls_nomem> 1.66ms Â=B112%
BM_Calc<__calc_delta_asm_fls64> 2.46ms Â=B113%
BM_Calc<__calc_delta_asm_fls64_nomem> 1.34ms Â=B115%
BM_Calc<__calc_delta_builtin> 1.32ms Â=B111%
```
Signed-off-by: Clement Courbet <courbet@google.com>
Signed-off-by: Josh Don <joshdon@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20210303224653.2579656-1-joshdon@google.com
The commit 36b238d571 ("psi: Optimize switching tasks inside shared
cgroups") only update cgroups whose state actually changes during a
task switch only in task preempt case, not in task sleep case.
We actually don't need to clear and set TSK_ONCPU state for common cgroups
of next and prev task in sleep case, that can save many psi_group_change
especially when most activity comes from one leaf cgroup.
sleep before:
psi_dequeue()
while ((group = iterate_groups(prev))) # all ancestors
psi_group_change(prev, .clear=TSK_RUNNING|TSK_ONCPU)
psi_task_switch()
while ((group = iterate_groups(next))) # all ancestors
psi_group_change(next, .set=TSK_ONCPU)
sleep after:
psi_dequeue()
nop
psi_task_switch()
while ((group = iterate_groups(next))) # until (prev & next)
psi_group_change(next, .set=TSK_ONCPU)
while ((group = iterate_groups(prev))) # all ancestors
psi_group_change(prev, .clear=common?TSK_RUNNING:TSK_RUNNING|TSK_ONCPU)
When a voluntary sleep switches to another task, we remove one call of
psi_group_change() for every common cgroup ancestor of the two tasks.
Co-developed-by: Muchun Song <songmuchun@bytedance.com>
Signed-off-by: Muchun Song <songmuchun@bytedance.com>
Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Link: https://lkml.kernel.org/r/20210303034659.91735-5-zhouchengming@bytedance.com
Move the unlikely branches out of line. This eliminates undesirable
jumps during wakeup and sleeps for workloads that aren't under any
sort of resource pressure.
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lkml.kernel.org/r/20210303034659.91735-4-zhouchengming@bytedance.com
Move the reclaim detection from the timer tick to the task state
tracking machinery using the recently added ONCPU state. And we
also add task psi_flags changes checking in the psi_task_switch()
optimization to update the parents properly.
In terms of performance and cost, this ONCPU task state tracking
is not cheaper than previous timer tick in aggregate. But the code is
simpler and shorter this way, so it's a maintainability win. And
Johannes did some testing with perf bench, the performace and cost
changes would be acceptable for real workloads.
Thanks to Johannes Weiner for pointing out the psi_task_switch()
optimization things and the clearer changelog.
Co-developed-by: Muchun Song <songmuchun@bytedance.com>
Signed-off-by: Muchun Song <songmuchun@bytedance.com>
Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Link: https://lkml.kernel.org/r/20210303034659.91735-3-zhouchengming@bytedance.com
The FULL state doesn't exist for the CPU resource at the system level,
but exist at the cgroup level, means all non-idle tasks in a cgroup are
delayed on the CPU resource which used by others outside of the cgroup
or throttled by the cgroup cpu.max configuration.
Co-developed-by: Muchun Song <songmuchun@bytedance.com>
Signed-off-by: Muchun Song <songmuchun@bytedance.com>
Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Link: https://lkml.kernel.org/r/20210303034659.91735-2-zhouchengming@bytedance.com
Factorizing and unifying cpuhp callback range invocations, especially for
the hotunplug path, where two different ways of decrementing were used. The
first one, decrements before the callback is called:
cpuhp_thread_fun()
state = st->state;
st->state--;
cpuhp_invoke_callback(state);
The second one, after:
take_down_cpu()|cpuhp_down_callbacks()
cpuhp_invoke_callback(st->state);
st->state--;
This is problematic for rolling back the steps in case of error, as
depending on the decrement, the rollback will start from N or N-1. It also
makes tracing inconsistent, between steps run in the cpuhp thread and
the others.
Additionally, avoid useless cpuhp_thread_fun() loops by skipping empty
steps.
Signed-off-by: Vincent Donnefort <vincent.donnefort@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lkml.kernel.org/r/20210216103506.416286-4-vincent.donnefort@arm.com
The atomic states (between CPUHP_AP_IDLE_DEAD and CPUHP_AP_ONLINE) are
triggered by the CPUHP_BRINGUP_CPU step. If the latter fails, no atomic
state can be rolled back.
DEAD callbacks too can't fail and disallow recovery. As a consequence,
during hotunplug, the fail injection interface should prohibit all states
from CPUHP_BRINGUP_CPU to CPUHP_ONLINE.
Signed-off-by: Vincent Donnefort <vincent.donnefort@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lkml.kernel.org/r/20210216103506.416286-3-vincent.donnefort@arm.com
Currently, the only way of resetting the fail injection is to trigger a
hotplug, hotunplug or both. This is rather annoying for testing
and, as the default value for this file is -1, it seems pretty natural to
let a user write it.
Signed-off-by: Vincent Donnefort <vincent.donnefort@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lkml.kernel.org/r/20210216103506.416286-2-vincent.donnefort@arm.com
Being called for each dequeue, util_est reduces the number of its updates
by filtering out when the EWMA signal is different from the task util_avg
by less than 1%. It is a problem for a sudden util_avg ramp-up. Due to the
decay from a previous high util_avg, EWMA might now be close enough to
the new util_avg. No update would then happen while it would leave
ue.enqueued with an out-of-date value.
Taking into consideration the two util_est members, EWMA and enqueued for
the filtering, ensures, for both, an up-to-date value.
This is for now an issue only for the trace probe that might return the
stale value. Functional-wise, it isn't a problem, as the value is always
accessed through max(enqueued, ewma).
This problem has been observed using LISA's UtilConvergence:test_means on
the sd845c board.
No regression observed with Hackbench on sd845c and Perf-bench sched pipe
on hikey/hikey960.
Signed-off-by: Vincent Donnefort <vincent.donnefort@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>
Link: https://lkml.kernel.org/r/20210225165820.1377125-1-vincent.donnefort@arm.com
Syzbot reported a handful of occurrences where an sd->nr_balance_failed can
grow to much higher values than one would expect.
A successful load_balance() resets it to 0; a failed one increments
it. Once it gets to sd->cache_nice_tries + 3, this *should* trigger an
active balance, which will either set it to sd->cache_nice_tries+1 or reset
it to 0. However, in case the to-be-active-balanced task is not allowed to
run on env->dst_cpu, then the increment is done without any further
modification.
This could then be repeated ad nauseam, and would explain the absurdly high
values reported by syzbot (86, 149). VincentG noted there is value in
letting sd->cache_nice_tries grow, so the shift itself should be
fixed. That means preventing:
"""
If the value of the right operand is negative or is greater than or equal
to the width of the promoted left operand, the behavior is undefined.
"""
Thus we need to cap the shift exponent to
BITS_PER_TYPE(typeof(lefthand)) - 1.
I had a look around for other similar cases via coccinelle:
@expr@
position pos;
expression E1;
expression E2;
@@
(
E1 >> E2@pos
|
E1 >> E2@pos
)
@cst depends on expr@
position pos;
expression expr.E1;
constant cst;
@@
(
E1 >> cst@pos
|
E1 << cst@pos
)
@script:python depends on !cst@
pos << expr.pos;
exp << expr.E2;
@@
# Dirty hack to ignore constexpr
if exp.upper() != exp:
coccilib.report.print_report(pos[0], "Possible UB shift here")
The only other match in kernel/sched is rq_clock_thermal() which employs
sched_thermal_decay_shift, and that exponent is already capped to 10, so
that one is fine.
Fixes: 5a7f555904 ("sched/fair: Relax constraint on task's load during load balance")
Reported-by: syzbot+d7581744d5fd27c9fbe1@syzkaller.appspotmail.com
Signed-off-by: Valentin Schneider <valentin.schneider@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: http://lore.kernel.org/r/000000000000ffac1205b9a2112f@google.com
The sub_positive local version is saving an explicit load-store and is
enough for the cpu_util_next() usage.
Signed-off-by: Vincent Donnefort <vincent.donnefort@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Quentin Perret <qperret@google.com>
Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Link: https://lkml.kernel.org/r/20210225083612.1113823-3-vincent.donnefort@arm.com
find_energy_efficient_cpu() (feec()) computes for each perf_domain (pd) an
energy delta as follows:
feec(task)
for_each_pd
base_energy = compute_energy(task, -1, pd)
-> for_each_cpu(pd)
-> cpu_util_next(cpu, task, -1)
energy_delta = compute_energy(task, dst_cpu, pd)
-> for_each_cpu(pd)
-> cpu_util_next(cpu, task, dst_cpu)
energy_delta -= base_energy
Then it picks the best CPU as being the one that minimizes energy_delta.
cpu_util_next() estimates the CPU utilization that would happen if the
task was placed on dst_cpu as follows:
max(cpu_util + task_util, cpu_util_est + _task_util_est)
The task contribution to the energy delta can then be either:
(1) _task_util_est, on a mostly idle CPU, where cpu_util is close to 0
and _task_util_est > cpu_util.
(2) task_util, on a mostly busy CPU, where cpu_util > _task_util_est.
(cpu_util_est doesn't appear here. It is 0 when a CPU is idle and
otherwise must be small enough so that feec() takes the CPU as a
potential target for the task placement)
This is problematic for feec(), as cpu_util_next() might give an unfair
advantage to a CPU which is mostly busy (2) compared to one which is
mostly idle (1). _task_util_est being always bigger than task_util in
feec() (as the task is waking up), the task contribution to the energy
might look smaller on certain CPUs (2) and this breaks the energy
comparison.
This issue is, moreover, not sporadic. By starving idle CPUs, it keeps
their cpu_util < _task_util_est (1) while others will maintain cpu_util >
_task_util_est (2).
Fix this problem by always using max(task_util, _task_util_est) as a task
contribution to the energy (ENERGY_UTIL). The new estimated CPU
utilization for the energy would then be:
max(cpu_util, cpu_util_est) + max(task_util, _task_util_est)
compute_energy() still needs to know which OPP would be selected if the
task would be migrated in the perf_domain (FREQUENCY_UTIL). Hence,
cpu_util_next() is still used to estimate the maximum util within the pd.
Signed-off-by: Vincent Donnefort <vincent.donnefort@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Quentin Perret <qperret@google.com>
Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Link: https://lkml.kernel.org/r/20210225083612.1113823-2-vincent.donnefort@arm.com
Start to update last_blocked_load_update_tick to reduce the possibility
of another cpu starting the update one more time
Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Valentin Schneider <valentin.schneider@arm.com>
Link: https://lkml.kernel.org/r/20210224133007.28644-8-vincent.guittot@linaro.org
Instead of waking up a random and already idle CPU, we can take advantage
of this_cpu being about to enter idle to run the ILB and update the
blocked load.
Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Valentin Schneider <valentin.schneider@arm.com>
Link: https://lkml.kernel.org/r/20210224133007.28644-7-vincent.guittot@linaro.org
Reorder the tests and skip useless ones when no load balance has been
performed and rq lock has not been released.
Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Valentin Schneider <valentin.schneider@arm.com>
Link: https://lkml.kernel.org/r/20210224133007.28644-6-vincent.guittot@linaro.org
Remove the specific case for handling this_cpu outside for_each_cpu() loop
when running ILB. Instead we use for_each_cpu_wrap() and start with the
next cpu after this_cpu so we will continue to finish with this_cpu.
update_nohz_stats() is now used for this_cpu too and will prevents
unnecessary update. We don't need a special case for handling the update of
nohz.next_balance for this_cpu anymore because it is now handled by the
loop like others.
Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Valentin Schneider <valentin.schneider@arm.com>
Link: https://lkml.kernel.org/r/20210224133007.28644-5-vincent.guittot@linaro.org
idle load balance is the only user of update_nohz_stats and doesn't use
force parameter. Remove it
Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Valentin Schneider <valentin.schneider@arm.com>
Link: https://lkml.kernel.org/r/20210224133007.28644-4-vincent.guittot@linaro.org
The return of _nohz_idle_balance() is not used anymore so we can remove
it
Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Valentin Schneider <valentin.schneider@arm.com>
Link: https://lkml.kernel.org/r/20210224133007.28644-3-vincent.guittot@linaro.org
newidle_balance runs with both preempt and irq disabled which prevent
local irq to run during this period. The duration for updating the
blocked load of CPUs varies according to the number of CPU cgroups
with non-decayed load and extends this critical period to an uncontrolled
level.
Remove the update from newidle_balance and trigger a normal ILB that
will take care of the update instead.
This reduces the IRQ latency from O(nr_cgroups * nr_nohz_cpus) to
O(nr_cgroups).
Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Valentin Schneider <valentin.schneider@arm.com>
Link: https://lkml.kernel.org/r/20210224133007.28644-2-vincent.guittot@linaro.org
The recent addition of in_serving_softirq() to kconv.h results in
compile failure on PREEMPT_RT because it requires
task_struct::softirq_disable_cnt. This is not available if kconv.h is
included from sched.h.
It is not needed to include kconv.h from sched.h. All but the net/ user
already include the kconv header file.
Move the include of the kconv.h header from sched.h it its users.
Additionally include sched.h from kconv.h to ensure that everything
task_struct related is available.
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Johannes Berg <johannes@sipsolutions.net>
Acked-by: Andrey Konovalov <andreyknvl@google.com>
Link: https://lkml.kernel.org/r/20210218173124.iy5iyqv3a4oia4vv@linutronix.de
Since, when ->stop_pending, only the stopper can uninstall
p->migration_pending. This could simplify a few ifs, because:
(pending != NULL) => (pending == p->migration_pending)
Also, the fatty comment above affine_move_task() probably needs a bit
of gardening.
Signed-off-by: Valentin Schneider <valentin.schneider@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
The function sync_runqueues_membarrier_state() should copy the
membarrier state from the @mm received as parameter to each runqueue
currently running tasks using that mm.
However, the use of smp_call_function_many() skips the current runqueue,
which is unintended. Replace by a call to on_each_cpu_mask().
Fixes: 227a4aadc7 ("sched/membarrier: Fix p->mm->membarrier_state racy load")
Reported-by: Nadav Amit <nadav.amit@gmail.com>
Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: stable@vger.kernel.org # 5.4.x+
Link: https://lore.kernel.org/r/74F1E842-4A84-47BF-B6C2-5407DFDD4A4A@gmail.com
Now that we have set_affinity_pending::stop_pending to indicate if a
stopper is in progress, and we have the guarantee that if that stopper
exists, it will (eventually) complete our @pending we can simplify the
refcount scheme by no longer counting the stopper thread.
Fixes: 6d337eab04 ("sched: Fix migrate_disable() vs set_cpus_allowed_ptr()")
Cc: stable@kernel.org
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Valentin Schneider <valentin.schneider@arm.com>
Link: https://lkml.kernel.org/r/20210224131355.724130207@infradead.org
Consider:
sched_setaffinity(p, X); sched_setaffinity(p, Y);
Then the first will install p->migration_pending = &my_pending; and
issue stop_one_cpu_nowait(pending); and the second one will read
p->migration_pending and _also_ issue: stop_one_cpu_nowait(pending),
the _SAME_ @pending.
This causes stopper list corruption.
Add set_affinity_pending::stop_pending, to indicate if a stopper is in
progress.
Fixes: 6d337eab04 ("sched: Fix migrate_disable() vs set_cpus_allowed_ptr()")
Cc: stable@kernel.org
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Valentin Schneider <valentin.schneider@arm.com>
Link: https://lkml.kernel.org/r/20210224131355.649146419@infradead.org
When the purpose of migration_cpu_stop() is to migrate the task to
'any' valid CPU, don't migrate the task when it's already running on a
valid CPU.
Fixes: 6d337eab04 ("sched: Fix migrate_disable() vs set_cpus_allowed_ptr()")
Cc: stable@kernel.org
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Valentin Schneider <valentin.schneider@arm.com>
Link: https://lkml.kernel.org/r/20210224131355.569238629@infradead.org
The SCA_MIGRATE_ENABLE and task_running() cases are almost identical,
collapse them to avoid further duplication.
Fixes: 6d337eab04 ("sched: Fix migrate_disable() vs set_cpus_allowed_ptr()")
Cc: stable@kernel.org
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Valentin Schneider <valentin.schneider@arm.com>
Link: https://lkml.kernel.org/r/20210224131355.500108964@infradead.org
When affine_move_task() issues a migration_cpu_stop(), the purpose of
that function is to complete that @pending, not any random other
p->migration_pending that might have gotten installed since.
This realization much simplifies migration_cpu_stop() and allows
further necessary steps to fix all this as it provides the guarantee
that @pending's stopper will complete @pending (and not some random
other @pending).
Fixes: 6d337eab04 ("sched: Fix migrate_disable() vs set_cpus_allowed_ptr()")
Cc: stable@kernel.org
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Valentin Schneider <valentin.schneider@arm.com>
Link: https://lkml.kernel.org/r/20210224131355.430014682@infradead.org
When affine_move_task(p) is called on a running task @p, which is not
otherwise already changing affinity, we'll first set
p->migration_pending and then do:
stop_one_cpu(cpu_of_rq(rq), migration_cpu_stop, &arg);
This then gets us to migration_cpu_stop() running on the CPU that was
previously running our victim task @p.
If we find that our task is no longer on that runqueue (this can
happen because of a concurrent migration due to load-balance etc.),
then we'll end up at the:
} else if (dest_cpu < 1 || pending) {
branch. Which we'll take because we set pending earlier. Here we first
check if the task @p has already satisfied the affinity constraints,
if so we bail early [A]. Otherwise we'll reissue migration_cpu_stop()
onto the CPU that is now hosting our task @p:
stop_one_cpu_nowait(cpu_of(rq), migration_cpu_stop,
&pending->arg, &pending->stop_work);
Except, we've never initialized pending->arg, which will be all 0s.
This then results in running migration_cpu_stop() on the next CPU with
arg->p == NULL, which gives the by now obvious result of fireworks.
The cure is to change affine_move_task() to always use pending->arg,
furthermore we can use the exact same pattern as the
SCA_MIGRATE_ENABLE case, since we'll block on the pending->done
completion anyway, no point in adding yet another completion in
stop_one_cpu().
This then gives a clear distinction between the two
migration_cpu_stop() use cases:
- sched_exec() / migrate_task_to() : arg->pending == NULL
- affine_move_task() : arg->pending != NULL;
And we can have it ignore p->migration_pending when !arg->pending. Any
stop work from sched_exec() / migrate_task_to() is in addition to stop
works from affine_move_task(), which will be sufficient to issue the
completion.
Fixes: 6d337eab04 ("sched: Fix migrate_disable() vs set_cpus_allowed_ptr()")
Cc: stable@kernel.org
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Valentin Schneider <valentin.schneider@arm.com>
Link: https://lkml.kernel.org/r/20210224131355.357743989@infradead.org
- Fix corner cases in the rxe reference counting cleanup that are causing
regressions in blktests for SRP
- Two kdoc fixes so W=1 is clean
- Missing error return in error unwind for mlx5
- Wrong lock type nesting in IB CM
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEfB7FMLh+8QxL+6i3OG33FX4gmxoFAmBCeEwACgkQOG33FX4g
mxohmQ/7Bnf52Ay59dqdQzc8nDPXMwQ5ejZke+MnYZG3j79UqNQrbEnLJ8Y2BlQU
IkpdTi8zUHm69GH7Ix9m/FsW52XFUuq2OEfGLvZNRUhi7hGEmc6llvfCK9y6a4Mt
VEpyo7vC4QvA1OvMBF3Zv91jylKPJt5U51OuRtbh+6eoS6oVm7ajjOay+USYT/kX
3zftTd6DUg6vpUP5j2WvjocgKUVGsh8GiyA3xC7u7AUTemcxvZ40wB535fLpn1JC
ObU8aiRKw0Ib5JbRukCFFBV13kmA3lnAEy0SktPVhIoeUnF0m1vNtAJrdnHsWPxW
lxtAXA3bN3oINhsEhnBllONKRxlqsUz7KT5bVAehsl8LENNLHhJWp0hwKawKK74K
5MAQ+P6CIW8TLMKeOcYMXBD2UiElc9sdhYsDliM7leclftwOO5gFI4WgTWrQwxvl
m+KewiByT6pzK7oyGvg9w/dl3y7atzy8y/7uMAfI2L6oL43z9yt24hHQTo3Z7UM7
/bbSas/yRbZujB7qE45c6NvwRFQk6mYTK0Is0huMWOLn0Zf0fVcy3um0q+OKQ7At
VS1Z7Fhy+ev1Lk764n5rmVUx8NWDlXkzwX7qk7SnQk7dWQ5lNFbziFDKmkiKgcAa
//oN0I/FbAM4FjhKImRzbXJiLdULxJCqpfQ5VXbUXlfXdpzjabY=
=j5ZU
-----END PGP SIGNATURE-----
Merge tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma
Pull rdma fixes from Jason Gunthorpe:
"Nothing special here, though Bob's regression fixes for rxe would have
made it before the rc cycle had there not been such strong winter
weather!
- Fix corner cases in the rxe reference counting cleanup that are
causing regressions in blktests for SRP
- Two kdoc fixes so W=1 is clean
- Missing error return in error unwind for mlx5
- Wrong lock type nesting in IB CM"
* tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma:
RDMA/rxe: Fix errant WARN_ONCE in rxe_completer()
RDMA/rxe: Fix extra deref in rxe_rcv_mcast_pkt()
RDMA/rxe: Fix missed IB reference counting in loopback
RDMA/uverbs: Fix kernel-doc warning of _uverbs_alloc
RDMA/mlx5: Set correct kernel-doc identifier
IB/mlx5: Add missing error code
RDMA/rxe: Fix missing kconfig dependency on CRYPTO
RDMA/cm: Fix IRQ restore in ib_send_cm_sidr_rep
- Fix coding style issues (Jason Yan)
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEpcP2jyKd1g9yPm4TiXL039xtwCYFAmBCursACgkQiXL039xt
wCZWzw//aBKQN8oUisNVhzWnUOpXEFV1GfJPaqgdZv89fCSnC20ojxzxXG4bTVod
mnDvkYYAce3A8uLyACdt2iVdXCKF5LbiglkipqK6S4N/0DzoN4IBhgEtmNxMswjc
LwXTzG3R7lu6J2CQatk0uRd2QLVWBHkjyMWRBpg3ryYJoj2ROscWlCL86u3Kdwyh
2XKozHs+mTrDz7sOHeDlEfeJTF7AniN3Px6ajf0st0up5NIiZ70uuaHqhKMWuYji
4fnO2Pf2G0b9QaB6W7k0S/M2VGhmd+1h3psI9PJJ3glrKe3WsHRm9Akq8MF08vr7
UEvYqEPuU6/YJDEXJ3C22xEeN9HBfXgrX1qAf/1/mE9c1AGvIPNmQYp+xE1xSXzO
79GRPf1OrZLntJuQG9ir1BrhryehrynhH2dS0F1ByT2u/lYpGw9o+vu7rpnQ6DUR
sjnhJSCyzDMHdZPzaUldFaydEquDE5LohY2TQDeWtT6tBtkjIONj0YqGlm2IlzFL
OlM0YkjSd4fDlKIhMbP/Vm+ScAJfs3dR7cWwn3F3JIoDTZf5f4IAbsKCyPJ3wIHm
A0BEXs9bZ4MCmNPtZDuSGExnHs+te4yBoahkvw/84TG2PBJVOCDdB4F7noFu47vJ
1QvL7TiRTr782fqZQwSTUZZHxmc6NhZRySsXRIHBA4AyboPnnYk=
=oow8
-----END PGP SIGNATURE-----
Merge tag 'gcc-plugins-v5.12-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux
Pull gcc-plugins fixes from Kees Cook:
"Tiny gcc-plugin fixes for v5.12-rc2. These issues are small but have
been reported a couple times now by static analyzers, so best to get
them fixed to reduce the noise. :)
- Fix coding style issues (Jason Yan)"
* tag 'gcc-plugins-v5.12-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux:
gcc-plugins: latent_entropy: remove unneeded semicolon
gcc-plugins: structleak: remove unneeded variable 'ret'
Reed-Solomon roots that are unaligned to block size.
-----BEGIN PGP SIGNATURE-----
iQFHBAABCAAxFiEEJfWUX4UqZ4x1O2wixSPxCi2dA1oFAmBCgmkTHHNuaXR6ZXJA
cmVkaGF0LmNvbQAKCRDFI/EKLZ0DWlF9B/481eKdOYK5RLj6LneVf10niUACEN0G
tp8CeKh2wEcTjX+9pWxvigAE7FZnvD0sCts3eRCd8egkdV3L9uHyaCHU9V8iGP8L
dSjHXjbwbOumww5FY1ddx2ZiRImcU7YpEj5TrUZi2TyoAB2jOblDgaUt8jWsbppa
8miPUZi0Kp56Z5EsFJcT8dYbIlLfUpD/XfZ0hjqMoc9XZOeoGvYqpYX1pZgIfj2s
F23Dru858zqjv7OLmIgjzSgS0dZSwdLMUpKInaW1AipvNQpWSuh2DQdP7Ail3+yy
gUirpBjztoYCfqwpQTJpGnnb47tECYUXef6dI7xR5bXnyaX8CG5bRvVV
=a00f
-----END PGP SIGNATURE-----
Merge tag 'for-5.12/dm-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm
Pull device mapper fixes from Mike Snitzer:
"Fix DM verity target's optional Forward Error Correction (FEC) for
Reed-Solomon roots that are unaligned to block size"
* tag 'for-5.12/dm-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm:
dm verity: fix FEC for RS roots unaligned to block size
dm bufio: subtract the number of initial sectors in dm_bufio_get_device_size
-----BEGIN PGP SIGNATURE-----
iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAmBCa2kQHGF4Ym9lQGtl
cm5lbC5kawAKCRD301j7KXHgpo5TEADb8p//Qr/Uv8hiVqvrMKoVbiJ+ZXB+Z+6k
ZMlcOryHWLbNOYYsA6soUygRkEFlFyUhfWJEvIYIr612s964+aD8cH9JhpPbGzkd
rTGUWeHXcW0is03AgEsq5SmjRO3zI+I0Ln2IMlePY+W8xwIL8jI3pf+LiI/gzJcG
sOxqgtS3zBzby3sU/9LDJfBCB4soqBcrp/6G6p2tK+gtVsud122LInl2/gGtXml4
AT15T1JaPbdnOCFXiy+GIK+AQiITfmPBiM+m7mh2LabxJNTDtYSrwh5rjKQjVpyi
N+AwVTeshzDbuoac6yuZSkLRPFfSPdgeefqCGNcpSjdJiJsvwQWHqf119CKYraLd
d2nOn4YRK4dtLWlu4bZkJCUC+KWQhCfQNAeu/ok9OQ7d6eKmigiDA+YlAf9NdymL
yknWzjjVxFwDKO/fGuOXq3kj/EW5JNgX1/esJAilVBojzavR9uu+fxVCvohEbc6I
hJ9hXijY1R3wmbJbpFRo8msKSifgaFVVrb2bKUbxtUI6er1mJ+dkAM8MeKf6+LTq
DLfyoD8o4F9+dJqHfRKIt+Pdgrm0nvcgxDPykAsX5vMoAwgeVIcyRK59wPOnTMWV
NaRabO85hEdkTTdAB10UB7Ix7PTNHGtCxfvb9mNnPDYMrHr8qolRfCzQW9nULzLi
ggVLgn2RLQ==
=Cd/Z
-----END PGP SIGNATURE-----
Merge tag 'block-5.12-2021-03-05' of git://git.kernel.dk/linux-block
Pull block fixes from Jens Axboe:
- NVMe fixes:
- more device quirks (Julian Einwag, Zoltán Böszörményi, Pascal
Terjan)
- fix a hwmon error return (Daniel Wagner)
- fix the keep alive timeout initialization (Martin George)
- ensure the model_number can't be changed on a used subsystem
(Max Gurtovoy)
- rsxx missing -EFAULT on copy_to_user() failure (Dan)
- rsxx remove unused linux.h include (Tian)
- kill unused RQF_SORTED (Jean)
- updated outdated BFQ comments (Joseph)
- revert work-around commit for bd_size_lock, since we removed the
offending user in this merge window (Damien)
* tag 'block-5.12-2021-03-05' of git://git.kernel.dk/linux-block:
nvmet: model_number must be immutable once set
nvme-fabrics: fix kato initialization
nvme-hwmon: Return error code when registration fails
nvme-pci: add quirks for Lexar 256GB SSD
nvme-pci: mark Kingston SKC2000 as not supporting the deepest power state
nvme-pci: mark Seagate Nytro XM1440 as QUIRK_NO_NS_DESC_LIST.
rsxx: Return -EFAULT if copy_to_user() fails
block/bfq: update comments and default value in docs for fifo_expire
rsxx: remove unused including <linux/version.h>
block: Drop leftover references to RQF_SORTED
block: revert "block: fix bd_size_lock use"