In order to minimize the interference of migrate_disable() on lower
priority tasks, which can be deprived of runtime due to being stuck
below a higher priority task. Teach the RT/DL balancers to push away
these higher priority tasks when a lower priority task gets selected
to run on a freshly demoted CPU (pull).
This adds migration interference to the higher priority task, but
restores bandwidth to system that would otherwise be irrevocably lost.
Without this it would be possible to have all tasks on the system
stuck on a single CPU, each task preempted in a migrate_disable()
section with a single high priority task running.
This way we can still approximate running the M highest priority tasks
on the system.
Migrating the top task away is (ofcourse) still subject to
migrate_disable() too, which means the lower task is subject to an
interference equivalent to the worst case migrate_disable() section.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Daniel Bristot de Oliveira <bristot@redhat.com>
Link: https://lkml.kernel.org/r/20201023102347.499155098@infradead.org
Concurrent migrate_disable() and set_cpus_allowed_ptr() has
interesting features. We rely on set_cpus_allowed_ptr() to not return
until the task runs inside the provided mask. This expectation is
exported to userspace.
This means that any set_cpus_allowed_ptr() caller must wait until
migrate_enable() allows migrations.
At the same time, we don't want migrate_enable() to schedule, due to
patterns like:
preempt_disable();
migrate_disable();
...
migrate_enable();
preempt_enable();
And:
raw_spin_lock(&B);
spin_unlock(&A);
this means that when migrate_enable() must restore the affinity
mask, it cannot wait for completion thereof. Luck will have it that
that is exactly the case where there is a pending
set_cpus_allowed_ptr(), so let that provide storage for the async stop
machine.
Much thanks to Valentin who used TLA+ most effective and found lots of
'interesting' cases.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Valentin Schneider <valentin.schneider@arm.com>
Reviewed-by: Daniel Bristot de Oliveira <bristot@redhat.com>
Link: https://lkml.kernel.org/r/20201023102346.921768277@infradead.org
Add the base migrate_disable() support (under protest).
While migrate_disable() is (currently) required for PREEMPT_RT, it is
also one of the biggest flaws in the system.
Notably this is just the base implementation, it is broken vs
sched_setaffinity() and hotplug, both solved in additional patches for
ease of review.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Valentin Schneider <valentin.schneider@arm.com>
Reviewed-by: Daniel Bristot de Oliveira <bristot@redhat.com>
Link: https://lkml.kernel.org/r/20201023102346.818170844@infradead.org
Since we now migrate tasks away before DYING, we should also move
bandwidth unthrottle, otherwise we can gain tasks from unthrottle
after we expect all tasks to be gone already.
Also; it looks like the RT balancers don't respect cpu_active() and
instead rely on rq->online in part, complete this. This too requires
we do set_rq_offline() earlier to match the cpu_active() semantics.
(The bigger patch is to convert RT to cpu_active() entirely)
Since set_rq_online() is called from sched_cpu_activate(), place
set_rq_offline() in sched_cpu_deactivate().
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Valentin Schneider <valentin.schneider@arm.com>
Reviewed-by: Daniel Bristot de Oliveira <bristot@redhat.com>
Link: https://lkml.kernel.org/r/20201023102346.639538965@infradead.org
With the new mechanism which kicks tasks off the outgoing CPU at the end of
schedule() the situation on an outgoing CPU right before the stopper thread
brings it down completely is:
- All user tasks and all unbound kernel threads have either been migrated
away or are not running and the next wakeup will move them to a online CPU.
- All per CPU kernel threads, except cpu hotplug thread and the stopper
thread have either been unbound or parked by the responsible CPU hotplug
callback.
That means that at the last step before the stopper thread is invoked the
cpu hotplug thread is the last legitimate running task on the outgoing
CPU.
Add a final wait step right before the stopper thread is kicked which
ensures that any still running tasks on the way to park or on the way to
kick themself of the CPU are either sleeping or gone.
This allows to remove the migrate_tasks() crutch in sched_cpu_dying(). If
sched_cpu_dying() detects that there is still another running task aside of
the stopper thread then it will explode with the appropriate fireworks.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Valentin Schneider <valentin.schneider@arm.com>
Reviewed-by: Daniel Bristot de Oliveira <bristot@redhat.com>
Link: https://lkml.kernel.org/r/20201023102346.547163969@infradead.org
RT kernels need to ensure that all tasks which are not per CPU kthreads
have left the outgoing CPU to guarantee that no tasks are force migrated
within a migrate disabled section.
There is also some desire to (ab)use fine grained CPU hotplug control to
clear a CPU from active state to force migrate tasks which are not per CPU
kthreads away for power control purposes.
Add a mechanism which waits until all tasks which should leave the CPU
after the CPU active flag is cleared have moved to a different online CPU.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Valentin Schneider <valentin.schneider@arm.com>
Reviewed-by: Daniel Bristot de Oliveira <bristot@redhat.com>
Link: https://lkml.kernel.org/r/20201023102346.377836842@infradead.org
In preparation for migrate_disable(), make sure only per-cpu kthreads
are allowed to run on !active CPUs.
This is ran (as one of the very first steps) from the cpu-hotplug
task which is a per-cpu kthread and completion of the hotplug
operation only requires such tasks.
This constraint enables the migrate_disable() implementation to wait
for completion of all migrate_disable regions on this CPU at hotplug
time without fear of any new ones starting.
This replaces the unlikely(rq->balance_callbacks) test at the tail of
context_switch with an unlikely(rq->balance_work), the fast path is
not affected.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Valentin Schneider <valentin.schneider@arm.com>
Reviewed-by: Daniel Bristot de Oliveira <bristot@redhat.com>
Link: https://lkml.kernel.org/r/20201023102346.292709163@infradead.org
The intent of balance_callback() has always been to delay executing
balancing operations until the end of the current rq->lock section.
This is because balance operations must often drop rq->lock, and that
isn't safe in general.
However, as noted by Scott, there were a few holes in that scheme;
balance_callback() was called after rq->lock was dropped, which means
another CPU can interleave and touch the callback list.
Rework code to call the balance callbacks before dropping rq->lock
where possible, and otherwise splice the balance list onto a local
stack.
This guarantees that the balance list must be empty when we take
rq->lock. IOW, we'll only ever run our own balance callbacks.
Reported-by: Scott Wood <swood@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Valentin Schneider <valentin.schneider@arm.com>
Reviewed-by: Daniel Bristot de Oliveira <bristot@redhat.com>
Link: https://lkml.kernel.org/r/20201023102346.203901269@infradead.org
In the case of a thread wakeup, wake_affine determines whether a core
will be chosen for the thread on the socket where the thread ran
previously or on the socket of the waker. This is done primarily by
comparing the load of the core where th thread ran previously (prev)
and the load of the waker (this).
commit 11f10e5420 ("sched/fair: Use load instead of runnable load
in wakeup path") changed the load computation from the runnable load
to the load average, where the latter includes the load of threads
that have already blocked on the core.
When a short-running daemon processes happens to run on prev, this
change raised the situation that prev could appear to have a greater
load than this, even when prev is actually idle. When prev and this
are on the same socket, the idle prev is detected later, in
select_idle_sibling. But if that does not hold, prev is completely
ignored, causing the waking thread to move to the socket of the waker.
In the case of N mostly active threads on N cores, this triggers other
migrations and hurts performance.
In contrast, before commit 11f10e5420, the load on an idle core
was 0, and in the case of a non-idle waker core, the effect of
wake_affine was to select prev as the target for searching for a core
for the waking thread.
To avoid unnecessary migrations, extend wake_affine_idle to check
whether the core where the thread previously ran is currently idle,
and if so simply return that core as the target.
[1] commit 11f10e5420 ("sched/fair: Use load instead of runnable
load in wakeup path")
This particularly has an impact when using the ondemand power manager,
where kworkers run every 0.004 seconds on all cores, increasing the
likelihood that an idle core will be considered to have a load.
The following numbers were obtained with the benchmarking tool
hyperfine (https://github.com/sharkdp/hyperfine) on the NAS parallel
benchmarks (https://www.nas.nasa.gov/publications/npb.html). The
tests were run on an 80-core Intel(R) Xeon(R) CPU E7-8870 v4 @
2.10GHz. Active (intel_pstate) and passive (intel_cpufreq) power
management were used. Times are in seconds. All experiments use all
160 hardware threads.
v5.9/intel-pstate v5.9+patch/intel-pstate
bt.C.c 24.725724+-0.962340 23.349608+-1.607214
lu.C.x 29.105952+-4.804203 25.249052+-5.561617
sp.C.x 31.220696+-1.831335 30.227760+-2.429792
ua.C.x 26.606118+-1.767384 25.778367+-1.263850
v5.9/ondemand v5.9+patch/ondemand
bt.C.c 25.330360+-1.028316 23.544036+-1.020189
lu.C.x 35.872659+-4.872090 23.719295+-3.883848
sp.C.x 32.141310+-2.289541 29.125363+-0.872300
ua.C.x 29.024597+-1.667049 25.728888+-1.539772
On the smaller data sets (A and B) and on the other NAS benchmarks
there is no impact on performance.
This also has a major impact on the splash2x.volrend benchmark of the
parsec benchmark suite that goes from 1m25 without this patch to 0m45,
in active (intel_pstate) mode.
Fixes: 11f10e5420 ("sched/fair: Use load instead of runnable load in wakeup path")
Signed-off-by: Julia Lawall <Julia.Lawall@inria.fr>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by Vincent Guittot <vincent.guittot@linaro.org>
Acked-by: Mel Gorman <mgorman@suse.de>
Link: https://lkml.kernel.org/r/1603372550-14680-1-git-send-email-Julia.Lawall@inria.fr
Florian reported that all of kernel/sched/ is rebuild when
CONFIG_BLK_DEV_INITRD is changed, which, while not a bug is
unexpected. This is due to us including vmlinux.lds.h.
Jakub explained that the problem is that we put the alignment
requirement on the type instead of on a variable. Type alignment is a
minimum, the compiler is free to pick any larger alignment for a
specific instance of the type (eg. the variable).
So force the type alignment on all individual variable definitions and
remove the undesired dependency on vmlinux.lds.h.
Fixes: 85c2ce9104 ("sched, vmlinux.lds: Increase STRUCT_ALIGNMENT to 64 bytes for GCC-4.9")
Reported-by: Florian Fainelli <f.fainelli@gmail.com>
Suggested-by: Jakub Jelinek <jakub@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
do_sched_yield() invokes schedule() with interrupts disabled which is
not allowed. This goes back to the pre git era to commit a6efb709806c
("[PATCH] irqlock patch 2.5.27-H6") in the history tree.
Reenable interrupts and remove the misleading comment which "explains" it.
Fixes: 1da177e4c3 ("Linux-2.6.12-rc2")
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/87r1pt7y5c.fsf@nanos.tec.linutronix.de
Add comments and memory barrier to kthread_use_mm and kthread_unuse_mm
to allow the effect of membarrier(2) to apply to kthreads accessing
user-space memory as well.
Given that no prior kthread use this guarantee and that it only affects
kthreads, adding this guarantee does not affect user-space ABI.
Refine the check in membarrier_global_expedited to exclude runqueues
running the idle thread rather than all kthreads from the IPI cpumask.
Now that membarrier_global_expedited can IPI kthreads, the scheduler
also needs to update the runqueue's membarrier_state when entering lazy
TLB state.
Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20201020134715.13909-3-mathieu.desnoyers@efficios.com
exit_mm should issue memory barriers after user-space memory accesses,
before clearing current->mm, to order user-space memory accesses
performed prior to exit_mm before clearing tsk->mm, which has the
effect of skipping the membarrier private expedited IPIs.
exit_mm should also update the runqueue's membarrier_state so
membarrier global expedited IPIs are not sent when they are not
needed.
The membarrier system call can be issued concurrently with do_exit
if we have thread groups created with CLONE_VM but not CLONE_THREAD.
Here is the scenario I have in mind:
Two thread groups are created, A and B. Thread group B is created by
issuing clone from group A with flag CLONE_VM set, but not CLONE_THREAD.
Let's assume we have a single thread within each thread group (Thread A
and Thread B).
The AFAIU we can have:
Userspace variables:
int x = 0, y = 0;
CPU 0 CPU 1
Thread A Thread B
(in thread group A) (in thread group B)
x = 1
barrier()
y = 1
exit()
exit_mm()
current->mm = NULL;
r1 = load y
membarrier()
skips CPU 0 (no IPI) because its current mm is NULL
r2 = load x
BUG_ON(r1 == 1 && r2 == 0)
Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20201020134715.13909-2-mathieu.desnoyers@efficios.com
It is possible for find_new_ilb() to select the current CPU, however,
this only happens from newidle balancing, in which case need_resched()
will be true, and consequently nohz_csd_func() will not trigger the
softirq.
Exclude the current CPU from becoming an ILB target.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Add CPUPRI_HIGHER above the RT99 priority to denote the CPU is in use
by higher priority tasks (specifically deadline).
XXX: we should probably drive PUSH-PULL from cpupri, that would
automagically result in an RT-PUSH when DL sets cpupri to CPUPRI_HIGHER.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
pri_to_cpu[CPUPRI_IDLE=0] isn't used since cpupri_set(..., newpri) is
never called with newpri = MAX_PRIO (140).
Current mapping:
p->rt_priority p->prio newpri cpupri
-1 -1 (CPUPRI_INVALID)
140 0 (CPUPRI_IDLE)
100 1 (CPUPRI_NORMAL)
1 98 98 3
...
49 50 50 51
50 49 49 52
...
99 0 0 101
Even when cpupri was introduced with commit 6e0534f278 ("sched: use a
2-d bitmap for searching lowest-pri CPU") in v2.6.27, only
(1) CPUPRI_INVALID (-1),
(2) MAX_RT_PRIO (100),
(3) an RT prio (RT1..RT99)
were used as newprio in cpupri_set(..., newpri) -> convert_prio(newpri).
MAX_RT_PRIO is used only in dec_rt_tasks() -> dec_rt_prio() ->
dec_rt_prio_smp() -> cpupri_set() in case of !rt_rq->rt_nr_running.
I.e. it stands for a non-rt task, including the IDLE task.
Commit 57785df5ac ("sched: Fix task priority bug") removed code in
v2.6.33 which did set the priority of the IDLE task to MAX_PRIO.
Although this happened after the introduction of cpupri, it didn't have
an effect on the values used for cpupri_set(..., newpri).
Remove CPUPRI_IDLE and adapt the cpupri implementation accordingly.
This will save a useless for loop with an atomic_read in
cpupri_find_fitness() calling __cpupri_find().
New mapping:
p->rt_priority p->prio newpri cpupri
-1 -1 (CPUPRI_INVALID)
100 0 (CPUPRI_NORMAL)
1 98 98 2
...
49 50 50 50
50 49 49 51
...
99 0 0 100
Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20200922083934.19275-2-dietmar.eggemann@arm.com
When change sched_rt_{runtime, period}_us, we validate that the new
settings should at least accommodate the currently allocated -dl
bandwidth:
sched_rt_handler()
--> sched_dl_bandwidth_validate()
{
new_bw = global_rt_runtime()/global_rt_period();
for_each_possible_cpu(cpu) {
dl_b = dl_bw_of(cpu);
if (new_bw < dl_b->total_bw) <-------
ret = -EBUSY;
}
}
But under CONFIG_SMP, dl_bw is per root domain , but not per CPU,
dl_b->total_bw is the allocated bandwidth of the whole root domain.
Instead, we should compare dl_b->total_bw against "cpus*new_bw",
where 'cpus' is the number of CPUs of the root domain.
Also, below annotation(in kernel/sched/sched.h) implied implementation
only appeared in SCHED_DEADLINE v2[1], then deadline scheduler kept
evolving till got merged(v9), but the annotation remains unchanged,
meaningless and misleading, update it.
* With respect to SMP, the bandwidth is given on a per-CPU basis,
* meaning that:
* - dl_bw (< 100%) is the bandwidth of the system (group) on each CPU;
* - dl_total_bw array contains, in the i-eth element, the currently
* allocated bandwidth on the i-eth CPU.
[1]: https://lore.kernel.org/lkml/1267385230.13676.101.camel@Palantir/
Fixes: 332ac17ef5 ("sched/deadline: Add bandwidth management for SCHED_DEADLINE tasks")
Signed-off-by: Peng Liu <iwtbavbm@gmail.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Daniel Bristot de Oliveira <bristot@redhat.com>
Acked-by: Juri Lelli <juri.lelli@redhat.com>
Link: https://lkml.kernel.org/r/db6bbda316048cda7a1bbc9571defde193a8d67e.1602171061.git.iwtbavbm@gmail.com
Pull tracing fix from Steven Rostedt:
"Fix synthetic event "strcat" overrun
New synthetic event code used strcat() and miscalculated the ending,
causing the concatenation to write beyond the allocated memory.
Instead of using strncat(), the code is switched over to seq_buf which
has all the mechanisms in place to protect against writing more than
what is allocated, and cleans up the code a bit"
* tag 'trace-v5.10-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
tracing, synthetic events: Replace buggy strcat() with seq_buf operations
There was a memory corruption bug happening while running the synthetic
event selftests:
kmemleak: Cannot insert 0xffff8c196fa2afe5 into the object search tree (overlaps existing)
CPU: 5 PID: 6866 Comm: ftracetest Tainted: G W 5.9.0-rc5-test+ #577
Hardware name: Hewlett-Packard HP Compaq Pro 6300 SFF/339A, BIOS K01 v03.03 07/14/2016
Call Trace:
dump_stack+0x8d/0xc0
create_object.cold+0x3b/0x60
slab_post_alloc_hook+0x57/0x510
? tracing_map_init+0x178/0x340
__kmalloc+0x1b1/0x390
tracing_map_init+0x178/0x340
event_hist_trigger_func+0x523/0xa40
trigger_process_regex+0xc5/0x110
event_trigger_write+0x71/0xd0
vfs_write+0xca/0x210
ksys_write+0x70/0xf0
do_syscall_64+0x33/0x40
entry_SYSCALL_64_after_hwframe+0x44/0xa9
RIP: 0033:0x7fef0a63a487
Code: 64 89 02 48 c7 c0 ff ff ff ff eb bb 0f 1f 80 00 00 00 00 f3 0f 1e fa 64 8b 04 25 18 00 00 00 85 c0 75 10 b8 01 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 51 c3 48 83 ec 28 48 89 54 24 18 48 89 74 24
RSP: 002b:00007fff76f18398 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
RAX: ffffffffffffffda RBX: 0000000000000039 RCX: 00007fef0a63a487
RDX: 0000000000000039 RSI: 000055eb3b26d690 RDI: 0000000000000001
RBP: 000055eb3b26d690 R08: 000000000000000a R09: 0000000000000038
R10: 000055eb3b2cdb80 R11: 0000000000000246 R12: 0000000000000039
R13: 00007fef0a70b500 R14: 0000000000000039 R15: 00007fef0a70b700
kmemleak: Kernel memory leak detector disabled
kmemleak: Object 0xffff8c196fa2afe0 (size 8):
kmemleak: comm "ftracetest", pid 6866, jiffies 4295082531
kmemleak: min_count = 1
kmemleak: count = 0
kmemleak: flags = 0x1
kmemleak: checksum = 0
kmemleak: backtrace:
__kmalloc+0x1b1/0x390
tracing_map_init+0x1be/0x340
event_hist_trigger_func+0x523/0xa40
trigger_process_regex+0xc5/0x110
event_trigger_write+0x71/0xd0
vfs_write+0xca/0x210
ksys_write+0x70/0xf0
do_syscall_64+0x33/0x40
entry_SYSCALL_64_after_hwframe+0x44/0xa9
The cause came down to a use of strcat() that was adding an string that was
shorten, but the strcat() did not take that into account.
strcat() is extremely dangerous as it does not care how big the buffer is.
Replace it with seq_buf operations that prevent the buffer from being
overwritten if what is being written is bigger than the buffer.
Fixes: 10819e2579 ("tracing: Handle synthetic event array field type checking correctly")
Reviewed-by: Tom Zanussi <zanussi@kernel.org>
Tested-by: Tom Zanussi <zanussi@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
tid_addr is not a "pointer to (pointer to int in userspace)"; it is in
fact a "pointer to (pointer to int in userspace) in userspace". So
sparse rightfully complains about passing a kernel pointer to
put_user().
Reported-by: kernel test robot <lkp@intel.com>
Signed-off-by: Rasmus Villemoes <linux@rasmusvillemoes.dk>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Pull timer fixes from Thomas Gleixner:
"A time namespace fix and a matching selftest. The futex absolute
timeouts which are based on CLOCK_MONOTONIC require time namespace
corrected. This was missed in the original time namesapce support"
* tag 'timers-urgent-2020-10-25' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
selftests/timens: Add a test for futex()
futex: Adjust absolute futex timeouts with per time namespace offset
Pull scheduler fixes from Thomas Gleixner:
"Two scheduler fixes:
- A trivial build fix for sched_feat() to compile correctly with
CONFIG_JUMP_LABEL=n
- Replace a zero lenght array with a flexible array"
* tag 'sched-urgent-2020-10-25' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
sched/features: Fix !CONFIG_JUMP_LABEL case
sched: Replace zero-length array with flexible-array
Pull SafeSetID updates from Micah Morton:
"The changes are mostly contained to within the SafeSetID LSM, with the
exception of a few 1-line changes to change some ns_capable() calls to
ns_capable_setid() -- causing a flag (CAP_OPT_INSETID) to be set that
is examined by SafeSetID code and nothing else in the kernel.
The changes to SafeSetID internally allow for setting up GID
transition security policies, as already existed for UIDs"
* tag 'safesetid-5.10' of git://github.com/micah-morton/linux:
LSM: SafeSetID: Fix warnings reported by test bot
LSM: SafeSetID: Add GID security policy handling
LSM: Signal to SafeSetID when setting group IDs
Pull random32 updates from Willy Tarreau:
"Make prandom_u32() less predictable.
This is the cleanup of the latest series of prandom_u32
experimentations consisting in using SipHash instead of Tausworthe to
produce the randoms used by the network stack.
The changes to the files were kept minimal, and the controversial
commit that used to take noise from the fast_pool (f227e3ec3b) was
reverted. Instead, a dedicated "net_rand_noise" per_cpu variable is
fed from various sources of activities (networking, scheduling) to
perturb the SipHash state using fast, non-trivially predictable data,
instead of keeping it fully deterministic. The goal is essentially to
make any occasional memory leakage or brute-force attempt useless.
The resulting code was verified to be very slightly faster on x86_64
than what is was with the controversial commit above, though this
remains barely above measurement noise. It was also tested on i386 and
arm, and build- tested only on arm64"
Link: https://lore.kernel.org/netdev/20200808152628.GA27941@SDF.ORG/
* tag '20201024-v4-5.10' of git://git.kernel.org/pub/scm/linux/kernel/git/wtarreau/prandom:
random32: add a selftest for the prandom32 code
random32: add noise from network and scheduling activity
random32: make prandom_u32() output unpredictable
Pull dma-mapping fixes from Christoph Hellwig:
- document the new dma_{alloc,free}_pages() API
- two fixups for the dma-mapping.h split
* tag 'dma-mapping-5.10-1' of git://git.infradead.org/users/hch/dma-mapping:
dma-mapping: document dma_{alloc,free}_pages
dma-mapping: move more functions to dma-map-ops.h
ARM/sa1111: add a missing include of dma-map-ops.h
With the removal of the interrupt perturbations in previous random32
change (random32: make prandom_u32() output unpredictable), the PRNG
has become 100% deterministic again. While SipHash is expected to be
way more robust against brute force than the previous Tausworthe LFSR,
there's still the risk that whoever has even one temporary access to
the PRNG's internal state is able to predict all subsequent draws till
the next reseed (roughly every minute). This may happen through a side
channel attack or any data leak.
This patch restores the spirit of commit f227e3ec3b ("random32: update
the net random state on interrupt and activity") in that it will perturb
the internal PRNG's statee using externally collected noise, except that
it will not pick that noise from the random pool's bits nor upon
interrupt, but will rather combine a few elements along the Tx path
that are collectively hard to predict, such as dev, skb and txq
pointers, packet length and jiffies values. These ones are combined
using a single round of SipHash into a single long variable that is
mixed with the net_rand_state upon each invocation.
The operation was inlined because it produces very small and efficient
code, typically 3 xor, 2 add and 2 rol. The performance was measured
to be the same (even very slightly better) than before the switch to
SipHash; on a 6-core 12-thread Core i7-8700k equipped with a 40G NIC
(i40e), the connection rate dropped from 556k/s to 555k/s while the
SYN cookie rate grew from 5.38 Mpps to 5.45 Mpps.
Link: https://lore.kernel.org/netdev/20200808152628.GA27941@SDF.ORG/
Cc: George Spelvin <lkml@sdf.org>
Cc: Amit Klein <aksecurity@gmail.com>
Cc: Eric Dumazet <edumazet@google.com>
Cc: "Jason A. Donenfeld" <Jason@zx2c4.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Kees Cook <keescook@chromium.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: tytso@mit.edu
Cc: Florian Westphal <fw@strlen.de>
Cc: Marc Plumb <lkml.mplumb@gmail.com>
Tested-by: Sedat Dilek <sedat.dilek@gmail.com>
Signed-off-by: Willy Tarreau <w@1wt.eu>
Non-cryptographic PRNGs may have great statistical properties, but
are usually trivially predictable to someone who knows the algorithm,
given a small sample of their output. An LFSR like prandom_u32() is
particularly simple, even if the sample is widely scattered bits.
It turns out the network stack uses prandom_u32() for some things like
random port numbers which it would prefer are *not* trivially predictable.
Predictability led to a practical DNS spoofing attack. Oops.
This patch replaces the LFSR with a homebrew cryptographic PRNG based
on the SipHash round function, which is in turn seeded with 128 bits
of strong random key. (The authors of SipHash have *not* been consulted
about this abuse of their algorithm.) Speed is prioritized over security;
attacks are rare, while performance is always wanted.
Replacing all callers of prandom_u32() is the quick fix.
Whether to reinstate a weaker PRNG for uses which can tolerate it
is an open question.
Commit f227e3ec3b ("random32: update the net random state on interrupt
and activity") was an earlier attempt at a solution. This patch replaces
it.
Reported-by: Amit Klein <aksecurity@gmail.com>
Cc: Willy Tarreau <w@1wt.eu>
Cc: Eric Dumazet <edumazet@google.com>
Cc: "Jason A. Donenfeld" <Jason@zx2c4.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Kees Cook <keescook@chromium.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: tytso@mit.edu
Cc: Florian Westphal <fw@strlen.de>
Cc: Marc Plumb <lkml.mplumb@gmail.com>
Fixes: f227e3ec3b ("random32: update the net random state on interrupt and activity")
Signed-off-by: George Spelvin <lkml@sdf.org>
Link: https://lore.kernel.org/netdev/20200808152628.GA27941@SDF.ORG/
[ willy: partial reversal of f227e3ec3b5c; moved SIPROUND definitions
to prandom.h for later use; merged George's prandom_seed() proposal;
inlined siprand_u32(); replaced the net_rand_state[] array with 4
members to fix a build issue; cosmetic cleanups to make checkpatch
happy; fixed RANDOM32_SELFTEST build ]
Signed-off-by: Willy Tarreau <w@1wt.eu>
Pull tracing ring-buffer fix from Steven Rostedt:
"The success return value of ring_buffer_resize() is stated to be
zero and checked that way.
But it was incorrectly returning the size allocated.
Also, a fix to a comment"
* tag 'trace-v5.10-3' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
ring-buffer: Update the description for ring_buffer_wait
ring-buffer: Return 0 on success from ring_buffer_resize()
Pull more power management updates from Rafael Wysocki:
"First of all, the adaptive voltage scaling (AVS) drivers go to new
platform-specific locations as planned (this part was reported to have
merge conflicts against the new arm-soc updates in linux-next).
In addition to that, there are some fixes (intel_idle, intel_pstate,
RAPL, acpi_cpufreq), the addition of on/off notifiers and idle state
accounting support to the generic power domains (genpd) code and some
janitorial changes all over.
Specifics:
- Move the AVS drivers to new platform-specific locations and get rid
of the drivers/power/avs directory (Ulf Hansson).
- Add on/off notifiers and idle state accounting support to the
generic power domains (genpd) framework (Ulf Hansson, Lina Iyer).
- Ulf will maintain the PM domain part of cpuidle-psci (Ulf Hansson).
- Make intel_idle disregard ACPI _CST if it cannot use the data
returned by that method (Mel Gorman).
- Modify intel_pstate to avoid leaving useless sysfs directory
structure behind if it cannot be registered (Chen Yu).
- Fix domain detection in the RAPL power capping driver and prevent
it from failing to enumerate the Psys RAPL domain (Zhang Rui).
- Allow acpi-cpufreq to use ACPI _PSD information with Family 19 and
later AMD chips (Wei Huang).
- Update the driver assumptions comment in intel_idle and fix a
kerneldoc comment in the runtime PM framework (Alexander Monakov,
Bean Huo).
- Avoid unnecessary resets of the cached frequency in the schedutil
cpufreq governor to reduce overhead (Wei Wang).
- Clean up the cpufreq core a bit (Viresh Kumar).
- Make assorted minor janitorial changes (Daniel Lezcano, Geert
Uytterhoeven, Hubert Jasudowicz, Tom Rix).
- Clean up and optimize the cpupower utility somewhat (Colin Ian
King, Martin Kaistra)"
* tag 'pm-5.10-rc1-2' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: (23 commits)
PM: sleep: remove unreachable break
PM: AVS: Drop the avs directory and the corresponding Kconfig
PM: AVS: qcom-cpr: Move the driver to the qcom specific drivers
PM: runtime: Fix typo in pm_runtime_set_active() helper comment
PM: domains: Fix build error for genpd notifiers
powercap: Fix typo in Kconfig "Plance" -> "Plane"
cpufreq: schedutil: restore cached freq when next_f is not changed
acpi-cpufreq: Honor _PSD table setting on new AMD CPUs
PM: AVS: smartreflex Move driver to soc specific drivers
PM: AVS: rockchip-io: Move the driver to the rockchip specific drivers
PM: domains: enable domain idle state accounting
PM: domains: Add curly braces to delimit comment + statement block
PM: domains: Add support for PM domain on/off notifiers for genpd
powercap/intel_rapl: enumerate Psys RAPL domain together with package RAPL domain
powercap/intel_rapl: Fix domain detection
intel_idle: Ignore _CST if control cannot be taken from the platform
cpuidle: Remove pointless stub
intel_idle: mention assumption that WBINVD is not needed
MAINTAINERS: Add section for cpuidle-psci PM domain
cpufreq: intel_pstate: Delete intel_pstate sysfs if failed to register the driver
...
Pull networking fixes from Jakub Kicinski:
"Cross-tree/merge window issues:
- rtl8150: don't incorrectly assign random MAC addresses; fix late in
the 5.9 cycle started depending on a return code from a function
which changed with the 5.10 PR from the usb subsystem
Current release regressions:
- Revert "virtio-net: ethtool configurable RXCSUM", it was causing
crashes at probe when control vq was not negotiated/available
Previous release regressions:
- ixgbe: fix probing of multi-port 10 Gigabit Intel NICs with an MDIO
bus, only first device would be probed correctly
- nexthop: Fix performance regression in nexthop deletion by
effectively switching from recently added synchronize_rcu() to
synchronize_rcu_expedited()
- netsec: ignore 'phy-mode' device property on ACPI systems; the
property is not populated correctly by the firmware, but firmware
configures the PHY so just keep boot settings
Previous releases - always broken:
- tcp: fix to update snd_wl1 in bulk receiver fast path, addressing
bulk transfers getting "stuck"
- icmp: randomize the global rate limiter to prevent attackers from
getting useful signal
- r8169: fix operation under forced interrupt threading, make the
driver always use hard irqs, even on RT, given the handler is light
and only wants to schedule napi (and do so through a _irqoff()
variant, preferably)
- bpf: Enforce pointer id generation for all may-be-null register
type to avoid pointers erroneously getting marked as null-checked
- tipc: re-configure queue limit for broadcast link
- net/sched: act_tunnel_key: fix OOB write in case of IPv6 ERSPAN
tunnels
- fix various issues in chelsio inline tls driver
Misc:
- bpf: improve just-added bpf_redirect_neigh() helper api to support
supplying nexthop by the caller - in case BPF program has already
done a lookup we can avoid doing another one
- remove unnecessary break statements
- make MCTCP not select IPV6, but rather depend on it"
* tag 'net-5.10-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (62 commits)
tcp: fix to update snd_wl1 in bulk receiver fast path
net: Properly typecast int values to set sk_max_pacing_rate
netfilter: nf_fwd_netdev: clear timestamp in forwarding path
ibmvnic: save changed mac address to adapter->mac_addr
selftests: mptcp: depends on built-in IPv6
Revert "virtio-net: ethtool configurable RXCSUM"
rtnetlink: fix data overflow in rtnl_calcit()
net: ethernet: mtk-star-emac: select REGMAP_MMIO
net: hdlc_raw_eth: Clear the IFF_TX_SKB_SHARING flag after calling ether_setup
net: hdlc: In hdlc_rcv, check to make sure dev is an HDLC device
bpf, libbpf: Guard bpf inline asm from bpf_tail_call_static
bpf, selftests: Extend test_tc_redirect to use modified bpf_redirect_neigh()
bpf: Fix bpf_redirect_neigh helper api to support supplying nexthop
mptcp: depends on IPV6 but not as a module
sfc: move initialisation of efx->filter_sem to efx_init_struct()
mpls: load mpls_gso after mpls_iptunnel
net/sched: act_tunnel_key: fix OOB write in case of IPv6 ERSPAN tunnels
net/sched: act_gate: Unlock ->tcfa_lock in tc_setup_flow_action()
net: dsa: bcm_sf2: make const array static, makes object smaller
mptcp: MPTCP_IPV6 should depend on IPV6 instead of selecting it
...
Pull arch task_work cleanups from Jens Axboe:
"Two cleanups that don't fit other categories:
- Finally get the task_work_add() cleanup done properly, so we don't
have random 0/1/false/true/TWA_SIGNAL confusing use cases. Updates
all callers, and also fixes up the documentation for
task_work_add().
- While working on some TIF related changes for 5.11, this
TIF_NOTIFY_RESUME cleanup fell out of that. Remove some arch
duplication for how that is handled"
* tag 'arch-cleanup-2020-10-22' of git://git.kernel.dk/linux-block:
task_work: cleanup notification modes
tracehook: clear TIF_NOTIFY_RESUME in tracehook_notify_resume()
Pull Kbuild updates from Masahiro Yamada:
- Support 'make compile_commands.json' to generate the compilation
database more easily, avoiding stale entries
- Support 'make clang-analyzer' and 'make clang-tidy' for static checks
using clang-tidy
- Preprocess scripts/modules.lds.S to allow CONFIG options in the
module linker script
- Drop cc-option tests from compiler flags supported by our minimal
GCC/Clang versions
- Use always 12-digits commit hash for CONFIG_LOCALVERSION_AUTO=y
- Use sha1 build id for both BFD linker and LLD
- Improve deb-pkg for reproducible builds and rootless builds
- Remove stale, useless scripts/namespace.pl
- Turn -Wreturn-type warning into error
- Fix build error of deb-pkg when CONFIG_MODULES=n
- Replace 'hostname' command with more portable 'uname -n'
- Various Makefile cleanups
* tag 'kbuild-v5.10' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild: (34 commits)
kbuild: Use uname for LINUX_COMPILE_HOST detection
kbuild: Only add -fno-var-tracking-assignments for old GCC versions
kbuild: remove leftover comment for filechk utility
treewide: remove DISABLE_LTO
kbuild: deb-pkg: clean up package name variables
kbuild: deb-pkg: do not build linux-headers package if CONFIG_MODULES=n
kbuild: enforce -Werror=return-type
scripts: remove namespace.pl
builddeb: Add support for all required debian/rules targets
builddeb: Enable rootless builds
builddeb: Pass -n to gzip for reproducible packages
kbuild: split the build log of kallsyms
kbuild: explicitly specify the build id style
scripts/setlocalversion: make git describe output more reliable
kbuild: remove cc-option test of -Werror=date-time
kbuild: remove cc-option test of -fno-stack-check
kbuild: remove cc-option test of -fno-strict-overflow
kbuild: move CFLAGS_{KASAN,UBSAN,KCSAN} exports to relevant Makefiles
kbuild: remove redundant CONFIG_KASAN check from scripts/Makefile.kasan
kbuild: do not create built-in objects for external module builds
...
Pull modules updates from Jessica Yu:
"Code cleanups: more informative error messages and statically
initialize init_free_wq to avoid a workqueue warning"
* tag 'modules-for-v5.10' of git://git.kernel.org/pub/scm/linux/kernel/git/jeyu/linux:
module: statically initialize init section freeing data
module: Add more error message for failed kernel module loading