With debugobjects enabled the timer hint for freeing of active timers
embedded inside delayed works is always the same, i.e. the hint is
delayed_work_timer_fn, even though the function the delayed work is going
to run can be wildly different depending on what work was queued. Enabling
workqueue debugobjects doesn't help either because the delayed work isn't
considered active until it is actually queued to run on a workqueue. If the
work is freed while the timer is pending the work isn't considered active
so there is no information from workqueue debugobjects.
Special case delayed works in the timer debugobjects hint logic so that the
delayed work function is returned instead of the delayed_work_timer_fn.
This will help to understand which delayed work was pending that got
freed.
Apply the same treatment for kthread_delayed_work because it follows the
same pattern.
Suggested-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Stephen Boyd <swboyd@chromium.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20220511201951.42408-1-swboyd@chromium.org
The kernel uses kHz as the unit for clock rates reported between 1MHz
(inclusive) and 4MHz (exclusive), e.g.:
sched_clock: 64 bits at 1000kHz, resolution 1000ns, wraps every 2199023255500ns
This reduces the amount of data lost due to rounding, but hasn't been
replicated for the kHz range when support was added for proper reporting of
sub-kHz clock rates. Take the same approach for rates between 1kHz
(inclusive) and 4kHz (exclusive), which makes it consistent.
Signed-off-by: Maciej W. Rozycki <macro@orcam.me.uk>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/alpine.DEB.2.21.2204240106380.9383@angie.orcam.me.uk
The frequency reported for clock sources are rounded down, which gives
misleading figures, e.g.:
I/O ASIC clock frequency 24999480Hz
sched_clock: 32 bits at 24MHz, resolution 40ns, wraps every 85901132779ns
MIPS counter frequency 59998512Hz
sched_clock: 32 bits at 59MHz, resolution 16ns, wraps every 35792281591ns
Rounding to nearest is more adequate:
I/O ASIC clock frequency 24999664Hz
sched_clock: 32 bits at 25MHz, resolution 40ns, wraps every 85900499947ns
MIPS counter frequency 59999728Hz
sched_clock: 32 bits at 60MHz, resolution 16ns, wraps every 35791556599ns
Signed-off-by: Maciej W. Rozycki <macro@orcam.me.uk>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: John Stultz <jstultz@google.com>
Link: https://lore.kernel.org/r/alpine.DEB.2.21.2204240055590.9383@angie.orcam.me.uk
Accessing timekeeper::offset_boot in ktime_get_boot_fast_ns() is an
intended data race as the reader side cannot synchronize with a writer and
there is no space in struct tk_read_base of the NMI safe timekeeper.
Mark it so.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20220415091920.956045162@linutronix.de
When tick_nohz_stop_tick() stops the tick and high resolution timers are
disabled, then the clock event device is not put into ONESHOT_STOPPED
mode. This can lead to spurious timer interrupts with some clock event
device drivers that don't shut down entirely after firing.
Eliminate these by putting the device into ONESHOT_STOPPED mode at points
where it is not being reprogrammed. When there are no timers active, then
tick_program_event() with KTIME_MAX can be used to stop the device. When
there is a timer active, the device can be stopped at the next tick (any
new timer added by timers will reprogram the tick).
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20220422141446.915024-1-npiggin@gmail.com
Introduce fast/NMI safe accessor to clock tai for tracing. The Linux kernel
tracing infrastructure has support for using different clocks to generate
timestamps for trace events. Especially in TSN networks it's useful to have TAI
as trace clock, because the application scheduling is done in accordance to the
network time, which is based on TAI. With a tai trace_clock in place, it becomes
very convenient to correlate network activity with Linux kernel application
traces.
Use the same implementation as ktime_get_boot_fast_ns() does by reading the
monotonic time and adding the TAI offset. The same limitations as for the fast
boot implementation apply. The TAI offset may change at run time e.g., by
setting the time or using adjtimex() with an offset. However, these kind of
offset changes are rare events. Nevertheless, the user has to be aware and deal
with it in post processing.
An alternative approach would be to use the same implementation as
ktime_get_real_fast_ns() does. However, this requires to add an additional u64
member to the tk_read_base struct. This struct together with a seqcount is
designed to fit into a single cache line on 64 bit architectures. Adding a new
member would violate this constraint.
Signed-off-by: Kurt Kanzenbach <kurt@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Steven Rostedt <rostedt@goodmis.org>
Link: https://lore.kernel.org/r/20220414091805.89667-2-kurt@linutronix.de
clocksource_verify_percpu() calls cpumask_weight() to check if any bit of a
given cpumask is set.
This can be done more efficiently with cpumask_empty() because
cpumask_empty() stops traversing the cpumask as soon as it finds first set
bit, while cpumask_weight() counts all bits unconditionally.
Signed-off-by: Yury Norov <yury.norov@gmail.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20220210224933.379149-24-yury.norov@gmail.com
Pull perf fixes from Borislav Petkov:
- A couple of fixes to cgroup-related handling of perf events
- A couple of fixes to event encoding on Sapphire Rapids
- Pass event caps of inherited events so that perf doesn't fail wrongly
at fork()
- Add support for a new Raptor Lake CPU
* tag 'perf_urgent_for_v5.18_rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
perf/core: Always set cpuctx cgrp when enable cgroup event
perf/core: Fix perf_cgroup_switch()
perf/core: Use perf_cgroup_info->active to check if cgroup is active
perf/core: Don't pass task around when ctx sched in
perf/x86/intel: Update the FRONTEND MSR mask on Sapphire Rapids
perf/x86/intel: Don't extend the pseudo-encoding to GP counters
perf/core: Inherit event_caps
perf/x86/uncore: Add Raptor Lake uncore support
perf/x86/msr: Add Raptor Lake CPU support
perf/x86/cstate: Add Raptor Lake support
perf/x86: Add Intel Raptor Lake support
Pull locking fixes from Borislav Petkov:
- Allow the compiler to optimize away unused percpu accesses and change
the local_lock_* macros back to inline functions
- A couple of fixes to static call insn patching
* tag 'locking_urgent_for_v5.18_rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
Revert "mm/page_alloc: mark pagesets as __maybe_unused"
Revert "locking/local_lock: Make the empty local_lock_*() function a macro."
x86/percpu: Remove volatile from arch_raw_cpu_ptr().
static_call: Remove __DEFINE_STATIC_CALL macro
static_call: Properly initialise DEFINE_STATIC_CALL_RET0()
static_call: Don't make __static_call_return0 static
x86,static_call: Fix __static_call_return0 for i386
Pull scheduler fixes from Borislav Petkov:
- Use the correct static key checking primitive on the IRQ exit path
- Two fixes for the new forceidle balancer
* tag 'sched_urgent_for_v5.18_rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
entry: Fix compile error in dynamic_irqentry_exit_cond_resched()
sched: Teach the forced-newidle balancer about CPU affinity limitation.
sched/core: Fix forceidle balancing
The level granularity round up of calc_index() does:
(x + (1 << n)) >> n
which is obviously equivalent to
(x >> n) + 1
but compilers can't figure that out despite the fact that the input range
is known to not cause an overflow. It's neither intuitive to read.
Just write out the obvious.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/87h778j46c.ffs@tglx
When base::next_expiry_recalc is not initialized to false during cpu
bringup in HOTPLUG_CPU and is accidently true and no timer is queued in the
meantime, the loop through the wheel to find __next_timer_interrupt() might
be done for nothing.
Therefore initialize base::next_expiry_recalc to false in
timers_prepare_cpu().
Signed-off-by: Anna-Maria Behnsen <anna-maria@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
Link: https://lore.kernel.org/r/20220405191732.7438-2-anna-maria@linutronix.de
When the timer base is empty, base::next_expiry is set to base::clk +
NEXT_TIMER_MAX_DELTA and base::next_expiry_recalc is false. When no timer
is queued until jiffies reaches base::next_expiry value, the warning for
not finding any expired timer and base::next_expiry_recalc is false in
__run_timers() triggers.
To prevent triggering the warning in this valid scenario
base::timers_pending needs to be added to the warning condition.
Fixes: 31cd0e119d ("timers: Recalculate next timer interrupt only when necessary")
Reported-by: Johannes Berg <johannes@sipsolutions.net>
Signed-off-by: Anna-Maria Behnsen <anna-maria@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
Link: https://lore.kernel.org/r/20220405191732.7438-3-anna-maria@linutronix.de
Pull networking fixes from Paolo Abeni:
"Including fixes from bpf and netfilter.
Current release - new code bugs:
- mctp: correct mctp_i2c_header_create result
- eth: fungible: fix reference to __udivdi3 on 32b builds
- eth: micrel: remove latencies support lan8814
Previous releases - regressions:
- bpf: resolve to prog->aux->dst_prog->type only for BPF_PROG_TYPE_EXT
- vrf: fix packet sniffing for traffic originating from ip tunnels
- rxrpc: fix a race in rxrpc_exit_net()
- dsa: revert "net: dsa: stop updating master MTU from master.c"
- eth: ice: fix MAC address setting
Previous releases - always broken:
- tls: fix slab-out-of-bounds bug in decrypt_internal
- bpf: support dual-stack sockets in bpf_tcp_check_syncookie
- xdp: fix coalescing for page_pool fragment recycling
- ovs: fix leak of nested actions
- eth: sfc:
- add missing xdp queue reinitialization
- fix using uninitialized xdp tx_queue
- eth: ice:
- clear default forwarding VSI during VSI release
- fix broken IFF_ALLMULTI handling
- synchronize_rcu() when terminating rings
- eth: qede: confirm skb is allocated before using
- eth: aqc111: fix out-of-bounds accesses in RX fixup
- eth: slip: fix NPD bug in sl_tx_timeout()"
* tag 'net-5.18-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (61 commits)
drivers: net: slip: fix NPD bug in sl_tx_timeout()
bpf: Adjust bpf_tcp_check_syncookie selftest to test dual-stack sockets
bpf: Support dual-stack sockets in bpf_tcp_check_syncookie
myri10ge: fix an incorrect free for skb in myri10ge_sw_tso
net: usb: aqc111: Fix out-of-bounds accesses in RX fixup
qede: confirm skb is allocated before using
net: ipv6mr: fix unused variable warning with CONFIG_IPV6_PIMSM_V2=n
net: phy: mscc-miim: reject clause 45 register accesses
net: axiemac: use a phandle to reference pcs_phy
dt-bindings: net: add pcs-handle attribute
net: axienet: factor out phy_node in struct axienet_local
net: axienet: setup mdio unconditionally
net: sfc: fix using uninitialized xdp tx_queue
rxrpc: fix a race in rxrpc_exit_net()
net: openvswitch: fix leak of nested actions
net: ethernet: mv643xx: Fix over zealous checking of_get_mac_address()
net: openvswitch: don't send internal clone attribute to the userspace.
net: micrel: Fix KS8851 Kconfig
ice: clear cmd_type_offset_bsz for TX rings
ice: xsk: fix VSI state check in ice_xsk_wakeup()
...
Alexei Starovoitov says:
====================
pull-request: bpf 2022-04-06
We've added 8 non-merge commits during the last 8 day(s) which contain
a total of 9 files changed, 139 insertions(+), 36 deletions(-).
The main changes are:
1) rethook related fixes, from Jiri and Masami.
2) Fix the case when tracing bpf prog is attached to struct_ops, from Martin.
3) Support dual-stack sockets in bpf_tcp_check_syncookie, from Maxim.
* https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf:
bpf: Adjust bpf_tcp_check_syncookie selftest to test dual-stack sockets
bpf: Support dual-stack sockets in bpf_tcp_check_syncookie
bpf: selftests: Test fentry tracing a struct_ops program
bpf: Resolve to prog->aux->dst_prog->type only for BPF_PROG_TYPE_EXT
rethook: Fix to use WRITE_ONCE() for rethook:: Handler
selftests/bpf: Fix warning comparing pointer to 0
bpf: Fix sparse warnings in kprobe_multi_resolve_syms
bpftool: Explicit errno handling in skeletons
====================
Link: https://lore.kernel.org/r/20220407031245.73026-1-alexei.starovoitov@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
When enable a cgroup event, cpuctx->cgrp setting is conditional
on the current task cgrp matching the event's cgroup, so have to
do it for every new event. It brings complexity but no advantage.
To keep it simple, this patch would always set cpuctx->cgrp
when enable the first cgroup event, and reset to NULL when disable
the last cgroup event.
Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20220329154523.86438-5-zhouchengming@bytedance.com
There is a race problem that can trigger WARN_ON_ONCE(cpuctx->cgrp)
in perf_cgroup_switch().
CPU1 CPU2
perf_cgroup_sched_out(prev, next)
cgrp1 = perf_cgroup_from_task(prev)
cgrp2 = perf_cgroup_from_task(next)
if (cgrp1 != cgrp2)
perf_cgroup_switch(prev, PERF_CGROUP_SWOUT)
cgroup_migrate_execute()
task->cgroups = ?
perf_cgroup_attach()
task_function_call(task, __perf_cgroup_move)
perf_cgroup_sched_in(prev, next)
cgrp1 = perf_cgroup_from_task(prev)
cgrp2 = perf_cgroup_from_task(next)
if (cgrp1 != cgrp2)
perf_cgroup_switch(next, PERF_CGROUP_SWIN)
__perf_cgroup_move()
perf_cgroup_switch(task, PERF_CGROUP_SWOUT | PERF_CGROUP_SWIN)
The commit a8d757ef07 ("perf events: Fix slow and broken cgroup
context switch code") want to skip perf_cgroup_switch() when the
perf_cgroup of "prev" and "next" are the same.
But task->cgroups can change in concurrent with context_switch()
in cgroup_migrate_execute(). If cgrp1 == cgrp2 in sched_out(),
cpuctx won't do sched_out. Then task->cgroups changed cause
cgrp1 != cgrp2 in sched_in(), cpuctx will do sched_in. So trigger
WARN_ON_ONCE(cpuctx->cgrp).
Even though __perf_cgroup_move() will be synchronized as the context
switch disables the interrupt, context_switch() still can see the
task->cgroups is changing in the middle, since task->cgroups changed
before sending IPI.
So we have to combine perf_cgroup_sched_in() into perf_cgroup_sched_out(),
unified into perf_cgroup_switch(), to fix the incosistency between
perf_cgroup_sched_out() and perf_cgroup_sched_in().
But we can't just compare prev->cgroups with next->cgroups to decide
whether to skip cpuctx sched_out/in since the prev->cgroups is changing
too. For example:
CPU1 CPU2
cgroup_migrate_execute()
prev->cgroups = ?
perf_cgroup_attach()
task_function_call(task, __perf_cgroup_move)
perf_cgroup_switch(task)
cgrp1 = perf_cgroup_from_task(prev)
cgrp2 = perf_cgroup_from_task(next)
if (cgrp1 != cgrp2)
cpuctx sched_out/in ...
task_function_call() will return -ESRCH
In the above example, prev->cgroups changing cause (cgrp1 == cgrp2)
to be true, so skip cpuctx sched_out/in. And later task_function_call()
would return -ESRCH since the prev task isn't running on cpu anymore.
So we would leave perf_events of the old prev->cgroups still sched on
the CPU, which is wrong.
The solution is that we should use cpuctx->cgrp to compare with
the next task's perf_cgroup. Since cpuctx->cgrp can only be changed
on local CPU, and we have irq disabled, we can read cpuctx->cgrp to
compare without holding ctx lock.
Fixes: a8d757ef07 ("perf events: Fix slow and broken cgroup context switch code")
Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20220329154523.86438-4-zhouchengming@bytedance.com
The current code pass task around for ctx_sched_in(), only
to get perf_cgroup of the task, then update the timestamp
of it and its ancestors and set them to active.
But we can use cpuctx->cgrp to get active perf_cgroup and
its ancestors since cpuctx->cgrp has been set before
ctx_sched_in().
This patch remove the task argument in ctx_sched_in()
and cleanup related code.
Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20220329154523.86438-2-zhouchengming@bytedance.com
It was reported that some perf event setup can make fork failed on
ARM64. It was the case of a group of mixed hw and sw events and it
failed in perf_event_init_task() due to armpmu_event_init().
The ARM PMU code checks if all the events in a group belong to the
same PMU except for software events. But it didn't set the event_caps
of inherited events and no longer identify them as software events.
Therefore the test failed in a child process.
A simple reproducer is:
$ perf stat -e '{cycles,cs,instructions}' perf bench sched messaging
# Running 'sched/messaging' benchmark:
perf: fork(): Invalid argument
The perf stat was fine but the perf bench failed in fork(). Let's
inherit the event caps from the parent.
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: <stable@vger.kernel.org>
Link: https://lkml.kernel.org/r/20220328200112.457740-1-namhyung@kernel.org
System.map shows that vmlinux contains several instances of
__static_call_return0():
c0004fc0 t __static_call_return0
c0011518 t __static_call_return0
c00d8160 t __static_call_return0
arch_static_call_transform() uses the middle one to check whether we are
setting a call to __static_call_return0 or not:
c0011520 <arch_static_call_transform>:
c0011520: 3d 20 c0 01 lis r9,-16383 <== r9 = 0xc001 << 16
c0011524: 39 29 15 18 addi r9,r9,5400 <== r9 += 0x1518
c0011528: 7c 05 48 00 cmpw r5,r9 <== r9 has value 0xc0011518 here
So if static_call_update() is called with one of the other instances of
__static_call_return0(), arch_static_call_transform() won't recognise it.
In order to work properly, global single instance of __static_call_return0() is required.
Fixes: 3f2a8fc4b1 ("static_call/x86: Add __static_call_return0()")
Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Josh Poimboeuf <jpoimboe@redhat.com>
Link: https://lkml.kernel.org/r/30821468a0e7d28251954b578e5051dc09300d04.1647258493.git.christophe.leroy@csgroup.eu
kernel/entry/common.c: In function ‘dynamic_irqentry_exit_cond_resched’:
kernel/entry/common.c:409:14: error: implicit declaration of function ‘static_key_unlikely’; did you mean ‘static_key_enable’? [-Werror=implicit-function-declaration]
409 | if (!static_key_unlikely(&sk_dynamic_irqentry_exit_cond_resched))
| ^~~~~~~~~~~~~~~~~~~
| static_key_enable
static_key_unlikely() should be static_branch_unlikely().
Fixes: 99cf983cc8 ("sched/preempt: Add PREEMPT_DYNAMIC using static keys")
Signed-off-by: Sven Schnelle <svens@linux.ibm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Mark Rutland <mark.rutland@arm.com>
Link: https://lore.kernel.org/r/20220330084328.1805665-1-svens@linux.ibm.com
try_steal_cookie() looks at task_struct::cpus_mask to decide if the
task could be moved to `this' CPU. It ignores that the task might be in
a migration disabled section while not on the CPU. In this case the task
must not be moved otherwise per-CPU assumption are broken.
Use is_cpu_allowed(), as suggested by Peter Zijlstra, to decide if the a
task can be moved.
Fixes: d2dfa17bc7 ("sched: Trivial forced-newidle balancer")
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/YjNK9El+3fzGmswf@linutronix.de
Steve reported that ChromeOS encounters the forceidle balancer being
ran from rt_mutex_setprio()'s balance_callback() invocation and
explodes.
Now, the forceidle balancer gets queued every time the idle task gets
selected, set_next_task(), which is strictly too often.
rt_mutex_setprio() also uses set_next_task() in the 'change' pattern:
queued = task_on_rq_queued(p); /* p->on_rq == TASK_ON_RQ_QUEUED */
running = task_current(rq, p); /* rq->curr == p */
if (queued)
dequeue_task(...);
if (running)
put_prev_task(...);
/* change task properties */
if (queued)
enqueue_task(...);
if (running)
set_next_task(...);
However, rt_mutex_setprio() will explicitly not run this pattern on
the idle task (since priority boosting the idle task is quite insane).
Most other 'change' pattern users are pidhash based and would also not
apply to idle.
Also, the change pattern doesn't contain a __balance_callback()
invocation and hence we could have an out-of-band balance-callback,
which *should* trigger the WARN in rq_pin_lock() (which guards against
this exact anti-pattern).
So while none of that explains how this happens, it does indicate that
having it in set_next_task() might not be the most robust option.
Instead, explicitly queue the forceidle balancer from pick_next_task()
when it does indeed result in forceidle selection. Having it here,
ensures it can only be triggered under the __schedule() rq->lock
instance, and hence must be ran from that context.
This also happens to clean up the code a little, so win-win.
Fixes: d2dfa17bc7 ("sched: Trivial forced-newidle balancer")
Reported-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: T.J. Alumbaugh <talumbau@chromium.org>
Link: https://lkml.kernel.org/r/20220330160535.GN8939@worktop.programming.kicks-ass.net
Pull more tracing updates from Steven Rostedt:
- Rename the staging files to give them some meaning. Just
stage1,stag2,etc, does not show what they are for
- Check for NULL from allocation in bootconfig
- Hold event mutex for dyn_event call in user events
- Mark user events to broken (to work on the API)
- Remove eBPF updates from user events
- Remove user events from uapi header to keep it from being installed.
- Move ftrace_graph_is_dead() into inline as it is called from hot
paths and also convert it into a static branch.
* tag 'trace-v5.18-2' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
tracing: Move user_events.h temporarily out of include/uapi
ftrace: Make ftrace_graph_is_dead() a static branch
tracing: Set user_events to BROKEN
tracing/user_events: Remove eBPF interfaces
tracing/user_events: Hold event_mutex during dyn_event_add
proc: bootconfig: Add null pointer check
tracing: Rename the staging files for trace_events
Pull RT signal fix from Thomas Gleixner:
"Revert the RT related signal changes. They need to be reworked and
generalized"
* tag 'core-urgent-2022-04-03' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
Revert "signal, x86: Delay calling signals in atomic on RT enabled kernels"
Pull more dma-mapping updates from Christoph Hellwig:
- fix a regression in dma remap handling vs AMD memory encryption (me)
- finally kill off the legacy PCI DMA API (Christophe JAILLET)
* tag 'dma-mapping-5.18-1' of git://git.infradead.org/users/hch/dma-mapping:
dma-mapping: move pgprot_decrypted out of dma_pgprot
PCI/doc: cleanup references to the legacy PCI DMA API
PCI: Remove the deprecated "pci-dma-compat.h" API
After being merged, user_events become more visible to a wider audience
that have concerns with the current API.
It is too late to fix this for this release, but instead of a full
revert, just mark it as BROKEN (which prevents it from being selected in
make config). Then we can work finding a better API. If that fails,
then it will need to be completely reverted.
To not have the code silently bitrot, still allow building it with
COMPILE_TEST.
And to prevent the uapi header from being installed, then later changed,
and then have an old distro user space see the old version, move the
header file out of the uapi directory.
Surround the include with CONFIG_COMPILE_TEST to the current location,
but when the BROKEN tag is taken off, it will use the uapi directory,
and fail to compile. This is a good way to remind us to move the header
back.
Link: https://lore.kernel.org/all/20220330155835.5e1f6669@gandalf.local.home
Link: https://lkml.kernel.org/r/20220330201755.29319-1-mathieu.desnoyers@efficios.com
Suggested-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
ftrace_graph_is_dead() is used on hot paths, it just reads a variable
in memory and is not worth suffering function call constraints.
For instance, at entry of prepare_ftrace_return(), inlining it avoids
saving prepare_ftrace_return() parameters to stack and restoring them
after calling ftrace_graph_is_dead().
While at it using a static branch is even more performant and is
rather well adapted considering that the returned value will almost
never change.
Inline ftrace_graph_is_dead() and replace 'kill_ftrace_graph' bool
by a static branch.
The performance improvement is noticeable.
Link: https://lkml.kernel.org/r/e0411a6a0ed3eafff0ad2bc9cd4b0e202b4617df.1648623570.git.christophe.leroy@csgroup.eu
Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
pgprot_decrypted is used by AMD SME systems to allow access to memory
that was set to not encrypted using set_memory_decrypted. That only
happens for dma-direct memory as the IOMMU solves the addressing
challenges for the encryption bit using its own remapping.
Move the pgprot_decrypted call out of dma_pgprot which is also used
by the IOMMU mappings and into dma-direct so that it is only used with
memory that was set decrypted.
Fixes: f5ff79fddf ("dma-mapping: remove CONFIG_DMA_REMAP")
Reported-by: Alex Xu (Hello71) <alex_y_xu@yahoo.ca>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Tested-by: Alex Xu (Hello71) <alex_y_xu@yahoo.ca>
Pull more networking updates from Jakub Kicinski:
"Networking fixes and rethook patches.
Features:
- kprobes: rethook: x86: replace kretprobe trampoline with rethook
Current release - regressions:
- sfc: avoid null-deref on systems without NUMA awareness in the new
queue sizing code
Current release - new code bugs:
- vxlan: do not feed vxlan_vnifilter_dump_dev with non-vxlan devices
- eth: lan966x: fix null-deref on PHY pointer in timestamp ioctl when
interface is down
Previous releases - always broken:
- openvswitch: correct neighbor discovery target mask field in the
flow dump
- wireguard: ignore v6 endpoints when ipv6 is disabled and fix a leak
- rxrpc: fix call timer start racing with call destruction
- rxrpc: fix null-deref when security type is rxrpc_no_security
- can: fix UAF bugs around echo skbs in multiple drivers
Misc:
- docs: move netdev-FAQ to the 'process' section of the
documentation"
* tag 'net-5.18-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (57 commits)
vxlan: do not feed vxlan_vnifilter_dump_dev with non vxlan devices
openvswitch: Add recirc_id to recirc warning
rxrpc: fix some null-ptr-deref bugs in server_key.c
rxrpc: Fix call timer start racing with call destruction
net: hns3: fix software vlan talbe of vlan 0 inconsistent with hardware
net: hns3: fix the concurrency between functions reading debugfs
docs: netdev: move the netdev-FAQ to the process pages
docs: netdev: broaden the new vs old code formatting guidelines
docs: netdev: call out the merge window in tag checking
docs: netdev: add missing back ticks
docs: netdev: make the testing requirement more stringent
docs: netdev: add a question about re-posting frequency
docs: netdev: rephrase the 'should I update patchwork' question
docs: netdev: rephrase the 'Under review' question
docs: netdev: shorten the name and mention msgid for patch status
docs: netdev: note that RFC postings are allowed any time
docs: netdev: turn the net-next closed into a Warning
docs: netdev: move the patch marking section up
docs: netdev: minor reword
docs: netdev: replace references to old archives
...
Pull dma-mapping updates from Christoph Hellwig:
- do not zero buffer in set_memory_decrypted (Kirill A. Shutemov)
- fix return value of dma-debug __setup handlers (Randy Dunlap)
- swiotlb cleanups (Robin Murphy)
- remove most remaining users of the pci-dma-compat.h API
(Christophe JAILLET)
- share the ABI header for the DMA map_benchmark with userspace
(Tian Tao)
- update the maintainer for DMA MAPPING BENCHMARK (Xiang Chen)
- remove CONFIG_DMA_REMAP (me)
* tag 'dma-mapping-5.18' of git://git.infradead.org/users/hch/dma-mapping:
dma-mapping: benchmark: extract a common header file for map_benchmark definition
dma-debug: fix return value of __setup handlers
dma-mapping: remove CONFIG_DMA_REMAP
media: v4l2-pci-skeleton: Remove usage of the deprecated "pci-dma-compat.h" API
rapidio/tsi721: Remove usage of the deprecated "pci-dma-compat.h" API
sparc: Remove usage of the deprecated "pci-dma-compat.h" API
agp/intel: Remove usage of the deprecated "pci-dma-compat.h" API
alpha: Remove usage of the deprecated "pci-dma-compat.h" API
MAINTAINERS: update maintainer list of DMA MAPPING BENCHMARK
swiotlb: simplify array allocation
swiotlb: tidy up includes
swiotlb: simplify debugfs setup
swiotlb: do not zero buffer in set_memory_decrypted()
Pull ptrace cleanups from Eric Biederman:
"This set of changes removes tracehook.h, moves modification of all of
the ptrace fields inside of siglock to remove races, adds a missing
permission check to ptrace.c
The removal of tracehook.h is quite significant as it has been a major
source of confusion in recent years. Much of that confusion was around
task_work and TIF_NOTIFY_SIGNAL (which I have now decoupled making the
semantics clearer).
For people who don't know tracehook.h is a vestiage of an attempt to
implement uprobes like functionality that was never fully merged, and
was later superseeded by uprobes when uprobes was merged. For many
years now we have been removing what tracehook functionaly a little
bit at a time. To the point where anything left in tracehook.h was
some weird strange thing that was difficult to understand"
* tag 'ptrace-cleanups-for-v5.18' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace:
ptrace: Remove duplicated include in ptrace.c
ptrace: Check PTRACE_O_SUSPEND_SECCOMP permission on PTRACE_SEIZE
ptrace: Return the signal to continue with from ptrace_stop
ptrace: Move setting/clearing ptrace_message into ptrace_stop
tracehook: Remove tracehook.h
resume_user_mode: Move to resume_user_mode.h
resume_user_mode: Remove #ifdef TIF_NOTIFY_RESUME in set_notify_resume
signal: Move set_notify_signal and clear_notify_signal into sched/signal.h
task_work: Decouple TIF_NOTIFY_SIGNAL and task_work
task_work: Call tracehook_notify_signal from get_signal on all architectures
task_work: Introduce task_work_pending
task_work: Remove unnecessary include from posix_timers.h
ptrace: Remove tracehook_signal_handler
ptrace: Remove arch_syscall_{enter,exit}_tracehook
ptrace: Create ptrace_report_syscall_{entry,exit} in ptrace.h
ptrace/arm: Rename tracehook_report_syscall report_syscall
ptrace: Move ptrace_report_syscall into ptrace.h