Add multiple histograms support for each event. This allows
user to set multiple histograms to an event.
ftrace.[instance.INSTANCE.]event.GROUP.EVENT.hist[.N] {
...
}
The 'N' is a digit started string and it can be omitted
for the default histogram.
For example, multiple hist triggers example in the
Documentation/trace/histogram.rst can be written as below;
ftrace.event.net.netif_receive_skb.hist {
1 {
keys = skbaddr.hex
values = len
filter = len < 0
}
2 {
keys = skbaddr.hex
values = len
filter = len > 4096
}
3 {
keys = skbaddr.hex
values = len
filter = len == 256
}
4 {
keys = skbaddr.hex
values = len
}
5 {
keys = len
values = common_preempt_count
}
}
Link: https://lkml.kernel.org/r/162856125628.203126.15846930277378572120.stgit@devnote2
Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Support multiple handlers for per-event histogram in boot-time tracing.
Since the histogram can register multiple same handler-actions with
different parameters, this expands the syntax to support such cases.
With this update, the 'onmax', 'onchange' and 'onmatch' handler subkeys
under per-event histogram option will take a number subkeys optionally
as below. (see [.N])
ftrace.[instance.INSTANCE.]event.GROUP.EVENT.hist {
onmax|onchange[.N] { var = <VAR>; <ACTION> [= <PARAM>] }
onmatch[.N] { event = <EVENT>; <ACTION> [= <PARAM>] }
}
The 'N' must be a digit (or digit started word).
Thus user can add several handler-actions to the histogram,
for example,
ftrace.event.SOMEGROUP.SOMEEVENT.hist {
keys = SOME_ID; lat = common_timestamp.usecs-$ts0
onmatch.1 {
event = GROUP1.STARTEVENT1
trace = latency_event, SOME_ID, $lat
}
onmatch.2 {
event = GROUP2.STARTEVENT2
trace = latency_event, SOME_ID, $lat
}
}
Then, it can trace the elapsed time from GROUP1.STARTEVENT1 to
SOMEGROUP.SOMEEVENT, and from GROUP2.STARTEVENT2 to
SOMEGROUP.SOMEEVENT with SOME_ID key.
Link: https://lkml.kernel.org/r/162856124905.203126.14913731908137885922.stgit@devnote2
Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Add a hist-trigger action syntax support to boot-time tracing.
Currently, boot-time tracing supports per-event actions as option
strings. However, for the histogram action, it has a special syntax
and usually needs a long action definition.
To make it readable and fit to the bootconfig syntax, this introduces
a new options for histogram.
Here are the histogram action options for boot-time tracing.
ftrace.[instance.INSTANCE.]event.GROUP.EVENT.hist {
keys = <KEY>[,...]
values = <VAL>[,...]
sort = <SORT-KEY>[,...]
size = <ENTRIES>
name = <HISTNAME>
var { <VAR> = <EXPR> ... }
pause|continue|clear
onmax|onchange { var = <VAR>; <ACTION> [= <PARAM>] }
onmatch { event = <EVENT>; <ACTION> [= <PARAM>] }
filter = <FILTER>
}
Where <ACTION> is one of below;
trace = <EVENT>, <ARG1>[, ...]
save = <ARG1>[, ...]
snapshot
Link: https://lkml.kernel.org/r/162856124106.203126.10501871028479029087.stgit@devnote2
Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
The entire FTRACE block is surrounded by 'if TRACING_SUPPORT' ...
'endif'.
Using 'depends on' is a simpler way to guard FTRACE.
Link: https://lkml.kernel.org/r/20210731052233.4703-1-masahiroy@kernel.org
Signed-off-by: Masahiro Yamada <masahiroy@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Allow common_pid.execname to be saved in a variable in one histogram to be
passed to another histogram that can pass it as a parameter to a synthetic
event.
># echo 'hist:keys=pid:__arg__1=common_timestamp.usecs:arg2=common_pid.execname' \
> events/sched/sched_waking/trigger
># echo 'wakeup_lat s32 pid; u64 delta; char wake_comm[]' > synthetic_events
># echo 'hist:keys=next_pid:pid=next_pid,delta=common_timestamp.usecs-$__arg__1,exec=$arg2'\
':onmatch(sched.sched_waking).trace(wakeup_lat,$pid,$delta,$exec)' \
> events/sched/sched_switch/trigger
The above is a wake up latency synthetic event setup that passes the execname
of the common_pid that woke the task to the scheduling of that task, which
triggers a synthetic event that passes the original execname as a
parameter to display it.
># echo 1 > events/synthetic/enable
># cat trace
<idle>-0 [006] d..4 186.863801: wakeup_lat: pid=1306 delta=65 wake_comm=kworker/u16:3
<idle>-0 [000] d..4 186.863858: wakeup_lat: pid=163 delta=27 wake_comm=<idle>
<idle>-0 [001] d..4 186.863903: wakeup_lat: pid=1307 delta=36 wake_comm=kworker/u16:4
<idle>-0 [000] d..4 186.863927: wakeup_lat: pid=163 delta=5 wake_comm=<idle>
<idle>-0 [006] d..4 186.863957: wakeup_lat: pid=1306 delta=24 wake_comm=kworker/u16:3
sshd-1306 [006] d..4 186.864051: wakeup_lat: pid=61 delta=62 wake_comm=<idle>
<idle>-0 [000] d..4 186.965030: wakeup_lat: pid=609 delta=18 wake_comm=<idle>
<idle>-0 [006] d..4 186.987582: wakeup_lat: pid=1306 delta=65 wake_comm=kworker/u16:3
<idle>-0 [000] d..4 186.987639: wakeup_lat: pid=163 delta=27 wake_comm=<idle>
Link: https://lkml.kernel.org/r/20210722142837.458596338@goodmis.org
Reviewed-by: Tom Zanussi <zanussi@kernel.org>
Reviewed-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Instead of kstrdup("const", GFP_KERNEL), have the hist_field type simply
assign the constant hist_field->type = "const"; And when the value passed
to it is a variable, use "kstrdup_const(var, GFP_KERNEL);" which will just
copy the value if the variable is already a constant. This saves on having
to allocate when not needed.
All frees of the hist_field->type will need to use kfree_const().
Link: https://lkml.kernel.org/r/20210722142837.280718447@goodmis.org
Suggested-by: Masami Hiramatsu <mhiramat@kernel.org>
Reviewed-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Update both the tracefs README file as well as the histogram.rst to
include an explanation of what the buckets modifier is and how to use it.
Include an example with the wakeup_latency example for both log2 and the
buckets modifiers as there was no existing log2 example.
Link: https://lkml.kernel.org/r/20210707213922.167218794@goodmis.org
Acked-by: Namhyung Kim <namhyung@kernel.org>
Reviewed-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
There's been several times I wished the histogram logic had a "grouping"
feature for the buckets. Currently, each bucket has a size of one. That
is, if you trace the amount of requested allocations, each allocation is
its own bucket, even if you are interested in what allocates 100 bytes or
less, 100 to 200, 200 to 300, etc.
Also, without grouping, it fills up the allocated histogram buckets
quickly. If you are tracking latency, and don't care if something is 200
microseconds off, or 201 microseconds off, but want to track them by say
10 microseconds each. This can not currently be done.
There is a log2 but that grouping get's too big too fast for a lot of
cases.
Introduce a "buckets=SIZE" command to each field where it will record in a
rounded number. For example:
># echo 'hist:keys=bytes_req.buckets=100:sort=bytes_req' > events/kmem/kmalloc/trigger
># cat events/kmem/kmalloc/hist
# event histogram
#
# trigger info:
hist:keys=bytes_req.buckets=100:vals=hitcount:sort=bytes_req.buckets=100:size=2048
[active]
#
{ bytes_req: ~ 0-99 } hitcount: 3149
{ bytes_req: ~ 100-199 } hitcount: 1468
{ bytes_req: ~ 200-299 } hitcount: 39
{ bytes_req: ~ 300-399 } hitcount: 306
{ bytes_req: ~ 400-499 } hitcount: 364
{ bytes_req: ~ 500-599 } hitcount: 32
{ bytes_req: ~ 600-699 } hitcount: 69
{ bytes_req: ~ 700-799 } hitcount: 37
{ bytes_req: ~ 1200-1299 } hitcount: 16
{ bytes_req: ~ 1400-1499 } hitcount: 30
{ bytes_req: ~ 2000-2099 } hitcount: 6
{ bytes_req: ~ 4000-4099 } hitcount: 2168
{ bytes_req: ~ 5000-5099 } hitcount: 6
Totals:
Hits: 7690
Entries: 13
Dropped: 0
Link: https://lkml.kernel.org/r/20210707213921.980359719@goodmis.org
Acked-by: Namhyung Kim <namhyung@kernel.org>
Reviewed-by: Tom Zanussi <zanussi@kernel.org>
Reviewed-by: Masami Hiramatsu <mhiramat@kernel.org>
Tested-by: Daniel Bristot de Oliveira <bristot@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Fixes a build error when CONFIG_HIST_TRIGGERS=n with boot-time
tracing. Since the trigger_process_regex() is defined only
when CONFIG_HIST_TRIGGERS=y, if it is disabled, the 'actions'
event option also must be disabled.
Link: https://lkml.kernel.org/r/162856123376.203126.582144262622247352.stgit@devnote2
Fixes: 81a59555ff ("tracing/boot: Add per-event settings")
Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
The event filters are not applied on all of the output, which results in
the flood of printk when using tp_printk. Unfolding
event_trigger_unlock_commit_regs() into trace_event_buffer_commit(), so
the filters can be applied on every output.
Link: https://lkml.kernel.org/r/20210814034538.8428-1-kernelfans@gmail.com
Cc: stable@vger.kernel.org
Fixes: 0daa230296 ("tracing: Add tp_printk cmdline to have tracepoints go to printk()")
Signed-off-by: Pingfan Liu <kernelfans@gmail.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
It is a useful helper hence move it to common code so others can enjoy
it.
Suggested-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Reviewed-by: Hannes Reinecke <hare@suse.com>
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Christoph Hellwig <hch@lst.de>
- Mask all MSI-X entries when enabling MSI-X otherwise stale unmasked
entries stay around e.g. when a crashkernel is booted.
- Enforce masking of a MSI-X table entry when updating it, which mandatory
according to speification
- Ensure that writes to MSI[-X} tables are flushed.
- Prevent invalid bits being set in the MSI mask register
- Properly serialize modifications to the mask cache and the mask register
for multi-MSI.
- Cure the violation of the affinity setting rules on X86 during interrupt
startup which can cause lost and stale interrupts. Move the initial
affinity setting ahead of actualy enabling the interrupt.
- Ensure that MSI interrupts are completely torn down before freeing them
in the error handling case.
- Prevent an array out of bounds access in the irq timings code.
-----BEGIN PGP SIGNATURE-----
iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAmEY5bcTHHRnbHhAbGlu
dXRyb25peC5kZQAKCRCmGPVMDXSYoaMvD/9KeK2430f4h/x/fQhHmHHIJOv3kqmB
gRXX4RV+N/DfU9GSflbzPxY9l2SkydgpQjHeGnqpV7DYRIu84nVYAuWcWtPimHHy
JxapniLlQv2GS+SIy9f1mmChH6VUPS05brHxKSqAQZvQIoZqza8vF3umZlV7eYF4
uZFd86TCbDFsBxbsKmyV1FtQLo008EeEp8dtZ/1cZ9Fbp0M/mQkuu7aTNqY0qWwZ
rAoGyE4PjDR+yf87XjE5z7hMs2vfUjiGXg7Kbp30NPKGcRyasb+SlHVKcvZKJIji
Y0Bk/SOyqoj1Co3U+cEaWolB1MeGff4nP+Xx8xvyNklKxxs1+92Z7L1RElXIc0cL
kmUehUSf5JuJ83B6ucAYbmnXKNw1XB00PaMy7iSxsYekTXJx+t0b+Rt6o0R3inWB
xUWbIVmoL2uF1oOAb6mEc3wDNMBVkY33e9l2jD0PUPxKXZ730MVeojWJ8FGFiPOT
9+aCRLjZHV5slVQAgLnlpcrseJLuUei6HLVwRXxv19Bz5L+HuAXUxWL9h74SRuE9
14kH63aXSVDlcYyW7c3t8Lh6QjKAf7AIz0iG+u3n09IWyURd4agHuKOl5itileZB
BK9NuRrNgmr2nEKG461Suc6GojLBXc1ih3ak+MG+O4iaLxnhapTjW3Weqr+OVXr+
SrIjoxjpEk2ECA==
=yf3u
-----END PGP SIGNATURE-----
Merge tag 'irq-urgent-2021-08-15' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull irq fixes from Thomas Gleixner:
"A set of fixes for PCI/MSI and x86 interrupt startup:
- Mask all MSI-X entries when enabling MSI-X otherwise stale unmasked
entries stay around e.g. when a crashkernel is booted.
- Enforce masking of a MSI-X table entry when updating it, which
mandatory according to speification
- Ensure that writes to MSI[-X} tables are flushed.
- Prevent invalid bits being set in the MSI mask register
- Properly serialize modifications to the mask cache and the mask
register for multi-MSI.
- Cure the violation of the affinity setting rules on X86 during
interrupt startup which can cause lost and stale interrupts. Move
the initial affinity setting ahead of actualy enabling the
interrupt.
- Ensure that MSI interrupts are completely torn down before freeing
them in the error handling case.
- Prevent an array out of bounds access in the irq timings code"
* tag 'irq-urgent-2021-08-15' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
driver core: Add missing kernel doc for device::msi_lock
genirq/msi: Ensure deactivation on teardown
genirq/timings: Prevent potential array overflow in __irq_timings_store()
x86/msi: Force affinity setup before startup
x86/ioapic: Force affinity setup before startup
genirq: Provide IRQCHIP_AFFINITY_PRE_STARTUP
PCI/MSI: Protect msi_desc::masked for multi-MSI
PCI/MSI: Use msi_mask_irq() in pci_msi_shutdown()
PCI/MSI: Correct misleading comments
PCI/MSI: Do not set invalid bits in MSI mask
PCI/MSI: Enforce MSI[X] entry updates to be visible
PCI/MSI: Enforce that MSI-X table entry is masked for update
PCI/MSI: Mask all unused MSI-X entries
PCI/MSI: Enable and mask MSI-X early
/proc/net/unix uses "%c" to print a single-byte character to escape '\0' in
the name of the abstract UNIX domain socket. The following selftest uses
it, so this patch adds support for "%c". Note that it does not support
wide character ("%lc" and "%llc") for simplicity.
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.co.jp>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Yonghong Song <yhs@fb.com>
Link: https://lore.kernel.org/bpf/20210814015718.42704-3-kuniyu@amazon.co.jp
These can only return 0 for failure or the number of entries, so turn
the return value into an unsigned int.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Logan Gunthorpe <logang@deltatee.com>
This is similar to existing BPF_PROG_TYPE_CGROUP_SOCK
and BPF_PROG_TYPE_CGROUP_SOCK_ADDR.
Signed-off-by: Stanislav Fomichev <sdf@google.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Link: https://lore.kernel.org/bpf/20210813230530.333779-2-sdf@google.com
"access skb fields ok" verifier test fails on s390 with the "verifier
bug. zext_dst is set, but no reg is defined" message. The first insns
of the test prog are ...
0: 61 01 00 00 00 00 00 00 ldxw %r0,[%r1+0]
8: 35 00 00 01 00 00 00 00 jge %r0,0,1
10: 61 01 00 08 00 00 00 00 ldxw %r0,[%r1+8]
... and the 3rd one is dead (this does not look intentional to me, but
this is a separate topic).
sanitize_dead_code() converts dead insns into "ja -1", but keeps
zext_dst. When opt_subreg_zext_lo32_rnd_hi32() tries to parse such
an insn, it sees this discrepancy and bails. This problem can be seen
only with JITs whose bpf_jit_needs_zext() returns true.
Fix by clearning dead insns' zext_dst.
The commits that contributed to this problem are:
1. 5aa5bd14c5 ("bpf: add initial suite for selftests"), which
introduced the test with the dead code.
2. 5327ed3d44 ("bpf: verifier: mark verified-insn with
sub-register zext flag"), which introduced the zext_dst flag.
3. 83a2881903 ("bpf: Account for BPF_FETCH in
insn_has_def32()"), which introduced the sanity check.
4. 9183671af6 ("bpf: Fix leakage under speculation on
mispredicted branches"), which bisect points to.
It's best to fix this on stable branches that contain the second one,
since that's the point where the inconsistency was introduced.
Fixes: 5327ed3d44 ("bpf: verifier: mark verified-insn with sub-register zext flag")
Signed-off-by: Ilya Leoshkevich <iii@linux.ibm.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20210812151811.184086-2-iii@linux.ibm.com
Fixes: 61377ec144 ("genirq: Clarify documentation for request_threaded_irq()")
Reported-by: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
can and ieee802154.
Current release - regressions:
- r8169: fix ASPM-related link-up regressions
- bridge: fix flags interpretation for extern learn fdb entries
- phy: micrel: fix link detection on ksz87xx switch
- Revert "tipc: Return the correct errno code"
- ptp: fix possible memory leak caused by invalid cast
Current release - new code bugs:
- bpf: add missing bpf_read_[un]lock_trace() for syscall program
- bpf: fix potentially incorrect results with bpf_get_local_storage()
- page_pool: mask the page->signature before the checking, avoid
dma mapping leaks
- netfilter: nfnetlink_hook: 5 fixes to information in netlink dumps
- bnxt_en: fix firmware interface issues with PTP
- mlx5: Bridge, fix ageing time
Previous releases - regressions:
- linkwatch: fix failure to restore device state across suspend/resume
- bareudp: fix invalid read beyond skb's linear data
Previous releases - always broken:
- bpf: fix integer overflow involving bucket_size
- ppp: fix issues when desired interface name is specified via netlink
- wwan: mhi_wwan_ctrl: fix possible deadlock
- dsa: microchip: ksz8795: fix number of VLAN related bugs
- dsa: drivers: fix broken backpressure in .port_fdb_dump
- dsa: qca: ar9331: make proper initial port defaults
Misc:
- bpf: add lockdown check for probe_write_user helper
- netfilter: conntrack: remove offload_pickup sysctl before 5.14 is out
- netfilter: conntrack: collect all entries in one cycle,
heuristically slow down garbage collection scans
on idle systems to prevent frequent wake ups
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEE6jPA+I1ugmIBA4hXMUZtbf5SIrsFAmEVb/AACgkQMUZtbf5S
Irvlzw//XGDHNNPPOueHVhYK50+WiqPMxezQ5nbnG6uR6JtPyirMNTgzST8rQRsu
HmQy8/Oi6bK5rbPC9iDtKK28ba6Ldvu1ic8lTkuWyNNthG/pZGJJQ+Pg7dmkd7te
soJGZKnTbNWwbgGOFbfw9rLRuzWsjQjQ43vxTMjjNnpOwNxANuNR1GN0S/t8e9di
9BBT8jtgcHhtW5jRMHMNWHk+k8aeyIZPxjl9fjzzsMt7meX50DFrCJgf8bKkZ5dA
W2b/fzUyMqVQJpgmIY4ktFmR4mV382pWOOs6rl+ppSu+mU/gpTuYCofF7FqAUU5S
71mzukW6KdOrqymVuwiTXBlGnZB370aT7aUU5PHL/ZkDJ9shSyVRcg/iQa40myzn
5wxunZX936z5f84bxZPW1J5bBZklba8deKPXHUkl5RoIXsN2qWFPJpZ1M0eHyfPm
ZdqvRZ1IkSSFZFr6FF374bEqa88NK1wbVKUbGQ+yn8abE+HQfXQR9ZWZa1DR1wkb
rF8XWOHjQLp/zlTRnj3gj3T4pEwc5L1QOt7RUrYfI36Mh7iUz5EdzowaiEaDQT6/
neThilci1F6Mz4Uf65pK4TaDTDvj1tqqAdg3g8uneHBTFARS+htGXqkaKxP6kSi+
T/W4woOqCRT6c0+BhZ2jPRhKsMZ5kR1vKLUVBHShChq32mDpn6g=
=hzDl
-----END PGP SIGNATURE-----
Merge tag 'net-5.14-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Pull networking fixes from Jakub Kicinski:
"Networking fixes, including fixes from netfilter, bpf, can and
ieee802154.
The size of this is pretty normal, but we got more fixes for 5.14
changes this week than last week. Nothing major but the trend is the
opposite of what we like. We'll see how the next week goes..
Current release - regressions:
- r8169: fix ASPM-related link-up regressions
- bridge: fix flags interpretation for extern learn fdb entries
- phy: micrel: fix link detection on ksz87xx switch
- Revert "tipc: Return the correct errno code"
- ptp: fix possible memory leak caused by invalid cast
Current release - new code bugs:
- bpf: add missing bpf_read_[un]lock_trace() for syscall program
- bpf: fix potentially incorrect results with bpf_get_local_storage()
- page_pool: mask the page->signature before the checking, avoid dma
mapping leaks
- netfilter: nfnetlink_hook: 5 fixes to information in netlink dumps
- bnxt_en: fix firmware interface issues with PTP
- mlx5: Bridge, fix ageing time
Previous releases - regressions:
- linkwatch: fix failure to restore device state across
suspend/resume
- bareudp: fix invalid read beyond skb's linear data
Previous releases - always broken:
- bpf: fix integer overflow involving bucket_size
- ppp: fix issues when desired interface name is specified via
netlink
- wwan: mhi_wwan_ctrl: fix possible deadlock
- dsa: microchip: ksz8795: fix number of VLAN related bugs
- dsa: drivers: fix broken backpressure in .port_fdb_dump
- dsa: qca: ar9331: make proper initial port defaults
Misc:
- bpf: add lockdown check for probe_write_user helper
- netfilter: conntrack: remove offload_pickup sysctl before 5.14 is
out
- netfilter: conntrack: collect all entries in one cycle,
heuristically slow down garbage collection scans on idle systems to
prevent frequent wake ups"
* tag 'net-5.14-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (87 commits)
vsock/virtio: avoid potential deadlock when vsock device remove
wwan: core: Avoid returning NULL from wwan_create_dev()
net: dsa: sja1105: unregister the MDIO buses during teardown
Revert "tipc: Return the correct errno code"
net: mscc: Fix non-GPL export of regmap APIs
net: igmp: increase size of mr_ifc_count
MAINTAINERS: switch to my OMP email for Renesas Ethernet drivers
tcp_bbr: fix u32 wrap bug in round logic if bbr_init() called after 2B packets
net: pcs: xpcs: fix error handling on failed to allocate memory
net: linkwatch: fix failure to restore device state across suspend/resume
net: bridge: fix memleak in br_add_if()
net: switchdev: zero-initialize struct switchdev_notifier_fdb_info emitted by drivers towards the bridge
net: bridge: fix flags interpretation for extern learn fdb entries
net: dsa: sja1105: fix broken backpressure in .port_fdb_dump
net: dsa: lantiq: fix broken backpressure in .port_fdb_dump
net: dsa: lan9303: fix broken backpressure in .port_fdb_dump
net: dsa: hellcreek: fix broken backpressure in .port_fdb_dump
bpf, core: Fix kernel-doc notation
net: igmp: fix data-race in igmp_ifc_timer_expire()
net: Fix memory leak in ieee802154_raw_deliver
...
When a user changes cpuset.cpus, each task in a v2 cpuset will be moved
to one of the new cpus if it is not there already. For memory, however,
they won't be migrated to the new nodes when cpuset.mems changes. This is
an inconsistency in behavior.
In cpuset v1, there is a memory_migrate control file to enable such
behavior by setting the CS_MEMORY_MIGRATE flag. Make it the default
for cpuset v2 so that we have a consistent set of behavior for both
cpus and memory.
There is certainly a cost to make memory migration the default, but it
is a one time cost that shouldn't really matter as long as cpuset.mems
isn't changed frequenty. Update the cgroup-v2.rst file to document the
new behavior and recommend against changing cpuset.mems frequently.
Since there won't be any concurrent access to the newly allocated cpuset
structure in cpuset_css_alloc(), we can use the cheaper non-atomic
__set_bit() instead of the more expensive atomic set_bit().
Signed-off-by: Waiman Long <longman@redhat.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Tejun Heo <tj@kernel.org>
Since the recent consoliation of reprogramming functions,
hrtimer_force_reprogram() is affected by a check whether the new expiry
time is past the current expiry time.
This breaks the NOHZ logic as that relies on the fact that the tick hrtimer
is moved into the future. That means cpu_base->expires_next becomes stale
and subsequent reprogramming attempts fail as well until the situation is
cleaned up by an hrtimer interrupts.
For some yet unknown reason this leads to a complete stall, so for now
partially revert the offending commit to a known working state. The root
cause for the stall is still investigated and will be fixed in a subsequent
commit.
Fixes: b14bca97c9 ("hrtimer: Consolidate reprogramming code")
Reported-by: Mike Galbraith <efault@gmx.de>
Reported-by: Marek Szyprowski <m.szyprowski@samsung.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Mike Galbraith <efault@gmx.de>
Link: https://lore.kernel.org/r/8735recskh.ffs@tglx
clock_was_set() can be invoked from preemptible context. Use raw_cpu_ptr()
to check whether high resolution mode is active or not. It does not matter
whether the task migrates after acquiring the pointer.
Fixes: e71a4153b7 ("hrtimer: Force clock_was_set() handling for the HIGHRES=n, NOHZ=y case")
Reported-by: Mike Galbraith <efault@gmx.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/875ywacsmb.ffs@tglx
Commit 2860cd8a23 ("livepatch: Use the default ftrace_ops instead of
REGS when ARGS is available") intends to enable config LIVEPATCH when
ftrace with ARGS is available. However, the chain of configs to enable
LIVEPATCH is incomplete, as HAVE_DYNAMIC_FTRACE_WITH_ARGS is available,
but the definition of DYNAMIC_FTRACE_WITH_ARGS, combining DYNAMIC_FTRACE
and HAVE_DYNAMIC_FTRACE_WITH_ARGS, needed to enable LIVEPATCH, is missing
in the commit.
Fortunately, ./scripts/checkkconfigsymbols.py detects this and warns:
DYNAMIC_FTRACE_WITH_ARGS
Referencing files: kernel/livepatch/Kconfig
So, define the config DYNAMIC_FTRACE_WITH_ARGS analogously to the already
existing similar configs, DYNAMIC_FTRACE_WITH_REGS and
DYNAMIC_FTRACE_WITH_DIRECT_CALLS, in ./kernel/trace/Kconfig to connect the
chain of configs.
Link: https://lore.kernel.org/kernel-janitors/CAKXUXMwT2zS9fgyQHKUUiqo8ynZBdx2UEUu1WnV_q0OCmknqhw@mail.gmail.com/
Link: https://lkml.kernel.org/r/20210806195027.16808-1-lukas.bulwahn@gmail.com
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Jiri Kosina <jikos@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Miroslav Benes <mbenes@suse.cz>
Cc: stable@vger.kernel.org
Fixes: 2860cd8a23 ("livepatch: Use the default ftrace_ops instead of REGS when ARGS is available")
Signed-off-by: Lukas Bulwahn <lukas.bulwahn@gmail.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Some extra flags are printed to the trace header when using the
PREEMPT_RT config. The extra flags are: need-resched-lazy,
preempt-lazy-depth, and migrate-disable.
Without printing these fields, the timerlat specific fields are
shifted by three positions, for example:
# tracer: timerlat
#
# _-----=> irqs-off
# / _----=> need-resched
# | / _---=> hardirq/softirq
# || / _--=> preempt-depth
# || /
# |||| ACTIVATION
# TASK-PID CPU# |||| TIMESTAMP ID CONTEXT LATENCY
# | | | |||| | | | |
<idle>-0 [000] d..h... 3279.798871: #1 context irq timer_latency 830 ns
<...>-807 [000] ....... 3279.798881: #1 context thread timer_latency 11301 ns
Add a new header for timerlat with the missing fields, to be used
when the PREEMPT_RT is enabled.
Link: https://lkml.kernel.org/r/babb83529a3211bd0805be0b8c21608230202c55.1626598844.git.bristot@kernel.org
Cc: Tom Zanussi <zanussi@kernel.org>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Daniel Bristot de Oliveira <bristot@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Some extra flags are printed to the trace header when using the
PREEMPT_RT config. The extra flags are: need-resched-lazy,
preempt-lazy-depth, and migrate-disable.
Without printing these fields, the osnoise specific fields are
shifted by three positions, for example:
# tracer: osnoise
#
# _-----=> irqs-off
# / _----=> need-resched
# | / _---=> hardirq/softirq
# || / _--=> preempt-depth MAX
# || / SINGLE Interference counters:
# |||| RUNTIME NOISE %% OF CPU NOISE +-----------------------------+
# TASK-PID CPU# |||| TIMESTAMP IN US IN US AVAILABLE IN US HW NMI IRQ SIRQ THREAD
# | | | |||| | | | | | | | | | |
<...>-741 [000] ....... 1105.690909: 1000000 234 99.97660 36 21 0 1001 22 3
<...>-742 [001] ....... 1105.691923: 1000000 281 99.97190 197 7 0 1012 35 14
<...>-743 [002] ....... 1105.691958: 1000000 1324 99.86760 118 11 0 1016 155 143
<...>-744 [003] ....... 1105.691998: 1000000 109 99.98910 21 4 0 1004 33 7
<...>-745 [004] ....... 1105.692015: 1000000 2023 99.79770 97 37 0 1023 52 18
Add a new header for osnoise with the missing fields, to be used
when the PREEMPT_RT is enabled.
Link: https://lkml.kernel.org/r/1f03289d2a51fde5a58c2e7def063dc630820ad1.1626598844.git.bristot@kernel.org
Cc: Tom Zanussi <zanussi@kernel.org>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Daniel Bristot de Oliveira <bristot@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Pull ucounts fix from Eric Biederman:
"This fixes the ucount sysctls on big endian architectures.
The counts were expanded to be longs instead of ints, and the sysctl
code was overlooked, so only the low 32bit were being processed. On
litte endian just processing the low 32bits is fine, but on 64bit big
endian processing just the low 32bits results in the high order bits
instead of the low order bits being processed and nothing works
proper.
This change took a little bit to mature as we have the SYSCTL_ZERO,
and SYSCTL_INT_MAX macros that are only usable for sysctls operating
on ints, but unfortunately are not obviously broken. Which resulted in
the versions of this change working on big endian and not on little
endian, because the int SYSCTL_ZERO when extended 64bit wound up being
0x100000000. So we only allowed values greater than 0x100000000 and
less than 0faff. Which unfortunately broken everything that tried to
set the sysctls. (First reported with the windows subsystem for
linux).
I have tested this on x86_64 64bit after first reproducing the
problems with the earlier version of this change, and then verifying
the problems do not exist when we use appropriate long min and max
values for extra1 and extra2"
* 'for-v5.14' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace:
ucounts: add missing data type changes
in copy_process(): non root users but with capability CAP_SYS_RESOURCE
or CAP_SYS_ADMIN will clean PF_NPROC_EXCEEDED flag even
rlimit(RLIMIT_NPROC) exceeds. Add the same capability check logic here.
Align the permission checks in copy_process() and set_user(). In
copy_process() CAP_SYS_RESOURCE or CAP_SYS_ADMIN capable users will be
able to circumvent and clear the PF_NPROC_EXCEEDED flag whereas they
aren't able to the same in set_user(). There's no obvious logic to this
and trying to unearth the reason in the thread didn't go anywhere.
The gist seems to be that this code wants to make sure that a program
can't successfully exec if it has gone through a set*id() transition
while exceeding its RLIMIT_NPROC.
A capable but non-INIT_USER caller getting PF_NPROC_EXCEEDED set during
a set*id() transition wouldn't be able to exec right away if they still
exceed their RLIMIT_NPROC at the time of exec. So their exec would fail
in fs/exec.c:
if ((current->flags & PF_NPROC_EXCEEDED) &&
is_ucounts_overlimit(current_ucounts(), UCOUNT_RLIMIT_NPROC, rlimit(RLIMIT_NPROC))) {
retval = -EAGAIN;
goto out_ret;
}
However, if the caller were to fork() right after the set*id()
transition but before the exec while still exceeding their RLIMIT_NPROC
then they would get PF_NPROC_EXCEEDED cleared (while the child would
inherit it):
retval = -EAGAIN;
if (is_ucounts_overlimit(task_ucounts(p), UCOUNT_RLIMIT_NPROC, rlimit(RLIMIT_NPROC))) {
if (p->real_cred->user != INIT_USER &&
!capable(CAP_SYS_RESOURCE) && !capable(CAP_SYS_ADMIN))
goto bad_fork_free;
}
current->flags &= ~PF_NPROC_EXCEEDED;
which means a subsequent exec by the capable caller would now succeed
even though they could still exceed their RLIMIT_NPROC limit. This seems
inconsistent. Allow a CAP_SYS_ADMIN or CAP_SYS_RESOURCE capable user to
avoid PF_NPROC_EXCEEDED as they already can in copy_process().
Cc: peterz@infradead.org, tglx@linutronix.de, linux-kernel@vger.kernel.org, Ran Xiaokai <ran.xiaokai@zte.com.cn>, , ,
Link: https://lore.kernel.org/r/20210728072629.530435-1-ran.xiaokai@zte.com.cn
Cc: Neil Brown <neilb@suse.de>
Cc: Kees Cook <keescook@chromium.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: James Morris <jamorris@linux.microsoft.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Ran Xiaokai <ran.xiaokai@zte.com.cn>
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
The functions get_online_cpus() and put_online_cpus() have been
deprecated during the CPU hotplug rework. They map directly to
cpus_read_lock() and cpus_read_unlock().
Replace deprecated CPU-hotplug functions with the official version.
The behavior remains unchanged.
Cc: Steffen Klassert <steffen.klassert@secunet.com>
Cc: Daniel Jordan <daniel.m.jordan@oracle.com>
Cc: linux-crypto@vger.kernel.org
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Acked-by: Daniel Jordan <daniel.m.jordan@oracle.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
If rcu_read_lock_sched tracing is enabled, the tracing subsystem can
perform a jump which needs to be checked by CFI. For example, stm_ftrace
source is enabled as a module and hooks into enabled ftrace events. This
can cause an recursive loop where find_shadow_check_fn ->
rcu_read_lock_sched -> (call to stm_ftrace generates cfi slowpath) ->
find_shadow_check_fn -> rcu_read_lock_sched -> ...
To avoid the recursion, either the ftrace codes needs to be marked with
__no_cfi or CFI should not trace. Use the "_notrace" in CFI to avoid
tracing so that CFI can guard ftrace.
Signed-off-by: Elliot Berman <quic_eberman@quicinc.com>
Reviewed-by: Sami Tolvanen <samitolvanen@google.com>
Cc: stable@vger.kernel.org
Fixes: cf68fffb66 ("add support for Clang CFI")
Signed-off-by: Kees Cook <keescook@chromium.org>
Link: https://lore.kernel.org/r/20210811155914.19550-1-quic_eberman@quicinc.com
Currently, if bpf_get_current_cgroup_id() or
bpf_get_current_ancestor_cgroup_id() helper is
called with sleepable programs e.g., sleepable
fentry/fmod_ret/fexit/lsm programs, a rcu warning
may appear. For example, if I added the following
hack to test_progs/test_lsm sleepable fentry program
test_sys_setdomainname:
--- a/tools/testing/selftests/bpf/progs/lsm.c
+++ b/tools/testing/selftests/bpf/progs/lsm.c
@@ -168,6 +168,10 @@ int BPF_PROG(test_sys_setdomainname, struct pt_regs *regs)
int buf = 0;
long ret;
+ __u64 cg_id = bpf_get_current_cgroup_id();
+ if (cg_id == 1000)
+ copy_test++;
+
ret = bpf_copy_from_user(&buf, sizeof(buf), ptr);
if (len == -2 && ret == 0 && buf == 1234)
copy_test++;
I will hit the following rcu warning:
include/linux/cgroup.h:481 suspicious rcu_dereference_check() usage!
other info that might help us debug this:
rcu_scheduler_active = 2, debug_locks = 1
1 lock held by test_progs/260:
#0: ffffffffa5173360 (rcu_read_lock_trace){....}-{0:0}, at: __bpf_prog_enter_sleepable+0x0/0xa0
stack backtrace:
CPU: 1 PID: 260 Comm: test_progs Tainted: G O 5.14.0-rc2+ #176
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.14.0-0-g155821a1990b-prebuilt.qemu.org 04/01/2014
Call Trace:
dump_stack_lvl+0x56/0x7b
bpf_get_current_cgroup_id+0x9c/0xb1
bpf_prog_a29888d1c6706e09_test_sys_setdomainname+0x3e/0x89c
bpf_trampoline_6442469132_0+0x2d/0x1000
__x64_sys_setdomainname+0x5/0x110
do_syscall_64+0x3a/0x80
entry_SYSCALL_64_after_hwframe+0x44/0xae
I can get similar warning using bpf_get_current_ancestor_cgroup_id() helper.
syzbot reported a similar issue in [1] for syscall program. Helper
bpf_get_current_cgroup_id() or bpf_get_current_ancestor_cgroup_id()
has the following callchain:
task_dfl_cgroup
task_css_set
task_css_set_check
and we have
#define task_css_set_check(task, __c) \
rcu_dereference_check((task)->cgroups, \
lockdep_is_held(&cgroup_mutex) || \
lockdep_is_held(&css_set_lock) || \
((task)->flags & PF_EXITING) || (__c))
Since cgroup_mutex/css_set_lock is not held and the task
is not existing and rcu read_lock is not held, a warning
will be issued. Note that bpf sleepable program is protected by
rcu_read_lock_trace().
The above sleepable bpf programs are already protected
by migrate_disable(). Adding rcu_read_lock() in these
two helpers will silence the above warning.
I marked the patch fixing 95b861a793
("bpf: Allow bpf_get_current_ancestor_cgroup_id for tracing")
which added bpf_get_current_ancestor_cgroup_id() to tracing programs
in 5.14. I think backporting 5.14 is probably good enough as sleepable
progrems are not widely used.
This patch should fix [1] as well since syscall program is a sleepable
program protected with migrate_disable().
[1] https://lore.kernel.org/bpf/0000000000006d5cab05c7d9bb87@google.com/
Fixes: 95b861a793 ("bpf: Allow bpf_get_current_ancestor_cgroup_id for tracing")
Reported-by: syzbot+7ee5c2c09c284495371f@syzkaller.appspotmail.com
Signed-off-by: Yonghong Song <yhs@fb.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20210810230537.2864668-1-yhs@fb.com
A valid cpuset partition can become invalid if all its CPUs are offlined
or somehow removed. This can happen through external events without
"cpuset.cpus.partition" being touched at all.
Users that rely on the property of a partition being present do not
currently have a simple way to get such an event notified other than
constant periodic polling which is both inefficient and cumbersome.
To make life easier for those users, event notification is now enabled
for "cpuset.cpus.partition" whenever its state changes.
Suggested-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Waiman Long <longman@redhat.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Fix kernel-doc warnings found in cgroup-v1.c:
kernel/cgroup/cgroup-v1.c:55: warning: No description found for return value of 'cgroup_attach_task_all'
kernel/cgroup/cgroup-v1.c:94: warning: expecting prototype for cgroup_trasnsfer_tasks(). Prototype was for cgroup_transfer_tasks() instead
cgroup-v1.c:96: warning: No description found for return value of 'cgroup_transfer_tasks'
kernel/cgroup/cgroup-v1.c:687: warning: No description found for return value of 'cgroupstats_build'
Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: Zefan Li <lizefan.x@bytedance.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: cgroups@vger.kernel.org
Signed-off-by: Tejun Heo <tj@kernel.org>
Fix the following warnings:
kernel/smp.c:1189: warning: cannot understand function prototype: 'struct smp_call_on_cpu_struct '
kernel/smp.c:788: warning: No description found for return value of 'smp_call_function_single_async'
kernel/smp.c:990: warning: Function parameter or member 'wait' not described in 'smp_call_function_many'
kernel/smp.c:990: warning: Excess function parameter 'flags' description in 'smp_call_function_many'
kernel/smp.c:1198: warning: Function parameter or member 'work' not described in 'smp_call_on_cpu_struct'
kernel/smp.c:1198: warning: Function parameter or member 'done' not described in 'smp_call_on_cpu_struct'
kernel/smp.c:1198: warning: Function parameter or member 'func' not described in 'smp_call_on_cpu_struct'
kernel/smp.c:1198: warning: Function parameter or member 'data' not described in 'smp_call_on_cpu_struct'
kernel/smp.c:1198: warning: Function parameter or member 'ret' not described in 'smp_call_on_cpu_struct'
kernel/smp.c:1198: warning: Function parameter or member 'cpu' not described in 'smp_call_on_cpu_struct'
Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20210810225051.3938-1-rdunlap@infradead.org
Fix all kernel-doc warnings in these 3 files and do some simple editing
(capitalize acronyms, capitalize Linux).
kernel/irq/pm.c:235: warning: expecting prototype for irq_pm_syscore_ops(). Prototype was for irq_pm_syscore_resume() instead
kernel/irq/msi.c:530: warning: expecting prototype for __msi_domain_free_irqs(). Prototype was for msi_domain_free_irqs() instead
kernel/irq/msi.c:31: warning: No description found for return value of 'alloc_msi_entry'
kernel/irq/msi.c:103: warning: No description found for return value of 'msi_domain_set_affinity'
kernel/irq/msi.c:288: warning: No description found for return value of 'msi_create_irq_domain'
kernel/irq/msi.c:499: warning: No description found for return value of 'msi_domain_alloc_irqs'
kernel/irq/msi.c:545: warning: No description found for return value of 'msi_get_domain_info'
kernel/irq/ipi.c:264: warning: expecting prototype for ipi_send_mask(). Prototype was for __ipi_send_mask() instead
kernel/irq/ipi.c:25: warning: No description found for return value of 'irq_reserve_ipi'
kernel/irq/ipi.c:116: warning: No description found for return value of 'irq_destroy_ipi'
kernel/irq/ipi.c:163: warning: No description found for return value of 'ipi_get_hwirq'
kernel/irq/ipi.c:222: warning: No description found for return value of '__ipi_send_single'
kernel/irq/ipi.c:308: warning: No description found for return value of 'ipi_send_single'
kernel/irq/ipi.c:329: warning: No description found for return value of 'ipi_send_mask'
Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20210810234835.12547-1-rdunlap@infradead.org
Return a negative error code from the error handling case instead of 0, as
done elsewhere in this function.
Fixes: f52da98d90 ("genirq/timings: Add selftest for irqs circular buffer")
Reported-by: Hulk Robot <hulkci@huawei.com>
Signed-off-by: Zhen Lei <thunder.leizhen@huawei.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20210811093333.2376-1-thunder.leizhen@huawei.com
Describe the arguments correctly.
Fixes the following W=1 kernel build warning(s):
kernel/irq/matrix.c:287: warning: Function parameter or
member 'msk' not described in 'irq_matrix_alloc_managed'
kernel/irq/matrix.c:287: warning: Function parameter or
member 'mapped_cpu' not described in 'irq_matrix_alloc_managed'
kernel/irq/matrix.c:287: warning: Excess function
parameter 'cpu' description in 'irq_matrix_alloc_managed'
Signed-off-by: Baokun Li <libaokun1@huawei.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20210605063413.684085-1-libaokun1@huawei.com
With CONFIG_IRQ_FORCED_THREADING=y, testing the boolean force_irqthreads
could incur a cache line miss in invoke_softirq() and other places.
Replace the test with a static key to avoid the potential cache miss.
[ tglx: Dropped the IDE part, removed the export and updated blk-mq ]
Suggested-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Tanner Love <tannerlove@google.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Kees Cook <keescook@chromium.org>
Link: https://lore.kernel.org/r/20210602180338.3324213-1-tannerlove.kernel@gmail.com
The functions get_online_cpus() and put_online_cpus() have been
deprecated during the CPU hotplug rework. They map directly to
cpus_read_lock() and cpus_read_unlock().
Replace deprecated CPU-hotplug functions with the official version.
The behavior remains unchanged.
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: "Paul E. McKenney" <paulmck@kernel.org>
Cc: Josh Triplett <josh@joshtriplett.org>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
The functions get_online_cpus() and put_online_cpus() have been
deprecated during the CPU hotplug rework. They map directly to
cpus_read_lock() and cpus_read_unlock().
Replace deprecated CPU-hotplug functions with the official version.
The behavior remains unchanged.
Cc: "Paul E. McKenney" <paulmck@kernel.org>
Cc: Josh Triplett <josh@joshtriplett.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Lai Jiangshan <jiangshanlai@gmail.com>
Cc: Joel Fernandes <joel@joelfernandes.org>
Cc: rcu@vger.kernel.org
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
CPU hotplug callbacks can fail and cause a rollback to the previous
state. These failures are silent and therefore hard to debug.
Add pr_debug() to the up and down paths which provide information about the
error code, the CPU and the failed state. The debug printks can be enabled
via kernel command line or sysfs.
[ tglx: Adopt to current mainline, massage printk and changelog ]
Signed-off-by: Dongli Zhang <dongli.zhang@oracle.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Qais Yousef <qais.yousef@arm.com>
Link: https://lore.kernel.org/r/20210409055316.1709-1-dongli.zhang@oracle.com
Use DEVICE_ATTR_*() helper instead of plain DEVICE_ATTR,
which makes the code a bit shorter and easier to read.
Signed-off-by: YueHaibing <yuehaibing@huawei.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20210527141105.2312-1-yuehaibing@huawei.com
kernel/cpu.c:57: warning: cannot understand function prototype: 'struct cpuhp_cpu_state '
kernel/cpu.c:115: warning: cannot understand function prototype: 'struct cpuhp_step '
kernel/cpu.c:146: warning: This comment starts with '/**', but isn't a kernel-doc comment. Refer Documentation/doc-guide/kernel-doc.rst
* cpuhp_invoke_callback _ Invoke the callbacks for a given state
kernel/cpu.c:75: warning: Function parameter or member 'fail' not described in 'cpuhp_cpu_state'
kernel/cpu.c:75: warning: Function parameter or member 'cpu' not described in 'cpuhp_cpu_state'
kernel/cpu.c:75: warning: Function parameter or member 'node' not described in 'cpuhp_cpu_state'
kernel/cpu.c:75: warning: Function parameter or member 'last' not described in 'cpuhp_cpu_state'
kernel/cpu.c:130: warning: Function parameter or member 'list' not described in 'cpuhp_step'
kernel/cpu.c:130: warning: Function parameter or member 'multi_instance' not described in 'cpuhp_step'
kernel/cpu.c:158: warning: No description found for return value of 'cpuhp_invoke_callback'
kernel/cpu.c:1188: warning: No description found for return value of 'cpu_device_down'
kernel/cpu.c:1400: warning: No description found for return value of 'cpu_device_up'
kernel/cpu.c:1425: warning: No description found for return value of 'bringup_hibernate_cpu'
Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20210809223825.24512-1-rdunlap@infradead.org
Fixes the following W=1 kernel build warning(s):
kernel/cpu.c:1949: warning: Function parameter or member
'name' not described in '__cpuhp_setup_state_cpuslocked'
Signed-off-by: Baokun Li <libaokun1@huawei.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20210605063003.681049-1-libaokun1@huawei.com
By unconditionally updating the offsets there are more indicators
whether the SMP function calls on clock_was_set() can be avoided:
- When the offset update already happened on the remote CPU then the
remote update attempt will yield the same seqeuence number and no
IPI is required.
- When the remote CPU is currently handling hrtimer_interrupt(). In
that case the remote CPU will reevaluate the timer bases before
reprogramming anyway, so nothing to do.
- After updating it can be checked whether the first expiring timer in
the affected clock bases moves before the first expiring (softirq)
timer of the CPU. If that's not the case then sending the IPI is not
required.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20210713135158.887322464@linutronix.de
Setting of clocks triggers an unconditional SMP function call on all online
CPUs to reprogram the clock event device.
However, only some clocks have their offsets updated and therefore
potentially require a reprogram. That's CLOCK_REALTIME and CLOCK_TAI and in
the case of resume (delayed sleep time injection) also CLOCK_BOOTTIME.
Instead of sending an IPI unconditionally, check each per CPU hrtimer base
whether it has active timers in the affected clock bases which are
indicated by the caller in the @bases argument of clock_was_set().
If that's not the case, skip the IPI and update the offsets remotely which
ensures that any subsequently armed timers on the affected clocks are
evaluated with the correct offsets.
[ tglx: Adopted to the new bases argument, removed the softirq_active
check, added comment, fixed up stale comment ]
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20210713135158.787536542@linutronix.de
clock_was_set() unconditionaly invokes retrigger_next_event() on all online
CPUs. This was necessary because that mechanism was also used for resume
from suspend to idle which is not longer the case.
The bases arguments allows the callers of clock_was_set() to hand in a mask
which tells clock_was_set() which of the hrtimer clock bases are affected
by the clock setting. This mask will be used in the next step to check
whether a CPU base has timers queued on a clock base affected by the event
and avoid the SMP function call if there are none.
Add a @bases argument, provide defines for the active bases masking and
fixup all callsites.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20210713135158.691083465@linutronix.de
do_adjtimex() might end up scheduling a delayed clock_was_set() via
timekeeping_advance() and then invoke clock_was_set() directly which is
pointless.
Make timekeeping_advance() return whether an invocation of clock_was_set()
is required and handle it at the call sites which allows do_adjtimex() to
issue a single direct call if required.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20210713135158.580966888@linutronix.de
Resuming timekeeping is a clock-was-set event and uses the clock-was-set
notification mechanism. This is in the way of making the clock-was-set
update for hrtimers selective so unnecessary IPIs are avoided when a CPU
base does not have timers queued which are affected by the clock setting.
Distangle it by invoking hrtimer_resume() on each unfreezing CPU and invoke
the new timerfd_resume() function from timekeeping_resume() which is the
only place where this is needed.
Rename hrtimer_resume() to hrtimer_resume_local() to reflect the change.
With this the clock_was_set*() functions are not longer required to IPI all
CPUs unconditionally and can get some smarts to avoid them.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20210713135158.488853478@linutronix.de
When CONFIG_HIGH_RES_TIMERS is disabled, but NOHZ is enabled then
clock_was_set() is not doing anything. With HIGHRES=n the kernel relies on
the periodic tick to update the clock offsets, but when NOHZ is enabled and
active then CPUs which are in a deep idle sleep do not have a periodic tick
which means the expiry of timers affected by clock_was_set() can be
arbitrarily delayed up to the point where the CPUs are brought out of idle
again.
Make the clock_was_set() logic unconditionaly available so that idle CPUs
are kicked out of idle to handle the update.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20210713135158.288697903@linutronix.de
If high resolution timers are disabled the timerfd notification about a
clock was set event is not happening for all cases which use
clock_was_set_delayed() because that's a NOP for HIGHRES=n, which is wrong.
Make clock_was_set_delayed() unconditially available to fix that.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20210713135158.196661266@linutronix.de
This code is mostly duplicated. The redudant store in the force reprogram
case does no harm and the in hrtimer interrupt condition cannot be true for
the force reprogram invocations.
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20210713135158.054424875@linutronix.de
If __hrtimer_start_range_ns() is invoked with an already armed hrtimer then
the timer has to be canceled first and then added back. If the timer is the
first expiring timer then on removal the clockevent device is reprogrammed
to the next expiring timer to avoid that the pending expiry fires needlessly.
If the new expiry time ends up to be the first expiry again then the clock
event device has to reprogrammed again.
Avoid this by checking whether the timer is the first to expire and in that
case, keep the timer on the current CPU and delay the reprogramming up to
the point where the timer has been enqueued again.
Reported-by: Lorenzo Colitti <lorenzo@google.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20210713135157.873137732@linutronix.de
There are several scenarios that can result in posix_cpu_timer_set()
not queueing the timer but still leaving the threadgroup cputime counter
running or keeping the tick dependency around for a random amount of time.
1) If timer_settime() is called with a 0 expiration on a timer that is
already disabled, the process wide cputime counter will be started
and won't ever get a chance to be stopped by stop_process_timer()
since no timer is actually armed to be processed.
The following snippet is enough to trigger the issue.
void trigger_process_counter(void)
{
timer_t id;
struct itimerspec val = { };
timer_create(CLOCK_PROCESS_CPUTIME_ID, NULL, &id);
timer_settime(id, TIMER_ABSTIME, &val, NULL);
timer_delete(id);
}
2) If timer_settime() is called with a 0 expiration on a timer that is
already armed, the timer is dequeued but not really disarmed. So the
process wide cputime counter and the tick dependency may still remain
a while around.
The following code snippet keeps this overhead around for one week after
the timer deletion:
void trigger_process_counter(void)
{
timer_t id;
struct itimerspec val = { };
val.it_value.tv_sec = 604800;
timer_create(CLOCK_PROCESS_CPUTIME_ID, NULL, &id);
timer_settime(id, 0, &val, NULL);
timer_delete(id);
}
3) If the timer was initially deactivated, this call to timer_settime()
with an early expiration may have started the process wide cputime
counter even though the timer hasn't been queued and armed because it
has fired early and inline within posix_cpu_timer_set() itself. As a
result the process wide cputime counter may never stop until a new
timer is ever armed in the future.
The following code snippet can reproduce this:
void trigger_process_counter(void)
{
timer_t id;
struct itimerspec val = { };
signal(SIGALRM, SIG_IGN);
timer_create(CLOCK_PROCESS_CPUTIME_ID, NULL, &id);
val.it_value.tv_nsec = 1;
timer_settime(id, TIMER_ABSTIME, &val, NULL);
}
4) If the timer was initially armed with a former expiration value
before this call to timer_settime() and the current call sets an
early deadline that has already expired, the timer fires inline
within posix_cpu_timer_set(). In this case it must have been dequeued
before firing inline with its new expiration value, yet it hasn't
been disarmed in this case. So the process wide cputime counter and
the tick dependency may still be around for a while even after the
timer fired.
The following code snippet can reproduce this:
void trigger_process_counter(void)
{
timer_t id;
struct itimerspec val = { };
signal(SIGALRM, SIG_IGN);
timer_create(CLOCK_PROCESS_CPUTIME_ID, NULL, &id);
val.it_value.tv_sec = 100;
timer_settime(id, TIMER_ABSTIME, &val, NULL);
val.it_value.tv_sec = 0;
val.it_value.tv_nsec = 1;
timer_settime(id, TIMER_ABSTIME, &val, NULL);
}
Fix all these issues with triggering the related base next expiration
recalculation on the next tick. This also implies to re-evaluate the need
to keep around the process wide cputime counter and the tick dependency, in
a similar fashion to disarm_timer().
Suggested-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20210726125513.271824-7-frederic@kernel.org
Remove the ad-hoc timer base accessors and provide a consolidated one.
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20210726125513.271824-6-frederic@kernel.org
The end of the function cannot be reached with an error in variable
ret. Unconfuse reviewers about that.
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20210726125513.271824-5-frederic@kernel.org
When an itimer deactivates a previously armed expiration, it simply doesn't
do anything. As a result the process wide cputime counter keeps running and
the tick dependency stays set until it reaches the old ghost expiration
value.
This can be reproduced with the following snippet:
void trigger_process_counter(void)
{
struct itimerval n = {};
n.it_value.tv_sec = 100;
setitimer(ITIMER_VIRTUAL, &n, NULL);
n.it_value.tv_sec = 0;
setitimer(ITIMER_VIRTUAL, &n, NULL);
}
Fix this with resetting the relevant base expiration. This is similar to
disarming a timer.
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20210726125513.271824-4-frederic@kernel.org
A timer deletion only dequeues the timer but it doesn't shutdown
the related costly process wide cputimer counter and the tick dependency.
The following code snippet keeps this overhead around for one week after
the timer deletion:
void trigger_process_counter(void)
{
timer_t id;
struct itimerspec val = { };
val.it_value.tv_sec = 604800;
timer_create(CLOCK_PROCESS_CPUTIME_ID, NULL, &id);
timer_settime(id, 0, &val, NULL);
timer_delete(id);
}
Make sure the next target's tick recalculates the nearest expiration and
clears the process wide counter and tick dependency if necessary.
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20210726125513.271824-3-frederic@kernel.org
Starting the process wide cputime counter needs to be done in the same
sighand locking sequence than actually arming the related timer otherwise
this races against concurrent timers setting/expiring in the same
threadgroup.
Detecting that the cputime counter is started without holding the sighand
lock is a first step toward debugging such situations.
Suggested-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20210726125513.271824-2-frederic@kernel.org
The variable ret is being initialized with a value that is never read, it
is being updated later on. The assignment is redundant and can be removed.
Addresses-Coverity: ("Unused value")
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20210721120147.109570-1-colin.king@canonical.com
Daniel Borkmann says:
====================
bpf-next 2021-08-10
We've added 31 non-merge commits during the last 8 day(s) which contain
a total of 28 files changed, 3644 insertions(+), 519 deletions(-).
1) Native XDP support for bonding driver & related BPF selftests, from Jussi Maki.
2) Large batch of new BPF JIT tests for test_bpf.ko that came out as a result from
32-bit MIPS JIT development, from Johan Almbladh.
3) Rewrite of netcnt BPF selftest and merge into test_progs, from Stanislav Fomichev.
4) Fix XDP bpf_prog_test_run infra after net to net-next merge, from Andrii Nakryiko.
5) Follow-up fix in unix_bpf_update_proto() to enforce socket type, from Cong Wang.
6) Fix bpf-iter-tcp4 selftest to print the correct dest IP, from Jose Blanquicet.
7) Various misc BPF XDP sample improvements, from Niklas Söderlund, Matthew Cover,
and Muhammad Falak R Wani.
* https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next: (31 commits)
bpf, tests: Add tail call test suite
bpf, tests: Add tests for BPF_CMPXCHG
bpf, tests: Add tests for atomic operations
bpf, tests: Add test for 32-bit context pointer argument passing
bpf, tests: Add branch conversion JIT test
bpf, tests: Add word-order tests for load/store of double words
bpf, tests: Add tests for ALU operations implemented with function calls
bpf, tests: Add more ALU64 BPF_MUL tests
bpf, tests: Add more BPF_LSH/RSH/ARSH tests for ALU64
bpf, tests: Add more ALU32 tests for BPF_LSH/RSH/ARSH
bpf, tests: Add more tests of ALU32 and ALU64 bitwise operations
bpf, tests: Fix typos in test case descriptions
bpf, tests: Add BPF_MOV tests for zero and sign extension
bpf, tests: Add BPF_JMP32 test cases
samples, bpf: Add an explict comment to handle nested vlan tagging.
selftests/bpf: Add tests for XDP bonding
selftests/bpf: Fix xdp_tx.c prog section name
net, core: Allow netdev_lower_get_next_private_rcu in bh context
bpf, devmap: Exclude XDP broadcast to master device
net, bonding: Add XDP support to the bonding driver
...
====================
Link: https://lore.kernel.org/r/20210810130038.16927-1-daniel@iogearbox.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
msi_domain_alloc_irqs() invokes irq_domain_activate_irq(), but
msi_domain_free_irqs() does not enforce deactivation before tearing down
the interrupts.
This happens when PCI/MSI interrupts are set up and never used before being
torn down again, e.g. in error handling pathes. The only place which cleans
that up is the error handling path in msi_domain_alloc_irqs().
Move the cleanup from msi_domain_alloc_irqs() into msi_domain_free_irqs()
to cure that.
Fixes: f3b0946d62 ("genirq/msi: Make sure PCI MSIs are activated early")
Signed-off-by: Bixuan Cui <cuibixuan@huawei.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/r/20210518033117.78104-1-cuibixuan@huawei.com
When the interrupt interval is greater than 2 ^ PREDICTION_BUFFER_SIZE *
PREDICTION_FACTOR us and less than 1s, the calculated index will be greater
than the length of irqs->ema_time[]. Check the calculated index before
using it to prevent array overflow.
Fixes: 23aa3b9a6b ("genirq/timings: Encapsulate storing function")
Signed-off-by: Ben Dai <ben.dai@unisoc.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/r/20210425150903.25456-1-ben.dai9703@gmail.com
Make use of the struct_size() helper instead of an open-coded version,
in order to avoid any potential type mistakes or integer overflows
that, in the worst scenario, could lead to heap overflows.
This code was detected with the help of Coccinelle and, audited and
fixed manually.
Signed-off-by: Gustavo A. R. Silva <gustavoars@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20210513212729.GA214145@embeddedor
PCI MSI interrupt numbers are now mapped in a PCI-MSI domain but the
underlying calls handling the passthrough of the interrupt in the
guest need a number in the XIVE IRQ domain.
Use the IRQ data mapped in the XIVE IRQ domain and not the one in the
PCI-MSI domain.
Signed-off-by: Cédric Le Goater <clg@kaod.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20210701132750.1475580-16-clg@kaod.org
The functions get_online_cpus() and put_online_cpus() have been
deprecated during the CPU hotplug rework. They map directly to
cpus_read_lock() and cpus_read_unlock().
Replace deprecated CPU-hotplug functions with the official version.
The behavior remains unchanged.
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20210803141621.780504-34-bigeasy@linutronix.de
The functions get_online_cpus() and put_online_cpus() have been
deprecated during the CPU hotplug rework. They map directly to
cpus_read_lock() and cpus_read_unlock().
Replace deprecated CPU-hotplug functions with the official version.
The behavior remains unchanged.
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20210803141621.780504-35-bigeasy@linutronix.de
The functions get_online_cpus() and put_online_cpus() have been
deprecated during the CPU hotplug rework. They map directly to
cpus_read_lock() and cpus_read_unlock().
Replace deprecated CPU-hotplug functions with the official version.
The behavior remains unchanged.
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20210803141621.780504-33-bigeasy@linutronix.de
The functions get_online_cpus() and put_online_cpus() have been
deprecated during the CPU hotplug rework. They map directly to
cpus_read_lock() and cpus_read_unlock().
Replace deprecated CPU-hotplug functions with the official version.
The behavior remains unchanged.
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20210803141621.780504-26-bigeasy@linutronix.de
Fix kernel-doc warnings in kernel/bpf/core.c (found by scripts/kernel-doc
and W=1 builds). That is, correct a function name in a comment and add
return descriptions for 2 functions.
Fixes these kernel-doc warnings:
kernel/bpf/core.c:1372: warning: expecting prototype for __bpf_prog_run(). Prototype was for ___bpf_prog_run() instead
kernel/bpf/core.c:1372: warning: No description found for return value of '___bpf_prog_run'
kernel/bpf/core.c:1883: warning: No description found for return value of 'bpf_prog_select_runtime'
Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20210809215229.7556-1-rdunlap@infradead.org
By adding the pidfd_create() declaration to linux/pid.h, we
effectively expose this function to the rest of the kernel. In order
to avoid any unintended behavior, or set false expectations upon this
function, ensure that constraints are forced upon each of the passed
parameters. This includes the checking of whether the passed struct
pid is a thread-group leader as pidfd creation is currently limited to
such pid types.
Link: https://lore.kernel.org/r/2e9b91c2d529d52a003b8b86c45f866153be9eb5.1628398044.git.repnop@google.com
Signed-off-by: Matthew Bobrowski <repnop@google.com>
Acked-by: Christian Brauner <christian.brauner@ubuntu.com>
Signed-off-by: Jan Kara <jack@suse.cz>
With the idea of returning pidfds from the fanotify API, we need to
expose a mechanism for creating pidfds. We drop the static qualifier
from pidfd_create() and add its declaration to linux/pid.h so that the
pidfd_create() helper can be called from other kernel subsystems
i.e. fanotify.
Link: https://lore.kernel.org/r/0c68653ec32f1b7143301f0231f7ed14062fd82b.1628398044.git.repnop@google.com
Signed-off-by: Matthew Bobrowski <repnop@google.com>
Acked-by: Christian Brauner <christian.brauner@ubuntu.com>
Signed-off-by: Jan Kara <jack@suse.cz>
X86 IO/APIC and MSI interrupts (when used without interrupts remapping)
require that the affinity setup on startup is done before the interrupt is
enabled for the first time as the non-remapped operation mode cannot safely
migrate enabled interrupts from arbitrary contexts. Provide a new irq chip
flag which allows affected hardware to request this.
This has to be opt-in because there have been reports in the past that some
interrupt chips cannot handle affinity setting before startup.
Fixes: 1840475676 ("genirq: Expose default irq affinity mask (take 3)")
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Marc Zyngier <maz@kernel.org>
Reviewed-by: Marc Zyngier <maz@kernel.org>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/r/20210729222542.779791738@linutronix.de
Commit b910eaaaa4 ("bpf: Fix NULL pointer dereference in bpf_get_local_storage()
helper") fixed a bug for bpf_get_local_storage() helper so different tasks
won't mess up with each other's percpu local storage.
The percpu data contains 8 slots so it can hold up to 8 contexts (same or
different tasks), for 8 different program runs, at the same time. This in
general is sufficient. But our internal testing showed the following warning
multiple times:
[...]
warning: WARNING: CPU: 13 PID: 41661 at include/linux/bpf-cgroup.h:193
__cgroup_bpf_run_filter_sock_ops+0x13e/0x180
RIP: 0010:__cgroup_bpf_run_filter_sock_ops+0x13e/0x180
<IRQ>
tcp_call_bpf.constprop.99+0x93/0xc0
tcp_conn_request+0x41e/0xa50
? tcp_rcv_state_process+0x203/0xe00
tcp_rcv_state_process+0x203/0xe00
? sk_filter_trim_cap+0xbc/0x210
? tcp_v6_inbound_md5_hash.constprop.41+0x44/0x160
tcp_v6_do_rcv+0x181/0x3e0
tcp_v6_rcv+0xc65/0xcb0
ip6_protocol_deliver_rcu+0xbd/0x450
ip6_input_finish+0x11/0x20
ip6_input+0xb5/0xc0
ip6_sublist_rcv_finish+0x37/0x50
ip6_sublist_rcv+0x1dc/0x270
ipv6_list_rcv+0x113/0x140
__netif_receive_skb_list_core+0x1a0/0x210
netif_receive_skb_list_internal+0x186/0x2a0
gro_normal_list.part.170+0x19/0x40
napi_complete_done+0x65/0x150
mlx5e_napi_poll+0x1ae/0x680
__napi_poll+0x25/0x120
net_rx_action+0x11e/0x280
__do_softirq+0xbb/0x271
irq_exit_rcu+0x97/0xa0
common_interrupt+0x7f/0xa0
</IRQ>
asm_common_interrupt+0x1e/0x40
RIP: 0010:bpf_prog_1835a9241238291a_tw_egress+0x5/0xbac
? __cgroup_bpf_run_filter_skb+0x378/0x4e0
? do_softirq+0x34/0x70
? ip6_finish_output2+0x266/0x590
? ip6_finish_output+0x66/0xa0
? ip6_output+0x6c/0x130
? ip6_xmit+0x279/0x550
? ip6_dst_check+0x61/0xd0
[...]
Using drgn [0] to dump the percpu buffer contents showed that on this CPU
slot 0 is still available, but slots 1-7 are occupied and those tasks in
slots 1-7 mostly don't exist any more. So we might have issues in
bpf_cgroup_storage_unset().
Further debugging confirmed that there is a bug in bpf_cgroup_storage_unset().
Currently, it tries to unset "current" slot with searching from the start.
So the following sequence is possible:
1. A task is running and claims slot 0
2. Running BPF program is done, and it checked slot 0 has the "task"
and ready to reset it to NULL (not yet).
3. An interrupt happens, another BPF program runs and it claims slot 1
with the *same* task.
4. The unset() in interrupt context releases slot 0 since it matches "task".
5. Interrupt is done, the task in process context reset slot 0.
At the end, slot 1 is not reset and the same process can continue to occupy
slots 2-7 and finally, when the above step 1-5 is repeated again, step 3 BPF
program won't be able to claim an empty slot and a warning will be issued.
To fix the issue, for unset() function, we should traverse from the last slot
to the first. This way, the above issue can be avoided.
The same reverse traversal should also be done in bpf_get_local_storage() helper
itself. Otherwise, incorrect local storage may be returned to BPF program.
[0] https://github.com/osandov/drgn
Fixes: b910eaaaa4 ("bpf: Fix NULL pointer dereference in bpf_get_local_storage() helper")
Signed-off-by: Yonghong Song <yhs@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20210810010413.1976277-1-yhs@fb.com
Back then, commit 96ae522795 ("bpf: Add bpf_probe_write_user BPF helper
to be called in tracers") added the bpf_probe_write_user() helper in order
to allow to override user space memory. Its original goal was to have a
facility to "debug, divert, and manipulate execution of semi-cooperative
processes" under CAP_SYS_ADMIN. Write to kernel was explicitly disallowed
since it would otherwise tamper with its integrity.
One use case was shown in cf9b1199de ("samples/bpf: Add test/example of
using bpf_probe_write_user bpf helper") where the program DNATs traffic
at the time of connect(2) syscall, meaning, it rewrites the arguments to
a syscall while they're still in userspace, and before the syscall has a
chance to copy the argument into kernel space. These days we have better
mechanisms in BPF for achieving the same (e.g. for load-balancers), but
without having to write to userspace memory.
Of course the bpf_probe_write_user() helper can also be used to abuse
many other things for both good or bad purpose. Outside of BPF, there is
a similar mechanism for ptrace(2) such as PTRACE_PEEK{TEXT,DATA} and
PTRACE_POKE{TEXT,DATA}, but would likely require some more effort.
Commit 96ae522795 explicitly dedicated the helper for experimentation
purpose only. Thus, move the helper's availability behind a newly added
LOCKDOWN_BPF_WRITE_USER lockdown knob so that the helper is disabled under
the "integrity" mode. More fine-grained control can be implemented also
from LSM side with this change.
Fixes: 96ae522795 ("bpf: Add bpf_probe_write_user BPF helper to be called in tracers")
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Pull cgroup fix from Tejun Heo:
"One commit to fix a possible A-A deadlock around u64_stats_sync on
32bit machines caused by updating it without disabling IRQ when it may
be read from IRQ context"
* 'for-5.14-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
cgroup: rstat: fix A-A deadlock on 32bit around u64_stats_sync
The functions get_online_cpus() and put_online_cpus() have been
deprecated during the CPU hotplug rework. They map directly to
cpus_read_lock() and cpus_read_unlock().
Replace deprecated CPU-hotplug functions with the official version.
The behavior remains unchanged.
Cc: Zefan Li <lizefan.x@bytedance.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: cgroups@vger.kernel.org
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Tejun Heo <tj@kernel.org>
The cpuset fields that manage partition root state do not strictly
follow the cpuset locking rule that update to cpuset has to be done
with both the callback_lock and cpuset_mutex held. This is now fixed
by making sure that the locking rule is upheld.
Fixes: 3881b86128 ("cpuset: Add an error state to cpuset.sched.partition")
Fixes: 4b842da276 ("cpuset: Make CPU hotplug work with partition")
Signed-off-by: Waiman Long <longman@redhat.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
The functions get_online_cpus() and put_online_cpus() have been
deprecated during the CPU hotplug rework. They map directly to
cpus_read_lock() and cpus_read_unlock().
Replace deprecated CPU-hotplug functions with the official version.
The behavior remains unchanged.
Cc: Tejun Heo <tj@kernel.org>
Reviewed-by: Lai Jiangshan <jiangshanlai@gmail.com>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Tejun Heo <tj@kernel.org>
Replace ida_simple_get() with ida_alloc() and ida_simple_remove() with
ida_free(), the latter is more concise and intuitive.
In addition, if ida_alloc() fails, NULL is returned directly. This
eliminates unnecessary initialization of two local variables and an 'if'
judgment.
Signed-off-by: Zhen Lei <thunder.leizhen@huawei.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Rename LOCKDOWN_BPF_READ into LOCKDOWN_BPF_READ_KERNEL so we have naming
more consistent with a LOCKDOWN_BPF_WRITE_USER option that we are adding.
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Now that all the .map_sg operations have been converted to returning
proper error codes, drop the code to handle a zero return value,
add a warning if a zero is returned.
Signed-off-by: Logan Gunthorpe <logang@deltatee.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
The .map_sg() op now expects an error code instead of zero on failure.
The only errno to return is -EINVAL in the case when DMA is not
supported.
Signed-off-by: Martin Oliveira <martin.oliveira@eideticom.com>
Signed-off-by: Logan Gunthorpe <logang@deltatee.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Now that the map_sg() op expects error codes instead of return zero on
error, convert dma_direct_map_sg() to return an error code. Per the
documentation for dma_map_sgtable(), -EIO is returned due to an
DMA_MAPPING_ERROR with unknown cause.
Signed-off-by: Logan Gunthorpe <logang@deltatee.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Allow dma_map_sgtable() to pass errors from the map_sg() ops. This
will be required for returning appropriate error codes when mapping
P2PDMA memory.
Introduce __dma_map_sg_attrs() which will return the raw error code
from the map_sg operation (whether it be negative or zero). Then add a
dma_map_sg_attrs() wrapper to convert any negative errors to zero to
satisfy the existing calling convention.
dma_map_sgtable() defines three error codes that .map_sg implementations
are allowed to return: -EINVAL, -ENOMEM and -EIO. The latter of which
is a generic return for cases that are passing DMA_MAPPING_ERROR
through.
dma_map_sgtable() will convert a zero error return for old map_sg() ops
into a -EIO return and return any negative errors as reported.
This allows map_sg implementations to start returning multiple
negative error codes. Legacy map_sg implementations can continue
to return zero until they are all converted.
Signed-off-by: Logan Gunthorpe <logang@deltatee.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Due to link order, dma_debug_init is called before debugfs has a chance
to initialize (via debugfs_init which also happens in the core initcall
stage), so the directories for dma-debug are never created.
Decouple dma_debug_fs_init from dma_debug_init and defer its init until
core_initcall_sync (after debugfs has been initialized) while letting
dma-debug initialization occur as soon as possible to catch any early
mappings, as suggested in [1].
[1] https://lore.kernel.org/linux-iommu/YIgGa6yF%2Fadg8OSN@kroah.com/
Fixes: 15b28bbcd5 ("dma-debug: move initialization to common code")
Signed-off-by: Anthony Iliopoulos <ailiop@suse.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
There is already a memory_intersects() helper in sections.h,
use memory_intersects() directly instead of private overlap().
Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
- Prevent a memory ordering issue in the timer expiry code which makes it
possible to observe falsely that the callback has been executed already
while that's not the case, which violates the guarantee of del_timer_sync().
-----BEGIN PGP SIGNATURE-----
iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAmEPwQgTHHRnbHhAbGlu
dXRyb25peC5kZQAKCRCmGPVMDXSYoV/CD/0YmL4fjwNOoDk/sZSuW6nh7DjZ2714
sLxP18nzq9NhykF1tSfJhgWSokjNLWZ3cr4/UJ+i1XyDbC69uIi9dLbWiQKrir6X
5lHlxy1bzemz59Lcx9ENcCXRO1R/7FnVR2h37dMwAEKQVkeXxqIcmwSJGokW2AQW
3LNMKbY6UPT9SNU399s8BdLHxKaQ7TBDZ/jxN+1xlt/BRj2+TpnL/hE5rGvrfYC7
gnNOwxuIacuS5XBrc8s1hD//VrqJPhgASLLmaoI6vXfl9q3OwjSpNCGzqORmMWqk
N8M1A7P9538ym72BWG71evoGWrbEwoxNo1OiK5RtgjH31hrsGwSD6EtOhGmBmqIB
urdC17R/sm+OFXzNyQgg9dmq7GdwbSD4HSYXJ7DnGh2us6JilFwxSkIJ1Ce0yYOw
qSBpDutas3Xc3RiejgFVBNKEsSGhOtSy3Tc7QqvRs1OJbb6qm8twU27UEzFXy6zX
LRnhv/A7rZRaeEc5WcbWu+xBDzIqWRSgecOwM3SBsQyUkVV73R7wyuNo80o0TEb2
13jVC9dnoDUDnqUwnNLJoqtfU/I/DBs49mZRJUyqev73buvBlDZqhjRthIMwSGDb
DORRsfOYCmHa+fySkO1GZbgHG4Pym51tyjpC8jD4KxNU0dOW/d5TYlRh8nsBt8PG
p+/vOBXMHBFbCg==
=JQWW
-----END PGP SIGNATURE-----
Merge tag 'timers-urgent-2021-08-08' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull timer fix from Thomas Gleixner:
"A single timer fix:
- Prevent a memory ordering issue in the timer expiry code which
makes it possible to observe falsely that the callback has been
executed already while that's not the case, which violates the
guarantee of del_timer_sync()"
* tag 'timers-urgent-2021-08-08' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
timers: Move clearing of base::timer_running under base:: Lock
- Prevent a double enqueue caused by rt_effective_prio() being invoked
twice in __sched_setscheduler().
-----BEGIN PGP SIGNATURE-----
iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAmEPwDATHHRnbHhAbGlu
dXRyb25peC5kZQAKCRCmGPVMDXSYobyFD/46yd3xi1cfI9WQRuOQPNBa4/uzg7ir
33AKOk3MmHICt8M5fhBrLsC/qwCjONB3N+0tmkj+uVgZPfeW4cd8LB5rYW/byIS+
ib6wMyvOpr91oL1Hb1b7SHlodbdZFL6gInMrDb/gMABiojml+aZt1kwsA9FFFVdE
DEWOue/xIf22Tw8egCxsjZBAfMvyBSuTvdGPTKiUXKm96RO2Sr7PQIbnc6gBjbkn
SvLwW8gIcyUe6u+8pN9rhAqnlOO5E/tSkF7BWNLAnrp3xnubty/XBulWRUCaeOQy
8+/O3/5cqmQ6kSNA7aPVSPPZY3zADB+KW5EHxWBYCiZuXnDj1WJqc3r1sYiNtfXL
Tl59DRggEktlAUh8QDt7rkFxe0waWTxyeAIEa/79IebnrZkdrMi87XO8hZoB7K4P
GRqg0AyiQB7B/trcZLb7rNPa9rFAMOMoPX5qyvwEoqKZ8rwzUrv+xmW5cqWsLpIO
3TatEgnK3pWPV+hhRhz2dqFQ6NuwnNFDTIPvSOS0EgY1lTUu+HkYwU2xqqwKHswF
aqyyw6SEXnOUeXJhj/6gzhDk/qGFCLfww+1+hiInBDNj6xlEbrSXANmEG8eH8DqU
XXQpgehCQwsgtxyzVMRvJJJ0dqulDxlv+xt+RtfXZHDjQeHYE1yXlWWm2r2opWse
feOUyXbKt4Tczg==
=EZjT
-----END PGP SIGNATURE-----
Merge tag 'sched-urgent-2021-08-08' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull scheduler fix from Thomas Gleixner:
"A single scheduler fix:
- Prevent a double enqueue caused by rt_effective_prio() being
invoked twice in __sched_setscheduler()"
* tag 'sched-urgent-2021-08-08' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
sched/rt: Fix double enqueue caused by rt_effective_prio
- Correct the permission checks for perf event which send SIGTRAP to a
different process and clean up that code to be more readable.
- Prevent an out of bound MSR access in the x86 perf code which happened
due to an incomplete limiting to the actually available hardware
counters.
- Prevent access to the AMD64_EVENTSEL_HOSTONLY bit when running inside a
guest.
- Handle small core counter re-enabling correctly by issuing an ACK right
before reenabling it to prevent a stale PEBS record being kept around.
-----BEGIN PGP SIGNATURE-----
iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAmEPv6UTHHRnbHhAbGlu
dXRyb25peC5kZQAKCRCmGPVMDXSYob8hD/wMmRLAoc/uvJIIICJ+IQVnnU8WToIS
Qy1dAPpQMz6pQpRQor1AGpcP89IMnLVhZn84lsd+kw0/Lv630JbWsXvQ8jB2GPHn
17XewPp4l4PDUgKaGEKIjPSjsmnZmzOLTYIy5gWOfA/h5EG/1D+ozvcRGDMaXWUw
+65Pinaf2QKfjYZV11SVJMLF5zLYUxMc6vRag00WrcPxd+JO4eVeV36g0LTmhABW
fOSDcBOSVrT2w9MYDpNmPvMh3dN2vlfhrEk10NBKslx8uk4t8sV/Jbs+48WhydKa
zmdqthtjIekRUSxhiHJve70D9ngveCBSKQDp0Us2BWWxdnM0+HV6ozjuxO0julCH
5tW4413fz2AoZJhWkTn3PE4nPG3apRCnL2B+jTFHHqCjKSkkrNDRJDOEUwasXjV5
jn25DLhOq5ltkMrLFDTV/h2RZqU0fAMV2iwNSkjD3lVLgKt6B3/uSnvE9SXmaJjs
njk/1LzeWwY+sk7YYXouPQ2STEDCKvOJGYZSS5pFA03mVaQgfuJxpyHKH+7nj9tV
k0FLDLMmSucYIWBq0iapa8cR69e0ZIE48hSNR3AOIIOVh3LusmA4HkogOAQG7kdZ
P2nKQUdN+SR8rL9KQRauP63J508fg0kkXNgSAm1lFWBDnFKt6shkkHGcL+5PzxJW
1Bjx2wc52Ww84A==
=hhv+
-----END PGP SIGNATURE-----
Merge tag 'perf-urgent-2021-08-08' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull perf fixes from Thomas Gleixner:
"A set of perf fixes:
- Correct the permission checks for perf event which send SIGTRAP to
a different process and clean up that code to be more readable.
- Prevent an out of bound MSR access in the x86 perf code which
happened due to an incomplete limiting to the actually available
hardware counters.
- Prevent access to the AMD64_EVENTSEL_HOSTONLY bit when running
inside a guest.
- Handle small core counter re-enabling correctly by issuing an ACK
right before reenabling it to prevent a stale PEBS record being
kept around"
* tag 'perf-urgent-2021-08-08' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
perf/x86/intel: Apply mid ACK for small core
perf/x86/amd: Don't touch the AMD64_EVENTSEL_HOSTONLY bit inside the guest
perf/x86: Fix out of bound MSR access
perf: Refactor permissions check into perf_check_permission()
perf: Fix required permissions if sigtrap is requested
Daniel Borkmann says:
====================
pull-request: bpf 2021-08-07
The following pull-request contains BPF updates for your *net* tree.
We've added 4 non-merge commits during the last 9 day(s) which contain
a total of 4 files changed, 8 insertions(+), 7 deletions(-).
The main changes are:
1) Fix integer overflow in htab's lookup + delete batch op, from Tatsuhiko Yasumatsu.
2) Fix invalid fd 0 close in libbpf if BTF parsing failed, from Daniel Xu.
3) Fix libbpf feature probe for BPF_PROG_TYPE_CGROUP_SOCKOPT, from Robin Gögge.
4) Fix minor libbpf doc warning regarding code-block language, from Randy Dunlap.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
In __htab_map_lookup_and_delete_batch(), hash buckets are iterated
over to count the number of elements in each bucket (bucket_size).
If bucket_size is large enough, the multiplication to calculate
kvmalloc() size could overflow, resulting in out-of-bounds write
as reported by KASAN:
[...]
[ 104.986052] BUG: KASAN: vmalloc-out-of-bounds in __htab_map_lookup_and_delete_batch+0x5ce/0xb60
[ 104.986489] Write of size 4194224 at addr ffffc9010503be70 by task crash/112
[ 104.986889]
[ 104.987193] CPU: 0 PID: 112 Comm: crash Not tainted 5.14.0-rc4 #13
[ 104.987552] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.13.0-1ubuntu1.1 04/01/2014
[ 104.988104] Call Trace:
[ 104.988410] dump_stack_lvl+0x34/0x44
[ 104.988706] print_address_description.constprop.0+0x21/0x140
[ 104.988991] ? __htab_map_lookup_and_delete_batch+0x5ce/0xb60
[ 104.989327] ? __htab_map_lookup_and_delete_batch+0x5ce/0xb60
[ 104.989622] kasan_report.cold+0x7f/0x11b
[ 104.989881] ? __htab_map_lookup_and_delete_batch+0x5ce/0xb60
[ 104.990239] kasan_check_range+0x17c/0x1e0
[ 104.990467] memcpy+0x39/0x60
[ 104.990670] __htab_map_lookup_and_delete_batch+0x5ce/0xb60
[ 104.990982] ? __wake_up_common+0x4d/0x230
[ 104.991256] ? htab_of_map_free+0x130/0x130
[ 104.991541] bpf_map_do_batch+0x1fb/0x220
[...]
In hashtable, if the elements' keys have the same jhash() value, the
elements will be put into the same bucket. By putting a lot of elements
into a single bucket, the value of bucket_size can be increased to
trigger the integer overflow.
Triggering the overflow is possible for both callers with CAP_SYS_ADMIN
and callers without CAP_SYS_ADMIN.
It will be trivial for a caller with CAP_SYS_ADMIN to intentionally
reach this overflow by enabling BPF_F_ZERO_SEED. As this flag will set
the random seed passed to jhash() to 0, it will be easy for the caller
to prepare keys which will be hashed into the same value, and thus put
all the elements into the same bucket.
If the caller does not have CAP_SYS_ADMIN, BPF_F_ZERO_SEED cannot be
used. However, it will be still technically possible to trigger the
overflow, by guessing the random seed value passed to jhash() (32bit)
and repeating the attempt to trigger the overflow. In this case,
the probability to trigger the overflow will be low and will take
a very long time.
Fix the integer overflow by calling kvmalloc_array() instead of
kvmalloc() to allocate memory.
Fixes: 057996380a ("bpf: Add batch ops to all htab bpf map")
Signed-off-by: Tatsuhiko Yasumatsu <th.yasumatsu@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20210806150419.109658-1-th.yasumatsu@gmail.com
The WARN_ON_ONCE() invocation within the CONFIG_PREEMPT=y version of
rcu_note_context_switch() triggers when there is a voluntary context
switch in an RCU read-side critical section, but there is quite a gap
between the output of that WARN_ON_ONCE() and this RCU-usage error.
This commit therefore converts the WARN_ON_ONCE() to a WARN_ONCE()
that explicitly describes the problem in its message.
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
The cond_resched() function reports an RCU quiescent state only in
non-preemptible TREE RCU implementation. This commit therefore adds a
comment explaining why cond_resched() does nothing in preemptible kernels.
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Cc: Neeraj Upadhyay <neeraju@codeaurora.org>
Cc: Joel Fernandes <joel@joelfernandes.org>
Cc: Uladzislau Rezki <urezki@gmail.com>
Cc: Boqun Feng <boqun.feng@gmail.com>
Cc: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
There are a few remaining locations in kernel/rcu that still use
"&per_cpu()". This commit replaces them with "per_cpu_ptr(&)", and does
not introduce any functional change.
Reviewed-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
Reviewed-by: Neeraj Upadhyay <neeraju@codeaurora.org>
Reviewed-by: Joel Fernandes (Google) <joel@joelfernandes.org>
Signed-off-by: Liu Song <liu.song11@zte.com.cn>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Within rcu_gp_fqs_loop(), the "ret" local variable is set to the
return value from swait_event_idle_timeout_exclusive(), but "ret" is
unconditionally overwritten later in the code. This commit therefore
removes this useless assignment.
Signed-off-by: Liu Song <liu.song11@zte.com.cn>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
This commit marks the accesses in tree_stall.h so as to both avoid
undesirable compiler optimizations and to keep KCSAN focused on the
accesses of the core algorithm.
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
The kbuild test project found an oversized stack frame in rcu_gp_kthread()
for some kernel configurations. This oversizing was due to a very large
amount of inlining, which is unnecessary due to the fact that this code
executes infrequently. This commit therefore marks rcu_gp_init() and
rcu_gp_fqs_loop noinline_for_stack to conserve stack space.
Reported-by: kernel test robot <lkp@intel.com>
Tested-by: Rong Chen <rong.a.chen@intel.com>
[ paulmck: noinline_for_stack per Nathan Chancellor. ]
Reviewed-by: Nathan Chancellor <nathan@kernel.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Accesses to ->qsmask are normally protected by ->lock, but there is an
exception in the diagnostic code in rcu_check_boost_fail(). This commit
therefore applies data_race() to this access to avoid KCSAN complaining
about the C-language writes protected by ->lock.
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
This commit marks some interrupt-induced read-side data races in
__srcu_read_lock(), __srcu_read_unlock(), and srcu_torture_stats_print().
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Systems with low-bandwidth consoles can have very large printk()
latencies, and on such systems it makes no sense to have the next RCU CPU
stall warning message start output before the prior message completed.
This commit therefore sets the time of the next stall only after the
prints have completed. While printing, the time of the next stall
message is set to ULONG_MAX/2 jiffies into the future.
Reviewed-by: Sergey Senozhatsky <senozhatsky@chromium.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
rcu_cpu_stall_reset() is one of the functions virtual CPUs
execute during VM resume in order to handle jiffies skew
that can trigger false positive stall warnings. Paul has
pointed out that this approach is problematic because
rcu_cpu_stall_reset() disables RCU grace period stall-detection
virtually forever, while in fact it can just restart the
stall-detection timeout.
Suggested-by: "Paul E. McKenney" <paulmck@kernel.org>
Signed-off-by: Sergey Senozhatsky <senozhatsky@chromium.org>
Signed-off-by: Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
The soft watchdog timer function checks if a virtual machine
was suspended and hence what looks like a lockup in fact
is a false positive.
This is what kvm_check_and_clear_guest_paused() does: it
tests guest PVCLOCK_GUEST_STOPPED (which is set by the host)
and if it's set then we need to touch all watchdogs and bail
out.
Watchdog timer function runs from IRQ, so PVCLOCK_GUEST_STOPPED
check works fine.
There is, however, one more watchdog that runs from IRQ, so
watchdog timer fn races with it, and that watchdog is not aware
of PVCLOCK_GUEST_STOPPED - RCU stall detector.
apic_timer_interrupt()
smp_apic_timer_interrupt()
hrtimer_interrupt()
__hrtimer_run_queues()
tick_sched_timer()
tick_sched_handle()
update_process_times()
rcu_sched_clock_irq()
This triggers RCU stalls on our devices during VM resume.
If tick_sched_handle()->rcu_sched_clock_irq() runs on a VCPU
before watchdog_timer_fn()->kvm_check_and_clear_guest_paused()
then there is nothing on this VCPU that touches watchdogs and
RCU reads stale gp stall timestamp and new jiffies value, which
makes it think that RCU has stalled.
Make RCU stall watchdog aware of PVCLOCK_GUEST_STOPPED and
don't report RCU stalls when we resume the VM.
Signed-off-by: Sergey Senozhatsky <senozhatsky@chromium.org>
Signed-off-by: Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
KCSAN flags accesses to ->rcu_read_lock_nesting as data races, but
in the past, the overhead of marked accesses was excessive. However,
that was long ago, and much has changed since then, both in terms of
hardware and of compilers. Here is data taken on an eight-core laptop
using Intel(R) Core(TM) i9-10885H CPU @ 2.40GHz with a kernel built
using gcc version 9.3.0, with all data in nanoseconds.
Unmarked accesses (status quo), measured by three refscale runs:
Minimum reader duration: 3.286 2.851 3.395
Median reader duration: 3.698 3.531 3.4695
Maximum reader duration: 4.481 5.215 5.157
Marked accesses, also measured by three refscale runs:
Minimum reader duration: 3.501 3.677 3.580
Median reader duration: 4.053 3.723 3.895
Maximum reader duration: 7.307 4.999 5.511
This focused microbenhmark shows only sub-nanosecond differences which
are unlikely to be visible at the system level. This commit therefore
marks data-racing accesses to ->rcu_read_lock_nesting.
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Accesses to the rcu_data structure's ->dynticks field have always been
fully ordered because it was not possible to prove that weaker ordering
was safe. However, with the removal of the rcu_eqs_special_set() function
and the advent of the Linux-kernel memory model, it is now easy to show
that two of the four original full memory barriers can be weakened to
acquire and release operations. The remaining pair must remain full
memory barriers. This change makes the memory ordering requirements
more evident, and it might well also speed up the to-idle and from-idle
fastpaths on some architectures.
The following litmus test, adapted from one supplied off-list by Frederic
Weisbecker, models the RCU grace-period kthread detecting an idle CPU
that is concurrently transitioning to non-idle:
C dynticks-from-idle
{
DYNTICKS=0; (* Initially idle. *)
}
P0(int *X, int *DYNTICKS)
{
int dynticks;
int x;
// Idle.
dynticks = READ_ONCE(*DYNTICKS);
smp_store_release(DYNTICKS, dynticks + 1);
smp_mb();
// Now non-idle
x = READ_ONCE(*X);
}
P1(int *X, int *DYNTICKS)
{
int dynticks;
WRITE_ONCE(*X, 1);
smp_mb();
dynticks = smp_load_acquire(DYNTICKS);
}
exists (1:dynticks=0 /\ 0:x=1)
Running "herd7 -conf linux-kernel.cfg dynticks-from-idle.litmus" verifies
this transition, namely, showing that if the RCU grace-period kthread (P1)
sees another CPU as idle (P0), then any memory access prior to the start
of the grace period (P1's write to X) will be seen by any RCU read-side
critical section following the to-non-idle transition (P0's read from X).
This is a straightforward use of full memory barriers to force ordering
in a store-buffering (SB) litmus test.
The following litmus test, also adapted from the one supplied off-list
by Frederic Weisbecker, models the RCU grace-period kthread detecting
a non-idle CPU that is concurrently transitioning to idle:
C dynticks-into-idle
{
DYNTICKS=1; (* Initially non-idle. *)
}
P0(int *X, int *DYNTICKS)
{
int dynticks;
// Non-idle.
WRITE_ONCE(*X, 1);
dynticks = READ_ONCE(*DYNTICKS);
smp_store_release(DYNTICKS, dynticks + 1);
smp_mb();
// Now idle.
}
P1(int *X, int *DYNTICKS)
{
int x;
int dynticks;
smp_mb();
dynticks = smp_load_acquire(DYNTICKS);
x = READ_ONCE(*X);
}
exists (1:dynticks=2 /\ 1:x=0)
Running "herd7 -conf linux-kernel.cfg dynticks-into-idle.litmus" verifies
this transition, namely, showing that if the RCU grace-period kthread
(P1) sees another CPU as newly idle (P0), then any pre-idle memory access
(P0's write to X) will be seen by any code following the grace period
(P1's read from X). This is a simple release-acquire pair forcing
ordering in a message-passing (MP) litmus test.
Of course, if the grace-period kthread detects the CPU as non-idle,
it will refrain from reporting a quiescent state on behalf of that CPU,
so there are no ordering requirements from the grace-period kthread in
that case. However, other subsystems call rcu_is_idle_cpu() to check
for CPUs being non-idle from an RCU perspective. That case is also
verified by the above litmus tests with the proviso that the sense of
the low-order bit of the DYNTICKS counter be inverted.
Unfortunately, on x86 smp_mb() is as expensive as a cache-local atomic
increment. This commit therefore weakens only the read from ->dynticks.
However, the updates are abstracted into a rcu_dynticks_inc() function
to ease any future changes that might be needed.
[ paulmck: Apply Linus Torvalds feedback. ]
Link: https://lore.kernel.org/lkml/20210721202127.2129660-4-paulmck@kernel.org/
Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Acked-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Commit b8c17e6664 ("rcu: Maintain special bits at bottom of ->dynticks
counter") reserved a bit at the bottom of the ->dynticks counter to defer
flushing of TLBs, but this facility never has been used. This commit
therefore removes this capability along with the rcu_eqs_special_set()
function used to trigger it.
Link: https://lore.kernel.org/linux-doc/CALCETrWNPOOdTrFabTDd=H7+wc6xJ9rJceg6OL1S0rTV5pfSsA@mail.gmail.com/
Suggested-by: Andy Lutomirski <luto@kernel.org>
Signed-off-by: "Joel Fernandes (Google)" <joel@joelfernandes.org>
[ paulmck: Forward-port to v5.13-rc1. ]
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
If rcu_print_task_stall() is invoked on an rcu_node structure that does
not contain any tasks blocking the current grace period, it takes an
early exit that fails to release that rcu_node structure's lock. This
results in a self-deadlock, which is detected by lockdep.
To reproduce this bug:
tools/testing/selftests/rcutorture/bin/kvm.sh --allcpus --duration 3 --trust-make --configs "TREE03" --kconfig "CONFIG_PROVE_LOCKING=y" --bootargs "rcutorture.stall_cpu=30 rcutorture.stall_cpu_block=1 rcutorture.fwd_progress=0 rcutorture.test_boost=0"
This will also result in other complaints, including RCU's scheduler
hook complaining about blocking rather than preemption and an rcutorture
writer stall.
Only a partial RCU CPU stall warning message will be printed because of
the self-deadlock.
This commit therefore releases the lock on the rcu_print_task_stall()
function's early exit path.
Fixes: c583bcb8f5 ("rcu: Don't invoke try_invoke_on_locked_down_task() with irqs disabled")
Tested-by: Qais Yousef <qais.yousef@arm.com>
Signed-off-by: Yanfei Xu <yanfei.xu@windriver.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
The for loop in rcu_print_task_stall() always omits ts[0], which points
to the first task blocking the stalled grace period. This in turn fails
to count this first task, which means that ndetected will be equal to
zero when all CPUs have passed through their quiescent states and only
one task is blocking the stalled grace period. This zero value for
ndetected will in turn result in an incorrect "All QSes seen" message:
rcu: INFO: rcu_preempt detected stalls on CPUs/tasks:
rcu: Tasks blocked on level-1 rcu_node (CPUs 12-23):
(detected by 15, t=6504 jiffies, g=164777, q=9011209)
rcu: All QSes seen, last rcu_preempt kthread activity 1 (4295252379-4295252378), jiffies_till_next_fqs=1, root ->qsmask 0x2
BUG: sleeping function called from invalid context at include/linux/uaccess.h:156
in_atomic(): 1, irqs_disabled(): 0, non_block: 0, pid: 70613, name: msgstress04
INFO: lockdep is turned off.
Preemption disabled at:
[<ffff8000104031a4>] create_object.isra.0+0x204/0x4b0
CPU: 15 PID: 70613 Comm: msgstress04 Kdump: loaded Not tainted
5.12.2-yoctodev-standard #1
Hardware name: Marvell OcteonTX CN96XX board (DT)
Call trace:
dump_backtrace+0x0/0x2cc
show_stack+0x24/0x30
dump_stack+0x110/0x188
___might_sleep+0x214/0x2d0
__might_sleep+0x7c/0xe0
This commit therefore fixes the loop to include ts[0].
Fixes: c583bcb8f5 ("rcu: Don't invoke try_invoke_on_locked_down_task() with irqs disabled")
Tested-by: Qais Yousef <qais.yousef@arm.com>
Signed-off-by: Yanfei Xu <yanfei.xu@windriver.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
As callbacks to a tracepoint are paired with the data that is passed in when
the callback is registered to the tracepoint, it must have that data passed
to the callback when the tracepoint is triggered, else bad things will
happen. To keep the two together, they are both assigned to a tracepoint
structure and added to an array. The tracepoint call site will dereference
the structure (via RCU) and call the callback in that structure along with
the data in that structure. This keeps the callback and data tightly
coupled.
Because of the overhead that retpolines have on tracepoint callbacks, if
there's only one callback attached to a tracepoint (a common case), then it
is called via a static call (code modified to do a direct call instead of an
indirect call). But to implement this, the data had to be decoupled from the
callback, as now the callback is implemented via a direct call from the
static call and not an indirect call from the dereferenced structure.
Note, the static call only calls a callback used when there's a single
callback attached to the tracepoint. If more than one callback is attached
to the same tracepoint, then the static call will call an iterator
function that goes back to dereferencing the structure keeping the callback
and its data tightly coupled again.
Issues can arise when going from 0 callbacks to one, as the static call is
assigned to the callback, and it must take care that the data passed to it
is loaded before the static call calls the callback. Going from 1 to 2
callbacks is not an issue, as long as the static call is updated to the
iterator before the tracepoint structure array is updated via RCU. Going
from 2 to more or back down to 2 is not an issue as the iterator can handle
all theses cases. But going from 2 to 1, care must be taken as the static
call is now calling a callback and the data that is loaded must be the data
for that callback.
Care was taken to ensure the callback and data would be in-sync, but after
a bug was reported, it became clear that not enough was done to make sure
that was the case. These changes address this.
The first change is to compare the old and new data instead of the old and
new callback, as it's the data that can corrupt the callback, even if the
callback is the same (something getting freed).
The next change is to convert these transitions into states, to make it
easier to know when a synchronization is needed, and to perform those
synchronizations. The problem with this patch is that it slows down
disabling all events from under a second, to making it take over 10 seconds
to do the same work. But that is addressed in the final patch.
The final patch uses the RCU state functions to keep track of the RCU state
between the transitions, and only needs to perform the synchronization if an
RCU synchronization hasn't been done already. This brings the performance of
disabling all events back to its original value. That's because no
synchronization is required between disabling tracepoints but is required
when enabling a tracepoint after its been disabled. If an RCU
synchronization happens after the tracepoint is disabled, and before it is
re-enabled, there's no need to do the synchronization again.
Both the second and third patch have subtle complexities that they are
separated into two patches. But because the second patch causes such a
regression in performance, the third patch adds a "Fixes" tag to the second
patch, such that the two must be backported together and not just the second
patch.
-----BEGIN PGP SIGNATURE-----
iIoEABYIADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCYQ15TBQccm9zdGVkdEBn
b29kbWlzLm9yZwAKCRAp5XQQmuv6qnmmAP4hoA34CDr5hrd8mYLeKptW63f5Nd1w
fVZjprfa1wJhZAEAq39OeRCT4Fb2hIeZNBNUnLU90f+J6NH5QFDEhW+CkAI=
=JcZS
-----END PGP SIGNATURE-----
Merge tag 'trace-v5.14-rc4-2' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace
Pull tracing fixes from Steven Rostedt:
"Fix tracepoint race between static_call and callback data
As callbacks to a tracepoint are paired with the data that is passed
in when the callback is registered to the tracepoint, it must have
that data passed to the callback when the tracepoint is triggered,
else bad things will happen. To keep the two together, they are both
assigned to a tracepoint structure and added to an array. The
tracepoint call site will dereference the structure (via RCU) and call
the callback in that structure along with the data in that structure.
This keeps the callback and data tightly coupled.
Because of the overhead that retpolines have on tracepoint callbacks,
if there's only one callback attached to a tracepoint (a common case),
then it is called via a static call (code modified to do a direct call
instead of an indirect call). But to implement this, the data had to
be decoupled from the callback, as now the callback is implemented via
a direct call from the static call and not an indirect call from the
dereferenced structure.
Note, the static call only calls a callback used when there's a single
callback attached to the tracepoint. If more than one callback is
attached to the same tracepoint, then the static call will call an
iterator function that goes back to dereferencing the structure
keeping the callback and its data tightly coupled again.
Issues can arise when going from 0 callbacks to one, as the static
call is assigned to the callback, and it must take care that the data
passed to it is loaded before the static call calls the callback.
Going from 1 to 2 callbacks is not an issue, as long as the static
call is updated to the iterator before the tracepoint structure array
is updated via RCU. Going from 2 to more or back down to 2 is not an
issue as the iterator can handle all theses cases. But going from 2 to
1, care must be taken as the static call is now calling a callback and
the data that is loaded must be the data for that callback.
Care was taken to ensure the callback and data would be in-sync, but
after a bug was reported, it became clear that not enough was done to
make sure that was the case. These changes address this.
The first change is to compare the old and new data instead of the old
and new callback, as it's the data that can corrupt the callback, even
if the callback is the same (something getting freed).
The next change is to convert these transitions into states, to make
it easier to know when a synchronization is needed, and to perform
those synchronizations. The problem with this patch is that it slows
down disabling all events from under a second, to making it take over
10 seconds to do the same work. But that is addressed in the final
patch.
The final patch uses the RCU state functions to keep track of the RCU
state between the transitions, and only needs to perform the
synchronization if an RCU synchronization hasn't been done already.
This brings the performance of disabling all events back to its
original value. That's because no synchronization is required between
disabling tracepoints but is required when enabling a tracepoint after
its been disabled. If an RCU synchronization happens after the
tracepoint is disabled, and before it is re-enabled, there's no need
to do the synchronization again.
Both the second and third patch have subtle complexities that they are
separated into two patches. But because the second patch causes such a
regression in performance, the third patch adds a "Fixes" tag to the
second patch, such that the two must be backported together and not
just the second patch"
* tag 'trace-v5.14-rc4-2' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
tracepoint: Use rcu get state and cond sync for static call updates
tracepoint: Fix static call function vs data state mismatch
tracepoint: static call: Compare data on transition from 2->1 callees
State transitions from 1->0->1 and N->2->1 callbacks require RCU
synchronization. Rather than performing the RCU synchronization every
time the state change occurs, which is quite slow when many tracepoints
are registered in batch, instead keep a snapshot of the RCU state on the
most recent transitions which belong to a chain, and conditionally wait
for a grace period on the last transition of the chain if one g.p. has
not elapsed since the last snapshot.
This applies to both RCU and SRCU.
This brings the performance regression caused by commit 231264d692
("Fix: tracepoint: static call function vs data state mismatch") back to
what it was originally.
Before this commit:
# trace-cmd start -e all
# time trace-cmd start -p nop
real 0m10.593s
user 0m0.017s
sys 0m0.259s
After this commit:
# trace-cmd start -e all
# time trace-cmd start -p nop
real 0m0.878s
user 0m0.000s
sys 0m0.103s
Link: https://lkml.kernel.org/r/20210805192954.30688-1-mathieu.desnoyers@efficios.com
Link: https://lore.kernel.org/io-uring/4ebea8f0-58c9-e571-fd30-0ce4f6f09c70@samba.org/
Cc: stable@vger.kernel.org
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: "Paul E. McKenney" <paulmck@kernel.org>
Cc: Stefan Metzmacher <metze@samba.org>
Fixes: 231264d692 ("Fix: tracepoint: static call function vs data state mismatch")
Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Reviewed-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
The Energy Model (EM) provides useful information about device power in
each performance state to other subsystems like: Energy Aware Scheduler
(EAS). The energy calculation in EAS does arithmetic operation based on
the EM em_cpu_energy(). Current implementation of that function uses
em_perf_state::cost as a pre-computed cost coefficient equal to:
cost = power * max_frequency / frequency.
The 'power' is expressed in milli-Watts (or in abstract scale).
There are corner cases when the EAS energy calculation for two Performance
Domains (PDs) return the same value. The EAS compares these values to
choose smaller one. It might happen that this values are equal due to
rounding error. In such scenario, we need better resolution, e.g. 1000
times better. To provide this possibility increase the resolution in the
em_perf_state::cost for 64-bit architectures. The cost of increasing
resolution on 32-bit is pretty high (64-bit division) and is not justified
since there are no new 32bit big.LITTLE EAS systems expected which would
benefit from this higher resolution.
This patch allows to avoid the rounding to milli-Watt errors, which might
occur in EAS energy estimation for each PD. The rounding error is common
for small tasks which have small utilization value.
There are two places in the code where it makes a difference:
1. In the find_energy_efficient_cpu() where we are searching for
best_delta. We might suffer there when two PDs return the same result,
like in the example below.
Scenario:
Low utilized system e.g. ~200 sum_util for PD0 and ~220 for PD1. There
are quite a few small tasks ~10-15 util. These tasks would suffer for
the rounding error. These utilization values are typical when running games
on Android. One of our partners has reported 5..10mA less battery drain
when running with increased resolution.
Some details:
We have two PDs: PD0 (big) and PD1 (little)
Let's compare w/o patch set ('old') and w/ patch set ('new')
We are comparing energy w/ task and w/o task placed in the PDs
a) 'old' w/o patch set, PD0
task_util = 13
cost = 480
sum_util_w/o_task = 215
sum_util_w_task = 228
scale_cpu = 1024
energy_w/o_task = 480 * 215 / 1024 = 100.78 => 100
energy_w_task = 480 * 228 / 1024 = 106.87 => 106
energy_diff = 106 - 100 = 6
(this is equal to 'old' PD1's energy_diff in 'c)')
b) 'new' w/ patch set, PD0
task_util = 13
cost = 480 * 1000 = 480000
sum_util_w/o_task = 215
sum_util_w_task = 228
energy_w/o_task = 480000 * 215 / 1024 = 100781
energy_w_task = 480000 * 228 / 1024 = 106875
energy_diff = 106875 - 100781 = 6094
(this is not equal to 'new' PD1's energy_diff in 'd)')
c) 'old' w/o patch set, PD1
task_util = 13
cost = 160
sum_util_w/o_task = 283
sum_util_w_task = 293
scale_cpu = 355
energy_w/o_task = 160 * 283 / 355 = 127.55 => 127
energy_w_task = 160 * 296 / 355 = 133.41 => 133
energy_diff = 133 - 127 = 6
(this is equal to 'old' PD0's energy_diff in 'a)')
d) 'new' w/ patch set, PD1
task_util = 13
cost = 160 * 1000 = 160000
sum_util_w/o_task = 283
sum_util_w_task = 293
scale_cpu = 355
energy_w/o_task = 160000 * 283 / 355 = 127549
energy_w_task = 160000 * 296 / 355 = 133408
energy_diff = 133408 - 127549 = 5859
(this is not equal to 'new' PD0's energy_diff in 'b)')
2. Difference in the 6% energy margin filter at the end of
find_energy_efficient_cpu(). With this patch the margin comparison also
has better resolution, so it's possible to have better task placement
thanks to that.
Fixes: 27871f7a8a ("PM: Introduce an Energy Model management framework")
Reported-by: CCJ Yeh <CCj.Yeh@mediatek.com>
Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Signed-off-by: Lukasz Luba <lukasz.luba@arm.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
SCHED_FLAG_KEEP_PARAMS can be passed to sched_setattr to specify that
the call must not touch scheduling parameters (nice or priority). This
is particularly handy for uclamp when used in conjunction with
SCHED_FLAG_KEEP_POLICY as that allows to issue a syscall that only
impacts uclamp values.
However, sched_setattr always checks whether the priorities and nice
values passed in sched_attr are valid first, even if those never get
used down the line. This is useless at best since userspace can
trivially bypass this check to set the uclamp values by specifying low
priorities. However, it is cumbersome to do so as there is no single
expression of this that skips both RT and CFS checks at once. As such,
userspace needs to query the task policy first with e.g. sched_getattr
and then set sched_attr.sched_priority accordingly. This is racy and
slower than a single call.
As the priority and nice checks are useless when SCHED_FLAG_KEEP_PARAMS
is specified, simply inherit them in this case to match the policy
inheritance of SCHED_FLAG_KEEP_POLICY.
Reported-by: Wei Wang <wvw@google.com>
Signed-off-by: Quentin Perret <qperret@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Reviewed-by: Qais Yousef <qais.yousef@arm.com>
Link: https://lore.kernel.org/r/20210805102154.590709-3-qperret@google.com
The UCLAMP_FLAG_IDLE flag is set on a runqueue when dequeueing the last
uclamp active task (that is, when buckets.tasks reaches 0 for all
buckets) to maintain the last uclamp.max and prevent blocked util from
suddenly becoming visible.
However, there is an asymmetry in how the flag is set and cleared which
can lead to having the flag set whilst there are active tasks on the rq.
Specifically, the flag is cleared in the uclamp_rq_inc() path, which is
called at enqueue time, but set in uclamp_rq_dec_id() which is called
both when dequeueing a task _and_ in the update_uclamp_active() path. As
a result, when both uclamp_rq_{dec,ind}_id() are called from
update_uclamp_active(), the flag ends up being set but not cleared,
hence leaving the runqueue in a broken state.
Fix this by clearing the flag in update_uclamp_active() as well.
Fixes: e496187da7 ("sched/uclamp: Enforce last task's UCLAMP_MAX")
Reported-by: Rick Yiu <rickyiu@google.com>
Signed-off-by: Quentin Perret <qperret@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Qais Yousef <qais.yousef@arm.com>
Tested-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Link: https://lore.kernel.org/r/20210805102154.590709-2-qperret@google.com
A missing clock update is causing the following warning:
rq->clock_update_flags < RQCF_ACT_SKIP
WARNING: CPU: 112 PID: 2041 at kernel/sched/sched.h:1453
sub_running_bw.isra.0+0x190/0x1a0
...
CPU: 112 PID: 2041 Comm: sugov:112 Tainted: G W 5.14.0-rc1 #1
Hardware name: WIWYNN Mt.Jade Server System
B81.030Z1.0007/Mt.Jade Motherboard, BIOS 1.6.20210526 (SCP:
1.06.20210526) 2021/05/26
...
Call trace:
sub_running_bw.isra.0+0x190/0x1a0
migrate_task_rq_dl+0xf8/0x1e0
set_task_cpu+0xa8/0x1f0
try_to_wake_up+0x150/0x3d4
wake_up_q+0x64/0xc0
__up_write+0xd0/0x1c0
up_write+0x4c/0x2b0
cppc_set_perf+0x120/0x2d0
cppc_cpufreq_set_target+0xe0/0x1a4 [cppc_cpufreq]
__cpufreq_driver_target+0x74/0x140
sugov_work+0x64/0x80
kthread_worker_fn+0xe0/0x230
kthread+0x138/0x140
ret_from_fork+0x10/0x18
The task causing this is the `cppc_fie` DL task introduced by
commit 1eb5dde674 ("cpufreq: CPPC: Add support for frequency
invariance").
With CONFIG_ACPI_CPPC_CPUFREQ_FIE=y and schedutil cpufreq governor on
slow-switching system (like on this Ampere Altra WIWYNN Mt. Jade Arm
Server):
DL task `curr=sugov:112` lets `p=cppc_fie` migrate and since the latter
is in `non_contending` state, migrate_task_rq_dl() calls
sub_running_bw()->__sub_running_bw()->cpufreq_update_util()->
rq_clock()->assert_clock_updated()
on p.
Fix this by updating the clock for a non_contending task in
migrate_task_rq_dl() before calling sub_running_bw().
Reported-by: Bruno Goncalves <bgoncalv@redhat.com>
Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Daniel Bristot de Oliveira <bristot@kernel.org>
Acked-by: Juri Lelli <juri.lelli@redhat.com>
Link: https://lore.kernel.org/r/20210804135925.3734605-1-dietmar.eggemann@arm.com
Build failure in drivers/net/wwan/mhi_wwan_mbim.c:
add missing parameter (0, assuming we don't want buffer pre-alloc).
Conflict in drivers/net/dsa/sja1105/sja1105_main.c between:
589918df93 ("net: dsa: sja1105: be stateless with FDB entries on SJA1105P/Q/R/S/SJA1110 too")
0fac6aa098 ("net: dsa: sja1105: delete the best_effort_vlan_filtering mode")
Follow the instructions from the commit message of the former commit
- removed the if conditions. When looking at commit 589918df93 ("net:
dsa: sja1105: be stateless with FDB entries on SJA1105P/Q/R/S/SJA1110 too")
note that the mask_iotag fields get removed by the following patch.
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
On a 1->0->1 callbacks transition, there is an issue with the new
callback using the old callback's data.
Considering __DO_TRACE_CALL:
do { \
struct tracepoint_func *it_func_ptr; \
void *__data; \
it_func_ptr = \
rcu_dereference_raw((&__tracepoint_##name)->funcs); \
if (it_func_ptr) { \
__data = (it_func_ptr)->data; \
----> [ delayed here on one CPU (e.g. vcpu preempted by the host) ]
static_call(tp_func_##name)(__data, args); \
} \
} while (0)
It has loaded the tp->funcs of the old callback, so it will try to use the old
data. This can be fixed by adding a RCU sync anywhere in the 1->0->1
transition chain.
On a N->2->1 transition, we need an rcu-sync because you may have a
sequence of 3->2->1 (or 1->2->1) where the element 0 data is unchanged
between 2->1, but was changed from 3->2 (or from 1->2), which may be
observed by the static call. This can be fixed by adding an
unconditional RCU sync in transition 2->1.
Note, this fixes a correctness issue at the cost of adding a tremendous
performance regression to the disabling of tracepoints.
Before this commit:
# trace-cmd start -e all
# time trace-cmd start -p nop
real 0m0.778s
user 0m0.000s
sys 0m0.061s
After this commit:
# trace-cmd start -e all
# time trace-cmd start -p nop
real 0m10.593s
user 0m0.017s
sys 0m0.259s
A follow up fix will introduce a more lightweight scheme based on RCU
get_state and cond_sync, that will return the performance back to what it
was. As both this change and the lightweight versions are complex on their
own, for bisecting any issues that this may cause, they are kept as two
separate changes.
Link: https://lkml.kernel.org/r/20210805132717.23813-3-mathieu.desnoyers@efficios.com
Link: https://lore.kernel.org/io-uring/4ebea8f0-58c9-e571-fd30-0ce4f6f09c70@samba.org/
Cc: stable@vger.kernel.org
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: "Paul E. McKenney" <paulmck@kernel.org>
Cc: Stefan Metzmacher <metze@samba.org>
Fixes: d25e37d89d ("tracepoint: Optimize using static_call()")
Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
On transition from 2->1 callees, we should be comparing .data rather
than .func, because the same callback can be registered twice with
different data, and what we care about here is that the data of array
element 0 is unchanged to skip rcu sync.
Link: https://lkml.kernel.org/r/20210805132717.23813-2-mathieu.desnoyers@efficios.com
Link: https://lore.kernel.org/io-uring/4ebea8f0-58c9-e571-fd30-0ce4f6f09c70@samba.org/
Cc: stable@vger.kernel.org
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: "Paul E. McKenney" <paulmck@kernel.org>
Cc: Stefan Metzmacher <metze@samba.org>
Fixes: 547305a646 ("tracepoint: Fix out of sync data passing by static caller")
Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Pull ucounts fix from Eric Biederman:
"Fix a subtle locking versus reference counting bug in the ucount
changes, found by syzbot"
* 'for-v5.14' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace:
ucounts: Fix race condition between alloc_ucounts and put_ucounts
- Fix NULL pointer dereference caused by an error path
- Give histogram calculation fields a size, otherwise it breaks synthetic
creation based on them.
- Reject strings being used for number calculations.
- Fix recordmcount.pl warning on llvm building RISC-V allmodconfig
- Fix the draw_functrace.py script to handle the new trace output
- Fix warning of smp_processor_id() in preemptible code
-----BEGIN PGP SIGNATURE-----
iIoEABYIADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCYQwR+xQccm9zdGVkdEBn
b29kbWlzLm9yZwAKCRAp5XQQmuv6qtHOAQD7gBn1cRK0T3Eolf5HRd14PLDVUZ1B
iMZuTJZzJUWLSAD/ec3ezcOafNlPKmG1ta8UxrWP5VzHOC5qTIAJYc1d5AA=
=7FNB
-----END PGP SIGNATURE-----
Merge tag 'trace-v5.14-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace
Pull tracing fixes from Steven Rostedt:
"Various tracing fixes:
- Fix NULL pointer dereference caused by an error path
- Give histogram calculation fields a size, otherwise it breaks
synthetic creation based on them.
- Reject strings being used for number calculations.
- Fix recordmcount.pl warning on llvm building RISC-V allmodconfig
- Fix the draw_functrace.py script to handle the new trace output
- Fix warning of smp_processor_id() in preemptible code"
* tag 'trace-v5.14-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
tracing: Quiet smp_processor_id() use in preemptable warning in hwlat
scripts/tracing: fix the bug that can't parse raw_trace_func
scripts/recordmcount.pl: Remove check_objcopy() and $can_use_local
tracing: Reject string operand in the histogram expression
tracing / histogram: Give calculation hist_fields a size
tracing: Fix NULL pointer dereference in start_creating
The hardware latency detector (hwlat) has a mode that it runs one thread
across CPUs. The logic to move from the currently running CPU to the next
one in the list does a smp_processor_id() to find where it currently is.
Unfortunately, it's done with preemption enabled, and this triggers a
warning for using smp_processor_id() in a preempt enabled section.
As it is only using smp_processor_id() to get information on where it
currently is in order to simply move it to the next CPU, it doesn't really
care if it got moved in the mean time. It will simply balance out later if
such a case arises.
Switch smp_processor_id() to raw_smp_processor_id() to quiet that warning.
Link: https://lkml.kernel.org/r/20210804141848.79edadc0@oasis.local.home
Acked-by: Daniel Bristot de Oliveira <bristot@redhat.com>
Fixes: 8fa826b734 ("trace/hwlat: Implement the mode config option")
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Since the string type can not be the target of the addition / subtraction
operation, it must be rejected. Without this fix, the string type silently
converted to digits.
Link: https://lkml.kernel.org/r/162742654278.290973.1523000673366456634.stgit@devnote2
Cc: stable@vger.kernel.org
Fixes: 100719dcef ("tracing: Add simple expression support to hist triggers")
Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
When working on my user space applications, I found a bug in the synthetic
event code where the automated synthetic event field was not matching the
event field calculation it was attached to. Looking deeper into it, it was
because the calculation hist_field was not given a size.
The synthetic event fields are matched to their hist_fields either by
having the field have an identical string type, or if that does not match,
then the size and signed values are used to match the fields.
The problem arose when I tried to match a calculation where the fields
were "unsigned int". My tool created a synthetic event of type "u32". But
it failed to match. The string was:
diff=field1-field2:onmatch(event).trace(synth,$diff)
Adding debugging into the kernel, I found that the size of "diff" was 0.
And since it was given "unsigned int" as a type, the histogram fallback
code used size and signed. The signed matched, but the size of u32 (4) did
not match zero, and the event failed to be created.
This can be worse if the field you want to match is not one of the
acceptable fields for a synthetic event. As event fields can have any type
that is supported in Linux, this can cause an issue. For example, if a
type is an enum. Then there's no way to use that with any calculations.
Have the calculation field simply take on the size of what it is
calculating.
Link: https://lkml.kernel.org/r/20210730171951.59c7743f@oasis.local.home
Cc: Tom Zanussi <zanussi@kernel.org>
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: stable@vger.kernel.org
Fixes: 100719dcef ("tracing: Add simple expression support to hist triggers")
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Test RTC_FEATURE_ALARM instead of relying on ops->set_alarm to know whether
alarms are available.
Signed-off-by: Alexandre Belloni <alexandre.belloni@bootlin.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
The functions get_online_cpus() and put_online_cpus() have been
deprecated during the CPU hotplug rework. They map directly to
cpus_read_lock() and cpus_read_unlock().
Replace deprecated CPU-hotplug functions with the official version.
The behavior remains unchanged.
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
When select_idle_cpu starts scanning for an idle CPU, it starts with
a target CPU that has already been checked by select_idle_sibling.
This patch starts with the next CPU instead.
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20210804115857.6253-3-mgorman@techsingularity.net
After select_idle_sibling, p->recent_used_cpu is set to the
new target. However on the next wakeup, prev will be the same as
recent_used_cpu unless the load balancer has moved the task since the
last wakeup. It still works, but is less efficient than it could be.
This patch preserves recent_used_cpu for longer.
The impact on SIS efficiency is tiny so the SIS statistic patches were
used to track the hit rate for using recent_used_cpu. With perf bench
pipe on a 2-socket Cascadelake machine, the hit rate went from 57.14%
to 85.32%. For more intensive wakeup loads like hackbench, the hit rate
is almost negligible but rose from 0.21% to 6.64%. For scaling loads
like tbench, the hit rate goes from almost 0% to 25.42% overall. Broadly
speaking, on tbench, the success rate is much higher for lower thread
counts and drops to almost 0 as the workload scales to towards saturation.
Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20210804115857.6253-2-mgorman@techsingularity.net
SCHED_FLAG_SUGOV is supposed to be a kernel-only flag that userspace
cannot interact with. However, sched_getattr() currently reports it
in sched_flags if called on a sugov worker even though it is not
actually defined in a UAPI header. To avoid this, make sure to
clean-up the sched_flags field in sched_getattr() before returning to
userspace.
Signed-off-by: Quentin Perret <qperret@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20210727101103.2729607-3-qperret@google.com
It is possible for sched_getattr() to incorrectly report the state of
the reset_on_fork flag when called on a deadline task.
Indeed, if the flag was set on a deadline task using sched_setattr()
with flags (SCHED_FLAG_RESET_ON_FORK | SCHED_FLAG_KEEP_PARAMS), then
p->sched_reset_on_fork will be set, but __setscheduler() will bail out
early, which means that the dl_se->flags will not get updated by
__setscheduler_params()->__setparam_dl(). Consequently, if
sched_getattr() is then called on the task, __getparam_dl() will
override kattr.sched_flags with the now out-of-date copy in dl_se->flags
and report the stale value to userspace.
To fix this, make sure to only copy the flags that are relevant to
sched_deadline to and from the dl_se->flags field.
Signed-off-by: Quentin Perret <qperret@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20210727101103.2729607-2-qperret@google.com
Use the loop variable instead of the function argument to test the
other SMT siblings for idle.
Fixes: ff7db0bf24 ("sched/numa: Prefer using an idle CPU as a migration target instead of comparing tasks")
Signed-off-by: Mika Penttilä <mika.penttila@gmail.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Pankaj Gupta <pankaj.gupta@ionos.com>
Link: https://lkml.kernel.org/r/20210722063946.28951-1-mika.penttila@gmail.com
Double enqueues in rt runqueues (list) have been reported while running
a simple test that spawns a number of threads doing a short sleep/run
pattern while being concurrently setscheduled between rt and fair class.
WARNING: CPU: 3 PID: 2825 at kernel/sched/rt.c:1294 enqueue_task_rt+0x355/0x360
CPU: 3 PID: 2825 Comm: setsched__13
RIP: 0010:enqueue_task_rt+0x355/0x360
Call Trace:
__sched_setscheduler+0x581/0x9d0
_sched_setscheduler+0x63/0xa0
do_sched_setscheduler+0xa0/0x150
__x64_sys_sched_setscheduler+0x1a/0x30
do_syscall_64+0x33/0x40
entry_SYSCALL_64_after_hwframe+0x44/0xae
list_add double add: new=ffff9867cb629b40, prev=ffff9867cb629b40,
next=ffff98679fc67ca0.
kernel BUG at lib/list_debug.c:31!
invalid opcode: 0000 [#1] PREEMPT_RT SMP PTI
CPU: 3 PID: 2825 Comm: setsched__13
RIP: 0010:__list_add_valid+0x41/0x50
Call Trace:
enqueue_task_rt+0x291/0x360
__sched_setscheduler+0x581/0x9d0
_sched_setscheduler+0x63/0xa0
do_sched_setscheduler+0xa0/0x150
__x64_sys_sched_setscheduler+0x1a/0x30
do_syscall_64+0x33/0x40
entry_SYSCALL_64_after_hwframe+0x44/0xae
__sched_setscheduler() uses rt_effective_prio() to handle proper queuing
of priority boosted tasks that are setscheduled while being boosted.
rt_effective_prio() is however called twice per each
__sched_setscheduler() call: first directly by __sched_setscheduler()
before dequeuing the task and then by __setscheduler() to actually do
the priority change. If the priority of the pi_top_task is concurrently
being changed however, it might happen that the two calls return
different results. If, for example, the first call returned the same rt
priority the task was running at and the second one a fair priority, the
task won't be removed by the rt list (on_list still set) and then
enqueued in the fair runqueue. When eventually setscheduled back to rt
it will be seen as enqueued already and the WARNING/BUG be issued.
Fix this by calling rt_effective_prio() only once and then reusing the
return value. While at it refactor code as well for clarity. Concurrent
priority inheritance handling is still safe and will eventually converge
to a new state by following the inheritance chain(s).
Fixes: 0782e63bc6 ("sched: Handle priority boosted tasks proper in setscheduler()")
[squashed Peterz changes; added changelog]
Reported-by: Mark Simmons <msimmons@redhat.com>
Signed-off-by: Juri Lelli <juri.lelli@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20210803104501.38333-1-juri.lelli@redhat.com
Implementing live patching on s390 requires each function's prologue to
contain a very special kind of nop, which gcc and clang don't generate.
However, the current code assumes that if CC_USING_NOP_MCOUNT is
defined, then whatever the compiler generates is good enough.
Move the CC_USING_NOP_MCOUNT check into the new ftrace_need_init_nop()
macro, that the architectures can override.
An alternative solution is to disable using -mnop-mcount in the
Makefile, however, this makes the build logic (even) more complicated
and forces the arch-specific code to deal with the useless __fentry__
symbol.
Signed-off-by: Ilya Leoshkevich <iii@linux.ibm.com>
Reviewed-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Link: https://lore.kernel.org/r/20210728212546.128248-2-iii@linux.ibm.com
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
Before, the interpreter allowed up to MAX_TAIL_CALL_CNT + 1 tail calls.
Now precisely MAX_TAIL_CALL_CNT is allowed, which is in line with the
behavior of the x86 JITs.
Signed-off-by: Johan Almbladh <johan.almbladh@anyfinetworks.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Yonghong Song <yhs@fb.com>
Link: https://lore.kernel.org/bpf/20210728164741.350370-1-johan.almbladh@anyfinetworks.com
Andrii Nakryiko says:
====================
bpf-next 2021-07-30
We've added 64 non-merge commits during the last 15 day(s) which contain
a total of 83 files changed, 5027 insertions(+), 1808 deletions(-).
The main changes are:
1) BTF-guided binary data dumping libbpf API, from Alan.
2) Internal factoring out of libbpf CO-RE relocation logic, from Alexei.
3) Ambient BPF run context and cgroup storage cleanup, from Andrii.
4) Few small API additions for libbpf 1.0 effort, from Evgeniy and Hengqi.
5) bpf_program__attach_kprobe_opts() fixes in libbpf, from Jiri.
6) bpf_{get,set}sockopt() support in BPF iterators, from Martin.
7) BPF map pinning improvements in libbpf, from Martynas.
8) Improved module BTF support in libbpf and bpftool, from Quentin.
9) Bpftool cleanups and documentation improvements, from Quentin.
10) Libbpf improvements for supporting CO-RE on old kernels, from Shuyi.
11) Increased maximum cgroup storage size, from Stanislav.
12) Small fixes and improvements to BPF tests and samples, from various folks.
* https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next: (64 commits)
tools: bpftool: Complete metrics list in "bpftool prog profile" doc
tools: bpftool: Document and add bash completion for -L, -B options
selftests/bpf: Update bpftool's consistency script for checking options
tools: bpftool: Update and synchronise option list in doc and help msg
tools: bpftool: Complete and synchronise attach or map types
selftests/bpf: Check consistency between bpftool source, doc, completion
tools: bpftool: Slightly ease bash completion updates
unix_bpf: Fix a potential deadlock in unix_dgram_bpf_recvmsg()
libbpf: Add btf__load_vmlinux_btf/btf__load_module_btf
tools: bpftool: Support dumping split BTF by id
libbpf: Add split BTF support for btf__load_from_kernel_by_id()
tools: Replace btf__get_from_id() with btf__load_from_kernel_by_id()
tools: Free BTF objects at various locations
libbpf: Rename btf__get_from_id() as btf__load_from_kernel_by_id()
libbpf: Rename btf__load() as btf__load_into_kernel()
libbpf: Return non-null error on failures in libbpf_find_prog_btf_id()
bpf: Emit better log message if bpf_iter ctx arg btf_id == 0
tools/resolve_btfids: Emit warnings and patch zero id for missing symbols
bpf: Increase supported cgroup storage value size
libbpf: Fix race when pinning maps in parallel
...
====================
Link: https://lore.kernel.org/r/20210730225606.1897330-1-andrii@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
and netfilter trees.
Current release - regressions:
- mac80211: fix starting aggregation sessions on mesh interfaces
Current release - new code bugs:
- sctp: send pmtu probe only if packet loss in Search Complete state
- bnxt_en: add missing periodic PHC overflow check
- devlink: fix phys_port_name of virtual port and merge error
- hns3: change the method of obtaining default ptp cycle
- can: mcba_usb_start(): add missing urb->transfer_dma initialization
Previous releases - regressions:
- set true network header for ECN decapsulation
- mlx5e: RX, avoid possible data corruption w/ relaxed ordering and LRO
- phy: re-add check for PHY_BRCM_DIS_TXCRXC_NOENRGY on the BCM54811 PHY
- sctp: fix return value check in __sctp_rcv_asconf_lookup
Previous releases - always broken:
- bpf:
- more spectre corner case fixes, introduce a BPF nospec
instruction for mitigating Spectre v4
- fix OOB read when printing XDP link fdinfo
- sockmap: fix cleanup related races
- mac80211: fix enabling 4-address mode on a sta vif after assoc
- can:
- raw: raw_setsockopt(): fix raw_rcv panic for sock UAF
- j1939: j1939_session_deactivate(): clarify lifetime of
session object, avoid UAF
- fix number of identical memory leaks in USB drivers
- tipc:
- do not blindly write skb_shinfo frags when doing decryption
- fix sleeping in tipc accept routine
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEE6jPA+I1ugmIBA4hXMUZtbf5SIrsFAmEEWm8ACgkQMUZtbf5S
Irv84A//V/nn9VRdpDpmodwBWVEc9SA00M/nmziRBLwRyG+fRMtnePY4Ha40TPbh
LL6orth08hZKOjVmMc6Ea4EjZbV5E3iAKtAnaX6wi1HpEXVxKtFYnWxu9ydwTEd9
An1fltDtWYkNi3kiq7il+Tp1/yZAQ+NYv5zQZCWJ47kkN3jkjULdAEBqODA2A6Ul
0PQgS1rKzXukE19PlXDuaNuEekhTiEfaTwzHjdBJZkj1toGJGfHsvdQ/YJjixzB9
44SjE4PfxIaMWP0BVaD6hwzaVQhaZETXhZZufdIDdQd7sDbmd6CPODX6mXfLEq4u
JaWylgobsK+5ScHE6siVI+ZlW7stq9l1Ynm10ADiwsZVzKEoP745484aEFOLO6Z+
Ln/IqDQCP/yJQmnl2i0+TfqVDh6BKYoIfUUK/+nzHw4Otycy0m3kj4P+74aYfjOv
Q+cUgbXUemcrpq6wGUK+zK0NyNHVILvdPDnHPMMypwqPk18y5ZmFvaJAVUPSavD9
N7t9LoLyGwK3i/Ir4l+JJZ1KgAv1+TbmyNBWvY1Yk/r/vHU3nBPIv26s7YarNAwD
094vJEJ0+mqO4h+Xj1Nc7HEBFi46JfpN2L8uYoM7gpwziIRMdmpXVLmpEk43WmFi
UMwWJWqabPEXaozC2UFcFLSk+jS7DiD+G5eG+Fd5HecmKzd7RI0=
=sKPI
-----END PGP SIGNATURE-----
Merge tag 'net-5.14-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Pull networking fixes from Jakub Kicinski:
"Networking fixes for 5.14-rc4, including fixes from bpf, can, WiFi
(mac80211) and netfilter trees.
Current release - regressions:
- mac80211: fix starting aggregation sessions on mesh interfaces
Current release - new code bugs:
- sctp: send pmtu probe only if packet loss in Search Complete state
- bnxt_en: add missing periodic PHC overflow check
- devlink: fix phys_port_name of virtual port and merge error
- hns3: change the method of obtaining default ptp cycle
- can: mcba_usb_start(): add missing urb->transfer_dma initialization
Previous releases - regressions:
- set true network header for ECN decapsulation
- mlx5e: RX, avoid possible data corruption w/ relaxed ordering and
LRO
- phy: re-add check for PHY_BRCM_DIS_TXCRXC_NOENRGY on the BCM54811
PHY
- sctp: fix return value check in __sctp_rcv_asconf_lookup
Previous releases - always broken:
- bpf:
- more spectre corner case fixes, introduce a BPF nospec
instruction for mitigating Spectre v4
- fix OOB read when printing XDP link fdinfo
- sockmap: fix cleanup related races
- mac80211: fix enabling 4-address mode on a sta vif after assoc
- can:
- raw: raw_setsockopt(): fix raw_rcv panic for sock UAF
- j1939: j1939_session_deactivate(): clarify lifetime of session
object, avoid UAF
- fix number of identical memory leaks in USB drivers
- tipc:
- do not blindly write skb_shinfo frags when doing decryption
- fix sleeping in tipc accept routine"
* tag 'net-5.14-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (91 commits)
gve: Update MAINTAINERS list
can: esd_usb2: fix memory leak
can: ems_usb: fix memory leak
can: usb_8dev: fix memory leak
can: mcba_usb_start(): add missing urb->transfer_dma initialization
can: hi311x: fix a signedness bug in hi3110_cmd()
MAINTAINERS: add Yasushi SHOJI as reviewer for the Microchip CAN BUS Analyzer Tool driver
bpf: Fix leakage due to insufficient speculative store bypass mitigation
bpf: Introduce BPF nospec instruction for mitigating Spectre v4
sis900: Fix missing pci_disable_device() in probe and remove
net: let flow have same hash in two directions
nfc: nfcsim: fix use after free during module unload
tulip: windbond-840: Fix missing pci_disable_device() in probe and remove
sctp: fix return value check in __sctp_rcv_asconf_lookup
nfc: s3fwrn5: fix undefined parameter values in dev_err()
net/mlx5: Fix mlx5_vport_tbl_attr chain from u16 to u32
net/mlx5e: Fix nullptr in mlx5e_hairpin_get_mdev()
net/mlx5: Unload device upon firmware fatal error
net/mlx5e: Fix page allocation failure for ptp-RQ over SF
net/mlx5e: Fix page allocation failure for trap-RQ over SF
...
The event_trace_add_tracer() can fail. In this case, it leads to a crash
in start_creating with below call stack. Handle the error scenario
properly in trace_array_create_dir.
Call trace:
down_write+0x7c/0x204
start_creating.25017+0x6c/0x194
tracefs_create_file+0xc4/0x2b4
init_tracer_tracefs+0x5c/0x940
trace_array_create_dir+0x58/0xb4
trace_array_create+0x1bc/0x2b8
trace_array_get_by_name+0xdc/0x18c
Link: https://lkml.kernel.org/r/1627651386-21315-1-git-send-email-kamaagra@codeaurora.org
Cc: stable@vger.kernel.org
Fixes: 4114fbfd02 ("tracing: Enable creating new instance early boot")
Signed-off-by: Kamal Agrawal <kamaagra@codeaurora.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
The HW IRQ numbers generated by the PCI MSI layer can be quite large
on a pSeries machine when running under the IBM Hypervisor and they
appear as negative. Use '%lu' instead to show them correctly.
Signed-off-by: Cédric Le Goater <clg@kaod.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
cycles_t has a different type across architectures: unsigned int,
unsinged long, or unsigned long long. Depending on architecture this
will generate this warning:
kernel/kcsan/debugfs.c: In function ‘microbenchmark’:
./include/linux/kern_levels.h:5:25: warning: format ‘%llu’ expects argument of type ‘long long unsigned int’, but argument 3 has type ‘cycles_t’ {aka ‘long unsigned int’} [-Wformat=]
To avoid this simply change the type of cycle to u64 in microbenchmark(),
since u64 is of type unsigned long long for all architectures.
Acked-by: Marco Elver <elver@google.com>
Link: https://lore.kernel.org/r/20210729142811.1309391-1-hca@linux.ibm.com
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
refcount_t type and corresponding API can protect refcounters from
accidental underflow and overflow and further use-after-free situations.
Signed-off-by: Xiyu Yang <xiyuyang19@fudan.edu.cn>
Signed-off-by: Xin Tan <tanxin.ctf@gmail.com>
Acked-by: Daniel Jordan <daniel.m.jordan@oracle.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
To avoid kernel build failure due to some missing .BTF-ids referenced
functions/types, the patch ([1]) tries to fill btf_id 0 for
these types.
In bpf verifier, for percpu variable and helper returning btf_id cases,
verifier already emitted proper warning with something like
verbose(env, "Helper has invalid btf_id in R%d\n", regno);
verbose(env, "invalid return type %d of func %s#%d\n",
fn->ret_type, func_id_name(func_id), func_id);
But this is not the case for bpf_iter context arguments.
I hacked resolve_btfids to encode btf_id 0 for struct task_struct.
With `./test_progs -n 7/5`, I got,
0: (79) r2 = *(u64 *)(r1 +0)
func 'bpf_iter_task' arg0 has btf_id 29739 type STRUCT 'bpf_iter_meta'
; struct seq_file *seq = ctx->meta->seq;
1: (79) r6 = *(u64 *)(r2 +0)
; struct task_struct *task = ctx->task;
2: (79) r7 = *(u64 *)(r1 +8)
; if (task == (void *)0) {
3: (55) if r7 != 0x0 goto pc+11
...
; BPF_SEQ_PRINTF(seq, "%8d %8d\n", task->tgid, task->pid);
26: (61) r1 = *(u32 *)(r7 +1372)
Type '(anon)' is not a struct
Basically, verifier will return btf_id 0 for task_struct.
Later on, when the code tries to access task->tgid, the
verifier correctly complains the type is '(anon)' and it is
not a struct. Users still need to backtrace to find out
what is going on.
Let us catch the invalid btf_id 0 earlier
and provide better message indicating btf_id is wrong.
The new error message looks like below:
R1 type=ctx expected=fp
; struct seq_file *seq = ctx->meta->seq;
0: (79) r2 = *(u64 *)(r1 +0)
func 'bpf_iter_task' arg0 has btf_id 29739 type STRUCT 'bpf_iter_meta'
; struct seq_file *seq = ctx->meta->seq;
1: (79) r6 = *(u64 *)(r2 +0)
; struct task_struct *task = ctx->task;
2: (79) r7 = *(u64 *)(r1 +8)
invalid btf_id for context argument offset 8
invalid bpf_context access off=8 size=8
[1] https://lore.kernel.org/bpf/20210727132532.2473636-1-hengqi.chen@gmail.com/
Signed-off-by: Yonghong Song <yhs@fb.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20210728183025.1461750-1-yhs@fb.com
In error handling branch "if (WARN_ON(node == NUMA_NO_NODE))", the
previously allocated memories are not released. Doing this before
allocating memory eliminates memory leaks.
tj: Note that the condition only occurs when the arch code is pretty broken
and the WARN_ON might as well be BUG_ON().
Signed-off-by: Zhen Lei <thunder.leizhen@huawei.com>
Reviewed-by: Lai Jiangshan <jiangshanlai@gmail.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
console_verbose() increases console loglevel to
CONSOLE_LOGLEVEL_MOTORMOUTH, which provides more information
to debug a panic/oops.
Unfortunately, in Arista we maintain some DUTs (Device Under Test) that
are configured to have 9600 baud rate. While verbose console messages
have their value to post-analyze crashes, on such setup they:
- may prevent panic/oops messages being printed
- take too long to flush on console resulting in watchdog reboot
In all our setups we use kdump which saves dmesg buffer after panic,
so in reality those extra messages on console provide no additional value,
but rather add risk of not getting to __crash_kexec().
Provide printk.console_no_auto_verbose boot parameter, which allows
to switch off printk being verbose on oops/panic/lockdep.
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: John Ogness <john.ogness@linutronix.de>
Cc: Petr Mladek <pmladek@suse.com>
Cc: Sergey Senozhatsky <senozhatsky@chromium.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Dmitry Safonov <dima@arista.com>
Suggested-by: Petr Mladek <pmladek@suse.com>
Reviewed-by: Sergey Senozhatsky <senozhatsky@chromium.org>
Reviewed-by: Petr Mladek <pmladek@suse.com>
Tested-by: Petr Mladek <pmladek@suse.com>
Signed-off-by: Petr Mladek <pmladek@suse.com>
Link: https://lore.kernel.org/r/20210727130635.675184-3-dima@arista.com
Daniel Borkmann says:
====================
pull-request: bpf 2021-07-29
The following pull-request contains BPF updates for your *net* tree.
We've added 9 non-merge commits during the last 14 day(s) which contain
a total of 20 files changed, 446 insertions(+), 138 deletions(-).
The main changes are:
1) Fix UBSAN out-of-bounds splat for showing XDP link fdinfo, from Lorenz Bauer.
2) Fix insufficient Spectre v4 mitigation in BPF runtime, from Daniel Borkmann,
Piotr Krysiuk and Benedict Schlueter.
3) Batch of fixes for BPF sockmap found under stress testing, from John Fastabend.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Spectre v4 gadgets make use of memory disambiguation, which is a set of
techniques that execute memory access instructions, that is, loads and
stores, out of program order; Intel's optimization manual, section 2.4.4.5:
A load instruction micro-op may depend on a preceding store. Many
microarchitectures block loads until all preceding store addresses are
known. The memory disambiguator predicts which loads will not depend on
any previous stores. When the disambiguator predicts that a load does
not have such a dependency, the load takes its data from the L1 data
cache. Eventually, the prediction is verified. If an actual conflict is
detected, the load and all succeeding instructions are re-executed.
af86ca4e30 ("bpf: Prevent memory disambiguation attack") tried to mitigate
this attack by sanitizing the memory locations through preemptive "fast"
(low latency) stores of zero prior to the actual "slow" (high latency) store
of a pointer value such that upon dependency misprediction the CPU then
speculatively executes the load of the pointer value and retrieves the zero
value instead of the attacker controlled scalar value previously stored at
that location, meaning, subsequent access in the speculative domain is then
redirected to the "zero page".
The sanitized preemptive store of zero prior to the actual "slow" store is
done through a simple ST instruction based on r10 (frame pointer) with
relative offset to the stack location that the verifier has been tracking
on the original used register for STX, which does not have to be r10. Thus,
there are no memory dependencies for this store, since it's only using r10
and immediate constant of zero; hence af86ca4e30 /assumed/ a low latency
operation.
However, a recent attack demonstrated that this mitigation is not sufficient
since the preemptive store of zero could also be turned into a "slow" store
and is thus bypassed as well:
[...]
// r2 = oob address (e.g. scalar)
// r7 = pointer to map value
31: (7b) *(u64 *)(r10 -16) = r2
// r9 will remain "fast" register, r10 will become "slow" register below
32: (bf) r9 = r10
// JIT maps BPF reg to x86 reg:
// r9 -> r15 (callee saved)
// r10 -> rbp
// train store forward prediction to break dependency link between both r9
// and r10 by evicting them from the predictor's LRU table.
33: (61) r0 = *(u32 *)(r7 +24576)
34: (63) *(u32 *)(r7 +29696) = r0
35: (61) r0 = *(u32 *)(r7 +24580)
36: (63) *(u32 *)(r7 +29700) = r0
37: (61) r0 = *(u32 *)(r7 +24584)
38: (63) *(u32 *)(r7 +29704) = r0
39: (61) r0 = *(u32 *)(r7 +24588)
40: (63) *(u32 *)(r7 +29708) = r0
[...]
543: (61) r0 = *(u32 *)(r7 +25596)
544: (63) *(u32 *)(r7 +30716) = r0
// prepare call to bpf_ringbuf_output() helper. the latter will cause rbp
// to spill to stack memory while r13/r14/r15 (all callee saved regs) remain
// in hardware registers. rbp becomes slow due to push/pop latency. below is
// disasm of bpf_ringbuf_output() helper for better visual context:
//
// ffffffff8117ee20: 41 54 push r12
// ffffffff8117ee22: 55 push rbp
// ffffffff8117ee23: 53 push rbx
// ffffffff8117ee24: 48 f7 c1 fc ff ff ff test rcx,0xfffffffffffffffc
// ffffffff8117ee2b: 0f 85 af 00 00 00 jne ffffffff8117eee0 <-- jump taken
// [...]
// ffffffff8117eee0: 49 c7 c4 ea ff ff ff mov r12,0xffffffffffffffea
// ffffffff8117eee7: 5b pop rbx
// ffffffff8117eee8: 5d pop rbp
// ffffffff8117eee9: 4c 89 e0 mov rax,r12
// ffffffff8117eeec: 41 5c pop r12
// ffffffff8117eeee: c3 ret
545: (18) r1 = map[id:4]
547: (bf) r2 = r7
548: (b7) r3 = 0
549: (b7) r4 = 4
550: (85) call bpf_ringbuf_output#194288
// instruction 551 inserted by verifier \
551: (7a) *(u64 *)(r10 -16) = 0 | /both/ are now slow stores here
// storing map value pointer r7 at fp-16 | since value of r10 is "slow".
552: (7b) *(u64 *)(r10 -16) = r7 /
// following "fast" read to the same memory location, but due to dependency
// misprediction it will speculatively execute before insn 551/552 completes.
553: (79) r2 = *(u64 *)(r9 -16)
// in speculative domain contains attacker controlled r2. in non-speculative
// domain this contains r7, and thus accesses r7 +0 below.
554: (71) r3 = *(u8 *)(r2 +0)
// leak r3
As can be seen, the current speculative store bypass mitigation which the
verifier inserts at line 551 is insufficient since /both/, the write of
the zero sanitation as well as the map value pointer are a high latency
instruction due to prior memory access via push/pop of r10 (rbp) in contrast
to the low latency read in line 553 as r9 (r15) which stays in hardware
registers. Thus, architecturally, fp-16 is r7, however, microarchitecturally,
fp-16 can still be r2.
Initial thoughts to address this issue was to track spilled pointer loads
from stack and enforce their load via LDX through r10 as well so that /both/
the preemptive store of zero /as well as/ the load use the /same/ register
such that a dependency is created between the store and load. However, this
option is not sufficient either since it can be bypassed as well under
speculation. An updated attack with pointer spill/fills now _all_ based on
r10 would look as follows:
[...]
// r2 = oob address (e.g. scalar)
// r7 = pointer to map value
[...]
// longer store forward prediction training sequence than before.
2062: (61) r0 = *(u32 *)(r7 +25588)
2063: (63) *(u32 *)(r7 +30708) = r0
2064: (61) r0 = *(u32 *)(r7 +25592)
2065: (63) *(u32 *)(r7 +30712) = r0
2066: (61) r0 = *(u32 *)(r7 +25596)
2067: (63) *(u32 *)(r7 +30716) = r0
// store the speculative load address (scalar) this time after the store
// forward prediction training.
2068: (7b) *(u64 *)(r10 -16) = r2
// preoccupy the CPU store port by running sequence of dummy stores.
2069: (63) *(u32 *)(r7 +29696) = r0
2070: (63) *(u32 *)(r7 +29700) = r0
2071: (63) *(u32 *)(r7 +29704) = r0
2072: (63) *(u32 *)(r7 +29708) = r0
2073: (63) *(u32 *)(r7 +29712) = r0
2074: (63) *(u32 *)(r7 +29716) = r0
2075: (63) *(u32 *)(r7 +29720) = r0
2076: (63) *(u32 *)(r7 +29724) = r0
2077: (63) *(u32 *)(r7 +29728) = r0
2078: (63) *(u32 *)(r7 +29732) = r0
2079: (63) *(u32 *)(r7 +29736) = r0
2080: (63) *(u32 *)(r7 +29740) = r0
2081: (63) *(u32 *)(r7 +29744) = r0
2082: (63) *(u32 *)(r7 +29748) = r0
2083: (63) *(u32 *)(r7 +29752) = r0
2084: (63) *(u32 *)(r7 +29756) = r0
2085: (63) *(u32 *)(r7 +29760) = r0
2086: (63) *(u32 *)(r7 +29764) = r0
2087: (63) *(u32 *)(r7 +29768) = r0
2088: (63) *(u32 *)(r7 +29772) = r0
2089: (63) *(u32 *)(r7 +29776) = r0
2090: (63) *(u32 *)(r7 +29780) = r0
2091: (63) *(u32 *)(r7 +29784) = r0
2092: (63) *(u32 *)(r7 +29788) = r0
2093: (63) *(u32 *)(r7 +29792) = r0
2094: (63) *(u32 *)(r7 +29796) = r0
2095: (63) *(u32 *)(r7 +29800) = r0
2096: (63) *(u32 *)(r7 +29804) = r0
2097: (63) *(u32 *)(r7 +29808) = r0
2098: (63) *(u32 *)(r7 +29812) = r0
// overwrite scalar with dummy pointer; same as before, also including the
// sanitation store with 0 from the current mitigation by the verifier.
2099: (7a) *(u64 *)(r10 -16) = 0 | /both/ are now slow stores here
2100: (7b) *(u64 *)(r10 -16) = r7 | since store unit is still busy.
// load from stack intended to bypass stores.
2101: (79) r2 = *(u64 *)(r10 -16)
2102: (71) r3 = *(u8 *)(r2 +0)
// leak r3
[...]
Looking at the CPU microarchitecture, the scheduler might issue loads (such
as seen in line 2101) before stores (line 2099,2100) because the load execution
units become available while the store execution unit is still busy with the
sequence of dummy stores (line 2069-2098). And so the load may use the prior
stored scalar from r2 at address r10 -16 for speculation. The updated attack
may work less reliable on CPU microarchitectures where loads and stores share
execution resources.
This concludes that the sanitizing with zero stores from af86ca4e30 ("bpf:
Prevent memory disambiguation attack") is insufficient. Moreover, the detection
of stack reuse from af86ca4e30 where previously data (STACK_MISC) has been
written to a given stack slot where a pointer value is now to be stored does
not have sufficient coverage as precondition for the mitigation either; for
several reasons outlined as follows:
1) Stack content from prior program runs could still be preserved and is
therefore not "random", best example is to split a speculative store
bypass attack between tail calls, program A would prepare and store the
oob address at a given stack slot and then tail call into program B which
does the "slow" store of a pointer to the stack with subsequent "fast"
read. From program B PoV such stack slot type is STACK_INVALID, and
therefore also must be subject to mitigation.
2) The STACK_SPILL must not be coupled to register_is_const(&stack->spilled_ptr)
condition, for example, the previous content of that memory location could
also be a pointer to map or map value. Without the fix, a speculative
store bypass is not mitigated in such precondition and can then lead to
a type confusion in the speculative domain leaking kernel memory near
these pointer types.
While brainstorming on various alternative mitigation possibilities, we also
stumbled upon a retrospective from Chrome developers [0]:
[...] For variant 4, we implemented a mitigation to zero the unused memory
of the heap prior to allocation, which cost about 1% when done concurrently
and 4% for scavenging. Variant 4 defeats everything we could think of. We
explored more mitigations for variant 4 but the threat proved to be more
pervasive and dangerous than we anticipated. For example, stack slots used
by the register allocator in the optimizing compiler could be subject to
type confusion, leading to pointer crafting. Mitigating type confusion for
stack slots alone would have required a complete redesign of the backend of
the optimizing compiler, perhaps man years of work, without a guarantee of
completeness. [...]
From BPF side, the problem space is reduced, however, options are rather
limited. One idea that has been explored was to xor-obfuscate pointer spills
to the BPF stack:
[...]
// preoccupy the CPU store port by running sequence of dummy stores.
[...]
2106: (63) *(u32 *)(r7 +29796) = r0
2107: (63) *(u32 *)(r7 +29800) = r0
2108: (63) *(u32 *)(r7 +29804) = r0
2109: (63) *(u32 *)(r7 +29808) = r0
2110: (63) *(u32 *)(r7 +29812) = r0
// overwrite scalar with dummy pointer; xored with random 'secret' value
// of 943576462 before store ...
2111: (b4) w11 = 943576462
2112: (af) r11 ^= r7
2113: (7b) *(u64 *)(r10 -16) = r11
2114: (79) r11 = *(u64 *)(r10 -16)
2115: (b4) w2 = 943576462
2116: (af) r2 ^= r11
// ... and restored with the same 'secret' value with the help of AX reg.
2117: (71) r3 = *(u8 *)(r2 +0)
[...]
While the above would not prevent speculation, it would make data leakage
infeasible by directing it to random locations. In order to be effective
and prevent type confusion under speculation, such random secret would have
to be regenerated for each store. The additional complexity involved for a
tracking mechanism that prevents jumps such that restoring spilled pointers
would not get corrupted is not worth the gain for unprivileged. Hence, the
fix in here eventually opted for emitting a non-public BPF_ST | BPF_NOSPEC
instruction which the x86 JIT translates into a lfence opcode. Inserting the
latter in between the store and load instruction is one of the mitigations
options [1]. The x86 instruction manual notes:
[...] An LFENCE that follows an instruction that stores to memory might
complete before the data being stored have become globally visible. [...]
The latter meaning that the preceding store instruction finished execution
and the store is at minimum guaranteed to be in the CPU's store queue, but
it's not guaranteed to be in that CPU's L1 cache at that point (globally
visible). The latter would only be guaranteed via sfence. So the load which
is guaranteed to execute after the lfence for that local CPU would have to
rely on store-to-load forwarding. [2], in section 2.3 on store buffers says:
[...] For every store operation that is added to the ROB, an entry is
allocated in the store buffer. This entry requires both the virtual and
physical address of the target. Only if there is no free entry in the store
buffer, the frontend stalls until there is an empty slot available in the
store buffer again. Otherwise, the CPU can immediately continue adding
subsequent instructions to the ROB and execute them out of order. On Intel
CPUs, the store buffer has up to 56 entries. [...]
One small upside on the fix is that it lifts constraints from af86ca4e30
where the sanitize_stack_off relative to r10 must be the same when coming
from different paths. The BPF_ST | BPF_NOSPEC gets emitted after a BPF_STX
or BPF_ST instruction. This happens either when we store a pointer or data
value to the BPF stack for the first time, or upon later pointer spills.
The former needs to be enforced since otherwise stale stack data could be
leaked under speculation as outlined earlier. For non-x86 JITs the BPF_ST |
BPF_NOSPEC mapping is currently optimized away, but others could emit a
speculation barrier as well if necessary. For real-world unprivileged
programs e.g. generated by LLVM, pointer spill/fill is only generated upon
register pressure and LLVM only tries to do that for pointers which are not
used often. The program main impact will be the initial BPF_ST | BPF_NOSPEC
sanitation for the STACK_INVALID case when the first write to a stack slot
occurs e.g. upon map lookup. In future we might refine ways to mitigate
the latter cost.
[0] https://arxiv.org/pdf/1902.05178.pdf
[1] https://msrc-blog.microsoft.com/2018/05/21/analysis-and-mitigation-of-speculative-store-bypass-cve-2018-3639/
[2] https://arxiv.org/pdf/1905.05725.pdf
Fixes: af86ca4e30 ("bpf: Prevent memory disambiguation attack")
Fixes: f7cf25b202 ("bpf: track spill/fill of constants")
Co-developed-by: Piotr Krysiuk <piotras@gmail.com>
Co-developed-by: Benedict Schlueter <benedict.schlueter@rub.de>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Piotr Krysiuk <piotras@gmail.com>
Signed-off-by: Benedict Schlueter <benedict.schlueter@rub.de>
Acked-by: Alexei Starovoitov <ast@kernel.org>
In case of JITs, each of the JIT backends compiles the BPF nospec instruction
/either/ to a machine instruction which emits a speculation barrier /or/ to
/no/ machine instruction in case the underlying architecture is not affected
by Speculative Store Bypass or has different mitigations in place already.
This covers both x86 and (implicitly) arm64: In case of x86, we use 'lfence'
instruction for mitigation. In case of arm64, we rely on the firmware mitigation
as controlled via the ssbd kernel parameter. Whenever the mitigation is enabled,
it works for all of the kernel code with no need to provide any additional
instructions here (hence only comment in arm64 JIT). Other archs can follow
as needed. The BPF nospec instruction is specifically targeting Spectre v4
since i) we don't use a serialization barrier for the Spectre v1 case, and
ii) mitigation instructions for v1 and v4 might be different on some archs.
The BPF nospec is required for a future commit, where the BPF verifier does
annotate intermediate BPF programs with speculation barriers.
Co-developed-by: Piotr Krysiuk <piotras@gmail.com>
Co-developed-by: Benedict Schlueter <benedict.schlueter@rub.de>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Piotr Krysiuk <piotras@gmail.com>
Signed-off-by: Benedict Schlueter <benedict.schlueter@rub.de>
Acked-by: Alexei Starovoitov <ast@kernel.org>
0fa294fb19 ("cgroup: Replace cgroup_rstat_mutex with a spinlock") added
cgroup_rstat_flush_irqsafe() allowing flushing to happen from the irq
context. However, rstat paths use u64_stats_sync to synchronize access to
64bit stat counters on 32bit machines. u64_stats_sync is implemented using
seq_lock and trying to read from an irq context can lead to A-A deadlock if
the irq happens to interrupt the stat update.
Fix it by using the irqsafe variants - u64_stats_update_begin_irqsave() and
u64_stats_update_end_irqrestore() - in the update paths. Note that none of
this matters on 64bit machines. All these are just for 32bit SMP setups.
Note that the interface was introduced way back, its first and currently
only use was recently added by 2d146aa3aa ("mm: memcontrol: switch to
rstat"). Stable tagging targets this commit.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Rik van Riel <riel@surriel.com>
Fixes: 2d146aa3aa ("mm: memcontrol: switch to rstat")
Cc: stable@vger.kernel.org # v5.13+
Current max cgroup storage value size is 4k (PAGE_SIZE). The other local
storages accept up to 64k (BPF_LOCAL_STORAGE_MAX_VALUE_SIZE). Let's align
max cgroup value size with the other storages.
For percpu, the max is 32k (PCPU_MIN_UNIT_SIZE) because percpu
allocator is not happy about larger values.
netcnt test is extended to exercise those maximum values
(non-percpu max size is close to, but not real max).
v4:
* remove inner union (Andrii Nakryiko)
* keep net_cnt on the stack (Andrii Nakryiko)
v3:
* refine SIZEOF_BPF_LOCAL_STORAGE_ELEM comment (Yonghong Song)
* anonymous struct in percpu_net_cnt & net_cnt (Yonghong Song)
* reorder free (Yonghong Song)
v2:
* cap max_value_size instead of BUILD_BUG_ON (Martin KaFai Lau)
Signed-off-by: Stanislav Fomichev <sdf@google.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Link: https://lore.kernel.org/bpf/20210727222335.4029096-1-sdf@google.com
Pull cgroup fix from Tejun Heo:
"Fix leak of filesystem context root which is triggered by LTP.
Not too likely to be a problem in non-testing environments"
* 'for-5.14-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
cgroup1: fix leaked context root causing sporadic NULL deref in LTP
Pull workqueue fix from Tejun Heo:
"Fix a use-after-free in allocation failure handling path"
* 'for-5.14-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq:
workqueue: fix UAF in pwq_unbound_release_workfn()
syzbot reported KCSAN data races vs. timer_base::timer_running being set to
NULL without holding base::lock in expire_timers().
This looks innocent and most reads are clearly not problematic, but
Frederic identified an issue which is:
int data = 0;
void timer_func(struct timer_list *t)
{
data = 1;
}
CPU 0 CPU 1
------------------------------ --------------------------
base = lock_timer_base(timer, &flags); raw_spin_unlock(&base->lock);
if (base->running_timer != timer) call_timer_fn(timer, fn, baseclk);
ret = detach_if_pending(timer, base, true); base->running_timer = NULL;
raw_spin_unlock_irqrestore(&base->lock, flags); raw_spin_lock(&base->lock);
x = data;
If the timer has previously executed on CPU 1 and then CPU 0 can observe
base->running_timer == NULL and returns, assuming the timer has completed,
but it's not guaranteed on all architectures. The comment for
del_timer_sync() makes that guarantee. Moving the assignment under
base->lock prevents this.
For non-RT kernel it's performance wise completely irrelevant whether the
store happens before or after taking the lock. For an RT kernel moving the
store under the lock requires an extra unlock/lock pair in the case that
there is a waiter for the timer, but that's not the end of the world.
Reported-by: syzbot+aa7c2385d46c5eba0b89@syzkaller.appspotmail.com
Reported-by: syzbot+abea4558531bae1ba9fe@syzkaller.appspotmail.com
Fixes: 030dcdd197 ("timers: Prepare support for PREEMPT_RT")
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Link: https://lore.kernel.org/r/87lfea7gw8.fsf@nanos.tec.linutronix.de
Cc: stable@vger.kernel.org
When scftorture finds an error in the module parameters controlling
the relative frequencies of smp_call_function*() variants, it takes an
early exit. So early that it has not allocated memory to track the
kthreads running the test, which results in a segfault. This commit
therefore checks for the existence of the memory before attempting
to stop the kthreads that would otherwise have been recorded in that
non-existent memory.
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
This commit adds the single_weight_rpc module parameter, which causes the
IPI handler to awaken the IPI sender. In many scheduler configurations,
this will result in an IPI back to the sender that is likely to be
received at a time when the sender CPU is idle. The intent is to stress
IPI reception during CPU busy-to-idle transitions.
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Currently, the lock_is_read_held variable is bool, so that a reader sets
it to true just after lock acquisition and then to false just before
lock release. This works in a rough statistical sense, but can result
in false negatives just after one of a pair of concurrent readers has
released the lock. This approach does have low overhead, but at the
expense of the setting to true potentially never leaving the reader's
store buffer, thus resulting in an unconditional false negative.
This commit therefore converts this variable to atomic_t and makes
the reader use atomic_inc() just after acquisition and atomic_dec()
just before release. This does increase overhead, but this increase is
negligible compared to the 10-microsecond lock hold time.
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
The lock_stress_stats structure's ->n_lock_fail and ->n_lock_acquired
fields are incremented and sampled locklessly using plain C-language
statements, which KCSAN objects to. This commit therefore marks the
statistics gathering with data_race() to flag the intent. While in
the area, this commit also reduces the number of accesses to the
->n_lock_acquired field, thus eliminating some possible check/use
confusion.
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
The rcuscale console output claims N grace periods, numbered from zero
to N, which means that there were really N+1 grace periods. The root
cause of this bug is that rcu_scale_writer() stores the number of the
last grace period (numbered from zero) into writer_n_durations[me]
instead of the number of grace periods. This commit therefore assigns
the actual number of grace periods to writer_n_durations[me], and also
makes the corresponding adjustment to the loop outputting per-grace-period
measurements.
Sample of old console output:
rcu-scale: writer 0 gps: 133
......
rcu-scale: 0 writer-duration: 0 44003961
rcu-scale: 0 writer-duration: 1 32003582
......
rcu-scale: 0 writer-duration: 132 28004391
rcu-scale: 0 writer-duration: 133 27996410
Sample of new console output:
rcu-scale: writer 0 gps: 134
......
rcu-scale: 0 writer-duration: 0 44003961
rcu-scale: 0 writer-duration: 1 32003582
......
rcu-scale: 0 writer-duration: 132 28004391
rcu-scale: 0 writer-duration: 133 27996410
Signed-off-by: Jiangong.Han <jiangong.han@windriver.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Currently, rcu_torture_stall() does a one-jiffy timed wait when
stall_cpu_block is set. This works, but emits a pointless splat in
CONFIG_PREEMPT=y kernels. This commit avoids this splat by instead
invoking preempt_schedule() in CONFIG_PREEMPT=y kernels.
This uses an admittedly ugly #ifdef, but abstracted approaches just
looked worse. A prettier approach would provide a preempt_schedule()
definition with a WARN_ON() for CONFIG_PREEMPT=n kernels, but this seems
quite silly.
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
This commit adds a "clock" type to refscale, which checks the performance
of ktime_get_real_fast_ns(). Use the "clocksource=" kernel boot parameter
to select the underlying clock source.
[ paulmck: Work around compiler false positive per kernel test robot. ]
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Remove redundant prefix "cmd_" from name of members in struct kdbtab_t
for better readibility.
Suggested-by: Doug Anderson <dianders@chromium.org>
Signed-off-by: Sumit Garg <sumit.garg@linaro.org>
Reviewed-by: Douglas Anderson <dianders@chromium.org>
Link: https://lore.kernel.org/r/20210712134620.276667-5-sumit.garg@linaro.org
Signed-off-by: Daniel Thompson <daniel.thompson@linaro.org>
Switch to use a linked list instead of dynamic array which makes
allocation of kdb macro and traversing the kdb macro commands list
simpler.
Suggested-by: Daniel Thompson <daniel.thompson@linaro.org>
Signed-off-by: Sumit Garg <sumit.garg@linaro.org>
Reviewed-by: Douglas Anderson <dianders@chromium.org>
Link: https://lore.kernel.org/r/20210712134620.276667-4-sumit.garg@linaro.org
Signed-off-by: Daniel Thompson <daniel.thompson@linaro.org>
Commit e4f291b3f7 ("kdb: Simplify kdb commands registration")
allowed registration of pre-allocated kdb commands with pointer to
struct kdbtab_t. Lets switch other users as well to register pre-
allocated kdb commands via:
- Changing prototype for kdb_register() to pass a pointer to struct
kdbtab_t instead.
- Embed kdbtab_t structure in kdb_macro_t rather than individual params.
With these changes kdb_register_flags() becomes redundant and hence
removed. Also, since we have switched all users to register
pre-allocated commands, "is_dynamic" flag in struct kdbtab_t becomes
redundant and hence removed as well.
Suggested-by: Daniel Thompson <daniel.thompson@linaro.org>
Signed-off-by: Sumit Garg <sumit.garg@linaro.org>
Acked-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Reviewed-by: Douglas Anderson <dianders@chromium.org>
Link: https://lore.kernel.org/r/20210712134620.276667-3-sumit.garg@linaro.org
Signed-off-by: Daniel Thompson <daniel.thompson@linaro.org>
Rename struct defcmd_set to struct kdb_macro as that sounds more
appropriate given its purpose.
Suggested-by: Daniel Thompson <daniel.thompson@linaro.org>
Signed-off-by: Sumit Garg <sumit.garg@linaro.org>
Reviewed-by: Douglas Anderson <dianders@chromium.org>
Link: https://lore.kernel.org/r/20210712134620.276667-2-sumit.garg@linaro.org
Signed-off-by: Daniel Thompson <daniel.thompson@linaro.org>
Currently the only user for debug heap is kdbnearsym() which can be
modified to rather use statically allocated buffer for symbol name as
per it's current usage. So do that and hence remove custom debug heap
allocator.
Note that this change puts a restriction on kdbnearsym() callers to
carefully use shared namebuf such that a caller should consume the symbol
returned immediately prior to another call to fetch a different symbol.
Also, this change uses standard KSYM_NAME_LEN macro for namebuf
allocation instead of local variable: knt1_size which should avoid any
conflicts caused by changes to KSYM_NAME_LEN macro value.
This change has been tested using kgdbtest on arm64 which doesn't show
any regressions.
Suggested-by: Daniel Thompson <daniel.thompson@linaro.org>
Signed-off-by: Sumit Garg <sumit.garg@linaro.org>
Reviewed-by: Douglas Anderson <dianders@chromium.org>
Link: https://lore.kernel.org/r/20210714055620.369915-1-sumit.garg@linaro.org
Signed-off-by: Daniel Thompson <daniel.thompson@linaro.org>
In cpuset_hotplug_workfn(), the detection of whether the cpu list
has been changed is done by comparing the effective cpus of the top
cpuset with the cpu_active_mask. However, in the rare case that just
all the CPUs in the subparts_cpus are offlined, the detection fails
and the partition states are not updated correctly. Fix it by forcing
the cpus_updated flag to true in this particular case.
Fixes: 4b842da276 ("cpuset: Make CPU hotplug work with partition")
Signed-off-by: Waiman Long <longman@redhat.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Use more descriptive variable names for update_prstate(), remove
unnecessary code and fix some typos. There is no functional change.
Signed-off-by: Waiman Long <longman@redhat.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Syslog's SYSLOG_ACTION_READ is supposed to block until the next
syslog record can be read, and then it should read that record.
However, because @syslog_lock is not held between waking up and
reading the record, another reader could read the record first,
thus causing SYSLOG_ACTION_READ to return with a value of 0, never
having read _anything_.
By holding @syslog_lock between waking up and reading, it can be
guaranteed that SYSLOG_ACTION_READ blocks until it successfully
reads a syslog record (or a real error occurs).
Signed-off-by: John Ogness <john.ogness@linutronix.de>
Reviewed-by: Petr Mladek <pmladek@suse.com>
Signed-off-by: Petr Mladek <pmladek@suse.com>
Link: https://lore.kernel.org/r/20210715193359.25946-7-john.ogness@linutronix.de
@syslog_lock was a raw_spin_lock to simplify the transition of
removing @logbuf_lock and the safe buffers. With that transition
complete, and since all uses of @syslog_lock are within sleepable
contexts, @syslog_lock can become a mutex.
Note that until now register_console() would disable interrupts
using irqsave, which implies that it may be called with interrupts
disabled. And indeed, there is one possible call chain on parisc
where this happens:
handle_interruption(code=1) /* High-priority machine check (HPMC) */
pdc_console_restart()
pdc_console_init_force()
register_console()
However, register_console() calls console_lock(), which might sleep.
So it has never been allowed to call register_console() from an
atomic context and the above call chain is a bug.
Note that the removal of read_syslog_seq_irq() is slightly changing
the behavior of SYSLOG_ACTION_READ by testing against a possibly
outdated @seq value. However, the value of @seq could have changed
after the test, so it is not a new window. A follow-up commit closes
this window.
Signed-off-by: John Ogness <john.ogness@linutronix.de>
Reviewed-by: Petr Mladek <pmladek@suse.com>
Signed-off-by: Petr Mladek <pmladek@suse.com>
Link: https://lore.kernel.org/r/20210715193359.25946-6-john.ogness@linutronix.de
All NMI contexts are handled the same as the safe context: store the
message and defer printing. There is no need to have special NMI
context tracking for this. Using in_nmi() is enough.
There are several parts of the kernel that are manually calling into
the printk NMI context tracking in order to cause general printk
deferred printing:
arch/arm/kernel/smp.c
arch/powerpc/kexec/crash.c
kernel/trace/trace.c
For arm/kernel/smp.c and powerpc/kexec/crash.c, provide a new
function pair printk_deferred_enter/exit that explicitly achieves the
same objective.
For ftrace, remove the printk context manipulation completely. It was
added in commit 03fc7f9c99 ("printk/nmi: Prevent deadlock when
accessing the main log buffer in NMI"). The purpose was to enforce
storing messages directly into the ring buffer even in NMI context.
It really should have only modified the behavior in NMI context.
There is no need for a special behavior any longer. All messages are
always stored directly now. The console deferring is handled
transparently in vprintk().
Signed-off-by: John Ogness <john.ogness@linutronix.de>
[pmladek@suse.com: Remove special handling in ftrace.c completely.
Signed-off-by: Petr Mladek <pmladek@suse.com>
Link: https://lore.kernel.org/r/20210715193359.25946-5-john.ogness@linutronix.de
With @logbuf_lock removed, the high level printk functions for
storing messages are lockless. Messages can be stored from any
context, so there is no need for the NMI and safe buffers anymore.
Remove the NMI and safe buffers.
Although the safe buffers are removed, the NMI and safe context
tracking is still in place. In these contexts, store the message
immediately but still use irq_work to defer the console printing.
Since printk recursion tracking is in place, safe context tracking
for most of printk is not needed. Remove it. Only safe context
tracking relating to the console and console_owner locks is left
in place. This is because the console and console_owner locks are
needed for the actual printing.
Signed-off-by: John Ogness <john.ogness@linutronix.de>
Reviewed-by: Petr Mladek <pmladek@suse.com>
Signed-off-by: Petr Mladek <pmladek@suse.com>
Link: https://lore.kernel.org/r/20210715193359.25946-4-john.ogness@linutronix.de
Currently the printk safe buffers provide a form of recursion
protection by redirecting to the safe buffers whenever printk() is
recursively called.
In preparation for removal of the safe buffers, provide an alternate
explicit recursion protection. Recursion is limited to 3 levels
per-CPU and per-context.
Signed-off-by: John Ogness <john.ogness@linutronix.de>
Reviewed-by: Petr Mladek <pmladek@suse.com>
Signed-off-by: Petr Mladek <pmladek@suse.com>
Link: https://lore.kernel.org/r/20210715193359.25946-3-john.ogness@linutronix.de
Commit 3370155737 ("printk: Userspace format indexing support") turned
printk() into a macro, but left the kerneldoc comment for it with the (now)
_printk() function, resulting in this docs-build warning:
kernel/printk/printk.c:1: warning: 'printk' not found
Move the kerneldoc comment back next to the (now) macro it's meant to
describe and have the docs build find it there.
Fixes: 3370155737 ("printk: Userspace format indexing support")
Signed-off-by: Jonathan Corbet <corbet@lwn.net>
Signed-off-by: Petr Mladek <pmladek@suse.com>
Link: https://lore.kernel.org/r/87o8aqt7qn.fsf@meer.lwn.net
The commit 3370155737 ("printk: Userspace format indexing support")
triggered the following build failure:
kernel/printk/index.c:140:6: warning: no previous prototype for ‘pi_create_file’ [-Wmissing-prototypes]
void pi_create_file(struct module *mod)
^~~~~~~~~~~~~~
kernel/printk/index.c:146:6: warning: no previous prototype for ‘pi_remove_file’ [-Wmissing-prototypes]
void pi_remove_file(struct module *mod)
^~~~~~~~~~~~~~
Fixes: 3370155737 ("printk: Userspace format indexing support")
Reported-by: kernel test robot <lkp@intel.com>
Suggested-by: Chris Down <chris@chrisdown.name>
[pmladek@suse.com: Let the compiler decide about inlining.]
Signed-off-by: Petr Mladek <pmladek@suse.com>
Link: https://lore.kernel.org/lkml/YPql089IwSpudw%2F1@alley/
gcc doesn't care, but clang quite reasonably pointed out that the recent
commit e9ba16e68c ("smpboot: Mark idle_init() as __always_inlined to
work around aggressive compiler un-inlining") did some really odd
things:
kernel/smpboot.c:50:20: warning: duplicate 'inline' declaration specifier [-Wduplicate-decl-specifier]
static inline void __always_inline idle_init(unsigned int cpu)
^
which not only has that duplicate inlining specifier, but the new
__always_inline was put in the wrong place of the function definition.
We put the storage class specifiers (ie things like "static" and
"extern") first, and the type information after that. And while the
compiler may not care, we put the inline specifier before the types.
So it should be just
static __always_inline void idle_init(unsigned int cpu)
instead.
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
- Plug a race between rearm and process tick in the posix CPU timers code
- Make the optimization to avoid recalculation of the next timer interrupt
work correctly when there are no timers pending.
-----BEGIN PGP SIGNATURE-----
iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAmD9LLUTHHRnbHhAbGlu
dXRyb25peC5kZQAKCRCmGPVMDXSYocqMD/9ciaE5J4gbsamavrPfMaQNo3c3mom1
3xoskHFKiz9fnYL/f3yNQ2OjXl+lxV0VLSwjJnc1TfhQM5g+X7uI4P/sCZzHtEmP
F3jTCUX99DSlTEB4Fc3ssr/hZPQab9nXearek9eAEpLKQDe1U1q2p314Mr4dA+Kj
awJUuOlkLt+NwRCajuZK6Ur0D1Zte56C/Nl3PeppgJY1U5tLCKTE8ZmRbdLo11zM
BEMFL95on1a/wKzTkuYqhfSQn4JWZo4lLP9jVHueF2nbPbOGdC9VwvegRMtRKG/k
0I8n7mcXvzi9pDP08o96rzjdZ9KBpG2hLkL0PgCDjrXwlOBH7tDkOBajJ+x5AgAf
71Zi/XOfjaHbkm37uLTTeerG2pKilds7ukjnQ3VS3t/XfPw6Nr6r9ig5AyBxpnjk
8Shw8kfEOApLEnmYwKAXHCewjxppp9pDsR4J7sYIfiYoDZRfF56Xyr/pKa8cpZge
9ByK8Pul4J9RhTOgIgJNMBb78gmFxsKi5CMt6kj8O89omc4pIUVpEK0z3kWmbjrd
m1mtcO2kS/ry+7TgAjkxHcrcm/QX+y/2SrvvLoqVLAJQfIrffsiGGehaIXS8rAKg
jlCd3s0NET6yULyQwI7qUdS+ZgtYSQKfIkU38VVcXaplUOABWcCwNXRh0rk6+gBs
+HK8UxQnAYRGrw==
=Q6oh
-----END PGP SIGNATURE-----
Merge tag 'timers-urgent-2021-07-25' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull timer fixes from Thomas Gleixner:
"A small set of timer related fixes:
- Plug a race between rearm and process tick in the posix CPU timers
code
- Make the optimization to avoid recalculation of the next timer
interrupt work correctly when there are no timers pending"
* tag 'timers-urgent-2021-07-25' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
timers: Fix get_next_timer_interrupt() with no timers pending
posix-cpu-timers: Fix rearm racing against process tick
causes a section mismatch.
-----BEGIN PGP SIGNATURE-----
iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAmD9KsYTHHRnbHhAbGlu
dXRyb25peC5kZQAKCRCmGPVMDXSYof1HEACfpcjQbezuNjuoSYQNpAz9jz/W63iH
FXTBAiC8IVzloRgNjlbVAVGgJlYFgbgkke2ulNwsRIwwVhhyDYDPio88U498LdB5
bQdw/jsylHY+5ipvGCxf2uvMI8xIk6Gj3bbgIsVxBqpCTaJOsFE8E69UUVZjxPMB
kpHQ2GzCLe0FxXC7reFoIIPhY9U1cYXNHUso4w2um6enr10uEwMouXOGkhNAboOl
bmi97wWd/wgfSJNUGrsGpK+f4yTAhfdXvPv79t0Dzbg7KqTxFdU2dmrHQYinuw96
YTZLW32o5JJzW/AjPcDTTixXStbwtrAS6GPqoAL65n2rvwit64SYfhPjJAie3+Ly
eHzNUyZX6+zczfe7zj8nXiutMc/KI5aRcwf1S2PCMefi1IoJuVOdbe9Pgj/iFtCt
GRGSZ+gob7R4Onvs2Yaw8iS32KjetmGTN7V9BuPTIgPgZ9UjQvB9pnDWcICJnT9U
6HLvKBhVBfDVOSe2YDNEIMFVJdacb0EKegyLXU0S/E0CUPitXM9TivazRrSprEiJ
nZxH5sM+9rFtpHWA6lxKYmRP5e0BoMCZ6kerHmOMNa6dW4ha6D3J1KUJPYgJ4YH0
wQ9dF4ppHgBGsBtkZ7YVwXHkNnKJnr8GN8Xcf6CUFh1Yu61QMDCDyGR8cEVFc8Ai
TtZDCYJLBBA5bA==
=t8VJ
-----END PGP SIGNATURE-----
Merge tag 'core-urgent-2021-07-25' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull core fix from Thomas Gleixner:
"A single update for the boot code to prevent aggressive un-inlining
which causes a section mismatch"
* tag 'core-urgent-2021-07-25' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
smpboot: Mark idle_init() as __always_inlined to work around aggressive compiler un-inlining
Although swiotlb_exit() frees the 'slots' metadata array referenced by
'io_tlb_default_mem', it leaves the underlying buffer pages allocated
despite no longer being usable.
Extend swiotlb_exit() to free the buffer pages as well as the slots
array.
Cc: Claire Chang <tientzu@chromium.org>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Robin Murphy <robin.murphy@arm.com>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Tested-by: Nathan Chancellor <nathan@kernel.org>
Tested-by: Claire Chang <tientzu@chromium.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Will Deacon <will@kernel.org>
Signed-off-by: Konrad Rzeszutek Wilk <konrad@kernel.org>
A recent debugging session would have been made a little bit easier if
we had noticed sooner that swiotlb_exit() was being called during boot.
Add a simple diagnostic message to swiotlb_exit() to complement the one
from swiotlb_print_info() during initialisation.
Cc: Claire Chang <tientzu@chromium.org>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Robin Murphy <robin.murphy@arm.com>
Link: https://lore.kernel.org/r/20210705190352.GA19461@willie-the-truck
Suggested-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Tested-by: Nathan Chancellor <nathan@kernel.org>
Tested-by: Claire Chang <tientzu@chromium.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Will Deacon <will@kernel.org>
Signed-off-by: Konrad Rzeszutek Wilk <konrad@kernel.org>
Since commit 69031f5008 ("swiotlb: Set dev->dma_io_tlb_mem to the
swiotlb pool used"), 'struct device' may hold a copy of the global
'io_default_tlb_mem' pointer if the device is using swiotlb for DMA. A
subsequent call to swiotlb_exit() will therefore leave dangling pointers
behind in these device structures, resulting in KASAN splats such as:
| BUG: KASAN: use-after-free in __iommu_dma_unmap_swiotlb+0x64/0xb0
| Read of size 8 at addr ffff8881d7830000 by task swapper/0/0
|
| CPU: 0 PID: 0 Comm: swapper/0 Not tainted 5.12.0-rc3-debug #1
| Hardware name: HP HP Desktop M01-F1xxx/87D6, BIOS F.12 12/17/2020
| Call Trace:
| <IRQ>
| dump_stack+0x9c/0xcf
| print_address_description.constprop.0+0x18/0x130
| kasan_report.cold+0x7f/0x111
| __iommu_dma_unmap_swiotlb+0x64/0xb0
| nvme_pci_complete_rq+0x73/0x130
| blk_complete_reqs+0x6f/0x80
| __do_softirq+0xfc/0x3be
Convert 'io_default_tlb_mem' to a static structure, so that the
per-device pointers remain valid after swiotlb_exit() has been invoked.
All users are updated to reference the static structure directly, using
the 'nslabs' field to determine whether swiotlb has been initialised.
The 'slots' array is still allocated dynamically and referenced via a
pointer rather than a flexible array member.
Cc: Claire Chang <tientzu@chromium.org>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Robin Murphy <robin.murphy@arm.com>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Fixes: 69031f5008 ("swiotlb: Set dev->dma_io_tlb_mem to the swiotlb pool used")
Reported-by: Nathan Chancellor <nathan@kernel.org>
Tested-by: Nathan Chancellor <nathan@kernel.org>
Tested-by: Claire Chang <tientzu@chromium.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Will Deacon <will@kernel.org>
Signed-off-by: Konrad Rzeszutek Wilk <konrad@kernel.org>
This patch allows bpf tcp iter to call bpf_(get|set)sockopt.
To allow a specific bpf iter (tcp here) to call a set of helpers,
get_func_proto function pointer is added to bpf_iter_reg.
The bpf iter is a tracing prog which currently requires
CAP_PERFMON or CAP_SYS_ADMIN, so this patch does not
impose other capability checks for bpf_(get|set)sockopt.
Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Acked-by: Kuniyuki Iwashima <kuniyu@amazon.co.jp>
Acked-by: Yonghong Song <yhs@fb.com>
Link: https://lore.kernel.org/bpf/20210701200619.1036715-1-kafai@fb.com
- Fix deadloop in ring buffer because of using stale "read" variable
- Fix synthetic event use of field_pos as boolean and not an index
- Fixed histogram special var "cpu" overriding event fields called "cpu"
- Cleaned up error prone logic in alloc_synth_event()
- Removed call to synchronize_rcu_tasks_rude() when not needed
- Removed redundant initialization of a local variable "ret"
- Fixed kernel crash when updating tracepoint callbacks of different
priorities.
-----BEGIN PGP SIGNATURE-----
iIoEABYIADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCYPrGTxQccm9zdGVkdEBn
b29kbWlzLm9yZwAKCRAp5XQQmuv6qusoAQDZkMo/vBFZgNGcL8GNCFpOF9HcV7QI
JtBU+UG5GY2LagD/SEFEoG1o9UwKnIaBwr7qxGvrPgg8jKWtK/HEVFU94wk=
=EVfM
-----END PGP SIGNATURE-----
Merge tag 'trace-v5.14-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace
Pull tracing fixes from Steven Rostedt:
- Fix deadloop in ring buffer because of using stale "read" variable
- Fix synthetic event use of field_pos as boolean and not an index
- Fixed histogram special var "cpu" overriding event fields called
"cpu"
- Cleaned up error prone logic in alloc_synth_event()
- Removed call to synchronize_rcu_tasks_rude() when not needed
- Removed redundant initialization of a local variable "ret"
- Fixed kernel crash when updating tracepoint callbacks of different
priorities.
* tag 'trace-v5.14-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
tracepoints: Update static_call before tp_funcs when adding a tracepoint
ftrace: Remove redundant initialization of variable ret
ftrace: Avoid synchronize_rcu_tasks_rude() call when not necessary
tracing: Clean up alloc_synth_event()
tracing/histogram: Rename "cpu" to "common_cpu"
tracing: Synthetic event field_pos is an index not a boolean
tracing: Fix bug in rb_per_cpu_empty() that might cause deadloop.
Now that __ARCH_SI_TRAPNO is no longer set by any architecture remove
all of the code it enabled from the kernel.
On alpha and sparc a more explict approach of using
send_sig_fault_trapno or force_sig_fault_trapno in the very limited
circumstances where si_trapno was set to a non-zero value.
The generic support that is being removed always set si_trapno on all
fault signals. With only SIGILL ILL_ILLTRAP on sparc and SIGFPE and
SIGTRAP TRAP_UNK on alpla providing si_trapno values asking all senders
of fault signals to provide an si_trapno value does not make sense.
Making si_trapno an ordinary extension of the fault siginfo layout has
enabled the architecture generic implementation of SIGTRAP TRAP_PERF,
and enables other faulting signals to grow architecture generic
senders as well.
v1: https://lkml.kernel.org/r/m18s4zs7nu.fsf_-_@fess.ebiederm.org
v2: https://lkml.kernel.org/r/20210505141101.11519-8-ebiederm@xmission.com
Link: https://lkml.kernel.org/r/87bl73xx6x.fsf_-_@disp2133
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Because of the significant overhead that retpolines pose on indirect
calls, the tracepoint code was updated to use the new "static_calls" that
can modify the running code to directly call a function instead of using
an indirect caller, and this function can be changed at runtime.
In the tracepoint code that calls all the registered callbacks that are
attached to a tracepoint, the following is done:
it_func_ptr = rcu_dereference_raw((&__tracepoint_##name)->funcs);
if (it_func_ptr) {
__data = (it_func_ptr)->data;
static_call(tp_func_##name)(__data, args);
}
If there's just a single callback, the static_call is updated to just call
that callback directly. Once another handler is added, then the static
caller is updated to call the iterator, that simply loops over all the
funcs in the array and calls each of the callbacks like the old method
using indirect calling.
The issue was discovered with a race between updating the funcs array and
updating the static_call. The funcs array was updated first and then the
static_call was updated. This is not an issue as long as the first element
in the old array is the same as the first element in the new array. But
that assumption is incorrect, because callbacks also have a priority
field, and if there's a callback added that has a higher priority than the
callback on the old array, then it will become the first callback in the
new array. This means that it is possible to call the old callback with
the new callback data element, which can cause a kernel panic.
static_call = callback1()
funcs[] = {callback1,data1};
callback2 has higher priority than callback1
CPU 1 CPU 2
----- -----
new_funcs = {callback2,data2},
{callback1,data1}
rcu_assign_pointer(tp->funcs, new_funcs);
/*
* Now tp->funcs has the new array
* but the static_call still calls callback1
*/
it_func_ptr = tp->funcs [ new_funcs ]
data = it_func_ptr->data [ data2 ]
static_call(callback1, data);
/* Now callback1 is called with
* callback2's data */
[ KERNEL PANIC ]
update_static_call(iterator);
To prevent this from happening, always switch the static_call to the
iterator before assigning the tp->funcs to the new array. The iterator will
always properly match the callback with its data.
To trigger this bug:
In one terminal:
while :; do hackbench 50; done
In another terminal
echo 1 > /sys/kernel/tracing/events/sched/sched_waking/enable
while :; do
echo 1 > /sys/kernel/tracing/set_event_pid;
sleep 0.5
echo 0 > /sys/kernel/tracing/set_event_pid;
sleep 0.5
done
And it doesn't take long to crash. This is because the set_event_pid adds
a callback to the sched_waking tracepoint with a high priority, which will
be called before the sched_waking trace event callback is called.
Note, the removal to a single callback updates the array first, before
changing the static_call to single callback, which is the proper order as
the first element in the array is the same as what the static_call is
being changed to.
Link: https://lore.kernel.org/io-uring/4ebea8f0-58c9-e571-fd30-0ce4f6f09c70@samba.org/
Cc: stable@vger.kernel.org
Fixes: d25e37d89d ("tracepoint: Optimize using static_call()")
Reported-by: Stefan Metzmacher <metze@samba.org>
tested-by: Stefan Metzmacher <metze@samba.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
The variable ret is being initialized with a value that is never
read, it is being updated later on. The assignment is redundant and
can be removed.
Link: https://lkml.kernel.org/r/20210721120915.122278-1-colin.king@canonical.com
Addresses-Coverity: ("Unused value")
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
synchronize_rcu_tasks_rude() triggers IPIs and forces rescheduling on
all CPUs. It is a costly operation and, when targeting nohz_full CPUs,
very disrupting (hence the name). So avoid calling it when 'old_hash'
doesn't need to be freed.
Link: https://lkml.kernel.org/r/20210721114726.1545103-1-nsaenzju@redhat.com
Signed-off-by: Nicolas Saenz Julienne <nsaenzju@redhat.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
alloc_synth_event() currently has the following code to initialize the
event fields and dynamic_fields:
for (i = 0, j = 0; i < n_fields; i++) {
event->fields[i] = fields[i];
if (fields[i]->is_dynamic) {
event->dynamic_fields[j] = fields[i];
event->dynamic_fields[j]->field_pos = i;
event->dynamic_fields[j++] = fields[i];
event->n_dynamic_fields++;
}
}
1) It would make more sense to have all fields keep track of their
field_pos.
2) event->dynmaic_fields[j] is assigned twice for no reason.
3) We can move updating event->n_dynamic_fields outside the loop, and just
assign it to j.
This combination makes the code much cleaner.
Link: https://lkml.kernel.org/r/20210721195341.29bb0f77@oasis.local.home
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Currently the histogram logic allows the user to write "cpu" in as an
event field, and it will record the CPU that the event happened on.
The problem with this is that there's a lot of events that have "cpu"
as a real field, and using "cpu" as the CPU it ran on, makes it
impossible to run histograms on the "cpu" field of events.
For example, if I want to have a histogram on the count of the
workqueue_queue_work event on its cpu field, running:
># echo 'hist:keys=cpu' > events/workqueue/workqueue_queue_work/trigger
Gives a misleading and wrong result.
Change the command to "common_cpu" as no event should have "common_*"
fields as that's a reserved name for fields used by all events. And
this makes sense here as common_cpu would be a field used by all events.
Now we can even do:
># echo 'hist:keys=common_cpu,cpu if cpu < 100' > events/workqueue/workqueue_queue_work/trigger
># cat events/workqueue/workqueue_queue_work/hist
# event histogram
#
# trigger info: hist:keys=common_cpu,cpu:vals=hitcount:sort=hitcount:size=2048 if cpu < 100 [active]
#
{ common_cpu: 0, cpu: 2 } hitcount: 1
{ common_cpu: 0, cpu: 4 } hitcount: 1
{ common_cpu: 7, cpu: 7 } hitcount: 1
{ common_cpu: 0, cpu: 7 } hitcount: 1
{ common_cpu: 0, cpu: 1 } hitcount: 1
{ common_cpu: 0, cpu: 6 } hitcount: 2
{ common_cpu: 0, cpu: 5 } hitcount: 2
{ common_cpu: 1, cpu: 1 } hitcount: 4
{ common_cpu: 6, cpu: 6 } hitcount: 4
{ common_cpu: 5, cpu: 5 } hitcount: 14
{ common_cpu: 4, cpu: 4 } hitcount: 26
{ common_cpu: 0, cpu: 0 } hitcount: 39
{ common_cpu: 2, cpu: 2 } hitcount: 184
Now for backward compatibility, I added a trick. If "cpu" is used, and
the field is not found, it will fall back to "common_cpu" and work as
it did before. This way, it will still work for old programs that use
"cpu" to get the actual CPU, but if the event has a "cpu" as a field, it
will get that event's "cpu" field, which is probably what it wants
anyway.
I updated the tracefs/README to include documentation about both the
common_timestamp and the common_cpu. This way, if that text is present in
the README, then an application can know that common_cpu is supported over
just plain "cpu".
Link: https://lkml.kernel.org/r/20210721110053.26b4f641@oasis.local.home
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: stable@vger.kernel.org
Fixes: 8b7622bf94 ("tracing: Add cpu field for hist triggers")
Reviewed-by: Tom Zanussi <zanussi@kernel.org>
Reviewed-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
The variable stype is being initialized with a value that is never
read, it is being updated later on. The assignment is redundant and
can be removed.
Addresses-Coverity: ("Unused value")
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20210721115630.109279-1-colin.king@canonical.com
kp->addr is a pointer, so it cannot be cast directly to a 'u64'
when it gets interpreted as an integer value:
kernel/trace/bpf_trace.c: In function '____bpf_get_func_ip_kprobe':
kernel/trace/bpf_trace.c:968:21: error: cast from pointer to integer of different size [-Werror=pointer-to-int-cast]
968 | return kp ? (u64) kp->addr : 0;
Use the uintptr_t type instead.
Fixes: 9ffd9f3ff7 ("bpf: Add bpf_get_func_ip helper for kprobe programs")
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20210721212007.3876595-1-arnd@kernel.org
Pull networking fixes from David Miller:
1) Fix type of bind option flag in af_xdp, from Baruch Siach.
2) Fix use after free in bpf_xdp_link_release(), from Xuan Zhao.
3) PM refcnt imbakance in r8152, from Takashi Iwai.
4) Sign extension ug in liquidio, from Colin Ian King.
5) Mising range check in s390 bpf jit, from Colin Ian King.
6) Uninit value in caif_seqpkt_sendmsg(), from Ziyong Xuan.
7) Fix skb page recycling race, from Ilias Apalodimas.
8) Fix memory leak in tcindex_partial_destroy_work, from Pave Skripkin.
9) netrom timer sk refcnt issues, from Nguyen Dinh Phi.
10) Fix data races aroun tcp's tfo_active_disable_stamp, from Eric
Dumazet.
11) act_skbmod should only operate on ethernet packets, from Peilin Ye.
12) Fix slab out-of-bpunds in fib6_nh_flush_exceptions(),, from Psolo
Abeni.
13) Fix sparx5 dependencies, from Yajun Deng.
* git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (74 commits)
dpaa2-switch: seed the buffer pool after allocating the swp
net: sched: cls_api: Fix the the wrong parameter
net: sparx5: fix unmet dependencies warning
net: dsa: tag_ksz: dont let the hardware process the layer 4 checksum
net: dsa: ensure linearized SKBs in case of tail taggers
ravb: Remove extra TAB
ravb: Fix a typo in comment
net: dsa: sja1105: make VID 4095 a bridge VLAN too
tcp: disable TFO blackhole logic by default
sctp: do not update transport pathmtu if SPP_PMTUD_ENABLE is not set
net: ixp46x: fix ptp build failure
ibmvnic: Remove the proper scrq flush
selftests: net: add ESP-in-UDP PMTU test
udp: check encap socket in __udp_lib_err
sctp: update active_key for asoc when old key is being replaced
r8169: Avoid duplicate sysfs entry creation error
ixgbe: Fix packet corruption due to missing DMA sync
Revert "qed: fix possible unpaired spin_{un}lock_bh in _qed_mcp_cmd_and_union()"
ipv6: fix another slab-out-of-bounds in fib6_nh_flush_exceptions
fsl/fman: Add fibre support
...
The "rb_per_cpu_empty()" misinterpret the condition (as not-empty) when
"head_page" and "commit_page" of "struct ring_buffer_per_cpu" points to
the same buffer page, whose "buffer_data_page" is empty and "read" field
is non-zero.
An error scenario could be constructed as followed (kernel perspective):
1. All pages in the buffer has been accessed by reader(s) so that all of
them will have non-zero "read" field.
2. Read and clear all buffer pages so that "rb_num_of_entries()" will
return 0 rendering there's no more data to read. It is also required
that the "read_page", "commit_page" and "tail_page" points to the same
page, while "head_page" is the next page of them.
3. Invoke "ring_buffer_lock_reserve()" with large enough "length"
so that it shot pass the end of current tail buffer page. Now the
"head_page", "commit_page" and "tail_page" points to the same page.
4. Discard current event with "ring_buffer_discard_commit()", so that
"head_page", "commit_page" and "tail_page" points to a page whose buffer
data page is now empty.
When the error scenario has been constructed, "tracing_read_pipe" will
be trapped inside a deadloop: "trace_empty()" returns 0 since
"rb_per_cpu_empty()" returns 0 when it hits the CPU containing such
constructed ring buffer. Then "trace_find_next_entry_inc()" always
return NULL since "rb_num_of_entries()" reports there's no more entry
to read. Finally "trace_seq_to_user()" returns "-EBUSY" spanking
"tracing_read_pipe" back to the start of the "waitagain" loop.
I've also written a proof-of-concept script to construct the scenario
and trigger the bug automatically, you can use it to trace and validate
my reasoning above:
https://github.com/aegistudio/RingBufferDetonator.git
Tests has been carried out on linux kernel 5.14-rc2
(2734d6c1b1), my fixed version
of kernel (for testing whether my update fixes the bug) and
some older kernels (for range of affected kernels). Test result is
also attached to the proof-of-concept repository.
Link: https://lore.kernel.org/linux-trace-devel/YPaNxsIlb2yjSi5Y@aegistudio/
Link: https://lore.kernel.org/linux-trace-devel/YPgrN85WL9VyrZ55@aegistudio
Cc: stable@vger.kernel.org
Fixes: bf41a158ca ("ring-buffer: make reentrant")
Suggested-by: Linus Torvalds <torvalds@linuxfoundation.org>
Signed-off-by: Haoran Luo <www@aegistudio.net>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Richard reported sporadic (roughly one in 10 or so) null dereferences and
other strange behaviour for a set of automated LTP tests. Things like:
BUG: kernel NULL pointer dereference, address: 0000000000000008
#PF: supervisor read access in kernel mode
#PF: error_code(0x0000) - not-present page
PGD 0 P4D 0
Oops: 0000 [#1] PREEMPT SMP PTI
CPU: 0 PID: 1516 Comm: umount Not tainted 5.10.0-yocto-standard #1
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.13.0-48-gd9c812dda519-prebuilt.qemu.org 04/01/2014
RIP: 0010:kernfs_sop_show_path+0x1b/0x60
...or these others:
RIP: 0010:do_mkdirat+0x6a/0xf0
RIP: 0010:d_alloc_parallel+0x98/0x510
RIP: 0010:do_readlinkat+0x86/0x120
There were other less common instances of some kind of a general scribble
but the common theme was mount and cgroup and a dubious dentry triggering
the NULL dereference. I was only able to reproduce it under qemu by
replicating Richard's setup as closely as possible - I never did get it
to happen on bare metal, even while keeping everything else the same.
In commit 71d883c37e ("cgroup_do_mount(): massage calling conventions")
we see this as a part of the overall change:
--------------
struct cgroup_subsys *ss;
- struct dentry *dentry;
[...]
- dentry = cgroup_do_mount(&cgroup_fs_type, fc->sb_flags, root,
- CGROUP_SUPER_MAGIC, ns);
[...]
- if (percpu_ref_is_dying(&root->cgrp.self.refcnt)) {
- struct super_block *sb = dentry->d_sb;
- dput(dentry);
+ ret = cgroup_do_mount(fc, CGROUP_SUPER_MAGIC, ns);
+ if (!ret && percpu_ref_is_dying(&root->cgrp.self.refcnt)) {
+ struct super_block *sb = fc->root->d_sb;
+ dput(fc->root);
deactivate_locked_super(sb);
msleep(10);
return restart_syscall();
}
--------------
In changing from the local "*dentry" variable to using fc->root, we now
export/leave that dentry pointer in the file context after doing the dput()
in the unlikely "is_dying" case. With LTP doing a crazy amount of back to
back mount/unmount [testcases/bin/cgroup_regression_5_1.sh] the unlikely
becomes slightly likely and then bad things happen.
A fix would be to not leave the stale reference in fc->root as follows:
--------------
dput(fc->root);
+ fc->root = NULL;
deactivate_locked_super(sb);
--------------
...but then we are just open-coding a duplicate of fc_drop_locked() so we
simply use that instead.
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Tejun Heo <tj@kernel.org>
Cc: Zefan Li <lizefan.x@bytedance.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: stable@vger.kernel.org # v5.1+
Reported-by: Richard Purdie <richard.purdie@linuxfoundation.org>
Fixes: 71d883c37e ("cgroup_do_mount(): massage calling conventions")
Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Add rules to ignore data-racy reads with only 1-bit value changes.
Details about the rules are captured in comments in
kernel/kcsan/permissive.h. More background follows.
While investigating a number of data races, we've encountered data-racy
accesses on flags variables to be very common. The typical pattern is a
reader masking all but one bit, and/or the writer setting/clearing only
1 bit (current->flags being a frequently encountered case; more examples
in mm/sl[au]b.c, which disable KCSAN for this reason).
Since these types of data-racy accesses are common (with the assumption
they are intentional and hard to miscompile) having the option (with
CONFIG_KCSAN_PERMISSIVE=y) to filter them will avoid forcing everyone to
mark them, and deliberately left to preference at this time.
One important motivation for having this option built-in is to move
closer to being able to enable KCSAN on CI systems or for testers
wishing to test the whole kernel, while more easily filtering
less interesting data races with higher probability.
For the implementation, we considered several alternatives, but had one
major requirement: that the rules be kept together with the Linux-kernel
tree. Adding them to the compiler would preclude us from making changes
quickly; if the rules require tweaks, having them part of the compiler
requires waiting another ~1 year for the next release -- that's not
realistic. We are left with the following options:
1. Maintain compiler plugins as part of the kernel-tree that
removes instrumentation for some accesses (e.g. plain-& with
1-bit mask). The analysis would be reader-side focused, as
no assumption can be made about racing writers.
Because it seems unrealistic to maintain 2 plugins, one for LLVM and
GCC, we would likely pick LLVM. Furthermore, no kernel infrastructure
exists to maintain LLVM plugins, and the build-system implications and
maintenance overheads do not look great (historically, plugins written
against old LLVM APIs are not guaranteed to work with newer LLVM APIs).
2. Find a set of rules that can be expressed in terms of
observed value changes, and make it part of the KCSAN runtime.
The analysis is writer-side focused, given we rely on observed
value changes.
The approach taken here is (2). While a complete approach requires both
(1) and (2), experiments show that the majority of data races involving
trivial bit operations on flags variables can be removed with (2) alone.
It goes without saying that the filtering of data races using (1) or (2)
does _not_ guarantee they are safe! Therefore, limiting ourselves to (2)
for now is the conservative choice for setups that wish to enable
CONFIG_KCSAN_PERMISSIVE=y.
Signed-off-by: Marco Elver <elver@google.com>
Acked-by: Mark Rutland <mark.rutland@arm.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Show a brief message if KCSAN is strict or non-strict, and if non-strict
also say that CONFIG_KCSAN_STRICT=y can be used to see all data races.
This is to hint to users of KCSAN who blindly use the default config
that their configuration might miss data races of interest.
Signed-off-by: Marco Elver <elver@google.com>
Acked-by: Mark Rutland <mark.rutland@arm.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Rework atomic.h into permissive.h to better reflect its purpose, and
introduce kcsan_ignore_address() and kcsan_ignore_data_race().
Introduce CONFIG_KCSAN_PERMISSIVE and update the stub functions in
preparation for subsequent changes.
As before, developers who choose to use KCSAN in "strict" mode will see
all data races and are not affected. Furthermore, by relying on the
value-change filter logic for kcsan_ignore_data_race(), even if the
permissive rules are enabled, the opt-outs in report.c:skip_report()
override them (such as for RCU-related functions by default).
The option CONFIG_KCSAN_PERMISSIVE is disabled by default, so that the
documented default behaviour of KCSAN does not change. Instead, like
CONFIG_KCSAN_IGNORE_ATOMICS, the option needs to be explicitly opted in.
Signed-off-by: Marco Elver <elver@google.com>
Acked-by: Mark Rutland <mark.rutland@arm.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
There are a number get_ctx() calls that are close to each other, which
results in poor codegen (repeated preempt_count loads).
Specifically in kcsan_found_watchpoint() (even though it's a slow-path)
it is beneficial to keep the race-window small until the watchpoint has
actually been consumed to avoid missed opportunities to report a race.
Let's clean it up a bit before we add more code in
kcsan_found_watchpoint().
Signed-off-by: Marco Elver <elver@google.com>
Acked-by: Mark Rutland <mark.rutland@arm.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
By this point CONFIG_KCSAN_DEBUG is pretty useless, as the system just
isn't usable with it due to spamming console (I imagine a randconfig
test robot will run into this sooner or later). Remove it.
Back in 2019 I used it occasionally to record traces of watchpoints and
verify the encoding is correct, but these days we have proper tests. If
something similar is needed in future, just add it back ad-hoc.
Signed-off-by: Marco Elver <elver@google.com>
Acked-by: Mark Rutland <mark.rutland@arm.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
This commit fixes several typos where CONFIG_TASKS_RCU_TRACE should
instead be CONFIG_TASKS_TRACE_RCU. Among other things, these typos
could cause CONFIG_TASKS_TRACE_RCU_READ_MB=y kernels to suffer from
memory-ordering bugs that could result in false-positive quiescent
states and too-short grace periods.
Signed-off-by: Zhouyi Zhou <zhouzhouyi@gmail.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
This commit replaces the fictitious synchronize_rcu_rude() function with
its real-world synchronize_rcu_tasks_rude() counterpart.
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
There are several ->trc_reader_special.b.need_qs data races that are
too low-probability for KCSAN to notice, but which will happen sooner
or later. This commit therefore marks these accesses.
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
There are several ->trc_reader_nesting data races that are too
low-probability for KCSAN to notice, but which will happen sooner or
later. This commit therefore marks these accesses, and comments one
that cannot race.
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Accesses to task_struct structures must be either protected by RCU
or by get_task_struct(). Tasks trace RCU uses these in a non-obvious
combination, in conjunction with an IPI handler. This commit therefore
adds comments explaining this usage.
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
At CPU offline time, we must handle any pending wakeup for the nocb_gp
kthread linked to the outgoing CPU.
Now we are making sure of that twice:
1) From rcu_report_dead() when the outgoing CPU makes the very last
local cleanups by itself before switching offline.
2) From rcutree_dead_cpu(). Here the offlining CPU has gone and is truly
now offline. Another CPU takes care of post-portem cleaning up and
check if the offline CPU had pending wakeup.
Both ways are fine but we have to choose one or the other because we
don't need to repeat that action. Simply benefit from cache locality
and keep only the first solution.
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
The kernel/rcu/tree_plugin.h file contains not only the plugins for
preemptible RCU, but also many other features including rcu_nocbs
callback offloading. This offloading has become large and complex,
so it is time to put it in its own file.
This commit starts that process.
Suggested-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
[ paulmck: Rename to tree_nocb.h, add Frederic as author. ]
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Protect kernel/audit.h against multiple #include's.
Signed-off-by: MaYuming <mayuming77@hotmail.com>
[PM: rewrite subj/description]
Signed-off-by: Paul Moore <paul@paul-moore.com>
We have a number of systems industry-wide that have a subset of their
functionality that works as follows:
1. Receive a message from local kmsg, serial console, or netconsole;
2. Apply a set of rules to classify the message;
3. Do something based on this classification (like scheduling a
remediation for the machine), rinse, and repeat.
As a couple of examples of places we have this implemented just inside
Facebook, although this isn't a Facebook-specific problem, we have this
inside our netconsole processing (for alarm classification), and as part
of our machine health checking. We use these messages to determine
fairly important metrics around production health, and it's important
that we get them right.
While for some kinds of issues we have counters, tracepoints, or metrics
with a stable interface which can reliably indicate the issue, in order
to react to production issues quickly we need to work with the interface
which most kernel developers naturally use when developing: printk.
Most production issues come from unexpected phenomena, and as such
usually the code in question doesn't have easily usable tracepoints or
other counters available for the specific problem being mitigated. We
have a number of lines of monitoring defence against problems in
production (host metrics, process metrics, service metrics, etc), and
where it's not feasible to reliably monitor at another level, this kind
of pragmatic netconsole monitoring is essential.
As one would expect, monitoring using printk is rather brittle for a
number of reasons -- most notably that the message might disappear
entirely in a new version of the kernel, or that the message may change
in some way that the regex or other classification methods start to
silently fail.
One factor that makes this even harder is that, under normal operation,
many of these messages are never expected to be hit. For example, there
may be a rare hardware bug which one wants to detect if it was to ever
happen again, but its recurrence is not likely or anticipated. This
precludes using something like checking whether the printk in question
was printed somewhere fleetwide recently to determine whether the
message in question is still present or not, since we don't anticipate
that it should be printed anywhere, but still need to monitor for its
future presence in the long-term.
This class of issue has happened on a number of occasions, causing
unhealthy machines with hardware issues to remain in production for
longer than ideal. As a recent example, some monitoring around
blk_update_request fell out of date and caused semi-broken machines to
remain in production for longer than would be desirable.
Searching through the codebase to find the message is also extremely
fragile, because many of the messages are further constructed beyond
their callsite (eg. btrfs_printk and other module-specific wrappers,
each with their own functionality). Even if they aren't, guessing the
format and formulation of the underlying message based on the aesthetics
of the message emitted is not a recipe for success at scale, and our
previous issues with fleetwide machine health checking demonstrate as
much.
This provides a solution to the issue of silently changed or deleted
printks: we record pointers to all printk format strings known at
compile time into a new .printk_index section, both in vmlinux and
modules. At runtime, this can then be iterated by looking at
<debugfs>/printk/index/<module>, which emits the following format, both
readable by humans and able to be parsed by machines:
$ head -1 vmlinux; shuf -n 5 vmlinux
# <level[,flags]> filename:line function "format"
<5> block/blk-settings.c:661 disk_stack_limits "%s: Warning: Device %s is misaligned\n"
<4> kernel/trace/trace.c:8296 trace_create_file "Could not create tracefs '%s' entry\n"
<6> arch/x86/kernel/hpet.c:144 _hpet_print_config "hpet: %s(%d):\n"
<6> init/do_mounts.c:605 prepare_namespace "Waiting for root device %s...\n"
<6> drivers/acpi/osl.c:1410 acpi_no_auto_serialize_setup "ACPI: auto-serialization disabled\n"
This mitigates the majority of cases where we have a highly-specific
printk which we want to match on, as we can now enumerate and check
whether the format changed or the printk callsite disappeared entirely
in userspace. This allows us to catch changes to printks we monitor
earlier and decide what to do about it before it becomes problematic.
There is no additional runtime cost for printk callers or printk itself,
and the assembly generated is exactly the same.
Signed-off-by: Chris Down <chris@chrisdown.name>
Cc: Petr Mladek <pmladek@suse.com>
Cc: Jessica Yu <jeyu@kernel.org>
Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Cc: John Ogness <john.ogness@linutronix.de>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Kees Cook <keescook@chromium.org>
Reviewed-by: Petr Mladek <pmladek@suse.com>
Tested-by: Petr Mladek <pmladek@suse.com>
Reported-by: kernel test robot <lkp@intel.com>
Acked-by: Andy Shevchenko <andy.shevchenko@gmail.com>
Acked-by: Jessica Yu <jeyu@kernel.org> # for module.{c,h}
Signed-off-by: Petr Mladek <pmladek@suse.com>
Link: https://lore.kernel.org/r/e42070983637ac5e384f17fbdbe86d19c7b212a5.1623775748.git.chris@chrisdown.name
parse_prefix is needed externally by later patches, so move it into a
context where it can be used as such. Also give it the printk_ prefix to
reduce the chance of collisions.
Signed-off-by: Chris Down <chris@chrisdown.name>
Cc: Petr Mladek <pmladek@suse.com>
Reviewed-by: Petr Mladek <pmladek@suse.com>
Acked-by: Andy Shevchenko <andy.shevchenko@gmail.com>
Signed-off-by: Petr Mladek <pmladek@suse.com>
Link: https://lore.kernel.org/r/b22ba314a860e5c7f887958f1eab2649f9bd1d06.1623775748.git.chris@chrisdown.name
In the past, `enum log_flags` was part of `struct log`, hence the name.
`struct log` has since been reworked and now this struct is stored
inside `struct printk_info`. However, the name was never updated, which
is somewhat confusing -- especially since these flags operate at the
record level rather than at the level of an abstract log.
printk_info_flags also joins its other metadata struct friends in
printk_ringbuffer.h.
Signed-off-by: Chris Down <chris@chrisdown.name>
Reviewed-by: Petr Mladek <pmladek@suse.com>
Acked-by: Andy Shevchenko <andy.shevchenko@gmail.com>
Signed-off-by: Petr Mladek <pmladek@suse.com>
Link: https://lore.kernel.org/r/3dd801982f02603e6e3aa4f8bc4f5ebb830a4949.1623775748.git.chris@chrisdown.name
Working on the histogram code, I found that if you dereference a char
pointer in a trace event that happens to point to user space, it can crash
the kernel, as it does no checks of that pointer. I have code coming that
will do this better, so just remove this ability to treat character
pointers in trace events as stings in the histogram.
-----BEGIN PGP SIGNATURE-----
iIoEABYIADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCYPH9FRQccm9zdGVkdEBn
b29kbWlzLm9yZwAKCRAp5XQQmuv6qsyhAQDKiQzVJtjfsNbIWliDQOaUwJMO9tNl
Qu5TUDmPbAA4fwD+MgYsnITPL+o/YcKQ+aMdj/wLLMKfIjhNkFY8wqdLvwg=
=CN97
-----END PGP SIGNATURE-----
Merge tag 'trace-v5.14-5' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace
Pull tracing fix from Steven Rostedt:
"Fix the histogram logic from possibly crashing the kernel
Working on the histogram code, I found that if you dereference a char
pointer in a trace event that happens to point to user space, it can
crash the kernel, as it does no checks of that pointer. I have code
coming that will do this better, so just remove this ability to treat
character pointers in trace events as stings in the histogram"
* tag 'trace-v5.14-5' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
tracing: Do not reference char * as a string in histograms
b910eaaaa4 ("bpf: Fix NULL pointer dereference in bpf_get_local_storage()
helper") fixed the problem with cgroup-local storage use in BPF by
pre-allocating per-CPU array of 8 cgroup storage pointers to accommodate
possible BPF program preemptions and nested executions.
While this seems to work good in practice, it introduces new and unnecessary
failure mode in which not all BPF programs might be executed if we fail to
find an unused slot for cgroup storage, however unlikely it is. It might also
not be so unlikely when/if we allow sleepable cgroup BPF programs in the
future.
Further, the way that cgroup storage is implemented as ambiently-available
property during entire BPF program execution is a convenient way to pass extra
information to BPF program and helpers without requiring user code to pass
around extra arguments explicitly. So it would be good to have a generic
solution that can allow implementing this without arbitrary restrictions.
Ideally, such solution would work for both preemptable and sleepable BPF
programs in exactly the same way.
This patch introduces such solution, bpf_run_ctx. It adds one pointer field
(bpf_ctx) to task_struct. This field is maintained by BPF_PROG_RUN family of
macros in such a way that it always stays valid throughout BPF program
execution. BPF program preemption is handled by remembering previous
current->bpf_ctx value locally while executing nested BPF program and
restoring old value after nested BPF program finishes. This is handled by two
helper functions, bpf_set_run_ctx() and bpf_reset_run_ctx(), which are
supposed to be used before and after BPF program runs, respectively.
Restoring old value of the pointer handles preemption, while bpf_run_ctx
pointer being a property of current task_struct naturally solves this problem
for sleepable BPF programs by "following" BPF program execution as it is
scheduled in and out of CPU. It would even allow CPU migration of BPF
programs, even though it's not currently allowed by BPF infra.
This patch cleans up cgroup local storage handling as a first application. The
design itself is generic, though, with bpf_run_ctx being an empty struct that
is supposed to be embedded into a specific struct for a given BPF program type
(bpf_cg_run_ctx in this case). Follow up patches are planned that will expand
this mechanism for other uses within tracing BPF programs.
To verify that this change doesn't revert the fix to the original cgroup
storage issue, I ran the same repro as in the original report ([0]) and didn't
get any problems. Replacing bpf_reset_run_ctx(old_run_ctx) with
bpf_reset_run_ctx(NULL) triggers the issue pretty quickly (so repro does work).
[0] https://lore.kernel.org/bpf/YEEvBUiJl2pJkxTd@krava/
Fixes: b910eaaaa4 ("bpf: Fix NULL pointer dereference in bpf_get_local_storage() helper")
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Yonghong Song <yhs@fb.com>
Link: https://lore.kernel.org/bpf/20210712230615.3525979-1-andrii@kernel.org
Pull RCU fixes from Paul McKenney:
- fix regressions induced by a merge-window change in scheduler
semantics, which means that smp_processor_id() can no longer be used
in kthreads using simple affinity to bind themselves to a specific
CPU.
- fix a bug in Tasks Trace RCU that was thought to be strictly
theoretical. However, production workloads have started hitting this,
so these fixes need to be merged sooner rather than later.
- fix a minor printk()-format-mismatch issue introduced during the
merge window.
* 'urgent' of git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu:
rcu: Fix pr_info() formats and values in show_rcu_gp_kthreads()
rcu-tasks: Don't delete holdouts within trc_wait_for_one_reader()
rcu-tasks: Don't delete holdouts within trc_inspect_reader()
refscale: Avoid false-positive warnings in ref_scale_reader()
scftorture: Avoid false-positive warnings in scftorture_invoker()
The 2nd parameter 'count' is not used in this function.
The places where the function is called are also modified.
Signed-off-by: xuyehan <xuyehan@xiaomi.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Waiman Long <longman@redhat.com>
Link: https://lore.kernel.org/r/1625547043-28103-1-git-send-email-yehanxu1@gmail.com
Refactor the permission check in perf_event_open() into a helper
perf_check_permission(). This makes the permission check logic more
readable (because we no longer have a negated disjunction). Add a
comment mentioning the ptrace check also checks the uid.
No functional change intended.
Signed-off-by: Marco Elver <elver@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Dmitry Vyukov <dvyukov@google.com>
Link: https://lore.kernel.org/r/20210705084453.2151729-2-elver@google.com
If perf_event_open() is called with another task as target and
perf_event_attr::sigtrap is set, and the target task's user does not
match the calling user, also require the CAP_KILL capability or
PTRACE_MODE_ATTACH permissions.
Otherwise, with the CAP_PERFMON capability alone it would be possible
for a user to send SIGTRAP signals via perf events to another user's
tasks. This could potentially result in those tasks being terminated if
they cannot handle SIGTRAP signals.
Note: The check complements the existing capability check, but is not
supposed to supersede the ptrace_may_access() check. At a high level we
now have:
capable of CAP_PERFMON and (CAP_KILL if sigtrap)
OR
ptrace_may_access(...) // also checks for same thread-group and uid
Fixes: 97ba62b278 ("perf: Add support for SIGTRAP on perf events")
Reported-by: Dmitry Vyukov <dvyukov@google.com>
Signed-off-by: Marco Elver <elver@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Dmitry Vyukov <dvyukov@google.com>
Cc: <stable@vger.kernel.org> # 5.13+
Link: https://lore.kernel.org/r/20210705084453.2151729-1-elver@google.com
Git rid of an outdated comment.
Since cgroup was fully switched to fs_context, cgroup_mount() is gone and
it's confusing to mention in comments of cgroup_kill_sb(). Delete it.
Signed-off-by: zhaoxiaoqiang11 <zhaoxiaoqiang11@jd.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
In 7fedb63a83 ("bpf: Tighten speculative pointer arithmetic mask") we
narrowed the offset mask for unprivileged pointer arithmetic in order to
mitigate a corner case where in the speculative domain it is possible to
advance, for example, the map value pointer by up to value_size-1 out-of-
bounds in order to leak kernel memory via side-channel to user space.
The verifier's state pruning for scalars leaves one corner case open
where in the first verification path R_x holds an unknown scalar with an
aux->alu_limit of e.g. 7, and in a second verification path that same
register R_x, here denoted as R_x', holds an unknown scalar which has
tighter bounds and would thus satisfy range_within(R_x, R_x') as well as
tnum_in(R_x, R_x') for state pruning, yielding an aux->alu_limit of 3:
Given the second path fits the register constraints for pruning, the final
generated mask from aux->alu_limit will remain at 7. While technically
not wrong for the non-speculative domain, it would however be possible
to craft similar cases where the mask would be too wide as in 7fedb63a83.
One way to fix it is to detect the presence of unknown scalar map pointer
arithmetic and force a deeper search on unknown scalars to ensure that
we do not run into a masking mismatch.
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Follow-up to fe9a5ca7e3 ("bpf: Do not mark insn as seen under speculative
path verification"). The sanitize_insn_aux_data() helper does not serve a
particular purpose in today's code. The original intention for the helper
was that if function-by-function verification fails, a given program would
be cleared from temporary insn_aux_data[], and then its verification would
be re-attempted in the context of the main program a second time.
However, a failure in do_check_subprogs() will skip do_check_main() and
propagate the error to the user instead, thus such situation can never occur.
Given its interaction is not compatible to the Spectre v1 mitigation (due to
comparing aux->seen with env->pass_cnt), just remove sanitize_insn_aux_data()
to avoid future bugs in this area.
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
xen-swiotlb can use vmalloc backed addresses for dma coherent allocations
and uses the common helpers. Properly handle them to unbreak Xen on
ARM platforms.
Fixes: 1b65c4e5a9 ("swiotlb-xen: use xen_alloc/free_coherent_pages")
Signed-off-by: Roman Skakun <roman_skakun@epam.com>
Reviewed-by: Andrii Anisov <andrii_anisov@epam.com>
[hch: split the patch, renamed the helpers]
Signed-off-by: Christoph Hellwig <hch@lst.de>
Alexei Starovoitov says:
====================
pull-request: bpf-next 2021-07-15
The following pull-request contains BPF updates for your *net-next* tree.
We've added 45 non-merge commits during the last 15 day(s) which contain
a total of 52 files changed, 3122 insertions(+), 384 deletions(-).
The main changes are:
1) Introduce bpf timers, from Alexei.
2) Add sockmap support for unix datagram socket, from Cong.
3) Fix potential memleak and UAF in the verifier, from He.
4) Add bpf_get_func_ip helper, from Jiri.
5) Improvements to generic XDP mode, from Kumar.
6) Support for passing xdp_md to XDP programs in bpf_prog_run, from Zvi.
===================
Signed-off-by: David S. Miller <davem@davemloft.net>
Currently sock_map still has Kconfig dependency on CONFIG_INET,
but there is no actual functional dependency on it after we
introduce ->psock_update_sk_prot().
We have to extend it to CONFIG_NET now as we are going to
support AF_UNIX.
Signed-off-by: Cong Wang <cong.wang@bytedance.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20210704190252.11866-2-xiyou.wangcong@gmail.com
Adding bpf_get_func_ip helper for BPF_PROG_TYPE_KPROBE programs,
so it's now possible to call bpf_get_func_ip from both kprobe and
kretprobe programs.
Taking the caller's address from 'struct kprobe::addr', which is
defined for both kprobe and kretprobe.
Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Reviewed-by: Masami Hiramatsu <mhiramat@kernel.org>
Link: https://lore.kernel.org/bpf/20210714094400.396467-5-jolsa@kernel.org
Adding bpf_get_func_ip helper for BPF_PROG_TYPE_TRACING programs,
specifically for all trampoline attach types.
The trampoline's caller IP address is stored in (ctx - 8) address.
so there's no reason to actually call the helper, but rather fixup
the call instruction and return [ctx - 8] value directly.
Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20210714094400.396467-4-jolsa@kernel.org
Enabling BPF_TRAMP_F_IP_ARG for trampolines that actually need it.
The BPF_TRAMP_F_IP_ARG adds extra 3 instructions to trampoline code
and is used only by programs with bpf_get_func_ip helper, which is
added in following patch and sets call_get_func_ip bit.
This patch ensures that BPF_TRAMP_F_IP_ARG flag is used only for
trampolines that have programs with call_get_func_ip set.
Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20210714094400.396467-3-jolsa@kernel.org
Andrii Nakryiko says:
====================
pull-request: bpf 2021-07-15
The following pull-request contains BPF updates for your *net* tree.
We've added 9 non-merge commits during the last 5 day(s) which contain
a total of 9 files changed, 37 insertions(+), 15 deletions(-).
The main changes are:
1) Fix NULL pointer dereference in BPF_TEST_RUN for BPF_XDP_DEVMAP and
BPF_XDP_CPUMAP programs, from Xuan Zhuo.
2) Fix use-after-free of net_device in XDP bpf_link, from Xuan Zhuo.
3) Follow-up fix to subprog poke descriptor use-after-free problem, from
Daniel Borkmann and John Fastabend.
4) Fix out-of-range array access in s390 BPF JIT backend, from Colin Ian King.
5) Fix memory leak in BPF sockmap, from John Fastabend.
6) Fix for sockmap to prevent proc stats reporting bug, from John Fastabend
and Jakub Sitnicki.
7) Fix NULL pointer dereference in bpftool, from Tobias Klauser.
8) AF_XDP documentation fixes, from Baruch Siach.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Hi Linus,
Please, pull the following patches that fix many fall-through
warnings when building with Clang and -Wimplicit-fallthrough.
This pull-request also contains the patch for Makefile that enables
-Wimplicit-fallthrough for Clang, globally.
It's also important to notice that since we have adopted the use of
the pseudo-keyword macro fallthrough; we also want to avoid having
more /* fall through */ comments being introduced. Notice that contrary
to GCC, Clang doesn't recognize any comments as implicit fall-through
markings when the -Wimplicit-fallthrough option is enabled. So, in
order to avoid having more comments being introduced, we have to use
the option -Wimplicit-fallthrough=5 for GCC, which similar to Clang,
will cause a warning in case a code comment is intended to be used
as a fall-through marking. The patch for Makefile also enforces this.
We had almost 4,000 of these issues for Clang in the beginning,
and there might be a couple more out there when building some
architectures with certain configurations. However, with the
recent fixes I think we are in good shape and it is now possible
to enable -Wimplicit-fallthrough for Clang. :)
Thanks!
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEEkmRahXBSurMIg1YvRwW0y0cG2zEFAmDvPV0ACgkQRwW0y0cG
2zGJ1xAAsbSN8I+ESxFCKi4UwQC71988isb0rHGL6rQPBbEYD2bbI+82aGu6WH8z
2ZVQZa5x3tUoIsqR2GMK5Npn1dFvuNWGZkYzlgjAp+FgzaULCLUc4NW9DEWU/Kbd
UBz8ilXL7tUAlDalUYjXAVVTxMnRTZ+BDCbEmIdRZ5X/pMlD2xHeMGG4kh8/cHjJ
+mqFwCFPZoyLJzPBnm37OEfIr8JxrLZAo1gfmz5sxfmBdYDY3hjV/BVgP8h0bkY9
ITaq0LMEAPMXvU+4DP3DxuqzRr9COYMcddmbaWwsXPd7UAuoXwGLvqx7JE0QqY3g
J80w4/hXqEa4o4/fUAsfjYEQoNTEnhl0iGskBJD2xy5Lj17/m4aOSL44ibr2Ag67
yfJPWi9tooYELEGNobPVHPuqk4ts6amKTOKqHWMBEcTEOIGzaGWPEjRoyaMivfZ3
G3ZbEPEYERcJRizm63UbciLeslGsxb6tdYQEy5VI8Wa7caz9Ci9HT+jjS7Vm2fuq
1zg+vIgwuSxvwxrSV/PDTRcQm9EM/MoRL0D4jbr7gtYD6KNwH8Vo6M21CoaB9gH3
FgMiCT4zy9KiONjVDJKkjtzuk+9OwpI5AAg0/fAQ4Ak9TzUN3Xfg7HpytETCIJ0Z
Ii3Jqwe+3IRTNq94ZTIiX/0hT7i5GsPxUfPzI2vwttCJn9ctUcM=
=XVNN
-----END PGP SIGNATURE-----
Merge tag 'Wimplicit-fallthrough-clang-5.14-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/gustavoars/linux
Pull fallthrough fixes from Gustavo Silva:
"This fixes many fall-through warnings when building with Clang and
-Wimplicit-fallthrough, and also enables -Wimplicit-fallthrough for
Clang, globally.
It's also important to notice that since we have adopted the use of
the pseudo-keyword macro fallthrough, we also want to avoid having
more /* fall through */ comments being introduced. Contrary to GCC,
Clang doesn't recognize any comments as implicit fall-through markings
when the -Wimplicit-fallthrough option is enabled.
So, in order to avoid having more comments being introduced, we use
the option -Wimplicit-fallthrough=5 for GCC, which similar to Clang,
will cause a warning in case a code comment is intended to be used as
a fall-through marking. The patch for Makefile also enforces this.
We had almost 4,000 of these issues for Clang in the beginning, and
there might be a couple more out there when building some
architectures with certain configurations. However, with the recent
fixes I think we are in good shape and it is now possible to enable
the warning for Clang"
* tag 'Wimplicit-fallthrough-clang-5.14-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/gustavoars/linux: (27 commits)
Makefile: Enable -Wimplicit-fallthrough for Clang
powerpc/smp: Fix fall-through warning for Clang
dmaengine: mpc512x: Fix fall-through warning for Clang
usb: gadget: fsl_qe_udc: Fix fall-through warning for Clang
powerpc/powernv: Fix fall-through warning for Clang
MIPS: Fix unreachable code issue
MIPS: Fix fall-through warnings for Clang
ASoC: Mediatek: MT8183: Fix fall-through warning for Clang
power: supply: Fix fall-through warnings for Clang
dmaengine: ti: k3-udma: Fix fall-through warning for Clang
s390: Fix fall-through warnings for Clang
dmaengine: ipu: Fix fall-through warning for Clang
iommu/arm-smmu-v3: Fix fall-through warning for Clang
mmc: jz4740: Fix fall-through warning for Clang
PCI: Fix fall-through warning for Clang
scsi: libsas: Fix fall-through warning for Clang
video: fbdev: Fix fall-through warning for Clang
math-emu: Fix fall-through warning
cpufreq: Fix fall-through warning for Clang
drm/msm: Fix fall-through warning in msm_gem_new_impl()
...
Teach max stack depth checking algorithm about async callbacks
that don't increase bpf program stack size.
Also add sanity check that bpf_tail_call didn't sneak into async cb.
It's impossible, since PTR_TO_CTX is not available in async cb,
hence the program cannot contain bpf_tail_call(ctx,...);
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Toke Høiland-Jørgensen <toke@redhat.com>
Link: https://lore.kernel.org/bpf/20210715005417.78572-10-alexei.starovoitov@gmail.com
bpf_for_each_map_elem() and bpf_timer_set_callback() helpers are relying on
PTR_TO_FUNC infra in the verifier to validate addresses to subprograms
and pass them into the helpers as function callbacks.
In case of bpf_for_each_map_elem() the callback is invoked synchronously
and the verifier treats it as a normal subprogram call by adding another
bpf_func_state and new frame in __check_func_call().
bpf_timer_set_callback() doesn't invoke the callback directly.
The subprogram will be called asynchronously from bpf_timer_cb().
Teach the verifier to validate such async callbacks as special kind
of jump by pushing verifier state into stack and let pop_stack() process it.
Special care needs to be taken during state pruning.
The call insn doing bpf_timer_set_callback has to be a prune_point.
Otherwise short timer callbacks might not have prune points in front of
bpf_timer_set_callback() which means is_state_visited() will be called
after this call insn is processed in __check_func_call(). Which means that
another async_cb state will be pushed to be walked later and the verifier
will eventually hit BPF_COMPLEXITY_LIMIT_JMP_SEQ limit.
Since push_async_cb() looks like another push_stack() branch the
infinite loop detection will trigger false positive. To recognize
this case mark such states as in_async_callback_fn.
To distinguish infinite loop in async callback vs the same callback called
with different arguments for different map and timer add async_entry_cnt
to bpf_func_state.
Enforce return zero from async callbacks.
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Toke Høiland-Jørgensen <toke@redhat.com>
Link: https://lore.kernel.org/bpf/20210715005417.78572-9-alexei.starovoitov@gmail.com
In the following bpf subprogram:
static int timer_cb(void *map, void *key, void *value)
{
bpf_timer_set_callback(.., timer_cb);
}
the 'timer_cb' is a pointer to a function.
ld_imm64 insn is used to carry this pointer.
bpf_pseudo_func() returns true for such ld_imm64 insn.
Unlike bpf_for_each_map_elem() the bpf_timer_set_callback() is asynchronous.
Relax control flow check to allow such "recursion" that is seen as an infinite
loop by check_cfg(). The distinction between bpf_for_each_map_elem() the
bpf_timer_set_callback() is done in the follow up patch.
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Toke Høiland-Jørgensen <toke@redhat.com>
Link: https://lore.kernel.org/bpf/20210715005417.78572-8-alexei.starovoitov@gmail.com
BTF is required for 'struct bpf_timer' to be recognized inside map value.
The bpf timers are supported inside inner maps.
Remember 'struct btf *' in inner_map_meta to make it available
to the verifier in the sequence:
struct bpf_map *inner_map = bpf_map_lookup_elem(&outer_map, ...);
if (inner_map)
timer = bpf_map_lookup_elem(&inner_map, ...);
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Yonghong Song <yhs@fb.com>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Toke Høiland-Jørgensen <toke@redhat.com>
Link: https://lore.kernel.org/bpf/20210715005417.78572-7-alexei.starovoitov@gmail.com