Current release - regressions:
- bpf, sockmap: re-evaluate proto ops when psock is removed from sockmap
Current release - new code bugs:
- bpf: fix bpf_check_mod_kfunc_call for built-in modules
- ice: fixes for TC classifier offloads
- vrf: don't run conntrack on vrf with !dflt qdisc
Previous releases - regressions:
- bpf: fix the off-by-two error in range markings
- seg6: fix the iif in the IPv6 socket control block
- devlink: fix netns refcount leak in devlink_nl_cmd_reload()
- dsa: mv88e6xxx: fix "don't use PHY_DETECT on internal PHY's"
- dsa: mv88e6xxx: allow use of PHYs on CPU and DSA ports
Previous releases - always broken:
- ethtool: do not perform operations on net devices being unregistered
- udp: use datalen to cap max gso segments
- ice: fix races in stats collection
- fec: only clear interrupt of handling queue in fec_enet_rx_queue()
- m_can: pci: fix incorrect reference clock rate
- m_can: disable and ignore ELO interrupt
- mvpp2: fix XDP rx queues registering
Misc:
- treewide: add missing includes masked by cgroup -> bpf.h dependency
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEE6jPA+I1ugmIBA4hXMUZtbf5SIrsFAmGyN1AACgkQMUZtbf5S
IrtgMA/8D0qk3c75ts0hCzGXwdNdEBs+e7u1bJVPqdyU8x/ZLAp2c0EKB/7IWuxA
CtsnbanPcmibqvQJDI1hZEBdafi43BmF5VuFSIxYC4EM/1vLoRprurXlIwL2YWki
aWi//tyOIGBl6/ClzJ9Vm51HTJQwDmdv8GRnKAbsC1eOTM3pmmcg+6TLbDhycFEQ
F9kkDCvyB9kWIH645QyJRH+Y5qQOvneCyQCPkkyjTgEADzV5i7YgtRol6J3QIbw3
umPHSckCBTjMacYcCLsbhQaF2gTMgPV1basNLPMjCquJVrItE0ZaeX3MiD6nBFae
yY5+Wt5KAZDzjERhneX8AINHoRPA/tNIahC1+ytTmsTA8Hj230FHE5hH1ajWiJ9+
GSTBCBqjtZXce3r2Efxfzy0Kb9JwL3vDi7LS2eKQLv0zBLfYp2ry9Sp9qe4NhPkb
OYrxws9kl5GOPvrFB5BWI9XBINciC9yC3PjIsz1noi0vD8/Hi9dPwXeAYh36fXU3
rwRg9uAt6tvFCpwbuQ9T2rsMST0miur2cDYd8qkJtuJ7zFvc+suMXwBZyI29nF2D
uyymIC2XStHJfAjUkFsGVUSXF5FhML9OQsqmisdQ8KdH26jMnDeMjIWJM7UWK+zY
E/fqWT8UyS3mXWqaggid4ZbotipCwA0gxiDHuqqUGTM+dbKrzmk=
=F6rS
-----END PGP SIGNATURE-----
Merge tag 'net-5.16-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Pull networking fixes from Jakub Kicinski:
"Including fixes from bpf, can and netfilter.
Current release - regressions:
- bpf, sockmap: re-evaluate proto ops when psock is removed from
sockmap
Current release - new code bugs:
- bpf: fix bpf_check_mod_kfunc_call for built-in modules
- ice: fixes for TC classifier offloads
- vrf: don't run conntrack on vrf with !dflt qdisc
Previous releases - regressions:
- bpf: fix the off-by-two error in range markings
- seg6: fix the iif in the IPv6 socket control block
- devlink: fix netns refcount leak in devlink_nl_cmd_reload()
- dsa: mv88e6xxx: fix "don't use PHY_DETECT on internal PHY's"
- dsa: mv88e6xxx: allow use of PHYs on CPU and DSA ports
Previous releases - always broken:
- ethtool: do not perform operations on net devices being
unregistered
- udp: use datalen to cap max gso segments
- ice: fix races in stats collection
- fec: only clear interrupt of handling queue in fec_enet_rx_queue()
- m_can: pci: fix incorrect reference clock rate
- m_can: disable and ignore ELO interrupt
- mvpp2: fix XDP rx queues registering
Misc:
- treewide: add missing includes masked by cgroup -> bpf.h
dependency"
* tag 'net-5.16-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (82 commits)
net: dsa: mv88e6xxx: allow use of PHYs on CPU and DSA ports
net: wwan: iosm: fixes unable to send AT command during mbim tx
net: wwan: iosm: fixes net interface nonfunctional after fw flash
net: wwan: iosm: fixes unnecessary doorbell send
net: dsa: felix: Fix memory leak in felix_setup_mmio_filtering
MAINTAINERS: s390/net: remove myself as maintainer
net/sched: fq_pie: prevent dismantle issue
net: mana: Fix memory leak in mana_hwc_create_wq
seg6: fix the iif in the IPv6 socket control block
nfp: Fix memory leak in nfp_cpp_area_cache_add()
nfc: fix potential NULL pointer deref in nfc_genl_dump_ses_done
nfc: fix segfault in nfc_genl_dump_devices_done
udp: using datalen to cap max gso segments
net: dsa: mv88e6xxx: error handling for serdes_power functions
can: kvaser_usb: get CAN clock frequency from device
can: kvaser_pciefd: kvaser_pciefd_rx_error_frame(): increase correct stats->{rx,tx}_errors counter
net: mvpp2: fix XDP rx queues registering
vmxnet3: fix minimum vectors alloc issue
net, neigh: clear whole pneigh_entry at alloc time
net: dsa: mv88e6xxx: fix "don't use PHY_DETECT on internal PHY's"
...
Several ->poll() implementations are special in that they use a
waitqueue whose lifetime is the current task, rather than the struct
file as is normally the case. This is okay for blocking polls, since a
blocking poll occurs within one task; however, non-blocking polls
require another solution. This solution is for the queue to be cleared
before it is freed, using 'wake_up_poll(wq, EPOLLHUP | POLLFREE);'.
However, that has a bug: wake_up_poll() calls __wake_up() with
nr_exclusive=1. Therefore, if there are multiple "exclusive" waiters,
and the wakeup function for the first one returns a positive value, only
that one will be called. That's *not* what's needed for POLLFREE;
POLLFREE is special in that it really needs to wake up everyone.
Considering the three non-blocking poll systems:
- io_uring poll doesn't handle POLLFREE at all, so it is broken anyway.
- aio poll is unaffected, since it doesn't support exclusive waits.
However, that's fragile, as someone could add this feature later.
- epoll doesn't appear to be broken by this, since its wakeup function
returns 0 when it sees POLLFREE. But this is fragile.
Although there is a workaround (see epoll), it's better to define a
function which always sends POLLFREE to all waiters. Add such a
function. Also make it verify that the queue really becomes empty after
all waiters have been woken up.
Reported-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/r/20211209010455.42744-2-ebiggers@kernel.org
Signed-off-by: Eric Biggers <ebiggers@google.com>
There's error paths in __create_synth_event() after the argv is allocated
that fail to free it. Add a jump to free it when necessary.
Link: https://lkml.kernel.org/r/20211209024317.11783-1-linmq006@gmail.com
Suggested-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Signed-off-by: Miaoqian Lin <linmq006@gmail.com>
[ Fixed up the patch and change log ]
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
The pointer t is being initialized with a value that is never read. The
pointer is re-assigned a value a littler later on, hence the initialization
is redundant and can be removed.
Signed-off-by: Colin Ian King <colin.i.king@gmail.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20211207224718.59593-1-colin.i.king@gmail.com
Daniel Borkmann says:
====================
bpf 2021-12-08
We've added 12 non-merge commits during the last 22 day(s) which contain
a total of 29 files changed, 659 insertions(+), 80 deletions(-).
The main changes are:
1) Fix an off-by-two error in packet range markings and also add a batch of
new tests for coverage of these corner cases, from Maxim Mikityanskiy.
2) Fix a compilation issue on MIPS JIT for R10000 CPUs, from Johan Almbladh.
3) Fix two functional regressions and a build warning related to BTF kfunc
for modules, from Kumar Kartikeya Dwivedi.
4) Fix outdated code and docs regarding BPF's migrate_disable() use on non-
PREEMPT_RT kernels, from Sebastian Andrzej Siewior.
5) Add missing includes in order to be able to detangle cgroup vs bpf header
dependencies, from Jakub Kicinski.
6) Fix regression in BPF sockmap tests caused by missing detachment of progs
from sockets when they are removed from the map, from John Fastabend.
7) Fix a missing "no previous prototype" warning in x86 JIT caused by BPF
dispatcher, from Björn Töpel.
* https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf:
bpf: Add selftests to cover packet access corner cases
bpf: Fix the off-by-two error in range markings
treewide: Add missing includes masked by cgroup -> bpf dependency
tools/resolve_btfids: Skip unresolved symbol warning for empty BTF sets
bpf: Fix bpf_check_mod_kfunc_call for built-in modules
bpf: Make CONFIG_DEBUG_INFO_BTF depend upon CONFIG_BPF_SYSCALL
mips, bpf: Fix reference to non-existing Kconfig symbol
bpf: Make sure bpf_disable_instrumentation() is safe vs preemption.
Documentation/locking/locktypes: Update migrate_disable() bits.
bpf, sockmap: Re-evaluate proto ops when psock is removed from sockmap
bpf, sockmap: Attach map progs to psock early for feature probes
bpf, x86: Fix "no previous prototype" warning
====================
Link: https://lore.kernel.org/r/20211208155125.11826-1-daniel@iogearbox.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Adding ops cleanup to unregister_ftrace_direct_multi,
so it can be reused in another register call.
Link: https://lkml.kernel.org/r/20211206182032.87248-3-jolsa@kernel.org
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Heiko Carstens <hca@linux.ibm.com>
Fixes: f64dd4627e ("ftrace: Add multi direct register/unregister interface")
Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Now when we have *direct_multi interface the direct_functions
hash is no longer owned just by direct_ops. It's also used by
any other ftrace_ops passed to *direct_multi interface.
Thus to find out that we are unregistering the last function
from direct_ops, we need to check directly direct_ops's hash.
Link: https://lkml.kernel.org/r/20211206182032.87248-2-jolsa@kernel.org
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Heiko Carstens <hca@linux.ibm.com>
Fixes: f64dd4627e ("ftrace: Add multi direct register/unregister interface")
Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
When CONFIG_DEBUG_INFO_BTF_MODULES is not set
the following warning can be seen:
kernel/bpf/btf.c:6588:13: warning: 'purge_cand_cache' defined but not used [-Wunused-function]
Fix it.
Fixes: 1e89106da2 ("bpf: Add bpf_core_add_cands() and wire it into bpf_core_apply_relo_insn().")
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20211207014839.6976-1-alexei.starovoitov@gmail.com
Branch data available to BPF programs can be very useful to get stack traces
out of userspace application.
Commit fff7b64355 ("bpf: Add bpf_read_branch_records() helper") added BPF
support to capture branch records in x86. Enable this feature also for other
architectures as well by removing checks specific to x86.
If an architecture doesn't support branch records, bpf_read_branch_records()
still has appropriate checks and it will return an -EINVAL in that scenario.
Based on UAPI helper doc in include/uapi/linux/bpf.h, unsupported architectures
should return -ENOENT in such case. Hence, update the appropriate check to
return -ENOENT instead.
Selftest 'perf_branches' result on power9 machine which has the branch stacks
support:
- Before this patch:
[command]# ./test_progs -t perf_branches
#88/1 perf_branches/perf_branches_hw:FAIL
#88/2 perf_branches/perf_branches_no_hw:OK
#88 perf_branches:FAIL
Summary: 0/1 PASSED, 0 SKIPPED, 1 FAILED
- After this patch:
[command]# ./test_progs -t perf_branches
#88/1 perf_branches/perf_branches_hw:OK
#88/2 perf_branches/perf_branches_no_hw:OK
#88 perf_branches:OK
Summary: 1/2 PASSED, 0 SKIPPED, 0 FAILED
Selftest 'perf_branches' result on power9 machine which doesn't have branch
stack report:
- After this patch:
[command]# ./test_progs -t perf_branches
#88/1 perf_branches/perf_branches_hw:SKIP
#88/2 perf_branches/perf_branches_no_hw:OK
#88 perf_branches:OK
Summary: 1/1 PASSED, 1 SKIPPED, 0 FAILED
Fixes: fff7b64355 ("bpf: Add bpf_read_branch_records() helper")
Suggested-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Kajol Jain <kjain@linux.ibm.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20211206073315.77432-1-kjain@linux.ibm.com
mode runs for prolonged periods with interrupts disabled and ends up
programming the next tick in the past, leading to that storm
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmGsp+EACgkQEsHwGGHe
VUpmAA/6A8W0Nb6Doc8B3emuy9qv3NeqLGWqSIKcJnOz0GYhlWuFKGmH6zWQ/ZKZ
ihjw5fP7aOEytLhLagnn1k2weRZrgBavHaxQskuL3HBFD0mT6Gz1TfJC9JlE5s2Q
KxaDjRLx5RGJb/KHZDiZv6Kz61Ouh14KfHHymVhZndcPNZ7UjsCgacyUkctGKcoc
DtNW0Z6tjUGbp1MXyGcOiTiM7hUS8SWsdJbMfn0Eu+/NKvnkua7vwTgEMTwYwrK0
88sLYyVygL+NHjE9LpSGrRj1HjEV4dSMC3r18UYuWQYkzBvA+/SQbIKD5QoeFmZU
st5dMBD8Q3KvAWQ8mXE5ymaYaIZxv21PaL1J7lZ3J3osMASH0LkMWXLYoMVtO5rq
OIpZlODSGLiamGcC5uieoBR/f4Zzn+sEZZ6TyoXWOBv4Cap2XnlJP5WjJ4ARJvzT
MLX2u8MPPMTL7vtd2Xb4kPZcWH5irrCENXlbz0UG08ZHj4CvBFb+a87f+E4aNUs4
uBsTf/kS5SihE1ripSCJEnFsc/QgVPr/9jBXQehRcuI4NgT4pUg85LWDj3gSIcH8
wMRbiX2ND0ZWk89RYaoiDQ6JPGrsnwKvGLRk9ZhFNtUfpycv5JWKwepVbmAKfos+
JtmG/6kcFQKBofR7EA4Xuh7DHv7LKCRf3MMlAR6Gzx/3K2kyIoQ=
=Ft9k
-----END PGP SIGNATURE-----
Merge tag 'timers_urgent_for_v5.16_rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull timer fix from Borislav Petkov:
- Prevent a tick storm when a dedicated timekeeper CPU in nohz_full
mode runs for prolonged periods with interrupts disabled and ends up
programming the next tick in the past, leading to that storm
* tag 'timers_urgent_for_v5.16_rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
timers/nohz: Last resort update jiffies on nohz_full IRQ entry
- Fix preempt= callback return values
- Correct utime/stime resource usage reporting on nohz_full to return
the proper times instead of shorter ones
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmGspTcACgkQEsHwGGHe
VUqHhBAAhEd9DMoJwREKDCDMqc3pttNYpTpSVo1K6oBTsOh7mwEilPdlmsTl239V
jRocVJST+/JmJ424j7t0Sp42tREMKNlbyf+ddvr0oUwi0mLUnN6J83NU4WK4Jisf
gyXFIkeMR+/W6/LO7gDdq/+rlRDtJcllwHoOm1yyiy5Zc0qDrcy6CjgP5/9hEsh6
xvRvPOXbeZZVA+a+n+G9xGN836aBe1VptoABbdAlOSTiOvAVkS95UCb9rfPTvMtq
/71jjZMmhTxGUhg5oLpgvfRRZE608X6b2RCTcAPKa5mfMpN5YMQLcD9G0f8XZjkq
iOO/+arE6XQJlTzhAEsGxkSXaVweYxRHHP1yAlWYlWV/xGhoaAyq/tXE1KusAnng
16/eTbrPb1eawpI6p1AAScCQuF/TlYZCMqjbFVhViXM5Rkd6jrii9vz/JnkdokGR
3TH0n4WAJkdZeg18WS3B0eIt6zDTvxbR9g5ap2/10xYnYHMNdHXGH8A+5Grw9/Ln
Qsv0V43OjdUK2tVuIHYblx1X9dOlLdpTEg9FCfjiZTQVor1pTwcbG62qNMozanlf
lQqI6f63E0jugHqhrqrfBvl4lUuoajN5SvXfBNFDIzxwWBGSdr+hJQXstUatfSZZ
MdmJX+Dk5cAk4CpQQ1ofPvYkS3Ade0vxaL4H++KHYtRvpPvxCXA=
=XQFF
-----END PGP SIGNATURE-----
Merge tag 'sched_urgent_for_v5.16_rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull scheduler fixes from Borislav Petkov:
- Properly init uclamp_flags of a runqueue, on first enqueuing
- Fix preempt= callback return values
- Correct utime/stime resource usage reporting on nohz_full to return
the proper times instead of shorter ones
* tag 'sched_urgent_for_v5.16_rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
sched/uclamp: Fix rq->uclamp_max not set on first enqueue
preempt/dynamic: Fix setup_preempt_mode() return value
sched/cputime: Fix getrusage(RUSAGE_THREAD) with nohz_full
BPF_LOG_KERNEL is only used internally, so disallow bpf_btf_load()
to set log level as BPF_LOG_KERNEL. The same checking has already
been done in bpf_check(), so factor out a helper to check the
validity of log attributes and use it in both places.
Fixes: 8580ac9404 ("bpf: Process in-kernel BTF")
Signed-off-by: Hou Tao <houtao1@huawei.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Yonghong Song <yhs@fb.com>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Link: https://lore.kernel.org/bpf/20211203053001.740945-1-houtao1@huawei.com
Commit d81ae8aac8 ("sched/uclamp: Fix initialization of struct
uclamp_rq") introduced a bug where uclamp_max of the rq is not reset to
match the woken up task's uclamp_max when the rq is idle.
The code was relying on rq->uclamp_max initialized to zero, so on first
enqueue
static inline void uclamp_rq_inc_id(struct rq *rq, struct task_struct *p,
enum uclamp_id clamp_id)
{
...
if (uc_se->value > READ_ONCE(uc_rq->value))
WRITE_ONCE(uc_rq->value, uc_se->value);
}
was actually resetting it. But since commit d81ae8aac8 changed the
default to 1024, this no longer works. And since rq->uclamp_flags is
also initialized to 0, neither above code path nor uclamp_idle_reset()
update the rq->uclamp_max on first wake up from idle.
This is only visible from first wake up(s) until the first dequeue to
idle after enabling the static key. And it only matters if the
uclamp_max of this task is < 1024 since only then its uclamp_max will be
effectively ignored.
Fix it by properly initializing rq->uclamp_flags = UCLAMP_FLAG_IDLE to
ensure uclamp_idle_reset() is called which then will update the rq
uclamp_max value as expected.
Fixes: d81ae8aac8 ("sched/uclamp: Fix initialization of struct uclamp_rq")
Signed-off-by: Qais Yousef <qais.yousef@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Valentin Schneider <Valentin.Schneider@arm.com>
Tested-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Link: https://lkml.kernel.org/r/20211202112033.1705279-1-qais.yousef@arm.com
__setup() callbacks expect 1 for success and 0 for failure. Correct the
usage here to reflect that.
Fixes: 826bfeb37b ("preempt/dynamic: Support dynamic preempt with preempt= boot option")
Reported-by: Mark Rutland <mark.rutland@arm.com>
Signed-off-by: Andrew Halaney <ahalaney@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20211203233203.133581-1-ahalaney@redhat.com
The first commit cited below attempts to fix the off-by-one error that
appeared in some comparisons with an open range. Due to this error,
arithmetically equivalent pieces of code could get different verdicts
from the verifier, for example (pseudocode):
// 1. Passes the verifier:
if (data + 8 > data_end)
return early
read *(u64 *)data, i.e. [data; data+7]
// 2. Rejected by the verifier (should still pass):
if (data + 7 >= data_end)
return early
read *(u64 *)data, i.e. [data; data+7]
The attempted fix, however, shifts the range by one in a wrong
direction, so the bug not only remains, but also such piece of code
starts failing in the verifier:
// 3. Rejected by the verifier, but the check is stricter than in #1.
if (data + 8 >= data_end)
return early
read *(u64 *)data, i.e. [data; data+7]
The change performed by that fix converted an off-by-one bug into
off-by-two. The second commit cited below added the BPF selftests
written to ensure than code chunks like #3 are rejected, however,
they should be accepted.
This commit fixes the off-by-two error by adjusting new_range in the
right direction and fixes the tests by changing the range into the
one that should actually fail.
Fixes: fb2a311a31 ("bpf: fix off by one for range markings with L{T, E} patterns")
Fixes: b37242c773 ("bpf: add test cases to bpf selftests to cover all access tests")
Signed-off-by: Maxim Mikityanskiy <maximmi@nvidia.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20211130181607.593149-1-maximmi@nvidia.com
When module registering its set is built-in, THIS_MODULE will be NULL,
hence we cannot return early in case owner is NULL.
Fixes: 14f267d95f ("bpf: btf: Introduce helpers for dynamic BTF set registration")
Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Song Liu <songliubraving@fb.com>
Link: https://lore.kernel.org/bpf/20211122144742.477787-3-memxor@gmail.com
Vinicius Costa Gomes reported [0] that build fails when
CONFIG_DEBUG_INFO_BTF is enabled and CONFIG_BPF_SYSCALL is disabled.
This leads to btf.c not being compiled, and then no symbol being present
in vmlinux for the declarations in btf.h. Since BTF is not useful
without enabling BPF subsystem, disallow this combination.
However, theoretically disabling both now could still fail, as the
symbol for kfunc_btf_id_list variables is not available. This isn't a
problem as the compiler usually optimizes the whole register/unregister
call, but at lower optimization levels it can fail the build in linking
stage.
Fix that by adding dummy variables so that modules taking address of
them still work, but the whole thing is a noop.
[0]: https://lore.kernel.org/bpf/20211110205418.332403-1-vinicius.gomes@intel.com
Fixes: 14f267d95f ("bpf: btf: Introduce helpers for dynamic BTF set registration")
Reported-by: Vinicius Costa Gomes <vinicius.gomes@intel.com>
Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Song Liu <songliubraving@fb.com>
Link: https://lore.kernel.org/bpf/20211122144742.477787-2-memxor@gmail.com
Given BPF program's BTF root type name perform the following steps:
. search in vmlinux candidate cache.
. if (present in cache and candidate list >= 1) return candidate list.
. do a linear search through kernel BTFs for possible candidates.
. regardless of number of candidates found populate vmlinux cache.
. if (candidate list >= 1) return candidate list.
. search in module candidate cache.
. if (present in cache) return candidate list (even if list is empty).
. do a linear search through BTFs of all kernel modules
collecting candidates from all of them.
. regardless of number of candidates found populate module cache.
. return candidate list.
Then wire the result into bpf_core_apply_relo_insn().
When BPF program is trying to CO-RE relocate a type
that doesn't exist in either vmlinux BTF or in modules BTFs
these steps will perform 2 cache lookups when cache is hit.
Note the cache doesn't prevent the abuse by the program that might
have lots of relocations that cannot be resolved. Hence cond_resched().
CO-RE in the kernel requires CAP_BPF, since BTF loading requires it.
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20211201181040.23337-9-alexei.starovoitov@gmail.com
Make BTF log size limit to be the same as the verifier log size limit.
Otherwise tools that progressively increase log size and use the same log
for BTF loading and program loading will be hitting hard to debug EINVAL.
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20211201181040.23337-7-alexei.starovoitov@gmail.com
struct bpf_core_relo is generated by llvm and processed by libbpf.
It's a de-facto uapi.
With CO-RE in the kernel the struct bpf_core_relo becomes uapi de-jure.
Add an ability to pass a set of 'struct bpf_core_relo' to prog_load command
and let the kernel perform CO-RE relocations.
Note the struct bpf_line_info and struct bpf_func_info have the same
layout when passed from LLVM to libbpf and from libbpf to the kernel
except "insn_off" fields means "byte offset" when LLVM generates it.
Then libbpf converts it to "insn index" to pass to the kernel.
The struct bpf_core_relo's "insn_off" field is always "byte offset".
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20211201181040.23337-6-alexei.starovoitov@gmail.com
Make relo_core.c to be compiled for the kernel and for user space libbpf.
Note the patch is reducing BPF_CORE_SPEC_MAX_LEN from 64 to 32.
This is the maximum number of nested structs and arrays.
For example:
struct sample {
int a;
struct {
int b[10];
};
};
struct sample *s = ...;
int *y = &s->b[5];
This field access is encoded as "0:1:0:5" and spec len is 4.
The follow up patch might bump it back to 64.
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20211201181040.23337-4-alexei.starovoitov@gmail.com
Rename btf_member_bit_offset() and btf_member_bitfield_size() to
avoid conflicts with similarly named helpers in libbpf's btf.h.
Rename the kernel helpers, since libbpf helpers are part of uapi.
Suggested-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20211201181040.23337-3-alexei.starovoitov@gmail.com
getrusage(RUSAGE_THREAD) with nohz_full may return shorter utime/stime
than the actual time.
task_cputime_adjusted() snapshots utime and stime and then adjust their
sum to match the scheduler maintained cputime.sum_exec_runtime.
Unfortunately in nohz_full, sum_exec_runtime is only updated once per
second in the worst case, causing a discrepancy against utime and stime
that can be updated anytime by the reader using vtime.
To fix this situation, perform an update of cputime.sum_exec_runtime
when the cputime snapshot reports the task as actually running while
the tick is disabled. The related overhead is then contained within the
relevant situations.
Reported-by: Hasegawa Hitomi <hasegawa-hitomi@fujitsu.com>
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Hasegawa Hitomi <hasegawa-hitomi@fujitsu.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Masayoshi Mizuma <m.mizuma@jp.fujitsu.com>
Acked-by: Phil Auld <pauld@redhat.com>
Link: https://lore.kernel.org/r/20211026141055.57358-3-frederic@kernel.org
When at least one CPU runs in nohz_full mode, a dedicated timekeeper CPU
is guaranteed to stay online and to never stop its tick.
Meanwhile on some rare case, the dedicated timekeeper may be running
with interrupts disabled for a while, such as in stop_machine.
If jiffies stop being updated, a nohz_full CPU may end up endlessly
programming the next tick in the past, taking the last jiffies update
monotonic timestamp as a stale base, resulting in an tick storm.
Here is a scenario where it matters:
0) CPU 0 is the timekeeper and CPU 1 a nohz_full CPU.
1) A stop machine callback is queued to execute somewhere.
2) CPU 0 reaches MULTI_STOP_DISABLE_IRQ while CPU 1 is still in
MULTI_STOP_PREPARE. Hence CPU 0 can't do its timekeeping duty. CPU 1
can still take IRQs.
3) CPU 1 receives an IRQ which queues a timer callback one jiffy forward.
4) On IRQ exit, CPU 1 schedules the tick one jiffy forward, taking
last_jiffies_update as a base. But last_jiffies_update hasn't been
updated for 2 jiffies since the timekeeper has interrupts disabled.
5) clockevents_program_event(), which relies on ktime_get(), observes
that the expiration is in the past and therefore programs the min
delta event on the clock.
6) The tick fires immediately, goto 3)
7) Tick storm, the nohz_full CPU is drown and takes ages to reach
MULTI_STOP_DISABLE_IRQ, which is the only way out of this situation.
Solve this with unconditionally updating jiffies if the value is stale
on nohz_full IRQ entry. IRQs and other disturbances are expected to be
rare enough on nohz_full for the unconditional call to ktime_get() to
actually matter.
Reported-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Paul E. McKenney <paulmck@kernel.org>
Link: https://lore.kernel.org/r/20211026141055.57358-2-frederic@kernel.org
The 'kprobe::data_size' is unsigned, thus it can not be negative. But if
user sets it enough big number (e.g. (size_t)-8), the result of 'data_size
+ sizeof(struct kretprobe_instance)' becomes smaller than sizeof(struct
kretprobe_instance) or zero. In result, the kretprobe_instance are
allocated without enough memory, and kretprobe accesses outside of
allocated memory.
To avoid this issue, introduce a max limitation of the
kretprobe::data_size. 4KB per instance should be OK.
Link: https://lkml.kernel.org/r/163836995040.432120.10322772773821182925.stgit@devnote2
Cc: stable@vger.kernel.org
Fixes: f47cd9b553 ("kprobes: kretprobe user entry-handler")
Reported-by: zhangyue <zhangyue1@kylinos.cn>
Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
When comparing two strings for the "onmatch" histogram trigger, fields
that are strings use string comparisons, which do not care about being
signed or not.
Do not fail to match two string fields if one is unsigned char array and
the other is a signed char array.
Link: https://lore.kernel.org/all/20211129123043.5cfd687a@gandalf.local.home/
Cc: stable@vgerk.kernel.org
Cc: Tom Zanussi <zanussi@kernel.org>
Cc: Yafang Shao <laoar.shao@gmail.com>
Fixes: b05e89ae7c ("tracing: Accept different type for synthetic event fields")
Reviewed-by: Masami Hiramatsu <mhiramatsu@kernel.org>
Reported-by: Sven Schnelle <svens@linux.ibm.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
An extra newline will output for bpf_log() with BPF_LOG_KERNEL level
as shown below:
[ 52.095704] BPF:The function test_3 has 12 arguments. Too many.
[ 52.095704]
[ 52.096896] Error in parsing func ptr test_3 in struct bpf_dummy_ops
Now all bpf_log() are ended by newline, but not all btf_verifier_log()
are ended by newline, so checking whether or not the log message
has the trailing newline and adding a newline if not.
Also there is no need to calculate the left userspace buffer size
for kernel log output and to truncate the output by '\0' which
has already been done by vscnprintf(), so only do these for
userspace log output.
Signed-off-by: Hou Tao <houtao1@huawei.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Yonghong Song <yhs@fb.com>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Link: https://lore.kernel.org/bpf/20211201073458.2731595-2-houtao1@huawei.com
This patch adds the kernel-side and API changes for a new helper
function, bpf_loop:
long bpf_loop(u32 nr_loops, void *callback_fn, void *callback_ctx,
u64 flags);
where long (*callback_fn)(u32 index, void *ctx);
bpf_loop invokes the "callback_fn" **nr_loops** times or until the
callback_fn returns 1. The callback_fn can only return 0 or 1, and
this is enforced by the verifier. The callback_fn index is zero-indexed.
A few things to please note:
~ The "u64 flags" parameter is currently unused but is included in
case a future use case for it arises.
~ In the kernel-side implementation of bpf_loop (kernel/bpf/bpf_iter.c),
bpf_callback_t is used as the callback function cast.
~ A program can have nested bpf_loop calls but the program must
still adhere to the verifier constraint of its stack depth (the stack depth
cannot exceed MAX_BPF_STACK))
~ Recursive callback_fns do not pass the verifier, due to the call stack
for these being too deep.
~ The next patch will include the tests and benchmark
Signed-off-by: Joanne Koong <joannekoong@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20211130030622.4131246-2-joannekoong@fb.com
The eBPF name has completely taken over from eBPF in general usage for
the actual eBPF representation, or BPF for any general in-kernel use.
Prune all remaining references to "internal BPF".
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Song Liu <songliubraving@fb.com>
Link: https://lore.kernel.org/bpf/20211119163215.971383-4-hch@lst.de
The comment telling that the prog_free helper is freeing the program is
not exactly useful, so just remove it.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Song Liu <songliubraving@fb.com>
Link: https://lore.kernel.org/bpf/20211119163215.971383-3-hch@lst.de
left on the idle task's stack when a CPU is brought up after it was brought
down before.
-----BEGIN PGP SIGNATURE-----
iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAmGjr0UTHHRnbHhAbGlu
dXRyb25peC5kZQAKCRCmGPVMDXSYoadaD/0Q3hMjI+N3AigZiBToGccafOfsmiMH
fJ6fUM7gh4pTrGuoDQSGt02zYNYx9Zx7X8PpiuWAAIKbppiKmvniCgPMgMGARUBn
UQ/W2XWUiu/wtleRf4JtE6VwHciNVgLdnWIazRWsjDryUXVcJwhn8J1o5K6LnwjD
Rof/aYuVR47DprYG03OI0FD1GwlSPWMbAgB6OlJS6ZRvpq+7ergVKA0PQAY7ZZko
vBlDU7Sq4dJ2CE4aiRGLyLNhZfrubmfeMP2UVmVSpMBta7zs+YmaYjZvKfgO3KZT
OVbyFfDbL8FJgUmTSI1WBKq+W44o1D1e8VrKiCFj+y5w9diHW9OQEg2wqQdsMB6a
QgNgDZjg8UHancF5O2kNYjnUVGgxUww7PftWbxkg4VAUmlCzhbZAAegspZHow0mU
zcqDaMTky0FbcbB/Ukik/HG6J3KrR34GYjui3fe0wZHZlDim6azZucRTd+x9jRsB
jPUlE3DW0JfNFKcMnlLLNvS8h3j7iCbb3XDv1y4BW0+EB76IsCThjqFO0dIPpiju
T9ituTr6p4+B4U37Cz5qOMgUSha+f9/6blYG8NgCeHyD2l5HDnavO9lGhoP3jsZJ
LJRa8mWd+oZbZlpBtTkaSOA55cTxonsIuCseTdXlfsVtzuJBmLKwdRPuDSRCEo0G
xH1vNNUba86+6A==
=ne0K
-----END PGP SIGNATURE-----
Merge tag 'sched-urgent-2021-11-28' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull scheduler fix from Thomas Gleixner:
"A single scheduler fix to ensure that there is no stale KASAN shadow
state left on the idle task's stack when a CPU is brought up after it
was brought down before"
* tag 'sched-urgent-2021-11-28' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
sched/scs: Reset task stack state in bringup_cpu()
a trace point event as it's not possible to deliver a synchronous signal to
a different task from there.
-----BEGIN PGP SIGNATURE-----
iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAmGjrj0THHRnbHhAbGlu
dXRyb25peC5kZQAKCRCmGPVMDXSYoRHRD/9T8sQw4arpmaFvB76m1LijsGrAuoXv
XH/gTcUupCdo0J1X8iEZfuGKx3C89BqLFaGpucK+9TCl6VMKHqtDTunciKV79tVQ
TcaTKYFwCwNrAQ0eATNzuM4RzzHGx0TK6u1DB0iFTSUJfAQ/EUE4/+yau2qDVfql
Pud/Fm5uHtqxDq5T9XqG3w324e8HWJr2johGMeg4ukbuKppRoNWlZcm75HndyK4m
OT8svA9Yg8GhSZNQ3q4HQTwof4zcGyaln+wxf7GWr9oryBPiqhHQuvWKXqDXLCVb
SbhsYmYcHEQgM3wpNaNqSf1LV1RoPuhFhgWB0te5SoVzoF7KpJLs+VIP/0q27Mcu
6aF7eTUG92NkR1uvSQ2d62UBE4EM0bFBvPaD4A5hLX1JAkVxHi+vxRFf5q0bUliO
Yybia4bv1WYwCVajBbpgwNDMKb4qacoIcXPlsjkRqkxk/vedOBkJadJnIEqc1iOl
Ld70jylQmj/TxmFM3iGk+QyFwFNpPnUxu0wws7A4YxYFknrhW+/8pcVTsUApBuYN
LWWiC08QelvQucCYGqpbEX37WA3DFXj4AHDp7nCJBkweMGhcgIBvZbz8yz/mgT7T
CTMkT5ZZY93mAWiXdagNJI4EWnjHZgeVtSlKRvF1D0J49SyKepqogOxNgi7KnW+/
tbCmxOTH9eA2Eg==
=yMum
-----END PGP SIGNATURE-----
Merge tag 'perf-urgent-2021-11-28' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull perf fix from Thomas Gleixner:
"A single fix for perf to prevent it from sending SIGTRAP to another
task from a trace point event as it's not possible to deliver a
synchronous signal to a different task from there"
* tag 'perf-urgent-2021-11-28' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
perf: Ignore sigtrap for tracepoints destined for other tasks
- Plug a race in the lock handoff which is caused by inconsistency of the
reader and writer path and can lead to corruption of the underlying
counter.
- down_read_trylock() is suboptimal when the lock is contended and
multiple readers trylock concurrently. That's due to the initial value
being read non-atomically which results in at least two compare exchange
loops. Making the initial readout atomic reduces this significantly.
Whith 40 readers by 11% in a benchmark which enforces contention on
mmap_sem.
-----BEGIN PGP SIGNATURE-----
iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAmGjrRITHHRnbHhAbGlu
dXRyb25peC5kZQAKCRCmGPVMDXSYodsdEACRDUU5tkNVIgNTsGrO4IUhNW9fxyfG
3dCAzcQx9w1UjjBn23/B0c6rPsVqEv6hKouBGXqdOHj0kLx6Xn0IPMTvqycPL+mp
OyDzx+t773BlvTZyaYFa6vBiWbEVGzedDp6uLsYaBNo//4yN1WZY3mevTwzKVceX
WOoobHjsoh5Wfwr1XmNw+7HVhPaY0E50DaIuRQrJjNj1zsUhzJsjr/M1NpiqCaSm
PleDum3Dg0PD/pxdWtm34teuGQur0QknqPc2I6sZGnX0UMsCozeZAuH/MGnwwXec
fsweMXBVyDngOIZbFX/tPbVTocOpfxkYgJKXwIrlmVwHzFeT6KFfpEPXxVhUj6ao
3KNqD+V5VL2zdMF11WB2lVQaX2/48WIXz23ppiUA5R7tJTPr+yAIYIUzT2GFkMTr
u//41pxnoXlm9RCjANrbzGSl049exf01mMFVzm6zGt6PZqTE/kaBuklRy6Vibk/C
cSB7Iy/iVaySunmF6X5RuBT7HsKrIN6SgYRCHZ7BI9aelQpHztJuy4LZAbgRPZZU
/VKB2BKLx1KeRNfn6ScvF1uSSLmXoFVs0PP7HwMrPs3AdI+KaHmYLqZf+Bf4W1q2
5bAfj2x5qWwvMrV4RnwLltWAASw1G/o5fs8WhPA6cZkG9iZCB5EBCnHv4B0pm+oq
xw8RPYImZFzK8w==
=dKz+
-----END PGP SIGNATURE-----
Merge tag 'locking-urgent-2021-11-28' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull locking fixes from Thomas Gleixner:
"Two regression fixes for reader writer semaphores:
- Plug a race in the lock handoff which is caused by inconsistency of
the reader and writer path and can lead to corruption of the
underlying counter.
- down_read_trylock() is suboptimal when the lock is contended and
multiple readers trylock concurrently. That's due to the initial
value being read non-atomically which results in at least two
compare exchange loops. Making the initial readout atomic reduces
this significantly. Whith 40 readers by 11% in a benchmark which
enforces contention on mmap_sem"
* tag 'locking-urgent-2021-11-28' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
locking/rwsem: Optimize down_read_trylock() under highly contended case
locking/rwsem: Make handoff bit handling more consistent
- The setting of the pid filtering flag tested the "trace only this
pid" case twice, and ignored the "trace everything but this pid" case.
Note, the 5.15 kernel does things a little differently due to the new
sparse pid mask introduced in 5.16, and as the bug was discovered
running the 5.15 kernel, and the first fix was initially done for
that kernel, that fix handled both cases (only pid and all but pid),
but the forward port to 5.16 created this bug.
-----BEGIN PGP SIGNATURE-----
iIoEABYIADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCYaOnPxQccm9zdGVkdEBn
b29kbWlzLm9yZwAKCRAp5XQQmuv6qqUTAP9KCOe2rZBjbn14xiCm/wbECjox58Uf
PrJ3fCDBVt8E0gEAjHkR3ybVE4xYLKj4RrO5GJ/pk/x1NeMmHdi+ls5hOQg=
=MZso
-----END PGP SIGNATURE-----
Merge tag 'trace-v5.16-rc2-3' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace
Pull another tracing fix from Steven Rostedt:
"Fix the fix of pid filtering
The setting of the pid filtering flag tested the "trace only this pid"
case twice, and ignored the "trace everything but this pid" case.
The 5.15 kernel does things a little differently due to the new sparse
pid mask introduced in 5.16, and as the bug was discovered running the
5.15 kernel, and the first fix was initially done for that kernel,
that fix handled both cases (only pid and all but pid), but the
forward port to 5.16 created this bug"
* tag 'trace-v5.16-rc2-3' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
tracing: Test the 'Do not trace this pid' case in create event
When creating a new event (via a module, kprobe, eprobe, etc), the
descriptors that are created must add flags for pid filtering if an
instance has pid filtering enabled, as the flags are used at the time the
event is executed to know if pid filtering should be done or not.
The "Only trace this pid" case was added, but a cut and paste error made
that case checked twice, instead of checking the "Trace all but this pid"
case.
Link: https://lore.kernel.org/all/202111280401.qC0z99JB-lkp@intel.com/
Fixes: 6cb206508b ("tracing: Check pid filtering when creating events")
Reported-by: kernel test robot <lkp@intel.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
- Have created events reflect the current state of pid filtering
- Test pid filtering on discard test of recorded logic.
(Also clean up the if statement to be cleaner).
-----BEGIN PGP SIGNATURE-----
iIoEABYIADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCYaJ3ZhQccm9zdGVkdEBn
b29kbWlzLm9yZwAKCRAp5XQQmuv6qhusAQC3nj0Xj4LRJXJtH4ALoJuthoBNoRHN
SslcuItuFLheyQD/URecPD2h4O+u/GQs1rjEUJ3B/mdzXojIrTz6Stagkwg=
=QCQF
-----END PGP SIGNATURE-----
Merge tag 'trace-v5.16-rc2-2' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace
Pull tracing fixes from Steven Rostedt:
"Two fixes to event pid filtering:
- Make sure newly created events reflect the current state of pid
filtering
- Take pid filtering into account when recording trigger events.
(Also clean up the if statement to be cleaner)"
* tag 'trace-v5.16-rc2-2' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
tracing: Fix pid filtering when triggers are attached
tracing: Check pid filtering when creating events
If a event is filtered by pid and a trigger that requires processing of
the event to happen is a attached to the event, the discard portion does
not take the pid filtering into account, and the event will then be
recorded when it should not have been.
Cc: stable@vger.kernel.org
Fixes: 3fdaf80f4a ("tracing: Implement event pid filtering")
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
- Make intel_pstate work correctly on Ice Lake server systems with
out-of-band performance control enabled (Adamos Ttofari).
- Fix EPP handling in intel_pstate during CPU offline and online in
the active mode (Rafael Wysocki).
- Make intel_pstate support ITMT on asymmetric systems with
overclocking enabled (Srinivas Pandruvada).
- Fix hibernation image saving when using the user space interface
based on the snapshot special device file (Evan Green).
- Make the hibernation code release the snapshot block device using
the same mode that was used when acquiring it (Thomas Zeitlhofer).
-----BEGIN PGP SIGNATURE-----
iQJGBAABCAAwFiEE4fcc61cGeeHD/fCwgsRv/nhiVHEFAmGhM8QSHHJqd0Byand5
c29ja2kubmV0AAoJEILEb/54YlRx/IkP/2VVQ2c56QZsGWmeyu6plAZBDXu69rHm
GeIO2/q0tEVZrIjmZkwPkSg0mKWw1cUEbiMq6pWShvSurJrko8Te3IECPE/2kOYO
R6crlBDxy2gcpoa5KIlKGz/qQBJPknDHMDSHE0kzmRokOl+/bCCgZkWWzRpR91EX
YlwBstvG1nd2F8Pi0UT59OTLVoTClIW5eTQRZtOY38Ip3PBiziMQAIwk/BFRtRSn
6H9xIdwg/KffTCmtMAq44O7Q5H5Kv6xhgJNNRlKClKnOCmMXGfuKaYDbzddkzEDW
8AAIt8mxZR9TWlhRJRbwTilcjQX/Ph0z2mpMmhcR9NdVm3g8rwHwrKxFirGOc4cQ
q3LXHma3csQ8PqagPoZV77rkBmVzd5ByYYYHQIZP7729WgzPlQ4XhDLU7+gd+eEI
pChycSNH9QNkgrBTvk7BTiD0C9EUYNIex2ptqf4sK7Tcr0pMSG2l9BjQBqQEyYns
O+fhkHkAuK+1dCJjhxcj6gAIuae2FEjjp1MOGkUVeozNwKKmx3ps4BcE9v5syuKi
HRJ72+8RTfV5FhEMZ7rPpWwibGJj6ZLYfuFUEngoWoc1t+sMkAIhpnadsEujcyIX
NzFpM3R0/LATNuYWquLiMHH3/AxOCe1Ezgc0cP8HaXYlZfb8k6p0IxkzNXWc3xLN
6m/j+ppjbXoK
=JN2D
-----END PGP SIGNATURE-----
Merge tag 'pm-5.16-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm
Pull power management fixes from Rafael Wysocki:
"These address three issues in the intel_pstate driver and fix two
problems related to hibernation.
Specifics:
- Make intel_pstate work correctly on Ice Lake server systems with
out-of-band performance control enabled (Adamos Ttofari).
- Fix EPP handling in intel_pstate during CPU offline and online in
the active mode (Rafael Wysocki).
- Make intel_pstate support ITMT on asymmetric systems with
overclocking enabled (Srinivas Pandruvada).
- Fix hibernation image saving when using the user space interface
based on the snapshot special device file (Evan Green).
- Make the hibernation code release the snapshot block device using
the same mode that was used when acquiring it (Thomas Zeitlhofer)"
* tag 'pm-5.16-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
PM: hibernate: Fix snapshot partial write lengths
PM: hibernate: use correct mode for swsusp_close()
cpufreq: intel_pstate: ITMT support for overclocked system
cpufreq: intel_pstate: Fix active mode offline/online EPP handling
cpufreq: intel_pstate: Add Ice Lake server to out-of-band IDs
When pid filtering is activated in an instance, all of the events trace
files for that instance has the PID_FILTER flag set. This determines
whether or not pid filtering needs to be done on the event, otherwise the
event is executed as normal.
If pid filtering is enabled when an event is created (via a dynamic event
or modules), its flag is not updated to reflect the current state, and the
events are not filtered properly.
Cc: stable@vger.kernel.org
Fixes: 3fdaf80f4a ("tracing: Implement event pid filtering")
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
snapshot_write() is inappropriately limiting the amount of data that can
be written in cases where a partial page has already been written. For
example, one would expect to be able to write 1 byte, then 4095 bytes to
the snapshot device, and have both of those complete fully (since now
we're aligned to a page again). But what ends up happening is we write 1
byte, then 4094/4095 bytes complete successfully.
The reason is that simple_write_to_buffer()'s second argument is the
total size of the buffer, not the size of the buffer minus the offset.
Since simple_write_to_buffer() accounts for the offset in its
implementation, snapshot_write() can just pass the full page size
directly down.
Signed-off-by: Evan Green <evgreen@chromium.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Commit 39fbef4b0f ("PM: hibernate: Get block device exclusively in
swsusp_check()") changed the opening mode of the block device to
(FMODE_READ | FMODE_EXCL).
In the corresponding calls to swsusp_close(), the mode is still just
FMODE_READ which triggers the warning in blkdev_flush_mapping() on
resume from hibernate.
So, use the mode (FMODE_READ | FMODE_EXCL) also when closing the
device.
Fixes: 39fbef4b0f ("PM: hibernate: Get block device exclusively in swsusp_check()")
Signed-off-by: Thomas Zeitlhofer <thomas.zeitlhofer+lkml@ze-it.at>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
To hot unplug a CPU, the idle task on that CPU calls a few layers of C
code before finally leaving the kernel. When KASAN is in use, poisoned
shadow is left around for each of the active stack frames, and when
shadow call stacks are in use. When shadow call stacks (SCS) are in use
the task's saved SCS SP is left pointing at an arbitrary point within
the task's shadow call stack.
When a CPU is offlined than onlined back into the kernel, this stale
state can adversely affect execution. Stale KASAN shadow can alias new
stackframes and result in bogus KASAN warnings. A stale SCS SP is
effectively a memory leak, and prevents a portion of the shadow call
stack being used. Across a number of hotplug cycles the idle task's
entire shadow call stack can become unusable.
We previously fixed the KASAN issue in commit:
e1b77c9298 ("sched/kasan: remove stale KASAN poison after hotplug")
... by removing any stale KASAN stack poison immediately prior to
onlining a CPU.
Subsequently in commit:
f1a0a376ca ("sched/core: Initialize the idle task with preemption disabled")
... the refactoring left the KASAN and SCS cleanup in one-time idle
thread initialization code rather than something invoked prior to each
CPU being onlined, breaking both as above.
We fixed SCS (but not KASAN) in commit:
63acd42c0d ("sched/scs: Reset the shadow stack when idle_task_exit")
... but as this runs in the context of the idle task being offlined it's
potentially fragile.
To fix these consistently and more robustly, reset the SCS SP and KASAN
shadow of a CPU's idle task immediately before we online that CPU in
bringup_cpu(). This ensures the idle task always has a consistent state
when it is running, and removes the need to so so when exiting an idle
task.
Whenever any thread is created, dup_task_struct() will give the task a
stack which is free of KASAN shadow, and initialize the task's SCS SP,
so there's no need to specially initialize either for idle thread within
init_idle(), as this was only necessary to handle hotplug cycles.
I've tested this on arm64 with:
* gcc 11.1.0, defconfig +KASAN_INLINE, KASAN_STACK
* clang 12.0.0, defconfig +KASAN_INLINE, KASAN_STACK, SHADOW_CALL_STACK
... offlining and onlining CPUS with:
| while true; do
| for C in /sys/devices/system/cpu/cpu*/online; do
| echo 0 > $C;
| echo 1 > $C;
| done
| done
Fixes: f1a0a376ca ("sched/core: Initialize the idle task with preemption disabled")
Reported-by: Qian Cai <quic_qiancai@quicinc.com>
Signed-off-by: Mark Rutland <mark.rutland@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Valentin Schneider <valentin.schneider@arm.com>
Tested-by: Qian Cai <quic_qiancai@quicinc.com>
Link: https://lore.kernel.org/lkml/20211115113310.35693-1-mark.rutland@arm.com/
Add missing 'tu' variable initialization in the probes loop,
otherwise the head 'tu' is used instead of added probes.
Link: https://lkml.kernel.org/r/20211123142801.182530-1-jolsa@kernel.org
Cc: stable@vger.kernel.org
Fixes: 99c9a923e9 ("tracing/uprobe: Fix double perf_event linking on multiprobe uprobe")
Acked-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
syzbot reported that the warning in perf_sigtrap() fires, saying that
the event's task does not match current:
| WARNING: CPU: 0 PID: 9090 at kernel/events/core.c:6446 perf_pending_event+0x40d/0x4b0 kernel/events/core.c:6513
| Modules linked in:
| CPU: 0 PID: 9090 Comm: syz-executor.1 Not tainted 5.15.0-syzkaller #0
| Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
| RIP: 0010:perf_sigtrap kernel/events/core.c:6446 [inline]
| RIP: 0010:perf_pending_event_disable kernel/events/core.c:6470 [inline]
| RIP: 0010:perf_pending_event+0x40d/0x4b0 kernel/events/core.c:6513
| ...
| Call Trace:
| <IRQ>
| irq_work_single+0x106/0x220 kernel/irq_work.c:211
| irq_work_run_list+0x6a/0x90 kernel/irq_work.c:242
| irq_work_run+0x4f/0xd0 kernel/irq_work.c:251
| __sysvec_irq_work+0x95/0x3d0 arch/x86/kernel/irq_work.c:22
| sysvec_irq_work+0x8e/0xc0 arch/x86/kernel/irq_work.c:17
| </IRQ>
| <TASK>
| asm_sysvec_irq_work+0x12/0x20 arch/x86/include/asm/idtentry.h:664
| RIP: 0010:__raw_spin_unlock_irqrestore include/linux/spinlock_api_smp.h:152 [inline]
| RIP: 0010:_raw_spin_unlock_irqrestore+0x38/0x70 kernel/locking/spinlock.c:194
| ...
| coredump_task_exit kernel/exit.c:371 [inline]
| do_exit+0x1865/0x25c0 kernel/exit.c:771
| do_group_exit+0xe7/0x290 kernel/exit.c:929
| get_signal+0x3b0/0x1ce0 kernel/signal.c:2820
| arch_do_signal_or_restart+0x2a9/0x1c40 arch/x86/kernel/signal.c:868
| handle_signal_work kernel/entry/common.c:148 [inline]
| exit_to_user_mode_loop kernel/entry/common.c:172 [inline]
| exit_to_user_mode_prepare+0x17d/0x290 kernel/entry/common.c:207
| __syscall_exit_to_user_mode_work kernel/entry/common.c:289 [inline]
| syscall_exit_to_user_mode+0x19/0x60 kernel/entry/common.c:300
| do_syscall_64+0x42/0xb0 arch/x86/entry/common.c:86
| entry_SYSCALL_64_after_hwframe+0x44/0xae
On x86 this shouldn't happen, which has arch_irq_work_raise().
The test program sets up a perf event with sigtrap set to fire on the
'sched_wakeup' tracepoint, which fired in ttwu_do_wakeup().
This happened because the 'sched_wakeup' tracepoint also takes a task
argument passed on to perf_tp_event(), which is used to deliver the
event to that other task.
Since we cannot deliver synchronous signals to other tasks, skip an event if
perf_tp_event() is targeted at another task and perf_event_attr::sigtrap is
set, which will avoid ever entering perf_sigtrap() for such events.
Fixes: 97ba62b278 ("perf: Add support for SIGTRAP on perf events")
Reported-by: syzbot+663359e32ce6f1a305ad@syzkaller.appspotmail.com
Signed-off-by: Marco Elver <elver@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/YYpoCOBmC/kJWfmI@elver.google.com
We found that a process with 10 thousnads threads has been encountered
a regression problem from Linux-v4.14 to Linux-v5.4. It is a kind of
workload which will concurrently allocate lots of memory in different
threads sometimes. In this case, we will see the down_read_trylock()
with a high hotspot. Therefore, we suppose that rwsem has a regression
at least since Linux-v5.4. In order to easily debug this problem, we
write a simply benchmark to create the similar situation lile the
following.
```c++
#include <sys/mman.h>
#include <sys/time.h>
#include <sys/resource.h>
#include <sched.h>
#include <cstdio>
#include <cassert>
#include <thread>
#include <vector>
#include <chrono>
volatile int mutex;
void trigger(int cpu, char* ptr, std::size_t sz)
{
cpu_set_t set;
CPU_ZERO(&set);
CPU_SET(cpu, &set);
assert(pthread_setaffinity_np(pthread_self(), sizeof(set), &set) == 0);
while (mutex);
for (std::size_t i = 0; i < sz; i += 4096) {
*ptr = '\0';
ptr += 4096;
}
}
int main(int argc, char* argv[])
{
std::size_t sz = 100;
if (argc > 1)
sz = atoi(argv[1]);
auto nproc = std:🧵:hardware_concurrency();
std::vector<std::thread> thr;
sz <<= 30;
auto* ptr = mmap(nullptr, sz, PROT_READ | PROT_WRITE, MAP_ANON |
MAP_PRIVATE, -1, 0);
assert(ptr != MAP_FAILED);
char* cptr = static_cast<char*>(ptr);
auto run = sz / nproc;
run = (run >> 12) << 12;
mutex = 1;
for (auto i = 0U; i < nproc; ++i) {
thr.emplace_back(std::thread([i, cptr, run]() { trigger(i, cptr, run); }));
cptr += run;
}
rusage usage_start;
getrusage(RUSAGE_SELF, &usage_start);
auto start = std::chrono::system_clock::now();
mutex = 0;
for (auto& t : thr)
t.join();
rusage usage_end;
getrusage(RUSAGE_SELF, &usage_end);
auto end = std::chrono::system_clock::now();
timeval utime;
timeval stime;
timersub(&usage_end.ru_utime, &usage_start.ru_utime, &utime);
timersub(&usage_end.ru_stime, &usage_start.ru_stime, &stime);
printf("usr: %ld.%06ld\n", utime.tv_sec, utime.tv_usec);
printf("sys: %ld.%06ld\n", stime.tv_sec, stime.tv_usec);
printf("real: %lu\n",
std::chrono::duration_cast<std::chrono::milliseconds>(end -
start).count());
return 0;
}
```
The functionality of above program is simply which creates `nproc`
threads and each of them are trying to touch memory (trigger page
fault) on different CPU. Then we will see the similar profile by
`perf top`.
25.55% [kernel] [k] down_read_trylock
14.78% [kernel] [k] handle_mm_fault
13.45% [kernel] [k] up_read
8.61% [kernel] [k] clear_page_erms
3.89% [kernel] [k] __do_page_fault
The highest hot instruction, which accounts for about 92%, in
down_read_trylock() is cmpxchg like the following.
91.89 │ lock cmpxchg %rdx,(%rdi)
Sice the problem is found by migrating from Linux-v4.14 to Linux-v5.4,
so we easily found that the commit ddb20d1d3a ("locking/rwsem: Optimize
down_read_trylock()") caused the regression. The reason is that the
commit assumes the rwsem is not contended at all. But it is not always
true for mmap lock which could be contended with thousands threads.
So most threads almost need to run at least 2 times of "cmpxchg" to
acquire the lock. The overhead of atomic operation is higher than
non-atomic instructions, which caused the regression.
By using the above benchmark, the real executing time on a x86-64 system
before and after the patch were:
Before Patch After Patch
# of Threads real real reduced by
------------ ------ ------ ----------
1 65,373 65,206 ~0.0%
4 15,467 15,378 ~0.5%
40 6,214 5,528 ~11.0%
For the uncontended case, the new down_read_trylock() is the same as
before. For the contended cases, the new down_read_trylock() is faster
than before. The more contended, the more fast.
Signed-off-by: Muchun Song <songmuchun@bytedance.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Waiman Long <longman@redhat.com>
Link: https://lore.kernel.org/r/20211118094455.9068-1-songmuchun@bytedance.com
There are some inconsistency in the way that the handoff bit is being
handled in readers and writers that lead to a race condition.
Firstly, when a queue head writer set the handoff bit, it will clear
it when the writer is being killed or interrupted on its way out
without acquiring the lock. That is not the case for a queue head
reader. The handoff bit will simply be inherited by the next waiter.
Secondly, in the out_nolock path of rwsem_down_read_slowpath(), both
the waiter and handoff bits are cleared if the wait queue becomes
empty. For rwsem_down_write_slowpath(), however, the handoff bit is
not checked and cleared if the wait queue is empty. This can
potentially make the handoff bit set with empty wait queue.
Worse, the situation in rwsem_down_write_slowpath() relies on wstate,
a variable set outside of the critical section containing the ->count
manipulation, this leads to race condition where RWSEM_FLAG_HANDOFF
can be double subtracted, corrupting ->count.
To make the handoff bit handling more consistent and robust, extract
out handoff bit clearing code into the new rwsem_del_waiter() helper
function. Also, completely eradicate wstate; always evaluate
everything inside the same critical section.
The common function will only use atomic_long_andnot() to clear bits
when the wait queue is empty to avoid possible race condition. If the
first waiter with handoff bit set is killed or interrupted to exit the
slowpath without acquiring the lock, the next waiter will inherit the
handoff bit.
While at it, simplify the trylock for loop in
rwsem_down_write_slowpath() to make it easier to read.
Fixes: 4f23dbc1e6 ("locking/rwsem: Implement lock handoff to prevent lock starvation")
Reported-by: Zhenhua Ma <mazhenhua@xiaomi.com>
Suggested-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Waiman Long <longman@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20211116012912.723980-1-longman@redhat.com
- Fix double free in destroy_hist_field
- Harden memset() of trace_iterator structure
- Do not warn in trace printk check when test buffer fills up
-----BEGIN PGP SIGNATURE-----
iIoEABYIADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCYZgSTRQccm9zdGVkdEBn
b29kbWlzLm9yZwAKCRAp5XQQmuv6qqsJAQDg6Oe0XMclYPLMyRlEJEMEV2bFh8ZQ
G1jqvMLcMnuFZAEA2onhzHzjR1amXuSw9YwNHcDB7eHiaIg9pgdOFFDUpwI=
=KTcf
-----END PGP SIGNATURE-----
Merge tag 'trace-v5.16-6' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace
Pull tracing fixes from Steven Rostedt:
- Fix double free in destroy_hist_field
- Harden memset() of trace_iterator structure
- Do not warn in trace printk check when test buffer fills up
* tag 'trace-v5.16-6' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
tracing: Don't use out-of-sync va_list in event printing
tracing: Use memset_startat() to zero struct trace_iterator
tracing/histogram: Fix UAF in destroy_hist_field()
Pull exit-vs-signal handling fixes from Eric Biederman:
"This is a small set of changes where debuggers were no longer able to
intercept synchronous SIGTRAP and SIGSEGV, introduced by the exit
cleanups.
This is essentially the change you suggested with all of i's dotted
and the t's crossed so that ptrace can intercept all of the cases it
has been able to intercept the past, and all of the cases that made it
to exit without giving ptrace a chance still don't give ptrace a
chance"
* 'SA_IMMUTABLE-fixes-for-v5.16-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace:
signal: Replace force_fatal_sig with force_exit_sig when in doubt
signal: Don't always set SA_IMMUTABLE for forced signals
Recently to prevent issues with SECCOMP_RET_KILL and similar signals
being changed before they are delivered SA_IMMUTABLE was added.
Unfortunately this broke debuggers[1][2] which reasonably expect
to be able to trap synchronous SIGTRAP and SIGSEGV even when
the target process is not configured to handle those signals.
Add force_exit_sig and use it instead of force_fatal_sig where
historically the code has directly called do_exit. This has the
implementation benefits of going through the signal exit path
(including generating core dumps) without the danger of allowing
userspace to ignore or change these signals.
This avoids userspace regressions as older kernels exited with do_exit
which debuggers also can not intercept.
In the future is should be possible to improve the quality of
implementation of the kernel by changing some of these force_exit_sig
calls to force_fatal_sig. That can be done where it matters on
a case-by-case basis with careful analysis.
Reported-by: Kyle Huey <me@kylehuey.com>
Reported-by: kernel test robot <oliver.sang@intel.com>
[1] https://lkml.kernel.org/r/CAP045AoMY4xf8aC_4QU_-j7obuEPYgTcnQQP3Yxk=2X90jtpjw@mail.gmail.com
[2] https://lkml.kernel.org/r/20211117150258.GB5403@xsang-OptiPlex-9020
Fixes: 00b06da29c ("signal: Add SA_IMMUTABLE to ensure forced siganls do not get changed")
Fixes: a3616a3c02 ("signal/m68k: Use force_sigsegv(SIGSEGV) in fpsp040_die")
Fixes: 83a1f27ad7 ("signal/powerpc: On swapcontext failure force SIGSEGV")
Fixes: 9bc508cf07 ("signal/s390: Use force_sigsegv in default_trap_handler")
Fixes: 086ec444f8 ("signal/sparc32: In setup_rt_frame and setup_fram use force_fatal_sig")
Fixes: c317d306d5 ("signal/sparc32: Exit with a fatal signal when try_to_clear_window_buffer fails")
Fixes: 695dd0d634 ("signal/x86: In emulate_vsyscall force a signal instead of calling do_exit")
Fixes: 1fbd60df8a ("signal/vm86_32: Properly send SIGSEGV when the vm86 state cannot be saved.")
Fixes: 941edc5bf1 ("exit/syscall_user_dispatch: Send ordinary signals on failure")
Link: https://lkml.kernel.org/r/871r3dqfv8.fsf_-_@email.froward.int.ebiederm.org
Reviewed-by: Kees Cook <keescook@chromium.org>
Tested-by: Kees Cook <keescook@chromium.org>
Tested-by: Kyle Huey <khuey@kylehuey.com>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Recently to prevent issues with SECCOMP_RET_KILL and similar signals
being changed before they are delivered SA_IMMUTABLE was added.
Unfortunately this broke debuggers[1][2] which reasonably expect to be
able to trap synchronous SIGTRAP and SIGSEGV even when the target
process is not configured to handle those signals.
Update force_sig_to_task to support both the case when we can allow
the debugger to intercept and possibly ignore the signal and the case
when it is not safe to let userspace know about the signal until the
process has exited.
Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Reported-by: Kyle Huey <me@kylehuey.com>
Reported-by: kernel test robot <oliver.sang@intel.com>
Cc: stable@vger.kernel.org
[1] https://lkml.kernel.org/r/CAP045AoMY4xf8aC_4QU_-j7obuEPYgTcnQQP3Yxk=2X90jtpjw@mail.gmail.com
[2] https://lkml.kernel.org/r/20211117150258.GB5403@xsang-OptiPlex-9020
Fixes: 00b06da29c ("signal: Add SA_IMMUTABLE to ensure forced siganls do not get changed")
Link: https://lkml.kernel.org/r/877dd5qfw5.fsf_-_@email.froward.int.ebiederm.org
Reviewed-by: Kees Cook <keescook@chromium.org>
Tested-by: Kees Cook <keescook@chromium.org>
Tested-by: Kyle Huey <khuey@kylehuey.com>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
If trace_seq becomes full, trace_seq_vprintf() no longer consumes
arguments from va_list, making va_list out of sync with format
processing by trace_check_vprintf().
This causes va_arg() in trace_check_vprintf() to return wrong
positional argument, which results into a WARN_ON_ONCE() hit.
ftrace_stress_test from LTP triggers this situation.
Fix it by explicitly avoiding further use if va_list at the point
when it's consistency can no longer be guaranteed.
Link: https://lkml.kernel.org/r/20211118145516.13219-1-nikita.yushchenko@virtuozzo.com
Signed-off-by: Nikita Yushchenko <nikita.yushchenko@virtuozzo.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
In preparation for FORTIFY_SOURCE performing compile-time and run-time
field bounds checking for memset(), avoid intentionally writing across
neighboring fields.
Use memset_startat() to avoid confusing memset() about writing beyond
the target struct member.
Link: https://lkml.kernel.org/r/20211118202217.1285588-1-keescook@chromium.org
Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Current release - regressions:
- devlink: don't throw an error if flash notification sent before
devlink visible
- page_pool: Revert "page_pool: disable dma mapping support...",
turns out there are active arches who need it
Current release - new code bugs:
- amt: cancel delayed_work synchronously in amt_fini()
Previous releases - regressions:
- xsk: fix crash on double free in buffer pool
- bpf: fix inner map state pruning regression causing program
rejections
- mac80211: drop check for DONT_REORDER in __ieee80211_select_queue,
preventing mis-selecting the best effort queue
- mac80211: do not access the IV when it was stripped
- mac80211: fix radiotap header generation, off-by-one
- nl80211: fix getting radio statistics in survey dump
- e100: fix device suspend/resume
Previous releases - always broken:
- tcp: fix uninitialized access in skb frags array for Rx 0cp
- bpf: fix toctou on read-only map's constant scalar tracking
- bpf: forbid bpf_ktime_get_coarse_ns and bpf_timer_* in tracing progs
- tipc: only accept encrypted MSG_CRYPTO msgs
- smc: transfer remaining wait queue entries during fallback,
fix missing wake ups
- udp: validate checksum in udp_read_sock() (when sockmap is used)
- sched: act_mirred: drop dst for the direction from egress to ingress
- virtio_net_hdr_to_skb: count transport header in UFO, prevent
allowing bad skbs into the stack
- nfc: reorder the logic in nfc_{un,}register_device, fix unregister
- ipsec: check return value of ipv6_skip_exthdr
- usb: r8152: add MAC passthrough support for more Lenovo Docks
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEE6jPA+I1ugmIBA4hXMUZtbf5SIrsFAmGWf08ACgkQMUZtbf5S
Irt+lxAAj8FAoLoSmQKUK3LttLLh0ZQQXu8Riey+wrP8Z9Yp8xWXIaVRF1c0vCE6
clbrF+mLfk6Wvv/RzOgwyBMHvK+djr/oVDNSmjlRvss4MLDfOQZhUV8V4XpvF4Up
hI7wyKfHtd7niosNqil6wklJFpLU8WyIAWrPSIPE6JlPkJmcm3GUGsliwEPwdLY1
yl7z4zsxigjA+hKxYqNQX6tixF3xnbDUbAnWshrSPL89melwz4GMao45qmcxJEVr
EipPhKifk0hT067jG08FMXcKBFKt6rKk7SVNo4mtq8Tl6HleJBj8fdaJAjSdFahB
+rYJ0sDZwGoDL5CxZ5mD3fM1cDgh4WFEM0z//0b/bZhoPDRKEpLr9LPuv+N6+/rA
8D98EHsvyNjlFgdyd8celMstiGtBn4YLEoLNYYh9Qibgm0XsCuv0yox7g0AOLMmQ
QiBmh2EnaXNPQ8PRZNMK3VH5ol2KoYWL6yrpJYV+wOWVLfezwlSsjkPSfW5pF9FG
hU0iQBp/YTCdCadR9YLj8qfDWDUAkCN7WpqIu9EA9FXJcYjJVaix0MA/tAVlzXyR
xlB7cU6O5NABcs/+04zPkKLwKbVYNMqgvKE+FVDVm+BKxo0UMxcmz/Np/ZYxfhkF
bwKplaiPb2H4D6t0sdxqaeYirPwt1BcleLilae6vHG1jO90H9Vw=
=tlqV
-----END PGP SIGNATURE-----
Merge tag 'net-5.16-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Pull networking fixes from Jakub Kicinski:
"Including fixes from bpf, mac80211.
Current release - regressions:
- devlink: don't throw an error if flash notification sent before
devlink visible
- page_pool: Revert "page_pool: disable dma mapping support...",
turns out there are active arches who need it
Current release - new code bugs:
- amt: cancel delayed_work synchronously in amt_fini()
Previous releases - regressions:
- xsk: fix crash on double free in buffer pool
- bpf: fix inner map state pruning regression causing program
rejections
- mac80211: drop check for DONT_REORDER in __ieee80211_select_queue,
preventing mis-selecting the best effort queue
- mac80211: do not access the IV when it was stripped
- mac80211: fix radiotap header generation, off-by-one
- nl80211: fix getting radio statistics in survey dump
- e100: fix device suspend/resume
Previous releases - always broken:
- tcp: fix uninitialized access in skb frags array for Rx 0cp
- bpf: fix toctou on read-only map's constant scalar tracking
- bpf: forbid bpf_ktime_get_coarse_ns and bpf_timer_* in tracing
progs
- tipc: only accept encrypted MSG_CRYPTO msgs
- smc: transfer remaining wait queue entries during fallback, fix
missing wake ups
- udp: validate checksum in udp_read_sock() (when sockmap is used)
- sched: act_mirred: drop dst for the direction from egress to
ingress
- virtio_net_hdr_to_skb: count transport header in UFO, prevent
allowing bad skbs into the stack
- nfc: reorder the logic in nfc_{un,}register_device, fix unregister
- ipsec: check return value of ipv6_skip_exthdr
- usb: r8152: add MAC passthrough support for more Lenovo Docks"
* tag 'net-5.16-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (96 commits)
ptp: ocp: Fix a couple NULL vs IS_ERR() checks
net: ethernet: dec: tulip: de4x5: fix possible array overflows in type3_infoblock()
net: tulip: de4x5: fix the problem that the array 'lp->phy[8]' may be out of bound
ipv6: check return value of ipv6_skip_exthdr
e100: fix device suspend/resume
devlink: Don't throw an error if flash notification sent before devlink visible
page_pool: Revert "page_pool: disable dma mapping support..."
ethernet: hisilicon: hns: hns_dsaf_misc: fix a possible array overflow in hns_dsaf_ge_srst_by_port()
octeontx2-af: debugfs: don't corrupt user memory
NFC: add NCI_UNREG flag to eliminate the race
NFC: reorder the logic in nfc_{un,}register_device
NFC: reorganize the functions in nci_request
tipc: check for null after calling kmemdup
i40e: Fix display error code in dmesg
i40e: Fix creation of first queue by omitting it if is not power of two
i40e: Fix warning message and call stack during rmmod i40e driver
i40e: Fix ping is lost after configuring ADq on VF
i40e: Fix changing previously set num_queue_pairs for PFs
i40e: Fix NULL ptr dereference on VSI filter sync
i40e: Fix correct max_pkt_size on VF RX queue
...
Calling destroy_hist_field() on an expression will recursively free
any operands associated with the expression. If during expression
parsing the operands of the expression are already set when an error
is encountered, there is no need to explicity free the operands. Doing
so will result in destroy_hist_field() being called twice for the
operands and lead to a use-after-free (UAF) error.
If the operands are associated with the expression, only call
destroy_hist_field() on the expression since the operands will be
recursively freed.
Link: https://lore.kernel.org/all/CAHk-=wgcrEbFgkw9720H3tW-AhHOoEKhYwZinYJw4FpzSaJ6_Q@mail.gmail.com/
Link: https://lkml.kernel.org/r/20211118011542.1420131-1-kaleshsingh@google.com
Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Kalesh Singh <kaleshsingh@google.com>
Fixes: 8b5d46fd7a ("tracing/histogram: Optimize division by constants")
Reported-by: kernel test robot <oliver.sang@intel.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEESH4wyp42V4tXvYsjUqAMR0iAlPIFAmGWF6YACgkQUqAMR0iA
lPJJ+RAAm9pi/EElKKl+lOlBl+ehJlKuNLnPQWFmmaRc9xd0ruUipp1nsaktLJ8f
R/PkSwR/YWpBWlF8P4o+x9sOFyTNyLasoHtqsinEcAJI4lb7d1KOrPliTXyr15Ai
A303djwJmwCw5KxAPOjkG/nMBlpMIAQRee9GDWs1ykfSlIsI4jp7isVbCFNCQNVF
auHYq1bfJ5MJYPjxIDZUt+NF7kg7dD4k4g+VCVjaH1u8pGeaCUCtnNjMFOk1XfU8
yFQnaDcrAu4zJPq3d74z4eN9Bk+su8+DhnfrAEFjuFxGTgYc2MyRt0gGFeiUtNs4
rvST/eHBO4zeuL18S8G+fLcig/9ZYE73xzjdOCzRzLDjn0VQr9t06ez1QqJOb4D6
A4SSufwek5NIqYKMlhV/az2EceQYK8Wv3KAz8w98KDfwvVVhUSgE23MbTCO0hvQU
PWF35d3hQ+9oH0ZGYRumb8OpXtKJ+2KmzyN8Z0xhivHFBIKlcW6IBGhWRANclJO8
jNAM3jiwi8fRDVM2wI1fmgeEmMhG+WuTI3dJVu3tu4vI923FW5GdY6ev5EvH0Ts0
khTwIjtmCHUJGSeWajy3Gi9irdyhPyPNRMqgal4GS+gGpVU2mMMKTG+NzxxtCRKR
BUgfCjFDoDJWrNWIzzOwNqgF0Y+V9GgCZOkb73u/y+xVx0Rmc6U=
=wbBy
-----END PGP SIGNATURE-----
Merge tag 'printk-for-5.16-fixup' of git://git.kernel.org/pub/scm/linux/kernel/git/printk/linux
Pull printk fixes from Petr Mladek:
- Try to flush backtraces from other CPUs also on the local one. This
was a regression caused by printk_safe buffers removal.
- Remove header dependency warning.
* tag 'printk-for-5.16-fixup' of git://git.kernel.org/pub/scm/linux/kernel/git/printk/linux:
printk: Remove printk.h inclusion in percpu.h
printk: restore flushing of NMI buffers on remote CPUs after NMI backtraces
Daniel Borkmann says:
====================
pull-request: bpf 2021-11-16
We've added 12 non-merge commits during the last 5 day(s) which contain
a total of 23 files changed, 573 insertions(+), 73 deletions(-).
The main changes are:
1) Fix pruning regression where verifier went overly conservative rejecting
previsouly accepted programs, from Alexei Starovoitov and Lorenz Bauer.
2) Fix verifier TOCTOU bug when using read-only map's values as constant
scalars during verification, from Daniel Borkmann.
3) Fix a crash due to a double free in XSK's buffer pool, from Magnus Karlsson.
4) Fix libbpf regression when cross-building runqslower, from Jean-Philippe Brucker.
5) Forbid use of bpf_ktime_get_coarse_ns() and bpf_timer_*() helpers in tracing
programs due to deadlock possibilities, from Dmitrii Banshchikov.
6) Fix checksum validation in sockmap's udp_read_sock() callback, from Cong Wang.
7) Various BPF sample fixes such as XDP stats in xdp_sample_user, from Alexander Lobakin.
8) Fix libbpf gen_loader error handling wrt fd cleanup, from Kumar Kartikeya Dwivedi.
* https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf:
udp: Validate checksum in udp_read_sock()
bpf: Fix toctou on read-only map's constant scalar tracking
samples/bpf: Fix build error due to -isystem removal
selftests/bpf: Add tests for restricted helpers
bpf: Forbid bpf_ktime_get_coarse_ns and bpf_timer_* in tracing progs
libbpf: Perform map fd cleanup for gen_loader in case of error
samples/bpf: Fix incorrect use of strlen in xdp_redirect_cpu
tools/runqslower: Fix cross-build
samples/bpf: Fix summary per-sec stats in xdp_sample_user
selftests/bpf: Check map in map pruning
bpf: Fix inner map state pruning regression.
xsk: Fix crash on double free in buffer pool
====================
Link: https://lore.kernel.org/r/20211116141134.6490-1-daniel@iogearbox.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
In the current code, the actual max tail call count is 33 which is greater
than MAX_TAIL_CALL_CNT (defined as 32). The actual limit is not consistent
with the meaning of MAX_TAIL_CALL_CNT and thus confusing at first glance.
We can see the historical evolution from commit 04fd61ab36 ("bpf: allow
bpf programs to tail-call other bpf programs") and commit f9dabe016b
("bpf: Undo off-by-one in interpreter tail call count limit"). In order
to avoid changing existing behavior, the actual limit is 33 now, this is
reasonable.
After commit 874be05f52 ("bpf, tests: Add tail call test suite"), we can
see there exists failed testcase.
On all archs when CONFIG_BPF_JIT_ALWAYS_ON is not set:
# echo 0 > /proc/sys/net/core/bpf_jit_enable
# modprobe test_bpf
# dmesg | grep -w FAIL
Tail call error path, max count reached jited:0 ret 34 != 33 FAIL
On some archs:
# echo 1 > /proc/sys/net/core/bpf_jit_enable
# modprobe test_bpf
# dmesg | grep -w FAIL
Tail call error path, max count reached jited:1 ret 34 != 33 FAIL
Although the above failed testcase has been fixed in commit 18935a72eb
("bpf/tests: Fix error in tail call limit tests"), it would still be good
to change the value of MAX_TAIL_CALL_CNT from 32 to 33 to make the code
more readable.
The 32-bit x86 JIT was using a limit of 32, just fix the wrong comments and
limit to 33 tail calls as the constant MAX_TAIL_CALL_CNT updated. For the
mips64 JIT, use "ori" instead of "addiu" as suggested by Johan Almbladh.
For the riscv JIT, use RV_REG_TCC directly to save one register move as
suggested by Björn Töpel. For the other implementations, no function changes,
it does not change the current limit 33, the new value of MAX_TAIL_CALL_CNT
can reflect the actual max tail call count, the related tail call testcases
in test_bpf module and selftests can work well for the interpreter and the
JIT.
Here are the test results on x86_64:
# uname -m
x86_64
# echo 0 > /proc/sys/net/core/bpf_jit_enable
# modprobe test_bpf test_suite=test_tail_calls
# dmesg | tail -1
test_bpf: test_tail_calls: Summary: 8 PASSED, 0 FAILED, [0/8 JIT'ed]
# rmmod test_bpf
# echo 1 > /proc/sys/net/core/bpf_jit_enable
# modprobe test_bpf test_suite=test_tail_calls
# dmesg | tail -1
test_bpf: test_tail_calls: Summary: 8 PASSED, 0 FAILED, [8/8 JIT'ed]
# rmmod test_bpf
# ./test_progs -t tailcalls
#142 tailcalls:OK
Summary: 1/11 PASSED, 0 SKIPPED, 0 FAILED
Signed-off-by: Tiezhu Yang <yangtiezhu@loongson.cn>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Tested-by: Johan Almbladh <johan.almbladh@anyfinetworks.com>
Tested-by: Ilya Leoshkevich <iii@linux.ibm.com>
Acked-by: Björn Töpel <bjorn@kernel.org>
Acked-by: Johan Almbladh <johan.almbladh@anyfinetworks.com>
Acked-by: Ilya Leoshkevich <iii@linux.ibm.com>
Link: https://lore.kernel.org/bpf/1636075800-3264-1-git-send-email-yangtiezhu@loongson.cn
Commit a23740ec43 ("bpf: Track contents of read-only maps as scalars") is
checking whether maps are read-only both from BPF program side and user space
side, and then, given their content is constant, reading out their data via
map->ops->map_direct_value_addr() which is then subsequently used as known
scalar value for the register, that is, it is marked as __mark_reg_known()
with the read value at verification time. Before a23740ec43, the register
content was marked as an unknown scalar so the verifier could not make any
assumptions about the map content.
The current implementation however is prone to a TOCTOU race, meaning, the
value read as known scalar for the register is not guaranteed to be exactly
the same at a later point when the program is executed, and as such, the
prior made assumptions of the verifier with regards to the program will be
invalid which can cause issues such as OOB access, etc.
While the BPF_F_RDONLY_PROG map flag is always fixed and required to be
specified at map creation time, the map->frozen property is initially set to
false for the map given the map value needs to be populated, e.g. for global
data sections. Once complete, the loader "freezes" the map from user space
such that no subsequent updates/deletes are possible anymore. For the rest
of the lifetime of the map, this freeze one-time trigger cannot be undone
anymore after a successful BPF_MAP_FREEZE cmd return. Meaning, any new BPF_*
cmd calls which would update/delete map entries will be rejected with -EPERM
since map_get_sys_perms() removes the FMODE_CAN_WRITE permission. This also
means that pending update/delete map entries must still complete before this
guarantee is given. This corner case is not an issue for loaders since they
create and prepare such program private map in successive steps.
However, a malicious user is able to trigger this TOCTOU race in two different
ways: i) via userfaultfd, and ii) via batched updates. For i) userfaultfd is
used to expand the competition interval, so that map_update_elem() can modify
the contents of the map after map_freeze() and bpf_prog_load() were executed.
This works, because userfaultfd halts the parallel thread which triggered a
map_update_elem() at the time where we copy key/value from the user buffer and
this already passed the FMODE_CAN_WRITE capability test given at that time the
map was not "frozen". Then, the main thread performs the map_freeze() and
bpf_prog_load(), and once that had completed successfully, the other thread
is woken up to complete the pending map_update_elem() which then changes the
map content. For ii) the idea of the batched update is similar, meaning, when
there are a large number of updates to be processed, it can increase the
competition interval between the two. It is therefore possible in practice to
modify the contents of the map after executing map_freeze() and bpf_prog_load().
One way to fix both i) and ii) at the same time is to expand the use of the
map's map->writecnt. The latter was introduced in fc9702273e ("bpf: Add mmap()
support for BPF_MAP_TYPE_ARRAY") and further refined in 1f6cb19be2 ("bpf:
Prevent re-mmap()'ing BPF map as writable for initially r/o mapping") with
the rationale to make a writable mmap()'ing of a map mutually exclusive with
read-only freezing. The counter indicates writable mmap() mappings and then
prevents/fails the freeze operation. Its semantics can be expanded beyond
just mmap() by generally indicating ongoing write phases. This would essentially
span any parallel regular and batched flavor of update/delete operation and
then also have map_freeze() fail with -EBUSY. For the check_mem_access() in
the verifier we expand upon the bpf_map_is_rdonly() check ensuring that all
last pending writes have completed via bpf_map_write_active() test. Once the
map->frozen is set and bpf_map_write_active() indicates a map->writecnt of 0
only then we are really guaranteed to use the map's data as known constants.
For map->frozen being set and pending writes in process of still being completed
we fall back to marking that register as unknown scalar so we don't end up
making assumptions about it. With this, both TOCTOU reproducers from i) and
ii) are fixed.
Note that the map->writecnt has been converted into a atomic64 in the fix in
order to avoid a double freeze_mutex mutex_{un,}lock() pair when updating
map->writecnt in the various map update/delete BPF_* cmd flavors. Spanning
the freeze_mutex over entire map update/delete operations in syscall side
would not be possible due to then causing everything to be serialized.
Similarly, something like synchronize_rcu() after setting map->frozen to wait
for update/deletes to complete is not possible either since it would also
have to span the user copy which can sleep. On the libbpf side, this won't
break d66562fba1 ("libbpf: Add BPF object skeleton support") as the
anonymous mmap()-ed "map initialization image" is remapped as a BPF map-backed
mmap()-ed memory where for .rodata it's non-writable.
Fixes: a23740ec43 ("bpf: Track contents of read-only maps as scalars")
Reported-by: w1tcher.bupt@gmail.com
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Daniel Borkmann says:
====================
pull-request: bpf-next 2021-11-15
We've added 72 non-merge commits during the last 13 day(s) which contain
a total of 171 files changed, 2728 insertions(+), 1143 deletions(-).
The main changes are:
1) Add btf_type_tag attributes to bring kernel annotations like __user/__rcu to
BTF such that BPF verifier will be able to detect misuse, from Yonghong Song.
2) Big batch of libbpf improvements including various fixes, future proofing APIs,
and adding a unified, OPTS-based bpf_prog_load() low-level API, from Andrii Nakryiko.
3) Add ingress_ifindex to BPF_SK_LOOKUP program type for selectively applying the
programmable socket lookup logic to packets from a given netdev, from Mark Pashmfouroush.
4) Remove the 128M upper JIT limit for BPF programs on arm64 and add selftest to
ensure exception handling still works, from Russell King and Alan Maguire.
5) Add a new bpf_find_vma() helper for tracing to map an address to the backing
file such as shared library, from Song Liu.
6) Batch of various misc fixes to bpftool, fixing a memory leak in BPF program dump,
updating documentation and bash-completion among others, from Quentin Monnet.
7) Deprecate libbpf bpf_program__get_prog_info_linear() API and migrate its users as
the API is heavily tailored around perf and is non-generic, from Dave Marchevsky.
8) Enable libbpf's strict mode by default in bpftool and add a --legacy option as an
opt-out for more relaxed BPF program requirements, from Stanislav Fomichev.
9) Fix bpftool to use libbpf_get_error() to check for errors, from Hengqi Chen.
* https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next: (72 commits)
bpftool: Use libbpf_get_error() to check error
bpftool: Fix mixed indentation in documentation
bpftool: Update the lists of names for maps and prog-attach types
bpftool: Fix indent in option lists in the documentation
bpftool: Remove inclusion of utilities.mak from Makefiles
bpftool: Fix memory leak in prog_dump()
selftests/bpf: Fix a tautological-constant-out-of-range-compare compiler warning
selftests/bpf: Fix an unused-but-set-variable compiler warning
bpf: Introduce btf_tracing_ids
bpf: Extend BTF_ID_LIST_GLOBAL with parameter for number of IDs
bpftool: Enable libbpf's strict mode by default
docs/bpf: Update documentation for BTF_KIND_TYPE_TAG support
selftests/bpf: Clarify llvm dependency with btf_tag selftest
selftests/bpf: Add a C test for btf_type_tag
selftests/bpf: Rename progs/tag.c to progs/btf_decl_tag.c
selftests/bpf: Test BTF_KIND_DECL_TAG for deduplication
selftests/bpf: Add BTF_KIND_TYPE_TAG unit tests
selftests/bpf: Test libbpf API function btf__add_type_tag()
bpftool: Support BTF_KIND_TYPE_TAG
libbpf: Support BTF_KIND_TYPE_TAG
...
====================
Link: https://lore.kernel.org/r/20211115162008.25916-1-daniel@iogearbox.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
A fix to only copy the size of the field to the histogram string
did not take into account that the size can be larger than the
storage.
-----BEGIN PGP SIGNATURE-----
iIoEABYIADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCYZHGYBQccm9zdGVkdEBn
b29kbWlzLm9yZwAKCRAp5XQQmuv6qi4RAP9Lr7RqTRQQ3C9BHZfCmIgwZtAqT+Z4
U+nHva6FcI9ufQEAtWAAHleVHUcfVB90mahMFSEnJ7yESKC3k1ZKXsTsYwo=
=X961
-----END PGP SIGNATURE-----
Merge tag 'trace-v5.16-5' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace
Pull tracing fix from Steven Rostedt:
"Update to tracing histogram variable string copy
A fix to only copy the size of the field to the histogram string did
not take into account that the size can be larger than the storage"
* tag 'trace-v5.16-5' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
tracing: Add length protection to histogram string copies
The string copies to the histogram storage has a max size of 256 bytes
(defined by MAX_FILTER_STR_VAL). Only the string size of the event field
needs to be copied to the event storage, but no more than what is in the
event storage. Although nothing should be bigger than 256 bytes, there's
no protection against overwriting of the storage if one day there is.
Copy no more than the destination size, and enforce it.
Also had to turn MAX_FILTER_STR_VAL into an unsigned int, to keep the
min() comparison of the string sizes of comparable types.
Link: https://lore.kernel.org/all/CAHk-=wjREUihCGrtRBwfX47y_KrLCGjiq3t6QtoNJpmVrAEb1w@mail.gmail.com/
Link: https://lkml.kernel.org/r/20211114132834.183429a4@rorschach.local.home
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Tom Zanussi <zanussi@kernel.org>
Reported-by: Linus Torvalds <torvalds@linux-foundation.org>
Reviewed-by: Masami Hiramatsu <mhiramat@kernel.org>
Fixes: 63f84ae6b8 ("tracing/histogram: Do not copy the fixed-size char array field over the field size")
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
timer delivery stops working for a new child task because copy_process()
copies state information which is only valid for the parent task.
-----BEGIN PGP SIGNATURE-----
iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAmGRDVUTHHRnbHhAbGlu
dXRyb25peC5kZQAKCRCmGPVMDXSYocOFD/42NOdli73N+Jdq7APHUIHXzu+6DVT6
CI5toLQw+0KPoF0s1wg4+J0YCDt2k0Pu4lOabF3Ze/+c6RlR5zfCXESqsXdHaCjh
E91Vs57u0ataRMEHo6KB6eBIutuF8hyxfY6vVXfkTRNAreUIWiwWYrlB0G64JVOG
+/l1W7adovjLcLwcW+ArrnLJwkBKtXunK6PVv2IrdRHwpMHbwoNRCCCFvzkqnWmQ
4Yy2/NaB/PEBK5kezP1/j9EMcGCTWk1JJIm+l/PEwCCcbIgIdUahpW3XHAaqms6R
oukqCvE5ukfmVzBFYBhCamhF8heyEeBVRqGU+Yyk48+I+DQFBCqaqa1NKSuEUdNL
Nycy6Rp1yn7CHVSB461shMS6NJGOSNDBjv7vxer3WjV3HPJu7y0RrN7jXbkSfQnm
hVKjkmbDEYwylgzFE5+T857NqD5MEXeuIBtTO08hNRnpd61aB3x+qq+8ElE6ST8Y
pm6rMzw0AZ5buPK8QdGVDk0dD4WKObj1LzmRZvBtYeWynO6sxyKUl6B2CgAxrvn5
D1Li2/arkJMCVeIuIL5uE6DPoxSh8J7OuEC4KeWX8M8xQSEDImqfZ+tDL2Esv6jv
xDmymq584hiCBc1CJjCOA9kZYe6KNXC7lkVOns6GaKKzLhkrcvUR3dUGhMyzxAMO
t9QIAinR6JwRRA==
=EBbc
-----END PGP SIGNATURE-----
Merge tag 'timers-urgent-2021-11-14' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull timer fix from Thomas Gleixner:
"A single fix for POSIX CPU timers to address a problem where POSIX CPU
timer delivery stops working for a new child task because
copy_process() copies state information which is only valid for the
parent task"
* tag 'timers-urgent-2021-11-14' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
posix-cpu-timers: Clear task::posix_cputimers_work in copy_process()
- Core code:
A regression fix for the Open Firmware interrupt mapping code where a
interrupt controller property in a node caused a map property in the
same node to be ignored.
- Interrupt chip drivers:
- Workaround a limitation in SiFive PLIC interrupt chip which silently
ignores an EOI when the interrupt line is masked.
- Provide the missing mask/unmask implementation for the CSKY MP
interrupt controller.
- PCI/MSI:
- Prevent a use after free when PCI/MSI interrupts are released by
destroying the sysfs entries before freeing the memory which is
accessed in the sysfs show() function.
- Implement a mask quirk for the Nvidia ION AHCI chip which does not
advertise masking capability despite implementing it. Even worse the
chip comes out of reset with all MSI entries masked, which due to the
missing masking capability never get unmasked.
- Move the check which prevents accessing the MSI[X] masking for XEN
back into the low level accessors. The recent consolidation missed
that these accessors can be invoked from places which do not have
that check which broke XEN. Move them back to he original place
instead of sprinkling tons of these checks all over the code.
-----BEGIN PGP SIGNATURE-----
iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAmGRDCsTHHRnbHhAbGlu
dXRyb25peC5kZQAKCRCmGPVMDXSYoTL5D/4n7CUudohHPckr0Rl3LbnSUfyY9g3H
irTKur71AT392YerJtQp+WBp3AKYMDD8wPTgydfpWe95ouIjx5jhb/co7uSifG6k
ZssXYS10bkvjqyS8E2s5FnA5xbnagunK/R981qju14Ec39xqx1JzlUnO/Pra0Kcr
5rBV7br9jJMBleBI4OFuS9fS8dVL1MH/yushkuDNfIKEnaElnaxaYUk/ZdzkMMAW
lt1B+dPhK24t1hXQvZKp/iVQUGrJWdzzy9aDiUYPv1IZP+V5nbLMgmFvEv8jNdNa
6kkfp0l30nXM9rgvcp2KkasVUPVhurVEwitzz9+tT6LRA+/kSwi2yx8/FwCVUcL6
xD0AgKQgxOj/WwGJTZswvPu3afsLuw3rGmx5uH1IV40P9mPX0AiHWgvoaInHjzlJ
QKFQ7mJEuUcC6cJ36RGqX9njhKvPIcUENGCTjGSffcXsWltPrOCg2mQFcsDa9fSH
qPfXDVv4YINI+0MAlOULh6TLWQ07xy37HiskJu/AgILOfipoDi8pXdqNJRfvxB1S
D3O8vB+SH3lPj69w4dtj7539SdNZn8CCyN3RbNlstl2vHV5Bus3cVk0CcOhG8qNW
KwK/tSH8O0ZYHAsUu8OqBipXy6qOPi/10MJQn3NOpvvOmS4oDd+82bq+jp5qJpsG
42WNuzEoBdaUiA==
=LBQL
-----END PGP SIGNATURE-----
Merge tag 'irq-urgent-2021-11-14' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull irq fixes from Thomas Gleixner:
"A set of fixes for the interrupt subsystem
Core code:
- A regression fix for the Open Firmware interrupt mapping code where
a interrupt controller property in a node caused a map property in
the same node to be ignored.
Interrupt chip drivers:
- Workaround a limitation in SiFive PLIC interrupt chip which
silently ignores an EOI when the interrupt line is masked.
- Provide the missing mask/unmask implementation for the CSKY MP
interrupt controller.
PCI/MSI:
- Prevent a use after free when PCI/MSI interrupts are released by
destroying the sysfs entries before freeing the memory which is
accessed in the sysfs show() function.
- Implement a mask quirk for the Nvidia ION AHCI chip which does not
advertise masking capability despite implementing it. Even worse
the chip comes out of reset with all MSI entries masked, which due
to the missing masking capability never get unmasked.
- Move the check which prevents accessing the MSI[X] masking for XEN
back into the low level accessors. The recent consolidation missed
that these accessors can be invoked from places which do not have
that check which broke XEN. Move them back to he original place
instead of sprinkling tons of these checks all over the code"
* tag 'irq-urgent-2021-11-14' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
of/irq: Don't ignore interrupt-controller when interrupt-map failed
irqchip/sifive-plic: Fixup EOI failed when masked
irqchip/csky-mpintc: Fixup mask/unmask implementation
PCI/MSI: Destroy sysfs before freeing entries
PCI: Add MSI masking quirk for Nvidia ION AHCI
PCI/MSI: Deal with devices lying about their MSI mask capability
PCI/MSI: Move non-mask check back into low level accessors
the preemption model
- clear cluster CPU masks too, on the CPU unplug path
- prevent use-after-free in cfs
- Prevent a race condition when updating CPU cache domains
- Factor out common shared part of smp_prepare_cpus() into a common
helper which can be called by both baremetal and Xen, in order to fix a
booting of Xen PV guests
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmGQ8HcACgkQEsHwGGHe
VUouoA//WAZ/dZu7IiM06JhZWswa2yNsdU8qQHys81lEqstaBqiWuZdg1qJTVIir
2d0aN0keiPcsLyAsp1UJ2g/K/7D5vSJWDzsHKfEAToiAm8Tntai2LlSocWWfeSQm
10grDHWpEHbj0hTHTA6HYOr2WbY4/LnR4cdL0WobIzivIrRTx49d0XUOUfWLP5KX
60uM6dSjwpJrQUnvzk+bhGiHVmutFrEJy+UU/0o+nxkdhwraNiSbLi0007BGRCof
6dokRRvLLR09dl1LMG51gVjQch4j/lCx6EWWUhYOFeV3I3gibSCNkmu7dpmMCBTR
QWO01cR9gyFN4xQ2is4I36M5L0/8T+sbGvvXIXNDT/XWr0/p+g6p2mx0cd2XiYIr
ZthGRcxxV/KGmxfPaygKS9tpQseMEIrdd6VjAnGfZ3OS6CtUvYt8d0B2Soj8FALQ
N9fMXDIEP3uUZim8UvCT6HBKlj9LR5uI5n+dAQ6uzsenO9WqeGeldc/N26/+osdN
vo4lNYTqiXJPhJvunYW5t4j5JnUa3grDHioAPWaQRJlWtEZBGKs9SXTcweg/KURb
mNfe1RfSlGJt28RD3E18gXeSS7xWdKgpcVX1rmW/9tUjX04NNDWjq4sAzOj7c+Ir
4sr78XgCY0pUxFaFYxvQWFUy7wcm0zAczo1RGUhcDTf1edDEvjo=
=s2MX
-----END PGP SIGNATURE-----
Merge tag 'sched_urgent_for_v5.16_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull scheduler fixes from Borislav Petkov:
- Avoid touching ~100 config files in order to be able to select the
preemption model
- clear cluster CPU masks too, on the CPU unplug path
- prevent use-after-free in cfs
- Prevent a race condition when updating CPU cache domains
- Factor out common shared part of smp_prepare_cpus() into a common
helper which can be called by both baremetal and Xen, in order to fix
a booting of Xen PV guests
* tag 'sched_urgent_for_v5.16_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
preempt: Restore preemption model selection configs
arch_topology: Fix missing clear cluster_cpumask in remove_cpu_topology()
sched/fair: Prevent dead task groups from regaining cfs_rq's
sched/core: Mitigate race cpus_share_cache()/update_top_cache_domain()
x86/smp: Factor out parts of native_smp_prepare_cpus()
reference to a PMU samples page has been acquired properly before that
- Make sure the LBR_SELECT MSR is saved/restored too
- Reset the LBR_SELECT MSR when resetting the LBR PMU to clear any
residual data left
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmGQ5z8ACgkQEsHwGGHe
VUqqdQ/+JIV6t0yIj7aNADaakwAe+i9zFzzUuvb5KT0zPzZirswkz6xeZ4g8S8PZ
lSjqKk8M2Yt3SJiqi/s3KNIOev52wtKGmeOFz1I+DUNpgk0wGHkRtVHV/iSptB61
Kp/fJvOVppY5grs5B0fRYkM5e477RPyZo+E0COKnff1bQ+k+z2ItMLCVxFCxQS6k
HmgPW7CBye811YcEg28lSwgS1OXiMZ19gACIsqnQ6kQP2Puo8+HT1/V1n+0grejb
OeYxURuYSRPd6Ft76qz0YlRIe1dgKllUBr7b0AaM11ADBMtWBTxqJcQvq/mOIHmT
9to0dVB/xFySR57iaL7BRuZFOrt8MRqJniEedMO99Dm9sxEVfHs1iXC9r7wZxQAf
/HcvVkcyOJD92Kv+4LS5tKjowCByOYEJW2YQIgXEbA6oIhRuM9/fdxEW6lHwgdwc
BPnOR6rtYuq+I+merBIIijAuf8OsIGY7ap2B+f7DkiOtA9+SHZsrU22J8T7CED/w
gmrAC3+3KGt7YDs6WZTbvkXminZQyu5WpHe+2K6dlCIPmJLqEsYUx8TeXa/okyvb
8ZXy/CfJNbHUrk6GZw7RFoeannwSPv9ZJO3Mfy5PDvwDk0Fj0J+/G92mR2Zucxpo
siNyBCivPY5vBPqk+x6eUPev/C3wPS+dNrs4HOyr1N2gZwgTk40=
=Ciqw
-----END PGP SIGNATURE-----
Merge tag 'perf_urgent_for_v5.16_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull perf fixes from Borislav Petkov:
- Prevent unintentional page sharing by checking whether a page
reference to a PMU samples page has been acquired properly before
that
- Make sure the LBR_SELECT MSR is saved/restored too
- Reset the LBR_SELECT MSR when resetting the LBR PMU to clear any
residual data left
* tag 'perf_urgent_for_v5.16_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
perf/core: Avoid put_page() when GUP fails
perf/x86/vlbr: Add c->flags to vlbr event constraints
perf/x86/lbr: Reset LBR_SELECT during vlbr reset
- Make local osnoise_instances static
- Copy just actual size of histogram strings
- Properly check missing operands in histogram expressions
-----BEGIN PGP SIGNATURE-----
iIoEABYIADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCYY++DxQccm9zdGVkdEBn
b29kbWlzLm9yZwAKCRAp5XQQmuv6qn93AQD9sBFtm7D/90P8KMp/yl75OTd1InGm
uZPOioR/itFXBwD6A4Y4xbpN0aWByM4P31pqFjZRxY0wmInHw3fkd8EjmQM=
=LgAs
-----END PGP SIGNATURE-----
Merge tag 'trace-v5.16-4' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace
Pull tracing fixes from Steven Rostedt:
"Three tracing fixes:
- Make local osnoise_instances static
- Copy just actual size of histogram strings
- Properly check missing operands in histogram expressions"
* tag 'trace-v5.16-4' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
tracing/histogram: Fix check for missing operands in an expression
tracing/histogram: Do not copy the fixed-size char array field over the field size
tracing/osnoise: Make osnoise_instances static
If a binary operation is detected while parsing an expression string,
the operand strings are deduced by splitting the experssion string at
the position of the detected binary operator. Both operand strings are
sub-strings (can be empty string) of the expression string but will
never be NULL.
Currently a NULL check is used for missing operands, fix this by
checking for empty strings instead.
Link: https://lkml.kernel.org/r/20211112191324.1302505-1-kaleshsingh@google.com
Signed-off-by: Kalesh Singh <kaleshsingh@google.com>
Fixes: 9710b2f341 ("tracing: Fix operator precedence for hist triggers expression")
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Do not copy the fixed-size char array field of the events over
the field size. The histogram treats char array as a string and
there are 2 types of char array in the event, fixed-size and
dynamic string. The dynamic string (__data_loc) field must be
null terminated, but the fixed-size char array field may not
be null terminated (not a string, but just a data).
In that case, histogram can copy the data after the field.
This uses the original field size for fixed-size char array
field to restrict the histogram not to access over the original
field size.
Link: https://lkml.kernel.org/r/163673292822.195747.3696966210526410250.stgit@devnote2
Fixes: 02205a6752 (tracing: Add support for 'field variables')
Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Pull input updates from Dmitry Torokhov:
"Just one new driver (Cypress StreetFighter touchkey), and no input
core changes this time.
Plus various fixes and enhancements to existing drivers"
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/dtor/input: (54 commits)
Input: iforce - fix control-message timeout
Input: wacom_i2c - use macros for the bit masks
Input: ili210x - reduce sample period to 15ms
Input: ili210x - improve polled sample spacing
Input: ili210x - special case ili251x sample read out
Input: elantench - fix misreporting trackpoint coordinates
Input: synaptics-rmi4 - Fix device hierarchy
Input: i8042 - Add quirk for Fujitsu Lifebook T725
Input: cap11xx - add support for cap1206
Input: remove unused header <linux/input/cy8ctmg110_pdata.h>
Input: ili210x - add ili251x firmware update support
Input: ili210x - export ili251x version details via sysfs
Input: ili210x - use resolution from ili251x firmware
Input: pm8941-pwrkey - respect reboot_mode for warm reset
reboot: export symbol 'reboot_mode'
Input: max77693-haptic - drop unneeded MODULE_ALIAS
Input: cpcap-pwrbutton - do not set input parent explicitly
Input: max8925_onkey - don't mark comment as kernel-doc
Input: ads7846 - do not attempt IRQ workaround when deferring probe
Input: ads7846 - use input_set_capability()
...
Similar to btf_sock_ids, btf_tracing_ids provides btf ID for task_struct,
file, and vm_area_struct via easy to understand format like
btf_tracing_ids[BTF_TRACING_TYPE_[TASK|file|VMA]].
Suggested-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Yonghong Song <yhs@fb.com>
Link: https://lore.kernel.org/bpf/20211112150243.1270987-3-songliubraving@fb.com
Introduction of map_uid made two lookups from outer map to be distinct.
That distinction is only necessary when inner map has an embedded timer.
Otherwise it will make the verifier state pruning to be conservative
which will cause complex programs to hit 1M insn_processed limit.
Tighten map_uid logic to apply to inner maps with timers only.
Fixes: 3e8ce29850 ("bpf: Prevent pointer mismatch in bpf_timer_init.")
Reported-by: Lorenz Bauer <lmb@cloudflare.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Tested-by: Lorenz Bauer <lmb@cloudflare.com>
Link: https://lore.kernel.org/bpf/CACAyw99hVEJFoiBH_ZGyy=+oO-jyydoz6v1DeKPKs2HVsUH28w@mail.gmail.com
Link: https://lore.kernel.org/bpf/20211110172556.20754-1-alexei.starovoitov@gmail.com
LLVM patches ([1] for clang, [2] and [3] for BPF backend)
added support for btf_type_tag attributes. This patch
added support for the kernel.
The main motivation for btf_type_tag is to bring kernel
annotations __user, __rcu etc. to btf. With such information
available in btf, bpf verifier can detect mis-usages
and reject the program. For example, for __user tagged pointer,
developers can then use proper helper like bpf_probe_read_user()
etc. to read the data.
BTF_KIND_TYPE_TAG may also useful for other tracing
facility where instead of to require user to specify
kernel/user address type, the kernel can detect it
by itself with btf.
[1] https://reviews.llvm.org/D111199
[2] https://reviews.llvm.org/D113222
[3] https://reviews.llvm.org/D113496
Signed-off-by: Yonghong Song <yhs@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20211112012609.1505032-1-yhs@fb.com
This series contains initialization fixups, testing improvements, addition
of instruction pointer to data-race reports, and scoped data-race checks.
-----BEGIN PGP SIGNATURE-----
iQJHBAABCgAxFiEEbK7UrM+RBIrCoViJnr8S83LZ+4wFAmGNQO4THHBhdWxtY2tA
a2VybmVsLm9yZwAKCRCevxLzctn7jIECD/49FaTsFhtZdEDlvLI2u2QJnxkVjwda
PBZkJrB66jDk0Dyc0oUxOu4GGSw64vze8HOJxWhaBA4tmqWGDA0DmTqRFQ3VJ4uW
Csl1uCzkIR9R0dgkDFwkvnq2fNbcr4SwDu0i+7Iig3zws7nhnZlSJPSze6gFkVX2
mLtUXybSR4FvlFMRePHd6cxltmwUohLKOklsI6emOfnSgBBFQ3584wEZ2HN5KwwO
8EwVxE5YNWyZQKqIj76tUoa8qkWbp5SoiiK6mzSQbJpgX8gLN3GngeAc9ZrfY09R
aiSQK9FnkcNkpnRROKA6Go6ze5NGa+1NvF32swZ1nSYOb/LFBDtwt4G8Y8cqdmLv
UtsxjFX4hhxdZzBSbGK3GwDWtDLWgHrmf5K/qPNHkM+QwdoyS27C5Kzfs4jkbtZ0
rAEWBxTrtdTCd+xMIz04ZDlio05CqSqme2/t4xaxGpcYGHLcuSi3uFa1cRvfaew8
rSfq2WKd9Cu2dKmjyF+EtN4Y2o8l8IaxJyeq5bVrBHeijIBH0KdCWkeDhWIJcMmE
Wo36PYsFLyCdAwr66IoNFHvOxbtAQsERZa0/2FGlOKBAzntNA72BdlAFgKJWiLKg
M5K1Q+r7kfns/T1JhftTByryZBd5JM+OiZ/rwU0hCRY48L93ftTzGYSyLVfPBeZ0
lDgc/oJQziv9fA==
=MjQ1
-----END PGP SIGNATURE-----
Merge tag 'kcsan.2021.11.11a' of git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu
Pull KCSAN updates from Paul McKenney:
"This contains initialization fixups, testing improvements, addition of
instruction pointer to data-race reports, and scoped data-race checks"
* tag 'kcsan.2021.11.11a' of git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu:
kcsan: selftest: Cleanup and add missing __init
kcsan: Move ctx to start of argument list
kcsan: Support reporting scoped read-write access type
kcsan: Start stack trace with explicit location if provided
kcsan: Save instruction pointer for scoped accesses
kcsan: Add ability to pass instruction pointer of access to reporting
kcsan: test: Fix flaky test case
kcsan: test: Use kunit_skip() to skip tests
kcsan: test: Defer kcsan_test_init() after kunit initialization
and netfilter.
Current release - regressions:
- bpf: do not reject when the stack read size is different
from the tracked scalar size
- net: fix premature exit from NAPI state polling in napi_disable()
- riscv, bpf: fix RV32 broken build, and silence RV64 warning
Current release - new code bugs:
- net: fix possible NULL deref in sock_reserve_memory
- amt: fix error return code in amt_init(); fix stopping the workqueue
- ax88796c: use the correct ioctl callback
Previous releases - always broken:
- bpf: stop caching subprog index in the bpf_pseudo_func insn
- security: fixups for the security hooks in sctp
- nfc: add necessary privilege flags in netlink layer, limit operations
to admin only
- vsock: prevent unnecessary refcnt inc for non-blocking connect
- net/smc: fix sk_refcnt underflow on link down and fallback
- nfnetlink_queue: fix OOB when mac header was cleared
- can: j1939: ignore invalid messages per standard
- bpf, sockmap:
- fix race in ingress receive verdict with redirect to self
- fix incorrect sk_skb data_end access when src_reg = dst_reg
- strparser, and tls are reusing qdisc_skb_cb and colliding
- ethtool: fix ethtool msg len calculation for pause stats
- vlan: fix a UAF in vlan_dev_real_dev() when ref-holder tries
to access an unregistering real_dev
- udp6: make encap_rcv() bump the v6 not v4 stats
- drv: prestera: add explicit padding to fix m68k build
- drv: felix: fix broken VLAN-tagged PTP under VLAN-aware bridge
- drv: mvpp2: fix wrong SerDes reconfiguration order
Misc & small latecomers:
- ipvs: auto-load ipvs on genl access
- mctp: sanity check the struct sockaddr_mctp padding fields
- libfs: support RENAME_EXCHANGE in simple_rename()
- avoid double accounting for pure zerocopy skbs
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEE6jPA+I1ugmIBA4hXMUZtbf5SIrsFAmGNQdwACgkQMUZtbf5S
IrsiMQ//f66lTJ8PJ5Qj70hX9dC897olx7uGHB9eiKoyOcJI459hFlfXwRU2T4Tf
fPNwPNUQ9Mynw9tX/jWEi+7zd6r6TSHGXK49U9/rIbQ95QjKY4LHowIE63x+vPl2
5Cpf+80zXC3DUX1fijgyG1ujnU3kBaqopTxDLmlsHw2PGkwT5Ox1DUwkhc370eEL
xlpq3PYGWA8/AQNyhSVBkG/UmoLaq0jYNP5yVcOj4jGjgcgLe1SLrqczENr35QHZ
cRkuBsFBMBZF7wSX2f9qQIB/+b1pcLlD9IO+K3S7Ruq+rUd7qfL/tmwNxEh0axYK
AyIun1Bxcy7QJGjtpGAz+Ku7jS9T3HxzyxhqilQo3co8jAW0WJ1YwHl+XPgQXyjV
DLG6Vxt4syiwsoSXGn8MQugs4nlBT+0qWl8YamIR+o7KkAYPc2QWkXlzEDfNeIW8
JNCZA3sy7VGi1ytorZGx16sQsEWnyRG9a6/WV20Dr+HVs1SKPcFzIfG6mVngR07T
mQMHnbAF6Z5d8VTcPQfMxd7UH48s1bHtk5lcSTa3j0Cw+GkA6ytTmjPdJ1qRcdkH
dl9jAfADe4O6frG+9XH7FEFqhmkghVI7bOCA4ZOhClVaIcDGgEZc2y7sY9/oZ7P4
KXBD2R5X1caCUM0UtzwL7/8ddOtPtHIrFnhY+7+I6ijt9qmI0BY=
=Ttgq
-----END PGP SIGNATURE-----
Merge tag 'net-5.16-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Pull networking fixes from Jakub Kicinski:
"Including fixes from bpf, can and netfilter.
Current release - regressions:
- bpf: do not reject when the stack read size is different from the
tracked scalar size
- net: fix premature exit from NAPI state polling in napi_disable()
- riscv, bpf: fix RV32 broken build, and silence RV64 warning
Current release - new code bugs:
- net: fix possible NULL deref in sock_reserve_memory
- amt: fix error return code in amt_init(); fix stopping the
workqueue
- ax88796c: use the correct ioctl callback
Previous releases - always broken:
- bpf: stop caching subprog index in the bpf_pseudo_func insn
- security: fixups for the security hooks in sctp
- nfc: add necessary privilege flags in netlink layer, limit
operations to admin only
- vsock: prevent unnecessary refcnt inc for non-blocking connect
- net/smc: fix sk_refcnt underflow on link down and fallback
- nfnetlink_queue: fix OOB when mac header was cleared
- can: j1939: ignore invalid messages per standard
- bpf, sockmap:
- fix race in ingress receive verdict with redirect to self
- fix incorrect sk_skb data_end access when src_reg = dst_reg
- strparser, and tls are reusing qdisc_skb_cb and colliding
- ethtool: fix ethtool msg len calculation for pause stats
- vlan: fix a UAF in vlan_dev_real_dev() when ref-holder tries to
access an unregistering real_dev
- udp6: make encap_rcv() bump the v6 not v4 stats
- drv: prestera: add explicit padding to fix m68k build
- drv: felix: fix broken VLAN-tagged PTP under VLAN-aware bridge
- drv: mvpp2: fix wrong SerDes reconfiguration order
Misc & small latecomers:
- ipvs: auto-load ipvs on genl access
- mctp: sanity check the struct sockaddr_mctp padding fields
- libfs: support RENAME_EXCHANGE in simple_rename()
- avoid double accounting for pure zerocopy skbs"
* tag 'net-5.16-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (123 commits)
selftests/net: udpgso_bench_rx: fix port argument
net: wwan: iosm: fix compilation warning
cxgb4: fix eeprom len when diagnostics not implemented
net: fix premature exit from NAPI state polling in napi_disable()
net/smc: fix sk_refcnt underflow on linkdown and fallback
net/mlx5: Lag, fix a potential Oops with mlx5_lag_create_definer()
gve: fix unmatched u64_stats_update_end()
net: ethernet: lantiq_etop: Fix compilation error
selftests: forwarding: Fix packet matching in mirroring selftests
vsock: prevent unnecessary refcnt inc for nonblocking connect
net: marvell: mvpp2: Fix wrong SerDes reconfiguration order
net: ethernet: ti: cpsw_ale: Fix access to un-initialized memory
net: stmmac: allow a tc-taprio base-time of zero
selftests: net: test_vxlan_under_vrf: fix HV connectivity test
net: hns3: allow configure ETS bandwidth of all TCs
net: hns3: remove check VF uc mac exist when set by PF
net: hns3: fix some mac statistics is always 0 in device version V2
net: hns3: fix kernel crash when unload VF while it is being reset
net: hns3: sync rx ring head in echo common pull
net: hns3: fix pfc packet number incorrect after querying pfc parameters
...
PEBS PERF_SAMPLE_PHYS_ADDR events use perf_virt_to_phys() to convert PMU
sampled virtual addresses to physical using get_user_page_fast_only()
and page_to_phys().
Some get_user_page_fast_only() error cases return false, indicating no
page reference, but still initialize the output page pointer with an
unreferenced page. In these error cases perf_virt_to_phys() calls
put_page(). This causes page reference count underflow, which can lead
to unintentional page sharing.
Fix perf_virt_to_phys() to only put_page() if get_user_page_fast_only()
returns a referenced page.
Fixes: fc7ce9c74c ("perf/core, x86: Add PERF_SAMPLE_PHYS_ADDR")
Signed-off-by: Greg Thelen <gthelen@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20211111021814.757086-1-gthelen@google.com
Commit c597bfddc9 ("sched: Provide Kconfig support for default dynamic
preempt mode") changed the selectable config names for the preemption
model. This means a config file must now select
CONFIG_PREEMPT_BEHAVIOUR=y
rather than
CONFIG_PREEMPT=y
to get a preemptible kernel. This means all arch config files would need to
be updated - right now they'll all end up with the default
CONFIG_PREEMPT_NONE_BEHAVIOUR.
Rather than touch a good hundred of config files, restore usage of
CONFIG_PREEMPT{_NONE, _VOLUNTARY}. Make them configure:
o The build-time preemption model when !PREEMPT_DYNAMIC
o The default boot-time preemption model when PREEMPT_DYNAMIC
Add siblings of those configs with the _BUILD suffix to unconditionally
designate the build-time preemption model (PREEMPT_DYNAMIC is built with
the "highest" preemption model it supports, aka PREEMPT). Downstream
configs should by now all be depending / selected by CONFIG_PREEMPTION
rather than CONFIG_PREEMPT, so only a few sites need patching up.
Signed-off-by: Valentin Schneider <valentin.schneider@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Marco Elver <elver@google.com>
Link: https://lore.kernel.org/r/20211110202448.4054153-2-valentin.schneider@arm.com
Kevin is reporting crashes which point to a use-after-free of a cfs_rq
in update_blocked_averages(). Initial debugging revealed that we've
live cfs_rq's (on_list=1) in an about to be kfree()'d task group in
free_fair_sched_group(). However, it was unclear how that can happen.
His kernel config happened to lead to a layout of struct sched_entity
that put the 'my_q' member directly into the middle of the object
which makes it incidentally overlap with SLUB's freelist pointer.
That, in combination with SLAB_FREELIST_HARDENED's freelist pointer
mangling, leads to a reliable access violation in form of a #GP which
made the UAF fail fast.
Michal seems to have run into the same issue[1]. He already correctly
diagnosed that commit a7b359fc6a ("sched/fair: Correctly insert
cfs_rq's to list on unthrottle") is causing the preconditions for the
UAF to happen by re-adding cfs_rq's also to task groups that have no
more running tasks, i.e. also to dead ones. His analysis, however,
misses the real root cause and it cannot be seen from the crash
backtrace only, as the real offender is tg_unthrottle_up() getting
called via sched_cfs_period_timer() via the timer interrupt at an
inconvenient time.
When unregister_fair_sched_group() unlinks all cfs_rq's from the dying
task group, it doesn't protect itself from getting interrupted. If the
timer interrupt triggers while we iterate over all CPUs or after
unregister_fair_sched_group() has finished but prior to unlinking the
task group, sched_cfs_period_timer() will execute and walk the list of
task groups, trying to unthrottle cfs_rq's, i.e. re-add them to the
dying task group. These will later -- in free_fair_sched_group() -- be
kfree()'ed while still being linked, leading to the fireworks Kevin
and Michal are seeing.
To fix this race, ensure the dying task group gets unlinked first.
However, simply switching the order of unregistering and unlinking the
task group isn't sufficient, as concurrent RCU walkers might still see
it, as can be seen below:
CPU1: CPU2:
: timer IRQ:
: do_sched_cfs_period_timer():
: :
: distribute_cfs_runtime():
: rcu_read_lock();
: :
: unthrottle_cfs_rq():
sched_offline_group(): :
: walk_tg_tree_from(…,tg_unthrottle_up,…):
list_del_rcu(&tg->list); :
(1) : list_for_each_entry_rcu(child, &parent->children, siblings)
: :
(2) list_del_rcu(&tg->siblings); :
: tg_unthrottle_up():
unregister_fair_sched_group(): struct cfs_rq *cfs_rq = tg->cfs_rq[cpu_of(rq)];
: :
list_del_leaf_cfs_rq(tg->cfs_rq[cpu]); :
: :
: if (!cfs_rq_is_decayed(cfs_rq) || cfs_rq->nr_running)
(3) : list_add_leaf_cfs_rq(cfs_rq);
: :
: :
: :
: :
: :
(4) : rcu_read_unlock();
CPU 2 walks the task group list in parallel to sched_offline_group(),
specifically, it'll read the soon to be unlinked task group entry at
(1). Unlinking it on CPU 1 at (2) therefore won't prevent CPU 2 from
still passing it on to tg_unthrottle_up(). CPU 1 now tries to unlink
all cfs_rq's via list_del_leaf_cfs_rq() in
unregister_fair_sched_group(). Meanwhile CPU 2 will re-add some of
these at (3), which is the cause of the UAF later on.
To prevent this additional race from happening, we need to wait until
walk_tg_tree_from() has finished traversing the task groups, i.e.
after the RCU read critical section ends in (4). Afterwards we're safe
to call unregister_fair_sched_group(), as each new walk won't see the
dying task group any more.
On top of that, we need to wait yet another RCU grace period after
unregister_fair_sched_group() to ensure print_cfs_stats(), which might
run concurrently, always sees valid objects, i.e. not already free'd
ones.
This patch survives Michal's reproducer[2] for 8h+ now, which used to
trigger within minutes before.
[1] https://lore.kernel.org/lkml/20211011172236.11223-1-mkoutny@suse.com/
[2] https://lore.kernel.org/lkml/20211102160228.GA57072@blackbody.suse.cz/
Fixes: a7b359fc6a ("sched/fair: Correctly insert cfs_rq's to list on unthrottle")
[peterz: shuffle code around a bit]
Reported-by: Kevin Tanguy <kevin.tanguy@corp.ovh.com>
Signed-off-by: Mathias Krause <minipli@grsecurity.net>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Nothing protects the access to the per_cpu variable sd_llc_id. When testing
the same CPU (i.e. this_cpu == that_cpu), a race condition exists with
update_top_cache_domain(). One scenario being:
CPU1 CPU2
==================================================================
per_cpu(sd_llc_id, CPUX) => 0
partition_sched_domains_locked()
detach_destroy_domains()
cpus_share_cache(CPUX, CPUX) update_top_cache_domain(CPUX)
per_cpu(sd_llc_id, CPUX) => 0
per_cpu(sd_llc_id, CPUX) = CPUX
per_cpu(sd_llc_id, CPUX) => CPUX
return false
ttwu_queue_cond() wouldn't catch smp_processor_id() == cpu and the result
is a warning triggered from ttwu_queue_wakelist().
Avoid a such race in cpus_share_cache() by always returning true when
this_cpu == that_cpu.
Fixes: 518cd62341 ("sched: Only queue remote wakeups when crossing cache boundaries")
Reported-by: Jing-Ting Wu <jing-ting.wu@mediatek.com>
Signed-off-by: Vincent Donnefort <vincent.donnefort@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Valentin Schneider <valentin.schneider@arm.com>
Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>
Link: https://lore.kernel.org/r/20211104175120.857087-1-vincent.donnefort@arm.com
The recent rework of PCI/MSI[X] masking moved the non-mask checks from the
low level accessors into the higher level mask/unmask functions.
This missed the fact that these accessors can be invoked from other places
as well. The missing checks break XEN-PV which sets pci_msi_ignore_mask and
also violates the virtual MSIX and the msi_attrib.maskbit protections.
Instead of sprinkling checks all over the place, lift them back into the
low level accessor functions. To avoid checking three different conditions
combine them into one property of msi_desc::msi_attrib.
[ josef: Fixed the missed conversion in the core code ]
Fixes: fcacdfbef5 ("PCI/MSI: Provide a new set of mask and unmask functions")
Reported-by: Josef Johansson <josef@oderland.se>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Josef Johansson <josef@oderland.se>
Cc: Bjorn Helgaas <helgaas@kernel.org>
Cc: stable@vger.kernel.org
Pull exit cleanups from Eric Biederman:
"While looking at some issues related to the exit path in the kernel I
found several instances where the code is not using the existing
abstractions properly.
This set of changes introduces force_fatal_sig a way of sending a
signal and not allowing it to be caught, and corrects the misuse of
the existing abstractions that I found.
A lot of the misuse of the existing abstractions are silly things such
as doing something after calling a no return function, rolling BUG by
hand, doing more work than necessary to terminate a kernel thread, or
calling do_exit(SIGKILL) instead of calling force_sig(SIGKILL).
In the review a deficiency in force_fatal_sig and force_sig_seccomp
where ptrace or sigaction could prevent the delivery of the signal was
found. I have added a change that adds SA_IMMUTABLE to change that
makes it impossible to interrupt the delivery of those signals, and
allows backporting to fix force_sig_seccomp
And Arnd found an issue where a function passed to kthread_run had the
wrong prototype, and after my cleanup was failing to build."
* 'exit-cleanups-for-v5.16' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace: (23 commits)
soc: ti: fix wkup_m3_rproc_boot_thread return type
signal: Add SA_IMMUTABLE to ensure forced siganls do not get changed
signal: Replace force_sigsegv(SIGSEGV) with force_fatal_sig(SIGSEGV)
exit/r8188eu: Replace the macro thread_exit with a simple return 0
exit/rtl8712: Replace the macro thread_exit with a simple return 0
exit/rtl8723bs: Replace the macro thread_exit with a simple return 0
signal/x86: In emulate_vsyscall force a signal instead of calling do_exit
signal/sparc32: In setup_rt_frame and setup_fram use force_fatal_sig
signal/sparc32: Exit with a fatal signal when try_to_clear_window_buffer fails
exit/syscall_user_dispatch: Send ordinary signals on failure
signal: Implement force_fatal_sig
exit/kthread: Have kernel threads return instead of calling do_exit
signal/s390: Use force_sigsegv in default_trap_handler
signal/vm86_32: Properly send SIGSEGV when the vm86 state cannot be saved.
signal/vm86_32: Replace open coded BUG_ON with an actual BUG_ON
signal/sparc: In setup_tsb_params convert open coded BUG into BUG
signal/powerpc: On swapcontext failure force SIGSEGV
signal/sh: Use force_sig(SIGKILL) instead of do_group_exit(SIGKILL)
signal/mips: Update (_save|_restore)_fp_context to fail with -EFAULT
signal/sparc32: Remove unreachable do_exit in do_sparc_fault
...
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCYYvEbgAKCRCRxhvAZXjc
og17AQDj+gsxk2lT4GsRo+WrI9qegGSvYHaxbOoqqSL6rHrrsQD+IU92dwVfuUXE
oP+De6/TBmsdygnlECxITp8p4ByhGAM=
=wi2X
-----END PGP SIGNATURE-----
Merge tag 'kernel.sys.v5.16' of git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux
Pull prctl updates from Christian Brauner:
"This contains the missing prctl uapi pieces for PR_SCHED_CORE.
In order to activate core scheduling the caller is expected to specify
the scope of the new core scheduling domain.
For example, passing 2 in the 4th argument of
prctl(PR_SCHED_CORE, PR_SCHED_CORE_CREATE, <pid>, 2, 0);
would indicate that the new core scheduling domain encompasses all
tasks in the process group of <pid>. Specifying 0 would only create a
core scheduling domain for the thread identified by <pid> and 2 would
encompass the whole thread-group of <pid>.
Note, the values 0, 1, and 2 correspond to PIDTYPE_PID, PIDTYPE_TGID,
and PIDTYPE_PGID. A first version tried to expose those values
directly to which I objected because:
- PIDTYPE_* is an enum that is kernel internal which we should not
expose to userspace directly.
- PIDTYPE_* indicates what a given struct pid is used for it doesn't
express a scope.
But what the 4th argument of PR_SCHED_CORE prctl() expresses is the
scope of the operation, i.e. the scope of the core scheduling domain
at creation time. So Eugene's patch now simply introduces three new
defines PR_SCHED_CORE_SCOPE_THREAD, PR_SCHED_CORE_SCOPE_THREAD_GROUP,
and PR_SCHED_CORE_SCOPE_PROCESS_GROUP. They simply express what
happens.
This has been on the mailing list for quite a while with all relevant
scheduler folks Cced. I announced multiple times that I'd pick this up
if I don't see or her anyone else doing it. None of this touches
proper scheduler code but only concerns uapi so I think this is fine.
With core scheduling being quite common now for vm managers (e.g.
moving individual vcpu threads into their own core scheduling domain)
and container managers (e.g. moving the init process into its own core
scheduling domain and letting all created children inherit it) having
to rely on raw numbers passed as the 4th argument in prctl() is a bit
annoying and everyone is starting to come up with their own defines"
* tag 'kernel.sys.v5.16' of git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux:
uapi/linux/prctl: provide macro definitions for the PR_SCHED_CORE type argument
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCYYvE0wAKCRCRxhvAZXjc
oo36AQCQRC9+LsfBsfoqrrdfWqp9ifs9DuytUg+CTftsy1Pn0QD/ZtySkNx9mnNl
0/lSTN5dJBfEYm6Xcfxuu/vu/iauhw0=
=dY6T
-----END PGP SIGNATURE-----
Merge tag 'pidfd.v5.16' of git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux
Pull pidfd updates from Christian Brauner:
"Various places in the kernel have picked up pidfds.
The two most recent additions have probably been the ability to use
pidfds in bpf maps and the usage of pidfds in mm-based syscalls such
as process_mrelease() and process_madvise().
The same pattern to turn a pidfd into a struct task exists in two
places. One of those places used PIDTYPE_TGID while the other one used
PIDTYPE_PID even though it is clearly documented in all pidfd-helpers
that pidfds __currently__ only refer to thread-group leaders (subject
to change in the future if need be).
This isn't a bug per se but has the potential to be one if we allow
pidfds to refer to individual threads. If that happens we want to
audit all codepaths that make use of them to ensure they can deal with
pidfds refering to individual threads.
This adds a simple helper to turn a pidfd into a struct task making it
easy to grep for such places. Plus, it gets rid of code-duplication"
* tag 'pidfd.v5.16' of git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux:
mm: use pidfd_get_task()
pid: add pidfd_get_task() helper
We can't call unregister_ftrace_function under ftrace_lock.
Link: https://lkml.kernel.org/r/20211109114217.1645296-1-jolsa@kernel.org
Fixes: ed29271894 ("ftrace/direct: Do not disable when switching direct callers")
Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
The resetting of the entire ring buffer use to simply go through and reset
each individual CPU buffer that had its own protection and synchronization.
But this was very slow, due to performing a synchronization for each CPU.
The code was reshuffled to do one disabling of all CPU buffers, followed
by a single RCU synchronization, and then the resetting of each of the CPU
buffers. But unfortunately, the mutex that prevented multiple occurrences
of resetting the buffer was not moved to the upper function, and there is
nothing to protect from it.
Take the ring buffer mutex around the global reset.
Cc: stable@vger.kernel.org
Fixes: b23d7a5f4a ("ring-buffer: speed up buffer resets by avoiding synchronize_rcu for each CPU")
Reported-by: "Tzvetomir Stoyanov (VMware)" <tz.stoyanov@gmail.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
printk from NMI context relies on irq work being raised on the local CPU
to print to console. This can be a problem if the NMI was raised by a
lockup detector to print lockup stack and regs, because the CPU may not
enable irqs (because it is locked up).
Introduce printk_trigger_flush() that can be called another CPU to try
to get those messages to the console, call that where printk_safe_flush
was previously called.
Fixes: 93d102f094 ("printk: remove safe buffers")
Cc: stable@vger.kernel.org # 5.15
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Reviewed-by: Petr Mladek <pmladek@suse.com>
Reviewed-by: John Ogness <john.ogness@linutronix.de>
Signed-off-by: Petr Mladek <pmladek@suse.com>
Link: https://lore.kernel.org/r/20211107045116.1754411-1-npiggin@gmail.com
- convert sparc32 to the generic dma-direct code
- use bitmap_zalloc (Christophe JAILLET)
-----BEGIN PGP SIGNATURE-----
iQI/BAABCgApFiEEgdbnc3r/njty3Iq9D55TZVIEUYMFAmGKfNYLHGhjaEBsc3Qu
ZGUACgkQD55TZVIEUYMEIRAAhOocEFpeaSg8iLMd7QLzm5vvzAuR43iykkKCvdvV
Q4P+g8H9Jr65ThsGS90AuuDKuyKh3tmbL7loHlyDygmRHhHALOO4127um4RAnOAL
1y2qCRwgHEZTu1uiu65cB+RRrlJP6T4sHV7+U3uZ3P5nfQoVVIoHKMceSTLIa3dx
WPyJXP33TWK50ZvGYuzMhO5hQPA8sKSePiaN3gz3anF0lMnqlUNh1Iso6nasUW40
XifOFM2Bg/SO7HpBGssrku6Zc5x9TpyuQtLP0u+LpjrbUYUZvz/OteyVu5cTZdbP
QG7MG6jcvDuU41sjKYNjaNpGZlvmXrEs4pXiwbOhzHTG8TFIEiR/LRsrvBGS7DJ8
y0NKNryIKR3+9fMKDH0PWHC7NszJbAQR0J7OT7+GP8cx9M62x5MuV8d2uOXp6TPY
v3VO0SJQrBZLKpY7vixZ6TOYMz15kmULMRrkGzf95+z5MpM2RjJ4lY8Kqlm2PBRR
Q3k53Ii8ya9U61SvgcCH39gR1fGT+WO8E5UFttCfhUhn49KJc7DqbEUiOC8Ta7QC
OONXxhGLdXAkti5NLFAexk8zdLBVRMnzfG44tBnP/JWDbQu3lMNuQfUXzsJK9yDb
zWr/832qwTIzT01NGZDFWdKUPNpafyuDQ1lP9rZZ2ZLo+f/EXNsHvczXvkwP08xS
cyY=
=DvuN
-----END PGP SIGNATURE-----
Merge tag 'dma-mapping-5.16' of git://git.infradead.org/users/hch/dma-mapping
Pull dma-mapping updates from Christoph Hellwig:
"Just a small set of changes this time. The request dma_direct_alloc
cleanups are still under review and haven't made the cut.
Summary:
- convert sparc32 to the generic dma-direct code
- use bitmap_zalloc (Christophe JAILLET)"
* tag 'dma-mapping-5.16' of git://git.infradead.org/users/hch/dma-mapping:
dma-mapping: use 'bitmap_zalloc()' when applicable
sparc32: use DMA_DIRECT_REMAP
sparc32: remove dma_make_coherent
sparc32: remove the call to dma_make_coherent in arch_dma_free
Merge more updates from Andrew Morton:
"87 patches.
Subsystems affected by this patch series: mm (pagecache and hugetlb),
procfs, misc, MAINTAINERS, lib, checkpatch, binfmt, kallsyms, ramfs,
init, codafs, nilfs2, hfs, crash_dump, signals, seq_file, fork,
sysvfs, kcov, gdb, resource, selftests, and ipc"
* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (87 commits)
ipc/ipc_sysctl.c: remove fallback for !CONFIG_PROC_SYSCTL
ipc: check checkpoint_restore_ns_capable() to modify C/R proc files
selftests/kselftest/runner/run_one(): allow running non-executable files
virtio-mem: disallow mapping virtio-mem memory via /dev/mem
kernel/resource: disallow access to exclusive system RAM regions
kernel/resource: clean up and optimize iomem_is_exclusive()
scripts/gdb: handle split debug for vmlinux
kcov: replace local_irq_save() with a local_lock_t
kcov: avoid enable+disable interrupts if !in_task()
kcov: allocate per-CPU memory on the relevant node
Documentation/kcov: define `ip' in the example
Documentation/kcov: include types.h in the example
sysv: use BUILD_BUG_ON instead of runtime check
kernel/fork.c: unshare(): use swap() to make code cleaner
seq_file: fix passing wrong private data
seq_file: move seq_escape() to a header
signal: remove duplicate include in signal.h
crash_dump: remove duplicate include in crash_dump.h
crash_dump: fix boolreturn.cocci warning
hfs/hfsplus: use WARN_ON for sanity check
...
virtio-mem dynamically exposes memory inside a device memory region as
system RAM to Linux, coordinating with the hypervisor which parts are
actually "plugged" and consequently usable/accessible.
On the one hand, the virtio-mem driver adds/removes whole memory blocks,
creating/removing busy IORESOURCE_SYSTEM_RAM resources, on the other
hand, it logically (un)plugs memory inside added memory blocks,
dynamically either exposing them to the buddy or hiding them from the
buddy and marking them PG_offline.
In contrast to physical devices, like a DIMM, the virtio-mem driver is
required to actually make use of any of the device-provided memory,
because it performs the handshake with the hypervisor. virtio-mem
memory cannot simply be access via /dev/mem without a driver.
There is no safe way to:
a) Access plugged memory blocks via /dev/mem, as they might contain
unplugged holes or might get silently unplugged by the virtio-mem
driver and consequently turned inaccessible.
b) Access unplugged memory blocks via /dev/mem because the virtio-mem
driver is required to make them actually accessible first.
The virtio-spec states that unplugged memory blocks MUST NOT be written,
and only selected unplugged memory blocks MAY be read. We want to make
sure, this is the case in sane environments -- where the virtio-mem driver
was loaded.
We want to make sure that in a sane environment, nobody "accidentially"
accesses unplugged memory inside the device managed region. For example,
a user might spot a memory region in /proc/iomem and try accessing it via
/dev/mem via gdb or dumping it via something else. By the time the mmap()
happens, the memory might already have been removed by the virtio-mem
driver silently: the mmap() would succeeed and user space might
accidentially access unplugged memory.
So once the driver was loaded and detected the device along the
device-managed region, we just want to disallow any access via /dev/mem to
it.
In an ideal world, we would mark the whole region as busy ("owned by a
driver") and exclude it; however, that would be wrong, as we don't really
have actual system RAM at these ranges added to Linux ("busy system RAM").
Instead, we want to mark such ranges as "not actual busy system RAM but
still soft-reserved and prepared by a driver for future use."
Let's teach iomem_is_exclusive() to reject access to any range with
"IORESOURCE_SYSTEM_RAM | IORESOURCE_EXCLUSIVE", even if not busy and even
if "iomem=relaxed" is set. Introduce EXCLUSIVE_SYSTEM_RAM to make it
easier for applicable drivers to depend on this setting in their Kconfig.
For now, there are no applicable ranges and we'll modify virtio-mem next
to properly set IORESOURCE_EXCLUSIVE on the parent resource container it
creates to contain all actual busy system RAM added via
add_memory_driver_managed().
Link: https://lkml.kernel.org/r/20210920142856.17758-3-david@redhat.com
Signed-off-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Dan Williams <dan.j.williams@intel.com>
Cc: Andy Shevchenko <andy.shevchenko@gmail.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Hanjun Guo <guohanjun@huawei.com>
Cc: Jason Wang <jasowang@redhat.com>
Cc: "Michael S. Tsirkin" <mst@redhat.com>
Cc: "Rafael J. Wysocki" <rafael.j.wysocki@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>