mirror of
https://github.com/torvalds/linux.git
synced 2024-11-22 12:11:40 +00:00
393397fbdc
45990 Commits
Author | SHA1 | Message | Date | |
---|---|---|---|---|
Hou Tao
|
393397fbdc |
bpf: Check the validity of nr_words in bpf_iter_bits_new()
Check the validity of nr_words in bpf_iter_bits_new(). Without this
check, when multiplication overflow occurs for nr_bits (e.g., when
nr_words = 0x0400-0001, nr_bits becomes 64), stack corruption may occur
due to bpf_probe_read_kernel_common(..., nr_bytes = 0x2000-0008).
Fix it by limiting the maximum value of nr_words to 511. The value is
derived from the current implementation of BPF memory allocator. To
ensure compatibility if the BPF memory allocator's size limitation
changes in the future, use the helper bpf_mem_alloc_check_size() to
check whether nr_bytes is too larger. And return -E2BIG instead of
-ENOMEM for oversized nr_bytes.
Fixes:
|
||
Hou Tao
|
62a898b07b |
bpf: Add bpf_mem_alloc_check_size() helper
Introduce bpf_mem_alloc_check_size() to check whether the allocation size exceeds the limitation for the kmalloc-equivalent allocator. The upper limit for percpu allocation is LLIST_NODE_SZ bytes larger than non-percpu allocation, so a percpu argument is added to the helper. The helper will be used in the following patch to check whether the size parameter passed to bpf_mem_alloc() is too big. Signed-off-by: Hou Tao <houtao1@huawei.com> Link: https://lore.kernel.org/r/20241030100516.3633640-3-houtao@huaweicloud.com Signed-off-by: Alexei Starovoitov <ast@kernel.org> |
||
Hou Tao
|
101ccfbabf |
bpf: Free dynamically allocated bits in bpf_iter_bits_destroy()
bpf_iter_bits_destroy() uses "kit->nr_bits <= 64" to check whether the
bits are dynamically allocated. However, the check is incorrect and may
cause a kmemleak as shown below:
unreferenced object 0xffff88812628c8c0 (size 32):
comm "swapper/0", pid 1, jiffies 4294727320
hex dump (first 32 bytes):
b0 c1 55 f5 81 88 ff ff f0 f0 f0 f0 f0 f0 f0 f0 ..U...........
f0 f0 f0 f0 f0 f0 f0 f0 00 00 00 00 00 00 00 00 ..............
backtrace (crc 781e32cc):
[<00000000c452b4ab>] kmemleak_alloc+0x4b/0x80
[<0000000004e09f80>] __kmalloc_node_noprof+0x480/0x5c0
[<00000000597124d6>] __alloc.isra.0+0x89/0xb0
[<000000004ebfffcd>] alloc_bulk+0x2af/0x720
[<00000000d9c10145>] prefill_mem_cache+0x7f/0xb0
[<00000000ff9738ff>] bpf_mem_alloc_init+0x3e2/0x610
[<000000008b616eac>] bpf_global_ma_init+0x19/0x30
[<00000000fc473efc>] do_one_initcall+0xd3/0x3c0
[<00000000ec81498c>] kernel_init_freeable+0x66a/0x940
[<00000000b119f72f>] kernel_init+0x20/0x160
[<00000000f11ac9a7>] ret_from_fork+0x3c/0x70
[<0000000004671da4>] ret_from_fork_asm+0x1a/0x30
That is because nr_bits will be set as zero in bpf_iter_bits_next()
after all bits have been iterated.
Fix the issue by setting kit->bit to kit->nr_bits instead of setting
kit->nr_bits to zero when the iteration completes in
bpf_iter_bits_next(). In addition, use "!nr_bits || bits >= nr_bits" to
check whether the iteration is complete and still use "nr_bits > 64" to
indicate whether bits are dynamically allocated. The "!nr_bits" check is
necessary because bpf_iter_bits_new() may fail before setting
kit->nr_bits, and this condition will stop the iteration early instead
of accessing the zeroed or freed kit->bits.
Considering the initial value of kit->bits is -1 and the type of
kit->nr_bits is unsigned int, change the type of kit->nr_bits to int.
The potential overflow problem will be handled in the following patch.
Fixes:
|
||
Eduard Zingerman
|
d0b98f6a17 |
bpf: disallow 40-bytes extra stack for bpf_fastcall patterns
Hou Tao reported an issue with bpf_fastcall patterns allowing extra
stack space above MAX_BPF_STACK limit. This extra stack allowance is
not integrated properly with the following verifier parts:
- backtracking logic still assumes that stack can't exceed
MAX_BPF_STACK;
- bpf_verifier_env->scratched_stack_slots assumes only 64 slots are
available.
Here is an example of an issue with precision tracking
(note stack slot -8 tracked as precise instead of -520):
0: (b7) r1 = 42 ; R1_w=42
1: (b7) r2 = 42 ; R2_w=42
2: (7b) *(u64 *)(r10 -512) = r1 ; R1_w=42 R10=fp0 fp-512_w=42
3: (7b) *(u64 *)(r10 -520) = r2 ; R2_w=42 R10=fp0 fp-520_w=42
4: (85) call bpf_get_smp_processor_id#8 ; R0_w=scalar(...)
5: (79) r2 = *(u64 *)(r10 -520) ; R2_w=42 R10=fp0 fp-520_w=42
6: (79) r1 = *(u64 *)(r10 -512) ; R1_w=42 R10=fp0 fp-512_w=42
7: (bf) r3 = r10 ; R3_w=fp0 R10=fp0
8: (0f) r3 += r2
mark_precise: frame0: last_idx 8 first_idx 0 subseq_idx -1
mark_precise: frame0: regs=r2 stack= before 7: (bf) r3 = r10
mark_precise: frame0: regs=r2 stack= before 6: (79) r1 = *(u64 *)(r10 -512)
mark_precise: frame0: regs=r2 stack= before 5: (79) r2 = *(u64 *)(r10 -520)
mark_precise: frame0: regs= stack=-8 before 4: (85) call bpf_get_smp_processor_id#8
mark_precise: frame0: regs= stack=-8 before 3: (7b) *(u64 *)(r10 -520) = r2
mark_precise: frame0: regs=r2 stack= before 2: (7b) *(u64 *)(r10 -512) = r1
mark_precise: frame0: regs=r2 stack= before 1: (b7) r2 = 42
9: R2_w=42 R3_w=fp42
9: (95) exit
This patch disables the additional allowance for the moment.
Also, two test cases are removed:
- bpf_fastcall_max_stack_ok:
it fails w/o additional stack allowance;
- bpf_fastcall_max_stack_fail:
this test is no longer necessary, stack size follows
regular rules, pattern invalidation is checked by other
test cases.
Reported-by: Hou Tao <houtao@huaweicloud.com>
Closes: https://lore.kernel.org/bpf/20241023022752.172005-1-houtao@huaweicloud.com/
Fixes:
|
||
Byeonguk Jeong
|
13400ac8fb |
bpf: Fix out-of-bounds write in trie_get_next_key()
trie_get_next_key() allocates a node stack with size trie->max_prefixlen,
while it writes (trie->max_prefixlen + 1) nodes to the stack when it has
full paths from the root to leaves. For example, consider a trie with
max_prefixlen is 8, and the nodes with key 0x00/0, 0x00/1, 0x00/2, ...
0x00/8 inserted. Subsequent calls to trie_get_next_key with _key with
.prefixlen = 8 make 9 nodes be written on the node stack with size 8.
Fixes:
|
||
Eduard Zingerman
|
aa30eb3260 |
bpf: Force checkpoint when jmp history is too long
A specifically crafted program might trick verifier into growing very
long jump history within a single bpf_verifier_state instance.
Very long jump history makes mark_chain_precision() unreasonably slow,
especially in case if verifier processes a loop.
Mitigate this by forcing new state in is_state_visited() in case if
current state's jump history is too long.
Use same constant as in `skip_inf_loop_check`, but multiply it by
arbitrarily chosen value 2 to account for jump history containing not
only information about jumps, but also information about stack access.
For an example of problematic program consider the code below,
w/o this patch the example is processed by verifier for ~15 minutes,
before failing to allocate big-enough chunk for jmp_history.
0: r7 = *(u16 *)(r1 +0);"
1: r7 += 0x1ab064b9;"
2: if r7 & 0x702000 goto 1b;
3: r7 &= 0x1ee60e;"
4: r7 += r1;"
5: if r7 s> 0x37d2 goto +0;"
6: r0 = 0;"
7: exit;"
Perf profiling shows that most of the time is spent in
mark_chain_precision() ~95%.
The easiest way to explain why this program causes problems is to
apply the following patch:
diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index 0c216e71cec7..4b4823961abe 100644
\--- a/include/linux/bpf.h
\+++ b/include/linux/bpf.h
\@@ -1926,7 +1926,7 @@ struct bpf_array {
};
};
-#define BPF_COMPLEXITY_LIMIT_INSNS 1000000 /* yes. 1M insns */
+#define BPF_COMPLEXITY_LIMIT_INSNS 256 /* yes. 1M insns */
#define MAX_TAIL_CALL_CNT 33
/* Maximum number of loops for bpf_loop and bpf_iter_num.
diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index f514247ba8ba..75e88be3bb3e 100644
\--- a/kernel/bpf/verifier.c
\+++ b/kernel/bpf/verifier.c
\@@ -18024,8 +18024,13 @@ static int is_state_visited(struct bpf_verifier_env *env, int insn_idx)
skip_inf_loop_check:
if (!force_new_state &&
env->jmps_processed - env->prev_jmps_processed < 20 &&
- env->insn_processed - env->prev_insn_processed < 100)
+ env->insn_processed - env->prev_insn_processed < 100) {
+ verbose(env, "is_state_visited: suppressing checkpoint at %d, %d jmps processed, cur->jmp_history_cnt is %d\n",
+ env->insn_idx,
+ env->jmps_processed - env->prev_jmps_processed,
+ cur->jmp_history_cnt);
add_new_state = false;
+ }
goto miss;
}
/* If sl->state is a part of a loop and this loop's entry is a part of
\@@ -18142,6 +18147,9 @@ static int is_state_visited(struct bpf_verifier_env *env, int insn_idx)
if (!add_new_state)
return 0;
+ verbose(env, "is_state_visited: new checkpoint at %d, resetting env->jmps_processed\n",
+ env->insn_idx);
+
/* There were no equivalent states, remember the current one.
* Technically the current state is not proven to be safe yet,
* but it will either reach outer most bpf_exit (which means it's safe)
And observe verification log:
...
is_state_visited: new checkpoint at 5, resetting env->jmps_processed
5: R1=ctx() R7=ctx(...)
5: (65) if r7 s> 0x37d2 goto pc+0 ; R7=ctx(...)
6: (b7) r0 = 0 ; R0_w=0
7: (95) exit
from 5 to 6: R1=ctx() R7=ctx(...) R10=fp0
6: R1=ctx() R7=ctx(...) R10=fp0
6: (b7) r0 = 0 ; R0_w=0
7: (95) exit
is_state_visited: suppressing checkpoint at 1, 3 jmps processed, cur->jmp_history_cnt is 74
from 2 to 1: R1=ctx() R7_w=scalar(...) R10=fp0
1: R1=ctx() R7_w=scalar(...) R10=fp0
1: (07) r7 += 447767737
is_state_visited: suppressing checkpoint at 2, 3 jmps processed, cur->jmp_history_cnt is 75
2: R7_w=scalar(...)
2: (45) if r7 & 0x702000 goto pc-2
... mark_precise 152 steps for r7 ...
2: R7_w=scalar(...)
is_state_visited: suppressing checkpoint at 1, 4 jmps processed, cur->jmp_history_cnt is 75
1: (07) r7 += 447767737
is_state_visited: suppressing checkpoint at 2, 4 jmps processed, cur->jmp_history_cnt is 76
2: R7_w=scalar(...)
2: (45) if r7 & 0x702000 goto pc-2
...
BPF program is too large. Processed 257 insn
The log output shows that checkpoint at label (1) is never created,
because it is suppressed by `skip_inf_loop_check` logic:
a. When 'if' at (2) is processed it pushes a state with insn_idx (1)
onto stack and proceeds to (3);
b. At (5) checkpoint is created, and this resets
env->{jmps,insns}_processed.
c. Verification proceeds and reaches `exit`;
d. State saved at step (a) is popped from stack and is_state_visited()
considers if checkpoint needs to be added, but because
env->{jmps,insns}_processed had been just reset at step (b)
the `skip_inf_loop_check` logic forces `add_new_state` to false.
e. Verifier proceeds with current state, which slowly accumulates
more and more entries in the jump history.
The accumulation of entries in the jump history is a problem because
of two factors:
- it eventually exhausts memory available for kmalloc() allocation;
- mark_chain_precision() traverses the jump history of a state,
meaning that if `r7` is marked precise, verifier would iterate
ever growing jump history until parent state boundary is reached.
(note: the log also shows a REG INVARIANTS VIOLATION warning
upon jset processing, but that's another bug to fix).
With this patch applied, the example above is rejected by verifier
under 1s of time, reaching 1M instructions limit.
The program is a simplified reproducer from syzbot report.
Previous discussion could be found at [1].
The patch does not cause any changes in verification performance,
when tested on selftests from veristat.cfg and cilium programs taken
from [2].
[1] https://lore.kernel.org/bpf/20241009021254.2805446-1-eddyz87@gmail.com/
[2] https://github.com/anakryiko/cilium
Changelog:
- v1 -> v2:
- moved patch to bpf tree;
- moved force_new_state variable initialization after declaration and
shortened the comment.
v1: https://lore.kernel.org/bpf/20241018020307.1766906-1-eddyz87@gmail.com/
Fixes:
|
||
Linus Torvalds
|
ae90f6a617 |
BPF fixes:
- Fix an out-of-bounds read in bpf_link_show_fdinfo for BPF sockmap link file descriptors (Hou Tao) - Fix BPF arm64 JIT's address emission with tag-based KASAN enabled reserving not enough size (Peter Collingbourne) - Fix BPF verifier do_misc_fixups patching for inlining of the bpf_get_branch_snapshot BPF helper (Andrii Nakryiko) - Fix a BPF verifier bug and reject BPF program write attempts into read-only marked BPF maps (Daniel Borkmann) - Fix perf_event_detach_bpf_prog error handling by removing an invalid check which would skip BPF program release (Jiri Olsa) - Fix memory leak when parsing mount options for the BPF filesystem (Hou Tao) Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> -----BEGIN PGP SIGNATURE----- iIsEABYIADMWIQTFp0I1jqZrAX+hPRXbK58LschIgwUCZxrAzxUcZGFuaWVsQGlv Z2VhcmJveC5uZXQACgkQ2yufC7HISIPcHwD8DnBSPlHX9OezMWCm8mjVx2Fd26W9 /IaiW2tyOPtoSGIA/3hfgfLrxkb3Raoh0miQB2+FRrz9e+y7i8c4Q91mcUgJ =Hvht -----END PGP SIGNATURE----- Merge tag 'bpf-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf Pull bpf fixes from Daniel Borkmann: - Fix an out-of-bounds read in bpf_link_show_fdinfo for BPF sockmap link file descriptors (Hou Tao) - Fix BPF arm64 JIT's address emission with tag-based KASAN enabled reserving not enough size (Peter Collingbourne) - Fix BPF verifier do_misc_fixups patching for inlining of the bpf_get_branch_snapshot BPF helper (Andrii Nakryiko) - Fix a BPF verifier bug and reject BPF program write attempts into read-only marked BPF maps (Daniel Borkmann) - Fix perf_event_detach_bpf_prog error handling by removing an invalid check which would skip BPF program release (Jiri Olsa) - Fix memory leak when parsing mount options for the BPF filesystem (Hou Tao) * tag 'bpf-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf: bpf: Check validity of link->type in bpf_link_show_fdinfo() bpf: Add the missing BPF_LINK_TYPE invocation for sockmap bpf: fix do_misc_fixups() for bpf_get_branch_snapshot() bpf,perf: Fix perf_event_detach_bpf_prog error handling selftests/bpf: Add test for passing in uninit mtu_len selftests/bpf: Add test for writes to .rodata bpf: Remove MEM_UNINIT from skb/xdp MTU helpers bpf: Fix overloading of MEM_UNINIT's meaning bpf: Add MEM_WRITE attribute bpf: Preserve param->string when parsing mount options bpf, arm64: Fix address emission with tag-based KASAN enabled |
||
Linus Torvalds
|
d44cd82264 |
Including fixes from netfiler, xfrm and bluetooth.
Current release - regressions: - posix-clock: Fix unbalanced locking in pc_clock_settime() - netfilter: fix typo causing some targets not to load on IPv6 Current release - new code bugs: - xfrm: policy: remove last remnants of pernet inexact list Previous releases - regressions: - core: fix races in netdev_tx_sent_queue()/dev_watchdog() - bluetooth: fix UAF on sco_sock_timeout - eth: hv_netvsc: fix VF namespace also in synthetic NIC NETDEV_REGISTER event - eth: usbnet: fix name regression - eth: be2net: fix potential memory leak in be_xmit() - eth: plip: fix transmit path breakage Previous releases - always broken: - sched: deny mismatched skip_sw/skip_hw flags for actions created by classifiers - netfilter: bpf: must hold reference on net namespace - eth: virtio_net: fix integer overflow in stats - eth: bnxt_en: replace ptp_lock with irqsave variant - eth: octeon_ep: add SKB allocation failures handling in __octep_oq_process_rx() Misc: - MAINTAINERS: add Simon as an official reviewer Signed-off-by: Paolo Abeni <pabeni@redhat.com> -----BEGIN PGP SIGNATURE----- iQJGBAABCAAwFiEEg1AjqC77wbdLX2LbKSR5jcyPE6QFAmcaTkUSHHBhYmVuaUBy ZWRoYXQuY29tAAoJECkkeY3MjxOkW8kP/iYfaxQ8zR61wUU7bOcVUSnEADR9XQ1H Nta5Z0tDJprZv254XW3hYDzU0Iy3OgclRE1oewF5fQVLn6Sfg4U5awxRTNdJw7KV wj62ziAv/xht2W/4nBsNfYkOZaDAibItbKtxlkOhgCGXSrXBoS22IonKRqEv2HLV Gu0vAY/VI9YNvB5Z6SEKFmQp2bWfX79AChVT72shLBLakOCUHBavk/DOU56XH1Ci IRmU5Lt8ysXWxCTF91rPCAbMyuxBbIv6phIKPV2ALpRUd6ha5nBqcl0wcS7Y1E+/ 0XOV71zjcXFoE/6hc5W3/mC7jm+ipXKVJOnIkCcWq40p6kDVJJ+E1RWEr5JxGEyF FtnUCZ8iK/F3/jSalMras2z+AZ/CGtfHF9wAS3YfMGtOJJb/k4dCxAddp7UzD9O4 yxAJhJ0DrVuplzwovL5owoJJXeRAMQeFydzHBYun5P8Sc9TtvviICi19fMgKGn4O eUQhjgZZY371sPnTDLDEw1Oqzs9qeaeV3S2dSeFJ98PQuPA5KVOf/R2/CptBIMi5 +UNcqeXrlUeYSBW94pPioEVStZDrzax5RVKh/Jo1tTnKzbnWDOOKZqSVsGPMWXdO 0aBlGuSsNe36VDg2C0QMxGk7+gXbKmk9U4+qVQH3KMpB8uqdAu5deMbTT6dfcwBV O/BaGiqoR4ak =dR3Q -----END PGP SIGNATURE----- Merge tag 'net-6.12-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net Pull networking fixes from Paolo Abeni: "Including fixes from netfiler, xfrm and bluetooth. Oddly this includes a fix for a posix clock regression; in our previous PR we included a change there as a pre-requisite for networking one. That fix proved to be buggy and requires the follow-up included here. Thomas suggested we should send it, given we sent the buggy patch. Current release - regressions: - posix-clock: Fix unbalanced locking in pc_clock_settime() - netfilter: fix typo causing some targets not to load on IPv6 Current release - new code bugs: - xfrm: policy: remove last remnants of pernet inexact list Previous releases - regressions: - core: fix races in netdev_tx_sent_queue()/dev_watchdog() - bluetooth: fix UAF on sco_sock_timeout - eth: hv_netvsc: fix VF namespace also in synthetic NIC NETDEV_REGISTER event - eth: usbnet: fix name regression - eth: be2net: fix potential memory leak in be_xmit() - eth: plip: fix transmit path breakage Previous releases - always broken: - sched: deny mismatched skip_sw/skip_hw flags for actions created by classifiers - netfilter: bpf: must hold reference on net namespace - eth: virtio_net: fix integer overflow in stats - eth: bnxt_en: replace ptp_lock with irqsave variant - eth: octeon_ep: add SKB allocation failures handling in __octep_oq_process_rx() Misc: - MAINTAINERS: add Simon as an official reviewer" * tag 'net-6.12-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (40 commits) net: dsa: mv88e6xxx: support 4000ps cycle counter period net: dsa: mv88e6xxx: read cycle counter period from hardware net: dsa: mv88e6xxx: group cycle counter coefficients net: usb: qmi_wwan: add Fibocom FG132 0x0112 composition hv_netvsc: Fix VF namespace also in synthetic NIC NETDEV_REGISTER event net: dsa: microchip: disable EEE for KSZ879x/KSZ877x/KSZ876x Bluetooth: ISO: Fix UAF on iso_sock_timeout Bluetooth: SCO: Fix UAF on sco_sock_timeout Bluetooth: hci_core: Disable works on hci_unregister_dev posix-clock: posix-clock: Fix unbalanced locking in pc_clock_settime() r8169: avoid unsolicited interrupts net: sched: use RCU read-side critical section in taprio_dump() net: sched: fix use-after-free in taprio_change() net/sched: act_api: deny mismatched skip_sw/skip_hw flags for actions created by classifiers net: usb: usbnet: fix name regression mlxsw: spectrum_router: fix xa_store() error checking virtio_net: fix integer overflow in stats net: fix races in netdev_tx_sent_queue()/dev_watchdog() net: wwan: fix global oob in wwan_rtnl_policy netfilter: xtables: fix typo causing some targets not to load on IPv6 ... |
||
Hou Tao
|
8421d4c876 |
bpf: Check validity of link->type in bpf_link_show_fdinfo()
If a newly-added link type doesn't invoke BPF_LINK_TYPE(), accessing bpf_link_type_strs[link->type] may result in an out-of-bounds access. To spot such missed invocations early in the future, checking the validity of link->type in bpf_link_show_fdinfo() and emitting a warning when such invocations are missed. Signed-off-by: Hou Tao <houtao1@huawei.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/bpf/20241024013558.1135167-3-houtao@huaweicloud.com |
||
Andrii Nakryiko
|
9806f28314 |
bpf: fix do_misc_fixups() for bpf_get_branch_snapshot()
We need `goto next_insn;` at the end of patching instead of `continue;`.
It currently works by accident by making verifier re-process patched
instructions.
Reported-by: Shung-Hsi Yu <shung-hsi.yu@suse.com>
Fixes:
|
||
Jiri Olsa
|
0ee288e69d |
bpf,perf: Fix perf_event_detach_bpf_prog error handling
Peter reported that perf_event_detach_bpf_prog might skip to release
the bpf program for -ENOENT error from bpf_prog_array_copy.
This can't happen because bpf program is stored in perf event and is
detached and released only when perf event is freed.
Let's drop the -ENOENT check and make sure the bpf program is released
in any case.
Fixes:
|
||
Jinjie Ruan
|
6e62807c7f |
posix-clock: posix-clock: Fix unbalanced locking in pc_clock_settime()
If get_clock_desc() succeeds, it calls fget() for the clockid's fd,
and get the clk->rwsem read lock, so the error path should release
the lock to make the lock balance and fput the clockid's fd to make
the refcount balance and release the fd related resource.
However the below commit left the error path locked behind resulting in
unbalanced locking. Check timespec64_valid_strict() before
get_clock_desc() to fix it, because the "ts" is not changed
after that.
Fixes:
|
||
Leo Yan
|
0b6e2e22cb |
tracing: Consider the NULL character when validating the event length
strlen() returns a string length excluding the null byte. If the string
length equals to the maximum buffer length, the buffer will have no
space for the NULL terminating character.
This commit checks this condition and returns failure for it.
Link: https://lore.kernel.org/all/20241007144724.920954-1-leo.yan@arm.com/
Fixes:
|
||
Mikel Rychliski
|
73f3508047 |
tracing/probes: Fix MAX_TRACE_ARGS limit handling
When creating a trace_probe we would set nr_args prior to truncating the
arguments to MAX_TRACE_ARGS. However, we would only initialize arguments
up to the limit.
This caused invalid memory access when attempting to set up probes with
more than 128 fetchargs.
BUG: kernel NULL pointer dereference, address: 0000000000000020
#PF: supervisor read access in kernel mode
#PF: error_code(0x0000) - not-present page
PGD 0 P4D 0
Oops: Oops: 0000 [#1] PREEMPT SMP PTI
CPU: 0 UID: 0 PID: 1769 Comm: cat Not tainted 6.11.0-rc7+ #8
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.3-1.fc39 04/01/2014
RIP: 0010:__set_print_fmt+0x134/0x330
Resolve the issue by applying the MAX_TRACE_ARGS limit earlier. Return
an error when there are too many arguments instead of silently
truncating.
Link: https://lore.kernel.org/all/20240930202656.292869-1-mikel@mikelr.com/
Fixes:
|
||
Daniel Borkmann
|
8ea607330a |
bpf: Fix overloading of MEM_UNINIT's meaning
Lonial reported an issue in the BPF verifier where check_mem_size_reg() has the following code: if (!tnum_is_const(reg->var_off)) /* For unprivileged variable accesses, disable raw * mode so that the program is required to * initialize all the memory that the helper could * just partially fill up. */ meta = NULL; This means that writes are not checked when the register containing the size of the passed buffer has not a fixed size. Through this bug, a BPF program can write to a map which is marked as read-only, for example, .rodata global maps. The problem is that MEM_UNINIT's initial meaning that "the passed buffer to the BPF helper does not need to be initialized" which was added back in commit |
||
Daniel Borkmann
|
6fad274f06 |
bpf: Add MEM_WRITE attribute
Add a MEM_WRITE attribute for BPF helper functions which can be used in
bpf_func_proto to annotate an argument type in order to let the verifier
know that the helper writes into the memory passed as an argument. In
the past MEM_UNINIT has been (ab)used for this function, but the latter
merely tells the verifier that the passed memory can be uninitialized.
There have been bugs with overloading the latter but aside from that
there are also cases where the passed memory is read + written which
currently cannot be expressed, see also
|
||
Hou Tao
|
1f97c03f43 |
bpf: Preserve param->string when parsing mount options
In bpf_parse_param(), keep the value of param->string intact so it can
be freed later. Otherwise, the kmalloc area pointed to by param->string
will be leaked as shown below:
unreferenced object 0xffff888118c46d20 (size 8):
comm "new_name", pid 12109, jiffies 4295580214
hex dump (first 8 bytes):
61 6e 79 00 38 c9 5c 7e any.8.\~
backtrace (crc e1b7f876):
[<00000000c6848ac7>] kmemleak_alloc+0x4b/0x80
[<00000000de9f7d00>] __kmalloc_node_track_caller_noprof+0x36e/0x4a0
[<000000003e29b886>] memdup_user+0x32/0xa0
[<0000000007248326>] strndup_user+0x46/0x60
[<0000000035b3dd29>] __x64_sys_fsconfig+0x368/0x3d0
[<0000000018657927>] x64_sys_call+0xff/0x9f0
[<00000000c0cabc95>] do_syscall_64+0x3b/0xc0
[<000000002f331597>] entry_SYSCALL_64_after_hwframe+0x4b/0x53
Fixes:
|
||
Qiao Ma
|
373b9338c9 |
uprobe: avoid out-of-bounds memory access of fetching args
Uprobe needs to fetch args into a percpu buffer, and then copy to ring
buffer to avoid non-atomic context problem.
Sometimes user-space strings, arrays can be very large, but the size of
percpu buffer is only page size. And store_trace_args() won't check
whether these data exceeds a single page or not, caused out-of-bounds
memory access.
It could be reproduced by following steps:
1. build kernel with CONFIG_KASAN enabled
2. save follow program as test.c
```
\#include <stdio.h>
\#include <stdlib.h>
\#include <string.h>
// If string length large than MAX_STRING_SIZE, the fetch_store_strlen()
// will return 0, cause __get_data_size() return shorter size, and
// store_trace_args() will not trigger out-of-bounds access.
// So make string length less than 4096.
\#define STRLEN 4093
void generate_string(char *str, int n)
{
int i;
for (i = 0; i < n; ++i)
{
char c = i % 26 + 'a';
str[i] = c;
}
str[n-1] = '\0';
}
void print_string(char *str)
{
printf("%s\n", str);
}
int main()
{
char tmp[STRLEN];
generate_string(tmp, STRLEN);
print_string(tmp);
return 0;
}
```
3. compile program
`gcc -o test test.c`
4. get the offset of `print_string()`
```
objdump -t test | grep -w print_string
0000000000401199 g F .text 000000000000001b print_string
```
5. configure uprobe with offset 0x1199
```
off=0x1199
cd /sys/kernel/debug/tracing/
echo "p /root/test:${off} arg1=+0(%di):ustring arg2=\$comm arg3=+0(%di):ustring"
> uprobe_events
echo 1 > events/uprobes/enable
echo 1 > tracing_on
```
6. run `test`, and kasan will report error.
==================================================================
BUG: KASAN: use-after-free in strncpy_from_user+0x1d6/0x1f0
Write of size 8 at addr ffff88812311c004 by task test/499CPU: 0 UID: 0 PID: 499 Comm: test Not tainted 6.12.0-rc3+ #18
Hardware name: Red Hat KVM, BIOS 1.16.0-4.al8 04/01/2014
Call Trace:
<TASK>
dump_stack_lvl+0x55/0x70
print_address_description.constprop.0+0x27/0x310
kasan_report+0x10f/0x120
? strncpy_from_user+0x1d6/0x1f0
strncpy_from_user+0x1d6/0x1f0
? rmqueue.constprop.0+0x70d/0x2ad0
process_fetch_insn+0xb26/0x1470
? __pfx_process_fetch_insn+0x10/0x10
? _raw_spin_lock+0x85/0xe0
? __pfx__raw_spin_lock+0x10/0x10
? __pte_offset_map+0x1f/0x2d0
? unwind_next_frame+0xc5f/0x1f80
? arch_stack_walk+0x68/0xf0
? is_bpf_text_address+0x23/0x30
? kernel_text_address.part.0+0xbb/0xd0
? __kernel_text_address+0x66/0xb0
? unwind_get_return_address+0x5e/0xa0
? __pfx_stack_trace_consume_entry+0x10/0x10
? arch_stack_walk+0xa2/0xf0
? _raw_spin_lock_irqsave+0x8b/0xf0
? __pfx__raw_spin_lock_irqsave+0x10/0x10
? depot_alloc_stack+0x4c/0x1f0
? _raw_spin_unlock_irqrestore+0xe/0x30
? stack_depot_save_flags+0x35d/0x4f0
? kasan_save_stack+0x34/0x50
? kasan_save_stack+0x24/0x50
? mutex_lock+0x91/0xe0
? __pfx_mutex_lock+0x10/0x10
prepare_uprobe_buffer.part.0+0x2cd/0x500
uprobe_dispatcher+0x2c3/0x6a0
? __pfx_uprobe_dispatcher+0x10/0x10
? __kasan_slab_alloc+0x4d/0x90
handler_chain+0xdd/0x3e0
handle_swbp+0x26e/0x3d0
? __pfx_handle_swbp+0x10/0x10
? uprobe_pre_sstep_notifier+0x151/0x1b0
irqentry_exit_to_user_mode+0xe2/0x1b0
asm_exc_int3+0x39/0x40
RIP: 0033:0x401199
Code: 01 c2 0f b6 45 fb 88 02 83 45 fc 01 8b 45 fc 3b 45 e4 7c b7 8b 45 e4 48 98 48 8d 50 ff 48 8b 45 e8 48 01 d0 ce
RSP: 002b:00007ffdf00576a8 EFLAGS: 00000206
RAX: 00007ffdf00576b0 RBX: 0000000000000000 RCX: 0000000000000ff2
RDX: 0000000000000ffc RSI: 0000000000000ffd RDI: 00007ffdf00576b0
RBP: 00007ffdf00586b0 R08: 00007feb2f9c0d20 R09: 00007feb2f9c0d20
R10: 0000000000000001 R11: 0000000000000202 R12: 0000000000401040
R13: 00007ffdf0058780 R14: 0000000000000000 R15: 0000000000000000
</TASK>
This commit enforces the buffer's maxlen less than a page-size to avoid
store_trace_args() out-of-memory access.
Link: https://lore.kernel.org/all/20241015060148.1108331-1-mqaio@linux.alibaba.com/
Fixes:
|
||
Linus Torvalds
|
2b4d25010d |
- Add PREEMPT_RT maintainers
- Fix another aspect of delayed dequeued tasks wrt determining their state, i.e., whether they're runnable or blocked - Handle delayed dequeued tasks and their migration wrt PSI properly - Fix the situation where a delayed dequeue task gets enqueued into a new class, which should not happen - Fix a case where memory allocation would happen while the runqueue lock is held, which is a no-no - Do not over-schedule when tasks with shorter slices preempt the currently running task - Make sure delayed to deque entities are properly handled before unthrottling - Other smaller cleanups and improvements -----BEGIN PGP SIGNATURE----- iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmcU3tMACgkQEsHwGGHe VUqpuhAAqqyi2NNgrIOlEWh/Ej4NQZL7KleF84cSpKCIBK2somYX5ksgMcUgn82i bIuDVErQu/a4lhNAf5zn7TO3yuPA1Q5xd/453qBlWM9ApkH0S69Mp9f0yocVu8F0 t3XsgXm+/R8A4TYbiv8cB+r1Xt8E5NUP6RkNIKCHbPLAG94gqYF8UZEJ9sAl9ZXw qEWc9afpnp+4LQ9PlzePuaM976LWUPB49OoFZMnFmY1VkvFuVjkjXSVzJX6l4qB7 Omo/+TXOOBSHXVVflNx/68Q16irFHAnqwPPrLCBQWBLIPz3iRiZjV9ptD9tUZkRM M+klL7w0jRG+8wa9fTwuqybmBNIBt4Az1/WUw9Lc3ryEWRsCKzkGT8au3lv5FpQY CTwIIBSMmUcqQSG40R0HHS3nDR4UBFFD0PAww+8cJQZc0IPd2rT9/hfqYdt3sq2Z vV9rmTFOcDlApeDdCGcfC7zJhdgVuBgDVjdTsE5SNRUduBUsBYOeLDnT+0Qi0ArJ txVINGxQDm6jz512f4CAB/xzUcYpU4o639Z1Jkd6a8QbO1NBZGX1ioOcvPEMhmFF f/qFyM8ctR5Kj6LJCZiDcstqtAZviW1d2uMp48gk2QeSvkCyIUQqrWshItd02iBG TZdSYRvSYtYSIz7WYtE/CABUDmrJGjuLtb+jOrR93//TsWwwVdE= =1D7H -----END PGP SIGNATURE----- Merge tag 'sched_urgent_for_v6.12_rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull scheduling fixes from Borislav Petkov: - Add PREEMPT_RT maintainers - Fix another aspect of delayed dequeued tasks wrt determining their state, i.e., whether they're runnable or blocked - Handle delayed dequeued tasks and their migration wrt PSI properly - Fix the situation where a delayed dequeue task gets enqueued into a new class, which should not happen - Fix a case where memory allocation would happen while the runqueue lock is held, which is a no-no - Do not over-schedule when tasks with shorter slices preempt the currently running task - Make sure delayed to deque entities are properly handled before unthrottling - Other smaller cleanups and improvements * tag 'sched_urgent_for_v6.12_rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: MAINTAINERS: Add an entry for PREEMPT_RT. sched/fair: Fix external p->on_rq users sched/psi: Fix mistaken CPU pressure indication after corrupted task state bug sched/core: Dequeue PSI signals for blocked tasks that are delayed sched: Fix delayed_dequeue vs switched_from_fair() sched/core: Disable page allocation in task_tick_mm_cid() sched/deadline: Use hrtick_enabled_dl() before start_hrtick_dl() sched/eevdf: Fix wakeup-preempt by checking cfs_rq->nr_running sched: Fix sched_delayed vs cfs_bandwidth |
||
Linus Torvalds
|
06526daaff |
ftrace: A couple of fixes to function graph infrastructure
- Fix allocation of idle shadow stack allocation during hotplug If function graph tracing is started when a CPU is offline, if it were come online during the trace then the idle task that represents the CPU will not get a shadow stack allocated for it. This means all function graph hooks that happen while that idle task is running (including in interrupt mode) will have all its events dropped. Switch over to the CPU hotplug mechanism that will have any newly brought on line CPU get a callback that can allocate the shadow stack for its idle task. - Fix allocation size of the ret_stack_list array When function graph tracing converted over to allowing more than one user at a time, it had to convert its shadow stack from an array of ret_stack structures to an array of unsigned longs. The shadow stacks are allocated in batches of 32 at a time and assigned to every running task. The batch is held by the ret_stack_list array. But when the conversion happened, instead of allocating an array of 32 pointers, it was allocated as a ret_stack itself (PAGE_SIZE). This ret_stack_list gets passed to a function that iterates over what it believes is its size defined by the FTRACE_RETSTACK_ALLOC_SIZE macro (which is 32). Luckily (PAGE_SIZE) is greater than 32 * sizeof(long), otherwise this would have been an array overflow. This still should be fixed and the ret_stack_list should be allocated to the size it is expected to be as someday it may end up being bigger than SHADOW_STACK_SIZE. -----BEGIN PGP SIGNATURE----- iIoEABYIADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCZxP8RhQccm9zdGVkdEBn b29kbWlzLm9yZwAKCRAp5XQQmuv6qmW3AP4qCOvU/g9g6u32gIZmS1oUWqe3q+Rq 9OKCk0JP6GGc8AD/cF816lbs5vpDiZFdbBvaz5gLHqhfAt35NVU8T5tbJA4= =Lh3A -----END PGP SIGNATURE----- Merge tag 'ftrace-v6.12-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace Pull ftrace fixes from Steven Rostedt: "A couple of fixes to function graph infrastructure: - Fix allocation of idle shadow stack allocation during hotplug If function graph tracing is started when a CPU is offline, if it were come online during the trace then the idle task that represents the CPU will not get a shadow stack allocated for it. This means all function graph hooks that happen while that idle task is running (including in interrupt mode) will have all its events dropped. Switch over to the CPU hotplug mechanism that will have any newly brought on line CPU get a callback that can allocate the shadow stack for its idle task. - Fix allocation size of the ret_stack_list array When function graph tracing converted over to allowing more than one user at a time, it had to convert its shadow stack from an array of ret_stack structures to an array of unsigned longs. The shadow stacks are allocated in batches of 32 at a time and assigned to every running task. The batch is held by the ret_stack_list array. But when the conversion happened, instead of allocating an array of 32 pointers, it was allocated as a ret_stack itself (PAGE_SIZE). This ret_stack_list gets passed to a function that iterates over what it believes is its size defined by the FTRACE_RETSTACK_ALLOC_SIZE macro (which is 32). Luckily (PAGE_SIZE) is greater than 32 * sizeof(long), otherwise this would have been an array overflow. This still should be fixed and the ret_stack_list should be allocated to the size it is expected to be as someday it may end up being bigger than SHADOW_STACK_SIZE" * tag 'ftrace-v6.12-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace: fgraph: Allocate ret_stack_list with proper size fgraph: Use CPU hotplug mechanism to initialize idle shadow stacks |
||
Steven Rostedt
|
fae4078c28 |
fgraph: Allocate ret_stack_list with proper size
The ret_stack_list is an array of ret_stack shadow stacks for the function
graph usage. When the first function graph is enabled, all tasks in the
system get a shadow stack. The ret_stack_list is a 32 element array of
pointers to these shadow stacks. It allocates the shadow stack in batches
(32 stacks at a time), assigns them to running tasks, and continues until
all tasks are covered.
When the function graph shadow stack changed from an array of
ftrace_ret_stack structures to an array of longs, the allocation of
ret_stack_list went from allocating an array of 32 elements to just a
block defined by SHADOW_STACK_SIZE. Luckily, that's defined as PAGE_SIZE
and is much more than enough to hold 32 pointers. But it is way overkill
for the amount needed to allocate.
Change the allocation of ret_stack_list back to a kcalloc() of
FTRACE_RETSTACK_ALLOC_SIZE pointers.
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Link: https://lore.kernel.org/20241018215212.23f13f40@rorschach
Fixes:
|
||
Steven Rostedt
|
2c02f7375e |
fgraph: Use CPU hotplug mechanism to initialize idle shadow stacks
The function graph infrastructure allocates a shadow stack for every task
when enabled. This includes the idle tasks. The first time the function
graph is invoked, the shadow stacks are created and never freed until the
task exits. This includes the idle tasks.
Only the idle tasks that were for online CPUs had their shadow stacks
created when function graph tracing started. If function graph tracing is
enabled and a CPU comes online, the idle task representing that CPU will
not have its shadow stack created, and all function graph tracing for that
idle task will be silently dropped.
Instead, use the CPU hotplug mechanism to allocate the idle shadow stacks.
This will include idle tasks for CPUs that come online during tracing.
This issue can be reproduced by:
# cd /sys/kernel/tracing
# echo 0 > /sys/devices/system/cpu/cpu1/online
# echo 0 > set_ftrace_pid
# echo function_graph > current_tracer
# echo 1 > options/funcgraph-proc
# echo 1 > /sys/devices/system/cpu/cpu1
# grep '<idle>' per_cpu/cpu1/trace | head
Before, nothing would show up.
After:
1) <idle>-0 | 0.811 us | __enqueue_entity();
1) <idle>-0 | 5.626 us | } /* enqueue_entity */
1) <idle>-0 | | dl_server_update_idle_time() {
1) <idle>-0 | | dl_scaled_delta_exec() {
1) <idle>-0 | 0.450 us | arch_scale_cpu_capacity();
1) <idle>-0 | 1.242 us | }
1) <idle>-0 | 1.908 us | }
1) <idle>-0 | | dl_server_start() {
1) <idle>-0 | | enqueue_dl_entity() {
1) <idle>-0 | | task_contending() {
Note, if tracing stops and restarts, the old way would then initialize
the onlined CPUs.
Cc: stable@vger.kernel.org
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/20241018214300.6df82178@rorschach
Fixes:
|
||
Linus Torvalds
|
3d5ad2d4ec |
BPF fixes:
- Fix BPF verifier to not affect subreg_def marks in its range propagation, from Eduard Zingerman. - Fix a truncation bug in the BPF verifier's handling of coerce_reg_to_size_sx, from Dimitar Kanaliev. - Fix the BPF verifier's delta propagation between linked registers under 32-bit addition, from Daniel Borkmann. - Fix a NULL pointer dereference in BPF devmap due to missing rxq information, from Florian Kauer. - Fix a memory leak in bpf_core_apply, from Jiri Olsa. - Fix an UBSAN-reported array-index-out-of-bounds in BTF parsing for arrays of nested structs, from Hou Tao. - Fix build ID fetching where memory areas backing the file were created with memfd_secret, from Andrii Nakryiko. - Fix BPF task iterator tid filtering which was incorrectly using pid instead of tid, from Jordan Rome. - Several fixes for BPF sockmap and BPF sockhash redirection in combination with vsocks, from Michal Luczaj. - Fix riscv BPF JIT and make BPF_CMPXCHG fully ordered, from Andrea Parri. - Fix riscv BPF JIT under CONFIG_CFI_CLANG to prevent the possibility of an infinite BPF tailcall, from Pu Lehui. - Fix a build warning from resolve_btfids that bpf_lsm_key_free cannot be resolved, from Thomas Weißschuh. - Fix a bug in kfunc BTF caching for modules where the wrong BTF object was returned, from Toke Høiland-Jørgensen. - Fix a BPF selftest compilation error in cgroup-related tests with musl libc, from Tony Ambardar. - Several fixes to BPF link info dumps to fill missing fields, from Tyrone Wu. - Add BPF selftests for kfuncs from multiple modules, checking that the correct kfuncs are called, from Simon Sundberg. - Ensure that internal and user-facing bpf_redirect flags don't overlap, also from Toke Høiland-Jørgensen. - Switch to use kvzmalloc to allocate BPF verifier environment, from Rik van Riel. - Use raw_spinlock_t in BPF ringbuf to fix a sleep in atomic splat under RT, from Wander Lairson Costa. Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> -----BEGIN PGP SIGNATURE----- iIsEABYIADMWIQTFp0I1jqZrAX+hPRXbK58LschIgwUCZxK4OhUcZGFuaWVsQGlv Z2VhcmJveC5uZXQACgkQ2yufC7HISIOCrwEAib2kC5EEQn5+wKVE/bnZryVX2leT YXdfItDCBU6zCYUA+wTU5hGGn9lcDUcZx72l/KZPDyPw7HdzNJ+6iR1zQqoM =f9kv -----END PGP SIGNATURE----- Merge tag 'bpf-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf Pull bpf fixes from Daniel Borkmann: - Fix BPF verifier to not affect subreg_def marks in its range propagation (Eduard Zingerman) - Fix a truncation bug in the BPF verifier's handling of coerce_reg_to_size_sx (Dimitar Kanaliev) - Fix the BPF verifier's delta propagation between linked registers under 32-bit addition (Daniel Borkmann) - Fix a NULL pointer dereference in BPF devmap due to missing rxq information (Florian Kauer) - Fix a memory leak in bpf_core_apply (Jiri Olsa) - Fix an UBSAN-reported array-index-out-of-bounds in BTF parsing for arrays of nested structs (Hou Tao) - Fix build ID fetching where memory areas backing the file were created with memfd_secret (Andrii Nakryiko) - Fix BPF task iterator tid filtering which was incorrectly using pid instead of tid (Jordan Rome) - Several fixes for BPF sockmap and BPF sockhash redirection in combination with vsocks (Michal Luczaj) - Fix riscv BPF JIT and make BPF_CMPXCHG fully ordered (Andrea Parri) - Fix riscv BPF JIT under CONFIG_CFI_CLANG to prevent the possibility of an infinite BPF tailcall (Pu Lehui) - Fix a build warning from resolve_btfids that bpf_lsm_key_free cannot be resolved (Thomas Weißschuh) - Fix a bug in kfunc BTF caching for modules where the wrong BTF object was returned (Toke Høiland-Jørgensen) - Fix a BPF selftest compilation error in cgroup-related tests with musl libc (Tony Ambardar) - Several fixes to BPF link info dumps to fill missing fields (Tyrone Wu) - Add BPF selftests for kfuncs from multiple modules, checking that the correct kfuncs are called (Simon Sundberg) - Ensure that internal and user-facing bpf_redirect flags don't overlap (Toke Høiland-Jørgensen) - Switch to use kvzmalloc to allocate BPF verifier environment (Rik van Riel) - Use raw_spinlock_t in BPF ringbuf to fix a sleep in atomic splat under RT (Wander Lairson Costa) * tag 'bpf-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf: (38 commits) lib/buildid: Handle memfd_secret() files in build_id_parse() selftests/bpf: Add test case for delta propagation bpf: Fix print_reg_state's constant scalar dump bpf: Fix incorrect delta propagation between linked registers bpf: Properly test iter/task tid filtering bpf: Fix iter/task tid filtering riscv, bpf: Make BPF_CMPXCHG fully ordered bpf, vsock: Drop static vsock_bpf_prot initialization vsock: Update msg_count on read_skb() vsock: Update rx_bytes on read_skb() bpf, sockmap: SK_DROP on attempted redirects of unsupported af_vsock selftests/bpf: Add asserts for netfilter link info bpf: Fix link info netfilter flags to populate defrag flag selftests/bpf: Add test for sign extension in coerce_subreg_to_size_sx() selftests/bpf: Add test for truncation after sign extension in coerce_reg_to_size_sx() bpf: Fix truncation bug in coerce_reg_to_size_sx() selftests/bpf: Assert link info uprobe_multi count & path_size if unset bpf: Fix unpopulated path_size when uprobe_multi fields unset selftests/bpf: Fix cross-compiling urandom_read selftests/bpf: Add test for kfunc module order ... |
||
Daniel Borkmann
|
3e9e708757 |
bpf: Fix print_reg_state's constant scalar dump
print_reg_state() should not consider adding reg->off to reg->var_off.value
when dumping scalars. Scalars can be produced with reg->off != 0 through
BPF_ADD_CONST, and thus as-is this can skew the register log dump.
Fixes:
|
||
Daniel Borkmann
|
3878ae04e9 |
bpf: Fix incorrect delta propagation between linked registers
Nathaniel reported a bug in the linked scalar delta tracking, which can lead to accepting a program with OOB access. The specific code is related to the sync_linked_regs() function and the BPF_ADD_CONST flag, which signifies a constant offset between two scalar registers tracked by the same register id. The verifier attempts to track "similar" scalars in order to propagate bounds information learned about one scalar to others. For instance, if r1 and r2 are known to contain the same value, then upon encountering 'if (r1 != 0x1234) goto xyz', not only does it know that r1 is equal to 0x1234 on the path where that conditional jump is not taken, it also knows that r2 is. Additionally, with env->bpf_capable set, the verifier will track scalars which should be a constant delta apart (if r1 is known to be one greater than r2, then if r1 is known to be equal to 0x1234, r2 must be equal to 0x1233.) The code path for the latter in adjust_reg_min_max_vals() is reached when processing both 32 and 64-bit addition operations. While adjust_reg_min_max_vals() knows whether dst_reg was produced by a 32 or a 64-bit addition (based on the alu32 bool), the only information saved in dst_reg is the id of the source register (reg->id, or'ed by BPF_ADD_CONST) and the value of the constant offset (reg->off). Later, the function sync_linked_regs() will attempt to use this information to propagate bounds information from one register (known_reg) to others, meaning, for all R in linked_regs, it copies known_reg range (and possibly adjusting delta) into R for the case of R->id == known_reg->id. For the delta adjustment, meaning, matching reg->id with BPF_ADD_CONST, the verifier adjusts the register as reg = known_reg; reg += delta where delta is computed as (s32)reg->off - (s32)known_reg->off and placed as a scalar into a fake_reg to then simulate the addition of reg += fake_reg. This is only correct, however, if the value in reg was created by a 64-bit addition. When reg contains the result of a 32-bit addition operation, its upper 32 bits will always be zero. sync_linked_regs() on the other hand, may cause the verifier to believe that the addition between fake_reg and reg overflows into those upper bits. For example, if reg was generated by adding the constant 1 to known_reg using a 32-bit alu operation, then reg->off is 1 and known_reg->off is 0. If known_reg is known to be the constant 0xFFFFFFFF, sync_linked_regs() will tell the verifier that reg is equal to the constant 0x100000000. This is incorrect as the actual value of reg will be 0, as the 32-bit addition will wrap around. Example: 0: (b7) r0 = 0; R0_w=0 1: (18) r1 = 0x80000001; R1_w=0x80000001 3: (37) r1 /= 1; R1_w=scalar() 4: (bf) r2 = r1; R1_w=scalar(id=1) R2_w=scalar(id=1) 5: (bf) r4 = r1; R1_w=scalar(id=1) R4_w=scalar(id=1) 6: (04) w2 += 2147483647; R2_w=scalar(id=1+2147483647,smin=0,smax=umax=0xffffffff,var_off=(0x0; 0xffffffff)) 7: (04) w4 += 0 ; R4_w=scalar(id=1+0,smin=0,smax=umax=0xffffffff,var_off=(0x0; 0xffffffff)) 8: (15) if r2 == 0x0 goto pc+1 10: R0=0 R1=0xffffffff80000001 R2=0x7fffffff R4=0xffffffff80000001 R10=fp0 What can be seen here is that r1 is copied to r2 and r4, such that {r1,r2,r4}.id are all the same which later lets sync_linked_regs() to be invoked. Then, in a next step constants are added with alu32 to r2 and r4, setting their ->off, as well as id |= BPF_ADD_CONST. Next, the conditional will bind r2 and propagate ranges to its linked registers. The verifier now believes the upper 32 bits of r4 are r4=0xffffffff80000001, while actually r4=r1=0x80000001. One approach for a simple fix suitable also for stable is to limit the constant delta tracking to only 64-bit alu addition. If necessary at some later point, BPF_ADD_CONST could be split into BPF_ADD_CONST64 and BPF_ADD_CONST32 to avoid mixing the two under the tradeoff to further complicate sync_linked_regs(). However, none of the added tests from |
||
Jordan Rome
|
9495a5b731 |
bpf: Fix iter/task tid filtering
In userspace, you can add a tid filter by setting
the "task.tid" field for "bpf_iter_link_info".
However, `get_pid_task` when called for the
`BPF_TASK_ITER_TID` type should have been using
`PIDTYPE_PID` (tid) instead of `PIDTYPE_TGID` (pid).
Fixes:
|
||
Linus Torvalds
|
07d6bf634b |
No contributions from subtrees.
Current release - new code bugs: - eth: mlx5: HWS, don't destroy more bwc queue locks than allocated Previous releases - regressions: - ipv4: give an IPv4 dev to blackhole_netdev - udp: compute L4 checksum as usual when not segmenting the skb - tcp/dccp: don't use timer_pending() in reqsk_queue_unlink(). - eth: mlx5e: don't call cleanup on profile rollback failure - eth: microchip: vcap api: fix memory leaks in vcap_api_encode_rule_test() - eth: enetc: disable Tx BD rings after they are empty - eth: macb: avoid 20s boot delay by skipping MDIO bus registration for fixed-link PHY Previous releases - always broken: - posix-clock: fix missing timespec64 check in pc_clock_settime() - genetlink: hold RCU in genlmsg_mcast() - mptcp: prevent MPC handshake on port-based signal endpoints - eth: vmxnet3: fix packet corruption in vmxnet3_xdp_xmit_frame - eth: stmmac: dwmac-tegra: fix link bring-up sequence - eth: bcmasp: fix potential memory leak in bcmasp_xmit() Misc: - add Andrew Lunn as a co-maintainer of all networking drivers Signed-off-by: Paolo Abeni <pabeni@redhat.com> -----BEGIN PGP SIGNATURE----- iQJGBAABCAAwFiEEg1AjqC77wbdLX2LbKSR5jcyPE6QFAmcRDoMSHHBhYmVuaUBy ZWRoYXQuY29tAAoJECkkeY3MjxOklXoP/2VKcUDYMfP03vB6a60riiqMcD+GVZzm lLaa9xBiDdEsnL+4dcQLF7tecYbqpIqh+0ZeEe0aXgp0HjIm2Im1eiXv5cB7INxK n1lxVibI/zD2j2M9cy2NDTeeYSI29GH98g0IkLrU9vnfsrp5jRuXFttrCmotzesZ tYb200cVMR/nt9rrG3aNAxTUHjaykgpKh/zSvFC0MRStXFfUbIia88LzcQOQzEK3 jcWMj8jsJHkDLhUHS8LVZEV1JDYIb+QeEjGz5LxMpQV5/T5sP7Ki4ISPxpUbYNrZ QWwSrFSg2ivaq8PkQ2LBTKAtiHyGlcEJtnlSf7Y50cpc9RNphClq8YMSBUCcGTEi 3W18eVIZ0aj1omTeHEQLbMkqT0soTwYxskDW0uCCFKf/SRCQm1ixrpQiA6PGP266 e0q7mMD7v4S9ZdO5VF+oYzf1fF0OhaOkZtUEsWjxHite6ujh8719EPbUoee4cqPL ofoHYF338BlYl4YIZERZMff8HJD4+PR0R8c9SIBXQcGUMvf2mQdvaD9q2MGySSJd 4mlH6bE8Ckj4KVp9llxB8zyw/5OmJdMZ+ar4o7+CX2Z3Wrs1SMSTy7qAADcV2B2G S9kvwNDffak+N94N/MqswQM4se5yo8dmyUnhqChBFYPFWc4N7v6wwKelROigqLL7 7Xhj2TXRBGKs =ukTi -----END PGP SIGNATURE----- Merge tag 'net-6.12-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net Pull networking fixes from Paolo Abeni: "Current release - new code bugs: - eth: mlx5: HWS, don't destroy more bwc queue locks than allocated Previous releases - regressions: - ipv4: give an IPv4 dev to blackhole_netdev - udp: compute L4 checksum as usual when not segmenting the skb - tcp/dccp: don't use timer_pending() in reqsk_queue_unlink(). - eth: mlx5e: don't call cleanup on profile rollback failure - eth: microchip: vcap api: fix memory leaks in vcap_api_encode_rule_test() - eth: enetc: disable Tx BD rings after they are empty - eth: macb: avoid 20s boot delay by skipping MDIO bus registration for fixed-link PHY Previous releases - always broken: - posix-clock: fix missing timespec64 check in pc_clock_settime() - genetlink: hold RCU in genlmsg_mcast() - mptcp: prevent MPC handshake on port-based signal endpoints - eth: vmxnet3: fix packet corruption in vmxnet3_xdp_xmit_frame - eth: stmmac: dwmac-tegra: fix link bring-up sequence - eth: bcmasp: fix potential memory leak in bcmasp_xmit() Misc: - add Andrew Lunn as a co-maintainer of all networking drivers" * tag 'net-6.12-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (47 commits) net/mlx5e: Don't call cleanup on profile rollback failure net/mlx5: Unregister notifier on eswitch init failure net/mlx5: Fix command bitmask initialization net/mlx5: Check for invalid vector index on EQ creation net/mlx5: HWS, use lock classes for bwc locks net/mlx5: HWS, don't destroy more bwc queue locks than allocated net/mlx5: HWS, fixed double free in error flow of definer layout net/mlx5: HWS, removed wrong access to a number of rules variable mptcp: pm: fix UaF read in mptcp_pm_nl_rm_addr_or_subflow net: ethernet: mtk_eth_soc: fix memory corruption during fq dma init vmxnet3: Fix packet corruption in vmxnet3_xdp_xmit_frame net: dsa: vsc73xx: fix reception from VLAN-unaware bridges net: ravb: Only advertise Rx/Tx timestamps if hardware supports it net: microchip: vcap api: Fix memory leaks in vcap_api_encode_rule_test() net: phy: mdio-bcm-unimac: Add BCM6846 support dt-bindings: net: brcm,unimac-mdio: Add bcm6846-mdio udp: Compute L4 checksum as usual when not segmenting the skb genetlink: hold RCU in genlmsg_mcast() net: dsa: mv88e6xxx: Fix the max_vid definition for the MV88E6361 tcp/dccp: Don't use timer_pending() in reqsk_queue_unlink(). ... |
||
Ingo Molnar
|
be602cde65 |
Merge branch 'linus' into sched/urgent, to resolve conflict
Conflicts: kernel/sched/ext.c There's a context conflict between this upstream commit: |
||
Linus Torvalds
|
dff6584301 |
sched_ext: Fixes for v6.12-rc3
- More issues reported in the enable/disable paths on large machines with many tasks due to scx_tasks_lock being held too long. Break up the task iterations. - Remove ops.select_cpu() dependency in bypass mode so that a misbehaving implementation can't live-lock the machine by pushing all tasks to few CPUs in bypass mode. - Other misc fixes. -----BEGIN PGP SIGNATURE----- iIQEABYKACwWIQTfIjM1kS57o3GsC/uxYfJx3gVYGQUCZw7SMw4cdGpAa2VybmVs Lm9yZwAKCRCxYfJx3gVYGU6+AQDZKNsyRw0c8+rd7eASAtCzvzBU1KGf3RPrsbNF pb3IGgEA4/uVBlVny10ZyeKaQ4A3L2G4ROyJwNEygUvhkvbfcgk= =z7d9 -----END PGP SIGNATURE----- Merge tag 'sched_ext-for-6.12-rc3-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext Pull sched_ext fixes from Tejun Heo: - More issues reported in the enable/disable paths on large machines with many tasks due to scx_tasks_lock being held too long. Break up the task iterations - Remove ops.select_cpu() dependency in bypass mode so that a misbehaving implementation can't live-lock the machine by pushing all tasks to few CPUs in bypass mode - Other misc fixes * tag 'sched_ext-for-6.12-rc3-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext: sched_ext: Remove unnecessary cpu_relax() sched_ext: Don't hold scx_tasks_lock for too long sched_ext: Move scx_tasks_lock handling into scx_task_iter helpers sched_ext: bypass mode shouldn't depend on ops.select_cpu() sched_ext: Move scx_buildin_idle_enabled check to scx_bpf_select_cpu_dfl() sched_ext: Start schedulers with consistent p->scx.slice values Revert "sched_ext: Use shorter slice while bypassing" sched_ext: use correct function name in pick_task_scx() warning message selftests: sched_ext: Add sched_ext as proper selftest target |
||
Linus Torvalds
|
2f87d0916c |
ring-buffer: Fixes for v6.12
- Fix ref counter of buffers assigned at boot up A tracing instance can be created from the kernel command line. If it maps to memory, it is considered permanent and should not be deleted, or bad things can happen. If it is not mapped to memory, then the user is fine to delete it via rmdir from the instances directory. But the ref counts assumed 0 was free to remove and greater than zero was not. But this was not the case. When an instance is created, it should have the reference of 1, and if it should not be removed, it must be greater than 1. The boot up code set normal instances with a ref count of 0, which could get removed if something accessed it and then released it. And memory mapped instances had a ref count of 1 which meant it could be deleted, and bad things happen. Keep normal instances ref count as 1, and set memory mapped instances ref count to 2. - Protect sub buffer size (order) updates from other modifications When a ring buffer is changing the size of its sub-buffers, no other operations should be performed on the ring buffer. That includes reading it. But the locking only grabbed the buffer->mutex that keeps some operations from touching the ring buffer. It also must hold the cpu_buffer->reader_lock as well when updates happen as other paths use that to do some operations on the ring buffer. -----BEGIN PGP SIGNATURE----- iIoEABYIADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCZw6LSRQccm9zdGVkdEBn b29kbWlzLm9yZwAKCRAp5XQQmuv6qr8NAP4ggN51IEEDLINUctXvTnJAbRdRiZcC Z/q/5hfPS2+uAgEArSfVRCMdCCiVMZv+R9i6YPzn1Ux2OTGbcOHz56lZKgY= =GwKu -----END PGP SIGNATURE----- Merge tag 'trace-ringbuffer-v6.12-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace Pull ring-buffer fixes from Steven Rostedt: - Fix ref counter of buffers assigned at boot up A tracing instance can be created from the kernel command line. If it maps to memory, it is considered permanent and should not be deleted, or bad things can happen. If it is not mapped to memory, then the user is fine to delete it via rmdir from the instances directory. But the ref counts assumed 0 was free to remove and greater than zero was not. But this was not the case. When an instance is created, it should have the reference of 1, and if it should not be removed, it must be greater than 1. The boot up code set normal instances with a ref count of 0, which could get removed if something accessed it and then released it. And memory mapped instances had a ref count of 1 which meant it could be deleted, and bad things happen. Keep normal instances ref count as 1, and set memory mapped instances ref count to 2. - Protect sub buffer size (order) updates from other modifications When a ring buffer is changing the size of its sub-buffers, no other operations should be performed on the ring buffer. That includes reading it. But the locking only grabbed the buffer->mutex that keeps some operations from touching the ring buffer. It also must hold the cpu_buffer->reader_lock as well when updates happen as other paths use that to do some operations on the ring buffer. * tag 'trace-ringbuffer-v6.12-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace: ring-buffer: Fix reader locking when changing the sub buffer order ring-buffer: Fix refcount setting of boot mapped buffers |
||
Dimitar Kanaliev
|
ae67b9fb8c |
bpf: Fix truncation bug in coerce_reg_to_size_sx()
coerce_reg_to_size_sx() updates the register state after a sign-extension
operation. However, there's a bug in the assignment order of the unsigned
min/max values, leading to incorrect truncation:
0: (85) call bpf_get_prandom_u32#7 ; R0_w=scalar()
1: (57) r0 &= 1 ; R0_w=scalar(smin=smin32=0,smax=umax=smax32=umax32=1,var_off=(0x0; 0x1))
2: (07) r0 += 254 ; R0_w=scalar(smin=umin=smin32=umin32=254,smax=umax=smax32=umax32=255,var_off=(0xfe; 0x1))
3: (bf) r0 = (s8)r0 ; R0_w=scalar(smin=smin32=-2,smax=smax32=-1,umin=umin32=0xfffffffe,umax=0xffffffff,var_off=(0xfffffffffffffffe; 0x1))
In the current implementation, the unsigned 32-bit min/max values
(u32_min_value and u32_max_value) are assigned directly from the 64-bit
signed min/max values (s64_min and s64_max):
reg->umin_value = reg->u32_min_value = s64_min;
reg->umax_value = reg->u32_max_value = s64_max;
Due to the chain assigmnent, this is equivalent to:
reg->u32_min_value = s64_min; // Unintended truncation
reg->umin_value = reg->u32_min_value;
reg->u32_max_value = s64_max; // Unintended truncation
reg->umax_value = reg->u32_max_value;
Fixes:
|
||
Petr Pavlu
|
09661f75e7 |
ring-buffer: Fix reader locking when changing the sub buffer order
The function ring_buffer_subbuf_order_set() updates each
ring_buffer_per_cpu and installs new sub buffers that match the requested
page order. This operation may be invoked concurrently with readers that
rely on some of the modified data, such as the head bit (RB_PAGE_HEAD), or
the ring_buffer_per_cpu.pages and reader_page pointers. However, no
exclusive access is acquired by ring_buffer_subbuf_order_set(). Modifying
the mentioned data while a reader also operates on them can then result in
incorrect memory access and various crashes.
Fix the problem by taking the reader_lock when updating a specific
ring_buffer_per_cpu in ring_buffer_subbuf_order_set().
Link: https://lore.kernel.org/linux-trace-kernel/20240715145141.5528-1-petr.pavlu@suse.com/
Link: https://lore.kernel.org/linux-trace-kernel/20241010195849.2f77cc3f@gandalf.local.home/
Link: https://lore.kernel.org/linux-trace-kernel/20241011112850.17212b25@gandalf.local.home/
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Link: https://lore.kernel.org/20241015112440.26987-1-petr.pavlu@suse.com
Fixes:
|
||
Jinjie Ruan
|
d8794ac20a |
posix-clock: Fix missing timespec64 check in pc_clock_settime()
As Andrew pointed out, it will make sense that the PTP core
checked timespec64 struct's tv_sec and tv_nsec range before calling
ptp->info->settime64().
As the man manual of clock_settime() said, if tp.tv_sec is negative or
tp.tv_nsec is outside the range [0..999,999,999], it should return EINVAL,
which include dynamic clocks which handles PTP clock, and the condition is
consistent with timespec64_valid(). As Thomas suggested, timespec64_valid()
only check the timespec is valid, but not ensure that the time is
in a valid range, so check it ahead using timespec64_valid_strict()
in pc_clock_settime() and return -EINVAL if not valid.
There are some drivers that use tp->tv_sec and tp->tv_nsec directly to
write registers without validity checks and assume that the higher layer
has checked it, which is dangerous and will benefit from this, such as
hclge_ptp_settime(), igb_ptp_settime_i210(), _rcar_gen4_ptp_settime(),
and some drivers can remove the checks of itself.
Cc: stable@vger.kernel.org
Fixes:
|
||
David Vernet
|
60e339be10 |
sched_ext: Remove unnecessary cpu_relax()
As described in commit
|
||
Steven Rostedt
|
2cf9733891 |
ring-buffer: Fix refcount setting of boot mapped buffers
A ring buffer which has its buffered mapped at boot up to fixed memory
should not be freed. Other buffers can be. The ref counting setup was
wrong for both. It made the not mapped buffers ref count have zero, and the
boot mapped buffer a ref count of 1. But an normally allocated buffer
should be 1, where it can be removed.
Keep the ref count of a normal boot buffer with its setup ref count (do
not decrement it), and increment the fixed memory boot mapped buffer's ref
count.
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Link: https://lore.kernel.org/20241011165224.33dd2624@gandalf.local.home
Fixes:
|
||
Peter Zijlstra
|
cd9626e9eb |
sched/fair: Fix external p->on_rq users
Sean noted that ever since commit |
||
Johannes Weiner
|
c650812419 |
sched/psi: Fix mistaken CPU pressure indication after corrupted task state bug
Since sched_delayed tasks remain queued even after blocking, the load
balancer can migrate them between runqueues while PSI considers them
to be asleep. As a result, it misreads the migration requeue followed
by a wakeup as a double queue:
psi: inconsistent task state! task=... cpu=... psi_flags=4 clear=. set=4
First, call psi_enqueue() after p->sched_class->enqueue_task(). A
wakeup will clear p->se.sched_delayed while a migration will not, so
psi can use that flag to tell them apart.
Then teach psi to migrate any "sleep" state when delayed-dequeue tasks
are being migrated.
Delayed-dequeue tasks can be revived by ttwu_runnable(), which will
call down with a new ENQUEUE_DELAYED. Instead of further complicating
the wakeup conditional in enqueue_task(), identify migration contexts
instead and default to wakeup handling for all other cases.
It's not just the warning in dmesg, the task state corruption causes a
permanent CPU pressure indication, which messes with workload/machine
health monitoring.
Debugged-by-and-original-fix-by: K Prateek Nayak <kprateek.nayak@amd.com>
Fixes:
|
||
Linus Torvalds
|
a1029768f3 |
RCU fix for v6.12
Fix rcuog kthread wakeup invocation from softirq context on a CPU which has been marked offline. This can happen when new callbacks are enqueued from a softirq on an offline CPU before it calls rcutree_report_cpu_dead(). When this happens on NOCB configuration, the rcuog wake-up is deferred through an IPI to an online CPU. This is done to avoid call into the scheduler which can risk arming the RT-bandwidth after hrtimers have been migrated out and disabled. However, doing IPI call from softirq is not allowed Fix this by forcing deferred rcuog wakeup through the NOCB timer when the CPU is offline. -----BEGIN PGP SIGNATURE----- iHUEABYIAB0WIQSi2tPIQIc2VEtjarIAHS7/6Z0wpQUCZwjWYAAKCRAAHS7/6Z0w pQ8iAP9vbhWlWuwlfZqQJK8Xb5b0kVx3nJjAin/5pY/lbWV7IgEA1DUzTAZKXU5Z yZmjM3l9RpMAxLSgevJKoJD+vR0POQA= =EVin -----END PGP SIGNATURE----- Merge tag 'rcu.fixes.6.12-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/rcu/linux Pull RCU fix from Neeraj Upadhyay: "Fix rcuog kthread wakeup invocation from softirq context on a CPU which has been marked offline. This can happen when new callbacks are enqueued from a softirq on an offline CPU before it calls rcutree_report_cpu_dead(). When this happens on NOCB configuration, the rcuog wake-up is deferred through an IPI to an online CPU. This is done to avoid call into the scheduler which can risk arming the RT-bandwidth after hrtimers have been migrated out and disabled. However, doing IPI call from softirq is not allowed: Fix this by forcing deferred rcuog wakeup through the NOCB timer when the CPU is offline" * tag 'rcu.fixes.6.12-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/rcu/linux: rcu/nocb: Fix rcuog wake-up from offline softirq |
||
Peter Zijlstra
|
f5aaff7bfa |
sched/core: Dequeue PSI signals for blocked tasks that are delayed
psi_dequeue() in for blocked task expects psi_sched_switch() to clear
the TSK_.*RUNNING PSI flags and set the TSK_IOWAIT flags however
psi_sched_switch() uses "!task_on_rq_queued(prev)" to detect if the task
is blocked or still runnable which is no longer true with DELAY_DEQUEUE
since a blocking task can be left queued on the runqueue.
This can lead to PSI splats similar to:
psi: inconsistent task state! task=... cpu=... psi_flags=4 clear=0 set=4
when the task is requeued since the TSK_RUNNING flag was not cleared
when the task was blocked.
Explicitly communicate that the task was blocked to psi_sched_switch()
even if it was delayed and is still on the runqueue.
[ prateek: Broke off the relevant part from [1], commit message ]
Fixes:
|
||
Peter Zijlstra
|
98442f0ccd |
sched: Fix delayed_dequeue vs switched_from_fair()
Commit |
||
Waiman Long
|
73ab05aa46 |
sched/core: Disable page allocation in task_tick_mm_cid()
With KASAN and PREEMPT_RT enabled, calling task_work_add() in task_tick_mm_cid() may cause the following splat. [ 63.696416] BUG: sleeping function called from invalid context at kernel/locking/spinlock_rt.c:48 [ 63.696416] in_atomic(): 1, irqs_disabled(): 1, non_block: 0, pid: 610, name: modprobe [ 63.696416] preempt_count: 10001, expected: 0 [ 63.696416] RCU nest depth: 1, expected: 1 This problem is caused by the following call trace. sched_tick() [ acquire rq->__lock ] -> task_tick_mm_cid() -> task_work_add() -> __kasan_record_aux_stack() -> kasan_save_stack() -> stack_depot_save_flags() -> alloc_pages_mpol_noprof() -> __alloc_pages_noprof() -> get_page_from_freelist() -> rmqueue() -> rmqueue_pcplist() -> __rmqueue_pcplist() -> rmqueue_bulk() -> rt_spin_lock() The rq lock is a raw_spinlock_t. We can't sleep while holding it. IOW, we can't call alloc_pages() in stack_depot_save_flags(). The task_tick_mm_cid() function with its task_work_add() call was introduced by commit |
||
Phil Auld
|
d16b7eb6f5 |
sched/deadline: Use hrtick_enabled_dl() before start_hrtick_dl()
The deadline server code moved one of the start_hrtick_dl() calls
but dropped the dl specific hrtick_enabled check. This causes hrticks
to get armed even when sched_feat(HRTICK_DL) is false. Fix it.
Fixes:
|
||
Tyrone Wu
|
ad6b5b6ea9 |
bpf: Fix unpopulated path_size when uprobe_multi fields unset
Previously when retrieving `bpf_link_info.uprobe_multi` with `path` and
`path_size` fields unset, the `path_size` field is not populated
(remains 0). This behavior was inconsistent with how other input/output
string buffer fields work, as the field should be populated in cases
when:
- both buffer and length are set (currently works as expected)
- both buffer and length are unset (not working as expected)
This patch now fills the `path_size` field when `path` and `path_size`
are unset.
Fixes:
|
||
Tejun Heo
|
b07996c7ab |
sched_ext: Don't hold scx_tasks_lock for too long
While enabling and disabling a BPF scheduler, every task is iterated a couple times by walking scx_tasks. Except for one, all iterations keep holding scx_tasks_lock. On multi-socket systems under heavy rq lock contention and high number of threads, this can can lead to RCU and other stalls. The following is triggered on a 2 x AMD EPYC 7642 system (192 logical CPUs) running `stress-ng --workload 150 --workload-threads 10` with >400k idle threads and RCU stall period reduced to 5s: rcu: INFO: rcu_preempt detected stalls on CPUs/tasks: rcu: 91-...!: (10 ticks this GP) idle=0754/1/0x4000000000000000 softirq=18204/18206 fqs=17 rcu: 186-...!: (17 ticks this GP) idle=ec54/1/0x4000000000000000 softirq=25863/25866 fqs=17 rcu: (detected by 80, t=10042 jiffies, g=89305, q=33 ncpus=192) Sending NMI from CPU 80 to CPUs 91: NMI backtrace for cpu 91 CPU: 91 UID: 0 PID: 284038 Comm: sched_ext_ops_h Kdump: loaded Not tainted 6.12.0-rc2-work-g6bf5681f7ee2-dirty #471 Hardware name: Supermicro Super Server/H11DSi, BIOS 2.8 12/14/2023 Sched_ext: simple (disabling+all) RIP: 0010:queued_spin_lock_slowpath+0x17b/0x2f0 Code: 02 c0 10 03 00 83 79 08 00 75 08 f3 90 83 79 08 00 74 f8 48 8b 11 48 85 d2 74 09 0f 0d 0a eb 0a 31 d2 eb 06 31 d2 eb 02 f3 90 <8b> 07 66 85 c0 75 f7 39 d8 75 0d be 01 00 00 00 89 d8 f0 0f b1 37 RSP: 0018:ffffc9000fadfcb8 EFLAGS: 00000002 RAX: 0000000001700001 RBX: 0000000001700000 RCX: ffff88bfcaaf10c0 RDX: 0000000000000000 RSI: 0000000000000101 RDI: ffff88bfca8f0080 RBP: 0000000001700000 R08: 0000000000000090 R09: ffffffffffffffff R10: ffff88a74761b268 R11: 0000000000000000 R12: ffff88a6b6765460 R13: ffffc9000fadfd60 R14: ffff88bfca8f0080 R15: ffff88bfcaac0000 FS: 0000000000000000(0000) GS:ffff88bfcaac0000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007f5c55f526a0 CR3: 0000000afd474000 CR4: 0000000000350eb0 Call Trace: <NMI> </NMI> <TASK> do_raw_spin_lock+0x9c/0xb0 task_rq_lock+0x50/0x190 scx_task_iter_next_locked+0x157/0x170 scx_ops_disable_workfn+0x2c2/0xbf0 kthread_worker_fn+0x108/0x2a0 kthread+0xeb/0x110 ret_from_fork+0x36/0x40 ret_from_fork_asm+0x1a/0x30 </TASK> Sending NMI from CPU 80 to CPUs 186: NMI backtrace for cpu 186 CPU: 186 UID: 0 PID: 51248 Comm: fish Kdump: loaded Not tainted 6.12.0-rc2-work-g6bf5681f7ee2-dirty #471 scx_task_iter can safely drop locks while iterating. Make scx_task_iter_next() drop scx_tasks_lock every 32 iterations to avoid stalls. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: David Vernet <void@manifault.com> |
||
Tejun Heo
|
967da57832 |
sched_ext: Move scx_tasks_lock handling into scx_task_iter helpers
Iterating with scx_task_iter involves scx_tasks_lock and optionally the rq lock of the task being iterated. Both locks can be released during iteration and the iteration can be continued after re-grabbing scx_tasks_lock. Currently, all lock handling is pushed to the caller which is a bit cumbersome and makes it difficult to add lock-aware behaviors. Make the scx_task_iter helpers handle scx_tasks_lock. - scx_task_iter_init/scx_taks_iter_exit() now grabs and releases scx_task_lock, respectively. Renamed to scx_task_iter_start/scx_task_iter_stop() to more clearly indicate that there are non-trivial side-effects. - Add __ prefix to scx_task_iter_rq_unlock() to indicate that the function is internal. - Add scx_task_iter_unlock/relock(). The former drops both rq lock (if held) and scx_tasks_lock and the latter re-locks only scx_tasks_lock. This doesn't cause behavior changes and will be used to implement stall avoidance. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: David Vernet <void@manifault.com> |
||
Tejun Heo
|
aebe7ae4cb |
sched_ext: bypass mode shouldn't depend on ops.select_cpu()
Bypass mode was depending on ops.select_cpu() which can't be trusted as with the rest of the BPF scheduler. Always enable and use scx_select_cpu_dfl() in bypass mode. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: David Vernet <void@manifault.com> |
||
Tejun Heo
|
cc3e1caca9 |
sched_ext: Move scx_buildin_idle_enabled check to scx_bpf_select_cpu_dfl()
Move the sanity check from the inner function scx_select_cpu_dfl() to the exported kfunc scx_bpf_select_cpu_dfl(). This doesn't cause behavior differences and will allow using scx_select_cpu_dfl() in bypass mode regardless of scx_builtin_idle_enabled. Signed-off-by: Tejun Heo <tj@kernel.org> |
||
Tejun Heo
|
3fdb9ebcec |
sched_ext: Start schedulers with consistent p->scx.slice values
The disable path caps p->scx.slice to SCX_SLICE_DFL. As the field is already being ignored at this stage during disable, the only effect this has is that when the next BPF scheduler is loaded, it won't see unreasonable left-over slices. Ultimately, this shouldn't matter but it's better to start in a known state. Drop p->scx.slice capping from the disable path and instead reset it to SCX_SLICE_DFL in the enable path. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: David Vernet <void@manifault.com> |
||
Tejun Heo
|
54baa7ac0c |
Revert "sched_ext: Use shorter slice while bypassing"
This reverts commit
|
||
Linus Torvalds
|
0edab8d132 |
ring-buffer: Fix for 6.12
- Do not have boot-mapped buffers use CPU hotplug callbacks When a ring buffer is mapped to memory assigned at boot, it also splits it up evenly between the possible CPUs. But the allocation code still attached a CPU notifier callback to this ring buffer. When a CPU is added, the callback will happen and another per-cpu buffer is created for the ring buffer. But for boot mapped buffers, there is no room to add another one (as they were all created already). The result of calling the CPU hotplug notifier on a boot mapped ring buffer is unpredictable and could lead to a system crash. If the ring buffer is boot mapped simply do not attach the CPU notifier to it. -----BEGIN PGP SIGNATURE----- iIoEABYIADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCZwfxchQccm9zdGVkdEBn b29kbWlzLm9yZwAKCRAp5XQQmuv6qrduAQC2xZ+lqeoDJ3o+H201bqshgx/YDANF p1q1BFmC0yFlOAD/XB/I9o4UFNHAb9D05zPADS+8bF1RxdisJJjqUae1TgM= =lBZ/ -----END PGP SIGNATURE----- Merge tag 'trace-ringbuffer-v6.12-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace Pull tracing fix from Steven Rostedt: "Ring-buffer fix: do not have boot-mapped buffers use CPU hotplug callbacks When a ring buffer is mapped to memory assigned at boot, it also splits it up evenly between the possible CPUs. But the allocation code still attached a CPU notifier callback to this ring buffer. When a CPU is added, the callback will happen and another per-cpu buffer is created for the ring buffer. But for boot mapped buffers, there is no room to add another one (as they were all created already). The result of calling the CPU hotplug notifier on a boot mapped ring buffer is unpredictable and could lead to a system crash. If the ring buffer is boot mapped simply do not attach the CPU notifier to it" * tag 'trace-ringbuffer-v6.12-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace: ring-buffer: Do not have boot mapped buffers hook to CPU hotplug |