mirror of
https://github.com/torvalds/linux.git
synced 2024-11-23 12:42:02 +00:00
8e41e0a575
41290 Commits
Author | SHA1 | Message | Date | |
---|---|---|---|---|
Linus Torvalds
|
23309d600d |
Networking fixes for 6.3-rc8, including fixes from netfilter and bpf
Current release - regressions: - sched: clear actions pointer in miss cookie init fail - mptcp: fix accept vs worker race - bpf: fix bpf_arch_text_poke() with new_addr == NULL on s390 - eth: bnxt_en: fix a possible NULL pointer dereference in unload path - eth: veth: take into account peer device for NETDEV_XDP_ACT_NDO_XMIT xdp_features flag Current release - new code bugs: - eth: revert "net/mlx5: Enable management PF initialization" Previous releases - regressions: - netfilter: fix recent physdev match breakage - bpf: fix incorrect verifier pruning due to missing register precision taints - eth: virtio_net: fix overflow inside xdp_linearize_page() - eth: cxgb4: fix use after free bugs caused by circular dependency problem - eth: mlxsw: pci: fix possible crash during initialization Previous releases - always broken: - sched: sch_qfq: prevent slab-out-of-bounds in qfq_activate_agg - netfilter: validate catch-all set elements - bridge: don't notify FDB entries with "master dynamic" - eth: bonding: fix memory leak when changing bond type to ethernet - eth: i40e: fix accessing vsi->active_filters without holding lock Misc: - Mat is back as MPTCP co-maintainer Signed-off-by: Paolo Abeni <pabeni@redhat.com> -----BEGIN PGP SIGNATURE----- iQJGBAABCAAwFiEEg1AjqC77wbdLX2LbKSR5jcyPE6QFAmRBF5ISHHBhYmVuaUBy ZWRoYXQuY29tAAoJECkkeY3MjxOkj5sP/itK7DeAzufFIe1SUY+WYdbhAj7XTJso q5bpF09wmLW9RLPxZ/hLMnCUniCSBBoJ/3oeBD8SgRBQJKSLjh1WTLYgFxfEZEeY DvydMxiurH13pxgMBpCUSTlqDbiLkZ51Sy2sSGJcoJK8XRfA265/D7ZEBFJRIJS9 wr2prLspZmlN/5dnt8WIXubf83o5mkJ7DneSMBGuJXE2akJ7VBROz10pK1HVMALq c6p/Kt92iffEiZZYCnqogrQOu3hLcSCLRTM7Wb3giIX9jaE84Hr9fV+zfG/JDeCJ kgjEiKOExnusd8Nq91cClDt92ceRWU5s1M1UxJ5r4Mxjnq0Ug+I3ayItS9bXcEqH 0PmDql4bKFUue7QiJZkCsusKjlf5R1XxE0Zt+lANn+FWr8THKxvnrbpCjT0ZUvQv 7kI+Q4g7AFSNoWgM9SwtiTMQmxI8BUo7kgaBLz2IvFDzau4T+yDLKZ+3gyewwp0e RN4pac8YyChuuMBmVrZGxVHPA3fKu7C7jCc/xGaMHcQSgFCsQtPpKZVa1SxLR/ZZ efMB/J2+GIGv2i5YecH4DItNUd0QhZnXgBjLEaDmEGk4rHIlc9JDy3frD5Qrs4pW Dq2zvveRVT30b52sOjkYzEvTU5R/s1nio3RGklUE4hDCV1DkehThAFaX68cIcgeR 63uRXDpogRs+ =xUNa -----END PGP SIGNATURE----- Merge tag 'net-6.3-rc8' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net Pull networking fixes from Paolo Abeni: "Including fixes from netfilter and bpf. There are a few fixes for new code bugs, including the Mellanox one noted in the last networking pull. No known regressions outstanding. Current release - regressions: - sched: clear actions pointer in miss cookie init fail - mptcp: fix accept vs worker race - bpf: fix bpf_arch_text_poke() with new_addr == NULL on s390 - eth: bnxt_en: fix a possible NULL pointer dereference in unload path - eth: veth: take into account peer device for NETDEV_XDP_ACT_NDO_XMIT xdp_features flag Current release - new code bugs: - eth: revert "net/mlx5: Enable management PF initialization" Previous releases - regressions: - netfilter: fix recent physdev match breakage - bpf: fix incorrect verifier pruning due to missing register precision taints - eth: virtio_net: fix overflow inside xdp_linearize_page() - eth: cxgb4: fix use after free bugs caused by circular dependency problem - eth: mlxsw: pci: fix possible crash during initialization Previous releases - always broken: - sched: sch_qfq: prevent slab-out-of-bounds in qfq_activate_agg - netfilter: validate catch-all set elements - bridge: don't notify FDB entries with "master dynamic" - eth: bonding: fix memory leak when changing bond type to ethernet - eth: i40e: fix accessing vsi->active_filters without holding lock Misc: - Mat is back as MPTCP co-maintainer" * tag 'net-6.3-rc8' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (33 commits) net: bridge: switchdev: don't notify FDB entries with "master dynamic" Revert "net/mlx5: Enable management PF initialization" MAINTAINERS: Resume MPTCP co-maintainer role mailmap: add entries for Mat Martineau e1000e: Disable TSO on i219-LM card to increase speed bnxt_en: fix free-runnig PHC mode net: dsa: microchip: ksz8795: Correctly handle huge frame configuration bpf: Fix incorrect verifier pruning due to missing register precision taints hamradio: drop ISA_DMA_API dependency mlxsw: pci: Fix possible crash during initialization mptcp: fix accept vs worker race mptcp: stops worker on unaccepted sockets at listener close net: rpl: fix rpl header size calculation net: vmxnet3: Fix NULL pointer dereference in vmxnet3_rq_rx_complete() bonding: Fix memory leak when changing bond type to Ethernet veth: take into account peer device for NETDEV_XDP_ACT_NDO_XMIT xdp_features flag mlxfw: fix null-ptr-deref in mlxfw_mfa2_tlv_next() bnxt_en: Fix a possible NULL pointer dereference in unload path bnxt_en: Do not initialize PTP on older P3/P4 chips netfilter: nf_tables: tighten netlink attribute requirements for catch-all elements ... |
||
Linus Torvalds
|
cb0856346a |
22 hotfixes.
19 are cc:stable and the remainder address issues which were introduced during this merge cycle, or aren't considered suitable for -stable backporting. 19 are for MM and the remainder are for other subsystems. -----BEGIN PGP SIGNATURE----- iHUEABYIAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCZEB7GgAKCRDdBJ7gKXxA jl4zAP9LxKisY8L29qrZG/SKoYbMMSM33ASOGZJRAuRRaOYL6QEAvS14pg/c22rL 4GCZbzvENY4xPRbz/6kc/s2Jnuww4wA= =Kh/V -----END PGP SIGNATURE----- Merge tag 'mm-hotfixes-stable-2023-04-19-16-36' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull misc fixes from Andrew Morton: "22 hotfixes. 19 are cc:stable and the remainder address issues which were introduced during this merge cycle, or aren't considered suitable for -stable backporting. 19 are for MM and the remainder are for other subsystems" * tag 'mm-hotfixes-stable-2023-04-19-16-36' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (22 commits) nilfs2: initialize unused bytes in segment summary blocks mm: page_alloc: skip regions with hugetlbfs pages when allocating 1G pages mm/mmap: regression fix for unmapped_area{_topdown} maple_tree: fix mas_empty_area() search maple_tree: make maple state reusable after mas_empty_area_rev() mm: kmsan: handle alloc failures in kmsan_ioremap_page_range() mm: kmsan: handle alloc failures in kmsan_vmap_pages_range_noflush() tools/Makefile: do missed s/vm/mm/ mm: fix memory leak on mm_init error handling mm/page_alloc: fix potential deadlock on zonelist_update_seq seqlock kernel/sys.c: fix and improve control flow in __sys_setres[ug]id() Revert "userfaultfd: don't fail on unrecognized features" writeback, cgroup: fix null-ptr-deref write in bdi_split_work_to_wbs maple_tree: fix a potential memory leak, OOB access, or other unpredictable bug tools/mm/page_owner_sort.c: fix TGID output when cull=tg is used mailmap: update jtoppins' entry to reference correct email mm/mempolicy: fix use-after-free of VMA iterator mm/huge_memory.c: warn with pr_warn_ratelimited instead of VM_WARN_ON_ONCE_FOLIO mm/mprotect: fix do_mprotect_pkey() return on error mm/khugepaged: check again on anon uffd-wp during isolation ... |
||
Daniel Borkmann
|
71b547f561 |
bpf: Fix incorrect verifier pruning due to missing register precision taints
Juan Jose et al reported an issue found via fuzzing where the verifier's
pruning logic prematurely marks a program path as safe.
Consider the following program:
0: (b7) r6 = 1024
1: (b7) r7 = 0
2: (b7) r8 = 0
3: (b7) r9 = -2147483648
4: (97) r6 %= 1025
5: (05) goto pc+0
6: (bd) if r6 <= r9 goto pc+2
7: (97) r6 %= 1
8: (b7) r9 = 0
9: (bd) if r6 <= r9 goto pc+1
10: (b7) r6 = 0
11: (b7) r0 = 0
12: (63) *(u32 *)(r10 -4) = r0
13: (18) r4 = 0xffff888103693400 // map_ptr(ks=4,vs=48)
15: (bf) r1 = r4
16: (bf) r2 = r10
17: (07) r2 += -4
18: (85) call bpf_map_lookup_elem#1
19: (55) if r0 != 0x0 goto pc+1
20: (95) exit
21: (77) r6 >>= 10
22: (27) r6 *= 8192
23: (bf) r1 = r0
24: (0f) r0 += r6
25: (79) r3 = *(u64 *)(r0 +0)
26: (7b) *(u64 *)(r1 +0) = r3
27: (95) exit
The verifier treats this as safe, leading to oob read/write access due
to an incorrect verifier conclusion:
func#0 @0
0: R1=ctx(off=0,imm=0) R10=fp0
0: (b7) r6 = 1024 ; R6_w=1024
1: (b7) r7 = 0 ; R7_w=0
2: (b7) r8 = 0 ; R8_w=0
3: (b7) r9 = -2147483648 ; R9_w=-2147483648
4: (97) r6 %= 1025 ; R6_w=scalar()
5: (05) goto pc+0
6: (bd) if r6 <= r9 goto pc+2 ; R6_w=scalar(umin=18446744071562067969,var_off=(0xffffffff00000000; 0xffffffff)) R9_w=-2147483648
7: (97) r6 %= 1 ; R6_w=scalar()
8: (b7) r9 = 0 ; R9=0
9: (bd) if r6 <= r9 goto pc+1 ; R6=scalar(umin=1) R9=0
10: (b7) r6 = 0 ; R6_w=0
11: (b7) r0 = 0 ; R0_w=0
12: (63) *(u32 *)(r10 -4) = r0
last_idx 12 first_idx 9
regs=1 stack=0 before 11: (b7) r0 = 0
13: R0_w=0 R10=fp0 fp-8=0000????
13: (18) r4 = 0xffff8ad3886c2a00 ; R4_w=map_ptr(off=0,ks=4,vs=48,imm=0)
15: (bf) r1 = r4 ; R1_w=map_ptr(off=0,ks=4,vs=48,imm=0) R4_w=map_ptr(off=0,ks=4,vs=48,imm=0)
16: (bf) r2 = r10 ; R2_w=fp0 R10=fp0
17: (07) r2 += -4 ; R2_w=fp-4
18: (85) call bpf_map_lookup_elem#1 ; R0=map_value_or_null(id=1,off=0,ks=4,vs=48,imm=0)
19: (55) if r0 != 0x0 goto pc+1 ; R0=0
20: (95) exit
from 19 to 21: R0=map_value(off=0,ks=4,vs=48,imm=0) R6=0 R7=0 R8=0 R9=0 R10=fp0 fp-8=mmmm????
21: (77) r6 >>= 10 ; R6_w=0
22: (27) r6 *= 8192 ; R6_w=0
23: (bf) r1 = r0 ; R0=map_value(off=0,ks=4,vs=48,imm=0) R1_w=map_value(off=0,ks=4,vs=48,imm=0)
24: (0f) r0 += r6
last_idx 24 first_idx 19
regs=40 stack=0 before 23: (bf) r1 = r0
regs=40 stack=0 before 22: (27) r6 *= 8192
regs=40 stack=0 before 21: (77) r6 >>= 10
regs=40 stack=0 before 19: (55) if r0 != 0x0 goto pc+1
parent didn't have regs=40 stack=0 marks: R0_rw=map_value_or_null(id=1,off=0,ks=4,vs=48,imm=0) R6_rw=P0 R7=0 R8=0 R9=0 R10=fp0 fp-8=mmmm????
last_idx 18 first_idx 9
regs=40 stack=0 before 18: (85) call bpf_map_lookup_elem#1
regs=40 stack=0 before 17: (07) r2 += -4
regs=40 stack=0 before 16: (bf) r2 = r10
regs=40 stack=0 before 15: (bf) r1 = r4
regs=40 stack=0 before 13: (18) r4 = 0xffff8ad3886c2a00
regs=40 stack=0 before 12: (63) *(u32 *)(r10 -4) = r0
regs=40 stack=0 before 11: (b7) r0 = 0
regs=40 stack=0 before 10: (b7) r6 = 0
25: (79) r3 = *(u64 *)(r0 +0) ; R0_w=map_value(off=0,ks=4,vs=48,imm=0) R3_w=scalar()
26: (7b) *(u64 *)(r1 +0) = r3 ; R1_w=map_value(off=0,ks=4,vs=48,imm=0) R3_w=scalar()
27: (95) exit
from 9 to 11: R1=ctx(off=0,imm=0) R6=0 R7=0 R8=0 R9=0 R10=fp0
11: (b7) r0 = 0 ; R0_w=0
12: (63) *(u32 *)(r10 -4) = r0
last_idx 12 first_idx 11
regs=1 stack=0 before 11: (b7) r0 = 0
13: R0_w=0 R10=fp0 fp-8=0000????
13: (18) r4 = 0xffff8ad3886c2a00 ; R4_w=map_ptr(off=0,ks=4,vs=48,imm=0)
15: (bf) r1 = r4 ; R1_w=map_ptr(off=0,ks=4,vs=48,imm=0) R4_w=map_ptr(off=0,ks=4,vs=48,imm=0)
16: (bf) r2 = r10 ; R2_w=fp0 R10=fp0
17: (07) r2 += -4 ; R2_w=fp-4
18: (85) call bpf_map_lookup_elem#1
frame 0: propagating r6
last_idx 19 first_idx 11
regs=40 stack=0 before 18: (85) call bpf_map_lookup_elem#1
regs=40 stack=0 before 17: (07) r2 += -4
regs=40 stack=0 before 16: (bf) r2 = r10
regs=40 stack=0 before 15: (bf) r1 = r4
regs=40 stack=0 before 13: (18) r4 = 0xffff8ad3886c2a00
regs=40 stack=0 before 12: (63) *(u32 *)(r10 -4) = r0
regs=40 stack=0 before 11: (b7) r0 = 0
parent didn't have regs=40 stack=0 marks: R1=ctx(off=0,imm=0) R6_r=P0 R7=0 R8=0 R9=0 R10=fp0
last_idx 9 first_idx 9
regs=40 stack=0 before 9: (bd) if r6 <= r9 goto pc+1
parent didn't have regs=40 stack=0 marks: R1=ctx(off=0,imm=0) R6_rw=Pscalar() R7_w=0 R8_w=0 R9_rw=0 R10=fp0
last_idx 8 first_idx 0
regs=40 stack=0 before 8: (b7) r9 = 0
regs=40 stack=0 before 7: (97) r6 %= 1
regs=40 stack=0 before 6: (bd) if r6 <= r9 goto pc+2
regs=40 stack=0 before 5: (05) goto pc+0
regs=40 stack=0 before 4: (97) r6 %= 1025
regs=40 stack=0 before 3: (b7) r9 = -2147483648
regs=40 stack=0 before 2: (b7) r8 = 0
regs=40 stack=0 before 1: (b7) r7 = 0
regs=40 stack=0 before 0: (b7) r6 = 1024
19: safe
frame 0: propagating r6
last_idx 9 first_idx 0
regs=40 stack=0 before 6: (bd) if r6 <= r9 goto pc+2
regs=40 stack=0 before 5: (05) goto pc+0
regs=40 stack=0 before 4: (97) r6 %= 1025
regs=40 stack=0 before 3: (b7) r9 = -2147483648
regs=40 stack=0 before 2: (b7) r8 = 0
regs=40 stack=0 before 1: (b7) r7 = 0
regs=40 stack=0 before 0: (b7) r6 = 1024
from 6 to 9: safe
verification time 110 usec
stack depth 4
processed 36 insns (limit 1000000) max_states_per_insn 0 total_states 3 peak_states 3 mark_read 2
The verifier considers this program as safe by mistakenly pruning unsafe
code paths. In the above func#0, code lines 0-10 are of interest. In line
0-3 registers r6 to r9 are initialized with known scalar values. In line 4
the register r6 is reset to an unknown scalar given the verifier does not
track modulo operations. Due to this, the verifier can also not determine
precisely which branches in line 6 and 9 are taken, therefore it needs to
explore them both.
As can be seen, the verifier starts with exploring the false/fall-through
paths first. The 'from 19 to 21' path has both r6=0 and r9=0 and the pointer
arithmetic on r0 += r6 is therefore considered safe. Given the arithmetic,
r6 is correctly marked for precision tracking where backtracking kicks in
where it walks back the current path all the way where r6 was set to 0 in
the fall-through branch.
Next, the pruning logics pops the path 'from 9 to 11' from the stack. Also
here, the state of the registers is the same, that is, r6=0 and r9=0, so
that at line 19 the path can be pruned as it is considered safe. It is
interesting to note that the conditional in line 9 turned r6 into a more
precise state, that is, in the fall-through path at the beginning of line
10, it is R6=scalar(umin=1), and in the branch-taken path (which is analyzed
here) at the beginning of line 11, r6 turned into a known const r6=0 as
r9=0 prior to that and therefore (unsigned) r6 <= 0 concludes that r6 must
be 0 (**):
[...] ; R6_w=scalar()
9: (bd) if r6 <= r9 goto pc+1 ; R6=scalar(umin=1) R9=0
[...]
from 9 to 11: R1=ctx(off=0,imm=0) R6=0 R7=0 R8=0 R9=0 R10=fp0
[...]
The next path is 'from 6 to 9'. The verifier considers the old and current
state equivalent, and therefore prunes the search incorrectly. Looking into
the two states which are being compared by the pruning logic at line 9, the
old state consists of R6_rwD=Pscalar() R9_rwD=0 R10=fp0 and the new state
consists of R1=ctx(off=0,imm=0) R6_w=scalar(umax=18446744071562067968)
R7_w=0 R8_w=0 R9_w=-2147483648 R10=fp0. While r6 had the reg->precise flag
correctly set in the old state, r9 did not. Both r6'es are considered as
equivalent given the old one is a superset of the current, more precise one,
however, r9's actual values (0 vs 0x80000000) mismatch. Given the old r9
did not have reg->precise flag set, the verifier does not consider the
register as contributing to the precision state of r6, and therefore it
considered both r9 states as equivalent. However, for this specific pruned
path (which is also the actual path taken at runtime), register r6 will be
0x400 and r9 0x80000000 when reaching line 21, thus oob-accessing the map.
The purpose of precision tracking is to initially mark registers (including
spilled ones) as imprecise to help verifier's pruning logic finding equivalent
states it can then prune if they don't contribute to the program's safety
aspects. For example, if registers are used for pointer arithmetic or to pass
constant length to a helper, then the verifier sets reg->precise flag and
backtracks the BPF program instruction sequence and chain of verifier states
to ensure that the given register or stack slot including their dependencies
are marked as precisely tracked scalar. This also includes any other registers
and slots that contribute to a tracked state of given registers/stack slot.
This backtracking relies on recorded jmp_history and is able to traverse
entire chain of parent states. This process ends only when all the necessary
registers/slots and their transitive dependencies are marked as precise.
The backtrack_insn() is called from the current instruction up to the first
instruction, and its purpose is to compute a bitmask of registers and stack
slots that need precision tracking in the parent's verifier state. For example,
if a current instruction is r6 = r7, then r6 needs precision after this
instruction and r7 needs precision before this instruction, that is, in the
parent state. Hence for the latter r7 is marked and r6 unmarked.
For the class of jmp/jmp32 instructions, backtrack_insn() today only looks
at call and exit instructions and for all other conditionals the masks
remain as-is. However, in the given situation register r6 has a dependency
on r9 (as described above in **), so also that one needs to be marked for
precision tracking. In other words, if an imprecise register influences a
precise one, then the imprecise register should also be marked precise.
Meaning, in the parent state both dest and src register need to be tracked
for precision and therefore the marking must be more conservative by setting
reg->precise flag for both. The precision propagation needs to cover both
for the conditional: if the src reg was marked but not the dst reg and vice
versa.
After the fix the program is correctly rejected:
func#0 @0
0: R1=ctx(off=0,imm=0) R10=fp0
0: (b7) r6 = 1024 ; R6_w=1024
1: (b7) r7 = 0 ; R7_w=0
2: (b7) r8 = 0 ; R8_w=0
3: (b7) r9 = -2147483648 ; R9_w=-2147483648
4: (97) r6 %= 1025 ; R6_w=scalar()
5: (05) goto pc+0
6: (bd) if r6 <= r9 goto pc+2 ; R6_w=scalar(umin=18446744071562067969,var_off=(0xffffffff80000000; 0x7fffffff),u32_min=-2147483648) R9_w=-2147483648
7: (97) r6 %= 1 ; R6_w=scalar()
8: (b7) r9 = 0 ; R9=0
9: (bd) if r6 <= r9 goto pc+1 ; R6=scalar(umin=1) R9=0
10: (b7) r6 = 0 ; R6_w=0
11: (b7) r0 = 0 ; R0_w=0
12: (63) *(u32 *)(r10 -4) = r0
last_idx 12 first_idx 9
regs=1 stack=0 before 11: (b7) r0 = 0
13: R0_w=0 R10=fp0 fp-8=0000????
13: (18) r4 = 0xffff9290dc5bfe00 ; R4_w=map_ptr(off=0,ks=4,vs=48,imm=0)
15: (bf) r1 = r4 ; R1_w=map_ptr(off=0,ks=4,vs=48,imm=0) R4_w=map_ptr(off=0,ks=4,vs=48,imm=0)
16: (bf) r2 = r10 ; R2_w=fp0 R10=fp0
17: (07) r2 += -4 ; R2_w=fp-4
18: (85) call bpf_map_lookup_elem#1 ; R0=map_value_or_null(id=1,off=0,ks=4,vs=48,imm=0)
19: (55) if r0 != 0x0 goto pc+1 ; R0=0
20: (95) exit
from 19 to 21: R0=map_value(off=0,ks=4,vs=48,imm=0) R6=0 R7=0 R8=0 R9=0 R10=fp0 fp-8=mmmm????
21: (77) r6 >>= 10 ; R6_w=0
22: (27) r6 *= 8192 ; R6_w=0
23: (bf) r1 = r0 ; R0=map_value(off=0,ks=4,vs=48,imm=0) R1_w=map_value(off=0,ks=4,vs=48,imm=0)
24: (0f) r0 += r6
last_idx 24 first_idx 19
regs=40 stack=0 before 23: (bf) r1 = r0
regs=40 stack=0 before 22: (27) r6 *= 8192
regs=40 stack=0 before 21: (77) r6 >>= 10
regs=40 stack=0 before 19: (55) if r0 != 0x0 goto pc+1
parent didn't have regs=40 stack=0 marks: R0_rw=map_value_or_null(id=1,off=0,ks=4,vs=48,imm=0) R6_rw=P0 R7=0 R8=0 R9=0 R10=fp0 fp-8=mmmm????
last_idx 18 first_idx 9
regs=40 stack=0 before 18: (85) call bpf_map_lookup_elem#1
regs=40 stack=0 before 17: (07) r2 += -4
regs=40 stack=0 before 16: (bf) r2 = r10
regs=40 stack=0 before 15: (bf) r1 = r4
regs=40 stack=0 before 13: (18) r4 = 0xffff9290dc5bfe00
regs=40 stack=0 before 12: (63) *(u32 *)(r10 -4) = r0
regs=40 stack=0 before 11: (b7) r0 = 0
regs=40 stack=0 before 10: (b7) r6 = 0
25: (79) r3 = *(u64 *)(r0 +0) ; R0_w=map_value(off=0,ks=4,vs=48,imm=0) R3_w=scalar()
26: (7b) *(u64 *)(r1 +0) = r3 ; R1_w=map_value(off=0,ks=4,vs=48,imm=0) R3_w=scalar()
27: (95) exit
from 9 to 11: R1=ctx(off=0,imm=0) R6=0 R7=0 R8=0 R9=0 R10=fp0
11: (b7) r0 = 0 ; R0_w=0
12: (63) *(u32 *)(r10 -4) = r0
last_idx 12 first_idx 11
regs=1 stack=0 before 11: (b7) r0 = 0
13: R0_w=0 R10=fp0 fp-8=0000????
13: (18) r4 = 0xffff9290dc5bfe00 ; R4_w=map_ptr(off=0,ks=4,vs=48,imm=0)
15: (bf) r1 = r4 ; R1_w=map_ptr(off=0,ks=4,vs=48,imm=0) R4_w=map_ptr(off=0,ks=4,vs=48,imm=0)
16: (bf) r2 = r10 ; R2_w=fp0 R10=fp0
17: (07) r2 += -4 ; R2_w=fp-4
18: (85) call bpf_map_lookup_elem#1
frame 0: propagating r6
last_idx 19 first_idx 11
regs=40 stack=0 before 18: (85) call bpf_map_lookup_elem#1
regs=40 stack=0 before 17: (07) r2 += -4
regs=40 stack=0 before 16: (bf) r2 = r10
regs=40 stack=0 before 15: (bf) r1 = r4
regs=40 stack=0 before 13: (18) r4 = 0xffff9290dc5bfe00
regs=40 stack=0 before 12: (63) *(u32 *)(r10 -4) = r0
regs=40 stack=0 before 11: (b7) r0 = 0
parent didn't have regs=40 stack=0 marks: R1=ctx(off=0,imm=0) R6_r=P0 R7=0 R8=0 R9=0 R10=fp0
last_idx 9 first_idx 9
regs=40 stack=0 before 9: (bd) if r6 <= r9 goto pc+1
parent didn't have regs=240 stack=0 marks: R1=ctx(off=0,imm=0) R6_rw=Pscalar() R7_w=0 R8_w=0 R9_rw=P0 R10=fp0
last_idx 8 first_idx 0
regs=240 stack=0 before 8: (b7) r9 = 0
regs=40 stack=0 before 7: (97) r6 %= 1
regs=40 stack=0 before 6: (bd) if r6 <= r9 goto pc+2
regs=240 stack=0 before 5: (05) goto pc+0
regs=240 stack=0 before 4: (97) r6 %= 1025
regs=240 stack=0 before 3: (b7) r9 = -2147483648
regs=40 stack=0 before 2: (b7) r8 = 0
regs=40 stack=0 before 1: (b7) r7 = 0
regs=40 stack=0 before 0: (b7) r6 = 1024
19: safe
from 6 to 9: R1=ctx(off=0,imm=0) R6_w=scalar(umax=18446744071562067968) R7_w=0 R8_w=0 R9_w=-2147483648 R10=fp0
9: (bd) if r6 <= r9 goto pc+1
last_idx 9 first_idx 0
regs=40 stack=0 before 6: (bd) if r6 <= r9 goto pc+2
regs=240 stack=0 before 5: (05) goto pc+0
regs=240 stack=0 before 4: (97) r6 %= 1025
regs=240 stack=0 before 3: (b7) r9 = -2147483648
regs=40 stack=0 before 2: (b7) r8 = 0
regs=40 stack=0 before 1: (b7) r7 = 0
regs=40 stack=0 before 0: (b7) r6 = 1024
last_idx 9 first_idx 0
regs=200 stack=0 before 6: (bd) if r6 <= r9 goto pc+2
regs=240 stack=0 before 5: (05) goto pc+0
regs=240 stack=0 before 4: (97) r6 %= 1025
regs=240 stack=0 before 3: (b7) r9 = -2147483648
regs=40 stack=0 before 2: (b7) r8 = 0
regs=40 stack=0 before 1: (b7) r7 = 0
regs=40 stack=0 before 0: (b7) r6 = 1024
11: R6=scalar(umax=18446744071562067968) R9=-2147483648
11: (b7) r0 = 0 ; R0_w=0
12: (63) *(u32 *)(r10 -4) = r0
last_idx 12 first_idx 11
regs=1 stack=0 before 11: (b7) r0 = 0
13: R0_w=0 R10=fp0 fp-8=0000????
13: (18) r4 = 0xffff9290dc5bfe00 ; R4_w=map_ptr(off=0,ks=4,vs=48,imm=0)
15: (bf) r1 = r4 ; R1_w=map_ptr(off=0,ks=4,vs=48,imm=0) R4_w=map_ptr(off=0,ks=4,vs=48,imm=0)
16: (bf) r2 = r10 ; R2_w=fp0 R10=fp0
17: (07) r2 += -4 ; R2_w=fp-4
18: (85) call bpf_map_lookup_elem#1 ; R0_w=map_value_or_null(id=3,off=0,ks=4,vs=48,imm=0)
19: (55) if r0 != 0x0 goto pc+1 ; R0_w=0
20: (95) exit
from 19 to 21: R0=map_value(off=0,ks=4,vs=48,imm=0) R6=scalar(umax=18446744071562067968) R7=0 R8=0 R9=-2147483648 R10=fp0 fp-8=mmmm????
21: (77) r6 >>= 10 ; R6_w=scalar(umax=18014398507384832,var_off=(0x0; 0x3fffffffffffff))
22: (27) r6 *= 8192 ; R6_w=scalar(smax=9223372036854767616,umax=18446744073709543424,var_off=(0x0; 0xffffffffffffe000),s32_max=2147475456,u32_max=-8192)
23: (bf) r1 = r0 ; R0=map_value(off=0,ks=4,vs=48,imm=0) R1_w=map_value(off=0,ks=4,vs=48,imm=0)
24: (0f) r0 += r6
last_idx 24 first_idx 21
regs=40 stack=0 before 23: (bf) r1 = r0
regs=40 stack=0 before 22: (27) r6 *= 8192
regs=40 stack=0 before 21: (77) r6 >>= 10
parent didn't have regs=40 stack=0 marks: R0_rw=map_value(off=0,ks=4,vs=48,imm=0) R6_r=Pscalar(umax=18446744071562067968) R7=0 R8=0 R9=-2147483648 R10=fp0 fp-8=mmmm????
last_idx 19 first_idx 11
regs=40 stack=0 before 19: (55) if r0 != 0x0 goto pc+1
regs=40 stack=0 before 18: (85) call bpf_map_lookup_elem#1
regs=40 stack=0 before 17: (07) r2 += -4
regs=40 stack=0 before 16: (bf) r2 = r10
regs=40 stack=0 before 15: (bf) r1 = r4
regs=40 stack=0 before 13: (18) r4 = 0xffff9290dc5bfe00
regs=40 stack=0 before 12: (63) *(u32 *)(r10 -4) = r0
regs=40 stack=0 before 11: (b7) r0 = 0
parent didn't have regs=40 stack=0 marks: R1=ctx(off=0,imm=0) R6_rw=Pscalar(umax=18446744071562067968) R7_w=0 R8_w=0 R9_w=-2147483648 R10=fp0
last_idx 9 first_idx 0
regs=40 stack=0 before 9: (bd) if r6 <= r9 goto pc+1
regs=240 stack=0 before 6: (bd) if r6 <= r9 goto pc+2
regs=240 stack=0 before 5: (05) goto pc+0
regs=240 stack=0 before 4: (97) r6 %= 1025
regs=240 stack=0 before 3: (b7) r9 = -2147483648
regs=40 stack=0 before 2: (b7) r8 = 0
regs=40 stack=0 before 1: (b7) r7 = 0
regs=40 stack=0 before 0: (b7) r6 = 1024
math between map_value pointer and register with unbounded min value is not allowed
verification time 886 usec
stack depth 4
processed 49 insns (limit 1000000) max_states_per_insn 1 total_states 5 peak_states 5 mark_read 2
Fixes:
|
||
Mathieu Desnoyers
|
b20b0368c6 |
mm: fix memory leak on mm_init error handling
commit |
||
Ondrej Mosnacek
|
659c0ce1cb |
kernel/sys.c: fix and improve control flow in __sys_setres[ug]id()
Linux Security Modules (LSMs) that implement the "capable" hook will
usually emit an access denial message to the audit log whenever they
"block" the current task from using the given capability based on their
security policy.
The occurrence of a denial is used as an indication that the given task
has attempted an operation that requires the given access permission, so
the callers of functions that perform LSM permission checks must take care
to avoid calling them too early (before it is decided if the permission is
actually needed to perform the requested operation).
The __sys_setres[ug]id() functions violate this convention by first
calling ns_capable_setid() and only then checking if the operation
requires the capability or not. It means that any caller that has the
capability granted by DAC (task's capability set) but not by MAC (LSMs)
will generate a "denied" audit record, even if is doing an operation for
which the capability is not required.
Fix this by reordering the checks such that ns_capable_setid() is checked
last and -EPERM is returned immediately if it returns false.
While there, also do two small optimizations:
* move the capability check before prepare_creds() and
* bail out early in case of a no-op.
Link: https://lkml.kernel.org/r/20230217162154.837549-1-omosnace@redhat.com
Fixes:
|
||
Linus Torvalds
|
6c538e1adb |
- Do not pull tasks to the local scheduling group if its average load is
higher than the average system load -----BEGIN PGP SIGNATURE----- iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmQ76uAACgkQEsHwGGHe VUpsNhAAt8FYuJD0oJs34mNIS75PrK6hd8zETj22BDW3QGdGvHT54DcgDkmCGwtC w2bSyPuNR1ZtLmKWt3EfSGuTDZDE/NS6OwPFgliOe68o76YgeVUezSBeHnaAoRDb 38j5o7X3tvU5Qz1EqWhdiOX7EKUVy7tRK+W49HLHQCEZkpjISg96Qj2Rtu6iXRg2 VPoyxb39NdtSCLDq2+ZkT2NayogX6hESZGDQ3/g9NJeOm4+y2VLqUfA6o9V6Aq5Y KRvWw/VsM6XiCLdkdjHAFMuiYCnXYKLAHuPKfxENqvCpXoA+5KxMadyG02hvAvo3 WGP4sEvfH+NWAtAvAf4wkIwxx420NsTV+GN+XpYTAlg/g9C9uT1OB06k6V7CunkV 3kA+WFyPYAcvd7onVkjQnJ3AI/muFZN+9uZKuBw0K/sjXnDzGHRW3cq0DoKpUDzp 3ehfL1d8reN9k/ZoIlycrsnLTuUxzQfPkG8Wfngw2RwsFJtyO3FcRkAZptTtVcmg vW6Uzn35zhG8FLc5rLt4hHmoFhvbINu9KD3UXD3Ihst/fuvBE+Ys4WEP/UaRr9mg ovHCq0RRcAuOiWeioJJhIw3jaat4yylOPXBkV7Wzd2kMmMyGcHmkFGJCXlzX9EPQ 9KaligBVyfr+SgM1sbob4jAA1ZUBIpUC/gN6Xim62o3W9PWG7tk= =E+yZ -----END PGP SIGNATURE----- Merge tag 'sched_urgent_for_v6.3_rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull scheduler fix from Borislav Petkov: - Do not pull tasks to the local scheduling group if its average load is higher than the average system load * tag 'sched_urgent_for_v6.3_rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: sched/fair: Fix imbalance overflow |
||
Linus Torvalds
|
44149752e9 |
cgroup: Fixes for v6.3-rc6
* Fix several cpuset bugs including one where it wasn't applying the target cgroup when tasks are created with CLONE_INTO_CGROUP. * Fix inversed locking order in cgroup1 freezer implementation. * Fix garbage cpu.stat::core_sched.forceidle_usec reporting in the root cgroup. This is a relatively big pull request this late in the cycle but the major contributor is the above mentioned cpuset bug which is rather significant. -----BEGIN PGP SIGNATURE----- iIQEABYIACwWIQTfIjM1kS57o3GsC/uxYfJx3gVYGQUCZDiJuw4cdGpAa2VybmVs Lm9yZwAKCRCxYfJx3gVYGbUfAQCLYhxijWvCpRYlQ3mfd1pgyvWNB90o4lnFkltz D0iSpwD/SOL5zwkR7WBeejJDKVIsezPpz3SZvxzzKMSk1VODkgo= =eOCG -----END PGP SIGNATURE----- Merge tag 'cgroup-for-6.3-rc6-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup Pull cgroup fixes from Tejun Heo: "This is a relatively big pull request this late in the cycle but the major contributor is the cpuset bug which is rather significant: - Fix several cpuset bugs including one where it wasn't applying the target cgroup when tasks are created with CLONE_INTO_CGROUP With a few smaller fixes: - Fix inversed locking order in cgroup1 freezer implementation - Fix garbage cpu.stat::core_sched.forceidle_usec reporting in the root cgroup" * tag 'cgroup-for-6.3-rc6-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup: cgroup/cpuset: Make cpuset_attach_task() skip subpartitions CPUs for top_cpuset cgroup/cpuset: Add cpuset_can_fork() and cpuset_cancel_fork() methods cgroup/cpuset: Make cpuset_fork() handle CLONE_INTO_CGROUP properly cgroup/cpuset: Wake up cpuset_attach_wq tasks in cpuset_cancel_attach() cgroup,freezer: hold cpu_hotplug_lock before freezer_mutex cgroup/cpuset: Fix partition root's cpuset.cpus update bug cgroup: fix display of forceidle time at root |
||
Waiman Long
|
7e27cb6ad4 |
cgroup/cpuset: Make cpuset_attach_task() skip subpartitions CPUs for top_cpuset
It is found that attaching a task to the top_cpuset does not currently ignore CPUs allocated to subpartitions in cpuset_attach_task(). So the code is changed to fix that. Signed-off-by: Waiman Long <longman@redhat.com> Reviewed-by: Michal Koutný <mkoutny@suse.com> Signed-off-by: Tejun Heo <tj@kernel.org> |
||
Waiman Long
|
eee8785379 |
cgroup/cpuset: Add cpuset_can_fork() and cpuset_cancel_fork() methods
In the case of CLONE_INTO_CGROUP, not all cpusets are ready to accept
new tasks. It is too late to check that in cpuset_fork(). So we need
to add the cpuset_can_fork() and cpuset_cancel_fork() methods to
pre-check it before we can allow attachment to a different cpuset.
We also need to set the attach_in_progress flag to alert other code
that a new task is going to be added to the cpuset.
Fixes:
|
||
Waiman Long
|
42a11bf5c5 |
cgroup/cpuset: Make cpuset_fork() handle CLONE_INTO_CGROUP properly
By default, the clone(2) syscall spawn a child process into the same cgroup as its parent. With the use of the CLONE_INTO_CGROUP flag introduced by commit |
||
Waiman Long
|
ba9182a896 |
cgroup/cpuset: Wake up cpuset_attach_wq tasks in cpuset_cancel_attach()
After a successful cpuset_can_attach() call which increments the
attach_in_progress flag, either cpuset_cancel_attach() or cpuset_attach()
will be called later. In cpuset_attach(), tasks in cpuset_attach_wq,
if present, will be woken up at the end. That is not the case in
cpuset_cancel_attach(). So missed wakeup is possible if the attach
operation is somehow cancelled. Fix that by doing the wakeup in
cpuset_cancel_attach() as well.
Fixes:
|
||
Tetsuo Handa
|
57dcd64c7e |
cgroup,freezer: hold cpu_hotplug_lock before freezer_mutex
syzbot is reporting circular locking dependency between cpu_hotplug_lock and freezer_mutex, for commit |
||
Vincent Guittot
|
91dcf1e806 |
sched/fair: Fix imbalance overflow
When local group is fully busy but its average load is above system load,
computing the imbalance will overflow and local group is not the best
target for pulling this load.
Fixes:
|
||
Linus Torvalds
|
0d3eb744ae |
Urgent RCU pull request for v6.3
This commit fixes a pair of bugs in which an improbable but very real sequence of events can cause kfree_rcu() to be a bit too quick about freeing the memory passed to it. It turns out that this pair of bugs is about two years old, and so this is not a v6.3 regression. However: (1) It just started showing up in the wild and (2) Its consequences are dire, so its fix needs to go in sooner rather than later. Testing is of course being upgraded, and the upgraded tests detect this situation very quickly. But to the best of my knowledge right now, the tests are not particularly urgent and will thus most likely show up in the v6.5 merge window (the one after this coming one). Kudos to Ziwei Dai and his group for tracking this one down the hard way! -----BEGIN PGP SIGNATURE----- iQJHBAABCgAxFiEEbK7UrM+RBIrCoViJnr8S83LZ+4wFAmQwt4UTHHBhdWxtY2tA a2VybmVsLm9yZwAKCRCevxLzctn7jNupD/sG0OTsQ+8zjAG9VhtdkGt3UwXod6z8 8yiM4fMJxECLtFwBD6kvM5jSs87AoSnUNWO2/Ii4v1VymhvzR4i/4+mQ9D6Cr4cQ yYdo3A1MlcZcjc7Po5KlX7y3JT8kLr8ijaA8XPxGHwVqHNQ6RF64gFercaeDykNv IFSrqylMvkqhReCFaDGgsVjR8jI4wso8b9IQAO1vnReJLRydui99ibRCWoMH54ev KO4kPc6QTuqFFHy7o7GgeNty09vLIN/QdEL7sTWUpLBStEzTsAdt5rARx47y+nuw Gh99s+abPFhO5Iy8nQin6MuBCdua1PbJM0yclU3UvmrhgkjoS9GMjiXP9bZ8t9AX ltiTvcippo1NpDcfNLaK5kt7FA2hlk8631jqPL0h558935vP8rlmgEddtEkqhOWv muHh1M4IMc/kix26hvLRf3aE8pszxU0b1NIuPkdEUakrvdXE32GlxMmlFZz4ApQ4 DnWlb3Vqof2AjAEUoh7jp4/7tgQaA8Hh1xERuqftQP/NjxNM1naaTwqdKryQFu5c V3lpn1t5G1xchHkAtuxDh2oVgWBlz5GPtga6AWuxrYPxxbzbl7eb1gEsZpXs0BF/ AB8/KSPcG0Is3yp4Gfet76n0SMWcFVw/g0ISXrTlXkPauXpll15f7PF22154M9f8 EinobMxu9DPT6Q== =VsnL -----END PGP SIGNATURE----- Merge tag 'urgent-rcu.2023.04.07a' of git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu Pull RCU fix from Paul McKenney: "This fixes a pair of bugs in which an improbable but very real sequence of events can cause kfree_rcu() to be a bit too quick about freeing the memory passed to it. It turns out that this pair of bugs is about two years old, and so this is not a v6.3 regression. However: (1) It just started showing up in the wild and (2) Its consequences are dire, so its fix needs to go in sooner rather than later. Testing is of course being upgraded, and the upgraded tests detect this situation very quickly. But to the best of my knowledge right now, the tests are not particularly urgent and will thus most likely show up in the v6.5 merge window (the one after this coming one). Kudos to Ziwei Dai and his group for tracking this one down the hard way!" * tag 'urgent-rcu.2023.04.07a' of git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu: rcu/kvfree: Avoid freeing new kfree_rcu() memory after old grace period |
||
Linus Torvalds
|
faf8f41858 |
- Fix "same task" check when redirecting event output
- Do not wait unconditionally for RCU on the event migration path if there are no events to migrate -----BEGIN PGP SIGNATURE----- iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmQynB4ACgkQEsHwGGHe VUpZ+xAAl85pXfn/uXM4LUy5rqvKXZA/Ytw4sL5XGNA6t31jtEyjlpCXev3clOss unV/nalV6mXoVu8eOPzlOdQYCqDaq8e5IvGEyKKuvHpl9xfUy4hf6FwsYRkOoTce CVpw7gegnIJC6MGXxwMlvMKAA9260Pssp/FVgcKzZaJN4ooB/pmYnXHpv65LPtRT eMdlmdSBw88vIG6wJSgng+Q7fd98h09Vp4l8X2DTyjLmsGPuwn33taAGnZCb9zIH R6tMUDSz5PuzT0f88ScZewxdI2kmMfxoo60yQMWXQ/+CMbe1ZVgm82g066zE1pfs ZxqlcNDjH6R2rmfaUq/96OPgPO4ivSpoEKNjlGQ/R8a4nb/ETNHlaKB/Zrrf36ph 9S04pGQm5lEUiSIwnN7eSDuOW5oomyorpeozYGRTOeQ+8n6hMEfOBS9dtCpoUCmz KjNvuFQ8E6lnvct0TF+gaYbqadwvp/dkUnniyfUVEJihGxdXK8ipgFHZb2uSmE2u M7Wk0zdUsKx4GRb2u7GGZBRNnxappFVUno4TxUmbeoA8XxVc81O5/p+WbLaZwauF klyVgWjZOrVV1R5FjeHk/6PbbU3KLa2hdk7ILZFLQJ5swjr85PGfupjn0KHB4CuB AycfstdaWJQspmtZodct/xmIngXbeacF58O7uRzUlZqkqx1jD/E= =m8RR -----END PGP SIGNATURE----- Merge tag 'perf_urgent_for_v6.3_rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull perf fixes from Borislav Petkov: - Fix "same task" check when redirecting event output - Do not wait unconditionally for RCU on the event migration path if there are no events to migrate * tag 'perf_urgent_for_v6.3_rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: perf/core: Fix the same task check in perf_event_set_output perf: Optimize perf_pmu_migrate_context() |
||
Linus Torvalds
|
973ad544f0 |
dma-mapping fix for Linux 6.3
- fix a braino in the swiotlb alignment check fix (Petr Tesarik) -----BEGIN PGP SIGNATURE----- iQI/BAABCgApFiEEgdbnc3r/njty3Iq9D55TZVIEUYMFAmQxA5oLHGhjaEBsc3Qu ZGUACgkQD55TZVIEUYMaXg//TiN+oJ1xKg9XnctW/0bmDzADtk6L7dNiBxfRZiVU kQE/crRCOod+dfkrnyqlxx2SC24RyPRouDYJGbBH10qP3flYYl101Ol6BVbEUxU+ /+QTHAT2+nwWEOLDgi59FlkdiIi8jvpzY4ANCwvSEW/y2BgJXy8KS5MUnrzWqi7B hzTv4gO8y1yyt65w9tZCax/EmQkL8U08e0l1U+OFDsiU2ZEmcwFfeETQ80183tEl h80XieijGIKQc7HtmUJWtGX+loiLPuy3emAH+2N9w/7OOQMpHuWwJj3Lp+oX5qFn ryB22oBWH+zRuxiAV/sp48mAl3W1hYf2q2tsu7lHVmPRdttScYIL556iozYaXHbt 2Vykhs2VISG/2v7foRNklrkz11IL9w0/oY8/dbvhLscTBKmtWSolMlaTJRBwMVQw xL7pcP6KJrWSRP/xmTDVpomNOFqTVh/sbMC6KEThIoOIdTXuvVucz9Btnqr8JruK CyzrRp8VkHoYReJYRWIs2QB9t584vssiMAMJOuelOZlBRF69j2BWQktJI6dthaJM /qqBnkOsef48bzRjCvIZgSDmgJnNYzDRBBkdjx1WqLJjcFlUd9CWEK6ZdNFNd04s KP3Pp0b9xQa6rkSKGJc55aqmWs755cp6v/AANnQLW/lZwxlw+l4fXuC+yWxXYuTh +Qc= =T0n5 -----END PGP SIGNATURE----- Merge tag 'dma-mapping-6.3-2023-04-08' of git://git.infradead.org/users/hch/dma-mapping Pull dma-mapping fix from Christoph Hellwig: - fix a braino in the swiotlb alignment check fix (Petr Tesarik) * tag 'dma-mapping-6.3-2023-04-08' of git://git.infradead.org/users/hch/dma-mapping: swiotlb: fix a braino in the alignment check fix |
||
Linus Torvalds
|
1a8a804a4f |
Some more tracing fixes for 6.3:
- Reset direct->addr back to its original value on error in updating the direct trampoline code. - Make lastcmd_mutex static. -----BEGIN PGP SIGNATURE----- iIoEABYIADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCZDB1JhQccm9zdGVkdEBn b29kbWlzLm9yZwAKCRAp5XQQmuv6qqihAQC6vNG/QFthBVj6++2O5+h+AGe3mIIv +SVs3GpL+Gr1MAEA/Q+zK7niLHrWSMsyq3eYY63J10AhI/ZHuFm28MbjKQM= =khcF -----END PGP SIGNATURE----- Merge tag 'trace-v6.3-rc5-2' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace Pull tracing fixes from Steven Rostedt: "A couple more minor fixes: - Reset direct->addr back to its original value on error in updating the direct trampoline code - Make lastcmd_mutex static" * tag 'trace-v6.3-rc5-2' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace: tracing/synthetic: Make lastcmd_mutex static ftrace: Fix issue that 'direct->addr' not restored in modify_ftrace_direct() |
||
Linus Torvalds
|
6fda0bb806 |
28 hotfixes.
23 are cc:stable and the other 5 address issues which were introduced during this merge cycle. 20 are for MM and the remainder are for other subsystems. -----BEGIN PGP SIGNATURE----- iHUEABYIAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCZDCmIAAKCRDdBJ7gKXxA jhZuAQDn8ErAotUpLn1Pq6WU1liPenGoraBo/a2ubpOjguSINwD+J7L85vgVmA78 YzoKHObW18yBW7JSzpWZ2zw8q2gLQwQ= =a1n7 -----END PGP SIGNATURE----- Merge tag 'mm-hotfixes-stable-2023-04-07-16-23' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull MM fixes from Andrew Morton: "28 hotfixes. 23 are cc:stable and the other five address issues which were introduced during this merge cycle. 20 are for MM and the remainder are for other subsystems" * tag 'mm-hotfixes-stable-2023-04-07-16-23' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (28 commits) maple_tree: fix a potential concurrency bug in RCU mode maple_tree: fix get wrong data_end in mtree_lookup_walk() mm/swap: fix swap_info_struct race between swapoff and get_swap_pages() nilfs2: fix sysfs interface lifetime mm: take a page reference when removing device exclusive entries mm: vmalloc: avoid warn_alloc noise caused by fatal signal nilfs2: initialize "struct nilfs_binfo_dat"->bi_pad field nilfs2: fix potential UAF of struct nilfs_sc_info in nilfs_segctor_thread() zsmalloc: document freeable stats zsmalloc: document new fullness grouping fsdax: force clear dirty mark if CoW mm/hugetlb: fix uffd wr-protection for CoW optimization path mm: enable maple tree RCU mode by default maple_tree: add RCU lock checking to rcu callback functions maple_tree: add smp_rmb() to dead node detection maple_tree: fix write memory barrier of nodes once dead for RCU mode maple_tree: remove extra smp_wmb() from mas_dead_leaves() maple_tree: fix freeing of nodes in rcu mode maple_tree: detect dead nodes in mas_start() maple_tree: be more cautious about dead nodes ... |
||
Steven Rostedt (Google)
|
31c6839671 |
tracing/synthetic: Make lastcmd_mutex static
The lastcmd_mutex is only used in trace_events_synth.c and should be
static.
Link: https://lore.kernel.org/linux-trace-kernel/202304062033.cRStgOuP-lkp@intel.com/
Link: https://lore.kernel.org/linux-trace-kernel/20230406111033.6e26de93@gandalf.local.home
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Tze-nan Wu <Tze-nan.Wu@mediatek.com>
Fixes:
|
||
Ziwei Dai
|
5da7cb193d |
rcu/kvfree: Avoid freeing new kfree_rcu() memory after old grace period
Memory passed to kvfree_rcu() that is to be freed is tracked by a per-CPU kfree_rcu_cpu structure, which in turn contains pointers to kvfree_rcu_bulk_data structures that contain pointers to memory that has not yet been handed to RCU, along with an kfree_rcu_cpu_work structure that tracks the memory that has already been handed to RCU. These structures track three categories of memory: (1) Memory for kfree(), (2) Memory for kvfree(), and (3) Memory for both that arrived during an OOM episode. The first two categories are tracked in a cache-friendly manner involving a dynamically allocated page of pointers (the aforementioned kvfree_rcu_bulk_data structures), while the third uses a simple (but decidedly cache-unfriendly) linked list through the rcu_head structures in each block of memory. On a given CPU, these three categories are handled as a unit, with that CPU's kfree_rcu_cpu_work structure having one pointer for each of the three categories. Clearly, new memory for a given category cannot be placed in the corresponding kfree_rcu_cpu_work structure until any old memory has had its grace period elapse and thus has been removed. And the kfree_rcu_monitor() function does in fact check for this. Except that the kfree_rcu_monitor() function checks these pointers one at a time. This means that if the previous kfree_rcu() memory passed to RCU had only category 1 and the current one has only category 2, the kfree_rcu_monitor() function will send that current category-2 memory along immediately. This can result in memory being freed too soon, that is, out from under unsuspecting RCU readers. To see this, consider the following sequence of events, in which: o Task A on CPU 0 calls rcu_read_lock(), then uses "from_cset", then is preempted. o CPU 1 calls kfree_rcu(cset, rcu_head) in order to free "from_cset" after a later grace period. Except that "from_cset" is freed right after the previous grace period ended, so that "from_cset" is immediately freed. Task A resumes and references "from_cset"'s member, after which nothing good happens. In full detail: CPU 0 CPU 1 ---------------------- ---------------------- count_memcg_event_mm() |rcu_read_lock() <--- |mem_cgroup_from_task() |// css_set_ptr is the "from_cset" mentioned on CPU 1 |css_set_ptr = rcu_dereference((task)->cgroups) |// Hard irq comes, current task is scheduled out. cgroup_attach_task() |cgroup_migrate() |cgroup_migrate_execute() |css_set_move_task(task, from_cset, to_cset, true) |cgroup_move_task(task, to_cset) |rcu_assign_pointer(.., to_cset) |... |cgroup_migrate_finish() |put_css_set_locked(from_cset) |from_cset->refcount return 0 |kfree_rcu(cset, rcu_head) // free from_cset after new gp |add_ptr_to_bulk_krc_lock() |schedule_delayed_work(&krcp->monitor_work, ..) kfree_rcu_monitor() |krcp->bulk_head[0]'s work attached to krwp->bulk_head_free[] |queue_rcu_work(system_wq, &krwp->rcu_work) |if rwork->rcu.work is not in WORK_STRUCT_PENDING_BIT state, |call_rcu(&rwork->rcu, rcu_work_rcufn) <--- request new gp // There is a perious call_rcu(.., rcu_work_rcufn) // gp end, rcu_work_rcufn() is called. rcu_work_rcufn() |__queue_work(.., rwork->wq, &rwork->work); |kfree_rcu_work() |krwp->bulk_head_free[0] bulk is freed before new gp end!!! |The "from_cset" is freed before new gp end. // the task resumes some time later. |css_set_ptr->subsys[(subsys_id) <--- Caused kernel crash, because css_set_ptr is freed. This commit therefore causes kfree_rcu_monitor() to refrain from moving kfree_rcu() memory to the kfree_rcu_cpu_work structure until the RCU grace period has completed for all three categories. v2: Use helper function instead of inserted code block at kfree_rcu_monitor(). Fixes: |
||
Zheng Yejian
|
2a2d8c51de |
ftrace: Fix issue that 'direct->addr' not restored in modify_ftrace_direct()
Syzkaller report a WARNING: "WARN_ON(!direct)" in modify_ftrace_direct().
Root cause is 'direct->addr' was changed from 'old_addr' to 'new_addr' but
not restored if error happened on calling ftrace_modify_direct_caller().
Then it can no longer find 'direct' by that 'old_addr'.
To fix it, restore 'direct->addr' to 'old_addr' explicitly in error path.
Link: https://lore.kernel.org/linux-trace-kernel/20230330025223.1046087-1-zhengyejian1@huawei.com
Cc: stable@vger.kernel.org
Cc: <mhiramat@kernel.org>
Cc: <mark.rutland@arm.com>
Cc: <ast@kernel.org>
Cc: <daniel@iogearbox.net>
Fixes:
|
||
Petr Tesarik
|
bbb73a103f |
swiotlb: fix a braino in the alignment check fix
The alignment mask in swiotlb_do_find_slots() masks off the high
bits which are not relevant for the alignment, so multiple
requirements are combined with a bitwise OR rather than AND.
In plain English, the stricter the alignment, the more bits must
be set in iotlb_align_mask.
Confusion may arise from the fact that the same variable is also
used to mask off the offset within a swiotlb slot, which is
achieved with a bitwise AND.
Fixes:
|
||
Liam R. Howlett
|
3dd4432549 |
mm: enable maple tree RCU mode by default
Use the maple tree in RCU mode for VMA tracking.
The maple tree tracks the stack and is able to update the pivot
(lower/upper boundary) in-place to allow the page fault handler to write
to the tree while holding just the mmap read lock. This is safe as the
writes to the stack have a guard VMA which ensures there will always be a
NULL in the direction of the growth and thus will only update a pivot.
It is possible, but not recommended, to have VMAs that grow up/down
without guard VMAs. syzbot has constructed a testcase which sets up a VMA
to grow and consume the empty space. Overwriting the entire NULL entry
causes the tree to be altered in a way that is not safe for concurrent
readers; the readers may see a node being rewritten or one that does not
match the maple state they are using.
Enabling RCU mode allows the concurrent readers to see a stable node and
will return the expected result.
[Liam.Howlett@Oracle.com: we don't need to free the nodes with RCU[
Link: https://lore.kernel.org/linux-mm/000000000000b0a65805f663ace6@google.com/
Link: https://lkml.kernel.org/r/20230227173632.3292573-9-surenb@google.com
Fixes:
|
||
Steven Rostedt (Google)
|
3357c6e429 |
tracing: Free error logs of tracing instances
When a tracing instance is removed, the error messages that hold errors
that occurred in the instance needs to be freed. The following reports a
memory leak:
# cd /sys/kernel/tracing
# mkdir instances/foo
# echo 'hist:keys=x' > instances/foo/events/sched/sched_switch/trigger
# cat instances/foo/error_log
[ 117.404795] hist:sched:sched_switch: error: Couldn't find field
Command: hist:keys=x
^
# rmdir instances/foo
Then check for memory leaks:
# echo scan > /sys/kernel/debug/kmemleak
# cat /sys/kernel/debug/kmemleak
unreferenced object 0xffff88810d8ec700 (size 192):
comm "bash", pid 869, jiffies 4294950577 (age 215.752s)
hex dump (first 32 bytes):
60 dd 68 61 81 88 ff ff 60 dd 68 61 81 88 ff ff `.ha....`.ha....
a0 30 8c 83 ff ff ff ff 26 00 0a 00 00 00 00 00 .0......&.......
backtrace:
[<00000000dae26536>] kmalloc_trace+0x2a/0xa0
[<00000000b2938940>] tracing_log_err+0x277/0x2e0
[<000000004a0e1b07>] parse_atom+0x966/0xb40
[<0000000023b24337>] parse_expr+0x5f3/0xdb0
[<00000000594ad074>] event_hist_trigger_parse+0x27f8/0x3560
[<00000000293a9645>] trigger_process_regex+0x135/0x1a0
[<000000005c22b4f2>] event_trigger_write+0x87/0xf0
[<000000002cadc509>] vfs_write+0x162/0x670
[<0000000059c3b9be>] ksys_write+0xca/0x170
[<00000000f1cddc00>] do_syscall_64+0x3e/0xc0
[<00000000868ac68c>] entry_SYSCALL_64_after_hwframe+0x72/0xdc
unreferenced object 0xffff888170c35a00 (size 32):
comm "bash", pid 869, jiffies 4294950577 (age 215.752s)
hex dump (first 32 bytes):
0a 20 20 43 6f 6d 6d 61 6e 64 3a 20 68 69 73 74 . Command: hist
3a 6b 65 79 73 3d 78 0a 00 00 00 00 00 00 00 00 :keys=x.........
backtrace:
[<000000006a747de5>] __kmalloc+0x4d/0x160
[<000000000039df5f>] tracing_log_err+0x29b/0x2e0
[<000000004a0e1b07>] parse_atom+0x966/0xb40
[<0000000023b24337>] parse_expr+0x5f3/0xdb0
[<00000000594ad074>] event_hist_trigger_parse+0x27f8/0x3560
[<00000000293a9645>] trigger_process_regex+0x135/0x1a0
[<000000005c22b4f2>] event_trigger_write+0x87/0xf0
[<000000002cadc509>] vfs_write+0x162/0x670
[<0000000059c3b9be>] ksys_write+0xca/0x170
[<00000000f1cddc00>] do_syscall_64+0x3e/0xc0
[<00000000868ac68c>] entry_SYSCALL_64_after_hwframe+0x72/0xdc
The problem is that the error log needs to be freed when the instance is
removed.
Link: https://lore.kernel.org/lkml/76134d9f-a5ba-6a0d-37b3-28310b4a1e91@alu.unizg.hr/
Link: https://lore.kernel.org/linux-trace-kernel/20230404194504.5790b95f@gandalf.local.home
Cc: stable@vger.kernel.org
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Thorsten Leemhuis <regressions@leemhuis.info>
Cc: Ulf Hansson <ulf.hansson@linaro.org>
Cc: Eric Biggers <ebiggers@kernel.org>
Fixes:
|
||
Kan Liang
|
24d3ae2f37 |
perf/core: Fix the same task check in perf_event_set_output
The same task check in perf_event_set_output has some potential issues for some usages. For the current perf code, there is a problem if using of perf_event_open() to have multiple samples getting into the same mmap’d memory when they are both attached to the same process. https://lore.kernel.org/all/92645262-D319-4068-9C44-2409EF44888E@gmail.com/ Because the event->ctx is not ready when the perf_event_set_output() is invoked in the perf_event_open(). Besides the above issue, before the commit |
||
Peter Zijlstra
|
b168098912 |
perf: Optimize perf_pmu_migrate_context()
Thomas reported that offlining CPUs spends a lot of time in
synchronize_rcu() as called from perf_pmu_migrate_context() even though
he's not actually using uncore events.
Turns out, the thing is unconditionally waiting for RCU, even if there's
no actual events to migrate.
Fixes:
|
||
Steven Rostedt (Google)
|
e94891641c |
tracing: Fix ftrace_boot_snapshot command line logic
The kernel command line ftrace_boot_snapshot by itself is supposed to
trigger a snapshot at the end of boot up of the main top level trace
buffer. A ftrace_boot_snapshot=foo will do the same for an instance called
foo that was created by trace_instance=foo,...
The logic was broken where if ftrace_boot_snapshot was by itself, it would
trigger a snapshot for all instances that had tracing enabled, regardless
if it asked for a snapshot or not.
When a snapshot is requested for a buffer, the buffer's
tr->allocated_snapshot is set to true. Use that to know if a trace buffer
wants a snapshot at boot up or not.
Since the top level buffer is part of the ftrace_trace_arrays list,
there's no reason to treat it differently than the other buffers. Just
iterate the list if ftrace_boot_snapshot was specified.
Link: https://lkml.kernel.org/r/20230405022341.895334039@goodmis.org
Cc: stable@vger.kernel.org
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Ross Zwisler <zwisler@google.com>
Fixes:
|
||
Steven Rostedt (Google)
|
9d52727f80 |
tracing: Have tracing_snapshot_instance_cond() write errors to the appropriate instance
If a trace instance has a failure with its snapshot code, the error
message is to be written to that instance's buffer. But currently, the
message is written to the top level buffer. Worse yet, it may also disable
the top level buffer and not the instance that had the issue.
Link: https://lkml.kernel.org/r/20230405022341.688730321@goodmis.org
Cc: stable@vger.kernel.org
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Ross Zwisler <zwisler@google.com>
Fixes:
|
||
Daniel Bristot de Oliveira
|
d3cba7f02c |
tracing/osnoise: Fix notify new tracing_max_latency
osnoise/timerlat tracers are reporting new max latency on instances
where the tracing is off, creating inconsistencies between the max
reported values in the trace and in the tracing_max_latency. Thus
only report new tracing_max_latency on active tracing instances.
Link: https://lkml.kernel.org/r/ecd109fde4a0c24ab0f00ba1e9a144ac19a91322.1680104184.git.bristot@kernel.org
Cc: stable@vger.kernel.org
Fixes:
|
||
Daniel Bristot de Oliveira
|
b9f451a902 |
tracing/timerlat: Notify new max thread latency
timerlat is not reporting a new tracing_max_latency for the thread
latency. The reason is that it is not calling notify_new_max_latency()
function after the new thread latency is sampled.
Call notify_new_max_latency() after computing the thread latency.
Link: https://lkml.kernel.org/r/16e18d61d69073d0192ace07bf61e405cca96e9c.1680104184.git.bristot@kernel.org
Cc: stable@vger.kernel.org
Fixes:
|
||
Zheng Yejian
|
6455b6163d |
ring-buffer: Fix race while reader and writer are on the same page
When user reads file 'trace_pipe', kernel keeps printing following logs that warn at "cpu_buffer->reader_page->read > rb_page_size(reader)" in rb_get_reader_page(). It just looks like there's an infinite loop in tracing_read_pipe(). This problem occurs several times on arm64 platform when testing v5.10 and below. Call trace: rb_get_reader_page+0x248/0x1300 rb_buffer_peek+0x34/0x160 ring_buffer_peek+0xbc/0x224 peek_next_entry+0x98/0xbc __find_next_entry+0xc4/0x1c0 trace_find_next_entry_inc+0x30/0x94 tracing_read_pipe+0x198/0x304 vfs_read+0xb4/0x1e0 ksys_read+0x74/0x100 __arm64_sys_read+0x24/0x30 el0_svc_common.constprop.0+0x7c/0x1bc do_el0_svc+0x2c/0x94 el0_svc+0x20/0x30 el0_sync_handler+0xb0/0xb4 el0_sync+0x160/0x180 Then I dump the vmcore and look into the problematic per_cpu ring_buffer, I found that tail_page/commit_page/reader_page are on the same page while reader_page->read is obviously abnormal: tail_page == commit_page == reader_page == { .write = 0x100d20, .read = 0x8f9f4805, // Far greater than 0xd20, obviously abnormal!!! .entries = 0x10004c, .real_end = 0x0, .page = { .time_stamp = 0x857257416af0, .commit = 0xd20, // This page hasn't been full filled. // .data[0...0xd20] seems normal. } } The root cause is most likely the race that reader and writer are on the same page while reader saw an event that not fully committed by writer. To fix this, add memory barriers to make sure the reader can see the content of what is committed. Since commit |
||
Tze-nan Wu
|
4ccf11c4e8 |
tracing/synthetic: Fix races on freeing last_cmd
Currently, the "last_cmd" variable can be accessed by multiple processes
asynchronously when multiple users manipulate synthetic_events node
at the same time, it could lead to use-after-free or double-free.
This patch add "lastcmd_mutex" to prevent "last_cmd" from being accessed
asynchronously.
================================================================
It's easy to reproduce in the KASAN environment by running the two
scripts below in different shells.
script 1:
while :
do
echo -n -e '\x88' > /sys/kernel/tracing/synthetic_events
done
script 2:
while :
do
echo -n -e '\xb0' > /sys/kernel/tracing/synthetic_events
done
================================================================
double-free scenario:
process A process B
------------------- ---------------
1.kstrdup last_cmd
2.free last_cmd
3.free last_cmd(double-free)
================================================================
use-after-free scenario:
process A process B
------------------- ---------------
1.kstrdup last_cmd
2.free last_cmd
3.tracing_log_err(use-after-free)
================================================================
Appendix 1. KASAN report double-free:
BUG: KASAN: double-free in kfree+0xdc/0x1d4
Free of addr ***** by task sh/4879
Call trace:
...
kfree+0xdc/0x1d4
create_or_delete_synth_event+0x60/0x1e8
trace_parse_run_command+0x2bc/0x4b8
synth_events_write+0x20/0x30
vfs_write+0x200/0x830
...
Allocated by task 4879:
...
kstrdup+0x5c/0x98
create_or_delete_synth_event+0x6c/0x1e8
trace_parse_run_command+0x2bc/0x4b8
synth_events_write+0x20/0x30
vfs_write+0x200/0x830
...
Freed by task 5464:
...
kfree+0xdc/0x1d4
create_or_delete_synth_event+0x60/0x1e8
trace_parse_run_command+0x2bc/0x4b8
synth_events_write+0x20/0x30
vfs_write+0x200/0x830
...
================================================================
Appendix 2. KASAN report use-after-free:
BUG: KASAN: use-after-free in strlen+0x5c/0x7c
Read of size 1 at addr ***** by task sh/5483
sh: CPU: 7 PID: 5483 Comm: sh
...
__asan_report_load1_noabort+0x34/0x44
strlen+0x5c/0x7c
tracing_log_err+0x60/0x444
create_or_delete_synth_event+0xc4/0x204
trace_parse_run_command+0x2bc/0x4b8
synth_events_write+0x20/0x30
vfs_write+0x200/0x830
...
Allocated by task 5483:
...
kstrdup+0x5c/0x98
create_or_delete_synth_event+0x80/0x204
trace_parse_run_command+0x2bc/0x4b8
synth_events_write+0x20/0x30
vfs_write+0x200/0x830
...
Freed by task 5480:
...
kfree+0xdc/0x1d4
create_or_delete_synth_event+0x74/0x204
trace_parse_run_command+0x2bc/0x4b8
synth_events_write+0x20/0x30
vfs_write+0x200/0x830
...
Link: https://lore.kernel.org/linux-trace-kernel/20230321110444.1587-1-Tze-nan.Wu@mediatek.com
Fixes:
|
||
Linus Torvalds
|
62bad54b26 |
dma-mapping fixes for Linux 6.3
- fix for swiotlb deadlock due to wrong alignment checks (GuoRui.Yu, Petr Tesarik) -----BEGIN PGP SIGNATURE----- iQI/BAABCgApFiEEgdbnc3r/njty3Iq9D55TZVIEUYMFAmQmEEsLHGhjaEBsc3Qu ZGUACgkQD55TZVIEUYMfhw//b1pmwO3ESd5mLwQh9sZrvndi6oyYwqwOy5KMyVtx 0BndOh7/wpJZlYASjj2imNCYDr2g9hsKpm3ZLdN0eY0fQbwQ8ZYjMLhNCylW/nsK pr3adV+sZc0VMr3smeB0Jl7p68KU9Tz0vkDEtG/XpllhFfaS52rFSlCqagDbL11t NA+Ev39RaVij2/M8z59jrd4cr0X74PqWHgtbNawXjHKQckiRm1un5Sg05O830VV0 shGQ/msJPbYdCBT9KD7trzRvFViBS+WeMHFx6I/PbsEUt7nPkGjO7eZiORD28AQ0 NjUqVa03m38RFi9YSXE3IZms0xo4panEGndpTF/eJ0Ly3DcES9FepzI6qHQf3Dq6 vPk5ok9DmTvZy/tWcmfHDWPIsn3vOStlf4SSADTYiOcSEysUJmzRcHaSgzYGA9Fd LQV1UVuYb8ARCa8knZqaxfQstPSzX6PDt1wgHY1k0Ikdvu5OwWskSy7wEMs4Gsd8 w6lcPAvx7QRpqQ27WwSBSMWYJXrdRLO+hckchBrJ8jnedd2IEqMUwMcImq3fLogx Kl6MND8tNEcetyzKxdk9oZ7dyAG5iAQY1diBIwzuu7SIJ/Pm1KmcvfzfULYdpnhP hs8HtntEMhuAmQOWSckwxONwzjPSNEUi+SOu0ywLjaQsFpu9J8eI4bNymvx+5WvH kqg= =MxwN -----END PGP SIGNATURE----- Merge tag 'dma-mapping-6.3-2023-03-31' of git://git.infradead.org/users/hch/dma-mapping Pull dma-mapping fixes from Christoph Hellwig: - fix for swiotlb deadlock due to wrong alignment checks (GuoRui.Yu, Petr Tesarik) * tag 'dma-mapping-6.3-2023-03-31' of git://git.infradead.org/users/hch/dma-mapping: swiotlb: fix slot alignment checks swiotlb: use wrap_area_index() instead of open-coding it swiotlb: fix the deadlock in swiotlb_do_find_slots |
||
Waiman Long
|
292fd843de |
cgroup/cpuset: Fix partition root's cpuset.cpus update bug
It was found that commit |
||
Linus Torvalds
|
18940c888c |
- Fix a corner case where vruntime of a task is not being sanitized
-----BEGIN PGP SIGNATURE----- iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmQgQeEACgkQEsHwGGHe VUpJ+w//c01JpzBXQvGGlNSTTuUzSxLAQz0n8lQmixpYHOUgVL3CQlOF+OfnkYPp mz8m3nDh1FB9o70Sd2J4/OjY2Gh1qENrBLmj509LTnZE9hADRI0T5Mn93Jo7m7HY QoXrYWEwJYwsLzJ66mR8A0xts5jWgkJsAWKXF9gfxf/ieeycdJ1GdOdzC1tp4Nfe /4SEjSbUhx/bsBbAdJ38Z/iQGT0HuyQLOBGBuBcFE0JnP/aYEanAQsxxP2LObeVw Za7ATxdJ9I1TErVfRsG0GDSiVKCYzSG2GME5TXibgPJ2g1+m0I7gZgpGO9Q8Wzo4 7y0X+vqsykY/Me3xEDBVaeCiHmFTambkxOR2xVJ2TISN8b390yePy4vgY1QQDidd eNh9S2x1dsKp8i4NdYyeW7xwaTfIDp9Yp4XNP8cw02VzW0FSCnsmCzwGHIXsF4K/ Sib+bhKUo+Qmck5nJlV6R5Xr9cvGgyPpBvD8/XqqwF5lHJ7xg4qkPwPKjoKL1HRj YT1t+l0kzcg/onyDAuPe1mIRFf7Q8x5G8zkUGMG401h2tazv19rjK4+V1UemhBqA h5Cf1BBy6+6kn4DDb/zD+0IgpDFKJDaClxNwfPzaplAoMC/8+0nxxnu7f32eT59k /JtMisERhr6lG0Q+TfURhB3yyCiBjBR2EKcKzHz+KARtxs7a4T0= =Z8oJ -----END PGP SIGNATURE----- Merge tag 'sched_urgent_for_v6.3_rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull scheduler fix from Borislav Petkov: - Fix a corner case where vruntime of a task is not being sanitized * tag 'sched_urgent_for_v6.3_rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: sched/fair: Sanitize vruntime of entity being migrated |
||
Linus Torvalds
|
f6cdaeb08b |
- Do the delayed RCU wakeup for kthreads in the proper order so that
former doesn't get ignored - A noinstr warning fix -----BEGIN PGP SIGNATURE----- iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmQgPhsACgkQEsHwGGHe VUqecA//VN/9pvCpJbf0S92lDXjbjuAna4+rYak6mjxke2lYHeXNsCjPpBtdfMnK Mvr1HykKVBGIwlIPsKvwzt1FYN08rhdPnb/XvwM1ZliFlXP/HhN7K3AWgld559sZ hY9gKfijof5D7VhqjRS7ZA2qo2by72ekXnuVQT0cmwZiHPQLx8M4ySHuMUZD3Lmz GQyvITuDyBX3BiIGVtqpOhpugpEjAoEBE0MPwXrMEe9Glk4i1z86Xd1Cr4ksCFZL gIdDSnlUuLuXn1OfLA35fXzoaStB07aTMEbRl+iD1KUopdRzpqj9j0rqYINkW0Ar W/BzyMNw2itEbGrF+kjjCpolwmJvMcJUuvEZKO/gNTv2qYZW5BQqQrHYILdmezyz HvGndyT996D3uoUIMIGqJf+41cwqPEjGyMLs3GfYnZMdVnZZnG/KflWQCiGUR9RR LfxaryNZfT8MfQP6uslxOWubMupfsen7Hk8oljUpT2GzUAsWxTjqYWkFx6bMNMkV Kx304at7R9jD81qC3Rdkqu0F5Z17YZWubd1oJEhi8HeMq8uxFxPb983SkXLY0w7Y 4Ss/MhJwt30e9ltGCMwgF83uOnndXwzFJG4TR9TqsO0TcdUE6XJPbPj/K8Wsi/u7 fvnCbBLsDaEJDEAikWJsSOaMjUwnajyaYomy+9VCR536/DlIq5M= =c5tH -----END PGP SIGNATURE----- Merge tag 'core_urgent_for_v6.3_rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull core fixes from Borislav Petkov: - Do the delayed RCU wakeup for kthreads in the proper order so that former doesn't get ignored - A noinstr warning fix * tag 'core_urgent_for_v6.3_rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: entry/rcu: Check TIF_RESCHED _after_ delayed RCU wake-up entry: Fix noinstr warning in __enter_from_user_mode() |
||
Linus Torvalds
|
f768b35a23 |
Fixes for 6.3-rc3:
* Fix a race in the percpu counters summation code where the summation failed to add in the values for any CPUs that were dying but not yet dead. This fixes some minor discrepancies and incorrect assertions when running generic/650. Signed-off-by: Darrick J. Wong <djwong@kernel.org> -----BEGIN PGP SIGNATURE----- iHUEABYKAB0WIQQ2qTKExjcn+O1o2YRKO3ySh0YRpgUCZBdAbgAKCRBKO3ySh0YR pkltAQCs4QO5LjYReqjUxd4cSsLtNnNon09qswRsl2GuRyI36AEAxI9QMq4Q6D9V ZasNbiTCkV3KPKfmp6gf1mQNLk1lGQ0= =Bz3q -----END PGP SIGNATURE----- Merge tag 'xfs-6.3-fixes-4' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux Pull xfs percpu counter fixes from Darrick Wong: "We discovered a filesystem summary counter corruption problem that was traced to cpu hot-remove racing with the call to percpu_counter_sum that sets the free block count in the superblock when writing it to disk. The root cause is that percpu_counter_sum doesn't cull from dying cpus and hence misses those counter values if the cpu shutdown hooks have not yet run to merge the values. I'm hoping this is a fairly painless fix to the problem, since the dying cpu mask should generally be empty. It's been in for-next for a week without any complaints from the bots. - Fix a race in the percpu counters summation code where the summation failed to add in the values for any CPUs that were dying but not yet dead. This fixes some minor discrepancies and incorrect assertions when running generic/650" * tag 'xfs-6.3-fixes-4' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux: pcpcntr: remove percpu_counter_sum_all() fork: remove use of percpu_counter_sum_all pcpcntrs: fix dying cpu summation race cpumask: introduce for_each_cpu_or |
||
Linus Torvalds
|
65aca32efd |
21 hotfixes, 8 of which are cc:stable. 11 are for MM, the remainder are
for other subsystems. -----BEGIN PGP SIGNATURE----- iHUEABYIAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCZB48xAAKCRDdBJ7gKXxA js2rAP4zvcMn90vBJhWNElsA7pBgDYD66QCK6JBDHGe3J1qdeQEA8D606pjMBWkL ly7NifwCjOtFhfDRgEHOXu8g8g1k1QM= =Cswg -----END PGP SIGNATURE----- Merge tag 'mm-hotfixes-stable-2023-03-24-17-09' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull misc fixes from Andrew Morton: "21 hotfixes, 8 of which are cc:stable. 11 are for MM, the remainder are for other subsystems" * tag 'mm-hotfixes-stable-2023-03-24-17-09' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (21 commits) mm: mmap: remove newline at the end of the trace mailmap: add entries for Richard Leitner kcsan: avoid passing -g for test kfence: avoid passing -g for test mm: kfence: fix using kfence_metadata without initialization in show_object() lib: dhry: fix unstable smp_processor_id(_) usage mailmap: add entry for Enric Balletbo i Serra mailmap: map Sai Prakash Ranjan's old address to his current one mailmap: map Rajendra Nayak's old address to his current one Revert "kasan: drop skip_kasan_poison variable in free_pages_prepare" mailmap: add entry for Tobias Klauser kasan, powerpc: don't rename memintrinsics if compiler adds prefixes mm/ksm: fix race with VMA iteration and mm_struct teardown kselftest: vm: fix unused variable warning mm: fix error handling for map_deny_write_exec mm: deduplicate error handling for map_deny_write_exec checksyscalls: ignore fstat to silence build warning on LoongArch nilfs2: fix kernel-infoleak in nilfs_ioctl_wrap_copy() test_maple_tree: add more testing for mas_empty_area() maple_tree: fix mas_skip_node() end slot detection ... |
||
Linus Torvalds
|
608f1b1366 |
Including fixes from bpf, wifi and bluetooth.
Current release - regressions: - wifi: mt76: mt7915: add back 160MHz channel width support for MT7915 - libbpf: revert poisoning of strlcpy, it broke uClibc-ng Current release - new code bugs: - bpf: improve the coverage of the "allow reads from uninit stack" feature to fix verification complexity problems - eth: am65-cpts: reset PPS genf adj settings on enable Previous releases - regressions: - wifi: mac80211: serialize ieee80211_handle_wake_tx_queue() - wifi: mt76: do not run mt76_unregister_device() on unregistered hw, fix null-deref - Bluetooth: btqcomsmd: fix command timeout after setting BD address - eth: igb: revert rtnl_lock() that causes a deadlock - dsa: mscc: ocelot: fix device specific statistics Previous releases - always broken: - xsk: add missing overflow check in xdp_umem_reg() - wifi: mac80211: - fix QoS on mesh interfaces - fix mesh path discovery based on unicast packets - Bluetooth: - ISO: fix timestamped HCI ISO data packet parsing - remove "Power-on" check from Mesh feature - usbnet: more fixes to drivers trusting packet length - wifi: iwlwifi: mvm: fix mvmtxq->stopped handling - Bluetooth: btintel: iterate only bluetooth device ACPI entries - eth: iavf: fix inverted Rx hash condition leading to disabled hash - eth: igc: fix the validation logic for taprio's gate list - dsa: tag_brcm: legacy: fix daisy-chained switches Misc: - bpf: adjust insufficient default bpf_jit_limit to account for growth of BPF use over the last 5 years - xdp: bpf_xdp_metadata() use EOPNOTSUPP as unique errno indicating no driver support Signed-off-by: Jakub Kicinski <kuba@kernel.org> -----BEGIN PGP SIGNATURE----- iQIzBAABCAAdFiEE6jPA+I1ugmIBA4hXMUZtbf5SIrsFAmQc4vkACgkQMUZtbf5S IruG/w//XixBtdFMHE0/fcGv77jTovlJNiDYeaa+KtyjvIseieYwOKW5F31r3xvl Mf/YhNEjAc++V8Zna/1UM5i/WOj1PJdHgSC+wMUGUXjMF+MfzL57nM83CllOpUB5 Z9YtUqGfolf2Vtx03wnV14qawmVnJWYKHn3AU11cueE5dUu6KNyBTCefQ7uzgcJN zMtHAxw96MRQIDxSfKvZsePk4FnQ4qoSOLkslji5iikcMnKePaqZaxQla2oTcEIR zue9V+ILmi62Y8mPcdT4ePpZQsjB39bpemh+9EL6l03/cjsjqmuiCw/d1+6g9kuy ZD5LgZzUOb6xalhSseiwJL+vj8x2gQhshEfoHQvgp7fzr6agta6sisRX611wtmJl hv4k2PMRqFrMv2S+8m8XC177bXIaGbiWh4vBFOWjf4u0lG55cGlzclbXWWQ80njy C5cE4V7qPRk8Cl/+uT10CLNQx6JmaX8kcddtFrYpu0PZHKx1WfUYKIpgkiiMPRKT njLkDQbFRa8Y3p7UX0wU1TbeuMzzLz+aTBrFEN864IJmbnUnWimeluQzD60WbkSx 6dciqq11LtvYDsR1HZ1pb7IoHYuDsDrO2Rx4zuqsB/SyfrGdRKJoKOnYvsk+AdCL N/e4wivie8s6b+G3yL6p+IdlpEaVo2ZiLINp7JSW8jhW1hRcZUI= =XBLi -----END PGP SIGNATURE----- Merge tag 'net-6.3-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net Pull networking fixes from Jakub Kicinski: "Including fixes from bpf, wifi and bluetooth. Current release - regressions: - wifi: mt76: mt7915: add back 160MHz channel width support for MT7915 - libbpf: revert poisoning of strlcpy, it broke uClibc-ng Current release - new code bugs: - bpf: improve the coverage of the "allow reads from uninit stack" feature to fix verification complexity problems - eth: am65-cpts: reset PPS genf adj settings on enable Previous releases - regressions: - wifi: mac80211: serialize ieee80211_handle_wake_tx_queue() - wifi: mt76: do not run mt76_unregister_device() on unregistered hw, fix null-deref - Bluetooth: btqcomsmd: fix command timeout after setting BD address - eth: igb: revert rtnl_lock() that causes a deadlock - dsa: mscc: ocelot: fix device specific statistics Previous releases - always broken: - xsk: add missing overflow check in xdp_umem_reg() - wifi: mac80211: - fix QoS on mesh interfaces - fix mesh path discovery based on unicast packets - Bluetooth: - ISO: fix timestamped HCI ISO data packet parsing - remove "Power-on" check from Mesh feature - usbnet: more fixes to drivers trusting packet length - wifi: iwlwifi: mvm: fix mvmtxq->stopped handling - Bluetooth: btintel: iterate only bluetooth device ACPI entries - eth: iavf: fix inverted Rx hash condition leading to disabled hash - eth: igc: fix the validation logic for taprio's gate list - dsa: tag_brcm: legacy: fix daisy-chained switches Misc: - bpf: adjust insufficient default bpf_jit_limit to account for growth of BPF use over the last 5 years - xdp: bpf_xdp_metadata() use EOPNOTSUPP as unique errno indicating no driver support" * tag 'net-6.3-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (84 commits) Bluetooth: HCI: Fix global-out-of-bounds Bluetooth: mgmt: Fix MGMT add advmon with RSSI command Bluetooth: btsdio: fix use after free bug in btsdio_remove due to unfinished work Bluetooth: L2CAP: Fix responding with wrong PDU type Bluetooth: btqcomsmd: Fix command timeout after setting BD address Bluetooth: btinel: Check ACPI handle for NULL before accessing net: mdio: thunder: Add missing fwnode_handle_put() net: dsa: mt7530: move setting ssc_delta to PHY_INTERFACE_MODE_TRGMII case net: dsa: mt7530: move lowering TRGMII driving to mt7530_setup() net: dsa: mt7530: move enabling disabling core clock to mt7530_pll_setup() net: asix: fix modprobe "sysfs: cannot create duplicate filename" gve: Cache link_speed value from device tools: ynl: Fix genlmsg header encoding formats net: enetc: fix aggregate RMON counters not showing the ranges Bluetooth: Remove "Power-on" check from Mesh feature Bluetooth: Fix race condition in hci_cmd_sync_clear Bluetooth: btintel: Iterate only bluetooth device ACPI entries Bluetooth: ISO: fix timestamped HCI ISO data packet parsing Bluetooth: btusb: Remove detection of ISO packets over bulk Bluetooth: hci_core: Detect if an ACL packet is in fact an ISO packet ... |
||
Marco Elver
|
5eb39cde1e |
kcsan: avoid passing -g for test
Nathan reported that when building with GNU as and a version of clang that
defaults to DWARF5, the assembler will complain with:
Error: non-constant .uleb128 is not supported
This is because `-g` defaults to the compiler debug info default. If the
assembler does not support some of the directives used, the above errors
occur. To fix, remove the explicit passing of `-g`.
All the test wants is that stack traces print valid function names, and
debug info is not required for that. (I currently cannot recall why I
added the explicit `-g`.)
Link: https://lkml.kernel.org/r/20230316224705.709984-2-elver@google.com
Fixes:
|
||
Jakub Kicinski
|
1b4ae19e43 |
bpf-for-netdev
-----BEGIN PGP SIGNATURE----- iHUEABYIAB0WIQTFp0I1jqZrAX+hPRXbK58LschIgwUCZBzSGQAKCRDbK58LschI g+dhAP95enbrlwaQ+9aoqrU+GqCq+uo4SkaqnUtq6GSvRNiVBQD8C6iZxrAjyXnm 1wRr3JN/HszPBzgjl3HvDc9y69I/PAI= =8JwR -----END PGP SIGNATURE----- Merge tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf Daniel Borkmann says: ==================== pull-request: bpf 2023-03-23 We've added 8 non-merge commits during the last 13 day(s) which contain a total of 21 files changed, 238 insertions(+), 161 deletions(-). The main changes are: 1) Fix verification issues in some BPF programs due to their stack usage patterns, from Eduard Zingerman. 2) Fix to add missing overflow checks in xdp_umem_reg and return an error in such case, from Kal Conley. 3) Fix and undo poisoning of strlcpy in libbpf given it broke builds for libcs which provided the former like uClibc-ng, from Jesus Sanchez-Palencia. 4) Fix insufficient bpf_jit_limit default to avoid users running into hard to debug seccomp BPF errors, from Daniel Borkmann. 5) Fix driver return code when they don't support a bpf_xdp_metadata kfunc to make it unambiguous from other errors, from Jesper Dangaard Brouer. 6) Two BPF selftest fixes to address compilation errors from recent changes in kernel structures, from Alexei Starovoitov. * tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf: xdp: bpf_xdp_metadata use EOPNOTSUPP for no driver support bpf: Adjust insufficient default bpf_jit_limit xsk: Add missing overflow check in xdp_umem_reg selftests/bpf: Fix progs/test_deny_namespace.c issues. selftests/bpf: Fix progs/find_vma_fail1.c build error. libbpf: Revert poisoning of strlcpy selftests/bpf: Tests for uninitialized stack reads bpf: Allow reads from uninit stack ==================== Link: https://lore.kernel.org/r/20230323225221.6082-1-daniel@iogearbox.net Signed-off-by: Jakub Kicinski <kuba@kernel.org> |
||
Petr Tesarik
|
0eee5ae102 |
swiotlb: fix slot alignment checks
Explicit alignment and page alignment are used only to calculate the stride, not when checking actual slot physical address. Originally, only page alignment was implemented, and that worked, because the whole SWIOTLB is allocated on a page boundary, so aligning the start index was sufficient to ensure a page-aligned slot. When commit |
||
Petr Tesarik
|
39e7d2ab6e |
swiotlb: use wrap_area_index() instead of open-coding it
No functional change, just use an existing helper. Signed-off-by: Petr Tesarik <petr.tesarik.ext@huawei.com> Signed-off-by: Christoph Hellwig <hch@lst.de> |
||
Daniel Borkmann
|
10ec8ca8ec |
bpf: Adjust insufficient default bpf_jit_limit
We've seen recent AWS EKS (Kubernetes) user reports like the following:
After upgrading EKS nodes from v20230203 to v20230217 on our 1.24 EKS
clusters after a few days a number of the nodes have containers stuck
in ContainerCreating state or liveness/readiness probes reporting the
following error:
Readiness probe errored: rpc error: code = Unknown desc = failed to
exec in container: failed to start exec "4a11039f730203ffc003b7[...]":
OCI runtime exec failed: exec failed: unable to start container process:
unable to init seccomp: error loading seccomp filter into kernel:
error loading seccomp filter: errno 524: unknown
However, we had not been seeing this issue on previous AMIs and it only
started to occur on v20230217 (following the upgrade from kernel 5.4 to
5.10) with no other changes to the underlying cluster or workloads.
We tried the suggestions from that issue (sysctl net.core.bpf_jit_limit=452534528)
which helped to immediately allow containers to be created and probes to
execute but after approximately a day the issue returned and the value
returned by cat /proc/vmallocinfo | grep bpf_jit | awk '{s+=$2} END {print s}'
was steadily increasing.
I tested bpf tree to observe bpf_jit_charge_modmem, bpf_jit_uncharge_modmem
their sizes passed in as well as bpf_jit_current under tcpdump BPF filter,
seccomp BPF and native (e)BPF programs, and the behavior all looks sane
and expected, that is nothing "leaking" from an upstream perspective.
The bpf_jit_limit knob was originally added in order to avoid a situation
where unprivileged applications loading BPF programs (e.g. seccomp BPF
policies) consuming all the module memory space via BPF JIT such that loading
of kernel modules would be prevented. The default limit was defined back in
2018 and while good enough back then, we are generally seeing far more BPF
consumers today.
Adjust the limit for the BPF JIT pool from originally 1/4 to now 1/2 of the
module memory space to better reflect today's needs and avoid more users
running into potentially hard to debug issues.
Fixes:
|
||
Frederic Weisbecker
|
b416514054 |
entry/rcu: Check TIF_RESCHED _after_ delayed RCU wake-up
RCU sometimes needs to perform a delayed wake up for specific kthreads
handling offloaded callbacks (RCU_NOCB). These wakeups are performed
by timers and upon entry to idle (also to guest and to user on nohz_full).
However the delayed wake-up on kernel exit is actually performed after
the thread flags are fetched towards the fast path check for work to
do on exit to user. As a result, and if there is no other pending work
to do upon that kernel exit, the current task will resume to userspace
with TIF_RESCHED set and the pending wake up ignored.
Fix this with fetching the thread flags _after_ the delayed RCU-nocb
kthread wake-up.
Fixes:
|
||
Vincent Guittot
|
a53ce18cac |
sched/fair: Sanitize vruntime of entity being migrated
Commit |
||
Josh Poimboeuf
|
f87d28673b |
entry: Fix noinstr warning in __enter_from_user_mode()
__enter_from_user_mode() is triggering noinstr warnings with
CONFIG_DEBUG_PREEMPT due to its call of preempt_count_add() via
ct_state().
The preemption disable isn't needed as interrupts are already disabled.
And the context_tracking_enabled() check in ct_state() also isn't needed
as that's already being done by the CT_WARN_ON().
Just use __ct_state() instead.
Fixes the following warnings:
vmlinux.o: warning: objtool: enter_from_user_mode+0xba: call to preempt_count_add() leaves .noinstr.text section
vmlinux.o: warning: objtool: syscall_enter_from_user_mode+0xf9: call to preempt_count_add() leaves .noinstr.text section
vmlinux.o: warning: objtool: syscall_enter_from_user_mode_prepare+0xc7: call to preempt_count_add() leaves .noinstr.text section
vmlinux.o: warning: objtool: irqentry_enter_from_user_mode+0xba: call to preempt_count_add() leaves .noinstr.text section
Fixes:
|
||
Linus Torvalds
|
eaba52d63b |
Tracing fixes for 6.3:
- Fix setting affinity of hwlat threads in containers Using sched_set_affinity() has unwanted side effects when being called within a container. Use set_cpus_allowed_ptr() instead. - Fix per cpu thread management of the hwlat tracer * Do not start per_cpu threads if one is already running for the CPU. * When starting per_cpu threads, do not clear the kthread variable as it may already be set to running per cpu threads - Fix return value for test_gen_kprobe_cmd() On error the return value was overwritten by being set to the result of the call from kprobe_event_delete(), which would likely succeed, and thus have the function return success. - Fix splice() reads from the trace file that was broken by |
||
Costa Shulyupin
|
71c7a30442 |
tracing/hwlat: Replace sched_setaffinity with set_cpus_allowed_ptr
There is a problem with the behavior of hwlat in a container,
resulting in incorrect output. A warning message is generated:
"cpumask changed while in round-robin mode, switching to mode none",
and the tracing_cpumask is ignored. This issue arises because
the kernel thread, hwlatd, is not a part of the container, and
the function sched_setaffinity is unable to locate it using its PID.
Additionally, the task_struct of hwlatd is already known.
Ultimately, the function set_cpus_allowed_ptr achieves
the same outcome as sched_setaffinity, but employs task_struct
instead of PID.
Test case:
# cd /sys/kernel/tracing
# echo 0 > tracing_on
# echo round-robin > hwlat_detector/mode
# echo hwlat > current_tracer
# unshare --fork --pid bash -c 'echo 1 > tracing_on'
# dmesg -c
Actual behavior:
[573502.809060] hwlat_detector: cpumask changed while in round-robin mode, switching to mode none
Link: https://lore.kernel.org/linux-trace-kernel/20230316144535.1004952-1-costa.shul@redhat.com
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Fixes:
|
||
Vlastimil Babka
|
a98151ad53 |
ring-buffer: remove obsolete comment for free_buffer_page()
The comment refers to mm/slob.c which is being removed. It comes from commit |