Commit Graph

46067 Commits

Author SHA1 Message Date
Eduard Zingerman
aa30eb3260 bpf: Force checkpoint when jmp history is too long
A specifically crafted program might trick verifier into growing very
long jump history within a single bpf_verifier_state instance.
Very long jump history makes mark_chain_precision() unreasonably slow,
especially in case if verifier processes a loop.

Mitigate this by forcing new state in is_state_visited() in case if
current state's jump history is too long.

Use same constant as in `skip_inf_loop_check`, but multiply it by
arbitrarily chosen value 2 to account for jump history containing not
only information about jumps, but also information about stack access.

For an example of problematic program consider the code below,
w/o this patch the example is processed by verifier for ~15 minutes,
before failing to allocate big-enough chunk for jmp_history.

    0: r7 = *(u16 *)(r1 +0);"
    1: r7 += 0x1ab064b9;"
    2: if r7 & 0x702000 goto 1b;
    3: r7 &= 0x1ee60e;"
    4: r7 += r1;"
    5: if r7 s> 0x37d2 goto +0;"
    6: r0 = 0;"
    7: exit;"

Perf profiling shows that most of the time is spent in
mark_chain_precision() ~95%.

The easiest way to explain why this program causes problems is to
apply the following patch:

    diff --git a/include/linux/bpf.h b/include/linux/bpf.h
    index 0c216e71cec7..4b4823961abe 100644
    \--- a/include/linux/bpf.h
    \+++ b/include/linux/bpf.h
    \@@ -1926,7 +1926,7 @@ struct bpf_array {
            };
     };

    -#define BPF_COMPLEXITY_LIMIT_INSNS      1000000 /* yes. 1M insns */
    +#define BPF_COMPLEXITY_LIMIT_INSNS      256 /* yes. 1M insns */
     #define MAX_TAIL_CALL_CNT 33

     /* Maximum number of loops for bpf_loop and bpf_iter_num.
    diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
    index f514247ba8ba..75e88be3bb3e 100644
    \--- a/kernel/bpf/verifier.c
    \+++ b/kernel/bpf/verifier.c
    \@@ -18024,8 +18024,13 @@ static int is_state_visited(struct bpf_verifier_env *env, int insn_idx)
     skip_inf_loop_check:
                            if (!force_new_state &&
                                env->jmps_processed - env->prev_jmps_processed < 20 &&
    -                           env->insn_processed - env->prev_insn_processed < 100)
    +                           env->insn_processed - env->prev_insn_processed < 100) {
    +                               verbose(env, "is_state_visited: suppressing checkpoint at %d, %d jmps processed, cur->jmp_history_cnt is %d\n",
    +                                       env->insn_idx,
    +                                       env->jmps_processed - env->prev_jmps_processed,
    +                                       cur->jmp_history_cnt);
                                    add_new_state = false;
    +                       }
                            goto miss;
                    }
                    /* If sl->state is a part of a loop and this loop's entry is a part of
    \@@ -18142,6 +18147,9 @@ static int is_state_visited(struct bpf_verifier_env *env, int insn_idx)
            if (!add_new_state)
                    return 0;

    +       verbose(env, "is_state_visited: new checkpoint at %d, resetting env->jmps_processed\n",
    +               env->insn_idx);
    +
            /* There were no equivalent states, remember the current one.
             * Technically the current state is not proven to be safe yet,
             * but it will either reach outer most bpf_exit (which means it's safe)

And observe verification log:

    ...
    is_state_visited: new checkpoint at 5, resetting env->jmps_processed
    5: R1=ctx() R7=ctx(...)
    5: (65) if r7 s> 0x37d2 goto pc+0     ; R7=ctx(...)
    6: (b7) r0 = 0                        ; R0_w=0
    7: (95) exit

    from 5 to 6: R1=ctx() R7=ctx(...) R10=fp0
    6: R1=ctx() R7=ctx(...) R10=fp0
    6: (b7) r0 = 0                        ; R0_w=0
    7: (95) exit
    is_state_visited: suppressing checkpoint at 1, 3 jmps processed, cur->jmp_history_cnt is 74

    from 2 to 1: R1=ctx() R7_w=scalar(...) R10=fp0
    1: R1=ctx() R7_w=scalar(...) R10=fp0
    1: (07) r7 += 447767737
    is_state_visited: suppressing checkpoint at 2, 3 jmps processed, cur->jmp_history_cnt is 75
    2: R7_w=scalar(...)
    2: (45) if r7 & 0x702000 goto pc-2
    ... mark_precise 152 steps for r7 ...
    2: R7_w=scalar(...)
    is_state_visited: suppressing checkpoint at 1, 4 jmps processed, cur->jmp_history_cnt is 75
    1: (07) r7 += 447767737
    is_state_visited: suppressing checkpoint at 2, 4 jmps processed, cur->jmp_history_cnt is 76
    2: R7_w=scalar(...)
    2: (45) if r7 & 0x702000 goto pc-2
    ...
    BPF program is too large. Processed 257 insn

The log output shows that checkpoint at label (1) is never created,
because it is suppressed by `skip_inf_loop_check` logic:
a. When 'if' at (2) is processed it pushes a state with insn_idx (1)
   onto stack and proceeds to (3);
b. At (5) checkpoint is created, and this resets
   env->{jmps,insns}_processed.
c. Verification proceeds and reaches `exit`;
d. State saved at step (a) is popped from stack and is_state_visited()
   considers if checkpoint needs to be added, but because
   env->{jmps,insns}_processed had been just reset at step (b)
   the `skip_inf_loop_check` logic forces `add_new_state` to false.
e. Verifier proceeds with current state, which slowly accumulates
   more and more entries in the jump history.

The accumulation of entries in the jump history is a problem because
of two factors:
- it eventually exhausts memory available for kmalloc() allocation;
- mark_chain_precision() traverses the jump history of a state,
  meaning that if `r7` is marked precise, verifier would iterate
  ever growing jump history until parent state boundary is reached.

(note: the log also shows a REG INVARIANTS VIOLATION warning
       upon jset processing, but that's another bug to fix).

With this patch applied, the example above is rejected by verifier
under 1s of time, reaching 1M instructions limit.

The program is a simplified reproducer from syzbot report.
Previous discussion could be found at [1].
The patch does not cause any changes in verification performance,
when tested on selftests from veristat.cfg and cilium programs taken
from [2].

[1] https://lore.kernel.org/bpf/20241009021254.2805446-1-eddyz87@gmail.com/
[2] https://github.com/anakryiko/cilium

Changelog:
- v1 -> v2:
  - moved patch to bpf tree;
  - moved force_new_state variable initialization after declaration and
    shortened the comment.
v1: https://lore.kernel.org/bpf/20241018020307.1766906-1-eddyz87@gmail.com/

Fixes: 2589726d12 ("bpf: introduce bounded loops")
Reported-by: syzbot+7e46cdef14bf496a3ab4@syzkaller.appspotmail.com
Signed-off-by: Eduard Zingerman <eddyz87@gmail.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20241029172641.1042523-1-eddyz87@gmail.com

Closes: https://lore.kernel.org/bpf/670429f6.050a0220.49194.0517.GAE@google.com/
2024-10-29 11:42:21 -07:00
Aboorva Devarajan
5db91545ef sched: Pass correct scheduling policy to __setscheduler_class
Commit 98442f0ccd ("sched: Fix delayed_dequeue vs
switched_from_fair()") overlooked that __setscheduler_prio(), now
__setscheduler_class() relies on p->policy for task_should_scx(), and
moved the call before __setscheduler_params() updates it, causing it
to be using the old p->policy value.

Resolve this by changing task_should_scx() to take the policy itself
instead of a task pointer, such that __sched_setscheduler() can pass
in the updated policy.

Fixes: 98442f0ccd ("sched: Fix delayed_dequeue vs switched_from_fair()")
Signed-off-by: Aboorva Devarajan <aboorvad@linux.ibm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Tejun Heo <tj@kernel.org>
2024-10-29 13:57:51 +01:00
Gregory Price
b125a0def2 resource,kexec: walk_system_ram_res_rev must retain resource flags
walk_system_ram_res_rev() erroneously discards resource flags when passing
the information to the callback.

This causes systems with IORESOURCE_SYSRAM_DRIVER_MANAGED memory to have
these resources selected during kexec to store kexec buffers if that
memory happens to be at placed above normal system ram.

This leads to undefined behavior after reboot.  If the kexec buffer is
never touched, nothing happens.  If the kexec buffer is touched, it could
lead to a crash (like below) or undefined behavior.

Tested on a system with CXL memory expanders with driver managed memory,
TPM enabled, and CONFIG_IMA_KEXEC=y.  Adding printk's showed the flags
were being discarded and as a result the check for
IORESOURCE_SYSRAM_DRIVER_MANAGED passes.

find_next_iomem_res: name(System RAM (kmem))
		     start(10000000000)
		     end(1034fffffff)
		     flags(83000200)

locate_mem_hole_top_down: start(10000000000) end(1034fffffff) flags(0)

[.] BUG: unable to handle page fault for address: ffff89834ffff000
[.] #PF: supervisor read access in kernel mode
[.] #PF: error_code(0x0000) - not-present page
[.] PGD c04c8bf067 P4D c04c8bf067 PUD c04c8be067 PMD 0
[.] Oops: 0000 [#1] SMP
[.] RIP: 0010:ima_restore_measurement_list+0x95/0x4b0
[.] RSP: 0018:ffffc900000d3a80 EFLAGS: 00010286
[.] RAX: 0000000000001000 RBX: 0000000000000000 RCX: ffff89834ffff000
[.] RDX: 0000000000000018 RSI: ffff89834ffff000 RDI: ffff89834ffff018
[.] RBP: ffffc900000d3ba0 R08: 0000000000000020 R09: ffff888132b8a900
[.] R10: 4000000000000000 R11: 000000003a616d69 R12: 0000000000000000
[.] R13: ffffffff8404ac28 R14: 0000000000000000 R15: ffff89834ffff000
[.] FS:  0000000000000000(0000) GS:ffff893d44640000(0000) knlGS:0000000000000000
[.] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[.] ata5: SATA link down (SStatus 0 SControl 300)
[.] CR2: ffff89834ffff000 CR3: 000001034d00f001 CR4: 0000000000770ef0
[.] PKRU: 55555554
[.] Call Trace:
[.]  <TASK>
[.]  ? __die+0x78/0xc0
[.]  ? page_fault_oops+0x2a8/0x3a0
[.]  ? exc_page_fault+0x84/0x130
[.]  ? asm_exc_page_fault+0x22/0x30
[.]  ? ima_restore_measurement_list+0x95/0x4b0
[.]  ? template_desc_init_fields+0x317/0x410
[.]  ? crypto_alloc_tfm_node+0x9c/0xc0
[.]  ? init_ima_lsm+0x30/0x30
[.]  ima_load_kexec_buffer+0x72/0xa0
[.]  ima_init+0x44/0xa0
[.]  __initstub__kmod_ima__373_1201_init_ima7+0x1e/0xb0
[.]  ? init_ima_lsm+0x30/0x30
[.]  do_one_initcall+0xad/0x200
[.]  ? idr_alloc_cyclic+0xaa/0x110
[.]  ? new_slab+0x12c/0x420
[.]  ? new_slab+0x12c/0x420
[.]  ? number+0x12a/0x430
[.]  ? sysvec_apic_timer_interrupt+0xa/0x80
[.]  ? asm_sysvec_apic_timer_interrupt+0x16/0x20
[.]  ? parse_args+0xd4/0x380
[.]  ? parse_args+0x14b/0x380
[.]  kernel_init_freeable+0x1c1/0x2b0
[.]  ? rest_init+0xb0/0xb0
[.]  kernel_init+0x16/0x1a0
[.]  ret_from_fork+0x2f/0x40
[.]  ? rest_init+0xb0/0xb0
[.]  ret_from_fork_asm+0x11/0x20
[.]  </TASK>

Link: https://lore.kernel.org/all/20231114091658.228030-1-bhe@redhat.com/
Link: https://lkml.kernel.org/r/20241017190347.5578-1-gourry@gourry.net
Fixes: 7acf164b25 ("resource: add walk_system_ram_res_rev()")
Signed-off-by: Gregory Price <gourry@gourry.net>
Reviewed-by: Dan Williams <dan.j.williams@intel.com>
Acked-by: Baoquan He <bhe@redhat.com>
Cc: AKASHI Takahiro <takahiro.akashi@linaro.org>
Cc: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Cc: Bjorn Helgaas <bhelgaas@google.com>
Cc: "Huang, Ying" <ying.huang@intel.com>
Cc: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com>
Cc: Mika Westerberg <mika.westerberg@linux.intel.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-10-28 21:40:40 -07:00
Lorenzo Stoakes
985da552a9 fork: only invoke khugepaged, ksm hooks if no error
There is no reason to invoke these hooks early against an mm that is in an
incomplete state.

The change in commit d240629148 ("fork: use __mt_dup() to duplicate
maple tree in dup_mmap()") makes this more pertinent as we may be in a
state where entries in the maple tree are not yet consistent.

Their placement early in dup_mmap() only appears to have been meaningful
for early error checking, and since functionally it'd require a very small
allocation to fail (in practice 'too small to fail') that'd only occur in
the most dire circumstances, meaning the fork would fail or be OOM'd in
any case.

Since both khugepaged and KSM tracking are there to provide optimisations
to memory performance rather than critical functionality, it doesn't
really matter all that much if, under such dire memory pressure, we fail
to register an mm with these.

As a result, we follow the example of commit d2081b2bf8 ("mm:
khugepaged: make khugepaged_enter() void function") and make ksm_fork() a
void function also.

We only expose the mm to these functions once we are done with them and
only if no error occurred in the fork operation.

Link: https://lkml.kernel.org/r/e0cb8b840c9d1d5a6e84d4f8eff5f3f2022aa10c.1729014377.git.lorenzo.stoakes@oracle.com
Fixes: d240629148 ("fork: use __mt_dup() to duplicate maple tree in dup_mmap()")
Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Reported-by: Jann Horn <jannh@google.com>
Reviewed-by: Liam R. Howlett <Liam.Howlett@Oracle.com>
Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
Reviewed-by: Jann Horn <jannh@google.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Christian Brauner <brauner@kernel.org>
Cc: Jan Kara <jack@suse.cz>
Cc: Linus Torvalds <torvalds@linuxfoundation.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-10-28 21:40:39 -07:00
Lorenzo Stoakes
f64e67e5d3 fork: do not invoke uffd on fork if error occurs
Patch series "fork: do not expose incomplete mm on fork".

During fork we may place the virtual memory address space into an
inconsistent state before the fork operation is complete.

In addition, we may encounter an error during the fork operation that
indicates that the virtual memory address space is invalidated.

As a result, we should not be exposing it in any way to external machinery
that might interact with the mm or VMAs, machinery that is not designed to
deal with incomplete state.

We specifically update the fork logic to defer khugepaged and ksm to the
end of the operation and only to be invoked if no error arose, and
disallow uffd from observing fork events should an error have occurred.


This patch (of 2):

Currently on fork we expose the virtual address space of a process to
userland unconditionally if uffd is registered in VMAs, regardless of
whether an error arose in the fork.

This is performed in dup_userfaultfd_complete() which is invoked
unconditionally, and performs two duties - invoking registered handlers
for the UFFD_EVENT_FORK event via dup_fctx(), and clearing down
userfaultfd_fork_ctx objects established in dup_userfaultfd().

This is problematic, because the virtual address space may not yet be
correctly initialised if an error arose.

The change in commit d240629148 ("fork: use __mt_dup() to duplicate
maple tree in dup_mmap()") makes this more pertinent as we may be in a
state where entries in the maple tree are not yet consistent.

We address this by, on fork error, ensuring that we roll back state that
we would otherwise expect to clean up through the event being handled by
userland and perform the memory freeing duty otherwise performed by
dup_userfaultfd_complete().

We do this by implementing a new function, dup_userfaultfd_fail(), which
performs the same loop, only decrementing reference counts.

Note that we perform mmgrab() on the parent and child mm's, however
userfaultfd_ctx_put() will mmdrop() this once the reference count drops to
zero, so we will avoid memory leaks correctly here.

Link: https://lkml.kernel.org/r/cover.1729014377.git.lorenzo.stoakes@oracle.com
Link: https://lkml.kernel.org/r/d3691d58bb58712b6fb3df2be441d175bd3cdf07.1729014377.git.lorenzo.stoakes@oracle.com
Fixes: d240629148 ("fork: use __mt_dup() to duplicate maple tree in dup_mmap()")
Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Reported-by: Jann Horn <jannh@google.com>
Reviewed-by: Jann Horn <jannh@google.com>
Reviewed-by: Liam R. Howlett <Liam.Howlett@Oracle.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Christian Brauner <brauner@kernel.org>
Cc: Jan Kara <jack@suse.cz>
Cc: Linus Torvalds <torvalds@linuxfoundation.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-10-28 21:40:38 -07:00
Linus Torvalds
f69a1accfe Fixes for function graph:
- Fix missing mutex unlock in error path of register_ftrace_graph()
 
   A previous fix added a return on an error path and forgot to unlock the
   mutex. Instead of dealing with error paths, use guard(mutex) as the mutex
   is just released at the exit of the function anyway. Other functions
   in this file should be updated with this, but that's a cleanup and not
   a fix.
 
 - Change cpuhp setup name to be consistent with other cpuhp states
 
   The same fix that the above patch fixes added a cpuhp_setup_state() call
   with the name of "fgraph_idle_init". I was informed that it should instead
   be something like: "fgraph:online". Update that too.
 -----BEGIN PGP SIGNATURE-----
 
 iIoEABYIADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCZxydTRQccm9zdGVkdEBn
 b29kbWlzLm9yZwAKCRAp5XQQmuv6qsE/APoDcsqqaDJvQ0OsMqVaPdHoj2IUkU4M
 yueb6U/Kyq1m4wEA259W1PZuQlM0Vo0yJM1w2YIAH18UpO09ZroLnbWoUAc=
 =2sS+
 -----END PGP SIGNATURE-----

Merge tag 'ftrace-v6.12-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace

Pull ftrace fixes from Steven Rostedt:

 - Fix missing mutex unlock in error path of register_ftrace_graph()

   A previous fix added a return on an error path and forgot to unlock
   the mutex. Instead of dealing with error paths, use guard(mutex) as
   the mutex is just released at the exit of the function anyway. Other
   functions in this file should be updated with this, but that's a
   cleanup and not a fix.

 - Change cpuhp setup name to be consistent with other cpuhp states

   The same fix that the above patch fixes added a cpuhp_setup_state()
   call with the name of "fgraph_idle_init". I was informed that it
   should instead be something like: "fgraph:online". Update that too.

* tag 'ftrace-v6.12-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace:
  fgraph: Change the name of cpuhp state to "fgraph:online"
  fgraph: Fix missing unlock in register_ftrace_graph()
2024-10-27 08:56:22 -10:00
Jinjie Ruan
5f994f5341 genirq/msi: Fix off-by-one error in msi_domain_alloc()
The error path in msi_domain_alloc(), frees the already allocated MSI
interrupts in a loop, but the loop condition terminates when the index
reaches zero, which fails to free the first allocated MSI interrupt at
index zero.

Check for >= 0 so that msi[0] is freed as well.

Fixes: f3cf8bb0d6 ("genirq: Add generic msi irq domain support")
Signed-off-by: Jinjie Ruan <ruanjinjie@huawei.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/all/20241026063639.10711-1-ruanjinjie@huawei.com
2024-10-27 10:40:47 +01:00
Benjamin Segall
b5413156ba posix-cpu-timers: Clear TICK_DEP_BIT_POSIX_TIMER on clone
When cloning a new thread, its posix_cputimers are not inherited, and
are cleared by posix_cputimers_init(). However, this does not clear the
tick dependency it creates in tsk->tick_dep_mask, and the handler does
not reach the code to clear the dependency if there were no timers to
begin with.

Thus if a thread has a cputimer running before clone/fork, all
descendants will prevent nohz_full unless they create a cputimer of
their own.

Fix this by entirely clearing the tick_dep_mask in copy_process().
(There is currently no inherited state that needs a tick dependency)

Process-wide timers do not have this problem because fork does not copy
signal_struct as a baseline, it creates one from scratch.

Fixes: b78783000d ("posix-cpu-timers: Migrate to use new tick dependency mask model")
Signed-off-by: Ben Segall <bsegall@google.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/all/xm26o737bq8o.fsf@google.com
2024-10-27 10:36:04 +01:00
Shawn Wang
9c70b2a33c sched/numa: Fix the potential null pointer dereference in task_numa_work()
When running stress-ng-vm-segv test, we found a null pointer dereference
error in task_numa_work(). Here is the backtrace:

  [323676.066985] Unable to handle kernel NULL pointer dereference at virtual address 0000000000000020
  ......
  [323676.067108] CPU: 35 PID: 2694524 Comm: stress-ng-vm-se
  ......
  [323676.067113] pstate: 23401009 (nzCv daif +PAN -UAO +TCO +DIT +SSBS BTYPE=--)
  [323676.067115] pc : vma_migratable+0x1c/0xd0
  [323676.067122] lr : task_numa_work+0x1ec/0x4e0
  [323676.067127] sp : ffff8000ada73d20
  [323676.067128] x29: ffff8000ada73d20 x28: 0000000000000000 x27: 000000003e89f010
  [323676.067130] x26: 0000000000080000 x25: ffff800081b5c0d8 x24: ffff800081b27000
  [323676.067133] x23: 0000000000010000 x22: 0000000104d18cc0 x21: ffff0009f7158000
  [323676.067135] x20: 0000000000000000 x19: 0000000000000000 x18: ffff8000ada73db8
  [323676.067138] x17: 0001400000000000 x16: ffff800080df40b0 x15: 0000000000000035
  [323676.067140] x14: ffff8000ada73cc8 x13: 1fffe0017cc72001 x12: ffff8000ada73cc8
  [323676.067142] x11: ffff80008001160c x10: ffff000be639000c x9 : ffff8000800f4ba4
  [323676.067145] x8 : ffff000810375000 x7 : ffff8000ada73974 x6 : 0000000000000001
  [323676.067147] x5 : 0068000b33e26707 x4 : 0000000000000001 x3 : ffff0009f7158000
  [323676.067149] x2 : 0000000000000041 x1 : 0000000000004400 x0 : 0000000000000000
  [323676.067152] Call trace:
  [323676.067153]  vma_migratable+0x1c/0xd0
  [323676.067155]  task_numa_work+0x1ec/0x4e0
  [323676.067157]  task_work_run+0x78/0xd8
  [323676.067161]  do_notify_resume+0x1ec/0x290
  [323676.067163]  el0_svc+0x150/0x160
  [323676.067167]  el0t_64_sync_handler+0xf8/0x128
  [323676.067170]  el0t_64_sync+0x17c/0x180
  [323676.067173] Code: d2888001 910003fd f9000bf3 aa0003f3 (f9401000)
  [323676.067177] SMP: stopping secondary CPUs
  [323676.070184] Starting crashdump kernel...

stress-ng-vm-segv in stress-ng is used to stress test the SIGSEGV error
handling function of the system, which tries to cause a SIGSEGV error on
return from unmapping the whole address space of the child process.

Normally this program will not cause kernel crashes. But before the
munmap system call returns to user mode, a potential task_numa_work()
for numa balancing could be added and executed. In this scenario, since the
child process has no vma after munmap, the vma_next() in task_numa_work()
will return a null pointer even if the vma iterator restarts from 0.

Recheck the vma pointer before dereferencing it in task_numa_work().

Fixes: 214dbc4281 ("sched: convert to vma iterator")
Signed-off-by: Shawn Wang <shawnwang@linux.alibaba.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: stable@vger.kernel.org # v6.2+
Link: https://lkml.kernel.org/r/20241025022208.125527-1-shawnwang@linux.alibaba.com
2024-10-26 09:28:37 +02:00
David Vernet
0e7ffff1b8 scx: Fix raciness in scx_ops_bypass()
scx_ops_bypass() can currently race on the ops enable / disable path as
follows:

1. scx_ops_bypass(true) called on enable path, bypass depth is set to 1
2. An op on the init path exits, which schedules scx_ops_disable_workfn()
3. scx_ops_bypass(false) is called on the disable path, and bypass depth
   is decremented to 0
4. kthread is scheduled to execute scx_ops_disable_workfn()
5. scx_ops_bypass(true) called, bypass depth set to 1
6. scx_ops_bypass() races when iterating over CPUs

While it's not safe to take any blocking locks on the bypass path, it is
safe to take a raw spinlock which cannot be preempted. This patch therefore
updates scx_ops_bypass() to use a raw spinlock to synchronize, and changes
scx_ops_bypass_depth to be a regular int.

Without this change, we observe the following warnings when running the
'exit' sched_ext selftest (sometimes requires a couple of runs):

.[root@virtme-ng sched_ext]# ./runner -t exit
===== START =====
TEST: exit
...
[   14.935078] WARNING: CPU: 2 PID: 360 at kernel/sched/ext.c:4332 scx_ops_bypass+0x1ca/0x280
[   14.935126] Modules linked in:
[   14.935150] CPU: 2 UID: 0 PID: 360 Comm: sched_ext_ops_h Not tainted 6.11.0-virtme #24
[   14.935192] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Arch Linux 1.16.3-1-1 04/01/2014
[   14.935242] Sched_ext: exit (enabling+all)
[   14.935244] RIP: 0010:scx_ops_bypass+0x1ca/0x280
[   14.935300] Code: ff ff ff e8 48 96 10 00 fb e9 08 ff ff ff c6 05 7b 34 e8 01 01 90 48 c7 c7 89 86 88 87 e8 be 1d f8 ff 90 0f 0b 90 90 eb 95 90 <0f> 0b 90 41 8b 84 24 24 0a 00 00 eb 97 90 0f 0b 90 41 8b 84 24 24
[   14.935394] RSP: 0018:ffffb706c0957ce0 EFLAGS: 00010002
[   14.935424] RAX: 0000000000000009 RBX: 0000000000000001 RCX: 00000000e3fb8b2a
[   14.935465] RDX: 0000000000000001 RSI: 0000000000000004 RDI: ffffffff88a4c080
[   14.935512] RBP: 0000000000009b56 R08: 0000000000000004 R09: 00000003f12e520a
[   14.935555] R10: ffffffff863a9795 R11: 0000000000000000 R12: ffff8fc5fec31300
[   14.935598] R13: ffff8fc5fec31318 R14: 0000000000000286 R15: 0000000000000018
[   14.935642] FS:  0000000000000000(0000) GS:ffff8fc5fe680000(0000) knlGS:0000000000000000
[   14.935684] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   14.935721] CR2: 0000557d92890b88 CR3: 000000002464a000 CR4: 0000000000750ef0
[   14.935765] PKRU: 55555554
[   14.935782] Call Trace:
[   14.935802]  <TASK>
[   14.935823]  ? __warn+0xce/0x220
[   14.935850]  ? scx_ops_bypass+0x1ca/0x280
[   14.935881]  ? report_bug+0xc1/0x160
[   14.935909]  ? handle_bug+0x61/0x90
[   14.935934]  ? exc_invalid_op+0x1a/0x50
[   14.935959]  ? asm_exc_invalid_op+0x1a/0x20
[   14.935984]  ? raw_spin_rq_lock_nested+0x15/0x30
[   14.936019]  ? scx_ops_bypass+0x1ca/0x280
[   14.936046]  ? srso_alias_return_thunk+0x5/0xfbef5
[   14.936081]  ? __pfx_scx_ops_disable_workfn+0x10/0x10
[   14.936111]  scx_ops_disable_workfn+0x146/0xac0
[   14.936142]  ? finish_task_switch+0xa9/0x2c0
[   14.936172]  ? srso_alias_return_thunk+0x5/0xfbef5
[   14.936211]  ? __pfx_scx_ops_disable_workfn+0x10/0x10
[   14.936244]  kthread_worker_fn+0x101/0x2c0
[   14.936268]  ? __pfx_kthread_worker_fn+0x10/0x10
[   14.936299]  kthread+0xec/0x110
[   14.936327]  ? __pfx_kthread+0x10/0x10
[   14.936351]  ret_from_fork+0x37/0x50
[   14.936374]  ? __pfx_kthread+0x10/0x10
[   14.936400]  ret_from_fork_asm+0x1a/0x30
[   14.936427]  </TASK>
[   14.936443] irq event stamp: 21002
[   14.936467] hardirqs last  enabled at (21001): [<ffffffff863aa35f>] resched_cpu+0x9f/0xd0
[   14.936521] hardirqs last disabled at (21002): [<ffffffff863dd0ba>] scx_ops_bypass+0x11a/0x280
[   14.936571] softirqs last  enabled at (20642): [<ffffffff863683d7>] __irq_exit_rcu+0x67/0xd0
[   14.936622] softirqs last disabled at (20637): [<ffffffff863683d7>] __irq_exit_rcu+0x67/0xd0
[   14.936672] ---[ end trace 0000000000000000 ]---
[   14.953282] sched_ext: BPF scheduler "exit" disabled (unregistered from BPF)
[   14.953352] ------------[ cut here ]------------
[   14.953383] WARNING: CPU: 2 PID: 360 at kernel/sched/ext.c:4335 scx_ops_bypass+0x1d8/0x280
[   14.953428] Modules linked in:
[   14.953453] CPU: 2 UID: 0 PID: 360 Comm: sched_ext_ops_h Tainted: G        W          6.11.0-virtme #24
[   14.953505] Tainted: [W]=WARN
[   14.953527] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Arch Linux 1.16.3-1-1 04/01/2014
[   14.953574] RIP: 0010:scx_ops_bypass+0x1d8/0x280
[   14.953603] Code: c6 05 7b 34 e8 01 01 90 48 c7 c7 89 86 88 87 e8 be 1d f8 ff 90 0f 0b 90 90 eb 95 90 0f 0b 90 41 8b 84 24 24 0a 00 00 eb 97 90 <0f> 0b 90 41 8b 84 24 24 0a 00 00 eb 92 f3 0f 1e fa 49 8d 84 24 f0
[   14.953693] RSP: 0018:ffffb706c0957ce0 EFLAGS: 00010046
[   14.953722] RAX: 0000000000000001 RBX: 0000000000000000 RCX: 0000000000000001
[   14.953763] RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff8fc5fec31318
[   14.953804] RBP: 0000000000000000 R08: 0000000000000001 R09: 0000000000000000
[   14.953845] R10: ffffffff863a9795 R11: 0000000000000000 R12: ffff8fc5fec31300
[   14.953888] R13: ffff8fc5fec31318 R14: 0000000000000286 R15: 0000000000000018
[   14.953934] FS:  0000000000000000(0000) GS:ffff8fc5fe680000(0000) knlGS:0000000000000000
[   14.953974] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   14.954009] CR2: 0000557d92890b88 CR3: 000000002464a000 CR4: 0000000000750ef0
[   14.954052] PKRU: 55555554
[   14.954068] Call Trace:
[   14.954085]  <TASK>
[   14.954102]  ? __warn+0xce/0x220
[   14.954126]  ? scx_ops_bypass+0x1d8/0x280
[   14.954150]  ? report_bug+0xc1/0x160
[   14.954178]  ? handle_bug+0x61/0x90
[   14.954203]  ? exc_invalid_op+0x1a/0x50
[   14.954226]  ? asm_exc_invalid_op+0x1a/0x20
[   14.954250]  ? raw_spin_rq_lock_nested+0x15/0x30
[   14.954285]  ? scx_ops_bypass+0x1d8/0x280
[   14.954311]  ? __mutex_unlock_slowpath+0x3a/0x260
[   14.954343]  scx_ops_disable_workfn+0xa3e/0xac0
[   14.954381]  ? __pfx_scx_ops_disable_workfn+0x10/0x10
[   14.954413]  kthread_worker_fn+0x101/0x2c0
[   14.954442]  ? __pfx_kthread_worker_fn+0x10/0x10
[   14.954479]  kthread+0xec/0x110
[   14.954507]  ? __pfx_kthread+0x10/0x10
[   14.954530]  ret_from_fork+0x37/0x50
[   14.954553]  ? __pfx_kthread+0x10/0x10
[   14.954576]  ret_from_fork_asm+0x1a/0x30
[   14.954603]  </TASK>
[   14.954621] irq event stamp: 21002
[   14.954644] hardirqs last  enabled at (21001): [<ffffffff863aa35f>] resched_cpu+0x9f/0xd0
[   14.954686] hardirqs last disabled at (21002): [<ffffffff863dd0ba>] scx_ops_bypass+0x11a/0x280
[   14.954735] softirqs last  enabled at (20642): [<ffffffff863683d7>] __irq_exit_rcu+0x67/0xd0
[   14.954782] softirqs last disabled at (20637): [<ffffffff863683d7>] __irq_exit_rcu+0x67/0xd0
[   14.954829] ---[ end trace 0000000000000000 ]---
[   15.022283] sched_ext: BPF scheduler "exit" disabled (unregistered from BPF)
[   15.092282] sched_ext: BPF scheduler "exit" disabled (unregistered from BPF)
[   15.149282] sched_ext: BPF scheduler "exit" disabled (unregistered from BPF)
ok 1 exit #
=====  END  =====

And with it, the test passes without issue after 1000s of runs:

.[root@virtme-ng sched_ext]# ./runner -t exit
===== START =====
TEST: exit
DESCRIPTION: Verify we can cleanly exit a scheduler in multiple places
OUTPUT:
[    7.412856] sched_ext: BPF scheduler "exit" enabled
[    7.427924] sched_ext: BPF scheduler "exit" disabled (unregistered from BPF)
[    7.466677] sched_ext: BPF scheduler "exit" enabled
[    7.475923] sched_ext: BPF scheduler "exit" disabled (unregistered from BPF)
[    7.512803] sched_ext: BPF scheduler "exit" enabled
[    7.532924] sched_ext: BPF scheduler "exit" disabled (unregistered from BPF)
[    7.586809] sched_ext: BPF scheduler "exit" enabled
[    7.595926] sched_ext: BPF scheduler "exit" disabled (unregistered from BPF)
[    7.661923] sched_ext: BPF scheduler "exit" disabled (unregistered from BPF)
[    7.723923] sched_ext: BPF scheduler "exit" disabled (unregistered from BPF)
ok 1 exit #
=====  END  =====

=============================

RESULTS:

PASSED:  1
SKIPPED: 0
FAILED:  0

Fixes: f0e1a0643a ("sched_ext: Implement BPF extensible scheduler class")
Signed-off-by: David Vernet <void@manifault.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2024-10-25 11:10:51 -10:00
Steven Rostedt
a574e7f80e fgraph: Change the name of cpuhp state to "fgraph:online"
The cpuhp state name given to cpuhp_setup_state() is "fgraph_idle_init"
which doesn't really conform to the names that are used for cpu hotplug
setups. Instead rename it to "fgraph:online" to be in line with other
states.

Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/20241024222944.473d88c5@rorschach.local.home
Suggested-by: Masami Hiramatsu <mhiramat@kernel.org>
Fixes: 2c02f7375e ("fgraph: Use CPU hotplug mechanism to initialize idle shadow stacks")
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2024-10-24 23:41:14 -04:00
Li Huafei
bd3734db86 fgraph: Fix missing unlock in register_ftrace_graph()
Use guard(mutex)() to acquire and automatically release ftrace_lock,
fixing the issue of not unlocking when calling cpuhp_setup_state()
fails.

Fixes smatch warning:

kernel/trace/fgraph.c:1317 register_ftrace_graph() warn: inconsistent returns '&ftrace_lock'.

Link: https://lore.kernel.org/20241024155917.1019580-1-lihuafei1@huawei.com
Fixes: 2c02f7375e ("fgraph: Use CPU hotplug mechanism to initialize idle shadow stacks")
Reported-by: kernel test robot <lkp@intel.com>
Reported-by: Dan Carpenter <dan.carpenter@linaro.org>
Closes: https://lore.kernel.org/r/202410220121.wxg0olfd-lkp@intel.com/
Suggested-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Li Huafei <lihuafei1@huawei.com>
Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2024-10-24 22:26:06 -04:00
Linus Torvalds
ae90f6a617 BPF fixes:
- Fix an out-of-bounds read in bpf_link_show_fdinfo for BPF
   sockmap link file descriptors (Hou Tao)
 
 - Fix BPF arm64 JIT's address emission with tag-based KASAN
   enabled reserving not enough size (Peter Collingbourne)
 
 - Fix BPF verifier do_misc_fixups patching for inlining of the
   bpf_get_branch_snapshot BPF helper (Andrii Nakryiko)
 
 - Fix a BPF verifier bug and reject BPF program write attempts
   into read-only marked BPF maps (Daniel Borkmann)
 
 - Fix perf_event_detach_bpf_prog error handling by removing an
   invalid check which would skip BPF program release (Jiri Olsa)
 
 - Fix memory leak when parsing mount options for the BPF
   filesystem (Hou Tao)
 
 Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
 -----BEGIN PGP SIGNATURE-----
 
 iIsEABYIADMWIQTFp0I1jqZrAX+hPRXbK58LschIgwUCZxrAzxUcZGFuaWVsQGlv
 Z2VhcmJveC5uZXQACgkQ2yufC7HISIPcHwD8DnBSPlHX9OezMWCm8mjVx2Fd26W9
 /IaiW2tyOPtoSGIA/3hfgfLrxkb3Raoh0miQB2+FRrz9e+y7i8c4Q91mcUgJ
 =Hvht
 -----END PGP SIGNATURE-----

Merge tag 'bpf-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf

Pull bpf fixes from Daniel Borkmann:

 - Fix an out-of-bounds read in bpf_link_show_fdinfo for BPF sockmap
   link file descriptors (Hou Tao)

 - Fix BPF arm64 JIT's address emission with tag-based KASAN enabled
   reserving not enough size (Peter Collingbourne)

 - Fix BPF verifier do_misc_fixups patching for inlining of the
   bpf_get_branch_snapshot BPF helper (Andrii Nakryiko)

 - Fix a BPF verifier bug and reject BPF program write attempts into
   read-only marked BPF maps (Daniel Borkmann)

 - Fix perf_event_detach_bpf_prog error handling by removing an invalid
   check which would skip BPF program release (Jiri Olsa)

 - Fix memory leak when parsing mount options for the BPF filesystem
   (Hou Tao)

* tag 'bpf-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf:
  bpf: Check validity of link->type in bpf_link_show_fdinfo()
  bpf: Add the missing BPF_LINK_TYPE invocation for sockmap
  bpf: fix do_misc_fixups() for bpf_get_branch_snapshot()
  bpf,perf: Fix perf_event_detach_bpf_prog error handling
  selftests/bpf: Add test for passing in uninit mtu_len
  selftests/bpf: Add test for writes to .rodata
  bpf: Remove MEM_UNINIT from skb/xdp MTU helpers
  bpf: Fix overloading of MEM_UNINIT's meaning
  bpf: Add MEM_WRITE attribute
  bpf: Preserve param->string when parsing mount options
  bpf, arm64: Fix address emission with tag-based KASAN enabled
2024-10-24 16:53:20 -07:00
Linus Torvalds
d44cd82264 Including fixes from netfiler, xfrm and bluetooth.
Current release - regressions:
 
   - posix-clock: Fix unbalanced locking in pc_clock_settime()
 
   - netfilter: fix typo causing some targets not to load on IPv6
 
 Current release - new code bugs:
 
   - xfrm: policy: remove last remnants of pernet inexact list
 
 Previous releases - regressions:
 
   - core: fix races in netdev_tx_sent_queue()/dev_watchdog()
 
   - bluetooth: fix UAF on sco_sock_timeout
 
   - eth: hv_netvsc: fix VF namespace also in synthetic NIC NETDEV_REGISTER event
 
   - eth: usbnet: fix name regression
 
   - eth: be2net: fix potential memory leak in be_xmit()
 
   - eth: plip: fix transmit path breakage
 
 Previous releases - always broken:
 
   - sched: deny mismatched skip_sw/skip_hw flags for actions created by classifiers
 
   - netfilter: bpf: must hold reference on net namespace
 
   - eth: virtio_net: fix integer overflow in stats
 
   - eth: bnxt_en: replace ptp_lock with irqsave variant
 
   - eth: octeon_ep: add SKB allocation failures handling in __octep_oq_process_rx()
 
 Misc:
 
   - MAINTAINERS: add Simon as an official reviewer
 
 Signed-off-by: Paolo Abeni <pabeni@redhat.com>
 -----BEGIN PGP SIGNATURE-----
 
 iQJGBAABCAAwFiEEg1AjqC77wbdLX2LbKSR5jcyPE6QFAmcaTkUSHHBhYmVuaUBy
 ZWRoYXQuY29tAAoJECkkeY3MjxOkW8kP/iYfaxQ8zR61wUU7bOcVUSnEADR9XQ1H
 Nta5Z0tDJprZv254XW3hYDzU0Iy3OgclRE1oewF5fQVLn6Sfg4U5awxRTNdJw7KV
 wj62ziAv/xht2W/4nBsNfYkOZaDAibItbKtxlkOhgCGXSrXBoS22IonKRqEv2HLV
 Gu0vAY/VI9YNvB5Z6SEKFmQp2bWfX79AChVT72shLBLakOCUHBavk/DOU56XH1Ci
 IRmU5Lt8ysXWxCTF91rPCAbMyuxBbIv6phIKPV2ALpRUd6ha5nBqcl0wcS7Y1E+/
 0XOV71zjcXFoE/6hc5W3/mC7jm+ipXKVJOnIkCcWq40p6kDVJJ+E1RWEr5JxGEyF
 FtnUCZ8iK/F3/jSalMras2z+AZ/CGtfHF9wAS3YfMGtOJJb/k4dCxAddp7UzD9O4
 yxAJhJ0DrVuplzwovL5owoJJXeRAMQeFydzHBYun5P8Sc9TtvviICi19fMgKGn4O
 eUQhjgZZY371sPnTDLDEw1Oqzs9qeaeV3S2dSeFJ98PQuPA5KVOf/R2/CptBIMi5
 +UNcqeXrlUeYSBW94pPioEVStZDrzax5RVKh/Jo1tTnKzbnWDOOKZqSVsGPMWXdO
 0aBlGuSsNe36VDg2C0QMxGk7+gXbKmk9U4+qVQH3KMpB8uqdAu5deMbTT6dfcwBV
 O/BaGiqoR4ak
 =dR3Q
 -----END PGP SIGNATURE-----

Merge tag 'net-6.12-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net

Pull networking fixes from Paolo Abeni:
 "Including fixes from netfiler, xfrm and bluetooth.

  Oddly this includes a fix for a posix clock regression; in our
  previous PR we included a change there as a pre-requisite for
  networking one. That fix proved to be buggy and requires the follow-up
  included here. Thomas suggested we should send it, given we sent the
  buggy patch.

  Current release - regressions:

   - posix-clock: Fix unbalanced locking in pc_clock_settime()

   - netfilter: fix typo causing some targets not to load on IPv6

  Current release - new code bugs:

   - xfrm: policy: remove last remnants of pernet inexact list

  Previous releases - regressions:

   - core: fix races in netdev_tx_sent_queue()/dev_watchdog()

   - bluetooth: fix UAF on sco_sock_timeout

   - eth: hv_netvsc: fix VF namespace also in synthetic NIC
     NETDEV_REGISTER event

   - eth: usbnet: fix name regression

   - eth: be2net: fix potential memory leak in be_xmit()

   - eth: plip: fix transmit path breakage

  Previous releases - always broken:

   - sched: deny mismatched skip_sw/skip_hw flags for actions created by
     classifiers

   - netfilter: bpf: must hold reference on net namespace

   - eth: virtio_net: fix integer overflow in stats

   - eth: bnxt_en: replace ptp_lock with irqsave variant

   - eth: octeon_ep: add SKB allocation failures handling in
     __octep_oq_process_rx()

  Misc:

   - MAINTAINERS: add Simon as an official reviewer"

* tag 'net-6.12-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (40 commits)
  net: dsa: mv88e6xxx: support 4000ps cycle counter period
  net: dsa: mv88e6xxx: read cycle counter period from hardware
  net: dsa: mv88e6xxx: group cycle counter coefficients
  net: usb: qmi_wwan: add Fibocom FG132 0x0112 composition
  hv_netvsc: Fix VF namespace also in synthetic NIC NETDEV_REGISTER event
  net: dsa: microchip: disable EEE for KSZ879x/KSZ877x/KSZ876x
  Bluetooth: ISO: Fix UAF on iso_sock_timeout
  Bluetooth: SCO: Fix UAF on sco_sock_timeout
  Bluetooth: hci_core: Disable works on hci_unregister_dev
  posix-clock: posix-clock: Fix unbalanced locking in pc_clock_settime()
  r8169: avoid unsolicited interrupts
  net: sched: use RCU read-side critical section in taprio_dump()
  net: sched: fix use-after-free in taprio_change()
  net/sched: act_api: deny mismatched skip_sw/skip_hw flags for actions created by classifiers
  net: usb: usbnet: fix name regression
  mlxsw: spectrum_router: fix xa_store() error checking
  virtio_net: fix integer overflow in stats
  net: fix races in netdev_tx_sent_queue()/dev_watchdog()
  net: wwan: fix global oob in wwan_rtnl_policy
  netfilter: xtables: fix typo causing some targets not to load on IPv6
  ...
2024-10-24 16:43:50 -07:00
Hou Tao
8421d4c876 bpf: Check validity of link->type in bpf_link_show_fdinfo()
If a newly-added link type doesn't invoke BPF_LINK_TYPE(), accessing
bpf_link_type_strs[link->type] may result in an out-of-bounds access.

To spot such missed invocations early in the future, checking the
validity of link->type in bpf_link_show_fdinfo() and emitting a warning
when such invocations are missed.

Signed-off-by: Hou Tao <houtao1@huawei.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20241024013558.1135167-3-houtao@huaweicloud.com
2024-10-24 10:17:12 -07:00
Andrii Nakryiko
9806f28314 bpf: fix do_misc_fixups() for bpf_get_branch_snapshot()
We need `goto next_insn;` at the end of patching instead of `continue;`.
It currently works by accident by making verifier re-process patched
instructions.

Reported-by: Shung-Hsi Yu <shung-hsi.yu@suse.com>
Fixes: 314a53623c ("bpf: inline bpf_get_branch_snapshot() helper")
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Yonghong Song <yonghong.song@linux.dev>
Acked-by: Shung-Hsi Yu <shung-hsi.yu@suse.com>
Link: https://lore.kernel.org/r/20241023161916.2896274-1-andrii@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-10-23 22:16:45 -07:00
Jiri Olsa
0ee288e69d bpf,perf: Fix perf_event_detach_bpf_prog error handling
Peter reported that perf_event_detach_bpf_prog might skip to release
the bpf program for -ENOENT error from bpf_prog_array_copy.

This can't happen because bpf program is stored in perf event and is
detached and released only when perf event is freed.

Let's drop the -ENOENT check and make sure the bpf program is released
in any case.

Fixes: 170a7e3ea0 ("bpf: bpf_prog_array_copy() should return -ENOENT if exclude_prog not found")
Reported-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20241023200352.3488610-1-jolsa@kernel.org

Closes: https://lore.kernel.org/lkml/20241022111638.GC16066@noisy.programming.kicks-ass.net/
2024-10-23 14:33:02 -07:00
Peter Zijlstra
b55945c500 sched: Fix pick_next_task_fair() vs try_to_wake_up() race
Syzkaller robot reported KCSAN tripping over the
ASSERT_EXCLUSIVE_WRITER(p->on_rq) in __block_task().

The report noted that both pick_next_task_fair() and try_to_wake_up()
were concurrently trying to write to the same p->on_rq, violating the
assertion -- even though both paths hold rq->__lock.

The logical consequence is that both code paths end up holding a
different rq->__lock. And looking through ttwu(), this is possible
when the __block_task() 'p->on_rq = 0' store is visible to the ttwu()
'p->on_rq' load, which then assumes the task is not queued and
continues to migrate it.

Rearrange things such that __block_task() releases @p with the store
and no code thereafter will use @p again.

Fixes: 152e11f6df ("sched/fair: Implement delayed dequeue")
Reported-by: syzbot+0ec1e96c2cdf5c0e512a@syzkaller.appspotmail.com
Reported-by: Kent Overstreet <kent.overstreet@linux.dev>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Marco Elver <elver@google.com>
Link: https://lkml.kernel.org/r/20241023093641.GE16066@noisy.programming.kicks-ass.net
2024-10-23 20:52:26 +02:00
Kan Liang
e3dfd64c1f perf: Fix missing RCU reader protection in perf_event_clear_cpumask()
Running rcutorture scenario TREE05, the below warning is triggered.

[   32.604594] WARNING: suspicious RCU usage
[   32.605928] 6.11.0-rc5-00040-g4ba4f1afb6a9 #55238 Not tainted
[   32.607812] -----------------------------
[   32.609140] kernel/events/core.c:13946 RCU-list traversed in non-reader section!!
[   32.611595] other info that might help us debug this:
[   32.614247] rcu_scheduler_active = 2, debug_locks = 1
[   32.616392] 3 locks held by cpuhp/4/35:
[   32.617687]  #0: ffffffffb666a650 (cpu_hotplug_lock){++++}-{0:0}, at: cpuhp_thread_fun+0x4e/0x200
[   32.620563]  #1: ffffffffb666cd20 (cpuhp_state-down){+.+.}-{0:0}, at: cpuhp_thread_fun+0x4e/0x200
[   32.623412]  #2: ffffffffb677c288 (pmus_lock){+.+.}-{3:3}, at: perf_event_exit_cpu_context+0x32/0x2f0

In perf_event_clear_cpumask(), uses list_for_each_entry_rcu() without an
obvious RCU read-side critical section.

Either pmus_srcu or pmus_lock is good enough to protect the pmus list.
In the current context, pmus_lock is already held. The
list_for_each_entry_rcu() is not required.

Fixes: 4ba4f1afb6 ("perf: Generic hotplug support for a PMU with a scope")
Closes: https://lore.kernel.org/lkml/2b66dff8-b827-494b-b151-1ad8d56f13e6@paulmck-laptop/
Closes: https://lore.kernel.org/oe-lkp/202409131559.545634cc-oliver.sang@intel.com
Reported-by: "Paul E. McKenney" <paulmck@kernel.org>
Reported-by: kernel test robot <oliver.sang@intel.com>
Suggested-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: "Paul E. McKenney" <paulmck@kernel.org>
Link: https://lore.kernel.org/r/20240913162340.2142976-1-kan.liang@linux.intel.com
2024-10-23 20:52:25 +02:00
Jinjie Ruan
6e62807c7f posix-clock: posix-clock: Fix unbalanced locking in pc_clock_settime()
If get_clock_desc() succeeds, it calls fget() for the clockid's fd,
and get the clk->rwsem read lock, so the error path should release
the lock to make the lock balance and fput the clockid's fd to make
the refcount balance and release the fd related resource.

However the below commit left the error path locked behind resulting in
unbalanced locking. Check timespec64_valid_strict() before
get_clock_desc() to fix it, because the "ts" is not changed
after that.

Fixes: d8794ac20a ("posix-clock: Fix missing timespec64 check in pc_clock_settime()")
Acked-by: Richard Cochran <richardcochran@gmail.com>
Signed-off-by: Jinjie Ruan <ruanjinjie@huawei.com>
Acked-by: Anna-Maria Behnsen <anna-maria@linutronix.de>
[pabeni@redhat.com: fixed commit message typo]
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-10-23 16:05:01 +02:00
Leo Yan
0b6e2e22cb tracing: Consider the NULL character when validating the event length
strlen() returns a string length excluding the null byte. If the string
length equals to the maximum buffer length, the buffer will have no
space for the NULL terminating character.

This commit checks this condition and returns failure for it.

Link: https://lore.kernel.org/all/20241007144724.920954-1-leo.yan@arm.com/

Fixes: dec65d79fd ("tracing/probe: Check event name length correctly")
Signed-off-by: Leo Yan <leo.yan@arm.com>
Reviewed-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
2024-10-23 17:24:47 +09:00
Mikel Rychliski
73f3508047 tracing/probes: Fix MAX_TRACE_ARGS limit handling
When creating a trace_probe we would set nr_args prior to truncating the
arguments to MAX_TRACE_ARGS. However, we would only initialize arguments
up to the limit.

This caused invalid memory access when attempting to set up probes with
more than 128 fetchargs.

  BUG: kernel NULL pointer dereference, address: 0000000000000020
  #PF: supervisor read access in kernel mode
  #PF: error_code(0x0000) - not-present page
  PGD 0 P4D 0
  Oops: Oops: 0000 [#1] PREEMPT SMP PTI
  CPU: 0 UID: 0 PID: 1769 Comm: cat Not tainted 6.11.0-rc7+ #8
  Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.3-1.fc39 04/01/2014
  RIP: 0010:__set_print_fmt+0x134/0x330

Resolve the issue by applying the MAX_TRACE_ARGS limit earlier. Return
an error when there are too many arguments instead of silently
truncating.

Link: https://lore.kernel.org/all/20240930202656.292869-1-mikel@mikelr.com/

Fixes: 035ba76014 ("tracing/probes: cleanup: Set trace_probe::nr_args at trace_probe_init")
Signed-off-by: Mikel Rychliski <mikel@mikelr.com>
Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
2024-10-23 17:24:44 +09:00
Thorsten Blum
89282beaf7 audit: Use str_yes_no() helper function
Remove hard-coded strings by using the helper function str_yes_no().

Signed-off-by: Thorsten Blum <thorsten.blum@linux.dev>
Signed-off-by: Paul Moore <paul@paul-moore.com>
2024-10-22 19:16:50 -04:00
Daniel Borkmann
8ea607330a bpf: Fix overloading of MEM_UNINIT's meaning
Lonial reported an issue in the BPF verifier where check_mem_size_reg()
has the following code:

    if (!tnum_is_const(reg->var_off))
        /* For unprivileged variable accesses, disable raw
         * mode so that the program is required to
         * initialize all the memory that the helper could
         * just partially fill up.
         */
         meta = NULL;

This means that writes are not checked when the register containing the
size of the passed buffer has not a fixed size. Through this bug, a BPF
program can write to a map which is marked as read-only, for example,
.rodata global maps.

The problem is that MEM_UNINIT's initial meaning that "the passed buffer
to the BPF helper does not need to be initialized" which was added back
in commit 435faee1aa ("bpf, verifier: add ARG_PTR_TO_RAW_STACK type")
got overloaded over time with "the passed buffer is being written to".

The problem however is that checks such as the above which were added later
via 06c1c04972 ("bpf: allow helpers access to variable memory") set meta
to NULL in order force the user to always initialize the passed buffer to
the helper. Due to the current double meaning of MEM_UNINIT, this bypasses
verifier write checks to the memory (not boundary checks though) and only
assumes the latter memory is read instead.

Fix this by reverting MEM_UNINIT back to its original meaning, and having
MEM_WRITE as an annotation to BPF helpers in order to then trigger the
BPF verifier checks for writing to memory.

Some notes: check_arg_pair_ok() ensures that for ARG_CONST_SIZE{,_OR_ZERO}
we can access fn->arg_type[arg - 1] since it must contain a preceding
ARG_PTR_TO_MEM. For check_mem_reg() the meta argument can be removed
altogether since we do check both BPF_READ and BPF_WRITE. Same for the
equivalent check_kfunc_mem_size_reg().

Fixes: 7b3552d3f9 ("bpf: Reject writes for PTR_TO_MAP_KEY in check_helper_mem_access")
Fixes: 97e6d7dab1 ("bpf: Check PTR_TO_MEM | MEM_RDONLY in check_helper_mem_access")
Fixes: 15baa55ff5 ("bpf/verifier: allow all functions to read user provided context")
Reported-by: Lonial Con <kongln9170@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Link: https://lore.kernel.org/r/20241021152809.33343-2-daniel@iogearbox.net
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-10-22 15:42:56 -07:00
Daniel Borkmann
6fad274f06 bpf: Add MEM_WRITE attribute
Add a MEM_WRITE attribute for BPF helper functions which can be used in
bpf_func_proto to annotate an argument type in order to let the verifier
know that the helper writes into the memory passed as an argument. In
the past MEM_UNINIT has been (ab)used for this function, but the latter
merely tells the verifier that the passed memory can be uninitialized.

There have been bugs with overloading the latter but aside from that
there are also cases where the passed memory is read + written which
currently cannot be expressed, see also 4b3786a6c5 ("bpf: Zero former
ARG_PTR_TO_{LONG,INT} args in case of error").

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Link: https://lore.kernel.org/r/20241021152809.33343-1-daniel@iogearbox.net
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-10-22 15:42:56 -07:00
Hou Tao
1f97c03f43 bpf: Preserve param->string when parsing mount options
In bpf_parse_param(), keep the value of param->string intact so it can
be freed later. Otherwise, the kmalloc area pointed to by param->string
will be leaked as shown below:

unreferenced object 0xffff888118c46d20 (size 8):
  comm "new_name", pid 12109, jiffies 4295580214
  hex dump (first 8 bytes):
    61 6e 79 00 38 c9 5c 7e                          any.8.\~
  backtrace (crc e1b7f876):
    [<00000000c6848ac7>] kmemleak_alloc+0x4b/0x80
    [<00000000de9f7d00>] __kmalloc_node_track_caller_noprof+0x36e/0x4a0
    [<000000003e29b886>] memdup_user+0x32/0xa0
    [<0000000007248326>] strndup_user+0x46/0x60
    [<0000000035b3dd29>] __x64_sys_fsconfig+0x368/0x3d0
    [<0000000018657927>] x64_sys_call+0xff/0x9f0
    [<00000000c0cabc95>] do_syscall_64+0x3b/0xc0
    [<000000002f331597>] entry_SYSCALL_64_after_hwframe+0x4b/0x53

Fixes: 6c1752e0b6 ("bpf: Support symbolic BPF FS delegation mount options")
Signed-off-by: Hou Tao <houtao1@huawei.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Jiri Olsa <jolsa@kernel.org>
Link: https://lore.kernel.org/bpf/20241022130133.3798232-1-houtao@huaweicloud.com
2024-10-22 12:56:38 -07:00
Aleksa Sarai
112cca098a
sched_getattr: port to copy_struct_to_user
sched_getattr(2) doesn't care about trailing non-zero bytes in the
(ksize > usize) case, so just use copy_struct_to_user() without checking
ignored_trailing.

Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
Link: https://lore.kernel.org/r/20241010-extensible-structs-check_fields-v3-2-d2833dfe6edd@cyphar.com
Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-10-21 16:51:31 +02:00
Qiao Ma
373b9338c9 uprobe: avoid out-of-bounds memory access of fetching args
Uprobe needs to fetch args into a percpu buffer, and then copy to ring
buffer to avoid non-atomic context problem.

Sometimes user-space strings, arrays can be very large, but the size of
percpu buffer is only page size. And store_trace_args() won't check
whether these data exceeds a single page or not, caused out-of-bounds
memory access.

It could be reproduced by following steps:
1. build kernel with CONFIG_KASAN enabled
2. save follow program as test.c

```
\#include <stdio.h>
\#include <stdlib.h>
\#include <string.h>

// If string length large than MAX_STRING_SIZE, the fetch_store_strlen()
// will return 0, cause __get_data_size() return shorter size, and
// store_trace_args() will not trigger out-of-bounds access.
// So make string length less than 4096.
\#define STRLEN 4093

void generate_string(char *str, int n)
{
    int i;
    for (i = 0; i < n; ++i)
    {
        char c = i % 26 + 'a';
        str[i] = c;
    }
    str[n-1] = '\0';
}

void print_string(char *str)
{
    printf("%s\n", str);
}

int main()
{
    char tmp[STRLEN];

    generate_string(tmp, STRLEN);
    print_string(tmp);

    return 0;
}
```
3. compile program
`gcc -o test test.c`

4. get the offset of `print_string()`
```
objdump -t test | grep -w print_string
0000000000401199 g     F .text  000000000000001b              print_string
```

5. configure uprobe with offset 0x1199
```
off=0x1199

cd /sys/kernel/debug/tracing/
echo "p /root/test:${off} arg1=+0(%di):ustring arg2=\$comm arg3=+0(%di):ustring"
 > uprobe_events
echo 1 > events/uprobes/enable
echo 1 > tracing_on
```

6. run `test`, and kasan will report error.
==================================================================
BUG: KASAN: use-after-free in strncpy_from_user+0x1d6/0x1f0
Write of size 8 at addr ffff88812311c004 by task test/499CPU: 0 UID: 0 PID: 499 Comm: test Not tainted 6.12.0-rc3+ #18
Hardware name: Red Hat KVM, BIOS 1.16.0-4.al8 04/01/2014
Call Trace:
 <TASK>
 dump_stack_lvl+0x55/0x70
 print_address_description.constprop.0+0x27/0x310
 kasan_report+0x10f/0x120
 ? strncpy_from_user+0x1d6/0x1f0
 strncpy_from_user+0x1d6/0x1f0
 ? rmqueue.constprop.0+0x70d/0x2ad0
 process_fetch_insn+0xb26/0x1470
 ? __pfx_process_fetch_insn+0x10/0x10
 ? _raw_spin_lock+0x85/0xe0
 ? __pfx__raw_spin_lock+0x10/0x10
 ? __pte_offset_map+0x1f/0x2d0
 ? unwind_next_frame+0xc5f/0x1f80
 ? arch_stack_walk+0x68/0xf0
 ? is_bpf_text_address+0x23/0x30
 ? kernel_text_address.part.0+0xbb/0xd0
 ? __kernel_text_address+0x66/0xb0
 ? unwind_get_return_address+0x5e/0xa0
 ? __pfx_stack_trace_consume_entry+0x10/0x10
 ? arch_stack_walk+0xa2/0xf0
 ? _raw_spin_lock_irqsave+0x8b/0xf0
 ? __pfx__raw_spin_lock_irqsave+0x10/0x10
 ? depot_alloc_stack+0x4c/0x1f0
 ? _raw_spin_unlock_irqrestore+0xe/0x30
 ? stack_depot_save_flags+0x35d/0x4f0
 ? kasan_save_stack+0x34/0x50
 ? kasan_save_stack+0x24/0x50
 ? mutex_lock+0x91/0xe0
 ? __pfx_mutex_lock+0x10/0x10
 prepare_uprobe_buffer.part.0+0x2cd/0x500
 uprobe_dispatcher+0x2c3/0x6a0
 ? __pfx_uprobe_dispatcher+0x10/0x10
 ? __kasan_slab_alloc+0x4d/0x90
 handler_chain+0xdd/0x3e0
 handle_swbp+0x26e/0x3d0
 ? __pfx_handle_swbp+0x10/0x10
 ? uprobe_pre_sstep_notifier+0x151/0x1b0
 irqentry_exit_to_user_mode+0xe2/0x1b0
 asm_exc_int3+0x39/0x40
RIP: 0033:0x401199
Code: 01 c2 0f b6 45 fb 88 02 83 45 fc 01 8b 45 fc 3b 45 e4 7c b7 8b 45 e4 48 98 48 8d 50 ff 48 8b 45 e8 48 01 d0 ce
RSP: 002b:00007ffdf00576a8 EFLAGS: 00000206
RAX: 00007ffdf00576b0 RBX: 0000000000000000 RCX: 0000000000000ff2
RDX: 0000000000000ffc RSI: 0000000000000ffd RDI: 00007ffdf00576b0
RBP: 00007ffdf00586b0 R08: 00007feb2f9c0d20 R09: 00007feb2f9c0d20
R10: 0000000000000001 R11: 0000000000000202 R12: 0000000000401040
R13: 00007ffdf0058780 R14: 0000000000000000 R15: 0000000000000000
 </TASK>

This commit enforces the buffer's maxlen less than a page-size to avoid
store_trace_args() out-of-memory access.

Link: https://lore.kernel.org/all/20241015060148.1108331-1-mqaio@linux.alibaba.com/

Fixes: dcad1a204f ("tracing/uprobes: Fetch args before reserving a ring buffer")
Signed-off-by: Qiao Ma <mqaio@linux.alibaba.com>
Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
2024-10-21 13:15:28 +09:00
Linus Torvalds
2b4d25010d - Add PREEMPT_RT maintainers
- Fix another aspect of delayed dequeued tasks wrt determining their state,
   i.e., whether they're runnable or blocked
 
 - Handle delayed dequeued tasks and their migration wrt PSI properly
 
 - Fix the situation where a delayed dequeue task gets enqueued into a new
   class, which should not happen
 
 - Fix a case where memory allocation would happen while the runqueue lock is
   held, which is a no-no
 
 - Do not over-schedule when tasks with shorter slices preempt the currently
   running task
 
 - Make sure delayed to deque entities are properly handled before unthrottling
 
 - Other smaller cleanups and improvements
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmcU3tMACgkQEsHwGGHe
 VUqpuhAAqqyi2NNgrIOlEWh/Ej4NQZL7KleF84cSpKCIBK2somYX5ksgMcUgn82i
 bIuDVErQu/a4lhNAf5zn7TO3yuPA1Q5xd/453qBlWM9ApkH0S69Mp9f0yocVu8F0
 t3XsgXm+/R8A4TYbiv8cB+r1Xt8E5NUP6RkNIKCHbPLAG94gqYF8UZEJ9sAl9ZXw
 qEWc9afpnp+4LQ9PlzePuaM976LWUPB49OoFZMnFmY1VkvFuVjkjXSVzJX6l4qB7
 Omo/+TXOOBSHXVVflNx/68Q16irFHAnqwPPrLCBQWBLIPz3iRiZjV9ptD9tUZkRM
 M+klL7w0jRG+8wa9fTwuqybmBNIBt4Az1/WUw9Lc3ryEWRsCKzkGT8au3lv5FpQY
 CTwIIBSMmUcqQSG40R0HHS3nDR4UBFFD0PAww+8cJQZc0IPd2rT9/hfqYdt3sq2Z
 vV9rmTFOcDlApeDdCGcfC7zJhdgVuBgDVjdTsE5SNRUduBUsBYOeLDnT+0Qi0ArJ
 txVINGxQDm6jz512f4CAB/xzUcYpU4o639Z1Jkd6a8QbO1NBZGX1ioOcvPEMhmFF
 f/qFyM8ctR5Kj6LJCZiDcstqtAZviW1d2uMp48gk2QeSvkCyIUQqrWshItd02iBG
 TZdSYRvSYtYSIz7WYtE/CABUDmrJGjuLtb+jOrR93//TsWwwVdE=
 =1D7H
 -----END PGP SIGNATURE-----

Merge tag 'sched_urgent_for_v6.12_rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull scheduling fixes from Borislav Petkov:

 - Add PREEMPT_RT maintainers

 - Fix another aspect of delayed dequeued tasks wrt determining their
   state, i.e., whether they're runnable or blocked

 - Handle delayed dequeued tasks and their migration wrt PSI properly

 - Fix the situation where a delayed dequeue task gets enqueued into a
   new class, which should not happen

 - Fix a case where memory allocation would happen while the runqueue
   lock is held, which is a no-no

 - Do not over-schedule when tasks with shorter slices preempt the
   currently running task

 - Make sure delayed to deque entities are properly handled before
   unthrottling

 - Other smaller cleanups and improvements

* tag 'sched_urgent_for_v6.12_rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  MAINTAINERS: Add an entry for PREEMPT_RT.
  sched/fair: Fix external p->on_rq users
  sched/psi: Fix mistaken CPU pressure indication after corrupted task state bug
  sched/core: Dequeue PSI signals for blocked tasks that are delayed
  sched: Fix delayed_dequeue vs switched_from_fair()
  sched/core: Disable page allocation in task_tick_mm_cid()
  sched/deadline: Use hrtick_enabled_dl() before start_hrtick_dl()
  sched/eevdf: Fix wakeup-preempt by checking cfs_rq->nr_running
  sched: Fix sched_delayed vs cfs_bandwidth
2024-10-20 11:30:56 -07:00
Linus Torvalds
06526daaff ftrace: A couple of fixes to function graph infrastructure
- Fix allocation of idle shadow stack allocation during hotplug
 
   If function graph tracing is started when a CPU is offline, if it were
   come online during the trace then the idle task that represents the CPU
   will not get a shadow stack allocated for it. This means all function
   graph hooks that happen while that idle task is running (including in
   interrupt mode) will have all its events dropped.
 
   Switch over to the CPU hotplug mechanism that will have any newly
   brought on line CPU get a callback that can allocate the shadow stack
   for its idle task.
 
 - Fix allocation size of the ret_stack_list array
 
   When function graph tracing converted over to allowing more than one
   user at a time, it had to convert its shadow stack from an array of
   ret_stack structures to an array of unsigned longs. The shadow stacks
   are allocated in batches of 32 at a time and assigned to every running
   task. The batch is held by the ret_stack_list array. But when the
   conversion happened, instead of allocating an array of 32 pointers, it
   was allocated as a ret_stack itself (PAGE_SIZE). This ret_stack_list
   gets passed to a function that iterates over what it believes is its
   size defined by the FTRACE_RETSTACK_ALLOC_SIZE macro (which is 32).
 
   Luckily (PAGE_SIZE) is greater than 32 * sizeof(long), otherwise this
   would have been an array overflow. This still should be fixed and the
   ret_stack_list should be allocated to the size it is expected to be as
   someday it may end up being bigger than SHADOW_STACK_SIZE.
 -----BEGIN PGP SIGNATURE-----
 
 iIoEABYIADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCZxP8RhQccm9zdGVkdEBn
 b29kbWlzLm9yZwAKCRAp5XQQmuv6qmW3AP4qCOvU/g9g6u32gIZmS1oUWqe3q+Rq
 9OKCk0JP6GGc8AD/cF816lbs5vpDiZFdbBvaz5gLHqhfAt35NVU8T5tbJA4=
 =Lh3A
 -----END PGP SIGNATURE-----

Merge tag 'ftrace-v6.12-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace

Pull ftrace fixes from Steven Rostedt:
 "A couple of fixes to function graph infrastructure:

   - Fix allocation of idle shadow stack allocation during hotplug

     If function graph tracing is started when a CPU is offline, if it
     were come online during the trace then the idle task that
     represents the CPU will not get a shadow stack allocated for it.
     This means all function graph hooks that happen while that idle
     task is running (including in interrupt mode) will have all its
     events dropped.

     Switch over to the CPU hotplug mechanism that will have any newly
     brought on line CPU get a callback that can allocate the shadow
     stack for its idle task.

   - Fix allocation size of the ret_stack_list array

     When function graph tracing converted over to allowing more than
     one user at a time, it had to convert its shadow stack from an
     array of ret_stack structures to an array of unsigned longs. The
     shadow stacks are allocated in batches of 32 at a time and assigned
     to every running task. The batch is held by the ret_stack_list
     array.

     But when the conversion happened, instead of allocating an array of
     32 pointers, it was allocated as a ret_stack itself (PAGE_SIZE).
     This ret_stack_list gets passed to a function that iterates over
     what it believes is its size defined by the
     FTRACE_RETSTACK_ALLOC_SIZE macro (which is 32).

     Luckily (PAGE_SIZE) is greater than 32 * sizeof(long), otherwise
     this would have been an array overflow. This still should be fixed
     and the ret_stack_list should be allocated to the size it is
     expected to be as someday it may end up being bigger than
     SHADOW_STACK_SIZE"

* tag 'ftrace-v6.12-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace:
  fgraph: Allocate ret_stack_list with proper size
  fgraph: Use CPU hotplug mechanism to initialize idle shadow stacks
2024-10-19 12:42:14 -07:00
Steven Rostedt
fae4078c28 fgraph: Allocate ret_stack_list with proper size
The ret_stack_list is an array of ret_stack shadow stacks for the function
graph usage. When the first function graph is enabled, all tasks in the
system get a shadow stack. The ret_stack_list is a 32 element array of
pointers to these shadow stacks. It allocates the shadow stack in batches
(32 stacks at a time), assigns them to running tasks, and continues until
all tasks are covered.

When the function graph shadow stack changed from an array of
ftrace_ret_stack structures to an array of longs, the allocation of
ret_stack_list went from allocating an array of 32 elements to just a
block defined by SHADOW_STACK_SIZE. Luckily, that's defined as PAGE_SIZE
and is much more than enough to hold 32 pointers. But it is way overkill
for the amount needed to allocate.

Change the allocation of ret_stack_list back to a kcalloc() of
FTRACE_RETSTACK_ALLOC_SIZE pointers.

Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Link: https://lore.kernel.org/20241018215212.23f13f40@rorschach
Fixes: 42675b723b ("function_graph: Convert ret_stack to a series of longs")
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2024-10-18 21:57:20 -04:00
Steven Rostedt
2c02f7375e fgraph: Use CPU hotplug mechanism to initialize idle shadow stacks
The function graph infrastructure allocates a shadow stack for every task
when enabled. This includes the idle tasks. The first time the function
graph is invoked, the shadow stacks are created and never freed until the
task exits. This includes the idle tasks.

Only the idle tasks that were for online CPUs had their shadow stacks
created when function graph tracing started. If function graph tracing is
enabled and a CPU comes online, the idle task representing that CPU will
not have its shadow stack created, and all function graph tracing for that
idle task will be silently dropped.

Instead, use the CPU hotplug mechanism to allocate the idle shadow stacks.
This will include idle tasks for CPUs that come online during tracing.

This issue can be reproduced by:

 # cd /sys/kernel/tracing
 # echo 0 > /sys/devices/system/cpu/cpu1/online
 # echo 0 > set_ftrace_pid
 # echo function_graph > current_tracer
 # echo 1 > options/funcgraph-proc
 # echo 1 > /sys/devices/system/cpu/cpu1
 # grep '<idle>' per_cpu/cpu1/trace | head

Before, nothing would show up.

After:
 1)    <idle>-0    |   0.811 us    |                        __enqueue_entity();
 1)    <idle>-0    |   5.626 us    |                      } /* enqueue_entity */
 1)    <idle>-0    |               |                      dl_server_update_idle_time() {
 1)    <idle>-0    |               |                        dl_scaled_delta_exec() {
 1)    <idle>-0    |   0.450 us    |                          arch_scale_cpu_capacity();
 1)    <idle>-0    |   1.242 us    |                        }
 1)    <idle>-0    |   1.908 us    |                      }
 1)    <idle>-0    |               |                      dl_server_start() {
 1)    <idle>-0    |               |                        enqueue_dl_entity() {
 1)    <idle>-0    |               |                          task_contending() {

Note, if tracing stops and restarts, the old way would then initialize
the onlined CPUs.

Cc: stable@vger.kernel.org
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/20241018214300.6df82178@rorschach
Fixes: 868baf07b1 ("ftrace: Fix memory leak with function graph and cpu hotplug")
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2024-10-18 21:56:56 -04:00
Linus Torvalds
3d5ad2d4ec BPF fixes:
- Fix BPF verifier to not affect subreg_def marks in its range
   propagation, from Eduard Zingerman.
 
 - Fix a truncation bug in the BPF verifier's handling of
   coerce_reg_to_size_sx, from Dimitar Kanaliev.
 
 - Fix the BPF verifier's delta propagation between linked
   registers under 32-bit addition, from Daniel Borkmann.
 
 - Fix a NULL pointer dereference in BPF devmap due to missing
   rxq information, from Florian Kauer.
 
 - Fix a memory leak in bpf_core_apply, from Jiri Olsa.
 
 - Fix an UBSAN-reported array-index-out-of-bounds in BTF
   parsing for arrays of nested structs, from Hou Tao.
 
 - Fix build ID fetching where memory areas backing the file
   were created with memfd_secret, from Andrii Nakryiko.
 
 - Fix BPF task iterator tid filtering which was incorrectly
   using pid instead of tid, from Jordan Rome.
 
 - Several fixes for BPF sockmap and BPF sockhash redirection
   in combination with vsocks, from Michal Luczaj.
 
 - Fix riscv BPF JIT and make BPF_CMPXCHG fully ordered,
   from Andrea Parri.
 
 - Fix riscv BPF JIT under CONFIG_CFI_CLANG to prevent the
   possibility of an infinite BPF tailcall, from Pu Lehui.
 
 - Fix a build warning from resolve_btfids that bpf_lsm_key_free
   cannot be resolved, from Thomas Weißschuh.
 
 - Fix a bug in kfunc BTF caching for modules where the wrong
   BTF object was returned, from Toke Høiland-Jørgensen.
 
 - Fix a BPF selftest compilation error in cgroup-related tests
   with musl libc, from Tony Ambardar.
 
 - Several fixes to BPF link info dumps to fill missing fields,
   from Tyrone Wu.
 
 - Add BPF selftests for kfuncs from multiple modules, checking
   that the correct kfuncs are called, from Simon Sundberg.
 
 - Ensure that internal and user-facing bpf_redirect flags
   don't overlap, also from Toke Høiland-Jørgensen.
 
 - Switch to use kvzmalloc to allocate BPF verifier environment,
   from Rik van Riel.
 
 - Use raw_spinlock_t in BPF ringbuf to fix a sleep in atomic
   splat under RT, from Wander Lairson Costa.
 
 Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
 -----BEGIN PGP SIGNATURE-----
 
 iIsEABYIADMWIQTFp0I1jqZrAX+hPRXbK58LschIgwUCZxK4OhUcZGFuaWVsQGlv
 Z2VhcmJveC5uZXQACgkQ2yufC7HISIOCrwEAib2kC5EEQn5+wKVE/bnZryVX2leT
 YXdfItDCBU6zCYUA+wTU5hGGn9lcDUcZx72l/KZPDyPw7HdzNJ+6iR1zQqoM
 =f9kv
 -----END PGP SIGNATURE-----

Merge tag 'bpf-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf

Pull bpf fixes from Daniel Borkmann:

 - Fix BPF verifier to not affect subreg_def marks in its range
   propagation (Eduard Zingerman)

 - Fix a truncation bug in the BPF verifier's handling of
   coerce_reg_to_size_sx (Dimitar Kanaliev)

 - Fix the BPF verifier's delta propagation between linked registers
   under 32-bit addition (Daniel Borkmann)

 - Fix a NULL pointer dereference in BPF devmap due to missing rxq
   information (Florian Kauer)

 - Fix a memory leak in bpf_core_apply (Jiri Olsa)

 - Fix an UBSAN-reported array-index-out-of-bounds in BTF parsing for
   arrays of nested structs (Hou Tao)

 - Fix build ID fetching where memory areas backing the file were
   created with memfd_secret (Andrii Nakryiko)

 - Fix BPF task iterator tid filtering which was incorrectly using pid
   instead of tid (Jordan Rome)

 - Several fixes for BPF sockmap and BPF sockhash redirection in
   combination with vsocks (Michal Luczaj)

 - Fix riscv BPF JIT and make BPF_CMPXCHG fully ordered (Andrea Parri)

 - Fix riscv BPF JIT under CONFIG_CFI_CLANG to prevent the possibility
   of an infinite BPF tailcall (Pu Lehui)

 - Fix a build warning from resolve_btfids that bpf_lsm_key_free cannot
   be resolved (Thomas Weißschuh)

 - Fix a bug in kfunc BTF caching for modules where the wrong BTF object
   was returned (Toke Høiland-Jørgensen)

 - Fix a BPF selftest compilation error in cgroup-related tests with
   musl libc (Tony Ambardar)

 - Several fixes to BPF link info dumps to fill missing fields (Tyrone
   Wu)

 - Add BPF selftests for kfuncs from multiple modules, checking that the
   correct kfuncs are called (Simon Sundberg)

 - Ensure that internal and user-facing bpf_redirect flags don't overlap
   (Toke Høiland-Jørgensen)

 - Switch to use kvzmalloc to allocate BPF verifier environment (Rik van
   Riel)

 - Use raw_spinlock_t in BPF ringbuf to fix a sleep in atomic splat
   under RT (Wander Lairson Costa)

* tag 'bpf-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf: (38 commits)
  lib/buildid: Handle memfd_secret() files in build_id_parse()
  selftests/bpf: Add test case for delta propagation
  bpf: Fix print_reg_state's constant scalar dump
  bpf: Fix incorrect delta propagation between linked registers
  bpf: Properly test iter/task tid filtering
  bpf: Fix iter/task tid filtering
  riscv, bpf: Make BPF_CMPXCHG fully ordered
  bpf, vsock: Drop static vsock_bpf_prot initialization
  vsock: Update msg_count on read_skb()
  vsock: Update rx_bytes on read_skb()
  bpf, sockmap: SK_DROP on attempted redirects of unsupported af_vsock
  selftests/bpf: Add asserts for netfilter link info
  bpf: Fix link info netfilter flags to populate defrag flag
  selftests/bpf: Add test for sign extension in coerce_subreg_to_size_sx()
  selftests/bpf: Add test for truncation after sign extension in coerce_reg_to_size_sx()
  bpf: Fix truncation bug in coerce_reg_to_size_sx()
  selftests/bpf: Assert link info uprobe_multi count & path_size if unset
  bpf: Fix unpopulated path_size when uprobe_multi fields unset
  selftests/bpf: Fix cross-compiling urandom_read
  selftests/bpf: Add test for kfunc module order
  ...
2024-10-18 16:27:14 -07:00
Daniel Borkmann
3e9e708757 bpf: Fix print_reg_state's constant scalar dump
print_reg_state() should not consider adding reg->off to reg->var_off.value
when dumping scalars. Scalars can be produced with reg->off != 0 through
BPF_ADD_CONST, and thus as-is this can skew the register log dump.

Fixes: 98d7ca374b ("bpf: Track delta between "linked" registers.")
Reported-by: Nathaniel Theis <nathaniel.theis@nccgroup.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20241016134913.32249-2-daniel@iogearbox.net
2024-10-17 11:06:34 -07:00
Daniel Borkmann
3878ae04e9 bpf: Fix incorrect delta propagation between linked registers
Nathaniel reported a bug in the linked scalar delta tracking, which can lead
to accepting a program with OOB access. The specific code is related to the
sync_linked_regs() function and the BPF_ADD_CONST flag, which signifies a
constant offset between two scalar registers tracked by the same register id.

The verifier attempts to track "similar" scalars in order to propagate bounds
information learned about one scalar to others. For instance, if r1 and r2
are known to contain the same value, then upon encountering 'if (r1 != 0x1234)
goto xyz', not only does it know that r1 is equal to 0x1234 on the path where
that conditional jump is not taken, it also knows that r2 is.

Additionally, with env->bpf_capable set, the verifier will track scalars
which should be a constant delta apart (if r1 is known to be one greater than
r2, then if r1 is known to be equal to 0x1234, r2 must be equal to 0x1233.)
The code path for the latter in adjust_reg_min_max_vals() is reached when
processing both 32 and 64-bit addition operations. While adjust_reg_min_max_vals()
knows whether dst_reg was produced by a 32 or a 64-bit addition (based on the
alu32 bool), the only information saved in dst_reg is the id of the source
register (reg->id, or'ed by BPF_ADD_CONST) and the value of the constant
offset (reg->off).

Later, the function sync_linked_regs() will attempt to use this information
to propagate bounds information from one register (known_reg) to others,
meaning, for all R in linked_regs, it copies known_reg range (and possibly
adjusting delta) into R for the case of R->id == known_reg->id.

For the delta adjustment, meaning, matching reg->id with BPF_ADD_CONST, the
verifier adjusts the register as reg = known_reg; reg += delta where delta
is computed as (s32)reg->off - (s32)known_reg->off and placed as a scalar
into a fake_reg to then simulate the addition of reg += fake_reg. This is
only correct, however, if the value in reg was created by a 64-bit addition.
When reg contains the result of a 32-bit addition operation, its upper 32
bits will always be zero. sync_linked_regs() on the other hand, may cause
the verifier to believe that the addition between fake_reg and reg overflows
into those upper bits. For example, if reg was generated by adding the
constant 1 to known_reg using a 32-bit alu operation, then reg->off is 1
and known_reg->off is 0. If known_reg is known to be the constant 0xFFFFFFFF,
sync_linked_regs() will tell the verifier that reg is equal to the constant
0x100000000. This is incorrect as the actual value of reg will be 0, as the
32-bit addition will wrap around.

Example:

  0: (b7) r0 = 0;             R0_w=0
  1: (18) r1 = 0x80000001;    R1_w=0x80000001
  3: (37) r1 /= 1;            R1_w=scalar()
  4: (bf) r2 = r1;            R1_w=scalar(id=1) R2_w=scalar(id=1)
  5: (bf) r4 = r1;            R1_w=scalar(id=1) R4_w=scalar(id=1)
  6: (04) w2 += 2147483647;   R2_w=scalar(id=1+2147483647,smin=0,smax=umax=0xffffffff,var_off=(0x0; 0xffffffff))
  7: (04) w4 += 0 ;           R4_w=scalar(id=1+0,smin=0,smax=umax=0xffffffff,var_off=(0x0; 0xffffffff))
  8: (15) if r2 == 0x0 goto pc+1
 10: R0=0 R1=0xffffffff80000001 R2=0x7fffffff R4=0xffffffff80000001 R10=fp0

What can be seen here is that r1 is copied to r2 and r4, such that {r1,r2,r4}.id
are all the same which later lets sync_linked_regs() to be invoked. Then, in
a next step constants are added with alu32 to r2 and r4, setting their ->off,
as well as id |= BPF_ADD_CONST. Next, the conditional will bind r2 and
propagate ranges to its linked registers. The verifier now believes the upper
32 bits of r4 are r4=0xffffffff80000001, while actually r4=r1=0x80000001.

One approach for a simple fix suitable also for stable is to limit the constant
delta tracking to only 64-bit alu addition. If necessary at some later point,
BPF_ADD_CONST could be split into BPF_ADD_CONST64 and BPF_ADD_CONST32 to avoid
mixing the two under the tradeoff to further complicate sync_linked_regs().
However, none of the added tests from dedf56d775 ("selftests/bpf: Add tests
for add_const") make this necessary at this point, meaning, BPF CI also passes
with just limiting tracking to 64-bit alu addition.

Fixes: 98d7ca374b ("bpf: Track delta between "linked" registers.")
Reported-by: Nathaniel Theis <nathaniel.theis@nccgroup.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Reviewed-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://lore.kernel.org/bpf/20241016134913.32249-1-daniel@iogearbox.net
2024-10-17 11:06:34 -07:00
Jordan Rome
9495a5b731 bpf: Fix iter/task tid filtering
In userspace, you can add a tid filter by setting
the "task.tid" field for "bpf_iter_link_info".
However, `get_pid_task` when called for the
`BPF_TASK_ITER_TID` type should have been using
`PIDTYPE_PID` (tid) instead of `PIDTYPE_TGID` (pid).

Fixes: f0d74c4da1 ("bpf: Parameterize task iterators.")
Signed-off-by: Jordan Rome <linux@jordanrome.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20241016210048.1213935-1-linux@jordanrome.com
2024-10-17 10:52:18 -07:00
Linus Torvalds
07d6bf634b No contributions from subtrees.
Current release - new code bugs:
 
   - eth: mlx5: HWS, don't destroy more bwc queue locks than allocated
 
 Previous releases - regressions:
 
   - ipv4: give an IPv4 dev to blackhole_netdev
 
   - udp: compute L4 checksum as usual when not segmenting the skb
 
   - tcp/dccp: don't use timer_pending() in reqsk_queue_unlink().
 
   - eth: mlx5e: don't call cleanup on profile rollback failure
 
   - eth: microchip: vcap api: fix memory leaks in vcap_api_encode_rule_test()
 
   - eth: enetc: disable Tx BD rings after they are empty
 
   - eth: macb: avoid 20s boot delay by skipping MDIO bus registration for fixed-link PHY
 
 Previous releases - always broken:
 
   - posix-clock: fix missing timespec64 check in pc_clock_settime()
 
   - genetlink: hold RCU in genlmsg_mcast()
 
   - mptcp: prevent MPC handshake on port-based signal endpoints
 
   - eth: vmxnet3: fix packet corruption in vmxnet3_xdp_xmit_frame
 
   - eth: stmmac: dwmac-tegra: fix link bring-up sequence
 
   - eth: bcmasp: fix potential memory leak in bcmasp_xmit()
 
 Misc:
 
   - add Andrew Lunn as a co-maintainer of all networking drivers
 
 Signed-off-by: Paolo Abeni <pabeni@redhat.com>
 -----BEGIN PGP SIGNATURE-----
 
 iQJGBAABCAAwFiEEg1AjqC77wbdLX2LbKSR5jcyPE6QFAmcRDoMSHHBhYmVuaUBy
 ZWRoYXQuY29tAAoJECkkeY3MjxOklXoP/2VKcUDYMfP03vB6a60riiqMcD+GVZzm
 lLaa9xBiDdEsnL+4dcQLF7tecYbqpIqh+0ZeEe0aXgp0HjIm2Im1eiXv5cB7INxK
 n1lxVibI/zD2j2M9cy2NDTeeYSI29GH98g0IkLrU9vnfsrp5jRuXFttrCmotzesZ
 tYb200cVMR/nt9rrG3aNAxTUHjaykgpKh/zSvFC0MRStXFfUbIia88LzcQOQzEK3
 jcWMj8jsJHkDLhUHS8LVZEV1JDYIb+QeEjGz5LxMpQV5/T5sP7Ki4ISPxpUbYNrZ
 QWwSrFSg2ivaq8PkQ2LBTKAtiHyGlcEJtnlSf7Y50cpc9RNphClq8YMSBUCcGTEi
 3W18eVIZ0aj1omTeHEQLbMkqT0soTwYxskDW0uCCFKf/SRCQm1ixrpQiA6PGP266
 e0q7mMD7v4S9ZdO5VF+oYzf1fF0OhaOkZtUEsWjxHite6ujh8719EPbUoee4cqPL
 ofoHYF338BlYl4YIZERZMff8HJD4+PR0R8c9SIBXQcGUMvf2mQdvaD9q2MGySSJd
 4mlH6bE8Ckj4KVp9llxB8zyw/5OmJdMZ+ar4o7+CX2Z3Wrs1SMSTy7qAADcV2B2G
 S9kvwNDffak+N94N/MqswQM4se5yo8dmyUnhqChBFYPFWc4N7v6wwKelROigqLL7
 7Xhj2TXRBGKs
 =ukTi
 -----END PGP SIGNATURE-----

Merge tag 'net-6.12-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net

Pull networking fixes from Paolo Abeni:
 "Current release - new code bugs:

   - eth: mlx5: HWS, don't destroy more bwc queue locks than allocated

  Previous releases - regressions:

   - ipv4: give an IPv4 dev to blackhole_netdev

   - udp: compute L4 checksum as usual when not segmenting the skb

   - tcp/dccp: don't use timer_pending() in reqsk_queue_unlink().

   - eth: mlx5e: don't call cleanup on profile rollback failure

   - eth: microchip: vcap api: fix memory leaks in
     vcap_api_encode_rule_test()

   - eth: enetc: disable Tx BD rings after they are empty

   - eth: macb: avoid 20s boot delay by skipping MDIO bus registration
     for fixed-link PHY

  Previous releases - always broken:

   - posix-clock: fix missing timespec64 check in pc_clock_settime()

   - genetlink: hold RCU in genlmsg_mcast()

   - mptcp: prevent MPC handshake on port-based signal endpoints

   - eth: vmxnet3: fix packet corruption in vmxnet3_xdp_xmit_frame

   - eth: stmmac: dwmac-tegra: fix link bring-up sequence

   - eth: bcmasp: fix potential memory leak in bcmasp_xmit()

  Misc:

   - add Andrew Lunn as a co-maintainer of all networking drivers"

* tag 'net-6.12-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (47 commits)
  net/mlx5e: Don't call cleanup on profile rollback failure
  net/mlx5: Unregister notifier on eswitch init failure
  net/mlx5: Fix command bitmask initialization
  net/mlx5: Check for invalid vector index on EQ creation
  net/mlx5: HWS, use lock classes for bwc locks
  net/mlx5: HWS, don't destroy more bwc queue locks than allocated
  net/mlx5: HWS, fixed double free in error flow of definer layout
  net/mlx5: HWS, removed wrong access to a number of rules variable
  mptcp: pm: fix UaF read in mptcp_pm_nl_rm_addr_or_subflow
  net: ethernet: mtk_eth_soc: fix memory corruption during fq dma init
  vmxnet3: Fix packet corruption in vmxnet3_xdp_xmit_frame
  net: dsa: vsc73xx: fix reception from VLAN-unaware bridges
  net: ravb: Only advertise Rx/Tx timestamps if hardware supports it
  net: microchip: vcap api: Fix memory leaks in vcap_api_encode_rule_test()
  net: phy: mdio-bcm-unimac: Add BCM6846 support
  dt-bindings: net: brcm,unimac-mdio: Add bcm6846-mdio
  udp: Compute L4 checksum as usual when not segmenting the skb
  genetlink: hold RCU in genlmsg_mcast()
  net: dsa: mv88e6xxx: Fix the max_vid definition for the MV88E6361
  tcp/dccp: Don't use timer_pending() in reqsk_queue_unlink().
  ...
2024-10-17 09:31:18 -07:00
Ingo Molnar
be602cde65 Merge branch 'linus' into sched/urgent, to resolve conflict
Conflicts:
	kernel/sched/ext.c

There's a context conflict between this upstream commit:

  3fdb9ebcec sched_ext: Start schedulers with consistent p->scx.slice values

... and this fix in sched/urgent:

  98442f0ccd sched: Fix delayed_dequeue vs switched_from_fair()

Resolve it.

Signed-off-by: Ingo Molnar <mingo@kernel.org>
2024-10-17 09:58:07 +02:00
Linus Torvalds
dff6584301 sched_ext: Fixes for v6.12-rc3
- More issues reported in the enable/disable paths on large machines with
   many tasks due to scx_tasks_lock being held too long. Break up the task
   iterations.
 
 - Remove ops.select_cpu() dependency in bypass mode so that a misbehaving
   implementation can't live-lock the machine by pushing all tasks to few
   CPUs in bypass mode.
 
 - Other misc fixes.
 -----BEGIN PGP SIGNATURE-----
 
 iIQEABYKACwWIQTfIjM1kS57o3GsC/uxYfJx3gVYGQUCZw7SMw4cdGpAa2VybmVs
 Lm9yZwAKCRCxYfJx3gVYGU6+AQDZKNsyRw0c8+rd7eASAtCzvzBU1KGf3RPrsbNF
 pb3IGgEA4/uVBlVny10ZyeKaQ4A3L2G4ROyJwNEygUvhkvbfcgk=
 =z7d9
 -----END PGP SIGNATURE-----

Merge tag 'sched_ext-for-6.12-rc3-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext

Pull sched_ext fixes from Tejun Heo:

 - More issues reported in the enable/disable paths on large machines
   with many tasks due to scx_tasks_lock being held too long. Break up
   the task iterations

 - Remove ops.select_cpu() dependency in bypass mode so that a
   misbehaving implementation can't live-lock the machine by pushing all
   tasks to few CPUs in bypass mode

 - Other misc fixes

* tag 'sched_ext-for-6.12-rc3-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext:
  sched_ext: Remove unnecessary cpu_relax()
  sched_ext: Don't hold scx_tasks_lock for too long
  sched_ext: Move scx_tasks_lock handling into scx_task_iter helpers
  sched_ext: bypass mode shouldn't depend on ops.select_cpu()
  sched_ext: Move scx_buildin_idle_enabled check to scx_bpf_select_cpu_dfl()
  sched_ext: Start schedulers with consistent p->scx.slice values
  Revert "sched_ext: Use shorter slice while bypassing"
  sched_ext: use correct function name in pick_task_scx() warning message
  selftests: sched_ext: Add sched_ext as proper selftest target
2024-10-15 19:47:19 -07:00
Linus Torvalds
2f87d0916c ring-buffer: Fixes for v6.12
- Fix ref counter of buffers assigned at boot up
 
   A tracing instance can be created from the kernel command line.
   If it maps to memory, it is considered permanent and should not
   be deleted, or bad things can happen. If it is not mapped to memory,
   then the user is fine to delete it via rmdir from the instances
   directory. But the ref counts assumed 0 was free to remove and
   greater than zero was not. But this was not the case. When an
   instance is created, it should have the reference of 1, and if
   it should not be removed, it must be greater than 1. The boot up
   code set normal instances with a ref count of 0, which could get
   removed if something accessed it and then released it. And memory
   mapped instances had a ref count of 1 which meant it could be deleted,
   and bad things happen. Keep normal instances ref count as 1, and
   set memory mapped instances ref count to 2.
 
 - Protect sub buffer size (order) updates from other modifications
 
   When a ring buffer is changing the size of its sub-buffers, no other
   operations should be performed on the ring buffer. That includes
   reading it. But the locking only grabbed the buffer->mutex that
   keeps some operations from touching the ring buffer. It also must
   hold the cpu_buffer->reader_lock as well when updates happen as
   other paths use that to do some operations on the ring buffer.
 -----BEGIN PGP SIGNATURE-----
 
 iIoEABYIADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCZw6LSRQccm9zdGVkdEBn
 b29kbWlzLm9yZwAKCRAp5XQQmuv6qr8NAP4ggN51IEEDLINUctXvTnJAbRdRiZcC
 Z/q/5hfPS2+uAgEArSfVRCMdCCiVMZv+R9i6YPzn1Ux2OTGbcOHz56lZKgY=
 =GwKu
 -----END PGP SIGNATURE-----

Merge tag 'trace-ringbuffer-v6.12-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace

Pull ring-buffer fixes from Steven Rostedt:

 - Fix ref counter of buffers assigned at boot up

   A tracing instance can be created from the kernel command line. If it
   maps to memory, it is considered permanent and should not be deleted,
   or bad things can happen. If it is not mapped to memory, then the
   user is fine to delete it via rmdir from the instances directory. But
   the ref counts assumed 0 was free to remove and greater than zero was
   not. But this was not the case. When an instance is created, it
   should have the reference of 1, and if it should not be removed, it
   must be greater than 1. The boot up code set normal instances with a
   ref count of 0, which could get removed if something accessed it and
   then released it. And memory mapped instances had a ref count of 1
   which meant it could be deleted, and bad things happen. Keep normal
   instances ref count as 1, and set memory mapped instances ref count
   to 2.

 - Protect sub buffer size (order) updates from other modifications

   When a ring buffer is changing the size of its sub-buffers, no other
   operations should be performed on the ring buffer. That includes
   reading it. But the locking only grabbed the buffer->mutex that keeps
   some operations from touching the ring buffer. It also must hold the
   cpu_buffer->reader_lock as well when updates happen as other paths
   use that to do some operations on the ring buffer.

* tag 'trace-ringbuffer-v6.12-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace:
  ring-buffer: Fix reader locking when changing the sub buffer order
  ring-buffer: Fix refcount setting of boot mapped buffers
2024-10-15 11:18:44 -07:00
Dimitar Kanaliev
ae67b9fb8c bpf: Fix truncation bug in coerce_reg_to_size_sx()
coerce_reg_to_size_sx() updates the register state after a sign-extension
operation. However, there's a bug in the assignment order of the unsigned
min/max values, leading to incorrect truncation:

  0: (85) call bpf_get_prandom_u32#7    ; R0_w=scalar()
  1: (57) r0 &= 1                       ; R0_w=scalar(smin=smin32=0,smax=umax=smax32=umax32=1,var_off=(0x0; 0x1))
  2: (07) r0 += 254                     ; R0_w=scalar(smin=umin=smin32=umin32=254,smax=umax=smax32=umax32=255,var_off=(0xfe; 0x1))
  3: (bf) r0 = (s8)r0                   ; R0_w=scalar(smin=smin32=-2,smax=smax32=-1,umin=umin32=0xfffffffe,umax=0xffffffff,var_off=(0xfffffffffffffffe; 0x1))

In the current implementation, the unsigned 32-bit min/max values
(u32_min_value and u32_max_value) are assigned directly from the 64-bit
signed min/max values (s64_min and s64_max):

  reg->umin_value = reg->u32_min_value = s64_min;
  reg->umax_value = reg->u32_max_value = s64_max;

Due to the chain assigmnent, this is equivalent to:

  reg->u32_min_value = s64_min;  // Unintended truncation
  reg->umin_value = reg->u32_min_value;
  reg->u32_max_value = s64_max;  // Unintended truncation
  reg->umax_value = reg->u32_max_value;

Fixes: 1f9a1ea821 ("bpf: Support new sign-extension load insns")
Reported-by: Shung-Hsi Yu <shung-hsi.yu@suse.com>
Reported-by: Zac Ecob <zacecob@protonmail.com>
Signed-off-by: Dimitar Kanaliev <dimitar.kanaliev@siteground.com>
Acked-by: Yonghong Song <yonghong.song@linux.dev>
Reviewed-by: Shung-Hsi Yu <shung-hsi.yu@suse.com>
Link: https://lore.kernel.org/r/20241014121155.92887-2-dimitar.kanaliev@siteground.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-10-15 11:16:24 -07:00
Petr Pavlu
09661f75e7 ring-buffer: Fix reader locking when changing the sub buffer order
The function ring_buffer_subbuf_order_set() updates each
ring_buffer_per_cpu and installs new sub buffers that match the requested
page order. This operation may be invoked concurrently with readers that
rely on some of the modified data, such as the head bit (RB_PAGE_HEAD), or
the ring_buffer_per_cpu.pages and reader_page pointers. However, no
exclusive access is acquired by ring_buffer_subbuf_order_set(). Modifying
the mentioned data while a reader also operates on them can then result in
incorrect memory access and various crashes.

Fix the problem by taking the reader_lock when updating a specific
ring_buffer_per_cpu in ring_buffer_subbuf_order_set().

Link: https://lore.kernel.org/linux-trace-kernel/20240715145141.5528-1-petr.pavlu@suse.com/
Link: https://lore.kernel.org/linux-trace-kernel/20241010195849.2f77cc3f@gandalf.local.home/
Link: https://lore.kernel.org/linux-trace-kernel/20241011112850.17212b25@gandalf.local.home/

Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Link: https://lore.kernel.org/20241015112440.26987-1-petr.pavlu@suse.com
Fixes: 8e7b58c27b ("ring-buffer: Just update the subbuffers when changing their allocation order")
Signed-off-by: Petr Pavlu <petr.pavlu@suse.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2024-10-15 11:18:51 -04:00
Jinjie Ruan
d8794ac20a posix-clock: Fix missing timespec64 check in pc_clock_settime()
As Andrew pointed out, it will make sense that the PTP core
checked timespec64 struct's tv_sec and tv_nsec range before calling
ptp->info->settime64().

As the man manual of clock_settime() said, if tp.tv_sec is negative or
tp.tv_nsec is outside the range [0..999,999,999], it should return EINVAL,
which include dynamic clocks which handles PTP clock, and the condition is
consistent with timespec64_valid(). As Thomas suggested, timespec64_valid()
only check the timespec is valid, but not ensure that the time is
in a valid range, so check it ahead using timespec64_valid_strict()
in pc_clock_settime() and return -EINVAL if not valid.

There are some drivers that use tp->tv_sec and tp->tv_nsec directly to
write registers without validity checks and assume that the higher layer
has checked it, which is dangerous and will benefit from this, such as
hclge_ptp_settime(), igb_ptp_settime_i210(), _rcar_gen4_ptp_settime(),
and some drivers can remove the checks of itself.

Cc: stable@vger.kernel.org
Fixes: 0606f422b4 ("posix clocks: Introduce dynamic clocks")
Acked-by: Richard Cochran <richardcochran@gmail.com>
Suggested-by: Andrew Lunn <andrew@lunn.ch>
Suggested-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Jinjie Ruan <ruanjinjie@huawei.com>
Link: https://patch.msgid.link/20241009072302.1754567-2-ruanjinjie@huawei.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-10-14 17:22:43 -07:00
Xiu Jianfeng
3cc4e13bb1 cgroup: Fix potential overflow issue when checking max_depth
cgroup.max.depth is the maximum allowed descent depth below the current
cgroup. If the actual descent depth is equal or larger, an attempt to
create a new child cgroup will fail. However due to the cgroup->max_depth
is of int type and having the default value INT_MAX, the condition
'level > cgroup->max_depth' will never be satisfied, and it will cause
an overflow of the level after it reaches to INT_MAX.

Fix it by starting the level from 0 and using '>=' instead.

It's worth mentioning that this issue is unlikely to occur in reality,
as it's impossible to have a depth of INT_MAX hierarchy, but should be
be avoided logically.

Fixes: 1a926e0bba ("cgroup: implement hierarchy limits")
Signed-off-by: Xiu Jianfeng <xiujianfeng@huawei.com>
Reviewed-by: Michal Koutný <mkoutny@suse.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2024-10-14 13:39:25 -10:00
David Vernet
60e339be10 sched_ext: Remove unnecessary cpu_relax()
As described in commit b07996c7ab ("sched_ext: Don't hold
scx_tasks_lock for too long"), we're doing a cond_resched() every 32
calls to scx_task_iter_next() to avoid RCU and other stalls. That commit
also added a cpu_relax() to the codepath where we drop and reacquire the
lock, but as Waiman described in [0], cpu_relax() should only be
necessary in busy loops to avoid pounding on a cacheline (or to allow a
hypertwin to more fully utilize a core).

Let's remove the unnecessary cpu_relax().

[0]: https://lore.kernel.org/all/35b3889b-904a-4d26-981f-c8aa1557a7c7@redhat.com/

Cc: Waiman Long <llong@redhat.com>
Signed-off-by: David Vernet <void@manifault.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2024-10-14 13:23:49 -10:00
Steven Rostedt
2cf9733891 ring-buffer: Fix refcount setting of boot mapped buffers
A ring buffer which has its buffered mapped at boot up to fixed memory
should not be freed. Other buffers can be. The ref counting setup was
wrong for both. It made the not mapped buffers ref count have zero, and the
boot mapped buffer a ref count of 1. But an normally allocated buffer
should be 1, where it can be removed.

Keep the ref count of a normal boot buffer with its setup ref count (do
not decrement it), and increment the fixed memory boot mapped buffer's ref
count.

Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Link: https://lore.kernel.org/20241011165224.33dd2624@gandalf.local.home
Fixes: e645535a95 ("tracing: Add option to use memmapped memory for trace boot instance")
Reviewed-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2024-10-14 14:30:59 -04:00
Peter Zijlstra
cd9626e9eb sched/fair: Fix external p->on_rq users
Sean noted that ever since commit 152e11f6df ("sched/fair: Implement
delayed dequeue") KVM's preemption notifiers have started
mis-classifying preemption vs blocking.

Notably p->on_rq is no longer sufficient to determine if a task is
runnable or blocked -- the aforementioned commit introduces tasks that
remain on the runqueue even through they will not run again, and
should be considered blocked for many cases.

Add the task_is_runnable() helper to classify things and audit all
external users of the p->on_rq state. Also add a few comments.

Fixes: 152e11f6df ("sched/fair: Implement delayed dequeue")
Reported-by: Sean Christopherson <seanjc@google.com>
Tested-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lkml.kernel.org/r/20241010091843.GK33184@noisy.programming.kicks-ass.net
2024-10-14 09:14:35 +02:00
Johannes Weiner
c650812419 sched/psi: Fix mistaken CPU pressure indication after corrupted task state bug
Since sched_delayed tasks remain queued even after blocking, the load
balancer can migrate them between runqueues while PSI considers them
to be asleep. As a result, it misreads the migration requeue followed
by a wakeup as a double queue:

  psi: inconsistent task state! task=... cpu=... psi_flags=4 clear=. set=4

First, call psi_enqueue() after p->sched_class->enqueue_task(). A
wakeup will clear p->se.sched_delayed while a migration will not, so
psi can use that flag to tell them apart.

Then teach psi to migrate any "sleep" state when delayed-dequeue tasks
are being migrated.

Delayed-dequeue tasks can be revived by ttwu_runnable(), which will
call down with a new ENQUEUE_DELAYED. Instead of further complicating
the wakeup conditional in enqueue_task(), identify migration contexts
instead and default to wakeup handling for all other cases.

It's not just the warning in dmesg, the task state corruption causes a
permanent CPU pressure indication, which messes with workload/machine
health monitoring.

Debugged-by-and-original-fix-by: K Prateek Nayak <kprateek.nayak@amd.com>
Fixes: 152e11f6df ("sched/fair: Implement delayed dequeue")
Closes: https://lore.kernel.org/lkml/20240830123458.3557-1-spasswolf@web.de/
Closes: https://lore.kernel.org/all/cd67fbcd-d659-4822-bb90-7e8fbb40a856@molgen.mpg.de/
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Tested-by: K Prateek Nayak <kprateek.nayak@amd.com>
Link: https://lkml.kernel.org/r/20241010193712.GC181795@cmpxchg.org
2024-10-14 09:11:42 +02:00
Linus Torvalds
a1029768f3 RCU fix for v6.12
Fix rcuog kthread wakeup invocation from softirq context on a CPU
 which has been marked offline. This can happen when new callbacks
 are enqueued from a softirq on an offline CPU before it calls
 rcutree_report_cpu_dead(). When this happens on NOCB configuration,
 the rcuog wake-up is deferred through an IPI to an online CPU.
 This is done to avoid call into the scheduler which can risk
 arming the RT-bandwidth after hrtimers have been migrated out
 and disabled. However, doing IPI call from softirq is not allowed
 Fix this by forcing deferred rcuog wakeup through the NOCB timer
 when the CPU is offline.
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYIAB0WIQSi2tPIQIc2VEtjarIAHS7/6Z0wpQUCZwjWYAAKCRAAHS7/6Z0w
 pQ8iAP9vbhWlWuwlfZqQJK8Xb5b0kVx3nJjAin/5pY/lbWV7IgEA1DUzTAZKXU5Z
 yZmjM3l9RpMAxLSgevJKoJD+vR0POQA=
 =EVin
 -----END PGP SIGNATURE-----

Merge tag 'rcu.fixes.6.12-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/rcu/linux

Pull RCU fix from Neeraj Upadhyay:
 "Fix rcuog kthread wakeup invocation from softirq context on a CPU
  which has been marked offline.

  This can happen when new callbacks are enqueued from a softirq on an
  offline CPU before it calls rcutree_report_cpu_dead(). When this
  happens on NOCB configuration, the rcuog wake-up is deferred through
  an IPI to an online CPU. This is done to avoid call into the scheduler
  which can risk arming the RT-bandwidth after hrtimers have been
  migrated out and disabled.

  However, doing IPI call from softirq is not allowed: Fix this by
  forcing deferred rcuog wakeup through the NOCB timer when the CPU is
  offline"

* tag 'rcu.fixes.6.12-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/rcu/linux:
  rcu/nocb: Fix rcuog wake-up from offline softirq
2024-10-11 14:42:27 -07:00
Casey Schaufler
13d826e564 audit: change context data from secid to lsm_prop
Change the LSM data stored in the audit transactions from a secid
to an LSM prop. This is done in struct audit_context and struct
audit_aux_data_pids. Several cases of scaffolding can be removed.

Signed-off-by: Casey Schaufler <casey@schaufler-ca.com>
[PM: subj line tweak]
Signed-off-by: Paul Moore <paul@paul-moore.com>
2024-10-11 14:34:16 -04:00
Casey Schaufler
e0a8dcbd53 audit: use an lsm_prop in audit_names
Replace the osid field in the audit_names structure with a
lsm_prop structure. This accommodates the use of an lsm_prop in
security_audit_rule_match() and security_inode_getsecid().

Signed-off-by: Casey Schaufler <casey@schaufler-ca.com>
[PM: subj line tweak]
Signed-off-by: Paul Moore <paul@paul-moore.com>
2024-10-11 14:34:15 -04:00
Casey Schaufler
07f9d2c113 lsm: use lsm_prop in security_inode_getsecid
Change the security_inode_getsecid() interface to fill in a
lsm_prop structure instead of a u32 secid. This allows for its
callers to gather data from all registered LSMs. Data is provided
for IMA and audit. Change the name to security_inode_getlsmprop().

Cc: linux-integrity@vger.kernel.org
Cc: selinux@vger.kernel.org
Signed-off-by: Casey Schaufler <casey@schaufler-ca.com>
[PM: subj line tweak]
Signed-off-by: Paul Moore <paul@paul-moore.com>
2024-10-11 14:34:14 -04:00
Casey Schaufler
37f670aacd lsm: use lsm_prop in security_current_getsecid
Change the security_current_getsecid_subj() and
security_task_getsecid_obj() interfaces to fill in a lsm_prop structure
instead of a u32 secid.  Audit interfaces will need to collect all
possible security data for possible reporting.

Cc: linux-integrity@vger.kernel.org
Cc: audit@vger.kernel.org
Cc: selinux@vger.kernel.org
Signed-off-by: Casey Schaufler <casey@schaufler-ca.com>
[PM: subject line tweak]
Signed-off-by: Paul Moore <paul@paul-moore.com>
2024-10-11 14:34:14 -04:00
Casey Schaufler
e4f6822044 audit: update shutdown LSM data
The audit process LSM information is changed from a secid audit_sig_sid
to an lsm_prop in audit_sig_lsm. Update the users of this data
appropriately. Calls to security_secid_to_secctx() are changed to use
security_lsmprop_to_secctx() instead. security_current_getsecid_subj()
is scaffolded. It will be updated in a subsequent patch.

Signed-off-by: Casey Schaufler <casey@schaufler-ca.com>
[PM: subject line tweak]
Signed-off-by: Paul Moore <paul@paul-moore.com>
2024-10-11 14:34:13 -04:00
Casey Schaufler
f4602f163c lsm: use lsm_prop in security_ipc_getsecid
There may be more than one LSM that provides IPC data for auditing.
Change security_ipc_getsecid() to fill in a lsm_prop structure instead
of the u32 secid.  Change the name to security_ipc_getlsmprop() to
reflect the change.

Cc: audit@vger.kernel.org
Cc: linux-security-module@vger.kernel.org
Cc: selinux@vger.kernel.org
Signed-off-by: Casey Schaufler <casey@schaufler-ca.com>
[PM: subject line tweak]
Signed-off-by: Paul Moore <paul@paul-moore.com>
2024-10-11 14:34:13 -04:00
Casey Schaufler
7183abccd8 audit: maintain an lsm_prop in audit_context
Replace the secid value stored in struct audit_context with a struct
lsm_prop. Change the code that uses this value to accommodate the
change. security_audit_rule_match() expects a lsm_prop, so existing
scaffolding can be removed. A call to security_secid_to_secctx()
is changed to security_lsmprop_to_secctx().  The call to
security_ipc_getsecid() is scaffolded.

A new function lsmprop_is_set() is introduced to identify whether
an lsm_prop contains a non-zero value.

Signed-off-by: Casey Schaufler <casey@schaufler-ca.com>
[PM: subject line tweak, fix lsmprop_is_set() typo]
Signed-off-by: Paul Moore <paul@paul-moore.com>
2024-10-11 14:34:13 -04:00
Casey Schaufler
870b7fdc66 lsm: use lsm_prop in security_audit_rule_match
Change the secid parameter of security_audit_rule_match
to a lsm_prop structure pointer. Pass the entry from the
lsm_prop structure for the approprite slot to the LSM hook.

Change the users of security_audit_rule_match to use the
lsm_prop instead of a u32. The scaffolding function lsmprop_init()
fills the structure with the value of the old secid, ensuring that
it is available to the appropriate module hook. The sources of
the secid, security_task_getsecid() and security_inode_getsecid(),
will be converted to use the lsm_prop structure later in the series.
At that point the use of lsmprop_init() is dropped.

Signed-off-by: Casey Schaufler <casey@schaufler-ca.com>
[PM: subject line tweak]
Signed-off-by: Paul Moore <paul@paul-moore.com>
2024-10-11 14:34:12 -04:00
Paul E. McKenney
9861f7f66f locking/csd-lock: Switch from sched_clock() to ktime_get_mono_fast_ns()
Currently, the CONFIG_CSD_LOCK_WAIT_DEBUG code uses sched_clock() to check
for excessive CSD-lock wait times.  This works, but does not guarantee
monotonic timestamps on x86 due to the sched_clock() function's use of
the rdtsc instruction, which does not guarantee ordering.  This means
that, given successive calls to sched_clock(), the second might return
an earlier time than the second, that is, time might seem to go backwards.
This can (and does!) result in false-positive CSD-lock wait complaints
claiming almost 2^64 nanoseconds of delay.

Therefore, switch from sched_clock() to ktime_get_mono_fast_ns(), which
does guarantee monotonic timestamps via the rdtsc_ordered() function,
which as the name implies, does guarantee ordered timestamps, at least
in the absence of calls from NMI handlers, which are not involved in
this code path.

Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Reviewed-by: Rik van Riel <riel@surriel.com>
Cc: Neeraj Upadhyay <neeraj.upadhyay@kernel.org>
Cc: Leonardo Bras <leobras@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: "Peter Zijlstra (Intel)" <peterz@infradead.org>
2024-10-11 09:31:21 -07:00
Peter Zijlstra
f5aaff7bfa sched/core: Dequeue PSI signals for blocked tasks that are delayed
psi_dequeue() in for blocked task expects psi_sched_switch() to clear
the TSK_.*RUNNING PSI flags and set the TSK_IOWAIT flags however
psi_sched_switch() uses "!task_on_rq_queued(prev)" to detect if the task
is blocked or still runnable which is no longer true with DELAY_DEQUEUE
since a blocking task can be left queued on the runqueue.

This can lead to PSI splats similar to:

    psi: inconsistent task state! task=... cpu=... psi_flags=4 clear=0 set=4

when the task is requeued since the TSK_RUNNING flag was not cleared
when the task was blocked.

Explicitly communicate that the task was blocked to psi_sched_switch()
even if it was delayed and is still on the runqueue.

  [ prateek: Broke off the relevant part from [1], commit message ]

Fixes: 152e11f6df ("sched/fair: Implement delayed dequeue")
Closes: https://lore.kernel.org/lkml/20240830123458.3557-1-spasswolf@web.de/
Closes: https://lore.kernel.org/all/cd67fbcd-d659-4822-bb90-7e8fbb40a856@molgen.mpg.de/
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Not-yet-signed-off-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: K Prateek Nayak <kprateek.nayak@amd.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Johannes Weiner <hannes@cmpxchg.org>
Link: https://lore.kernel.org/lkml/20241004123506.GR18071@noisy.programming.kicks-ass.net/ [1]
2024-10-11 10:49:33 +02:00
Peter Zijlstra
98442f0ccd sched: Fix delayed_dequeue vs switched_from_fair()
Commit 2e0199df25 ("sched/fair: Prepare exit/cleanup paths for delayed_dequeue")
and its follow up fixes try to deal with a rather unfortunate
situation where is task is enqueued in a new class, even though it
shouldn't have been. Mostly because the existing ->switched_to/from()
hooks are in the wrong place for this case.

This all led to Paul being able to trigger failures at something like
once per 10k CPU hours of RCU torture.

For now, do the ugly thing and move the code to the right place by
ignoring the switch hooks.

Note: Clean up the whole sched_class::switch*_{to,from}() thing.

Fixes: 2e0199df25 ("sched/fair: Prepare exit/cleanup paths for delayed_dequeue")
Reported-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20241003185037.GA5594@noisy.programming.kicks-ass.net
2024-10-11 10:49:32 +02:00
Waiman Long
73ab05aa46 sched/core: Disable page allocation in task_tick_mm_cid()
With KASAN and PREEMPT_RT enabled, calling task_work_add() in
task_tick_mm_cid() may cause the following splat.

[   63.696416] BUG: sleeping function called from invalid context at kernel/locking/spinlock_rt.c:48
[   63.696416] in_atomic(): 1, irqs_disabled(): 1, non_block: 0, pid: 610, name: modprobe
[   63.696416] preempt_count: 10001, expected: 0
[   63.696416] RCU nest depth: 1, expected: 1

This problem is caused by the following call trace.

  sched_tick() [ acquire rq->__lock ]
   -> task_tick_mm_cid()
    -> task_work_add()
     -> __kasan_record_aux_stack()
      -> kasan_save_stack()
       -> stack_depot_save_flags()
        -> alloc_pages_mpol_noprof()
         -> __alloc_pages_noprof()
	  -> get_page_from_freelist()
	   -> rmqueue()
	    -> rmqueue_pcplist()
	     -> __rmqueue_pcplist()
	      -> rmqueue_bulk()
	       -> rt_spin_lock()

The rq lock is a raw_spinlock_t. We can't sleep while holding
it. IOW, we can't call alloc_pages() in stack_depot_save_flags().

The task_tick_mm_cid() function with its task_work_add() call was
introduced by commit 223baf9d17 ("sched: Fix performance regression
introduced by mm_cid") in v6.4 kernel.

Fortunately, there is a kasan_record_aux_stack_noalloc() variant that
calls stack_depot_save_flags() while not allowing it to allocate
new pages.  To allow task_tick_mm_cid() to use task_work without
page allocation, a new TWAF_NO_ALLOC flag is added to enable calling
kasan_record_aux_stack_noalloc() instead of kasan_record_aux_stack()
if set. The task_tick_mm_cid() function is modified to add this new flag.

The possible downside is the missing stack trace in a KASAN report due
to new page allocation required when task_work_add_noallloc() is called
which should be rare.

Fixes: 223baf9d17 ("sched: Fix performance regression introduced by mm_cid")
Signed-off-by: Waiman Long <longman@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20241010014432.194742-1-longman@redhat.com
2024-10-11 10:49:32 +02:00
Phil Auld
d16b7eb6f5 sched/deadline: Use hrtick_enabled_dl() before start_hrtick_dl()
The deadline server code moved one of the start_hrtick_dl() calls
but dropped the dl specific hrtick_enabled check. This causes hrticks
to get armed even when sched_feat(HRTICK_DL) is false. Fix it.

Fixes: 63ba8422f8 ("sched/deadline: Introduce deadline servers")
Signed-off-by: Phil Auld <pauld@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Juri Lelli <juri.lelli@redhat.com>
Link: https://lore.kernel.org/r/20241004123729.460668-1-pauld@redhat.com
2024-10-11 10:49:32 +02:00
Tyrone Wu
ad6b5b6ea9 bpf: Fix unpopulated path_size when uprobe_multi fields unset
Previously when retrieving `bpf_link_info.uprobe_multi` with `path` and
`path_size` fields unset, the `path_size` field is not populated
(remains 0). This behavior was inconsistent with how other input/output
string buffer fields work, as the field should be populated in cases
when:
- both buffer and length are set (currently works as expected)
- both buffer and length are unset (not working as expected)

This patch now fills the `path_size` field when `path` and `path_size`
are unset.

Fixes: e56fdbfb06 ("bpf: Add link_info support for uprobe multi link")
Signed-off-by: Tyrone Wu <wudevelops@gmail.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20241011000803.681190-1-wudevelops@gmail.com
2024-10-10 19:11:28 -07:00
Tejun Heo
b07996c7ab sched_ext: Don't hold scx_tasks_lock for too long
While enabling and disabling a BPF scheduler, every task is iterated a
couple times by walking scx_tasks. Except for one, all iterations keep
holding scx_tasks_lock. On multi-socket systems under heavy rq lock
contention and high number of threads, this can can lead to RCU and other
stalls.

The following is triggered on a 2 x AMD EPYC 7642 system (192 logical CPUs)
running `stress-ng --workload 150 --workload-threads 10` with >400k idle
threads and RCU stall period reduced to 5s:

  rcu: INFO: rcu_preempt detected stalls on CPUs/tasks:
  rcu:     91-...!: (10 ticks this GP) idle=0754/1/0x4000000000000000 softirq=18204/18206 fqs=17
  rcu:     186-...!: (17 ticks this GP) idle=ec54/1/0x4000000000000000 softirq=25863/25866 fqs=17
  rcu:     (detected by 80, t=10042 jiffies, g=89305, q=33 ncpus=192)
  Sending NMI from CPU 80 to CPUs 91:
  NMI backtrace for cpu 91
  CPU: 91 UID: 0 PID: 284038 Comm: sched_ext_ops_h Kdump: loaded Not tainted 6.12.0-rc2-work-g6bf5681f7ee2-dirty #471
  Hardware name: Supermicro Super Server/H11DSi, BIOS 2.8 12/14/2023
  Sched_ext: simple (disabling+all)
  RIP: 0010:queued_spin_lock_slowpath+0x17b/0x2f0
  Code: 02 c0 10 03 00 83 79 08 00 75 08 f3 90 83 79 08 00 74 f8 48 8b 11 48 85 d2 74 09 0f 0d 0a eb 0a 31 d2 eb 06 31 d2 eb 02 f3 90 <8b> 07 66 85 c0 75 f7 39 d8 75 0d be 01 00 00 00 89 d8 f0 0f b1 37
  RSP: 0018:ffffc9000fadfcb8 EFLAGS: 00000002
  RAX: 0000000001700001 RBX: 0000000001700000 RCX: ffff88bfcaaf10c0
  RDX: 0000000000000000 RSI: 0000000000000101 RDI: ffff88bfca8f0080
  RBP: 0000000001700000 R08: 0000000000000090 R09: ffffffffffffffff
  R10: ffff88a74761b268 R11: 0000000000000000 R12: ffff88a6b6765460
  R13: ffffc9000fadfd60 R14: ffff88bfca8f0080 R15: ffff88bfcaac0000
  FS:  0000000000000000(0000) GS:ffff88bfcaac0000(0000) knlGS:0000000000000000
  CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
  CR2: 00007f5c55f526a0 CR3: 0000000afd474000 CR4: 0000000000350eb0
  Call Trace:
   <NMI>
   </NMI>
   <TASK>
   do_raw_spin_lock+0x9c/0xb0
   task_rq_lock+0x50/0x190
   scx_task_iter_next_locked+0x157/0x170
   scx_ops_disable_workfn+0x2c2/0xbf0
   kthread_worker_fn+0x108/0x2a0
   kthread+0xeb/0x110
   ret_from_fork+0x36/0x40
   ret_from_fork_asm+0x1a/0x30
   </TASK>
  Sending NMI from CPU 80 to CPUs 186:
  NMI backtrace for cpu 186
  CPU: 186 UID: 0 PID: 51248 Comm: fish Kdump: loaded Not tainted 6.12.0-rc2-work-g6bf5681f7ee2-dirty #471

scx_task_iter can safely drop locks while iterating. Make
scx_task_iter_next() drop scx_tasks_lock every 32 iterations to avoid
stalls.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: David Vernet <void@manifault.com>
2024-10-10 11:41:44 -10:00
Tejun Heo
967da57832 sched_ext: Move scx_tasks_lock handling into scx_task_iter helpers
Iterating with scx_task_iter involves scx_tasks_lock and optionally the rq
lock of the task being iterated. Both locks can be released during iteration
and the iteration can be continued after re-grabbing scx_tasks_lock.
Currently, all lock handling is pushed to the caller which is a bit
cumbersome and makes it difficult to add lock-aware behaviors. Make the
scx_task_iter helpers handle scx_tasks_lock.

- scx_task_iter_init/scx_taks_iter_exit() now grabs and releases
  scx_task_lock, respectively. Renamed to
  scx_task_iter_start/scx_task_iter_stop() to more clearly indicate that
  there are non-trivial side-effects.

- Add __ prefix to scx_task_iter_rq_unlock() to indicate that the function
  is internal.

- Add scx_task_iter_unlock/relock(). The former drops both rq lock (if held)
  and scx_tasks_lock and the latter re-locks only scx_tasks_lock.

This doesn't cause behavior changes and will be used to implement stall
avoidance.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: David Vernet <void@manifault.com>
2024-10-10 11:41:44 -10:00
Tejun Heo
aebe7ae4cb sched_ext: bypass mode shouldn't depend on ops.select_cpu()
Bypass mode was depending on ops.select_cpu() which can't be trusted as with
the rest of the BPF scheduler. Always enable and use scx_select_cpu_dfl() in
bypass mode.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: David Vernet <void@manifault.com>
2024-10-10 11:41:44 -10:00
Tejun Heo
cc3e1caca9 sched_ext: Move scx_buildin_idle_enabled check to scx_bpf_select_cpu_dfl()
Move the sanity check from the inner function scx_select_cpu_dfl() to the
exported kfunc scx_bpf_select_cpu_dfl(). This doesn't cause behavior
differences and will allow using scx_select_cpu_dfl() in bypass mode
regardless of scx_builtin_idle_enabled.

Signed-off-by: Tejun Heo <tj@kernel.org>
2024-10-10 11:41:44 -10:00
Tejun Heo
3fdb9ebcec sched_ext: Start schedulers with consistent p->scx.slice values
The disable path caps p->scx.slice to SCX_SLICE_DFL. As the field is already
being ignored at this stage during disable, the only effect this has is that
when the next BPF scheduler is loaded, it won't see unreasonable left-over
slices. Ultimately, this shouldn't matter but it's better to start in a
known state. Drop p->scx.slice capping from the disable path and instead
reset it to SCX_SLICE_DFL in the enable path.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: David Vernet <void@manifault.com>
2024-10-10 11:41:44 -10:00
Tejun Heo
54baa7ac0c Revert "sched_ext: Use shorter slice while bypassing"
This reverts commit 6f34d8d382.

Slice length is ignored while bypassing and tasks are switched on every tick
and thus the patch does not make any difference. The perceived difference
was from test noise.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: David Vernet <void@manifault.com>
2024-10-10 11:41:44 -10:00
Linus Torvalds
0edab8d132 ring-buffer: Fix for 6.12
- Do not have boot-mapped buffers use CPU hotplug callbacks
 
   When a ring buffer is mapped to memory assigned at boot, it
   also splits it up evenly between the possible CPUs. But the
   allocation code still attached a CPU notifier callback to this
   ring buffer. When a CPU is added, the callback will happen and
   another per-cpu buffer is created for the ring buffer. But for
   boot mapped buffers, there is no room to add another one (as
   they were all created already). The result of calling the CPU
   hotplug notifier on a boot mapped ring buffer is unpredictable
   and could lead to a system crash. If the ring buffer is boot mapped
   simply do not attach the CPU notifier to it.
 -----BEGIN PGP SIGNATURE-----
 
 iIoEABYIADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCZwfxchQccm9zdGVkdEBn
 b29kbWlzLm9yZwAKCRAp5XQQmuv6qrduAQC2xZ+lqeoDJ3o+H201bqshgx/YDANF
 p1q1BFmC0yFlOAD/XB/I9o4UFNHAb9D05zPADS+8bF1RxdisJJjqUae1TgM=
 =lBZ/
 -----END PGP SIGNATURE-----

Merge tag 'trace-ringbuffer-v6.12-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace

Pull tracing fix from Steven Rostedt:
 "Ring-buffer fix: do not have boot-mapped buffers use CPU hotplug
  callbacks

  When a ring buffer is mapped to memory assigned at boot, it also
  splits it up evenly between the possible CPUs. But the allocation code
  still attached a CPU notifier callback to this ring buffer. When a CPU
  is added, the callback will happen and another per-cpu buffer is
  created for the ring buffer.

  But for boot mapped buffers, there is no room to add another one (as
  they were all created already). The result of calling the CPU hotplug
  notifier on a boot mapped ring buffer is unpredictable and could lead
  to a system crash.

  If the ring buffer is boot mapped simply do not attach the CPU
  notifier to it"

* tag 'trace-ringbuffer-v6.12-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace:
  ring-buffer: Do not have boot mapped buffers hook to CPU hotplug
2024-10-10 12:25:32 -07:00
Toke Høiland-Jørgensen
6cb86a0fde bpf: fix kfunc btf caching for modules
The verifier contains a cache for looking up module BTF objects when
calling kfuncs defined in modules. This cache uses a 'struct
bpf_kfunc_btf_tab', which contains a sorted list of BTF objects that
were already seen in the current verifier run, and the BTF objects are
looked up by the offset stored in the relocated call instruction using
bsearch().

The first time a given offset is seen, the module BTF is loaded from the
file descriptor passed in by libbpf, and stored into the cache. However,
there's a bug in the code storing the new entry: it stores a pointer to
the new cache entry, then calls sort() to keep the cache sorted for the
next lookup using bsearch(), and then returns the entry that was just
stored through the stored pointer. However, because sort() modifies the
list of entries in place *by value*, the stored pointer may no longer
point to the right entry, in which case the wrong BTF object will be
returned.

The end result of this is an intermittent bug where, if a BPF program
calls two functions with the same signature in two different modules,
the function from the wrong module may sometimes end up being called.
Whether this happens depends on the order of the calls in the BPF
program (as that affects whether sort() reorders the array of BTF
objects), making it especially hard to track down. Simon, credited as
reporter below, spent significant effort analysing and creating a
reproducer for this issue. The reproducer is added as a selftest in a
subsequent patch.

The fix is straight forward: simply don't use the stored pointer after
calling sort(). Since we already have an on-stack pointer to the BTF
object itself at the point where the function return, just use that, and
populate it from the cache entry in the branch where the lookup
succeeds.

Fixes: 2357672c54 ("bpf: Introduce BPF support for kernel module function calls")
Reported-by: Simon Sundberg <simon.sundberg@kau.se>
Acked-by: Jiri Olsa <jolsa@kernel.org>
Acked-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com>
Link: https://lore.kernel.org/r/20241010-fix-kfunc-btf-caching-for-modules-v2-1-745af6c1af98@redhat.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-10-10 10:44:03 -07:00
Frederic Weisbecker
f7345ccc62 rcu/nocb: Fix rcuog wake-up from offline softirq
After a CPU has set itself offline and before it eventually calls
rcutree_report_cpu_dead(), there are still opportunities for callbacks
to be enqueued, for example from a softirq. When that happens on NOCB,
the rcuog wake-up is deferred through an IPI to an online CPU in order
not to call into the scheduler and risk arming the RT-bandwidth after
hrtimers have been migrated out and disabled.

But performing a synchronized IPI from a softirq is buggy as reported in
the following scenario:

        WARNING: CPU: 1 PID: 26 at kernel/smp.c:633 smp_call_function_single
        Modules linked in: rcutorture torture
        CPU: 1 UID: 0 PID: 26 Comm: migration/1 Not tainted 6.11.0-rc1-00012-g9139f93209d1 #1
        Stopper: multi_cpu_stop+0x0/0x320 <- __stop_cpus+0xd0/0x120
        RIP: 0010:smp_call_function_single
        <IRQ>
        swake_up_one_online
        __call_rcu_nocb_wake
        __call_rcu_common
        ? rcu_torture_one_read
        call_timer_fn
        __run_timers
        run_timer_softirq
        handle_softirqs
        irq_exit_rcu
        ? tick_handle_periodic
        sysvec_apic_timer_interrupt
        </IRQ>

Fix this with forcing deferred rcuog wake up through the NOCB timer when
the CPU is offline. The actual wake up will happen from
rcutree_report_cpu_dead().

Reported-by: kernel test robot <oliver.sang@intel.com>
Closes: https://lore.kernel.org/oe-lkp/202409231644.4c55582d-lkp@intel.com
Fixes: 9139f93209 ("rcu/nocb: Fix RT throttling hrtimer armed from offline CPU")
Reviewed-by: "Joel Fernandes (Google)" <joel@joelfernandes.org>
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Neeraj Upadhyay <neeraj.upadhyay@kernel.org>
2024-10-10 22:18:19 +05:30
Honglei Wang
c425180d88 sched_ext: use correct function name in pick_task_scx() warning message
pick_next_task_scx() was turned into pick_task_scx() since
commit 753e2836d1 ("sched_ext: Unify regular and core-sched pick
task paths"). Update the outdated message.

Signed-off-by: Honglei Wang <jameshongleiwang@126.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2024-10-10 06:35:28 -10:00
Jeff Layton
2a15385742
timekeeping: Add percpu counter for tracking floor swap events
The mgtime_floor value is a global variable for tracking the latest
fine-grained timestamp handed out. Because it's a global, track the
number of times that a new floor value is assigned.

Add a new percpu counter to the timekeeping code to track the number of
floor swap events that have occurred. A later patch will add a debugfs
file to display this counter alongside other stats involving multigrain
timestamps.

Signed-off-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Randy Dunlap <rdunlap@infradead.org> # documentation bits
Link: https://lore.kernel.org/all/20241002-mgtime-v10-2-d1c4717f5284@kernel.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-10-10 10:20:46 +02:00
Jeff Layton
ee3283c608
timekeeping: Add interfaces for handling timestamps with a floor value
Multigrain timestamps allow the kernel to use fine-grained timestamps when
an inode's attributes is being actively observed via ->getattr().  With
this support, it's possible for a file to get a fine-grained timestamp, and
another modified after it to get a coarse-grained stamp that is earlier
than the fine-grained time.  If this happens then the files can appear to
have been modified in reverse order, which breaks VFS ordering guarantees
[1].

To prevent this, maintain a floor value for multigrain timestamps.
Whenever a fine-grained timestamp is handed out, record it, and when later
coarse-grained stamps are handed out, ensure they are not earlier than that
value. If the coarse-grained timestamp is earlier than the fine-grained
floor, return the floor value instead.

Add a static singleton atomic64_t into timekeeper.c that is used to keep
track of the latest fine-grained time ever handed out. This is tracked as a
monotonic ktime_t value to ensure that it isn't affected by clock
jumps. Because it is updated at different times than the rest of the
timekeeper object, the floor value is managed independently of the
timekeeper via a cmpxchg() operation, and sits on its own cacheline.

Add two new public interfaces:

- ktime_get_coarse_real_ts64_mg() fills a timespec64 with the later of the
  coarse-grained clock and the floor time

- ktime_get_real_ts64_mg() gets the fine-grained clock value, and tries
  to swap it into the floor. A timespec64 is filled with the result.

The floor value is global and updated via a single try_cmpxchg(). If
that fails then the operation raced with a concurrent update. Any
concurrent update must be later than the existing floor value, so any
racing tasks can accept any resulting floor value without retrying.

[1]: POSIX requires that files be stamped with realtime clock values, and
     makes no provision for dealing with backward clock jumps. If a backward
     realtime clock jump occurs, then files can appear to have been modified
     in reverse order.

Signed-off-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Randy Dunlap <rdunlap@infradead.org> # documentation bits
Acked-by: John Stultz <jstultz@google.com>
Link: https://lore.kernel.org/all/20241002-mgtime-v10-1-d1c4717f5284@kernel.org
Signed-off-by: Christian Brauner <brauner@kernel.org>
2024-10-10 10:20:46 +02:00
Tyrone Wu
4deecdd29c bpf: fix unpopulated name_len field in perf_event link info
Previously when retrieving `bpf_link_info.perf_event` for
kprobe/uprobe/tracepoint, the `name_len` field was not populated by the
kernel, leaving it to reflect the value initially set by the user. This
behavior was inconsistent with how other input/output string buffer
fields function (e.g. `raw_tracepoint.tp_name_len`).

This patch fills `name_len` with the actual size of the string name.

Fixes: 1b715e1b0e ("bpf: Support ->fill_link_info for perf_event")
Signed-off-by: Tyrone Wu <wudevelops@gmail.com>
Acked-by: Jiri Olsa <jolsa@kernel.org>
Acked-by: Yafang Shao <laoar.shao@gmail.com>
Link: https://lore.kernel.org/r/20241008164312.46269-1-wudevelops@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-10-09 18:17:16 -07:00
Rik van Riel
434247637c bpf: use kvzmalloc to allocate BPF verifier environment
The kzmalloc call in bpf_check can fail when memory is very fragmented,
which in turn can lead to an OOM kill.

Use kvzmalloc to fall back to vmalloc when memory is too fragmented to
allocate an order 3 sized bpf verifier environment.

Admittedly this is not a very common case, and only happens on systems
where memory has already been squeezed close to the limit, but this does
not seem like much of a hot path, and it's a simple enough fix.

Signed-off-by: Rik van Riel <riel@surriel.com>
Reviewed-by: Shakeel Butt <shakeel.butt@linux.dev>
Link: https://lore.kernel.org/r/20241008170735.16766766@imladris.surriel.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-10-09 18:13:05 -07:00
Hou Tao
797d73ee23 bpf: Check the remaining info_cnt before repeating btf fields
When trying to repeat the btf fields for array of nested struct, it
doesn't check the remaining info_cnt. The following splat will be
reported when the value of ret * nelems is greater than BTF_FIELDS_MAX:

  ------------[ cut here ]------------
  UBSAN: array-index-out-of-bounds in ../kernel/bpf/btf.c:3951:49
  index 11 is out of range for type 'btf_field_info [11]'
  CPU: 6 UID: 0 PID: 411 Comm: test_progs ...... 6.11.0-rc4+ #1
  Tainted: [O]=OOT_MODULE
  Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS ...
  Call Trace:
   <TASK>
   dump_stack_lvl+0x57/0x70
   dump_stack+0x10/0x20
   ubsan_epilogue+0x9/0x40
   __ubsan_handle_out_of_bounds+0x6f/0x80
   ? kallsyms_lookup_name+0x48/0xb0
   btf_parse_fields+0x992/0xce0
   map_create+0x591/0x770
   __sys_bpf+0x229/0x2410
   __x64_sys_bpf+0x1f/0x30
   x64_sys_call+0x199/0x9f0
   do_syscall_64+0x3b/0xc0
   entry_SYSCALL_64_after_hwframe+0x4b/0x53
  RIP: 0033:0x7fea56f2cc5d
  ......
   </TASK>
  ---[ end trace ]---

Fix it by checking the remaining info_cnt in btf_repeat_fields() before
repeating the btf fields.

Fixes: 64e8ee8148 ("bpf: look into the types of the fields of a struct type recursively.")
Signed-off-by: Hou Tao <houtao1@huawei.com>
Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://lore.kernel.org/r/20241008071114.3718177-2-houtao@huaweicloud.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-10-09 16:32:46 -07:00
Linus Torvalds
d3d1556696 12 hotfixes, 5 of which are c:stable. All singletons, about half of which
are MM.
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYIAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCZwcILgAKCRDdBJ7gKXxA
 jjnMAQDRl+UfscRUeMippi7wnL3ee6MKyhhZVOhoxP24uB7yBwD/Ulq4oE+mLHml
 YTlK/wj5qTZIsdxGaBzM1yifqp3L7gU=
 =lFmJ
 -----END PGP SIGNATURE-----

Merge tag 'mm-hotfixes-stable-2024-10-09-15-46' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm

Pull misc fixes from Andrew Morton:
 "12 hotfixes, 5 of which are c:stable. All singletons, about half of
  which are MM"

* tag 'mm-hotfixes-stable-2024-10-09-15-46' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm:
  mm: zswap: delete comments for "value" member of 'struct zswap_entry'.
  CREDITS: sort alphabetically by name
  secretmem: disable memfd_secret() if arch cannot set direct map
  .mailmap: update Fangrui's email
  mm/huge_memory: check pmd_special() only after pmd_present()
  resource, kunit: fix user-after-free in resource_test_region_intersects()
  fs/proc/kcore.c: allow translation of physical memory addresses
  selftests/mm: fix incorrect buffer->mirror size in hmm2 double_map test
  device-dax: correct pgoff align in dax_set_mapping()
  kthread: unpark only parked kthread
  Revert "mm: introduce PF_MEMALLOC_NORECLAIM, PF_MEMALLOC_NOWARN"
  bcachefs: do not use PF_MEMALLOC_NORECLAIM
2024-10-09 16:01:40 -07:00
Huang Ying
0665d7a39b resource, kunit: fix user-after-free in resource_test_region_intersects()
In resource_test_insert_resource(), the pointer is used in error message
after kfree().  This is user-after-free.  To fix this, we need to call
kunit_add_action_or_reset() to schedule memory freeing after usage.  But
kunit_add_action_or_reset() itself may fail and free the memory.  So, its
return value should be checked and abort the test for failure.  Then, we
found that other usage of kunit_add_action_or_reset() in
resource_test_region_intersects() needs to be fixed too.  We fix all these
user-after-free bugs in this patch.

Link: https://lkml.kernel.org/r/20240930070611.353338-1-ying.huang@intel.com
Fixes: 99185c10d5 ("resource, kunit: add test case for region_intersects()")
Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
Reported-by: Kees Bakker <kees@ijzerbout.nl>
Closes: https://lore.kernel.org/lkml/87ldzaotcg.fsf@yhuang6-desk2.ccr.corp.intel.com/
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Bjorn Helgaas <bhelgaas@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-10-09 12:47:19 -07:00
Frederic Weisbecker
214e01ad4e kthread: unpark only parked kthread
Calling into kthread unparking unconditionally is mostly harmless when
the kthread is already unparked. The wake up is then simply ignored
because the target is not in TASK_PARKED state.

However if the kthread is per CPU, the wake up is preceded by a call
to kthread_bind() which expects the task to be inactive and in
TASK_PARKED state, which obviously isn't the case if it is unparked.

As a result, calling kthread_stop() on an unparked per-cpu kthread
triggers such a warning:

	WARNING: CPU: 0 PID: 11 at kernel/kthread.c:525 __kthread_bind_mask kernel/kthread.c:525
	 <TASK>
	 kthread_stop+0x17a/0x630 kernel/kthread.c:707
	 destroy_workqueue+0x136/0xc40 kernel/workqueue.c:5810
	 wg_destruct+0x1e2/0x2e0 drivers/net/wireguard/device.c:257
	 netdev_run_todo+0xe1a/0x1000 net/core/dev.c:10693
	 default_device_exit_batch+0xa14/0xa90 net/core/dev.c:11769
	 ops_exit_list net/core/net_namespace.c:178 [inline]
	 cleanup_net+0x89d/0xcc0 net/core/net_namespace.c:640
	 process_one_work kernel/workqueue.c:3231 [inline]
	 process_scheduled_works+0xa2c/0x1830 kernel/workqueue.c:3312
	 worker_thread+0x86d/0xd70 kernel/workqueue.c:3393
	 kthread+0x2f0/0x390 kernel/kthread.c:389
	 ret_from_fork+0x4b/0x80 arch/x86/kernel/process.c:147
	 ret_from_fork_asm+0x1a/0x30 arch/x86/entry/entry_64.S:244
	 </TASK>

Fix this with skipping unecessary unparking while stopping a kthread.

Link: https://lkml.kernel.org/r/20240913214634.12557-1-frederic@kernel.org
Fixes: 5c25b5ff89 ("workqueue: Tag bound workers with KTHREAD_IS_PER_CPU")
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Reported-by: syzbot+943d34fa3cf2191e3068@syzkaller.appspotmail.com
Tested-by: syzbot+943d34fa3cf2191e3068@syzkaller.appspotmail.com
Suggested-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Hillf Danton <hdanton@sina.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-10-09 12:47:19 -07:00
Steven Rostedt
912da2c384 ring-buffer: Do not have boot mapped buffers hook to CPU hotplug
The boot mapped ring buffer has its buffer mapped at a fixed location
found at boot up. It is not dynamic. It cannot grow or be expanded when
new CPUs come online.

Do not hook fixed memory mapped ring buffers to the CPU hotplug callback,
otherwise it can cause a crash when it tries to add the buffer to the
memory that is already fully occupied.

Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Link: https://lore.kernel.org/20241008143242.25e20801@gandalf.local.home
Fixes: be68d63a13 ("ring-buffer: Add ring_buffer_alloc_range()")
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2024-10-09 09:24:35 -04:00
Linus Torvalds
75b607fab3 sched_ext: Fixes for v6.12-rc2
- ops.enqueue() didn't have a way to tell whether select_task_rq_scx() and
   thus ops.select() were skipped. Some schedulers were incorrectly using
   SCX_ENQ_WAKEUP. Add SCX_ENQ_CPU_SELECTED and fix scx_qmap using it.
 
 - Remove a spurious WARN_ON_ONCE() in scx_cgroup_exit().
 
 - Fix error information clobbering during load.
 
 - Add missing __weak markers to BPF helper declarations.
 
 - Doc update.
 -----BEGIN PGP SIGNATURE-----
 
 iIQEABYKACwWIQTfIjM1kS57o3GsC/uxYfJx3gVYGQUCZwWKkA4cdGpAa2VybmVs
 Lm9yZwAKCRCxYfJx3gVYGelnAQDTA8GSIahTEHKM0c3yXE6K1/M56zo8Spp5OOA7
 kXHR3AD/Y0RcXgaCvMI13aozmQWq756gyB6/qczN0+X3jx6wZwI=
 =6xbe
 -----END PGP SIGNATURE-----

Merge tag 'sched_ext-for-6.12-rc2-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext

Pull sched_ext fixes from Tejun Heo:

 - ops.enqueue() didn't have a way to tell whether select_task_rq_scx()
   and thus ops.select() were skipped. Some schedulers were incorrectly
   using SCX_ENQ_WAKEUP. Add SCX_ENQ_CPU_SELECTED and fix scx_qmap using
   it.

 - Remove a spurious WARN_ON_ONCE() in scx_cgroup_exit()

 - Fix error information clobbering during load

 - Add missing __weak markers to BPF helper declarations

 - Doc update

* tag 'sched_ext-for-6.12-rc2-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext:
  sched_ext: Documentation: Update instructions for running example schedulers
  sched_ext, scx_qmap: Add and use SCX_ENQ_CPU_SELECTED
  sched/core: Add ENQUEUE_RQ_SELECTED to indicate whether ->select_task_rq() was called
  sched/core: Make select_task_rq() take the pointer to wake_flags instead of value
  sched_ext: scx_cgroup_exit() may be called without successful scx_cgroup_init()
  sched_ext: Improve error reporting during loading
  sched_ext: Add __weak markers to BPF helper function decalarations
2024-10-08 12:54:04 -07:00
Thomas Weißschuh
b24d7f0da6 bpf, lsm: Remove bpf_lsm_key_free hook
The key_free LSM hook has been removed.
Remove the corresponding BPF hook.

Avoid warnings during the build:
  BTFIDS  vmlinux
WARN: resolve_btfids: unresolved symbol bpf_lsm_key_free

Fixes: 5f8d28f6d7 ("lsm: infrastructure management of the key security blob")
Signed-off-by: Thomas Weißschuh <linux@weissschuh.net>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Song Liu <song@kernel.org>
Acked-by: Jiri Olsa <jolsa@kernel.org>
Link: https://lore.kernel.org/bpf/20241005-lsm-key_free-v1-1-42ea801dbd63@weissschuh.net
2024-10-08 12:52:40 -07:00
Chen Ridong
117932eea9 cgroup/bpf: use a dedicated workqueue for cgroup bpf destruction
A hung_task problem shown below was found:

INFO: task kworker/0:0:8 blocked for more than 327 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Workqueue: events cgroup_bpf_release
Call Trace:
 <TASK>
 __schedule+0x5a2/0x2050
 ? find_held_lock+0x33/0x100
 ? wq_worker_sleeping+0x9e/0xe0
 schedule+0x9f/0x180
 schedule_preempt_disabled+0x25/0x50
 __mutex_lock+0x512/0x740
 ? cgroup_bpf_release+0x1e/0x4d0
 ? cgroup_bpf_release+0xcf/0x4d0
 ? process_scheduled_works+0x161/0x8a0
 ? cgroup_bpf_release+0x1e/0x4d0
 ? mutex_lock_nested+0x2b/0x40
 ? __pfx_delay_tsc+0x10/0x10
 mutex_lock_nested+0x2b/0x40
 cgroup_bpf_release+0xcf/0x4d0
 ? process_scheduled_works+0x161/0x8a0
 ? trace_event_raw_event_workqueue_execute_start+0x64/0xd0
 ? process_scheduled_works+0x161/0x8a0
 process_scheduled_works+0x23a/0x8a0
 worker_thread+0x231/0x5b0
 ? __pfx_worker_thread+0x10/0x10
 kthread+0x14d/0x1c0
 ? __pfx_kthread+0x10/0x10
 ret_from_fork+0x59/0x70
 ? __pfx_kthread+0x10/0x10
 ret_from_fork_asm+0x1b/0x30
 </TASK>

This issue can be reproduced by the following pressuse test:
1. A large number of cpuset cgroups are deleted.
2. Set cpu on and off repeatly.
3. Set watchdog_thresh repeatly.
The scripts can be obtained at LINK mentioned above the signature.

The reason for this issue is cgroup_mutex and cpu_hotplug_lock are
acquired in different tasks, which may lead to deadlock.
It can lead to a deadlock through the following steps:
1. A large number of cpusets are deleted asynchronously, which puts a
   large number of cgroup_bpf_release works into system_wq. The max_active
   of system_wq is WQ_DFL_ACTIVE(256). Consequently, all active works are
   cgroup_bpf_release works, and many cgroup_bpf_release works will be put
   into inactive queue. As illustrated in the diagram, there are 256 (in
   the acvtive queue) + n (in the inactive queue) works.
2. Setting watchdog_thresh will hold cpu_hotplug_lock.read and put
   smp_call_on_cpu work into system_wq. However step 1 has already filled
   system_wq, 'sscs.work' is put into inactive queue. 'sscs.work' has
   to wait until the works that were put into the inacvtive queue earlier
   have executed (n cgroup_bpf_release), so it will be blocked for a while.
3. Cpu offline requires cpu_hotplug_lock.write, which is blocked by step 2.
4. Cpusets that were deleted at step 1 put cgroup_release works into
   cgroup_destroy_wq. They are competing to get cgroup_mutex all the time.
   When cgroup_metux is acqured by work at css_killed_work_fn, it will
   call cpuset_css_offline, which needs to acqure cpu_hotplug_lock.read.
   However, cpuset_css_offline will be blocked for step 3.
5. At this moment, there are 256 works in active queue that are
   cgroup_bpf_release, they are attempting to acquire cgroup_mutex, and as
   a result, all of them are blocked. Consequently, sscs.work can not be
   executed. Ultimately, this situation leads to four processes being
   blocked, forming a deadlock.

system_wq(step1)		WatchDog(step2)			cpu offline(step3)	cgroup_destroy_wq(step4)
...
2000+ cgroups deleted asyn
256 actives + n inactives
				__lockup_detector_reconfigure
				P(cpu_hotplug_lock.read)
				put sscs.work into system_wq
256 + n + 1(sscs.work)
sscs.work wait to be executed
				warting sscs.work finish
								percpu_down_write
								P(cpu_hotplug_lock.write)
								...blocking...
											css_killed_work_fn
											P(cgroup_mutex)
											cpuset_css_offline
											P(cpu_hotplug_lock.read)
											...blocking...
256 cgroup_bpf_release
mutex_lock(&cgroup_mutex);
..blocking...

To fix the problem, place cgroup_bpf_release works on a dedicated
workqueue which can break the loop and solve the problem. System wqs are
for misc things which shouldn't create a large number of concurrent work
items. If something is going to generate >WQ_DFL_ACTIVE(256) concurrent
work items, it should use its own dedicated workqueue.

Fixes: 4bfc0bb2c6 ("bpf: decouple the lifetime of cgroup_bpf from cgroup itself")
Cc: stable@vger.kernel.org # v5.3+
Link: https://lore.kernel.org/cgroups/e90c32d2-2a85-4f28-9154-09c7d320cb60@huawei.com/T/#t
Tested-by: Vishal Chourasia <vishalc@linux.ibm.com>
Signed-off-by: Chen Ridong <chenridong@huawei.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
2024-10-08 08:43:22 -10:00
Jiri Olsa
45126b155e bpf: Fix memory leak in bpf_core_apply
We need to free specs properly.

Fixes: 3d2786d65a ("bpf: correctly handle malformed BPF_CORE_TYPE_ID_LOCAL relos")
Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://lore.kernel.org/bpf/20241007160958.607434-1-jolsa@kernel.org
2024-10-07 20:28:24 -07:00
Tejun Heo
9b671793c7 sched_ext, scx_qmap: Add and use SCX_ENQ_CPU_SELECTED
scx_qmap and other schedulers in the SCX repo are using SCX_ENQ_WAKEUP to
tell whether ops.select_cpu() was called. This is incorrect as
ops.select_cpu() can be skipped in the wakeup path and leads to e.g.
incorrectly skipping direct dispatch for tasks that are bound to a single
CPU.

sched core has been updated to specify ENQUEUE_RQ_SELECTED if
->select_task_rq() was called. Map it to SCX_ENQ_CPU_SELECTED and update
scx_qmap to test it instead of SCX_ENQ_WAKEUP.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: David Vernet <void@manifault.com>
Cc: Daniel Hodges <hodges.daniel.scott@gmail.com>
Cc: Changwoo Min <multics69@gmail.com>
Cc: Andrea Righi <andrea.righi@linux.dev>
Cc: Dan Schatzberg <schatzberg.dan@gmail.com>
2024-10-07 10:16:18 -10:00
Tejun Heo
f207dc2dcd sched/core: Add ENQUEUE_RQ_SELECTED to indicate whether ->select_task_rq() was called
During ttwu, ->select_task_rq() can be skipped if only one CPU is allowed or
migration is disabled. sched_ext schedulers may perform operations such as
direct dispatch from ->select_task_rq() path and it is useful for them to
know whether ->select_task_rq() was skipped in the ->enqueue_task() path.

Currently, sched_ext schedulers are using ENQUEUE_WAKEUP for this purpose
and end up assuming incorrectly that ->select_task_rq() was called for tasks
that are bound to a single CPU or migration disabled.

Make select_task_rq() indicate whether ->select_task_rq() was called by
setting WF_RQ_SELECTED in *wake_flags and make ttwu_do_activate() map that
to ENQUEUE_RQ_SELECTED for ->enqueue_task().

This will be used by sched_ext to fix ->select_task_rq() skip detection.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: David Vernet <void@manifault.com>
2024-10-07 10:16:18 -10:00
Tejun Heo
b62933eee4 sched/core: Make select_task_rq() take the pointer to wake_flags instead of value
This will be used to allow select_task_rq() to indicate whether
->select_task_rq() was called by modifying *wake_flags.

This makes try_to_wake_up() call all functions that take wake_flags with
WF_TTWU set. Previously, only select_task_rq() was. Using the same flags is
more consistent, and, as the flag is only tested by ->select_task_rq()
implementations, it doesn't cause any behavior differences.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: David Vernet <void@manifault.com>
2024-10-07 10:16:18 -10:00
Al Viro
be5498cac2 remove pointless includes of <linux/fdtable.h>
some of those used to be needed, some had been cargo-culted for
no reason...

Reviewed-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2024-10-07 13:34:41 -04:00
Al Viro
8fd3395ec9 get rid of ...lookup...fdget_rcu() family
Once upon a time, predecessors of those used to do file lookup
without bumping a refcount, provided that caller held rcu_read_lock()
across the lookup and whatever it wanted to read from the struct
file found.  When struct file allocation switched to SLAB_TYPESAFE_BY_RCU,
that stopped being feasible and these primitives started to bump the
file refcount for lookup result, requiring the caller to call fput()
afterwards.

But that turned them pointless - e.g.
	rcu_read_lock();
	file = lookup_fdget_rcu(fd);
	rcu_read_unlock();
is equivalent to
	file = fget_raw(fd);
and all callers of lookup_fdget_rcu() are of that form.  Similarly,
task_lookup_fdget_rcu() calls can be replaced with calling fget_task().
task_lookup_next_fdget_rcu() doesn't have direct counterparts, but
its callers would be happier if we replaced it with an analogue that
deals with RCU internally.

Reviewed-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2024-10-07 13:34:41 -04:00
Tejun Heo
ec010333ce sched_ext: scx_cgroup_exit() may be called without successful scx_cgroup_init()
568894edbe ("sched_ext: Add scx_cgroup_enabled to gate cgroup operations
and fix scx_tg_online()") assumed that scx_cgroup_exit() is only called
after scx_cgroup_init() finished successfully. This isn't true.
scx_cgroup_exit() can be called without scx_cgroup_init() being called at
all or after scx_cgroup_init() failed in the middle.

As init state is tracked per cgroup, scx_cgroup_exit() can be used safely to
clean up in all cases. Remove the incorrect WARN_ON_ONCE().

Signed-off-by: Tejun Heo <tj@kernel.org>
Fixes: 568894edbe ("sched_ext: Add scx_cgroup_enabled to gate cgroup operations and fix scx_tg_online()")
2024-10-04 10:12:11 -10:00
Tejun Heo
cc9877fb76 sched_ext: Improve error reporting during loading
When the BPF scheduler fails, ops.exit() allows rich error reporting through
scx_exit_info. Use scx.exit() path consistently for all failures which can
be caused by the BPF scheduler:

- scx_ops_error() is called after ops.init() and ops.cgroup_init() failure
  to record error information.

- ops.init_task() failure now uses scx_ops_error() instead of pr_err().

- The err_disable path updated to automatically trigger scx_ops_error() to
  cover cases that the error message hasn't already been generated and
  always return 0 indicating init success so that the error is reported
  through ops.exit().

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: David Vernet <void@manifault.com>
Cc: Daniel Hodges <hodges.daniel.scott@gmail.com>
Cc: Changwoo Min <multics69@gmail.com>
Cc: Andrea Righi <andrea.righi@linux.dev>
Cc: Dan Schatzberg <schatzberg.dan@gmail.com>
2024-10-04 10:11:43 -10:00
Linus Torvalds
622a3ed1ac Various fixes for tracing:
- Fix tp_printk command line option crashing the kernel
 
   With the code that can handle a buffer from a previous boot, the
   trace_check_vprintf() needed access to the delta of the address
   space used by the old buffer and the current buffer. To do so,
   the trace_array (tr) parameter was used. But when tp_printk is
   enabled on the kernel command line, no trace buffer is used and
   the trace event is sent directly to printk(). That meant the tr
   field of the iterator descriptor was NULL, and since tp_printk still
   uses trace_check_vprintf() it caused a NULL dereference.
 
 - Add ptrace.h include to x86 ftrace file for completeness
 
 - Fix rtla installation when done with out-of-tree build
 
 - Fix the help messages in rtla that were incorrect
 
 - Several fixes to fix races with the timerlat and hwlat code
 
   Several locking issues were discovered with the coordination
   between timerlat kthread creation and hotplug. As timerlat has
   callbacks from hotplug code to start kthreads when CPUs come online.
   There are also locking issues with grabbing the cpu_read_lock()
   and the locks within timerlat.
 -----BEGIN PGP SIGNATURE-----
 
 iIoEABYIADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCZwAUghQccm9zdGVkdEBn
 b29kbWlzLm9yZwAKCRAp5XQQmuv6qmMCAP9Z+0sN/Tx+F+5LmrXV1R9UxRPykmpm
 4NeZYEp+hAQSegD+MdHsEJLIDfmsZnGOBivmzfepuv35GMLrqQMIhhWQOA0=
 =/OQq
 -----END PGP SIGNATURE-----

Merge tag 'trace-v6.12-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace

Pull tracing fixes from Steven Rostedt:

 - Fix tp_printk command line option crashing the kernel

   With the code that can handle a buffer from a previous boot, the
   trace_check_vprintf() needed access to the delta of the address space
   used by the old buffer and the current buffer. To do so, the
   trace_array (tr) parameter was used. But when tp_printk is enabled on
   the kernel command line, no trace buffer is used and the trace event
   is sent directly to printk(). That meant the tr field of the iterator
   descriptor was NULL, and since tp_printk still uses
   trace_check_vprintf() it caused a NULL dereference.

 - Add ptrace.h include to x86 ftrace file for completeness

 - Fix rtla installation when done with out-of-tree build

 - Fix the help messages in rtla that were incorrect

 - Several fixes to fix races with the timerlat and hwlat code

   Several locking issues were discovered with the coordination between
   timerlat kthread creation and hotplug. As timerlat has callbacks from
   hotplug code to start kthreads when CPUs come online. There are also
   locking issues with grabbing the cpu_read_lock() and the locks within
   timerlat.

* tag 'trace-v6.12-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace:
  tracing/hwlat: Fix a race during cpuhp processing
  tracing/timerlat: Fix a race during cpuhp processing
  tracing/timerlat: Drop interface_lock in stop_kthread()
  tracing/timerlat: Fix duplicated kthread creation due to CPU online/offline
  x86/ftrace: Include <asm/ptrace.h>
  rtla: Fix the help text in osnoise and timerlat top tools
  tools/rtla: Fix installation from out-of-tree build
  tracing: Fix trace_check_vprintf() when tp_printk is used
2024-10-04 12:11:06 -07:00
Linus Torvalds
f6785e0ccf slab fixes for 6.12-rc1
-----BEGIN PGP SIGNATURE-----
 
 iQEzBAABCAAdFiEEe7vIQRWZI0iWSE3xu+CwddJFiJoFAmb/8bcACgkQu+CwddJF
 iJoApwf5AWWhKFbbYwFUCXDi7+/Xr7T7c9H9q+GAEOQiDLsDxihEAo1KYQ+DLl+h
 Vp1ddRYIKMIUfllW3bcD4O6C8L46OX3XPHhTHnksEfvtn3fQGjcU3jKH8n0eL01J
 s9eUdvduNSJorAWqjFPPRrGuLJTXmervrDYYPJLaXGITHHMOxMjKfLAxtXehvARv
 mVQV1F0NTvvNqieuibUCM5XqJs37lrmqB39pLun7bQDU48z4OR1L3nkJxTFF1bGm
 EcvAPayTiNybMt08QSVHIwqfSs+e0HmyKqjvSLpJPImDrfSrWOJvBCJxI4DU+1aw
 UiHyWYLaxWZ7DoJgtZuHV2//8wOWww==
 =EXEA
 -----END PGP SIGNATURE-----

Merge tag 'slab-for-6.12-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/vbabka/slab

Pull slab fixes from Vlastimil Babka:
 "Fixes for issues introduced in this merge window: kobject memory leak,
  unsupressed warning and possible lockup in new slub_kunit tests,
  misleading code in kvfree_rcu_queue_batch()"

* tag 'slab-for-6.12-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/vbabka/slab:
  slub/kunit: skip test_kfree_rcu when the slub kunit test is built-in
  mm, slab: suppress warnings in test_leak_destroy kunit test
  rcu/kvfree: Refactor kvfree_rcu_queue_batch()
  mm, slab: fix use of SLAB_SUPPORTS_SYSFS in kmem_cache_release()
2024-10-04 12:05:39 -07:00
Linus Torvalds
6cca119588 close_range(): fix the logics in descriptor table trimming
-----BEGIN PGP SIGNATURE-----
 
 iHUEABYIAB0WIQQqUNBr3gm4hGXdBJlZ7Krx/gZQ6wUCZv90qwAKCRBZ7Krx/gZQ
 629DAP9NT9/ndByazKrP03nQ/ITNhYVd0cby2iPcXZiHGj/fsQD7BF4yUUiBy7EO
 seIY8rGxc8S5TDJAG4HYa7t1Nuksnw4=
 =oji5
 -----END PGP SIGNATURE-----

Merge tag 'pull-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs

Pull close_range() fix from Al Viro:
 "Fix the logic in descriptor table trimming"

* tag 'pull-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  close_range(): fix the logics in descriptor table trimming
2024-10-04 09:46:16 -07:00
Mark Brown
91e102e797 prctl: arch-agnostic prctl for shadow stack
Three architectures (x86, aarch64, riscv) have announced support for
shadow stacks with fairly similar functionality.  While x86 is using
arch_prctl() to control the functionality neither arm64 nor riscv uses
that interface so this patch adds arch-agnostic prctl() support to
get and set status of shadow stacks and lock the current configuation to
prevent further changes, with support for turning on and off individual
subfeatures so applications can limit their exposure to features that
they do not need.  The features are:

  - PR_SHADOW_STACK_ENABLE: Tracking and enforcement of shadow stacks,
    including allocation of a shadow stack if one is not already
    allocated.
  - PR_SHADOW_STACK_WRITE: Writes to specific addresses in the shadow
    stack.
  - PR_SHADOW_STACK_PUSH: Push additional values onto the shadow stack.

These features are expected to be inherited by new threads and cleared
on exec(), unknown features should be rejected for enable but accepted
for locking (in order to allow for future proofing).

This is based on a patch originally written by Deepak Gupta but modified
fairly heavily, support for indirect landing pads is removed, additional
modes added and the locking interface reworked.  The set status prctl()
is also reworked to just set flags, if setting/reading the shadow stack
pointer is required this could be a separate prctl.

Reviewed-by: Thiago Jung Bauermann <thiago.bauermann@linaro.org>
Reviewed-by: Catalin Marinas <catalin.marinas@arm.com>
Acked-by: Yury Khrustalev <yury.khrustalev@arm.com>
Signed-off-by: Mark Brown <broonie@kernel.org>
Reviewed-by: Deepak Gupta <debug@rivosinc.com>
Link: https://lore.kernel.org/r/20241001-arm64-gcs-v13-4-222b78d87eee@kernel.org
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
2024-10-04 12:04:33 +01:00
Johannes Weiner
3840cbe24c sched: psi: fix bogus pressure spikes from aggregation race
Brandon reports sporadic, non-sensical spikes in cumulative pressure
time (total=) when reading cpu.pressure at a high rate. This is due to
a race condition between reader aggregation and tasks changing states.

While it affects all states and all resources captured by PSI, in
practice it most likely triggers with CPU pressure, since scheduling
events are so frequent compared to other resource events.

The race context is the live snooping of ongoing stalls during a
pressure read. The read aggregates per-cpu records for stalls that
have concluded, but will also incorporate ad-hoc the duration of any
active state that hasn't been recorded yet. This is important to get
timely measurements of ongoing stalls. Those ad-hoc samples are
calculated on-the-fly up to the current time on that CPU; since the
stall hasn't concluded, it's expected that this is the minimum amount
of stall time that will enter the per-cpu records once it does.

The problem is that the path that concludes the state uses a CPU clock
read that is not synchronized against aggregators; the clock is read
outside of the seqlock protection. This allows aggregators to race and
snoop a stall with a longer duration than will actually be recorded.

With the recorded stall time being less than the last snapshot
remembered by the aggregator, a subsequent sample will underflow and
observe a bogus delta value, resulting in an erratic jump in pressure.

Fix this by moving the clock read of the state change into the seqlock
protection. This ensures no aggregation can snoop live stalls past the
time that's recorded when the state concludes.

Reported-by: Brandon Duffany <brandon@buildbuddy.io>
Link: https://bugzilla.kernel.org/show_bug.cgi?id=219194
Link: https://lore.kernel.org/lkml/20240827121851.GB438928@cmpxchg.org/
Fixes: df77430639 ("psi: Reduce calls to sched_clock() in psi")
Cc: stable@vger.kernel.org
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Chengming Zhou <chengming.zhou@linux.dev>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2024-10-03 16:03:16 -07:00
Wei Li
2a13ca2e8a tracing/hwlat: Fix a race during cpuhp processing
The cpuhp online/offline processing race also exists in percpu-mode hwlat
tracer in theory, apply the fix too. That is:

    T1                       | T2
    [CPUHP_ONLINE]           | cpu_device_down()
     hwlat_hotplug_workfn()  |
                             |     cpus_write_lock()
                             |     takedown_cpu(1)
                             |     cpus_write_unlock()
    [CPUHP_OFFLINE]          |
        cpus_read_lock()     |
        start_kthread(1)     |
        cpus_read_unlock()   |

Cc: stable@vger.kernel.org
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Link: https://lore.kernel.org/20240924094515.3561410-5-liwei391@huawei.com
Fixes: ba998f7d95 ("trace/hwlat: Support hotplug operations")
Signed-off-by: Wei Li <liwei391@huawei.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2024-10-03 16:43:23 -04:00
Wei Li
829e0c9f08 tracing/timerlat: Fix a race during cpuhp processing
There is another found exception that the "timerlat/1" thread was
scheduled on CPU0, and lead to timer corruption finally:

```
ODEBUG: init active (active state 0) object: ffff888237c2e108 object type: hrtimer hint: timerlat_irq+0x0/0x220
WARNING: CPU: 0 PID: 426 at lib/debugobjects.c:518 debug_print_object+0x7d/0xb0
Modules linked in:
CPU: 0 UID: 0 PID: 426 Comm: timerlat/1 Not tainted 6.11.0-rc7+ #45
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.13.0-1ubuntu1.1 04/01/2014
RIP: 0010:debug_print_object+0x7d/0xb0
...
Call Trace:
 <TASK>
 ? __warn+0x7c/0x110
 ? debug_print_object+0x7d/0xb0
 ? report_bug+0xf1/0x1d0
 ? prb_read_valid+0x17/0x20
 ? handle_bug+0x3f/0x70
 ? exc_invalid_op+0x13/0x60
 ? asm_exc_invalid_op+0x16/0x20
 ? debug_print_object+0x7d/0xb0
 ? debug_print_object+0x7d/0xb0
 ? __pfx_timerlat_irq+0x10/0x10
 __debug_object_init+0x110/0x150
 hrtimer_init+0x1d/0x60
 timerlat_main+0xab/0x2d0
 ? __pfx_timerlat_main+0x10/0x10
 kthread+0xb7/0xe0
 ? __pfx_kthread+0x10/0x10
 ret_from_fork+0x2d/0x40
 ? __pfx_kthread+0x10/0x10
 ret_from_fork_asm+0x1a/0x30
 </TASK>
```

After tracing the scheduling event, it was discovered that the migration
of the "timerlat/1" thread was performed during thread creation. Further
analysis confirmed that it is because the CPU online processing for
osnoise is implemented through workers, which is asynchronous with the
offline processing. When the worker was scheduled to create a thread, the
CPU may has already been removed from the cpu_online_mask during the offline
process, resulting in the inability to select the right CPU:

T1                       | T2
[CPUHP_ONLINE]           | cpu_device_down()
osnoise_hotplug_workfn() |
                         |     cpus_write_lock()
                         |     takedown_cpu(1)
                         |     cpus_write_unlock()
[CPUHP_OFFLINE]          |
    cpus_read_lock()     |
    start_kthread(1)     |
    cpus_read_unlock()   |

To fix this, skip online processing if the CPU is already offline.

Cc: stable@vger.kernel.org
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Link: https://lore.kernel.org/20240924094515.3561410-4-liwei391@huawei.com
Fixes: c8895e271f ("trace/osnoise: Support hotplug operations")
Signed-off-by: Wei Li <liwei391@huawei.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2024-10-03 16:43:22 -04:00