Commit Graph

35694 Commits

Author SHA1 Message Date
Mathieu Desnoyers
5bc7850232 sched: fix exit_mm vs membarrier (v4)
exit_mm should issue memory barriers after user-space memory accesses,
before clearing current->mm, to order user-space memory accesses
performed prior to exit_mm before clearing tsk->mm, which has the
effect of skipping the membarrier private expedited IPIs.

exit_mm should also update the runqueue's membarrier_state so
membarrier global expedited IPIs are not sent when they are not
needed.

The membarrier system call can be issued concurrently with do_exit
if we have thread groups created with CLONE_VM but not CLONE_THREAD.

Here is the scenario I have in mind:

Two thread groups are created, A and B. Thread group B is created by
issuing clone from group A with flag CLONE_VM set, but not CLONE_THREAD.
Let's assume we have a single thread within each thread group (Thread A
and Thread B).

The AFAIU we can have:

Userspace variables:

int x = 0, y = 0;

CPU 0                   CPU 1
Thread A                Thread B
(in thread group A)     (in thread group B)

x = 1
barrier()
y = 1
exit()
exit_mm()
current->mm = NULL;
                        r1 = load y
                        membarrier()
                          skips CPU 0 (no IPI) because its current mm is NULL
                        r2 = load x
                        BUG_ON(r1 == 1 && r2 == 0)

Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20201020134715.13909-2-mathieu.desnoyers@efficios.com
2020-10-29 11:00:30 +01:00
Peter Zijlstra
45da7a2b0a sched/fair: Exclude the current CPU from find_new_ilb()
It is possible for find_new_ilb() to select the current CPU, however,
this only happens from newidle balancing, in which case need_resched()
will be true, and consequently nohz_csd_func() will not trigger the
softirq.

Exclude the current CPU from becoming an ILB target.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
2020-10-29 11:00:30 +01:00
Peter Zijlstra
b13772f813 sched/cpupri: Add CPUPRI_HIGHER
Add CPUPRI_HIGHER above the RT99 priority to denote the CPU is in use
by higher priority tasks (specifically deadline).

XXX: we should probably drive PUSH-PULL from cpupri, that would
automagically result in an RT-PUSH when DL sets cpupri to CPUPRI_HIGHER.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
2020-10-29 11:00:30 +01:00
Peter Zijlstra
934fc3314b sched/cpupri: Remap CPUPRI_NORMAL to MAX_RT_PRIO-1
This makes the mapping continuous and frees up 100 for other usage.

Prev mapping:

p->rt_priority   p->prio   newpri   cpupri

                               -1       -1 (CPUPRI_INVALID)

                              100        0 (CPUPRI_NORMAL)

             1        98       98        1
           ...
            49        50       50       49
            50        49       49       50
           ...
            99         0        0       99

New mapping:

p->rt_priority   p->prio   newpri   cpupri

                               -1       -1 (CPUPRI_INVALID)

                               99        0 (CPUPRI_NORMAL)

             1        98       98        1
           ...
            49        50       50       49
            50        49       49       50
           ...
            99         0        0       99

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
2020-10-29 11:00:30 +01:00
Dietmar Eggemann
1b08782ce3 sched/cpupri: Remove pri_to_cpu[1]
pri_to_cpu[1] isn't used since cpupri_set(..., newpri) is
never called with newpri = 99.

The valid RT priorities RT1..RT99 (p->rt_priority = [1..99]) map into
cpupri (idx of pri_to_cpu[]) = [2..100]

Current mapping:

p->rt_priority   p->prio   newpri   cpupri

                               -1       -1 (CPUPRI_INVALID)

                              100        0 (CPUPRI_NORMAL)

             1        98       98        2
           ...
            49        50       50       50
            50        49       49       51
           ...
            99         0        0      100

So cpupri = 1 isn't used.

Reduce the size of pri_to_cpu[] by 1 and adapt the cpupri
implementation accordingly. This will save a useless for loop with an
atomic_read in cpupri_find_fitness() calling __cpupri_find().

New mapping:

p->rt_priority   p->prio   newpri   cpupri

                               -1       -1 (CPUPRI_INVALID)

                              100        0 (CPUPRI_NORMAL)

             1        98       98        1
           ...
            49        50       50       49
            50        49       49       50
           ...
            99         0        0       99

Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20200922083934.19275-3-dietmar.eggemann@arm.com
2020-10-29 11:00:29 +01:00
Dietmar Eggemann
5e054bca44 sched/cpupri: Remove pri_to_cpu[CPUPRI_IDLE]
pri_to_cpu[CPUPRI_IDLE=0] isn't used since cpupri_set(..., newpri) is
never called with newpri = MAX_PRIO (140).

Current mapping:

p->rt_priority   p->prio   newpri   cpupri

                               -1       -1 (CPUPRI_INVALID)

                              140        0 (CPUPRI_IDLE)

                              100        1 (CPUPRI_NORMAL)

             1        98       98        3
           ...
            49        50       50       51
            50        49       49       52
           ...
            99         0        0      101

Even when cpupri was introduced with commit 6e0534f278 ("sched: use a
2-d bitmap for searching lowest-pri CPU") in v2.6.27, only

   (1) CPUPRI_INVALID (-1),
   (2) MAX_RT_PRIO (100),
   (3) an RT prio (RT1..RT99)

were used as newprio in cpupri_set(..., newpri) -> convert_prio(newpri).

MAX_RT_PRIO is used only in dec_rt_tasks() -> dec_rt_prio() ->
dec_rt_prio_smp() -> cpupri_set() in case of !rt_rq->rt_nr_running.
I.e. it stands for a non-rt task, including the IDLE task.

Commit 57785df5ac ("sched: Fix task priority bug") removed code in
v2.6.33 which did set the priority of the IDLE task to MAX_PRIO.
Although this happened after the introduction of cpupri, it didn't have
an effect on the values used for cpupri_set(..., newpri).

Remove CPUPRI_IDLE and adapt the cpupri implementation accordingly.
This will save a useless for loop with an atomic_read in
cpupri_find_fitness() calling __cpupri_find().

New mapping:

p->rt_priority   p->prio   newpri   cpupri

                               -1       -1 (CPUPRI_INVALID)

                              100        0 (CPUPRI_NORMAL)

             1        98       98        2
           ...
            49        50       50       50
            50        49       49       51
           ...
            99         0        0      100

Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20200922083934.19275-2-dietmar.eggemann@arm.com
2020-10-29 11:00:29 +01:00
Peng Liu
a57415f5d1 sched/deadline: Fix sched_dl_global_validate()
When change sched_rt_{runtime, period}_us, we validate that the new
settings should at least accommodate the currently allocated -dl
bandwidth:

  sched_rt_handler()
    -->	sched_dl_bandwidth_validate()
	{
		new_bw = global_rt_runtime()/global_rt_period();

		for_each_possible_cpu(cpu) {
			dl_b = dl_bw_of(cpu);
			if (new_bw < dl_b->total_bw)    <-------
				ret = -EBUSY;
		}
	}

But under CONFIG_SMP, dl_bw is per root domain , but not per CPU,
dl_b->total_bw is the allocated bandwidth of the whole root domain.
Instead, we should compare dl_b->total_bw against "cpus*new_bw",
where 'cpus' is the number of CPUs of the root domain.

Also, below annotation(in kernel/sched/sched.h) implied implementation
only appeared in SCHED_DEADLINE v2[1], then deadline scheduler kept
evolving till got merged(v9), but the annotation remains unchanged,
meaningless and misleading, update it.

* With respect to SMP, the bandwidth is given on a per-CPU basis,
* meaning that:
*  - dl_bw (< 100%) is the bandwidth of the system (group) on each CPU;
*  - dl_total_bw array contains, in the i-eth element, the currently
*    allocated bandwidth on the i-eth CPU.

[1]: https://lore.kernel.org/lkml/1267385230.13676.101.camel@Palantir/

Fixes: 332ac17ef5 ("sched/deadline: Add bandwidth management for SCHED_DEADLINE tasks")
Signed-off-by: Peng Liu <iwtbavbm@gmail.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Daniel Bristot de Oliveira <bristot@redhat.com>
Acked-by: Juri Lelli <juri.lelli@redhat.com>
Link: https://lkml.kernel.org/r/db6bbda316048cda7a1bbc9571defde193a8d67e.1602171061.git.iwtbavbm@gmail.com
2020-10-29 11:00:29 +01:00
Peng Liu
26762423a2 sched/deadline: Optimize sched_dl_global_validate()
Under CONFIG_SMP, dl_bw is per root domain, but not per CPU.
When checking or updating dl_bw, currently iterating every CPU is
overdoing, just need iterate each root domain once.

Suggested-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Peng Liu <iwtbavbm@gmail.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Daniel Bristot de Oliveira <bristot@redhat.com>
Acked-by: Juri Lelli <juri.lelli@redhat.com>
Link: https://lkml.kernel.org/r/78d21ee792cc48ff79e8cd62a5f26208463684d6.1602171061.git.iwtbavbm@gmail.com
2020-10-29 11:00:28 +01:00
jun qian
b9c88f7522 sched/fair: Improve the accuracy of sched_stat_wait statistics
When the sched_schedstat changes from 0 to 1, some sched se maybe
already in the runqueue, the se->statistics.wait_start will be 0.
So it will let the (rq_of(cfs_rq)) - se->statistics.wait_start)
wrong. We need to avoid this scenario.

Signed-off-by: jun qian <qianjun.kernel@gmail.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Yafang Shao <laoar.shao@gmail.com>
Link: https://lkml.kernel.org/r/20201015064846.19809-1-qianjun.kernel@gmail.com
2020-10-29 11:00:28 +01:00
Jens Axboe
114518eb64 task_work: Use TIF_NOTIFY_SIGNAL if available
If the arch supports TIF_NOTIFY_SIGNAL, then use that for TWA_SIGNAL as
it's more efficient than using the signal delivery method. This is
especially true on threaded applications, where ->sighand is shared across
threads, but it's also lighter weight on non-shared cases.

io_uring is a heavy consumer of TWA_SIGNAL based task_work. A test with
threads shows a nice improvement running an io_uring based echo server.

stock kernel:
0.01% <= 0.1 milliseconds
95.86% <= 0.2 milliseconds
98.27% <= 0.3 milliseconds
99.71% <= 0.4 milliseconds
100.00% <= 0.5 milliseconds
100.00% <= 0.6 milliseconds
100.00% <= 0.7 milliseconds
100.00% <= 0.8 milliseconds
100.00% <= 0.9 milliseconds
100.00% <= 1.0 milliseconds
100.00% <= 1.1 milliseconds
100.00% <= 2 milliseconds
100.00% <= 3 milliseconds
100.00% <= 3 milliseconds
1378930.00 requests per second
~1600% CPU

1.38M requests/second, and all 16 CPUs are maxed out.

patched kernel:
0.01% <= 0.1 milliseconds
98.24% <= 0.2 milliseconds
99.47% <= 0.3 milliseconds
99.99% <= 0.4 milliseconds
100.00% <= 0.5 milliseconds
100.00% <= 0.6 milliseconds
100.00% <= 0.7 milliseconds
100.00% <= 0.8 milliseconds
100.00% <= 0.9 milliseconds
100.00% <= 1.2 milliseconds
1666111.38 requests per second
~1450% CPU

1.67M requests/second, and we're no longer just hammering on the sighand
lock. The original reporter states:

"For 5.7.15 my benchmark achieves 1.6M qps and system cpu is at ~80%.
 for 5.7.16 or later it achieves only 1M qps and the system cpu is is
 at ~100%"

with the only difference there being that TWA_SIGNAL is used
unconditionally in 5.7.16, since it's required to be able to handle the
inability to run task_work if the application is waiting in the kernel
already on an event that needs task_work run to be satisfied. Also see
commit 0ba9c9edcd.

Reported-by: Roman Gershman <romger@amazon.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Oleg Nesterov <oleg@redhat.com>
Link: https://lore.kernel.org/r/20201026203230.386348-5-axboe@kernel.dk
2020-10-29 09:37:37 +01:00
Jens Axboe
12db8b6900 entry: Add support for TIF_NOTIFY_SIGNAL
Add TIF_NOTIFY_SIGNAL handling in the generic entry code, which if set,
will return true if signal_pending() is used in a wait loop. That causes an
exit of the loop so that notify_signal tracehooks can be run. If the wait
loop is currently inside a system call, the system call is restarted once
task_work has been processed.

In preparation for only having arch_do_signal() handle syscall restarts if
_TIF_SIGPENDING isn't set, rename it to arch_do_signal_or_restart().  Pass
in a boolean that tells the architecture specific signal handler if it
should attempt to get a signal, or just process a potential syscall
restart.

For !CONFIG_GENERIC_ENTRY archs, add the TIF_NOTIFY_SIGNAL handling to
get_signal(). This is done to minimize the needed architecture changes to
support this feature.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Oleg Nesterov <oleg@redhat.com>
Link: https://lore.kernel.org/r/20201026203230.386348-3-axboe@kernel.dk
2020-10-29 09:37:36 +01:00
Jens Axboe
5c251e9dc0 signal: Add task_sigpending() helper
This is in preparation for maintaining signal_pending() as the decider of
whether or not a schedule() loop should be broken, or continue sleeping.
This is different than the core signal use cases, which really need to know
whether an actual signal is pending or not. task_sigpending() returns
non-zero if TIF_SIGPENDING is set.

Only core kernel use cases should care about the distinction between
the two, make sure those use the task_sigpending() helper.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Oleg Nesterov <oleg@redhat.com>
Link: https://lore.kernel.org/r/20201026203230.386348-2-axboe@kernel.dk
2020-10-29 09:37:36 +01:00
Yonghong Song
cf83b2d2e2 bpf: Permit cond_resched for some iterators
Commit e679654a70 ("bpf: Fix a rcu_sched stall issue with
bpf task/task_file iterator") tries to fix rcu stalls warning
which is caused by bpf task_file iterator when running
"bpftool prog".

      rcu: INFO: rcu_sched self-detected stall on CPU
      rcu: \x097-....: (20999 ticks this GP) idle=302/1/0x4000000000000000 softirq=1508852/1508852 fqs=4913
      \x09(t=21031 jiffies g=2534773 q=179750)
      NMI backtrace for cpu 7
      CPU: 7 PID: 184195 Comm: bpftool Kdump: loaded Tainted: G        W         5.8.0-00004-g68bfc7f8c1b4 #6
      Hardware name: Quanta Twin Lakes MP/Twin Lakes Passive MP, BIOS F09_3A17 05/03/2019
      Call Trace:
      <IRQ>
      dump_stack+0x57/0x70
      nmi_cpu_backtrace.cold+0x14/0x53
      ? lapic_can_unplug_cpu.cold+0x39/0x39
      nmi_trigger_cpumask_backtrace+0xb7/0xc7
      rcu_dump_cpu_stacks+0xa2/0xd0
      rcu_sched_clock_irq.cold+0x1ff/0x3d9
      ? tick_nohz_handler+0x100/0x100
      update_process_times+0x5b/0x90
      tick_sched_timer+0x5e/0xf0
      __hrtimer_run_queues+0x12a/0x2a0
      hrtimer_interrupt+0x10e/0x280
      __sysvec_apic_timer_interrupt+0x51/0xe0
      asm_call_on_stack+0xf/0x20
      </IRQ>
      sysvec_apic_timer_interrupt+0x6f/0x80
      ...
      task_file_seq_next+0x52/0xa0
      bpf_seq_read+0xb9/0x320
      vfs_read+0x9d/0x180
      ksys_read+0x5f/0xe0
      do_syscall_64+0x38/0x60
      entry_SYSCALL_64_after_hwframe+0x44/0xa9

The fix is to limit the number of bpf program runs to be
one million. This fixed the program in most cases. But
we also found under heavy load, which can increase the wallclock
time for bpf_seq_read(), the warning may still be possible.

For example, calling bpf_delay() in the "while" loop of
bpf_seq_read(), which will introduce artificial delay,
the warning will show up in my qemu run.

  static unsigned q;
  volatile unsigned *p = &q;
  volatile unsigned long long ll;
  static void bpf_delay(void)
  {
         int i, j;

         for (i = 0; i < 10000; i++)
                 for (j = 0; j < 10000; j++)
                         ll += *p;
  }

There are two ways to fix this issue. One is to reduce the above
one million threshold to say 100,000 and hopefully rcu warning will
not show up any more. Another is to introduce a target feature
which enables bpf_seq_read() calling cond_resched().

This patch took second approach as the first approach may cause
more -EAGAIN failures for read() syscalls. Note that not all bpf_iter
targets can permit cond_resched() in bpf_seq_read() as some, e.g.,
netlink seq iterator, rcu read lock critical section spans through
seq_ops->next() -> seq_ops->show() -> seq_ops->next().

For the kernel code with the above hack, "bpftool p" roughly takes
38 seconds to finish on my VM with 184 bpf program runs.
Using the following command, I am able to collect the number of
context switches:
   perf stat -e context-switches -- ./bpftool p >& log
Without this patch,
   69      context-switches
With this patch,
   75      context-switches
This patch added additional 6 context switches, roughly every 6 seconds
to reschedule, to avoid lengthy no-rescheduling which may cause the
above RCU warnings.

Signed-off-by: Yonghong Song <yhs@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20201028061054.1411116-1-yhs@fb.com
2020-10-28 14:54:31 -07:00
Al Viro
77f6ab8b77 don't dump the threads that had been already exiting when zapped.
Coredump logics needs to report not only the registers of the dumping
thread, but (since 2.5.43) those of other threads getting killed.

Doing that might require extra state saved on the stack in asm glue at
kernel entry; signal delivery logics does that (we need to be able to
save sigcontext there, at the very least) and so does seccomp.

That covers all callers of do_coredump().  Secondary threads get hit with
SIGKILL and caught as soon as they reach exit_mm(), which normally happens
in signal delivery, so those are also fine most of the time.  Unfortunately,
it is possible to end up with secondary zapped when it has already entered
exit(2) (or, worse yet, is oopsing).  In those cases we reach exit_mm()
when mm->core_state is already set, but the stack contents is not what
we would have in signal delivery.

At least on two architectures (alpha and m68k) it leads to infoleaks - we
end up with a chunk of kernel stack written into coredump, with the contents
consisting of normal C stack frames of the call chain leading to exit_mm()
instead of the expected copy of userland registers.  In case of alpha we
leak 312 bytes of stack.  Other architectures (including the regset-using
ones) might have similar problems - the normal user of regsets is ptrace
and the state of tracee at the time of such calls is special in the same
way signal delivery is.

Note that had the zapper gotten to the exiting thread slightly later,
it wouldn't have been included into coredump anyway - we skip the threads
that have already cleared their ->mm.  So let's pretend that zapper always
loses the race.  IOW, have exit_mm() only insert into the dumper list if
we'd gotten there from handling a fatal signal[*]

As the result, the callers of do_exit() that have *not* gone through get_signal()
are not seen by coredump logics as secondary threads.  Which excludes voluntary
exit()/oopsen/traps/etc.  The dumper thread itself is unaffected by that,
so seccomp is fine.

[*] originally I intended to add a new flag in tsk->flags, but ebiederman pointed
out that PF_SIGNALED is already doing just what we need.

Cc: stable@vger.kernel.org
Fixes: d89f3847def4 ("[PATCH] thread-aware coredumps, 2.5.43-C3")
History-tree: https://git.kernel.org/pub/scm/linux/kernel/git/tglx/history.git
Acked-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2020-10-28 16:39:49 -04:00
David Woodhouse
2cbd5a45e5 genirq/irqdomain: Implement get_name() method on irqchip fwnodes
Prerequisite to make x86 more irqdomain compliant.

Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20201024213535.443185-23-dwmw2@infradead.org
2020-10-28 20:26:27 +01:00
Linus Torvalds
23859ae444 Merge tag 'trace-v5.10-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace
Pull tracing fix from Steven Rostedt:
 "Fix synthetic event "strcat" overrun

  New synthetic event code used strcat() and miscalculated the ending,
  causing the concatenation to write beyond the allocated memory.

  Instead of using strncat(), the code is switched over to seq_buf which
  has all the mechanisms in place to protect against writing more than
  what is allocated, and cleans up the code a bit"

* tag 'trace-v5.10-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
  tracing, synthetic events: Replace buggy strcat() with seq_buf operations
2020-10-28 12:05:14 -07:00
Mateusz Nosek
921c7ebd13 futex: Fix incorrect should_fail_futex() handling
If should_futex_fail() returns true in futex_wake_pi(), then the 'ret'
variable is set to -EFAULT and then immediately overwritten. So the failure
injection is non-functional.

Fix it by actually leaving the function and returning -EFAULT.

The Fixes tag is kinda blury because the initial commit which introduced
failure injection was already sloppy, but the below mentioned commit broke
it completely.

[ tglx: Massaged changelog ]

Fixes: 6b4f4bc9cb ("locking/futex: Allow low-level atomic operations to return -EAGAIN")
Signed-off-by: Mateusz Nosek <mateusznosek0@gmail.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20200927000858.24219-1-mateusznosek0@gmail.com
2020-10-28 15:48:51 +01:00
Richard Guy Briggs
6d915476e6 audit: trigger accompanying records when no rules present
When there are no audit rules registered, mandatory records (config,
etc.) are missing their accompanying records (syscall, proctitle, etc.).

This is due to audit context dummy set on syscall entry based on absence
of rules that signals that no other records are to be printed.  Clear the dummy
bit if any record is generated, open coding this in audit_log_start().

The proctitle context and dummy checks are pointless since the
proctitle record will not be printed if no syscall records are printed.

The fds array is reset to -1 after the first syscall to indicate it
isn't valid any more, but was never set to -1 when the context was
allocated to indicate it wasn't yet valid.

Check ctx->pwd in audit_log_name().

The audit_inode* functions can be called without going through
getname_flags() or getname_kernel() that sets audit_names and cwd, so
set the cwd in audit_alloc_name() if it has not already been done so due to
audit_names being valid and purge all other audit_getcwd() calls.

Revert the LSM dump_common_audit_data() LSM_AUDIT_DATA_* cases from the
ghak96 patch since they are no longer necessary due to cwd coverage in
audit_alloc_name().

Thanks to bauen1 <j2468h@googlemail.com> for reporting LSM situations in
which context->cwd is not valid, inadvertantly fixed by the ghak96 patch.

Please see upstream github issue
https://github.com/linux-audit/audit-kernel/issues/120
This is also related to upstream github issue
https://github.com/linux-audit/audit-kernel/issues/96

Signed-off-by: Richard Guy Briggs <rgb@redhat.com>
Signed-off-by: Paul Moore <paul@paul-moore.com>
2020-10-27 21:02:57 -04:00
Mauro Carvalho Chehab
cbb5262192 audit: fix a kernel-doc markup
typo:
	kauditd_print_skb -> kauditd_printk_skb

Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
Signed-off-by: Paul Moore <paul@paul-moore.com>
2020-10-27 21:02:54 -04:00
Jackie Zamow
4d4ce8053b PM: sleep: fix typo in kernel/power/process.c
Fix a typo in a comment in freeze_processes().

Signed-off-by: Jackie Zamow <jackie.zamow@gmail.com>
[ rjw: Subject and changelog edits ]
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2020-10-27 19:11:44 +01:00
Steven Rostedt (VMware)
761a8c58db tracing, synthetic events: Replace buggy strcat() with seq_buf operations
There was a memory corruption bug happening while running the synthetic
event selftests:

 kmemleak: Cannot insert 0xffff8c196fa2afe5 into the object search tree (overlaps existing)
 CPU: 5 PID: 6866 Comm: ftracetest Tainted: G        W         5.9.0-rc5-test+ #577
 Hardware name: Hewlett-Packard HP Compaq Pro 6300 SFF/339A, BIOS K01 v03.03 07/14/2016
 Call Trace:
  dump_stack+0x8d/0xc0
  create_object.cold+0x3b/0x60
  slab_post_alloc_hook+0x57/0x510
  ? tracing_map_init+0x178/0x340
  __kmalloc+0x1b1/0x390
  tracing_map_init+0x178/0x340
  event_hist_trigger_func+0x523/0xa40
  trigger_process_regex+0xc5/0x110
  event_trigger_write+0x71/0xd0
  vfs_write+0xca/0x210
  ksys_write+0x70/0xf0
  do_syscall_64+0x33/0x40
  entry_SYSCALL_64_after_hwframe+0x44/0xa9
 RIP: 0033:0x7fef0a63a487
 Code: 64 89 02 48 c7 c0 ff ff ff ff eb bb 0f 1f 80 00 00 00 00 f3 0f 1e fa 64 8b 04 25 18 00 00 00 85 c0 75 10 b8 01 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 51 c3 48 83 ec 28 48 89 54 24 18 48 89 74 24
 RSP: 002b:00007fff76f18398 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
 RAX: ffffffffffffffda RBX: 0000000000000039 RCX: 00007fef0a63a487
 RDX: 0000000000000039 RSI: 000055eb3b26d690 RDI: 0000000000000001
 RBP: 000055eb3b26d690 R08: 000000000000000a R09: 0000000000000038
 R10: 000055eb3b2cdb80 R11: 0000000000000246 R12: 0000000000000039
 R13: 00007fef0a70b500 R14: 0000000000000039 R15: 00007fef0a70b700
 kmemleak: Kernel memory leak detector disabled
 kmemleak: Object 0xffff8c196fa2afe0 (size 8):
 kmemleak:   comm "ftracetest", pid 6866, jiffies 4295082531
 kmemleak:   min_count = 1
 kmemleak:   count = 0
 kmemleak:   flags = 0x1
 kmemleak:   checksum = 0
 kmemleak:   backtrace:
      __kmalloc+0x1b1/0x390
      tracing_map_init+0x1be/0x340
      event_hist_trigger_func+0x523/0xa40
      trigger_process_regex+0xc5/0x110
      event_trigger_write+0x71/0xd0
      vfs_write+0xca/0x210
      ksys_write+0x70/0xf0
      do_syscall_64+0x33/0x40
      entry_SYSCALL_64_after_hwframe+0x44/0xa9

The cause came down to a use of strcat() that was adding an string that was
shorten, but the strcat() did not take that into account.

strcat() is extremely dangerous as it does not care how big the buffer is.
Replace it with seq_buf operations that prevent the buffer from being
overwritten if what is being written is bigger than the buffer.

Fixes: 10819e2579 ("tracing: Handle synthetic event array field type checking correctly")
Reviewed-by: Tom Zanussi <zanussi@kernel.org>
Tested-by: Tom Zanussi <zanussi@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2020-10-27 09:25:36 -04:00
Zong Li
4230e2deaa stop_machine, rcu: Mark functions as notrace
Some architectures assume that the stopped CPUs don't make function calls
to traceable functions when they are in the stopped state. See also commit
cb9d7fd51d ("watchdog: Mark watchdog touch functions as notrace").

Violating this assumption causes kernel crashes when switching tracer on
RISC-V.

Mark rcu_momentary_dyntick_idle() and stop_machine_yield() notrace to
prevent this.

Fixes: 4ecf0a43e7 ("processor: get rid of cpu_relax_yield")
Fixes: 366237e7b0 ("stop_machine: Provide RCU quiescent state in multi_cpu_stop()")
Signed-off-by: Zong Li <zong.li@sifive.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Atish Patra <atish.patra@wdc.com>
Tested-by: Colin Ian King <colin.king@canonical.com>
Acked-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Acked-by: Paul E. McKenney <paulmck@kernel.org>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/r/20201021073839.43935-1-zong.li@sifive.com
2020-10-26 12:12:27 +01:00
Zeng Tao
cb47755725 time: Prevent undefined behaviour in timespec64_to_ns()
UBSAN reports:

Undefined behaviour in ./include/linux/time64.h:127:27
signed integer overflow:
17179869187 * 1000000000 cannot be represented in type 'long long int'
Call Trace:
 timespec64_to_ns include/linux/time64.h:127 [inline]
 set_cpu_itimer+0x65c/0x880 kernel/time/itimer.c:180
 do_setitimer+0x8e/0x740 kernel/time/itimer.c:245
 __x64_sys_setitimer+0x14c/0x2c0 kernel/time/itimer.c:336
 do_syscall_64+0xa1/0x540 arch/x86/entry/common.c:295

Commit bd40a17576 ("y2038: itimer: change implementation to timespec64")
replaced the original conversion which handled time clamping correctly with
timespec64_to_ns() which has no overflow protection.

Fix it in timespec64_to_ns() as this is not necessarily limited to the
usage in itimers.

[ tglx: Added comment and adjusted the fixes tag ]

Fixes: 361a3bf005 ("time64: Add time64.h header and define struct timespec64")
Signed-off-by: Zeng Tao <prime.zeng@hisilicon.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Arnd Bergmann <arnd@arndb.de>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/r/1598952616-6416-1-git-send-email-prime.zeng@hisilicon.com
2020-10-26 11:48:11 +01:00
YueHaibing
9010e3876e timers: Remove unused inline funtion debug_timer_free()
There is no caller in tree, remove it.

Signed-off-by: YueHaibing <yuehaibing@huawei.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20200909134749.32300-1-yuehaibing@huawei.com
2020-10-26 11:39:21 +01:00
YueHaibing
5254cb87c0 hrtimer: Remove unused inline function debug_hrtimer_free()
There is no caller in tree, remove it.

Signed-off-by: YueHaibing <yuehaibing@huawei.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20200909134850.21940-1-yuehaibing@huawei.com
2020-10-26 11:39:21 +01:00
Quanyang Wang
4cd2bb1298 time/sched_clock: Mark sched_clock_read_begin/retry() as notrace
Since sched_clock_read_begin() and sched_clock_read_retry() are called
by notrace function sched_clock(), they shouldn't be traceable either,
or else ftrace_graph_caller will run into a dead loop on the path
as below (arm for instance):

  ftrace_graph_caller()
    prepare_ftrace_return()
      function_graph_enter()
        ftrace_push_return_trace()
          trace_clock_local()
            sched_clock()
              sched_clock_read_begin/retry()

Fixes: 1b86abc1c6 ("sched_clock: Expose struct clock_read_data")
Signed-off-by: Quanyang Wang <quanyang.wang@windriver.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/r/20200929082027.16787-1-quanyang.wang@windriver.com
2020-10-26 11:34:31 +01:00
Davidlohr Bueso
1a2b85f1e2 timekeeping: Convert jiffies_seq to seqcount_raw_spinlock_t
Use the new api and associate the seqcounter to the jiffies_lock enabling
lockdep support - although for this particular case the write-side locking
and non-preemptibility are quite obvious.

Signed-off-by: Davidlohr Bueso <dbueso@suse.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20201021190749.19363-1-dave@stgolabs.net
2020-10-26 11:04:14 +01:00
Joe Perches
33def8498f treewide: Convert macro and uses of __section(foo) to __section("foo")
Use a more generic form for __section that requires quotes to avoid
complications with clang and gcc differences.

Remove the quote operator # from compiler_attributes.h __section macro.

Convert all unquoted __section(foo) uses to quoted __section("foo").
Also convert __attribute__((section("foo"))) uses to __section("foo")
even if the __attribute__ has multiple list entry forms.

Conversion done using the script at:

    https://lore.kernel.org/lkml/75393e5ddc272dc7403de74d645e6c6e0f4e70eb.camel@perches.com/2-convert_section.pl

Signed-off-by: Joe Perches <joe@perches.com>
Reviewed-by: Nick Desaulniers <ndesaulniers@gooogle.com>
Reviewed-by: Miguel Ojeda <ojeda@kernel.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-10-25 14:51:49 -07:00
Rasmus Villemoes
986b9eacb2 kernel/sys.c: fix prototype of prctl_get_tid_address()
tid_addr is not a "pointer to (pointer to int in userspace)"; it is in
fact a "pointer to (pointer to int in userspace) in userspace".  So
sparse rightfully complains about passing a kernel pointer to
put_user().

Reported-by: kernel test robot <lkp@intel.com>
Signed-off-by: Rasmus Villemoes <linux@rasmusvillemoes.dk>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-10-25 11:44:16 -07:00
Linus Torvalds
672f887126 Merge tag 'timers-urgent-2020-10-25' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull timer fixes from Thomas Gleixner:
 "A time namespace fix and a matching selftest. The futex absolute
  timeouts which are based on CLOCK_MONOTONIC require time namespace
  corrected. This was missed in the original time namesapce support"

* tag 'timers-urgent-2020-10-25' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  selftests/timens: Add a test for futex()
  futex: Adjust absolute futex timeouts with per time namespace offset
2020-10-25 11:28:49 -07:00
Linus Torvalds
87702a337f Merge tag 'sched-urgent-2020-10-25' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull scheduler fixes from Thomas Gleixner:
 "Two scheduler fixes:

   - A trivial build fix for sched_feat() to compile correctly with
     CONFIG_JUMP_LABEL=n

   - Replace a zero lenght array with a flexible array"

* tag 'sched-urgent-2020-10-25' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  sched/features: Fix !CONFIG_JUMP_LABEL case
  sched: Replace zero-length array with flexible-array
2020-10-25 11:25:16 -07:00
Linus Torvalds
81ecf91eab Merge tag 'safesetid-5.10' of git://github.com/micah-morton/linux
Pull SafeSetID updates from Micah Morton:
 "The changes are mostly contained to within the SafeSetID LSM, with the
  exception of a few 1-line changes to change some ns_capable() calls to
  ns_capable_setid() -- causing a flag (CAP_OPT_INSETID) to be set that
  is examined by SafeSetID code and nothing else in the kernel.

  The changes to SafeSetID internally allow for setting up GID
  transition security policies, as already existed for UIDs"

* tag 'safesetid-5.10' of git://github.com/micah-morton/linux:
  LSM: SafeSetID: Fix warnings reported by test bot
  LSM: SafeSetID: Add GID security policy handling
  LSM: Signal to SafeSetID when setting group IDs
2020-10-25 10:45:26 -07:00
Linus Torvalds
91f28da8c9 Merge tag '20201024-v4-5.10' of git://git.kernel.org/pub/scm/linux/kernel/git/wtarreau/prandom
Pull random32 updates from Willy Tarreau:
 "Make prandom_u32() less predictable.

  This is the cleanup of the latest series of prandom_u32
  experimentations consisting in using SipHash instead of Tausworthe to
  produce the randoms used by the network stack.

  The changes to the files were kept minimal, and the controversial
  commit that used to take noise from the fast_pool (f227e3ec3b) was
  reverted. Instead, a dedicated "net_rand_noise" per_cpu variable is
  fed from various sources of activities (networking, scheduling) to
  perturb the SipHash state using fast, non-trivially predictable data,
  instead of keeping it fully deterministic. The goal is essentially to
  make any occasional memory leakage or brute-force attempt useless.

  The resulting code was verified to be very slightly faster on x86_64
  than what is was with the controversial commit above, though this
  remains barely above measurement noise. It was also tested on i386 and
  arm, and build- tested only on arm64"

Link: https://lore.kernel.org/netdev/20200808152628.GA27941@SDF.ORG/

* tag '20201024-v4-5.10' of git://git.kernel.org/pub/scm/linux/kernel/git/wtarreau/prandom:
  random32: add a selftest for the prandom32 code
  random32: add noise from network and scheduling activity
  random32: make prandom_u32() output unpredictable
2020-10-25 10:40:08 -07:00
Linus Torvalds
1b307ac870 Merge tag 'dma-mapping-5.10-1' of git://git.infradead.org/users/hch/dma-mapping
Pull dma-mapping fixes from Christoph Hellwig:

 - document the new dma_{alloc,free}_pages() API

 - two fixups for the dma-mapping.h split

* tag 'dma-mapping-5.10-1' of git://git.infradead.org/users/hch/dma-mapping:
  dma-mapping: document dma_{alloc,free}_pages
  dma-mapping: move more functions to dma-map-ops.h
  ARM/sa1111: add a missing include of dma-map-ops.h
2020-10-24 12:17:05 -07:00
Willy Tarreau
3744741ada random32: add noise from network and scheduling activity
With the removal of the interrupt perturbations in previous random32
change (random32: make prandom_u32() output unpredictable), the PRNG
has become 100% deterministic again. While SipHash is expected to be
way more robust against brute force than the previous Tausworthe LFSR,
there's still the risk that whoever has even one temporary access to
the PRNG's internal state is able to predict all subsequent draws till
the next reseed (roughly every minute). This may happen through a side
channel attack or any data leak.

This patch restores the spirit of commit f227e3ec3b ("random32: update
the net random state on interrupt and activity") in that it will perturb
the internal PRNG's statee using externally collected noise, except that
it will not pick that noise from the random pool's bits nor upon
interrupt, but will rather combine a few elements along the Tx path
that are collectively hard to predict, such as dev, skb and txq
pointers, packet length and jiffies values. These ones are combined
using a single round of SipHash into a single long variable that is
mixed with the net_rand_state upon each invocation.

The operation was inlined because it produces very small and efficient
code, typically 3 xor, 2 add and 2 rol. The performance was measured
to be the same (even very slightly better) than before the switch to
SipHash; on a 6-core 12-thread Core i7-8700k equipped with a 40G NIC
(i40e), the connection rate dropped from 556k/s to 555k/s while the
SYN cookie rate grew from 5.38 Mpps to 5.45 Mpps.

Link: https://lore.kernel.org/netdev/20200808152628.GA27941@SDF.ORG/
Cc: George Spelvin <lkml@sdf.org>
Cc: Amit Klein <aksecurity@gmail.com>
Cc: Eric Dumazet <edumazet@google.com>
Cc: "Jason A. Donenfeld" <Jason@zx2c4.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Kees Cook <keescook@chromium.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: tytso@mit.edu
Cc: Florian Westphal <fw@strlen.de>
Cc: Marc Plumb <lkml.mplumb@gmail.com>
Tested-by: Sedat Dilek <sedat.dilek@gmail.com>
Signed-off-by: Willy Tarreau <w@1wt.eu>
2020-10-24 20:21:57 +02:00
George Spelvin
c51f8f88d7 random32: make prandom_u32() output unpredictable
Non-cryptographic PRNGs may have great statistical properties, but
are usually trivially predictable to someone who knows the algorithm,
given a small sample of their output.  An LFSR like prandom_u32() is
particularly simple, even if the sample is widely scattered bits.

It turns out the network stack uses prandom_u32() for some things like
random port numbers which it would prefer are *not* trivially predictable.
Predictability led to a practical DNS spoofing attack.  Oops.

This patch replaces the LFSR with a homebrew cryptographic PRNG based
on the SipHash round function, which is in turn seeded with 128 bits
of strong random key.  (The authors of SipHash have *not* been consulted
about this abuse of their algorithm.)  Speed is prioritized over security;
attacks are rare, while performance is always wanted.

Replacing all callers of prandom_u32() is the quick fix.
Whether to reinstate a weaker PRNG for uses which can tolerate it
is an open question.

Commit f227e3ec3b ("random32: update the net random state on interrupt
and activity") was an earlier attempt at a solution.  This patch replaces
it.

Reported-by: Amit Klein <aksecurity@gmail.com>
Cc: Willy Tarreau <w@1wt.eu>
Cc: Eric Dumazet <edumazet@google.com>
Cc: "Jason A. Donenfeld" <Jason@zx2c4.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Kees Cook <keescook@chromium.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: tytso@mit.edu
Cc: Florian Westphal <fw@strlen.de>
Cc: Marc Plumb <lkml.mplumb@gmail.com>
Fixes: f227e3ec3b ("random32: update the net random state on interrupt and activity")
Signed-off-by: George Spelvin <lkml@sdf.org>
Link: https://lore.kernel.org/netdev/20200808152628.GA27941@SDF.ORG/
[ willy: partial reversal of f227e3ec3b5c; moved SIPROUND definitions
  to prandom.h for later use; merged George's prandom_seed() proposal;
  inlined siprand_u32(); replaced the net_rand_state[] array with 4
  members to fix a build issue; cosmetic cleanups to make checkpatch
  happy; fixed RANDOM32_SELFTEST build ]
Signed-off-by: Willy Tarreau <w@1wt.eu>
2020-10-24 20:21:57 +02:00
Linus Torvalds
a5e5c274c9 Merge tag 'trace-v5.10-3' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace
Pull tracing ring-buffer fix from Steven Rostedt:
 "The success return value of ring_buffer_resize() is stated to be
  zero and checked that way.

  But it was incorrectly returning the size allocated.

  Also, a fix to a comment"

* tag 'trace-v5.10-3' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
  ring-buffer: Update the description for ring_buffer_wait
  ring-buffer: Return 0 on success from ring_buffer_resize()
2020-10-23 17:09:38 -07:00
Linus Torvalds
41f762a15a Merge tag 'pm-5.10-rc1-2' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm
Pull more power management updates from Rafael Wysocki:
 "First of all, the adaptive voltage scaling (AVS) drivers go to new
  platform-specific locations as planned (this part was reported to have
  merge conflicts against the new arm-soc updates in linux-next).

  In addition to that, there are some fixes (intel_idle, intel_pstate,
  RAPL, acpi_cpufreq), the addition of on/off notifiers and idle state
  accounting support to the generic power domains (genpd) code and some
  janitorial changes all over.

  Specifics:

   - Move the AVS drivers to new platform-specific locations and get rid
     of the drivers/power/avs directory (Ulf Hansson).

   - Add on/off notifiers and idle state accounting support to the
     generic power domains (genpd) framework (Ulf Hansson, Lina Iyer).

   - Ulf will maintain the PM domain part of cpuidle-psci (Ulf Hansson).

   - Make intel_idle disregard ACPI _CST if it cannot use the data
     returned by that method (Mel Gorman).

   - Modify intel_pstate to avoid leaving useless sysfs directory
     structure behind if it cannot be registered (Chen Yu).

   - Fix domain detection in the RAPL power capping driver and prevent
     it from failing to enumerate the Psys RAPL domain (Zhang Rui).

   - Allow acpi-cpufreq to use ACPI _PSD information with Family 19 and
     later AMD chips (Wei Huang).

   - Update the driver assumptions comment in intel_idle and fix a
     kerneldoc comment in the runtime PM framework (Alexander Monakov,
     Bean Huo).

   - Avoid unnecessary resets of the cached frequency in the schedutil
     cpufreq governor to reduce overhead (Wei Wang).

   - Clean up the cpufreq core a bit (Viresh Kumar).

   - Make assorted minor janitorial changes (Daniel Lezcano, Geert
     Uytterhoeven, Hubert Jasudowicz, Tom Rix).

   - Clean up and optimize the cpupower utility somewhat (Colin Ian
     King, Martin Kaistra)"

* tag 'pm-5.10-rc1-2' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: (23 commits)
  PM: sleep: remove unreachable break
  PM: AVS: Drop the avs directory and the corresponding Kconfig
  PM: AVS: qcom-cpr: Move the driver to the qcom specific drivers
  PM: runtime: Fix typo in pm_runtime_set_active() helper comment
  PM: domains: Fix build error for genpd notifiers
  powercap: Fix typo in Kconfig "Plance" -> "Plane"
  cpufreq: schedutil: restore cached freq when next_f is not changed
  acpi-cpufreq: Honor _PSD table setting on new AMD CPUs
  PM: AVS: smartreflex Move driver to soc specific drivers
  PM: AVS: rockchip-io: Move the driver to the rockchip specific drivers
  PM: domains: enable domain idle state accounting
  PM: domains: Add curly braces to delimit comment + statement block
  PM: domains: Add support for PM domain on/off notifiers for genpd
  powercap/intel_rapl: enumerate Psys RAPL domain together with package RAPL domain
  powercap/intel_rapl: Fix domain detection
  intel_idle: Ignore _CST if control cannot be taken from the platform
  cpuidle: Remove pointless stub
  intel_idle: mention assumption that WBINVD is not needed
  MAINTAINERS: Add section for cpuidle-psci PM domain
  cpufreq: intel_pstate: Delete intel_pstate sysfs if failed to register the driver
  ...
2020-10-23 16:27:03 -07:00
Linus Torvalds
3cb12d27ff Merge tag 'net-5.10-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Pull networking fixes from Jakub Kicinski:
 "Cross-tree/merge window issues:

   - rtl8150: don't incorrectly assign random MAC addresses; fix late in
     the 5.9 cycle started depending on a return code from a function
     which changed with the 5.10 PR from the usb subsystem

  Current release regressions:

   - Revert "virtio-net: ethtool configurable RXCSUM", it was causing
     crashes at probe when control vq was not negotiated/available

  Previous release regressions:

   - ixgbe: fix probing of multi-port 10 Gigabit Intel NICs with an MDIO
     bus, only first device would be probed correctly

   - nexthop: Fix performance regression in nexthop deletion by
     effectively switching from recently added synchronize_rcu() to
     synchronize_rcu_expedited()

   - netsec: ignore 'phy-mode' device property on ACPI systems; the
     property is not populated correctly by the firmware, but firmware
     configures the PHY so just keep boot settings

  Previous releases - always broken:

   - tcp: fix to update snd_wl1 in bulk receiver fast path, addressing
     bulk transfers getting "stuck"

   - icmp: randomize the global rate limiter to prevent attackers from
     getting useful signal

   - r8169: fix operation under forced interrupt threading, make the
     driver always use hard irqs, even on RT, given the handler is light
     and only wants to schedule napi (and do so through a _irqoff()
     variant, preferably)

   - bpf: Enforce pointer id generation for all may-be-null register
     type to avoid pointers erroneously getting marked as null-checked

   - tipc: re-configure queue limit for broadcast link

   - net/sched: act_tunnel_key: fix OOB write in case of IPv6 ERSPAN
     tunnels

   - fix various issues in chelsio inline tls driver

  Misc:

   - bpf: improve just-added bpf_redirect_neigh() helper api to support
     supplying nexthop by the caller - in case BPF program has already
     done a lookup we can avoid doing another one

   - remove unnecessary break statements

   - make MCTCP not select IPV6, but rather depend on it"

* tag 'net-5.10-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (62 commits)
  tcp: fix to update snd_wl1 in bulk receiver fast path
  net: Properly typecast int values to set sk_max_pacing_rate
  netfilter: nf_fwd_netdev: clear timestamp in forwarding path
  ibmvnic: save changed mac address to adapter->mac_addr
  selftests: mptcp: depends on built-in IPv6
  Revert "virtio-net: ethtool configurable RXCSUM"
  rtnetlink: fix data overflow in rtnl_calcit()
  net: ethernet: mtk-star-emac: select REGMAP_MMIO
  net: hdlc_raw_eth: Clear the IFF_TX_SKB_SHARING flag after calling ether_setup
  net: hdlc: In hdlc_rcv, check to make sure dev is an HDLC device
  bpf, libbpf: Guard bpf inline asm from bpf_tail_call_static
  bpf, selftests: Extend test_tc_redirect to use modified bpf_redirect_neigh()
  bpf: Fix bpf_redirect_neigh helper api to support supplying nexthop
  mptcp: depends on IPV6 but not as a module
  sfc: move initialisation of efx->filter_sem to efx_init_struct()
  mpls: load mpls_gso after mpls_iptunnel
  net/sched: act_tunnel_key: fix OOB write in case of IPv6 ERSPAN tunnels
  net/sched: act_gate: Unlock ->tcfa_lock in tc_setup_flow_action()
  net: dsa: bcm_sf2: make const array static, makes object smaller
  mptcp: MPTCP_IPV6 should depend on IPV6 instead of selecting it
  ...
2020-10-23 12:05:49 -07:00
Linus Torvalds
4a22709e21 Merge tag 'arch-cleanup-2020-10-22' of git://git.kernel.dk/linux-block
Pull arch task_work cleanups from Jens Axboe:
 "Two cleanups that don't fit other categories:

   - Finally get the task_work_add() cleanup done properly, so we don't
     have random 0/1/false/true/TWA_SIGNAL confusing use cases. Updates
     all callers, and also fixes up the documentation for
     task_work_add().

   - While working on some TIF related changes for 5.11, this
     TIF_NOTIFY_RESUME cleanup fell out of that. Remove some arch
     duplication for how that is handled"

* tag 'arch-cleanup-2020-10-22' of git://git.kernel.dk/linux-block:
  task_work: cleanup notification modes
  tracehook: clear TIF_NOTIFY_RESUME in tracehook_notify_resume()
2020-10-23 10:06:38 -07:00
Linus Torvalds
746b25b1aa Merge tag 'kbuild-v5.10' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild
Pull Kbuild updates from Masahiro Yamada:

 - Support 'make compile_commands.json' to generate the compilation
   database more easily, avoiding stale entries

 - Support 'make clang-analyzer' and 'make clang-tidy' for static checks
   using clang-tidy

 - Preprocess scripts/modules.lds.S to allow CONFIG options in the
   module linker script

 - Drop cc-option tests from compiler flags supported by our minimal
   GCC/Clang versions

 - Use always 12-digits commit hash for CONFIG_LOCALVERSION_AUTO=y

 - Use sha1 build id for both BFD linker and LLD

 - Improve deb-pkg for reproducible builds and rootless builds

 - Remove stale, useless scripts/namespace.pl

 - Turn -Wreturn-type warning into error

 - Fix build error of deb-pkg when CONFIG_MODULES=n

 - Replace 'hostname' command with more portable 'uname -n'

 - Various Makefile cleanups

* tag 'kbuild-v5.10' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild: (34 commits)
  kbuild: Use uname for LINUX_COMPILE_HOST detection
  kbuild: Only add -fno-var-tracking-assignments for old GCC versions
  kbuild: remove leftover comment for filechk utility
  treewide: remove DISABLE_LTO
  kbuild: deb-pkg: clean up package name variables
  kbuild: deb-pkg: do not build linux-headers package if CONFIG_MODULES=n
  kbuild: enforce -Werror=return-type
  scripts: remove namespace.pl
  builddeb: Add support for all required debian/rules targets
  builddeb: Enable rootless builds
  builddeb: Pass -n to gzip for reproducible packages
  kbuild: split the build log of kallsyms
  kbuild: explicitly specify the build id style
  scripts/setlocalversion: make git describe output more reliable
  kbuild: remove cc-option test of -Werror=date-time
  kbuild: remove cc-option test of -fno-stack-check
  kbuild: remove cc-option test of -fno-strict-overflow
  kbuild: move CFLAGS_{KASAN,UBSAN,KCSAN} exports to relevant Makefiles
  kbuild: remove redundant CONFIG_KASAN check from scripts/Makefile.kasan
  kbuild: do not create built-in objects for external module builds
  ...
2020-10-22 13:13:57 -07:00
Linus Torvalds
2b71482060 Merge tag 'modules-for-v5.10' of git://git.kernel.org/pub/scm/linux/kernel/git/jeyu/linux
Pull modules updates from Jessica Yu:
 "Code cleanups: more informative error messages and statically
  initialize init_free_wq to avoid a workqueue warning"

* tag 'modules-for-v5.10' of git://git.kernel.org/pub/scm/linux/kernel/git/jeyu/linux:
  module: statically initialize init section freeing data
  module: Add more error message for failed kernel module loading
2020-10-22 13:08:57 -07:00
Linus Torvalds
f56e65dff6 Merge branch 'work.set_fs' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull initial set_fs() removal from Al Viro:
 "Christoph's set_fs base series + fixups"

* 'work.set_fs' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  fs: Allow a NULL pos pointer to __kernel_read
  fs: Allow a NULL pos pointer to __kernel_write
  powerpc: remove address space overrides using set_fs()
  powerpc: use non-set_fs based maccess routines
  x86: remove address space overrides using set_fs()
  x86: make TASK_SIZE_MAX usable from assembly code
  x86: move PAGE_OFFSET, TASK_SIZE & friends to page_{32,64}_types.h
  lkdtm: remove set_fs-based tests
  test_bitmap: remove user bitmap tests
  uaccess: add infrastructure for kernel builds with set_fs()
  fs: don't allow splice read/write without explicit ops
  fs: don't allow kernel reads and writes without iter ops
  sysctl: Convert to iter interfaces
  proc: add a read_iter method to proc proc_ops
  proc: cleanup the compat vs no compat file ops
  proc: remove a level of indentation in proc_get_inode
2020-10-22 09:59:21 -07:00
Qiujun Huang
e1981f75d3 ring-buffer: Update the description for ring_buffer_wait
The function changed at some point, but the description was not
updated.

Link: https://lkml.kernel.org/r/20201017095246.5170-1-hqjagain@gmail.com

Signed-off-by: Qiujun Huang <hqjagain@gmail.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2020-10-22 11:26:26 -04:00
Qiujun Huang
0a1754b2a9 ring-buffer: Return 0 on success from ring_buffer_resize()
We don't need to check the new buffer size, and the return value
had confused resize_buffer_duplicate_size().
...
	ret = ring_buffer_resize(trace_buf->buffer,
		per_cpu_ptr(size_buf->data,cpu_id)->entries, cpu_id);
	if (ret == 0)
		per_cpu_ptr(trace_buf->data, cpu_id)->entries =
			per_cpu_ptr(size_buf->data, cpu_id)->entries;
...

Link: https://lkml.kernel.org/r/20201019142242.11560-1-hqjagain@gmail.com

Cc: stable@vger.kernel.org
Fixes: d60da506cb ("tracing: Add a resize function to make one buffer equivalent to another buffer")
Signed-off-by: Qiujun Huang <hqjagain@gmail.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2020-10-22 11:24:10 -04:00
Peter Zijlstra
f8e48a3dca lockdep: Fix preemption WARN for spurious IRQ-enable
It is valid (albeit uncommon) to call local_irq_enable() without first
having called local_irq_disable(). In this case we enter
lockdep_hardirqs_on*() with IRQs enabled and trip a preemption warning
for using __this_cpu_read().

Use this_cpu_read() instead to avoid the warning.

Fixes: 4d004099a6 ("lockdep: Fix lockdep recursion")
Reported-by: syzbot+53f8ce8bbc07924b6417@syzkaller.appspotmail.com
Reported-by: kernel test robot <lkp@intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
2020-10-22 12:37:22 +02:00
Sami Tolvanen
0f6372e522 treewide: remove DISABLE_LTO
This change removes all instances of DISABLE_LTO from
Makefiles, as they are currently unused, and the preferred
method of disabling LTO is to filter out the flags instead.

Note added by Masahiro Yamada:
DISABLE_LTO was added as preparation for GCC LTO, but GCC LTO was
not pulled into the mainline. (https://lkml.org/lkml/2014/4/8/272)

Suggested-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Sami Tolvanen <samitolvanen@google.com>
Reviewed-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Masahiro Yamada <masahiroy@kernel.org>
2020-10-21 00:28:53 +09:00
Andrei Vagin
c2f7d08ccc futex: Adjust absolute futex timeouts with per time namespace offset
For all commands except FUTEX_WAIT, the timeout is interpreted as an
absolute value. This absolute value is inside the task's time namespace and
has to be converted to the host's time.

Fixes: 5a590f35ad ("posix-clocks: Wire up clock_gettime() with timens offsets")
Reported-by: Hans van der Laan <j.h.vanderlaan@student.utwente.nl>
Signed-off-by: Andrei Vagin <avagin@gmail.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Dmitry Safonov <0x7f454c46@gmail.com>
Cc: <stable@vger.kernel.org>
Link: https://lore.kernel.org/r/20201015160020.293748-1-avagin@gmail.com
2020-10-20 17:02:57 +02:00
Christoph Hellwig
695cebe58d dma-mapping: move more functions to dma-map-ops.h
Due to a mismerge a bunch of prototypes that should have moved to
dma-map-ops.h are still in dma-mapping.h, fix that up.

Signed-off-by: Christoph Hellwig <hch@lst.de>
2020-10-20 10:41:07 +02:00
Martin KaFai Lau
93c230e3f5 bpf: Enforce id generation for all may-be-null register type
The commit af7ec13833 ("bpf: Add bpf_skc_to_tcp6_sock() helper")
introduces RET_PTR_TO_BTF_ID_OR_NULL and
the commit eaa6bcb71e ("bpf: Introduce bpf_per_cpu_ptr()")
introduces RET_PTR_TO_MEM_OR_BTF_ID_OR_NULL.
Note that for RET_PTR_TO_MEM_OR_BTF_ID_OR_NULL, the reg0->type
could become PTR_TO_MEM_OR_NULL which is not covered by
BPF_PROBE_MEM.

The BPF_REG_0 will then hold a _OR_NULL pointer type. This _OR_NULL
pointer type requires the bpf program to explicitly do a NULL check first.
After NULL check, the verifier will mark all registers having
the same reg->id as safe to use.  However, the reg->id
is not set for those new _OR_NULL return types.  One of the ways
that may be wrong is, checking NULL for one btf_id typed pointer will
end up validating all other btf_id typed pointers because
all of them have id == 0.  The later tests will exercise
this path.

To fix it and also avoid similar issue in the future, this patch
moves the id generation logic out of each individual RET type
test in check_helper_call().  Instead, it does one
reg_type_may_be_null() test and then do the id generation
if needed.

This patch also adds a WARN_ON_ONCE in mark_ptr_or_null_reg()
to catch future breakage.

The _OR_NULL pointer usage in the bpf_iter_reg.ctx_arg_info is
fine because it just happens that the existing id generation after
check_ctx_access() has covered it.  It is also using the
reg_type_may_be_null() to decide if id generation is needed or not.

Fixes: af7ec13833 ("bpf: Add bpf_skc_to_tcp6_sock() helper")
Fixes: eaa6bcb71e ("bpf: Introduce bpf_per_cpu_ptr()")
Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20201019194212.1050855-1-kafai@fb.com
2020-10-19 15:57:42 -07:00