The UEVENT user mode helper is enabled before the initcalls are executed
and is available when the root filesystem has been mounted.
The user mode helper is triggered by device init calls and the executable
might use the futex syscall.
futex_init() is marked __initcall which maps to device_initcall, but there
is no guarantee that futex_init() is invoked _before_ the first device init
call which triggers the UEVENT user mode helper.
If the user mode helper uses the futex syscall before futex_init() then the
syscall crashes with a NULL pointer dereference because the futex subsystem
has not been initialized yet.
Move futex_init() to core_initcall so futexes are initialized before the
root filesystem is mounted and the usermode helper becomes available.
[ tglx: Rewrote changelog ]
Signed-off-by: Yang Yang <yang.yang29@zte.com.cn>
Cc: jiang.biao2@zte.com.cn
Cc: jiang.zhengxiong@zte.com.cn
Cc: zhong.weidong@zte.com.cn
Cc: deng.huali@zte.com.cn
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: stable@vger.kernel.org
Link: http://lkml.kernel.org/r/1483085875-6130-1-git-send-email-yang.yang29@zte.com.cn
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
tick_broadcast_lock is taken from interrupt context, but the following call
chain takes the lock without disabling interrupts:
[ 12.703736] _raw_spin_lock+0x3b/0x50
[ 12.703738] tick_broadcast_control+0x5a/0x1a0
[ 12.703742] intel_idle_cpu_online+0x22/0x100
[ 12.703744] cpuhp_invoke_callback+0x245/0x9d0
[ 12.703752] cpuhp_thread_fun+0x52/0x110
[ 12.703754] smpboot_thread_fn+0x276/0x320
So the following deadlock can happen:
lock(tick_broadcast_lock);
<Interrupt>
lock(tick_broadcast_lock);
intel_idle_cpu_online() is the only place which violates the calling
convention of tick_broadcast_control(). This was caused by the removal of
the smp function call in course of the cpu hotplug rework.
Instead of slapping local_irq_disable/enable() at the call site, we can
relax the calling convention and handle it in the core code, which makes
the whole machinery more robust.
Fixes: 29d7bbada9 ("intel_idle: Remove superfluous SMP fuction call")
Reported-by: Gabriel C <nix.or.die@gmail.com>
Signed-off-by: Mike Galbraith <efault@gmx.de>
Cc: Ruslan Ruslichenko <rruslich@cisco.com>
Cc: Jiri Slaby <jslaby@suse.cz>
Cc: Greg KH <gregkh@linuxfoundation.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: lwn@lwn.net
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Anna-Maria Gleixner <anna-maria@linutronix.de>
Cc: Sebastian Siewior <bigeasy@linutronix.de>
Cc: stable <stable@vger.kernel.org>
Link: http://lkml.kernel.org/r/1486953115.5912.4.camel@gmx.de
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
If BPF_F_ALLOW_OVERRIDE flag is used in BPF_PROG_ATTACH command
to the given cgroup the descendent cgroup will be able to override
effective bpf program that was inherited from this cgroup.
By default it's not passed, therefore override is disallowed.
Examples:
1.
prog X attached to /A with default
prog Y fails to attach to /A/B and /A/B/C
Everything under /A runs prog X
2.
prog X attached to /A with allow_override.
prog Y fails to attach to /A/B with default (non-override)
prog M attached to /A/B with allow_override.
Everything under /A/B runs prog M only.
3.
prog X attached to /A with allow_override.
prog Y fails to attach to /A with default.
The user has to detach first to switch the mode.
In the future this behavior may be extended with a chain of
non-overridable programs.
Also fix the bug where detach from cgroup where nothing is attached
was not throwing error. Return ENOENT in such case.
Add several testcases and adjust libbpf.
Fixes: 3007098494 ("cgroup: add support for eBPF programs")
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Tejun Heo <tj@kernel.org>
Acked-by: Daniel Mack <daniel@zonque.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Allow the devname parameter to be NULL and use dev_name(dev) in this case.
This should be an appropriate default for most use cases.
Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com>
Link: http://lkml.kernel.org/r/05c63d67-30b4-7026-02d5-ce7fb7bc185f@gmail.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Pull timer fix from Ingo Molnar:
"Fix a sporadic missed timer hw reprogramming bug that can result in
random delays"
* 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
tick/nohz: Fix possible missing clock reprog after tick soft restart
While looking through the __ex_table stuff I found that we do a linear
lookup of the module. Also fix up a comment.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Mark Rutland <mark.rutland@arm.com>
Signed-off-by: Jessica Yu <jeyu@redhat.com>
If the irq_desc being output does not have a domain associated the
information following the 'name' is not aligned correctly.
Signed-off-by: H Hartley Sweeten <hsweeten@visionengravers.com>
Link: http://lkml.kernel.org/r/20170210165416.5629-1-hsweeten@visionengravers.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Currently CONFIG_TIMER_STATS exposes process information across namespaces:
kernel/time/timer_list.c print_timer():
SEQ_printf(m, ", %s/%d", tmp, timer->start_pid);
/proc/timer_list:
#11: <0000000000000000>, hrtimer_wakeup, S:01, do_nanosleep, cron/2570
Given that the tracer can give the same information, this patch entirely
removes CONFIG_TIMER_STATS.
Suggested-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Kees Cook <keescook@chromium.org>
Acked-by: John Stultz <john.stultz@linaro.org>
Cc: Nicolas Pitre <nicolas.pitre@linaro.org>
Cc: linux-doc@vger.kernel.org
Cc: Lai Jiangshan <jiangshanlai@gmail.com>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Xing Gao <xgao01@email.wm.edu>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Jessica Frazelle <me@jessfraz.com>
Cc: kernel-hardening@lists.openwall.com
Cc: Nicolas Iooss <nicolas.iooss_linux@m4x.org>
Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Cc: Petr Mladek <pmladek@suse.com>
Cc: Richard Cochran <richardcochran@gmail.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Michal Marek <mmarek@suse.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Olof Johansson <olof@lixom.net>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: linux-api@vger.kernel.org
Cc: Arjan van de Ven <arjan@linux.intel.com>
Link: http://lkml.kernel.org/r/20170208192659.GA32582@beast
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
ts->next_tick keeps track of the next tick deadline in order to optimize
clock programmation on irq exit and avoid redundant clock device writes.
Now if ts->next_tick missed an update, we may spuriously miss a clock
reprog later as the nohz code is fooled by an obsolete next_tick value.
This is what happens here on a specific path: when we observe an
expired timer from the nohz update code on irq exit, we perform a soft
tick restart which simply fires the closest possible tick without
actually exiting the nohz mode and restoring a periodic state. But we
forget to update ts->next_tick accordingly.
As a result, after the next tick resulting from such soft tick restart,
the nohz code sees a stale value on ts->next_tick which doesn't match
the clock deadline that just expired. If that obsolete ts->next_tick
value happens to collide with the actual next tick deadline to be
scheduled, we may spuriously bypass the clock reprogramming. In the
worst case, the tick may never fire again.
Fix this with a ts->next_tick reset on soft tick restart.
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Reviewed: Wanpeng Li <wanpeng.li@hotmail.com>
Acked-by: Rik van Riel <riel@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: stable@vger.kernel.org
Link: http://lkml.kernel.org/r/1486485894-29173-1-git-send-email-fweisbec@gmail.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
The current spinlock lockup detection code can sometimes produce false
positives because of the unfairness of the locking algorithm itself.
So the lockup detection code is now removed. Instead, we are relying
on the NMI watchdog to detect potential lockup. We won't have lockup
detection if the watchdog isn't running.
The commented-out read-write lock lockup detection code are also
removed.
Signed-off-by: Waiman Long <longman@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Sasha Levin <sasha.levin@oracle.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1486583208-11038-1-git-send-email-longman@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Bug messages and stack dump for MAX_LOCKDEP_CHAIN_HLOCKS should only
be printed once.
Signed-off-by: Byungchul Park <byungchul.park@lge.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Sasha Levin <sasha.levin@oracle.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1484275324-28192-1-git-send-email-byungchul.park@lge.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
While supporting file-based address filters for CPU events requires some
extra context switch handling, kernel address filters are easy, since the
kernel mapping is preserved across address spaces. It is also useful as
it permits tracing scheduling paths of the kernel.
This patch allows setting up kernel filters for CPU events.
Signed-off-by: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Arnaldo Carvalho de Melo <acme@infradead.org>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Mathieu Poirier <mathieu.poirier@linaro.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephane Eranian <eranian@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vince Weaver <vincent.weaver@maine.edu>
Cc: Will Deacon <will.deacon@arm.com>
Cc: vince@deater.net
Link: http://lkml.kernel.org/r/20170126094057.13805-4-alexander.shishkin@linux.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
It is currently possible to configure a kernel address filter for a
event that excludes kernel from its traces (attr.exclude_kernel==1).
While in reality this doesn't make sense, the SET_FILTER ioctl() should
return a error in such case, currently it does not. Furthermore, it
will still silently discard the filter and any potentially valid filters
that came with it.
This patch makes the SET_FILTER ioctl() error out in such cases.
Signed-off-by: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Mathieu Poirier <mathieu.poirier@linaro.org>
Cc: Arnaldo Carvalho de Melo <acme@infradead.org>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephane Eranian <eranian@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vince Weaver <vincent.weaver@maine.edu>
Cc: Will Deacon <will.deacon@arm.com>
Cc: vince@deater.net
Link: http://lkml.kernel.org/r/20170126094057.13805-3-alexander.shishkin@linux.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
The check for 'running' in sched_move_task() has an unlikely() around it. That
is, it is unlikely that the task being moved is running. That use to be
true. But with a couple of recent updates, it is now likely that the task
will be running.
The first change came from ea86cb4b76 ("sched/cgroup: Fix
cpu_cgroup_fork() handling") that moved around the use case of
sched_move_task() in do_fork() where the call is now done after the task is
woken (hence it is running).
The second change came from 8e5bfa8c1f ("sched/autogroup: Do not use
autogroup->tg in zombie threads") where sched_move_task() is called by the
exit path, by the task that is exiting. Hence it too is running.
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Link: http://lkml.kernel.org/r/20170206110426.27ca6426@gandalf.local.home
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Alexei had his box explode because doing read() on a package
(rapl/uncore) event that isn't currently scheduled in ends up doing an
out-of-bounds load.
Rework the code to more explicitly deal with event->oncpu being -1.
Reported-by: Alexei Starovoitov <alexei.starovoitov@gmail.com>
Tested-by: Alexei Starovoitov <ast@kernel.org>
Tested-by: David Carrillo-Cisneros <davidcc@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: eranian@google.com
Fixes: d6a2f9035b ("perf/core: Introduce PMU_EV_CAP_READ_ACTIVE_PKG")
Link: http://lkml.kernel.org/r/20170131102710.GL6515@twins.programming.kicks-ass.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>
kprobe_exceptions_notify() is not used on some of the architectures such
as arm[64] and powerpc anymore. Introduce a weak variant for such
architectures.
Signed-off-by: Naveen N. Rao <naveen.n.rao@linux.vnet.ibm.com>
Acked-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
These files were including module.h for exception table related
functions. We've now separated that content out into its own file
"extable.h" so now move over to that and where possible, avoid all
the extra header content in module.h that we don't really need to
compile these non-modular files.
Note:
init/main.c still needs module.h for __init_or_module
kernel/extable.c still needs module.h for is_module_text_address
...and so we don't get the benefit of removing module.h from the cpp
feed for these two files, unlike the almost universal 1:1 exchange
of module.h for extable.h we were able to do in the arch dirs.
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Acked-by: Jessica Yu <jeyu@redhat.com>
Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
The user_header gets caught by kmemleak with the following splat as
missing a free:
unreferenced object 0xffff99667a733d80 (size 96):
comm "swapper/0", pid 1, jiffies 4294892317 (age 62191.468s)
hex dump (first 32 bytes):
a0 b6 92 b4 ff ff ff ff 00 00 00 00 01 00 00 00 ................
01 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
backtrace:
kmemleak_alloc+0x4a/0xa0
__kmalloc+0x144/0x260
__register_sysctl_table+0x54/0x5e0
register_sysctl+0x1b/0x20
user_namespace_sysctl_init+0x17/0x34
do_one_initcall+0x52/0x1a0
kernel_init_freeable+0x173/0x200
kernel_init+0xe/0x100
ret_from_fork+0x2c/0x40
The BUG_ON()s are intended to crash so no need to clean up after
ourselves on error there. This is also a kernel/ subsys_init() we don't
need a respective exit call here as this is never modular, so just white
list it.
Link: http://lkml.kernel.org/r/20170203211404.31458-1-mcgrof@kernel.org
Signed-off-by: Luis R. Rodriguez <mcgrof@kernel.org>
Cc: Eric W. Biederman <ebiederm@xmission.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Nikolay Borisov <n.borisov.lkml@gmail.com>
Cc: Serge Hallyn <serge@hallyn.com>
Cc: Jan Kara <jack@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Cap the maximum (total) value size and bail out if larger than KMALLOC_MAX_SIZE
as otherwise it doesn't make any sense to proceed further, since we're
guaranteed to fail to allocate elements anyway in lpm_trie_node_alloc();
likleyhood of failure is still high for large values, though, similarly
as with htab case in non-prealloc.
Next, make sure that cost vars are really u64 instead of size_t, so that we
don't overflow on 32 bit and charge only tiny map.pages against memlock while
allowing huge max_entries; cap also the max cost like we do with other map
types.
Fixes: b95a5c4db0 ("bpf: add a longest prefix match trie map implementation")
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
We do suppress_message_printing() check before we call
call_console_drivers() now, so `level' param is not needed
anymore.
Link: http://lkml.kernel.org/r/20161224140902.1962-2-sergey.senozhatsky@gmail.com
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Peter Hurley <peter@hurleysoftware.com>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Signed-off-by: Petr Mladek <pmladek@suse.com>
This patch converts the rest of logbuf users (which are
out of printk recursion case, but can deadlock in printk).
To make printk-safe usage easier the patch introduces 4
helper macros:
- logbuf_lock_irq()/logbuf_unlock_irq()
lock/unlock the logbuf lock and disable/enable local IRQ
- logbuf_lock_irqsave(flags)/logbuf_unlock_irqrestore(flags)
lock/unlock the logbuf lock and saves/restores local IRQ state
Link: http://lkml.kernel.org/r/20161227141611.940-9-sergey.senozhatsky@gmail.com
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Jan Kara <jack@suse.cz>
Cc: Tejun Heo <tj@kernel.org>
Cc: Calvin Owens <calvinowens@fb.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Peter Hurley <peter@hurleysoftware.com>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Signed-off-by: Petr Mladek <pmladek@suse.com>
We use printk-safe now which makes printk-recursion detection code
in vprintk_emit() unreachable. The tricky thing here is that, apart
from detecting and reporting printk recursions, that code also used
to zap_locks() in case of panic() from the same CPU. However,
zap_locks() does not look to be needed anymore:
1) Since commit 08d78658f3 ("panic: release stale console lock to
always get the logbuf printed out") panic flushing of `logbuf' to
console ignores the state of `console_sem' by doing
panic()
console_trylock();
console_unlock();
2) Since commit cf9b1106c8 ("printk/nmi: flush NMI messages on the
system panic") panic attempts to zap the `logbuf_lock' spin_lock to
successfully flush nmi messages to `logbuf'.
Basically, it seems that we either already do what zap_locks() used to
do but in other places or we ignore the state of the lock. The only
reaming difference is that we don't re-init the console semaphore in
printk_safe_flush_on_panic(), but this is not necessary because we
don't call console drivers from printk_safe_flush_on_panic() due to
the fact that we are using a deferred printk() version (as was
suggested by Petr Mladek).
Link: http://lkml.kernel.org/r/20161227141611.940-8-sergey.senozhatsky@gmail.com
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Jan Kara <jack@suse.cz>
Cc: Tejun Heo <tj@kernel.org>
Cc: Calvin Owens <calvinowens@fb.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Peter Hurley <peter@hurleysoftware.com>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Signed-off-by: Petr Mladek <pmladek@suse.com>
Use printk_safe per-CPU buffers in printk recursion-prone blocks:
-- around logbuf_lock protected sections in vprintk_emit() and
console_unlock()
-- around down_trylock_console_sem() and up_console_sem()
Note that this solution addresses deadlocks caused by printk()
recursive calls only. That is vprintk_emit() and console_unlock().
The rest will be converted in a followup patch.
Another thing to note is that we now keep lockdep enabled in printk,
because we are protected against the printk recursion caused by
lockdep in vprintk_emit() by the printk-safe mechanism - we first
switch to per-CPU buffers and only then access the deadlock-prone
locks.
Examples:
1) printk() from logbuf_lock spin_lock section
Assume the following code:
printk()
raw_spin_lock(&logbuf_lock);
WARN_ON(1);
raw_spin_unlock(&logbuf_lock);
which now produces:
------------[ cut here ]------------
WARNING: CPU: 0 PID: 366 at kernel/printk/printk.c:1811 vprintk_emit
CPU: 0 PID: 366 Comm: bash
Call Trace:
warn_slowpath_null+0x1d/0x1f
vprintk_emit+0x1cd/0x438
vprintk_default+0x1d/0x1f
printk+0x48/0x50
[..]
2) printk() from semaphore sem->lock spin_lock section
Assume the following code
printk()
console_trylock()
down_trylock()
raw_spin_lock_irqsave(&sem->lock, flags);
WARN_ON(1);
raw_spin_unlock_irqrestore(&sem->lock, flags);
which now produces:
------------[ cut here ]------------
WARNING: CPU: 1 PID: 363 at kernel/locking/semaphore.c:141 down_trylock
CPU: 1 PID: 363 Comm: bash
Call Trace:
warn_slowpath_null+0x1d/0x1f
down_trylock+0x3d/0x62
? vprintk_emit+0x3f9/0x414
console_trylock+0x31/0xeb
vprintk_emit+0x3f9/0x414
vprintk_default+0x1d/0x1f
printk+0x48/0x50
[..]
3) printk() from console_unlock()
Assume the following code:
printk()
console_unlock()
raw_spin_lock(&logbuf_lock);
WARN_ON(1);
raw_spin_unlock(&logbuf_lock);
which now produces:
------------[ cut here ]------------
WARNING: CPU: 1 PID: 329 at kernel/printk/printk.c:2384 console_unlock
CPU: 1 PID: 329 Comm: bash
Call Trace:
warn_slowpath_null+0x18/0x1a
console_unlock+0x12d/0x559
? trace_hardirqs_on_caller+0x16d/0x189
? trace_hardirqs_on+0xd/0xf
vprintk_emit+0x363/0x374
vprintk_default+0x18/0x1a
printk+0x43/0x4b
[..]
4) printk() from try_to_wake_up()
Assume the following code:
printk()
console_unlock()
up()
try_to_wake_up()
raw_spin_lock_irqsave(&p->pi_lock, flags);
WARN_ON(1);
raw_spin_unlock_irqrestore(&p->pi_lock, flags);
which now produces:
------------[ cut here ]------------
WARNING: CPU: 3 PID: 363 at kernel/sched/core.c:2028 try_to_wake_up
CPU: 3 PID: 363 Comm: bash
Call Trace:
warn_slowpath_null+0x1d/0x1f
try_to_wake_up+0x7f/0x4f7
wake_up_process+0x15/0x17
__up.isra.0+0x56/0x63
up+0x32/0x42
__up_console_sem+0x37/0x55
console_unlock+0x21e/0x4c2
vprintk_emit+0x41c/0x462
vprintk_default+0x1d/0x1f
printk+0x48/0x50
[..]
5) printk() from call_console_drivers()
Assume the following code:
printk()
console_unlock()
call_console_drivers()
...
WARN_ON(1);
which now produces:
------------[ cut here ]------------
WARNING: CPU: 2 PID: 305 at kernel/printk/printk.c:1604 call_console_drivers
CPU: 2 PID: 305 Comm: bash
Call Trace:
warn_slowpath_null+0x18/0x1a
call_console_drivers.isra.6.constprop.16+0x3a/0xb0
console_unlock+0x471/0x48e
vprintk_emit+0x1f4/0x206
vprintk_default+0x18/0x1a
vprintk_func+0x6e/0x70
printk+0x3e/0x46
[..]
6) unsupported placeholder in printk() format now prints an actual
warning from vscnprintf(), instead of
'BUG: recent printk recursion!'.
------------[ cut here ]------------
WARNING: CPU: 5 PID: 337 at lib/vsprintf.c:1900 format_decode
Please remove unsupported %
in format string
CPU: 5 PID: 337 Comm: bash
Call Trace:
dump_stack+0x4f/0x65
__warn+0xc2/0xdd
warn_slowpath_fmt+0x4b/0x53
format_decode+0x22c/0x308
vsnprintf+0x89/0x3b7
vscnprintf+0xd/0x26
vprintk_emit+0xb4/0x238
vprintk_default+0x1d/0x1f
vprintk_func+0x6c/0x73
printk+0x43/0x4b
[..]
Link: http://lkml.kernel.org/r/20161227141611.940-7-sergey.senozhatsky@gmail.com
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Jan Kara <jack@suse.cz>
Cc: Tejun Heo <tj@kernel.org>
Cc: Calvin Owens <calvinowens@fb.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Peter Hurley <peter@hurleysoftware.com>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Signed-off-by: Petr Mladek <pmladek@suse.com>
Reviewed-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Account lost messages in pritk-safe and printk-safe-nmi
contexts and report those numbers during printk_safe_flush().
The patch also moves lost message counter to struct
`printk_safe_seq_buf' instead of having dedicated static
counters - this simplifies the code.
Link: http://lkml.kernel.org/r/20161227141611.940-6-sergey.senozhatsky@gmail.com
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Jan Kara <jack@suse.cz>
Cc: Tejun Heo <tj@kernel.org>
Cc: Calvin Owens <calvinowens@fb.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Peter Hurley <peter@hurleysoftware.com>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Signed-off-by: Petr Mladek <pmladek@suse.com>
Always use printk_deferred() in printk_safe_flush_line().
Flushing can be done from NMI or printk_safe contexts (when
we are in panic), so we can't call console drivers, yet still
want to store the messages in the logbuf buffer. Therefore we
use a deferred printk version.
Link: http://lkml.kernel.org/r/20170206164253.GA463@tigerII.localdomain
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Jan Kara <jack@suse.cz>
Cc: Tejun Heo <tj@kernel.org>
Cc: Calvin Owens <calvinowens@fb.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Peter Hurley <peter@hurleysoftware.com>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Suggested-by: Petr Mladek <pmladek@suse.com>
Signed-off-by: Petr Mladek <pmladek@suse.com>
Reviewed-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
This patch extends the idea of NMI per-cpu buffers to regions
that may cause recursive printk() calls and possible deadlocks.
Namely, printk() can't handle printk calls from schedule code
or printk() calls from lock debugging code (spin_dump() for instance);
because those may be called with `sem->lock' already taken or any
other `critical' locks (p->pi_lock, etc.). An example of deadlock
can be
vprintk_emit()
console_unlock()
up() << raw_spin_lock_irqsave(&sem->lock, flags);
wake_up_process()
try_to_wake_up()
ttwu_queue()
ttwu_activate()
activate_task()
enqueue_task()
enqueue_task_fair()
cfs_rq_of()
task_of()
WARN_ON_ONCE(!entity_is_task(se))
vprintk_emit()
console_trylock()
down_trylock()
raw_spin_lock_irqsave(&sem->lock, flags)
^^^^ deadlock
and some other cases.
Just like in NMI implementation, the solution uses a per-cpu
`printk_func' pointer to 'redirect' printk() calls to a 'safe'
callback, that store messages in a per-cpu buffer and flushes
them back to logbuf buffer later.
Usage example:
printk()
printk_safe_enter_irqsave(flags)
//
// any printk() call from here will endup in vprintk_safe(),
// that stores messages in a special per-CPU buffer.
//
printk_safe_exit_irqrestore(flags)
The 'redirection' mechanism, though, has been reworked, as suggested
by Petr Mladek. Instead of using a per-cpu @print_func callback we now
keep a per-cpu printk-context variable and call either default or nmi
vprintk function depending on its value. printk_nmi_entrer/exit and
printk_safe_enter/exit, thus, just set/celar corresponding bits in
printk-context functions.
The patch only adds printk_safe support, we don't use it yet.
Link: http://lkml.kernel.org/r/20161227141611.940-4-sergey.senozhatsky@gmail.com
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Jan Kara <jack@suse.cz>
Cc: Tejun Heo <tj@kernel.org>
Cc: Calvin Owens <calvinowens@fb.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Peter Hurley <peter@hurleysoftware.com>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Signed-off-by: Petr Mladek <pmladek@suse.com>
Reviewed-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
A preparation patch for printk_safe work. No functional change.
- rename nmi.c to print_safe.c
- add `printk_safe' prefix to some (which used both by printk-safe
and printk-nmi) of the exported functions.
Link: http://lkml.kernel.org/r/20161227141611.940-3-sergey.senozhatsky@gmail.com
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Jan Kara <jack@suse.cz>
Cc: Tejun Heo <tj@kernel.org>
Cc: Calvin Owens <calvinowens@fb.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Peter Hurley <peter@hurleysoftware.com>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Signed-off-by: Petr Mladek <pmladek@suse.com>
vprintk(), just like printk(), better be using per-cpu printk_func
instead of direct vprintk_emit() call. Just in case if vprintk()
will ever be called from NMI, or from any other context that can
deadlock in printk().
Link: http://lkml.kernel.org/r/20161227141611.940-2-sergey.senozhatsky@gmail.com
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Jan Kara <jack@suse.cz>
Cc: Tejun Heo <tj@kernel.org>
Cc: Calvin Owens <calvinowens@fb.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Peter Hurley <peter@hurleysoftware.com>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Reviewed-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Petr Mladek <pmladek@suse.com>
The names are all 'autogroup', not 'auto_group' - so rename
the kernel/sched/auto_group.[ch] to match the existing
nomenclature.
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Since KERN_CONT became meaningful again, lockdep stack traces have had
annoying extra newlines, like this:
[ 5.561122] -> #1 (B){+.+...}:
[ 5.561528]
[ 5.561532] [<ffffffff810d8873>] lock_acquire+0xc3/0x210
[ 5.562178]
[ 5.562181] [<ffffffff816f6414>] mutex_lock_nested+0x74/0x6d0
[ 5.562861]
[ 5.562880] [<ffffffffa01aa3c3>] init_btrfs_fs+0x21/0x196 [btrfs]
[ 5.563717]
[ 5.563721] [<ffffffff81000472>] do_one_initcall+0x52/0x1b0
[ 5.564554]
[ 5.564559] [<ffffffff811a3af6>] do_init_module+0x5f/0x209
[ 5.565357]
[ 5.565361] [<ffffffff81122f4d>] load_module+0x218d/0x2b80
[ 5.566020]
[ 5.566021] [<ffffffff81123beb>] SyS_finit_module+0xeb/0x120
[ 5.566694]
[ 5.566696] [<ffffffff816fd241>] entry_SYSCALL_64_fastpath+0x1f/0xc2
That's happening because each printk() call now gets printed on its own
line, and we do a separate call to print the spaces before the symbol.
Fix it by doing the printk() directly instead of using the
print_ip_sym() helper.
Additionally, the symbol address isn't very helpful, so let's get rid of
that, too. The final result looks like this:
[ 5.194518] -> #1 (B){+.+...}:
[ 5.195002] lock_acquire+0xc3/0x210
[ 5.195439] mutex_lock_nested+0x74/0x6d0
[ 5.196491] do_one_initcall+0x52/0x1b0
[ 5.196939] do_init_module+0x5f/0x209
[ 5.197355] load_module+0x218d/0x2b80
[ 5.197792] SyS_finit_module+0xeb/0x120
[ 5.198251] entry_SYSCALL_64_fastpath+0x1f/0xc2
Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Omar Sandoval <osandov@fb.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: kernel-team@fb.com
Fixes: 4bcc595ccd ("printk: reinstate KERN_CONT for printing continuation lines")
Link: http://lkml.kernel.org/r/43b4e114724b2bdb0308fa86cb33aa07d3d67fad.1486510315.git.osandov@fb.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
The conflict was an interaction between a bug fix in the
netvsc driver in 'net' and an optimization of the RX path
in 'net-next'.
Signed-off-by: David S. Miller <davem@davemloft.net>
Both of these options are poorly named. The features they provide are
necessary for system security and should not be considered debug only.
Change the names to CONFIG_STRICT_KERNEL_RWX and
CONFIG_STRICT_MODULE_RWX to better describe what these options do.
Signed-off-by: Laura Abbott <labbott@redhat.com>
Acked-by: Jessica Yu <jeyu@redhat.com>
Signed-off-by: Kees Cook <keescook@chromium.org>
Over the years sched/core.c accumulated over 50 #include lines,
40 of which are superfluous. (!)
Removing them decreases the preprocessed .c file (.i) size noticeably:
triton:~/tip> wc -l kernel/sched/core.i
Before: 76387 kernel/sched/core.i
After: 75896 kernel/sched/core.i
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
update_rq_clock_task() and update_rq_clock() we unnecessarily
spread across core.c, requiring an extra prototype line.
Move them next to each other and in the proper order.
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Refresh the comments in the core scheduler code:
- Capitalize sentences consistently
- Capitalize 'CPU' consistently
- ... and other small details.
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
The patch fixes the case when adding a zero value to the packet
pointer. The zero value could come from src_reg equals type
BPF_K or CONST_IMM. The patch fixes both, otherwise the verifer
reports the following error:
[...]
R0=imm0,min_value=0,max_value=0
R1=pkt(id=0,off=0,r=4)
R2=pkt_end R3=fp-12
R4=imm4,min_value=4,max_value=4
R5=pkt(id=0,off=4,r=4)
269: (bf) r2 = r0 // r2 becomes imm0
270: (77) r2 >>= 3
271: (bf) r4 = r1 // r4 becomes pkt ptr
272: (0f) r4 += r2 // r4 += 0
addition of negative constant to packet pointer is not allowed
Signed-off-by: William Tu <u9012063@gmail.com>
Signed-off-by: Mihai Budiu <mbudiu@vmware.com>
Cc: Daniel Borkmann <daniel@iogearbox.net>
Cc: Alexei Starovoitov <ast@kernel.org>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
Pull irq fixes from Thomas Gleixner:
- Prevent double activation of interrupt lines, which causes problems
on certain interrupt controllers
- Handle the fallout of the above because x86 (ab)uses the activation
function to reconfigure interrupts under the hood.
* 'irq-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/irq: Make irq activate operations symmetric
irqdomain: Avoid activating interrupts more than once
It was observed that on an Intel x86 system without the ARAT (Always
running APIC timer) feature and with fairly large number of CPUs as
well as CPUs coming in and out of intel_idle frequently, the lock
contention on the tick_broadcast_lock can become significant.
To reduce contention, the lock is put into its own cacheline and all
the cpumask_var_t variables are put into the __read_mostly section.
Running the SP benchmark of the NAS Parallel Benchmarks on a 4-socket
16-core 32-thread Nehalam system, the performance number improved
from 3353.94 Mop/s to 3469.31 Mop/s when this patch was applied on
a 4.9.6 kernel. This is a 3.4% improvement.
Signed-off-by: Waiman Long <longman@redhat.com>
Cc: "Peter Zijlstra (Intel)" <peterz@infradead.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Link: http://lkml.kernel.org/r/1485799063-20857-1-git-send-email-longman@redhat.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Steven suggested to improve trace_print_hex_seq() a bit after commit
2acae0d5b0 ("trace: add variant without spacing in trace_print_hex_seq")
in two ways: i) by adding a kdoc comment for the helper function
itself and ii) by renaming 'spacing' argument into 'concatenate'
to better denote that we don't add spaces between each hex bytes.
Suggested-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
The modversion symbol CRCs are emitted as ELF symbols, which allows us
to easily populate the kcrctab sections by relying on the linker to
associate each kcrctab slot with the correct value.
This has a couple of downsides:
- Given that the CRCs are treated as memory addresses, we waste 4 bytes
for each CRC on 64 bit architectures,
- On architectures that support runtime relocation, a R_<arch>_RELATIVE
relocation entry is emitted for each CRC value, which identifies it
as a quantity that requires fixing up based on the actual runtime
load offset of the kernel. This results in corrupted CRCs unless we
explicitly undo the fixup (and this is currently being handled in the
core module code)
- Such runtime relocation entries take up 24 bytes of __init space
each, resulting in a x8 overhead in [uncompressed] kernel size for
CRCs.
Switching to explicit 32 bit values on 64 bit architectures fixes most
of these issues, given that 32 bit values are not treated as quantities
that require fixing up based on the actual runtime load offset. Note
that on some ELF64 architectures [such as PPC64], these 32-bit values
are still emitted as [absolute] runtime relocatable quantities, even if
the value resolves to a build time constant. Since relative relocations
are always resolved at build time, this patch enables MODULE_REL_CRCS on
powerpc when CONFIG_RELOCATABLE=y, which turns the absolute CRC
references into relative references into .rodata where the actual CRC
value is stored.
So redefine all CRC fields and variables as u32, and redefine the
__CRC_SYMBOL() macro for 64 bit builds to emit the CRC reference using
inline assembler (which is necessary since 64-bit C code cannot use
32-bit types to hold memory addresses, even if they are ultimately
resolved using values that do not exceed 0xffffffff). To avoid
potential problems with legacy 32-bit architectures using legacy
toolchains, the equivalent C definition of the kcrctab entry is retained
for 32-bit architectures.
Note that this mostly reverts commit d4703aefdb ("module: handle ppc64
relocating kcrctabs when CONFIG_RELOCATABLE=y")
Acked-by: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
If process forks some children when it has is_child_subreaper
flag enabled they will inherit has_child_subreaper flag - first
group, when is_child_subreaper is disabled forked children will
not inherit it - second group. So child-subreaper does not reparent
all his descendants when their parents die. Having these two
differently behaving groups can lead to confusion. Also it is
a problem for CRIU, as when we restore process tree we need to
somehow determine which descendants belong to which group and
much harder - to put them exactly to these group.
To simplify these we can add a propagation of has_child_subreaper
flag on PR_SET_CHILD_SUBREAPER, walking all descendants of child-
subreaper to setup has_child_subreaper flag.
In common cases when process like systemd first sets itself to
be a child-subreaper and only after that forks its services, we will
have zero-length list of descendants to walk. Testing with binary
subtree of 2^15 processes prctl took < 0.007 sec and has shown close
to linear dependency(~0.2 * n * usec) on lower numbers of processes.
Moreover, I doubt someone intentionaly pre-forks the children whitch
should reparent to init before becoming subreaper, because some our
ancestor migh have had is_child_subreaper flag while forking our
sub-tree and our childs will all inherit has_child_subreaper flag,
and we have no way to influence it. And only way to check if we have
no has_child_subreaper flag is to create some childs, kill them and
see where they will reparent to.
Using walk_process_tree helper to walk subtree, thanks to Oleg! Timing
seems to be the same.
Optimize:
a) When descendant already has has_child_subreaper flag all his subtree
has it too already.
* for a) to be true need to move has_child_subreaper inheritance under
the same tasklist_lock with adding task to its ->real_parent->children
as without it process can inherit zero has_child_subreaper, then we
set 1 to it's parent flag, check that parent has no more children, and
only after child with wrong flag is added to the tree.
* Also make these inheritance more clear by using real_parent instead of
current, as on clone(CLONE_PARENT) if current has is_child_subreaper
and real_parent has no is_child_subreaper or has_child_subreaper, child
will have has_child_subreaper flag set without actually having a
subreaper in it's ancestors.
b) When some descendant is child_reaper, it's subtree is in different
pidns from us(original child-subreaper) and processes from other pidns
will never reparent to us.
So we can skip their(a,b) subtree from walk.
v2: switch to walk_process_tree() general helper, move
has_child_subreaper inheritance
v3: remove csr_descendant leftover, change current to real_parent
in has_child_subreaper inheritance
v4: small commit message fix
Fixes: ebec18a6d3 ("prctl: add PR_{SET,GET}_CHILD_SUBREAPER to allow simple process supervision")
Signed-off-by: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>
Reviewed-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Add the new helper to walk the process tree, the next patch adds a user.
Note that it visits the group leaders only, proc_visitor can do
for_each_thread itself or we can trivially extend walk_process_tree() to
do this.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Pull perf fixes from Ingo Molnar:
"Five kernel fixes:
- an mmap tracing ABI fix for certain mappings
- a use-after-free fix, found via KASAN
- three CPU hotplug related x86 PMU driver fixes"
* 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
perf/x86/intel/uncore: Make package handling more robust
perf/x86/intel/uncore: Clean up hotplug conversion fallout
perf/x86/intel/rapl: Make package handling more robust
perf/core: Fix PERF_RECORD_MMAP2 prot/flags for anonymous memory
perf/core: Fix use-after-free bug
We may already have a directory to put the blktrace stuff in if
1. The disk uses blk-mq
2. CONFIG_BLK_DEBUG_FS is enabled
3. We are tracing the whole disk and not a partition
Instead of hardcoding this very specific case, let's use the new
debugfs_lookup(). If the directory exists, we use it, otherwise we
create one and clean it up later.
Fixes: 07e4fead45 ("blk-mq: create debugfs directory tree")
Signed-off-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
When I added the blk-mq debugging information to debugfs, I didn't
notice that blktrace also creates a "block" directory in debugfs. Make
them use the same dentry, now created in the core block code. Based on a
patch from Jens.
Signed-off-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
clang complains about "__init" being attached to a struct name:
kernel/trace/trace_kprobe.c:1375:15: error: '__section__' attribute only applies to functions and global variables
The intention must have been to mark the function as __init instead of
the type, so move the attribute there.
Link: http://lkml.kernel.org/r/20170201165826.2625888-1-arnd@arndb.de
Fixes: f18f97ac43 ("tracing/kprobes: Add a helper method to return number of probe hits")
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
To support unprivileged users mounting filesystems two permission
checks have to be performed: a test to see if the user allowed to
create a mount in the mount namespace, and a test to see if
the user is allowed to access the specified filesystem.
The automount case is special in that mounting the original filesystem
grants permission to mount the sub-filesystems, to any user who
happens to stumble across the their mountpoint and satisfies the
ordinary filesystem permission checks.
Attempting to handle the automount case by using override_creds
almost works. It preserves the idea that permission to mount
the original filesystem is permission to mount the sub-filesystem.
Unfortunately using override_creds messes up the filesystems
ordinary permission checks.
Solve this by being explicit that a mount is a submount by introducing
vfs_submount, and using it where appropriate.
vfs_submount uses a new mount internal mount flags MS_SUBMOUNT, to let
sget and friends know that a mount is a submount so they can take appropriate
action.
sget and sget_userns are modified to not perform any permission checks
on submounts.
follow_automount is modified to stop using override_creds as that
has proven problemantic.
do_mount is modified to always remove the new MS_SUBMOUNT flag so
that we know userspace will never by able to specify it.
autofs4 is modified to stop using current_real_cred that was put in
there to handle the previous version of submount permission checking.
cifs is modified to pass the mountpoint all of the way down to vfs_submount.
debugfs is modified to pass the mountpoint all of the way down to
trace_automount by adding a new parameter. To make this change easier
a new typedef debugfs_automount_t is introduced to capture the type of
the debugfs automount function.
Cc: stable@vger.kernel.org
Fixes: 069d5ac9ae ("autofs: Fix automounts by using current_real_cred()->uid")
Fixes: aeaa4a79ff ("fs: Call d_automount with the filesystems creds")
Reviewed-by: Trond Myklebust <trond.myklebust@primarydata.com>
Reviewed-by: Seth Forshee <seth.forshee@canonical.com>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
We added the 'sched_rr_timeslice_ms' SCHED_RR tuning knob in this commit:
ce0dbbbb30 ("sched/rt: Add a tuning knob to allow changing SCHED_RR timeslice")
... which name suggests to users that it's in milliseconds, while in reality
it's being set in milliseconds but the result is shown in jiffies.
This is obviously confusing when HZ is not 1000, it makes it appear like the
value set failed, such as HZ=100:
root# echo 100 > /proc/sys/kernel/sched_rr_timeslice_ms
root# cat /proc/sys/kernel/sched_rr_timeslice_ms
10
Fix this to be milliseconds all around.
Signed-off-by: Shile Zhang <shile.zhang@nokia.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1485612049-20923-1-git-send-email-shile.zhang@nokia.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Turn the full dynticks cputime clock source to return nsec while keeping
its very internals jiffies based for performance reasons.
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Stanislaw Gruszka <sgruszka@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Wanpeng Li <wanpeng.li@hotmail.com>
Link: http://lkml.kernel.org/r/1485832191-26889-27-git-send-email-fweisbec@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
This is the final step toward tick based cputime conversion. Now that
the whole cputime accounting engine accounts in nsecs, we can convert the
very source of the cputime to account in nsecs.
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Stanislaw Gruszka <sgruszka@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Wanpeng Li <wanpeng.li@hotmail.com>
Link: http://lkml.kernel.org/r/1485832191-26889-26-git-send-email-fweisbec@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
This is one more step toward converting cputime accounting to pure nsecs.
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Stanislaw Gruszka <sgruszka@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Wanpeng Li <wanpeng.li@hotmail.com>
Link: http://lkml.kernel.org/r/1485832191-26889-25-git-send-email-fweisbec@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
This is one more step toward converting cputime accounting to pure nsecs.
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Stanislaw Gruszka <sgruszka@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Wanpeng Li <wanpeng.li@hotmail.com>
Link: http://lkml.kernel.org/r/1485832191-26889-24-git-send-email-fweisbec@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
This is one more step toward converting cputime accounting to pure nsecs.
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Stanislaw Gruszka <sgruszka@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Wanpeng Li <wanpeng.li@hotmail.com>
Link: http://lkml.kernel.org/r/1485832191-26889-23-git-send-email-fweisbec@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
This is one more step toward converting cputime accounting to pure nsecs.
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Stanislaw Gruszka <sgruszka@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Wanpeng Li <wanpeng.li@hotmail.com>
Link: http://lkml.kernel.org/r/1485832191-26889-22-git-send-email-fweisbec@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Use the new nsec based cputime accessors as part of the whole cputime
conversion from cputime_t to nsecs.
Also convert itimers to use nsec based internal counters. This simplifies
it and removes the whole game with error/inc_error which served to deal
with cputime_t random granularity.
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Stanislaw Gruszka <sgruszka@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Wanpeng Li <wanpeng.li@hotmail.com>
Link: http://lkml.kernel.org/r/1485832191-26889-20-git-send-email-fweisbec@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Use the new nsec based cputime accessors as part of the whole cputime
conversion from cputime_t to nsecs.
Also convert posix-cpu-timers to use nsec based internal counters to
simplify it.
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Stanislaw Gruszka <sgruszka@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Wanpeng Li <wanpeng.li@hotmail.com>
Link: http://lkml.kernel.org/r/1485832191-26889-19-git-send-email-fweisbec@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
The irqtime is accounted is nsecs and stored in
cpu_irq_time.hardirq_time and cpu_irq_time.softirq_time. Once the
accumulated amount reaches a new jiffy, this one gets accounted to the
kcpustat.
This was necessary when kcpustat was stored in cputime_t, which could at
worst have jiffies granularity. But now kcpustat is stored in nsecs
so this whole discretization game with temporary irqtime storage has
become unnecessary.
We can now directly account the irqtime to the kcpustat.
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Stanislaw Gruszka <sgruszka@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Wanpeng Li <wanpeng.li@hotmail.com>
Link: http://lkml.kernel.org/r/1485832191-26889-17-git-send-email-fweisbec@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Use the new nsec based cputime accessors as part of the whole cputime
conversion from cputime_t to nsecs.
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Stanislaw Gruszka <sgruszka@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Wanpeng Li <wanpeng.li@hotmail.com>
Link: http://lkml.kernel.org/r/1485832191-26889-16-git-send-email-fweisbec@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Use the new nsec based cputime accessors as part of the whole cputime
conversion from cputime_t to nsecs.
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Stanislaw Gruszka <sgruszka@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Wanpeng Li <wanpeng.li@hotmail.com>
Link: http://lkml.kernel.org/r/1485832191-26889-15-git-send-email-fweisbec@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Use the new nsec based cputime accessors as part of the whole cputime
conversion from cputime_t to nsecs.
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Stanislaw Gruszka <sgruszka@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Wanpeng Li <wanpeng.li@hotmail.com>
Link: http://lkml.kernel.org/r/1485832191-26889-14-git-send-email-fweisbec@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Use the new nsec based cputime accessors as part of the whole cputime
conversion from cputime_t to nsecs.
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Stanislaw Gruszka <sgruszka@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Wanpeng Li <wanpeng.li@hotmail.com>
Link: http://lkml.kernel.org/r/1485832191-26889-13-git-send-email-fweisbec@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Now that most cputime readers use the transition API which return the
task cputime in old style cputime_t, we can safely store the cputime in
nsecs. This will eventually make cputime statistics less opaque and more
granular. Back and forth convertions between cputime_t and nsecs in order
to deal with cputime_t random granularity won't be needed anymore.
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Stanislaw Gruszka <sgruszka@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Wanpeng Li <wanpeng.li@hotmail.com>
Link: http://lkml.kernel.org/r/1485832191-26889-8-git-send-email-fweisbec@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
This API returns a task's cputime in cputime_t in order to ease the
conversion of cputime internals to use nsecs units instead. Blindly
converting all cputime readers to use this API now will later let us
convert more smoothly and step by step all these places to use the
new nsec based cputime.
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Stanislaw Gruszka <sgruszka@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Wanpeng Li <wanpeng.li@hotmail.com>
Link: http://lkml.kernel.org/r/1485832191-26889-7-git-send-email-fweisbec@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
cputime_t is being obsolete and replaced by nsecs units in order to make
internal timestamps less opaque and more granular.
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Stanislaw Gruszka <sgruszka@redhat.com>
Cc: Wanpeng Li <wanpeng.li@hotmail.com>
Link: http://lkml.kernel.org/r/1485832191-26889-6-git-send-email-fweisbec@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Kernel CPU stats are stored in cputime_t which is an architecture
defined type, and hence a bit opaque and requiring accessors and mutators
for any operation.
Converting them to nsecs simplifies the code and is one step toward
the removal of cputime_t in the core code.
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Stanislaw Gruszka <sgruszka@redhat.com>
Cc: Wanpeng Li <wanpeng.li@hotmail.com>
Link: http://lkml.kernel.org/r/1485832191-26889-4-git-send-email-fweisbec@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
This will be needed for the cputime_t to nsec conversion.
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Stanislaw Gruszka <sgruszka@redhat.com>
Cc: Wanpeng Li <wanpeng.li@hotmail.com>
Link: http://lkml.kernel.org/r/1485832191-26889-2-git-send-email-fweisbec@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
NSEC_PER_JIFFY is an ad-hoc redefinition of TICK_NSEC. Let's rather
use a unique and well maintained version.
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Stanislaw Gruszka <sgruszka@redhat.com>
Cc: Wanpeng Li <wanpeng.li@hotmail.com>
Link: http://lkml.kernel.org/r/1485832191-26889-1-git-send-email-fweisbec@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
find_new_reaper() checks same_thread_group(reaper, child_reaper) to
prevent the cross-namespace reparenting but this is not enough if the
exiting parent was injected by setns() + fork().
Suppose we have a process P in the root namespace and some namespace X.
P does setns() to enter the X namespace, and forks the child C.
C forks a grandchild G and exits.
The grandchild G should be re-parented to X->child_reaper, but in this
case the ->real_parent chain does not lead to ->child_reaper, so it will
be wrongly reparanted to P's sub-reaper or a global init.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
not migrate after the first instance. I found that there was as small
bug in the logic, and fixed it. It's minor, but should be fixed regardless.
There's not much impact outside the hwlat tracer.
-----BEGIN PGP SIGNATURE-----
iQExBAABCAAbBQJYkP7tFBxyb3N0ZWR0QGdvb2RtaXMub3JnAAoJEMm5BfJq2Y3L
srEH/iqi0mC4rJF57yLmwGXVAuwSRsxHlkGfyG3RBRnXOhiPLaM9Y6QZDWj8f99D
DEnVJ7IMpW7tjJB0pDQBCMBi+NX6GvR9Zu4ub8k69UzDlw81CPdMiJ/HfEFsP5XA
EOiDH/xGsYmbgGyqbU2VTb4lS8CijStht0jhydriLT1ga/W3bOl0/w6TeQW+AwqJ
0tcksaKwAcH6iN11FUQfWPlirQ/aCTvj8FelgbT7MjYnwxDdbJSfKG+GwJjytLNQ
Rj/STWD56OZpl91OuEYA50BZVxp3eF870xJXsLjg3mGIYBseW/T6ZTP8TOM+6WoH
u2J08p8Owdtdo+9gkw+IwEuoFP4=
=VCLk
-----END PGP SIGNATURE-----
Merge tag 'trace-4.10-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace
Pull tracing fix from Steven Rostedt:
"It was reported to me that the thread created by the hwlat tracer does
not migrate after the first instance. I found that there was as small
bug in the logic, and fixed it. It's minor, but should be fixed
regardless. There's not much impact outside the hwlat tracer"
* tag 'trace-4.10-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
tracing: Fix hwlat kthread migration
Pull cgroup fix from Tejun Heo:
"The cgroup creation path was getting the order of operations wrong and
exposing cgroups which don't have their names set yet to controllers
which can lead to NULL derefs.
This contains the fix for the bug"
* 'for-4.10-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
cgroup: don't online subsystems before cgroup_name/path() are operational
This can be used to check for fs vs non-fs requests and basically
removes all knowledge of BLOCK_PC specific from the block layer,
as well as preparing for removing the cmd_type field in struct request.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
The hwlat tracer creates a kernel thread at start of the tracer. It is
pinned to a single CPU and will move to the next CPU after each period of
running. If the user modifies the migration thread's affinity, it will not
change after that happens.
The original code created the thread at the first instance it was called,
but later was changed to destroy the thread after the tracer was finished,
and would not be created until the next instance of the tracer was
established. The code that initialized the affinity was only called on the
initial instantiation of the tracer. After that, it was not initialized, and
the previous affinity did not match the current newly created one, making
it appear that the user modified the thread's affinity when it did not, and
the thread failed to migrate again.
Cc: stable@vger.kernel.org
Fixes: 0330f7aa8e ("tracing: Have hwlat trace migrate across tracing_cpumask CPUs")
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Add back the "tainting kernel with TAINT_LIVEPATCH" kernel log message
that commit 2992ef29ae ("livepatch/module: make TAINT_LIVEPATCH
module-specific") dropped. Now that it's a module-specific taint flag,
include the module name.
Signed-off-by: Joe Lawrence <joe.lawrence@redhat.com>
Acked-by: Josh Poimboeuf <jpoimboe@redhat.com>
Reviewed-by: Kamalesh Babulal <kamalesh@linux.vnet.ibm.com>
Signed-off-by: Jessica Yu <jeyu@redhat.com>
Since commit f3b0946d62 ("genirq/msi: Make sure PCI MSIs are
activated early"), we can end-up activating a PCI/MSI twice (once
at allocation time, and once at startup time).
This is normally of no consequences, except that there is some
HW out there that may misbehave if activate is used more than once
(the GICv3 ITS, for example, uses the activate callback
to issue the MAPVI command, and the architecture spec says that
"If there is an existing mapping for the EventID-DeviceID
combination, behavior is UNPREDICTABLE").
While this could be worked around in each individual driver, it may
make more sense to tackle the issue at the core level. In order to
avoid getting in that situation, let's have a per-interrupt flag
to remember if we have already activated that interrupt or not.
Fixes: f3b0946d62 ("genirq/msi: Make sure PCI MSIs are activated early")
Reported-and-tested-by: Andre Przywara <andre.przywara@arm.com>
Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
Cc: stable@vger.kernel.org
Link: http://lkml.kernel.org/r/1484668848-24361-1-git-send-email-marc.zyngier@arm.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
perf has additional overhead when monitoring the task which
frequently generates child tasks.
perf_init_event() is one of the hotspots for the additional overhead:
Currently, to get the PMU, it tries to search the type in pmu_idr at
first. But it is not always successful, especially for the widely used
PERF_TYPE_HARDWARE and PERF_TYPE_HW_CACHE events. So it has to go to the
slow path which go through the whole PMUs list.
It will be a big performance issue, if the PMUs list is long (e.g. server
with many uncore boxes) and the task frequently generates child tasks.
The child event inherits its parent event. So the child event should
try its parent PMU first.
Here is some data from the overhead test on Broadwell server:
perf record -e $TEST_EVENTS -- ./loop.sh 50000
loop.sh
start=$(date +%s%N)
i=0
while [ "$i" -le "$1" ]
do
date > /dev/null
i=`expr $i + 1`
done
end=$(date +%s%N)
elapsed=`expr $end - $start`
Event# Original elapsed time Elapsed time with patch delta
1 196,573,192,397 189,162,029,998 -3.77%
2 257,567,753,013 241,620,788,683 -6.19%
4 398,730,726,971 370,518,938,714 -7.08%
8 824,983,761,120 740,702,489,329 -10.22%
16 1,883,411,923,498 1,672,027,508,355 -11.22%
... which shows a nice performance improvement.
Signed-off-by: Kan Liang <kan.liang@intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephane Eranian <eranian@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vince Weaver <vincent.weaver@maine.edu>
Link: http://lkml.kernel.org/r/1484745662-15928-2-git-send-email-kan.liang@intel.com
[ Tidied up the changelog and the code comment. ]
Signed-off-by: Ingo Molnar <mingo@kernel.org>
When new events are added to an active context, we go and reschedule
all cpu groups and all task groups in order to preserve the priority
(cpu pinned, task pinned, cpu flexible, task flexible), but in
reality we only need to reschedule groups of the same priority as
that of the events being added, and below.
This patch changes the behavior so that only groups that need to be
rescheduled are rescheduled.
Reported-by: Adrian Hunter <adrian.hunter@intel.com>
Signed-off-by: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Arnaldo Carvalho de Melo <acme@infradead.org>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephane Eranian <eranian@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vince Weaver <vincent.weaver@maine.edu>
Cc: vince@deater.net
Link: http://lkml.kernel.org/r/20170119164330.22887-3-alexander.shishkin@linux.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
In the sched-in path, we first remove a CPU's flexible events in order to
give priority to the task's pinned events. However, this step can be safely
skipped if the task doesn't have its own pinned events.
This patch implements this skipping.
Reported-by: Adrian Hunter <adrian.hunter@intel.com>
Signed-off-by: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Arnaldo Carvalho de Melo <acme@infradead.org>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephane Eranian <eranian@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vince Weaver <vincent.weaver@maine.edu>
Cc: vince@deater.net
Link: http://lkml.kernel.org/r/20170119164330.22887-2-alexander.shishkin@linux.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
cpuctx->unique_pmu was originally introduced as a way to identify cpuctxs
with shared pmus in order to avoid visiting the same cpuctx more than once
in a for_each_pmu loop.
cpuctx->unique_pmu == cpuctx->pmu in non-software task contexts since they
have only one pmu per cpuctx. Since perf_pmu_sched_task() is only called in
hw contexts, this patch replaces cpuctx->unique_pmu by cpuctx->pmu in it.
The change above, together with the previous patch in this series, removed
the remaining uses of cpuctx->unique_pmu, so we remove it altogether.
Signed-off-by: David Carrillo-Cisneros <davidcc@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Mark Rutland <mark.rutland@arm.com>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Borislav Petkov <bp@suse.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Kan Liang <kan.liang@intel.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Paul Turner <pjt@google.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
Cc: Stephane Eranian <eranian@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vikas Shivappa <vikas.shivappa@linux.intel.com>
Cc: Vince Weaver <vince@deater.net>
Cc: Vince Weaver <vincent.weaver@maine.edu>
Link: http://lkml.kernel.org/r/20170118192454.58008-3-davidcc@google.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
This patch follows from a conversation in CQM/CMT's last series about
speeding up the context switch for cgroup events:
https://patchwork.kernel.org/patch/9478617/
This is a low-hanging fruit optimization. It replaces the iteration over
the "pmus" list in cgroup switch by an iteration over a new list that
contains only cpuctxs with at least one cgroup event.
This is necessary because the number of PMUs have increased over the years
e.g modern x86 server systems have well above 50 PMUs.
The iteration over the full PMU list is unneccessary and can be costly in
heavy cache contention scenarios.
Below are some instrumentation measurements with 10, 50 and 90 percentiles
of the total cost of context switch before and after this optimization for
a simple array read/write microbenchark.
Contention
Level Nr events Before (us) After (us) Median
L2 L3 types (10%, 50%, 90%) (10%, 50%, 90% Speedup
--------------------------------------------------------------------------
Low Low 1 (1.72, 2.42, 5.85) (1.35, 1.64, 5.46) 29%
High Low 1 (2.08, 4.56, 19.8) (1720, 2.20, 13.7) 51%
High High 1 (2.86, 10.4, 12.7) (2.54, 4.32, 12.1) 58%
Low Low 2 (1.98, 3.20, 6.89) (1.68, 2.41, 8.89) 24%
High Low 2 (2.48, 5.28, 22.4) (2150, 3.69, 14.6) 30%
High High 2 (3.32, 8.09, 13.9) (2.80, 5.15, 13.7) 36%
where:
1 event type = cycles
2 event types = cycles,intel_cqm/llc_occupancy/
Contention L2 Low: workset < L2 cache size.
High: " >> L2 " " .
Contention L3 Low: workset of task on all sockets < L3 cache size.
High: " " " " " " >> L3 " " .
Median Speedup is (50%ile Before - 50%ile After) / 50%ile Before
Unsurprisingly, the benefits of this optimization decrease with the number
of cpuctxs with a cgroup events, yet, is never detrimental.
Tested-by: Mark Rutland <mark.rutland@arm.com>
Signed-off-by: David Carrillo-Cisneros <davidcc@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Mark Rutland <mark.rutland@arm.com>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Borislav Petkov <bp@suse.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Kan Liang <kan.liang@intel.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Paul Turner <pjt@google.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
Cc: Stephane Eranian <eranian@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vikas Shivappa <vikas.shivappa@linux.intel.com>
Cc: Vince Weaver <vince@deater.net>
Cc: Vince Weaver <vincent.weaver@maine.edu>
Link: http://lkml.kernel.org/r/20170118192454.58008-2-davidcc@google.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Since the change in commit:
fd7a4bed18 ("sched, rt: Convert switched_{from, to}_rt() / prio_changed_rt() to balance callbacks")
... we don't reschedule a task under certain circumstances:
Lets say task-A, SCHED_OTHER, is running on CPU0 (and it may run only on
CPU0) and holds a PI lock. This task is removed from the CPU because it
used up its time slice and another SCHED_OTHER task is running. Task-B on
CPU1 runs at RT priority and asks for the lock owned by task-A. This
results in a priority boost for task-A. Task-B goes to sleep until the
lock has been made available. Task-A is already runnable (but not active),
so it receives no wake up.
The reality now is that task-A gets on the CPU once the scheduler decides
to remove the current task despite the fact that a high priority task is
enqueued and waiting. This may take a long time.
The desired behaviour is that CPU0 immediately reschedules after the
priority boost which made task-A the task with the lowest priority.
Suggested-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Thomas Gleixner <tglx@linutronix.de>
Fixes: fd7a4bed18 ("sched, rt: Convert switched_{from, to}_rt() prio_changed_rt() to balance callbacks")
Link: http://lkml.kernel.org/r/20170124144006.29821-1-bigeasy@linutronix.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>