By copying BPF related operation to uprobe processing path, this patch
allow users attach BPF programs to uprobes like what they are already
doing on kprobes.
After this patch, users are allowed to use PERF_EVENT_IOC_SET_BPF on a
uprobe perf event. Which make it possible to profile user space programs
and kernel events together using BPF.
Because of this patch, CONFIG_BPF_EVENTS should be selected by
CONFIG_UPROBE_EVENT to ensure trace_call_bpf() is compiled even if
KPROBE_EVENT is not set.
Signed-off-by: Wang Nan <wangnan0@huawei.com>
Acked-by: Alexei Starovoitov <ast@plumgrid.com>
Cc: Brendan Gregg <brendan.d.gregg@gmail.com>
Cc: Daniel Borkmann <daniel@iogearbox.net>
Cc: David Ahern <dsahern@gmail.com>
Cc: He Kuang <hekuang@huawei.com>
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: Kaixu Xia <xiakaixu@huawei.com>
Cc: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Zefan Li <lizefan@huawei.com>
Cc: pi3orama@163.com
Link: http://lkml.kernel.org/r/1435716878-189507-3-git-send-email-wangnan0@huawei.com
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Currently, the PT driver zeroes out the status register every time before
starting the event. However, all the writable bits are already taken care
of in pt_handle_status() function, except the new PacketByteCnt field,
which in new versions of PT contains the number of packet bytes written
since the last sync (PSB) packet. Zeroing it out before enabling PT forces
a sync packet to be written. This means that, with the existing code, a
sync packet (PSB and PSBEND, 18 bytes in total) will be generated every
time a PT event is scheduled in.
To avoid these unnecessary syncs and save a WRMSR in the fast path, this
patch changes the default behavior to not clear PacketByteCnt field, so
that the sync packets will be generated with the period specified as
"psb_period" attribute config field. This has little impact on the trace
data as the other packets that are normally sent within PSB+ (between PSB
and PSBEND) have their own generation scenarios which do not depend on the
sync packets.
One exception where we do need to force PSB like this when tracing starts,
so that the decoder has a clear sync point in the trace. For this purpose
we aready have hw::itrace_started flag, which we are currently using to
output PERF_RECORD_ITRACE_START. This patch moves setting itrace_started
from perf core to the pmu::start, where it should still be 0 on the very
first run.
Signed-off-by: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: acme@infradead.org
Cc: adrian.hunter@intel.com
Cc: hpa@zytor.com
Link: http://lkml.kernel.org/r/1438264104-16189-1-git-send-email-alexander.shishkin@linux.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Code on the kprobe blacklist doesn't want unexpected int3
exceptions. It probably doesn't want unexpected debug exceptions
either. Be safe: disallow breakpoints in nokprobes code.
On non-CONFIG_KPROBES kernels, there is no kprobe blacklist. In
that case, disallow kernel breakpoints entirely.
It will be particularly important to keep hw breakpoints out of the
entry and NMI code once we move debug exceptions off the IST stack.
Signed-off-by: Andy Lutomirski <luto@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/e14b152af99640448d895e3c2a8c2d5ee19a1325.1438312874.git.luto@kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
The xol_free_insn_slot()->waitqueue_active() check is buggy. We
need mb() after we set the conditon for wait_event(), or
xol_take_insn_slot() can miss the wakeup.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Pratyush Anand <panand@redhat.com>
Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20150721134036.GA4799@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Change xol_add_vma() to use _install_special_mapping(), this way
we can name the vma installed by uprobes. Currently it looks
like private anonymous mapping, this is confusing and
complicates the debugging. With this change /proc/$pid/maps
reports "[uprobes]".
As a side effect this will cause core dumps to include the XOL vma
and I think this is good; this can help to debug the problem if
the app crashed because it was probed.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Pratyush Anand <panand@redhat.com>
Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20150721134033.GA4796@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
install_special_mapping(pages) expects that "pages" is the zero-
terminated array while xol_add_vma() passes &area->page, this
means that special_mapping_fault() can wrongly use the next
member in xol_area (vaddr) as "struct page *".
Fortunately, this area is not expandable so pgoff != 0 isn't
possible (modulo bugs in special_mapping_vmops), but still this
does not look good.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Pratyush Anand <panand@redhat.com>
Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20150721134031.GA4789@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
The previous change documents that cleanup_return_instances()
can't always detect the dead frames, the stack can grow. But
there is one special case which imho worth fixing:
arch_uretprobe_is_alive() can return true when the stack didn't
actually grow, but the next "call" insn uses the already
invalidated frame.
Test-case:
#include <stdio.h>
#include <setjmp.h>
jmp_buf jmp;
int nr = 1024;
void func_2(void)
{
if (--nr == 0)
return;
longjmp(jmp, 1);
}
void func_1(void)
{
setjmp(jmp);
func_2();
}
int main(void)
{
func_1();
return 0;
}
If you ret-probe func_1() and func_2() prepare_uretprobe() hits
the MAX_URETPROBE_DEPTH limit and "return" from func_2() is not
reported.
When we know that the new call is not chained, we can do the
more strict check. In this case "sp" points to the new ret-addr,
so every frame which uses the same "sp" must be dead. The only
complication is that arch_uretprobe_is_alive() needs to know was
it chained or not, so we add the new RP_CHECK_CHAIN_CALL enum
and change prepare_uretprobe() to pass RP_CHECK_CALL only if
!chained.
Note: arch_uretprobe_is_alive() could also re-read *sp and check
if this word is still trampoline_vaddr. This could obviously
improve the logic, but I would like to avoid another
copy_from_user() especially in the case when we can't avoid the
false "alive == T" positives.
Tested-by: Pratyush Anand <panand@redhat.com>
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Acked-by: Anton Arapov <arapov@gmail.com>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20150721134028.GA4786@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
arch/x86 doesn't care (so far), but as Pratyush Anand pointed
out other architectures might want why arch_uretprobe_is_alive()
was called and use different checks depending on the context.
Add the new argument to distinguish 2 callers.
Tested-by: Pratyush Anand <panand@redhat.com>
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Acked-by: Anton Arapov <arapov@gmail.com>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20150721134026.GA4779@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Change prepare_uretprobe() to flush the !arch_uretprobe_is_alive()
return_instance's. This is not needed correctness-wise, but can help
to avoid the failure caused by MAX_URETPROBE_DEPTH.
Note: in this case arch_uretprobe_is_alive() can be false
positive, the stack can grow after longjmp(). Unfortunately, the
kernel can't 100% solve this problem, but see the next patch.
Tested-by: Pratyush Anand <panand@redhat.com>
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Acked-by: Anton Arapov <arapov@gmail.com>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20150721134023.GA4776@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Test-case:
#include <stdio.h>
#include <setjmp.h>
jmp_buf jmp;
void func_2(void)
{
longjmp(jmp, 1);
}
void func_1(void)
{
if (setjmp(jmp))
return;
func_2();
printf("ERR!! I am running on the caller's stack\n");
}
int main(void)
{
func_1();
return 0;
}
fails if you probe func_1() and func_2() because
handle_trampoline() assumes that the probed function should must
return and hit the bp installed be prepare_uretprobe(). But in
this case func_2() does not return, so when func_1() returns the
kernel uses the no longer valid return_instance of func_2().
Change handle_trampoline() to unwind ->return_instances until we
know that the next chain is alive or NULL, this ensures that the
current chain is the last we need to report and free.
Alternatively, every return_instance could use unique
trampoline_vaddr, in this case we could use it as a key. And
this could solve the problem with sigaltstack() automatically.
But this approach needs more changes, and it puts the "hard"
limit on MAX_URETPROBE_DEPTH. Plus it can not solve another
problem partially fixed by the next patch.
Note: this change has no effect on !x86, the arch-agnostic
version of arch_uretprobe_is_alive() just returns "true".
TODO: as documented by the previous change, arch_uretprobe_is_alive()
can be fooled by sigaltstack/etc.
Tested-by: Pratyush Anand <panand@redhat.com>
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Acked-by: Anton Arapov <arapov@gmail.com>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20150721134021.GA4773@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Add the x86 specific version of arch_uretprobe_is_alive()
helper. It returns true if the stack frame mangled by
prepare_uretprobe() is still on stack. So if it returns false,
we know that the probed function has already returned.
We add the new return_instance->stack member and change the
generic code to initialize it in prepare_uretprobe, but it
should be equally useful for other architectures.
TODO: this assumes that the probed application can't use
multiple stacks (say sigaltstack). We will try to improve
this logic later.
Tested-by: Pratyush Anand <panand@redhat.com>
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Acked-by: Anton Arapov <arapov@gmail.com>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20150721134018.GA4766@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Add the new "weak" helper, arch_uretprobe_is_alive(), used by
the next patches. It should return true if this return_instance
is still valid. The arch agnostic version just always returns
true.
The patch exports "struct return_instance" for the architectures
which want to override this hook. We can also cleanup
prepare_uretprobe() if we pass the new return_instance to
arch_uretprobe_hijack_return_addr().
Tested-by: Pratyush Anand <panand@redhat.com>
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Acked-by: Anton Arapov <arapov@gmail.com>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20150721134016.GA4762@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
No functional changes, preparation.
Add the new helper, find_next_ret_chain(), which finds the first
!chained entry and returns its ->next. Yes, it is suboptimal. We
probably want to turn ->chained into ->start_of_this_chain
pointer and avoid another loop. But this needs the boring
changes in dup_utask(), so lets do this later.
Change the main loop in handle_trampoline() to unwind the stack
until ri is equal to the pointer returned by this new helper.
Tested-by: Pratyush Anand <panand@redhat.com>
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Acked-by: Anton Arapov <arapov@gmail.com>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20150721134013.GA4755@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Turn the last pr_warn() in uprobes.c into uprobe_warn().
While at it:
- s/kzalloc/kmalloc, we initialize every member of 'ri'
- remove the pointless comment above the obvious code
Tested-by: Pratyush Anand <panand@redhat.com>
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Acked-by: Anton Arapov <arapov@gmail.com>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20150721134010.GA4752@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
1. It doesn't make sense to continue if handle_trampoline()
fails, change handle_swbp() to always return after this call.
2. Turn pr_warn() into uprobe_warn(), and change
handle_trampoline() to send SIGILL on failure. It is pointless to
return to user mode with the corrupted instruction_pointer() which
we can't restore.
Tested-by: Pratyush Anand <panand@redhat.com>
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Acked-by: Anton Arapov <arapov@gmail.com>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20150721134008.GA4745@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
We can simplify uprobe_free_utask() and handle_uretprobe_chain()
if we add a simple helper which does put_uprobe/kfree and
returns the ->next return_instance.
Tested-by: Pratyush Anand <panand@redhat.com>
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Acked-by: Anton Arapov <arapov@gmail.com>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20150721134006.GA4740@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cosmetic. Add the new trivial helper, get_uprobe(). It matches
put_uprobe() we already have and we can simplify a couple of its
users.
Tested-by: Pratyush Anand <panand@redhat.com>
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Acked-by: Anton Arapov <arapov@gmail.com>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20150721134003.GA4736@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
A recent fix to the shadow timestamp inadvertly broke the running time
accounting.
We must not update the running timestamp if we fail to schedule the
event, the event will not have ran. This can (and did) result in
negative total runtime because the stopped timestamp was before the
running timestamp (we 'started' but never stopped the event -- because
it never really started we didn't have to stop it either).
Reported-and-Tested-by: Vince Weaver <vincent.weaver@maine.edu>
Fixes: 72f669c008 ("perf: Update shadow timestamp before add event")
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: stable@vger.kernel.org # 4.1
Cc: Shaohua Li <shli@fb.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Pull x86 fixes from Thomas Gleixner:
"This update contains:
- the manual revert of the SYSCALL32 changes which caused a
regression
- a fix for the MPX vma handling
- three fixes for the ioremap 'is ram' checks.
- PAT warning fixes
- a trivial fix for the size calculation of TLB tracepoints
- handle old EFI structures gracefully
This also contains a PAT fix from Jan plus a revert thereof. Toshi
explained why the code is correct"
* 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/mm/pat: Revert 'Adjust default caching mode translation tables'
x86/asm/entry/32: Revert 'Do not use R9 in SYSCALL32' commit
x86/mm: Fix newly introduced printk format warnings
mm: Fix bugs in region_is_ram()
x86/mm: Remove region_is_ram() call from ioremap
x86/mm: Move warning from __ioremap_check_ram() to the call site
x86/mm/pat, drivers/media/ivtv: Move the PAT warning and replace WARN() with pr_warn()
x86/mm/pat, drivers/infiniband/ipath: Replace WARN() with pr_warn()
x86/mm/pat: Adjust default caching mode translation tables
x86/fpu: Disable dependent CPU features on "noxsave"
x86/mpx: Do not set ->vm_ops on MPX VMAs
x86/mm: Add parenthesis for TLB tracepoint size calculation
efi: Handle memory error structures produced based on old versions of standard
double iteration list (one for registered ftrace ops, and one for
registered "global" ops), to just use one list. That simplified the code
but also broke the function tracing filtering on pid.
This updates the code to handle the filtering again with the new logic.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQEcBAABAgAGBQJVsq1dAAoJEEjnJuOKh9ldfagH/39dixx3tMMElO7pGM9Q3DBE
WmJuGSOmeZA1O0balnfcV2SgegEGCSEP6sjBtyuYCfgcFDWyIdAvGrMLS/hKTOxR
pilaqyNQz7CROx1zco9gbu5pGwkSkcAfnqYzOg6IpJ0WHtiyH36GMe2wMu29u2VT
I7NgSC6ByA82N4pwvgetlUcIDcPTyrkkhGmGPBJGY+diKzeSSo8NlRbv3SNYs0ua
V072Oumu64RrZBMdn/Sb2pCF2hf6vhTXD6qS4dbpK/Rfnlblqer9SqUIx2kpg603
yDOmPY7wN9FJ94Te3EeXubLi0LqDJH4iPOndrRn1fYsgMbUpq1BlViF7W7Ajze8=
=LQ+S
-----END PGP SIGNATURE-----
Merge tag 'trace-v4.2-rc2-fix3' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace
Pull ftrace fix from Steven Rostedt:
"Back in 3.16 the ftrace code was redesigned and cleaned up to remove
the double iteration list (one for registered ftrace ops, and one for
registered "global" ops), to just use one list. That simplified the
code but also broke the function tracing filtering on pid.
This updates the code to handle the filtering again with the new
logic"
* tag 'trace-v4.2-rc2-fix3' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
ftrace: Fix breakage of set_ftrace_pid
Commit 4104d326b6 ("ftrace: Remove global function list and call function
directly") simplified the ftrace code by removing the global_ops list with a
new design. But this cleanup also broke the filtering of PIDs that are added
to the set_ftrace_pid file.
Add back the proper hooks to have pid filtering working once again.
Cc: stable@vger.kernel.org # 3.16+
Reported-by: Matt Fleming <matt@console-pimps.org>
Reported-by: Richard Weinberger <richard.weinberger@gmail.com>
Tested-by: Matt Fleming <matt@console-pimps.org>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
There are already two events for context switches, namely the tracepoint
sched:sched_switch and the software event context_switches.
Unfortunately neither are suitable for use by non-privileged users for
the purpose of synchronizing hardware trace data (e.g. Intel PT) to the
context switch.
Tracepoints are no good at all for non-privileged users because they
need either CAP_SYS_ADMIN or /proc/sys/kernel/perf_event_paranoid <= -1.
On the other hand, kernel software events need either CAP_SYS_ADMIN or
/proc/sys/kernel/perf_event_paranoid <= 1.
Now many distributions do default perf_event_paranoid to 1 making
context_switches a contender, except it has another problem (which is
also shared with sched:sched_switch) which is that it happens before
perf schedules events out instead of after perf schedules events in.
Whereas a privileged user can see all the events anyway, a
non-privileged user only sees events for their own processes, in other
words they see when their process was scheduled out not when it was
scheduled in. That presents two problems to use the event:
1. the information comes too late, so tools have to look ahead in the
event stream to find out what the current state is
2. if they are unlucky tracing might have stopped before the
context-switches event is recorded.
This new PERF_RECORD_SWITCH event does not have those problems
and it also has a couple of other small advantages.
It is easier to use because it is an auxiliary event (like mmap, comm
and task events) which can be enabled by setting a single bit. It is
smaller than sched:sched_switch and easier to parse.
To make the event useful for privileged users also, if the
context is cpu-wide then the event record will be
PERF_RECORD_SWITCH_CPU_WIDE which is the same as
PERF_RECORD_SWITCH except it also provides the next or
previous pid/tid.
Signed-off-by: Adrian Hunter <adrian.hunter@intel.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Jiri Olsa <jolsa@redhat.com>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Mathieu Poirier <mathieu.poirier@linaro.org>
Cc: Pawel Moll <pawel.moll@arm.com>
Cc: Stephane Eranian <eranian@google.com>
Link: http://lkml.kernel.org/r/1437471846-26995-2-git-send-email-adrian.hunter@intel.com
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
region_is_ram() looks up the iomem_resource table to check if
a target range is in RAM. However, it always returns with -1
due to invalid range checks. It always breaks the loop at the
first entry of the table.
Another issue is that it compares p->flags and flags, but it always
fails. flags is declared as int, which makes it as a negative value
with IORESOURCE_BUSY (0x80000000) set while p->flags is unsigned long.
Fix the range check and flags so that region_is_ram() works as
advertised.
Signed-off-by: Toshi Kani <toshi.kani@hp.com>
Reviewed-by: Dan Williams <dan.j.williams@intel.com>
Cc: Mike Travis <travis@sgi.com>
Cc: Luis R. Rodriguez <mcgrof@suse.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Roland Dreier <roland@purestorage.com>
Cc: linux-mm@kvack.org
Link: http://lkml.kernel.org/r/1437088996-28511-4-git-send-email-toshi.kani@hp.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Pull x86 fixes from Ingo Molnar:
"Two families of fixes:
- Fix an FPU context related boot crash on newer x86 hardware with
larger context sizes than what most people test. To fix this
without ugly kludges or extensive reverts we had to touch core task
allocator, to allow x86 to determine the task size dynamically, at
boot time.
I've tested it on a number of x86 platforms, and I cross-built it
to a handful of architectures:
(warns) (warns)
testing x86-64: -git: pass ( 0), -tip: pass ( 0)
testing x86-32: -git: pass ( 0), -tip: pass ( 0)
testing arm: -git: pass ( 1359), -tip: pass ( 1359)
testing cris: -git: pass ( 1031), -tip: pass ( 1031)
testing m32r: -git: pass ( 1135), -tip: pass ( 1135)
testing m68k: -git: pass ( 1471), -tip: pass ( 1471)
testing mips: -git: pass ( 1162), -tip: pass ( 1162)
testing mn10300: -git: pass ( 1058), -tip: pass ( 1058)
testing parisc: -git: pass ( 1846), -tip: pass ( 1846)
testing sparc: -git: pass ( 1185), -tip: pass ( 1185)
... so I hope the cross-arch impact 'none', as intended.
(by Dave Hansen)
- Fix various NMI handling related bugs unearthed by the big asm code
rewrite and generally make the NMI code more robust and more
maintainable while at it. These changes are a bit late in the
cycle, I hope they are still acceptable.
(by Andy Lutomirski)"
* 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/fpu, sched: Introduce CONFIG_ARCH_WANTS_DYNAMIC_TASK_STRUCT and use it on x86
x86/fpu, sched: Dynamically allocate 'struct fpu'
x86/entry/64, x86/nmi/64: Add CONFIG_DEBUG_ENTRY NMI testing code
x86/nmi/64: Make the "NMI executing" variable more consistent
x86/nmi/64: Minor asm simplification
x86/nmi/64: Use DF to avoid userspace RSP confusing nested NMI detection
x86/nmi/64: Reorder nested NMI checks
x86/nmi/64: Improve nested NMI comments
x86/nmi/64: Switch stacks on userspace NMI entry
x86/nmi/64: Remove asm code that saves CR2
x86/nmi: Enable nested do_nmi() handling for 64-bit kernels
Pull timer fix from Ingo Molnar:
"Fix for a misplaced export that can cause build failures in certain
(rare) Kconfig situations"
* 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
tick: Move the export of tick_broadcast_oneshot_control to the proper place
Pull scheduler fix from Ingo Molnar:
"A oneliner rq throttling fix"
* 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
sched/fair: Test list head instead of list entry in throttle_cfs_rq()
Pull irq fixes from Ingo Molnar:
"Misc irq fixes:
- two driver fixes
- a Xen regression fix
- a nested irq thread crash fix"
* 'irq-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
irqchip/gicv3-its: Fix mapping of LPIs to collections
genirq: Prevent resend to interrupts marked IRQ_NESTED_THREAD
genirq: Revert sparse irq locking around __cpu_up() and move it to x86 for now
gpio/davinci: Fix race in installing chained irq handler
Don't burden architectures without dynamic task_struct sizing
with the overhead of dynamic sizing.
Also optimize the x86 code a bit by caching task_struct_size.
Acked-and-Tested-by: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Dave Hansen <dave@sr71.net>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1437128892-9831-3-git-send-email-mingo@kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
The FPU rewrite removed the dynamic allocations of 'struct fpu'.
But, this potentially wastes massive amounts of memory (2k per
task on systems that do not have AVX-512 for instance).
Instead of having a separate slab, this patch just appends the
space that we need to the 'task_struct' which we dynamically
allocate already. This saves from doing an extra slab
allocation at fork().
The only real downside here is that we have to stick everything
and the end of the task_struct. But, I think the
BUILD_BUG_ON()s I stuck in there should keep that from being too
fragile.
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Dave Hansen <dave@sr71.net>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1437128892-9831-2-git-send-email-mingo@kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
The resend mechanism happily calls the interrupt handler of interrupts
which are marked IRQ_NESTED_THREAD from softirq context. This can
result in crashes because the interrupt handler is not the proper way
to invoke the device handlers. They must be invoked via
handle_nested_irq.
Prevent the resend even if the interrupt has no valid parent irq
set. Its better to have a lost interrupt than a crashing machine.
Reported-by: Uwe Kleine-König <u.kleine-koenig@pengutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: stable@vger.kernel.org
tracer (traces unlikely and likely branches) when enabled with certain
debug options.
What happened was that various debug options like lockdep and DEBUG_PREEMPT
can cause parts of the branch tracer to recurse outside its recursion
protection. In fact, part of its recursion protection used these features
that caused the lockup. This cleans up the code a little and makes the
recursion protection a bit more robust.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQEcBAABAgAGBQJVovjwAAoJEEjnJuOKh9ldYacH/3VVIcBhRbuUmANZJSYLXSUh
nEx6FOAfmla/7hH2dVSm8JmnbUzWyunttjbZ6/O8PFE1gumuvVrfWrIV8iwUe4J6
6Z2KdbKd+3FpaSKEnX61UQ1cfR+1eFLOLH9CH5O4twVPyLzvI+NpaJQJaNoX0ywq
vqsUMx63gKdGwhC6BhLi0t/xsTuuIgGjDAjuaF2yNZCuBw9UtziedxK5pveH2OFX
G8dAVfP18aqsXaRMj1LUrm6wRUP0BD7B2v99jdUu2UHYTtmBzy8Vm6RfhJ4Gk8d7
WnIkaBBU5iu75E9ec35MtA52zQ+8b8O/fpIQZgJgFRr7uaf+hvbXjF0UScWE6vU=
=ujzT
-----END PGP SIGNATURE-----
Merge tag 'trace-v4.2-rc1-fix' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace
Pull tracing fix from Steven Rostedt:
"Fengguang Wu discovered a crash that happened to be because of the
branch tracer (traces unlikely and likely branches) when enabled with
certain debug options.
What happened was that various debug options like lockdep and
DEBUG_PREEMPT can cause parts of the branch tracer to recurse outside
its recursion protection. In fact, part of its recursion protection
used these features that caused the lockup. This cleans up the code a
little and makes the recursion protection a bit more robust"
* tag 'trace-v4.2-rc1-fix' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
tracing: Have branch tracer use recursive field of task struct
Boris reported that the sparse_irq protection around __cpu_up() in the
generic code causes a regression on Xen. Xen allocates interrupts and
some more in the xen_cpu_up() function, so it deadlocks on the
sparse_irq_lock.
There is no simple fix for this and we really should have the
protection for all architectures, but for now the only solution is to
move it to x86 where actual wreckage due to the lack of protection has
been observed.
Reported-and-tested-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Fixes: a899418167 'hotplug: Prevent alloc/free of irq descriptors during cpu up/down'
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: xiao jin <jin.xiao@intel.com>
Cc: Joerg Roedel <jroedel@suse.de>
Cc: Borislav Petkov <bp@suse.de>
Cc: Yanmin Zhang <yanmin_zhang@linux.intel.com>
Cc: xen-devel <xen-devel@lists.xenproject.org>
tick_broadcast_oneshot_control got moved from tick-broadcast to
tick-common, but the export stayed in the old place. Fix it up.
Fixes: f32dd11705 'tick/broadcast: Make idle check independent from mode and config'
Reported-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Pull timer fixes from Thomas Gleixner:
"This update from the timer departement contains:
- A series of patches which address a shortcoming in the tick
broadcast code.
If the broadcast device is not available or an hrtimer emulated
broadcast device, some of the original assumptions lead to boot
failures. I rather plugged all of the corner cases instead of only
addressing the issue reported, so the change got a little larger.
Has been extensivly tested on x86 and arm.
- Get rid of the last holdouts using do_posix_clock_monotonic_gettime()
- A regression fix for the imx clocksource driver
- An update to the new state callbacks mechanism for clockevents.
This is required to simplify the conversion, which will take place
in 4.3"
* 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
tick/broadcast: Prevent NULL pointer dereference
time: Get rid of do_posix_clock_monotonic_gettime
cris: Replace do_posix_clock_monotonic_gettime()
tick/broadcast: Unbreak CONFIG_GENERIC_CLOCKEVENTS=n build
tick/broadcast: Handle spurious interrupts gracefully
tick/broadcast: Check for hrtimer broadcast active early
tick/broadcast: Return busy when IPI is pending
tick/broadcast: Return busy if periodic mode and hrtimer broadcast
tick/broadcast: Move the check for periodic mode inside state handling
tick/broadcast: Prevent deep idle if no broadcast device available
tick/broadcast: Make idle check independent from mode and config
tick/broadcast: Sanity check the shutdown of the local clock_event
tick/broadcast: Prevent hrtimer recursion
clockevents: Allow set-state callbacks to be optional
clocksource/imx: Define clocksource for mx27
Pull irq fix from Thomas Gleixner:
"A single fix for a cpu hotplug race vs. interrupt descriptors:
Prevent irq setup/teardown across the cpu starting/dying parts of cpu
hotplug so that the starting/dying cpu has a stable view of the
descriptor space. This has been an issue for all architectures in the
cpu dying phase, where interrupts are migrated away from the dying
cpu. In the starting phase its mostly a x86 issue vs the vector space
update"
* 'irq-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
hotplug: Prevent alloc/free of irq descriptors during cpu up/down
Dan reported that the recent changes to the broadcast code introduced
a potential NULL dereference.
Add the proper check.
Fixes: e045431190 "tick/broadcast: Sanity check the shutdown of the local clock_event"
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
The load_module() error path frees a module but forgot to take it out
of the mod_tree, leaving a dangling entry in the tree, causing havoc.
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Reported-by: Arthur Marsh <arthur.marsh@internode.on.net>
Tested-by: Arthur Marsh <arthur.marsh@internode.on.net>
Fixes: 93c2e105f6 ("module: Optimize __module_address() using a latched RB-tree")
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
The "fix" in commit 0b08c5e594 ("audit: Fix check of return value of
strnlen_user()") didn't fix anything, it broke things. As reported by
Steven Rostedt:
"Yes, strnlen_user() returns 0 on fault, but if you look at what len is
set to, than you would notice that on fault len would be -1"
because we just subtracted one from the return value. So testing
against 0 doesn't test for a fault condition, it tests against a
perfectly valid empty string.
Also fix up the usual braindamage wrt using WARN_ON() inside a
conditional - make it part of the conditional and remove the explicit
unlikely() (which is already part of the WARN_ON*() logic, exactly so
that you don't have to write unreadable code.
Reported-and-tested-by: Steven Rostedt <rostedt@goodmis.org>
Cc: Jan Kara <jack@suse.cz>
Cc: Paul Moore <pmoore@redhat.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Fengguang Wu's tests triggered a bug in the branch tracer's start up
test when CONFIG_DEBUG_PREEMPT set. This was because that config
adds some debug logic in the per cpu field, which calls back into
the branch tracer.
The branch tracer has its own recursive checks, but uses a per cpu
variable to implement it. If retrieving the per cpu variable calls
back into the branch tracer, you can see how things will break.
Instead of using a per cpu variable, use the trace_recursion field
of the current task struct. Simply set a bit when entering the
branch tracing and clear it when leaving. If the bit is set on
entry, just don't do the tracing.
There's also the case with lockdep, as the local_irq_save() called
before the recursion can also trigger code that can call back into
the function. Changing that to a raw_local_irq_save() will protect
that as well.
This prevents the recursion and the inevitable crash that follows.
Link: http://lkml.kernel.org/r/20150630141803.GA28071@wfg-t540p.sh.intel.com
Cc: stable@vger.kernel.org # 3.10+
Reported-by: Fengguang Wu <fengguang.wu@intel.com>
Tested-by: Fengguang Wu <fengguang.wu@intel.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
When a cpu goes up some architectures (e.g. x86) have to walk the irq
space to set up the vector space for the cpu. While this needs extra
protection at the architecture level we can avoid a few race
conditions by preventing the concurrent allocation/free of irq
descriptors and the associated data.
When a cpu goes down it moves the interrupts which are targeted to
this cpu away by reassigning the affinities. While this happens
interrupts can be allocated and freed, which opens a can of race
conditions in the code which reassignes the affinities because
interrupt descriptors might be freed underneath.
Example:
CPU1 CPU2
cpu_up/down
irq_desc = irq_to_desc(irq);
remove_from_radix_tree(desc);
raw_spin_lock(&desc->lock);
free(desc);
We could protect the irq descriptors with RCU, but that would require
a full tree change of all accesses to interrupt descriptors. But
fortunately these kind of race conditions are rather limited to a few
things like cpu hotplug. The normal setup/teardown is very well
serialized. So the simpler and obvious solution is:
Prevent allocation and freeing of interrupt descriptors accross cpu
hotplug.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Cc: xiao jin <jin.xiao@intel.com>
Cc: Joerg Roedel <jroedel@suse.de>
Cc: Borislav Petkov <bp@suse.de>
Cc: Yanmin Zhang <yanmin_zhang@linux.intel.com>
Link: http://lkml.kernel.org/r/20150705171102.063519515@linutronix.de
Andriy reported that on a virtual machine the warning about negative
expiry time in the clock events programming code triggered:
hpet: hpet0 irq 40 for MSI
hpet: hpet1 irq 41 for MSI
Switching to clocksource hpet
WARNING: at kernel/time/clockevents.c:239
[<ffffffff810ce6eb>] clockevents_program_event+0xdb/0xf0
[<ffffffff810cf211>] tick_handle_periodic_broadcast+0x41/0x50
[<ffffffff81016525>] timer_interrupt+0x15/0x20
When the second hpet is installed as a per cpu timer the broadcast
event is not longer required and stopped, which sets the next_evt of
the broadcast device to KTIME_MAX.
If after that a spurious interrupt happens on the broadcast device,
then the current code blindly handles it and tries to reprogram the
broadcast device afterwards, which adds the period to
next_evt. KTIME_MAX + period results in a negative expiry value
causing the WARN_ON in the clockevents code to trigger.
Add a proper check for the state of the broadcast device into the
interrupt handler and return if the interrupt is spurious.
[ Folded in pointer fix from Sudeep ]
Reported-by: Andriy Gapon <avg@FreeBSD.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Sudeep Holla <sudeep.holla@arm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Preeti U Murthy <preeti@linux.vnet.ibm.com>
Link: http://lkml.kernel.org/r/20150705205221.802094647@linutronix.de
If the current cpu is the one which has the hrtimer based broadcast
queued then we better return busy immediately instead of going through
loops and hoops to figure that out.
[ Split out from a larger combo patch ]
Tested-by: Sudeep Holla <sudeep.holla@arm.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Suzuki Poulose <Suzuki.Poulose@arm.com>
Cc: Lorenzo Pieralisi <Lorenzo.Pieralisi@arm.com>
Cc: Catalin Marinas <Catalin.Marinas@arm.com>
Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Preeti U Murthy <preeti@linux.vnet.ibm.com>
Cc: Ingo Molnar <mingo@kernel.org>
Link: http://lkml.kernel.org/r/alpine.DEB.2.11.1507070929360.3916@nanos
Tell the idle code not to go deep if the broadcast IPI is about to
arrive.
[ Split out from a larger combo patch ]
Tested-by: Sudeep Holla <sudeep.holla@arm.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Suzuki Poulose <Suzuki.Poulose@arm.com>
Cc: Lorenzo Pieralisi <Lorenzo.Pieralisi@arm.com>
Cc: Catalin Marinas <Catalin.Marinas@arm.com>
Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Preeti U Murthy <preeti@linux.vnet.ibm.com>
Cc: Ingo Molnar <mingo@kernel.org>
Link: http://lkml.kernel.org/r/alpine.DEB.2.11.1507070929360.3916@nanos
If the system is in periodic mode and the broadcast device is hrtimer
based, return busy as we have no proper handling for this.
[ Split out from a larger combo patch ]
Tested-by: Sudeep Holla <sudeep.holla@arm.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Suzuki Poulose <Suzuki.Poulose@arm.com>
Cc: Lorenzo Pieralisi <Lorenzo.Pieralisi@arm.com>
Cc: Catalin Marinas <Catalin.Marinas@arm.com>
Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Preeti U Murthy <preeti@linux.vnet.ibm.com>
Cc: Ingo Molnar <mingo@kernel.org>
Link: http://lkml.kernel.org/r/alpine.DEB.2.11.1507070929360.3916@nanos
We need to check more than the periodic mode for proper operation in
all runtime combinations. To avoid code duplication move the check
into the enter state handling.
No functional change.
[ Split out from a larger combo patch ]
Reported-and-tested-by: Sudeep Holla <sudeep.holla@arm.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Suzuki Poulose <Suzuki.Poulose@arm.com>
Cc: Lorenzo Pieralisi <Lorenzo.Pieralisi@arm.com>
Cc: Catalin Marinas <Catalin.Marinas@arm.com>
Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Preeti U Murthy <preeti@linux.vnet.ibm.com>
Cc: Ingo Molnar <mingo@kernel.org>
Link: http://lkml.kernel.org/r/alpine.DEB.2.11.1507070929360.3916@nanos
Add a check for a installed broadcast device to the oneshot control
function and return busy if not.
[ Split out from a larger combo patch ]
Reported-and-tested-by: Sudeep Holla <sudeep.holla@arm.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Suzuki Poulose <Suzuki.Poulose@arm.com>
Cc: Lorenzo Pieralisi <Lorenzo.Pieralisi@arm.com>
Cc: Catalin Marinas <Catalin.Marinas@arm.com>
Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Preeti U Murthy <preeti@linux.vnet.ibm.com>
Cc: Ingo Molnar <mingo@kernel.org>
Link: http://lkml.kernel.org/r/alpine.DEB.2.11.1507070929360.3916@nanos
Currently the broadcast busy check, which prevents the idle code from
going into deep idle, works only in one shot mode.
If NOHZ and HIGHRES are off (config or command line) there is no
sanity check at all, so under certain conditions cpus are allowed to
go into deep idle, where the local timer stops, and are not woken up
again because there is no broadcast timer installed or a hrtimer based
broadcast device is not evaluated.
Move tick_broadcast_oneshot_control() into the common code and provide
proper subfunctions for the various config combinations.
The common check in tick_broadcast_oneshot_control() is for the C3STOP
misfeature flag of the local clock event device. If its not set, idle
can proceed. If set, further checks are necessary.
Provide checks for the trivial cases:
- If broadcast is disabled in the config, then return busy
- If oneshot mode (NOHZ/HIGHES) is disabled in the config, return
busy if the broadcast device is hrtimer based.
- If oneshot mode is enabled in the config call the original
tick_broadcast_oneshot_control() function. That function needs
extra checks which will be implemented in seperate patches.
[ Split out from a larger combo patch ]
Reported-and-tested-by: Sudeep Holla <sudeep.holla@arm.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Suzuki Poulose <Suzuki.Poulose@arm.com>
Cc: Lorenzo Pieralisi <Lorenzo.Pieralisi@arm.com>
Cc: Catalin Marinas <Catalin.Marinas@arm.com>
Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Preeti U Murthy <preeti@linux.vnet.ibm.com>
Cc: Ingo Molnar <mingo@kernel.org>
Link: http://lkml.kernel.org/r/alpine.DEB.2.11.1507070929360.3916@nanos
The broadcast code shuts down the local clock event unconditionally
even if no broadcast device is installed or if the broadcast device is
hrtimer based.
Add proper sanity checks.
[ Split out from a larger combo patch ]
Reported-and-tested-by: Sudeep Holla <sudeep.holla@arm.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Suzuki Poulose <Suzuki.Poulose@arm.com>
Cc: Lorenzo Pieralisi <Lorenzo.Pieralisi@arm.com>
Cc: Catalin Marinas <Catalin.Marinas@arm.com>
Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Preeti U Murthy <preeti@linux.vnet.ibm.com>
Cc: Ingo Molnar <mingo@kernel.org>
Link: http://lkml.kernel.org/r/alpine.DEB.2.11.1507070929360.3916@nanos
The hrtimer based broadcast vehicle can cause a hrtimer recursion
which went unnoticed until we changed the hrtimer expiry code to keep
track of the currently running timer.
local_timer_interrupt()
local_handler()
hrtimer_interrupt()
expire_hrtimers()
broadcast_hrtimer()
send_ipis()
local_handler()
hrtimer_interrupt()
....
Solution is simple: Prevent the local handler call from the broadcast
code when the broadcast 'device' is hrtimer based.
[ Split out from a larger combo patch ]
Tested-by: Sudeep Holla <sudeep.holla@arm.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Suzuki Poulose <Suzuki.Poulose@arm.com>
Cc: Lorenzo Pieralisi <Lorenzo.Pieralisi@arm.com>
Cc: Catalin Marinas <Catalin.Marinas@arm.com>
Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Preeti U Murthy <preeti@linux.vnet.ibm.com>
Cc: Ingo Molnar <mingo@kernel.org>
Link: http://lkml.kernel.org/r/alpine.DEB.2.11.1507070929360.3916@nanos