Commit Graph

53 Commits

Author SHA1 Message Date
Jan Stancek
68de8867ea powerpc/perf: add missing put_cpu_var in power_pmu_event_init
One path in power_pmu_event_init() calls get_cpu_var(), but is
missing matching call to put_cpu_var(), which causes preemption
imbalance and crash in user-space:

  Page fault in user mode with in_atomic() = 1 mm = c000001fefa5a280
  NIP = 3fff9bf2cae0  MSR = 900000014280f032
  Oops: Weird page fault, sig: 11 [#23]
  SMP NR_CPUS=2048 NUMA PowerNV
  Modules linked in: <snip>
  CPU: 43 PID: 10285 Comm: a.out Tainted: G      D         4.0.0-rc5+ #1
  task: c000001fe82c9200 ti: c000001fe835c000 task.ti: c000001fe835c000
  NIP: 00003fff9bf2cae0 LR: 00003fff9bee4898 CTR: 00003fff9bf2cae0
  REGS: c000001fe835fea0 TRAP: 0401   Tainted: G      D          (4.0.0-rc5+)
  MSR: 900000014280f032 <SF,HV,VEC,VSX,EE,PR,FP,ME,IR,DR,RI>  CR: 22000028  XER: 00000000
  CFAR: 00003fff9bee4894 SOFTE: 1
   GPR00: 00003fff9bee494c 00003fffe01c2ee0 00003fff9c084410 0000000010020068
   GPR04: 0000000000000000 0000000000000002 0000000000000008 0000000000000001
   GPR08: 0000000000000001 00003fff9c074a30 00003fff9bf2cae0 00003fff9bf2cd70
   GPR12: 0000000052000022 00003fff9c10b700
  NIP [00003fff9bf2cae0] 0x3fff9bf2cae0
  LR [00003fff9bee4898] 0x3fff9bee4898
  Call Trace:
  ---[ end trace 5d3d952b5d4185d4 ]---

  BUG: sleeping function called from invalid context at kernel/locking/rwsem.c:41
  in_atomic(): 1, irqs_disabled(): 0, pid: 10285, name: a.out
  INFO: lockdep is turned off.
  CPU: 43 PID: 10285 Comm: a.out Tainted: G      D         4.0.0-rc5+ #1
  Call Trace:
  [c000001fe835f990] [c00000000089c014] .dump_stack+0x98/0xd4 (unreliable)
  [c000001fe835fa10] [c0000000000e4138] .___might_sleep+0x1d8/0x2e0
  [c000001fe835faa0] [c000000000888da8] .down_read+0x38/0x110
  [c000001fe835fb30] [c0000000000bf2f4] .exit_signals+0x24/0x160
  [c000001fe835fbc0] [c0000000000abde0] .do_exit+0xd0/0xe70
  [c000001fe835fcb0] [c00000000001f4c4] .die+0x304/0x450
  [c000001fe835fd60] [c00000000088e1f4] .do_page_fault+0x2d4/0x900
  [c000001fe835fe30] [c000000000008664] handle_page_fault+0x10/0x30
  note: a.out[10285] exited with preempt_count 1

Reproducer:
  #include <stdio.h>
  #include <unistd.h>
  #include <syscall.h>
  #include <sys/types.h>
  #include <sys/stat.h>
  #include <linux/perf_event.h>
  #include <linux/hw_breakpoint.h>

  static struct perf_event_attr event = {
          .type = PERF_TYPE_RAW,
          .size = sizeof(struct perf_event_attr),
          .sample_type = PERF_SAMPLE_BRANCH_STACK,
          .branch_sample_type = PERF_SAMPLE_BRANCH_ANY_RETURN,
  };

  int main()
  {
          syscall(__NR_perf_event_open, &event, 0, -1, -1, 0);
  }

Signed-off-by: Jan Stancek <jstancek@redhat.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2015-03-27 20:07:01 +11:00
Christoph Lameter
69111bac42 powerpc: Replace __get_cpu_var uses
This still has not been merged and now powerpc is the only arch that does
not have this change. Sorry about missing linuxppc-dev before.

V2->V2
  - Fix up to work against 3.18-rc1

__get_cpu_var() is used for multiple purposes in the kernel source. One of
them is address calculation via the form &__get_cpu_var(x).  This calculates
the address for the instance of the percpu variable of the current processor
based on an offset.

Other use cases are for storing and retrieving data from the current
processors percpu area.  __get_cpu_var() can be used as an lvalue when
writing data or on the right side of an assignment.

__get_cpu_var() is defined as :

__get_cpu_var() always only does an address determination. However, store
and retrieve operations could use a segment prefix (or global register on
other platforms) to avoid the address calculation.

this_cpu_write() and this_cpu_read() can directly take an offset into a
percpu area and use optimized assembly code to read and write per cpu
variables.

This patch converts __get_cpu_var into either an explicit address
calculation using this_cpu_ptr() or into a use of this_cpu operations that
use the offset.  Thereby address calculations are avoided and less registers
are used when code is generated.

At the end of the patch set all uses of __get_cpu_var have been removed so
the macro is removed too.

The patch set includes passes over all arches as well. Once these operations
are used throughout then specialized macros can be defined in non -x86
arches as well in order to optimize per cpu access by f.e.  using a global
register that may be set to the per cpu base.

Transformations done to __get_cpu_var()

1. Determine the address of the percpu instance of the current processor.

	DEFINE_PER_CPU(int, y);
	int *x = &__get_cpu_var(y);

    Converts to

	int *x = this_cpu_ptr(&y);

2. Same as #1 but this time an array structure is involved.

	DEFINE_PER_CPU(int, y[20]);
	int *x = __get_cpu_var(y);

    Converts to

	int *x = this_cpu_ptr(y);

3. Retrieve the content of the current processors instance of a per cpu
variable.

	DEFINE_PER_CPU(int, y);
	int x = __get_cpu_var(y)

   Converts to

	int x = __this_cpu_read(y);

4. Retrieve the content of a percpu struct

	DEFINE_PER_CPU(struct mystruct, y);
	struct mystruct x = __get_cpu_var(y);

   Converts to

	memcpy(&x, this_cpu_ptr(&y), sizeof(x));

5. Assignment to a per cpu variable

	DEFINE_PER_CPU(int, y)
	__get_cpu_var(y) = x;

   Converts to

	__this_cpu_write(y, x);

6. Increment/Decrement etc of a per cpu variable

	DEFINE_PER_CPU(int, y);
	__get_cpu_var(y)++

   Converts to

	__this_cpu_inc(y)

Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
CC: Paul Mackerras <paulus@samba.org>
Signed-off-by: Christoph Lameter <cl@linux.com>
[mpe: Fix build errors caused by set/or_softirq_pending(), and rework
      assignment in __set_breakpoint() to use memcpy().]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2014-11-03 12:12:32 +11:00
Anton Blanchard
e51df2c170 powerpc: Make a bunch of things static
Signed-off-by: Anton Blanchard <anton@samba.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
2014-09-25 23:14:41 +10:00
Michael Ellerman
9de5cb0f6d powerpc/perf: Add per-event excludes on Power8
Power8 has a new register (MMCR2), which contains individual freeze bits
for each counter. This is an improvement on previous chips as it means
we can have multiple events on the PMU at the same time with different
exclude_{user,kernel,hv} settings. Previously we had to ensure all
events on the PMU had the same exclude settings.

The core of the patch is fairly simple. We use the 207S feature flag to
indicate that the PMU backend supports per-event excludes, if it's set
we skip the generic logic that enforces the equality of excludes between
events. We also use that flag to skip setting the freeze bits in MMCR0,
the PMU backend is expected to have handled setting them in MMCR2.

The complication arises with EBB. The FCxP bits in MMCR2 are accessible
R/W to a task using EBB. Which means a task using EBB will be able to
see that we are using MMCR2 for freezing, whereas the old logic which
used MMCR0 is not user visible.

The task can not see or affect exclude_kernel & exclude_hv, so we only
need to consider exclude_user.

The table below summarises the behaviour both before and after this
commit is applied:

 exclude_user           true  false
 ------------------------------------
        | User visible |  N    N
 Before | Can freeze   |  Y    Y
        | Can unfreeze |  N    Y
 ------------------------------------
        | User visible |  Y    Y
  After | Can freeze   |  Y    Y
        | Can unfreeze |  Y/N  Y
 ------------------------------------

So firstly I assert that the simple visibility of the exclude_user
setting in MMCR2 is a non-issue. The event belongs to the task, and
was most likely created by the task. So the exclude_user setting is not
privileged information in any way.

Secondly, the behaviour in the exclude_user = false case is unchanged.
This is important as it is the case that is actually useful, ie. the
event is created with no exclude setting and the task uses MMCR2 to
implement exclusion manually.

For exclude_user = true there is no meaningful change to freezing the
event. Previously the task could use MMCR2 to freeze the event, though
it was already frozen with MMCR0. With the new code the task can use
MMCR2 to freeze the event, though it was already frozen with MMCR2.

The only real change is when exclude_user = true and the task tries to
use MMCR2 to unfreeze the event. Previously this had no effect, because
the event was already frozen in MMCR0. With the new code the task can
unfreeze the event in MMCR2, but at some indeterminate time in the
future the kernel will overwrite its setting and refreeze the event.

Therefore my final assertion is that any task using exclude_user = true
and also fiddling with MMCR2 was deeply confused before this change, and
remains so after it.

Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2014-07-28 14:30:58 +10:00
Michael Ellerman
8abd818fc7 powerpc/perf: Pass the struct perf_events down to compute_mmcr()
To support per-event exclude settings on Power8 we need access to the
struct perf_events in compute_mmcr().

Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2014-07-28 14:30:47 +10:00
Michael Ellerman
79a4cb28a0 powerpc/perf: Clear all MMCR settings before calling compute_mmcr()
Because we reuse cpuhw->mmcr on each call to compute_mmcr() there's a
risk that we could forget to set one of the values and use whatever
value was in there previously.

Currently all the implementations are careful to set all the values, but
it's safer to clear them all before we call compute_mmcr().

Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2014-07-28 14:11:34 +10:00
Michael Ellerman
8903461c9b powerpc/perf: Fix MMCR2 handling for EBB
In the recent commit b50a6c584b "Clear MMCR2 when enabling PMU", I
screwed up the handling of MMCR2 for tasks using EBB.

We must make sure we set MMCR2 *before* ebb_switch_in(), otherwise we
overwrite the value of MMCR2 that userspace may have written. That
potentially breaks a task that uses EBB and manually uses MMCR2 for
event freezing.

Fixes: b50a6c584b ("powerpc/perf: Clear MMCR2 when enabling PMU")
Cc: stable@vger.kernel.org
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2014-07-23 17:16:47 +10:00
Anton Blanchard
f56029410a powerpc/perf: Never program book3s PMCs with values >= 0x80000000
We are seeing a lot of PMU warnings on POWER8:

    Can't find PMC that caused IRQ

Looking closer, the active PMC is 0 at this point and we took a PMU
exception on the transition from negative to 0. Some versions of POWER8
have an issue where they edge detect and not level detect PMC overflows.

A number of places program the PMC with (0x80000000 - period_left),
where period_left can be negative. We can either fix all of these or
just ensure that period_left is always >= 1.

This patch takes the second option.

Cc: <stable@vger.kernel.org>
Signed-off-by: Anton Blanchard <anton@samba.org>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2014-07-11 13:50:47 +10:00
Joel Stanley
b50a6c584b powerpc/perf: Clear MMCR2 when enabling PMU
On POWER8 when switching to a KVM guest we set bits in MMCR2 to freeze
the PMU counters. Aside from on boot they are then never reset,
resulting in stuck perf counters for any user in the guest or host.

We now set MMCR2 to 0 whenever enabling the PMU, which provides a sane
state for perf to use the PMU counters under either the guest or the
host.

This was manifesting as a bug with ppc64_cpu --frequency:

    $ sudo ppc64_cpu --frequency
    WARNING: couldn't run on cpu 0
    WARNING: couldn't run on cpu 8
      ...
    WARNING: couldn't run on cpu 144
    WARNING: couldn't run on cpu 152
    min:    18446744073.710 GHz (cpu -1)
    max:    0.000 GHz (cpu -1)
    avg:    0.000 GHz

The command uses a perf counter to measure CPU cycles over a fixed
amount of time, in order to approximate the frequency of the machine.
The counters were returning zero once a guest was started, regardless of
weather it was still running or had been shut down.

By dumping the value of MMCR2, it was observed that once a guest is
running MMCR2 is set to 1s - which stops counters from running:

    $ sudo sh -c 'echo p > /proc/sysrq-trigger'
    CPU: 0 PMU registers, ppmu = POWER8 n_counters = 6
    PMC1:  5b635e38 PMC2: 00000000 PMC3: 00000000 PMC4: 00000000
    PMC5:  1bf5a646 PMC6: 5793d378 PMC7: deadbeef PMC8: deadbeef
    MMCR0: 0000000080000000 MMCR1: 000000001e000000 MMCRA: 0000040000000000
    MMCR2: fffffffffffffc00 EBBHR: 0000000000000000
    EBBRR: 0000000000000000 BESCR: 0000000000000000
    SIAR:  00000000000a51cc SDAR:  c00000000fc40000 SIER:  0000000001000000

This is done unconditionally in book3s_hv_interrupts.S upon entering the
guest, and the original value is only save/restored if the host has
indicated it was using the PMU. This is okay, however the user of the
PMU needs to ensure that it is in a defined state when it starts using
it.

Fixes: e05b9b9e5c ("powerpc/perf: Power8 PMU support")
Cc: stable@vger.kernel.org
Signed-off-by: Joel Stanley <joel@jms.id.au>
Acked-by: Michael Ellerman <mpe@ellerman.id.au>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2014-07-11 12:55:08 +10:00
Joel Stanley
4d9690dd56 powerpc/perf: Add PPMU_ARCH_207S define
Instead of separate bits for every POWER8 PMU feature, have a single one
for v2.07 of the architecture.

This saves us adding a MMCR2 define for a future patch.

Cc: stable@vger.kernel.org
Signed-off-by: Joel Stanley <joel@jms.id.au>
Acked-by: Michael Ellerman <mpe@ellerman.id.au>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2014-07-11 12:55:07 +10:00
Michael Ellerman
76cb8a783a powerpc/perf: Enable BHRB access for EBB events
The previous commit added constraint and register handling to allow
processes using EBB (Event Based Branches) to request access to the BHRB
(Branch History Rolling Buffer).

With that in place we can allow processes using EBB to access the BHRB.
This is achieved by setting BHRBA in MMCR0 when we enable EBB access. We
must also clear BHRBA when we are disabling.

Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2014-03-24 09:48:27 +11:00
Michael Ellerman
58b5fb0049 powerpc/perf: Reject EBB events which specify a sample_type
Although we already block EBB events which request sampling using
sample_period, technically it's possible for an event to set sample_type
but not sample_period.

Nothing terrible will happen if an EBB event does specify sample_type,
but it signals a major confusion on the part of userspace, and so we do
them the favor of rejecting it.

Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2014-03-24 09:48:25 +11:00
Michael Ellerman
c2e37a2626 powerpc/perf: Add lost exception workaround
Some power8 revisions have a hardware bug where we can lose a PMU
exception, this commit adds a workaround to detect the bad condition and
rectify the situation.

See the comment in the commit for a full description.

Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2014-03-24 09:48:25 +11:00
Anshuman Khandual
5f6d0380c6 powerpc/perf: Define perf_event_print_debug() to print PMU register values
Currently the sysrq ShowRegs command does not print any PMU registers as
we have an empty definition for perf_event_print_debug(). This patch
defines perf_event_print_debug() to print various PMU registers.

Example output:

CPU: 0 PMU registers, ppmu = POWER7 n_counters = 6
PMC1:  00000000 PMC2: 00000000 PMC3: 00000000 PMC4: 00000000
PMC5:  00000000 PMC6: 00000000 PMC7: deadbeef PMC8: deadbeef
MMCR0: 0000000080000000 MMCR1: 0000000000000000 MMCRA: 0f00000001000000
SIAR:  0000000000000000 SDAR:  0000000000000000 SIER:  0000000000000000

Signed-off-by: Anshuman Khandual <khandual@linux.vnet.ibm.com>
[mpe: Fix 32 bit build and rework formatting for compactness]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2014-03-24 09:48:23 +11:00
Anshuman Khandual
b4d6c06c8d powerpc/perf: Configure BHRB filter before enabling PMU interrupts
Right now the config_bhrb() PMU specific call happens after
write_mmcr0(), which actually enables the PMU for event counting and
interrupts. So there is a small window of time where the PMU and BHRB
runs without the required HW branch filter (if any) enabled in BHRB.

This can cause some of the branch samples to be collected through BHRB
without any filter applied and hence affects the correctness of
the results. This patch moves the BHRB config function call before
enabling interrupts.

Here are some data points captured via trace prints which depicts how we
could get PMU interrupts with BHRB filter NOT enabled with a standard
perf record command line (asking for branch record information as well).

    $ perf record -j any_call ls

Before the patch:-

    ls-1962  [003] d...  2065.299590: .perf_event_interrupt: MMCRA: 40000000000
    ls-1962  [003] d...  2065.299603: .perf_event_interrupt: MMCRA: 40000000000
    ...

    All the PMU interrupts before this point did not have the requested
    HW branch filter enabled in the MMCRA.

    ls-1962  [003] d...  2065.299647: .perf_event_interrupt: MMCRA: 40040000000
    ls-1962  [003] d...  2065.299662: .perf_event_interrupt: MMCRA: 40040000000

After the patch:-

    ls-1850  [008] d...   190.311828: .perf_event_interrupt: MMCRA: 40040000000
    ls-1850  [008] d...   190.311848: .perf_event_interrupt: MMCRA: 40040000000

    All the PMU interrupts have the requested HW BHRB branch filter
    enabled in MMCRA.

Signed-off-by: Anshuman Khandual <khandual@linux.vnet.ibm.com>
[mpe: Fixed up whitespace and cleaned up changelog]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2014-02-11 11:24:50 +11:00
Anton Blanchard
b0d436c739 powerpc: Fix a number of sparse warnings
Address some of the trivial sparse warnings in arch/powerpc.

Signed-off-by: Anton Blanchard <anton@samba.org>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2013-08-14 11:50:24 +10:00
Michael Ellerman
8d7c55d01e powerpc/perf: Export PERF_EVENT_CONFIG_EBB_SHIFT to userspace
We use bit 63 of the event code for userspace to request that the event
be counted using EBB (Event Based Branches). Export this value, making
it part of the API - though only on processors that support EBB.

Signed-off-by: Michael Ellerman <michael@ellerman.id.au>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2013-08-01 13:11:46 +10:00
Anshuman Khandual
ff3d79dc12 powerpc/perf: BHRB filter configuration should follow the task
When the task moves around the system, the corresponding cpuhw
per cpu strcuture should be popullated with the BHRB filter
request value so that PMU could be configured appropriately with
that during the next call into power_pmu_enable().

Signed-off-by: Anshuman Khandual <khandual@linux.vnet.ibm.com>
Acked-by: Michael Neuling <mikey@neuling.org>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2013-07-24 14:42:34 +10:00
Michael Ellerman
330a1eb777 powerpc/perf: Core EBB support for 64-bit book3s
Add support for EBB (Event Based Branches) on 64-bit book3s. See the
included documentation for more details.

EBBs are a feature which allows the hardware to branch directly to a
specified user space address when a PMU event overflows. This can be
used by programs for self-monitoring with no kernel involvement in the
inner loop.

Most of the logic is in the generic book3s code, primarily to avoid a
proliferation of PMU callbacks.

Signed-off-by: Michael Ellerman <michael@ellerman.id.au>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2013-07-01 11:50:10 +10:00
Michael Ellerman
4ea355b536 powerpc/perf: Don't enable if we have zero events
In power_pmu_enable() we still enable the PMU even if we have zero
events. This should have no effect but doesn't make much sense. Instead
just return after telling the hypervisor that we are not using the PMCs.

Signed-off-by: Michael Ellerman <michael@ellerman.id.au>
CC: <stable@vger.kernel.org> [v3.10]
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2013-07-01 11:50:03 +10:00
Michael Ellerman
0a48843d6c powerpc/perf: Use existing out label in power_pmu_enable()
In power_pmu_enable() we can use the existing out label to reduce the
number of return paths.

Signed-off-by: Michael Ellerman <michael@ellerman.id.au>
CC: <stable@vger.kernel.org> [v3.10]
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2013-07-01 11:50:00 +10:00
Michael Ellerman
7a7a41f9d5 powerpc/perf: Freeze PMC5/6 if we're not using them
On Power8 we can freeze PMC5 and 6 if we're not using them. Normally they
run all the time.

As noticed by Anshuman, we should unfreeze them when we disable the PMU
as there are legacy tools which expect them to run all the time.

Signed-off-by: Michael Ellerman <michael@ellerman.id.au>
CC: <stable@vger.kernel.org> [v3.10]
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2013-07-01 11:49:57 +10:00
Michael Ellerman
378a6ee99e powerpc/perf: Rework disable logic in pmu_disable()
In pmu_disable() we disable the PMU by setting the FC (Freeze Counters)
bit in MMCR0. In order to do this we have to read/modify/write MMCR0.

It's possible that we read a value from MMCR0 which has PMAO (PMU Alert
Occurred) set. When we write that value back it will cause an interrupt
to occur. We will then end up in the PMU interrupt handler even though
we are supposed to have just disabled the PMU.

We can avoid this by making sure we never write PMAO back. We should not
lose interrupts because when the PMU is re-enabled the overflowed values
will cause another interrupt.

We also reorder the clearing of SAMPLE_ENABLE so that is done after the
PMU is frozen. Otherwise there is a small window between the clearing of
SAMPLE_ENABLE and the setting of FC where we could take an interrupt and
incorrectly see SAMPLE_ENABLE not set. This would for example change the
logic in perf_read_regs().

Signed-off-by: Michael Ellerman <michael@ellerman.id.au>
CC: <stable@vger.kernel.org> [v3.10]
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2013-07-01 11:49:54 +10:00
Paul Gortmaker
061d19f279 powerpc: Delete __cpuinit usage from all users
The __cpuinit type of throwaway sections might have made sense
some time ago when RAM was more constrained, but now the savings
do not offset the cost and complications.  For example, the fix in
commit 5e427ec2d0 ("x86: Fix bit corruption at CPU resume time")
is a good example of the nasty type of bugs that can be created
with improper use of the various __init prefixes.

After a discussion on LKML[1] it was decided that cpuinit should go
the way of devinit and be phased out.  Once all the users are gone,
we can then finally remove the macros themselves from linux/init.h.

This removes all the powerpc uses of the __cpuinit macros.  There
are no __CPUINIT users in assembly files in powerpc.

[1] https://lkml.org/lkml/2013/5/20/589

Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Josh Boyer <jwboyer@gmail.com>
Cc: Matt Porter <mporter@kernel.crashing.org>
Cc: Kumar Gala <galak@kernel.crashing.org>
Cc: linuxppc-dev@lists.ozlabs.org
Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2013-07-01 11:10:36 +10:00
Michael Ellerman
6772faa1ba powerpc/perf: Fix deadlock caused by calling printk() in PMU exception
In commit bc09c21 "Fix finding overflowed PMC in interrupt" we added
a printk() to the PMU exception handler. Unfortunately that is not safe.

The problem is that the PMU exception may run even when interrupts are
soft disabled, aka NMI context. We do this so that we can profile parts
of the kernel that have interrupts soft-disabled.

But by calling printk() from the exception handler, we can potentially
deadlock in the printk code on logbuf_lock, eg:

  [c00000038ba575c0] c000000000081928 .vprintk_emit+0xa8/0x540
  [c00000038ba576a0] c0000000007bcde8 .printk+0x48/0x58
  [c00000038ba57710] c000000000076504 .perf_event_interrupt+0x2d4/0x490
  [c00000038ba57810] c00000000001f6f8 .performance_monitor_exception+0x48/0x60
  [c00000038ba57880] c0000000000032cc performance_monitor_common+0x14c/0x180
  --- Exception: f01 (Performance Monitor) at c0000000007b25d4 ._raw_spin_lock_irq
  +0x64/0xc0
  [c00000038ba57bf0] c00000000007ed90 .devkmsg_read+0xd0/0x5a0
  [c00000038ba57d00] c0000000001c2934 .vfs_read+0xc4/0x1e0
  [c00000038ba57d90] c0000000001c2cd8 .SyS_read+0x58/0xd0
  [c00000038ba57e30] c000000000009d54 syscall_exit+0x0/0x98
  --- Exception: c01 (System Call) at 00001fffffbf6f7c
  SP (3ffff6d4de10) is in userspace

Fix it by making sure we only call printk() when we are not in NMI
context.

Signed-off-by: Michael Ellerman <michael@ellerman.id.au>
Cc: <stable@vger.kernel.org> # 3.9
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2013-06-10 08:36:32 +10:00
Michael Ellerman
58a032c3b1 powerpc/perf: Add missing SIER support
Commit 8f61aa3 "Add support for SIER" missed updates to siar_valid()
and perf_get_data_addr().

In both cases we need to check the SIER instead of mmcra.

Signed-off-by: Michael Ellerman <michael@ellerman.id.au>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2013-06-01 08:29:29 +10:00
Michael Ellerman
cbda6aa10b powerpc/perf: Revert to original NO_SIPR logic
This is a revert and then some of commit 860aad7 "Add regs_no_sipr()".
This workaround was only needed on early chip versions.

As before NO_SIPR becomes a static flag of the PMU struct.

Signed-off-by: Michael Ellerman <michael@ellerman.id.au>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2013-06-01 08:29:29 +10:00
Michael Neuling
691231846c powerpc/perf: Fix setting of "to" addresses for BHRB
Currently we only set the "to" address in the branch stack when the CPU
explicitly gives us a value.  Unfortunately it only does this for XL form
branches (eg blr, bctr, bctar) and not I and B form branches (eg b, bc).

Fortunately if we read the instruction from memory we can extract the offset of
a branch and calculate the target address.

This adds a function power_pmu_bhrb_to() to calculate the target/to address of
the corresponding I and B form branches.  It handles branches in both user and
kernel spaces.  It also plumbs this into the perf brhb reading code.

Signed-off-by: Michael Neuling <mikey@neuling.org>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2013-05-14 16:00:22 +10:00
Michael Neuling
506e70d132 powerpc/pmu: Fix order of interpreting BHRB target entries
The current Branch History Rolling Buffer (BHRB) code misinterprets the order
of entries in the hardware buffer.  It assumes that a branch target address
will be read _after_ its corresponding branch.  In reality the branch target
comes before (lower mfbhrb entry) it's corresponding branch.

This is a rewrite of the code to take this into account.

Signed-off-by: Michael Neuling <mikey@neuling.org>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2013-05-14 16:00:22 +10:00
Michael Neuling
d52f2dc40b powerpc/perf: Move BHRB code into CONFIG_PPC64 region
The new Branch History Rolling buffer (BHRB) code is only useful on 64bit
processors, so move it into the #ifdef CONFIG_PPC64 region.

This avoids code bloat on 32bit systems.

Signed-off-by: Michael Neuling <mikey@neuling.org>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2013-05-14 16:00:21 +10:00
Anshuman Khandual
3925f46bb5 powerpc/perf: Enable branch stack sampling framework
Provides basic enablement for perf branch stack sampling framework on
POWER8 processor based platforms. Adds new BHRB related elements into
cpu_hw_event structure to represent current BHRB config, BHRB filter
configuration, manage context and to hold output BHRB buffer during
PMU interrupt before passing to the user space. This also enables
processing of BHRB data and converts them into generic perf branch
stack data format.

Signed-off-by: Anshuman Khandual <khandual@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2013-04-26 16:13:02 +10:00
Michael Ellerman
8f61aa325f powerpc/perf: Add support for SIER
On power8 we have a new SIER (Sampled Instruction Event Register), which
captures information about instructions when we have random sampling
enabled.

Add support for loading the SIER into pt_regs, overloading regs->dar.
Also set the new NO_SIPR flag in regs->result if we don't have SIPR.

Update regs_sihv/sipr() to look for SIPR/SIHV in SIER.

Signed-off-by: Michael Ellerman <michael@ellerman.id.au>
Acked-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2013-04-26 16:11:10 +10:00
Michael Ellerman
860aad71fc powerpc/perf: Add regs_no_sipr()
On power8 the presence or absence of SIPR depends on settings at runtime,
so convert to using a dynamic flag for NO_SIPR. Existing backends that
set NO_SIPR unconditionally set the dynamic flag obviously.

Signed-off-by: Michael Ellerman <michael@ellerman.id.au>
Acked-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2013-04-26 16:11:09 +10:00
Michael Ellerman
33904054b4 powerpc/perf: Add an accessor for regs->result
Add an accessor for regs->result so we can use it to store more flags in
future.

Signed-off-by: Michael Ellerman <michael@ellerman.id.au>
Acked-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2013-04-26 16:11:08 +10:00
Michael Ellerman
5682c46026 powerpc/perf: Convert mmcra_sipr/sihv() to regs_sipr/sihv()
On power8 the SIPR and SIHV are not in MMCRA, so convert the routines
to take regs and change the names accordingly.

Signed-off-by: Michael Ellerman <michael@ellerman.id.au>
Acked-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2013-04-26 16:11:08 +10:00
Michael Ellerman
7a7868326d powerpc/perf: Add an explict flag indicating presence of SLOT field
In perf_ip_adjust() we potentially use the MMCRA[SLOT] field to adjust
the reported IP of a sampled instruction.

Currently the logic is written so that if the backend does NOT have
the PPMU_ALT_SIPR flag set then we assume MMCRA[SLOT] exists.

However on power8 we do not want to set ALT_SIPR (it's in a third
location), and we also do not have MMCRA[SLOT].

So add a new flag which only indicates whether MMCRA[SLOT] exists.

Naively we'd set it on everything except power6/7, because they set
ALT_SIPR, and we've reversed the polarity of the flag. But it's more
complicated than that.

mpc7450 is 32-bit, and uses its own version of perf_ip_adjust()
which doesn't use MMCRA[SLOT], so it doesn't need the new flag set and
the behaviour is unchanged.

PPC970 (and I assume power4) don't have MMCRA[SLOT], so shouldn't have
the new flag set. This is a behaviour change on those cpus, though we
were probably getting lucky and the bits in question were 0.

power5 and power5+ set the new flag, behaviour unchanged.

power6 & power7 do not set the new flag, behaviour unchanged.

Signed-off-by: Michael Ellerman <michael@ellerman.id.au>
Acked-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2013-04-26 16:11:07 +10:00
Linus Torvalds
9d3cae26ac Merge branch 'next' of git://git.kernel.org/pub/scm/linux/kernel/git/benh/powerpc
Pull powerpc updates from Benjamin Herrenschmidt:
 "So from the depth of frozen Minnesota, here's the powerpc pull request
  for 3.9.  It has a few interesting highlights, in addition to the
  usual bunch of bug fixes, minor updates, embedded device tree updates
  and new boards:

   - Hand tuned asm implementation of SHA1 (by Paulus & Michael
     Ellerman)

   - Support for Doorbell interrupts on Power8 (kind of fast
     thread-thread IPIs) by Ian Munsie

   - Long overdue cleanup of the way we handle relocation of our open
     firmware trampoline (prom_init.c) on 64-bit by Anton Blanchard

   - Support for saving/restoring & context switching the PPR (Processor
     Priority Register) on server processors that support it.  This
     allows the kernel to preserve thread priorities established by
     userspace.  By Haren Myneni.

   - DAWR (new watchpoint facility) support on Power8 by Michael Neuling

   - Ability to change the DSCR (Data Stream Control Register) which
     controls cache prefetching on a running process via ptrace by
     Alexey Kardashevskiy

   - Support for context switching the TAR register on Power8 (new
     branch target register meant to be used by some new specific
     userspace perf event interrupt facility which is yet to be enabled)
     by Ian Munsie.

   - Improve preservation of the CFAR register (which captures the
     origin of a branch) on various exception conditions by Paulus.

   - Move the Bestcomm DMA driver from arch powerpc to drivers/dma where
     it belongs by Philippe De Muyter

   - Support for Transactional Memory on Power8 by Michael Neuling
     (based on original work by Matt Evans).  For those curious about
     the feature, the patch contains a pretty good description."

(See commit db8ff90702: "powerpc: Documentation for transactional
memory on powerpc" for the mentioned description added to the file
Documentation/powerpc/transactional_memory.txt)

* 'next' of git://git.kernel.org/pub/scm/linux/kernel/git/benh/powerpc: (140 commits)
  powerpc/kexec: Disable hard IRQ before kexec
  powerpc/85xx: l2sram - Add compatible string for BSC9131 platform
  powerpc/85xx: bsc9131 - Correct typo in SDHC device node
  powerpc/e500/qemu-e500: enable coreint
  powerpc/mpic: allow coreint to be determined by MPIC version
  powerpc/fsl_pci: Store the pci ctlr device ptr in the pci ctlr struct
  powerpc/85xx: Board support for ppa8548
  powerpc/fsl: remove extraneous DIU platform functions
  arch/powerpc/platforms/85xx/p1022_ds.c: adjust duplicate test
  powerpc: Documentation for transactional memory on powerpc
  powerpc: Add transactional memory to pseries and ppc64 defconfigs
  powerpc: Add config option for transactional memory
  powerpc: Add transactional memory to POWER8 cpu features
  powerpc: Add new transactional memory state to the signal context
  powerpc: Hook in new transactional memory code
  powerpc: Routines for FP/VSX/VMX unavailable during a transaction
  powerpc: Add transactional memory unavaliable execption handler
  powerpc: Add reclaim and recheckpoint functions for context switching transactional memory processes
  powerpc: Add FP/VSX and VMX register load functions for transactional memory
  powerpc: Add helper functions for transactional memory context switching
  ...
2013-02-23 17:09:55 -08:00
Sukadev Bhattiprolu
1c53a27072 perf/POWER7: Make generic event translations available in sysfs
Make the generic perf events in POWER7 available via sysfs.

	$ ls /sys/bus/event_source/devices/cpu/events
	branch-instructions
	branch-misses
	cache-misses
	cache-references
	cpu-cycles
	instructions
	stalled-cycles-backend
	stalled-cycles-frontend

	$ cat /sys/bus/event_source/devices/cpu/events/cache-misses
	event=0x400f0

This patch is based on commits that implement this functionality on x86.
Eg:
	commit a47473939d
	Author: Jiri Olsa <jolsa@redhat.com>
	Date:   Wed Oct 10 14:53:11 2012 +0200

	    perf/x86: Make hardware event translations available in sysfs

Changelog:[v2]
	[Jiri Osla] Drop EVENT_ID() macro since it is only used once.

Signed-off-by: Sukadev Bhattiprolu <sukadev@linux.vnet.ibm.com>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Anton Blanchard <anton@au1.ibm.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Robert Richter <robert.richter@amd.com>
Cc: Stephane Eranian <eranian@google.com>
Cc: linuxppc-dev@ozlabs.org
Link: http://lkml.kernel.org/r/20130123062454.GD13720@us.ibm.com
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
2013-01-31 13:07:50 -03:00
sukadev@linux.vnet.ibm.com
f53d168c02 perf/Power: PERF_EVENT_IOC_ENABLE does not reenable event
perf/Power: PERF_EVENT_IOC_ENABLE does not reenable event

If we disable a perf event because we exceeded the specified ->event_limit,
power_pmu_stop() sets the PERF_HES_STOPPED flag on the event.

If the application then re-enables the event using PERF_EVENT_IOC_ENABLE
ioctl, we don't ever clear this STOPPED flag. Consequently, the user space
is never notified of the event.

Following message has more background and test case.

    http://lists.eecs.utk.edu/pipermail/ptools-perfapi/2012-October/002528.html

Used the following test cases to verify that this patch works on latest PAPI.

	$ papi.git/src/ctests/nonthread PAPI_TOT_CYC@5000000

	$ papi.git/src/ctests/overflow_single_event

Changelog[v2]:
	- [Paul Mackerras] Also clear PERF_HES_UPTODATE flag since we are
	  restarting the event; cleanup comments and patch description.

Signed-off-by: Sukadev Bhattiprolu <sukadev@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2013-01-29 11:35:07 +11:00
Michael Neuling
e13e895f84 powerpc/perf: Fix for PMCs not making progress
On POWER7 when we have really small counts left before overflow, we can take a
PMU IRQ, but the PMC gets wound back to just before the overflow.

If the kernel is setting the PMC to a value just before the overflow, we can
get interrupted again without the PMC making any progress (ie another buggy
overflow).  In this case, we can end up making no forward progress, with the
PMC interrupt returning us to the same count over and over.

The below detects when we are making no forward progress (ie. delta = 0) and
then increases the amount left before the overflow.  This stops us from locking
up.

Signed-off-by: Michael Neuling <mikey@neuling.org>
Reviewed-by: Sukadev Bhattiprolu <sukadev@linux.vnet.ibm.com>
cc: Paul Mackerras <paulus@samba.org>
cc: Anton Blanchard <anton@samba.org>
cc: Linux PPC dev <linuxppc-dev@ozlabs.org>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2013-01-10 17:02:04 +11:00
Michael Neuling
bc09c219b2 powerpc/perf: Fix finding overflowed PMC in interrupt
If a PMC is about to overflow on a counter that's on an active perf event
(ie. less than 256 from the end) and a _different_ PMC overflows just at this
time (a PMC that's not on an active perf event), we currently mark the event as
found, but in reality it's not as it's likely the other PMC that caused the
IRQ.  Since we mark it as found the second catch all for overflows doesn't run,
and we don't reset the overflowing PMC ever.  Hence we keep hitting that same
PMC IRQ over and over and don't reset the actual overflowing counter.

This is a rewrite of the perf interrupt handler for book3s to get around this.
We now check to see if any of the PMCs have actually overflowed (ie >=
0x80000000).  If yes, record it for active counters and just reset it for
inactive counters.  If it's not overflowed, then we check to see if it's one of
the buggy power7 counters and if it is, record it and continue.  If none of the
PMCs match this, then we make note that we couldn't find the PMC that caused
the IRQ.

Signed-off-by: Michael Neuling <mikey@neuling.org>
Reviewed-by: Sukadev Bhattiprolu <sukadev@linux.vnet.ibm.com>
cc: Paul Mackerras <paulus@samba.org>
cc: Anton Blanchard <anton@samba.org>
cc: Linux PPC dev <linuxppc-dev@ozlabs.org>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2013-01-10 17:02:01 +11:00
Benjamin Herrenschmidt
72523d8082 Revert "powerpc/perf: Use pmc_overflow() to detect rolled back events"
This reverts commit 813312110b.

This revert was requested by the author of the patch as it seems
to cause system hangs with some low frequency events
2012-10-18 10:36:11 +11:00
sukadev@linux.vnet.ibm.com
e6878835ac powerpc/perf: Sample only if SIAR-Valid bit is set in P7+
powerpc/perf: Sample only if SIAR-Valid bit is set in P7+

On POWER7+ two new bits (mmcra[35] and mmcra[36]) indicate whether the
contents of SIAR and SDAR are valid.

For marked instructions on P7+, we must save the contents of SIAR and
SDAR registers only if these new bits are set.

This code/check for the SIAR-Valid bit is specific to P7+, so rather than
waste a CPU-feature bit use the PVR flag.

Note that Carl Love proposed a similar change for oprofile:

        https://lkml.org/lkml/2012/6/22/309

Signed-off-by: Sukadev Bhattiprolu <sukadev@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2012-09-27 12:51:05 +10:00
Benjamin Herrenschmidt
fff34b3412 Merge branch 'merge' into next
Brings in various bug fixes from 3.6-rcX
2012-09-07 09:48:59 +10:00
Michael Ellerman
d3dbeef657 powerpc: Rename 64-bit PVR constants to PVR_foo
We have an old FIXME in reg.h which points out that we should standardise
on PVR_foo for our PVR #defines. Currently we use PVR_ on 32-bit and PV_
on 64-bit.

So do that rename and remove the FIXME.

Seeing as we're touching all but one usage of __is_processor(), rename it
to something less ugly and more indicative of what it does, which is
simply to check the PVR version.

Signed-off-by: Michael Ellerman <michael@ellerman.id.au>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2012-09-05 15:19:35 +10:00
Sukadev Bhattiprolu
813312110b powerpc/perf: Use pmc_overflow() to detect rolled back events
For certain speculative events on Power7, 'perf stat' reports far higher
event count than 'perf record' for the same event.

As described in following commit, a performance monitor exception is raised
even when the the performance events are rolled back.

        commit 0837e3242c
        Author: Anton Blanchard <anton@samba.org>
        Date:   Wed Mar 9 14:38:42 2011 +1100

perf_event_interrupt() records an event only when an overflow occurs. But
this check for overflow is a simple 'if (val < 0)'.

Because the events are rolled back, this check for overflow fails and the
event is not recorded. perf_event_interrupt() later uses pmc_overflow() to
detect the overflow and resets the counters and the events are lost completely.

To properly detect the overflow of rolled back events, use pmc_overflow()
even when recording events.

To reproduce:
        $ cat strcpy.c
        #include <stdio.h>
        #include <string.h>
        main()
        {
                char buf[256];

                alarm(5);
                while(1)
                        strcpy(buf, "string1");
        }

        $ perf record -e r20014 ./strcpy
        $ perf report -n > report.1
        $ perf stat -e r20014 > report.2
        # Compare report.1 and report.2

Reported-by: Maynard Johnson <mpjohn@us.ibm.com>
Signed-off-by: Sukadev Bhattiprolu <sukadev@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2012-08-24 20:26:10 +10:00
Anton Blanchard
5c093efa6f powerpc/perf: Always use pt_regs for userspace samples
At the moment we always use the SIAR if the PMU supports continuous
sampling. Unfortunately the SIAR and the PMU exception are not
synchronised for non marked events so we can end up with callchains
that dont make sense.

The following patch checks the HV and PR bits for samples coming from
userspace and always uses pt_regs for them. Userspace will never have
interrupts off so there is no real advantage to using the SIAR for
non marked events in userspace.

I had experimented with a patch that did a similar thing for kernel
samples but we lost a significant amount of information. I was
unable to profile any of our early exception code for example.

Signed-off-by: Anton Blanchard <anton@samba.org>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2012-07-10 19:18:43 +10:00
Anton Blanchard
75382aa72f powerpc/perf: Move code to select SIAR or pt_regs into perf_read_regs
The logic to choose whether to use the SIAR or get the information
out of pt_regs is going to get more complicated, so do it once in
perf_read_regs.

We overload regs->result which is gross but we are already doing it
with regs->dsisr.

Signed-off-by: Anton Blanchard <anton@samba.org>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2012-07-10 19:18:41 +10:00
Anton Blanchard
68b30bb9f0 powerpc/perf: Create mmcra_sihv/mmcra_sipv helpers
We want to access the MMCRA_SIHV and MMCRA_SIPR bits elsewhere so
create mmcra_sihv and mmcra_sipr which hide the differences between
the old and new layout of the bits.

Signed-off-by: Anton Blanchard <anton@samba.org>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
2012-07-10 19:18:38 +10:00
Robert Richter
fd0d000b2c perf: Pass last sampling period to perf_sample_data_init()
We always need to pass the last sample period to
perf_sample_data_init(), otherwise the event distribution will be
wrong. Thus, modifiyng the function interface with the required period
as argument. So basically a pattern like this:

        perf_sample_data_init(&data, ~0ULL);
        data.period = event->hw.last_period;

will now be like that:

        perf_sample_data_init(&data, ~0ULL, event->hw.last_period);

Avoids unininitialized data.period and simplifies code.

Signed-off-by: Robert Richter <robert.richter@amd.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/1333390758-10893-3-git-send-email-robert.richter@amd.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2012-05-09 15:23:12 +02:00