perf events, powerpc: Add POWER7 stalled-cycles-frontend/backend events
Extent the POWER7 PMU driver with definitions for generic front-end and back-end
stall events.
As explained in Ingo's original comment(8f62242246
), the exact definitions of the stall events are very much processor specific as
different things mean different in their respective instruction pipeline. These
two Power7 raw events are the closest approximation to the concept detailed in
Ingo's comment.
[PERF_COUNT_HW_STALLED_CYCLES_FRONTEND] = 0x100f8, /* GCT_NOSLOT_CYC */
It means cycles when the Global Completion Table has no slots from this thread
[PERF_COUNT_HW_STALLED_CYCLES_BACKEND] = 0x4000a, /* CMPLU_STALL */
It means no groups completed and GCT not empty for this thread
Signed-off-by: Anshuman Khandual <khandual@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
The firmware doesn't wait after lifting the PCI reset. However it does
timestamp it in the device tree. We use that to ensure we wait long
enough (3s is our current arbitrary setting) from that timestamp to
actually probing the bus.
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
This implements support for MSIs on p5ioc2 PHBs. We only support
MSIs on the PCIe PHBs, not the PCI-X ones as the later hasn't been
properly verified in HW.
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
This adds support for PCI-X and PCIe on the p5ioc2 IO hub using
OPAL. This includes allocating & setting up TCE tables and config
space access routines.
This also supports fallbacks via RTAS when OPAL is absent, using
legacy TCE format pre-allocated via the device-tree (BML style)
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
OPAL can handle various interrupt for us such as Machine Checks (it
performs all sorts of recovery tasks and passes back control to us with
informations about the error), Hardware Management Interrupts and Softpatch
interrupts.
This wires up the mechanisms and prints out specific informations returned
by HAL when a machine check occurs.
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
We do the minimum which is to "pass" interrupts to HAL, which
makes the console smoother and will allow us to implement
interrupt based completion and console.
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
OPAL handles HW access to the various ICS or equivalent chips
for us (with the exception of p5ioc2 based HEA which uses a
different backend) similarily to what RTAS does on pSeries.
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Implements OPAL RTC and NVRAM support and wire all that up to
the powernv platform.
We use RTAS for RTC as a fallback if available. Using RTAS for nvram
is not supported yet, pending some rework/cleanup and generalization
of the pSeries & CHRP code. We also use RTAS fallbacks for power off
and reboot
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
This calls the respective HAL functions, and spin on hal_poll_event()
to ensure the HAL has a chance to communicate with the FSP to trigger
the reboot or shutdown operation
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
This adds a udbg and an hvc console backend for supporting a console
using the OPAL console interfaces.
On OPAL v1 we have hvc0 mapped to whatever console the system was
configured for (network or hvsi serial port) via the service
processor.
On OPAL v2 we have hvcN mapped to the Nth console provided by OPAL
which generally corresponds to:
hvc0 : network console (raw protocol)
hvc1 : serial port S1 (hvsi)
hvc2 : serial port S2 (hvsi)
Note: At this point, early debug console only works with OPAL v1
and shouldn't be enabled in a normal kernel.
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
OPAL v2 is instantiated in a way similar to RTAS using Open Firmware
client interface calls, and the resulting address and entry point are
put in the device-tree
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Add definition of OPAL interfaces along with the wrappers to call
into OPAL runtime and the early device-tree parsing hook to locate
the OPAL runtime firmware.
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
We stash it in boot_command_line which isn't in BSS and so won't
be overwritten. We then use that as a default cmd_line before
we walk the device-tree.
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
On machines supporting the OPAL firmware version 1, the system
is initially booted under pHyp. We then use a special hypercall
to verify if OPAL is available and if it is, we then trigger
a "takeover" which disables pHyp and loads the OPAL runtime
firmware, giving control to the kernel in hypervisor mode.
This patch add the necessary code to detect that the OPAL takeover
capability is present when running under PowerVM (aka pHyp) and
perform said takeover to get hypervisor control of the processor.
To perform the takeover, we must first use RTAS (within Open
Firmware runtime environment) to start all processors & threads,
in order to give control to OPAL on all of them. We then call
the takeover hypercall on everybody, OPAL will re-enter the kernel
main entry point passing it a flat device-tree.
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
We used to overwrite with CONFIG_CMDLINE if we found a chosen
node but failed to get bootargs out of it or they were empty,
unless CONFIG_CMDLINE_FORCE is set.
Instead change that to overwrite if "data" is non empty after
the bootargs check. It allows arch code to have other mechanisms
to retrieve the command line prior to parsing the device-tree.
Note: CONFIG_CMDLINE_FORCE case should ideally be handled elsewhere
as it won't work as it-is if the device-tree has no /chosen node
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
CC: devicetree-discuss@lists-ozlabs.org
CC: Grant Likely <grant.likely@secretlab.ca>
Acked-by: Grant Likely <grant.likely@secretlab.ca>
This adds a skeletton for the new Power "Non Virtualized"
platform which will be used by machines supporting running
without an hypervisor, for example in order to run KVM.
These machines will be using a new firmware called OPAL
for which the support will be provided by later patches.
The PowerNV platform is intended to be also usable under
the BML environment used internally for early CPU bringup
which is why the code also supports using RTAS instead of
OPAL in various places.
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
With OPAL, r8 and r9 will be used to pass the OPAL base and entry
for debugging purposes (those informations are also in the
device-tree). We don't want to clobber those registers that
early.
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
This new function is used to properly setup the PCI Express Max Payload Size
(and in some circumstances Max Read Request Size).
Some systems will not operate properly if these aren't set correctly and
the firmware doesn't always do it.
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
This adds more generic support for doing CPU hotplug with a simple
idle loop and no actual reset of the processors. The generic
smp_generic_kick_cpu() does the hotplug bringup trick if the PACA
shows that the CPU has already been started at boot and we provide
an accessor for the CPU state.
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
It was preventing the global early debug selection whenever KVM was enabled
instead of only preventing the 440 specific one.
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
The Power platform requires the partner info buffer to be page aligned
otherwise it will fail the partner info hcall with H_PARAMETER. Switch
from using kmalloc to allocate this buffer to __get_free_page to ensure
page alignment.
Signed-off-by: Brian King <brking@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
The icswx code introduced an A-B B-A deadlock:
CPU0 CPU1
---- ----
lock(&anon_vma->mutex);
lock(&mm->mmap_sem);
lock(&anon_vma->mutex);
lock(&mm->mmap_sem);
Instead of using the mmap_sem to keep mm_users constant, take the
page table spinlock.
Signed-off-by: Anton Blanchard <anton@samba.org>
Cc: <stable@kernel.org>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
If we echo an address the hypervisor doesn't like to
/sys/devices/system/memory/probe we oops the box:
# echo 0x10000000000 > /sys/devices/system/memory/probe
kernel BUG at arch/powerpc/mm/hash_utils_64.c:541!
The backtrace is:
create_section_mapping
arch_add_memory
add_memory
memory_probe_store
sysdev_class_store
sysfs_write_file
vfs_write
SyS_write
In create_section_mapping we BUG if htab_bolt_mapping returned
an error. A better approach is to return an error which will
propagate back to userspace.
Rerunning the test with this patch applied:
# echo 0x10000000000 > /sys/devices/system/memory/probe
-bash: echo: write error: Invalid argument
Signed-off-by: Anton Blanchard <anton@samba.org>
Cc: stable@kernel.org
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
While converting code to use for_each_node_by_type I noticed a
number of coding style issues.
Signed-off-by: Anton Blanchard <anton@samba.org>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Use for_each_node_by_type instead of open coding it.
Signed-off-by: Anton Blanchard <anton@samba.org>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
During memory hotplug testing, I got the following warning:
ERROR: Bad of_node_put() on /memory@0
of_node_release
kref_put
of_node_put
of_find_node_by_type
hot_add_node_scn_to_nid
hot_add_scn_to_nid
memory_add_physaddr_to_nid
...
of_find_node_by_type() loop does the of_node_put for us so we only
need the handle the case where we terminate the loop early.
As suggested by Stephen Rothwell we can do the of_node_put
unconditionally outside of the loop since of_node_put handles a
NULL argument fine.
Signed-off-by: Anton Blanchard <anton@samba.org>
Cc: stable@kernel.org
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
We have two identical definitions of RECLAIM_DISTANCE, looks like
the patch got applied twice. Remove one.
Signed-off-by: Anton Blanchard <anton@samba.org>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
On big POWER7 boxes we see large amounts of CPU time in system
processes like workqueue and watchdog kernel threads.
We currently rebalance the entire machine each time a task goes
idle and this is very expensive on large machines. Disable newidle
balancing at the node level and rely on the scheduler tick to
rebalance across nodes.
Signed-off-by: Anton Blanchard <anton@samba.org>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
The largest POWER7 boxes have 32 nodes. SD_NODES_PER_DOMAIN groups
nodes into chunks of 16 and adds a global balancing domain
(SD_ALLNODES) above it.
If we bump SD_NODES_PER_DOMAIN to 32, then we avoid this extra
level of balancing on our largest boxes.
Signed-off-by: Anton Blanchard <anton@samba.org>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
We want to override the default value of SD_NODES_PER_DOMAIN on ppc64,
so move it into linux/topology.h.
Signed-off-by: Anton Blanchard <anton@samba.org>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
When chasing a performance issue on ppc64, I noticed tasks
communicating via a pipe would often end up on different nodes.
It turns out SD_WAKE_AFFINE is not set in our node defition. Commit
9fcd18c9e6 (sched: re-tune balancing) enabled SD_WAKE_AFFINE
in the node definition for x86 and we need a similar change for
ppc64.
I used lmbench lat_ctx and perf bench pipe to verify this fix. Each
benchmark was run 10 times and the average taken.
lmbench lat_ctx:
before: 66565 ops/sec
after: 204700 ops/sec
3.1x faster
perf bench pipe:
before: 5.6570 usecs
after: 1.3470 usecs
4.2x faster
Signed-off-by: Anton Blanchard <anton@samba.org>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
* 'irq-fixes-for-linus' of git://tesla.tglx.de/git/linux-2.6-tip:
x86, iommu: Mark DMAR IRQ as non-threaded
genirq: Make irq_shutdown() symmetric vs. irq_startup again
* 'for-linus' of git://github.com/chrismason/linux:
Btrfs: only clear the need lookup flag after the dentry is setup
BTRFS: Fix lseek return value for error
Btrfs: don't change inode flag of the dest clone file
Btrfs: don't make a file partly checksummed through file clone
Btrfs: fix pages truncation in btrfs_ioctl_clone()
btrfs: fix d_off in the first dirent
When a xHC host is unable to handle isochronous transfer in the
interval, it reports a Missed Service Error event and skips some tds.
Currently xhci driver handles MSE event in the following ways:
1. When encounter a MSE event, set ep->skip flag, update event ring
dequeue pointer and return.
2. When encounter the next event on this ep, the driver will run the
do-while loop, fetch td from ep's td_list to find the td
corresponding to this event. All tds missed are marked as short
transfer(-EXDEV).
The do-while loop will end in two ways:
1. If the td pointed by the event trb is found;
2. If the ep ring's td_list is empty.
However, if a buggy HW reports some unpredicted event (for example, an
overrun event following a MSE event while the ep ring is actually not
empty), the driver will never find the td, and it will loop until the
td_list is empty.
Unfortunately, the spinlock is dropped when give back a urb in the
do-while loop. During the spinlock released period, the class driver
may still submit urbs and add tds to the td_list. This may cause
disaster, since the td_list will never be empty and the loop never ends,
and the system hangs.
To fix this, count the number of TDs on the ep ring before skipping TDs,
and quit the loop when skipped that number of tds. This guarantees the
do-while loop will end after certain number of cycles, and driver will
not be trapped in an infinite loop.
Signed-off-by: Andiry Xu <andiry.xu@amd.com>
Signed-off-by: Sarah Sharp <sarah.a.sharp@linux.intel.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Sometimes, when a USB 3.0 device is disconnected, the Intel Panther
Point xHCI host controller will report a link state change with the
state set to "SS.Inactive". This causes the xHCI host controller to
issue a warm port reset, which doesn't finish before the USB core times
out while waiting for it to complete.
When the warm port reset does complete, and the xHC gives back a port
status change event, the xHCI driver kicks khubd. However, it fails to
set the bit indicating there is a change event for that port because the
logic in xhci-hub.c doesn't check for the warm port reset bit.
After that, the warm port status change bit is never cleared by the USB
core, and the xHC stops reporting port status change bits. (The xHCI
spec says it shouldn't report more port events until all change bits are
cleared.) This means any port changes when a new device is connected
will never be reported, and the port will seem "dead" until the xHCI
driver is unloaded and reloaded, or the computer is rebooted. Fix this
by making the xHCI driver set the port change bit when a warm port reset
change bit is set.
A better solution would be to make the USB core handle warm port reset
in differently, merging the current code with the standard port reset
code that does an incremental backoff on the timeout, and tries to
complete the port reset two more times before giving up. That more
complicated fix will be merged next window, and this fix will be
backported to stable.
This should be backported to kernels as old as 3.0, since that was the
first kernel with commit a11496ebf3 ("xHCI: warm reset support").
Signed-off-by: Sarah Sharp <sarah.a.sharp@linux.intel.com>
Cc: stable@kernel.org
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Fix build when CONFIG_ISA_DMA_API is enabled but
CONFIG_COMEDI_PCI[_DRIVERS] is not enabled.
Fixes these build errors:
drivers/staging/comedi/drivers/ni_labpc.c: In function 'labpc_ai_cmd':
drivers/staging/comedi/drivers/ni_labpc.c:1351: error: implicit declaration of function 'labpc_suggest_transfer_size'
drivers/staging/comedi/drivers/ni_labpc.c: At top level:
drivers/staging/comedi/drivers/ni_labpc.c:1802: error: conflicting types for 'labpc_suggest_transfer_size'
drivers/staging/comedi/drivers/ni_labpc.c:1351: note: previous implicit declaration of 'labpc_suggest_transfer_size' was here
Signed-off-by: Randy Dunlap <rdunlap@xenotime.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Even with just the interface limited to admin, there really is little to
reason to give byte-per-byte counts for taskstats. So round it down to
something less intrusive.
Acked-by: Balbir Singh <bsingharora@gmail.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Ok, this isn't optimal, since it means that 'iotop' needs admin
capabilities, and we may have to work on this some more. But at the
same time it is very much not acceptable to let anybody just read
anybody elses IO statistics quite at this level.
Use of the GENL_ADMIN_PERM suggested by Johannes Berg as an alternative
to checking the capabilities by hand.
Reported-by: Vasiliy Kulikov <segoon@openwall.com>
Cc: Johannes Berg <johannes.berg@intel.com>
Acked-by: Balbir Singh <bsingharora@gmail.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Add a new udbg driver for the PS3 gelic Ehthernet device.
This driver shares only a few stucture and constant definitions with the
gelic Ethernet device driver, so is implemented as a stand-alone driver
with no dependencies on the gelic Ethernet device driver.
Signed-off-by: Hector Martin <hector@marcansoft.com>
Signed-off-by: Andre Heider <a.heider@gmail.com>
Signed-off-by: Geoff Levand <geoff@infradead.org>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Since commit 188917e183, /proc/ppc64 is a
symlink to /proc/powerpc/. That means that creating /proc/ppc64/eeh will
end up with a unaccessible file, that is not listed under /proc/powerpc/
and, then, not listed under /proc/ppc64/.
Creating /proc/powerpc/eeh fixes that problem and maintain the
compatibility intended with the ppc64 symlink.
Signed-off-by: Thadeu Lima de Souza Cascardo <cascardo@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: <stable@kernel.org> [3.x]
This should fix the following warning:
LD arch/powerpc/sysdev/xics/built-in.o
WARNING: arch/powerpc/sysdev/xics/built-in.o(.text+0x1310): Section mismatch in
reference from the function .icp_native_init() to the function
.init.text:.icp_native_init_one_node()
The function .icp_native_init() references
the function __init .icp_native_init_one_node().
This is often because .icp_native_init lacks a __init
annotation or the annotation of .icp_native_init_one_node is wrong.
icp_native_init() is only referenced in `arch/powerpc/sysdev/xics/xics-common.c'
by xics_init() which is itself marked with __init.
= not built-tested =
Reported-by: Timur Tabi <timur@freescale.com>
Signed-off-by: Arnaud Lacombe <lacombar@gmail.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
During hotplug CPU add we get the following error:
Unexpected Error (0) returned from configure-connector
ibm,configure-connector returns 0 for configuration complete, so
catch this and avoid the error.
Signed-off-by: Anton Blanchard <anton@samba.org>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: <stable@kernel.org>
In SMP mode, the kernel would produce call trace when resumed
from hibernation. The reason is when the function destroy_context
is called to drop the resuming mm context, the mm->context.active
is 1 which is wrong and should be zero.
We pass the current->active_mm as previous mm context to function
switch_mmu_context to decrease the context.active by 1.
In UP mode, there is no effect.
Signed-off-by: Tang Yuantian <b29983@freescale.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
The various port_init_hw methods of ppc4xx_pciex_hwops should have been
marked __init and when I added ppc4xx_pciex_port_reset_sdr(), which is
__init. This added many section mismatch warnings like:
WARNING: arch/powerpc/sysdev/built-in.o(.text+0x5c68): Section mismatch in reference from the function ppc440spe_pciex_init_port_hw() to the function .init.text:ppc4xx_pciex_port_reset_sdr()
The function ppc440spe_pciex_init_port_hw() references
the function __init ppc4xx_pciex_port_reset_sdr().
This is often because ppc440spe_pciex_init_port_hw lacks a __init
annotation or the annotation of ppc4xx_pciex_port_reset_sdr is wrong.
Trivial patch to silence those warnings.
Reported-By: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Tony Breeds <tony@bakeyournoodle.com>
Yours Tony
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Based on a patch by Michael Ellerman <michael@ellerman.id.au>
Patch was simply forward ported upstream.
Jimi Xenidis <jimix@pobox.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Based on a patch by Benjamin Herrenschmidt <benh@kernel.crashing.org>
Modernized and slightly modified to not record erros into the nvram
log since we do not have that device driver just yet.
Jimi Xenidis <jimix@pobox.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Some config selections were applied to the platform (reference board)
when they actuall apply to the chip.
Signed-off-by: Jimi Xenidis <jimix@pobox.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>