The cxl driver will use infrastructure from pnv_php to handle device tree
updates when switching bi-modal CAPI cards into CAPI mode.
To enable this, export pnv_php_find_slot() and
pnv_php_set_slot_power_state(), and add corresponding declarations, as well
as the definition of struct pnv_php_slot, to asm/pnv-pci.h.
Cc: Gavin Shan <gwshan@linux.vnet.ibm.com>
Cc: linux-pci@vger.kernel.org
Cc: Bjorn Helgaas <bhelgaas@google.com>
Signed-off-by: Andrew Donnellan <andrew.donnellan@au1.ibm.com>
Signed-off-by: Ian Munsie <imunsie@au1.ibm.com>
Acked-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
The Mellanox CX4 in cxl mode uses a hybrid interrupt model, where
interrupts are routed from the networking hardware to the XSL using the
MSIX table, and from there will be transformed back into an MSIX
interrupt using the cxl style interrupts (i.e. using IVTE entries and
ranges to map a PE and AFU interrupt number to an MSIX address).
We want to hide the implementation details of cxl interrupts as much as
possible. To this end, we use a special version of the MSI setup &
teardown routines in the PHB while in cxl mode to allocate the cxl
interrupts and configure the IVTE entries in the process element.
This function does not configure the MSIX table - the CX4 card uses a
custom format in that table and it would not be appropriate to fill that
out in generic code. The rest of the functionality is similar to the
"Full MSI-X mode" described in the CAIA, and this could be easily
extended to support other adapters that use that mode in the future.
The interrupts will be associated with the default context. If the
maximum number of interrupts per context has been limited (e.g. by the
mlx5 driver), it will automatically allocate additional kernel contexts
to associate extra interrupts as required. These contexts will be
started using the same WED that was used to start the default context.
Signed-off-by: Ian Munsie <imunsie@au1.ibm.com>
Reviewed-by: Andrew Donnellan <andrew.donnellan@au1.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
This adds support for the peer model of the cxl kernel api to the
PowerNV PHB, in which physical function 0 represents the cxl function on
the card (an XSL in the case of the CX4), which other physical functions
will use for memory access and interrupt services. It is referred to as
the peer model as these functions are peers of one another, as opposed
to the Virtual PHB model which forms a hierarchy.
This patch exports APIs to enable the peer mode, check if a PCI device
is attached to a PHB in this mode, and to set and get the peer AFU for
this mode.
The cxl driver will enable this mode for supported cards by calling
pnv_cxl_enable_phb_kernel_api(). This will set a flag in the PHB to note
that this mode is enabled, and switch out it's controller_ops for the
cxl version.
The cxl version of the controller_ops struct implements it's own
versions of the enable_device_hook and release_device to handle
refcounting on the peer AFU and to allocate a default context for the
device.
Once enabled, the cxl kernel API may not be disabled on a PHB. Currently
there is no safe way to disable cxl mode short of a reboot, so until
that changes there is no reason to support the disable path.
Signed-off-by: Ian Munsie <imunsie@au1.ibm.com>
Reviewed-by: Andrew Donnellan <andrew.donnellan@au1.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
The support for using the Mellanox CX4 in cxl mode will require
additions to the PHB code. In preparation for this, move the existing
cxl code out of pci-ioda.c into a separate pci-cxl.c file to keep things
more organised.
Signed-off-by: Ian Munsie <imunsie@au1.ibm.com>
Reviewed-by: Andrew Donnellan <andrew.donnellan@au1.ibm.com>
Reviewed-by: Frederic Barrat <fbarrat@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
The array crash_shutdown_handles[] has size CRASH_HANDLER_MAX, thus when
we loop over the elements of the list we check crash_shutdown_handles[i]
&& i < CRASH_HANDLER_MAX. However this means that when we increment i to
CRASH_HANDLER_MAX we will perform an out of bound array access checking
the first condition before exiting on the second condition.
To avoid the out of bounds access, simply reorder the loop conditions.
Fixes: 1d1451655b ("powerpc: Add array bounds checking to crash_shutdown_handlers")
Signed-off-by: Suraj Jitindar Singh <sjitindarsingh@gmail.com>
Reviewed-by: Andrew Donnellan <andrew.donnellan@au1.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
The subsequent test for RTAS along with the LPAR test are sufficient
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
The test is unnecessary, the FW_FEATURE_LPAR is sufficient as there
exist no other LPAR type that has RTAS.
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
ge_imp3a_pic_init() is called way beyond the unflattening of
the tree, it shouldn't be using of_flat_dt_*
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Some bit of SPU code was using the FDT rather than the expanded
device-tree. Fix it.
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
The function is called by both 32-bit and 64-bit early setup right
after early_init_devtree(). All it does is run yet another early
DT parser which is precisely what early_init_devtree() is about,
so move it in there.
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Anything in early_setup() needs to be justified to be there, in
this case, we need the trampolines before we can take exceptions
and thus before we turn on the MMU.
Also remove a pretty meaningless and misplaced debug message
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
[mpe: Fix comment formatting]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
early_init() is called in-place before kernel relocation and using
whatever MMU setup exists at the point the kernel is entered.
machine_init() is called after relocation and after some initial
mapping of PAGE_OFFSET has been established (typically using BATs
on 6xx/7xx/7xxx processors or some form of bolted TLB on others).
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
eeh_cache.c doesn't build cleanly with -DDEBUG when
CONFIG_PHYS_ADDR_T_64BIT is set, as a couple of pr_debug()s use "%lx" for
resource_size_t parameters.
Use "%pap" instead, as it's the correct format specifier for types deriving
from phys_addr_t.
Signed-off-by: Andrew Donnellan <andrew.donnellan@au1.ibm.com>
Reviewed-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
On some environments (prototype machines, some simulators, etc...)
there is no functional interrupt source to signal completion, so
we rely on the fairly slow OPAL heartbeat.
In a number of cases, the calls complete very quickly or even
immediately. We've observed that it helps a lot to wakeup the OPAL
heartbeat thread before waiting for event in those cases, it will
call OPAL immediately to collect completions for anything that
finished fast enough.
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Acked-By: Michael Neuling <mikey@neuling.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
This is so we can use the powernv_flash mtd driver as an block
device.
Signed-off-by: Michael Neuling <mikey@neuling.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
A strange behaviour is observed when comparing PCI hotplug in QEMU, between
x86 and pseries. If you consider the following steps:
- start a VM
- add a PCI device via the QEMU monitor before the rtasd has started (for
example starting the VM in paused state, or hotplug during FW or boot
loader)
- resume the VM execution
The x86 kernel detects the PCI device, but the pseries one does not.
This happens because the rtasd kernel worker is currently started under
device_initcall, while PCI probing happens earlier under subsys_initcall.
As a consequence, if we have a pending RTAS event at boot time, a message
is printed and the event is dropped.
This patch moves all the initialization of rtasd to arch_initcall, which is
run before subsys_call: this way, logging_enabled is true when the RTAS
event pops up and it is not lost anymore.
The proc fs bits stay at device_initcall because they cannot be run before
fs_initcall.
Signed-off-by: Greg Kurz <gkurz@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
The domain/PHB field of PCI addresses has its value obtained from a
global variable, incremented each time a new domain (represented by
struct pci_controller) is added on the system. The domain addition
process happens during boot or due to PHB hotplug add.
As recent kernels are using predictable naming for network interfaces,
the network stack is more tied to PCI naming. This can be a problem in
hotplug scenarios, because PCI addresses will change if devices are
removed and then re-added. This situation seems unusual, but it can
happen if a user wants to replace a NIC without rebooting the machine,
for example.
This patch changes the way PCI domain values are generated: now, we use
device-tree properties to assign fixed PHB numbers to PCI addresses
when available (meaning pSeries and PowerNV cases). We also use a bitmap
to allow dynamic PHB numbering when device-tree properties are not
used. This bitmap keeps track of used PHB numbers and if a PHB is
released (by hotplug operations for example), it allows the reuse of
this PHB number, avoiding PCI address to change in case of device remove
and re-add soon after. No functional changes were introduced.
Signed-off-by: Guilherme G. Piccoli <gpiccoli@linux.vnet.ibm.com>
Reviewed-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
Reviewed-by: Ian Munsie <imunsie@au1.ibm.com>
Acked-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
[mpe: Drop unnecessary machine_is(pseries) test]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Despite attempting to fix this in commit fb36e90736 ("powerpc/pci: Fix
SRIOV not building without EEH enabled"), the build is still broken when
PCI_IOV=y and EEH=n (eg. g5_defconfig with PCI_IOV=y):
arch/powerpc/kernel/pci_dn.c: In function ‘remove_dev_pci_data’:
arch/powerpc/kernel/pci_dn.c:230:18: error: unused variable ‘edev’
Incorporate Ben's idea of using __maybe_unused to avoid so many #ifdefs.
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
For memory hotplug to work, the MMU code needs to provide the functions
create_section_mapping() and remove_section_mapping() to respectively
map and unmap portions of the linear mapping.
At the moment only hash64 provides these, so we provide weak stubs that
just error out. This fixes the build with configurations such as 64-bit
BookE with CONFIG_MEMORY_HOTPLUG enabled.
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
This patch adds an OPAL console backend to the powerpc boot wrapper so
that decompression failures inside the wrapper can be reported to the
user. This is important since it typically indicates data corruption in
the firmware and other nasty things.
Currently this only works when building a little endian kernel. When
compiling a 64 bit BE kernel the wrapper is always build 32 bit to be
compatible with some 32 bit firmwares. BE support will be added at a
later date. Another limitation of this is that only the "raw" type of
OPAL console is supported, however machines that provide a hvsi console
also provide a raw console so this is not an issue in practice.
Actually-written-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Oliver O'Halloran <oohall@gmail.com>
[mpe: Move #ifdef __powerpc64__ to avoid warnings on 32-bit]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
This patch adds the kernel command line parameter "no_tb_segs" which
forces the kernel to use 256MB rather than 1TB segments. Forcing the use
of 256MB segments makes it considerably easier to test code that depends
on an SLB miss occurring.
Suggested-by: Michael Neuling <mikey@neuling.org>
Suggested-by: Michael Ellerman <mpe@ellerman.id.au>
Signed-off-by: Oliver O'Halloran <oohall@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Power ISAv3 adds a large decrementer (LD) mode which increases the size
of the decrementer register. The size of the enlarged decrementer
register is between 32 and 64 bits with the exact size being dependent
on the implementation. When in LD mode, reads are sign extended to 64
bits and a decrementer exception is raised when the high bit is set (i.e
the value goes below zero). Writes however are truncated to the physical
register width so some care needs to be taken to ensure that the high
bit is not set when reloading the decrementer. This patch adds support
for using the LD inside the host kernel on processors that support it.
When LD mode is supported firmware will supply the ibm,dec-bits property
for CPU nodes to allow the kernel to determine the maximum decrementer
value. Enabling LD mode is a hypervisor privileged operation so the kernel
can only enable it manually when running in hypervisor mode. Guests that
support LD mode can request it using the "ibm,client-architecture-support"
firmware call (not implemented in this patch) or some other platform
specific method. If this property is not supplied then the traditional
decrementer width of 32 bit is assumed and LD mode will not be enabled.
This patch was based on initial work by Jack Miller.
Signed-off-by: Oliver O'Halloran <oohall@gmail.com>
Signed-off-by: Balbir Singh <bsingharora@gmail.com>
Acked-by: Michael Neuling <mikey@neuling.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Check the assembler supports -maltivec by wrapping it with
call as-option.
Signed-off-by: Anton Blanchard <anton@samba.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
cmm_mem_going_offline() is (only) called from cmm_memory_cb(), which
sends the return value through notifier_from_errno(). The latter
expects 0 or -errno (notifier_to_errno(notifier_from_errno(x)) is 0
for any x >= 0, so passing a positive value cannot make sense). Hence
negate ENOMEM.
Signed-off-by: Rasmus Villemoes <linux@rasmusvillemoes.dk>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
If ppc_rtas() is called with args.nargs == 16 and args.nret == 0,
args.rets is set to point to &args.args[16], which is beyond the end of
the args.args array. This results in a minor read overrun of the array
when we check the first return code (which, per PAPR, is a required
output of all RTAS calls) to see if there's been a hardware error.
Change the nargs/nret check to ensure nargs is <= 15, allowing room for
the status code. Users shouldn't be calling with nret == 0, but there's
no real harm if they do, so we don't stop them.
Signed-off-by: Andrew Donnellan <andrew.donnellan@au1.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Calling ISA 3.0 instructions copy, copy_first, paste and paste_last
generates an alignment fault when copying or pasting unaligned
data (128 byte). We catch this and send SIGBUS to the userspace
process that caused it.
We do not emulate these because paste may contain additional metadata
when pasting to a co-processor and paste_last is the synchronisation
point for preceding copy/paste sequences.
Thanks to Michael Neuling <mikey@neuling.org> for his help.
Signed-off-by: Chris Smart <chris@distroguy.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Export the generic hardware and cache perf events for Power9 to sysfs,
so users can determine the PMU event monitored.
Signed-off-by: Madhavan Srinivasan <maddy@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
This patch adds base enablement for the power9 PMU.
Signed-off-by: Madhavan Srinivasan <maddy@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Add macros for the generic and cache events on Power9
Signed-off-by: Madhavan Srinivasan <maddy@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Factor out the power8 pmu init functions to share with
power9. Monitor Mode Control Register S(MMCRS) and
Monitor Mode Control Register H(MMCRH) registers are
dropped in Power9. These registers are added to new
function which are included for power8 init.
Signed-off-by: Madhavan Srinivasan <maddy@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Factor out some of the power8 pmu functions
to new file "isa207-common.c" to share with
power9 pmu code. Only code movement and no
logic change
Signed-off-by: Madhavan Srinivasan <maddy@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Factor out some of the power8 pmu macros to
new a header file to share with power9 pmu code.
Just code movement and no logic change.
Signed-off-by: Madhavan Srinivasan <maddy@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
We spent so much time bike-shedding the printk() we missed that the next
line was missing a semi-colon. And it seems none of our defconfigs turn
on CONFIG_FA_DUMP.
Fixes: 4a03749f14 ("powerpc/fadump: Trivial fix of spelling mistake, clean up message")
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Implement new character device driver to allow access from user space
to the operator panel display present on IBM Power Systems machines
with FSPs.
This will allow status information to be presented on the display which
is visible to a user.
The driver implements a character buffer which a user can read/write
by accessing the device (/dev/op_panel). This buffer is then displayed on
the operator panel display. Any attempt to write past the last character
position will have no effect and attempts to write more characters than
the size of the display will be truncated. The device may only be accessed
by a single process at a time.
Signed-off-by: Suraj Jitindar Singh <sjitindarsingh@gmail.com>
Reviewed-by: Andrew Donnellan <andrew.donnellan@au1.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
An opal_msg of type OPAL_MSG_ASYNC_COMP contains the return code in the
params[1] struct member. However this isn't intuitive or obvious when
reading the code and requires that a user look at the skiboot
documentation or opal-api.h to verify this.
Add an inline function to get the return code from an opal_msg and update
call sites accordingly.
Signed-off-by: Suraj Jitindar Singh <sjitindarsingh@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Trivial fix to spelling mistake in pr_debug() message.
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Fix trivial spelling mistake "rgistration". Also use pr_err()
instead of printk() and unsplit the string to keep it all on one
line.
Signed-off-by: Colin Ian King <colin.king@canonical.com>
[mpe: Keep rc on the same line, splitting it doesn't help]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
If a PHB has no I/O space, there's no need to make it look like
something bad happened, a pr_debug() is plenty enough since this
is the case of all our modern POWER chips.
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>