When running as a powernv "host" system on P8, we need to switch
the endianness of interrupt handlers. This does it via the appropriate
call to the OPAL firmware which may result in just switching HID0:HILE
but depending on the processor version might need to do a few more
things. This call must be done early before any other processor has
been brought out of firmware.
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Andy Whitcroft <apw@canonical.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Since commit "efcac65 powerpc: Per process DSCR + some fixes (try#4)"
it is no longer possible to set the DSCR on a per-CPU basis.
The old behaviour was to minipulate the DSCR SPR directly but this is no
longer sufficient: the value is quickly overwritten by context switching.
This patch stores the per-CPU DSCR value in a kernel variable rather than
directly in the SPR and it is used whenever a process has not set the DSCR
itself. The sysfs interface (/sys/devices/system/cpu/cpuN/dscr) is unchanged.
Writes to the old global default (/sys/devices/system/cpu/dscr_default)
now set all of the per-CPU values and reads return the last written value.
The new per-CPU default is added to the paca_struct and is used everywhere
outside of sysfs.c instead of the old global default.
Signed-off-by: Sam Bobroff <sam.bobroff@au1.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Commit 32e45ff43e changed the default value of
RECLAIM_DISTANCE to 30. However the comment around arch
specifc definition of RECLAIM_DISTANCE is not updated to
reflect the same. Correct the value mentioned in the comment.
Signed-off-by: Preeti U Murthy <preeti@linux.vnet.ibm.com>
Cc: Anton Blanchard <anton@samba.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Acked-by: KOSAKI Motohiro <Kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Upcoming POWER8 chips support a concept called split core. This is where the
core can be split into subcores that although not full cores, are able to
appear as full cores to a guest.
The splitting & unsplitting procedure is mildly complicated, and explained at
length in the comments within the patch.
One notable detail is that when splitting or unsplitting we need to pull
offline cpus out of their offline state to do work as part of the procedure.
The interface for changing the split mode is via a sysfs file, eg:
$ echo 2 > /sys/devices/system/cpu/subcores_per_core
Currently supported values are '1', '2' and '4'. And indicate respectively that
the core should be unsplit, split in half, and split in quarters. These modes
correspond to threads_per_subcore of 8, 4 and 2.
We do not allow changing the split mode while KVM VMs are active. This is to
prevent the value changing while userspace is configuring the VM, and also to
prevent the mode being changed in such a way that existing guests are unable to
be run.
CPU hotplug fixes by Srivatsa. max_cpus fixes by Mahesh. cpuset fixes by
benh. Fix for irq race by paulus. The rest by mikey and mpe.
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Signed-off-by: Michael Neuling <mikey@neuling.org>
Signed-off-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
Signed-off-by: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
On POWER8 we have a new concept of a subcore. This is what happens when
you take a regular core and split it. A subcore is a grouping of two or
four SMT threads, as well as a handfull of SPRs which allows the subcore
to appear as if it were a core from the point of view of a guest.
Unlike threads_per_core which is fixed at boot, threads_per_subcore can
change while the system is running. Most code will not want to use
threads_per_subcore.
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Signed-off-by: Michael Neuling <mikey@neuling.org>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
To support split core we need to be able to force all secondaries into
nap, so the core can detect they are idle and do an unsplit.
Currently power7_nap() will return without napping if there is an irq
pending. We want to ignore the pending irq and nap anyway, we will deal
with the interrupt later.
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Signed-off-by: Michael Neuling <mikey@neuling.org>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
As part of the support for split core on POWER8, we want to be able to
block splitting of the core while KVM VMs are active.
The logic to do that would be exactly the same as the code we currently
have for inhibiting onlining of secondaries.
Instead of adding an identical mechanism to block split core, rework the
secondary inhibit code to be a "HV KVM is active" check. We can then use
that in both the cpu hotplug code and the upcoming split core code.
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Signed-off-by: Michael Neuling <mikey@neuling.org>
Acked-by: Alexander Graf <agraf@suse.de>
Acked-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Based off 3bccd996 for ia64, convert powerpc to use the generic per-CPU
topology tracking, specifically:
initialize per cpu numa_node entry in start_secondary
remove the powerpc cpu_to_node()
define CONFIG_USE_PERCPU_NUMA_NODE_ID if NUMA
Signed-off-by: Nishanth Aravamudan <nacc@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
With binutils 2.24, various 64 bit builds fail with relocation errors
such as
arch/powerpc/kernel/built-in.o: In function `exc_debug_crit_book3e':
(.text+0x165ee): relocation truncated to fit: R_PPC64_ADDR16_HI
against symbol `interrupt_base_book3e' defined in .text section
in arch/powerpc/kernel/built-in.o
arch/powerpc/kernel/built-in.o: In function `exc_debug_crit_book3e':
(.text+0x16602): relocation truncated to fit: R_PPC64_ADDR16_HI
against symbol `interrupt_end_book3e' defined in .text section
in arch/powerpc/kernel/built-in.o
The assembler maintainer says:
I changed the ABI, something that had to be done but unfortunately
happens to break the booke kernel code. When building up a 64-bit
value with lis, ori, shl, oris, ori or similar sequences, you now
should use @high and @higha in place of @h and @ha. @h and @ha
(and their associated relocs R_PPC64_ADDR16_HI and R_PPC64_ADDR16_HA)
now report overflow if the value is out of 32-bit signed range.
ie. @h and @ha assume you're building a 32-bit value. This is needed
to report out-of-range -mcmodel=medium toc pointer offsets in @toc@h
and @toc@ha expressions, and for consistency I did the same for all
other @h and @ha relocs.
Replacing @h with @high in one strategic location fixes the relocation
errors. This has to be done conditionally since the assembler either
supports @h or @high but not both.
Cc: <stable@vger.kernel.org>
Signed-off-by: Guenter Roeck <linux@roeck-us.net>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
powerpc: Clear ELF personality flag if ELFv2 is not requested.
The POWER kernel uses a personality flag to determine whether it should
be setting up function descriptors or not (per the updated ABI). This
flag wasn't being cleared on a new process but instead was being
inherited. The visible effect was that an ELFv2 binary could not execve
to an ELFv1 binary.
Signed-off-by: Jeff Bailey <jeffbailey@google.com>
arch/powerpc/include/asm/elf.h | 2 ++
1 file changed, 2 insertions(+)
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Currently, on 8641D, which doesn't set CONFIG_HAVE_HW_BREAKPOINT
we get the following splat:
BUG: using smp_processor_id() in preemptible [00000000] code: login/1382
caller is set_breakpoint+0x1c/0xa0
CPU: 0 PID: 1382 Comm: login Not tainted 3.15.0-rc3-00041-g2aafe1a4d451 #1
Call Trace:
[decd5d80] [c0008dc4] show_stack+0x50/0x158 (unreliable)
[decd5dc0] [c03c6fa0] dump_stack+0x7c/0xdc
[decd5de0] [c01f8818] check_preemption_disabled+0xf4/0x104
[decd5e00] [c00086b8] set_breakpoint+0x1c/0xa0
[decd5e10] [c00d4530] flush_old_exec+0x2bc/0x588
[decd5e40] [c011c468] load_elf_binary+0x2ac/0x1164
[decd5ec0] [c00d35f8] search_binary_handler+0xc4/0x1f8
[decd5ef0] [c00d4ee8] do_execve+0x3d8/0x4b8
[decd5f40] [c001185c] ret_from_syscall+0x0/0x38
--- Exception: c01 at 0xfeee554
LR = 0xfeee7d4
The call path in this case is:
flush_thread
--> set_debug_reg_defaults
--> set_breakpoint
--> __get_cpu_var
Since preemption is enabled in the cleanup of flush thread, and
there is no need to disable it, introduce the distinction between
set_breakpoint and __set_breakpoint, leaving only the flush_thread
instance as the current user of set_breakpoint.
Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
None of the callers check the return value, so it might as
well not have one at all.
Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
The arch/powerpc/include/asm/linkage.h is being unintentionally exported
in the kernel headers since commit e1b5bb6d12 (consolidate
cond_syscall and SYSCALL_ALIAS declarations) when
arch/powerpc/include/uapi/asm/linkage.h was deleted but the header-y not
removed from the Kbuild file. This happens because Makefile.headersinst
still checks the old asm/ directory if the specified header doesn't
exist in the uapi directory.
The asm/linkage.h shouldn't ever have been exported anyway. No other
arch does and it doesn't contain anything useful to userland, so remove
the header-y line from the Kbuild file which triggers the export.
Signed-off-by: James Hogan <james.hogan@imgtec.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: linuxppc-dev@lists.ozlabs.org
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: David Woodhouse <dwmw2@infradead.org>
Cc: David Howells <dhowells@redhat.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
This reverts commit b2b5efcf20.
This code was way too board specific, there are quirks as to how
the PERST line is wired on different boards, we'll have to revisit
this using/creating appropriate firmware interfaces.
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
This series adds support for building the powerpc 64-bit
LE kernel using the new ABI v2. We already supported
running ABI v2 userspace programs but this adds support
for building the kernel itself using the new ABI.
Unaligned stores take alignment exceptions on POWER7 running in little-endian.
This is a dumb little-endian base memcpy that prevents unaligned stores.
Once booted the feature fixup code switches over to the VMX copy loops
(which are already endian safe).
The question is what we do before that switch over. The base 64bit
memcpy takes alignment exceptions on POWER7 so we can't use it as is.
Fixing the causes of alignment exception would slow it down, because
we'd need to ensure all loads and stores are aligned either through
rotate tricks or bytewise loads and stores. Either would be bad for
all other 64bit platforms.
[ I simplified the loop a bit - Anton ]
Signed-off-by: Philippe Bergheaud <felix@linux.vnet.ibm.com>
Signed-off-by: Anton Blanchard <anton@samba.org>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
If we do a treclaim and we are not in TM suspend mode, it results in a TM bad
thing (ie. a 0x700 program check). Similarly if we do a trechkpt and we have
an active transaction or TEXASR Failure Summary (FS) is not set, we also take a
TM bad thing.
This should never happen, but if it does (ie. a kernel bug), the cause is
almost impossible to debug as the GPR state is mostly userspace and hence we
don't get a call chain.
This adds some checks in these cases case a BUG_ON() (in asm) in case we ever
hit these cases. It moves the register saving around to preserve r1 till later
also.
Signed-off-by: Michael Neuling <mikey@neuling.org>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Currently, the code in setup-common.c for powerpc assumes that all
clock rates are same in a smp system. This value is cached in the
variable named ppc_proc_freq and is the value that is reported in
/proc/cpuinfo.
However on the PowerNV platform, the clock rate is same only across
the threads of the same core. Hence the value that is reported in
/proc/cpuinfo is incorrect on PowerNV platforms. We need a better way
to query and report the correct value of the processor clock in
/proc/cpuinfo.
The patch achieves this by creating a machdep_call named
get_proc_freq() which is expected to returns the frequency in Hz. The
code in show_cpuinfo() can invoke this method to display the correct
clock rate on platforms that have implemented this method. On the
other powerpc platforms it can use the value cached in ppc_proc_freq.
Signed-off-by: Gautham R. Shenoy <ego@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Firmware update on PowerNV platform takes several minutes. During
this time one CPU is stuck in FW and the kernel complains about "soft
lockups".
This patch returns all secondary CPUs to firmware before starting
firmware update process.
[ Reworked a bit and cleaned up -- BenH ]
Signed-off-by: Vasant Hegde <hegdevasant@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
The patch intends to support fundamental reset on PLX downstream
ports. If the PCI device matches any one of the internal table,
which includes PLX vendor ID, bridge device ID, register offset
for fundamental reset and bit, fundamental reset will be done
accordingly. Otherwise, it will fail back to hot reset.
Additional flag (EEH_DEV_FRESET) is introduced to record the last
reset type on the PCI bridge.
Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
The problem was initially reported by Wendy who tried pass through
IPR adapter, which was connected to PHB root port directly, to KVM
based guest. When doing that, pci_reset_bridge_secondary_bus() was
called by VFIO driver and linkDown was detected by the root port.
That caused all PEs to be frozen.
The patch fixes the issue by routing the reset for the secondary bus
of root port to underly firmware. For that, one more weak function
pci_reset_secondary_bus() is introduced so that the individual platforms
can override that and do specific reset for bridge's secondary bus.
Reported-by: Wendy Xiong <wenxiong@linux.vnet.ibm.com>
Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Basically, we have 3 types of resets to fulfil PE reset: fundamental,
hot and PHB reset. For the later 2 cases, we need PCI bus reset hold
and settlement delay as specified by PCI spec. PowerNV and pSeries
platforms are running on top of different firmware and some of the
delays have been covered by underly firmware (PowerNV).
The patch makes the delays unified to be done in backend, instead of
EEH core.
Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
The issue was detected in a bit complicated test case where
we have multiple hierarchical PEs shown as following figure:
+-----------------+
| PE#3 p2p#0 |
| p2p#1 |
+-----------------+
|
+-----------------+
| PE#4 pdev#0 |
| pdev#1 |
+-----------------+
PE#4 (have 2 PCI devices) is the child of PE#3, which has 2 p2p
bridges. We accidentally had less-known scenario: PE#4 was removed
permanently from the system because of permanent failure (e.g.
exceeding the max allowd failure times in last hour), then we detects
EEH errors on PE#3 and tried to recover it. However, eeh_dev instances
for pdev#0/1 were not detached from PE#4, which was still connected to
PE#3. All of that was because of the fact that we rely on count-based
pcibios_release_device(), which isn't reliable enough. When doing
recovery for PE#3, we still apply hotplug on PE#4 and pdev#0/1, which
are not valid any more. Eventually, we run into kernel crash.
The patch fixes above issue from two aspects. For unplug, we simply
skip those permanently removed PE, whose state is (EEH_PE_STATE_ISOLATED
&& !EEH_PE_STATE_RECOVERING) and its frozen count should be greater
than EEH_MAX_ALLOWED_FREEZES. For plug, we marked all permanently
removed EEH devices with EEH_DEV_REMOVED and return 0xFF's on read
its PCI config so that PCI core will omit them.
Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
There're 2 EEH subsystem variables: eeh_subsystem_enabled and
eeh_probe_mode. We needn't maintain 2 variables and we can just
have one variable and introduce different flags. The patch also
introduces additional flag EEH_FORCE_DISABLE, which will be used
to disable EEH subsystem via boot parameter ("eeh=off") in future.
Besides, the patch also introduces flag EEH_ENABLED, which is
changed to disable or enable EEH functionality on the fly through
debugfs entry in future.
With the patch applied, the creteria to check the enabled EEH
functionality is changed to:
!EEH_FORCE_DISABLED && EEH_ENABLED : Enabled
Other cases : Disabled
Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
When calling into eeh_gather_pci_data() on pSeries platform, we
possiblly don't have pci_dev instance yet, but eeh_dev is always
ready. So we use cached capability from eeh_dev instead of pci_dev
for log dump there. In order to keep things unified, we also cache
PCI capability positions to eeh_dev for PowerNV as well.
Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
We've observed multiple PE reset failures because of PCI-CFG
access during that period. Potentially, some device drivers
can't support EEH very well and they can't put the device to
motionless state before PE reset. So those device drivers might
produce PCI-CFG accesses during PE reset. Also, we could have
PCI-CFG access from user space (e.g. "lspci"). Since access to
frozen PE should return 0xFF's, we can block PCI-CFG access
during the period of PE reset so that we won't get recrusive EEH
errors.
The patch adds flag EEH_PE_RESET, which is kept during PE reset.
The PowerNV/pSeries PCI-CFG accessors reuse the flag to block
PCI-CFG accordingly.
Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
The PE state (for eeh_pe instance) EEH_PE_PHB_DEAD is duplicate to
EEH_PE_ISOLATED. Originally, those PHBs (PHB PE) with EEH_PE_PHB_DEAD
would be removed from the system. However, it's safe to replace
that with EEH_PE_ISOLATED.
The patch also clear EEH_PE_RECOVERING after fenced PHB has been handled,
either failure or success. It makes the PHB PE state consistent with:
PHB functions normally NONE
PHB has been removed EEH_PE_ISOLATED
PHB fenced, recovery in progress EEH_PE_ISOLATED | RECOVERING
Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
I've had a report that the current limit is too small for
an automated network based installer. Bump it.
Signed-off-by: Anton Blanchard <anton@samba.org>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
We have two copies of code that creates an OPAL sg list. Consolidate
these into a common set of helpers and fix the endian issues.
The flash interface embedded a version number in the num_entries
field, whereas the dump interface did did not. Since versioning
wasn't added to the flash interface and it is impossible to add
this in a backwards compatible way, just remove it.
Signed-off-by: Anton Blanchard <anton@samba.org>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Fix little endian issues with the OPAL error log code.
Signed-off-by: Anton Blanchard <anton@samba.org>
Reviewed-by: Stewart Smith <stewart@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
We had some duplication of the internal OPAL functions.
Signed-off-by: Anton Blanchard <anton@samba.org>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Using size_t in our APIs is asking for trouble, especially
when some OPAL calls use size_t pointers.
Signed-off-by: Anton Blanchard <anton@samba.org>
Reviewed-by: Stewart Smith <stewart@linux.vnet.ibm.com>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
ftrace has way too much knowledge of our kernel module trampoline
layout hidden inside it. Create module_trampoline_target() that gives
the target address of a kernel module trampoline.
Signed-off-by: Anton Blanchard <anton@samba.org>
ftrace has way too much knowledge of our kernel module trampoline
layout hidden inside it. Create is_module_trampoline() that can
abstract this away inside the module loader code.
Signed-off-by: Anton Blanchard <anton@samba.org>
If an assembly function that calls back into c code is exported to
modules, we need to ensure r2 is setup correctly. There are only
two places crazy enough to do it (two of which are my fault).
Signed-off-by: Anton Blanchard <anton@samba.org>
The new ELF ABI tends to use R_PPC64_REL16_LO and R_PPC64_REL16_HA
relocations (PC-relative), so implement them.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
The kernel resolved the '.TOC.' to a fake symbol, so we need to fix it up
to point to our .toc section plus 0x8000.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
TRACE_WITH_FRAME_BUFFER creates 32 byte stack frames. On ppc64
ABIv1 this is too small and a callee could corrupt the stack by
writing to the parameter save area (starting at offset 48).
Signed-off-by: Anton Blanchard <anton@samba.org>
ABIv2 doesn't have function descriptors or dot symbols. One
new thing it does add is a function global and a local entry
point, so add that to our _GLOBAL macro.
Signed-off-by: Anton Blanchard <anton@samba.org>
There are a few places we have to use dot symbols with the
current ABI - the syscall table and the kvm hcall table.
Wrap both of these with a new macro called DOTSYM so it will
be easy to transition away from dot symbols in a future ABI.
Signed-off-by: Anton Blanchard <anton@samba.org>
binutils is smart enough to know that a branch to a function
descriptor is actually a branch to the functions text address.
Alan tells me that binutils has been doing this for 9 years.
Signed-off-by: Anton Blanchard <anton@samba.org>