Lazy save and restore of FP/Altivec means that a userspace process can
be sent to userspace with FP or Altivec disabled and loaded only as
required (by way of an FP/Altivec unavailable exception). Transactional
Memory complicates this situation as a transaction could be started
without FP/Altivec being loaded up. This causes the hardware to
checkpoint incorrect registers. Handling FP/Altivec unavailable
exceptions while a thread is transactional requires a reclaim and
recheckpoint to ensure the CPU has correct state for both sets of
registers.
tm_reclaim() has optimisations to not always save the FP/Altivec
registers to the checkpointed save area. This was originally done
because the caller might have information that the checkpointed
registers aren't valid due to lazy save and restore. We've also been a
little vague as to how tm_reclaim() leaves the FP/Altivec state since it
doesn't necessarily always save it to the thread struct. This has lead
to an (incorrect) assumption that it leaves the checkpointed state on
the CPU.
tm_recheckpoint() has similar optimisations in reverse. It may not
always reload the checkpointed FP/Altivec registers from the thread
struct before the trecheckpoint. It is therefore quite unclear where it
expects to get the state from. This didn't help with the assumption
made about tm_reclaim().
This patch is a minimal fix for ease of backporting. A more correct fix
which removes the msr parameter to tm_reclaim() and tm_recheckpoint()
altogether has been upstreamed to apply on top of this patch.
Fixes: dc3106690b ("powerpc: tm: Always use fp_state and vr_state to
store live registers")
Signed-off-by: Cyril Bur <cyrilbur@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Lazy save and restore of FP/Altivec means that a userspace process can
be sent to userspace with FP or Altivec disabled and loaded only as
required (by way of an FP/Altivec unavailable exception). Transactional
Memory complicates this situation as a transaction could be started
without FP/Altivec being loaded up. This causes the hardware to
checkpoint incorrect registers. Handling FP/Altivec unavailable
exceptions while a thread is transactional requires a reclaim and
recheckpoint to ensure the CPU has correct state for both sets of
registers.
Lazy save and restore of FP/Altivec cannot be done if a process is
transactional. If a facility was enabled it must remain enabled whenever
a thread is transactional.
Commit dc16b553c9 ("powerpc: Always restore FPU/VEC/VSX if hardware
transactional memory in use") ensures that the facilities are always
enabled if a thread is transactional. A bug in the introduced code may
cause it to inadvertently enable a facility that was (and should remain)
disabled. The problem with this extraneous enablement is that the
registers for the erroneously enabled facility have not been correctly
recheckpointed - the recheckpointing code assumed the facility would
remain disabled.
Further compounding the issue, the transactional {fp,altivec,vsx}
unavailable code has been incorrectly using the MSR to enable
facilities. The presence of the {FP,VEC,VSX} bit in the regs->msr simply
means if the registers are live on the CPU, not if the kernel should
load them before returning to userspace. This has worked due to the bug
mentioned above.
This causes transactional threads which return to their failure handler
to observe incorrect checkpointed registers. Perhaps an example will
help illustrate the problem:
A userspace process is running and uses both FP and Altivec registers.
This process then continues to run for some time without touching
either sets of registers. The kernel subsequently disables the
facilities as part of lazy save and restore. The userspace process then
performs a tbegin and the CPU checkpoints 'junk' FP and Altivec
registers. The process then performs a floating point instruction
triggering a fp unavailable exception in the kernel.
The kernel then loads the FP registers - and only the FP registers.
Since the thread is transactional it must perform a reclaim and
recheckpoint to ensure both the checkpointed registers and the
transactional registers are correct. It then (correctly) enables
MSR[FP] for the process. Later (on exception exist) the kernel also
(inadvertently) enables MSR[VEC]. The process is then returned to
userspace.
Since the act of loading the FP registers doomed the transaction we know
CPU will fail the transaction, restore its checkpointed registers, and
return the process to its failure handler. The problem is that we're
now running with Altivec enabled and the 'junk' checkpointed registers
are restored. The kernel had only recheckpointed FP.
This patch solves this by only activating FP/Altivec if userspace was
using them when it entered the kernel and not simply if the process is
transactional.
Fixes: dc16b553c9 ("powerpc: Always restore FPU/VEC/VSX if hardware
transactional memory in use")
Signed-off-by: Cyril Bur <cyrilbur@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
This interface is inefficient and deprecated because of the y2038
overflow.
ktime_get_seconds() is an appropriate replacement here, since it
has sufficient granularity but is more efficient and uses monotonic
time.
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Reviewed-by: Andrew Donnellan <andrew.donnellan@au1.ibm.com>
Acked-by: Russell Currey <ruscur@russell.cc>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Currently when we take a TM Bad Thing program check exception, we
search the bug table to see if the program check was generated by a
WARN/WARN_ON etc.
That makes no sense, the WARN macros use trap instructions, which
should never generate a TM Bad Thing exception. If they ever did that
would be a bug and we should oops.
We do have some hand-coded bugs in tm.S, using EMIT_BUG_ENTRY, but
those are all BUGs not WARNs, and they all use trap instructions
anyway. Almost certainly this check was incorrectly copied from the
REASON_TRAP handling in the same function.
Remove it.
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Acked-By: Michael Neuling <mikey@neuling.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
CONFIG_PPC_STD_MMU_64 indicates support for the "standard" powerpc MMU
on 64-bit CPUs. The "standard" MMU refers to the hash page table MMU
found in "server" processors, from IBM mainly.
Currently CONFIG_PPC_STD_MMU_64 is == CONFIG_PPC_BOOK3S_64. While it's
annoying to have two symbols that always have the same value, it's not
quite annoying enough to bother removing one.
However with the arrival of Power9, we now have the situation where
CONFIG_PPC_STD_MMU_64 is enabled, but the kernel is running using the
Radix MMU - *not* the "standard" MMU. So it is now actively confusing
to use it, because it implies that code is disabled or inactive when
the Radix MMU is in use, however that is not necessarily true.
So s/CONFIG_PPC_STD_MMU_64/CONFIG_PPC_BOOK3S_64/, and do some minor
formatting updates of some of the affected lines.
This will be a pain for backports, but c'est la vie.
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
The last user of CPU_FTR_ICSWX was removed in commit
6ff4d3e966 ("powerpc: Remove old unused icswx based coprocessor
support"), so free the bit up for future use.
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
When returning from an exception to a soft-enabled context, pending
IRQs are replayed but IRQ tracing is not reset, so a number of them
can get chained together into the same IRQ-disabled trace.
Fix this by having __check_irq_replay re-set IRQ trace. This is
conceptually where we respond to the next interrupt, so it fits the
semantics of the IRQ tracer.
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
If the host takes a system reset interrupt while a guest is running,
the CPU must exit the guest before processing the host exception
handler.
After this patch, taking a sysrq+x with a CPU running in a guest
gives a trace like this:
cpu 0x27: Vector: 100 (System Reset) at [c000000fdf5776f0]
pc: c008000010158b80: kvmppc_run_core+0x16b8/0x1ad0 [kvm_hv]
lr: c008000010158b80: kvmppc_run_core+0x16b8/0x1ad0 [kvm_hv]
sp: c000000fdf577850
msr: 9000000002803033
current = 0xc000000fdf4b1e00
paca = 0xc00000000fd4d680 softe: 3 irq_happened: 0x01
pid = 6608, comm = qemu-system-ppc
Linux version 4.14.0-rc7-01489-g47e1893a404a-dirty #26 SMP
[c000000fdf577a00] c008000010159dd4 kvmppc_vcpu_run_hv+0x3dc/0x12d0 [kvm_hv]
[c000000fdf577b30] c0080000100a537c kvmppc_vcpu_run+0x44/0x60 [kvm]
[c000000fdf577b60] c0080000100a1ae0 kvm_arch_vcpu_ioctl_run+0x118/0x310 [kvm]
[c000000fdf577c00] c008000010093e98 kvm_vcpu_ioctl+0x530/0x7c0 [kvm]
[c000000fdf577d50] c000000000357bf8 do_vfs_ioctl+0xd8/0x8c0
[c000000fdf577df0] c000000000358448 SyS_ioctl+0x68/0x100
[c000000fdf577e30] c00000000000b220 system_call+0x58/0x6c
--- Exception: c01 (System Call) at 00007fff76868df0
SP (7fff7069baf0) is in userspace
Fixes: e36d0a2ed5 ("powerpc/powernv: Implement NMI IPI with OPAL_SIGNAL_SYSTEM_RESET")
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Back in 2008 we added support for "fast little-endian switch" in the
syscall path. This added a special case syscall number 0x1ebe, which
is caught very early in the system call exception and switches endian
with as little overhead as possible. See commit 745a14cc26
("[POWERPC] Add fast little-endian switch system call") for full
details.
Although it is fast, it's also completely non standard. The "syscall
number" is out of the range of normal syscalls, it can't be traced or
audited, and it's a bit of a wart. To the best of our knowledge it was
only used by one program, now long since discontinued.
So in an effort to shake out any current users, put it behind a config
option, and make it default n. If anyone *is* using it they can
quickly reinstate it with a rebuild, and we can flip it to default y.
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Unfortunately userspace can construct a sigcontext which enables
suspend. Thus userspace can force Linux into a path where trechkpt is
executed.
This patch blocks this from happening on POWER9 by sanity checking
sigcontexts passed in.
ptrace doesn't have this problem as only MSR SE and BE can be changed
via ptrace.
This patch also adds a number of WARN_ON()s in case we ever enter
suspend when we shouldn't. This should not happen, but if it does the
symptoms are soft lockup warnings which are not obviously TM related,
so the WARN_ON()s should make it obvious what's happening.
Signed-off-by: Michael Neuling <mikey@neuling.org>
Signed-off-by: Cyril Bur <cyrilbur@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Some Power9 revisions can run in a mode where TM operates without
suspended state. If we find ourself on a CPU that might be in this
mode, we query OPAL to check, and if so we reenable TM in CPU
features, and enable a new user feature to signal to userspace that we
are in this mode.
We do not enable the "normal" user feature, PPC_FEATURE2_HTM, but we
do enable PPC_FEATURE2_HTM_NOSC because that indicates to userspace
that the kernel will abort transactions on syscall entry, which is
true regardless of the suspend mode.
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Currently the kernel relies on firmware to inform it whether or not the
CPU supports HTM and as long as the kernel was built with
CONFIG_PPC_TRANSACTIONAL_MEM=y then it will allow userspace to make
use of the facility.
There may be situations where it would be advantageous for the kernel
to not allow userspace to use HTM, currently the only way to achieve
this is to recompile the kernel with CONFIG_PPC_TRANSACTIONAL_MEM=n.
This patch adds a simple commandline option so that HTM can be
disabled at boot time.
Signed-off-by: Cyril Bur <cyrilbur@gmail.com>
[mpe: Simplify to a bool, move to prom.c, put doco in the right place.
Always disable, regardless of initial state, to avoid user confusion.]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
This reverts commit 94a04bc25a.
In order to run HPT guests on a radix POWER9 host, we will have to run
the host in single-threaded mode, because POWER9 processors do not
currently support running some threads of a core in HPT mode while
others are in radix mode ("mixed mode").
That means that we will need the same mechanisms that are used on
POWER8 to make the secondary threads available to KVM, which were
disabled on POWER9 by commit 94a04bc25a.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
If we are in user space and hit a UE error, we now have the
basic infrastructure to walk the page tables and find out
the effective address that was accessed, since the DAR
is not valid.
We use a work_queue content to hookup the bad pfn, any
other context causes problems, since memory_failure itself
can call into schedule() via lru_drain_ bits.
We could probably poison the struct page to avoid a race
between detection and taking corrective action.
Signed-off-by: Balbir Singh <bsingharora@gmail.com>
Reviewed-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Hookup instruction errors (UE) for memory offling via memory_failure()
in a manner similar to load/store errors (derror). Since we have access
to the NIP, the conversion is a one step process in this case.
Signed-off-by: Balbir Singh <bsingharora@gmail.com>
Reviewed-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Extract physical_address for UE errors by walking the page
tables for the mm and address at the NIP, to extract the
instruction. Then use the instruction to find the effective
address via analyse_instr().
We might have page table walking races, but we expect them to
be rare, the physical address extraction is best effort. The idea
is to then hook up this infrastructure to memory failure eventually.
Signed-off-by: Balbir Singh <bsingharora@gmail.com>
Reviewed-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Use the same alignment as Effective address.
Signed-off-by: Balbir Singh <bsingharora@gmail.com>
Reviewed-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
There are no users of get_mce_fault_addr() since commit
1363875bdb ("powerpc/64s: fix handling of non-synchronous machine
checks") removed the last usage.
Signed-off-by: Balbir Singh <bsingharora@gmail.com>
Reviewed-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Use WARN_ON(), while running out of stubs in stub_for_addr()
and abort loading of the module instead of BUG_ON().
Signed-off-by: Kamalesh Babulal <kamalesh@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Add a check for p->state == TASK_RUNNING so that any wake-ups on
task_struct p in the interim lead to 0 being returned by get_wchan().
Signed-off-by: Kautuk Consul <kautuk.consul.1980@gmail.com>
[mpe: Confirmed other architectures do similar]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Fix a circa 2005 FIXME by implementing a check to ensure that we
actually got into the jprobe break handler() due to the trap in
jprobe_return().
Acked-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Naveen N. Rao <naveen.n.rao@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Kamalesh pointed out that we are getting the below call traces with
livepatched functions when we enable CONFIG_PREEMPT:
[ 495.470721] BUG: using __this_cpu_read() in preemptible [00000000] code: cat/8394
[ 495.471167] caller is is_current_kprobe_addr+0x30/0x90
[ 495.471171] CPU: 4 PID: 8394 Comm: cat Tainted: G K 4.13.0-rc7-nnr+ #95
[ 495.471173] Call Trace:
[ 495.471178] [c00000008fd9b960] [c0000000009f039c] dump_stack+0xec/0x160 (unreliable)
[ 495.471184] [c00000008fd9b9a0] [c00000000059169c] check_preemption_disabled+0x15c/0x170
[ 495.471187] [c00000008fd9ba30] [c000000000046460] is_current_kprobe_addr+0x30/0x90
[ 495.471191] [c00000008fd9ba60] [c00000000004e9a0] ftrace_call+0x1c/0xb8
[ 495.471195] [c00000008fd9bc30] [c000000000376fd8] seq_read+0x238/0x5c0
[ 495.471199] [c00000008fd9bcd0] [c0000000003cfd78] proc_reg_read+0x88/0xd0
[ 495.471203] [c00000008fd9bd00] [c00000000033e5d4] __vfs_read+0x44/0x1b0
[ 495.471206] [c00000008fd9bd90] [c0000000003402ec] vfs_read+0xbc/0x1b0
[ 495.471210] [c00000008fd9bde0] [c000000000342138] SyS_read+0x68/0x110
[ 495.471214] [c00000008fd9be30] [c00000000000bc6c] system_call+0x58/0x6c
Commit c05b8c4474 ("powerpc/kprobes: Skip livepatch_handler() for
jprobes") introduced a helper is_current_kprobe_addr() to help determine
if the current function has been livepatched or if it has a jprobe
installed, both of which modify the NIP. This was subsequently renamed
to __is_active_jprobe().
In the case of a jprobe, kprobe_ftrace_handler() disables pre-emption
before calling into setjmp_pre_handler() which returns without disabling
pre-emption. This is done to ensure that the jprobe handler won't
disappear beneath us if the jprobe is unregistered between the
setjmp_pre_handler() and the subsequent longjmp_break_handler() called
from the jprobe handler. Due to this, we can use __this_cpu_read() in
__is_active_jprobe() with the pre-emption check as we know that
pre-emption will be disabled.
However, if this function has been livepatched, we are still doing this
check and when we do so, pre-emption won't necessarily be disabled. This
results in the call trace shown above.
Fix this by only invoking __is_active_jprobe() when pre-emption is
disabled. And since we now guard this within a pre-emption check, we can
instead use raw_cpu_read() to get the current_kprobe value skipping the
check done by __this_cpu_read().
Fixes: c05b8c4474 ("powerpc/kprobes: Skip livepatch_handler() for jprobes")
Reported-by: Kamalesh Babulal <kamalesh@linux.vnet.ibm.com>
Tested-by: Kamalesh Babulal <kamalesh@linux.vnet.ibm.com>
Signed-off-by: Naveen N. Rao <naveen.n.rao@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
In commit c05b8c4474 ("powerpc/kprobes: Skip livepatch_handler() for
jprobes"), we added a helper is_current_kprobe_addr() to help detect if
the modified regs->nip was due to a jprobe or livepatch. Masami felt
that the function name was not quite clear. To that end, this patch
renames is_current_kprobe_addr() to __is_active_jprobe() and adds a
comment to (hopefully) better clarify the purpose of this helper. The
helper has also now been moved to kprobes-ftrace.c so that it is only
available for KPROBES_ON_FTRACE.
Signed-off-by: Naveen N. Rao <naveen.n.rao@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Currently, we disable instruction emulation if emulate_step() fails for
any reason. However, such failures could be transient and specific to a
particular run. Instead, only disable instruction emulation if we have
never been able to emulate this. If we had emulated this instruction
successfully at least once, then we single step only this probe hit and
continue to try emulating the instruction in subsequent probe hits.
Signed-off-by: Naveen N. Rao <naveen.n.rao@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
1. This is only used in kprobes.c, so make it static.
2. Remove the un-necessary (ret == 0) comparison in the else clause.
Reviewed-by: Masami Hiramatsu <mhiramat@kernel.org>
Reviewed-by: Kamalesh Babulal <kamalesh@linux.vnet.ibm.com>
Signed-off-by: Naveen N. Rao <naveen.n.rao@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Use setup_timer function instead of initializing timer with the
function and data fields.
Signed-off-by: Allen Pais <allen.lkml@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
This code is used at boot and machine checks, so it should be using
early_radix_enabled() (which is usable any time).
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
It is possible to wake from idle due to a system reset exception, in
which case the CPU takes a system reset interrupt to wake from idle,
with system reset as the wakeup reason.
The regular (not idle wakeup) system reset interrupt handler must be
invoked in this case, otherwise the system reset interrupt is lost.
Handle the system reset interrupt immediately after CPU state has been
restored.
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
In xmon, touch_nmi_watchdog() is not expected to be checking that
other CPUs have not touched the watchdog, so the code will just call
touch_nmi_watchdog() once before re-enabling hard interrupts.
Just update our CPU's state, and ignore apparently stuck SMP threads.
Arguably touch_nmi_watchdog should check for SMP lockups, and callers
should be fixed, but that's not trivial for the input code of xmon.
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
If sysctl_hardlockup_all_cpu_backtrace is enabled, there is no need to
IPI stuck CPUs for backtrace before trigger_allbutself_cpu_backtrace(),
which does the same thing again.
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
The SMP watchdog will detect locked CPUs and IPI them to print a
backtrace and registers. If panic on hard lockup is enabled, do not
panic from this handler, because that can cause recursion into the IPI
layer during the panic.
The caller already panics in this case.
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
POWER9 DD2.1 and earlier has an issue where some cache inhibited
vector load will return bad data. The workaround is two part, one
firmware/microcode part triggers HMI interrupts when hitting such
loads, the other part is this patch which then emulates the
instructions in Linux.
The affected instructions are limited to lxvd2x, lxvw4x, lxvb16x and
lxvh8x.
When an instruction triggers the HMI, all threads in the core will be
sent to the HMI handler, not just the one running the vector load.
In general, these spurious HMIs are detected by the emulation code and
we just return back to the running process. Unfortunately, if a
spurious interrupt occurs on a vector load that's to normal memory we
have no way to detect that it's spurious (unless we walk the page
tables, which is very expensive). In this case we emulate the load but
we need do so using a vector load itself to ensure 128bit atomicity is
preserved.
Some additional debugfs emulated instruction counters are added also.
Signed-off-by: Michael Neuling <mikey@neuling.org>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
[mpe: Switch CONFIG_PPC_BOOK3S_64 to CONFIG_VSX to unbreak the build]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Remove the post_init callback which is only used
by powernv, we can just call it explicitly from
the powernv code.
This partially kills the ability to "disable" eeh at
runtime via debugfs as this was calling that same
callback again, but this is both unused and broken
in several ways. If we want to revive it, we need
to create a dedicated enable/disable callback on the
backend that does the right thing.
Let the bulk of eeh initialize normally at
core_initcall() like it does on pseries by removing
the hack in eeh_init() that delays it.
Instead we make sure our eeh->probe cleanly bails
out of the PEs haven't been created yet and we force
a re-probe where we used to call eeh_init() again.
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Acked-by: Russell Currey <ruscur@russell.cc>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Otherwise we end up not yet having computed the right diag data size
on powernv where EEH initialization is delayed, thus causing memory
corruption later on when calling OPAL.
Fixes: 5cb1f8fddd ("powerpc/powernv/pci: Dynamically allocate PHB diag data")
Cc: stable@vger.kernel.org # v4.13+
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Acked-by: Russell Currey <ruscur@russell.cc>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Optprobes depended on an updated regs->nip from analyse_instr() to
identify the location to branch back from the optprobes trampoline.
However, since commit 3cdfcbfd32 ("powerpc: Change analyse_instr so
it doesn't modify *regs"), analyse_instr() doesn't update the registers
anymore. Due to this, we end up branching back from the optprobes
trampoline to the same branch into the trampoline resulting in a loop.
Fix this by calling out to emulate_update_regs() before using the nip.
Additionally, explicitly compare the return value from analyse_instr()
to 1, rather than just checking for !0 so as to guard against any
future changes to analyse_instr() that may result in -1 being returned
in more scenarios.
Fixes: 3cdfcbfd32 ("powerpc: Change analyse_instr so it doesn't modify *regs")
Signed-off-by: Naveen N. Rao <naveen.n.rao@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Commit cd63f3c ("powerpc/tm: Fix saving of TM SPRs in core dump")
added code to access TM SPRs in flush_tmregs_to_thread(). However
flush_tmregs_to_thread() does not check if TM feature is available on
CPU before trying to access TM SPRs in order to copy live state to
thread structures. flush_tmregs_to_thread() is indeed guarded by
CONFIG_PPC_TRANSACTIONAL_MEM but it might be the case that kernel
was compiled with CONFIG_PPC_TRANSACTIONAL_MEM enabled and ran on
a CPU without TM feature available, thus rendering the execution
of TM instructions that are treated by the CPU as illegal instructions.
The fix is just to add proper checking in flush_tmregs_to_thread()
if CPU has the TM feature before accessing any TM-specific resource,
returning immediately if TM is no available on the CPU. Adding
that checking in flush_tmregs_to_thread() instead of in places
where it is called, like in vsr_get() and vsr_set(), is better because
avoids the same problem cropping up elsewhere.
Cc: stable@vger.kernel.org # v4.13+
Fixes: cd63f3c ("powerpc/tm: Fix saving of TM SPRs in core dump")
Signed-off-by: Gustavo Romero <gromero@linux.vnet.ibm.com>
Reviewed-by: Cyril Bur <cyrilbur@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Just one fix, for the handling of alignment interrupts on dcbz instructions.
Thanks to:
Paul Mackerras, Christian Zigotzky, Michal Sojka.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQIcBAABAgAGBQJZu5HyAAoJEFHr6jzI4aWA+yoP+gOKF4obMWnvSJ5YWZZWQee/
T++Uhps7AF0G2y99rBVUKzbXzweDPpBrNQW5cFlF51WinNCA4KJ4E08xoNSrnlKl
ebc6Tn2cHbpWgydSRxWDig/Dp90U95toKp14irO2YvOVWoTgYBX7gOHhkMcpxdLH
MeVe6IAatOSNNNpJK9gLrNDLZaDhTGqXf64VIaw44llf0R9avNFhoS+8F94bllm2
VSsXizP1HZxWFcxJ1W/F54zkOvOkSXxjmd9Lf0IOxERPAshZBO7MfDaXF9MaYXwL
4AU5GZc2UmWIJpwmSQqRQPpLOPFvsJkhNnXeUb/8kQAG2W+HfZy6BvHpVh1lajm7
j+tyIYe+E7Jt63tQxyxSKpGnu20po78iimciYhl77A2RksGl9Tw5sHWj8gok30TI
pQFIRLPAO38MnnD3OrsboAnBrruS0Og+lkSWan7nV5n8ybHflvimjG+MfkNx9vzL
as7yJThQ8D8JBRAyaIHG3aAUPgBDxpZwqhDU3rac0qZChOc9DH4X7YFHR1UzGH8n
J8h9e7IsKGfvm4O9j4t3/nq0Qwh0LvpAlfu1/tPdJ0KXc9BnQTVoCpkFzhA+IFV6
JY2y4F1MfW8ZWS20iPHLlYEXYcTtcTq5b47JYQx97TgSp82ShXSLwbUxXNwSraDv
umUvJle++yyRm927njki
=F9yZ
-----END PGP SIGNATURE-----
Merge tag 'powerpc-4.14-2' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux
Pull powerpc fix from Michael Ellerman:
"Just one fix, for the handling of alignment interrupts on dcbz
instructions.
Thanks to Paul Mackerras, Christian Zigotzky, Michal Sojka"
* tag 'powerpc-4.14-2' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux:
powerpc: Fix handling of alignment interrupt on dcbz instruction
This fixes the emulation of the dcbz instruction in the alignment
interrupt handler. The error was that we were comparing just the
instruction type field of op.type rather than the whole thing,
and therefore the comparison "type != CACHEOP + DCBZ" was always
true.
Fixes: 31bfdb036f ("powerpc: Use instruction emulation infrastructure to handle alignment faults")
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
Tested-by: Michal Sojka <sojkam1@fel.cvut.cz>
Tested-by: Christian Zigotzky <chzigotzky@xenosoft.de>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
GFP_TEMPORARY was introduced by commit e12ba74d8f ("Group short-lived
and reclaimable kernel allocations") along with __GFP_RECLAIMABLE. It's
primary motivation was to allow users to tell that an allocation is
short lived and so the allocator can try to place such allocations close
together and prevent long term fragmentation. As much as this sounds
like a reasonable semantic it becomes much less clear when to use the
highlevel GFP_TEMPORARY allocation flag. How long is temporary? Can the
context holding that memory sleep? Can it take locks? It seems there is
no good answer for those questions.
The current implementation of GFP_TEMPORARY is basically GFP_KERNEL |
__GFP_RECLAIMABLE which in itself is tricky because basically none of
the existing caller provide a way to reclaim the allocated memory. So
this is rather misleading and hard to evaluate for any benefits.
I have checked some random users and none of them has added the flag
with a specific justification. I suspect most of them just copied from
other existing users and others just thought it might be a good idea to
use without any measuring. This suggests that GFP_TEMPORARY just
motivates for cargo cult usage without any reasoning.
I believe that our gfp flags are quite complex already and especially
those with highlevel semantic should be clearly defined to prevent from
confusion and abuse. Therefore I propose dropping GFP_TEMPORARY and
replace all existing users to simply use GFP_KERNEL. Please note that
SLAB users with shrinkers will still get __GFP_RECLAIMABLE heuristic and
so they will be placed properly for memory fragmentation prevention.
I can see reasons we might want some gfp flag to reflect shorterm
allocations but I propose starting from a clear semantic definition and
only then add users with proper justification.
This was been brought up before LSF this year by Matthew [1] and it
turned out that GFP_TEMPORARY really doesn't have a clear semantic. It
seems to be a heuristic without any measured advantage for most (if not
all) its current users. The follow up discussion has revealed that
opinions on what might be temporary allocation differ a lot between
developers. So rather than trying to tweak existing users into a
semantic which they haven't expected I propose to simply remove the flag
and start from scratch if we really need a semantic for short term
allocations.
[1] http://lkml.kernel.org/r/20170118054945.GD18349@bombadil.infradead.org
[akpm@linux-foundation.org: fix typo]
[akpm@linux-foundation.org: coding-style fixes]
[sfr@canb.auug.org.au: drm/i915: fix up]
Link: http://lkml.kernel.org/r/20170816144703.378d4f4d@canb.auug.org.au
Link: http://lkml.kernel.org/r/20170728091904.14627-1-mhocko@kernel.org
Signed-off-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Acked-by: Mel Gorman <mgorman@suse.de>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Neil Brown <neilb@suse.de>
Cc: "Theodore Ts'o" <tytso@mit.edu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Pull namespace updates from Eric Biederman:
"Life has been busy and I have not gotten half as much done this round
as I would have liked. I delayed it so that a minor conflict
resolution with the mips tree could spend a little time in linux-next
before I sent this pull request.
This includes two long delayed user namespace changes from Kirill
Tkhai. It also includes a very useful change from Serge Hallyn that
allows the security capability attribute to be used inside of user
namespaces. The practical effect of this is people can now untar
tarballs and install rpms in user namespaces. It had been suggested to
generalize this and encode some of the namespace information
information in the xattr name. Upon close inspection that makes the
things that should be hard easy and the things that should be easy
more expensive.
Then there is my bugfix/cleanup for signal injection that removes the
magic encoding of the siginfo union member from the kernel internal
si_code. The mips folks reported the case where I had used FPE_FIXME
me is impossible so I have remove FPE_FIXME from mips, while at the
same time including a return statement in that case to keep gcc from
complaining about unitialized variables.
I almost finished the work to get make copy_siginfo_to_user a trivial
copy to user. The code is available at:
git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace.git neuter-copy_siginfo_to_user-v3
But I did not have time/energy to get the code posted and reviewed
before the merge window opened.
I was able to see that the security excuse for just copying fields
that we know are initialized doesn't work in practice there are buggy
initializations that don't initialize the proper fields in siginfo. So
we still sometimes copy unitialized data to userspace"
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace:
Introduce v3 namespaced file capabilities
mips/signal: In force_fcr31_sig return in the impossible case
signal: Remove kernel interal si_code magic
fcntl: Don't use ambiguous SIG_POLL si_codes
prctl: Allow local CAP_SYS_ADMIN changing exe_file
security: Use user_namespace::level to avoid redundant iterations in cap_capable()
userns,pidns: Verify the userns for new pid namespaces
signal/testing: Don't look for __SI_FAULT in userspace
signal/mips: Document a conflict with SI_USER with SIGFPE
signal/sparc: Document a conflict with SI_USER with SIGFPE
signal/ia64: Document a conflict with SI_USER with SIGFPE
signal/alpha: Document a conflict with SI_USER for SIGTRAP
First, number of CPUs can't be negative number.
Second, different signnnedness leads to suboptimal code in the following
cases:
1)
kmalloc(nr_cpu_ids * sizeof(X));
"int" has to be sign extended to size_t.
2)
while (loff_t *pos < nr_cpu_ids)
MOVSXD is 1 byte longed than the same MOV.
Other cases exist as well. Basically compiler is told that nr_cpu_ids
can't be negative which can't be deduced if it is "int".
Code savings on allyesconfig kernel: -3KB
add/remove: 0/0 grow/shrink: 25/264 up/down: 261/-3631 (-3370)
function old new delta
coretemp_cpu_online 450 512 +62
rcu_init_one 1234 1272 +38
pci_device_probe 374 399 +25
...
pgdat_reclaimable_pages 628 556 -72
select_fallback_rq 446 369 -77
task_numa_find_cpu 1923 1807 -116
Link: http://lkml.kernel.org/r/20170819114959.GA30580@avx2
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
On POWER9, the Client Architecture Support (CAS) negotiation process
determines whether the guest operates in XIVE Legacy compatibility or
in XIVE exploitation mode. Now that we have initial guest support for
the XIVE interrupt controller, let's inform the hypervisor what we can
do.
The platform advertises the XIVE Exploitation Mode support using the
property "ibm,arch-vec-5-platform-support-vec-5", byte 23 bits 0-1 :
- 0b00 XIVE legacy mode Only
- 0b01 XIVE exploitation mode Only
- 0b10 XIVE legacy or exploitation mode
The OS asks for XIVE Exploitation Mode support using the property
"ibm,architecture-vec-5", byte 23 bits 0-1:
- 0b00 XIVE legacy mode Only
- 0b01 XIVE exploitation mode Only
Signed-off-by: Cédric Le Goater <clg@kaod.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Use DEVICE_ATTR_RW for read-write attributes. This simplifies the
source code, improves readbility, and reduces the chance of
inconsistencies.
Signed-off-by: Julia Lawall <Julia.Lawall@lip6.fr>
Acked-by: Michael Ellerman <mpe@ellerman.id.au> (powerpc)
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Omit an extra message for a memory allocation failure in
eeh_dev_init().
This issue was detected by using the Coccinelle software.
Signed-off-by: Markus Elfring <elfring@users.sourceforge.net>
[mpe: Do not drop the message that can happen at runtime and lead to
an event not being handled]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
memset() is patched after initialisation to activate the
optimised part which uses cache instructions.
Today we have a 'b 2f' to skip the optimised patch, which then gets
replaced by a NOP, implying a useless cycle consumption.
As we have a 'bne 2f' just before, we could use that instruction
for the live patching, hence removing the need to have a
dedicated 'b 2f' to be replaced by a NOP.
This patch changes the 'bne 2f' by a 'b 2f'. During init, that
'b 2f' is then replaced by 'bne 2f'
Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
This replaces almost all of the instruction emulation code in
fix_alignment() with calls to analyse_instr(), emulate_loadstore()
and emulate_dcbz(). The only emulation code left is the SPE
emulation code; analyse_instr() etc. do not handle SPE instructions
at present.
One result of this is that we can now handle alignment faults on
all the new VSX load and store instructions that were added in POWER9.
VSX loads/stores will take alignment faults for unaligned accesses
to cache-inhibited memory.
Another effect is that we no longer rely on the DAR and DSISR values
set by the processor.
With this, we now need to include the instruction emulation code
unconditionally.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>