kernel parameter disable_radix takes different options
disable_radix=yes|no|1|0 or just disable_radix.
prom_init parsing is not supporting these options.
Fixes: 1fd6c02207 ("powerpc/mm: Add a CONFIG option to choose if radix is used by default")
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
When stop is executed with EC=ESL=0, it appears to execute like a
normal instruction (resuming from NIP when woken by interrupt). So all
the save/restore handling can be avoided completely. In particular NV
GPRs do not have to be saved, and MSR does not have to be switched
back to kernel MSR.
So move the test for EC=ESL=0 sleep states out to power9_idle_stop,
and return directly to the caller after stop in that case.
This improves performance for ping-pong benchmark with the stop0_lite
idle state by 2.54% for 2 threads in the same core, and 2.57% for
different cores. Performance increase with HV_POSSIBLE defined will be
improved further by avoiding the hwsync.
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Commit 3d4fbffdd7 ("powerpc/64s/idle: POWER9 implement a separate
idle stop function for hotplug") that added power9_offline_stop() was
written before commit 7672691a08 ("powerpc/powernv: Provide a way to
force a core into SMT4 mode").
When merging the former I failed to notice that it caused us to skip
the force-SMT4 logic for offline CPUs. The result is that offlined
CPUs will not correctly participate in the force-SMT4 logic, which
presumably will result in badness (not tested).
Reconcile the two commits by making power9_offline_stop() a pre-cursor
to power9_idle_stop(), so that they share the force-SMT4 logic.
This is based on an original commit from Nick, all breakage is my own.
Fixes: 3d4fbffdd7 ("powerpc/64s/idle: POWER9 implement a separate idle stop function for hotplug")
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
- add a shell script to get Clang version
- improve portability of build scripts
- drop always-enabled CONFIG_THIN_ARCHIVE and remove unused code
- rename built-in.o which is now thin archive to built-in.a
- process clean/build targets one by one to get along with -j option
- simplify ld-option
- improve building with CONFIG_TRIM_UNUSED_KSYMS
- define KBUILD_MODNAME even for objects shared among multiple modules
- avoid linking multiple instances of same objects from composite objects
- move <linux/compiler_types.h> to c_flags to include it only for C files
- clean-up various Makefiles
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQIcBAABAgAGBQJaw6eWAAoJED2LAQed4NsGrK8QAJmbYg83TTNoOgQRK/7Lg+sj
KL1+RGFxmdHRVOqG5n18L7Q4LmTD19tUFNQImrQTTrKrbH2vbMSTF2PfzdmDRwMz
R5vW5+wsagfhSttOce/GR4p9+bM9XEclzEa3liqNVQxijOFXmkV14pn0x5anYfeB
ABthxFFHcVn3exP/q3lmq048x1yNE71wUU5WQIWf6V/ZKf+++wQU8r7HpnATWYeO
vtf8gZq+xyLLjhxoJF6n6olSZXI7Yhz4jV2G68/VroS312AUFWPogK+cSshWGlSw
zGixM1q55oj9CXjZ37nR6pTzQhSZLf/uHX5beatlpeoJ4Hho6HlIABvx2oEnat7b
o5RW64RB0gVJqlYZdKxL29HNrovr9tlWPTaIPGFRvWDpl3c1w+rMKXE+5hwu8jMJ
2jgxd5FZCgBaDsAKojmeQR7PAo2ffAdUO0Dj/SuAaMOpuHWHcnJk9kIN2PUrC+Sf
d/H2soT9x+60KbQmtCEo5VfEN8bvNP3+ZSnadEG/gRN2IIa5FZAUQykU+i50gAvj
tuKAokdRGZHvXM+buYFBfN6RbhVCXzBF/fAG7r37QVR2u1zaUszmgFOUqERhTQfm
RNnyeAs9G9rljtna/AD7cIOdKTg8oETPISxt8Y6EzNMpI8PhF0aGoxso3yD19oH1
M+fq55RigsR48Kic40hY
=N5BL
-----END PGP SIGNATURE-----
Merge tag 'kbuild-v4.17' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild
Pull Kbuild updates from Masahiro Yamada:
- add a shell script to get Clang version
- improve portability of build scripts
- drop always-enabled CONFIG_THIN_ARCHIVE and remove unused code
- rename built-in.o which is now thin archive to built-in.a
- process clean/build targets one by one to get along with -j option
- simplify ld-option
- improve building with CONFIG_TRIM_UNUSED_KSYMS
- define KBUILD_MODNAME even for objects shared among multiple modules
- avoid linking multiple instances of same objects from composite
objects
- move <linux/compiler_types.h> to c_flags to include it only for C
files
- clean-up various Makefiles
* tag 'kbuild-v4.17' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild: (29 commits)
kbuild: get <linux/compiler_types.h> out of <linux/kconfig.h>
kbuild: clean up link rule of composite modules
kbuild: clean up archive rule of built-in.a
kbuild: remove partial section mismatch detection for built-in.a
net: liquidio: clean up Makefile for simpler composite object handling
lib: zstd: clean up Makefile for simpler composite object handling
kbuild: link $(real-obj-y) instead of $(obj-y) into built-in.a
kbuild: rename real-objs-y/m to real-obj-y/m
kbuild: move modname and modname-multi close to modname_flags
kbuild: simplify modname calculation
kbuild: fix modname for composite modules
kbuild: define KBUILD_MODNAME even if multiple modules share objects
kbuild: remove unnecessary $(subst $(obj)/, , ...) in modname-multi
kbuild: Use ls(1) instead of stat(1) to obtain file size
kbuild: link vmlinux only once for CONFIG_TRIM_UNUSED_KSYMS
kbuild: move include/config/ksym/* to include/ksym/*
kbuild: move CONFIG_TRIM_UNUSED_KSYMS code unneeded for external module
kbuild: restore autoksyms.h touch to the top Makefile
kbuild: move 'scripts' target below
kbuild: remove wrong 'touch' in adjust_autoksyms.sh
...
Pull sparc updates from David Miller:
1) Add support for ADI (Application Data Integrity) found in more
recent sparc64 cpus. Essentially this is keyed based access to
virtual memory, and if the key encoded in the virual address is
wrong you get a trap.
The mm changes were reviewed by Andrew Morton and others.
Work by Khalid Aziz.
2) Validate DAX completion index range properly, from Rob Gardner.
3) Add proper Kconfig deps for DAX driver. From Guenter Roeck.
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/sparc-next:
sparc64: Make atomic_xchg() an inline function rather than a macro.
sparc64: Properly range check DAX completion index
sparc: Make auxiliary vectors for ADI available on 32-bit as well
sparc64: Oracle DAX driver depends on SPARC64
sparc64: Update signal delivery to use new helper functions
sparc64: Add support for ADI (Application Data Integrity)
mm: Allow arch code to override copy_highpage()
mm: Clear arch specific VM flags on protection change
mm: Add address parameter to arch_validate_prot()
sparc64: Add auxiliary vectors to report platform ADI properties
sparc64: Add handler for "Memory Corruption Detected" trap
sparc64: Add HV fault type handlers for ADI related faults
sparc64: Add support for ADI register fields, ASIs and traps
mm, swap: Add infrastructure for saving page metadata on swap
signals, sparc: Add signal codes for ADI violations
The hard lockup watchdog can fire under local_irq_disable
on platforms with irq soft masking.
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Use the NMI IPI rather than smp_call_function for smp_send_stop.
Have stopped CPUs hard disable interrupts rather than just soft
disable.
This function is used in crash/panic/shutdown paths to bring other
CPUs down as quickly and reliably as possible, and minimizing their
potential to cause trouble.
Avoiding the Linux smp_call_function infrastructure and (if supported)
using true NMI IPIs makes this more robust.
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
The PSSCR value is not stored to PACA_REQ_PSSCR if the CPU does not
have the XER[SO] bug.
Fix this by storing up-front, outside the workaround code. The initial
test is not required because it is a slow path.
The workaround is made to depend on CONFIG_KVM_BOOK3S_HV_POSSIBLE, to
match pnv_power9_force_smt4_catch() where it is used. Drop the comment
on pnv_power9_force_smt4_catch() as it's no longer true.
Fixes: 7672691a08 ("powerpc/powernv: Provide a way to force a core into SMT4 mode")
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
This moves the definition of the default security feature flags
(i.e., enabled by default) closer to the security feature flags.
This can be used to restore current flags to the default flags.
Signed-off-by: Mauricio Faria de Oliveira <mauricfo@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
flush_thread() calls __set_breakpoint() via set_debug_reg_defaults()
without checking ppc_breakpoint_available(). On Power8 or later CPUs
which have the DAWR feature disabled that will cause a write to the
DABR which is incorrect as those CPUs don't have a DABR.
Fix it two ways, by checking ppc_breakpoint_available() in
set_debug_reg_defaults(), and also by reworking __set_breakpoint() to
only write to DABR on Power7 or earlier.
Fixes: 9654153158 ("powerpc: Disable DAWR in the base POWER9 CPU features")
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
[mpe: Rework the logic in __set_breakpoint()]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Commit 8e0b634b13 ("powerpc/64s: Do not allocate lppaca if we are
not virtualized") removed allocation of lppaca on bare metal
platforms. But with CONFIG_PPC_SPLPAR enabled, we still access the
lppaca on bare metal in some code paths.
Fix this but adding runtime checks for SPLPAR (shared processor LPAR).
Fixes: 8e0b634b13 ("powerpc/64s: Do not allocate lppaca if we are not virtualized")
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Pull removal of in-kernel calls to syscalls from Dominik Brodowski:
"System calls are interaction points between userspace and the kernel.
Therefore, system call functions such as sys_xyzzy() or
compat_sys_xyzzy() should only be called from userspace via the
syscall table, but not from elsewhere in the kernel.
At least on 64-bit x86, it will likely be a hard requirement from
v4.17 onwards to not call system call functions in the kernel: It is
better to use use a different calling convention for system calls
there, where struct pt_regs is decoded on-the-fly in a syscall wrapper
which then hands processing over to the actual syscall function. This
means that only those parameters which are actually needed for a
specific syscall are passed on during syscall entry, instead of
filling in six CPU registers with random user space content all the
time (which may cause serious trouble down the call chain). Those
x86-specific patches will be pushed through the x86 tree in the near
future.
Moreover, rules on how data may be accessed may differ between kernel
data and user data. This is another reason why calling sys_xyzzy() is
generally a bad idea, and -- at most -- acceptable in arch-specific
code.
This patchset removes all in-kernel calls to syscall functions in the
kernel with the exception of arch/. On top of this, it cleans up the
three places where many syscalls are referenced or prototyped, namely
kernel/sys_ni.c, include/linux/syscalls.h and include/linux/compat.h"
* 'syscalls-next' of git://git.kernel.org/pub/scm/linux/kernel/git/brodo/linux: (109 commits)
bpf: whitelist all syscalls for error injection
kernel/sys_ni: remove {sys_,sys_compat} from cond_syscall definitions
kernel/sys_ni: sort cond_syscall() entries
syscalls/x86: auto-create compat_sys_*() prototypes
syscalls: sort syscall prototypes in include/linux/compat.h
net: remove compat_sys_*() prototypes from net/compat.h
syscalls: sort syscall prototypes in include/linux/syscalls.h
kexec: move sys_kexec_load() prototype to syscalls.h
x86/sigreturn: use SYSCALL_DEFINE0
x86: fix sys_sigreturn() return type to be long, not unsigned long
x86/ioport: add ksys_ioperm() helper; remove in-kernel calls to sys_ioperm()
mm: add ksys_readahead() helper; remove in-kernel calls to sys_readahead()
mm: add ksys_mmap_pgoff() helper; remove in-kernel calls to sys_mmap_pgoff()
mm: add ksys_fadvise64_64() helper; remove in-kernel call to sys_fadvise64_64()
fs: add ksys_fallocate() wrapper; remove in-kernel calls to sys_fallocate()
fs: add ksys_p{read,write}64() helpers; remove in-kernel calls to syscalls
fs: add ksys_truncate() wrapper; remove in-kernel calls to sys_truncate()
fs: add ksys_sync_file_range helper(); remove in-kernel calls to syscall
kernel: add ksys_setsid() helper; remove in-kernel call to sys_setsid()
kernel: add ksys_unshare() helper; remove in-kernel calls to sys_unshare()
...
Using this helper allows us to avoid the in-kernel calls to the
sys_readahead() syscall. The ksys_ prefix denotes that this function is
meant as a drop-in replacement for the syscall. In particular, it uses the
same calling convention as sys_readahead().
This patch is part of a series which removes in-kernel calls to syscalls.
On this basis, the syscall entry path can be streamlined. For details, see
http://lkml.kernel.org/r/20180325162527.GA17492@light.dominikbrodowski.net
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: linux-mm@kvack.org
Signed-off-by: Dominik Brodowski <linux@dominikbrodowski.net>
Using this helper allows us to avoid the in-kernel calls to the
sys_mmap_pgoff() syscall. The ksys_ prefix denotes that this function is
meant as a drop-in replacement for the syscall. In particular, it uses the
same calling convention as sys_mmap_pgoff().
This patch is part of a series which removes in-kernel calls to syscalls.
On this basis, the syscall entry path can be streamlined. For details, see
http://lkml.kernel.org/r/20180325162527.GA17492@light.dominikbrodowski.net
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: linux-mm@kvack.org
Signed-off-by: Dominik Brodowski <linux@dominikbrodowski.net>
Using the ksys_fadvise64_64() helper allows us to avoid the in-kernel
calls to the sys_fadvise64_64() syscall. The ksys_ prefix denotes that
this function is meant as a drop-in replacement for the syscall. In
particular, it uses the same calling convention as ksys_fadvise64_64().
Some compat stubs called sys_fadvise64(), which then just passed through
the arguments to sys_fadvise64_64(). Get rid of this indirection, and call
ksys_fadvise64_64() directly.
This patch is part of a series which removes in-kernel calls to syscalls.
On this basis, the syscall entry path can be streamlined. For details, see
http://lkml.kernel.org/r/20180325162527.GA17492@light.dominikbrodowski.net
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: linux-mm@kvack.org
Signed-off-by: Dominik Brodowski <linux@dominikbrodowski.net>
Using the ksys_fallocate() wrapper allows us to get rid of in-kernel
calls to the sys_fallocate() syscall. The ksys_ prefix denotes that this
function is meant as a drop-in replacement for the syscall. In
particular, it uses the same calling convention as sys_fallocate().
This patch is part of a series which removes in-kernel calls to syscalls.
On this basis, the syscall entry path can be streamlined. For details, see
http://lkml.kernel.org/r/20180325162527.GA17492@light.dominikbrodowski.net
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Dominik Brodowski <linux@dominikbrodowski.net>
Using the ksys_p{read,write}64() wrappers allows us to get rid of
in-kernel calls to the sys_pread64() and sys_pwrite64() syscalls.
The ksys_ prefix denotes that this function is meant as a drop-in
replacement for the syscall. In particular, it uses the same calling
convention as sys_p{read,write}64().
This patch is part of a series which removes in-kernel calls to syscalls.
On this basis, the syscall entry path can be streamlined. For details, see
http://lkml.kernel.org/r/20180325162527.GA17492@light.dominikbrodowski.net
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Dominik Brodowski <linux@dominikbrodowski.net>
Using the ksys_truncate() wrapper allows us to get rid of in-kernel
calls to the sys_truncate() syscall. The ksys_ prefix denotes that this
function is meant as a drop-in replacement for the syscall. In
particular, it uses the same calling convention as sys_truncate().
This patch is part of a series which removes in-kernel calls to syscalls.
On this basis, the syscall entry path can be streamlined. For details, see
http://lkml.kernel.org/r/20180325162527.GA17492@light.dominikbrodowski.net
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Dominik Brodowski <linux@dominikbrodowski.net>
Using this helper allows us to avoid the in-kernel calls to the
sys_sync_file_range() syscall. The ksys_ prefix denotes that this function
is meant as a drop-in replacement for the syscall. In particular, it uses
the same calling convention as sys_sync_file_range().
This patch is part of a series which removes in-kernel calls to syscalls.
On this basis, the syscall entry path can be streamlined. For details, see
http://lkml.kernel.org/r/20180325162527.GA17492@light.dominikbrodowski.net
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Dominik Brodowski <linux@dominikbrodowski.net>
Using the ksys_ftruncate() wrapper allows us to get rid of in-kernel
calls to the sys_ftruncate() syscall. The ksys_ prefix denotes that this
function is meant as a drop-in replacement for the syscall. In
particular, it uses the same calling convention as sys_ftruncate().
This patch is part of a series which removes in-kernel calls to syscalls.
On this basis, the syscall entry path can be streamlined. For details, see
http://lkml.kernel.org/r/20180325162527.GA17492@light.dominikbrodowski.net
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Dominik Brodowski <linux@dominikbrodowski.net>
When using SIG_DBG_BRANCH_TRACING, MSR.BE is left enabled in the
user context when single_step_exception() prepares the SIGTRAP
delivery. The resulting branch-trap-within-the-SIGTRAP-handler
isn't healthy.
Commit 2538c2d08f broke this, by
replacing an MSR mask operation of ~(MSR_SE | MSR_BE) with a call
to clear_single_step() which only clears MSR_SE.
This patch adds a new helper, clear_br_trace(), which clears the
debug trap before invoking the signal handler. This helper is a
NOP for BookE as SIG_DBG_BRANCH_TRACING isn't supported on BookE.
Signed-off-by: Matt Evans <matt@ozlabs.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
POWER4 has been broken since at least the change 49d09bf2a6
("powerpc/64s: Optimise MSR handling in exception handling"), which
requires mtmsrd L=1 support. This was introduced in ISA v2.01, and
POWER4 supports ISA v2.00.
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
The CPU_FTR_POWER9_DD2_1 flag is intended to be set for DD2.1 and
above (which is what the cputable setup does). Fix DT CPU features
quirk setup to match.
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
[mpe: Merge with upstream changes]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Rather than override the machine type in .S code (which can hide wrong
or ambiguous code generation for the target), set the type to power4
for all assembly.
This also means we need to be careful not to build power4-only code
when we're not building for Book3S, such as the "power7" versions of
copyuser/page/memcpy.
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
[mpe: Fix Book3E build, don't build the "power7" variants for non-Book3S]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
When waking from a CPU idle instruction (e.g., nap or stop), the sync
for ordering the KVM secondary thread state can be avoided if there
wakeup is coming from a kernel context rather than KVM context.
This improves performance for ping-pong benchmark with the stop0 idle
state by 0.46% for 2 threads in the same core, and 1.02% for different
cores.
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Implement a new function to invoke stop, power9_offline_stop, which is
like power9_idle_stop but used by the cpu hotplug code.
Move KVM secondary state manipulation code to the offline case.
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Reviewed-by: Vaidyanathan Srinivasan <svaidy@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
system_reset_exception does most of its own crash handling now,
invoking the debugger or crash dumps if they are registered. If not,
then it goes through to die() to print stack traces, and then is
supposed to panic (according to comments).
However after die() prints oopses, it does its own handling which
doesn't allow system_reset_exception to panic (e.g., it may just
kill the current process). This patch causes sreset exceptions to
return from die after it prints messages but before acting.
This also stops die from invoking the debugger on 0x100 crashes.
system_reset_exception similarly calls the debugger. It had been
thought this was harmless (because if the debugger was disabled,
neither call would fire, and if it was enabled the first call
would return). However in some cases like xmon 'X' command, the
debugger returns 0, which currently causes it to be entered
again (first in system_reset_exception, then in die), which is
confusing.
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
System Reset, being an NMI, must return more carefully than other
interrupts. It has traditionally returned via the nromal return
from exception path, but that has a number of problems.
- r13 does not get restored if returning to kernel. This is for
interrupts which may cause a context switch, which sreset will
never do. Interrupting OPAL (which uses a different r13) is one
place where this causes breakage.
- It may cause several other problems returning to kernel with
preempt or TIF_EMULATE_STACK_STORE if it hits at the wrong time.
It's safer just to have a simple restore and return, like machine
check which is the other NMI.
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
The current EEH callbacks can race with a driver unbind. This can
result in a backtraces like this:
EEH: Frozen PHB#0-PE#1fc detected
EEH: PE location: S000009, PHB location: N/A
CPU: 2 PID: 2312 Comm: kworker/u258:3 Not tainted 4.15.6-openpower1 #2
Workqueue: nvme-wq nvme_reset_work [nvme]
Call Trace:
dump_stack+0x9c/0xd0 (unreliable)
eeh_dev_check_failure+0x420/0x470
eeh_check_failure+0xa0/0xa4
nvme_reset_work+0x138/0x1414 [nvme]
process_one_work+0x1ec/0x328
worker_thread+0x2e4/0x3a8
kthread+0x14c/0x154
ret_from_kernel_thread+0x5c/0xc8
nvme nvme1: Removing after probe failure status: -19
<snip>
cpu 0x23: Vector: 300 (Data Access) at [c000000ff50f3800]
pc: c0080000089a0eb0: nvme_error_detected+0x4c/0x90 [nvme]
lr: c000000000026564: eeh_report_error+0xe0/0x110
sp: c000000ff50f3a80
msr: 9000000000009033
dar: 400
dsisr: 40000000
current = 0xc000000ff507c000
paca = 0xc00000000fdc9d80 softe: 0 irq_happened: 0x01
pid = 782, comm = eehd
Linux version 4.15.6-openpower1 (smc@smc-desktop) (gcc version 6.4.0 (Buildroot 2017.11.2-00008-g4b6188e)) #2 SM P Tue Feb 27 12:33:27 PST 2018
enter ? for help
eeh_report_error+0xe0/0x110
eeh_pe_dev_traverse+0xc0/0xdc
eeh_handle_normal_event+0x184/0x4c4
eeh_handle_event+0x30/0x288
eeh_event_handler+0x124/0x170
kthread+0x14c/0x154
ret_from_kernel_thread+0x5c/0xc8
The first part is an EEH (on boot), the second half is the resulting
crash. nvme probe starts the nvme_reset_work() worker thread. This
worker thread starts touching the device which see a device error
(EEH) and hence queues up an event in the powerpc EEH worker
thread. nvme_reset_work() then continues and runs
nvme_remove_dead_ctrl_work() which results in unbinding the driver
from the device and hence releases all resources. At the same time,
the EEH worker thread starts doing the EEH .error_detected() driver
callback, which no longer works since the resources have been freed.
This fixes the problem in the same way the generic PCIe AER code (in
drivers/pci/pcie/aer/aerdrv_core.c) does. It makes the EEH code hold
the device_lock() while performing the driver EEH callbacks and
associated code. This ensures either the callbacks are no longer
register, or if they are registered the driver will not be removed
from underneath us.
This has been broken forever. The EEH call backs were first introduced
in 2005 (in 77bd741561) but it's not clear if a lock was needed back
then.
Fixes: 77bd741561 ("[PATCH] powerpc: PCI Error Recovery: PPC64 core recovery routines")
Cc: stable@vger.kernel.org # v2.6.16+
Signed-off-by: Michael Neuling <mikey@neuling.org>
Reviewed-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
kexec_file_load() on powerpc doesn't support kdump kernels yet, so it
returns -ENOTSUPP in that case.
I've recently learned that this errno is internal to the kernel and
isn't supposed to be exposed to userspace. Therefore, change to
-EOPNOTSUPP which is defined in an uapi header.
This does indeed make kexec-tools happier. Before the patch, on
ppc64le:
# ~bauermann/src/kexec-tools/build/sbin/kexec -s -p /boot/vmlinuz
kexec_file_load failed: Unknown error 524
After the patch:
# ~bauermann/src/kexec-tools/build/sbin/kexec -s -p /boot/vmlinuz
kexec_file_load failed: Operation not supported
Fixes: a0458284f0 ("powerpc: Add support code for kexec_file_load()")
Cc: stable@vger.kernel.org # v4.10+
Reported-by: Dave Young <dyoung@redhat.com>
Signed-off-by: Thiago Jung Bauermann <bauerman@linux.vnet.ibm.com>
Reviewed-by: Simon Horman <horms@verge.net.au>
Reviewed-by: Dave Young <dyoung@redhat.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
On 64-bit Book3E systems, in setup_tlb_core_data() we reference other
CPUs pacas. But in commit 59f577743d ("powerpc/64: Defer paca
allocation until memory topology is discovered") the allocation of
non-boot-CPU pacas was deferred until later in boot.
This leads to an oops:
CPU maps initialized for 1 thread per core
Unable to handle kernel paging request for data at address 0x8888888888888918
Faulting instruction address: 0xc000000000e2f0d0
Oops: Kernel access of bad area, sig: 11 [#1]
NIP .setup_tlb_core_data+0xdc/0x160
Call Trace:
.setup_tlb_core_data+0x5c/0x160 (unreliable)
.setup_arch+0x80/0x348
.start_kernel+0x7c/0x598
start_here_common+0x1c/0x40
Luckily setup_tlb_core_data() is called immediately prior to
smp_setup_pacas(). So simply switching their order is sufficient to
fix the oops and seems unlikely to have any other unwanted side
effects.
Fixes: 59f577743d ("powerpc/64: Defer paca allocation until memory topology is discovered")
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Bring in yet another series that touches KVM code, and might need to
be merged into the kvm-ppc branch to resolve conflicts.
This required some changes in pnv_power9_force_smt4_catch/release()
due to the paca array becomming an array of pointers.
For addresses above 512TB we allocate additional mmu contexts. To make
it all easy, addresses above 512TB are handled with IR/DR=1 and with
stack frame setup.
The mmu_context_t is also updated to track the new extended_ids. To
support upto 4PB we need a total 8 contexts.
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
[mpe: Minor formatting tweaks and comment wording, switch BUG to WARN
in get_ea_context().]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Michael Ellerman reported the following call trace when running
ftracetest:
BUG: using __this_cpu_write() in preemptible [00000000] code: ftracetest/6178
caller is opt_pre_handler+0xc4/0x110
CPU: 1 PID: 6178 Comm: ftracetest Not tainted 4.15.0-rc7-gcc6x-gb2cd1df #1
Call Trace:
[c0000000f9ec39c0] [c000000000ac4304] dump_stack+0xb4/0x100 (unreliable)
[c0000000f9ec3a00] [c00000000061159c] check_preemption_disabled+0x15c/0x170
[c0000000f9ec3a90] [c000000000217e84] opt_pre_handler+0xc4/0x110
[c0000000f9ec3af0] [c00000000004cf68] optimized_callback+0x148/0x170
[c0000000f9ec3b40] [c00000000004d954] optinsn_slot+0xec/0x10000
[c0000000f9ec3e30] [c00000000004bae0] kretprobe_trampoline+0x0/0x10
This is showing up since OPTPROBES is now enabled with CONFIG_PREEMPT.
trampoline_probe_handler() considers itself to be a special kprobe
handler for kretprobes. In doing so, it expects to be called from
kprobe_handler() on a trap, and re-enables preemption before returning a
non-zero return value so as to suppress any subsequent processing of the
trap by the kprobe_handler().
However, with optprobes, we don't deal with special handlers (we ignore
the return code) and just try to re-enable preemption causing the above
trace.
To address this, modify trampoline_probe_handler() to not be special.
The only additional processing done in kprobe_handler() is to emulate
the instruction (in this case, a 'nop'). We adjust the value of
regs->nip for the purpose and delegate the job of re-enabling
preemption and resetting current kprobe to the probe handlers
(kprobe_handler() or optimized_callback()).
Fixes: 8a2d71a3f2 ("powerpc/kprobes: Disable preemption before invoking probe handler for optprobes")
Cc: stable@vger.kernel.org # v4.15+
Reported-by: Michael Ellerman <mpe@ellerman.id.au>
Signed-off-by: Naveen N. Rao <naveen.n.rao@linux.vnet.ibm.com>
Acked-by: Ananth N Mavinakayanahalli <ananth@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Per-node allocations are possible on 64s with radix that does
not have the bolted SLB limitation.
Hash would be able to do the same if all CPUs had the bottom of
their node-local memory bolted as well. This is left as an
exercise for the reader.
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
[mpe: Add dummy definition of boot_cpuid for !SMP]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
[mpe: Rename the dummy allocate_pacas() to fix 32-bit build]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Build an array that finds hardware CPU number from logical CPU
number in firmware CPU discovery. Use that rather than setting
paca of other CPUs directly, to begin with. Subsequent patch will
not have pacas allocated at this point.
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
[mpe: Fix SMP=n build by adding #ifdef in arch_match_cpu_phys_id()]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Move this into the early setup code, and don't iterate over CPU masks.
We don't want to call into sysfs so early from setup, and a future patch
won't initialize CPU masks by the time this is called.
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
[mpe: Fold in incremental fix from Nick for DSCR handling]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Split sparsemem initialisation from basic numa topology discovery.
Move the parsing earlier in boot, before pacas are allocated.
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
slb_shadow structures are avoided for radix environment.
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
We no longer allocate lppacas in an array, so this patch removes the
1kB static alignment for the structure, and enforces the PAPR
alignment requirements at allocation time. We can not reduce the 1kB
allocation size however, due to existing KVM hypervisors.
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Change the paca array into an array of pointers to pacas. Allocate
pacas individually.
This allows flexibility in where the PACAs are allocated. Future work
will allocate them node-local. Platforms that don't have address limits
on PACAs would be able to defer PACA allocations until later in boot
rather than allocate all possible ones up-front then freeing unused.
This is slightly more overhead (one additional indirection) for cross
CPU paca references, but those aren't too common.
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
The "lppaca" is a structure registered with the hypervisor. This is
unnecessary when running on non-virtualised platforms. One field from
the lppaca (pmcregs_in_use) is also used by the host, so move the host
part out into the paca (lppaca field is still updated in
guest mode).
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
[mpe: Fix non-pseries build with some #ifdefs]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Merge our fixes branch from the 4.16 cycle.
There were a number of important fixes merged, in particular some Power9
workarounds that we want in next for testing purposes. There's also been
some conflicting changes in the CPU features code which are best merged
and tested before going upstream.
This disables the DAWR on all POWER9 CPUs via cpu feature quirk.
Using the DAWR on POWER9 can cause xstops, hence we need to disable
it.
Signed-off-by: Michael Neuling <mikey@neuling.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
This updates the ptrace code to use ppc_breakpoint_available().
We now advertise via PPC_PTRACE_GETHWDBGINFO zero breakpoints when the
DAWR is missing (ie. POWER9). This results in GDB falling back to
software emulation of the breakpoint (which is slow).
For the features advertised by PPC_PTRACE_GETHWDBGINFO, we keep
advertising DAWR as if we don't GDB assumes 1 breakpoint irrespective
of the number of breakpoints advertised. GDB then fails later when
trying to set this one breakpoint.
Signed-off-by: Michael Neuling <mikey@neuling.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Add ppc_breakpoint_available() to determine if a breakpoint is
available currently via the DAWR or DABR.
Signed-off-by: Michael Neuling <mikey@neuling.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Checking for a "fully active" device state requires testing two flag
bits, which is open coded in several places, so add a function to do
it.
Signed-off-by: Sam Bobroff <sam.bobroff@au1.ibm.com>
Reviewed-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
The caller will always pass NULL for 'rmv_data' when
'eeh_aware_driver' is true, so the first two calls to
eeh_pe_dev_traverse() can be combined without changing behaviour as
can the two arms of the final 'if' block.
This should not change behaviour.
Signed-off-by: Sam Bobroff <sam.bobroff@au1.ibm.com>
Reviewed-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
eeh_reset_device() tests the value of 'bus' more than once but the
only caller, eeh_handle_normal_device() does this test itself and will
never pass NULL.
So, remove the dead tests.
This should not change behaviour.
Signed-off-by: Sam Bobroff <sam.bobroff@au1.ibm.com>
Reviewed-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
It is currently difficult to understand the behaviour of
eeh_reset_device() due to the way it's parameters are used. In
particular, when 'bus' is NULL, it's value is still necessary so the
same value is looked up again locally under a different name
('frozen_bus') but behaviour is changed.
To clarify this, add a new parameter 'driver_eeh_aware', and have the
caller set it when it would have passed NULL for 'bus' and always pass
a value for 'bus'. Then change any test that was on 'bus' to one on
'!driver_eeh_aware' and replace uses of 'frozen_bus' with 'bus'.
Also update the function's comment.
This should not change behaviour.
Signed-off-by: Sam Bobroff <sam.bobroff@au1.ibm.com>
Reviewed-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
The name "frozen_bus" is misleading: it's not necessarily frozen, it's
just the PE's PCI bus.
Signed-off-by: Sam Bobroff <sam.bobroff@au1.ibm.com>
Reviewed-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Remove a test that checks if "frozen_bus" is NULL, because it cannot
have changed since it was tested at the start of the function and so
must be true here.
Signed-off-by: Sam Bobroff <sam.bobroff@au1.ibm.com>
Reviewed-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Commit "0ba178888b05 powerpc/eeh: Remove reference to PCI device"
removed a call to pci_dev_get() from __eeh_addr_cache_get_device() but
did not update the comment to match.
Signed-off-by: Sam Bobroff <sam.bobroff@au1.ibm.com>
Reviewed-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Currently the EEH_PE_RECOVERING flag for a PE is managed by both the
caller and callee of eeh_handle_normal_event() (among other places not
considered here). This is complicated by the fact that the PE may
or may not have been invalidated by the call.
So move the callee's handling into eeh_handle_normal_event(), which
clarifies it and allows the return type to be changed to void (because
it no longer needs to indicate at the PE has been invalidated).
This should not change behaviour except in eeh_event_handler() where
it was previously possible to cause eeh_pe_state_clear() to be called
on an invalid PE, which is now avoided.
Signed-off-by: Sam Bobroff <sam.bobroff@au1.ibm.com>
Reviewed-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
The function eeh_handle_event(pe) does nothing other than switching
between calling eeh_handle_normal_event(pe) and
eeh_handle_special_event(). However it is only called in two places,
one where pe can't be NULL and the other where it must be NULL (see
eeh_event_handler()) so it does nothing but obscure the flow of
control.
So, remove it.
Signed-off-by: Sam Bobroff <sam.bobroff@au1.ibm.com>
Reviewed-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Currently the pseries kernel advertises radix MMU support even if
the actual support is disabled via the CONFIG_PPC_RADIX_MMU option.
This adds a check for CONFIG_PPC_RADIX_MMU to avoid advertising radix
to the hypervisor.
Suggested-by: Paul Mackerras <paulus@ozlabs.org>
Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Add a definition for cpu_show_spectre_v2() to override the generic
version. This has several permuations, though in practice some may not
occur we cater for any combination.
The most verbose is:
Mitigation: Indirect branch serialisation (kernel only), Indirect
branch cache disabled, ori31 speculation barrier enabled
We don't treat the ori31 speculation barrier as a mitigation on its
own, because it has to be *used* by code in order to be a mitigation
and we don't know if userspace is doing that. So if that's all we see
we say:
Vulnerable, ori31 speculation barrier enabled
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Add a definition for cpu_show_spectre_v1() to override the generic
version. Currently this just prints "Not affected" or "Vulnerable"
based on the firmware flag.
Although the kernel does have array_index_nospec() in a few places, we
haven't yet audited all the powerpc code to see where it's necessary,
so for now we don't list that as a mitigation.
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Now that we have the security feature flags we can make the
information displayed in the "meltdown" file more informative.
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
This landed in setup_64.c for no good reason other than we had nowhere
else to put it. Now that we have a security-related file, that is a
better place for it so move it.
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
This commit adds security feature flags to reflect the settings we
receive from firmware regarding Spectre/Meltdown mitigations.
The feature names reflect the names we are given by firmware on bare
metal machines. See the hostboot source for details.
Arguably these could be firmware features, but that then requires them
to be read early in boot so they're available prior to asm feature
patching, but we don't actually want to use them for patching. We may
also want to dynamically update them in future, which would be
incompatible with the way firmware features work (at the moment at
least). So for now just make them separate flags.
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Currently the rfi-flush messages print 'Using <type> flush' for all
enabled_flush_types, but that is not necessarily true -- as now the
fallback flush is always enabled on pseries, but the fixup function
overwrites its nop/branch slot with other flush types, if available.
So, replace the 'Using <type> flush' messages with '<type> flush is
available'.
Also, print the patched flush types in the fixup function, so users
can know what is (not) being used (e.g., the slower, fallback flush,
or no flush type at all if flush is disabled via the debugfs switch).
Suggested-by: Michael Ellerman <mpe@ellerman.id.au>
Signed-off-by: Mauricio Faria de Oliveira <mauricfo@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
For PowerVM migration we want to be able to call setup_rfi_flush()
again after we've migrated the partition.
To support that we need to check that we're not trying to allocate the
fallback flush area after memblock has gone away (i.e., boot-time only).
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Signed-off-by: Mauricio Faria de Oliveira <mauricfo@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
rfi_flush_enable() includes a check to see if we're already
enabled (or disabled), and in that case does nothing.
But that means calling setup_rfi_flush() a 2nd time doesn't actually
work, which is a bit confusing.
Move that check into the debugfs code, where it really belongs.
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Signed-off-by: Mauricio Faria de Oliveira <mauricfo@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
The SLB bad address handler's trap number fixup does not preserve the
low bit that indicates nonvolatile GPRs have not been saved. This
leads save_nvgprs to skip saving them, and subsequent functions and
return from interrupt will think they are saved.
This causes kernel branch-to-garbage debugging to not have correct
registers, can also cause userspace to have its registers clobbered
after a segfault.
Fixes: f0f558b131 ("powerpc/mm: Preserve CFAR value on SLB miss caused by access to bogus address")
Cc: stable@vger.kernel.org # v4.9+
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Incremental linking is gone, so rename built-in.o to built-in.a, which
is the usual extension for archive files.
This patch does two things, first is a simple search/replace:
git grep -l 'built-in\.o' | xargs sed -i 's/built-in\.o/built-in\.a/g'
The second is to invert nesting of nested text manipulations to avoid
filtering built-in.a out from libs-y2:
-libs-y2 := $(filter-out %.a, $(patsubst %/, %/built-in.a, $(libs-y)))
+libs-y2 := $(patsubst %/, %/built-in.a, $(filter-out %.a, $(libs-y)))
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Masahiro Yamada <yamada.masahiro@socionext.com>
POWER9 has hardware bugs relating to transactional memory and thread
reconfiguration (changes to hardware SMT mode). Specifically, the core
does not have enough storage to store a complete checkpoint of all the
architected state for all four threads. The DD2.2 version of POWER9
includes hardware modifications designed to allow hypervisor software
to implement workarounds for these problems. This patch implements
those workarounds in KVM code so that KVM guests see a full, working
transactional memory implementation.
The problems center around the use of TM suspended state, where the
CPU has a checkpointed state but execution is not transactional. The
workaround is to implement a "fake suspend" state, which looks to the
guest like suspended state but the CPU does not store a checkpoint.
In this state, any instruction that would cause a transition to
transactional state (rfid, rfebb, mtmsrd, tresume) or would use the
checkpointed state (treclaim) causes a "soft patch" interrupt (vector
0x1500) to the hypervisor so that it can be emulated. The trechkpt
instruction also causes a soft patch interrupt.
On POWER9 DD2.2, we avoid returning to the guest in any state which
would require a checkpoint to be present. The trechkpt in the guest
entry path which would normally create that checkpoint is replaced by
either a transition to fake suspend state, if the guest is in suspend
state, or a rollback to the pre-transactional state if the guest is in
transactional state. Fake suspend state is indicated by a flag in the
PACA plus a new bit in the PSSCR. The new PSSCR bit is write-only and
reads back as 0.
On exit from the guest, if the guest is in fake suspend state, we still
do the treclaim instruction as we would in real suspend state, in order
to get into non-transactional state, but we do not save the resulting
register state since there was no checkpoint.
Emulation of the instructions that cause a softpatch interrupt is
handled in two paths. If the guest is in real suspend mode, we call
kvmhv_p9_tm_emulation_early() to handle the cases where the guest is
transitioning to transactional state. This is called before we do the
treclaim in the guest exit path; because we haven't done treclaim, we
can get back to the guest with the transaction still active. If the
instruction is a case that kvmhv_p9_tm_emulation_early() doesn't
handle, or if the guest is in fake suspend state, then we proceed to
do the complete guest exit path and subsequently call
kvmhv_p9_tm_emulation() in host context with the MMU on. This handles
all the cases including the cases that generate program interrupts
(illegal instruction or TM Bad Thing) and facility unavailable
interrupts.
The emulation is reasonably straightforward and is mostly concerned
with checking for exception conditions and updating the state of
registers such as MSR and CR0. The treclaim emulation takes care to
ensure that the TEXASR register gets updated as if it were the guest
treclaim instruction that had done failure recording, not the treclaim
done in hypervisor state in the guest exit path.
With this, the KVM_CAP_PPC_HTM capability returns true (1) even if
transactional memory is not available to host userspace.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
POWER9 processors up to and including "Nimbus" v2.2 have hardware
bugs relating to transactional memory and thread reconfiguration.
One of these bugs has a workaround which is to get the core into
SMT4 state temporarily. This workaround is only needed when
running bare-metal.
This patch provides a function which gets the core into SMT4 mode
by preventing threads from going to a stop state, and waking up
those which are already in a stop state. Once at least 3 threads
are not in a stop state, the core will be in SMT4 and we can
continue.
To do this, we add a "dont_stop" flag to the paca to tell the
thread not to go into a stop state. If this flag is set,
power9_idle_stop() just returns immediately with a return value
of 0. The pnv_power9_force_smt4_catch() function does the following:
1. Set the dont_stop flag for each thread in the core, except
ourselves (in fact we use an atomic_inc() in case more than
one thread is calling this function concurrently).
2. See how many threads are awake, indicated by their
requested_psscr field in the paca being 0. If this is at
least 3, skip to step 5.
3. Send a doorbell interrupt to each thread that was seen as
being in a stop state in step 2.
4. Until at least 3 threads are awake, scan the threads to which
we sent a doorbell interrupt and check if they are awake now.
This relies on the following properties:
- Once dont_stop is non-zero, requested_psccr can't go from zero to
non-zero, except transiently (and without the thread doing stop).
- requested_psscr being zero guarantees that the thread isn't in
a state-losing stop state where thread reconfiguration could occur.
- Doing stop with a PSSCR value of 0 won't be a state-losing stop
and thus won't allow thread reconfiguration.
- Once threads_per_core/2 + 1 (i.e. 3) threads are awake, the core
must be in SMT4 mode, since SMT modes are powers of 2.
This does add a sync to power9_idle_stop(), which is necessary to
provide the correct ordering between setting requested_psscr and
checking dont_stop. The overhead of the sync should be unnoticeable
compared to the latency of going into and out of a stop state.
Because some objected to incurring this extra latency on systems where
the XER[SO] bug is not relevant, I have put the test in
power9_idle_stop inside a feature section. This means that
pnv_power9_force_smt4_catch() WILL NOT WORK correctly on systems
without the CPU_FTR_P9_TM_XER_SO_BUG feature bit set, and will
probably hang the system.
In order to cater for uses where the caller has an operation that
has to be done while the core is in SMT4, the core continues to be
kept in SMT4 after pnv_power9_force_smt4_catch() function returns,
until the pnv_power9_force_smt4_release() function is called.
It undoes the effect of step 1 above and allows the other threads
to go into a stop state.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
This adds a CPU feature bit which is set for POWER9 "Nimbus" DD2.2
processors which will be used to enable the hypervisor to assist
hardware with the handling of checkpointed register values while the
CPU is in suspend state, in order to work around hardware bugs. The
hardware assistance for these workarounds introduced a new hardware
bug relating to the XER[SO] bit. We add a separate feature bit for
this bug in case future chips fix it while still requiring the
hypervisor assistance with suspend state.
When the dt_cpu_ftrs subsystem is in use, the software assistance can
be enabled using a "tm-suspend-hypervisor-assist" node in the device
tree, and a "tm-suspend-xer-so-bug" node enables the workarounds for
the XER[SO] bug. In the absence of such nodes, a quirk enables both
for POWER9 "Nimbus" DD2.2 processors.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
This moves all the CPU feature bits that are only used on 32-bit
machines to the top 20 bits of the CPU feature word and arranges
for them to be defined only in 32-bit builds. The features that
are common to 32-bit and 64-bit machines are moved to bits 0-11
of the CPU feature word. This means that for 64-bit platforms,
bits 44-63 can now be used for new features that only exist on
64-bit machines. (These bit numbers are counting from the right,
i.e. the LSB is bit 0.)
Because CPU_FTR_L3_DISABLE_NAP moved from the low 16 bits to the high
16 bits, we have to adjust some assembly code. Also, CPU_FTR_EMB_HV
moved from the high 16 bits to the low 16 bits.
Note that CPU_FTR_REAL_LE only applies to 64-bit chips, because only
64-bit chips (POWER6, 7, 8, 9) have a true little-endian mode that is
a CPU execution mode as opposed to being a page attribute.
With this we now have 20 free CPU feature bits on 64-bit machines.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
All PowerPC CPUs other than the original PPC601 have a timebase
register rather than the "real-time clock" (RTC) register that the
PPC601 (and the original POWER and POWER2 CPUs) had. Currently
we have a CPU feature bit to indicate the presence of the timebase,
but it makes more sense to use a bit to indicate the unusual
situation rather than the common situation. This therefore defines
a CPU_FTR_USE_RTC bit in place of the CPU_FTR_USE_TB bit, and
arranges for it to be set on PPC601 systems.
Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
On POWER9, under some circumstances, a broadcast TLB invalidation
might complete before all previous stores have drained, potentially
allowing stale stores from becoming visible after the invalidation.
This works around it by doubling up those TLB invalidations which was
verified by HW to be sufficient to close the risk window.
This will be documented in a yet-to-be-published errata.
Fixes: 1a472c9dba ("powerpc/mm/radix: Add tlbflush routines")
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
[mpe: Enable the feature in the DT CPU features code for all Power9,
rename the feature to CPU_FTR_P9_TLBIE_BUG per benh.]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
force_external_irq_replay() can be called in the do_IRQ path with
interrupts hard enabled and soft disabled if may_hard_irq_enable() set
MSR[EE]=1. It updates local_paca->irq_happened with a load, modify,
store sequence. If a maskable interrupt hits during this sequence, it
will go to the masked handler to be marked pending in irq_happened.
This update will be lost when the interrupt returns and the store
instruction executes. This can result in unpredictable latencies,
timeouts, lockups, etc.
Fix this by ensuring hard interrupts are disabled before modifying
irq_happened.
This could cause any maskable asynchronous interrupt to get lost, but
it was noticed on P9 SMP system doing RDMA NVMe target over 100GbE,
so very high external interrupt rate and high IPI rate. The hang was
bisected down to enabling doorbell interrupts for IPIs. These provided
an interrupt type that could run at high rates in the do_IRQ path,
stressing the race.
Fixes: 1d607bb3bd ("powerpc/irq: Add mechanism to force a replay of interrupts")
Cc: stable@vger.kernel.org # v4.8+
Reported-by: Carol L. Soto <clsoto@us.ibm.com>
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
It's slightly less error prone to use sizeof(*foo) rather than
specifying the type.
Signed-off-by: Markus Elfring <elfring@users.sourceforge.net>
[mpe: Consolidate into one patch, rewrite change log]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
The flush_dcache_phys_range() function is no longer used in the
kernel. The last usage was removed in c40785ad30 ("powerpc/dart: Use
a cachable DART").
This patch removes the function and declaration.
Signed-off-by: Matt Brown <matthew.brown.dev@gmail.com>
[mpe: Munge change log, include commit that removed last user]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
A protection flag may not be valid across entire address space and
hence arch_validate_prot() might need the address a protection bit is
being set on to ensure it is a valid protection flag. For example, sparc
processors support memory corruption detection (as part of ADI feature)
flag on memory addresses mapped on to physical RAM but not on PFN mapped
pages or addresses mapped on to devices. This patch adds address to the
parameters being passed to arch_validate_prot() so protection bits can
be validated in the relevant context.
Signed-off-by: Khalid Aziz <khalid.aziz@oracle.com>
Cc: Khalid Aziz <khalid@gonehiking.org>
Reviewed-by: Anthony Yznaga <anthony.yznaga@oracle.com>
Acked-by: Michael Ellerman <mpe@ellerman.id.au> (powerpc)
Acked-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
The RTC core is always calling rtc_valid_tm after the read_time callback.
It is not necessary to call it just before returning from the callback.
Signed-off-by: Alexandre Belloni <alexandre.belloni@bootlin.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
When running virtualised the powerpc kernel is able to run the system
in "compat mode" - which means the kernel and hardware are pretending
to userspace that the CPU is an older version than it actually is.
AT_BASE_PLATFORM is an AUXV entry that we export to userspace for use
when we're running in that mode, which tells userspace the "platform"
string for the real CPU version, as opposed to the faked version.
Although we don't support compat mode when using DT CPU features, and
arguably don't need to set AT_BASE_PLATFORM, the existing cputable
based code always sets it even when we're running bare metal. That
means the lack of AT_BASE_PLATFORM is a user-visible artifact of the
fact that the kernel is using DT CPU features, which we don't want.
So set it in the DT CPU features code also.
This results in eg:
$ LD_SHOW_AUXV=1 /bin/true | grep "AT_.*PLATFORM"
AT_PLATFORM: power9
AT_BASE_PLATFORM:power9
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Reviewed-by: Nicholas Piggin <npiggin@gmail.com>
early_init() and machine_init() have no prototype, add one in
asm-prototypes.h.
Fixes the following warnings (treated as error in W=1):
arch/powerpc/kernel/setup_32.c:68:30: error: no previous prototype for ‘early_init’
arch/powerpc/kernel/setup_32.c:99:21: error: no previous prototype for ‘machine_init’
Signed-off-by: Mathieu Malaterre <malat@debian.org>
[mpe: Move them to asm-prototypes.h, drop other functions]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
These functions can all be static, make it so.
Signed-off-by: Mathieu Malaterre <malat@debian.org>
[mpe: Combine a patch of Mathieu's with some other static conversions]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
When neither CONFIG_ALTIVEC, nor CONFIG_VSX or CONFIG_PPC64 is
defined, the array feature_properties is defined as an empty array,
which in turn triggers the following warning (treated as error on
W=1):
arch/powerpc/kernel/prom.c: In function ‘check_cpu_feature_properties’:
arch/powerpc/kernel/prom.c:298:16: error: comparison of unsigned expression < 0 is always false
for (i = 0; i < ARRAY_SIZE(feature_properties); ++i, ++fp) {
^
Suggested-by: Michael Ellerman <mpe@ellerman.id.au>
Signed-off-by: Mathieu Malaterre <malat@debian.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Two functions did not have a prototype defined in signal.h header. Fix
the following two warnings (treated as errors in W=1):
arch/powerpc/kernel/signal_32.c:1135:6: error: no previous prototype for ‘sys_rt_sigreturn’
arch/powerpc/kernel/signal_32.c:1422:6: error: no previous prototype for ‘sys_sigreturn’
Signed-off-by: Mathieu Malaterre <malat@debian.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
__giveup_fpu() is never called outside process.c, so it can be static.
That also means we don't need an empty definition in switch_to.h
Signed-off-by: Mathieu Malaterre <malat@debian.org>
[mpe: Also drop the empty version, rewrite change log]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Since the value of `tmp` is never intended to be read, declare both `tmp`
variables as unused. Fix warning (treated as error in W=1):
arch/powerpc/kernel/signal_32.c: In function ‘sys_swapcontext’:
arch/powerpc/kernel/signal_32.c:1048:16: error: variable ‘tmp’ set but not used
arch/powerpc/kernel/signal_32.c: In function ‘sys_debug_setcontext’:
arch/powerpc/kernel/signal_32.c🔢16: error: variable ‘tmp’ set but not used
Signed-off-by: Mathieu Malaterre <malat@debian.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
With ibm,dynamic-memory-v2 and ibm,drc-info coming around the same
time, byte22 in vector5 of ibm architecture vector table got set twice
separately. The end result is that guest kernel isn't advertising
support for ibm,dynamic-memory-v2.
Fix this by removing the duplicate assignment of byte22.
Fixes: 02ef6dd810 ("powerpc: Enable support for ibm,drc-info devtree property")
Signed-off-by: Bharata B Rao <bharata@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
While the implementation of the "slices" address space allows
a significant amount of high slices, it limits the number of
low slices to 16 due to the use of a single u64 low_slices_psize
element in struct mm_context_t
On the 8xx, the minimum slice size is the size of the area
covered by a single PMD entry, ie 4M in 4K pages mode and 64M in
16K pages mode. This means we could have at least 64 slices.
In order to override this limitation, this patch switches the
handling of low_slices_psize to char array as done already for
high_slices_psize.
Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
On the 8xx, the page size is set in the PMD entry and applies to
all pages of the page table pointed by the said PMD entry.
When an app has some regular pages allocated (e.g. see below) and tries
to mmap() a huge page at a hint address covered by the same PMD entry,
the kernel accepts the hint allthough the 8xx cannot handle different
page sizes in the same PMD entry.
10000000-10001000 r-xp 00000000 00:0f 2597 /root/malloc
10010000-10011000 rwxp 00000000 00:0f 2597 /root/malloc
mmap(0x10080000, 524288, PROT_READ|PROT_WRITE,
MAP_PRIVATE|MAP_ANONYMOUS|0x40000, -1, 0) = 0x10080000
This results the app remaining forever in do_page_fault()/hugetlb_fault()
and when interrupting that app, we get the following warning:
[162980.035629] WARNING: CPU: 0 PID: 2777 at arch/powerpc/mm/hugetlbpage.c:354 hugetlb_free_pgd_range+0xc8/0x1e4
[162980.035699] CPU: 0 PID: 2777 Comm: malloc Tainted: G W 4.14.6 #85
[162980.035744] task: c67e2c00 task.stack: c668e000
[162980.035783] NIP: c000fe18 LR: c00e1eec CTR: c00f90c0
[162980.035830] REGS: c668fc20 TRAP: 0700 Tainted: G W (4.14.6)
[162980.035854] MSR: 00029032 <EE,ME,IR,DR,RI> CR: 24044224 XER: 20000000
[162980.036003]
[162980.036003] GPR00: c00e1eec c668fcd0 c67e2c00 00000010 c6869410 10080000 00000000 77fb4000
[162980.036003] GPR08: ffff0001 0683c001 00000000 ffffff80 44028228 10018a34 00004008 418004fc
[162980.036003] GPR16: c668e000 00040100 c668e000 c06c0000 c668fe78 c668e000 c6835ba0 c668fd48
[162980.036003] GPR24: 00000000 73ffffff 74000000 00000001 77fb4000 100fffff 10100000 10100000
[162980.036743] NIP [c000fe18] hugetlb_free_pgd_range+0xc8/0x1e4
[162980.036839] LR [c00e1eec] free_pgtables+0x12c/0x150
[162980.036861] Call Trace:
[162980.036939] [c668fcd0] [c00f0774] unlink_anon_vmas+0x1c4/0x214 (unreliable)
[162980.037040] [c668fd10] [c00e1eec] free_pgtables+0x12c/0x150
[162980.037118] [c668fd40] [c00eabac] exit_mmap+0xe8/0x1b4
[162980.037210] [c668fda0] [c0019710] mmput.part.9+0x20/0xd8
[162980.037301] [c668fdb0] [c001ecb0] do_exit+0x1f0/0x93c
[162980.037386] [c668fe00] [c001f478] do_group_exit+0x40/0xcc
[162980.037479] [c668fe10] [c002a76c] get_signal+0x47c/0x614
[162980.037570] [c668fe70] [c0007840] do_signal+0x54/0x244
[162980.037654] [c668ff30] [c0007ae8] do_notify_resume+0x34/0x88
[162980.037744] [c668ff40] [c000dae8] do_user_signal+0x74/0xc4
[162980.037781] Instruction dump:
[162980.037821] 7fdff378 81370000 54a3463a 80890020 7d24182e 7c841a14 712a0004 4082ff94
[162980.038014] 2f890000 419e0010 712a0ff0 408200e0 <0fe00000> 54a9000a 7f984840 419d0094
[162980.038216] ---[ end trace c0ceeca8e7a5800a ]---
[162980.038754] BUG: non-zero nr_ptes on freeing mm: 1
[162985.363322] BUG: non-zero nr_ptes on freeing mm: -1
In order to fix this, this patch uses the address space "slices"
implemented for BOOK3S/64 and enhanced to support PPC32 by the
preceding patch.
This patch modifies the context.id on the 8xx to be in the range
[1:16] instead of [0:15] in order to identify context.id == 0 as
not initialised contexts as done on BOOK3S
This patch activates CONFIG_PPC_MM_SLICES when CONFIG_HUGETLB_PAGE is
selected for the 8xx
Alltough we could in theory have as many slices as PMD entries, the
current slices implementation limits the number of low slices to 16.
This limitation is not preventing us to fix the initial issue allthough
it is suboptimal. It will be cured in a subsequent patch.
Fixes: 4b91428699 ("powerpc/8xx: Implement support of hugepages")
Signed-off-by: Christophe Leroy <christophe.leroy@c-s.fr>
Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Commit f719582435 ("PCI: Add pci_mmap_resource_range() and use it for
ARM64") added this generic function with the intent of using it everywhere
and ultimately killing the old arch-specific implementations.
Remove the powerpc-specific pci_mmap_page_range() and use the generic
pci_mmap_resource_range() instead.
Powerpc can mmap I/O port space, so supply the powerpc-specific
pci_iobar_pfn() required to make that work.
Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
[bhelgaas: changelog]
Signed-off-by: Bjorn Helgaas <helgaas@kernel.org>
This reverts commit 02ef6dd810.
The earlier patch tried to enable support for a new property
"ibm,drc-info" on powerpc systems.
Unfortunately, some errors in the associated patch set break things
in some of the DLPAR operations. In particular when attempting to
hot-add a new CPU or set of CPUs, the original patch failed to
properly calculate the available resources, and aborted the operation.
In addition, the original set missed several opportunities to compress
and reuse common code.
As the associated patch set was meant to provide an optimization of
storage and performance of a set of device-tree properties for future
systems with large amounts of resources, reverting just restores
the previous behavior for existing systems. It seems unnecessary
to enable this feature and introduce the consequent problems in the
field that it will cause at this time, so please revert it for now
until testing of the corrections are finished properly.
Fixes: 02ef6dd810 ("powerpc: Enable support for ibm,drc-info devtree property")
Signed-off-by: Michael W. Bringmann <mwb@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
The notify_resume() callback in eeh_ops is NULL on powernv, leading to
crashes:
NIP (null)
LR eeh_report_resume+0x218/0x220
Call Trace:
eeh_report_resume+0x1f0/0x220 (unreliable)
eeh_pe_dev_traverse+0x98/0x170
eeh_handle_normal_event+0x3f4/0x650
eeh_handle_event+0x54/0x380
eeh_event_handler+0x14c/0x210
kthread+0x168/0x1b0
ret_from_kernel_thread+0x5c/0xb4
Fix it by adding a check before calling it.
Fixes: 856e1eb9bd ("PCI/AER: Add uevents in AER and EEH error/resume")
Signed-off-by: Juan J. Alvarez <jjalvare@linux.vnet.ibm.com>
Reviewed-by: Bryant G. Ly <bryantly@linux.vnet.ibm.com>
Tested-by: Carol L. Soto <clsoto@us.ibm.com>
Reviewed-by: Andrew Donnellan <andrew.donnellan@au1.ibm.com>
Tested-by: Mauro S. M. Rodrigues <maurosr@linux.vnet.ibm.com>
Acked-by: Michael Neuling <mikey@neuling.org>
[mpe: Rewrite change log]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
The TSCR can only be accessed in hypervisor mode.
Fixes: 88b5e12eeb11 ("powerpc: Expose TSCR via sysfs")
Signed-off-by: Cyril Bur <cyrilbur@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
A larger batch of fixes than we'd like. Roughly 1/3 fixes for new code, 1/3
fixes for stable and 1/3 minor things.
There's four commits fixing bugs when using 16GB huge pages on hash, caused by
some of the preparatory changes for pkeys.
Two fixes for bugs in the enhanced IRQ soft masking for local_t, one of which
broke KVM in some circumstances.
Four fixes for Power9. The most bizarre being a bug where futexes stopped
working because a NULL pointer dereference didn't trap during early boot (it
aliased the kernel mapping). A fix for memory hotplug when using the Radix MMU,
and a fix for live migration of guests using the Radix MMU.
Two fixes for hotplug on pseries machines. One where we weren't correctly
updating NUMA info when CPUs are added and removed. And the other fixes
crashes/hangs seen when doing memory hot remove during boot, which is apparently
a thing people do.
Finally a handful of build fixes for obscure configs and other minor fixes.
Thanks to:
Alexey Kardashevskiy, Aneesh Kumar K.V, Balbir Singh, Colin Ian King, Daniel
Henrique Barboza, Florian Weimer, Guenter Roeck, Harish, Laurent Vivier,
Madhavan Srinivasan, Mauricio Faria de Oliveira, Nathan Fontenot, Nicholas
Piggin, Sam Bobroff.
-----BEGIN PGP SIGNATURE-----
iQIwBAABCAAaBQJahDTmExxtcGVAZWxsZXJtYW4uaWQuYXUACgkQUevqPMjhpYAd
chAAtVe8hmkEJefTbU63GBeqva0JHSiTu2DENZAlN/epWtbtyl05PLETMdTcwGCv
nK2zzR+xbSFN1DzZK8KQfDBW33McKZE+YkHwYOC8Kff/N0SKdHK4zvxYr7FTZGzG
9uSG5vrxVEsPLT/yANabl0d0vKWMsJ1jZquvJAU0eLNUbA/skGjEPADtXqYQUXiA
EnW4xeczsMLjuzTleoRqrBx74Gulovuq9LVAjfDvkydWlCU9MQkrodCgP0V2hQtw
RAJ/QLY+NS/vMCBnvVOGBaKzIqrfeQTHF3P0j4pyBeBq/2kNuidM5n25uoc31wUq
DE4Ebe2FJA6CHP5KEyf7dr9y7gsks/ak3/CKs+l6Yz3/0BqenEMhu6WKJ1tgf9cC
qAmi1dIjtpw6JZ6baCbkloUdAGNjKVfLWB9ld9VIfg0C+C3y4L7+TKJukxrCBGI6
hopfT/3p8xUdla3euiRXRLZzajyKDGrqk71hk5J/J0ChXfWB0B51X0F6NIfH41Mn
YsVUQ95p3zS79Pl942ijGScFX/bNVLfEEGzlI/nwU/wbTxF5g/XNXm5PjBsGSr/W
zFcCwCpFV2b/kypQoxQA5CbrKRCLOleDA/lLOxW/1NMYOQsNj05DM9wYAw5Bl+lX
AVj2c5jM9heNN4scxDiufRNfqZbyjZ4fFUpXLNqs7N5vcks=
=BmuL
-----END PGP SIGNATURE-----
Merge tag 'powerpc-4.16-2' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux
Pull powerpc fixes from Michael Ellerman:
"A larger batch of fixes than we'd like. Roughly 1/3 fixes for new
code, 1/3 fixes for stable and 1/3 minor things.
There's four commits fixing bugs when using 16GB huge pages on hash,
caused by some of the preparatory changes for pkeys.
Two fixes for bugs in the enhanced IRQ soft masking for local_t, one
of which broke KVM in some circumstances.
Four fixes for Power9. The most bizarre being a bug where futexes
stopped working because a NULL pointer dereference didn't trap during
early boot (it aliased the kernel mapping). A fix for memory hotplug
when using the Radix MMU, and a fix for live migration of guests using
the Radix MMU.
Two fixes for hotplug on pseries machines. One where we weren't
correctly updating NUMA info when CPUs are added and removed. And the
other fixes crashes/hangs seen when doing memory hot remove during
boot, which is apparently a thing people do.
Finally a handful of build fixes for obscure configs and other minor
fixes.
Thanks to: Alexey Kardashevskiy, Aneesh Kumar K.V, Balbir Singh, Colin
Ian King, Daniel Henrique Barboza, Florian Weimer, Guenter Roeck,
Harish, Laurent Vivier, Madhavan Srinivasan, Mauricio Faria de
Oliveira, Nathan Fontenot, Nicholas Piggin, Sam Bobroff"
* tag 'powerpc-4.16-2' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux:
selftests/powerpc: Fix to use ucontext_t instead of struct ucontext
powerpc/kdump: Fix powernv build break when KEXEC_CORE=n
powerpc/pseries: Fix build break for SPLPAR=n and CPU hotplug
powerpc/mm/hash64: Zero PGD pages on allocation
powerpc/mm/hash64: Store the slot information at the right offset for hugetlb
powerpc/mm/hash64: Allocate larger PMD table if hugetlb config is enabled
powerpc/mm: Fix crashes with 16G huge pages
powerpc/mm: Flush radix process translations when setting MMU type
powerpc/vas: Don't set uses_vas for kernel windows
powerpc/pseries: Enable RAS hotplug events later
powerpc/mm/radix: Split linear mapping on hot-unplug
powerpc/64s/radix: Boot-time NULL pointer protection using a guard-PID
ocxl: fix signed comparison with less than zero
powerpc/64s: Fix may_hard_irq_enable() for PMI soft masking
powerpc/64s: Fix MASKABLE_RELON_EXCEPTION_HV_OOL macro
powerpc/numa: Invalidate numa_cpu_lookup_table on cpu remove
This is the mindless scripted replacement of kernel use of POLL*
variables as described by Al, done by this script:
for V in IN OUT PRI ERR RDNORM RDBAND WRNORM WRBAND HUP RDHUP NVAL MSG; do
L=`git grep -l -w POLL$V | grep -v '^t' | grep -v /um/ | grep -v '^sa' | grep -v '/poll.h$'|grep -v '^D'`
for f in $L; do sed -i "-es/^\([^\"]*\)\(\<POLL$V\>\)/\\1E\\2/" $f; done
done
with de-mangling cleanups yet to come.
NOTE! On almost all architectures, the EPOLL* constants have the same
values as the POLL* constants do. But they keyword here is "almost".
For various bad reasons they aren't the same, and epoll() doesn't
actually work quite correctly in some cases due to this on Sparc et al.
The next patch from Al will sort out the final differences, and we
should be all done.
Scripted-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
ARM:
- Include icache invalidation optimizations, improving VM startup time
- Support for forwarded level-triggered interrupts, improving
performance for timers and passthrough platform devices
- A small fix for power-management notifiers, and some cosmetic changes
PPC:
- Add MMIO emulation for vector loads and stores
- Allow HPT guests to run on a radix host on POWER9 v2.2 CPUs without
requiring the complex thread synchronization of older CPU versions
- Improve the handling of escalation interrupts with the XIVE interrupt
controller
- Support decrement register migration
- Various cleanups and bugfixes.
s390:
- Cornelia Huck passed maintainership to Janosch Frank
- Exitless interrupts for emulated devices
- Cleanup of cpuflag handling
- kvm_stat counter improvements
- VSIE improvements
- mm cleanup
x86:
- Hypervisor part of SEV
- UMIP, RDPID, and MSR_SMI_COUNT emulation
- Paravirtualized TLB shootdown using the new KVM_VCPU_PREEMPTED bit
- Allow guests to see TOPOEXT, GFNI, VAES, VPCLMULQDQ, and more AVX512
features
- Show vcpu id in its anonymous inode name
- Many fixes and cleanups
- Per-VCPU MSR bitmaps (already merged through x86/pti branch)
- Stable KVM clock when nesting on Hyper-V (merged through x86/hyperv)
-----BEGIN PGP SIGNATURE-----
iQEcBAABCAAGBQJafvMtAAoJEED/6hsPKofo6YcH/Rzf2RmshrWaC3q82yfIV0Qz
Z8N8yJHSaSdc3Jo6cmiVj0zelwAxdQcyjwlT7vxt5SL2yML+/Q0st9Hc3EgGGXPm
Il99eJEl+2MYpZgYZqV8ff3mHS5s5Jms+7BITAeh6Rgt+DyNbykEAvzt+MCHK9cP
xtsIZQlvRF7HIrpOlaRzOPp3sK2/MDZJ1RBE7wYItK3CUAmsHim/LVYKzZkRTij3
/9b4LP1yMMbziG+Yxt1o682EwJB5YIat6fmDG9uFeEVI5rWWN7WFubqs8gCjYy/p
FX+BjpOdgTRnX+1m9GIj0Jlc/HKMXryDfSZS07Zy4FbGEwSiI5SfKECub4mDhuE=
=C/uD
-----END PGP SIGNATURE-----
Merge tag 'kvm-4.16-1' of git://git.kernel.org/pub/scm/virt/kvm/kvm
Pull KVM updates from Radim Krčmář:
"ARM:
- icache invalidation optimizations, improving VM startup time
- support for forwarded level-triggered interrupts, improving
performance for timers and passthrough platform devices
- a small fix for power-management notifiers, and some cosmetic
changes
PPC:
- add MMIO emulation for vector loads and stores
- allow HPT guests to run on a radix host on POWER9 v2.2 CPUs without
requiring the complex thread synchronization of older CPU versions
- improve the handling of escalation interrupts with the XIVE
interrupt controller
- support decrement register migration
- various cleanups and bugfixes.
s390:
- Cornelia Huck passed maintainership to Janosch Frank
- exitless interrupts for emulated devices
- cleanup of cpuflag handling
- kvm_stat counter improvements
- VSIE improvements
- mm cleanup
x86:
- hypervisor part of SEV
- UMIP, RDPID, and MSR_SMI_COUNT emulation
- paravirtualized TLB shootdown using the new KVM_VCPU_PREEMPTED bit
- allow guests to see TOPOEXT, GFNI, VAES, VPCLMULQDQ, and more
AVX512 features
- show vcpu id in its anonymous inode name
- many fixes and cleanups
- per-VCPU MSR bitmaps (already merged through x86/pti branch)
- stable KVM clock when nesting on Hyper-V (merged through
x86/hyperv)"
* tag 'kvm-4.16-1' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (197 commits)
KVM: PPC: Book3S: Add MMIO emulation for VMX instructions
KVM: PPC: Book3S HV: Branch inside feature section
KVM: PPC: Book3S HV: Make HPT resizing work on POWER9
KVM: PPC: Book3S HV: Fix handling of secondary HPTEG in HPT resizing code
KVM: PPC: Book3S PR: Fix broken select due to misspelling
KVM: x86: don't forget vcpu_put() in kvm_arch_vcpu_ioctl_set_sregs()
KVM: PPC: Book3S PR: Fix svcpu copying with preemption enabled
KVM: PPC: Book3S HV: Drop locks before reading guest memory
kvm: x86: remove efer_reload entry in kvm_vcpu_stat
KVM: x86: AMD Processor Topology Information
x86/kvm/vmx: do not use vm-exit instruction length for fast MMIO when running nested
kvm: embed vcpu id to dentry of vcpu anon inode
kvm: Map PFN-type memory regions as writable (if possible)
x86/kvm: Make it compile on 32bit and with HYPYERVISOR_GUEST=n
KVM: arm/arm64: Fixup userspace irqchip static key optimization
KVM: arm/arm64: Fix userspace_irqchip_in_use counting
KVM: arm/arm64: Fix incorrect timer_is_pending logic
MAINTAINERS: update KVM/s390 maintainers
MAINTAINERS: add Halil as additional vfio-ccw maintainer
MAINTAINERS: add David as a reviewer for KVM/s390
...
59f47eff03 ("powerpc/pci: Use of_irq_parse_and_map_pci() helper")
replaced of_irq_parse_pci() + irq_create_of_mapping() with
of_irq_parse_and_map_pci(), but neglected to capture the virq
returned by irq_create_of_mapping(), so virq remained zero, which
caused INTx configuration to fail.
Save the virq value returned by of_irq_parse_and_map_pci() and correct
the virq declaration to match the of_irq_parse_and_map_pci() signature.
Fixes: 59f47eff03 "powerpc/pci: Use of_irq_parse_and_map_pci() helper"
Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
[bhelgaas: changelog]
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>