- Ground work for the new Power9 MMU from Aneesh Kumar K.V
- Optimise FP/VMX/VSX context switching from Anton Blanchard
- Various cleanups from Krzysztof Kozlowski, John Ogness, Rashmica Gupta,
Russell Currey, Gavin Shan, Daniel Axtens, Michael Neuling, Andrew Donnellan
- Allow wrapper to work on non-english system from Laurent Vivier
- Add rN aliases to the pt_regs_offset table from Rashmica Gupta
- Fix module autoload for rackmeter & axonram drivers from Luis de Bethencourt
- Include KVM guest test in all interrupt vectors from Paul Mackerras
- Fix DSCR inheritance over fork() from Anton Blanchard
- Make value-returning atomics & {cmp}xchg* & their atomic_ versions fully ordered from Boqun Feng
- Print MSR TM bits in oops messages from Michael Neuling
- Add TM signal return & invalid stack selftests from Michael Neuling
- Limit EPOW reset event warnings from Vipin K Parashar
- Remove the Cell QPACE code from Rashmica Gupta
- Append linux_banner to exception information in xmon from Rashmica Gupta
- Add selftest to check if VSRs are corrupted from Rashmica Gupta
- Remove broken GregorianDay() from Daniel Axtens
- Import Anton's context_switch2 benchmark into selftests from Michael Ellerman
- Add selftest script to test HMI functionality from Daniel Axtens
- Remove obsolete OPAL v2 support from Stewart Smith
- Make enter_rtas() private from Michael Ellerman
- PPR exception cleanups from Michael Ellerman
- Add page soft dirty tracking from Laurent Dufour
- Add support for Nvlink NPUs from Alistair Popple
- Add support for kexec on 476fpe from Alistair Popple
- Enable kernel CPU dlpar from sysfs from Nathan Fontenot
- Copy only required pieces of the mm_context_t to the paca from Michael Neuling
- Add a kmsg_dumper that flushes OPAL console output on panic from Russell Currey
- Implement save_stack_trace_regs() to enable kprobe stack tracing from Steven Rostedt
- Add HWCAP bits for Power9 from Michael Ellerman
- Fix _PAGE_PTE breaking swapoff from Aneesh Kumar K.V
- Fix _PAGE_SWP_SOFT_DIRTY breaking swapoff from Hugh Dickins
- scripts/recordmcount.pl: support data in text section on powerpc from Ulrich Weigand
- Handle R_PPC64_ENTRY relocations in modules from Ulrich Weigand
- cxl: Fix possible idr warning when contexts are released from Vaibhav Jain
- cxl: use correct operator when writing pcie config space values from Andrew Donnellan
- cxl: Fix DSI misses when the context owning task exits from Vaibhav Jain
- cxl: fix build for GCC 4.6.x from Brian Norris
- cxl: use -Werror only with CONFIG_PPC_WERROR from Brian Norris
- cxl: Enable PCI device ID for future IBM CXL adapter from Uma Krishnan
- Freescale updates from Scott: Highlights include moving QE code out of
arch/powerpc (to be shared with arm), device tree updates, and minor fixes.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQIcBAABAgAGBQJWmIxeAAoJEFHr6jzI4aWAA+cQAIXAw4WfVWJ2V4ZK+1eKfB57
fdXG71PuXG+WYIWy71ly8keLHdzzD1NQ2OUB64bUVRq202nRgVc15ZYKRJ/FE/sP
SkxaQ2AG/2kI2EflWshOi0Lu9qaZ+LMHJnszIqE/9lnGSB2kUI/cwsSXgziiMKXR
XNci9v14SdDd40YV/6BSZXoxApwyq9cUbZ7rnzFLmz4hrFuKmB/L3LABDF8QcpH7
sGt/YaHGOtqP0UX7h5KQTFLGe1OPvK6NWixSXeZKQ71ED6cho1iKUEOtBA9EZeIN
QM5JdHFWgX8MMRA0OHAgidkSiqO38BXjmjkVYWoIbYz7Zax3ThmrDHB4IpFwWnk3
l7WBykEXY7KEqpZzbh0GFGehZWzVZvLnNgDdvpmpk/GkPzeYKomBj7ZZfm3H1yGD
BTHPwuWCTX+/K75yEVNO8aJO12wBg7DRl4IEwBgqhwU8ga4FvUOCJkm+SCxA1Dnn
qlpS7qPwTXNIEfKMJcxp5X0KiwDY1EoOotd4glTN0jbeY5GEYcxe+7RQ302GrYxP
zcc8EGLn8h6BtQvV3ypNHF5l6QeTW/0ZlO9c236tIuUQ5gQU39SQci7jQKsYjSzv
BB1XdLHkbtIvYDkmbnr1elbeJCDbrWL9rAXRUTRyfuCzaFWTfZmfVNe8c8qwDMLk
TUxMR/38aI7bLcIQjwj9
=R5bX
-----END PGP SIGNATURE-----
Merge tag 'powerpc-4.5-1' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux
Pull powerpc updates from Michael Ellerman:
"Core:
- Ground work for the new Power9 MMU from Aneesh Kumar K.V
- Optimise FP/VMX/VSX context switching from Anton Blanchard
Misc:
- Various cleanups from Krzysztof Kozlowski, John Ogness, Rashmica
Gupta, Russell Currey, Gavin Shan, Daniel Axtens, Michael Neuling,
Andrew Donnellan
- Allow wrapper to work on non-english system from Laurent Vivier
- Add rN aliases to the pt_regs_offset table from Rashmica Gupta
- Fix module autoload for rackmeter & axonram drivers from Luis de
Bethencourt
- Include KVM guest test in all interrupt vectors from Paul Mackerras
- Fix DSCR inheritance over fork() from Anton Blanchard
- Make value-returning atomics & {cmp}xchg* & their atomic_ versions
fully ordered from Boqun Feng
- Print MSR TM bits in oops messages from Michael Neuling
- Add TM signal return & invalid stack selftests from Michael Neuling
- Limit EPOW reset event warnings from Vipin K Parashar
- Remove the Cell QPACE code from Rashmica Gupta
- Append linux_banner to exception information in xmon from Rashmica
Gupta
- Add selftest to check if VSRs are corrupted from Rashmica Gupta
- Remove broken GregorianDay() from Daniel Axtens
- Import Anton's context_switch2 benchmark into selftests from
Michael Ellerman
- Add selftest script to test HMI functionality from Daniel Axtens
- Remove obsolete OPAL v2 support from Stewart Smith
- Make enter_rtas() private from Michael Ellerman
- PPR exception cleanups from Michael Ellerman
- Add page soft dirty tracking from Laurent Dufour
- Add support for Nvlink NPUs from Alistair Popple
- Add support for kexec on 476fpe from Alistair Popple
- Enable kernel CPU dlpar from sysfs from Nathan Fontenot
- Copy only required pieces of the mm_context_t to the paca from
Michael Neuling
- Add a kmsg_dumper that flushes OPAL console output on panic from
Russell Currey
- Implement save_stack_trace_regs() to enable kprobe stack tracing
from Steven Rostedt
- Add HWCAP bits for Power9 from Michael Ellerman
- Fix _PAGE_PTE breaking swapoff from Aneesh Kumar K.V
- Fix _PAGE_SWP_SOFT_DIRTY breaking swapoff from Hugh Dickins
- scripts/recordmcount.pl: support data in text section on powerpc
from Ulrich Weigand
- Handle R_PPC64_ENTRY relocations in modules from Ulrich Weigand
cxl:
- cxl: Fix possible idr warning when contexts are released from
Vaibhav Jain
- cxl: use correct operator when writing pcie config space values
from Andrew Donnellan
- cxl: Fix DSI misses when the context owning task exits from Vaibhav
Jain
- cxl: fix build for GCC 4.6.x from Brian Norris
- cxl: use -Werror only with CONFIG_PPC_WERROR from Brian Norris
- cxl: Enable PCI device ID for future IBM CXL adapter from Uma
Krishnan
Freescale:
- Freescale updates from Scott: Highlights include moving QE code out
of arch/powerpc (to be shared with arm), device tree updates, and
minor fixes"
* tag 'powerpc-4.5-1' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux: (149 commits)
powerpc/module: Handle R_PPC64_ENTRY relocations
scripts/recordmcount.pl: support data in text section on powerpc
powerpc/powernv: Fix OPAL_CONSOLE_FLUSH prototype and usages
powerpc/mm: fix _PAGE_SWP_SOFT_DIRTY breaking swapoff
powerpc/mm: Fix _PAGE_PTE breaking swapoff
cxl: Enable PCI device ID for future IBM CXL adapter
cxl: use -Werror only with CONFIG_PPC_WERROR
cxl: fix build for GCC 4.6.x
powerpc: Add HWCAP bits for Power9
powerpc/powernv: Reserve PE#0 on NPU
powerpc/powernv: Change NPU PE# assignment
powerpc/powernv: Fix update of NVLink DMA mask
powerpc/powernv: Remove misleading comment in pci.c
powerpc: Implement save_stack_trace_regs() to enable kprobe stack tracing
powerpc: Fix build break due to paca mm_context_t changes
cxl: Fix DSI misses when the context owning task exits
MAINTAINERS: Update Scott Wood's e-mail address
powerpc/powernv: Fix minor off-by-one error in opal_mce_check_early_recovery()
powerpc: Fix style of self-test config prompts
powerpc/powernv: Only delay opal_rtc_read() retry when necessary
...
Pull livepatching updates from Jiri Kosina:
- RO/NX attribute fixes for patch module relocations from Josh
Poimboeuf. As part of this effort, module.c has been cleaned up as
well and livepatching is piggy-backing on this cleanup. Rusty is OK
with this whole lot going through livepatching tree.
- symbol disambiguation support from Chris J Arges. That series is
also
Reviewed-by: Miroslav Benes <mbenes@suse.cz>
but this came in only after I've alredy pushed out. Didn't want to
rebase because of that, hence I am mentioning it here.
- symbol lookup fix from Miroslav Benes
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/livepatching:
livepatch: Cleanup module page permission changes
module: keep percpu symbols in module's symtab
module: clean up RO/NX handling.
module: use a structure to encapsulate layout.
gcov: use within_module() helper.
module: Use the same logic for setting and unsetting RO/NX
livepatch: function,sympos scheme in livepatch sysfs directory
livepatch: add sympos as disambiguator field to klp_reloc
livepatch: add old_sympos as disambiguator field to klp_func
GCC 6 will include changes to generated code with -mcmodel=large,
which is used to build kernel modules on powerpc64le. This was
necessary because the large model is supposed to allow arbitrary
sizes and locations of the code and data sections, but the ELFv2
global entry point prolog still made the unconditional assumption
that the TOC associated with any particular function can be found
within 2 GB of the function entry point:
func:
addis r2,r12,(.TOC.-func)@ha
addi r2,r2,(.TOC.-func)@l
.localentry func, .-func
To remove this assumption, GCC will now generate instead this global
entry point prolog sequence when using -mcmodel=large:
.quad .TOC.-func
func:
.reloc ., R_PPC64_ENTRY
ld r2, -8(r12)
add r2, r2, r12
.localentry func, .-func
The new .reloc triggers an optimization in the linker that will
replace this new prolog with the original code (see above) if the
linker determines that the distance between .TOC. and func is in
range after all.
Since this new relocation is now present in module object files,
the kernel module loader is required to handle them too. This
patch adds support for the new relocation and implements the
same optimization done by the GNU linker.
Cc: stable@vger.kernel.org
Signed-off-by: Ulrich Weigand <ulrich.weigand@de.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
It has come to my attention that kprobe event stack tracing does not
work on powerpc. You can see with the following:
# cd /sys/kernel/debug/tracing
# echo stacktrace > trace_options
# echo 'p kfree' > kprobe_events
# echo 1 > events/kprobes/enable
Will print the following warning:
save_stack_trace_regs() not implemented yet.
Although save_stack_trace() (which normal event stack traces use) is
implemented, save_stack_trace_regs() which kprobe events use is not.
This is a cheap attempt to implement that function.
Note, This may have issues if a task tries to get a stack trace from
another task with its regs, because it just passes in "current" to
save_context_stack(). But this does solve the issue with stack tracing
kprobe events.
Reported-by: Chunyu Hu <chuhu@redhat.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Currently we copy the whole mm_context_t to the paca but only access a
few bits of it. This is wasteful of space paca and also takes quite
some time in the hot path of context switching.
This patch pulls in only the required bits from the mm_context_t to
the paca and on context switch, copies only those.
Benchmarking this (On top of Anton's recent MSR context switching
changes [1]) using processes and yield shows an improvement of almost
3% on POWER8:
http://ozlabs.org/~anton/junkcode/context_switch2.c
./context_switch2 --test=yield --process 0 0
1. https://lists.ozlabs.org/pipermail/linuxppc-dev/2015-October/135700.html
Signed-off-by: Michael Neuling <mikey@neuling.org>
[mpe: Rename paca fields to be mm_ctx_foo rather than context_foo]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
This adds a function to copy the mm->context to the paca. This is
only a basic conversion for now but will be used more extensively in
the next patch.
This also adds #ifdef CONFIG_PPC_BOOK3S around this code since it's
not used elsewhere.
Signed-off-by: Michael Neuling <mikey@neuling.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
cppcheck picked up that there were a couple of missing va_end()
calls in functions using va_start().
Signed-off-by: Daniel Axtens <dja@axtens.net>
Reviewed-by: Russell Currey <ruscur@russell.cc>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
PPC476FPE has a different PVR from previous PPC476 processors. The
kexec code checks the PVR in order to correctly setup the MMU. When
the initial support for 476FPE processors was added the corresponding
change in the kexec code was missed. This patch simply adds the check
and solves the following bug on kexec:
kexec: Starting new kernel
Bye!
Unable to handle kernel paging request for instruction fetch
Faulting instruction address: 0xee9a50f8
cpu 0x0: Vector: 400 (Instruction Access) at [ee9d7d20]
pc: ee9a50f8
lr: ee9a50e4
sp: ee9d7dd0
msr: 21020
current = 0xee40f000
pid = 960, comm = kexec
enter ? for help
[link register ] ee9a50e4
[ee9d7dd0] c0013748 default_machine_kexec+0x58/0x70 (unreliable)
[ee9d7df0] c0012f04 machine_kexec+0x34/0x40
[ee9d7e00] c00aa1ec kernel_kexec+0x9c/0xb0
[ee9d7e20] c005d704 SyS_reboot+0x1f4/0x220
[ee9d7f40] c000db68 ret_from_syscall+0x0/0x3c
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
The STD_EXCEPTION_PSERIES macro takes both a vector number, and a
location (memory address). However both are always identical, so combine
them to save repeating ourselves.
This does mean an exception handler must always exist at the location in
memory that matches its vector number. But that's OK because this is the
"STD" macro (standard), which does exactly that. We have other macros
for the other cases, eg. STD_EXCEPTION_PSERIES_OOL (out of line).
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
HMT_MEDIUM_PPR_DISCARD is a macro which is present at the start of most
of our first level exception handlers. It conditionally executes a
HMT_MEDIUM instruction, which sets the processor priority to medium.
On on modern systems, ie. Power7 and later, it is nop'ed out at boot.
All it does is make the exception vectors more cramped, and consume 4
bytes of icache.
On old systems it has the effect of boosting the processor priority at
the start of exception processing. If we were previously in the idle
loop for example, we may be at low or very low priority. This is
desirable as we want to process the exception as fast as possible.
However looking closely at the generated code, we see that in all cases
we execute another HMT_MEDIUM just four instructions later. With code
patching applied, the final code on an old (Power6) system will look
like, eg:
c000000000000300 <data_access_pSeries>:
c000000000000300: 7c 42 13 78 mr r2,r2 <-
c000000000000304: 7d b2 43 a6 mtsprg 2,r13
c000000000000308: 7d b1 42 a6 mfsprg r13,1
c00000000000030c: f9 2d 00 80 std r9,128(r13)
c000000000000310: 60 00 00 00 nop
c000000000000314: 7c 42 13 78 mr r2,r2 <-
So I suggest that the added code complexity of HMT_MEDIUM_PPR_DISCARD is
not justified by the benefit of boosting the processor priority for the
duration of four instructions, and therefore we drop it.
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
There are no longer any users of enter_rtas() outside of rtas.c, so make
it "private", by moving the declaration inside rtas.c. Hopefully this
will encourage people to use one of the wrappers which takes the sharp
edges off the RTAS calling sequence.
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Although call_rtas_display_status() does actually want to use the
regular RTAS locking, it doesn't want the extra logic that is in
rtas_call(), so currently it open codes the logic.
Instead we can use rtas_call_unlocked(), after taking the RTAS lock.
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Most users of RTAS (Run-Time Abstraction Services) use rtas_call(),
which deals with locking as well as endian handling.
However we have two users outside of rtas.c that can't use rtas_call()
because they have different locking requirements.
The hotplug CPU code can't take the RTAS lock because the CPU would go
offline with the lock held and no other CPUs would be able to call RTAS
until the CPU came back online.
The xmon code doesn't want to take the lock because it would risk dead
locking when we are trying to recover from a crash.
Both sites required multiple patches when we added little endian
support, proving that programmers can't do endian right.
Although that ship has sailed, we can still clean the code up by
providing an unlocked version of rtas_call() which avoids the need to
open code the logic elsewhere.
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
GregorianDay() is supposed to calculate the day of the week
(tm->tm_wday) for a given day/month/year. In that calcuation it
indexed into an array called MonthOffset using tm->tm_mon-1. However
tm_mon is zero-based, not one-based, so this is off-by-one. It also
means that every January, GregoiranDay() will access element -1 of
the MonthOffset array.
It also doesn't appear to be a correct algorithm either: see in
contrast kernel/time/timeconv.c's time_to_tm function.
It's been broken forever, which suggests no-one in userland uses
this. It looks like no-one in the kernel uses tm->tm_wday either
(see e.g. drivers/rtc/rtc-ds1305.c:319).
tm->tm_wday is conventionally set to -1 when not available in
hardware so we can simply set it to -1 and drop the function.
(There are over a dozen other drivers in drivers/rtc that do
this.)
Found using UBSAN.
Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
Cc: Andrew Morton <akpm@linux-foundation.org> # as an example of what UBSan finds.
Cc: Alessandro Zummo <a.zummo@towertech.it>
Cc: Alexandre Belloni <alexandre.belloni@free-electrons.com>
Cc: rtc-linux@googlegroups.com
Signed-off-by: Daniel Axtens <dja@axtens.net>
Acked-by: Alexandre Belloni <alexandre.belloni@free-electrons.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
- tm: Block signal return from setting invalid MSR state from Michael Neuling
- tm: Check for already reclaimed tasks from Michael Neuling
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQIcBAABAgAGBQJWVul+AAoJEFHr6jzI4aWAUZAQAK2s+E4WTtcExXSG1bqq05o6
5wCLtIq6M91h5HDpgBfN7S5OXpRd72ZeIVdzC0HkFeLBF3y7NSHSEezUw4g/GGfz
K2xGV1CCXC3Rb3qyHSdyi6+c1AnLVRPBVzVxPVmlXigrXeFiQ4613YW9rzf9b8fs
oktUciwW9aHbrIv7g8f82gpuk9jwwhp/sF+1H/7fGOozT4CFsKo4wj4HOOCBwH4y
ODEjs6Z+9Uwb6Kfvi/rn3k4XA1wC36WFq3ORI6KrmK/ZB1eR0Kwf0IELYpMj8cOX
q5ZtCH7t68f9vmEK2B34AUijf/amm+2vLwvF6xAuZJFPUPZtgMBdRcqkLalbtPAO
8hlyPPgoZcgR/Of+lEYxUobcL0SMNufXwmfwRO35ktkm9Z9Ee96C8NNbpybBSDXL
YRa6is5MeO4GL8Gbcc0TA50hGjok7o3acGE6HSAReyzf0guQ4xqcif7+6lfWZPkk
P3aM02ajp2qoqyjhT/Ei6JlMptAiuQY+HvELFneqn5s9nDbv6cGuYZNNap0c1fK+
74W0p7MiZh7+IF5HpyUIeYV836inXMDIoKzjA6H3OWitk/1lbcrbF34Qpz9zAWZn
YF3w786ZzzQLw0jcALaqZejm58MLGIakO4MNDB0/ZBh0nKKfEV8WvPOjev78OAp1
+pDrJh0iQzrujN6OMKQ0
=IL8A
-----END PGP SIGNATURE-----
Merge tag 'powerpc-4.4-3' into next
Merge the two TM fixes we merged in 4.4. We are about to merge selftests
for these, and without the fixes the selftests will oops.
powerpc fixes for 4.4 #2
- tm: Block signal return from setting invalid MSR state from Michael Neuling
- tm: Check for already reclaimed tasks from Michael Neuling
Print MSR TM bits in oops messages. This appends them to the end
like this:
MSR: 8000000502823031 <SF,VEC,VSX,FP,ME,IR,DR,LE,TM[TE]>
You get the TM[] only if at least one TM MSR bit is set. Inside the
TM[], E means Enabled (bit 32), S means Suspended (bit 33), and T
means Transactional (bit 34)
If no bits are set, you get no TM[] output.
Include rework of printbits() to handle this case.
Signed-off-by: Michael Neuling <mikey@neuling.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
We should not expect pte bit position in asm code. Simply
by moving part of that to C
Acked-by: Scott Wood <scottwood@freescale.com>
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
* pci/aspm:
PCI/ASPM: Make sysfs link_state_store() consistent with link_state_show()
* pci/hotplug:
PCI: pciehp: Always protect pciehp_disable_slot() with hotplug mutex
* pci/misc:
x86/PCI: Simplify pci_bios_{read,write}
PCI: Simplify config space size computation
PCI: Limit config space size for Netronome NFP6000 family
PCI: Add Netronome vendor and device IDs
PCI: Support PCIe devices with short cfg_size
x86/PCI: Clarify AMD Fam10h config access restrictions comment
PCI: Print warnings for all invalid expansion ROM headers
PCI: Check for PCI_HEADER_TYPE_BRIDGE equality, not bitmask
* pci/msi:
PCI/MSI: Remove empty pci_msi_init_pci_dev()
PCI/MSI: Initialize MSI capability for all architectures
Bit 7 of the "Header Type" register indicates a multi-function device when
set. Bits 0-6 contain encoded values, where 0x1 indicates a PCI-PCI
bridge. It is incorrect to test this as though it were a mask.
For example, while the PCI 3.0 spec only defines values 0x0, 0x1, and 0x2,
it's conceivable that a future spec could define 0x3 to mean something
else; then tests for "(hdr_type & 0x7f) & PCI_HEADER_TYPE_BRIDGE" would
incorrectly succeed for this new 0x3 header type.
Test bits 0-6 of the Header Type for equality with PCI_HEADER_TYPE_BRIDGE.
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
Two DSCR tests have a hack in them:
/*
* XXX: Force a context switch out so that DSCR
* current value is copied into the thread struct
* which is required for the child to inherit the
* changed value.
*/
sleep(1);
We should not be working around this in the testcase, it is a kernel bug.
Fix it by copying the current DSCR to the child, instead of what we
had in the thread struct at last context switch.
Signed-off-by: Anton Blanchard <anton@samba.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
commit 152d523e63 ("powerpc: Create context switch helpers save_sprs()
and restore_sprs()") moved the restore of SPRs after the call to _switch().
There is an issue with this approach - new tasks do not return through
_switch(), they are set up by copy_thread() to directly return through
ret_from_fork() or ret_from_kernel_thread(). This means restore_sprs() is
not getting called for new tasks.
Fix this by moving restore_sprs() before _switch().
Fixes: 152d523e63 ("powerpc: Create context switch helpers save_sprs() and restore_sprs()")
Signed-off-by: Anton Blanchard <anton@samba.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Commit a0e72cf12b ("powerpc: Create msr_check_and_{set,clear}()")
removed a call to check_if_tm_restore_required() in the
enable_kernel_*() functions. Add them back in.
Fixes: a0e72cf12b ("powerpc: Create msr_check_and_{set,clear}()")
Reported-by: Rashmica Gupta <rashmicy@gmail.com>
Signed-off-by: Anton Blanchard <anton@samba.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
This reverts commit 527d10ef3a.
The reverted commit breaks cxlflash devices following an EEH reset (and
possibly other cxl devices, however this has not been tested).
The reverted commit changed the behaviour of eeh_reset_device() so that PHB
PEs are not unfrozen following the completion of the reset. This should not
be problematic, as no device resources should have been associated with the
PHB PE.
However, when attempting to load the cxlflash driver after a reset, the
driver attempts to read Vital Product Data through a call to
pci_read_vpd() (which is called on the physical cxl device, not on the
virtual AFU device). pci_read_vpd() in turn attempts to read from the cxl
device's config space. This fails, as the PE it's trying to read from is
still frozen. In turn, the driver gets an -ENODEV and fails to initialise.
It appears this issue only affects some parts of the VPD area, as "lspci
-vvv", which only reads a subset of the VPD bytes, is not broken by the
original patch.
At this stage, we don't fully understand why we're trying to read a frozen
PE, and we don't know how this affects other cxl devices. It is possible
that there is an underlying bug in the cxl driver or the powerpc CAPI
support code, or alternatively a bug in the PCI resource allocation/mapping
code that is incorrectly mapping resources to PE#0.
As such, this fix is incomplete, however it is necessary to prevent a
serious regression in CAPI support.
In the meantime, revert the commit, especially as it was intended to be a
non-functional change.
Cc: Gavin Shan <gwshan@linux.vnet.ibm.com>
Cc: Ian Munsie <imunsie@au1.ibm.com>
Cc: Daniel Axtens <dja@axtens.net>
Signed-off-by: Andrew Donnellan <andrew.donnellan@au1.ibm.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Makes it easier to handle init vs core cleanly, though the change is
fairly invasive across random architectures.
It simplifies the rbtree code immediately, however, while keeping the
core data together in the same cachline (now iff the rbtree code is
enabled).
Acked-by: Peter Zijlstra <peterz@infradead.org>
Reviewed-by: Josh Poimboeuf <jpoimboe@redhat.com>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
Remove a bunch of unnecessary fallback functions and group
things in a more logical way.
Signed-off-by: Anton Blanchard <anton@samba.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Most of __switch_to() is housekeeping, TLB batching, timekeeping etc.
Move these away from the more complex and critical context switching
code.
Signed-off-by: Anton Blanchard <anton@samba.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Create a single function that flushes everything (FP, VMX, VSX, SPE).
Doing this all at once means we only do one MSR write.
Signed-off-by: Anton Blanchard <anton@samba.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Create a single function that gives everything up (FP, VMX, VSX, SPE).
Doing this all at once means we only do one MSR write.
A context switch microbenchmark using yield():
http://ozlabs.org/~anton/junkcode/context_switch2.c
./context_switch2 --test=yield --fp --altivec --vector 0 0
shows an improvement of 3% on POWER8.
Signed-off-by: Anton Blanchard <anton@samba.org>
[mpe: giveup_all() needs to be EXPORT_SYMBOL'ed]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
More consolidation of our MSR available bit handling.
Signed-off-by: Anton Blanchard <anton@samba.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Add a boot option that strictly manages the MSR unavailable bits.
This catches kernel uses of FP/Altivec/SPE that would otherwise
corrupt user state.
Signed-off-by: Anton Blanchard <anton@samba.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
The enable_kernel_*() functions leave the relevant MSR bits enabled
until we exit the kernel sometime later. Create disable versions
that wrap the kernel use of FP, Altivec VSX or SPE.
While we don't want to disable it normally for performance reasons
(MSR writes are slow), it will be used for a debug boot option that
does this and catches bad uses in other areas of the kernel.
Signed-off-by: Anton Blanchard <anton@samba.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Create helper functions to set and clear MSR bits after first
checking if they are already set. Grouping them will make it
easy to avoid the MSR writes in a subsequent optimisation.
Signed-off-by: Anton Blanchard <anton@samba.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Move the MSR modification into c. Removing it from the assembly
function will allow us to avoid costly MSR writes by batching them
up.
Check the FP and VMX bits before calling the relevant giveup_*()
function. This makes giveup_vsx() and flush_vsx_to_thread() perform
more like their sister functions, and allows us to use
flush_vsx_to_thread() in the signal code.
Move the check_if_tm_restore_required() check in.
Signed-off-by: Anton Blanchard <anton@samba.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Move the MSR modification into new c functions. Removing it from
the low level functions will allow us to avoid costly MSR writes
by batching them up.
Move the check_if_tm_restore_required() check into these new functions.
Signed-off-by: Anton Blanchard <anton@samba.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
We used to allow giveup_*() to be called with a NULL task struct
pointer. Now those cases are handled in the caller we can remove
the checks. We can also remove giveup_altivec_notask() which is also
unused.
Signed-off-by: Anton Blanchard <anton@samba.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
mtmsrd_isync() will do an mtmsrd followed by an isync on older
processors. On newer processors we avoid the isync via a feature fixup.
Signed-off-by: Anton Blanchard <anton@samba.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Instead of having multiple giveup_*_maybe_transactional() functions,
separate out the TM check into a new function called
check_if_tm_restore_required().
This will make it easier to optimise the giveup_*() functions in a
subsequent patch.
Signed-off-by: Anton Blanchard <anton@samba.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
The UP only lazy floating point and vector optimisations were written
back when SMP was not common, and neither glibc nor gcc used vector
instructions. Now SMP is very common, glibc aggressively uses vector
instructions and gcc autovectorises.
We want to add new optimisations that apply to both UP and SMP, but
in preparation for that remove these UP only optimisations.
Signed-off-by: Anton Blanchard <anton@samba.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Move all our context switch SPR save and restore code into two
helpers. We do a few optimisations:
- Group all mfsprs and all mtsprs. In many cases an mtspr sets a
scoreboarding bit that an mfspr waits on, so the current practise of
mfspr A; mtspr A; mfpsr B; mtspr B is the worst scheduling we can
do.
- SPR writes are slow, so check that the value is changing before
writing it.
A context switch microbenchmark using yield():
http://ozlabs.org/~anton/junkcode/context_switch2.c
./context_switch2 --test=yield 0 0
shows an improvement of almost 10% on POWER8.
Signed-off-by: Anton Blanchard <anton@samba.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Similar to the non TM load_up_*() functions, don't disable the MSR
bits on the way out.
Signed-off-by: Anton Blanchard <anton@samba.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Writing the MSR is slow, so we want to avoid it whenever possible.
A subsequent patch will add a debug option that strictly manages the
FP/VMX/VSX unavailable bits. For now just remove it, matching what
we do in other areas of the kernel (eg enable_kernel_altivec()).
A context switch microbenchmark using yield():
http://ozlabs.org/~anton/junkcode/context_switch2.c
./context_switch2 --test=yield --fp 0 0
shows an improvement of almost 3% on POWER8.
Signed-off-by: Anton Blanchard <anton@samba.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Currently, if HV KVM is configured but PR KVM isn't, we don't include
a test to see whether we were interrupted in KVM guest context for the
set of interrupts which get delivered directly to the guest by hardware
if they occur in the guest. This includes things like program
interrupts.
However, the recent bug where userspace could set the MSR for a VCPU
to have an illegal value in the TS field, and thus cause a TM Bad Thing
type of program interrupt on the hrfid that enters the guest, showed that
we can never be completely sure that these interrupts can never occur
in the guest entry/exit code. If one of these interrupts does happen
and we have HV KVM configured but not PR KVM, then we end up trying to
run the handler in the host with the MMU set to the guest MMU context,
which generally ends badly.
Thus, for robustness it is better to have the test in every interrupt
vector, so that if some way is found to trigger some interrupt in the
guest entry/exit path, we can handle it without immediately crashing
the host.
This means that the distinction between KVMTEST and KVMTEST_PR goes
away. Thus we delete KVMTEST_PR and associated macros and use KVMTEST
everywhere that we previously used either KVMTEST_PR or KVMTEST. It
also means that SOFTEN_TEST_HV_201 becomes the same as SOFTEN_TEST_PR,
so we deleted SOFTEN_TEST_HV_201 and use SOFTEN_TEST_PR instead.
Signed-off-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
It is common practice with powerpc to use 'rN' to refer to register 'N'. However
when using the pt_regs_offset table we have to use 'gprN'.
So add aliases such that both 'rN' and 'gprN' can be used.
For example, we can currently do:
$ su -
$ cd /sys/kernel/debug/tracing
$ echo "p:probe/sys_fchownat sys_fchownat %gpr3:s32 +0(%gpr4):string %gpr5:s32 %gpr6:s32 %gpr7:s32" > kprobe_events
$ echo 1 > events/probe/sys_fchownat/enable
$ touch /tmp/foo
$ chown root /tmp/foo
$ echo 0 > events/enable
$ cat trace
chown-2925 [014] d... 76.160657: sys_fchownat: (SyS_fchownat+0x8/0x1a0) arg1=-100 arg2="/tmp/foo" arg3=0 arg4=-1 arg5=0
Instead we'd like to be able to use:
$ echo "p:probe/sys_fchownat sys_fchownat %r3:s32 +0(%r4):string %r5:s32 %r6:s32 %r7:s32" > kprobe_events
Signed-off-by: Rashmica Gupta <rashmicy@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Most architectures use NR_syscalls as the #define for the number of syscalls.
We use __NR_syscalls, and then define NR_syscalls as __NR_syscalls.
__NR_syscalls is not used outside arch code, whereas NR_syscalls is. So as
NR_syscalls must be defined and __NR_syscalls does not, replace __NR_syscalls
with NR_syscalls.
Signed-off-by: Rashmica Gupta <rashmicy@gmail.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>