mirror of
https://github.com/torvalds/linux.git
synced 2024-11-29 23:51:37 +00:00
c70a4be130
- Enable KFENCE for 32-bit. - Implement EBPF for 32-bit. - Convert 32-bit to do interrupt entry/exit in C. - Convert 64-bit BookE to do interrupt entry/exit in C. - Changes to our signal handling code to use user_access_begin/end() more extensively. - Add support for time namespaces (CONFIG_TIME_NS) - A series of fixes that allow us to reenable STRICT_KERNEL_RWX. - Other smaller features, fixes & cleanups. Thanks to: Alexey Kardashevskiy, Andreas Schwab, Andrew Donnellan, Aneesh Kumar K.V, Athira Rajeev, Bhaskar Chowdhury, Bixuan Cui, Cédric Le Goater, Chen Huang, Chris Packham, Christophe Leroy, Christopher M. Riedl, Colin Ian King, Dan Carpenter, Daniel Axtens, Daniel Henrique Barboza, David Gibson, Davidlohr Bueso, Denis Efremov, dingsenjie, Dmitry Safonov, Dominic DeMarco, Fabiano Rosas, Ganesh Goudar, Geert Uytterhoeven, Geetika Moolchandani, Greg Kurz, Guenter Roeck, Haren Myneni, He Ying, Jiapeng Chong, Jordan Niethe, Laurent Dufour, Lee Jones, Leonardo Bras, Li Huafei, Madhavan Srinivasan, Mahesh Salgaonkar, Masahiro Yamada, Nathan Chancellor, Nathan Lynch, Nicholas Piggin, Oliver O'Halloran, Paul Menzel, Pu Lehui, Randy Dunlap, Ravi Bangoria, Rosen Penev, Russell Currey, Santosh Sivaraj, Sebastian Andrzej Siewior, Segher Boessenkool, Shivaprasad G Bhat, Srikar Dronamraju, Stephen Rothwell, Thadeu Lima de Souza Cascardo, Thomas Gleixner, Tony Ambardar, Tyrel Datwyler, Vaibhav Jain, Vincenzo Frascino, Xiongwei Song, Yang Li, Yu Kuai, Zhang Yunkai. -----BEGIN PGP SIGNATURE----- iQJHBAABCAAxFiEEJFGtCPCthwEv2Y/bUevqPMjhpYAFAmCLV1kTHG1wZUBlbGxl cm1hbi5pZC5hdQAKCRBR6+o8yOGlgLUyD/4jrTolG4sVec211hYO+0VuJzoqN4Cf j2CA2Ju39butnSMiq4LJUPRB7QRZY1OofkoNFpZeDQspjfZXPz2ulpYAz+SxHWE2 ReHPmWH1rOABlUPXFboePF4OLwmAs9eR5mN2z9HpKXbT3k78HaToLqiONyB4fVCr Q5TkJeRn/Y7ZJLdyPLTpczHHleQ8KoM6kT7ncXnTm6p97JOBJSrGaJ5N/8X5a4+e 6jtgB7Pvw8jNDShSr8BDLBgBZZcmoTiuG8KfgwRZ+m+mKB1yI2X8S/a54w/lDi9g UcSv3jQcFLJuW+T/pYe4R330uWDYa0cwjJOtMmsJ98S4EYOevoe9fZuL97qNshme xtBr4q1i03G1icYOJJ8dXtvabG2rUzj8t1SCDpwYfrynzTWVRikiQYTXUBhRSFoK nsoklvKd2IZa485XYJ2ljSyClMy8S4yJJ9RuzZ94DTXDSJUesKuyRWGnso4mhkcl wvl4wwMTJvnCMKVo6dsJyV24QWfd6dABxzm04uPA94CKhG33UwK8252jXVeaohSb WSO7qWBONgDXQLJ0mXRcEYa9NHvFS4Jnp6APbxnHr1gS+K+PNkD4gPBf34FoyN0E 9s27kvEYk5vr8APUclETF6+FkbGUD5bFbusjt3hYloFpAoHQ/k5pFVDsOZNPA8sW fDIRp05KunDojw== =dfKL -----END PGP SIGNATURE----- Merge tag 'powerpc-5.13-1' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux Pull powerpc updates from Michael Ellerman: - Enable KFENCE for 32-bit. - Implement EBPF for 32-bit. - Convert 32-bit to do interrupt entry/exit in C. - Convert 64-bit BookE to do interrupt entry/exit in C. - Changes to our signal handling code to use user_access_begin/end() more extensively. - Add support for time namespaces (CONFIG_TIME_NS) - A series of fixes that allow us to reenable STRICT_KERNEL_RWX. - Other smaller features, fixes & cleanups. Thanks to Alexey Kardashevskiy, Andreas Schwab, Andrew Donnellan, Aneesh Kumar K.V, Athira Rajeev, Bhaskar Chowdhury, Bixuan Cui, Cédric Le Goater, Chen Huang, Chris Packham, Christophe Leroy, Christopher M. Riedl, Colin Ian King, Dan Carpenter, Daniel Axtens, Daniel Henrique Barboza, David Gibson, Davidlohr Bueso, Denis Efremov, dingsenjie, Dmitry Safonov, Dominic DeMarco, Fabiano Rosas, Ganesh Goudar, Geert Uytterhoeven, Geetika Moolchandani, Greg Kurz, Guenter Roeck, Haren Myneni, He Ying, Jiapeng Chong, Jordan Niethe, Laurent Dufour, Lee Jones, Leonardo Bras, Li Huafei, Madhavan Srinivasan, Mahesh Salgaonkar, Masahiro Yamada, Nathan Chancellor, Nathan Lynch, Nicholas Piggin, Oliver O'Halloran, Paul Menzel, Pu Lehui, Randy Dunlap, Ravi Bangoria, Rosen Penev, Russell Currey, Santosh Sivaraj, Sebastian Andrzej Siewior, Segher Boessenkool, Shivaprasad G Bhat, Srikar Dronamraju, Stephen Rothwell, Thadeu Lima de Souza Cascardo, Thomas Gleixner, Tony Ambardar, Tyrel Datwyler, Vaibhav Jain, Vincenzo Frascino, Xiongwei Song, Yang Li, Yu Kuai, and Zhang Yunkai. * tag 'powerpc-5.13-1' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux: (302 commits) powerpc/signal32: Fix erroneous SIGSEGV on RT signal return powerpc: Avoid clang uninitialized warning in __get_user_size_allowed powerpc/papr_scm: Mark nvdimm as unarmed if needed during probe powerpc/kvm: Fix build error when PPC_MEM_KEYS/PPC_PSERIES=n powerpc/kasan: Fix shadow start address with modules powerpc/kernel/iommu: Use largepool as a last resort when !largealloc powerpc/kernel/iommu: Align size for IOMMU_PAGE_SIZE() to save TCEs powerpc/44x: fix spelling mistake in Kconfig "varients" -> "variants" powerpc/iommu: Annotate nested lock for lockdep powerpc/iommu: Do not immediately panic when failed IOMMU table allocation powerpc/iommu: Allocate it_map by vmalloc selftests/powerpc: remove unneeded semicolon powerpc/64s: remove unneeded semicolon powerpc/eeh: remove unneeded semicolon powerpc/selftests: Add selftest to test concurrent perf/ptrace events powerpc/selftests/perf-hwbreak: Add testcases for 2nd DAWR powerpc/selftests/perf-hwbreak: Coalesce event creation code powerpc/selftests/ptrace-hwbreak: Add testcases for 2nd DAWR powerpc/configs: Add IBMVNIC to some 64-bit configs selftests/powerpc: Add uaccess flush test ...
303 lines
14 KiB
ReStructuredText
303 lines
14 KiB
ReStructuredText
.. SPDX-License-Identifier: GPL-2.0
|
|
|
|
===========================
|
|
Hypercall Op-codes (hcalls)
|
|
===========================
|
|
|
|
Overview
|
|
=========
|
|
|
|
Virtualization on 64-bit Power Book3S Platforms is based on the PAPR
|
|
specification [1]_ which describes the run-time environment for a guest
|
|
operating system and how it should interact with the hypervisor for
|
|
privileged operations. Currently there are two PAPR compliant hypervisors:
|
|
|
|
- **IBM PowerVM (PHYP)**: IBM's proprietary hypervisor that supports AIX,
|
|
IBM-i and Linux as supported guests (termed as Logical Partitions
|
|
or LPARS). It supports the full PAPR specification.
|
|
|
|
- **Qemu/KVM**: Supports PPC64 linux guests running on a PPC64 linux host.
|
|
Though it only implements a subset of PAPR specification called LoPAPR [2]_.
|
|
|
|
On PPC64 arch a guest kernel running on top of a PAPR hypervisor is called
|
|
a *pSeries guest*. A pseries guest runs in a supervisor mode (HV=0) and must
|
|
issue hypercalls to the hypervisor whenever it needs to perform an action
|
|
that is hypervisor priviledged [3]_ or for other services managed by the
|
|
hypervisor.
|
|
|
|
Hence a Hypercall (hcall) is essentially a request by the pseries guest
|
|
asking hypervisor to perform a privileged operation on behalf of the guest. The
|
|
guest issues a with necessary input operands. The hypervisor after performing
|
|
the privilege operation returns a status code and output operands back to the
|
|
guest.
|
|
|
|
HCALL ABI
|
|
=========
|
|
The ABI specification for a hcall between a pseries guest and PAPR hypervisor
|
|
is covered in section 14.5.3 of ref [2]_. Switch to the Hypervisor context is
|
|
done via the instruction **HVCS** that expects the Opcode for hcall is set in *r3*
|
|
and any in-arguments for the hcall are provided in registers *r4-r12*. If values
|
|
have to be passed through a memory buffer, the data stored in that buffer should be
|
|
in Big-endian byte order.
|
|
|
|
Once control returns back to the guest after hypervisor has serviced the
|
|
'HVCS' instruction the return value of the hcall is available in *r3* and any
|
|
out values are returned in registers *r4-r12*. Again like in case of in-arguments,
|
|
any out values stored in a memory buffer will be in Big-endian byte order.
|
|
|
|
Powerpc arch code provides convenient wrappers named **plpar_hcall_xxx** defined
|
|
in a arch specific header [4]_ to issue hcalls from the linux kernel
|
|
running as pseries guest.
|
|
|
|
Register Conventions
|
|
====================
|
|
|
|
Any hcall should follow same register convention as described in section 2.2.1.1
|
|
of "64-Bit ELF V2 ABI Specification: Power Architecture"[5]_. Table below
|
|
summarizes these conventions:
|
|
|
|
+----------+----------+-------------------------------------------+
|
|
| Register |Volatile | Purpose |
|
|
| Range |(Y/N) | |
|
|
+==========+==========+===========================================+
|
|
| r0 | Y | Optional-usage |
|
|
+----------+----------+-------------------------------------------+
|
|
| r1 | N | Stack Pointer |
|
|
+----------+----------+-------------------------------------------+
|
|
| r2 | N | TOC |
|
|
+----------+----------+-------------------------------------------+
|
|
| r3 | Y | hcall opcode/return value |
|
|
+----------+----------+-------------------------------------------+
|
|
| r4-r10 | Y | in and out values |
|
|
+----------+----------+-------------------------------------------+
|
|
| r11 | Y | Optional-usage/Environmental pointer |
|
|
+----------+----------+-------------------------------------------+
|
|
| r12 | Y | Optional-usage/Function entry address at |
|
|
| | | global entry point |
|
|
+----------+----------+-------------------------------------------+
|
|
| r13 | N | Thread-Pointer |
|
|
+----------+----------+-------------------------------------------+
|
|
| r14-r31 | N | Local Variables |
|
|
+----------+----------+-------------------------------------------+
|
|
| LR | Y | Link Register |
|
|
+----------+----------+-------------------------------------------+
|
|
| CTR | Y | Loop Counter |
|
|
+----------+----------+-------------------------------------------+
|
|
| XER | Y | Fixed-point exception register. |
|
|
+----------+----------+-------------------------------------------+
|
|
| CR0-1 | Y | Condition register fields. |
|
|
+----------+----------+-------------------------------------------+
|
|
| CR2-4 | N | Condition register fields. |
|
|
+----------+----------+-------------------------------------------+
|
|
| CR5-7 | Y | Condition register fields. |
|
|
+----------+----------+-------------------------------------------+
|
|
| Others | N | |
|
|
+----------+----------+-------------------------------------------+
|
|
|
|
DRC & DRC Indexes
|
|
=================
|
|
::
|
|
|
|
DR1 Guest
|
|
+--+ +------------+ +---------+
|
|
| | <----> | | | User |
|
|
+--+ DRC1 | | DRC | Space |
|
|
| PAPR | Index +---------+
|
|
DR2 | Hypervisor | | |
|
|
+--+ | | <-----> | Kernel |
|
|
| | <----> | | Hcall | |
|
|
+--+ DRC2 +------------+ +---------+
|
|
|
|
PAPR hypervisor terms shared hardware resources like PCI devices, NVDIMMs etc
|
|
available for use by LPARs as Dynamic Resource (DR). When a DR is allocated to
|
|
an LPAR, PHYP creates a data-structure called Dynamic Resource Connector (DRC)
|
|
to manage LPAR access. An LPAR refers to a DRC via an opaque 32-bit number
|
|
called DRC-Index. The DRC-index value is provided to the LPAR via device-tree
|
|
where its present as an attribute in the device tree node associated with the
|
|
DR.
|
|
|
|
HCALL Return-values
|
|
===================
|
|
|
|
After servicing the hcall, hypervisor sets the return-value in *r3* indicating
|
|
success or failure of the hcall. In case of a failure an error code indicates
|
|
the cause for error. These codes are defined and documented in arch specific
|
|
header [4]_.
|
|
|
|
In some cases a hcall can potentially take a long time and need to be issued
|
|
multiple times in order to be completely serviced. These hcalls will usually
|
|
accept an opaque value *continue-token* within there argument list and a
|
|
return value of *H_CONTINUE* indicates that hypervisor hasn't still finished
|
|
servicing the hcall yet.
|
|
|
|
To make such hcalls the guest need to set *continue-token == 0* for the
|
|
initial call and use the hypervisor returned value of *continue-token*
|
|
for each subsequent hcall until hypervisor returns a non *H_CONTINUE*
|
|
return value.
|
|
|
|
HCALL Op-codes
|
|
==============
|
|
|
|
Below is a partial list of HCALLs that are supported by PHYP. For the
|
|
corresponding opcode values please look into the arch specific header [4]_:
|
|
|
|
**H_SCM_READ_METADATA**
|
|
|
|
| Input: *drcIndex, offset, buffer-address, numBytesToRead*
|
|
| Out: *numBytesRead*
|
|
| Return Value: *H_Success, H_Parameter, H_P2, H_P3, H_Hardware*
|
|
|
|
Given a DRC Index of an NVDIMM, read N-bytes from the metadata area
|
|
associated with it, at a specified offset and copy it to provided buffer.
|
|
The metadata area stores configuration information such as label information,
|
|
bad-blocks etc. The metadata area is located out-of-band of NVDIMM storage
|
|
area hence a separate access semantics is provided.
|
|
|
|
**H_SCM_WRITE_METADATA**
|
|
|
|
| Input: *drcIndex, offset, data, numBytesToWrite*
|
|
| Out: *None*
|
|
| Return Value: *H_Success, H_Parameter, H_P2, H_P4, H_Hardware*
|
|
|
|
Given a DRC Index of an NVDIMM, write N-bytes to the metadata area
|
|
associated with it, at the specified offset and from the provided buffer.
|
|
|
|
**H_SCM_BIND_MEM**
|
|
|
|
| Input: *drcIndex, startingScmBlockIndex, numScmBlocksToBind,*
|
|
| *targetLogicalMemoryAddress, continue-token*
|
|
| Out: *continue-token, targetLogicalMemoryAddress, numScmBlocksToBound*
|
|
| Return Value: *H_Success, H_Parameter, H_P2, H_P3, H_P4, H_Overlap,*
|
|
| *H_Too_Big, H_P5, H_Busy*
|
|
|
|
Given a DRC-Index of an NVDIMM, map a continuous SCM blocks range
|
|
*(startingScmBlockIndex, startingScmBlockIndex+numScmBlocksToBind)* to the guest
|
|
at *targetLogicalMemoryAddress* within guest physical address space. In
|
|
case *targetLogicalMemoryAddress == 0xFFFFFFFF_FFFFFFFF* then hypervisor
|
|
assigns a target address to the guest. The HCALL can fail if the Guest has
|
|
an active PTE entry to the SCM block being bound.
|
|
|
|
**H_SCM_UNBIND_MEM**
|
|
| Input: drcIndex, startingScmLogicalMemoryAddress, numScmBlocksToUnbind
|
|
| Out: numScmBlocksUnbound
|
|
| Return Value: *H_Success, H_Parameter, H_P2, H_P3, H_In_Use, H_Overlap,*
|
|
| *H_Busy, H_LongBusyOrder1mSec, H_LongBusyOrder10mSec*
|
|
|
|
Given a DRC-Index of an NVDimm, unmap *numScmBlocksToUnbind* SCM blocks starting
|
|
at *startingScmLogicalMemoryAddress* from guest physical address space. The
|
|
HCALL can fail if the Guest has an active PTE entry to the SCM block being
|
|
unbound.
|
|
|
|
**H_SCM_QUERY_BLOCK_MEM_BINDING**
|
|
|
|
| Input: *drcIndex, scmBlockIndex*
|
|
| Out: *Guest-Physical-Address*
|
|
| Return Value: *H_Success, H_Parameter, H_P2, H_NotFound*
|
|
|
|
Given a DRC-Index and an SCM Block index return the guest physical address to
|
|
which the SCM block is mapped to.
|
|
|
|
**H_SCM_QUERY_LOGICAL_MEM_BINDING**
|
|
|
|
| Input: *Guest-Physical-Address*
|
|
| Out: *drcIndex, scmBlockIndex*
|
|
| Return Value: *H_Success, H_Parameter, H_P2, H_NotFound*
|
|
|
|
Given a guest physical address return which DRC Index and SCM block is mapped
|
|
to that address.
|
|
|
|
**H_SCM_UNBIND_ALL**
|
|
|
|
| Input: *scmTargetScope, drcIndex*
|
|
| Out: *None*
|
|
| Return Value: *H_Success, H_Parameter, H_P2, H_P3, H_In_Use, H_Busy,*
|
|
| *H_LongBusyOrder1mSec, H_LongBusyOrder10mSec*
|
|
|
|
Depending on the Target scope unmap all SCM blocks belonging to all NVDIMMs
|
|
or all SCM blocks belonging to a single NVDIMM identified by its drcIndex
|
|
from the LPAR memory.
|
|
|
|
**H_SCM_HEALTH**
|
|
|
|
| Input: drcIndex
|
|
| Out: *health-bitmap (r4), health-bit-valid-bitmap (r5)*
|
|
| Return Value: *H_Success, H_Parameter, H_Hardware*
|
|
|
|
Given a DRC Index return the info on predictive failure and overall health of
|
|
the PMEM device. The asserted bits in the health-bitmap indicate one or more states
|
|
(described in table below) of the PMEM device and health-bit-valid-bitmap indicate
|
|
which bits in health-bitmap are valid. The bits are reported in
|
|
reverse bit ordering for example a value of 0xC400000000000000
|
|
indicates bits 0, 1, and 5 are valid.
|
|
|
|
Health Bitmap Flags:
|
|
|
|
+------+-----------------------------------------------------------------------+
|
|
| Bit | Definition |
|
|
+======+=======================================================================+
|
|
| 00 | PMEM device is unable to persist memory contents. |
|
|
| | If the system is powered down, nothing will be saved. |
|
|
+------+-----------------------------------------------------------------------+
|
|
| 01 | PMEM device failed to persist memory contents. Either contents were |
|
|
| | not saved successfully on power down or were not restored properly on |
|
|
| | power up. |
|
|
+------+-----------------------------------------------------------------------+
|
|
| 02 | PMEM device contents are persisted from previous IPL. The data from |
|
|
| | the last boot were successfully restored. |
|
|
+------+-----------------------------------------------------------------------+
|
|
| 03 | PMEM device contents are not persisted from previous IPL. There was no|
|
|
| | data to restore from the last boot. |
|
|
+------+-----------------------------------------------------------------------+
|
|
| 04 | PMEM device memory life remaining is critically low |
|
|
+------+-----------------------------------------------------------------------+
|
|
| 05 | PMEM device will be garded off next IPL due to failure |
|
|
+------+-----------------------------------------------------------------------+
|
|
| 06 | PMEM device contents cannot persist due to current platform health |
|
|
| | status. A hardware failure may prevent data from being saved or |
|
|
| | restored. |
|
|
+------+-----------------------------------------------------------------------+
|
|
| 07 | PMEM device is unable to persist memory contents in certain conditions|
|
|
+------+-----------------------------------------------------------------------+
|
|
| 08 | PMEM device is encrypted |
|
|
+------+-----------------------------------------------------------------------+
|
|
| 09 | PMEM device has successfully completed a requested erase or secure |
|
|
| | erase procedure. |
|
|
+------+-----------------------------------------------------------------------+
|
|
|10:63 | Reserved / Unused |
|
|
+------+-----------------------------------------------------------------------+
|
|
|
|
**H_SCM_PERFORMANCE_STATS**
|
|
|
|
| Input: drcIndex, resultBuffer Addr
|
|
| Out: None
|
|
| Return Value: *H_Success, H_Parameter, H_Unsupported, H_Hardware, H_Authority, H_Privilege*
|
|
|
|
Given a DRC Index collect the performance statistics for NVDIMM and copy them
|
|
to the resultBuffer.
|
|
|
|
**H_SCM_FLUSH**
|
|
|
|
| Input: *drcIndex, continue-token*
|
|
| Out: *continue-token*
|
|
| Return Value: *H_SUCCESS, H_Parameter, H_P2, H_BUSY*
|
|
|
|
Given a DRC Index Flush the data to backend NVDIMM device.
|
|
|
|
The hcall returns H_BUSY when the flush takes longer time and the hcall needs
|
|
to be issued multiple times in order to be completely serviced. The
|
|
*continue-token* from the output to be passed in the argument list of
|
|
subsequent hcalls to the hypervisor until the hcall is completely serviced
|
|
at which point H_SUCCESS or other error is returned by the hypervisor.
|
|
|
|
References
|
|
==========
|
|
.. [1] "Power Architecture Platform Reference"
|
|
https://en.wikipedia.org/wiki/Power_Architecture_Platform_Reference
|
|
.. [2] "Linux on Power Architecture Platform Reference"
|
|
https://members.openpowerfoundation.org/document/dl/469
|
|
.. [3] "Definitions and Notation" Book III-Section 14.5.3
|
|
https://openpowerfoundation.org/?resource_lib=power-isa-version-3-0
|
|
.. [4] arch/powerpc/include/asm/hvcall.h
|
|
.. [5] "64-Bit ELF V2 ABI Specification: Power Architecture"
|
|
https://openpowerfoundation.org/?resource_lib=64-bit-elf-v2-abi-specification-power-architecture
|