mirror of
https://github.com/torvalds/linux.git
synced 2024-11-21 19:41:42 +00:00
2c9b351240
* Initial infrastructure for shadow stage-2 MMUs, as part of nested virtualization enablement * Support for userspace changes to the guest CTR_EL0 value, enabling (in part) migration of VMs between heterogenous hardware * Fixes + improvements to pKVM's FF-A proxy, adding support for v1.1 of the protocol * FPSIMD/SVE support for nested, including merged trap configuration and exception routing * New command-line parameter to control the WFx trap behavior under KVM * Introduce kCFI hardening in the EL2 hypervisor * Fixes + cleanups for handling presence/absence of FEAT_TCRX * Miscellaneous fixes + documentation updates LoongArch: * Add paravirt steal time support. * Add support for KVM_DIRTY_LOG_INITIALLY_SET. * Add perf kvm-stat support for loongarch. RISC-V: * Redirect AMO load/store access fault traps to guest * perf kvm stat support * Use guest files for IMSIC virtualization, when available ONE_REG support for the Zimop, Zcmop, Zca, Zcf, Zcd, Zcb and Zawrs ISA extensions is coming through the RISC-V tree. s390: * Assortment of tiny fixes which are not time critical x86: * Fixes for Xen emulation. * Add a global struct to consolidate tracking of host values, e.g. EFER * Add KVM_CAP_X86_APIC_BUS_CYCLES_NS to allow configuring the effective APIC bus frequency, because TDX. * Print the name of the APICv/AVIC inhibits in the relevant tracepoint. * Clean up KVM's handling of vendor specific emulation to consistently act on "compatible with Intel/AMD", versus checking for a specific vendor. * Drop MTRR virtualization, and instead always honor guest PAT on CPUs that support self-snoop. * Update to the newfangled Intel CPU FMS infrastructure. * Don't advertise IA32_PERF_GLOBAL_OVF_CTRL as an MSR-to-be-saved, as it reads '0' and writes from userspace are ignored. * Misc cleanups x86 - MMU: * Small cleanups, renames and refactoring extracted from the upcoming Intel TDX support. * Don't allocate kvm_mmu_page.shadowed_translation for shadow pages that can't hold leafs SPTEs. * Unconditionally drop mmu_lock when allocating TDP MMU page tables for eager page splitting, to avoid stalling vCPUs when splitting huge pages. * Bug the VM instead of simply warning if KVM tries to split a SPTE that is non-present or not-huge. KVM is guaranteed to end up in a broken state because the callers fully expect a valid SPTE, it's all but dangerous to let more MMU changes happen afterwards. x86 - AMD: * Make per-CPU save_area allocations NUMA-aware. * Force sev_es_host_save_area() to be inlined to avoid calling into an instrumentable function from noinstr code. * Base support for running SEV-SNP guests. API-wise, this includes a new KVM_X86_SNP_VM type, encrypting/measure the initial image into guest memory, and finalizing it before launching it. Internally, there are some gmem/mmu hooks needed to prepare gmem-allocated pages before mapping them into guest private memory ranges. This includes basic support for attestation guest requests, enough to say that KVM supports the GHCB 2.0 specification. There is no support yet for loading into the firmware those signing keys to be used for attestation requests, and therefore no need yet for the host to provide certificate data for those keys. To support fetching certificate data from userspace, a new KVM exit type will be needed to handle fetching the certificate from userspace. An attempt to define a new KVM_EXIT_COCO/KVM_EXIT_COCO_REQ_CERTS exit type to handle this was introduced in v1 of this patchset, but is still being discussed by community, so for now this patchset only implements a stub version of SNP Extended Guest Requests that does not provide certificate data. x86 - Intel: * Remove an unnecessary EPT TLB flush when enabling hardware. * Fix a series of bugs that cause KVM to fail to detect nested pending posted interrupts as valid wake eents for a vCPU executing HLT in L2 (with HLT-exiting disable by L1). * KVM: x86: Suppress MMIO that is triggered during task switch emulation Explicitly suppress userspace emulated MMIO exits that are triggered when emulating a task switch as KVM doesn't support userspace MMIO during complex (multi-step) emulation. Silently ignoring the exit request can result in the WARN_ON_ONCE(vcpu->mmio_needed) firing if KVM exits to userspace for some other reason prior to purging mmio_needed. See commit0dc902267c
("KVM: x86: Suppress pending MMIO write exits if emulator detects exception") for more details on KVM's limitations with respect to emulated MMIO during complex emulator flows. Generic: * Rename the AS_UNMOVABLE flag that was introduced for KVM to AS_INACCESSIBLE, because the special casing needed by these pages is not due to just unmovability (and in fact they are only unmovable because the CPU cannot access them). * New ioctl to populate the KVM page tables in advance, which is useful to mitigate KVM page faults during guest boot or after live migration. The code will also be used by TDX, but (probably) not through the ioctl. * Enable halt poll shrinking by default, as Intel found it to be a clear win. * Setup empty IRQ routing when creating a VM to avoid having to synchronize SRCU when creating a split IRQCHIP on x86. * Rework the sched_in/out() paths to replace kvm_arch_sched_in() with a flag that arch code can use for hooking both sched_in() and sched_out(). * Take the vCPU @id as an "unsigned long" instead of "u32" to avoid truncating a bogus value from userspace, e.g. to help userspace detect bugs. * Mark a vCPU as preempted if and only if it's scheduled out while in the KVM_RUN loop, e.g. to avoid marking it preempted and thus writing guest memory when retrieving guest state during live migration blackout. Selftests: * Remove dead code in the memslot modification stress test. * Treat "branch instructions retired" as supported on all AMD Family 17h+ CPUs. * Print the guest pseudo-RNG seed only when it changes, to avoid spamming the log for tests that create lots of VMs. * Make the PMU counters test less flaky when counting LLC cache misses by doing CLFLUSH{OPT} in every loop iteration. -----BEGIN PGP SIGNATURE----- iQFIBAABCAAyFiEE8TM4V0tmI4mGbHaCv/vSX3jHroMFAmaZQB0UHHBib256aW5p QHJlZGhhdC5jb20ACgkQv/vSX3jHroNkZwf/bv2jiENaLFNGPe/VqTKMQ6PHQLMG +sNHx6fJPP35gTM8Jqf0/7/ummZXcSuC1mWrzYbecZm7Oeg3vwNXHZ4LquwwX6Dv 8dKcUzLbWDAC4WA3SKhi8C8RV2v6E7ohy69NtAJmFWTc7H95dtIQm6cduV2osTC3 OEuHe1i8d9umk6couL9Qhm8hk3i9v2KgCsrfyNrQgLtS3hu7q6yOTR8nT0iH6sJR KE5A8prBQgLmF34CuvYDw4Hu6E4j+0QmIqodovg2884W1gZQ9LmcVqYPaRZGsG8S iDdbkualLKwiR1TpRr3HJGKWSFdc7RblbsnHRvHIZgFsMQiimh4HrBSCyQ== =zepX -----END PGP SIGNATURE----- Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm Pull kvm updates from Paolo Bonzini: "ARM: - Initial infrastructure for shadow stage-2 MMUs, as part of nested virtualization enablement - Support for userspace changes to the guest CTR_EL0 value, enabling (in part) migration of VMs between heterogenous hardware - Fixes + improvements to pKVM's FF-A proxy, adding support for v1.1 of the protocol - FPSIMD/SVE support for nested, including merged trap configuration and exception routing - New command-line parameter to control the WFx trap behavior under KVM - Introduce kCFI hardening in the EL2 hypervisor - Fixes + cleanups for handling presence/absence of FEAT_TCRX - Miscellaneous fixes + documentation updates LoongArch: - Add paravirt steal time support - Add support for KVM_DIRTY_LOG_INITIALLY_SET - Add perf kvm-stat support for loongarch RISC-V: - Redirect AMO load/store access fault traps to guest - perf kvm stat support - Use guest files for IMSIC virtualization, when available s390: - Assortment of tiny fixes which are not time critical x86: - Fixes for Xen emulation - Add a global struct to consolidate tracking of host values, e.g. EFER - Add KVM_CAP_X86_APIC_BUS_CYCLES_NS to allow configuring the effective APIC bus frequency, because TDX - Print the name of the APICv/AVIC inhibits in the relevant tracepoint - Clean up KVM's handling of vendor specific emulation to consistently act on "compatible with Intel/AMD", versus checking for a specific vendor - Drop MTRR virtualization, and instead always honor guest PAT on CPUs that support self-snoop - Update to the newfangled Intel CPU FMS infrastructure - Don't advertise IA32_PERF_GLOBAL_OVF_CTRL as an MSR-to-be-saved, as it reads '0' and writes from userspace are ignored - Misc cleanups x86 - MMU: - Small cleanups, renames and refactoring extracted from the upcoming Intel TDX support - Don't allocate kvm_mmu_page.shadowed_translation for shadow pages that can't hold leafs SPTEs - Unconditionally drop mmu_lock when allocating TDP MMU page tables for eager page splitting, to avoid stalling vCPUs when splitting huge pages - Bug the VM instead of simply warning if KVM tries to split a SPTE that is non-present or not-huge. KVM is guaranteed to end up in a broken state because the callers fully expect a valid SPTE, it's all but dangerous to let more MMU changes happen afterwards x86 - AMD: - Make per-CPU save_area allocations NUMA-aware - Force sev_es_host_save_area() to be inlined to avoid calling into an instrumentable function from noinstr code - Base support for running SEV-SNP guests. API-wise, this includes a new KVM_X86_SNP_VM type, encrypting/measure the initial image into guest memory, and finalizing it before launching it. Internally, there are some gmem/mmu hooks needed to prepare gmem-allocated pages before mapping them into guest private memory ranges This includes basic support for attestation guest requests, enough to say that KVM supports the GHCB 2.0 specification There is no support yet for loading into the firmware those signing keys to be used for attestation requests, and therefore no need yet for the host to provide certificate data for those keys. To support fetching certificate data from userspace, a new KVM exit type will be needed to handle fetching the certificate from userspace. An attempt to define a new KVM_EXIT_COCO / KVM_EXIT_COCO_REQ_CERTS exit type to handle this was introduced in v1 of this patchset, but is still being discussed by community, so for now this patchset only implements a stub version of SNP Extended Guest Requests that does not provide certificate data x86 - Intel: - Remove an unnecessary EPT TLB flush when enabling hardware - Fix a series of bugs that cause KVM to fail to detect nested pending posted interrupts as valid wake eents for a vCPU executing HLT in L2 (with HLT-exiting disable by L1) - KVM: x86: Suppress MMIO that is triggered during task switch emulation Explicitly suppress userspace emulated MMIO exits that are triggered when emulating a task switch as KVM doesn't support userspace MMIO during complex (multi-step) emulation Silently ignoring the exit request can result in the WARN_ON_ONCE(vcpu->mmio_needed) firing if KVM exits to userspace for some other reason prior to purging mmio_needed See commit0dc902267c
("KVM: x86: Suppress pending MMIO write exits if emulator detects exception") for more details on KVM's limitations with respect to emulated MMIO during complex emulator flows Generic: - Rename the AS_UNMOVABLE flag that was introduced for KVM to AS_INACCESSIBLE, because the special casing needed by these pages is not due to just unmovability (and in fact they are only unmovable because the CPU cannot access them) - New ioctl to populate the KVM page tables in advance, which is useful to mitigate KVM page faults during guest boot or after live migration. The code will also be used by TDX, but (probably) not through the ioctl - Enable halt poll shrinking by default, as Intel found it to be a clear win - Setup empty IRQ routing when creating a VM to avoid having to synchronize SRCU when creating a split IRQCHIP on x86 - Rework the sched_in/out() paths to replace kvm_arch_sched_in() with a flag that arch code can use for hooking both sched_in() and sched_out() - Take the vCPU @id as an "unsigned long" instead of "u32" to avoid truncating a bogus value from userspace, e.g. to help userspace detect bugs - Mark a vCPU as preempted if and only if it's scheduled out while in the KVM_RUN loop, e.g. to avoid marking it preempted and thus writing guest memory when retrieving guest state during live migration blackout Selftests: - Remove dead code in the memslot modification stress test - Treat "branch instructions retired" as supported on all AMD Family 17h+ CPUs - Print the guest pseudo-RNG seed only when it changes, to avoid spamming the log for tests that create lots of VMs - Make the PMU counters test less flaky when counting LLC cache misses by doing CLFLUSH{OPT} in every loop iteration" * tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (227 commits) crypto: ccp: Add the SNP_VLEK_LOAD command KVM: x86/pmu: Add kvm_pmu_call() to simplify static calls of kvm_pmu_ops KVM: x86: Introduce kvm_x86_call() to simplify static calls of kvm_x86_ops KVM: x86: Replace static_call_cond() with static_call() KVM: SEV: Provide support for SNP_EXTENDED_GUEST_REQUEST NAE event x86/sev: Move sev_guest.h into common SEV header KVM: SEV: Provide support for SNP_GUEST_REQUEST NAE event KVM: x86: Suppress MMIO that is triggered during task switch emulation KVM: x86/mmu: Clean up make_huge_page_split_spte() definition and intro KVM: x86/mmu: Bug the VM if KVM tries to split a !hugepage SPTE KVM: selftests: x86: Add test for KVM_PRE_FAULT_MEMORY KVM: x86: Implement kvm_arch_vcpu_pre_fault_memory() KVM: x86/mmu: Make kvm_mmu_do_page_fault() return mapped level KVM: x86/mmu: Account pf_{fixed,emulate,spurious} in callers of "do page fault" KVM: x86/mmu: Bump pf_taken stat only in the "real" page fault handler KVM: Add KVM_PRE_FAULT_MEMORY vcpu ioctl to pre-populate guest memory KVM: Document KVM_PRE_FAULT_MEMORY ioctl mm, virt: merge AS_UNMOVABLE and AS_INACCESSIBLE perf kvm: Add kvm-stat for loongarch64 LoongArch: KVM: Add PV steal time support in guest side ...
244 lines
9.3 KiB
ReStructuredText
244 lines
9.3 KiB
ReStructuredText
.. SPDX-License-Identifier: GPL-2.0
|
|
|
|
===================================================================
|
|
The Definitive SEV Guest API Documentation
|
|
===================================================================
|
|
|
|
1. General description
|
|
======================
|
|
|
|
The SEV API is a set of ioctls that are used by the guest or hypervisor
|
|
to get or set a certain aspect of the SEV virtual machine. The ioctls belong
|
|
to the following classes:
|
|
|
|
- Hypervisor ioctls: These query and set global attributes which affect the
|
|
whole SEV firmware. These ioctl are used by platform provisioning tools.
|
|
|
|
- Guest ioctls: These query and set attributes of the SEV virtual machine.
|
|
|
|
2. API description
|
|
==================
|
|
|
|
This section describes ioctls that is used for querying the SEV guest report
|
|
from the SEV firmware. For each ioctl, the following information is provided
|
|
along with a description:
|
|
|
|
Technology:
|
|
which SEV technology provides this ioctl. SEV, SEV-ES, SEV-SNP or all.
|
|
|
|
Type:
|
|
hypervisor or guest. The ioctl can be used inside the guest or the
|
|
hypervisor.
|
|
|
|
Parameters:
|
|
what parameters are accepted by the ioctl.
|
|
|
|
Returns:
|
|
the return value. General error numbers (-ENOMEM, -EINVAL)
|
|
are not detailed, but errors with specific meanings are.
|
|
|
|
The guest ioctl should be issued on a file descriptor of the /dev/sev-guest
|
|
device. The ioctl accepts struct snp_user_guest_request. The input and
|
|
output structure is specified through the req_data and resp_data field
|
|
respectively. If the ioctl fails to execute due to a firmware error, then
|
|
the fw_error code will be set, otherwise fw_error will be set to -1.
|
|
|
|
The firmware checks that the message sequence counter is one greater than
|
|
the guests message sequence counter. If guest driver fails to increment message
|
|
counter (e.g. counter overflow), then -EIO will be returned.
|
|
|
|
::
|
|
|
|
struct snp_guest_request_ioctl {
|
|
/* Message version number */
|
|
__u32 msg_version;
|
|
|
|
/* Request and response structure address */
|
|
__u64 req_data;
|
|
__u64 resp_data;
|
|
|
|
/* bits[63:32]: VMM error code, bits[31:0] firmware error code (see psp-sev.h) */
|
|
union {
|
|
__u64 exitinfo2;
|
|
struct {
|
|
__u32 fw_error;
|
|
__u32 vmm_error;
|
|
};
|
|
};
|
|
};
|
|
|
|
The host ioctls are issued to a file descriptor of the /dev/sev device.
|
|
The ioctl accepts the command ID/input structure documented below.
|
|
|
|
::
|
|
|
|
struct sev_issue_cmd {
|
|
/* Command ID */
|
|
__u32 cmd;
|
|
|
|
/* Command request structure */
|
|
__u64 data;
|
|
|
|
/* Firmware error code on failure (see psp-sev.h) */
|
|
__u32 error;
|
|
};
|
|
|
|
|
|
2.1 SNP_GET_REPORT
|
|
------------------
|
|
|
|
:Technology: sev-snp
|
|
:Type: guest ioctl
|
|
:Parameters (in): struct snp_report_req
|
|
:Returns (out): struct snp_report_resp on success, -negative on error
|
|
|
|
The SNP_GET_REPORT ioctl can be used to query the attestation report from the
|
|
SEV-SNP firmware. The ioctl uses the SNP_GUEST_REQUEST (MSG_REPORT_REQ) command
|
|
provided by the SEV-SNP firmware to query the attestation report.
|
|
|
|
On success, the snp_report_resp.data will contains the report. The report
|
|
contain the format described in the SEV-SNP specification. See the SEV-SNP
|
|
specification for further details.
|
|
|
|
2.2 SNP_GET_DERIVED_KEY
|
|
-----------------------
|
|
:Technology: sev-snp
|
|
:Type: guest ioctl
|
|
:Parameters (in): struct snp_derived_key_req
|
|
:Returns (out): struct snp_derived_key_resp on success, -negative on error
|
|
|
|
The SNP_GET_DERIVED_KEY ioctl can be used to get a key derive from a root key.
|
|
The derived key can be used by the guest for any purpose, such as sealing keys
|
|
or communicating with external entities.
|
|
|
|
The ioctl uses the SNP_GUEST_REQUEST (MSG_KEY_REQ) command provided by the
|
|
SEV-SNP firmware to derive the key. See SEV-SNP specification for further details
|
|
on the various fields passed in the key derivation request.
|
|
|
|
On success, the snp_derived_key_resp.data contains the derived key value. See
|
|
the SEV-SNP specification for further details.
|
|
|
|
|
|
2.3 SNP_GET_EXT_REPORT
|
|
----------------------
|
|
:Technology: sev-snp
|
|
:Type: guest ioctl
|
|
:Parameters (in/out): struct snp_ext_report_req
|
|
:Returns (out): struct snp_report_resp on success, -negative on error
|
|
|
|
The SNP_GET_EXT_REPORT ioctl is similar to the SNP_GET_REPORT. The difference is
|
|
related to the additional certificate data that is returned with the report.
|
|
The certificate data returned is being provided by the hypervisor through the
|
|
SNP_SET_EXT_CONFIG.
|
|
|
|
The ioctl uses the SNP_GUEST_REQUEST (MSG_REPORT_REQ) command provided by the SEV-SNP
|
|
firmware to get the attestation report.
|
|
|
|
On success, the snp_ext_report_resp.data will contain the attestation report
|
|
and snp_ext_report_req.certs_address will contain the certificate blob. If the
|
|
length of the blob is smaller than expected then snp_ext_report_req.certs_len will
|
|
be updated with the expected value.
|
|
|
|
See GHCB specification for further detail on how to parse the certificate blob.
|
|
|
|
2.4 SNP_PLATFORM_STATUS
|
|
-----------------------
|
|
:Technology: sev-snp
|
|
:Type: hypervisor ioctl cmd
|
|
:Parameters (out): struct sev_user_data_snp_status
|
|
:Returns (out): 0 on success, -negative on error
|
|
|
|
The SNP_PLATFORM_STATUS command is used to query the SNP platform status. The
|
|
status includes API major, minor version and more. See the SEV-SNP
|
|
specification for further details.
|
|
|
|
2.5 SNP_COMMIT
|
|
--------------
|
|
:Technology: sev-snp
|
|
:Type: hypervisor ioctl cmd
|
|
:Returns (out): 0 on success, -negative on error
|
|
|
|
SNP_COMMIT is used to commit the currently installed firmware using the
|
|
SEV-SNP firmware SNP_COMMIT command. This prevents roll-back to a previously
|
|
committed firmware version. This will also update the reported TCB to match
|
|
that of the currently installed firmware.
|
|
|
|
2.6 SNP_SET_CONFIG
|
|
------------------
|
|
:Technology: sev-snp
|
|
:Type: hypervisor ioctl cmd
|
|
:Parameters (in): struct sev_user_data_snp_config
|
|
:Returns (out): 0 on success, -negative on error
|
|
|
|
SNP_SET_CONFIG is used to set the system-wide configuration such as
|
|
reported TCB version in the attestation report. The command is similar
|
|
to SNP_CONFIG command defined in the SEV-SNP spec. The current values of
|
|
the firmware parameters affected by this command can be queried via
|
|
SNP_PLATFORM_STATUS.
|
|
|
|
2.7 SNP_VLEK_LOAD
|
|
-----------------
|
|
:Technology: sev-snp
|
|
:Type: hypervisor ioctl cmd
|
|
:Parameters (in): struct sev_user_data_snp_vlek_load
|
|
:Returns (out): 0 on success, -negative on error
|
|
|
|
When requesting an attestation report a guest is able to specify whether
|
|
it wants SNP firmware to sign the report using either a Versioned Chip
|
|
Endorsement Key (VCEK), which is derived from chip-unique secrets, or a
|
|
Versioned Loaded Endorsement Key (VLEK) which is obtained from an AMD
|
|
Key Derivation Service (KDS) and derived from seeds allocated to
|
|
enrolled cloud service providers.
|
|
|
|
In the case of VLEK keys, the SNP_VLEK_LOAD SNP command is used to load
|
|
them into the system after obtaining them from the KDS, and corresponds
|
|
closely to the SNP_VLEK_LOAD firmware command specified in the SEV-SNP
|
|
spec.
|
|
|
|
3. SEV-SNP CPUID Enforcement
|
|
============================
|
|
|
|
SEV-SNP guests can access a special page that contains a table of CPUID values
|
|
that have been validated by the PSP as part of the SNP_LAUNCH_UPDATE firmware
|
|
command. It provides the following assurances regarding the validity of CPUID
|
|
values:
|
|
|
|
- Its address is obtained via bootloader/firmware (via CC blob), and those
|
|
binaries will be measured as part of the SEV-SNP attestation report.
|
|
- Its initial state will be encrypted/pvalidated, so attempts to modify
|
|
it during run-time will result in garbage being written, or #VC exceptions
|
|
being generated due to changes in validation state if the hypervisor tries
|
|
to swap the backing page.
|
|
- Attempts to bypass PSP checks by the hypervisor by using a normal page, or
|
|
a non-CPUID encrypted page will change the measurement provided by the
|
|
SEV-SNP attestation report.
|
|
- The CPUID page contents are *not* measured, but attempts to modify the
|
|
expected contents of a CPUID page as part of guest initialization will be
|
|
gated by the PSP CPUID enforcement policy checks performed on the page
|
|
during SNP_LAUNCH_UPDATE, and noticeable later if the guest owner
|
|
implements their own checks of the CPUID values.
|
|
|
|
It is important to note that this last assurance is only useful if the kernel
|
|
has taken care to make use of the SEV-SNP CPUID throughout all stages of boot.
|
|
Otherwise, guest owner attestation provides no assurance that the kernel wasn't
|
|
fed incorrect values at some point during boot.
|
|
|
|
4. SEV Guest Driver Communication Key
|
|
=====================================
|
|
|
|
Communication between an SEV guest and the SEV firmware in the AMD Secure
|
|
Processor (ASP, aka PSP) is protected by a VM Platform Communication Key
|
|
(VMPCK). By default, the sev-guest driver uses the VMPCK associated with the
|
|
VM Privilege Level (VMPL) at which the guest is running. Should this key be
|
|
wiped by the sev-guest driver (see the driver for reasons why a VMPCK can be
|
|
wiped), a different key can be used by reloading the sev-guest driver and
|
|
specifying the desired key using the vmpck_id module parameter.
|
|
|
|
|
|
Reference
|
|
---------
|
|
|
|
SEV-SNP and GHCB specification: developer.amd.com/sev
|
|
|
|
The driver is based on SEV-SNP firmware spec 0.9 and GHCB spec version 2.0.
|