Historically, signal.h defines MINSIGSTKSZ (2KB) and SIGSTKSZ (8KB), for
use by all architectures with sigaltstack(2). Over time, the hardware state
size grew, but these constants did not evolve. Today, literal use of these
constants on several architectures may result in signal stack overflow, and
thus user data corruption.
A few years ago, the ARM team addressed this issue by establishing
getauxval(AT_MINSIGSTKSZ). This enables the kernel to supply a value
at runtime that is an appropriate replacement on current and future
hardware.
Add getauxval(AT_MINSIGSTKSZ) support to x86, analogous to the support
added for ARM in
94b07c1f8c ("arm64: signal: Report signal frame size to userspace via auxv").
Also, include a documentation to describe x86-specific auxiliary vectors.
Signed-off-by: Chang S. Bae <chang.seok.bae@intel.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Len Brown <len.brown@intel.com>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lkml.kernel.org/r/20210518200320.17239-4-chang.seok.bae@intel.com
- Stage-2 isolation for the host kernel when running in protected mode
- Guest SVE support when running in nVHE mode
- Force W^X hypervisor mappings in nVHE mode
- ITS save/restore for guests using direct injection with GICv4.1
- nVHE panics now produce readable backtraces
- Guest support for PTP using the ptp_kvm driver
- Performance improvements in the S2 fault handler
x86:
- Optimizations and cleanup of nested SVM code
- AMD: Support for virtual SPEC_CTRL
- Optimizations of the new MMU code: fast invalidation,
zap under read lock, enable/disably dirty page logging under
read lock
- /dev/kvm API for AMD SEV live migration (guest API coming soon)
- support SEV virtual machines sharing the same encryption context
- support SGX in virtual machines
- add a few more statistics
- improved directed yield heuristics
- Lots and lots of cleanups
Generic:
- Rework of MMU notifier interface, simplifying and optimizing
the architecture-specific code
- Some selftests improvements
-----BEGIN PGP SIGNATURE-----
iQFIBAABCAAyFiEE8TM4V0tmI4mGbHaCv/vSX3jHroMFAmCJ13kUHHBib256aW5p
QHJlZGhhdC5jb20ACgkQv/vSX3jHroM1HAgAqzPxEtiTPTFeFJV5cnPPJ3dFoFDK
y/juZJUQ1AOtvuWzzwuf175ewkv9vfmtG6rVohpNSkUlJYeoc6tw7n8BTTzCVC1b
c/4Dnrjeycr6cskYlzaPyV6MSgjSv5gfyj1LA5UEM16LDyekmaynosVWY5wJhju+
Bnyid8l8Utgz+TLLYogfQJQECCrsU0Wm//n+8TWQgLf1uuiwshU5JJe7b43diJrY
+2DX+8p9yWXCTz62sCeDWNahUv8AbXpMeJ8uqZPYcN1P0gSEUGu8xKmLOFf9kR7b
M4U1Gyz8QQbjd2lqnwiWIkvRLX6gyGVbq2zH0QbhUe5gg3qGUX7JjrhdDQ==
=AXUi
-----END PGP SIGNATURE-----
Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm
Pull kvm updates from Paolo Bonzini:
"This is a large update by KVM standards, including AMD PSP (Platform
Security Processor, aka "AMD Secure Technology") and ARM CoreSight
(debug and trace) changes.
ARM:
- CoreSight: Add support for ETE and TRBE
- Stage-2 isolation for the host kernel when running in protected
mode
- Guest SVE support when running in nVHE mode
- Force W^X hypervisor mappings in nVHE mode
- ITS save/restore for guests using direct injection with GICv4.1
- nVHE panics now produce readable backtraces
- Guest support for PTP using the ptp_kvm driver
- Performance improvements in the S2 fault handler
x86:
- AMD PSP driver changes
- Optimizations and cleanup of nested SVM code
- AMD: Support for virtual SPEC_CTRL
- Optimizations of the new MMU code: fast invalidation, zap under
read lock, enable/disably dirty page logging under read lock
- /dev/kvm API for AMD SEV live migration (guest API coming soon)
- support SEV virtual machines sharing the same encryption context
- support SGX in virtual machines
- add a few more statistics
- improved directed yield heuristics
- Lots and lots of cleanups
Generic:
- Rework of MMU notifier interface, simplifying and optimizing the
architecture-specific code
- a handful of "Get rid of oprofile leftovers" patches
- Some selftests improvements"
* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (379 commits)
KVM: selftests: Speed up set_memory_region_test
selftests: kvm: Fix the check of return value
KVM: x86: Take advantage of kvm_arch_dy_has_pending_interrupt()
KVM: SVM: Skip SEV cache flush if no ASIDs have been used
KVM: SVM: Remove an unnecessary prototype declaration of sev_flush_asids()
KVM: SVM: Drop redundant svm_sev_enabled() helper
KVM: SVM: Move SEV VMCB tracking allocation to sev.c
KVM: SVM: Explicitly check max SEV ASID during sev_hardware_setup()
KVM: SVM: Unconditionally invoke sev_hardware_teardown()
KVM: SVM: Enable SEV/SEV-ES functionality by default (when supported)
KVM: SVM: Condition sev_enabled and sev_es_enabled on CONFIG_KVM_AMD_SEV=y
KVM: SVM: Append "_enabled" to module-scoped SEV/SEV-ES control variables
KVM: SEV: Mask CPUID[0x8000001F].eax according to supported features
KVM: SVM: Move SEV module params/variables to sev.c
KVM: SVM: Disable SEV/SEV-ES if NPT is disabled
KVM: SVM: Free sev_asid_bitmap during init if SEV setup fails
KVM: SVM: Zero out the VMCB array used to track SEV ASID association
x86/sev: Drop redundant and potentially misleading 'sev_enabled'
KVM: x86: Move reverse CPUID helpers to separate header file
KVM: x86: Rename GPR accessors to make mode-aware variants the defaults
...
Newer CPUs provide a second mechanism to detect operations with lock
prefix which go accross a cache line boundary. Such operations have to
take bus lock which causes a system wide performance degradation when
these operations happen frequently.
The new mechanism is not using the #AC exception. It triggers #DB and is
restricted to operations in user space. Kernel side split lock access can
only be detected by the #AC based variant. Contrary to the #AC based
mechanism the #DB based variant triggers _after_ the instruction was
executed. The mechanism is CPUID enumerated and contrary to the #AC
version which is based on the magic TEST_CTRL_MSR and model/family based
enumeration on the way to become architectural.
-----BEGIN PGP SIGNATURE-----
iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAmCGkr8THHRnbHhAbGlu
dXRyb25peC5kZQAKCRCmGPVMDXSYodUKD/9tUXhInR7+1ykEHpMvdmSp48vqY3nc
sKmT22pPl+OchnJ62mw3T8gKpBYVleJmcCaY2qVx7hfaVcWApLGJvX4tmfXmv422
XDSJ6b8Os6wfgx5FR//I17z8ZtXnnuKkPrTMoRsQUw2qLq31y6fdQv+GW/cc1Kpw
mengjmPE+HnpaKbtuQfPdc4a+UvLjvzBMAlDZPTBPKYrP4FFqYVnUVwyTg5aLVDY
gHz4V8+b502RS/zPfTAtE3J848od+NmcUPdFlcG9DVA+hR0Rl0thvruCTFiD2vVh
i9DJ7INof5FoJDEzh0dGsD7x+MB6OY8GZyHdUMeGgIRPtWkqrG52feQQIn2YYlaL
fB3DlpNv7NIJ/0JMlALvh8S0tEoOcYdHqH+M/3K/zbzecg/FAo+lVo8WciGLPqWs
ykUG5/f/OnlTvgB8po1ebJu0h0jHnoK9heWWXk9zWIRVDPXHFOWKW3kSbTTb3icR
9hfjP/SNejpmt9Ju1OTwsgnV7NALIdVX+G5jyIEsjFl31Co1RZNYhHLFvi11FWlQ
/ssvFK9O5ZkliocGCAN9+yuOnM26VqWSCE4fis6/2aSgD2Y4Gpvb//cP96SrcNAH
u8eXNvGLlniJP3F3JImWIfIPQTrpvQhcU4eZ6NtviXqj/utQXX6c9PZ1PLYpcvUh
9AWF8rwhT8X4oA==
=lmi8
-----END PGP SIGNATURE-----
Merge tag 'x86-splitlock-2021-04-26' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 bus lock detection updates from Thomas Gleixner:
"Support for enhanced split lock detection:
Newer CPUs provide a second mechanism to detect operations with lock
prefix which go accross a cache line boundary. Such operations have to
take bus lock which causes a system wide performance degradation when
these operations happen frequently.
The new mechanism is not using the #AC exception. It triggers #DB and
is restricted to operations in user space. Kernel side split lock
access can only be detected by the #AC based variant.
Contrary to the #AC based mechanism the #DB based variant triggers
_after_ the instruction was executed. The mechanism is CPUID
enumerated and contrary to the #AC version which is based on the magic
TEST_CTRL_MSR and model/family based enumeration on the way to become
architectural"
* tag 'x86-splitlock-2021-04-26' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
Documentation/admin-guide: Change doc for split_lock_detect parameter
x86/traps: Handle #DB for bus lock
x86/cpufeatures: Enumerate #DB for bus lock detection
Add support for handling VM-Exits that originate from a guest SGX
enclave. In SGX, an "enclave" is a new CPL3-only execution environment,
wherein the CPU and memory state is protected by hardware to make the
state inaccesible to code running outside of the enclave. When exiting
an enclave due to an asynchronous event (from the perspective of the
enclave), e.g. exceptions, interrupts, and VM-Exits, the enclave's state
is automatically saved and scrubbed (the CPU loads synthetic state), and
then reloaded when re-entering the enclave. E.g. after an instruction
based VM-Exit from an enclave, vmcs.GUEST_RIP will not contain the RIP
of the enclave instruction that trigered VM-Exit, but will instead point
to a RIP in the enclave's untrusted runtime (the guest userspace code
that coordinates entry/exit to/from the enclave).
To help a VMM recognize and handle exits from enclaves, SGX adds bits to
existing VMCS fields, VM_EXIT_REASON.VMX_EXIT_REASON_FROM_ENCLAVE and
GUEST_INTERRUPTIBILITY_INFO.GUEST_INTR_STATE_ENCLAVE_INTR. Define the
new architectural bits, and add a boolean to struct vcpu_vmx to cache
VMX_EXIT_REASON_FROM_ENCLAVE. Clear the bit in exit_reason so that
checks against exit_reason do not need to account for SGX, e.g.
"if (exit_reason == EXIT_REASON_EXCEPTION_NMI)" continues to work.
KVM is a largely a passive observer of the new bits, e.g. KVM needs to
account for the bits when propagating information to a nested VMM, but
otherwise doesn't need to act differently for the majority of VM-Exits
from enclaves.
The one scenario that is directly impacted is emulation, which is for
all intents and purposes impossible[1] since KVM does not have access to
the RIP or instruction stream that triggered the VM-Exit. The inability
to emulate is a non-issue for KVM, as most instructions that might
trigger VM-Exit unconditionally #UD in an enclave (before the VM-Exit
check. For the few instruction that conditionally #UD, KVM either never
sets the exiting control, e.g. PAUSE_EXITING[2], or sets it if and only
if the feature is not exposed to the guest in order to inject a #UD,
e.g. RDRAND_EXITING.
But, because it is still possible for a guest to trigger emulation,
e.g. MMIO, inject a #UD if KVM ever attempts emulation after a VM-Exit
from an enclave. This is architecturally accurate for instruction
VM-Exits, and for MMIO it's the least bad choice, e.g. it's preferable
to killing the VM. In practice, only broken or particularly stupid
guests should ever encounter this behavior.
Add a WARN in skip_emulated_instruction to detect any attempt to
modify the guest's RIP during an SGX enclave VM-Exit as all such flows
should either be unreachable or must handle exits from enclaves before
getting to skip_emulated_instruction.
[1] Impossible for all practical purposes. Not truly impossible
since KVM could implement some form of para-virtualization scheme.
[2] PAUSE_LOOP_EXITING only affects CPL0 and enclaves exist only at
CPL3, so we also don't need to worry about that interaction.
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Kai Huang <kai.huang@intel.com>
Message-Id: <315f54a8507d09c292463ef29104e1d4c62e9090.1618196135.git.kai.huang@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Bus locks degrade performance for the whole system, not just for the CPU
that requested the bus lock. Two CPU features "#AC for split lock" and
"#DB for bus lock" provide hooks so that the operating system may choose
one of several mitigation strategies.
#AC for split lock is already implemented. Add code to use the #DB for
bus lock feature to cover additional situations with new options to
mitigate.
split_lock_detect=
#AC for split lock #DB for bus lock
off Do nothing Do nothing
warn Kernel OOPs Warn once per task and
Warn once per task and and continues to run.
disable future checking
When both features are
supported, warn in #AC
fatal Kernel OOPs Send SIGBUS to user.
Send SIGBUS to user
When both features are
supported, fatal in #AC
ratelimit:N Do nothing Limit bus lock rate to
N per second in the
current non-root user.
Default option is "warn".
Hardware only generates #DB for bus lock detect when CPL>0 to avoid
nested #DB from multiple bus locks while the first #DB is being handled.
So no need to handle #DB for bus lock detected in the kernel.
#DB for bus lock is enabled by bus lock detection bit 2 in DEBUGCTL MSR
while #AC for split lock is enabled by split lock detection bit 29 in
TEST_CTRL MSR.
Both breakpoint and bus lock in the same instruction can trigger one #DB.
The bus lock is handled before the breakpoint in the #DB handler.
Delivery of #DB for bus lock in userspace clears DR6[11], which is set by
the #DB handler right after reading DR6.
Signed-off-by: Fenghua Yu <fenghua.yu@intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Tony Luck <tony.luck@intel.com>
Link: https://lore.kernel.org/r/20210322135325.682257-3-fenghua.yu@intel.com
Fix ~144 single-word typos in arch/x86/ code comments.
Doing this in a single commit should reduce the churn.
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Bjorn Helgaas <bhelgaas@google.com>
Cc: linux-kernel@vger.kernel.org
- Support for userspace to emulate Xen hypercalls
- Raise the maximum number of user memslots
- Scalability improvements for the new MMU. Instead of the complex
"fast page fault" logic that is used in mmu.c, tdp_mmu.c uses an
rwlock so that page faults are concurrent, but the code that can run
against page faults is limited. Right now only page faults take the
lock for reading; in the future this will be extended to some
cases of page table destruction. I hope to switch the default MMU
around 5.12-rc3 (some testing was delayed due to Chinese New Year).
- Cleanups for MAXPHYADDR checks
- Use static calls for vendor-specific callbacks
- On AMD, use VMLOAD/VMSAVE to save and restore host state
- Stop using deprecated jump label APIs
- Workaround for AMD erratum that made nested virtualization unreliable
- Support for LBR emulation in the guest
- Support for communicating bus lock vmexits to userspace
- Add support for SEV attestation command
- Miscellaneous cleanups
PPC:
- Support for second data watchpoint on POWER10
- Remove some complex workarounds for buggy early versions of POWER9
- Guest entry/exit fixes
ARM64
- Make the nVHE EL2 object relocatable
- Cleanups for concurrent translation faults hitting the same page
- Support for the standard TRNG hypervisor call
- A bunch of small PMU/Debug fixes
- Simplification of the early init hypercall handling
Non-KVM changes (with acks):
- Detection of contended rwlocks (implemented only for qrwlocks,
because KVM only needs it for x86)
- Allow __DISABLE_EXPORTS from assembly code
- Provide a saner follow_pfn replacements for modules
-----BEGIN PGP SIGNATURE-----
iQFIBAABCAAyFiEE8TM4V0tmI4mGbHaCv/vSX3jHroMFAmApSRgUHHBib256aW5p
QHJlZGhhdC5jb20ACgkQv/vSX3jHroOc7wf9FnlinKoTFaSk7oeuuhF/CoCVwSFs
Z9+A2sNI99tWHQxFR6dyDkEFeQoXnqSxfLHtUVIdH/JnTg0FkEvFz3NK+0PzY1PF
PnGNbSoyhP58mSBG4gbBAxdF3ZJZMB8GBgYPeR62PvMX2dYbcHqVBNhlf6W4MQK4
5mAUuAnbf19O5N267sND+sIg3wwJYwOZpRZB7PlwvfKAGKf18gdBz5dQ/6Ej+apf
P7GODZITjqM5Iho7SDm/sYJlZprFZT81KqffwJQHWFMEcxFgwzrnYPx7J3gFwRTR
eeh9E61eCBDyCTPpHROLuNTVBqrAioCqXLdKOtO5gKvZI3zmomvAsZ8uXQ==
=uFZU
-----END PGP SIGNATURE-----
Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm
Pull KVM updates from Paolo Bonzini:
"x86:
- Support for userspace to emulate Xen hypercalls
- Raise the maximum number of user memslots
- Scalability improvements for the new MMU.
Instead of the complex "fast page fault" logic that is used in
mmu.c, tdp_mmu.c uses an rwlock so that page faults are concurrent,
but the code that can run against page faults is limited. Right now
only page faults take the lock for reading; in the future this will
be extended to some cases of page table destruction. I hope to
switch the default MMU around 5.12-rc3 (some testing was delayed
due to Chinese New Year).
- Cleanups for MAXPHYADDR checks
- Use static calls for vendor-specific callbacks
- On AMD, use VMLOAD/VMSAVE to save and restore host state
- Stop using deprecated jump label APIs
- Workaround for AMD erratum that made nested virtualization
unreliable
- Support for LBR emulation in the guest
- Support for communicating bus lock vmexits to userspace
- Add support for SEV attestation command
- Miscellaneous cleanups
PPC:
- Support for second data watchpoint on POWER10
- Remove some complex workarounds for buggy early versions of POWER9
- Guest entry/exit fixes
ARM64:
- Make the nVHE EL2 object relocatable
- Cleanups for concurrent translation faults hitting the same page
- Support for the standard TRNG hypervisor call
- A bunch of small PMU/Debug fixes
- Simplification of the early init hypercall handling
Non-KVM changes (with acks):
- Detection of contended rwlocks (implemented only for qrwlocks,
because KVM only needs it for x86)
- Allow __DISABLE_EXPORTS from assembly code
- Provide a saner follow_pfn replacements for modules"
* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (192 commits)
KVM: x86/xen: Explicitly pad struct compat_vcpu_info to 64 bytes
KVM: selftests: Don't bother mapping GVA for Xen shinfo test
KVM: selftests: Fix hex vs. decimal snafu in Xen test
KVM: selftests: Fix size of memslots created by Xen tests
KVM: selftests: Ignore recently added Xen tests' build output
KVM: selftests: Add missing header file needed by xAPIC IPI tests
KVM: selftests: Add operand to vmsave/vmload/vmrun in svm.c
KVM: SVM: Make symbol 'svm_gp_erratum_intercept' static
locking/arch: Move qrwlock.h include after qspinlock.h
KVM: PPC: Book3S HV: Fix host radix SLB optimisation with hash guests
KVM: PPC: Book3S HV: Ensure radix guest has no SLB entries
KVM: PPC: Don't always report hash MMU capability for P9 < DD2.2
KVM: PPC: Book3S HV: Save and restore FSCR in the P9 path
KVM: PPC: remove unneeded semicolon
KVM: PPC: Book3S HV: Use POWER9 SLBIA IH=6 variant to clear SLB
KVM: PPC: Book3S HV: No need to clear radix host SLB before loading HPT guest
KVM: PPC: Book3S HV: Fix radix guest SLB side channel
KVM: PPC: Book3S HV: Remove support for running HPT guest on RPT host without mixed mode support
KVM: PPC: Book3S HV: Introduce new capability for 2nd DAWR
KVM: PPC: Book3S HV: Add infrastructure to support 2nd DAWR
...
Virtual Machine can exploit bus locks to degrade the performance of
system. Bus lock can be caused by split locked access to writeback(WB)
memory or by using locks on uncacheable(UC) memory. The bus lock is
typically >1000 cycles slower than an atomic operation within a cache
line. It also disrupts performance on other cores (which must wait for
the bus lock to be released before their memory operations can
complete).
To address the threat, bus lock VM exit is introduced to notify the VMM
when a bus lock was acquired, allowing it to enforce throttling or other
policy based mitigations.
A VMM can enable VM exit due to bus locks by setting a new "Bus Lock
Detection" VM-execution control(bit 30 of Secondary Processor-based VM
execution controls). If delivery of this VM exit was preempted by a
higher priority VM exit (e.g. EPT misconfiguration, EPT violation, APIC
access VM exit, APIC write VM exit, exception bitmap exiting), bit 26 of
exit reason in vmcs field is set to 1.
In current implementation, the KVM exposes this capability through
KVM_CAP_X86_BUS_LOCK_EXIT. The user can get the supported mode bitmap
(i.e. off and exit) and enable it explicitly (disabled by default). If
bus locks in guest are detected by KVM, exit to user space even when
current exit reason is handled by KVM internally. Set a new field
KVM_RUN_BUS_LOCK in vcpu->run->flags to inform the user space that there
is a bus lock detected in guest.
Document for Bus Lock VM exit is now available at the latest "Intel
Architecture Instruction Set Extensions Programming Reference".
Document Link:
https://software.intel.com/content/www/us/en/develop/download/intel-architecture-instruction-set-extensions-programming-reference.html
Co-developed-by: Xiaoyao Li <xiaoyao.li@intel.com>
Signed-off-by: Xiaoyao Li <xiaoyao.li@intel.com>
Signed-off-by: Chenyi Qiang <chenyi.qiang@intel.com>
Message-Id: <20201106090315.18606-4-chenyi.qiang@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
The implementation was rather buggy. It unconditionally marked PTEs
read-only, even for VM_SHARED mappings. I'm not sure whether this is
actually a problem, but it certainly seems unwise. More importantly, it
released the mmap lock before flushing the TLB, which could allow a racing
CoW operation to falsely believe that the underlying memory was not
writable.
I can't find any users at all of this mechanism, so just remove it.
Signed-off-by: Andy Lutomirski <luto@kernel.org>
Signed-off-by: Borislav Petkov <bp@suse.de>
Acked-by: Stas Sergeev <stsp2@yandex.ru>
Link: https://lkml.kernel.org/r/f3086de0babcab36f69949b5780bde851f719bc8.1611078018.git.luto@kernel.org
* PSCI relay at EL2 when "protected KVM" is enabled
* New exception injection code
* Simplification of AArch32 system register handling
* Fix PMU accesses when no PMU is enabled
* Expose CSV3 on non-Meltdown hosts
* Cache hierarchy discovery fixes
* PV steal-time cleanups
* Allow function pointers at EL2
* Various host EL2 entry cleanups
* Simplification of the EL2 vector allocation
s390:
* memcg accouting for s390 specific parts of kvm and gmap
* selftest for diag318
* new kvm_stat for when async_pf falls back to sync
x86:
* Tracepoints for the new pagetable code from 5.10
* Catch VFIO and KVM irqfd events before userspace
* Reporting dirty pages to userspace with a ring buffer
* SEV-ES host support
* Nested VMX support for wait-for-SIPI activity state
* New feature flag (AVX512 FP16)
* New system ioctl to report Hyper-V-compatible paravirtualization features
Generic:
* Selftest improvements
-----BEGIN PGP SIGNATURE-----
iQFIBAABCAAyFiEE8TM4V0tmI4mGbHaCv/vSX3jHroMFAl/bdL4UHHBib256aW5p
QHJlZGhhdC5jb20ACgkQv/vSX3jHroNgQQgAnTH6rhXa++Zd5F0EM2NwXwz3iEGb
lOq1DZSGjs6Eekjn8AnrWbmVQr+CBCuGU9MrxpSSzNDK/awryo3NwepOWAZw9eqk
BBCVwGBbJQx5YrdgkGC0pDq2sNzcpW/VVB3vFsmOxd9eHblnuKSIxEsCCXTtyqIt
XrLpQ1UhvI4yu102fDNhuFw2EfpzXm+K0Lc0x6idSkdM/p7SyeOxiv8hD4aMr6+G
bGUQuMl4edKZFOWFigzr8NovQAvDHZGrwfihu2cLRYKLhV97QuWVmafv/yYfXcz2
drr+wQCDNzDOXyANnssmviazrhOX0QmTAhbIXGGX/kTxYKcfPi83ZLoI3A==
=ISud
-----END PGP SIGNATURE-----
Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm
Pull KVM updates from Paolo Bonzini:
"Much x86 work was pushed out to 5.12, but ARM more than made up for it.
ARM:
- PSCI relay at EL2 when "protected KVM" is enabled
- New exception injection code
- Simplification of AArch32 system register handling
- Fix PMU accesses when no PMU is enabled
- Expose CSV3 on non-Meltdown hosts
- Cache hierarchy discovery fixes
- PV steal-time cleanups
- Allow function pointers at EL2
- Various host EL2 entry cleanups
- Simplification of the EL2 vector allocation
s390:
- memcg accouting for s390 specific parts of kvm and gmap
- selftest for diag318
- new kvm_stat for when async_pf falls back to sync
x86:
- Tracepoints for the new pagetable code from 5.10
- Catch VFIO and KVM irqfd events before userspace
- Reporting dirty pages to userspace with a ring buffer
- SEV-ES host support
- Nested VMX support for wait-for-SIPI activity state
- New feature flag (AVX512 FP16)
- New system ioctl to report Hyper-V-compatible paravirtualization features
Generic:
- Selftest improvements"
* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (171 commits)
KVM: SVM: fix 32-bit compilation
KVM: SVM: Add AP_JUMP_TABLE support in prep for AP booting
KVM: SVM: Provide support to launch and run an SEV-ES guest
KVM: SVM: Provide an updated VMRUN invocation for SEV-ES guests
KVM: SVM: Provide support for SEV-ES vCPU loading
KVM: SVM: Provide support for SEV-ES vCPU creation/loading
KVM: SVM: Update ASID allocation to support SEV-ES guests
KVM: SVM: Set the encryption mask for the SVM host save area
KVM: SVM: Add NMI support for an SEV-ES guest
KVM: SVM: Guest FPU state save/restore not needed for SEV-ES guest
KVM: SVM: Do not report support for SMM for an SEV-ES guest
KVM: x86: Update __get_sregs() / __set_sregs() to support SEV-ES
KVM: SVM: Add support for CR8 write traps for an SEV-ES guest
KVM: SVM: Add support for CR4 write traps for an SEV-ES guest
KVM: SVM: Add support for CR0 write traps for an SEV-ES guest
KVM: SVM: Add support for EFER write traps for an SEV-ES guest
KVM: SVM: Support string IO operations for an SEV-ES guest
KVM: SVM: Support MMIO for an SEV-ES guest
KVM: SVM: Create trace events for VMGEXIT MSR protocol processing
KVM: SVM: Create trace events for VMGEXIT processing
...
For SEV-ES guests, the interception of control register write access
is not recommended. Control register interception occurs prior to the
control register being modified and the hypervisor is unable to modify
the control register itself because the register is located in the
encrypted register state.
SEV-ES guests introduce new control register write traps. These traps
provide intercept support of a control register write after the control
register has been modified. The new control register value is provided in
the VMCB EXITINFO1 field, allowing the hypervisor to track the setting
of the guest control registers.
Add support to track the value of the guest CR8 register using the control
register write trap so that the hypervisor understands the guest operating
mode.
Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com>
Message-Id: <5a01033f4c8b3106ca9374b7cadf8e33da852df1.1607620209.git.thomas.lendacky@amd.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
For SEV-ES guests, the interception of control register write access
is not recommended. Control register interception occurs prior to the
control register being modified and the hypervisor is unable to modify
the control register itself because the register is located in the
encrypted register state.
SEV-ES guests introduce new control register write traps. These traps
provide intercept support of a control register write after the control
register has been modified. The new control register value is provided in
the VMCB EXITINFO1 field, allowing the hypervisor to track the setting
of the guest control registers.
Add support to track the value of the guest CR4 register using the control
register write trap so that the hypervisor understands the guest operating
mode.
Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com>
Message-Id: <c3880bf2db8693aa26f648528fbc6e967ab46e25.1607620209.git.thomas.lendacky@amd.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
For SEV-ES guests, the interception of control register write access
is not recommended. Control register interception occurs prior to the
control register being modified and the hypervisor is unable to modify
the control register itself because the register is located in the
encrypted register state.
SEV-ES support introduces new control register write traps. These traps
provide intercept support of a control register write after the control
register has been modified. The new control register value is provided in
the VMCB EXITINFO1 field, allowing the hypervisor to track the setting
of the guest control registers.
Add support to track the value of the guest CR0 register using the control
register write trap so that the hypervisor understands the guest operating
mode.
Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com>
Message-Id: <182c9baf99df7e40ad9617ff90b84542705ef0d7.1607620209.git.thomas.lendacky@amd.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
For SEV-ES guests, the interception of EFER write access is not
recommended. EFER interception occurs prior to EFER being modified and
the hypervisor is unable to modify EFER itself because the register is
located in the encrypted register state.
SEV-ES support introduces a new EFER write trap. This trap provides
intercept support of an EFER write after it has been modified. The new
EFER value is provided in the VMCB EXITINFO1 field, allowing the
hypervisor to track the setting of the guest EFER.
Add support to track the value of the guest EFER value using the EFER
write trap so that the hypervisor understands the guest operating mode.
Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com>
Message-Id: <8993149352a3a87cd0625b3b61bfd31ab28977e1.1607620209.git.thomas.lendacky@amd.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
SEV-ES adds a new VMEXIT reason code, VMGEXIT. Initial support for a
VMGEXIT includes mapping the GHCB based on the guest GPA, which is
obtained from a new VMCB field, and then validating the required inputs
for the VMGEXIT exit reason.
Since many of the VMGEXIT exit reasons correspond to existing VMEXIT
reasons, the information from the GHCB is copied into the VMCB control
exit code areas and KVM register areas. The standard exit handlers are
invoked, similar to standard VMEXIT processing. Before restarting the
vCPU, the GHCB is updated with any registers that have been updated by
the hypervisor.
Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com>
Message-Id: <c6a4ed4294a369bd75c44d03bd7ce0f0c3840e50.1607620209.git.thomas.lendacky@amd.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
- Expose tag address bits in siginfo. The original arm64 ABI did not
expose any of the bits 63:56 of a tagged address in siginfo. In the
presence of user ASAN or MTE, this information may be useful. The
implementation is generic to other architectures supporting tags (like
SPARC ADI, subject to wiring up the arch code). The user will have to
opt in via sigaction(SA_EXPOSE_TAGBITS) so that the extra bits, if
available, become visible in si_addr.
- Default to 32-bit wide ZONE_DMA. Previously, ZONE_DMA was set to the
lowest 1GB to cope with the Raspberry Pi 4 limitations, to the
detriment of other platforms. With these changes, the kernel scans the
Device Tree dma-ranges and the ACPI IORT information before deciding
on a smaller ZONE_DMA.
- Strengthen READ_ONCE() to acquire when CONFIG_LTO=y. When building
with LTO, there is an increased risk of the compiler converting an
address dependency headed by a READ_ONCE() invocation into a control
dependency and consequently allowing for harmful reordering by the
CPU.
- Add CPPC FFH support using arm64 AMU counters.
- set_fs() removal on arm64. This renders the User Access Override (UAO)
ARMv8 feature unnecessary.
- Perf updates: PMU driver for the ARM DMC-620 memory controller, sysfs
identifier file for SMMUv3, stop event counters support for i.MX8MP,
enable the perf events-based hard lockup detector.
- Reorganise the kernel VA space slightly so that 52-bit VA
configurations can use more virtual address space.
- Improve the robustness of the arm64 memory offline event notifier.
- Pad the Image header to 64K following the EFI header definition
updated recently to increase the section alignment to 64K.
- Support CONFIG_CMDLINE_EXTEND on arm64.
- Do not use tagged PC in the kernel (TCR_EL1.TBID1==1), freeing up 8
bits for PtrAuth.
- Switch to vmapped shadow call stacks.
- Miscellaneous clean-ups.
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEE5RElWfyWxS+3PLO2a9axLQDIXvEFAl/XcSgACgkQa9axLQDI
XvGkwg//SLknimELD/cphf2UzZm5RFuCU0x1UnIXs9XYo5BrOpgVLLA//+XkCrKN
0GLAdtBDfw1axWJudzgMBiHrv6wSGh4p3YWjLIW06u/PJu3m3U8oiiolvvF8d7Yq
UKDseKGQnQkrl97J0SyA+Da/u8D11GEzp52SWL5iRxzt6vInEC27iTOp9n1yoaoP
f3y7qdp9kv831ryUM3rXFYpc8YuMWXk+JpBSNaxqmjlvjMzipA5PhzBLmNzfc657
XcrRX5qsgjEeJW8UUnWUVNB42j7tVzN77yraoUpoVVCzZZeWOQxqq5EscKPfIhRt
AjtSIQNOs95ZVE0SFCTjXnUUb823coUs4dMCdftqlE62JNRwdR+3bkfa+QjPTg1F
O9ohW1AzX0/JB19QBxMaOgbheB8GFXh3DVJ6pizTgxJgyPvQQtFuEhT1kq8Cst0U
Pe+pEWsg9t41bUXNz+/l9tUWKWpeCfFNMTrBXLmXrNlTLeOvDh/0UiF0+2lYJYgf
YAboibQ5eOv2wGCcSDEbNMJ6B2/6GtubDJxH4du680F6Emb6pCSw0ntPwB7mSGLG
5dXz+9FJxDLjmxw7BXxQgc5MoYIrt5JQtaOQ6UxU8dPy53/+py4Ck6tXNkz0+Ap7
gPPaGGy1GqobQFu3qlHtOK1VleQi/sWcrpmPHrpiiFUf6N7EmcY=
=zXFk
-----END PGP SIGNATURE-----
Merge tag 'arm64-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux
Pull arm64 updates from Catalin Marinas:
- Expose tag address bits in siginfo. The original arm64 ABI did not
expose any of the bits 63:56 of a tagged address in siginfo. In the
presence of user ASAN or MTE, this information may be useful. The
implementation is generic to other architectures supporting tags
(like SPARC ADI, subject to wiring up the arch code). The user will
have to opt in via sigaction(SA_EXPOSE_TAGBITS) so that the extra
bits, if available, become visible in si_addr.
- Default to 32-bit wide ZONE_DMA. Previously, ZONE_DMA was set to the
lowest 1GB to cope with the Raspberry Pi 4 limitations, to the
detriment of other platforms. With these changes, the kernel scans
the Device Tree dma-ranges and the ACPI IORT information before
deciding on a smaller ZONE_DMA.
- Strengthen READ_ONCE() to acquire when CONFIG_LTO=y. When building
with LTO, there is an increased risk of the compiler converting an
address dependency headed by a READ_ONCE() invocation into a control
dependency and consequently allowing for harmful reordering by the
CPU.
- Add CPPC FFH support using arm64 AMU counters.
- set_fs() removal on arm64. This renders the User Access Override
(UAO) ARMv8 feature unnecessary.
- Perf updates: PMU driver for the ARM DMC-620 memory controller, sysfs
identifier file for SMMUv3, stop event counters support for i.MX8MP,
enable the perf events-based hard lockup detector.
- Reorganise the kernel VA space slightly so that 52-bit VA
configurations can use more virtual address space.
- Improve the robustness of the arm64 memory offline event notifier.
- Pad the Image header to 64K following the EFI header definition
updated recently to increase the section alignment to 64K.
- Support CONFIG_CMDLINE_EXTEND on arm64.
- Do not use tagged PC in the kernel (TCR_EL1.TBID1==1), freeing up 8
bits for PtrAuth.
- Switch to vmapped shadow call stacks.
- Miscellaneous clean-ups.
* tag 'arm64-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux: (78 commits)
perf/imx_ddr: Add system PMU identifier for userspace
bindings: perf: imx-ddr: add compatible string
arm64: Fix build failure when HARDLOCKUP_DETECTOR_PERF is enabled
arm64: mte: fix prctl(PR_GET_TAGGED_ADDR_CTRL) if TCF0=NONE
arm64: mark __system_matches_cap as __maybe_unused
arm64: uaccess: remove vestigal UAO support
arm64: uaccess: remove redundant PAN toggling
arm64: uaccess: remove addr_limit_user_check()
arm64: uaccess: remove set_fs()
arm64: uaccess cleanup macro naming
arm64: uaccess: split user/kernel routines
arm64: uaccess: refactor __{get,put}_user
arm64: uaccess: simplify __copy_user_flushcache()
arm64: uaccess: rename privileged uaccess routines
arm64: sdei: explicitly simulate PAN/UAO entry
arm64: sdei: move uaccess logic to arch/arm64/
arm64: head.S: always initialize PSTATE
arm64: head.S: cleanup SCTLR_ELx initialization
arm64: head.S: rename el2_setup -> init_kernel_el
arm64: add C wrappers for SET_PSTATE_*()
...
Most architectures with the exception of alpha, mips, parisc and
sparc use the same values for these flags. Move their definitions into
asm-generic/signal-defs.h and allow the architectures with non-standard
values to override them. Also, document the non-standard flag values
in order to make it easier to add new generic flags in the future.
A consequence of this change is that on powerpc and x86, the constants'
values aside from SA_RESETHAND change signedness from unsigned
to signed. This is not expected to impact realistic use of these
constants. In particular the typical use of the constants where they
are or'ed together and assigned to sa_flags (or another int variable)
would not be affected.
Signed-off-by: Peter Collingbourne <pcc@google.com>
Acked-by: Geert Uytterhoeven <geert@linux-m68k.org>
Acked-by: "Eric W. Biederman" <ebiederm@xmission.com>
Reviewed-by: Dave Martin <Dave.Martin@arm.com>
Link: https://linux-review.googlesource.com/id/Ia3849f18b8009bf41faca374e701cdca36974528
Link: https://lkml.kernel.org/r/b6d0d1ec34f9ee93e1105f14f288fba5f89d1f24.1605235762.git.pcc@google.com
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Enclaves encounter exceptions for lots of reasons: everything from enclave
page faults to NULL pointer dereferences, to system calls that must be
“proxied” to the kernel from outside the enclave.
In addition to the code contained inside an enclave, there is also
supporting code outside the enclave called an “SGX runtime”, which is
virtually always implemented inside a shared library. The runtime helps
build the enclave and handles things like *re*building the enclave if it
got destroyed by something like a suspend/resume cycle.
The rebuilding has traditionally been handled in SIGSEGV handlers,
registered by the library. But, being process-wide, shared state, signal
handling and shared libraries do not mix well.
Introduce a vDSO function call that wraps the enclave entry functions
(EENTER/ERESUME functions of the ENCLU instruciton) and returns information
about any exceptions to the caller in the SGX runtime.
Instead of generating a signal, the kernel places exception information in
RDI, RSI and RDX. The kernel-provided userspace portion of the vDSO handler
will place this information in a user-provided buffer or trigger a
user-provided callback at the time of the exception.
The vDSO function calling convention uses the standard RDI RSI, RDX, RCX,
R8 and R9 registers. This makes it possible to declare the vDSO as a C
prototype, but other than that there is no specific support for SystemV
ABI. Things like storing XSAVE are the responsibility of the enclave and
the runtime.
[ bp: Change vsgx.o build dependency to CONFIG_X86_SGX. ]
Suggested-by: Andy Lutomirski <luto@amacapital.net>
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Co-developed-by: Cedric Xing <cedric.xing@intel.com>
Signed-off-by: Cedric Xing <cedric.xing@intel.com>
Co-developed-by: Jarkko Sakkinen <jarkko@kernel.org>
Signed-off-by: Jarkko Sakkinen <jarkko@kernel.org>
Signed-off-by: Borislav Petkov <bp@suse.de>
Tested-by: Jethro Beekman <jethro@fortanix.com>
Link: https://lkml.kernel.org/r/20201112220135.165028-20-jarkko@kernel.org
The whole point of SGX is to create a hardware protected place to do
“stuff”. But, before someone is willing to hand over the keys to
the castle , an enclave must often prove that it is running on an
SGX-protected processor. Provisioning enclaves play a key role in
providing proof.
There are actually three different enclaves in play in order to make this
happen:
1. The application enclave. The familiar one we know and love that runs
the actual code that’s doing real work. There can be many of these on
a single system, or even in a single application.
2. The quoting enclave (QE). The QE is mentioned in lots of silly
whitepapers, but, for the purposes of kernel enabling, just pretend they
do not exist.
3. The provisioning enclave. There is typically only one of these
enclaves per system. Provisioning enclaves have access to a special
hardware key.
They can use this key to help to generate certificates which serve as
proof that enclaves are running on trusted SGX hardware. These
certificates can be passed around without revealing the special key.
Any user who can create a provisioning enclave can access the
processor-unique Provisioning Certificate Key which has privacy and
fingerprinting implications. Even if a user is permitted to create
normal application enclaves (via /dev/sgx_enclave), they should not be
able to create provisioning enclaves. That means a separate permissions
scheme is needed to control provisioning enclave privileges.
Implement a separate device file (/dev/sgx_provision) which allows
creating provisioning enclaves. This device will typically have more
strict permissions than the plain enclave device.
The actual device “driver” is an empty stub. Open file descriptors for
this device will represent a token which allows provisioning enclave duty.
This file descriptor can be passed around and ultimately given as an
argument to the /dev/sgx_enclave driver ioctl().
[ bp: Touchups. ]
Suggested-by: Andy Lutomirski <luto@kernel.org>
Signed-off-by: Jarkko Sakkinen <jarkko@kernel.org>
Signed-off-by: Borislav Petkov <bp@suse.de>
Cc: linux-security-module@vger.kernel.org
Link: https://lkml.kernel.org/r/20201112220135.165028-16-jarkko@kernel.org
Enclaves have two basic states. They are either being built and are
malleable and can be modified by doing things like adding pages. Or,
they are locked down and not accepting changes. They can only be run
after they have been locked down. The ENCLS[EINIT] function induces the
transition from being malleable to locked-down.
Add an ioctl() that performs ENCLS[EINIT]. After this, new pages can
no longer be added with ENCLS[EADD]. This is also the time where the
enclave can be measured to verify its integrity.
Co-developed-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Jarkko Sakkinen <jarkko@kernel.org>
Signed-off-by: Borislav Petkov <bp@suse.de>
Tested-by: Jethro Beekman <jethro@fortanix.com>
Link: https://lkml.kernel.org/r/20201112220135.165028-15-jarkko@kernel.org
SGX enclave pages are inaccessible to normal software. They must be
populated with data by copying from normal memory with the help of the
EADD and EEXTEND functions of the ENCLS instruction.
Add an ioctl() which performs EADD that adds new data to an enclave, and
optionally EEXTEND functions that hash the page contents and use the
hash as part of enclave “measurement” to ensure enclave integrity.
The enclave author gets to decide which pages will be included in the
enclave measurement with EEXTEND. Measurement is very slow and has
sometimes has very little value. For instance, an enclave _could_
measure every page of data and code, but would be slow to initialize.
Or, it might just measure its code and then trust that code to
initialize the bulk of its data after it starts running.
Co-developed-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Jarkko Sakkinen <jarkko@kernel.org>
Signed-off-by: Borislav Petkov <bp@suse.de>
Tested-by: Jethro Beekman <jethro@fortanix.com>
Link: https://lkml.kernel.org/r/20201112220135.165028-14-jarkko@kernel.org
Add an ioctl() that performs the ECREATE function of the ENCLS
instruction, which creates an SGX Enclave Control Structure (SECS).
Although the SECS is an in-memory data structure, it is present in
enclave memory and is not directly accessible by software.
Co-developed-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Jarkko Sakkinen <jarkko@kernel.org>
Signed-off-by: Borislav Petkov <bp@suse.de>
Tested-by: Jethro Beekman <jethro@fortanix.com>
Link: https://lkml.kernel.org/r/20201112220135.165028-13-jarkko@kernel.org
This patch is heavily based on previous work from Lei Cao
<lei.cao@stratus.com> and Paolo Bonzini <pbonzini@redhat.com>. [1]
KVM currently uses large bitmaps to track dirty memory. These bitmaps
are copied to userspace when userspace queries KVM for its dirty page
information. The use of bitmaps is mostly sufficient for live
migration, as large parts of memory are be dirtied from one log-dirty
pass to another. However, in a checkpointing system, the number of
dirty pages is small and in fact it is often bounded---the VM is
paused when it has dirtied a pre-defined number of pages. Traversing a
large, sparsely populated bitmap to find set bits is time-consuming,
as is copying the bitmap to user-space.
A similar issue will be there for live migration when the guest memory
is huge while the page dirty procedure is trivial. In that case for
each dirty sync we need to pull the whole dirty bitmap to userspace
and analyse every bit even if it's mostly zeros.
The preferred data structure for above scenarios is a dense list of
guest frame numbers (GFN). This patch series stores the dirty list in
kernel memory that can be memory mapped into userspace to allow speedy
harvesting.
This patch enables dirty ring for X86 only. However it should be
easily extended to other archs as well.
[1] https://patchwork.kernel.org/patch/10471409/
Signed-off-by: Lei Cao <lei.cao@stratus.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Peter Xu <peterx@redhat.com>
Message-Id: <20201001012222.5767-1-peterx@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Background: We have a lightweight HV, it needs INIT-VMExit and
SIPI-VMExit to wake-up APs for guests since it do not monitor
the Local APIC. But currently virtual wait-for-SIPI(WFS) state
is not supported in nVMX, so when running on top of KVM, the L1
HV cannot receive the INIT-VMExit and SIPI-VMExit which cause
the L2 guest cannot wake up the APs.
According to Intel SDM Chapter 25.2 Other Causes of VM Exits,
SIPIs cause VM exits when a logical processor is in
wait-for-SIPI state.
In this patch:
1. introduce SIPI exit reason,
2. introduce wait-for-SIPI state for nVMX,
3. advertise wait-for-SIPI support to guest.
When L1 hypervisor is not monitoring Local APIC, L0 need to emulate
INIT-VMExit and SIPI-VMExit to L1 to emulate INIT-SIPI-SIPI for
L2. L2 LAPIC write would be traped by L0 Hypervisor(KVM), L0 should
emulate the INIT/SIPI vmexit to L1 hypervisor to set proper state
for L2's vcpu state.
Handle procdure:
Source vCPU:
L2 write LAPIC.ICR(INIT).
L0 trap LAPIC.ICR write(INIT): inject a latched INIT event to target
vCPU.
Target vCPU:
L0 emulate an INIT VMExit to L1 if is guest mode.
L1 set guest VMCS, guest_activity_state=WAIT_SIPI, vmresume.
L0 set vcpu.mp_state to INIT_RECEIVED if (vmcs12.guest_activity_state
== WAIT_SIPI).
Source vCPU:
L2 write LAPIC.ICR(SIPI).
L0 trap LAPIC.ICR write(INIT): inject a latched SIPI event to traget
vCPU.
Target vCPU:
L0 emulate an SIPI VMExit to L1 if (vcpu.mp_state == INIT_RECEIVED).
L1 set CS:IP, guest_activity_state=ACTIVE, vmresume.
L0 resume to L2.
L2 start-up.
Signed-off-by: Yadong Qi <yadong.qi@intel.com>
Message-Id: <20200922052343.84388-1-yadong.qi@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Message-Id: <20201106065122.403183-1-yadong.qi@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
No functional change; just reserve the feature bit for now so that VMMs
can start to implement it.
This will allow the host to indicate that MSI emulation supports 15-bit
destination IDs, allowing up to 32768 CPUs without interrupt remapping.
cf. https://patchwork.kernel.org/patch/11816693/ for qemu
Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
Acked-by: Paolo Bonzini <pbonzini@redhat.com>
Message-Id: <4cd59bed05f4b7410d3d1ffd1e997ab53683874d.camel@infradead.org>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
- New page table code for both hypervisor and guest stage-2
- Introduction of a new EL2-private host context
- Allow EL2 to have its own private per-CPU variables
- Support of PMU event filtering
- Complete rework of the Spectre mitigation
PPC:
- Fix for running nested guests with in-kernel IRQ chip
- Fix race condition causing occasional host hard lockup
- Minor cleanups and bugfixes
x86:
- allow trapping unknown MSRs to userspace
- allow userspace to force #GP on specific MSRs
- INVPCID support on AMD
- nested AMD cleanup, on demand allocation of nested SVM state
- hide PV MSRs and hypercalls for features not enabled in CPUID
- new test for MSR_IA32_TSC writes from host and guest
- cleanups: MMU, CPUID, shared MSRs
- LAPIC latency optimizations ad bugfixes
For x86, also included in this pull request is a new alternative and
(in the future) more scalable implementation of extended page tables
that does not need a reverse map from guest physical addresses to
host physical addresses. For now it is disabled by default because
it is still lacking a few of the existing MMU's bells and whistles.
However it is a very solid piece of work and it is already available
for people to hammer on it.
-----BEGIN PGP SIGNATURE-----
iQFIBAABCAAyFiEE8TM4V0tmI4mGbHaCv/vSX3jHroMFAl+S8dsUHHBib256aW5p
QHJlZGhhdC5jb20ACgkQv/vSX3jHroM40Af+M46NJmuS5rcwFfybvK/c42KT6svX
Co1NrZDwzSQ2mMy3WQzH9qeLvb+nbY4sT3n5BPNPNsT+aIDPOTDt//qJ2/Ip9UUs
tRNea0MAR96JWLE7MSeeRxnTaQIrw/AAZC0RXFzZvxcgytXwdqBExugw4im+b+dn
Dcz8QxX1EkwT+4lTm5HC0hKZAuo4apnK1QkqCq4SdD2QVJ1YE6+z7pgj4wX7xitr
STKD6q/Yt/0ndwqS0GSGbyg0jy6mE620SN6isFRkJYwqfwLJci6KnqvEK67EcNMu
qeE017K+d93yIVC46/6TfVHzLR/D1FpQ8LZ16Yl6S13OuGIfAWBkQZtPRg==
=AD6a
-----END PGP SIGNATURE-----
Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm
Pull KVM updates from Paolo Bonzini:
"For x86, there is a new alternative and (in the future) more scalable
implementation of extended page tables that does not need a reverse
map from guest physical addresses to host physical addresses.
For now it is disabled by default because it is still lacking a few of
the existing MMU's bells and whistles. However it is a very solid
piece of work and it is already available for people to hammer on it.
Other updates:
ARM:
- New page table code for both hypervisor and guest stage-2
- Introduction of a new EL2-private host context
- Allow EL2 to have its own private per-CPU variables
- Support of PMU event filtering
- Complete rework of the Spectre mitigation
PPC:
- Fix for running nested guests with in-kernel IRQ chip
- Fix race condition causing occasional host hard lockup
- Minor cleanups and bugfixes
x86:
- allow trapping unknown MSRs to userspace
- allow userspace to force #GP on specific MSRs
- INVPCID support on AMD
- nested AMD cleanup, on demand allocation of nested SVM state
- hide PV MSRs and hypercalls for features not enabled in CPUID
- new test for MSR_IA32_TSC writes from host and guest
- cleanups: MMU, CPUID, shared MSRs
- LAPIC latency optimizations ad bugfixes"
* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (232 commits)
kvm: x86/mmu: NX largepage recovery for TDP MMU
kvm: x86/mmu: Don't clear write flooding count for direct roots
kvm: x86/mmu: Support MMIO in the TDP MMU
kvm: x86/mmu: Support write protection for nesting in tdp MMU
kvm: x86/mmu: Support disabling dirty logging for the tdp MMU
kvm: x86/mmu: Support dirty logging for the TDP MMU
kvm: x86/mmu: Support changed pte notifier in tdp MMU
kvm: x86/mmu: Add access tracking for tdp_mmu
kvm: x86/mmu: Support invalidate range MMU notifier for TDP MMU
kvm: x86/mmu: Allocate struct kvm_mmu_pages for all pages in TDP MMU
kvm: x86/mmu: Add TDP MMU PF handler
kvm: x86/mmu: Remove disallowed_hugepage_adjust shadow_walk_iterator arg
kvm: x86/mmu: Support zapping SPTEs in the TDP MMU
KVM: Cache as_id in kvm_memory_slot
kvm: x86/mmu: Add functions to handle changed TDP SPTEs
kvm: x86/mmu: Allocate and free TDP MMU roots
kvm: x86/mmu: Init / Uninit the TDP MMU
kvm: x86/mmu: Introduce tdp_iter
KVM: mmu: extract spte.h and spte.c
KVM: mmu: Separate updating a PTE from kvm_set_pte_rmapp
...
It's not desireable to have all MSRs always handled by KVM kernel space. Some
MSRs would be useful to handle in user space to either emulate behavior (like
uCode updates) or differentiate whether they are valid based on the CPU model.
To allow user space to specify which MSRs it wants to see handled by KVM,
this patch introduces a new ioctl to push filter rules with bitmaps into
KVM. Based on these bitmaps, KVM can then decide whether to reject MSR access.
With the addition of KVM_CAP_X86_USER_SPACE_MSR it can also deflect the
denied MSR events to user space to operate on.
If no filter is populated, MSR handling stays identical to before.
Signed-off-by: Alexander Graf <graf@amazon.com>
Message-Id: <20200925143422.21718-8-graf@amazon.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
In the following commits we will add pieces of MSR filtering.
To ensure that code compiles even with the feature half-merged, let's add
a few stubs and struct definitions before the real patches start.
Signed-off-by: Alexander Graf <graf@amazon.com>
Message-Id: <20200925143422.21718-4-graf@amazon.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
The following intercept bit has been added to support VMEXIT
for INVPCID instruction:
Code Name Cause
A2h VMEXIT_INVPCID INVPCID instruction
The following bit has been added to the VMCB layout control area
to control intercept of INVPCID:
Byte Offset Bit(s) Function
14h 2 intercept INVPCID
Enable the interceptions when the the guest is running with shadow
page table enabled and handle the tlbflush based on the invpcid
instruction type.
For the guests with nested page table (NPT) support, the INVPCID
feature works as running it natively. KVM does not need to do any
special handling in this case.
AMD documentation for INVPCID feature is available at "AMD64
Architecture Programmer’s Manual Volume 2: System Programming,
Pub. 24593 Rev. 3.34(or later)"
The documentation can be obtained at the links below:
Link: https://www.amd.com/system/files/TechDocs/24593.pdf
Link: https://bugzilla.kernel.org/show_bug.cgi?id=206537
Signed-off-by: Babu Moger <babu.moger@amd.com>
Reviewed-by: Jim Mattson <jmattson@google.com>
Message-Id: <159985255929.11252.17346684135277453258.stgit@bmoger-ubuntu>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
When running under SEV-ES, the kernel has to tell the hypervisor when to
open the NMI window again after an NMI was injected. This is done with
an NMI-complete message to the hypervisor.
Add code to the kernel's NMI handler to send this message right at the
beginning of do_nmi(). This always allows nesting NMIs.
[ bp: Mark __sev_es_nmi_complete() noinstr:
vmlinux.o: warning: objtool: exc_nmi()+0x17: call to __sev_es_nmi_complete()
leaves .noinstr.text section
While at it, use __pa_nodebug() for the same reason due to
CONFIG_DEBUG_VIRTUAL=y:
vmlinux.o: warning: objtool: __sev_es_nmi_complete()+0xd9: call to __phys_addr()
leaves .noinstr.text section ]
Signed-off-by: Joerg Roedel <jroedel@suse.de>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20200907131613.12703-71-joro@8bytes.org
Add a play_dead handler when running under SEV-ES. This is needed
because the hypervisor can't deliver an SIPI request to restart the AP.
Instead, the kernel has to issue a VMGEXIT to halt the VCPU until the
hypervisor wakes it up again.
Signed-off-by: Joerg Roedel <jroedel@suse.de>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20200907131613.12703-70-joro@8bytes.org
As part of the GHCB specification, the booting of APs under SEV-ES
requires an AP jump table when transitioning from one layer of code to
another (e.g. when going from UEFI to the OS). As a result, each layer
that parks an AP must provide the physical address of an AP jump table
to the next layer via the hypervisor.
Upon booting of the kernel, read the AP jump table address from the
hypervisor. Under SEV-ES, APs are started using the INIT-SIPI-SIPI
sequence. Before issuing the first SIPI request for an AP, the start
CS and IP is programmed into the AP jump table. Upon issuing the SIPI
request, the AP will awaken and jump to that start CS:IP address.
Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com>
[ jroedel@suse.de: - Adapted to different code base
- Moved AP table setup from SIPI sending path to
real-mode setup code
- Fix sparse warnings ]
Co-developed-by: Joerg Roedel <jroedel@suse.de>
Signed-off-by: Joerg Roedel <jroedel@suse.de>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20200907131613.12703-67-joro@8bytes.org
Add a handler for #VC exceptions caused by MMIO intercepts. These
intercepts come along as nested page faults on pages with reserved
bits set.
Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com>
[ jroedel@suse.de: Adapt to VC handling framework ]
Co-developed-by: Joerg Roedel <jroedel@suse.de>
Signed-off-by: Joerg Roedel <jroedel@suse.de>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20200907131613.12703-50-joro@8bytes.org
Install an exception handler for #VC exception that uses a GHCB. Also
add the infrastructure for handling different exit-codes by decoding
the instruction that caused the exception and error handling.
Signed-off-by: Joerg Roedel <jroedel@suse.de>
Signed-off-by: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20200907131613.12703-24-joro@8bytes.org
this has been brought into a shape which is maintainable and actually
works.
This final version was done by Sasha Levin who took it up after Intel
dropped the ball. Sasha discovered that the SGX (sic!) offerings out there
ship rogue kernel modules enabling FSGSBASE behind the kernels back which
opens an instantanious unpriviledged root hole.
The FSGSBASE instructions provide a considerable speedup of the context
switch path and enable user space to write GSBASE without kernel
interaction. This enablement requires careful handling of the exception
entries which go through the paranoid entry path as they cannot longer rely
on the assumption that user GSBASE is positive (as enforced via prctl() on
non FSGSBASE enabled systemn). All other entries (syscalls, interrupts and
exceptions) can still just utilize SWAPGS unconditionally when the entry
comes from user space. Converting these entries to use FSGSBASE has no
benefit as SWAPGS is only marginally slower than WRGSBASE and locating and
retrieving the kernel GSBASE value is not a free operation either. The real
benefit of RD/WRGSBASE is the avoidance of the MSR reads and writes.
The changes come with appropriate selftests and have held up in field
testing against the (sanitized) Graphene-SGX driver.
-----BEGIN PGP SIGNATURE-----
iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAl8pGnoTHHRnbHhAbGlu
dXRyb25peC5kZQAKCRCmGPVMDXSYoTYJD/9873GkwvGcc/Vq/dJH1szGTgFftPyZ
c/Y9gzx7EGBPLo25BS820L+ZlynzXHDxExKfCEaD10TZfe5XIc1vYNR0J74M2NmK
IBgEDstJeW93ai+rHCFRXIevhpzU4GgGYJ1MeeOgbVMN3aGU1g6HfzMvtF0fPn8Y
n6fsLZa43wgnoTdjwjjikpDTrzoZbaL1mbODBzBVPAaTbim7IKKTge6r/iCKrOjz
Uixvm3g9lVzx52zidJ9kWa8esmbOM1j0EPe7/hy3qH9DFo87KxEzjHNH3T6gY5t6
NJhRAIfY+YyTHpPCUCshj6IkRudE6w/qjEAmKP9kWZxoJrvPCTWOhCzelwsFS9b9
gxEYfsnaKhsfNhB6fi0PtWlMzPINmEA7SuPza33u5WtQUK7s1iNlgHfvMbjstbwg
MSETn4SG2/ZyzUrSC06lVwV8kh0RgM3cENc/jpFfIHD0vKGI3qfka/1RY94kcOCG
AeJd0YRSU2RqL7lmxhHyG8tdb8eexns41IzbPCLXX2sF00eKNkVvMRYT2mKfKLFF
q8v1x7yuwmODdXfFR6NdCkGm9IU7wtL6wuQ8Nhu9UraFmcXo6X6FLJC18FqcvSb9
jvcRP4XY/8pNjjf44JB8yWfah0xGQsaMIKQGP4yLv4j6Xk1xAQKH1MqcC7l1D2HN
5Z24GibFqSK/vA==
=QaAN
-----END PGP SIGNATURE-----
Merge tag 'x86-fsgsbase-2020-08-04' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 fsgsbase from Thomas Gleixner:
"Support for FSGSBASE. Almost 5 years after the first RFC to support
it, this has been brought into a shape which is maintainable and
actually works.
This final version was done by Sasha Levin who took it up after Intel
dropped the ball. Sasha discovered that the SGX (sic!) offerings out
there ship rogue kernel modules enabling FSGSBASE behind the kernels
back which opens an instantanious unpriviledged root hole.
The FSGSBASE instructions provide a considerable speedup of the
context switch path and enable user space to write GSBASE without
kernel interaction. This enablement requires careful handling of the
exception entries which go through the paranoid entry path as they
can no longer rely on the assumption that user GSBASE is positive (as
enforced via prctl() on non FSGSBASE enabled systemn).
All other entries (syscalls, interrupts and exceptions) can still just
utilize SWAPGS unconditionally when the entry comes from user space.
Converting these entries to use FSGSBASE has no benefit as SWAPGS is
only marginally slower than WRGSBASE and locating and retrieving the
kernel GSBASE value is not a free operation either. The real benefit
of RD/WRGSBASE is the avoidance of the MSR reads and writes.
The changes come with appropriate selftests and have held up in field
testing against the (sanitized) Graphene-SGX driver"
* tag 'x86-fsgsbase-2020-08-04' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (21 commits)
x86/fsgsbase: Fix Xen PV support
x86/ptrace: Fix 32-bit PTRACE_SETREGS vs fsbase and gsbase
selftests/x86/fsgsbase: Add a missing memory constraint
selftests/x86/fsgsbase: Fix a comment in the ptrace_write_gsbase test
selftests/x86: Add a syscall_arg_fault_64 test for negative GSBASE
selftests/x86/fsgsbase: Test ptracer-induced GS base write with FSGSBASE
selftests/x86/fsgsbase: Test GS selector on ptracer-induced GS base write
Documentation/x86/64: Add documentation for GS/FS addressing mode
x86/elf: Enumerate kernel FSGSBASE capability in AT_HWCAP2
x86/cpu: Enable FSGSBASE on 64bit by default and add a chicken bit
x86/entry/64: Handle FSGSBASE enabled paranoid entry/exit
x86/entry/64: Introduce the FIND_PERCPU_BASE macro
x86/entry/64: Switch CR3 before SWAPGS in paranoid entry
x86/speculation/swapgs: Check FSGSBASE in enabling SWAPGS mitigation
x86/process/64: Use FSGSBASE instructions on thread copy and ptrace
x86/process/64: Use FSBSBASE in switch_to() if available
x86/process/64: Make save_fsgs_for_kvm() ready for FSGSBASE
x86/fsgsbase/64: Enable FSGSBASE instructions in helper functions
x86/fsgsbase/64: Add intrinsics for FSGSBASE instructions
x86/cpu: Add 'unsafe_fsgsbase' to enable CR4.FSGSBASE
...
Commit 850448f35a ("KVM: nVMX: Fix VMX preemption timer migration",
2020-06-01) accidentally broke nVMX live migration from older version
by changing the userspace ABI. Restore it and, while at it, ensure
that vmx->nested.has_preemption_timer_deadline is always initialized
according to the KVM_STATE_VMX_PREEMPTION_TIMER_DEADLINE flag.
Cc: Makarand Sonare <makarandsonare@google.com>
Fixes: 850448f35a ("KVM: nVMX: Fix VMX preemption timer migration")
Reviewed-by: Jim Mattson <jmattson@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
The kernel needs to explicitly enable FSGSBASE. So, the application needs
to know if it can safely use these instructions. Just looking at the CPUID
bit is not enough because it may be running in a kernel that does not
enable the instructions.
One way for the application would be to just try and catch the SIGILL.
But that is difficult to do in libraries which may not want to overwrite
the signal handlers of the main application.
Enumerate the enabled FSGSBASE capability in bit 1 of AT_HWCAP2 in the ELF
aux vector. AT_HWCAP2 is already used by PPC for similar purposes.
The application can access it open coded or by using the getauxval()
function in newer versions of glibc.
[ tglx: Massaged changelog ]
Signed-off-by: Andi Kleen <ak@linux.intel.com>
Signed-off-by: Chang S. Bae <chang.seok.bae@intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Sasha Levin <sashal@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lkml.kernel.org/r/1557309753-24073-18-git-send-email-chang.seok.bae@intel.com
Link: https://lkml.kernel.org/r/20200528201402.1708239-14-sashal@kernel.org
to fixup conflicts in arch/x86/kernel/cpu/mce/core.c so MCE specific follow
up patches can be applied without creating a horrible merge conflict
afterwards.
- Move the arch-specific code into arch/arm64/kvm
- Start the post-32bit cleanup
- Cherry-pick a few non-invasive pre-NV patches
x86:
- Rework of TLB flushing
- Rework of event injection, especially with respect to nested virtualization
- Nested AMD event injection facelift, building on the rework of generic code
and fixing a lot of corner cases
- Nested AMD live migration support
- Optimization for TSC deadline MSR writes and IPIs
- Various cleanups
- Asynchronous page fault cleanups (from tglx, common topic branch with tip tree)
- Interrupt-based delivery of asynchronous "page ready" events (host side)
- Hyper-V MSRs and hypercalls for guest debugging
- VMX preemption timer fixes
s390:
- Cleanups
Generic:
- switch vCPU thread wakeup from swait to rcuwait
The other architectures, and the guest side of the asynchronous page fault
work, will come next week.
-----BEGIN PGP SIGNATURE-----
iQFIBAABCAAyFiEE8TM4V0tmI4mGbHaCv/vSX3jHroMFAl7VJcYUHHBib256aW5p
QHJlZGhhdC5jb20ACgkQv/vSX3jHroPf6QgAq4wU5wdd1lTGz/i3DIhNVJNJgJlp
ozLzRdMaJbdbn5RpAK6PEBd9+pt3+UlojpFB3gpJh2Nazv2OzV4yLQgXXXyyMEx1
5Hg7b4UCJYDrbkCiegNRv7f/4FWDkQ9dx++RZITIbxeskBBCEI+I7GnmZhGWzuC4
7kj4ytuKAySF2OEJu0VQF6u0CvrNYfYbQIRKBXjtOwuRK4Q6L63FGMJpYo159MBQ
asg3B1jB5TcuGZ9zrjL5LkuzaP4qZZHIRs+4kZsH9I6MODHGUxKonrkablfKxyKy
CFK+iaHCuEXXty5K0VmWM3nrTfvpEjVjbMc7e1QGBQ5oXsDM0pqn84syRg==
=v7Wn
-----END PGP SIGNATURE-----
Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm
Pull kvm updates from Paolo Bonzini:
"ARM:
- Move the arch-specific code into arch/arm64/kvm
- Start the post-32bit cleanup
- Cherry-pick a few non-invasive pre-NV patches
x86:
- Rework of TLB flushing
- Rework of event injection, especially with respect to nested
virtualization
- Nested AMD event injection facelift, building on the rework of
generic code and fixing a lot of corner cases
- Nested AMD live migration support
- Optimization for TSC deadline MSR writes and IPIs
- Various cleanups
- Asynchronous page fault cleanups (from tglx, common topic branch
with tip tree)
- Interrupt-based delivery of asynchronous "page ready" events (host
side)
- Hyper-V MSRs and hypercalls for guest debugging
- VMX preemption timer fixes
s390:
- Cleanups
Generic:
- switch vCPU thread wakeup from swait to rcuwait
The other architectures, and the guest side of the asynchronous page
fault work, will come next week"
* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (256 commits)
KVM: selftests: fix rdtsc() for vmx_tsc_adjust_test
KVM: check userspace_addr for all memslots
KVM: selftests: update hyperv_cpuid with SynDBG tests
x86/kvm/hyper-v: Add support for synthetic debugger via hypercalls
x86/kvm/hyper-v: enable hypercalls regardless of hypercall page
x86/kvm/hyper-v: Add support for synthetic debugger interface
x86/hyper-v: Add synthetic debugger definitions
KVM: selftests: VMX preemption timer migration test
KVM: nVMX: Fix VMX preemption timer migration
x86/kvm/hyper-v: Explicitly align hcall param for kvm_hyperv_exit
KVM: x86/pmu: Support full width counting
KVM: x86/pmu: Tweak kvm_pmu_get_msr to pass 'struct msr_data' in
KVM: x86: announce KVM_FEATURE_ASYNC_PF_INT
KVM: x86: acknowledgment mechanism for async pf page ready notifications
KVM: x86: interrupt based APF 'page ready' event delivery
KVM: introduce kvm_read_guest_offset_cached()
KVM: rename kvm_arch_can_inject_async_page_present() to kvm_arch_can_dequeue_async_page_present()
KVM: x86: extend struct kvm_vcpu_pv_apf_data with token info
Revert "KVM: async_pf: Fix #DF due to inject "Page not Present" and "Page Ready" exceptions simultaneously"
KVM: VMX: Replace zero-length array with flexible-array
...
Add new field to hold preemption timer expiration deadline
appended to struct kvm_vmx_nested_state_hdr. This is to prevent
the first VM-Enter after migration from incorrectly restarting the timer
with the full timer value instead of partially decayed timer value.
KVM_SET_NESTED_STATE restarts timer using migrated state regardless
of whether L1 sets VM_EXIT_SAVE_VMX_PREEMPTION_TIMER.
Fixes: cf8b84f48a ("kvm: nVMX: Prepare for checkpointing L2 state")
Signed-off-by: Peter Shier <pshier@google.com>
Signed-off-by: Makarand Sonare <makarandsonare@google.com>
Message-Id: <20200526215107.205814-2-makarandsonare@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Introduce new capability to indicate that KVM supports interrupt based
delivery of 'page ready' APF events. This includes support for both
MSR_KVM_ASYNC_PF_INT and MSR_KVM_ASYNC_PF_ACK.
Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Message-Id: <20200525144125.143875-8-vkuznets@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
If two page ready notifications happen back to back the second one is not
delivered and the only mechanism we currently have is
kvm_check_async_pf_completion() check in vcpu_run() loop. The check will
only be performed with the next vmexit when it happens and in some cases
it may take a while. With interrupt based page ready notification delivery
the situation is even worse: unlike exceptions, interrupts are not handled
immediately so we must check if the slot is empty. This is slow and
unnecessary. Introduce dedicated MSR_KVM_ASYNC_PF_ACK MSR to communicate
the fact that the slot is free and host should check its notification
queue. Mandate using it for interrupt based 'page ready' APF event
delivery.
As kvm_check_async_pf_completion() is going away from vcpu_run() we need
a way to communicate the fact that vcpu->async_pf.done queue has
transitioned from empty to non-empty state. Introduce
kvm_arch_async_page_present_queued() and KVM_REQ_APF_READY to do the job.
Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Message-Id: <20200525144125.143875-7-vkuznets@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Concerns were expressed around APF delivery via synthetic #PF exception as
in some cases such delivery may collide with real page fault. For 'page
ready' notifications we can easily switch to using an interrupt instead.
Introduce new MSR_KVM_ASYNC_PF_INT mechanism and deprecate the legacy one.
One notable difference between the two mechanisms is that interrupt may not
get handled immediately so whenever we would like to deliver next event
(regardless of its type) we must be sure the guest had read and cleared
previous event in the slot.
While on it, get rid on 'type 1/type 2' names for APF events in the
documentation as they are causing confusion. Use 'page not present'
and 'page ready' everywhere instead.
Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Message-Id: <20200525144125.143875-6-vkuznets@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Currently, APF mechanism relies on the #PF abuse where the token is being
passed through CR2. If we switch to using interrupts to deliver page-ready
notifications we need a different way to pass the data. Extent the existing
'struct kvm_vcpu_pv_apf_data' with token information for page-ready
notifications.
While on it, rename 'reason' to 'flags'. This doesn't change the semantics
as we only have reasons '1' and '2' and these can be treated as bit flags
but KVM_PV_REASON_PAGE_READY is going away with interrupt based delivery
making 'reason' name misleading.
The newly introduced apf_put_user_ready() temporary puts both flags and
token information, this will be changed to put token only when we switch
to interrupt based notifications.
Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Message-Id: <20200525144125.143875-3-vkuznets@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Similar to VMX, the state that is captured through the currently available
IOCTLs is a mix of L1 and L2 state, dependent on whether the L2 guest was
running at the moment when the process was interrupted to save its state.
In particular, the SVM-specific state for nested virtualization includes
the L1 saved state (including the interrupt flag), the cached L2 controls,
and the GIF.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>