linux

mirror of https://github.com/torvalds/linux.git synced 2024-12-14 07:02:23 +00:00

Author	SHA1	Message	Date
Vitaly Kuznetsov	0a31df6823	KVM: x86: Check the right feature bit for MSR_KVM_ASYNC_PF_ACK access MSR_KVM_ASYNC_PF_ACK MSR is part of interrupt based asynchronous page fault interface and not the original (deprecated) KVM_FEATURE_ASYNC_PF. This is stated in Documentation/virt/kvm/msr.rst. Fixes: `66570e966d` ("kvm: x86: only provide PV features if enabled in guest's CPUID") Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com> Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com> Reviewed-by: Oliver Upton <oupton@google.com> Message-Id: <20210722123018.260035-1-vkuznets@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-07-26 08:26:53 -04:00
Vitaly Kuznetsov	2bb16bea5f	KVM: nSVM: Swap the parameter order for svm_copy_vmrun_state()/svm_copy_vmloadsave_state() Make svm_copy_vmrun_state()/svm_copy_vmloadsave_state() interface match 'memcpy(dest, src)' to avoid any confusion. No functional change intended. Suggested-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com> Message-Id: <20210719090322.625277-1-vkuznets@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-07-26 08:09:46 -04:00
Vitaly Kuznetsov	9a9e74819b	KVM: nSVM: Rename nested_svm_vmloadsave() to svm_copy_vmloadsave_state() To match svm_copy_vmrun_state(), rename nested_svm_vmloadsave() to svm_copy_vmloadsave_state(). Opportunistically add missing braces to 'else' branch in vmload_vmsave_interception(). No functional change intended. Suggested-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com> Message-Id: <20210716144104.465269-1-vkuznets@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-07-26 08:09:36 -04:00
Linus Torvalds	405386b021	* Allow again loading KVM on 32-bit non-PAE builds * Fixes for host SMIs on AMD * Fixes for guest SMIs on AMD * Fixes for selftests on s390 and ARM * Fix memory leak * Enforce no-instrumentation area on vmentry when hardware breakpoints are in use. -----BEGIN PGP SIGNATURE----- iQFIBAABCAAyFiEE8TM4V0tmI4mGbHaCv/vSX3jHroMFAmDwRi4UHHBib256aW5p QHJlZGhhdC5jb20ACgkQv/vSX3jHroOt4AgAl6xEkMwDC74d/QFIOA7s2GD3ugfa z5XqGN1qz/nmEMnuIg6/tjTXDPmn/dfLMqy8RGZfyUv6xbgPcv/7JuFMRILvwGTb SbOVrGnR/QOhMdlfWH34qDkXeEsthTXSgQgVm/iiED0TttvQYVcZ/E9mgzaWQXor T1yTug2uAUXJ1EBxY0ZBo2kbh+BvvdmhEF0pksZOuwqZdH3zn3QCXwAwkL/OtUYE M6nNn3j1LU38C4OK1niXOZZVOuMIdk/l7LyFpjUQTFlIqitQAPtBE5MD+K+A9oC2 Yocxyj2tId1e6o8bLic/oN8/LpdORTvA/wDMj5M1DcMzvxQuQIpGYkcVGg== =gjVA -----END PGP SIGNATURE----- Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm Pull kvm fixes from Paolo Bonzini: - Allow again loading KVM on 32-bit non-PAE builds - Fixes for host SMIs on AMD - Fixes for guest SMIs on AMD - Fixes for selftests on s390 and ARM - Fix memory leak - Enforce no-instrumentation area on vmentry when hardware breakpoints are in use. * tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (25 commits) KVM: selftests: smm_test: Test SMM enter from L2 KVM: nSVM: Restore nested control upon leaving SMM KVM: nSVM: Fix L1 state corruption upon return from SMM KVM: nSVM: Introduce svm_copy_vmrun_state() KVM: nSVM: Check that VM_HSAVE_PA MSR was set before VMRUN KVM: nSVM: Check the value written to MSR_VM_HSAVE_PA KVM: SVM: Fix sev_pin_memory() error checks in SEV migration utilities KVM: SVM: Return -EFAULT if copy_to_user() for SEV mig packet header fails KVM: SVM: add module param to control the #SMI interception KVM: SVM: remove INIT intercept handler KVM: SVM: #SMI interception must not skip the instruction KVM: VMX: Remove vmx_msr_index from vmx.h KVM: X86: Disable hardware breakpoints unconditionally before kvm_x86->run() KVM: selftests: Address extra memslot parameters in vm_vaddr_alloc kvm: debugfs: fix memory leak in kvm_create_vm_debugfs KVM: x86/pmu: Clear anythread deprecated bit when 0xa leaf is unsupported on the SVM KVM: mmio: Fix use-after-free Read in kvm_vm_ioctl_unregister_coalesced_mmio KVM: SVM: Revert clearing of C-bit on GPA in #NPF handler KVM: x86/mmu: Do not apply HPA (memory encryption) mask to GPAs KVM: x86: Use kernel's x86_phys_bits to handle reduced MAXPHYADDR ...	2021-07-15 11:56:07 -07:00
Vitaly Kuznetsov	bb00bd9c08	KVM: nSVM: Restore nested control upon leaving SMM If the VM was migrated while in SMM, no nested state was saved/restored, and therefore svm_leave_smm has to load both save and control area of the vmcb12. Save area is already loaded from HSAVE area, so now load the control area as well from the vmcb12. Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com> Message-Id: <20210628104425.391276-6-vkuznets@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-07-15 10:19:44 -04:00
Vitaly Kuznetsov	37be407b2c	KVM: nSVM: Fix L1 state corruption upon return from SMM VMCB split commit `4995a3685f` ("KVM: SVM: Use a separate vmcb for the nested L2 guest") broke return from SMM when we entered there from guest (L2) mode. Gen2 WS2016/Hyper-V is known to do this on boot. The problem manifests itself like this: kvm_exit: reason EXIT_RSM rip 0x7ffbb280 info 0 0 kvm_emulate_insn: 0:7ffbb280: 0f aa kvm_smm_transition: vcpu 0: leaving SMM, smbase 0x7ffb3000 kvm_nested_vmrun: rip: 0x000000007ffbb280 vmcb: 0x0000000008224000 nrip: 0xffffffffffbbe119 int_ctl: 0x01020000 event_inj: 0x00000000 npt: on kvm_nested_intercepts: cr_read: 0000 cr_write: 0010 excp: 40060002 intercepts: fd44bfeb 0000217f 00000000 kvm_entry: vcpu 0, rip 0xffffffffffbbe119 kvm_exit: reason EXIT_NPF rip 0xffffffffffbbe119 info 200000006 1ab000 kvm_nested_vmexit: vcpu 0 reason npf rip 0xffffffffffbbe119 info1 0x0000000200000006 info2 0x00000000001ab000 intr_info 0x00000000 error_code 0x00000000 kvm_page_fault: address 1ab000 error_code 6 kvm_nested_vmexit_inject: reason EXIT_NPF info1 200000006 info2 1ab000 int_info 0 int_info_err 0 kvm_entry: vcpu 0, rip 0x7ffbb280 kvm_exit: reason EXIT_EXCP_GP rip 0x7ffbb280 info 0 0 kvm_emulate_insn: 0:7ffbb280: 0f aa kvm_inj_exception: #GP (0x0) Note: return to L2 succeeded but upon first exit to L1 its RIP points to 'RSM' instruction but we're not in SMM. The problem appears to be that VMCB01 gets irreversibly destroyed during SMM execution. Previously, we used to have 'hsave' VMCB where regular (pre-SMM) L1's state was saved upon nested_svm_vmexit() but now we just switch to VMCB01 from VMCB02. Pre-split (working) flow looked like: - SMM is triggered during L2's execution - L2's state is pushed to SMRAM - nested_svm_vmexit() restores L1's state from 'hsave' - SMM -> RSM - enter_svm_guest_mode() switches to L2 but keeps 'hsave' intact so we have pre-SMM (and pre L2 VMRUN) L1's state there - L2's state is restored from SMRAM - upon first exit L1's state is restored from L1. This was always broken with regards to svm_get_nested_state()/ svm_set_nested_state(): 'hsave' was never a part of what's being save and restored so migration happening during SMM triggered from L2 would never restore L1's state correctly. Post-split flow (broken) looks like: - SMM is triggered during L2's execution - L2's state is pushed to SMRAM - nested_svm_vmexit() switches to VMCB01 from VMCB02 - SMM -> RSM - enter_svm_guest_mode() switches from VMCB01 to VMCB02 but pre-SMM VMCB01 is already lost. - L2's state is restored from SMRAM - upon first exit L1's state is restored from VMCB01 but it is corrupted (reflects the state during 'RSM' execution). VMX doesn't have this problem because unlike VMCB, VMCS keeps both guest and host state so when we switch back to VMCS02 L1's state is intact there. To resolve the issue we need to save L1's state somewhere. We could've created a third VMCB for SMM but that would require us to modify saved state format. L1's architectural HSAVE area (pointed by MSR_VM_HSAVE_PA) seems appropriate: L0 is free to save any (or none) of L1's state there. Currently, KVM does 'none'. Note, for nested state migration to succeed, both source and destination hypervisors must have the fix. We, however, don't need to create a new flag indicating the fact that HSAVE area is now populated as migration during SMM triggered from L2 was always broken. Fixes: `4995a3685f` ("KVM: SVM: Use a separate vmcb for the nested L2 guest") Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-07-15 10:19:44 -04:00
Vitaly Kuznetsov	0a75829076	KVM: nSVM: Introduce svm_copy_vmrun_state() Separate the code setting non-VMLOAD-VMSAVE state from svm_set_nested_state() into its own function. This is going to be re-used from svm_enter_smm()/svm_leave_smm(). Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com> Message-Id: <20210628104425.391276-4-vkuznets@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-07-15 10:19:43 -04:00
Vitaly Kuznetsov	fb79f566e4	KVM: nSVM: Check that VM_HSAVE_PA MSR was set before VMRUN APM states that "The address written to the VM_HSAVE_PA MSR, which holds the address of the page used to save the host state on a VMRUN, must point to a hypervisor-owned page. If this check fails, the WRMSR will fail with a #GP(0) exception. Note that a value of 0 is not considered valid for the VM_HSAVE_PA MSR and a VMRUN that is attempted while the HSAVE_PA is 0 will fail with a #GP(0) exception." svm_set_msr() already checks that the supplied address is valid, so only check for '0' is missing. Add it to nested_svm_vmrun(). Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com> Message-Id: <20210628104425.391276-3-vkuznets@redhat.com> Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-07-15 10:19:43 -04:00
Vitaly Kuznetsov	fce7e152ff	KVM: nSVM: Check the value written to MSR_VM_HSAVE_PA APM states that #GP is raised upon write to MSR_VM_HSAVE_PA when the supplied address is not page-aligned or is outside of "maximum supported physical address for this implementation". page_address_valid() check seems suitable. Also, forcefully page-align the address when it's written from VMM. Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com> Message-Id: <20210628104425.391276-2-vkuznets@redhat.com> Cc: stable@vger.kernel.org Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com> [Add comment about behavior for host-provided values. - Paolo] Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-07-15 10:19:43 -04:00
Sean Christopherson	c7a1b2b678	KVM: SVM: Fix sev_pin_memory() error checks in SEV migration utilities Use IS_ERR() instead of checking for a NULL pointer when querying for sev_pin_memory() failures. sev_pin_memory() always returns an error code cast to a pointer, or a valid pointer; it never returns NULL. Reported-by: Dan Carpenter <dan.carpenter@oracle.com> Cc: Steve Rutherford <srutherford@google.com> Cc: Brijesh Singh <brijesh.singh@amd.com> Cc: Ashish Kalra <ashish.kalra@amd.com> Fixes: `d3d1af85e2` ("KVM: SVM: Add KVM_SEND_UPDATE_DATA command") Fixes: `15fb7de1a7` ("KVM: SVM: Add KVM_SEV_RECEIVE_UPDATE_DATA command") Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210506175826.2166383-3-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-07-15 10:19:43 -04:00
Sean Christopherson	b4a693924a	KVM: SVM: Return -EFAULT if copy_to_user() for SEV mig packet header fails Return -EFAULT if copy_to_user() fails; if accessing user memory faults, copy_to_user() returns the number of bytes remaining, not an error code. Reported-by: Dan Carpenter <dan.carpenter@oracle.com> Cc: Steve Rutherford <srutherford@google.com> Cc: Brijesh Singh <brijesh.singh@amd.com> Cc: Ashish Kalra <ashish.kalra@amd.com> Fixes: `d3d1af85e2` ("KVM: SVM: Add KVM_SEND_UPDATE_DATA command") Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210506175826.2166383-2-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-07-15 10:19:42 -04:00
Maxim Levitsky	4b639a9f82	KVM: SVM: add module param to control the #SMI interception In theory there are no side effects of not intercepting #SMI, because then #SMI becomes transparent to the OS and the KVM. Plus an observation on recent Zen2 CPUs reveals that these CPUs ignore #SMI interception and never deliver #SMI VMexits. This is also useful to test nested KVM to see that L1 handles #SMIs correctly in case when L1 doesn't intercept #SMI. Finally the default remains the same, the SMI are intercepted by default thus this patch doesn't have any effect unless non default module param value is used. Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com> Message-Id: <20210707125100.677203-4-mlevitsk@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-07-15 10:19:42 -04:00
Maxim Levitsky	896707c212	KVM: SVM: remove INIT intercept handler Kernel never sends real INIT even to CPUs, other than on boot. Thus INIT interception is an error which should be caught by a check for an unknown VMexit reason. On top of that, the current INIT VM exit handler skips the current instruction which is wrong. That was added in commit `5ff3a351f6` ("KVM: x86: Move trivial instruction-based exit handlers to common code"). Fixes: `5ff3a351f6` ("KVM: x86: Move trivial instruction-based exit handlers to common code") Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com> Message-Id: <20210707125100.677203-3-mlevitsk@redhat.com> Cc: stable@vger.kernel.org Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-07-15 10:19:42 -04:00
Maxim Levitsky	991afbbee8	KVM: SVM: #SMI interception must not skip the instruction Commit `5ff3a351f6` ("KVM: x86: Move trivial instruction-based exit handlers to common code"), unfortunately made a mistake of treating nop_on_interception and nop_interception in the same way. Former does truly nothing while the latter skips the instruction. SMI VM exit handler should do nothing. (SMI itself is handled by the host when we do STGI) Fixes: `5ff3a351f6` ("KVM: x86: Move trivial instruction-based exit handlers to common code") Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com> Message-Id: <20210707125100.677203-2-mlevitsk@redhat.com> Cc: stable@vger.kernel.org Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-07-15 10:19:42 -04:00
Yu Zhang	c0e1303ed4	KVM: VMX: Remove vmx_msr_index from vmx.h vmx_msr_index was used to record the list of MSRs which can be lazily restored when kvm returns to userspace. It is now reimplemented as kvm_uret_msrs_list, a common x86 list which is only used inside x86.c. So just remove the obsolete declaration in vmx.h. Signed-off-by: Yu Zhang <yu.c.zhang@linux.intel.com> Message-Id: <20210707235702.31595-1-yu.c.zhang@linux.intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-07-15 10:19:41 -04:00
Lai Jiangshan	f85d401606	KVM: X86: Disable hardware breakpoints unconditionally before kvm_x86->run() When the host is using debug registers but the guest is not using them nor is the guest in guest-debug state, the kvm code does not reset the host debug registers before kvm_x86->run(). Rather, it relies on the hardware vmentry instruction to automatically reset the dr7 registers which ensures that the host breakpoints do not affect the guest. This however violates the non-instrumentable nature around VM entry and exit; for example, when a host breakpoint is set on vcpu->arch.cr2, Another issue is consistency. When the guest debug registers are active, the host breakpoints are reset before kvm_x86->run(). But when the guest debug registers are inactive, the host breakpoints are delayed to be disabled. The host tracing tools may see different results depending on what the guest is doing. To fix the problems, we clear %db7 unconditionally before kvm_x86->run() if the host has set any breakpoints, no matter if the guest is using them or not. Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com> Message-Id: <20210628172632.81029-1-jiangshanlai@gmail.com> Cc: stable@vger.kernel.org [Only clear %db7 instead of reloading all debug registers. - Paolo] Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-07-15 10:19:41 -04:00
Like Xu	7234c362cc	KVM: x86/pmu: Clear anythread deprecated bit when 0xa leaf is unsupported on the SVM The AMD platform does not support the functions Ah CPUID leaf. The returned results for this entry should all remain zero just like the native does: AMD host: 0x0000000a 0x00: eax=0x00000000 ebx=0x00000000 ecx=0x00000000 edx=0x00000000 (uncanny) AMD guest: 0x0000000a 0x00: eax=0x00000000 ebx=0x00000000 ecx=0x00000000 edx=0x00008000 Fixes: `cadbaa039b` ("perf/x86/intel: Make anythread filter support conditional") Signed-off-by: Like Xu <likexu@tencent.com> Message-Id: <20210628074354.33848-1-likexu@tencent.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-07-14 12:17:56 -04:00
Sean Christopherson	76ff371b67	KVM: SVM: Revert clearing of C-bit on GPA in #NPF handler Don't clear the C-bit in the #NPF handler, as it is a legal GPA bit for non-SEV guests, and for SEV guests the C-bit is dropped before the GPA hits the NPT in hardware. Clearing the bit for non-SEV guests causes KVM to mishandle #NPFs with that collide with the host's C-bit. Although the APM doesn't explicitly state that the C-bit is not reserved for non-SEV, Tom Lendacky confirmed that the following snippet about the effective reduction due to the C-bit does indeed apply only to SEV guests. Note that because guest physical addresses are always translated through the nested page tables, the size of the guest physical address space is not impacted by any physical address space reduction indicated in CPUID 8000_001F[EBX]. If the C-bit is a physical address bit however, the guest physical address space is effectively reduced by 1 bit. And for SEV guests, the APM clearly states that the bit is dropped before walking the nested page tables. If the C-bit is an address bit, this bit is masked from the guest physical address when it is translated through the nested page tables. Consequently, the hypervisor does not need to be aware of which pages the guest has chosen to mark private. Note, the bogus C-bit clearing was removed from legacy #PF handler in commit `6d1b867d04` ("KVM: SVM: Don't strip the C-bit from CR2 on #PF interception"). Fixes: `0ede79e132` ("KVM: SVM: Clear C-bit from the page fault address") Cc: Peter Gonda <pgonda@google.com> Cc: Brijesh Singh <brijesh.singh@amd.com> Cc: Tom Lendacky <thomas.lendacky@amd.com> Cc: stable@vger.kernel.org Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210625020354.431829-3-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-07-14 12:17:56 -04:00
Sean Christopherson	fc9bf2e087	KVM: x86/mmu: Do not apply HPA (memory encryption) mask to GPAs Ignore "dynamic" host adjustments to the physical address mask when generating the masks for guest PTEs, i.e. the guest PA masks. The host physical address space and guest physical address space are two different beasts, e.g. even though SEV's C-bit is the same bit location for both host and guest, disabling SME in the host (which clears shadow_me_mask) does not affect the guest PTE->GPA "translation". For non-SEV guests, not dropping bits is the correct behavior. Assuming KVM and userspace correctly enumerate/configure guest MAXPHYADDR, bits that are lost as collateral damage from memory encryption are treated as reserved bits, i.e. KVM will never get to the point where it attempts to generate a gfn using the affected bits. And if userspace wants to create a bogus vCPU, then userspace gets to deal with the fallout of hardware doing odd things with bad GPAs. For SEV guests, not dropping the C-bit is technically wrong, but it's a moot point because KVM can't read SEV guest's page tables in any case since they're always encrypted. Not to mention that the current KVM code is also broken since sme_me_mask does not have to be non-zero for SEV to be supported by KVM. The proper fix would be to teach all of KVM to correctly handle guest private memory, but that's a task for the future. Fixes: `d0ec49d4de` ("kvm/x86/svm: Support Secure Memory Encryption within KVM") Cc: stable@vger.kernel.org Cc: Brijesh Singh <brijesh.singh@amd.com> Cc: Tom Lendacky <thomas.lendacky@amd.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210623230552.4027702-5-seanjc@google.com> [Use a new header instead of adding header guards to paging_tmpl.h. - Paolo] Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-07-14 12:17:56 -04:00
Sean Christopherson	e39f00f60e	KVM: x86: Use kernel's x86_phys_bits to handle reduced MAXPHYADDR Use boot_cpu_data.x86_phys_bits instead of the raw CPUID information to enumerate the MAXPHYADDR for KVM guests when TDP is disabled (the guest version is only relevant to NPT/TDP). When using shadow paging, any reductions to the host's MAXPHYADDR apply to KVM and its guests as well, i.e. using the raw CPUID info will cause KVM to misreport the number of PA bits available to the guest. Unconditionally zero out the "Physical Address bit reduction" entry. For !TDP, the adjustment is already done, and for TDP enumerating the host's reduction is wrong as the reduction does not apply to GPAs. Fixes: `9af9b94068` ("x86/cpu/AMD: Handle SME reduction in physical address size") Cc: stable@vger.kernel.org Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210623230552.4027702-3-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-07-14 12:17:55 -04:00
Sean Christopherson	4bf48e3c0a	KVM: x86: Use guest MAXPHYADDR from CPUID.0x8000_0008 iff TDP is enabled Ignore the guest MAXPHYADDR reported by CPUID.0x8000_0008 if TDP, i.e. NPT, is disabled, and instead use the host's MAXPHYADDR. Per AMD'S APM: Maximum guest physical address size in bits. This number applies only to guests using nested paging. When this field is zero, refer to the PhysAddrSize field for the maximum guest physical address size. Fixes: `24c82e576b` ("KVM: Sanitize cpuid") Cc: stable@vger.kernel.org Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210623230552.4027702-2-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-07-14 12:17:55 -04:00
Sean Christopherson	f0414b078d	Revert "KVM: x86: WARN and reject loading KVM if NX is supported but not enabled" Let KVM load if EFER.NX=0 even if NX is supported, the analysis and testing (or lack thereof) for the non-PAE host case was garbage. If the kernel won't be using PAE paging, .Ldefault_entry in head_32.S skips over the entire EFER sequence. Hopefully that can be changed in the future to allow KVM to require EFER.NX, but the motivation behind KVM's requirement isn't yet merged. Reverting and revisiting the mess at a later date is by far the safest approach. This reverts commit `8bbed95d2c`. Fixes: `8bbed95d2c` ("KVM: x86: WARN and reject loading KVM if NX is supported but not enabled") Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210625001853.318148-1-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-07-14 12:17:55 -04:00
Linus Torvalds	1423e2660c	Fixes and improvements for FPU handling on x86: - Prevent sigaltstack out of bounds writes. The kernel unconditionally writes the FPU state to the alternate stack without checking whether the stack is large enough to accomodate it. Check the alternate stack size before doing so and in case it's too small force a SIGSEGV instead of silently corrupting user space data. - MINSIGSTKZ and SIGSTKSZ are constants in signal.h and have never been updated despite the fact that the FPU state which is stored on the signal stack has grown over time which causes trouble in the field when AVX512 is available on a CPU. The kernel does not expose the minimum requirements for the alternate stack size depending on the available and enabled CPU features. ARM already added an aux vector AT_MINSIGSTKSZ for the same reason. Add it to x86 as well - A major cleanup of the x86 FPU code. The recent discoveries of XSTATE related issues unearthed quite some inconsistencies, duplicated code and other issues. The fine granular overhaul addresses this, makes the code more robust and maintainable, which allows to integrate upcoming XSTATE related features in sane ways. -----BEGIN PGP SIGNATURE----- iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAmDlcpETHHRnbHhAbGlu dXRyb25peC5kZQAKCRCmGPVMDXSYoeP5D/4i+AgYYeiMLgGb+NS7iaKPfoWo6LIz y3qdTSA0DQaIYbYivWwRO/g0GYdDMXDWeZalFi7eGnVI8O3eOog+22Zrf/y0UINB KJHdYd4ApWHhs401022y5hexrWQvnV8w1yQCuj/zLm6eC+AVhdwt2AY+IBoRrdUj wqY97B/4rJNsBvvqTDn9EeDrJA2y0y0Suc7AhIp2BGMI+dpIdxys8RJDamXNWyDL gJf0YRgUoiIn3AHKb+fgv60AoxfC175NSg/5/y/scFNXqVlW0Up4YCb7pqG9o2Ga f3XvtWfbw1N5PmUYjFkALwEkzGUbM3v0RA3xLY2j2WlWm9fBPPy59dt+i/h/VKyA GrA7i7lcIqX8dfVH6XkrReZBkRDSB6t9SZTvV54jAz5fcIZO2Rg++UFUvI/R6GKK XCcxukYaArwo+IG62iqDszS3gfLGhcor/cviOeULRC5zMUIO4Jah+IhDnifmShtC M5s9QzrwIRD/XMewGRQmvkiN4kBfE7jFoBQr1J9leCXJKrM+2JQmMzVInuubTQIq SdlKOaAIn7xtekz+6XdFG9Gmhck0PCLMJMOLNvQkKWI3KqGLRZ+dAWKK0vsCizAx 0BA7ZeB9w9lFT+D8mQCX77JvW9+VNwyfwIOLIrJRHk3VqVpS5qvoiFTLGJJBdZx4 /TbbRZu7nXDN2w== =Mq1m -----END PGP SIGNATURE----- Merge tag 'x86-fpu-2021-07-07' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 fpu updates from Thomas Gleixner: "Fixes and improvements for FPU handling on x86: - Prevent sigaltstack out of bounds writes. The kernel unconditionally writes the FPU state to the alternate stack without checking whether the stack is large enough to accomodate it. Check the alternate stack size before doing so and in case it's too small force a SIGSEGV instead of silently corrupting user space data. - MINSIGSTKZ and SIGSTKSZ are constants in signal.h and have never been updated despite the fact that the FPU state which is stored on the signal stack has grown over time which causes trouble in the field when AVX512 is available on a CPU. The kernel does not expose the minimum requirements for the alternate stack size depending on the available and enabled CPU features. ARM already added an aux vector AT_MINSIGSTKSZ for the same reason. Add it to x86 as well. - A major cleanup of the x86 FPU code. The recent discoveries of XSTATE related issues unearthed quite some inconsistencies, duplicated code and other issues. The fine granular overhaul addresses this, makes the code more robust and maintainable, which allows to integrate upcoming XSTATE related features in sane ways" * tag 'x86-fpu-2021-07-07' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (74 commits) x86/fpu/xstate: Clear xstate header in copy_xstate_to_uabi_buf() again x86/fpu/signal: Let xrstor handle the features to init x86/fpu/signal: Handle #PF in the direct restore path x86/fpu: Return proper error codes from user access functions x86/fpu/signal: Split out the direct restore code x86/fpu/signal: Sanitize copy_user_to_fpregs_zeroing() x86/fpu/signal: Sanitize the xstate check on sigframe x86/fpu/signal: Remove the legacy alignment check x86/fpu/signal: Move initial checks into fpu__restore_sig() x86/fpu: Mark init_fpstate __ro_after_init x86/pkru: Remove xstate fiddling from write_pkru() x86/fpu: Don't store PKRU in xstate in fpu_reset_fpstate() x86/fpu: Remove PKRU handling from switch_fpu_finish() x86/fpu: Mask PKRU from kernel XRSTOR[S] operations x86/fpu: Hook up PKRU into ptrace() x86/fpu: Add PKRU storage outside of task XSAVE buffer x86/fpu: Dont restore PKRU in fpregs_restore_userspace() x86/fpu: Rename xfeatures_mask_user() to xfeatures_mask_uabi() x86/fpu: Move FXSAVE_LEAK quirk info __copy_kernel_to_fpregs() x86/fpu: Rename __fpregs_load_activate() to fpregs_restore_userregs() ...	2021-07-07 11:12:01 -07:00
Linus Torvalds	36824f198c	ARM: - Add MTE support in guests, complete with tag save/restore interface - Reduce the impact of CMOs by moving them in the page-table code - Allow device block mappings at stage-2 - Reduce the footprint of the vmemmap in protected mode - Support the vGIC on dumb systems such as the Apple M1 - Add selftest infrastructure to support multiple configuration and apply that to PMU/non-PMU setups - Add selftests for the debug architecture - The usual crop of PMU fixes PPC: - Support for the H_RPT_INVALIDATE hypercall - Conversion of Book3S entry/exit to C - Bug fixes S390: - new HW facilities for guests - make inline assembly more robust with KASAN and co x86: - Allow userspace to handle emulation errors (unknown instructions) - Lazy allocation of the rmap (host physical -> guest physical address) - Support for virtualizing TSC scaling on VMX machines - Optimizations to avoid shattering huge pages at the beginning of live migration - Support for initializing the PDPTRs without loading them from memory - Many TLB flushing cleanups - Refuse to load if two-stage paging is available but NX is not (this has been a requirement in practice for over a year) - A large series that separates the MMU mode (WP/SMAP/SMEP etc.) from CR0/CR4/EFER, using the MMU mode everywhere once it is computed from the CPU registers - Use PM notifier to notify the guest about host suspend or hibernate - Support for passing arguments to Hyper-V hypercalls using XMM registers - Support for Hyper-V TLB flush hypercalls and enlightened MSR bitmap on AMD processors - Hide Hyper-V hypercalls that are not included in the guest CPUID - Fixes for live migration of virtual machines that use the Hyper-V "enlightened VMCS" optimization of nested virtualization - Bugfixes (not many) Generic: - Support for retrieving statistics without debugfs - Cleanups for the KVM selftests API -----BEGIN PGP SIGNATURE----- iQFIBAABCAAyFiEE8TM4V0tmI4mGbHaCv/vSX3jHroMFAmDV9UYUHHBib256aW5p QHJlZGhhdC5jb20ACgkQv/vSX3jHroOIRgf/XX8fKLh24RnTOs2ldIu2AfRGVrT4 QMrr8MxhmtukBAszk2xKvBt8/6gkUjdaIC3xqEnVjxaDaUvZaEtP7CQlF5JV45rn iv1zyxUKucXrnIOr+gCioIT7qBlh207zV35ArKioP9Y83cWx9uAs22pfr6g+7RxO h8bJZlJbSG6IGr3voANCIb9UyjU1V/l8iEHqRwhmr/A5rARPfD7g8lfMEQeGkzX6 +/UydX2fumB3tl8e2iMQj6vLVdSOsCkehvpHK+Z33EpkKhan7GwZ2sZ05WmXV/nY QLAYfD10KegoNWl5Ay4GTp4hEAIYVrRJCLC+wnLdc0U8udbfCuTC31LK4w== =NcRh -----END PGP SIGNATURE----- Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm Pull kvm updates from Paolo Bonzini: "This covers all architectures (except MIPS) so I don't expect any other feature pull requests this merge window. ARM: - Add MTE support in guests, complete with tag save/restore interface - Reduce the impact of CMOs by moving them in the page-table code - Allow device block mappings at stage-2 - Reduce the footprint of the vmemmap in protected mode - Support the vGIC on dumb systems such as the Apple M1 - Add selftest infrastructure to support multiple configuration and apply that to PMU/non-PMU setups - Add selftests for the debug architecture - The usual crop of PMU fixes PPC: - Support for the H_RPT_INVALIDATE hypercall - Conversion of Book3S entry/exit to C - Bug fixes S390: - new HW facilities for guests - make inline assembly more robust with KASAN and co x86: - Allow userspace to handle emulation errors (unknown instructions) - Lazy allocation of the rmap (host physical -> guest physical address) - Support for virtualizing TSC scaling on VMX machines - Optimizations to avoid shattering huge pages at the beginning of live migration - Support for initializing the PDPTRs without loading them from memory - Many TLB flushing cleanups - Refuse to load if two-stage paging is available but NX is not (this has been a requirement in practice for over a year) - A large series that separates the MMU mode (WP/SMAP/SMEP etc.) from CR0/CR4/EFER, using the MMU mode everywhere once it is computed from the CPU registers - Use PM notifier to notify the guest about host suspend or hibernate - Support for passing arguments to Hyper-V hypercalls using XMM registers - Support for Hyper-V TLB flush hypercalls and enlightened MSR bitmap on AMD processors - Hide Hyper-V hypercalls that are not included in the guest CPUID - Fixes for live migration of virtual machines that use the Hyper-V "enlightened VMCS" optimization of nested virtualization - Bugfixes (not many) Generic: - Support for retrieving statistics without debugfs - Cleanups for the KVM selftests API" * tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (314 commits) KVM: x86: rename apic_access_page_done to apic_access_memslot_enabled kvm: x86: disable the narrow guest module parameter on unload selftests: kvm: Allows userspace to handle emulation errors. kvm: x86: Allow userspace to handle emulation errors KVM: x86/mmu: Let guest use GBPAGES if supported in hardware and TDP is on KVM: x86/mmu: Get CR4.SMEP from MMU, not vCPU, in shadow page fault KVM: x86/mmu: Get CR0.WP from MMU, not vCPU, in shadow page fault KVM: x86/mmu: Drop redundant rsvd bits reset for nested NPT KVM: x86/mmu: Optimize and clean up so called "last nonleaf level" logic KVM: x86: Enhance comments for MMU roles and nested transition trickiness KVM: x86/mmu: WARN on any reserved SPTE value when making a valid SPTE KVM: x86/mmu: Add helpers to do full reserved SPTE checks w/ generic MMU KVM: x86/mmu: Use MMU's role to determine PTTYPE KVM: x86/mmu: Collapse 32-bit PAE and 64-bit statements for helpers KVM: x86/mmu: Add a helper to calculate root from role_regs KVM: x86/mmu: Add helper to update paging metadata KVM: x86/mmu: Don't update nested guest's paging bitmasks if CR0.PG=0 KVM: x86/mmu: Consolidate reset_rsvds_bits_mask() calls KVM: x86/mmu: Use MMU role_regs to get LA57, and drop vCPU LA57 helper KVM: x86/mmu: Get nested MMU's root level from the MMU's role ...	2021-06-28 15:40:51 -07:00
Linus Torvalds	8e4d7a78f0	Misc cleanups & removal of obsolete code. Signed-off-by: Ingo Molnar <mingo@kernel.org> -----BEGIN PGP SIGNATURE----- iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmDZejQRHG1pbmdvQGtl cm5lbC5vcmcACgkQEnMQ0APhK1hKCg//a0wiOyJBiWLAW0uiOucF2ICVQZj5rKgi M4HRJZ9jNkFUFVQ/eXYI7uedSqJ6B4hwoUqU6Yp6e05CF/Jgxe2OQXnknearjtDp xs8yBsnLolCrtHzWvuJAZL8InXwvUYrsxu1A8kWKd1ezZQ2V2aFEI4KtYcPVoBBi hRNMy1JVJbUoCG5s/CbsMpTKH0ehQFGsG46rCLQJ4s9H3rcYaCv9NY2q1EYKBrha ZiZjPSFBKaTAVEoc3tUbqsNZAqgyuwRcBQL0K5VDI9p92fudvKgsTI7erbmp+Lij mLhjjoPQK1C07kj0HpCPyoGMiTbJ2piag/jZnxSEiQnNxmZjqjRUhDuDhp6uc/SE 98CEYWPoVbU7N6QLEurHVRAfaQ/ZC7PfiR7lhkoJHizaszFY1NFRxplsU1rzTwGq YZdr+y49tJTCU1wIvWF2eFBZHBEgfA6fP4TRGgVsQ7r8IhugR1nCLcnTfMLYXt2t 9Fe57M7cBgZMgNf5AgvraowugJrTLX7240YPKxHnv5yLjIBt4bulm8X4Lq/MKgc+ UbRfB7Trd2c9T4EVDy26rQ7qk+VC8rIbzEp4kvlDpV8u7BtLYhVonxVz6qPong5b NxOczaFsfL5gWJmfGU+vfc+RFl2lNhQQMLo/gdEn89qZL8nxL/4byejwfCs0YfC2 wgDXNwRJb+g= =YqZp -----END PGP SIGNATURE----- Merge tag 'x86-cleanups-2021-06-28' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 cleanups from Ingo Molnar: "Misc cleanups & removal of obsolete code" * tag 'x86-cleanups-2021-06-28' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/sgx: Correct kernel-doc's arg name in sgx_encl_release() doc: Remove references to IBM Calgary x86/setup: Document that Windows reserves the first MiB x86/crash: Remove crash_reserve_low_1M() x86/setup: Remove CONFIG_X86_RESERVE_LOW and reservelow= options x86/alternative: Align insn bytes vertically x86: Fix leftover comment typos x86/asm: Simplify __smp_mb() definition x86/alternatives: Make the x86nops[] symbol static	2021-06-28 13:10:25 -07:00
Linus Torvalds	54a728dc5e	Scheduler udpates for this cycle: - Changes to core scheduling facilities: - Add "Core Scheduling" via CONFIG_SCHED_CORE=y, which enables coordinated scheduling across SMT siblings. This is a much requested feature for cloud computing platforms, to allow the flexible utilization of SMT siblings, without exposing untrusted domains to information leaks & side channels, plus to ensure more deterministic computing performance on SMT systems used by heterogenous workloads. There's new prctls to set core scheduling groups, which allows more flexible management of workloads that can share siblings. - Fix task->state access anti-patterns that may result in missed wakeups and rename it to ->__state in the process to catch new abuses. - Load-balancing changes: - Tweak newidle_balance for fair-sched, to improve 'memcache'-like workloads. - "Age" (decay) average idle time, to better track & improve workloads such as 'tbench'. - Fix & improve energy-aware (EAS) balancing logic & metrics. - Fix & improve the uclamp metrics. - Fix task migration (taskset) corner case on !CONFIG_CPUSET. - Fix RT and deadline utilization tracking across policy changes - Introduce a "burstable" CFS controller via cgroups, which allows bursty CPU-bound workloads to borrow a bit against their future quota to improve overall latencies & batching. Can be tweaked via /sys/fs/cgroup/cpu/<X>/cpu.cfs_burst_us. - Rework assymetric topology/capacity detection & handling. - Scheduler statistics & tooling: - Disable delayacct by default, but add a sysctl to enable it at runtime if tooling needs it. Use static keys and other optimizations to make it more palatable. - Use sched_clock() in delayacct, instead of ktime_get_ns(). - Misc cleanups and fixes. Signed-off-by: Ingo Molnar <mingo@kernel.org> -----BEGIN PGP SIGNATURE----- iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmDZcPoRHG1pbmdvQGtl cm5lbC5vcmcACgkQEnMQ0APhK1g3yw//WfhIqy7Psa9d/MBMjQDRGbTuO4+w22Dj vmWFU44Q4KJxQHWeIgUlrK+dzvYWvNmflUs2CUUOiDVzxFTHMIyBtL4qCBUbx4Ns vKAcB9wsWZge2o3WzZqpProRhdoRaSKw8egUr2q7rACVBkckY7eGP/OjWxXU8BdA b7D0LPWwuIBFfN4pFYeCDLn32Dqr9s6Chyj+ZecabdG7EE6Gu+f1diVcxy7JE/mc 4WWL0D1RqdgpGrBEuMJIxPYekdrZiuy4jtEbztz5gbTBteN1cj3BLfqn0Pc/e6rO Vyuc5mXCAmzRVi18z6g6bsVl+IA/nrbErENB2OHOhOYtqiZxqGTd4GPWZszMyY17 5AsEO5+5pcaBsy4gyp09qURggBu9zhJnMVmOI3rIHZkmkhwzc6uUJlyhDCTiFWOz 3ZF3LjbZEyCKodMD8qMHbs3axIBpIfZqjzkvSKyFnvfXEGVytVse7NUuWtQ36u92 GnURxVeYY1TDVXvE1Y8owNKMxknKQ6YRlypP7Dtbeo/qG6hShp0xmS7qDLDi0ybZ ZlK+bDECiVoDf3nvJo+8v5M82IJ3CBt4UYldeRJsa1YCK/FsbK8tp91fkEfnXVue +U6LPX0AmMpXacR5HaZfb3uBIKRw/QMdP/7RFtBPhpV6jqCrEmuqHnpPQiEVtxwO UmG7bt94Trk= =3VDr -----END PGP SIGNATURE----- Merge tag 'sched-core-2021-06-28' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull scheduler udpates from Ingo Molnar: - Changes to core scheduling facilities: - Add "Core Scheduling" via CONFIG_SCHED_CORE=y, which enables coordinated scheduling across SMT siblings. This is a much requested feature for cloud computing platforms, to allow the flexible utilization of SMT siblings, without exposing untrusted domains to information leaks & side channels, plus to ensure more deterministic computing performance on SMT systems used by heterogenous workloads. There are new prctls to set core scheduling groups, which allows more flexible management of workloads that can share siblings. - Fix task->state access anti-patterns that may result in missed wakeups and rename it to ->__state in the process to catch new abuses. - Load-balancing changes: - Tweak newidle_balance for fair-sched, to improve 'memcache'-like workloads. - "Age" (decay) average idle time, to better track & improve workloads such as 'tbench'. - Fix & improve energy-aware (EAS) balancing logic & metrics. - Fix & improve the uclamp metrics. - Fix task migration (taskset) corner case on !CONFIG_CPUSET. - Fix RT and deadline utilization tracking across policy changes - Introduce a "burstable" CFS controller via cgroups, which allows bursty CPU-bound workloads to borrow a bit against their future quota to improve overall latencies & batching. Can be tweaked via /sys/fs/cgroup/cpu/<X>/cpu.cfs_burst_us. - Rework assymetric topology/capacity detection & handling. - Scheduler statistics & tooling: - Disable delayacct by default, but add a sysctl to enable it at runtime if tooling needs it. Use static keys and other optimizations to make it more palatable. - Use sched_clock() in delayacct, instead of ktime_get_ns(). - Misc cleanups and fixes. * tag 'sched-core-2021-06-28' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (72 commits) sched/doc: Update the CPU capacity asymmetry bits sched/topology: Rework CPU capacity asymmetry detection sched/core: Introduce SD_ASYM_CPUCAPACITY_FULL sched_domain flag psi: Fix race between psi_trigger_create/destroy sched/fair: Introduce the burstable CFS controller sched/uclamp: Fix uclamp_tg_restrict() sched/rt: Fix Deadline utilization tracking during policy change sched/rt: Fix RT utilization tracking during policy change sched: Change task_struct::state sched,arch: Remove unused TASK_STATE offsets sched,timer: Use __set_current_state() sched: Add get_current_state() sched,perf,kvm: Fix preemption condition sched: Introduce task_is_running() sched: Unbreak wakeups sched/fair: Age the average idle time sched/cpufreq: Consider reduced CPU capacity in energy calculation sched/fair: Take thermal pressure into account while estimating energy thermal/cpufreq_cooling: Update offline CPUs per-cpu thermal_pressure sched/fair: Return early from update_tg_cfs_load() if delta == 0 ...	2021-06-28 12:14:19 -07:00
Maxim Levitsky	a01b45e9d3	KVM: x86: rename apic_access_page_done to apic_access_memslot_enabled This better reflects the purpose of this variable on AMD, since on AMD the AVIC's memory slot can be enabled and disabled dynamically. Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com> Message-Id: <20210623113002.111448-4-mlevitsk@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-06-24 18:00:49 -04:00
Aaron Lewis	88213da235	kvm: x86: disable the narrow guest module parameter on unload When the kvm_intel module unloads the module parameter 'allow_smaller_maxphyaddr' is not cleared because the backing variable is defined in the kvm module. As a result, if the module parameter's state was set before kvm_intel unloads, it will also be set when it reloads. Explicitly clear the state in vmx_exit() to prevent this from happening. Signed-off-by: Aaron Lewis <aaronlewis@google.com> Message-Id: <20210623203426.1891402-1-aaronlewis@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Reviewed-by: Jim Mattson <jmattson@google.com>	2021-06-24 18:00:49 -04:00
Aaron Lewis	19238e75bd	kvm: x86: Allow userspace to handle emulation errors Add a fallback mechanism to the in-kernel instruction emulator that allows userspace the opportunity to process an instruction the emulator was unable to. When the in-kernel instruction emulator fails to process an instruction it will either inject a #UD into the guest or exit to userspace with exit reason KVM_INTERNAL_ERROR. This is because it does not know how to proceed in an appropriate manner. This feature lets userspace get involved to see if it can figure out a better path forward. Signed-off-by: Aaron Lewis <aaronlewis@google.com> Reviewed-by: David Edmondson <david.edmondson@oracle.com> Message-Id: <20210510144834.658457-2-aaronlewis@google.com> Reviewed-by: Jim Mattson <jmattson@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-06-24 18:00:48 -04:00
Sean Christopherson	27de925044	KVM: x86/mmu: Let guest use GBPAGES if supported in hardware and TDP is on Let the guest use 1g hugepages if TDP is enabled and the host supports GBPAGES, KVM can't actively prevent the guest from using 1g pages in this case since they can't be disabled in the hardware page walker. While injecting a page fault if a bogus 1g page is encountered during a software page walk is perfectly reasonable since KVM is simply honoring userspace's vCPU model, doing so arguably doesn't provide any meaningful value, and at worst will be horribly confusing as the guest will see inconsistent behavior and seemingly spurious page faults. Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210622175739.3610207-55-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-06-24 18:00:48 -04:00
Sean Christopherson	9a65d0b70f	KVM: x86/mmu: Get CR4.SMEP from MMU, not vCPU, in shadow page fault Use the current MMU instead of vCPU state to query CR4.SMEP when handling a page fault. In the nested NPT case, the current CR4.SMEP reflects L2, whereas the page fault is shadowing L1's NPT, which uses L1's hCR4. Practically speaking, this is a nop a NPT walks are always user faults, i.e. this code will never be reached, but fix it up for consistency. Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210622175739.3610207-54-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-06-24 18:00:48 -04:00
Sean Christopherson	fdaa293598	KVM: x86/mmu: Get CR0.WP from MMU, not vCPU, in shadow page fault Use the current MMU instead of vCPU state to query CR0.WP when handling a page fault. In the nested NPT case, the current CR0.WP reflects L2, whereas the page fault is shadowing L1's NPT. Practically speaking, this is a nop a NPT walks are always user faults, but fix it up for consistency. Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210622175739.3610207-53-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-06-24 18:00:47 -04:00
Sean Christopherson	f82fdaf536	KVM: x86/mmu: Drop redundant rsvd bits reset for nested NPT Drop the extra reset of shadow_zero_bits in the nested NPT flow now that shadow_mmu_init_context computes the correct level for nested NPT. No functional change intended. Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210622175739.3610207-52-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-06-24 18:00:47 -04:00
Sean Christopherson	7cd138db5c	KVM: x86/mmu: Optimize and clean up so called "last nonleaf level" logic Drop the pre-computed last_nonleaf_level, which is arguably wrong and at best confusing. Per the comment: Can have large pages at levels 2..last_nonleaf_level-1. the intent of the variable would appear to be to track what levels can _legally_ have large pages, but that intent doesn't align with reality. The computed value will be wrong for 5-level paging, or if 1gb pages are not supported. The flawed code is not a problem in practice, because except for 32-bit PSE paging, bit 7 is reserved if large pages aren't supported at the level. Take advantage of this invariant and simply omit the level magic math for 64-bit page tables (including PAE). For 32-bit paging (non-PAE), the adjustments are needed purely because bit 7 is ignored if PSE=0. Retain that logic as is, but make is_last_gpte() unique per PTTYPE so that the PSE check is avoided for PAE and EPT paging. In the spirit of avoiding branches, bump the "last nonleaf level" for 32-bit PSE paging by adding the PSE bit itself. Note, bit 7 is ignored or has other meaning in CR3/EPTP, but despite FNAME(walk_addr_generic) briefly grabbing CR3/EPTP in "pte", they are not PTEs and will blow up all the other gpte helpers. Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210622175739.3610207-51-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-06-24 18:00:47 -04:00
Sean Christopherson	616007c866	KVM: x86: Enhance comments for MMU roles and nested transition trickiness Expand the comments for the MMU roles. The interactions with gfn_track PGD reuse in particular are hairy. Regarding PGD reuse, add comments in the nested virtualization flows to call out why kvm_init_mmu() is unconditionally called even when nested TDP is used. Cc: Vitaly Kuznetsov <vkuznets@redhat.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210622175739.3610207-50-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-06-24 18:00:47 -04:00
Sean Christopherson	3b77daa5ef	KVM: x86/mmu: WARN on any reserved SPTE value when making a valid SPTE Replace make_spte()'s WARN on a collision with the magic MMIO value with a generic WARN on reserved bits being set (including EPT's reserved WX combination). Warning on any reserved bits covers MMIO, A/D tracking bits with PAE paging, and in theory any future goofs that are introduced. Opportunistically convert to ONCE behavior to avoid spamming the kernel log, odds are very good that if KVM screws up one SPTE, it will botch all SPTEs for the same MMU. Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210622175739.3610207-49-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-06-24 18:00:46 -04:00
Sean Christopherson	961f84457c	KVM: x86/mmu: Add helpers to do full reserved SPTE checks w/ generic MMU Extract the reserved SPTE check and print helpers in get_mmio_spte() to new helpers so that KVM can also WARN on reserved badness when making a SPTE. Tag the checking helper with __always_inline to improve the probability of the compiler generating optimal code for the checking loop, e.g. gcc appears to avoid using %rbp when the helper is tagged with a vanilla "inline". No functional change intended. Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210622175739.3610207-48-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-06-24 18:00:46 -04:00
Sean Christopherson	36f267871e	KVM: x86/mmu: Use MMU's role to determine PTTYPE Use the MMU's role instead of vCPU state or role_regs to determine the PTTYPE, i.e. which helpers to wire up. No functional change intended. Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210622175739.3610207-47-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-06-24 18:00:46 -04:00
Sean Christopherson	fe660f7244	KVM: x86/mmu: Collapse 32-bit PAE and 64-bit statements for helpers Skip paging32E_init_context() and paging64_init_context_common() and go directly to paging64_init_context() (was the common version) now that the relevant flows don't need to distinguish between 64-bit PAE and 32-bit PAE for other reasons. No functional change intended. Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210622175739.3610207-46-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-06-24 18:00:46 -04:00
Sean Christopherson	f4bd6f7376	KVM: x86/mmu: Add a helper to calculate root from role_regs Add a helper to calculate the level for non-EPT page tables from the MMU's role_regs. No functional change intended. Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210622175739.3610207-45-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-06-24 18:00:45 -04:00
Sean Christopherson	533f9a4b38	KVM: x86/mmu: Add helper to update paging metadata Consolidate MMU guest metadata updates into a common helper for TDP, shadow, and nested MMUs. No functional change intended. Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210622175739.3610207-44-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-06-24 18:00:45 -04:00
Sean Christopherson	af0eb17e99	KVM: x86/mmu: Don't update nested guest's paging bitmasks if CR0.PG=0 Don't bother updating the bitmasks and last-leaf information if paging is disabled as the metadata will never be used. Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210622175739.3610207-43-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-06-24 18:00:45 -04:00
Sean Christopherson	fa4b558802	KVM: x86/mmu: Consolidate reset_rsvds_bits_mask() calls Move calls to reset_rsvds_bits_mask() out of the various mode statements and under a more generic CR0.PG=1 check. This will allow for additional code consolidation in the future. No functional change intended. Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210622175739.3610207-42-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-06-24 18:00:45 -04:00
Sean Christopherson	87e99d7d70	KVM: x86/mmu: Use MMU role_regs to get LA57, and drop vCPU LA57 helper Get LA57 from the role_regs, which are initialized from the vCPU even though TDP is enabled, instead of pulling the value directly from the vCPU when computing the guest's root_level for TDP MMUs. Note, the check is inside an is_long_mode() statement, so that requirement is not lost. Use role_regs even though the MMU's role is available and arguably "better". A future commit will consolidate the guest root level logic, and it needs access to EFER.LMA, which is not tracked in the role (it can't be toggled on VM-Exit, unlike LA57). Drop is_la57_mode() as there are no remaining users, and to discourage pulling MMU state from the vCPU (in the future). No functional change intended. Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210622175739.3610207-41-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-06-24 18:00:45 -04:00
Sean Christopherson	5472fcd4c6	KVM: x86/mmu: Get nested MMU's root level from the MMU's role Initialize the MMU's (guest) root_level using its mmu_role instead of redoing the calculations. The role_regs used to calculate the mmu_role are initialized from the vCPU, i.e. this should be a complete nop. No functional change intended. Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210622175739.3610207-40-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-06-24 18:00:44 -04:00
Sean Christopherson	a4c93252fe	KVM: x86/mmu: Drop "nx" from MMU context now that there are no readers Drop kvm_mmu.nx as there no consumers left. No functional change intended. Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210622175739.3610207-39-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-06-24 18:00:44 -04:00
Sean Christopherson	90599c2801	KVM: x86/mmu: Use MMU's role to get EFER.NX during MMU configuration Get the MMU's effective EFER.NX from its role instead of using the one-off, dedicated flag. This will allow dropping said flag in a future commit. No functional change intended. Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210622175739.3610207-38-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-06-24 18:00:44 -04:00
Sean Christopherson	84a1622604	KVM: x86/mmu: Use MMU's role/role_regs to compute context's metadata Use the MMU's role and role_regs to calculate the MMU's guest root level and NX bit. For some flows, the vCPU state may not be correct (or relevant), e.g. EPT doesn't interact with EFER.NX and nested NPT will configure the guest_mmu with possibly-stale vCPU state. Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210622175739.3610207-37-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-06-24 18:00:44 -04:00
Sean Christopherson	cd628f0f1e	KVM: x86/mmu: Use MMU's role to detect EFER.NX in guest page walk Use the NX bit from the MMU's role instead of the MMU itself so that the redundant, dedicated "nx" flag can be dropped. No functional change intended. Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210622175739.3610207-36-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-06-24 18:00:43 -04:00
Sean Christopherson	b67a93a87e	KVM: x86/mmu: Use MMU's roles to compute last non-leaf level Use the MMU's role to get CR4.PSE when determining the last level at which the guest _cannot_ create a non-leaf PTE, i.e. cannot create a huge page. Note, the existing logic is arguably wrong when considering 5-level paging and the case where 1gb pages aren't supported. In practice, the logic is confusing but not broken, because except for 32-bit non-PAE paging, bit 7 (_PAGE_PSE) bit is reserved when a huge page isn't supported at that level. I.e. setting bit 7 will terminate the guest walk one way or another. Furthermore, last_nonleaf_level is only consulted after KVM has verified there are no reserved bits set. All that confusion will be addressed in a future patch by dropping last_nonleaf_level entirely. For now, massage the code to continue the march toward using mmu_role for (almost) all MMU computations. Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210622175739.3610207-35-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-06-24 18:00:43 -04:00
Sean Christopherson	2e4c06618d	KVM: x86/mmu: Use MMU's role to compute PKRU bitmask Use the MMU's role to calculate the Protection Keys (Restrict Userspace) bitmask instead of pulling bits from current vCPU state. For some flows, the vCPU state may not be correct (or relevant), e.g. EPT doesn't interact with PKRU. Case in point, the "ept" param simply disappears. Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210622175739.3610207-34-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-06-24 18:00:43 -04:00
Sean Christopherson	c596f1470a	KVM: x86/mmu: Use MMU's role to compute permission bitmask Use the MMU's role to generate the permission bitmasks for the MMU. For some flows, the vCPU state may not be correct (or relevant), e.g. the nested NPT MMU can be initialized with incoherent vCPU state. Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210622175739.3610207-33-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-06-24 18:00:43 -04:00
Sean Christopherson	b705a277b7	KVM: x86/mmu: Drop vCPU param from reserved bits calculator Drop the vCPU param from __reset_rsvds_bits_mask() as it's now unused, and ideally will remain unused in the future. Any information that's needed by the low level helper should be explicitly provided as it's used for both shadow/host MMUs and guest MMUs, i.e. vCPU state may be meaningless or simply wrong. Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210622175739.3610207-32-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-06-24 18:00:42 -04:00
Sean Christopherson	4e9c0d80db	KVM: x86/mmu: Use MMU's role to get CR4.PSE for computing rsvd bits Use the MMU's role to get CR4.PSE when calculating reserved bits for the guest's PTEs. Practically speaking, this is a glorified nop as the role always come from vCPU state for the relevant flows, but converting to the roles will provide consistency once everything else is converted, and will Just Work if the "always comes from vCPU" behavior were ever to change (unlikely). Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210622175739.3610207-31-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-06-24 18:00:42 -04:00
Sean Christopherson	8c985b2d8e	KVM: x86/mmu: Don't grab CR4.PSE for calculating shadow reserved bits Unconditionally pass pse=false when calculating reserved bits for shadow PTEs. CR4.PSE is only relevant for 32-bit non-PAE paging, which KVM does not use for shadow paging (including nested NPT). Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210622175739.3610207-30-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-06-24 18:00:42 -04:00
Sean Christopherson	18db1b1790	KVM: x86/mmu: Always set new mmu_role immediately after checking old role Refactor shadow MMU initialization to immediately set its new mmu_role after verifying it differs from the old role, and so that all flavors of MMU initialization share the same check-and-set pattern. Immediately setting the role will allow future commits to use mmu_role to configure the MMU without consuming stale state. Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210622175739.3610207-29-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-06-24 18:00:42 -04:00
Sean Christopherson	84c679f5f5	KVM: x86/mmu: Set CR4.PKE/LA57 in MMU role iff long mode is active Don't set cr4_pke or cr4_la57 in the MMU role if long mode isn't active, which is required for protection keys and 5-level paging to be fully enabled. Ignoring the bit avoids unnecessary reconfiguration on reuse, and also means consumers of mmu_role don't need to manually check for long mode. Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210622175739.3610207-28-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-06-24 18:00:41 -04:00
Sean Christopherson	ca8d664f50	KVM: x86/mmu: Do not set paging-related bits in MMU role if CR0.PG=0 Don't set CR0/CR4/EFER bits in the MMU role if paging is disabled, paging modifiers are irrelevant if there is no paging in the first place. Somewhat arbitrarily clear gpte_is_8_bytes for shadow paging if paging is disabled in the guest. Again, there are no guest PTEs to process, so the size is meaningless. Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210622175739.3610207-27-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-06-24 18:00:41 -04:00
Sean Christopherson	6066772455	KVM: x86/mmu: Add accessors to query mmu_role bits Add accessors via a builder macro for all mmu_role bits that track a CR0, CR4, or EFER bit, abstracting whether the bits are in the base or the extended role. Future commits will switch to using mmu_role instead of vCPU state to configure the MMU, i.e. there are about to be a large number of users. Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210622175739.3610207-26-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-06-24 18:00:41 -04:00
Sean Christopherson	167f8a5cae	KVM: x86/mmu: Rename "nxe" role bit to "efer_nx" for macro shenanigans Rename "nxe" to "efer_nx" so that future macro magic can use the pattern <reg>_<bit> for all CR0, CR4, and EFER bits that included in the role. Using "efer_nx" also makes it clear that the role bit reflects EFER.NX, not the NX bit in the corresponding PTE. Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210622175739.3610207-25-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-06-24 18:00:41 -04:00
Sean Christopherson	8626c120ba	KVM: x86/mmu: Use MMU's role_regs, not vCPU state, to compute mmu_role Use the provided role_regs to calculate the mmu_role instead of pulling bits from current vCPU state. For some flows, e.g. nested TDP, the vCPU state may not be correct (or relevant). Cc: Maxim Levitsky <mlevitsk@redhat.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210622175739.3610207-24-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-06-24 18:00:40 -04:00
Sean Christopherson	cd6767c334	KVM: x86/mmu: Ignore CR0 and CR4 bits in nested EPT MMU role Do not incorporate CR0/CR4 bits into the role for the nested EPT MMU, as EPT behavior is not influenced by CR0/CR4. Note, this is the guest_mmu, (L1's EPT), not nested_mmu (L2's IA32 paging); the nested_mmu does need CR0/CR4, and is initialized in a separate flow. Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210622175739.3610207-23-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-06-24 18:00:40 -04:00
Sean Christopherson	af09897229	KVM: x86/mmu: Consolidate misc updates into shadow_mmu_init_context() Consolidate the MMU metadata update calls to deduplicate code, and to prep for future cleanup. No functional change intended. Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210622175739.3610207-22-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-06-24 18:00:40 -04:00
Sean Christopherson	594e91a100	KVM: x86/mmu: Add struct and helpers to retrieve MMU role bits from regs Introduce "struct kvm_mmu_role_regs" to hold the register state that is incorporated into the mmu_role. For nested TDP, the register state that is factored into the MMU isn't vCPU state; the dedicated struct will be used to propagate the correct state throughout the flows without having to pass multiple params, and also provides helpers for the various flag accessors. Intentionally make the new helpers cumbersome/ugly by prepending four underscores. In the not-too-distant future, it will be preferable to use the mmu_role to query bits as the mmu_role can drop irrelevant bits without creating contradictions, e.g. clearing CR4 bits when CR0.PG=0. Reserve the clean helper names (no underscores) for the mmu_role. Add a helper for vCPU conversion, which is the common case. No functional change intended. Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210622175739.3610207-21-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-06-24 18:00:40 -04:00
Sean Christopherson	d555f7057e	KVM: x86/mmu: Grab shadow root level from mmu_role for shadow MMUs Use the mmu_role to initialize shadow root level instead of assuming the level of KVM's shadow root (host) is the same as that of the guest root, or in the case of 32-bit non-PAE paging where KVM forces PAE paging. For nested NPT, the shadow root level cannot be adapted to L1's NPT root level and is instead always the TDP root level because NPT uses the current host CR0/CR4/EFER, e.g. 64-bit KVM can't drop into 32-bit PAE to shadow L1's NPT. Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210622175739.3610207-20-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-06-24 18:00:39 -04:00
Sean Christopherson	16be1d1292	KVM: x86/mmu: Move nested NPT reserved bit calculation into MMU proper Move nested NPT's invocation of reset_shadow_zero_bits_mask() into the MMU proper and unexport said function. Aside from dropping an export, this is a baby step toward eliminating the call entirely by fixing the shadow_root_level confusion. No functional change intended. Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210622175739.3610207-19-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-06-24 18:00:39 -04:00
Sean Christopherson	20f632bd00	KVM: x86: Read and pass all CR0/CR4 role bits to shadow MMU helper Grab all CR0/CR4 MMU role bits from current vCPU state when initializing a non-nested shadow MMU. Extract the masks from kvm_post_set_cr{0,4}(), as the CR0/CR4 update masks must exactly match the mmu_role bits, with one exception (see below). The "full" CR0/CR4 will be used by future commits to initialize the MMU and its role, as opposed to the current approach of pulling everything from vCPU, which is incorrect for certain flows, e.g. nested NPT. CR4.LA57 is an exception, as it can be toggled on VM-Exit (for L1's MMU) but can't be toggled via MOV CR4 while long mode is active. I.e. LA57 needs to be in the mmu_role, but technically doesn't need to be checked by kvm_post_set_cr4(). However, the extra check is completely benign as the hardware restrictions simply mean LA57 will never be _the_ cause of a MMU reset during MOV CR4. Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210622175739.3610207-18-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-06-24 18:00:39 -04:00
Sean Christopherson	18feaad3c6	KVM: x86/mmu: Drop smep_andnot_wp check from "uses NX" for shadow MMUs Drop the smep_andnot_wp role check from the "uses NX" calculation now that all non-nested shadow MMUs treat NX as used via the !TDP check. The shadow MMU for nested NPT, which shares the helper, does not need to deal with SMEP (or WP) as NPT walks are always "user" accesses and WP is explicitly noted as being ignored: Table walks for guest page tables are always treated as user writes at the nested page table level. A table walk for the guest page itself is always treated as a user access at the nested page table level The host hCR0.WP bit is ignored under nested paging. Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210622175739.3610207-17-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-06-24 18:00:39 -04:00
Sean Christopherson	31e96bc636	KVM: nSVM: Add a comment to document why nNPT uses vmcb01, not vCPU state Add a comment in the nested NPT initialization flow to call out that it intentionally uses vmcb01 instead current vCPU state to get the effective hCR4 and hEFER for L1's NPT context. Note, despite nSVM's efforts to handle the case where vCPU state doesn't reflect L1 state, the MMU may still do the wrong thing due to pulling state from the vCPU instead of the passed in CR0/CR4/EFER values. This will be addressed in future commits. Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210622175739.3610207-16-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-06-24 18:00:38 -04:00
Sean Christopherson	dbc4739b6b	KVM: x86: Fix sizes used to pass around CR0, CR4, and EFER When configuring KVM's MMU, pass CR0 and CR4 as unsigned longs, and EFER as a u64 in various flows (mostly MMU). Passing the params as u32s is functionally ok since all of the affected registers reserve bits 63:32 to zero (enforced by KVM), but it's technically wrong. No functional change intended. Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210622175739.3610207-15-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-06-24 18:00:38 -04:00
Sean Christopherson	0337f585f5	KVM: x86/mmu: Rename unsync helper and update related comments Rename mmu_need_write_protect() to mmu_try_to_unsync_pages() and update a variety of related, stale comments. Add several new comments to call out subtle details, e.g. that upper-level shadow pages are write-tracked, and that can_unsync is false iff KVM is in the process of synchronizing pages. No functional change intended. Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210622175739.3610207-14-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-06-24 18:00:38 -04:00
Sean Christopherson	479a1efc81	KVM: x86/mmu: Drop the intermediate "transient" __kvm_sync_page() Nove the kvm_unlink_unsync_page() call out of kvm_sync_page() and into it's sole caller, and fold __kvm_sync_page() into kvm_sync_page() since the latter becomes a pure pass-through. There really should be no reason for code to do a complete sync of a shadow page outside of the full kvm_mmu_sync_roots(), e.g. the one use case that creeped in turned out to be flawed and counter-productive. Drop the stale comment about @sp->gfn needing to be write-protected, as it directly contradicts the kvm_mmu_get_page() usage. No functional change intended. Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210622175739.3610207-13-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-06-24 18:00:38 -04:00
Sean Christopherson	07dc4f35a4	KVM: x86/mmu: comment on kvm_mmu_get_page's syncing of pages Explain the usage of sync_page() in kvm_mmu_get_page(), which is subtle in how and why it differs from mmu_sync_children(). Signed-off-by: Sean Christopherson <seanjc@google.com> [Split out of a different patch by Sean. - Paolo] Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-06-24 18:00:37 -04:00
Sean Christopherson	2640b08653	KVM: x86/mmu: WARN and zap SP when sync'ing if MMU role mismatches When synchronizing a shadow page, WARN and zap the page if its mmu role isn't compatible with the current MMU context, where "compatible" is an exact match sans the bits that have no meaning in the overall MMU context or will be explicitly overwritten during the sync. Many of the helpers used by sync_page() are specific to the current context, updating a SMM vs. non-SMM shadow page would use the wrong memslots, updating L1 vs. L2 PTEs might work but would be extremely bizaree, and so on and so forth. Drop the guard with respect to 8-byte vs. 4-byte PTEs in __kvm_sync_page(), it was made useless when kvm_mmu_get_page() stopped trying to sync shadow pages irrespective of the current MMU context. Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210622175739.3610207-12-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-06-24 18:00:37 -04:00
Sean Christopherson	00a669780f	KVM: x86/mmu: Use MMU role to check for matching guest page sizes Originally, __kvm_sync_page used to check the cr4_pae bit in the role to avoid zapping 4-byte kvm_mmu_pages when guest page size are 8-byte or the other way round. However, in commit `47c42e6b41` ("KVM: x86: fix handling of role.cr4_pae and rename it to 'gpte_size'", 2019-03-28) it was observed that this did not work for nested EPT, where the page table size would be 8 bytes even if CR4.PAE=0. (Note that the check still has to be done for nested NPT, so it is not possible to use tdp_enabled or similar). Therefore, a hack was introduced to identify nested EPT shadow pages and unconditionally call __kvm_sync_page() on them. However, it is possible to do without the hack to identify nested EPT shadow pages: if EPT is active, there will be no shadow pages in non-EPT format, and all of them will have gpte_is_8_bytes set to true; we can just check the MMU role directly, and the test will always be true. Even for non-EPT shadow MMUs, this test should really always be true now that __kvm_sync_page() is called if and only if the role is an exact match (kvm_mmu_get_page()) or is part of the current MMU context (kvm_mmu_sync_roots()). A future commit will convert the likely-pointless check into a meaningful WARN to enforce that the mmu_roles of the current context and the shadow page are compatible. Cc: Vitaly Kuznetsov <vkuznets@redhat.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210622175739.3610207-11-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-06-24 18:00:37 -04:00
Sean Christopherson	ddc16abbba	KVM: x86/mmu: Unconditionally zap unsync SPs when creating >4k SP at GFN When creating a new upper-level shadow page, zap unsync shadow pages at the same target gfn instead of attempting to sync the pages. This fixes a bug where an unsync shadow page could be sync'd with an incompatible context, e.g. wrong smm, is_guest, etc... flags. In practice, the bug is relatively benign as sync_page() is all but guaranteed to fail its check that the guest's desired gfn (for the to-be-sync'd page) matches the current gfn associated with the shadow page. I.e. kvm_sync_page() would end up zapping the page anyways. Alternatively, __kvm_sync_page() could be modified to explicitly verify the mmu_role of the unsync shadow page is compatible with the current MMU context. But, except for this specific case, __kvm_sync_page() is called iff the page is compatible, e.g. the transient sync in kvm_mmu_get_page() requires an exact role match, and the call from kvm_sync_mmu_roots() is only synchronizing shadow pages from the current MMU (which better be compatible or KVM has problems). And as described above, attempting to sync shadow pages when creating an upper-level shadow page is unlikely to succeed, e.g. zero successful syncs were observed when running Linux guests despite over a million attempts. Fixes: `9f1a122f97` ("KVM: MMU: allow more page become unsync at getting sp time") Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210622175739.3610207-10-seanjc@google.com> [Remove WARN_ON after __kvm_sync_page. - Paolo] Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-06-24 18:00:37 -04:00
Sean Christopherson	6c032f12dd	Revert "KVM: MMU: record maximum physical address width in kvm_mmu_extended_role" Drop MAXPHYADDR from mmu_role now that all MMUs have their role invalidated after a CPUID update. Invalidating the role forces all MMUs to re-evaluate the guest's MAXPHYADDR, and the guest's MAXPHYADDR can only be changed only through a CPUID update. This reverts commit `de3ccd26fa`. Cc: Yu Zhang <yu.c.zhang@linux.intel.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210622175739.3610207-9-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-06-24 18:00:36 -04:00
Sean Christopherson	63f5a1909f	KVM: x86: Alert userspace that KVM_SET_CPUID{,2} after KVM_RUN is broken Warn userspace that KVM_SET_CPUID{,2} after KVM_RUN "may" cause guest instability. Initialize last_vmentry_cpu to -1 and use it to detect if the vCPU has been run at least once when its CPUID model is changed. KVM does not correctly handle changes to paging related settings in the guest's vCPU model after KVM_RUN, e.g. MAXPHYADDR, GBPAGES, etc... KVM could theoretically zap all shadow pages, but actually making that happen is a mess due to lock inversion (vcpu->mutex is held). And even then, updating paging settings on the fly would only work if all vCPUs are stopped, updated in concert with identical settings, then restarted. To support running vCPUs with different vCPU models (that affect paging), KVM would need to track all relevant information in kvm_mmu_page_role. Note, that's the _page_ role, not the full mmu_role. Updating mmu_role isn't sufficient as a vCPU can reuse a shadow page translation that was created by a vCPU with different settings and thus completely skip the reserved bit checks (that are tied to CPUID). Tracking CPUID state in kvm_mmu_page_role is _extremely_ undesirable as it would require doubling gfn_track from a u16 to a u32, i.e. would increase KVM's memory footprint by 2 bytes for every 4kb of guest memory. E.g. MAXPHYADDR (6 bits), GBPAGES, AMD vs. INTEL = 1 bit, and SEV C-BIT would all need to be tracked. In practice, there is no remotely sane use case for changing any paging related CPUID entries on the fly, so just sweep it under the rug (after yelling at userspace). Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210622175739.3610207-8-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-06-24 18:00:36 -04:00
Sean Christopherson	49c6f8756c	KVM: x86: Force all MMUs to reinitialize if guest CPUID is modified Invalidate all MMUs' roles after a CPUID update to force reinitizliation of the MMU context/helpers. Despite the efforts of commit `de3ccd26fa` ("KVM: MMU: record maximum physical address width in kvm_mmu_extended_role"), there are still a handful of CPUID-based properties that affect MMU behavior but are not incorporated into mmu_role. E.g. 1gb hugepage support, AMD vs. Intel handling of bit 8, and SEV's C-Bit location all factor into the guest's reserved PTE bits. The obvious alternative would be to add all such properties to mmu_role, but doing so provides no benefit over simply forcing a reinitialization on every CPUID update, as setting guest CPUID is a rare operation. Note, reinitializing all MMUs after a CPUID update does not fix all of KVM's woes. Specifically, kvm_mmu_page_role doesn't track the CPUID properties, which means that a vCPU can reuse shadow pages that should not exist for the new vCPU model, e.g. that map GPAs that are now illegal (due to MAXPHYADDR changes) or that set bits that are now reserved (PAGE_SIZE for 1gb pages), etc... Tracking the relevant CPUID properties in kvm_mmu_page_role would address the majority of problems, but fully tracking that much state in the shadow page role comes with an unpalatable cost as it would require a non-trivial increase in KVM's memory footprint. The GBPAGES case is even worse, as neither Intel nor AMD provides a way to disable 1gb hugepage support in the hardware page walker, i.e. it's a virtualization hole that can't be closed when using TDP. In other words, resetting the MMU after a CPUID update is largely a superficial fix. But, it will allow reverting the tracking of MAXPHYADDR in the mmu_role, and that case in particular needs to mostly work because KVM's shadow_root_level depends on guest MAXPHYADDR when 5-level paging is supported. For cases where KVM botches guest behavior, the damage is limited to that guest. But for the shadow_root_level, a misconfigured MMU can cause KVM to incorrectly access memory, e.g. due to walking off the end of its shadow page tables. Fixes: `7dcd575520` ("x86/kvm/mmu: check if tdp/shadow MMU reconfiguration is needed") Cc: Yu Zhang <yu.c.zhang@linux.intel.com> Cc: stable@vger.kernel.org Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210622175739.3610207-7-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-06-24 18:00:36 -04:00
Sean Christopherson	f71a53d118	Revert "KVM: x86/mmu: Drop kvm_mmu_extended_role.cr4_la57 hack" Restore CR4.LA57 to the mmu_role to fix an amusing edge case with nested virtualization. When KVM (L0) is using TDP, CR4.LA57 is not reflected in mmu_role.base.level because that tracks the shadow root level, i.e. TDP level. Normally, this is not an issue because LA57 can't be toggled while long mode is active, i.e. the guest has to first disable paging, then toggle LA57, then re-enable paging, thus ensuring an MMU reinitialization. But if L1 is crafty, it can load a new CR4 on VM-Exit and toggle LA57 without having to bounce through an unpaged section. L1 can also load a new CR3 on exit, i.e. it doesn't even need to play crazy paging games, a single entry PML5 is sufficient. Such shenanigans are only problematic if L0 and L1 use TDP, otherwise L1 and L2 share an MMU that gets reinitialized on nested VM-Enter/VM-Exit due to mmu_role.base.guest_mode. Note, in the L2 case with nested TDP, even though L1 can switch between L2s with different LA57 settings, thus bypassing the paging requirement, in that case KVM's nested_mmu will track LA57 in base.level. This reverts commit `8053f924ca`. Fixes: `8053f924ca` ("KVM: x86/mmu: Drop kvm_mmu_extended_role.cr4_la57 hack") Cc: stable@vger.kernel.org Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210622175739.3610207-6-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-06-24 18:00:36 -04:00
Sean Christopherson	ef318b9edf	KVM: x86/mmu: Use MMU's role to detect CR4.SMEP value in nested NPT walk Use the MMU's role to get its effective SMEP value when injecting a fault into the guest. When walking L1's (nested) NPT while L2 is active, vCPU state will reflect L2, whereas NPT uses the host's (L1 in this case) CR0, CR4, EFER, etc... If L1 and L2 have different settings for SMEP and L1 does not have EFER.NX=1, this can result in an incorrect PFEC.FETCH when injecting #NPF. Fixes: `e57d4a356a` ("KVM: Add instruction fetch checking when walking guest page table") Cc: stable@vger.kernel.org Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210622175739.3610207-5-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-06-24 18:00:35 -04:00
Sean Christopherson	0aa1837533	KVM: x86: Properly reset MMU context at vCPU RESET/INIT Reset the MMU context at vCPU INIT (and RESET for good measure) if CR0.PG was set prior to INIT. Simply re-initializing the current MMU is not sufficient as the current root HPA may not be usable in the new context. E.g. if TDP is disabled and INIT arrives while the vCPU is in long mode, KVM will fail to switch to the 32-bit pae_root and bomb on the next VM-Enter due to running with a 64-bit CR3 in 32-bit mode. This bug was papered over in both VMX and SVM, but still managed to rear its head in the MMU role on VMX. Because EFER.LMA=1 requires CR0.PG=1, kvm_calc_shadow_mmu_root_page_role() checks for EFER.LMA without first checking CR0.PG. VMX's RESET/INIT flow writes CR0 before EFER, and so an INIT with the vCPU in 64-bit mode will cause the hack-a-fix to generate the wrong MMU role. In VMX, the INIT issue is specific to running without unrestricted guest since unrestricted guest is available if and only if EPT is enabled. Commit `8668a3c468` ("KVM: VMX: Reset mmu context when entering real mode") resolved the issue by forcing a reset when entering emulated real mode. In SVM, commit `ebae871a50` ("kvm: svm: reset mmu on VCPU reset") forced a MMU reset on every INIT to workaround the flaw in common x86. Note, at the time the bug was fixed, the SVM problem was exacerbated by a complete lack of a CR4 update. The vendor resets will be reverted in future patches, primarily to aid bisection in case there are non-INIT flows that rely on the existing VMX logic. Because CR0.PG is unconditionally cleared on INIT, and because CR0.WP and all CR4/EFER paging bits are ignored if CR0.PG=0, simply checking that CR0.PG was '1' prior to INIT/RESET is sufficient to detect a required MMU context reset. Cc: stable@vger.kernel.org Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210622175739.3610207-4-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-06-24 18:00:35 -04:00
Sean Christopherson	112022bdb5	KVM: x86/mmu: Treat NX as used (not reserved) for all !TDP shadow MMUs Mark NX as being used for all non-nested shadow MMUs, as KVM will set the NX bit for huge SPTEs if the iTLB mutli-hit mitigation is enabled. Checking the mitigation itself is not sufficient as it can be toggled on at any time and KVM doesn't reset MMU contexts when that happens. KVM could reset the contexts, but that would require purging all SPTEs in all MMUs, for no real benefit. And, KVM already forces EFER.NX=1 when TDP is disabled (for WP=0, SMEP=1, NX=0), so technically NX is never reserved for shadow MMUs. Fixes: `b8e8c8303f` ("kvm: mmu: ITLB_MULTIHIT mitigation") Cc: stable@vger.kernel.org Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210622175739.3610207-3-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-06-24 18:00:35 -04:00
Sean Christopherson	f0d4379087	KVM: x86/mmu: Remove broken WARN that fires on 32-bit KVM w/ nested EPT Remove a misguided WARN that attempts to detect the scenario where using a special A/D tracking flag will set reserved bits on a non-MMIO spte. The WARN triggers false positives when using EPT with 32-bit KVM because of the !64-bit clause, which is just flat out wrong. The whole A/D tracking goo is specific to EPT, and one of the big selling points of EPT is that EPT is decoupled from the host's native paging mode. Drop the WARN instead of trying to salvage the check. Keeping a check specific to A/D tracking bits would essentially regurgitate the same code that led to KVM needed the tracking bits in the first place. A better approach would be to add a generic WARN on reserved bits being set, which would naturally cover the A/D tracking bits, work for all flavors of paging, and be self-documenting to some extent. Fixes: `8a406c8953` ("KVM: x86/mmu: Rename and document A/D scheme for TDP SPTEs") Cc: stable@vger.kernel.org Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210622175739.3610207-2-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-06-24 18:00:35 -04:00
Jing Zhang	bc9e9e672d	KVM: debugfs: Reuse binary stats descriptors To remove code duplication, use the binary stats descriptors in the implementation of the debugfs interface for statistics. This unifies the definition of statistics for the binary and debugfs interfaces. Signed-off-by: Jing Zhang <jingzhangos@google.com> Message-Id: <20210618222709.1858088-8-jingzhangos@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-06-24 18:00:29 -04:00
Jing Zhang	ce55c04945	KVM: stats: Support binary stats retrieval for a VCPU Add a VCPU ioctl to get a statistics file descriptor by which a read functionality is provided for userspace to read out VCPU stats header, descriptors and data. Define VCPU statistics descriptors and header for all architectures. Reviewed-by: David Matlack <dmatlack@google.com> Reviewed-by: Ricardo Koller <ricarkol@google.com> Reviewed-by: Krish Sadhukhan <krish.sadhukhan@oracle.com> Reviewed-by: Fuad Tabba <tabba@google.com> Tested-by: Fuad Tabba <tabba@google.com> #arm64 Signed-off-by: Jing Zhang <jingzhangos@google.com> Message-Id: <20210618222709.1858088-5-jingzhangos@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-06-24 18:00:19 -04:00
Jing Zhang	fcfe1baedd	KVM: stats: Support binary stats retrieval for a VM Add a VM ioctl to get a statistics file descriptor by which a read functionality is provided for userspace to read out VM stats header, descriptors and data. Define VM statistics descriptors and header for all architectures. Reviewed-by: David Matlack <dmatlack@google.com> Reviewed-by: Ricardo Koller <ricarkol@google.com> Reviewed-by: Krish Sadhukhan <krish.sadhukhan@oracle.com> Reviewed-by: Fuad Tabba <tabba@google.com> Tested-by: Fuad Tabba <tabba@google.com> #arm64 Signed-off-by: Jing Zhang <jingzhangos@google.com> Message-Id: <20210618222709.1858088-4-jingzhangos@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-06-24 18:00:10 -04:00
Jing Zhang	cb082bfab5	KVM: stats: Add fd-based API to read binary stats data This commit defines the API for userspace and prepare the common functionalities to support per VM/VCPU binary stats data readings. The KVM stats now is only accessible by debugfs, which has some shortcomings this change series are supposed to fix: 1. The current debugfs stats solution in KVM could be disabled when kernel Lockdown mode is enabled, which is a potential rick for production. 2. The current debugfs stats solution in KVM is organized as "one stats per file", it is good for debugging, but not efficient for production. 3. The stats read/clear in current debugfs solution in KVM are protected by the global kvm_lock. Besides that, there are some other benefits with this change: 1. All KVM VM/VCPU stats can be read out in a bulk by one copy to userspace. 2. A schema is used to describe KVM statistics. From userspace's perspective, the KVM statistics are self-describing. 3. With the fd-based solution, a separate telemetry would be able to read KVM stats in a less privileged environment. 4. After the initial setup by reading in stats descriptors, a telemetry only needs to read the stats data itself, no more parsing or setup is needed. Reviewed-by: David Matlack <dmatlack@google.com> Reviewed-by: Ricardo Koller <ricarkol@google.com> Reviewed-by: Krish Sadhukhan <krish.sadhukhan@oracle.com> Reviewed-by: Fuad Tabba <tabba@google.com> Tested-by: Fuad Tabba <tabba@google.com> #arm64 Signed-off-by: Jing Zhang <jingzhangos@google.com> Message-Id: <20210618222709.1858088-3-jingzhangos@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-06-24 11:47:57 -04:00
Jing Zhang	0193cc908b	KVM: stats: Separate generic stats from architecture specific ones Generic KVM stats are those collected in architecture independent code or those supported by all architectures; put all generic statistics in a separate structure. This ensures that they are defined the same way in the statistics API which is being added, removing duplication among different architectures in the declaration of the descriptors. No functional change intended. Reviewed-by: David Matlack <dmatlack@google.com> Reviewed-by: Ricardo Koller <ricarkol@google.com> Reviewed-by: Krish Sadhukhan <krish.sadhukhan@oracle.com> Signed-off-by: Jing Zhang <jingzhangos@google.com> Message-Id: <20210618222709.1858088-2-jingzhangos@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-06-24 11:47:56 -04:00
Sean Christopherson	6c6e166b2c	KVM: x86/mmu: Don't WARN on a NULL shadow page in TDP MMU check Treat a NULL shadow page in the "is a TDP MMU" check as valid, non-TDP root. KVM uses a "direct" PAE paging MMU when TDP is disabled and the guest is running with paging disabled. In that case, root_hpa points at the pae_root page (of which only 32 bytes are used), not a standard shadow page, and the WARN fires (a lot). Fixes: `0b873fd7fb` ("KVM: x86/mmu: Remove redundant is_tdp_mmu_enabled check") Cc: David Matlack <dmatlack@google.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210622072454.3449146-1-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-06-24 11:47:56 -04:00
Sean Christopherson	b33bb78a1f	KVM: nVMX: Handle split-lock #AC exceptions that happen in L2 Mark #ACs that won't be reinjected to the guest as wanted by L0 so that KVM handles split-lock #AC from L2 instead of forwarding the exception to L1. Split-lock #AC isn't yet virtualized, i.e. L1 will treat it like a regular #AC and do the wrong thing, e.g. reinject it into L2. Fixes: `e6f8b6c12f` ("KVM: VMX: Extend VMXs #AC interceptor to handle split lock #AC in guest") Cc: Xiaoyao Li <xiaoyao.li@intel.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210622172244.3561540-1-seanjc@google.com> Cc: stable@vger.kernel.org Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-06-24 04:31:16 -04:00
Colin Ian King	31c6565700	KVM: x86/mmu: Fix uninitialized boolean variable flush In the case where kvm_memslots_have_rmaps(kvm) is false the boolean variable flush is not set and is uninitialized. If is_tdp_mmu_enabled(kvm) is true then the call to kvm_tdp_mmu_zap_collapsible_sptes passes the uninitialized value of flush into the call. Fix this by initializing flush to false. Addresses-Coverity: ("Uninitialized scalar variable") Fixes: `e2209710cc` ("KVM: x86/mmu: Skip rmap operations if rmaps not allocated") Signed-off-by: Colin Ian King <colin.king@canonical.com> Reviewed-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210622150912.23429-1-colin.king@canonical.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-06-24 04:31:16 -04:00
Jim Mattson	18f63b15b0	KVM: x86: Print CPU of last attempted VM-entry when dumping VMCS/VMCB Failed VM-entry is often due to a faulty core. To help identify bad cores, print the id of the last logical processor that attempted VM-entry whenever dumping a VMCS or VMCB. Signed-off-by: Jim Mattson <jmattson@google.com> Message-Id: <20210621221648.1833148-1-jmattson@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-06-24 04:31:13 -04:00
Thomas Gleixner	72a6c08c44	x86/pkru: Remove xstate fiddling from write_pkru() The PKRU value of a task is stored in task->thread.pkru when the task is scheduled out. PKRU is restored on schedule in from there. So keeping the XSAVE buffer up to date is a pointless exercise. Remove the xstate fiddling and cleanup all related functions. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Borislav Petkov <bp@suse.de> Reviewed-by: Borislav Petkov <bp@suse.de> Link: https://lkml.kernel.org/r/20210623121456.897372712@linutronix.de	2021-06-23 19:55:51 +02:00
Dave Hansen	784a46618f	x86/pkeys: Move read_pkru() and write_pkru() write_pkru() was originally used just to write to the PKRU register. It was mercifully short and sweet and was not out of place in pgtable.h with some other pkey-related code. But, later work included a requirement to also modify the task XSAVE buffer when updating the register. This really is more related to the XSAVE architecture than to paging. Move the read/write_pkru() to asm/pkru.h. pgtable.h won't miss them. Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Borislav Petkov <bp@suse.de> Link: https://lkml.kernel.org/r/20210623121455.102647114@linutronix.de	2021-06-23 18:52:57 +02:00
Thomas Gleixner	1c61fada30	x86/fpu: Rename copy_kernel_to_fpregs() to restore_fpregs_from_fpstate() This is not a copy functionality. It restores the register state from the supplied kernel buffer. No functional changes. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Borislav Petkov <bp@suse.de> Reviewed-by: Borislav Petkov <bp@suse.de> Link: https://lkml.kernel.org/r/20210623121454.716058365@linutronix.de	2021-06-23 18:36:42 +02:00
Thomas Gleixner	ebe7234b08	x86/fpu: Rename copy_fpregs_to_fpstate() to save_fpregs_to_fpstate() A copy is guaranteed to leave the source intact, which is not the case when FNSAVE is used as that reinitilizes the registers. Save does not make such guarantees and it matches what this is about, i.e. to save the state for a later restore. Rename it to save_fpregs_to_fpstate(). Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Borislav Petkov <bp@suse.de> Reviewed-by: Borislav Petkov <bp@suse.de> Link: https://lkml.kernel.org/r/20210623121454.508853062@linutronix.de	2021-06-23 18:26:43 +02:00
Dave Hansen	71ef453355	x86/kvm: Avoid looking up PKRU in XSAVE buffer PKRU is being removed from the kernel XSAVE/FPU buffers. This removal will probably include warnings for code that look up PKRU in those buffers. KVM currently looks up the location of PKRU but doesn't even use the pointer that it gets back. Rework the code to avoid calling get_xsave_addr() except in cases where its result is actually used. This makes the code more clear and also avoids the inevitable PKRU warnings. This is probably a good cleanup and could go upstream idependently of any PKRU rework. Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Borislav Petkov <bp@suse.de> Reviewed-by: Borislav Petkov <bp@suse.de> Link: https://lkml.kernel.org/r/20210623121453.541037562@linutronix.de	2021-06-23 17:49:47 +02:00
Paolo Bonzini	c3ab0e28a4	Merge branch 'topic/ppc-kvm' of https://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux into HEAD - Support for the H_RPT_INVALIDATE hypercall - Conversion of Book3S entry/exit to C - Bug fixes	2021-06-23 07:30:41 -04:00
Sean Christopherson	ba1f82456b	KVM: nVMX: Dynamically compute max VMCS index for vmcs12 Calculate the max VMCS index for vmcs12 by walking the array to find the actual max index. Hardcoding the index is prone to bitrot, and the calculation is only done on KVM bringup (albeit on every CPU, but there aren't _that_ many null entries in the array). Fixes: `3c0f99366e` ("KVM: nVMX: Add a TSC multiplier field in VMCS12") Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210618214658.2700765-1-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-06-21 12:58:55 -04:00
Jim Mattson	5140bc7d6b	KVM: VMX: Skip #PF(RSVD) intercepts when emulating smaller maxphyaddr As part of smaller maxphyaddr emulation, kvm needs to intercept present page faults to see if it needs to add the RSVD flag (bit 3) to the error code. However, there is no need to intercept page faults that already have the RSVD flag set. When setting up the page fault intercept, add the RSVD flag into the #PF error code mask field (but not the #PF error code match field) to skip the intercept when the RSVD flag is already set. Signed-off-by: Jim Mattson <jmattson@google.com> Message-Id: <20210618235941.1041604-1-jmattson@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-06-21 12:58:55 -04:00
David Matlack	0485cf8dbe	KVM: x86/mmu: Remove redundant root_hpa checks The root_hpa checks below the top-level check in kvm_mmu_page_fault are theoretically redundant since there is no longer a way for the root_hpa to be reset during a page fault. The details of why are described in commit `ddce620821` ("KVM: x86/mmu: Move root_hpa validity checks to top of page fault handler") __direct_map, kvm_tdp_mmu_map, and get_mmio_spte are all only reachable through kvm_mmu_page_fault, therefore their root_hpa checks are redundant. Suggested-by: Sean Christopherson <seanjc@google.com> Signed-off-by: David Matlack <dmatlack@google.com> Message-Id: <20210617231948.2591431-5-dmatlack@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-06-18 06:45:47 -04:00
David Matlack	63c0cac938	KVM: x86/mmu: Refactor is_tdp_mmu_root into is_tdp_mmu This change simplifies the call sites slightly and also abstracts away the implementation detail of looking at root_hpa as the mechanism for determining if the mmu is the TDP MMU. Suggested-by: Sean Christopherson <seanjc@google.com> Signed-off-by: David Matlack <dmatlack@google.com> Message-Id: <20210617231948.2591431-4-dmatlack@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-06-18 06:45:46 -04:00
David Matlack	0b873fd7fb	KVM: x86/mmu: Remove redundant is_tdp_mmu_enabled check This check is redundant because the root shadow page will only be a TDP MMU page if is_tdp_mmu_enabled() returns true, and is_tdp_mmu_enabled() never changes for the lifetime of a VM. It's possible that this check was added for performance reasons but it is unlikely that it is useful in practice since to_shadow_page() is cheap. That being said, this patch also caches the return value of is_tdp_mmu_root() in direct_page_fault() since there's no reason to duplicate the call so many times, so performance is not a concern. Suggested-by: Sean Christopherson <seanjc@google.com> Signed-off-by: David Matlack <dmatlack@google.com> Message-Id: <20210617231948.2591431-3-dmatlack@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-06-18 06:45:46 -04:00
David Matlack	aa23c0ad14	KVM: x86/mmu: Remove redundant is_tdp_mmu_root check The check for is_tdp_mmu_root in kvm_tdp_mmu_map is redundant because kvm_tdp_mmu_map's only caller (direct_page_fault) already checks is_tdp_mmu_root. Suggested-by: Sean Christopherson <seanjc@google.com> Signed-off-by: David Matlack <dmatlack@google.com> Message-Id: <20210617231948.2591431-2-dmatlack@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-06-18 06:45:46 -04:00
Paolo Bonzini	c62efff28b	KVM: x86: Stub out is_tdp_mmu_root on 32-bit hosts If is_tdp_mmu_root is not inlined, the elimination of TDP MMU calls as dead code might not work out. To avoid this, explicitly declare the stubbed is_tdp_mmu_root on 32-bit hosts. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-06-18 06:45:46 -04:00
Sean Christopherson	8bbed95d2c	KVM: x86: WARN and reject loading KVM if NX is supported but not enabled WARN if NX is reported as supported but not enabled in EFER. All flavors of the kernel, including non-PAE 32-bit kernels, set EFER.NX=1 if NX is supported, even if NX usage is disable via kernel command line. KVM relies on NX being enabled if it's supported, e.g. KVM will generate illegal NPT entries if nx_huge_pages is enabled and NX is supported but not enabled. Signed-off-by: Sean Christopherson <seanjc@google.com> Reviewed-by: Jim Mattson <jmattson@google.com> Message-Id: <20210615164535.2146172-4-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-06-18 06:24:50 -04:00
Sean Christopherson	b26a71a1a5	KVM: SVM: Refuse to load kvm_amd if NX support is not available Refuse to load KVM if NX support is not available. Shadow paging has assumed NX support since commit `9167ab7993` ("KVM: vmx, svm: always run with EFER.NXE=1 when shadow paging is active"), and NPT has assumed NX support since commit `b8e8c8303f` ("kvm: mmu: ITLB_MULTIHIT mitigation"). While the NX huge pages mitigation should not be enabled by default for AMD CPUs, it can be turned on by userspace at will. Unlike Intel CPUs, AMD does not provide a way for firmware to disable NX support, and Linux always sets EFER.NX=1 if it is supported. Given that it's extremely unlikely that a CPU supports NPT but not NX, making NX a formal requirement is far simpler than adding requirements to the mitigation flow. Fixes: `9167ab7993` ("KVM: vmx, svm: always run with EFER.NXE=1 when shadow paging is active") Fixes: `b8e8c8303f` ("kvm: mmu: ITLB_MULTIHIT mitigation") Signed-off-by: Sean Christopherson <seanjc@google.com> Reviewed-by: Jim Mattson <jmattson@google.com> Message-Id: <20210615164535.2146172-3-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-06-18 06:24:49 -04:00
Sean Christopherson	23f079c249	KVM: VMX: Refuse to load kvm_intel if EPT and NX are disabled Refuse to load KVM if NX support is not available and EPT is not enabled. Shadow paging has assumed NX support since commit `9167ab7993` ("KVM: vmx, svm: always run with EFER.NXE=1 when shadow paging is active"), so for all intents and purposes this has been a de facto requirement for over a year. Do not require NX support if EPT is enabled purely because Intel CPUs let firmware disable NX support via MSR_IA32_MISC_ENABLES. If not for that, VMX (and KVM as a whole) could require NX support with minimal risk to breaking userspace. Fixes: `9167ab7993` ("KVM: vmx, svm: always run with EFER.NXE=1 when shadow paging is active") Signed-off-by: Sean Christopherson <seanjc@google.com> Reviewed-by: Jim Mattson <jmattson@google.com> Message-Id: <20210615164535.2146172-2-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-06-18 06:24:49 -04:00
Ingo Molnar	b2c0931a07	Merge branch 'sched/urgent' into sched/core, to resolve conflicts This commit in sched/urgent moved the cfs_rq_is_decayed() function: `a7b359fc6a`: ("sched/fair: Correctly insert cfs_rq's to list on unthrottle") and this fresh commit in sched/core modified it in the old location: `9e077b52d8`: ("sched/pelt: Check that _avg are null when _sum are") Merge the two variants. Conflicts: kernel/sched/fair.c Signed-off-by: Ingo Molnar <mingo@kernel.org>	2021-06-18 11:31:25 +02:00
Linus Torvalds	fd0aa1a456	Miscellaneous bugfixes. The main interesting one is a NULL pointer dereference reported by syzkaller ("KVM: x86: Immediately reset the MMU context when the SMM flag is cleared"). -----BEGIN PGP SIGNATURE----- iQFIBAABCAAyFiEE8TM4V0tmI4mGbHaCv/vSX3jHroMFAmDLldwUHHBib256aW5p QHJlZGhhdC5jb20ACgkQv/vSX3jHroPTOgf/XpAehLdWlx2877ulcBD0Kjt0tLvH OFHRD1Ir0d1Ay3DX8qmxLquXHB4CoDGZBwi1d7AI165kUP/XLmV0bY6TZ74inI/P CaD8Bsbm8+iBl5jrovEPc+223bK+3OFOvo2pk6M/MlsO/ExRikaPDtHOnkfousbl nLX8v2qd7ihWyJOdLJMU9pV8E2iczQoCuH9yWBHdCrxRxWtPzkEekPWb0sujByiF 4tD7sqiEA3ugbF1Wm5keQV63NLplfxx+Zun0FV+tbpjjxQWAGl81dP+zmqok0sM/ qQCyZevt6jLLkL+Fn6hI6PP9OTeYreX2fgwhWXs71d2js33yNg5Veqx5Bw== =Gs/y -----END PGP SIGNATURE----- Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm Pull kvm fixes from Paolo Bonzini: "Miscellaneous bugfixes. The main interesting one is a NULL pointer dereference reported by syzkaller ("KVM: x86: Immediately reset the MMU context when the SMM flag is cleared")" * tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: KVM: selftests: Fix kvm_check_cap() assertion KVM: x86/mmu: Calculate and check "full" mmu_role for nested MMU KVM: X86: Fix x86_emulator slab cache leak KVM: SVM: Call SEV Guest Decommission if ASID binding fails KVM: x86: Immediately reset the MMU context when the SMM flag is cleared KVM: x86: Fix fall-through warnings for Clang KVM: SVM: fix doc warnings KVM: selftests: Fix compiling errors when initializing the static structure kvm: LAPIC: Restore guard to prevent illegal APIC register access	2021-06-17 13:14:53 -07:00
Kai Huang	f1b8325508	KVM: x86/mmu: Fix TDP MMU page table level TDP MMU iterator's level is identical to page table's actual level. For instance, for the last level page table (whose entry points to one 4K page), iter->level is 1 (PG_LEVEL_4K), and in case of 5 level paging, the iter->level is mmu->shadow_root_level, which is 5. However, struct kvm_mmu_page's level currently is not set correctly when it is allocated in kvm_tdp_mmu_map(). When iterator hits non-present SPTE and needs to allocate a new child page table, currently iter->level, which is the level of the page table where the non-present SPTE belongs to, is used. This results in struct kvm_mmu_page's level always having its parent's level (excpet root table's level, which is initialized explicitly using mmu->shadow_root_level). This is kinda wrong, and not consistent with existing non TDP MMU code. Fortuantely sp->role.level is only used in handle_removed_tdp_mmu_page() and kvm_tdp_mmu_zap_sp(), and they are already aware of this and behave correctly. However to make it consistent with legacy MMU code (and fix the issue that both root page table and its child page table have shadow_root_level), use iter->level - 1 in kvm_tdp_mmu_map(), and change handle_removed_tdp_mmu_page() and kvm_tdp_mmu_zap_sp() accordingly. Reviewed-by: Ben Gardon <bgardon@google.com> Signed-off-by: Kai Huang <kai.huang@intel.com> Message-Id: <bcb6569b6e96cb78aaa7b50640e6e6b53291a74e.1623717884.git.kai.huang@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-06-17 14:27:49 -04:00
Kai Huang	857f84743e	KVM: x86/mmu: Fix pf_fixed count in tdp_mmu_map_handle_target_level() Currently pf_fixed is not increased when prefault is true. This is not correct, since prefault here really means "async page fault completed". In that case, the original page fault from the guest was morphed into as async page fault and pf_fixed was not increased. So when prefault indicates async page fault is completed, pf_fixed should be increased. Additionally, currently pf_fixed is also increased even when page fault is spurious, while legacy MMU increases pf_fixed when page fault returns RET_PF_EMULATE or RET_PF_FIXED. To fix above two issues, change to increase pf_fixed when return value is not RET_PF_SPURIOUS (RET_PF_RETRY has already been ruled out by reaching here). More information: https://lore.kernel.org/kvm/cover.1620200410.git.kai.huang@intel.com/T/#mbb5f8083e58a2cd262231512b9211cbe70fc3bd5 Fixes: `bb18842e21` ("kvm: x86/mmu: Add TDP MMU PF handler") Reviewed-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Kai Huang <kai.huang@intel.com> Message-Id: <2ea8b7f5d4f03c99b32bc56fc982e1e4e3d3fc6b.1623717884.git.kai.huang@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-06-17 14:27:16 -04:00
Kai Huang	57a3e96d6d	KVM: x86/mmu: Fix return value in tdp_mmu_map_handle_target_level() Currently tdp_mmu_map_handle_target_level() returns 0, which is RET_PF_RETRY, when page fault is actually fixed. This makes kvm_tdp_mmu_map() also return RET_PF_RETRY in this case, instead of RET_PF_FIXED. Fix by initializing ret to RET_PF_FIXED. Note that kvm_mmu_page_fault() resumes guest on both RET_PF_RETRY and RET_PF_FIXED, which means in practice returning the two won't make difference, so this fix alone won't be necessary for stable tree. Fixes: `bb18842e21` ("kvm: x86/mmu: Add TDP MMU PF handler") Reviewed-by: Sean Christopherson <seanjc@google.com> Reviewed-by: Ben Gardon <bgardon@google.com> Signed-off-by: Kai Huang <kai.huang@intel.com> Message-Id: <f9e8956223a586cd28c090879a8ff40f5eb6d609.1623717884.git.kai.huang@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-06-17 14:26:45 -04:00
Wanpeng Li	2735886c9e	KVM: LAPIC: Keep stored TMCCT register value 0 after KVM_SET_LAPIC KVM_GET_LAPIC stores the current value of TMCCT and KVM_SET_LAPIC's memcpy stores it in vcpu->arch.apic->regs, KVM_SET_LAPIC could store zero in vcpu->arch.apic->regs after it uses it, and then the stored value would always be zero. In addition, the TMCCT is always computed on-demand and never directly readable. Suggested-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Wanpeng Li <wanpengli@tencent.com> Message-Id: <1623223000-18116-1-git-send-email-wanpengli@tencent.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-06-17 14:26:11 -04:00
Ashish Kalra	0dbb112304	KVM: X86: Introduce KVM_HC_MAP_GPA_RANGE hypercall This hypercall is used by the SEV guest to notify a change in the page encryption status to the hypervisor. The hypercall should be invoked only when the encryption attribute is changed from encrypted -> decrypted and vice versa. By default all guest pages are considered encrypted. The hypercall exits to userspace to manage the guest shared regions and integrate with the userspace VMM's migration code. Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@redhat.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Joerg Roedel <joro@8bytes.org> Cc: Borislav Petkov <bp@suse.de> Cc: Tom Lendacky <thomas.lendacky@amd.com> Cc: x86@kernel.org Cc: kvm@vger.kernel.org Cc: linux-kernel@vger.kernel.org Reviewed-by: Steve Rutherford <srutherford@google.com> Signed-off-by: Brijesh Singh <brijesh.singh@amd.com> Signed-off-by: Ashish Kalra <ashish.kalra@amd.com> Co-developed-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Co-developed-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Message-Id: <90778988e1ee01926ff9cac447aacb745f954c8c.1623174621.git.ashish.kalra@amd.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-06-17 14:25:39 -04:00
Sean Christopherson	ade74e1433	KVM: x86/mmu: Grab nx_lpage_splits as an unsigned long before division Snapshot kvm->stats.nx_lpage_splits into a local unsigned long to avoid 64-bit division on 32-bit kernels. Casting to an unsigned long is safe because the maximum number of shadow pages, n_max_mmu_pages, is also an unsigned long, i.e. KVM will start recycling shadow pages before the number of splits can exceed a 32-bit value. ERROR: modpost: "__udivdi3" [arch/x86/kvm/kvm.ko] undefined! Fixes: 7ee093d4f3f5 ("KVM: switch per-VM stats to u64") Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210615162905.2132937-1-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-06-17 13:10:18 -04:00
Vitaly Kuznetsov	bca66dbcd2	KVM: x86: Check for pending interrupts when APICv is getting disabled When APICv is active, interrupt injection doesn't raise KVM_REQ_EVENT request (see __apic_accept_irq()) as the required work is done by hardware. In case KVM_REQ_APICV_UPDATE collides with such injection, the interrupt may never get delivered. Currently, the described situation is hardly possible: all kvm_request_apicv_update() calls normally happen upon VM creation when no interrupts are pending. We are, however, going to move unconditional kvm_request_apicv_update() call from kvm_hv_activate_synic() to synic_update_vector() and without this fix 'hyperv_connections' test from kvm-unit-tests gets stuck on IPI delivery attempt right after configuring a SynIC route which triggers APICv disablement. Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com> Message-Id: <20210609150911.1471882-4-vkuznets@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-06-17 13:09:55 -04:00
Sean Christopherson	c5ffd408cd	KVM: nVMX: Drop redundant checks on vmcs12 in EPTP switching emulation Drop the explicit check on EPTP switching being enabled. The EPTP switching check is handled in the generic VMFUNC function check, while the underlying VMFUNC enablement check is done by hardware and redone by generic VMFUNC emulation. The vmcs12 EPT check is handled by KVM at VM-Enter in the form of a consistency check, keep it but add a WARN. Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210609234235.1244004-16-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-06-17 13:09:54 -04:00
Sean Christopherson	546e8398bc	KVM: nVMX: WARN if subtly-impossible VMFUNC conditions occur WARN and inject #UD when emulating VMFUNC for L2 if the function is out-of-bounds or if VMFUNC is not enabled in vmcs12. Neither condition should occur in practice, as the CPU is supposed to prioritize the #UD over VM-Exit for out-of-bounds input and KVM is supposed to enable VMFUNC in vmcs02 if and only if it's enabled in vmcs12, but neither of those dependencies is obvious. Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210609234235.1244004-15-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-06-17 13:09:54 -04:00
Sean Christopherson	c906066288	KVM: x86: Drop pointless @reset_roots from kvm_init_mmu() Remove the @reset_roots param from kvm_init_mmu(), the one user, kvm_mmu_reset_context() has already unloaded the MMU and thus freed and invalidated all roots. This also happens to be why the reset_roots=true paths doesn't leak roots; they're already invalid. No functional change intended. Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210609234235.1244004-14-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-06-17 13:09:54 -04:00
Sean Christopherson	e62f1aa8b9	KVM: x86: Defer MMU sync on PCID invalidation Defer the MMU sync on PCID invalidation so that multiple sync requests in a single VM-Exit are batched. This is a very minor optimization as checking for unsync'd children is quite cheap. Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210609234235.1244004-13-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-06-17 13:09:54 -04:00
Sean Christopherson	39353ab579	KVM: nVMX: Use fast PGD switch when emulating VMFUNC[EPTP_SWITCH] Use __kvm_mmu_new_pgd() via kvm_init_shadow_ept_mmu() to emulate VMFUNC[EPTP_SWITCH] instead of nuking all MMUs. EPTP_SWITCH is the EPT equivalent of MOV to CR3, i.e. is a perfect fit for the common PGD flow, the only hiccup being that A/D enabling is buried in the EPTP. But, that is easily handled by bouncing through kvm_init_shadow_ept_mmu(). Explicitly request a guest TLB flush if VPID is disabled. Per Intel's SDM, if VPID is disabled, "an EPTP-switching VMFUNC invalidates combined mappings associated with VPID 0000H (for all PCIDs and for all EP4TA values, where EP4TA is the value of bits 51:12 of EPTP)". Note, this technically is a very bizarre bug fix of sorts if L2 is using PAE paging, as avoiding the full MMU reload also avoids incorrectly reloading the PDPTEs, which the SDM explicitly states are not touched: If PAE paging is in use, an EPTP-switching VMFUNC does not load the four page-directory-pointer-table entries (PDPTEs) from the guest-physical address in CR3. The logical processor continues to use the four guest-physical addresses already present in the PDPTEs. The guest-physical address in CR3 is not translated through the new EPT paging structures (until some operation that would load the PDPTEs). In addition to optimizing L2's MMU shenanigans, avoiding the full reload also optimizes L1's MMU as KVM_REQ_MMU_RELOAD wipes out all roots in both root_mmu and guest_mmu. Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210609234235.1244004-12-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-06-17 13:09:53 -04:00
Sean Christopherson	28f28d453f	KVM: x86: Use KVM_REQ_TLB_FLUSH_GUEST to handle INVPCID(ALL) emulation Use KVM_REQ_TLB_FLUSH_GUEST instead of KVM_REQ_MMU_RELOAD when emulating INVPCID of all contexts. In the current code, this is a glorified nop as TLB_FLUSH_GUEST becomes kvm_mmu_unload(), same as MMU_RELOAD, when TDP is disabled, which is the only time INVPCID is only intercepted+emulated. In the future, reusing TLB_FLUSH_GUEST will simplify optimizing paths that emulate a guest TLB flush, e.g. by synchronizing as needed instead of completely unloading all MMUs. Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210609234235.1244004-11-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-06-17 13:09:53 -04:00
Sean Christopherson	25b62c6274	KVM: nVMX: Free only guest_mode (L2) roots on INVVPID w/o EPT When emulating INVVPID for L1, free only L2+ roots, using the guest_mode tag in the MMU role to identify L2+ roots. From L1's perspective, its own TLB entries use VPID=0, and INVVPID is not requied to invalidate such entries. Per Intel's SDM, INVVPID _may_ invalidate entries with VPID=0, but it is not required to do so. Cc: Lai Jiangshan <laijs@linux.alibaba.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210609234235.1244004-10-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-06-17 13:09:53 -04:00
Sean Christopherson	50a417962a	KVM: nVMX: Consolidate VM-Enter/VM-Exit TLB flush and MMU sync logic Drop the dedicated nested_vmx_transition_mmu_sync() now that the MMU sync is handled via KVM_REQ_TLB_FLUSH_GUEST, and fold that flush into the all-encompassing nested_vmx_transition_tlb_flush(). Opportunistically add a comment explaning why nested EPT never needs to sync the MMU on VM-Enter. Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210609234235.1244004-9-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-06-17 13:09:53 -04:00
Sean Christopherson	b512910039	KVM: x86: Drop skip MMU sync and TLB flush params from "new PGD" helpers Drop skip_mmu_sync and skip_tlb_flush from __kvm_mmu_new_pgd() now that all call sites unconditionally skip both the sync and flush. No functional change intended. Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210609234235.1244004-8-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-06-17 13:09:52 -04:00
Sean Christopherson	d2e5601907	KVM: nSVM: Move TLB flushing logic (or lack thereof) to dedicated helper Introduce nested_svm_transition_tlb_flush() and use it force an MMU sync and TLB flush on nSVM VM-Enter and VM-Exit instead of sneaking the logic into the __kvm_mmu_new_pgd() call sites. Add a partial todo list to document issues that need to be addressed before the unconditional sync and flush can be modified to look more like nVMX's logic. In addition to making nSVM's forced flushing more overt (guess who keeps losing track of it), the new helper brings further convergence between nSVM and nVMX, and also sets the stage for dropping the "skip" params from __kvm_mmu_new_pgd(). Cc: Maxim Levitsky <mlevitsk@redhat.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210609234235.1244004-7-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-06-17 13:09:52 -04:00
Sean Christopherson	415b1a0105	KVM: x86: Uncondtionally skip MMU sync/TLB flush in MOV CR3's PGD switch Stop leveraging the MMU sync and TLB flush requested by the fast PGD switch helper now that kvm_set_cr3() manually handles the necessary sync, frees, and TLB flush. This will allow dropping the params from the fast PGD helpers since nested SVM is now the odd blob out. Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210609234235.1244004-6-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-06-17 13:09:52 -04:00
Sean Christopherson	21823fbda5	KVM: x86: Invalidate all PGDs for the current PCID on MOV CR3 w/ flush Flush and sync all PGDs for the current/target PCID on MOV CR3 with a TLB flush, i.e. without PCID_NOFLUSH set. Paraphrasing Intel's SDM regarding the behavior of MOV to CR3: - If CR4.PCIDE = 0, invalidates all TLB entries associated with PCID 000H and all entries in all paging-structure caches associated with PCID 000H. - If CR4.PCIDE = 1 and NOFLUSH=0, invalidates all TLB entries associated with the PCID specified in bits 11:0, and all entries in all paging-structure caches associated with that PCID. It is not required to invalidate entries in the TLBs and paging-structure caches that are associated with other PCIDs. - If CR4.PCIDE=1 and NOFLUSH=1, is not required to invalidate any TLB entries or entries in paging-structure caches. Extract and reuse the logic for INVPCID(single) which is effectively the same flow and works even if CR4.PCIDE=0, as the current PCID will be '0' in that case, thus honoring the requirement of flushing PCID=0. Continue passing skip_tlb_flush to kvm_mmu_new_pgd() even though it _should_ be redundant; the clean up will be done in a future patch. The overhead of an unnecessary nop sync is minimal (especially compared to the actual sync), and the TLB flush is handled via request. Avoiding the the negligible overhead is not worth the risk of breaking kernels that backport the fix. Fixes: `956bf3531f` ("kvm: x86: Skip shadow page resync on CR3 switch when indicated by guest") Cc: Junaid Shahid <junaids@google.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210609234235.1244004-5-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-06-17 13:09:52 -04:00
Sean Christopherson	272b0a998d	KVM: nVMX: Don't clobber nested MMU's A/D status on EPTP switch Drop bogus logic that incorrectly clobbers the accessed/dirty enabling status of the nested MMU on an EPTP switch. When nested EPT is enabled, walk_mmu points at L2's _legacy_ page tables, not L1's EPT for L2. This is likely a benign bug, as mmu->ept_ad is never consumed (since the MMU is not a nested EPT MMU), and stuffing mmu_role.base.ad_disabled will never propagate into future shadow pages since the nested MMU isn't used to map anything, just to walk L2's page tables. Note, KVM also does a full MMU reload, i.e. the guest_mmu will be recreated using the new EPTP, and thus any change in A/D enabling will be properly recognized in the relevant MMU. Fixes: `41ab937274` ("KVM: nVMX: Emulate EPTP switching for the L1 hypervisor") Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210609234235.1244004-4-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-06-17 13:09:51 -04:00
Sean Christopherson	0e75225dfa	KVM: nVMX: Ensure 64-bit shift when checking VMFUNC bitmap Use BIT_ULL() instead of an open-coded shift to check whether or not a function is enabled in L1's VMFUNC bitmap. This is a benign bug as KVM supports only bit 0, and will fail VM-Enter if any other bits are set, i.e. bits 63:32 are guaranteed to be zero. Note, "function" is bounded by hardware as VMFUNC will #UD before taking a VM-Exit if the function is greater than 63. Before: if ((vmcs12->vm_function_control & (1 << function)) == 0) 0x000000000001a916 <+118>: mov $0x1,%eax 0x000000000001a91b <+123>: shl %cl,%eax 0x000000000001a91d <+125>: cltq 0x000000000001a91f <+127>: and 0x128(%rbx),%rax After: if (!(vmcs12->vm_function_control & BIT_ULL(function & 63))) 0x000000000001a955 <+117>: mov 0x128(%rbx),%rdx 0x000000000001a95c <+124>: bt %rax,%rdx Fixes: `27c42a1bb8` ("KVM: nVMX: Enable VMFUNC for the L1 hypervisor") Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210609234235.1244004-3-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-06-17 13:09:51 -04:00
Sean Christopherson	07ffaf343e	KVM: nVMX: Sync all PGDs on nested transition with shadow paging Trigger a full TLB flush on behalf of the guest on nested VM-Enter and VM-Exit when VPID is disabled for L2. kvm_mmu_new_pgd() syncs only the current PGD, which can theoretically leave stale, unsync'd entries in a previous guest PGD, which could be consumed if L2 is allowed to load CR3 with PCID_NOFLUSH=1. Rename KVM_REQ_HV_TLB_FLUSH to KVM_REQ_TLB_FLUSH_GUEST so that it can be utilized for its obvious purpose of emulating a guest TLB flush. Note, there is no change the actual TLB flush executed by KVM, even though the fast PGD switch uses KVM_REQ_TLB_FLUSH_CURRENT. When VPID is disabled for L2, vpid02 is guaranteed to be '0', and thus nested_get_vpid02() will return the VPID that is shared by L1 and L2. Generate the request outside of kvm_mmu_new_pgd(), as getting the common helper to correctly identify which requested is needed is quite painful. E.g. using KVM_REQ_TLB_FLUSH_GUEST when nested EPT is in play is wrong as a TLB flush from the L1 kernel's perspective does not invalidate EPT mappings. And, by using KVM_REQ_TLB_FLUSH_GUEST, nVMX can do future simplification by moving the logic into nested_vmx_transition_tlb_flush(). Fixes: `41fab65e7c` ("KVM: nVMX: Skip MMU sync on nested VMX transition when possible") Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210609234235.1244004-2-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-06-17 13:09:51 -04:00
Vitaly Kuznetsov	8629b625e0	KVM: nVMX: Request to sync eVMCS from VMCS12 after migration VMCS12 is used to keep the authoritative state during nested state migration. In case 'need_vmcs12_to_shadow_sync' flag is set, we're in between L2->L1 vmexit and L1 guest run when actual sync to enlightened (or shadow) VMCS happens. Nested state, however, has no flag for 'need_vmcs12_to_shadow_sync' so vmx_set_nested_state()-> set_current_vmptr() always sets it. Enlightened vmptrld path, however, doesn't have the quirk so some VMCS12 changes may not get properly reflected to eVMCS and L1 will see an incorrect state. Note, during L2 execution or when need_vmcs12_to_shadow_sync is not set the change is effectively a nop: in the former case all changes will get reflected during the first L2->L1 vmexit and in the later case VMCS12 and eVMCS are already in sync (thanks to copy_enlightened_to_vmcs12() in vmx_get_nested_state()). Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com> Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Message-Id: <20210526132026.270394-11-vkuznets@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-06-17 13:09:50 -04:00
Vitaly Kuznetsov	dc31338552	KVM: nVMX: Reset eVMCS clean fields data from prepare_vmcs02() When nested state migration happens during L1's execution, it is incorrect to modify eVMCS as it is L1 who 'owns' it at the moment. At least genuine Hyper-V seems to not be very happy when 'clean fields' data changes underneath it. 'Clean fields' data is used in KVM twice: by copy_enlightened_to_vmcs12() and prepare_vmcs02_rare() so we can reset it from prepare_vmcs02() instead. While at it, update a comment stating why exactly we need to reset 'hv_clean_fields' data from L0. Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Message-Id: <20210526132026.270394-10-vkuznets@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-06-17 13:09:50 -04:00
Vitaly Kuznetsov	b7685cfd5e	KVM: nVMX: Force enlightened VMCS sync from nested_vmx_failValid() 'need_vmcs12_to_shadow_sync' is used for both shadow and enlightened VMCS sync when we exit to L1. The comment in nested_vmx_failValid() validly states why shadow vmcs sync can be omitted but this doesn't apply to enlightened VMCS as it 'shadows' all VMCS12 fields. Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com> Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Message-Id: <20210526132026.270394-9-vkuznets@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-06-17 13:09:50 -04:00
Vitaly Kuznetsov	d6bf71a18c	KVM: nVMX: Ignore 'hv_clean_fields' data when eVMCS data is copied in vmx_get_nested_state() 'Clean fields' data from enlightened VMCS is only valid upon vmentry: L1 hypervisor is not obliged to keep it up-to-date while it is mangling L2's state, KVM_GET_NESTED_STATE request may come at a wrong moment when actual eVMCS changes are unsynchronized with 'hv_clean_fields'. As upon migration VMCS12 is used as a source of ultimate truth, we must make sure we pick all the changes to eVMCS and thus 'clean fields' data must be ignored. Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Message-Id: <20210526132026.270394-8-vkuznets@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-06-17 13:09:49 -04:00
Vitaly Kuznetsov	3b19b81acf	KVM: nVMX: Release enlightened VMCS on VMCLEAR Unlike VMREAD/VMWRITE/VMPTRLD, VMCLEAR is a valid instruction when enlightened VMCS is in use. TLFS has the following brief description: "The L1 hypervisor can execute a VMCLEAR instruction to transition an enlightened VMCS from the active to the non-active state". Normally, this change can be ignored as unmapping active eVMCS can be postponed until the next VMLAUNCH instruction but in case nested state is migrated with KVM_GET_NESTED_STATE/KVM_SET_NESTED_STATE, keeping eVMCS mapped may result in its synchronization with VMCS12 and this is incorrect: L1 hypervisor is free to reuse inactive eVMCS memory for something else. Inactive eVMCS after VMCLEAR can just be unmapped. Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com> Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Message-Id: <20210526132026.270394-7-vkuznets@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-06-17 13:09:49 -04:00
Vitaly Kuznetsov	278499686b	KVM: nVMX: Introduce 'EVMPTR_MAP_PENDING' post-migration state Unlike regular set_current_vmptr(), nested_vmx_handle_enlightened_vmptrld() can not be called directly from vmx_set_nested_state() as KVM may not have all the information yet (e.g. HV_X64_MSR_VP_ASSIST_PAGE MSR may not be restored yet). Enlightened VMCS is mapped later while getting nested state pages. In the meantime, vmx->nested.hv_evmcs_vmptr remains 'EVMPTR_INVALID' and it's indistinguishable from 'evmcs is not in use' case. This leads to certain issues, in particular, if KVM_GET_NESTED_STATE is called right after KVM_SET_NESTED_STATE, KVM_STATE_NESTED_EVMCS flag in the resulting state will be unset (and such state will later fail to load). Introduce 'EVMPTR_MAP_PENDING' state to detect not-yet-mapped eVMCS after restore. With this, the 'is_guest_mode(vcpu)' hack in vmx_has_valid_vmcs12() is no longer needed. Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Message-Id: <20210526132026.270394-6-vkuznets@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-06-17 13:09:49 -04:00
Vitaly Kuznetsov	25641cafab	KVM: nVMX: Make copy_vmcs12_to_enlightened()/copy_enlightened_to_vmcs12() return 'void' copy_vmcs12_to_enlightened()/copy_enlightened_to_vmcs12() don't return any result, make them return 'void'. No functional change intended. Suggested-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Message-Id: <20210526132026.270394-5-vkuznets@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-06-17 13:09:49 -04:00
Vitaly Kuznetsov	0276171680	KVM: nVMX: Release eVMCS when enlightened VMENTRY was disabled In theory, L1 can try to disable enlightened VMENTRY in VP assist page and try to issue VMLAUNCH/VMRESUME. While nested_vmx_handle_enlightened_vmptrld() properly handles this as 'EVMPTRLD_DISABLED', previously mapped eVMCS remains mapped and thus all evmptr_is_valid() checks will still pass and nested_vmx_run() will proceed when it shouldn't. Release eVMCS immediately when we detect that enlightened vmentry was disabled by L1. Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Message-Id: <20210526132026.270394-4-vkuznets@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-06-17 13:09:48 -04:00
Vitaly Kuznetsov	6a789ca5d5	KVM: nVMX: Don't set 'dirty_vmcs12' flag on enlightened VMPTRLD 'dirty_vmcs12' is only checked in prepare_vmcs02_early()/prepare_vmcs02() and both checks look like: 'vmx->nested.dirty_vmcs12 \|\| evmptr_is_valid(vmx->nested.hv_evmcs_vmptr)' so for eVMCS case the flag changes nothing. Drop the assignment to avoid the confusion. No functional change intended. Reported-by: Maxim Levitsky <mlevitsk@redhat.com> Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Message-Id: <20210526132026.270394-3-vkuznets@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-06-17 13:09:48 -04:00
Vitaly Kuznetsov	1e9dfbd748	KVM: nVMX: Use '-1' in 'hv_evmcs_vmptr' to indicate that eVMCS is not in use Instead of checking 'vmx->nested.hv_evmcs' use '-1' in 'vmx->nested.hv_evmcs_vmptr' to indicate 'evmcs is not in use' state. This matches how we check 'vmx->nested.current_vmptr'. Introduce EVMPTR_INVALID and evmptr_is_valid() and use it instead of raw '-1' check as a preparation to adding other 'special' values. No functional change intended. Suggested-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Message-Id: <20210526132026.270394-2-vkuznets@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-06-17 13:09:48 -04:00
Maxim Levitsky	158a48ecf7	KVM: x86: avoid loading PDPTRs after migration when possible if new KVM_*_SREGS2 ioctls are used, the PDPTRs are a part of the migration state and are correctly restored by those ioctls. Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com> Message-Id: <20210607090203.133058-9-mlevitsk@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-06-17 13:09:48 -04:00
Maxim Levitsky	6dba940352	KVM: x86: Introduce KVM_GET_SREGS2 / KVM_SET_SREGS2 This is a new version of KVM_GET_SREGS / KVM_SET_SREGS. It has the following changes: * Has flags for future extensions * Has vcpu's PDPTRs, allowing to save/restore them on migration. * Lacks obsolete interrupt bitmap (done now via KVM_SET_VCPU_EVENTS) New capability, KVM_CAP_SREGS2 is added to signal the userspace of this ioctl. Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com> Message-Id: <20210607090203.133058-8-mlevitsk@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-06-17 13:09:47 -04:00
Maxim Levitsky	329675dde9	KVM: x86: introduce kvm_register_clear_available Small refactoring that will be used in the next patch. Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com> Message-Id: <20210607090203.133058-7-mlevitsk@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-06-17 13:09:47 -04:00
Maxim Levitsky	0f85722341	KVM: nVMX: delay loading of PDPTRs to KVM_REQ_GET_NESTED_STATE_PAGES Similar to the rest of guest page accesses after a migration, this access should be delayed to KVM_REQ_GET_NESTED_STATE_PAGES. Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com> Message-Id: <20210607090203.133058-6-mlevitsk@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-06-17 13:09:47 -04:00
Maxim Levitsky	b222b0b881	KVM: nSVM: refactor the CR3 reload on migration Document the actual reason why we need to do it on migration and move the call to svm_set_nested_state to be closer to VMX code. To avoid loading the PDPTRs from possibly not up to date memory map, in nested_svm_load_cr3 after the move, move this code to .get_nested_state_pages. Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com> Message-Id: <20210607090203.133058-5-mlevitsk@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-06-17 13:09:47 -04:00
Sean Christopherson	c7313155bf	KVM: x86: Always load PDPTRs on CR3 load for SVM w/o NPT and a PAE guest Kill off pdptrs_changed() and instead go through the full kvm_set_cr3() for PAE guest, even if the new CR3 is the same as the current CR3. For VMX, and SVM with NPT enabled, the PDPTRs are unconditionally marked as unavailable after VM-Exit, i.e. the optimization is dead code except for SVM without NPT. In the unlikely scenario that anyone cares about SVM without NPT _and_ a PAE guest, they've got bigger problems if their guest is loading the same CR3 so frequently that the performance of kvm_set_cr3() is notable, especially since KVM's fast PGD switching means reloading the same CR3 does not require a full rebuild. Given that PAE and PCID are mutually exclusive, i.e. a sync and flush are guaranteed in any case, the actual benefits of the pdptrs_changed() optimization are marginal at best. Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210607090203.133058-4-mlevitsk@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-06-17 13:09:46 -04:00
Sean Christopherson	a36dbec67e	KVM: nSVM: Drop pointless pdptrs_changed() check on nested transition Remove the "PDPTRs unchanged" check to skip PDPTR loading during nested SVM transitions as it's not at all an optimization. Reading guest memory to get the PDPTRs isn't magically cheaper by doing it in pdptrs_changed(), and if the PDPTRs did change, KVM will end up doing the read twice. Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210607090203.133058-3-mlevitsk@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-06-17 13:09:46 -04:00
Sean Christopherson	bcb72d0627	KVM: nVMX: Drop obsolete (and pointless) pdptrs_changed() check Remove the pdptrs_changed() check when loading L2's CR3. The set of available registers is always reset when switching VMCSes (see commit `e5d03de593`, "KVM: nVMX: Reset register cache (available and dirty masks) on VMCS switch"), thus the "are PDPTRs available" check will always fail. And even if it didn't fail, reading guest memory to check the PDPTRs is just as expensive as reading guest memory to load 'em. Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210607090203.133058-2-mlevitsk@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-06-17 13:09:46 -04:00
Vitaly Kuznetsov	445caed021	KVM: x86: hyper-v: Honor HV_X64_EX_PROCESSOR_MASKS_RECOMMENDED bit Hypercalls which use extended processor masks are only available when HV_X64_EX_PROCESSOR_MASKS_RECOMMENDED privilege bit is exposed (and 'RECOMMENDED' is rather a misnomer). Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Message-Id: <20210521095204.2161214-28-vkuznets@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-06-17 13:09:44 -04:00
Vitaly Kuznetsov	d264eb3c14	KVM: x86: hyper-v: Honor HV_X64_CLUSTER_IPI_RECOMMENDED bit Hyper-V partition must possess 'HV_X64_CLUSTER_IPI_RECOMMENDED' privilege ('recommended' is rather a misnomer) to issue HVCALL_SEND_IPI hypercalls. Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Message-Id: <20210521095204.2161214-27-vkuznets@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-06-17 13:09:44 -04:00
Vitaly Kuznetsov	bb53ecb4d6	KVM: x86: hyper-v: Honor HV_X64_REMOTE_TLB_FLUSH_RECOMMENDED bit Hyper-V partition must possess 'HV_X64_REMOTE_TLB_FLUSH_RECOMMENDED' privilege ('recommended' is rather a misnomer) to issue HVCALL_FLUSH_VIRTUAL_ADDRESS_LIST/SPACE hypercalls. Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Message-Id: <20210521095204.2161214-26-vkuznets@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-06-17 13:09:44 -04:00
Vitaly Kuznetsov	a921cf83cc	KVM: x86: hyper-v: Honor HV_DEBUGGING privilege bit Hyper-V partition must possess 'HV_DEBUGGING' privilege to issue HVCALL_POST_DEBUG_DATA/HVCALL_RETRIEVE_DEBUG_DATA/ HVCALL_RESET_DEBUG_SESSION hypercalls. Note, when SynDBG is disabled hv_check_hypercall_access() returns 'true' (like for any other unknown hypercall) so the result will be HV_STATUS_INVALID_HYPERCALL_CODE and not HV_STATUS_ACCESS_DENIED. Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Message-Id: <20210521095204.2161214-25-vkuznets@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-06-17 13:09:44 -04:00
Vitaly Kuznetsov	a60b3c594e	KVM: x86: hyper-v: Honor HV_SIGNAL_EVENTS privilege bit Hyper-V partition must possess 'HV_SIGNAL_EVENTS' privilege to issue HVCALL_SIGNAL_EVENT hypercalls. Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Message-Id: <20210521095204.2161214-24-vkuznets@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-06-17 13:09:43 -04:00
Vitaly Kuznetsov	4f532b7f96	KVM: x86: hyper-v: Honor HV_POST_MESSAGES privilege bit Hyper-V partition must possess 'HV_POST_MESSAGES' privilege to issue HVCALL_POST_MESSAGE hypercalls. Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Message-Id: <20210521095204.2161214-23-vkuznets@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-06-17 13:09:43 -04:00
Vitaly Kuznetsov	34ef7d7b9c	KVM: x86: hyper-v: Check access to HVCALL_NOTIFY_LONG_SPIN_WAIT hypercall TLFS6.0b states that partition issuing HVCALL_NOTIFY_LONG_SPIN_WAIT must posess 'UseHypercallForLongSpinWait' privilege but there's no corresponding feature bit. Instead, we have "Recommended number of attempts to retry a spinlock failure before notifying the hypervisor about the failures. 0xFFFFFFFF indicates never notify." Use this to check access to the hypercall. Also, check against zero as the corresponding CPUID must be set (and '0' attempts before re-try is weird anyway). Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Message-Id: <20210521095204.2161214-22-vkuznets@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-06-17 13:09:43 -04:00
Vitaly Kuznetsov	4ad81a9111	KVM: x86: hyper-v: Prepare to check access to Hyper-V hypercalls Introduce hv_check_hypercallr_access() to check if the particular hypercall should be available to guest, this will be used with KVM_CAP_HYPERV_ENFORCE_CPUID mode. No functional change intended. Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Message-Id: <20210521095204.2161214-21-vkuznets@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-06-17 13:09:43 -04:00
Vitaly Kuznetsov	1aa8a4184d	KVM: x86: hyper-v: Honor HV_STIMER_DIRECT_MODE_AVAILABLE privilege bit Synthetic timers can only be configured in 'direct' mode when HV_STIMER_DIRECT_MODE_AVAILABLE bit was exposed. Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Message-Id: <20210521095204.2161214-20-vkuznets@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-06-17 13:09:42 -04:00
Vitaly Kuznetsov	d66bfa36f9	KVM: x86: hyper-v: Inverse the default in hv_check_msr_access() Access to all MSRs is now properly checked. To avoid 'forgetting' to properly check access to new MSRs in the future change the default to 'false' meaning 'no access'. No functional change intended. Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Message-Id: <20210521095204.2161214-19-vkuznets@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-06-17 13:09:42 -04:00
Vitaly Kuznetsov	17b6d51771	KVM: x86: hyper-v: Honor HV_FEATURE_DEBUG_MSRS_AVAILABLE privilege bit Synthetic debugging MSRs (HV_X64_MSR_SYNDBG_CONTROL, HV_X64_MSR_SYNDBG_STATUS, HV_X64_MSR_SYNDBG_SEND_BUFFER, HV_X64_MSR_SYNDBG_RECV_BUFFER, HV_X64_MSR_SYNDBG_PENDING_BUFFER, HV_X64_MSR_SYNDBG_OPTIONS) are only available to guest when HV_FEATURE_DEBUG_MSRS_AVAILABLE bit is exposed. Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Message-Id: <20210521095204.2161214-18-vkuznets@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-06-17 13:09:42 -04:00
Vitaly Kuznetsov	0a19c8992d	KVM: x86: hyper-v: Honor HV_FEATURE_GUEST_CRASH_MSR_AVAILABLE privilege bit HV_X64_MSR_CRASH_P0 ... HV_X64_MSR_CRASH_P4, HV_X64_MSR_CRASH_CTL are only available to guest when HV_FEATURE_GUEST_CRASH_MSR_AVAILABLE bit is exposed. Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Message-Id: <20210521095204.2161214-17-vkuznets@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-06-17 13:09:42 -04:00
Vitaly Kuznetsov	234d01baec	KVM: x86: hyper-v: Honor HV_ACCESS_REENLIGHTENMENT privilege bit HV_X64_MSR_REENLIGHTENMENT_CONTROL/HV_X64_MSR_TSC_EMULATION_CONTROL/ HV_X64_MSR_TSC_EMULATION_STATUS are only available to guest when HV_ACCESS_REENLIGHTENMENT bit is exposed. Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Message-Id: <20210521095204.2161214-16-vkuznets@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-06-17 13:09:41 -04:00
Vitaly Kuznetsov	9442f3bd90	KVM: x86: hyper-v: Honor HV_ACCESS_FREQUENCY_MSRS privilege bit HV_X64_MSR_TSC_FREQUENCY/HV_X64_MSR_APIC_FREQUENCY are only available to guest when HV_ACCESS_FREQUENCY_MSRS bit is exposed. Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Message-Id: <20210521095204.2161214-15-vkuznets@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-06-17 13:09:41 -04:00
Vitaly Kuznetsov	978b57475c	KVM: x86: hyper-v: Honor HV_MSR_APIC_ACCESS_AVAILABLE privilege bit HV_X64_MSR_EOI, HV_X64_MSR_ICR, HV_X64_MSR_TPR, and HV_X64_MSR_VP_ASSIST_PAGE are only available to guest when HV_MSR_APIC_ACCESS_AVAILABLE bit is exposed. Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Message-Id: <20210521095204.2161214-14-vkuznets@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-06-17 13:09:41 -04:00
Vitaly Kuznetsov	eba60ddae7	KVM: x86: hyper-v: Honor HV_MSR_SYNTIMER_AVAILABLE privilege bit Synthetic timers MSRs (HV_X64_MSR_STIMER[0-3]_CONFIG, HV_X64_MSR_STIMER[0-3]_COUNT) are only available to guest when HV_MSR_SYNTIMER_AVAILABLE bit is exposed. Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Message-Id: <20210521095204.2161214-13-vkuznets@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-06-17 13:09:41 -04:00
Vitaly Kuznetsov	9e2715ca20	KVM: x86: hyper-v: Honor HV_MSR_SYNIC_AVAILABLE privilege bit SynIC MSRs (HV_X64_MSR_SCONTROL, HV_X64_MSR_SVERSION, HV_X64_MSR_SIEFP, HV_X64_MSR_SIMP, HV_X64_MSR_EOM, HV_X64_MSR_SINT0 ... HV_X64_MSR_SINT15) are only available to guest when HV_MSR_SYNIC_AVAILABLE bit is exposed. Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Message-Id: <20210521095204.2161214-12-vkuznets@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-06-17 13:09:40 -04:00
Vitaly Kuznetsov	a1ec661c3f	KVM: x86: hyper-v: Honor HV_MSR_REFERENCE_TSC_AVAILABLE privilege bit HV_X64_MSR_REFERENCE_TSC is only available to guest when HV_MSR_REFERENCE_TSC_AVAILABLE bit is exposed. Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Message-Id: <20210521095204.2161214-11-vkuznets@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-06-17 13:09:40 -04:00
Vitaly Kuznetsov	679008e4bb	KVM: x86: hyper-v: Honor HV_MSR_RESET_AVAILABLE privilege bit HV_X64_MSR_RESET is only available to guest when HV_MSR_RESET_AVAILABLE bit is exposed. Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Message-Id: <20210521095204.2161214-10-vkuznets@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-06-17 13:09:40 -04:00
Vitaly Kuznetsov	d2ac25d419	KVM: x86: hyper-v: Honor HV_MSR_VP_INDEX_AVAILABLE privilege bit HV_X64_MSR_VP_INDEX is only available to guest when HV_MSR_VP_INDEX_AVAILABLE bit is exposed. Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Message-Id: <20210521095204.2161214-9-vkuznets@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-06-17 13:09:40 -04:00
Vitaly Kuznetsov	c2b32867f2	KVM: x86: hyper-v: Honor HV_MSR_TIME_REF_COUNT_AVAILABLE privilege bit HV_X64_MSR_TIME_REF_COUNT is only available to guest when HV_MSR_TIME_REF_COUNT_AVAILABLE bit is exposed. Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Message-Id: <20210521095204.2161214-8-vkuznets@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-06-17 13:09:39 -04:00
Vitaly Kuznetsov	b80a92ff81	KVM: x86: hyper-v: Honor HV_MSR_VP_RUNTIME_AVAILABLE privilege bit HV_X64_MSR_VP_RUNTIME is only available to guest when HV_MSR_VP_RUNTIME_AVAILABLE bit is exposed. Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Message-Id: <20210521095204.2161214-7-vkuznets@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-06-17 13:09:39 -04:00
Vitaly Kuznetsov	1561c2cb87	KVM: x86: hyper-v: Honor HV_MSR_HYPERCALL_AVAILABLE privilege bit HV_X64_MSR_GUEST_OS_ID/HV_X64_MSR_HYPERCALL are only available to guest when HV_MSR_HYPERCALL_AVAILABLE bit is exposed. Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Message-Id: <20210521095204.2161214-6-vkuznets@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-06-17 13:09:39 -04:00
Vitaly Kuznetsov	b4128000e2	KVM: x86: hyper-v: Prepare to check access to Hyper-V MSRs Introduce hv_check_msr_access() to check if the particular MSR should be accessible by guest, this will be used with KVM_CAP_HYPERV_ENFORCE_CPUID mode. No functional change intended. Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Message-Id: <20210521095204.2161214-5-vkuznets@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-06-17 13:09:39 -04:00
Vitaly Kuznetsov	10d7bf1e46	KVM: x86: hyper-v: Cache guest CPUID leaves determining features availability Limiting exposed Hyper-V features requires a fast way to check if the particular feature is exposed in guest visible CPUIDs or not. To aboid looping through all CPUID entries on every hypercall/MSR access cache the required leaves on CPUID update. Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Message-Id: <20210521095204.2161214-4-vkuznets@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-06-17 13:09:38 -04:00
Vitaly Kuznetsov	644f706719	KVM: x86: hyper-v: Introduce KVM_CAP_HYPERV_ENFORCE_CPUID Modeled after KVM_CAP_ENFORCE_PV_FEATURE_CPUID, the new capability allows for limiting Hyper-V features to those exposed to the guest in Hyper-V CPUIDs (0x40000003, 0x40000004, ...). Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Message-Id: <20210521095204.2161214-3-vkuznets@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-06-17 13:09:38 -04:00
Vineeth Pillai	1183646a67	KVM: SVM: hyper-v: Direct Virtual Flush support From Hyper-V TLFS: "The hypervisor exposes hypercalls (HvFlushVirtualAddressSpace, HvFlushVirtualAddressSpaceEx, HvFlushVirtualAddressList, and HvFlushVirtualAddressListEx) that allow operating systems to more efficiently manage the virtual TLB. The L1 hypervisor can choose to allow its guest to use those hypercalls and delegate the responsibility to handle them to the L0 hypervisor. This requires the use of a partition assist page." Add the Direct Virtual Flush support for SVM. Related VMX changes: commit `6f6a657c99` ("KVM/Hyper-V/VMX: Add direct tlb flush support") Signed-off-by: Vineeth Pillai <viremana@linux.microsoft.com> Message-Id: <fc8d24d8eb7017266bb961e39a171b0caf298d7f.1622730232.git.viremana@linux.microsoft.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-06-17 13:09:38 -04:00
Vineeth Pillai	c4327f15df	KVM: SVM: hyper-v: Enlightened MSR-Bitmap support Enlightened MSR-Bitmap as per TLFS: "The L1 hypervisor may collaborate with the L0 hypervisor to make MSR accesses more efficient. It can enable enlightened MSR bitmaps by setting the corresponding field in the enlightened VMCS to 1. When enabled, L0 hypervisor does not monitor the MSR bitmaps for changes. Instead, the L1 hypervisor must invalidate the corresponding clean field after making changes to one of the MSR bitmaps." Enable this for SVM. Related VMX changes: commit `ceef7d10df` ("KVM: x86: VMX: hyper-v: Enlightened MSR-Bitmap support") Signed-off-by: Vineeth Pillai <viremana@linux.microsoft.com> Message-Id: <87df0710f95d28b91cc4ea014fc4d71056eebbee.1622730232.git.viremana@linux.microsoft.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-06-17 13:09:37 -04:00
Vineeth Pillai	1e0c7d4075	KVM: SVM: hyper-v: Remote TLB flush for SVM Enable remote TLB flush for SVM. Signed-off-by: Vineeth Pillai <viremana@linux.microsoft.com> Message-Id: <1ee364e397e142aed662d2920d198cd03772f1a5.1622730232.git.viremana@linux.microsoft.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-06-17 13:09:37 -04:00
Vineeth Pillai	59d21d67f3	KVM: SVM: Software reserved fields SVM added support for certain reserved fields to be used by software or hypervisor. Add the following reserved fields: - VMCB offset 0x3e0 - 0x3ff - Clean bit 31 - SVM intercept exit code 0xf0000000 Later patches will make use of this for supporting Hyper-V nested virtualization enhancements. Signed-off-by: Vineeth Pillai <viremana@linux.microsoft.com> Message-Id: <a1f17a43a8e9e751a1a9cc0281649d71bdbf721b.1622730232.git.viremana@linux.microsoft.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-06-17 13:09:37 -04:00
Vineeth Pillai	3c86c0d3db	KVM: x86: hyper-v: Move the remote TLB flush logic out of vmx Currently the remote TLB flush logic is specific to VMX. Move it to a common place so that SVM can use it as well. Signed-off-by: Vineeth Pillai <viremana@linux.microsoft.com> Message-Id: <4f4e4ca19778437dae502f44363a38e99e3ef5d1.1622730232.git.viremana@linux.microsoft.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-06-17 13:09:36 -04:00
Krish Sadhukhan	d5a0483f9f	KVM: nVMX: nSVM: Add a new VCPU statistic to show if VCPU is in guest mode Add the following per-VCPU statistic to KVM debugfs to show if a given VCPU is in guest mode: guest_mode Also add this as a per-VM statistic to KVM debugfs to show the total number of VCPUs that are in guest mode in a given VM. Signed-off-by: Krish Sadhukhan <Krish.Sadhukhan@oracle.com> Message-Id: <20210609180340.104248-3-krish.sadhukhan@oracle.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-06-17 13:09:36 -04:00
Krish Sadhukhan	b93af02c67	KVM: nVMX: nSVM: 'nested_run' should count guest-entry attempts that make it to guest code Currently, the 'nested_run' statistic counts all guest-entry attempts, including those that fail during vmentry checks on Intel and during consistency checks on AMD. Convert this statistic to count only those guest-entries that make it past these state checks and make it to guest code. This will tell us the number of guest-entries that actually executed or tried to execute guest code. Signed-off-by: Krish Sadhukhan <Krish.Sadhukhan@oracle.com> Message-Id: <20210609180340.104248-2-krish.sadhukhan@oracle.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-06-17 13:09:35 -04:00
Sean Christopherson	ecc513e5bb	KVM: x86: Drop "pre_" from enter/leave_smm() helpers Now that .post_leave_smm() is gone, drop "pre_" from the remaining helpers. The helpers aren't invoked purely before SMI/RSM processing, e.g. both helpers are invoked after state is snapshotted (from regs or SMRAM), and the RSM helper is invoked after some amount of register state has been stuffed. Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210609185619.992058-10-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-06-17 13:09:35 -04:00
Sean Christopherson	0128116550	KVM: x86: Drop .post_leave_smm(), i.e. the manual post-RSM MMU reset Drop the .post_leave_smm() emulator callback, which at this point is just a wrapper to kvm_mmu_reset_context(). The manual context reset is unnecessary, because unlike enter_smm() which calls vendor MSR/CR helpers directly, em_rsm() bounces through the KVM helpers, e.g. kvm_set_cr4(), which are responsible for processing side effects. em_rsm() is already subtly relying on this behavior as it doesn't manually do kvm_update_cpuid_runtime(), e.g. to recognize CR4.OSXSAVE changes. Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210609185619.992058-9-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-06-17 13:09:35 -04:00
Sean Christopherson	1270e647c8	KVM: x86: Rename SMM tracepoint to make it reflect reality Rename the SMM tracepoint, which handles both entering and exiting SMM, from kvm_enter_smm to kvm_smm_transition. Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210609185619.992058-8-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-06-17 13:09:35 -04:00
Sean Christopherson	0d7ee6f4b5	KVM: x86: Move "entering SMM" tracepoint into kvm_smm_changed() Invoke the "entering SMM" tracepoint from kvm_smm_changed() instead of enter_smm(), effectively moving it from before reading vCPU state to after reading state (but still before writing it to SMRAM!). The primary motivation is to consolidate code, but calling the tracepoint from kvm_smm_changed() also makes its invocation consistent with respect to SMI and RSM, and with respect to KVM_SET_VCPU_EVENTS (which previously only invoked the tracepoint when forcing the vCPU out of SMM). Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210609185619.992058-7-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-06-17 13:09:34 -04:00
Sean Christopherson	dc87275f47	KVM: x86: Move (most) SMM hflags modifications into kvm_smm_changed() Move the core of SMM hflags modifications into kvm_smm_changed() and use kvm_smm_changed() in enter_smm(). Clear HF_SMM_INSIDE_NMI_MASK for leaving SMM but do not set it for entering SMM. If the vCPU is executing outside of SMM, the flag should unequivocally be cleared, e.g. this technically fixes a benign bug where the flag could be left set after KVM_SET_VCPU_EVENTS, but the reverse is not true as NMI blocking depends on pre-SMM state or userspace input. Note, this adds an extra kvm_mmu_reset_context() to enter_smm(). The extra/early reset isn't strictly necessary, and in a way can never be necessary since the vCPU/MMU context is in a half-baked state until the final context reset at the end of the function. But, enter_smm() is not a hot path, and exploding on an invalid root_hpa is probably better than having a stale SMM flag in the MMU role; it's at least no worse. Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210609185619.992058-6-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-06-17 13:09:34 -04:00
Sean Christopherson	fa75e08bbe	KVM: x86: Invoke kvm_smm_changed() immediately after clearing SMM flag Move RSM emulation's call to kvm_smm_changed() from .post_leave_smm() to .exiting_smm(), leaving behind the MMU context reset. The primary motivation is to allow for future cleanup, but this also fixes a bug of sorts by queueing KVM_REQ_EVENT even if RSM causes shutdown, e.g. to let an INIT wake the vCPU from shutdown. Of course, KVM doesn't properly emulate a shutdown state, e.g. KVM doesn't block SMIs after shutdown, and immediately exits to userspace, so the event request is a moot point in practice. Moving kvm_smm_changed() also moves the RSM tracepoint. This isn't strictly necessary, but will allow consolidating the SMI and RSM tracepoints in a future commit (by also moving the SMI tracepoint). Invoking the tracepoint before loading SMRAM state also means the SMBASE that reported in the tracepoint will point that the state that will be used for RSM, as opposed to the SMBASE _after_ RSM completes, which is arguably a good thing if the tracepoint is being used to debug a RSM/SMM issue. Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210609185619.992058-5-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-06-17 13:09:34 -04:00
Sean Christopherson	edce46548b	KVM: x86: Replace .set_hflags() with dedicated .exiting_smm() helper Replace the .set_hflags() emulator hook with a dedicated .exiting_smm(), moving the SMM and SMM_INSIDE_NMI flag handling out of the emulator in the process. This is a step towards consolidating much of the logic in kvm_smm_changed(), including the SMM hflags updates. No functional change intended. Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210609185619.992058-4-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-06-17 13:09:34 -04:00
Sean Christopherson	25b17226cd	KVM: x86: Emulate triple fault shutdown if RSM emulation fails Use the recently introduced KVM_REQ_TRIPLE_FAULT to properly emulate shutdown if RSM from SMM fails. Note, entering shutdown after clearing the SMM flag and restoring NMI blocking is architecturally correct with respect to AMD's APM, which KVM also uses for SMRAM layout and RSM NMI blocking behavior. The APM says: An RSM causes a processor shutdown if an invalid-state condition is found in the SMRAM state-save area. Only an external reset, external processor-initialization, or non-maskable external interrupt (NMI) can cause the processor to leave the shutdown state. Of note is processor-initialization (INIT) as a valid shutdown wake event, as INIT is blocked by SMM, implying that entering shutdown also forces the CPU out of SMM. For recent Intel CPUs, restoring NMI blocking is technically wrong, but so is restoring NMI blocking in the first place, and Intel's RSM "architecture" is such a mess that just about anything is allowed and can be justified as micro-architectural behavior. Per the SDM: On Pentium 4 and later processors, shutdown will inhibit INTR and A20M but will not change any of the other inhibits. On these processors, NMIs will be inhibited if no action is taken in the SMI handler to uninhibit them (see Section 34.8). where Section 34.8 says: When the processor enters SMM while executing an NMI handler, the processor saves the SMRAM state save map but does not save the attribute to keep NMI interrupts disabled. Potentially, an NMI could be latched (while in SMM or upon exit) and serviced upon exit of SMM even though the previous NMI handler has still not completed. I.e. RSM unconditionally unblocks NMI, but shutdown on RSM does not, which is in direct contradiction of KVM's behavior. But, as mentioned above, KVM follows AMD architecture and restores NMI blocking on RSM, so that micro-architectural detail is already lost. And for Pentium era CPUs, SMI# can break shutdown, meaning that at least some Intel CPUs fully leave SMM when entering shutdown: In the shutdown state, Intel processors stop executing instructions until a RESET#, INIT# or NMI# is asserted. While Pentium family processors recognize the SMI# signal in shutdown state, P6 family and Intel486 processors do not. In other words, the fact that Intel CPUs have implemented the two extremes gives KVM carte blanche when it comes to honoring Intel's architecture for handling shutdown during RSM. Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210609185619.992058-3-seanjc@google.com> [Return X86EMUL_CONTINUE after triple fault. - Paolo] Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-06-17 13:09:33 -04:00
Vitaly Kuznetsov	4651fc56ba	KVM: x86: Drop vendor specific functions for APICv/AVIC enablement Now that APICv/AVIC enablement is kept in common 'enable_apicv' variable, there's no need to call kvm_apicv_init() from vendor specific code. No functional change intended. Reviewed-by: Sean Christopherson <seanjc@google.com> Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com> Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com> Message-Id: <20210609150911.1471882-3-vkuznets@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-06-17 13:09:33 -04:00
Vitaly Kuznetsov	fdf513e37a	KVM: x86: Use common 'enable_apicv' variable for both APICv and AVIC Unify VMX and SVM code by moving APICv/AVIC enablement tracking to common 'enable_apicv' variable. Note: unlike APICv, AVIC is disabled by default. No functional change intended. Suggested-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com> Message-Id: <20210609150911.1471882-2-vkuznets@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-06-17 13:09:33 -04:00
Sergey Senozhatsky	7d62874f69	kvm: x86: implement KVM PM-notifier Implement PM hibernation/suspend prepare notifiers so that KVM can reliably set PVCLOCK_GUEST_STOPPED on VCPUs and properly suspend VMs. Signed-off-by: Sergey Senozhatsky <senozhatsky@chromium.org> Message-Id: <20210606021045.14159-2-senozhatsky@chromium.org> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-06-17 13:09:33 -04:00
Jim Mattson	966eefb896	KVM: nVMX: Disable vmcs02 posted interrupts if vmcs12 PID isn't mappable Don't allow posted interrupts to modify a stale posted interrupt descriptor (including the initial value of 0). Empirical tests on real hardware reveal that a posted interrupt descriptor referencing an unbacked address has PCI bus error semantics (reads as all 1's; writes are ignored). However, kvm can't distinguish unbacked addresses from device-backed (MMIO) addresses, so it should really ask userspace for an MMIO completion. That's overly complicated, so just punt with KVM_INTERNAL_ERROR. Don't return the error until the posted interrupt descriptor is actually accessed. We don't want to break the existing kvm-unit-tests that assume they can launch an L2 VM with a posted interrupt descriptor that references MMIO space in L1. Fixes: `6beb7bd52e` ("kvm: nVMX: Refactor nested_get_vmcs12_pages()") Signed-off-by: Jim Mattson <jmattson@google.com> Message-Id: <20210604172611.281819-8-jmattson@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-06-17 13:09:31 -04:00
Jim Mattson	0fe998b295	KVM: nVMX: Fail on MMIO completion for nested posted interrupts When the kernel has no mapping for the vmcs02 virtual APIC page, userspace MMIO completion is necessary to process nested posted interrupts. This is not a configuration that KVM supports. Rather than silently ignoring the problem, try to exit to userspace with KVM_INTERNAL_ERROR. Note that the event that triggers this error is consumed as a side-effect of a call to kvm_check_nested_events. On some paths (notably through kvm_vcpu_check_block), the error is dropped. In any case, this is an incremental improvement over always ignoring the error. Signed-off-by: Jim Mattson <jmattson@google.com> Message-Id: <20210604172611.281819-7-jmattson@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-06-17 13:09:31 -04:00
Jim Mattson	4fe09bcf14	KVM: x86: Add a return code to kvm_apic_accept_events No functional change intended. At present, the only negative value returned by kvm_check_nested_events is -EBUSY. Signed-off-by: Jim Mattson <jmattson@google.com> Message-Id: <20210604172611.281819-6-jmattson@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-06-17 13:09:31 -04:00
Jim Mattson	a5f6909a71	KVM: x86: Add a return code to inject_pending_event No functional change intended. At present, 'r' will always be -EBUSY on a control transfer to the 'out' label. Signed-off-by: Jim Mattson <jmattson@google.com> Message-Id: <20210604172611.281819-5-jmattson@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-06-17 13:09:30 -04:00
Jim Mattson	650293c3de	KVM: nVMX: Add a return code to vmx_complete_nested_posted_interrupt No functional change intended. Signed-off-by: Jim Mattson <jmattson@google.com> Reviewed-by: Oliver Upton <oupton@google.com> Message-Id: <20210604172611.281819-4-jmattson@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-06-17 13:09:30 -04:00

... 2 3 4 5 6 ...

7621 Commits