linux

mirror of https://github.com/torvalds/linux.git synced 2024-12-16 08:02:17 +00:00

Author	SHA1	Message	Date
Ben Gardon	2bdb3d84ce	KVM: x86/mmu: Merge TDP MMU put and free root kvm_tdp_mmu_put_root and kvm_tdp_mmu_free_root are always called together, so merge the functions to simplify TDP MMU root refcounting / freeing. Signed-off-by: Ben Gardon <bgardon@google.com> Message-Id: <20210401233736.638171-5-bgardon@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-04-19 09:05:24 -04:00
Ben Gardon	4bba36d72b	KVM: x86/mmu: use tdp_mmu_free_sp to free roots Minor cleanup to deduplicate the code used to free a struct kvm_mmu_page in the TDP MMU. No functional change intended. Signed-off-by: Ben Gardon <bgardon@google.com> Message-Id: <20210401233736.638171-4-bgardon@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-04-19 09:05:24 -04:00
Ben Gardon	76eb54e7e7	KVM: x86/mmu: Move kvm_mmu_(get\|put)_root to TDP MMU The TDP MMU is almost the only user of kvm_mmu_get_root and kvm_mmu_put_root. There is only one use of put_root in mmu.c for the legacy / shadow MMU. Open code that one use and move the get / put functions to the TDP MMU so they can be extended in future commits. No functional change intended. Signed-off-by: Ben Gardon <bgardon@google.com> Message-Id: <20210401233736.638171-3-bgardon@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-04-19 09:05:24 -04:00
Ben Gardon	8ca6f063b7	KVM: x86/mmu: Re-add const qualifier in kvm_tdp_mmu_zap_collapsible_sptes kvm_tdp_mmu_zap_collapsible_sptes unnecessarily removes the const qualifier from its memlsot argument, leading to a compiler warning. Add the const annotation and pass it to subsequent functions. Signed-off-by: Ben Gardon <bgardon@google.com> Message-Id: <20210401233736.638171-2-bgardon@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-04-19 09:05:23 -04:00
Sean Christopherson	e1eed5847b	KVM: x86/mmu: Allow yielding during MMU notifier unmap/zap, if possible Let the TDP MMU yield when unmapping a range in response to a MMU notification, if yielding is allowed by said notification. There is no reason to disallow yielding in this case, and in theory the range being invalidated could be quite large. Cc: Ben Gardon <bgardon@google.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210402005658.3024832-11-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-04-19 09:05:23 -04:00
Sean Christopherson	3039bcc744	KVM: Move x86's MMU notifier memslot walkers to generic code Move the hva->gfn lookup for MMU notifiers into common code. Every arch does a similar lookup, and some arch code is all but identical across multiple architectures. In addition to consolidating code, this will allow introducing optimizations that will benefit all architectures without incurring multiple walks of the memslots, e.g. by taking mmu_lock if and only if a relevant range exists in the memslots. The use of __always_inline to avoid indirect call retpolines, as done by x86, may also benefit other architectures. Consolidating the lookups also fixes a wart in x86, where the legacy MMU and TDP MMU each do their own memslot walks. Lastly, future enhancements to the memslot implementation, e.g. to add an interval tree to track host address, will need to touch far less arch specific code. MIPS, PPC, and arm64 will be converted one at a time in future patches. Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210402005658.3024832-3-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-04-17 08:31:06 -04:00
Paolo Bonzini	6c9dd6d262	KVM: constify kvm_arch_flush_remote_tlbs_memslot memslots are stored in RCU and there should be no need to change them. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-04-17 08:31:04 -04:00
Paolo Bonzini	dbb6964e4c	KVM: MMU: protect TDP MMU pages only down to required level When using manual protection of dirty pages, it is not necessary to protect nested page tables down to the 4K level; instead KVM can protect only hugepages in order to split them lazily, and delay write protection at 4K-granularity until KVM_CLEAR_DIRTY_LOG. This was overlooked in the TDP MMU, so do it there as well. Fixes: `a6a0b05da9` ("kvm: x86/mmu: Support dirty logging for the TDP MMU") Cc: Ben Gardon <bgardon@google.com> Reviewed-by: Keqian Zhu <zhukeqian1@huawei.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-04-17 08:31:04 -04:00
Maxim Levitsky	7e582ccbbd	KVM: x86: implement KVM_CAP_SET_GUEST_DEBUG2 Store the supported bits into KVM_GUESTDBG_VALID_MASK macro, similar to how arm does this. Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com> Message-Id: <20210401135451.1004564-4-mlevitsk@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-04-17 08:31:02 -04:00
Maxim Levitsky	4020da3b9f	KVM: x86: pending exceptions must not be blocked by an injected event Injected interrupts/nmi should not block a pending exception, but rather be either lost if nested hypervisor doesn't intercept the pending exception (as in stock x86), or be delivered in exitintinfo/IDT_VECTORING_INFO field, as a part of a VMexit that corresponds to the pending exception. The only reason for an exception to be blocked is when nested run is pending (and that can't really happen currently but still worth checking for). Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com> Message-Id: <20210401143817.1030695-2-mlevitsk@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-04-17 08:31:02 -04:00
Maxim Levitsky	232f75d3b4	KVM: nSVM: call nested_svm_load_cr3 on nested state load While KVM's MMU should be fully reset by loading of nested CR0/CR3/CR4 by KVM_SET_SREGS, we are not in nested mode yet when we do it and therefore only root_mmu is reset. On regular nested entries we call nested_svm_load_cr3 which both updates the guest's CR3 in the MMU when it is needed, and it also initializes the mmu again which makes it initialize the walk_mmu as well when nested paging is enabled in both host and guest. Since we don't call nested_svm_load_cr3 on nested state load, the walk_mmu can be left uninitialized, which can lead to a NULL pointer dereference while accessing it if we happen to get a nested page fault right after entering the nested guest first time after the migration and we decide to emulate it, which leads to the emulator trying to access walk_mmu->gva_to_gpa which is NULL. Therefore we should call this function on nested state load as well. Suggested-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com> Message-Id: <20210401141814.1029036-3-mlevitsk@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-04-17 08:31:01 -04:00
David Edmondson	8486039a6c	KVM: x86: dump_vmcs should include the autoload/autostore MSR lists When dumping the current VMCS state, include the MSRs that are being automatically loaded/stored during VM entry/exit. Suggested-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: David Edmondson <david.edmondson@oracle.com> Message-Id: <20210318120841.133123-6-david.edmondson@oracle.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-04-17 08:31:01 -04:00
David Edmondson	0702a3cbbf	KVM: x86: dump_vmcs should show the effective EFER If EFER is not being loaded from the VMCS, show the effective value by reference to the MSR autoload list or calculation. Suggested-by: Sean Christopherson <seanjc@google.com> Signed-off-by: David Edmondson <david.edmondson@oracle.com> Message-Id: <20210318120841.133123-5-david.edmondson@oracle.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-04-17 08:31:00 -04:00
David Edmondson	5518da62d4	KVM: x86: dump_vmcs should consider only the load controls of EFER/PAT When deciding whether to dump the GUEST_IA32_EFER and GUEST_IA32_PAT fields of the VMCS, examine only the VM entry load controls, as saving on VM exit has no effect on whether VM entry succeeds or fails. Suggested-by: Sean Christopherson <seanjc@google.com> Signed-off-by: David Edmondson <david.edmondson@oracle.com> Message-Id: <20210318120841.133123-4-david.edmondson@oracle.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-04-17 08:31:00 -04:00
David Edmondson	699e1b2e55	KVM: x86: dump_vmcs should not conflate EFER and PAT presence in VMCS Show EFER and PAT based on their individual entry/exit controls. Signed-off-by: David Edmondson <david.edmondson@oracle.com> Message-Id: <20210318120841.133123-3-david.edmondson@oracle.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-04-17 08:30:59 -04:00
David Edmondson	d9e46d344e	KVM: x86: dump_vmcs should not assume GUEST_IA32_EFER is valid If the VM entry/exit controls for loading/saving MSR_EFER are either not available (an older processor or explicitly disabled) or not used (host and guest values are the same), reading GUEST_IA32_EFER from the VMCS returns an inaccurate value. Because of this, in dump_vmcs() don't use GUEST_IA32_EFER to decide whether to print the PDPTRs - always do so if the fields exist. Fixes: `4eb64dce8d` ("KVM: x86: dump VMCS on invalid entry") Signed-off-by: David Edmondson <david.edmondson@oracle.com> Message-Id: <20210318120841.133123-2-david.edmondson@oracle.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-04-17 08:30:59 -04:00
Maxim Levitsky	adc2a23734	KVM: nSVM: improve SYSENTER emulation on AMD Currently to support Intel->AMD migration, if CPU vendor is GenuineIntel, we emulate the full 64 value for MSR_IA32_SYSENTER_{EIP\|ESP} msrs, and we also emulate the sysenter/sysexit instruction in long mode. (Emulator does still refuse to emulate sysenter in 64 bit mode, on the ground that the code for that wasn't tested and likely has no users) However when virtual vmload/vmsave is enabled, the vmload instruction will update these 32 bit msrs without triggering their msr intercept, which will lead to having stale values in kvm's shadow copy of these msrs, which relies on the intercept to be up to date. Fix/optimize this by doing the following: 1. Enable the MSR intercepts for SYSENTER MSRs iff vendor=GenuineIntel (This is both a tiny optimization and also ensures that in case the guest cpu vendor is AMD, the msrs will be 32 bit wide as AMD defined). 2. Store only high 32 bit part of these msrs on interception and combine it with hardware msr value on intercepted read/writes iff vendor=GenuineIntel. 3. Disable vmload/vmsave virtualization if vendor=GenuineIntel. (It is somewhat insane to set vendor=GenuineIntel and still enable SVM for the guest but well whatever). Then zero the high 32 bit parts when kvm intercepts and emulates vmload. Thanks a lot to Paulo Bonzini for helping me with fixing this in the most correct way. This patch fixes nested migration of 32 bit nested guests, that was broken because incorrect cached values of SYSENTER msrs were stored in the migration stream if L1 changed these msrs with vmload prior to L2 entry. Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com> Message-Id: <20210401111928.996871-3-mlevitsk@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-04-17 08:30:59 -04:00
Maxim Levitsky	c1df4aac44	KVM: x86: add guest_cpuid_is_intel This is similar to existing 'guest_cpuid_is_amd_or_hygon' Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com> Message-Id: <20210401111928.996871-2-mlevitsk@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-04-17 08:30:58 -04:00
Sean Christopherson	eba04b20e4	KVM: x86: Account a variety of miscellaneous allocations Switch to GFP_KERNEL_ACCOUNT for a handful of allocations that are clearly associated with a single task/VM. Note, there are a several SEV allocations that aren't accounted, but those can (hopefully) be fixed by using the local stack for memory. Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210331023025.2485960-3-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-04-17 08:30:58 -04:00
Sean Christopherson	8727906fde	KVM: SVM: Do not allow SEV/SEV-ES initialization after vCPUs are created Reject KVM_SEV_INIT and KVM_SEV_ES_INIT if they are attempted after one or more vCPUs have been created. KVM assumes a VM is tagged SEV/SEV-ES prior to vCPU creation, e.g. init_vmcb() needs to mark the VMCB as SEV enabled, and svm_create_vcpu() needs to allocate the VMSA. At best, creating vCPUs before SEV/SEV-ES init will lead to unexpected errors and/or behavior, and at worst it will crash the host, e.g. sev_launch_update_vmsa() will dereference a null svm->vmsa pointer. Fixes: `1654efcbc4` ("KVM: SVM: Add KVM_SEV_INIT command") Fixes: `ad73109ae7` ("KVM: SVM: Provide support to launch and run an SEV-ES guest") Cc: stable@vger.kernel.org Cc: Brijesh Singh <brijesh.singh@amd.com> Cc: Tom Lendacky <thomas.lendacky@amd.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210331031936.2495277-4-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-04-17 08:30:58 -04:00
Sean Christopherson	9fa1521daa	KVM: SVM: Do not set sev->es_active until KVM_SEV_ES_INIT completes Set sev->es_active only after the guts of KVM_SEV_ES_INIT succeeds. If the command fails, e.g. because SEV is already active or there are no available ASIDs, then es_active will be left set even though the VM is not fully SEV-ES capable. Refactor the code so that "es_active" is passed on the stack instead of being prematurely shoved into sev_info, both to avoid having to unwind sev_info and so that it's more obvious what actually consumes es_active in sev_guest_init() and its helpers. Fixes: `ad73109ae7` ("KVM: SVM: Provide support to launch and run an SEV-ES guest") Cc: stable@vger.kernel.org Cc: Brijesh Singh <brijesh.singh@amd.com> Cc: Tom Lendacky <thomas.lendacky@amd.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210331031936.2495277-3-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-04-17 08:30:57 -04:00
Sean Christopherson	c36b16d29f	KVM: SVM: Use online_vcpus, not created_vcpus, to iterate over vCPUs Use the kvm_for_each_vcpu() helper to iterate over vCPUs when encrypting VMSAs for SEV, which effectively switches to use online_vcpus instead of created_vcpus. This fixes a possible null-pointer dereference as created_vcpus does not guarantee a vCPU exists, since it is updated at the very beginning of KVM_CREATE_VCPU. created_vcpus exists to allow the bulk of vCPU creation to run in parallel, while still correctly restricting the max number of max vCPUs. Fixes: `ad73109ae7` ("KVM: SVM: Provide support to launch and run an SEV-ES guest") Cc: stable@vger.kernel.org Cc: Brijesh Singh <brijesh.singh@amd.com> Cc: Tom Lendacky <thomas.lendacky@amd.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210331031936.2495277-2-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-04-17 08:30:57 -04:00
Sean Christopherson	8f8f52a45d	KVM: x86/mmu: Simplify code for aging SPTEs in TDP MMU Use a basic NOT+AND sequence to clear the Accessed bit in TDP MMU SPTEs, as opposed to the fancy ffs()+clear_bit() logic that was copied from the legacy MMU. The legacy MMU uses clear_bit() because it is operating on the SPTE itself, i.e. clearing needs to be atomic. The TDP MMU operates on a local variable that it later writes to the SPTE, and so doesn't need to be atomic or even resident in memory. Opportunistically drop unnecessary initialization of new_spte, it's guaranteed to be written before being accessed. Using NOT+AND instead of ffs()+clear_bit() reduces the sequence from: 0x0000000000058be6 <+134>: test %rax,%rax 0x0000000000058be9 <+137>: je 0x58bf4 <age_gfn_range+148> 0x0000000000058beb <+139>: test %rax,%rdi 0x0000000000058bee <+142>: je 0x58cdc <age_gfn_range+380> 0x0000000000058bf4 <+148>: mov %rdi,0x8(%rsp) 0x0000000000058bf9 <+153>: mov $0xffffffff,%edx 0x0000000000058bfe <+158>: bsf %eax,%edx 0x0000000000058c01 <+161>: movslq %edx,%rdx 0x0000000000058c04 <+164>: lock btr %rdx,0x8(%rsp) 0x0000000000058c0b <+171>: mov 0x8(%rsp),%r15 to: 0x0000000000058bdd <+125>: test %rax,%rax 0x0000000000058be0 <+128>: je 0x58beb <age_gfn_range+139> 0x0000000000058be2 <+130>: test %rax,%r8 0x0000000000058be5 <+133>: je 0x58cc0 <age_gfn_range+352> 0x0000000000058beb <+139>: not %rax 0x0000000000058bee <+142>: and %r8,%rax 0x0000000000058bf1 <+145>: mov %rax,%r15 thus eliminating several memory accesses, including a locked access. Cc: Ben Gardon <bgardon@google.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210331004942.2444916-3-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-04-17 08:30:57 -04:00
Sean Christopherson	6d9aafb96d	KVM: x86/mmu: Remove spurious clearing of dirty bit from TDP MMU SPTE Don't clear the dirty bit when aging a TDP MMU SPTE (in response to a MMU notifier event). Prematurely clearing the dirty bit could cause spurious PML updates if aging a page happened to coincide with dirty logging. Note, tdp_mmu_set_spte_no_acc_track() flows into __handle_changed_spte(), so the host PFN will be marked dirty, i.e. there is no potential for data corruption. Fixes: `a6a0b05da9` ("kvm: x86/mmu: Support dirty logging for the TDP MMU") Cc: Ben Gardon <bgardon@google.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210331004942.2444916-2-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-04-17 08:30:56 -04:00
Sean Christopherson	6dfbd6b5d5	KVM: x86/mmu: Drop trace_kvm_age_page() tracepoint Remove x86's trace_kvm_age_page() tracepoint. It's mostly redundant with the common trace_kvm_age_hva() tracepoint, and if there is a need for the extra details, e.g. gfn, referenced, etc... those details should be added to the common tracepoint so that all architectures and MMUs benefit from the info. Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210326021957.1424875-19-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-04-17 08:30:56 -04:00
Sean Christopherson	aaaac889cf	KVM: x86/mmu: Use leaf-only loop for walking TDP SPTEs when changing SPTE Use the leaf-only TDP iterator when changing the SPTE in reaction to a MMU notifier. Practically speaking, this is a nop since the guts of the loop explicitly looks for 4k SPTEs, which are always leaf SPTEs. Switch the iterator to match age_gfn_range() and test_age_gfn() so that a future patch can consolidate the core iterating logic. No real functional change intended. Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210326021957.1424875-8-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-04-17 08:30:55 -04:00
Sean Christopherson	a3f15bda46	KVM: x86/mmu: Pass address space ID to TDP MMU root walkers Move the address space ID check that is performed when iterating over roots into the macro helpers to consolidate code. No functional change intended. Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210326021957.1424875-7-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-04-17 08:30:55 -04:00
Sean Christopherson	2b9663d8a1	KVM: x86/mmu: Pass address space ID to __kvm_tdp_mmu_zap_gfn_range() Pass the address space ID to TDP MMU's primary "zap gfn range" helper to allow the MMU notifier paths to iterate over memslots exactly once. Currently, both the legacy MMU and TDP MMU iterate over memslots when looking for an overlapping hva range, which can be quite costly if there are a large number of memslots. Add a "flush" parameter so that iterating over multiple address spaces in the caller will continue to do the right thing when yielding while a flush is pending from a previous address space. Note, this also has a functional change in the form of coalescing TLB flushes across multiple address spaces in kvm_zap_gfn_range(), and also optimizes the TDP MMU to utilize range-based flushing when running as L1 with Hyper-V enlightenments. Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210326021957.1424875-6-seanjc@google.com> [Keep separate for loops to prepare for other incoming patches. - Paolo] Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-04-17 08:30:54 -04:00
Sean Christopherson	1a61b7db7a	KVM: x86/mmu: Coalesce TLB flushes across address spaces for gfn range zap Gather pending TLB flushes across both address spaces when zapping a given gfn range. This requires feeding "flush" back into subsequent calls, but on the plus side sets the stage for further batching between the legacy MMU and TDP MMU. It also allows refactoring the address space iteration to cover the legacy and TDP MMUs without introducing truly ugly code. Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210326021957.1424875-5-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-04-17 08:30:54 -04:00
Sean Christopherson	142ccde1f7	KVM: x86/mmu: Coalesce TLB flushes when zapping collapsible SPTEs Gather pending TLB flushes across both the legacy and TDP MMUs when zapping collapsible SPTEs to avoid multiple flushes if both the legacy MMU (for nested guests) and TDP MMU have mappings for the memslot. Note, this also optimizes the TDP MMU to flush only the relevant range when running as L1 with Hyper-V enlightenments. Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210326021957.1424875-4-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-04-17 08:30:54 -04:00
Sean Christopherson	302695a574	KVM: x86/mmu: Move flushing for "slot" handlers to caller for legacy MMU Place the onus on the caller of slot_handle_*() to flush the TLB, rather than handling the flush in the helper, and rename parameters accordingly. This will allow future patches to coalesce flushes between address spaces and between the legacy and TDP MMUs. No functional change intended. Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210326021957.1424875-3-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-04-17 08:30:53 -04:00
Sean Christopherson	af95b53e56	KVM: x86/mmu: Coalesce TDP MMU TLB flushes when zapping collapsible SPTEs When zapping collapsible SPTEs across multiple roots, gather pending flushes and perform a single remote TLB flush at the end, as opposed to flushing after processing every root. Note, flush may be cleared by the result of zap_collapsible_spte_range(). This is intended and correct, e.g. yielding may have serviced a prior pending flush. Cc: Ben Gardon <bgardon@google.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210326021957.1424875-2-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-04-17 08:30:53 -04:00
Vitaly Kuznetsov	c28fa560c5	KVM: x86/vPMU: Forbid reading from MSR_F15H_PERF MSRs when guest doesn't have X86_FEATURE_PERFCTR_CORE MSR_F15H_PERF_CTL0-5, MSR_F15H_PERF_CTR0-5 MSRs have a CPUID bit assigned to them (X86_FEATURE_PERFCTR_CORE) and when it wasn't exposed to the guest the correct behavior is to inject #GP an not just return zero. Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com> Message-Id: <20210329124804.170173-1-vkuznets@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-04-17 08:30:53 -04:00
Krish Sadhukhan	9a7de6ecc3	KVM: nSVM: If VMRUN is single-stepped, queue the #DB intercept in nested_svm_vmexit() According to APM, the #DB intercept for a single-stepped VMRUN must happen after the completion of that instruction, when the guest does #VMEXIT to the host. However, in the current implementation of KVM, the #DB intercept for a single-stepped VMRUN happens after the completion of the instruction that follows the VMRUN instruction. When the #DB intercept handler is invoked, it shows the RIP of the instruction that follows VMRUN, instead of of VMRUN itself. This is an incorrect RIP as far as single-stepping VMRUN is concerned. This patch fixes the problem by checking, in nested_svm_vmexit(), for the condition that the VMRUN instruction is being single-stepped and if so, queues the pending #DB intercept so that the #DB is accounted for before we execute L1's next instruction. Suggested-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Krish Sadhukhan <krish.sadhukhan@oraacle.com> Message-Id: <20210323175006.73249-2-krish.sadhukhan@oracle.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-04-17 08:30:52 -04:00
Paolo Bonzini	4a38162ee9	KVM: MMU: load PDPTRs outside mmu_lock On SVM, reading PDPTRs might access guest memory, which might fault and thus might sleep. On the other hand, it is not possible to release the lock after make_mmu_pages_available has been called. Therefore, push the call to make_mmu_pages_available and the mmu_lock critical section within mmu_alloc_direct_roots and mmu_alloc_shadow_roots. Reported-by: Wanpeng Li <wanpengli@tencent.com> Co-developed-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-04-17 08:30:52 -04:00
Paolo Bonzini	d9bd0082e2	Merge remote-tracking branch 'tip/x86/sgx' into kvm-next Pull generic x86 SGX changes needed to support SGX in virtual machines.	2021-04-17 08:29:47 -04:00
Reiji Watanabe	04c4f2ee3f	KVM: VMX: Don't use vcpu->run->internal.ndata as an array index __vmx_handle_exit() uses vcpu->run->internal.ndata as an index for an array access. Since vcpu->run is (can be) mapped to a user address space with a writer permission, the 'ndata' could be updated by the user process at anytime (the user process can set it to outside the bounds of the array). So, it is not safe that __vmx_handle_exit() uses the 'ndata' that way. Fixes: `1aa561b1a4` ("kvm: x86: Add "last CPU" to some KVM_EXIT information") Signed-off-by: Reiji Watanabe <reijiw@google.com> Reviewed-by: Jim Mattson <jmattson@google.com> Message-Id: <20210413154739.490299-1-reijiw@google.com> Cc: stable@vger.kernel.org Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-04-13 18:23:41 -04:00
Paolo Bonzini	315f02c60d	KVM: x86/mmu: preserve pending TLB flush across calls to kvm_tdp_mmu_zap_sp Right now, if a call to kvm_tdp_mmu_zap_sp returns false, the caller will skip the TLB flush, which is wrong. There are two ways to fix it: - since kvm_tdp_mmu_zap_sp will not yield and therefore will not flush the TLB itself, we could change the call to kvm_tdp_mmu_zap_sp to use "flush \|= ..." - or we can chain the flush argument through kvm_tdp_mmu_zap_sp down to __kvm_tdp_mmu_zap_gfn_range. Note that kvm_tdp_mmu_zap_sp will neither yield nor flush, so flush would never go from true to false. This patch does the former to simplify application to stable kernels, and to make it further clearer that kvm_tdp_mmu_zap_sp will not flush. Cc: seanjc@google.com Fixes: `048f49809c` ("KVM: x86/mmu: Ensure TLBs are flushed for TDP MMU during NX zapping") Cc: <stable@vger.kernel.org> # 5.10.x: `048f49809c`: KVM: x86/mmu: Ensure TLBs are flushed for TDP MMU during NX zapping Cc: <stable@vger.kernel.org> # 5.10.x: `33a3164161`: KVM: x86/mmu: Don't allow TDP MMU to yield when recovering NX pages Cc: <stable@vger.kernel.org> Reviewed-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-04-08 07:48:18 -04:00
Sean Christopherson	540745ddbc	x86/sgx: Introduce virtual EPC for use by KVM guests Add a misc device /dev/sgx_vepc to allow userspace to allocate "raw" Enclave Page Cache (EPC) without an associated enclave. The intended and only known use case for raw EPC allocation is to expose EPC to a KVM guest, hence the 'vepc' moniker, virt.{c,h} files and X86_SGX_KVM Kconfig. The SGX driver uses the misc device /dev/sgx_enclave to support userspace in creating an enclave. Each file descriptor returned from opening /dev/sgx_enclave represents an enclave. Unlike the SGX driver, KVM doesn't control how the guest uses the EPC, therefore EPC allocated to a KVM guest is not associated with an enclave, and /dev/sgx_enclave is not suitable for allocating EPC for a KVM guest. Having separate device nodes for the SGX driver and KVM virtual EPC also allows separate permission control for running host SGX enclaves and KVM SGX guests. To use /dev/sgx_vepc to allocate a virtual EPC instance with particular size, the hypervisor opens /dev/sgx_vepc, and uses mmap() with the intended size to get an address range of virtual EPC. Then it may use the address range to create one KVM memory slot as virtual EPC for a guest. Implement the "raw" EPC allocation in the x86 core-SGX subsystem via /dev/sgx_vepc rather than in KVM. Doing so has two major advantages: - Does not require changes to KVM's uAPI, e.g. EPC gets handled as just another memory backend for guests. - EPC management is wholly contained in the SGX subsystem, e.g. SGX does not have to export any symbols, changes to reclaim flows don't need to be routed through KVM, SGX's dirty laundry doesn't have to get aired out for the world to see, and so on and so forth. The virtual EPC pages allocated to guests are currently not reclaimable. Reclaiming an EPC page used by enclave requires a special reclaim mechanism separate from normal page reclaim, and that mechanism is not supported for virutal EPC pages. Due to the complications of handling reclaim conflicts between guest and host, reclaiming virtual EPC pages is significantly more complex than basic support for SGX virtualization. [ bp: - Massage commit message and comments - use cpu_feature_enabled() - vertically align struct members init - massage Virtual EPC clarification text - move Kconfig prompt to Virtualization ] Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Co-developed-by: Kai Huang <kai.huang@intel.com> Signed-off-by: Kai Huang <kai.huang@intel.com> Signed-off-by: Borislav Petkov <bp@suse.de> Acked-by: Dave Hansen <dave.hansen@intel.com> Acked-by: Jarkko Sakkinen <jarkko@kernel.org> Link: https://lkml.kernel.org/r/0c38ced8c8e5a69872db4d6a1c0dabd01e07cad7.1616136308.git.kai.huang@intel.com	2021-04-06 09:43:17 +02:00
Vipin Sharma	7aef27f0b2	svm/sev: Register SEV and SEV-ES ASIDs to the misc controller Secure Encrypted Virtualization (SEV) and Secure Encrypted Virtualization - Encrypted State (SEV-ES) ASIDs are used to encrypt KVMs on AMD platform. These ASIDs are available in the limited quantities on a host. Register their capacity and usage to the misc controller for tracking via cgroups. Signed-off-by: Vipin Sharma <vipinsh@google.com> Reviewed-by: David Rientjes <rientjes@google.com> Signed-off-by: Tejun Heo <tj@kernel.org>	2021-04-04 13:34:46 -04:00
Paolo Bonzini	657f1d86a3	Merge branch 'kvm-tdp-fix-rcu' into HEAD	2021-04-02 07:25:32 -04:00
Paolo Bonzini	57e45ea487	Merge branch 'kvm-tdp-fix-flushes' into HEAD	2021-04-02 07:24:54 -04:00
Paolo Bonzini	cb9b6a1b19	Merge branch 'kvm-fix-svm-races' into HEAD	2021-04-01 05:19:48 -04:00
Vitaly Kuznetsov	77fcbe823f	KVM: x86: Prevent 'hv_clock->system_time' from going negative in kvm_guest_time_update() When guest time is reset with KVM_SET_CLOCK(0), it is possible for 'hv_clock->system_time' to become a small negative number. This happens because in KVM_SET_CLOCK handling we set 'kvm->arch.kvmclock_offset' based on get_kvmclock_ns(kvm) but when KVM_REQ_CLOCK_UPDATE is handled, kvm_guest_time_update() does (masterclock in use case): hv_clock.system_time = ka->master_kernel_ns + v->kvm->arch.kvmclock_offset; And 'master_kernel_ns' represents the last time when masterclock got updated, it can precede KVM_SET_CLOCK() call. Normally, this is not a problem, the difference is very small, e.g. I'm observing hv_clock.system_time = -70 ns. The issue comes from the fact that 'hv_clock.system_time' is stored as unsigned and 'system_time / 100' in compute_tsc_page_parameters() becomes a very big number. Use 'master_kernel_ns' instead of get_kvmclock_ns() when masterclock is in use and get_kvmclock_base_ns() when it's not to prevent 'system_time' from going negative. Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com> Message-Id: <20210331124130.337992-2-vkuznets@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-04-01 05:14:19 -04:00
Paolo Bonzini	a83829f56c	KVM: x86: disable interrupts while pvclock_gtod_sync_lock is taken pvclock_gtod_sync_lock can be taken with interrupts disabled if the preempt notifier calls get_kvmclock_ns to update the Xen runstate information: spin_lock include/linux/spinlock.h:354 [inline] get_kvmclock_ns+0x25/0x390 arch/x86/kvm/x86.c:2587 kvm_xen_update_runstate+0x3d/0x2c0 arch/x86/kvm/xen.c:69 kvm_xen_update_runstate_guest+0x74/0x320 arch/x86/kvm/xen.c:100 kvm_xen_runstate_set_preempted arch/x86/kvm/xen.h:96 [inline] kvm_arch_vcpu_put+0x2d8/0x5a0 arch/x86/kvm/x86.c:4062 So change the users of the spinlock to spin_lock_irqsave and spin_unlock_irqrestore. Reported-by: syzbot+b282b65c2c68492df769@syzkaller.appspotmail.com Fixes: `30b5c851af` ("KVM: x86/xen: Add support for vCPU runstate information") Cc: David Woodhouse <dwmw@amazon.co.uk> Cc: Marcelo Tosatti <mtosatti@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-04-01 05:14:19 -04:00
Paolo Bonzini	c2c647f91a	KVM: x86: reduce pvclock_gtod_sync_lock critical sections There is no need to include changes to vcpu->requests into the pvclock_gtod_sync_lock critical section. The changes to the shared data structures (in pvclock_update_vm_gtod_copy) already occur under the lock. Cc: David Woodhouse <dwmw@amazon.co.uk> Cc: Marcelo Tosatti <mtosatti@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-04-01 05:14:19 -04:00
Paolo Bonzini	6ebae23c07	Merge branch 'kvm-fix-svm-races' into kvm-master	2021-04-01 05:14:05 -04:00
Paolo Bonzini	3c346c0c60	KVM: SVM: ensure that EFER.SVME is set when running nested guest or on nested vmexit Fixing nested_vmcb_check_save to avoid all TOC/TOU races is a bit harder in released kernels, so do the bare minimum by avoiding that EFER.SVME is cleared. This is problematic because svm_set_efer frees the data structures for nested virtualization if EFER.SVME is cleared. Also check that EFER.SVME remains set after a nested vmexit; clearing it could happen if the bit is zero in the save area that is passed to KVM_SET_NESTED_STATE (the save area of the nested state corresponds to the nested hypervisor's state and is restored on the next nested vmexit). Cc: stable@vger.kernel.org Fixes: `2fcf4876ad` ("KVM: nSVM: implement on demand allocation of the nested state") Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-04-01 05:11:35 -04:00
Paolo Bonzini	a58d9166a7	KVM: SVM: load control fields from VMCB12 before checking them Avoid races between check and use of the nested VMCB controls. This for example ensures that the VMRUN intercept is always reflected to the nested hypervisor, instead of being processed by the host. Without this patch, it is possible to end up with svm->nested.hsave pointing to the MSR permission bitmap for nested guests. This bug is CVE-2021-29657. Reported-by: Felix Wilhelm <fwilhelm@google.com> Cc: stable@vger.kernel.org Fixes: `2fcf4876ad` ("KVM: nSVM: implement on demand allocation of the nested state") Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-04-01 05:09:31 -04:00
Paolo Bonzini	825e34d3c9	Merge commit 'kvm-tdp-fix-flushes' into kvm-master	2021-03-31 07:45:41 -04:00
Sean Christopherson	33a3164161	KVM: x86/mmu: Don't allow TDP MMU to yield when recovering NX pages Prevent the TDP MMU from yielding when zapping a gfn range during NX page recovery. If a flush is pending from a previous invocation of the zapping helper, either in the TDP MMU or the legacy MMU, but the TDP MMU has not accumulated a flush for the current invocation, then yielding will release mmu_lock with stale TLB entries. That being said, this isn't technically a bug fix in the current code, as the TDP MMU will never yield in this case. tdp_mmu_iter_cond_resched() will yield if and only if it has made forward progress, as defined by the current gfn vs. the last yielded (or starting) gfn. Because zapping a single shadow page is guaranteed to (a) find that page and (b) step sideways at the level of the shadow page, the TDP iter will break its loop before getting a chance to yield. But that is all very, very subtle, and will break at the slightest sneeze, e.g. zapping while holding mmu_lock for read would break as the TDP MMU wouldn't be guaranteed to see the present shadow page, and thus could step sideways at a lower level. Cc: Ben Gardon <bgardon@google.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210325200119.1359384-4-seanjc@google.com> [Add lockdep assertion. - Paolo] Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-03-30 13:19:56 -04:00
Sean Christopherson	048f49809c	KVM: x86/mmu: Ensure TLBs are flushed for TDP MMU during NX zapping Honor the "flush needed" return from kvm_tdp_mmu_zap_gfn_range(), which does the flush itself if and only if it yields (which it will never do in this particular scenario), and otherwise expects the caller to do the flush. If pages are zapped from the TDP MMU but not the legacy MMU, then no flush will occur. Fixes: `29cf0f5007` ("kvm: x86/mmu: NX largepage recovery for TDP MMU") Cc: stable@vger.kernel.org Cc: Ben Gardon <bgardon@google.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210325200119.1359384-3-seanjc@google.com> Reviewed-by: Ben Gardon <bgardon@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-03-30 13:19:55 -04:00
Sean Christopherson	a835429cda	KVM: x86/mmu: Ensure TLBs are flushed when yielding during GFN range zap When flushing a range of GFNs across multiple roots, ensure any pending flush from a previous root is honored before yielding while walking the tables of the current root. Note, kvm_tdp_mmu_zap_gfn_range() now intentionally overwrites its local "flush" with the result to avoid redundant flushes. zap_gfn_range() preserves and return the incoming "flush", unless of course the flush was performed prior to yielding and no new flush was triggered. Fixes: `1af4a96025` ("KVM: x86/mmu: Yield in TDU MMU iter even if no SPTES changed") Cc: stable@vger.kernel.org Reviewed-by: Ben Gardon <bgardon@google.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210325200119.1359384-2-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-03-30 13:19:55 -04:00
Siddharth Chandrasekaran	6fb3084ab5	KVM: make: Fix out-of-source module builds Building kvm module out-of-source with, make -C $SRC O=$BIN M=arch/x86/kvm fails to find "irq.h" as the include dir passed to cflags-y does not prefix the source dir. Fix this by prefixing $(srctree) to the include dir path. Signed-off-by: Siddharth Chandrasekaran <sidcha@amazon.de> Message-Id: <20210324124347.18336-1-sidcha@amazon.de> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-03-30 13:07:10 -04:00
Vitaly Kuznetsov	1973cadd4c	KVM: x86/vPMU: Forbid writing to MSR_F15H_PERF MSRs when guest doesn't have X86_FEATURE_PERFCTR_CORE MSR_F15H_PERF_CTL0-5, MSR_F15H_PERF_CTR0-5 MSRs are only available when X86_FEATURE_PERFCTR_CORE CPUID bit was exposed to the guest. KVM, however, allows these MSRs unconditionally because kvm_pmu_is_valid_msr() -> amd_msr_idx_to_pmc() check always passes and because kvm_pmu_set_msr() -> amd_pmu_set_msr() doesn't fail. In case of a counter (CTRn), no big harm is done as we only increase internal PMC's value but in case of an eventsel (CTLn), we go deep into perf internals with a non-existing counter. Note, kvm_get_msr_common() just returns '0' when these MSRs don't exist and this also seems to contradict architectural behavior which is #GP (I did check one old Opteron host) but changing this status quo is a bit scarier. Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com> Message-Id: <20210323084515.1346540-1-vkuznets@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-03-30 13:07:10 -04:00
Dongli Zhang	ecaf088f53	KVM: x86: remove unused declaration of kvm_write_tsc() kvm_write_tsc() was renamed and made static since commit `0c899c25d7` ("KVM: x86: do not attempt TSC synchronization on guest writes"). Remove its unused declaration. Signed-off-by: Dongli Zhang <dongli.zhang@oracle.com> Message-Id: <20210326070334.12310-1-dongli.zhang@oracle.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-03-30 13:07:09 -04:00
Haiwei Li	d632826f26	KVM: clean up the unused argument kvm_msr_ignored_check function never uses vcpu argument. Clean up the function and invokers. Signed-off-by: Haiwei Li <lihaiwei@tencent.com> Message-Id: <20210313051032.4171-1-lihaiwei.kernel@gmail.com> Reviewed-by: Keqian Zhu <zhukeqian1@huawei.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-03-30 13:07:09 -04:00
Ingo Molnar	163b099146	x86: Fix various typos in comments, take #2 Fix another ~42 single-word typos in arch/x86/ code comments, missed a few in the first pass, in particular in .S files. Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Bjorn Helgaas <bhelgaas@google.com> Cc: linux-kernel@vger.kernel.org	2021-03-21 23:50:28 +01:00
Ingo Molnar	ca8778c45e	Merge branch 'linus' into x86/cleanups, to resolve conflict Conflicts: arch/x86/kernel/kprobes/ftrace.c Signed-off-by: Ingo Molnar <mingo@kernel.org>	2021-03-21 22:16:08 +01:00
Wanpeng Li	c2162e13d6	KVM: X86: Fix missing local pCPU when executing wbinvd on all dirty pCPUs In order to deal with noncoherent DMA, we should execute wbinvd on all dirty pCPUs when guest wbinvd exits to maintain data consistency. smp_call_function_many() does not execute the provided function on the local core, therefore replace it by on_each_cpu_mask(). Reported-by: Nadav Amit <namit@vmware.com> Cc: Nadav Amit <namit@vmware.com> Signed-off-by: Wanpeng Li <wanpengli@tencent.com> Message-Id: <1615517151-7465-1-git-send-email-wanpengli@tencent.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-03-18 13:55:34 -04:00
Sean Christopherson	b318e8decf	KVM: x86: Protect userspace MSR filter with SRCU, and set atomically-ish Fix a plethora of issues with MSR filtering by installing the resulting filter as an atomic bundle instead of updating the live filter one range at a time. The KVM_X86_SET_MSR_FILTER ioctl() isn't truly atomic, as the hardware MSR bitmaps won't be updated until the next VM-Enter, but the relevant software struct is atomically updated, which is what KVM really needs. Similar to the approach used for modifying memslots, make arch.msr_filter a SRCU-protected pointer, do all the work configuring the new filter outside of kvm->lock, and then acquire kvm->lock only when the new filter has been vetted and created. That way vCPU readers either see the old filter or the new filter in their entirety, not some half-baked state. Yuan Yao pointed out a use-after-free in ksm_msr_allowed() due to a TOCTOU bug, but that's just the tip of the iceberg... - Nothing is __rcu annotated, making it nigh impossible to audit the code for correctness. - kvm_add_msr_filter() has an unpaired smp_wmb(). Violation of kernel coding style aside, the lack of a smb_rmb() anywhere casts all code into doubt. - kvm_clear_msr_filter() has a double free TOCTOU bug, as it grabs count before taking the lock. - kvm_clear_msr_filter() also has memory leak due to the same TOCTOU bug. The entire approach of updating the live filter is also flawed. While installing a new filter is inherently racy if vCPUs are running, fixing the above issues also makes it trivial to ensure certain behavior is deterministic, e.g. KVM can provide deterministic behavior for MSRs with identical settings in the old and new filters. An atomic update of the filter also prevents KVM from getting into a half-baked state, e.g. if installing a filter fails, the existing approach would leave the filter in a half-baked state, having already committed whatever bits of the filter were already processed. [*] https://lkml.kernel.org/r/20210312083157.25403-1-yaoyuan0329os@gmail.com Fixes: `1a155254ff` ("KVM: x86: Introduce MSR filtering") Cc: stable@vger.kernel.org Cc: Alexander Graf <graf@amazon.com> Reported-by: Yuan Yao <yaoyuan0329os@gmail.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210316184436.2544875-2-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-03-18 13:55:14 -04:00
Ingo Molnar	d9f6e12fb0	x86: Fix various typos in comments Fix ~144 single-word typos in arch/x86/ code comments. Doing this in a single commit should reduce the churn. Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Bjorn Helgaas <bhelgaas@google.com> Cc: linux-kernel@vger.kernel.org	2021-03-18 15:31:53 +01:00
Ingo Molnar	14ff3ed86e	Linux 5.12-rc3 -----BEGIN PGP SIGNATURE----- iQFSBAABCAA8FiEEq68RxlopcLEwq+PEeb4+QwBBGIYFAmBOgu4eHHRvcnZhbGRz QGxpbnV4LWZvdW5kYXRpb24ub3JnAAoJEHm+PkMAQRiGUd0H/3Ey8aWjVAig9Pe+ VQVZKwG+LXWH6UmUx5qyaTxophhmGnWLvkigJMn63qIg4eQtfp2gNFHK+T4OJNIP ybnkjFZ337x4J9zD6m8mt4Wmelq9iW2wNOS+3YZAyYiGlXfMGM7SlYRCQRQznTED 2O/JCMsOoP+Z8tr5ah/bzs0dANsXmTZ3QqRP2uzb6irKTgFR3/weOhj+Ht1oJ4Aq V+bgdcwhtk20hJhlvVeqws+o74LR789tTDCknlz/YNMv9e6VPfyIQ5vJAcFmZATE Ezj9yzkZ4IU+Ux6ikAyaFyBU8d1a4Wqye3eHCZBsEo6tcSAhbTZ90eoU86vh6ajS LZjwkNw= =6y1u -----END PGP SIGNATURE----- Merge tag 'v5.12-rc3' into x86/cleanups, to refresh the tree Signed-off-by: Ingo Molnar <mingo@kernel.org>	2021-03-18 15:27:03 +01:00
Vitaly Kuznetsov	0469f2f7ab	KVM: x86: hyper-v: Don't touch TSC page values when guest opted for re-enlightenment When guest opts for re-enlightenment notifications upon migration, it is in its right to assume that TSC page values never change (as they're only supposed to change upon migration and the host has to keep things as they are before it receives confirmation from the guest). This is mostly true until the guest is migrated somewhere. KVM userspace (e.g. QEMU) will trigger masterclock update by writing to HV_X64_MSR_REFERENCE_TSC, by calling KVM_SET_CLOCK,... and as TSC value and kvmclock reading drift apart (even slightly), the update causes TSC page values to change. The issue at hand is that when Hyper-V is migrated, it uses stale (cached) TSC page values to compute the difference between its own clocksource (provided by KVM) and its guests' TSC pages to program synthetic timers and in some cases, when TSC page is updated, this puts all stimer expirations in the past. This, in its turn, causes an interrupt storm and L2 guests not making much forward progress. Note, KVM doesn't fully implement re-enlightenment notification. Basically, the support for reenlightenment MSRs is just a stub and userspace is only expected to expose the feature when TSC scaling on the expected destination hosts is available. With TSC scaling, no real re-enlightenment is needed as TSC frequency doesn't change. With TSC scaling becoming ubiquitous, it likely makes little sense to fully implement re-enlightenment in KVM. Prevent TSC page from being updated after migration. In case it's not the guest who's initiating the change and when TSC page is already enabled, just keep it as it is: TSC value is supposed to be preserved across migration and TSC frequency can't change with re-enlightenment enabled. The guest is doomed anyway if any of this is not true. Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com> Message-Id: <20210316143736.964151-5-vkuznets@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-03-18 08:02:46 -04:00
Vitaly Kuznetsov	cc9cfddb04	KVM: x86: hyper-v: Track Hyper-V TSC page status Create an infrastructure for tracking Hyper-V TSC page status, i.e. if it was updated from guest/host side or if we've failed to set it up (because e.g. guest wrote some garbage to HV_X64_MSR_REFERENCE_TSC) and there's no need to retry. Also, in a hypothetical situation when we are in 'always catchup' mode for TSC we can now avoid contending 'hv->hv_lock' on every guest enter by setting the state to HV_TSC_PAGE_BROKEN after compute_tsc_page_parameters() returns false. Check for HV_TSC_PAGE_SET state instead of '!hv->tsc_ref.tsc_sequence' in get_time_ref_counter() to properly handle the situation when we failed to write the updated TSC page values to the guest. Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com> Message-Id: <20210316143736.964151-4-vkuznets@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-03-18 08:02:46 -04:00
Vitaly Kuznetsov	e880c6ea55	KVM: x86: hyper-v: Prevent using not-yet-updated TSC page by secondary CPUs When KVM_REQ_MASTERCLOCK_UPDATE request is issued (e.g. after migration) we need to make sure no vCPU sees stale values in PV clock structures and thus all vCPUs are kicked with KVM_REQ_CLOCK_UPDATE. Hyper-V TSC page clocksource is global and kvm_guest_time_update() only updates in on vCPU0 but this is not entirely correct: nothing blocks some other vCPU from entering the guest before we finish the update on CPU0 and it can read stale values from the page. Invalidate TSC page in kvm_gen_update_masterclock() to switch all vCPUs to using MSR based clocksource (HV_X64_MSR_TIME_REF_COUNT). Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com> Message-Id: <20210316143736.964151-3-vkuznets@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-03-17 04:03:22 -04:00
Vitaly Kuznetsov	d2547cf597	KVM: x86: hyper-v: Limit guest to writing zero to HV_X64_MSR_TSC_EMULATION_STATUS HV_X64_MSR_TSC_EMULATION_STATUS indicates whether TSC accesses are emulated after migration (to accommodate for a different host TSC frequency when TSC scaling is not supported; we don't implement this in KVM). Guest can use the same MSR to stop TSC access emulation by writing zero. Writing anything else is forbidden. Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com> Message-Id: <20210316143736.964151-2-vkuznets@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-03-17 04:02:49 -04:00
Sean Christopherson	08889894cc	KVM: x86/mmu: Store the address space ID in the TDP iterator Store the address space ID in the TDP iterator so that it can be retrieved without having to bounce through the root shadow page. This streamlines the code and fixes a Sparse warning about not properly using rcu_dereference() when grabbing the ID from the root on the fly. Reported-by: kernel test robot <lkp@intel.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Ben Gardon <bgardon@google.com> Message-Id: <20210315233803.2706477-5-bgardon@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-03-16 14:16:34 -04:00
Ben Gardon	b601c3bc9d	KVM: x86/mmu: Factor out tdp_iter_return_to_root In tdp_mmu_iter_cond_resched there is a call to tdp_iter_start which causes the iterator to continue its walk over the paging structure from the root. This is needed after a yield as paging structure could have been freed in the interim. The tdp_iter_start call is not very clear and something of a hack. It requires exposing tdp_iter fields not used elsewhere in tdp_mmu.c and the effect is not obvious from the function name. Factor a more aptly named function out of tdp_iter_start and call it from tdp_mmu_iter_cond_resched and tdp_iter_start. No functional change intended. Signed-off-by: Ben Gardon <bgardon@google.com> Message-Id: <20210315233803.2706477-4-bgardon@google.com> Reviewed-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-03-16 14:16:17 -04:00
Ben Gardon	14f6fec2e8	KVM: x86/mmu: Fix RCU usage when atomically zapping SPTEs Fix a missing rcu_dereference in tdp_mmu_zap_spte_atomic. Reported-by: kernel test robot <lkp@intel.com> Signed-off-by: Ben Gardon <bgardon@google.com> Message-Id: <20210315233803.2706477-3-bgardon@google.com> Reviewed-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-03-16 14:15:24 -04:00
Ben Gardon	70fb3e41a9	KVM: x86/mmu: Fix RCU usage in handle_removed_tdp_mmu_page The pt passed into handle_removed_tdp_mmu_page does not need RCU protection, as it is not at any risk of being freed by another thread at that point. However, the implicit cast from tdp_sptep_t to u64 * dropped the __rcu annotation without a proper rcu_derefrence. Fix this by passing the pt as a tdp_ptep_t and then rcu_dereferencing it in the function. Suggested-by: Sean Christopherson <seanjc@google.com> Reported-by: kernel test robot <lkp@intel.com> Signed-off-by: Ben Gardon <bgardon@google.com> Message-Id: <20210315233803.2706477-2-bgardon@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-03-16 14:14:59 -04:00
Sean Christopherson	4a98623d5d	KVM: x86/mmu: Mark the PAE roots as decrypted for shadow paging Set the PAE roots used as decrypted to play nice with SME when KVM is using shadow paging. Explicitly skip setting the C-bit when loading CR3 for PAE shadow paging, even though it's completely ignored by the CPU. The extra documentation is nice to have. Note, there are several subtleties at play with NPT. In addition to legacy shadow paging, the PAE roots are used for SVM's NPT when either KVM is 32-bit (uses PAE paging) or KVM is 64-bit and shadowing 32-bit NPT. However, 32-bit Linux, and thus KVM, doesn't support SME. And 64-bit KVM can happily set the C-bit in CR3. This also means that keeping __sme_set(root) for 32-bit KVM when NPT is enabled is conceptually wrong, but functionally ok since SME is 64-bit only. Leave it as is to avoid unnecessary pollution. Fixes: `d0ec49d4de` ("kvm/x86/svm: Support Secure Memory Encryption within KVM") Cc: stable@vger.kernel.org Cc: Brijesh Singh <brijesh.singh@amd.com> Cc: Tom Lendacky <thomas.lendacky@amd.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210309224207.1218275-5-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-03-15 04:44:08 -04:00
Sean Christopherson	c834e5e44f	KVM: x86/mmu: Use '0' as the one and only value for an invalid PAE root Use '0' to denote an invalid pae_root instead of '0' or INVALID_PAGE. Unlike root_hpa, the pae_roots hold permission bits and thus are guaranteed to be non-zero. Having to deal with both values leads to bugs, e.g. failing to set back to INVALID_PAGE, warning on the wrong value, etc... Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210309224207.1218275-4-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-03-15 04:44:07 -04:00
Sean Christopherson	978c834a66	KVM: VMX: Track root HPA instead of EPTP for paravirt Hyper-V TLB flush Track the address of the top-level EPT struct, a.k.a. the root HPA, instead of the EPTP itself for Hyper-V's paravirt TLB flush. The paravirt API takes only the address, not the full EPTP, and in theory tracking the EPTP could lead to false negatives, e.g. if the HPA matched but the attributes in the EPTP do not. In practice, such a mismatch is extremely unlikely, if not flat out impossible, given how KVM generates the EPTP. Opportunsitically rename the related fields to use the 'root' nomenclature, and to prefix them with 'hv_' to connect them to Hyper-V's paravirt TLB flushing. Reviewed-by: Vitaly Kuznetsov <vkuznets@redhat.com> Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210305183123.3978098-12-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-03-15 04:44:06 -04:00
Sean Christopherson	14072e5695	KVM: VMX: Skip additional Hyper-V TLB EPTP flushes if one fails Skip additional EPTP flushes if one fails when processing EPTPs for Hyper-V's paravirt TLB flushing. If _any_ flush fails, KVM falls back to a full global flush, i.e. additional flushes are unnecessary (and will likely fail anyways). Continue processing the loop unless a mismatch was already detected, e.g. to handle the case where the first flush fails and there is a yet-to-be-detected mismatch. Reviewed-by: Vitaly Kuznetsov <vkuznets@redhat.com> Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210305183123.3978098-11-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-03-15 04:44:05 -04:00
Sean Christopherson	ee36656f0a	KVM: VMX: Define Hyper-V paravirt TLB flush fields iff Hyper-V is enabled Ifdef away the Hyper-V specific fields in structs kvm_vmx and vcpu_vmx as each field has only a single reference outside of the struct itself that isn't already wrapped in ifdeffery (and both are initialization). vcpu_vmx.ept_pointer in particular should be wrapped as it is valid if and only if Hyper-v is active, i.e. non-Hyper-V code cannot rely on it to actually track the current EPTP (without additional code changes). Reviewed-by: Vitaly Kuznetsov <vkuznets@redhat.com> Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210305183123.3978098-10-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-03-15 04:44:04 -04:00
Sean Christopherson	c82f1b670f	KVM: VMX: Explicitly check for hv_remote_flush_tlb when loading pgd Explicitly check that kvm_x86_ops.tlb_remote_flush() points at Hyper-V's implementation for PV flushing instead of assuming that a non-NULL implementation means running on Hyper-V. Wrap the related logic in ifdeffery as hv_remote_flush_tlb() is defined iff CONFIG_HYPERV!=n. Short term, the explicit check makes it more obvious why a non-NULL tlb_remote_flush() triggers EPTP shenanigans. Long term, this will allow TDX to define its own implementation of tlb_remote_flush() without running afoul of Hyper-V. Reviewed-by: Vitaly Kuznetsov <vkuznets@redhat.com> Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210305183123.3978098-9-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-03-15 04:44:03 -04:00
Sean Christopherson	d0a2d45654	KVM: VMX: Don't invalidate hv_tlb_eptp if the new EPTP matches Don't invalidate the common EPTP, and thus trigger rechecking of EPTPs across all vCPUs, if the new EPTP matches the old/common EPTP. In all likelihood this is a meaningless optimization, but there are (uncommon) scenarios where KVM can reload the same EPTP. Reviewed-by: Vitaly Kuznetsov <vkuznets@redhat.com> Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210305183123.3978098-8-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-03-15 04:44:02 -04:00
Sean Christopherson	cdbd4b40e7	KVM: VMX: Invalidate hv_tlb_eptp to denote an EPTP mismatch Drop the dedicated 'ept_pointers_match' field in favor of stuffing 'hv_tlb_eptp' with INVALID_PAGE to mark it as invalid, i.e. to denote that there is at least one EPTP mismatch. Use a local variable to track whether or not a mismatch is detected so that hv_tlb_eptp can be used to skip redundant flushes. No functional change intended. Reviewed-by: Vitaly Kuznetsov <vkuznets@redhat.com> Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210305183123.3978098-7-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-03-15 04:44:01 -04:00
Sean Christopherson	446f7f1155	KVM: VMX: Do Hyper-V TLB flush iff vCPU's EPTP hasn't been flushed Combine the for-loops for Hyper-V TLB EPTP checking and flushing, and in doing so skip flushes for vCPUs whose EPTP matches the target EPTP. Reviewed-by: Vitaly Kuznetsov <vkuznets@redhat.com> Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210305183123.3978098-6-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-03-15 04:44:00 -04:00
Sean Christopherson	288bee2809	KVM: VMX: Fold Hyper-V EPTP checking into it's only caller Fold check_ept_pointer_match() into hv_remote_flush_tlb_with_range() in preparation for combining the kvm_for_each_vcpu loops of the ==CHECK and !=MATCH statements. No functional change intended. Reviewed-by: Vitaly Kuznetsov <vkuznets@redhat.com> Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210305183123.3978098-5-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-03-15 04:43:59 -04:00
Sean Christopherson	b68aa15cca	KVM: VMX: Stash kvm_vmx in a local variable for Hyper-V paravirt TLB flush Capture kvm_vmx in a local variable instead of polluting hv_remote_flush_tlb_with_range() with to_kvm_vmx(kvm). No functional change intended. Reviewed-by: Vitaly Kuznetsov <vkuznets@redhat.com> Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210305183123.3978098-4-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-03-15 04:43:58 -04:00
Sean Christopherson	a4038ef1aa	KVM: VMX: Track common EPTP for Hyper-V's paravirt TLB flush Explicitly track the EPTP that is common to all vCPUs instead of grabbing vCPU0's EPTP when invoking Hyper-V's paravirt TLB flush. Tracking the EPTP will allow optimizing the checks when loading a new EPTP and will also allow dropping ept_pointer_match, e.g. by marking the common EPTP as invalid. This also technically fixes a bug where KVM could theoretically flush an invalid GPA if all vCPUs have an invalid root. In practice, it's likely impossible to trigger a remote TLB flush in such a scenario. In any case, the superfluous flush is completely benign. Reviewed-by: Vitaly Kuznetsov <vkuznets@redhat.com> Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210305183123.3978098-3-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-03-15 04:43:57 -04:00
Sean Christopherson	e83bc09caf	KVM: x86: Get active PCID only when writing a CR3 value Retrieve the active PCID only when writing a guest CR3 value, i.e. don't get the PCID when using EPT or NPT. The PCID is especially problematic for EPT as the bits have different meaning, and so the PCID and must be manually stripped, which is annoying and unnecessary. And on VMX, getting the active PCID also involves reading the guest's CR3 and CR4.PCIDE, i.e. may add pointless VMREADs. Opportunistically rename the pgd/pgd_level params to root_hpa and root_level to better reflect their new roles. Keep the function names, as "load the guest PGD" is still accurate/correct. Last, and probably least, pass root_hpa as a hpa_t/u64 instead of an unsigned long. The EPTP holds a 64-bit value, even in 32-bit mode, so in theory EPT could support HIGHMEM for 32-bit KVM. Never mind that doing so would require changing the MMU page allocators and reworking the MMU to use kmap(). Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210305183123.3978098-2-seanjc@google.com> Reviewed-by: Vitaly Kuznetsov <vkuznets@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-03-15 04:43:56 -04:00
Uros Bizjak	7531b47c8a	KVM/SVM: Move vmenter.S exception fixups out of line Avoid jump by moving exception fixups out of line. Cc: Sean Christopherson <seanjc@google.com> Cc: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Uros Bizjak <ubizjak@gmail.com> Message-Id: <20210226125621.111723-1-ubizjak@gmail.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-03-15 04:43:56 -04:00
Sean Christopherson	bb4cdf3af9	KVM: x86/mmu: Dump reserved bits if they're detected on non-MMIO SPTE Debugging unexpected reserved bit page faults sucks. Dump the reserved bits that (likely) caused the page fault to make debugging suck a little less. Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210225204749.1512652-25-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-03-15 04:43:55 -04:00
Sean Christopherson	715f1079ee	KVM: x86/mmu: Use low available bits for removed SPTEs Use low "available" bits to tag REMOVED SPTEs. Using a high bit is moderately costly as it often causes the compiler to generate a 64-bit immediate. More importantly, this makes it very clear REMOVED_SPTE is a value, not a flag. Cc: Ben Gardon <bgardon@google.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210225204749.1512652-24-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-03-15 04:43:54 -04:00
Sean Christopherson	7a51393ae0	KVM: x86/mmu: Use is_removed_spte() instead of open coded equivalents Use the is_removed_spte() helper instead of open coding the check. No functional change intended. Cc: Ben Gardon <bgardon@google.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210225204749.1512652-23-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-03-15 04:43:53 -04:00
Sean Christopherson	8f366ae6d8	KVM: x86/mmu: Tweak auditing WARN for A/D bits to !PRESENT (was MMIO) Tweak the MMU_WARN that guards against weirdness when querying A/D status to fire on a !MMU_PRESENT SPTE, as opposed to a MMIO SPTE. Attempting to query A/D status on any kind of !MMU_PRESENT SPTE, MMIO or otherwise, indicates a KVM bug. Case in point, several now-fixed bugs were identified by enabling this new WARN. Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210225204749.1512652-22-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-03-15 04:43:52 -04:00
Sean Christopherson	edea7c4fc2	KVM: x86/mmu: Use a dedicated bit to track shadow/MMU-present SPTEs Introduce MMU_PRESENT to explicitly track which SPTEs are "present" from the MMU's perspective. Checking for shadow-present SPTEs is a very common operation for the MMU, particularly in hot paths such as page faults. With the addition of "removed" SPTEs for the TDP MMU, identifying shadow-present SPTEs is quite costly especially since it requires checking multiple 64-bit values. On 64-bit KVM, this reduces the footprint of kvm.ko's .text by ~2k bytes. On 32-bit KVM, this increases the footprint by ~200 bytes, but only because gcc now inlines several more MMU helpers, e.g. drop_parent_pte(). We now need to drop bit 11, used for the MMU_PRESENT flag, from the set of bits used to store the generation number in MMIO SPTEs. Otherwise MMIO SPTEs with bit 11 set would get false positives for is_shadow_present_spte() and lead to a variety of fireworks, from oopses to likely hangs of the host kernel. Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210225204749.1512652-21-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-03-15 04:43:51 -04:00
Sean Christopherson	613a3f3797	KVM: x86/mmu: Use high bits for host/mmu writable masks for EPT SPTEs Use bits 57 and 58 for HOST_WRITABLE and MMU_WRITABLE when using EPT. This will allow using bit 11 as a constant MMU_PRESENT, which is desirable as checking for a shadow-present SPTE is one of the most common SPTE operations in KVM, particular in hot paths such as page faults. EPT is short on low available bits; currently only bit 11 is the only always-available bit. Bit 10 is also available, but only while KVM doesn't support mode-based execution. On the other hand, PAE paging doesn't have _any_ high available bits. Thus, using bit 11 is the only feasible option for MMU_PRESENT. Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210225204749.1512652-20-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-03-15 04:43:50 -04:00
Sean Christopherson	5fc3424f8b	KVM: x86/mmu: Make Host-writable and MMU-writable bit locations dynamic Make the location of the HOST_WRITABLE and MMU_WRITABLE configurable for a given KVM instance. This will allow EPT to use high available bits, which in turn will free up bit 11 for a constant MMU_PRESENT bit. No functional change intended. Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210225204749.1512652-19-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-03-15 04:43:49 -04:00
Sean Christopherson	e7b7bdea77	KVM: x86/mmu: Move logic for setting SPTE masks for EPT into the MMU proper Let the MMU deal with the SPTE masks to avoid splitting the logic and knowledge across the MMU and VMX. The SPTE masks that are used for EPT are very, very tightly coupled to the MMU implementation. The use of available bits, the existence of A/D types, the fact that shadow_x_mask even exists, and so on and so forth are all baked into the MMU implementation. Cross referencing the params to the masks is also a nightmare, as pretty much every param is a u64. A future patch will make the location of the MMU_WRITABLE and HOST_WRITABLE bits MMU specific, to free up bit 11 for a MMU_PRESENT bit. Doing that change with the current kvm_mmu_set_mask_ptes() would be an absolute mess. No functional change intended. Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210225204749.1512652-18-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-03-15 04:43:48 -04:00
Sean Christopherson	d6b87f2565	KVM: x86/mmu: Co-locate code for setting various SPTE masks Squish all the code for (re)setting the various SPTE masks into one location. With the split code, it's not at all clear that the masks are set once during module initialization. This will allow a future patch to clean up initialization of the masks without shuffling code all over tarnation. No functional change intended. Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210225204749.1512652-17-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-03-15 04:43:47 -04:00
Sean Christopherson	ec761cfd35	KVM: x86/mmu: Move initial kvm_mmu_set_mask_ptes() call into MMU proper Move kvm_mmu_set_mask_ptes() into mmu.c as prep for future cleanup of the mask initialization code. No functional change intended. Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210225204749.1512652-16-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-03-15 04:43:46 -04:00
Sean Christopherson	c4827eabe1	KVM: x86/mmu: Document dependency bewteen TDP A/D type and saved bits Document that SHADOW_ACC_TRACK_SAVED_BITS_SHIFT is directly dependent on bits 53:52 being used to track the A/D type. Remove PT64_SECOND_AVAIL_BITS_SHIFT as it is at best misleading, and at worst wrong. For PAE paging, which arguably is a variant of PT64, the bits are reserved. For MMIO SPTEs the bits are not available as they're used for the MMIO generation. For access tracked SPTEs, they are also not available as bits 56:54 are used to store the original RX bits. No functional change intended. Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210225204749.1512652-15-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-03-15 04:43:45 -04:00
Sean Christopherson	b0de568018	KVM: x86/mmu: Use MMIO SPTE bits 53 and 52 for the MMIO generation Use bits 53 and 52 for the MMIO generation now that they're not used to identify MMIO SPTEs. Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210225204749.1512652-14-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-03-15 04:43:44 -04:00
Sean Christopherson	8a406c8953	KVM: x86/mmu: Rename and document A/D scheme for TDP SPTEs Rename the various A/D status defines to explicitly associated them with TDP. There is a subtle dependency on the bits in question never being set when using PAE paging, as those bits are reserved, not available. I.e. using these bits outside of TDP (technically EPT) would cause explosions. No functional change intended. Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210225204749.1512652-13-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-03-15 04:43:43 -04:00
Sean Christopherson	b09763da4d	KVM: x86/mmu: Add module param to disable MMIO caching (for testing) Add a module param to disable MMIO caching so that it's possible to test the related flows without access to the necessary hardware. Using shadow paging with 64-bit KVM and 52 bits of physical address space must disable MMIO caching as there are no reserved bits to be had. Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210225204749.1512652-12-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-03-15 04:43:42 -04:00
Sean Christopherson	8120337a4c	KVM: x86/mmu: Stop using software available bits to denote MMIO SPTEs Stop tagging MMIO SPTEs with specific available bits and instead detect MMIO SPTEs by checking for their unique SPTE value. The value is guaranteed to be unique on shadow paging and NPT as setting reserved physical address bits on any other type of SPTE would consistute a KVM bug. Ditto for EPT, as creating a WX non-MMIO would also be a bug. Note, this approach is also future-compatibile with TDX, which will need to reflect MMIO EPT violations as #VEs into the guest. To create an EPT violation instead of a misconfig, TDX EPTs will need to have RWX=0, But, MMIO SPTEs will also be the only case where KVM clears SUPPRESS_VE, so MMIO SPTEs will still be guaranteed to have a unique value within a given MMU context. The main motivation is to make it easier to reason about which types of SPTEs use which available bits. As a happy side effect, this frees up two more bits for storing the MMIO generation. Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210225204749.1512652-11-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-03-15 04:43:41 -04:00
Sean Christopherson	c236d9623f	KVM: x86/mmu: Rename 'mask' to 'spte' in MMIO SPTE helpers The value returned by make_mmio_spte() is a SPTE, it is not a mask. Name it accordingly. No functional change intended. Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210225204749.1512652-10-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-03-15 04:43:40 -04:00
Sean Christopherson	3849e0924e	KVM: x86/mmu: Drop redundant trace_kvm_mmu_set_spte() in the TDP MMU Remove TDP MMU's call to trace_kvm_mmu_set_spte() that is done for both shadow-present SPTEs and MMIO SPTEs. It's fully redundant for the former, and unnecessary for the latter. This aligns TDP MMU tracing behavior with that of the legacy MMU. Fixes: `33dd3574f5` ("kvm: x86/mmu: Add existing trace points to TDP MMU") Cc: Ben Gardon <bgardon@google.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210225204749.1512652-9-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-03-15 04:43:39 -04:00
Sean Christopherson	a54aa15c6b	KVM: x86/mmu: Handle MMIO SPTEs directly in mmu_set_spte() Now that it should be impossible to convert a valid SPTE to an MMIO SPTE, handle MMIO SPTEs early in mmu_set_spte() without going through set_spte() and all the logic for removing an existing, valid SPTE. The other caller of set_spte(), FNAME(sync_page)(), explicitly handles MMIO SPTEs prior to calling set_spte(). This simplifies mmu_set_spte() and set_spte(), and also "fixes" an oddity where MMIO SPTEs are traced by both trace_kvm_mmu_set_spte() and trace_mark_mmio_spte(). Note, mmu_spte_set() will WARN if this new approach causes KVM to create an MMIO SPTE overtop a valid SPTE. Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210225204749.1512652-8-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-03-15 04:43:38 -04:00
Sean Christopherson	30ab5901da	KVM: x86/mmu: Don't install bogus MMIO SPTEs if MMIO caching is disabled If MMIO caching is disabled, e.g. when using shadow paging on CPUs with 52 bits of PA space, go straight to MMIO emulation and don't install an MMIO SPTE. The SPTE will just generate a !PRESENT #PF, i.e. can't actually accelerate future MMIO. Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210225204749.1512652-7-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-03-15 04:43:38 -04:00
Sean Christopherson	e0c378684b	KVM: x86/mmu: Retry page faults that hit an invalid memslot Retry page faults (re-enter the guest) that hit an invalid memslot instead of treating the memslot as not existing, i.e. handling the page fault as an MMIO access. When deleting a memslot, SPTEs aren't zapped and the TLBs aren't flushed until after the memslot has been marked invalid. Handling the invalid slot as MMIO means there's a small window where a page fault could replace a valid SPTE with an MMIO SPTE. The legacy MMU handles such a scenario cleanly, but the TDP MMU assumes such behavior is impossible (see the BUG() in __handle_changed_spte()). There's really no good reason why the legacy MMU should allow such a scenario, and closing this hole allows for additional cleanups. Fixes: `2f2fad0897` ("kvm: x86/mmu: Add functions to handle changed TDP SPTEs") Cc: Ben Gardon <bgardon@google.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210225204749.1512652-6-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-03-15 04:43:37 -04:00
Sean Christopherson	44aaa0150b	KVM: x86/mmu: Disable MMIO caching if MMIO value collides with L1TF Disable MMIO caching if the MMIO value collides with the L1TF mitigation that usurps high PFN bits. In practice this should never happen as only CPUs with SME support can generate such a collision (because the MMIO value can theoretically get adjusted into legal memory), and no CPUs exist that support SME and are susceptible to L1TF. But, closing the hole is trivial. Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210225204749.1512652-5-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-03-15 04:43:36 -04:00
Sean Christopherson	ec89e64386	KVM: x86/mmu: Bail from fast_page_fault() if SPTE is not shadow-present Bail from fast_page_fault() if the SPTE is not a shadow-present SPTE. Functionally, this is not strictly necessary as the !is_access_allowed() check will eventually reject the fast path, but an early check on shadow-present skips unnecessary checks and will allow a future patch to tweak the A/D status auditing to warn if KVM attempts to query A/D bits without first ensuring the SPTE is a shadow-present SPTE. Note, is_shadow_present_pte() is quite expensive at this time, i.e. this might be a net negative in the short term. A future patch will optimize is_shadow_present_pte() to a single AND operation and remedy the issue. Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210225204749.1512652-4-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-03-15 04:43:35 -04:00
Sean Christopherson	64bb2769d7	KVM: x86/mmu: Check for shadow-present SPTE before querying A/D status When updating accessed and dirty bits, check that the new SPTE is present before attempting to query its A/D bits. Failure to confirm the SPTE is present can theoretically cause a false negative, e.g. if a MMIO SPTE replaces a "real" SPTE and somehow the PFNs magically match. Realistically, this is all but guaranteed to be a benign bug. Fix it up primarily so that a future patch can tweak the MMU_WARN_ON checking A/D status to fire if the SPTE is not-present. Fixes: `f8e144971c` ("kvm: x86/mmu: Add access tracking for tdp_mmu") Cc: Ben Gardon <bgardon@google.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210225204749.1512652-3-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-03-15 04:43:34 -04:00
Sean Christopherson	2032195713	KVM: x86/mmu: Add convenience wrapper for acting on single hva in TDP MMU Add a TDP MMU helper to handle a single HVA hook, the name is a nice reminder that the flow in question is operating on a single HVA. No functional change intended. Cc: Ben Gardon <bgardon@google.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210226010329.1766033-6-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-03-15 04:43:33 -04:00
Sean Christopherson	c1b91493ed	KVM: x86/mmu: Add typedefs for rmap/iter handlers Add typedefs for the MMU handlers that are invoked when walking the MMU SPTEs (rmaps in legacy MMU) to act on a host virtual address range. No functional change intended. Cc: Ben Gardon <bgardon@google.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210226010329.1766033-5-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-03-15 04:43:32 -04:00
Sean Christopherson	e12b785e52	KVM: x86/mmu: Use 'end' param in TDP MMU's test_age_gfn() Use the @end param when aging a GFN instead of hardcoding the walk to a single GFN. Unlike tdp_set_spte(), which simply cannot work with more than one GFN, aging multiple GFNs would not break, though admittedly it would be weird. Be nice to the casual reader and don't make them puzzle out why the end GFN is unused. No functional change intended. Cc: Ben Gardon <bgardon@google.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210226010329.1766033-4-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-03-15 04:43:31 -04:00
Sean Christopherson	74fe0f5474	KVM: x86/mmu: WARN if TDP MMU's set_tdp_spte() sees multiple GFNs WARN if set_tdp_spte() is invoked with multipel GFNs. It is specifically a callback to handle a single host PTE being changed. Consuming the @end parameter also eliminates the confusing 'unused' parameter. Cc: Ben Gardon <bgardon@google.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210226010329.1766033-3-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-03-15 04:43:30 -04:00
Sean Christopherson	f055ab634c	KVM: x86/mmu: Remove spurious TLB flush from TDP MMU's change_pte() hook Remove an unnecessary remote TLB flush from set_tdp_spte(), the TDP MMu's hook for handling change_pte() invocations from the MMU notifier. If the new host PTE is writable, the flush is completely redundant as there are no futher changes to the SPTE before the post-loop flush. If the host PTE is read-only, then the primary MMU is responsible for ensuring that the contents of the old and new pages are identical, thus it's safe to let the guest continue reading both the old and new pages. KVM must only ensure the old page cannot be referenced after returning from its callback; this is handled by the post-loop flush. Fixes: `1d8dd6b3f1` ("kvm: x86/mmu: Support changed pte notifier in tdp MMU") Cc: Ben Gardon <bgardon@google.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210226010329.1766033-2-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-03-15 04:43:29 -04:00
Maxim Levitsky	422e2e1706	KVM: x86: mmu: initialize fault.async_page_fault in walk_addr_generic This field was left uninitialized by a mistake. Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com> Message-Id: <20210225154135.405125-3-mlevitsk@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-03-15 04:43:28 -04:00
Maxim Levitsky	b97f074583	KVM: x86: determine if an exception has an error code only when injecting it. A page fault can be queued while vCPU is in real paged mode on AMD, and AMD manual asks the user to always intercept it (otherwise result is undefined). The resulting VM exit, does have an error code. Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com> Message-Id: <20210225154135.405125-2-mlevitsk@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-03-15 04:43:27 -04:00
Cathy Avery	8173396e94	KVM: nSVM: Optimize vmcb12 to vmcb02 save area copies Use the vmcb12 control clean field to determine which vmcb12.save registers were marked dirty in order to minimize register copies when switching from L1 to L2. Those vmcb12 registers marked as dirty need to be copied to L0's vmcb02 as they will be used to update the vmcb state cache for the L2 VMRUN. In the case where we have a different vmcb12 from the last L2 VMRUN all vmcb12.save registers must be copied over to L2.save. Tested: kvm-unit-tests kvm selftests Fedora L1 L2 Suggested-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Cathy Avery <cavery@redhat.com> Message-Id: <20210301200844.2000-1-cavery@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-03-15 04:43:26 -04:00
Babu Moger	d00b99c514	KVM: SVM: Add support for Virtual SPEC_CTRL Newer AMD processors have a feature to virtualize the use of the SPEC_CTRL MSR. Presence of this feature is indicated via CPUID function 0x8000000A_EDX[20]: GuestSpecCtrl. Hypervisors are not required to enable this feature since it is automatically enabled on processors that support it. A hypervisor may wish to impose speculation controls on guest execution or a guest may want to impose its own speculation controls. Therefore, the processor implements both host and guest versions of SPEC_CTRL. When in host mode, the host SPEC_CTRL value is in effect and writes update only the host version of SPEC_CTRL. On a VMRUN, the processor loads the guest version of SPEC_CTRL from the VMCB. When the guest writes SPEC_CTRL, only the guest version is updated. On a VMEXIT, the guest version is saved into the VMCB and the processor returns to only using the host SPEC_CTRL for speculation control. The guest SPEC_CTRL is located at offset 0x2E0 in the VMCB. The effective SPEC_CTRL setting is the guest SPEC_CTRL setting or'ed with the hypervisor SPEC_CTRL setting. This allows the hypervisor to ensure a minimum SPEC_CTRL if desired. This support also fixes an issue where a guest may sometimes see an inconsistent value for the SPEC_CTRL MSR on processors that support this feature. With the current SPEC_CTRL support, the first write to SPEC_CTRL is intercepted and the virtualized version of the SPEC_CTRL MSR is not updated. When the guest reads back the SPEC_CTRL MSR, it will be 0x0, instead of the actual expected value. There isn’t a security concern here, because the host SPEC_CTRL value is or’ed with the Guest SPEC_CTRL value to generate the effective SPEC_CTRL value. KVM writes with the guest's virtualized SPEC_CTRL value to SPEC_CTRL MSR just before the VMRUN, so it will always have the actual value even though it doesn’t appear that way in the guest. The guest will only see the proper value for the SPEC_CTRL register if the guest was to write to the SPEC_CTRL register again. With Virtual SPEC_CTRL support, the save area spec_ctrl is properly saved and restored. So, the guest will always see the proper value when it is read back. Signed-off-by: Babu Moger <babu.moger@amd.com> Message-Id: <161188100955.28787.11816849358413330720.stgit@bmoger-ubuntu> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-03-15 04:43:25 -04:00
Maxim Levitsky	cc3ed80ae6	KVM: nSVM: always use vmcb01 to for vmsave/vmload of guest state This allows to avoid copying of these fields between vmcb01 and vmcb02 on nested guest entry/exit. Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-03-15 04:43:24 -04:00
Paolo Bonzini	fb0c4a4fee	KVM: SVM: move VMLOAD/VMSAVE to C code Thanks to the new macros that handle exception handling for SVM instructions, it is easier to just do the VMLOAD/VMSAVE in C. This is safe, as shown by the fact that the host reload is already done outside the assembly source. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-03-15 04:43:23 -04:00
Sean Christopherson	c8781feaf1	KVM: SVM: Skip intercepted PAUSE instructions after emulation Skip PAUSE after interception to avoid unnecessarily re-executing the instruction in the guest, e.g. after regaining control post-yield. This is a benign bug as KVM disables PAUSE interception if filtering is off, including the case where pause_filter_count is set to zero. Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210205005750.3841462-10-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-03-15 04:43:22 -04:00
Sean Christopherson	32c23c7d52	KVM: SVM: Don't manually emulate RDPMC if nrips=0 Remove bizarre code that causes KVM to run RDPMC through the emulator when nrips is disabled. Accelerated emulation of RDPMC doesn't rely on any additional data from the VMCB, and SVM has generic handling for updating RIP to skip instructions when nrips is disabled. Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210205005750.3841462-9-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-03-15 04:43:21 -04:00
Sean Christopherson	c483c45471	KVM: x86: Move RDPMC emulation to common code Move the entirety of the accelerated RDPMC emulation to x86.c, and assign the common handler directly to the exit handler array for VMX. SVM has bizarre nrips behavior that prevents it from directly invoking the common handler. The nrips goofiness will be addressed in a future patch. Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210205005750.3841462-8-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-03-15 04:43:20 -04:00
Sean Christopherson	5ff3a351f6	KVM: x86: Move trivial instruction-based exit handlers to common code Move the trivial exit handlers, e.g. for instructions that KVM "emulates" as nops, to common x86 code. Assign the common handlers directly to the exit handler arrays and drop the vendor trampolines. Opportunistically use pr_warn_once() where appropriate. No functional change intended. Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210205005750.3841462-7-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-03-15 04:43:19 -04:00
Sean Christopherson	92f9895c14	KVM: x86: Move XSETBV emulation to common code Move the entirety of XSETBV emulation to x86.c, and assign the function directly to both VMX's and SVM's exit handlers, i.e. drop the unnecessary trampolines. No functional change intended. Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210205005750.3841462-6-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-03-15 04:43:18 -04:00
Sean Christopherson	2ac636a6ea	KVM: nSVM: Add VMLOAD/VMSAVE helper to deduplicate code Add another helper layer for VMLOAD+VMSAVE, the code is identical except for the one line that determines which VMCB is the source and which is the destination. No functional change intended. Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210205005750.3841462-5-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-03-15 04:43:17 -04:00
Sean Christopherson	3a87c7e0d1	KVM: nSVM: Add helper to synthesize nested VM-Exit without collateral Add a helper to consolidate boilerplate for nested VM-Exits that don't provide any data in exit_info_*. No functional change intended. Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210302174515.2812275-3-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-03-15 04:43:16 -04:00
Sean Christopherson	cb6a32c2b8	KVM: x86: Handle triple fault in L2 without killing L1 Synthesize a nested VM-Exit if L2 triggers an emulated triple fault instead of exiting to userspace, which likely will kill L1. Any flow that does KVM_REQ_TRIPLE_FAULT is suspect, but the most common scenario for L2 killing L1 is if L0 (KVM) intercepts a contributory exception that is _not_intercepted by L1. E.g. if KVM is intercepting #GPs for the VMware backdoor, a #GP that occurs in L2 while vectoring an injected #DF will cause KVM to emulate triple fault. Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com> Cc: Jim Mattson <jmattson@google.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210302174515.2812275-2-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-03-15 04:43:15 -04:00
Paolo Bonzini	6312975417	KVM: SVM: Pass struct kvm_vcpu to exit handlers (and many, many other places) Refactor the svm_exit_handlers API to pass @vcpu instead of @svm to allow directly invoking common x86 exit handlers (in a future patch). Opportunistically convert an absurd number of instances of 'svm->vcpu' to direct uses of 'vcpu' to avoid pointless casting. No functional change intended. Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210205005750.3841462-4-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-03-15 04:43:09 -04:00
Paolo Bonzini	2a32a77cef	KVM: SVM: merge update_cr0_intercept into svm_set_cr0 The logic of update_cr0_intercept is pointlessly complicated. All svm_set_cr0 is compute the effective cr0 and compare it with the guest value. Inlining the function and simplifying the condition clarifies what it is doing. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-03-15 04:42:38 -04:00
Sean Christopherson	11f0cbf0c6	KVM: nSVM: Trace VM-Enter consistency check failures Use trace_kvm_nested_vmenter_failed() and its macro magic to trace consistency check failures on nested VMRUN. Tracing such failures by running the buggy VMM as a KVM guest is often the only way to get a precise explanation of why VMRUN failed. Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210204000117.3303214-13-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-03-15 04:42:37 -04:00
Sean Christopherson	648fc8ae37	KVM: x86: Move nVMX's consistency check macro to common code Move KVM's CC() macro to x86.h so that it can be reused by nSVM. Debugging VM-Enter is as painful on SVM as it is on VMX. Rename the more visible macro to KVM_NESTED_VMENTER_CONSISTENCY_CHECK to avoid any collisions with the uber-concise "CC". Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210204000117.3303214-12-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-03-15 04:42:36 -04:00
Krish Sadhukhan	6906e06db9	KVM: nSVM: Add missing checks for reserved bits to svm_set_nested_state() The path for SVM_SET_NESTED_STATE needs to have the same checks for the CPU registers, as we have in the VMRUN path for a nested guest. This patch adds those missing checks to svm_set_nested_state(). Suggested-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Krish Sadhukhan <krish.sadhukhan@oracle.com> Message-Id: <20201006190654.32305-3-krish.sadhukhan@oracle.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-03-15 04:42:35 -04:00
Paolo Bonzini	c08f390a75	KVM: nSVM: only copy L1 non-VMLOAD/VMSAVE data in svm_set_nested_state() The VMLOAD/VMSAVE data is not taken from userspace, since it will not be restored on VMEXIT (it will be copied from VMCB02 to VMCB01). For clarity, replace the wholesale copy of the VMCB save area with a copy of that state only. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-03-15 04:42:34 -04:00
Paolo Bonzini	4bb170a543	KVM: nSVM: do not mark all VMCB02 fields dirty on nested vmexit Since L1 and L2 now use different VMCBs, most of the fields remain the same in VMCB02 from one L2 run to the next. Since KVM itself is not looking at VMCB12's clean field, for now not much can be optimized. However, in the future we could avoid more copies if the VMCB12's SEG and DT sections are clean. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-03-15 04:42:33 -04:00
Paolo Bonzini	7ca62d1322	KVM: nSVM: do not mark all VMCB01 fields dirty on nested vmexit Since L1 and L2 now use different VMCBs, most of the fields remain the same from one L1 run to the next. svm_set_cr0 and other functions called by nested_svm_vmexit already take care of clearing the corresponding clean bits; only the TSC offset is special. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-03-15 04:42:32 -04:00
Paolo Bonzini	7c3ecfcd31	KVM: nSVM: do not copy vmcb01->control blindly to vmcb02->control Most fields were going to be overwritten by vmcb12 control fields, or do not matter at all because they are filled by the processor on vmexit. Therefore, we need not copy them from vmcb01 to vmcb02 on vmentry. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-03-15 04:42:32 -04:00
Paolo Bonzini	9e8f0fbfff	KVM: nSVM: rename functions and variables according to vmcbXY nomenclature Now that SVM is using a separate vmcb01 and vmcb02 (and also uses the vmcb12 naming) we can give clearer names to functions that write to and read from those VMCBs. Likewise, variables and parameters can be renamed from nested_vmcb to vmcb12. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-03-15 04:42:31 -04:00
Cathy Avery	193015adf4	KVM: nSVM: Track the ASID generation of the vmcb vmrun through the vmcb This patch moves the asid_generation from the vcpu to the vmcb in order to track the ASID generation that was active the last time the vmcb was run. If sd->asid_generation changes between two runs, the old ASID is invalid and must be changed. Suggested-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Cathy Avery <cavery@redhat.com> Message-Id: <20210112164313.4204-3-cavery@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-03-15 04:42:30 -04:00
Cathy Avery	af18fa775d	KVM: nSVM: Track the physical cpu of the vmcb vmrun through the vmcb This patch moves the physical cpu tracking from the vcpu to the vmcb in svm_switch_vmcb. If either vmcb01 or vmcb02 change physical cpus from one vmrun to the next the vmcb's previous cpu is preserved for comparison with the current cpu and the vmcb is marked dirty if different. This prevents the processor from using old cached data for a vmcb that may have been updated on a prior run on a different processor. It also moves the physical cpu check from svm_vcpu_load to pre_svm_run as the check only needs to be done at run. Suggested-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Cathy Avery <cavery@redhat.com> Message-Id: <20210112164313.4204-2-cavery@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-03-15 04:42:29 -04:00
Cathy Avery	4995a3685f	KVM: SVM: Use a separate vmcb for the nested L2 guest svm->vmcb will now point to a separate vmcb for L1 (not nested) or L2 (nested). The main advantages are removing get_host_vmcb and hsave, in favor of concepts that are shared with VMX. We don't need anymore to stash the L1 registers in hsave while L2 runs, but we need to copy the VMLOAD/VMSAVE registers from VMCB01 to VMCB02 and back. This more or less has the same cost, but code-wise nested_svm_vmloadsave can be reused. This patch omits several optimizations that are possible: - for simplicity there is some wholesale copying of vmcb.control areas which can go away. - we should be able to better use the VMCB01 and VMCB02 clean bits. - another possibility is to always use VMCB01 for VMLOAD and VMSAVE, thus avoiding the copy of VMLOAD/VMSAVE registers from VMCB01 to VMCB02 and back. Tested: kvm-unit-tests kvm self tests Loaded fedora nested guest on fedora Signed-off-by: Cathy Avery <cavery@redhat.com> Message-Id: <20201011184818.3609-3-cavery@redhat.com> [Fix conflicts; keep VMCB02 G_PAT up to date whenever guest writes the PAT MSR; do not copy CR4 over from VMCB01 as it is not needed anymore; add a few more comments. - Paolo] Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-03-15 04:42:28 -04:00
Sean Christopherson	a3322d5cd8	KVM: nSVM: Set the shadow root level to the TDP level for nested NPT Override the shadow root level in the MMU context when configuring NPT for shadowing nested NPT. The level is always tied to the TDP level of the host, not whatever level the guest happens to be using. Fixes: `096586fda5` ("KVM: nSVM: Correctly set the shadow NPT root level in its MMU role") Cc: stable@vger.kernel.org Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210305011101.3597423-2-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-03-15 04:42:27 -04:00
Sean Christopherson	6d1b867d04	KVM: SVM: Don't strip the C-bit from CR2 on #PF interception Don't strip the C-bit from the faulting address on an intercepted #PF, the address is a virtual address, not a physical address. Fixes: `0ede79e132` ("KVM: SVM: Clear C-bit from the page fault address") Cc: stable@vger.kernel.org Cc: Brijesh Singh <brijesh.singh@amd.com> Cc: Tom Lendacky <thomas.lendacky@amd.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210305011101.3597423-13-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-03-15 04:42:26 -04:00
Sean Christopherson	73ad160693	KVM: x86/mmu: WARN on NULL pae_root or lm_root, or bad shadow root level WARN if KVM is about to dereference a NULL pae_root or lm_root when loading an MMU, and convert the BUG() on a bad shadow_root_level into a WARN (now that errors are handled cleanly). With nested NPT, botching the level and sending KVM down the wrong path is all too easy, and the on-demand allocation of pae_root and lm_root means bugs crash the host. Obviously, KVM could unconditionally allocate the roots, but that's arguably a worse failure mode as it would potentially corrupt the guest instead of crashing it. Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210305011101.3597423-18-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-03-15 04:42:25 -04:00
Sean Christopherson	a91f387b4b	KVM: x86/mmu: Sync roots after MMU load iff load as successful For clarity, explicitly skip syncing roots if the MMU load failed instead of relying on the !VALID_PAGE check in kvm_mmu_sync_roots(). Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210305011101.3597423-17-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-03-15 04:42:24 -04:00
Sean Christopherson	61a1773e2e	KVM: x86/mmu: Unexport MMU load/unload functions Unexport the MMU load and unload helpers now that they are no longer used (incorrectly) in vendor code. Opportunistically move the kvm_mmu_sync_roots() declaration into mmu.h, it should not be exposed to vendor code. No functional change intended. Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210305011101.3597423-16-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-03-15 04:42:23 -04:00
Sean Christopherson	f66c53b3b9	KVM: x86: Defer the MMU unload to the normal path on an global INVPCID Defer unloading the MMU after a INVPCID until the instruction emulation has completed, i.e. until after RIP has been updated. On VMX, this is a benign bug as VMX doesn't touch the MMU when skipping an emulated instruction. However, on SVM, if nrip is disabled, the emulator is used to skip an instruction, which would lead to fireworks if the emulator were invoked without a valid MMU. Fixes: `eb4b248e15` ("kvm: vmx: Support INVPCID in shadow paging mode") Cc: stable@vger.kernel.org Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210305011101.3597423-15-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-03-15 04:42:22 -04:00
Sean Christopherson	c805f5d558	KVM: nVMX: Defer the MMU reload to the normal path on an EPTP switch Defer reloading the MMU after a EPTP successful EPTP switch. The VMFUNC instruction itself is executed in the previous EPTP context, any side effects, e.g. updating RIP, should occur in the old context. Practically speaking, this bug is benign as VMX doesn't touch the MMU when skipping an emulated instruction, nor does queuing a single-step #DB. No other post-switch side effects exist. Fixes: `41ab937274` ("KVM: nVMX: Emulate EPTP switching for the L1 hypervisor") Cc: stable@vger.kernel.org Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210305011101.3597423-14-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-03-15 04:42:21 -04:00
Sean Christopherson	17e368d94a	KVM: x86/mmu: Set the C-bit in the PDPTRs and LM pseudo-PDPTRs Set the C-bit in SPTEs that are set outside of the normal MMU flows, specifically the PDPDTRs and the handful of special cased "LM root" entries, all of which are shadow paging only. Note, the direct-mapped-root PDPTR handling is needed for the scenario where paging is disabled in the guest, in which case KVM uses a direct mapped MMU even though TDP is disabled. Fixes: `d0ec49d4de` ("kvm/x86/svm: Support Secure Memory Encryption within KVM") Cc: stable@vger.kernel.org Cc: Brijesh Singh <brijesh.singh@amd.com> Cc: Tom Lendacky <thomas.lendacky@amd.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210305011101.3597423-11-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-03-15 04:42:21 -04:00
Sean Christopherson	e49e0b7bf3	KVM: x86/mmu: Fix and unconditionally enable WARNs to detect PAE leaks Exempt NULL PAE roots from the check to detect leaks, since kvm_mmu_free_roots() doesn't set them back to INVALID_PAGE. Stop hiding the WARNs to detect PAE root leaks behind MMU_WARN_ON, the hidden WARNs obviously didn't do their job given the hilarious number of bugs that could lead to PAE roots being leaked, not to mention the above false positive. Opportunistically delete a warning on root_hpa being valid, there's nothing special about 4/5-level shadow pages that warrants a WARN. Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210305011101.3597423-9-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-03-15 04:42:20 -04:00
Sean Christopherson	6e0918aec4	KVM: x86/mmu: Check PDPTRs before allocating PAE roots Check the validity of the PDPTRs before allocating any of the PAE roots, otherwise a bad PDPTR will cause KVM to leak any previously allocated roots. Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210305011101.3597423-8-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2021-03-15 04:42:19 -04:00

1 2 3 4 5 ...

7195 Commits