kvm_init_shadow_ept_mmu() doesn't set get_pdptr() hook and is this
not a problem just because MMU context is already initialized and this
hook points to kvm_pdptr_read(). As we're intended to use a dedicated
MMU for shadow EPT MMU set this hook explicitly.
Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Reviewed-by: Sean Christopherson <sean.j.christopherson@intel.com>
As a preparation to full MMU split between L1 and L2 make vcpu->arch.mmu
a pointer to the currently used mmu. For now, this is always
vcpu->arch.root_mmu. No functional change.
Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Reviewed-by: Sean Christopherson <sean.j.christopherson@intel.com>
The quote from the comment almost says it all: we are currently zeroing
the guest dr6 in kvm_arch_vcpu_put, because do_debug expects it. However,
the host %dr6 is either:
- zero because the guest hasn't run after kvm_arch_vcpu_load
- written from vcpu->arch.dr6 by vcpu_enter_guest
- written by the guest and copied to vcpu->arch.dr6 by ->sync_dirty_debug_regs().
Therefore, we can skip the write if vcpu->arch.dr6 is already zero. We
may do extra useless writes if vcpu->arch.dr6 is nonzero but the guest
hasn't run; however that is less important for performance.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Rewrite kvm_hv_flush_tlb()/send_ipi_vcpus_mask() making them cleaner and
somewhat more optimal.
hv_vcpu_in_sparse_set() is converted to sparse_set_to_vcpu_mask()
which copies sparse banks u64-at-a-time and then, depending on the
num_mismatched_vp_indexes value, returns immediately or does
vp index to vcpu index conversion by walking all vCPUs.
To support the change and make kvm_hv_send_ipi() look similar to
kvm_hv_flush_tlb() send_ipi_vcpus_mask() is introduced.
Suggested-by: Roman Kagan <rkagan@virtuozzo.com>
Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Reviewed-by: Roman Kagan <rkagan@virtuozzo.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Regardless of whether your TLB is lush or not it still needs flushing.
Reported-by: Roman Kagan <rkagan@virtuozzo.com>
Reviewed-by: Roman Kagan <rkagan@virtuozzo.com>
Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
When early consistency checks are enabled, all VMFail conditions
should be caught by nested_vmx_check_vmentry_hw().
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
KVM defers many VMX consistency checks to the CPU, ostensibly for
performance reasons[1], including checks that result in VMFail (as
opposed to VMExit). This behavior may be undesirable for some users
since this means KVM detects certain classes of VMFail only after it
has processed guest state, e.g. emulated MSR load-on-entry. Because
there is a strict ordering between checks that cause VMFail and those
that cause VMExit, i.e. all VMFail checks are performed before any
checks that cause VMExit, we can detect (almost) all VMFail conditions
via a dry run of sorts. The almost qualifier exists because some
state in vmcs02 comes from L0, e.g. VPID, which means that hardware
will never detect an invalid VPID in vmcs12 because it never sees
said value. Software must (continue to) explicitly check such fields.
After preparing vmcs02 with all state needed to pass the VMFail
consistency checks, optionally do a "test" VMEnter with an invalid
GUEST_RFLAGS. If the VMEnter results in a VMExit (due to bad guest
state), then we can safely say that the nested VMEnter should not
VMFail, i.e. any VMFail encountered in nested_vmx_vmexit() must
be due to an L0 bug. GUEST_RFLAGS is used to induce VMExit as it
is unconditionally loaded on all implementations of VMX, has an
invalid value that is writable on a 32-bit system and its consistency
check is performed relatively early in all implementations (the exact
order of consistency checks is micro-architectural).
Unfortunately, since the "passing" case causes a VMExit, KVM must
be extra diligent to ensure that host state is restored, e.g. DR7
and RFLAGS are reset on VMExit. Failure to restore RFLAGS.IF is
particularly fatal.
And of course the extra VMEnter and VMExit impacts performance.
The raw overhead of the early consistency checks is ~6% on modern
hardware (though this could easily vary based on configuration),
while the added latency observed from the L1 VMM is ~10%. The
early consistency checks do not occur in a vacuum, e.g. spending
more time in L0 can lead to more interrupts being serviced while
emulating VMEnter, thereby increasing the latency observed by L1.
Add a module param, early_consistency_checks, to provide control
over whether or not VMX performs the early consistency checks.
In addition to standard on/off behavior, the param accepts a value
of -1, which is essentialy an "auto" setting whereby KVM does
the early checks only when it thinks it's running on bare metal.
When running nested, doing early checks is of dubious value since
the resulting behavior is heavily dependent on L0. In the future,
the "auto" setting could also be used to default to skipping the
early hardware checks for certain configurations/platforms if KVM
reaches a state where it has 100% coverage of VMFail conditions.
[1] To my knowledge no one has implemented and tested full software
emulation of the VMFail consistency checks. Until that happens,
one can only speculate about the actual performance overhead of
doing all VMFail consistency checks in software. Obviously any
code is slower than no code, but in the grand scheme of nested
virtualization it's entirely possible the overhead is negligible.
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
EFER is constant in the host and writing it once during setup means
we can skip writing the host value in add_atomic_switch_msr_special().
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Reviewed-by: Jim Mattson <jmattson@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
... as every invocation of nested_vmx_{fail,succeed} is immediately
followed by a call to kvm_skip_emulated_instruction(). This saves
a bit of code and eliminates some silly paths, e.g. nested_vmx_run()
ended up with a goto label purely used to call and return
kvm_skip_emulated_instruction().
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Reviewed-by: Jim Mattson <jmattson@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
EFLAGS is set to a fixed value on VMExit, calling nested_vmx_succeed()
is unnecessary and wrong.
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Reviewed-by: Jim Mattson <jmattson@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
A successful VMEnter is essentially a fancy indirect branch that
pulls the target RIP from the VMCS. Skipping the instruction is
unnecessary (RIP will get overwritten by the VMExit handler) and
is problematic because it can incorrectly suppress a #DB due to
EFLAGS.TF when a VMFail is detected by hardware (happens after we
skip the instruction).
Now that vmx_nested_run() is not prematurely skipping the instr,
use the full kvm_skip_emulated_instruction() in the VMFail path
of nested_vmx_vmexit(). We also need to explicitly update the
GUEST_INTERRUPTIBILITY_INFO when loading vmcs12 host state.
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Reviewed-by: Jim Mattson <jmattson@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
In anticipation of using vmcs02 to do early consistency checks, move
the early preparation of vmcs02 prior to checking the postreqs. The
downside of this approach is that we'll unnecessary load vmcs02 in
the case that check_vmentry_postreqs() fails, but that is essentially
our slow path anyways (not actually slow, but it's the path we don't
really care about optimizing).
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Reviewed-by: Jim Mattson <jmattson@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Add a dedicated flag to track if vmcs02 has been initialized, i.e.
the constant state for vmcs02 has been written to the backing VMCS.
The launched flag (in struct loaded_vmcs) gets cleared on logical
CPU migration to mirror hardware behavior[1], i.e. using the launched
flag to determine whether or not vmcs02 constant state needs to be
initialized results in unnecessarily re-initializing the VMCS when
migrating between logical CPUS.
[1] The active VMCS needs to be VMCLEARed before it can be migrated
to a different logical CPU. Hardware's VMCS cache is per-CPU
and is not coherent between CPUs. VMCLEAR flushes the cache so
that any dirty data is written back to memory. A side effect
of VMCLEAR is that it also clears the VMCS's internal launch
flag, which KVM must mirror because VMRESUME must be used to
run a previously launched VMCS.
Suggested-by: Jim Mattson <jmattson@google.com>
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Reviewed-by: Jim Mattson <jmattson@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Add prepare_vmcs02_early() and move pieces of prepare_vmcs02() to the
new function. prepare_vmcs02_early() writes the bits of vmcs02 that
a) must be in place to pass the VMFail consistency checks (assuming
vmcs12 is valid) and b) are needed recover from a VMExit, e.g. host
state that is loaded on VMExit. Splitting the functionality will
enable KVM to leverage hardware to do VMFail consistency checks via
a dry run of VMEnter and recover from a potential VMExit without
having to fully initialize vmcs02.
Add prepare_vmcs02_constant_state() to handle writing vmcs02 state that
comes from vmcs01 and never changes, i.e. we don't need to rewrite any
of the vmcs02 that is effectively constant once defined.
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Reviewed-by: Jim Mattson <jmattson@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
vmx->pml_pg is allocated by vmx_create_vcpu() and is only nullified
when the vCPU is destroyed by vmx_free_vcpu(). Remove the ASSERTs
on vmx->pml_pg, there is no need to carry debug code that provides
no value to the current code base.
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Reviewed-by: Jim Mattson <jmattson@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Rename 'fail' to 'vmentry_fail_vmexit_guest_mode' to make it more
obvious that it's simply a different entry point to the VMExit path,
whose purpose is unwind the updates done prior to calling
prepare_vmcs02().
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Reviewed-by: Jim Mattson <jmattson@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Handling all VMExits due to failed consistency checks on VMEnter in
nested_vmx_enter_non_root_mode() consolidates all relevant code into
a single location, and removing nested_vmx_entry_failure() eliminates
a confusing function name and label. For a VMEntry, "fail" and its
derivatives has a very specific meaning due to the different behavior
of a VMEnter VMFail versus VMExit, i.e. it wasn't obvious that
nested_vmx_entry_failure() handled VMExit scenarios.
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Reviewed-by: Jim Mattson <jmattson@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
In preparation of supporting checkpoint/restore for nested state,
commit ca0bde28f2 ("kvm: nVMX: Split VMCS checks from nested_vmx_run()")
modified check_vmentry_postreqs() to only perform the guest EFER
consistency checks when nested_run_pending is true. But, in the
normal nested VMEntry flow, nested_run_pending is only set after
check_vmentry_postreqs(), i.e. the consistency check is being skipped.
Alternatively, nested_run_pending could be set prior to calling
check_vmentry_postreqs() in nested_vmx_run(), but placing the
consistency checks in nested_vmx_enter_non_root_mode() allows us
to split prepare_vmcs02() and interleave the preparation with
the consistency checks without having to change the call sites
of nested_vmx_enter_non_root_mode(). In other words, the rest
of the consistency check code in nested_vmx_run() will be joining
the postreqs checks in future patches.
Fixes: ca0bde28f2 ("kvm: nVMX: Split VMCS checks from nested_vmx_run()")
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Cc: Jim Mattson <jmattson@google.com>
Reviewed-by: Jim Mattson <jmattson@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
...to be more consistent with the nested VMX nomenclature.
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Reviewed-by: Jim Mattson <jmattson@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
VM_ENTRY_IA32E_MODE and VM_{ENTRY,EXIT}_LOAD_IA32_EFER will be
explicitly set/cleared as needed by vmx_set_efer(), but attempt
to get the bits set correctly when intializing the control fields.
Setting the value correctly can avoid multiple VMWrites.
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Reviewed-by: Jim Mattson <jmattson@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Do not unconditionally call clear_atomic_switch_msr() when updating
EFER. This adds up to four unnecessary VMWrites in the case where
guest_efer != host_efer, e.g. if the load_on_{entry,exit} bits were
already set.
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Reviewed-by: Jim Mattson <jmattson@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Reset the vm_{entry,exit}_controls_shadow variables as well as the
segment cache after loading a new VMCS in vmx_switch_vmcs(). The
shadows/cache track VMCS data, i.e. they're stale every time we
switch to a new VMCS regardless of reason.
This fixes a bug where stale control shadows would be consumed after
a nested VMExit due to a failed consistency check.
Suggested-by: Jim Mattson <jmattson@google.com>
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Reviewed-by: Jim Mattson <jmattson@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Write VM_EXIT_CONTROLS using vm_exit_controls_init() when configuring
vmcs02, otherwise vm_exit_controls_shadow will be stale. EFER in
particular can be corrupted if VM_EXIT_LOAD_IA32_EFER is not updated
due to an incorrect shadow optimization, which can crash L0 due to
EFER not being loaded on exit. This does not occur with the current
code base simply because update_transition_efer() unconditionally
clears VM_EXIT_LOAD_IA32_EFER before conditionally setting it, and
because a nested guest always starts with VM_EXIT_LOAD_IA32_EFER
clear, i.e. we'll only ever unnecessarily clear the bit. That is,
until someone optimizes update_transition_efer()...
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Reviewed-by: Jim Mattson <jmattson@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
An invalid EPTP causes a VMFail(VMXERR_ENTRY_INVALID_CONTROL_FIELD),
not a VMExit.
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Reviewed-by: Jim Mattson <jmattson@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Invalid host state related to loading EFER on VMExit causes a
VMFail(VMXERR_ENTRY_INVALID_HOST_STATE_FIELD), not a VMExit.
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Reviewed-by: Jim Mattson <jmattson@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
When bit 3 (corresponding to CR0.TS) of the VMCS12 cr0_guest_host_mask
field is clear, the VMCS12 guest_cr0 field does not necessarily hold
the current value of the L2 CR0.TS bit, so the code that checked for
L2's CR0.TS bit being set was incorrect. Moreover, I'm not sure that
the CR0.TS check was adequate. (What if L2's CR0.EM was set, for
instance?)
Fortunately, lazy FPU has gone away, so L0 has lost all interest in
intercepting #NM exceptions. See commit bd7e5b0899 ("KVM: x86:
remove code for lazy FPU handling"). Therefore, there is no longer any
question of which hypervisor gets first dibs. The #NM VM-exit should
always be reflected to L1. (Note that the corresponding bit must be
set in the VMCS12 exception_bitmap field for there to be an #NM
VM-exit at all.)
Fixes: ccf9844e5d ("kvm, vmx: Really fix lazy FPU on nested guest")
Reported-by: Abhiroop Dabral <adabral@paloaltonetworks.com>
Signed-off-by: Jim Mattson <jmattson@google.com>
Reviewed-by: Peter Shier <pshier@google.com>
Tested-by: Abhiroop Dabral <adabral@paloaltonetworks.com>
Reviewed-by: Liran Alon <liran.alon@oracle.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Using hypercall for sending IPIs is faster because this allows to specify
any number of vCPUs (even > 64 with sparse CPU set), the whole procedure
will take only one VMEXIT.
Current Hyper-V TLFS (v5.0b) claims that HvCallSendSyntheticClusterIpi
hypercall can't be 'fast' (passing parameters through registers) but
apparently this is not true, Windows always uses it as 'fast' so we need
to support that.
Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
VP inedx almost always matches VCPU and when it does it's faster to walk
the sparse set instead of all vcpus.
Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
This probably doesn't matter much (KVM_MAX_VCPUS is much lower nowadays)
but valid_bank_mask is really u64 and not unsigned long.
Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Reviewed-by: Roman Kagan <rkagan@virtuozzo.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
In most common cases VP index of a vcpu matches its vcpu index. Userspace
is, however, free to set any mapping it wishes and we need to account for
that when we need to find a vCPU with a particular VP index. To keep search
algorithms optimal in both cases introduce 'num_mismatched_vp_indexes'
counter showing how many vCPUs with mismatching VP index we have. In case
the counter is zero we can assume vp_index == vcpu_idx.
Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Reviewed-by: Roman Kagan <rkagan@virtuozzo.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Rename 'hv' to 'hv_vcpu' in kvm_hv_set_msr/kvm_hv_get_msr(); 'hv' is
'reserved' for 'struct kvm_hv' variables across the file.
Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Reviewed-by: Roman Kagan <rkagan@virtuozzo.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
We can use 'NULL' to represent 'all cpus' case in
kvm_make_vcpus_request_mask() and avoid building vCPU mask with
all vCPUs.
Suggested-by: Radim Krčmář <rkrcmar@redhat.com>
Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Reviewed-by: Roman Kagan <rkagan@virtuozzo.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Hyper-V TLFS (5.0b) states:
> Virtual processors are identified by using an index (VP index). The
> maximum number of virtual processors per partition supported by the
> current implementation of the hypervisor can be obtained through CPUID
> leaf 0x40000005. A virtual processor index must be less than the
> maximum number of virtual processors per partition.
Forbid userspace to set VP_INDEX above KVM_MAX_VCPUS. get_vcpu_by_vpidx()
can now be optimized to bail early when supplied vpidx is >= KVM_MAX_VCPUS.
Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Reviewed-by: Roman Kagan <rkagan@virtuozzo.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
If kvm_apic_map_get_dest_lapic() finds a disabled LAPIC,
it will return with bitmap==0 and (*r == -1) will be returned to
userspace.
QEMU may then record "KVM: injection failed, MSI lost
(Operation not permitted)" in its log, which is quite puzzling.
Reported-by: Peng Hao <penghao122@sina.com.cn>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Currently, there are two definitions related to huge page, but a little bit
far from each other and seems loosely connected:
* KVM_NR_PAGE_SIZES defines the number of different size a page could map
* PT_MAX_HUGEPAGE_LEVEL means the maximum level of huge page
The number of different size a page could map equals the maximum level
of huge page, which is implied by current definition.
While current implementation may not be kind to readers and further
developers:
* KVM_NR_PAGE_SIZES looks like a stand alone definition at first sight
* in case we need to support more level, two places need to change
This patch tries to make these two definition more close, so that reader
and developer would feel more comfortable to manipulate.
Signed-off-by: Wei Yang <richard.weiyang@gmail.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
is_external_interrupt() is not used now and so remove it.
Signed-off-by: Lan Tianyu <Tianyu.Lan@microsoft.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
The code tries to pre-allocate *min* number of objects, so it is ok to
return 0 when the kvm_mmu_memory_cache meets the requirement.
Signed-off-by: Wei Yang <richard.weiyang@gmail.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Suggested-by: Jim Mattson <jmattson@google.com>
Signed-off-by: Krish Sadhukhan <krish.sadhukhan@oracle.com>
Reviewed-by: Mark Kanda <mark.kanda@oracle.com>
Reviewed-by: Jim Mattson <jmattson@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
A VMEnter that VMFails (as opposed to VMExits) does not touch host
state beyond registers that are explicitly noted in the VMFail path,
e.g. EFLAGS. Host state does not need to be loaded because VMFail
is only signaled for consistency checks that occur before the CPU
starts to load guest state, i.e. there is no need to restore any
state as nothing has been modified. But in the case where a VMFail
is detected by hardware and not by KVM (due to deferring consistency
checks to hardware), KVM has already loaded some amount of guest
state. Luckily, "loaded" only means loaded to KVM's software model,
i.e. vmcs01 has not been modified. So, unwind our software model to
the pre-VMEntry host state.
Not restoring host state in this VMFail path leads to a variety of
failures because we end up with stale data in vcpu->arch, e.g. CR0,
CR4, EFER, etc... will all be out of sync relative to vmcs01. Any
significant delta in the stale data is all but guaranteed to crash
L1, e.g. emulation of SMEP, SMAP, UMIP, WP, etc... will be wrong.
An alternative to this "soft" reload would be to load host state from
vmcs12 as if we triggered a VMExit (as opposed to VMFail), but that is
wildly inconsistent with respect to the VMX architecture, e.g. an L1
VMM with separate VMExit and VMFail paths would explode.
Note that this approach does not mean KVM is 100% accurate with
respect to VMX hardware behavior, even at an architectural level
(the exact order of consistency checks is microarchitecture specific).
But 100% emulation accuracy isn't the goal (with this patch), rather
the goal is to be consistent in the information delivered to L1, e.g.
a VMExit should not fall-through VMENTER, and a VMFail should not jump
to HOST_RIP.
This technically reverts commit "5af4157388ad (KVM: nVMX: Fix mmu
context after VMLAUNCH/VMRESUME failure)", but retains the core
aspects of that patch, just in an open coded form due to the need to
pull state from vmcs01 instead of vmcs12. Restoring host state
resolves a variety of issues introduced by commit "4f350c6dbcb9
(kvm: nVMX: Handle deferred early VMLAUNCH/VMRESUME failure properly)",
which remedied the incorrect behavior of treating VMFail like VMExit
but in doing so neglected to restore arch state that had been modified
prior to attempting nested VMEnter.
A sample failure that occurs due to stale vcpu.arch state is a fault
of some form while emulating an LGDT (due to emulated UMIP) from L1
after a failed VMEntry to L3, in this case when running the KVM unit
test test_tpr_threshold_values in L1. L0 also hits a WARN in this
case due to a stale arch.cr4.UMIP.
L1:
BUG: unable to handle kernel paging request at ffffc90000663b9e
PGD 276512067 P4D 276512067 PUD 276513067 PMD 274efa067 PTE 8000000271de2163
Oops: 0009 [#1] SMP
CPU: 5 PID: 12495 Comm: qemu-system-x86 Tainted: G W 4.18.0-rc2+ #2
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
RIP: 0010:native_load_gdt+0x0/0x10
...
Call Trace:
load_fixmap_gdt+0x22/0x30
__vmx_load_host_state+0x10e/0x1c0 [kvm_intel]
vmx_switch_vmcs+0x2d/0x50 [kvm_intel]
nested_vmx_vmexit+0x222/0x9c0 [kvm_intel]
vmx_handle_exit+0x246/0x15a0 [kvm_intel]
kvm_arch_vcpu_ioctl_run+0x850/0x1830 [kvm]
kvm_vcpu_ioctl+0x3a1/0x5c0 [kvm]
do_vfs_ioctl+0x9f/0x600
ksys_ioctl+0x66/0x70
__x64_sys_ioctl+0x16/0x20
do_syscall_64+0x4f/0x100
entry_SYSCALL_64_after_hwframe+0x44/0xa9
L0:
WARNING: CPU: 2 PID: 3529 at arch/x86/kvm/vmx.c:6618 handle_desc+0x28/0x30 [kvm_intel]
...
CPU: 2 PID: 3529 Comm: qemu-system-x86 Not tainted 4.17.2-coffee+ #76
Hardware name: Intel Corporation Kabylake Client platform/KBL S
RIP: 0010:handle_desc+0x28/0x30 [kvm_intel]
...
Call Trace:
kvm_arch_vcpu_ioctl_run+0x863/0x1840 [kvm]
kvm_vcpu_ioctl+0x3a1/0x5c0 [kvm]
do_vfs_ioctl+0x9f/0x5e0
ksys_ioctl+0x66/0x70
__x64_sys_ioctl+0x16/0x20
do_syscall_64+0x49/0xf0
entry_SYSCALL_64_after_hwframe+0x44/0xa9
Fixes: 5af4157388 (KVM: nVMX: Fix mmu context after VMLAUNCH/VMRESUME failure)
Fixes: 4f350c6dbc (kvm: nVMX: Handle deferred early VMLAUNCH/VMRESUME failure properly)
Cc: Jim Mattson <jmattson@google.com>
Cc: Krish Sadhukhan <krish.sadhukhan@oracle.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Radim KrÄmář <rkrcmar@redhat.com>
Cc: Wanpeng Li <wanpeng.li@hotmail.com>
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
According to volume 3 of the SDM, bits 63:15 and 12:4 of the exit
qualification field for debug exceptions are reserved (cleared to
0). However, the SDM is incorrect about bit 16 (corresponding to
DR6.RTM). This bit should be set if a debug exception (#DB) or a
breakpoint exception (#BP) occurred inside an RTM region while
advanced debugging of RTM transactional regions was enabled. Note that
this is the opposite of DR6.RTM, which "indicates (when clear) that a
debug exception (#DB) or breakpoint exception (#BP) occurred inside an
RTM region while advanced debugging of RTM transactional regions was
enabled."
There is still an issue with stale DR6 bits potentially being
misreported for the current debug exception. DR6 should not have been
modified before vectoring the #DB exception, and the "new DR6 bits"
should be available somewhere, but it was and they aren't.
Fixes: b96fb43977 ("KVM: nVMX: fixes to nested virt interrupt injection")
Signed-off-by: Jim Mattson <jmattson@google.com>
Reviewed-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
In cloud environment, lapic_timer_advance_ns is needed to be tuned for every CPU
generations, and every host kernel versions(the kvm-unit-tests/tscdeadline_latency.flat
is 5700 cycles for upstream kernel and 9600 cycles for our 3.10 product kernel,
both preemption_timer=N, Skylake server).
This patch adds the capability to automatically tune lapic_timer_advance_ns
step by step, the initial value is 1000ns as 'commit d0659d946b ("KVM: x86:
add option to advance tscdeadline hrtimer expiration")' recommended, it will be
reduced when it is too early, and increased when it is too late. The guest_tsc
and tsc_deadline are hard to equal, so we assume we are done when the delta
is within a small scope e.g. 100 cycles. This patch reduces latency
(kvm-unit-tests/tscdeadline_latency, busy waits, preemption_timer enabled)
from ~2600 cyles to ~1200 cyles on our Skylake server.
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Radim Krčmář <rkrcmar@redhat.com>
Cc: Liran Alon <liran.alon@oracle.com>
Signed-off-by: Wanpeng Li <wanpengli@tencent.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
If L1 uses VPID, it expects TLB to not be flushed on L1<->L2
transitions. However, code currently flushes TLB nonetheless if we
didn't allocate a vpid02 for L2. As in this case,
vmcs02->vpid == vmcs01->vpid == vmx->vpid.
But, if L1 uses EPT, TLB entires populated by L2 are tagged with EPTP02
while TLB entries populated by L1 are tagged with EPTP01.
Therefore, we can also avoid TLB flush if L1 uses VPID and EPT.
Reviewed-by: Mihai Carabas <mihai.carabas@oracle.com>
Reviewed-by: Darren Kenny <darren.kenny@oracle.com>
Reviewed-by: Nikita Leshenko <nikita.leshchenko@oracle.com>
Signed-off-by: Liran Alon <liran.alon@oracle.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
All VPID12s used on a given L1 vCPU is translated to a single
VPID02 (vmx->nested.vpid02 or vmx->vpid). Therefore, on L1->L2 VMEntry,
we need to invalidate linear and combined mappings tagged by
VPID02 in case L1 uses VPID and vmcs12->vpid was changed since
last L1->L2 VMEntry.
However, current code invalidates the wrong mappings as it calls
__vmx_flush_tlb() with invalidate_gpa parameter set to true which will
result in invalidating combined and guest-physical mappings tagged with
active EPTP which is EPTP01.
Similarly, INVVPID emulation have the exact same issue.
Fix both issues by just setting invalidate_gpa parameter to false which
will result in invalidating linear and combined mappings tagged with
given VPID02 as required.
Reviewed-by: Nikita Leshenko <nikita.leshchenko@oracle.com>
Reviewed-by: Mark Kanda <mark.kanda@oracle.com>
Signed-off-by: Liran Alon <liran.alon@oracle.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
In case L0 didn't allocate vmx->nested.vpid02 for L2,
vmcs02->vpid is set to vmx->vpid.
Consider this case when emulating L1 INVVPID in L0.
Reviewed-by: Nikita Leshenko <nikita.leshchenko@oracle.com>
Reviewed-by: Mark Kanda <mark.kanda@oracle.com>
Signed-off-by: Liran Alon <liran.alon@oracle.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
If L1 and L2 share VPID (because L1 don't use VPID or we haven't allocated
a vpid02), we need to flush TLB on L1<->L2 transitions.
Before this patch, this TLB flushing was done by vmx_flush_tlb().
If L0 use EPT, this will translate into INVEPT(active_eptp);
However, if L1 use EPT, in L1->L2 VMEntry, active EPTP is EPTP01 but
TLB entries populated by L2 are tagged with EPTP02.
Therefore we should delay vmx_flush_tlb() until active_eptp is EPTP02.
To achieve this, instead of directly calling vmx_flush_tlb() we request
it to be called by KVM_REQ_TLB_FLUSH which is evaluated after
KVM_REQ_LOAD_CR3 which sets the active_eptp to EPTP02 as required.
Similarly, on L2->L1 VMExit, active EPTP is EPTP02 but TLB entries
populated by L1 are tagged with EPTP01 and therefore we should delay
vmx_flush_tlb() until active_eptp is EPTP01.
Reviewed-by: Mihai Carabas <mihai.carabas@oracle.com>
Reviewed-by: Darren Kenny <darren.kenny@oracle.com>
Reviewed-by: Nikita Leshenko <nikita.leshchenko@oracle.com>
Signed-off-by: Liran Alon <liran.alon@oracle.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
The KVM_GUEST_CR0_MASK macro tracks CR0 bits that are forced to zero
by the VMX architecture, i.e. CR0.{NW,CD} must always be zero in the
hardware CR0 post-VMXON. Rename the macro to clarify its purpose,
be consistent with KVM_VM_CR0_ALWAYS_ON and avoid confusion with the
CR0_GUEST_HOST_MASK field.
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Reviewed-by: Jim Mattson <jmattson@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Commit b5861e5cf2 introduced a check on
the interrupt-window and NMI-window CPU execution controls in order to
inject an external interrupt vmexit before the first guest instruction
executes. However, when APIC virtualization is enabled the host does not
need a vmexit in order to inject an interrupt at the next interrupt window;
instead, it just places the interrupt vector in RVI and the processor will
inject it as soon as possible. Therefore, on machines with APICv it is
not enough to check the CPU execution controls: the same scenario can also
happen if RVI>vPPR.
Fixes: b5861e5cf2
Reviewed-by: Nikita Leshchenko <nikita.leshchenko@oracle.com>
Cc: Sean Christopherson <sean.j.christopherson@intel.com>
Cc: Liran Alon <liran.alon@oracle.com>
Cc: Radim Krčmář <rkrcmar@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
As of commit 8d860bbeed ("kvm: vmx: Basic APIC virtualization controls
have three settings"), KVM will disable VIRTUALIZE_APIC_ACCESSES when
a nested guest writes APIC_BASE MSR and kvm-intel.flexpriority=0,
whereas previously KVM would allow a nested guest to enable
VIRTUALIZE_APIC_ACCESSES so long as it's supported in hardware. That is,
KVM now advertises VIRTUALIZE_APIC_ACCESSES to a guest but doesn't
(always) allow setting it when kvm-intel.flexpriority=0, and may even
initially allow the control and then clear it when the nested guest
writes APIC_BASE MSR, which is decidedly odd even if it doesn't cause
functional issues.
Hide the control completely when the module parameter is cleared.
reported-by: Sean Christopherson <sean.j.christopherson@intel.com>
Fixes: 8d860bbeed ("kvm: vmx: Basic APIC virtualization controls have three settings")
Cc: Jim Mattson <jmattson@google.com>
Cc: stable@vger.kernel.org
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Return early from vmx_set_virtual_apic_mode() if the processor doesn't
support VIRTUALIZE_APIC_ACCESSES or VIRTUALIZE_X2APIC_MODE, both of
which reside in SECONDARY_VM_EXEC_CONTROL. This eliminates warnings
due to VMWRITEs to SECONDARY_VM_EXEC_CONTROL (VMCS field 401e) failing
on processors without secondary exec controls.
Remove the similar check for TPR shadowing as it is incorporated in the
flexpriority_enabled check and the APIC-related code in
vmx_update_msr_bitmap() is further gated by VIRTUALIZE_X2APIC_MODE.
Reported-by: Gerhard Wiesinger <redhat@wiesinger.com>
Fixes: 8d860bbeed ("kvm: vmx: Basic APIC virtualization controls have three settings")
Cc: Jim Mattson <jmattson@google.com>
Cc: stable@vger.kernel.org
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
One defense against L1TF in KVM is to always set the upper five bits
of the *legal* physical address in the SPTEs for non-present and
reserved SPTEs, e.g. MMIO SPTEs. In the MMIO case, the GFN of the
MMIO SPTE may overlap with the upper five bits that are being usurped
to defend against L1TF. To preserve the GFN, the bits of the GFN that
overlap with the repurposed bits are shifted left into the reserved
bits, i.e. the GFN in the SPTE will be split into high and low parts.
When retrieving the GFN from the MMIO SPTE, e.g. to check for an MMIO
access, get_mmio_spte_gfn() unshifts the affected bits and restores
the original GFN for comparison. Unfortunately, get_mmio_spte_gfn()
neglects to mask off the reserved bits in the SPTE that were used to
store the upper chunk of the GFN. As a result, KVM fails to detect
MMIO accesses whose GPA overlaps the repurprosed bits, which in turn
causes guest panics and hangs.
Fix the bug by generating a mask that covers the lower chunk of the
GFN, i.e. the bits that aren't shifted by the L1TF mitigation. The
alternative approach would be to explicitly zero the five reserved
bits that are used to store the upper chunk of the GFN, but that
requires additional run-time computation and makes an already-ugly
bit of code even more inscrutable.
I considered adding a WARN_ON_ONCE(low_phys_bits-1 <= PAGE_SHIFT) to
warn if GENMASK_ULL() generated a nonsensical value, but that seemed
silly since that would mean a system that supports VMX has less than
18 bits of physical address space...
Reported-by: Sakari Ailus <sakari.ailus@iki.fi>
Fixes: d9b47449c1a1 ("kvm: x86: Set highest physical address bits in non-present/reserved SPTEs")
Cc: Junaid Shahid <junaids@google.com>
Cc: Jim Mattson <jmattson@google.com>
Cc: stable@vger.kernel.org
Reviewed-by: Junaid Shahid <junaids@google.com>
Tested-by: Sakari Ailus <sakari.ailus@linux.intel.com>
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>