The FPU is always active now when running KVM.
Reviewed-by: David Matlack <dmatlack@google.com>
Reviewed-by: Bandan Das <bsd@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
The purpose of the KVM_SET_SIGNAL_MASK API is to let userspace "kick"
a VCPU out of KVM_RUN through a POSIX signal. A signal is attached
to a dummy signal handler; by blocking the signal outside KVM_RUN and
unblocking it inside, this possible race is closed:
VCPU thread service thread
--------------------------------------------------------------
check flag
set flag
raise signal
(signal handler does nothing)
KVM_RUN
However, one issue with KVM_SET_SIGNAL_MASK is that it has to take
tsk->sighand->siglock on every KVM_RUN. This lock is often on a
remote NUMA node, because it is on the node of a thread's creator.
Taking this lock can be very expensive if there are many userspace
exits (as is the case for SMP Windows VMs without Hyper-V reference
time counter).
As an alternative, we can put the flag directly in kvm_run so that
KVM can see it:
VCPU thread service thread
--------------------------------------------------------------
raise signal
signal handler
set run->immediate_exit
KVM_RUN
check run->immediate_exit
Reviewed-by: Radim Krčmář <rkrcmar@redhat.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Provide versions of struct gfn_to_hva_cache functions that
take vcpu as a parameter instead of struct kvm. The existing functions
are not needed anymore, so delete them. This allows dirty pages to
be logged in the vcpu dirty ring, instead of the global dirty ring,
for ring-based dirty memory tracking.
Signed-off-by: Lei Cao <lei.cao@stratus.com>
Message-Id: <CY1PR08MB19929BD2AC47A291FD680E83F04F0@CY1PR08MB1992.namprd08.prod.outlook.com>
Reviewed-by: Radim Krčmář <rkrcmar@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
The hashtable and guarding spinlock are global data structures,
we can inititalize them statically.
Signed-off-by: David Hildenbrand <david@redhat.com>
Message-Id: <20170124212116.4568-1-david@redhat.com>
Reviewed-by: Radim Krčmář <rkrcmar@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Nested_vmx_run is split into two parts: the part that handles the
VMLAUNCH/VMRESUME instruction, and the part that modifies the vcpu state
to transition from VMX root mode to VMX non-root mode. The latter will
be used when restoring the checkpointed state of a vCPU that was in VMX
operation when a snapshot was taken.
Signed-off-by: Jim Mattson <jmattson@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
The checks performed on the contents of the vmcs12 are extracted from
nested_vmx_run so that they can be used to validate a vmcs12 that has
been restored from a checkpoint.
Signed-off-by: Jim Mattson <jmattson@google.com>
[Change prepare_vmcs02 and nested_vmx_load_cr3's last argument to u32,
to match check_vmentry_postreqs. Update comments for singlestep
handling. - Paolo]
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Perform the checks on vmcs12 state early, but defer the gpa->hpa lookups
until after prepare_vmcs02. Later, when we restore the checkpointed
state of a vCPU in guest mode, we will not be able to do the gpa->hpa
lookups when the restore is done.
Signed-off-by: Jim Mattson <jmattson@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Handle_vmptrld is split into two parts: the part that handles the
VMPTRLD instruction, and the part that establishes the current VMCS
pointer. The latter will be used when restoring the checkpointed state
of a vCPU that had a valid VMCS pointer when a snapshot was taken.
Signed-off-by: Jim Mattson <jmattson@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Handle_vmon is split into two parts: the part that handles the VMXON
instruction, and the part that modifies the vcpu state to transition
from legacy mode to VMX operation. The latter will be used when
restoring the checkpointed state of a vCPU that was in VMX operation
when a snapshot was taken.
Signed-off-by: Jim Mattson <jmattson@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Split prepare_vmcs12 into two parts: the part that stores the current L2
guest state and the part that sets up the exit information fields. The
former will be used when checkpointing the vCPU's VMX state.
Modify prepare_vmcs02 so that it can construct a vmcs02 midway through
L2 execution, using the checkpointed L2 guest state saved into the
cached vmcs12 above.
Signed-off-by: Jim Mattson <jmattson@google.com>
[Rebasing: add from_vmentry argument to prepare_vmcs02 instead of using
vmx->nested.nested_run_pending, because it is no longer 1 at the
point prepare_vmcs02 is called. - Paolo]
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Since bf9f6ac8d7 ("KVM: Update Posted-Interrupts Descriptor when vCPU
is blocked", 2015-09-18) the posted interrupt descriptor is checked
unconditionally for PIR.ON. Therefore we don't need KVM_REQ_EVENT to
trigger the scan and, if NMIs or SMIs are not involved, we can avoid
the complicated event injection path.
Calling kvm_vcpu_kick if PIR.ON=1 is also useless, though it has been
there since APICv was introduced.
However, without the KVM_REQ_EVENT safety net KVM needs to be much
more careful about races between vmx_deliver_posted_interrupt and
vcpu_enter_guest. First, the IPI for posted interrupts may be issued
between setting vcpu->mode = IN_GUEST_MODE and disabling interrupts.
If that happens, kvm_trigger_posted_interrupt returns true, but
smp_kvm_posted_intr_ipi doesn't do anything about it. The guest is
entered with PIR.ON, but the posted interrupt IPI has not been sent
and the interrupt is only delivered to the guest on the next vmentry
(if any). To fix this, disable interrupts before setting vcpu->mode.
This ensures that the IPI is delayed until the guest enters non-root mode;
it is then trapped by the processor causing the interrupt to be injected.
Second, the IPI may be issued between kvm_x86_ops->sync_pir_to_irr(vcpu)
and vcpu->mode = IN_GUEST_MODE. In this case, kvm_vcpu_kick is called
but it (correctly) doesn't do anything because it sees vcpu->mode ==
OUTSIDE_GUEST_MODE. Again, the guest is entered with PIR.ON but no
posted interrupt IPI is pending; this time, the fix for this is to move
the RVI update after IN_GUEST_MODE.
Both issues were mostly masked by the liberal usage of KVM_REQ_EVENT,
though the second could actually happen with VT-d posted interrupts.
In both race scenarios KVM_REQ_EVENT would cancel guest entry, resulting
in another vmentry which would inject the interrupt.
This saves about 300 cycles on the self_ipi_* tests of vmexit.flat.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Calls to apic_find_highest_irr are scanning IRR twice, once
in vmx_sync_pir_from_irr and once in apic_search_irr. Change
sync_pir_from_irr to get the new maximum IRR from kvm_apic_update_irr;
now that it does the computation, it can also do the RVI write.
In order to avoid complications in svm.c, make the callback optional.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
vcpu_run calls kvm_vcpu_running, not kvm_arch_vcpu_runnable,
and the former does not call check_nested_events.
Once KVM_REQ_EVENT is removed from the APICv interrupt injection
path, however, this would leave no place to trigger a vmexit
from L2 to L1, causing a missed interrupt delivery while in guest
mode. This is caught by the "ack interrupt on exit" test in
vmx.flat.
[This does not change the calls to check_nested_events in
inject_pending_event. That is material for a separate cleanup.]
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Pending interrupts might be in the PI descriptor when the
LAPIC is restored from an external state; we do not want
them to be injected.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
As in the SVM patch, the guest physical address is passed by
VMX to x86_emulate_instruction already, so mark the GPA as available
in vcpu->arch.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
The newly added hypercall doesn't work on x86-32:
arch/x86/kvm/x86.c: In function 'kvm_pv_clock_pairing':
arch/x86/kvm/x86.c:6163:6: error: implicit declaration of function 'kvm_get_walltime_and_clockread';did you mean 'kvm_get_time_scale'? [-Werror=implicit-function-declaration]
This adds an #ifdef around it, matching the one around the related
functions that are also only implemented on 64-bit systems.
Fixes: 55dd00a73a ("KVM: x86: add KVM_HC_CLOCK_PAIRING hypercall")
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Fix rebase breakage from commit 55dd00a73a ("KVM: x86: add
KVM_HC_CLOCK_PAIRING hypercall", 2017-01-24), courtesy of the
"I could have sworn I had pushed the right branch" department.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Add a hypercall to retrieve the host realtime clock and the TSC value
used to calculate that clock read.
Used to implement clock synchronization between host and guest.
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
vmx_complete_nested_posted_interrupt() can't fail, let's turn it into
a void function.
Signed-off-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
kmap() can't fail, therefore it will always return a valid pointer. Let's
just get rid of the unnecessary checks.
Signed-off-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Saving unsupported state prevents migration when the new host does not
support a XSAVE feature of the original host, even if the feature is not
exposed to the guest.
We've masked host features with guest-visible features before, with
4344ee981e ("KVM: x86: only copy XSAVE state for the supported
features") and dropped it when implementing XSAVES. Do it again.
Fixes: df1daba7d1 ("KVM: x86: support XSAVES usage in the host")
Cc: stable@vger.kernel.org
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
Now that most cputime readers use the transition API which return the
task cputime in old style cputime_t, we can safely store the cputime in
nsecs. This will eventually make cputime statistics less opaque and more
granular. Back and forth convertions between cputime_t and nsecs in order
to deal with cputime_t random granularity won't be needed anymore.
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Stanislaw Gruszka <sgruszka@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Wanpeng Li <wanpeng.li@hotmail.com>
Link: http://lkml.kernel.org/r/1485832191-26889-8-git-send-email-fweisbec@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Before fast page fault restores an access track PTE back to a regular PTE,
it now also verifies that the restored PTE would grant the necessary
permissions for the faulting access to succeed. If not, it falls back
to the slow page fault path.
Signed-off-by: Junaid Shahid <junaids@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Redo the page table walk in fast_page_fault when retrying so that we are
working on the latest PTE even if the hierarchy changes.
Signed-off-by: Junaid Shahid <junaids@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Reword the comment to hopefully make it more clear.
Signed-off-by: Junaid Shahid <junaids@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Instead of the caller including the SPTE_SPECIAL_MASK in the masks being
supplied to kvm_mmu_set_mmio_spte_mask() and kvm_mmu_set_mask_ptes(),
those functions now themselves include the SPTE_SPECIAL_MASK.
Note that bit 63 is now reset in the default MMIO mask.
Signed-off-by: Junaid Shahid <junaids@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Rename the EPT_VIOLATION_READ/WRITE/INSTR constants to
EPT_VIOLATION_ACC_READ/WRITE/INSTR to more clearly indicate that these
signify the type of the memory access as opposed to the permissions
granted by the PTE.
Signed-off-by: Junaid Shahid <junaids@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
This reverts commit bc6134942d.
A CPUID instruction executed in VMX non-root mode always causes a
VM-exit, regardless of the leaf being queried.
Fixes: bc6134942d ("KVM: nested VMX: disable perf cpuid reporting")
Signed-off-by: Jim Mattson <jmattson@google.com>
[The issue solved by bc6134942d has been resolved with ff651cb613
("KVM: nVMX: Add nested msr load/restore algorithm").]
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
Vector population count instructions for dwords and qwords are to be
used in future Intel Xeon & Xeon Phi processors. The bit 14 of
CPUID[level:0x07, ECX] indicates that the new instructions are
supported by a processor.
The spec can be found in the Intel Software Developer Manual (SDM)
or in the Instruction Set Extensions Programming Reference (ISE).
Signed-off-by: Piotr Luc <piotr.luc@intel.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Radim Krčmář <rkrcmar@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: x86@kernel.org
Cc: kvm@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
emulator_fix_hypercall() replaces hypercall with vmcall instruction,
but it does not handle GP exception properly when writes the new instruction.
It can return X86EMUL_PROPAGATE_FAULT without setting exception information.
This leads to incorrect emulation and triggers
WARN_ON(ctxt->exception.vector > 0x1f) in x86_emulate_insn()
as discovered by syzkaller fuzzer:
WARNING: CPU: 2 PID: 18646 at arch/x86/kvm/emulate.c:5558
Call Trace:
warn_slowpath_null+0x2c/0x40 kernel/panic.c:582
x86_emulate_insn+0x16a5/0x4090 arch/x86/kvm/emulate.c:5572
x86_emulate_instruction+0x403/0x1cc0 arch/x86/kvm/x86.c:5618
emulate_instruction arch/x86/include/asm/kvm_host.h:1127 [inline]
handle_exception+0x594/0xfd0 arch/x86/kvm/vmx.c:5762
vmx_handle_exit+0x2b7/0x38b0 arch/x86/kvm/vmx.c:8625
vcpu_enter_guest arch/x86/kvm/x86.c:6888 [inline]
vcpu_run arch/x86/kvm/x86.c:6947 [inline]
Set exception information when write in emulator_fix_hypercall() fails.
Signed-off-by: Dmitry Vyukov <dvyukov@google.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Radim Krčmář <rkrcmar@redhat.com>
Cc: Wanpeng Li <wanpeng.li@hotmail.com>
Cc: kvm@vger.kernel.org
Cc: syzkaller@googlegroups.com
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
This is CVE-2017-2583. On Intel this causes a failed vmentry because
SS's type is neither 3 nor 7 (even though the manual says this check is
only done for usable SS, and the dmesg splat says that SS is unusable!).
On AMD it's worse: svm.c is confused and sets CPL to 0 in the vmcb.
The fix fabricates a data segment descriptor when SS is set to a null
selector, so that CPL and SS.DPL are set correctly in the VMCS/vmcb.
Furthermore, only allow setting SS to a NULL selector if SS.RPL < 3;
this in turn ensures CPL < 3 because RPL must be equal to CPL.
Thanks to Andy Lutomirski and Willy Tarreau for help in analyzing
the bug and deciphering the manuals.
Reported-by: Xiaohan Zhang <zhangxiaohan1@huawei.com>
Fixes: 79d5b4c3cd
Cc: stable@nongnu.org
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Introduces segemented_write_std.
Switches from emulated reads/writes to standard read/writes in fxsave,
fxrstor, sgdt, and sidt. This fixes CVE-2017-2584, a longstanding
kernel memory leak.
Since commit 283c95d0e3 ("KVM: x86: emulate FXSAVE and FXRSTOR",
2016-11-09), which is luckily not yet in any final release, this would
also be an exploitable kernel memory *write*!
Reported-by: Dmitry Vyukov <dvyukov@google.com>
Cc: stable@vger.kernel.org
Fixes: 96051572c8
Fixes: 283c95d0e3
Suggested-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Steve Rutherford <srutherford@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
KVM's lapic emulation uses static_key_deferred (apic_{hw,sw}_disabled).
These are implemented with delayed_work structs which can still be
pending when the KVM module is unloaded. We've seen this cause kernel
panics when the kvm_intel module is quickly reloaded.
Use the new static_key_deferred_flush() API to flush pending updates on
module unload.
Signed-off-by: David Matlack <dmatlack@google.com>
Cc: stable@vger.kernel.org
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Checks on the operand to VMXON are performed after the check for
legacy mode operation and the #GP checks, according to the pseudo-code
in Intel's SDM.
Signed-off-by: Jim Mattson <jmattson@google.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
On interrupt delivery the PPR can only grow (except for auto-EOI),
so it is impossible that non-auto-EOI interrupt delivery results
in KVM_REQ_EVENT. We can therefore use __apic_update_ppr.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
On PPR update, we set KVM_REQ_EVENT unconditionally anytime PPR is lowered.
But we can take into account IRR here already.
Reviewed-by: Roman Kagan <rkagan@virtuozzo.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
PPR needs to be updated whenever on every IRR read because we
may have missed TPR writes that _increased_ PPR. However, these
writes need not generate KVM_REQ_EVENT, because either KVM_REQ_EVENT
has been set already in __apic_accept_irq, or we are going to
process the interrupt right away.
Reviewed-by: Roman Kagan <rkagan@virtuozzo.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Since we're already in VCPU context, all we have to do here is recompute
the PPR value. That will in turn generate a KVM_REQ_EVENT if necessary.
Reviewed-by: Roman Kagan <rkagan@virtuozzo.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
This statistic can be useful to estimate the cost of an IRQ injection
scenario, by comparing it with irq_injections. For example the stat
shows that sti;hlt triggers more KVM_REQ_EVENT than sti;nop.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
When a guest causes a NPF which requires emulation, KVM sometimes walks
the guest page tables to translate the GVA to a GPA. This is unnecessary
most of the time on AMD hardware since the hardware provides the GPA in
EXITINFO2.
The only exception cases involve string operations involving rep or
operations that use two memory locations. With rep, the GPA will only be
the value of the initial NPF and with dual memory locations we won't know
which memory address was translated into EXITINFO2.
Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com>
Reviewed-by: Borislav Petkov <bp@suse.de>
Signed-off-by: Brijesh Singh <brijesh.singh@amd.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
LAPIC after reset is in xAPIC mode, which poses a problem for hotplug of
VCPUs with high APIC ID, because reset VCPU is waiting for INIT/SIPI,
but there is no way to uniquely address it using xAPIC.
From many possible options, we chose the one that also works on real
hardware: accepting interrupts addressed to LAPIC's x2APIC ID even in
xAPIC mode.
KVM intentionally differs from real hardware, because real hardware
(Knights Landing) does just "x2apic_id & 0xff" to decide whether to
accept the interrupt in xAPIC mode and it can deliver one interrupt to
more than one physical destination, e.g. 0x123 to 0x123 and 0x23.
Fixes: 682f732ecf ("KVM: x86: bump MAX_VCPUS to 288")
Reviewed-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Slow path tried to prevent IPIs from x2APIC VCPUs from being delivered
to xAPIC VCPUs and vice-versa. Make slow path behave like fast path,
which never distinguished that.
Reviewed-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
There were three calls sites:
- recalculate_apic_map and kvm_apic_match_physical_addr, where it would
only complicate implementation of x2APIC hotplug;
- in apic_debug, where it was still somewhat preserved, but keeping the
old function just for apic_debug was not worth it
Reviewed-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Interrupt to self can be sent without knowing the APIC ID.
Reviewed-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
This change implements lockless access tracking for Intel CPUs without EPT
A bits. This is achieved by marking the PTEs as not-present (but not
completely clearing them) when clear_flush_young() is called after marking
the pages as accessed. When an EPT Violation is generated as a result of
the VM accessing those pages, the PTEs are restored to their original values.
Signed-off-by: Junaid Shahid <junaids@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
MMIO SPTEs currently set both bits 62 and 63 to distinguish them as special
PTEs. However, bit 63 is used as the SVE bit in Intel EPT PTEs. The SVE bit
is ignored for misconfigured PTEs but not necessarily for not-Present PTEs.
Since MMIO SPTEs use an EPT misconfiguration, so using bit 63 for them is
acceptable. However, the upcoming fast access tracking feature adds another
type of special tracking PTE, which uses not-Present PTEs and hence should
not set bit 63.
In order to use common bits to distinguish both type of special PTEs, we
now use only bit 62 as the special bit.
Signed-off-by: Junaid Shahid <junaids@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
mmu_spte_update() tracks changes in the accessed/dirty state of
the SPTE being updated and calls kvm_set_pfn_accessed/dirty
appropriately. However, in some cases (e.g. when aging the SPTE),
this shouldn't be done. mmu_spte_update_no_track() is introduced
for use in such cases.
Signed-off-by: Junaid Shahid <junaids@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
This simplifies mmu_spte_update() a little bit.
The checks for clearing of accessed and dirty bits are refactored into
separate functions, which are used inside both mmu_spte_update() and
mmu_spte_clear_track_bits(), as well as kvm_test_age_rmapp(). The new
helper functions handle both the case when A/D bits are supported in
hardware and the case when they are not.
Signed-off-by: Junaid Shahid <junaids@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
This change adds retries into the Fast Page Fault path. Without the
retries, the code still works, but if a retry does end up being needed,
then it will result in a second page fault for the same memory access,
which will cause much more overhead compared to just retrying within the
original fault.
This would be especially useful with the upcoming fast access tracking
change, as that would make it more likely for retries to be needed
(e.g. due to read and write faults happening on different CPUs at
the same time).
Signed-off-by: Junaid Shahid <junaids@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
This change renames spte_is_locklessly_modifiable() to
spte_can_locklessly_be_made_writable() to distinguish it from other
forms of lockless modifications. The full set of lockless modifications
is covered by spte_has_volatile_bits().
Signed-off-by: Junaid Shahid <junaids@google.com>
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
This change adds some symbolic constants for VM Exit Qualifications
related to EPT Violations and updates handle_ept_violation() to use
these constants instead of hard-coded numbers.
Signed-off-by: Junaid Shahid <junaids@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
When using two-dimensional paging, the mmu_page_hash (which provides
lookups for existing kvm_mmu_page structs), becomes imbalanced; with
too many collisions in buckets 0 and 512. This has been seen to cause
mmu_lock to be held for multiple milliseconds in kvm_mmu_get_page on
VMs with a large amount of RAM mapped with 4K pages.
The current hash function uses the lower 10 bits of gfn to index into
mmu_page_hash. When doing shadow paging, gfn is the address of the
guest page table being shadow. These tables are 4K-aligned, which
makes the low bits of gfn a good hash. However, with two-dimensional
paging, no guest page tables are being shadowed, so gfn is the base
address that is mapped by the table. Thus page tables (level=1) have
a 2MB aligned gfn, page directories (level=2) have a 1GB aligned gfn,
etc. This means hashes will only differ in their 10th bit.
hash_64() provides a better hash. For example, on a VM with ~200G
(99458 direct=1 kvm_mmu_page structs):
hash max_mmu_page_hash_collisions
--------------------------------------------
low 10 bits 49847
hash_64 105
perfect 97
While we're changing the hash, increase the table size by 4x to better
support large VMs (further reduces number of collisions in 200G VM to
29).
Note that hash_64() does not provide a good distribution prior to commit
ef703f49a6 ("Eliminate bad hash multipliers from hash_32() and
hash_64()").
Signed-off-by: David Matlack <dmatlack@google.com>
Change-Id: I5aa6b13c834722813c6cca46b8b1ed6f53368ade
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Report the maximum number of mmu_page_hash collisions as a per-VM stat.
This will make it easy to identify problems with the mmu_page_hash in
the future.
Signed-off-by: David Matlack <dmatlack@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
The check in kvm_set_pic_irq() and kvm_set_ioapic_irq() was just a
temporary measure until the code improved enough for us to do this.
This changes APIC in a case when KVM_SET_GSI_ROUTING is called to set up pic
and ioapic routes before KVM_CREATE_IRQCHIP. Those rules would get overwritten
by KVM_CREATE_IRQCHIP at best, so it is pointless to allow it. Userspaces
hopefully noticed that things don't work if they do that and don't do that.
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
We don't treat kvm->arch.vpic specially anymore, so the setup can look
like ioapic. This gets a bit more information out of return values.
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
irqchip_in_kernel() tried to save a bit by reusing pic_irqchip(), but it
just complicated the code.
Add a separate state for the irqchip mode.
Reviewed-by: David Hildenbrand <david@redhat.com>
[Used Paolo's version of condition in irqchip_in_kernel().]
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
Split irqchip cannot be created after creating the kernel irqchip, but
we forgot to restrict the other way. This is an API change.
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
MIPS: (both for stable)
- fix host kernel crashes when receiving a signal with 64-bit userspace
- flush instruction cache on all vcpus after generating entry code
x86:
- fix NULL dereference in MMU caused by SMM transitions (for stable)
- correct guest instruction pointer after emulating some VMX errors
- minor cleanup
-----BEGIN PGP SIGNATURE-----
iQEcBAABCAAGBQJYb/N7AAoJEED/6hsPKofoa4QH/0/jwHr64lFeiOzMxqZfTF0y
wufcTqw3zGq5iPaNlEwn+6AkKnTq2IPws92FludfPHPb7BrLUPqrXxRlSRN+XPVw
pHVcV9u0q4yghMi7/6Flu3JASnpD6PrPZ7ezugZwgXFrR7pewd/+sTq6xBUnI9rZ
nNEYsfh8dYiBicxSGXlmZcHLuJJHKshjsv9F6ngyBGXAAf/F+nLiJReUzPO0m2+P
gmXi5zhVu6z05zlaCW1KAmJ1QV1UJla1vZnzrnK3twRK/05l7YX+xCbHIo1wB03R
2YhKDnSrnG3Zt+KpXfRhADXazNgM5ASvORdvI6RvjLNVxlnOveQtAcfRyvZezT4=
=LXLf
-----END PGP SIGNATURE-----
Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm
Pull KVM fixes from Radim Krčmář:
"MIPS:
- fix host kernel crashes when receiving a signal with 64-bit
userspace
- flush instruction cache on all vcpus after generating entry code
(both for stable)
x86:
- fix NULL dereference in MMU caused by SMM transitions (for stable)
- correct guest instruction pointer after emulating some VMX errors
- minor cleanup"
* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm:
KVM: VMX: remove duplicated declaration
KVM: MIPS: Flush KVM entry code from icache globally
KVM: MIPS: Don't clobber CP0_Status.UX
KVM: x86: reset MMU on KVM_SET_VCPU_EVENTS
KVM: nVMX: fix instruction skipping during emulated vm-entry
Declaration of VMX_VPID_EXTENT_SUPPORTED_MASK occures twice in the code.
Probably, it was happened after unsuccessful merge.
Signed-off-by: Jan Dakinevich <jan.dakinevich@gmail.com>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
Pull timer type cleanups from Thomas Gleixner:
"This series does a tree wide cleanup of types related to
timers/timekeeping.
- Get rid of cycles_t and use a plain u64. The type is not really
helpful and caused more confusion than clarity
- Get rid of the ktime union. The union has become useless as we use
the scalar nanoseconds storage unconditionally now. The 32bit
timespec alike storage got removed due to the Y2038 limitations
some time ago.
That leaves the odd union access around for no reason. Clean it up.
Both changes have been done with coccinelle and a small amount of
manual mopping up"
* 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
ktime: Get rid of ktime_equal()
ktime: Cleanup ktime_set() usage
ktime: Get rid of the union
clocksource: Use a plain u64 instead of cycle_t
ktime_set(S,N) was required for the timespec storage type and is still
useful for situations where a Seconds and Nanoseconds part of a time value
needs to be converted. For anything where the Seconds argument is 0, this
is pointless and can be replaced with a simple assignment.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <peterz@infradead.org>
There is no point in having an extra type for extra confusion. u64 is
unambiguous.
Conversion was done with the following coccinelle script:
@rem@
@@
-typedef u64 cycle_t;
@fix@
typedef cycle_t;
@@
-cycle_t
+u64
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: John Stultz <john.stultz@linaro.org>
When the state names got added a script was used to add the extra argument
to the calls. The script basically converted the state constant to a
string, but the cleanup to convert these strings into meaningful ones did
not happen.
Replace all the useless strings with 'subsys/xxx/yyy:state' strings which
are used in all the other places already.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Sebastian Siewior <bigeasy@linutronix.de>
Link: http://lkml.kernel.org/r/20161221192112.085444152@linutronix.de
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Otherwise, mismatch between the smm bit in hflags and the MMU role
can cause a NULL pointer dereference.
Cc: stable@vger.kernel.org
Signed-off-by: Xiao Guangrong <guangrong.xiao@linux.intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
kvm_skip_emulated_instruction() should not be called after emulating
a VM-entry failure during or after loading guest state
(nested_vmx_entry_failure()). Otherwise the L1 hypervisor is resumed
some number of bytes past vmcs->host_rip.
Fixes: eb27756217
Signed-off-by: David Matlack <dmatlack@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
When L2 exits to L0 due to "exception or NMI", software exceptions
(#BP and #OF) for which L1 has requested an intercept should be
handled by L1 rather than L0. Previously, only hardware exceptions
were forwarded to L1.
Signed-off-by: Jim Mattson <jmattson@google.com>
Cc: stable@vger.kernel.org
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
kvm_memslots() will be called by kvm_write_guest_offset_cached() so
take the srcu lock.
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
kvm_steal_time_set_preempted() isn't disabling the pagefaults before
calling __copy_to_user and the kernel debug notices.
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Introduce a new mutex to avoid an AB-BA deadlock between kvm->lock and
vcpu->mutex. Protect accesses in kvm_hv_setup_tsc_page too, as suggested
by Roman.
Reported-by: Dmitry Vyukov <dvyukov@google.com>
Reviewed-by: Roman Kagan <rkagan@virtuozzo.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
When the operand passed to VMPTRLD matches the address of the VMXON
region, the VMX instruction error code should be
VMXERR_VMPTRLD_VMXON_POINTER rather than VMXERR_VMCLEAR_VMXON_POINTER.
Signed-off-by: GanShun <ganshun@google.com>
Signed-off-by: Jim Mattson <jmattson@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
x86: userspace can now hide nested VMX features from guests; nested
VMX can now run Hyper-V in a guest; support for AVX512_4VNNIW and
AVX512_FMAPS in KVM; infrastructure support for virtual Intel GPUs.
PPC: support for KVM guests on POWER9; improved support for interrupt
polling; optimizations and cleanups.
s390: two small optimizations, more stuff is in flight and will be
in 4.11.
ARM: support for the GICv3 ITS on 32bit platforms.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2
iQExBAABCAAbBQJYTkP0FBxwYm9uemluaUByZWRoYXQuY29tAAoJEL/70l94x66D
lZIH/iT1n9OQXcuTpYYnQhuCenzI3GZZOIMTbCvK2i5bo0FIJKxVn0EiAAqZSXvO
nO185FqjOgLuJ1AD1kJuxzye5suuQp4HIPWWgNHcexLuy43WXWKZe0IQlJ4zM2Xf
u31HakpFmVDD+Cd1qN3yDXtDrRQ79/xQn2kw7CWb8olp+pVqwbceN3IVie9QYU+3
gCz0qU6As0aQIwq2PyalOe03sO10PZlm4XhsoXgWPG7P18BMRhNLTDqhLhu7A/ry
qElVMANT7LSNLzlwNdpzdK8rVuKxETwjlc1UP8vSuhrwad4zM2JJ1Exk26nC2NaG
D0j4tRSyGFIdx6lukZm7HmiSHZ0=
=mkoB
-----END PGP SIGNATURE-----
Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm
Pull KVM updates from Paolo Bonzini:
"Small release, the most interesting stuff is x86 nested virt
improvements.
x86:
- userspace can now hide nested VMX features from guests
- nested VMX can now run Hyper-V in a guest
- support for AVX512_4VNNIW and AVX512_FMAPS in KVM
- infrastructure support for virtual Intel GPUs.
PPC:
- support for KVM guests on POWER9
- improved support for interrupt polling
- optimizations and cleanups.
s390:
- two small optimizations, more stuff is in flight and will be in
4.11.
ARM:
- support for the GICv3 ITS on 32bit platforms"
* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (94 commits)
arm64: KVM: pmu: Reset PMSELR_EL0.SEL to a sane value before entering the guest
KVM: arm/arm64: timer: Check for properly initialized timer on init
KVM: arm/arm64: vgic-v2: Limit ITARGETSR bits to number of VCPUs
KVM: x86: Handle the kthread worker using the new API
KVM: nVMX: invvpid handling improvements
KVM: nVMX: check host CR3 on vmentry and vmexit
KVM: nVMX: introduce nested_vmx_load_cr3 and call it on vmentry
KVM: nVMX: propagate errors from prepare_vmcs02
KVM: nVMX: fix CR3 load if L2 uses PAE paging and EPT
KVM: nVMX: load GUEST_EFER after GUEST_CR0 during emulated VM-entry
KVM: nVMX: generate MSR_IA32_CR{0,4}_FIXED1 from guest CPUID
KVM: nVMX: fix checks on CR{0,4} during virtual VMX operation
KVM: nVMX: support restore of VMX capability MSRs
KVM: nVMX: generate non-true VMX MSRs based on true versions
KVM: x86: Do not clear RFLAGS.TF when a singlestep trap occurs.
KVM: x86: Add kvm_skip_emulated_instruction and use it.
KVM: VMX: Move skip_emulated_instruction out of nested_vmx_check_vmcs12
KVM: VMX: Reorder some skip_emulated_instruction calls
KVM: x86: Add a return value to kvm_emulate_cpuid
KVM: PPC: Book3S: Move prototypes for KVM functions into kvm_ppc.h
...
-----BEGIN PGP SIGNATURE-----
iQIcBAABAgAGBQJYT3qqAAoJEAx081l5xIa+dLMP/2dqBybSAeWlPmAwVenIHRtS
KFNktISezFSY/LBcIP2mHkFJmjTKBMZFxWnyEJL9NmFUD1cS2WMyNnC1282h/+rD
+P8Bsmzmt/daV4UTFxVDpzlmVlavAyakNi6FnSQfAfmf+3PB1yzU3gn8ld9pU/if
h7KEp9fDn9eYZreTRfCUloI2yoVpD9d0DG3uaGDN/N0kGUnCC6TZT5ig5j2JO016
fYf/DqoYAk3ItWF9WK/uG7qJIGi37afCpQq+kbSSJk+p3HjJqu8JUe9jzqYdl7j9
26TGSY5o9WLhZkxDgbcCIJzcFJhMmXgMdhjil9lqaHmnNG5FPFU7g8DK1CZqbel9
m8+aRPn1EgxIahMgdl8NblW1pfO2Kco0tZmoP5vXx1uqhivd67h0hiQqp66WxOJd
i2yMLncaCEv8M161CVEgtzuI5a7nCfaZv7J9ArzbkD/huBwu51IZgTs7Dz4njgvz
VPB5FBTB/ZYteErUNoh6gjF0hLngWvvJSPvuzT+EFO7yypek0IJ28GTdbxYSP+jR
13697s5Itigf/D3KUdRRGsWRzyVVN9n+djkl//sy5ddL9eOlKSKEga4ujOUjTWaW
hTvAxpK9GmJS/Iun5jIP6f75zDbi+e8FWUeB/OI2lPtnApaSKdXBTPXsco2RnTEV
+G6XrH8IMEIsTxOk7hWU
=7s/c
-----END PGP SIGNATURE-----
Merge tag 'drm-for-v4.10' of git://people.freedesktop.org/~airlied/linux
Pull drm updates from Dave Airlie:
"This is the main pull request for drm for 4.10 kernel.
New drivers:
- ZTE VOU display driver (zxdrm)
- Amlogic Meson Graphic Controller GXBB/GXL/GXM SoCs (meson)
- MXSFB support (mxsfb)
Core:
- Format handling has been reworked
- Better atomic state debugging
- drm_mm leak debugging
- Atomic explicit fencing support
- fbdev helper ops
- Documentation updates
- MST fbcon fixes
Bridge:
- Silicon Image SiI8620 driver
Panel:
- Add support for new simple panels
i915:
- GVT Device model
- Better HDMI2.0 support on skylake
- More watermark fixes
- GPU idling rework for suspend/resume
- DP Audio workarounds
- Scheduler prep-work
- Opregion CADL handling
- GPU scheduler and priority boosting
amdgfx/radeon:
- Support for virtual devices
- New VM manager for non-contig VRAM buffers
- UVD powergating
- SI register header cleanup
- Cursor fixes
- Powermanagement fixes
nouveau:
- Powermangement reworks for better voltage/clock changes
- Atomic modesetting support
- Displayport Multistream (MST) support.
- GP102/104 hang and cursor fixes
- GP106 support
hisilicon:
- hibmc support (BMC chip for aarch64 servers)
armada:
- add tracing support for overlay change
- refactor plane support
- de-midlayer the driver
omapdrm:
- Timing code cleanups
rcar-du:
- R8A7792/R8A7796 support
- Misc fixes.
sunxi:
- A31 SoC display engine support
imx-drm:
- YUV format support
- Cleanup plane atomic update
mali-dp:
- Misc fixes
dw-hdmi:
- Add support for HDMI i2c master controller
tegra:
- IOMMU support fixes
- Error handling fixes
tda998x:
- Fix connector registration
- Improved robustness
- Fix infoframe/audio compliance
virtio:
- fix busid issues
- allocate more vbufs
qxl:
- misc fixes and cleanups.
vc4:
- Fragment shader threading
- ETC1 support
- VEC (tv-out) support
msm:
- A5XX GPU support
- Lots of atomic changes
tilcdc:
- Misc fixes and cleanups.
etnaviv:
- Fix dma-buf export path
- DRAW_INSTANCED support
- fix driver on i.MX6SX
exynos:
- HDMI refactoring
fsl-dcu:
- fbdev changes"
* tag 'drm-for-v4.10' of git://people.freedesktop.org/~airlied/linux: (1343 commits)
drm/nouveau/kms/nv50: fix atomic regression on original G80
drm/nouveau/bl: Do not register interface if Apple GMUX detected
drm/nouveau/bl: Assign different names to interfaces
drm/nouveau/bios/dp: fix handling of LevelEntryTableIndex on DP table 4.2
drm/nouveau/ltc: protect clearing of comptags with mutex
drm/nouveau/gr/gf100-: handle GPC/TPC/MPC trap
drm/nouveau/core: recognise GP106 chipset
drm/nouveau/ttm: wait for bo fence to signal before unmapping vmas
drm/nouveau/gr/gf100-: FECS intr handling is not relevant on proprietary ucode
drm/nouveau/gr/gf100-: properly ack all FECS error interrupts
drm/nouveau/fifo/gf100-: recover from host mmu faults
drm: Add fake controlD* symlinks for backwards compat
drm/vc4: Don't use drm_put_dev
drm/vc4: Document VEC DT binding
drm/vc4: Add support for the VEC (Video Encoder) IP
drm: Add TV connector states to drm_connector_state
drm: Turn DRM_MODE_SUBCONNECTOR_xx definitions into an enum
drm/vc4: Fix ->clock_select setting for the VEC encoder
drm/amdgpu/dce6: Set MASTER_UPDATE_MODE to 0 in resume_mc_access as well
drm/amdgpu: use pin rather than pin_restricted in a few cases
...
Pull x86 FPU updates from Ingo Molnar:
"The main changes in this cycle were:
- do a large round of simplifications after all CPUs do 'eager' FPU
context switching in v4.9: remove CR0 twiddling, remove leftover
eager/lazy bts, etc (Andy Lutomirski)
- more FPU code simplifications: remove struct fpu::counter, clarify
nomenclature, remove unnecessary arguments/functions and better
structure the code (Rik van Riel)"
* 'x86-fpu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/fpu: Remove clts()
x86/fpu: Remove stts()
x86/fpu: Handle #NM without FPU emulation as an error
x86/fpu, lguest: Remove CR0.TS support
x86/fpu, kvm: Remove host CR0.TS manipulation
x86/fpu: Remove irq_ts_save() and irq_ts_restore()
x86/fpu: Stop saving and restoring CR0.TS in fpu__init_check_bugs()
x86/fpu: Get rid of two redundant clts() calls
x86/fpu: Finish excising 'eagerfpu'
x86/fpu: Split old_fpu & new_fpu handling into separate functions
x86/fpu: Remove 'cpu' argument from __cpu_invalidate_fpregs_state()
x86/fpu: Split old & new FPU code paths
x86/fpu: Remove __fpregs_(de)activate()
x86/fpu: Rename lazy restore functions to "register state valid"
x86/fpu, kvm: Remove KVM vcpu->fpu_counter
x86/fpu: Remove struct fpu::counter
x86/fpu: Remove use_eager_fpu()
x86/fpu: Remove the XFEATURE_MASK_EAGER/LAZY distinction
x86/fpu: Hard-disable lazy FPU mode
x86/crypto, x86/fpu: Remove X86_FEATURE_EAGER_FPU #ifdef from the crc32c code
Use the new API to create and destroy the "kvm-pit" kthread
worker. The API hides some implementation details.
In particular, kthread_create_worker() allocates and initializes
struct kthread_worker. It runs the kthread the right way
and stores task_struct into the worker structure.
kthread_destroy_worker() flushes all pending works, stops
the kthread and frees the structure.
This patch does not change the existing behavior except for
dynamically allocating struct kthread_worker and storing
only the pointer of this structure.
It is compile tested only because I did not find an easy
way how to run the code. Well, it should be pretty safe
given the nature of the change.
Signed-off-by: Petr Mladek <pmladek@suse.com>
Message-Id: <1476877847-11217-1-git-send-email-pmladek@suse.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
- Expose all invalidation types to the L1
- Reject invvpid instruction, if L1 passed zero vpid value to single
context invalidations
Signed-off-by: Jan Dakinevich <jan.dakinevich@gmail.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
This commit adds missing host CR3 checks. Before entering guest mode, the value
of CR3 is checked for reserved bits. After returning, nested_vmx_load_cr3 is
called to set the new CR3 value and check and load PDPTRs.
Signed-off-by: Ladi Prosek <lprosek@redhat.com>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
Loading CR3 as part of emulating vmentry is different from regular CR3 loads,
as implemented in kvm_set_cr3, in several ways.
* different rules are followed to check CR3 and it is desirable for the caller
to distinguish between the possible failures
* PDPTRs are not loaded if PAE paging and nested EPT are both enabled
* many MMU operations are not necessary
This patch introduces nested_vmx_load_cr3 suitable for CR3 loads as part of
nested vmentry and vmexit, and makes use of it on the nested vmentry path.
Signed-off-by: Ladi Prosek <lprosek@redhat.com>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
It is possible that prepare_vmcs02 fails to load the guest state. This
patch adds the proper error handling for such a case. L1 will receive
an INVALID_STATE vmexit with the appropriate exit qualification if it
happens.
A failure to set guest CR3 is the only error propagated from prepare_vmcs02
at the moment.
Signed-off-by: Ladi Prosek <lprosek@redhat.com>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
KVM does not correctly handle L1 hypervisors that emulate L2 real mode with
PAE and EPT, such as Hyper-V. In this mode, the L1 hypervisor populates guest
PDPTE VMCS fields and leaves guest CR3 uninitialized because it is not used
(see 26.3.2.4 Loading Page-Directory-Pointer-Table Entries). KVM always
dereferences CR3 and tries to load PDPTEs if PAE is on. This leads to two
related issues:
1) On the first nested vmentry, the guest PDPTEs, as populated by L1, are
overwritten in ept_load_pdptrs because the registers are believed to have
been loaded in load_pdptrs as part of kvm_set_cr3. This is incorrect. L2 is
running with PAE enabled but PDPTRs have been set up by L1.
2) When L2 is about to enable paging and loads its CR3, we, again, attempt
to load PDPTEs in load_pdptrs called from kvm_set_cr3. There are no guarantees
that this will succeed (it's just a CR3 load, paging is not enabled yet) and
if it doesn't, kvm_set_cr3 returns early without persisting the CR3 which is
then lost and L2 crashes right after it enables paging.
This patch replaces the kvm_set_cr3 call with a simple register write if PAE
and EPT are both on. CR3 is not to be interpreted in this case.
Signed-off-by: Ladi Prosek <lprosek@redhat.com>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
vmx_set_cr0() modifies GUEST_EFER and "IA-32e mode guest" in the current
VMCS. Call vmx_set_efer() after vmx_set_cr0() so that emulated VM-entry
is more faithful to VMCS12.
This patch correctly causes VM-entry to fail when "IA-32e mode guest" is
1 and GUEST_CR0.PG is 0. Previously this configuration would succeed and
"IA-32e mode guest" would silently be disabled by KVM.
Signed-off-by: David Matlack <dmatlack@google.com>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
MSR_IA32_CR{0,4}_FIXED1 define which bits in CR0 and CR4 are allowed to
be 1 during VMX operation. Since the set of allowed-1 bits is the same
in and out of VMX operation, we can generate these MSRs entirely from
the guest's CPUID. This lets userspace avoiding having to save/restore
these MSRs.
This patch also initializes MSR_IA32_CR{0,4}_FIXED1 from the CPU's MSRs
by default. This is a saner than the current default of -1ull, which
includes bits that the host CPU does not support.
Signed-off-by: David Matlack <dmatlack@google.com>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
KVM emulates MSR_IA32_VMX_CR{0,4}_FIXED1 with the value -1ULL, meaning
all CR0 and CR4 bits are allowed to be 1 during VMX operation.
This does not match real hardware, which disallows the high 32 bits of
CR0 to be 1, and disallows reserved bits of CR4 to be 1 (including bits
which are defined in the SDM but missing according to CPUID). A guest
can induce a VM-entry failure by setting these bits in GUEST_CR0 and
GUEST_CR4, despite MSR_IA32_VMX_CR{0,4}_FIXED1 indicating they are
valid.
Since KVM has allowed all bits to be 1 in CR0 and CR4, the existing
checks on these registers do not verify must-be-0 bits. Fix these checks
to identify must-be-0 bits according to MSR_IA32_VMX_CR{0,4}_FIXED1.
This patch should introduce no change in behavior in KVM, since these
MSRs are still -1ULL.
Signed-off-by: David Matlack <dmatlack@google.com>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
The VMX capability MSRs advertise the set of features the KVM virtual
CPU can support. This set of features varies across different host CPUs
and KVM versions. This patch aims to addresses both sources of
differences, allowing VMs to be migrated across CPUs and KVM versions
without guest-visible changes to these MSRs. Note that cross-KVM-
version migration is only supported from this point forward.
When the VMX capability MSRs are restored, they are audited to check
that the set of features advertised are a subset of what KVM and the
CPU support.
Since the VMX capability MSRs are read-only, they do not need to be on
the default MSR save/restore lists. The userspace hypervisor can set
the values of these MSRs or read them from KVM at VCPU creation time,
and restore the same value after every save/restore.
Signed-off-by: David Matlack <dmatlack@google.com>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
The "non-true" VMX capability MSRs can be generated from their "true"
counterparts, by OR-ing the default1 bits. The default1 bits are fixed
and defined in the SDM.
Since we can generate the non-true VMX MSRs from the true versions,
there's no need to store both in struct nested_vmx. This also lets
userspace avoid having to restore the non-true MSRs.
Note this does not preclude emulating MSR_IA32_VMX_BASIC[55]=0. To do so,
we simply need to set all the default1 bits in the true MSRs (such that
the true MSRs and the generated non-true MSRs are equal).
Signed-off-by: David Matlack <dmatlack@google.com>
Suggested-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
kvm_skip_emulated_instruction calls both
kvm_x86_ops->skip_emulated_instruction and kvm_vcpu_check_singlestep,
skipping the emulated instruction and generating a trap if necessary.
Replacing skip_emulated_instruction calls with
kvm_skip_emulated_instruction is straightforward, except for:
- ICEBP, which is already inside a trap, so avoid triggering another trap.
- Instructions that can trigger exits to userspace, such as the IO insns,
MOVs to CR8, and HALT. If kvm_skip_emulated_instruction does trigger a
KVM_GUESTDBG_SINGLESTEP exit, and the handling code for
IN/OUT/MOV CR8/HALT also triggers an exit to userspace, the latter will
take precedence. The singlestep will be triggered again on the next
instruction, which is the current behavior.
- Task switch instructions which would require additional handling (e.g.
the task switch bit) and are instead left alone.
- Cases where VMLAUNCH/VMRESUME do not proceed to the next instruction,
which do not trigger singlestep traps as mentioned previously.
Signed-off-by: Kyle Huey <khuey@kylehuey.com>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
We can't return both the pass/fail boolean for the vmcs and the upcoming
continue/exit-to-userspace boolean for skip_emulated_instruction out of
nested_vmx_check_vmcs, so move skip_emulated_instruction out of it instead.
Additionally, VMENTER/VMRESUME only trigger singlestep exceptions when
they advance the IP to the following instruction, not when they a) succeed,
b) fail MSR validation or c) throw an exception. Add a separate call to
skip_emulated_instruction that will later not be converted to the variant
that checks the singlestep flag.
Signed-off-by: Kyle Huey <khuey@kylehuey.com>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
The functions being moved ahead of skip_emulated_instruction here don't
need updated IPs, and skipping the emulated instruction at the end will
make it easier to return its value.
Signed-off-by: Kyle Huey <khuey@kylehuey.com>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
Once skipping the emulated instruction can potentially trigger an exit to
userspace (via KVM_GUESTDBG_SINGLESTEP) kvm_emulate_cpuid will need to
propagate a return value.
Signed-off-by: Kyle Huey <khuey@kylehuey.com>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
-----BEGIN PGP SIGNATURE-----
iQEcBAABAgAGBQJYRIGyAAoJEHm+PkMAQRiG2ksH/jwMUT9j6glbwESxbn1YTqTM
QcBT5AMc7D0wNuidQe0hWZMtG4RbC+4ZhxzZl2wPgA2gueJ+rBnyX7bgtA7ka8ka
Fdc3u/Q1v38HPzf8iBnxcdCs40VgsoMLjFYCXrpOxuGDNKYzRd+Q8aI2TeGvzbyi
X8+6oAWifBwo2oA06jfcuUncEWbyDDyK9aQksmfKOpjHdb26yELPEhsPOlds1g7E
jYLnvUVnU2CoFaumta+rZQ0kzLdc4Ntu0wEao6WzJuQKsgoID+tS/6iudi8cUhDp
YowGAVoOfr6rAJB0mwrDVfugpamaT3386XKyocdNsK0/jR60UIJ8x+WzvvSU+lY=
=JTBj
-----END PGP SIGNATURE-----
Backmerge tag 'v4.9-rc8' into drm-next
Linux 4.9-rc8
Daniel requested this so we could apply some follow on fixes cleanly to -next.