- support -y option for merge_config.sh to avoid downgrading =y to =m
- remove S_OTHER symbol type, and touch include/config/*.h files correctly
- fix file name and line number in lexer warnings
- fix memory leak when EOF is encountered in quotation
- resolve all shift/reduce conflicts of the parser
- warn no new line at end of file
- make 'source' statement more strict to take only string literal
- rewrite the lexer and remove the keyword lookup table
- convert to SPDX License Identifier
- compile C files independently instead of including them from zconf.y
- fix various warnings of gconfig
- misc cleanups
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQIcBAABAgAGBQJcJieuAAoJED2LAQed4NsGHlIP/1s0fQ86XD9dIMyHzAO0gh2f
7rylfe2kEXJgIzJ0DyZdLu4iZtwbkEUqTQrRS1abriNGVemPkfBAnZdM5d92lOQX
3iREa700AJ2xo7V7gYZ6AbhZoG3p0S9U9Q2qE5S+tFTe8c2Gy4xtjnODF+Vel85r
S0P8tF5sE1/d00lm+yfMI/CJVfDjyNaMm+aVEnL0kZTPiRkaktjWgo6Fc2p4z1L5
HFmMMP6/iaXmRZ+tHJGPQ2AT70GFVZw5ePxPcl50EotUP25KHbuUdzs8wDpYm3U/
rcESVsIFpgqHWmTsdBk6dZk0q8yFZNkMlkaP/aYukVZpUn/N6oAXgTFckYl8dmQL
fQBkQi6DTfr9EBPVbj18BKm7xI3Y4DdQ2fzTfYkJ2XwNRGFA5r9N3sjd7ZTVGjxC
aeeMHCwvGdSx1x8PeZAhZfsUHW8xVDMSQiT713+ljBY+6cwzA+2NF0kP7B6OAqwr
ETFzd4Xu2/lZcL7gQRH8WU3L2S5iedmDG6RnZgJMXI0/9V4qAA+nlsWaCgnl1TgA
mpxYlLUMrd6AUJevE34FlnyFdk8IMn9iKRFsvF0f3doO5C7QzTVGqFdJu5a0CuWO
4NBJvZjFT8/4amoWLfnDlfApWXzTfwLbKG+r6V2F30fLuXpYg5LxWhBoGRPYLZSq
oi4xN1Mpx3TvXz6WcKVZ
=r3Fl
-----END PGP SIGNATURE-----
Merge tag 'kconfig-v4.21' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild
Pull Kconfig updates from Masahiro Yamada:
- support -y option for merge_config.sh to avoid downgrading =y to =m
- remove S_OTHER symbol type, and touch include/config/*.h files correctly
- fix file name and line number in lexer warnings
- fix memory leak when EOF is encountered in quotation
- resolve all shift/reduce conflicts of the parser
- warn no new line at end of file
- make 'source' statement more strict to take only string literal
- rewrite the lexer and remove the keyword lookup table
- convert to SPDX License Identifier
- compile C files independently instead of including them from zconf.y
- fix various warnings of gconfig
- misc cleanups
* tag 'kconfig-v4.21' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild: (39 commits)
kconfig: surround dbg_sym_flags with #ifdef DEBUG to fix gconf warning
kconfig: split images.c out of qconf.cc/gconf.c to fix gconf warnings
kconfig: add static qualifiers to fix gconf warnings
kconfig: split the lexer out of zconf.y
kconfig: split some C files out of zconf.y
kconfig: convert to SPDX License Identifier
kconfig: remove keyword lookup table entirely
kconfig: update current_pos in the second lexer
kconfig: switch to ASSIGN_VAL state in the second lexer
kconfig: stop associating kconf_id with yylval
kconfig: refactor end token rules
kconfig: stop supporting '.' and '/' in unquoted words
treewide: surround Kconfig file paths with double quotes
microblaze: surround string default in Kconfig with double quotes
kconfig: use T_WORD instead of T_VARIABLE for variables
kconfig: use specific tokens instead of T_ASSIGN for assignments
kconfig: refactor scanning and parsing "option" properties
kconfig: use distinct tokens for type and default properties
kconfig: remove redundant token defines
kconfig: rename depends_list to comment_option_list
...
single-stepping fixes, improved tracing, various timer and vGIC
fixes
* x86: Processor Tracing virtualization, STIBP support, some correctness fixes,
refactorings and splitting of vmx.c, use the Hyper-V range TLB flush hypercall,
reduce order of vcpu struct, WBNOINVD support, do not use -ftrace for __noclone
functions, nested guest support for PAUSE filtering on AMD, more Hyper-V
enlightenments (direct mode for synthetic timers)
* PPC: nested VFIO
* s390: bugfixes only this time
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.22 (GNU/Linux)
iQEcBAABAgAGBQJcH0vFAAoJEL/70l94x66Dw/wH/2FZp1YOM5OgiJzgqnXyDbyf
dNEfWo472MtNiLsuf+ZAfJojVIu9cv7wtBfXNzW+75XZDfh/J88geHWNSiZDm3Fe
aM4MOnGG0yF3hQrRQyEHe4IFhGFNERax8Ccv+OL44md9CjYrIrsGkRD08qwb+gNh
P8T/3wJEKwUcVHA/1VHEIM8MlirxNENc78p6JKd/C7zb0emjGavdIpWFUMr3SNfs
CemabhJUuwOYtwjRInyx1y34FzYwW3Ejuc9a9UoZ+COahUfkuxHE8u+EQS7vLVF6
2VGVu5SA0PqgmLlGhHthxLqVgQYo+dB22cRnsLtXlUChtVAq8q9uu5sKzvqEzuE=
=b4Jx
-----END PGP SIGNATURE-----
Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm
Pull KVM updates from Paolo Bonzini:
"ARM:
- selftests improvements
- large PUD support for HugeTLB
- single-stepping fixes
- improved tracing
- various timer and vGIC fixes
x86:
- Processor Tracing virtualization
- STIBP support
- some correctness fixes
- refactorings and splitting of vmx.c
- use the Hyper-V range TLB flush hypercall
- reduce order of vcpu struct
- WBNOINVD support
- do not use -ftrace for __noclone functions
- nested guest support for PAUSE filtering on AMD
- more Hyper-V enlightenments (direct mode for synthetic timers)
PPC:
- nested VFIO
s390:
- bugfixes only this time"
* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (171 commits)
KVM: x86: Add CPUID support for new instruction WBNOINVD
kvm: selftests: ucall: fix exit mmio address guessing
Revert "compiler-gcc: disable -ftracer for __noclone functions"
KVM: VMX: Move VM-Enter + VM-Exit handling to non-inline sub-routines
KVM: VMX: Explicitly reference RCX as the vmx_vcpu pointer in asm blobs
KVM: x86: Use jmp to invoke kvm_spurious_fault() from .fixup
MAINTAINERS: Add arch/x86/kvm sub-directories to existing KVM/x86 entry
KVM/x86: Use SVM assembly instruction mnemonics instead of .byte streams
KVM/MMU: Flush tlb directly in the kvm_zap_gfn_range()
KVM/MMU: Flush tlb directly in kvm_set_pte_rmapp()
KVM/MMU: Move tlb flush in kvm_set_pte_rmapp() to kvm_mmu_notifier_change_pte()
KVM: Make kvm_set_spte_hva() return int
KVM: Replace old tlb flush function with new one to flush a specified range.
KVM/MMU: Add tlb flush with range helper function
KVM/VMX: Add hv tlb range flush support
x86/hyper-v: Add HvFlushGuestAddressList hypercall support
KVM: Add tlb_remote_flush_with_range callback in kvm_x86_ops
KVM: x86: Disable Intel PT when VMXON in L1 guest
KVM: x86: Set intercept for Intel PT MSRs read/write
KVM: x86: Implement Intel PT MSRs read/write emulation
...
The Kconfig lexer supports special characters such as '.' and '/' in
the parameter context. In my understanding, the reason is just to
support bare file paths in the source statement.
I do not see a good reason to complicate Kconfig for the room of
ambiguity.
The majority of code already surrounds file paths with double quotes,
and it makes sense since file paths are constant string literals.
Make it treewide consistent now.
Signed-off-by: Masahiro Yamada <yamada.masahiro@socionext.com>
Acked-by: Wolfram Sang <wsa@the-dreams.de>
Acked-by: Geert Uytterhoeven <geert@linux-m68k.org>
Acked-by: Ingo Molnar <mingo@kernel.org>
Transitioning to/from a VMX guest requires KVM to manually save/load
the bulk of CPU state that the guest is allowed to direclty access,
e.g. XSAVE state, CR2, GPRs, etc... For obvious reasons, loading the
guest's GPR snapshot prior to VM-Enter and saving the snapshot after
VM-Exit is done via handcoded assembly. The assembly blob is written
as inline asm so that it can easily access KVM-defined structs that
are used to hold guest state, e.g. moving the blob to a standalone
assembly file would require generating defines for struct offsets.
The other relevant aspect of VMX transitions in KVM is the handling of
VM-Exits. KVM doesn't employ a separate VM-Exit handler per se, but
rather treats the VMX transition as a mega instruction (with many side
effects), i.e. sets the VMCS.HOST_RIP to a label immediately following
VMLAUNCH/VMRESUME. The label is then exposed to C code via a global
variable definition in the inline assembly.
Because of the global variable, KVM takes steps to (attempt to) ensure
only a single instance of the owning C function, e.g. vmx_vcpu_run, is
generated by the compiler. The earliest approach placed the inline
assembly in a separate noinline function[1]. Later, the assembly was
folded back into vmx_vcpu_run() and tagged with __noclone[2][3], which
is still used today.
After moving to __noclone, an edge case was encountered where GCC's
-ftracer optimization resulted in the inline assembly blob being
duplicated. This was "fixed" by explicitly disabling -ftracer in the
__noclone definition[4].
Recently, it was found that disabling -ftracer causes build warnings
for unsuspecting users of __noclone[5], and more importantly for KVM,
prevents the compiler for properly optimizing vmx_vcpu_run()[6]. And
perhaps most importantly of all, it was pointed out that there is no
way to prevent duplication of a function with 100% reliability[7],
i.e. more edge cases may be encountered in the future.
So to summarize, the only way to prevent the compiler from duplicating
the global variable definition is to move the variable out of inline
assembly, which has been suggested several times over[1][7][8].
Resolve the aforementioned issues by moving the VMLAUNCH+VRESUME and
VM-Exit "handler" to standalone assembly sub-routines. Moving only
the core VMX transition codes allows the struct indexing to remain as
inline assembly and also allows the sub-routines to be used by
nested_vmx_check_vmentry_hw(). Reusing the sub-routines has a happy
side-effect of eliminating two VMWRITEs in the nested_early_check path
as there is no longer a need to dynamically change VMCS.HOST_RIP.
Note that callers to vmx_vmenter() must account for the CALL modifying
RSP, e.g. must subtract op-size from RSP when synchronizing RSP with
VMCS.HOST_RSP and "restore" RSP prior to the CALL. There are no great
alternatives to fudging RSP. Saving RSP in vmx_enter() is difficult
because doing so requires a second register (VMWRITE does not provide
an immediate encoding for the VMCS field and KVM supports Hyper-V's
memory-based eVMCS ABI). The other more drastic alternative would be
to use eschew VMCS.HOST_RSP and manually save/load RSP using a per-cpu
variable (which can be encoded as e.g. gs:[imm]). But because a valid
stack is needed at the time of VM-Exit (NMIs aren't blocked and a user
could theoretically insert INT3/INT1ICEBRK at the VM-Exit handler), a
dedicated per-cpu VM-Exit stack would be required. A dedicated stack
isn't difficult to implement, but it would require at least one page
per CPU and knowledge of the stack in the dumpstack routines. And in
most cases there is essentially zero overhead in dynamically updating
VMCS.HOST_RSP, e.g. the VMWRITE can be avoided for all but the first
VMLAUNCH unless nested_early_check=1, which is not a fast path. In
other words, avoiding the VMCS.HOST_RSP by using a dedicated stack
would only make the code marginally less ugly while requiring at least
one page per CPU and forcing the kernel to be aware (and approve) of
the VM-Exit stack shenanigans.
[1] cea15c24ca39 ("KVM: Move KVM context switch into own function")
[2] a3b5ba49a8 ("KVM: VMX: add the __noclone attribute to vmx_vcpu_run")
[3] 104f226bfd ("KVM: VMX: Fold __vmx_vcpu_run() into vmx_vcpu_run()")
[4] 95272c2937 ("compiler-gcc: disable -ftracer for __noclone functions")
[5] https://lkml.kernel.org/r/20181218140105.ajuiglkpvstt3qxs@treble
[6] https://patchwork.kernel.org/patch/8707981/#21817015
[7] https://lkml.kernel.org/r/ri6y38lo23g.fsf@suse.cz
[8] https://lkml.kernel.org/r/20181218212042.GE25620@tassilo.jf.intel.com
Suggested-by: Andi Kleen <ak@linux.intel.com>
Suggested-by: Martin Jambor <mjambor@suse.cz>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Nadav Amit <namit@vmware.com>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Martin Jambor <mjambor@suse.cz>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Miroslav Benes <mbenes@suse.cz>
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Reviewed-by: Andi Kleen <ak@linux.intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Use '%% " _ASM_CX"' instead of '%0' to dereference RCX, i.e. the
'struct vcpu_vmx' pointer, in the VM-Enter asm blobs of vmx_vcpu_run()
and nested_vmx_check_vmentry_hw(). Using the symbolic name means that
adding/removing an output parameter(s) requires "rewriting" almost all
of the asm blob, which makes it nearly impossible to understand what's
being changed in even the most minor patches.
Opportunistically improve the code comments.
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Reviewed-by: Andi Kleen <ak@linux.intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Recently the minimum required version of binutils was changed to 2.20,
which supports all SVM instruction mnemonics. The patch removes
all .byte #defines and uses real instruction mnemonics instead.
Signed-off-by: Uros Bizjak <ubizjak@gmail.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Originally, flush tlb is done by slot_handle_level_range(). This patch
moves the flush directly to kvm_zap_gfn_range() when range flush is
available, so that only the requested range can be flushed.
Signed-off-by: Lan Tianyu <Tianyu.Lan@microsoft.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
This patch is to flush tlb directly in kvm_set_pte_rmapp()
function when Hyper-V remote TLB flush is available, returning 0
so that kvm_mmu_notifier_change_pte() does not flush again.
Signed-off-by: Lan Tianyu <Tianyu.Lan@microsoft.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
This patch is to move tlb flush in kvm_set_pte_rmapp() to
kvm_mmu_notifier_change_pte() in order to avoid redundant tlb flush.
Signed-off-by: Lan Tianyu <Tianyu.Lan@microsoft.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
The patch is to make kvm_set_spte_hva() return int and caller can
check return value to determine flush tlb or not.
Signed-off-by: Lan Tianyu <Tianyu.Lan@microsoft.com>
Acked-by: Paul Mackerras <paulus@ozlabs.org>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
This patch is to replace kvm_flush_remote_tlbs() with kvm_flush_
remote_tlbs_with_address() in some functions without logic change.
Signed-off-by: Lan Tianyu <Tianyu.Lan@microsoft.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
This patch is to add wrapper functions for tlb_remote_flush_with_range
callback and flush tlb directly in kvm_mmu_zap_collapsible_spte().
kvm_mmu_zap_collapsible_spte() returns flush request to the
slot_handle_leaf() and the latter does flush on demand. When
range flush is available, make kvm_mmu_zap_collapsible_spte()
to flush tlb with range directly to avoid returning range back
to slot_handle_leaf().
Signed-off-by: Lan Tianyu <Tianyu.Lan@microsoft.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
This patch is to register tlb_remote_flush_with_range callback with
hv tlb range flush interface.
Signed-off-by: Lan Tianyu <Tianyu.Lan@microsoft.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Currently, Intel Processor Trace do not support tracing in L1 guest
VMX operation(IA32_VMX_MISC[bit 14] is 0). As mentioned in SDM,
on these type of processors, execution of the VMXON instruction will
clears IA32_RTIT_CTL.TraceEn and any attempt to write IA32_RTIT_CTL
causes a general-protection exception (#GP).
Signed-off-by: Luwei Kang <luwei.kang@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
To save performance overhead, disable intercept Intel PT MSRs
read/write when Intel PT is enabled in guest.
MSR_IA32_RTIT_CTL is an exception that will always be intercepted.
Signed-off-by: Chao Peng <chao.p.peng@linux.intel.com>
Signed-off-by: Luwei Kang <luwei.kang@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
This patch implement Intel Processor Trace MSRs read/write
emulation.
Intel PT MSRs read/write need to be emulated when Intel PT
MSRs is intercepted in guest and during live migration.
Signed-off-by: Chao Peng <chao.p.peng@linux.intel.com>
Signed-off-by: Luwei Kang <luwei.kang@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Initialize the Intel PT configuration when cpuid update.
Include cpuid inforamtion, rtit_ctl bit mask and the number of
address ranges.
Signed-off-by: Luwei Kang <luwei.kang@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Load/Store Intel Processor Trace register in context switch.
MSR IA32_RTIT_CTL is loaded/stored automatically from VMCS.
In Host-Guest mode, we need load/resore PT MSRs only when PT
is enabled in guest.
Signed-off-by: Chao Peng <chao.p.peng@linux.intel.com>
Signed-off-by: Luwei Kang <luwei.kang@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Expose Intel Processor Trace to guest only when
the PT works in Host-Guest mode.
Signed-off-by: Chao Peng <chao.p.peng@linux.intel.com>
Signed-off-by: Luwei Kang <luwei.kang@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Intel Processor Trace virtualization can be work in one
of 2 possible modes:
a. System-Wide mode (default):
When the host configures Intel PT to collect trace packets
of the entire system, it can leave the relevant VMX controls
clear to allow VMX-specific packets to provide information
across VMX transitions.
KVM guest will not aware this feature in this mode and both
host and KVM guest trace will output to host buffer.
b. Host-Guest mode:
Host can configure trace-packet generation while in
VMX non-root operation for guests and root operation
for native executing normally.
Intel PT will be exposed to KVM guest in this mode, and
the trace output to respective buffer of host and guest.
In this mode, tht status of PT will be saved and disabled
before VM-entry and restored after VM-exit if trace
a virtual machine.
Signed-off-by: Chao Peng <chao.p.peng@linux.intel.com>
Signed-off-by: Luwei Kang <luwei.kang@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Wei Yang <richard.weiyang@gmail.com>
[Preserved the iff and a probably intentional weird bracket notation.
Also dropped the style change to make a single-purpose patch. - Radim]
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
Although well-intentioned, keeping the KF() definition as a hint for
handling scattered CPUID features may be counter-productive. Simply
redefining the bit position only works for directly manipulating the
guest's CPUID leafs, e.g. it doesn't make guest_cpuid_has() magically
work. Taking an alternative approach, e.g. ensuring the bit position
is identical between the Linux-defined and hardware-defined features,
may be a simpler and/or more effective method of exposing scattered
features to the guest.
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
Let the guest read the IA32_TSC MSR with the generic RDMSR instruction
as well as the specific RDTSC(P) instructions. Note that the hardware
applies the TSC multiplier and offset (when applicable) to the result of
RDMSR(IA32_TSC), just as it does to the result of RDTSC(P).
Signed-off-by: Jim Mattson <jmattson@google.com>
Reviewed-by: Peter Shier <pshier@google.com>
Reviewed-by: Marc Orr <marcorr@google.com>
Reviewed-by: Liran Alon <liran.alon@oracle.com>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
According to the SDM, "NMI-window exiting" VM-exits wake a logical
processor from the same inactive states as would an NMI and
"interrupt-window exiting" VM-exits wake a logical processor from the
same inactive states as would an external interrupt. Specifically, they
wake a logical processor from the shutdown state and from the states
entered using the HLT and MWAIT instructions.
Fixes: 6dfacadd58 ("KVM: nVMX: Add support for activity state HLT")
Signed-off-by: Jim Mattson <jmattson@google.com>
Reviewed-by: Peter Shier <pshier@google.com>
Suggested-by: Sean Christopherson <sean.j.christopherson@intel.com>
[Squashed comments of two Jim's patches and used the simplified code
hunk provided by Sean. - Radim]
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
Currently, the nested guest's PAUSE intercept intentions are not being
honored. Instead, since the L0 hypervisor's pause_filter_count and
pause_filter_thresh values are still in place, these values are used
instead of those programmed in the VMCB by the L1 hypervisor.
To honor the desired PAUSE intercept support of the L1 hypervisor, the L0
hypervisor must use the PAUSE filtering fields of the L1 hypervisor. This
requires saving and restoring of both the L0 and L1 hypervisor's PAUSE
filtering fields.
Signed-off-by: William Tambe <william.tambe@amd.com>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
AMD doesn't seem to implement MSR_IA32_MCG_EXT_CTL and svm code in kvm
knows nothing about it, however, this MSR is among emulated_msrs and
thus returned with KVM_GET_MSR_INDEX_LIST. The consequent KVM_GET_MSRS,
of course, fails.
Report the MSR as unsupported to not confuse userspace.
Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
The memory allocation in b666a4b697 ("kvm: x86: Dynamically allocate
guest_fpu", 2018-11-06) is wrong, there are other members in struct fpu
before the fpregs_state union and the patch should be doing something
similar to the code in fpu__init_task_struct_size. It's enough to run
a guest and then rmmod kvm to see slub errors which are actually caused
by memory corruption.
For now let's revert it to sizeof(struct fpu), which is conservative.
I have plans to move fsave/fxsave/xsave directly in KVM, without using
the kernel FPU helpers, and once it's done, the size of the object in
the cache will be something like kvm_xstate_size.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Recent optimizations in MMU code broke nested SVM with NPT in L1
completely: when we do nested_svm_{,un}init_mmu_context() we want
to switch from TDP MMU to shadow MMU, both init_kvm_tdp_mmu() and
kvm_init_shadow_mmu() check if re-configuration is needed by looking
at cache source data. The data, however, doesn't change - it's only
the type of the MMU which changes. We end up not re-initializing
guest MMU as shadow and everything goes off the rails.
The issue could have been fixed by putting MMU type into extended MMU
role but this is not really needed. We can just split root and guest MMUs
the exact same way we did for nVMX, their types never change in the
lifetime of a vCPU.
There is still room for improvement: currently, we reset all MMU roots
when switching from L1 to L2 and back and this is not needed.
Fixes: 7dcd575520 ("x86/kvm/mmu: check if tdp/shadow MMU reconfiguration is needed")
Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Some guests OSes (including Windows 10) write to MSR 0xc001102c
on some cases (possibly while trying to apply a CPU errata).
Make KVM ignore reads and writes to that MSR, so the guest won't
crash.
The MSR is documented as "Execution Unit Configuration (EX_CFG)",
at AMD's "BIOS and Kernel Developer's Guide (BKDG) for AMD Family
15h Models 00h-0Fh Processors".
Cc: stable@vger.kernel.org
Signed-off-by: Eduardo Habkost <ehabkost@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Reported by syzkaller:
CPU: 1 PID: 5962 Comm: syz-executor118 Not tainted 4.20.0-rc6+ #374
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
RIP: 0010:kvm_apic_hw_enabled arch/x86/kvm/lapic.h:169 [inline]
RIP: 0010:vcpu_scan_ioapic arch/x86/kvm/x86.c:7449 [inline]
RIP: 0010:vcpu_enter_guest arch/x86/kvm/x86.c:7602 [inline]
RIP: 0010:vcpu_run arch/x86/kvm/x86.c:7874 [inline]
RIP: 0010:kvm_arch_vcpu_ioctl_run+0x5296/0x7320 arch/x86/kvm/x86.c:8074
Call Trace:
kvm_vcpu_ioctl+0x5c8/0x1150 arch/x86/kvm/../../../virt/kvm/kvm_main.c:2596
vfs_ioctl fs/ioctl.c:46 [inline]
file_ioctl fs/ioctl.c:509 [inline]
do_vfs_ioctl+0x1de/0x1790 fs/ioctl.c:696
ksys_ioctl+0xa9/0xd0 fs/ioctl.c:713
__do_sys_ioctl fs/ioctl.c:720 [inline]
__se_sys_ioctl fs/ioctl.c:718 [inline]
__x64_sys_ioctl+0x73/0xb0 fs/ioctl.c:718
do_syscall_64+0x1b9/0x820 arch/x86/entry/common.c:290
entry_SYSCALL_64_after_hwframe+0x49/0xbe
The reason is that the testcase writes hyperv synic HV_X64_MSR_SINT14 msr
and triggers scan ioapic logic to load synic vectors into EOI exit bitmap.
However, irqchip is not initialized by this simple testcase, ioapic/apic
objects should not be accessed.
This patch fixes it by also considering whether or not apic is present.
Reported-by: syzbot+39810e6c400efadfef71@syzkaller.appspotmail.com
Cc: stable@vger.kernel.org
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Radim Krčmář <rkrcmar@redhat.com>
Signed-off-by: Wanpeng Li <wanpengli@tencent.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
nested_get_vmcs12_pages() processes the posted_intr address in vmcs12. It
caches the kmap()ed page object and pointer, however, it doesn't handle
errors correctly: it's possible to cache a valid pointer, then release
the page and later dereference the dangling pointer.
I was able to reproduce with the following steps:
1. Call vmlaunch with valid posted_intr_desc_addr but an invalid
MSR_EFER. This causes nested_get_vmcs12_pages() to cache the kmap()ed
pi_desc_page and pi_desc. Later the invalid EFER value fails
check_vmentry_postreqs() which fails the first vmlaunch.
2. Call vmlanuch with a valid EFER but an invalid posted_intr_desc_addr
(I set it to 2G - 0x80). The second time we call nested_get_vmcs12_pages
pi_desc_page is unmapped and released and pi_desc_page is set to NULL
(the "shouldn't happen" clause). Due to the invalid
posted_intr_desc_addr, kvm_vcpu_gpa_to_page() fails and
nested_get_vmcs12_pages() returns. It doesn't return an error value so
vmlaunch proceeds. Note that at this time we have a dangling pointer in
vmx->nested.pi_desc and POSTED_INTR_DESC_ADDR in L0's vmcs.
3. Issue an IPI in L2 guest code. This triggers a call to
vmx_complete_nested_posted_interrupt() and pi_test_and_clear_on() which
dereferences the dangling pointer.
Vulnerable code requires nested and enable_apicv variables to be set to
true. The host CPU must also support posted interrupts.
Fixes: 5e2f30b756 "KVM: nVMX: get rid of nested_get_page()"
Cc: stable@vger.kernel.org
Reviewed-by: Andy Honig <ahonig@google.com>
Signed-off-by: Cfir Cohen <cfir@google.com>
Reviewed-by: Liran Alon <liran.alon@oracle.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Previously, the guest_fpu field was embedded in the kvm_vcpu_arch
struct. Unfortunately, the field is quite large, (e.g., 4352 bytes on my
current setup). This bloats the kvm_vcpu_arch struct for x86 into an
order 3 memory allocation, which can become a problem on overcommitted
machines. Thus, this patch moves the fpu state outside of the
kvm_vcpu_arch struct.
With this patch applied, the kvm_vcpu_arch struct is reduced to 15168
bytes for vmx on my setup when building the kernel with kvmconfig.
Suggested-by: Dave Hansen <dave.hansen@intel.com>
Signed-off-by: Marc Orr <marcorr@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Previously, x86's instantiation of 'struct kvm_vcpu_arch' added an fpu
field to save/restore fpu-related architectural state, which will differ
from kvm's fpu state. However, this is redundant to the 'struct fpu'
field, called fpu, embedded in the task struct, via the thread field.
Thus, this patch removes the user_fpu field from the kvm_vcpu_arch
struct and replaces it with the task struct's fpu field.
This change is significant because the fpu struct is actually quite
large. For example, on the system used to develop this patch, this
change reduces the size of the vcpu_vmx struct from 23680 bytes down to
19520 bytes, when building the kernel with kvmconfig. This reduction in
the size of the vcpu_vmx struct moves us closer to being able to
allocate the struct at order 2, rather than order 3.
Suggested-by: Dave Hansen <dave.hansen@intel.com>
Signed-off-by: Marc Orr <marcorr@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
.. to improve readability and maintainability, and to align the code as per
the layout of the checks in chapter "VM Entries" in Intel SDM vol 3C.
Signed-off-by: Krish Sadhukhan <krish.sadhukhan@oracle.com>
Reviewed-by: Mihai Carabas <mihai.carabas@oracle.com>
Reviewed-by: Mark Kanda <mark.kanda@oracle.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
.. to improve readability and maintainability, and to align the code as per
the layout of the checks in chapter "VM Entries" in Intel SDM vol 3C.
Signed-off-by: Krish Sadhukhan <krish.sadhukhan@oracle.com>
Reviewed-by: Mihai Carabas <mihai.carabas@oracle.com>
Reviewed-by: Mark Kanda <mark.kanda@oracle.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
.. to improve readability and maintainability, and to align the code as per
the layout of the checks in chapter "VM Entries" in Intel SDM vol 3C.
Signed-off-by: Krish Sadhukhan <krish.sadhukhan@oracle.com>
Reviewed-by: Mihai Carabas <mihai.carabas@oracle.com>
Reviewed-by: Mark Kanda <mark.kanda@oracle.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
.. to improve readability and maintainability, and to align the code as per
the layout of the checks in chapter "VM Entries" in Intel SDM vol 3C.
Signed-off-by: Krish Sadhukhan <krish.sadhukhan@oracle.com>
Reviewed-by: Mihai Carabas <mihai.carabas@oracle.com>
Reviewed-by: Mark Kanda <mark.kanda@oracle.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Passing the enum and doing an indirect lookup is silly when we can
simply pass the field directly. Remove the "fast path" code in
nested_vmx_check_msr_switch_controls() as it's now nothing more than a
redundant check.
Remove the debug message rather than continue passing the enum for the
address field. Having debug messages for the MSRs themselves is useful
as MSR legality is a huge space, whereas messing up a physical address
means the VMM is fundamentally broken.
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
.. to improve readability and maintainability, and to align the code as per
the layout of the checks in chapter "VM Entries" in Intel SDM vol 3C.
Signed-off-by: Krish Sadhukhan <krish.sadhukhan@oracle.com>
Reviewed-by: Mihai Carabas <mihai.carabas@oracle.com>
Reviewed-by: Mark Kanda <mark.kanda@oracle.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
.. as they are used only in nested vmx context.
Signed-off-by: Krish Sadhukhan <krish.sadhukhan@oracle.com>
Reviewed-by: Liran Alon <liran.alon@oracle.com>
Reviewed-by: Mihai Carabas <mihai.carabas@oracle.com>
Reviewed-by: Mark Kanda <mark.kanda@oracle.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
This patch is to initialize ept_pointer to INVALID_PAGE and check it
before flushing ept tlb. If ept_pointer is invalid, bypass the flush
request.
Signed-off-by: Lan Tianyu <Tianyu.Lan@microsoft.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
According to section "VM-entry Failures During or After Loading Guest State"
in Intel SDM vol 3C,
"No MSRs are saved into the VM-exit MSR-store area."
when bit 31 of the exit reason is set.
Reported-by: Krish Sadhukhan <krish.sadhukhan@oracle.com>
Suggested-by: Jim Mattson <jmattson@google.com>
Signed-off-by: Krish Sadhukhan <krish.sadhukhan@oracle.com>
Reviewed-by: Darren Kenny <darren.kenny@oracle.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
If userspace has provided a different value for this MSR (e.g with the
turbo bits set), the userspace-provided value should survive a vCPU
reset. For backwards compatibility, MSR_PLATFORM_INFO is initialized
in kvm_arch_vcpu_setup.
Signed-off-by: Jim Mattson <jmattson@google.com>
Reviewed-by: Drew Schmitt <dasch@google.com>
Cc: Abhiroop Dabral <adabral@paloaltonetworks.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
This patch adds Intel "Xeon CPU E3-1220 V2", with CPUID.01H.EAX=0x000306A8,
into the list of known broken CPUs which fail to support VMX preemption
timer. This bug was found while running the APIC timer test of
kvm-unit-test on this specific CPU, even though the errata info can't be
located in the public domain for this CPU.
Signed-off-by: Wei Huang <wei@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Months ago, we have added code to allow direct access to MSR_IA32_SPEC_CTRL
to the guest, which makes STIBP available to guests. This was implemented
by commits d28b387fb7 ("KVM/VMX: Allow direct access to
MSR_IA32_SPEC_CTRL") and b2ac58f905 ("KVM/SVM: Allow direct access to
MSR_IA32_SPEC_CTRL").
However, we never updated GET_SUPPORTED_CPUID to let userspace know that
STIBP can be enabled in CPUID. Fix that by updating
kvm_cpuid_8000_0008_ebx_x86_features and kvm_cpuid_7_0_edx_x86_features.
Signed-off-by: Eduardo Habkost <ehabkost@redhat.com>
Reviewed-by: Jim Mattson <jmattson@google.com>
Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Turns out we over-engineered Direct Mode for stimers a bit: unlike
traditional stimers where we may want to try to re-inject the message upon
EOI, Direct Mode stimers just set the irq in APIC and kvm_apic_set_irq()
fails only when APIC is disabled (see APIC_DM_FIXED case in
__apic_accept_irq()). Remove the redundant part.
Suggested-by: Roman Kagan <rkagan@virtuozzo.com>
Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
stimers_pending optimization only helps us to avoid multiple
kvm_make_request() calls. This doesn't happen very often and these
calls are very cheap in the first place, remove open-coded version of
stimer_mark_pending() from kvm_hv_notify_acked_sint().
Suggested-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Reviewed-by: Roman Kagan <rkagan@virtuozzo.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Turns out Hyper-V on KVM (as of 2016) will only use synthetic timers
if direct mode is available. With direct mode we notify the guest by
asserting APIC irq instead of sending a SynIC message.
The implementation uses existing vec_bitmap for letting lapic code
know that we're interested in the particular IRQ's EOI request. We assume
that the same APIC irq won't be used by the guest for both direct mode
stimer and as sint source (especially with AutoEOI semantics). It is
unclear how things should be handled if that's not true.
Direct mode is also somewhat less expensive; in my testing
stimer_send_msg() takes not less than 1500 cpu cycles and
stimer_notify_direct() can usually be done in 300-400. WS2016 without
Hyper-V, however, always sticks to non-direct version.
Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Reviewed-by: Roman Kagan <rkagan@virtuozzo.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
As a preparation to implementing Direct Mode for Hyper-V synthetic
timers switch to using stimer config definition from hyperv-tlfs.h.
Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
With every new Hyper-V Enlightenment we implement we're forced to add a
KVM_CAP_HYPERV_* capability. While this approach works it is fairly
inconvenient: the majority of the enlightenments we do have corresponding
CPUID feature bit(s) and userspace has to know this anyways to be able to
expose the feature to the guest.
Add KVM_GET_SUPPORTED_HV_CPUID ioctl (backed by KVM_CAP_HYPERV_CPUID, "one
cap to rule them all!") returning all Hyper-V CPUID feature leaves.
Using the existing KVM_GET_SUPPORTED_CPUID doesn't seem to be possible:
Hyper-V CPUID feature leaves intersect with KVM's (e.g. 0x40000000,
0x40000001) and we would probably confuse userspace in case we decide to
return these twice.
KVM_CAP_HYPERV_CPUID's number is interim: we're intended to drop
KVM_CAP_HYPERV_STIMER_DIRECT and use its number instead.
Suggested-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
The upcoming KVM_GET_SUPPORTED_HV_CPUID ioctl will need to return
Enlightened VMCS version in HYPERV_CPUID_NESTED_FEATURES.EAX when
it was enabled.
Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
hyperv-tlfs.h is a bit messy: CPUID feature bits are not always sorted,
it's hard to get which CPUID they belong to, some items are duplicated
(e.g. HV_X64_MSR_CRASH_CTL_NOTIFY/HV_CRASH_CTL_CRASH_NOTIFY).
Do some housekeeping work. While on it, replace all (1 << X) with BIT(X)
macro.
Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Reviewed-by: Michael Kelley <mikelley@microsoft.com>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
The SynIC message delivery protocol allows the message originator to
request, should the message slot be busy, to be notified when it's free.
However, this is unnecessary and even undesirable for messages generated
by SynIC timers in periodic mode: if the period is short enough compared
to the time the guest spends in the timer interrupt handler, so the
timer ticks start piling up, the excessive interactions due to this
notification and retried message delivery only makes the things worse.
[This was observed, in particular, with Windows L2 guests setting
(temporarily) the periodic timer to 2 kHz, and spending hundreds of
microseconds in the timer interrupt handler due to several L2->L1 exits;
under some load in L0 this could exceed 500 us so the timer ticks
started to pile up and the guest livelocked.]
Relieve the situation somewhat by not retrying message delivery for
periodic SynIC timers. This appears to remain within the "lazy" lost
ticks policy for SynIC timers as implemented in KVM.
Note that it doesn't solve the fundamental problem of livelocking the
guest with a periodic timer whose period is smaller than the time needed
to process a tick, but it makes it a bit less likely to be triggered.
Signed-off-by: Roman Kagan <rkagan@virtuozzo.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
SynIC message delivery is somewhat overengineered: it pretends to follow
the ordering rules when grabbing the message slot, using atomic
operations and all that, but does it incorrectly and unnecessarily.
The correct order would be to first set .msg_pending, then atomically
replace .message_type if it was zero, and then clear .msg_pending if
the previous step was successful. But this all is done in vcpu context
so the whole update looks atomic to the guest (it's assumed to only
access the message page from this cpu), and therefore can be done in
whatever order is most convenient (and is also the reason why the
incorrect order didn't trigger any bugs so far).
While at this, also switch to kvm_vcpu_{read,write}_guest_page, and drop
the no longer needed synic_clear_sint_msg_pending.
Signed-off-by: Roman Kagan <rkagan@virtuozzo.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
In the previous code, the variable apic_sw_disabled influences
recalculate_apic_map. But in "KVM: x86: simplify kvm_apic_map"
(commit: 3b5a5ffa92),
the access to apic_sw_disabled in recalculate_apic_map has been
deleted.
Signed-off-by: Peng Hao <peng.hao2@zte.com.cn>
Reviewed-by: Radim Krčmář <rkrcmar@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Like IA32_STAR, IA32_LSTAR and IA32_FMASK only need to contain guest
values on VM-entry when the guest is in long mode and EFER.SCE is set.
Signed-off-by: Jim Mattson <jmattson@google.com>
Reviewed-by: Peter Shier <pshier@google.com>
Reviewed-by: Marc Orr <marcorr@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
SYSCALL raises #UD in compatibility mode on Intel CPUs, so it's
pointless to load the guest's IA32_CSTAR value into the hardware MSR.
IA32_CSTAR still provides 48 bits of storage on Intel CPUs that have
CPUID.80000001:EDX.LM[bit 29] set, so we cannot remove it from the
vmx_msr_index[] array.
Signed-off-by: Jim Mattson <jmattson@google.com>
Reviewed-by: Peter Shier <pshier@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Add a comment explaining why MSR_STAR must be included in
vmx_msr_index[] even for i386 builds.
The elided comment has not been relevant since move_msr_up() was
introduced in commit a75beee6e4 ("KVM: VMX: Avoid saving and
restoring msrs on lightweight vmexit").
Signed-off-by: Jim Mattson <jmattson@google.com>
Reviewed-by: Peter Shier <pshier@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
RDTSCP is supported in legacy mode as well as long mode. The
IA32_TSC_AUX MSR should be set to the correct guest value before
entering any guest that supports RDTSCP.
Fixes: 4e47c7a6d7 ("KVM: VMX: Add instruction rdtscp support for guest")
Signed-off-by: Jim Mattson <jmattson@google.com>
Reviewed-by: Peter Shier <pshier@google.com>
Reviewed-by: Marc Orr <marcorr@google.com>
Reviewed-by: Liran Alon <liran.alon@oracle.com>
Reviewed-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
From a functional perspective, this is (supposed to be) a straight
copy-paste of code. Code was moved piecemeal to nested.c as not all
code that could/should be moved was obviously nested-only. The nested
code was then re-ordered as needed to compile, i.e. stats may not show
this is being a "pure" move despite there not being any intended changes
in functionality.
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Exposing only the function allows @nested, i.e. the module param, to be
statically defined in vmx.c, ensuring we aren't unnecessarily checking
said variable in the nested code. nested_vmx_allowed() is exposed due
to the need to verify nested support in vmx_{get,set}_nested_state().
The downside is that nested_vmx_allowed() likely won't be inlined in
vmx_{get,set}_nested_state(), but that should be a non-issue as they're
not a hot path. Keeping vmx_{get,set}_nested_state() in vmx.c isn't a
viable option as they need access to several nested-only functions.
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
...as they're used directly by the nested code. This will allow
moving the bulk of the nested code out of vmx.c without concurrent
changes to vmx.h.
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Exposed vmx_msr_index, vmx_return and host_efer via vmx.h so that the
nested code can be moved out of vmx.c.
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
...so that the function doesn't need to be created when moving the
nested code out of vmx.c.
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
...so that it doesn't need access to @nested. The only case where the
provided struct isn't already zeroed is the call from vmx_create_vcpu()
as setup_vmcs_config() zeroes the struct in the other use cases. This
will allow @nested to be statically defined in vmx.c, i.e. this removes
the last direct reference from nested code.
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
...in nested-specific code so that they can eventually be moved out of
vmx.c, e.g. into nested.c.
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
...so that future patches can reference e.g. @kvm_vmx_exit_handlers
without having to simultaneously move a big chunk of code. Speaking
from experience, resolving merge conflicts is an absolute nightmare
without pre-moving the code.
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
...so that it can conditionally set by the VMX code, i.e. iff @nested is
true. This will in turn allow it to be moved out of vmx.c and into a
nested-specified file.
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Eventually this will allow us to move the nested VMX code out of vmx.c.
Note that this also effectively wraps @enable_shadow_vmcs with @nested
so that it too can be moved out of vmx.c.
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
VMX has a few hundred lines of code just to wrap various VMX specific
instructions, e.g. VMWREAD, INVVPID, etc... Move them to a dedicated
header so it's easier to find/isolate the boilerplate.
With this change, more inlines can be moved from vmx.c to vmx.h.
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
The header, evmcs.h, already exists and contains a fair amount of code,
but there are a few pieces in vmx.c that can be moved verbatim. In
addition, move an array definition to evmcs.c to prepare for multiple
consumers of evmcs.h.
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
vmcs12 is the KVM-defined struct used to track a nested VMCS, e.g. a
VMCS created by L1 for L2.
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
This isn't intended to be a pure reflection of hardware, e.g. struct
loaded_vmcs and struct vmcs_host_state are KVM-defined constructs.
Similar to capabilities.h, this is a standalone file to avoid circular
dependencies between yet-to-be-created vmx.h and nested.h files.
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Expose the variables associated with various module params that are
needed by the nested VMX code. There is no ulterior logic for what
variables are/aren't exposed, this is purely "what's needed by the
nested code".
Note that @nested is intentionally not exposed.
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Defining a separate capabilities.h as opposed to putting this code in
e.g. vmx.h avoids circular dependencies between (the yet-to-be-added)
vmx.h and nested.h. The aforementioned circular dependencies are why
struct nested_vmx_msrs also resides in capabilities instead of e.g.
nested.h.
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
...instead of referencing the global struct. This will allow moving
setup_vmcs_config() to a separate file that may not have access to
the global variable. Modify nested_vmx_setup_ctls_msrs() appropriately
since vmx_capability.ept may not be accurate when called by
vmx_check_processor_compat().
No functional change intended.
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
EFER and PERF_GLOBAL_CTRL MSRs have dedicated VM Entry/Exit controls
that KVM dynamically toggles based on whether or not the guest's value
for each MSRs differs from the host. Handle the dynamic behavior by
adding a helper that clears the dynamic bits so the bits aren't set
when initializing the VMCS field outside of the dynamic toggling flow.
This makes the handling consistent with similar behavior for other
controls, e.g. pin, exec and sec_exec. More importantly, it eliminates
two global bools that are stealthily modified by setup_vmcs_config.
Opportunistically clean up a comment and print related to errata for
IA32_PERF_GLOBAL_CTRL.
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
MSR_IA32_XSS has no relation to the VMCS whatsoever, it doesn't belong
in setup_vmcs_config() and its reference to host_xss prevents moving
setup_vmcs_config() to a dedicated file.
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
VMX specific files now reside in a dedicated subdirectory, i.e. the
file name prefix is redundant.
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
VMX specific files now reside in a dedicated subdirectory. Drop the
"vmx" prefix, which is redundant, and add a "vmcs" prefix to clarify
that the file is referring to VMCS shadow fields.
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
...to prepare for shattering vmx.c into multiple files without having
to prepend "vmx_" to all new files.
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Until this point vmx.c has been the only consumer and included the
file after many others. Prepare for multiple consumers, i.e. the
shattering of vmx.c
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Until this point vmx.c has been the only consumer and included the
file after many others. Prepare for multiple consumers, i.e. the
shattering of vmx.c
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
...to prepare for the creation of a "vmx" subdirectory that will contain
a variety of headers. Clean things up now to avoid making a bigger mess
in the future.
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
...and make enable_shadow_vmcs depend on nested. Aside from the obvious
memory savings, this will allow moving the relevant code out of vmx.c in
the future, e.g. to a nested specific file.
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Fixes: 34a1cd60d1 ("kvm: x86: vmx: move some vmx setting from vmx_init() to hardware_setup()")
Cc: stable@vger.kernel.org
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
There are two problems with KVM_GET_DIRTY_LOG. First, and less important,
it can take kvm->mmu_lock for an extended period of time. Second, its user
can actually see many false positives in some cases. The latter is due
to a benign race like this:
1. KVM_GET_DIRTY_LOG returns a set of dirty pages and write protects
them.
2. The guest modifies the pages, causing them to be marked ditry.
3. Userspace actually copies the pages.
4. KVM_GET_DIRTY_LOG returns those pages as dirty again, even though
they were not written to since (3).
This is especially a problem for large guests, where the time between
(1) and (3) can be substantial. This patch introduces a new
capability which, when enabled, makes KVM_GET_DIRTY_LOG not
write-protect the pages it returns. Instead, userspace has to
explicitly clear the dirty log bits just before using the content
of the page. The new KVM_CLEAR_DIRTY_LOG ioctl can also operate on a
64-page granularity rather than requiring to sync a full memslot;
this way, the mmu_lock is taken for small amounts of time, and
only a small amount of time will pass between write protection
of pages and the sending of their content.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
When manual dirty log reprotect will be enabled, kvm_get_dirty_log_protect's
pointer argument will always be false on exit, because no TLB flush is needed
until the manual re-protection operation. Rename it from "is_dirty" to "flush",
which more accurately tells the caller what they have to do with it.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Go over arch/x86/ and fix common typos in comments,
and a typo in an actual function argument name.
No change in functionality intended.
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
It is never correct for a VMX instruction to fail with "invalid VMCS"
if there is, in fact, a current VMCS. Reads from unbacked addresses
return all 1's, which means that an unbacked VMCS will not have the
correct VMCS revision ID (i.e. VMCS12_REVISION).
Fixes: 63846663ea ("KVM: nVMX: Implement VMPTRLD")
Signed-off-by: Jim Mattson <jmattson@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
According to section "Checks on VMX Controls" in Intel SDM vol 3C, the
following check needs to be enforced on vmentry of L2 guests:
If the "activate VMX-preemption timer" VM-execution control is 0, the
the "save VMX-preemption timer value" VM-exit control must also be 0.
Signed-off-by: Krish Sadhukhan <krish.sadhukhan@oracle.com>
Reviewed-by: Mihai Carabas <mihai.carabas@oracle.com>
Reviewed-by: Liran Alon <liran.alon@oracle.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
There is no need to jump just after the jump insn itself. Also, make
code similar to entering guest mode in vmx_vcpu_run.
Signed-off-by: Uros Bizjak <ubizjak@gmail.com>
Reviewed-by: Jim Mattson <jmattson@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
As specified in Intel's SDM, do not allow the L1 hypervisor to launch
an L2 guest with the VM-execution controls for "unrestricted guest" or
"mode-based execute control for EPT" set and the VM-execution control
for "enable EPT" clear.
Note that the VM-execution control for "mode-based execute control for
EPT" is not yet virtualized by kvm.
Reported-by: Andrew Thornton <andrewth@google.com>
Signed-off-by: Jim Mattson <jmattson@google.com>
Reviewed-by: Peter Shier <pshier@google.com>
Reviewed-by: Sean Christopherson <sean.j.christopherson@intel.com>
Reviewed-by: Wanpeng Li <wanpengli@tencent.com>
Reviewed-by: Liran Alon <liran.alon@oracle.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
For some reason, kvm_x86_ops->write_l1_tsc_offset() skipped trace
of change to active TSC offset in case vCPU is in guest-mode.
This patch changes write_l1_tsc_offset() behavior to trace any change
to active TSC offset to aid debugging. The VMX code is changed to
look more similar to SVM, which is in my opinion nicer.
Based on a patch by Liran Alon.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>