Let's add some output here so that the user has some feedback
about what is being run.
Signed-off-by: Thomas Huth <thuth@redhat.com>
Message-Id: <20221004093131.40392-4-thuth@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
The kvm_binary_stats_test test currently does not have any output (unless
one of the TEST_ASSERT statement fails), so it's hard to say for a user
how far it did proceed already. Thus let's make this a little bit more
user-friendly and include some TAP output via the kselftest.h interface.
Signed-off-by: Thomas Huth <thuth@redhat.com>
Reviewed-by: Andrew Jones <andrew.jones@linux.dev>
Message-Id: <20221004093131.40392-2-thuth@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Since the commit 65855ed8b0 ("KVM: X86: Synchronize the shadow
pagetable before link it"), no sp would be linked with
sp->unsync_children = 1.
So make it WARN if it is the case.
Signed-off-by: Lai Jiangshan <jiangshan.ljs@antgroup.com>
Message-Id: <20221212090106.378206-1-jiangshanlai@gmail.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Commit 9bcb90650e ("KVM: VMX: Get rid of eVMCS specific VMX controls
sanitization") dropped 'vmcs_conf' sanitization for KVM-on-Hyper-V because
there's no known Hyper-V version which would expose a feature
unsupported in eVMCS in VMX feature MSRs. This works well for all
currently existing Hyper-V version, however, future Hyper-V versions
may add features which are supported by KVM and are currently missing
in eVMCSv1 definition (e.g. APIC virtualization, PML,...). When this
happens, existing KVMs will get broken. With the inverted 'unsupported
by eVMCSv1' checks, we can resurrect vmcs_conf sanitization and make
KVM future proof.
Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Message-Id: <20221104144708.435865-5-vkuznets@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
In preparation to restoring vmcs_conf sanitization for KVM-on-Hyper-V,
(and for completeness) add tertiary VM-execution controls to
'evmcs_supported_ctrls'.
No functional change intended as KVM doesn't yet expose
MSR_IA32_VMX_PROCBASED_CTLS3 to its guests.
Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Message-Id: <20221104144708.435865-4-vkuznets@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
When a new feature gets implemented in KVM, EVMCS1_UNSUPPORTED_* defines
need to be adjusted to avoid the situation when the feature is exposed
to the guest but there's no corresponding eVMCS field[s] for it. This
is not obvious and fragile. Invert 'unsupported by eVMCSv1' check and
make it 'supported by eVMCSv1' instead, this way it's much harder to
make a mistake. New features will get added to EVMCS1_SUPPORTED_*
defines when the corresponding fields are added to eVMCS definition.
No functional change intended. EVMCS1_SUPPORTED_* defines are composed
by taking KVM_{REQUIRED,OPTIONAL}_VMX_ defines and filtering out what
was previously known as EVMCS1_UNSUPPORTED_*.
From all the controls, SECONDARY_EXEC_TSC_SCALING requires special
handling as it's actually present in eVMCSv1 definition but is not
currently supported for Hyper-V-on-KVM, just for KVM-on-Hyper-V. As
evmcs_supported_ctrls will be used for both scenarios, just add it
there instead of EVMCS1_SUPPORTED_2NDEXEC.
Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Message-Id: <20221104144708.435865-3-vkuznets@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
The only unsupported primary processor-based VM-execution control at the
moment is CPU_BASED_ACTIVATE_TERTIARY_CONTROLS and KVM doesn't expose it
in nested VMX feature MSRs anyway (see nested_vmx_setup_ctls_msrs())
but in preparation to inverting "unsupported with eVMCS" checks (and
for completeness) it's better to sanitize MSR_IA32_VMX_PROCBASED_CTLS/
MSR_IA32_VMX_TRUE_PROCBASED_CTLS too.
No functional change intended.
Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Message-Id: <20221104144708.435865-2-vkuznets@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Commit 8fda37cf3d ("KVM: selftests: Stuff RAX/RCX with 'safe' values
in vmmcall()/vmcall()", 2022-11-21) broke the svm_nested_soft_inject_test
because it placed a "pop rbp" instruction after vmmcall. While this is
correct and mimics what is done in the VMX case, this particular test
expects a ud2 instruction right after the vmmcall, so that it can skip
over it in the L1 part of the test.
Inline a suitably-modified version of vmmcall() to restore the
functionality of the test.
Fixes: 8fda37cf3d ("KVM: selftests: Stuff RAX/RCX with 'safe' values in vmmcall()/vmcall()"
Cc: Vitaly Kuznetsov <vkuznets@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Reviewed-by: Sean Christopherson <seanjc@google.com>
Reviewed-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
Message-Id: <20221130181147.9911-1-pbonzini@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Currently only the locking order of SRCU vs kvm->slots_arch_lock
and kvm->slots_lock is documented. Extend this to kvm->lock
since Xen emulation got it terribly wrong.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
While KVM_XEN_EVTCHN_RESET is usually called with no vCPUs running,
if that happened it could cause a deadlock. This is due to
kvm_xen_eventfd_reset() doing a synchronize_srcu() inside
a kvm->lock critical section.
To avoid this, first collect all the evtchnfd objects in an
array and free all of them once the kvm->lock critical section
is over and th SRCU grace period has expired.
Reported-by: Michal Luczaj <mhal@rbox.co>
Cc: David Woodhouse <dwmw@amazon.co.uk>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Most notably, the KVM_XEN_EVTCHN_RESET feature had escaped documentation
entirely. Along with how to turn most stuff off on SHUTDOWN_soft_reset.
Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
Message-Id: <20221226120320.1125390-6-dwmw2@infradead.org>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
These are (uint64_t)-1 magic values are a userspace ABI, allowing the
shared info pages and other enlightenments to be disabled. This isn't
a Xen ABI because Xen doesn't let the guest turn these off except with
the full SHUTDOWN_soft_reset mechanism. Under KVM, the userspace VMM is
expected to handle soft reset, and tear down the kernel parts of the
enlightenments accordingly.
Suggested-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
Message-Id: <20221226120320.1125390-5-dwmw2@infradead.org>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Port number is validated in kvm_xen_setattr_evtchn().
Remove superfluous checks in kvm_xen_eventfd_assign() and
kvm_xen_eventfd_update().
Signed-off-by: Michal Luczaj <mhal@rbox.co>
Message-Id: <20221222203021.1944101-3-mhal@rbox.co>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
Message-Id: <20221226120320.1125390-4-dwmw2@infradead.org>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
The evtchnfd structure itself must be protected by either kvm->lock or
SRCU. Use the former in kvm_xen_eventfd_update(), since the lock is
being taken anyway; kvm_xen_hcall_evtchn_send() instead is a reader and
does not need kvm->lock, and is called in SRCU critical section from the
kvm_x86_handle_exit function.
It is also important to use rcu_read_{lock,unlock}() in
kvm_xen_hcall_evtchn_send(), because idr_remove() will *not*
use synchronize_srcu() to wait for readers to complete.
Remove a superfluous if (kvm) check before calling synchronize_srcu()
in kvm_xen_eventfd_deassign() where kvm has been dereferenced already.
Co-developed-by: Michal Luczaj <mhal@rbox.co>
Signed-off-by: Michal Luczaj <mhal@rbox.co>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
Message-Id: <20221226120320.1125390-3-dwmw2@infradead.org>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
In particular, we shouldn't assume that being contiguous in guest virtual
address space means being contiguous in guest *physical* address space.
In dropping the manual calls to kvm_mmu_gva_to_gpa_system(), also drop
the srcu_read_lock() that was around them. All call sites are reached
from kvm_xen_hypercall() which is called from the handle_exit function
with the read lock already held.
536395260 ("KVM: x86/xen: handle PV timers oneshot mode")
1a65105a5 ("KVM: x86/xen: handle PV spinlocks slowpath")
Fixes: 2fd6df2f2 ("KVM: x86/xen: intercept EVTCHNOP_send from guests")
Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
Message-Id: <20221226120320.1125390-2-dwmw2@infradead.org>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Delete an extra block of code/documentation that snuck in when KVM's
documentation was converted to ReST format.
Fixes: 106ee47dc6 ("docs: kvm: Convert api.txt to ReST format")
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20221207003637.2041211-1-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
No code is using KVM_MMU_READ_LOCK() or KVM_MMU_READ_UNLOCK(). They
used to be in virt/kvm/pfncache.c:
KVM_MMU_READ_LOCK(kvm);
retry = mmu_notifier_retry_hva(kvm, mmu_seq, uhva);
KVM_MMU_READ_UNLOCK(kvm);
However, since 58cd407ca4 ("KVM: Fix multiple races in gfn=>pfn cache
refresh", 2022-05-25) the code is only relying on the MMU notifier's
invalidation count and sequence number.
Signed-off-by: Lai Jiangshan <jiangshan.ljs@antgroup.com>
Message-Id: <20221207120617.9409-1-jiangshanlai@gmail.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Commit a789aeba41 ("KVM: VMX: Rename "vmx/evmcs.{ch}" to
"vmx/hyperv.{ch}"") renames the VMX specific Hyper-V files, but does not
adjust the entry in MAINTAINERS.
Hence, ./scripts/get_maintainer.pl --self-test=patterns complains about a
broken reference.
Repair this file reference in KVM X86 HYPER-V (KVM/hyper-v).
Signed-off-by: Lukas Bulwahn <lukas.bulwahn@gmail.com>
Fixes: a789aeba41 ("KVM: VMX: Rename "vmx/evmcs.{ch}" to "vmx/hyperv.{ch}"")
Reviewed-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20221205082044.10141-1-lukas.bulwahn@gmail.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
The loop marks vaddr as mapped after incrementing it by page size,
thereby marking the *next* page as mapped. Set the bit in vpages_mapped
first instead.
Fixes: 56fc773203 ("KVM: selftests: Fill in vm->vpages_mapped bitmap in virt_map() too")
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
Message-Id: <20221209015307.1781352-4-oliver.upton@linux.dev>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Currently the ucall MMIO hole is placed immediately after slot0, which
is a relatively safe address in the PA space. However, it is possible
that the same address has already been used for something else (like the
guest program image) in the VA space. At least in my own testing,
building the vgic_irq test with clang leads to the MMIO hole appearing
underneath gicv3_ops.
Stop identity mapping the MMIO hole and instead find an unused VA to map
to it. Yet another subtle detail of the KVM selftests library is that
virt_pg_map() does not update vm->vpages_mapped. Switch over to
virt_map() instead to guarantee that the chosen VA isn't to something
else.
Signed-off-by: Oliver Upton <oliver.upton@linux.dev>
Message-Id: <20221209015307.1781352-6-oliver.upton@linux.dev>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Explain the meaning of the bit manipulations of vm_vaddr_populate_bitmap.
These correspond to the "canonical addresses" of x86 and other
architectures, but that is not obvious.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Use a magic value to signal a ucall_alloc() failure instead of simply
doing GUEST_ASSERT(). GUEST_ASSERT() relies on ucall_alloc() and so a
failure puts the guest into an infinite loop.
Use -1 as the magic value, as a real ucall struct should never wrap.
Reported-by: Oliver Upton <oliver.upton@linux.dev>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Disable gnu-variable-sized-type-not-at-end so that tests and libraries
can create overlays of variable sized arrays at the end of structs when
using a fixed number of entries, e.g. to get/set a single MSR.
It's possible to fudge around the warning, e.g. by defining a custom
struct that hardcodes the number of entries, but that is a burden for
both developers and readers of the code.
lib/x86_64/processor.c:664:19: warning: field 'header' with variable sized type 'struct kvm_msrs'
not at the end of a struct or class is a GNU extension [-Wgnu-variable-sized-type-not-at-end]
struct kvm_msrs header;
^
lib/x86_64/processor.c:772:19: warning: field 'header' with variable sized type 'struct kvm_msrs'
not at the end of a struct or class is a GNU extension [-Wgnu-variable-sized-type-not-at-end]
struct kvm_msrs header;
^
lib/x86_64/processor.c:787:19: warning: field 'header' with variable sized type 'struct kvm_msrs'
not at the end of a struct or class is a GNU extension [-Wgnu-variable-sized-type-not-at-end]
struct kvm_msrs header;
^
3 warnings generated.
x86_64/hyperv_tlb_flush.c:54:18: warning: field 'hv_vp_set' with variable sized type 'struct hv_vpset'
not at the end of a struct or class is a GNU extension [-Wgnu-variable-sized-type-not-at-end]
struct hv_vpset hv_vp_set;
^
1 warning generated.
x86_64/xen_shinfo_test.c:137:25: warning: field 'info' with variable sized type 'struct kvm_irq_routing'
not at the end of a struct or class is a GNU extension [-Wgnu-variable-sized-type-not-at-end]
struct kvm_irq_routing info;
^
1 warning generated.
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20221213001653.3852042-12-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Include lib.mk before consuming $(CC) and document that lib.mk overwrites
$(CC) unless make was invoked with -e or $(CC) was specified after make
(which makes the environment override the Makefile). Including lib.mk
after using it for probing, e.g. for -no-pie, can lead to weirdness.
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20221213001653.3852042-11-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Explicitly disable the compiler's builtin memcmp(), memcpy(), and
memset(). Because only lib/string_override.c is built with -ffreestanding,
the compiler reserves the right to do what it wants and can try to link the
non-freestanding code to its own crud.
/usr/bin/x86_64-linux-gnu-ld: /lib/x86_64-linux-gnu/libc.a(memcmp.o): in function `memcmp_ifunc':
(.text+0x0): multiple definition of `memcmp'; tools/testing/selftests/kvm/lib/string_override.o:
tools/testing/selftests/kvm/lib/string_override.c:15: first defined here
clang: error: linker command failed with exit code 1 (use -v to see invocation)
Fixes: 6b6f71484b ("KVM: selftests: Implement memcmp(), memcpy(), and memset() for guest use")
Reported-by: Aaron Lewis <aaronlewis@google.com>
Reported-by: Raghavendra Rao Ananta <rananta@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20221213001653.3852042-10-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Probe -no-pie with the actual set of CFLAGS used to compile the tests,
clang whines about -no-pie being unused if the tests are compiled with
-static.
clang: warning: argument unused during compilation: '-no-pie'
[-Wunused-command-line-argument]
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20221213001653.3852042-9-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Make the main() functions in the probing code proper prototypes so that
compiling the probing code with more strict flags won't generate false
negatives.
<stdin>:1:5: error: function declaration isn’t a prototype [-Werror=strict-prototypes]
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20221213001653.3852042-8-seanjc@google.com>
Reviewed-by: Philippe Mathieu-Daudé <philmd@linaro.org>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Rename UNAME_M to ARCH_DIR and explicitly set it directly for x86. At
this point, the name of the arch directory really doesn't have anything
to do with `uname -m`, and UNAME_M is unnecessarily confusing given that
its purpose is purely to identify the arch specific directory.
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20221213001653.3852042-7-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Fix a == vs. = typo in kvm_get_cpu_address_width() that results in
@pa_bits being left unset if the CPU doesn't support enumerating its
MAX_PHY_ADDR. Flagged by clang's unusued-value warning.
lib/x86_64/processor.c:1034:51: warning: expression result unused [-Wunused-value]
*pa_bits == kvm_cpu_has(X86_FEATURE_PAE) ? 36 : 32;
Fixes: 3bd396353d ("KVM: selftests: Add X86_FEATURE_PAE and use it calc "fallback" MAXPHYADDR")
Signed-off-by: Sean Christopherson <seanjc@google.com>
Reviewed-by: Philippe Mathieu-Daudé <philmd@linaro.org>
Message-Id: <20221213001653.3852042-6-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Use pattern matching to exclude everything except .c, .h, .S, and .sh
files from Git. Manually adding every test target has an absurd
maintenance cost, is comically error prone, and leads to bikeshedding
over whether or not the targets should be listed in alphabetical order.
Deliberately do not include the one-off assets, e.g. config, settings,
.gitignore itself, etc as Git doesn't ignore files that are already in
the repository. Adding the one-off assets won't prevent mistakes where
developers forget to --force add files that don't match the "allowed".
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20221213001653.3852042-5-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Check that the number of pages per slot is non-zero in get_max_slots()
prior to computing the remaining number of pages. clang generates code
that uses an actual DIV for calculating the remaining, which causes a #DE
if the total number of pages is less than the number of slots.
traps: memslot_perf_te[97611] trap divide error ip:4030c4 sp:7ffd18ae58f0
error:0 in memslot_perf_test[401000+cb000]
Fixes: a69170c65a ("KVM: selftests: memslot_perf_test: Report optimal memory slots")
Signed-off-by: Sean Christopherson <seanjc@google.com>
Reviewed-by: Philippe Mathieu-Daudé <philmd@linaro.org>
Message-Id: <20221213001653.3852042-4-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Delete an unused struct definition in x86_64/vmx_tsc_adjust_test.c.
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20221213001653.3852042-3-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Define a literal '0' asm input constraint to aarch64/page_fault_test's
guest_cas() as an unsigned long to make clang happy.
tools/testing/selftests/kvm/aarch64/page_fault_test.c:120:16: error:
value size does not match register size specified by the constraint
and modifier [-Werror,-Wasm-operand-widths]
:: "r" (0), "r" (TEST_DATA), "r" (guest_test_memory));
^
tools/testing/selftests/kvm/aarch64/page_fault_test.c:119:15: note:
use constraint modifier "w"
"casal %0, %1, [%2]\n"
^~
%w0
Fixes: 35c5810157 ("KVM: selftests: aarch64: Add aarch64/page_fault_test")
Cc: Ricardo Koller <ricarkol@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Reviewed-by: Philippe Mathieu-Daudé <philmd@linaro.org>
Message-Id: <20221213001653.3852042-2-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Re-check sp->nx_huge_page_disallowed under the tdp_mmu_pages_lock spinlock
when adding a new shadow page in the TDP MMU. To ensure the NX reclaim
kthread can't see a not-yet-linked shadow page, the page fault path links
the new page table prior to adding the page to possible_nx_huge_pages.
If the page is zapped by different task, e.g. because dirty logging is
disabled, between linking the page and adding it to the list, KVM can end
up triggering use-after-free by adding the zapped SP to the aforementioned
list, as the zapped SP's memory is scheduled for removal via RCU callback.
The bug is detected by the sanity checks guarded by CONFIG_DEBUG_LIST=y,
i.e. the below splat is just one possible signature.
------------[ cut here ]------------
list_add corruption. prev->next should be next (ffffc9000071fa70), but was ffff88811125ee38. (prev=ffff88811125ee38).
WARNING: CPU: 1 PID: 953 at lib/list_debug.c:30 __list_add_valid+0x79/0xa0
Modules linked in: kvm_intel
CPU: 1 PID: 953 Comm: nx_huge_pages_t Tainted: G W 6.1.0-rc4+ #71
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
RIP: 0010:__list_add_valid+0x79/0xa0
RSP: 0018:ffffc900006efb68 EFLAGS: 00010286
RAX: 0000000000000000 RBX: ffff888116cae8a0 RCX: 0000000000000027
RDX: 0000000000000027 RSI: 0000000100001872 RDI: ffff888277c5b4c8
RBP: ffffc90000717000 R08: ffff888277c5b4c0 R09: ffffc900006efa08
R10: 0000000000199998 R11: 0000000000199a20 R12: ffff888116cae930
R13: ffff88811125ee38 R14: ffffc9000071fa70 R15: ffff88810b794f90
FS: 00007fc0415d2740(0000) GS:ffff888277c40000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000000000000 CR3: 0000000115201006 CR4: 0000000000172ea0
Call Trace:
<TASK>
track_possible_nx_huge_page+0x53/0x80
kvm_tdp_mmu_map+0x242/0x2c0
kvm_tdp_page_fault+0x10c/0x130
kvm_mmu_page_fault+0x103/0x680
vmx_handle_exit+0x132/0x5a0 [kvm_intel]
vcpu_enter_guest+0x60c/0x16f0
kvm_arch_vcpu_ioctl_run+0x1e2/0x9d0
kvm_vcpu_ioctl+0x271/0x660
__x64_sys_ioctl+0x80/0xb0
do_syscall_64+0x2b/0x50
entry_SYSCALL_64_after_hwframe+0x46/0xb0
</TASK>
---[ end trace 0000000000000000 ]---
Fixes: 61f9447854 ("KVM: x86/mmu: Set disallowed_nx_huge_page in TDP MMU before setting SPTE")
Reported-by: Greg Thelen <gthelen@google.com>
Analyzed-by: David Matlack <dmatlack@google.com>
Cc: David Matlack <dmatlack@google.com>
Cc: Ben Gardon <bgardon@google.com>
Cc: Mingwei Zhang <mizhang@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20221213033030.83345-4-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Map the leaf SPTE when handling a TDP MMU page fault if and only if the
target level is reached. A recent commit reworked the retry logic and
incorrectly assumed that walking SPTEs would never "fail", as the loop
either bails (retries) or installs parent SPs. However, the iterator
itself will bail early if it detects a frozen (REMOVED) SPTE when
stepping down. The TDP iterator also rereads the current SPTE before
stepping down specifically to avoid walking into a part of the tree that
is being removed, which means it's possible to terminate the loop without
the guts of the loop observing the frozen SPTE, e.g. if a different task
zaps a parent SPTE between the initial read and try_step_down()'s refresh.
Mapping a leaf SPTE at the wrong level results in all kinds of badness as
page table walkers interpret the SPTE as a page table, not a leaf, and
walk into the weeds.
------------[ cut here ]------------
WARNING: CPU: 1 PID: 1025 at arch/x86/kvm/mmu/tdp_mmu.c:1070 kvm_tdp_mmu_map+0x481/0x510
Modules linked in: kvm_intel
CPU: 1 PID: 1025 Comm: nx_huge_pages_t Tainted: G W 6.1.0-rc4+ #64
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
RIP: 0010:kvm_tdp_mmu_map+0x481/0x510
RSP: 0018:ffffc9000072fba8 EFLAGS: 00010286
RAX: 0000000000000000 RBX: ffffc9000072fcc0 RCX: 0000000000000027
RDX: 0000000000000027 RSI: 00000000ffffdfff RDI: ffff888277c5b4c8
RBP: ffff888107d45a10 R08: ffff888277c5b4c0 R09: ffffc9000072fa48
R10: 0000000000000001 R11: 0000000000000001 R12: ffffc9000073a0e0
R13: ffff88810fc54800 R14: ffff888107d1ae60 R15: ffff88810fc54f90
FS: 00007fba9f853740(0000) GS:ffff888277c40000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000000000000 CR3: 000000010aa7a003 CR4: 0000000000172ea0
Call Trace:
<TASK>
kvm_tdp_page_fault+0x10c/0x130
kvm_mmu_page_fault+0x103/0x680
vmx_handle_exit+0x132/0x5a0 [kvm_intel]
vcpu_enter_guest+0x60c/0x16f0
kvm_arch_vcpu_ioctl_run+0x1e2/0x9d0
kvm_vcpu_ioctl+0x271/0x660
__x64_sys_ioctl+0x80/0xb0
do_syscall_64+0x2b/0x50
entry_SYSCALL_64_after_hwframe+0x46/0xb0
</TASK>
---[ end trace 0000000000000000 ]---
Invalid SPTE change: cannot replace a present leaf
SPTE with another present leaf SPTE mapping a
different PFN!
as_id: 0 gfn: 100200 old_spte: 600000112400bf3 new_spte: 6000001126009f3 level: 2
------------[ cut here ]------------
kernel BUG at arch/x86/kvm/mmu/tdp_mmu.c:559!
invalid opcode: 0000 [#1] SMP
CPU: 1 PID: 1025 Comm: nx_huge_pages_t Tainted: G W 6.1.0-rc4+ #64
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
RIP: 0010:__handle_changed_spte.cold+0x95/0x9c
RSP: 0018:ffffc9000072faf8 EFLAGS: 00010246
RAX: 00000000000000c1 RBX: ffffc90000731000 RCX: 0000000000000027
RDX: 0000000000000000 RSI: 00000000ffffdfff RDI: ffff888277c5b4c8
RBP: 0600000112400bf3 R08: ffff888277c5b4c0 R09: ffffc9000072f9a0
R10: 0000000000000001 R11: 0000000000000001 R12: 06000001126009f3
R13: 0000000000000002 R14: 0000000012600901 R15: 0000000012400b01
FS: 00007fba9f853740(0000) GS:ffff888277c40000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000000000000 CR3: 000000010aa7a003 CR4: 0000000000172ea0
Call Trace:
<TASK>
kvm_tdp_mmu_map+0x3b0/0x510
kvm_tdp_page_fault+0x10c/0x130
kvm_mmu_page_fault+0x103/0x680
vmx_handle_exit+0x132/0x5a0 [kvm_intel]
vcpu_enter_guest+0x60c/0x16f0
kvm_arch_vcpu_ioctl_run+0x1e2/0x9d0
kvm_vcpu_ioctl+0x271/0x660
__x64_sys_ioctl+0x80/0xb0
do_syscall_64+0x2b/0x50
entry_SYSCALL_64_after_hwframe+0x46/0xb0
</TASK>
Modules linked in: kvm_intel
---[ end trace 0000000000000000 ]---
Fixes: 63d28a25e0 ("KVM: x86/mmu: simplify kvm_tdp_mmu_map flow when guest has to retry")
Cc: Robert Hoo <robert.hu@linux.intel.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20221213033030.83345-3-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
When stuffing the allowed secondary execution controls for nested VMX in
response to CPUID updates, don't set the allowed-1 bit for a feature that
isn't supported by KVM, i.e. isn't allowed by the canonical vmcs_config.
WARN if KVM attempts to manipulate a feature that isn't supported. All
features that are currently stuffed are always advertised to L1 for
nested VMX if they are supported in KVM's base configuration, and no
additional features should ever be added to the CPUID-induced stuffing
(updating VMX MSRs in response to CPUID updates is a long-standing KVM
flaw that is slowly being fixed).
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20221213062306.667649-3-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Set ENABLE_USR_WAIT_PAUSE in KVM's supported VMX MSR configuration if the
feature is supported in hardware and enabled in KVM's base, non-nested
configuration, i.e. expose ENABLE_USR_WAIT_PAUSE to L1 if it's supported.
This fixes a bug where saving/restoring, i.e. migrating, a vCPU will fail
if WAITPKG (the associated CPUID feature) is enabled for the vCPU, and
obviously allows L1 to enable the feature for L2.
KVM already effectively exposes ENABLE_USR_WAIT_PAUSE to L1 by stuffing
the allowed-1 control ina vCPU's virtual MSR_IA32_VMX_PROCBASED_CTLS2 when
updating secondary controls in response to KVM_SET_CPUID(2), but (a) that
depends on flawed code (KVM shouldn't touch VMX MSRs in response to CPUID
updates) and (b) runs afoul of vmx_restore_control_msr()'s restriction
that the guest value must be a strict subset of the supported host value.
Although no past commit explicitly enabled nested support for WAITPKG,
doing so is safe and functionally correct from an architectural
perspective as no additional KVM support is needed to virtualize TPAUSE,
UMONITOR, and UMWAIT for L2 relative to L1, and KVM already forwards
VM-Exits to L1 as necessary (commit bf653b78f9, "KVM: vmx: Introduce
handle_unexpected_vmexit and handle WAITPKG vmexit").
Note, KVM always keeps the hosts MSR_IA32_UMWAIT_CONTROL resident in
hardware, i.e. always runs both L1 and L2 with the host's power management
settings for TPAUSE and UMWAIT. See commit bf09fb6cba ("KVM: VMX: Stop
context switching MSR_IA32_UMWAIT_CONTROL") for more details.
Fixes: e69e72faa3 ("KVM: x86: Add support for user wait instructions")
Cc: stable@vger.kernel.org
Reported-by: Aaron Lewis <aaronlewis@google.com>
Reported-by: Yu Zhang <yu.c.zhang@linux.intel.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Reviewed-by: Jim Mattson <jmattson@google.com>
Message-Id: <20221213062306.667649-2-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Explicitly drop the result of kvm_vcpu_write_guest() when writing the
"launch state" as part of VMCLEAR emulation, and add a comment to call
out that KVM's behavior is architecturally valid. Intel's pseudocode
effectively says that VMCLEAR is a nop if the target VMCS address isn't
in memory, e.g. if the address points at MMIO.
Add a FIXME to call out that suppressing failures on __copy_to_user() is
wrong, as memory (a memslot) does exist in that case. Punt the issue to
the future as open coding kvm_vcpu_write_guest() just to make sure the
guest dies with -EFAULT isn't worth the extra complexity. The flaw will
need to be addressed if KVM ever does something intelligent on uaccess
failures, e.g. to support post-copy demand paging, but in that case KVM
will need a more thorough overhaul, i.e. VMCLEAR shouldn't need to open
code a core KVM helper.
No functional change intended.
Reported-by: coverity-bot <keescook+coverity-bot@chromium.org>
Addresses-Coverity-ID: 1527765 ("Error handling issues")
Fixes: 587d7e72ae ("kvm: nVMX: VMCLEAR should not cause the vCPU to shut down")
Cc: Jim Mattson <jmattson@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20221220154224.526568-1-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Zero out the valid_bank_mask when using the fast variant of
HVCALL_SEND_IPI_EX to send IPIs to all vCPUs. KVM requires the "var_cnt"
and "valid_bank_mask" inputs to be consistent even when targeting all
vCPUs. See commit bd1ba5732b ("KVM: x86: Get the number of Hyper-V
sparse banks from the VARHEAD field").
Fixes: 998489245d ("KVM: selftests: Hyper-V PV IPI selftest")
Cc: Vitaly Kuznetsov <vkuznets@redhat.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20221219220416.395329-1-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Add a sanity check in kvm_handle_memory_failure() to assert that a valid
x86_exception structure is provided if the memory "failure" wants to
propagate a fault into the guest. If a memory failure happens during a
direct guest physical memory access, e.g. for nested VMX, KVM hardcodes
the failure to X86EMUL_IO_NEEDED and doesn't provide an exception pointer
(because the exception struct would just be filled with garbage).
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20221220153427.514032-1-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
kvm_apic_hw_enabled() only needs to return bool, there is no place
to use the return value of MSR_IA32_APICBASE_ENABLE.
Signed-off-by: Peng Hao <flyingpeng@tencent.com>
Message-Id: <CAPm50aJ=BLXNWT11+j36Dd6d7nz2JmOBk4u7o_NPQ0N61ODu1g@mail.gmail.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
In kvm_hv_flush_tlb(), 'data_offset' and 'consumed_xmm_halves' variables
are used in a mutually exclusive way: in 'hc->fast' we count in 'XMM
halves' and increase 'data_offset' otherwise. Coverity discovered, that in
one case both variables are incremented unconditionally. This doesn't seem
to cause any issues as the only user of 'data_offset'/'consumed_xmm_halves'
data is kvm_hv_get_tlb_flush_entries() -> kvm_hv_get_hc_data() which also
takes into account 'hc->fast' but is still worth fixing.
To make things explicit, put 'data_offset' and 'consumed_xmm_halves' to
'struct kvm_hv_hcall' as a union and use at call sites. This allows to
remove explicit 'data_offset'/'consumed_xmm_halves' parameters from
kvm_hv_get_hc_data()/kvm_get_sparse_vp_set()/kvm_hv_get_tlb_flush_entries()
helpers.
Note: 'struct kvm_hv_hcall' is allocated on stack in kvm_hv_hypercall() and
is not zeroed, consumers are supposed to initialize the appropriate field
if needed.
Reported-by: coverity-bot <keescook+coverity-bot@chromium.org>
Addresses-Coverity-ID: 1527764 ("Uninitialized variables")
Fixes: 260970862c ("KVM: x86: hyper-v: Handle HVCALL_FLUSH_VIRTUAL_ADDRESS_LIST{,EX} calls gently")
Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Reviewed-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20221208102700.959630-1-vkuznets@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
When scanning userspace I/OAPIC entries, intercept EOI for level-triggered
IRQs if the current vCPU has a pending and/or in-service IRQ for the
vector in its local API, even if the vCPU doesn't match the new entry's
destination. This fixes a race between userspace I/OAPIC reconfiguration
and IRQ delivery that results in the vector's bit being left set in the
remote IRR due to the eventual EOI not being forwarded to the userspace
I/OAPIC.
Commit 0fc5a36dd6 ("KVM: x86: ioapic: Fix level-triggered EOI and IOAPIC
reconfigure race") fixed the in-kernel IOAPIC, but not the userspace
IOAPIC configuration, which has a similar race.
Fixes: 0fc5a36dd6 ("KVM: x86: ioapic: Fix level-triggered EOI and IOAPIC reconfigure race")
Signed-off-by: Adamos Ttofari <attofari@amazon.de>
Reviewed-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20221208094415.12723-1-attofari@amazon.de>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
The current vPMU can reuse the same pmc->perf_event for the same
hardware event via pmc_pause/resume_counter(), but this optimization
does not apply to a portion of the TSX events (e.g., "event=0x3c,in_tx=1,
in_tx_cp=1"), where event->attr.sample_period is legally zero at creation,
thus making the perf call to perf_event_period() meaningless (no need to
adjust sample period in this case), and instead causing such reusable
perf_events to be repeatedly released and created.
Avoid releasing zero sample_period events by checking is_sampling_event()
to follow the previously enable/disable optimization.
Signed-off-by: Like Xu <likexu@tencent.com>
Message-Id: <20221207071506.15733-2-likexu@tencent.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Add ReST formatting to the set of userspace MSR exits/flags so that the
resulting HTML docs generate a table instead of malformed gunk. This
also fixes a warning that was introduced by a recent cleanup of the
relevant documentation (yay copy+paste).
>> Documentation/virt/kvm/api.rst:7287: WARNING: Block quote ends
without a blank line; unexpected unindent.
Fixes: 1ae099540e ("KVM: x86: Allow deflecting unknown MSR accesses to user space")
Fixes: 1f15814718 ("KVM: x86: Clean up KVM_CAP_X86_USER_SPACE_MSR documentation")
Reported-by: kernel test robot <lkp@intel.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20221207000959.2035098-1-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
x86 Xen-for-KVM:
* Allow the Xen runstate information to cross a page boundary
* Allow XEN_RUNSTATE_UPDATE flag behaviour to be configured
* add support for 32-bit guests in SCHEDOP_poll
x86 fixes:
* One-off fixes for various emulation flows (SGX, VMXON, NRIPS=0).
* Reinstate IBPB on emulated VM-Exit that was incorrectly dropped a few
years back when eliminating unnecessary barriers when switching between
vmcs01 and vmcs02.
* Clean up the MSR filter docs.
* Clean up vmread_error_trampoline() to make it more obvious that params
must be passed on the stack, even for x86-64.
* Let userspace set all supported bits in MSR_IA32_FEAT_CTL irrespective
of the current guest CPUID.
* Fudge around a race with TSC refinement that results in KVM incorrectly
thinking a guest needs TSC scaling when running on a CPU with a
constant TSC, but no hardware-enumerated TSC frequency.
* Advertise (on AMD) that the SMM_CTL MSR is not supported
* Remove unnecessary exports
Selftests:
* Fix an inverted check in the access tracking perf test, and restore
support for asserting that there aren't too many idle pages when
running on bare metal.
* Fix an ordering issue in the AMX test introduced by recent conversions
to use kvm_cpu_has(), and harden the code to guard against similar bugs
in the future. Anything that tiggers caching of KVM's supported CPUID,
kvm_cpu_has() in this case, effectively hides opt-in XSAVE features if
the caching occurs before the test opts in via prctl().
* Fix build errors that occur in certain setups (unsure exactly what is
unique about the problematic setup) due to glibc overriding
static_assert() to a variant that requires a custom message.
* Introduce actual atomics for clear/set_bit() in selftests
Documentation:
* Remove deleted ioctls from documentation
* Various fixes