Commit Graph

1043894 Commits

Author SHA1 Message Date
Anup Patel
0a86512dc1 RISC-V: KVM: Factor-out FP virtualization into separate sources
The timer and SBI virtualization is already in separate sources.
In future, we will have vector and AIA virtualization also added
as separate sources.

To align with above described modularity, we factor-out FP
virtualization into separate sources.

Signed-off-by: Anup Patel <anup.patel@wdc.com>
Message-Id: <20211026170136.2147619-3-anup.patel@wdc.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-10-31 02:45:43 -04:00
Paolo Bonzini
4e33868433 KVM/arm64 updates for Linux 5.16
- More progress on the protected VM front, now with the full
   fixed feature set as well as the limitation of some hypercalls
   after initialisation.
 
 - Cleanup of the RAZ/WI sysreg handling, which was pointlessly
   complicated
 
 - Fixes for the vgic placement in the IPA space, together with a
   bunch of selftests
 
 - More memcg accounting of the memory allocated on behalf of a guest
 
 - Timer and vgic selftests
 
 - Workarounds for the Apple M1 broken vgic implementation
 
 - KConfig cleanups
 
 - New kvmarm.mode=none option, for those who really dislike us
 -----BEGIN PGP SIGNATURE-----
 
 iQJDBAABCgAtFiEEn9UcU+C1Yxj9lZw9I9DQutE9ekMFAmF7u5YPHG1hekBrZXJu
 ZWwub3JnAAoJECPQ0LrRPXpD6w8QAIKDLJCTqkxv5Vh4ZSmtXxg4gTZMBlg8oSQ8
 sVL639aqBvFe3A6Vmz6IwBm+NT7Sm1zxkuH9qHzVR1gmXq0oLYNrIuyrzRW8PvqO
 hIkSRRoVsf03755TmkxwR7/2jAFxb6FhEVAy6VWdQyI44orihIPvMp8aTIq+jvU+
 XoNGb/rPf9HpSUtvuaHYvZhSZBhoi5dRnkr33R1+VR69n7Axs8lm905xcl6Pt0a0
 QqYZWQvFu/BXPyNflG7LUsegRF/iiV2vNTbNNowkzlV5suqxBpJAp6ApDL/gWrHv
 ya/6cMqicSjBIkWnawhXY98w6/5xfzK4IV/zc00FNWOlUdVP89Thqrgc8EkigS9R
 BGcxFFqj41snr+ensSBBIkNtV+dBX52H3rUE0F9seiTXm8QWI86JobdeNadT8tUP
 TXdOeCUcA+cp4Ngln18lsbOEaBkPA5H1po1nUFPHbKnVOxnqXScB7E/xF6rAbryV
 m+Z+oidU7MyS/Ev/Da0ww/XFx7cs2ez9EgeQvjcdFAvUMqS6kcXEExvgGYlm+KRQ
 GBMKPLCNHKdflMANoSpol7MZUmPJ45XoWKW1rntj2r9X+oJW2Z2hEx32xrWDJdqK
 ixnbjog5kNZb0CjLGsUC90lo2hpRJecaLhAjgTLYaNC1QxGPrt92eat6gnwuMTBc
 mpADqi7w
 =qBAO
 -----END PGP SIGNATURE-----

Merge tag 'kvmarm-5.16' of git://git.kernel.org/pub/scm/linux/kernel/git/kvmarm/kvmarm into HEAD

KVM/arm64 updates for Linux 5.16

- More progress on the protected VM front, now with the full
  fixed feature set as well as the limitation of some hypercalls
  after initialisation.

- Cleanup of the RAZ/WI sysreg handling, which was pointlessly
  complicated

- Fixes for the vgic placement in the IPA space, together with a
  bunch of selftests

- More memcg accounting of the memory allocated on behalf of a guest

- Timer and vgic selftests

- Workarounds for the Apple M1 broken vgic implementation

- KConfig cleanups

- New kvmarm.mode=none option, for those who really dislike us
2021-10-31 02:28:48 -04:00
Paolo Bonzini
e59f3e5d45 Merge branch 'kvm-pvclock-raw-spinlock' into HEAD
pvclock_gtod_sync_lock is completely gone in Linux 5.16.  Include this
fix into the kvm/next history to record that the syzkaller report is
not valid there.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-10-25 08:40:24 -04:00
David Woodhouse
8228c77d8b KVM: x86: switch pvclock_gtod_sync_lock to a raw spinlock
On the preemption path when updating a Xen guest's runstate times, this
lock is taken inside the scheduler rq->lock, which is a raw spinlock.
This was shown in a lockdep warning:

[   89.138354] =============================
[   89.138356] [ BUG: Invalid wait context ]
[   89.138358] 5.15.0-rc5+ #834 Tainted: G S        I E
[   89.138360] -----------------------------
[   89.138361] xen_shinfo_test/2575 is trying to lock:
[   89.138363] ffffa34a0364efd8 (&kvm->arch.pvclock_gtod_sync_lock){....}-{3:3}, at: get_kvmclock_ns+0x1f/0x130 [kvm]
[   89.138442] other info that might help us debug this:
[   89.138444] context-{5:5}
[   89.138445] 4 locks held by xen_shinfo_test/2575:
[   89.138447]  #0: ffff972bdc3b8108 (&vcpu->mutex){+.+.}-{4:4}, at: kvm_vcpu_ioctl+0x77/0x6f0 [kvm]
[   89.138483]  #1: ffffa34a03662e90 (&kvm->srcu){....}-{0:0}, at: kvm_arch_vcpu_ioctl_run+0xdc/0x8b0 [kvm]
[   89.138526]  #2: ffff97331fdbac98 (&rq->__lock){-.-.}-{2:2}, at: __schedule+0xff/0xbd0
[   89.138534]  #3: ffffa34a03662e90 (&kvm->srcu){....}-{0:0}, at: kvm_arch_vcpu_put+0x26/0x170 [kvm]
...
[   89.138695]  get_kvmclock_ns+0x1f/0x130 [kvm]
[   89.138734]  kvm_xen_update_runstate+0x14/0x90 [kvm]
[   89.138783]  kvm_xen_update_runstate_guest+0x15/0xd0 [kvm]
[   89.138830]  kvm_arch_vcpu_put+0xe6/0x170 [kvm]
[   89.138870]  kvm_sched_out+0x2f/0x40 [kvm]
[   89.138900]  __schedule+0x5de/0xbd0

Cc: stable@vger.kernel.org
Reported-by: syzbot+b282b65c2c68492df769@syzkaller.appspotmail.com
Fixes: 30b5c851af ("KVM: x86/xen: Add support for vCPU runstate information")
Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
Message-Id: <1b02a06421c17993df337493a68ba923f3bd5c0f.camel@infradead.org>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-10-25 08:14:38 -04:00
David Edmondson
0d7d84498f KVM: x86: SGX must obey the KVM_INTERNAL_ERROR_EMULATION protocol
When passing the failing address and size out to user space, SGX must
ensure not to trample on the earlier fields of the emulation_failure
sub-union of struct kvm_run.

Signed-off-by: David Edmondson <david.edmondson@oracle.com>
Reviewed-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20210920103737.2696756-5-david.edmondson@oracle.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-10-25 06:48:25 -04:00
David Edmondson
e615e35589 KVM: x86: On emulation failure, convey the exit reason, etc. to userspace
Should instruction emulation fail, include the VM exit reason, etc. in
the emulation_failure data passed to userspace, in order that the VMM
can report it as a debugging aid when describing the failure.

Suggested-by: Joao Martins <joao.m.martins@oracle.com>
Signed-off-by: David Edmondson <david.edmondson@oracle.com>
Reviewed-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20210920103737.2696756-4-david.edmondson@oracle.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-10-25 06:48:24 -04:00
David Edmondson
0a62a0319a KVM: x86: Get exit_reason as part of kvm_x86_ops.get_exit_info
Extend the get_exit_info static call to provide the reason for the VM
exit. Modify relevant trace points to use this rather than extracting
the reason in the caller.

Signed-off-by: David Edmondson <david.edmondson@oracle.com>
Reviewed-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20210920103737.2696756-3-david.edmondson@oracle.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-10-25 06:48:24 -04:00
David Edmondson
a9d496d8e0 KVM: x86: Clarify the kvm_run.emulation_failure structure layout
Until more flags for kvm_run.emulation_failure flags are defined, it
is undetermined whether new payload elements corresponding to those
flags will be additive or alternative. As a hint to userspace that an
alternative is possible, wrap the current payload elements in a union.

Suggested-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: David Edmondson <david.edmondson@oracle.com>
Reviewed-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20210920103737.2696756-2-david.edmondson@oracle.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-10-25 06:48:24 -04:00
Jim Mattson
ed290e1c20 KVM: selftests: Fix nested SVM tests when built with clang
Though gcc conveniently compiles a simple memset to "rep stos," clang
prefers to call the libc version of memset. If a test is dynamically
linked, the libc memset isn't available in L1 (nor is the PLT or the
GOT, for that matter). Even if the test is statically linked, the libc
memset may choose to use some CPU features, like AVX, which may not be
enabled in L1. Note that __builtin_memset doesn't solve the problem,
because (a) the compiler is free to call memset anyway, and (b)
__builtin_memset may also choose to use features like AVX, which may
not be available in L1.

To avoid a myriad of problems, use an explicit "rep stos" to clear the
VMCB in generic_svm_setup(), which is called both from L0 and L1.

Reported-by: Ricardo Koller <ricarkol@google.com>
Signed-off-by: Jim Mattson <jmattson@google.com>
Fixes: 20ba262f86 ("selftests: KVM: AMD Nested test infrastructure")
Message-Id: <20210930003649.4026553-1-jmattson@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-10-22 12:46:37 -04:00
Jim Mattson
dfd3c713a9 kvm: x86: Remove stale declaration of kvm_no_apic_vcpu
This variable was renamed to kvm_has_noapic_vcpu in commit
6e4e3b4df4 ("KVM: Stop using deprecated jump label APIs").

Signed-off-by: Jim Mattson <jmattson@google.com>
Message-Id: <20211021185449.3471763-1-jmattson@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-10-22 12:46:37 -04:00
Sean Christopherson
ec5a4919fa KVM: VMX: Unregister posted interrupt wakeup handler on hardware unsetup
Unregister KVM's posted interrupt wakeup handler during unsetup so that a
spurious interrupt that arrives after kvm_intel.ko is unloaded doesn't
call into freed memory.

Fixes: bf9f6ac8d7 ("KVM: Update Posted-Interrupts Descriptor when vCPU is blocked")
Cc: stable@vger.kernel.org
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20211009001107.3936588-3-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-10-22 12:46:37 -04:00
Sean Christopherson
6ff53f6a43 x86/irq: Ensure PI wakeup handler is unregistered before module unload
Add a synchronize_rcu() after clearing the posted interrupt wakeup handler
to ensure all readers, i.e. in-flight IRQ handlers, see the new handler
before returning to the caller.  If the caller is an exiting module and
is unregistering its handler, failure to wait could result in the IRQ
handler jumping into an unloaded module.

The registration path doesn't require synchronization, as it's the
caller's responsibility to not generate interrupts it cares about until
after its handler is registered.

Fixes: f6b3c72c23 ("x86/irq: Define a global vector for VT-d Posted-Interrupts")
Cc: stable@vger.kernel.org
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20211009001107.3936588-2-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-10-22 12:45:35 -04:00
Sean Christopherson
187c8833de KVM: x86: Use rw_semaphore for APICv lock to allow vCPU parallelism
Use a rw_semaphore instead of a mutex to coordinate APICv updates so that
vCPUs responding to requests can take the lock for read and run in
parallel.  Using a mutex forces serialization of vCPUs even though
kvm_vcpu_update_apicv() only touches data local to that vCPU or is
protected by a different lock, e.g. SVM's ir_list_lock.

Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20211022004927.1448382-5-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-10-22 11:20:16 -04:00
Sean Christopherson
ee49a89329 KVM: x86: Move SVM's APICv sanity check to common x86
Move SVM's assertion that vCPU's APICv state is consistent with its VM's
state out of svm_vcpu_run() and into x86's common inner run loop.  The
assertion and underlying logic is not unique to SVM, it's just that SVM
has more inhibiting conditions and thus is more likely to run headfirst
into any KVM bugs.

Add relevant comments to document exactly why the update path has unusual
ordering between the update the kick, why said ordering is safe, and also
the basic rules behind the assertion in the run loop.

Cc: Maxim Levitsky <mlevitsk@redhat.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20211022004927.1448382-3-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-10-22 11:20:16 -04:00
Lukas Bulwahn
9b4eb77099 riscv: do not select non-existing config ANON_INODES
Commit 99cdc6c18c ("RISC-V: Add initial skeletal KVM support") selects
the config ANON_INODES in config KVM, but the config ANON_INODES is removed
since commit 5dd50aaeb1 ("Make anon_inodes unconditional") in 2018.

Hence, ./scripts/checkkconfigsymbols.py warns on non-existing symbols:

  ANON_INODES
  Referencing files: arch/riscv/kvm/Kconfig

Remove selecting the non-existing config ANON_INODES.

Signed-off-by: Lukas Bulwahn <lukas.bulwahn@gmail.com>
Message-Id: <20211022061514.25946-1-lukas.bulwahn@gmail.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-10-22 05:53:37 -04:00
Sean Christopherson
21fa324654 KVM: x86/mmu: Extract zapping of rmaps for gfn range to separate helper
Extract the zapping of rmaps, a.k.a. legacy MMU, for a gfn range to a
separate helper to clean up the unholy mess that kvm_zap_gfn_range() has
become.  In addition to deep nesting, the rmaps zapping spreads out the
declaration of several variables and is generally a mess.  Clean up the
mess now so that future work to improve the memslots implementation
doesn't need to deal with it.

Cc: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20211022010005.1454978-4-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-10-22 05:51:52 -04:00
Sean Christopherson
e8be2a5ba8 KVM: x86/mmu: Drop a redundant remote TLB flush in kvm_zap_gfn_range()
Remove an unnecessary remote TLB flush in kvm_zap_gfn_range() now that
said function holds mmu_lock for write for its entire duration.  The
flush was added by the now-reverted commit to allow TDP MMU to flush while
holding mmu_lock for read, as the transition from write=>read required
dropping the lock and thus a pending flush needed to be serviced.

Fixes: 5a324c24b6 ("Revert "KVM: x86/mmu: Allow zap gfn range to operate under the mmu read lock"")
Cc: Maxim Levitsky <mlevitsk@redhat.com>
Cc: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
Cc: Ben Gardon <bgardon@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20211022010005.1454978-3-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-10-22 05:51:46 -04:00
Sean Christopherson
bc3b3c1002 KVM: x86/mmu: Drop a redundant, broken remote TLB flush
A recent commit to fix the calls to kvm_flush_remote_tlbs_with_address()
in kvm_zap_gfn_range() inadvertantly added yet another flush instead of
fixing the existing flush.  Drop the redundant flush, and fix the params
for the existing flush.

Cc: stable@vger.kernel.org
Fixes: 2822da4466 ("KVM: x86/mmu: fix parameters to kvm_flush_remote_tlbs_with_address")
Cc: Maxim Levitsky <mlevitsk@redhat.com>
Cc: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20211022010005.1454978-2-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-10-22 05:51:30 -04:00
Lai Jiangshan
61b05a9fd4 KVM: X86: Don't unload MMU in kvm_vcpu_flush_tlb_guest()
kvm_mmu_unload() destroys all the PGD caches.  Use the lighter
kvm_mmu_sync_roots() and kvm_mmu_sync_prev_roots() instead.

Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com>
Message-Id: <20211019110154.4091-5-jiangshanlai@gmail.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-10-22 05:44:43 -04:00
Lai Jiangshan
264d3dc1d3 KVM: X86: pair smp_wmb() of mmu_try_to_unsync_pages() with smp_rmb()
The commit 578e1c4db2 ("kvm: x86: Avoid taking MMU lock
in kvm_mmu_sync_roots if no sync is needed") added smp_wmb() in
mmu_try_to_unsync_pages(), but the corresponding smp_load_acquire() isn't
used on the load of SPTE.W.  smp_load_acquire() orders _subsequent_
loads after sp->is_unsync; it does not order _earlier_ loads before
the load of sp->is_unsync.

This has no functional change; smp_rmb() is a NOP on x86, and no
compiler barrier is required because there is a VMEXIT between the
load of SPTE.W and kvm_mmu_snc_roots.

Cc: Junaid Shahid <junaids@google.com>
Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com>
Message-Id: <20211019110154.4091-4-jiangshanlai@gmail.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-10-22 05:43:21 -04:00
Lai Jiangshan
509bfe3d97 KVM: X86: Cache CR3 in prev_roots when PCID is disabled
The commit 21823fbda5 ("KVM: x86: Invalidate all PGDs for the
current PCID on MOV CR3 w/ flush") invalidates all PGDs for the specific
PCID and in the case of PCID is disabled, it includes all PGDs in the
prev_roots and the commit made prev_roots totally unused in this case.

Not using prev_roots fixes a problem when CR4.PCIDE is changed 0 -> 1
before the said commit:

	(CR4.PCIDE=0, CR4.PGE=1; CR3=cr3_a; the page for the guest
	 RIP is global; cr3_b is cached in prev_roots)

	modify page tables under cr3_b
		the shadow root of cr3_b is unsync in kvm
	INVPCID single context
		the guest expects the TLB is clean for PCID=0
	change CR4.PCIDE 0 -> 1
	switch to cr3_b with PCID=0,NOFLUSH=1
		No sync in kvm, cr3_b is still unsync in kvm
	jump to the page that was modified in step 1
		shadow page tables point to the wrong page

It is a very unlikely case, but it shows that stale prev_roots can be
a problem after CR4.PCIDE changes from 0 to 1.  However, to fix this
case, the commit disabled caching CR3 in prev_roots altogether when PCID
is disabled.  Not all CPUs have PCID; especially the PCID support
for AMD CPUs is kind of recent.  To restore the prev_roots optimization
for CR4.PCIDE=0, flush the whole MMU (including all prev_roots) when
CR4.PCIDE changes.

Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com>
Message-Id: <20211019110154.4091-3-jiangshanlai@gmail.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-10-22 05:19:30 -04:00
Lai Jiangshan
e45e9e3998 KVM: X86: Fix tlb flush for tdp in kvm_invalidate_pcid()
The KVM doesn't know whether any TLB for a specific pcid is cached in
the CPU when tdp is enabled.  So it is better to flush all the guest
TLB when invalidating any single PCID context.

The case is very rare or even impossible since KVM generally doesn't
intercept CR3 write or INVPCID instructions when tdp is enabled, so the
fix is mostly for the sake of overall robustness.

Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com>
Message-Id: <20211019110154.4091-2-jiangshanlai@gmail.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-10-22 05:19:30 -04:00
Lai Jiangshan
a91a7c7096 KVM: X86: Don't reset mmu context when toggling X86_CR4_PGE
X86_CR4_PGE doesn't participate in kvm_mmu_role, so the mmu context
doesn't need to be reset.  It is only required to flush all the guest
tlb.

It is also inconsistent that X86_CR4_PGE is in KVM_MMU_CR4_ROLE_BITS
while kvm_mmu_role doesn't use X86_CR4_PGE.  So X86_CR4_PGE is also
removed from KVM_MMU_CR4_ROLE_BITS.

Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com>
Reviewed-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20210919024246.89230-3-jiangshanlai@gmail.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-10-22 05:19:29 -04:00
Lai Jiangshan
552617382c KVM: X86: Don't reset mmu context when X86_CR4_PCIDE 1->0
X86_CR4_PCIDE doesn't participate in kvm_mmu_role, so the mmu context
doesn't need to be reset.  It is only required to flush all the guest
tlb.

Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com>
Reviewed-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20210919024246.89230-2-jiangshanlai@gmail.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-10-22 05:19:29 -04:00
Michael Roth
413eaa4ecd KVM: selftests: set CPUID before setting sregs in vcpu creation
Recent kernels have checks to ensure the GPA values in special-purpose
registers like CR3 are within the maximum physical address range and
don't overlap with anything in the upper/reserved range. In the case of
SEV kselftest guests booting directly into 64-bit mode, CR3 needs to be
initialized to the GPA of the page table root, with the encryption bit
set. The kernel accounts for this encryption bit by removing it from
reserved bit range when the guest advertises the bit position via
KVM_SET_CPUID*, but kselftests currently call KVM_SET_SREGS as part of
vm_vcpu_add_default(), before KVM_SET_CPUID*.

As a result, KVM_SET_SREGS will return an error in these cases.
Address this by moving vcpu_set_cpuid() (which calls KVM_SET_CPUID*)
ahead of vcpu_setup() (which calls KVM_SET_SREGS).

While there, address a typo in the assertion that triggers when
KVM_SET_SREGS fails.

Suggested-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Michael Roth <michael.roth@amd.com>
Message-Id: <20211006203617.13045-1-michael.roth@amd.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Reviewed-by: Nathan Tempelman <natet@google.com>
2021-10-22 05:19:29 -04:00
Wanpeng Li
9ae7f6c9b5 KVM: emulate: Comment on difference between RDPMC implementation and manual
SDM mentioned that, RDPMC:

  IF (((CR4.PCE = 1) or (CPL = 0) or (CR0.PE = 0)) and (ECX indicates a supported counter))
      THEN
          EAX := counter[31:0];
          EDX := ZeroExtend(counter[MSCB:32]);
      ELSE (* ECX is not valid or CR4.PCE is 0 and CPL is 1, 2, or 3 and CR0.PE is 1 *)
          #GP(0);
  FI;

Let's add a comment why CR0.PE isn't tested since it's impossible for CPL to be >0 if
CR0.PE=0.

Signed-off-by: Wanpeng Li <wanpengli@tencent.com>
Message-Id: <1634724836-73721-1-git-send-email-wanpengli@tencent.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-10-22 05:19:29 -04:00
Sean Christopherson
9dadfc4a61 KVM: x86: Add vendor name to kvm_x86_ops, use it for error messages
Paul pointed out the error messages when KVM fails to load are unhelpful
in understanding exactly what went wrong if userspace probes the "wrong"
module.

Add a mandatory kvm_x86_ops field to track vendor module names, kvm_intel
and kvm_amd, and use the name for relevant error message when KVM fails
to load so that the user knows which module failed to load.

Opportunistically tweak the "disabled by bios" error message to clarify
that _support_ was disabled, not that the module itself was magically
disabled by BIOS.

Suggested-by: Paul Menzel <pmenzel@molgen.mpg.de>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20211018183929.897461-2-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-10-22 05:19:28 -04:00
Junaid Shahid
4dfe4f40d8 kvm: x86: mmu: Make NX huge page recovery period configurable
Currently, the NX huge page recovery thread wakes up every minute and
zaps 1/nx_huge_pages_recovery_ratio of the total number of split NX
huge pages at a time. This is intended to ensure that only a
relatively small number of pages get zapped at a time. But for very
large VMs (or more specifically, VMs with a large number of
executable pages), a period of 1 minute could still result in this
number being too high (unless the ratio is changed significantly,
but that can result in split pages lingering on for too long).

This change makes the period configurable instead of fixing it at
1 minute. Users of large VMs can then adjust the period and/or the
ratio to reduce the number of pages zapped at one time while still
maintaining the same overall duration for cycling through the
entire list. By default, KVM derives a period from the ratio such
that a page will remain on the list for 1 hour on average.

Signed-off-by: Junaid Shahid <junaids@google.com>
Message-Id: <20211020010627.305925-1-junaids@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-10-22 05:19:28 -04:00
Wanpeng Li
540c7abe61 KVM: vPMU: Fill get_msr MSR_CORE_PERF_GLOBAL_OVF_CTRL w/ 0
SDM section 18.2.3 mentioned that:

  "IA32_PERF_GLOBAL_OVF_CTL MSR allows software to clear overflow indicator(s) of
   any general-purpose or fixed-function counters via a single WRMSR."

It is R/W mentioned by SDM, we read this msr on bare-metal during perf testing,
the value is always 0 for ICX/SKX boxes on hands. Let's fill get_msr
MSR_CORE_PERF_GLOBAL_OVF_CTRL w/ 0 as hardware behavior and drop
global_ovf_ctrl variable.

Tested-by: Like Xu <likexu@tencent.com>
Signed-off-by: Wanpeng Li <wanpengli@tencent.com>
Message-Id: <1634631160-67276-2-git-send-email-wanpengli@tencent.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-10-22 05:19:28 -04:00
David Matlack
610265ea3d KVM: x86/mmu: Rename slot_handle_leaf to slot_handle_level_4k
slot_handle_leaf is a misnomer because it only operates on 4K SPTEs
whereas "leaf" is used to describe any valid terminal SPTE (4K or
large page). Rename slot_handle_leaf to slot_handle_level_4k to
avoid confusion.

Making this change makes it more obvious there is a benign discrepency
between the legacy MMU and the TDP MMU when it comes to dirty logging.
The legacy MMU only iterates through 4K SPTEs when zapping for
collapsing and when clearing D-bits. The TDP MMU, on the other hand,
iterates through SPTEs on all levels.

The TDP MMU behavior of zapping SPTEs at all levels is technically
overkill for its current dirty logging implementation, which always
demotes to 4k SPTES, but both the TDP MMU and legacy MMU zap if and only
if the SPTE can be replaced by a larger page, i.e. will not spuriously
zap 2m (or larger) SPTEs. Opportunistically add comments to explain this
discrepency in the code.

Signed-off-by: David Matlack <dmatlack@google.com>
Message-Id: <20211019162223.3935109-1-dmatlack@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-10-22 05:19:27 -04:00
Xiaoyao Li
e099f3eb0e KVM: VMX: RTIT_CTL_BRANCH_EN has no dependency on other CPUID bit
Per Intel SDM, RTIT_CTL_BRANCH_EN bit has no dependency on any CPUID
leaf 0x14.

Signed-off-by: Xiaoyao Li <xiaoyao.li@intel.com>
Message-Id: <20210827070249.924633-5-xiaoyao.li@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-10-22 05:19:27 -04:00
Xiaoyao Li
f4d3a902a5 KVM: VMX: Rename pt_desc.addr_range to pt_desc.num_address_ranges
To better self explain the meaning of this field and match the
PT_CAP_num_address_ranges constatn.

Suggested-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Xiaoyao Li <xiaoyao.li@intel.com>
Message-Id: <20210827070249.924633-4-xiaoyao.li@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-10-22 05:19:27 -04:00
Xiaoyao Li
ba51d62723 KVM: VMX: Use precomputed vmx->pt_desc.addr_range
The number of valid PT ADDR MSRs for the guest is precomputed in
vmx->pt_desc.addr_range. Use it instead of calculating again.

Signed-off-by: Xiaoyao Li <xiaoyao.li@intel.com>
Message-Id: <20210827070249.924633-3-xiaoyao.li@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-10-22 05:19:26 -04:00
Xiaoyao Li
2e6e0d683b KVM: VMX: Restore host's MSR_IA32_RTIT_CTL when it's not zero
A minor optimization to WRMSR MSR_IA32_RTIT_CTL when necessary.

Opportunistically refine the comment to call out that KVM requires
VM_EXIT_CLEAR_IA32_RTIT_CTL to expose PT to the guest.

Reviewed-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Xiaoyao Li <xiaoyao.li@intel.com>
Message-Id: <20210827070249.924633-2-xiaoyao.li@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-10-22 05:19:26 -04:00
Paolo Bonzini
2839180ce5 KVM: x86/mmu: clean up prefetch/prefault/speculative naming
"prefetch", "prefault" and "speculative" are used throughout KVM to mean
the same thing.  Use a single name, standardizing on "prefetch" which
is already used by various functions such as direct_pte_prefetch,
FNAME(prefetch_gpte), FNAME(pte_prefetch), etc.

Suggested-by: David Matlack <dmatlack@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-10-22 05:19:26 -04:00
David Stevens
1e76a3ce0d KVM: cleanup allocation of rmaps and page tracking data
Unify the flags for rmaps and page tracking data, using a
single flag in struct kvm_arch and a single loop to go
over all the address spaces and memslots.  This avoids
code duplication between alloc_all_memslots_rmaps and
kvm_page_track_enable_mmu_write_tracking.

Signed-off-by: David Stevens <stevensd@chromium.org>
[This patch is the delta between David's v2 and v3, with conflicts
 fixed and my own commit message. - Paolo]
Co-developed-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-10-22 05:19:25 -04:00
Marc Zyngier
5a2acbbb01 Merge branch kvm/selftests/memslot into kvmarm-master/next
* kvm/selftests/memslot:
  : .
  : Enable KVM memslot selftests on arm64, making them less
  : x86 specific.
  : .
  KVM: selftests: Build the memslot tests for arm64
  KVM: selftests: Make memslot_perf_test arch independent

Signed-off-by: Marc Zyngier <maz@kernel.org>
2021-10-21 11:40:03 +01:00
Ricardo Koller
358928fd52 KVM: selftests: Build the memslot tests for arm64
Add memslot_perf_test and memslot_modification_stress_test to the list
of aarch64 selftests.

Signed-off-by: Ricardo Koller <ricarkol@google.com>
Reviewed-by: Andrew Jones <drjones@redhat.com>
Reviewed-by: Oliver Upton <oupton@google.com>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20210907180957.609966-3-ricarkol@google.com
2021-10-21 11:36:41 +01:00
Ricardo Koller
ffb4ce3c49 KVM: selftests: Make memslot_perf_test arch independent
memslot_perf_test uses ucalls for synchronization between guest and
host. Ucalls API is architecture independent: tests do not need to know
details like what kind of exit they generate on a specific arch.  More
specifically, there is no need to check whether an exit is KVM_EXIT_IO
in x86 for the host to know that the exit is ucall related, as
get_ucall() already makes that check.

Change memslot_perf_test to not require specifying what exit does a
ucall generate. Also add a missing ucall_init.

Signed-off-by: Ricardo Koller <ricarkol@google.com>
Reviewed-by: Andrew Jones <drjones@redhat.com>
Reviewed-by: Oliver Upton <oupton@google.com>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://lore.kernel.org/r/20210907180957.609966-2-ricarkol@google.com
2021-10-21 11:36:34 +01:00
Oliver Upton
3f9808cac0 selftests: KVM: Introduce system counter offset test
Introduce a KVM selftest to verify that userspace manipulation of the
TSC (via the new vCPU attribute) results in the correct behavior within
the guest.

Reviewed-by: Andrew Jones <drjones@redhat.com>
Signed-off-by: Oliver Upton <oupton@google.com>
Message-Id: <20210916181555.973085-6-oupton@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-10-18 14:43:46 -04:00
Oliver Upton
c895513453 selftests: KVM: Add helpers for vCPU device attributes
vCPU file descriptors are abstracted away from test code in KVM
selftests, meaning that tests cannot directly access a vCPU's device
attributes. Add helpers that tests can use to get at vCPU device
attributes.

Reviewed-by: Andrew Jones <drjones@redhat.com>
Signed-off-by: Oliver Upton <oupton@google.com>
Message-Id: <20210916181555.973085-5-oupton@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-10-18 14:43:46 -04:00
Oliver Upton
c1901feef5 selftests: KVM: Fix kvm device helper ioctl assertions
The KVM_CREATE_DEVICE and KVM_{GET,SET}_DEVICE_ATTR ioctls are defined
to return a value of zero on success. As such, tighten the assertions in
the helper functions to only pass if the return code is zero.

Suggested-by: Andrew Jones <drjones@redhat.com>
Reviewed-by: Andrew Jones <drjones@redhat.com>
Signed-off-by: Oliver Upton <oupton@google.com>
Message-Id: <20210916181555.973085-4-oupton@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-10-18 14:43:46 -04:00
Oliver Upton
61fb1c5485 selftests: KVM: Add test for KVM_{GET,SET}_CLOCK
Add a selftest for the new KVM clock UAPI that was introduced. Ensure
that the KVM clock is consistent between userspace and the guest, and
that the difference in realtime will only ever cause the KVM clock to
advance forward.

Cc: Andrew Jones <drjones@redhat.com>
Signed-off-by: Oliver Upton <oupton@google.com>
Message-Id: <20210916181555.973085-3-oupton@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-10-18 14:43:45 -04:00
Oliver Upton
5000653934 tools: arch: x86: pull in pvclock headers
Copy over approximately clean versions of the pvclock headers into
tools. Reconcile headers/symbols missing in tools that are unneeded.

Signed-off-by: Oliver Upton <oupton@google.com>
Message-Id: <20210916181555.973085-2-oupton@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-10-18 14:43:45 -04:00
Oliver Upton
828ca89628 KVM: x86: Expose TSC offset controls to userspace
To date, VMM-directed TSC synchronization and migration has been a bit
messy. KVM has some baked-in heuristics around TSC writes to infer if
the VMM is attempting to synchronize. This is problematic, as it depends
on host userspace writing to the guest's TSC within 1 second of the last
write.

A much cleaner approach to configuring the guest's views of the TSC is to
simply migrate the TSC offset for every vCPU. Offsets are idempotent,
and thus not subject to change depending on when the VMM actually
reads/writes values from/to KVM. The VMM can then read the TSC once with
KVM_GET_CLOCK to capture a (realtime, host_tsc) pair at the instant when
the guest is paused.

Cc: David Matlack <dmatlack@google.com>
Cc: Sean Christopherson <seanjc@google.com>
Signed-off-by: Oliver Upton <oupton@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Message-Id: <20210916181538.968978-8-oupton@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-10-18 14:43:45 -04:00
Oliver Upton
58d4277be9 KVM: x86: Refactor tsc synchronization code
Refactor kvm_synchronize_tsc to make a new function that allows callers
to specify TSC parameters (offset, value, nanoseconds, etc.) explicitly
for the sake of participating in TSC synchronization.

Signed-off-by: Oliver Upton <oupton@google.com>
Message-Id: <20210916181538.968978-7-oupton@google.com>
[Make sure kvm->arch.cur_tsc_generation and vcpu->arch.this_tsc_generation are
 equal at the end of __kvm_synchronize_tsc, if matched is false. Reported by
 Maxim Levitsky. - Paolo]
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-10-18 14:43:45 -04:00
Paolo Bonzini
869b44211a kvm: x86: protect masterclock with a seqcount
Protect the reference point for kvmclock with a seqcount, so that
kvmclock updates for all vCPUs can proceed in parallel.  Xen runstate
updates will also run in parallel and not bounce the kvmclock cacheline.

Of the variables that were protected by pvclock_gtod_sync_lock,
nr_vcpus_matched_tsc is different because it is updated outside
pvclock_update_vm_gtod_copy and read inside it.  Therefore, we
need to keep it protected by a spinlock.  In fact it must now
be a raw spinlock, because pvclock_update_vm_gtod_copy, being the
write-side of a seqcount, is non-preemptible.  Since we already
have tsc_write_lock which is a raw spinlock, we can just use
tsc_write_lock as the lock that protects the write-side of the
seqcount.

Co-developed-by: Oliver Upton <oupton@google.com>
Message-Id: <20210916181538.968978-6-oupton@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-10-18 14:43:44 -04:00
Oliver Upton
c68dc1b577 KVM: x86: Report host tsc and realtime values in KVM_GET_CLOCK
Handling the migration of TSCs correctly is difficult, in part because
Linux does not provide userspace with the ability to retrieve a (TSC,
realtime) clock pair for a single instant in time. In lieu of a more
convenient facility, KVM can report similar information in the kvm_clock
structure.

Provide userspace with a host TSC & realtime pair iff the realtime clock
is based on the TSC. If userspace provides KVM_SET_CLOCK with a valid
realtime value, advance the KVM clock by the amount of elapsed time. Do
not step the KVM clock backwards, though, as it is a monotonic
oscillator.

Suggested-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Oliver Upton <oupton@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Message-Id: <20210916181538.968978-5-oupton@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-10-18 14:43:44 -04:00
Paolo Bonzini
3d5e7a28b1 KVM: x86: avoid warning with -Wbitwise-instead-of-logical
This is a new warning in clang top-of-tree (will be clang 14):

In file included from arch/x86/kvm/mmu/mmu.c:27:
arch/x86/kvm/mmu/spte.h:318:9: error: use of bitwise '|' with boolean operands [-Werror,-Wbitwise-instead-of-logical]
        return __is_bad_mt_xwr(rsvd_check, spte) |
               ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
                                                 ||
arch/x86/kvm/mmu/spte.h:318:9: note: cast one or both operands to int to silence this warning

The code is fine, but change it anyway to shut up this clever clogs
of a compiler.

Reported-by: torvic9@mailbox.org
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-10-18 14:43:42 -04:00
Paolo Bonzini
a25c78d04c Merge commit 'kvm-pagedata-alloc-fixes' into HEAD 2021-10-18 14:13:37 -04:00