Commit Graph

7889 Commits

Author SHA1 Message Date
Emanuele Giuseppe Esposito
f2740a8d85 KVM: nSVM: introduce svm->nested.save to cache save area before checks
This is useful in the next patch, to keep a saved copy
of vmcb12 registers and pass it around more easily.

Instead of blindly copying everything, we just copy EFER, CR0, CR3, CR4,
DR6 and DR7 which are needed by the VMRUN checks.  If more fields will
need to be checked, it will be quite obvious to see that they must be added
in struct vmcb_save_area_cached and in nested_copy_vmcb_save_to_cache().

__nested_copy_vmcb_save_to_cache() takes a vmcb_save_area_cached
parameter, which is useful in order to save the state to a local
variable.

Signed-off-by: Emanuele Giuseppe Esposito <eesposit@redhat.com>
Message-Id: <20211103140527.752797-3-eesposit@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-12-08 04:24:38 -05:00
Emanuele Giuseppe Esposito
907afa48e9 KVM: nSVM: move nested_vmcb_check_cr3_cr4 logic in nested_vmcb_valid_sregs
Inline nested_vmcb_check_cr3_cr4 as it is not called by anyone else.
Doing so simplifies next patches.

Signed-off-by: Emanuele Giuseppe Esposito <eesposit@redhat.com>
Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
Message-Id: <20211103140527.752797-2-eesposit@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-12-08 04:24:38 -05:00
Maciej S. Szmigiero
f4209439b5 KVM: Optimize gfn lookup in kvm_zap_gfn_range()
Introduce a memslots gfn upper bound operation and use it to optimize
kvm_zap_gfn_range().
This way this handler can do a quick lookup for intersecting gfns and won't
have to do a linear scan of the whole memslot set.

Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
Message-Id: <ef242146a87a335ee93b441dcf01665cb847c902.1638817641.git.maciej.szmigiero@oracle.com>
2021-12-08 04:24:35 -05:00
Maciej S. Szmigiero
a54d806688 KVM: Keep memslots in tree-based structures instead of array-based ones
The current memslot code uses a (reverse gfn-ordered) memslot array for
keeping track of them.

Because the memslot array that is currently in use cannot be modified
every memslot management operation (create, delete, move, change flags)
has to make a copy of the whole array so it has a scratch copy to work on.

Strictly speaking, however, it is only necessary to make copy of the
memslot that is being modified, copying all the memslots currently present
is just a limitation of the array-based memslot implementation.

Two memslot sets, however, are still needed so the VM continues to run
on the currently active set while the requested operation is being
performed on the second, currently inactive one.

In order to have two memslot sets, but only one copy of actual memslots
it is necessary to split out the memslot data from the memslot sets.

The memslots themselves should be also kept independent of each other
so they can be individually added or deleted.

These two memslot sets should normally point to the same set of
memslots. They can, however, be desynchronized when performing a
memslot management operation by replacing the memslot to be modified
by its copy.  After the operation is complete, both memslot sets once
again point to the same, common set of memslot data.

This commit implements the aforementioned idea.

For tracking of gfns an ordinary rbtree is used since memslots cannot
overlap in the guest address space and so this data structure is
sufficient for ensuring that lookups are done quickly.

The "last used slot" mini-caches (both per-slot set one and per-vCPU one),
that keep track of the last found-by-gfn memslot, are still present in the
new code.

Co-developed-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
Message-Id: <17c0cf3663b760a0d3753d4ac08c0753e941b811.1638817641.git.maciej.szmigiero@oracle.com>
2021-12-08 04:24:34 -05:00
Maciej S. Szmigiero
ed922739c9 KVM: Use interval tree to do fast hva lookup in memslots
The current memslots implementation only allows quick binary search by gfn,
quick lookup by hva is not possible - the implementation has to do a linear
scan of the whole memslots array, even though the operation being performed
might apply just to a single memslot.

This significantly hurts performance of per-hva operations with higher
memslot counts.

Since hva ranges can overlap between memslots an interval tree is needed
for tracking them.

[sean: handle interval tree updates in kvm_replace_memslot()]
Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
Message-Id: <d66b9974becaa9839be9c4e1a5de97b177b4ac20.1638817640.git.maciej.szmigiero@oracle.com>
2021-12-08 04:24:32 -05:00
Maciej S. Szmigiero
f5756029ee KVM: x86: Use nr_memslot_pages to avoid traversing the memslots array
There is no point in recalculating from scratch the total number of pages
in all memslots each time a memslot is created or deleted.  Use KVM's
cached nr_memslot_pages to compute the default max number of MMU pages.

Note that even with nr_memslot_pages capped at ULONG_MAX we can't safely
multiply it by KVM_PERMILLE_MMU_PAGES (20) since this operation can
possibly overflow an unsigned long variable.

Write this "* 20 / 1000" operation as "/ 50" instead to avoid such
overflow.

Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
[sean: use common KVM field and rework changelog accordingly]
Reviewed-by: Sean Christopherson <seanjc@google.com>
Message-Id: <d14c5a24535269606675437d5602b7dac4ad8c0e.1638817640.git.maciej.szmigiero@oracle.com>
2021-12-08 04:24:29 -05:00
Maciej S. Szmigiero
e0c2b6338a KVM: x86: Don't call kvm_mmu_change_mmu_pages() if the count hasn't changed
There is no point in calling kvm_mmu_change_mmu_pages() for memslot
operations that don't change the total page count, so do it just for
KVM_MR_CREATE and KVM_MR_DELETE.

Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
Reviewed-by: Sean Christopherson <seanjc@google.com>
Message-Id: <9e56b7616a11f5654e4ab486b3237366b7ba9f2a.1638817640.git.maciej.szmigiero@oracle.com>
2021-12-08 04:24:28 -05:00
Sean Christopherson
77aedf26fe KVM: x86: Don't assume old/new memslots are non-NULL at memslot commit
Play nice with a NULL @old or @new when handling memslot updates so that
common KVM can pass NULL for one or the other in CREATE and DELETE cases
instead of having to synthesize a dummy memslot.

No functional change intended.

Signed-off-by: Sean Christopherson <seanjc@google.com>
Reviewed-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
Message-Id: <2eb7788adbdc2bc9a9c5f86844dd8ee5c8428732.1638817640.git.maciej.szmigiero@oracle.com>
2021-12-08 04:24:26 -05:00
Sean Christopherson
6a99c6e3f5 KVM: Stop passing kvm_userspace_memory_region to arch memslot hooks
Drop the @mem param from kvm_arch_{prepare,commit}_memory_region() now
that its use has been removed in all architectures.

No functional change intended.

Signed-off-by: Sean Christopherson <seanjc@google.com>
Reviewed-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
Message-Id: <aa5ed3e62c27e881d0d8bc0acbc1572bc336dc19.1638817640.git.maciej.szmigiero@oracle.com>
2021-12-08 04:24:25 -05:00
Sean Christopherson
9d7d18ee3f KVM: x86: Use "new" memslot instead of userspace memory region
Get the number of pages directly from the new memslot instead of
computing the same from the userspace memory region when allocating
memslot metadata.  This will allow a future patch to drop @mem.

No functional change intended.

Signed-off-by: Sean Christopherson <seanjc@google.com>
Reviewed-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
Message-Id: <ef44892eb615f5c28e682bbe06af96aff9ce2a9f.1638817639.git.maciej.szmigiero@oracle.com>
2021-12-08 04:24:23 -05:00
Sean Christopherson
537a17b314 KVM: Let/force architectures to deal with arch specific memslot data
Pass the "old" slot to kvm_arch_prepare_memory_region() and force arch
code to handle propagating arch specific data from "new" to "old" when
necessary.  This is a baby step towards dynamically allocating "new" from
the get go, and is a (very) minor performance boost on x86 due to not
unnecessarily copying arch data.

For PPC HV, copy the rmap in the !CREATE and !DELETE paths, i.e. for MOVE
and FLAGS_ONLY.  This is functionally a nop as the previous behavior
would overwrite the pointer for CREATE, and eventually discard/ignore it
for DELETE.

For x86, copy the arch data only for FLAGS_ONLY changes.  Unlike PPC HV,
x86 needs to reallocate arch data in the MOVE case as the size of x86's
allocations depend on the alignment of the memslot's gfn.

Opportunistically tweak kvm_arch_prepare_memory_region()'s param order to
match the "commit" prototype.

Signed-off-by: Sean Christopherson <seanjc@google.com>
Reviewed-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
[mss: add missing RISCV kvm_arch_prepare_memory_region() change]
Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
Message-Id: <67dea5f11bbcfd71e3da5986f11e87f5dd4013f9.1638817639.git.maciej.szmigiero@oracle.com>
2021-12-08 04:24:20 -05:00
Marc Zyngier
46808a4cb8 KVM: Use 'unsigned long' as kvm_for_each_vcpu()'s index
Everywhere we use kvm_for_each_vpcu(), we use an int as the vcpu
index. Unfortunately, we're about to move rework the iterator,
which requires this to be upgrade to an unsigned long.

Let's bite the bullet and repaint all of it in one go.

Signed-off-by: Marc Zyngier <maz@kernel.org>
Message-Id: <20211116160403.4074052-7-maz@kernel.org>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-12-08 04:24:15 -05:00
Marc Zyngier
27592ae8db KVM: Move wiping of the kvm->vcpus array to common code
All architectures have similar loops iterating over the vcpus,
freeing one vcpu at a time, and eventually wiping the reference
off the vcpus array. They are also inconsistently taking
the kvm->lock mutex when wiping the references from the array.

Make this code common, which will simplify further changes.
The locking is dropped altogether, as this should only be called
when there is no further references on the kvm structure.

Reviewed-by: Claudio Imbrenda <imbrenda@linux.ibm.com>
Signed-off-by: Marc Zyngier <maz@kernel.org>
Message-Id: <20211116160403.4074052-2-maz@kernel.org>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-12-08 04:24:13 -05:00
Tom Lendacky
ad5b353240 KVM: SVM: Do not terminate SEV-ES guests on GHCB validation failure
Currently, an SEV-ES guest is terminated if the validation of the VMGEXIT
exit code or exit parameters fails.

The VMGEXIT instruction can be issued from userspace, even though
userspace (likely) can't update the GHCB. To prevent userspace from being
able to kill the guest, return an error through the GHCB when validation
fails rather than terminating the guest. For cases where the GHCB can't be
updated (e.g. the GHCB can't be mapped, etc.), just return back to the
guest.

The new error codes are documented in the lasest update to the GHCB
specification.

Fixes: 291bd20d5d ("KVM: SVM: Add initial support for a VMGEXIT VMEXIT")
Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com>
Message-Id: <b57280b5562893e2616257ac9c2d4525a9aeeb42.1638471124.git.thomas.lendacky@amd.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-12-05 03:02:04 -05:00
Sean Christopherson
a655276a59 KVM: SEV: Fall back to vmalloc for SEV-ES scratch area if necessary
Use kvzalloc() to allocate KVM's buffer for SEV-ES's GHCB scratch area so
that KVM falls back to __vmalloc() if physically contiguous memory isn't
available.  The buffer is purely a KVM software construct, i.e. there's
no need for it to be physically contiguous.

Cc: Tom Lendacky <thomas.lendacky@amd.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20211109222350.2266045-3-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-12-05 03:02:03 -05:00
Sean Christopherson
75236f5f22 KVM: SEV: Return appropriate error codes if SEV-ES scratch setup fails
Return appropriate error codes if setting up the GHCB scratch area for an
SEV-ES guest fails.  In particular, returning -EINVAL instead of -ENOMEM
when allocating the kernel buffer could be confusing as userspace would
likely suspect a guest issue.

Fixes: 8f423a80d2 ("KVM: SVM: Support MMIO for an SEV-ES guest")
Cc: Tom Lendacky <thomas.lendacky@amd.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20211109222350.2266045-2-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-12-05 03:02:03 -05:00
Sean Christopherson
a955cad84c KVM: x86/mmu: Retry page fault if root is invalidated by memslot update
Bail from the page fault handler if the root shadow page was obsoleted by
a memslot update.  Do the check _after_ acuiring mmu_lock, as the TDP MMU
doesn't rely on the memslot/MMU generation, and instead relies on the
root being explicit marked invalid by kvm_mmu_zap_all_fast(), which takes
mmu_lock for write.

For the TDP MMU, inserting a SPTE into an obsolete root can leak a SP if
kvm_tdp_mmu_zap_invalidated_roots() has already zapped the SP, i.e. has
moved past the gfn associated with the SP.

For other MMUs, the resulting behavior is far more convoluted, though
unlikely to be truly problematic.  Installing SPs/SPTEs into the obsolete
root isn't directly problematic, as the obsolete root will be unloaded
and dropped before the vCPU re-enters the guest.  But because the legacy
MMU tracks shadow pages by their role, any SP created by the fault can
can be reused in the new post-reload root.  Again, that _shouldn't_ be
problematic as any leaf child SPTEs will be created for the current/valid
memslot generation, and kvm_mmu_get_page() will not reuse child SPs from
the old generation as they will be flagged as obsolete.  But, given that
continuing with the fault is pointess (the root will be unloaded), apply
the check to all MMUs.

Fixes: b7cccd397f ("KVM: x86/mmu: Fast invalidation for TDP MMU")
Cc: stable@vger.kernel.org
Cc: Ben Gardon <bgardon@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20211120045046.3940942-5-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-12-02 04:12:12 -05:00
Dan Carpenter
bfbb307c62 KVM: VMX: Set failure code in prepare_vmcs02()
The error paths in the prepare_vmcs02() function are supposed to set
*entry_failure_code but this path does not.  It leads to using an
uninitialized variable in the caller.

Fixes: 71f7347025 ("KVM: nVMX: Load GUEST_IA32_PERF_GLOBAL_CTRL MSR on VM-Entry")
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Message-Id: <20211130125337.GB24578@kili>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-12-02 04:12:11 -05:00
Paolo Bonzini
ef8b4b7203 KVM: ensure APICv is considered inactive if there is no APIC
kvm_vcpu_apicv_active() returns false if a virtual machine has no in-kernel
local APIC, however kvm_apicv_activated might still be true if there are
no reasons to disable APICv; in fact it is quite likely that there is none
because APICv is inhibited by specific configurations of the local APIC
and those configurations cannot be programmed.  This triggers a WARN:

   WARN_ON_ONCE(kvm_apicv_activated(vcpu->kvm) != kvm_vcpu_apicv_active(vcpu));

To avoid this, introduce another cause for APICv inhibition, namely the
absence of an in-kernel local APIC.  This cause is enabled by default,
and is dropped by either KVM_CREATE_IRQCHIP or the enabling of
KVM_CAP_IRQCHIP_SPLIT.

Reported-by: Ignat Korchagin <ignat@cloudflare.com>
Fixes: ee49a89329 ("KVM: x86: Move SVM's APICv sanity check to common x86", 2021-10-22)
Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
Reviewed-by: Sean Christopherson <seanjc@google.com>
Tested-by: Ignat Korchagin <ignat@cloudflare.com>
Message-Id: <20211130123746.293379-1-pbonzini@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-12-02 04:12:11 -05:00
Like Xu
cb1d220da0 KVM: x86/pmu: Fix reserved bits for AMD PerfEvtSeln register
If we run the following perf command in an AMD Milan guest:

  perf stat \
  -e cpu/event=0x1d0/ \
  -e cpu/event=0x1c7/ \
  -e cpu/umask=0x1f,event=0x18e/ \
  -e cpu/umask=0x7,event=0x18e/ \
  -e cpu/umask=0x18,event=0x18e/ \
  ./workload

dmesg will report a #GP warning from an unchecked MSR access
error on MSR_F15H_PERF_CTLx.

This is because according to APM (Revision: 4.03) Figure 13-7,
the bits [35:32] of AMD PerfEvtSeln register is a part of the
event select encoding, which extends the EVENT_SELECT field
from 8 bits to 12 bits.

Opportunistically update pmu->reserved_bits for reserved bit 19.

Reported-by: Jim Mattson <jmattson@google.com>
Fixes: ca724305a2 ("KVM: x86/vPMU: Implement AMD vPMU code for KVM")
Signed-off-by: Like Xu <likexu@tencent.com>
Message-Id: <20211118130320.95997-1-likexu@tencent.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-12-02 04:11:50 -05:00
Paolo Bonzini
7cfc5c653b KVM: fix avic_set_running for preemptable kernels
avic_set_running() passes the current CPU to avic_vcpu_load(), albeit
via vcpu->cpu rather than smp_processor_id().  If the thread is migrated
while avic_set_running runs, the call to avic_vcpu_load() can use a stale
value for the processor id.  Avoid this by blocking preemption over the
entire execution of avic_set_running().

Reported-by: Sean Christopherson <seanjc@google.com>
Fixes: 8221c13700 ("svm: Manage vcpu load/unload when enable AVIC")
Cc: stable@vger.kernel.org
Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-11-30 07:40:48 -05:00
Paolo Bonzini
e90e51d5f0 KVM: VMX: clear vmx_x86_ops.sync_pir_to_irr if APICv is disabled
There is nothing to synchronize if APICv is disabled, since neither
other vCPUs nor assigned devices can set PIR.ON.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-11-30 07:40:47 -05:00
Paolo Bonzini
c9d61dcb0b KVM: SEV: accept signals in sev_lock_two_vms
Generally, kvm->lock is not taken for a long time, but
sev_lock_two_vms is different: it takes vCPU locks
inside, so userspace can hold it back just by calling
a vCPU ioctl.  Play it safe and use mutex_lock_killable.

Message-Id: <20211123005036.2954379-13-pbonzini@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-11-30 03:54:15 -05:00
Paolo Bonzini
10a37929ef KVM: SEV: do not take kvm->lock when destroying
Taking the lock is useless since there are no other references,
and there are already accesses (e.g. to sev->enc_context_owner)
that do not take it.  So get rid of it.

Reviewed-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20211123005036.2954379-12-pbonzini@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-11-30 03:54:14 -05:00
Paolo Bonzini
17d44a96f0 KVM: SEV: Prohibit migration of a VM that has mirrors
VMs that mirror an encryption context rely on the owner to keep the
ASID allocated.  Performing a KVM_CAP_VM_MOVE_ENC_CONTEXT_FROM
would cause a dangling ASID:

1. copy context from A to B (gets ref to A)
2. move context from A to L (moves ASID from A to L)
3. close L (releases ASID from L, B still references it)

The right way to do the handoff instead is to create a fresh mirror VM
on the destination first:

1. copy context from A to B (gets ref to A)
[later] 2. close B (releases ref to A)
3. move context from A to L (moves ASID from A to L)
4. copy context from L to M

So, catch the situation by adding a count of how many VMs are
mirroring this one's encryption context.

Fixes: 0b020f5af0 ("KVM: SEV: Add support for SEV-ES intra host migration")
Message-Id: <20211123005036.2954379-11-pbonzini@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-11-30 03:54:14 -05:00
Paolo Bonzini
bf42b02b19 KVM: SEV: Do COPY_ENC_CONTEXT_FROM with both VMs locked
Now that we have a facility to lock two VMs with deadlock
protection, use it for the creation of mirror VMs as well.  One of
COPY_ENC_CONTEXT_FROM(dst, src) and COPY_ENC_CONTEXT_FROM(src, dst)
would always fail, so the combination is nonsensical and it is okay to
return -EBUSY if it is attempted.

This sidesteps the question of what happens if a VM is
MOVE_ENC_CONTEXT_FROM'd at the same time as it is
COPY_ENC_CONTEXT_FROM'd: the locking prevents that from
happening.

Cc: Peter Gonda <pgonda@google.com>
Cc: Sean Christopherson <seanjc@google.com>
Reviewed-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20211123005036.2954379-10-pbonzini@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-11-30 03:54:13 -05:00
Paolo Bonzini
642525e3bd KVM: SEV: move mirror status to destination of KVM_CAP_VM_MOVE_ENC_CONTEXT_FROM
Allow intra-host migration of a mirror VM; the destination VM will be
a mirror of the same ASID as the source.

Fixes: b56639318b ("KVM: SEV: Add support for SEV intra host migration")
Reviewed-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20211123005036.2954379-8-pbonzini@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-11-30 03:54:12 -05:00
Paolo Bonzini
2b347a3878 KVM: SEV: initialize regions_list of a mirror VM
This was broken before the introduction of KVM_CAP_VM_MOVE_ENC_CONTEXT_FROM,
but technically harmless because the region list was unused for a mirror
VM.  However, it is untidy and it now causes a NULL pointer access when
attempting to move the encryption context of a mirror VM.

Fixes: 54526d1fd5 ("KVM: x86: Support KVM VMs sharing SEV context")
Message-Id: <20211123005036.2954379-7-pbonzini@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-11-30 03:54:12 -05:00
Paolo Bonzini
501b580c02 KVM: SEV: cleanup locking for KVM_CAP_VM_MOVE_ENC_CONTEXT_FROM
Encapsulate the handling of the migration_in_progress flag for both VMs in
two functions sev_lock_two_vms and sev_unlock_two_vms.  It does not matter
if KVM_CAP_VM_MOVE_ENC_CONTEXT_FROM locks the destination struct kvm a bit
later, and this change 1) keeps the cleanup chain of labels smaller 2)
makes it possible for KVM_CAP_VM_COPY_ENC_CONTEXT_FROM to reuse the logic.

Cc: Peter Gonda <pgonda@google.com>
Cc: Sean Christopherson <seanjc@google.com>
Message-Id: <20211123005036.2954379-6-pbonzini@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-11-30 03:54:11 -05:00
Paolo Bonzini
4674164f0a KVM: SEV: do not use list_replace_init on an empty list
list_replace_init cannot be used if the source is an empty list,
because "new->next->prev = new" will overwrite "old->next":

				new				old
				prev = new, next = new		prev = old, next = old
new->next = old->next		prev = new, next = old		prev = old, next = old
new->next->prev = new		prev = new, next = old		prev = old, next = new
new->prev = old->prev		prev = old, next = old		prev = old, next = old
new->next->prev = new		prev = old, next = old		prev = new, next = new

The desired outcome instead would be to leave both old and new the same
as they were (two empty circular lists).  Use list_cut_before, which
already has the necessary check and is documented to discard the
previous contents of the list that will hold the result.

Fixes: b56639318b ("KVM: SEV: Add support for SEV intra host migration")
Reviewed-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20211123005036.2954379-5-pbonzini@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-11-30 03:54:11 -05:00
Paolo Bonzini
53b7ca1a35 KVM: x86: Use a stable condition around all VT-d PI paths
Currently, checks for whether VT-d PI can be used refer to the current
status of the feature in the current vCPU; or they more or less pick
vCPU 0 in case a specific vCPU is not available.

However, these checks do not attempt to synchronize with changes to
the IRTE.  In particular, there is no path that updates the IRTE when
APICv is re-activated on vCPU 0; and there is no path to wakeup a CPU
that has APICv disabled, if the wakeup occurs because of an IRTE
that points to a posted interrupt.

To fix this, always go through the VT-d PI path as long as there are
assigned devices and APICv is available on both the host and the VM side.
Since the relevant condition was copied over three times, take the hint
and factor it into a separate function.

Suggested-by: Sean Christopherson <seanjc@google.com>
Cc: stable@vger.kernel.org
Reviewed-by: Sean Christopherson <seanjc@google.com>
Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
Reviewed-by: David Matlack <dmatlack@google.com>
Message-Id: <20211123004311.2954158-5-pbonzini@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-11-30 03:53:14 -05:00
Paolo Bonzini
37c4dbf337 KVM: x86: check PIR even for vCPUs with disabled APICv
The IRTE for an assigned device can trigger a POSTED_INTR_VECTOR even
if APICv is disabled on the vCPU that receives it.  In that case, the
interrupt will just cause a vmexit and leave the ON bit set together
with the PIR bit corresponding to the interrupt.

Right now, the interrupt would not be delivered until APICv is re-enabled.
However, fixing this is just a matter of always doing the PIR->IRR
synchronization, even if the vCPU has temporarily disabled APICv.

This is not a problem for performance, or if anything it is an
improvement.  First, in the common case where vcpu->arch.apicv_active is
true, one fewer check has to be performed.  Second, static_call_cond will
elide the function call if APICv is not present or disabled.  Finally,
in the case for AMD hardware we can remove the sync_pir_to_irr callback:
it is only needed for apic_has_interrupt_for_ppr, and that function
already has a fallback for !APICv.

Cc: stable@vger.kernel.org
Co-developed-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
Reviewed-by: David Matlack <dmatlack@google.com>
Message-Id: <20211123004311.2954158-4-pbonzini@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-11-30 03:52:39 -05:00
Paolo Bonzini
7e1901f6c8 KVM: VMX: prepare sync_pir_to_irr for running with APICv disabled
If APICv is disabled for this vCPU, assigned devices may still attempt to
post interrupts.  In that case, we need to cancel the vmentry and deliver
the interrupt with KVM_REQ_EVENT.  Extend the existing code that handles
injection of L1 interrupts into L2 to cover this case as well.

vmx_hwapic_irr_update is only called when APICv is active so it would be
confusing to add a check for vcpu->arch.apicv_active in there.  Instead,
just use vmx_set_rvi directly in vmx_sync_pir_to_irr.

Cc: stable@vger.kernel.org
Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
Reviewed-by: David Matlack <dmatlack@google.com>
Reviewed-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20211123004311.2954158-3-pbonzini@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-11-30 03:51:58 -05:00
Sean Christopherson
f47491d7f3 KVM: x86/mmu: Handle "default" period when selectively waking kthread
Account for the '0' being a default, "let KVM choose" period, when
determining whether or not the recovery worker needs to be awakened in
response to userspace reducing the period.  Failure to do so results in
the worker not being awakened properly, e.g. when changing the period
from '0' to any small-ish value.

Fixes: 4dfe4f40d8 ("kvm: x86: mmu: Make NX huge page recovery period configurable")
Cc: stable@vger.kernel.org
Cc: Junaid Shahid <junaids@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20211120015706.3830341-1-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-11-30 03:09:27 -05:00
Paolo Bonzini
28f091bc2f KVM: MMU: shadow nested paging does not have PKU
Initialize the mask for PKU permissions as if CR4.PKE=0, avoiding
incorrect interpretations of the nested hypervisor's page tables.

Cc: stable@vger.kernel.org
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-11-30 03:09:26 -05:00
Sean Christopherson
4b85c921cd KVM: x86/mmu: Remove spurious TLB flushes in TDP MMU zap collapsible path
Drop the "flush" param and return values to/from the TDP MMU's helper for
zapping collapsible SPTEs.  Because the helper runs with mmu_lock held
for read, not write, it uses tdp_mmu_zap_spte_atomic(), and the atomic
zap handles the necessary remote TLB flush.

Similarly, because mmu_lock is dropped and re-acquired between zapping
legacy MMUs and zapping TDP MMUs, kvm_mmu_zap_collapsible_sptes() must
handle remote TLB flushes from the legacy MMU before calling into the TDP
MMU.

Fixes: e2209710cc ("KVM: x86/mmu: Skip rmap operations if rmaps not allocated")
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20211120045046.3940942-4-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-11-30 03:09:25 -05:00
Sean Christopherson
7533377215 KVM: x86/mmu: Use yield-safe TDP MMU root iter in MMU notifier unmapping
Use the yield-safe variant of the TDP MMU iterator when handling an
unmapping event from the MMU notifier, as most occurences of the event
allow yielding.

Fixes: e1eed5847b ("KVM: x86/mmu: Allow yielding during MMU notifier unmap/zap, if possible")
Cc: stable@vger.kernel.org
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20211120015008.3780032-1-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-11-30 03:09:25 -05:00
Lai Jiangshan
05b29633c7 KVM: X86: Use vcpu->arch.walk_mmu for kvm_mmu_invlpg()
INVLPG operates on guest virtual address, which are represented by
vcpu->arch.walk_mmu.  In nested virtualization scenarios,
kvm_mmu_invlpg() was using the wrong MMU structure; if L2's invlpg were
emulated by L0 (in practice, it hardly happen) when nested two-dimensional
paging is enabled, the call to ->tlb_flush_gva() would be skipped and
the hardware TLB entry would not be invalidated.

Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com>
Message-Id: <20211124122055.64424-5-jiangshanlai@gmail.com>
Cc: stable@vger.kernel.org
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-11-26 08:14:21 -05:00
Lai Jiangshan
12ec33a705 KVM: X86: Fix when shadow_root_level=5 && guest root_level<4
If the is an L1 with nNPT in 32bit, the shadow walk starts with
pae_root.

Fixes: a717a780fc ("KVM: x86/mmu: Support shadowing NPT when 5-level paging is enabled in host)
Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com>
Message-Id: <20211124122055.64424-2-jiangshanlai@gmail.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-11-26 08:14:20 -05:00
Vitaly Kuznetsov
feb627e8d6 KVM: x86: Forbid KVM_SET_CPUID{,2} after KVM_RUN
Commit 63f5a1909f ("KVM: x86: Alert userspace that KVM_SET_CPUID{,2}
after KVM_RUN is broken") officially deprecated KVM_SET_CPUID{,2} ioctls
after first successful KVM_RUN and promissed to make this sequence forbiden
in 5.16. It's time to fulfil the promise.

Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Message-Id: <20211122175818.608220-3-vkuznets@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-11-26 08:14:20 -05:00
Sean Christopherson
712494de96 KVM: nVMX: Emulate guest TLB flush on nested VM-Enter with new vpid12
Fully emulate a guest TLB flush on nested VM-Enter which changes vpid12,
i.e. L2's VPID, instead of simply doing INVVPID to flush real hardware's
TLB entries for vpid02.  From L1's perspective, changing L2's VPID is
effectively a TLB flush unless "hardware" has previously cached entries
for the new vpid12.  Because KVM tracks only a single vpid12, KVM doesn't
know if the new vpid12 has been used in the past and so must treat it as
a brand new, never been used VPID, i.e. must assume that the new vpid12
represents a TLB flush from L1's perspective.

For example, if L1 and L2 share a CR3, the first VM-Enter to L2 (with a
VPID) is effectively a TLB flush as hardware/KVM has never seen vpid12
and thus can't have cached entries in the TLB for vpid12.

Reported-by: Lai Jiangshan <jiangshanlai+lkml@gmail.com>
Fixes: 5c614b3583 ("KVM: nVMX: nested VPID emulation")
Cc: stable@vger.kernel.org
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20211125014944.536398-3-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-11-26 07:11:29 -05:00
Sean Christopherson
40e5f90804 KVM: nVMX: Abide to KVM_REQ_TLB_FLUSH_GUEST request on nested vmentry/vmexit
Like KVM_REQ_TLB_FLUSH_CURRENT, the GUEST variant needs to be serviced at
nested transitions, as KVM doesn't track requests for L1 vs L2.  E.g. if
there's a pending flush when a nested VM-Exit occurs, then the flush was
requested in the context of L2 and needs to be handled before switching
to L1, otherwise the flush for L2 would effectiely be lost.

Opportunistically add a helper to handle CURRENT and GUEST as a pair, the
logic for when they need to be serviced is identical as both requests are
tied to L1 vs. L2, the only difference is the scope of the flush.

Reported-by: Lai Jiangshan <jiangshanlai+lkml@gmail.com>
Fixes: 07ffaf343e ("KVM: nVMX: Sync all PGDs on nested transition with shadow paging")
Cc: stable@vger.kernel.org
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20211125014944.536398-2-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-11-26 07:07:49 -05:00
Sean Christopherson
2b4a5a5d56 KVM: nVMX: Flush current VPID (L1 vs. L2) for KVM_REQ_TLB_FLUSH_GUEST
Flush the current VPID when handling KVM_REQ_TLB_FLUSH_GUEST instead of
always flushing vpid01.  Any TLB flush that is triggered when L2 is
active is scoped to L2's VPID (if it has one), e.g. if L2 toggles CR4.PGE
and L1 doesn't intercept PGE writes, then KVM's emulation of the TLB
flush needs to be applied to L2's VPID.

Reported-by: Lai Jiangshan <jiangshanlai+lkml@gmail.com>
Fixes: 07ffaf343e ("KVM: nVMX: Sync all PGDs on nested transition with shadow paging")
Cc: stable@vger.kernel.org
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20211125014944.536398-2-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-11-26 07:06:58 -05:00
Paolo Bonzini
30d7c5d60a KVM: SEV: expose KVM_CAP_VM_MOVE_ENC_CONTEXT_FROM capability
The capability, albeit present, was never exposed via KVM_CHECK_EXTENSION.

Fixes: b56639318b ("KVM: SEV: Add support for SEV intra host migration")
Cc: Peter Gonda <pgonda@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-11-26 06:43:30 -05:00
Paolo Bonzini
78311a5140 KVM: x86: ignore APICv if LAPIC is not enabled
Synchronize the two calls to kvm_x86_sync_pir_to_irr.  The one
in the reenter-guest fast path invoked the callback unconditionally
even if LAPIC is present but disabled.  In this case, there are
no interrupts to deliver, and therefore posted interrupts can
be ignored.

Cc: stable@vger.kernel.org
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-11-26 06:43:29 -05:00
Paolo Bonzini
8503fea676 KVM: VMX: do not use uninitialized gfn_to_hva_cache
An uninitialized gfn_to_hva_cache has ghc->len == 0, which causes
the accessors to croak very loudly.  While a BUG_ON is definitely
_too_ loud and a bug on its own, there is indeed an issue of using
the caches in such a way that they could not have been initialized,
because ghc->gpa == 0 might match and thus kvm_gfn_to_hva_cache_init
would not be called.

For the vmcs12_cache, the solution is simply to invoke
kvm_gfn_to_hva_cache_init unconditionally: we already know
that the cache does not match the current VMCS pointer.
For the shadow_vmcs12_cache, there is no similar condition
that checks the VMCS link pointer, so invalidate the cache
on VMXON.

Fixes: cee66664dc ("KVM: nVMX: Use a gfn_to_hva_cache for vmptrld")
Acked-by: David Woodhouse <dwmw@amazon.co.uk>
Reported-by: syzbot+7b7db8bb4db6fd5e157b@syzkaller.appspotmail.com
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-11-26 06:43:28 -05:00
Hou Wenlong
8ed716ca7d KVM: x86/mmu: Pass parameter flush as false in kvm_tdp_mmu_zap_collapsible_sptes()
Since tlb flush has been done for legacy MMU before
kvm_tdp_mmu_zap_collapsible_sptes(), so the parameter flush
should be false for kvm_tdp_mmu_zap_collapsible_sptes().

Fixes: e2209710cc ("KVM: x86/mmu: Skip rmap operations if rmaps not allocated")
Signed-off-by: Hou Wenlong <houwenlong93@linux.alibaba.com>
Message-Id: <21453a1d2533afb6e59fb6c729af89e771ff2e76.1637140154.git.houwenlong93@linux.alibaba.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-11-18 07:05:58 -05:00
Hou Wenlong
c7785d85b6 KVM: x86/mmu: Skip tlb flush if it has been done in zap_gfn_range()
If the parameter flush is set, zap_gfn_range() would flush remote tlb
when yield, then tlb flush is not needed outside. So use the return
value of zap_gfn_range() directly instead of OR on it in
kvm_unmap_gfn_range() and kvm_tdp_mmu_unmap_gfn_range().

Fixes: 3039bcc744 ("KVM: Move x86's MMU notifier memslot walkers to generic code")
Signed-off-by: Hou Wenlong <houwenlong93@linux.alibaba.com>
Message-Id: <5e16546e228877a4d974f8c0e448a93d52c7a5a9.1637140154.git.houwenlong93@linux.alibaba.com>
Reviewed-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-11-18 07:05:57 -05:00
Juergen Gross
9dba4d24cb x86/kvm: remove unused ack_notifier callbacks
Commit f52447261b ("KVM: irq ack notification") introduced an
ack_notifier() callback in struct kvm_pic and in struct kvm_ioapic
without using them anywhere. Remove those callbacks again.

Signed-off-by: Juergen Gross <jgross@suse.com>
Message-Id: <20211117071617.19504-1-jgross@suse.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-11-18 07:05:57 -05:00
Ben Gardon
574c3c55e9 KVM: x86/mmu: Fix TLB flush range when handling disconnected pt
When recursively clearing out disconnected pts, the range based TLB
flush in handle_removed_tdp_mmu_page uses the wrong starting GFN,
resulting in the flush mostly missing the affected range. Fix this by
using base_gfn for the flush.

In response to feedback from David Matlack on the RFC version of this
patch, also move a few definitions into the for loop in the function to
prevent unintended references to them in the future.

Fixes: a066e61f13 ("KVM: x86/mmu: Factor out handling of removed page tables")
CC: stable@vger.kernel.org
Signed-off-by: Ben Gardon <bgardon@google.com>
Message-Id: <20211115211704.2621644-1-bgardon@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-11-18 02:15:19 -05:00