KVK generic changes for 6.12:
- Fix a bug that results in KVM prematurely exiting to userspace for coalesced
MMIO/PIO in many cases, clean up the related code, and add a testcase.
- Fix a bug in kvm_clear_guest() where it would trigger a buffer overflow _if_
the gpa+len crosses a page boundary, which thankfully is guaranteed to not
happen in the current code base. Add WARNs in more helpers that read/write
guest memory to detect similar bugs.
When reading or writing a guest page, WARN and bail if offset+len would
result in a read to a different page so that KVM bugs are more likely to
be detected, and so that any such bugs are less likely to escalate to an
out-of-bounds access. E.g. if userspace isn't using guard pages and the
target page is at the end of a memslot.
Note, KVM already hardens itself in similar APIs, e.g. in the "cached"
variants, it's just the vanilla APIs that are playing with fire.
Link: https://lore.kernel.org/r/20240829191413.900740-3-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
Pass "seg" instead of "len" when writing guest memory in kvm_clear_guest(),
as "seg" holds the number of bytes to write for the current page, while
"len" holds the total bytes remaining.
Luckily, all users of kvm_clear_guest() are guaranteed to not cross a page
boundary, and so the bug is unhittable in the current code base.
Fixes: 2f5414423e ("KVM: remove kvm_clear_guest_page")
Reported-by: zyr_ms@outlook.com
Closes: https://bugzilla.kernel.org/show_bug.cgi?id=219104
Link: https://lore.kernel.org/r/20240829191413.900740-2-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
Add arch hooks that are invoked when KVM enables/disable virtualization.
x86 will use the hooks to register an "emergency disable" callback, which
is essentially an x86-specific shutdown notifier that is used when the
kernel is doing an emergency reboot/shutdown/kexec.
Add comments for the declarations to help arch code understand exactly
when the callbacks are invoked. Alternatively, the APIs themselves could
communicate most of the same info, but kvm_arch_pre_enable_virtualization()
and kvm_arch_post_disable_virtualization() are a bit cumbersome, and make
it a bit less obvious that they are intended to be implemented as a pair.
Reviewed-by: Chao Gao <chao.gao@intel.com>
Reviewed-by: Kai Huang <kai.huang@intel.com>
Acked-by: Kai Huang <kai.huang@intel.com>
Tested-by: Farrah Chen <farrah.chen@intel.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-ID: <20240830043600.127750-9-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Add an on-by-default module param, enable_virt_at_load, to let userspace
force virtualization to be enabled in hardware when KVM is initialized,
i.e. just before /dev/kvm is exposed to userspace. Enabling virtualization
during KVM initialization allows userspace to avoid the additional latency
when creating/destroying the first/last VM (or more specifically, on the
0=>1 and 1=>0 edges of creation/destruction).
Now that KVM uses the cpuhp framework to do per-CPU enabling, the latency
could be non-trivial as the cpuhup bringup/teardown is serialized across
CPUs, e.g. the latency could be problematic for use case that need to spin
up VMs quickly.
Prior to commit 10474ae894 ("KVM: Activate Virtualization On Demand"),
KVM _unconditionally_ enabled virtualization during load, i.e. there's no
fundamental reason KVM needs to dynamically toggle virtualization. These
days, the only known argument for not enabling virtualization is to allow
KVM to be autoloaded without blocking other out-of-tree hypervisors, and
such use cases can simply change the module param, e.g. via command line.
Note, the aforementioned commit also mentioned that enabling SVM (AMD's
virtualization extensions) can result in "using invalid TLB entries".
It's not clear whether the changelog was referring to a KVM bug, a CPU
bug, or something else entirely. Regardless, leaving virtualization off
by default is not a robust "fix", as any protection provided is lost the
instant userspace creates the first VM.
Reviewed-by: Chao Gao <chao.gao@intel.com>
Acked-by: Kai Huang <kai.huang@intel.com>
Reviewed-by: Kai Huang <kai.huang@intel.com>
Tested-by: Farrah Chen <farrah.chen@intel.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-ID: <20240830043600.127750-8-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Rename the per-CPU hooks used to enable virtualization in hardware to
align with the KVM-wide helpers in kvm_main.c, and to better capture that
the callbacks are invoked on every online CPU.
No functional change intended.
Suggested-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Reviewed-by: Kai Huang <kai.huang@intel.com>
Message-ID: <20240830043600.127750-5-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Rename the various functions (and a variable) that enable virtualization
to prepare for upcoming changes, and to clean up artifacts of KVM's
previous behavior, which required manually juggling locks around
kvm_usage_count.
Drop the "nolock" qualifier from per-CPU functions now that there are no
"nolock" implementations of the "all" variants, i.e. now that calling a
non-nolock function from a nolock function isn't confusing (unlike this
sentence).
Drop "all" from the outer helpers as they no longer manually iterate
over all CPUs, and because it might not be obvious what "all" refers to.
In lieu of the above qualifiers, append "_cpu" to the end of the functions
that are per-CPU helpers for the outer APIs.
Opportunistically prepend "kvm" to all functions to help make it clear
that they are KVM helpers, but mostly because there's no reason not to.
Lastly, use "virtualization" instead of "hardware", because while the
functions do enable virtualization in hardware, there are a _lot_ of
things that KVM enables in hardware.
Defer renaming the arch hooks to future patches, purely to reduce the
amount of churn in a single commit.
Reviewed-by: Chao Gao <chao.gao@intel.com>
Reviewed-by: Kai Huang <kai.huang@intel.com>
Acked-by: Kai Huang <kai.huang@intel.com>
Tested-by: Farrah Chen <farrah.chen@intel.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-ID: <20240830043600.127750-4-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Register KVM's cpuhp and syscore callback when enabling virtualization
in hardware instead of registering the callbacks during initialization,
and let the CPU up/down framework invoke the inner enable/disable
functions. Registering the callbacks during initialization makes things
more complex than they need to be, as KVM needs to be very careful about
handling races between enabling CPUs being onlined/offlined and hardware
being enabled/disabled.
Intel TDX support will require KVM to enable virtualization during KVM
initialization, i.e. will add another wrinkle to things, at which point
sorting out the potential races with kvm_usage_count would become even
more complex.
Note, using the cpuhp framework has a subtle behavioral change: enabling
will be done serially across all CPUs, whereas KVM currently sends an IPI
to all CPUs in parallel. While serializing virtualization enabling could
create undesirable latency, the issue is limited to the 0=>1 transition of
VM creation. And even that can be mitigated, e.g. by letting userspace
force virtualization to be enabled when KVM is initialized.
Cc: Chao Gao <chao.gao@intel.com>
Reviewed-by: Kai Huang <kai.huang@intel.com>
Acked-by: Kai Huang <kai.huang@intel.com>
Tested-by: Farrah Chen <farrah.chen@intel.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-ID: <20240830043600.127750-3-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Use a dedicated mutex to guard kvm_usage_count to fix a potential deadlock
on x86 due to a chain of locks and SRCU synchronizations. Translating the
below lockdep splat, CPU1 #6 will wait on CPU0 #1, CPU0 #8 will wait on
CPU2 #3, and CPU2 #7 will wait on CPU1 #4 (if there's a writer, due to the
fairness of r/w semaphores).
CPU0 CPU1 CPU2
1 lock(&kvm->slots_lock);
2 lock(&vcpu->mutex);
3 lock(&kvm->srcu);
4 lock(cpu_hotplug_lock);
5 lock(kvm_lock);
6 lock(&kvm->slots_lock);
7 lock(cpu_hotplug_lock);
8 sync(&kvm->srcu);
Note, there are likely more potential deadlocks in KVM x86, e.g. the same
pattern of taking cpu_hotplug_lock outside of kvm_lock likely exists with
__kvmclock_cpufreq_notifier():
cpuhp_cpufreq_online()
|
-> cpufreq_online()
|
-> cpufreq_gov_performance_limits()
|
-> __cpufreq_driver_target()
|
-> __target_index()
|
-> cpufreq_freq_transition_begin()
|
-> cpufreq_notify_transition()
|
-> ... __kvmclock_cpufreq_notifier()
But, actually triggering such deadlocks is beyond rare due to the
combination of dependencies and timings involved. E.g. the cpufreq
notifier is only used on older CPUs without a constant TSC, mucking with
the NX hugepage mitigation while VMs are running is very uncommon, and
doing so while also onlining/offlining a CPU (necessary to generate
contention on cpu_hotplug_lock) would be even more unusual.
The most robust solution to the general cpu_hotplug_lock issue is likely
to switch vm_list to be an RCU-protected list, e.g. so that x86's cpufreq
notifier doesn't to take kvm_lock. For now, settle for fixing the most
blatant deadlock, as switching to an RCU-protected list is a much more
involved change, but add a comment in locking.rst to call out that care
needs to be taken when walking holding kvm_lock and walking vm_list.
======================================================
WARNING: possible circular locking dependency detected
6.10.0-smp--c257535a0c9d-pip #330 Tainted: G S O
------------------------------------------------------
tee/35048 is trying to acquire lock:
ff6a80eced71e0a8 (&kvm->slots_lock){+.+.}-{3:3}, at: set_nx_huge_pages+0x179/0x1e0 [kvm]
but task is already holding lock:
ffffffffc07abb08 (kvm_lock){+.+.}-{3:3}, at: set_nx_huge_pages+0x14a/0x1e0 [kvm]
which lock already depends on the new lock.
the existing dependency chain (in reverse order) is:
-> #3 (kvm_lock){+.+.}-{3:3}:
__mutex_lock+0x6a/0xb40
mutex_lock_nested+0x1f/0x30
kvm_dev_ioctl+0x4fb/0xe50 [kvm]
__se_sys_ioctl+0x7b/0xd0
__x64_sys_ioctl+0x21/0x30
x64_sys_call+0x15d0/0x2e60
do_syscall_64+0x83/0x160
entry_SYSCALL_64_after_hwframe+0x76/0x7e
-> #2 (cpu_hotplug_lock){++++}-{0:0}:
cpus_read_lock+0x2e/0xb0
static_key_slow_inc+0x16/0x30
kvm_lapic_set_base+0x6a/0x1c0 [kvm]
kvm_set_apic_base+0x8f/0xe0 [kvm]
kvm_set_msr_common+0x9ae/0xf80 [kvm]
vmx_set_msr+0xa54/0xbe0 [kvm_intel]
__kvm_set_msr+0xb6/0x1a0 [kvm]
kvm_arch_vcpu_ioctl+0xeca/0x10c0 [kvm]
kvm_vcpu_ioctl+0x485/0x5b0 [kvm]
__se_sys_ioctl+0x7b/0xd0
__x64_sys_ioctl+0x21/0x30
x64_sys_call+0x15d0/0x2e60
do_syscall_64+0x83/0x160
entry_SYSCALL_64_after_hwframe+0x76/0x7e
-> #1 (&kvm->srcu){.+.+}-{0:0}:
__synchronize_srcu+0x44/0x1a0
synchronize_srcu_expedited+0x21/0x30
kvm_swap_active_memslots+0x110/0x1c0 [kvm]
kvm_set_memslot+0x360/0x620 [kvm]
__kvm_set_memory_region+0x27b/0x300 [kvm]
kvm_vm_ioctl_set_memory_region+0x43/0x60 [kvm]
kvm_vm_ioctl+0x295/0x650 [kvm]
__se_sys_ioctl+0x7b/0xd0
__x64_sys_ioctl+0x21/0x30
x64_sys_call+0x15d0/0x2e60
do_syscall_64+0x83/0x160
entry_SYSCALL_64_after_hwframe+0x76/0x7e
-> #0 (&kvm->slots_lock){+.+.}-{3:3}:
__lock_acquire+0x15ef/0x2e30
lock_acquire+0xe0/0x260
__mutex_lock+0x6a/0xb40
mutex_lock_nested+0x1f/0x30
set_nx_huge_pages+0x179/0x1e0 [kvm]
param_attr_store+0x93/0x100
module_attr_store+0x22/0x40
sysfs_kf_write+0x81/0xb0
kernfs_fop_write_iter+0x133/0x1d0
vfs_write+0x28d/0x380
ksys_write+0x70/0xe0
__x64_sys_write+0x1f/0x30
x64_sys_call+0x281b/0x2e60
do_syscall_64+0x83/0x160
entry_SYSCALL_64_after_hwframe+0x76/0x7e
Cc: Chao Gao <chao.gao@intel.com>
Fixes: 0bf50497f0 ("KVM: Drop kvm_count_lock and instead protect kvm_usage_count with kvm_lock")
Cc: stable@vger.kernel.org
Reviewed-by: Kai Huang <kai.huang@intel.com>
Acked-by: Kai Huang <kai.huang@intel.com>
Tested-by: Farrah Chen <farrah.chen@intel.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-ID: <20240830043600.127750-2-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Fold coalesced_mmio_has_room() into its sole caller, coalesced_mmio_write(),
as it's really just a single line of code, has a goofy return value, and
is unnecessarily brittle.
E.g. if coalesced_mmio_has_room() were to check ring->last directly, or
the caller failed to use READ_ONCE(), KVM would be susceptible to TOCTOU
attacks from userspace.
Opportunistically add a comment explaining why on earth KVM leaves one
entry free, which may not be obvious to readers that aren't familiar with
ring buffers.
No functional change intended.
Reviewed-by: Ilias Stamatis <ilstam@amazon.com>
Cc: Paul Durrant <paul@xen.org>
Link: https://lore.kernel.org/r/20240828181446.652474-3-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
The following calculation used in coalesced_mmio_has_room() to check
whether the ring buffer is full is wrong and results in premature exits if
the start of the valid entries is in the first half of the ring buffer.
avail = (ring->first - last - 1) % KVM_COALESCED_MMIO_MAX;
if (avail == 0)
/* full */
Because negative values are handled using two's complement, and KVM
computes the result as an unsigned value, the above will get a false
positive if "first < last" and the ring is half-full.
The above might have worked as expected in python for example:
>>> (-86) % 170
84
However it doesn't work the same way in C.
printf("avail: %d\n", (-86) % 170);
printf("avail: %u\n", (-86) % 170);
printf("avail: %u\n", (-86u) % 170u);
Using gcc-11 these print:
avail: -86
avail: 4294967210
avail: 0
For illustration purposes, given a 4-bit integer and a ring size of 0xA
(unsigned), 0xA == 0x1010 == -6, and thus (-6u % 0xA) == 0.
Fix the calculation and allow all but one entries in the buffer to be
used as originally intended.
Note, KVM's behavior is self-healing to some extent, as KVM will allow the
entire buffer to be used if ring->first is beyond the halfway point. In
other words, in the unlikely scenario that a use case benefits from being
able to coalesce more than 86 entries at once, KVM will still provide such
behavior, sometimes.
Note #2, the % operator in C is not the modulo operator but the remainder
operator. Modulo and remainder operators differ with respect to negative
values. But, the relevant values in KVM are all unsigned, so it's a moot
point in this case anyway.
Note #3, this is almost a pure revert of the buggy commit, plus a
READ_ONCE() to provide additional safety. Thue buggy commit justified the
change with "it paves the way for making this function lockless", but it's
not at all clear what was intended, nor is there any evidence that the
buggy code was somehow safer. (a) the fields in question were already
accessed locklessly, from the perspective that they could be modified by
userspace at any time, and (b) the lock guarding the ring itself was
changed, but never dropped, i.e. whatever lockless scheme (SRCU?) was
planned never landed.
Fixes: 105f8d40a7 ("KVM: Calculate available entries in coalesced mmio ring")
Signed-off-by: Ilias Stamatis <ilstam@amazon.com>
Reviewed-by: Paul Durrant <paul@xen.org>
Link: https://lore.kernel.org/r/20240718193543.624039-2-ilstam@amazon.com
[sean: rework changelog to clarify behavior, call out weirdness of buggy commit]
Signed-off-by: Sean Christopherson <seanjc@google.com>
Disallow read-only memslots for SEV-{ES,SNP} VM types, as KVM can't
directly emulate instructions for ES/SNP, and instead the guest must
explicitly request emulation. Unless the guest explicitly requests
emulation without accessing memory, ES/SNP relies on KVM creating an MMIO
SPTE, with the subsequent #NPF being reflected into the guest as a #VC.
But for read-only memslots, KVM deliberately doesn't create MMIO SPTEs,
because except for ES/SNP, doing so requires setting reserved bits in the
SPTE, i.e. the SPTE can't be readable while also generating a #VC on
writes. Because KVM never creates MMIO SPTEs and jumps directly to
emulation, the guest never gets a #VC. And since KVM simply resumes the
guest if ES/SNP guests trigger emulation, KVM effectively puts the vCPU
into an infinite #NPF loop if the vCPU attempts to write read-only memory.
Disallow read-only memory for all VMs with protected state, i.e. for
upcoming TDX VMs as well as ES/SNP VMs. For TDX, it's actually possible
to support read-only memory, as TDX uses EPT Violation #VE to reflect the
fault into the guest, e.g. KVM could configure read-only SPTEs with RX
protections and SUPPRESS_VE=0. But there is no strong use case for
supporting read-only memslots on TDX, e.g. the main historical usage is
to emulate option ROMs, but TDX disallows executing from shared memory.
And if someone comes along with a legitimate, strong use case, the
restriction can always be lifted for TDX.
Don't bother trying to retroactively apply the restriction to SEV-ES
VMs that are created as type KVM_X86_DEFAULT_VM. Read-only memslots can't
possibly work for SEV-ES, i.e. disallowing such memslots is really just
means reporting an error to userspace instead of silently hanging vCPUs.
Trying to deal with the ordering between KVM_SEV_INIT and memslot creation
isn't worth the marginal benefit it would provide userspace.
Fixes: 26c44aa9e0 ("KVM: SEV: define VM types for SEV and SEV-ES")
Fixes: 1dfe571c12 ("KVM: SEV: Add initial SEV-SNP support")
Cc: Peter Gonda <pgonda@google.com>
Cc: Michael Roth <michael.roth@amd.com>
Cc: Vishal Annapurve <vannapurve@google.com>
Cc: Ackerly Tng <ackerleytng@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-ID: <20240809190319.1710470-2-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
When hot-unplug a device which has many queues, and guest CPU will has
huge jitter, and unplugging is very slow.
It turns out synchronize_srcu() in irqfd_shutdown() caused the guest
jitter and unplugging latency, so replace synchronize_srcu() with
synchronize_srcu_expedited(), to accelerate the unplugging, and reduce
the guest OS jitter, this accelerates the VM reboot too.
Signed-off-by: Li RongQing <lirongqing@baidu.com>
Message-ID: <20240711121130.38917-1-lirongqing@baidu.com>
[Call it just once in irqfd_resampler_shutdown. - Paolo]
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Right now, large folios are not supported in guest_memfd, and therefore the order
used by kvm_gmem_populate() is always 0. In this scenario, using the up-to-date
bit to track prepared-ness is nice and easy because we have one bit available
per page.
In the future, however, we might have large pages that are partially populated;
for example, in the case of SEV-SNP, if a large page has both shared and private
areas inside, it is necessary to populate it at a granularity that is smaller
than that of the guest_memfd's backing store. In that case we will have
to track preparedness at a 4K level, probably as a bitmap.
In preparation for that, do not use explicitly folio_test_uptodate() and
folio_mark_uptodate(). Return the state of the page directly from
__kvm_gmem_get_pfn(), so that it is expected to apply to 2^N pages
with N=*max_order. The function to mark a range as prepared for now
takes just a folio, but is expected to take also an index and order
(or something like that) when large pages are introduced.
Thanks to Michael Roth for pointing out the issue with large pages.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
This check is currently performed by sev_gmem_post_populate(), but it
applies to all callers of kvm_gmem_populate(): the point of the function
is that the memory is being encrypted and some work has to be done
on all the gfns in order to encrypt them.
Therefore, check the KVM_MEMORY_ATTRIBUTE_PRIVATE attribute prior
to invoking the callback, and stop the operation if a shared page
is encountered. Because CONFIG_KVM_PRIVATE_MEM in principle does
not require attributes, this makes kvm_gmem_populate() depend on
CONFIG_KVM_GENERIC_PRIVATE_MEM (which does require them).
Reviewed-by: Michael Roth <michael.roth@amd.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
While currently there is no other attribute than KVM_MEMORY_ATTRIBUTE_PRIVATE,
KVM code such as kvm_mem_is_private() is written to expect their existence.
Allow using kvm_range_has_memory_attributes() as a multi-page version of
kvm_mem_is_private(), without it breaking later when more attributes are
introduced.
Reviewed-by: Michael Roth <michael.roth@amd.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Use a guard to simplify early returns, and add two more easy
shortcuts. If the requested attributes are invalid, the attributes
xarray will never show them as set. And if testing a single page,
kvm_get_memory_attributes() is more efficient.
Reviewed-by: Michael Roth <michael.roth@amd.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Do not allow populating the same page twice with startup data. In the
case of SEV-SNP, for example, the firmware does not allow it anyway,
since the launch-update operation is only possible on pages that are
still shared in the RMP.
Even if it worked, kvm_gmem_populate()'s callback is meant to have side
effects such as updating launch measurements, and updating the same
page twice is unlikely to have the desired results.
Races between calls to the ioctl are not possible because
kvm_gmem_populate() holds slots_lock and the VM should not be running.
But again, even if this worked on other confidential computing technology,
it doesn't matter to guest_memfd.c whether this is something fishy
such as missing synchronization in userspace, or rather something
intentional. One of the racers wins, and the page is initialized by
either kvm_gmem_prepare_folio() or kvm_gmem_populate().
Anyway, out of paranoia, adjust sev_gmem_post_populate() anyway to use
the same errno that kvm_gmem_populate() is using.
Reviewed-by: Michael Roth <michael.roth@amd.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
It is enough to return 0 if a guest need not do any preparation.
This is in fact how sev_gmem_prepare() works for non-SNP guests,
and it extends naturally to Intel hosts: the x86 callback for
gmem_prepare is optional and returns 0 if not defined.
Reviewed-by: Michael Roth <michael.roth@amd.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
This is now possible because preparation is done by kvm_gmem_get_pfn()
instead of fallocate(). In practice this is not a limitation, because
even though guest_memfd can be bound to multiple struct kvm, for
hardware implementations of confidential computing only one guest
(identified by an ASID on SEV-SNP, or an HKID on TDX) will be able
to access it.
In the case of intra-host migration (not implemented yet for SEV-SNP,
but we can use SEV-ES as an idea of how it will work), the new struct
kvm inherits the same ASID and preparation need not be repeated.
Reviewed-by: Michael Roth <michael.roth@amd.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Initializing the contents of the folio on fallocate() is unnecessarily
restrictive. It means that the page is registered with the firmware and
then it cannot be touched anymore. In particular, this loses the
possibility of using fallocate() to pre-allocate the page for SEV-SNP
guests, because kvm_arch_gmem_prepare() then fails.
It's only when the guest actually accesses the page (and therefore
kvm_gmem_get_pfn() is called) that the page must be cleared from any
stale host data and registered with the firmware. The up-to-date flag
is clear if this has to be done (i.e. it is the first access and
kvm_gmem_populate() has not been called).
All in all, there are enough differences between kvm_gmem_get_pfn() and
kvm_gmem_populate(), that it's better to separate the two flows completely.
Extract the bulk of kvm_gmem_get_folio(), which take a folio and end up
setting its up-to-date flag, to a new function kvm_gmem_prepare_folio();
these are now done only by the non-__-prefixed kvm_gmem_get_pfn().
As a bonus, __kvm_gmem_get_pfn() loses its ugly "bool prepare" argument.
One difference is that fallocate(PUNCH_HOLE) can now race with a
page fault. Potentially this causes a page to be prepared and into the
filemap even after fallocate(PUNCH_HOLE). This is harmless, as it can be
fixed by another hole punching operation, and can be avoided by clearing
the private-page attribute prior to invoking fallocate(PUNCH_HOLE).
This way, the page fault will cause an exit to user space.
The previous semantics, where fallocate() could be used to prepare
the pages in advance of running the guest, can be accessed with
KVM_PRE_FAULT_MEMORY.
For now, accessing a page in one VM will attempt to call
kvm_arch_gmem_prepare() in all of those that have bound the guest_memfd.
Cleaning this up is left to a separate patch.
Suggested-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Allow testing the up-to-date flag in the caller without taking the
lock again.
Reviewed-by: Michael Roth <michael.roth@amd.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Add "ARCH" to the symbols; shortly, the "prepare" phase will include both
the arch-independent step to clear out contents left in the page by the
host, and the arch-dependent step enabled by CONFIG_HAVE_KVM_GMEM_PREPARE.
For consistency do the same for CONFIG_HAVE_KVM_GMEM_INVALIDATE as well.
Reviewed-by: Michael Roth <michael.roth@amd.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
We have a perfectly usable folio, use it to retrieve the pfn and order.
All that's needed is a version of folio_file_page that returns a pfn.
Reviewed-by: Michael Roth <michael.roth@amd.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
The up-to-date flag as is now is not too useful; it tells guest_memfd not
to overwrite the contents of a folio, but it doesn't say that the page
is ready to be mapped into the guest. For encrypted guests, mapping
a private page requires that the "preparation" phase has succeeded,
and at the same time the same page cannot be prepared twice.
So, ensure that folio_mark_uptodate() is only called on a prepared page. If
kvm_gmem_prepare_folio() or the post_populate callback fail, the folio
will not be marked up-to-date; it's not a problem to call clear_highpage()
again on such a page prior to the next preparation attempt.
Reviewed-by: Michael Roth <michael.roth@amd.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Right now this is simply more consistent and avoids use of pfn_to_page()
and put_page(). It will be put to more use in upcoming patches, to
ensure that the up-to-date flag is set at the very end of both the
kvm_gmem_get_pfn() and kvm_gmem_populate() flows.
Reviewed-by: Michael Roth <michael.roth@amd.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
- Enable halt poll shrinking by default, as Intel found it to be a clear win.
- Setup empty IRQ routing when creating a VM to avoid having to synchronize
SRCU when creating a split IRQCHIP on x86.
- Rework the sched_in/out() paths to replace kvm_arch_sched_in() with a flag
that arch code can use for hooking both sched_in() and sched_out().
- Take the vCPU @id as an "unsigned long" instead of "u32" to avoid
truncating a bogus value from userspace, e.g. to help userspace detect bugs.
- Mark a vCPU as preempted if and only if it's scheduled out while in the
KVM_RUN loop, e.g. to avoid marking it preempted and thus writing guest
memory when retrieving guest state during live migration blackout.
- A few minor cleanups
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEKTobbabEP7vbhhN9OlYIJqCjN/0FAmaRuOYACgkQOlYIJqCj
N/1UnQ/8CI5Qfr+/0gzYgtWmtEMczGG+rMNpzD3XVqPjJjXcMcBiQnplnzUVLhha
vlPdYVK7vgmEt003XGzV55mik46LHL+DX/v4hI3HEdblfyCeNLW3fKEWVRB44qJe
o+YUQwSK42SORUp9oXuQINxhA//U9EnI7CQxlJ8w8wenv5IJKfIGr01DefmfGPAV
PKm9t6WLcNqvhZMEyy/zmzM3KVPCJL0NcwI97x6sHxFpQYIDtL0E/VexA4AFqMoT
QK7cSDC/2US41Zvem/r/GzM/ucdF6vb9suzZYBohwhxtVhwJe2CDeYQZvtNKJ1U7
GOHPaKL6nBWdZCm/yyWbbX2nstY1lHqxhN3JD0X8wqU5rNcwm2b8Vfyav0Ehc7H+
jVbDTshOx4YJmIgajoKjgM050rdBK59TdfVL+l+AAV5q/TlHocalYtvkEBdGmIDg
2td9UHSime6sp20vQfczUEz4bgrQsh4l2Fa/qU2jFwLievnBw0AvEaMximkSGMJe
b8XfjmdTjlOesWAejANKtQolfrq14+1wYw0zZZ8PA+uNVpKdoovmcqSOcaDC9bT8
GO/NFUvoG+lkcvJcIlo1SSl81SmGLosijwxWfGvFAqsgpR3/3l3dYp0QtztoCNJO
d3+HnjgYn5o5FwufuTD3eUOXH4AFjG108DH0o25XrIkb2Kymy0o=
=BalU
-----END PGP SIGNATURE-----
Merge tag 'kvm-x86-generic-6.11' of https://github.com/kvm-x86/linux into HEAD
KVM generic changes for 6.11
- Enable halt poll shrinking by default, as Intel found it to be a clear win.
- Setup empty IRQ routing when creating a VM to avoid having to synchronize
SRCU when creating a split IRQCHIP on x86.
- Rework the sched_in/out() paths to replace kvm_arch_sched_in() with a flag
that arch code can use for hooking both sched_in() and sched_out().
- Take the vCPU @id as an "unsigned long" instead of "u32" to avoid
truncating a bogus value from userspace, e.g. to help userspace detect bugs.
- Mark a vCPU as preempted if and only if it's scheduled out while in the
KVM_RUN loop, e.g. to avoid marking it preempted and thus writing guest
memory when retrieving guest state during live migration blackout.
- A few minor cleanups
Fix a bug where KVM fails to check the validity of an incoming userspace
virtual address and tries to activate a gfn_to_pfn_cache with a kernel address.
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEKTobbabEP7vbhhN9OlYIJqCjN/0FAmaRtxMACgkQOlYIJqCj
N/0B9A/+PeiWgW+AZ5Bl3YLLTMUi2Q1pKapT5hdNAthdDC72ewDDqtfXQ/Lcaubq
j1ElrxXP5IjMBq7U65hT9hc/f6AxZat2LkTSBRM7uGrhE2UTF+/GmfJve8vUHZsA
XtH9JRqNhBfr5EOkyh4AwvzvmzaiuJemPeLtQuEwlgaECs9byG1ILhudD/KiVStw
3Tw4Zw2oKzpo3hiW+5/YZBEkpO+CM50ByP9/SEPUhnG8sCqlGzMEI27bOvDgEDQs
vV2fIj992lrSa9OX6oA/xz3/UE1t3ruo8mHjP/r3+sxr1SnnwbmR0YbiGEVdAjJI
Pllwx6E9Fxi8Nqq/6Dy4XxUhNG+G8+ozwvr8wbLHzXFIpCGZFN5xgweuHtdEg0x8
mNxekTZrsLE58t5MpvTUMMHiHsn0KqtPqg1g3c2A7znZnzYxMHrIYyUm+uEOIZ/w
Q93WT7s3SiaELgENsx3uda3Q0i2gKK3x7gbLk/9N2ciFZXgFZMwdZa7FivJr5OoT
wp8r/btKTTaVGyn3x3y/Tum5XpyMNsKdxEKQ4n3aaIkfS2APfiHjQFyzkI2uLePG
Tz/SZ/PT2rxpRuYL54mHzovPsIJ3NqBA+OTWCHKI/EwyWeyW9g05rZnmJar4ZYQ5
pdHqWgGi0wMrPWG+GQv0FMBj7EIM4Z3dA4/BRyRObnMmp1XtIb4=
=Vgz5
-----END PGP SIGNATURE-----
Merge tag 'kvm-x86-fixes-6.10-11' of https://github.com/kvm-x86/linux into HEAD
KVM Xen:
Fix a bug where KVM fails to check the validity of an incoming userspace
virtual address and tries to activate a gfn_to_pfn_cache with a kernel address.
1. Add ParaVirt steal time support.
2. Add some VM migration enhancement.
3. Add perf kvm-stat support for loongarch.
-----BEGIN PGP SIGNATURE-----
iQJKBAABCAA0FiEEzOlt8mkP+tbeiYy5AoYrw/LiJnoFAmaOS6UWHGNoZW5odWFj
YWlAa2VybmVsLm9yZwAKCRAChivD8uImehejD/9pACGe3h3krXLcFVWXOFIu5Hpc
5kQLP0lSPJ/o5Xs8t/oPLrnDX70z90wXI1LOmltc7h32MSwFa2l8COQh+sN5eJBQ
PNyt7u7bMipp0yJS4Gl3LQQ5vklcGOSpQc/gbeXnVx8J/tz+Mo9YGGLIXVRXRM6W
Ri8D2VVFiwzQQYeTpPo1u1Ob8C6mA4KOppwvhscMTM3vj4NMbsinBzRnR0lG0Tdw
meFhxDPly1Ksxsbnj9UGO6UnEY0A2SLONs6MiO4y4DtoqoDlw/lbqFJuYo4vvbx1
pxtjyirD/PX/wjslQFWUOuU0hMfAodera+JupZ5BZWfcG8FltA4DQfDsm/U9RjK/
7gGNnr8Xk2/tp6+4AVV+HU2iTgRvq+mXCL72zSy2Y4r7ElBAANDfk4n+Zn/PWisn
U9wwV8Ue7tVB15BRpRsg77NzBidiCFEe/6flWYiX2y24ke71gwDJBGUy8hMdKt6t
4Cq8atsU0MvDAzfYMsK9JjskJp4UFq6wb1tXbbuADM4TDhnzlK6s6h3vM+pFlh/f
my7fDH8/2qsCWhBDM4pmsJskVp+I1GOk/80RjTQISwx7iHktJWvxNYTaisK2fvD5
Qs1IUWfNFbDX0Lr0QpN6j6X4rZkghR4R6XoFkd4nkicwi+UHVn3oK9GSqv24QJn9
7+Ev3dfRTUYLd6mC4Q==
=DpIK
-----END PGP SIGNATURE-----
Merge tag 'loongarch-kvm-6.11' of git://git.kernel.org/pub/scm/linux/kernel/git/chenhuacai/linux-loongson into HEAD
LoongArch KVM changes for v6.11
1. Add ParaVirt steal time support.
2. Add some VM migration enhancement.
3. Add perf kvm-stat support for loongarch.
Pre-population has been requested several times to mitigate KVM page faults
during guest boot or after live migration. It is also required by TDX
before filling in the initial guest memory with measured contents.
Introduce it as a generic API.
Add a new ioctl KVM_PRE_FAULT_MEMORY in the KVM common code. It iterates on the
memory range and calls the arch-specific function. The implementation is
optional and enabled by a Kconfig symbol.
Suggested-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
Reviewed-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Message-ID: <819322b8f25971f2b9933bfa4506e618508ad782.1712785629.git.isaku.yamahata@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
The flags AS_UNMOVABLE and AS_INACCESSIBLE were both added just for guest_memfd;
AS_UNMOVABLE is already in existing versions of Linux, while AS_INACCESSIBLE was
acked for inclusion in 6.11.
But really, they are the same thing: only guest_memfd uses them, at least for
now, and guest_memfd pages are unmovable because they should not be
accessed by the CPU.
So merge them into one; use the AS_INACCESSIBLE name which is more comprehensive.
At the same time, this fixes an embarrassing bug where AS_INACCESSIBLE was used
as a bit mask, despite it being just a bit index.
The bug was mostly benign, because AS_INACCESSIBLE's bit representation (1010)
corresponded to setting AS_UNEVICTABLE (which is already set) and AS_ENOSPC
(except no async writes can happen on the guest_memfd). So the AS_INACCESSIBLE
flag simply had no effect.
Fixes: 1d23040caa ("KVM: guest_memfd: Use AS_INACCESSIBLE when creating guest_memfd inode")
Fixes: c72ceafbd1 ("mm: Introduce AS_INACCESSIBLE for encrypted/confidential memory")
Cc: linux-mm@kvack.org
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: David Hildenbrand <david@redhat.com>
Tested-by: Michael Roth <michael.roth@amd.com>
Reviewed-by: Michael Roth <michael.roth@amd.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Add a module description for kvm.ko to fix a 'make W=1' warning:
WARNING: modpost: missing MODULE_DESCRIPTION() in arch/x86/kvm/kvm.o
Opportunistically update kvm_main.c's comically stale file comment to
match the module description.
Signed-off-by: Jeff Johnson <quic_jjohnson@quicinc.com>
Link: https://lore.kernel.org/r/20240622-md-kvm-v2-1-29a60f7c48b1@quicinc.com
[sean: split x86 changes to a separate commit, remove stale VT-x comment]
Signed-off-by: Sean Christopherson <seanjc@google.com>
- Fix a "shift too big" goof in the KVM_SEV_INIT2 selftest.
- Compute the max mappable gfn for KVM selftests on x86 using GuestMaxPhyAddr
from KVM's supported CPUID (if it's available).
- Fix a race in kvm_vcpu_on_spin() by ensuring loads and stores are atomic.
- Fix technically benign bug in __kvm_handle_hva_range() where KVM consumes
the return from a void-returning function as if it were a boolean.
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEKTobbabEP7vbhhN9OlYIJqCjN/0FAmZ0tUIACgkQOlYIJqCj
N/0bNQ//etfWk8SWCeOQ2ir83es04/i57Rz0L5L+d1C58IznwwbuRZdYaMpldb/B
Wx8J4mhfmjdd1Q3HeqWqkpDATNBIkTx3Cp0ydyM41mCMgOuL2uz7o9CDf0VG6IPN
j+9X91IEbfZ/h2k7qlrHVePY6P4HASsGFkYnc/3q7A8nA3jhZPMUwlzX4v02V3Ib
x5MvtLxtuA4V8feAETNMVwFk2DxPXZV8NQAi6RNnPKF8ui8hmkaMPk1ysj3JaqGN
bgsfAJQz3+uL5IR/cQQvjKGDFwL6TkE2mHnuziYQAMR9+ir7EIN+88xW/PZYkCHN
Bh1pgtv6quCP33MlC2gjUwDLxbLWPuS0zsGe/QOABRrY+95gngS6/DgYIA7tN/ye
VjWS3LHEQDaOa6AumKJqhi90WYNICRI3wi/4Bk3so3Oj/lvMisnizeTMKKGSPyU1
FhW6JUYQlcbmTS6aKGz7WoZxCv73Pild9Vz9ZqsW93aKIgJqeUEpfpMeHVg1DO8n
/YXBCkYqm2ni6yTeoDxHiXJt+ecwKrZdjOe0Rwhmcybyux82ig98ISq+ZEtptSQW
rEpa7wJ6Vb9Kv5Tzf5bKjb2MIzRkMFJgnRjr97taf4LLL4z1WyQm90OSBtqTgU8i
1R6Fy/M8hgE5D/fHOy8SZ63osLVlnnxbX6Fu1LebqxaQcrmKzcU=
=Qdo9
-----END PGP SIGNATURE-----
Merge tag 'kvm-x86-fixes-6.10-rcN' of https://github.com/kvm-x86/linux into HEAD
KVM fixes for 6.10
- Fix a "shift too big" goof in the KVM_SEV_INIT2 selftest.
- Compute the max mappable gfn for KVM selftests on x86 using GuestMaxPhyAddr
from KVM's supported CPUID (if it's available).
- Fix a race in kvm_vcpu_on_spin() by ensuring loads and stores are atomic.
- Fix technically benign bug in __kvm_handle_hva_range() where KVM consumes
the return from a void-returning function as if it were a boolean.
kvm_gmem_populate() is a potentially lengthy operation that can involve
multiple calls to the firmware. Interrupt it if a signal arrives.
Fixes: 1f6c06b177 ("KVM: guest_memfd: Add interface for populating gmem pages with user data")
Cc: Isaku Yamahata <isaku.yamahata@intel.com>
Cc: Michael Roth <michael.roth@amd.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Function kvm_reset_dirty_gfn may be called with parameters cur_slot /
cur_offset / mask are all zero, it does not represent real dirty page.
It is not necessary to clear dirty page in this condition. Also return
value of macro __fls() is undefined if mask is zero which is called in
funciton kvm_reset_dirty_gfn(). Here just return.
Signed-off-by: Bibo Mao <maobibo@loongson.cn>
Message-ID: <20240613122803.1031511-1-maobibo@loongson.cn>
[Move the conditional inside kvm_reset_dirty_gfn; suggested by
Sean Christopherson. - Paolo]
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
If kvm_gmem_get_pfn() detects an hwpoisoned page, it returns -EHWPOISON
but it does not put back the reference that kvm_gmem_get_folio() had
grabbed. Add the forgotten folio_put().
Fixes: a7800aa80e ("KVM: Add KVM_CREATE_GUEST_MEMFD ioctl() for guest-specific backing memory")
Cc: stable@vger.kernel.org
Reviewed-by: Liam Merwick <liam.merwick@oracle.com>
Reviewed-by: Isaku Yamahata <isaku.yamahata@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Some allocations done by KVM are temporary, they are created as result
of program actions, but can't exists for arbitrary long times.
They should have been GFP_TEMPORARY (rip!).
OTOH, kvm-nx-lpage-recovery and kvm-pit kernel threads exist for as long
as VM exists but their task_struct memory is not accounted.
This is story for another day.
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Message-ID: <c0122f66-f428-417e-a360-b25fc0f154a0@p183>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Mark a vCPU as preempted/ready if-and-only-if it's scheduled out while
running. i.e. Do not mark a vCPU preempted/ready if it's scheduled out
during a non-KVM_RUN ioctl() or when userspace is doing KVM_RUN with
immediate_exit.
Commit 54aa83c901 ("KVM: x86: do not set st->preempted when going back
to user space") stopped marking a vCPU as preempted when returning to
userspace, but if userspace then invokes a KVM vCPU ioctl() that gets
preempted, the vCPU will be marked preempted/ready. This is arguably
incorrect behavior since the vCPU was not actually preempted while the
guest was running, it was preempted while doing something on behalf of
userspace.
Marking a vCPU preempted iff its running also avoids KVM dirtying guest
memory after userspace has paused vCPUs, e.g. for live migration, which
allows userspace to collect the final dirty bitmap before or in parallel
with saving vCPU state, without having to worry about saving vCPU state
triggering writes to guest memory.
Suggested-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: David Matlack <dmatlack@google.com>
Link: https://lore.kernel.org/r/20240503181734.1467938-4-dmatlack@google.com
[sean: massage changelog]
Signed-off-by: Sean Christopherson <seanjc@google.com>
Ensure that any new KVM code that references immediate_exit gets extra
scrutiny by renaming it to immediate_exit__unsafe in kernel code.
All fields in struct kvm_run are subject to TOCTOU races since they are
mapped into userspace, which may be malicious or buggy. To protect KVM,
introduces a new macro that appends __unsafe to select field names in
struct kvm_run, hinting to developers and reviewers that accessing such
fields must be done carefully.
Apply the new macro to immediate_exit, since userspace can make
immediate_exit inconsistent with vcpu->wants_to_run, i.e. accessing
immediate_exit directly could lead to unexpected bugs in the future.
Signed-off-by: David Matlack <dmatlack@google.com>
Link: https://lore.kernel.org/r/20240503181734.1467938-3-dmatlack@google.com
[sean: massage changelog]
Signed-off-by: Sean Christopherson <seanjc@google.com>
Introduce vcpu->wants_to_run to indicate when a vCPU is in its core run
loop, i.e. when the vCPU is running the KVM_RUN ioctl and immediate_exit
was not set.
Replace all references to vcpu->run->immediate_exit with
!vcpu->wants_to_run to avoid TOCTOU races with userspace. For example, a
malicious userspace could invoked KVM_RUN with immediate_exit=true and
then after KVM reads it to set wants_to_run=false, flip it to false.
This would result in the vCPU running in KVM_RUN with
wants_to_run=false. This wouldn't cause any real bugs today but is a
dangerous landmine.
Signed-off-by: David Matlack <dmatlack@google.com>
Link: https://lore.kernel.org/r/20240503181734.1467938-2-dmatlack@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
If, on a 64 bit system, a vCPU ID is provided that has the upper 32 bits
set to a non-zero value, it may get accepted if the truncated to 32 bits
integer value is below KVM_MAX_VCPU_IDS and 'max_vcpus'. This feels very
wrong and triggered the reporting logic of PaX's SIZE_OVERFLOW plugin.
Instead of silently truncating and accepting such values, pass the full
value to kvm_vm_ioctl_create_vcpu() and make the existing limit checks
return an error.
Even if this is a userland ABI breaking change, no sane userland could
have ever relied on that behaviour.
Reported-by: PaX's SIZE_OVERFLOW plugin running on grsecurity's syzkaller
Fixes: 6aa8b732ca ("[PATCH] kvm: userspace interface")
Cc: Emese Revfy <re.emese@gmail.com>
Cc: PaX Team <pageexec@freemail.hu>
Signed-off-by: Mathias Krause <minipli@grsecurity.net>
Link: https://lore.kernel.org/r/20240614202859.3597745-2-minipli@grsecurity.net
[sean: tweak comment about INT_MAX assertion]
Signed-off-by: Sean Christopherson <seanjc@google.com>
Bail from outer address space loop, not just the inner memslot loop, when
a "null" handler is encountered by __kvm_handle_hva_range(), which is the
intended behavior. On x86, which has multiple address spaces thanks to
SMM emulation, breaking from just the memslot loop results in undefined
behavior due to assigning the non-existent return value from kvm_null_fn()
to a bool.
In practice, the bug is benign as kvm_mmu_notifier_invalidate_range_end()
is the only caller that passes handler=kvm_null_fn, and it doesn't set
flush_on_ret, i.e. assigning garbage to r.ret is ultimately ignored. And
for most configuration the compiler elides the entire sequence, i.e. there
is no undefined behavior at runtime.
------------[ cut here ]------------
UBSAN: invalid-load in arch/x86/kvm/../../../virt/kvm/kvm_main.c:655:10
load of value 160 is not a valid value for type '_Bool'
CPU: 370 PID: 8246 Comm: CPU 0/KVM Not tainted 6.8.2-amdsos-build58-ubuntu-22.04+ #1
Hardware name: AMD Corporation Sh54p/Sh54p, BIOS WPC4429N 04/25/2024
Call Trace:
<TASK>
dump_stack_lvl+0x48/0x60
ubsan_epilogue+0x5/0x30
__ubsan_handle_load_invalid_value+0x79/0x80
kvm_mmu_notifier_invalidate_range_end.cold+0x18/0x4f [kvm]
__mmu_notifier_invalidate_range_end+0x63/0xe0
__split_huge_pmd+0x367/0xfc0
do_huge_pmd_wp_page+0x1cc/0x380
__handle_mm_fault+0x8ee/0xe50
handle_mm_fault+0xe4/0x4a0
__get_user_pages+0x190/0x840
get_user_pages_unlocked+0xe0/0x590
hva_to_pfn+0x114/0x550 [kvm]
kvm_faultin_pfn+0xed/0x5b0 [kvm]
kvm_tdp_page_fault+0x123/0x170 [kvm]
kvm_mmu_page_fault+0x244/0xaa0 [kvm]
vcpu_enter_guest+0x592/0x1070 [kvm]
kvm_arch_vcpu_ioctl_run+0x145/0x8a0 [kvm]
kvm_vcpu_ioctl+0x288/0x6d0 [kvm]
__x64_sys_ioctl+0x8f/0xd0
do_syscall_64+0x77/0x120
entry_SYSCALL_64_after_hwframe+0x6e/0x76
</TASK>
---[ end trace ]---
Fixes: 071064f14d ("KVM: Don't take mmu_lock for range invalidation unless necessary")
Signed-off-by: Babu Moger <babu.moger@amd.com>
Link: https://lore.kernel.org/r/b8723d39903b64c241c50f5513f804390c7b5eec.1718203311.git.babu.moger@amd.com
[sean: massage changelog]
Signed-off-by: Sean Christopherson <seanjc@google.com>
The error path for OOM when allocating buses used to return -ENOMEM using
the local variable 'r', where 'r' was initialized at the top of the
function. But a new "r = kvm_init_irq_routing(kvm);" was introduced in
the middle of the function, so now the error code is not set and it
eventually leads to a NULL dereference due to kvm_dev_ioctl_create_vm()
thinking kvm_create_vm() succeeded. Set the error code back to -ENOMEM.
Opportunistically tweak the logic to pre-set "r = -ENOMEM" immediately
before the flows that can fail due to memory allocation failure to make
it less likely that the bug recurs in the future.
Fixes: fbe4a7e881 ("KVM: Setup empty IRQ routing when creating a VM")
Signed-off-by: Dan Carpenter <dan.carpenter@linaro.org>
Link: https://lore.kernel.org/r/02051e0a-09d8-49a2-917f-7c2f278a1ba1@moroto.mountain
[sean: tweak all of the "r = -ENOMEM" sites, massage changelog]
Signed-off-by: Sean Christopherson <seanjc@google.com>
Delete kvm_arch_sched_in() now that all implementations are nops.
Reviewed-by: Bibo Mao <maobibo@loongson.cn>
Acked-by: Kai Huang <kai.huang@intel.com>
Link: https://lore.kernel.org/r/20240522014013.1672962-5-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
Add a kvm_vcpu.scheduled_out flag to track if a vCPU is in the process of
being scheduled out (vCPU put path), or if the vCPU is being reloaded
after being scheduled out (vCPU load path). In the short term, this will
allow dropping kvm_arch_sched_in(), as arch code can query scheduled_out
during kvm_arch_vcpu_load().
Longer term, scheduled_out opens up other potential optimizations, without
creating subtle/brittle dependencies. E.g. it allows KVM to keep guest
state (that is managed via kvm_arch_vcpu_{load,put}()) loaded across
kvm_sched_{out,in}(), if KVM knows the state isn't accessed by the host
kernel. Forcing arch code to coordinate between kvm_arch_sched_{in,out}()
and kvm_arch_vcpu_{load,put}() is awkward, not reusable, and relies on the
exact ordering of calls into arch code.
Adding scheduled_out also obviates the need for a kvm_arch_sched_out()
hook, e.g. if arch code needs to do something novel when putting vCPU
state.
And even if KVM never uses scheduled_out for anything beyond dropping
kvm_arch_sched_in(), just being able to remove all of the arch stubs makes
it worth adding the flag.
Link: https://lore.kernel.org/all/20240430224431.490139-1-seanjc@google.com
Cc: Oliver Upton <oliver.upton@linux.dev>
Reviewed-by: Oliver Upton <oliver.upton@linux.dev>
Acked-by: Kai Huang <kai.huang@intel.com>
Link: https://lore.kernel.org/r/20240522014013.1672962-2-seanjc@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
Setup empty IRQ routing during VM creation so that x86 and s390 don't need
to set empty/dummy IRQ routing during KVM_CREATE_IRQCHIP (in future
patches). Initializing IRQ routing before there are any potential readers
allows KVM to avoid the synchronize_srcu() in kvm_set_irq_routing(), which
can introduces 20+ milliseconds of latency in the VM creation path.
Ensuring that all VMs have non-NULL IRQ routing also hardens KVM against
misbehaving userspace VMMs, e.g. RISC-V dynamically instantiates its
interrupt controller, but doesn't override kvm_arch_intc_initialized() or
kvm_arch_irqfd_allowed(), and so can likely reach kvm_irq_map_gsi()
without fully initialized IRQ routing.
Signed-off-by: Yi Wang <foxywang@tencent.com>
Acked-by: Christian Borntraeger <borntraeger@linux.ibm.com>
Link: https://lore.kernel.org/r/20240506101751.3145407-2-foxywang@tencent.com
[sean: init refcount after IRQ routing, fix stub, massage changelog]
Signed-off-by: Sean Christopherson <seanjc@google.com>
Use {READ,WRITE}_ONCE() to access kvm->last_boosted_vcpu to ensure the
loads and stores are atomic. In the extremely unlikely scenario the
compiler tears the stores, it's theoretically possible for KVM to attempt
to get a vCPU using an out-of-bounds index, e.g. if the write is split
into multiple 8-bit stores, and is paired with a 32-bit load on a VM with
257 vCPUs:
CPU0 CPU1
last_boosted_vcpu = 0xff;
(last_boosted_vcpu = 0x100)
last_boosted_vcpu[15:8] = 0x01;
i = (last_boosted_vcpu = 0x1ff)
last_boosted_vcpu[7:0] = 0x00;
vcpu = kvm->vcpu_array[0x1ff];
As detected by KCSAN:
BUG: KCSAN: data-race in kvm_vcpu_on_spin [kvm] / kvm_vcpu_on_spin [kvm]
write to 0xffffc90025a92344 of 4 bytes by task 4340 on cpu 16:
kvm_vcpu_on_spin (arch/x86/kvm/../../../virt/kvm/kvm_main.c:4112) kvm
handle_pause (arch/x86/kvm/vmx/vmx.c:5929) kvm_intel
vmx_handle_exit (arch/x86/kvm/vmx/vmx.c:?
arch/x86/kvm/vmx/vmx.c:6606) kvm_intel
vcpu_run (arch/x86/kvm/x86.c:11107 arch/x86/kvm/x86.c:11211) kvm
kvm_arch_vcpu_ioctl_run (arch/x86/kvm/x86.c:?) kvm
kvm_vcpu_ioctl (arch/x86/kvm/../../../virt/kvm/kvm_main.c:?) kvm
__se_sys_ioctl (fs/ioctl.c:52 fs/ioctl.c:904 fs/ioctl.c:890)
__x64_sys_ioctl (fs/ioctl.c:890)
x64_sys_call (arch/x86/entry/syscall_64.c:33)
do_syscall_64 (arch/x86/entry/common.c:?)
entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:130)
read to 0xffffc90025a92344 of 4 bytes by task 4342 on cpu 4:
kvm_vcpu_on_spin (arch/x86/kvm/../../../virt/kvm/kvm_main.c:4069) kvm
handle_pause (arch/x86/kvm/vmx/vmx.c:5929) kvm_intel
vmx_handle_exit (arch/x86/kvm/vmx/vmx.c:?
arch/x86/kvm/vmx/vmx.c:6606) kvm_intel
vcpu_run (arch/x86/kvm/x86.c:11107 arch/x86/kvm/x86.c:11211) kvm
kvm_arch_vcpu_ioctl_run (arch/x86/kvm/x86.c:?) kvm
kvm_vcpu_ioctl (arch/x86/kvm/../../../virt/kvm/kvm_main.c:?) kvm
__se_sys_ioctl (fs/ioctl.c:52 fs/ioctl.c:904 fs/ioctl.c:890)
__x64_sys_ioctl (fs/ioctl.c:890)
x64_sys_call (arch/x86/entry/syscall_64.c:33)
do_syscall_64 (arch/x86/entry/common.c:?)
entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:130)
value changed: 0x00000012 -> 0x00000000
Fixes: 217ece6129 ("KVM: use yield_to instead of sleep in kvm_vcpu_on_spin")
Cc: stable@vger.kernel.org
Signed-off-by: Breno Leitao <leitao@debian.org>
Link: https://lore.kernel.org/r/20240510092353.2261824-1-leitao@debian.org
Signed-off-by: Sean Christopherson <seanjc@google.com>