mirror of
https://github.com/torvalds/linux.git
synced 2024-11-29 15:41:36 +00:00
bf328e22e4
The legacy API for setting the TSC is fundamentally broken, and only allows userspace to set a TSC "now", without any way to account for time lost between the calculation of the value, and the kernel eventually handling the ioctl. To work around this, KVM has a hack which, if a TSC is set with a value which is within a second's worth of the last TSC "written" to any vCPU in the VM, assumes that userspace actually intended the two TSC values to be in sync and adjusts the newly-written TSC value accordingly. Thus, when a VMM restores a guest after suspend or migration using the legacy API, the TSCs aren't necessarily *right*, but at least they're in sync. This trick falls down when restoring a guest which genuinely has been running for less time than the 1 second of imprecision KVM allows for in in the legacy API. On *creation*, the first vCPU starts its TSC counting from zero, and the subsequent vCPUs synchronize to that. But then when the VMM tries to restore a vCPU's intended TSC, because the VM has been alive for less than 1 second and KVM's default TSC value for new vCPU's is '0', the intended TSC is within a second of the last "written" TSC and KVM incorrectly adjusts the intended TSC in an attempt to synchronize. But further hacks can be piled onto KVM's existing hackish ABI, and declare that the *first* value written by *userspace* (on any vCPU) should not be subject to this "correction", i.e. KVM can assume that the first write from userspace is not an attempt to sync up with TSC values that only come from the kernel's default vCPU creation. To that end: Add a flag, kvm->arch.user_set_tsc, protected by kvm->arch.tsc_write_lock, to record that a TSC for at least one vCPU in the VM *has* been set by userspace, and make the 1-second slop hack only trigger if user_set_tsc is already set. Note that userspace can explicitly request a *synchronization* of the TSC by writing zero. For the purpose of user_set_tsc, an explicit synchronization counts as "setting" the TSC, i.e. if userspace then subsequently writes an explicit non-zero value which happens to be within 1 second of the previous value, the new value will be "corrected". This behavior is deliberate, as treating explicit synchronization as "setting" the TSC preserves KVM's existing behaviour inasmuch as possible (KVM always applied the 1-second "correction" regardless of whether the write came from userspace vs. the kernel). Reported-by: Yong He <alexyonghe@tencent.com> Closes: https://bugzilla.kernel.org/show_bug.cgi?id=217423 Suggested-by: Oliver Upton <oliver.upton@linux.dev> Original-by: Oliver Upton <oliver.upton@linux.dev> Original-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Like Xu <likexu@tencent.com> Tested-by: Yong He <alexyonghe@tencent.com> Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com> Link: https://lore.kernel.org/r/20231008025335.7419-1-likexu@tencent.com Signed-off-by: Sean Christopherson <seanjc@google.com> |
||
---|---|---|
.. | ||
mmu | ||
svm | ||
vmx | ||
.gitignore | ||
cpuid.c | ||
cpuid.h | ||
debugfs.c | ||
emulate.c | ||
fpu.h | ||
governed_features.h | ||
hyperv.c | ||
hyperv.h | ||
i8254.c | ||
i8254.h | ||
i8259.c | ||
ioapic.c | ||
ioapic.h | ||
irq_comm.c | ||
irq.c | ||
irq.h | ||
Kconfig | ||
kvm_cache_regs.h | ||
kvm_emulate.h | ||
kvm_onhyperv.c | ||
kvm_onhyperv.h | ||
kvm-asm-offsets.c | ||
lapic.c | ||
lapic.h | ||
Makefile | ||
mmu.h | ||
mtrr.c | ||
pmu.c | ||
pmu.h | ||
reverse_cpuid.h | ||
smm.c | ||
smm.h | ||
trace.h | ||
tss.h | ||
x86.c | ||
x86.h | ||
xen.c | ||
xen.h |