2007-10-10 15:16:19 +00:00
|
|
|
/*
|
|
|
|
* Kernel-based Virtual Machine driver for Linux
|
|
|
|
*
|
|
|
|
* derived from drivers/kvm/kvm_main.c
|
|
|
|
*
|
|
|
|
* Copyright (C) 2006 Qumranet, Inc.
|
2008-07-28 16:26:26 +00:00
|
|
|
* Copyright (C) 2008 Qumranet, Inc.
|
|
|
|
* Copyright IBM Corporation, 2008
|
2010-10-06 12:23:22 +00:00
|
|
|
* Copyright 2010 Red Hat, Inc. and/or its affiliates.
|
2007-10-10 15:16:19 +00:00
|
|
|
*
|
|
|
|
* Authors:
|
|
|
|
* Avi Kivity <avi@qumranet.com>
|
|
|
|
* Yaniv Kamay <yaniv@qumranet.com>
|
2008-07-28 16:26:26 +00:00
|
|
|
* Amit Shah <amit.shah@qumranet.com>
|
|
|
|
* Ben-Ami Yassour <benami@il.ibm.com>
|
2007-10-10 15:16:19 +00:00
|
|
|
*
|
|
|
|
* This work is licensed under the terms of the GNU GPL, version 2. See
|
|
|
|
* the COPYING file in the top-level directory.
|
|
|
|
*
|
|
|
|
*/
|
|
|
|
|
2007-12-16 09:02:48 +00:00
|
|
|
#include <linux/kvm_host.h>
|
KVM: Portability: split kvm_vcpu_ioctl
This patch splits kvm_vcpu_ioctl into archtecture independent parts, and
x86 specific parts which go to kvm_arch_vcpu_ioctl in x86.c.
Common ioctls for all architectures are:
KVM_RUN, KVM_GET/SET_(S-)REGS, KVM_TRANSLATE, KVM_INTERRUPT,
KVM_DEBUG_GUEST, KVM_SET_SIGNAL_MASK, KVM_GET/SET_FPU
Note that some PPC chips don't have an FPU, so we might need an #ifdef
around KVM_GET/SET_FPU one day.
x86 specific ioctls are:
KVM_GET/SET_LAPIC, KVM_SET_CPUID, KVM_GET/SET_MSRS
An interresting aspect is vcpu_load/vcpu_put. We now have a common
vcpu_load/put which does the preemption stuff, and an architecture
specific kvm_arch_vcpu_load/put. In the x86 case, this one calls the
vmx/svm function defined in kvm_x86_ops.
Signed-off-by: Carsten Otte <cotte@de.ibm.com>
Reviewed-by: Christian Borntraeger <borntraeger@de.ibm.com>
Reviewed-by: Christian Ehrhardt <ehrhardt@linux.vnet.ibm.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
2007-10-11 17:16:52 +00:00
|
|
|
#include "irq.h"
|
2007-12-14 01:35:10 +00:00
|
|
|
#include "mmu.h"
|
2008-01-27 21:10:22 +00:00
|
|
|
#include "i8254.h"
|
2008-03-24 21:14:53 +00:00
|
|
|
#include "tss.h"
|
2008-06-27 17:58:02 +00:00
|
|
|
#include "kvm_cache_regs.h"
|
2008-07-03 11:59:22 +00:00
|
|
|
#include "x86.h"
|
2011-11-23 14:30:32 +00:00
|
|
|
#include "cpuid.h"
|
2015-06-19 11:54:23 +00:00
|
|
|
#include "pmu.h"
|
2015-07-03 12:01:34 +00:00
|
|
|
#include "hyperv.h"
|
KVM: Portability: split kvm_vcpu_ioctl
This patch splits kvm_vcpu_ioctl into archtecture independent parts, and
x86 specific parts which go to kvm_arch_vcpu_ioctl in x86.c.
Common ioctls for all architectures are:
KVM_RUN, KVM_GET/SET_(S-)REGS, KVM_TRANSLATE, KVM_INTERRUPT,
KVM_DEBUG_GUEST, KVM_SET_SIGNAL_MASK, KVM_GET/SET_FPU
Note that some PPC chips don't have an FPU, so we might need an #ifdef
around KVM_GET/SET_FPU one day.
x86 specific ioctls are:
KVM_GET/SET_LAPIC, KVM_SET_CPUID, KVM_GET/SET_MSRS
An interresting aspect is vcpu_load/vcpu_put. We now have a common
vcpu_load/put which does the preemption stuff, and an architecture
specific kvm_arch_vcpu_load/put. In the x86 case, this one calls the
vmx/svm function defined in kvm_x86_ops.
Signed-off-by: Carsten Otte <cotte@de.ibm.com>
Reviewed-by: Christian Borntraeger <borntraeger@de.ibm.com>
Reviewed-by: Christian Ehrhardt <ehrhardt@linux.vnet.ibm.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
2007-10-11 17:16:52 +00:00
|
|
|
|
2008-02-15 19:52:47 +00:00
|
|
|
#include <linux/clocksource.h>
|
2008-07-28 16:26:26 +00:00
|
|
|
#include <linux/interrupt.h>
|
KVM: Portability: split kvm_vcpu_ioctl
This patch splits kvm_vcpu_ioctl into archtecture independent parts, and
x86 specific parts which go to kvm_arch_vcpu_ioctl in x86.c.
Common ioctls for all architectures are:
KVM_RUN, KVM_GET/SET_(S-)REGS, KVM_TRANSLATE, KVM_INTERRUPT,
KVM_DEBUG_GUEST, KVM_SET_SIGNAL_MASK, KVM_GET/SET_FPU
Note that some PPC chips don't have an FPU, so we might need an #ifdef
around KVM_GET/SET_FPU one day.
x86 specific ioctls are:
KVM_GET/SET_LAPIC, KVM_SET_CPUID, KVM_GET/SET_MSRS
An interresting aspect is vcpu_load/vcpu_put. We now have a common
vcpu_load/put which does the preemption stuff, and an architecture
specific kvm_arch_vcpu_load/put. In the x86 case, this one calls the
vmx/svm function defined in kvm_x86_ops.
Signed-off-by: Carsten Otte <cotte@de.ibm.com>
Reviewed-by: Christian Borntraeger <borntraeger@de.ibm.com>
Reviewed-by: Christian Ehrhardt <ehrhardt@linux.vnet.ibm.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
2007-10-11 17:16:52 +00:00
|
|
|
#include <linux/kvm.h>
|
|
|
|
#include <linux/fs.h>
|
|
|
|
#include <linux/vmalloc.h>
|
2016-07-14 00:19:00 +00:00
|
|
|
#include <linux/export.h>
|
|
|
|
#include <linux/moduleparam.h>
|
2007-11-20 08:25:04 +00:00
|
|
|
#include <linux/mman.h>
|
2007-12-12 15:46:12 +00:00
|
|
|
#include <linux/highmem.h>
|
2008-12-03 13:43:34 +00:00
|
|
|
#include <linux/iommu.h>
|
2008-09-14 00:48:28 +00:00
|
|
|
#include <linux/intel-iommu.h>
|
2009-02-04 16:52:04 +00:00
|
|
|
#include <linux/cpufreq.h>
|
2009-09-07 08:12:18 +00:00
|
|
|
#include <linux/user-return-notifier.h>
|
2009-12-23 16:35:23 +00:00
|
|
|
#include <linux/srcu.h>
|
include cleanup: Update gfp.h and slab.h includes to prepare for breaking implicit slab.h inclusion from percpu.h
percpu.h is included by sched.h and module.h and thus ends up being
included when building most .c files. percpu.h includes slab.h which
in turn includes gfp.h making everything defined by the two files
universally available and complicating inclusion dependencies.
percpu.h -> slab.h dependency is about to be removed. Prepare for
this change by updating users of gfp and slab facilities include those
headers directly instead of assuming availability. As this conversion
needs to touch large number of source files, the following script is
used as the basis of conversion.
http://userweb.kernel.org/~tj/misc/slabh-sweep.py
The script does the followings.
* Scan files for gfp and slab usages and update includes such that
only the necessary includes are there. ie. if only gfp is used,
gfp.h, if slab is used, slab.h.
* When the script inserts a new include, it looks at the include
blocks and try to put the new include such that its order conforms
to its surrounding. It's put in the include block which contains
core kernel includes, in the same order that the rest are ordered -
alphabetical, Christmas tree, rev-Xmas-tree or at the end if there
doesn't seem to be any matching order.
* If the script can't find a place to put a new include (mostly
because the file doesn't have fitting include block), it prints out
an error message indicating which .h file needs to be added to the
file.
The conversion was done in the following steps.
1. The initial automatic conversion of all .c files updated slightly
over 4000 files, deleting around 700 includes and adding ~480 gfp.h
and ~3000 slab.h inclusions. The script emitted errors for ~400
files.
2. Each error was manually checked. Some didn't need the inclusion,
some needed manual addition while adding it to implementation .h or
embedding .c file was more appropriate for others. This step added
inclusions to around 150 files.
3. The script was run again and the output was compared to the edits
from #2 to make sure no file was left behind.
4. Several build tests were done and a couple of problems were fixed.
e.g. lib/decompress_*.c used malloc/free() wrappers around slab
APIs requiring slab.h to be added manually.
5. The script was run on all .h files but without automatically
editing them as sprinkling gfp.h and slab.h inclusions around .h
files could easily lead to inclusion dependency hell. Most gfp.h
inclusion directives were ignored as stuff from gfp.h was usually
wildly available and often used in preprocessor macros. Each
slab.h inclusion directive was examined and added manually as
necessary.
6. percpu.h was updated not to include slab.h.
7. Build test were done on the following configurations and failures
were fixed. CONFIG_GCOV_KERNEL was turned off for all tests (as my
distributed build env didn't work with gcov compiles) and a few
more options had to be turned off depending on archs to make things
build (like ipr on powerpc/64 which failed due to missing writeq).
* x86 and x86_64 UP and SMP allmodconfig and a custom test config.
* powerpc and powerpc64 SMP allmodconfig
* sparc and sparc64 SMP allmodconfig
* ia64 SMP allmodconfig
* s390 SMP allmodconfig
* alpha SMP allmodconfig
* um on x86_64 SMP allmodconfig
8. percpu.h modifications were reverted so that it could be applied as
a separate patch and serve as bisection point.
Given the fact that I had only a couple of failures from tests on step
6, I'm fairly confident about the coverage of this conversion patch.
If there is a breakage, it's likely to be something in one of the arch
headers which should be easily discoverable easily on most builds of
the specific arch.
Signed-off-by: Tejun Heo <tj@kernel.org>
Guess-its-ok-by: Christoph Lameter <cl@linux-foundation.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
2010-03-24 08:04:11 +00:00
|
|
|
#include <linux/slab.h>
|
2010-04-19 05:32:45 +00:00
|
|
|
#include <linux/perf_event.h>
|
2010-06-02 09:06:03 +00:00
|
|
|
#include <linux/uaccess.h>
|
2010-10-14 09:22:46 +00:00
|
|
|
#include <linux/hash.h>
|
2011-09-06 16:46:34 +00:00
|
|
|
#include <linux/pci.h>
|
2012-11-28 01:29:00 +00:00
|
|
|
#include <linux/timekeeper_internal.h>
|
|
|
|
#include <linux/pvclock_gtod.h>
|
2015-09-18 14:29:40 +00:00
|
|
|
#include <linux/kvm_irqfd.h>
|
|
|
|
#include <linux/irqbypass.h>
|
2017-02-05 11:07:04 +00:00
|
|
|
#include <linux/sched/stat.h>
|
2017-07-17 21:10:27 +00:00
|
|
|
#include <linux/mem_encrypt.h>
|
2017-02-05 11:07:04 +00:00
|
|
|
|
2009-07-01 13:01:02 +00:00
|
|
|
#include <trace/events/kvm.h>
|
2010-03-10 11:00:43 +00:00
|
|
|
|
2009-09-09 17:22:48 +00:00
|
|
|
#include <asm/debugreg.h>
|
2007-11-14 12:08:51 +00:00
|
|
|
#include <asm/msr.h>
|
2008-02-20 15:57:21 +00:00
|
|
|
#include <asm/desc.h>
|
2009-05-11 08:48:15 +00:00
|
|
|
#include <asm/mce.h>
|
2015-04-22 08:58:10 +00:00
|
|
|
#include <linux/kernel_stat.h>
|
2015-04-24 00:54:44 +00:00
|
|
|
#include <asm/fpu/internal.h> /* Ugh! */
|
2010-08-20 08:07:30 +00:00
|
|
|
#include <asm/pvclock.h>
|
2010-08-26 10:38:03 +00:00
|
|
|
#include <asm/div64.h>
|
2015-09-18 14:29:51 +00:00
|
|
|
#include <asm/irq_remapping.h>
|
2018-01-24 13:23:36 +00:00
|
|
|
#include <asm/mshyperv.h>
|
2018-01-24 13:23:37 +00:00
|
|
|
#include <asm/hypervisor.h>
|
2007-10-10 15:16:19 +00:00
|
|
|
|
2016-06-01 17:42:20 +00:00
|
|
|
#define CREATE_TRACE_POINTS
|
|
|
|
#include "trace.h"
|
|
|
|
|
KVM: Portability: split kvm_vcpu_ioctl
This patch splits kvm_vcpu_ioctl into archtecture independent parts, and
x86 specific parts which go to kvm_arch_vcpu_ioctl in x86.c.
Common ioctls for all architectures are:
KVM_RUN, KVM_GET/SET_(S-)REGS, KVM_TRANSLATE, KVM_INTERRUPT,
KVM_DEBUG_GUEST, KVM_SET_SIGNAL_MASK, KVM_GET/SET_FPU
Note that some PPC chips don't have an FPU, so we might need an #ifdef
around KVM_GET/SET_FPU one day.
x86 specific ioctls are:
KVM_GET/SET_LAPIC, KVM_SET_CPUID, KVM_GET/SET_MSRS
An interresting aspect is vcpu_load/vcpu_put. We now have a common
vcpu_load/put which does the preemption stuff, and an architecture
specific kvm_arch_vcpu_load/put. In the x86 case, this one calls the
vmx/svm function defined in kvm_x86_ops.
Signed-off-by: Carsten Otte <cotte@de.ibm.com>
Reviewed-by: Christian Borntraeger <borntraeger@de.ibm.com>
Reviewed-by: Christian Ehrhardt <ehrhardt@linux.vnet.ibm.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
2007-10-11 17:16:52 +00:00
|
|
|
#define MAX_IO_MSRS 256
|
2009-05-11 08:48:15 +00:00
|
|
|
#define KVM_MAX_MCE_BANKS 32
|
2016-06-22 06:59:56 +00:00
|
|
|
u64 __read_mostly kvm_mce_cap_supported = MCG_CTL_P | MCG_SER_P;
|
|
|
|
EXPORT_SYMBOL_GPL(kvm_mce_cap_supported);
|
2009-05-11 08:48:15 +00:00
|
|
|
|
2011-04-20 10:37:53 +00:00
|
|
|
#define emul_to_vcpu(ctxt) \
|
|
|
|
container_of(ctxt, struct kvm_vcpu, arch.emulate_ctxt)
|
|
|
|
|
2008-01-31 13:57:38 +00:00
|
|
|
/* EFER defaults:
|
|
|
|
* - enable syscall per default because its emulated by KVM
|
|
|
|
* - enable LME and LMA per default on 64 bit KVM
|
|
|
|
*/
|
|
|
|
#ifdef CONFIG_X86_64
|
2011-02-21 03:51:35 +00:00
|
|
|
static
|
|
|
|
u64 __read_mostly efer_reserved_bits = ~((u64)(EFER_SCE | EFER_LME | EFER_LMA));
|
2008-01-31 13:57:38 +00:00
|
|
|
#else
|
2011-02-21 03:51:35 +00:00
|
|
|
static u64 __read_mostly efer_reserved_bits = ~((u64)EFER_SCE);
|
2008-01-31 13:57:38 +00:00
|
|
|
#endif
|
KVM: Portability: split kvm_vcpu_ioctl
This patch splits kvm_vcpu_ioctl into archtecture independent parts, and
x86 specific parts which go to kvm_arch_vcpu_ioctl in x86.c.
Common ioctls for all architectures are:
KVM_RUN, KVM_GET/SET_(S-)REGS, KVM_TRANSLATE, KVM_INTERRUPT,
KVM_DEBUG_GUEST, KVM_SET_SIGNAL_MASK, KVM_GET/SET_FPU
Note that some PPC chips don't have an FPU, so we might need an #ifdef
around KVM_GET/SET_FPU one day.
x86 specific ioctls are:
KVM_GET/SET_LAPIC, KVM_SET_CPUID, KVM_GET/SET_MSRS
An interresting aspect is vcpu_load/vcpu_put. We now have a common
vcpu_load/put which does the preemption stuff, and an architecture
specific kvm_arch_vcpu_load/put. In the x86 case, this one calls the
vmx/svm function defined in kvm_x86_ops.
Signed-off-by: Carsten Otte <cotte@de.ibm.com>
Reviewed-by: Christian Borntraeger <borntraeger@de.ibm.com>
Reviewed-by: Christian Ehrhardt <ehrhardt@linux.vnet.ibm.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
2007-10-11 17:16:52 +00:00
|
|
|
|
2007-11-18 14:24:12 +00:00
|
|
|
#define VM_STAT(x) offsetof(struct kvm, stat.x), KVM_STAT_VM
|
|
|
|
#define VCPU_STAT(x) offsetof(struct kvm_vcpu, stat.x), KVM_STAT_VCPU
|
2007-10-31 22:24:23 +00:00
|
|
|
|
2016-07-12 20:09:28 +00:00
|
|
|
#define KVM_X2APIC_API_VALID_FLAGS (KVM_X2APIC_API_USE_32BIT_IDS | \
|
|
|
|
KVM_X2APIC_API_DISABLE_BROADCAST_QUIRK)
|
2016-07-12 20:09:27 +00:00
|
|
|
|
2009-08-09 12:17:40 +00:00
|
|
|
static void update_cr8_intercept(struct kvm_vcpu *vcpu);
|
2011-09-20 10:43:14 +00:00
|
|
|
static void process_nmi(struct kvm_vcpu *vcpu);
|
2016-06-01 20:26:01 +00:00
|
|
|
static void enter_smm(struct kvm_vcpu *vcpu);
|
2014-03-27 10:29:28 +00:00
|
|
|
static void __kvm_set_rflags(struct kvm_vcpu *vcpu, unsigned long rflags);
|
2018-02-01 00:03:36 +00:00
|
|
|
static void store_regs(struct kvm_vcpu *vcpu);
|
|
|
|
static int sync_regs(struct kvm_vcpu *vcpu);
|
2008-02-11 16:37:23 +00:00
|
|
|
|
2015-11-06 10:46:24 +00:00
|
|
|
struct kvm_x86_ops *kvm_x86_ops __read_mostly;
|
2008-06-27 17:58:02 +00:00
|
|
|
EXPORT_SYMBOL_GPL(kvm_x86_ops);
|
2007-11-14 12:09:30 +00:00
|
|
|
|
2015-11-06 10:46:24 +00:00
|
|
|
static bool __read_mostly ignore_msrs = 0;
|
2012-01-12 23:02:18 +00:00
|
|
|
module_param(ignore_msrs, bool, S_IRUGO | S_IWUSR);
|
2009-06-25 10:36:49 +00:00
|
|
|
|
2017-11-08 12:32:08 +00:00
|
|
|
static bool __read_mostly report_ignored_msrs = true;
|
|
|
|
module_param(report_ignored_msrs, bool, S_IRUGO | S_IWUSR);
|
|
|
|
|
2018-05-05 11:02:32 +00:00
|
|
|
unsigned int min_timer_period_us = 200;
|
2014-01-06 14:00:02 +00:00
|
|
|
module_param(min_timer_period_us, uint, S_IRUGO | S_IWUSR);
|
|
|
|
|
2015-05-13 01:42:04 +00:00
|
|
|
static bool __read_mostly kvmclock_periodic_sync = true;
|
|
|
|
module_param(kvmclock_periodic_sync, bool, S_IRUGO);
|
|
|
|
|
2015-11-06 10:46:24 +00:00
|
|
|
bool __read_mostly kvm_has_tsc_control;
|
2011-03-25 08:44:51 +00:00
|
|
|
EXPORT_SYMBOL_GPL(kvm_has_tsc_control);
|
2015-11-06 10:46:24 +00:00
|
|
|
u32 __read_mostly kvm_max_guest_tsc_khz;
|
2011-03-25 08:44:51 +00:00
|
|
|
EXPORT_SYMBOL_GPL(kvm_max_guest_tsc_khz);
|
2015-10-20 07:39:01 +00:00
|
|
|
u8 __read_mostly kvm_tsc_scaling_ratio_frac_bits;
|
|
|
|
EXPORT_SYMBOL_GPL(kvm_tsc_scaling_ratio_frac_bits);
|
|
|
|
u64 __read_mostly kvm_max_tsc_scaling_ratio;
|
|
|
|
EXPORT_SYMBOL_GPL(kvm_max_tsc_scaling_ratio);
|
2016-06-13 21:19:59 +00:00
|
|
|
u64 __read_mostly kvm_default_tsc_scaling_ratio;
|
|
|
|
EXPORT_SYMBOL_GPL(kvm_default_tsc_scaling_ratio);
|
2011-03-25 08:44:51 +00:00
|
|
|
|
KVM: Infrastructure for software and hardware based TSC rate scaling
This requires some restructuring; rather than use 'virtual_tsc_khz'
to indicate whether hardware rate scaling is in effect, we consider
each VCPU to always have a virtual TSC rate. Instead, there is new
logic above the vendor-specific hardware scaling that decides whether
it is even necessary to use and updates all rate variables used by
common code. This means we can simply query the virtual rate at
any point, which is needed for software rate scaling.
There is also now a threshold added to the TSC rate scaling; minor
differences and variations of measured TSC rate can accidentally
provoke rate scaling to be used when it is not needed. Instead,
we have a tolerance variable called tsc_tolerance_ppm, which is
the maximum variation from user requested rate at which scaling
will be used. The default is 250ppm, which is the half the
threshold for NTP adjustment, allowing for some hardware variation.
In the event that hardware rate scaling is not available, we can
kludge a bit by forcing TSC catchup to turn on when a faster than
hardware speed has been requested, but there is nothing available
yet for the reverse case; this requires a trap and emulate software
implementation for RDTSC, which is still forthcoming.
[avi: fix 64-bit division on i386]
Signed-off-by: Zachary Amsden <zamsden@gmail.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2012-02-03 17:43:50 +00:00
|
|
|
/* tsc tolerance in parts per million - default to 1/2 of the NTP threshold */
|
2015-11-06 10:46:24 +00:00
|
|
|
static u32 __read_mostly tsc_tolerance_ppm = 250;
|
KVM: Infrastructure for software and hardware based TSC rate scaling
This requires some restructuring; rather than use 'virtual_tsc_khz'
to indicate whether hardware rate scaling is in effect, we consider
each VCPU to always have a virtual TSC rate. Instead, there is new
logic above the vendor-specific hardware scaling that decides whether
it is even necessary to use and updates all rate variables used by
common code. This means we can simply query the virtual rate at
any point, which is needed for software rate scaling.
There is also now a threshold added to the TSC rate scaling; minor
differences and variations of measured TSC rate can accidentally
provoke rate scaling to be used when it is not needed. Instead,
we have a tolerance variable called tsc_tolerance_ppm, which is
the maximum variation from user requested rate at which scaling
will be used. The default is 250ppm, which is the half the
threshold for NTP adjustment, allowing for some hardware variation.
In the event that hardware rate scaling is not available, we can
kludge a bit by forcing TSC catchup to turn on when a faster than
hardware speed has been requested, but there is nothing available
yet for the reverse case; this requires a trap and emulate software
implementation for RDTSC, which is still forthcoming.
[avi: fix 64-bit division on i386]
Signed-off-by: Zachary Amsden <zamsden@gmail.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2012-02-03 17:43:50 +00:00
|
|
|
module_param(tsc_tolerance_ppm, uint, S_IRUGO | S_IWUSR);
|
|
|
|
|
2014-12-16 14:08:15 +00:00
|
|
|
/* lapic timer advance (tscdeadline mode only) in nanoseconds */
|
2015-11-06 10:46:24 +00:00
|
|
|
unsigned int __read_mostly lapic_timer_advance_ns = 0;
|
2014-12-16 14:08:15 +00:00
|
|
|
module_param(lapic_timer_advance_ns, uint, S_IRUGO | S_IWUSR);
|
2018-05-29 06:53:17 +00:00
|
|
|
EXPORT_SYMBOL_GPL(lapic_timer_advance_ns);
|
2014-12-16 14:08:15 +00:00
|
|
|
|
2016-01-25 08:53:33 +00:00
|
|
|
static bool __read_mostly vector_hashing = true;
|
|
|
|
module_param(vector_hashing, bool, S_IRUGO);
|
|
|
|
|
2018-03-12 11:12:47 +00:00
|
|
|
bool __read_mostly enable_vmware_backdoor = false;
|
|
|
|
module_param(enable_vmware_backdoor, bool, S_IRUGO);
|
|
|
|
EXPORT_SYMBOL_GPL(enable_vmware_backdoor);
|
|
|
|
|
KVM: X86: Add Force Emulation Prefix for "emulate the next instruction"
There is no easy way to force KVM to run an instruction through the emulator
(by design as that will expose the x86 emulator as a significant attack-surface).
However, we do wish to expose the x86 emulator in case we are testing it
(e.g. via kvm-unit-tests). Therefore, this patch adds a "force emulation prefix"
that is designed to raise #UD which KVM will trap and it's #UD exit-handler will
match "force emulation prefix" to run instruction after prefix by the x86 emulator.
To not expose the x86 emulator by default, we add a module parameter that should
be off by default.
A simple testcase here:
#include <stdio.h>
#include <string.h>
#define HYPERVISOR_INFO 0x40000000
#define CPUID(idx, eax, ebx, ecx, edx) \
asm volatile (\
"ud2a; .ascii \"kvm\"; cpuid" \
:"=b" (*ebx), "=a" (*eax), "=c" (*ecx), "=d" (*edx) \
:"0"(idx) );
void main()
{
unsigned int eax, ebx, ecx, edx;
char string[13];
CPUID(HYPERVISOR_INFO, &eax, &ebx, &ecx, &edx);
*(unsigned int *)(string + 0) = ebx;
*(unsigned int *)(string + 4) = ecx;
*(unsigned int *)(string + 8) = edx;
string[12] = 0;
if (strncmp(string, "KVMKVMKVM\0\0\0", 12) == 0)
printf("kvm guest\n");
else
printf("bare hardware\n");
}
Suggested-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Radim Krčmář <rkrcmar@redhat.com>
Reviewed-by: Liran Alon <liran.alon@oracle.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Radim Krčmář <rkrcmar@redhat.com>
Cc: Andrew Cooper <andrew.cooper3@citrix.com>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: Liran Alon <liran.alon@oracle.com>
Signed-off-by: Wanpeng Li <wanpengli@tencent.com>
[Correctly handle usermode exits. - Paolo]
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2018-04-03 23:28:49 +00:00
|
|
|
static bool __read_mostly force_emulation_prefix = false;
|
|
|
|
module_param(force_emulation_prefix, bool, S_IRUGO);
|
|
|
|
|
2009-09-07 08:12:18 +00:00
|
|
|
#define KVM_NR_SHARED_MSRS 16
|
|
|
|
|
|
|
|
struct kvm_shared_msrs_global {
|
|
|
|
int nr;
|
2009-12-18 08:48:44 +00:00
|
|
|
u32 msrs[KVM_NR_SHARED_MSRS];
|
2009-09-07 08:12:18 +00:00
|
|
|
};
|
|
|
|
|
|
|
|
struct kvm_shared_msrs {
|
|
|
|
struct user_return_notifier urn;
|
|
|
|
bool registered;
|
2009-12-18 08:48:44 +00:00
|
|
|
struct kvm_shared_msr_values {
|
|
|
|
u64 host;
|
|
|
|
u64 curr;
|
|
|
|
} values[KVM_NR_SHARED_MSRS];
|
2009-09-07 08:12:18 +00:00
|
|
|
};
|
|
|
|
|
|
|
|
static struct kvm_shared_msrs_global __read_mostly shared_msrs_global;
|
2013-01-03 13:41:39 +00:00
|
|
|
static struct kvm_shared_msrs __percpu *shared_msrs;
|
2009-09-07 08:12:18 +00:00
|
|
|
|
2007-10-31 22:24:23 +00:00
|
|
|
struct kvm_stats_debugfs_item debugfs_entries[] = {
|
2007-11-18 14:24:12 +00:00
|
|
|
{ "pf_fixed", VCPU_STAT(pf_fixed) },
|
|
|
|
{ "pf_guest", VCPU_STAT(pf_guest) },
|
|
|
|
{ "tlb_flush", VCPU_STAT(tlb_flush) },
|
|
|
|
{ "invlpg", VCPU_STAT(invlpg) },
|
|
|
|
{ "exits", VCPU_STAT(exits) },
|
|
|
|
{ "io_exits", VCPU_STAT(io_exits) },
|
|
|
|
{ "mmio_exits", VCPU_STAT(mmio_exits) },
|
|
|
|
{ "signal_exits", VCPU_STAT(signal_exits) },
|
|
|
|
{ "irq_window", VCPU_STAT(irq_window_exits) },
|
2008-05-15 10:23:25 +00:00
|
|
|
{ "nmi_window", VCPU_STAT(nmi_window_exits) },
|
2007-11-18 14:24:12 +00:00
|
|
|
{ "halt_exits", VCPU_STAT(halt_exits) },
|
kvm: add halt_poll_ns module parameter
This patch introduces a new module parameter for the KVM module; when it
is present, KVM attempts a bit of polling on every HLT before scheduling
itself out via kvm_vcpu_block.
This parameter helps a lot for latency-bound workloads---in particular
I tested it with O_DSYNC writes with a battery-backed disk in the host.
In this case, writes are fast (because the data doesn't have to go all
the way to the platters) but they cannot be merged by either the host or
the guest. KVM's performance here is usually around 30% of bare metal,
or 50% if you use cache=directsync or cache=writethrough (these
parameters avoid that the guest sends pointless flush requests, and
at the same time they are not slow because of the battery-backed cache).
The bad performance happens because on every halt the host CPU decides
to halt itself too. When the interrupt comes, the vCPU thread is then
migrated to a new physical CPU, and in general the latency is horrible
because the vCPU thread has to be scheduled back in.
With this patch performance reaches 60-65% of bare metal and, more
important, 99% of what you get if you use idle=poll in the guest. This
means that the tunable gets rid of this particular bottleneck, and more
work can be done to improve performance in the kernel or QEMU.
Of course there is some price to pay; every time an otherwise idle vCPUs
is interrupted by an interrupt, it will poll unnecessarily and thus
impose a little load on the host. The above results were obtained with
a mostly random value of the parameter (500000), and the load was around
1.5-2.5% CPU usage on one of the host's core for each idle guest vCPU.
The patch also adds a new stat, /sys/kernel/debug/kvm/halt_successful_poll,
that can be used to tune the parameter. It counts how many HLT
instructions received an interrupt during the polling period; each
successful poll avoids that Linux schedules the VCPU thread out and back
in, and may also avoid a likely trip to C1 and back for the physical CPU.
While the VM is idle, a Linux 4 VCPU VM halts around 10 times per second.
Of these halts, almost all are failed polls. During the benchmark,
instead, basically all halts end within the polling period, except a more
or less constant stream of 50 per second coming from vCPUs that are not
running the benchmark. The wasted time is thus very low. Things may
be slightly different for Windows VMs, which have a ~10 ms timer tick.
The effect is also visible on Marcelo's recently-introduced latency
test for the TSC deadline timer. Though of course a non-RT kernel has
awful latency bounds, the latency of the timer is around 8000-10000 clock
cycles compared to 20000-120000 without setting halt_poll_ns. For the TSC
deadline timer, thus, the effect is both a smaller average latency and
a smaller variance.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2015-02-04 17:20:58 +00:00
|
|
|
{ "halt_successful_poll", VCPU_STAT(halt_successful_poll) },
|
2015-09-15 16:27:57 +00:00
|
|
|
{ "halt_attempted_poll", VCPU_STAT(halt_attempted_poll) },
|
2016-05-13 10:16:35 +00:00
|
|
|
{ "halt_poll_invalid", VCPU_STAT(halt_poll_invalid) },
|
2007-11-18 14:24:12 +00:00
|
|
|
{ "halt_wakeup", VCPU_STAT(halt_wakeup) },
|
2008-02-20 19:30:30 +00:00
|
|
|
{ "hypercalls", VCPU_STAT(hypercalls) },
|
2007-11-18 14:24:12 +00:00
|
|
|
{ "request_irq", VCPU_STAT(request_irq_exits) },
|
|
|
|
{ "irq_exits", VCPU_STAT(irq_exits) },
|
|
|
|
{ "host_state_reload", VCPU_STAT(host_state_reload) },
|
|
|
|
{ "fpu_reload", VCPU_STAT(fpu_reload) },
|
|
|
|
{ "insn_emulation", VCPU_STAT(insn_emulation) },
|
|
|
|
{ "insn_emulation_fail", VCPU_STAT(insn_emulation_fail) },
|
2008-09-01 12:57:51 +00:00
|
|
|
{ "irq_injections", VCPU_STAT(irq_injections) },
|
2008-09-26 07:30:55 +00:00
|
|
|
{ "nmi_injections", VCPU_STAT(nmi_injections) },
|
2016-12-17 15:05:19 +00:00
|
|
|
{ "req_event", VCPU_STAT(req_event) },
|
x86/KVM/VMX: Add L1D flush logic
Add the logic for flushing L1D on VMENTER. The flush depends on the static
key being enabled and the new l1tf_flush_l1d flag being set.
The flags is set:
- Always, if the flush module parameter is 'always'
- Conditionally at:
- Entry to vcpu_run(), i.e. after executing user space
- From the sched_in notifier, i.e. when switching to a vCPU thread.
- From vmexit handlers which are considered unsafe, i.e. where
sensitive data can be brought into L1D:
- The emulator, which could be a good target for other speculative
execution-based threats,
- The MMU, which can bring host page tables in the L1 cache.
- External interrupts
- Nested operations that require the MMU (see above). That is
vmptrld, vmptrst, vmclear,vmwrite,vmread.
- When handling invept,invvpid
[ tglx: Split out from combo patch and reduced to a single flag ]
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2018-07-02 11:07:14 +00:00
|
|
|
{ "l1d_flush", VCPU_STAT(l1d_flush) },
|
2007-11-18 14:37:07 +00:00
|
|
|
{ "mmu_shadow_zapped", VM_STAT(mmu_shadow_zapped) },
|
|
|
|
{ "mmu_pte_write", VM_STAT(mmu_pte_write) },
|
|
|
|
{ "mmu_pte_updated", VM_STAT(mmu_pte_updated) },
|
|
|
|
{ "mmu_pde_zapped", VM_STAT(mmu_pde_zapped) },
|
|
|
|
{ "mmu_flooded", VM_STAT(mmu_flooded) },
|
|
|
|
{ "mmu_recycled", VM_STAT(mmu_recycled) },
|
2007-12-18 17:47:18 +00:00
|
|
|
{ "mmu_cache_miss", VM_STAT(mmu_cache_miss) },
|
2008-09-23 16:18:39 +00:00
|
|
|
{ "mmu_unsync", VM_STAT(mmu_unsync) },
|
2007-11-20 21:01:14 +00:00
|
|
|
{ "remote_tlb_flush", VM_STAT(remote_tlb_flush) },
|
2008-02-23 14:44:30 +00:00
|
|
|
{ "largepages", VM_STAT(lpages) },
|
2016-12-20 23:25:57 +00:00
|
|
|
{ "max_mmu_page_hash_collisions",
|
|
|
|
VM_STAT(max_mmu_page_hash_collisions) },
|
2007-10-31 22:24:23 +00:00
|
|
|
{ NULL }
|
|
|
|
};
|
|
|
|
|
2010-06-10 03:27:12 +00:00
|
|
|
u64 __read_mostly host_xcr0;
|
|
|
|
|
2012-09-20 05:43:17 +00:00
|
|
|
static int emulator_fix_hypercall(struct x86_emulate_ctxt *ctxt);
|
2011-04-20 12:47:13 +00:00
|
|
|
|
2010-10-14 09:22:46 +00:00
|
|
|
static inline void kvm_async_pf_hash_reset(struct kvm_vcpu *vcpu)
|
|
|
|
{
|
|
|
|
int i;
|
|
|
|
for (i = 0; i < roundup_pow_of_two(ASYNC_PF_PER_VCPU); i++)
|
|
|
|
vcpu->arch.apf.gfns[i] = ~0;
|
|
|
|
}
|
|
|
|
|
2009-09-07 08:12:18 +00:00
|
|
|
static void kvm_on_user_return(struct user_return_notifier *urn)
|
|
|
|
{
|
|
|
|
unsigned slot;
|
|
|
|
struct kvm_shared_msrs *locals
|
|
|
|
= container_of(urn, struct kvm_shared_msrs, urn);
|
2009-12-18 08:48:44 +00:00
|
|
|
struct kvm_shared_msr_values *values;
|
2016-11-04 19:15:55 +00:00
|
|
|
unsigned long flags;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Disabling irqs at this point since the following code could be
|
|
|
|
* interrupted and executed through kvm_arch_hardware_disable()
|
|
|
|
*/
|
|
|
|
local_irq_save(flags);
|
|
|
|
if (locals->registered) {
|
|
|
|
locals->registered = false;
|
|
|
|
user_return_notifier_unregister(urn);
|
|
|
|
}
|
|
|
|
local_irq_restore(flags);
|
2009-09-07 08:12:18 +00:00
|
|
|
for (slot = 0; slot < shared_msrs_global.nr; ++slot) {
|
2009-12-18 08:48:44 +00:00
|
|
|
values = &locals->values[slot];
|
|
|
|
if (values->host != values->curr) {
|
|
|
|
wrmsrl(shared_msrs_global.msrs[slot], values->host);
|
|
|
|
values->curr = values->host;
|
2009-09-07 08:12:18 +00:00
|
|
|
}
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2009-12-18 08:48:44 +00:00
|
|
|
static void shared_msr_update(unsigned slot, u32 msr)
|
2009-09-07 08:12:18 +00:00
|
|
|
{
|
|
|
|
u64 value;
|
2013-01-03 13:41:39 +00:00
|
|
|
unsigned int cpu = smp_processor_id();
|
|
|
|
struct kvm_shared_msrs *smsr = per_cpu_ptr(shared_msrs, cpu);
|
2009-09-07 08:12:18 +00:00
|
|
|
|
2009-12-18 08:48:44 +00:00
|
|
|
/* only read, and nobody should modify it at this time,
|
|
|
|
* so don't need lock */
|
|
|
|
if (slot >= shared_msrs_global.nr) {
|
|
|
|
printk(KERN_ERR "kvm: invalid MSR slot!");
|
|
|
|
return;
|
|
|
|
}
|
|
|
|
rdmsrl_safe(msr, &value);
|
|
|
|
smsr->values[slot].host = value;
|
|
|
|
smsr->values[slot].curr = value;
|
|
|
|
}
|
|
|
|
|
|
|
|
void kvm_define_shared_msr(unsigned slot, u32 msr)
|
|
|
|
{
|
2014-07-24 12:06:56 +00:00
|
|
|
BUG_ON(slot >= KVM_NR_SHARED_MSRS);
|
2015-07-29 09:06:34 +00:00
|
|
|
shared_msrs_global.msrs[slot] = msr;
|
2009-09-07 08:12:18 +00:00
|
|
|
if (slot >= shared_msrs_global.nr)
|
|
|
|
shared_msrs_global.nr = slot + 1;
|
|
|
|
}
|
|
|
|
EXPORT_SYMBOL_GPL(kvm_define_shared_msr);
|
|
|
|
|
|
|
|
static void kvm_shared_msr_cpu_online(void)
|
|
|
|
{
|
|
|
|
unsigned i;
|
|
|
|
|
|
|
|
for (i = 0; i < shared_msrs_global.nr; ++i)
|
2009-12-18 08:48:44 +00:00
|
|
|
shared_msr_update(i, shared_msrs_global.msrs[i]);
|
2009-09-07 08:12:18 +00:00
|
|
|
}
|
|
|
|
|
2014-08-27 18:16:44 +00:00
|
|
|
int kvm_set_shared_msr(unsigned slot, u64 value, u64 mask)
|
2009-09-07 08:12:18 +00:00
|
|
|
{
|
2013-01-03 13:41:39 +00:00
|
|
|
unsigned int cpu = smp_processor_id();
|
|
|
|
struct kvm_shared_msrs *smsr = per_cpu_ptr(shared_msrs, cpu);
|
2014-08-27 18:16:44 +00:00
|
|
|
int err;
|
2009-09-07 08:12:18 +00:00
|
|
|
|
2009-12-18 08:48:44 +00:00
|
|
|
if (((value ^ smsr->values[slot].curr) & mask) == 0)
|
2014-08-27 18:16:44 +00:00
|
|
|
return 0;
|
2009-12-18 08:48:44 +00:00
|
|
|
smsr->values[slot].curr = value;
|
2014-08-27 18:16:44 +00:00
|
|
|
err = wrmsrl_safe(shared_msrs_global.msrs[slot], value);
|
|
|
|
if (err)
|
|
|
|
return 1;
|
|
|
|
|
2009-09-07 08:12:18 +00:00
|
|
|
if (!smsr->registered) {
|
|
|
|
smsr->urn.on_user_return = kvm_on_user_return;
|
|
|
|
user_return_notifier_register(&smsr->urn);
|
|
|
|
smsr->registered = true;
|
|
|
|
}
|
2014-08-27 18:16:44 +00:00
|
|
|
return 0;
|
2009-09-07 08:12:18 +00:00
|
|
|
}
|
|
|
|
EXPORT_SYMBOL_GPL(kvm_set_shared_msr);
|
|
|
|
|
2014-08-28 13:13:03 +00:00
|
|
|
static void drop_user_return_notifiers(void)
|
2009-11-28 12:18:47 +00:00
|
|
|
{
|
2013-01-03 13:41:39 +00:00
|
|
|
unsigned int cpu = smp_processor_id();
|
|
|
|
struct kvm_shared_msrs *smsr = per_cpu_ptr(shared_msrs, cpu);
|
2009-11-28 12:18:47 +00:00
|
|
|
|
|
|
|
if (smsr->registered)
|
|
|
|
kvm_on_user_return(&smsr->urn);
|
|
|
|
}
|
|
|
|
|
2007-10-29 15:09:10 +00:00
|
|
|
u64 kvm_get_apic_base(struct kvm_vcpu *vcpu)
|
|
|
|
{
|
2012-08-05 12:58:26 +00:00
|
|
|
return vcpu->arch.apic_base;
|
2007-10-29 15:09:10 +00:00
|
|
|
}
|
|
|
|
EXPORT_SYMBOL_GPL(kvm_get_apic_base);
|
|
|
|
|
2018-05-09 20:56:04 +00:00
|
|
|
enum lapic_mode kvm_get_apic_mode(struct kvm_vcpu *vcpu)
|
|
|
|
{
|
|
|
|
return kvm_apic_mode(kvm_get_apic_base(vcpu));
|
|
|
|
}
|
|
|
|
EXPORT_SYMBOL_GPL(kvm_get_apic_mode);
|
|
|
|
|
2014-01-24 15:48:44 +00:00
|
|
|
int kvm_set_apic_base(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
|
|
|
|
{
|
2018-05-09 20:56:04 +00:00
|
|
|
enum lapic_mode old_mode = kvm_get_apic_mode(vcpu);
|
|
|
|
enum lapic_mode new_mode = kvm_apic_mode(msr_info->data);
|
2017-08-04 22:12:49 +00:00
|
|
|
u64 reserved_bits = ((~0ULL) << cpuid_maxphyaddr(vcpu)) | 0x2ff |
|
|
|
|
(guest_cpuid_has(vcpu, X86_FEATURE_X2APIC) ? 0 : X2APIC_ENABLE);
|
2014-01-24 15:48:44 +00:00
|
|
|
|
2018-05-09 20:56:04 +00:00
|
|
|
if ((msr_info->data & reserved_bits) != 0 || new_mode == LAPIC_MODE_INVALID)
|
2014-01-24 15:48:44 +00:00
|
|
|
return 1;
|
2018-05-09 20:56:04 +00:00
|
|
|
if (!msr_info->host_initiated) {
|
|
|
|
if (old_mode == LAPIC_MODE_X2APIC && new_mode == LAPIC_MODE_XAPIC)
|
|
|
|
return 1;
|
|
|
|
if (old_mode == LAPIC_MODE_DISABLED && new_mode == LAPIC_MODE_X2APIC)
|
|
|
|
return 1;
|
|
|
|
}
|
2014-01-24 15:48:44 +00:00
|
|
|
|
|
|
|
kvm_lapic_set_base(vcpu, msr_info->data);
|
|
|
|
return 0;
|
2007-10-29 15:09:10 +00:00
|
|
|
}
|
|
|
|
EXPORT_SYMBOL_GPL(kvm_set_apic_base);
|
|
|
|
|
2014-05-01 22:44:37 +00:00
|
|
|
asmlinkage __visible void kvm_spurious_fault(void)
|
2013-04-05 19:20:30 +00:00
|
|
|
{
|
|
|
|
/* Fault while not rebooting. We want the trace. */
|
|
|
|
BUG();
|
|
|
|
}
|
|
|
|
EXPORT_SYMBOL_GPL(kvm_spurious_fault);
|
|
|
|
|
2009-11-19 15:54:07 +00:00
|
|
|
#define EXCPT_BENIGN 0
|
|
|
|
#define EXCPT_CONTRIBUTORY 1
|
|
|
|
#define EXCPT_PF 2
|
|
|
|
|
|
|
|
static int exception_class(int vector)
|
|
|
|
{
|
|
|
|
switch (vector) {
|
|
|
|
case PF_VECTOR:
|
|
|
|
return EXCPT_PF;
|
|
|
|
case DE_VECTOR:
|
|
|
|
case TS_VECTOR:
|
|
|
|
case NP_VECTOR:
|
|
|
|
case SS_VECTOR:
|
|
|
|
case GP_VECTOR:
|
|
|
|
return EXCPT_CONTRIBUTORY;
|
|
|
|
default:
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
return EXCPT_BENIGN;
|
|
|
|
}
|
|
|
|
|
2014-07-24 11:51:24 +00:00
|
|
|
#define EXCPT_FAULT 0
|
|
|
|
#define EXCPT_TRAP 1
|
|
|
|
#define EXCPT_ABORT 2
|
|
|
|
#define EXCPT_INTERRUPT 3
|
|
|
|
|
|
|
|
static int exception_type(int vector)
|
|
|
|
{
|
|
|
|
unsigned int mask;
|
|
|
|
|
|
|
|
if (WARN_ON(vector > 31 || vector == NMI_VECTOR))
|
|
|
|
return EXCPT_INTERRUPT;
|
|
|
|
|
|
|
|
mask = 1 << vector;
|
|
|
|
|
|
|
|
/* #DB is trap, as instruction watchpoints are handled elsewhere */
|
|
|
|
if (mask & ((1 << DB_VECTOR) | (1 << BP_VECTOR) | (1 << OF_VECTOR)))
|
|
|
|
return EXCPT_TRAP;
|
|
|
|
|
|
|
|
if (mask & ((1 << DF_VECTOR) | (1 << MC_VECTOR)))
|
|
|
|
return EXCPT_ABORT;
|
|
|
|
|
|
|
|
/* Reserved exceptions will result in fault */
|
|
|
|
return EXCPT_FAULT;
|
|
|
|
}
|
|
|
|
|
2009-11-19 15:54:07 +00:00
|
|
|
static void kvm_multiple_exception(struct kvm_vcpu *vcpu,
|
2010-04-22 10:33:13 +00:00
|
|
|
unsigned nr, bool has_error, u32 error_code,
|
|
|
|
bool reinject)
|
2009-11-19 15:54:07 +00:00
|
|
|
{
|
|
|
|
u32 prev_nr;
|
|
|
|
int class1, class2;
|
|
|
|
|
2010-07-27 09:30:24 +00:00
|
|
|
kvm_make_request(KVM_REQ_EVENT, vcpu);
|
|
|
|
|
2017-08-24 10:35:09 +00:00
|
|
|
if (!vcpu->arch.exception.pending && !vcpu->arch.exception.injected) {
|
2009-11-19 15:54:07 +00:00
|
|
|
queue:
|
2014-11-02 09:54:42 +00:00
|
|
|
if (has_error && !is_protmode(vcpu))
|
|
|
|
has_error = false;
|
2017-08-24 10:35:09 +00:00
|
|
|
if (reinject) {
|
|
|
|
/*
|
|
|
|
* On vmentry, vcpu->arch.exception.pending is only
|
|
|
|
* true if an event injection was blocked by
|
|
|
|
* nested_run_pending. In that case, however,
|
|
|
|
* vcpu_enter_guest requests an immediate exit,
|
|
|
|
* and the guest shouldn't proceed far enough to
|
|
|
|
* need reinjection.
|
|
|
|
*/
|
|
|
|
WARN_ON_ONCE(vcpu->arch.exception.pending);
|
|
|
|
vcpu->arch.exception.injected = true;
|
|
|
|
} else {
|
|
|
|
vcpu->arch.exception.pending = true;
|
|
|
|
vcpu->arch.exception.injected = false;
|
|
|
|
}
|
2009-11-19 15:54:07 +00:00
|
|
|
vcpu->arch.exception.has_error_code = has_error;
|
|
|
|
vcpu->arch.exception.nr = nr;
|
|
|
|
vcpu->arch.exception.error_code = error_code;
|
|
|
|
return;
|
|
|
|
}
|
|
|
|
|
|
|
|
/* to check exception */
|
|
|
|
prev_nr = vcpu->arch.exception.nr;
|
|
|
|
if (prev_nr == DF_VECTOR) {
|
|
|
|
/* triple fault -> shutdown */
|
2010-05-10 09:34:53 +00:00
|
|
|
kvm_make_request(KVM_REQ_TRIPLE_FAULT, vcpu);
|
2009-11-19 15:54:07 +00:00
|
|
|
return;
|
|
|
|
}
|
|
|
|
class1 = exception_class(prev_nr);
|
|
|
|
class2 = exception_class(nr);
|
|
|
|
if ((class1 == EXCPT_CONTRIBUTORY && class2 == EXCPT_CONTRIBUTORY)
|
|
|
|
|| (class1 == EXCPT_PF && class2 != EXCPT_BENIGN)) {
|
2017-08-24 10:35:09 +00:00
|
|
|
/*
|
|
|
|
* Generate double fault per SDM Table 5-5. Set
|
|
|
|
* exception.pending = true so that the double fault
|
|
|
|
* can trigger a nested vmexit.
|
|
|
|
*/
|
2009-11-19 15:54:07 +00:00
|
|
|
vcpu->arch.exception.pending = true;
|
2017-08-24 10:35:09 +00:00
|
|
|
vcpu->arch.exception.injected = false;
|
2009-11-19 15:54:07 +00:00
|
|
|
vcpu->arch.exception.has_error_code = true;
|
|
|
|
vcpu->arch.exception.nr = DF_VECTOR;
|
|
|
|
vcpu->arch.exception.error_code = 0;
|
|
|
|
} else
|
|
|
|
/* replace previous exception with a new one in a hope
|
|
|
|
that instruction re-execution will regenerate lost
|
|
|
|
exception */
|
|
|
|
goto queue;
|
|
|
|
}
|
|
|
|
|
2007-11-25 11:41:11 +00:00
|
|
|
void kvm_queue_exception(struct kvm_vcpu *vcpu, unsigned nr)
|
|
|
|
{
|
2010-04-22 10:33:13 +00:00
|
|
|
kvm_multiple_exception(vcpu, nr, false, 0, false);
|
2007-11-25 11:41:11 +00:00
|
|
|
}
|
|
|
|
EXPORT_SYMBOL_GPL(kvm_queue_exception);
|
|
|
|
|
2010-04-22 10:33:13 +00:00
|
|
|
void kvm_requeue_exception(struct kvm_vcpu *vcpu, unsigned nr)
|
|
|
|
{
|
|
|
|
kvm_multiple_exception(vcpu, nr, false, 0, true);
|
|
|
|
}
|
|
|
|
EXPORT_SYMBOL_GPL(kvm_requeue_exception);
|
|
|
|
|
KVM: x86: Add kvm_skip_emulated_instruction and use it.
kvm_skip_emulated_instruction calls both
kvm_x86_ops->skip_emulated_instruction and kvm_vcpu_check_singlestep,
skipping the emulated instruction and generating a trap if necessary.
Replacing skip_emulated_instruction calls with
kvm_skip_emulated_instruction is straightforward, except for:
- ICEBP, which is already inside a trap, so avoid triggering another trap.
- Instructions that can trigger exits to userspace, such as the IO insns,
MOVs to CR8, and HALT. If kvm_skip_emulated_instruction does trigger a
KVM_GUESTDBG_SINGLESTEP exit, and the handling code for
IN/OUT/MOV CR8/HALT also triggers an exit to userspace, the latter will
take precedence. The singlestep will be triggered again on the next
instruction, which is the current behavior.
- Task switch instructions which would require additional handling (e.g.
the task switch bit) and are instead left alone.
- Cases where VMLAUNCH/VMRESUME do not proceed to the next instruction,
which do not trigger singlestep traps as mentioned previously.
Signed-off-by: Kyle Huey <khuey@kylehuey.com>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
2016-11-29 20:40:40 +00:00
|
|
|
int kvm_complete_insn_gp(struct kvm_vcpu *vcpu, int err)
|
2007-11-25 12:04:58 +00:00
|
|
|
{
|
2010-12-21 10:12:01 +00:00
|
|
|
if (err)
|
|
|
|
kvm_inject_gp(vcpu, 0);
|
|
|
|
else
|
KVM: x86: Add kvm_skip_emulated_instruction and use it.
kvm_skip_emulated_instruction calls both
kvm_x86_ops->skip_emulated_instruction and kvm_vcpu_check_singlestep,
skipping the emulated instruction and generating a trap if necessary.
Replacing skip_emulated_instruction calls with
kvm_skip_emulated_instruction is straightforward, except for:
- ICEBP, which is already inside a trap, so avoid triggering another trap.
- Instructions that can trigger exits to userspace, such as the IO insns,
MOVs to CR8, and HALT. If kvm_skip_emulated_instruction does trigger a
KVM_GUESTDBG_SINGLESTEP exit, and the handling code for
IN/OUT/MOV CR8/HALT also triggers an exit to userspace, the latter will
take precedence. The singlestep will be triggered again on the next
instruction, which is the current behavior.
- Task switch instructions which would require additional handling (e.g.
the task switch bit) and are instead left alone.
- Cases where VMLAUNCH/VMRESUME do not proceed to the next instruction,
which do not trigger singlestep traps as mentioned previously.
Signed-off-by: Kyle Huey <khuey@kylehuey.com>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
2016-11-29 20:40:40 +00:00
|
|
|
return kvm_skip_emulated_instruction(vcpu);
|
|
|
|
|
|
|
|
return 1;
|
2010-12-21 10:12:01 +00:00
|
|
|
}
|
|
|
|
EXPORT_SYMBOL_GPL(kvm_complete_insn_gp);
|
2010-09-10 15:30:46 +00:00
|
|
|
|
2010-11-29 14:12:30 +00:00
|
|
|
void kvm_inject_page_fault(struct kvm_vcpu *vcpu, struct x86_exception *fault)
|
2007-11-25 12:04:58 +00:00
|
|
|
{
|
|
|
|
++vcpu->stat.pf_guest;
|
2017-07-14 01:30:41 +00:00
|
|
|
vcpu->arch.exception.nested_apf =
|
|
|
|
is_guest_mode(vcpu) && fault->async_page_fault;
|
|
|
|
if (vcpu->arch.exception.nested_apf)
|
|
|
|
vcpu->arch.apf.nested_apf_token = fault->address;
|
|
|
|
else
|
|
|
|
vcpu->arch.cr2 = fault->address;
|
2010-11-29 14:12:30 +00:00
|
|
|
kvm_queue_exception_e(vcpu, PF_VECTOR, fault->error_code);
|
2007-11-25 12:04:58 +00:00
|
|
|
}
|
2011-05-25 20:06:59 +00:00
|
|
|
EXPORT_SYMBOL_GPL(kvm_inject_page_fault);
|
2007-11-25 12:04:58 +00:00
|
|
|
|
2014-09-04 17:46:15 +00:00
|
|
|
static bool kvm_propagate_fault(struct kvm_vcpu *vcpu, struct x86_exception *fault)
|
2010-09-10 15:30:55 +00:00
|
|
|
{
|
2010-11-29 14:12:30 +00:00
|
|
|
if (mmu_is_nested(vcpu) && !fault->nested_page_fault)
|
|
|
|
vcpu->arch.nested_mmu.inject_page_fault(vcpu, fault);
|
2010-09-10 15:30:55 +00:00
|
|
|
else
|
2010-11-29 14:12:30 +00:00
|
|
|
vcpu->arch.mmu.inject_page_fault(vcpu, fault);
|
2014-09-04 17:46:15 +00:00
|
|
|
|
|
|
|
return fault->nested_page_fault;
|
2010-09-10 15:30:55 +00:00
|
|
|
}
|
|
|
|
|
2008-05-15 01:52:48 +00:00
|
|
|
void kvm_inject_nmi(struct kvm_vcpu *vcpu)
|
|
|
|
{
|
2011-09-20 10:43:14 +00:00
|
|
|
atomic_inc(&vcpu->arch.nmi_queued);
|
|
|
|
kvm_make_request(KVM_REQ_NMI, vcpu);
|
2008-05-15 01:52:48 +00:00
|
|
|
}
|
|
|
|
EXPORT_SYMBOL_GPL(kvm_inject_nmi);
|
|
|
|
|
2007-11-25 11:41:11 +00:00
|
|
|
void kvm_queue_exception_e(struct kvm_vcpu *vcpu, unsigned nr, u32 error_code)
|
|
|
|
{
|
2010-04-22 10:33:13 +00:00
|
|
|
kvm_multiple_exception(vcpu, nr, true, error_code, false);
|
2007-11-25 11:41:11 +00:00
|
|
|
}
|
|
|
|
EXPORT_SYMBOL_GPL(kvm_queue_exception_e);
|
|
|
|
|
2010-04-22 10:33:13 +00:00
|
|
|
void kvm_requeue_exception_e(struct kvm_vcpu *vcpu, unsigned nr, u32 error_code)
|
|
|
|
{
|
|
|
|
kvm_multiple_exception(vcpu, nr, true, error_code, true);
|
|
|
|
}
|
|
|
|
EXPORT_SYMBOL_GPL(kvm_requeue_exception_e);
|
|
|
|
|
2009-09-01 09:03:25 +00:00
|
|
|
/*
|
|
|
|
* Checks if cpl <= required_cpl; if true, return true. Otherwise queue
|
|
|
|
* a #GP and return false.
|
|
|
|
*/
|
|
|
|
bool kvm_require_cpl(struct kvm_vcpu *vcpu, int required_cpl)
|
2007-11-25 11:41:11 +00:00
|
|
|
{
|
2009-09-01 09:03:25 +00:00
|
|
|
if (kvm_x86_ops->get_cpl(vcpu) <= required_cpl)
|
|
|
|
return true;
|
|
|
|
kvm_queue_exception_e(vcpu, GP_VECTOR, 0);
|
|
|
|
return false;
|
2007-11-25 11:41:11 +00:00
|
|
|
}
|
2009-09-01 09:03:25 +00:00
|
|
|
EXPORT_SYMBOL_GPL(kvm_require_cpl);
|
2007-11-25 11:41:11 +00:00
|
|
|
|
2014-10-02 22:10:05 +00:00
|
|
|
bool kvm_require_dr(struct kvm_vcpu *vcpu, int dr)
|
|
|
|
{
|
|
|
|
if ((dr != 4 && dr != 5) || !kvm_read_cr4_bits(vcpu, X86_CR4_DE))
|
|
|
|
return true;
|
|
|
|
|
|
|
|
kvm_queue_exception(vcpu, UD_VECTOR);
|
|
|
|
return false;
|
|
|
|
}
|
|
|
|
EXPORT_SYMBOL_GPL(kvm_require_dr);
|
|
|
|
|
2010-09-10 15:30:51 +00:00
|
|
|
/*
|
|
|
|
* This function will be used to read from the physical memory of the currently
|
2015-04-08 13:39:23 +00:00
|
|
|
* running guest. The difference to kvm_vcpu_read_guest_page is that this function
|
2010-09-10 15:30:51 +00:00
|
|
|
* can read from guest physical or from the guest's guest physical memory.
|
|
|
|
*/
|
|
|
|
int kvm_read_guest_page_mmu(struct kvm_vcpu *vcpu, struct kvm_mmu *mmu,
|
|
|
|
gfn_t ngfn, void *data, int offset, int len,
|
|
|
|
u32 access)
|
|
|
|
{
|
2014-09-02 11:23:06 +00:00
|
|
|
struct x86_exception exception;
|
2010-09-10 15:30:51 +00:00
|
|
|
gfn_t real_gfn;
|
|
|
|
gpa_t ngpa;
|
|
|
|
|
|
|
|
ngpa = gfn_to_gpa(ngfn);
|
2014-09-02 11:23:06 +00:00
|
|
|
real_gfn = mmu->translate_gpa(vcpu, ngpa, access, &exception);
|
2010-09-10 15:30:51 +00:00
|
|
|
if (real_gfn == UNMAPPED_GVA)
|
|
|
|
return -EFAULT;
|
|
|
|
|
|
|
|
real_gfn = gpa_to_gfn(real_gfn);
|
|
|
|
|
2015-04-08 13:39:23 +00:00
|
|
|
return kvm_vcpu_read_guest_page(vcpu, real_gfn, data, offset, len);
|
2010-09-10 15:30:51 +00:00
|
|
|
}
|
|
|
|
EXPORT_SYMBOL_GPL(kvm_read_guest_page_mmu);
|
|
|
|
|
2015-01-19 14:33:39 +00:00
|
|
|
static int kvm_read_nested_guest_page(struct kvm_vcpu *vcpu, gfn_t gfn,
|
2010-09-10 15:30:53 +00:00
|
|
|
void *data, int offset, int len, u32 access)
|
|
|
|
{
|
|
|
|
return kvm_read_guest_page_mmu(vcpu, vcpu->arch.walk_mmu, gfn,
|
|
|
|
data, offset, len, access);
|
|
|
|
}
|
|
|
|
|
KVM: Portability: Move control register helper functions to x86.c
This patch moves the definitions of CR0_RESERVED_BITS,
CR4_RESERVED_BITS, and CR8_RESERVED_BITS along with the following
functions from kvm_main.c to x86.c:
set_cr0(), set_cr3(), set_cr4(), set_cr8(), get_cr8(), lmsw(),
load_pdptrs()
The static function wrapper inject_gp is duplicated in kvm_main.c and
x86.c for now, the version in kvm_main.c should disappear once the last
user of it is gone too.
The function load_pdptrs is no longer static, and now defined in x86.h
for the time being, until the last user of it is gone from kvm_main.c.
Signed-off-by: Carsten Otte <cotte@de.ibm.com>
Reviewed-by: Christian Borntraeger <borntraeger@de.ibm.com>
Acked-by: Hollis Blanchard <hollisb@us.ibm.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
2007-10-29 15:09:35 +00:00
|
|
|
/*
|
|
|
|
* Load the pae pdptrs. Return true is they are all valid.
|
|
|
|
*/
|
2010-09-10 15:30:57 +00:00
|
|
|
int load_pdptrs(struct kvm_vcpu *vcpu, struct kvm_mmu *mmu, unsigned long cr3)
|
KVM: Portability: Move control register helper functions to x86.c
This patch moves the definitions of CR0_RESERVED_BITS,
CR4_RESERVED_BITS, and CR8_RESERVED_BITS along with the following
functions from kvm_main.c to x86.c:
set_cr0(), set_cr3(), set_cr4(), set_cr8(), get_cr8(), lmsw(),
load_pdptrs()
The static function wrapper inject_gp is duplicated in kvm_main.c and
x86.c for now, the version in kvm_main.c should disappear once the last
user of it is gone too.
The function load_pdptrs is no longer static, and now defined in x86.h
for the time being, until the last user of it is gone from kvm_main.c.
Signed-off-by: Carsten Otte <cotte@de.ibm.com>
Reviewed-by: Christian Borntraeger <borntraeger@de.ibm.com>
Acked-by: Hollis Blanchard <hollisb@us.ibm.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
2007-10-29 15:09:35 +00:00
|
|
|
{
|
|
|
|
gfn_t pdpt_gfn = cr3 >> PAGE_SHIFT;
|
|
|
|
unsigned offset = ((cr3 & (PAGE_SIZE-1)) >> 5) << 2;
|
|
|
|
int i;
|
|
|
|
int ret;
|
2010-09-10 15:30:57 +00:00
|
|
|
u64 pdpte[ARRAY_SIZE(mmu->pdptrs)];
|
KVM: Portability: Move control register helper functions to x86.c
This patch moves the definitions of CR0_RESERVED_BITS,
CR4_RESERVED_BITS, and CR8_RESERVED_BITS along with the following
functions from kvm_main.c to x86.c:
set_cr0(), set_cr3(), set_cr4(), set_cr8(), get_cr8(), lmsw(),
load_pdptrs()
The static function wrapper inject_gp is duplicated in kvm_main.c and
x86.c for now, the version in kvm_main.c should disappear once the last
user of it is gone too.
The function load_pdptrs is no longer static, and now defined in x86.h
for the time being, until the last user of it is gone from kvm_main.c.
Signed-off-by: Carsten Otte <cotte@de.ibm.com>
Reviewed-by: Christian Borntraeger <borntraeger@de.ibm.com>
Acked-by: Hollis Blanchard <hollisb@us.ibm.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
2007-10-29 15:09:35 +00:00
|
|
|
|
2010-09-10 15:30:57 +00:00
|
|
|
ret = kvm_read_guest_page_mmu(vcpu, mmu, pdpt_gfn, pdpte,
|
|
|
|
offset * sizeof(u64), sizeof(pdpte),
|
|
|
|
PFERR_USER_MASK|PFERR_WRITE_MASK);
|
KVM: Portability: Move control register helper functions to x86.c
This patch moves the definitions of CR0_RESERVED_BITS,
CR4_RESERVED_BITS, and CR8_RESERVED_BITS along with the following
functions from kvm_main.c to x86.c:
set_cr0(), set_cr3(), set_cr4(), set_cr8(), get_cr8(), lmsw(),
load_pdptrs()
The static function wrapper inject_gp is duplicated in kvm_main.c and
x86.c for now, the version in kvm_main.c should disappear once the last
user of it is gone too.
The function load_pdptrs is no longer static, and now defined in x86.h
for the time being, until the last user of it is gone from kvm_main.c.
Signed-off-by: Carsten Otte <cotte@de.ibm.com>
Reviewed-by: Christian Borntraeger <borntraeger@de.ibm.com>
Acked-by: Hollis Blanchard <hollisb@us.ibm.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
2007-10-29 15:09:35 +00:00
|
|
|
if (ret < 0) {
|
|
|
|
ret = 0;
|
|
|
|
goto out;
|
|
|
|
}
|
|
|
|
for (i = 0; i < ARRAY_SIZE(pdpte); ++i) {
|
2016-07-12 22:18:50 +00:00
|
|
|
if ((pdpte[i] & PT_PRESENT_MASK) &&
|
2015-08-05 04:04:21 +00:00
|
|
|
(pdpte[i] &
|
|
|
|
vcpu->arch.mmu.guest_rsvd_check.rsvd_bits_mask[0][2])) {
|
KVM: Portability: Move control register helper functions to x86.c
This patch moves the definitions of CR0_RESERVED_BITS,
CR4_RESERVED_BITS, and CR8_RESERVED_BITS along with the following
functions from kvm_main.c to x86.c:
set_cr0(), set_cr3(), set_cr4(), set_cr8(), get_cr8(), lmsw(),
load_pdptrs()
The static function wrapper inject_gp is duplicated in kvm_main.c and
x86.c for now, the version in kvm_main.c should disappear once the last
user of it is gone too.
The function load_pdptrs is no longer static, and now defined in x86.h
for the time being, until the last user of it is gone from kvm_main.c.
Signed-off-by: Carsten Otte <cotte@de.ibm.com>
Reviewed-by: Christian Borntraeger <borntraeger@de.ibm.com>
Acked-by: Hollis Blanchard <hollisb@us.ibm.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
2007-10-29 15:09:35 +00:00
|
|
|
ret = 0;
|
|
|
|
goto out;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
ret = 1;
|
|
|
|
|
2010-09-10 15:30:57 +00:00
|
|
|
memcpy(mmu->pdptrs, pdpte, sizeof(mmu->pdptrs));
|
2009-05-31 19:58:47 +00:00
|
|
|
__set_bit(VCPU_EXREG_PDPTR,
|
|
|
|
(unsigned long *)&vcpu->arch.regs_avail);
|
|
|
|
__set_bit(VCPU_EXREG_PDPTR,
|
|
|
|
(unsigned long *)&vcpu->arch.regs_dirty);
|
KVM: Portability: Move control register helper functions to x86.c
This patch moves the definitions of CR0_RESERVED_BITS,
CR4_RESERVED_BITS, and CR8_RESERVED_BITS along with the following
functions from kvm_main.c to x86.c:
set_cr0(), set_cr3(), set_cr4(), set_cr8(), get_cr8(), lmsw(),
load_pdptrs()
The static function wrapper inject_gp is duplicated in kvm_main.c and
x86.c for now, the version in kvm_main.c should disappear once the last
user of it is gone too.
The function load_pdptrs is no longer static, and now defined in x86.h
for the time being, until the last user of it is gone from kvm_main.c.
Signed-off-by: Carsten Otte <cotte@de.ibm.com>
Reviewed-by: Christian Borntraeger <borntraeger@de.ibm.com>
Acked-by: Hollis Blanchard <hollisb@us.ibm.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
2007-10-29 15:09:35 +00:00
|
|
|
out:
|
|
|
|
|
|
|
|
return ret;
|
|
|
|
}
|
2008-02-07 12:47:43 +00:00
|
|
|
EXPORT_SYMBOL_GPL(load_pdptrs);
|
KVM: Portability: Move control register helper functions to x86.c
This patch moves the definitions of CR0_RESERVED_BITS,
CR4_RESERVED_BITS, and CR8_RESERVED_BITS along with the following
functions from kvm_main.c to x86.c:
set_cr0(), set_cr3(), set_cr4(), set_cr8(), get_cr8(), lmsw(),
load_pdptrs()
The static function wrapper inject_gp is duplicated in kvm_main.c and
x86.c for now, the version in kvm_main.c should disappear once the last
user of it is gone too.
The function load_pdptrs is no longer static, and now defined in x86.h
for the time being, until the last user of it is gone from kvm_main.c.
Signed-off-by: Carsten Otte <cotte@de.ibm.com>
Reviewed-by: Christian Borntraeger <borntraeger@de.ibm.com>
Acked-by: Hollis Blanchard <hollisb@us.ibm.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
2007-10-29 15:09:35 +00:00
|
|
|
|
2016-11-30 15:03:10 +00:00
|
|
|
bool pdptrs_changed(struct kvm_vcpu *vcpu)
|
2007-11-21 00:57:59 +00:00
|
|
|
{
|
2010-09-10 15:30:57 +00:00
|
|
|
u64 pdpte[ARRAY_SIZE(vcpu->arch.walk_mmu->pdptrs)];
|
2007-11-21 00:57:59 +00:00
|
|
|
bool changed = true;
|
2010-09-10 15:30:53 +00:00
|
|
|
int offset;
|
|
|
|
gfn_t gfn;
|
2007-11-21 00:57:59 +00:00
|
|
|
int r;
|
|
|
|
|
|
|
|
if (is_long_mode(vcpu) || !is_pae(vcpu))
|
|
|
|
return false;
|
|
|
|
|
2009-05-31 19:58:47 +00:00
|
|
|
if (!test_bit(VCPU_EXREG_PDPTR,
|
|
|
|
(unsigned long *)&vcpu->arch.regs_avail))
|
|
|
|
return true;
|
|
|
|
|
2017-07-24 16:54:38 +00:00
|
|
|
gfn = (kvm_read_cr3(vcpu) & 0xffffffe0ul) >> PAGE_SHIFT;
|
|
|
|
offset = (kvm_read_cr3(vcpu) & 0xffffffe0ul) & (PAGE_SIZE - 1);
|
2010-09-10 15:30:53 +00:00
|
|
|
r = kvm_read_nested_guest_page(vcpu, gfn, pdpte, offset, sizeof(pdpte),
|
|
|
|
PFERR_USER_MASK | PFERR_WRITE_MASK);
|
2007-11-21 00:57:59 +00:00
|
|
|
if (r < 0)
|
|
|
|
goto out;
|
2010-09-10 15:30:57 +00:00
|
|
|
changed = memcmp(pdpte, vcpu->arch.walk_mmu->pdptrs, sizeof(pdpte)) != 0;
|
2007-11-21 00:57:59 +00:00
|
|
|
out:
|
|
|
|
|
|
|
|
return changed;
|
|
|
|
}
|
2016-11-30 15:03:10 +00:00
|
|
|
EXPORT_SYMBOL_GPL(pdptrs_changed);
|
2007-11-21 00:57:59 +00:00
|
|
|
|
2010-06-10 14:02:14 +00:00
|
|
|
int kvm_set_cr0(struct kvm_vcpu *vcpu, unsigned long cr0)
|
KVM: Portability: Move control register helper functions to x86.c
This patch moves the definitions of CR0_RESERVED_BITS,
CR4_RESERVED_BITS, and CR8_RESERVED_BITS along with the following
functions from kvm_main.c to x86.c:
set_cr0(), set_cr3(), set_cr4(), set_cr8(), get_cr8(), lmsw(),
load_pdptrs()
The static function wrapper inject_gp is duplicated in kvm_main.c and
x86.c for now, the version in kvm_main.c should disappear once the last
user of it is gone too.
The function load_pdptrs is no longer static, and now defined in x86.h
for the time being, until the last user of it is gone from kvm_main.c.
Signed-off-by: Carsten Otte <cotte@de.ibm.com>
Reviewed-by: Christian Borntraeger <borntraeger@de.ibm.com>
Acked-by: Hollis Blanchard <hollisb@us.ibm.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
2007-10-29 15:09:35 +00:00
|
|
|
{
|
2010-05-12 08:40:42 +00:00
|
|
|
unsigned long old_cr0 = kvm_read_cr0(vcpu);
|
2015-05-13 06:42:28 +00:00
|
|
|
unsigned long update_bits = X86_CR0_PG | X86_CR0_WP;
|
2010-05-12 08:40:42 +00:00
|
|
|
|
2010-01-06 17:10:22 +00:00
|
|
|
cr0 |= X86_CR0_ET;
|
|
|
|
|
2010-01-21 13:28:46 +00:00
|
|
|
#ifdef CONFIG_X86_64
|
2010-04-28 16:15:31 +00:00
|
|
|
if (cr0 & 0xffffffff00000000UL)
|
|
|
|
return 1;
|
2010-01-21 13:28:46 +00:00
|
|
|
#endif
|
|
|
|
|
|
|
|
cr0 &= ~CR0_RESERVED_BITS;
|
KVM: Portability: Move control register helper functions to x86.c
This patch moves the definitions of CR0_RESERVED_BITS,
CR4_RESERVED_BITS, and CR8_RESERVED_BITS along with the following
functions from kvm_main.c to x86.c:
set_cr0(), set_cr3(), set_cr4(), set_cr8(), get_cr8(), lmsw(),
load_pdptrs()
The static function wrapper inject_gp is duplicated in kvm_main.c and
x86.c for now, the version in kvm_main.c should disappear once the last
user of it is gone too.
The function load_pdptrs is no longer static, and now defined in x86.h
for the time being, until the last user of it is gone from kvm_main.c.
Signed-off-by: Carsten Otte <cotte@de.ibm.com>
Reviewed-by: Christian Borntraeger <borntraeger@de.ibm.com>
Acked-by: Hollis Blanchard <hollisb@us.ibm.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
2007-10-29 15:09:35 +00:00
|
|
|
|
2010-04-28 16:15:31 +00:00
|
|
|
if ((cr0 & X86_CR0_NW) && !(cr0 & X86_CR0_CD))
|
|
|
|
return 1;
|
KVM: Portability: Move control register helper functions to x86.c
This patch moves the definitions of CR0_RESERVED_BITS,
CR4_RESERVED_BITS, and CR8_RESERVED_BITS along with the following
functions from kvm_main.c to x86.c:
set_cr0(), set_cr3(), set_cr4(), set_cr8(), get_cr8(), lmsw(),
load_pdptrs()
The static function wrapper inject_gp is duplicated in kvm_main.c and
x86.c for now, the version in kvm_main.c should disappear once the last
user of it is gone too.
The function load_pdptrs is no longer static, and now defined in x86.h
for the time being, until the last user of it is gone from kvm_main.c.
Signed-off-by: Carsten Otte <cotte@de.ibm.com>
Reviewed-by: Christian Borntraeger <borntraeger@de.ibm.com>
Acked-by: Hollis Blanchard <hollisb@us.ibm.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
2007-10-29 15:09:35 +00:00
|
|
|
|
2010-04-28 16:15:31 +00:00
|
|
|
if ((cr0 & X86_CR0_PG) && !(cr0 & X86_CR0_PE))
|
|
|
|
return 1;
|
KVM: Portability: Move control register helper functions to x86.c
This patch moves the definitions of CR0_RESERVED_BITS,
CR4_RESERVED_BITS, and CR8_RESERVED_BITS along with the following
functions from kvm_main.c to x86.c:
set_cr0(), set_cr3(), set_cr4(), set_cr8(), get_cr8(), lmsw(),
load_pdptrs()
The static function wrapper inject_gp is duplicated in kvm_main.c and
x86.c for now, the version in kvm_main.c should disappear once the last
user of it is gone too.
The function load_pdptrs is no longer static, and now defined in x86.h
for the time being, until the last user of it is gone from kvm_main.c.
Signed-off-by: Carsten Otte <cotte@de.ibm.com>
Reviewed-by: Christian Borntraeger <borntraeger@de.ibm.com>
Acked-by: Hollis Blanchard <hollisb@us.ibm.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
2007-10-29 15:09:35 +00:00
|
|
|
|
|
|
|
if (!is_paging(vcpu) && (cr0 & X86_CR0_PG)) {
|
|
|
|
#ifdef CONFIG_X86_64
|
2010-01-21 13:31:50 +00:00
|
|
|
if ((vcpu->arch.efer & EFER_LME)) {
|
KVM: Portability: Move control register helper functions to x86.c
This patch moves the definitions of CR0_RESERVED_BITS,
CR4_RESERVED_BITS, and CR8_RESERVED_BITS along with the following
functions from kvm_main.c to x86.c:
set_cr0(), set_cr3(), set_cr4(), set_cr8(), get_cr8(), lmsw(),
load_pdptrs()
The static function wrapper inject_gp is duplicated in kvm_main.c and
x86.c for now, the version in kvm_main.c should disappear once the last
user of it is gone too.
The function load_pdptrs is no longer static, and now defined in x86.h
for the time being, until the last user of it is gone from kvm_main.c.
Signed-off-by: Carsten Otte <cotte@de.ibm.com>
Reviewed-by: Christian Borntraeger <borntraeger@de.ibm.com>
Acked-by: Hollis Blanchard <hollisb@us.ibm.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
2007-10-29 15:09:35 +00:00
|
|
|
int cs_db, cs_l;
|
|
|
|
|
2010-04-28 16:15:31 +00:00
|
|
|
if (!is_pae(vcpu))
|
|
|
|
return 1;
|
KVM: Portability: Move control register helper functions to x86.c
This patch moves the definitions of CR0_RESERVED_BITS,
CR4_RESERVED_BITS, and CR8_RESERVED_BITS along with the following
functions from kvm_main.c to x86.c:
set_cr0(), set_cr3(), set_cr4(), set_cr8(), get_cr8(), lmsw(),
load_pdptrs()
The static function wrapper inject_gp is duplicated in kvm_main.c and
x86.c for now, the version in kvm_main.c should disappear once the last
user of it is gone too.
The function load_pdptrs is no longer static, and now defined in x86.h
for the time being, until the last user of it is gone from kvm_main.c.
Signed-off-by: Carsten Otte <cotte@de.ibm.com>
Reviewed-by: Christian Borntraeger <borntraeger@de.ibm.com>
Acked-by: Hollis Blanchard <hollisb@us.ibm.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
2007-10-29 15:09:35 +00:00
|
|
|
kvm_x86_ops->get_cs_db_l_bits(vcpu, &cs_db, &cs_l);
|
2010-04-28 16:15:31 +00:00
|
|
|
if (cs_l)
|
|
|
|
return 1;
|
KVM: Portability: Move control register helper functions to x86.c
This patch moves the definitions of CR0_RESERVED_BITS,
CR4_RESERVED_BITS, and CR8_RESERVED_BITS along with the following
functions from kvm_main.c to x86.c:
set_cr0(), set_cr3(), set_cr4(), set_cr8(), get_cr8(), lmsw(),
load_pdptrs()
The static function wrapper inject_gp is duplicated in kvm_main.c and
x86.c for now, the version in kvm_main.c should disappear once the last
user of it is gone too.
The function load_pdptrs is no longer static, and now defined in x86.h
for the time being, until the last user of it is gone from kvm_main.c.
Signed-off-by: Carsten Otte <cotte@de.ibm.com>
Reviewed-by: Christian Borntraeger <borntraeger@de.ibm.com>
Acked-by: Hollis Blanchard <hollisb@us.ibm.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
2007-10-29 15:09:35 +00:00
|
|
|
} else
|
|
|
|
#endif
|
2010-09-10 15:30:57 +00:00
|
|
|
if (is_pae(vcpu) && !load_pdptrs(vcpu, vcpu->arch.walk_mmu,
|
2010-12-05 15:30:00 +00:00
|
|
|
kvm_read_cr3(vcpu)))
|
2010-04-28 16:15:31 +00:00
|
|
|
return 1;
|
KVM: Portability: Move control register helper functions to x86.c
This patch moves the definitions of CR0_RESERVED_BITS,
CR4_RESERVED_BITS, and CR8_RESERVED_BITS along with the following
functions from kvm_main.c to x86.c:
set_cr0(), set_cr3(), set_cr4(), set_cr8(), get_cr8(), lmsw(),
load_pdptrs()
The static function wrapper inject_gp is duplicated in kvm_main.c and
x86.c for now, the version in kvm_main.c should disappear once the last
user of it is gone too.
The function load_pdptrs is no longer static, and now defined in x86.h
for the time being, until the last user of it is gone from kvm_main.c.
Signed-off-by: Carsten Otte <cotte@de.ibm.com>
Reviewed-by: Christian Borntraeger <borntraeger@de.ibm.com>
Acked-by: Hollis Blanchard <hollisb@us.ibm.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
2007-10-29 15:09:35 +00:00
|
|
|
}
|
|
|
|
|
2012-07-02 01:18:48 +00:00
|
|
|
if (!(cr0 & X86_CR0_PG) && kvm_read_cr4_bits(vcpu, X86_CR4_PCIDE))
|
|
|
|
return 1;
|
|
|
|
|
KVM: Portability: Move control register helper functions to x86.c
This patch moves the definitions of CR0_RESERVED_BITS,
CR4_RESERVED_BITS, and CR8_RESERVED_BITS along with the following
functions from kvm_main.c to x86.c:
set_cr0(), set_cr3(), set_cr4(), set_cr8(), get_cr8(), lmsw(),
load_pdptrs()
The static function wrapper inject_gp is duplicated in kvm_main.c and
x86.c for now, the version in kvm_main.c should disappear once the last
user of it is gone too.
The function load_pdptrs is no longer static, and now defined in x86.h
for the time being, until the last user of it is gone from kvm_main.c.
Signed-off-by: Carsten Otte <cotte@de.ibm.com>
Reviewed-by: Christian Borntraeger <borntraeger@de.ibm.com>
Acked-by: Hollis Blanchard <hollisb@us.ibm.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
2007-10-29 15:09:35 +00:00
|
|
|
kvm_x86_ops->set_cr0(vcpu, cr0);
|
|
|
|
|
2011-02-21 03:21:30 +00:00
|
|
|
if ((cr0 ^ old_cr0) & X86_CR0_PG) {
|
2010-11-12 06:47:01 +00:00
|
|
|
kvm_clear_async_pf_completion_queue(vcpu);
|
2011-02-21 03:21:30 +00:00
|
|
|
kvm_async_pf_hash_reset(vcpu);
|
|
|
|
}
|
2010-11-12 06:47:01 +00:00
|
|
|
|
2010-05-12 08:40:42 +00:00
|
|
|
if ((cr0 ^ old_cr0) & update_bits)
|
|
|
|
kvm_mmu_reset_context(vcpu);
|
2015-06-15 08:55:21 +00:00
|
|
|
|
2015-11-04 11:54:41 +00:00
|
|
|
if (((cr0 ^ old_cr0) & X86_CR0_CD) &&
|
|
|
|
kvm_arch_has_noncoherent_dma(vcpu->kvm) &&
|
|
|
|
!kvm_check_has_quirk(vcpu->kvm, KVM_X86_QUIRK_CD_NW_CLEARED))
|
2015-06-15 08:55:21 +00:00
|
|
|
kvm_zap_gfn_range(vcpu->kvm, 0, ~0ULL);
|
|
|
|
|
2010-04-28 16:15:31 +00:00
|
|
|
return 0;
|
|
|
|
}
|
2008-02-24 09:20:43 +00:00
|
|
|
EXPORT_SYMBOL_GPL(kvm_set_cr0);
|
KVM: Portability: Move control register helper functions to x86.c
This patch moves the definitions of CR0_RESERVED_BITS,
CR4_RESERVED_BITS, and CR8_RESERVED_BITS along with the following
functions from kvm_main.c to x86.c:
set_cr0(), set_cr3(), set_cr4(), set_cr8(), get_cr8(), lmsw(),
load_pdptrs()
The static function wrapper inject_gp is duplicated in kvm_main.c and
x86.c for now, the version in kvm_main.c should disappear once the last
user of it is gone too.
The function load_pdptrs is no longer static, and now defined in x86.h
for the time being, until the last user of it is gone from kvm_main.c.
Signed-off-by: Carsten Otte <cotte@de.ibm.com>
Reviewed-by: Christian Borntraeger <borntraeger@de.ibm.com>
Acked-by: Hollis Blanchard <hollisb@us.ibm.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
2007-10-29 15:09:35 +00:00
|
|
|
|
2008-02-24 09:20:43 +00:00
|
|
|
void kvm_lmsw(struct kvm_vcpu *vcpu, unsigned long msw)
|
KVM: Portability: Move control register helper functions to x86.c
This patch moves the definitions of CR0_RESERVED_BITS,
CR4_RESERVED_BITS, and CR8_RESERVED_BITS along with the following
functions from kvm_main.c to x86.c:
set_cr0(), set_cr3(), set_cr4(), set_cr8(), get_cr8(), lmsw(),
load_pdptrs()
The static function wrapper inject_gp is duplicated in kvm_main.c and
x86.c for now, the version in kvm_main.c should disappear once the last
user of it is gone too.
The function load_pdptrs is no longer static, and now defined in x86.h
for the time being, until the last user of it is gone from kvm_main.c.
Signed-off-by: Carsten Otte <cotte@de.ibm.com>
Reviewed-by: Christian Borntraeger <borntraeger@de.ibm.com>
Acked-by: Hollis Blanchard <hollisb@us.ibm.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
2007-10-29 15:09:35 +00:00
|
|
|
{
|
2010-06-10 14:02:14 +00:00
|
|
|
(void)kvm_set_cr0(vcpu, kvm_read_cr0_bits(vcpu, ~0x0eul) | (msw & 0x0f));
|
KVM: Portability: Move control register helper functions to x86.c
This patch moves the definitions of CR0_RESERVED_BITS,
CR4_RESERVED_BITS, and CR8_RESERVED_BITS along with the following
functions from kvm_main.c to x86.c:
set_cr0(), set_cr3(), set_cr4(), set_cr8(), get_cr8(), lmsw(),
load_pdptrs()
The static function wrapper inject_gp is duplicated in kvm_main.c and
x86.c for now, the version in kvm_main.c should disappear once the last
user of it is gone too.
The function load_pdptrs is no longer static, and now defined in x86.h
for the time being, until the last user of it is gone from kvm_main.c.
Signed-off-by: Carsten Otte <cotte@de.ibm.com>
Reviewed-by: Christian Borntraeger <borntraeger@de.ibm.com>
Acked-by: Hollis Blanchard <hollisb@us.ibm.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
2007-10-29 15:09:35 +00:00
|
|
|
}
|
2008-02-24 09:20:43 +00:00
|
|
|
EXPORT_SYMBOL_GPL(kvm_lmsw);
|
KVM: Portability: Move control register helper functions to x86.c
This patch moves the definitions of CR0_RESERVED_BITS,
CR4_RESERVED_BITS, and CR8_RESERVED_BITS along with the following
functions from kvm_main.c to x86.c:
set_cr0(), set_cr3(), set_cr4(), set_cr8(), get_cr8(), lmsw(),
load_pdptrs()
The static function wrapper inject_gp is duplicated in kvm_main.c and
x86.c for now, the version in kvm_main.c should disappear once the last
user of it is gone too.
The function load_pdptrs is no longer static, and now defined in x86.h
for the time being, until the last user of it is gone from kvm_main.c.
Signed-off-by: Carsten Otte <cotte@de.ibm.com>
Reviewed-by: Christian Borntraeger <borntraeger@de.ibm.com>
Acked-by: Hollis Blanchard <hollisb@us.ibm.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
2007-10-29 15:09:35 +00:00
|
|
|
|
2013-04-16 02:30:13 +00:00
|
|
|
static void kvm_load_guest_xcr0(struct kvm_vcpu *vcpu)
|
|
|
|
{
|
|
|
|
if (kvm_read_cr4_bits(vcpu, X86_CR4_OSXSAVE) &&
|
|
|
|
!vcpu->guest_xcr0_loaded) {
|
|
|
|
/* kvm_set_xcr() also depends on this */
|
2017-12-13 12:51:32 +00:00
|
|
|
if (vcpu->arch.xcr0 != host_xcr0)
|
|
|
|
xsetbv(XCR_XFEATURE_ENABLED_MASK, vcpu->arch.xcr0);
|
2013-04-16 02:30:13 +00:00
|
|
|
vcpu->guest_xcr0_loaded = 1;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
static void kvm_put_guest_xcr0(struct kvm_vcpu *vcpu)
|
|
|
|
{
|
|
|
|
if (vcpu->guest_xcr0_loaded) {
|
|
|
|
if (vcpu->arch.xcr0 != host_xcr0)
|
|
|
|
xsetbv(XCR_XFEATURE_ENABLED_MASK, host_xcr0);
|
|
|
|
vcpu->guest_xcr0_loaded = 0;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2015-01-19 14:33:39 +00:00
|
|
|
static int __kvm_set_xcr(struct kvm_vcpu *vcpu, u32 index, u64 xcr)
|
2010-06-10 03:27:12 +00:00
|
|
|
{
|
2014-02-21 17:39:02 +00:00
|
|
|
u64 xcr0 = xcr;
|
|
|
|
u64 old_xcr0 = vcpu->arch.xcr0;
|
2013-10-17 14:50:46 +00:00
|
|
|
u64 valid_bits;
|
2010-06-10 03:27:12 +00:00
|
|
|
|
|
|
|
/* Only support XCR_XFEATURE_ENABLED_MASK(xcr0) now */
|
|
|
|
if (index != XCR_XFEATURE_ENABLED_MASK)
|
|
|
|
return 1;
|
2015-09-02 23:31:26 +00:00
|
|
|
if (!(xcr0 & XFEATURE_MASK_FP))
|
2010-06-10 03:27:12 +00:00
|
|
|
return 1;
|
2015-09-02 23:31:26 +00:00
|
|
|
if ((xcr0 & XFEATURE_MASK_YMM) && !(xcr0 & XFEATURE_MASK_SSE))
|
2010-06-10 03:27:12 +00:00
|
|
|
return 1;
|
2013-10-17 14:50:46 +00:00
|
|
|
|
|
|
|
/*
|
|
|
|
* Do not allow the guest to set bits that we do not support
|
|
|
|
* saving. However, xcr0 bit 0 is always set, even if the
|
|
|
|
* emulated CPU does not support XSAVE (see fx_init).
|
|
|
|
*/
|
2015-09-02 23:31:26 +00:00
|
|
|
valid_bits = vcpu->arch.guest_supported_xcr0 | XFEATURE_MASK_FP;
|
2013-10-17 14:50:46 +00:00
|
|
|
if (xcr0 & ~valid_bits)
|
2010-06-10 03:27:12 +00:00
|
|
|
return 1;
|
2013-10-17 14:50:46 +00:00
|
|
|
|
2015-09-02 23:31:26 +00:00
|
|
|
if ((!(xcr0 & XFEATURE_MASK_BNDREGS)) !=
|
|
|
|
(!(xcr0 & XFEATURE_MASK_BNDCSR)))
|
2014-02-24 10:58:09 +00:00
|
|
|
return 1;
|
|
|
|
|
2015-09-02 23:31:26 +00:00
|
|
|
if (xcr0 & XFEATURE_MASK_AVX512) {
|
|
|
|
if (!(xcr0 & XFEATURE_MASK_YMM))
|
2014-10-22 09:35:24 +00:00
|
|
|
return 1;
|
2015-09-02 23:31:26 +00:00
|
|
|
if ((xcr0 & XFEATURE_MASK_AVX512) != XFEATURE_MASK_AVX512)
|
2014-10-22 09:35:24 +00:00
|
|
|
return 1;
|
|
|
|
}
|
2010-06-10 03:27:12 +00:00
|
|
|
vcpu->arch.xcr0 = xcr0;
|
2014-02-21 17:39:02 +00:00
|
|
|
|
2015-09-02 23:31:26 +00:00
|
|
|
if ((xcr0 ^ old_xcr0) & XFEATURE_MASK_EXTEND)
|
2014-02-21 17:39:02 +00:00
|
|
|
kvm_update_cpuid(vcpu);
|
2010-06-10 03:27:12 +00:00
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
|
|
|
int kvm_set_xcr(struct kvm_vcpu *vcpu, u32 index, u64 xcr)
|
|
|
|
{
|
2013-06-14 07:36:13 +00:00
|
|
|
if (kvm_x86_ops->get_cpl(vcpu) != 0 ||
|
|
|
|
__kvm_set_xcr(vcpu, index, xcr)) {
|
2010-06-10 03:27:12 +00:00
|
|
|
kvm_inject_gp(vcpu, 0);
|
|
|
|
return 1;
|
|
|
|
}
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
EXPORT_SYMBOL_GPL(kvm_set_xcr);
|
|
|
|
|
2010-06-10 14:02:15 +00:00
|
|
|
int kvm_set_cr4(struct kvm_vcpu *vcpu, unsigned long cr4)
|
KVM: Portability: Move control register helper functions to x86.c
This patch moves the definitions of CR0_RESERVED_BITS,
CR4_RESERVED_BITS, and CR8_RESERVED_BITS along with the following
functions from kvm_main.c to x86.c:
set_cr0(), set_cr3(), set_cr4(), set_cr8(), get_cr8(), lmsw(),
load_pdptrs()
The static function wrapper inject_gp is duplicated in kvm_main.c and
x86.c for now, the version in kvm_main.c should disappear once the last
user of it is gone too.
The function load_pdptrs is no longer static, and now defined in x86.h
for the time being, until the last user of it is gone from kvm_main.c.
Signed-off-by: Carsten Otte <cotte@de.ibm.com>
Reviewed-by: Christian Borntraeger <borntraeger@de.ibm.com>
Acked-by: Hollis Blanchard <hollisb@us.ibm.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
2007-10-29 15:09:35 +00:00
|
|
|
{
|
2009-12-07 10:16:48 +00:00
|
|
|
unsigned long old_cr4 = kvm_read_cr4(vcpu);
|
2015-05-11 14:55:21 +00:00
|
|
|
unsigned long pdptr_bits = X86_CR4_PGE | X86_CR4_PSE | X86_CR4_PAE |
|
2016-03-22 08:51:21 +00:00
|
|
|
X86_CR4_SMEP | X86_CR4_SMAP | X86_CR4_PKE;
|
2015-05-11 14:55:21 +00:00
|
|
|
|
2010-04-28 16:15:31 +00:00
|
|
|
if (cr4 & CR4_RESERVED_BITS)
|
|
|
|
return 1;
|
KVM: Portability: Move control register helper functions to x86.c
This patch moves the definitions of CR0_RESERVED_BITS,
CR4_RESERVED_BITS, and CR8_RESERVED_BITS along with the following
functions from kvm_main.c to x86.c:
set_cr0(), set_cr3(), set_cr4(), set_cr8(), get_cr8(), lmsw(),
load_pdptrs()
The static function wrapper inject_gp is duplicated in kvm_main.c and
x86.c for now, the version in kvm_main.c should disappear once the last
user of it is gone too.
The function load_pdptrs is no longer static, and now defined in x86.h
for the time being, until the last user of it is gone from kvm_main.c.
Signed-off-by: Carsten Otte <cotte@de.ibm.com>
Reviewed-by: Christian Borntraeger <borntraeger@de.ibm.com>
Acked-by: Hollis Blanchard <hollisb@us.ibm.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
2007-10-29 15:09:35 +00:00
|
|
|
|
2017-08-04 22:12:49 +00:00
|
|
|
if (!guest_cpuid_has(vcpu, X86_FEATURE_XSAVE) && (cr4 & X86_CR4_OSXSAVE))
|
2010-06-10 03:27:12 +00:00
|
|
|
return 1;
|
|
|
|
|
2017-08-04 22:12:49 +00:00
|
|
|
if (!guest_cpuid_has(vcpu, X86_FEATURE_SMEP) && (cr4 & X86_CR4_SMEP))
|
2010-06-10 03:27:12 +00:00
|
|
|
return 1;
|
|
|
|
|
2017-08-04 22:12:49 +00:00
|
|
|
if (!guest_cpuid_has(vcpu, X86_FEATURE_SMAP) && (cr4 & X86_CR4_SMAP))
|
2011-06-03 03:13:42 +00:00
|
|
|
return 1;
|
|
|
|
|
2017-08-04 22:12:49 +00:00
|
|
|
if (!guest_cpuid_has(vcpu, X86_FEATURE_FSGSBASE) && (cr4 & X86_CR4_FSGSBASE))
|
2014-04-01 09:46:34 +00:00
|
|
|
return 1;
|
|
|
|
|
2017-08-04 22:12:49 +00:00
|
|
|
if (!guest_cpuid_has(vcpu, X86_FEATURE_PKU) && (cr4 & X86_CR4_PKE))
|
2011-06-14 12:10:18 +00:00
|
|
|
return 1;
|
|
|
|
|
2017-08-24 12:27:56 +00:00
|
|
|
if (!guest_cpuid_has(vcpu, X86_FEATURE_LA57) && (cr4 & X86_CR4_LA57))
|
2016-03-22 08:51:21 +00:00
|
|
|
return 1;
|
|
|
|
|
2016-07-12 08:36:41 +00:00
|
|
|
if (!guest_cpuid_has(vcpu, X86_FEATURE_UMIP) && (cr4 & X86_CR4_UMIP))
|
|
|
|
return 1;
|
|
|
|
|
KVM: Portability: Move control register helper functions to x86.c
This patch moves the definitions of CR0_RESERVED_BITS,
CR4_RESERVED_BITS, and CR8_RESERVED_BITS along with the following
functions from kvm_main.c to x86.c:
set_cr0(), set_cr3(), set_cr4(), set_cr8(), get_cr8(), lmsw(),
load_pdptrs()
The static function wrapper inject_gp is duplicated in kvm_main.c and
x86.c for now, the version in kvm_main.c should disappear once the last
user of it is gone too.
The function load_pdptrs is no longer static, and now defined in x86.h
for the time being, until the last user of it is gone from kvm_main.c.
Signed-off-by: Carsten Otte <cotte@de.ibm.com>
Reviewed-by: Christian Borntraeger <borntraeger@de.ibm.com>
Acked-by: Hollis Blanchard <hollisb@us.ibm.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
2007-10-29 15:09:35 +00:00
|
|
|
if (is_long_mode(vcpu)) {
|
2010-04-28 16:15:31 +00:00
|
|
|
if (!(cr4 & X86_CR4_PAE))
|
|
|
|
return 1;
|
2009-05-24 19:19:00 +00:00
|
|
|
} else if (is_paging(vcpu) && (cr4 & X86_CR4_PAE)
|
|
|
|
&& ((cr4 ^ old_cr4) & pdptr_bits)
|
2010-12-05 15:30:00 +00:00
|
|
|
&& !load_pdptrs(vcpu, vcpu->arch.walk_mmu,
|
|
|
|
kvm_read_cr3(vcpu)))
|
2010-04-28 16:15:31 +00:00
|
|
|
return 1;
|
|
|
|
|
2012-07-02 01:18:48 +00:00
|
|
|
if ((cr4 & X86_CR4_PCIDE) && !(old_cr4 & X86_CR4_PCIDE)) {
|
2017-08-04 22:12:49 +00:00
|
|
|
if (!guest_cpuid_has(vcpu, X86_FEATURE_PCID))
|
2012-07-02 01:18:48 +00:00
|
|
|
return 1;
|
|
|
|
|
|
|
|
/* PCID can not be enabled when cr3[11:0]!=000H or EFER.LMA=0 */
|
|
|
|
if ((kvm_read_cr3(vcpu) & X86_CR3_PCID_MASK) || !is_long_mode(vcpu))
|
|
|
|
return 1;
|
|
|
|
}
|
|
|
|
|
2011-05-25 20:03:24 +00:00
|
|
|
if (kvm_x86_ops->set_cr4(vcpu, cr4))
|
2010-04-28 16:15:31 +00:00
|
|
|
return 1;
|
KVM: Portability: Move control register helper functions to x86.c
This patch moves the definitions of CR0_RESERVED_BITS,
CR4_RESERVED_BITS, and CR8_RESERVED_BITS along with the following
functions from kvm_main.c to x86.c:
set_cr0(), set_cr3(), set_cr4(), set_cr8(), get_cr8(), lmsw(),
load_pdptrs()
The static function wrapper inject_gp is duplicated in kvm_main.c and
x86.c for now, the version in kvm_main.c should disappear once the last
user of it is gone too.
The function load_pdptrs is no longer static, and now defined in x86.h
for the time being, until the last user of it is gone from kvm_main.c.
Signed-off-by: Carsten Otte <cotte@de.ibm.com>
Reviewed-by: Christian Borntraeger <borntraeger@de.ibm.com>
Acked-by: Hollis Blanchard <hollisb@us.ibm.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
2007-10-29 15:09:35 +00:00
|
|
|
|
2012-07-02 01:18:48 +00:00
|
|
|
if (((cr4 ^ old_cr4) & pdptr_bits) ||
|
|
|
|
(!(cr4 & X86_CR4_PCIDE) && (old_cr4 & X86_CR4_PCIDE)))
|
2010-05-12 08:40:42 +00:00
|
|
|
kvm_mmu_reset_context(vcpu);
|
2010-04-28 16:15:31 +00:00
|
|
|
|
2016-03-22 08:51:21 +00:00
|
|
|
if ((cr4 ^ old_cr4) & (X86_CR4_OSXSAVE | X86_CR4_PKE))
|
2011-11-23 14:30:32 +00:00
|
|
|
kvm_update_cpuid(vcpu);
|
2010-06-10 03:27:12 +00:00
|
|
|
|
2010-04-28 16:15:31 +00:00
|
|
|
return 0;
|
|
|
|
}
|
2008-02-24 09:20:43 +00:00
|
|
|
EXPORT_SYMBOL_GPL(kvm_set_cr4);
|
KVM: Portability: Move control register helper functions to x86.c
This patch moves the definitions of CR0_RESERVED_BITS,
CR4_RESERVED_BITS, and CR8_RESERVED_BITS along with the following
functions from kvm_main.c to x86.c:
set_cr0(), set_cr3(), set_cr4(), set_cr8(), get_cr8(), lmsw(),
load_pdptrs()
The static function wrapper inject_gp is duplicated in kvm_main.c and
x86.c for now, the version in kvm_main.c should disappear once the last
user of it is gone too.
The function load_pdptrs is no longer static, and now defined in x86.h
for the time being, until the last user of it is gone from kvm_main.c.
Signed-off-by: Carsten Otte <cotte@de.ibm.com>
Reviewed-by: Christian Borntraeger <borntraeger@de.ibm.com>
Acked-by: Hollis Blanchard <hollisb@us.ibm.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
2007-10-29 15:09:35 +00:00
|
|
|
|
2010-06-10 14:02:16 +00:00
|
|
|
int kvm_set_cr3(struct kvm_vcpu *vcpu, unsigned long cr3)
|
KVM: Portability: Move control register helper functions to x86.c
This patch moves the definitions of CR0_RESERVED_BITS,
CR4_RESERVED_BITS, and CR8_RESERVED_BITS along with the following
functions from kvm_main.c to x86.c:
set_cr0(), set_cr3(), set_cr4(), set_cr8(), get_cr8(), lmsw(),
load_pdptrs()
The static function wrapper inject_gp is duplicated in kvm_main.c and
x86.c for now, the version in kvm_main.c should disappear once the last
user of it is gone too.
The function load_pdptrs is no longer static, and now defined in x86.h
for the time being, until the last user of it is gone from kvm_main.c.
Signed-off-by: Carsten Otte <cotte@de.ibm.com>
Reviewed-by: Christian Borntraeger <borntraeger@de.ibm.com>
Acked-by: Hollis Blanchard <hollisb@us.ibm.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
2007-10-29 15:09:35 +00:00
|
|
|
{
|
2018-06-27 21:59:15 +00:00
|
|
|
bool skip_tlb_flush = false;
|
2014-11-10 12:53:25 +00:00
|
|
|
#ifdef CONFIG_X86_64
|
2018-05-04 18:37:13 +00:00
|
|
|
bool pcid_enabled = kvm_read_cr4_bits(vcpu, X86_CR4_PCIDE);
|
|
|
|
|
2018-06-27 21:59:15 +00:00
|
|
|
if (pcid_enabled) {
|
2018-06-27 21:59:21 +00:00
|
|
|
skip_tlb_flush = cr3 & X86_CR3_PCID_NOFLUSH;
|
|
|
|
cr3 &= ~X86_CR3_PCID_NOFLUSH;
|
2018-06-27 21:59:15 +00:00
|
|
|
}
|
2014-11-10 12:53:25 +00:00
|
|
|
#endif
|
2014-11-02 09:54:52 +00:00
|
|
|
|
2010-12-05 15:30:00 +00:00
|
|
|
if (cr3 == kvm_read_cr3(vcpu) && !pdptrs_changed(vcpu)) {
|
2018-06-27 21:59:18 +00:00
|
|
|
if (!skip_tlb_flush) {
|
|
|
|
kvm_mmu_sync_roots(vcpu);
|
2018-06-27 21:59:15 +00:00
|
|
|
kvm_make_request(KVM_REQ_TLB_FLUSH, vcpu);
|
2018-06-27 21:59:18 +00:00
|
|
|
}
|
2010-04-28 16:15:31 +00:00
|
|
|
return 0;
|
2007-11-21 00:57:59 +00:00
|
|
|
}
|
|
|
|
|
2017-08-24 12:27:53 +00:00
|
|
|
if (is_long_mode(vcpu) &&
|
2018-05-13 09:24:47 +00:00
|
|
|
(cr3 & rsvd_bits(cpuid_maxphyaddr(vcpu), 63)))
|
2017-08-24 12:27:53 +00:00
|
|
|
return 1;
|
|
|
|
else if (is_pae(vcpu) && is_paging(vcpu) &&
|
2014-05-10 07:24:34 +00:00
|
|
|
!load_pdptrs(vcpu, vcpu->arch.walk_mmu, cr3))
|
2014-04-18 00:35:09 +00:00
|
|
|
return 1;
|
KVM: Portability: Move control register helper functions to x86.c
This patch moves the definitions of CR0_RESERVED_BITS,
CR4_RESERVED_BITS, and CR8_RESERVED_BITS along with the following
functions from kvm_main.c to x86.c:
set_cr0(), set_cr3(), set_cr4(), set_cr8(), get_cr8(), lmsw(),
load_pdptrs()
The static function wrapper inject_gp is duplicated in kvm_main.c and
x86.c for now, the version in kvm_main.c should disappear once the last
user of it is gone too.
The function load_pdptrs is no longer static, and now defined in x86.h
for the time being, until the last user of it is gone from kvm_main.c.
Signed-off-by: Carsten Otte <cotte@de.ibm.com>
Reviewed-by: Christian Borntraeger <borntraeger@de.ibm.com>
Acked-by: Hollis Blanchard <hollisb@us.ibm.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
2007-10-29 15:09:35 +00:00
|
|
|
|
2018-06-27 21:59:15 +00:00
|
|
|
kvm_mmu_new_cr3(vcpu, cr3, skip_tlb_flush);
|
2010-04-28 16:15:31 +00:00
|
|
|
vcpu->arch.cr3 = cr3;
|
2010-12-05 16:56:11 +00:00
|
|
|
__set_bit(VCPU_EXREG_CR3, (ulong *)&vcpu->arch.regs_avail);
|
2018-06-27 21:59:06 +00:00
|
|
|
|
2010-04-28 16:15:31 +00:00
|
|
|
return 0;
|
|
|
|
}
|
2008-02-24 09:20:43 +00:00
|
|
|
EXPORT_SYMBOL_GPL(kvm_set_cr3);
|
KVM: Portability: Move control register helper functions to x86.c
This patch moves the definitions of CR0_RESERVED_BITS,
CR4_RESERVED_BITS, and CR8_RESERVED_BITS along with the following
functions from kvm_main.c to x86.c:
set_cr0(), set_cr3(), set_cr4(), set_cr8(), get_cr8(), lmsw(),
load_pdptrs()
The static function wrapper inject_gp is duplicated in kvm_main.c and
x86.c for now, the version in kvm_main.c should disappear once the last
user of it is gone too.
The function load_pdptrs is no longer static, and now defined in x86.h
for the time being, until the last user of it is gone from kvm_main.c.
Signed-off-by: Carsten Otte <cotte@de.ibm.com>
Reviewed-by: Christian Borntraeger <borntraeger@de.ibm.com>
Acked-by: Hollis Blanchard <hollisb@us.ibm.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
2007-10-29 15:09:35 +00:00
|
|
|
|
2010-12-21 10:12:00 +00:00
|
|
|
int kvm_set_cr8(struct kvm_vcpu *vcpu, unsigned long cr8)
|
KVM: Portability: Move control register helper functions to x86.c
This patch moves the definitions of CR0_RESERVED_BITS,
CR4_RESERVED_BITS, and CR8_RESERVED_BITS along with the following
functions from kvm_main.c to x86.c:
set_cr0(), set_cr3(), set_cr4(), set_cr8(), get_cr8(), lmsw(),
load_pdptrs()
The static function wrapper inject_gp is duplicated in kvm_main.c and
x86.c for now, the version in kvm_main.c should disappear once the last
user of it is gone too.
The function load_pdptrs is no longer static, and now defined in x86.h
for the time being, until the last user of it is gone from kvm_main.c.
Signed-off-by: Carsten Otte <cotte@de.ibm.com>
Reviewed-by: Christian Borntraeger <borntraeger@de.ibm.com>
Acked-by: Hollis Blanchard <hollisb@us.ibm.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
2007-10-29 15:09:35 +00:00
|
|
|
{
|
2010-04-28 16:15:31 +00:00
|
|
|
if (cr8 & CR8_RESERVED_BITS)
|
|
|
|
return 1;
|
2015-07-29 10:05:37 +00:00
|
|
|
if (lapic_in_kernel(vcpu))
|
KVM: Portability: Move control register helper functions to x86.c
This patch moves the definitions of CR0_RESERVED_BITS,
CR4_RESERVED_BITS, and CR8_RESERVED_BITS along with the following
functions from kvm_main.c to x86.c:
set_cr0(), set_cr3(), set_cr4(), set_cr8(), get_cr8(), lmsw(),
load_pdptrs()
The static function wrapper inject_gp is duplicated in kvm_main.c and
x86.c for now, the version in kvm_main.c should disappear once the last
user of it is gone too.
The function load_pdptrs is no longer static, and now defined in x86.h
for the time being, until the last user of it is gone from kvm_main.c.
Signed-off-by: Carsten Otte <cotte@de.ibm.com>
Reviewed-by: Christian Borntraeger <borntraeger@de.ibm.com>
Acked-by: Hollis Blanchard <hollisb@us.ibm.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
2007-10-29 15:09:35 +00:00
|
|
|
kvm_lapic_set_tpr(vcpu, cr8);
|
|
|
|
else
|
2007-12-13 15:50:52 +00:00
|
|
|
vcpu->arch.cr8 = cr8;
|
2010-04-28 16:15:31 +00:00
|
|
|
return 0;
|
|
|
|
}
|
2008-02-24 09:20:43 +00:00
|
|
|
EXPORT_SYMBOL_GPL(kvm_set_cr8);
|
KVM: Portability: Move control register helper functions to x86.c
This patch moves the definitions of CR0_RESERVED_BITS,
CR4_RESERVED_BITS, and CR8_RESERVED_BITS along with the following
functions from kvm_main.c to x86.c:
set_cr0(), set_cr3(), set_cr4(), set_cr8(), get_cr8(), lmsw(),
load_pdptrs()
The static function wrapper inject_gp is duplicated in kvm_main.c and
x86.c for now, the version in kvm_main.c should disappear once the last
user of it is gone too.
The function load_pdptrs is no longer static, and now defined in x86.h
for the time being, until the last user of it is gone from kvm_main.c.
Signed-off-by: Carsten Otte <cotte@de.ibm.com>
Reviewed-by: Christian Borntraeger <borntraeger@de.ibm.com>
Acked-by: Hollis Blanchard <hollisb@us.ibm.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
2007-10-29 15:09:35 +00:00
|
|
|
|
2008-02-24 09:20:43 +00:00
|
|
|
unsigned long kvm_get_cr8(struct kvm_vcpu *vcpu)
|
KVM: Portability: Move control register helper functions to x86.c
This patch moves the definitions of CR0_RESERVED_BITS,
CR4_RESERVED_BITS, and CR8_RESERVED_BITS along with the following
functions from kvm_main.c to x86.c:
set_cr0(), set_cr3(), set_cr4(), set_cr8(), get_cr8(), lmsw(),
load_pdptrs()
The static function wrapper inject_gp is duplicated in kvm_main.c and
x86.c for now, the version in kvm_main.c should disappear once the last
user of it is gone too.
The function load_pdptrs is no longer static, and now defined in x86.h
for the time being, until the last user of it is gone from kvm_main.c.
Signed-off-by: Carsten Otte <cotte@de.ibm.com>
Reviewed-by: Christian Borntraeger <borntraeger@de.ibm.com>
Acked-by: Hollis Blanchard <hollisb@us.ibm.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
2007-10-29 15:09:35 +00:00
|
|
|
{
|
2015-07-29 10:05:37 +00:00
|
|
|
if (lapic_in_kernel(vcpu))
|
KVM: Portability: Move control register helper functions to x86.c
This patch moves the definitions of CR0_RESERVED_BITS,
CR4_RESERVED_BITS, and CR8_RESERVED_BITS along with the following
functions from kvm_main.c to x86.c:
set_cr0(), set_cr3(), set_cr4(), set_cr8(), get_cr8(), lmsw(),
load_pdptrs()
The static function wrapper inject_gp is duplicated in kvm_main.c and
x86.c for now, the version in kvm_main.c should disappear once the last
user of it is gone too.
The function load_pdptrs is no longer static, and now defined in x86.h
for the time being, until the last user of it is gone from kvm_main.c.
Signed-off-by: Carsten Otte <cotte@de.ibm.com>
Reviewed-by: Christian Borntraeger <borntraeger@de.ibm.com>
Acked-by: Hollis Blanchard <hollisb@us.ibm.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
2007-10-29 15:09:35 +00:00
|
|
|
return kvm_lapic_get_cr8(vcpu);
|
|
|
|
else
|
2007-12-13 15:50:52 +00:00
|
|
|
return vcpu->arch.cr8;
|
KVM: Portability: Move control register helper functions to x86.c
This patch moves the definitions of CR0_RESERVED_BITS,
CR4_RESERVED_BITS, and CR8_RESERVED_BITS along with the following
functions from kvm_main.c to x86.c:
set_cr0(), set_cr3(), set_cr4(), set_cr8(), get_cr8(), lmsw(),
load_pdptrs()
The static function wrapper inject_gp is duplicated in kvm_main.c and
x86.c for now, the version in kvm_main.c should disappear once the last
user of it is gone too.
The function load_pdptrs is no longer static, and now defined in x86.h
for the time being, until the last user of it is gone from kvm_main.c.
Signed-off-by: Carsten Otte <cotte@de.ibm.com>
Reviewed-by: Christian Borntraeger <borntraeger@de.ibm.com>
Acked-by: Hollis Blanchard <hollisb@us.ibm.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
2007-10-29 15:09:35 +00:00
|
|
|
}
|
2008-02-24 09:20:43 +00:00
|
|
|
EXPORT_SYMBOL_GPL(kvm_get_cr8);
|
KVM: Portability: Move control register helper functions to x86.c
This patch moves the definitions of CR0_RESERVED_BITS,
CR4_RESERVED_BITS, and CR8_RESERVED_BITS along with the following
functions from kvm_main.c to x86.c:
set_cr0(), set_cr3(), set_cr4(), set_cr8(), get_cr8(), lmsw(),
load_pdptrs()
The static function wrapper inject_gp is duplicated in kvm_main.c and
x86.c for now, the version in kvm_main.c should disappear once the last
user of it is gone too.
The function load_pdptrs is no longer static, and now defined in x86.h
for the time being, until the last user of it is gone from kvm_main.c.
Signed-off-by: Carsten Otte <cotte@de.ibm.com>
Reviewed-by: Christian Borntraeger <borntraeger@de.ibm.com>
Acked-by: Hollis Blanchard <hollisb@us.ibm.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
2007-10-29 15:09:35 +00:00
|
|
|
|
2015-04-02 00:10:37 +00:00
|
|
|
static void kvm_update_dr0123(struct kvm_vcpu *vcpu)
|
|
|
|
{
|
|
|
|
int i;
|
|
|
|
|
|
|
|
if (!(vcpu->guest_debug & KVM_GUESTDBG_USE_HW_BP)) {
|
|
|
|
for (i = 0; i < KVM_NR_DB_REGS; i++)
|
|
|
|
vcpu->arch.eff_db[i] = vcpu->arch.db[i];
|
|
|
|
vcpu->arch.switch_db_regs |= KVM_DEBUGREG_RELOAD;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2014-01-04 17:47:16 +00:00
|
|
|
static void kvm_update_dr6(struct kvm_vcpu *vcpu)
|
|
|
|
{
|
|
|
|
if (!(vcpu->guest_debug & KVM_GUESTDBG_USE_HW_BP))
|
|
|
|
kvm_x86_ops->set_dr6(vcpu, vcpu->arch.dr6);
|
|
|
|
}
|
|
|
|
|
2012-09-21 03:42:55 +00:00
|
|
|
static void kvm_update_dr7(struct kvm_vcpu *vcpu)
|
|
|
|
{
|
|
|
|
unsigned long dr7;
|
|
|
|
|
|
|
|
if (vcpu->guest_debug & KVM_GUESTDBG_USE_HW_BP)
|
|
|
|
dr7 = vcpu->arch.guest_debug_dr7;
|
|
|
|
else
|
|
|
|
dr7 = vcpu->arch.dr7;
|
|
|
|
kvm_x86_ops->set_dr7(vcpu, dr7);
|
2014-02-21 08:55:56 +00:00
|
|
|
vcpu->arch.switch_db_regs &= ~KVM_DEBUGREG_BP_ENABLED;
|
|
|
|
if (dr7 & DR7_BP_EN_MASK)
|
|
|
|
vcpu->arch.switch_db_regs |= KVM_DEBUGREG_BP_ENABLED;
|
2012-09-21 03:42:55 +00:00
|
|
|
}
|
|
|
|
|
2014-07-15 14:37:46 +00:00
|
|
|
static u64 kvm_dr6_fixed(struct kvm_vcpu *vcpu)
|
|
|
|
{
|
|
|
|
u64 fixed = DR6_FIXED_1;
|
|
|
|
|
2017-08-04 22:12:49 +00:00
|
|
|
if (!guest_cpuid_has(vcpu, X86_FEATURE_RTM))
|
2014-07-15 14:37:46 +00:00
|
|
|
fixed |= DR6_RTM;
|
|
|
|
return fixed;
|
|
|
|
}
|
|
|
|
|
2010-04-28 16:15:32 +00:00
|
|
|
static int __kvm_set_dr(struct kvm_vcpu *vcpu, int dr, unsigned long val)
|
2010-04-13 07:05:23 +00:00
|
|
|
{
|
|
|
|
switch (dr) {
|
|
|
|
case 0 ... 3:
|
|
|
|
vcpu->arch.db[dr] = val;
|
|
|
|
if (!(vcpu->guest_debug & KVM_GUESTDBG_USE_HW_BP))
|
|
|
|
vcpu->arch.eff_db[dr] = val;
|
|
|
|
break;
|
|
|
|
case 4:
|
|
|
|
/* fall through */
|
|
|
|
case 6:
|
2010-04-28 16:15:32 +00:00
|
|
|
if (val & 0xffffffff00000000ULL)
|
|
|
|
return -1; /* #GP */
|
2014-07-15 14:37:46 +00:00
|
|
|
vcpu->arch.dr6 = (val & DR6_VOLATILE) | kvm_dr6_fixed(vcpu);
|
2014-01-04 17:47:16 +00:00
|
|
|
kvm_update_dr6(vcpu);
|
2010-04-13 07:05:23 +00:00
|
|
|
break;
|
|
|
|
case 5:
|
|
|
|
/* fall through */
|
|
|
|
default: /* 7 */
|
2010-04-28 16:15:32 +00:00
|
|
|
if (val & 0xffffffff00000000ULL)
|
|
|
|
return -1; /* #GP */
|
2010-04-13 07:05:23 +00:00
|
|
|
vcpu->arch.dr7 = (val & DR7_VOLATILE) | DR7_FIXED_1;
|
2012-09-21 03:42:55 +00:00
|
|
|
kvm_update_dr7(vcpu);
|
2010-04-13 07:05:23 +00:00
|
|
|
break;
|
|
|
|
}
|
|
|
|
|
|
|
|
return 0;
|
|
|
|
}
|
2010-04-28 16:15:32 +00:00
|
|
|
|
|
|
|
int kvm_set_dr(struct kvm_vcpu *vcpu, int dr, unsigned long val)
|
|
|
|
{
|
2014-10-02 22:10:05 +00:00
|
|
|
if (__kvm_set_dr(vcpu, dr, val)) {
|
2010-04-28 16:15:32 +00:00
|
|
|
kvm_inject_gp(vcpu, 0);
|
2014-10-02 22:10:05 +00:00
|
|
|
return 1;
|
|
|
|
}
|
|
|
|
return 0;
|
2010-04-28 16:15:32 +00:00
|
|
|
}
|
2010-04-13 07:05:23 +00:00
|
|
|
EXPORT_SYMBOL_GPL(kvm_set_dr);
|
|
|
|
|
2014-10-02 22:10:05 +00:00
|
|
|
int kvm_get_dr(struct kvm_vcpu *vcpu, int dr, unsigned long *val)
|
2010-04-13 07:05:23 +00:00
|
|
|
{
|
|
|
|
switch (dr) {
|
|
|
|
case 0 ... 3:
|
|
|
|
*val = vcpu->arch.db[dr];
|
|
|
|
break;
|
|
|
|
case 4:
|
|
|
|
/* fall through */
|
|
|
|
case 6:
|
2014-01-04 17:47:16 +00:00
|
|
|
if (vcpu->guest_debug & KVM_GUESTDBG_USE_HW_BP)
|
|
|
|
*val = vcpu->arch.dr6;
|
|
|
|
else
|
|
|
|
*val = kvm_x86_ops->get_dr6(vcpu);
|
2010-04-13 07:05:23 +00:00
|
|
|
break;
|
|
|
|
case 5:
|
|
|
|
/* fall through */
|
|
|
|
default: /* 7 */
|
|
|
|
*val = vcpu->arch.dr7;
|
|
|
|
break;
|
|
|
|
}
|
2010-04-28 16:15:32 +00:00
|
|
|
return 0;
|
|
|
|
}
|
2010-04-13 07:05:23 +00:00
|
|
|
EXPORT_SYMBOL_GPL(kvm_get_dr);
|
|
|
|
|
2011-11-10 12:57:23 +00:00
|
|
|
bool kvm_rdpmc(struct kvm_vcpu *vcpu)
|
|
|
|
{
|
|
|
|
u32 ecx = kvm_register_read(vcpu, VCPU_REGS_RCX);
|
|
|
|
u64 data;
|
|
|
|
int err;
|
|
|
|
|
2015-06-19 11:44:45 +00:00
|
|
|
err = kvm_pmu_rdpmc(vcpu, ecx, &data);
|
2011-11-10 12:57:23 +00:00
|
|
|
if (err)
|
|
|
|
return err;
|
|
|
|
kvm_register_write(vcpu, VCPU_REGS_RAX, (u32)data);
|
|
|
|
kvm_register_write(vcpu, VCPU_REGS_RDX, data >> 32);
|
|
|
|
return err;
|
|
|
|
}
|
|
|
|
EXPORT_SYMBOL_GPL(kvm_rdpmc);
|
|
|
|
|
2007-10-10 15:16:19 +00:00
|
|
|
/*
|
|
|
|
* List of msr numbers which we expose to userspace through KVM_GET_MSRS
|
|
|
|
* and KVM_SET_MSRS, and KVM_GET_MSR_INDEX_LIST.
|
|
|
|
*
|
|
|
|
* This list is modified at module load time to reflect the
|
2009-10-06 17:24:50 +00:00
|
|
|
* capabilities of the host cpu. This capabilities test skips MSRs that are
|
2015-05-05 10:08:55 +00:00
|
|
|
* kvm-specific. Those are put in emulated_msrs; filtering of emulated_msrs
|
|
|
|
* may depend on host virtualization features rather than host cpu features.
|
2007-10-10 15:16:19 +00:00
|
|
|
*/
|
2009-10-06 17:24:50 +00:00
|
|
|
|
2007-10-10 15:16:19 +00:00
|
|
|
static u32 msrs_to_save[] = {
|
|
|
|
MSR_IA32_SYSENTER_CS, MSR_IA32_SYSENTER_ESP, MSR_IA32_SYSENTER_EIP,
|
2010-07-17 13:03:26 +00:00
|
|
|
MSR_STAR,
|
2007-10-10 15:16:19 +00:00
|
|
|
#ifdef CONFIG_X86_64
|
|
|
|
MSR_CSTAR, MSR_KERNEL_GS_BASE, MSR_SYSCALL_MASK, MSR_LSTAR,
|
|
|
|
#endif
|
2013-07-08 11:12:35 +00:00
|
|
|
MSR_IA32_TSC, MSR_IA32_CR_PAT, MSR_VM_HSAVE_PA,
|
2015-11-12 13:49:17 +00:00
|
|
|
MSR_IA32_FEATURE_CONTROL, MSR_IA32_BNDCFGS, MSR_TSC_AUX,
|
2018-02-01 21:59:45 +00:00
|
|
|
MSR_IA32_SPEC_CTRL, MSR_IA32_ARCH_CAPABILITIES
|
2007-10-10 15:16:19 +00:00
|
|
|
};
|
|
|
|
|
|
|
|
static unsigned num_msrs_to_save;
|
|
|
|
|
2015-05-05 10:08:55 +00:00
|
|
|
static u32 emulated_msrs[] = {
|
|
|
|
MSR_KVM_SYSTEM_TIME, MSR_KVM_WALL_CLOCK,
|
|
|
|
MSR_KVM_SYSTEM_TIME_NEW, MSR_KVM_WALL_CLOCK_NEW,
|
|
|
|
HV_X64_MSR_GUEST_OS_ID, HV_X64_MSR_HYPERCALL,
|
|
|
|
HV_X64_MSR_TIME_REF_COUNT, HV_X64_MSR_REFERENCE_TSC,
|
2017-07-26 11:32:59 +00:00
|
|
|
HV_X64_MSR_TSC_FREQUENCY, HV_X64_MSR_APIC_FREQUENCY,
|
2015-07-03 12:01:37 +00:00
|
|
|
HV_X64_MSR_CRASH_P0, HV_X64_MSR_CRASH_P1, HV_X64_MSR_CRASH_P2,
|
|
|
|
HV_X64_MSR_CRASH_P3, HV_X64_MSR_CRASH_P4, HV_X64_MSR_CRASH_CTL,
|
2015-09-16 09:29:48 +00:00
|
|
|
HV_X64_MSR_RESET,
|
2015-09-16 09:29:49 +00:00
|
|
|
HV_X64_MSR_VP_INDEX,
|
2015-09-16 09:29:50 +00:00
|
|
|
HV_X64_MSR_VP_RUNTIME,
|
2015-11-10 12:36:34 +00:00
|
|
|
HV_X64_MSR_SCONTROL,
|
2015-11-30 16:22:21 +00:00
|
|
|
HV_X64_MSR_STIMER0_CONFIG,
|
2018-03-20 14:02:07 +00:00
|
|
|
HV_X64_MSR_VP_ASSIST_PAGE,
|
2018-03-01 14:15:12 +00:00
|
|
|
HV_X64_MSR_REENLIGHTENMENT_CONTROL, HV_X64_MSR_TSC_EMULATION_CONTROL,
|
|
|
|
HV_X64_MSR_TSC_EMULATION_STATUS,
|
|
|
|
|
|
|
|
MSR_KVM_ASYNC_PF_EN, MSR_KVM_STEAL_TIME,
|
2015-05-05 10:08:55 +00:00
|
|
|
MSR_KVM_PV_EOI_EN,
|
|
|
|
|
2012-11-29 20:42:50 +00:00
|
|
|
MSR_IA32_TSC_ADJUST,
|
2011-09-22 08:55:52 +00:00
|
|
|
MSR_IA32_TSCDEADLINE,
|
2007-10-10 15:16:19 +00:00
|
|
|
MSR_IA32_MISC_ENABLE,
|
2010-07-07 11:09:38 +00:00
|
|
|
MSR_IA32_MCG_STATUS,
|
|
|
|
MSR_IA32_MCG_CTL,
|
2016-06-22 06:59:56 +00:00
|
|
|
MSR_IA32_MCG_EXT_CTL,
|
2015-05-07 09:36:11 +00:00
|
|
|
MSR_IA32_SMBASE,
|
2017-11-15 11:43:14 +00:00
|
|
|
MSR_SMI_COUNT,
|
2017-03-20 08:16:28 +00:00
|
|
|
MSR_PLATFORM_INFO,
|
|
|
|
MSR_MISC_FEATURES_ENABLES,
|
2018-05-10 20:06:39 +00:00
|
|
|
MSR_AMD64_VIRT_SPEC_CTRL,
|
2007-10-10 15:16:19 +00:00
|
|
|
};
|
|
|
|
|
2015-05-05 10:08:55 +00:00
|
|
|
static unsigned num_emulated_msrs;
|
|
|
|
|
2018-02-21 19:39:51 +00:00
|
|
|
/*
|
|
|
|
* List of msr numbers which are used to expose MSR-based features that
|
|
|
|
* can be used by a hypervisor to validate requested CPU features.
|
|
|
|
*/
|
|
|
|
static u32 msr_based_features[] = {
|
2018-02-26 12:40:09 +00:00
|
|
|
MSR_IA32_VMX_BASIC,
|
|
|
|
MSR_IA32_VMX_TRUE_PINBASED_CTLS,
|
|
|
|
MSR_IA32_VMX_PINBASED_CTLS,
|
|
|
|
MSR_IA32_VMX_TRUE_PROCBASED_CTLS,
|
|
|
|
MSR_IA32_VMX_PROCBASED_CTLS,
|
|
|
|
MSR_IA32_VMX_TRUE_EXIT_CTLS,
|
|
|
|
MSR_IA32_VMX_EXIT_CTLS,
|
|
|
|
MSR_IA32_VMX_TRUE_ENTRY_CTLS,
|
|
|
|
MSR_IA32_VMX_ENTRY_CTLS,
|
|
|
|
MSR_IA32_VMX_MISC,
|
|
|
|
MSR_IA32_VMX_CR0_FIXED0,
|
|
|
|
MSR_IA32_VMX_CR0_FIXED1,
|
|
|
|
MSR_IA32_VMX_CR4_FIXED0,
|
|
|
|
MSR_IA32_VMX_CR4_FIXED1,
|
|
|
|
MSR_IA32_VMX_VMCS_ENUM,
|
|
|
|
MSR_IA32_VMX_PROCBASED_CTLS2,
|
|
|
|
MSR_IA32_VMX_EPT_VPID_CAP,
|
|
|
|
MSR_IA32_VMX_VMFUNC,
|
|
|
|
|
2018-02-23 23:18:20 +00:00
|
|
|
MSR_F10H_DECFG,
|
2018-02-28 06:03:31 +00:00
|
|
|
MSR_IA32_UCODE_REV,
|
2018-06-25 12:04:37 +00:00
|
|
|
MSR_IA32_ARCH_CAPABILITIES,
|
2018-02-21 19:39:51 +00:00
|
|
|
};
|
|
|
|
|
|
|
|
static unsigned int num_msr_based_features;
|
|
|
|
|
2018-08-05 14:07:47 +00:00
|
|
|
u64 kvm_get_arch_capabilities(void)
|
|
|
|
{
|
|
|
|
u64 data;
|
|
|
|
|
|
|
|
rdmsrl_safe(MSR_IA32_ARCH_CAPABILITIES, &data);
|
|
|
|
|
|
|
|
/*
|
|
|
|
* If we're doing cache flushes (either "always" or "cond")
|
|
|
|
* we will do one whenever the guest does a vmlaunch/vmresume.
|
|
|
|
* If an outer hypervisor is doing the cache flush for us
|
|
|
|
* (VMENTER_L1D_FLUSH_NESTED_VM), we can safely pass that
|
|
|
|
* capability to the guest too, and if EPT is disabled we're not
|
|
|
|
* vulnerable. Overall, only VMENTER_L1D_FLUSH_NEVER will
|
|
|
|
* require a nested hypervisor to do a flush of its own.
|
|
|
|
*/
|
|
|
|
if (l1tf_vmx_mitigation != VMENTER_L1D_FLUSH_NEVER)
|
|
|
|
data |= ARCH_CAP_SKIP_VMENTRY_L1DFLUSH;
|
|
|
|
|
|
|
|
return data;
|
|
|
|
}
|
|
|
|
EXPORT_SYMBOL_GPL(kvm_get_arch_capabilities);
|
|
|
|
|
2018-02-28 06:03:30 +00:00
|
|
|
static int kvm_get_msr_feature(struct kvm_msr_entry *msr)
|
|
|
|
{
|
|
|
|
switch (msr->index) {
|
2018-06-25 12:04:37 +00:00
|
|
|
case MSR_IA32_ARCH_CAPABILITIES:
|
2018-08-05 14:07:47 +00:00
|
|
|
msr->data = kvm_get_arch_capabilities();
|
|
|
|
break;
|
|
|
|
case MSR_IA32_UCODE_REV:
|
2018-06-25 12:04:37 +00:00
|
|
|
rdmsrl_safe(msr->index, &msr->data);
|
2018-02-28 06:03:31 +00:00
|
|
|
break;
|
2018-02-28 06:03:30 +00:00
|
|
|
default:
|
|
|
|
if (kvm_x86_ops->get_msr_feature(msr))
|
|
|
|
return 1;
|
|
|
|
}
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
2018-02-21 19:39:51 +00:00
|
|
|
static int do_get_msr_feature(struct kvm_vcpu *vcpu, unsigned index, u64 *data)
|
|
|
|
{
|
|
|
|
struct kvm_msr_entry msr;
|
2018-02-28 06:03:30 +00:00
|
|
|
int r;
|
2018-02-21 19:39:51 +00:00
|
|
|
|
|
|
|
msr.index = index;
|
2018-02-28 06:03:30 +00:00
|
|
|
r = kvm_get_msr_feature(&msr);
|
|
|
|
if (r)
|
|
|
|
return r;
|
2018-02-21 19:39:51 +00:00
|
|
|
|
|
|
|
*data = msr.data;
|
|
|
|
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
2013-04-20 08:52:36 +00:00
|
|
|
bool kvm_valid_efer(struct kvm_vcpu *vcpu, u64 efer)
|
2007-10-30 17:44:17 +00:00
|
|
|
{
|
2010-05-06 09:38:43 +00:00
|
|
|
if (efer & efer_reserved_bits)
|
2013-04-20 08:52:36 +00:00
|
|
|
return false;
|
2007-10-30 17:44:17 +00:00
|
|
|
|
2017-08-04 22:12:50 +00:00
|
|
|
if (efer & EFER_FFXSR && !guest_cpuid_has(vcpu, X86_FEATURE_FXSR_OPT))
|
2013-04-20 08:52:36 +00:00
|
|
|
return false;
|
2009-02-02 15:23:51 +00:00
|
|
|
|
2017-08-04 22:12:50 +00:00
|
|
|
if (efer & EFER_SVME && !guest_cpuid_has(vcpu, X86_FEATURE_SVM))
|
2013-04-20 08:52:36 +00:00
|
|
|
return false;
|
2008-11-25 19:17:11 +00:00
|
|
|
|
2013-04-20 08:52:36 +00:00
|
|
|
return true;
|
|
|
|
}
|
|
|
|
EXPORT_SYMBOL_GPL(kvm_valid_efer);
|
|
|
|
|
|
|
|
static int set_efer(struct kvm_vcpu *vcpu, u64 efer)
|
|
|
|
{
|
|
|
|
u64 old_efer = vcpu->arch.efer;
|
|
|
|
|
|
|
|
if (!kvm_valid_efer(vcpu, efer))
|
|
|
|
return 1;
|
|
|
|
|
|
|
|
if (is_paging(vcpu)
|
|
|
|
&& (vcpu->arch.efer & EFER_LME) != (efer & EFER_LME))
|
|
|
|
return 1;
|
|
|
|
|
2007-10-30 17:44:17 +00:00
|
|
|
efer &= ~EFER_LMA;
|
2010-01-21 13:31:50 +00:00
|
|
|
efer |= vcpu->arch.efer & EFER_LMA;
|
2007-10-30 17:44:17 +00:00
|
|
|
|
2010-05-12 08:40:40 +00:00
|
|
|
kvm_x86_ops->set_efer(vcpu, efer);
|
|
|
|
|
2010-05-12 08:40:42 +00:00
|
|
|
/* Update reserved bits */
|
|
|
|
if ((efer ^ old_efer) & EFER_NX)
|
|
|
|
kvm_mmu_reset_context(vcpu);
|
|
|
|
|
2010-05-06 09:38:43 +00:00
|
|
|
return 0;
|
2007-10-30 17:44:17 +00:00
|
|
|
}
|
|
|
|
|
2008-01-31 13:57:37 +00:00
|
|
|
void kvm_enable_efer_bits(u64 mask)
|
|
|
|
{
|
|
|
|
efer_reserved_bits &= ~mask;
|
|
|
|
}
|
|
|
|
EXPORT_SYMBOL_GPL(kvm_enable_efer_bits);
|
|
|
|
|
2007-10-30 17:44:17 +00:00
|
|
|
/*
|
|
|
|
* Writes msr value into into the appropriate "register".
|
|
|
|
* Returns 0 on success, non-0 otherwise.
|
|
|
|
* Assumes vcpu_load() was already called.
|
|
|
|
*/
|
2012-11-29 20:42:12 +00:00
|
|
|
int kvm_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr)
|
2007-10-30 17:44:17 +00:00
|
|
|
{
|
KVM: x86: Check non-canonical addresses upon WRMSR
Upon WRMSR, the CPU should inject #GP if a non-canonical value (address) is
written to certain MSRs. The behavior is "almost" identical for AMD and Intel
(ignoring MSRs that are not implemented in either architecture since they would
anyhow #GP). However, IA32_SYSENTER_ESP and IA32_SYSENTER_EIP cause #GP if
non-canonical address is written on Intel but not on AMD (which ignores the top
32-bits).
Accordingly, this patch injects a #GP on the MSRs which behave identically on
Intel and AMD. To eliminate the differences between the architecutres, the
value which is written to IA32_SYSENTER_ESP and IA32_SYSENTER_EIP is turned to
canonical value before writing instead of injecting a #GP.
Some references from Intel and AMD manuals:
According to Intel SDM description of WRMSR instruction #GP is expected on
WRMSR "If the source register contains a non-canonical address and ECX
specifies one of the following MSRs: IA32_DS_AREA, IA32_FS_BASE, IA32_GS_BASE,
IA32_KERNEL_GS_BASE, IA32_LSTAR, IA32_SYSENTER_EIP, IA32_SYSENTER_ESP."
According to AMD manual instruction manual:
LSTAR/CSTAR (SYSCALL): "The WRMSR instruction loads the target RIP into the
LSTAR and CSTAR registers. If an RIP written by WRMSR is not in canonical
form, a general-protection exception (#GP) occurs."
IA32_GS_BASE and IA32_FS_BASE (WRFSBASE/WRGSBASE): "The address written to the
base field must be in canonical form or a #GP fault will occur."
IA32_KERNEL_GS_BASE (SWAPGS): "The address stored in the KernelGSbase MSR must
be in canonical form."
This patch fixes CVE-2014-3610.
Cc: stable@vger.kernel.org
Signed-off-by: Nadav Amit <namit@cs.technion.ac.il>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2014-09-16 00:24:05 +00:00
|
|
|
switch (msr->index) {
|
|
|
|
case MSR_FS_BASE:
|
|
|
|
case MSR_GS_BASE:
|
|
|
|
case MSR_KERNEL_GS_BASE:
|
|
|
|
case MSR_CSTAR:
|
|
|
|
case MSR_LSTAR:
|
2017-08-24 12:27:56 +00:00
|
|
|
if (is_noncanonical_address(msr->data, vcpu))
|
KVM: x86: Check non-canonical addresses upon WRMSR
Upon WRMSR, the CPU should inject #GP if a non-canonical value (address) is
written to certain MSRs. The behavior is "almost" identical for AMD and Intel
(ignoring MSRs that are not implemented in either architecture since they would
anyhow #GP). However, IA32_SYSENTER_ESP and IA32_SYSENTER_EIP cause #GP if
non-canonical address is written on Intel but not on AMD (which ignores the top
32-bits).
Accordingly, this patch injects a #GP on the MSRs which behave identically on
Intel and AMD. To eliminate the differences between the architecutres, the
value which is written to IA32_SYSENTER_ESP and IA32_SYSENTER_EIP is turned to
canonical value before writing instead of injecting a #GP.
Some references from Intel and AMD manuals:
According to Intel SDM description of WRMSR instruction #GP is expected on
WRMSR "If the source register contains a non-canonical address and ECX
specifies one of the following MSRs: IA32_DS_AREA, IA32_FS_BASE, IA32_GS_BASE,
IA32_KERNEL_GS_BASE, IA32_LSTAR, IA32_SYSENTER_EIP, IA32_SYSENTER_ESP."
According to AMD manual instruction manual:
LSTAR/CSTAR (SYSCALL): "The WRMSR instruction loads the target RIP into the
LSTAR and CSTAR registers. If an RIP written by WRMSR is not in canonical
form, a general-protection exception (#GP) occurs."
IA32_GS_BASE and IA32_FS_BASE (WRFSBASE/WRGSBASE): "The address written to the
base field must be in canonical form or a #GP fault will occur."
IA32_KERNEL_GS_BASE (SWAPGS): "The address stored in the KernelGSbase MSR must
be in canonical form."
This patch fixes CVE-2014-3610.
Cc: stable@vger.kernel.org
Signed-off-by: Nadav Amit <namit@cs.technion.ac.il>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2014-09-16 00:24:05 +00:00
|
|
|
return 1;
|
|
|
|
break;
|
|
|
|
case MSR_IA32_SYSENTER_EIP:
|
|
|
|
case MSR_IA32_SYSENTER_ESP:
|
|
|
|
/*
|
|
|
|
* IA32_SYSENTER_ESP and IA32_SYSENTER_EIP cause #GP if
|
|
|
|
* non-canonical address is written on Intel but not on
|
|
|
|
* AMD (which ignores the top 32-bits, because it does
|
|
|
|
* not implement 64-bit SYSENTER).
|
|
|
|
*
|
|
|
|
* 64-bit code should hence be able to write a non-canonical
|
|
|
|
* value on AMD. Making the address canonical ensures that
|
|
|
|
* vmentry does not fail on Intel after writing a non-canonical
|
|
|
|
* value, and that something deterministic happens if the guest
|
|
|
|
* invokes 64-bit SYSENTER.
|
|
|
|
*/
|
2017-08-24 12:27:56 +00:00
|
|
|
msr->data = get_canonical(msr->data, vcpu_virt_addr_bits(vcpu));
|
KVM: x86: Check non-canonical addresses upon WRMSR
Upon WRMSR, the CPU should inject #GP if a non-canonical value (address) is
written to certain MSRs. The behavior is "almost" identical for AMD and Intel
(ignoring MSRs that are not implemented in either architecture since they would
anyhow #GP). However, IA32_SYSENTER_ESP and IA32_SYSENTER_EIP cause #GP if
non-canonical address is written on Intel but not on AMD (which ignores the top
32-bits).
Accordingly, this patch injects a #GP on the MSRs which behave identically on
Intel and AMD. To eliminate the differences between the architecutres, the
value which is written to IA32_SYSENTER_ESP and IA32_SYSENTER_EIP is turned to
canonical value before writing instead of injecting a #GP.
Some references from Intel and AMD manuals:
According to Intel SDM description of WRMSR instruction #GP is expected on
WRMSR "If the source register contains a non-canonical address and ECX
specifies one of the following MSRs: IA32_DS_AREA, IA32_FS_BASE, IA32_GS_BASE,
IA32_KERNEL_GS_BASE, IA32_LSTAR, IA32_SYSENTER_EIP, IA32_SYSENTER_ESP."
According to AMD manual instruction manual:
LSTAR/CSTAR (SYSCALL): "The WRMSR instruction loads the target RIP into the
LSTAR and CSTAR registers. If an RIP written by WRMSR is not in canonical
form, a general-protection exception (#GP) occurs."
IA32_GS_BASE and IA32_FS_BASE (WRFSBASE/WRGSBASE): "The address written to the
base field must be in canonical form or a #GP fault will occur."
IA32_KERNEL_GS_BASE (SWAPGS): "The address stored in the KernelGSbase MSR must
be in canonical form."
This patch fixes CVE-2014-3610.
Cc: stable@vger.kernel.org
Signed-off-by: Nadav Amit <namit@cs.technion.ac.il>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2014-09-16 00:24:05 +00:00
|
|
|
}
|
2012-11-29 20:42:12 +00:00
|
|
|
return kvm_x86_ops->set_msr(vcpu, msr);
|
2007-10-30 17:44:17 +00:00
|
|
|
}
|
KVM: x86: Check non-canonical addresses upon WRMSR
Upon WRMSR, the CPU should inject #GP if a non-canonical value (address) is
written to certain MSRs. The behavior is "almost" identical for AMD and Intel
(ignoring MSRs that are not implemented in either architecture since they would
anyhow #GP). However, IA32_SYSENTER_ESP and IA32_SYSENTER_EIP cause #GP if
non-canonical address is written on Intel but not on AMD (which ignores the top
32-bits).
Accordingly, this patch injects a #GP on the MSRs which behave identically on
Intel and AMD. To eliminate the differences between the architecutres, the
value which is written to IA32_SYSENTER_ESP and IA32_SYSENTER_EIP is turned to
canonical value before writing instead of injecting a #GP.
Some references from Intel and AMD manuals:
According to Intel SDM description of WRMSR instruction #GP is expected on
WRMSR "If the source register contains a non-canonical address and ECX
specifies one of the following MSRs: IA32_DS_AREA, IA32_FS_BASE, IA32_GS_BASE,
IA32_KERNEL_GS_BASE, IA32_LSTAR, IA32_SYSENTER_EIP, IA32_SYSENTER_ESP."
According to AMD manual instruction manual:
LSTAR/CSTAR (SYSCALL): "The WRMSR instruction loads the target RIP into the
LSTAR and CSTAR registers. If an RIP written by WRMSR is not in canonical
form, a general-protection exception (#GP) occurs."
IA32_GS_BASE and IA32_FS_BASE (WRFSBASE/WRGSBASE): "The address written to the
base field must be in canonical form or a #GP fault will occur."
IA32_KERNEL_GS_BASE (SWAPGS): "The address stored in the KernelGSbase MSR must
be in canonical form."
This patch fixes CVE-2014-3610.
Cc: stable@vger.kernel.org
Signed-off-by: Nadav Amit <namit@cs.technion.ac.il>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2014-09-16 00:24:05 +00:00
|
|
|
EXPORT_SYMBOL_GPL(kvm_set_msr);
|
2007-10-30 17:44:17 +00:00
|
|
|
|
KVM: Portability: split kvm_vcpu_ioctl
This patch splits kvm_vcpu_ioctl into archtecture independent parts, and
x86 specific parts which go to kvm_arch_vcpu_ioctl in x86.c.
Common ioctls for all architectures are:
KVM_RUN, KVM_GET/SET_(S-)REGS, KVM_TRANSLATE, KVM_INTERRUPT,
KVM_DEBUG_GUEST, KVM_SET_SIGNAL_MASK, KVM_GET/SET_FPU
Note that some PPC chips don't have an FPU, so we might need an #ifdef
around KVM_GET/SET_FPU one day.
x86 specific ioctls are:
KVM_GET/SET_LAPIC, KVM_SET_CPUID, KVM_GET/SET_MSRS
An interresting aspect is vcpu_load/vcpu_put. We now have a common
vcpu_load/put which does the preemption stuff, and an architecture
specific kvm_arch_vcpu_load/put. In the x86 case, this one calls the
vmx/svm function defined in kvm_x86_ops.
Signed-off-by: Carsten Otte <cotte@de.ibm.com>
Reviewed-by: Christian Borntraeger <borntraeger@de.ibm.com>
Reviewed-by: Christian Ehrhardt <ehrhardt@linux.vnet.ibm.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
2007-10-11 17:16:52 +00:00
|
|
|
/*
|
|
|
|
* Adapt set_msr() to msr_io()'s calling convention
|
|
|
|
*/
|
2015-04-08 13:30:38 +00:00
|
|
|
static int do_get_msr(struct kvm_vcpu *vcpu, unsigned index, u64 *data)
|
|
|
|
{
|
|
|
|
struct msr_data msr;
|
|
|
|
int r;
|
|
|
|
|
|
|
|
msr.index = index;
|
|
|
|
msr.host_initiated = true;
|
|
|
|
r = kvm_get_msr(vcpu, &msr);
|
|
|
|
if (r)
|
|
|
|
return r;
|
|
|
|
|
|
|
|
*data = msr.data;
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
KVM: Portability: split kvm_vcpu_ioctl
This patch splits kvm_vcpu_ioctl into archtecture independent parts, and
x86 specific parts which go to kvm_arch_vcpu_ioctl in x86.c.
Common ioctls for all architectures are:
KVM_RUN, KVM_GET/SET_(S-)REGS, KVM_TRANSLATE, KVM_INTERRUPT,
KVM_DEBUG_GUEST, KVM_SET_SIGNAL_MASK, KVM_GET/SET_FPU
Note that some PPC chips don't have an FPU, so we might need an #ifdef
around KVM_GET/SET_FPU one day.
x86 specific ioctls are:
KVM_GET/SET_LAPIC, KVM_SET_CPUID, KVM_GET/SET_MSRS
An interresting aspect is vcpu_load/vcpu_put. We now have a common
vcpu_load/put which does the preemption stuff, and an architecture
specific kvm_arch_vcpu_load/put. In the x86 case, this one calls the
vmx/svm function defined in kvm_x86_ops.
Signed-off-by: Carsten Otte <cotte@de.ibm.com>
Reviewed-by: Christian Borntraeger <borntraeger@de.ibm.com>
Reviewed-by: Christian Ehrhardt <ehrhardt@linux.vnet.ibm.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
2007-10-11 17:16:52 +00:00
|
|
|
static int do_set_msr(struct kvm_vcpu *vcpu, unsigned index, u64 *data)
|
|
|
|
{
|
2012-11-29 20:42:12 +00:00
|
|
|
struct msr_data msr;
|
|
|
|
|
|
|
|
msr.data = *data;
|
|
|
|
msr.index = index;
|
|
|
|
msr.host_initiated = true;
|
|
|
|
return kvm_set_msr(vcpu, &msr);
|
KVM: Portability: split kvm_vcpu_ioctl
This patch splits kvm_vcpu_ioctl into archtecture independent parts, and
x86 specific parts which go to kvm_arch_vcpu_ioctl in x86.c.
Common ioctls for all architectures are:
KVM_RUN, KVM_GET/SET_(S-)REGS, KVM_TRANSLATE, KVM_INTERRUPT,
KVM_DEBUG_GUEST, KVM_SET_SIGNAL_MASK, KVM_GET/SET_FPU
Note that some PPC chips don't have an FPU, so we might need an #ifdef
around KVM_GET/SET_FPU one day.
x86 specific ioctls are:
KVM_GET/SET_LAPIC, KVM_SET_CPUID, KVM_GET/SET_MSRS
An interresting aspect is vcpu_load/vcpu_put. We now have a common
vcpu_load/put which does the preemption stuff, and an architecture
specific kvm_arch_vcpu_load/put. In the x86 case, this one calls the
vmx/svm function defined in kvm_x86_ops.
Signed-off-by: Carsten Otte <cotte@de.ibm.com>
Reviewed-by: Christian Borntraeger <borntraeger@de.ibm.com>
Reviewed-by: Christian Ehrhardt <ehrhardt@linux.vnet.ibm.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
2007-10-11 17:16:52 +00:00
|
|
|
}
|
|
|
|
|
2012-11-28 01:29:00 +00:00
|
|
|
#ifdef CONFIG_X86_64
|
|
|
|
struct pvclock_gtod_data {
|
|
|
|
seqcount_t seq;
|
|
|
|
|
|
|
|
struct { /* extract of a clocksource struct */
|
|
|
|
int vclock_mode;
|
2016-12-21 19:32:01 +00:00
|
|
|
u64 cycle_last;
|
|
|
|
u64 mask;
|
2012-11-28 01:29:00 +00:00
|
|
|
u32 mult;
|
|
|
|
u32 shift;
|
|
|
|
} clock;
|
|
|
|
|
2014-07-16 21:04:54 +00:00
|
|
|
u64 boot_ns;
|
|
|
|
u64 nsec_base;
|
2017-01-24 17:09:39 +00:00
|
|
|
u64 wall_time_sec;
|
2012-11-28 01:29:00 +00:00
|
|
|
};
|
|
|
|
|
|
|
|
static struct pvclock_gtod_data pvclock_gtod_data;
|
|
|
|
|
|
|
|
static void update_pvclock_gtod(struct timekeeper *tk)
|
|
|
|
{
|
|
|
|
struct pvclock_gtod_data *vdata = &pvclock_gtod_data;
|
2014-07-16 21:04:54 +00:00
|
|
|
u64 boot_ns;
|
|
|
|
|
2015-03-19 09:09:06 +00:00
|
|
|
boot_ns = ktime_to_ns(ktime_add(tk->tkr_mono.base, tk->offs_boot));
|
2012-11-28 01:29:00 +00:00
|
|
|
|
|
|
|
write_seqcount_begin(&vdata->seq);
|
|
|
|
|
|
|
|
/* copy pvclock gtod data */
|
2015-03-19 09:09:06 +00:00
|
|
|
vdata->clock.vclock_mode = tk->tkr_mono.clock->archdata.vclock_mode;
|
|
|
|
vdata->clock.cycle_last = tk->tkr_mono.cycle_last;
|
|
|
|
vdata->clock.mask = tk->tkr_mono.mask;
|
|
|
|
vdata->clock.mult = tk->tkr_mono.mult;
|
|
|
|
vdata->clock.shift = tk->tkr_mono.shift;
|
2012-11-28 01:29:00 +00:00
|
|
|
|
2014-07-16 21:04:54 +00:00
|
|
|
vdata->boot_ns = boot_ns;
|
2015-03-19 09:09:06 +00:00
|
|
|
vdata->nsec_base = tk->tkr_mono.xtime_nsec;
|
2012-11-28 01:29:00 +00:00
|
|
|
|
2017-01-24 17:09:39 +00:00
|
|
|
vdata->wall_time_sec = tk->xtime_sec;
|
|
|
|
|
2012-11-28 01:29:00 +00:00
|
|
|
write_seqcount_end(&vdata->seq);
|
|
|
|
}
|
|
|
|
#endif
|
|
|
|
|
2015-01-02 03:05:18 +00:00
|
|
|
void kvm_set_pending_timer(struct kvm_vcpu *vcpu)
|
|
|
|
{
|
|
|
|
/*
|
|
|
|
* Note: KVM_REQ_PENDING_TIMER is implicitly checked in
|
|
|
|
* vcpu_enter_guest. This function is only called from
|
|
|
|
* the physical CPU that is running vcpu.
|
|
|
|
*/
|
|
|
|
kvm_make_request(KVM_REQ_PENDING_TIMER, vcpu);
|
|
|
|
}
|
2012-11-28 01:29:00 +00:00
|
|
|
|
2008-02-15 19:52:47 +00:00
|
|
|
static void kvm_write_wall_clock(struct kvm *kvm, gpa_t wall_clock)
|
|
|
|
{
|
2010-05-04 12:00:37 +00:00
|
|
|
int version;
|
|
|
|
int r;
|
2008-06-03 14:17:31 +00:00
|
|
|
struct pvclock_wall_clock wc;
|
2016-06-17 15:48:56 +00:00
|
|
|
struct timespec64 boot;
|
2008-02-15 19:52:47 +00:00
|
|
|
|
|
|
|
if (!wall_clock)
|
|
|
|
return;
|
|
|
|
|
2010-05-04 12:00:37 +00:00
|
|
|
r = kvm_read_guest(kvm, wall_clock, &version, sizeof(version));
|
|
|
|
if (r)
|
|
|
|
return;
|
|
|
|
|
|
|
|
if (version & 1)
|
|
|
|
++version; /* first time write, random junk */
|
|
|
|
|
|
|
|
++version;
|
2008-02-15 19:52:47 +00:00
|
|
|
|
2015-12-30 18:08:46 +00:00
|
|
|
if (kvm_write_guest(kvm, wall_clock, &version, sizeof(version)))
|
|
|
|
return;
|
2008-02-15 19:52:47 +00:00
|
|
|
|
2008-06-03 14:17:31 +00:00
|
|
|
/*
|
|
|
|
* The guest calculates current wall clock time by adding
|
2010-09-19 00:38:14 +00:00
|
|
|
* system time (updated by kvm_guest_time_update below) to the
|
2008-06-03 14:17:31 +00:00
|
|
|
* wall clock specified here. guest system time equals host
|
|
|
|
* system time for us, thus we must fill in host boot time here.
|
|
|
|
*/
|
2016-06-17 15:48:56 +00:00
|
|
|
getboottime64(&boot);
|
2008-06-03 14:17:31 +00:00
|
|
|
|
2012-07-20 16:44:24 +00:00
|
|
|
if (kvm->arch.kvmclock_offset) {
|
2016-06-17 15:48:56 +00:00
|
|
|
struct timespec64 ts = ns_to_timespec64(kvm->arch.kvmclock_offset);
|
|
|
|
boot = timespec64_sub(boot, ts);
|
2012-07-20 16:44:24 +00:00
|
|
|
}
|
2016-06-17 15:48:56 +00:00
|
|
|
wc.sec = (u32)boot.tv_sec; /* overflow in 2106 guest time */
|
2008-06-03 14:17:31 +00:00
|
|
|
wc.nsec = boot.tv_nsec;
|
|
|
|
wc.version = version;
|
2008-02-15 19:52:47 +00:00
|
|
|
|
|
|
|
kvm_write_guest(kvm, wall_clock, &wc, sizeof(wc));
|
|
|
|
|
|
|
|
version++;
|
|
|
|
kvm_write_guest(kvm, wall_clock, &version, sizeof(version));
|
|
|
|
}
|
|
|
|
|
2008-06-03 14:17:31 +00:00
|
|
|
static uint32_t div_frac(uint32_t dividend, uint32_t divisor)
|
|
|
|
{
|
2016-01-22 10:39:22 +00:00
|
|
|
do_shl32_div32(dividend, divisor);
|
|
|
|
return dividend;
|
2008-06-03 14:17:31 +00:00
|
|
|
}
|
|
|
|
|
2016-02-08 14:11:15 +00:00
|
|
|
static void kvm_get_time_scale(uint64_t scaled_hz, uint64_t base_hz,
|
2010-09-19 00:38:13 +00:00
|
|
|
s8 *pshift, u32 *pmultiplier)
|
2008-06-03 14:17:31 +00:00
|
|
|
{
|
2010-09-19 00:38:13 +00:00
|
|
|
uint64_t scaled64;
|
2008-06-03 14:17:31 +00:00
|
|
|
int32_t shift = 0;
|
|
|
|
uint64_t tps64;
|
|
|
|
uint32_t tps32;
|
|
|
|
|
2016-02-08 14:11:15 +00:00
|
|
|
tps64 = base_hz;
|
|
|
|
scaled64 = scaled_hz;
|
2010-09-26 11:00:53 +00:00
|
|
|
while (tps64 > scaled64*2 || tps64 & 0xffffffff00000000ULL) {
|
2008-06-03 14:17:31 +00:00
|
|
|
tps64 >>= 1;
|
|
|
|
shift--;
|
|
|
|
}
|
|
|
|
|
|
|
|
tps32 = (uint32_t)tps64;
|
2010-09-26 11:00:53 +00:00
|
|
|
while (tps32 <= scaled64 || scaled64 & 0xffffffff00000000ULL) {
|
|
|
|
if (scaled64 & 0xffffffff00000000ULL || tps32 & 0x80000000)
|
2010-09-19 00:38:13 +00:00
|
|
|
scaled64 >>= 1;
|
|
|
|
else
|
|
|
|
tps32 <<= 1;
|
2008-06-03 14:17:31 +00:00
|
|
|
shift++;
|
|
|
|
}
|
|
|
|
|
2010-09-19 00:38:13 +00:00
|
|
|
*pshift = shift;
|
|
|
|
*pmultiplier = div_frac(scaled64, tps32);
|
2008-06-03 14:17:31 +00:00
|
|
|
|
2016-02-08 14:11:15 +00:00
|
|
|
pr_debug("%s: base_hz %llu => %llu, shift %d, mul %u\n",
|
|
|
|
__func__, base_hz, scaled_hz, shift, *pmultiplier);
|
2008-06-03 14:17:31 +00:00
|
|
|
}
|
|
|
|
|
2012-11-28 01:29:01 +00:00
|
|
|
#ifdef CONFIG_X86_64
|
2012-11-28 01:29:00 +00:00
|
|
|
static atomic_t kvm_guest_has_master_clock = ATOMIC_INIT(0);
|
2012-11-28 01:29:01 +00:00
|
|
|
#endif
|
2012-11-28 01:29:00 +00:00
|
|
|
|
2009-02-04 16:52:04 +00:00
|
|
|
static DEFINE_PER_CPU(unsigned long, cpu_tsc_khz);
|
2015-01-19 14:33:39 +00:00
|
|
|
static unsigned long max_tsc_khz;
|
2009-02-04 16:52:04 +00:00
|
|
|
|
KVM: Infrastructure for software and hardware based TSC rate scaling
This requires some restructuring; rather than use 'virtual_tsc_khz'
to indicate whether hardware rate scaling is in effect, we consider
each VCPU to always have a virtual TSC rate. Instead, there is new
logic above the vendor-specific hardware scaling that decides whether
it is even necessary to use and updates all rate variables used by
common code. This means we can simply query the virtual rate at
any point, which is needed for software rate scaling.
There is also now a threshold added to the TSC rate scaling; minor
differences and variations of measured TSC rate can accidentally
provoke rate scaling to be used when it is not needed. Instead,
we have a tolerance variable called tsc_tolerance_ppm, which is
the maximum variation from user requested rate at which scaling
will be used. The default is 250ppm, which is the half the
threshold for NTP adjustment, allowing for some hardware variation.
In the event that hardware rate scaling is not available, we can
kludge a bit by forcing TSC catchup to turn on when a faster than
hardware speed has been requested, but there is nothing available
yet for the reverse case; this requires a trap and emulate software
implementation for RDTSC, which is still forthcoming.
[avi: fix 64-bit division on i386]
Signed-off-by: Zachary Amsden <zamsden@gmail.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2012-02-03 17:43:50 +00:00
|
|
|
static u32 adjust_tsc_khz(u32 khz, s32 ppm)
|
2011-03-25 08:44:47 +00:00
|
|
|
{
|
KVM: Infrastructure for software and hardware based TSC rate scaling
This requires some restructuring; rather than use 'virtual_tsc_khz'
to indicate whether hardware rate scaling is in effect, we consider
each VCPU to always have a virtual TSC rate. Instead, there is new
logic above the vendor-specific hardware scaling that decides whether
it is even necessary to use and updates all rate variables used by
common code. This means we can simply query the virtual rate at
any point, which is needed for software rate scaling.
There is also now a threshold added to the TSC rate scaling; minor
differences and variations of measured TSC rate can accidentally
provoke rate scaling to be used when it is not needed. Instead,
we have a tolerance variable called tsc_tolerance_ppm, which is
the maximum variation from user requested rate at which scaling
will be used. The default is 250ppm, which is the half the
threshold for NTP adjustment, allowing for some hardware variation.
In the event that hardware rate scaling is not available, we can
kludge a bit by forcing TSC catchup to turn on when a faster than
hardware speed has been requested, but there is nothing available
yet for the reverse case; this requires a trap and emulate software
implementation for RDTSC, which is still forthcoming.
[avi: fix 64-bit division on i386]
Signed-off-by: Zachary Amsden <zamsden@gmail.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2012-02-03 17:43:50 +00:00
|
|
|
u64 v = (u64)khz * (1000000 + ppm);
|
|
|
|
do_div(v, 1000000);
|
|
|
|
return v;
|
2011-03-25 08:44:47 +00:00
|
|
|
}
|
|
|
|
|
2015-10-20 07:39:04 +00:00
|
|
|
static int set_tsc_khz(struct kvm_vcpu *vcpu, u32 user_tsc_khz, bool scale)
|
|
|
|
{
|
|
|
|
u64 ratio;
|
|
|
|
|
|
|
|
/* Guest TSC same frequency as host TSC? */
|
|
|
|
if (!scale) {
|
|
|
|
vcpu->arch.tsc_scaling_ratio = kvm_default_tsc_scaling_ratio;
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
|
|
|
/* TSC scaling supported? */
|
|
|
|
if (!kvm_has_tsc_control) {
|
|
|
|
if (user_tsc_khz > tsc_khz) {
|
|
|
|
vcpu->arch.tsc_catchup = 1;
|
|
|
|
vcpu->arch.tsc_always_catchup = 1;
|
|
|
|
return 0;
|
|
|
|
} else {
|
|
|
|
WARN(1, "user requested TSC rate below hardware speed\n");
|
|
|
|
return -1;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
/* TSC scaling required - calculate ratio */
|
|
|
|
ratio = mul_u64_u32_div(1ULL << kvm_tsc_scaling_ratio_frac_bits,
|
|
|
|
user_tsc_khz, tsc_khz);
|
|
|
|
|
|
|
|
if (ratio == 0 || ratio >= kvm_max_tsc_scaling_ratio) {
|
|
|
|
WARN_ONCE(1, "Invalid TSC scaling ratio - virtual-tsc-khz=%u\n",
|
|
|
|
user_tsc_khz);
|
|
|
|
return -1;
|
|
|
|
}
|
|
|
|
|
|
|
|
vcpu->arch.tsc_scaling_ratio = ratio;
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
2016-02-08 13:51:12 +00:00
|
|
|
static int kvm_set_tsc_khz(struct kvm_vcpu *vcpu, u32 user_tsc_khz)
|
2010-08-20 08:07:25 +00:00
|
|
|
{
|
KVM: Infrastructure for software and hardware based TSC rate scaling
This requires some restructuring; rather than use 'virtual_tsc_khz'
to indicate whether hardware rate scaling is in effect, we consider
each VCPU to always have a virtual TSC rate. Instead, there is new
logic above the vendor-specific hardware scaling that decides whether
it is even necessary to use and updates all rate variables used by
common code. This means we can simply query the virtual rate at
any point, which is needed for software rate scaling.
There is also now a threshold added to the TSC rate scaling; minor
differences and variations of measured TSC rate can accidentally
provoke rate scaling to be used when it is not needed. Instead,
we have a tolerance variable called tsc_tolerance_ppm, which is
the maximum variation from user requested rate at which scaling
will be used. The default is 250ppm, which is the half the
threshold for NTP adjustment, allowing for some hardware variation.
In the event that hardware rate scaling is not available, we can
kludge a bit by forcing TSC catchup to turn on when a faster than
hardware speed has been requested, but there is nothing available
yet for the reverse case; this requires a trap and emulate software
implementation for RDTSC, which is still forthcoming.
[avi: fix 64-bit division on i386]
Signed-off-by: Zachary Amsden <zamsden@gmail.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2012-02-03 17:43:50 +00:00
|
|
|
u32 thresh_lo, thresh_hi;
|
|
|
|
int use_scaling = 0;
|
2010-08-26 10:38:03 +00:00
|
|
|
|
2013-03-12 02:10:24 +00:00
|
|
|
/* tsc_khz can be zero if TSC calibration fails */
|
2016-02-08 13:51:12 +00:00
|
|
|
if (user_tsc_khz == 0) {
|
2015-10-20 07:39:02 +00:00
|
|
|
/* set tsc_scaling_ratio to a safe value */
|
|
|
|
vcpu->arch.tsc_scaling_ratio = kvm_default_tsc_scaling_ratio;
|
2015-10-20 07:39:04 +00:00
|
|
|
return -1;
|
2015-10-20 07:39:02 +00:00
|
|
|
}
|
2013-03-12 02:10:24 +00:00
|
|
|
|
2010-09-19 00:38:15 +00:00
|
|
|
/* Compute a scale to convert nanoseconds in TSC cycles */
|
2016-02-08 14:11:15 +00:00
|
|
|
kvm_get_time_scale(user_tsc_khz * 1000LL, NSEC_PER_SEC,
|
KVM: Infrastructure for software and hardware based TSC rate scaling
This requires some restructuring; rather than use 'virtual_tsc_khz'
to indicate whether hardware rate scaling is in effect, we consider
each VCPU to always have a virtual TSC rate. Instead, there is new
logic above the vendor-specific hardware scaling that decides whether
it is even necessary to use and updates all rate variables used by
common code. This means we can simply query the virtual rate at
any point, which is needed for software rate scaling.
There is also now a threshold added to the TSC rate scaling; minor
differences and variations of measured TSC rate can accidentally
provoke rate scaling to be used when it is not needed. Instead,
we have a tolerance variable called tsc_tolerance_ppm, which is
the maximum variation from user requested rate at which scaling
will be used. The default is 250ppm, which is the half the
threshold for NTP adjustment, allowing for some hardware variation.
In the event that hardware rate scaling is not available, we can
kludge a bit by forcing TSC catchup to turn on when a faster than
hardware speed has been requested, but there is nothing available
yet for the reverse case; this requires a trap and emulate software
implementation for RDTSC, which is still forthcoming.
[avi: fix 64-bit division on i386]
Signed-off-by: Zachary Amsden <zamsden@gmail.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2012-02-03 17:43:50 +00:00
|
|
|
&vcpu->arch.virtual_tsc_shift,
|
|
|
|
&vcpu->arch.virtual_tsc_mult);
|
2016-02-08 13:51:12 +00:00
|
|
|
vcpu->arch.virtual_tsc_khz = user_tsc_khz;
|
KVM: Infrastructure for software and hardware based TSC rate scaling
This requires some restructuring; rather than use 'virtual_tsc_khz'
to indicate whether hardware rate scaling is in effect, we consider
each VCPU to always have a virtual TSC rate. Instead, there is new
logic above the vendor-specific hardware scaling that decides whether
it is even necessary to use and updates all rate variables used by
common code. This means we can simply query the virtual rate at
any point, which is needed for software rate scaling.
There is also now a threshold added to the TSC rate scaling; minor
differences and variations of measured TSC rate can accidentally
provoke rate scaling to be used when it is not needed. Instead,
we have a tolerance variable called tsc_tolerance_ppm, which is
the maximum variation from user requested rate at which scaling
will be used. The default is 250ppm, which is the half the
threshold for NTP adjustment, allowing for some hardware variation.
In the event that hardware rate scaling is not available, we can
kludge a bit by forcing TSC catchup to turn on when a faster than
hardware speed has been requested, but there is nothing available
yet for the reverse case; this requires a trap and emulate software
implementation for RDTSC, which is still forthcoming.
[avi: fix 64-bit division on i386]
Signed-off-by: Zachary Amsden <zamsden@gmail.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2012-02-03 17:43:50 +00:00
|
|
|
|
|
|
|
/*
|
|
|
|
* Compute the variation in TSC rate which is acceptable
|
|
|
|
* within the range of tolerance and decide if the
|
|
|
|
* rate being applied is within that bounds of the hardware
|
|
|
|
* rate. If so, no scaling or compensation need be done.
|
|
|
|
*/
|
|
|
|
thresh_lo = adjust_tsc_khz(tsc_khz, -tsc_tolerance_ppm);
|
|
|
|
thresh_hi = adjust_tsc_khz(tsc_khz, tsc_tolerance_ppm);
|
2016-02-08 13:51:12 +00:00
|
|
|
if (user_tsc_khz < thresh_lo || user_tsc_khz > thresh_hi) {
|
|
|
|
pr_debug("kvm: requested TSC rate %u falls outside tolerance [%u,%u]\n", user_tsc_khz, thresh_lo, thresh_hi);
|
KVM: Infrastructure for software and hardware based TSC rate scaling
This requires some restructuring; rather than use 'virtual_tsc_khz'
to indicate whether hardware rate scaling is in effect, we consider
each VCPU to always have a virtual TSC rate. Instead, there is new
logic above the vendor-specific hardware scaling that decides whether
it is even necessary to use and updates all rate variables used by
common code. This means we can simply query the virtual rate at
any point, which is needed for software rate scaling.
There is also now a threshold added to the TSC rate scaling; minor
differences and variations of measured TSC rate can accidentally
provoke rate scaling to be used when it is not needed. Instead,
we have a tolerance variable called tsc_tolerance_ppm, which is
the maximum variation from user requested rate at which scaling
will be used. The default is 250ppm, which is the half the
threshold for NTP adjustment, allowing for some hardware variation.
In the event that hardware rate scaling is not available, we can
kludge a bit by forcing TSC catchup to turn on when a faster than
hardware speed has been requested, but there is nothing available
yet for the reverse case; this requires a trap and emulate software
implementation for RDTSC, which is still forthcoming.
[avi: fix 64-bit division on i386]
Signed-off-by: Zachary Amsden <zamsden@gmail.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2012-02-03 17:43:50 +00:00
|
|
|
use_scaling = 1;
|
|
|
|
}
|
2016-02-08 13:51:12 +00:00
|
|
|
return set_tsc_khz(vcpu, user_tsc_khz, use_scaling);
|
2010-09-19 00:38:15 +00:00
|
|
|
}
|
|
|
|
|
|
|
|
static u64 compute_guest_tsc(struct kvm_vcpu *vcpu, s64 kernel_ns)
|
|
|
|
{
|
2012-02-03 17:43:57 +00:00
|
|
|
u64 tsc = pvclock_scale_delta(kernel_ns-vcpu->arch.this_tsc_nsec,
|
KVM: Infrastructure for software and hardware based TSC rate scaling
This requires some restructuring; rather than use 'virtual_tsc_khz'
to indicate whether hardware rate scaling is in effect, we consider
each VCPU to always have a virtual TSC rate. Instead, there is new
logic above the vendor-specific hardware scaling that decides whether
it is even necessary to use and updates all rate variables used by
common code. This means we can simply query the virtual rate at
any point, which is needed for software rate scaling.
There is also now a threshold added to the TSC rate scaling; minor
differences and variations of measured TSC rate can accidentally
provoke rate scaling to be used when it is not needed. Instead,
we have a tolerance variable called tsc_tolerance_ppm, which is
the maximum variation from user requested rate at which scaling
will be used. The default is 250ppm, which is the half the
threshold for NTP adjustment, allowing for some hardware variation.
In the event that hardware rate scaling is not available, we can
kludge a bit by forcing TSC catchup to turn on when a faster than
hardware speed has been requested, but there is nothing available
yet for the reverse case; this requires a trap and emulate software
implementation for RDTSC, which is still forthcoming.
[avi: fix 64-bit division on i386]
Signed-off-by: Zachary Amsden <zamsden@gmail.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2012-02-03 17:43:50 +00:00
|
|
|
vcpu->arch.virtual_tsc_mult,
|
|
|
|
vcpu->arch.virtual_tsc_shift);
|
2012-02-03 17:43:57 +00:00
|
|
|
tsc += vcpu->arch.this_tsc_write;
|
2010-09-19 00:38:15 +00:00
|
|
|
return tsc;
|
|
|
|
}
|
|
|
|
|
2018-01-24 13:23:36 +00:00
|
|
|
static inline int gtod_is_based_on_tsc(int mode)
|
|
|
|
{
|
|
|
|
return mode == VCLOCK_TSC || mode == VCLOCK_HVCLOCK;
|
|
|
|
}
|
|
|
|
|
2015-01-19 14:33:39 +00:00
|
|
|
static void kvm_track_tsc_matching(struct kvm_vcpu *vcpu)
|
2012-11-28 01:29:03 +00:00
|
|
|
{
|
|
|
|
#ifdef CONFIG_X86_64
|
|
|
|
bool vcpus_matched;
|
|
|
|
struct kvm_arch *ka = &vcpu->kvm->arch;
|
|
|
|
struct pvclock_gtod_data *gtod = &pvclock_gtod_data;
|
|
|
|
|
|
|
|
vcpus_matched = (ka->nr_vcpus_matched_tsc + 1 ==
|
|
|
|
atomic_read(&vcpu->kvm->online_vcpus));
|
|
|
|
|
2014-11-04 23:30:44 +00:00
|
|
|
/*
|
|
|
|
* Once the masterclock is enabled, always perform request in
|
|
|
|
* order to update it.
|
|
|
|
*
|
|
|
|
* In order to enable masterclock, the host clocksource must be TSC
|
|
|
|
* and the vcpus need to have matched TSCs. When that happens,
|
|
|
|
* perform request to enable masterclock.
|
|
|
|
*/
|
|
|
|
if (ka->use_master_clock ||
|
2018-01-24 13:23:36 +00:00
|
|
|
(gtod_is_based_on_tsc(gtod->clock.vclock_mode) && vcpus_matched))
|
2012-11-28 01:29:03 +00:00
|
|
|
kvm_make_request(KVM_REQ_MASTERCLOCK_UPDATE, vcpu);
|
|
|
|
|
|
|
|
trace_kvm_track_tsc(vcpu->vcpu_id, ka->nr_vcpus_matched_tsc,
|
|
|
|
atomic_read(&vcpu->kvm->online_vcpus),
|
|
|
|
ka->use_master_clock, gtod->clock.vclock_mode);
|
|
|
|
#endif
|
|
|
|
}
|
|
|
|
|
2012-11-29 20:42:50 +00:00
|
|
|
static void update_ia32_tsc_adjust_msr(struct kvm_vcpu *vcpu, s64 offset)
|
|
|
|
{
|
2018-04-14 03:10:52 +00:00
|
|
|
u64 curr_offset = kvm_x86_ops->read_l1_tsc_offset(vcpu);
|
2012-11-29 20:42:50 +00:00
|
|
|
vcpu->arch.ia32_tsc_adjust_msr += offset - curr_offset;
|
|
|
|
}
|
|
|
|
|
2015-10-20 07:39:03 +00:00
|
|
|
/*
|
|
|
|
* Multiply tsc by a fixed point number represented by ratio.
|
|
|
|
*
|
|
|
|
* The most significant 64-N bits (mult) of ratio represent the
|
|
|
|
* integral part of the fixed point number; the remaining N bits
|
|
|
|
* (frac) represent the fractional part, ie. ratio represents a fixed
|
|
|
|
* point number (mult + frac * 2^(-N)).
|
|
|
|
*
|
|
|
|
* N equals to kvm_tsc_scaling_ratio_frac_bits.
|
|
|
|
*/
|
|
|
|
static inline u64 __scale_tsc(u64 ratio, u64 tsc)
|
|
|
|
{
|
|
|
|
return mul_u64_u64_shr(tsc, ratio, kvm_tsc_scaling_ratio_frac_bits);
|
|
|
|
}
|
|
|
|
|
|
|
|
u64 kvm_scale_tsc(struct kvm_vcpu *vcpu, u64 tsc)
|
|
|
|
{
|
|
|
|
u64 _tsc = tsc;
|
|
|
|
u64 ratio = vcpu->arch.tsc_scaling_ratio;
|
|
|
|
|
|
|
|
if (ratio != kvm_default_tsc_scaling_ratio)
|
|
|
|
_tsc = __scale_tsc(ratio, tsc);
|
|
|
|
|
|
|
|
return _tsc;
|
|
|
|
}
|
|
|
|
EXPORT_SYMBOL_GPL(kvm_scale_tsc);
|
|
|
|
|
2015-10-20 07:39:05 +00:00
|
|
|
static u64 kvm_compute_tsc_offset(struct kvm_vcpu *vcpu, u64 target_tsc)
|
|
|
|
{
|
|
|
|
u64 tsc;
|
|
|
|
|
|
|
|
tsc = kvm_scale_tsc(vcpu, rdtsc());
|
|
|
|
|
|
|
|
return target_tsc - tsc;
|
|
|
|
}
|
|
|
|
|
2015-10-20 07:39:07 +00:00
|
|
|
u64 kvm_read_l1_tsc(struct kvm_vcpu *vcpu, u64 host_tsc)
|
|
|
|
{
|
2018-04-14 03:10:52 +00:00
|
|
|
u64 tsc_offset = kvm_x86_ops->read_l1_tsc_offset(vcpu);
|
|
|
|
|
|
|
|
return tsc_offset + kvm_scale_tsc(vcpu, host_tsc);
|
2015-10-20 07:39:07 +00:00
|
|
|
}
|
|
|
|
EXPORT_SYMBOL_GPL(kvm_read_l1_tsc);
|
|
|
|
|
2016-09-07 18:47:19 +00:00
|
|
|
static void kvm_vcpu_write_tsc_offset(struct kvm_vcpu *vcpu, u64 offset)
|
|
|
|
{
|
|
|
|
kvm_x86_ops->write_tsc_offset(vcpu, offset);
|
|
|
|
vcpu->arch.tsc_offset = offset;
|
|
|
|
}
|
|
|
|
|
2018-01-24 13:23:36 +00:00
|
|
|
static inline bool kvm_check_tsc_unstable(void)
|
|
|
|
{
|
|
|
|
#ifdef CONFIG_X86_64
|
|
|
|
/*
|
|
|
|
* TSC is marked unstable when we're running on Hyper-V,
|
|
|
|
* 'TSC page' clocksource is good.
|
|
|
|
*/
|
|
|
|
if (pvclock_gtod_data.clock.vclock_mode == VCLOCK_HVCLOCK)
|
|
|
|
return false;
|
|
|
|
#endif
|
|
|
|
return check_tsc_unstable();
|
|
|
|
}
|
|
|
|
|
2012-11-29 20:42:12 +00:00
|
|
|
void kvm_write_tsc(struct kvm_vcpu *vcpu, struct msr_data *msr)
|
2010-08-20 08:07:17 +00:00
|
|
|
{
|
|
|
|
struct kvm *kvm = vcpu->kvm;
|
2010-08-20 08:07:20 +00:00
|
|
|
u64 offset, ns, elapsed;
|
2010-08-20 08:07:17 +00:00
|
|
|
unsigned long flags;
|
2012-11-28 01:29:03 +00:00
|
|
|
bool matched;
|
2014-06-24 07:42:43 +00:00
|
|
|
bool already_matched;
|
2012-11-29 20:42:12 +00:00
|
|
|
u64 data = msr->data;
|
2017-04-07 09:09:52 +00:00
|
|
|
bool synchronizing = false;
|
2010-08-20 08:07:17 +00:00
|
|
|
|
2011-02-04 09:49:11 +00:00
|
|
|
raw_spin_lock_irqsave(&kvm->arch.tsc_write_lock, flags);
|
2015-10-20 07:39:05 +00:00
|
|
|
offset = kvm_compute_tsc_offset(vcpu, data);
|
2016-09-01 12:21:03 +00:00
|
|
|
ns = ktime_get_boot_ns();
|
2010-08-20 08:07:20 +00:00
|
|
|
elapsed = ns - kvm->arch.last_tsc_nsec;
|
KVM: Improve TSC offset matching
There are a few improvements that can be made to the TSC offset
matching code. First, we don't need to call the 128-bit multiply
(especially on a constant number), the code works much nicer to
do computation in nanosecond units.
Second, the way everything is setup with software TSC rate scaling,
we currently have per-cpu rates. Obviously this isn't too desirable
to use in practice, but if for some reason we do change the rate of
all VCPUs at runtime, then reset the TSCs, we will only want to
match offsets for VCPUs running at the same rate.
Finally, for the case where we have an unstable host TSC, but
rate scaling is being done in hardware, we should call the platform
code to compute the TSC offset, so the math is reorganized to recompute
the base instead, then transform the base into an offset using the
existing API.
[avi: fix 64-bit division on i386]
Signed-off-by: Zachary Amsden <zamsden@gmail.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
KVM: Fix 64-bit division in kvm_write_tsc()
Breaks i386 build.
Signed-off-by: Avi Kivity <avi@redhat.com>
2012-02-03 17:43:51 +00:00
|
|
|
|
2013-03-12 02:10:24 +00:00
|
|
|
if (vcpu->arch.virtual_tsc_khz) {
|
2017-04-07 09:09:53 +00:00
|
|
|
if (data == 0 && msr->host_initiated) {
|
|
|
|
/*
|
|
|
|
* detection of vcpu initialization -- need to sync
|
|
|
|
* with other vCPUs. This particularly helps to keep
|
|
|
|
* kvm_clock stable after CPU hotplug
|
|
|
|
*/
|
|
|
|
synchronizing = true;
|
|
|
|
} else {
|
|
|
|
u64 tsc_exp = kvm->arch.last_tsc_write +
|
|
|
|
nsec_to_cycles(vcpu, elapsed);
|
|
|
|
u64 tsc_hz = vcpu->arch.virtual_tsc_khz * 1000LL;
|
|
|
|
/*
|
|
|
|
* Special case: TSC write with a small delta (1 second)
|
|
|
|
* of virtual cycle time against real time is
|
|
|
|
* interpreted as an attempt to synchronize the CPU.
|
|
|
|
*/
|
|
|
|
synchronizing = data < tsc_exp + tsc_hz &&
|
|
|
|
data + tsc_hz > tsc_exp;
|
|
|
|
}
|
2017-04-07 09:09:52 +00:00
|
|
|
}
|
2010-08-20 08:07:20 +00:00
|
|
|
|
|
|
|
/*
|
KVM: Improve TSC offset matching
There are a few improvements that can be made to the TSC offset
matching code. First, we don't need to call the 128-bit multiply
(especially on a constant number), the code works much nicer to
do computation in nanosecond units.
Second, the way everything is setup with software TSC rate scaling,
we currently have per-cpu rates. Obviously this isn't too desirable
to use in practice, but if for some reason we do change the rate of
all VCPUs at runtime, then reset the TSCs, we will only want to
match offsets for VCPUs running at the same rate.
Finally, for the case where we have an unstable host TSC, but
rate scaling is being done in hardware, we should call the platform
code to compute the TSC offset, so the math is reorganized to recompute
the base instead, then transform the base into an offset using the
existing API.
[avi: fix 64-bit division on i386]
Signed-off-by: Zachary Amsden <zamsden@gmail.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
KVM: Fix 64-bit division in kvm_write_tsc()
Breaks i386 build.
Signed-off-by: Avi Kivity <avi@redhat.com>
2012-02-03 17:43:51 +00:00
|
|
|
* For a reliable TSC, we can match TSC offsets, and for an unstable
|
|
|
|
* TSC, we add elapsed time in this computation. We could let the
|
|
|
|
* compensation code attempt to catch up if we fall behind, but
|
|
|
|
* it's better to try to match offsets from the beginning.
|
|
|
|
*/
|
2017-04-07 09:09:52 +00:00
|
|
|
if (synchronizing &&
|
KVM: Improve TSC offset matching
There are a few improvements that can be made to the TSC offset
matching code. First, we don't need to call the 128-bit multiply
(especially on a constant number), the code works much nicer to
do computation in nanosecond units.
Second, the way everything is setup with software TSC rate scaling,
we currently have per-cpu rates. Obviously this isn't too desirable
to use in practice, but if for some reason we do change the rate of
all VCPUs at runtime, then reset the TSCs, we will only want to
match offsets for VCPUs running at the same rate.
Finally, for the case where we have an unstable host TSC, but
rate scaling is being done in hardware, we should call the platform
code to compute the TSC offset, so the math is reorganized to recompute
the base instead, then transform the base into an offset using the
existing API.
[avi: fix 64-bit division on i386]
Signed-off-by: Zachary Amsden <zamsden@gmail.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
KVM: Fix 64-bit division in kvm_write_tsc()
Breaks i386 build.
Signed-off-by: Avi Kivity <avi@redhat.com>
2012-02-03 17:43:51 +00:00
|
|
|
vcpu->arch.virtual_tsc_khz == kvm->arch.last_tsc_khz) {
|
2018-01-24 13:23:36 +00:00
|
|
|
if (!kvm_check_tsc_unstable()) {
|
2012-02-03 17:43:57 +00:00
|
|
|
offset = kvm->arch.cur_tsc_offset;
|
2010-08-20 08:07:20 +00:00
|
|
|
pr_debug("kvm: matched tsc offset for %llu\n", data);
|
|
|
|
} else {
|
2011-03-25 08:44:50 +00:00
|
|
|
u64 delta = nsec_to_cycles(vcpu, elapsed);
|
KVM: Improve TSC offset matching
There are a few improvements that can be made to the TSC offset
matching code. First, we don't need to call the 128-bit multiply
(especially on a constant number), the code works much nicer to
do computation in nanosecond units.
Second, the way everything is setup with software TSC rate scaling,
we currently have per-cpu rates. Obviously this isn't too desirable
to use in practice, but if for some reason we do change the rate of
all VCPUs at runtime, then reset the TSCs, we will only want to
match offsets for VCPUs running at the same rate.
Finally, for the case where we have an unstable host TSC, but
rate scaling is being done in hardware, we should call the platform
code to compute the TSC offset, so the math is reorganized to recompute
the base instead, then transform the base into an offset using the
existing API.
[avi: fix 64-bit division on i386]
Signed-off-by: Zachary Amsden <zamsden@gmail.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
KVM: Fix 64-bit division in kvm_write_tsc()
Breaks i386 build.
Signed-off-by: Avi Kivity <avi@redhat.com>
2012-02-03 17:43:51 +00:00
|
|
|
data += delta;
|
2015-10-20 07:39:05 +00:00
|
|
|
offset = kvm_compute_tsc_offset(vcpu, data);
|
2010-08-20 08:07:25 +00:00
|
|
|
pr_debug("kvm: adjusted tsc offset by %llu\n", delta);
|
2010-08-20 08:07:20 +00:00
|
|
|
}
|
2012-11-28 01:29:03 +00:00
|
|
|
matched = true;
|
2014-06-24 07:42:43 +00:00
|
|
|
already_matched = (vcpu->arch.this_tsc_generation == kvm->arch.cur_tsc_generation);
|
2012-02-03 17:43:57 +00:00
|
|
|
} else {
|
|
|
|
/*
|
|
|
|
* We split periods of matched TSC writes into generations.
|
|
|
|
* For each generation, we track the original measured
|
|
|
|
* nanosecond time, offset, and write, so if TSCs are in
|
|
|
|
* sync, we can match exact offset, and if not, we can match
|
2012-06-28 07:17:27 +00:00
|
|
|
* exact software computation in compute_guest_tsc()
|
2012-02-03 17:43:57 +00:00
|
|
|
*
|
|
|
|
* These values are tracked in kvm->arch.cur_xxx variables.
|
|
|
|
*/
|
|
|
|
kvm->arch.cur_tsc_generation++;
|
|
|
|
kvm->arch.cur_tsc_nsec = ns;
|
|
|
|
kvm->arch.cur_tsc_write = data;
|
|
|
|
kvm->arch.cur_tsc_offset = offset;
|
2012-11-28 01:29:03 +00:00
|
|
|
matched = false;
|
2014-06-24 07:42:43 +00:00
|
|
|
pr_debug("kvm: new tsc generation %llu, clock %llu\n",
|
2012-02-03 17:43:57 +00:00
|
|
|
kvm->arch.cur_tsc_generation, data);
|
2010-08-20 08:07:20 +00:00
|
|
|
}
|
2012-02-03 17:43:57 +00:00
|
|
|
|
|
|
|
/*
|
|
|
|
* We also track th most recent recorded KHZ, write and time to
|
|
|
|
* allow the matching interval to be extended at each write.
|
|
|
|
*/
|
2010-08-20 08:07:20 +00:00
|
|
|
kvm->arch.last_tsc_nsec = ns;
|
|
|
|
kvm->arch.last_tsc_write = data;
|
KVM: Improve TSC offset matching
There are a few improvements that can be made to the TSC offset
matching code. First, we don't need to call the 128-bit multiply
(especially on a constant number), the code works much nicer to
do computation in nanosecond units.
Second, the way everything is setup with software TSC rate scaling,
we currently have per-cpu rates. Obviously this isn't too desirable
to use in practice, but if for some reason we do change the rate of
all VCPUs at runtime, then reset the TSCs, we will only want to
match offsets for VCPUs running at the same rate.
Finally, for the case where we have an unstable host TSC, but
rate scaling is being done in hardware, we should call the platform
code to compute the TSC offset, so the math is reorganized to recompute
the base instead, then transform the base into an offset using the
existing API.
[avi: fix 64-bit division on i386]
Signed-off-by: Zachary Amsden <zamsden@gmail.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
KVM: Fix 64-bit division in kvm_write_tsc()
Breaks i386 build.
Signed-off-by: Avi Kivity <avi@redhat.com>
2012-02-03 17:43:51 +00:00
|
|
|
kvm->arch.last_tsc_khz = vcpu->arch.virtual_tsc_khz;
|
2010-08-20 08:07:17 +00:00
|
|
|
|
KVM: Fix last_guest_tsc / tsc_offset semantics
The variable last_guest_tsc was being used as an ad-hoc indicator
that guest TSC has been initialized and recorded correctly. However,
it may not have been, it could be that guest TSC has been set to some
large value, the back to a small value (by, say, a software reboot).
This defeats the logic and causes KVM to falsely assume that the
guest TSC has gone backwards, marking the host TSC unstable, which
is undesirable behavior.
In addition, rather than try to compute an offset adjustment for the
TSC on unstable platforms, just recompute the whole offset. This
allows us to get rid of one callsite for adjust_tsc_offset, which
is problematic because the units it takes are in guest units, but
here, the computation was originally being done in host units.
Doing this, and also recording last_guest_tsc when the TSC is written
allow us to remove the tricky logic which depended on last_guest_tsc
being zero to indicate a reset of uninitialized value.
Instead, we now have the guarantee that the guest TSC offset is
always at least something which will get us last_guest_tsc.
Signed-off-by: Zachary Amsden <zamsden@gmail.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2012-02-03 17:43:53 +00:00
|
|
|
vcpu->arch.last_guest_tsc = data;
|
2012-02-03 17:43:57 +00:00
|
|
|
|
|
|
|
/* Keep track of which generation this VCPU has synchronized to */
|
|
|
|
vcpu->arch.this_tsc_generation = kvm->arch.cur_tsc_generation;
|
|
|
|
vcpu->arch.this_tsc_nsec = kvm->arch.cur_tsc_nsec;
|
|
|
|
vcpu->arch.this_tsc_write = kvm->arch.cur_tsc_write;
|
|
|
|
|
2017-08-04 22:12:49 +00:00
|
|
|
if (!msr->host_initiated && guest_cpuid_has(vcpu, X86_FEATURE_TSC_ADJUST))
|
2012-11-29 20:42:50 +00:00
|
|
|
update_ia32_tsc_adjust_msr(vcpu, offset);
|
2017-08-04 22:12:49 +00:00
|
|
|
|
2016-09-07 18:47:19 +00:00
|
|
|
kvm_vcpu_write_tsc_offset(vcpu, offset);
|
2012-02-03 17:43:57 +00:00
|
|
|
raw_spin_unlock_irqrestore(&kvm->arch.tsc_write_lock, flags);
|
2012-11-28 01:29:03 +00:00
|
|
|
|
|
|
|
spin_lock(&kvm->arch.pvclock_gtod_sync_lock);
|
2014-06-24 07:42:43 +00:00
|
|
|
if (!matched) {
|
2012-11-28 01:29:03 +00:00
|
|
|
kvm->arch.nr_vcpus_matched_tsc = 0;
|
2014-06-24 07:42:43 +00:00
|
|
|
} else if (!already_matched) {
|
|
|
|
kvm->arch.nr_vcpus_matched_tsc++;
|
|
|
|
}
|
2012-11-28 01:29:03 +00:00
|
|
|
|
|
|
|
kvm_track_tsc_matching(vcpu);
|
|
|
|
spin_unlock(&kvm->arch.pvclock_gtod_sync_lock);
|
2010-08-20 08:07:17 +00:00
|
|
|
}
|
2012-02-03 17:43:57 +00:00
|
|
|
|
2010-08-20 08:07:17 +00:00
|
|
|
EXPORT_SYMBOL_GPL(kvm_write_tsc);
|
|
|
|
|
2015-10-20 07:39:06 +00:00
|
|
|
static inline void adjust_tsc_offset_guest(struct kvm_vcpu *vcpu,
|
|
|
|
s64 adjustment)
|
|
|
|
{
|
2016-10-31 23:39:48 +00:00
|
|
|
kvm_vcpu_write_tsc_offset(vcpu, vcpu->arch.tsc_offset + adjustment);
|
2015-10-20 07:39:06 +00:00
|
|
|
}
|
|
|
|
|
|
|
|
static inline void adjust_tsc_offset_host(struct kvm_vcpu *vcpu, s64 adjustment)
|
|
|
|
{
|
|
|
|
if (vcpu->arch.tsc_scaling_ratio != kvm_default_tsc_scaling_ratio)
|
|
|
|
WARN_ON(adjustment < 0);
|
|
|
|
adjustment = kvm_scale_tsc(vcpu, (u64) adjustment);
|
2016-10-31 23:39:48 +00:00
|
|
|
adjust_tsc_offset_guest(vcpu, adjustment);
|
2015-10-20 07:39:06 +00:00
|
|
|
}
|
|
|
|
|
2012-11-28 01:29:01 +00:00
|
|
|
#ifdef CONFIG_X86_64
|
|
|
|
|
2016-12-21 19:32:01 +00:00
|
|
|
static u64 read_tsc(void)
|
2012-11-28 01:29:01 +00:00
|
|
|
{
|
2016-12-21 19:32:01 +00:00
|
|
|
u64 ret = (u64)rdtsc_ordered();
|
2015-06-25 16:44:08 +00:00
|
|
|
u64 last = pvclock_gtod_data.clock.cycle_last;
|
2012-11-28 01:29:01 +00:00
|
|
|
|
|
|
|
if (likely(ret >= last))
|
|
|
|
return ret;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* GCC likes to generate cmov here, but this branch is extremely
|
2016-02-23 23:34:30 +00:00
|
|
|
* predictable (it's just a function of time and the likely is
|
2012-11-28 01:29:01 +00:00
|
|
|
* very likely) and there's a data dependence, so force GCC
|
|
|
|
* to generate a branch instead. I don't barrier() because
|
|
|
|
* we don't actually need a barrier, and if this function
|
|
|
|
* ever gets inlined it will generate worse code.
|
|
|
|
*/
|
|
|
|
asm volatile ("");
|
|
|
|
return last;
|
|
|
|
}
|
|
|
|
|
2018-01-24 13:23:36 +00:00
|
|
|
static inline u64 vgettsc(u64 *tsc_timestamp, int *mode)
|
2012-11-28 01:29:01 +00:00
|
|
|
{
|
|
|
|
long v;
|
|
|
|
struct pvclock_gtod_data *gtod = &pvclock_gtod_data;
|
2018-01-24 13:23:36 +00:00
|
|
|
u64 tsc_pg_val;
|
|
|
|
|
|
|
|
switch (gtod->clock.vclock_mode) {
|
|
|
|
case VCLOCK_HVCLOCK:
|
|
|
|
tsc_pg_val = hv_read_tsc_page_tsc(hv_get_tsc_page(),
|
|
|
|
tsc_timestamp);
|
|
|
|
if (tsc_pg_val != U64_MAX) {
|
|
|
|
/* TSC page valid */
|
|
|
|
*mode = VCLOCK_HVCLOCK;
|
|
|
|
v = (tsc_pg_val - gtod->clock.cycle_last) &
|
|
|
|
gtod->clock.mask;
|
|
|
|
} else {
|
|
|
|
/* TSC page invalid */
|
|
|
|
*mode = VCLOCK_NONE;
|
|
|
|
}
|
|
|
|
break;
|
|
|
|
case VCLOCK_TSC:
|
|
|
|
*mode = VCLOCK_TSC;
|
|
|
|
*tsc_timestamp = read_tsc();
|
|
|
|
v = (*tsc_timestamp - gtod->clock.cycle_last) &
|
|
|
|
gtod->clock.mask;
|
|
|
|
break;
|
|
|
|
default:
|
|
|
|
*mode = VCLOCK_NONE;
|
|
|
|
}
|
2012-11-28 01:29:01 +00:00
|
|
|
|
2018-01-24 13:23:36 +00:00
|
|
|
if (*mode == VCLOCK_NONE)
|
|
|
|
*tsc_timestamp = v = 0;
|
2012-11-28 01:29:01 +00:00
|
|
|
|
|
|
|
return v * gtod->clock.mult;
|
|
|
|
}
|
|
|
|
|
2018-01-24 13:23:36 +00:00
|
|
|
static int do_monotonic_boot(s64 *t, u64 *tsc_timestamp)
|
2012-11-28 01:29:01 +00:00
|
|
|
{
|
2014-07-16 21:04:54 +00:00
|
|
|
struct pvclock_gtod_data *gtod = &pvclock_gtod_data;
|
2012-11-28 01:29:01 +00:00
|
|
|
unsigned long seq;
|
|
|
|
int mode;
|
2014-07-16 21:04:54 +00:00
|
|
|
u64 ns;
|
2012-11-28 01:29:01 +00:00
|
|
|
|
|
|
|
do {
|
|
|
|
seq = read_seqcount_begin(>od->seq);
|
2014-07-16 21:04:54 +00:00
|
|
|
ns = gtod->nsec_base;
|
2018-01-24 13:23:36 +00:00
|
|
|
ns += vgettsc(tsc_timestamp, &mode);
|
2012-11-28 01:29:01 +00:00
|
|
|
ns >>= gtod->clock.shift;
|
2014-07-16 21:04:54 +00:00
|
|
|
ns += gtod->boot_ns;
|
2012-11-28 01:29:01 +00:00
|
|
|
} while (unlikely(read_seqcount_retry(>od->seq, seq)));
|
2014-07-16 21:04:54 +00:00
|
|
|
*t = ns;
|
2012-11-28 01:29:01 +00:00
|
|
|
|
|
|
|
return mode;
|
|
|
|
}
|
|
|
|
|
2018-04-23 08:04:26 +00:00
|
|
|
static int do_realtime(struct timespec64 *ts, u64 *tsc_timestamp)
|
2017-01-24 17:09:39 +00:00
|
|
|
{
|
|
|
|
struct pvclock_gtod_data *gtod = &pvclock_gtod_data;
|
|
|
|
unsigned long seq;
|
|
|
|
int mode;
|
|
|
|
u64 ns;
|
|
|
|
|
|
|
|
do {
|
|
|
|
seq = read_seqcount_begin(>od->seq);
|
|
|
|
ts->tv_sec = gtod->wall_time_sec;
|
|
|
|
ns = gtod->nsec_base;
|
2018-01-24 13:23:36 +00:00
|
|
|
ns += vgettsc(tsc_timestamp, &mode);
|
2017-01-24 17:09:39 +00:00
|
|
|
ns >>= gtod->clock.shift;
|
|
|
|
} while (unlikely(read_seqcount_retry(>od->seq, seq)));
|
|
|
|
|
|
|
|
ts->tv_sec += __iter_div_u64_rem(ns, NSEC_PER_SEC, &ns);
|
|
|
|
ts->tv_nsec = ns;
|
|
|
|
|
|
|
|
return mode;
|
|
|
|
}
|
|
|
|
|
2018-01-24 13:23:36 +00:00
|
|
|
/* returns true if host is using TSC based clocksource */
|
|
|
|
static bool kvm_get_time_and_clockread(s64 *kernel_ns, u64 *tsc_timestamp)
|
2012-11-28 01:29:01 +00:00
|
|
|
{
|
|
|
|
/* checked again under seqlock below */
|
2018-01-24 13:23:36 +00:00
|
|
|
if (!gtod_is_based_on_tsc(pvclock_gtod_data.clock.vclock_mode))
|
2012-11-28 01:29:01 +00:00
|
|
|
return false;
|
|
|
|
|
2018-01-24 13:23:36 +00:00
|
|
|
return gtod_is_based_on_tsc(do_monotonic_boot(kernel_ns,
|
|
|
|
tsc_timestamp));
|
2012-11-28 01:29:01 +00:00
|
|
|
}
|
2017-01-24 17:09:39 +00:00
|
|
|
|
2018-01-24 13:23:36 +00:00
|
|
|
/* returns true if host is using TSC based clocksource */
|
2018-04-23 08:04:26 +00:00
|
|
|
static bool kvm_get_walltime_and_clockread(struct timespec64 *ts,
|
2018-01-24 13:23:36 +00:00
|
|
|
u64 *tsc_timestamp)
|
2017-01-24 17:09:39 +00:00
|
|
|
{
|
|
|
|
/* checked again under seqlock below */
|
2018-01-24 13:23:36 +00:00
|
|
|
if (!gtod_is_based_on_tsc(pvclock_gtod_data.clock.vclock_mode))
|
2017-01-24 17:09:39 +00:00
|
|
|
return false;
|
|
|
|
|
2018-01-24 13:23:36 +00:00
|
|
|
return gtod_is_based_on_tsc(do_realtime(ts, tsc_timestamp));
|
2017-01-24 17:09:39 +00:00
|
|
|
}
|
2012-11-28 01:29:01 +00:00
|
|
|
#endif
|
|
|
|
|
|
|
|
/*
|
|
|
|
*
|
2012-11-28 01:29:03 +00:00
|
|
|
* Assuming a stable TSC across physical CPUS, and a stable TSC
|
|
|
|
* across virtual CPUs, the following condition is possible.
|
|
|
|
* Each numbered line represents an event visible to both
|
2012-11-28 01:29:01 +00:00
|
|
|
* CPUs at the next numbered event.
|
|
|
|
*
|
|
|
|
* "timespecX" represents host monotonic time. "tscX" represents
|
|
|
|
* RDTSC value.
|
|
|
|
*
|
|
|
|
* VCPU0 on CPU0 | VCPU1 on CPU1
|
|
|
|
*
|
|
|
|
* 1. read timespec0,tsc0
|
|
|
|
* 2. | timespec1 = timespec0 + N
|
|
|
|
* | tsc1 = tsc0 + M
|
|
|
|
* 3. transition to guest | transition to guest
|
|
|
|
* 4. ret0 = timespec0 + (rdtsc - tsc0) |
|
|
|
|
* 5. | ret1 = timespec1 + (rdtsc - tsc1)
|
|
|
|
* | ret1 = timespec0 + N + (rdtsc - (tsc0 + M))
|
|
|
|
*
|
|
|
|
* Since ret0 update is visible to VCPU1 at time 5, to obey monotonicity:
|
|
|
|
*
|
|
|
|
* - ret0 < ret1
|
|
|
|
* - timespec0 + (rdtsc - tsc0) < timespec0 + N + (rdtsc - (tsc0 + M))
|
|
|
|
* ...
|
|
|
|
* - 0 < N - M => M < N
|
|
|
|
*
|
|
|
|
* That is, when timespec0 != timespec1, M < N. Unfortunately that is not
|
|
|
|
* always the case (the difference between two distinct xtime instances
|
|
|
|
* might be smaller then the difference between corresponding TSC reads,
|
|
|
|
* when updating guest vcpus pvclock areas).
|
|
|
|
*
|
|
|
|
* To avoid that problem, do not allow visibility of distinct
|
|
|
|
* system_timestamp/tsc_timestamp values simultaneously: use a master
|
|
|
|
* copy of host monotonic time values. Update that master copy
|
|
|
|
* in lockstep.
|
|
|
|
*
|
2012-11-28 01:29:03 +00:00
|
|
|
* Rely on synchronization of host TSCs and guest TSCs for monotonicity.
|
2012-11-28 01:29:01 +00:00
|
|
|
*
|
|
|
|
*/
|
|
|
|
|
|
|
|
static void pvclock_update_vm_gtod_copy(struct kvm *kvm)
|
|
|
|
{
|
|
|
|
#ifdef CONFIG_X86_64
|
|
|
|
struct kvm_arch *ka = &kvm->arch;
|
|
|
|
int vclock_mode;
|
2012-11-28 01:29:03 +00:00
|
|
|
bool host_tsc_clocksource, vcpus_matched;
|
|
|
|
|
|
|
|
vcpus_matched = (ka->nr_vcpus_matched_tsc + 1 ==
|
|
|
|
atomic_read(&kvm->online_vcpus));
|
2012-11-28 01:29:01 +00:00
|
|
|
|
|
|
|
/*
|
|
|
|
* If the host uses TSC clock, then passthrough TSC as stable
|
|
|
|
* to the guest.
|
|
|
|
*/
|
2012-11-28 01:29:03 +00:00
|
|
|
host_tsc_clocksource = kvm_get_time_and_clockread(
|
2012-11-28 01:29:01 +00:00
|
|
|
&ka->master_kernel_ns,
|
|
|
|
&ka->master_cycle_now);
|
|
|
|
|
2014-05-14 15:43:24 +00:00
|
|
|
ka->use_master_clock = host_tsc_clocksource && vcpus_matched
|
2017-06-26 07:56:43 +00:00
|
|
|
&& !ka->backwards_tsc_observed
|
2015-01-20 17:54:52 +00:00
|
|
|
&& !ka->boot_vcpu_runs_old_kvmclock;
|
2012-11-28 01:29:03 +00:00
|
|
|
|
2012-11-28 01:29:01 +00:00
|
|
|
if (ka->use_master_clock)
|
|
|
|
atomic_set(&kvm_guest_has_master_clock, 1);
|
|
|
|
|
|
|
|
vclock_mode = pvclock_gtod_data.clock.vclock_mode;
|
2012-11-28 01:29:03 +00:00
|
|
|
trace_kvm_update_master_clock(ka->use_master_clock, vclock_mode,
|
|
|
|
vcpus_matched);
|
2012-11-28 01:29:01 +00:00
|
|
|
#endif
|
|
|
|
}
|
|
|
|
|
2016-01-07 14:05:10 +00:00
|
|
|
void kvm_make_mclock_inprogress_request(struct kvm *kvm)
|
|
|
|
{
|
|
|
|
kvm_make_all_cpus_request(kvm, KVM_REQ_MCLOCK_INPROGRESS);
|
|
|
|
}
|
|
|
|
|
2013-08-28 02:55:29 +00:00
|
|
|
static void kvm_gen_update_masterclock(struct kvm *kvm)
|
|
|
|
{
|
|
|
|
#ifdef CONFIG_X86_64
|
|
|
|
int i;
|
|
|
|
struct kvm_vcpu *vcpu;
|
|
|
|
struct kvm_arch *ka = &kvm->arch;
|
|
|
|
|
|
|
|
spin_lock(&ka->pvclock_gtod_sync_lock);
|
|
|
|
kvm_make_mclock_inprogress_request(kvm);
|
|
|
|
/* no guest entries from this point */
|
|
|
|
pvclock_update_vm_gtod_copy(kvm);
|
|
|
|
|
|
|
|
kvm_for_each_vcpu(i, vcpu, kvm)
|
2014-09-12 05:43:19 +00:00
|
|
|
kvm_make_request(KVM_REQ_CLOCK_UPDATE, vcpu);
|
2013-08-28 02:55:29 +00:00
|
|
|
|
|
|
|
/* guest entries allowed */
|
|
|
|
kvm_for_each_vcpu(i, vcpu, kvm)
|
2017-04-26 20:32:19 +00:00
|
|
|
kvm_clear_request(KVM_REQ_MCLOCK_INPROGRESS, vcpu);
|
2013-08-28 02:55:29 +00:00
|
|
|
|
|
|
|
spin_unlock(&ka->pvclock_gtod_sync_lock);
|
|
|
|
#endif
|
|
|
|
}
|
|
|
|
|
KVM: x86: remove irq disablement around KVM_SET_CLOCK/KVM_GET_CLOCK
The disablement of interrupts at KVM_SET_CLOCK/KVM_GET_CLOCK
attempts to disable software suspend from causing "non atomic behaviour" of
the operation:
Add a helper function to compute the kernel time and convert nanoseconds
back to CPU specific cycles. Note that these must not be called in preemptible
context, as that would mean the kernel could enter software suspend state,
which would cause non-atomic operation.
However, assume the kernel can enter software suspend at the following 2 points:
ktime_get_ts(&ts);
1.
hypothetical_ktime_get_ts(&ts)
monotonic_to_bootbased(&ts);
2.
monotonic_to_bootbased() should be correct relative to a ktime_get_ts(&ts)
performed after point 1 (that is after resuming from software suspend),
hypothetical_ktime_get_ts()
Therefore it is also correct for the ktime_get_ts(&ts) before point 1,
which is
ktime_get_ts(&ts) = hypothetical_ktime_get_ts(&ts) + time-to-execute-suspend-code
Note CLOCK_MONOTONIC does not count during suspension.
So remove the irq disablement, which causes the following warning on
-RT kernels:
With this reasoning, and the -RT bug that the irq disablement causes
(because spin_lock is now a sleeping lock), remove the IRQ protection as it
causes:
[ 1064.668109] in_atomic(): 0, irqs_disabled(): 1, pid: 15296, name:m
[ 1064.668110] INFO: lockdep is turned off.
[ 1064.668110] irq event stamp: 0
[ 1064.668112] hardirqs last enabled at (0): [< (null)>] )
[ 1064.668116] hardirqs last disabled at (0): [] c0
[ 1064.668118] softirqs last enabled at (0): [] c0
[ 1064.668118] softirqs last disabled at (0): [< (null)>] )
[ 1064.668121] CPU: 13 PID: 15296 Comm: qemu-kvm Not tainted 3.10.0-1
[ 1064.668121] Hardware name: Dell Inc. PowerEdge R730/0H21J3, BIOS 5
[ 1064.668123] ffff8c1796b88000 00000000afe7344c ffff8c179abf3c68 f3
[ 1064.668125] ffff8c179abf3c90 ffffffff930ccb3d ffff8c1b992b3610 f0
[ 1064.668126] 00007ffc1a26fbc0 ffff8c179abf3cb0 ffffffff9375f694 f0
[ 1064.668126] Call Trace:
[ 1064.668132] [] dump_stack+0x19/0x1b
[ 1064.668135] [] __might_sleep+0x12d/0x1f0
[ 1064.668138] [] rt_spin_lock+0x24/0x60
[ 1064.668155] [] __get_kvmclock_ns+0x36/0x110 [k]
[ 1064.668159] [] ? futex_wait_queue_me+0x103/0x10
[ 1064.668171] [] kvm_arch_vm_ioctl+0xa2/0xd70 [k]
[ 1064.668173] [] ? futex_wait+0x1ac/0x2a0
v2: notice get_kvmclock_ns with the same problem (Pankaj).
v3: remove useless helper function (Pankaj).
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2017-04-17 15:51:37 +00:00
|
|
|
u64 get_kvmclock_ns(struct kvm *kvm)
|
2016-09-01 12:21:03 +00:00
|
|
|
{
|
|
|
|
struct kvm_arch *ka = &kvm->arch;
|
2016-11-16 17:31:30 +00:00
|
|
|
struct pvclock_vcpu_time_info hv_clock;
|
2017-05-12 01:12:05 +00:00
|
|
|
u64 ret;
|
2016-09-01 12:21:03 +00:00
|
|
|
|
2016-11-16 17:31:30 +00:00
|
|
|
spin_lock(&ka->pvclock_gtod_sync_lock);
|
|
|
|
if (!ka->use_master_clock) {
|
|
|
|
spin_unlock(&ka->pvclock_gtod_sync_lock);
|
|
|
|
return ktime_get_boot_ns() + ka->kvmclock_offset;
|
2016-09-01 12:21:03 +00:00
|
|
|
}
|
|
|
|
|
2016-11-16 17:31:30 +00:00
|
|
|
hv_clock.tsc_timestamp = ka->master_cycle_now;
|
|
|
|
hv_clock.system_time = ka->master_kernel_ns + ka->kvmclock_offset;
|
|
|
|
spin_unlock(&ka->pvclock_gtod_sync_lock);
|
|
|
|
|
2017-05-12 01:12:05 +00:00
|
|
|
/* both __this_cpu_read() and rdtsc() should be on the same cpu */
|
|
|
|
get_cpu();
|
|
|
|
|
2017-11-20 22:55:05 +00:00
|
|
|
if (__this_cpu_read(cpu_tsc_khz)) {
|
|
|
|
kvm_get_time_scale(NSEC_PER_SEC, __this_cpu_read(cpu_tsc_khz) * 1000LL,
|
|
|
|
&hv_clock.tsc_shift,
|
|
|
|
&hv_clock.tsc_to_system_mul);
|
|
|
|
ret = __pvclock_read_cycles(&hv_clock, rdtsc());
|
|
|
|
} else
|
|
|
|
ret = ktime_get_boot_ns() + ka->kvmclock_offset;
|
2017-05-12 01:12:05 +00:00
|
|
|
|
|
|
|
put_cpu();
|
|
|
|
|
|
|
|
return ret;
|
2016-09-01 12:21:03 +00:00
|
|
|
}
|
|
|
|
|
2016-09-01 12:20:09 +00:00
|
|
|
static void kvm_setup_pvclock_page(struct kvm_vcpu *v)
|
|
|
|
{
|
|
|
|
struct kvm_vcpu_arch *vcpu = &v->arch;
|
|
|
|
struct pvclock_vcpu_time_info guest_hv_clock;
|
|
|
|
|
2017-05-02 14:20:18 +00:00
|
|
|
if (unlikely(kvm_read_guest_cached(v->kvm, &vcpu->pv_time,
|
2016-09-01 12:20:09 +00:00
|
|
|
&guest_hv_clock, sizeof(guest_hv_clock))))
|
|
|
|
return;
|
|
|
|
|
|
|
|
/* This VCPU is paused, but it's legal for a guest to read another
|
|
|
|
* VCPU's kvmclock, so we really have to follow the specification where
|
|
|
|
* it says that version is odd if data is being modified, and even after
|
|
|
|
* it is consistent.
|
|
|
|
*
|
|
|
|
* Version field updates must be kept separate. This is because
|
|
|
|
* kvm_write_guest_cached might use a "rep movs" instruction, and
|
|
|
|
* writes within a string instruction are weakly ordered. So there
|
|
|
|
* are three writes overall.
|
|
|
|
*
|
|
|
|
* As a small optimization, only write the version field in the first
|
|
|
|
* and third write. The vcpu->pv_time cache is still valid, because the
|
|
|
|
* version field is the first in the struct.
|
|
|
|
*/
|
|
|
|
BUILD_BUG_ON(offsetof(struct pvclock_vcpu_time_info, version) != 0);
|
|
|
|
|
2017-11-05 14:11:30 +00:00
|
|
|
if (guest_hv_clock.version & 1)
|
|
|
|
++guest_hv_clock.version; /* first time write, random junk */
|
|
|
|
|
2016-09-01 12:20:09 +00:00
|
|
|
vcpu->hv_clock.version = guest_hv_clock.version + 1;
|
2017-05-02 14:20:18 +00:00
|
|
|
kvm_write_guest_cached(v->kvm, &vcpu->pv_time,
|
|
|
|
&vcpu->hv_clock,
|
|
|
|
sizeof(vcpu->hv_clock.version));
|
2016-09-01 12:20:09 +00:00
|
|
|
|
|
|
|
smp_wmb();
|
|
|
|
|
|
|
|
/* retain PVCLOCK_GUEST_STOPPED if set in guest copy */
|
|
|
|
vcpu->hv_clock.flags |= (guest_hv_clock.flags & PVCLOCK_GUEST_STOPPED);
|
|
|
|
|
|
|
|
if (vcpu->pvclock_set_guest_stopped_request) {
|
|
|
|
vcpu->hv_clock.flags |= PVCLOCK_GUEST_STOPPED;
|
|
|
|
vcpu->pvclock_set_guest_stopped_request = false;
|
|
|
|
}
|
|
|
|
|
|
|
|
trace_kvm_pvclock_update(v->vcpu_id, &vcpu->hv_clock);
|
|
|
|
|
2017-05-02 14:20:18 +00:00
|
|
|
kvm_write_guest_cached(v->kvm, &vcpu->pv_time,
|
|
|
|
&vcpu->hv_clock,
|
|
|
|
sizeof(vcpu->hv_clock));
|
2016-09-01 12:20:09 +00:00
|
|
|
|
|
|
|
smp_wmb();
|
|
|
|
|
|
|
|
vcpu->hv_clock.version++;
|
2017-05-02 14:20:18 +00:00
|
|
|
kvm_write_guest_cached(v->kvm, &vcpu->pv_time,
|
|
|
|
&vcpu->hv_clock,
|
|
|
|
sizeof(vcpu->hv_clock.version));
|
2016-09-01 12:20:09 +00:00
|
|
|
}
|
|
|
|
|
2010-09-19 00:38:14 +00:00
|
|
|
static int kvm_guest_time_update(struct kvm_vcpu *v)
|
2008-02-15 19:52:47 +00:00
|
|
|
{
|
2016-02-08 13:51:40 +00:00
|
|
|
unsigned long flags, tgt_tsc_khz;
|
2008-02-15 19:52:47 +00:00
|
|
|
struct kvm_vcpu_arch *vcpu = &v->arch;
|
2012-11-28 01:29:01 +00:00
|
|
|
struct kvm_arch *ka = &v->kvm->arch;
|
2014-01-06 14:18:59 +00:00
|
|
|
s64 kernel_ns;
|
2012-11-28 01:29:01 +00:00
|
|
|
u64 tsc_timestamp, host_tsc;
|
2012-08-03 18:57:49 +00:00
|
|
|
u8 pvclock_flags;
|
2012-11-28 01:29:01 +00:00
|
|
|
bool use_master_clock;
|
|
|
|
|
|
|
|
kernel_ns = 0;
|
|
|
|
host_tsc = 0;
|
2008-02-15 19:52:47 +00:00
|
|
|
|
2012-11-28 01:29:01 +00:00
|
|
|
/*
|
|
|
|
* If the host uses TSC clock, then passthrough TSC as stable
|
|
|
|
* to the guest.
|
|
|
|
*/
|
|
|
|
spin_lock(&ka->pvclock_gtod_sync_lock);
|
|
|
|
use_master_clock = ka->use_master_clock;
|
|
|
|
if (use_master_clock) {
|
|
|
|
host_tsc = ka->master_cycle_now;
|
|
|
|
kernel_ns = ka->master_kernel_ns;
|
|
|
|
}
|
|
|
|
spin_unlock(&ka->pvclock_gtod_sync_lock);
|
2013-03-18 16:54:32 +00:00
|
|
|
|
|
|
|
/* Keep irq disabled to prevent changes to the clock */
|
|
|
|
local_irq_save(flags);
|
2016-02-08 13:51:40 +00:00
|
|
|
tgt_tsc_khz = __this_cpu_read(cpu_tsc_khz);
|
|
|
|
if (unlikely(tgt_tsc_khz == 0)) {
|
2013-03-18 16:54:32 +00:00
|
|
|
local_irq_restore(flags);
|
|
|
|
kvm_make_request(KVM_REQ_CLOCK_UPDATE, v);
|
|
|
|
return 1;
|
|
|
|
}
|
2012-11-28 01:29:01 +00:00
|
|
|
if (!use_master_clock) {
|
2015-06-25 16:44:07 +00:00
|
|
|
host_tsc = rdtsc();
|
2016-09-01 12:21:03 +00:00
|
|
|
kernel_ns = ktime_get_boot_ns();
|
2012-11-28 01:29:01 +00:00
|
|
|
}
|
|
|
|
|
2015-10-20 07:39:07 +00:00
|
|
|
tsc_timestamp = kvm_read_l1_tsc(v, host_tsc);
|
2012-11-28 01:29:01 +00:00
|
|
|
|
2010-09-19 00:38:15 +00:00
|
|
|
/*
|
|
|
|
* We may have to catch up the TSC to match elapsed wall clock
|
|
|
|
* time for two reasons, even if kvmclock is used.
|
|
|
|
* 1) CPU could have been running below the maximum TSC rate
|
|
|
|
* 2) Broken TSC compensation resets the base at each VCPU
|
|
|
|
* entry to avoid unknown leaps of TSC even when running
|
|
|
|
* again on the same CPU. This may cause apparent elapsed
|
|
|
|
* time to disappear, and the guest to stand still or run
|
|
|
|
* very slowly.
|
|
|
|
*/
|
|
|
|
if (vcpu->tsc_catchup) {
|
|
|
|
u64 tsc = compute_guest_tsc(v, kernel_ns);
|
|
|
|
if (tsc > tsc_timestamp) {
|
2012-02-03 17:43:55 +00:00
|
|
|
adjust_tsc_offset_guest(v, tsc - tsc_timestamp);
|
2010-09-19 00:38:15 +00:00
|
|
|
tsc_timestamp = tsc;
|
|
|
|
}
|
2008-06-03 14:17:31 +00:00
|
|
|
}
|
|
|
|
|
2008-02-15 19:52:47 +00:00
|
|
|
local_irq_restore(flags);
|
|
|
|
|
2016-09-01 12:20:09 +00:00
|
|
|
/* With all the info we got, fill in the values */
|
2008-02-15 19:52:47 +00:00
|
|
|
|
2016-02-08 13:51:40 +00:00
|
|
|
if (kvm_has_tsc_control)
|
|
|
|
tgt_tsc_khz = kvm_scale_tsc(v, tgt_tsc_khz);
|
|
|
|
|
|
|
|
if (unlikely(vcpu->hw_tsc_khz != tgt_tsc_khz)) {
|
2016-02-08 14:11:15 +00:00
|
|
|
kvm_get_time_scale(NSEC_PER_SEC, tgt_tsc_khz * 1000LL,
|
2010-09-19 00:38:13 +00:00
|
|
|
&vcpu->hv_clock.tsc_shift,
|
|
|
|
&vcpu->hv_clock.tsc_to_system_mul);
|
2016-02-08 13:51:40 +00:00
|
|
|
vcpu->hw_tsc_khz = tgt_tsc_khz;
|
2010-08-20 08:07:21 +00:00
|
|
|
}
|
|
|
|
|
2010-08-20 08:07:30 +00:00
|
|
|
vcpu->hv_clock.tsc_timestamp = tsc_timestamp;
|
2010-08-20 08:07:25 +00:00
|
|
|
vcpu->hv_clock.system_time = kernel_ns + v->kvm->arch.kvmclock_offset;
|
2010-09-19 00:38:12 +00:00
|
|
|
vcpu->last_guest_tsc = tsc_timestamp;
|
2012-08-03 18:57:49 +00:00
|
|
|
|
2012-11-28 01:29:01 +00:00
|
|
|
/* If the host uses TSC clocksource, then it is stable */
|
2016-09-01 12:20:09 +00:00
|
|
|
pvclock_flags = 0;
|
2012-11-28 01:29:01 +00:00
|
|
|
if (use_master_clock)
|
|
|
|
pvclock_flags |= PVCLOCK_TSC_STABLE_BIT;
|
|
|
|
|
2012-11-28 01:28:47 +00:00
|
|
|
vcpu->hv_clock.flags = pvclock_flags;
|
|
|
|
|
2016-02-08 11:54:12 +00:00
|
|
|
if (vcpu->pv_time_enabled)
|
|
|
|
kvm_setup_pvclock_page(v);
|
|
|
|
if (v == kvm_get_vcpu(v->kvm, 0))
|
|
|
|
kvm_hv_setup_tsc_page(v->kvm, &vcpu->hv_clock);
|
2010-08-20 08:07:21 +00:00
|
|
|
return 0;
|
2009-02-04 16:52:04 +00:00
|
|
|
}
|
|
|
|
|
2013-05-09 23:21:41 +00:00
|
|
|
/*
|
|
|
|
* kvmclock updates which are isolated to a given vcpu, such as
|
|
|
|
* vcpu->cpu migration, should not allow system_timestamp from
|
|
|
|
* the rest of the vcpus to remain static. Otherwise ntp frequency
|
|
|
|
* correction applies to one vcpu's system_timestamp but not
|
|
|
|
* the others.
|
|
|
|
*
|
|
|
|
* So in those cases, request a kvmclock update for all vcpus.
|
2014-02-28 11:52:54 +00:00
|
|
|
* We need to rate-limit these requests though, as they can
|
|
|
|
* considerably slow guests that have a large number of vcpus.
|
|
|
|
* The time for a remote vcpu to update its kvmclock is bound
|
|
|
|
* by the delay we use to rate-limit the updates.
|
2013-05-09 23:21:41 +00:00
|
|
|
*/
|
|
|
|
|
2014-02-28 11:52:54 +00:00
|
|
|
#define KVMCLOCK_UPDATE_DELAY msecs_to_jiffies(100)
|
|
|
|
|
|
|
|
static void kvmclock_update_fn(struct work_struct *work)
|
2013-05-09 23:21:41 +00:00
|
|
|
{
|
|
|
|
int i;
|
2014-02-28 11:52:54 +00:00
|
|
|
struct delayed_work *dwork = to_delayed_work(work);
|
|
|
|
struct kvm_arch *ka = container_of(dwork, struct kvm_arch,
|
|
|
|
kvmclock_update_work);
|
|
|
|
struct kvm *kvm = container_of(ka, struct kvm, arch);
|
2013-05-09 23:21:41 +00:00
|
|
|
struct kvm_vcpu *vcpu;
|
|
|
|
|
|
|
|
kvm_for_each_vcpu(i, vcpu, kvm) {
|
2014-09-12 05:43:19 +00:00
|
|
|
kvm_make_request(KVM_REQ_CLOCK_UPDATE, vcpu);
|
2013-05-09 23:21:41 +00:00
|
|
|
kvm_vcpu_kick(vcpu);
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2014-02-28 11:52:54 +00:00
|
|
|
static void kvm_gen_kvmclock_update(struct kvm_vcpu *v)
|
|
|
|
{
|
|
|
|
struct kvm *kvm = v->kvm;
|
|
|
|
|
2014-09-12 05:43:19 +00:00
|
|
|
kvm_make_request(KVM_REQ_CLOCK_UPDATE, v);
|
2014-02-28 11:52:54 +00:00
|
|
|
schedule_delayed_work(&kvm->arch.kvmclock_update_work,
|
|
|
|
KVMCLOCK_UPDATE_DELAY);
|
|
|
|
}
|
|
|
|
|
2014-02-28 11:52:55 +00:00
|
|
|
#define KVMCLOCK_SYNC_PERIOD (300 * HZ)
|
|
|
|
|
|
|
|
static void kvmclock_sync_fn(struct work_struct *work)
|
|
|
|
{
|
|
|
|
struct delayed_work *dwork = to_delayed_work(work);
|
|
|
|
struct kvm_arch *ka = container_of(dwork, struct kvm_arch,
|
|
|
|
kvmclock_sync_work);
|
|
|
|
struct kvm *kvm = container_of(ka, struct kvm, arch);
|
|
|
|
|
2015-05-13 01:42:04 +00:00
|
|
|
if (!kvmclock_periodic_sync)
|
|
|
|
return;
|
|
|
|
|
2014-02-28 11:52:55 +00:00
|
|
|
schedule_delayed_work(&kvm->arch.kvmclock_update_work, 0);
|
|
|
|
schedule_delayed_work(&kvm->arch.kvmclock_sync_work,
|
|
|
|
KVMCLOCK_SYNC_PERIOD);
|
|
|
|
}
|
|
|
|
|
2017-10-19 13:47:56 +00:00
|
|
|
static int set_msr_mce(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
|
2007-10-30 17:44:17 +00:00
|
|
|
{
|
2009-05-11 08:48:15 +00:00
|
|
|
u64 mcg_cap = vcpu->arch.mcg_cap;
|
|
|
|
unsigned bank_num = mcg_cap & 0xff;
|
2017-10-19 13:47:56 +00:00
|
|
|
u32 msr = msr_info->index;
|
|
|
|
u64 data = msr_info->data;
|
2009-05-11 08:48:15 +00:00
|
|
|
|
2007-10-30 17:44:17 +00:00
|
|
|
switch (msr) {
|
|
|
|
case MSR_IA32_MCG_STATUS:
|
2009-05-11 08:48:15 +00:00
|
|
|
vcpu->arch.mcg_status = data;
|
2007-10-30 17:44:17 +00:00
|
|
|
break;
|
2008-02-11 19:28:27 +00:00
|
|
|
case MSR_IA32_MCG_CTL:
|
2018-07-26 11:01:52 +00:00
|
|
|
if (!(mcg_cap & MCG_CTL_P) &&
|
|
|
|
(data || !msr_info->host_initiated))
|
2009-05-11 08:48:15 +00:00
|
|
|
return 1;
|
|
|
|
if (data != 0 && data != ~(u64)0)
|
2018-07-26 11:01:52 +00:00
|
|
|
return 1;
|
2009-05-11 08:48:15 +00:00
|
|
|
vcpu->arch.mcg_ctl = data;
|
|
|
|
break;
|
|
|
|
default:
|
|
|
|
if (msr >= MSR_IA32_MC0_CTL &&
|
2014-09-23 02:44:35 +00:00
|
|
|
msr < MSR_IA32_MCx_CTL(bank_num)) {
|
2009-05-11 08:48:15 +00:00
|
|
|
u32 offset = msr - MSR_IA32_MC0_CTL;
|
2010-03-24 16:46:42 +00:00
|
|
|
/* only 0 or all 1s can be written to IA32_MCi_CTL
|
|
|
|
* some Linux kernels though clear bit 10 in bank 4 to
|
|
|
|
* workaround a BIOS/GART TBL issue on AMD K8s, ignore
|
|
|
|
* this to avoid an uncatched #GP in the guest
|
|
|
|
*/
|
2009-05-11 08:48:15 +00:00
|
|
|
if ((offset & 0x3) == 0 &&
|
2010-03-24 16:46:42 +00:00
|
|
|
data != 0 && (data | (1 << 10)) != ~(u64)0)
|
2009-05-11 08:48:15 +00:00
|
|
|
return -1;
|
2017-10-19 13:47:56 +00:00
|
|
|
if (!msr_info->host_initiated &&
|
|
|
|
(offset & 0x3) == 1 && data != 0)
|
|
|
|
return -1;
|
2009-05-11 08:48:15 +00:00
|
|
|
vcpu->arch.mce_banks[offset] = data;
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
return 1;
|
|
|
|
}
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
2009-10-15 22:21:43 +00:00
|
|
|
static int xen_hvm_config(struct kvm_vcpu *vcpu, u64 data)
|
|
|
|
{
|
|
|
|
struct kvm *kvm = vcpu->kvm;
|
|
|
|
int lm = is_long_mode(vcpu);
|
|
|
|
u8 *blob_addr = lm ? (u8 *)(long)kvm->arch.xen_hvm_config.blob_addr_64
|
|
|
|
: (u8 *)(long)kvm->arch.xen_hvm_config.blob_addr_32;
|
|
|
|
u8 blob_size = lm ? kvm->arch.xen_hvm_config.blob_size_64
|
|
|
|
: kvm->arch.xen_hvm_config.blob_size_32;
|
|
|
|
u32 page_num = data & ~PAGE_MASK;
|
|
|
|
u64 page_addr = data & PAGE_MASK;
|
|
|
|
u8 *page;
|
|
|
|
int r;
|
|
|
|
|
|
|
|
r = -E2BIG;
|
|
|
|
if (page_num >= blob_size)
|
|
|
|
goto out;
|
|
|
|
r = -ENOMEM;
|
2011-12-04 17:36:29 +00:00
|
|
|
page = memdup_user(blob_addr + (page_num * PAGE_SIZE), PAGE_SIZE);
|
|
|
|
if (IS_ERR(page)) {
|
|
|
|
r = PTR_ERR(page);
|
2009-10-15 22:21:43 +00:00
|
|
|
goto out;
|
2011-12-04 17:36:29 +00:00
|
|
|
}
|
2015-04-08 13:39:23 +00:00
|
|
|
if (kvm_vcpu_write_guest(vcpu, page_addr, page, PAGE_SIZE))
|
2009-10-15 22:21:43 +00:00
|
|
|
goto out_free;
|
|
|
|
r = 0;
|
|
|
|
out_free:
|
|
|
|
kfree(page);
|
|
|
|
out:
|
|
|
|
return r;
|
|
|
|
}
|
|
|
|
|
2010-10-14 09:22:50 +00:00
|
|
|
static int kvm_pv_enable_async_pf(struct kvm_vcpu *vcpu, u64 data)
|
|
|
|
{
|
|
|
|
gpa_t gpa = data & ~0x3f;
|
|
|
|
|
2017-07-14 01:30:42 +00:00
|
|
|
/* Bits 3:5 are reserved, Should be zero */
|
|
|
|
if (data & 0x38)
|
2010-10-14 09:22:50 +00:00
|
|
|
return 1;
|
|
|
|
|
|
|
|
vcpu->arch.apf.msr_val = data;
|
|
|
|
|
|
|
|
if (!(data & KVM_ASYNC_PF_ENABLED)) {
|
|
|
|
kvm_clear_async_pf_completion_queue(vcpu);
|
|
|
|
kvm_async_pf_hash_reset(vcpu);
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
2017-05-02 14:20:18 +00:00
|
|
|
if (kvm_gfn_to_hva_cache_init(vcpu->kvm, &vcpu->arch.apf.data, gpa,
|
2013-03-29 16:35:21 +00:00
|
|
|
sizeof(u32)))
|
2010-10-14 09:22:50 +00:00
|
|
|
return 1;
|
|
|
|
|
2010-10-14 09:22:55 +00:00
|
|
|
vcpu->arch.apf.send_user_only = !(data & KVM_ASYNC_PF_SEND_ALWAYS);
|
2017-07-14 01:30:42 +00:00
|
|
|
vcpu->arch.apf.delivery_as_pf_vmexit = data & KVM_ASYNC_PF_DELIVERY_AS_PF_VMEXIT;
|
2010-10-14 09:22:50 +00:00
|
|
|
kvm_async_pf_wakeup_all(vcpu);
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
2011-02-01 19:16:40 +00:00
|
|
|
static void kvmclock_reset(struct kvm_vcpu *vcpu)
|
|
|
|
{
|
2013-02-20 22:48:10 +00:00
|
|
|
vcpu->arch.pv_time_enabled = false;
|
2011-02-01 19:16:40 +00:00
|
|
|
}
|
|
|
|
|
2017-12-13 01:33:04 +00:00
|
|
|
static void kvm_vcpu_flush_tlb(struct kvm_vcpu *vcpu, bool invalidate_gpa)
|
|
|
|
{
|
|
|
|
++vcpu->stat.tlb_flush;
|
|
|
|
kvm_x86_ops->tlb_flush(vcpu, invalidate_gpa);
|
|
|
|
}
|
|
|
|
|
2011-07-11 19:28:14 +00:00
|
|
|
static void record_steal_time(struct kvm_vcpu *vcpu)
|
|
|
|
{
|
|
|
|
if (!(vcpu->arch.st.msr_val & KVM_MSR_ENABLED))
|
|
|
|
return;
|
|
|
|
|
2017-05-02 14:20:18 +00:00
|
|
|
if (unlikely(kvm_read_guest_cached(vcpu->kvm, &vcpu->arch.st.stime,
|
2011-07-11 19:28:14 +00:00
|
|
|
&vcpu->arch.st.steal, sizeof(struct kvm_steal_time))))
|
|
|
|
return;
|
|
|
|
|
2017-12-13 01:33:04 +00:00
|
|
|
/*
|
|
|
|
* Doing a TLB flush here, on the guest's behalf, can avoid
|
|
|
|
* expensive IPIs.
|
|
|
|
*/
|
|
|
|
if (xchg(&vcpu->arch.st.steal.preempted, 0) & KVM_VCPU_FLUSH_TLB)
|
|
|
|
kvm_vcpu_flush_tlb(vcpu, false);
|
2016-11-02 09:08:35 +00:00
|
|
|
|
2016-05-03 03:43:10 +00:00
|
|
|
if (vcpu->arch.st.steal.version & 1)
|
|
|
|
vcpu->arch.st.steal.version += 1; /* first time write, random junk */
|
|
|
|
|
|
|
|
vcpu->arch.st.steal.version += 1;
|
|
|
|
|
2017-05-02 14:20:18 +00:00
|
|
|
kvm_write_guest_cached(vcpu->kvm, &vcpu->arch.st.stime,
|
2016-05-03 03:43:10 +00:00
|
|
|
&vcpu->arch.st.steal, sizeof(struct kvm_steal_time));
|
|
|
|
|
|
|
|
smp_wmb();
|
|
|
|
|
2016-03-16 11:33:16 +00:00
|
|
|
vcpu->arch.st.steal.steal += current->sched_info.run_delay -
|
|
|
|
vcpu->arch.st.last_steal;
|
|
|
|
vcpu->arch.st.last_steal = current->sched_info.run_delay;
|
2016-05-03 03:43:10 +00:00
|
|
|
|
2017-05-02 14:20:18 +00:00
|
|
|
kvm_write_guest_cached(vcpu->kvm, &vcpu->arch.st.stime,
|
2016-05-03 03:43:10 +00:00
|
|
|
&vcpu->arch.st.steal, sizeof(struct kvm_steal_time));
|
|
|
|
|
|
|
|
smp_wmb();
|
|
|
|
|
|
|
|
vcpu->arch.st.steal.version += 1;
|
2011-07-11 19:28:14 +00:00
|
|
|
|
2017-05-02 14:20:18 +00:00
|
|
|
kvm_write_guest_cached(vcpu->kvm, &vcpu->arch.st.stime,
|
2011-07-11 19:28:14 +00:00
|
|
|
&vcpu->arch.st.steal, sizeof(struct kvm_steal_time));
|
|
|
|
}
|
|
|
|
|
2012-11-29 20:42:12 +00:00
|
|
|
int kvm_set_msr_common(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
|
2007-10-30 17:44:17 +00:00
|
|
|
{
|
2012-01-15 12:17:22 +00:00
|
|
|
bool pr = false;
|
2012-11-29 20:42:12 +00:00
|
|
|
u32 msr = msr_info->index;
|
|
|
|
u64 data = msr_info->data;
|
2012-01-15 12:17:22 +00:00
|
|
|
|
2007-10-30 17:44:17 +00:00
|
|
|
switch (msr) {
|
2013-02-19 18:33:13 +00:00
|
|
|
case MSR_AMD64_NB_CFG:
|
|
|
|
case MSR_IA32_UCODE_WRITE:
|
|
|
|
case MSR_VM_HSAVE_PA:
|
|
|
|
case MSR_AMD64_PATCH_LOADER:
|
|
|
|
case MSR_AMD64_BU_CFG2:
|
2017-04-06 13:22:20 +00:00
|
|
|
case MSR_AMD64_DC_CFG:
|
2013-02-19 18:33:13 +00:00
|
|
|
break;
|
|
|
|
|
2018-02-28 06:03:31 +00:00
|
|
|
case MSR_IA32_UCODE_REV:
|
|
|
|
if (msr_info->host_initiated)
|
|
|
|
vcpu->arch.microcode_version = data;
|
|
|
|
break;
|
2007-10-30 17:44:17 +00:00
|
|
|
case MSR_EFER:
|
2010-05-06 09:38:43 +00:00
|
|
|
return set_efer(vcpu, data);
|
2009-06-24 10:44:33 +00:00
|
|
|
case MSR_K7_HWCR:
|
|
|
|
data &= ~(u64)0x40; /* ignore flush filter disable */
|
2010-02-24 17:59:16 +00:00
|
|
|
data &= ~(u64)0x100; /* ignore ignne emulation enable */
|
2012-02-21 21:44:21 +00:00
|
|
|
data &= ~(u64)0x8; /* ignore TLB cache disable */
|
2014-06-26 11:50:15 +00:00
|
|
|
data &= ~(u64)0x40000; /* ignore Mc status write enable */
|
2009-06-24 10:44:33 +00:00
|
|
|
if (data != 0) {
|
KVM: Cleanup the kvm_print functions and introduce pr_XX wrappers
Introduces a couple of print functions, which are essentially wrappers
around standard printk functions, with a KVM: prefix.
Functions introduced or modified are:
- kvm_err(fmt, ...)
- kvm_info(fmt, ...)
- kvm_debug(fmt, ...)
- kvm_pr_unimpl(fmt, ...)
- pr_unimpl(vcpu, fmt, ...) -> vcpu_unimpl(vcpu, fmt, ...)
Signed-off-by: Christoffer Dall <c.dall@virtualopensystems.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2012-06-03 18:17:48 +00:00
|
|
|
vcpu_unimpl(vcpu, "unimplemented HWCR wrmsr: 0x%llx\n",
|
|
|
|
data);
|
2009-06-24 10:44:33 +00:00
|
|
|
return 1;
|
|
|
|
}
|
2007-10-30 17:44:17 +00:00
|
|
|
break;
|
2009-07-02 13:04:14 +00:00
|
|
|
case MSR_FAM10H_MMIO_CONF_BASE:
|
|
|
|
if (data != 0) {
|
KVM: Cleanup the kvm_print functions and introduce pr_XX wrappers
Introduces a couple of print functions, which are essentially wrappers
around standard printk functions, with a KVM: prefix.
Functions introduced or modified are:
- kvm_err(fmt, ...)
- kvm_info(fmt, ...)
- kvm_debug(fmt, ...)
- kvm_pr_unimpl(fmt, ...)
- pr_unimpl(vcpu, fmt, ...) -> vcpu_unimpl(vcpu, fmt, ...)
Signed-off-by: Christoffer Dall <c.dall@virtualopensystems.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2012-06-03 18:17:48 +00:00
|
|
|
vcpu_unimpl(vcpu, "unimplemented MMIO_CONF_BASE wrmsr: "
|
|
|
|
"0x%llx\n", data);
|
2009-07-02 13:04:14 +00:00
|
|
|
return 1;
|
|
|
|
}
|
2007-10-30 17:44:17 +00:00
|
|
|
break;
|
2008-07-22 06:00:45 +00:00
|
|
|
case MSR_IA32_DEBUGCTLMSR:
|
|
|
|
if (!data) {
|
|
|
|
/* We support the non-activated case already */
|
|
|
|
break;
|
|
|
|
} else if (data & ~(DEBUGCTLMSR_LBR | DEBUGCTLMSR_BTF)) {
|
|
|
|
/* Values other than LBR and BTF are vendor-specific,
|
|
|
|
thus reserved and should throw a #GP */
|
|
|
|
return 1;
|
|
|
|
}
|
KVM: Cleanup the kvm_print functions and introduce pr_XX wrappers
Introduces a couple of print functions, which are essentially wrappers
around standard printk functions, with a KVM: prefix.
Functions introduced or modified are:
- kvm_err(fmt, ...)
- kvm_info(fmt, ...)
- kvm_debug(fmt, ...)
- kvm_pr_unimpl(fmt, ...)
- pr_unimpl(vcpu, fmt, ...) -> vcpu_unimpl(vcpu, fmt, ...)
Signed-off-by: Christoffer Dall <c.dall@virtualopensystems.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2012-06-03 18:17:48 +00:00
|
|
|
vcpu_unimpl(vcpu, "%s: MSR_IA32_DEBUGCTLMSR 0x%llx, nop\n",
|
|
|
|
__func__, data);
|
2008-07-22 06:00:45 +00:00
|
|
|
break;
|
2008-05-26 17:06:35 +00:00
|
|
|
case 0x200 ... 0x2ff:
|
2015-06-15 08:55:22 +00:00
|
|
|
return kvm_mtrr_set_msr(vcpu, msr, data);
|
2007-10-30 17:44:17 +00:00
|
|
|
case MSR_IA32_APICBASE:
|
2014-01-24 15:48:44 +00:00
|
|
|
return kvm_set_apic_base(vcpu, msr_info);
|
2009-07-05 14:39:36 +00:00
|
|
|
case APIC_BASE_MSR ... APIC_BASE_MSR + 0x3ff:
|
|
|
|
return kvm_x2apic_msr_write(vcpu, msr, data);
|
2011-09-22 08:55:52 +00:00
|
|
|
case MSR_IA32_TSCDEADLINE:
|
|
|
|
kvm_set_lapic_tscdeadline_msr(vcpu, data);
|
|
|
|
break;
|
2012-11-29 20:42:50 +00:00
|
|
|
case MSR_IA32_TSC_ADJUST:
|
2017-08-04 22:12:49 +00:00
|
|
|
if (guest_cpuid_has(vcpu, X86_FEATURE_TSC_ADJUST)) {
|
2012-11-29 20:42:50 +00:00
|
|
|
if (!msr_info->host_initiated) {
|
2014-11-13 03:00:39 +00:00
|
|
|
s64 adj = data - vcpu->arch.ia32_tsc_adjust_msr;
|
2015-08-07 03:24:32 +00:00
|
|
|
adjust_tsc_offset_guest(vcpu, adj);
|
2012-11-29 20:42:50 +00:00
|
|
|
}
|
|
|
|
vcpu->arch.ia32_tsc_adjust_msr = data;
|
|
|
|
}
|
|
|
|
break;
|
2007-10-30 17:44:17 +00:00
|
|
|
case MSR_IA32_MISC_ENABLE:
|
2007-12-13 15:50:52 +00:00
|
|
|
vcpu->arch.ia32_misc_enable_msr = data;
|
2007-10-30 17:44:17 +00:00
|
|
|
break;
|
2015-05-07 09:36:11 +00:00
|
|
|
case MSR_IA32_SMBASE:
|
|
|
|
if (!msr_info->host_initiated)
|
|
|
|
return 1;
|
|
|
|
vcpu->arch.smbase = data;
|
|
|
|
break;
|
2018-04-13 09:38:35 +00:00
|
|
|
case MSR_IA32_TSC:
|
|
|
|
kvm_write_tsc(vcpu, msr_info);
|
|
|
|
break;
|
2017-11-15 11:43:14 +00:00
|
|
|
case MSR_SMI_COUNT:
|
|
|
|
if (!msr_info->host_initiated)
|
|
|
|
return 1;
|
|
|
|
vcpu->arch.smi_count = data;
|
|
|
|
break;
|
2010-05-11 16:17:41 +00:00
|
|
|
case MSR_KVM_WALL_CLOCK_NEW:
|
2008-02-15 19:52:47 +00:00
|
|
|
case MSR_KVM_WALL_CLOCK:
|
|
|
|
vcpu->kvm->arch.wall_clock = data;
|
|
|
|
kvm_write_wall_clock(vcpu->kvm, data);
|
|
|
|
break;
|
2010-05-11 16:17:41 +00:00
|
|
|
case MSR_KVM_SYSTEM_TIME_NEW:
|
2008-02-15 19:52:47 +00:00
|
|
|
case MSR_KVM_SYSTEM_TIME: {
|
2015-01-20 17:54:52 +00:00
|
|
|
struct kvm_arch *ka = &vcpu->kvm->arch;
|
|
|
|
|
2011-02-01 19:16:40 +00:00
|
|
|
kvmclock_reset(vcpu);
|
2008-02-15 19:52:47 +00:00
|
|
|
|
2015-01-20 17:54:52 +00:00
|
|
|
if (vcpu->vcpu_id == 0 && !msr_info->host_initiated) {
|
|
|
|
bool tmp = (msr == MSR_KVM_SYSTEM_TIME);
|
|
|
|
|
|
|
|
if (ka->boot_vcpu_runs_old_kvmclock != tmp)
|
2017-04-26 20:32:20 +00:00
|
|
|
kvm_make_request(KVM_REQ_MASTERCLOCK_UPDATE, vcpu);
|
2015-01-20 17:54:52 +00:00
|
|
|
|
|
|
|
ka->boot_vcpu_runs_old_kvmclock = tmp;
|
|
|
|
}
|
|
|
|
|
2008-02-15 19:52:47 +00:00
|
|
|
vcpu->arch.time = data;
|
2013-05-09 23:21:41 +00:00
|
|
|
kvm_make_request(KVM_REQ_GLOBAL_CLOCK_UPDATE, vcpu);
|
2008-02-15 19:52:47 +00:00
|
|
|
|
|
|
|
/* we verify if the enable bit is set... */
|
|
|
|
if (!(data & 1))
|
|
|
|
break;
|
|
|
|
|
2017-05-02 14:20:18 +00:00
|
|
|
if (kvm_gfn_to_hva_cache_init(vcpu->kvm,
|
2013-03-29 16:35:21 +00:00
|
|
|
&vcpu->arch.pv_time, data & ~1ULL,
|
|
|
|
sizeof(struct pvclock_vcpu_time_info)))
|
2013-02-20 22:48:10 +00:00
|
|
|
vcpu->arch.pv_time_enabled = false;
|
|
|
|
else
|
|
|
|
vcpu->arch.pv_time_enabled = true;
|
2012-08-03 07:42:52 +00:00
|
|
|
|
2008-02-15 19:52:47 +00:00
|
|
|
break;
|
|
|
|
}
|
2010-10-14 09:22:50 +00:00
|
|
|
case MSR_KVM_ASYNC_PF_EN:
|
|
|
|
if (kvm_pv_enable_async_pf(vcpu, data))
|
|
|
|
return 1;
|
|
|
|
break;
|
2011-07-11 19:28:14 +00:00
|
|
|
case MSR_KVM_STEAL_TIME:
|
|
|
|
|
|
|
|
if (unlikely(!sched_info_on()))
|
|
|
|
return 1;
|
|
|
|
|
|
|
|
if (data & KVM_STEAL_RESERVED_MASK)
|
|
|
|
return 1;
|
|
|
|
|
2017-05-02 14:20:18 +00:00
|
|
|
if (kvm_gfn_to_hva_cache_init(vcpu->kvm, &vcpu->arch.st.stime,
|
2013-03-29 16:35:21 +00:00
|
|
|
data & KVM_STEAL_VALID_BITS,
|
|
|
|
sizeof(struct kvm_steal_time)))
|
2011-07-11 19:28:14 +00:00
|
|
|
return 1;
|
|
|
|
|
|
|
|
vcpu->arch.st.msr_val = data;
|
|
|
|
|
|
|
|
if (!(data & KVM_MSR_ENABLED))
|
|
|
|
break;
|
|
|
|
|
|
|
|
kvm_make_request(KVM_REQ_STEAL_UPDATE, vcpu);
|
|
|
|
|
|
|
|
break;
|
2012-06-24 16:25:07 +00:00
|
|
|
case MSR_KVM_PV_EOI_EN:
|
|
|
|
if (kvm_lapic_enable_pv_eoi(vcpu, data))
|
|
|
|
return 1;
|
|
|
|
break;
|
2011-07-11 19:28:14 +00:00
|
|
|
|
2009-05-11 08:48:15 +00:00
|
|
|
case MSR_IA32_MCG_CTL:
|
|
|
|
case MSR_IA32_MCG_STATUS:
|
2014-09-23 02:44:35 +00:00
|
|
|
case MSR_IA32_MC0_CTL ... MSR_IA32_MCx_CTL(KVM_MAX_MCE_BANKS) - 1:
|
2017-10-19 13:47:56 +00:00
|
|
|
return set_msr_mce(vcpu, msr_info);
|
2009-06-12 20:01:29 +00:00
|
|
|
|
2015-06-12 05:34:56 +00:00
|
|
|
case MSR_K7_PERFCTR0 ... MSR_K7_PERFCTR3:
|
|
|
|
case MSR_P6_PERFCTR0 ... MSR_P6_PERFCTR1:
|
|
|
|
pr = true; /* fall through */
|
|
|
|
case MSR_K7_EVNTSEL0 ... MSR_K7_EVNTSEL3:
|
|
|
|
case MSR_P6_EVNTSEL0 ... MSR_P6_EVNTSEL1:
|
2015-06-19 11:44:45 +00:00
|
|
|
if (kvm_pmu_is_valid_msr(vcpu, msr))
|
2013-03-28 16:18:35 +00:00
|
|
|
return kvm_pmu_set_msr(vcpu, msr_info);
|
2012-01-15 12:17:22 +00:00
|
|
|
|
|
|
|
if (pr || data != 0)
|
KVM: Cleanup the kvm_print functions and introduce pr_XX wrappers
Introduces a couple of print functions, which are essentially wrappers
around standard printk functions, with a KVM: prefix.
Functions introduced or modified are:
- kvm_err(fmt, ...)
- kvm_info(fmt, ...)
- kvm_debug(fmt, ...)
- kvm_pr_unimpl(fmt, ...)
- pr_unimpl(vcpu, fmt, ...) -> vcpu_unimpl(vcpu, fmt, ...)
Signed-off-by: Christoffer Dall <c.dall@virtualopensystems.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2012-06-03 18:17:48 +00:00
|
|
|
vcpu_unimpl(vcpu, "disabled perfctr wrmsr: "
|
|
|
|
"0x%x data 0x%llx\n", msr, data);
|
2012-01-15 12:17:22 +00:00
|
|
|
break;
|
2010-09-01 09:42:04 +00:00
|
|
|
case MSR_K7_CLK_CTL:
|
|
|
|
/*
|
|
|
|
* Ignore all writes to this no longer documented MSR.
|
|
|
|
* Writes are only relevant for old K7 processors,
|
|
|
|
* all pre-dating SVM, but a recommended workaround from
|
2012-06-28 07:17:27 +00:00
|
|
|
* AMD for these chips. It is possible to specify the
|
2010-09-01 09:42:04 +00:00
|
|
|
* affected processor models on the command line, hence
|
|
|
|
* the need to ignore the workaround.
|
|
|
|
*/
|
|
|
|
break;
|
2010-01-17 13:51:22 +00:00
|
|
|
case HV_X64_MSR_GUEST_OS_ID ... HV_X64_MSR_SINT15:
|
2015-07-03 12:01:37 +00:00
|
|
|
case HV_X64_MSR_CRASH_P0 ... HV_X64_MSR_CRASH_P4:
|
|
|
|
case HV_X64_MSR_CRASH_CTL:
|
2015-11-30 16:22:21 +00:00
|
|
|
case HV_X64_MSR_STIMER0_CONFIG ... HV_X64_MSR_STIMER3_COUNT:
|
2018-03-01 14:15:12 +00:00
|
|
|
case HV_X64_MSR_REENLIGHTENMENT_CONTROL:
|
|
|
|
case HV_X64_MSR_TSC_EMULATION_CONTROL:
|
|
|
|
case HV_X64_MSR_TSC_EMULATION_STATUS:
|
2015-07-03 12:01:37 +00:00
|
|
|
return kvm_hv_set_msr_common(vcpu, msr, data,
|
|
|
|
msr_info->host_initiated);
|
2011-01-21 05:21:00 +00:00
|
|
|
case MSR_IA32_BBL_CR_CTL3:
|
|
|
|
/* Drop writes to this legacy MSR -- see rdmsr
|
|
|
|
* counterpart for further detail.
|
|
|
|
*/
|
2017-11-08 12:32:08 +00:00
|
|
|
if (report_ignored_msrs)
|
|
|
|
vcpu_unimpl(vcpu, "ignored wrmsr: 0x%x data 0x%llx\n",
|
|
|
|
msr, data);
|
2011-01-21 05:21:00 +00:00
|
|
|
break;
|
2012-01-09 19:00:35 +00:00
|
|
|
case MSR_AMD64_OSVW_ID_LENGTH:
|
2017-08-04 22:12:49 +00:00
|
|
|
if (!guest_cpuid_has(vcpu, X86_FEATURE_OSVW))
|
2012-01-09 19:00:35 +00:00
|
|
|
return 1;
|
|
|
|
vcpu->arch.osvw.length = data;
|
|
|
|
break;
|
|
|
|
case MSR_AMD64_OSVW_STATUS:
|
2017-08-04 22:12:49 +00:00
|
|
|
if (!guest_cpuid_has(vcpu, X86_FEATURE_OSVW))
|
2012-01-09 19:00:35 +00:00
|
|
|
return 1;
|
|
|
|
vcpu->arch.osvw.status = data;
|
|
|
|
break;
|
2017-03-20 08:16:28 +00:00
|
|
|
case MSR_PLATFORM_INFO:
|
|
|
|
if (!msr_info->host_initiated ||
|
|
|
|
data & ~MSR_PLATFORM_INFO_CPUID_FAULT ||
|
|
|
|
(!(data & MSR_PLATFORM_INFO_CPUID_FAULT) &&
|
|
|
|
cpuid_fault_enabled(vcpu)))
|
|
|
|
return 1;
|
|
|
|
vcpu->arch.msr_platform_info = data;
|
|
|
|
break;
|
|
|
|
case MSR_MISC_FEATURES_ENABLES:
|
|
|
|
if (data & ~MSR_MISC_FEATURES_ENABLES_CPUID_FAULT ||
|
|
|
|
(data & MSR_MISC_FEATURES_ENABLES_CPUID_FAULT &&
|
|
|
|
!supports_cpuid_fault(vcpu)))
|
|
|
|
return 1;
|
|
|
|
vcpu->arch.msr_misc_features_enables = data;
|
|
|
|
break;
|
2007-10-30 17:44:17 +00:00
|
|
|
default:
|
2009-10-15 22:21:43 +00:00
|
|
|
if (msr && (msr == vcpu->kvm->arch.xen_hvm_config.msr))
|
|
|
|
return xen_hvm_config(vcpu, data);
|
2015-06-19 11:44:45 +00:00
|
|
|
if (kvm_pmu_is_valid_msr(vcpu, msr))
|
2013-03-28 16:18:35 +00:00
|
|
|
return kvm_pmu_set_msr(vcpu, msr_info);
|
2009-06-25 10:36:49 +00:00
|
|
|
if (!ignore_msrs) {
|
2016-11-15 06:36:18 +00:00
|
|
|
vcpu_debug_ratelimited(vcpu, "unhandled wrmsr: 0x%x data 0x%llx\n",
|
KVM: Cleanup the kvm_print functions and introduce pr_XX wrappers
Introduces a couple of print functions, which are essentially wrappers
around standard printk functions, with a KVM: prefix.
Functions introduced or modified are:
- kvm_err(fmt, ...)
- kvm_info(fmt, ...)
- kvm_debug(fmt, ...)
- kvm_pr_unimpl(fmt, ...)
- pr_unimpl(vcpu, fmt, ...) -> vcpu_unimpl(vcpu, fmt, ...)
Signed-off-by: Christoffer Dall <c.dall@virtualopensystems.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2012-06-03 18:17:48 +00:00
|
|
|
msr, data);
|
2009-06-25 10:36:49 +00:00
|
|
|
return 1;
|
|
|
|
} else {
|
2017-11-08 12:32:08 +00:00
|
|
|
if (report_ignored_msrs)
|
|
|
|
vcpu_unimpl(vcpu,
|
|
|
|
"ignored wrmsr: 0x%x data 0x%llx\n",
|
|
|
|
msr, data);
|
2009-06-25 10:36:49 +00:00
|
|
|
break;
|
|
|
|
}
|
2007-10-30 17:44:17 +00:00
|
|
|
}
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
EXPORT_SYMBOL_GPL(kvm_set_msr_common);
|
|
|
|
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Reads an msr value (of 'msr_index') into 'pdata'.
|
|
|
|
* Returns 0 on success, non-0 otherwise.
|
|
|
|
* Assumes vcpu_load() was already called.
|
|
|
|
*/
|
2015-04-08 13:30:38 +00:00
|
|
|
int kvm_get_msr(struct kvm_vcpu *vcpu, struct msr_data *msr)
|
2007-10-30 17:44:17 +00:00
|
|
|
{
|
2015-04-08 13:30:38 +00:00
|
|
|
return kvm_x86_ops->get_msr(vcpu, msr);
|
2007-10-30 17:44:17 +00:00
|
|
|
}
|
2014-12-11 05:52:58 +00:00
|
|
|
EXPORT_SYMBOL_GPL(kvm_get_msr);
|
2007-10-30 17:44:17 +00:00
|
|
|
|
2018-07-26 11:01:52 +00:00
|
|
|
static int get_msr_mce(struct kvm_vcpu *vcpu, u32 msr, u64 *pdata, bool host)
|
2007-10-30 17:44:17 +00:00
|
|
|
{
|
|
|
|
u64 data;
|
2009-05-11 08:48:15 +00:00
|
|
|
u64 mcg_cap = vcpu->arch.mcg_cap;
|
|
|
|
unsigned bank_num = mcg_cap & 0xff;
|
2007-10-30 17:44:17 +00:00
|
|
|
|
|
|
|
switch (msr) {
|
|
|
|
case MSR_IA32_P5_MC_ADDR:
|
|
|
|
case MSR_IA32_P5_MC_TYPE:
|
2009-05-11 08:48:15 +00:00
|
|
|
data = 0;
|
|
|
|
break;
|
2007-10-30 17:44:17 +00:00
|
|
|
case MSR_IA32_MCG_CAP:
|
2009-05-11 08:48:15 +00:00
|
|
|
data = vcpu->arch.mcg_cap;
|
|
|
|
break;
|
2008-02-11 19:28:27 +00:00
|
|
|
case MSR_IA32_MCG_CTL:
|
2018-07-26 11:01:52 +00:00
|
|
|
if (!(mcg_cap & MCG_CTL_P) && !host)
|
2009-05-11 08:48:15 +00:00
|
|
|
return 1;
|
|
|
|
data = vcpu->arch.mcg_ctl;
|
|
|
|
break;
|
|
|
|
case MSR_IA32_MCG_STATUS:
|
|
|
|
data = vcpu->arch.mcg_status;
|
|
|
|
break;
|
|
|
|
default:
|
|
|
|
if (msr >= MSR_IA32_MC0_CTL &&
|
2014-09-23 02:44:35 +00:00
|
|
|
msr < MSR_IA32_MCx_CTL(bank_num)) {
|
2009-05-11 08:48:15 +00:00
|
|
|
u32 offset = msr - MSR_IA32_MC0_CTL;
|
|
|
|
data = vcpu->arch.mce_banks[offset];
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
return 1;
|
|
|
|
}
|
|
|
|
*pdata = data;
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
2015-04-08 13:30:38 +00:00
|
|
|
int kvm_get_msr_common(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
|
2009-05-11 08:48:15 +00:00
|
|
|
{
|
2015-04-08 13:30:38 +00:00
|
|
|
switch (msr_info->index) {
|
2009-05-11 08:48:15 +00:00
|
|
|
case MSR_IA32_PLATFORM_ID:
|
2007-10-30 17:44:17 +00:00
|
|
|
case MSR_IA32_EBL_CR_POWERON:
|
2008-07-22 06:00:45 +00:00
|
|
|
case MSR_IA32_DEBUGCTLMSR:
|
|
|
|
case MSR_IA32_LASTBRANCHFROMIP:
|
|
|
|
case MSR_IA32_LASTBRANCHTOIP:
|
|
|
|
case MSR_IA32_LASTINTFROMIP:
|
|
|
|
case MSR_IA32_LASTINTTOIP:
|
2009-05-14 05:30:10 +00:00
|
|
|
case MSR_K8_SYSCFG:
|
2015-09-18 15:33:04 +00:00
|
|
|
case MSR_K8_TSEG_ADDR:
|
|
|
|
case MSR_K8_TSEG_MASK:
|
2009-05-14 05:30:10 +00:00
|
|
|
case MSR_K7_HWCR:
|
2008-12-29 15:32:28 +00:00
|
|
|
case MSR_VM_HSAVE_PA:
|
2009-06-24 10:44:34 +00:00
|
|
|
case MSR_K8_INT_PENDING_MSG:
|
2009-06-24 13:37:05 +00:00
|
|
|
case MSR_AMD64_NB_CFG:
|
2009-07-02 13:04:14 +00:00
|
|
|
case MSR_FAM10H_MMIO_CONF_BASE:
|
2013-02-19 18:33:13 +00:00
|
|
|
case MSR_AMD64_BU_CFG2:
|
2016-05-31 14:38:24 +00:00
|
|
|
case MSR_IA32_PERF_CTL:
|
2017-04-06 13:22:20 +00:00
|
|
|
case MSR_AMD64_DC_CFG:
|
2015-04-08 13:30:38 +00:00
|
|
|
msr_info->data = 0;
|
2007-10-30 17:44:17 +00:00
|
|
|
break;
|
2018-02-05 19:24:52 +00:00
|
|
|
case MSR_F15H_PERF_CTL0 ... MSR_F15H_PERF_CTR5:
|
2015-06-12 05:34:56 +00:00
|
|
|
case MSR_K7_EVNTSEL0 ... MSR_K7_EVNTSEL3:
|
|
|
|
case MSR_K7_PERFCTR0 ... MSR_K7_PERFCTR3:
|
|
|
|
case MSR_P6_PERFCTR0 ... MSR_P6_PERFCTR1:
|
|
|
|
case MSR_P6_EVNTSEL0 ... MSR_P6_EVNTSEL1:
|
2015-06-19 11:44:45 +00:00
|
|
|
if (kvm_pmu_is_valid_msr(vcpu, msr_info->index))
|
2015-04-08 13:30:38 +00:00
|
|
|
return kvm_pmu_get_msr(vcpu, msr_info->index, &msr_info->data);
|
|
|
|
msr_info->data = 0;
|
2012-01-15 12:17:22 +00:00
|
|
|
break;
|
2011-07-29 22:44:21 +00:00
|
|
|
case MSR_IA32_UCODE_REV:
|
2018-02-28 06:03:31 +00:00
|
|
|
msr_info->data = vcpu->arch.microcode_version;
|
2011-07-29 22:44:21 +00:00
|
|
|
break;
|
2018-04-13 09:38:35 +00:00
|
|
|
case MSR_IA32_TSC:
|
|
|
|
msr_info->data = kvm_scale_tsc(vcpu, rdtsc()) + vcpu->arch.tsc_offset;
|
|
|
|
break;
|
2008-05-26 17:06:35 +00:00
|
|
|
case MSR_MTRRcap:
|
|
|
|
case 0x200 ... 0x2ff:
|
2015-06-15 08:55:22 +00:00
|
|
|
return kvm_mtrr_get_msr(vcpu, msr_info->index, &msr_info->data);
|
2007-10-30 17:44:17 +00:00
|
|
|
case 0xcd: /* fsb frequency */
|
2015-04-08 13:30:38 +00:00
|
|
|
msr_info->data = 3;
|
2007-10-30 17:44:17 +00:00
|
|
|
break;
|
2010-09-09 10:06:46 +00:00
|
|
|
/*
|
|
|
|
* MSR_EBC_FREQUENCY_ID
|
|
|
|
* Conservative value valid for even the basic CPU models.
|
|
|
|
* Models 0,1: 000 in bits 23:21 indicating a bus speed of
|
|
|
|
* 100MHz, model 2 000 in bits 18:16 indicating 100MHz,
|
|
|
|
* and 266MHz for model 3, or 4. Set Core Clock
|
|
|
|
* Frequency to System Bus Frequency Ratio to 1 (bits
|
|
|
|
* 31:24) even though these are only valid for CPU
|
|
|
|
* models > 2, however guests may end up dividing or
|
|
|
|
* multiplying by zero otherwise.
|
|
|
|
*/
|
|
|
|
case MSR_EBC_FREQUENCY_ID:
|
2015-04-08 13:30:38 +00:00
|
|
|
msr_info->data = 1 << 24;
|
2010-09-09 10:06:46 +00:00
|
|
|
break;
|
2007-10-30 17:44:17 +00:00
|
|
|
case MSR_IA32_APICBASE:
|
2015-04-08 13:30:38 +00:00
|
|
|
msr_info->data = kvm_get_apic_base(vcpu);
|
2007-10-30 17:44:17 +00:00
|
|
|
break;
|
2009-07-05 14:39:36 +00:00
|
|
|
case APIC_BASE_MSR ... APIC_BASE_MSR + 0x3ff:
|
2015-04-08 13:30:38 +00:00
|
|
|
return kvm_x2apic_msr_read(vcpu, msr_info->index, &msr_info->data);
|
2009-07-05 14:39:36 +00:00
|
|
|
break;
|
2011-09-22 08:55:52 +00:00
|
|
|
case MSR_IA32_TSCDEADLINE:
|
2015-04-08 13:30:38 +00:00
|
|
|
msr_info->data = kvm_get_lapic_tscdeadline_msr(vcpu);
|
2011-09-22 08:55:52 +00:00
|
|
|
break;
|
2012-11-29 20:42:50 +00:00
|
|
|
case MSR_IA32_TSC_ADJUST:
|
2015-04-08 13:30:38 +00:00
|
|
|
msr_info->data = (u64)vcpu->arch.ia32_tsc_adjust_msr;
|
2012-11-29 20:42:50 +00:00
|
|
|
break;
|
2007-10-30 17:44:17 +00:00
|
|
|
case MSR_IA32_MISC_ENABLE:
|
2015-04-08 13:30:38 +00:00
|
|
|
msr_info->data = vcpu->arch.ia32_misc_enable_msr;
|
2007-10-30 17:44:17 +00:00
|
|
|
break;
|
2015-05-07 09:36:11 +00:00
|
|
|
case MSR_IA32_SMBASE:
|
|
|
|
if (!msr_info->host_initiated)
|
|
|
|
return 1;
|
|
|
|
msr_info->data = vcpu->arch.smbase;
|
2007-10-30 17:44:17 +00:00
|
|
|
break;
|
2017-11-15 11:43:14 +00:00
|
|
|
case MSR_SMI_COUNT:
|
|
|
|
msr_info->data = vcpu->arch.smi_count;
|
|
|
|
break;
|
2008-02-21 11:11:01 +00:00
|
|
|
case MSR_IA32_PERF_STATUS:
|
|
|
|
/* TSC increment by tick */
|
2015-04-08 13:30:38 +00:00
|
|
|
msr_info->data = 1000ULL;
|
2008-02-21 11:11:01 +00:00
|
|
|
/* CPU multiplier */
|
2015-06-29 10:39:23 +00:00
|
|
|
msr_info->data |= (((uint64_t)4ULL) << 40);
|
2008-02-21 11:11:01 +00:00
|
|
|
break;
|
2007-10-30 17:44:17 +00:00
|
|
|
case MSR_EFER:
|
2015-04-08 13:30:38 +00:00
|
|
|
msr_info->data = vcpu->arch.efer;
|
2007-10-30 17:44:17 +00:00
|
|
|
break;
|
2008-02-15 19:52:47 +00:00
|
|
|
case MSR_KVM_WALL_CLOCK:
|
2010-05-11 16:17:41 +00:00
|
|
|
case MSR_KVM_WALL_CLOCK_NEW:
|
2015-04-08 13:30:38 +00:00
|
|
|
msr_info->data = vcpu->kvm->arch.wall_clock;
|
2008-02-15 19:52:47 +00:00
|
|
|
break;
|
|
|
|
case MSR_KVM_SYSTEM_TIME:
|
2010-05-11 16:17:41 +00:00
|
|
|
case MSR_KVM_SYSTEM_TIME_NEW:
|
2015-04-08 13:30:38 +00:00
|
|
|
msr_info->data = vcpu->arch.time;
|
2008-02-15 19:52:47 +00:00
|
|
|
break;
|
2010-10-14 09:22:50 +00:00
|
|
|
case MSR_KVM_ASYNC_PF_EN:
|
2015-04-08 13:30:38 +00:00
|
|
|
msr_info->data = vcpu->arch.apf.msr_val;
|
2010-10-14 09:22:50 +00:00
|
|
|
break;
|
2011-07-11 19:28:14 +00:00
|
|
|
case MSR_KVM_STEAL_TIME:
|
2015-04-08 13:30:38 +00:00
|
|
|
msr_info->data = vcpu->arch.st.msr_val;
|
2011-07-11 19:28:14 +00:00
|
|
|
break;
|
2012-08-26 15:00:29 +00:00
|
|
|
case MSR_KVM_PV_EOI_EN:
|
2015-04-08 13:30:38 +00:00
|
|
|
msr_info->data = vcpu->arch.pv_eoi.msr_val;
|
2012-08-26 15:00:29 +00:00
|
|
|
break;
|
2009-05-11 08:48:15 +00:00
|
|
|
case MSR_IA32_P5_MC_ADDR:
|
|
|
|
case MSR_IA32_P5_MC_TYPE:
|
|
|
|
case MSR_IA32_MCG_CAP:
|
|
|
|
case MSR_IA32_MCG_CTL:
|
|
|
|
case MSR_IA32_MCG_STATUS:
|
2014-09-23 02:44:35 +00:00
|
|
|
case MSR_IA32_MC0_CTL ... MSR_IA32_MCx_CTL(KVM_MAX_MCE_BANKS) - 1:
|
2018-07-26 11:01:52 +00:00
|
|
|
return get_msr_mce(vcpu, msr_info->index, &msr_info->data,
|
|
|
|
msr_info->host_initiated);
|
2010-09-01 09:42:04 +00:00
|
|
|
case MSR_K7_CLK_CTL:
|
|
|
|
/*
|
|
|
|
* Provide expected ramp-up count for K7. All other
|
|
|
|
* are set to zero, indicating minimum divisors for
|
|
|
|
* every field.
|
|
|
|
*
|
|
|
|
* This prevents guest kernels on AMD host with CPU
|
|
|
|
* type 6, model 8 and higher from exploding due to
|
|
|
|
* the rdmsr failing.
|
|
|
|
*/
|
2015-04-08 13:30:38 +00:00
|
|
|
msr_info->data = 0x20000000;
|
2010-09-01 09:42:04 +00:00
|
|
|
break;
|
2010-01-17 13:51:22 +00:00
|
|
|
case HV_X64_MSR_GUEST_OS_ID ... HV_X64_MSR_SINT15:
|
2015-07-03 12:01:37 +00:00
|
|
|
case HV_X64_MSR_CRASH_P0 ... HV_X64_MSR_CRASH_P4:
|
|
|
|
case HV_X64_MSR_CRASH_CTL:
|
2015-11-30 16:22:21 +00:00
|
|
|
case HV_X64_MSR_STIMER0_CONFIG ... HV_X64_MSR_STIMER3_COUNT:
|
2018-03-01 14:15:12 +00:00
|
|
|
case HV_X64_MSR_REENLIGHTENMENT_CONTROL:
|
|
|
|
case HV_X64_MSR_TSC_EMULATION_CONTROL:
|
|
|
|
case HV_X64_MSR_TSC_EMULATION_STATUS:
|
2015-07-03 12:01:34 +00:00
|
|
|
return kvm_hv_get_msr_common(vcpu,
|
2018-07-26 11:01:52 +00:00
|
|
|
msr_info->index, &msr_info->data,
|
|
|
|
msr_info->host_initiated);
|
2010-01-17 13:51:22 +00:00
|
|
|
break;
|
2011-01-21 05:21:00 +00:00
|
|
|
case MSR_IA32_BBL_CR_CTL3:
|
|
|
|
/* This legacy MSR exists but isn't fully documented in current
|
|
|
|
* silicon. It is however accessed by winxp in very narrow
|
|
|
|
* scenarios where it sets bit #19, itself documented as
|
|
|
|
* a "reserved" bit. Best effort attempt to source coherent
|
|
|
|
* read data here should the balance of the register be
|
|
|
|
* interpreted by the guest:
|
|
|
|
*
|
|
|
|
* L2 cache control register 3: 64GB range, 256KB size,
|
|
|
|
* enabled, latency 0x1, configured
|
|
|
|
*/
|
2015-04-08 13:30:38 +00:00
|
|
|
msr_info->data = 0xbe702111;
|
2011-01-21 05:21:00 +00:00
|
|
|
break;
|
2012-01-09 19:00:35 +00:00
|
|
|
case MSR_AMD64_OSVW_ID_LENGTH:
|
2017-08-04 22:12:49 +00:00
|
|
|
if (!guest_cpuid_has(vcpu, X86_FEATURE_OSVW))
|
2012-01-09 19:00:35 +00:00
|
|
|
return 1;
|
2015-04-08 13:30:38 +00:00
|
|
|
msr_info->data = vcpu->arch.osvw.length;
|
2012-01-09 19:00:35 +00:00
|
|
|
break;
|
|
|
|
case MSR_AMD64_OSVW_STATUS:
|
2017-08-04 22:12:49 +00:00
|
|
|
if (!guest_cpuid_has(vcpu, X86_FEATURE_OSVW))
|
2012-01-09 19:00:35 +00:00
|
|
|
return 1;
|
2015-04-08 13:30:38 +00:00
|
|
|
msr_info->data = vcpu->arch.osvw.status;
|
2012-01-09 19:00:35 +00:00
|
|
|
break;
|
2017-03-20 08:16:28 +00:00
|
|
|
case MSR_PLATFORM_INFO:
|
|
|
|
msr_info->data = vcpu->arch.msr_platform_info;
|
|
|
|
break;
|
|
|
|
case MSR_MISC_FEATURES_ENABLES:
|
|
|
|
msr_info->data = vcpu->arch.msr_misc_features_enables;
|
|
|
|
break;
|
2007-10-30 17:44:17 +00:00
|
|
|
default:
|
2015-06-19 11:44:45 +00:00
|
|
|
if (kvm_pmu_is_valid_msr(vcpu, msr_info->index))
|
2015-04-08 13:30:38 +00:00
|
|
|
return kvm_pmu_get_msr(vcpu, msr_info->index, &msr_info->data);
|
2009-06-25 10:36:49 +00:00
|
|
|
if (!ignore_msrs) {
|
2016-11-15 06:36:18 +00:00
|
|
|
vcpu_debug_ratelimited(vcpu, "unhandled rdmsr: 0x%x\n",
|
|
|
|
msr_info->index);
|
2009-06-25 10:36:49 +00:00
|
|
|
return 1;
|
|
|
|
} else {
|
2017-11-08 12:32:08 +00:00
|
|
|
if (report_ignored_msrs)
|
|
|
|
vcpu_unimpl(vcpu, "ignored rdmsr: 0x%x\n",
|
|
|
|
msr_info->index);
|
2015-04-08 13:30:38 +00:00
|
|
|
msr_info->data = 0;
|
2009-06-25 10:36:49 +00:00
|
|
|
}
|
|
|
|
break;
|
2007-10-30 17:44:17 +00:00
|
|
|
}
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
EXPORT_SYMBOL_GPL(kvm_get_msr_common);
|
|
|
|
|
KVM: Portability: split kvm_vcpu_ioctl
This patch splits kvm_vcpu_ioctl into archtecture independent parts, and
x86 specific parts which go to kvm_arch_vcpu_ioctl in x86.c.
Common ioctls for all architectures are:
KVM_RUN, KVM_GET/SET_(S-)REGS, KVM_TRANSLATE, KVM_INTERRUPT,
KVM_DEBUG_GUEST, KVM_SET_SIGNAL_MASK, KVM_GET/SET_FPU
Note that some PPC chips don't have an FPU, so we might need an #ifdef
around KVM_GET/SET_FPU one day.
x86 specific ioctls are:
KVM_GET/SET_LAPIC, KVM_SET_CPUID, KVM_GET/SET_MSRS
An interresting aspect is vcpu_load/vcpu_put. We now have a common
vcpu_load/put which does the preemption stuff, and an architecture
specific kvm_arch_vcpu_load/put. In the x86 case, this one calls the
vmx/svm function defined in kvm_x86_ops.
Signed-off-by: Carsten Otte <cotte@de.ibm.com>
Reviewed-by: Christian Borntraeger <borntraeger@de.ibm.com>
Reviewed-by: Christian Ehrhardt <ehrhardt@linux.vnet.ibm.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
2007-10-11 17:16:52 +00:00
|
|
|
/*
|
|
|
|
* Read or write a bunch of msrs. All parameters are kernel addresses.
|
|
|
|
*
|
|
|
|
* @return number of msrs set successfully.
|
|
|
|
*/
|
|
|
|
static int __msr_io(struct kvm_vcpu *vcpu, struct kvm_msrs *msrs,
|
|
|
|
struct kvm_msr_entry *entries,
|
|
|
|
int (*do_msr)(struct kvm_vcpu *vcpu,
|
|
|
|
unsigned index, u64 *data))
|
|
|
|
{
|
2018-02-21 19:39:51 +00:00
|
|
|
int i;
|
KVM: Portability: split kvm_vcpu_ioctl
This patch splits kvm_vcpu_ioctl into archtecture independent parts, and
x86 specific parts which go to kvm_arch_vcpu_ioctl in x86.c.
Common ioctls for all architectures are:
KVM_RUN, KVM_GET/SET_(S-)REGS, KVM_TRANSLATE, KVM_INTERRUPT,
KVM_DEBUG_GUEST, KVM_SET_SIGNAL_MASK, KVM_GET/SET_FPU
Note that some PPC chips don't have an FPU, so we might need an #ifdef
around KVM_GET/SET_FPU one day.
x86 specific ioctls are:
KVM_GET/SET_LAPIC, KVM_SET_CPUID, KVM_GET/SET_MSRS
An interresting aspect is vcpu_load/vcpu_put. We now have a common
vcpu_load/put which does the preemption stuff, and an architecture
specific kvm_arch_vcpu_load/put. In the x86 case, this one calls the
vmx/svm function defined in kvm_x86_ops.
Signed-off-by: Carsten Otte <cotte@de.ibm.com>
Reviewed-by: Christian Borntraeger <borntraeger@de.ibm.com>
Reviewed-by: Christian Ehrhardt <ehrhardt@linux.vnet.ibm.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
2007-10-11 17:16:52 +00:00
|
|
|
|
|
|
|
for (i = 0; i < msrs->nmsrs; ++i)
|
|
|
|
if (do_msr(vcpu, entries[i].index, &entries[i].data))
|
|
|
|
break;
|
|
|
|
|
|
|
|
return i;
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Read or write a bunch of msrs. Parameters are user addresses.
|
|
|
|
*
|
|
|
|
* @return number of msrs set successfully.
|
|
|
|
*/
|
|
|
|
static int msr_io(struct kvm_vcpu *vcpu, struct kvm_msrs __user *user_msrs,
|
|
|
|
int (*do_msr)(struct kvm_vcpu *vcpu,
|
|
|
|
unsigned index, u64 *data),
|
|
|
|
int writeback)
|
|
|
|
{
|
|
|
|
struct kvm_msrs msrs;
|
|
|
|
struct kvm_msr_entry *entries;
|
|
|
|
int r, n;
|
|
|
|
unsigned size;
|
|
|
|
|
|
|
|
r = -EFAULT;
|
|
|
|
if (copy_from_user(&msrs, user_msrs, sizeof msrs))
|
|
|
|
goto out;
|
|
|
|
|
|
|
|
r = -E2BIG;
|
|
|
|
if (msrs.nmsrs >= MAX_IO_MSRS)
|
|
|
|
goto out;
|
|
|
|
|
|
|
|
size = sizeof(struct kvm_msr_entry) * msrs.nmsrs;
|
2011-12-04 17:36:29 +00:00
|
|
|
entries = memdup_user(user_msrs->entries, size);
|
|
|
|
if (IS_ERR(entries)) {
|
|
|
|
r = PTR_ERR(entries);
|
KVM: Portability: split kvm_vcpu_ioctl
This patch splits kvm_vcpu_ioctl into archtecture independent parts, and
x86 specific parts which go to kvm_arch_vcpu_ioctl in x86.c.
Common ioctls for all architectures are:
KVM_RUN, KVM_GET/SET_(S-)REGS, KVM_TRANSLATE, KVM_INTERRUPT,
KVM_DEBUG_GUEST, KVM_SET_SIGNAL_MASK, KVM_GET/SET_FPU
Note that some PPC chips don't have an FPU, so we might need an #ifdef
around KVM_GET/SET_FPU one day.
x86 specific ioctls are:
KVM_GET/SET_LAPIC, KVM_SET_CPUID, KVM_GET/SET_MSRS
An interresting aspect is vcpu_load/vcpu_put. We now have a common
vcpu_load/put which does the preemption stuff, and an architecture
specific kvm_arch_vcpu_load/put. In the x86 case, this one calls the
vmx/svm function defined in kvm_x86_ops.
Signed-off-by: Carsten Otte <cotte@de.ibm.com>
Reviewed-by: Christian Borntraeger <borntraeger@de.ibm.com>
Reviewed-by: Christian Ehrhardt <ehrhardt@linux.vnet.ibm.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
2007-10-11 17:16:52 +00:00
|
|
|
goto out;
|
2011-12-04 17:36:29 +00:00
|
|
|
}
|
KVM: Portability: split kvm_vcpu_ioctl
This patch splits kvm_vcpu_ioctl into archtecture independent parts, and
x86 specific parts which go to kvm_arch_vcpu_ioctl in x86.c.
Common ioctls for all architectures are:
KVM_RUN, KVM_GET/SET_(S-)REGS, KVM_TRANSLATE, KVM_INTERRUPT,
KVM_DEBUG_GUEST, KVM_SET_SIGNAL_MASK, KVM_GET/SET_FPU
Note that some PPC chips don't have an FPU, so we might need an #ifdef
around KVM_GET/SET_FPU one day.
x86 specific ioctls are:
KVM_GET/SET_LAPIC, KVM_SET_CPUID, KVM_GET/SET_MSRS
An interresting aspect is vcpu_load/vcpu_put. We now have a common
vcpu_load/put which does the preemption stuff, and an architecture
specific kvm_arch_vcpu_load/put. In the x86 case, this one calls the
vmx/svm function defined in kvm_x86_ops.
Signed-off-by: Carsten Otte <cotte@de.ibm.com>
Reviewed-by: Christian Borntraeger <borntraeger@de.ibm.com>
Reviewed-by: Christian Ehrhardt <ehrhardt@linux.vnet.ibm.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
2007-10-11 17:16:52 +00:00
|
|
|
|
|
|
|
r = n = __msr_io(vcpu, &msrs, entries, do_msr);
|
|
|
|
if (r < 0)
|
|
|
|
goto out_free;
|
|
|
|
|
|
|
|
r = -EFAULT;
|
|
|
|
if (writeback && copy_to_user(user_msrs->entries, entries, size))
|
|
|
|
goto out_free;
|
|
|
|
|
|
|
|
r = n;
|
|
|
|
|
|
|
|
out_free:
|
2010-07-22 20:24:52 +00:00
|
|
|
kfree(entries);
|
KVM: Portability: split kvm_vcpu_ioctl
This patch splits kvm_vcpu_ioctl into archtecture independent parts, and
x86 specific parts which go to kvm_arch_vcpu_ioctl in x86.c.
Common ioctls for all architectures are:
KVM_RUN, KVM_GET/SET_(S-)REGS, KVM_TRANSLATE, KVM_INTERRUPT,
KVM_DEBUG_GUEST, KVM_SET_SIGNAL_MASK, KVM_GET/SET_FPU
Note that some PPC chips don't have an FPU, so we might need an #ifdef
around KVM_GET/SET_FPU one day.
x86 specific ioctls are:
KVM_GET/SET_LAPIC, KVM_SET_CPUID, KVM_GET/SET_MSRS
An interresting aspect is vcpu_load/vcpu_put. We now have a common
vcpu_load/put which does the preemption stuff, and an architecture
specific kvm_arch_vcpu_load/put. In the x86 case, this one calls the
vmx/svm function defined in kvm_x86_ops.
Signed-off-by: Carsten Otte <cotte@de.ibm.com>
Reviewed-by: Christian Borntraeger <borntraeger@de.ibm.com>
Reviewed-by: Christian Ehrhardt <ehrhardt@linux.vnet.ibm.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
2007-10-11 17:16:52 +00:00
|
|
|
out:
|
|
|
|
return r;
|
|
|
|
}
|
|
|
|
|
2018-03-12 11:53:02 +00:00
|
|
|
static inline bool kvm_can_mwait_in_guest(void)
|
|
|
|
{
|
|
|
|
return boot_cpu_has(X86_FEATURE_MWAIT) &&
|
2018-04-11 09:16:03 +00:00
|
|
|
!boot_cpu_has_bug(X86_BUG_MONITOR) &&
|
|
|
|
boot_cpu_has(X86_FEATURE_ARAT);
|
2018-03-12 11:53:02 +00:00
|
|
|
}
|
|
|
|
|
2014-07-14 16:27:35 +00:00
|
|
|
int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext)
|
2007-11-15 15:07:47 +00:00
|
|
|
{
|
2018-03-12 11:53:02 +00:00
|
|
|
int r = 0;
|
2007-11-15 15:07:47 +00:00
|
|
|
|
|
|
|
switch (ext) {
|
|
|
|
case KVM_CAP_IRQCHIP:
|
|
|
|
case KVM_CAP_HLT:
|
|
|
|
case KVM_CAP_MMU_SHADOW_CACHE_CONTROL:
|
|
|
|
case KVM_CAP_SET_TSS_ADDR:
|
2007-11-21 15:10:04 +00:00
|
|
|
case KVM_CAP_EXT_CPUID:
|
2013-09-22 14:44:50 +00:00
|
|
|
case KVM_CAP_EXT_EMUL_CPUID:
|
2009-02-04 16:52:04 +00:00
|
|
|
case KVM_CAP_CLOCKSOURCE:
|
2008-01-27 21:10:22 +00:00
|
|
|
case KVM_CAP_PIT:
|
2008-02-22 17:21:36 +00:00
|
|
|
case KVM_CAP_NOP_IO_DELAY:
|
2008-04-11 16:24:45 +00:00
|
|
|
case KVM_CAP_MP_STATE:
|
2008-07-29 08:30:57 +00:00
|
|
|
case KVM_CAP_SYNC_MMU:
|
2010-12-14 09:57:47 +00:00
|
|
|
case KVM_CAP_USER_NMI:
|
2008-12-30 17:55:06 +00:00
|
|
|
case KVM_CAP_REINJECT_CONTROL:
|
2009-02-04 15:28:14 +00:00
|
|
|
case KVM_CAP_IRQ_INJECT_STATUS:
|
KVM: add ioeventfd support
ioeventfd is a mechanism to register PIO/MMIO regions to trigger an eventfd
signal when written to by a guest. Host userspace can register any
arbitrary IO address with a corresponding eventfd and then pass the eventfd
to a specific end-point of interest for handling.
Normal IO requires a blocking round-trip since the operation may cause
side-effects in the emulated model or may return data to the caller.
Therefore, an IO in KVM traps from the guest to the host, causes a VMX/SVM
"heavy-weight" exit back to userspace, and is ultimately serviced by qemu's
device model synchronously before returning control back to the vcpu.
However, there is a subclass of IO which acts purely as a trigger for
other IO (such as to kick off an out-of-band DMA request, etc). For these
patterns, the synchronous call is particularly expensive since we really
only want to simply get our notification transmitted asychronously and
return as quickly as possible. All the sychronous infrastructure to ensure
proper data-dependencies are met in the normal IO case are just unecessary
overhead for signalling. This adds additional computational load on the
system, as well as latency to the signalling path.
Therefore, we provide a mechanism for registration of an in-kernel trigger
point that allows the VCPU to only require a very brief, lightweight
exit just long enough to signal an eventfd. This also means that any
clients compatible with the eventfd interface (which includes userspace
and kernelspace equally well) can now register to be notified. The end
result should be a more flexible and higher performance notification API
for the backend KVM hypervisor and perhipheral components.
To test this theory, we built a test-harness called "doorbell". This
module has a function called "doorbell_ring()" which simply increments a
counter for each time the doorbell is signaled. It supports signalling
from either an eventfd, or an ioctl().
We then wired up two paths to the doorbell: One via QEMU via a registered
io region and through the doorbell ioctl(). The other is direct via
ioeventfd.
You can download this test harness here:
ftp://ftp.novell.com/dev/ghaskins/doorbell.tar.bz2
The measured results are as follows:
qemu-mmio: 110000 iops, 9.09us rtt
ioeventfd-mmio: 200100 iops, 5.00us rtt
ioeventfd-pio: 367300 iops, 2.72us rtt
I didn't measure qemu-pio, because I have to figure out how to register a
PIO region with qemu's device model, and I got lazy. However, for now we
can extrapolate based on the data from the NULLIO runs of +2.56us for MMIO,
and -350ns for HC, we get:
qemu-pio: 153139 iops, 6.53us rtt
ioeventfd-hc: 412585 iops, 2.37us rtt
these are just for fun, for now, until I can gather more data.
Here is a graph for your convenience:
http://developer.novell.com/wiki/images/7/76/Iofd-chart.png
The conclusion to draw is that we save about 4us by skipping the userspace
hop.
--------------------
Signed-off-by: Gregory Haskins <ghaskins@novell.com>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2009-07-07 21:08:49 +00:00
|
|
|
case KVM_CAP_IOEVENTFD:
|
2014-03-31 18:50:38 +00:00
|
|
|
case KVM_CAP_IOEVENTFD_NO_LENGTH:
|
2009-05-14 20:42:53 +00:00
|
|
|
case KVM_CAP_PIT2:
|
2009-07-07 15:50:38 +00:00
|
|
|
case KVM_CAP_PIT_STATE2:
|
2009-07-21 02:42:48 +00:00
|
|
|
case KVM_CAP_SET_IDENTITY_MAP_ADDR:
|
2009-10-15 22:21:43 +00:00
|
|
|
case KVM_CAP_XEN_HVM:
|
2009-11-12 00:04:25 +00:00
|
|
|
case KVM_CAP_VCPU_EVENTS:
|
2010-01-17 13:51:22 +00:00
|
|
|
case KVM_CAP_HYPERV:
|
2010-01-17 13:51:23 +00:00
|
|
|
case KVM_CAP_HYPERV_VAPIC:
|
2010-01-17 13:51:24 +00:00
|
|
|
case KVM_CAP_HYPERV_SPIN:
|
2015-11-10 12:36:34 +00:00
|
|
|
case KVM_CAP_HYPERV_SYNIC:
|
kvm: x86: hyperv: add KVM_CAP_HYPERV_SYNIC2
There is a flaw in the Hyper-V SynIC implementation in KVM: when message
page or event flags page is enabled by setting the corresponding msr,
KVM zeroes it out. This is problematic because on migration the
corresponding MSRs are loaded on the destination, so the content of
those pages is lost.
This went unnoticed so far because the only user of those pages was
in-KVM hyperv synic timers, which could continue working despite that
zeroing.
Newer QEMU uses those pages for Hyper-V VMBus implementation, and
zeroing them breaks the migration.
Besides, in newer QEMU the content of those pages is fully managed by
QEMU, so zeroing them is undesirable even when writing the MSRs from the
guest side.
To support this new scheme, introduce a new capability,
KVM_CAP_HYPERV_SYNIC2, which, when enabled, makes sure that the synic
pages aren't zeroed out in KVM.
Signed-off-by: Roman Kagan <rkagan@virtuozzo.com>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
2017-06-22 13:51:01 +00:00
|
|
|
case KVM_CAP_HYPERV_SYNIC2:
|
2017-07-14 14:13:20 +00:00
|
|
|
case KVM_CAP_HYPERV_VP_INDEX:
|
2018-02-01 13:48:32 +00:00
|
|
|
case KVM_CAP_HYPERV_EVENTFD:
|
2018-05-16 15:21:31 +00:00
|
|
|
case KVM_CAP_HYPERV_TLBFLUSH:
|
2010-01-29 06:38:44 +00:00
|
|
|
case KVM_CAP_PCI_SEGMENT:
|
2010-02-15 09:45:43 +00:00
|
|
|
case KVM_CAP_DEBUGREGS:
|
2010-02-23 16:47:57 +00:00
|
|
|
case KVM_CAP_X86_ROBUST_SINGLESTEP:
|
2010-06-13 09:29:39 +00:00
|
|
|
case KVM_CAP_XSAVE:
|
2010-10-14 09:22:50 +00:00
|
|
|
case KVM_CAP_ASYNC_PF:
|
2011-03-25 08:44:51 +00:00
|
|
|
case KVM_CAP_GET_TSC_KHZ:
|
2012-03-10 19:37:27 +00:00
|
|
|
case KVM_CAP_KVMCLOCK_CTRL:
|
2012-08-21 03:02:51 +00:00
|
|
|
case KVM_CAP_READONLY_MEM:
|
2014-01-29 17:10:45 +00:00
|
|
|
case KVM_CAP_HYPERV_TIME:
|
2014-02-28 04:06:17 +00:00
|
|
|
case KVM_CAP_IOAPIC_POLARITY_IGNORED:
|
2015-01-08 14:59:30 +00:00
|
|
|
case KVM_CAP_TSC_DEADLINE_TIMER:
|
2015-04-12 22:53:41 +00:00
|
|
|
case KVM_CAP_ENABLE_CAP_VM:
|
|
|
|
case KVM_CAP_DISABLE_QUIRKS:
|
2015-07-29 09:56:48 +00:00
|
|
|
case KVM_CAP_SET_BOOT_CPU_ID:
|
2015-07-30 06:21:40 +00:00
|
|
|
case KVM_CAP_SPLIT_IRQCHIP:
|
2017-02-08 10:50:15 +00:00
|
|
|
case KVM_CAP_IMMEDIATE_EXIT:
|
2018-02-21 19:39:51 +00:00
|
|
|
case KVM_CAP_GET_MSR_FEATURES:
|
2007-11-15 15:07:47 +00:00
|
|
|
r = 1;
|
|
|
|
break;
|
2018-02-01 00:03:36 +00:00
|
|
|
case KVM_CAP_SYNC_REGS:
|
|
|
|
r = KVM_SYNC_X86_VALID_FIELDS;
|
|
|
|
break;
|
2016-11-09 16:48:15 +00:00
|
|
|
case KVM_CAP_ADJUST_CLOCK:
|
|
|
|
r = KVM_CLOCK_TSC_STABLE;
|
|
|
|
break;
|
2018-03-12 11:53:02 +00:00
|
|
|
case KVM_CAP_X86_DISABLE_EXITS:
|
2018-06-07 23:19:53 +00:00
|
|
|
r |= KVM_X86_DISABLE_EXITS_HLT | KVM_X86_DISABLE_EXITS_PAUSE;
|
2018-03-12 11:53:02 +00:00
|
|
|
if(kvm_can_mwait_in_guest())
|
|
|
|
r |= KVM_X86_DISABLE_EXITS_MWAIT;
|
2017-04-21 10:27:17 +00:00
|
|
|
break;
|
2015-04-01 12:25:33 +00:00
|
|
|
case KVM_CAP_X86_SMM:
|
|
|
|
/* SMBASE is usually relocated above 1M on modern chipsets,
|
|
|
|
* and SMM handlers might indeed rely on 4G segment limits,
|
|
|
|
* so do not report SMM to be available if real mode is
|
|
|
|
* emulated via vm86 mode. Still, do not go to great lengths
|
|
|
|
* to avoid userspace's usage of the feature, because it is a
|
|
|
|
* fringe case that is not enabled except via specific settings
|
|
|
|
* of the module parameters.
|
|
|
|
*/
|
2018-05-10 20:06:39 +00:00
|
|
|
r = kvm_x86_ops->has_emulated_msr(MSR_IA32_SMBASE);
|
2015-04-01 12:25:33 +00:00
|
|
|
break;
|
2007-12-26 11:57:04 +00:00
|
|
|
case KVM_CAP_VAPIC:
|
|
|
|
r = !kvm_x86_ops->cpu_has_accelerated_tpr();
|
|
|
|
break;
|
2008-02-20 09:53:16 +00:00
|
|
|
case KVM_CAP_NR_VCPUS:
|
2011-07-18 14:17:15 +00:00
|
|
|
r = KVM_SOFT_MAX_VCPUS;
|
|
|
|
break;
|
|
|
|
case KVM_CAP_MAX_VCPUS:
|
2008-02-20 09:53:16 +00:00
|
|
|
r = KVM_MAX_VCPUS;
|
|
|
|
break;
|
2008-02-20 09:59:20 +00:00
|
|
|
case KVM_CAP_NR_MEMSLOTS:
|
2012-12-10 17:33:09 +00:00
|
|
|
r = KVM_USER_MEM_SLOTS;
|
2008-02-20 09:59:20 +00:00
|
|
|
break;
|
2009-10-01 22:28:39 +00:00
|
|
|
case KVM_CAP_PV_MMU: /* obsolete */
|
|
|
|
r = 0;
|
2008-02-22 17:21:37 +00:00
|
|
|
break;
|
2009-05-11 08:48:15 +00:00
|
|
|
case KVM_CAP_MCE:
|
|
|
|
r = KVM_MAX_MCE_BANKS;
|
|
|
|
break;
|
2010-06-13 09:29:39 +00:00
|
|
|
case KVM_CAP_XCRS:
|
2016-04-04 20:25:02 +00:00
|
|
|
r = boot_cpu_has(X86_FEATURE_XSAVE);
|
2010-06-13 09:29:39 +00:00
|
|
|
break;
|
2011-03-25 08:44:51 +00:00
|
|
|
case KVM_CAP_TSC_CONTROL:
|
|
|
|
r = kvm_has_tsc_control;
|
|
|
|
break;
|
2016-07-12 20:09:27 +00:00
|
|
|
case KVM_CAP_X2APIC_API:
|
|
|
|
r = KVM_X2APIC_API_VALID_FLAGS;
|
|
|
|
break;
|
2018-07-10 09:27:20 +00:00
|
|
|
case KVM_CAP_NESTED_STATE:
|
|
|
|
r = kvm_x86_ops->get_nested_state ?
|
|
|
|
kvm_x86_ops->get_nested_state(NULL, 0, 0) : 0;
|
|
|
|
break;
|
2007-11-15 15:07:47 +00:00
|
|
|
default:
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
return r;
|
|
|
|
|
|
|
|
}
|
|
|
|
|
2007-10-10 15:16:19 +00:00
|
|
|
long kvm_arch_dev_ioctl(struct file *filp,
|
|
|
|
unsigned int ioctl, unsigned long arg)
|
|
|
|
{
|
|
|
|
void __user *argp = (void __user *)arg;
|
|
|
|
long r;
|
|
|
|
|
|
|
|
switch (ioctl) {
|
|
|
|
case KVM_GET_MSR_INDEX_LIST: {
|
|
|
|
struct kvm_msr_list __user *user_msr_list = argp;
|
|
|
|
struct kvm_msr_list msr_list;
|
|
|
|
unsigned n;
|
|
|
|
|
|
|
|
r = -EFAULT;
|
|
|
|
if (copy_from_user(&msr_list, user_msr_list, sizeof msr_list))
|
|
|
|
goto out;
|
|
|
|
n = msr_list.nmsrs;
|
2015-05-05 10:08:55 +00:00
|
|
|
msr_list.nmsrs = num_msrs_to_save + num_emulated_msrs;
|
2007-10-10 15:16:19 +00:00
|
|
|
if (copy_to_user(user_msr_list, &msr_list, sizeof msr_list))
|
|
|
|
goto out;
|
|
|
|
r = -E2BIG;
|
2009-07-02 19:45:47 +00:00
|
|
|
if (n < msr_list.nmsrs)
|
2007-10-10 15:16:19 +00:00
|
|
|
goto out;
|
|
|
|
r = -EFAULT;
|
|
|
|
if (copy_to_user(user_msr_list->indices, &msrs_to_save,
|
|
|
|
num_msrs_to_save * sizeof(u32)))
|
|
|
|
goto out;
|
2009-07-02 19:45:47 +00:00
|
|
|
if (copy_to_user(user_msr_list->indices + num_msrs_to_save,
|
2007-10-10 15:16:19 +00:00
|
|
|
&emulated_msrs,
|
2015-05-05 10:08:55 +00:00
|
|
|
num_emulated_msrs * sizeof(u32)))
|
2007-10-10 15:16:19 +00:00
|
|
|
goto out;
|
|
|
|
r = 0;
|
|
|
|
break;
|
|
|
|
}
|
2013-09-22 14:44:50 +00:00
|
|
|
case KVM_GET_SUPPORTED_CPUID:
|
|
|
|
case KVM_GET_EMULATED_CPUID: {
|
2008-02-11 16:37:23 +00:00
|
|
|
struct kvm_cpuid2 __user *cpuid_arg = argp;
|
|
|
|
struct kvm_cpuid2 cpuid;
|
|
|
|
|
|
|
|
r = -EFAULT;
|
|
|
|
if (copy_from_user(&cpuid, cpuid_arg, sizeof cpuid))
|
|
|
|
goto out;
|
2013-09-22 14:44:50 +00:00
|
|
|
|
|
|
|
r = kvm_dev_ioctl_get_cpuid(&cpuid, cpuid_arg->entries,
|
|
|
|
ioctl);
|
2008-02-11 16:37:23 +00:00
|
|
|
if (r)
|
|
|
|
goto out;
|
|
|
|
|
|
|
|
r = -EFAULT;
|
|
|
|
if (copy_to_user(cpuid_arg, &cpuid, sizeof cpuid))
|
|
|
|
goto out;
|
|
|
|
r = 0;
|
|
|
|
break;
|
|
|
|
}
|
2009-05-11 08:48:15 +00:00
|
|
|
case KVM_X86_GET_MCE_CAP_SUPPORTED: {
|
|
|
|
r = -EFAULT;
|
2016-06-22 06:59:56 +00:00
|
|
|
if (copy_to_user(argp, &kvm_mce_cap_supported,
|
|
|
|
sizeof(kvm_mce_cap_supported)))
|
2009-05-11 08:48:15 +00:00
|
|
|
goto out;
|
|
|
|
r = 0;
|
|
|
|
break;
|
2018-02-21 19:39:51 +00:00
|
|
|
case KVM_GET_MSR_FEATURE_INDEX_LIST: {
|
|
|
|
struct kvm_msr_list __user *user_msr_list = argp;
|
|
|
|
struct kvm_msr_list msr_list;
|
|
|
|
unsigned int n;
|
|
|
|
|
|
|
|
r = -EFAULT;
|
|
|
|
if (copy_from_user(&msr_list, user_msr_list, sizeof(msr_list)))
|
|
|
|
goto out;
|
|
|
|
n = msr_list.nmsrs;
|
|
|
|
msr_list.nmsrs = num_msr_based_features;
|
|
|
|
if (copy_to_user(user_msr_list, &msr_list, sizeof(msr_list)))
|
|
|
|
goto out;
|
|
|
|
r = -E2BIG;
|
|
|
|
if (n < msr_list.nmsrs)
|
|
|
|
goto out;
|
|
|
|
r = -EFAULT;
|
|
|
|
if (copy_to_user(user_msr_list->indices, &msr_based_features,
|
|
|
|
num_msr_based_features * sizeof(u32)))
|
|
|
|
goto out;
|
|
|
|
r = 0;
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
case KVM_GET_MSRS:
|
|
|
|
r = msr_io(NULL, argp, do_get_msr_feature, 1);
|
|
|
|
break;
|
2009-05-11 08:48:15 +00:00
|
|
|
}
|
2007-10-10 15:16:19 +00:00
|
|
|
default:
|
|
|
|
r = -EINVAL;
|
|
|
|
}
|
|
|
|
out:
|
|
|
|
return r;
|
|
|
|
}
|
|
|
|
|
2010-06-30 04:25:15 +00:00
|
|
|
static void wbinvd_ipi(void *garbage)
|
|
|
|
{
|
|
|
|
wbinvd();
|
|
|
|
}
|
|
|
|
|
|
|
|
static bool need_emulate_wbinvd(struct kvm_vcpu *vcpu)
|
|
|
|
{
|
2013-10-30 17:02:30 +00:00
|
|
|
return kvm_arch_has_noncoherent_dma(vcpu->kvm);
|
2010-06-30 04:25:15 +00:00
|
|
|
}
|
|
|
|
|
KVM: Portability: split kvm_vcpu_ioctl
This patch splits kvm_vcpu_ioctl into archtecture independent parts, and
x86 specific parts which go to kvm_arch_vcpu_ioctl in x86.c.
Common ioctls for all architectures are:
KVM_RUN, KVM_GET/SET_(S-)REGS, KVM_TRANSLATE, KVM_INTERRUPT,
KVM_DEBUG_GUEST, KVM_SET_SIGNAL_MASK, KVM_GET/SET_FPU
Note that some PPC chips don't have an FPU, so we might need an #ifdef
around KVM_GET/SET_FPU one day.
x86 specific ioctls are:
KVM_GET/SET_LAPIC, KVM_SET_CPUID, KVM_GET/SET_MSRS
An interresting aspect is vcpu_load/vcpu_put. We now have a common
vcpu_load/put which does the preemption stuff, and an architecture
specific kvm_arch_vcpu_load/put. In the x86 case, this one calls the
vmx/svm function defined in kvm_x86_ops.
Signed-off-by: Carsten Otte <cotte@de.ibm.com>
Reviewed-by: Christian Borntraeger <borntraeger@de.ibm.com>
Reviewed-by: Christian Ehrhardt <ehrhardt@linux.vnet.ibm.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
2007-10-11 17:16:52 +00:00
|
|
|
void kvm_arch_vcpu_load(struct kvm_vcpu *vcpu, int cpu)
|
|
|
|
{
|
2010-06-30 04:25:15 +00:00
|
|
|
/* Address WBINVD may be executed by guest */
|
|
|
|
if (need_emulate_wbinvd(vcpu)) {
|
|
|
|
if (kvm_x86_ops->has_wbinvd_exit())
|
|
|
|
cpumask_set_cpu(cpu, vcpu->arch.wbinvd_dirty_mask);
|
|
|
|
else if (vcpu->cpu != -1 && vcpu->cpu != cpu)
|
|
|
|
smp_call_function_single(vcpu->cpu,
|
|
|
|
wbinvd_ipi, NULL, 1);
|
|
|
|
}
|
|
|
|
|
KVM: Portability: split kvm_vcpu_ioctl
This patch splits kvm_vcpu_ioctl into archtecture independent parts, and
x86 specific parts which go to kvm_arch_vcpu_ioctl in x86.c.
Common ioctls for all architectures are:
KVM_RUN, KVM_GET/SET_(S-)REGS, KVM_TRANSLATE, KVM_INTERRUPT,
KVM_DEBUG_GUEST, KVM_SET_SIGNAL_MASK, KVM_GET/SET_FPU
Note that some PPC chips don't have an FPU, so we might need an #ifdef
around KVM_GET/SET_FPU one day.
x86 specific ioctls are:
KVM_GET/SET_LAPIC, KVM_SET_CPUID, KVM_GET/SET_MSRS
An interresting aspect is vcpu_load/vcpu_put. We now have a common
vcpu_load/put which does the preemption stuff, and an architecture
specific kvm_arch_vcpu_load/put. In the x86 case, this one calls the
vmx/svm function defined in kvm_x86_ops.
Signed-off-by: Carsten Otte <cotte@de.ibm.com>
Reviewed-by: Christian Borntraeger <borntraeger@de.ibm.com>
Reviewed-by: Christian Ehrhardt <ehrhardt@linux.vnet.ibm.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
2007-10-11 17:16:52 +00:00
|
|
|
kvm_x86_ops->vcpu_load(vcpu, cpu);
|
2011-03-25 08:44:48 +00:00
|
|
|
|
2012-02-03 17:43:56 +00:00
|
|
|
/* Apply any externally detected TSC adjustments (due to suspend) */
|
|
|
|
if (unlikely(vcpu->arch.tsc_offset_adjustment)) {
|
|
|
|
adjust_tsc_offset_host(vcpu, vcpu->arch.tsc_offset_adjustment);
|
|
|
|
vcpu->arch.tsc_offset_adjustment = 0;
|
2014-09-12 05:43:19 +00:00
|
|
|
kvm_make_request(KVM_REQ_CLOCK_UPDATE, vcpu);
|
2012-02-03 17:43:56 +00:00
|
|
|
}
|
2011-03-25 08:44:48 +00:00
|
|
|
|
2018-01-24 13:23:36 +00:00
|
|
|
if (unlikely(vcpu->cpu != cpu) || kvm_check_tsc_unstable()) {
|
2012-02-03 17:43:54 +00:00
|
|
|
s64 tsc_delta = !vcpu->arch.last_host_tsc ? 0 :
|
2015-06-25 16:44:07 +00:00
|
|
|
rdtsc() - vcpu->arch.last_host_tsc;
|
2010-08-20 08:07:23 +00:00
|
|
|
if (tsc_delta < 0)
|
|
|
|
mark_tsc_unstable("KVM discovered backwards TSC");
|
2016-06-13 21:20:01 +00:00
|
|
|
|
2018-01-24 13:23:36 +00:00
|
|
|
if (kvm_check_tsc_unstable()) {
|
2015-10-20 07:39:05 +00:00
|
|
|
u64 offset = kvm_compute_tsc_offset(vcpu,
|
KVM: Fix last_guest_tsc / tsc_offset semantics
The variable last_guest_tsc was being used as an ad-hoc indicator
that guest TSC has been initialized and recorded correctly. However,
it may not have been, it could be that guest TSC has been set to some
large value, the back to a small value (by, say, a software reboot).
This defeats the logic and causes KVM to falsely assume that the
guest TSC has gone backwards, marking the host TSC unstable, which
is undesirable behavior.
In addition, rather than try to compute an offset adjustment for the
TSC on unstable platforms, just recompute the whole offset. This
allows us to get rid of one callsite for adjust_tsc_offset, which
is problematic because the units it takes are in guest units, but
here, the computation was originally being done in host units.
Doing this, and also recording last_guest_tsc when the TSC is written
allow us to remove the tricky logic which depended on last_guest_tsc
being zero to indicate a reset of uninitialized value.
Instead, we now have the guarantee that the guest TSC offset is
always at least something which will get us last_guest_tsc.
Signed-off-by: Zachary Amsden <zamsden@gmail.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2012-02-03 17:43:53 +00:00
|
|
|
vcpu->arch.last_guest_tsc);
|
2016-09-07 18:47:19 +00:00
|
|
|
kvm_vcpu_write_tsc_offset(vcpu, offset);
|
2010-09-19 00:38:15 +00:00
|
|
|
vcpu->arch.tsc_catchup = 1;
|
|
|
|
}
|
2017-06-29 15:14:50 +00:00
|
|
|
|
|
|
|
if (kvm_lapic_hv_timer_in_use(vcpu))
|
|
|
|
kvm_lapic_restart_hv_timer(vcpu);
|
|
|
|
|
2012-11-28 01:29:04 +00:00
|
|
|
/*
|
|
|
|
* On a host with synchronized TSC, there is no need to update
|
|
|
|
* kvmclock on vcpu->cpu migration
|
|
|
|
*/
|
|
|
|
if (!vcpu->kvm->arch.use_master_clock || vcpu->cpu == -1)
|
2013-05-09 23:21:41 +00:00
|
|
|
kvm_make_request(KVM_REQ_GLOBAL_CLOCK_UPDATE, vcpu);
|
2010-09-19 00:38:15 +00:00
|
|
|
if (vcpu->cpu != cpu)
|
2017-04-26 20:32:20 +00:00
|
|
|
kvm_make_request(KVM_REQ_MIGRATE_TIMER, vcpu);
|
2010-08-20 08:07:23 +00:00
|
|
|
vcpu->cpu = cpu;
|
2009-10-10 02:26:08 +00:00
|
|
|
}
|
2011-07-11 19:28:14 +00:00
|
|
|
|
|
|
|
kvm_make_request(KVM_REQ_STEAL_UPDATE, vcpu);
|
KVM: Portability: split kvm_vcpu_ioctl
This patch splits kvm_vcpu_ioctl into archtecture independent parts, and
x86 specific parts which go to kvm_arch_vcpu_ioctl in x86.c.
Common ioctls for all architectures are:
KVM_RUN, KVM_GET/SET_(S-)REGS, KVM_TRANSLATE, KVM_INTERRUPT,
KVM_DEBUG_GUEST, KVM_SET_SIGNAL_MASK, KVM_GET/SET_FPU
Note that some PPC chips don't have an FPU, so we might need an #ifdef
around KVM_GET/SET_FPU one day.
x86 specific ioctls are:
KVM_GET/SET_LAPIC, KVM_SET_CPUID, KVM_GET/SET_MSRS
An interresting aspect is vcpu_load/vcpu_put. We now have a common
vcpu_load/put which does the preemption stuff, and an architecture
specific kvm_arch_vcpu_load/put. In the x86 case, this one calls the
vmx/svm function defined in kvm_x86_ops.
Signed-off-by: Carsten Otte <cotte@de.ibm.com>
Reviewed-by: Christian Borntraeger <borntraeger@de.ibm.com>
Reviewed-by: Christian Ehrhardt <ehrhardt@linux.vnet.ibm.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
2007-10-11 17:16:52 +00:00
|
|
|
}
|
|
|
|
|
2016-11-02 09:08:35 +00:00
|
|
|
static void kvm_steal_time_set_preempted(struct kvm_vcpu *vcpu)
|
|
|
|
{
|
|
|
|
if (!(vcpu->arch.st.msr_val & KVM_MSR_ENABLED))
|
|
|
|
return;
|
|
|
|
|
2017-12-13 01:33:01 +00:00
|
|
|
vcpu->arch.st.steal.preempted = KVM_VCPU_PREEMPTED;
|
2016-11-02 09:08:35 +00:00
|
|
|
|
2017-05-02 14:20:18 +00:00
|
|
|
kvm_write_guest_offset_cached(vcpu->kvm, &vcpu->arch.st.stime,
|
2016-11-02 09:08:35 +00:00
|
|
|
&vcpu->arch.st.steal.preempted,
|
|
|
|
offsetof(struct kvm_steal_time, preempted),
|
|
|
|
sizeof(vcpu->arch.st.steal.preempted));
|
|
|
|
}
|
|
|
|
|
KVM: Portability: split kvm_vcpu_ioctl
This patch splits kvm_vcpu_ioctl into archtecture independent parts, and
x86 specific parts which go to kvm_arch_vcpu_ioctl in x86.c.
Common ioctls for all architectures are:
KVM_RUN, KVM_GET/SET_(S-)REGS, KVM_TRANSLATE, KVM_INTERRUPT,
KVM_DEBUG_GUEST, KVM_SET_SIGNAL_MASK, KVM_GET/SET_FPU
Note that some PPC chips don't have an FPU, so we might need an #ifdef
around KVM_GET/SET_FPU one day.
x86 specific ioctls are:
KVM_GET/SET_LAPIC, KVM_SET_CPUID, KVM_GET/SET_MSRS
An interresting aspect is vcpu_load/vcpu_put. We now have a common
vcpu_load/put which does the preemption stuff, and an architecture
specific kvm_arch_vcpu_load/put. In the x86 case, this one calls the
vmx/svm function defined in kvm_x86_ops.
Signed-off-by: Carsten Otte <cotte@de.ibm.com>
Reviewed-by: Christian Borntraeger <borntraeger@de.ibm.com>
Reviewed-by: Christian Ehrhardt <ehrhardt@linux.vnet.ibm.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
2007-10-11 17:16:52 +00:00
|
|
|
void kvm_arch_vcpu_put(struct kvm_vcpu *vcpu)
|
|
|
|
{
|
2016-12-17 18:13:32 +00:00
|
|
|
int idx;
|
2017-08-08 04:05:33 +00:00
|
|
|
|
|
|
|
if (vcpu->preempted)
|
|
|
|
vcpu->arch.preempted_in_kernel = !kvm_x86_ops->get_cpl(vcpu);
|
|
|
|
|
2016-12-17 17:43:52 +00:00
|
|
|
/*
|
|
|
|
* Disable page faults because we're in atomic context here.
|
|
|
|
* kvm_write_guest_offset_cached() would call might_fault()
|
|
|
|
* that relies on pagefault_disable() to tell if there's a
|
|
|
|
* bug. NOTE: the write to guest memory may not go through if
|
|
|
|
* during postcopy live migration or if there's heavy guest
|
|
|
|
* paging.
|
|
|
|
*/
|
|
|
|
pagefault_disable();
|
2016-12-17 18:13:32 +00:00
|
|
|
/*
|
|
|
|
* kvm_memslots() will be called by
|
|
|
|
* kvm_write_guest_offset_cached() so take the srcu lock.
|
|
|
|
*/
|
|
|
|
idx = srcu_read_lock(&vcpu->kvm->srcu);
|
2016-11-02 09:08:35 +00:00
|
|
|
kvm_steal_time_set_preempted(vcpu);
|
2016-12-17 18:13:32 +00:00
|
|
|
srcu_read_unlock(&vcpu->kvm->srcu, idx);
|
2016-12-17 17:43:52 +00:00
|
|
|
pagefault_enable();
|
2009-12-30 10:40:26 +00:00
|
|
|
kvm_x86_ops->vcpu_put(vcpu);
|
2015-06-25 16:44:07 +00:00
|
|
|
vcpu->arch.last_host_tsc = rdtsc();
|
2017-12-13 09:46:40 +00:00
|
|
|
/*
|
|
|
|
* If userspace has set any breakpoints or watchpoints, dr6 is restored
|
|
|
|
* on every vmexit, but if not, we might have a stale dr6 from the
|
|
|
|
* guest. do_debug expects dr6 to be cleared after it runs, do the same.
|
|
|
|
*/
|
|
|
|
set_debugreg(0, 6);
|
KVM: Portability: split kvm_vcpu_ioctl
This patch splits kvm_vcpu_ioctl into archtecture independent parts, and
x86 specific parts which go to kvm_arch_vcpu_ioctl in x86.c.
Common ioctls for all architectures are:
KVM_RUN, KVM_GET/SET_(S-)REGS, KVM_TRANSLATE, KVM_INTERRUPT,
KVM_DEBUG_GUEST, KVM_SET_SIGNAL_MASK, KVM_GET/SET_FPU
Note that some PPC chips don't have an FPU, so we might need an #ifdef
around KVM_GET/SET_FPU one day.
x86 specific ioctls are:
KVM_GET/SET_LAPIC, KVM_SET_CPUID, KVM_GET/SET_MSRS
An interresting aspect is vcpu_load/vcpu_put. We now have a common
vcpu_load/put which does the preemption stuff, and an architecture
specific kvm_arch_vcpu_load/put. In the x86 case, this one calls the
vmx/svm function defined in kvm_x86_ops.
Signed-off-by: Carsten Otte <cotte@de.ibm.com>
Reviewed-by: Christian Borntraeger <borntraeger@de.ibm.com>
Reviewed-by: Christian Ehrhardt <ehrhardt@linux.vnet.ibm.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
2007-10-11 17:16:52 +00:00
|
|
|
}
|
|
|
|
|
|
|
|
static int kvm_vcpu_ioctl_get_lapic(struct kvm_vcpu *vcpu,
|
|
|
|
struct kvm_lapic_state *s)
|
|
|
|
{
|
2017-12-24 16:12:53 +00:00
|
|
|
if (vcpu->arch.apicv_active)
|
2015-11-10 12:36:33 +00:00
|
|
|
kvm_x86_ops->sync_pir_to_irr(vcpu);
|
|
|
|
|
2016-07-12 20:09:22 +00:00
|
|
|
return kvm_apic_get_state(vcpu, s);
|
KVM: Portability: split kvm_vcpu_ioctl
This patch splits kvm_vcpu_ioctl into archtecture independent parts, and
x86 specific parts which go to kvm_arch_vcpu_ioctl in x86.c.
Common ioctls for all architectures are:
KVM_RUN, KVM_GET/SET_(S-)REGS, KVM_TRANSLATE, KVM_INTERRUPT,
KVM_DEBUG_GUEST, KVM_SET_SIGNAL_MASK, KVM_GET/SET_FPU
Note that some PPC chips don't have an FPU, so we might need an #ifdef
around KVM_GET/SET_FPU one day.
x86 specific ioctls are:
KVM_GET/SET_LAPIC, KVM_SET_CPUID, KVM_GET/SET_MSRS
An interresting aspect is vcpu_load/vcpu_put. We now have a common
vcpu_load/put which does the preemption stuff, and an architecture
specific kvm_arch_vcpu_load/put. In the x86 case, this one calls the
vmx/svm function defined in kvm_x86_ops.
Signed-off-by: Carsten Otte <cotte@de.ibm.com>
Reviewed-by: Christian Borntraeger <borntraeger@de.ibm.com>
Reviewed-by: Christian Ehrhardt <ehrhardt@linux.vnet.ibm.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
2007-10-11 17:16:52 +00:00
|
|
|
}
|
|
|
|
|
|
|
|
static int kvm_vcpu_ioctl_set_lapic(struct kvm_vcpu *vcpu,
|
|
|
|
struct kvm_lapic_state *s)
|
|
|
|
{
|
2016-07-12 20:09:22 +00:00
|
|
|
int r;
|
|
|
|
|
|
|
|
r = kvm_apic_set_state(vcpu, s);
|
|
|
|
if (r)
|
|
|
|
return r;
|
2009-08-09 12:17:40 +00:00
|
|
|
update_cr8_intercept(vcpu);
|
KVM: Portability: split kvm_vcpu_ioctl
This patch splits kvm_vcpu_ioctl into archtecture independent parts, and
x86 specific parts which go to kvm_arch_vcpu_ioctl in x86.c.
Common ioctls for all architectures are:
KVM_RUN, KVM_GET/SET_(S-)REGS, KVM_TRANSLATE, KVM_INTERRUPT,
KVM_DEBUG_GUEST, KVM_SET_SIGNAL_MASK, KVM_GET/SET_FPU
Note that some PPC chips don't have an FPU, so we might need an #ifdef
around KVM_GET/SET_FPU one day.
x86 specific ioctls are:
KVM_GET/SET_LAPIC, KVM_SET_CPUID, KVM_GET/SET_MSRS
An interresting aspect is vcpu_load/vcpu_put. We now have a common
vcpu_load/put which does the preemption stuff, and an architecture
specific kvm_arch_vcpu_load/put. In the x86 case, this one calls the
vmx/svm function defined in kvm_x86_ops.
Signed-off-by: Carsten Otte <cotte@de.ibm.com>
Reviewed-by: Christian Borntraeger <borntraeger@de.ibm.com>
Reviewed-by: Christian Ehrhardt <ehrhardt@linux.vnet.ibm.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
2007-10-11 17:16:52 +00:00
|
|
|
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
2015-11-17 16:32:05 +00:00
|
|
|
static int kvm_cpu_accept_dm_intr(struct kvm_vcpu *vcpu)
|
|
|
|
{
|
|
|
|
return (!lapic_in_kernel(vcpu) ||
|
|
|
|
kvm_apic_accept_pic_intr(vcpu));
|
|
|
|
}
|
|
|
|
|
2015-11-16 23:26:00 +00:00
|
|
|
/*
|
|
|
|
* if userspace requested an interrupt window, check that the
|
|
|
|
* interrupt window is open.
|
|
|
|
*
|
|
|
|
* No need to exit to userspace if we already have an interrupt queued.
|
|
|
|
*/
|
|
|
|
static int kvm_vcpu_ready_for_interrupt_injection(struct kvm_vcpu *vcpu)
|
|
|
|
{
|
|
|
|
return kvm_arch_interrupt_allowed(vcpu) &&
|
|
|
|
!kvm_cpu_has_interrupt(vcpu) &&
|
|
|
|
!kvm_event_needs_reinjection(vcpu) &&
|
|
|
|
kvm_cpu_accept_dm_intr(vcpu);
|
|
|
|
}
|
|
|
|
|
2007-11-20 20:36:41 +00:00
|
|
|
static int kvm_vcpu_ioctl_interrupt(struct kvm_vcpu *vcpu,
|
|
|
|
struct kvm_interrupt *irq)
|
|
|
|
{
|
2013-02-27 03:33:25 +00:00
|
|
|
if (irq->irq >= KVM_NR_INTERRUPTS)
|
2007-11-20 20:36:41 +00:00
|
|
|
return -EINVAL;
|
KVM: x86: Add support for local interrupt requests from userspace
In order to enable userspace PIC support, the userspace PIC needs to
be able to inject local interrupts even when the APICs are in the
kernel.
KVM_INTERRUPT now supports sending local interrupts to an APIC when
APICs are in the kernel.
The ready_for_interrupt_request flag is now only set when the CPU/APIC
will immediately accept and inject an interrupt (i.e. APIC has not
masked the PIC).
When the PIC wishes to initiate an INTA cycle with, say, CPU0, it
kicks CPU0 out of the guest, and renedezvous with CPU0 once it arrives
in userspace.
When the CPU/APIC unmasks the PIC, a KVM_EXIT_IRQ_WINDOW_OPEN is
triggered, so that userspace has a chance to inject a PIC interrupt
if it had been pending.
Overall, this design can lead to a small number of spurious userspace
renedezvous. In particular, whenever the PIC transistions from low to
high while it is masked and whenever the PIC becomes unmasked while
it is low.
Note: this does not buffer more than one local interrupt in the
kernel, so the VMM needs to enter the guest in order to complete
interrupt injection before injecting an additional interrupt.
Compiles for x86.
Can pass the KVM Unit Tests.
Signed-off-by: Steve Rutherford <srutherford@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2015-07-30 09:27:16 +00:00
|
|
|
|
|
|
|
if (!irqchip_in_kernel(vcpu->kvm)) {
|
|
|
|
kvm_queue_interrupt(vcpu, irq->irq, false);
|
|
|
|
kvm_make_request(KVM_REQ_EVENT, vcpu);
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* With in-kernel LAPIC, we only use this to inject EXTINT, so
|
|
|
|
* fail for in-kernel 8259.
|
|
|
|
*/
|
|
|
|
if (pic_in_kernel(vcpu->kvm))
|
2007-11-20 20:36:41 +00:00
|
|
|
return -ENXIO;
|
|
|
|
|
KVM: x86: Add support for local interrupt requests from userspace
In order to enable userspace PIC support, the userspace PIC needs to
be able to inject local interrupts even when the APICs are in the
kernel.
KVM_INTERRUPT now supports sending local interrupts to an APIC when
APICs are in the kernel.
The ready_for_interrupt_request flag is now only set when the CPU/APIC
will immediately accept and inject an interrupt (i.e. APIC has not
masked the PIC).
When the PIC wishes to initiate an INTA cycle with, say, CPU0, it
kicks CPU0 out of the guest, and renedezvous with CPU0 once it arrives
in userspace.
When the CPU/APIC unmasks the PIC, a KVM_EXIT_IRQ_WINDOW_OPEN is
triggered, so that userspace has a chance to inject a PIC interrupt
if it had been pending.
Overall, this design can lead to a small number of spurious userspace
renedezvous. In particular, whenever the PIC transistions from low to
high while it is masked and whenever the PIC becomes unmasked while
it is low.
Note: this does not buffer more than one local interrupt in the
kernel, so the VMM needs to enter the guest in order to complete
interrupt injection before injecting an additional interrupt.
Compiles for x86.
Can pass the KVM Unit Tests.
Signed-off-by: Steve Rutherford <srutherford@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2015-07-30 09:27:16 +00:00
|
|
|
if (vcpu->arch.pending_external_vector != -1)
|
|
|
|
return -EEXIST;
|
2007-11-20 20:36:41 +00:00
|
|
|
|
KVM: x86: Add support for local interrupt requests from userspace
In order to enable userspace PIC support, the userspace PIC needs to
be able to inject local interrupts even when the APICs are in the
kernel.
KVM_INTERRUPT now supports sending local interrupts to an APIC when
APICs are in the kernel.
The ready_for_interrupt_request flag is now only set when the CPU/APIC
will immediately accept and inject an interrupt (i.e. APIC has not
masked the PIC).
When the PIC wishes to initiate an INTA cycle with, say, CPU0, it
kicks CPU0 out of the guest, and renedezvous with CPU0 once it arrives
in userspace.
When the CPU/APIC unmasks the PIC, a KVM_EXIT_IRQ_WINDOW_OPEN is
triggered, so that userspace has a chance to inject a PIC interrupt
if it had been pending.
Overall, this design can lead to a small number of spurious userspace
renedezvous. In particular, whenever the PIC transistions from low to
high while it is masked and whenever the PIC becomes unmasked while
it is low.
Note: this does not buffer more than one local interrupt in the
kernel, so the VMM needs to enter the guest in order to complete
interrupt injection before injecting an additional interrupt.
Compiles for x86.
Can pass the KVM Unit Tests.
Signed-off-by: Steve Rutherford <srutherford@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2015-07-30 09:27:16 +00:00
|
|
|
vcpu->arch.pending_external_vector = irq->irq;
|
2015-11-16 23:26:05 +00:00
|
|
|
kvm_make_request(KVM_REQ_EVENT, vcpu);
|
2007-11-20 20:36:41 +00:00
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
2008-09-26 07:30:55 +00:00
|
|
|
static int kvm_vcpu_ioctl_nmi(struct kvm_vcpu *vcpu)
|
|
|
|
{
|
|
|
|
kvm_inject_nmi(vcpu);
|
|
|
|
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
2015-04-01 13:06:40 +00:00
|
|
|
static int kvm_vcpu_ioctl_smi(struct kvm_vcpu *vcpu)
|
|
|
|
{
|
2015-05-07 09:36:11 +00:00
|
|
|
kvm_make_request(KVM_REQ_SMI, vcpu);
|
|
|
|
|
2015-04-01 13:06:40 +00:00
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
2007-10-22 14:50:39 +00:00
|
|
|
static int vcpu_ioctl_tpr_access_reporting(struct kvm_vcpu *vcpu,
|
|
|
|
struct kvm_tpr_access_ctl *tac)
|
|
|
|
{
|
|
|
|
if (tac->flags)
|
|
|
|
return -EINVAL;
|
|
|
|
vcpu->arch.tpr_access_reporting = !!tac->enabled;
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
2009-05-11 08:48:15 +00:00
|
|
|
static int kvm_vcpu_ioctl_x86_setup_mce(struct kvm_vcpu *vcpu,
|
|
|
|
u64 mcg_cap)
|
|
|
|
{
|
|
|
|
int r;
|
|
|
|
unsigned bank_num = mcg_cap & 0xff, bank;
|
|
|
|
|
|
|
|
r = -EINVAL;
|
2009-10-23 07:37:00 +00:00
|
|
|
if (!bank_num || bank_num >= KVM_MAX_MCE_BANKS)
|
2009-05-11 08:48:15 +00:00
|
|
|
goto out;
|
2016-06-22 06:59:56 +00:00
|
|
|
if (mcg_cap & ~(kvm_mce_cap_supported | 0xff | 0xff0000))
|
2009-05-11 08:48:15 +00:00
|
|
|
goto out;
|
|
|
|
r = 0;
|
|
|
|
vcpu->arch.mcg_cap = mcg_cap;
|
|
|
|
/* Init IA32_MCG_CTL to all 1s */
|
|
|
|
if (mcg_cap & MCG_CTL_P)
|
|
|
|
vcpu->arch.mcg_ctl = ~(u64)0;
|
|
|
|
/* Init IA32_MCi_CTL to all 1s */
|
|
|
|
for (bank = 0; bank < bank_num; bank++)
|
|
|
|
vcpu->arch.mce_banks[bank*4] = ~(u64)0;
|
2016-06-22 06:59:56 +00:00
|
|
|
|
|
|
|
if (kvm_x86_ops->setup_mce)
|
|
|
|
kvm_x86_ops->setup_mce(vcpu);
|
2009-05-11 08:48:15 +00:00
|
|
|
out:
|
|
|
|
return r;
|
|
|
|
}
|
|
|
|
|
|
|
|
static int kvm_vcpu_ioctl_x86_set_mce(struct kvm_vcpu *vcpu,
|
|
|
|
struct kvm_x86_mce *mce)
|
|
|
|
{
|
|
|
|
u64 mcg_cap = vcpu->arch.mcg_cap;
|
|
|
|
unsigned bank_num = mcg_cap & 0xff;
|
|
|
|
u64 *banks = vcpu->arch.mce_banks;
|
|
|
|
|
|
|
|
if (mce->bank >= bank_num || !(mce->status & MCI_STATUS_VAL))
|
|
|
|
return -EINVAL;
|
|
|
|
/*
|
|
|
|
* if IA32_MCG_CTL is not all 1s, the uncorrected error
|
|
|
|
* reporting is disabled
|
|
|
|
*/
|
|
|
|
if ((mce->status & MCI_STATUS_UC) && (mcg_cap & MCG_CTL_P) &&
|
|
|
|
vcpu->arch.mcg_ctl != ~(u64)0)
|
|
|
|
return 0;
|
|
|
|
banks += 4 * mce->bank;
|
|
|
|
/*
|
|
|
|
* if IA32_MCi_CTL is not all 1s, the uncorrected error
|
|
|
|
* reporting is disabled for the bank
|
|
|
|
*/
|
|
|
|
if ((mce->status & MCI_STATUS_UC) && banks[0] != ~(u64)0)
|
|
|
|
return 0;
|
|
|
|
if (mce->status & MCI_STATUS_UC) {
|
|
|
|
if ((vcpu->arch.mcg_status & MCG_STATUS_MCIP) ||
|
2009-12-07 10:16:48 +00:00
|
|
|
!kvm_read_cr4_bits(vcpu, X86_CR4_MCE)) {
|
2010-05-10 09:34:53 +00:00
|
|
|
kvm_make_request(KVM_REQ_TRIPLE_FAULT, vcpu);
|
2009-05-11 08:48:15 +00:00
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
if (banks[1] & MCI_STATUS_VAL)
|
|
|
|
mce->status |= MCI_STATUS_OVER;
|
|
|
|
banks[2] = mce->addr;
|
|
|
|
banks[3] = mce->misc;
|
|
|
|
vcpu->arch.mcg_status = mce->mcg_status;
|
|
|
|
banks[1] = mce->status;
|
|
|
|
kvm_queue_exception(vcpu, MC_VECTOR);
|
|
|
|
} else if (!(banks[1] & MCI_STATUS_VAL)
|
|
|
|
|| !(banks[1] & MCI_STATUS_UC)) {
|
|
|
|
if (banks[1] & MCI_STATUS_VAL)
|
|
|
|
mce->status |= MCI_STATUS_OVER;
|
|
|
|
banks[2] = mce->addr;
|
|
|
|
banks[3] = mce->misc;
|
|
|
|
banks[1] = mce->status;
|
|
|
|
} else
|
|
|
|
banks[1] |= MCI_STATUS_OVER;
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
2009-11-12 00:04:25 +00:00
|
|
|
static void kvm_vcpu_ioctl_x86_get_vcpu_events(struct kvm_vcpu *vcpu,
|
|
|
|
struct kvm_vcpu_events *events)
|
|
|
|
{
|
2011-09-20 10:43:14 +00:00
|
|
|
process_nmi(vcpu);
|
2017-08-24 10:35:09 +00:00
|
|
|
/*
|
|
|
|
* FIXME: pass injected and pending separately. This is only
|
|
|
|
* needed for nested virtualization, whose state cannot be
|
|
|
|
* migrated yet. For now we can combine them.
|
|
|
|
*/
|
2010-02-15 09:45:41 +00:00
|
|
|
events->exception.injected =
|
2017-08-24 10:35:09 +00:00
|
|
|
(vcpu->arch.exception.pending ||
|
|
|
|
vcpu->arch.exception.injected) &&
|
2010-02-15 09:45:41 +00:00
|
|
|
!kvm_exception_is_soft(vcpu->arch.exception.nr);
|
2009-11-12 00:04:25 +00:00
|
|
|
events->exception.nr = vcpu->arch.exception.nr;
|
|
|
|
events->exception.has_error_code = vcpu->arch.exception.has_error_code;
|
2010-10-30 18:54:47 +00:00
|
|
|
events->exception.pad = 0;
|
2009-11-12 00:04:25 +00:00
|
|
|
events->exception.error_code = vcpu->arch.exception.error_code;
|
|
|
|
|
2010-02-15 09:45:41 +00:00
|
|
|
events->interrupt.injected =
|
KVM: x86: Rename interrupt.pending to interrupt.injected
For exceptions & NMIs events, KVM code use the following
coding convention:
*) "pending" represents an event that should be injected to guest at
some point but it's side-effects have not yet occurred.
*) "injected" represents an event that it's side-effects have already
occurred.
However, interrupts don't conform to this coding convention.
All current code flows mark interrupt.pending when it's side-effects
have already taken place (For example, bit moved from LAPIC IRR to
ISR). Therefore, it makes sense to just rename
interrupt.pending to interrupt.injected.
This change follows logic of previous commit 664f8e26b00c ("KVM: X86:
Fix loss of exception which has not yet been injected") which changed
exception to follow this coding convention as well.
It is important to note that in case !lapic_in_kernel(vcpu),
interrupt.pending usage was and still incorrect.
In this case, interrrupt.pending can only be set using one of the
following ioctls: KVM_INTERRUPT, KVM_SET_VCPU_EVENTS and
KVM_SET_SREGS. Looking at how QEMU uses these ioctls, one can see that
QEMU uses them either to re-set an "interrupt.pending" state it has
received from KVM (via KVM_GET_VCPU_EVENTS interrupt.pending or
via KVM_GET_SREGS interrupt_bitmap) or by dispatching a new interrupt
from QEMU's emulated LAPIC which reset bit in IRR and set bit in ISR
before sending ioctl to KVM. So it seems that indeed "interrupt.pending"
in this case is also suppose to represent "interrupt.injected".
However, kvm_cpu_has_interrupt() & kvm_cpu_has_injectable_intr()
is misusing (now named) interrupt.injected in order to return if
there is a pending interrupt.
This leads to nVMX/nSVM not be able to distinguish if it should exit
from L2 to L1 on EXTERNAL_INTERRUPT on pending interrupt or should
re-inject an injected interrupt.
Therefore, add a FIXME at these functions for handling this issue.
This patch introduce no semantics change.
Signed-off-by: Liran Alon <liran.alon@oracle.com>
Reviewed-by: Nikita Leshenko <nikita.leshchenko@oracle.com>
Reviewed-by: Jim Mattson <jmattson@google.com>
Signed-off-by: Krish Sadhukhan <krish.sadhukhan@oracle.com>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
2018-03-23 00:01:31 +00:00
|
|
|
vcpu->arch.interrupt.injected && !vcpu->arch.interrupt.soft;
|
2009-11-12 00:04:25 +00:00
|
|
|
events->interrupt.nr = vcpu->arch.interrupt.nr;
|
2010-02-15 09:45:41 +00:00
|
|
|
events->interrupt.soft = 0;
|
2014-05-20 12:29:47 +00:00
|
|
|
events->interrupt.shadow = kvm_x86_ops->get_interrupt_shadow(vcpu);
|
2009-11-12 00:04:25 +00:00
|
|
|
|
|
|
|
events->nmi.injected = vcpu->arch.nmi_injected;
|
2011-09-20 10:43:14 +00:00
|
|
|
events->nmi.pending = vcpu->arch.nmi_pending != 0;
|
2009-11-12 00:04:25 +00:00
|
|
|
events->nmi.masked = kvm_x86_ops->get_nmi_mask(vcpu);
|
2010-10-30 18:54:47 +00:00
|
|
|
events->nmi.pad = 0;
|
2009-11-12 00:04:25 +00:00
|
|
|
|
2013-03-13 11:42:34 +00:00
|
|
|
events->sipi_vector = 0; /* never valid when reporting to user space */
|
2009-11-12 00:04:25 +00:00
|
|
|
|
2015-04-01 13:06:40 +00:00
|
|
|
events->smi.smm = is_smm(vcpu);
|
|
|
|
events->smi.pending = vcpu->arch.smi_pending;
|
|
|
|
events->smi.smm_inside_nmi =
|
|
|
|
!!(vcpu->arch.hflags & HF_SMM_INSIDE_NMI_MASK);
|
|
|
|
events->smi.latched_init = kvm_lapic_latched_init(vcpu);
|
|
|
|
|
2009-12-06 17:24:15 +00:00
|
|
|
events->flags = (KVM_VCPUEVENT_VALID_NMI_PENDING
|
2015-04-01 13:06:40 +00:00
|
|
|
| KVM_VCPUEVENT_VALID_SHADOW
|
|
|
|
| KVM_VCPUEVENT_VALID_SMM);
|
2010-10-30 18:54:47 +00:00
|
|
|
memset(&events->reserved, 0, sizeof(events->reserved));
|
2009-11-12 00:04:25 +00:00
|
|
|
}
|
|
|
|
|
2016-12-24 09:00:42 +00:00
|
|
|
static void kvm_set_hflags(struct kvm_vcpu *vcpu, unsigned emul_flags);
|
|
|
|
|
2009-11-12 00:04:25 +00:00
|
|
|
static int kvm_vcpu_ioctl_x86_set_vcpu_events(struct kvm_vcpu *vcpu,
|
|
|
|
struct kvm_vcpu_events *events)
|
|
|
|
{
|
2009-12-06 17:24:15 +00:00
|
|
|
if (events->flags & ~(KVM_VCPUEVENT_VALID_NMI_PENDING
|
2010-02-19 18:38:07 +00:00
|
|
|
| KVM_VCPUEVENT_VALID_SIPI_VECTOR
|
2015-04-01 13:06:40 +00:00
|
|
|
| KVM_VCPUEVENT_VALID_SHADOW
|
|
|
|
| KVM_VCPUEVENT_VALID_SMM))
|
2009-11-12 00:04:25 +00:00
|
|
|
return -EINVAL;
|
|
|
|
|
KVM: fail KVM_SET_VCPU_EVENTS with invalid exception number
This cannot be returned by KVM_GET_VCPU_EVENTS, so it is okay to return
EINVAL. It causes a WARN from exception_type:
WARNING: CPU: 3 PID: 16732 at arch/x86/kvm/x86.c:345 exception_type+0x49/0x50 [kvm]()
CPU: 3 PID: 16732 Comm: a.out Tainted: G W 4.4.6-300.fc23.x86_64 #1
Hardware name: LENOVO 2325F51/2325F51, BIOS G2ET32WW (1.12 ) 05/30/2012
0000000000000286 000000006308a48b ffff8800bec7fcf8 ffffffff813b542e
0000000000000000 ffffffffa0966496 ffff8800bec7fd30 ffffffff810a40f2
ffff8800552a8000 0000000000000000 00000000002c267c 0000000000000001
Call Trace:
[<ffffffff813b542e>] dump_stack+0x63/0x85
[<ffffffff810a40f2>] warn_slowpath_common+0x82/0xc0
[<ffffffff810a423a>] warn_slowpath_null+0x1a/0x20
[<ffffffffa0924809>] exception_type+0x49/0x50 [kvm]
[<ffffffffa0934622>] kvm_arch_vcpu_ioctl_run+0x10a2/0x14e0 [kvm]
[<ffffffffa091c04d>] kvm_vcpu_ioctl+0x33d/0x620 [kvm]
[<ffffffff81241248>] do_vfs_ioctl+0x298/0x480
[<ffffffff812414a9>] SyS_ioctl+0x79/0x90
[<ffffffff817a04ee>] entry_SYSCALL_64_fastpath+0x12/0x71
---[ end trace b1a0391266848f50 ]---
Testcase (beautified/reduced from syzkaller output):
#include <unistd.h>
#include <sys/syscall.h>
#include <string.h>
#include <stdint.h>
#include <fcntl.h>
#include <sys/ioctl.h>
#include <linux/kvm.h>
long r[31];
int main()
{
memset(r, -1, sizeof(r));
r[2] = open("/dev/kvm", O_RDONLY);
r[3] = ioctl(r[2], KVM_CREATE_VM, 0);
r[7] = ioctl(r[3], KVM_CREATE_VCPU, 0);
struct kvm_vcpu_events ve = {
.exception.injected = 1,
.exception.nr = 0xd4
};
r[27] = ioctl(r[7], KVM_SET_VCPU_EVENTS, &ve);
r[30] = ioctl(r[7], KVM_RUN, 0);
return 0;
}
Reported-by: Dmitry Vyukov <dvyukov@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
2016-06-01 12:09:20 +00:00
|
|
|
if (events->exception.injected &&
|
kvm: nVMX: Disallow userspace-injected exceptions in guest mode
The userspace exception injection API and code path are entirely
unprepared for exceptions that might cause a VM-exit from L2 to L1, so
the best course of action may be to simply disallow this for now.
1. The API provides no mechanism for userspace to specify the new DR6
bits for a #DB exception or the new CR2 value for a #PF
exception. Presumably, userspace is expected to modify these registers
directly with KVM_SET_SREGS before the next KVM_RUN ioctl. However, in
the event that L1 intercepts the exception, these registers should not
be changed. Instead, the new values should be provided in the
exit_qualification field of vmcs12 (Intel SDM vol 3, section 27.1).
2. In the case of a userspace-injected #DB, inject_pending_event()
clears DR7.GD before calling vmx_queue_exception(). However, in the
event that L1 intercepts the exception, this is too early, because
DR7.GD should not be modified by a #DB that causes a VM-exit directly
(Intel SDM vol 3, section 27.1).
3. If the injected exception is a #PF, nested_vmx_check_exception()
doesn't properly check whether or not L1 is interested in the
associated error code (using the #PF error code mask and match fields
from vmcs12). It may either return 0 when it should call
nested_vmx_vmexit() or vice versa.
4. nested_vmx_check_exception() assumes that it is dealing with a
hardware-generated exception intercept from L2, with some of the
relevant details (the VM-exit interruption-information and the exit
qualification) live in vmcs02. For userspace-injected exceptions, this
is not the case.
5. prepare_vmcs12() assumes that when its exit_intr_info argument
specifies valid information with a valid error code that it can VMREAD
the VM-exit interruption error code from vmcs02. For
userspace-injected exceptions, this is not the case.
Signed-off-by: Jim Mattson <jmattson@google.com>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
2017-04-05 16:14:40 +00:00
|
|
|
(events->exception.nr > 31 || events->exception.nr == NMI_VECTOR ||
|
|
|
|
is_guest_mode(vcpu)))
|
KVM: fail KVM_SET_VCPU_EVENTS with invalid exception number
This cannot be returned by KVM_GET_VCPU_EVENTS, so it is okay to return
EINVAL. It causes a WARN from exception_type:
WARNING: CPU: 3 PID: 16732 at arch/x86/kvm/x86.c:345 exception_type+0x49/0x50 [kvm]()
CPU: 3 PID: 16732 Comm: a.out Tainted: G W 4.4.6-300.fc23.x86_64 #1
Hardware name: LENOVO 2325F51/2325F51, BIOS G2ET32WW (1.12 ) 05/30/2012
0000000000000286 000000006308a48b ffff8800bec7fcf8 ffffffff813b542e
0000000000000000 ffffffffa0966496 ffff8800bec7fd30 ffffffff810a40f2
ffff8800552a8000 0000000000000000 00000000002c267c 0000000000000001
Call Trace:
[<ffffffff813b542e>] dump_stack+0x63/0x85
[<ffffffff810a40f2>] warn_slowpath_common+0x82/0xc0
[<ffffffff810a423a>] warn_slowpath_null+0x1a/0x20
[<ffffffffa0924809>] exception_type+0x49/0x50 [kvm]
[<ffffffffa0934622>] kvm_arch_vcpu_ioctl_run+0x10a2/0x14e0 [kvm]
[<ffffffffa091c04d>] kvm_vcpu_ioctl+0x33d/0x620 [kvm]
[<ffffffff81241248>] do_vfs_ioctl+0x298/0x480
[<ffffffff812414a9>] SyS_ioctl+0x79/0x90
[<ffffffff817a04ee>] entry_SYSCALL_64_fastpath+0x12/0x71
---[ end trace b1a0391266848f50 ]---
Testcase (beautified/reduced from syzkaller output):
#include <unistd.h>
#include <sys/syscall.h>
#include <string.h>
#include <stdint.h>
#include <fcntl.h>
#include <sys/ioctl.h>
#include <linux/kvm.h>
long r[31];
int main()
{
memset(r, -1, sizeof(r));
r[2] = open("/dev/kvm", O_RDONLY);
r[3] = ioctl(r[2], KVM_CREATE_VM, 0);
r[7] = ioctl(r[3], KVM_CREATE_VCPU, 0);
struct kvm_vcpu_events ve = {
.exception.injected = 1,
.exception.nr = 0xd4
};
r[27] = ioctl(r[7], KVM_SET_VCPU_EVENTS, &ve);
r[30] = ioctl(r[7], KVM_RUN, 0);
return 0;
}
Reported-by: Dmitry Vyukov <dvyukov@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
2016-06-01 12:09:20 +00:00
|
|
|
return -EINVAL;
|
|
|
|
|
2017-03-23 10:46:03 +00:00
|
|
|
/* INITs are latched while in SMM */
|
|
|
|
if (events->flags & KVM_VCPUEVENT_VALID_SMM &&
|
|
|
|
(events->smi.smm || events->smi.pending) &&
|
|
|
|
vcpu->arch.mp_state == KVM_MP_STATE_INIT_RECEIVED)
|
|
|
|
return -EINVAL;
|
|
|
|
|
2011-09-20 10:43:14 +00:00
|
|
|
process_nmi(vcpu);
|
2017-08-24 10:35:09 +00:00
|
|
|
vcpu->arch.exception.injected = false;
|
2009-11-12 00:04:25 +00:00
|
|
|
vcpu->arch.exception.pending = events->exception.injected;
|
|
|
|
vcpu->arch.exception.nr = events->exception.nr;
|
|
|
|
vcpu->arch.exception.has_error_code = events->exception.has_error_code;
|
|
|
|
vcpu->arch.exception.error_code = events->exception.error_code;
|
|
|
|
|
KVM: x86: Rename interrupt.pending to interrupt.injected
For exceptions & NMIs events, KVM code use the following
coding convention:
*) "pending" represents an event that should be injected to guest at
some point but it's side-effects have not yet occurred.
*) "injected" represents an event that it's side-effects have already
occurred.
However, interrupts don't conform to this coding convention.
All current code flows mark interrupt.pending when it's side-effects
have already taken place (For example, bit moved from LAPIC IRR to
ISR). Therefore, it makes sense to just rename
interrupt.pending to interrupt.injected.
This change follows logic of previous commit 664f8e26b00c ("KVM: X86:
Fix loss of exception which has not yet been injected") which changed
exception to follow this coding convention as well.
It is important to note that in case !lapic_in_kernel(vcpu),
interrupt.pending usage was and still incorrect.
In this case, interrrupt.pending can only be set using one of the
following ioctls: KVM_INTERRUPT, KVM_SET_VCPU_EVENTS and
KVM_SET_SREGS. Looking at how QEMU uses these ioctls, one can see that
QEMU uses them either to re-set an "interrupt.pending" state it has
received from KVM (via KVM_GET_VCPU_EVENTS interrupt.pending or
via KVM_GET_SREGS interrupt_bitmap) or by dispatching a new interrupt
from QEMU's emulated LAPIC which reset bit in IRR and set bit in ISR
before sending ioctl to KVM. So it seems that indeed "interrupt.pending"
in this case is also suppose to represent "interrupt.injected".
However, kvm_cpu_has_interrupt() & kvm_cpu_has_injectable_intr()
is misusing (now named) interrupt.injected in order to return if
there is a pending interrupt.
This leads to nVMX/nSVM not be able to distinguish if it should exit
from L2 to L1 on EXTERNAL_INTERRUPT on pending interrupt or should
re-inject an injected interrupt.
Therefore, add a FIXME at these functions for handling this issue.
This patch introduce no semantics change.
Signed-off-by: Liran Alon <liran.alon@oracle.com>
Reviewed-by: Nikita Leshenko <nikita.leshchenko@oracle.com>
Reviewed-by: Jim Mattson <jmattson@google.com>
Signed-off-by: Krish Sadhukhan <krish.sadhukhan@oracle.com>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
2018-03-23 00:01:31 +00:00
|
|
|
vcpu->arch.interrupt.injected = events->interrupt.injected;
|
2009-11-12 00:04:25 +00:00
|
|
|
vcpu->arch.interrupt.nr = events->interrupt.nr;
|
|
|
|
vcpu->arch.interrupt.soft = events->interrupt.soft;
|
2010-02-19 18:38:07 +00:00
|
|
|
if (events->flags & KVM_VCPUEVENT_VALID_SHADOW)
|
|
|
|
kvm_x86_ops->set_interrupt_shadow(vcpu,
|
|
|
|
events->interrupt.shadow);
|
2009-11-12 00:04:25 +00:00
|
|
|
|
|
|
|
vcpu->arch.nmi_injected = events->nmi.injected;
|
2009-12-06 17:24:15 +00:00
|
|
|
if (events->flags & KVM_VCPUEVENT_VALID_NMI_PENDING)
|
|
|
|
vcpu->arch.nmi_pending = events->nmi.pending;
|
2009-11-12 00:04:25 +00:00
|
|
|
kvm_x86_ops->set_nmi_mask(vcpu, events->nmi.masked);
|
|
|
|
|
2013-03-13 11:42:34 +00:00
|
|
|
if (events->flags & KVM_VCPUEVENT_VALID_SIPI_VECTOR &&
|
2016-01-08 12:48:51 +00:00
|
|
|
lapic_in_kernel(vcpu))
|
2013-03-13 11:42:34 +00:00
|
|
|
vcpu->arch.apic->sipi_vector = events->sipi_vector;
|
2009-11-12 00:04:25 +00:00
|
|
|
|
2015-04-01 13:06:40 +00:00
|
|
|
if (events->flags & KVM_VCPUEVENT_VALID_SMM) {
|
2016-12-24 09:00:42 +00:00
|
|
|
u32 hflags = vcpu->arch.hflags;
|
2015-04-01 13:06:40 +00:00
|
|
|
if (events->smi.smm)
|
2016-12-24 09:00:42 +00:00
|
|
|
hflags |= HF_SMM_MASK;
|
2015-04-01 13:06:40 +00:00
|
|
|
else
|
2016-12-24 09:00:42 +00:00
|
|
|
hflags &= ~HF_SMM_MASK;
|
|
|
|
kvm_set_hflags(vcpu, hflags);
|
|
|
|
|
2015-04-01 13:06:40 +00:00
|
|
|
vcpu->arch.smi_pending = events->smi.pending;
|
2017-08-01 23:05:25 +00:00
|
|
|
|
|
|
|
if (events->smi.smm) {
|
|
|
|
if (events->smi.smm_inside_nmi)
|
|
|
|
vcpu->arch.hflags |= HF_SMM_INSIDE_NMI_MASK;
|
2015-04-01 13:06:40 +00:00
|
|
|
else
|
2017-08-01 23:05:25 +00:00
|
|
|
vcpu->arch.hflags &= ~HF_SMM_INSIDE_NMI_MASK;
|
|
|
|
if (lapic_in_kernel(vcpu)) {
|
|
|
|
if (events->smi.latched_init)
|
|
|
|
set_bit(KVM_APIC_INIT, &vcpu->arch.apic->pending_events);
|
|
|
|
else
|
|
|
|
clear_bit(KVM_APIC_INIT, &vcpu->arch.apic->pending_events);
|
|
|
|
}
|
2015-04-01 13:06:40 +00:00
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2010-07-27 09:30:24 +00:00
|
|
|
kvm_make_request(KVM_REQ_EVENT, vcpu);
|
|
|
|
|
2009-11-12 00:04:25 +00:00
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
2010-02-15 09:45:43 +00:00
|
|
|
static void kvm_vcpu_ioctl_x86_get_debugregs(struct kvm_vcpu *vcpu,
|
|
|
|
struct kvm_debugregs *dbgregs)
|
|
|
|
{
|
2014-01-04 17:47:16 +00:00
|
|
|
unsigned long val;
|
|
|
|
|
2010-02-15 09:45:43 +00:00
|
|
|
memcpy(dbgregs->db, vcpu->arch.db, sizeof(vcpu->arch.db));
|
2014-10-02 22:10:05 +00:00
|
|
|
kvm_get_dr(vcpu, 6, &val);
|
2014-01-04 17:47:16 +00:00
|
|
|
dbgregs->dr6 = val;
|
2010-02-15 09:45:43 +00:00
|
|
|
dbgregs->dr7 = vcpu->arch.dr7;
|
|
|
|
dbgregs->flags = 0;
|
2010-10-30 18:54:47 +00:00
|
|
|
memset(&dbgregs->reserved, 0, sizeof(dbgregs->reserved));
|
2010-02-15 09:45:43 +00:00
|
|
|
}
|
|
|
|
|
|
|
|
static int kvm_vcpu_ioctl_x86_set_debugregs(struct kvm_vcpu *vcpu,
|
|
|
|
struct kvm_debugregs *dbgregs)
|
|
|
|
{
|
|
|
|
if (dbgregs->flags)
|
|
|
|
return -EINVAL;
|
|
|
|
|
KVM: x86: fix OOPS after invalid KVM_SET_DEBUGREGS
MOV to DR6 or DR7 causes a #GP if an attempt is made to write a 1 to
any of bits 63:32. However, this is not detected at KVM_SET_DEBUGREGS
time, and the next KVM_RUN oopses:
general protection fault: 0000 [#1] SMP
CPU: 2 PID: 14987 Comm: a.out Not tainted 4.4.9-300.fc23.x86_64 #1
Hardware name: LENOVO 2325F51/2325F51, BIOS G2ET32WW (1.12 ) 05/30/2012
[...]
Call Trace:
[<ffffffffa072c93d>] kvm_arch_vcpu_ioctl_run+0x141d/0x14e0 [kvm]
[<ffffffffa071405d>] kvm_vcpu_ioctl+0x33d/0x620 [kvm]
[<ffffffff81241648>] do_vfs_ioctl+0x298/0x480
[<ffffffff812418a9>] SyS_ioctl+0x79/0x90
[<ffffffff817a0f2e>] entry_SYSCALL_64_fastpath+0x12/0x71
Code: 55 83 ff 07 48 89 e5 77 27 89 ff ff 24 fd 90 87 80 81 0f 23 fe 5d c3 0f 23 c6 5d c3 0f 23 ce 5d c3 0f 23 d6 5d c3 0f 23 de 5d c3 <0f> 23 f6 5d c3 0f 0b 66 66 66 66 66 2e 0f 1f 84 00 00 00 00 00
RIP [<ffffffff810639eb>] native_set_debugreg+0x2b/0x40
RSP <ffff88005836bd50>
Testcase (beautified/reduced from syzkaller output):
#include <unistd.h>
#include <sys/syscall.h>
#include <string.h>
#include <stdint.h>
#include <linux/kvm.h>
#include <fcntl.h>
#include <sys/ioctl.h>
long r[8];
int main()
{
struct kvm_debugregs dr = { 0 };
r[2] = open("/dev/kvm", O_RDONLY);
r[3] = ioctl(r[2], KVM_CREATE_VM, 0);
r[4] = ioctl(r[3], KVM_CREATE_VCPU, 7);
memcpy(&dr,
"\x5d\x6a\x6b\xe8\x57\x3b\x4b\x7e\xcf\x0d\xa1\x72"
"\xa3\x4a\x29\x0c\xfc\x6d\x44\x00\xa7\x52\xc7\xd8"
"\x00\xdb\x89\x9d\x78\xb5\x54\x6b\x6b\x13\x1c\xe9"
"\x5e\xd3\x0e\x40\x6f\xb4\x66\xf7\x5b\xe3\x36\xcb",
48);
r[7] = ioctl(r[4], KVM_SET_DEBUGREGS, &dr);
r[6] = ioctl(r[4], KVM_RUN, 0);
}
Reported-by: Dmitry Vyukov <dvyukov@google.com>
Cc: stable@vger.kernel.org
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
2016-06-01 12:09:23 +00:00
|
|
|
if (dbgregs->dr6 & ~0xffffffffull)
|
|
|
|
return -EINVAL;
|
|
|
|
if (dbgregs->dr7 & ~0xffffffffull)
|
|
|
|
return -EINVAL;
|
|
|
|
|
2010-02-15 09:45:43 +00:00
|
|
|
memcpy(vcpu->arch.db, dbgregs->db, sizeof(vcpu->arch.db));
|
2015-04-02 00:10:37 +00:00
|
|
|
kvm_update_dr0123(vcpu);
|
2010-02-15 09:45:43 +00:00
|
|
|
vcpu->arch.dr6 = dbgregs->dr6;
|
2014-01-04 17:47:16 +00:00
|
|
|
kvm_update_dr6(vcpu);
|
2010-02-15 09:45:43 +00:00
|
|
|
vcpu->arch.dr7 = dbgregs->dr7;
|
2014-01-04 17:47:15 +00:00
|
|
|
kvm_update_dr7(vcpu);
|
2010-02-15 09:45:43 +00:00
|
|
|
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
2014-11-21 18:05:07 +00:00
|
|
|
#define XSTATE_COMPACTION_ENABLED (1ULL << 63)
|
|
|
|
|
|
|
|
static void fill_xsave(u8 *dest, struct kvm_vcpu *vcpu)
|
|
|
|
{
|
2015-04-30 15:15:32 +00:00
|
|
|
struct xregs_state *xsave = &vcpu->arch.guest_fpu.state.xsave;
|
2015-04-24 08:19:47 +00:00
|
|
|
u64 xstate_bv = xsave->header.xfeatures;
|
2014-11-21 18:05:07 +00:00
|
|
|
u64 valid;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Copy legacy XSAVE area, to avoid complications with CPUID
|
|
|
|
* leaves 0 and 1 in the loop below.
|
|
|
|
*/
|
|
|
|
memcpy(dest, xsave, XSAVE_HDR_OFFSET);
|
|
|
|
|
|
|
|
/* Set XSTATE_BV */
|
2017-02-01 13:19:53 +00:00
|
|
|
xstate_bv &= vcpu->arch.guest_supported_xcr0 | XFEATURE_MASK_FPSSE;
|
2014-11-21 18:05:07 +00:00
|
|
|
*(u64 *)(dest + XSAVE_HDR_OFFSET) = xstate_bv;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Copy each region from the possibly compacted offset to the
|
|
|
|
* non-compacted offset.
|
|
|
|
*/
|
2015-09-02 23:31:26 +00:00
|
|
|
valid = xstate_bv & ~XFEATURE_MASK_FPSSE;
|
2014-11-21 18:05:07 +00:00
|
|
|
while (valid) {
|
|
|
|
u64 feature = valid & -valid;
|
|
|
|
int index = fls64(feature) - 1;
|
|
|
|
void *src = get_xsave_addr(xsave, feature);
|
|
|
|
|
|
|
|
if (src) {
|
|
|
|
u32 size, offset, ecx, edx;
|
|
|
|
cpuid_count(XSTATE_CPUID, index,
|
|
|
|
&size, &offset, &ecx, &edx);
|
2017-08-23 21:16:29 +00:00
|
|
|
if (feature == XFEATURE_MASK_PKRU)
|
|
|
|
memcpy(dest + offset, &vcpu->arch.pkru,
|
|
|
|
sizeof(vcpu->arch.pkru));
|
|
|
|
else
|
|
|
|
memcpy(dest + offset, src, size);
|
|
|
|
|
2014-11-21 18:05:07 +00:00
|
|
|
}
|
|
|
|
|
|
|
|
valid -= feature;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
static void load_xsave(struct kvm_vcpu *vcpu, u8 *src)
|
|
|
|
{
|
2015-04-30 15:15:32 +00:00
|
|
|
struct xregs_state *xsave = &vcpu->arch.guest_fpu.state.xsave;
|
2014-11-21 18:05:07 +00:00
|
|
|
u64 xstate_bv = *(u64 *)(src + XSAVE_HDR_OFFSET);
|
|
|
|
u64 valid;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Copy legacy XSAVE area, to avoid complications with CPUID
|
|
|
|
* leaves 0 and 1 in the loop below.
|
|
|
|
*/
|
|
|
|
memcpy(xsave, src, XSAVE_HDR_OFFSET);
|
|
|
|
|
|
|
|
/* Set XSTATE_BV and possibly XCOMP_BV. */
|
2015-04-24 08:19:47 +00:00
|
|
|
xsave->header.xfeatures = xstate_bv;
|
2016-04-04 20:25:03 +00:00
|
|
|
if (boot_cpu_has(X86_FEATURE_XSAVES))
|
2015-04-24 08:14:36 +00:00
|
|
|
xsave->header.xcomp_bv = host_xcr0 | XSTATE_COMPACTION_ENABLED;
|
2014-11-21 18:05:07 +00:00
|
|
|
|
|
|
|
/*
|
|
|
|
* Copy each region from the non-compacted offset to the
|
|
|
|
* possibly compacted offset.
|
|
|
|
*/
|
2015-09-02 23:31:26 +00:00
|
|
|
valid = xstate_bv & ~XFEATURE_MASK_FPSSE;
|
2014-11-21 18:05:07 +00:00
|
|
|
while (valid) {
|
|
|
|
u64 feature = valid & -valid;
|
|
|
|
int index = fls64(feature) - 1;
|
|
|
|
void *dest = get_xsave_addr(xsave, feature);
|
|
|
|
|
|
|
|
if (dest) {
|
|
|
|
u32 size, offset, ecx, edx;
|
|
|
|
cpuid_count(XSTATE_CPUID, index,
|
|
|
|
&size, &offset, &ecx, &edx);
|
2017-08-23 21:16:29 +00:00
|
|
|
if (feature == XFEATURE_MASK_PKRU)
|
|
|
|
memcpy(&vcpu->arch.pkru, src + offset,
|
|
|
|
sizeof(vcpu->arch.pkru));
|
|
|
|
else
|
|
|
|
memcpy(dest, src + offset, size);
|
2015-07-09 07:44:52 +00:00
|
|
|
}
|
2014-11-21 18:05:07 +00:00
|
|
|
|
|
|
|
valid -= feature;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2010-06-13 09:29:39 +00:00
|
|
|
static void kvm_vcpu_ioctl_x86_get_xsave(struct kvm_vcpu *vcpu,
|
|
|
|
struct kvm_xsave *guest_xsave)
|
|
|
|
{
|
2016-04-04 20:25:02 +00:00
|
|
|
if (boot_cpu_has(X86_FEATURE_XSAVE)) {
|
2014-11-21 18:05:07 +00:00
|
|
|
memset(guest_xsave, 0, sizeof(struct kvm_xsave));
|
|
|
|
fill_xsave((u8 *) guest_xsave->region, vcpu);
|
2013-10-02 14:06:16 +00:00
|
|
|
} else {
|
2010-06-13 09:29:39 +00:00
|
|
|
memcpy(guest_xsave->region,
|
x86/fpu: Simplify FPU handling by embedding the fpstate in task_struct (again)
So 6 years ago we made the FPU fpstate dynamically allocated:
aa283f49276e ("x86, fpu: lazy allocation of FPU area - v5")
61c4628b5386 ("x86, fpu: split FPU state from task struct - v5")
In hindsight this was a mistake:
- it complicated context allocation failure handling, such as:
/* kthread execs. TODO: cleanup this horror. */
if (WARN_ON(fpstate_alloc_init(fpu)))
force_sig(SIGKILL, tsk);
- it caused us to enable irqs in fpu__restore():
local_irq_enable();
/*
* does a slab alloc which can sleep
*/
if (fpstate_alloc_init(fpu)) {
/*
* ran out of memory!
*/
do_group_exit(SIGKILL);
return;
}
local_irq_disable();
- it (slightly) slowed down task creation/destruction by adding
slab allocation/free pattens.
- it made access to context contents (slightly) slower by adding
one more pointer dereference.
The motivation for the dynamic allocation was two-fold:
- reduce memory consumption by non-FPU tasks
- allocate and handle only the necessary amount of context for
various XSAVE processors that have varying hardware frame
sizes.
These days, with glibc using SSE memcpy by default and GCC optimizing
for SSE/AVX by default, the scope of FPU using apps on an x86 system is
much larger than it was 6 years ago.
For example on a freshly installed Fedora 21 desktop system, with a
recent kernel, all non-kthread tasks have used the FPU shortly after
bootup.
Also, even modern embedded x86 CPUs try to support the latest vector
instruction set - so they'll too often use the larger xstate frame
sizes.
So remove the dynamic allocation complication by embedding the FPU
fpstate in task_struct again. This should make the FPU a lot more
accessible to all sorts of atomic contexts.
We could still optimize for the xstate frame size in the future,
by moving the state structure to the last element of task_struct,
and allocating only a part of that.
This change is kept minimal by still keeping the ctx_alloc()/free()
routines (that now do nothing substantial) - we'll remove them in
the following patches.
Reviewed-by: Borislav Petkov <bp@alien8.de>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-04-27 02:19:39 +00:00
|
|
|
&vcpu->arch.guest_fpu.state.fxsave,
|
2015-04-30 15:15:32 +00:00
|
|
|
sizeof(struct fxregs_state));
|
2010-06-13 09:29:39 +00:00
|
|
|
*(u64 *)&guest_xsave->region[XSAVE_HDR_OFFSET / sizeof(u32)] =
|
2015-09-02 23:31:26 +00:00
|
|
|
XFEATURE_MASK_FPSSE;
|
2010-06-13 09:29:39 +00:00
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2017-05-11 09:58:55 +00:00
|
|
|
#define XSAVE_MXCSR_OFFSET 24
|
|
|
|
|
2010-06-13 09:29:39 +00:00
|
|
|
static int kvm_vcpu_ioctl_x86_set_xsave(struct kvm_vcpu *vcpu,
|
|
|
|
struct kvm_xsave *guest_xsave)
|
|
|
|
{
|
|
|
|
u64 xstate_bv =
|
|
|
|
*(u64 *)&guest_xsave->region[XSAVE_HDR_OFFSET / sizeof(u32)];
|
2017-05-11 09:58:55 +00:00
|
|
|
u32 mxcsr = *(u32 *)&guest_xsave->region[XSAVE_MXCSR_OFFSET / sizeof(u32)];
|
2010-06-13 09:29:39 +00:00
|
|
|
|
2016-04-04 20:25:02 +00:00
|
|
|
if (boot_cpu_has(X86_FEATURE_XSAVE)) {
|
2013-10-02 14:06:15 +00:00
|
|
|
/*
|
|
|
|
* Here we allow setting states that are not present in
|
|
|
|
* CPUID leaf 0xD, index 0, EDX:EAX. This is for compatibility
|
|
|
|
* with old userspace.
|
|
|
|
*/
|
2017-05-11 09:58:55 +00:00
|
|
|
if (xstate_bv & ~kvm_supported_xcr0() ||
|
|
|
|
mxcsr & ~mxcsr_feature_mask)
|
2013-10-02 14:06:15 +00:00
|
|
|
return -EINVAL;
|
2014-11-21 18:05:07 +00:00
|
|
|
load_xsave(vcpu, (u8 *)guest_xsave->region);
|
2013-10-02 14:06:15 +00:00
|
|
|
} else {
|
2017-05-11 09:58:55 +00:00
|
|
|
if (xstate_bv & ~XFEATURE_MASK_FPSSE ||
|
|
|
|
mxcsr & ~mxcsr_feature_mask)
|
2010-06-13 09:29:39 +00:00
|
|
|
return -EINVAL;
|
x86/fpu: Simplify FPU handling by embedding the fpstate in task_struct (again)
So 6 years ago we made the FPU fpstate dynamically allocated:
aa283f49276e ("x86, fpu: lazy allocation of FPU area - v5")
61c4628b5386 ("x86, fpu: split FPU state from task struct - v5")
In hindsight this was a mistake:
- it complicated context allocation failure handling, such as:
/* kthread execs. TODO: cleanup this horror. */
if (WARN_ON(fpstate_alloc_init(fpu)))
force_sig(SIGKILL, tsk);
- it caused us to enable irqs in fpu__restore():
local_irq_enable();
/*
* does a slab alloc which can sleep
*/
if (fpstate_alloc_init(fpu)) {
/*
* ran out of memory!
*/
do_group_exit(SIGKILL);
return;
}
local_irq_disable();
- it (slightly) slowed down task creation/destruction by adding
slab allocation/free pattens.
- it made access to context contents (slightly) slower by adding
one more pointer dereference.
The motivation for the dynamic allocation was two-fold:
- reduce memory consumption by non-FPU tasks
- allocate and handle only the necessary amount of context for
various XSAVE processors that have varying hardware frame
sizes.
These days, with glibc using SSE memcpy by default and GCC optimizing
for SSE/AVX by default, the scope of FPU using apps on an x86 system is
much larger than it was 6 years ago.
For example on a freshly installed Fedora 21 desktop system, with a
recent kernel, all non-kthread tasks have used the FPU shortly after
bootup.
Also, even modern embedded x86 CPUs try to support the latest vector
instruction set - so they'll too often use the larger xstate frame
sizes.
So remove the dynamic allocation complication by embedding the FPU
fpstate in task_struct again. This should make the FPU a lot more
accessible to all sorts of atomic contexts.
We could still optimize for the xstate frame size in the future,
by moving the state structure to the last element of task_struct,
and allocating only a part of that.
This change is kept minimal by still keeping the ctx_alloc()/free()
routines (that now do nothing substantial) - we'll remove them in
the following patches.
Reviewed-by: Borislav Petkov <bp@alien8.de>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-04-27 02:19:39 +00:00
|
|
|
memcpy(&vcpu->arch.guest_fpu.state.fxsave,
|
2015-04-30 15:15:32 +00:00
|
|
|
guest_xsave->region, sizeof(struct fxregs_state));
|
2010-06-13 09:29:39 +00:00
|
|
|
}
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
|
|
|
static void kvm_vcpu_ioctl_x86_get_xcrs(struct kvm_vcpu *vcpu,
|
|
|
|
struct kvm_xcrs *guest_xcrs)
|
|
|
|
{
|
2016-04-04 20:25:02 +00:00
|
|
|
if (!boot_cpu_has(X86_FEATURE_XSAVE)) {
|
2010-06-13 09:29:39 +00:00
|
|
|
guest_xcrs->nr_xcrs = 0;
|
|
|
|
return;
|
|
|
|
}
|
|
|
|
|
|
|
|
guest_xcrs->nr_xcrs = 1;
|
|
|
|
guest_xcrs->flags = 0;
|
|
|
|
guest_xcrs->xcrs[0].xcr = XCR_XFEATURE_ENABLED_MASK;
|
|
|
|
guest_xcrs->xcrs[0].value = vcpu->arch.xcr0;
|
|
|
|
}
|
|
|
|
|
|
|
|
static int kvm_vcpu_ioctl_x86_set_xcrs(struct kvm_vcpu *vcpu,
|
|
|
|
struct kvm_xcrs *guest_xcrs)
|
|
|
|
{
|
|
|
|
int i, r = 0;
|
|
|
|
|
2016-04-04 20:25:02 +00:00
|
|
|
if (!boot_cpu_has(X86_FEATURE_XSAVE))
|
2010-06-13 09:29:39 +00:00
|
|
|
return -EINVAL;
|
|
|
|
|
|
|
|
if (guest_xcrs->nr_xcrs > KVM_MAX_XCRS || guest_xcrs->flags)
|
|
|
|
return -EINVAL;
|
|
|
|
|
|
|
|
for (i = 0; i < guest_xcrs->nr_xcrs; i++)
|
|
|
|
/* Only support XCR0 currently */
|
2013-10-17 14:50:47 +00:00
|
|
|
if (guest_xcrs->xcrs[i].xcr == XCR_XFEATURE_ENABLED_MASK) {
|
2010-06-13 09:29:39 +00:00
|
|
|
r = __kvm_set_xcr(vcpu, XCR_XFEATURE_ENABLED_MASK,
|
2013-10-17 14:50:47 +00:00
|
|
|
guest_xcrs->xcrs[i].value);
|
2010-06-13 09:29:39 +00:00
|
|
|
break;
|
|
|
|
}
|
|
|
|
if (r)
|
|
|
|
r = -EINVAL;
|
|
|
|
return r;
|
|
|
|
}
|
|
|
|
|
2012-03-10 19:37:27 +00:00
|
|
|
/*
|
|
|
|
* kvm_set_guest_paused() indicates to the guest kernel that it has been
|
|
|
|
* stopped by the hypervisor. This function will be called from the host only.
|
|
|
|
* EINVAL is returned when the host attempts to set the flag for a guest that
|
|
|
|
* does not support pv clocks.
|
|
|
|
*/
|
|
|
|
static int kvm_set_guest_paused(struct kvm_vcpu *vcpu)
|
|
|
|
{
|
2013-02-20 22:48:10 +00:00
|
|
|
if (!vcpu->arch.pv_time_enabled)
|
2012-03-10 19:37:27 +00:00
|
|
|
return -EINVAL;
|
2012-08-03 18:57:49 +00:00
|
|
|
vcpu->arch.pvclock_set_guest_stopped_request = true;
|
2012-03-10 19:37:27 +00:00
|
|
|
kvm_make_request(KVM_REQ_CLOCK_UPDATE, vcpu);
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
2015-11-10 12:36:34 +00:00
|
|
|
static int kvm_vcpu_ioctl_enable_cap(struct kvm_vcpu *vcpu,
|
|
|
|
struct kvm_enable_cap *cap)
|
|
|
|
{
|
|
|
|
if (cap->flags)
|
|
|
|
return -EINVAL;
|
|
|
|
|
|
|
|
switch (cap->cap) {
|
kvm: x86: hyperv: add KVM_CAP_HYPERV_SYNIC2
There is a flaw in the Hyper-V SynIC implementation in KVM: when message
page or event flags page is enabled by setting the corresponding msr,
KVM zeroes it out. This is problematic because on migration the
corresponding MSRs are loaded on the destination, so the content of
those pages is lost.
This went unnoticed so far because the only user of those pages was
in-KVM hyperv synic timers, which could continue working despite that
zeroing.
Newer QEMU uses those pages for Hyper-V VMBus implementation, and
zeroing them breaks the migration.
Besides, in newer QEMU the content of those pages is fully managed by
QEMU, so zeroing them is undesirable even when writing the MSRs from the
guest side.
To support this new scheme, introduce a new capability,
KVM_CAP_HYPERV_SYNIC2, which, when enabled, makes sure that the synic
pages aren't zeroed out in KVM.
Signed-off-by: Roman Kagan <rkagan@virtuozzo.com>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
2017-06-22 13:51:01 +00:00
|
|
|
case KVM_CAP_HYPERV_SYNIC2:
|
|
|
|
if (cap->args[0])
|
|
|
|
return -EINVAL;
|
2015-11-10 12:36:34 +00:00
|
|
|
case KVM_CAP_HYPERV_SYNIC:
|
KVM: x86: fix NULL deref in vcpu_scan_ioapic
Reported by syzkaller:
BUG: unable to handle kernel NULL pointer dereference at 00000000000001b0
IP: _raw_spin_lock+0xc/0x30
PGD 3e28eb067
PUD 3f0ac6067
PMD 0
Oops: 0002 [#1] SMP
CPU: 0 PID: 2431 Comm: test Tainted: G OE 4.10.0-rc1+ #3
Call Trace:
? kvm_ioapic_scan_entry+0x3e/0x110 [kvm]
kvm_arch_vcpu_ioctl_run+0x10a8/0x15f0 [kvm]
? pick_next_task_fair+0xe1/0x4e0
? kvm_arch_vcpu_load+0xea/0x260 [kvm]
kvm_vcpu_ioctl+0x33a/0x600 [kvm]
? hrtimer_try_to_cancel+0x29/0x130
? do_nanosleep+0x97/0xf0
do_vfs_ioctl+0xa1/0x5d0
? __hrtimer_init+0x90/0x90
? do_nanosleep+0x5b/0xf0
SyS_ioctl+0x79/0x90
do_syscall_64+0x6e/0x180
entry_SYSCALL64_slow_path+0x25/0x25
RIP: _raw_spin_lock+0xc/0x30 RSP: ffffa43688973cc0
The syzkaller folks reported a NULL pointer dereference due to
ENABLE_CAP succeeding even without an irqchip. The Hyper-V
synthetic interrupt controller is activated, resulting in a
wrong request to rescan the ioapic and a NULL pointer dereference.
#include <sys/ioctl.h>
#include <sys/mman.h>
#include <sys/types.h>
#include <linux/kvm.h>
#include <pthread.h>
#include <stddef.h>
#include <stdint.h>
#include <stdlib.h>
#include <string.h>
#include <unistd.h>
#ifndef KVM_CAP_HYPERV_SYNIC
#define KVM_CAP_HYPERV_SYNIC 123
#endif
void* thr(void* arg)
{
struct kvm_enable_cap cap;
cap.flags = 0;
cap.cap = KVM_CAP_HYPERV_SYNIC;
ioctl((long)arg, KVM_ENABLE_CAP, &cap);
return 0;
}
int main()
{
void *host_mem = mmap(0, 0x1000, PROT_READ|PROT_WRITE,
MAP_PRIVATE|MAP_ANONYMOUS, -1, 0);
int kvmfd = open("/dev/kvm", 0);
int vmfd = ioctl(kvmfd, KVM_CREATE_VM, 0);
struct kvm_userspace_memory_region memreg;
memreg.slot = 0;
memreg.flags = 0;
memreg.guest_phys_addr = 0;
memreg.memory_size = 0x1000;
memreg.userspace_addr = (unsigned long)host_mem;
host_mem[0] = 0xf4;
ioctl(vmfd, KVM_SET_USER_MEMORY_REGION, &memreg);
int cpufd = ioctl(vmfd, KVM_CREATE_VCPU, 0);
struct kvm_sregs sregs;
ioctl(cpufd, KVM_GET_SREGS, &sregs);
sregs.cr0 = 0;
sregs.cr4 = 0;
sregs.efer = 0;
sregs.cs.selector = 0;
sregs.cs.base = 0;
ioctl(cpufd, KVM_SET_SREGS, &sregs);
struct kvm_regs regs = { .rflags = 2 };
ioctl(cpufd, KVM_SET_REGS, ®s);
ioctl(vmfd, KVM_CREATE_IRQCHIP, 0);
pthread_t th;
pthread_create(&th, 0, thr, (void*)(long)cpufd);
usleep(rand() % 10000);
ioctl(cpufd, KVM_RUN, 0);
pthread_join(th, 0);
return 0;
}
This patch fixes it by failing ENABLE_CAP if without an irqchip.
Reported-by: Dmitry Vyukov <dvyukov@google.com>
Fixes: 5c919412fe61 (kvm/x86: Hyper-V synthetic interrupt controller)
Cc: stable@vger.kernel.org # 4.5+
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Radim Krčmář <rkrcmar@redhat.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Signed-off-by: Wanpeng Li <wanpeng.li@hotmail.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2017-01-04 02:56:19 +00:00
|
|
|
if (!irqchip_in_kernel(vcpu->kvm))
|
|
|
|
return -EINVAL;
|
kvm: x86: hyperv: add KVM_CAP_HYPERV_SYNIC2
There is a flaw in the Hyper-V SynIC implementation in KVM: when message
page or event flags page is enabled by setting the corresponding msr,
KVM zeroes it out. This is problematic because on migration the
corresponding MSRs are loaded on the destination, so the content of
those pages is lost.
This went unnoticed so far because the only user of those pages was
in-KVM hyperv synic timers, which could continue working despite that
zeroing.
Newer QEMU uses those pages for Hyper-V VMBus implementation, and
zeroing them breaks the migration.
Besides, in newer QEMU the content of those pages is fully managed by
QEMU, so zeroing them is undesirable even when writing the MSRs from the
guest side.
To support this new scheme, introduce a new capability,
KVM_CAP_HYPERV_SYNIC2, which, when enabled, makes sure that the synic
pages aren't zeroed out in KVM.
Signed-off-by: Roman Kagan <rkagan@virtuozzo.com>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
2017-06-22 13:51:01 +00:00
|
|
|
return kvm_hv_activate_synic(vcpu, cap->cap ==
|
|
|
|
KVM_CAP_HYPERV_SYNIC2);
|
2015-11-10 12:36:34 +00:00
|
|
|
default:
|
|
|
|
return -EINVAL;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
KVM: Portability: split kvm_vcpu_ioctl
This patch splits kvm_vcpu_ioctl into archtecture independent parts, and
x86 specific parts which go to kvm_arch_vcpu_ioctl in x86.c.
Common ioctls for all architectures are:
KVM_RUN, KVM_GET/SET_(S-)REGS, KVM_TRANSLATE, KVM_INTERRUPT,
KVM_DEBUG_GUEST, KVM_SET_SIGNAL_MASK, KVM_GET/SET_FPU
Note that some PPC chips don't have an FPU, so we might need an #ifdef
around KVM_GET/SET_FPU one day.
x86 specific ioctls are:
KVM_GET/SET_LAPIC, KVM_SET_CPUID, KVM_GET/SET_MSRS
An interresting aspect is vcpu_load/vcpu_put. We now have a common
vcpu_load/put which does the preemption stuff, and an architecture
specific kvm_arch_vcpu_load/put. In the x86 case, this one calls the
vmx/svm function defined in kvm_x86_ops.
Signed-off-by: Carsten Otte <cotte@de.ibm.com>
Reviewed-by: Christian Borntraeger <borntraeger@de.ibm.com>
Reviewed-by: Christian Ehrhardt <ehrhardt@linux.vnet.ibm.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
2007-10-11 17:16:52 +00:00
|
|
|
long kvm_arch_vcpu_ioctl(struct file *filp,
|
|
|
|
unsigned int ioctl, unsigned long arg)
|
|
|
|
{
|
|
|
|
struct kvm_vcpu *vcpu = filp->private_data;
|
|
|
|
void __user *argp = (void __user *)arg;
|
|
|
|
int r;
|
2010-06-20 12:54:43 +00:00
|
|
|
union {
|
|
|
|
struct kvm_lapic_state *lapic;
|
|
|
|
struct kvm_xsave *xsave;
|
|
|
|
struct kvm_xcrs *xcrs;
|
|
|
|
void *buffer;
|
|
|
|
} u;
|
|
|
|
|
2017-12-04 20:35:36 +00:00
|
|
|
vcpu_load(vcpu);
|
|
|
|
|
2010-06-20 12:54:43 +00:00
|
|
|
u.buffer = NULL;
|
KVM: Portability: split kvm_vcpu_ioctl
This patch splits kvm_vcpu_ioctl into archtecture independent parts, and
x86 specific parts which go to kvm_arch_vcpu_ioctl in x86.c.
Common ioctls for all architectures are:
KVM_RUN, KVM_GET/SET_(S-)REGS, KVM_TRANSLATE, KVM_INTERRUPT,
KVM_DEBUG_GUEST, KVM_SET_SIGNAL_MASK, KVM_GET/SET_FPU
Note that some PPC chips don't have an FPU, so we might need an #ifdef
around KVM_GET/SET_FPU one day.
x86 specific ioctls are:
KVM_GET/SET_LAPIC, KVM_SET_CPUID, KVM_GET/SET_MSRS
An interresting aspect is vcpu_load/vcpu_put. We now have a common
vcpu_load/put which does the preemption stuff, and an architecture
specific kvm_arch_vcpu_load/put. In the x86 case, this one calls the
vmx/svm function defined in kvm_x86_ops.
Signed-off-by: Carsten Otte <cotte@de.ibm.com>
Reviewed-by: Christian Borntraeger <borntraeger@de.ibm.com>
Reviewed-by: Christian Ehrhardt <ehrhardt@linux.vnet.ibm.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
2007-10-11 17:16:52 +00:00
|
|
|
switch (ioctl) {
|
|
|
|
case KVM_GET_LAPIC: {
|
2009-10-29 15:44:16 +00:00
|
|
|
r = -EINVAL;
|
2016-01-08 12:48:51 +00:00
|
|
|
if (!lapic_in_kernel(vcpu))
|
2009-10-29 15:44:16 +00:00
|
|
|
goto out;
|
2010-06-20 12:54:43 +00:00
|
|
|
u.lapic = kzalloc(sizeof(struct kvm_lapic_state), GFP_KERNEL);
|
KVM: Portability: split kvm_vcpu_ioctl
This patch splits kvm_vcpu_ioctl into archtecture independent parts, and
x86 specific parts which go to kvm_arch_vcpu_ioctl in x86.c.
Common ioctls for all architectures are:
KVM_RUN, KVM_GET/SET_(S-)REGS, KVM_TRANSLATE, KVM_INTERRUPT,
KVM_DEBUG_GUEST, KVM_SET_SIGNAL_MASK, KVM_GET/SET_FPU
Note that some PPC chips don't have an FPU, so we might need an #ifdef
around KVM_GET/SET_FPU one day.
x86 specific ioctls are:
KVM_GET/SET_LAPIC, KVM_SET_CPUID, KVM_GET/SET_MSRS
An interresting aspect is vcpu_load/vcpu_put. We now have a common
vcpu_load/put which does the preemption stuff, and an architecture
specific kvm_arch_vcpu_load/put. In the x86 case, this one calls the
vmx/svm function defined in kvm_x86_ops.
Signed-off-by: Carsten Otte <cotte@de.ibm.com>
Reviewed-by: Christian Borntraeger <borntraeger@de.ibm.com>
Reviewed-by: Christian Ehrhardt <ehrhardt@linux.vnet.ibm.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
2007-10-11 17:16:52 +00:00
|
|
|
|
2008-08-11 17:01:47 +00:00
|
|
|
r = -ENOMEM;
|
2010-06-20 12:54:43 +00:00
|
|
|
if (!u.lapic)
|
2008-08-11 17:01:47 +00:00
|
|
|
goto out;
|
2010-06-20 12:54:43 +00:00
|
|
|
r = kvm_vcpu_ioctl_get_lapic(vcpu, u.lapic);
|
KVM: Portability: split kvm_vcpu_ioctl
This patch splits kvm_vcpu_ioctl into archtecture independent parts, and
x86 specific parts which go to kvm_arch_vcpu_ioctl in x86.c.
Common ioctls for all architectures are:
KVM_RUN, KVM_GET/SET_(S-)REGS, KVM_TRANSLATE, KVM_INTERRUPT,
KVM_DEBUG_GUEST, KVM_SET_SIGNAL_MASK, KVM_GET/SET_FPU
Note that some PPC chips don't have an FPU, so we might need an #ifdef
around KVM_GET/SET_FPU one day.
x86 specific ioctls are:
KVM_GET/SET_LAPIC, KVM_SET_CPUID, KVM_GET/SET_MSRS
An interresting aspect is vcpu_load/vcpu_put. We now have a common
vcpu_load/put which does the preemption stuff, and an architecture
specific kvm_arch_vcpu_load/put. In the x86 case, this one calls the
vmx/svm function defined in kvm_x86_ops.
Signed-off-by: Carsten Otte <cotte@de.ibm.com>
Reviewed-by: Christian Borntraeger <borntraeger@de.ibm.com>
Reviewed-by: Christian Ehrhardt <ehrhardt@linux.vnet.ibm.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
2007-10-11 17:16:52 +00:00
|
|
|
if (r)
|
|
|
|
goto out;
|
|
|
|
r = -EFAULT;
|
2010-06-20 12:54:43 +00:00
|
|
|
if (copy_to_user(argp, u.lapic, sizeof(struct kvm_lapic_state)))
|
KVM: Portability: split kvm_vcpu_ioctl
This patch splits kvm_vcpu_ioctl into archtecture independent parts, and
x86 specific parts which go to kvm_arch_vcpu_ioctl in x86.c.
Common ioctls for all architectures are:
KVM_RUN, KVM_GET/SET_(S-)REGS, KVM_TRANSLATE, KVM_INTERRUPT,
KVM_DEBUG_GUEST, KVM_SET_SIGNAL_MASK, KVM_GET/SET_FPU
Note that some PPC chips don't have an FPU, so we might need an #ifdef
around KVM_GET/SET_FPU one day.
x86 specific ioctls are:
KVM_GET/SET_LAPIC, KVM_SET_CPUID, KVM_GET/SET_MSRS
An interresting aspect is vcpu_load/vcpu_put. We now have a common
vcpu_load/put which does the preemption stuff, and an architecture
specific kvm_arch_vcpu_load/put. In the x86 case, this one calls the
vmx/svm function defined in kvm_x86_ops.
Signed-off-by: Carsten Otte <cotte@de.ibm.com>
Reviewed-by: Christian Borntraeger <borntraeger@de.ibm.com>
Reviewed-by: Christian Ehrhardt <ehrhardt@linux.vnet.ibm.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
2007-10-11 17:16:52 +00:00
|
|
|
goto out;
|
|
|
|
r = 0;
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
case KVM_SET_LAPIC: {
|
2009-10-29 15:44:16 +00:00
|
|
|
r = -EINVAL;
|
2016-01-08 12:48:51 +00:00
|
|
|
if (!lapic_in_kernel(vcpu))
|
2009-10-29 15:44:16 +00:00
|
|
|
goto out;
|
2011-12-04 17:36:29 +00:00
|
|
|
u.lapic = memdup_user(argp, sizeof(*u.lapic));
|
2017-12-04 20:35:36 +00:00
|
|
|
if (IS_ERR(u.lapic)) {
|
|
|
|
r = PTR_ERR(u.lapic);
|
|
|
|
goto out_nofree;
|
|
|
|
}
|
2011-12-04 17:36:29 +00:00
|
|
|
|
2010-06-20 12:54:43 +00:00
|
|
|
r = kvm_vcpu_ioctl_set_lapic(vcpu, u.lapic);
|
KVM: Portability: split kvm_vcpu_ioctl
This patch splits kvm_vcpu_ioctl into archtecture independent parts, and
x86 specific parts which go to kvm_arch_vcpu_ioctl in x86.c.
Common ioctls for all architectures are:
KVM_RUN, KVM_GET/SET_(S-)REGS, KVM_TRANSLATE, KVM_INTERRUPT,
KVM_DEBUG_GUEST, KVM_SET_SIGNAL_MASK, KVM_GET/SET_FPU
Note that some PPC chips don't have an FPU, so we might need an #ifdef
around KVM_GET/SET_FPU one day.
x86 specific ioctls are:
KVM_GET/SET_LAPIC, KVM_SET_CPUID, KVM_GET/SET_MSRS
An interresting aspect is vcpu_load/vcpu_put. We now have a common
vcpu_load/put which does the preemption stuff, and an architecture
specific kvm_arch_vcpu_load/put. In the x86 case, this one calls the
vmx/svm function defined in kvm_x86_ops.
Signed-off-by: Carsten Otte <cotte@de.ibm.com>
Reviewed-by: Christian Borntraeger <borntraeger@de.ibm.com>
Reviewed-by: Christian Ehrhardt <ehrhardt@linux.vnet.ibm.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
2007-10-11 17:16:52 +00:00
|
|
|
break;
|
|
|
|
}
|
2007-11-20 20:36:41 +00:00
|
|
|
case KVM_INTERRUPT: {
|
|
|
|
struct kvm_interrupt irq;
|
|
|
|
|
|
|
|
r = -EFAULT;
|
|
|
|
if (copy_from_user(&irq, argp, sizeof irq))
|
|
|
|
goto out;
|
|
|
|
r = kvm_vcpu_ioctl_interrupt(vcpu, &irq);
|
|
|
|
break;
|
|
|
|
}
|
2008-09-26 07:30:55 +00:00
|
|
|
case KVM_NMI: {
|
|
|
|
r = kvm_vcpu_ioctl_nmi(vcpu);
|
|
|
|
break;
|
|
|
|
}
|
2015-04-01 13:06:40 +00:00
|
|
|
case KVM_SMI: {
|
|
|
|
r = kvm_vcpu_ioctl_smi(vcpu);
|
|
|
|
break;
|
|
|
|
}
|
KVM: Portability: split kvm_vcpu_ioctl
This patch splits kvm_vcpu_ioctl into archtecture independent parts, and
x86 specific parts which go to kvm_arch_vcpu_ioctl in x86.c.
Common ioctls for all architectures are:
KVM_RUN, KVM_GET/SET_(S-)REGS, KVM_TRANSLATE, KVM_INTERRUPT,
KVM_DEBUG_GUEST, KVM_SET_SIGNAL_MASK, KVM_GET/SET_FPU
Note that some PPC chips don't have an FPU, so we might need an #ifdef
around KVM_GET/SET_FPU one day.
x86 specific ioctls are:
KVM_GET/SET_LAPIC, KVM_SET_CPUID, KVM_GET/SET_MSRS
An interresting aspect is vcpu_load/vcpu_put. We now have a common
vcpu_load/put which does the preemption stuff, and an architecture
specific kvm_arch_vcpu_load/put. In the x86 case, this one calls the
vmx/svm function defined in kvm_x86_ops.
Signed-off-by: Carsten Otte <cotte@de.ibm.com>
Reviewed-by: Christian Borntraeger <borntraeger@de.ibm.com>
Reviewed-by: Christian Ehrhardt <ehrhardt@linux.vnet.ibm.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
2007-10-11 17:16:52 +00:00
|
|
|
case KVM_SET_CPUID: {
|
|
|
|
struct kvm_cpuid __user *cpuid_arg = argp;
|
|
|
|
struct kvm_cpuid cpuid;
|
|
|
|
|
|
|
|
r = -EFAULT;
|
|
|
|
if (copy_from_user(&cpuid, cpuid_arg, sizeof cpuid))
|
|
|
|
goto out;
|
|
|
|
r = kvm_vcpu_ioctl_set_cpuid(vcpu, &cpuid, cpuid_arg->entries);
|
|
|
|
break;
|
|
|
|
}
|
2007-11-21 15:10:04 +00:00
|
|
|
case KVM_SET_CPUID2: {
|
|
|
|
struct kvm_cpuid2 __user *cpuid_arg = argp;
|
|
|
|
struct kvm_cpuid2 cpuid;
|
|
|
|
|
|
|
|
r = -EFAULT;
|
|
|
|
if (copy_from_user(&cpuid, cpuid_arg, sizeof cpuid))
|
|
|
|
goto out;
|
|
|
|
r = kvm_vcpu_ioctl_set_cpuid2(vcpu, &cpuid,
|
2009-01-14 16:56:00 +00:00
|
|
|
cpuid_arg->entries);
|
2007-11-21 15:10:04 +00:00
|
|
|
break;
|
|
|
|
}
|
|
|
|
case KVM_GET_CPUID2: {
|
|
|
|
struct kvm_cpuid2 __user *cpuid_arg = argp;
|
|
|
|
struct kvm_cpuid2 cpuid;
|
|
|
|
|
|
|
|
r = -EFAULT;
|
|
|
|
if (copy_from_user(&cpuid, cpuid_arg, sizeof cpuid))
|
|
|
|
goto out;
|
|
|
|
r = kvm_vcpu_ioctl_get_cpuid2(vcpu, &cpuid,
|
2009-01-14 16:56:00 +00:00
|
|
|
cpuid_arg->entries);
|
2007-11-21 15:10:04 +00:00
|
|
|
if (r)
|
|
|
|
goto out;
|
|
|
|
r = -EFAULT;
|
|
|
|
if (copy_to_user(cpuid_arg, &cpuid, sizeof cpuid))
|
|
|
|
goto out;
|
|
|
|
r = 0;
|
|
|
|
break;
|
|
|
|
}
|
2018-02-21 19:39:51 +00:00
|
|
|
case KVM_GET_MSRS: {
|
|
|
|
int idx = srcu_read_lock(&vcpu->kvm->srcu);
|
2015-04-08 13:30:38 +00:00
|
|
|
r = msr_io(vcpu, argp, do_get_msr, 1);
|
2018-02-21 19:39:51 +00:00
|
|
|
srcu_read_unlock(&vcpu->kvm->srcu, idx);
|
KVM: Portability: split kvm_vcpu_ioctl
This patch splits kvm_vcpu_ioctl into archtecture independent parts, and
x86 specific parts which go to kvm_arch_vcpu_ioctl in x86.c.
Common ioctls for all architectures are:
KVM_RUN, KVM_GET/SET_(S-)REGS, KVM_TRANSLATE, KVM_INTERRUPT,
KVM_DEBUG_GUEST, KVM_SET_SIGNAL_MASK, KVM_GET/SET_FPU
Note that some PPC chips don't have an FPU, so we might need an #ifdef
around KVM_GET/SET_FPU one day.
x86 specific ioctls are:
KVM_GET/SET_LAPIC, KVM_SET_CPUID, KVM_GET/SET_MSRS
An interresting aspect is vcpu_load/vcpu_put. We now have a common
vcpu_load/put which does the preemption stuff, and an architecture
specific kvm_arch_vcpu_load/put. In the x86 case, this one calls the
vmx/svm function defined in kvm_x86_ops.
Signed-off-by: Carsten Otte <cotte@de.ibm.com>
Reviewed-by: Christian Borntraeger <borntraeger@de.ibm.com>
Reviewed-by: Christian Ehrhardt <ehrhardt@linux.vnet.ibm.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
2007-10-11 17:16:52 +00:00
|
|
|
break;
|
2018-02-21 19:39:51 +00:00
|
|
|
}
|
|
|
|
case KVM_SET_MSRS: {
|
|
|
|
int idx = srcu_read_lock(&vcpu->kvm->srcu);
|
KVM: Portability: split kvm_vcpu_ioctl
This patch splits kvm_vcpu_ioctl into archtecture independent parts, and
x86 specific parts which go to kvm_arch_vcpu_ioctl in x86.c.
Common ioctls for all architectures are:
KVM_RUN, KVM_GET/SET_(S-)REGS, KVM_TRANSLATE, KVM_INTERRUPT,
KVM_DEBUG_GUEST, KVM_SET_SIGNAL_MASK, KVM_GET/SET_FPU
Note that some PPC chips don't have an FPU, so we might need an #ifdef
around KVM_GET/SET_FPU one day.
x86 specific ioctls are:
KVM_GET/SET_LAPIC, KVM_SET_CPUID, KVM_GET/SET_MSRS
An interresting aspect is vcpu_load/vcpu_put. We now have a common
vcpu_load/put which does the preemption stuff, and an architecture
specific kvm_arch_vcpu_load/put. In the x86 case, this one calls the
vmx/svm function defined in kvm_x86_ops.
Signed-off-by: Carsten Otte <cotte@de.ibm.com>
Reviewed-by: Christian Borntraeger <borntraeger@de.ibm.com>
Reviewed-by: Christian Ehrhardt <ehrhardt@linux.vnet.ibm.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
2007-10-11 17:16:52 +00:00
|
|
|
r = msr_io(vcpu, argp, do_set_msr, 0);
|
2018-02-21 19:39:51 +00:00
|
|
|
srcu_read_unlock(&vcpu->kvm->srcu, idx);
|
KVM: Portability: split kvm_vcpu_ioctl
This patch splits kvm_vcpu_ioctl into archtecture independent parts, and
x86 specific parts which go to kvm_arch_vcpu_ioctl in x86.c.
Common ioctls for all architectures are:
KVM_RUN, KVM_GET/SET_(S-)REGS, KVM_TRANSLATE, KVM_INTERRUPT,
KVM_DEBUG_GUEST, KVM_SET_SIGNAL_MASK, KVM_GET/SET_FPU
Note that some PPC chips don't have an FPU, so we might need an #ifdef
around KVM_GET/SET_FPU one day.
x86 specific ioctls are:
KVM_GET/SET_LAPIC, KVM_SET_CPUID, KVM_GET/SET_MSRS
An interresting aspect is vcpu_load/vcpu_put. We now have a common
vcpu_load/put which does the preemption stuff, and an architecture
specific kvm_arch_vcpu_load/put. In the x86 case, this one calls the
vmx/svm function defined in kvm_x86_ops.
Signed-off-by: Carsten Otte <cotte@de.ibm.com>
Reviewed-by: Christian Borntraeger <borntraeger@de.ibm.com>
Reviewed-by: Christian Ehrhardt <ehrhardt@linux.vnet.ibm.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
2007-10-11 17:16:52 +00:00
|
|
|
break;
|
2018-02-21 19:39:51 +00:00
|
|
|
}
|
2007-10-22 14:50:39 +00:00
|
|
|
case KVM_TPR_ACCESS_REPORTING: {
|
|
|
|
struct kvm_tpr_access_ctl tac;
|
|
|
|
|
|
|
|
r = -EFAULT;
|
|
|
|
if (copy_from_user(&tac, argp, sizeof tac))
|
|
|
|
goto out;
|
|
|
|
r = vcpu_ioctl_tpr_access_reporting(vcpu, &tac);
|
|
|
|
if (r)
|
|
|
|
goto out;
|
|
|
|
r = -EFAULT;
|
|
|
|
if (copy_to_user(argp, &tac, sizeof tac))
|
|
|
|
goto out;
|
|
|
|
r = 0;
|
|
|
|
break;
|
|
|
|
};
|
2007-10-25 14:52:32 +00:00
|
|
|
case KVM_SET_VAPIC_ADDR: {
|
|
|
|
struct kvm_vapic_addr va;
|
2016-11-17 14:55:46 +00:00
|
|
|
int idx;
|
2007-10-25 14:52:32 +00:00
|
|
|
|
|
|
|
r = -EINVAL;
|
2015-07-29 10:05:37 +00:00
|
|
|
if (!lapic_in_kernel(vcpu))
|
2007-10-25 14:52:32 +00:00
|
|
|
goto out;
|
|
|
|
r = -EFAULT;
|
|
|
|
if (copy_from_user(&va, argp, sizeof va))
|
|
|
|
goto out;
|
2016-11-17 14:55:46 +00:00
|
|
|
idx = srcu_read_lock(&vcpu->kvm->srcu);
|
2013-11-20 18:23:22 +00:00
|
|
|
r = kvm_lapic_set_vapic_addr(vcpu, va.vapic_addr);
|
2016-11-17 14:55:46 +00:00
|
|
|
srcu_read_unlock(&vcpu->kvm->srcu, idx);
|
2007-10-25 14:52:32 +00:00
|
|
|
break;
|
|
|
|
}
|
2009-05-11 08:48:15 +00:00
|
|
|
case KVM_X86_SETUP_MCE: {
|
|
|
|
u64 mcg_cap;
|
|
|
|
|
|
|
|
r = -EFAULT;
|
|
|
|
if (copy_from_user(&mcg_cap, argp, sizeof mcg_cap))
|
|
|
|
goto out;
|
|
|
|
r = kvm_vcpu_ioctl_x86_setup_mce(vcpu, mcg_cap);
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
case KVM_X86_SET_MCE: {
|
|
|
|
struct kvm_x86_mce mce;
|
|
|
|
|
|
|
|
r = -EFAULT;
|
|
|
|
if (copy_from_user(&mce, argp, sizeof mce))
|
|
|
|
goto out;
|
|
|
|
r = kvm_vcpu_ioctl_x86_set_mce(vcpu, &mce);
|
|
|
|
break;
|
|
|
|
}
|
2009-11-12 00:04:25 +00:00
|
|
|
case KVM_GET_VCPU_EVENTS: {
|
|
|
|
struct kvm_vcpu_events events;
|
|
|
|
|
|
|
|
kvm_vcpu_ioctl_x86_get_vcpu_events(vcpu, &events);
|
|
|
|
|
|
|
|
r = -EFAULT;
|
|
|
|
if (copy_to_user(argp, &events, sizeof(struct kvm_vcpu_events)))
|
|
|
|
break;
|
|
|
|
r = 0;
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
case KVM_SET_VCPU_EVENTS: {
|
|
|
|
struct kvm_vcpu_events events;
|
|
|
|
|
|
|
|
r = -EFAULT;
|
|
|
|
if (copy_from_user(&events, argp, sizeof(struct kvm_vcpu_events)))
|
|
|
|
break;
|
|
|
|
|
|
|
|
r = kvm_vcpu_ioctl_x86_set_vcpu_events(vcpu, &events);
|
|
|
|
break;
|
|
|
|
}
|
2010-02-15 09:45:43 +00:00
|
|
|
case KVM_GET_DEBUGREGS: {
|
|
|
|
struct kvm_debugregs dbgregs;
|
|
|
|
|
|
|
|
kvm_vcpu_ioctl_x86_get_debugregs(vcpu, &dbgregs);
|
|
|
|
|
|
|
|
r = -EFAULT;
|
|
|
|
if (copy_to_user(argp, &dbgregs,
|
|
|
|
sizeof(struct kvm_debugregs)))
|
|
|
|
break;
|
|
|
|
r = 0;
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
case KVM_SET_DEBUGREGS: {
|
|
|
|
struct kvm_debugregs dbgregs;
|
|
|
|
|
|
|
|
r = -EFAULT;
|
|
|
|
if (copy_from_user(&dbgregs, argp,
|
|
|
|
sizeof(struct kvm_debugregs)))
|
|
|
|
break;
|
|
|
|
|
|
|
|
r = kvm_vcpu_ioctl_x86_set_debugregs(vcpu, &dbgregs);
|
|
|
|
break;
|
|
|
|
}
|
2010-06-13 09:29:39 +00:00
|
|
|
case KVM_GET_XSAVE: {
|
2010-06-20 12:54:43 +00:00
|
|
|
u.xsave = kzalloc(sizeof(struct kvm_xsave), GFP_KERNEL);
|
2010-06-13 09:29:39 +00:00
|
|
|
r = -ENOMEM;
|
2010-06-20 12:54:43 +00:00
|
|
|
if (!u.xsave)
|
2010-06-13 09:29:39 +00:00
|
|
|
break;
|
|
|
|
|
2010-06-20 12:54:43 +00:00
|
|
|
kvm_vcpu_ioctl_x86_get_xsave(vcpu, u.xsave);
|
2010-06-13 09:29:39 +00:00
|
|
|
|
|
|
|
r = -EFAULT;
|
2010-06-20 12:54:43 +00:00
|
|
|
if (copy_to_user(argp, u.xsave, sizeof(struct kvm_xsave)))
|
2010-06-13 09:29:39 +00:00
|
|
|
break;
|
|
|
|
r = 0;
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
case KVM_SET_XSAVE: {
|
2011-12-04 17:36:29 +00:00
|
|
|
u.xsave = memdup_user(argp, sizeof(*u.xsave));
|
2017-12-04 20:35:36 +00:00
|
|
|
if (IS_ERR(u.xsave)) {
|
|
|
|
r = PTR_ERR(u.xsave);
|
|
|
|
goto out_nofree;
|
|
|
|
}
|
2010-06-13 09:29:39 +00:00
|
|
|
|
2010-06-20 12:54:43 +00:00
|
|
|
r = kvm_vcpu_ioctl_x86_set_xsave(vcpu, u.xsave);
|
2010-06-13 09:29:39 +00:00
|
|
|
break;
|
|
|
|
}
|
|
|
|
case KVM_GET_XCRS: {
|
2010-06-20 12:54:43 +00:00
|
|
|
u.xcrs = kzalloc(sizeof(struct kvm_xcrs), GFP_KERNEL);
|
2010-06-13 09:29:39 +00:00
|
|
|
r = -ENOMEM;
|
2010-06-20 12:54:43 +00:00
|
|
|
if (!u.xcrs)
|
2010-06-13 09:29:39 +00:00
|
|
|
break;
|
|
|
|
|
2010-06-20 12:54:43 +00:00
|
|
|
kvm_vcpu_ioctl_x86_get_xcrs(vcpu, u.xcrs);
|
2010-06-13 09:29:39 +00:00
|
|
|
|
|
|
|
r = -EFAULT;
|
2010-06-20 12:54:43 +00:00
|
|
|
if (copy_to_user(argp, u.xcrs,
|
2010-06-13 09:29:39 +00:00
|
|
|
sizeof(struct kvm_xcrs)))
|
|
|
|
break;
|
|
|
|
r = 0;
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
case KVM_SET_XCRS: {
|
2011-12-04 17:36:29 +00:00
|
|
|
u.xcrs = memdup_user(argp, sizeof(*u.xcrs));
|
2017-12-04 20:35:36 +00:00
|
|
|
if (IS_ERR(u.xcrs)) {
|
|
|
|
r = PTR_ERR(u.xcrs);
|
|
|
|
goto out_nofree;
|
|
|
|
}
|
2010-06-13 09:29:39 +00:00
|
|
|
|
2010-06-20 12:54:43 +00:00
|
|
|
r = kvm_vcpu_ioctl_x86_set_xcrs(vcpu, u.xcrs);
|
2010-06-13 09:29:39 +00:00
|
|
|
break;
|
|
|
|
}
|
2011-03-25 08:44:51 +00:00
|
|
|
case KVM_SET_TSC_KHZ: {
|
|
|
|
u32 user_tsc_khz;
|
|
|
|
|
|
|
|
r = -EINVAL;
|
|
|
|
user_tsc_khz = (u32)arg;
|
|
|
|
|
|
|
|
if (user_tsc_khz >= kvm_max_guest_tsc_khz)
|
|
|
|
goto out;
|
|
|
|
|
KVM: Infrastructure for software and hardware based TSC rate scaling
This requires some restructuring; rather than use 'virtual_tsc_khz'
to indicate whether hardware rate scaling is in effect, we consider
each VCPU to always have a virtual TSC rate. Instead, there is new
logic above the vendor-specific hardware scaling that decides whether
it is even necessary to use and updates all rate variables used by
common code. This means we can simply query the virtual rate at
any point, which is needed for software rate scaling.
There is also now a threshold added to the TSC rate scaling; minor
differences and variations of measured TSC rate can accidentally
provoke rate scaling to be used when it is not needed. Instead,
we have a tolerance variable called tsc_tolerance_ppm, which is
the maximum variation from user requested rate at which scaling
will be used. The default is 250ppm, which is the half the
threshold for NTP adjustment, allowing for some hardware variation.
In the event that hardware rate scaling is not available, we can
kludge a bit by forcing TSC catchup to turn on when a faster than
hardware speed has been requested, but there is nothing available
yet for the reverse case; this requires a trap and emulate software
implementation for RDTSC, which is still forthcoming.
[avi: fix 64-bit division on i386]
Signed-off-by: Zachary Amsden <zamsden@gmail.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2012-02-03 17:43:50 +00:00
|
|
|
if (user_tsc_khz == 0)
|
|
|
|
user_tsc_khz = tsc_khz;
|
|
|
|
|
2015-10-20 07:39:04 +00:00
|
|
|
if (!kvm_set_tsc_khz(vcpu, user_tsc_khz))
|
|
|
|
r = 0;
|
2011-03-25 08:44:51 +00:00
|
|
|
|
|
|
|
goto out;
|
|
|
|
}
|
|
|
|
case KVM_GET_TSC_KHZ: {
|
KVM: Infrastructure for software and hardware based TSC rate scaling
This requires some restructuring; rather than use 'virtual_tsc_khz'
to indicate whether hardware rate scaling is in effect, we consider
each VCPU to always have a virtual TSC rate. Instead, there is new
logic above the vendor-specific hardware scaling that decides whether
it is even necessary to use and updates all rate variables used by
common code. This means we can simply query the virtual rate at
any point, which is needed for software rate scaling.
There is also now a threshold added to the TSC rate scaling; minor
differences and variations of measured TSC rate can accidentally
provoke rate scaling to be used when it is not needed. Instead,
we have a tolerance variable called tsc_tolerance_ppm, which is
the maximum variation from user requested rate at which scaling
will be used. The default is 250ppm, which is the half the
threshold for NTP adjustment, allowing for some hardware variation.
In the event that hardware rate scaling is not available, we can
kludge a bit by forcing TSC catchup to turn on when a faster than
hardware speed has been requested, but there is nothing available
yet for the reverse case; this requires a trap and emulate software
implementation for RDTSC, which is still forthcoming.
[avi: fix 64-bit division on i386]
Signed-off-by: Zachary Amsden <zamsden@gmail.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2012-02-03 17:43:50 +00:00
|
|
|
r = vcpu->arch.virtual_tsc_khz;
|
2011-03-25 08:44:51 +00:00
|
|
|
goto out;
|
|
|
|
}
|
2012-03-10 19:37:27 +00:00
|
|
|
case KVM_KVMCLOCK_CTRL: {
|
|
|
|
r = kvm_set_guest_paused(vcpu);
|
|
|
|
goto out;
|
|
|
|
}
|
2015-11-10 12:36:34 +00:00
|
|
|
case KVM_ENABLE_CAP: {
|
|
|
|
struct kvm_enable_cap cap;
|
|
|
|
|
|
|
|
r = -EFAULT;
|
|
|
|
if (copy_from_user(&cap, argp, sizeof(cap)))
|
|
|
|
goto out;
|
|
|
|
r = kvm_vcpu_ioctl_enable_cap(vcpu, &cap);
|
|
|
|
break;
|
|
|
|
}
|
2018-07-10 09:27:20 +00:00
|
|
|
case KVM_GET_NESTED_STATE: {
|
|
|
|
struct kvm_nested_state __user *user_kvm_nested_state = argp;
|
|
|
|
u32 user_data_size;
|
|
|
|
|
|
|
|
r = -EINVAL;
|
|
|
|
if (!kvm_x86_ops->get_nested_state)
|
|
|
|
break;
|
|
|
|
|
|
|
|
BUILD_BUG_ON(sizeof(user_data_size) != sizeof(user_kvm_nested_state->size));
|
|
|
|
if (get_user(user_data_size, &user_kvm_nested_state->size))
|
|
|
|
return -EFAULT;
|
|
|
|
|
|
|
|
r = kvm_x86_ops->get_nested_state(vcpu, user_kvm_nested_state,
|
|
|
|
user_data_size);
|
|
|
|
if (r < 0)
|
|
|
|
return r;
|
|
|
|
|
|
|
|
if (r > user_data_size) {
|
|
|
|
if (put_user(r, &user_kvm_nested_state->size))
|
|
|
|
return -EFAULT;
|
|
|
|
return -E2BIG;
|
|
|
|
}
|
|
|
|
r = 0;
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
case KVM_SET_NESTED_STATE: {
|
|
|
|
struct kvm_nested_state __user *user_kvm_nested_state = argp;
|
|
|
|
struct kvm_nested_state kvm_state;
|
|
|
|
|
|
|
|
r = -EINVAL;
|
|
|
|
if (!kvm_x86_ops->set_nested_state)
|
|
|
|
break;
|
|
|
|
|
|
|
|
if (copy_from_user(&kvm_state, user_kvm_nested_state, sizeof(kvm_state)))
|
|
|
|
return -EFAULT;
|
|
|
|
|
|
|
|
if (kvm_state.size < sizeof(kvm_state))
|
|
|
|
return -EINVAL;
|
|
|
|
|
|
|
|
if (kvm_state.flags &
|
|
|
|
~(KVM_STATE_NESTED_RUN_PENDING | KVM_STATE_NESTED_GUEST_MODE))
|
|
|
|
return -EINVAL;
|
|
|
|
|
|
|
|
/* nested_run_pending implies guest_mode. */
|
|
|
|
if (kvm_state.flags == KVM_STATE_NESTED_RUN_PENDING)
|
|
|
|
return -EINVAL;
|
|
|
|
|
|
|
|
r = kvm_x86_ops->set_nested_state(vcpu, user_kvm_nested_state, &kvm_state);
|
|
|
|
break;
|
|
|
|
}
|
KVM: Portability: split kvm_vcpu_ioctl
This patch splits kvm_vcpu_ioctl into archtecture independent parts, and
x86 specific parts which go to kvm_arch_vcpu_ioctl in x86.c.
Common ioctls for all architectures are:
KVM_RUN, KVM_GET/SET_(S-)REGS, KVM_TRANSLATE, KVM_INTERRUPT,
KVM_DEBUG_GUEST, KVM_SET_SIGNAL_MASK, KVM_GET/SET_FPU
Note that some PPC chips don't have an FPU, so we might need an #ifdef
around KVM_GET/SET_FPU one day.
x86 specific ioctls are:
KVM_GET/SET_LAPIC, KVM_SET_CPUID, KVM_GET/SET_MSRS
An interresting aspect is vcpu_load/vcpu_put. We now have a common
vcpu_load/put which does the preemption stuff, and an architecture
specific kvm_arch_vcpu_load/put. In the x86 case, this one calls the
vmx/svm function defined in kvm_x86_ops.
Signed-off-by: Carsten Otte <cotte@de.ibm.com>
Reviewed-by: Christian Borntraeger <borntraeger@de.ibm.com>
Reviewed-by: Christian Ehrhardt <ehrhardt@linux.vnet.ibm.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
2007-10-11 17:16:52 +00:00
|
|
|
default:
|
|
|
|
r = -EINVAL;
|
|
|
|
}
|
|
|
|
out:
|
2010-06-20 12:54:43 +00:00
|
|
|
kfree(u.buffer);
|
2017-12-04 20:35:36 +00:00
|
|
|
out_nofree:
|
|
|
|
vcpu_put(vcpu);
|
KVM: Portability: split kvm_vcpu_ioctl
This patch splits kvm_vcpu_ioctl into archtecture independent parts, and
x86 specific parts which go to kvm_arch_vcpu_ioctl in x86.c.
Common ioctls for all architectures are:
KVM_RUN, KVM_GET/SET_(S-)REGS, KVM_TRANSLATE, KVM_INTERRUPT,
KVM_DEBUG_GUEST, KVM_SET_SIGNAL_MASK, KVM_GET/SET_FPU
Note that some PPC chips don't have an FPU, so we might need an #ifdef
around KVM_GET/SET_FPU one day.
x86 specific ioctls are:
KVM_GET/SET_LAPIC, KVM_SET_CPUID, KVM_GET/SET_MSRS
An interresting aspect is vcpu_load/vcpu_put. We now have a common
vcpu_load/put which does the preemption stuff, and an architecture
specific kvm_arch_vcpu_load/put. In the x86 case, this one calls the
vmx/svm function defined in kvm_x86_ops.
Signed-off-by: Carsten Otte <cotte@de.ibm.com>
Reviewed-by: Christian Borntraeger <borntraeger@de.ibm.com>
Reviewed-by: Christian Ehrhardt <ehrhardt@linux.vnet.ibm.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
2007-10-11 17:16:52 +00:00
|
|
|
return r;
|
|
|
|
}
|
|
|
|
|
2018-04-18 19:19:58 +00:00
|
|
|
vm_fault_t kvm_arch_vcpu_fault(struct kvm_vcpu *vcpu, struct vm_fault *vmf)
|
2012-01-04 09:25:23 +00:00
|
|
|
{
|
|
|
|
return VM_FAULT_SIGBUS;
|
|
|
|
}
|
|
|
|
|
2007-10-29 15:08:35 +00:00
|
|
|
static int kvm_vm_ioctl_set_tss_addr(struct kvm *kvm, unsigned long addr)
|
|
|
|
{
|
|
|
|
int ret;
|
|
|
|
|
|
|
|
if (addr > (unsigned int)(-3 * PAGE_SIZE))
|
2012-11-02 10:33:22 +00:00
|
|
|
return -EINVAL;
|
2007-10-29 15:08:35 +00:00
|
|
|
ret = kvm_x86_ops->set_tss_addr(kvm, addr);
|
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
2009-07-21 02:42:48 +00:00
|
|
|
static int kvm_vm_ioctl_set_identity_map_addr(struct kvm *kvm,
|
|
|
|
u64 ident_addr)
|
|
|
|
{
|
2018-03-20 19:17:19 +00:00
|
|
|
return kvm_x86_ops->set_identity_map_addr(kvm, ident_addr);
|
2009-07-21 02:42:48 +00:00
|
|
|
}
|
|
|
|
|
2007-10-29 15:08:35 +00:00
|
|
|
static int kvm_vm_ioctl_set_nr_mmu_pages(struct kvm *kvm,
|
|
|
|
u32 kvm_nr_mmu_pages)
|
|
|
|
{
|
|
|
|
if (kvm_nr_mmu_pages < KVM_MIN_ALLOC_MMU_PAGES)
|
|
|
|
return -EINVAL;
|
|
|
|
|
2009-12-23 16:35:26 +00:00
|
|
|
mutex_lock(&kvm->slots_lock);
|
2007-10-29 15:08:35 +00:00
|
|
|
|
|
|
|
kvm_mmu_change_mmu_pages(kvm, kvm_nr_mmu_pages);
|
2007-12-14 02:01:48 +00:00
|
|
|
kvm->arch.n_requested_mmu_pages = kvm_nr_mmu_pages;
|
2007-10-29 15:08:35 +00:00
|
|
|
|
2009-12-23 16:35:26 +00:00
|
|
|
mutex_unlock(&kvm->slots_lock);
|
2007-10-29 15:08:35 +00:00
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
|
|
|
static int kvm_vm_ioctl_get_nr_mmu_pages(struct kvm *kvm)
|
|
|
|
{
|
2010-08-20 01:11:14 +00:00
|
|
|
return kvm->arch.n_max_mmu_pages;
|
2007-10-29 15:08:35 +00:00
|
|
|
}
|
|
|
|
|
|
|
|
static int kvm_vm_ioctl_get_irqchip(struct kvm *kvm, struct kvm_irqchip *chip)
|
|
|
|
{
|
2017-04-07 08:50:23 +00:00
|
|
|
struct kvm_pic *pic = kvm->arch.vpic;
|
2007-10-29 15:08:35 +00:00
|
|
|
int r;
|
|
|
|
|
|
|
|
r = 0;
|
|
|
|
switch (chip->chip_id) {
|
|
|
|
case KVM_IRQCHIP_PIC_MASTER:
|
2017-04-07 08:50:23 +00:00
|
|
|
memcpy(&chip->chip.pic, &pic->pics[0],
|
2007-10-29 15:08:35 +00:00
|
|
|
sizeof(struct kvm_pic_state));
|
|
|
|
break;
|
|
|
|
case KVM_IRQCHIP_PIC_SLAVE:
|
2017-04-07 08:50:23 +00:00
|
|
|
memcpy(&chip->chip.pic, &pic->pics[1],
|
2007-10-29 15:08:35 +00:00
|
|
|
sizeof(struct kvm_pic_state));
|
|
|
|
break;
|
|
|
|
case KVM_IRQCHIP_IOAPIC:
|
2017-04-07 08:50:27 +00:00
|
|
|
kvm_get_ioapic(kvm, &chip->chip.ioapic);
|
2007-10-29 15:08:35 +00:00
|
|
|
break;
|
|
|
|
default:
|
|
|
|
r = -EINVAL;
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
return r;
|
|
|
|
}
|
|
|
|
|
|
|
|
static int kvm_vm_ioctl_set_irqchip(struct kvm *kvm, struct kvm_irqchip *chip)
|
|
|
|
{
|
2017-04-07 08:50:23 +00:00
|
|
|
struct kvm_pic *pic = kvm->arch.vpic;
|
2007-10-29 15:08:35 +00:00
|
|
|
int r;
|
|
|
|
|
|
|
|
r = 0;
|
|
|
|
switch (chip->chip_id) {
|
|
|
|
case KVM_IRQCHIP_PIC_MASTER:
|
2017-04-07 08:50:23 +00:00
|
|
|
spin_lock(&pic->lock);
|
|
|
|
memcpy(&pic->pics[0], &chip->chip.pic,
|
2007-10-29 15:08:35 +00:00
|
|
|
sizeof(struct kvm_pic_state));
|
2017-04-07 08:50:23 +00:00
|
|
|
spin_unlock(&pic->lock);
|
2007-10-29 15:08:35 +00:00
|
|
|
break;
|
|
|
|
case KVM_IRQCHIP_PIC_SLAVE:
|
2017-04-07 08:50:23 +00:00
|
|
|
spin_lock(&pic->lock);
|
|
|
|
memcpy(&pic->pics[1], &chip->chip.pic,
|
2007-10-29 15:08:35 +00:00
|
|
|
sizeof(struct kvm_pic_state));
|
2017-04-07 08:50:23 +00:00
|
|
|
spin_unlock(&pic->lock);
|
2007-10-29 15:08:35 +00:00
|
|
|
break;
|
|
|
|
case KVM_IRQCHIP_IOAPIC:
|
2017-04-07 08:50:27 +00:00
|
|
|
kvm_set_ioapic(kvm, &chip->chip.ioapic);
|
2007-10-29 15:08:35 +00:00
|
|
|
break;
|
|
|
|
default:
|
|
|
|
r = -EINVAL;
|
|
|
|
break;
|
|
|
|
}
|
2017-04-07 08:50:23 +00:00
|
|
|
kvm_pic_update_irq(pic);
|
2007-10-29 15:08:35 +00:00
|
|
|
return r;
|
|
|
|
}
|
|
|
|
|
2008-03-03 16:50:59 +00:00
|
|
|
static int kvm_vm_ioctl_get_pit(struct kvm *kvm, struct kvm_pit_state *ps)
|
|
|
|
{
|
2016-03-02 21:56:50 +00:00
|
|
|
struct kvm_kpit_state *kps = &kvm->arch.vpit->pit_state;
|
|
|
|
|
|
|
|
BUILD_BUG_ON(sizeof(*ps) != sizeof(kps->channels));
|
|
|
|
|
|
|
|
mutex_lock(&kps->lock);
|
|
|
|
memcpy(ps, &kps->channels, sizeof(*ps));
|
|
|
|
mutex_unlock(&kps->lock);
|
2015-10-30 07:26:11 +00:00
|
|
|
return 0;
|
2008-03-03 16:50:59 +00:00
|
|
|
}
|
|
|
|
|
|
|
|
static int kvm_vm_ioctl_set_pit(struct kvm *kvm, struct kvm_pit_state *ps)
|
|
|
|
{
|
2015-11-18 22:50:23 +00:00
|
|
|
int i;
|
2016-03-02 21:56:43 +00:00
|
|
|
struct kvm_pit *pit = kvm->arch.vpit;
|
|
|
|
|
|
|
|
mutex_lock(&pit->pit_state.lock);
|
2016-03-02 21:56:50 +00:00
|
|
|
memcpy(&pit->pit_state.channels, ps, sizeof(*ps));
|
2015-11-18 22:50:23 +00:00
|
|
|
for (i = 0; i < 3; i++)
|
2016-03-02 21:56:43 +00:00
|
|
|
kvm_pit_load_count(pit, i, ps->channels[i].count, 0);
|
|
|
|
mutex_unlock(&pit->pit_state.lock);
|
2015-10-30 07:26:11 +00:00
|
|
|
return 0;
|
2009-07-07 15:50:38 +00:00
|
|
|
}
|
|
|
|
|
|
|
|
static int kvm_vm_ioctl_get_pit2(struct kvm *kvm, struct kvm_pit_state2 *ps)
|
|
|
|
{
|
|
|
|
mutex_lock(&kvm->arch.vpit->pit_state.lock);
|
|
|
|
memcpy(ps->channels, &kvm->arch.vpit->pit_state.channels,
|
|
|
|
sizeof(ps->channels));
|
|
|
|
ps->flags = kvm->arch.vpit->pit_state.flags;
|
|
|
|
mutex_unlock(&kvm->arch.vpit->pit_state.lock);
|
2010-10-30 18:54:47 +00:00
|
|
|
memset(&ps->reserved, 0, sizeof(ps->reserved));
|
2015-10-30 07:26:11 +00:00
|
|
|
return 0;
|
2009-07-07 15:50:38 +00:00
|
|
|
}
|
|
|
|
|
|
|
|
static int kvm_vm_ioctl_set_pit2(struct kvm *kvm, struct kvm_pit_state2 *ps)
|
|
|
|
{
|
2015-10-30 07:26:11 +00:00
|
|
|
int start = 0;
|
2015-11-18 22:50:23 +00:00
|
|
|
int i;
|
2009-07-07 15:50:38 +00:00
|
|
|
u32 prev_legacy, cur_legacy;
|
2016-03-02 21:56:43 +00:00
|
|
|
struct kvm_pit *pit = kvm->arch.vpit;
|
|
|
|
|
|
|
|
mutex_lock(&pit->pit_state.lock);
|
|
|
|
prev_legacy = pit->pit_state.flags & KVM_PIT_FLAGS_HPET_LEGACY;
|
2009-07-07 15:50:38 +00:00
|
|
|
cur_legacy = ps->flags & KVM_PIT_FLAGS_HPET_LEGACY;
|
|
|
|
if (!prev_legacy && cur_legacy)
|
|
|
|
start = 1;
|
2016-03-02 21:56:43 +00:00
|
|
|
memcpy(&pit->pit_state.channels, &ps->channels,
|
|
|
|
sizeof(pit->pit_state.channels));
|
|
|
|
pit->pit_state.flags = ps->flags;
|
2015-11-18 22:50:23 +00:00
|
|
|
for (i = 0; i < 3; i++)
|
2016-03-02 21:56:43 +00:00
|
|
|
kvm_pit_load_count(pit, i, pit->pit_state.channels[i].count,
|
2016-01-07 12:50:38 +00:00
|
|
|
start && i == 0);
|
2016-03-02 21:56:43 +00:00
|
|
|
mutex_unlock(&pit->pit_state.lock);
|
2015-10-30 07:26:11 +00:00
|
|
|
return 0;
|
2008-03-03 16:50:59 +00:00
|
|
|
}
|
|
|
|
|
2008-12-30 17:55:06 +00:00
|
|
|
static int kvm_vm_ioctl_reinject(struct kvm *kvm,
|
|
|
|
struct kvm_reinject_control *control)
|
|
|
|
{
|
2016-03-02 21:56:45 +00:00
|
|
|
struct kvm_pit *pit = kvm->arch.vpit;
|
|
|
|
|
|
|
|
if (!pit)
|
2008-12-30 17:55:06 +00:00
|
|
|
return -ENXIO;
|
2016-03-02 21:56:44 +00:00
|
|
|
|
2016-03-02 21:56:45 +00:00
|
|
|
/* pit->pit_state.lock was overloaded to prevent userspace from getting
|
|
|
|
* an inconsistent state after running multiple KVM_REINJECT_CONTROL
|
|
|
|
* ioctls in parallel. Use a separate lock if that ioctl isn't rare.
|
|
|
|
*/
|
|
|
|
mutex_lock(&pit->pit_state.lock);
|
|
|
|
kvm_pit_set_reinject(pit, control->pit_reinject);
|
|
|
|
mutex_unlock(&pit->pit_state.lock);
|
2016-03-02 21:56:44 +00:00
|
|
|
|
2008-12-30 17:55:06 +00:00
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
2011-11-14 09:24:50 +00:00
|
|
|
/**
|
2012-03-03 05:21:48 +00:00
|
|
|
* kvm_vm_ioctl_get_dirty_log - get and clear the log of dirty pages in a slot
|
|
|
|
* @kvm: kvm instance
|
|
|
|
* @log: slot id and address to which we copy the log
|
2011-11-14 09:24:50 +00:00
|
|
|
*
|
2015-01-15 23:58:54 +00:00
|
|
|
* Steps 1-4 below provide general overview of dirty page logging. See
|
|
|
|
* kvm_get_dirty_log_protect() function description for additional details.
|
|
|
|
*
|
|
|
|
* We call kvm_get_dirty_log_protect() to handle steps 1-3, upon return we
|
|
|
|
* always flush the TLB (step 4) even if previous step failed and the dirty
|
|
|
|
* bitmap may be corrupt. Regardless of previous outcome the KVM logging API
|
|
|
|
* does not preclude user space subsequent dirty log read. Flushing TLB ensures
|
|
|
|
* writes will be marked dirty for next log read.
|
2011-11-14 09:24:50 +00:00
|
|
|
*
|
2012-03-03 05:21:48 +00:00
|
|
|
* 1. Take a snapshot of the bit and clear it if needed.
|
|
|
|
* 2. Write protect the corresponding page.
|
2015-01-15 23:58:54 +00:00
|
|
|
* 3. Copy the snapshot to the userspace.
|
|
|
|
* 4. Flush TLB's if needed.
|
2007-11-18 12:29:43 +00:00
|
|
|
*/
|
2012-03-03 05:21:48 +00:00
|
|
|
int kvm_vm_ioctl_get_dirty_log(struct kvm *kvm, struct kvm_dirty_log *log)
|
2007-11-18 12:29:43 +00:00
|
|
|
{
|
2012-03-03 05:21:48 +00:00
|
|
|
bool is_dirty = false;
|
2015-01-15 23:58:54 +00:00
|
|
|
int r;
|
2007-11-18 12:29:43 +00:00
|
|
|
|
2009-12-23 16:35:26 +00:00
|
|
|
mutex_lock(&kvm->slots_lock);
|
2007-11-18 12:29:43 +00:00
|
|
|
|
2015-01-28 02:54:27 +00:00
|
|
|
/*
|
|
|
|
* Flush potentially hardware-cached dirty pages to dirty_bitmap.
|
|
|
|
*/
|
|
|
|
if (kvm_x86_ops->flush_log_dirty)
|
|
|
|
kvm_x86_ops->flush_log_dirty(kvm);
|
|
|
|
|
2015-01-15 23:58:54 +00:00
|
|
|
r = kvm_get_dirty_log_protect(kvm, log, &is_dirty);
|
2014-04-17 09:06:16 +00:00
|
|
|
|
|
|
|
/*
|
|
|
|
* All the TLBs can be flushed out of mmu lock, see the comments in
|
|
|
|
* kvm_mmu_slot_remove_write_access().
|
|
|
|
*/
|
2015-01-15 23:58:54 +00:00
|
|
|
lockdep_assert_held(&kvm->slots_lock);
|
2014-04-17 09:06:16 +00:00
|
|
|
if (is_dirty)
|
|
|
|
kvm_flush_remote_tlbs(kvm);
|
|
|
|
|
2009-12-23 16:35:26 +00:00
|
|
|
mutex_unlock(&kvm->slots_lock);
|
2007-11-18 12:29:43 +00:00
|
|
|
return r;
|
|
|
|
}
|
|
|
|
|
2013-04-11 11:21:40 +00:00
|
|
|
int kvm_vm_ioctl_irq_line(struct kvm *kvm, struct kvm_irq_level *irq_event,
|
|
|
|
bool line_status)
|
2012-07-24 12:51:20 +00:00
|
|
|
{
|
|
|
|
if (!irqchip_in_kernel(kvm))
|
|
|
|
return -ENXIO;
|
|
|
|
|
|
|
|
irq_event->status = kvm_set_irq(kvm, KVM_USERSPACE_IRQ_SOURCE_ID,
|
2013-04-11 11:21:40 +00:00
|
|
|
irq_event->irq, irq_event->level,
|
|
|
|
line_status);
|
2012-07-24 12:51:20 +00:00
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
2015-04-12 22:53:41 +00:00
|
|
|
static int kvm_vm_ioctl_enable_cap(struct kvm *kvm,
|
|
|
|
struct kvm_enable_cap *cap)
|
|
|
|
{
|
|
|
|
int r;
|
|
|
|
|
|
|
|
if (cap->flags)
|
|
|
|
return -EINVAL;
|
|
|
|
|
|
|
|
switch (cap->cap) {
|
|
|
|
case KVM_CAP_DISABLE_QUIRKS:
|
|
|
|
kvm->arch.disabled_quirks = cap->args[0];
|
|
|
|
r = 0;
|
|
|
|
break;
|
2015-07-30 06:21:40 +00:00
|
|
|
case KVM_CAP_SPLIT_IRQCHIP: {
|
|
|
|
mutex_lock(&kvm->lock);
|
2015-07-30 06:32:35 +00:00
|
|
|
r = -EINVAL;
|
|
|
|
if (cap->args[0] > MAX_NR_RESERVED_IOAPIC_PINS)
|
|
|
|
goto split_irqchip_unlock;
|
2015-07-30 06:21:40 +00:00
|
|
|
r = -EEXIST;
|
|
|
|
if (irqchip_in_kernel(kvm))
|
|
|
|
goto split_irqchip_unlock;
|
2016-06-13 12:50:04 +00:00
|
|
|
if (kvm->created_vcpus)
|
2015-07-30 06:21:40 +00:00
|
|
|
goto split_irqchip_unlock;
|
|
|
|
r = kvm_setup_empty_irq_routing(kvm);
|
2017-04-28 15:06:20 +00:00
|
|
|
if (r)
|
2015-07-30 06:21:40 +00:00
|
|
|
goto split_irqchip_unlock;
|
|
|
|
/* Pairs with irqchip_in_kernel. */
|
|
|
|
smp_wmb();
|
2016-12-16 15:10:02 +00:00
|
|
|
kvm->arch.irqchip_mode = KVM_IRQCHIP_SPLIT;
|
2015-07-30 06:32:35 +00:00
|
|
|
kvm->arch.nr_reserved_ioapic_pins = cap->args[0];
|
2015-07-30 06:21:40 +00:00
|
|
|
r = 0;
|
|
|
|
split_irqchip_unlock:
|
|
|
|
mutex_unlock(&kvm->lock);
|
|
|
|
break;
|
|
|
|
}
|
2016-07-12 20:09:27 +00:00
|
|
|
case KVM_CAP_X2APIC_API:
|
|
|
|
r = -EINVAL;
|
|
|
|
if (cap->args[0] & ~KVM_X2APIC_API_VALID_FLAGS)
|
|
|
|
break;
|
|
|
|
|
|
|
|
if (cap->args[0] & KVM_X2APIC_API_USE_32BIT_IDS)
|
|
|
|
kvm->arch.x2apic_format = true;
|
2016-07-12 20:09:28 +00:00
|
|
|
if (cap->args[0] & KVM_X2APIC_API_DISABLE_BROADCAST_QUIRK)
|
|
|
|
kvm->arch.x2apic_broadcast_quirk_disabled = true;
|
2016-07-12 20:09:27 +00:00
|
|
|
|
|
|
|
r = 0;
|
|
|
|
break;
|
2018-03-12 11:53:02 +00:00
|
|
|
case KVM_CAP_X86_DISABLE_EXITS:
|
|
|
|
r = -EINVAL;
|
|
|
|
if (cap->args[0] & ~KVM_X86_DISABLE_VALID_EXITS)
|
|
|
|
break;
|
|
|
|
|
|
|
|
if ((cap->args[0] & KVM_X86_DISABLE_EXITS_MWAIT) &&
|
|
|
|
kvm_can_mwait_in_guest())
|
|
|
|
kvm->arch.mwait_in_guest = true;
|
2018-06-07 23:19:53 +00:00
|
|
|
if (cap->args[0] & KVM_X86_DISABLE_EXITS_HLT)
|
2018-03-12 11:53:03 +00:00
|
|
|
kvm->arch.hlt_in_guest = true;
|
2018-03-12 11:53:04 +00:00
|
|
|
if (cap->args[0] & KVM_X86_DISABLE_EXITS_PAUSE)
|
|
|
|
kvm->arch.pause_in_guest = true;
|
2018-03-12 11:53:02 +00:00
|
|
|
r = 0;
|
|
|
|
break;
|
2015-04-12 22:53:41 +00:00
|
|
|
default:
|
|
|
|
r = -EINVAL;
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
return r;
|
|
|
|
}
|
|
|
|
|
2007-10-29 15:08:35 +00:00
|
|
|
long kvm_arch_vm_ioctl(struct file *filp,
|
|
|
|
unsigned int ioctl, unsigned long arg)
|
|
|
|
{
|
|
|
|
struct kvm *kvm = filp->private_data;
|
|
|
|
void __user *argp = (void __user *)arg;
|
2009-08-26 11:57:07 +00:00
|
|
|
int r = -ENOTTY;
|
2008-08-11 17:01:45 +00:00
|
|
|
/*
|
|
|
|
* This union makes it completely explicit to gcc-3.x
|
|
|
|
* that these two variables' stack usage should be
|
|
|
|
* combined, not added together.
|
|
|
|
*/
|
|
|
|
union {
|
|
|
|
struct kvm_pit_state ps;
|
2009-07-07 15:50:38 +00:00
|
|
|
struct kvm_pit_state2 ps2;
|
2009-05-14 20:42:53 +00:00
|
|
|
struct kvm_pit_config pit_config;
|
2008-08-11 17:01:45 +00:00
|
|
|
} u;
|
2007-10-29 15:08:35 +00:00
|
|
|
|
|
|
|
switch (ioctl) {
|
|
|
|
case KVM_SET_TSS_ADDR:
|
|
|
|
r = kvm_vm_ioctl_set_tss_addr(kvm, arg);
|
|
|
|
break;
|
2009-07-21 02:42:48 +00:00
|
|
|
case KVM_SET_IDENTITY_MAP_ADDR: {
|
|
|
|
u64 ident_addr;
|
|
|
|
|
2017-08-24 18:51:36 +00:00
|
|
|
mutex_lock(&kvm->lock);
|
|
|
|
r = -EINVAL;
|
|
|
|
if (kvm->created_vcpus)
|
|
|
|
goto set_identity_unlock;
|
2009-07-21 02:42:48 +00:00
|
|
|
r = -EFAULT;
|
|
|
|
if (copy_from_user(&ident_addr, argp, sizeof ident_addr))
|
2017-08-24 18:51:36 +00:00
|
|
|
goto set_identity_unlock;
|
2009-07-21 02:42:48 +00:00
|
|
|
r = kvm_vm_ioctl_set_identity_map_addr(kvm, ident_addr);
|
2017-08-24 18:51:36 +00:00
|
|
|
set_identity_unlock:
|
|
|
|
mutex_unlock(&kvm->lock);
|
2009-07-21 02:42:48 +00:00
|
|
|
break;
|
|
|
|
}
|
2007-10-29 15:08:35 +00:00
|
|
|
case KVM_SET_NR_MMU_PAGES:
|
|
|
|
r = kvm_vm_ioctl_set_nr_mmu_pages(kvm, arg);
|
|
|
|
break;
|
|
|
|
case KVM_GET_NR_MMU_PAGES:
|
|
|
|
r = kvm_vm_ioctl_get_nr_mmu_pages(kvm);
|
|
|
|
break;
|
2009-10-29 15:44:15 +00:00
|
|
|
case KVM_CREATE_IRQCHIP: {
|
|
|
|
mutex_lock(&kvm->lock);
|
2016-12-16 15:10:03 +00:00
|
|
|
|
2009-10-29 15:44:15 +00:00
|
|
|
r = -EEXIST;
|
2016-12-16 15:10:01 +00:00
|
|
|
if (irqchip_in_kernel(kvm))
|
2009-10-29 15:44:15 +00:00
|
|
|
goto create_irqchip_unlock;
|
2016-12-16 15:10:03 +00:00
|
|
|
|
2012-03-05 12:23:29 +00:00
|
|
|
r = -EINVAL;
|
2016-06-13 12:50:04 +00:00
|
|
|
if (kvm->created_vcpus)
|
2012-03-05 12:23:29 +00:00
|
|
|
goto create_irqchip_unlock;
|
2016-12-16 15:10:03 +00:00
|
|
|
|
|
|
|
r = kvm_pic_init(kvm);
|
|
|
|
if (r)
|
2009-10-29 15:44:15 +00:00
|
|
|
goto create_irqchip_unlock;
|
2016-12-16 15:10:03 +00:00
|
|
|
|
|
|
|
r = kvm_ioapic_init(kvm);
|
|
|
|
if (r) {
|
|
|
|
kvm_pic_destroy(kvm);
|
2009-10-29 15:44:15 +00:00
|
|
|
goto create_irqchip_unlock;
|
2016-12-16 15:10:03 +00:00
|
|
|
}
|
|
|
|
|
2008-11-19 11:58:46 +00:00
|
|
|
r = kvm_setup_default_irq_routing(kvm);
|
|
|
|
if (r) {
|
2010-02-09 02:33:03 +00:00
|
|
|
kvm_ioapic_destroy(kvm);
|
2016-12-16 15:10:03 +00:00
|
|
|
kvm_pic_destroy(kvm);
|
2015-07-29 10:31:15 +00:00
|
|
|
goto create_irqchip_unlock;
|
2008-11-19 11:58:46 +00:00
|
|
|
}
|
2016-12-16 15:10:02 +00:00
|
|
|
/* Write kvm->irq_routing before enabling irqchip_in_kernel. */
|
2015-07-29 10:31:15 +00:00
|
|
|
smp_wmb();
|
2016-12-16 15:10:02 +00:00
|
|
|
kvm->arch.irqchip_mode = KVM_IRQCHIP_KERNEL;
|
2009-10-29 15:44:15 +00:00
|
|
|
create_irqchip_unlock:
|
|
|
|
mutex_unlock(&kvm->lock);
|
2007-10-29 15:08:35 +00:00
|
|
|
break;
|
2009-10-29 15:44:15 +00:00
|
|
|
}
|
2008-01-27 21:10:22 +00:00
|
|
|
case KVM_CREATE_PIT:
|
2009-05-14 20:42:53 +00:00
|
|
|
u.pit_config.flags = KVM_PIT_SPEAKER_DUMMY;
|
|
|
|
goto create_pit;
|
|
|
|
case KVM_CREATE_PIT2:
|
|
|
|
r = -EFAULT;
|
|
|
|
if (copy_from_user(&u.pit_config, argp,
|
|
|
|
sizeof(struct kvm_pit_config)))
|
|
|
|
goto out;
|
|
|
|
create_pit:
|
KVM: x86: protect KVM_CREATE_PIT/KVM_CREATE_PIT2 with kvm->lock
The syzkaller folks reported a NULL pointer dereference that seems
to be cause by a race between KVM_CREATE_IRQCHIP and KVM_CREATE_PIT2.
The former takes kvm->lock (except when registering the devices,
which needs kvm->slots_lock); the latter takes kvm->slots_lock only.
Change KVM_CREATE_PIT2 to follow the same model as KVM_CREATE_IRQCHIP.
Testcase:
#include <pthread.h>
#include <linux/kvm.h>
#include <fcntl.h>
#include <sys/ioctl.h>
#include <stdint.h>
#include <string.h>
#include <stdlib.h>
#include <sys/syscall.h>
#include <unistd.h>
long r[23];
void* thr1(void* arg)
{
struct kvm_pit_config pitcfg = { .flags = 4 };
switch ((long)arg) {
case 0: r[2] = open("/dev/kvm", O_RDONLY|O_ASYNC); break;
case 1: r[3] = ioctl(r[2], KVM_CREATE_VM, 0); break;
case 2: r[4] = ioctl(r[3], KVM_CREATE_IRQCHIP, 0); break;
case 3: r[22] = ioctl(r[3], KVM_CREATE_PIT2, &pitcfg); break;
}
return 0;
}
int main(int argc, char **argv)
{
long i;
pthread_t th[4];
memset(r, -1, sizeof(r));
for (i = 0; i < 4; i++) {
pthread_create(&th[i], 0, thr, (void*)i);
if (argc > 1 && rand()%2) usleep(rand()%1000);
}
usleep(20000);
return 0;
}
Reported-by: Dmitry Vyukov <dvyukov@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
2016-06-01 12:09:24 +00:00
|
|
|
mutex_lock(&kvm->lock);
|
2009-01-05 13:21:42 +00:00
|
|
|
r = -EEXIST;
|
|
|
|
if (kvm->arch.vpit)
|
|
|
|
goto create_pit_unlock;
|
2008-01-27 21:10:22 +00:00
|
|
|
r = -ENOMEM;
|
2009-05-14 20:42:53 +00:00
|
|
|
kvm->arch.vpit = kvm_create_pit(kvm, u.pit_config.flags);
|
2008-01-27 21:10:22 +00:00
|
|
|
if (kvm->arch.vpit)
|
|
|
|
r = 0;
|
2009-01-05 13:21:42 +00:00
|
|
|
create_pit_unlock:
|
KVM: x86: protect KVM_CREATE_PIT/KVM_CREATE_PIT2 with kvm->lock
The syzkaller folks reported a NULL pointer dereference that seems
to be cause by a race between KVM_CREATE_IRQCHIP and KVM_CREATE_PIT2.
The former takes kvm->lock (except when registering the devices,
which needs kvm->slots_lock); the latter takes kvm->slots_lock only.
Change KVM_CREATE_PIT2 to follow the same model as KVM_CREATE_IRQCHIP.
Testcase:
#include <pthread.h>
#include <linux/kvm.h>
#include <fcntl.h>
#include <sys/ioctl.h>
#include <stdint.h>
#include <string.h>
#include <stdlib.h>
#include <sys/syscall.h>
#include <unistd.h>
long r[23];
void* thr1(void* arg)
{
struct kvm_pit_config pitcfg = { .flags = 4 };
switch ((long)arg) {
case 0: r[2] = open("/dev/kvm", O_RDONLY|O_ASYNC); break;
case 1: r[3] = ioctl(r[2], KVM_CREATE_VM, 0); break;
case 2: r[4] = ioctl(r[3], KVM_CREATE_IRQCHIP, 0); break;
case 3: r[22] = ioctl(r[3], KVM_CREATE_PIT2, &pitcfg); break;
}
return 0;
}
int main(int argc, char **argv)
{
long i;
pthread_t th[4];
memset(r, -1, sizeof(r));
for (i = 0; i < 4; i++) {
pthread_create(&th[i], 0, thr, (void*)i);
if (argc > 1 && rand()%2) usleep(rand()%1000);
}
usleep(20000);
return 0;
}
Reported-by: Dmitry Vyukov <dvyukov@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
2016-06-01 12:09:24 +00:00
|
|
|
mutex_unlock(&kvm->lock);
|
2008-01-27 21:10:22 +00:00
|
|
|
break;
|
2007-10-29 15:08:35 +00:00
|
|
|
case KVM_GET_IRQCHIP: {
|
|
|
|
/* 0: PIC master, 1: PIC slave, 2: IOAPIC */
|
2011-12-04 17:36:29 +00:00
|
|
|
struct kvm_irqchip *chip;
|
2007-10-29 15:08:35 +00:00
|
|
|
|
2011-12-04 17:36:29 +00:00
|
|
|
chip = memdup_user(argp, sizeof(*chip));
|
|
|
|
if (IS_ERR(chip)) {
|
|
|
|
r = PTR_ERR(chip);
|
2007-10-29 15:08:35 +00:00
|
|
|
goto out;
|
2011-12-04 17:36:29 +00:00
|
|
|
}
|
|
|
|
|
2007-10-29 15:08:35 +00:00
|
|
|
r = -ENXIO;
|
2016-12-16 15:10:06 +00:00
|
|
|
if (!irqchip_kernel(kvm))
|
2008-08-11 17:01:45 +00:00
|
|
|
goto get_irqchip_out;
|
|
|
|
r = kvm_vm_ioctl_get_irqchip(kvm, chip);
|
2007-10-29 15:08:35 +00:00
|
|
|
if (r)
|
2008-08-11 17:01:45 +00:00
|
|
|
goto get_irqchip_out;
|
2007-10-29 15:08:35 +00:00
|
|
|
r = -EFAULT;
|
2008-08-11 17:01:45 +00:00
|
|
|
if (copy_to_user(argp, chip, sizeof *chip))
|
|
|
|
goto get_irqchip_out;
|
2007-10-29 15:08:35 +00:00
|
|
|
r = 0;
|
2008-08-11 17:01:45 +00:00
|
|
|
get_irqchip_out:
|
|
|
|
kfree(chip);
|
2007-10-29 15:08:35 +00:00
|
|
|
break;
|
|
|
|
}
|
|
|
|
case KVM_SET_IRQCHIP: {
|
|
|
|
/* 0: PIC master, 1: PIC slave, 2: IOAPIC */
|
2011-12-04 17:36:29 +00:00
|
|
|
struct kvm_irqchip *chip;
|
2007-10-29 15:08:35 +00:00
|
|
|
|
2011-12-04 17:36:29 +00:00
|
|
|
chip = memdup_user(argp, sizeof(*chip));
|
|
|
|
if (IS_ERR(chip)) {
|
|
|
|
r = PTR_ERR(chip);
|
2007-10-29 15:08:35 +00:00
|
|
|
goto out;
|
2011-12-04 17:36:29 +00:00
|
|
|
}
|
|
|
|
|
2007-10-29 15:08:35 +00:00
|
|
|
r = -ENXIO;
|
2016-12-16 15:10:06 +00:00
|
|
|
if (!irqchip_kernel(kvm))
|
2008-08-11 17:01:45 +00:00
|
|
|
goto set_irqchip_out;
|
|
|
|
r = kvm_vm_ioctl_set_irqchip(kvm, chip);
|
2007-10-29 15:08:35 +00:00
|
|
|
if (r)
|
2008-08-11 17:01:45 +00:00
|
|
|
goto set_irqchip_out;
|
2007-10-29 15:08:35 +00:00
|
|
|
r = 0;
|
2008-08-11 17:01:45 +00:00
|
|
|
set_irqchip_out:
|
|
|
|
kfree(chip);
|
2007-10-29 15:08:35 +00:00
|
|
|
break;
|
|
|
|
}
|
2008-03-03 16:50:59 +00:00
|
|
|
case KVM_GET_PIT: {
|
|
|
|
r = -EFAULT;
|
2008-08-11 17:01:45 +00:00
|
|
|
if (copy_from_user(&u.ps, argp, sizeof(struct kvm_pit_state)))
|
2008-03-03 16:50:59 +00:00
|
|
|
goto out;
|
|
|
|
r = -ENXIO;
|
|
|
|
if (!kvm->arch.vpit)
|
|
|
|
goto out;
|
2008-08-11 17:01:45 +00:00
|
|
|
r = kvm_vm_ioctl_get_pit(kvm, &u.ps);
|
2008-03-03 16:50:59 +00:00
|
|
|
if (r)
|
|
|
|
goto out;
|
|
|
|
r = -EFAULT;
|
2008-08-11 17:01:45 +00:00
|
|
|
if (copy_to_user(argp, &u.ps, sizeof(struct kvm_pit_state)))
|
2008-03-03 16:50:59 +00:00
|
|
|
goto out;
|
|
|
|
r = 0;
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
case KVM_SET_PIT: {
|
|
|
|
r = -EFAULT;
|
2008-08-11 17:01:45 +00:00
|
|
|
if (copy_from_user(&u.ps, argp, sizeof u.ps))
|
2008-03-03 16:50:59 +00:00
|
|
|
goto out;
|
|
|
|
r = -ENXIO;
|
|
|
|
if (!kvm->arch.vpit)
|
|
|
|
goto out;
|
2008-08-11 17:01:45 +00:00
|
|
|
r = kvm_vm_ioctl_set_pit(kvm, &u.ps);
|
2008-03-03 16:50:59 +00:00
|
|
|
break;
|
|
|
|
}
|
2009-07-07 15:50:38 +00:00
|
|
|
case KVM_GET_PIT2: {
|
|
|
|
r = -ENXIO;
|
|
|
|
if (!kvm->arch.vpit)
|
|
|
|
goto out;
|
|
|
|
r = kvm_vm_ioctl_get_pit2(kvm, &u.ps2);
|
|
|
|
if (r)
|
|
|
|
goto out;
|
|
|
|
r = -EFAULT;
|
|
|
|
if (copy_to_user(argp, &u.ps2, sizeof(u.ps2)))
|
|
|
|
goto out;
|
|
|
|
r = 0;
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
case KVM_SET_PIT2: {
|
|
|
|
r = -EFAULT;
|
|
|
|
if (copy_from_user(&u.ps2, argp, sizeof(u.ps2)))
|
|
|
|
goto out;
|
|
|
|
r = -ENXIO;
|
|
|
|
if (!kvm->arch.vpit)
|
|
|
|
goto out;
|
|
|
|
r = kvm_vm_ioctl_set_pit2(kvm, &u.ps2);
|
|
|
|
break;
|
|
|
|
}
|
2008-12-30 17:55:06 +00:00
|
|
|
case KVM_REINJECT_CONTROL: {
|
|
|
|
struct kvm_reinject_control control;
|
|
|
|
r = -EFAULT;
|
|
|
|
if (copy_from_user(&control, argp, sizeof(control)))
|
|
|
|
goto out;
|
|
|
|
r = kvm_vm_ioctl_reinject(kvm, &control);
|
|
|
|
break;
|
|
|
|
}
|
2015-07-29 09:56:48 +00:00
|
|
|
case KVM_SET_BOOT_CPU_ID:
|
|
|
|
r = 0;
|
|
|
|
mutex_lock(&kvm->lock);
|
2016-06-13 12:50:04 +00:00
|
|
|
if (kvm->created_vcpus)
|
2015-07-29 09:56:48 +00:00
|
|
|
r = -EBUSY;
|
|
|
|
else
|
|
|
|
kvm->arch.bsp_vcpu_id = arg;
|
|
|
|
mutex_unlock(&kvm->lock);
|
|
|
|
break;
|
2009-10-15 22:21:43 +00:00
|
|
|
case KVM_XEN_HVM_CONFIG: {
|
2017-10-26 13:45:47 +00:00
|
|
|
struct kvm_xen_hvm_config xhc;
|
2009-10-15 22:21:43 +00:00
|
|
|
r = -EFAULT;
|
2017-10-26 13:45:47 +00:00
|
|
|
if (copy_from_user(&xhc, argp, sizeof(xhc)))
|
2009-10-15 22:21:43 +00:00
|
|
|
goto out;
|
|
|
|
r = -EINVAL;
|
2017-10-26 13:45:47 +00:00
|
|
|
if (xhc.flags)
|
2009-10-15 22:21:43 +00:00
|
|
|
goto out;
|
2017-10-26 13:45:47 +00:00
|
|
|
memcpy(&kvm->arch.xen_hvm_config, &xhc, sizeof(xhc));
|
2009-10-15 22:21:43 +00:00
|
|
|
r = 0;
|
|
|
|
break;
|
|
|
|
}
|
2009-10-16 19:28:36 +00:00
|
|
|
case KVM_SET_CLOCK: {
|
|
|
|
struct kvm_clock_data user_ns;
|
|
|
|
u64 now_ns;
|
|
|
|
|
|
|
|
r = -EFAULT;
|
|
|
|
if (copy_from_user(&user_ns, argp, sizeof(user_ns)))
|
|
|
|
goto out;
|
|
|
|
|
|
|
|
r = -EINVAL;
|
|
|
|
if (user_ns.flags)
|
|
|
|
goto out;
|
|
|
|
|
|
|
|
r = 0;
|
2017-05-16 20:50:00 +00:00
|
|
|
/*
|
|
|
|
* TODO: userspace has to take care of races with VCPU_RUN, so
|
|
|
|
* kvm_gen_update_masterclock() can be cut down to locked
|
|
|
|
* pvclock_update_vm_gtod_copy().
|
|
|
|
*/
|
|
|
|
kvm_gen_update_masterclock(kvm);
|
KVM: x86: remove irq disablement around KVM_SET_CLOCK/KVM_GET_CLOCK
The disablement of interrupts at KVM_SET_CLOCK/KVM_GET_CLOCK
attempts to disable software suspend from causing "non atomic behaviour" of
the operation:
Add a helper function to compute the kernel time and convert nanoseconds
back to CPU specific cycles. Note that these must not be called in preemptible
context, as that would mean the kernel could enter software suspend state,
which would cause non-atomic operation.
However, assume the kernel can enter software suspend at the following 2 points:
ktime_get_ts(&ts);
1.
hypothetical_ktime_get_ts(&ts)
monotonic_to_bootbased(&ts);
2.
monotonic_to_bootbased() should be correct relative to a ktime_get_ts(&ts)
performed after point 1 (that is after resuming from software suspend),
hypothetical_ktime_get_ts()
Therefore it is also correct for the ktime_get_ts(&ts) before point 1,
which is
ktime_get_ts(&ts) = hypothetical_ktime_get_ts(&ts) + time-to-execute-suspend-code
Note CLOCK_MONOTONIC does not count during suspension.
So remove the irq disablement, which causes the following warning on
-RT kernels:
With this reasoning, and the -RT bug that the irq disablement causes
(because spin_lock is now a sleeping lock), remove the IRQ protection as it
causes:
[ 1064.668109] in_atomic(): 0, irqs_disabled(): 1, pid: 15296, name:m
[ 1064.668110] INFO: lockdep is turned off.
[ 1064.668110] irq event stamp: 0
[ 1064.668112] hardirqs last enabled at (0): [< (null)>] )
[ 1064.668116] hardirqs last disabled at (0): [] c0
[ 1064.668118] softirqs last enabled at (0): [] c0
[ 1064.668118] softirqs last disabled at (0): [< (null)>] )
[ 1064.668121] CPU: 13 PID: 15296 Comm: qemu-kvm Not tainted 3.10.0-1
[ 1064.668121] Hardware name: Dell Inc. PowerEdge R730/0H21J3, BIOS 5
[ 1064.668123] ffff8c1796b88000 00000000afe7344c ffff8c179abf3c68 f3
[ 1064.668125] ffff8c179abf3c90 ffffffff930ccb3d ffff8c1b992b3610 f0
[ 1064.668126] 00007ffc1a26fbc0 ffff8c179abf3cb0 ffffffff9375f694 f0
[ 1064.668126] Call Trace:
[ 1064.668132] [] dump_stack+0x19/0x1b
[ 1064.668135] [] __might_sleep+0x12d/0x1f0
[ 1064.668138] [] rt_spin_lock+0x24/0x60
[ 1064.668155] [] __get_kvmclock_ns+0x36/0x110 [k]
[ 1064.668159] [] ? futex_wait_queue_me+0x103/0x10
[ 1064.668171] [] kvm_arch_vm_ioctl+0xa2/0xd70 [k]
[ 1064.668173] [] ? futex_wait+0x1ac/0x2a0
v2: notice get_kvmclock_ns with the same problem (Pankaj).
v3: remove useless helper function (Pankaj).
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2017-04-17 15:51:37 +00:00
|
|
|
now_ns = get_kvmclock_ns(kvm);
|
2016-09-01 12:21:03 +00:00
|
|
|
kvm->arch.kvmclock_offset += user_ns.clock - now_ns;
|
2017-05-16 20:50:00 +00:00
|
|
|
kvm_make_all_cpus_request(kvm, KVM_REQ_CLOCK_UPDATE);
|
2009-10-16 19:28:36 +00:00
|
|
|
break;
|
|
|
|
}
|
|
|
|
case KVM_GET_CLOCK: {
|
|
|
|
struct kvm_clock_data user_ns;
|
|
|
|
u64 now_ns;
|
|
|
|
|
KVM: x86: remove irq disablement around KVM_SET_CLOCK/KVM_GET_CLOCK
The disablement of interrupts at KVM_SET_CLOCK/KVM_GET_CLOCK
attempts to disable software suspend from causing "non atomic behaviour" of
the operation:
Add a helper function to compute the kernel time and convert nanoseconds
back to CPU specific cycles. Note that these must not be called in preemptible
context, as that would mean the kernel could enter software suspend state,
which would cause non-atomic operation.
However, assume the kernel can enter software suspend at the following 2 points:
ktime_get_ts(&ts);
1.
hypothetical_ktime_get_ts(&ts)
monotonic_to_bootbased(&ts);
2.
monotonic_to_bootbased() should be correct relative to a ktime_get_ts(&ts)
performed after point 1 (that is after resuming from software suspend),
hypothetical_ktime_get_ts()
Therefore it is also correct for the ktime_get_ts(&ts) before point 1,
which is
ktime_get_ts(&ts) = hypothetical_ktime_get_ts(&ts) + time-to-execute-suspend-code
Note CLOCK_MONOTONIC does not count during suspension.
So remove the irq disablement, which causes the following warning on
-RT kernels:
With this reasoning, and the -RT bug that the irq disablement causes
(because spin_lock is now a sleeping lock), remove the IRQ protection as it
causes:
[ 1064.668109] in_atomic(): 0, irqs_disabled(): 1, pid: 15296, name:m
[ 1064.668110] INFO: lockdep is turned off.
[ 1064.668110] irq event stamp: 0
[ 1064.668112] hardirqs last enabled at (0): [< (null)>] )
[ 1064.668116] hardirqs last disabled at (0): [] c0
[ 1064.668118] softirqs last enabled at (0): [] c0
[ 1064.668118] softirqs last disabled at (0): [< (null)>] )
[ 1064.668121] CPU: 13 PID: 15296 Comm: qemu-kvm Not tainted 3.10.0-1
[ 1064.668121] Hardware name: Dell Inc. PowerEdge R730/0H21J3, BIOS 5
[ 1064.668123] ffff8c1796b88000 00000000afe7344c ffff8c179abf3c68 f3
[ 1064.668125] ffff8c179abf3c90 ffffffff930ccb3d ffff8c1b992b3610 f0
[ 1064.668126] 00007ffc1a26fbc0 ffff8c179abf3cb0 ffffffff9375f694 f0
[ 1064.668126] Call Trace:
[ 1064.668132] [] dump_stack+0x19/0x1b
[ 1064.668135] [] __might_sleep+0x12d/0x1f0
[ 1064.668138] [] rt_spin_lock+0x24/0x60
[ 1064.668155] [] __get_kvmclock_ns+0x36/0x110 [k]
[ 1064.668159] [] ? futex_wait_queue_me+0x103/0x10
[ 1064.668171] [] kvm_arch_vm_ioctl+0xa2/0xd70 [k]
[ 1064.668173] [] ? futex_wait+0x1ac/0x2a0
v2: notice get_kvmclock_ns with the same problem (Pankaj).
v3: remove useless helper function (Pankaj).
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2017-04-17 15:51:37 +00:00
|
|
|
now_ns = get_kvmclock_ns(kvm);
|
2016-09-01 12:21:03 +00:00
|
|
|
user_ns.clock = now_ns;
|
2016-11-09 16:48:15 +00:00
|
|
|
user_ns.flags = kvm->arch.use_master_clock ? KVM_CLOCK_TSC_STABLE : 0;
|
2010-10-30 18:54:47 +00:00
|
|
|
memset(&user_ns.pad, 0, sizeof(user_ns.pad));
|
2009-10-16 19:28:36 +00:00
|
|
|
|
|
|
|
r = -EFAULT;
|
|
|
|
if (copy_to_user(argp, &user_ns, sizeof(user_ns)))
|
|
|
|
goto out;
|
|
|
|
r = 0;
|
|
|
|
break;
|
|
|
|
}
|
2015-04-12 22:53:41 +00:00
|
|
|
case KVM_ENABLE_CAP: {
|
|
|
|
struct kvm_enable_cap cap;
|
2009-10-16 19:28:36 +00:00
|
|
|
|
2015-04-12 22:53:41 +00:00
|
|
|
r = -EFAULT;
|
|
|
|
if (copy_from_user(&cap, argp, sizeof(cap)))
|
|
|
|
goto out;
|
|
|
|
r = kvm_vm_ioctl_enable_cap(kvm, &cap);
|
|
|
|
break;
|
|
|
|
}
|
2017-12-04 16:57:26 +00:00
|
|
|
case KVM_MEMORY_ENCRYPT_OP: {
|
|
|
|
r = -ENOTTY;
|
|
|
|
if (kvm_x86_ops->mem_enc_op)
|
|
|
|
r = kvm_x86_ops->mem_enc_op(kvm, argp);
|
|
|
|
break;
|
|
|
|
}
|
2017-12-04 16:57:26 +00:00
|
|
|
case KVM_MEMORY_ENCRYPT_REG_REGION: {
|
|
|
|
struct kvm_enc_region region;
|
|
|
|
|
|
|
|
r = -EFAULT;
|
|
|
|
if (copy_from_user(®ion, argp, sizeof(region)))
|
|
|
|
goto out;
|
|
|
|
|
|
|
|
r = -ENOTTY;
|
|
|
|
if (kvm_x86_ops->mem_enc_reg_region)
|
|
|
|
r = kvm_x86_ops->mem_enc_reg_region(kvm, ®ion);
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
case KVM_MEMORY_ENCRYPT_UNREG_REGION: {
|
|
|
|
struct kvm_enc_region region;
|
|
|
|
|
|
|
|
r = -EFAULT;
|
|
|
|
if (copy_from_user(®ion, argp, sizeof(region)))
|
|
|
|
goto out;
|
|
|
|
|
|
|
|
r = -ENOTTY;
|
|
|
|
if (kvm_x86_ops->mem_enc_unreg_region)
|
|
|
|
r = kvm_x86_ops->mem_enc_unreg_region(kvm, ®ion);
|
|
|
|
break;
|
|
|
|
}
|
2018-02-01 13:48:32 +00:00
|
|
|
case KVM_HYPERV_EVENTFD: {
|
|
|
|
struct kvm_hyperv_eventfd hvevfd;
|
|
|
|
|
|
|
|
r = -EFAULT;
|
|
|
|
if (copy_from_user(&hvevfd, argp, sizeof(hvevfd)))
|
|
|
|
goto out;
|
|
|
|
r = kvm_vm_ioctl_hv_eventfd(kvm, &hvevfd);
|
|
|
|
break;
|
|
|
|
}
|
2007-10-29 15:08:35 +00:00
|
|
|
default:
|
2017-03-27 12:30:40 +00:00
|
|
|
r = -ENOTTY;
|
2007-10-29 15:08:35 +00:00
|
|
|
}
|
|
|
|
out:
|
|
|
|
return r;
|
|
|
|
}
|
|
|
|
|
2007-11-16 06:38:21 +00:00
|
|
|
static void kvm_init_msr_list(void)
|
2007-10-10 15:16:19 +00:00
|
|
|
{
|
|
|
|
u32 dummy[2];
|
|
|
|
unsigned i, j;
|
|
|
|
|
2015-05-05 10:08:55 +00:00
|
|
|
for (i = j = 0; i < ARRAY_SIZE(msrs_to_save); i++) {
|
2007-10-10 15:16:19 +00:00
|
|
|
if (rdmsr_safe(msrs_to_save[i], &dummy[0], &dummy[1]) < 0)
|
|
|
|
continue;
|
2014-03-05 22:19:52 +00:00
|
|
|
|
|
|
|
/*
|
|
|
|
* Even MSRs that are valid in the host may not be exposed
|
2015-11-12 13:49:17 +00:00
|
|
|
* to the guests in some cases.
|
2014-03-05 22:19:52 +00:00
|
|
|
*/
|
|
|
|
switch (msrs_to_save[i]) {
|
|
|
|
case MSR_IA32_BNDCFGS:
|
|
|
|
if (!kvm_x86_ops->mpx_supported())
|
|
|
|
continue;
|
|
|
|
break;
|
2015-11-12 13:49:17 +00:00
|
|
|
case MSR_TSC_AUX:
|
|
|
|
if (!kvm_x86_ops->rdtscp_supported())
|
|
|
|
continue;
|
|
|
|
break;
|
2014-03-05 22:19:52 +00:00
|
|
|
default:
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
|
2007-10-10 15:16:19 +00:00
|
|
|
if (j < i)
|
|
|
|
msrs_to_save[j] = msrs_to_save[i];
|
|
|
|
j++;
|
|
|
|
}
|
|
|
|
num_msrs_to_save = j;
|
2015-05-05 10:08:55 +00:00
|
|
|
|
|
|
|
for (i = j = 0; i < ARRAY_SIZE(emulated_msrs); i++) {
|
2018-05-10 20:06:39 +00:00
|
|
|
if (!kvm_x86_ops->has_emulated_msr(emulated_msrs[i]))
|
|
|
|
continue;
|
2015-05-05 10:08:55 +00:00
|
|
|
|
|
|
|
if (j < i)
|
|
|
|
emulated_msrs[j] = emulated_msrs[i];
|
|
|
|
j++;
|
|
|
|
}
|
|
|
|
num_emulated_msrs = j;
|
2018-02-21 19:39:51 +00:00
|
|
|
|
|
|
|
for (i = j = 0; i < ARRAY_SIZE(msr_based_features); i++) {
|
|
|
|
struct kvm_msr_entry msr;
|
|
|
|
|
|
|
|
msr.index = msr_based_features[i];
|
2018-02-28 06:03:30 +00:00
|
|
|
if (kvm_get_msr_feature(&msr))
|
2018-02-21 19:39:51 +00:00
|
|
|
continue;
|
|
|
|
|
|
|
|
if (j < i)
|
|
|
|
msr_based_features[j] = msr_based_features[i];
|
|
|
|
j++;
|
|
|
|
}
|
|
|
|
num_msr_based_features = j;
|
2007-10-10 15:16:19 +00:00
|
|
|
}
|
|
|
|
|
2009-06-29 19:24:32 +00:00
|
|
|
static int vcpu_mmio_write(struct kvm_vcpu *vcpu, gpa_t addr, int len,
|
|
|
|
const void *v)
|
KVM: Portability: Move x86 emulation and mmio device hook to x86.c
This patch moves the following functions to from kvm_main.c to x86.c:
emulator_read/write_std, vcpu_find_pervcpu_dev, vcpu_find_mmio_dev,
emulator_read/write_emulated, emulator_write_phys,
emulator_write_emulated_onepage, emulator_cmpxchg_emulated,
get_setment_base, emulate_invlpg, emulate_clts, emulator_get/set_dr,
kvm_report_emulation_failure, emulate_instruction
The following data type is moved to x86.c:
struct x86_emulate_ops emulate_ops
Signed-off-by: Carsten Otte <cotte@de.ibm.com>
Acked-by: Hollis Blanchard <hollisb@us.ibm.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
2007-10-30 17:44:21 +00:00
|
|
|
{
|
2010-01-19 10:51:22 +00:00
|
|
|
int handled = 0;
|
|
|
|
int n;
|
|
|
|
|
|
|
|
do {
|
|
|
|
n = min(len, 8);
|
2016-01-08 12:48:51 +00:00
|
|
|
if (!(lapic_in_kernel(vcpu) &&
|
2015-03-26 14:39:28 +00:00
|
|
|
!kvm_iodevice_write(vcpu, &vcpu->arch.apic->dev, addr, n, v))
|
|
|
|
&& kvm_io_bus_write(vcpu, KVM_MMIO_BUS, addr, n, v))
|
2010-01-19 10:51:22 +00:00
|
|
|
break;
|
|
|
|
handled += n;
|
|
|
|
addr += n;
|
|
|
|
len -= n;
|
|
|
|
v += n;
|
|
|
|
} while (len);
|
KVM: Portability: Move x86 emulation and mmio device hook to x86.c
This patch moves the following functions to from kvm_main.c to x86.c:
emulator_read/write_std, vcpu_find_pervcpu_dev, vcpu_find_mmio_dev,
emulator_read/write_emulated, emulator_write_phys,
emulator_write_emulated_onepage, emulator_cmpxchg_emulated,
get_setment_base, emulate_invlpg, emulate_clts, emulator_get/set_dr,
kvm_report_emulation_failure, emulate_instruction
The following data type is moved to x86.c:
struct x86_emulate_ops emulate_ops
Signed-off-by: Carsten Otte <cotte@de.ibm.com>
Acked-by: Hollis Blanchard <hollisb@us.ibm.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
2007-10-30 17:44:21 +00:00
|
|
|
|
2010-01-19 10:51:22 +00:00
|
|
|
return handled;
|
KVM: Portability: Move x86 emulation and mmio device hook to x86.c
This patch moves the following functions to from kvm_main.c to x86.c:
emulator_read/write_std, vcpu_find_pervcpu_dev, vcpu_find_mmio_dev,
emulator_read/write_emulated, emulator_write_phys,
emulator_write_emulated_onepage, emulator_cmpxchg_emulated,
get_setment_base, emulate_invlpg, emulate_clts, emulator_get/set_dr,
kvm_report_emulation_failure, emulate_instruction
The following data type is moved to x86.c:
struct x86_emulate_ops emulate_ops
Signed-off-by: Carsten Otte <cotte@de.ibm.com>
Acked-by: Hollis Blanchard <hollisb@us.ibm.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
2007-10-30 17:44:21 +00:00
|
|
|
}
|
|
|
|
|
2009-06-29 19:24:32 +00:00
|
|
|
static int vcpu_mmio_read(struct kvm_vcpu *vcpu, gpa_t addr, int len, void *v)
|
KVM: Portability: Move x86 emulation and mmio device hook to x86.c
This patch moves the following functions to from kvm_main.c to x86.c:
emulator_read/write_std, vcpu_find_pervcpu_dev, vcpu_find_mmio_dev,
emulator_read/write_emulated, emulator_write_phys,
emulator_write_emulated_onepage, emulator_cmpxchg_emulated,
get_setment_base, emulate_invlpg, emulate_clts, emulator_get/set_dr,
kvm_report_emulation_failure, emulate_instruction
The following data type is moved to x86.c:
struct x86_emulate_ops emulate_ops
Signed-off-by: Carsten Otte <cotte@de.ibm.com>
Acked-by: Hollis Blanchard <hollisb@us.ibm.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
2007-10-30 17:44:21 +00:00
|
|
|
{
|
2010-01-19 10:51:22 +00:00
|
|
|
int handled = 0;
|
|
|
|
int n;
|
|
|
|
|
|
|
|
do {
|
|
|
|
n = min(len, 8);
|
2016-01-08 12:48:51 +00:00
|
|
|
if (!(lapic_in_kernel(vcpu) &&
|
2015-03-26 14:39:28 +00:00
|
|
|
!kvm_iodevice_read(vcpu, &vcpu->arch.apic->dev,
|
|
|
|
addr, n, v))
|
|
|
|
&& kvm_io_bus_read(vcpu, KVM_MMIO_BUS, addr, n, v))
|
2010-01-19 10:51:22 +00:00
|
|
|
break;
|
2017-12-15 01:40:50 +00:00
|
|
|
trace_kvm_mmio(KVM_TRACE_MMIO_READ, n, addr, v);
|
2010-01-19 10:51:22 +00:00
|
|
|
handled += n;
|
|
|
|
addr += n;
|
|
|
|
len -= n;
|
|
|
|
v += n;
|
|
|
|
} while (len);
|
KVM: Portability: Move x86 emulation and mmio device hook to x86.c
This patch moves the following functions to from kvm_main.c to x86.c:
emulator_read/write_std, vcpu_find_pervcpu_dev, vcpu_find_mmio_dev,
emulator_read/write_emulated, emulator_write_phys,
emulator_write_emulated_onepage, emulator_cmpxchg_emulated,
get_setment_base, emulate_invlpg, emulate_clts, emulator_get/set_dr,
kvm_report_emulation_failure, emulate_instruction
The following data type is moved to x86.c:
struct x86_emulate_ops emulate_ops
Signed-off-by: Carsten Otte <cotte@de.ibm.com>
Acked-by: Hollis Blanchard <hollisb@us.ibm.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
2007-10-30 17:44:21 +00:00
|
|
|
|
2010-01-19 10:51:22 +00:00
|
|
|
return handled;
|
KVM: Portability: Move x86 emulation and mmio device hook to x86.c
This patch moves the following functions to from kvm_main.c to x86.c:
emulator_read/write_std, vcpu_find_pervcpu_dev, vcpu_find_mmio_dev,
emulator_read/write_emulated, emulator_write_phys,
emulator_write_emulated_onepage, emulator_cmpxchg_emulated,
get_setment_base, emulate_invlpg, emulate_clts, emulator_get/set_dr,
kvm_report_emulation_failure, emulate_instruction
The following data type is moved to x86.c:
struct x86_emulate_ops emulate_ops
Signed-off-by: Carsten Otte <cotte@de.ibm.com>
Acked-by: Hollis Blanchard <hollisb@us.ibm.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
2007-10-30 17:44:21 +00:00
|
|
|
}
|
|
|
|
|
2010-03-18 13:20:16 +00:00
|
|
|
static void kvm_set_segment(struct kvm_vcpu *vcpu,
|
|
|
|
struct kvm_segment *var, int seg)
|
|
|
|
{
|
|
|
|
kvm_x86_ops->set_segment(vcpu, var, seg);
|
|
|
|
}
|
|
|
|
|
|
|
|
void kvm_get_segment(struct kvm_vcpu *vcpu,
|
|
|
|
struct kvm_segment *var, int seg)
|
|
|
|
{
|
|
|
|
kvm_x86_ops->get_segment(vcpu, var, seg);
|
|
|
|
}
|
|
|
|
|
2014-09-02 11:23:06 +00:00
|
|
|
gpa_t translate_nested_gpa(struct kvm_vcpu *vcpu, gpa_t gpa, u32 access,
|
|
|
|
struct x86_exception *exception)
|
2010-09-10 15:30:54 +00:00
|
|
|
{
|
|
|
|
gpa_t t_gpa;
|
|
|
|
|
|
|
|
BUG_ON(!mmu_is_nested(vcpu));
|
|
|
|
|
|
|
|
/* NPT walks are always user-walks */
|
|
|
|
access |= PFERR_USER_MASK;
|
2014-09-02 11:23:06 +00:00
|
|
|
t_gpa = vcpu->arch.mmu.gva_to_gpa(vcpu, gpa, access, exception);
|
2010-09-10 15:30:54 +00:00
|
|
|
|
|
|
|
return t_gpa;
|
|
|
|
}
|
|
|
|
|
2010-11-22 15:53:26 +00:00
|
|
|
gpa_t kvm_mmu_gva_to_gpa_read(struct kvm_vcpu *vcpu, gva_t gva,
|
|
|
|
struct x86_exception *exception)
|
2010-02-10 12:21:32 +00:00
|
|
|
{
|
|
|
|
u32 access = (kvm_x86_ops->get_cpl(vcpu) == 3) ? PFERR_USER_MASK : 0;
|
2010-11-22 15:53:26 +00:00
|
|
|
return vcpu->arch.walk_mmu->gva_to_gpa(vcpu, gva, access, exception);
|
2010-02-10 12:21:32 +00:00
|
|
|
}
|
|
|
|
|
2010-11-22 15:53:26 +00:00
|
|
|
gpa_t kvm_mmu_gva_to_gpa_fetch(struct kvm_vcpu *vcpu, gva_t gva,
|
|
|
|
struct x86_exception *exception)
|
2010-02-10 12:21:32 +00:00
|
|
|
{
|
|
|
|
u32 access = (kvm_x86_ops->get_cpl(vcpu) == 3) ? PFERR_USER_MASK : 0;
|
|
|
|
access |= PFERR_FETCH_MASK;
|
2010-11-22 15:53:26 +00:00
|
|
|
return vcpu->arch.walk_mmu->gva_to_gpa(vcpu, gva, access, exception);
|
2010-02-10 12:21:32 +00:00
|
|
|
}
|
|
|
|
|
2010-11-22 15:53:26 +00:00
|
|
|
gpa_t kvm_mmu_gva_to_gpa_write(struct kvm_vcpu *vcpu, gva_t gva,
|
|
|
|
struct x86_exception *exception)
|
2010-02-10 12:21:32 +00:00
|
|
|
{
|
|
|
|
u32 access = (kvm_x86_ops->get_cpl(vcpu) == 3) ? PFERR_USER_MASK : 0;
|
|
|
|
access |= PFERR_WRITE_MASK;
|
2010-11-22 15:53:26 +00:00
|
|
|
return vcpu->arch.walk_mmu->gva_to_gpa(vcpu, gva, access, exception);
|
2010-02-10 12:21:32 +00:00
|
|
|
}
|
|
|
|
|
|
|
|
/* uses this to access any guest's mapped memory without checking CPL */
|
2010-11-22 15:53:26 +00:00
|
|
|
gpa_t kvm_mmu_gva_to_gpa_system(struct kvm_vcpu *vcpu, gva_t gva,
|
|
|
|
struct x86_exception *exception)
|
2010-02-10 12:21:32 +00:00
|
|
|
{
|
2010-11-22 15:53:26 +00:00
|
|
|
return vcpu->arch.walk_mmu->gva_to_gpa(vcpu, gva, 0, exception);
|
2010-02-10 12:21:32 +00:00
|
|
|
}
|
|
|
|
|
|
|
|
static int kvm_read_guest_virt_helper(gva_t addr, void *val, unsigned int bytes,
|
|
|
|
struct kvm_vcpu *vcpu, u32 access,
|
2010-11-22 15:53:22 +00:00
|
|
|
struct x86_exception *exception)
|
KVM: Portability: Move x86 emulation and mmio device hook to x86.c
This patch moves the following functions to from kvm_main.c to x86.c:
emulator_read/write_std, vcpu_find_pervcpu_dev, vcpu_find_mmio_dev,
emulator_read/write_emulated, emulator_write_phys,
emulator_write_emulated_onepage, emulator_cmpxchg_emulated,
get_setment_base, emulate_invlpg, emulate_clts, emulator_get/set_dr,
kvm_report_emulation_failure, emulate_instruction
The following data type is moved to x86.c:
struct x86_emulate_ops emulate_ops
Signed-off-by: Carsten Otte <cotte@de.ibm.com>
Acked-by: Hollis Blanchard <hollisb@us.ibm.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
2007-10-30 17:44:21 +00:00
|
|
|
{
|
|
|
|
void *data = val;
|
2007-12-21 00:18:22 +00:00
|
|
|
int r = X86EMUL_CONTINUE;
|
KVM: Portability: Move x86 emulation and mmio device hook to x86.c
This patch moves the following functions to from kvm_main.c to x86.c:
emulator_read/write_std, vcpu_find_pervcpu_dev, vcpu_find_mmio_dev,
emulator_read/write_emulated, emulator_write_phys,
emulator_write_emulated_onepage, emulator_cmpxchg_emulated,
get_setment_base, emulate_invlpg, emulate_clts, emulator_get/set_dr,
kvm_report_emulation_failure, emulate_instruction
The following data type is moved to x86.c:
struct x86_emulate_ops emulate_ops
Signed-off-by: Carsten Otte <cotte@de.ibm.com>
Acked-by: Hollis Blanchard <hollisb@us.ibm.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
2007-10-30 17:44:21 +00:00
|
|
|
|
|
|
|
while (bytes) {
|
2010-09-10 15:30:49 +00:00
|
|
|
gpa_t gpa = vcpu->arch.walk_mmu->gva_to_gpa(vcpu, addr, access,
|
2010-11-22 15:53:26 +00:00
|
|
|
exception);
|
KVM: Portability: Move x86 emulation and mmio device hook to x86.c
This patch moves the following functions to from kvm_main.c to x86.c:
emulator_read/write_std, vcpu_find_pervcpu_dev, vcpu_find_mmio_dev,
emulator_read/write_emulated, emulator_write_phys,
emulator_write_emulated_onepage, emulator_cmpxchg_emulated,
get_setment_base, emulate_invlpg, emulate_clts, emulator_get/set_dr,
kvm_report_emulation_failure, emulate_instruction
The following data type is moved to x86.c:
struct x86_emulate_ops emulate_ops
Signed-off-by: Carsten Otte <cotte@de.ibm.com>
Acked-by: Hollis Blanchard <hollisb@us.ibm.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
2007-10-30 17:44:21 +00:00
|
|
|
unsigned offset = addr & (PAGE_SIZE-1);
|
2008-12-28 23:42:19 +00:00
|
|
|
unsigned toread = min(bytes, (unsigned)PAGE_SIZE - offset);
|
KVM: Portability: Move x86 emulation and mmio device hook to x86.c
This patch moves the following functions to from kvm_main.c to x86.c:
emulator_read/write_std, vcpu_find_pervcpu_dev, vcpu_find_mmio_dev,
emulator_read/write_emulated, emulator_write_phys,
emulator_write_emulated_onepage, emulator_cmpxchg_emulated,
get_setment_base, emulate_invlpg, emulate_clts, emulator_get/set_dr,
kvm_report_emulation_failure, emulate_instruction
The following data type is moved to x86.c:
struct x86_emulate_ops emulate_ops
Signed-off-by: Carsten Otte <cotte@de.ibm.com>
Acked-by: Hollis Blanchard <hollisb@us.ibm.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
2007-10-30 17:44:21 +00:00
|
|
|
int ret;
|
|
|
|
|
2010-11-22 15:53:22 +00:00
|
|
|
if (gpa == UNMAPPED_GVA)
|
2010-11-22 15:53:26 +00:00
|
|
|
return X86EMUL_PROPAGATE_FAULT;
|
2015-04-08 13:39:23 +00:00
|
|
|
ret = kvm_vcpu_read_guest_page(vcpu, gpa >> PAGE_SHIFT, data,
|
|
|
|
offset, toread);
|
2007-12-21 00:18:22 +00:00
|
|
|
if (ret < 0) {
|
2010-04-28 16:15:35 +00:00
|
|
|
r = X86EMUL_IO_NEEDED;
|
2007-12-21 00:18:22 +00:00
|
|
|
goto out;
|
|
|
|
}
|
KVM: Portability: Move x86 emulation and mmio device hook to x86.c
This patch moves the following functions to from kvm_main.c to x86.c:
emulator_read/write_std, vcpu_find_pervcpu_dev, vcpu_find_mmio_dev,
emulator_read/write_emulated, emulator_write_phys,
emulator_write_emulated_onepage, emulator_cmpxchg_emulated,
get_setment_base, emulate_invlpg, emulate_clts, emulator_get/set_dr,
kvm_report_emulation_failure, emulate_instruction
The following data type is moved to x86.c:
struct x86_emulate_ops emulate_ops
Signed-off-by: Carsten Otte <cotte@de.ibm.com>
Acked-by: Hollis Blanchard <hollisb@us.ibm.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
2007-10-30 17:44:21 +00:00
|
|
|
|
2008-12-28 23:42:19 +00:00
|
|
|
bytes -= toread;
|
|
|
|
data += toread;
|
|
|
|
addr += toread;
|
KVM: Portability: Move x86 emulation and mmio device hook to x86.c
This patch moves the following functions to from kvm_main.c to x86.c:
emulator_read/write_std, vcpu_find_pervcpu_dev, vcpu_find_mmio_dev,
emulator_read/write_emulated, emulator_write_phys,
emulator_write_emulated_onepage, emulator_cmpxchg_emulated,
get_setment_base, emulate_invlpg, emulate_clts, emulator_get/set_dr,
kvm_report_emulation_failure, emulate_instruction
The following data type is moved to x86.c:
struct x86_emulate_ops emulate_ops
Signed-off-by: Carsten Otte <cotte@de.ibm.com>
Acked-by: Hollis Blanchard <hollisb@us.ibm.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
2007-10-30 17:44:21 +00:00
|
|
|
}
|
2007-12-21 00:18:22 +00:00
|
|
|
out:
|
|
|
|
return r;
|
KVM: Portability: Move x86 emulation and mmio device hook to x86.c
This patch moves the following functions to from kvm_main.c to x86.c:
emulator_read/write_std, vcpu_find_pervcpu_dev, vcpu_find_mmio_dev,
emulator_read/write_emulated, emulator_write_phys,
emulator_write_emulated_onepage, emulator_cmpxchg_emulated,
get_setment_base, emulate_invlpg, emulate_clts, emulator_get/set_dr,
kvm_report_emulation_failure, emulate_instruction
The following data type is moved to x86.c:
struct x86_emulate_ops emulate_ops
Signed-off-by: Carsten Otte <cotte@de.ibm.com>
Acked-by: Hollis Blanchard <hollisb@us.ibm.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
2007-10-30 17:44:21 +00:00
|
|
|
}
|
2008-12-28 23:42:19 +00:00
|
|
|
|
2010-02-10 12:21:32 +00:00
|
|
|
/* used for instruction fetching */
|
2011-04-20 10:37:53 +00:00
|
|
|
static int kvm_fetch_guest_virt(struct x86_emulate_ctxt *ctxt,
|
|
|
|
gva_t addr, void *val, unsigned int bytes,
|
2010-11-22 15:53:22 +00:00
|
|
|
struct x86_exception *exception)
|
2010-02-10 12:21:32 +00:00
|
|
|
{
|
2011-04-20 10:37:53 +00:00
|
|
|
struct kvm_vcpu *vcpu = emul_to_vcpu(ctxt);
|
2010-02-10 12:21:32 +00:00
|
|
|
u32 access = (kvm_x86_ops->get_cpl(vcpu) == 3) ? PFERR_USER_MASK : 0;
|
2014-05-13 12:02:13 +00:00
|
|
|
unsigned offset;
|
|
|
|
int ret;
|
2011-04-20 10:37:53 +00:00
|
|
|
|
2014-05-13 12:02:13 +00:00
|
|
|
/* Inline kvm_read_guest_virt_helper for speed. */
|
|
|
|
gpa_t gpa = vcpu->arch.walk_mmu->gva_to_gpa(vcpu, addr, access|PFERR_FETCH_MASK,
|
|
|
|
exception);
|
|
|
|
if (unlikely(gpa == UNMAPPED_GVA))
|
|
|
|
return X86EMUL_PROPAGATE_FAULT;
|
|
|
|
|
|
|
|
offset = addr & (PAGE_SIZE-1);
|
|
|
|
if (WARN_ON(offset + bytes > PAGE_SIZE))
|
|
|
|
bytes = (unsigned)PAGE_SIZE - offset;
|
2015-04-08 13:39:23 +00:00
|
|
|
ret = kvm_vcpu_read_guest_page(vcpu, gpa >> PAGE_SHIFT, val,
|
|
|
|
offset, bytes);
|
2014-05-13 12:02:13 +00:00
|
|
|
if (unlikely(ret < 0))
|
|
|
|
return X86EMUL_IO_NEEDED;
|
|
|
|
|
|
|
|
return X86EMUL_CONTINUE;
|
2010-02-10 12:21:32 +00:00
|
|
|
}
|
|
|
|
|
2018-06-06 15:37:49 +00:00
|
|
|
int kvm_read_guest_virt(struct kvm_vcpu *vcpu,
|
2011-04-20 10:37:53 +00:00
|
|
|
gva_t addr, void *val, unsigned int bytes,
|
2010-11-22 15:53:22 +00:00
|
|
|
struct x86_exception *exception)
|
2010-02-10 12:21:32 +00:00
|
|
|
{
|
|
|
|
u32 access = (kvm_x86_ops->get_cpl(vcpu) == 3) ? PFERR_USER_MASK : 0;
|
2011-04-20 10:37:53 +00:00
|
|
|
|
2010-02-10 12:21:32 +00:00
|
|
|
return kvm_read_guest_virt_helper(addr, val, bytes, vcpu, access,
|
2010-11-22 15:53:22 +00:00
|
|
|
exception);
|
2010-02-10 12:21:32 +00:00
|
|
|
}
|
2011-05-25 20:04:56 +00:00
|
|
|
EXPORT_SYMBOL_GPL(kvm_read_guest_virt);
|
2010-02-10 12:21:32 +00:00
|
|
|
|
2018-06-06 15:37:49 +00:00
|
|
|
static int emulator_read_std(struct x86_emulate_ctxt *ctxt,
|
|
|
|
gva_t addr, void *val, unsigned int bytes,
|
2018-06-06 15:38:09 +00:00
|
|
|
struct x86_exception *exception, bool system)
|
2010-02-10 12:21:32 +00:00
|
|
|
{
|
2011-04-20 10:37:53 +00:00
|
|
|
struct kvm_vcpu *vcpu = emul_to_vcpu(ctxt);
|
2018-06-06 15:38:09 +00:00
|
|
|
u32 access = 0;
|
|
|
|
|
|
|
|
if (!system && kvm_x86_ops->get_cpl(vcpu) == 3)
|
|
|
|
access |= PFERR_USER_MASK;
|
|
|
|
|
|
|
|
return kvm_read_guest_virt_helper(addr, val, bytes, vcpu, access, exception);
|
2010-02-10 12:21:32 +00:00
|
|
|
}
|
|
|
|
|
2015-10-30 15:36:24 +00:00
|
|
|
static int kvm_read_guest_phys_system(struct x86_emulate_ctxt *ctxt,
|
|
|
|
unsigned long addr, void *val, unsigned int bytes)
|
|
|
|
{
|
|
|
|
struct kvm_vcpu *vcpu = emul_to_vcpu(ctxt);
|
|
|
|
int r = kvm_vcpu_read_guest(vcpu, addr, val, bytes);
|
|
|
|
|
|
|
|
return r < 0 ? X86EMUL_IO_NEEDED : X86EMUL_CONTINUE;
|
|
|
|
}
|
|
|
|
|
2018-06-06 15:37:49 +00:00
|
|
|
static int kvm_write_guest_virt_helper(gva_t addr, void *val, unsigned int bytes,
|
|
|
|
struct kvm_vcpu *vcpu, u32 access,
|
|
|
|
struct x86_exception *exception)
|
2008-12-28 23:42:19 +00:00
|
|
|
{
|
|
|
|
void *data = val;
|
|
|
|
int r = X86EMUL_CONTINUE;
|
|
|
|
|
|
|
|
while (bytes) {
|
2010-09-10 15:30:49 +00:00
|
|
|
gpa_t gpa = vcpu->arch.walk_mmu->gva_to_gpa(vcpu, addr,
|
2018-06-06 15:37:49 +00:00
|
|
|
access,
|
2010-11-22 15:53:26 +00:00
|
|
|
exception);
|
2008-12-28 23:42:19 +00:00
|
|
|
unsigned offset = addr & (PAGE_SIZE-1);
|
|
|
|
unsigned towrite = min(bytes, (unsigned)PAGE_SIZE - offset);
|
|
|
|
int ret;
|
|
|
|
|
2010-11-22 15:53:22 +00:00
|
|
|
if (gpa == UNMAPPED_GVA)
|
2010-11-22 15:53:26 +00:00
|
|
|
return X86EMUL_PROPAGATE_FAULT;
|
2015-04-08 13:39:23 +00:00
|
|
|
ret = kvm_vcpu_write_guest(vcpu, gpa, data, towrite);
|
2008-12-28 23:42:19 +00:00
|
|
|
if (ret < 0) {
|
2010-04-28 16:15:35 +00:00
|
|
|
r = X86EMUL_IO_NEEDED;
|
2008-12-28 23:42:19 +00:00
|
|
|
goto out;
|
|
|
|
}
|
|
|
|
|
|
|
|
bytes -= towrite;
|
|
|
|
data += towrite;
|
|
|
|
addr += towrite;
|
|
|
|
}
|
|
|
|
out:
|
|
|
|
return r;
|
|
|
|
}
|
2018-06-06 15:37:49 +00:00
|
|
|
|
|
|
|
static int emulator_write_std(struct x86_emulate_ctxt *ctxt, gva_t addr, void *val,
|
2018-06-06 15:38:09 +00:00
|
|
|
unsigned int bytes, struct x86_exception *exception,
|
|
|
|
bool system)
|
2018-06-06 15:37:49 +00:00
|
|
|
{
|
|
|
|
struct kvm_vcpu *vcpu = emul_to_vcpu(ctxt);
|
2018-06-06 15:38:09 +00:00
|
|
|
u32 access = PFERR_WRITE_MASK;
|
|
|
|
|
|
|
|
if (!system && kvm_x86_ops->get_cpl(vcpu) == 3)
|
|
|
|
access |= PFERR_USER_MASK;
|
2018-06-06 15:37:49 +00:00
|
|
|
|
|
|
|
return kvm_write_guest_virt_helper(addr, val, bytes, vcpu,
|
2018-06-06 15:38:09 +00:00
|
|
|
access, exception);
|
2018-06-06 15:37:49 +00:00
|
|
|
}
|
|
|
|
|
|
|
|
int kvm_write_guest_virt_system(struct kvm_vcpu *vcpu, gva_t addr, void *val,
|
|
|
|
unsigned int bytes, struct x86_exception *exception)
|
|
|
|
{
|
x86/KVM/VMX: Add L1D flush logic
Add the logic for flushing L1D on VMENTER. The flush depends on the static
key being enabled and the new l1tf_flush_l1d flag being set.
The flags is set:
- Always, if the flush module parameter is 'always'
- Conditionally at:
- Entry to vcpu_run(), i.e. after executing user space
- From the sched_in notifier, i.e. when switching to a vCPU thread.
- From vmexit handlers which are considered unsafe, i.e. where
sensitive data can be brought into L1D:
- The emulator, which could be a good target for other speculative
execution-based threats,
- The MMU, which can bring host page tables in the L1 cache.
- External interrupts
- Nested operations that require the MMU (see above). That is
vmptrld, vmptrst, vmclear,vmwrite,vmread.
- When handling invept,invvpid
[ tglx: Split out from combo patch and reduced to a single flag ]
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2018-07-02 11:07:14 +00:00
|
|
|
/* kvm_write_guest_virt_system can pull in tons of pages. */
|
|
|
|
vcpu->arch.l1tf_flush_l1d = true;
|
|
|
|
|
2018-06-06 15:37:49 +00:00
|
|
|
return kvm_write_guest_virt_helper(addr, val, bytes, vcpu,
|
|
|
|
PFERR_WRITE_MASK, exception);
|
|
|
|
}
|
2011-05-25 20:08:00 +00:00
|
|
|
EXPORT_SYMBOL_GPL(kvm_write_guest_virt_system);
|
2008-12-28 23:42:19 +00:00
|
|
|
|
2018-04-03 23:28:48 +00:00
|
|
|
int handle_ud(struct kvm_vcpu *vcpu)
|
|
|
|
{
|
KVM: X86: Add Force Emulation Prefix for "emulate the next instruction"
There is no easy way to force KVM to run an instruction through the emulator
(by design as that will expose the x86 emulator as a significant attack-surface).
However, we do wish to expose the x86 emulator in case we are testing it
(e.g. via kvm-unit-tests). Therefore, this patch adds a "force emulation prefix"
that is designed to raise #UD which KVM will trap and it's #UD exit-handler will
match "force emulation prefix" to run instruction after prefix by the x86 emulator.
To not expose the x86 emulator by default, we add a module parameter that should
be off by default.
A simple testcase here:
#include <stdio.h>
#include <string.h>
#define HYPERVISOR_INFO 0x40000000
#define CPUID(idx, eax, ebx, ecx, edx) \
asm volatile (\
"ud2a; .ascii \"kvm\"; cpuid" \
:"=b" (*ebx), "=a" (*eax), "=c" (*ecx), "=d" (*edx) \
:"0"(idx) );
void main()
{
unsigned int eax, ebx, ecx, edx;
char string[13];
CPUID(HYPERVISOR_INFO, &eax, &ebx, &ecx, &edx);
*(unsigned int *)(string + 0) = ebx;
*(unsigned int *)(string + 4) = ecx;
*(unsigned int *)(string + 8) = edx;
string[12] = 0;
if (strncmp(string, "KVMKVMKVM\0\0\0", 12) == 0)
printf("kvm guest\n");
else
printf("bare hardware\n");
}
Suggested-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Radim Krčmář <rkrcmar@redhat.com>
Reviewed-by: Liran Alon <liran.alon@oracle.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Radim Krčmář <rkrcmar@redhat.com>
Cc: Andrew Cooper <andrew.cooper3@citrix.com>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: Liran Alon <liran.alon@oracle.com>
Signed-off-by: Wanpeng Li <wanpengli@tencent.com>
[Correctly handle usermode exits. - Paolo]
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2018-04-03 23:28:49 +00:00
|
|
|
int emul_type = EMULTYPE_TRAP_UD;
|
2018-04-03 23:28:48 +00:00
|
|
|
enum emulation_result er;
|
KVM: X86: Add Force Emulation Prefix for "emulate the next instruction"
There is no easy way to force KVM to run an instruction through the emulator
(by design as that will expose the x86 emulator as a significant attack-surface).
However, we do wish to expose the x86 emulator in case we are testing it
(e.g. via kvm-unit-tests). Therefore, this patch adds a "force emulation prefix"
that is designed to raise #UD which KVM will trap and it's #UD exit-handler will
match "force emulation prefix" to run instruction after prefix by the x86 emulator.
To not expose the x86 emulator by default, we add a module parameter that should
be off by default.
A simple testcase here:
#include <stdio.h>
#include <string.h>
#define HYPERVISOR_INFO 0x40000000
#define CPUID(idx, eax, ebx, ecx, edx) \
asm volatile (\
"ud2a; .ascii \"kvm\"; cpuid" \
:"=b" (*ebx), "=a" (*eax), "=c" (*ecx), "=d" (*edx) \
:"0"(idx) );
void main()
{
unsigned int eax, ebx, ecx, edx;
char string[13];
CPUID(HYPERVISOR_INFO, &eax, &ebx, &ecx, &edx);
*(unsigned int *)(string + 0) = ebx;
*(unsigned int *)(string + 4) = ecx;
*(unsigned int *)(string + 8) = edx;
string[12] = 0;
if (strncmp(string, "KVMKVMKVM\0\0\0", 12) == 0)
printf("kvm guest\n");
else
printf("bare hardware\n");
}
Suggested-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Radim Krčmář <rkrcmar@redhat.com>
Reviewed-by: Liran Alon <liran.alon@oracle.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Radim Krčmář <rkrcmar@redhat.com>
Cc: Andrew Cooper <andrew.cooper3@citrix.com>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: Liran Alon <liran.alon@oracle.com>
Signed-off-by: Wanpeng Li <wanpengli@tencent.com>
[Correctly handle usermode exits. - Paolo]
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2018-04-03 23:28:49 +00:00
|
|
|
char sig[5]; /* ud2; .ascii "kvm" */
|
|
|
|
struct x86_exception e;
|
|
|
|
|
|
|
|
if (force_emulation_prefix &&
|
2018-06-06 15:38:09 +00:00
|
|
|
kvm_read_guest_virt(vcpu, kvm_get_linear_rip(vcpu),
|
|
|
|
sig, sizeof(sig), &e) == 0 &&
|
KVM: X86: Add Force Emulation Prefix for "emulate the next instruction"
There is no easy way to force KVM to run an instruction through the emulator
(by design as that will expose the x86 emulator as a significant attack-surface).
However, we do wish to expose the x86 emulator in case we are testing it
(e.g. via kvm-unit-tests). Therefore, this patch adds a "force emulation prefix"
that is designed to raise #UD which KVM will trap and it's #UD exit-handler will
match "force emulation prefix" to run instruction after prefix by the x86 emulator.
To not expose the x86 emulator by default, we add a module parameter that should
be off by default.
A simple testcase here:
#include <stdio.h>
#include <string.h>
#define HYPERVISOR_INFO 0x40000000
#define CPUID(idx, eax, ebx, ecx, edx) \
asm volatile (\
"ud2a; .ascii \"kvm\"; cpuid" \
:"=b" (*ebx), "=a" (*eax), "=c" (*ecx), "=d" (*edx) \
:"0"(idx) );
void main()
{
unsigned int eax, ebx, ecx, edx;
char string[13];
CPUID(HYPERVISOR_INFO, &eax, &ebx, &ecx, &edx);
*(unsigned int *)(string + 0) = ebx;
*(unsigned int *)(string + 4) = ecx;
*(unsigned int *)(string + 8) = edx;
string[12] = 0;
if (strncmp(string, "KVMKVMKVM\0\0\0", 12) == 0)
printf("kvm guest\n");
else
printf("bare hardware\n");
}
Suggested-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Radim Krčmář <rkrcmar@redhat.com>
Reviewed-by: Liran Alon <liran.alon@oracle.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Radim Krčmář <rkrcmar@redhat.com>
Cc: Andrew Cooper <andrew.cooper3@citrix.com>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: Liran Alon <liran.alon@oracle.com>
Signed-off-by: Wanpeng Li <wanpengli@tencent.com>
[Correctly handle usermode exits. - Paolo]
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2018-04-03 23:28:49 +00:00
|
|
|
memcmp(sig, "\xf\xbkvm", sizeof(sig)) == 0) {
|
|
|
|
kvm_rip_write(vcpu, kvm_rip_read(vcpu) + sizeof(sig));
|
|
|
|
emul_type = 0;
|
|
|
|
}
|
2018-04-03 23:28:48 +00:00
|
|
|
|
KVM: X86: Add Force Emulation Prefix for "emulate the next instruction"
There is no easy way to force KVM to run an instruction through the emulator
(by design as that will expose the x86 emulator as a significant attack-surface).
However, we do wish to expose the x86 emulator in case we are testing it
(e.g. via kvm-unit-tests). Therefore, this patch adds a "force emulation prefix"
that is designed to raise #UD which KVM will trap and it's #UD exit-handler will
match "force emulation prefix" to run instruction after prefix by the x86 emulator.
To not expose the x86 emulator by default, we add a module parameter that should
be off by default.
A simple testcase here:
#include <stdio.h>
#include <string.h>
#define HYPERVISOR_INFO 0x40000000
#define CPUID(idx, eax, ebx, ecx, edx) \
asm volatile (\
"ud2a; .ascii \"kvm\"; cpuid" \
:"=b" (*ebx), "=a" (*eax), "=c" (*ecx), "=d" (*edx) \
:"0"(idx) );
void main()
{
unsigned int eax, ebx, ecx, edx;
char string[13];
CPUID(HYPERVISOR_INFO, &eax, &ebx, &ecx, &edx);
*(unsigned int *)(string + 0) = ebx;
*(unsigned int *)(string + 4) = ecx;
*(unsigned int *)(string + 8) = edx;
string[12] = 0;
if (strncmp(string, "KVMKVMKVM\0\0\0", 12) == 0)
printf("kvm guest\n");
else
printf("bare hardware\n");
}
Suggested-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Radim Krčmář <rkrcmar@redhat.com>
Reviewed-by: Liran Alon <liran.alon@oracle.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Radim Krčmář <rkrcmar@redhat.com>
Cc: Andrew Cooper <andrew.cooper3@citrix.com>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: Liran Alon <liran.alon@oracle.com>
Signed-off-by: Wanpeng Li <wanpengli@tencent.com>
[Correctly handle usermode exits. - Paolo]
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2018-04-03 23:28:49 +00:00
|
|
|
er = emulate_instruction(vcpu, emul_type);
|
2018-04-03 23:28:48 +00:00
|
|
|
if (er == EMULATE_USER_EXIT)
|
|
|
|
return 0;
|
|
|
|
if (er != EMULATE_DONE)
|
|
|
|
kvm_queue_exception(vcpu, UD_VECTOR);
|
|
|
|
return 1;
|
|
|
|
}
|
|
|
|
EXPORT_SYMBOL_GPL(handle_ud);
|
|
|
|
|
2016-12-14 19:59:23 +00:00
|
|
|
static int vcpu_is_mmio_gpa(struct kvm_vcpu *vcpu, unsigned long gva,
|
|
|
|
gpa_t gpa, bool write)
|
|
|
|
{
|
|
|
|
/* For APIC access vmexit */
|
|
|
|
if ((gpa & PAGE_MASK) == APIC_DEFAULT_PHYS_BASE)
|
|
|
|
return 1;
|
|
|
|
|
|
|
|
if (vcpu_match_mmio_gpa(vcpu, gpa)) {
|
|
|
|
trace_vcpu_match_mmio(gva, gpa, write, true);
|
|
|
|
return 1;
|
|
|
|
}
|
|
|
|
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
2011-07-11 19:22:46 +00:00
|
|
|
static int vcpu_mmio_gva_to_gpa(struct kvm_vcpu *vcpu, unsigned long gva,
|
|
|
|
gpa_t *gpa, struct x86_exception *exception,
|
|
|
|
bool write)
|
|
|
|
{
|
KVM: MMU: Optimize pte permission checks
walk_addr_generic() permission checks are a maze of branchy code, which is
performed four times per lookup. It depends on the type of access, efer.nxe,
cr0.wp, cr4.smep, and in the near future, cr4.smap.
Optimize this away by precalculating all variants and storing them in a
bitmap. The bitmap is recalculated when rarely-changing variables change
(cr0, cr4) and is indexed by the often-changing variables (page fault error
code, pte access permissions).
The permission check is moved to the end of the loop, otherwise an SMEP
fault could be reported as a false positive, when PDE.U=1 but PTE.U=0.
Noted by Xiao Guangrong.
The result is short, branch-free code.
Reviewed-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2012-09-12 11:52:00 +00:00
|
|
|
u32 access = ((kvm_x86_ops->get_cpl(vcpu) == 3) ? PFERR_USER_MASK : 0)
|
|
|
|
| (write ? PFERR_WRITE_MASK : 0);
|
2011-07-11 19:22:46 +00:00
|
|
|
|
2016-03-22 08:51:20 +00:00
|
|
|
/*
|
|
|
|
* currently PKRU is only applied to ept enabled guest so
|
|
|
|
* there is no pkey in EPT page table for L1 guest or EPT
|
|
|
|
* shadow page table for L2 guest.
|
|
|
|
*/
|
KVM: MMU: Optimize pte permission checks
walk_addr_generic() permission checks are a maze of branchy code, which is
performed four times per lookup. It depends on the type of access, efer.nxe,
cr0.wp, cr4.smep, and in the near future, cr4.smap.
Optimize this away by precalculating all variants and storing them in a
bitmap. The bitmap is recalculated when rarely-changing variables change
(cr0, cr4) and is indexed by the often-changing variables (page fault error
code, pte access permissions).
The permission check is moved to the end of the loop, otherwise an SMEP
fault could be reported as a false positive, when PDE.U=1 but PTE.U=0.
Noted by Xiao Guangrong.
The result is short, branch-free code.
Reviewed-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2012-09-12 11:52:00 +00:00
|
|
|
if (vcpu_match_mmio_gva(vcpu, gva)
|
2014-04-01 09:46:34 +00:00
|
|
|
&& !permission_fault(vcpu, vcpu->arch.walk_mmu,
|
2016-03-22 08:51:20 +00:00
|
|
|
vcpu->arch.access, 0, access)) {
|
2011-07-11 19:23:20 +00:00
|
|
|
*gpa = vcpu->arch.mmio_gfn << PAGE_SHIFT |
|
|
|
|
(gva & (PAGE_SIZE - 1));
|
2011-07-11 19:34:24 +00:00
|
|
|
trace_vcpu_match_mmio(gva, *gpa, write, false);
|
2011-07-11 19:23:20 +00:00
|
|
|
return 1;
|
|
|
|
}
|
|
|
|
|
2011-07-11 19:22:46 +00:00
|
|
|
*gpa = vcpu->arch.walk_mmu->gva_to_gpa(vcpu, gva, access, exception);
|
|
|
|
|
|
|
|
if (*gpa == UNMAPPED_GVA)
|
|
|
|
return -1;
|
|
|
|
|
2016-12-14 19:59:23 +00:00
|
|
|
return vcpu_is_mmio_gpa(vcpu, gva, *gpa, write);
|
2011-07-11 19:22:46 +00:00
|
|
|
}
|
|
|
|
|
2008-03-29 23:17:59 +00:00
|
|
|
int emulator_write_phys(struct kvm_vcpu *vcpu, gpa_t gpa,
|
2010-11-22 15:53:22 +00:00
|
|
|
const void *val, int bytes)
|
KVM: Portability: Move x86 emulation and mmio device hook to x86.c
This patch moves the following functions to from kvm_main.c to x86.c:
emulator_read/write_std, vcpu_find_pervcpu_dev, vcpu_find_mmio_dev,
emulator_read/write_emulated, emulator_write_phys,
emulator_write_emulated_onepage, emulator_cmpxchg_emulated,
get_setment_base, emulate_invlpg, emulate_clts, emulator_get/set_dr,
kvm_report_emulation_failure, emulate_instruction
The following data type is moved to x86.c:
struct x86_emulate_ops emulate_ops
Signed-off-by: Carsten Otte <cotte@de.ibm.com>
Acked-by: Hollis Blanchard <hollisb@us.ibm.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
2007-10-30 17:44:21 +00:00
|
|
|
{
|
|
|
|
int ret;
|
|
|
|
|
2015-04-08 13:39:23 +00:00
|
|
|
ret = kvm_vcpu_write_guest(vcpu, gpa, val, bytes);
|
2008-03-02 12:06:05 +00:00
|
|
|
if (ret < 0)
|
KVM: Portability: Move x86 emulation and mmio device hook to x86.c
This patch moves the following functions to from kvm_main.c to x86.c:
emulator_read/write_std, vcpu_find_pervcpu_dev, vcpu_find_mmio_dev,
emulator_read/write_emulated, emulator_write_phys,
emulator_write_emulated_onepage, emulator_cmpxchg_emulated,
get_setment_base, emulate_invlpg, emulate_clts, emulator_get/set_dr,
kvm_report_emulation_failure, emulate_instruction
The following data type is moved to x86.c:
struct x86_emulate_ops emulate_ops
Signed-off-by: Carsten Otte <cotte@de.ibm.com>
Acked-by: Hollis Blanchard <hollisb@us.ibm.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
2007-10-30 17:44:21 +00:00
|
|
|
return 0;
|
2016-02-24 09:51:13 +00:00
|
|
|
kvm_page_track_write(vcpu, gpa, val, bytes);
|
KVM: Portability: Move x86 emulation and mmio device hook to x86.c
This patch moves the following functions to from kvm_main.c to x86.c:
emulator_read/write_std, vcpu_find_pervcpu_dev, vcpu_find_mmio_dev,
emulator_read/write_emulated, emulator_write_phys,
emulator_write_emulated_onepage, emulator_cmpxchg_emulated,
get_setment_base, emulate_invlpg, emulate_clts, emulator_get/set_dr,
kvm_report_emulation_failure, emulate_instruction
The following data type is moved to x86.c:
struct x86_emulate_ops emulate_ops
Signed-off-by: Carsten Otte <cotte@de.ibm.com>
Acked-by: Hollis Blanchard <hollisb@us.ibm.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
2007-10-30 17:44:21 +00:00
|
|
|
return 1;
|
|
|
|
}
|
|
|
|
|
2011-07-13 06:31:50 +00:00
|
|
|
struct read_write_emulator_ops {
|
|
|
|
int (*read_write_prepare)(struct kvm_vcpu *vcpu, void *val,
|
|
|
|
int bytes);
|
|
|
|
int (*read_write_emulate)(struct kvm_vcpu *vcpu, gpa_t gpa,
|
|
|
|
void *val, int bytes);
|
|
|
|
int (*read_write_mmio)(struct kvm_vcpu *vcpu, gpa_t gpa,
|
|
|
|
int bytes, void *val);
|
|
|
|
int (*read_write_exit_mmio)(struct kvm_vcpu *vcpu, gpa_t gpa,
|
|
|
|
void *val, int bytes);
|
|
|
|
bool write;
|
|
|
|
};
|
|
|
|
|
|
|
|
static int read_prepare(struct kvm_vcpu *vcpu, void *val, int bytes)
|
|
|
|
{
|
|
|
|
if (vcpu->mmio_read_completed) {
|
|
|
|
trace_kvm_mmio(KVM_TRACE_MMIO_READ, bytes,
|
2017-12-15 01:40:50 +00:00
|
|
|
vcpu->mmio_fragments[0].gpa, val);
|
2011-07-13 06:31:50 +00:00
|
|
|
vcpu->mmio_read_completed = 0;
|
|
|
|
return 1;
|
|
|
|
}
|
|
|
|
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
|
|
|
static int read_emulate(struct kvm_vcpu *vcpu, gpa_t gpa,
|
|
|
|
void *val, int bytes)
|
|
|
|
{
|
2015-04-08 13:39:23 +00:00
|
|
|
return !kvm_vcpu_read_guest(vcpu, gpa, val, bytes);
|
2011-07-13 06:31:50 +00:00
|
|
|
}
|
|
|
|
|
|
|
|
static int write_emulate(struct kvm_vcpu *vcpu, gpa_t gpa,
|
|
|
|
void *val, int bytes)
|
|
|
|
{
|
|
|
|
return emulator_write_phys(vcpu, gpa, val, bytes);
|
|
|
|
}
|
|
|
|
|
|
|
|
static int write_mmio(struct kvm_vcpu *vcpu, gpa_t gpa, int bytes, void *val)
|
|
|
|
{
|
2017-12-15 01:40:50 +00:00
|
|
|
trace_kvm_mmio(KVM_TRACE_MMIO_WRITE, bytes, gpa, val);
|
2011-07-13 06:31:50 +00:00
|
|
|
return vcpu_mmio_write(vcpu, gpa, bytes, val);
|
|
|
|
}
|
|
|
|
|
|
|
|
static int read_exit_mmio(struct kvm_vcpu *vcpu, gpa_t gpa,
|
|
|
|
void *val, int bytes)
|
|
|
|
{
|
2017-12-15 01:40:50 +00:00
|
|
|
trace_kvm_mmio(KVM_TRACE_MMIO_READ_UNSATISFIED, bytes, gpa, NULL);
|
2011-07-13 06:31:50 +00:00
|
|
|
return X86EMUL_IO_NEEDED;
|
|
|
|
}
|
|
|
|
|
|
|
|
static int write_exit_mmio(struct kvm_vcpu *vcpu, gpa_t gpa,
|
|
|
|
void *val, int bytes)
|
|
|
|
{
|
2012-04-18 16:22:47 +00:00
|
|
|
struct kvm_mmio_fragment *frag = &vcpu->mmio_fragments[0];
|
|
|
|
|
2012-10-24 06:07:59 +00:00
|
|
|
memcpy(vcpu->run->mmio.data, frag->data, min(8u, frag->len));
|
2011-07-13 06:31:50 +00:00
|
|
|
return X86EMUL_CONTINUE;
|
|
|
|
}
|
|
|
|
|
2012-08-29 23:30:17 +00:00
|
|
|
static const struct read_write_emulator_ops read_emultor = {
|
2011-07-13 06:31:50 +00:00
|
|
|
.read_write_prepare = read_prepare,
|
|
|
|
.read_write_emulate = read_emulate,
|
|
|
|
.read_write_mmio = vcpu_mmio_read,
|
|
|
|
.read_write_exit_mmio = read_exit_mmio,
|
|
|
|
};
|
|
|
|
|
2012-08-29 23:30:17 +00:00
|
|
|
static const struct read_write_emulator_ops write_emultor = {
|
2011-07-13 06:31:50 +00:00
|
|
|
.read_write_emulate = write_emulate,
|
|
|
|
.read_write_mmio = write_mmio,
|
|
|
|
.read_write_exit_mmio = write_exit_mmio,
|
|
|
|
.write = true,
|
|
|
|
};
|
|
|
|
|
2011-07-13 06:32:31 +00:00
|
|
|
static int emulator_read_write_onepage(unsigned long addr, void *val,
|
|
|
|
unsigned int bytes,
|
|
|
|
struct x86_exception *exception,
|
|
|
|
struct kvm_vcpu *vcpu,
|
2012-08-29 23:30:17 +00:00
|
|
|
const struct read_write_emulator_ops *ops)
|
KVM: Portability: Move x86 emulation and mmio device hook to x86.c
This patch moves the following functions to from kvm_main.c to x86.c:
emulator_read/write_std, vcpu_find_pervcpu_dev, vcpu_find_mmio_dev,
emulator_read/write_emulated, emulator_write_phys,
emulator_write_emulated_onepage, emulator_cmpxchg_emulated,
get_setment_base, emulate_invlpg, emulate_clts, emulator_get/set_dr,
kvm_report_emulation_failure, emulate_instruction
The following data type is moved to x86.c:
struct x86_emulate_ops emulate_ops
Signed-off-by: Carsten Otte <cotte@de.ibm.com>
Acked-by: Hollis Blanchard <hollisb@us.ibm.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
2007-10-30 17:44:21 +00:00
|
|
|
{
|
2011-07-11 19:22:46 +00:00
|
|
|
gpa_t gpa;
|
|
|
|
int handled, ret;
|
2011-07-13 06:32:31 +00:00
|
|
|
bool write = ops->write;
|
2012-04-18 16:22:47 +00:00
|
|
|
struct kvm_mmio_fragment *frag;
|
2016-12-14 19:59:23 +00:00
|
|
|
struct x86_emulate_ctxt *ctxt = &vcpu->arch.emulate_ctxt;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* If the exit was due to a NPF we may already have a GPA.
|
|
|
|
* If the GPA is present, use it to avoid the GVA to GPA table walk.
|
|
|
|
* Note, this cannot be used on string operations since string
|
|
|
|
* operation using rep will only have the initial GPA from the NPF
|
|
|
|
* occurred.
|
|
|
|
*/
|
|
|
|
if (vcpu->arch.gpa_available &&
|
|
|
|
emulator_can_use_gpa(ctxt) &&
|
2017-08-17 16:36:57 +00:00
|
|
|
(addr & ~PAGE_MASK) == (vcpu->arch.gpa_val & ~PAGE_MASK)) {
|
|
|
|
gpa = vcpu->arch.gpa_val;
|
|
|
|
ret = vcpu_is_mmio_gpa(vcpu, addr, gpa, write);
|
|
|
|
} else {
|
|
|
|
ret = vcpu_mmio_gva_to_gpa(vcpu, addr, &gpa, exception, write);
|
|
|
|
if (ret < 0)
|
|
|
|
return X86EMUL_PROPAGATE_FAULT;
|
2016-12-14 19:59:23 +00:00
|
|
|
}
|
2007-12-21 00:18:22 +00:00
|
|
|
|
2017-08-17 16:36:57 +00:00
|
|
|
if (!ret && ops->read_write_emulate(vcpu, gpa, val, bytes))
|
KVM: Portability: Move x86 emulation and mmio device hook to x86.c
This patch moves the following functions to from kvm_main.c to x86.c:
emulator_read/write_std, vcpu_find_pervcpu_dev, vcpu_find_mmio_dev,
emulator_read/write_emulated, emulator_write_phys,
emulator_write_emulated_onepage, emulator_cmpxchg_emulated,
get_setment_base, emulate_invlpg, emulate_clts, emulator_get/set_dr,
kvm_report_emulation_failure, emulate_instruction
The following data type is moved to x86.c:
struct x86_emulate_ops emulate_ops
Signed-off-by: Carsten Otte <cotte@de.ibm.com>
Acked-by: Hollis Blanchard <hollisb@us.ibm.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
2007-10-30 17:44:21 +00:00
|
|
|
return X86EMUL_CONTINUE;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Is this MMIO handled locally?
|
|
|
|
*/
|
2011-07-13 06:32:31 +00:00
|
|
|
handled = ops->read_write_mmio(vcpu, gpa, bytes, val);
|
2010-01-19 10:51:22 +00:00
|
|
|
if (handled == bytes)
|
KVM: Portability: Move x86 emulation and mmio device hook to x86.c
This patch moves the following functions to from kvm_main.c to x86.c:
emulator_read/write_std, vcpu_find_pervcpu_dev, vcpu_find_mmio_dev,
emulator_read/write_emulated, emulator_write_phys,
emulator_write_emulated_onepage, emulator_cmpxchg_emulated,
get_setment_base, emulate_invlpg, emulate_clts, emulator_get/set_dr,
kvm_report_emulation_failure, emulate_instruction
The following data type is moved to x86.c:
struct x86_emulate_ops emulate_ops
Signed-off-by: Carsten Otte <cotte@de.ibm.com>
Acked-by: Hollis Blanchard <hollisb@us.ibm.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
2007-10-30 17:44:21 +00:00
|
|
|
return X86EMUL_CONTINUE;
|
|
|
|
|
2010-01-19 10:51:22 +00:00
|
|
|
gpa += handled;
|
|
|
|
bytes -= handled;
|
|
|
|
val += handled;
|
|
|
|
|
2012-10-24 06:07:59 +00:00
|
|
|
WARN_ON(vcpu->mmio_nr_fragments >= KVM_MAX_MMIO_FRAGMENTS);
|
|
|
|
frag = &vcpu->mmio_fragments[vcpu->mmio_nr_fragments++];
|
|
|
|
frag->gpa = gpa;
|
|
|
|
frag->data = val;
|
|
|
|
frag->len = bytes;
|
2012-04-18 16:22:47 +00:00
|
|
|
return X86EMUL_CONTINUE;
|
KVM: Portability: Move x86 emulation and mmio device hook to x86.c
This patch moves the following functions to from kvm_main.c to x86.c:
emulator_read/write_std, vcpu_find_pervcpu_dev, vcpu_find_mmio_dev,
emulator_read/write_emulated, emulator_write_phys,
emulator_write_emulated_onepage, emulator_cmpxchg_emulated,
get_setment_base, emulate_invlpg, emulate_clts, emulator_get/set_dr,
kvm_report_emulation_failure, emulate_instruction
The following data type is moved to x86.c:
struct x86_emulate_ops emulate_ops
Signed-off-by: Carsten Otte <cotte@de.ibm.com>
Acked-by: Hollis Blanchard <hollisb@us.ibm.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
2007-10-30 17:44:21 +00:00
|
|
|
}
|
|
|
|
|
2015-03-13 09:39:45 +00:00
|
|
|
static int emulator_read_write(struct x86_emulate_ctxt *ctxt,
|
|
|
|
unsigned long addr,
|
2011-07-13 06:32:31 +00:00
|
|
|
void *val, unsigned int bytes,
|
|
|
|
struct x86_exception *exception,
|
2012-08-29 23:30:17 +00:00
|
|
|
const struct read_write_emulator_ops *ops)
|
KVM: Portability: Move x86 emulation and mmio device hook to x86.c
This patch moves the following functions to from kvm_main.c to x86.c:
emulator_read/write_std, vcpu_find_pervcpu_dev, vcpu_find_mmio_dev,
emulator_read/write_emulated, emulator_write_phys,
emulator_write_emulated_onepage, emulator_cmpxchg_emulated,
get_setment_base, emulate_invlpg, emulate_clts, emulator_get/set_dr,
kvm_report_emulation_failure, emulate_instruction
The following data type is moved to x86.c:
struct x86_emulate_ops emulate_ops
Signed-off-by: Carsten Otte <cotte@de.ibm.com>
Acked-by: Hollis Blanchard <hollisb@us.ibm.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
2007-10-30 17:44:21 +00:00
|
|
|
{
|
2011-04-20 10:37:53 +00:00
|
|
|
struct kvm_vcpu *vcpu = emul_to_vcpu(ctxt);
|
2012-04-18 16:22:47 +00:00
|
|
|
gpa_t gpa;
|
|
|
|
int rc;
|
|
|
|
|
|
|
|
if (ops->read_write_prepare &&
|
|
|
|
ops->read_write_prepare(vcpu, val, bytes))
|
|
|
|
return X86EMUL_CONTINUE;
|
|
|
|
|
|
|
|
vcpu->mmio_nr_fragments = 0;
|
2011-04-20 10:37:53 +00:00
|
|
|
|
KVM: Portability: Move x86 emulation and mmio device hook to x86.c
This patch moves the following functions to from kvm_main.c to x86.c:
emulator_read/write_std, vcpu_find_pervcpu_dev, vcpu_find_mmio_dev,
emulator_read/write_emulated, emulator_write_phys,
emulator_write_emulated_onepage, emulator_cmpxchg_emulated,
get_setment_base, emulate_invlpg, emulate_clts, emulator_get/set_dr,
kvm_report_emulation_failure, emulate_instruction
The following data type is moved to x86.c:
struct x86_emulate_ops emulate_ops
Signed-off-by: Carsten Otte <cotte@de.ibm.com>
Acked-by: Hollis Blanchard <hollisb@us.ibm.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
2007-10-30 17:44:21 +00:00
|
|
|
/* Crossing a page boundary? */
|
|
|
|
if (((addr + bytes - 1) ^ addr) & PAGE_MASK) {
|
2012-04-18 16:22:47 +00:00
|
|
|
int now;
|
KVM: Portability: Move x86 emulation and mmio device hook to x86.c
This patch moves the following functions to from kvm_main.c to x86.c:
emulator_read/write_std, vcpu_find_pervcpu_dev, vcpu_find_mmio_dev,
emulator_read/write_emulated, emulator_write_phys,
emulator_write_emulated_onepage, emulator_cmpxchg_emulated,
get_setment_base, emulate_invlpg, emulate_clts, emulator_get/set_dr,
kvm_report_emulation_failure, emulate_instruction
The following data type is moved to x86.c:
struct x86_emulate_ops emulate_ops
Signed-off-by: Carsten Otte <cotte@de.ibm.com>
Acked-by: Hollis Blanchard <hollisb@us.ibm.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
2007-10-30 17:44:21 +00:00
|
|
|
|
|
|
|
now = -addr & ~PAGE_MASK;
|
2011-07-13 06:32:31 +00:00
|
|
|
rc = emulator_read_write_onepage(addr, val, now, exception,
|
|
|
|
vcpu, ops);
|
|
|
|
|
KVM: Portability: Move x86 emulation and mmio device hook to x86.c
This patch moves the following functions to from kvm_main.c to x86.c:
emulator_read/write_std, vcpu_find_pervcpu_dev, vcpu_find_mmio_dev,
emulator_read/write_emulated, emulator_write_phys,
emulator_write_emulated_onepage, emulator_cmpxchg_emulated,
get_setment_base, emulate_invlpg, emulate_clts, emulator_get/set_dr,
kvm_report_emulation_failure, emulate_instruction
The following data type is moved to x86.c:
struct x86_emulate_ops emulate_ops
Signed-off-by: Carsten Otte <cotte@de.ibm.com>
Acked-by: Hollis Blanchard <hollisb@us.ibm.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
2007-10-30 17:44:21 +00:00
|
|
|
if (rc != X86EMUL_CONTINUE)
|
|
|
|
return rc;
|
|
|
|
addr += now;
|
2015-01-26 07:32:26 +00:00
|
|
|
if (ctxt->mode != X86EMUL_MODE_PROT64)
|
|
|
|
addr = (u32)addr;
|
KVM: Portability: Move x86 emulation and mmio device hook to x86.c
This patch moves the following functions to from kvm_main.c to x86.c:
emulator_read/write_std, vcpu_find_pervcpu_dev, vcpu_find_mmio_dev,
emulator_read/write_emulated, emulator_write_phys,
emulator_write_emulated_onepage, emulator_cmpxchg_emulated,
get_setment_base, emulate_invlpg, emulate_clts, emulator_get/set_dr,
kvm_report_emulation_failure, emulate_instruction
The following data type is moved to x86.c:
struct x86_emulate_ops emulate_ops
Signed-off-by: Carsten Otte <cotte@de.ibm.com>
Acked-by: Hollis Blanchard <hollisb@us.ibm.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
2007-10-30 17:44:21 +00:00
|
|
|
val += now;
|
|
|
|
bytes -= now;
|
|
|
|
}
|
2011-07-13 06:32:31 +00:00
|
|
|
|
2012-04-18 16:22:47 +00:00
|
|
|
rc = emulator_read_write_onepage(addr, val, bytes, exception,
|
|
|
|
vcpu, ops);
|
|
|
|
if (rc != X86EMUL_CONTINUE)
|
|
|
|
return rc;
|
|
|
|
|
|
|
|
if (!vcpu->mmio_nr_fragments)
|
|
|
|
return rc;
|
|
|
|
|
|
|
|
gpa = vcpu->mmio_fragments[0].gpa;
|
|
|
|
|
|
|
|
vcpu->mmio_needed = 1;
|
|
|
|
vcpu->mmio_cur_fragment = 0;
|
|
|
|
|
2012-10-24 06:07:59 +00:00
|
|
|
vcpu->run->mmio.len = min(8u, vcpu->mmio_fragments[0].len);
|
2012-04-18 16:22:47 +00:00
|
|
|
vcpu->run->mmio.is_write = vcpu->mmio_is_write = ops->write;
|
|
|
|
vcpu->run->exit_reason = KVM_EXIT_MMIO;
|
|
|
|
vcpu->run->mmio.phys_addr = gpa;
|
|
|
|
|
|
|
|
return ops->read_write_exit_mmio(vcpu, gpa, val, bytes);
|
2011-07-13 06:32:31 +00:00
|
|
|
}
|
|
|
|
|
|
|
|
static int emulator_read_emulated(struct x86_emulate_ctxt *ctxt,
|
|
|
|
unsigned long addr,
|
|
|
|
void *val,
|
|
|
|
unsigned int bytes,
|
|
|
|
struct x86_exception *exception)
|
|
|
|
{
|
|
|
|
return emulator_read_write(ctxt, addr, val, bytes,
|
|
|
|
exception, &read_emultor);
|
|
|
|
}
|
|
|
|
|
2015-03-13 09:39:45 +00:00
|
|
|
static int emulator_write_emulated(struct x86_emulate_ctxt *ctxt,
|
2011-07-13 06:32:31 +00:00
|
|
|
unsigned long addr,
|
|
|
|
const void *val,
|
|
|
|
unsigned int bytes,
|
|
|
|
struct x86_exception *exception)
|
|
|
|
{
|
|
|
|
return emulator_read_write(ctxt, addr, (void *)val, bytes,
|
|
|
|
exception, &write_emultor);
|
KVM: Portability: Move x86 emulation and mmio device hook to x86.c
This patch moves the following functions to from kvm_main.c to x86.c:
emulator_read/write_std, vcpu_find_pervcpu_dev, vcpu_find_mmio_dev,
emulator_read/write_emulated, emulator_write_phys,
emulator_write_emulated_onepage, emulator_cmpxchg_emulated,
get_setment_base, emulate_invlpg, emulate_clts, emulator_get/set_dr,
kvm_report_emulation_failure, emulate_instruction
The following data type is moved to x86.c:
struct x86_emulate_ops emulate_ops
Signed-off-by: Carsten Otte <cotte@de.ibm.com>
Acked-by: Hollis Blanchard <hollisb@us.ibm.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
2007-10-30 17:44:21 +00:00
|
|
|
}
|
|
|
|
|
2010-03-15 11:59:54 +00:00
|
|
|
#define CMPXCHG_TYPE(t, ptr, old, new) \
|
|
|
|
(cmpxchg((t *)(ptr), *(t *)(old), *(t *)(new)) == *(t *)(old))
|
|
|
|
|
|
|
|
#ifdef CONFIG_X86_64
|
|
|
|
# define CMPXCHG64(ptr, old, new) CMPXCHG_TYPE(u64, ptr, old, new)
|
|
|
|
#else
|
|
|
|
# define CMPXCHG64(ptr, old, new) \
|
2010-03-20 09:14:13 +00:00
|
|
|
(cmpxchg64((u64 *)(ptr), *(u64 *)(old), *(u64 *)(new)) == *(u64 *)(old))
|
2010-03-15 11:59:54 +00:00
|
|
|
#endif
|
|
|
|
|
2011-04-20 10:37:53 +00:00
|
|
|
static int emulator_cmpxchg_emulated(struct x86_emulate_ctxt *ctxt,
|
|
|
|
unsigned long addr,
|
KVM: Portability: Move x86 emulation and mmio device hook to x86.c
This patch moves the following functions to from kvm_main.c to x86.c:
emulator_read/write_std, vcpu_find_pervcpu_dev, vcpu_find_mmio_dev,
emulator_read/write_emulated, emulator_write_phys,
emulator_write_emulated_onepage, emulator_cmpxchg_emulated,
get_setment_base, emulate_invlpg, emulate_clts, emulator_get/set_dr,
kvm_report_emulation_failure, emulate_instruction
The following data type is moved to x86.c:
struct x86_emulate_ops emulate_ops
Signed-off-by: Carsten Otte <cotte@de.ibm.com>
Acked-by: Hollis Blanchard <hollisb@us.ibm.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
2007-10-30 17:44:21 +00:00
|
|
|
const void *old,
|
|
|
|
const void *new,
|
|
|
|
unsigned int bytes,
|
2011-04-20 10:37:53 +00:00
|
|
|
struct x86_exception *exception)
|
KVM: Portability: Move x86 emulation and mmio device hook to x86.c
This patch moves the following functions to from kvm_main.c to x86.c:
emulator_read/write_std, vcpu_find_pervcpu_dev, vcpu_find_mmio_dev,
emulator_read/write_emulated, emulator_write_phys,
emulator_write_emulated_onepage, emulator_cmpxchg_emulated,
get_setment_base, emulate_invlpg, emulate_clts, emulator_get/set_dr,
kvm_report_emulation_failure, emulate_instruction
The following data type is moved to x86.c:
struct x86_emulate_ops emulate_ops
Signed-off-by: Carsten Otte <cotte@de.ibm.com>
Acked-by: Hollis Blanchard <hollisb@us.ibm.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
2007-10-30 17:44:21 +00:00
|
|
|
{
|
2011-04-20 10:37:53 +00:00
|
|
|
struct kvm_vcpu *vcpu = emul_to_vcpu(ctxt);
|
2010-03-15 11:59:54 +00:00
|
|
|
gpa_t gpa;
|
|
|
|
struct page *page;
|
|
|
|
char *kaddr;
|
|
|
|
bool exchanged;
|
2007-12-12 15:46:12 +00:00
|
|
|
|
2010-03-15 11:59:54 +00:00
|
|
|
/* guests cmpxchg8b have to be emulated atomically */
|
|
|
|
if (bytes > 8 || (bytes & (bytes - 1)))
|
|
|
|
goto emul_write;
|
2007-12-21 00:18:22 +00:00
|
|
|
|
2010-03-15 11:59:54 +00:00
|
|
|
gpa = kvm_mmu_gva_to_gpa_write(vcpu, addr, NULL);
|
2007-12-12 15:46:12 +00:00
|
|
|
|
2010-03-15 11:59:54 +00:00
|
|
|
if (gpa == UNMAPPED_GVA ||
|
|
|
|
(gpa & PAGE_MASK) == APIC_DEFAULT_PHYS_BASE)
|
|
|
|
goto emul_write;
|
2007-12-12 15:46:12 +00:00
|
|
|
|
2010-03-15 11:59:54 +00:00
|
|
|
if (((gpa + bytes - 1) & PAGE_MASK) != (gpa & PAGE_MASK))
|
|
|
|
goto emul_write;
|
2008-02-10 16:04:15 +00:00
|
|
|
|
2015-04-08 13:39:23 +00:00
|
|
|
page = kvm_vcpu_gfn_to_page(vcpu, gpa >> PAGE_SHIFT);
|
2012-08-03 07:42:52 +00:00
|
|
|
if (is_error_page(page))
|
2010-07-15 00:51:58 +00:00
|
|
|
goto emul_write;
|
2008-02-10 16:04:15 +00:00
|
|
|
|
2011-11-25 15:14:17 +00:00
|
|
|
kaddr = kmap_atomic(page);
|
2010-03-15 11:59:54 +00:00
|
|
|
kaddr += offset_in_page(gpa);
|
|
|
|
switch (bytes) {
|
|
|
|
case 1:
|
|
|
|
exchanged = CMPXCHG_TYPE(u8, kaddr, old, new);
|
|
|
|
break;
|
|
|
|
case 2:
|
|
|
|
exchanged = CMPXCHG_TYPE(u16, kaddr, old, new);
|
|
|
|
break;
|
|
|
|
case 4:
|
|
|
|
exchanged = CMPXCHG_TYPE(u32, kaddr, old, new);
|
|
|
|
break;
|
|
|
|
case 8:
|
|
|
|
exchanged = CMPXCHG64(kaddr, old, new);
|
|
|
|
break;
|
|
|
|
default:
|
|
|
|
BUG();
|
2007-12-12 15:46:12 +00:00
|
|
|
}
|
2011-11-25 15:14:17 +00:00
|
|
|
kunmap_atomic(kaddr);
|
2010-03-15 11:59:54 +00:00
|
|
|
kvm_release_page_dirty(page);
|
|
|
|
|
|
|
|
if (!exchanged)
|
|
|
|
return X86EMUL_CMPXCHG_FAILED;
|
|
|
|
|
2015-04-08 13:39:23 +00:00
|
|
|
kvm_vcpu_mark_page_dirty(vcpu, gpa >> PAGE_SHIFT);
|
2016-02-24 09:51:13 +00:00
|
|
|
kvm_page_track_write(vcpu, gpa, new, bytes);
|
2010-04-13 07:21:56 +00:00
|
|
|
|
|
|
|
return X86EMUL_CONTINUE;
|
2010-03-15 11:59:55 +00:00
|
|
|
|
2008-03-29 23:17:59 +00:00
|
|
|
emul_write:
|
2010-03-15 11:59:54 +00:00
|
|
|
printk_once(KERN_WARNING "kvm: emulating exchange as write\n");
|
2007-12-12 15:46:12 +00:00
|
|
|
|
2011-04-20 10:37:53 +00:00
|
|
|
return emulator_write_emulated(ctxt, addr, new, bytes, exception);
|
KVM: Portability: Move x86 emulation and mmio device hook to x86.c
This patch moves the following functions to from kvm_main.c to x86.c:
emulator_read/write_std, vcpu_find_pervcpu_dev, vcpu_find_mmio_dev,
emulator_read/write_emulated, emulator_write_phys,
emulator_write_emulated_onepage, emulator_cmpxchg_emulated,
get_setment_base, emulate_invlpg, emulate_clts, emulator_get/set_dr,
kvm_report_emulation_failure, emulate_instruction
The following data type is moved to x86.c:
struct x86_emulate_ops emulate_ops
Signed-off-by: Carsten Otte <cotte@de.ibm.com>
Acked-by: Hollis Blanchard <hollisb@us.ibm.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
2007-10-30 17:44:21 +00:00
|
|
|
}
|
|
|
|
|
2010-03-18 13:20:23 +00:00
|
|
|
static int kernel_pio(struct kvm_vcpu *vcpu, void *pd)
|
|
|
|
{
|
KVM: X86: Fix read out-of-bounds vulnerability in kvm pio emulation
Huawei folks reported a read out-of-bounds vulnerability in kvm pio emulation.
- "inb" instruction to access PIT Mod/Command register (ioport 0x43, write only,
a read should be ignored) in guest can get a random number.
- "rep insb" instruction to access PIT register port 0x43 can control memcpy()
in emulator_pio_in_emulated() to copy max 0x400 bytes but only read 1 bytes,
which will disclose the unimportant kernel memory in host but no crash.
The similar test program below can reproduce the read out-of-bounds vulnerability:
void hexdump(void *mem, unsigned int len)
{
unsigned int i, j;
for(i = 0; i < len + ((len % HEXDUMP_COLS) ? (HEXDUMP_COLS - len % HEXDUMP_COLS) : 0); i++)
{
/* print offset */
if(i % HEXDUMP_COLS == 0)
{
printf("0x%06x: ", i);
}
/* print hex data */
if(i < len)
{
printf("%02x ", 0xFF & ((char*)mem)[i]);
}
else /* end of block, just aligning for ASCII dump */
{
printf(" ");
}
/* print ASCII dump */
if(i % HEXDUMP_COLS == (HEXDUMP_COLS - 1))
{
for(j = i - (HEXDUMP_COLS - 1); j <= i; j++)
{
if(j >= len) /* end of block, not really printing */
{
putchar(' ');
}
else if(isprint(((char*)mem)[j])) /* printable char */
{
putchar(0xFF & ((char*)mem)[j]);
}
else /* other char */
{
putchar('.');
}
}
putchar('\n');
}
}
}
int main(void)
{
int i;
if (iopl(3))
{
err(1, "set iopl unsuccessfully\n");
return -1;
}
static char buf[0x40];
/* test ioport 0x40,0x41,0x42,0x43,0x44,0x45 */
memset(buf, 0xab, sizeof(buf));
asm volatile("push %rdi;");
asm volatile("mov %0, %%rdi;"::"q"(buf));
asm volatile ("mov $0x40, %rdx;");
asm volatile ("in %dx,%al;");
asm volatile ("stosb;");
asm volatile ("mov $0x41, %rdx;");
asm volatile ("in %dx,%al;");
asm volatile ("stosb;");
asm volatile ("mov $0x42, %rdx;");
asm volatile ("in %dx,%al;");
asm volatile ("stosb;");
asm volatile ("mov $0x43, %rdx;");
asm volatile ("in %dx,%al;");
asm volatile ("stosb;");
asm volatile ("mov $0x44, %rdx;");
asm volatile ("in %dx,%al;");
asm volatile ("stosb;");
asm volatile ("mov $0x45, %rdx;");
asm volatile ("in %dx,%al;");
asm volatile ("stosb;");
asm volatile ("pop %rdi;");
hexdump(buf, 0x40);
printf("\n");
/* ins port 0x40 */
memset(buf, 0xab, sizeof(buf));
asm volatile("push %rdi;");
asm volatile("mov %0, %%rdi;"::"q"(buf));
asm volatile ("mov $0x20, %rcx;");
asm volatile ("mov $0x40, %rdx;");
asm volatile ("rep insb;");
asm volatile ("pop %rdi;");
hexdump(buf, 0x40);
printf("\n");
/* ins port 0x43 */
memset(buf, 0xab, sizeof(buf));
asm volatile("push %rdi;");
asm volatile("mov %0, %%rdi;"::"q"(buf));
asm volatile ("mov $0x20, %rcx;");
asm volatile ("mov $0x43, %rdx;");
asm volatile ("rep insb;");
asm volatile ("pop %rdi;");
hexdump(buf, 0x40);
printf("\n");
return 0;
}
The vcpu->arch.pio_data buffer is used by both in/out instrutions emulation
w/o clear after using which results in some random datas are left over in
the buffer. Guest reads port 0x43 will be ignored since it is write only,
however, the function kernel_pio() can't distigush this ignore from successfully
reads data from device's ioport. There is no new data fill the buffer from
port 0x43, however, emulator_pio_in_emulated() will copy the stale data in
the buffer to the guest unconditionally. This patch fixes it by clearing the
buffer before in instruction emulation to avoid to grant guest the stale data
in the buffer.
In addition, string I/O is not supported for in kernel device. So there is no
iteration to read ioport %RCX times for string I/O. The function kernel_pio()
just reads one round, and then copy the io size * %RCX to the guest unconditionally,
actually it copies the one round ioport data w/ other random datas which are left
over in the vcpu->arch.pio_data buffer to the guest. This patch fixes it by
introducing the string I/O support for in kernel device in order to grant the right
ioport datas to the guest.
Before the patch:
0x000000: fe 38 93 93 ff ff ab ab .8......
0x000008: ab ab ab ab ab ab ab ab ........
0x000010: ab ab ab ab ab ab ab ab ........
0x000018: ab ab ab ab ab ab ab ab ........
0x000020: ab ab ab ab ab ab ab ab ........
0x000028: ab ab ab ab ab ab ab ab ........
0x000030: ab ab ab ab ab ab ab ab ........
0x000038: ab ab ab ab ab ab ab ab ........
0x000000: f6 00 00 00 00 00 00 00 ........
0x000008: 00 00 00 00 00 00 00 00 ........
0x000010: 00 00 00 00 4d 51 30 30 ....MQ00
0x000018: 30 30 20 33 20 20 20 20 00 3
0x000020: ab ab ab ab ab ab ab ab ........
0x000028: ab ab ab ab ab ab ab ab ........
0x000030: ab ab ab ab ab ab ab ab ........
0x000038: ab ab ab ab ab ab ab ab ........
0x000000: f6 00 00 00 00 00 00 00 ........
0x000008: 00 00 00 00 00 00 00 00 ........
0x000010: 00 00 00 00 4d 51 30 30 ....MQ00
0x000018: 30 30 20 33 20 20 20 20 00 3
0x000020: ab ab ab ab ab ab ab ab ........
0x000028: ab ab ab ab ab ab ab ab ........
0x000030: ab ab ab ab ab ab ab ab ........
0x000038: ab ab ab ab ab ab ab ab ........
After the patch:
0x000000: 1e 02 f8 00 ff ff ab ab ........
0x000008: ab ab ab ab ab ab ab ab ........
0x000010: ab ab ab ab ab ab ab ab ........
0x000018: ab ab ab ab ab ab ab ab ........
0x000020: ab ab ab ab ab ab ab ab ........
0x000028: ab ab ab ab ab ab ab ab ........
0x000030: ab ab ab ab ab ab ab ab ........
0x000038: ab ab ab ab ab ab ab ab ........
0x000000: d2 e2 d2 df d2 db d2 d7 ........
0x000008: d2 d3 d2 cf d2 cb d2 c7 ........
0x000010: d2 c4 d2 c0 d2 bc d2 b8 ........
0x000018: d2 b4 d2 b0 d2 ac d2 a8 ........
0x000020: ab ab ab ab ab ab ab ab ........
0x000028: ab ab ab ab ab ab ab ab ........
0x000030: ab ab ab ab ab ab ab ab ........
0x000038: ab ab ab ab ab ab ab ab ........
0x000000: 00 00 00 00 00 00 00 00 ........
0x000008: 00 00 00 00 00 00 00 00 ........
0x000010: 00 00 00 00 00 00 00 00 ........
0x000018: 00 00 00 00 00 00 00 00 ........
0x000020: ab ab ab ab ab ab ab ab ........
0x000028: ab ab ab ab ab ab ab ab ........
0x000030: ab ab ab ab ab ab ab ab ........
0x000038: ab ab ab ab ab ab ab ab ........
Reported-by: Moguofang <moguofang@huawei.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Radim Krčmář <rkrcmar@redhat.com>
Cc: Moguofang <moguofang@huawei.com>
Signed-off-by: Wanpeng Li <wanpeng.li@hotmail.com>
Cc: stable@vger.kernel.org
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
2017-05-19 09:46:56 +00:00
|
|
|
int r = 0, i;
|
2010-03-18 13:20:23 +00:00
|
|
|
|
KVM: X86: Fix read out-of-bounds vulnerability in kvm pio emulation
Huawei folks reported a read out-of-bounds vulnerability in kvm pio emulation.
- "inb" instruction to access PIT Mod/Command register (ioport 0x43, write only,
a read should be ignored) in guest can get a random number.
- "rep insb" instruction to access PIT register port 0x43 can control memcpy()
in emulator_pio_in_emulated() to copy max 0x400 bytes but only read 1 bytes,
which will disclose the unimportant kernel memory in host but no crash.
The similar test program below can reproduce the read out-of-bounds vulnerability:
void hexdump(void *mem, unsigned int len)
{
unsigned int i, j;
for(i = 0; i < len + ((len % HEXDUMP_COLS) ? (HEXDUMP_COLS - len % HEXDUMP_COLS) : 0); i++)
{
/* print offset */
if(i % HEXDUMP_COLS == 0)
{
printf("0x%06x: ", i);
}
/* print hex data */
if(i < len)
{
printf("%02x ", 0xFF & ((char*)mem)[i]);
}
else /* end of block, just aligning for ASCII dump */
{
printf(" ");
}
/* print ASCII dump */
if(i % HEXDUMP_COLS == (HEXDUMP_COLS - 1))
{
for(j = i - (HEXDUMP_COLS - 1); j <= i; j++)
{
if(j >= len) /* end of block, not really printing */
{
putchar(' ');
}
else if(isprint(((char*)mem)[j])) /* printable char */
{
putchar(0xFF & ((char*)mem)[j]);
}
else /* other char */
{
putchar('.');
}
}
putchar('\n');
}
}
}
int main(void)
{
int i;
if (iopl(3))
{
err(1, "set iopl unsuccessfully\n");
return -1;
}
static char buf[0x40];
/* test ioport 0x40,0x41,0x42,0x43,0x44,0x45 */
memset(buf, 0xab, sizeof(buf));
asm volatile("push %rdi;");
asm volatile("mov %0, %%rdi;"::"q"(buf));
asm volatile ("mov $0x40, %rdx;");
asm volatile ("in %dx,%al;");
asm volatile ("stosb;");
asm volatile ("mov $0x41, %rdx;");
asm volatile ("in %dx,%al;");
asm volatile ("stosb;");
asm volatile ("mov $0x42, %rdx;");
asm volatile ("in %dx,%al;");
asm volatile ("stosb;");
asm volatile ("mov $0x43, %rdx;");
asm volatile ("in %dx,%al;");
asm volatile ("stosb;");
asm volatile ("mov $0x44, %rdx;");
asm volatile ("in %dx,%al;");
asm volatile ("stosb;");
asm volatile ("mov $0x45, %rdx;");
asm volatile ("in %dx,%al;");
asm volatile ("stosb;");
asm volatile ("pop %rdi;");
hexdump(buf, 0x40);
printf("\n");
/* ins port 0x40 */
memset(buf, 0xab, sizeof(buf));
asm volatile("push %rdi;");
asm volatile("mov %0, %%rdi;"::"q"(buf));
asm volatile ("mov $0x20, %rcx;");
asm volatile ("mov $0x40, %rdx;");
asm volatile ("rep insb;");
asm volatile ("pop %rdi;");
hexdump(buf, 0x40);
printf("\n");
/* ins port 0x43 */
memset(buf, 0xab, sizeof(buf));
asm volatile("push %rdi;");
asm volatile("mov %0, %%rdi;"::"q"(buf));
asm volatile ("mov $0x20, %rcx;");
asm volatile ("mov $0x43, %rdx;");
asm volatile ("rep insb;");
asm volatile ("pop %rdi;");
hexdump(buf, 0x40);
printf("\n");
return 0;
}
The vcpu->arch.pio_data buffer is used by both in/out instrutions emulation
w/o clear after using which results in some random datas are left over in
the buffer. Guest reads port 0x43 will be ignored since it is write only,
however, the function kernel_pio() can't distigush this ignore from successfully
reads data from device's ioport. There is no new data fill the buffer from
port 0x43, however, emulator_pio_in_emulated() will copy the stale data in
the buffer to the guest unconditionally. This patch fixes it by clearing the
buffer before in instruction emulation to avoid to grant guest the stale data
in the buffer.
In addition, string I/O is not supported for in kernel device. So there is no
iteration to read ioport %RCX times for string I/O. The function kernel_pio()
just reads one round, and then copy the io size * %RCX to the guest unconditionally,
actually it copies the one round ioport data w/ other random datas which are left
over in the vcpu->arch.pio_data buffer to the guest. This patch fixes it by
introducing the string I/O support for in kernel device in order to grant the right
ioport datas to the guest.
Before the patch:
0x000000: fe 38 93 93 ff ff ab ab .8......
0x000008: ab ab ab ab ab ab ab ab ........
0x000010: ab ab ab ab ab ab ab ab ........
0x000018: ab ab ab ab ab ab ab ab ........
0x000020: ab ab ab ab ab ab ab ab ........
0x000028: ab ab ab ab ab ab ab ab ........
0x000030: ab ab ab ab ab ab ab ab ........
0x000038: ab ab ab ab ab ab ab ab ........
0x000000: f6 00 00 00 00 00 00 00 ........
0x000008: 00 00 00 00 00 00 00 00 ........
0x000010: 00 00 00 00 4d 51 30 30 ....MQ00
0x000018: 30 30 20 33 20 20 20 20 00 3
0x000020: ab ab ab ab ab ab ab ab ........
0x000028: ab ab ab ab ab ab ab ab ........
0x000030: ab ab ab ab ab ab ab ab ........
0x000038: ab ab ab ab ab ab ab ab ........
0x000000: f6 00 00 00 00 00 00 00 ........
0x000008: 00 00 00 00 00 00 00 00 ........
0x000010: 00 00 00 00 4d 51 30 30 ....MQ00
0x000018: 30 30 20 33 20 20 20 20 00 3
0x000020: ab ab ab ab ab ab ab ab ........
0x000028: ab ab ab ab ab ab ab ab ........
0x000030: ab ab ab ab ab ab ab ab ........
0x000038: ab ab ab ab ab ab ab ab ........
After the patch:
0x000000: 1e 02 f8 00 ff ff ab ab ........
0x000008: ab ab ab ab ab ab ab ab ........
0x000010: ab ab ab ab ab ab ab ab ........
0x000018: ab ab ab ab ab ab ab ab ........
0x000020: ab ab ab ab ab ab ab ab ........
0x000028: ab ab ab ab ab ab ab ab ........
0x000030: ab ab ab ab ab ab ab ab ........
0x000038: ab ab ab ab ab ab ab ab ........
0x000000: d2 e2 d2 df d2 db d2 d7 ........
0x000008: d2 d3 d2 cf d2 cb d2 c7 ........
0x000010: d2 c4 d2 c0 d2 bc d2 b8 ........
0x000018: d2 b4 d2 b0 d2 ac d2 a8 ........
0x000020: ab ab ab ab ab ab ab ab ........
0x000028: ab ab ab ab ab ab ab ab ........
0x000030: ab ab ab ab ab ab ab ab ........
0x000038: ab ab ab ab ab ab ab ab ........
0x000000: 00 00 00 00 00 00 00 00 ........
0x000008: 00 00 00 00 00 00 00 00 ........
0x000010: 00 00 00 00 00 00 00 00 ........
0x000018: 00 00 00 00 00 00 00 00 ........
0x000020: ab ab ab ab ab ab ab ab ........
0x000028: ab ab ab ab ab ab ab ab ........
0x000030: ab ab ab ab ab ab ab ab ........
0x000038: ab ab ab ab ab ab ab ab ........
Reported-by: Moguofang <moguofang@huawei.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Radim Krčmář <rkrcmar@redhat.com>
Cc: Moguofang <moguofang@huawei.com>
Signed-off-by: Wanpeng Li <wanpeng.li@hotmail.com>
Cc: stable@vger.kernel.org
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
2017-05-19 09:46:56 +00:00
|
|
|
for (i = 0; i < vcpu->arch.pio.count; i++) {
|
|
|
|
if (vcpu->arch.pio.in)
|
|
|
|
r = kvm_io_bus_read(vcpu, KVM_PIO_BUS, vcpu->arch.pio.port,
|
|
|
|
vcpu->arch.pio.size, pd);
|
|
|
|
else
|
|
|
|
r = kvm_io_bus_write(vcpu, KVM_PIO_BUS,
|
|
|
|
vcpu->arch.pio.port, vcpu->arch.pio.size,
|
|
|
|
pd);
|
|
|
|
if (r)
|
|
|
|
break;
|
|
|
|
pd += vcpu->arch.pio.size;
|
|
|
|
}
|
2010-03-18 13:20:23 +00:00
|
|
|
return r;
|
|
|
|
}
|
|
|
|
|
2011-09-22 08:55:10 +00:00
|
|
|
static int emulator_pio_in_out(struct kvm_vcpu *vcpu, int size,
|
|
|
|
unsigned short port, void *val,
|
|
|
|
unsigned int count, bool in)
|
2010-03-18 13:20:23 +00:00
|
|
|
{
|
|
|
|
vcpu->arch.pio.port = port;
|
2011-09-22 08:55:10 +00:00
|
|
|
vcpu->arch.pio.in = in;
|
2010-03-18 13:20:24 +00:00
|
|
|
vcpu->arch.pio.count = count;
|
2010-03-18 13:20:23 +00:00
|
|
|
vcpu->arch.pio.size = size;
|
|
|
|
|
|
|
|
if (!kernel_pio(vcpu, vcpu->arch.pio_data)) {
|
2010-03-18 13:20:24 +00:00
|
|
|
vcpu->arch.pio.count = 0;
|
2010-03-18 13:20:23 +00:00
|
|
|
return 1;
|
|
|
|
}
|
|
|
|
|
|
|
|
vcpu->run->exit_reason = KVM_EXIT_IO;
|
2011-09-22 08:55:10 +00:00
|
|
|
vcpu->run->io.direction = in ? KVM_EXIT_IO_IN : KVM_EXIT_IO_OUT;
|
2010-03-18 13:20:23 +00:00
|
|
|
vcpu->run->io.size = size;
|
|
|
|
vcpu->run->io.data_offset = KVM_PIO_PAGE_OFFSET * PAGE_SIZE;
|
|
|
|
vcpu->run->io.count = count;
|
|
|
|
vcpu->run->io.port = port;
|
|
|
|
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
2011-09-22 08:55:10 +00:00
|
|
|
static int emulator_pio_in_emulated(struct x86_emulate_ctxt *ctxt,
|
|
|
|
int size, unsigned short port, void *val,
|
|
|
|
unsigned int count)
|
2010-03-18 13:20:23 +00:00
|
|
|
{
|
2011-04-20 10:37:53 +00:00
|
|
|
struct kvm_vcpu *vcpu = emul_to_vcpu(ctxt);
|
2011-09-22 08:55:10 +00:00
|
|
|
int ret;
|
2011-04-20 10:37:53 +00:00
|
|
|
|
2011-09-22 08:55:10 +00:00
|
|
|
if (vcpu->arch.pio.count)
|
|
|
|
goto data_avail;
|
2010-03-18 13:20:23 +00:00
|
|
|
|
KVM: X86: Fix read out-of-bounds vulnerability in kvm pio emulation
Huawei folks reported a read out-of-bounds vulnerability in kvm pio emulation.
- "inb" instruction to access PIT Mod/Command register (ioport 0x43, write only,
a read should be ignored) in guest can get a random number.
- "rep insb" instruction to access PIT register port 0x43 can control memcpy()
in emulator_pio_in_emulated() to copy max 0x400 bytes but only read 1 bytes,
which will disclose the unimportant kernel memory in host but no crash.
The similar test program below can reproduce the read out-of-bounds vulnerability:
void hexdump(void *mem, unsigned int len)
{
unsigned int i, j;
for(i = 0; i < len + ((len % HEXDUMP_COLS) ? (HEXDUMP_COLS - len % HEXDUMP_COLS) : 0); i++)
{
/* print offset */
if(i % HEXDUMP_COLS == 0)
{
printf("0x%06x: ", i);
}
/* print hex data */
if(i < len)
{
printf("%02x ", 0xFF & ((char*)mem)[i]);
}
else /* end of block, just aligning for ASCII dump */
{
printf(" ");
}
/* print ASCII dump */
if(i % HEXDUMP_COLS == (HEXDUMP_COLS - 1))
{
for(j = i - (HEXDUMP_COLS - 1); j <= i; j++)
{
if(j >= len) /* end of block, not really printing */
{
putchar(' ');
}
else if(isprint(((char*)mem)[j])) /* printable char */
{
putchar(0xFF & ((char*)mem)[j]);
}
else /* other char */
{
putchar('.');
}
}
putchar('\n');
}
}
}
int main(void)
{
int i;
if (iopl(3))
{
err(1, "set iopl unsuccessfully\n");
return -1;
}
static char buf[0x40];
/* test ioport 0x40,0x41,0x42,0x43,0x44,0x45 */
memset(buf, 0xab, sizeof(buf));
asm volatile("push %rdi;");
asm volatile("mov %0, %%rdi;"::"q"(buf));
asm volatile ("mov $0x40, %rdx;");
asm volatile ("in %dx,%al;");
asm volatile ("stosb;");
asm volatile ("mov $0x41, %rdx;");
asm volatile ("in %dx,%al;");
asm volatile ("stosb;");
asm volatile ("mov $0x42, %rdx;");
asm volatile ("in %dx,%al;");
asm volatile ("stosb;");
asm volatile ("mov $0x43, %rdx;");
asm volatile ("in %dx,%al;");
asm volatile ("stosb;");
asm volatile ("mov $0x44, %rdx;");
asm volatile ("in %dx,%al;");
asm volatile ("stosb;");
asm volatile ("mov $0x45, %rdx;");
asm volatile ("in %dx,%al;");
asm volatile ("stosb;");
asm volatile ("pop %rdi;");
hexdump(buf, 0x40);
printf("\n");
/* ins port 0x40 */
memset(buf, 0xab, sizeof(buf));
asm volatile("push %rdi;");
asm volatile("mov %0, %%rdi;"::"q"(buf));
asm volatile ("mov $0x20, %rcx;");
asm volatile ("mov $0x40, %rdx;");
asm volatile ("rep insb;");
asm volatile ("pop %rdi;");
hexdump(buf, 0x40);
printf("\n");
/* ins port 0x43 */
memset(buf, 0xab, sizeof(buf));
asm volatile("push %rdi;");
asm volatile("mov %0, %%rdi;"::"q"(buf));
asm volatile ("mov $0x20, %rcx;");
asm volatile ("mov $0x43, %rdx;");
asm volatile ("rep insb;");
asm volatile ("pop %rdi;");
hexdump(buf, 0x40);
printf("\n");
return 0;
}
The vcpu->arch.pio_data buffer is used by both in/out instrutions emulation
w/o clear after using which results in some random datas are left over in
the buffer. Guest reads port 0x43 will be ignored since it is write only,
however, the function kernel_pio() can't distigush this ignore from successfully
reads data from device's ioport. There is no new data fill the buffer from
port 0x43, however, emulator_pio_in_emulated() will copy the stale data in
the buffer to the guest unconditionally. This patch fixes it by clearing the
buffer before in instruction emulation to avoid to grant guest the stale data
in the buffer.
In addition, string I/O is not supported for in kernel device. So there is no
iteration to read ioport %RCX times for string I/O. The function kernel_pio()
just reads one round, and then copy the io size * %RCX to the guest unconditionally,
actually it copies the one round ioport data w/ other random datas which are left
over in the vcpu->arch.pio_data buffer to the guest. This patch fixes it by
introducing the string I/O support for in kernel device in order to grant the right
ioport datas to the guest.
Before the patch:
0x000000: fe 38 93 93 ff ff ab ab .8......
0x000008: ab ab ab ab ab ab ab ab ........
0x000010: ab ab ab ab ab ab ab ab ........
0x000018: ab ab ab ab ab ab ab ab ........
0x000020: ab ab ab ab ab ab ab ab ........
0x000028: ab ab ab ab ab ab ab ab ........
0x000030: ab ab ab ab ab ab ab ab ........
0x000038: ab ab ab ab ab ab ab ab ........
0x000000: f6 00 00 00 00 00 00 00 ........
0x000008: 00 00 00 00 00 00 00 00 ........
0x000010: 00 00 00 00 4d 51 30 30 ....MQ00
0x000018: 30 30 20 33 20 20 20 20 00 3
0x000020: ab ab ab ab ab ab ab ab ........
0x000028: ab ab ab ab ab ab ab ab ........
0x000030: ab ab ab ab ab ab ab ab ........
0x000038: ab ab ab ab ab ab ab ab ........
0x000000: f6 00 00 00 00 00 00 00 ........
0x000008: 00 00 00 00 00 00 00 00 ........
0x000010: 00 00 00 00 4d 51 30 30 ....MQ00
0x000018: 30 30 20 33 20 20 20 20 00 3
0x000020: ab ab ab ab ab ab ab ab ........
0x000028: ab ab ab ab ab ab ab ab ........
0x000030: ab ab ab ab ab ab ab ab ........
0x000038: ab ab ab ab ab ab ab ab ........
After the patch:
0x000000: 1e 02 f8 00 ff ff ab ab ........
0x000008: ab ab ab ab ab ab ab ab ........
0x000010: ab ab ab ab ab ab ab ab ........
0x000018: ab ab ab ab ab ab ab ab ........
0x000020: ab ab ab ab ab ab ab ab ........
0x000028: ab ab ab ab ab ab ab ab ........
0x000030: ab ab ab ab ab ab ab ab ........
0x000038: ab ab ab ab ab ab ab ab ........
0x000000: d2 e2 d2 df d2 db d2 d7 ........
0x000008: d2 d3 d2 cf d2 cb d2 c7 ........
0x000010: d2 c4 d2 c0 d2 bc d2 b8 ........
0x000018: d2 b4 d2 b0 d2 ac d2 a8 ........
0x000020: ab ab ab ab ab ab ab ab ........
0x000028: ab ab ab ab ab ab ab ab ........
0x000030: ab ab ab ab ab ab ab ab ........
0x000038: ab ab ab ab ab ab ab ab ........
0x000000: 00 00 00 00 00 00 00 00 ........
0x000008: 00 00 00 00 00 00 00 00 ........
0x000010: 00 00 00 00 00 00 00 00 ........
0x000018: 00 00 00 00 00 00 00 00 ........
0x000020: ab ab ab ab ab ab ab ab ........
0x000028: ab ab ab ab ab ab ab ab ........
0x000030: ab ab ab ab ab ab ab ab ........
0x000038: ab ab ab ab ab ab ab ab ........
Reported-by: Moguofang <moguofang@huawei.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Radim Krčmář <rkrcmar@redhat.com>
Cc: Moguofang <moguofang@huawei.com>
Signed-off-by: Wanpeng Li <wanpeng.li@hotmail.com>
Cc: stable@vger.kernel.org
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
2017-05-19 09:46:56 +00:00
|
|
|
memset(vcpu->arch.pio_data, 0, size * count);
|
|
|
|
|
2011-09-22 08:55:10 +00:00
|
|
|
ret = emulator_pio_in_out(vcpu, size, port, val, count, true);
|
|
|
|
if (ret) {
|
|
|
|
data_avail:
|
|
|
|
memcpy(val, vcpu->arch.pio_data, size * count);
|
2014-05-02 15:57:47 +00:00
|
|
|
trace_kvm_pio(KVM_PIO_IN, port, size, count, vcpu->arch.pio_data);
|
2010-03-18 13:20:24 +00:00
|
|
|
vcpu->arch.pio.count = 0;
|
2010-03-18 13:20:23 +00:00
|
|
|
return 1;
|
|
|
|
}
|
|
|
|
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
2011-09-22 08:55:10 +00:00
|
|
|
static int emulator_pio_out_emulated(struct x86_emulate_ctxt *ctxt,
|
|
|
|
int size, unsigned short port,
|
|
|
|
const void *val, unsigned int count)
|
|
|
|
{
|
|
|
|
struct kvm_vcpu *vcpu = emul_to_vcpu(ctxt);
|
|
|
|
|
|
|
|
memcpy(vcpu->arch.pio_data, val, size * count);
|
2014-05-02 15:57:47 +00:00
|
|
|
trace_kvm_pio(KVM_PIO_OUT, port, size, count, vcpu->arch.pio_data);
|
2011-09-22 08:55:10 +00:00
|
|
|
return emulator_pio_in_out(vcpu, size, port, (void *)val, count, false);
|
|
|
|
}
|
|
|
|
|
KVM: Portability: Move x86 emulation and mmio device hook to x86.c
This patch moves the following functions to from kvm_main.c to x86.c:
emulator_read/write_std, vcpu_find_pervcpu_dev, vcpu_find_mmio_dev,
emulator_read/write_emulated, emulator_write_phys,
emulator_write_emulated_onepage, emulator_cmpxchg_emulated,
get_setment_base, emulate_invlpg, emulate_clts, emulator_get/set_dr,
kvm_report_emulation_failure, emulate_instruction
The following data type is moved to x86.c:
struct x86_emulate_ops emulate_ops
Signed-off-by: Carsten Otte <cotte@de.ibm.com>
Acked-by: Hollis Blanchard <hollisb@us.ibm.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
2007-10-30 17:44:21 +00:00
|
|
|
static unsigned long get_segment_base(struct kvm_vcpu *vcpu, int seg)
|
|
|
|
{
|
|
|
|
return kvm_x86_ops->get_segment_base(vcpu, seg);
|
|
|
|
}
|
|
|
|
|
2011-04-20 12:38:44 +00:00
|
|
|
static void emulator_invlpg(struct x86_emulate_ctxt *ctxt, ulong address)
|
KVM: Portability: Move x86 emulation and mmio device hook to x86.c
This patch moves the following functions to from kvm_main.c to x86.c:
emulator_read/write_std, vcpu_find_pervcpu_dev, vcpu_find_mmio_dev,
emulator_read/write_emulated, emulator_write_phys,
emulator_write_emulated_onepage, emulator_cmpxchg_emulated,
get_setment_base, emulate_invlpg, emulate_clts, emulator_get/set_dr,
kvm_report_emulation_failure, emulate_instruction
The following data type is moved to x86.c:
struct x86_emulate_ops emulate_ops
Signed-off-by: Carsten Otte <cotte@de.ibm.com>
Acked-by: Hollis Blanchard <hollisb@us.ibm.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
2007-10-30 17:44:21 +00:00
|
|
|
{
|
2011-04-20 12:38:44 +00:00
|
|
|
kvm_mmu_invlpg(emul_to_vcpu(ctxt), address);
|
KVM: Portability: Move x86 emulation and mmio device hook to x86.c
This patch moves the following functions to from kvm_main.c to x86.c:
emulator_read/write_std, vcpu_find_pervcpu_dev, vcpu_find_mmio_dev,
emulator_read/write_emulated, emulator_write_phys,
emulator_write_emulated_onepage, emulator_cmpxchg_emulated,
get_setment_base, emulate_invlpg, emulate_clts, emulator_get/set_dr,
kvm_report_emulation_failure, emulate_instruction
The following data type is moved to x86.c:
struct x86_emulate_ops emulate_ops
Signed-off-by: Carsten Otte <cotte@de.ibm.com>
Acked-by: Hollis Blanchard <hollisb@us.ibm.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
2007-10-30 17:44:21 +00:00
|
|
|
}
|
|
|
|
|
2016-11-07 00:54:51 +00:00
|
|
|
static int kvm_emulate_wbinvd_noskip(struct kvm_vcpu *vcpu)
|
2010-06-30 04:25:15 +00:00
|
|
|
{
|
|
|
|
if (!need_emulate_wbinvd(vcpu))
|
|
|
|
return X86EMUL_CONTINUE;
|
|
|
|
|
|
|
|
if (kvm_x86_ops->has_wbinvd_exit()) {
|
2010-11-01 13:01:29 +00:00
|
|
|
int cpu = get_cpu();
|
|
|
|
|
|
|
|
cpumask_set_cpu(cpu, vcpu->arch.wbinvd_dirty_mask);
|
2010-06-30 04:25:15 +00:00
|
|
|
smp_call_function_many(vcpu->arch.wbinvd_dirty_mask,
|
|
|
|
wbinvd_ipi, NULL, 1);
|
2010-11-01 13:01:29 +00:00
|
|
|
put_cpu();
|
2010-06-30 04:25:15 +00:00
|
|
|
cpumask_clear(vcpu->arch.wbinvd_dirty_mask);
|
2010-11-01 13:01:29 +00:00
|
|
|
} else
|
|
|
|
wbinvd();
|
2010-06-30 04:25:15 +00:00
|
|
|
return X86EMUL_CONTINUE;
|
|
|
|
}
|
2015-03-02 19:43:31 +00:00
|
|
|
|
|
|
|
int kvm_emulate_wbinvd(struct kvm_vcpu *vcpu)
|
|
|
|
{
|
KVM: x86: Add kvm_skip_emulated_instruction and use it.
kvm_skip_emulated_instruction calls both
kvm_x86_ops->skip_emulated_instruction and kvm_vcpu_check_singlestep,
skipping the emulated instruction and generating a trap if necessary.
Replacing skip_emulated_instruction calls with
kvm_skip_emulated_instruction is straightforward, except for:
- ICEBP, which is already inside a trap, so avoid triggering another trap.
- Instructions that can trigger exits to userspace, such as the IO insns,
MOVs to CR8, and HALT. If kvm_skip_emulated_instruction does trigger a
KVM_GUESTDBG_SINGLESTEP exit, and the handling code for
IN/OUT/MOV CR8/HALT also triggers an exit to userspace, the latter will
take precedence. The singlestep will be triggered again on the next
instruction, which is the current behavior.
- Task switch instructions which would require additional handling (e.g.
the task switch bit) and are instead left alone.
- Cases where VMLAUNCH/VMRESUME do not proceed to the next instruction,
which do not trigger singlestep traps as mentioned previously.
Signed-off-by: Kyle Huey <khuey@kylehuey.com>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
2016-11-29 20:40:40 +00:00
|
|
|
kvm_emulate_wbinvd_noskip(vcpu);
|
|
|
|
return kvm_skip_emulated_instruction(vcpu);
|
2015-03-02 19:43:31 +00:00
|
|
|
}
|
2010-06-30 04:25:15 +00:00
|
|
|
EXPORT_SYMBOL_GPL(kvm_emulate_wbinvd);
|
|
|
|
|
2015-03-02 19:43:31 +00:00
|
|
|
|
|
|
|
|
2011-04-20 12:53:23 +00:00
|
|
|
static void emulator_wbinvd(struct x86_emulate_ctxt *ctxt)
|
|
|
|
{
|
2015-03-02 19:43:31 +00:00
|
|
|
kvm_emulate_wbinvd_noskip(emul_to_vcpu(ctxt));
|
2011-04-20 12:53:23 +00:00
|
|
|
}
|
|
|
|
|
2015-03-13 09:39:45 +00:00
|
|
|
static int emulator_get_dr(struct x86_emulate_ctxt *ctxt, int dr,
|
|
|
|
unsigned long *dest)
|
KVM: Portability: Move x86 emulation and mmio device hook to x86.c
This patch moves the following functions to from kvm_main.c to x86.c:
emulator_read/write_std, vcpu_find_pervcpu_dev, vcpu_find_mmio_dev,
emulator_read/write_emulated, emulator_write_phys,
emulator_write_emulated_onepage, emulator_cmpxchg_emulated,
get_setment_base, emulate_invlpg, emulate_clts, emulator_get/set_dr,
kvm_report_emulation_failure, emulate_instruction
The following data type is moved to x86.c:
struct x86_emulate_ops emulate_ops
Signed-off-by: Carsten Otte <cotte@de.ibm.com>
Acked-by: Hollis Blanchard <hollisb@us.ibm.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
2007-10-30 17:44:21 +00:00
|
|
|
{
|
2014-10-02 22:10:05 +00:00
|
|
|
return kvm_get_dr(emul_to_vcpu(ctxt), dr, dest);
|
KVM: Portability: Move x86 emulation and mmio device hook to x86.c
This patch moves the following functions to from kvm_main.c to x86.c:
emulator_read/write_std, vcpu_find_pervcpu_dev, vcpu_find_mmio_dev,
emulator_read/write_emulated, emulator_write_phys,
emulator_write_emulated_onepage, emulator_cmpxchg_emulated,
get_setment_base, emulate_invlpg, emulate_clts, emulator_get/set_dr,
kvm_report_emulation_failure, emulate_instruction
The following data type is moved to x86.c:
struct x86_emulate_ops emulate_ops
Signed-off-by: Carsten Otte <cotte@de.ibm.com>
Acked-by: Hollis Blanchard <hollisb@us.ibm.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
2007-10-30 17:44:21 +00:00
|
|
|
}
|
|
|
|
|
2015-03-13 09:39:45 +00:00
|
|
|
static int emulator_set_dr(struct x86_emulate_ctxt *ctxt, int dr,
|
|
|
|
unsigned long value)
|
KVM: Portability: Move x86 emulation and mmio device hook to x86.c
This patch moves the following functions to from kvm_main.c to x86.c:
emulator_read/write_std, vcpu_find_pervcpu_dev, vcpu_find_mmio_dev,
emulator_read/write_emulated, emulator_write_phys,
emulator_write_emulated_onepage, emulator_cmpxchg_emulated,
get_setment_base, emulate_invlpg, emulate_clts, emulator_get/set_dr,
kvm_report_emulation_failure, emulate_instruction
The following data type is moved to x86.c:
struct x86_emulate_ops emulate_ops
Signed-off-by: Carsten Otte <cotte@de.ibm.com>
Acked-by: Hollis Blanchard <hollisb@us.ibm.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
2007-10-30 17:44:21 +00:00
|
|
|
{
|
2010-04-28 16:15:32 +00:00
|
|
|
|
2011-04-20 10:37:53 +00:00
|
|
|
return __kvm_set_dr(emul_to_vcpu(ctxt), dr, value);
|
KVM: Portability: Move x86 emulation and mmio device hook to x86.c
This patch moves the following functions to from kvm_main.c to x86.c:
emulator_read/write_std, vcpu_find_pervcpu_dev, vcpu_find_mmio_dev,
emulator_read/write_emulated, emulator_write_phys,
emulator_write_emulated_onepage, emulator_cmpxchg_emulated,
get_setment_base, emulate_invlpg, emulate_clts, emulator_get/set_dr,
kvm_report_emulation_failure, emulate_instruction
The following data type is moved to x86.c:
struct x86_emulate_ops emulate_ops
Signed-off-by: Carsten Otte <cotte@de.ibm.com>
Acked-by: Hollis Blanchard <hollisb@us.ibm.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
2007-10-30 17:44:21 +00:00
|
|
|
}
|
|
|
|
|
2010-03-18 13:20:03 +00:00
|
|
|
static u64 mk_cr_64(u64 curr_cr, u32 new_val)
|
2008-06-27 17:58:02 +00:00
|
|
|
{
|
2010-03-18 13:20:03 +00:00
|
|
|
return (curr_cr & ~((1ULL << 32) - 1)) | new_val;
|
2008-06-27 17:58:02 +00:00
|
|
|
}
|
|
|
|
|
2011-04-20 10:37:53 +00:00
|
|
|
static unsigned long emulator_get_cr(struct x86_emulate_ctxt *ctxt, int cr)
|
KVM: Portability: Move x86 emulation and mmio device hook to x86.c
This patch moves the following functions to from kvm_main.c to x86.c:
emulator_read/write_std, vcpu_find_pervcpu_dev, vcpu_find_mmio_dev,
emulator_read/write_emulated, emulator_write_phys,
emulator_write_emulated_onepage, emulator_cmpxchg_emulated,
get_setment_base, emulate_invlpg, emulate_clts, emulator_get/set_dr,
kvm_report_emulation_failure, emulate_instruction
The following data type is moved to x86.c:
struct x86_emulate_ops emulate_ops
Signed-off-by: Carsten Otte <cotte@de.ibm.com>
Acked-by: Hollis Blanchard <hollisb@us.ibm.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
2007-10-30 17:44:21 +00:00
|
|
|
{
|
2011-04-20 10:37:53 +00:00
|
|
|
struct kvm_vcpu *vcpu = emul_to_vcpu(ctxt);
|
2010-03-18 13:20:03 +00:00
|
|
|
unsigned long value;
|
|
|
|
|
|
|
|
switch (cr) {
|
|
|
|
case 0:
|
|
|
|
value = kvm_read_cr0(vcpu);
|
|
|
|
break;
|
|
|
|
case 2:
|
|
|
|
value = vcpu->arch.cr2;
|
|
|
|
break;
|
|
|
|
case 3:
|
2010-12-05 15:30:00 +00:00
|
|
|
value = kvm_read_cr3(vcpu);
|
2010-03-18 13:20:03 +00:00
|
|
|
break;
|
|
|
|
case 4:
|
|
|
|
value = kvm_read_cr4(vcpu);
|
|
|
|
break;
|
|
|
|
case 8:
|
|
|
|
value = kvm_get_cr8(vcpu);
|
|
|
|
break;
|
|
|
|
default:
|
KVM: Cleanup the kvm_print functions and introduce pr_XX wrappers
Introduces a couple of print functions, which are essentially wrappers
around standard printk functions, with a KVM: prefix.
Functions introduced or modified are:
- kvm_err(fmt, ...)
- kvm_info(fmt, ...)
- kvm_debug(fmt, ...)
- kvm_pr_unimpl(fmt, ...)
- pr_unimpl(vcpu, fmt, ...) -> vcpu_unimpl(vcpu, fmt, ...)
Signed-off-by: Christoffer Dall <c.dall@virtualopensystems.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2012-06-03 18:17:48 +00:00
|
|
|
kvm_err("%s: unexpected cr %u\n", __func__, cr);
|
2010-03-18 13:20:03 +00:00
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
|
|
|
return value;
|
|
|
|
}
|
|
|
|
|
2011-04-20 10:37:53 +00:00
|
|
|
static int emulator_set_cr(struct x86_emulate_ctxt *ctxt, int cr, ulong val)
|
2010-03-18 13:20:03 +00:00
|
|
|
{
|
2011-04-20 10:37:53 +00:00
|
|
|
struct kvm_vcpu *vcpu = emul_to_vcpu(ctxt);
|
2010-04-28 16:15:31 +00:00
|
|
|
int res = 0;
|
|
|
|
|
2010-03-18 13:20:03 +00:00
|
|
|
switch (cr) {
|
|
|
|
case 0:
|
2010-06-10 14:02:14 +00:00
|
|
|
res = kvm_set_cr0(vcpu, mk_cr_64(kvm_read_cr0(vcpu), val));
|
2010-03-18 13:20:03 +00:00
|
|
|
break;
|
|
|
|
case 2:
|
|
|
|
vcpu->arch.cr2 = val;
|
|
|
|
break;
|
|
|
|
case 3:
|
2010-06-10 14:02:16 +00:00
|
|
|
res = kvm_set_cr3(vcpu, val);
|
2010-03-18 13:20:03 +00:00
|
|
|
break;
|
|
|
|
case 4:
|
2010-06-10 14:02:15 +00:00
|
|
|
res = kvm_set_cr4(vcpu, mk_cr_64(kvm_read_cr4(vcpu), val));
|
2010-03-18 13:20:03 +00:00
|
|
|
break;
|
|
|
|
case 8:
|
2010-12-21 10:12:00 +00:00
|
|
|
res = kvm_set_cr8(vcpu, val);
|
2010-03-18 13:20:03 +00:00
|
|
|
break;
|
|
|
|
default:
|
KVM: Cleanup the kvm_print functions and introduce pr_XX wrappers
Introduces a couple of print functions, which are essentially wrappers
around standard printk functions, with a KVM: prefix.
Functions introduced or modified are:
- kvm_err(fmt, ...)
- kvm_info(fmt, ...)
- kvm_debug(fmt, ...)
- kvm_pr_unimpl(fmt, ...)
- pr_unimpl(vcpu, fmt, ...) -> vcpu_unimpl(vcpu, fmt, ...)
Signed-off-by: Christoffer Dall <c.dall@virtualopensystems.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2012-06-03 18:17:48 +00:00
|
|
|
kvm_err("%s: unexpected cr %u\n", __func__, cr);
|
2010-04-28 16:15:31 +00:00
|
|
|
res = -1;
|
2010-03-18 13:20:03 +00:00
|
|
|
}
|
2010-04-28 16:15:31 +00:00
|
|
|
|
|
|
|
return res;
|
2010-03-18 13:20:03 +00:00
|
|
|
}
|
|
|
|
|
2011-04-20 10:37:53 +00:00
|
|
|
static int emulator_get_cpl(struct x86_emulate_ctxt *ctxt)
|
2010-03-18 13:20:05 +00:00
|
|
|
{
|
2011-04-20 10:37:53 +00:00
|
|
|
return kvm_x86_ops->get_cpl(emul_to_vcpu(ctxt));
|
2010-03-18 13:20:05 +00:00
|
|
|
}
|
|
|
|
|
2011-04-20 10:37:53 +00:00
|
|
|
static void emulator_get_gdt(struct x86_emulate_ctxt *ctxt, struct desc_ptr *dt)
|
2010-03-18 13:20:16 +00:00
|
|
|
{
|
2011-04-20 10:37:53 +00:00
|
|
|
kvm_x86_ops->get_gdt(emul_to_vcpu(ctxt), dt);
|
2010-03-18 13:20:16 +00:00
|
|
|
}
|
|
|
|
|
2011-04-20 10:37:53 +00:00
|
|
|
static void emulator_get_idt(struct x86_emulate_ctxt *ctxt, struct desc_ptr *dt)
|
2010-08-04 02:44:24 +00:00
|
|
|
{
|
2011-04-20 10:37:53 +00:00
|
|
|
kvm_x86_ops->get_idt(emul_to_vcpu(ctxt), dt);
|
2010-08-04 02:44:24 +00:00
|
|
|
}
|
|
|
|
|
2011-04-20 12:12:00 +00:00
|
|
|
static void emulator_set_gdt(struct x86_emulate_ctxt *ctxt, struct desc_ptr *dt)
|
|
|
|
{
|
|
|
|
kvm_x86_ops->set_gdt(emul_to_vcpu(ctxt), dt);
|
|
|
|
}
|
|
|
|
|
|
|
|
static void emulator_set_idt(struct x86_emulate_ctxt *ctxt, struct desc_ptr *dt)
|
|
|
|
{
|
|
|
|
kvm_x86_ops->set_idt(emul_to_vcpu(ctxt), dt);
|
|
|
|
}
|
|
|
|
|
2011-04-20 10:37:53 +00:00
|
|
|
static unsigned long emulator_get_cached_segment_base(
|
|
|
|
struct x86_emulate_ctxt *ctxt, int seg)
|
2010-04-28 16:15:29 +00:00
|
|
|
{
|
2011-04-20 10:37:53 +00:00
|
|
|
return get_segment_base(emul_to_vcpu(ctxt), seg);
|
2010-04-28 16:15:29 +00:00
|
|
|
}
|
|
|
|
|
2011-04-27 10:20:30 +00:00
|
|
|
static bool emulator_get_segment(struct x86_emulate_ctxt *ctxt, u16 *selector,
|
|
|
|
struct desc_struct *desc, u32 *base3,
|
|
|
|
int seg)
|
2010-03-18 13:20:16 +00:00
|
|
|
{
|
|
|
|
struct kvm_segment var;
|
|
|
|
|
2011-04-20 10:37:53 +00:00
|
|
|
kvm_get_segment(emul_to_vcpu(ctxt), &var, seg);
|
2011-04-27 10:20:30 +00:00
|
|
|
*selector = var.selector;
|
2010-03-18 13:20:16 +00:00
|
|
|
|
2013-01-21 13:36:48 +00:00
|
|
|
if (var.unusable) {
|
|
|
|
memset(desc, 0, sizeof(*desc));
|
2017-05-18 17:37:30 +00:00
|
|
|
if (base3)
|
|
|
|
*base3 = 0;
|
2010-03-18 13:20:16 +00:00
|
|
|
return false;
|
2013-01-21 13:36:48 +00:00
|
|
|
}
|
2010-03-18 13:20:16 +00:00
|
|
|
|
|
|
|
if (var.g)
|
|
|
|
var.limit >>= 12;
|
|
|
|
set_desc_limit(desc, var.limit);
|
|
|
|
set_desc_base(desc, (unsigned long)var.base);
|
2011-03-07 12:55:06 +00:00
|
|
|
#ifdef CONFIG_X86_64
|
|
|
|
if (base3)
|
|
|
|
*base3 = var.base >> 32;
|
|
|
|
#endif
|
2010-03-18 13:20:16 +00:00
|
|
|
desc->type = var.type;
|
|
|
|
desc->s = var.s;
|
|
|
|
desc->dpl = var.dpl;
|
|
|
|
desc->p = var.present;
|
|
|
|
desc->avl = var.avl;
|
|
|
|
desc->l = var.l;
|
|
|
|
desc->d = var.db;
|
|
|
|
desc->g = var.g;
|
|
|
|
|
|
|
|
return true;
|
|
|
|
}
|
|
|
|
|
2011-04-27 10:20:30 +00:00
|
|
|
static void emulator_set_segment(struct x86_emulate_ctxt *ctxt, u16 selector,
|
|
|
|
struct desc_struct *desc, u32 base3,
|
|
|
|
int seg)
|
2010-03-18 13:20:16 +00:00
|
|
|
{
|
2011-04-20 10:37:53 +00:00
|
|
|
struct kvm_vcpu *vcpu = emul_to_vcpu(ctxt);
|
2010-03-18 13:20:16 +00:00
|
|
|
struct kvm_segment var;
|
|
|
|
|
2011-04-27 10:20:30 +00:00
|
|
|
var.selector = selector;
|
2010-03-18 13:20:16 +00:00
|
|
|
var.base = get_desc_base(desc);
|
2011-03-07 12:55:06 +00:00
|
|
|
#ifdef CONFIG_X86_64
|
|
|
|
var.base |= ((u64)base3) << 32;
|
|
|
|
#endif
|
2010-03-18 13:20:16 +00:00
|
|
|
var.limit = get_desc_limit(desc);
|
|
|
|
if (desc->g)
|
|
|
|
var.limit = (var.limit << 12) | 0xfff;
|
|
|
|
var.type = desc->type;
|
|
|
|
var.dpl = desc->dpl;
|
|
|
|
var.db = desc->d;
|
|
|
|
var.s = desc->s;
|
|
|
|
var.l = desc->l;
|
|
|
|
var.g = desc->g;
|
|
|
|
var.avl = desc->avl;
|
|
|
|
var.present = desc->p;
|
|
|
|
var.unusable = !var.present;
|
|
|
|
var.padding = 0;
|
|
|
|
|
|
|
|
kvm_set_segment(vcpu, &var, seg);
|
|
|
|
return;
|
|
|
|
}
|
|
|
|
|
2011-04-20 10:37:53 +00:00
|
|
|
static int emulator_get_msr(struct x86_emulate_ctxt *ctxt,
|
|
|
|
u32 msr_index, u64 *pdata)
|
|
|
|
{
|
2015-04-08 13:30:38 +00:00
|
|
|
struct msr_data msr;
|
|
|
|
int r;
|
|
|
|
|
|
|
|
msr.index = msr_index;
|
|
|
|
msr.host_initiated = false;
|
|
|
|
r = kvm_get_msr(emul_to_vcpu(ctxt), &msr);
|
|
|
|
if (r)
|
|
|
|
return r;
|
|
|
|
|
|
|
|
*pdata = msr.data;
|
|
|
|
return 0;
|
2011-04-20 10:37:53 +00:00
|
|
|
}
|
|
|
|
|
|
|
|
static int emulator_set_msr(struct x86_emulate_ctxt *ctxt,
|
|
|
|
u32 msr_index, u64 data)
|
|
|
|
{
|
2012-11-29 20:42:12 +00:00
|
|
|
struct msr_data msr;
|
|
|
|
|
|
|
|
msr.data = data;
|
|
|
|
msr.index = msr_index;
|
|
|
|
msr.host_initiated = false;
|
|
|
|
return kvm_set_msr(emul_to_vcpu(ctxt), &msr);
|
2011-04-20 10:37:53 +00:00
|
|
|
}
|
|
|
|
|
2015-05-07 09:36:11 +00:00
|
|
|
static u64 emulator_get_smbase(struct x86_emulate_ctxt *ctxt)
|
|
|
|
{
|
|
|
|
struct kvm_vcpu *vcpu = emul_to_vcpu(ctxt);
|
|
|
|
|
|
|
|
return vcpu->arch.smbase;
|
|
|
|
}
|
|
|
|
|
|
|
|
static void emulator_set_smbase(struct x86_emulate_ctxt *ctxt, u64 smbase)
|
|
|
|
{
|
|
|
|
struct kvm_vcpu *vcpu = emul_to_vcpu(ctxt);
|
|
|
|
|
|
|
|
vcpu->arch.smbase = smbase;
|
|
|
|
}
|
|
|
|
|
2014-06-02 15:34:09 +00:00
|
|
|
static int emulator_check_pmc(struct x86_emulate_ctxt *ctxt,
|
|
|
|
u32 pmc)
|
|
|
|
{
|
2015-06-19 11:44:45 +00:00
|
|
|
return kvm_pmu_is_valid_msr_idx(emul_to_vcpu(ctxt), pmc);
|
2014-06-02 15:34:09 +00:00
|
|
|
}
|
|
|
|
|
2011-11-10 12:57:30 +00:00
|
|
|
static int emulator_read_pmc(struct x86_emulate_ctxt *ctxt,
|
|
|
|
u32 pmc, u64 *pdata)
|
|
|
|
{
|
2015-06-19 11:44:45 +00:00
|
|
|
return kvm_pmu_rdpmc(emul_to_vcpu(ctxt), pmc, pdata);
|
2011-11-10 12:57:30 +00:00
|
|
|
}
|
|
|
|
|
2011-04-20 12:43:05 +00:00
|
|
|
static void emulator_halt(struct x86_emulate_ctxt *ctxt)
|
|
|
|
{
|
|
|
|
emul_to_vcpu(ctxt)->arch.halt_request = 1;
|
|
|
|
}
|
|
|
|
|
2011-04-20 10:37:53 +00:00
|
|
|
static int emulator_intercept(struct x86_emulate_ctxt *ctxt,
|
2011-04-04 10:39:27 +00:00
|
|
|
struct x86_instruction_info *info,
|
2011-04-04 10:39:22 +00:00
|
|
|
enum x86_intercept_stage stage)
|
|
|
|
{
|
2011-04-20 10:37:53 +00:00
|
|
|
return kvm_x86_ops->check_intercept(emul_to_vcpu(ctxt), info, stage);
|
2011-04-04 10:39:22 +00:00
|
|
|
}
|
|
|
|
|
2017-08-24 12:27:52 +00:00
|
|
|
static bool emulator_get_cpuid(struct x86_emulate_ctxt *ctxt,
|
|
|
|
u32 *eax, u32 *ebx, u32 *ecx, u32 *edx, bool check_limit)
|
2012-01-12 15:43:03 +00:00
|
|
|
{
|
2017-08-24 12:27:52 +00:00
|
|
|
return kvm_cpuid(emul_to_vcpu(ctxt), eax, ebx, ecx, edx, check_limit);
|
2012-01-12 15:43:03 +00:00
|
|
|
}
|
|
|
|
|
2012-08-27 20:46:17 +00:00
|
|
|
static ulong emulator_read_gpr(struct x86_emulate_ctxt *ctxt, unsigned reg)
|
|
|
|
{
|
|
|
|
return kvm_register_read(emul_to_vcpu(ctxt), reg);
|
|
|
|
}
|
|
|
|
|
|
|
|
static void emulator_write_gpr(struct x86_emulate_ctxt *ctxt, unsigned reg, ulong val)
|
|
|
|
{
|
|
|
|
kvm_register_write(emul_to_vcpu(ctxt), reg, val);
|
|
|
|
}
|
|
|
|
|
2015-01-26 07:32:23 +00:00
|
|
|
static void emulator_set_nmi_mask(struct x86_emulate_ctxt *ctxt, bool masked)
|
|
|
|
{
|
|
|
|
kvm_x86_ops->set_nmi_mask(emul_to_vcpu(ctxt), masked);
|
|
|
|
}
|
|
|
|
|
KVM: x86: fix emulation of RSM and IRET instructions
On AMD, the effect of set_nmi_mask called by emulate_iret_real and em_rsm
on hflags is reverted later on in x86_emulate_instruction where hflags are
overwritten with ctxt->emul_flags (the kvm_set_hflags call). This manifests
as a hang when rebooting Windows VMs with QEMU, OVMF, and >1 vcpu.
Instead of trying to merge ctxt->emul_flags into vcpu->arch.hflags after
an instruction is emulated, this commit deletes emul_flags altogether and
makes the emulator access vcpu->arch.hflags using two new accessors. This
way all changes, on the emulator side as well as in functions called from
the emulator and accessing vcpu state with emul_to_vcpu, are preserved.
More details on the bug and its manifestation with Windows and OVMF:
It's a KVM bug in the interaction between SMI/SMM and NMI, specific to AMD.
I believe that the SMM part explains why we started seeing this only with
OVMF.
KVM masks and unmasks NMI when entering and leaving SMM. When KVM emulates
the RSM instruction in em_rsm, the set_nmi_mask call doesn't stick because
later on in x86_emulate_instruction we overwrite arch.hflags with
ctxt->emul_flags, effectively reverting the effect of the set_nmi_mask call.
The AMD-specific hflag of interest here is HF_NMI_MASK.
When rebooting the system, Windows sends an NMI IPI to all but the current
cpu to shut them down. Only after all of them are parked in HLT will the
initiating cpu finish the restart. If NMI is masked, other cpus never get
the memo and the initiating cpu spins forever, waiting for
hal!HalpInterruptProcessorsStarted to drop. That's the symptom we observe.
Fixes: a584539b24b8 ("KVM: x86: pass the whole hflags field to emulator and back")
Signed-off-by: Ladi Prosek <lprosek@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2017-04-25 14:42:44 +00:00
|
|
|
static unsigned emulator_get_hflags(struct x86_emulate_ctxt *ctxt)
|
|
|
|
{
|
|
|
|
return emul_to_vcpu(ctxt)->arch.hflags;
|
|
|
|
}
|
|
|
|
|
|
|
|
static void emulator_set_hflags(struct x86_emulate_ctxt *ctxt, unsigned emul_flags)
|
|
|
|
{
|
|
|
|
kvm_set_hflags(emul_to_vcpu(ctxt), emul_flags);
|
|
|
|
}
|
|
|
|
|
2017-10-11 14:54:40 +00:00
|
|
|
static int emulator_pre_leave_smm(struct x86_emulate_ctxt *ctxt, u64 smbase)
|
|
|
|
{
|
|
|
|
return kvm_x86_ops->pre_leave_smm(emul_to_vcpu(ctxt), smbase);
|
|
|
|
}
|
|
|
|
|
2012-08-29 23:30:16 +00:00
|
|
|
static const struct x86_emulate_ops emulate_ops = {
|
2012-08-27 20:46:17 +00:00
|
|
|
.read_gpr = emulator_read_gpr,
|
|
|
|
.write_gpr = emulator_write_gpr,
|
2018-06-06 15:37:49 +00:00
|
|
|
.read_std = emulator_read_std,
|
|
|
|
.write_std = emulator_write_std,
|
2015-10-30 15:36:24 +00:00
|
|
|
.read_phys = kvm_read_guest_phys_system,
|
2010-02-10 12:21:32 +00:00
|
|
|
.fetch = kvm_fetch_guest_virt,
|
KVM: Portability: Move x86 emulation and mmio device hook to x86.c
This patch moves the following functions to from kvm_main.c to x86.c:
emulator_read/write_std, vcpu_find_pervcpu_dev, vcpu_find_mmio_dev,
emulator_read/write_emulated, emulator_write_phys,
emulator_write_emulated_onepage, emulator_cmpxchg_emulated,
get_setment_base, emulate_invlpg, emulate_clts, emulator_get/set_dr,
kvm_report_emulation_failure, emulate_instruction
The following data type is moved to x86.c:
struct x86_emulate_ops emulate_ops
Signed-off-by: Carsten Otte <cotte@de.ibm.com>
Acked-by: Hollis Blanchard <hollisb@us.ibm.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
2007-10-30 17:44:21 +00:00
|
|
|
.read_emulated = emulator_read_emulated,
|
|
|
|
.write_emulated = emulator_write_emulated,
|
|
|
|
.cmpxchg_emulated = emulator_cmpxchg_emulated,
|
2011-04-20 12:38:44 +00:00
|
|
|
.invlpg = emulator_invlpg,
|
2010-03-18 13:20:23 +00:00
|
|
|
.pio_in_emulated = emulator_pio_in_emulated,
|
|
|
|
.pio_out_emulated = emulator_pio_out_emulated,
|
2011-04-27 10:20:30 +00:00
|
|
|
.get_segment = emulator_get_segment,
|
|
|
|
.set_segment = emulator_set_segment,
|
2010-04-28 16:15:29 +00:00
|
|
|
.get_cached_segment_base = emulator_get_cached_segment_base,
|
2010-03-18 13:20:16 +00:00
|
|
|
.get_gdt = emulator_get_gdt,
|
2010-08-04 02:44:24 +00:00
|
|
|
.get_idt = emulator_get_idt,
|
2011-04-20 12:12:00 +00:00
|
|
|
.set_gdt = emulator_set_gdt,
|
|
|
|
.set_idt = emulator_set_idt,
|
2010-03-18 13:20:03 +00:00
|
|
|
.get_cr = emulator_get_cr,
|
|
|
|
.set_cr = emulator_set_cr,
|
2010-03-18 13:20:05 +00:00
|
|
|
.cpl = emulator_get_cpl,
|
2010-04-28 16:15:27 +00:00
|
|
|
.get_dr = emulator_get_dr,
|
|
|
|
.set_dr = emulator_set_dr,
|
2015-05-07 09:36:11 +00:00
|
|
|
.get_smbase = emulator_get_smbase,
|
|
|
|
.set_smbase = emulator_set_smbase,
|
2011-04-20 10:37:53 +00:00
|
|
|
.set_msr = emulator_set_msr,
|
|
|
|
.get_msr = emulator_get_msr,
|
2014-06-02 15:34:09 +00:00
|
|
|
.check_pmc = emulator_check_pmc,
|
2011-11-10 12:57:30 +00:00
|
|
|
.read_pmc = emulator_read_pmc,
|
2011-04-20 12:43:05 +00:00
|
|
|
.halt = emulator_halt,
|
2011-04-20 12:53:23 +00:00
|
|
|
.wbinvd = emulator_wbinvd,
|
2011-04-20 12:47:13 +00:00
|
|
|
.fix_hypercall = emulator_fix_hypercall,
|
2011-04-04 10:39:22 +00:00
|
|
|
.intercept = emulator_intercept,
|
2012-01-12 15:43:03 +00:00
|
|
|
.get_cpuid = emulator_get_cpuid,
|
2015-01-26 07:32:23 +00:00
|
|
|
.set_nmi_mask = emulator_set_nmi_mask,
|
KVM: x86: fix emulation of RSM and IRET instructions
On AMD, the effect of set_nmi_mask called by emulate_iret_real and em_rsm
on hflags is reverted later on in x86_emulate_instruction where hflags are
overwritten with ctxt->emul_flags (the kvm_set_hflags call). This manifests
as a hang when rebooting Windows VMs with QEMU, OVMF, and >1 vcpu.
Instead of trying to merge ctxt->emul_flags into vcpu->arch.hflags after
an instruction is emulated, this commit deletes emul_flags altogether and
makes the emulator access vcpu->arch.hflags using two new accessors. This
way all changes, on the emulator side as well as in functions called from
the emulator and accessing vcpu state with emul_to_vcpu, are preserved.
More details on the bug and its manifestation with Windows and OVMF:
It's a KVM bug in the interaction between SMI/SMM and NMI, specific to AMD.
I believe that the SMM part explains why we started seeing this only with
OVMF.
KVM masks and unmasks NMI when entering and leaving SMM. When KVM emulates
the RSM instruction in em_rsm, the set_nmi_mask call doesn't stick because
later on in x86_emulate_instruction we overwrite arch.hflags with
ctxt->emul_flags, effectively reverting the effect of the set_nmi_mask call.
The AMD-specific hflag of interest here is HF_NMI_MASK.
When rebooting the system, Windows sends an NMI IPI to all but the current
cpu to shut them down. Only after all of them are parked in HLT will the
initiating cpu finish the restart. If NMI is masked, other cpus never get
the memo and the initiating cpu spins forever, waiting for
hal!HalpInterruptProcessorsStarted to drop. That's the symptom we observe.
Fixes: a584539b24b8 ("KVM: x86: pass the whole hflags field to emulator and back")
Signed-off-by: Ladi Prosek <lprosek@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2017-04-25 14:42:44 +00:00
|
|
|
.get_hflags = emulator_get_hflags,
|
|
|
|
.set_hflags = emulator_set_hflags,
|
2017-10-11 14:54:40 +00:00
|
|
|
.pre_leave_smm = emulator_pre_leave_smm,
|
KVM: Portability: Move x86 emulation and mmio device hook to x86.c
This patch moves the following functions to from kvm_main.c to x86.c:
emulator_read/write_std, vcpu_find_pervcpu_dev, vcpu_find_mmio_dev,
emulator_read/write_emulated, emulator_write_phys,
emulator_write_emulated_onepage, emulator_cmpxchg_emulated,
get_setment_base, emulate_invlpg, emulate_clts, emulator_get/set_dr,
kvm_report_emulation_failure, emulate_instruction
The following data type is moved to x86.c:
struct x86_emulate_ops emulate_ops
Signed-off-by: Carsten Otte <cotte@de.ibm.com>
Acked-by: Hollis Blanchard <hollisb@us.ibm.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
2007-10-30 17:44:21 +00:00
|
|
|
};
|
|
|
|
|
2010-04-28 16:15:43 +00:00
|
|
|
static void toggle_interruptibility(struct kvm_vcpu *vcpu, u32 mask)
|
|
|
|
{
|
2014-05-20 12:29:47 +00:00
|
|
|
u32 int_shadow = kvm_x86_ops->get_interrupt_shadow(vcpu);
|
2010-04-28 16:15:43 +00:00
|
|
|
/*
|
|
|
|
* an sti; sti; sequence only disable interrupts for the first
|
|
|
|
* instruction. So, if the last instruction, be it emulated or
|
|
|
|
* not, left the system with the INT_STI flag enabled, it
|
|
|
|
* means that the last instruction is an sti. We should not
|
|
|
|
* leave the flag on in this case. The same goes for mov ss
|
|
|
|
*/
|
2014-05-20 12:29:47 +00:00
|
|
|
if (int_shadow & mask)
|
|
|
|
mask = 0;
|
2014-03-27 10:29:28 +00:00
|
|
|
if (unlikely(int_shadow || mask)) {
|
2010-04-28 16:15:43 +00:00
|
|
|
kvm_x86_ops->set_interrupt_shadow(vcpu, mask);
|
2014-03-27 10:29:28 +00:00
|
|
|
if (!mask)
|
|
|
|
kvm_make_request(KVM_REQ_EVENT, vcpu);
|
|
|
|
}
|
2010-04-28 16:15:43 +00:00
|
|
|
}
|
|
|
|
|
2014-09-04 17:46:15 +00:00
|
|
|
static bool inject_emulated_exception(struct kvm_vcpu *vcpu)
|
2010-04-28 16:15:44 +00:00
|
|
|
{
|
|
|
|
struct x86_emulate_ctxt *ctxt = &vcpu->arch.emulate_ctxt;
|
2010-11-22 15:53:21 +00:00
|
|
|
if (ctxt->exception.vector == PF_VECTOR)
|
2014-09-04 17:46:15 +00:00
|
|
|
return kvm_propagate_fault(vcpu, &ctxt->exception);
|
|
|
|
|
|
|
|
if (ctxt->exception.error_code_valid)
|
2010-11-22 15:53:21 +00:00
|
|
|
kvm_queue_exception_e(vcpu, ctxt->exception.vector,
|
|
|
|
ctxt->exception.error_code);
|
2010-04-28 16:15:44 +00:00
|
|
|
else
|
2010-11-22 15:53:21 +00:00
|
|
|
kvm_queue_exception(vcpu, ctxt->exception.vector);
|
2014-09-04 17:46:15 +00:00
|
|
|
return false;
|
2010-04-28 16:15:44 +00:00
|
|
|
}
|
|
|
|
|
2010-08-15 21:47:01 +00:00
|
|
|
static void init_emulate_ctxt(struct kvm_vcpu *vcpu)
|
|
|
|
{
|
2011-05-25 02:06:16 +00:00
|
|
|
struct x86_emulate_ctxt *ctxt = &vcpu->arch.emulate_ctxt;
|
2010-08-15 21:47:01 +00:00
|
|
|
int cs_db, cs_l;
|
|
|
|
|
|
|
|
kvm_x86_ops->get_cs_db_l_bits(vcpu, &cs_db, &cs_l);
|
|
|
|
|
2011-05-25 02:06:16 +00:00
|
|
|
ctxt->eflags = kvm_get_rflags(vcpu);
|
2017-06-07 13:13:14 +00:00
|
|
|
ctxt->tf = (ctxt->eflags & X86_EFLAGS_TF) != 0;
|
|
|
|
|
2011-05-25 02:06:16 +00:00
|
|
|
ctxt->eip = kvm_rip_read(vcpu);
|
|
|
|
ctxt->mode = (!is_protmode(vcpu)) ? X86EMUL_MODE_REAL :
|
|
|
|
(ctxt->eflags & X86_EFLAGS_VM) ? X86EMUL_MODE_VM86 :
|
2014-04-18 04:11:34 +00:00
|
|
|
(cs_l && is_long_mode(vcpu)) ? X86EMUL_MODE_PROT64 :
|
2011-05-25 02:06:16 +00:00
|
|
|
cs_db ? X86EMUL_MODE_PROT32 :
|
|
|
|
X86EMUL_MODE_PROT16;
|
2015-04-01 16:18:53 +00:00
|
|
|
BUILD_BUG_ON(HF_GUEST_MASK != X86EMUL_GUEST_MASK);
|
2015-05-07 09:36:11 +00:00
|
|
|
BUILD_BUG_ON(HF_SMM_MASK != X86EMUL_SMM_MASK);
|
|
|
|
BUILD_BUG_ON(HF_SMM_INSIDE_NMI_MASK != X86EMUL_SMM_INSIDE_NMI_MASK);
|
2011-05-25 02:06:16 +00:00
|
|
|
|
2012-08-27 20:46:17 +00:00
|
|
|
init_decode_cache(ctxt);
|
2011-03-31 10:06:41 +00:00
|
|
|
vcpu->arch.emulate_regs_need_sync_from_vcpu = false;
|
2010-08-15 21:47:01 +00:00
|
|
|
}
|
|
|
|
|
2011-04-13 14:12:54 +00:00
|
|
|
int kvm_inject_realmode_interrupt(struct kvm_vcpu *vcpu, int irq, int inc_eip)
|
2010-09-19 12:34:06 +00:00
|
|
|
{
|
2011-05-29 12:53:48 +00:00
|
|
|
struct x86_emulate_ctxt *ctxt = &vcpu->arch.emulate_ctxt;
|
2010-09-19 12:34:06 +00:00
|
|
|
int ret;
|
|
|
|
|
|
|
|
init_emulate_ctxt(vcpu);
|
|
|
|
|
2011-06-01 12:34:25 +00:00
|
|
|
ctxt->op_bytes = 2;
|
|
|
|
ctxt->ad_bytes = 2;
|
|
|
|
ctxt->_eip = ctxt->eip + inc_eip;
|
2011-05-29 12:53:48 +00:00
|
|
|
ret = emulate_int_real(ctxt, irq);
|
2010-09-19 12:34:06 +00:00
|
|
|
|
|
|
|
if (ret != X86EMUL_CONTINUE)
|
|
|
|
return EMULATE_FAIL;
|
|
|
|
|
2011-06-01 12:34:25 +00:00
|
|
|
ctxt->eip = ctxt->_eip;
|
2011-05-29 12:53:48 +00:00
|
|
|
kvm_rip_write(vcpu, ctxt->eip);
|
|
|
|
kvm_set_rflags(vcpu, ctxt->eflags);
|
2010-09-19 12:34:06 +00:00
|
|
|
|
|
|
|
return EMULATE_DONE;
|
|
|
|
}
|
|
|
|
EXPORT_SYMBOL_GPL(kvm_inject_realmode_interrupt);
|
|
|
|
|
2018-03-12 11:12:49 +00:00
|
|
|
static int handle_emulation_failure(struct kvm_vcpu *vcpu, int emulation_type)
|
2010-05-10 08:16:56 +00:00
|
|
|
{
|
2010-11-29 16:51:49 +00:00
|
|
|
int r = EMULATE_DONE;
|
|
|
|
|
2010-05-10 08:16:56 +00:00
|
|
|
++vcpu->stat.insn_emulation_fail;
|
|
|
|
trace_kvm_emulate_insn_failed(vcpu);
|
2018-03-12 11:12:49 +00:00
|
|
|
|
|
|
|
if (emulation_type & EMULTYPE_NO_UD_ON_FAIL)
|
|
|
|
return EMULATE_FAIL;
|
|
|
|
|
2014-09-16 23:50:50 +00:00
|
|
|
if (!is_guest_mode(vcpu) && kvm_x86_ops->get_cpl(vcpu) == 0) {
|
2010-11-29 16:51:49 +00:00
|
|
|
vcpu->run->exit_reason = KVM_EXIT_INTERNAL_ERROR;
|
|
|
|
vcpu->run->internal.suberror = KVM_INTERNAL_ERROR_EMULATION;
|
|
|
|
vcpu->run->internal.ndata = 0;
|
2017-11-05 14:56:33 +00:00
|
|
|
r = EMULATE_USER_EXIT;
|
2010-11-29 16:51:49 +00:00
|
|
|
}
|
2018-03-12 11:12:49 +00:00
|
|
|
|
2010-05-10 08:16:56 +00:00
|
|
|
kvm_queue_exception(vcpu, UD_VECTOR);
|
2010-11-29 16:51:49 +00:00
|
|
|
|
|
|
|
return r;
|
2010-05-10 08:16:56 +00:00
|
|
|
}
|
|
|
|
|
2013-01-13 15:49:07 +00:00
|
|
|
static bool reexecute_instruction(struct kvm_vcpu *vcpu, gva_t cr2,
|
2013-04-11 09:10:51 +00:00
|
|
|
bool write_fault_to_shadow_pgtable,
|
|
|
|
int emulation_type)
|
2010-07-08 09:41:12 +00:00
|
|
|
{
|
2013-01-13 15:46:52 +00:00
|
|
|
gpa_t gpa = cr2;
|
kvm: rename pfn_t to kvm_pfn_t
To date, we have implemented two I/O usage models for persistent memory,
PMEM (a persistent "ram disk") and DAX (mmap persistent memory into
userspace). This series adds a third, DAX-GUP, that allows DAX mappings
to be the target of direct-i/o. It allows userspace to coordinate
DMA/RDMA from/to persistent memory.
The implementation leverages the ZONE_DEVICE mm-zone that went into
4.3-rc1 (also discussed at kernel summit) to flag pages that are owned
and dynamically mapped by a device driver. The pmem driver, after
mapping a persistent memory range into the system memmap via
devm_memremap_pages(), arranges for DAX to distinguish pfn-only versus
page-backed pmem-pfns via flags in the new pfn_t type.
The DAX code, upon seeing a PFN_DEV+PFN_MAP flagged pfn, flags the
resulting pte(s) inserted into the process page tables with a new
_PAGE_DEVMAP flag. Later, when get_user_pages() is walking ptes it keys
off _PAGE_DEVMAP to pin the device hosting the page range active.
Finally, get_page() and put_page() are modified to take references
against the device driver established page mapping.
Finally, this need for "struct page" for persistent memory requires
memory capacity to store the memmap array. Given the memmap array for a
large pool of persistent may exhaust available DRAM introduce a
mechanism to allocate the memmap from persistent memory. The new
"struct vmem_altmap *" parameter to devm_memremap_pages() enables
arch_add_memory() to use reserved pmem capacity rather than the page
allocator.
This patch (of 18):
The core has developed a need for a "pfn_t" type [1]. Move the existing
pfn_t in KVM to kvm_pfn_t [2].
[1]: https://lists.01.org/pipermail/linux-nvdimm/2015-September/002199.html
[2]: https://lists.01.org/pipermail/linux-nvdimm/2015-September/002218.html
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Acked-by: Christoffer Dall <christoffer.dall@linaro.org>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-01-16 00:56:11 +00:00
|
|
|
kvm_pfn_t pfn;
|
2010-07-08 09:41:12 +00:00
|
|
|
|
2018-08-23 20:56:49 +00:00
|
|
|
if (!(emulation_type & EMULTYPE_ALLOW_RETRY))
|
2013-04-11 09:10:51 +00:00
|
|
|
return false;
|
|
|
|
|
2013-01-13 15:46:52 +00:00
|
|
|
if (!vcpu->arch.mmu.direct_map) {
|
|
|
|
/*
|
|
|
|
* Write permission should be allowed since only
|
|
|
|
* write access need to be emulated.
|
|
|
|
*/
|
|
|
|
gpa = kvm_mmu_gva_to_gpa_write(vcpu, cr2, NULL);
|
2010-07-08 09:41:12 +00:00
|
|
|
|
2013-01-13 15:46:52 +00:00
|
|
|
/*
|
|
|
|
* If the mapping is invalid in guest, let cpu retry
|
|
|
|
* it to generate fault.
|
|
|
|
*/
|
|
|
|
if (gpa == UNMAPPED_GVA)
|
|
|
|
return true;
|
|
|
|
}
|
2010-07-08 09:41:12 +00:00
|
|
|
|
2012-08-21 02:57:42 +00:00
|
|
|
/*
|
|
|
|
* Do not retry the unhandleable instruction if it faults on the
|
|
|
|
* readonly host memory, otherwise it will goto a infinite loop:
|
|
|
|
* retry instruction -> write #PF -> emulation fail -> retry
|
|
|
|
* instruction -> ...
|
|
|
|
*/
|
|
|
|
pfn = gfn_to_pfn(vcpu->kvm, gpa_to_gfn(gpa));
|
2013-01-13 15:46:52 +00:00
|
|
|
|
|
|
|
/*
|
|
|
|
* If the instruction failed on the error pfn, it can not be fixed,
|
|
|
|
* report the error to userspace.
|
|
|
|
*/
|
|
|
|
if (is_error_noslot_pfn(pfn))
|
|
|
|
return false;
|
|
|
|
|
|
|
|
kvm_release_pfn_clean(pfn);
|
|
|
|
|
|
|
|
/* The instructions are well-emulated on direct mmu. */
|
|
|
|
if (vcpu->arch.mmu.direct_map) {
|
|
|
|
unsigned int indirect_shadow_pages;
|
|
|
|
|
|
|
|
spin_lock(&vcpu->kvm->mmu_lock);
|
|
|
|
indirect_shadow_pages = vcpu->kvm->arch.indirect_shadow_pages;
|
|
|
|
spin_unlock(&vcpu->kvm->mmu_lock);
|
|
|
|
|
|
|
|
if (indirect_shadow_pages)
|
|
|
|
kvm_mmu_unprotect_page(vcpu->kvm, gpa_to_gfn(gpa));
|
|
|
|
|
2010-07-08 09:41:12 +00:00
|
|
|
return true;
|
2012-08-21 02:57:42 +00:00
|
|
|
}
|
2010-07-08 09:41:12 +00:00
|
|
|
|
2013-01-13 15:46:52 +00:00
|
|
|
/*
|
|
|
|
* if emulation was due to access to shadowed page table
|
|
|
|
* and it failed try to unshadow page and re-enter the
|
|
|
|
* guest to let CPU execute the instruction.
|
|
|
|
*/
|
|
|
|
kvm_mmu_unprotect_page(vcpu->kvm, gpa_to_gfn(gpa));
|
2013-01-13 15:49:07 +00:00
|
|
|
|
|
|
|
/*
|
|
|
|
* If the access faults on its page table, it can not
|
|
|
|
* be fixed by unprotecting shadow page and it should
|
|
|
|
* be reported to userspace.
|
|
|
|
*/
|
|
|
|
return !write_fault_to_shadow_pgtable;
|
2010-07-08 09:41:12 +00:00
|
|
|
}
|
|
|
|
|
2011-09-22 09:02:48 +00:00
|
|
|
static bool retry_instruction(struct x86_emulate_ctxt *ctxt,
|
|
|
|
unsigned long cr2, int emulation_type)
|
|
|
|
{
|
|
|
|
struct kvm_vcpu *vcpu = emul_to_vcpu(ctxt);
|
|
|
|
unsigned long last_retry_eip, last_retry_addr, gpa = cr2;
|
|
|
|
|
|
|
|
last_retry_eip = vcpu->arch.last_retry_eip;
|
|
|
|
last_retry_addr = vcpu->arch.last_retry_addr;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* If the emulation is caused by #PF and it is non-page_table
|
|
|
|
* writing instruction, it means the VM-EXIT is caused by shadow
|
|
|
|
* page protected, we can zap the shadow page and retry this
|
|
|
|
* instruction directly.
|
|
|
|
*
|
|
|
|
* Note: if the guest uses a non-page-table modifying instruction
|
|
|
|
* on the PDE that points to the instruction, then we will unmap
|
|
|
|
* the instruction and go to an infinite loop. So, we cache the
|
|
|
|
* last retried eip and the last fault address, if we meet the eip
|
|
|
|
* and the address again, we can break out of the potential infinite
|
|
|
|
* loop.
|
|
|
|
*/
|
|
|
|
vcpu->arch.last_retry_eip = vcpu->arch.last_retry_addr = 0;
|
|
|
|
|
2018-08-23 20:56:49 +00:00
|
|
|
if (!(emulation_type & EMULTYPE_ALLOW_RETRY))
|
2011-09-22 09:02:48 +00:00
|
|
|
return false;
|
|
|
|
|
|
|
|
if (x86_page_table_writing_insn(ctxt))
|
|
|
|
return false;
|
|
|
|
|
|
|
|
if (ctxt->eip == last_retry_eip && last_retry_addr == cr2)
|
|
|
|
return false;
|
|
|
|
|
|
|
|
vcpu->arch.last_retry_eip = ctxt->eip;
|
|
|
|
vcpu->arch.last_retry_addr = cr2;
|
|
|
|
|
|
|
|
if (!vcpu->arch.mmu.direct_map)
|
|
|
|
gpa = kvm_mmu_gva_to_gpa_write(vcpu, cr2, NULL);
|
|
|
|
|
2013-01-13 15:44:12 +00:00
|
|
|
kvm_mmu_unprotect_page(vcpu->kvm, gpa_to_gfn(gpa));
|
2011-09-22 09:02:48 +00:00
|
|
|
|
|
|
|
return true;
|
|
|
|
}
|
|
|
|
|
2012-09-03 12:24:26 +00:00
|
|
|
static int complete_emulated_mmio(struct kvm_vcpu *vcpu);
|
|
|
|
static int complete_emulated_pio(struct kvm_vcpu *vcpu);
|
|
|
|
|
2015-05-07 09:36:11 +00:00
|
|
|
static void kvm_smm_changed(struct kvm_vcpu *vcpu)
|
2015-04-01 16:18:53 +00:00
|
|
|
{
|
2015-05-07 09:36:11 +00:00
|
|
|
if (!(vcpu->arch.hflags & HF_SMM_MASK)) {
|
2015-05-05 09:50:23 +00:00
|
|
|
/* This is a good place to trace that we are exiting SMM. */
|
|
|
|
trace_kvm_enter_smm(vcpu->vcpu_id, vcpu->arch.smbase, false);
|
|
|
|
|
2016-06-01 20:26:00 +00:00
|
|
|
/* Process a latched INIT or SMI, if any. */
|
|
|
|
kvm_make_request(KVM_REQ_EVENT, vcpu);
|
2015-05-07 09:36:11 +00:00
|
|
|
}
|
2015-05-18 13:03:39 +00:00
|
|
|
|
|
|
|
kvm_mmu_reset_context(vcpu);
|
2015-05-07 09:36:11 +00:00
|
|
|
}
|
|
|
|
|
|
|
|
static void kvm_set_hflags(struct kvm_vcpu *vcpu, unsigned emul_flags)
|
|
|
|
{
|
|
|
|
unsigned changed = vcpu->arch.hflags ^ emul_flags;
|
|
|
|
|
2015-04-01 16:18:53 +00:00
|
|
|
vcpu->arch.hflags = emul_flags;
|
2015-05-07 09:36:11 +00:00
|
|
|
|
|
|
|
if (changed & HF_SMM_MASK)
|
|
|
|
kvm_smm_changed(vcpu);
|
2015-04-01 16:18:53 +00:00
|
|
|
}
|
|
|
|
|
2013-05-30 09:48:30 +00:00
|
|
|
static int kvm_vcpu_check_hw_bp(unsigned long addr, u32 type, u32 dr7,
|
|
|
|
unsigned long *db)
|
|
|
|
{
|
|
|
|
u32 dr6 = 0;
|
|
|
|
int i;
|
|
|
|
u32 enable, rwlen;
|
|
|
|
|
|
|
|
enable = dr7;
|
|
|
|
rwlen = dr7 >> 16;
|
|
|
|
for (i = 0; i < 4; i++, enable >>= 2, rwlen >>= 4)
|
|
|
|
if ((enable & 3) && (rwlen & 15) == type && db[i] == addr)
|
|
|
|
dr6 |= (1 << i);
|
|
|
|
return dr6;
|
|
|
|
}
|
|
|
|
|
2017-06-07 13:13:14 +00:00
|
|
|
static void kvm_vcpu_do_singlestep(struct kvm_vcpu *vcpu, int *r)
|
2013-06-25 16:32:07 +00:00
|
|
|
{
|
|
|
|
struct kvm_run *kvm_run = vcpu->run;
|
|
|
|
|
2017-06-07 13:13:14 +00:00
|
|
|
if (vcpu->guest_debug & KVM_GUESTDBG_SINGLESTEP) {
|
|
|
|
kvm_run->debug.arch.dr6 = DR6_BS | DR6_FIXED_1 | DR6_RTM;
|
|
|
|
kvm_run->debug.arch.pc = vcpu->arch.singlestep_rip;
|
|
|
|
kvm_run->debug.arch.exception = DB_VECTOR;
|
|
|
|
kvm_run->exit_reason = KVM_EXIT_DEBUG;
|
|
|
|
*r = EMULATE_USER_EXIT;
|
|
|
|
} else {
|
|
|
|
/*
|
|
|
|
* "Certain debug exceptions may clear bit 0-3. The
|
|
|
|
* remaining contents of the DR6 register are never
|
|
|
|
* cleared by the processor".
|
|
|
|
*/
|
|
|
|
vcpu->arch.dr6 &= ~15;
|
|
|
|
vcpu->arch.dr6 |= DR6_BS | DR6_RTM;
|
|
|
|
kvm_queue_exception(vcpu, DB_VECTOR);
|
2013-06-25 16:32:07 +00:00
|
|
|
}
|
|
|
|
}
|
|
|
|
|
KVM: x86: Add kvm_skip_emulated_instruction and use it.
kvm_skip_emulated_instruction calls both
kvm_x86_ops->skip_emulated_instruction and kvm_vcpu_check_singlestep,
skipping the emulated instruction and generating a trap if necessary.
Replacing skip_emulated_instruction calls with
kvm_skip_emulated_instruction is straightforward, except for:
- ICEBP, which is already inside a trap, so avoid triggering another trap.
- Instructions that can trigger exits to userspace, such as the IO insns,
MOVs to CR8, and HALT. If kvm_skip_emulated_instruction does trigger a
KVM_GUESTDBG_SINGLESTEP exit, and the handling code for
IN/OUT/MOV CR8/HALT also triggers an exit to userspace, the latter will
take precedence. The singlestep will be triggered again on the next
instruction, which is the current behavior.
- Task switch instructions which would require additional handling (e.g.
the task switch bit) and are instead left alone.
- Cases where VMLAUNCH/VMRESUME do not proceed to the next instruction,
which do not trigger singlestep traps as mentioned previously.
Signed-off-by: Kyle Huey <khuey@kylehuey.com>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
2016-11-29 20:40:40 +00:00
|
|
|
int kvm_skip_emulated_instruction(struct kvm_vcpu *vcpu)
|
|
|
|
{
|
|
|
|
unsigned long rflags = kvm_x86_ops->get_rflags(vcpu);
|
|
|
|
int r = EMULATE_DONE;
|
|
|
|
|
|
|
|
kvm_x86_ops->skip_emulated_instruction(vcpu);
|
2017-06-07 13:13:14 +00:00
|
|
|
|
|
|
|
/*
|
|
|
|
* rflags is the old, "raw" value of the flags. The new value has
|
|
|
|
* not been saved yet.
|
|
|
|
*
|
|
|
|
* This is correct even for TF set by the guest, because "the
|
|
|
|
* processor will not generate this exception after the instruction
|
|
|
|
* that sets the TF flag".
|
|
|
|
*/
|
|
|
|
if (unlikely(rflags & X86_EFLAGS_TF))
|
|
|
|
kvm_vcpu_do_singlestep(vcpu, &r);
|
KVM: x86: Add kvm_skip_emulated_instruction and use it.
kvm_skip_emulated_instruction calls both
kvm_x86_ops->skip_emulated_instruction and kvm_vcpu_check_singlestep,
skipping the emulated instruction and generating a trap if necessary.
Replacing skip_emulated_instruction calls with
kvm_skip_emulated_instruction is straightforward, except for:
- ICEBP, which is already inside a trap, so avoid triggering another trap.
- Instructions that can trigger exits to userspace, such as the IO insns,
MOVs to CR8, and HALT. If kvm_skip_emulated_instruction does trigger a
KVM_GUESTDBG_SINGLESTEP exit, and the handling code for
IN/OUT/MOV CR8/HALT also triggers an exit to userspace, the latter will
take precedence. The singlestep will be triggered again on the next
instruction, which is the current behavior.
- Task switch instructions which would require additional handling (e.g.
the task switch bit) and are instead left alone.
- Cases where VMLAUNCH/VMRESUME do not proceed to the next instruction,
which do not trigger singlestep traps as mentioned previously.
Signed-off-by: Kyle Huey <khuey@kylehuey.com>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
2016-11-29 20:40:40 +00:00
|
|
|
return r == EMULATE_DONE;
|
|
|
|
}
|
|
|
|
EXPORT_SYMBOL_GPL(kvm_skip_emulated_instruction);
|
|
|
|
|
2013-05-30 09:48:30 +00:00
|
|
|
static bool kvm_vcpu_check_breakpoint(struct kvm_vcpu *vcpu, int *r)
|
|
|
|
{
|
|
|
|
if (unlikely(vcpu->guest_debug & KVM_GUESTDBG_USE_HW_BP) &&
|
|
|
|
(vcpu->arch.guest_debug_dr7 & DR7_BP_EN_MASK)) {
|
2014-11-02 09:54:45 +00:00
|
|
|
struct kvm_run *kvm_run = vcpu->run;
|
|
|
|
unsigned long eip = kvm_get_linear_rip(vcpu);
|
|
|
|
u32 dr6 = kvm_vcpu_check_hw_bp(eip, 0,
|
2013-05-30 09:48:30 +00:00
|
|
|
vcpu->arch.guest_debug_dr7,
|
|
|
|
vcpu->arch.eff_db);
|
|
|
|
|
|
|
|
if (dr6 != 0) {
|
2014-07-15 14:37:46 +00:00
|
|
|
kvm_run->debug.arch.dr6 = dr6 | DR6_FIXED_1 | DR6_RTM;
|
2014-11-02 09:54:45 +00:00
|
|
|
kvm_run->debug.arch.pc = eip;
|
2013-05-30 09:48:30 +00:00
|
|
|
kvm_run->debug.arch.exception = DB_VECTOR;
|
|
|
|
kvm_run->exit_reason = KVM_EXIT_DEBUG;
|
|
|
|
*r = EMULATE_USER_EXIT;
|
|
|
|
return true;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2014-07-16 22:19:31 +00:00
|
|
|
if (unlikely(vcpu->arch.dr7 & DR7_BP_EN_MASK) &&
|
|
|
|
!(kvm_get_rflags(vcpu) & X86_EFLAGS_RF)) {
|
2014-11-02 09:54:45 +00:00
|
|
|
unsigned long eip = kvm_get_linear_rip(vcpu);
|
|
|
|
u32 dr6 = kvm_vcpu_check_hw_bp(eip, 0,
|
2013-05-30 09:48:30 +00:00
|
|
|
vcpu->arch.dr7,
|
|
|
|
vcpu->arch.db);
|
|
|
|
|
|
|
|
if (dr6 != 0) {
|
|
|
|
vcpu->arch.dr6 &= ~15;
|
2014-07-15 14:37:46 +00:00
|
|
|
vcpu->arch.dr6 |= dr6 | DR6_RTM;
|
2013-05-30 09:48:30 +00:00
|
|
|
kvm_queue_exception(vcpu, DB_VECTOR);
|
|
|
|
*r = EMULATE_DONE;
|
|
|
|
return true;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
return false;
|
|
|
|
}
|
|
|
|
|
2018-03-12 11:12:50 +00:00
|
|
|
static bool is_vmware_backdoor_opcode(struct x86_emulate_ctxt *ctxt)
|
|
|
|
{
|
2018-03-12 11:12:53 +00:00
|
|
|
switch (ctxt->opcode_len) {
|
|
|
|
case 1:
|
|
|
|
switch (ctxt->b) {
|
|
|
|
case 0xe4: /* IN */
|
|
|
|
case 0xe5:
|
|
|
|
case 0xec:
|
|
|
|
case 0xed:
|
|
|
|
case 0xe6: /* OUT */
|
|
|
|
case 0xe7:
|
|
|
|
case 0xee:
|
|
|
|
case 0xef:
|
|
|
|
case 0x6c: /* INS */
|
|
|
|
case 0x6d:
|
|
|
|
case 0x6e: /* OUTS */
|
|
|
|
case 0x6f:
|
|
|
|
return true;
|
|
|
|
}
|
|
|
|
break;
|
|
|
|
case 2:
|
|
|
|
switch (ctxt->b) {
|
|
|
|
case 0x33: /* RDPMC */
|
|
|
|
return true;
|
|
|
|
}
|
|
|
|
break;
|
2018-03-12 11:12:50 +00:00
|
|
|
}
|
|
|
|
|
|
|
|
return false;
|
|
|
|
}
|
|
|
|
|
2010-12-21 10:12:02 +00:00
|
|
|
int x86_emulate_instruction(struct kvm_vcpu *vcpu,
|
|
|
|
unsigned long cr2,
|
2010-12-21 10:12:07 +00:00
|
|
|
int emulation_type,
|
|
|
|
void *insn,
|
|
|
|
int insn_len)
|
KVM: Portability: Move x86 emulation and mmio device hook to x86.c
This patch moves the following functions to from kvm_main.c to x86.c:
emulator_read/write_std, vcpu_find_pervcpu_dev, vcpu_find_mmio_dev,
emulator_read/write_emulated, emulator_write_phys,
emulator_write_emulated_onepage, emulator_cmpxchg_emulated,
get_setment_base, emulate_invlpg, emulate_clts, emulator_get/set_dr,
kvm_report_emulation_failure, emulate_instruction
The following data type is moved to x86.c:
struct x86_emulate_ops emulate_ops
Signed-off-by: Carsten Otte <cotte@de.ibm.com>
Acked-by: Hollis Blanchard <hollisb@us.ibm.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
2007-10-30 17:44:21 +00:00
|
|
|
{
|
2010-04-28 16:15:43 +00:00
|
|
|
int r;
|
2011-05-29 12:53:48 +00:00
|
|
|
struct x86_emulate_ctxt *ctxt = &vcpu->arch.emulate_ctxt;
|
2011-03-31 10:06:41 +00:00
|
|
|
bool writeback = true;
|
2013-01-13 15:49:07 +00:00
|
|
|
bool write_fault_to_spt = vcpu->arch.write_fault_to_shadow_pgtable;
|
KVM: Portability: Move x86 emulation and mmio device hook to x86.c
This patch moves the following functions to from kvm_main.c to x86.c:
emulator_read/write_std, vcpu_find_pervcpu_dev, vcpu_find_mmio_dev,
emulator_read/write_emulated, emulator_write_phys,
emulator_write_emulated_onepage, emulator_cmpxchg_emulated,
get_setment_base, emulate_invlpg, emulate_clts, emulator_get/set_dr,
kvm_report_emulation_failure, emulate_instruction
The following data type is moved to x86.c:
struct x86_emulate_ops emulate_ops
Signed-off-by: Carsten Otte <cotte@de.ibm.com>
Acked-by: Hollis Blanchard <hollisb@us.ibm.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
2007-10-30 17:44:21 +00:00
|
|
|
|
x86/KVM/VMX: Add L1D flush logic
Add the logic for flushing L1D on VMENTER. The flush depends on the static
key being enabled and the new l1tf_flush_l1d flag being set.
The flags is set:
- Always, if the flush module parameter is 'always'
- Conditionally at:
- Entry to vcpu_run(), i.e. after executing user space
- From the sched_in notifier, i.e. when switching to a vCPU thread.
- From vmexit handlers which are considered unsafe, i.e. where
sensitive data can be brought into L1D:
- The emulator, which could be a good target for other speculative
execution-based threats,
- The MMU, which can bring host page tables in the L1 cache.
- External interrupts
- Nested operations that require the MMU (see above). That is
vmptrld, vmptrst, vmclear,vmwrite,vmread.
- When handling invept,invvpid
[ tglx: Split out from combo patch and reduced to a single flag ]
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2018-07-02 11:07:14 +00:00
|
|
|
vcpu->arch.l1tf_flush_l1d = true;
|
|
|
|
|
2013-01-13 15:49:07 +00:00
|
|
|
/*
|
|
|
|
* Clear write_fault_to_shadow_pgtable here to ensure it is
|
|
|
|
* never reused.
|
|
|
|
*/
|
|
|
|
vcpu->arch.write_fault_to_shadow_pgtable = false;
|
2008-07-03 11:59:22 +00:00
|
|
|
kvm_clear_exception_queue(vcpu);
|
2011-04-12 09:36:21 +00:00
|
|
|
|
KVM: x86 emulator: Only allow VMCALL/VMMCALL trapped by #UD
When executing a test program called "crashme", we found the KVM guest cannot
survive more than ten seconds, then encounterd kernel panic. The basic concept
of "crashme" is generating random assembly code and trying to execute it.
After some fixes on emulator insn validity judgment, we found it's hard to
get the current emulator handle the invalid instructions correctly, for the
#UD trap for hypercall patching caused troubles. The problem is, if the opcode
itself was OK, but combination of opcode and modrm_reg was invalid, and one
operand of the opcode was memory (SrcMem or DstMem), the emulator will fetch
the memory operand first rather than checking the validity, and may encounter
an error there. For example, ".byte 0xfe, 0x34, 0xcd" has this problem.
In the patch, we simply check that if the invalid opcode wasn't vmcall/vmmcall,
then return from emulate_instruction() and inject a #UD to guest. With the
patch, the guest had been running for more than 12 hours.
Signed-off-by: Feng (Eric) Liu <eric.e.liu@intel.com>
Signed-off-by: Sheng Yang <sheng.yang@intel.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
2008-01-02 06:49:22 +00:00
|
|
|
if (!(emulation_type & EMULTYPE_NO_DECODE)) {
|
2010-08-15 21:47:01 +00:00
|
|
|
init_emulate_ctxt(vcpu);
|
2013-05-30 09:48:30 +00:00
|
|
|
|
|
|
|
/*
|
|
|
|
* We will reenter on the same instruction since
|
|
|
|
* we do not set complete_userspace_io. This does not
|
|
|
|
* handle watchpoints yet, those would be handled in
|
|
|
|
* the emulate_ops.
|
|
|
|
*/
|
2018-01-25 15:37:07 +00:00
|
|
|
if (!(emulation_type & EMULTYPE_SKIP) &&
|
|
|
|
kvm_vcpu_check_breakpoint(vcpu, &r))
|
2013-05-30 09:48:30 +00:00
|
|
|
return r;
|
|
|
|
|
2011-05-29 12:53:48 +00:00
|
|
|
ctxt->interruptibility = 0;
|
|
|
|
ctxt->have_exception = false;
|
2014-08-20 08:08:23 +00:00
|
|
|
ctxt->exception.vector = -1;
|
2011-05-29 12:53:48 +00:00
|
|
|
ctxt->perm_ok = false;
|
KVM: Portability: Move x86 emulation and mmio device hook to x86.c
This patch moves the following functions to from kvm_main.c to x86.c:
emulator_read/write_std, vcpu_find_pervcpu_dev, vcpu_find_mmio_dev,
emulator_read/write_emulated, emulator_write_phys,
emulator_write_emulated_onepage, emulator_cmpxchg_emulated,
get_setment_base, emulate_invlpg, emulate_clts, emulator_get/set_dr,
kvm_report_emulation_failure, emulate_instruction
The following data type is moved to x86.c:
struct x86_emulate_ops emulate_ops
Signed-off-by: Carsten Otte <cotte@de.ibm.com>
Acked-by: Hollis Blanchard <hollisb@us.ibm.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
2007-10-30 17:44:21 +00:00
|
|
|
|
2013-09-22 14:44:52 +00:00
|
|
|
ctxt->ud = emulation_type & EMULTYPE_TRAP_UD;
|
2011-02-01 14:32:04 +00:00
|
|
|
|
2011-05-29 12:53:48 +00:00
|
|
|
r = x86_decode_insn(ctxt, insn, insn_len);
|
KVM: Portability: Move x86 emulation and mmio device hook to x86.c
This patch moves the following functions to from kvm_main.c to x86.c:
emulator_read/write_std, vcpu_find_pervcpu_dev, vcpu_find_mmio_dev,
emulator_read/write_emulated, emulator_write_phys,
emulator_write_emulated_onepage, emulator_cmpxchg_emulated,
get_setment_base, emulate_invlpg, emulate_clts, emulator_get/set_dr,
kvm_report_emulation_failure, emulate_instruction
The following data type is moved to x86.c:
struct x86_emulate_ops emulate_ops
Signed-off-by: Carsten Otte <cotte@de.ibm.com>
Acked-by: Hollis Blanchard <hollisb@us.ibm.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
2007-10-30 17:44:21 +00:00
|
|
|
|
2010-04-11 10:05:16 +00:00
|
|
|
trace_kvm_emulate_insn_start(vcpu);
|
2007-11-18 13:17:51 +00:00
|
|
|
++vcpu->stat.insn_emulation;
|
2011-07-30 09:03:34 +00:00
|
|
|
if (r != EMULATION_OK) {
|
2011-02-01 14:32:04 +00:00
|
|
|
if (emulation_type & EMULTYPE_TRAP_UD)
|
|
|
|
return EMULATE_FAIL;
|
2013-04-11 09:10:51 +00:00
|
|
|
if (reexecute_instruction(vcpu, cr2, write_fault_to_spt,
|
|
|
|
emulation_type))
|
KVM: Portability: Move x86 emulation and mmio device hook to x86.c
This patch moves the following functions to from kvm_main.c to x86.c:
emulator_read/write_std, vcpu_find_pervcpu_dev, vcpu_find_mmio_dev,
emulator_read/write_emulated, emulator_write_phys,
emulator_write_emulated_onepage, emulator_cmpxchg_emulated,
get_setment_base, emulate_invlpg, emulate_clts, emulator_get/set_dr,
kvm_report_emulation_failure, emulate_instruction
The following data type is moved to x86.c:
struct x86_emulate_ops emulate_ops
Signed-off-by: Carsten Otte <cotte@de.ibm.com>
Acked-by: Hollis Blanchard <hollisb@us.ibm.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
2007-10-30 17:44:21 +00:00
|
|
|
return EMULATE_DONE;
|
2017-11-10 09:49:38 +00:00
|
|
|
if (ctxt->have_exception && inject_emulated_exception(vcpu))
|
|
|
|
return EMULATE_DONE;
|
2010-05-10 08:16:56 +00:00
|
|
|
if (emulation_type & EMULTYPE_SKIP)
|
|
|
|
return EMULATE_FAIL;
|
2018-03-12 11:12:49 +00:00
|
|
|
return handle_emulation_failure(vcpu, emulation_type);
|
KVM: Portability: Move x86 emulation and mmio device hook to x86.c
This patch moves the following functions to from kvm_main.c to x86.c:
emulator_read/write_std, vcpu_find_pervcpu_dev, vcpu_find_mmio_dev,
emulator_read/write_emulated, emulator_write_phys,
emulator_write_emulated_onepage, emulator_cmpxchg_emulated,
get_setment_base, emulate_invlpg, emulate_clts, emulator_get/set_dr,
kvm_report_emulation_failure, emulate_instruction
The following data type is moved to x86.c:
struct x86_emulate_ops emulate_ops
Signed-off-by: Carsten Otte <cotte@de.ibm.com>
Acked-by: Hollis Blanchard <hollisb@us.ibm.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
2007-10-30 17:44:21 +00:00
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2018-03-12 11:12:50 +00:00
|
|
|
if ((emulation_type & EMULTYPE_VMWARE) &&
|
|
|
|
!is_vmware_backdoor_opcode(ctxt))
|
|
|
|
return EMULATE_FAIL;
|
|
|
|
|
2009-04-12 10:36:57 +00:00
|
|
|
if (emulation_type & EMULTYPE_SKIP) {
|
2011-06-01 12:34:25 +00:00
|
|
|
kvm_rip_write(vcpu, ctxt->_eip);
|
2014-07-21 11:37:26 +00:00
|
|
|
if (ctxt->eflags & X86_EFLAGS_RF)
|
|
|
|
kvm_set_rflags(vcpu, ctxt->eflags & ~X86_EFLAGS_RF);
|
2009-04-12 10:36:57 +00:00
|
|
|
return EMULATE_DONE;
|
|
|
|
}
|
|
|
|
|
2011-09-22 09:02:48 +00:00
|
|
|
if (retry_instruction(ctxt, cr2, emulation_type))
|
|
|
|
return EMULATE_DONE;
|
|
|
|
|
2011-03-31 10:06:41 +00:00
|
|
|
/* this is needed for vmware backdoor interface to work since it
|
2010-04-28 16:15:42 +00:00
|
|
|
changes registers values during IO operation */
|
2011-03-31 10:06:41 +00:00
|
|
|
if (vcpu->arch.emulate_regs_need_sync_from_vcpu) {
|
|
|
|
vcpu->arch.emulate_regs_need_sync_from_vcpu = false;
|
2012-08-27 20:46:17 +00:00
|
|
|
emulator_invalidate_register_cache(ctxt);
|
2011-03-31 10:06:41 +00:00
|
|
|
}
|
2010-04-28 16:15:42 +00:00
|
|
|
|
2010-03-18 13:20:26 +00:00
|
|
|
restart:
|
2016-12-14 19:59:23 +00:00
|
|
|
/* Save the faulting GPA (cr2) in the address field */
|
|
|
|
ctxt->exception.address = cr2;
|
|
|
|
|
2011-05-29 12:53:48 +00:00
|
|
|
r = x86_emulate_insn(ctxt);
|
KVM: Portability: Move x86 emulation and mmio device hook to x86.c
This patch moves the following functions to from kvm_main.c to x86.c:
emulator_read/write_std, vcpu_find_pervcpu_dev, vcpu_find_mmio_dev,
emulator_read/write_emulated, emulator_write_phys,
emulator_write_emulated_onepage, emulator_cmpxchg_emulated,
get_setment_base, emulate_invlpg, emulate_clts, emulator_get/set_dr,
kvm_report_emulation_failure, emulate_instruction
The following data type is moved to x86.c:
struct x86_emulate_ops emulate_ops
Signed-off-by: Carsten Otte <cotte@de.ibm.com>
Acked-by: Hollis Blanchard <hollisb@us.ibm.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
2007-10-30 17:44:21 +00:00
|
|
|
|
2011-04-04 10:39:24 +00:00
|
|
|
if (r == EMULATION_INTERCEPTED)
|
|
|
|
return EMULATE_DONE;
|
|
|
|
|
2010-08-25 09:47:43 +00:00
|
|
|
if (r == EMULATION_FAILED) {
|
2013-04-11 09:10:51 +00:00
|
|
|
if (reexecute_instruction(vcpu, cr2, write_fault_to_spt,
|
|
|
|
emulation_type))
|
2010-04-28 16:15:35 +00:00
|
|
|
return EMULATE_DONE;
|
|
|
|
|
2018-03-12 11:12:49 +00:00
|
|
|
return handle_emulation_failure(vcpu, emulation_type);
|
KVM: Portability: Move x86 emulation and mmio device hook to x86.c
This patch moves the following functions to from kvm_main.c to x86.c:
emulator_read/write_std, vcpu_find_pervcpu_dev, vcpu_find_mmio_dev,
emulator_read/write_emulated, emulator_write_phys,
emulator_write_emulated_onepage, emulator_cmpxchg_emulated,
get_setment_base, emulate_invlpg, emulate_clts, emulator_get/set_dr,
kvm_report_emulation_failure, emulate_instruction
The following data type is moved to x86.c:
struct x86_emulate_ops emulate_ops
Signed-off-by: Carsten Otte <cotte@de.ibm.com>
Acked-by: Hollis Blanchard <hollisb@us.ibm.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
2007-10-30 17:44:21 +00:00
|
|
|
}
|
|
|
|
|
2011-05-29 12:53:48 +00:00
|
|
|
if (ctxt->have_exception) {
|
2010-08-25 09:47:43 +00:00
|
|
|
r = EMULATE_DONE;
|
2014-09-04 17:46:15 +00:00
|
|
|
if (inject_emulated_exception(vcpu))
|
|
|
|
return r;
|
2010-08-25 09:47:43 +00:00
|
|
|
} else if (vcpu->arch.pio.count) {
|
2013-08-27 13:41:43 +00:00
|
|
|
if (!vcpu->arch.pio.in) {
|
|
|
|
/* FIXME: return into emulator if single-stepping. */
|
2010-04-28 16:15:38 +00:00
|
|
|
vcpu->arch.pio.count = 0;
|
2013-08-27 13:41:43 +00:00
|
|
|
} else {
|
2011-03-31 10:06:41 +00:00
|
|
|
writeback = false;
|
2012-09-03 12:24:26 +00:00
|
|
|
vcpu->arch.complete_userspace_io = complete_emulated_pio;
|
|
|
|
}
|
2013-06-25 16:24:41 +00:00
|
|
|
r = EMULATE_USER_EXIT;
|
2011-03-31 10:06:41 +00:00
|
|
|
} else if (vcpu->mmio_needed) {
|
|
|
|
if (!vcpu->mmio_is_write)
|
|
|
|
writeback = false;
|
2013-06-25 16:24:41 +00:00
|
|
|
r = EMULATE_USER_EXIT;
|
2012-09-03 12:24:26 +00:00
|
|
|
vcpu->arch.complete_userspace_io = complete_emulated_mmio;
|
2011-03-31 10:06:41 +00:00
|
|
|
} else if (r == EMULATION_RESTART)
|
2010-03-18 13:20:26 +00:00
|
|
|
goto restart;
|
2010-08-25 09:47:43 +00:00
|
|
|
else
|
|
|
|
r = EMULATE_DONE;
|
2010-02-10 12:21:33 +00:00
|
|
|
|
2011-03-31 10:06:41 +00:00
|
|
|
if (writeback) {
|
2014-03-27 10:29:28 +00:00
|
|
|
unsigned long rflags = kvm_x86_ops->get_rflags(vcpu);
|
2011-05-29 12:53:48 +00:00
|
|
|
toggle_interruptibility(vcpu, ctxt->interruptibility);
|
2011-03-31 10:06:41 +00:00
|
|
|
vcpu->arch.emulate_regs_need_sync_to_vcpu = false;
|
2011-05-29 12:53:48 +00:00
|
|
|
kvm_rip_write(vcpu, ctxt->eip);
|
2017-06-07 13:13:14 +00:00
|
|
|
if (r == EMULATE_DONE &&
|
|
|
|
(ctxt->tf || (vcpu->guest_debug & KVM_GUESTDBG_SINGLESTEP)))
|
|
|
|
kvm_vcpu_do_singlestep(vcpu, &r);
|
2014-11-02 09:54:53 +00:00
|
|
|
if (!ctxt->have_exception ||
|
|
|
|
exception_type(ctxt->exception.vector) == EXCPT_TRAP)
|
|
|
|
__kvm_set_rflags(vcpu, ctxt->eflags);
|
2014-03-27 10:29:28 +00:00
|
|
|
|
|
|
|
/*
|
|
|
|
* For STI, interrupts are shadowed; so KVM_REQ_EVENT will
|
|
|
|
* do nothing, and it will be requested again as soon as
|
|
|
|
* the shadow expires. But we still need to check here,
|
|
|
|
* because POPF has no interrupt shadow.
|
|
|
|
*/
|
|
|
|
if (unlikely((ctxt->eflags & ~rflags) & X86_EFLAGS_IF))
|
|
|
|
kvm_make_request(KVM_REQ_EVENT, vcpu);
|
2011-03-31 10:06:41 +00:00
|
|
|
} else
|
|
|
|
vcpu->arch.emulate_regs_need_sync_to_vcpu = true;
|
2010-07-29 12:11:52 +00:00
|
|
|
|
|
|
|
return r;
|
2007-10-30 17:44:25 +00:00
|
|
|
}
|
2010-12-21 10:12:02 +00:00
|
|
|
EXPORT_SYMBOL_GPL(x86_emulate_instruction);
|
2007-10-30 17:44:25 +00:00
|
|
|
|
2018-03-08 16:57:27 +00:00
|
|
|
static int kvm_fast_pio_out(struct kvm_vcpu *vcpu, int size,
|
|
|
|
unsigned short port)
|
2007-10-30 17:44:25 +00:00
|
|
|
{
|
2010-03-18 13:20:23 +00:00
|
|
|
unsigned long val = kvm_register_read(vcpu, VCPU_REGS_RAX);
|
2011-04-20 10:37:53 +00:00
|
|
|
int ret = emulator_pio_out_emulated(&vcpu->arch.emulate_ctxt,
|
|
|
|
size, port, &val, 1);
|
2010-03-18 13:20:23 +00:00
|
|
|
/* do not return to emulator after return from userspace */
|
2010-03-18 13:20:24 +00:00
|
|
|
vcpu->arch.pio.count = 0;
|
2007-10-30 17:44:25 +00:00
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
2016-11-23 17:01:50 +00:00
|
|
|
static int complete_fast_pio_in(struct kvm_vcpu *vcpu)
|
|
|
|
{
|
|
|
|
unsigned long val;
|
|
|
|
|
|
|
|
/* We should only ever be called with arch.pio.count equal to 1 */
|
|
|
|
BUG_ON(vcpu->arch.pio.count != 1);
|
|
|
|
|
|
|
|
/* For size less than 4 we merge, else we zero extend */
|
|
|
|
val = (vcpu->arch.pio.size < 4) ? kvm_register_read(vcpu, VCPU_REGS_RAX)
|
|
|
|
: 0;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Since vcpu->arch.pio.count == 1 let emulator_pio_in_emulated perform
|
|
|
|
* the copy and tracing
|
|
|
|
*/
|
|
|
|
emulator_pio_in_emulated(&vcpu->arch.emulate_ctxt, vcpu->arch.pio.size,
|
|
|
|
vcpu->arch.pio.port, &val, 1);
|
|
|
|
kvm_register_write(vcpu, VCPU_REGS_RAX, val);
|
|
|
|
|
|
|
|
return 1;
|
|
|
|
}
|
|
|
|
|
2018-03-08 16:57:27 +00:00
|
|
|
static int kvm_fast_pio_in(struct kvm_vcpu *vcpu, int size,
|
|
|
|
unsigned short port)
|
2016-11-23 17:01:50 +00:00
|
|
|
{
|
|
|
|
unsigned long val;
|
|
|
|
int ret;
|
|
|
|
|
|
|
|
/* For size less than 4 we merge, else we zero extend */
|
|
|
|
val = (size < 4) ? kvm_register_read(vcpu, VCPU_REGS_RAX) : 0;
|
|
|
|
|
|
|
|
ret = emulator_pio_in_emulated(&vcpu->arch.emulate_ctxt, size, port,
|
|
|
|
&val, 1);
|
|
|
|
if (ret) {
|
|
|
|
kvm_register_write(vcpu, VCPU_REGS_RAX, val);
|
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
|
|
|
vcpu->arch.complete_userspace_io = complete_fast_pio_in;
|
|
|
|
|
|
|
|
return 0;
|
|
|
|
}
|
2018-03-08 16:57:27 +00:00
|
|
|
|
|
|
|
int kvm_fast_pio(struct kvm_vcpu *vcpu, int size, unsigned short port, int in)
|
|
|
|
{
|
|
|
|
int ret = kvm_skip_emulated_instruction(vcpu);
|
|
|
|
|
|
|
|
/*
|
|
|
|
* TODO: we might be squashing a KVM_GUESTDBG_SINGLESTEP-triggered
|
|
|
|
* KVM_EXIT_DEBUG here.
|
|
|
|
*/
|
|
|
|
if (in)
|
|
|
|
return kvm_fast_pio_in(vcpu, size, port) && ret;
|
|
|
|
else
|
|
|
|
return kvm_fast_pio_out(vcpu, size, port) && ret;
|
|
|
|
}
|
|
|
|
EXPORT_SYMBOL_GPL(kvm_fast_pio);
|
2016-11-23 17:01:50 +00:00
|
|
|
|
2016-07-13 17:16:33 +00:00
|
|
|
static int kvmclock_cpu_down_prep(unsigned int cpu)
|
2010-08-20 08:07:21 +00:00
|
|
|
{
|
2010-12-18 15:28:55 +00:00
|
|
|
__this_cpu_write(cpu_tsc_khz, 0);
|
2016-07-13 17:16:33 +00:00
|
|
|
return 0;
|
2010-08-20 08:07:21 +00:00
|
|
|
}
|
|
|
|
|
|
|
|
static void tsc_khz_changed(void *data)
|
2009-02-04 16:52:04 +00:00
|
|
|
{
|
2010-08-20 08:07:21 +00:00
|
|
|
struct cpufreq_freqs *freq = data;
|
|
|
|
unsigned long khz = 0;
|
|
|
|
|
|
|
|
if (data)
|
|
|
|
khz = freq->new;
|
|
|
|
else if (!boot_cpu_has(X86_FEATURE_CONSTANT_TSC))
|
|
|
|
khz = cpufreq_quick_get(raw_smp_processor_id());
|
|
|
|
if (!khz)
|
|
|
|
khz = tsc_khz;
|
2010-12-18 15:28:55 +00:00
|
|
|
__this_cpu_write(cpu_tsc_khz, khz);
|
2009-02-04 16:52:04 +00:00
|
|
|
}
|
|
|
|
|
2018-01-31 08:41:40 +00:00
|
|
|
#ifdef CONFIG_X86_64
|
2018-01-24 13:23:37 +00:00
|
|
|
static void kvm_hyperv_tsc_notifier(void)
|
|
|
|
{
|
|
|
|
struct kvm *kvm;
|
|
|
|
struct kvm_vcpu *vcpu;
|
|
|
|
int cpu;
|
|
|
|
|
|
|
|
spin_lock(&kvm_lock);
|
|
|
|
list_for_each_entry(kvm, &vm_list, vm_list)
|
|
|
|
kvm_make_mclock_inprogress_request(kvm);
|
|
|
|
|
|
|
|
hyperv_stop_tsc_emulation();
|
|
|
|
|
|
|
|
/* TSC frequency always matches when on Hyper-V */
|
|
|
|
for_each_present_cpu(cpu)
|
|
|
|
per_cpu(cpu_tsc_khz, cpu) = tsc_khz;
|
|
|
|
kvm_max_guest_tsc_khz = tsc_khz;
|
|
|
|
|
|
|
|
list_for_each_entry(kvm, &vm_list, vm_list) {
|
|
|
|
struct kvm_arch *ka = &kvm->arch;
|
|
|
|
|
|
|
|
spin_lock(&ka->pvclock_gtod_sync_lock);
|
|
|
|
|
|
|
|
pvclock_update_vm_gtod_copy(kvm);
|
|
|
|
|
|
|
|
kvm_for_each_vcpu(cpu, vcpu, kvm)
|
|
|
|
kvm_make_request(KVM_REQ_CLOCK_UPDATE, vcpu);
|
|
|
|
|
|
|
|
kvm_for_each_vcpu(cpu, vcpu, kvm)
|
|
|
|
kvm_clear_request(KVM_REQ_MCLOCK_INPROGRESS, vcpu);
|
|
|
|
|
|
|
|
spin_unlock(&ka->pvclock_gtod_sync_lock);
|
|
|
|
}
|
|
|
|
spin_unlock(&kvm_lock);
|
|
|
|
}
|
2018-01-31 08:41:40 +00:00
|
|
|
#endif
|
2018-01-24 13:23:37 +00:00
|
|
|
|
2009-02-04 16:52:04 +00:00
|
|
|
static int kvmclock_cpufreq_notifier(struct notifier_block *nb, unsigned long val,
|
|
|
|
void *data)
|
|
|
|
{
|
|
|
|
struct cpufreq_freqs *freq = data;
|
|
|
|
struct kvm *kvm;
|
|
|
|
struct kvm_vcpu *vcpu;
|
|
|
|
int i, send_ipi = 0;
|
|
|
|
|
2010-08-20 08:07:21 +00:00
|
|
|
/*
|
|
|
|
* We allow guests to temporarily run on slowing clocks,
|
|
|
|
* provided we notify them after, or to run on accelerating
|
|
|
|
* clocks, provided we notify them before. Thus time never
|
|
|
|
* goes backwards.
|
|
|
|
*
|
|
|
|
* However, we have a problem. We can't atomically update
|
|
|
|
* the frequency of a given CPU from this function; it is
|
|
|
|
* merely a notifier, which can be called from any CPU.
|
|
|
|
* Changing the TSC frequency at arbitrary points in time
|
|
|
|
* requires a recomputation of local variables related to
|
|
|
|
* the TSC for each VCPU. We must flag these local variables
|
|
|
|
* to be updated and be sure the update takes place with the
|
|
|
|
* new frequency before any guests proceed.
|
|
|
|
*
|
|
|
|
* Unfortunately, the combination of hotplug CPU and frequency
|
|
|
|
* change creates an intractable locking scenario; the order
|
|
|
|
* of when these callouts happen is undefined with respect to
|
|
|
|
* CPU hotplug, and they can race with each other. As such,
|
|
|
|
* merely setting per_cpu(cpu_tsc_khz) = X during a hotadd is
|
|
|
|
* undefined; you can actually have a CPU frequency change take
|
|
|
|
* place in between the computation of X and the setting of the
|
|
|
|
* variable. To protect against this problem, all updates of
|
|
|
|
* the per_cpu tsc_khz variable are done in an interrupt
|
|
|
|
* protected IPI, and all callers wishing to update the value
|
|
|
|
* must wait for a synchronous IPI to complete (which is trivial
|
|
|
|
* if the caller is on the CPU already). This establishes the
|
|
|
|
* necessary total order on variable updates.
|
|
|
|
*
|
|
|
|
* Note that because a guest time update may take place
|
|
|
|
* anytime after the setting of the VCPU's request bit, the
|
|
|
|
* correct TSC value must be set before the request. However,
|
|
|
|
* to ensure the update actually makes it to any guest which
|
|
|
|
* starts running in hardware virtualization between the set
|
|
|
|
* and the acquisition of the spinlock, we must also ping the
|
|
|
|
* CPU after setting the request bit.
|
|
|
|
*
|
|
|
|
*/
|
|
|
|
|
2009-02-04 16:52:04 +00:00
|
|
|
if (val == CPUFREQ_PRECHANGE && freq->old > freq->new)
|
|
|
|
return 0;
|
|
|
|
if (val == CPUFREQ_POSTCHANGE && freq->old < freq->new)
|
|
|
|
return 0;
|
2010-08-20 08:07:21 +00:00
|
|
|
|
|
|
|
smp_call_function_single(freq->cpu, tsc_khz_changed, freq, 1);
|
2009-02-04 16:52:04 +00:00
|
|
|
|
2013-09-25 11:53:07 +00:00
|
|
|
spin_lock(&kvm_lock);
|
2009-02-04 16:52:04 +00:00
|
|
|
list_for_each_entry(kvm, &vm_list, vm_list) {
|
2009-06-09 12:56:29 +00:00
|
|
|
kvm_for_each_vcpu(i, vcpu, kvm) {
|
2009-02-04 16:52:04 +00:00
|
|
|
if (vcpu->cpu != freq->cpu)
|
|
|
|
continue;
|
2010-09-19 00:38:15 +00:00
|
|
|
kvm_make_request(KVM_REQ_CLOCK_UPDATE, vcpu);
|
2009-02-04 16:52:04 +00:00
|
|
|
if (vcpu->cpu != smp_processor_id())
|
2010-08-20 08:07:21 +00:00
|
|
|
send_ipi = 1;
|
2009-02-04 16:52:04 +00:00
|
|
|
}
|
|
|
|
}
|
2013-09-25 11:53:07 +00:00
|
|
|
spin_unlock(&kvm_lock);
|
2009-02-04 16:52:04 +00:00
|
|
|
|
|
|
|
if (freq->old < freq->new && send_ipi) {
|
|
|
|
/*
|
|
|
|
* We upscale the frequency. Must make the guest
|
|
|
|
* doesn't see old kvmclock values while running with
|
|
|
|
* the new frequency, otherwise we risk the guest sees
|
|
|
|
* time go backwards.
|
|
|
|
*
|
|
|
|
* In case we update the frequency for another cpu
|
|
|
|
* (which might be in guest context) send an interrupt
|
|
|
|
* to kick the cpu out of guest context. Next time
|
|
|
|
* guest context is entered kvmclock will be updated,
|
|
|
|
* so the guest will not see stale values.
|
|
|
|
*/
|
2010-08-20 08:07:21 +00:00
|
|
|
smp_call_function_single(freq->cpu, tsc_khz_changed, freq, 1);
|
2009-02-04 16:52:04 +00:00
|
|
|
}
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
|
|
|
static struct notifier_block kvmclock_cpufreq_notifier_block = {
|
2010-08-20 08:07:21 +00:00
|
|
|
.notifier_call = kvmclock_cpufreq_notifier
|
|
|
|
};
|
|
|
|
|
2016-07-13 17:16:33 +00:00
|
|
|
static int kvmclock_cpu_online(unsigned int cpu)
|
2010-08-20 08:07:21 +00:00
|
|
|
{
|
2016-07-13 17:16:33 +00:00
|
|
|
tsc_khz_changed(NULL);
|
|
|
|
return 0;
|
2010-08-20 08:07:21 +00:00
|
|
|
}
|
|
|
|
|
2009-09-29 21:38:34 +00:00
|
|
|
static void kvm_timer_init(void)
|
|
|
|
{
|
2010-09-19 00:38:15 +00:00
|
|
|
max_tsc_khz = tsc_khz;
|
2014-03-10 20:39:01 +00:00
|
|
|
|
2009-09-29 21:38:34 +00:00
|
|
|
if (!boot_cpu_has(X86_FEATURE_CONSTANT_TSC)) {
|
2010-09-19 00:38:15 +00:00
|
|
|
#ifdef CONFIG_CPU_FREQ
|
|
|
|
struct cpufreq_policy policy;
|
2016-09-04 17:13:57 +00:00
|
|
|
int cpu;
|
|
|
|
|
2010-09-19 00:38:15 +00:00
|
|
|
memset(&policy, 0, sizeof(policy));
|
2010-12-16 10:16:34 +00:00
|
|
|
cpu = get_cpu();
|
|
|
|
cpufreq_get_policy(&policy, cpu);
|
2010-09-19 00:38:15 +00:00
|
|
|
if (policy.cpuinfo.max_freq)
|
|
|
|
max_tsc_khz = policy.cpuinfo.max_freq;
|
2010-12-16 10:16:34 +00:00
|
|
|
put_cpu();
|
2010-09-19 00:38:15 +00:00
|
|
|
#endif
|
2009-09-29 21:38:34 +00:00
|
|
|
cpufreq_register_notifier(&kvmclock_cpufreq_notifier_block,
|
|
|
|
CPUFREQ_TRANSITION_NOTIFIER);
|
|
|
|
}
|
2010-09-19 00:38:15 +00:00
|
|
|
pr_debug("kvm: max_tsc_khz = %ld\n", max_tsc_khz);
|
2014-03-10 20:39:01 +00:00
|
|
|
|
2016-12-21 19:19:54 +00:00
|
|
|
cpuhp_setup_state(CPUHP_AP_X86_KVM_CLK_ONLINE, "x86/kvm/clk:online",
|
2016-07-13 17:16:33 +00:00
|
|
|
kvmclock_cpu_online, kvmclock_cpu_down_prep);
|
2009-09-29 21:38:34 +00:00
|
|
|
}
|
|
|
|
|
2017-07-26 00:20:32 +00:00
|
|
|
DEFINE_PER_CPU(struct kvm_vcpu *, current_vcpu);
|
|
|
|
EXPORT_PER_CPU_SYMBOL_GPL(current_vcpu);
|
2010-04-19 05:32:45 +00:00
|
|
|
|
2011-11-10 12:57:22 +00:00
|
|
|
int kvm_is_in_guest(void)
|
2010-04-19 05:32:45 +00:00
|
|
|
{
|
2011-10-20 07:34:01 +00:00
|
|
|
return __this_cpu_read(current_vcpu) != NULL;
|
2010-04-19 05:32:45 +00:00
|
|
|
}
|
|
|
|
|
|
|
|
static int kvm_is_user_mode(void)
|
|
|
|
{
|
|
|
|
int user_mode = 3;
|
2010-04-20 02:13:58 +00:00
|
|
|
|
2011-10-20 07:34:01 +00:00
|
|
|
if (__this_cpu_read(current_vcpu))
|
|
|
|
user_mode = kvm_x86_ops->get_cpl(__this_cpu_read(current_vcpu));
|
2010-04-20 02:13:58 +00:00
|
|
|
|
2010-04-19 05:32:45 +00:00
|
|
|
return user_mode != 0;
|
|
|
|
}
|
|
|
|
|
|
|
|
static unsigned long kvm_get_guest_ip(void)
|
|
|
|
{
|
|
|
|
unsigned long ip = 0;
|
2010-04-20 02:13:58 +00:00
|
|
|
|
2011-10-20 07:34:01 +00:00
|
|
|
if (__this_cpu_read(current_vcpu))
|
|
|
|
ip = kvm_rip_read(__this_cpu_read(current_vcpu));
|
2010-04-20 02:13:58 +00:00
|
|
|
|
2010-04-19 05:32:45 +00:00
|
|
|
return ip;
|
|
|
|
}
|
|
|
|
|
|
|
|
static struct perf_guest_info_callbacks kvm_guest_cbs = {
|
|
|
|
.is_in_guest = kvm_is_in_guest,
|
|
|
|
.is_user_mode = kvm_is_user_mode,
|
|
|
|
.get_guest_ip = kvm_get_guest_ip,
|
|
|
|
};
|
|
|
|
|
2011-07-11 19:33:44 +00:00
|
|
|
static void kvm_set_mmio_spte_mask(void)
|
|
|
|
{
|
|
|
|
u64 mask;
|
|
|
|
int maxphyaddr = boot_cpu_data.x86_phys_bits;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Set the reserved bits and the present bit of an paging-structure
|
|
|
|
* entry to generate page fault with PFER.RSV = 1.
|
|
|
|
*/
|
2018-08-14 17:15:34 +00:00
|
|
|
|
|
|
|
/*
|
|
|
|
* Mask the uppermost physical address bit, which would be reserved as
|
|
|
|
* long as the supported physical address width is less than 52.
|
|
|
|
*/
|
|
|
|
mask = 1ull << 51;
|
2013-06-07 08:51:23 +00:00
|
|
|
|
|
|
|
/* Set the present bit. */
|
2011-07-11 19:33:44 +00:00
|
|
|
mask |= 1ull;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* If reserved bit is not supported, clear the present bit to disable
|
|
|
|
* mmio page fault.
|
|
|
|
*/
|
2018-08-20 21:37:50 +00:00
|
|
|
if (IS_ENABLED(CONFIG_X86_64) && maxphyaddr == 52)
|
2011-07-11 19:33:44 +00:00
|
|
|
mask &= ~1ull;
|
|
|
|
|
2017-07-01 00:26:30 +00:00
|
|
|
kvm_mmu_set_mmio_spte_mask(mask, mask);
|
2011-07-11 19:33:44 +00:00
|
|
|
}
|
|
|
|
|
2012-11-28 01:29:00 +00:00
|
|
|
#ifdef CONFIG_X86_64
|
|
|
|
static void pvclock_gtod_update_fn(struct work_struct *work)
|
|
|
|
{
|
2012-11-28 01:29:01 +00:00
|
|
|
struct kvm *kvm;
|
|
|
|
|
|
|
|
struct kvm_vcpu *vcpu;
|
|
|
|
int i;
|
|
|
|
|
2013-09-25 11:53:07 +00:00
|
|
|
spin_lock(&kvm_lock);
|
2012-11-28 01:29:01 +00:00
|
|
|
list_for_each_entry(kvm, &vm_list, vm_list)
|
|
|
|
kvm_for_each_vcpu(i, vcpu, kvm)
|
2014-09-12 05:43:19 +00:00
|
|
|
kvm_make_request(KVM_REQ_MASTERCLOCK_UPDATE, vcpu);
|
2012-11-28 01:29:01 +00:00
|
|
|
atomic_set(&kvm_guest_has_master_clock, 0);
|
2013-09-25 11:53:07 +00:00
|
|
|
spin_unlock(&kvm_lock);
|
2012-11-28 01:29:00 +00:00
|
|
|
}
|
|
|
|
|
|
|
|
static DECLARE_WORK(pvclock_gtod_work, pvclock_gtod_update_fn);
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Notification about pvclock gtod data update.
|
|
|
|
*/
|
|
|
|
static int pvclock_gtod_notify(struct notifier_block *nb, unsigned long unused,
|
|
|
|
void *priv)
|
|
|
|
{
|
|
|
|
struct pvclock_gtod_data *gtod = &pvclock_gtod_data;
|
|
|
|
struct timekeeper *tk = priv;
|
|
|
|
|
|
|
|
update_pvclock_gtod(tk);
|
|
|
|
|
|
|
|
/* disable master clock if host does not trust, or does not
|
2018-01-24 13:23:36 +00:00
|
|
|
* use, TSC based clocksource.
|
2012-11-28 01:29:00 +00:00
|
|
|
*/
|
2018-01-24 13:23:36 +00:00
|
|
|
if (!gtod_is_based_on_tsc(gtod->clock.vclock_mode) &&
|
2012-11-28 01:29:00 +00:00
|
|
|
atomic_read(&kvm_guest_has_master_clock) != 0)
|
|
|
|
queue_work(system_long_wq, &pvclock_gtod_work);
|
|
|
|
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
|
|
|
static struct notifier_block pvclock_gtod_notifier = {
|
|
|
|
.notifier_call = pvclock_gtod_notify,
|
|
|
|
};
|
|
|
|
#endif
|
|
|
|
|
2007-11-14 12:40:21 +00:00
|
|
|
int kvm_arch_init(void *opaque)
|
2007-10-10 15:16:19 +00:00
|
|
|
{
|
2009-09-29 21:38:34 +00:00
|
|
|
int r;
|
2013-06-26 18:36:23 +00:00
|
|
|
struct kvm_x86_ops *ops = opaque;
|
2007-11-14 12:40:21 +00:00
|
|
|
|
|
|
|
if (kvm_x86_ops) {
|
|
|
|
printk(KERN_ERR "kvm: already loaded the other module\n");
|
2007-11-18 12:43:21 +00:00
|
|
|
r = -EEXIST;
|
|
|
|
goto out;
|
2007-11-14 12:40:21 +00:00
|
|
|
}
|
|
|
|
|
|
|
|
if (!ops->cpu_has_kvm_support()) {
|
|
|
|
printk(KERN_ERR "kvm: no hardware support\n");
|
2007-11-18 12:43:21 +00:00
|
|
|
r = -EOPNOTSUPP;
|
|
|
|
goto out;
|
2007-11-14 12:40:21 +00:00
|
|
|
}
|
|
|
|
if (ops->disabled_by_bios()) {
|
|
|
|
printk(KERN_ERR "kvm: disabled by bios\n");
|
2007-11-18 12:43:21 +00:00
|
|
|
r = -EOPNOTSUPP;
|
|
|
|
goto out;
|
2007-11-14 12:40:21 +00:00
|
|
|
}
|
|
|
|
|
2013-01-03 13:41:39 +00:00
|
|
|
r = -ENOMEM;
|
|
|
|
shared_msrs = alloc_percpu(struct kvm_shared_msrs);
|
|
|
|
if (!shared_msrs) {
|
|
|
|
printk(KERN_ERR "kvm: failed to allocate percpu kvm_shared_msrs\n");
|
|
|
|
goto out;
|
|
|
|
}
|
|
|
|
|
2008-01-13 11:23:56 +00:00
|
|
|
r = kvm_mmu_module_init();
|
|
|
|
if (r)
|
2013-01-03 13:41:39 +00:00
|
|
|
goto out_free_percpu;
|
2008-01-13 11:23:56 +00:00
|
|
|
|
2011-07-11 19:33:44 +00:00
|
|
|
kvm_set_mmio_spte_mask();
|
2008-01-13 11:23:56 +00:00
|
|
|
|
2007-11-14 12:40:21 +00:00
|
|
|
kvm_x86_ops = ops;
|
2014-03-26 14:54:00 +00:00
|
|
|
|
2008-04-25 13:13:50 +00:00
|
|
|
kvm_mmu_set_mask_ptes(PT_USER_MASK, PT_ACCESSED_MASK,
|
2016-07-12 22:18:49 +00:00
|
|
|
PT_DIRTY_MASK, PT64_NX_MASK, 0,
|
2017-07-17 21:10:27 +00:00
|
|
|
PT_PRESENT_MASK, 0, sme_me_mask);
|
2009-09-29 21:38:34 +00:00
|
|
|
kvm_timer_init();
|
2009-02-04 16:52:04 +00:00
|
|
|
|
2010-04-19 05:32:45 +00:00
|
|
|
perf_register_guest_info_callbacks(&kvm_guest_cbs);
|
|
|
|
|
2016-04-04 20:25:02 +00:00
|
|
|
if (boot_cpu_has(X86_FEATURE_XSAVE))
|
2010-06-10 03:27:12 +00:00
|
|
|
host_xcr0 = xgetbv(XCR_XFEATURE_ENABLED_MASK);
|
|
|
|
|
2012-08-05 12:58:30 +00:00
|
|
|
kvm_lapic_init();
|
2012-11-28 01:29:00 +00:00
|
|
|
#ifdef CONFIG_X86_64
|
|
|
|
pvclock_gtod_register_notifier(&pvclock_gtod_notifier);
|
2018-01-24 13:23:37 +00:00
|
|
|
|
2018-01-31 08:41:40 +00:00
|
|
|
if (hypervisor_is_type(X86_HYPER_MS_HYPERV))
|
2018-01-24 13:23:37 +00:00
|
|
|
set_hv_tscchange_cb(kvm_hyperv_tsc_notifier);
|
2012-11-28 01:29:00 +00:00
|
|
|
#endif
|
|
|
|
|
2007-11-14 12:40:21 +00:00
|
|
|
return 0;
|
2007-11-18 12:43:21 +00:00
|
|
|
|
2013-01-03 13:41:39 +00:00
|
|
|
out_free_percpu:
|
|
|
|
free_percpu(shared_msrs);
|
2007-11-18 12:43:21 +00:00
|
|
|
out:
|
|
|
|
return r;
|
2007-10-10 15:16:19 +00:00
|
|
|
}
|
2007-10-31 22:24:24 +00:00
|
|
|
|
2007-11-14 12:40:21 +00:00
|
|
|
void kvm_arch_exit(void)
|
|
|
|
{
|
2018-01-24 13:23:37 +00:00
|
|
|
#ifdef CONFIG_X86_64
|
2018-01-31 08:41:40 +00:00
|
|
|
if (hypervisor_is_type(X86_HYPER_MS_HYPERV))
|
2018-01-24 13:23:37 +00:00
|
|
|
clear_hv_tscchange_cb();
|
|
|
|
#endif
|
2016-12-16 22:30:36 +00:00
|
|
|
kvm_lapic_exit();
|
2010-04-19 05:32:45 +00:00
|
|
|
perf_unregister_guest_info_callbacks(&kvm_guest_cbs);
|
|
|
|
|
2009-04-17 17:24:58 +00:00
|
|
|
if (!boot_cpu_has(X86_FEATURE_CONSTANT_TSC))
|
|
|
|
cpufreq_unregister_notifier(&kvmclock_cpufreq_notifier_block,
|
|
|
|
CPUFREQ_TRANSITION_NOTIFIER);
|
2016-07-13 17:16:33 +00:00
|
|
|
cpuhp_remove_state_nocalls(CPUHP_AP_X86_KVM_CLK_ONLINE);
|
2012-11-28 01:29:00 +00:00
|
|
|
#ifdef CONFIG_X86_64
|
|
|
|
pvclock_gtod_unregister_notifier(&pvclock_gtod_notifier);
|
|
|
|
#endif
|
2007-11-14 12:40:21 +00:00
|
|
|
kvm_x86_ops = NULL;
|
2007-11-18 12:43:21 +00:00
|
|
|
kvm_mmu_module_exit();
|
2013-01-03 13:41:39 +00:00
|
|
|
free_percpu(shared_msrs);
|
2007-11-18 12:43:21 +00:00
|
|
|
}
|
2007-11-14 12:40:21 +00:00
|
|
|
|
2015-03-02 19:43:31 +00:00
|
|
|
int kvm_vcpu_halt(struct kvm_vcpu *vcpu)
|
2007-10-31 22:24:24 +00:00
|
|
|
{
|
|
|
|
++vcpu->stat.halt_exits;
|
2015-07-29 10:05:37 +00:00
|
|
|
if (lapic_in_kernel(vcpu)) {
|
2008-04-13 14:54:35 +00:00
|
|
|
vcpu->arch.mp_state = KVM_MP_STATE_HALTED;
|
2007-10-31 22:24:24 +00:00
|
|
|
return 1;
|
|
|
|
} else {
|
|
|
|
vcpu->run->exit_reason = KVM_EXIT_HLT;
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
}
|
2015-03-02 19:43:31 +00:00
|
|
|
EXPORT_SYMBOL_GPL(kvm_vcpu_halt);
|
|
|
|
|
|
|
|
int kvm_emulate_halt(struct kvm_vcpu *vcpu)
|
|
|
|
{
|
KVM: x86: Add kvm_skip_emulated_instruction and use it.
kvm_skip_emulated_instruction calls both
kvm_x86_ops->skip_emulated_instruction and kvm_vcpu_check_singlestep,
skipping the emulated instruction and generating a trap if necessary.
Replacing skip_emulated_instruction calls with
kvm_skip_emulated_instruction is straightforward, except for:
- ICEBP, which is already inside a trap, so avoid triggering another trap.
- Instructions that can trigger exits to userspace, such as the IO insns,
MOVs to CR8, and HALT. If kvm_skip_emulated_instruction does trigger a
KVM_GUESTDBG_SINGLESTEP exit, and the handling code for
IN/OUT/MOV CR8/HALT also triggers an exit to userspace, the latter will
take precedence. The singlestep will be triggered again on the next
instruction, which is the current behavior.
- Task switch instructions which would require additional handling (e.g.
the task switch bit) and are instead left alone.
- Cases where VMLAUNCH/VMRESUME do not proceed to the next instruction,
which do not trigger singlestep traps as mentioned previously.
Signed-off-by: Kyle Huey <khuey@kylehuey.com>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
2016-11-29 20:40:40 +00:00
|
|
|
int ret = kvm_skip_emulated_instruction(vcpu);
|
|
|
|
/*
|
|
|
|
* TODO: we might be squashing a GUESTDBG_SINGLESTEP-triggered
|
|
|
|
* KVM_EXIT_DEBUG here.
|
|
|
|
*/
|
|
|
|
return kvm_vcpu_halt(vcpu) && ret;
|
2015-03-02 19:43:31 +00:00
|
|
|
}
|
2007-10-31 22:24:24 +00:00
|
|
|
EXPORT_SYMBOL_GPL(kvm_emulate_halt);
|
|
|
|
|
2017-02-09 15:10:42 +00:00
|
|
|
#ifdef CONFIG_X86_64
|
2017-01-24 17:09:39 +00:00
|
|
|
static int kvm_pv_clock_pairing(struct kvm_vcpu *vcpu, gpa_t paddr,
|
|
|
|
unsigned long clock_type)
|
|
|
|
{
|
|
|
|
struct kvm_clock_pairing clock_pairing;
|
2018-04-23 08:04:26 +00:00
|
|
|
struct timespec64 ts;
|
2017-02-08 09:57:24 +00:00
|
|
|
u64 cycle;
|
2017-01-24 17:09:39 +00:00
|
|
|
int ret;
|
|
|
|
|
|
|
|
if (clock_type != KVM_CLOCK_PAIRING_WALLCLOCK)
|
|
|
|
return -KVM_EOPNOTSUPP;
|
|
|
|
|
|
|
|
if (kvm_get_walltime_and_clockread(&ts, &cycle) == false)
|
|
|
|
return -KVM_EOPNOTSUPP;
|
|
|
|
|
|
|
|
clock_pairing.sec = ts.tv_sec;
|
|
|
|
clock_pairing.nsec = ts.tv_nsec;
|
|
|
|
clock_pairing.tsc = kvm_read_l1_tsc(vcpu, cycle);
|
|
|
|
clock_pairing.flags = 0;
|
|
|
|
|
|
|
|
ret = 0;
|
|
|
|
if (kvm_write_guest(vcpu->kvm, paddr, &clock_pairing,
|
|
|
|
sizeof(struct kvm_clock_pairing)))
|
|
|
|
ret = -KVM_EFAULT;
|
|
|
|
|
|
|
|
return ret;
|
|
|
|
}
|
2017-02-09 15:10:42 +00:00
|
|
|
#endif
|
2017-01-24 17:09:39 +00:00
|
|
|
|
2013-08-26 08:48:34 +00:00
|
|
|
/*
|
|
|
|
* kvm_pv_kick_cpu_op: Kick a vcpu.
|
|
|
|
*
|
|
|
|
* @apicid - apicid of vcpu to be kicked.
|
|
|
|
*/
|
|
|
|
static void kvm_pv_kick_cpu_op(struct kvm *kvm, unsigned long flags, int apicid)
|
|
|
|
{
|
2013-08-26 08:48:35 +00:00
|
|
|
struct kvm_lapic_irq lapic_irq;
|
2013-08-26 08:48:34 +00:00
|
|
|
|
2013-08-26 08:48:35 +00:00
|
|
|
lapic_irq.shorthand = 0;
|
|
|
|
lapic_irq.dest_mode = 0;
|
2017-08-02 03:20:51 +00:00
|
|
|
lapic_irq.level = 0;
|
2013-08-26 08:48:35 +00:00
|
|
|
lapic_irq.dest_id = apicid;
|
2015-03-19 01:26:03 +00:00
|
|
|
lapic_irq.msi_redir_hint = false;
|
2013-08-26 08:48:34 +00:00
|
|
|
|
2013-08-26 08:48:35 +00:00
|
|
|
lapic_irq.delivery_mode = APIC_DM_REMRD;
|
2015-03-13 09:39:44 +00:00
|
|
|
kvm_irq_delivery_to_apic(kvm, NULL, &lapic_irq, NULL);
|
2013-08-26 08:48:34 +00:00
|
|
|
}
|
|
|
|
|
2015-11-10 12:36:33 +00:00
|
|
|
void kvm_vcpu_deactivate_apicv(struct kvm_vcpu *vcpu)
|
|
|
|
{
|
|
|
|
vcpu->arch.apicv_active = false;
|
|
|
|
kvm_x86_ops->refresh_apicv_exec_ctrl(vcpu);
|
|
|
|
}
|
|
|
|
|
2007-10-31 22:24:24 +00:00
|
|
|
int kvm_emulate_hypercall(struct kvm_vcpu *vcpu)
|
|
|
|
{
|
|
|
|
unsigned long nr, a0, a1, a2, a3, ret;
|
2018-04-30 09:23:01 +00:00
|
|
|
int op_64_bit;
|
2007-10-31 22:24:24 +00:00
|
|
|
|
2018-05-24 15:50:56 +00:00
|
|
|
if (kvm_hv_hypercall_enabled(vcpu->kvm))
|
|
|
|
return kvm_hv_hypercall(vcpu);
|
2010-01-17 13:51:22 +00:00
|
|
|
|
2008-06-27 17:58:02 +00:00
|
|
|
nr = kvm_register_read(vcpu, VCPU_REGS_RAX);
|
|
|
|
a0 = kvm_register_read(vcpu, VCPU_REGS_RBX);
|
|
|
|
a1 = kvm_register_read(vcpu, VCPU_REGS_RCX);
|
|
|
|
a2 = kvm_register_read(vcpu, VCPU_REGS_RDX);
|
|
|
|
a3 = kvm_register_read(vcpu, VCPU_REGS_RSI);
|
2007-10-31 22:24:24 +00:00
|
|
|
|
2009-06-17 12:22:14 +00:00
|
|
|
trace_kvm_hypercall(nr, a0, a1, a2, a3);
|
2008-04-10 19:31:10 +00:00
|
|
|
|
2014-06-18 14:19:24 +00:00
|
|
|
op_64_bit = is_64_bit_mode(vcpu);
|
|
|
|
if (!op_64_bit) {
|
2007-10-31 22:24:24 +00:00
|
|
|
nr &= 0xFFFFFFFF;
|
|
|
|
a0 &= 0xFFFFFFFF;
|
|
|
|
a1 &= 0xFFFFFFFF;
|
|
|
|
a2 &= 0xFFFFFFFF;
|
|
|
|
a3 &= 0xFFFFFFFF;
|
|
|
|
}
|
|
|
|
|
2009-08-03 16:43:28 +00:00
|
|
|
if (kvm_x86_ops->get_cpl(vcpu) != 0) {
|
|
|
|
ret = -KVM_EPERM;
|
2018-05-24 15:50:56 +00:00
|
|
|
goto out;
|
2009-08-03 16:43:28 +00:00
|
|
|
}
|
|
|
|
|
2007-10-31 22:24:24 +00:00
|
|
|
switch (nr) {
|
2007-10-25 14:52:32 +00:00
|
|
|
case KVM_HC_VAPIC_POLL_IRQ:
|
|
|
|
ret = 0;
|
|
|
|
break;
|
2013-08-26 08:48:34 +00:00
|
|
|
case KVM_HC_KICK_CPU:
|
|
|
|
kvm_pv_kick_cpu_op(vcpu->kvm, a0, a1);
|
|
|
|
ret = 0;
|
|
|
|
break;
|
2017-02-09 15:10:42 +00:00
|
|
|
#ifdef CONFIG_X86_64
|
2017-01-24 17:09:39 +00:00
|
|
|
case KVM_HC_CLOCK_PAIRING:
|
|
|
|
ret = kvm_pv_clock_pairing(vcpu, a0, a1);
|
|
|
|
break;
|
KVM: X86: Implement "send IPI" hypercall
Using hypercall to send IPIs by one vmexit instead of one by one for
xAPIC/x2APIC physical mode and one vmexit per-cluster for x2APIC cluster
mode. Intel guest can enter x2apic cluster mode when interrupt remmaping
is enabled in qemu, however, latest AMD EPYC still just supports xapic
mode which can get great improvement by Exit-less IPIs. This patchset
lets a guest send multicast IPIs, with at most 128 destinations per
hypercall in 64-bit mode and 64 vCPUs per hypercall in 32-bit mode.
Hardware: Xeon Skylake 2.5GHz, 2 sockets, 40 cores, 80 threads, the VM
is 80 vCPUs, IPI microbenchmark(https://lkml.org/lkml/2017/12/19/141):
x2apic cluster mode, vanilla
Dry-run: 0, 2392199 ns
Self-IPI: 6907514, 15027589 ns
Normal IPI: 223910476, 251301666 ns
Broadcast IPI: 0, 9282161150 ns
Broadcast lock: 0, 8812934104 ns
x2apic cluster mode, pv-ipi
Dry-run: 0, 2449341 ns
Self-IPI: 6720360, 15028732 ns
Normal IPI: 228643307, 255708477 ns
Broadcast IPI: 0, 7572293590 ns => 22% performance boost
Broadcast lock: 0, 8316124651 ns
x2apic physical mode, vanilla
Dry-run: 0, 3135933 ns
Self-IPI: 8572670, 17901757 ns
Normal IPI: 226444334, 255421709 ns
Broadcast IPI: 0, 19845070887 ns
Broadcast lock: 0, 19827383656 ns
x2apic physical mode, pv-ipi
Dry-run: 0, 2446381 ns
Self-IPI: 6788217, 15021056 ns
Normal IPI: 219454441, 249583458 ns
Broadcast IPI: 0, 7806540019 ns => 154% performance boost
Broadcast lock: 0, 9143618799 ns
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Radim Krčmář <rkrcmar@redhat.com>
Cc: Vitaly Kuznetsov <vkuznets@redhat.com>
Signed-off-by: Wanpeng Li <wanpengli@tencent.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2018-07-23 06:39:54 +00:00
|
|
|
case KVM_HC_SEND_IPI:
|
|
|
|
ret = kvm_pv_send_ipi(vcpu->kvm, a0, a1, a2, a3, op_64_bit);
|
|
|
|
break;
|
2017-02-09 15:10:42 +00:00
|
|
|
#endif
|
2007-10-31 22:24:24 +00:00
|
|
|
default:
|
|
|
|
ret = -KVM_ENOSYS;
|
|
|
|
break;
|
|
|
|
}
|
2018-05-24 15:50:56 +00:00
|
|
|
out:
|
2014-06-18 14:19:24 +00:00
|
|
|
if (!op_64_bit)
|
|
|
|
ret = (u32)ret;
|
2008-06-27 17:58:02 +00:00
|
|
|
kvm_register_write(vcpu, VCPU_REGS_RAX, ret);
|
2018-04-30 09:23:01 +00:00
|
|
|
|
2008-02-20 19:30:30 +00:00
|
|
|
++vcpu->stat.hypercalls;
|
2018-04-30 09:23:01 +00:00
|
|
|
return kvm_skip_emulated_instruction(vcpu);
|
2007-10-31 22:24:24 +00:00
|
|
|
}
|
|
|
|
EXPORT_SYMBOL_GPL(kvm_emulate_hypercall);
|
|
|
|
|
2012-09-20 05:43:17 +00:00
|
|
|
static int emulator_fix_hypercall(struct x86_emulate_ctxt *ctxt)
|
2007-10-31 22:24:24 +00:00
|
|
|
{
|
2011-04-20 12:47:13 +00:00
|
|
|
struct kvm_vcpu *vcpu = emul_to_vcpu(ctxt);
|
2007-10-31 22:24:24 +00:00
|
|
|
char instruction[3];
|
2008-06-27 17:58:02 +00:00
|
|
|
unsigned long rip = kvm_rip_read(vcpu);
|
2007-10-31 22:24:24 +00:00
|
|
|
|
|
|
|
kvm_x86_ops->patch_hypercall(vcpu, instruction);
|
|
|
|
|
2017-01-17 13:51:04 +00:00
|
|
|
return emulator_write_emulated(ctxt, rip, instruction, 3,
|
|
|
|
&ctxt->exception);
|
2007-10-31 22:24:24 +00:00
|
|
|
}
|
|
|
|
|
2009-08-24 08:10:17 +00:00
|
|
|
static int dm_request_for_irq_injection(struct kvm_vcpu *vcpu)
|
2007-11-01 19:16:10 +00:00
|
|
|
{
|
2015-11-16 23:26:00 +00:00
|
|
|
return vcpu->run->request_interrupt_window &&
|
|
|
|
likely(!pic_in_kernel(vcpu->kvm));
|
2007-11-01 19:16:10 +00:00
|
|
|
}
|
|
|
|
|
2009-08-24 08:10:17 +00:00
|
|
|
static void post_kvm_run_save(struct kvm_vcpu *vcpu)
|
2007-11-01 19:16:10 +00:00
|
|
|
{
|
2009-08-24 08:10:17 +00:00
|
|
|
struct kvm_run *kvm_run = vcpu->run;
|
|
|
|
|
2009-10-05 11:07:21 +00:00
|
|
|
kvm_run->if_flag = (kvm_get_rflags(vcpu) & X86_EFLAGS_IF) != 0;
|
2015-04-01 13:06:40 +00:00
|
|
|
kvm_run->flags = is_smm(vcpu) ? KVM_RUN_X86_SMM : 0;
|
2008-02-24 09:20:43 +00:00
|
|
|
kvm_run->cr8 = kvm_get_cr8(vcpu);
|
2007-11-01 19:16:10 +00:00
|
|
|
kvm_run->apic_base = kvm_get_apic_base(vcpu);
|
2015-11-17 16:32:05 +00:00
|
|
|
kvm_run->ready_for_interrupt_injection =
|
|
|
|
pic_in_kernel(vcpu->kvm) ||
|
2015-11-16 23:26:00 +00:00
|
|
|
kvm_vcpu_ready_for_interrupt_injection(vcpu);
|
2007-11-01 19:16:10 +00:00
|
|
|
}
|
|
|
|
|
2009-04-21 14:45:08 +00:00
|
|
|
static void update_cr8_intercept(struct kvm_vcpu *vcpu)
|
|
|
|
{
|
|
|
|
int max_irr, tpr;
|
|
|
|
|
|
|
|
if (!kvm_x86_ops->update_cr8_intercept)
|
|
|
|
return;
|
|
|
|
|
2016-01-08 12:48:51 +00:00
|
|
|
if (!lapic_in_kernel(vcpu))
|
2009-08-17 19:49:40 +00:00
|
|
|
return;
|
|
|
|
|
2015-11-10 12:36:33 +00:00
|
|
|
if (vcpu->arch.apicv_active)
|
|
|
|
return;
|
|
|
|
|
2009-05-11 10:35:54 +00:00
|
|
|
if (!vcpu->arch.apic->vapic_addr)
|
|
|
|
max_irr = kvm_lapic_find_highest_irr(vcpu);
|
|
|
|
else
|
|
|
|
max_irr = -1;
|
2009-04-21 14:45:08 +00:00
|
|
|
|
|
|
|
if (max_irr != -1)
|
|
|
|
max_irr >>= 4;
|
|
|
|
|
|
|
|
tpr = kvm_lapic_get_cr8(vcpu);
|
|
|
|
|
|
|
|
kvm_x86_ops->update_cr8_intercept(vcpu, tpr, max_irr);
|
|
|
|
}
|
|
|
|
|
2014-03-07 19:03:12 +00:00
|
|
|
static int inject_pending_event(struct kvm_vcpu *vcpu, bool req_int_win)
|
2009-04-21 14:45:08 +00:00
|
|
|
{
|
2014-03-07 19:03:12 +00:00
|
|
|
int r;
|
|
|
|
|
2009-04-21 14:45:08 +00:00
|
|
|
/* try to reinject previous events if any */
|
2017-08-24 10:35:09 +00:00
|
|
|
|
2018-03-23 00:01:33 +00:00
|
|
|
if (vcpu->arch.exception.injected)
|
|
|
|
kvm_x86_ops->queue_exception(vcpu);
|
2017-08-24 10:35:09 +00:00
|
|
|
/*
|
2018-03-23 00:01:32 +00:00
|
|
|
* Do not inject an NMI or interrupt if there is a pending
|
|
|
|
* exception. Exceptions and interrupts are recognized at
|
|
|
|
* instruction boundaries, i.e. the start of an instruction.
|
|
|
|
* Trap-like exceptions, e.g. #DB, have higher priority than
|
|
|
|
* NMIs and interrupts, i.e. traps are recognized before an
|
|
|
|
* NMI/interrupt that's pending on the same instruction.
|
|
|
|
* Fault-like exceptions, e.g. #GP and #PF, are the lowest
|
|
|
|
* priority, but are only generated (pended) during instruction
|
|
|
|
* execution, i.e. a pending fault-like exception means the
|
|
|
|
* fault occurred on the *previous* instruction and must be
|
|
|
|
* serviced prior to recognizing any new events in order to
|
|
|
|
* fully complete the previous instruction.
|
2017-08-24 10:35:09 +00:00
|
|
|
*/
|
2018-03-23 00:01:33 +00:00
|
|
|
else if (!vcpu->arch.exception.pending) {
|
|
|
|
if (vcpu->arch.nmi_injected)
|
2017-08-24 10:35:09 +00:00
|
|
|
kvm_x86_ops->set_nmi(vcpu);
|
2018-03-23 00:01:33 +00:00
|
|
|
else if (vcpu->arch.interrupt.injected)
|
2017-08-24 10:35:09 +00:00
|
|
|
kvm_x86_ops->set_irq(vcpu);
|
|
|
|
}
|
|
|
|
|
2018-03-23 00:01:33 +00:00
|
|
|
/*
|
|
|
|
* Call check_nested_events() even if we reinjected a previous event
|
|
|
|
* in order for caller to determine if it should require immediate-exit
|
|
|
|
* from L2 to L1 due to pending L1 events which require exit
|
|
|
|
* from L2 to L1.
|
|
|
|
*/
|
2017-08-24 10:35:09 +00:00
|
|
|
if (is_guest_mode(vcpu) && kvm_x86_ops->check_nested_events) {
|
|
|
|
r = kvm_x86_ops->check_nested_events(vcpu, req_int_win);
|
|
|
|
if (r != 0)
|
|
|
|
return r;
|
|
|
|
}
|
|
|
|
|
|
|
|
/* try to inject new event if pending */
|
2009-07-09 12:33:51 +00:00
|
|
|
if (vcpu->arch.exception.pending) {
|
2010-03-11 11:01:59 +00:00
|
|
|
trace_kvm_inj_exception(vcpu->arch.exception.nr,
|
|
|
|
vcpu->arch.exception.has_error_code,
|
|
|
|
vcpu->arch.exception.error_code);
|
2014-07-24 11:51:24 +00:00
|
|
|
|
2018-03-23 00:01:33 +00:00
|
|
|
WARN_ON_ONCE(vcpu->arch.exception.injected);
|
2017-08-24 10:35:09 +00:00
|
|
|
vcpu->arch.exception.pending = false;
|
|
|
|
vcpu->arch.exception.injected = true;
|
|
|
|
|
2014-07-24 11:51:24 +00:00
|
|
|
if (exception_type(vcpu->arch.exception.nr) == EXCPT_FAULT)
|
|
|
|
__kvm_set_rflags(vcpu, kvm_get_rflags(vcpu) |
|
|
|
|
X86_EFLAGS_RF);
|
|
|
|
|
2014-09-30 17:49:14 +00:00
|
|
|
if (vcpu->arch.exception.nr == DB_VECTOR &&
|
|
|
|
(vcpu->arch.dr7 & DR7_GD)) {
|
|
|
|
vcpu->arch.dr7 &= ~DR7_GD;
|
|
|
|
kvm_update_dr7(vcpu);
|
|
|
|
}
|
|
|
|
|
2017-07-14 01:30:39 +00:00
|
|
|
kvm_x86_ops->queue_exception(vcpu);
|
2018-03-23 00:01:33 +00:00
|
|
|
}
|
|
|
|
|
|
|
|
/* Don't consider new event if we re-injected an event */
|
|
|
|
if (kvm_event_needs_reinjection(vcpu))
|
|
|
|
return 0;
|
|
|
|
|
|
|
|
if (vcpu->arch.smi_pending && !is_smm(vcpu) &&
|
|
|
|
kvm_x86_ops->smi_allowed(vcpu)) {
|
2016-06-01 20:26:00 +00:00
|
|
|
vcpu->arch.smi_pending = false;
|
2017-11-15 11:43:14 +00:00
|
|
|
++vcpu->arch.smi_count;
|
2016-06-01 20:26:01 +00:00
|
|
|
enter_smm(vcpu);
|
2016-06-01 20:26:00 +00:00
|
|
|
} else if (vcpu->arch.nmi_pending && kvm_x86_ops->nmi_allowed(vcpu)) {
|
KVM: x86: Inject pending interrupt even if pending nmi exist
Non maskable interrupts (NMI) are preferred to interrupts in current
implementation. If a NMI is pending and NMI is blocked by the result
of nmi_allowed(), pending interrupt is not injected and
enable_irq_window() is not executed, even if interrupts injection is
allowed.
In old kernel (e.g. 2.6.32), schedule() is often called in NMI context.
In this case, interrupts are needed to execute iret that intends end
of NMI. The flag of blocking new NMI is not cleared until the guest
execute the iret, and interrupts are blocked by pending NMI. Due to
this, iret can't be invoked in the guest, and the guest is starved
until block is cleared by some events (e.g. canceling injection).
This patch injects pending interrupts, when it's allowed, even if NMI
is blocked. And, If an interrupts is pending after executing
inject_pending_event(), enable_irq_window() is executed regardless of
NMI pending counter.
Cc: stable@vger.kernel.org
Signed-off-by: Yuki Shibuya <shibuya.yk@ncos.nec.co.jp>
Suggested-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2016-03-24 05:17:03 +00:00
|
|
|
--vcpu->arch.nmi_pending;
|
|
|
|
vcpu->arch.nmi_injected = true;
|
|
|
|
kvm_x86_ops->set_nmi(vcpu);
|
2013-01-25 02:18:51 +00:00
|
|
|
} else if (kvm_cpu_has_injectable_intr(vcpu)) {
|
2014-07-08 04:30:23 +00:00
|
|
|
/*
|
|
|
|
* Because interrupts can be injected asynchronously, we are
|
|
|
|
* calling check_nested_events again here to avoid a race condition.
|
|
|
|
* See https://lkml.org/lkml/2014/7/2/60 for discussion about this
|
|
|
|
* proposal and current concerns. Perhaps we should be setting
|
|
|
|
* KVM_REQ_EVENT only on certain events and not unconditionally?
|
|
|
|
*/
|
|
|
|
if (is_guest_mode(vcpu) && kvm_x86_ops->check_nested_events) {
|
|
|
|
r = kvm_x86_ops->check_nested_events(vcpu, req_int_win);
|
|
|
|
if (r != 0)
|
|
|
|
return r;
|
|
|
|
}
|
2009-04-21 14:45:08 +00:00
|
|
|
if (kvm_x86_ops->interrupt_allowed(vcpu)) {
|
2009-05-11 10:35:50 +00:00
|
|
|
kvm_queue_interrupt(vcpu, kvm_cpu_get_interrupt(vcpu),
|
|
|
|
false);
|
|
|
|
kvm_x86_ops->set_irq(vcpu);
|
2009-04-21 14:45:08 +00:00
|
|
|
}
|
|
|
|
}
|
2016-06-01 20:26:01 +00:00
|
|
|
|
2014-03-07 19:03:12 +00:00
|
|
|
return 0;
|
2009-04-21 14:45:08 +00:00
|
|
|
}
|
|
|
|
|
2011-09-20 10:43:14 +00:00
|
|
|
static void process_nmi(struct kvm_vcpu *vcpu)
|
|
|
|
{
|
|
|
|
unsigned limit = 2;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* x86 is limited to one NMI running, and one NMI pending after it.
|
|
|
|
* If an NMI is already in progress, limit further NMIs to just one.
|
|
|
|
* Otherwise, allow two (and we'll inject the first one immediately).
|
|
|
|
*/
|
|
|
|
if (kvm_x86_ops->get_nmi_mask(vcpu) || vcpu->arch.nmi_injected)
|
|
|
|
limit = 1;
|
|
|
|
|
|
|
|
vcpu->arch.nmi_pending += atomic_xchg(&vcpu->arch.nmi_queued, 0);
|
|
|
|
vcpu->arch.nmi_pending = min(vcpu->arch.nmi_pending, limit);
|
|
|
|
kvm_make_request(KVM_REQ_EVENT, vcpu);
|
|
|
|
}
|
|
|
|
|
2016-06-01 20:26:01 +00:00
|
|
|
static u32 enter_smm_get_segment_flags(struct kvm_segment *seg)
|
2015-05-05 09:50:23 +00:00
|
|
|
{
|
|
|
|
u32 flags = 0;
|
|
|
|
flags |= seg->g << 23;
|
|
|
|
flags |= seg->db << 22;
|
|
|
|
flags |= seg->l << 21;
|
|
|
|
flags |= seg->avl << 20;
|
|
|
|
flags |= seg->present << 15;
|
|
|
|
flags |= seg->dpl << 13;
|
|
|
|
flags |= seg->s << 12;
|
|
|
|
flags |= seg->type << 8;
|
|
|
|
return flags;
|
|
|
|
}
|
|
|
|
|
2016-06-01 20:26:01 +00:00
|
|
|
static void enter_smm_save_seg_32(struct kvm_vcpu *vcpu, char *buf, int n)
|
2015-05-05 09:50:23 +00:00
|
|
|
{
|
|
|
|
struct kvm_segment seg;
|
|
|
|
int offset;
|
|
|
|
|
|
|
|
kvm_get_segment(vcpu, &seg, n);
|
|
|
|
put_smstate(u32, buf, 0x7fa8 + n * 4, seg.selector);
|
|
|
|
|
|
|
|
if (n < 3)
|
|
|
|
offset = 0x7f84 + n * 12;
|
|
|
|
else
|
|
|
|
offset = 0x7f2c + (n - 3) * 12;
|
|
|
|
|
|
|
|
put_smstate(u32, buf, offset + 8, seg.base);
|
|
|
|
put_smstate(u32, buf, offset + 4, seg.limit);
|
2016-06-01 20:26:01 +00:00
|
|
|
put_smstate(u32, buf, offset, enter_smm_get_segment_flags(&seg));
|
2015-05-05 09:50:23 +00:00
|
|
|
}
|
|
|
|
|
2015-09-06 13:35:41 +00:00
|
|
|
#ifdef CONFIG_X86_64
|
2016-06-01 20:26:01 +00:00
|
|
|
static void enter_smm_save_seg_64(struct kvm_vcpu *vcpu, char *buf, int n)
|
2015-05-05 09:50:23 +00:00
|
|
|
{
|
|
|
|
struct kvm_segment seg;
|
|
|
|
int offset;
|
|
|
|
u16 flags;
|
|
|
|
|
|
|
|
kvm_get_segment(vcpu, &seg, n);
|
|
|
|
offset = 0x7e00 + n * 16;
|
|
|
|
|
2016-06-01 20:26:01 +00:00
|
|
|
flags = enter_smm_get_segment_flags(&seg) >> 8;
|
2015-05-05 09:50:23 +00:00
|
|
|
put_smstate(u16, buf, offset, seg.selector);
|
|
|
|
put_smstate(u16, buf, offset + 2, flags);
|
|
|
|
put_smstate(u32, buf, offset + 4, seg.limit);
|
|
|
|
put_smstate(u64, buf, offset + 8, seg.base);
|
|
|
|
}
|
2015-09-06 13:35:41 +00:00
|
|
|
#endif
|
2015-05-05 09:50:23 +00:00
|
|
|
|
2016-06-01 20:26:01 +00:00
|
|
|
static void enter_smm_save_state_32(struct kvm_vcpu *vcpu, char *buf)
|
2015-05-05 09:50:23 +00:00
|
|
|
{
|
|
|
|
struct desc_ptr dt;
|
|
|
|
struct kvm_segment seg;
|
|
|
|
unsigned long val;
|
|
|
|
int i;
|
|
|
|
|
|
|
|
put_smstate(u32, buf, 0x7ffc, kvm_read_cr0(vcpu));
|
|
|
|
put_smstate(u32, buf, 0x7ff8, kvm_read_cr3(vcpu));
|
|
|
|
put_smstate(u32, buf, 0x7ff4, kvm_get_rflags(vcpu));
|
|
|
|
put_smstate(u32, buf, 0x7ff0, kvm_rip_read(vcpu));
|
|
|
|
|
|
|
|
for (i = 0; i < 8; i++)
|
|
|
|
put_smstate(u32, buf, 0x7fd0 + i * 4, kvm_register_read(vcpu, i));
|
|
|
|
|
|
|
|
kvm_get_dr(vcpu, 6, &val);
|
|
|
|
put_smstate(u32, buf, 0x7fcc, (u32)val);
|
|
|
|
kvm_get_dr(vcpu, 7, &val);
|
|
|
|
put_smstate(u32, buf, 0x7fc8, (u32)val);
|
|
|
|
|
|
|
|
kvm_get_segment(vcpu, &seg, VCPU_SREG_TR);
|
|
|
|
put_smstate(u32, buf, 0x7fc4, seg.selector);
|
|
|
|
put_smstate(u32, buf, 0x7f64, seg.base);
|
|
|
|
put_smstate(u32, buf, 0x7f60, seg.limit);
|
2016-06-01 20:26:01 +00:00
|
|
|
put_smstate(u32, buf, 0x7f5c, enter_smm_get_segment_flags(&seg));
|
2015-05-05 09:50:23 +00:00
|
|
|
|
|
|
|
kvm_get_segment(vcpu, &seg, VCPU_SREG_LDTR);
|
|
|
|
put_smstate(u32, buf, 0x7fc0, seg.selector);
|
|
|
|
put_smstate(u32, buf, 0x7f80, seg.base);
|
|
|
|
put_smstate(u32, buf, 0x7f7c, seg.limit);
|
2016-06-01 20:26:01 +00:00
|
|
|
put_smstate(u32, buf, 0x7f78, enter_smm_get_segment_flags(&seg));
|
2015-05-05 09:50:23 +00:00
|
|
|
|
|
|
|
kvm_x86_ops->get_gdt(vcpu, &dt);
|
|
|
|
put_smstate(u32, buf, 0x7f74, dt.address);
|
|
|
|
put_smstate(u32, buf, 0x7f70, dt.size);
|
|
|
|
|
|
|
|
kvm_x86_ops->get_idt(vcpu, &dt);
|
|
|
|
put_smstate(u32, buf, 0x7f58, dt.address);
|
|
|
|
put_smstate(u32, buf, 0x7f54, dt.size);
|
|
|
|
|
|
|
|
for (i = 0; i < 6; i++)
|
2016-06-01 20:26:01 +00:00
|
|
|
enter_smm_save_seg_32(vcpu, buf, i);
|
2015-05-05 09:50:23 +00:00
|
|
|
|
|
|
|
put_smstate(u32, buf, 0x7f14, kvm_read_cr4(vcpu));
|
|
|
|
|
|
|
|
/* revision id */
|
|
|
|
put_smstate(u32, buf, 0x7efc, 0x00020000);
|
|
|
|
put_smstate(u32, buf, 0x7ef8, vcpu->arch.smbase);
|
|
|
|
}
|
|
|
|
|
2016-06-01 20:26:01 +00:00
|
|
|
static void enter_smm_save_state_64(struct kvm_vcpu *vcpu, char *buf)
|
2015-05-05 09:50:23 +00:00
|
|
|
{
|
|
|
|
#ifdef CONFIG_X86_64
|
|
|
|
struct desc_ptr dt;
|
|
|
|
struct kvm_segment seg;
|
|
|
|
unsigned long val;
|
|
|
|
int i;
|
|
|
|
|
|
|
|
for (i = 0; i < 16; i++)
|
|
|
|
put_smstate(u64, buf, 0x7ff8 - i * 8, kvm_register_read(vcpu, i));
|
|
|
|
|
|
|
|
put_smstate(u64, buf, 0x7f78, kvm_rip_read(vcpu));
|
|
|
|
put_smstate(u32, buf, 0x7f70, kvm_get_rflags(vcpu));
|
|
|
|
|
|
|
|
kvm_get_dr(vcpu, 6, &val);
|
|
|
|
put_smstate(u64, buf, 0x7f68, val);
|
|
|
|
kvm_get_dr(vcpu, 7, &val);
|
|
|
|
put_smstate(u64, buf, 0x7f60, val);
|
|
|
|
|
|
|
|
put_smstate(u64, buf, 0x7f58, kvm_read_cr0(vcpu));
|
|
|
|
put_smstate(u64, buf, 0x7f50, kvm_read_cr3(vcpu));
|
|
|
|
put_smstate(u64, buf, 0x7f48, kvm_read_cr4(vcpu));
|
|
|
|
|
|
|
|
put_smstate(u32, buf, 0x7f00, vcpu->arch.smbase);
|
|
|
|
|
|
|
|
/* revision id */
|
|
|
|
put_smstate(u32, buf, 0x7efc, 0x00020064);
|
|
|
|
|
|
|
|
put_smstate(u64, buf, 0x7ed0, vcpu->arch.efer);
|
|
|
|
|
|
|
|
kvm_get_segment(vcpu, &seg, VCPU_SREG_TR);
|
|
|
|
put_smstate(u16, buf, 0x7e90, seg.selector);
|
2016-06-01 20:26:01 +00:00
|
|
|
put_smstate(u16, buf, 0x7e92, enter_smm_get_segment_flags(&seg) >> 8);
|
2015-05-05 09:50:23 +00:00
|
|
|
put_smstate(u32, buf, 0x7e94, seg.limit);
|
|
|
|
put_smstate(u64, buf, 0x7e98, seg.base);
|
|
|
|
|
|
|
|
kvm_x86_ops->get_idt(vcpu, &dt);
|
|
|
|
put_smstate(u32, buf, 0x7e84, dt.size);
|
|
|
|
put_smstate(u64, buf, 0x7e88, dt.address);
|
|
|
|
|
|
|
|
kvm_get_segment(vcpu, &seg, VCPU_SREG_LDTR);
|
|
|
|
put_smstate(u16, buf, 0x7e70, seg.selector);
|
2016-06-01 20:26:01 +00:00
|
|
|
put_smstate(u16, buf, 0x7e72, enter_smm_get_segment_flags(&seg) >> 8);
|
2015-05-05 09:50:23 +00:00
|
|
|
put_smstate(u32, buf, 0x7e74, seg.limit);
|
|
|
|
put_smstate(u64, buf, 0x7e78, seg.base);
|
|
|
|
|
|
|
|
kvm_x86_ops->get_gdt(vcpu, &dt);
|
|
|
|
put_smstate(u32, buf, 0x7e64, dt.size);
|
|
|
|
put_smstate(u64, buf, 0x7e68, dt.address);
|
|
|
|
|
|
|
|
for (i = 0; i < 6; i++)
|
2016-06-01 20:26:01 +00:00
|
|
|
enter_smm_save_seg_64(vcpu, buf, i);
|
2015-05-05 09:50:23 +00:00
|
|
|
#else
|
|
|
|
WARN_ON_ONCE(1);
|
|
|
|
#endif
|
|
|
|
}
|
|
|
|
|
2016-06-01 20:26:01 +00:00
|
|
|
static void enter_smm(struct kvm_vcpu *vcpu)
|
2015-05-07 09:36:11 +00:00
|
|
|
{
|
2015-05-05 09:50:23 +00:00
|
|
|
struct kvm_segment cs, ds;
|
2015-08-07 10:27:54 +00:00
|
|
|
struct desc_ptr dt;
|
2015-05-05 09:50:23 +00:00
|
|
|
char buf[512];
|
|
|
|
u32 cr0;
|
|
|
|
|
|
|
|
trace_kvm_enter_smm(vcpu->vcpu_id, vcpu->arch.smbase, true);
|
|
|
|
memset(buf, 0, 512);
|
2017-08-04 22:12:49 +00:00
|
|
|
if (guest_cpuid_has(vcpu, X86_FEATURE_LM))
|
2016-06-01 20:26:01 +00:00
|
|
|
enter_smm_save_state_64(vcpu, buf);
|
2015-05-05 09:50:23 +00:00
|
|
|
else
|
2016-06-01 20:26:01 +00:00
|
|
|
enter_smm_save_state_32(vcpu, buf);
|
2015-05-05 09:50:23 +00:00
|
|
|
|
2017-10-11 14:54:40 +00:00
|
|
|
/*
|
|
|
|
* Give pre_enter_smm() a chance to make ISA-specific changes to the
|
|
|
|
* vCPU state (e.g. leave guest mode) after we've saved the state into
|
|
|
|
* the SMM state-save area.
|
|
|
|
*/
|
|
|
|
kvm_x86_ops->pre_enter_smm(vcpu, buf);
|
|
|
|
|
|
|
|
vcpu->arch.hflags |= HF_SMM_MASK;
|
2015-04-08 13:39:23 +00:00
|
|
|
kvm_vcpu_write_guest(vcpu, vcpu->arch.smbase + 0xfe00, buf, sizeof(buf));
|
2015-05-05 09:50:23 +00:00
|
|
|
|
|
|
|
if (kvm_x86_ops->get_nmi_mask(vcpu))
|
|
|
|
vcpu->arch.hflags |= HF_SMM_INSIDE_NMI_MASK;
|
|
|
|
else
|
|
|
|
kvm_x86_ops->set_nmi_mask(vcpu, true);
|
|
|
|
|
|
|
|
kvm_set_rflags(vcpu, X86_EFLAGS_FIXED);
|
|
|
|
kvm_rip_write(vcpu, 0x8000);
|
|
|
|
|
|
|
|
cr0 = vcpu->arch.cr0 & ~(X86_CR0_PE | X86_CR0_EM | X86_CR0_TS | X86_CR0_PG);
|
|
|
|
kvm_x86_ops->set_cr0(vcpu, cr0);
|
|
|
|
vcpu->arch.cr0 = cr0;
|
|
|
|
|
|
|
|
kvm_x86_ops->set_cr4(vcpu, 0);
|
|
|
|
|
2015-08-07 10:27:54 +00:00
|
|
|
/* Undocumented: IDT limit is set to zero on entry to SMM. */
|
|
|
|
dt.address = dt.size = 0;
|
|
|
|
kvm_x86_ops->set_idt(vcpu, &dt);
|
|
|
|
|
2015-05-05 09:50:23 +00:00
|
|
|
__kvm_set_dr(vcpu, 7, DR7_FIXED_1);
|
|
|
|
|
|
|
|
cs.selector = (vcpu->arch.smbase >> 4) & 0xffff;
|
|
|
|
cs.base = vcpu->arch.smbase;
|
|
|
|
|
|
|
|
ds.selector = 0;
|
|
|
|
ds.base = 0;
|
|
|
|
|
|
|
|
cs.limit = ds.limit = 0xffffffff;
|
|
|
|
cs.type = ds.type = 0x3;
|
|
|
|
cs.dpl = ds.dpl = 0;
|
|
|
|
cs.db = ds.db = 0;
|
|
|
|
cs.s = ds.s = 1;
|
|
|
|
cs.l = ds.l = 0;
|
|
|
|
cs.g = ds.g = 1;
|
|
|
|
cs.avl = ds.avl = 0;
|
|
|
|
cs.present = ds.present = 1;
|
|
|
|
cs.unusable = ds.unusable = 0;
|
|
|
|
cs.padding = ds.padding = 0;
|
|
|
|
|
|
|
|
kvm_set_segment(vcpu, &cs, VCPU_SREG_CS);
|
|
|
|
kvm_set_segment(vcpu, &ds, VCPU_SREG_DS);
|
|
|
|
kvm_set_segment(vcpu, &ds, VCPU_SREG_ES);
|
|
|
|
kvm_set_segment(vcpu, &ds, VCPU_SREG_FS);
|
|
|
|
kvm_set_segment(vcpu, &ds, VCPU_SREG_GS);
|
|
|
|
kvm_set_segment(vcpu, &ds, VCPU_SREG_SS);
|
|
|
|
|
2017-08-04 22:12:49 +00:00
|
|
|
if (guest_cpuid_has(vcpu, X86_FEATURE_LM))
|
2015-05-05 09:50:23 +00:00
|
|
|
kvm_x86_ops->set_efer(vcpu, 0);
|
|
|
|
|
|
|
|
kvm_update_cpuid(vcpu);
|
|
|
|
kvm_mmu_reset_context(vcpu);
|
2015-05-07 09:36:11 +00:00
|
|
|
}
|
|
|
|
|
2016-06-01 20:26:01 +00:00
|
|
|
static void process_smi(struct kvm_vcpu *vcpu)
|
2016-06-01 20:26:00 +00:00
|
|
|
{
|
|
|
|
vcpu->arch.smi_pending = true;
|
|
|
|
kvm_make_request(KVM_REQ_EVENT, vcpu);
|
|
|
|
}
|
|
|
|
|
2016-01-07 14:05:10 +00:00
|
|
|
void kvm_make_scan_ioapic_request(struct kvm *kvm)
|
|
|
|
{
|
|
|
|
kvm_make_all_cpus_request(kvm, KVM_REQ_SCAN_IOAPIC);
|
|
|
|
}
|
|
|
|
|
2013-04-11 11:25:13 +00:00
|
|
|
static void vcpu_scan_ioapic(struct kvm_vcpu *vcpu)
|
2013-01-25 02:18:51 +00:00
|
|
|
{
|
2013-04-11 11:25:13 +00:00
|
|
|
if (!kvm_apic_hw_enabled(vcpu->arch.apic))
|
|
|
|
return;
|
2013-01-25 02:18:51 +00:00
|
|
|
|
2015-11-10 12:36:32 +00:00
|
|
|
bitmap_zero(vcpu->arch.ioapic_handled_vectors, 256);
|
2013-01-25 02:18:51 +00:00
|
|
|
|
2015-07-30 06:32:35 +00:00
|
|
|
if (irqchip_split(vcpu->kvm))
|
2015-11-10 12:36:32 +00:00
|
|
|
kvm_scan_ioapic_routes(vcpu, vcpu->arch.ioapic_handled_vectors);
|
2015-10-08 18:23:34 +00:00
|
|
|
else {
|
2017-12-24 16:12:53 +00:00
|
|
|
if (vcpu->arch.apicv_active)
|
2015-11-10 12:36:33 +00:00
|
|
|
kvm_x86_ops->sync_pir_to_irr(vcpu);
|
2015-11-10 12:36:32 +00:00
|
|
|
kvm_ioapic_scan_entry(vcpu, vcpu->arch.ioapic_handled_vectors);
|
2015-10-08 18:23:34 +00:00
|
|
|
}
|
KVM: nVMX: Do not load EOI-exitmap while running L2
When L1 IOAPIC redirection-table is written, a request of
KVM_REQ_SCAN_IOAPIC is set on all vCPUs. This is done such that
all vCPUs will now recalc their IOAPIC handled vectors and load
it to their EOI-exitmap.
However, it could be that one of the vCPUs is currently running
L2. In this case, load_eoi_exitmap() will be called which would
write to vmcs02->eoi_exit_bitmap, which is wrong because
vmcs02->eoi_exit_bitmap should always be equal to
vmcs12->eoi_exit_bitmap. Furthermore, at this point
KVM_REQ_SCAN_IOAPIC was already consumed and therefore we will
never update vmcs01->eoi_exit_bitmap. This could lead to remote_irr
of some IOAPIC level-triggered entry to remain set forever.
Fix this issue by delaying the load of EOI-exitmap to when vCPU
is running L1.
One may wonder why not just delay entire KVM_REQ_SCAN_IOAPIC
processing to when vCPU is running L1. This is done in order to handle
correctly the case where LAPIC & IO-APIC of L1 is pass-throughed into
L2. In this case, vmcs12->virtual_interrupt_delivery should be 0. In
current nVMX implementation, that results in
vmcs02->virtual_interrupt_delivery to also be 0. Thus,
vmcs02->eoi_exit_bitmap is not used. Therefore, every L2 EOI cause
a #VMExit into L0 (either on MSR_WRITE to x2APIC MSR or
APIC_ACCESS/APIC_WRITE/EPT_MISCONFIG to APIC MMIO page).
In order for such L2 EOI to be broadcasted, if needed, from LAPIC
to IO-APIC, vcpu->arch.ioapic_handled_vectors must be updated
while L2 is running. Therefore, patch makes sure to delay only the
loading of EOI-exitmap but not the update of
vcpu->arch.ioapic_handled_vectors.
Reviewed-by: Arbel Moshe <arbel.moshe@oracle.com>
Reviewed-by: Krish Sadhukhan <krish.sadhukhan@oracle.com>
Signed-off-by: Liran Alon <liran.alon@oracle.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2018-03-21 00:50:31 +00:00
|
|
|
|
|
|
|
if (is_guest_mode(vcpu))
|
|
|
|
vcpu->arch.load_eoi_exitmap_pending = true;
|
|
|
|
else
|
|
|
|
kvm_make_request(KVM_REQ_LOAD_EOI_EXITMAP, vcpu);
|
|
|
|
}
|
|
|
|
|
|
|
|
static void vcpu_load_eoi_exitmap(struct kvm_vcpu *vcpu)
|
|
|
|
{
|
|
|
|
u64 eoi_exit_bitmap[4];
|
|
|
|
|
|
|
|
if (!kvm_apic_hw_enabled(vcpu->arch.apic))
|
|
|
|
return;
|
|
|
|
|
2015-11-10 12:36:34 +00:00
|
|
|
bitmap_or((ulong *)eoi_exit_bitmap, vcpu->arch.ioapic_handled_vectors,
|
|
|
|
vcpu_to_synic(vcpu)->vec_bitmap, 256);
|
|
|
|
kvm_x86_ops->load_eoi_exitmap(vcpu, eoi_exit_bitmap);
|
2013-01-25 02:18:51 +00:00
|
|
|
}
|
|
|
|
|
2018-08-22 04:52:33 +00:00
|
|
|
int kvm_arch_mmu_notifier_invalidate_range(struct kvm *kvm,
|
|
|
|
unsigned long start, unsigned long end,
|
|
|
|
bool blockable)
|
2017-11-30 18:05:45 +00:00
|
|
|
{
|
|
|
|
unsigned long apic_address;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* The physical address of apic access page is stored in the VMCS.
|
|
|
|
* Update it when it becomes invalid.
|
|
|
|
*/
|
|
|
|
apic_address = gfn_to_hva(kvm, APIC_DEFAULT_PHYS_BASE >> PAGE_SHIFT);
|
|
|
|
if (start <= apic_address && apic_address < end)
|
|
|
|
kvm_make_all_cpus_request(kvm, KVM_REQ_APIC_PAGE_RELOAD);
|
2018-08-22 04:52:33 +00:00
|
|
|
|
|
|
|
return 0;
|
2017-11-30 18:05:45 +00:00
|
|
|
}
|
|
|
|
|
2014-09-24 07:57:54 +00:00
|
|
|
void kvm_vcpu_reload_apic_access_page(struct kvm_vcpu *vcpu)
|
|
|
|
{
|
2014-09-24 07:57:58 +00:00
|
|
|
struct page *page = NULL;
|
|
|
|
|
2015-07-29 10:05:37 +00:00
|
|
|
if (!lapic_in_kernel(vcpu))
|
2014-10-02 11:53:24 +00:00
|
|
|
return;
|
|
|
|
|
2014-09-24 07:57:54 +00:00
|
|
|
if (!kvm_x86_ops->set_apic_access_page_addr)
|
|
|
|
return;
|
|
|
|
|
2014-09-24 07:57:58 +00:00
|
|
|
page = gfn_to_page(vcpu->kvm, APIC_DEFAULT_PHYS_BASE >> PAGE_SHIFT);
|
2015-05-08 12:32:56 +00:00
|
|
|
if (is_error_page(page))
|
|
|
|
return;
|
2014-09-24 07:57:58 +00:00
|
|
|
kvm_x86_ops->set_apic_access_page_addr(vcpu, page_to_phys(page));
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Do not pin apic access page in memory, the MMU notifier
|
|
|
|
* will call us again if it is migrated or swapped out.
|
|
|
|
*/
|
|
|
|
put_page(page);
|
2014-09-24 07:57:54 +00:00
|
|
|
}
|
|
|
|
EXPORT_SYMBOL_GPL(kvm_vcpu_reload_apic_access_page);
|
|
|
|
|
2013-12-13 06:08:38 +00:00
|
|
|
/*
|
2015-02-06 11:48:04 +00:00
|
|
|
* Returns 1 to let vcpu_run() continue the guest execution loop without
|
2013-12-13 06:08:38 +00:00
|
|
|
* exiting to the userspace. Otherwise, the value will be returned to the
|
|
|
|
* userspace.
|
|
|
|
*/
|
2009-08-24 08:10:17 +00:00
|
|
|
static int vcpu_enter_guest(struct kvm_vcpu *vcpu)
|
2007-11-01 19:16:10 +00:00
|
|
|
{
|
|
|
|
int r;
|
2015-11-16 23:26:07 +00:00
|
|
|
bool req_int_win =
|
|
|
|
dm_request_for_irq_injection(vcpu) &&
|
|
|
|
kvm_cpu_accept_dm_intr(vcpu);
|
|
|
|
|
2013-04-28 08:50:52 +00:00
|
|
|
bool req_immediate_exit = false;
|
2007-11-01 19:16:10 +00:00
|
|
|
|
2017-06-04 12:43:52 +00:00
|
|
|
if (kvm_request_pending(vcpu)) {
|
2018-07-18 16:49:01 +00:00
|
|
|
if (kvm_check_request(KVM_REQ_GET_VMCS12_PAGES, vcpu))
|
|
|
|
kvm_x86_ops->get_vmcs12_pages(vcpu);
|
2010-05-10 09:34:53 +00:00
|
|
|
if (kvm_check_request(KVM_REQ_MMU_RELOAD, vcpu))
|
2008-02-20 19:47:24 +00:00
|
|
|
kvm_mmu_unload(vcpu);
|
2010-05-10 09:34:53 +00:00
|
|
|
if (kvm_check_request(KVM_REQ_MIGRATE_TIMER, vcpu))
|
2008-05-27 15:10:20 +00:00
|
|
|
__kvm_migrate_timers(vcpu);
|
2012-11-28 01:29:01 +00:00
|
|
|
if (kvm_check_request(KVM_REQ_MASTERCLOCK_UPDATE, vcpu))
|
|
|
|
kvm_gen_update_masterclock(vcpu->kvm);
|
2013-05-09 23:21:41 +00:00
|
|
|
if (kvm_check_request(KVM_REQ_GLOBAL_CLOCK_UPDATE, vcpu))
|
|
|
|
kvm_gen_kvmclock_update(vcpu);
|
2010-09-19 00:38:14 +00:00
|
|
|
if (kvm_check_request(KVM_REQ_CLOCK_UPDATE, vcpu)) {
|
|
|
|
r = kvm_guest_time_update(vcpu);
|
2010-08-20 08:07:21 +00:00
|
|
|
if (unlikely(r))
|
|
|
|
goto out;
|
|
|
|
}
|
2010-05-10 09:34:53 +00:00
|
|
|
if (kvm_check_request(KVM_REQ_MMU_SYNC, vcpu))
|
2008-09-23 16:18:39 +00:00
|
|
|
kvm_mmu_sync_roots(vcpu);
|
2018-06-27 21:59:08 +00:00
|
|
|
if (kvm_check_request(KVM_REQ_LOAD_CR3, vcpu))
|
|
|
|
kvm_mmu_load_cr3(vcpu);
|
2010-05-10 09:34:53 +00:00
|
|
|
if (kvm_check_request(KVM_REQ_TLB_FLUSH, vcpu))
|
2017-12-13 01:33:03 +00:00
|
|
|
kvm_vcpu_flush_tlb(vcpu, true);
|
2010-05-10 09:34:53 +00:00
|
|
|
if (kvm_check_request(KVM_REQ_REPORT_TPR_ACCESS, vcpu)) {
|
2009-08-24 08:10:17 +00:00
|
|
|
vcpu->run->exit_reason = KVM_EXIT_TPR_ACCESS;
|
2007-10-25 14:52:32 +00:00
|
|
|
r = 0;
|
|
|
|
goto out;
|
|
|
|
}
|
2010-05-10 09:34:53 +00:00
|
|
|
if (kvm_check_request(KVM_REQ_TRIPLE_FAULT, vcpu)) {
|
2009-08-24 08:10:17 +00:00
|
|
|
vcpu->run->exit_reason = KVM_EXIT_SHUTDOWN;
|
KVM: X86: Fix residual mmio emulation request to userspace
Reported by syzkaller:
The kvm-intel.unrestricted_guest=0
WARNING: CPU: 5 PID: 1014 at /home/kernel/data/kvm/arch/x86/kvm//x86.c:7227 kvm_arch_vcpu_ioctl_run+0x38b/0x1be0 [kvm]
CPU: 5 PID: 1014 Comm: warn_test Tainted: G W OE 4.13.0-rc3+ #8
RIP: 0010:kvm_arch_vcpu_ioctl_run+0x38b/0x1be0 [kvm]
Call Trace:
? put_pid+0x3a/0x50
? rcu_read_lock_sched_held+0x79/0x80
? kmem_cache_free+0x2f2/0x350
kvm_vcpu_ioctl+0x340/0x700 [kvm]
? kvm_vcpu_ioctl+0x340/0x700 [kvm]
? __fget+0xfc/0x210
do_vfs_ioctl+0xa4/0x6a0
? __fget+0x11d/0x210
SyS_ioctl+0x79/0x90
entry_SYSCALL_64_fastpath+0x23/0xc2
? __this_cpu_preempt_check+0x13/0x20
The syszkaller folks reported a residual mmio emulation request to userspace
due to vm86 fails to emulate inject real mode interrupt(fails to read CS) and
incurs a triple fault. The vCPU returns to userspace with vcpu->mmio_needed == true
and KVM_EXIT_SHUTDOWN exit reason. However, the syszkaller testcase constructs
several threads to launch the same vCPU, the thread which lauch this vCPU after
the thread whichs get the vcpu->mmio_needed == true and KVM_EXIT_SHUTDOWN will
trigger the warning.
#define _GNU_SOURCE
#include <pthread.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <sys/wait.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <sys/mman.h>
#include <fcntl.h>
#include <unistd.h>
#include <linux/kvm.h>
#include <stdio.h>
int kvmcpu;
struct kvm_run *run;
void* thr(void* arg)
{
int res;
res = ioctl(kvmcpu, KVM_RUN, 0);
printf("ret1=%d exit_reason=%d suberror=%d\n",
res, run->exit_reason, run->internal.suberror);
return 0;
}
void test()
{
int i, kvm, kvmvm;
pthread_t th[4];
kvm = open("/dev/kvm", O_RDWR);
kvmvm = ioctl(kvm, KVM_CREATE_VM, 0);
kvmcpu = ioctl(kvmvm, KVM_CREATE_VCPU, 0);
run = (struct kvm_run*)mmap(0, 4096, PROT_READ|PROT_WRITE, MAP_SHARED, kvmcpu, 0);
srand(getpid());
for (i = 0; i < 4; i++) {
pthread_create(&th[i], 0, thr, 0);
usleep(rand() % 10000);
}
for (i = 0; i < 4; i++)
pthread_join(th[i], 0);
}
int main()
{
for (;;) {
int pid = fork();
if (pid < 0)
exit(1);
if (pid == 0) {
test();
exit(0);
}
int status;
while (waitpid(pid, &status, __WALL) != pid) {}
}
return 0;
}
This patch fixes it by resetting the vcpu->mmio_needed once we receive
the triple fault to avoid the residue.
Reported-by: Dmitry Vyukov <dvyukov@google.com>
Tested-by: Dmitry Vyukov <dvyukov@google.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Radim Krčmář <rkrcmar@redhat.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Signed-off-by: Wanpeng Li <wanpeng.li@hotmail.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2017-08-10 05:33:12 +00:00
|
|
|
vcpu->mmio_needed = 0;
|
2008-02-26 15:49:16 +00:00
|
|
|
r = 0;
|
|
|
|
goto out;
|
|
|
|
}
|
2010-10-14 09:22:46 +00:00
|
|
|
if (kvm_check_request(KVM_REQ_APF_HALT, vcpu)) {
|
|
|
|
/* Page is swapped out. Do synthetic halt */
|
|
|
|
vcpu->arch.apf.halted = true;
|
|
|
|
r = 1;
|
|
|
|
goto out;
|
|
|
|
}
|
2011-07-11 19:28:14 +00:00
|
|
|
if (kvm_check_request(KVM_REQ_STEAL_UPDATE, vcpu))
|
|
|
|
record_steal_time(vcpu);
|
2015-05-07 09:36:11 +00:00
|
|
|
if (kvm_check_request(KVM_REQ_SMI, vcpu))
|
|
|
|
process_smi(vcpu);
|
2011-09-20 10:43:14 +00:00
|
|
|
if (kvm_check_request(KVM_REQ_NMI, vcpu))
|
|
|
|
process_nmi(vcpu);
|
2011-11-10 12:57:22 +00:00
|
|
|
if (kvm_check_request(KVM_REQ_PMU, vcpu))
|
2015-06-19 11:44:45 +00:00
|
|
|
kvm_pmu_handle_event(vcpu);
|
2011-11-10 12:57:22 +00:00
|
|
|
if (kvm_check_request(KVM_REQ_PMI, vcpu))
|
2015-06-19 11:44:45 +00:00
|
|
|
kvm_pmu_deliver_pmi(vcpu);
|
2015-07-30 06:21:41 +00:00
|
|
|
if (kvm_check_request(KVM_REQ_IOAPIC_EOI_EXIT, vcpu)) {
|
|
|
|
BUG_ON(vcpu->arch.pending_ioapic_eoi > 255);
|
|
|
|
if (test_bit(vcpu->arch.pending_ioapic_eoi,
|
2015-11-10 12:36:32 +00:00
|
|
|
vcpu->arch.ioapic_handled_vectors)) {
|
2015-07-30 06:21:41 +00:00
|
|
|
vcpu->run->exit_reason = KVM_EXIT_IOAPIC_EOI;
|
|
|
|
vcpu->run->eoi.vector =
|
|
|
|
vcpu->arch.pending_ioapic_eoi;
|
|
|
|
r = 0;
|
|
|
|
goto out;
|
|
|
|
}
|
|
|
|
}
|
2013-04-11 11:25:13 +00:00
|
|
|
if (kvm_check_request(KVM_REQ_SCAN_IOAPIC, vcpu))
|
|
|
|
vcpu_scan_ioapic(vcpu);
|
KVM: nVMX: Do not load EOI-exitmap while running L2
When L1 IOAPIC redirection-table is written, a request of
KVM_REQ_SCAN_IOAPIC is set on all vCPUs. This is done such that
all vCPUs will now recalc their IOAPIC handled vectors and load
it to their EOI-exitmap.
However, it could be that one of the vCPUs is currently running
L2. In this case, load_eoi_exitmap() will be called which would
write to vmcs02->eoi_exit_bitmap, which is wrong because
vmcs02->eoi_exit_bitmap should always be equal to
vmcs12->eoi_exit_bitmap. Furthermore, at this point
KVM_REQ_SCAN_IOAPIC was already consumed and therefore we will
never update vmcs01->eoi_exit_bitmap. This could lead to remote_irr
of some IOAPIC level-triggered entry to remain set forever.
Fix this issue by delaying the load of EOI-exitmap to when vCPU
is running L1.
One may wonder why not just delay entire KVM_REQ_SCAN_IOAPIC
processing to when vCPU is running L1. This is done in order to handle
correctly the case where LAPIC & IO-APIC of L1 is pass-throughed into
L2. In this case, vmcs12->virtual_interrupt_delivery should be 0. In
current nVMX implementation, that results in
vmcs02->virtual_interrupt_delivery to also be 0. Thus,
vmcs02->eoi_exit_bitmap is not used. Therefore, every L2 EOI cause
a #VMExit into L0 (either on MSR_WRITE to x2APIC MSR or
APIC_ACCESS/APIC_WRITE/EPT_MISCONFIG to APIC MMIO page).
In order for such L2 EOI to be broadcasted, if needed, from LAPIC
to IO-APIC, vcpu->arch.ioapic_handled_vectors must be updated
while L2 is running. Therefore, patch makes sure to delay only the
loading of EOI-exitmap but not the update of
vcpu->arch.ioapic_handled_vectors.
Reviewed-by: Arbel Moshe <arbel.moshe@oracle.com>
Reviewed-by: Krish Sadhukhan <krish.sadhukhan@oracle.com>
Signed-off-by: Liran Alon <liran.alon@oracle.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2018-03-21 00:50:31 +00:00
|
|
|
if (kvm_check_request(KVM_REQ_LOAD_EOI_EXITMAP, vcpu))
|
|
|
|
vcpu_load_eoi_exitmap(vcpu);
|
2014-09-24 07:57:54 +00:00
|
|
|
if (kvm_check_request(KVM_REQ_APIC_PAGE_RELOAD, vcpu))
|
|
|
|
kvm_vcpu_reload_apic_access_page(vcpu);
|
2015-07-03 12:01:41 +00:00
|
|
|
if (kvm_check_request(KVM_REQ_HV_CRASH, vcpu)) {
|
|
|
|
vcpu->run->exit_reason = KVM_EXIT_SYSTEM_EVENT;
|
|
|
|
vcpu->run->system_event.type = KVM_SYSTEM_EVENT_CRASH;
|
|
|
|
r = 0;
|
|
|
|
goto out;
|
|
|
|
}
|
2015-09-16 09:29:48 +00:00
|
|
|
if (kvm_check_request(KVM_REQ_HV_RESET, vcpu)) {
|
|
|
|
vcpu->run->exit_reason = KVM_EXIT_SYSTEM_EVENT;
|
|
|
|
vcpu->run->system_event.type = KVM_SYSTEM_EVENT_RESET;
|
|
|
|
r = 0;
|
|
|
|
goto out;
|
|
|
|
}
|
2015-11-10 12:36:35 +00:00
|
|
|
if (kvm_check_request(KVM_REQ_HV_EXIT, vcpu)) {
|
|
|
|
vcpu->run->exit_reason = KVM_EXIT_HYPERV;
|
|
|
|
vcpu->run->hyperv = vcpu->arch.hyperv.exit;
|
|
|
|
r = 0;
|
|
|
|
goto out;
|
|
|
|
}
|
2015-12-28 15:27:24 +00:00
|
|
|
|
|
|
|
/*
|
|
|
|
* KVM_REQ_HV_STIMER has to be processed after
|
|
|
|
* KVM_REQ_CLOCK_UPDATE, because Hyper-V SynIC timers
|
|
|
|
* depend on the guest clock being up-to-date
|
|
|
|
*/
|
2015-11-30 16:22:21 +00:00
|
|
|
if (kvm_check_request(KVM_REQ_HV_STIMER, vcpu))
|
|
|
|
kvm_hv_process_stimers(vcpu);
|
2008-01-16 10:49:30 +00:00
|
|
|
}
|
2007-10-25 14:52:32 +00:00
|
|
|
|
2010-07-20 12:06:17 +00:00
|
|
|
if (kvm_check_request(KVM_REQ_EVENT, vcpu) || req_int_win) {
|
2016-12-17 15:05:19 +00:00
|
|
|
++vcpu->stat.req_event;
|
2013-03-13 11:42:34 +00:00
|
|
|
kvm_apic_accept_events(vcpu);
|
|
|
|
if (vcpu->arch.mp_state == KVM_MP_STATE_INIT_RECEIVED) {
|
|
|
|
r = 1;
|
|
|
|
goto out;
|
|
|
|
}
|
|
|
|
|
2014-03-07 19:03:12 +00:00
|
|
|
if (inject_pending_event(vcpu, req_int_win) != 0)
|
|
|
|
req_immediate_exit = true;
|
KVM: x86: Inject pending interrupt even if pending nmi exist
Non maskable interrupts (NMI) are preferred to interrupts in current
implementation. If a NMI is pending and NMI is blocked by the result
of nmi_allowed(), pending interrupt is not injected and
enable_irq_window() is not executed, even if interrupts injection is
allowed.
In old kernel (e.g. 2.6.32), schedule() is often called in NMI context.
In this case, interrupts are needed to execute iret that intends end
of NMI. The flag of blocking new NMI is not cleared until the guest
execute the iret, and interrupts are blocked by pending NMI. Due to
this, iret can't be invoked in the guest, and the guest is starved
until block is cleared by some events (e.g. canceling injection).
This patch injects pending interrupts, when it's allowed, even if NMI
is blocked. And, If an interrupts is pending after executing
inject_pending_event(), enable_irq_window() is executed regardless of
NMI pending counter.
Cc: stable@vger.kernel.org
Signed-off-by: Yuki Shibuya <shibuya.yk@ncos.nec.co.jp>
Suggested-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2016-03-24 05:17:03 +00:00
|
|
|
else {
|
2017-10-17 14:02:39 +00:00
|
|
|
/* Enable SMI/NMI/IRQ window open exits if needed.
|
2016-06-01 20:26:00 +00:00
|
|
|
*
|
2017-10-17 14:02:39 +00:00
|
|
|
* SMIs have three cases:
|
|
|
|
* 1) They can be nested, and then there is nothing to
|
|
|
|
* do here because RSM will cause a vmexit anyway.
|
|
|
|
* 2) There is an ISA-specific reason why SMI cannot be
|
|
|
|
* injected, and the moment when this changes can be
|
|
|
|
* intercepted.
|
|
|
|
* 3) Or the SMI can be pending because
|
|
|
|
* inject_pending_event has completed the injection
|
|
|
|
* of an IRQ or NMI from the previous vmexit, and
|
|
|
|
* then we request an immediate exit to inject the
|
|
|
|
* SMI.
|
2016-06-01 20:26:00 +00:00
|
|
|
*/
|
|
|
|
if (vcpu->arch.smi_pending && !is_smm(vcpu))
|
2017-10-17 14:02:39 +00:00
|
|
|
if (!kvm_x86_ops->enable_smi_window(vcpu))
|
|
|
|
req_immediate_exit = true;
|
KVM: x86: Inject pending interrupt even if pending nmi exist
Non maskable interrupts (NMI) are preferred to interrupts in current
implementation. If a NMI is pending and NMI is blocked by the result
of nmi_allowed(), pending interrupt is not injected and
enable_irq_window() is not executed, even if interrupts injection is
allowed.
In old kernel (e.g. 2.6.32), schedule() is often called in NMI context.
In this case, interrupts are needed to execute iret that intends end
of NMI. The flag of blocking new NMI is not cleared until the guest
execute the iret, and interrupts are blocked by pending NMI. Due to
this, iret can't be invoked in the guest, and the guest is starved
until block is cleared by some events (e.g. canceling injection).
This patch injects pending interrupts, when it's allowed, even if NMI
is blocked. And, If an interrupts is pending after executing
inject_pending_event(), enable_irq_window() is executed regardless of
NMI pending counter.
Cc: stable@vger.kernel.org
Signed-off-by: Yuki Shibuya <shibuya.yk@ncos.nec.co.jp>
Suggested-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2016-03-24 05:17:03 +00:00
|
|
|
if (vcpu->arch.nmi_pending)
|
|
|
|
kvm_x86_ops->enable_nmi_window(vcpu);
|
|
|
|
if (kvm_cpu_has_injectable_intr(vcpu) || req_int_win)
|
|
|
|
kvm_x86_ops->enable_irq_window(vcpu);
|
2017-08-24 10:35:09 +00:00
|
|
|
WARN_ON(vcpu->arch.exception.pending);
|
KVM: x86: Inject pending interrupt even if pending nmi exist
Non maskable interrupts (NMI) are preferred to interrupts in current
implementation. If a NMI is pending and NMI is blocked by the result
of nmi_allowed(), pending interrupt is not injected and
enable_irq_window() is not executed, even if interrupts injection is
allowed.
In old kernel (e.g. 2.6.32), schedule() is often called in NMI context.
In this case, interrupts are needed to execute iret that intends end
of NMI. The flag of blocking new NMI is not cleared until the guest
execute the iret, and interrupts are blocked by pending NMI. Due to
this, iret can't be invoked in the guest, and the guest is starved
until block is cleared by some events (e.g. canceling injection).
This patch injects pending interrupts, when it's allowed, even if NMI
is blocked. And, If an interrupts is pending after executing
inject_pending_event(), enable_irq_window() is executed regardless of
NMI pending counter.
Cc: stable@vger.kernel.org
Signed-off-by: Yuki Shibuya <shibuya.yk@ncos.nec.co.jp>
Suggested-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2016-03-24 05:17:03 +00:00
|
|
|
}
|
2010-07-20 12:06:17 +00:00
|
|
|
|
|
|
|
if (kvm_lapic_enabled(vcpu)) {
|
|
|
|
update_cr8_intercept(vcpu);
|
|
|
|
kvm_lapic_sync_to_vapic(vcpu);
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2012-05-14 15:07:56 +00:00
|
|
|
r = kvm_mmu_reload(vcpu);
|
|
|
|
if (unlikely(r)) {
|
2012-06-24 16:25:00 +00:00
|
|
|
goto cancel_injection;
|
2012-05-14 15:07:56 +00:00
|
|
|
}
|
|
|
|
|
2007-11-01 19:16:10 +00:00
|
|
|
preempt_disable();
|
|
|
|
|
|
|
|
kvm_x86_ops->prepare_guest_switch(vcpu);
|
kvm: x86: do not use KVM_REQ_EVENT for APICv interrupt injection
Since bf9f6ac8d749 ("KVM: Update Posted-Interrupts Descriptor when vCPU
is blocked", 2015-09-18) the posted interrupt descriptor is checked
unconditionally for PIR.ON. Therefore we don't need KVM_REQ_EVENT to
trigger the scan and, if NMIs or SMIs are not involved, we can avoid
the complicated event injection path.
Calling kvm_vcpu_kick if PIR.ON=1 is also useless, though it has been
there since APICv was introduced.
However, without the KVM_REQ_EVENT safety net KVM needs to be much
more careful about races between vmx_deliver_posted_interrupt and
vcpu_enter_guest. First, the IPI for posted interrupts may be issued
between setting vcpu->mode = IN_GUEST_MODE and disabling interrupts.
If that happens, kvm_trigger_posted_interrupt returns true, but
smp_kvm_posted_intr_ipi doesn't do anything about it. The guest is
entered with PIR.ON, but the posted interrupt IPI has not been sent
and the interrupt is only delivered to the guest on the next vmentry
(if any). To fix this, disable interrupts before setting vcpu->mode.
This ensures that the IPI is delayed until the guest enters non-root mode;
it is then trapped by the processor causing the interrupt to be injected.
Second, the IPI may be issued between kvm_x86_ops->sync_pir_to_irr(vcpu)
and vcpu->mode = IN_GUEST_MODE. In this case, kvm_vcpu_kick is called
but it (correctly) doesn't do anything because it sees vcpu->mode ==
OUTSIDE_GUEST_MODE. Again, the guest is entered with PIR.ON but no
posted interrupt IPI is pending; this time, the fix for this is to move
the RVI update after IN_GUEST_MODE.
Both issues were mostly masked by the liberal usage of KVM_REQ_EVENT,
though the second could actually happen with VT-d posted interrupts.
In both race scenarios KVM_REQ_EVENT would cancel guest entry, resulting
in another vmentry which would inject the interrupt.
This saves about 300 cycles on the self_ipi_* tests of vmexit.flat.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2016-12-19 12:57:33 +00:00
|
|
|
|
|
|
|
/*
|
|
|
|
* Disable IRQs before setting IN_GUEST_MODE. Posted interrupt
|
|
|
|
* IPI are then delayed after guest entry, which ensures that they
|
|
|
|
* result in virtual interrupt delivery.
|
|
|
|
*/
|
|
|
|
local_irq_disable();
|
2011-01-12 07:40:31 +00:00
|
|
|
vcpu->mode = IN_GUEST_MODE;
|
|
|
|
|
2013-11-04 20:36:25 +00:00
|
|
|
srcu_read_unlock(&vcpu->kvm->srcu, vcpu->srcu_idx);
|
|
|
|
|
2016-03-13 03:10:29 +00:00
|
|
|
/*
|
kvm: x86: do not use KVM_REQ_EVENT for APICv interrupt injection
Since bf9f6ac8d749 ("KVM: Update Posted-Interrupts Descriptor when vCPU
is blocked", 2015-09-18) the posted interrupt descriptor is checked
unconditionally for PIR.ON. Therefore we don't need KVM_REQ_EVENT to
trigger the scan and, if NMIs or SMIs are not involved, we can avoid
the complicated event injection path.
Calling kvm_vcpu_kick if PIR.ON=1 is also useless, though it has been
there since APICv was introduced.
However, without the KVM_REQ_EVENT safety net KVM needs to be much
more careful about races between vmx_deliver_posted_interrupt and
vcpu_enter_guest. First, the IPI for posted interrupts may be issued
between setting vcpu->mode = IN_GUEST_MODE and disabling interrupts.
If that happens, kvm_trigger_posted_interrupt returns true, but
smp_kvm_posted_intr_ipi doesn't do anything about it. The guest is
entered with PIR.ON, but the posted interrupt IPI has not been sent
and the interrupt is only delivered to the guest on the next vmentry
(if any). To fix this, disable interrupts before setting vcpu->mode.
This ensures that the IPI is delayed until the guest enters non-root mode;
it is then trapped by the processor causing the interrupt to be injected.
Second, the IPI may be issued between kvm_x86_ops->sync_pir_to_irr(vcpu)
and vcpu->mode = IN_GUEST_MODE. In this case, kvm_vcpu_kick is called
but it (correctly) doesn't do anything because it sees vcpu->mode ==
OUTSIDE_GUEST_MODE. Again, the guest is entered with PIR.ON but no
posted interrupt IPI is pending; this time, the fix for this is to move
the RVI update after IN_GUEST_MODE.
Both issues were mostly masked by the liberal usage of KVM_REQ_EVENT,
though the second could actually happen with VT-d posted interrupts.
In both race scenarios KVM_REQ_EVENT would cancel guest entry, resulting
in another vmentry which would inject the interrupt.
This saves about 300 cycles on the self_ipi_* tests of vmexit.flat.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2016-12-19 12:57:33 +00:00
|
|
|
* 1) We should set ->mode before checking ->requests. Please see
|
2017-04-26 20:32:24 +00:00
|
|
|
* the comment in kvm_vcpu_exiting_guest_mode().
|
kvm: x86: do not use KVM_REQ_EVENT for APICv interrupt injection
Since bf9f6ac8d749 ("KVM: Update Posted-Interrupts Descriptor when vCPU
is blocked", 2015-09-18) the posted interrupt descriptor is checked
unconditionally for PIR.ON. Therefore we don't need KVM_REQ_EVENT to
trigger the scan and, if NMIs or SMIs are not involved, we can avoid
the complicated event injection path.
Calling kvm_vcpu_kick if PIR.ON=1 is also useless, though it has been
there since APICv was introduced.
However, without the KVM_REQ_EVENT safety net KVM needs to be much
more careful about races between vmx_deliver_posted_interrupt and
vcpu_enter_guest. First, the IPI for posted interrupts may be issued
between setting vcpu->mode = IN_GUEST_MODE and disabling interrupts.
If that happens, kvm_trigger_posted_interrupt returns true, but
smp_kvm_posted_intr_ipi doesn't do anything about it. The guest is
entered with PIR.ON, but the posted interrupt IPI has not been sent
and the interrupt is only delivered to the guest on the next vmentry
(if any). To fix this, disable interrupts before setting vcpu->mode.
This ensures that the IPI is delayed until the guest enters non-root mode;
it is then trapped by the processor causing the interrupt to be injected.
Second, the IPI may be issued between kvm_x86_ops->sync_pir_to_irr(vcpu)
and vcpu->mode = IN_GUEST_MODE. In this case, kvm_vcpu_kick is called
but it (correctly) doesn't do anything because it sees vcpu->mode ==
OUTSIDE_GUEST_MODE. Again, the guest is entered with PIR.ON but no
posted interrupt IPI is pending; this time, the fix for this is to move
the RVI update after IN_GUEST_MODE.
Both issues were mostly masked by the liberal usage of KVM_REQ_EVENT,
though the second could actually happen with VT-d posted interrupts.
In both race scenarios KVM_REQ_EVENT would cancel guest entry, resulting
in another vmentry which would inject the interrupt.
This saves about 300 cycles on the self_ipi_* tests of vmexit.flat.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2016-12-19 12:57:33 +00:00
|
|
|
*
|
|
|
|
* 2) For APICv, we should set ->mode before checking PIR.ON. This
|
|
|
|
* pairs with the memory barrier implicit in pi_test_and_set_on
|
|
|
|
* (see vmx_deliver_posted_interrupt).
|
|
|
|
*
|
|
|
|
* 3) This also orders the write to mode from any reads to the page
|
|
|
|
* tables done while the VCPU is running. Please see the comment
|
|
|
|
* in kvm_flush_remote_tlbs.
|
2011-01-12 07:40:31 +00:00
|
|
|
*/
|
2013-11-04 20:36:25 +00:00
|
|
|
smp_mb__after_srcu_read_unlock();
|
2007-11-01 19:16:10 +00:00
|
|
|
|
kvm: x86: do not use KVM_REQ_EVENT for APICv interrupt injection
Since bf9f6ac8d749 ("KVM: Update Posted-Interrupts Descriptor when vCPU
is blocked", 2015-09-18) the posted interrupt descriptor is checked
unconditionally for PIR.ON. Therefore we don't need KVM_REQ_EVENT to
trigger the scan and, if NMIs or SMIs are not involved, we can avoid
the complicated event injection path.
Calling kvm_vcpu_kick if PIR.ON=1 is also useless, though it has been
there since APICv was introduced.
However, without the KVM_REQ_EVENT safety net KVM needs to be much
more careful about races between vmx_deliver_posted_interrupt and
vcpu_enter_guest. First, the IPI for posted interrupts may be issued
between setting vcpu->mode = IN_GUEST_MODE and disabling interrupts.
If that happens, kvm_trigger_posted_interrupt returns true, but
smp_kvm_posted_intr_ipi doesn't do anything about it. The guest is
entered with PIR.ON, but the posted interrupt IPI has not been sent
and the interrupt is only delivered to the guest on the next vmentry
(if any). To fix this, disable interrupts before setting vcpu->mode.
This ensures that the IPI is delayed until the guest enters non-root mode;
it is then trapped by the processor causing the interrupt to be injected.
Second, the IPI may be issued between kvm_x86_ops->sync_pir_to_irr(vcpu)
and vcpu->mode = IN_GUEST_MODE. In this case, kvm_vcpu_kick is called
but it (correctly) doesn't do anything because it sees vcpu->mode ==
OUTSIDE_GUEST_MODE. Again, the guest is entered with PIR.ON but no
posted interrupt IPI is pending; this time, the fix for this is to move
the RVI update after IN_GUEST_MODE.
Both issues were mostly masked by the liberal usage of KVM_REQ_EVENT,
though the second could actually happen with VT-d posted interrupts.
In both race scenarios KVM_REQ_EVENT would cancel guest entry, resulting
in another vmentry which would inject the interrupt.
This saves about 300 cycles on the self_ipi_* tests of vmexit.flat.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2016-12-19 12:57:33 +00:00
|
|
|
/*
|
|
|
|
* This handles the case where a posted interrupt was
|
|
|
|
* notified with kvm_vcpu_kick.
|
|
|
|
*/
|
2017-12-24 16:12:53 +00:00
|
|
|
if (kvm_lapic_enabled(vcpu) && vcpu->arch.apicv_active)
|
|
|
|
kvm_x86_ops->sync_pir_to_irr(vcpu);
|
2009-05-07 20:55:12 +00:00
|
|
|
|
2017-06-04 12:43:52 +00:00
|
|
|
if (vcpu->mode == EXITING_GUEST_MODE || kvm_request_pending(vcpu)
|
2010-05-03 13:54:48 +00:00
|
|
|
|| need_resched() || signal_pending(current)) {
|
2011-01-12 07:40:31 +00:00
|
|
|
vcpu->mode = OUTSIDE_GUEST_MODE;
|
2010-05-03 13:54:48 +00:00
|
|
|
smp_wmb();
|
2008-01-15 16:27:32 +00:00
|
|
|
local_irq_enable();
|
|
|
|
preempt_enable();
|
2013-11-04 20:36:25 +00:00
|
|
|
vcpu->srcu_idx = srcu_read_lock(&vcpu->kvm->srcu);
|
2008-01-15 16:27:32 +00:00
|
|
|
r = 1;
|
2012-06-24 16:25:00 +00:00
|
|
|
goto cancel_injection;
|
2008-01-15 16:27:32 +00:00
|
|
|
}
|
|
|
|
|
2016-03-30 19:24:47 +00:00
|
|
|
kvm_load_guest_xcr0(vcpu);
|
|
|
|
|
2016-06-01 20:26:00 +00:00
|
|
|
if (req_immediate_exit) {
|
|
|
|
kvm_make_request(KVM_REQ_EVENT, vcpu);
|
KVM: nVMX: Add KVM_REQ_IMMEDIATE_EXIT
This patch adds a new vcpu->requests bit, KVM_REQ_IMMEDIATE_EXIT.
This bit requests that when next entering the guest, we should run it only
for as little as possible, and exit again.
We use this new option in nested VMX: When L1 launches L2, but L0 wishes L1
to continue running so it can inject an event to it, we unfortunately cannot
just pretend to have run L2 for a little while - We must really launch L2,
otherwise certain one-off vmcs12 parameters (namely, L1 injection into L2)
will be lost. So the existing code runs L2 in this case.
But L2 could potentially run for a long time until it exits, and the
injection into L1 will be delayed. The new KVM_REQ_IMMEDIATE_EXIT allows us
to request that L2 will be entered, as necessary, but will exit as soon as
possible after entry.
Our implementation of this request uses smp_send_reschedule() to send a
self-IPI, with interrupts disabled. The interrupts remain disabled until the
guest is entered, and then, after the entry is complete (often including
processing an injection and jumping to the relevant handler), the physical
interrupt is noticed and causes an exit.
On recent Intel processors, we could have achieved the same goal by using
MTF instead of a self-IPI. Another technique worth considering in the future
is to use VM_EXIT_ACK_INTR_ON_EXIT and a highest-priority vector IPI - to
slightly improve performance by avoiding the useless interrupt handler
which ends up being called when smp_send_reschedule() is used.
Signed-off-by: Nadav Har'El <nyh@il.ibm.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2011-09-22 10:52:56 +00:00
|
|
|
smp_send_reschedule(vcpu->cpu);
|
2016-06-01 20:26:00 +00:00
|
|
|
}
|
KVM: nVMX: Add KVM_REQ_IMMEDIATE_EXIT
This patch adds a new vcpu->requests bit, KVM_REQ_IMMEDIATE_EXIT.
This bit requests that when next entering the guest, we should run it only
for as little as possible, and exit again.
We use this new option in nested VMX: When L1 launches L2, but L0 wishes L1
to continue running so it can inject an event to it, we unfortunately cannot
just pretend to have run L2 for a little while - We must really launch L2,
otherwise certain one-off vmcs12 parameters (namely, L1 injection into L2)
will be lost. So the existing code runs L2 in this case.
But L2 could potentially run for a long time until it exits, and the
injection into L1 will be delayed. The new KVM_REQ_IMMEDIATE_EXIT allows us
to request that L2 will be entered, as necessary, but will exit as soon as
possible after entry.
Our implementation of this request uses smp_send_reschedule() to send a
self-IPI, with interrupts disabled. The interrupts remain disabled until the
guest is entered, and then, after the entry is complete (often including
processing an injection and jumping to the relevant handler), the physical
interrupt is noticed and causes an exit.
On recent Intel processors, we could have achieved the same goal by using
MTF instead of a self-IPI. Another technique worth considering in the future
is to use VM_EXIT_ACK_INTR_ON_EXIT and a highest-priority vector IPI - to
slightly improve performance by avoiding the useless interrupt handler
which ends up being called when smp_send_reschedule() is used.
Signed-off-by: Nadav Har'El <nyh@il.ibm.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2011-09-22 10:52:56 +00:00
|
|
|
|
2015-12-10 17:37:32 +00:00
|
|
|
trace_kvm_entry(vcpu->vcpu_id);
|
2017-12-01 08:15:10 +00:00
|
|
|
if (lapic_timer_advance_ns)
|
|
|
|
wait_lapic_expire(vcpu);
|
2016-06-15 13:18:26 +00:00
|
|
|
guest_enter_irqoff();
|
2007-11-01 19:16:10 +00:00
|
|
|
|
2008-12-15 12:52:10 +00:00
|
|
|
if (unlikely(vcpu->arch.switch_db_regs)) {
|
|
|
|
set_debugreg(0, 7);
|
|
|
|
set_debugreg(vcpu->arch.eff_db[0], 0);
|
|
|
|
set_debugreg(vcpu->arch.eff_db[1], 1);
|
|
|
|
set_debugreg(vcpu->arch.eff_db[2], 2);
|
|
|
|
set_debugreg(vcpu->arch.eff_db[3], 3);
|
2014-02-21 09:17:24 +00:00
|
|
|
set_debugreg(vcpu->arch.dr6, 6);
|
2015-04-02 00:10:37 +00:00
|
|
|
vcpu->arch.switch_db_regs &= ~KVM_DEBUGREG_RELOAD;
|
2008-12-15 12:52:10 +00:00
|
|
|
}
|
2007-11-01 19:16:10 +00:00
|
|
|
|
2009-08-24 08:10:17 +00:00
|
|
|
kvm_x86_ops->run(vcpu);
|
2007-11-01 19:16:10 +00:00
|
|
|
|
2014-02-21 09:17:24 +00:00
|
|
|
/*
|
|
|
|
* Do this here before restoring debug registers on the host. And
|
|
|
|
* since we do this before handling the vmexit, a DR access vmexit
|
|
|
|
* can (a) read the correct value of the debug registers, (b) set
|
|
|
|
* KVM_DEBUGREG_WONT_EXIT again.
|
|
|
|
*/
|
|
|
|
if (unlikely(vcpu->arch.switch_db_regs & KVM_DEBUGREG_WONT_EXIT)) {
|
|
|
|
WARN_ON(vcpu->guest_debug & KVM_GUESTDBG_USE_HW_BP);
|
|
|
|
kvm_x86_ops->sync_dirty_debug_regs(vcpu);
|
2016-02-26 11:28:40 +00:00
|
|
|
kvm_update_dr0123(vcpu);
|
|
|
|
kvm_update_dr6(vcpu);
|
|
|
|
kvm_update_dr7(vcpu);
|
|
|
|
vcpu->arch.switch_db_regs &= ~KVM_DEBUGREG_RELOAD;
|
2014-02-21 09:17:24 +00:00
|
|
|
}
|
|
|
|
|
2009-09-09 17:22:48 +00:00
|
|
|
/*
|
|
|
|
* If the guest has used debug registers, at least dr7
|
|
|
|
* will be disabled while returning to the host.
|
|
|
|
* If we don't have active breakpoints in the host, we don't
|
|
|
|
* care about the messed up debug address registers. But if
|
|
|
|
* we have some of them active, restore the old state.
|
|
|
|
*/
|
2009-11-10 10:03:12 +00:00
|
|
|
if (hw_breakpoint_active())
|
2009-09-09 17:22:48 +00:00
|
|
|
hw_breakpoint_restore();
|
2008-12-15 12:52:10 +00:00
|
|
|
|
2015-10-20 07:39:07 +00:00
|
|
|
vcpu->arch.last_guest_tsc = kvm_read_l1_tsc(vcpu, rdtsc());
|
2010-08-20 08:07:30 +00:00
|
|
|
|
2011-01-12 07:40:31 +00:00
|
|
|
vcpu->mode = OUTSIDE_GUEST_MODE;
|
2010-05-03 13:54:48 +00:00
|
|
|
smp_wmb();
|
2013-04-11 11:25:10 +00:00
|
|
|
|
2016-03-30 19:24:47 +00:00
|
|
|
kvm_put_guest_xcr0(vcpu);
|
|
|
|
|
2017-07-26 00:20:32 +00:00
|
|
|
kvm_before_interrupt(vcpu);
|
2013-04-11 11:25:10 +00:00
|
|
|
kvm_x86_ops->handle_external_intr(vcpu);
|
2017-07-26 00:20:32 +00:00
|
|
|
kvm_after_interrupt(vcpu);
|
2007-11-01 19:16:10 +00:00
|
|
|
|
|
|
|
++vcpu->stat.exits;
|
|
|
|
|
2016-06-15 13:23:11 +00:00
|
|
|
guest_exit_irqoff();
|
2007-11-01 19:16:10 +00:00
|
|
|
|
2016-06-15 13:23:11 +00:00
|
|
|
local_irq_enable();
|
2007-11-01 19:16:10 +00:00
|
|
|
preempt_enable();
|
|
|
|
|
2009-12-23 16:35:25 +00:00
|
|
|
vcpu->srcu_idx = srcu_read_lock(&vcpu->kvm->srcu);
|
2008-03-29 23:17:59 +00:00
|
|
|
|
2007-11-01 19:16:10 +00:00
|
|
|
/*
|
|
|
|
* Profile KVM exit RIPs:
|
|
|
|
*/
|
|
|
|
if (unlikely(prof_on == KVM_PROFILING)) {
|
2008-06-27 17:58:02 +00:00
|
|
|
unsigned long rip = kvm_rip_read(vcpu);
|
|
|
|
profile_hit(KVM_PROFILING, (void *)rip);
|
2007-11-01 19:16:10 +00:00
|
|
|
}
|
|
|
|
|
KVM: Infrastructure for software and hardware based TSC rate scaling
This requires some restructuring; rather than use 'virtual_tsc_khz'
to indicate whether hardware rate scaling is in effect, we consider
each VCPU to always have a virtual TSC rate. Instead, there is new
logic above the vendor-specific hardware scaling that decides whether
it is even necessary to use and updates all rate variables used by
common code. This means we can simply query the virtual rate at
any point, which is needed for software rate scaling.
There is also now a threshold added to the TSC rate scaling; minor
differences and variations of measured TSC rate can accidentally
provoke rate scaling to be used when it is not needed. Instead,
we have a tolerance variable called tsc_tolerance_ppm, which is
the maximum variation from user requested rate at which scaling
will be used. The default is 250ppm, which is the half the
threshold for NTP adjustment, allowing for some hardware variation.
In the event that hardware rate scaling is not available, we can
kludge a bit by forcing TSC catchup to turn on when a faster than
hardware speed has been requested, but there is nothing available
yet for the reverse case; this requires a trap and emulate software
implementation for RDTSC, which is still forthcoming.
[avi: fix 64-bit division on i386]
Signed-off-by: Zachary Amsden <zamsden@gmail.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2012-02-03 17:43:50 +00:00
|
|
|
if (unlikely(vcpu->arch.tsc_always_catchup))
|
|
|
|
kvm_make_request(KVM_REQ_CLOCK_UPDATE, vcpu);
|
2007-11-25 11:41:11 +00:00
|
|
|
|
2012-06-24 16:24:54 +00:00
|
|
|
if (vcpu->arch.apic_attention)
|
|
|
|
kvm_lapic_sync_from_vapic(vcpu);
|
2007-10-25 14:52:32 +00:00
|
|
|
|
2017-08-17 16:36:57 +00:00
|
|
|
vcpu->arch.gpa_available = false;
|
2009-08-24 08:10:17 +00:00
|
|
|
r = kvm_x86_ops->handle_exit(vcpu);
|
2012-06-24 16:25:00 +00:00
|
|
|
return r;
|
|
|
|
|
|
|
|
cancel_injection:
|
|
|
|
kvm_x86_ops->cancel_injection(vcpu);
|
2012-06-24 16:25:07 +00:00
|
|
|
if (unlikely(vcpu->arch.apic_attention))
|
|
|
|
kvm_lapic_sync_from_vapic(vcpu);
|
2008-09-08 18:23:48 +00:00
|
|
|
out:
|
|
|
|
return r;
|
|
|
|
}
|
2007-11-01 19:16:10 +00:00
|
|
|
|
2015-02-06 11:48:04 +00:00
|
|
|
static inline int vcpu_block(struct kvm *kvm, struct kvm_vcpu *vcpu)
|
|
|
|
{
|
2015-09-18 14:29:55 +00:00
|
|
|
if (!kvm_arch_vcpu_runnable(vcpu) &&
|
|
|
|
(!kvm_x86_ops->pre_block || kvm_x86_ops->pre_block(vcpu) == 0)) {
|
2015-02-06 11:58:42 +00:00
|
|
|
srcu_read_unlock(&kvm->srcu, vcpu->srcu_idx);
|
|
|
|
kvm_vcpu_block(vcpu);
|
|
|
|
vcpu->srcu_idx = srcu_read_lock(&kvm->srcu);
|
2015-09-18 14:29:55 +00:00
|
|
|
|
|
|
|
if (kvm_x86_ops->post_block)
|
|
|
|
kvm_x86_ops->post_block(vcpu);
|
|
|
|
|
2015-02-06 11:58:42 +00:00
|
|
|
if (!kvm_check_request(KVM_REQ_UNHALT, vcpu))
|
|
|
|
return 1;
|
|
|
|
}
|
2015-02-06 11:48:04 +00:00
|
|
|
|
|
|
|
kvm_apic_accept_events(vcpu);
|
|
|
|
switch(vcpu->arch.mp_state) {
|
|
|
|
case KVM_MP_STATE_HALTED:
|
|
|
|
vcpu->arch.pv.pv_unhalted = false;
|
|
|
|
vcpu->arch.mp_state =
|
|
|
|
KVM_MP_STATE_RUNNABLE;
|
|
|
|
case KVM_MP_STATE_RUNNABLE:
|
|
|
|
vcpu->arch.apf.halted = false;
|
|
|
|
break;
|
|
|
|
case KVM_MP_STATE_INIT_RECEIVED:
|
|
|
|
break;
|
|
|
|
default:
|
|
|
|
return -EINTR;
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
return 1;
|
|
|
|
}
|
2009-03-23 13:11:44 +00:00
|
|
|
|
2015-10-13 08:18:53 +00:00
|
|
|
static inline bool kvm_vcpu_running(struct kvm_vcpu *vcpu)
|
|
|
|
{
|
2016-12-19 14:23:54 +00:00
|
|
|
if (is_guest_mode(vcpu) && kvm_x86_ops->check_nested_events)
|
|
|
|
kvm_x86_ops->check_nested_events(vcpu, false);
|
|
|
|
|
2015-10-13 08:18:53 +00:00
|
|
|
return (vcpu->arch.mp_state == KVM_MP_STATE_RUNNABLE &&
|
|
|
|
!vcpu->arch.apf.halted);
|
|
|
|
}
|
|
|
|
|
2015-02-06 11:48:04 +00:00
|
|
|
static int vcpu_run(struct kvm_vcpu *vcpu)
|
2008-09-08 18:23:48 +00:00
|
|
|
{
|
|
|
|
int r;
|
2009-12-23 16:35:25 +00:00
|
|
|
struct kvm *kvm = vcpu->kvm;
|
2008-09-08 18:23:48 +00:00
|
|
|
|
2009-12-23 16:35:25 +00:00
|
|
|
vcpu->srcu_idx = srcu_read_lock(&kvm->srcu);
|
x86/KVM/VMX: Add L1D flush logic
Add the logic for flushing L1D on VMENTER. The flush depends on the static
key being enabled and the new l1tf_flush_l1d flag being set.
The flags is set:
- Always, if the flush module parameter is 'always'
- Conditionally at:
- Entry to vcpu_run(), i.e. after executing user space
- From the sched_in notifier, i.e. when switching to a vCPU thread.
- From vmexit handlers which are considered unsafe, i.e. where
sensitive data can be brought into L1D:
- The emulator, which could be a good target for other speculative
execution-based threats,
- The MMU, which can bring host page tables in the L1 cache.
- External interrupts
- Nested operations that require the MMU (see above). That is
vmptrld, vmptrst, vmclear,vmwrite,vmread.
- When handling invept,invvpid
[ tglx: Split out from combo patch and reduced to a single flag ]
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2018-07-02 11:07:14 +00:00
|
|
|
vcpu->arch.l1tf_flush_l1d = true;
|
2008-09-08 18:23:48 +00:00
|
|
|
|
2015-02-06 11:48:04 +00:00
|
|
|
for (;;) {
|
2015-10-13 19:32:50 +00:00
|
|
|
if (kvm_vcpu_running(vcpu)) {
|
2009-08-24 08:10:17 +00:00
|
|
|
r = vcpu_enter_guest(vcpu);
|
2015-09-18 14:29:55 +00:00
|
|
|
} else {
|
2015-02-06 11:48:04 +00:00
|
|
|
r = vcpu_block(kvm, vcpu);
|
2015-09-18 14:29:55 +00:00
|
|
|
}
|
|
|
|
|
2009-03-23 13:11:44 +00:00
|
|
|
if (r <= 0)
|
|
|
|
break;
|
|
|
|
|
2017-04-26 20:32:19 +00:00
|
|
|
kvm_clear_request(KVM_REQ_PENDING_TIMER, vcpu);
|
2009-03-23 13:11:44 +00:00
|
|
|
if (kvm_cpu_has_pending_timer(vcpu))
|
|
|
|
kvm_inject_pending_timer_irqs(vcpu);
|
|
|
|
|
2015-11-16 23:26:00 +00:00
|
|
|
if (dm_request_for_irq_injection(vcpu) &&
|
|
|
|
kvm_vcpu_ready_for_interrupt_injection(vcpu)) {
|
2015-07-30 08:32:16 +00:00
|
|
|
r = 0;
|
|
|
|
vcpu->run->exit_reason = KVM_EXIT_IRQ_WINDOW_OPEN;
|
2009-03-23 13:11:44 +00:00
|
|
|
++vcpu->stat.request_irq_exits;
|
2015-02-06 11:48:04 +00:00
|
|
|
break;
|
2009-03-23 13:11:44 +00:00
|
|
|
}
|
2010-10-14 09:22:46 +00:00
|
|
|
|
|
|
|
kvm_check_async_pf_completion(vcpu);
|
|
|
|
|
2009-03-23 13:11:44 +00:00
|
|
|
if (signal_pending(current)) {
|
|
|
|
r = -EINTR;
|
2009-08-24 08:10:17 +00:00
|
|
|
vcpu->run->exit_reason = KVM_EXIT_INTR;
|
2009-03-23 13:11:44 +00:00
|
|
|
++vcpu->stat.signal_exits;
|
2015-02-06 11:48:04 +00:00
|
|
|
break;
|
2009-03-23 13:11:44 +00:00
|
|
|
}
|
|
|
|
if (need_resched()) {
|
2009-12-23 16:35:25 +00:00
|
|
|
srcu_read_unlock(&kvm->srcu, vcpu->srcu_idx);
|
2013-12-13 06:07:21 +00:00
|
|
|
cond_resched();
|
2009-12-23 16:35:25 +00:00
|
|
|
vcpu->srcu_idx = srcu_read_lock(&kvm->srcu);
|
2008-09-08 18:23:48 +00:00
|
|
|
}
|
2007-11-01 19:16:10 +00:00
|
|
|
}
|
|
|
|
|
2009-12-23 16:35:25 +00:00
|
|
|
srcu_read_unlock(&kvm->srcu, vcpu->srcu_idx);
|
2007-11-01 19:16:10 +00:00
|
|
|
|
|
|
|
return r;
|
|
|
|
}
|
|
|
|
|
2012-09-03 12:24:26 +00:00
|
|
|
static inline int complete_emulated_io(struct kvm_vcpu *vcpu)
|
|
|
|
{
|
|
|
|
int r;
|
|
|
|
vcpu->srcu_idx = srcu_read_lock(&vcpu->kvm->srcu);
|
|
|
|
r = emulate_instruction(vcpu, EMULTYPE_NO_DECODE);
|
|
|
|
srcu_read_unlock(&vcpu->kvm->srcu, vcpu->srcu_idx);
|
|
|
|
if (r != EMULATE_DONE)
|
|
|
|
return 0;
|
|
|
|
return 1;
|
|
|
|
}
|
|
|
|
|
|
|
|
static int complete_emulated_pio(struct kvm_vcpu *vcpu)
|
|
|
|
{
|
|
|
|
BUG_ON(!vcpu->arch.pio.count);
|
|
|
|
|
|
|
|
return complete_emulated_io(vcpu);
|
|
|
|
}
|
|
|
|
|
2012-04-18 16:22:47 +00:00
|
|
|
/*
|
|
|
|
* Implements the following, as a state machine:
|
|
|
|
*
|
|
|
|
* read:
|
|
|
|
* for each fragment
|
2012-10-24 06:07:59 +00:00
|
|
|
* for each mmio piece in the fragment
|
|
|
|
* write gpa, len
|
|
|
|
* exit
|
|
|
|
* copy data
|
2012-04-18 16:22:47 +00:00
|
|
|
* execute insn
|
|
|
|
*
|
|
|
|
* write:
|
|
|
|
* for each fragment
|
2012-10-24 06:07:59 +00:00
|
|
|
* for each mmio piece in the fragment
|
|
|
|
* write gpa, len
|
|
|
|
* copy data
|
|
|
|
* exit
|
2012-04-18 16:22:47 +00:00
|
|
|
*/
|
2012-09-03 12:24:26 +00:00
|
|
|
static int complete_emulated_mmio(struct kvm_vcpu *vcpu)
|
2010-01-19 12:20:10 +00:00
|
|
|
{
|
|
|
|
struct kvm_run *run = vcpu->run;
|
2012-04-18 16:22:47 +00:00
|
|
|
struct kvm_mmio_fragment *frag;
|
2012-10-24 06:07:59 +00:00
|
|
|
unsigned len;
|
2010-01-19 12:20:10 +00:00
|
|
|
|
2012-09-03 12:24:26 +00:00
|
|
|
BUG_ON(!vcpu->mmio_needed);
|
2010-01-19 12:20:10 +00:00
|
|
|
|
2012-09-03 12:24:26 +00:00
|
|
|
/* Complete previous fragment */
|
2012-10-24 06:07:59 +00:00
|
|
|
frag = &vcpu->mmio_fragments[vcpu->mmio_cur_fragment];
|
|
|
|
len = min(8u, frag->len);
|
2012-09-03 12:24:26 +00:00
|
|
|
if (!vcpu->mmio_is_write)
|
2012-10-24 06:07:59 +00:00
|
|
|
memcpy(frag->data, run->mmio.data, len);
|
|
|
|
|
|
|
|
if (frag->len <= 8) {
|
|
|
|
/* Switch to the next fragment. */
|
|
|
|
frag++;
|
|
|
|
vcpu->mmio_cur_fragment++;
|
|
|
|
} else {
|
|
|
|
/* Go forward to the next mmio piece. */
|
|
|
|
frag->data += len;
|
|
|
|
frag->gpa += len;
|
|
|
|
frag->len -= len;
|
|
|
|
}
|
|
|
|
|
2014-02-27 18:35:14 +00:00
|
|
|
if (vcpu->mmio_cur_fragment >= vcpu->mmio_nr_fragments) {
|
2012-09-03 12:24:26 +00:00
|
|
|
vcpu->mmio_needed = 0;
|
2013-08-27 13:41:43 +00:00
|
|
|
|
|
|
|
/* FIXME: return into emulator if single-stepping. */
|
2010-01-20 10:01:20 +00:00
|
|
|
if (vcpu->mmio_is_write)
|
2012-09-03 12:24:26 +00:00
|
|
|
return 1;
|
|
|
|
vcpu->mmio_read_completed = 1;
|
|
|
|
return complete_emulated_io(vcpu);
|
|
|
|
}
|
2012-10-24 06:07:59 +00:00
|
|
|
|
2012-09-03 12:24:26 +00:00
|
|
|
run->exit_reason = KVM_EXIT_MMIO;
|
|
|
|
run->mmio.phys_addr = frag->gpa;
|
|
|
|
if (vcpu->mmio_is_write)
|
2012-10-24 06:07:59 +00:00
|
|
|
memcpy(run->mmio.data, frag->data, min(8u, frag->len));
|
|
|
|
run->mmio.len = min(8u, frag->len);
|
2012-09-03 12:24:26 +00:00
|
|
|
run->mmio.is_write = vcpu->mmio_is_write;
|
|
|
|
vcpu->arch.complete_userspace_io = complete_emulated_mmio;
|
|
|
|
return 0;
|
2010-01-19 12:20:10 +00:00
|
|
|
}
|
|
|
|
|
2007-11-01 19:16:10 +00:00
|
|
|
int kvm_arch_vcpu_ioctl_run(struct kvm_vcpu *vcpu, struct kvm_run *kvm_run)
|
|
|
|
{
|
|
|
|
int r;
|
|
|
|
|
2017-12-04 20:35:25 +00:00
|
|
|
vcpu_load(vcpu);
|
2017-11-24 21:39:01 +00:00
|
|
|
kvm_sigset_activate(vcpu);
|
2017-12-12 16:15:02 +00:00
|
|
|
kvm_load_guest_fpu(vcpu);
|
|
|
|
|
2008-04-13 14:54:35 +00:00
|
|
|
if (unlikely(vcpu->arch.mp_state == KVM_MP_STATE_UNINITIALIZED)) {
|
2017-09-06 16:34:06 +00:00
|
|
|
if (kvm_run->immediate_exit) {
|
|
|
|
r = -EINTR;
|
|
|
|
goto out;
|
|
|
|
}
|
2007-11-01 19:16:10 +00:00
|
|
|
kvm_vcpu_block(vcpu);
|
2013-03-13 11:42:34 +00:00
|
|
|
kvm_apic_accept_events(vcpu);
|
2017-04-26 20:32:19 +00:00
|
|
|
kvm_clear_request(KVM_REQ_UNHALT, vcpu);
|
2008-07-06 12:48:31 +00:00
|
|
|
r = -EAGAIN;
|
2017-09-05 22:27:19 +00:00
|
|
|
if (signal_pending(current)) {
|
|
|
|
r = -EINTR;
|
|
|
|
vcpu->run->exit_reason = KVM_EXIT_INTR;
|
|
|
|
++vcpu->stat.signal_exits;
|
|
|
|
}
|
2008-07-06 12:48:31 +00:00
|
|
|
goto out;
|
2007-11-01 19:16:10 +00:00
|
|
|
}
|
|
|
|
|
2018-02-01 00:03:36 +00:00
|
|
|
if (vcpu->run->kvm_valid_regs & ~KVM_SYNC_X86_VALID_FIELDS) {
|
|
|
|
r = -EINVAL;
|
|
|
|
goto out;
|
|
|
|
}
|
|
|
|
|
|
|
|
if (vcpu->run->kvm_dirty_regs) {
|
|
|
|
r = sync_regs(vcpu);
|
|
|
|
if (r != 0)
|
|
|
|
goto out;
|
|
|
|
}
|
|
|
|
|
2007-11-01 19:16:10 +00:00
|
|
|
/* re-sync apic's tpr */
|
2015-07-29 10:05:37 +00:00
|
|
|
if (!lapic_in_kernel(vcpu)) {
|
2010-12-21 10:12:00 +00:00
|
|
|
if (kvm_set_cr8(vcpu, kvm_run->cr8) != 0) {
|
|
|
|
r = -EINVAL;
|
|
|
|
goto out;
|
|
|
|
}
|
|
|
|
}
|
2007-11-01 19:16:10 +00:00
|
|
|
|
2012-09-03 12:24:26 +00:00
|
|
|
if (unlikely(vcpu->arch.complete_userspace_io)) {
|
|
|
|
int (*cui)(struct kvm_vcpu *) = vcpu->arch.complete_userspace_io;
|
|
|
|
vcpu->arch.complete_userspace_io = NULL;
|
|
|
|
r = cui(vcpu);
|
|
|
|
if (r <= 0)
|
2017-12-12 16:15:02 +00:00
|
|
|
goto out;
|
2012-09-03 12:24:26 +00:00
|
|
|
} else
|
|
|
|
WARN_ON(vcpu->arch.pio.count || vcpu->mmio_needed);
|
2010-01-19 12:20:10 +00:00
|
|
|
|
2017-02-08 10:50:15 +00:00
|
|
|
if (kvm_run->immediate_exit)
|
|
|
|
r = -EINTR;
|
|
|
|
else
|
|
|
|
r = vcpu_run(vcpu);
|
2007-11-01 19:16:10 +00:00
|
|
|
|
|
|
|
out:
|
2017-12-12 16:15:02 +00:00
|
|
|
kvm_put_guest_fpu(vcpu);
|
2018-02-01 00:03:36 +00:00
|
|
|
if (vcpu->run->kvm_valid_regs)
|
|
|
|
store_regs(vcpu);
|
2010-05-04 02:04:27 +00:00
|
|
|
post_kvm_run_save(vcpu);
|
2017-11-24 21:39:01 +00:00
|
|
|
kvm_sigset_deactivate(vcpu);
|
2007-11-01 19:16:10 +00:00
|
|
|
|
2017-12-04 20:35:25 +00:00
|
|
|
vcpu_put(vcpu);
|
2007-11-01 19:16:10 +00:00
|
|
|
return r;
|
|
|
|
}
|
|
|
|
|
2018-02-01 00:03:36 +00:00
|
|
|
static void __get_regs(struct kvm_vcpu *vcpu, struct kvm_regs *regs)
|
2007-11-01 19:16:10 +00:00
|
|
|
{
|
2011-03-31 10:06:41 +00:00
|
|
|
if (vcpu->arch.emulate_regs_need_sync_to_vcpu) {
|
|
|
|
/*
|
|
|
|
* We are here if userspace calls get_regs() in the middle of
|
|
|
|
* instruction emulation. Registers state needs to be copied
|
2012-06-28 07:17:27 +00:00
|
|
|
* back from emulation context to vcpu. Userspace shouldn't do
|
2011-03-31 10:06:41 +00:00
|
|
|
* that usually, but some bad designed PV devices (vmware
|
|
|
|
* backdoor interface) need this to work
|
|
|
|
*/
|
2012-08-27 20:46:17 +00:00
|
|
|
emulator_writeback_register_cache(&vcpu->arch.emulate_ctxt);
|
2011-03-31 10:06:41 +00:00
|
|
|
vcpu->arch.emulate_regs_need_sync_to_vcpu = false;
|
|
|
|
}
|
2008-06-27 17:58:02 +00:00
|
|
|
regs->rax = kvm_register_read(vcpu, VCPU_REGS_RAX);
|
|
|
|
regs->rbx = kvm_register_read(vcpu, VCPU_REGS_RBX);
|
|
|
|
regs->rcx = kvm_register_read(vcpu, VCPU_REGS_RCX);
|
|
|
|
regs->rdx = kvm_register_read(vcpu, VCPU_REGS_RDX);
|
|
|
|
regs->rsi = kvm_register_read(vcpu, VCPU_REGS_RSI);
|
|
|
|
regs->rdi = kvm_register_read(vcpu, VCPU_REGS_RDI);
|
|
|
|
regs->rsp = kvm_register_read(vcpu, VCPU_REGS_RSP);
|
|
|
|
regs->rbp = kvm_register_read(vcpu, VCPU_REGS_RBP);
|
2007-11-01 19:16:10 +00:00
|
|
|
#ifdef CONFIG_X86_64
|
2008-06-27 17:58:02 +00:00
|
|
|
regs->r8 = kvm_register_read(vcpu, VCPU_REGS_R8);
|
|
|
|
regs->r9 = kvm_register_read(vcpu, VCPU_REGS_R9);
|
|
|
|
regs->r10 = kvm_register_read(vcpu, VCPU_REGS_R10);
|
|
|
|
regs->r11 = kvm_register_read(vcpu, VCPU_REGS_R11);
|
|
|
|
regs->r12 = kvm_register_read(vcpu, VCPU_REGS_R12);
|
|
|
|
regs->r13 = kvm_register_read(vcpu, VCPU_REGS_R13);
|
|
|
|
regs->r14 = kvm_register_read(vcpu, VCPU_REGS_R14);
|
|
|
|
regs->r15 = kvm_register_read(vcpu, VCPU_REGS_R15);
|
2007-11-01 19:16:10 +00:00
|
|
|
#endif
|
|
|
|
|
2008-06-27 17:58:02 +00:00
|
|
|
regs->rip = kvm_rip_read(vcpu);
|
2009-10-05 11:07:21 +00:00
|
|
|
regs->rflags = kvm_get_rflags(vcpu);
|
2018-02-01 00:03:36 +00:00
|
|
|
}
|
2007-11-01 19:16:10 +00:00
|
|
|
|
2018-02-01 00:03:36 +00:00
|
|
|
int kvm_arch_vcpu_ioctl_get_regs(struct kvm_vcpu *vcpu, struct kvm_regs *regs)
|
|
|
|
{
|
|
|
|
vcpu_load(vcpu);
|
|
|
|
__get_regs(vcpu, regs);
|
2017-12-04 20:35:26 +00:00
|
|
|
vcpu_put(vcpu);
|
2007-11-01 19:16:10 +00:00
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
2018-02-01 00:03:36 +00:00
|
|
|
static void __set_regs(struct kvm_vcpu *vcpu, struct kvm_regs *regs)
|
2007-11-01 19:16:10 +00:00
|
|
|
{
|
2011-03-31 10:06:41 +00:00
|
|
|
vcpu->arch.emulate_regs_need_sync_from_vcpu = true;
|
|
|
|
vcpu->arch.emulate_regs_need_sync_to_vcpu = false;
|
|
|
|
|
2008-06-27 17:58:02 +00:00
|
|
|
kvm_register_write(vcpu, VCPU_REGS_RAX, regs->rax);
|
|
|
|
kvm_register_write(vcpu, VCPU_REGS_RBX, regs->rbx);
|
|
|
|
kvm_register_write(vcpu, VCPU_REGS_RCX, regs->rcx);
|
|
|
|
kvm_register_write(vcpu, VCPU_REGS_RDX, regs->rdx);
|
|
|
|
kvm_register_write(vcpu, VCPU_REGS_RSI, regs->rsi);
|
|
|
|
kvm_register_write(vcpu, VCPU_REGS_RDI, regs->rdi);
|
|
|
|
kvm_register_write(vcpu, VCPU_REGS_RSP, regs->rsp);
|
|
|
|
kvm_register_write(vcpu, VCPU_REGS_RBP, regs->rbp);
|
2007-11-01 19:16:10 +00:00
|
|
|
#ifdef CONFIG_X86_64
|
2008-06-27 17:58:02 +00:00
|
|
|
kvm_register_write(vcpu, VCPU_REGS_R8, regs->r8);
|
|
|
|
kvm_register_write(vcpu, VCPU_REGS_R9, regs->r9);
|
|
|
|
kvm_register_write(vcpu, VCPU_REGS_R10, regs->r10);
|
|
|
|
kvm_register_write(vcpu, VCPU_REGS_R11, regs->r11);
|
|
|
|
kvm_register_write(vcpu, VCPU_REGS_R12, regs->r12);
|
|
|
|
kvm_register_write(vcpu, VCPU_REGS_R13, regs->r13);
|
|
|
|
kvm_register_write(vcpu, VCPU_REGS_R14, regs->r14);
|
|
|
|
kvm_register_write(vcpu, VCPU_REGS_R15, regs->r15);
|
2007-11-01 19:16:10 +00:00
|
|
|
#endif
|
|
|
|
|
2008-06-27 17:58:02 +00:00
|
|
|
kvm_rip_write(vcpu, regs->rip);
|
KVM: X86: Fix load RFLAGS w/o the fixed bit
*** Guest State ***
CR0: actual=0x0000000000000030, shadow=0x0000000060000010, gh_mask=fffffffffffffff7
CR4: actual=0x0000000000002050, shadow=0x0000000000000000, gh_mask=ffffffffffffe871
CR3 = 0x00000000fffbc000
RSP = 0x0000000000000000 RIP = 0x0000000000000000
RFLAGS=0x00000000 DR7 = 0x0000000000000400
^^^^^^^^^^
The failed vmentry is triggered by the following testcase when ept=Y:
#include <unistd.h>
#include <sys/syscall.h>
#include <string.h>
#include <stdint.h>
#include <linux/kvm.h>
#include <fcntl.h>
#include <sys/ioctl.h>
long r[5];
int main()
{
r[2] = open("/dev/kvm", O_RDONLY);
r[3] = ioctl(r[2], KVM_CREATE_VM, 0);
r[4] = ioctl(r[3], KVM_CREATE_VCPU, 7);
struct kvm_regs regs = {
.rflags = 0,
};
ioctl(r[4], KVM_SET_REGS, ®s);
ioctl(r[4], KVM_RUN, 0);
}
X86 RFLAGS bit 1 is fixed set, userspace can simply clearing bit 1
of RFLAGS with KVM_SET_REGS ioctl which results in vmentry fails.
This patch fixes it by oring X86_EFLAGS_FIXED during ioctl.
Cc: stable@vger.kernel.org
Suggested-by: Jim Mattson <jmattson@google.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Quan Xu <quan.xu0@gmail.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Radim Krčmář <rkrcmar@redhat.com>
Cc: Jim Mattson <jmattson@google.com>
Cc: stable@vger.kernel.org
Signed-off-by: Wanpeng Li <wanpeng.li@hotmail.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2017-12-07 08:30:08 +00:00
|
|
|
kvm_set_rflags(vcpu, regs->rflags | X86_EFLAGS_FIXED);
|
2007-11-01 19:16:10 +00:00
|
|
|
|
2008-04-30 15:59:04 +00:00
|
|
|
vcpu->arch.exception.pending = false;
|
|
|
|
|
2010-07-27 09:30:24 +00:00
|
|
|
kvm_make_request(KVM_REQ_EVENT, vcpu);
|
2018-02-01 00:03:36 +00:00
|
|
|
}
|
2010-07-27 09:30:24 +00:00
|
|
|
|
2018-02-01 00:03:36 +00:00
|
|
|
int kvm_arch_vcpu_ioctl_set_regs(struct kvm_vcpu *vcpu, struct kvm_regs *regs)
|
|
|
|
{
|
|
|
|
vcpu_load(vcpu);
|
|
|
|
__set_regs(vcpu, regs);
|
2017-12-04 20:35:27 +00:00
|
|
|
vcpu_put(vcpu);
|
2007-11-01 19:16:10 +00:00
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
|
|
|
void kvm_get_cs_db_l_bits(struct kvm_vcpu *vcpu, int *db, int *l)
|
|
|
|
{
|
|
|
|
struct kvm_segment cs;
|
|
|
|
|
2008-05-27 08:18:46 +00:00
|
|
|
kvm_get_segment(vcpu, &cs, VCPU_SREG_CS);
|
2007-11-01 19:16:10 +00:00
|
|
|
*db = cs.db;
|
|
|
|
*l = cs.l;
|
|
|
|
}
|
|
|
|
EXPORT_SYMBOL_GPL(kvm_get_cs_db_l_bits);
|
|
|
|
|
2018-02-01 00:03:36 +00:00
|
|
|
static void __get_sregs(struct kvm_vcpu *vcpu, struct kvm_sregs *sregs)
|
2007-11-01 19:16:10 +00:00
|
|
|
{
|
2010-02-16 08:51:48 +00:00
|
|
|
struct desc_ptr dt;
|
2007-11-01 19:16:10 +00:00
|
|
|
|
2008-05-27 08:18:46 +00:00
|
|
|
kvm_get_segment(vcpu, &sregs->cs, VCPU_SREG_CS);
|
|
|
|
kvm_get_segment(vcpu, &sregs->ds, VCPU_SREG_DS);
|
|
|
|
kvm_get_segment(vcpu, &sregs->es, VCPU_SREG_ES);
|
|
|
|
kvm_get_segment(vcpu, &sregs->fs, VCPU_SREG_FS);
|
|
|
|
kvm_get_segment(vcpu, &sregs->gs, VCPU_SREG_GS);
|
|
|
|
kvm_get_segment(vcpu, &sregs->ss, VCPU_SREG_SS);
|
2007-11-01 19:16:10 +00:00
|
|
|
|
2008-05-27 08:18:46 +00:00
|
|
|
kvm_get_segment(vcpu, &sregs->tr, VCPU_SREG_TR);
|
|
|
|
kvm_get_segment(vcpu, &sregs->ldt, VCPU_SREG_LDTR);
|
2007-11-01 19:16:10 +00:00
|
|
|
|
|
|
|
kvm_x86_ops->get_idt(vcpu, &dt);
|
2010-02-16 08:51:48 +00:00
|
|
|
sregs->idt.limit = dt.size;
|
|
|
|
sregs->idt.base = dt.address;
|
2007-11-01 19:16:10 +00:00
|
|
|
kvm_x86_ops->get_gdt(vcpu, &dt);
|
2010-02-16 08:51:48 +00:00
|
|
|
sregs->gdt.limit = dt.size;
|
|
|
|
sregs->gdt.base = dt.address;
|
2007-11-01 19:16:10 +00:00
|
|
|
|
2009-12-29 16:07:30 +00:00
|
|
|
sregs->cr0 = kvm_read_cr0(vcpu);
|
2007-12-13 15:50:52 +00:00
|
|
|
sregs->cr2 = vcpu->arch.cr2;
|
2010-12-05 15:30:00 +00:00
|
|
|
sregs->cr3 = kvm_read_cr3(vcpu);
|
2009-12-07 10:16:48 +00:00
|
|
|
sregs->cr4 = kvm_read_cr4(vcpu);
|
2008-02-24 09:20:43 +00:00
|
|
|
sregs->cr8 = kvm_get_cr8(vcpu);
|
2010-01-21 13:31:50 +00:00
|
|
|
sregs->efer = vcpu->arch.efer;
|
2007-11-01 19:16:10 +00:00
|
|
|
sregs->apic_base = kvm_get_apic_base(vcpu);
|
|
|
|
|
2009-05-11 10:35:48 +00:00
|
|
|
memset(sregs->interrupt_bitmap, 0, sizeof sregs->interrupt_bitmap);
|
2007-11-01 19:16:10 +00:00
|
|
|
|
KVM: x86: Rename interrupt.pending to interrupt.injected
For exceptions & NMIs events, KVM code use the following
coding convention:
*) "pending" represents an event that should be injected to guest at
some point but it's side-effects have not yet occurred.
*) "injected" represents an event that it's side-effects have already
occurred.
However, interrupts don't conform to this coding convention.
All current code flows mark interrupt.pending when it's side-effects
have already taken place (For example, bit moved from LAPIC IRR to
ISR). Therefore, it makes sense to just rename
interrupt.pending to interrupt.injected.
This change follows logic of previous commit 664f8e26b00c ("KVM: X86:
Fix loss of exception which has not yet been injected") which changed
exception to follow this coding convention as well.
It is important to note that in case !lapic_in_kernel(vcpu),
interrupt.pending usage was and still incorrect.
In this case, interrrupt.pending can only be set using one of the
following ioctls: KVM_INTERRUPT, KVM_SET_VCPU_EVENTS and
KVM_SET_SREGS. Looking at how QEMU uses these ioctls, one can see that
QEMU uses them either to re-set an "interrupt.pending" state it has
received from KVM (via KVM_GET_VCPU_EVENTS interrupt.pending or
via KVM_GET_SREGS interrupt_bitmap) or by dispatching a new interrupt
from QEMU's emulated LAPIC which reset bit in IRR and set bit in ISR
before sending ioctl to KVM. So it seems that indeed "interrupt.pending"
in this case is also suppose to represent "interrupt.injected".
However, kvm_cpu_has_interrupt() & kvm_cpu_has_injectable_intr()
is misusing (now named) interrupt.injected in order to return if
there is a pending interrupt.
This leads to nVMX/nSVM not be able to distinguish if it should exit
from L2 to L1 on EXTERNAL_INTERRUPT on pending interrupt or should
re-inject an injected interrupt.
Therefore, add a FIXME at these functions for handling this issue.
This patch introduce no semantics change.
Signed-off-by: Liran Alon <liran.alon@oracle.com>
Reviewed-by: Nikita Leshenko <nikita.leshchenko@oracle.com>
Reviewed-by: Jim Mattson <jmattson@google.com>
Signed-off-by: Krish Sadhukhan <krish.sadhukhan@oracle.com>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
2018-03-23 00:01:31 +00:00
|
|
|
if (vcpu->arch.interrupt.injected && !vcpu->arch.interrupt.soft)
|
2009-04-21 14:45:11 +00:00
|
|
|
set_bit(vcpu->arch.interrupt.nr,
|
|
|
|
(unsigned long *)sregs->interrupt_bitmap);
|
2018-02-01 00:03:36 +00:00
|
|
|
}
|
2009-04-21 14:45:10 +00:00
|
|
|
|
2018-02-01 00:03:36 +00:00
|
|
|
int kvm_arch_vcpu_ioctl_get_sregs(struct kvm_vcpu *vcpu,
|
|
|
|
struct kvm_sregs *sregs)
|
|
|
|
{
|
|
|
|
vcpu_load(vcpu);
|
|
|
|
__get_sregs(vcpu, sregs);
|
2017-12-04 20:35:28 +00:00
|
|
|
vcpu_put(vcpu);
|
2007-11-01 19:16:10 +00:00
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
2008-04-11 16:24:45 +00:00
|
|
|
int kvm_arch_vcpu_ioctl_get_mpstate(struct kvm_vcpu *vcpu,
|
|
|
|
struct kvm_mp_state *mp_state)
|
|
|
|
{
|
2017-12-04 20:35:30 +00:00
|
|
|
vcpu_load(vcpu);
|
|
|
|
|
2013-03-13 11:42:34 +00:00
|
|
|
kvm_apic_accept_events(vcpu);
|
2013-08-26 08:48:34 +00:00
|
|
|
if (vcpu->arch.mp_state == KVM_MP_STATE_HALTED &&
|
|
|
|
vcpu->arch.pv.pv_unhalted)
|
|
|
|
mp_state->mp_state = KVM_MP_STATE_RUNNABLE;
|
|
|
|
else
|
|
|
|
mp_state->mp_state = vcpu->arch.mp_state;
|
|
|
|
|
2017-12-04 20:35:30 +00:00
|
|
|
vcpu_put(vcpu);
|
2008-04-11 16:24:45 +00:00
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
|
|
|
int kvm_arch_vcpu_ioctl_set_mpstate(struct kvm_vcpu *vcpu,
|
|
|
|
struct kvm_mp_state *mp_state)
|
|
|
|
{
|
2017-12-04 20:35:31 +00:00
|
|
|
int ret = -EINVAL;
|
|
|
|
|
|
|
|
vcpu_load(vcpu);
|
|
|
|
|
2016-01-08 12:48:51 +00:00
|
|
|
if (!lapic_in_kernel(vcpu) &&
|
2013-03-13 11:42:34 +00:00
|
|
|
mp_state->mp_state != KVM_MP_STATE_RUNNABLE)
|
2017-12-04 20:35:31 +00:00
|
|
|
goto out;
|
2013-03-13 11:42:34 +00:00
|
|
|
|
2017-03-23 10:46:03 +00:00
|
|
|
/* INITs are latched while in SMM */
|
|
|
|
if ((is_smm(vcpu) || vcpu->arch.smi_pending) &&
|
|
|
|
(mp_state->mp_state == KVM_MP_STATE_SIPI_RECEIVED ||
|
|
|
|
mp_state->mp_state == KVM_MP_STATE_INIT_RECEIVED))
|
2017-12-04 20:35:31 +00:00
|
|
|
goto out;
|
2017-03-23 10:46:03 +00:00
|
|
|
|
2013-03-13 11:42:34 +00:00
|
|
|
if (mp_state->mp_state == KVM_MP_STATE_SIPI_RECEIVED) {
|
|
|
|
vcpu->arch.mp_state = KVM_MP_STATE_INIT_RECEIVED;
|
|
|
|
set_bit(KVM_APIC_SIPI, &vcpu->arch.apic->pending_events);
|
|
|
|
} else
|
|
|
|
vcpu->arch.mp_state = mp_state->mp_state;
|
2010-07-27 09:30:24 +00:00
|
|
|
kvm_make_request(KVM_REQ_EVENT, vcpu);
|
2017-12-04 20:35:31 +00:00
|
|
|
|
|
|
|
ret = 0;
|
|
|
|
out:
|
|
|
|
vcpu_put(vcpu);
|
|
|
|
return ret;
|
2008-04-11 16:24:45 +00:00
|
|
|
}
|
|
|
|
|
2012-02-08 13:34:38 +00:00
|
|
|
int kvm_task_switch(struct kvm_vcpu *vcpu, u16 tss_selector, int idt_index,
|
|
|
|
int reason, bool has_error_code, u32 error_code)
|
2007-11-01 19:16:10 +00:00
|
|
|
{
|
2011-05-29 12:53:48 +00:00
|
|
|
struct x86_emulate_ctxt *ctxt = &vcpu->arch.emulate_ctxt;
|
2010-08-15 21:47:01 +00:00
|
|
|
int ret;
|
2010-01-25 10:01:04 +00:00
|
|
|
|
2010-08-15 21:47:01 +00:00
|
|
|
init_emulate_ctxt(vcpu);
|
2010-02-18 10:15:01 +00:00
|
|
|
|
2012-02-08 13:34:38 +00:00
|
|
|
ret = emulator_task_switch(ctxt, tss_selector, idt_index, reason,
|
2011-05-29 12:53:48 +00:00
|
|
|
has_error_code, error_code);
|
2010-02-18 10:15:01 +00:00
|
|
|
|
|
|
|
if (ret)
|
2010-04-15 09:29:50 +00:00
|
|
|
return EMULATE_FAIL;
|
2008-03-24 21:14:53 +00:00
|
|
|
|
2011-05-29 12:53:48 +00:00
|
|
|
kvm_rip_write(vcpu, ctxt->eip);
|
|
|
|
kvm_set_rflags(vcpu, ctxt->eflags);
|
2010-07-27 09:30:24 +00:00
|
|
|
kvm_make_request(KVM_REQ_EVENT, vcpu);
|
2010-04-15 09:29:50 +00:00
|
|
|
return EMULATE_DONE;
|
2008-03-24 21:14:53 +00:00
|
|
|
}
|
|
|
|
EXPORT_SYMBOL_GPL(kvm_task_switch);
|
|
|
|
|
2018-04-02 01:15:32 +00:00
|
|
|
static int kvm_valid_sregs(struct kvm_vcpu *vcpu, struct kvm_sregs *sregs)
|
2017-12-14 08:01:52 +00:00
|
|
|
{
|
2018-07-23 12:31:21 +00:00
|
|
|
if (!guest_cpuid_has(vcpu, X86_FEATURE_XSAVE) &&
|
|
|
|
(sregs->cr4 & X86_CR4_OSXSAVE))
|
|
|
|
return -EINVAL;
|
|
|
|
|
2018-01-16 09:34:07 +00:00
|
|
|
if ((sregs->efer & EFER_LME) && (sregs->cr0 & X86_CR0_PG)) {
|
2017-12-14 08:01:52 +00:00
|
|
|
/*
|
|
|
|
* When EFER.LME and CR0.PG are set, the processor is in
|
|
|
|
* 64-bit mode (though maybe in a 32-bit code segment).
|
|
|
|
* CR4.PAE and EFER.LMA must be set.
|
|
|
|
*/
|
2018-01-16 09:34:07 +00:00
|
|
|
if (!(sregs->cr4 & X86_CR4_PAE)
|
2017-12-14 08:01:52 +00:00
|
|
|
|| !(sregs->efer & EFER_LMA))
|
|
|
|
return -EINVAL;
|
|
|
|
} else {
|
|
|
|
/*
|
|
|
|
* Not in 64-bit mode: EFER.LMA is clear and the code
|
|
|
|
* segment cannot be 64-bit.
|
|
|
|
*/
|
|
|
|
if (sregs->efer & EFER_LMA || sregs->cs.l)
|
|
|
|
return -EINVAL;
|
|
|
|
}
|
|
|
|
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
2018-02-01 00:03:36 +00:00
|
|
|
static int __set_sregs(struct kvm_vcpu *vcpu, struct kvm_sregs *sregs)
|
2007-11-01 19:16:10 +00:00
|
|
|
{
|
2014-01-24 15:48:44 +00:00
|
|
|
struct msr_data apic_base_msr;
|
2007-11-01 19:16:10 +00:00
|
|
|
int mmu_reset_needed = 0;
|
2018-05-01 14:49:54 +00:00
|
|
|
int cpuid_update_needed = 0;
|
2011-01-12 07:39:18 +00:00
|
|
|
int pending_vec, max_bits, idx;
|
2010-02-16 08:51:48 +00:00
|
|
|
struct desc_ptr dt;
|
2017-12-04 20:35:29 +00:00
|
|
|
int ret = -EINVAL;
|
|
|
|
|
2017-12-14 08:01:52 +00:00
|
|
|
if (kvm_valid_sregs(vcpu, sregs))
|
2017-12-21 00:24:27 +00:00
|
|
|
goto out;
|
2017-12-14 08:01:52 +00:00
|
|
|
|
2017-08-10 17:14:13 +00:00
|
|
|
apic_base_msr.data = sregs->apic_base;
|
|
|
|
apic_base_msr.host_initiated = true;
|
|
|
|
if (kvm_set_apic_base(vcpu, &apic_base_msr))
|
2017-12-04 20:35:29 +00:00
|
|
|
goto out;
|
2012-11-06 18:24:07 +00:00
|
|
|
|
2010-02-16 08:51:48 +00:00
|
|
|
dt.size = sregs->idt.limit;
|
|
|
|
dt.address = sregs->idt.base;
|
2007-11-01 19:16:10 +00:00
|
|
|
kvm_x86_ops->set_idt(vcpu, &dt);
|
2010-02-16 08:51:48 +00:00
|
|
|
dt.size = sregs->gdt.limit;
|
|
|
|
dt.address = sregs->gdt.base;
|
2007-11-01 19:16:10 +00:00
|
|
|
kvm_x86_ops->set_gdt(vcpu, &dt);
|
|
|
|
|
2007-12-13 15:50:52 +00:00
|
|
|
vcpu->arch.cr2 = sregs->cr2;
|
2010-12-05 15:30:00 +00:00
|
|
|
mmu_reset_needed |= kvm_read_cr3(vcpu) != sregs->cr3;
|
2009-07-01 18:52:03 +00:00
|
|
|
vcpu->arch.cr3 = sregs->cr3;
|
2010-12-05 16:56:11 +00:00
|
|
|
__set_bit(VCPU_EXREG_CR3, (ulong *)&vcpu->arch.regs_avail);
|
2007-11-01 19:16:10 +00:00
|
|
|
|
2008-02-24 09:20:43 +00:00
|
|
|
kvm_set_cr8(vcpu, sregs->cr8);
|
2007-11-01 19:16:10 +00:00
|
|
|
|
2010-01-21 13:31:50 +00:00
|
|
|
mmu_reset_needed |= vcpu->arch.efer != sregs->efer;
|
2007-11-01 19:16:10 +00:00
|
|
|
kvm_x86_ops->set_efer(vcpu, sregs->efer);
|
|
|
|
|
2009-12-29 16:07:30 +00:00
|
|
|
mmu_reset_needed |= kvm_read_cr0(vcpu) != sregs->cr0;
|
2007-11-01 19:16:10 +00:00
|
|
|
kvm_x86_ops->set_cr0(vcpu, sregs->cr0);
|
2008-02-06 11:02:35 +00:00
|
|
|
vcpu->arch.cr0 = sregs->cr0;
|
2007-11-01 19:16:10 +00:00
|
|
|
|
2009-12-07 10:16:48 +00:00
|
|
|
mmu_reset_needed |= kvm_read_cr4(vcpu) != sregs->cr4;
|
2018-05-01 14:49:54 +00:00
|
|
|
cpuid_update_needed |= ((kvm_read_cr4(vcpu) ^ sregs->cr4) &
|
|
|
|
(X86_CR4_OSXSAVE | X86_CR4_PKE));
|
2007-11-01 19:16:10 +00:00
|
|
|
kvm_x86_ops->set_cr4(vcpu, sregs->cr4);
|
2018-05-01 14:49:54 +00:00
|
|
|
if (cpuid_update_needed)
|
2011-11-23 14:30:32 +00:00
|
|
|
kvm_update_cpuid(vcpu);
|
2011-01-12 07:39:18 +00:00
|
|
|
|
|
|
|
idx = srcu_read_lock(&vcpu->kvm->srcu);
|
2009-10-26 18:48:33 +00:00
|
|
|
if (!is_long_mode(vcpu) && is_pae(vcpu)) {
|
2010-12-05 15:30:00 +00:00
|
|
|
load_pdptrs(vcpu, vcpu->arch.walk_mmu, kvm_read_cr3(vcpu));
|
2009-10-26 18:48:33 +00:00
|
|
|
mmu_reset_needed = 1;
|
|
|
|
}
|
2011-01-12 07:39:18 +00:00
|
|
|
srcu_read_unlock(&vcpu->kvm->srcu, idx);
|
2007-11-01 19:16:10 +00:00
|
|
|
|
|
|
|
if (mmu_reset_needed)
|
|
|
|
kvm_mmu_reset_context(vcpu);
|
|
|
|
|
2012-09-05 17:00:52 +00:00
|
|
|
max_bits = KVM_NR_INTERRUPTS;
|
2009-05-11 10:35:48 +00:00
|
|
|
pending_vec = find_first_bit(
|
|
|
|
(const unsigned long *)sregs->interrupt_bitmap, max_bits);
|
|
|
|
if (pending_vec < max_bits) {
|
2009-05-11 10:35:50 +00:00
|
|
|
kvm_queue_interrupt(vcpu, pending_vec, false);
|
2009-05-11 10:35:48 +00:00
|
|
|
pr_debug("Set back pending irq %d\n", pending_vec);
|
2007-11-01 19:16:10 +00:00
|
|
|
}
|
|
|
|
|
2008-05-27 08:18:46 +00:00
|
|
|
kvm_set_segment(vcpu, &sregs->cs, VCPU_SREG_CS);
|
|
|
|
kvm_set_segment(vcpu, &sregs->ds, VCPU_SREG_DS);
|
|
|
|
kvm_set_segment(vcpu, &sregs->es, VCPU_SREG_ES);
|
|
|
|
kvm_set_segment(vcpu, &sregs->fs, VCPU_SREG_FS);
|
|
|
|
kvm_set_segment(vcpu, &sregs->gs, VCPU_SREG_GS);
|
|
|
|
kvm_set_segment(vcpu, &sregs->ss, VCPU_SREG_SS);
|
2007-11-01 19:16:10 +00:00
|
|
|
|
2008-05-27 08:18:46 +00:00
|
|
|
kvm_set_segment(vcpu, &sregs->tr, VCPU_SREG_TR);
|
|
|
|
kvm_set_segment(vcpu, &sregs->ldt, VCPU_SREG_LDTR);
|
2007-11-01 19:16:10 +00:00
|
|
|
|
2009-08-03 11:58:25 +00:00
|
|
|
update_cr8_intercept(vcpu);
|
|
|
|
|
2008-09-10 19:40:55 +00:00
|
|
|
/* Older userspace won't unhalt the vcpu on reset. */
|
2009-06-09 12:56:26 +00:00
|
|
|
if (kvm_vcpu_is_bsp(vcpu) && kvm_rip_read(vcpu) == 0xfff0 &&
|
2008-09-10 19:40:55 +00:00
|
|
|
sregs->cs.selector == 0xf000 && sregs->cs.base == 0xffff0000 &&
|
2010-01-21 13:31:48 +00:00
|
|
|
!is_protmode(vcpu))
|
2008-09-10 19:40:55 +00:00
|
|
|
vcpu->arch.mp_state = KVM_MP_STATE_RUNNABLE;
|
|
|
|
|
2010-07-27 09:30:24 +00:00
|
|
|
kvm_make_request(KVM_REQ_EVENT, vcpu);
|
|
|
|
|
2017-12-04 20:35:29 +00:00
|
|
|
ret = 0;
|
|
|
|
out:
|
2018-02-01 00:03:36 +00:00
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
|
|
|
int kvm_arch_vcpu_ioctl_set_sregs(struct kvm_vcpu *vcpu,
|
|
|
|
struct kvm_sregs *sregs)
|
|
|
|
{
|
|
|
|
int ret;
|
|
|
|
|
|
|
|
vcpu_load(vcpu);
|
|
|
|
ret = __set_sregs(vcpu, sregs);
|
2017-12-04 20:35:29 +00:00
|
|
|
vcpu_put(vcpu);
|
|
|
|
return ret;
|
2007-11-01 19:16:10 +00:00
|
|
|
}
|
|
|
|
|
2008-12-15 12:52:10 +00:00
|
|
|
int kvm_arch_vcpu_ioctl_set_guest_debug(struct kvm_vcpu *vcpu,
|
|
|
|
struct kvm_guest_debug *dbg)
|
2007-11-01 19:16:10 +00:00
|
|
|
{
|
2009-10-02 22:31:21 +00:00
|
|
|
unsigned long rflags;
|
2008-12-15 12:52:10 +00:00
|
|
|
int i, r;
|
2007-11-01 19:16:10 +00:00
|
|
|
|
2017-12-04 20:35:33 +00:00
|
|
|
vcpu_load(vcpu);
|
|
|
|
|
2009-10-30 11:46:59 +00:00
|
|
|
if (dbg->control & (KVM_GUESTDBG_INJECT_DB | KVM_GUESTDBG_INJECT_BP)) {
|
|
|
|
r = -EBUSY;
|
|
|
|
if (vcpu->arch.exception.pending)
|
2010-05-13 08:25:04 +00:00
|
|
|
goto out;
|
2009-10-30 11:46:59 +00:00
|
|
|
if (dbg->control & KVM_GUESTDBG_INJECT_DB)
|
|
|
|
kvm_queue_exception(vcpu, DB_VECTOR);
|
|
|
|
else
|
|
|
|
kvm_queue_exception(vcpu, BP_VECTOR);
|
|
|
|
}
|
|
|
|
|
2009-10-05 11:07:21 +00:00
|
|
|
/*
|
|
|
|
* Read rflags as long as potentially injected trace flags are still
|
|
|
|
* filtered out.
|
|
|
|
*/
|
|
|
|
rflags = kvm_get_rflags(vcpu);
|
2009-10-02 22:31:21 +00:00
|
|
|
|
|
|
|
vcpu->guest_debug = dbg->control;
|
|
|
|
if (!(vcpu->guest_debug & KVM_GUESTDBG_ENABLE))
|
|
|
|
vcpu->guest_debug = 0;
|
|
|
|
|
|
|
|
if (vcpu->guest_debug & KVM_GUESTDBG_USE_HW_BP) {
|
2008-12-15 12:52:10 +00:00
|
|
|
for (i = 0; i < KVM_NR_DB_REGS; ++i)
|
|
|
|
vcpu->arch.eff_db[i] = dbg->arch.debugreg[i];
|
2012-09-21 03:42:55 +00:00
|
|
|
vcpu->arch.guest_debug_dr7 = dbg->arch.debugreg[7];
|
2008-12-15 12:52:10 +00:00
|
|
|
} else {
|
|
|
|
for (i = 0; i < KVM_NR_DB_REGS; i++)
|
|
|
|
vcpu->arch.eff_db[i] = vcpu->arch.db[i];
|
|
|
|
}
|
2012-09-21 03:42:55 +00:00
|
|
|
kvm_update_dr7(vcpu);
|
2008-12-15 12:52:10 +00:00
|
|
|
|
2010-02-23 16:47:55 +00:00
|
|
|
if (vcpu->guest_debug & KVM_GUESTDBG_SINGLESTEP)
|
|
|
|
vcpu->arch.singlestep_rip = kvm_rip_read(vcpu) +
|
|
|
|
get_segment_base(vcpu, VCPU_SREG_CS);
|
2009-10-18 11:24:44 +00:00
|
|
|
|
2009-10-05 11:07:21 +00:00
|
|
|
/*
|
|
|
|
* Trigger an rflags update that will inject or remove the trace
|
|
|
|
* flags.
|
|
|
|
*/
|
|
|
|
kvm_set_rflags(vcpu, rflags);
|
2007-11-01 19:16:10 +00:00
|
|
|
|
2015-11-10 10:55:36 +00:00
|
|
|
kvm_x86_ops->update_bp_intercept(vcpu);
|
2007-11-01 19:16:10 +00:00
|
|
|
|
2009-10-30 11:46:59 +00:00
|
|
|
r = 0;
|
2008-12-15 12:52:10 +00:00
|
|
|
|
2010-05-13 08:25:04 +00:00
|
|
|
out:
|
2017-12-04 20:35:33 +00:00
|
|
|
vcpu_put(vcpu);
|
2007-11-01 19:16:10 +00:00
|
|
|
return r;
|
|
|
|
}
|
|
|
|
|
2007-11-16 05:05:55 +00:00
|
|
|
/*
|
|
|
|
* Translate a guest virtual address to a guest physical address.
|
|
|
|
*/
|
|
|
|
int kvm_arch_vcpu_ioctl_translate(struct kvm_vcpu *vcpu,
|
|
|
|
struct kvm_translation *tr)
|
|
|
|
{
|
|
|
|
unsigned long vaddr = tr->linear_address;
|
|
|
|
gpa_t gpa;
|
2009-12-23 16:35:25 +00:00
|
|
|
int idx;
|
2007-11-16 05:05:55 +00:00
|
|
|
|
2017-12-04 20:35:32 +00:00
|
|
|
vcpu_load(vcpu);
|
|
|
|
|
2009-12-23 16:35:25 +00:00
|
|
|
idx = srcu_read_lock(&vcpu->kvm->srcu);
|
2010-02-10 12:21:32 +00:00
|
|
|
gpa = kvm_mmu_gva_to_gpa_system(vcpu, vaddr, NULL);
|
2009-12-23 16:35:25 +00:00
|
|
|
srcu_read_unlock(&vcpu->kvm->srcu, idx);
|
2007-11-16 05:05:55 +00:00
|
|
|
tr->physical_address = gpa;
|
|
|
|
tr->valid = gpa != UNMAPPED_GVA;
|
|
|
|
tr->writeable = 1;
|
|
|
|
tr->usermode = 0;
|
|
|
|
|
2017-12-04 20:35:32 +00:00
|
|
|
vcpu_put(vcpu);
|
2007-11-16 05:05:55 +00:00
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
2007-10-31 22:24:25 +00:00
|
|
|
int kvm_arch_vcpu_ioctl_get_fpu(struct kvm_vcpu *vcpu, struct kvm_fpu *fpu)
|
|
|
|
{
|
2017-12-04 20:35:34 +00:00
|
|
|
struct fxregs_state *fxsave;
|
2007-10-31 22:24:25 +00:00
|
|
|
|
2017-12-04 20:35:34 +00:00
|
|
|
vcpu_load(vcpu);
|
2007-10-31 22:24:25 +00:00
|
|
|
|
2017-12-04 20:35:34 +00:00
|
|
|
fxsave = &vcpu->arch.guest_fpu.state.fxsave;
|
2007-10-31 22:24:25 +00:00
|
|
|
memcpy(fpu->fpr, fxsave->st_space, 128);
|
|
|
|
fpu->fcw = fxsave->cwd;
|
|
|
|
fpu->fsw = fxsave->swd;
|
|
|
|
fpu->ftwx = fxsave->twd;
|
|
|
|
fpu->last_opcode = fxsave->fop;
|
|
|
|
fpu->last_ip = fxsave->rip;
|
|
|
|
fpu->last_dp = fxsave->rdp;
|
|
|
|
memcpy(fpu->xmm, fxsave->xmm_space, sizeof fxsave->xmm_space);
|
|
|
|
|
2017-12-04 20:35:34 +00:00
|
|
|
vcpu_put(vcpu);
|
2007-10-31 22:24:25 +00:00
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
|
|
|
int kvm_arch_vcpu_ioctl_set_fpu(struct kvm_vcpu *vcpu, struct kvm_fpu *fpu)
|
|
|
|
{
|
2017-12-04 20:35:35 +00:00
|
|
|
struct fxregs_state *fxsave;
|
|
|
|
|
|
|
|
vcpu_load(vcpu);
|
|
|
|
|
|
|
|
fxsave = &vcpu->arch.guest_fpu.state.fxsave;
|
2007-10-31 22:24:25 +00:00
|
|
|
|
|
|
|
memcpy(fxsave->st_space, fpu->fpr, 128);
|
|
|
|
fxsave->cwd = fpu->fcw;
|
|
|
|
fxsave->swd = fpu->fsw;
|
|
|
|
fxsave->twd = fpu->ftwx;
|
|
|
|
fxsave->fop = fpu->last_opcode;
|
|
|
|
fxsave->rip = fpu->last_ip;
|
|
|
|
fxsave->rdp = fpu->last_dp;
|
|
|
|
memcpy(fxsave->xmm_space, fpu->xmm, sizeof fxsave->xmm_space);
|
|
|
|
|
2017-12-04 20:35:35 +00:00
|
|
|
vcpu_put(vcpu);
|
2007-10-31 22:24:25 +00:00
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
2018-02-01 00:03:36 +00:00
|
|
|
static void store_regs(struct kvm_vcpu *vcpu)
|
|
|
|
{
|
|
|
|
BUILD_BUG_ON(sizeof(struct kvm_sync_regs) > SYNC_REGS_SIZE_BYTES);
|
|
|
|
|
|
|
|
if (vcpu->run->kvm_valid_regs & KVM_SYNC_X86_REGS)
|
|
|
|
__get_regs(vcpu, &vcpu->run->s.regs.regs);
|
|
|
|
|
|
|
|
if (vcpu->run->kvm_valid_regs & KVM_SYNC_X86_SREGS)
|
|
|
|
__get_sregs(vcpu, &vcpu->run->s.regs.sregs);
|
|
|
|
|
|
|
|
if (vcpu->run->kvm_valid_regs & KVM_SYNC_X86_EVENTS)
|
|
|
|
kvm_vcpu_ioctl_x86_get_vcpu_events(
|
|
|
|
vcpu, &vcpu->run->s.regs.events);
|
|
|
|
}
|
|
|
|
|
|
|
|
static int sync_regs(struct kvm_vcpu *vcpu)
|
|
|
|
{
|
|
|
|
if (vcpu->run->kvm_dirty_regs & ~KVM_SYNC_X86_VALID_FIELDS)
|
|
|
|
return -EINVAL;
|
|
|
|
|
|
|
|
if (vcpu->run->kvm_dirty_regs & KVM_SYNC_X86_REGS) {
|
|
|
|
__set_regs(vcpu, &vcpu->run->s.regs.regs);
|
|
|
|
vcpu->run->kvm_dirty_regs &= ~KVM_SYNC_X86_REGS;
|
|
|
|
}
|
|
|
|
if (vcpu->run->kvm_dirty_regs & KVM_SYNC_X86_SREGS) {
|
|
|
|
if (__set_sregs(vcpu, &vcpu->run->s.regs.sregs))
|
|
|
|
return -EINVAL;
|
|
|
|
vcpu->run->kvm_dirty_regs &= ~KVM_SYNC_X86_SREGS;
|
|
|
|
}
|
|
|
|
if (vcpu->run->kvm_dirty_regs & KVM_SYNC_X86_EVENTS) {
|
|
|
|
if (kvm_vcpu_ioctl_x86_set_vcpu_events(
|
|
|
|
vcpu, &vcpu->run->s.regs.events))
|
|
|
|
return -EINVAL;
|
|
|
|
vcpu->run->kvm_dirty_regs &= ~KVM_SYNC_X86_EVENTS;
|
|
|
|
}
|
|
|
|
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
2015-04-27 04:58:22 +00:00
|
|
|
static void fx_init(struct kvm_vcpu *vcpu)
|
2007-10-31 22:24:25 +00:00
|
|
|
{
|
2015-04-30 08:23:42 +00:00
|
|
|
fpstate_init(&vcpu->arch.guest_fpu.state);
|
2016-04-04 20:25:03 +00:00
|
|
|
if (boot_cpu_has(X86_FEATURE_XSAVES))
|
x86/fpu: Simplify FPU handling by embedding the fpstate in task_struct (again)
So 6 years ago we made the FPU fpstate dynamically allocated:
aa283f49276e ("x86, fpu: lazy allocation of FPU area - v5")
61c4628b5386 ("x86, fpu: split FPU state from task struct - v5")
In hindsight this was a mistake:
- it complicated context allocation failure handling, such as:
/* kthread execs. TODO: cleanup this horror. */
if (WARN_ON(fpstate_alloc_init(fpu)))
force_sig(SIGKILL, tsk);
- it caused us to enable irqs in fpu__restore():
local_irq_enable();
/*
* does a slab alloc which can sleep
*/
if (fpstate_alloc_init(fpu)) {
/*
* ran out of memory!
*/
do_group_exit(SIGKILL);
return;
}
local_irq_disable();
- it (slightly) slowed down task creation/destruction by adding
slab allocation/free pattens.
- it made access to context contents (slightly) slower by adding
one more pointer dereference.
The motivation for the dynamic allocation was two-fold:
- reduce memory consumption by non-FPU tasks
- allocate and handle only the necessary amount of context for
various XSAVE processors that have varying hardware frame
sizes.
These days, with glibc using SSE memcpy by default and GCC optimizing
for SSE/AVX by default, the scope of FPU using apps on an x86 system is
much larger than it was 6 years ago.
For example on a freshly installed Fedora 21 desktop system, with a
recent kernel, all non-kthread tasks have used the FPU shortly after
bootup.
Also, even modern embedded x86 CPUs try to support the latest vector
instruction set - so they'll too often use the larger xstate frame
sizes.
So remove the dynamic allocation complication by embedding the FPU
fpstate in task_struct again. This should make the FPU a lot more
accessible to all sorts of atomic contexts.
We could still optimize for the xstate frame size in the future,
by moving the state structure to the last element of task_struct,
and allocating only a part of that.
This change is kept minimal by still keeping the ctx_alloc()/free()
routines (that now do nothing substantial) - we'll remove them in
the following patches.
Reviewed-by: Borislav Petkov <bp@alien8.de>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-04-27 02:19:39 +00:00
|
|
|
vcpu->arch.guest_fpu.state.xsave.header.xcomp_bv =
|
2014-11-21 18:05:07 +00:00
|
|
|
host_xcr0 | XSTATE_COMPACTION_ENABLED;
|
2007-10-31 22:24:25 +00:00
|
|
|
|
2010-06-10 03:27:12 +00:00
|
|
|
/*
|
|
|
|
* Ensure guest xcr0 is valid for loading
|
|
|
|
*/
|
2015-09-02 23:31:26 +00:00
|
|
|
vcpu->arch.xcr0 = XFEATURE_MASK_FP;
|
2010-06-10 03:27:12 +00:00
|
|
|
|
2007-12-13 15:50:52 +00:00
|
|
|
vcpu->arch.cr0 |= X86_CR0_ET;
|
2007-10-31 22:24:25 +00:00
|
|
|
}
|
|
|
|
|
x86,kvm: move qemu/guest FPU switching out to vcpu_run
Currently, every time a VCPU is scheduled out, the host kernel will
first save the guest FPU/xstate context, then load the qemu userspace
FPU context, only to then immediately save the qemu userspace FPU
context back to memory. When scheduling in a VCPU, the same extraneous
FPU loads and saves are done.
This could be avoided by moving from a model where the guest FPU is
loaded and stored with preemption disabled, to a model where the
qemu userspace FPU is swapped out for the guest FPU context for
the duration of the KVM_RUN ioctl.
This is done under the VCPU mutex, which is also taken when other
tasks inspect the VCPU FPU context, so the code should already be
safe for this change. That should come as no surprise, given that
s390 already has this optimization.
This can fix a bug where KVM calls get_user_pages while owning the
FPU, and the file system ends up requesting the FPU again:
[258270.527947] __warn+0xcb/0xf0
[258270.527948] warn_slowpath_null+0x1d/0x20
[258270.527951] kernel_fpu_disable+0x3f/0x50
[258270.527953] __kernel_fpu_begin+0x49/0x100
[258270.527955] kernel_fpu_begin+0xe/0x10
[258270.527958] crc32c_pcl_intel_update+0x84/0xb0
[258270.527961] crypto_shash_update+0x3f/0x110
[258270.527968] crc32c+0x63/0x8a [libcrc32c]
[258270.527975] dm_bm_checksum+0x1b/0x20 [dm_persistent_data]
[258270.527978] node_prepare_for_write+0x44/0x70 [dm_persistent_data]
[258270.527985] dm_block_manager_write_callback+0x41/0x50 [dm_persistent_data]
[258270.527988] submit_io+0x170/0x1b0 [dm_bufio]
[258270.527992] __write_dirty_buffer+0x89/0x90 [dm_bufio]
[258270.527994] __make_buffer_clean+0x4f/0x80 [dm_bufio]
[258270.527996] __try_evict_buffer+0x42/0x60 [dm_bufio]
[258270.527998] dm_bufio_shrink_scan+0xc0/0x130 [dm_bufio]
[258270.528002] shrink_slab.part.40+0x1f5/0x420
[258270.528004] shrink_node+0x22c/0x320
[258270.528006] do_try_to_free_pages+0xf5/0x330
[258270.528008] try_to_free_pages+0xe9/0x190
[258270.528009] __alloc_pages_slowpath+0x40f/0xba0
[258270.528011] __alloc_pages_nodemask+0x209/0x260
[258270.528014] alloc_pages_vma+0x1f1/0x250
[258270.528017] do_huge_pmd_anonymous_page+0x123/0x660
[258270.528021] handle_mm_fault+0xfd3/0x1330
[258270.528025] __get_user_pages+0x113/0x640
[258270.528027] get_user_pages+0x4f/0x60
[258270.528063] __gfn_to_pfn_memslot+0x120/0x3f0 [kvm]
[258270.528108] try_async_pf+0x66/0x230 [kvm]
[258270.528135] tdp_page_fault+0x130/0x280 [kvm]
[258270.528149] kvm_mmu_page_fault+0x60/0x120 [kvm]
[258270.528158] handle_ept_violation+0x91/0x170 [kvm_intel]
[258270.528162] vmx_handle_exit+0x1ca/0x1400 [kvm_intel]
No performance changes were detected in quick ping-pong tests on
my 4 socket system, which is expected since an FPU+xstate load is
on the order of 0.1us, while ping-ponging between CPUs is on the
order of 20us, and somewhat noisy.
Cc: stable@vger.kernel.org
Signed-off-by: Rik van Riel <riel@redhat.com>
Suggested-by: Christian Borntraeger <borntraeger@de.ibm.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
[Fixed a bug where reset_vcpu called put_fpu without preceding load_fpu,
which happened inside from KVM_CREATE_VCPU ioctl. - Radim]
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
2017-11-14 21:54:23 +00:00
|
|
|
/* Swap (qemu) user FPU context for the guest FPU context. */
|
2007-10-31 22:24:25 +00:00
|
|
|
void kvm_load_guest_fpu(struct kvm_vcpu *vcpu)
|
|
|
|
{
|
x86,kvm: move qemu/guest FPU switching out to vcpu_run
Currently, every time a VCPU is scheduled out, the host kernel will
first save the guest FPU/xstate context, then load the qemu userspace
FPU context, only to then immediately save the qemu userspace FPU
context back to memory. When scheduling in a VCPU, the same extraneous
FPU loads and saves are done.
This could be avoided by moving from a model where the guest FPU is
loaded and stored with preemption disabled, to a model where the
qemu userspace FPU is swapped out for the guest FPU context for
the duration of the KVM_RUN ioctl.
This is done under the VCPU mutex, which is also taken when other
tasks inspect the VCPU FPU context, so the code should already be
safe for this change. That should come as no surprise, given that
s390 already has this optimization.
This can fix a bug where KVM calls get_user_pages while owning the
FPU, and the file system ends up requesting the FPU again:
[258270.527947] __warn+0xcb/0xf0
[258270.527948] warn_slowpath_null+0x1d/0x20
[258270.527951] kernel_fpu_disable+0x3f/0x50
[258270.527953] __kernel_fpu_begin+0x49/0x100
[258270.527955] kernel_fpu_begin+0xe/0x10
[258270.527958] crc32c_pcl_intel_update+0x84/0xb0
[258270.527961] crypto_shash_update+0x3f/0x110
[258270.527968] crc32c+0x63/0x8a [libcrc32c]
[258270.527975] dm_bm_checksum+0x1b/0x20 [dm_persistent_data]
[258270.527978] node_prepare_for_write+0x44/0x70 [dm_persistent_data]
[258270.527985] dm_block_manager_write_callback+0x41/0x50 [dm_persistent_data]
[258270.527988] submit_io+0x170/0x1b0 [dm_bufio]
[258270.527992] __write_dirty_buffer+0x89/0x90 [dm_bufio]
[258270.527994] __make_buffer_clean+0x4f/0x80 [dm_bufio]
[258270.527996] __try_evict_buffer+0x42/0x60 [dm_bufio]
[258270.527998] dm_bufio_shrink_scan+0xc0/0x130 [dm_bufio]
[258270.528002] shrink_slab.part.40+0x1f5/0x420
[258270.528004] shrink_node+0x22c/0x320
[258270.528006] do_try_to_free_pages+0xf5/0x330
[258270.528008] try_to_free_pages+0xe9/0x190
[258270.528009] __alloc_pages_slowpath+0x40f/0xba0
[258270.528011] __alloc_pages_nodemask+0x209/0x260
[258270.528014] alloc_pages_vma+0x1f1/0x250
[258270.528017] do_huge_pmd_anonymous_page+0x123/0x660
[258270.528021] handle_mm_fault+0xfd3/0x1330
[258270.528025] __get_user_pages+0x113/0x640
[258270.528027] get_user_pages+0x4f/0x60
[258270.528063] __gfn_to_pfn_memslot+0x120/0x3f0 [kvm]
[258270.528108] try_async_pf+0x66/0x230 [kvm]
[258270.528135] tdp_page_fault+0x130/0x280 [kvm]
[258270.528149] kvm_mmu_page_fault+0x60/0x120 [kvm]
[258270.528158] handle_ept_violation+0x91/0x170 [kvm_intel]
[258270.528162] vmx_handle_exit+0x1ca/0x1400 [kvm_intel]
No performance changes were detected in quick ping-pong tests on
my 4 socket system, which is expected since an FPU+xstate load is
on the order of 0.1us, while ping-ponging between CPUs is on the
order of 20us, and somewhat noisy.
Cc: stable@vger.kernel.org
Signed-off-by: Rik van Riel <riel@redhat.com>
Suggested-by: Christian Borntraeger <borntraeger@de.ibm.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
[Fixed a bug where reset_vcpu called put_fpu without preceding load_fpu,
which happened inside from KVM_CREATE_VCPU ioctl. - Radim]
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
2017-11-14 21:54:23 +00:00
|
|
|
preempt_disable();
|
|
|
|
copy_fpregs_to_fpstate(&vcpu->arch.user_fpu);
|
2017-08-23 21:16:29 +00:00
|
|
|
/* PKRU is separately restored in kvm_x86_ops->run. */
|
|
|
|
__copy_kernel_to_fpregs(&vcpu->arch.guest_fpu.state,
|
|
|
|
~XFEATURE_MASK_PKRU);
|
x86,kvm: move qemu/guest FPU switching out to vcpu_run
Currently, every time a VCPU is scheduled out, the host kernel will
first save the guest FPU/xstate context, then load the qemu userspace
FPU context, only to then immediately save the qemu userspace FPU
context back to memory. When scheduling in a VCPU, the same extraneous
FPU loads and saves are done.
This could be avoided by moving from a model where the guest FPU is
loaded and stored with preemption disabled, to a model where the
qemu userspace FPU is swapped out for the guest FPU context for
the duration of the KVM_RUN ioctl.
This is done under the VCPU mutex, which is also taken when other
tasks inspect the VCPU FPU context, so the code should already be
safe for this change. That should come as no surprise, given that
s390 already has this optimization.
This can fix a bug where KVM calls get_user_pages while owning the
FPU, and the file system ends up requesting the FPU again:
[258270.527947] __warn+0xcb/0xf0
[258270.527948] warn_slowpath_null+0x1d/0x20
[258270.527951] kernel_fpu_disable+0x3f/0x50
[258270.527953] __kernel_fpu_begin+0x49/0x100
[258270.527955] kernel_fpu_begin+0xe/0x10
[258270.527958] crc32c_pcl_intel_update+0x84/0xb0
[258270.527961] crypto_shash_update+0x3f/0x110
[258270.527968] crc32c+0x63/0x8a [libcrc32c]
[258270.527975] dm_bm_checksum+0x1b/0x20 [dm_persistent_data]
[258270.527978] node_prepare_for_write+0x44/0x70 [dm_persistent_data]
[258270.527985] dm_block_manager_write_callback+0x41/0x50 [dm_persistent_data]
[258270.527988] submit_io+0x170/0x1b0 [dm_bufio]
[258270.527992] __write_dirty_buffer+0x89/0x90 [dm_bufio]
[258270.527994] __make_buffer_clean+0x4f/0x80 [dm_bufio]
[258270.527996] __try_evict_buffer+0x42/0x60 [dm_bufio]
[258270.527998] dm_bufio_shrink_scan+0xc0/0x130 [dm_bufio]
[258270.528002] shrink_slab.part.40+0x1f5/0x420
[258270.528004] shrink_node+0x22c/0x320
[258270.528006] do_try_to_free_pages+0xf5/0x330
[258270.528008] try_to_free_pages+0xe9/0x190
[258270.528009] __alloc_pages_slowpath+0x40f/0xba0
[258270.528011] __alloc_pages_nodemask+0x209/0x260
[258270.528014] alloc_pages_vma+0x1f1/0x250
[258270.528017] do_huge_pmd_anonymous_page+0x123/0x660
[258270.528021] handle_mm_fault+0xfd3/0x1330
[258270.528025] __get_user_pages+0x113/0x640
[258270.528027] get_user_pages+0x4f/0x60
[258270.528063] __gfn_to_pfn_memslot+0x120/0x3f0 [kvm]
[258270.528108] try_async_pf+0x66/0x230 [kvm]
[258270.528135] tdp_page_fault+0x130/0x280 [kvm]
[258270.528149] kvm_mmu_page_fault+0x60/0x120 [kvm]
[258270.528158] handle_ept_violation+0x91/0x170 [kvm_intel]
[258270.528162] vmx_handle_exit+0x1ca/0x1400 [kvm_intel]
No performance changes were detected in quick ping-pong tests on
my 4 socket system, which is expected since an FPU+xstate load is
on the order of 0.1us, while ping-ponging between CPUs is on the
order of 20us, and somewhat noisy.
Cc: stable@vger.kernel.org
Signed-off-by: Rik van Riel <riel@redhat.com>
Suggested-by: Christian Borntraeger <borntraeger@de.ibm.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
[Fixed a bug where reset_vcpu called put_fpu without preceding load_fpu,
which happened inside from KVM_CREATE_VCPU ioctl. - Radim]
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
2017-11-14 21:54:23 +00:00
|
|
|
preempt_enable();
|
2010-01-21 13:31:52 +00:00
|
|
|
trace_kvm_fpu(1);
|
2007-10-31 22:24:25 +00:00
|
|
|
}
|
|
|
|
|
x86,kvm: move qemu/guest FPU switching out to vcpu_run
Currently, every time a VCPU is scheduled out, the host kernel will
first save the guest FPU/xstate context, then load the qemu userspace
FPU context, only to then immediately save the qemu userspace FPU
context back to memory. When scheduling in a VCPU, the same extraneous
FPU loads and saves are done.
This could be avoided by moving from a model where the guest FPU is
loaded and stored with preemption disabled, to a model where the
qemu userspace FPU is swapped out for the guest FPU context for
the duration of the KVM_RUN ioctl.
This is done under the VCPU mutex, which is also taken when other
tasks inspect the VCPU FPU context, so the code should already be
safe for this change. That should come as no surprise, given that
s390 already has this optimization.
This can fix a bug where KVM calls get_user_pages while owning the
FPU, and the file system ends up requesting the FPU again:
[258270.527947] __warn+0xcb/0xf0
[258270.527948] warn_slowpath_null+0x1d/0x20
[258270.527951] kernel_fpu_disable+0x3f/0x50
[258270.527953] __kernel_fpu_begin+0x49/0x100
[258270.527955] kernel_fpu_begin+0xe/0x10
[258270.527958] crc32c_pcl_intel_update+0x84/0xb0
[258270.527961] crypto_shash_update+0x3f/0x110
[258270.527968] crc32c+0x63/0x8a [libcrc32c]
[258270.527975] dm_bm_checksum+0x1b/0x20 [dm_persistent_data]
[258270.527978] node_prepare_for_write+0x44/0x70 [dm_persistent_data]
[258270.527985] dm_block_manager_write_callback+0x41/0x50 [dm_persistent_data]
[258270.527988] submit_io+0x170/0x1b0 [dm_bufio]
[258270.527992] __write_dirty_buffer+0x89/0x90 [dm_bufio]
[258270.527994] __make_buffer_clean+0x4f/0x80 [dm_bufio]
[258270.527996] __try_evict_buffer+0x42/0x60 [dm_bufio]
[258270.527998] dm_bufio_shrink_scan+0xc0/0x130 [dm_bufio]
[258270.528002] shrink_slab.part.40+0x1f5/0x420
[258270.528004] shrink_node+0x22c/0x320
[258270.528006] do_try_to_free_pages+0xf5/0x330
[258270.528008] try_to_free_pages+0xe9/0x190
[258270.528009] __alloc_pages_slowpath+0x40f/0xba0
[258270.528011] __alloc_pages_nodemask+0x209/0x260
[258270.528014] alloc_pages_vma+0x1f1/0x250
[258270.528017] do_huge_pmd_anonymous_page+0x123/0x660
[258270.528021] handle_mm_fault+0xfd3/0x1330
[258270.528025] __get_user_pages+0x113/0x640
[258270.528027] get_user_pages+0x4f/0x60
[258270.528063] __gfn_to_pfn_memslot+0x120/0x3f0 [kvm]
[258270.528108] try_async_pf+0x66/0x230 [kvm]
[258270.528135] tdp_page_fault+0x130/0x280 [kvm]
[258270.528149] kvm_mmu_page_fault+0x60/0x120 [kvm]
[258270.528158] handle_ept_violation+0x91/0x170 [kvm_intel]
[258270.528162] vmx_handle_exit+0x1ca/0x1400 [kvm_intel]
No performance changes were detected in quick ping-pong tests on
my 4 socket system, which is expected since an FPU+xstate load is
on the order of 0.1us, while ping-ponging between CPUs is on the
order of 20us, and somewhat noisy.
Cc: stable@vger.kernel.org
Signed-off-by: Rik van Riel <riel@redhat.com>
Suggested-by: Christian Borntraeger <borntraeger@de.ibm.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
[Fixed a bug where reset_vcpu called put_fpu without preceding load_fpu,
which happened inside from KVM_CREATE_VCPU ioctl. - Radim]
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
2017-11-14 21:54:23 +00:00
|
|
|
/* When vcpu_run ends, restore user space FPU context. */
|
2007-10-31 22:24:25 +00:00
|
|
|
void kvm_put_guest_fpu(struct kvm_vcpu *vcpu)
|
|
|
|
{
|
x86,kvm: move qemu/guest FPU switching out to vcpu_run
Currently, every time a VCPU is scheduled out, the host kernel will
first save the guest FPU/xstate context, then load the qemu userspace
FPU context, only to then immediately save the qemu userspace FPU
context back to memory. When scheduling in a VCPU, the same extraneous
FPU loads and saves are done.
This could be avoided by moving from a model where the guest FPU is
loaded and stored with preemption disabled, to a model where the
qemu userspace FPU is swapped out for the guest FPU context for
the duration of the KVM_RUN ioctl.
This is done under the VCPU mutex, which is also taken when other
tasks inspect the VCPU FPU context, so the code should already be
safe for this change. That should come as no surprise, given that
s390 already has this optimization.
This can fix a bug where KVM calls get_user_pages while owning the
FPU, and the file system ends up requesting the FPU again:
[258270.527947] __warn+0xcb/0xf0
[258270.527948] warn_slowpath_null+0x1d/0x20
[258270.527951] kernel_fpu_disable+0x3f/0x50
[258270.527953] __kernel_fpu_begin+0x49/0x100
[258270.527955] kernel_fpu_begin+0xe/0x10
[258270.527958] crc32c_pcl_intel_update+0x84/0xb0
[258270.527961] crypto_shash_update+0x3f/0x110
[258270.527968] crc32c+0x63/0x8a [libcrc32c]
[258270.527975] dm_bm_checksum+0x1b/0x20 [dm_persistent_data]
[258270.527978] node_prepare_for_write+0x44/0x70 [dm_persistent_data]
[258270.527985] dm_block_manager_write_callback+0x41/0x50 [dm_persistent_data]
[258270.527988] submit_io+0x170/0x1b0 [dm_bufio]
[258270.527992] __write_dirty_buffer+0x89/0x90 [dm_bufio]
[258270.527994] __make_buffer_clean+0x4f/0x80 [dm_bufio]
[258270.527996] __try_evict_buffer+0x42/0x60 [dm_bufio]
[258270.527998] dm_bufio_shrink_scan+0xc0/0x130 [dm_bufio]
[258270.528002] shrink_slab.part.40+0x1f5/0x420
[258270.528004] shrink_node+0x22c/0x320
[258270.528006] do_try_to_free_pages+0xf5/0x330
[258270.528008] try_to_free_pages+0xe9/0x190
[258270.528009] __alloc_pages_slowpath+0x40f/0xba0
[258270.528011] __alloc_pages_nodemask+0x209/0x260
[258270.528014] alloc_pages_vma+0x1f1/0x250
[258270.528017] do_huge_pmd_anonymous_page+0x123/0x660
[258270.528021] handle_mm_fault+0xfd3/0x1330
[258270.528025] __get_user_pages+0x113/0x640
[258270.528027] get_user_pages+0x4f/0x60
[258270.528063] __gfn_to_pfn_memslot+0x120/0x3f0 [kvm]
[258270.528108] try_async_pf+0x66/0x230 [kvm]
[258270.528135] tdp_page_fault+0x130/0x280 [kvm]
[258270.528149] kvm_mmu_page_fault+0x60/0x120 [kvm]
[258270.528158] handle_ept_violation+0x91/0x170 [kvm_intel]
[258270.528162] vmx_handle_exit+0x1ca/0x1400 [kvm_intel]
No performance changes were detected in quick ping-pong tests on
my 4 socket system, which is expected since an FPU+xstate load is
on the order of 0.1us, while ping-ponging between CPUs is on the
order of 20us, and somewhat noisy.
Cc: stable@vger.kernel.org
Signed-off-by: Rik van Riel <riel@redhat.com>
Suggested-by: Christian Borntraeger <borntraeger@de.ibm.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
[Fixed a bug where reset_vcpu called put_fpu without preceding load_fpu,
which happened inside from KVM_CREATE_VCPU ioctl. - Radim]
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
2017-11-14 21:54:23 +00:00
|
|
|
preempt_disable();
|
x86/fpu: Rename fpu_save_init() to copy_fpregs_to_fpstate()
So fpu_save_init() is a historic name that got its name when the only
way the FPU state was FNSAVE, which cleared (well, destroyed) the FPU
state after saving it.
Nowadays the name is misleading, because ever since the introduction of
FXSAVE (and more modern FPU saving instructions) the 'we need to reload
the FPU state' part is only true if there's a pending FPU exception [*],
which is almost never the case.
So rename it to copy_fpregs_to_fpstate() to make it clear what's
happening. Also add a few comments about why we cannot keep registers
in certain cases.
Also clean up the control flow a bit, to make it more apparent when
we are dropping/keeping FP registers, and to optimize the common
case (of keeping fpregs) some more.
[*] Probably not true anymore, modern instructions always leave the FPU
state intact, even if exceptions are pending: because pending FP
exceptions are posted on the next FP instruction, not asynchronously.
They were truly asynchronous back in the IRQ13 case, and we had to
synchronize with them, but that code is not working anymore: we don't
have IRQ13 mapped in the IDT anymore.
But a cleanup patch is obviously not the place to change subtle behavior.
Reviewed-by: Borislav Petkov <bp@alien8.de>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-04-27 00:53:16 +00:00
|
|
|
copy_fpregs_to_fpstate(&vcpu->arch.guest_fpu);
|
x86,kvm: move qemu/guest FPU switching out to vcpu_run
Currently, every time a VCPU is scheduled out, the host kernel will
first save the guest FPU/xstate context, then load the qemu userspace
FPU context, only to then immediately save the qemu userspace FPU
context back to memory. When scheduling in a VCPU, the same extraneous
FPU loads and saves are done.
This could be avoided by moving from a model where the guest FPU is
loaded and stored with preemption disabled, to a model where the
qemu userspace FPU is swapped out for the guest FPU context for
the duration of the KVM_RUN ioctl.
This is done under the VCPU mutex, which is also taken when other
tasks inspect the VCPU FPU context, so the code should already be
safe for this change. That should come as no surprise, given that
s390 already has this optimization.
This can fix a bug where KVM calls get_user_pages while owning the
FPU, and the file system ends up requesting the FPU again:
[258270.527947] __warn+0xcb/0xf0
[258270.527948] warn_slowpath_null+0x1d/0x20
[258270.527951] kernel_fpu_disable+0x3f/0x50
[258270.527953] __kernel_fpu_begin+0x49/0x100
[258270.527955] kernel_fpu_begin+0xe/0x10
[258270.527958] crc32c_pcl_intel_update+0x84/0xb0
[258270.527961] crypto_shash_update+0x3f/0x110
[258270.527968] crc32c+0x63/0x8a [libcrc32c]
[258270.527975] dm_bm_checksum+0x1b/0x20 [dm_persistent_data]
[258270.527978] node_prepare_for_write+0x44/0x70 [dm_persistent_data]
[258270.527985] dm_block_manager_write_callback+0x41/0x50 [dm_persistent_data]
[258270.527988] submit_io+0x170/0x1b0 [dm_bufio]
[258270.527992] __write_dirty_buffer+0x89/0x90 [dm_bufio]
[258270.527994] __make_buffer_clean+0x4f/0x80 [dm_bufio]
[258270.527996] __try_evict_buffer+0x42/0x60 [dm_bufio]
[258270.527998] dm_bufio_shrink_scan+0xc0/0x130 [dm_bufio]
[258270.528002] shrink_slab.part.40+0x1f5/0x420
[258270.528004] shrink_node+0x22c/0x320
[258270.528006] do_try_to_free_pages+0xf5/0x330
[258270.528008] try_to_free_pages+0xe9/0x190
[258270.528009] __alloc_pages_slowpath+0x40f/0xba0
[258270.528011] __alloc_pages_nodemask+0x209/0x260
[258270.528014] alloc_pages_vma+0x1f1/0x250
[258270.528017] do_huge_pmd_anonymous_page+0x123/0x660
[258270.528021] handle_mm_fault+0xfd3/0x1330
[258270.528025] __get_user_pages+0x113/0x640
[258270.528027] get_user_pages+0x4f/0x60
[258270.528063] __gfn_to_pfn_memslot+0x120/0x3f0 [kvm]
[258270.528108] try_async_pf+0x66/0x230 [kvm]
[258270.528135] tdp_page_fault+0x130/0x280 [kvm]
[258270.528149] kvm_mmu_page_fault+0x60/0x120 [kvm]
[258270.528158] handle_ept_violation+0x91/0x170 [kvm_intel]
[258270.528162] vmx_handle_exit+0x1ca/0x1400 [kvm_intel]
No performance changes were detected in quick ping-pong tests on
my 4 socket system, which is expected since an FPU+xstate load is
on the order of 0.1us, while ping-ponging between CPUs is on the
order of 20us, and somewhat noisy.
Cc: stable@vger.kernel.org
Signed-off-by: Rik van Riel <riel@redhat.com>
Suggested-by: Christian Borntraeger <borntraeger@de.ibm.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
[Fixed a bug where reset_vcpu called put_fpu without preceding load_fpu,
which happened inside from KVM_CREATE_VCPU ioctl. - Radim]
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
2017-11-14 21:54:23 +00:00
|
|
|
copy_kernel_to_fpregs(&vcpu->arch.user_fpu.state);
|
|
|
|
preempt_enable();
|
2007-11-18 11:54:33 +00:00
|
|
|
++vcpu->stat.fpu_reload;
|
2010-01-21 13:31:52 +00:00
|
|
|
trace_kvm_fpu(0);
|
2007-10-31 22:24:25 +00:00
|
|
|
}
|
2007-11-14 12:38:21 +00:00
|
|
|
|
|
|
|
void kvm_arch_vcpu_free(struct kvm_vcpu *vcpu)
|
|
|
|
{
|
2016-10-21 16:39:57 +00:00
|
|
|
void *wbinvd_dirty_mask = vcpu->arch.wbinvd_dirty_mask;
|
|
|
|
|
2011-02-01 19:16:40 +00:00
|
|
|
kvmclock_reset(vcpu);
|
2009-02-25 15:08:31 +00:00
|
|
|
|
2007-11-14 12:38:21 +00:00
|
|
|
kvm_x86_ops->vcpu_free(vcpu);
|
2016-10-21 16:39:57 +00:00
|
|
|
free_cpumask_var(wbinvd_dirty_mask);
|
2007-11-14 12:38:21 +00:00
|
|
|
}
|
|
|
|
|
|
|
|
struct kvm_vcpu *kvm_arch_vcpu_create(struct kvm *kvm,
|
|
|
|
unsigned int id)
|
|
|
|
{
|
2015-05-20 20:41:25 +00:00
|
|
|
struct kvm_vcpu *vcpu;
|
|
|
|
|
2018-01-24 13:23:36 +00:00
|
|
|
if (kvm_check_tsc_unstable() && atomic_read(&kvm->online_vcpus) != 0)
|
2010-08-20 08:07:22 +00:00
|
|
|
printk_once(KERN_WARNING
|
|
|
|
"kvm: SMP vm created on host with unstable TSC; "
|
|
|
|
"guest TSC will not be reliable\n");
|
2015-05-20 20:41:25 +00:00
|
|
|
|
|
|
|
vcpu = kvm_x86_ops->vcpu_create(kvm, id);
|
|
|
|
|
|
|
|
return vcpu;
|
2007-11-20 13:30:24 +00:00
|
|
|
}
|
2007-11-14 12:38:21 +00:00
|
|
|
|
2007-11-20 13:30:24 +00:00
|
|
|
int kvm_arch_vcpu_setup(struct kvm_vcpu *vcpu)
|
|
|
|
{
|
2015-06-15 08:55:31 +00:00
|
|
|
kvm_vcpu_mtrr_init(vcpu);
|
2017-12-04 20:35:23 +00:00
|
|
|
vcpu_load(vcpu);
|
KVM: x86: INIT and reset sequences are different
x86 architecture defines differences between the reset and INIT sequences.
INIT does not initialize the FPU (including MMX, XMM, YMM, etc.), TSC, PMU,
MSRs (in general), MTRRs machine-check, APIC ID, APIC arbitration ID and BSP.
References (from Intel SDM):
"If the MP protocol has completed and a BSP is chosen, subsequent INITs (either
to a specific processor or system wide) do not cause the MP protocol to be
repeated." [8.4.2: MP Initialization Protocol Requirements and Restrictions]
[Table 9-1. IA-32 Processor States Following Power-up, Reset, or INIT]
"If the processor is reset by asserting the INIT# pin, the x87 FPU state is not
changed." [9.2: X87 FPU INITIALIZATION]
"The state of the local APIC following an INIT reset is the same as it is after
a power-up or hardware reset, except that the APIC ID and arbitration ID
registers are not affected." [10.4.7.3: Local APIC State After an INIT Reset
("Wait-for-SIPI" State)]
Signed-off-by: Nadav Amit <namit@cs.technion.ac.il>
Message-Id: <1428924848-28212-1-git-send-email-namit@cs.technion.ac.il>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2015-04-13 11:34:08 +00:00
|
|
|
kvm_vcpu_reset(vcpu, false);
|
2013-10-02 14:56:13 +00:00
|
|
|
kvm_mmu_setup(vcpu);
|
2007-11-14 12:38:21 +00:00
|
|
|
vcpu_put(vcpu);
|
2017-12-04 20:35:23 +00:00
|
|
|
return 0;
|
2007-11-14 12:38:21 +00:00
|
|
|
}
|
|
|
|
|
2014-12-04 14:47:07 +00:00
|
|
|
void kvm_arch_vcpu_postcreate(struct kvm_vcpu *vcpu)
|
2012-11-28 01:29:02 +00:00
|
|
|
{
|
2012-11-29 20:42:12 +00:00
|
|
|
struct msr_data msr;
|
2014-02-28 11:52:55 +00:00
|
|
|
struct kvm *kvm = vcpu->kvm;
|
2012-11-28 01:29:02 +00:00
|
|
|
|
2017-07-14 14:13:20 +00:00
|
|
|
kvm_hv_vcpu_postcreate(vcpu);
|
|
|
|
|
2017-12-04 20:35:23 +00:00
|
|
|
if (mutex_lock_killable(&vcpu->mutex))
|
2014-12-04 14:47:07 +00:00
|
|
|
return;
|
2017-12-04 20:35:23 +00:00
|
|
|
vcpu_load(vcpu);
|
2012-11-29 20:42:12 +00:00
|
|
|
msr.data = 0x0;
|
|
|
|
msr.index = MSR_IA32_TSC;
|
|
|
|
msr.host_initiated = true;
|
|
|
|
kvm_write_tsc(vcpu, &msr);
|
2012-11-28 01:29:02 +00:00
|
|
|
vcpu_put(vcpu);
|
2017-12-04 20:35:23 +00:00
|
|
|
mutex_unlock(&vcpu->mutex);
|
2012-11-28 01:29:02 +00:00
|
|
|
|
2015-05-13 01:42:04 +00:00
|
|
|
if (!kvmclock_periodic_sync)
|
|
|
|
return;
|
|
|
|
|
2014-02-28 11:52:55 +00:00
|
|
|
schedule_delayed_work(&kvm->arch.kvmclock_sync_work,
|
|
|
|
KVMCLOCK_SYNC_PERIOD);
|
2012-11-28 01:29:02 +00:00
|
|
|
}
|
|
|
|
|
2007-11-19 20:04:43 +00:00
|
|
|
void kvm_arch_vcpu_destroy(struct kvm_vcpu *vcpu)
|
2007-11-14 12:38:21 +00:00
|
|
|
{
|
2010-10-14 09:22:50 +00:00
|
|
|
vcpu->arch.apf.msr_val = 0;
|
|
|
|
|
2017-12-04 20:35:23 +00:00
|
|
|
vcpu_load(vcpu);
|
2007-11-14 12:38:21 +00:00
|
|
|
kvm_mmu_unload(vcpu);
|
|
|
|
vcpu_put(vcpu);
|
|
|
|
|
|
|
|
kvm_x86_ops->vcpu_free(vcpu);
|
|
|
|
}
|
|
|
|
|
KVM: x86: INIT and reset sequences are different
x86 architecture defines differences between the reset and INIT sequences.
INIT does not initialize the FPU (including MMX, XMM, YMM, etc.), TSC, PMU,
MSRs (in general), MTRRs machine-check, APIC ID, APIC arbitration ID and BSP.
References (from Intel SDM):
"If the MP protocol has completed and a BSP is chosen, subsequent INITs (either
to a specific processor or system wide) do not cause the MP protocol to be
repeated." [8.4.2: MP Initialization Protocol Requirements and Restrictions]
[Table 9-1. IA-32 Processor States Following Power-up, Reset, or INIT]
"If the processor is reset by asserting the INIT# pin, the x87 FPU state is not
changed." [9.2: X87 FPU INITIALIZATION]
"The state of the local APIC following an INIT reset is the same as it is after
a power-up or hardware reset, except that the APIC ID and arbitration ID
registers are not affected." [10.4.7.3: Local APIC State After an INIT Reset
("Wait-for-SIPI" State)]
Signed-off-by: Nadav Amit <namit@cs.technion.ac.il>
Message-Id: <1428924848-28212-1-git-send-email-namit@cs.technion.ac.il>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2015-04-13 11:34:08 +00:00
|
|
|
void kvm_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event)
|
2007-11-14 12:38:21 +00:00
|
|
|
{
|
2018-03-01 14:24:25 +00:00
|
|
|
kvm_lapic_reset(vcpu, init_event);
|
|
|
|
|
2015-06-04 08:44:44 +00:00
|
|
|
vcpu->arch.hflags = 0;
|
|
|
|
|
2016-06-01 20:26:00 +00:00
|
|
|
vcpu->arch.smi_pending = 0;
|
2017-11-15 11:43:14 +00:00
|
|
|
vcpu->arch.smi_count = 0;
|
2011-09-20 10:43:14 +00:00
|
|
|
atomic_set(&vcpu->arch.nmi_queued, 0);
|
|
|
|
vcpu->arch.nmi_pending = 0;
|
2008-09-26 07:30:48 +00:00
|
|
|
vcpu->arch.nmi_injected = false;
|
2014-06-30 09:03:02 +00:00
|
|
|
kvm_clear_interrupt_queue(vcpu);
|
|
|
|
kvm_clear_exception_queue(vcpu);
|
2017-08-24 10:35:09 +00:00
|
|
|
vcpu->arch.exception.pending = false;
|
2008-09-26 07:30:48 +00:00
|
|
|
|
2008-12-15 12:52:10 +00:00
|
|
|
memset(vcpu->arch.db, 0, sizeof(vcpu->arch.db));
|
2015-04-02 00:10:37 +00:00
|
|
|
kvm_update_dr0123(vcpu);
|
2014-07-15 14:37:46 +00:00
|
|
|
vcpu->arch.dr6 = DR6_INIT;
|
2014-01-04 17:47:16 +00:00
|
|
|
kvm_update_dr6(vcpu);
|
2008-12-15 12:52:10 +00:00
|
|
|
vcpu->arch.dr7 = DR7_FIXED_1;
|
2012-09-21 03:42:55 +00:00
|
|
|
kvm_update_dr7(vcpu);
|
2008-12-15 12:52:10 +00:00
|
|
|
|
2015-04-02 00:10:38 +00:00
|
|
|
vcpu->arch.cr2 = 0;
|
|
|
|
|
2010-07-27 09:30:24 +00:00
|
|
|
kvm_make_request(KVM_REQ_EVENT, vcpu);
|
2010-10-14 09:22:50 +00:00
|
|
|
vcpu->arch.apf.msr_val = 0;
|
2011-07-11 19:28:14 +00:00
|
|
|
vcpu->arch.st.msr_val = 0;
|
2010-07-27 09:30:24 +00:00
|
|
|
|
2011-02-01 19:16:40 +00:00
|
|
|
kvmclock_reset(vcpu);
|
|
|
|
|
2010-10-14 09:22:46 +00:00
|
|
|
kvm_clear_async_pf_completion_queue(vcpu);
|
|
|
|
kvm_async_pf_hash_reset(vcpu);
|
|
|
|
vcpu->arch.apf.halted = false;
|
2010-07-27 09:30:24 +00:00
|
|
|
|
2017-10-11 12:10:19 +00:00
|
|
|
if (kvm_mpx_supported()) {
|
|
|
|
void *mpx_state_buffer;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* To avoid have the INIT path from kvm_apic_has_events() that be
|
|
|
|
* called with loaded FPU and does not let userspace fix the state.
|
|
|
|
*/
|
x86,kvm: move qemu/guest FPU switching out to vcpu_run
Currently, every time a VCPU is scheduled out, the host kernel will
first save the guest FPU/xstate context, then load the qemu userspace
FPU context, only to then immediately save the qemu userspace FPU
context back to memory. When scheduling in a VCPU, the same extraneous
FPU loads and saves are done.
This could be avoided by moving from a model where the guest FPU is
loaded and stored with preemption disabled, to a model where the
qemu userspace FPU is swapped out for the guest FPU context for
the duration of the KVM_RUN ioctl.
This is done under the VCPU mutex, which is also taken when other
tasks inspect the VCPU FPU context, so the code should already be
safe for this change. That should come as no surprise, given that
s390 already has this optimization.
This can fix a bug where KVM calls get_user_pages while owning the
FPU, and the file system ends up requesting the FPU again:
[258270.527947] __warn+0xcb/0xf0
[258270.527948] warn_slowpath_null+0x1d/0x20
[258270.527951] kernel_fpu_disable+0x3f/0x50
[258270.527953] __kernel_fpu_begin+0x49/0x100
[258270.527955] kernel_fpu_begin+0xe/0x10
[258270.527958] crc32c_pcl_intel_update+0x84/0xb0
[258270.527961] crypto_shash_update+0x3f/0x110
[258270.527968] crc32c+0x63/0x8a [libcrc32c]
[258270.527975] dm_bm_checksum+0x1b/0x20 [dm_persistent_data]
[258270.527978] node_prepare_for_write+0x44/0x70 [dm_persistent_data]
[258270.527985] dm_block_manager_write_callback+0x41/0x50 [dm_persistent_data]
[258270.527988] submit_io+0x170/0x1b0 [dm_bufio]
[258270.527992] __write_dirty_buffer+0x89/0x90 [dm_bufio]
[258270.527994] __make_buffer_clean+0x4f/0x80 [dm_bufio]
[258270.527996] __try_evict_buffer+0x42/0x60 [dm_bufio]
[258270.527998] dm_bufio_shrink_scan+0xc0/0x130 [dm_bufio]
[258270.528002] shrink_slab.part.40+0x1f5/0x420
[258270.528004] shrink_node+0x22c/0x320
[258270.528006] do_try_to_free_pages+0xf5/0x330
[258270.528008] try_to_free_pages+0xe9/0x190
[258270.528009] __alloc_pages_slowpath+0x40f/0xba0
[258270.528011] __alloc_pages_nodemask+0x209/0x260
[258270.528014] alloc_pages_vma+0x1f1/0x250
[258270.528017] do_huge_pmd_anonymous_page+0x123/0x660
[258270.528021] handle_mm_fault+0xfd3/0x1330
[258270.528025] __get_user_pages+0x113/0x640
[258270.528027] get_user_pages+0x4f/0x60
[258270.528063] __gfn_to_pfn_memslot+0x120/0x3f0 [kvm]
[258270.528108] try_async_pf+0x66/0x230 [kvm]
[258270.528135] tdp_page_fault+0x130/0x280 [kvm]
[258270.528149] kvm_mmu_page_fault+0x60/0x120 [kvm]
[258270.528158] handle_ept_violation+0x91/0x170 [kvm_intel]
[258270.528162] vmx_handle_exit+0x1ca/0x1400 [kvm_intel]
No performance changes were detected in quick ping-pong tests on
my 4 socket system, which is expected since an FPU+xstate load is
on the order of 0.1us, while ping-ponging between CPUs is on the
order of 20us, and somewhat noisy.
Cc: stable@vger.kernel.org
Signed-off-by: Rik van Riel <riel@redhat.com>
Suggested-by: Christian Borntraeger <borntraeger@de.ibm.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
[Fixed a bug where reset_vcpu called put_fpu without preceding load_fpu,
which happened inside from KVM_CREATE_VCPU ioctl. - Radim]
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
2017-11-14 21:54:23 +00:00
|
|
|
if (init_event)
|
|
|
|
kvm_put_guest_fpu(vcpu);
|
2017-10-11 12:10:19 +00:00
|
|
|
mpx_state_buffer = get_xsave_addr(&vcpu->arch.guest_fpu.state.xsave,
|
|
|
|
XFEATURE_MASK_BNDREGS);
|
|
|
|
if (mpx_state_buffer)
|
|
|
|
memset(mpx_state_buffer, 0, sizeof(struct mpx_bndreg_state));
|
|
|
|
mpx_state_buffer = get_xsave_addr(&vcpu->arch.guest_fpu.state.xsave,
|
|
|
|
XFEATURE_MASK_BNDCSR);
|
|
|
|
if (mpx_state_buffer)
|
|
|
|
memset(mpx_state_buffer, 0, sizeof(struct mpx_bndcsr));
|
x86,kvm: move qemu/guest FPU switching out to vcpu_run
Currently, every time a VCPU is scheduled out, the host kernel will
first save the guest FPU/xstate context, then load the qemu userspace
FPU context, only to then immediately save the qemu userspace FPU
context back to memory. When scheduling in a VCPU, the same extraneous
FPU loads and saves are done.
This could be avoided by moving from a model where the guest FPU is
loaded and stored with preemption disabled, to a model where the
qemu userspace FPU is swapped out for the guest FPU context for
the duration of the KVM_RUN ioctl.
This is done under the VCPU mutex, which is also taken when other
tasks inspect the VCPU FPU context, so the code should already be
safe for this change. That should come as no surprise, given that
s390 already has this optimization.
This can fix a bug where KVM calls get_user_pages while owning the
FPU, and the file system ends up requesting the FPU again:
[258270.527947] __warn+0xcb/0xf0
[258270.527948] warn_slowpath_null+0x1d/0x20
[258270.527951] kernel_fpu_disable+0x3f/0x50
[258270.527953] __kernel_fpu_begin+0x49/0x100
[258270.527955] kernel_fpu_begin+0xe/0x10
[258270.527958] crc32c_pcl_intel_update+0x84/0xb0
[258270.527961] crypto_shash_update+0x3f/0x110
[258270.527968] crc32c+0x63/0x8a [libcrc32c]
[258270.527975] dm_bm_checksum+0x1b/0x20 [dm_persistent_data]
[258270.527978] node_prepare_for_write+0x44/0x70 [dm_persistent_data]
[258270.527985] dm_block_manager_write_callback+0x41/0x50 [dm_persistent_data]
[258270.527988] submit_io+0x170/0x1b0 [dm_bufio]
[258270.527992] __write_dirty_buffer+0x89/0x90 [dm_bufio]
[258270.527994] __make_buffer_clean+0x4f/0x80 [dm_bufio]
[258270.527996] __try_evict_buffer+0x42/0x60 [dm_bufio]
[258270.527998] dm_bufio_shrink_scan+0xc0/0x130 [dm_bufio]
[258270.528002] shrink_slab.part.40+0x1f5/0x420
[258270.528004] shrink_node+0x22c/0x320
[258270.528006] do_try_to_free_pages+0xf5/0x330
[258270.528008] try_to_free_pages+0xe9/0x190
[258270.528009] __alloc_pages_slowpath+0x40f/0xba0
[258270.528011] __alloc_pages_nodemask+0x209/0x260
[258270.528014] alloc_pages_vma+0x1f1/0x250
[258270.528017] do_huge_pmd_anonymous_page+0x123/0x660
[258270.528021] handle_mm_fault+0xfd3/0x1330
[258270.528025] __get_user_pages+0x113/0x640
[258270.528027] get_user_pages+0x4f/0x60
[258270.528063] __gfn_to_pfn_memslot+0x120/0x3f0 [kvm]
[258270.528108] try_async_pf+0x66/0x230 [kvm]
[258270.528135] tdp_page_fault+0x130/0x280 [kvm]
[258270.528149] kvm_mmu_page_fault+0x60/0x120 [kvm]
[258270.528158] handle_ept_violation+0x91/0x170 [kvm_intel]
[258270.528162] vmx_handle_exit+0x1ca/0x1400 [kvm_intel]
No performance changes were detected in quick ping-pong tests on
my 4 socket system, which is expected since an FPU+xstate load is
on the order of 0.1us, while ping-ponging between CPUs is on the
order of 20us, and somewhat noisy.
Cc: stable@vger.kernel.org
Signed-off-by: Rik van Riel <riel@redhat.com>
Suggested-by: Christian Borntraeger <borntraeger@de.ibm.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
[Fixed a bug where reset_vcpu called put_fpu without preceding load_fpu,
which happened inside from KVM_CREATE_VCPU ioctl. - Radim]
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
2017-11-14 21:54:23 +00:00
|
|
|
if (init_event)
|
|
|
|
kvm_load_guest_fpu(vcpu);
|
2017-10-11 12:10:19 +00:00
|
|
|
}
|
|
|
|
|
2015-05-07 09:36:11 +00:00
|
|
|
if (!init_event) {
|
KVM: x86: INIT and reset sequences are different
x86 architecture defines differences between the reset and INIT sequences.
INIT does not initialize the FPU (including MMX, XMM, YMM, etc.), TSC, PMU,
MSRs (in general), MTRRs machine-check, APIC ID, APIC arbitration ID and BSP.
References (from Intel SDM):
"If the MP protocol has completed and a BSP is chosen, subsequent INITs (either
to a specific processor or system wide) do not cause the MP protocol to be
repeated." [8.4.2: MP Initialization Protocol Requirements and Restrictions]
[Table 9-1. IA-32 Processor States Following Power-up, Reset, or INIT]
"If the processor is reset by asserting the INIT# pin, the x87 FPU state is not
changed." [9.2: X87 FPU INITIALIZATION]
"The state of the local APIC following an INIT reset is the same as it is after
a power-up or hardware reset, except that the APIC ID and arbitration ID
registers are not affected." [10.4.7.3: Local APIC State After an INIT Reset
("Wait-for-SIPI" State)]
Signed-off-by: Nadav Amit <namit@cs.technion.ac.il>
Message-Id: <1428924848-28212-1-git-send-email-namit@cs.technion.ac.il>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2015-04-13 11:34:08 +00:00
|
|
|
kvm_pmu_reset(vcpu);
|
2015-05-07 09:36:11 +00:00
|
|
|
vcpu->arch.smbase = 0x30000;
|
2017-03-20 08:16:28 +00:00
|
|
|
|
|
|
|
vcpu->arch.msr_platform_info = MSR_PLATFORM_INFO_CPUID_FAULT;
|
|
|
|
vcpu->arch.msr_misc_features_enables = 0;
|
2017-10-11 12:10:19 +00:00
|
|
|
|
|
|
|
vcpu->arch.xcr0 = XFEATURE_MASK_FP;
|
2015-05-07 09:36:11 +00:00
|
|
|
}
|
2011-11-10 12:57:22 +00:00
|
|
|
|
2012-12-05 14:26:19 +00:00
|
|
|
memset(vcpu->arch.regs, 0, sizeof(vcpu->arch.regs));
|
|
|
|
vcpu->arch.regs_avail = ~0;
|
|
|
|
vcpu->arch.regs_dirty = ~0;
|
|
|
|
|
2017-10-11 12:10:19 +00:00
|
|
|
vcpu->arch.ia32_xss = 0;
|
|
|
|
|
KVM: x86: INIT and reset sequences are different
x86 architecture defines differences between the reset and INIT sequences.
INIT does not initialize the FPU (including MMX, XMM, YMM, etc.), TSC, PMU,
MSRs (in general), MTRRs machine-check, APIC ID, APIC arbitration ID and BSP.
References (from Intel SDM):
"If the MP protocol has completed and a BSP is chosen, subsequent INITs (either
to a specific processor or system wide) do not cause the MP protocol to be
repeated." [8.4.2: MP Initialization Protocol Requirements and Restrictions]
[Table 9-1. IA-32 Processor States Following Power-up, Reset, or INIT]
"If the processor is reset by asserting the INIT# pin, the x87 FPU state is not
changed." [9.2: X87 FPU INITIALIZATION]
"The state of the local APIC following an INIT reset is the same as it is after
a power-up or hardware reset, except that the APIC ID and arbitration ID
registers are not affected." [10.4.7.3: Local APIC State After an INIT Reset
("Wait-for-SIPI" State)]
Signed-off-by: Nadav Amit <namit@cs.technion.ac.il>
Message-Id: <1428924848-28212-1-git-send-email-namit@cs.technion.ac.il>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2015-04-13 11:34:08 +00:00
|
|
|
kvm_x86_ops->vcpu_reset(vcpu, init_event);
|
2007-11-14 12:38:21 +00:00
|
|
|
}
|
|
|
|
|
2014-11-24 13:35:24 +00:00
|
|
|
void kvm_vcpu_deliver_sipi_vector(struct kvm_vcpu *vcpu, u8 vector)
|
2013-03-13 11:42:34 +00:00
|
|
|
{
|
|
|
|
struct kvm_segment cs;
|
|
|
|
|
|
|
|
kvm_get_segment(vcpu, &cs, VCPU_SREG_CS);
|
|
|
|
cs.selector = vector << 8;
|
|
|
|
cs.base = vector << 12;
|
|
|
|
kvm_set_segment(vcpu, &cs, VCPU_SREG_CS);
|
|
|
|
kvm_rip_write(vcpu, 0);
|
2007-11-14 12:38:21 +00:00
|
|
|
}
|
|
|
|
|
2014-08-28 13:13:03 +00:00
|
|
|
int kvm_arch_hardware_enable(void)
|
2007-11-14 12:38:21 +00:00
|
|
|
{
|
2010-08-20 08:07:28 +00:00
|
|
|
struct kvm *kvm;
|
|
|
|
struct kvm_vcpu *vcpu;
|
|
|
|
int i;
|
2012-02-03 17:43:56 +00:00
|
|
|
int ret;
|
|
|
|
u64 local_tsc;
|
|
|
|
u64 max_tsc = 0;
|
|
|
|
bool stable, backwards_tsc = false;
|
2009-09-07 08:12:18 +00:00
|
|
|
|
|
|
|
kvm_shared_msr_cpu_online();
|
2014-08-28 13:13:03 +00:00
|
|
|
ret = kvm_x86_ops->hardware_enable();
|
2012-02-03 17:43:56 +00:00
|
|
|
if (ret != 0)
|
|
|
|
return ret;
|
|
|
|
|
2015-06-25 16:44:07 +00:00
|
|
|
local_tsc = rdtsc();
|
2018-01-24 13:23:36 +00:00
|
|
|
stable = !kvm_check_tsc_unstable();
|
2012-02-03 17:43:56 +00:00
|
|
|
list_for_each_entry(kvm, &vm_list, vm_list) {
|
|
|
|
kvm_for_each_vcpu(i, vcpu, kvm) {
|
|
|
|
if (!stable && vcpu->cpu == smp_processor_id())
|
2014-09-12 05:43:19 +00:00
|
|
|
kvm_make_request(KVM_REQ_CLOCK_UPDATE, vcpu);
|
2012-02-03 17:43:56 +00:00
|
|
|
if (stable && vcpu->arch.last_host_tsc > local_tsc) {
|
|
|
|
backwards_tsc = true;
|
|
|
|
if (vcpu->arch.last_host_tsc > max_tsc)
|
|
|
|
max_tsc = vcpu->arch.last_host_tsc;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Sometimes, even reliable TSCs go backwards. This happens on
|
|
|
|
* platforms that reset TSC during suspend or hibernate actions, but
|
|
|
|
* maintain synchronization. We must compensate. Fortunately, we can
|
|
|
|
* detect that condition here, which happens early in CPU bringup,
|
|
|
|
* before any KVM threads can be running. Unfortunately, we can't
|
|
|
|
* bring the TSCs fully up to date with real time, as we aren't yet far
|
|
|
|
* enough into CPU bringup that we know how much real time has actually
|
2016-09-01 12:21:03 +00:00
|
|
|
* elapsed; our helper function, ktime_get_boot_ns() will be using boot
|
2012-02-03 17:43:56 +00:00
|
|
|
* variables that haven't been updated yet.
|
|
|
|
*
|
|
|
|
* So we simply find the maximum observed TSC above, then record the
|
|
|
|
* adjustment to TSC in each VCPU. When the VCPU later gets loaded,
|
|
|
|
* the adjustment will be applied. Note that we accumulate
|
|
|
|
* adjustments, in case multiple suspend cycles happen before some VCPU
|
|
|
|
* gets a chance to run again. In the event that no KVM threads get a
|
|
|
|
* chance to run, we will miss the entire elapsed period, as we'll have
|
|
|
|
* reset last_host_tsc, so VCPUs will not have the TSC adjusted and may
|
|
|
|
* loose cycle time. This isn't too big a deal, since the loss will be
|
|
|
|
* uniform across all VCPUs (not to mention the scenario is extremely
|
|
|
|
* unlikely). It is possible that a second hibernate recovery happens
|
|
|
|
* much faster than a first, causing the observed TSC here to be
|
|
|
|
* smaller; this would require additional padding adjustment, which is
|
|
|
|
* why we set last_host_tsc to the local tsc observed here.
|
|
|
|
*
|
|
|
|
* N.B. - this code below runs only on platforms with reliable TSC,
|
|
|
|
* as that is the only way backwards_tsc is set above. Also note
|
|
|
|
* that this runs for ALL vcpus, which is not a bug; all VCPUs should
|
|
|
|
* have the same delta_cyc adjustment applied if backwards_tsc
|
|
|
|
* is detected. Note further, this adjustment is only done once,
|
|
|
|
* as we reset last_host_tsc on all VCPUs to stop this from being
|
|
|
|
* called multiple times (one for each physical CPU bringup).
|
|
|
|
*
|
2012-06-28 07:17:27 +00:00
|
|
|
* Platforms with unreliable TSCs don't have to deal with this, they
|
2012-02-03 17:43:56 +00:00
|
|
|
* will be compensated by the logic in vcpu_load, which sets the TSC to
|
|
|
|
* catchup mode. This will catchup all VCPUs to real time, but cannot
|
|
|
|
* guarantee that they stay in perfect synchronization.
|
|
|
|
*/
|
|
|
|
if (backwards_tsc) {
|
|
|
|
u64 delta_cyc = max_tsc - local_tsc;
|
|
|
|
list_for_each_entry(kvm, &vm_list, vm_list) {
|
2017-06-26 07:56:43 +00:00
|
|
|
kvm->arch.backwards_tsc_observed = true;
|
2012-02-03 17:43:56 +00:00
|
|
|
kvm_for_each_vcpu(i, vcpu, kvm) {
|
|
|
|
vcpu->arch.tsc_offset_adjustment += delta_cyc;
|
|
|
|
vcpu->arch.last_host_tsc = local_tsc;
|
2014-09-12 05:43:19 +00:00
|
|
|
kvm_make_request(KVM_REQ_MASTERCLOCK_UPDATE, vcpu);
|
2012-02-03 17:43:56 +00:00
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* We have to disable TSC offset matching.. if you were
|
|
|
|
* booting a VM while issuing an S4 host suspend....
|
|
|
|
* you may have some problem. Solving this issue is
|
|
|
|
* left as an exercise to the reader.
|
|
|
|
*/
|
|
|
|
kvm->arch.last_tsc_nsec = 0;
|
|
|
|
kvm->arch.last_tsc_write = 0;
|
|
|
|
}
|
|
|
|
|
|
|
|
}
|
|
|
|
return 0;
|
2007-11-14 12:38:21 +00:00
|
|
|
}
|
|
|
|
|
2014-08-28 13:13:03 +00:00
|
|
|
void kvm_arch_hardware_disable(void)
|
2007-11-14 12:38:21 +00:00
|
|
|
{
|
2014-08-28 13:13:03 +00:00
|
|
|
kvm_x86_ops->hardware_disable();
|
|
|
|
drop_user_return_notifiers();
|
2007-11-14 12:38:21 +00:00
|
|
|
}
|
|
|
|
|
|
|
|
int kvm_arch_hardware_setup(void)
|
|
|
|
{
|
2015-04-12 18:47:15 +00:00
|
|
|
int r;
|
|
|
|
|
|
|
|
r = kvm_x86_ops->hardware_setup();
|
|
|
|
if (r != 0)
|
|
|
|
return r;
|
|
|
|
|
2015-10-20 07:39:03 +00:00
|
|
|
if (kvm_has_tsc_control) {
|
|
|
|
/*
|
|
|
|
* Make sure the user can only configure tsc_khz values that
|
|
|
|
* fit into a signed integer.
|
2018-06-11 17:12:10 +00:00
|
|
|
* A min value is not calculated because it will always
|
2015-10-20 07:39:03 +00:00
|
|
|
* be 1 on all machines.
|
|
|
|
*/
|
|
|
|
u64 max = min(0x7fffffffULL,
|
|
|
|
__scale_tsc(kvm_max_tsc_scaling_ratio, tsc_khz));
|
|
|
|
kvm_max_guest_tsc_khz = max;
|
|
|
|
|
2015-10-20 07:39:02 +00:00
|
|
|
kvm_default_tsc_scaling_ratio = 1ULL << kvm_tsc_scaling_ratio_frac_bits;
|
2015-10-20 07:39:03 +00:00
|
|
|
}
|
2015-10-20 07:39:02 +00:00
|
|
|
|
2015-04-12 18:47:15 +00:00
|
|
|
kvm_init_msr_list();
|
|
|
|
return 0;
|
2007-11-14 12:38:21 +00:00
|
|
|
}
|
|
|
|
|
|
|
|
void kvm_arch_hardware_unsetup(void)
|
|
|
|
{
|
|
|
|
kvm_x86_ops->hardware_unsetup();
|
|
|
|
}
|
|
|
|
|
|
|
|
void kvm_arch_check_processor_compat(void *rtn)
|
|
|
|
{
|
|
|
|
kvm_x86_ops->check_processor_compatibility(rtn);
|
2015-07-29 09:56:48 +00:00
|
|
|
}
|
|
|
|
|
|
|
|
bool kvm_vcpu_is_reset_bsp(struct kvm_vcpu *vcpu)
|
|
|
|
{
|
|
|
|
return vcpu->kvm->arch.bsp_vcpu_id == vcpu->vcpu_id;
|
|
|
|
}
|
|
|
|
EXPORT_SYMBOL_GPL(kvm_vcpu_is_reset_bsp);
|
|
|
|
|
|
|
|
bool kvm_vcpu_is_bsp(struct kvm_vcpu *vcpu)
|
|
|
|
{
|
|
|
|
return (vcpu->arch.apic_base & MSR_IA32_APICBASE_BSP) != 0;
|
2007-11-14 12:38:21 +00:00
|
|
|
}
|
|
|
|
|
2012-08-05 12:58:32 +00:00
|
|
|
struct static_key kvm_no_apic_vcpu __read_mostly;
|
2016-01-08 12:48:51 +00:00
|
|
|
EXPORT_SYMBOL_GPL(kvm_no_apic_vcpu);
|
2012-08-05 12:58:32 +00:00
|
|
|
|
2007-11-14 12:38:21 +00:00
|
|
|
int kvm_arch_vcpu_init(struct kvm_vcpu *vcpu)
|
|
|
|
{
|
|
|
|
struct page *page;
|
|
|
|
int r;
|
|
|
|
|
2017-09-12 15:42:41 +00:00
|
|
|
vcpu->arch.apicv_active = kvm_x86_ops->get_enable_apicv(vcpu);
|
2010-07-29 12:11:50 +00:00
|
|
|
vcpu->arch.emulate_ctxt.ops = &emulate_ops;
|
2017-08-24 18:51:25 +00:00
|
|
|
if (!irqchip_in_kernel(vcpu->kvm) || kvm_vcpu_is_reset_bsp(vcpu))
|
2008-04-13 14:54:35 +00:00
|
|
|
vcpu->arch.mp_state = KVM_MP_STATE_RUNNABLE;
|
2007-11-14 12:38:21 +00:00
|
|
|
else
|
2008-04-13 14:54:35 +00:00
|
|
|
vcpu->arch.mp_state = KVM_MP_STATE_UNINITIALIZED;
|
2007-11-14 12:38:21 +00:00
|
|
|
|
|
|
|
page = alloc_page(GFP_KERNEL | __GFP_ZERO);
|
|
|
|
if (!page) {
|
|
|
|
r = -ENOMEM;
|
|
|
|
goto fail;
|
|
|
|
}
|
2007-12-13 15:50:52 +00:00
|
|
|
vcpu->arch.pio_data = page_address(page);
|
2007-11-14 12:38:21 +00:00
|
|
|
|
KVM: Infrastructure for software and hardware based TSC rate scaling
This requires some restructuring; rather than use 'virtual_tsc_khz'
to indicate whether hardware rate scaling is in effect, we consider
each VCPU to always have a virtual TSC rate. Instead, there is new
logic above the vendor-specific hardware scaling that decides whether
it is even necessary to use and updates all rate variables used by
common code. This means we can simply query the virtual rate at
any point, which is needed for software rate scaling.
There is also now a threshold added to the TSC rate scaling; minor
differences and variations of measured TSC rate can accidentally
provoke rate scaling to be used when it is not needed. Instead,
we have a tolerance variable called tsc_tolerance_ppm, which is
the maximum variation from user requested rate at which scaling
will be used. The default is 250ppm, which is the half the
threshold for NTP adjustment, allowing for some hardware variation.
In the event that hardware rate scaling is not available, we can
kludge a bit by forcing TSC catchup to turn on when a faster than
hardware speed has been requested, but there is nothing available
yet for the reverse case; this requires a trap and emulate software
implementation for RDTSC, which is still forthcoming.
[avi: fix 64-bit division on i386]
Signed-off-by: Zachary Amsden <zamsden@gmail.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2012-02-03 17:43:50 +00:00
|
|
|
kvm_set_tsc_khz(vcpu, max_tsc_khz);
|
2010-09-19 00:38:15 +00:00
|
|
|
|
2007-11-14 12:38:21 +00:00
|
|
|
r = kvm_mmu_create(vcpu);
|
|
|
|
if (r < 0)
|
|
|
|
goto fail_free_pio_data;
|
|
|
|
|
2017-08-24 18:51:25 +00:00
|
|
|
if (irqchip_in_kernel(vcpu->kvm)) {
|
2007-11-14 12:38:21 +00:00
|
|
|
r = kvm_create_lapic(vcpu);
|
|
|
|
if (r < 0)
|
|
|
|
goto fail_mmu_destroy;
|
2012-08-05 12:58:32 +00:00
|
|
|
} else
|
|
|
|
static_key_slow_inc(&kvm_no_apic_vcpu);
|
2007-11-14 12:38:21 +00:00
|
|
|
|
2009-05-11 08:48:15 +00:00
|
|
|
vcpu->arch.mce_banks = kzalloc(KVM_MAX_MCE_BANKS * sizeof(u64) * 4,
|
|
|
|
GFP_KERNEL);
|
|
|
|
if (!vcpu->arch.mce_banks) {
|
|
|
|
r = -ENOMEM;
|
2010-01-22 06:21:29 +00:00
|
|
|
goto fail_free_lapic;
|
2009-05-11 08:48:15 +00:00
|
|
|
}
|
|
|
|
vcpu->arch.mcg_cap = KVM_MAX_MCE_BANKS;
|
|
|
|
|
2013-04-17 23:41:00 +00:00
|
|
|
if (!zalloc_cpumask_var(&vcpu->arch.wbinvd_dirty_mask, GFP_KERNEL)) {
|
|
|
|
r = -ENOMEM;
|
2010-06-30 04:25:15 +00:00
|
|
|
goto fail_free_mce_banks;
|
2013-04-17 23:41:00 +00:00
|
|
|
}
|
2010-06-30 04:25:15 +00:00
|
|
|
|
2015-04-27 04:58:22 +00:00
|
|
|
fx_init(vcpu);
|
2012-12-05 14:26:19 +00:00
|
|
|
|
2013-10-02 14:06:16 +00:00
|
|
|
vcpu->arch.guest_xstate_size = XSAVE_HDR_SIZE + XSAVE_HDR_OFFSET;
|
2013-10-02 14:06:15 +00:00
|
|
|
|
2015-03-29 20:56:12 +00:00
|
|
|
vcpu->arch.maxphyaddr = cpuid_query_maxphyaddr(vcpu);
|
|
|
|
|
2015-04-27 13:11:25 +00:00
|
|
|
vcpu->arch.pat = MSR_IA32_CR_PAT_DEFAULT;
|
|
|
|
|
2010-10-14 09:22:46 +00:00
|
|
|
kvm_async_pf_hash_reset(vcpu);
|
2011-11-10 12:57:22 +00:00
|
|
|
kvm_pmu_init(vcpu);
|
2010-10-14 09:22:46 +00:00
|
|
|
|
KVM: x86: Add support for local interrupt requests from userspace
In order to enable userspace PIC support, the userspace PIC needs to
be able to inject local interrupts even when the APICs are in the
kernel.
KVM_INTERRUPT now supports sending local interrupts to an APIC when
APICs are in the kernel.
The ready_for_interrupt_request flag is now only set when the CPU/APIC
will immediately accept and inject an interrupt (i.e. APIC has not
masked the PIC).
When the PIC wishes to initiate an INTA cycle with, say, CPU0, it
kicks CPU0 out of the guest, and renedezvous with CPU0 once it arrives
in userspace.
When the CPU/APIC unmasks the PIC, a KVM_EXIT_IRQ_WINDOW_OPEN is
triggered, so that userspace has a chance to inject a PIC interrupt
if it had been pending.
Overall, this design can lead to a small number of spurious userspace
renedezvous. In particular, whenever the PIC transistions from low to
high while it is masked and whenever the PIC becomes unmasked while
it is low.
Note: this does not buffer more than one local interrupt in the
kernel, so the VMM needs to enter the guest in order to complete
interrupt injection before injecting an additional interrupt.
Compiles for x86.
Can pass the KVM Unit Tests.
Signed-off-by: Steve Rutherford <srutherford@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2015-07-30 09:27:16 +00:00
|
|
|
vcpu->arch.pending_external_vector = -1;
|
2017-08-08 04:05:33 +00:00
|
|
|
vcpu->arch.preempted_in_kernel = false;
|
KVM: x86: Add support for local interrupt requests from userspace
In order to enable userspace PIC support, the userspace PIC needs to
be able to inject local interrupts even when the APICs are in the
kernel.
KVM_INTERRUPT now supports sending local interrupts to an APIC when
APICs are in the kernel.
The ready_for_interrupt_request flag is now only set when the CPU/APIC
will immediately accept and inject an interrupt (i.e. APIC has not
masked the PIC).
When the PIC wishes to initiate an INTA cycle with, say, CPU0, it
kicks CPU0 out of the guest, and renedezvous with CPU0 once it arrives
in userspace.
When the CPU/APIC unmasks the PIC, a KVM_EXIT_IRQ_WINDOW_OPEN is
triggered, so that userspace has a chance to inject a PIC interrupt
if it had been pending.
Overall, this design can lead to a small number of spurious userspace
renedezvous. In particular, whenever the PIC transistions from low to
high while it is masked and whenever the PIC becomes unmasked while
it is low.
Note: this does not buffer more than one local interrupt in the
kernel, so the VMM needs to enter the guest in order to complete
interrupt injection before injecting an additional interrupt.
Compiles for x86.
Can pass the KVM Unit Tests.
Signed-off-by: Steve Rutherford <srutherford@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2015-07-30 09:27:16 +00:00
|
|
|
|
2015-11-10 12:36:34 +00:00
|
|
|
kvm_hv_vcpu_init(vcpu);
|
|
|
|
|
2007-11-14 12:38:21 +00:00
|
|
|
return 0;
|
2015-04-27 04:58:22 +00:00
|
|
|
|
2010-06-30 04:25:15 +00:00
|
|
|
fail_free_mce_banks:
|
|
|
|
kfree(vcpu->arch.mce_banks);
|
2010-01-22 06:21:29 +00:00
|
|
|
fail_free_lapic:
|
|
|
|
kvm_free_lapic(vcpu);
|
2007-11-14 12:38:21 +00:00
|
|
|
fail_mmu_destroy:
|
|
|
|
kvm_mmu_destroy(vcpu);
|
|
|
|
fail_free_pio_data:
|
2007-12-13 15:50:52 +00:00
|
|
|
free_page((unsigned long)vcpu->arch.pio_data);
|
2007-11-14 12:38:21 +00:00
|
|
|
fail:
|
|
|
|
return r;
|
|
|
|
}
|
|
|
|
|
|
|
|
void kvm_arch_vcpu_uninit(struct kvm_vcpu *vcpu)
|
|
|
|
{
|
2009-12-23 16:35:25 +00:00
|
|
|
int idx;
|
|
|
|
|
2015-11-30 16:22:21 +00:00
|
|
|
kvm_hv_vcpu_uninit(vcpu);
|
2011-11-10 12:57:22 +00:00
|
|
|
kvm_pmu_destroy(vcpu);
|
2010-01-22 06:18:47 +00:00
|
|
|
kfree(vcpu->arch.mce_banks);
|
2007-11-14 12:38:21 +00:00
|
|
|
kvm_free_lapic(vcpu);
|
2009-12-23 16:35:25 +00:00
|
|
|
idx = srcu_read_lock(&vcpu->kvm->srcu);
|
2007-11-14 12:38:21 +00:00
|
|
|
kvm_mmu_destroy(vcpu);
|
2009-12-23 16:35:25 +00:00
|
|
|
srcu_read_unlock(&vcpu->kvm->srcu, idx);
|
2007-12-13 15:50:52 +00:00
|
|
|
free_page((unsigned long)vcpu->arch.pio_data);
|
2015-07-29 10:05:37 +00:00
|
|
|
if (!lapic_in_kernel(vcpu))
|
2012-08-05 12:58:32 +00:00
|
|
|
static_key_slow_dec(&kvm_no_apic_vcpu);
|
2007-11-14 12:38:21 +00:00
|
|
|
}
|
2007-11-18 10:43:45 +00:00
|
|
|
|
2014-08-21 16:08:05 +00:00
|
|
|
void kvm_arch_sched_in(struct kvm_vcpu *vcpu, int cpu)
|
|
|
|
{
|
x86/KVM/VMX: Add L1D flush logic
Add the logic for flushing L1D on VMENTER. The flush depends on the static
key being enabled and the new l1tf_flush_l1d flag being set.
The flags is set:
- Always, if the flush module parameter is 'always'
- Conditionally at:
- Entry to vcpu_run(), i.e. after executing user space
- From the sched_in notifier, i.e. when switching to a vCPU thread.
- From vmexit handlers which are considered unsafe, i.e. where
sensitive data can be brought into L1D:
- The emulator, which could be a good target for other speculative
execution-based threats,
- The MMU, which can bring host page tables in the L1 cache.
- External interrupts
- Nested operations that require the MMU (see above). That is
vmptrld, vmptrst, vmclear,vmwrite,vmread.
- When handling invept,invvpid
[ tglx: Split out from combo patch and reduced to a single flag ]
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2018-07-02 11:07:14 +00:00
|
|
|
vcpu->arch.l1tf_flush_l1d = true;
|
2014-08-21 16:08:06 +00:00
|
|
|
kvm_x86_ops->sched_in(vcpu, cpu);
|
2014-08-21 16:08:05 +00:00
|
|
|
}
|
|
|
|
|
2012-01-04 09:25:20 +00:00
|
|
|
int kvm_arch_init_vm(struct kvm *kvm, unsigned long type)
|
2007-11-18 10:43:45 +00:00
|
|
|
{
|
2012-01-04 09:25:20 +00:00
|
|
|
if (type)
|
|
|
|
return -EINVAL;
|
|
|
|
|
2014-11-20 12:45:31 +00:00
|
|
|
INIT_HLIST_HEAD(&kvm->arch.mask_notifier_list);
|
2007-12-14 02:01:48 +00:00
|
|
|
INIT_LIST_HEAD(&kvm->arch.active_mmu_pages);
|
2013-05-31 00:36:29 +00:00
|
|
|
INIT_LIST_HEAD(&kvm->arch.zapped_obsolete_pages);
|
2008-07-28 16:26:26 +00:00
|
|
|
INIT_LIST_HEAD(&kvm->arch.assigned_dev_head);
|
2013-10-30 17:02:30 +00:00
|
|
|
atomic_set(&kvm->arch.noncoherent_dma_count, 0);
|
2007-11-18 10:43:45 +00:00
|
|
|
|
2008-10-15 12:15:06 +00:00
|
|
|
/* Reserve bit 0 of irq_sources_bitmap for userspace irq source */
|
|
|
|
set_bit(KVM_USERSPACE_IRQ_SOURCE_ID, &kvm->arch.irq_sources_bitmap);
|
2012-09-21 17:58:03 +00:00
|
|
|
/* Reserve bit 1 of irq_sources_bitmap for irqfd-resampler */
|
|
|
|
set_bit(KVM_IRQFD_RESAMPLE_IRQ_SOURCE_ID,
|
|
|
|
&kvm->arch.irq_sources_bitmap);
|
2008-10-15 12:15:06 +00:00
|
|
|
|
2011-02-04 09:49:11 +00:00
|
|
|
raw_spin_lock_init(&kvm->arch.tsc_write_lock);
|
2012-09-13 14:19:24 +00:00
|
|
|
mutex_init(&kvm->arch.apic_map_lock);
|
2012-11-28 01:29:01 +00:00
|
|
|
spin_lock_init(&kvm->arch.pvclock_gtod_sync_lock);
|
|
|
|
|
2016-09-01 12:21:03 +00:00
|
|
|
kvm->arch.kvmclock_offset = -ktime_get_boot_ns();
|
2012-11-28 01:29:01 +00:00
|
|
|
pvclock_update_vm_gtod_copy(kvm);
|
2008-12-11 19:45:05 +00:00
|
|
|
|
2014-02-28 11:52:54 +00:00
|
|
|
INIT_DELAYED_WORK(&kvm->arch.kvmclock_update_work, kvmclock_update_fn);
|
2014-02-28 11:52:55 +00:00
|
|
|
INIT_DELAYED_WORK(&kvm->arch.kvmclock_sync_work, kvmclock_sync_fn);
|
2014-02-28 11:52:54 +00:00
|
|
|
|
2018-02-01 13:48:31 +00:00
|
|
|
kvm_hv_init_vm(kvm);
|
2016-02-24 09:51:13 +00:00
|
|
|
kvm_page_track_init(kvm);
|
2016-02-24 09:51:16 +00:00
|
|
|
kvm_mmu_init_vm(kvm);
|
2016-02-24 09:51:13 +00:00
|
|
|
|
2016-05-04 19:09:42 +00:00
|
|
|
if (kvm_x86_ops->vm_init)
|
|
|
|
return kvm_x86_ops->vm_init(kvm);
|
|
|
|
|
2010-11-09 16:02:49 +00:00
|
|
|
return 0;
|
2007-11-18 10:43:45 +00:00
|
|
|
}
|
|
|
|
|
|
|
|
static void kvm_unload_vcpu_mmu(struct kvm_vcpu *vcpu)
|
|
|
|
{
|
2017-12-04 20:35:23 +00:00
|
|
|
vcpu_load(vcpu);
|
2007-11-18 10:43:45 +00:00
|
|
|
kvm_mmu_unload(vcpu);
|
|
|
|
vcpu_put(vcpu);
|
|
|
|
}
|
|
|
|
|
|
|
|
static void kvm_free_vcpus(struct kvm *kvm)
|
|
|
|
{
|
|
|
|
unsigned int i;
|
2009-06-09 12:56:29 +00:00
|
|
|
struct kvm_vcpu *vcpu;
|
2007-11-18 10:43:45 +00:00
|
|
|
|
|
|
|
/*
|
|
|
|
* Unpin any mmu pages first.
|
|
|
|
*/
|
2010-10-14 09:22:46 +00:00
|
|
|
kvm_for_each_vcpu(i, vcpu, kvm) {
|
|
|
|
kvm_clear_async_pf_completion_queue(vcpu);
|
2009-06-09 12:56:29 +00:00
|
|
|
kvm_unload_vcpu_mmu(vcpu);
|
2010-10-14 09:22:46 +00:00
|
|
|
}
|
2009-06-09 12:56:29 +00:00
|
|
|
kvm_for_each_vcpu(i, vcpu, kvm)
|
|
|
|
kvm_arch_vcpu_free(vcpu);
|
|
|
|
|
|
|
|
mutex_lock(&kvm->lock);
|
|
|
|
for (i = 0; i < atomic_read(&kvm->online_vcpus); i++)
|
|
|
|
kvm->vcpus[i] = NULL;
|
2007-11-18 10:43:45 +00:00
|
|
|
|
2009-06-09 12:56:29 +00:00
|
|
|
atomic_set(&kvm->online_vcpus, 0);
|
|
|
|
mutex_unlock(&kvm->lock);
|
2007-11-18 10:43:45 +00:00
|
|
|
}
|
|
|
|
|
2009-01-06 02:03:02 +00:00
|
|
|
void kvm_arch_sync_events(struct kvm *kvm)
|
|
|
|
{
|
2014-02-28 11:52:55 +00:00
|
|
|
cancel_delayed_work_sync(&kvm->arch.kvmclock_sync_work);
|
2014-02-28 11:52:54 +00:00
|
|
|
cancel_delayed_work_sync(&kvm->arch.kvmclock_update_work);
|
2010-07-10 09:37:56 +00:00
|
|
|
kvm_free_pit(kvm);
|
2009-01-06 02:03:02 +00:00
|
|
|
}
|
|
|
|
|
2015-10-12 11:38:32 +00:00
|
|
|
int __x86_set_memory_region(struct kvm *kvm, int id, gpa_t gpa, u32 size)
|
2015-05-18 11:33:16 +00:00
|
|
|
{
|
|
|
|
int i, r;
|
2015-10-14 13:51:08 +00:00
|
|
|
unsigned long hva;
|
2015-10-12 11:56:27 +00:00
|
|
|
struct kvm_memslots *slots = kvm_memslots(kvm);
|
|
|
|
struct kvm_memory_slot *slot, old;
|
2015-05-18 11:33:16 +00:00
|
|
|
|
|
|
|
/* Called with kvm->slots_lock held. */
|
2015-10-12 11:38:32 +00:00
|
|
|
if (WARN_ON(id >= KVM_MEM_SLOTS_NUM))
|
|
|
|
return -EINVAL;
|
2015-05-18 11:33:16 +00:00
|
|
|
|
2015-10-12 11:56:27 +00:00
|
|
|
slot = id_to_memslot(slots, id);
|
|
|
|
if (size) {
|
kvm: x86: avoid warning on repeated KVM_SET_TSS_ADDR
Found by syzkaller:
WARNING: CPU: 3 PID: 15175 at arch/x86/kvm/x86.c:7705 __x86_set_memory_region+0x1dc/0x1f0 [kvm]()
CPU: 3 PID: 15175 Comm: a.out Tainted: G W 4.4.6-300.fc23.x86_64 #1
Hardware name: LENOVO 2325F51/2325F51, BIOS G2ET32WW (1.12 ) 05/30/2012
0000000000000286 00000000950899a7 ffff88011ab3fbf0 ffffffff813b542e
0000000000000000 ffffffffa0966496 ffff88011ab3fc28 ffffffff810a40f2
00000000000001fd 0000000000003000 ffff88014fc50000 0000000000000000
Call Trace:
[<ffffffff813b542e>] dump_stack+0x63/0x85
[<ffffffff810a40f2>] warn_slowpath_common+0x82/0xc0
[<ffffffff810a423a>] warn_slowpath_null+0x1a/0x20
[<ffffffffa09251cc>] __x86_set_memory_region+0x1dc/0x1f0 [kvm]
[<ffffffffa092521b>] x86_set_memory_region+0x3b/0x60 [kvm]
[<ffffffffa09bb61c>] vmx_set_tss_addr+0x3c/0x150 [kvm_intel]
[<ffffffffa092f4d4>] kvm_arch_vm_ioctl+0x654/0xbc0 [kvm]
[<ffffffffa091d31a>] kvm_vm_ioctl+0x9a/0x6f0 [kvm]
[<ffffffff81241248>] do_vfs_ioctl+0x298/0x480
[<ffffffff812414a9>] SyS_ioctl+0x79/0x90
[<ffffffff817a04ee>] entry_SYSCALL_64_fastpath+0x12/0x71
Testcase:
#include <unistd.h>
#include <sys/ioctl.h>
#include <fcntl.h>
#include <string.h>
#include <linux/kvm.h>
long r[8];
int main()
{
memset(r, -1, sizeof(r));
r[2] = open("/dev/kvm", O_RDONLY|O_TRUNC);
r[3] = ioctl(r[2], KVM_CREATE_VM, 0x0ul);
r[5] = ioctl(r[3], KVM_SET_TSS_ADDR, 0x20000000ul);
r[7] = ioctl(r[3], KVM_SET_TSS_ADDR, 0x20000000ul);
return 0;
}
Reported-by: Dmitry Vyukov <dvyukov@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
2016-06-01 12:09:18 +00:00
|
|
|
if (slot->npages)
|
2015-10-12 11:56:27 +00:00
|
|
|
return -EEXIST;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* MAP_SHARED to prevent internal slot pages from being moved
|
|
|
|
* by fork()/COW.
|
|
|
|
*/
|
|
|
|
hva = vm_mmap(NULL, 0, size, PROT_READ | PROT_WRITE,
|
|
|
|
MAP_SHARED | MAP_ANONYMOUS, 0);
|
|
|
|
if (IS_ERR((void *)hva))
|
|
|
|
return PTR_ERR((void *)hva);
|
|
|
|
} else {
|
|
|
|
if (!slot->npages)
|
|
|
|
return 0;
|
|
|
|
|
|
|
|
hva = 0;
|
|
|
|
}
|
|
|
|
|
|
|
|
old = *slot;
|
2015-05-18 11:33:16 +00:00
|
|
|
for (i = 0; i < KVM_ADDRESS_SPACE_NUM; i++) {
|
2015-10-12 11:38:32 +00:00
|
|
|
struct kvm_userspace_memory_region m;
|
2015-05-18 11:33:16 +00:00
|
|
|
|
2015-10-12 11:38:32 +00:00
|
|
|
m.slot = id | (i << 16);
|
|
|
|
m.flags = 0;
|
|
|
|
m.guest_phys_addr = gpa;
|
2015-10-12 11:56:27 +00:00
|
|
|
m.userspace_addr = hva;
|
2015-10-12 11:38:32 +00:00
|
|
|
m.memory_size = size;
|
2015-05-18 11:33:16 +00:00
|
|
|
r = __kvm_set_memory_region(kvm, &m);
|
|
|
|
if (r < 0)
|
|
|
|
return r;
|
|
|
|
}
|
|
|
|
|
2018-02-01 01:30:21 +00:00
|
|
|
if (!size)
|
|
|
|
vm_munmap(old.userspace_addr, old.npages * PAGE_SIZE);
|
2015-10-12 11:56:27 +00:00
|
|
|
|
2015-05-18 11:33:16 +00:00
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
EXPORT_SYMBOL_GPL(__x86_set_memory_region);
|
|
|
|
|
2015-10-12 11:38:32 +00:00
|
|
|
int x86_set_memory_region(struct kvm *kvm, int id, gpa_t gpa, u32 size)
|
2015-05-18 11:33:16 +00:00
|
|
|
{
|
|
|
|
int r;
|
|
|
|
|
|
|
|
mutex_lock(&kvm->slots_lock);
|
2015-10-12 11:38:32 +00:00
|
|
|
r = __x86_set_memory_region(kvm, id, gpa, size);
|
2015-05-18 11:33:16 +00:00
|
|
|
mutex_unlock(&kvm->slots_lock);
|
|
|
|
|
|
|
|
return r;
|
|
|
|
}
|
|
|
|
EXPORT_SYMBOL_GPL(x86_set_memory_region);
|
|
|
|
|
2007-11-18 10:43:45 +00:00
|
|
|
void kvm_arch_destroy_vm(struct kvm *kvm)
|
|
|
|
{
|
2013-04-18 16:38:14 +00:00
|
|
|
if (current->mm == kvm->mm) {
|
|
|
|
/*
|
|
|
|
* Free memory regions allocated on behalf of userspace,
|
|
|
|
* unless the the memory map has changed due to process exit
|
|
|
|
* or fd copying.
|
|
|
|
*/
|
2015-10-12 11:38:32 +00:00
|
|
|
x86_set_memory_region(kvm, APIC_ACCESS_PAGE_PRIVATE_MEMSLOT, 0, 0);
|
|
|
|
x86_set_memory_region(kvm, IDENTITY_PAGETABLE_PRIVATE_MEMSLOT, 0, 0);
|
|
|
|
x86_set_memory_region(kvm, TSS_PRIVATE_MEMSLOT, 0, 0);
|
2013-04-18 16:38:14 +00:00
|
|
|
}
|
2016-05-04 19:09:42 +00:00
|
|
|
if (kvm_x86_ops->vm_destroy)
|
|
|
|
kvm_x86_ops->vm_destroy(kvm);
|
2017-03-15 08:01:19 +00:00
|
|
|
kvm_pic_destroy(kvm);
|
|
|
|
kvm_ioapic_destroy(kvm);
|
2007-11-18 10:43:45 +00:00
|
|
|
kvm_free_vcpus(kvm);
|
2016-07-12 20:09:30 +00:00
|
|
|
kvfree(rcu_dereference_check(kvm->arch.apic_map, 1));
|
2016-02-24 09:51:16 +00:00
|
|
|
kvm_mmu_uninit_vm(kvm);
|
2017-03-27 15:53:50 +00:00
|
|
|
kvm_page_track_cleanup(kvm);
|
2018-02-01 13:48:31 +00:00
|
|
|
kvm_hv_destroy_vm(kvm);
|
2007-11-18 10:43:45 +00:00
|
|
|
}
|
2007-11-20 08:25:04 +00:00
|
|
|
|
2013-10-07 16:48:00 +00:00
|
|
|
void kvm_arch_free_memslot(struct kvm *kvm, struct kvm_memory_slot *free,
|
2012-02-08 04:02:18 +00:00
|
|
|
struct kvm_memory_slot *dont)
|
|
|
|
{
|
|
|
|
int i;
|
|
|
|
|
2012-08-01 09:03:28 +00:00
|
|
|
for (i = 0; i < KVM_NR_PAGE_SIZES; ++i) {
|
|
|
|
if (!dont || free->arch.rmap[i] != dont->arch.rmap[i]) {
|
2015-02-24 20:29:25 +00:00
|
|
|
kvfree(free->arch.rmap[i]);
|
2012-08-01 09:03:28 +00:00
|
|
|
free->arch.rmap[i] = NULL;
|
2012-07-02 08:57:17 +00:00
|
|
|
}
|
2012-08-01 09:03:28 +00:00
|
|
|
if (i == 0)
|
|
|
|
continue;
|
|
|
|
|
|
|
|
if (!dont || free->arch.lpage_info[i - 1] !=
|
|
|
|
dont->arch.lpage_info[i - 1]) {
|
2015-02-24 20:29:25 +00:00
|
|
|
kvfree(free->arch.lpage_info[i - 1]);
|
2012-08-01 09:03:28 +00:00
|
|
|
free->arch.lpage_info[i - 1] = NULL;
|
2012-02-08 04:02:18 +00:00
|
|
|
}
|
|
|
|
}
|
KVM: page track: add the framework of guest page tracking
The array, gfn_track[mode][gfn], is introduced in memory slot for every
guest page, this is the tracking count for the gust page on different
modes. If the page is tracked then the count is increased, the page is
not tracked after the count reaches zero
We use 'unsigned short' as the tracking count which should be enough as
shadow page table only can use 2^14 (2^3 for level, 2^1 for cr4_pae, 2^2
for quadrant, 2^3 for access, 2^1 for nxe, 2^1 for cr0_wp, 2^1 for
smep_andnot_wp, 2^1 for smap_andnot_wp, and 2^1 for smm) at most, there
is enough room for other trackers
Two callbacks, kvm_page_track_create_memslot() and
kvm_page_track_free_memslot() are implemented in this patch, they are
internally used to initialize and reclaim the memory of the array
Currently, only write track mode is supported
Signed-off-by: Xiao Guangrong <guangrong.xiao@linux.intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2016-02-24 09:51:09 +00:00
|
|
|
|
|
|
|
kvm_page_track_free_memslot(free, dont);
|
2012-02-08 04:02:18 +00:00
|
|
|
}
|
|
|
|
|
2013-10-07 16:48:00 +00:00
|
|
|
int kvm_arch_create_memslot(struct kvm *kvm, struct kvm_memory_slot *slot,
|
|
|
|
unsigned long npages)
|
2012-02-08 04:02:18 +00:00
|
|
|
{
|
|
|
|
int i;
|
|
|
|
|
2012-08-01 09:03:28 +00:00
|
|
|
for (i = 0; i < KVM_NR_PAGE_SIZES; ++i) {
|
2016-02-24 09:51:06 +00:00
|
|
|
struct kvm_lpage_info *linfo;
|
2012-02-08 04:02:18 +00:00
|
|
|
unsigned long ugfn;
|
|
|
|
int lpages;
|
2012-08-01 09:03:28 +00:00
|
|
|
int level = i + 1;
|
2012-02-08 04:02:18 +00:00
|
|
|
|
|
|
|
lpages = gfn_to_index(slot->base_gfn + npages - 1,
|
|
|
|
slot->base_gfn, level) + 1;
|
|
|
|
|
2012-08-01 09:03:28 +00:00
|
|
|
slot->arch.rmap[i] =
|
treewide: kvzalloc() -> kvcalloc()
The kvzalloc() function has a 2-factor argument form, kvcalloc(). This
patch replaces cases of:
kvzalloc(a * b, gfp)
with:
kvcalloc(a * b, gfp)
as well as handling cases of:
kvzalloc(a * b * c, gfp)
with:
kvzalloc(array3_size(a, b, c), gfp)
as it's slightly less ugly than:
kvcalloc(array_size(a, b), c, gfp)
This does, however, attempt to ignore constant size factors like:
kvzalloc(4 * 1024, gfp)
though any constants defined via macros get caught up in the conversion.
Any factors with a sizeof() of "unsigned char", "char", and "u8" were
dropped, since they're redundant.
The Coccinelle script used for this was:
// Fix redundant parens around sizeof().
@@
type TYPE;
expression THING, E;
@@
(
kvzalloc(
- (sizeof(TYPE)) * E
+ sizeof(TYPE) * E
, ...)
|
kvzalloc(
- (sizeof(THING)) * E
+ sizeof(THING) * E
, ...)
)
// Drop single-byte sizes and redundant parens.
@@
expression COUNT;
typedef u8;
typedef __u8;
@@
(
kvzalloc(
- sizeof(u8) * (COUNT)
+ COUNT
, ...)
|
kvzalloc(
- sizeof(__u8) * (COUNT)
+ COUNT
, ...)
|
kvzalloc(
- sizeof(char) * (COUNT)
+ COUNT
, ...)
|
kvzalloc(
- sizeof(unsigned char) * (COUNT)
+ COUNT
, ...)
|
kvzalloc(
- sizeof(u8) * COUNT
+ COUNT
, ...)
|
kvzalloc(
- sizeof(__u8) * COUNT
+ COUNT
, ...)
|
kvzalloc(
- sizeof(char) * COUNT
+ COUNT
, ...)
|
kvzalloc(
- sizeof(unsigned char) * COUNT
+ COUNT
, ...)
)
// 2-factor product with sizeof(type/expression) and identifier or constant.
@@
type TYPE;
expression THING;
identifier COUNT_ID;
constant COUNT_CONST;
@@
(
- kvzalloc
+ kvcalloc
(
- sizeof(TYPE) * (COUNT_ID)
+ COUNT_ID, sizeof(TYPE)
, ...)
|
- kvzalloc
+ kvcalloc
(
- sizeof(TYPE) * COUNT_ID
+ COUNT_ID, sizeof(TYPE)
, ...)
|
- kvzalloc
+ kvcalloc
(
- sizeof(TYPE) * (COUNT_CONST)
+ COUNT_CONST, sizeof(TYPE)
, ...)
|
- kvzalloc
+ kvcalloc
(
- sizeof(TYPE) * COUNT_CONST
+ COUNT_CONST, sizeof(TYPE)
, ...)
|
- kvzalloc
+ kvcalloc
(
- sizeof(THING) * (COUNT_ID)
+ COUNT_ID, sizeof(THING)
, ...)
|
- kvzalloc
+ kvcalloc
(
- sizeof(THING) * COUNT_ID
+ COUNT_ID, sizeof(THING)
, ...)
|
- kvzalloc
+ kvcalloc
(
- sizeof(THING) * (COUNT_CONST)
+ COUNT_CONST, sizeof(THING)
, ...)
|
- kvzalloc
+ kvcalloc
(
- sizeof(THING) * COUNT_CONST
+ COUNT_CONST, sizeof(THING)
, ...)
)
// 2-factor product, only identifiers.
@@
identifier SIZE, COUNT;
@@
- kvzalloc
+ kvcalloc
(
- SIZE * COUNT
+ COUNT, SIZE
, ...)
// 3-factor product with 1 sizeof(type) or sizeof(expression), with
// redundant parens removed.
@@
expression THING;
identifier STRIDE, COUNT;
type TYPE;
@@
(
kvzalloc(
- sizeof(TYPE) * (COUNT) * (STRIDE)
+ array3_size(COUNT, STRIDE, sizeof(TYPE))
, ...)
|
kvzalloc(
- sizeof(TYPE) * (COUNT) * STRIDE
+ array3_size(COUNT, STRIDE, sizeof(TYPE))
, ...)
|
kvzalloc(
- sizeof(TYPE) * COUNT * (STRIDE)
+ array3_size(COUNT, STRIDE, sizeof(TYPE))
, ...)
|
kvzalloc(
- sizeof(TYPE) * COUNT * STRIDE
+ array3_size(COUNT, STRIDE, sizeof(TYPE))
, ...)
|
kvzalloc(
- sizeof(THING) * (COUNT) * (STRIDE)
+ array3_size(COUNT, STRIDE, sizeof(THING))
, ...)
|
kvzalloc(
- sizeof(THING) * (COUNT) * STRIDE
+ array3_size(COUNT, STRIDE, sizeof(THING))
, ...)
|
kvzalloc(
- sizeof(THING) * COUNT * (STRIDE)
+ array3_size(COUNT, STRIDE, sizeof(THING))
, ...)
|
kvzalloc(
- sizeof(THING) * COUNT * STRIDE
+ array3_size(COUNT, STRIDE, sizeof(THING))
, ...)
)
// 3-factor product with 2 sizeof(variable), with redundant parens removed.
@@
expression THING1, THING2;
identifier COUNT;
type TYPE1, TYPE2;
@@
(
kvzalloc(
- sizeof(TYPE1) * sizeof(TYPE2) * COUNT
+ array3_size(COUNT, sizeof(TYPE1), sizeof(TYPE2))
, ...)
|
kvzalloc(
- sizeof(TYPE1) * sizeof(THING2) * (COUNT)
+ array3_size(COUNT, sizeof(TYPE1), sizeof(TYPE2))
, ...)
|
kvzalloc(
- sizeof(THING1) * sizeof(THING2) * COUNT
+ array3_size(COUNT, sizeof(THING1), sizeof(THING2))
, ...)
|
kvzalloc(
- sizeof(THING1) * sizeof(THING2) * (COUNT)
+ array3_size(COUNT, sizeof(THING1), sizeof(THING2))
, ...)
|
kvzalloc(
- sizeof(TYPE1) * sizeof(THING2) * COUNT
+ array3_size(COUNT, sizeof(TYPE1), sizeof(THING2))
, ...)
|
kvzalloc(
- sizeof(TYPE1) * sizeof(THING2) * (COUNT)
+ array3_size(COUNT, sizeof(TYPE1), sizeof(THING2))
, ...)
)
// 3-factor product, only identifiers, with redundant parens removed.
@@
identifier STRIDE, SIZE, COUNT;
@@
(
kvzalloc(
- (COUNT) * STRIDE * SIZE
+ array3_size(COUNT, STRIDE, SIZE)
, ...)
|
kvzalloc(
- COUNT * (STRIDE) * SIZE
+ array3_size(COUNT, STRIDE, SIZE)
, ...)
|
kvzalloc(
- COUNT * STRIDE * (SIZE)
+ array3_size(COUNT, STRIDE, SIZE)
, ...)
|
kvzalloc(
- (COUNT) * (STRIDE) * SIZE
+ array3_size(COUNT, STRIDE, SIZE)
, ...)
|
kvzalloc(
- COUNT * (STRIDE) * (SIZE)
+ array3_size(COUNT, STRIDE, SIZE)
, ...)
|
kvzalloc(
- (COUNT) * STRIDE * (SIZE)
+ array3_size(COUNT, STRIDE, SIZE)
, ...)
|
kvzalloc(
- (COUNT) * (STRIDE) * (SIZE)
+ array3_size(COUNT, STRIDE, SIZE)
, ...)
|
kvzalloc(
- COUNT * STRIDE * SIZE
+ array3_size(COUNT, STRIDE, SIZE)
, ...)
)
// Any remaining multi-factor products, first at least 3-factor products,
// when they're not all constants...
@@
expression E1, E2, E3;
constant C1, C2, C3;
@@
(
kvzalloc(C1 * C2 * C3, ...)
|
kvzalloc(
- (E1) * E2 * E3
+ array3_size(E1, E2, E3)
, ...)
|
kvzalloc(
- (E1) * (E2) * E3
+ array3_size(E1, E2, E3)
, ...)
|
kvzalloc(
- (E1) * (E2) * (E3)
+ array3_size(E1, E2, E3)
, ...)
|
kvzalloc(
- E1 * E2 * E3
+ array3_size(E1, E2, E3)
, ...)
)
// And then all remaining 2 factors products when they're not all constants,
// keeping sizeof() as the second factor argument.
@@
expression THING, E1, E2;
type TYPE;
constant C1, C2, C3;
@@
(
kvzalloc(sizeof(THING) * C2, ...)
|
kvzalloc(sizeof(TYPE) * C2, ...)
|
kvzalloc(C1 * C2 * C3, ...)
|
kvzalloc(C1 * C2, ...)
|
- kvzalloc
+ kvcalloc
(
- sizeof(TYPE) * (E2)
+ E2, sizeof(TYPE)
, ...)
|
- kvzalloc
+ kvcalloc
(
- sizeof(TYPE) * E2
+ E2, sizeof(TYPE)
, ...)
|
- kvzalloc
+ kvcalloc
(
- sizeof(THING) * (E2)
+ E2, sizeof(THING)
, ...)
|
- kvzalloc
+ kvcalloc
(
- sizeof(THING) * E2
+ E2, sizeof(THING)
, ...)
|
- kvzalloc
+ kvcalloc
(
- (E1) * E2
+ E1, E2
, ...)
|
- kvzalloc
+ kvcalloc
(
- (E1) * (E2)
+ E1, E2
, ...)
|
- kvzalloc
+ kvcalloc
(
- E1 * E2
+ E1, E2
, ...)
)
Signed-off-by: Kees Cook <keescook@chromium.org>
2018-06-12 21:04:48 +00:00
|
|
|
kvcalloc(lpages, sizeof(*slot->arch.rmap[i]),
|
|
|
|
GFP_KERNEL);
|
2012-08-01 09:03:28 +00:00
|
|
|
if (!slot->arch.rmap[i])
|
2012-07-02 08:57:17 +00:00
|
|
|
goto out_free;
|
2012-08-01 09:03:28 +00:00
|
|
|
if (i == 0)
|
|
|
|
continue;
|
2012-07-02 08:57:17 +00:00
|
|
|
|
treewide: kvzalloc() -> kvcalloc()
The kvzalloc() function has a 2-factor argument form, kvcalloc(). This
patch replaces cases of:
kvzalloc(a * b, gfp)
with:
kvcalloc(a * b, gfp)
as well as handling cases of:
kvzalloc(a * b * c, gfp)
with:
kvzalloc(array3_size(a, b, c), gfp)
as it's slightly less ugly than:
kvcalloc(array_size(a, b), c, gfp)
This does, however, attempt to ignore constant size factors like:
kvzalloc(4 * 1024, gfp)
though any constants defined via macros get caught up in the conversion.
Any factors with a sizeof() of "unsigned char", "char", and "u8" were
dropped, since they're redundant.
The Coccinelle script used for this was:
// Fix redundant parens around sizeof().
@@
type TYPE;
expression THING, E;
@@
(
kvzalloc(
- (sizeof(TYPE)) * E
+ sizeof(TYPE) * E
, ...)
|
kvzalloc(
- (sizeof(THING)) * E
+ sizeof(THING) * E
, ...)
)
// Drop single-byte sizes and redundant parens.
@@
expression COUNT;
typedef u8;
typedef __u8;
@@
(
kvzalloc(
- sizeof(u8) * (COUNT)
+ COUNT
, ...)
|
kvzalloc(
- sizeof(__u8) * (COUNT)
+ COUNT
, ...)
|
kvzalloc(
- sizeof(char) * (COUNT)
+ COUNT
, ...)
|
kvzalloc(
- sizeof(unsigned char) * (COUNT)
+ COUNT
, ...)
|
kvzalloc(
- sizeof(u8) * COUNT
+ COUNT
, ...)
|
kvzalloc(
- sizeof(__u8) * COUNT
+ COUNT
, ...)
|
kvzalloc(
- sizeof(char) * COUNT
+ COUNT
, ...)
|
kvzalloc(
- sizeof(unsigned char) * COUNT
+ COUNT
, ...)
)
// 2-factor product with sizeof(type/expression) and identifier or constant.
@@
type TYPE;
expression THING;
identifier COUNT_ID;
constant COUNT_CONST;
@@
(
- kvzalloc
+ kvcalloc
(
- sizeof(TYPE) * (COUNT_ID)
+ COUNT_ID, sizeof(TYPE)
, ...)
|
- kvzalloc
+ kvcalloc
(
- sizeof(TYPE) * COUNT_ID
+ COUNT_ID, sizeof(TYPE)
, ...)
|
- kvzalloc
+ kvcalloc
(
- sizeof(TYPE) * (COUNT_CONST)
+ COUNT_CONST, sizeof(TYPE)
, ...)
|
- kvzalloc
+ kvcalloc
(
- sizeof(TYPE) * COUNT_CONST
+ COUNT_CONST, sizeof(TYPE)
, ...)
|
- kvzalloc
+ kvcalloc
(
- sizeof(THING) * (COUNT_ID)
+ COUNT_ID, sizeof(THING)
, ...)
|
- kvzalloc
+ kvcalloc
(
- sizeof(THING) * COUNT_ID
+ COUNT_ID, sizeof(THING)
, ...)
|
- kvzalloc
+ kvcalloc
(
- sizeof(THING) * (COUNT_CONST)
+ COUNT_CONST, sizeof(THING)
, ...)
|
- kvzalloc
+ kvcalloc
(
- sizeof(THING) * COUNT_CONST
+ COUNT_CONST, sizeof(THING)
, ...)
)
// 2-factor product, only identifiers.
@@
identifier SIZE, COUNT;
@@
- kvzalloc
+ kvcalloc
(
- SIZE * COUNT
+ COUNT, SIZE
, ...)
// 3-factor product with 1 sizeof(type) or sizeof(expression), with
// redundant parens removed.
@@
expression THING;
identifier STRIDE, COUNT;
type TYPE;
@@
(
kvzalloc(
- sizeof(TYPE) * (COUNT) * (STRIDE)
+ array3_size(COUNT, STRIDE, sizeof(TYPE))
, ...)
|
kvzalloc(
- sizeof(TYPE) * (COUNT) * STRIDE
+ array3_size(COUNT, STRIDE, sizeof(TYPE))
, ...)
|
kvzalloc(
- sizeof(TYPE) * COUNT * (STRIDE)
+ array3_size(COUNT, STRIDE, sizeof(TYPE))
, ...)
|
kvzalloc(
- sizeof(TYPE) * COUNT * STRIDE
+ array3_size(COUNT, STRIDE, sizeof(TYPE))
, ...)
|
kvzalloc(
- sizeof(THING) * (COUNT) * (STRIDE)
+ array3_size(COUNT, STRIDE, sizeof(THING))
, ...)
|
kvzalloc(
- sizeof(THING) * (COUNT) * STRIDE
+ array3_size(COUNT, STRIDE, sizeof(THING))
, ...)
|
kvzalloc(
- sizeof(THING) * COUNT * (STRIDE)
+ array3_size(COUNT, STRIDE, sizeof(THING))
, ...)
|
kvzalloc(
- sizeof(THING) * COUNT * STRIDE
+ array3_size(COUNT, STRIDE, sizeof(THING))
, ...)
)
// 3-factor product with 2 sizeof(variable), with redundant parens removed.
@@
expression THING1, THING2;
identifier COUNT;
type TYPE1, TYPE2;
@@
(
kvzalloc(
- sizeof(TYPE1) * sizeof(TYPE2) * COUNT
+ array3_size(COUNT, sizeof(TYPE1), sizeof(TYPE2))
, ...)
|
kvzalloc(
- sizeof(TYPE1) * sizeof(THING2) * (COUNT)
+ array3_size(COUNT, sizeof(TYPE1), sizeof(TYPE2))
, ...)
|
kvzalloc(
- sizeof(THING1) * sizeof(THING2) * COUNT
+ array3_size(COUNT, sizeof(THING1), sizeof(THING2))
, ...)
|
kvzalloc(
- sizeof(THING1) * sizeof(THING2) * (COUNT)
+ array3_size(COUNT, sizeof(THING1), sizeof(THING2))
, ...)
|
kvzalloc(
- sizeof(TYPE1) * sizeof(THING2) * COUNT
+ array3_size(COUNT, sizeof(TYPE1), sizeof(THING2))
, ...)
|
kvzalloc(
- sizeof(TYPE1) * sizeof(THING2) * (COUNT)
+ array3_size(COUNT, sizeof(TYPE1), sizeof(THING2))
, ...)
)
// 3-factor product, only identifiers, with redundant parens removed.
@@
identifier STRIDE, SIZE, COUNT;
@@
(
kvzalloc(
- (COUNT) * STRIDE * SIZE
+ array3_size(COUNT, STRIDE, SIZE)
, ...)
|
kvzalloc(
- COUNT * (STRIDE) * SIZE
+ array3_size(COUNT, STRIDE, SIZE)
, ...)
|
kvzalloc(
- COUNT * STRIDE * (SIZE)
+ array3_size(COUNT, STRIDE, SIZE)
, ...)
|
kvzalloc(
- (COUNT) * (STRIDE) * SIZE
+ array3_size(COUNT, STRIDE, SIZE)
, ...)
|
kvzalloc(
- COUNT * (STRIDE) * (SIZE)
+ array3_size(COUNT, STRIDE, SIZE)
, ...)
|
kvzalloc(
- (COUNT) * STRIDE * (SIZE)
+ array3_size(COUNT, STRIDE, SIZE)
, ...)
|
kvzalloc(
- (COUNT) * (STRIDE) * (SIZE)
+ array3_size(COUNT, STRIDE, SIZE)
, ...)
|
kvzalloc(
- COUNT * STRIDE * SIZE
+ array3_size(COUNT, STRIDE, SIZE)
, ...)
)
// Any remaining multi-factor products, first at least 3-factor products,
// when they're not all constants...
@@
expression E1, E2, E3;
constant C1, C2, C3;
@@
(
kvzalloc(C1 * C2 * C3, ...)
|
kvzalloc(
- (E1) * E2 * E3
+ array3_size(E1, E2, E3)
, ...)
|
kvzalloc(
- (E1) * (E2) * E3
+ array3_size(E1, E2, E3)
, ...)
|
kvzalloc(
- (E1) * (E2) * (E3)
+ array3_size(E1, E2, E3)
, ...)
|
kvzalloc(
- E1 * E2 * E3
+ array3_size(E1, E2, E3)
, ...)
)
// And then all remaining 2 factors products when they're not all constants,
// keeping sizeof() as the second factor argument.
@@
expression THING, E1, E2;
type TYPE;
constant C1, C2, C3;
@@
(
kvzalloc(sizeof(THING) * C2, ...)
|
kvzalloc(sizeof(TYPE) * C2, ...)
|
kvzalloc(C1 * C2 * C3, ...)
|
kvzalloc(C1 * C2, ...)
|
- kvzalloc
+ kvcalloc
(
- sizeof(TYPE) * (E2)
+ E2, sizeof(TYPE)
, ...)
|
- kvzalloc
+ kvcalloc
(
- sizeof(TYPE) * E2
+ E2, sizeof(TYPE)
, ...)
|
- kvzalloc
+ kvcalloc
(
- sizeof(THING) * (E2)
+ E2, sizeof(THING)
, ...)
|
- kvzalloc
+ kvcalloc
(
- sizeof(THING) * E2
+ E2, sizeof(THING)
, ...)
|
- kvzalloc
+ kvcalloc
(
- (E1) * E2
+ E1, E2
, ...)
|
- kvzalloc
+ kvcalloc
(
- (E1) * (E2)
+ E1, E2
, ...)
|
- kvzalloc
+ kvcalloc
(
- E1 * E2
+ E1, E2
, ...)
)
Signed-off-by: Kees Cook <keescook@chromium.org>
2018-06-12 21:04:48 +00:00
|
|
|
linfo = kvcalloc(lpages, sizeof(*linfo), GFP_KERNEL);
|
2016-02-24 09:51:06 +00:00
|
|
|
if (!linfo)
|
2012-02-08 04:02:18 +00:00
|
|
|
goto out_free;
|
|
|
|
|
2016-02-24 09:51:06 +00:00
|
|
|
slot->arch.lpage_info[i - 1] = linfo;
|
|
|
|
|
2012-02-08 04:02:18 +00:00
|
|
|
if (slot->base_gfn & (KVM_PAGES_PER_HPAGE(level) - 1))
|
2016-02-24 09:51:06 +00:00
|
|
|
linfo[0].disallow_lpage = 1;
|
2012-02-08 04:02:18 +00:00
|
|
|
if ((slot->base_gfn + npages) & (KVM_PAGES_PER_HPAGE(level) - 1))
|
2016-02-24 09:51:06 +00:00
|
|
|
linfo[lpages - 1].disallow_lpage = 1;
|
2012-02-08 04:02:18 +00:00
|
|
|
ugfn = slot->userspace_addr >> PAGE_SHIFT;
|
|
|
|
/*
|
|
|
|
* If the gfn and userspace address are not aligned wrt each
|
|
|
|
* other, or if explicitly asked to, disable large page
|
|
|
|
* support for this slot
|
|
|
|
*/
|
|
|
|
if ((slot->base_gfn ^ ugfn) & (KVM_PAGES_PER_HPAGE(level) - 1) ||
|
|
|
|
!kvm_largepages_enabled()) {
|
|
|
|
unsigned long j;
|
|
|
|
|
|
|
|
for (j = 0; j < lpages; ++j)
|
2016-02-24 09:51:06 +00:00
|
|
|
linfo[j].disallow_lpage = 1;
|
2012-02-08 04:02:18 +00:00
|
|
|
}
|
|
|
|
}
|
|
|
|
|
KVM: page track: add the framework of guest page tracking
The array, gfn_track[mode][gfn], is introduced in memory slot for every
guest page, this is the tracking count for the gust page on different
modes. If the page is tracked then the count is increased, the page is
not tracked after the count reaches zero
We use 'unsigned short' as the tracking count which should be enough as
shadow page table only can use 2^14 (2^3 for level, 2^1 for cr4_pae, 2^2
for quadrant, 2^3 for access, 2^1 for nxe, 2^1 for cr0_wp, 2^1 for
smep_andnot_wp, 2^1 for smap_andnot_wp, and 2^1 for smm) at most, there
is enough room for other trackers
Two callbacks, kvm_page_track_create_memslot() and
kvm_page_track_free_memslot() are implemented in this patch, they are
internally used to initialize and reclaim the memory of the array
Currently, only write track mode is supported
Signed-off-by: Xiao Guangrong <guangrong.xiao@linux.intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2016-02-24 09:51:09 +00:00
|
|
|
if (kvm_page_track_create_memslot(slot, npages))
|
|
|
|
goto out_free;
|
|
|
|
|
2012-02-08 04:02:18 +00:00
|
|
|
return 0;
|
|
|
|
|
|
|
|
out_free:
|
2012-08-01 09:03:28 +00:00
|
|
|
for (i = 0; i < KVM_NR_PAGE_SIZES; ++i) {
|
2015-02-24 20:29:25 +00:00
|
|
|
kvfree(slot->arch.rmap[i]);
|
2012-08-01 09:03:28 +00:00
|
|
|
slot->arch.rmap[i] = NULL;
|
|
|
|
if (i == 0)
|
|
|
|
continue;
|
|
|
|
|
2015-02-24 20:29:25 +00:00
|
|
|
kvfree(slot->arch.lpage_info[i - 1]);
|
2012-08-01 09:03:28 +00:00
|
|
|
slot->arch.lpage_info[i - 1] = NULL;
|
2012-02-08 04:02:18 +00:00
|
|
|
}
|
|
|
|
return -ENOMEM;
|
|
|
|
}
|
|
|
|
|
2015-05-17 19:26:08 +00:00
|
|
|
void kvm_arch_memslots_updated(struct kvm *kvm, struct kvm_memslots *slots)
|
2013-07-04 04:40:29 +00:00
|
|
|
{
|
2013-07-04 04:41:26 +00:00
|
|
|
/*
|
|
|
|
* memslots->generation has been incremented.
|
|
|
|
* mmio generation may have reached its maximum value.
|
|
|
|
*/
|
2015-04-08 13:39:23 +00:00
|
|
|
kvm_mmu_invalidate_mmio_sptes(kvm, slots);
|
2013-07-04 04:40:29 +00:00
|
|
|
}
|
|
|
|
|
2009-12-23 16:35:18 +00:00
|
|
|
int kvm_arch_prepare_memory_region(struct kvm *kvm,
|
|
|
|
struct kvm_memory_slot *memslot,
|
2015-05-18 11:59:39 +00:00
|
|
|
const struct kvm_userspace_memory_region *mem,
|
2013-02-27 10:44:34 +00:00
|
|
|
enum kvm_mr_change change)
|
2007-11-20 08:25:04 +00:00
|
|
|
{
|
2009-12-23 16:35:18 +00:00
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
2015-01-28 02:54:27 +00:00
|
|
|
static void kvm_mmu_slot_apply_flags(struct kvm *kvm,
|
|
|
|
struct kvm_memory_slot *new)
|
|
|
|
{
|
|
|
|
/* Still write protect RO slot */
|
|
|
|
if (new->flags & KVM_MEM_READONLY) {
|
|
|
|
kvm_mmu_slot_remove_write_access(kvm, new);
|
|
|
|
return;
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Call kvm_x86_ops dirty logging hooks when they are valid.
|
|
|
|
*
|
|
|
|
* kvm_x86_ops->slot_disable_log_dirty is called when:
|
|
|
|
*
|
|
|
|
* - KVM_MR_CREATE with dirty logging is disabled
|
|
|
|
* - KVM_MR_FLAGS_ONLY with dirty logging is disabled in new flag
|
|
|
|
*
|
|
|
|
* The reason is, in case of PML, we need to set D-bit for any slots
|
|
|
|
* with dirty logging disabled in order to eliminate unnecessary GPA
|
|
|
|
* logging in PML buffer (and potential PML buffer full VMEXT). This
|
|
|
|
* guarantees leaving PML enabled during guest's lifetime won't have
|
|
|
|
* any additonal overhead from PML when guest is running with dirty
|
|
|
|
* logging disabled for memory slots.
|
|
|
|
*
|
|
|
|
* kvm_x86_ops->slot_enable_log_dirty is called when switching new slot
|
|
|
|
* to dirty logging mode.
|
|
|
|
*
|
|
|
|
* If kvm_x86_ops dirty logging hooks are invalid, use write protect.
|
|
|
|
*
|
|
|
|
* In case of write protect:
|
|
|
|
*
|
|
|
|
* Write protect all pages for dirty logging.
|
|
|
|
*
|
|
|
|
* All the sptes including the large sptes which point to this
|
|
|
|
* slot are set to readonly. We can not create any new large
|
|
|
|
* spte on this slot until the end of the logging.
|
|
|
|
*
|
|
|
|
* See the comments in fast_page_fault().
|
|
|
|
*/
|
|
|
|
if (new->flags & KVM_MEM_LOG_DIRTY_PAGES) {
|
|
|
|
if (kvm_x86_ops->slot_enable_log_dirty)
|
|
|
|
kvm_x86_ops->slot_enable_log_dirty(kvm, new);
|
|
|
|
else
|
|
|
|
kvm_mmu_slot_remove_write_access(kvm, new);
|
|
|
|
} else {
|
|
|
|
if (kvm_x86_ops->slot_disable_log_dirty)
|
|
|
|
kvm_x86_ops->slot_disable_log_dirty(kvm, new);
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2009-12-23 16:35:18 +00:00
|
|
|
void kvm_arch_commit_memory_region(struct kvm *kvm,
|
2015-05-18 11:59:39 +00:00
|
|
|
const struct kvm_userspace_memory_region *mem,
|
2013-02-27 10:45:25 +00:00
|
|
|
const struct kvm_memory_slot *old,
|
2015-05-18 11:20:23 +00:00
|
|
|
const struct kvm_memory_slot *new,
|
2013-02-27 10:45:25 +00:00
|
|
|
enum kvm_mr_change change)
|
2009-12-23 16:35:18 +00:00
|
|
|
{
|
2013-02-27 10:45:25 +00:00
|
|
|
int nr_mmu_pages = 0;
|
2009-12-23 16:35:18 +00:00
|
|
|
|
2011-03-04 10:59:21 +00:00
|
|
|
if (!kvm->arch.n_requested_mmu_pages)
|
|
|
|
nr_mmu_pages = kvm_mmu_calculate_mmu_pages(kvm);
|
|
|
|
|
|
|
|
if (nr_mmu_pages)
|
2007-11-20 08:25:04 +00:00
|
|
|
kvm_mmu_change_mmu_pages(kvm, nr_mmu_pages);
|
2015-01-28 02:54:26 +00:00
|
|
|
|
2015-04-03 07:40:25 +00:00
|
|
|
/*
|
|
|
|
* Dirty logging tracks sptes in 4k granularity, meaning that large
|
|
|
|
* sptes have to be split. If live migration is successful, the guest
|
|
|
|
* in the source machine will be destroyed and large sptes will be
|
|
|
|
* created in the destination. However, if the guest continues to run
|
|
|
|
* in the source machine (for example if live migration fails), small
|
|
|
|
* sptes will remain around and cause bad performance.
|
|
|
|
*
|
|
|
|
* Scan sptes if dirty logging has been stopped, dropping those
|
|
|
|
* which can be collapsed into a single large-page spte. Later
|
|
|
|
* page faults will create the large-page sptes.
|
|
|
|
*/
|
|
|
|
if ((change != KVM_MR_DELETE) &&
|
|
|
|
(old->flags & KVM_MEM_LOG_DIRTY_PAGES) &&
|
|
|
|
!(new->flags & KVM_MEM_LOG_DIRTY_PAGES))
|
|
|
|
kvm_mmu_zap_collapsible_sptes(kvm, new);
|
|
|
|
|
2013-01-08 10:43:28 +00:00
|
|
|
/*
|
2015-01-28 02:54:27 +00:00
|
|
|
* Set up write protection and/or dirty logging for the new slot.
|
2014-04-17 09:06:14 +00:00
|
|
|
*
|
2015-01-28 02:54:27 +00:00
|
|
|
* For KVM_MR_DELETE and KVM_MR_MOVE, the shadow pages of old slot have
|
|
|
|
* been zapped so no dirty logging staff is needed for old slot. For
|
|
|
|
* KVM_MR_FLAGS_ONLY, the old slot is essentially the same one as the
|
|
|
|
* new and it's also covered when dealing with the new slot.
|
2015-05-18 11:20:23 +00:00
|
|
|
*
|
|
|
|
* FIXME: const-ify all uses of struct kvm_memory_slot.
|
2013-01-08 10:43:28 +00:00
|
|
|
*/
|
2015-01-28 02:54:27 +00:00
|
|
|
if (change != KVM_MR_DELETE)
|
2015-05-18 11:20:23 +00:00
|
|
|
kvm_mmu_slot_apply_flags(kvm, (struct kvm_memory_slot *) new);
|
2007-11-20 08:25:04 +00:00
|
|
|
}
|
2007-12-14 01:35:10 +00:00
|
|
|
|
2012-08-24 18:54:57 +00:00
|
|
|
void kvm_arch_flush_shadow_all(struct kvm *kvm)
|
2008-07-10 23:49:31 +00:00
|
|
|
{
|
2013-05-31 00:36:23 +00:00
|
|
|
kvm_mmu_invalidate_zap_all_pages(kvm);
|
2008-07-10 23:49:31 +00:00
|
|
|
}
|
|
|
|
|
2012-08-24 18:54:57 +00:00
|
|
|
void kvm_arch_flush_shadow_memslot(struct kvm *kvm,
|
|
|
|
struct kvm_memory_slot *slot)
|
|
|
|
{
|
2016-10-09 07:41:44 +00:00
|
|
|
kvm_page_track_flush_slot(kvm, slot);
|
2012-08-24 18:54:57 +00:00
|
|
|
}
|
|
|
|
|
2015-10-13 08:18:53 +00:00
|
|
|
static inline bool kvm_vcpu_has_events(struct kvm_vcpu *vcpu)
|
|
|
|
{
|
|
|
|
if (!list_empty_careful(&vcpu->async_pf.done))
|
|
|
|
return true;
|
|
|
|
|
|
|
|
if (kvm_apic_has_events(vcpu))
|
|
|
|
return true;
|
|
|
|
|
|
|
|
if (vcpu->arch.pv.pv_unhalted)
|
|
|
|
return true;
|
|
|
|
|
2017-09-13 11:04:01 +00:00
|
|
|
if (vcpu->arch.exception.pending)
|
|
|
|
return true;
|
|
|
|
|
2017-05-26 05:16:48 +00:00
|
|
|
if (kvm_test_request(KVM_REQ_NMI, vcpu) ||
|
|
|
|
(vcpu->arch.nmi_pending &&
|
|
|
|
kvm_x86_ops->nmi_allowed(vcpu)))
|
2015-10-13 08:18:53 +00:00
|
|
|
return true;
|
|
|
|
|
2017-05-26 05:16:48 +00:00
|
|
|
if (kvm_test_request(KVM_REQ_SMI, vcpu) ||
|
|
|
|
(vcpu->arch.smi_pending && !is_smm(vcpu)))
|
2015-10-13 08:19:35 +00:00
|
|
|
return true;
|
|
|
|
|
2015-10-13 08:18:53 +00:00
|
|
|
if (kvm_arch_interrupt_allowed(vcpu) &&
|
|
|
|
kvm_cpu_has_interrupt(vcpu))
|
|
|
|
return true;
|
|
|
|
|
2015-11-30 16:22:21 +00:00
|
|
|
if (kvm_hv_has_stimer_pending(vcpu))
|
|
|
|
return true;
|
|
|
|
|
2015-10-13 08:18:53 +00:00
|
|
|
return false;
|
|
|
|
}
|
|
|
|
|
2007-12-14 01:35:10 +00:00
|
|
|
int kvm_arch_vcpu_runnable(struct kvm_vcpu *vcpu)
|
|
|
|
{
|
2015-10-13 08:18:53 +00:00
|
|
|
return kvm_vcpu_running(vcpu) || kvm_vcpu_has_events(vcpu);
|
2007-12-14 01:35:10 +00:00
|
|
|
}
|
2007-12-17 06:21:40 +00:00
|
|
|
|
2017-08-08 04:05:32 +00:00
|
|
|
bool kvm_arch_vcpu_in_kernel(struct kvm_vcpu *vcpu)
|
|
|
|
{
|
2017-08-08 04:05:33 +00:00
|
|
|
return vcpu->arch.preempted_in_kernel;
|
2017-08-08 04:05:32 +00:00
|
|
|
}
|
|
|
|
|
2012-03-08 21:44:24 +00:00
|
|
|
int kvm_arch_vcpu_should_kick(struct kvm_vcpu *vcpu)
|
2007-12-17 06:21:40 +00:00
|
|
|
{
|
2012-03-08 21:44:24 +00:00
|
|
|
return kvm_vcpu_exiting_guest_mode(vcpu) == IN_GUEST_MODE;
|
2007-12-17 06:21:40 +00:00
|
|
|
}
|
2009-03-23 10:12:11 +00:00
|
|
|
|
|
|
|
int kvm_arch_interrupt_allowed(struct kvm_vcpu *vcpu)
|
|
|
|
{
|
|
|
|
return kvm_x86_ops->interrupt_allowed(vcpu);
|
|
|
|
}
|
2009-06-17 12:22:14 +00:00
|
|
|
|
2014-11-02 09:54:45 +00:00
|
|
|
unsigned long kvm_get_linear_rip(struct kvm_vcpu *vcpu)
|
2010-02-23 16:47:55 +00:00
|
|
|
{
|
2014-11-02 09:54:45 +00:00
|
|
|
if (is_64_bit_mode(vcpu))
|
|
|
|
return kvm_rip_read(vcpu);
|
|
|
|
return (u32)(get_segment_base(vcpu, VCPU_SREG_CS) +
|
|
|
|
kvm_rip_read(vcpu));
|
|
|
|
}
|
|
|
|
EXPORT_SYMBOL_GPL(kvm_get_linear_rip);
|
2010-02-23 16:47:55 +00:00
|
|
|
|
2014-11-02 09:54:45 +00:00
|
|
|
bool kvm_is_linear_rip(struct kvm_vcpu *vcpu, unsigned long linear_rip)
|
|
|
|
{
|
|
|
|
return kvm_get_linear_rip(vcpu) == linear_rip;
|
2010-02-23 16:47:55 +00:00
|
|
|
}
|
|
|
|
EXPORT_SYMBOL_GPL(kvm_is_linear_rip);
|
|
|
|
|
2009-10-18 11:24:44 +00:00
|
|
|
unsigned long kvm_get_rflags(struct kvm_vcpu *vcpu)
|
|
|
|
{
|
|
|
|
unsigned long rflags;
|
|
|
|
|
|
|
|
rflags = kvm_x86_ops->get_rflags(vcpu);
|
|
|
|
if (vcpu->guest_debug & KVM_GUESTDBG_SINGLESTEP)
|
2010-02-23 16:47:58 +00:00
|
|
|
rflags &= ~X86_EFLAGS_TF;
|
2009-10-18 11:24:44 +00:00
|
|
|
return rflags;
|
|
|
|
}
|
|
|
|
EXPORT_SYMBOL_GPL(kvm_get_rflags);
|
|
|
|
|
2014-03-27 10:29:28 +00:00
|
|
|
static void __kvm_set_rflags(struct kvm_vcpu *vcpu, unsigned long rflags)
|
2009-10-18 11:24:44 +00:00
|
|
|
{
|
|
|
|
if (vcpu->guest_debug & KVM_GUESTDBG_SINGLESTEP &&
|
2010-02-23 16:47:55 +00:00
|
|
|
kvm_is_linear_rip(vcpu, vcpu->arch.singlestep_rip))
|
2010-02-23 16:47:58 +00:00
|
|
|
rflags |= X86_EFLAGS_TF;
|
2009-10-18 11:24:44 +00:00
|
|
|
kvm_x86_ops->set_rflags(vcpu, rflags);
|
2014-03-27 10:29:28 +00:00
|
|
|
}
|
|
|
|
|
|
|
|
void kvm_set_rflags(struct kvm_vcpu *vcpu, unsigned long rflags)
|
|
|
|
{
|
|
|
|
__kvm_set_rflags(vcpu, rflags);
|
2010-07-27 09:30:24 +00:00
|
|
|
kvm_make_request(KVM_REQ_EVENT, vcpu);
|
2009-10-18 11:24:44 +00:00
|
|
|
}
|
|
|
|
EXPORT_SYMBOL_GPL(kvm_set_rflags);
|
|
|
|
|
2010-10-17 16:13:42 +00:00
|
|
|
void kvm_arch_async_page_ready(struct kvm_vcpu *vcpu, struct kvm_async_pf *work)
|
|
|
|
{
|
|
|
|
int r;
|
|
|
|
|
2010-12-07 02:35:25 +00:00
|
|
|
if ((vcpu->arch.mmu.direct_map != work->arch.direct_map) ||
|
2013-10-14 14:22:33 +00:00
|
|
|
work->wakeup_all)
|
2010-10-17 16:13:42 +00:00
|
|
|
return;
|
|
|
|
|
|
|
|
r = kvm_mmu_reload(vcpu);
|
|
|
|
if (unlikely(r))
|
|
|
|
return;
|
|
|
|
|
2010-12-07 02:35:25 +00:00
|
|
|
if (!vcpu->arch.mmu.direct_map &&
|
|
|
|
work->arch.cr3 != vcpu->arch.mmu.get_cr3(vcpu))
|
|
|
|
return;
|
|
|
|
|
2010-10-17 16:13:42 +00:00
|
|
|
vcpu->arch.mmu.page_fault(vcpu, work->gva, 0, true);
|
|
|
|
}
|
|
|
|
|
2010-10-14 09:22:46 +00:00
|
|
|
static inline u32 kvm_async_pf_hash_fn(gfn_t gfn)
|
|
|
|
{
|
|
|
|
return hash_32(gfn & 0xffffffff, order_base_2(ASYNC_PF_PER_VCPU));
|
|
|
|
}
|
|
|
|
|
|
|
|
static inline u32 kvm_async_pf_next_probe(u32 key)
|
|
|
|
{
|
|
|
|
return (key + 1) & (roundup_pow_of_two(ASYNC_PF_PER_VCPU) - 1);
|
|
|
|
}
|
|
|
|
|
|
|
|
static void kvm_add_async_pf_gfn(struct kvm_vcpu *vcpu, gfn_t gfn)
|
|
|
|
{
|
|
|
|
u32 key = kvm_async_pf_hash_fn(gfn);
|
|
|
|
|
|
|
|
while (vcpu->arch.apf.gfns[key] != ~0)
|
|
|
|
key = kvm_async_pf_next_probe(key);
|
|
|
|
|
|
|
|
vcpu->arch.apf.gfns[key] = gfn;
|
|
|
|
}
|
|
|
|
|
|
|
|
static u32 kvm_async_pf_gfn_slot(struct kvm_vcpu *vcpu, gfn_t gfn)
|
|
|
|
{
|
|
|
|
int i;
|
|
|
|
u32 key = kvm_async_pf_hash_fn(gfn);
|
|
|
|
|
|
|
|
for (i = 0; i < roundup_pow_of_two(ASYNC_PF_PER_VCPU) &&
|
2010-11-01 09:00:30 +00:00
|
|
|
(vcpu->arch.apf.gfns[key] != gfn &&
|
|
|
|
vcpu->arch.apf.gfns[key] != ~0); i++)
|
2010-10-14 09:22:46 +00:00
|
|
|
key = kvm_async_pf_next_probe(key);
|
|
|
|
|
|
|
|
return key;
|
|
|
|
}
|
|
|
|
|
|
|
|
bool kvm_find_async_pf_gfn(struct kvm_vcpu *vcpu, gfn_t gfn)
|
|
|
|
{
|
|
|
|
return vcpu->arch.apf.gfns[kvm_async_pf_gfn_slot(vcpu, gfn)] == gfn;
|
|
|
|
}
|
|
|
|
|
|
|
|
static void kvm_del_async_pf_gfn(struct kvm_vcpu *vcpu, gfn_t gfn)
|
|
|
|
{
|
|
|
|
u32 i, j, k;
|
|
|
|
|
|
|
|
i = j = kvm_async_pf_gfn_slot(vcpu, gfn);
|
|
|
|
while (true) {
|
|
|
|
vcpu->arch.apf.gfns[i] = ~0;
|
|
|
|
do {
|
|
|
|
j = kvm_async_pf_next_probe(j);
|
|
|
|
if (vcpu->arch.apf.gfns[j] == ~0)
|
|
|
|
return;
|
|
|
|
k = kvm_async_pf_hash_fn(vcpu->arch.apf.gfns[j]);
|
|
|
|
/*
|
|
|
|
* k lies cyclically in ]i,j]
|
|
|
|
* | i.k.j |
|
|
|
|
* |....j i.k.| or |.k..j i...|
|
|
|
|
*/
|
|
|
|
} while ((i <= j) ? (i < k && k <= j) : (i < k || k <= j));
|
|
|
|
vcpu->arch.apf.gfns[i] = vcpu->arch.apf.gfns[j];
|
|
|
|
i = j;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2010-10-14 09:22:53 +00:00
|
|
|
static int apf_put_user(struct kvm_vcpu *vcpu, u32 val)
|
|
|
|
{
|
2017-05-02 14:20:18 +00:00
|
|
|
|
|
|
|
return kvm_write_guest_cached(vcpu->kvm, &vcpu->arch.apf.data, &val,
|
|
|
|
sizeof(val));
|
2010-10-14 09:22:53 +00:00
|
|
|
}
|
|
|
|
|
2017-09-14 10:54:16 +00:00
|
|
|
static int apf_get_user(struct kvm_vcpu *vcpu, u32 *val)
|
|
|
|
{
|
|
|
|
|
|
|
|
return kvm_read_guest_cached(vcpu->kvm, &vcpu->arch.apf.data, val,
|
|
|
|
sizeof(u32));
|
|
|
|
}
|
|
|
|
|
2010-10-14 09:22:46 +00:00
|
|
|
void kvm_arch_async_page_not_present(struct kvm_vcpu *vcpu,
|
|
|
|
struct kvm_async_pf *work)
|
|
|
|
{
|
2010-11-29 14:12:30 +00:00
|
|
|
struct x86_exception fault;
|
|
|
|
|
2010-10-14 09:22:53 +00:00
|
|
|
trace_kvm_async_pf_not_present(work->arch.token, work->gva);
|
2010-10-14 09:22:46 +00:00
|
|
|
kvm_add_async_pf_gfn(vcpu, work->arch.gfn);
|
2010-10-14 09:22:53 +00:00
|
|
|
|
|
|
|
if (!(vcpu->arch.apf.msr_val & KVM_ASYNC_PF_ENABLED) ||
|
2010-10-14 09:22:56 +00:00
|
|
|
(vcpu->arch.apf.send_user_only &&
|
|
|
|
kvm_x86_ops->get_cpl(vcpu) == 0))
|
2010-10-14 09:22:53 +00:00
|
|
|
kvm_make_request(KVM_REQ_APF_HALT, vcpu);
|
|
|
|
else if (!apf_put_user(vcpu, KVM_PV_REASON_PAGE_NOT_PRESENT)) {
|
2010-11-29 14:12:30 +00:00
|
|
|
fault.vector = PF_VECTOR;
|
|
|
|
fault.error_code_valid = true;
|
|
|
|
fault.error_code = 0;
|
|
|
|
fault.nested_page_fault = false;
|
|
|
|
fault.address = work->arch.token;
|
2017-07-14 01:30:41 +00:00
|
|
|
fault.async_page_fault = true;
|
2010-11-29 14:12:30 +00:00
|
|
|
kvm_inject_page_fault(vcpu, &fault);
|
2010-10-14 09:22:53 +00:00
|
|
|
}
|
2010-10-14 09:22:46 +00:00
|
|
|
}
|
|
|
|
|
|
|
|
void kvm_arch_async_page_present(struct kvm_vcpu *vcpu,
|
|
|
|
struct kvm_async_pf *work)
|
|
|
|
{
|
2010-11-29 14:12:30 +00:00
|
|
|
struct x86_exception fault;
|
2017-09-14 10:54:16 +00:00
|
|
|
u32 val;
|
2010-11-29 14:12:30 +00:00
|
|
|
|
2013-10-14 14:22:33 +00:00
|
|
|
if (work->wakeup_all)
|
2010-10-14 09:22:53 +00:00
|
|
|
work->arch.token = ~0; /* broadcast wakeup */
|
|
|
|
else
|
|
|
|
kvm_del_async_pf_gfn(vcpu, work->arch.gfn);
|
2017-03-21 04:18:55 +00:00
|
|
|
trace_kvm_async_pf_ready(work->arch.token, work->gva);
|
2010-10-14 09:22:53 +00:00
|
|
|
|
2017-09-14 10:54:16 +00:00
|
|
|
if (vcpu->arch.apf.msr_val & KVM_ASYNC_PF_ENABLED &&
|
|
|
|
!apf_get_user(vcpu, &val)) {
|
|
|
|
if (val == KVM_PV_REASON_PAGE_NOT_PRESENT &&
|
|
|
|
vcpu->arch.exception.pending &&
|
|
|
|
vcpu->arch.exception.nr == PF_VECTOR &&
|
|
|
|
!apf_put_user(vcpu, 0)) {
|
|
|
|
vcpu->arch.exception.injected = false;
|
|
|
|
vcpu->arch.exception.pending = false;
|
|
|
|
vcpu->arch.exception.nr = 0;
|
|
|
|
vcpu->arch.exception.has_error_code = false;
|
|
|
|
vcpu->arch.exception.error_code = 0;
|
|
|
|
} else if (!apf_put_user(vcpu, KVM_PV_REASON_PAGE_READY)) {
|
|
|
|
fault.vector = PF_VECTOR;
|
|
|
|
fault.error_code_valid = true;
|
|
|
|
fault.error_code = 0;
|
|
|
|
fault.nested_page_fault = false;
|
|
|
|
fault.address = work->arch.token;
|
|
|
|
fault.async_page_fault = true;
|
|
|
|
kvm_inject_page_fault(vcpu, &fault);
|
|
|
|
}
|
2010-10-14 09:22:53 +00:00
|
|
|
}
|
2010-11-01 09:01:28 +00:00
|
|
|
vcpu->arch.apf.halted = false;
|
2012-05-03 08:36:39 +00:00
|
|
|
vcpu->arch.mp_state = KVM_MP_STATE_RUNNABLE;
|
2010-10-14 09:22:53 +00:00
|
|
|
}
|
|
|
|
|
|
|
|
bool kvm_arch_can_inject_async_page_present(struct kvm_vcpu *vcpu)
|
|
|
|
{
|
|
|
|
if (!(vcpu->arch.apf.msr_val & KVM_ASYNC_PF_ENABLED))
|
|
|
|
return true;
|
|
|
|
else
|
2017-06-09 03:13:40 +00:00
|
|
|
return kvm_can_do_async_pf(vcpu);
|
2010-10-14 09:22:46 +00:00
|
|
|
}
|
|
|
|
|
2015-07-07 13:41:58 +00:00
|
|
|
void kvm_arch_start_assignment(struct kvm *kvm)
|
|
|
|
{
|
|
|
|
atomic_inc(&kvm->arch.assigned_device_count);
|
|
|
|
}
|
|
|
|
EXPORT_SYMBOL_GPL(kvm_arch_start_assignment);
|
|
|
|
|
|
|
|
void kvm_arch_end_assignment(struct kvm *kvm)
|
|
|
|
{
|
|
|
|
atomic_dec(&kvm->arch.assigned_device_count);
|
|
|
|
}
|
|
|
|
EXPORT_SYMBOL_GPL(kvm_arch_end_assignment);
|
|
|
|
|
|
|
|
bool kvm_arch_has_assigned_device(struct kvm *kvm)
|
|
|
|
{
|
|
|
|
return atomic_read(&kvm->arch.assigned_device_count);
|
|
|
|
}
|
|
|
|
EXPORT_SYMBOL_GPL(kvm_arch_has_assigned_device);
|
|
|
|
|
2013-10-30 17:02:30 +00:00
|
|
|
void kvm_arch_register_noncoherent_dma(struct kvm *kvm)
|
|
|
|
{
|
|
|
|
atomic_inc(&kvm->arch.noncoherent_dma_count);
|
|
|
|
}
|
|
|
|
EXPORT_SYMBOL_GPL(kvm_arch_register_noncoherent_dma);
|
|
|
|
|
|
|
|
void kvm_arch_unregister_noncoherent_dma(struct kvm *kvm)
|
|
|
|
{
|
|
|
|
atomic_dec(&kvm->arch.noncoherent_dma_count);
|
|
|
|
}
|
|
|
|
EXPORT_SYMBOL_GPL(kvm_arch_unregister_noncoherent_dma);
|
|
|
|
|
|
|
|
bool kvm_arch_has_noncoherent_dma(struct kvm *kvm)
|
|
|
|
{
|
|
|
|
return atomic_read(&kvm->arch.noncoherent_dma_count);
|
|
|
|
}
|
|
|
|
EXPORT_SYMBOL_GPL(kvm_arch_has_noncoherent_dma);
|
|
|
|
|
2016-05-05 17:58:35 +00:00
|
|
|
bool kvm_arch_has_irq_bypass(void)
|
|
|
|
{
|
|
|
|
return kvm_x86_ops->update_pi_irte != NULL;
|
|
|
|
}
|
|
|
|
|
2015-09-18 14:29:40 +00:00
|
|
|
int kvm_arch_irq_bypass_add_producer(struct irq_bypass_consumer *cons,
|
|
|
|
struct irq_bypass_producer *prod)
|
|
|
|
{
|
|
|
|
struct kvm_kernel_irqfd *irqfd =
|
|
|
|
container_of(cons, struct kvm_kernel_irqfd, consumer);
|
|
|
|
|
2016-05-05 17:58:35 +00:00
|
|
|
irqfd->producer = prod;
|
2015-09-18 14:29:40 +00:00
|
|
|
|
2016-05-05 17:58:35 +00:00
|
|
|
return kvm_x86_ops->update_pi_irte(irqfd->kvm,
|
|
|
|
prod->irq, irqfd->gsi, 1);
|
2015-09-18 14:29:40 +00:00
|
|
|
}
|
|
|
|
|
|
|
|
void kvm_arch_irq_bypass_del_producer(struct irq_bypass_consumer *cons,
|
|
|
|
struct irq_bypass_producer *prod)
|
|
|
|
{
|
|
|
|
int ret;
|
|
|
|
struct kvm_kernel_irqfd *irqfd =
|
|
|
|
container_of(cons, struct kvm_kernel_irqfd, consumer);
|
|
|
|
|
|
|
|
WARN_ON(irqfd->producer != prod);
|
|
|
|
irqfd->producer = NULL;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* When producer of consumer is unregistered, we change back to
|
|
|
|
* remapped mode, so we can re-use the current implementation
|
2016-05-21 12:14:44 +00:00
|
|
|
* when the irq is masked/disabled or the consumer side (KVM
|
2015-09-18 14:29:40 +00:00
|
|
|
* int this case doesn't want to receive the interrupts.
|
|
|
|
*/
|
|
|
|
ret = kvm_x86_ops->update_pi_irte(irqfd->kvm, prod->irq, irqfd->gsi, 0);
|
|
|
|
if (ret)
|
|
|
|
printk(KERN_INFO "irq bypass consumer (token %p) unregistration"
|
|
|
|
" fails: %d\n", irqfd->consumer.token, ret);
|
|
|
|
}
|
|
|
|
|
|
|
|
int kvm_arch_update_irqfd_routing(struct kvm *kvm, unsigned int host_irq,
|
|
|
|
uint32_t guest_irq, bool set)
|
|
|
|
{
|
|
|
|
if (!kvm_x86_ops->update_pi_irte)
|
|
|
|
return -EINVAL;
|
|
|
|
|
|
|
|
return kvm_x86_ops->update_pi_irte(kvm, host_irq, guest_irq, set);
|
|
|
|
}
|
|
|
|
|
2016-01-25 08:53:33 +00:00
|
|
|
bool kvm_vector_hashing_enabled(void)
|
|
|
|
{
|
|
|
|
return vector_hashing;
|
|
|
|
}
|
|
|
|
EXPORT_SYMBOL_GPL(kvm_vector_hashing_enabled);
|
|
|
|
|
2009-06-17 12:22:14 +00:00
|
|
|
EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_exit);
|
2015-09-15 06:41:58 +00:00
|
|
|
EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_fast_mmio);
|
2009-06-17 12:22:14 +00:00
|
|
|
EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_inj_virq);
|
|
|
|
EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_page_fault);
|
|
|
|
EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_msr);
|
|
|
|
EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_cr);
|
2009-10-09 14:08:27 +00:00
|
|
|
EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_nested_vmrun);
|
2009-10-09 14:08:28 +00:00
|
|
|
EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_nested_vmexit);
|
2009-10-09 14:08:29 +00:00
|
|
|
EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_nested_vmexit_inject);
|
2009-10-09 14:08:30 +00:00
|
|
|
EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_nested_intr_vmexit);
|
2009-10-09 14:08:31 +00:00
|
|
|
EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_invlpga);
|
2009-10-09 14:08:32 +00:00
|
|
|
EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_skinit);
|
2010-02-24 17:59:14 +00:00
|
|
|
EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_nested_intercepts);
|
2013-06-12 07:43:44 +00:00
|
|
|
EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_write_tsc_offset);
|
2014-08-21 16:08:09 +00:00
|
|
|
EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_ple_window);
|
2015-01-28 02:54:28 +00:00
|
|
|
EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_pml_full);
|
2015-09-18 14:29:51 +00:00
|
|
|
EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_pi_irte_update);
|
2016-05-04 19:09:48 +00:00
|
|
|
EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_avic_unaccelerated_access);
|
|
|
|
EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_avic_incomplete_ipi);
|