This makes it possible to release mmu_lock and reschedule conditionally
in a later patch. Although this may increase the time needed to protect
the whole slot when we start dirty logging, the kernel should not allow
the userspace to trigger something that will hold a spinlock for such a
long time as tens of milliseconds: actually there is no limit since it
is roughly proportional to the number of guest pages.
Another point to note is that this patch removes the only user of
slot_bitmap which will cause some problems when we increase the number
of slots further.
Reviewed-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Takuya Yoshikawa <yoshikawa_takuya_b1@lab.ntt.co.jp>
Signed-off-by: Gleb Natapov <gleb@redhat.com>
No longer need to care about the mapping level in this function.
Reviewed-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Takuya Yoshikawa <yoshikawa_takuya_b1@lab.ntt.co.jp>
Signed-off-by: Gleb Natapov <gleb@redhat.com>
Calling kvm_mmu_slot_remove_write_access() for a deleted slot does
nothing but search for non-existent mmu pages which have mappings to
that deleted memory; this is safe but a waste of time.
Since we want to make the function rmap based in a later patch, in a
manner which makes it unsafe to be called for a deleted slot, we makes
the caller see if the slot is non-zero and being dirty logged.
Reviewed-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Takuya Yoshikawa <yoshikawa_takuya_b1@lab.ntt.co.jp>
Signed-off-by: Gleb Natapov <gleb@redhat.com>
We have two issues in current code:
- if target gfn is used as its page table, guest will refault then kvm will use
small page size to map it. We need two #PF to fix its shadow page table
- sometimes, say a exception is triggered during vm-exit caused by #PF
(see handle_exception() in vmx.c), we remove all the shadow pages shadowed
by the target gfn before go into page fault path, it will cause infinite
loop:
delete shadow pages shadowed by the gfn -> try to use large page size to map
the gfn -> retry the access ->...
To fix these, we can adjust page size early if the target gfn is used as page
table
Signed-off-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
If the write-fault access is from supervisor and CR0.WP is not set on the
vcpu, kvm will fix it by adjusting pte access - it sets the W bit on pte
and clears U bit. This is the chance that kvm can change pte access from
readonly to writable
Unfortunately, the pte access is the access of 'direct' shadow page table,
means direct sp.role.access = pte_access, then we will create a writable
spte entry on the readonly shadow page table. It will cause Dirty bit is
not tracked when two guest ptes point to the same large page. Note, it
does not have other impact except Dirty bit since cr0.wp is encoded into
sp.role
It can be fixed by adjusting pte access before establishing shadow page
table. Also, after that, no mmu specified code exists in the common function
and drop two parameters in set_spte
Signed-off-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Opcodes:
TEST
CMP
ADD
ADC
SUB
SBB
XOR
OR
AND
Acked-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Avi Kivity <avi.kivity@gmail.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
We emulate arithmetic opcodes by executing a "similar" (same operation,
different operands) on the cpu. This ensures accurate emulation, esp. wrt.
eflags. However, the prologue and epilogue around the opcode is fairly long,
consisting of a switch (for the operand size) and code to load and save the
operands. This is repeated for every opcode.
This patch introduces an alternative way to emulate arithmetic opcodes.
Instead of the above, we have four (three on i386) functions consisting
of just the opcode and a ret; one for each operand size. For example:
.align 8
em_notb:
not %al
ret
.align 8
em_notw:
not %ax
ret
.align 8
em_notl:
not %eax
ret
.align 8
em_notq:
not %rax
ret
The prologue and epilogue are shared across all opcodes. Note the functions
use a special calling convention; notably eflags is an input/output parameter
and is not clobbered. Rather than dispatching the four functions through a
jump table, the functions are declared as a constant size (8) so their address
can be calculated.
Acked-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Avi Kivity <avi.kivity@gmail.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
CPL is always 0 when in real mode, and always 3 when virtual 8086 mode.
Using values other than those can cause failures on operations that
check CPL.
Reviewed-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
MMU code tries to avoid if()s HW is not able to predict reliably by using
bitwise operation to streamline code execution, but in case of a dirty bit
folding this gives us nothing since write_fault is checked right before
the folding code. Lets just piggyback onto the if() to make code more clear.
Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
With emulate_invalid_guest_state=0 if a vcpu is in real mode VMX can
enter the vcpu with smaller segment limit than guest configured. If the
guest tries to access pass this limit it will get #GP at which point
instruction will be emulated with correct segment limit applied. If
during the emulation IO is detected it is not handled correctly. Vcpu
thread should exit to userspace to serve the IO, but it returns to the
guest instead. Since emulation is not completed till userspace completes
the IO the faulty instruction is re-executed ad infinitum.
The patch fixes that by exiting to userspace if IO happens during
instruction emulation.
Reported-by: Alex Williamson <alex.williamson@redhat.com>
Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Segment registers will be fixed according to current emulation policy
during switching to real mode for the first time.
Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Currently when emulation of invalid guest state is enable
(emulate_invalid_guest_state=1) segment registers are still fixed for
entry to vm86 mode some times. Segment register fixing is avoided in
enter_rmode(), but vmx_set_segment() still does it unconditionally.
The patch fixes it.
Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Currently it allows entering vm86 mode if segment limit is greater than
0xffff and db bit is set. Both of those can cause incorrect execution of
instruction by cpu since in vm86 mode limit will be set to 0xffff and db
will be forced to 0.
Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
According to Intel SDM Vol3 Section 5.5 "Privilege Levels" and 5.6
"Privilege Level Checking When Accessing Data Segments" RPL checking is
done during loading of a segment selector, not during data access. We
already do checking during segment selector loading, so drop the check
during data access. Checking RPL during data access triggers #GP if
after transition from real mode to protected mode RPL bits in a segment
selector are set.
Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Segment descriptor's base is fixed by call to fix_rmode_seg(). Not need
to do it twice.
Reviewed-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Gleb Natapov <gleb@redhat.com>
The code for SS and CS does the same thing fix_rmode_seg() is doing.
Use it instead of hand crafted code.
Reviewed-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Gleb Natapov <gleb@redhat.com>
VMX without unrestricted mode cannot virtualize real mode, so if
emulate_invalid_guest_state=0 kvm uses vm86 mode to approximate
it. Sometimes, when guest moves from protected mode to real mode, it
leaves segment descriptors in a state not suitable for use by vm86 mode
virtualization, so we keep shadow copy of segment descriptors for internal
use and load fake register to VMCS for guest entry to succeed. Till
now we kept shadow for all segments except SS and CS (for SS and CS we
returned parameters directly from VMCS), but since commit a5625189f6
emulator enforces segment limits in real mode. This causes #GP during move
from protected mode to real mode when emulator fetches first instruction
after moving to real mode since it uses incorrect CS base and limit to
linearize the %rip. Fix by keeping shadow for SS and CS too.
Reviewed-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Gleb Natapov <gleb@redhat.com>
rmode_segment_valid() checks if segment descriptor can be used to enter
vm86 mode. VMX spec mandates that in vm86 mode CS register will be of
type data, not code. Lets allow guest entry with vm86 mode if the only
problem with CS register is incorrect type. Otherwise entire real mode
will be emulated.
Reviewed-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Gleb Natapov <gleb@redhat.com>
Set segment fields explicitly instead of using binary operations.
No behaviour changes.
Reviewed-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Gleb Natapov <gleb@redhat.com>
The kvm i8254 emulation for counter 0 (but not for counters 1 and 2)
has at least two bugs in mode 0:
1. The OUT bit, computed by pit_get_out(), is never set high.
2. The counter value, computed by pit_get_count(), wraps back around to
the initial counter value, rather than wrapping back to 0xFFFF
(which is the behavior described in the comment in __kpit_elapsed,
the behavior implemented by qemu, and the behavior observed on AMD
hardware).
The bug stems from __kpit_elapsed computing the elapsed time mod the
initial counter value (stored as nanoseconds in ps->period). This is both
unnecessary (none of the callers of kpit_elapsed expect the value to be
at most the initial counter value) and incorrect (it causes pit_get_count
to appear to wrap around to the initial counter value rather than 0xFFFF).
Removing this mod from __kpit_elapsed fixes both of the above bugs.
Signed-off-by: Nickolai Zeldovich <nickolai@csail.mit.edu>
Reviewed-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Gleb Natapov <gleb@redhat.com>
There's no need for this to be an int, it holds a boolean.
Move to the end of the struct for alignment.
Reviewed-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
It's easy to confuse KVM_MEMORY_SLOTS and KVM_MEM_SLOTS_NUM. One is
the user accessible slots and the other is user + private. Make this
more obvious.
Reviewed-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
According to Intel SDM Volume 3 Section 10.8.1 "Interrupt Handling with
the Pentium 4 and Intel Xeon Processors" and Section 10.8.2 "Interrupt
Handling with the P6 Family and Pentium Processors" ExtINT interrupts are
sent directly to the processor core for handling. Currently KVM checks
APIC before it considers ExtINT interrupts for injection which is
backwards from the spec. Make code behave according to the SDM.
Signed-off-by: Gleb Natapov <gleb@redhat.com>
Acked-by: "Zhang, Yang Z" <yang.z.zhang@intel.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
MOV immediate instruction (opcodes 0xB8-0xBF) may take 64-bit operand.
The previous emulation implementation assumes the operand is no longer than 32.
Adding OpImm64 for this matter.
Fixes https://bugzilla.redhat.com/show_bug.cgi?id=881579
Signed-off-by: Nadav Amit <nadav.amit@gmail.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Pull KVM updates from Marcelo Tosatti:
"Considerable KVM/PPC work, x86 kvmclock vsyscall support,
IA32_TSC_ADJUST MSR emulation, amongst others."
Fix up trivial conflict in kernel/sched/core.c due to cross-cpu
migration notifier added next to rq migration call-back.
* tag 'kvm-3.8-1' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (156 commits)
KVM: emulator: fix real mode segment checks in address linearization
VMX: remove unneeded enable_unrestricted_guest check
KVM: VMX: fix DPL during entry to protected mode
x86/kexec: crash_vmclear_local_vmcss needs __rcu
kvm: Fix irqfd resampler list walk
KVM: VMX: provide the vmclear function and a bitmap to support VMCLEAR in kdump
x86/kexec: VMCLEAR VMCSs loaded on all cpus if necessary
KVM: MMU: optimize for set_spte
KVM: PPC: booke: Get/set guest EPCR register using ONE_REG interface
KVM: PPC: bookehv: Add EPCR support in mtspr/mfspr emulation
KVM: PPC: bookehv: Add guest computation mode for irq delivery
KVM: PPC: Make EPCR a valid field for booke64 and bookehv
KVM: PPC: booke: Extend MAS2 EPN mask for 64-bit
KVM: PPC: e500: Mask MAS2 EPN high 32-bits in 32/64 tlbwe emulation
KVM: PPC: Mask ea's high 32-bits in 32/64 instr emulation
KVM: PPC: e500: Add emulation helper for getting instruction ea
KVM: PPC: bookehv64: Add support for interrupt handling
KVM: PPC: bookehv: Remove GET_VCPU macro from exception handler
KVM: PPC: booke: Fix get_tb() compile error on 64-bit
KVM: PPC: e500: Silence bogus GCC warning in tlb code
...
In real mode CS register is writable, so do not #GP on write.
Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
If enable_unrestricted_guest is true vmx->rmode.vm86_active will
always be false.
Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
On CPUs without support for unrestricted guests DPL cannot be smaller
than RPL for data segments during guest entry, but this state can occurs
if a data segment selector changes while vcpu is in real mode to a value
with lowest two bits != 00. Fix that by forcing DPL == RPL on transition
to protected mode.
This is a regression introduced by c865c43de6.
Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
The vmclear function will be assigned to the callback function pointer
when loading kvm-intel module. And the bitmap indicates whether we
should do VMCLEAR operation in kdump. The bits in the bitmap are
set/unset according to different conditions.
Signed-off-by: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
Acked-by: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: Gleb Natapov <gleb@redhat.com>
There are two cases we need to adjust page size in set_spte:
1): the one is other vcpu creates new sp in the window between mapping_level()
and acquiring mmu-lock.
2): the another case is the new sp is created by itself (page-fault path) when
guest uses the target gfn as its page table.
In current code, set_spte drop the spte and emulate the access for these case,
it works not good:
- for the case 1, it may destroy the mapping established by other vcpu, and
do expensive instruction emulation.
- for the case 2, it may emulate the access even if the guest is accessing
the page which not used as page table. There is a example, 0~2M is used as
huge page in guest, in this huge page, only page 3 used as page table, then
guest read/writes on other pages can cause instruction emulation.
Both of these cases can be fixed by allowing guest to retry the access, it
will refault, then we can establish the mapping by using small page
Signed-off-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
Acked-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Gleb Natapov <gleb@redhat.com>
VMX behaves now as SVM wrt to FPU initialization. Code has been moved to
generic code path. General-purpose registers are now cleared on reset and
INIT. SVM code properly initializes EDX.
Signed-off-by: Julian Stecklina <jsteckli@os.inf.tu-dresden.de>
Signed-off-by: Gleb Natapov <gleb@redhat.com>
Bit24 in VMX_EPT_VPID_CAP_MASI is not used for address-specific invalidation capability
reporting, so remove it from KVM to avoid conflicts in future.
Signed-off-by: Zhang Xiantao <xiantao.zhang@intel.com>
Signed-off-by: Gleb Natapov <gleb@redhat.com>
Bit 6 in EPT vmexit's exit qualification is not defined in SDM, so remove it.
Signed-off-by: Zhang Xiantao <xiantao.zhang@intel.com>
Signed-off-by: Gleb Natapov <gleb@redhat.com>
CPUID.7.0.EBX[1]=1 indicates IA32_TSC_ADJUST MSR 0x3b is supported
Basic design is to emulate the MSR by allowing reads and writes to a guest
vcpu specific location to store the value of the emulated MSR while adding
the value to the vmcs tsc_offset. In this way the IA32_TSC_ADJUST value will
be included in all reads to the TSC MSR whether through rdmsr or rdtsc. This
is of course as long as the "use TSC counter offsetting" VM-execution control
is enabled as well as the IA32_TSC_ADJUST control.
However, because hardware will only return the TSC + IA32_TSC_ADJUST +
vmsc tsc_offset for a guest process when it does and rdtsc (with the correct
settings) the value of our virtualized IA32_TSC_ADJUST must be stored in one
of these three locations. The argument against storing it in the actual MSR
is performance. This is likely to be seldom used while the save/restore is
required on every transition. IA32_TSC_ADJUST was created as a way to solve
some issues with writing TSC itself so that is not an option either.
The remaining option, defined above as our solution has the problem of
returning incorrect vmcs tsc_offset values (unless we intercept and fix, not
done here) as mentioned above. However, more problematic is that storing the
data in vmcs tsc_offset will have a different semantic effect on the system
than does using the actual MSR. This is illustrated in the following example:
The hypervisor set the IA32_TSC_ADJUST, then the guest sets it and a guest
process performs a rdtsc. In this case the guest process will get
TSC + IA32_TSC_ADJUST_hyperviser + vmsc tsc_offset including
IA32_TSC_ADJUST_guest. While the total system semantics changed the semantics
as seen by the guest do not and hence this will not cause a problem.
Signed-off-by: Will Auld <will.auld@intel.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
In order to track who initiated the call (host or guest) to modify an msr
value I have changed function call parameters along the call path. The
specific change is to add a struct pointer parameter that points to (index,
data, caller) information rather than having this information passed as
individual parameters.
The initial use for this capability is for updating the IA32_TSC_ADJUST msr
while setting the tsc value. It is anticipated that this capability is
useful for other tasks.
Signed-off-by: Will Auld <will.auld@intel.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
vmcs->cpu indicates whether it exists on the target cpu, -1 means the vmcs
does not exist on any vcpu
If vcpu load vmcs with vmcs.cpu = -1, it can be directly added to cpu's percpu
list. The list can be corrupted if the cpu prefetch the vmcs's list before
reading vmcs->cpu. Meanwhile, we should remove vmcs from the list before
making vmcs->vcpu == -1 be visible
Signed-off-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
In loaded_vmcs_clear, loaded_vmcs->cpu is the fist parameter passed to
smp_call_function_single, if the target cpu is downing (doing cpu hot remove),
loaded_vmcs->cpu can become -1 then -1 is passed to smp_call_function_single
It can be triggered when vcpu is being destroyed, loaded_vmcs_clear is called
in the preemptionable context
Signed-off-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>