No TLB flush is needed when there's no valid rmap in memory slot.
Signed-off-by: Kai Huang <kai.huang@linux.intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Removes some functions that are not used anywhere:
cpu_has_vmx_eptp_writeback() cpu_has_vmx_eptp_uncacheable()
This was partially found by using a static code analysis program called cppcheck.
Signed-off-by: Rickard Strandqvist <rickard_strandqvist@spectrumdigital.se>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
We now have a generic function that does most of the work of
kvm_vm_ioctl_get_dirty_log, now use it.
Acked-by: Christoffer Dall <christoffer.dall@linaro.org>
Signed-off-by: Mario Smarduch <m.smarduch@samsung.com>
When emulating an instruction that reads the destination memory operand (i.e.,
instructions without the Mov flag in the emulator), the operand is first read.
If a page-fault is detected in this phase, the error-code which would be
delivered to the VM does not indicate that the access that caused the exception
is a write one. This does not conform with real hardware, and may cause the VM
to enter the page-fault handler twice for no reason (once for read, once for
write).
Signed-off-by: Nadav Amit <namit@cs.technion.ac.il>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
When software changes D bit (either from 1 to 0, or 0 to 1), the
corresponding TLB entity in the hardware won't be updated immediately. We
should flush it to guarantee the consistence of D bit between TLB and
MMU page table in memory. This is especially important when clearing
the D bit, since it may cause false negatives in reporting dirtiness.
Sanity test was done on my machine with Intel processor.
Signed-off-by: Kai Huang <kai.huang@linux.intel.com>
[Check A bit too. - Paolo]
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Adds a function kvm_vcpu_set_pending_timer instead of calling
kvm_make_request in lapic.c.
Signed-off-by: Nicholas Krause <xerofoify@gmail.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
When access to descriptor in LDT/GDT wraparound outside long-mode, the address
of the descriptor should be truncated to 32-bit. Citing Intel SDM 2.1.1.1
"Global and Local Descriptor Tables in IA-32e Mode": "GDTR and LDTR registers
are expanded to 64-bits wide in both IA-32e sub-modes (64-bit mode and
compatibility mode)."
So in other cases, we need to truncate. Creating new function to return a
pointer to descriptor table to avoid too much code duplication.
Signed-off-by: Nadav Amit <namit@cs.technion.ac.il>
[Wrap 64-bit check with #ifdef CONFIG_X86_64, to avoid a "right shift count
>= width of type" warning and consequent undefined behavior. - Paolo]
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
When segment is loaded, the segment access bit is set unconditionally. In
fact, it should be set conditionally, based on whether the segment had the
accessed bit set before. In addition, it can improve performance.
Signed-off-by: Nadav Amit <namit@cs.technion.ac.il>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
According to Intel SDM: "If the ESP register is used as a base register for
addressing a destination operand in memory, the POP instruction computes the
effective address of the operand after it increments the ESP register."
The current emulation does not behave so. The fix required to waste another
of the precious instruction flags and to check the flag in decode_modrm.
Signed-off-by: Nadav Amit <namit@cs.technion.ac.il>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Currently, if em_call_far fails it returns success instead of the resulting
error-code. Fix it.
Signed-off-by: Nadav Amit <namit@cs.technion.ac.il>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
The KVM emulator does not emulate JMP and CALL that target a call gate or a
task gate. This patch does not try to implement these scenario as they are
presumably rare; yet it returns X86EMUL_UNHANDLEABLE error in such cases
instead of generating an exception.
Signed-off-by: Nadav Amit <namit@cs.technion.ac.il>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Since the operand size of fnstcw and fnstsw is updated during the execution,
the emulation may cause spurious exceptions as it reads the memory beforehand.
Marking these instructions as Mov (since the previous value is ignored) and
DstMem16 to simplify the setting of operand size.
Signed-off-by: Nadav Amit <namit@cs.technion.ac.il>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Although pop sreg updates RSP according to the operand size, only 2 bytes are
read. The current behavior may result in incorrect #GP or #PF exceptions.
Signed-off-by: Nadav Amit <namit@cs.technion.ac.il>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Because ASSERT is just a printk, these would oops right away.
The assertion thus hardly adds anything.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
The initialization function in mmu.c can always use walk_mmu, which
is known to be vcpu->arch.mmu. Only init_kvm_nested_mmu is used to
initialize vcpu->arch.nested_mmu.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Add tracepoint to wait_lapic_expire.
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
[Remind reader if early or late. - Paolo]
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
For the hrtimer which emulates the tscdeadline timer in the guest,
add an option to advance expiration, and busy spin on VM-entry waiting
for the actual expiration time to elapse.
This allows achieving low latencies in cyclictest (or any scenario
which requires strict timing regarding timer expiration).
Reduces average cyclictest latency from 12us to 8us
on Core i5 desktop.
Note: this option requires tuning to find the appropriate value
for a particular hardware/guest combination. One method is to measure the
average delay between apic_timer_fn and VM-entry.
Another method is to start with 1000ns, and increase the value
in say 500ns increments until avg cyclictest numbers stop decreasing.
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
kvm_x86_ops->test_posted_interrupt() returns true/false depending
whether 'vector' is set.
Next patch makes use of this interface.
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
In most cases calling hwapic_isr_update(), we always check if
kvm_apic_vid_enabled() == 1, but actually,
kvm_apic_vid_enabled()
-> kvm_x86_ops->vm_has_apicv()
-> vmx_vm_has_apicv() or '0' in svm case
-> return enable_apicv && irqchip_in_kernel(kvm)
So its a little cost to recall vmx_vm_has_apicv() inside
hwapic_isr_update(), here just NULL out hwapic_isr_update() in
case of !enable_apicv inside hardware_setup() then make all
related stuffs follow this. Note we don't check this under that
condition of irqchip_in_kernel() since we should make sure
definitely any caller don't work without in-kernel irqchip.
Signed-off-by: Tiejun Chen <tiejun.chen@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Remove FIXME comments about needing fault addresses to be returned. These
are propaagated from walk_addr_generic to gva_to_gpa and from there to
ops->read_std and ops->write_std.
Signed-off-by: Nicholas Krause <xerofoify@gmail.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
When generating #PF VM-exit, check equality:
(PFEC & PFEC_MASK) == PFEC_MATCH
If there is equality, the 14 bit of exception bitmap is used to take decision
about generating #PF VM-exit. If there is inequality, inverted 14 bit is used.
Signed-off-by: Eugene Korenevsky <ekorenevsky@gmail.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
This patch improve checks required by Intel Software Developer Manual.
- SMM MSRs are not allowed.
- microcode MSRs are not allowed.
- check x2apic MSRs only when LAPIC is in x2apic mode.
- MSR switch areas must be aligned to 16 bytes.
- address of first and last byte in MSR switch areas should not set any bits
beyond the processor's physical-address width.
Also it adds warning messages on failures during MSR switch. These messages
are useful for people who debug their VMMs in nVMX.
Signed-off-by: Eugene Korenevsky <ekorenevsky@gmail.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Several hypervisors need MSR auto load/restore feature.
We read MSRs from VM-entry MSR load area which specified by L1,
and load them via kvm_set_msr in the nested entry.
When nested exit occurs, we get MSRs via kvm_get_msr, writing
them to L1`s MSR store area. After this, we read MSRs from VM-exit
MSR load area, and load them via kvm_set_msr.
Signed-off-by: Wincy Van <fanwenyi0529@gmail.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
SRCU is not necessary to be compiled by default in all cases. For tinification
efforts not compiling SRCU unless necessary is desirable.
The current patch tries to make compiling SRCU optional by introducing a new
Kconfig option CONFIG_SRCU which is selected when any of the components making
use of SRCU are selected.
If we do not select CONFIG_SRCU, srcu.o will not be compiled at all.
text data bss dec hex filename
2007 0 0 2007 7d7 kernel/rcu/srcu.o
Size of arch/powerpc/boot/zImage changes from
text data bss dec hex filename
831552 64180 23944 919676 e087c arch/powerpc/boot/zImage : before
829504 64180 23952 917636 e0084 arch/powerpc/boot/zImage : after
so the savings are about ~2000 bytes.
Signed-off-by: Pranith Kumar <bobby.prani@gmail.com>
CC: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
CC: Josh Triplett <josh@joshtriplett.org>
CC: Lai Jiangshan <laijs@cn.fujitsu.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
[ paulmck: resolve conflict due to removal of arch/ia64/kvm/Kconfig. ]
Since most virtual machines raise this message once, it is a bit annoying.
Make it KERN_DEBUG severity.
Cc: stable@vger.kernel.org
Fixes: 7a2e8aaf0f
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
The commit 34a1cd60d1, "x86: vmx: move some vmx setting from
vmx_init() to hardware_setup()", tried to refactor some codes
specific to vmx hardware setting into hardware_setup(), but some
msr writing should depend on our previous setting condition like
enable_apicv, enable_ept and so on.
Reported-by: Jamie Heilman <jamie@audible.transient.net>
Tested-by: Jamie Heilman <jamie@audible.transient.net>
Signed-off-by: Tiejun Chen <tiejun.chen@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
- spring cleaning: removed support for IA64, and for hardware-assisted
virtualization on the PPC970
- ARM, PPC, s390 all had only small fixes
For x86:
- small performance improvements (though only on weird guests)
- usual round of hardware-compliancy fixes from Nadav
- APICv fixes
- XSAVES support for hosts and guests. XSAVES hosts were broken because
the (non-KVM) XSAVES patches inadvertently changed the KVM userspace
ABI whenever XSAVES was enabled; hence, this part is going to stable.
Guest support is just a matter of exposing the feature and CPUID leaves
support.
Right now KVM is broken for PPC BookE in your tree (doesn't compile).
I'll reply to the pull request with a patch, please apply it either
before the pull request or in the merge commit, in order to preserve
bisectability somewhat.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.22 (GNU/Linux)
iQEcBAABAgAGBQJUkpg+AAoJEL/70l94x66DUmoH/jzXYkptSW9NGgm79KqxGJlD
lzLnLBkitVvx++Mz5YBhdJEhKKLUlCtifFT1zPJQ/pthQhIRSaaAwZyNGgUs5w5x
yMGKHiPQFyZRbmQtZhCInW0BftJoYHHciO3nUfHCZnp34My9MP2D55W7/z+fYFfQ
DuqBSE9ThyZJtZ4zh8NRA9fCOeuqwVYRyoBs820Wbsh4cpIBoIK63Dg7k+CLE+ZV
MZa/mRL6bAfsn9W5bnOUAgHJ3SPznnWbO3/g0aV+roL/5pffblprJx9lKNR08xUM
6hDFLop2gDehDJesDkY/o8Ckp1hEouvfsVpSShry4vcgtn0hgh2O5/6Orbmj6vE=
=Zwq1
-----END PGP SIGNATURE-----
Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm
Pull KVM update from Paolo Bonzini:
"3.19 changes for KVM:
- spring cleaning: removed support for IA64, and for hardware-
assisted virtualization on the PPC970
- ARM, PPC, s390 all had only small fixes
For x86:
- small performance improvements (though only on weird guests)
- usual round of hardware-compliancy fixes from Nadav
- APICv fixes
- XSAVES support for hosts and guests. XSAVES hosts were broken
because the (non-KVM) XSAVES patches inadvertently changed the KVM
userspace ABI whenever XSAVES was enabled; hence, this part is
going to stable. Guest support is just a matter of exposing the
feature and CPUID leaves support"
* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (179 commits)
KVM: move APIC types to arch/x86/
KVM: PPC: Book3S: Enable in-kernel XICS emulation by default
KVM: PPC: Book3S HV: Improve H_CONFER implementation
KVM: PPC: Book3S HV: Fix endianness of instruction obtained from HEIR register
KVM: PPC: Book3S HV: Remove code for PPC970 processors
KVM: PPC: Book3S HV: Tracepoints for KVM HV guest interactions
KVM: PPC: Book3S HV: Simplify locking around stolen time calculations
arch: powerpc: kvm: book3s_paired_singles.c: Remove unused function
arch: powerpc: kvm: book3s_pr.c: Remove unused function
arch: powerpc: kvm: book3s.c: Remove some unused functions
arch: powerpc: kvm: book3s_32_mmu.c: Remove unused function
KVM: PPC: Book3S HV: Check wait conditions before sleeping in kvmppc_vcore_blocked
KVM: PPC: Book3S HV: ptes are big endian
KVM: PPC: Book3S HV: Fix inaccuracies in ICP emulation for H_IPI
KVM: PPC: Book3S HV: Fix KSM memory corruption
KVM: PPC: Book3S HV: Fix an issue where guest is paused on receiving HMI
KVM: PPC: Book3S HV: Fix computation of tlbie operand
KVM: PPC: Book3S HV: Add missing HPTE unlock
KVM: PPC: BookE: Improve irq inject tracepoint
arm/arm64: KVM: Require in-kernel vgic for the arch timers
...
problems, clarifies VCPU init, and fixes a regression concerning the
VGIC init flow.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQEcBAABAgAGBQJUjsVhAAoJEEtpOizt6ddy5rIH/1V/YVwhprC55YqdHelU9Qu2
Muzsx+7F71NxC7xgMGFqPD1YrPR+hxvoPhy+ADOBlvcqlolrkDnV9I+8e3geaYNc
nZ/yEnoGTtbAggiS1smx7usBv34Z88Sd5txNjmj1cmHBy+VOWlyidWMkGBTsfBRe
mVc61BDUfyC47udgRHXhwS80sbHLJHElmADisFOVmQNBYwwiHiTdx0hMBMnHcC3Y
/3T0tKxHdeTISnmA+J+n7TcChtTIM4xqC6kwf3rw3b7XX8gdtTKylDHX2GLAg646
RdebAG2twmGpIc6SxXZbo38f3oY9OFo1Le5xZGa6iUjD56VDw/e4wg4iA2juo0Y=
=J2Ut
-----END PGP SIGNATURE-----
Merge tag 'kvm-arm-for-3.19-take2' of git://git.kernel.org/pub/scm/linux/kernel/git/kvmarm/kvmarm into HEAD
Second round of changes for KVM for arm/arm64 for v3.19; fixes reboot
problems, clarifies VCPU init, and fixes a regression concerning the
VGIC init flow.
Conflicts:
arch/ia64/kvm/kvm-ia64.c [deleted in HEAD and modified in kvmarm]
commit d50eaa1803 ("KVM: x86: Perform limit checks when assigning EIP")
mistakenly used zero as cpl on em_ret_far. Use the actual one.
Fixes: d50eaa1803
Cc: stable@vger.kernel.org
Signed-off-by: Nadav Amit <namit@cs.technion.ac.il>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
If L0 has disabled EPT, don't advertise unrestricted
mode at all since it depends on EPT to run real mode code.
Fixes: 92fbc7b195
Cc: stable@vger.kernel.org
Reviewed-by: Jan Kiszka <jan.kiszka@siemens.com>
Signed-off-by: Bandan Das <bsd@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
to the trace_seq code. It also removed the return values to the
trace_seq_*() functions and use trace_seq_has_overflowed() to see if
the buffer filled up or not. This is similar to work being done to the
seq_file code as well in another tree.
Some of the other goodies include:
o Added some "!" (NOT) logic to the tracing filter.
o Fixed the frame pointer logic to the x86_64 mcount trampolines
o Added the logic for dynamic trampolines on !CONFIG_PREEMPT systems.
That is, the ftrace trampoline can be dynamically allocated
and be called directly by functions that only have a single hook
to them.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQEcBAABAgAGBQJUhbLGAAoJEEjnJuOKh9ldRV4H/3NcLbgGB2iu96la1zdYE6pG
Q7cDJMxXK80YIIL70h9G0IItcD4t62LMb72lfBnMGRj3msgFb3AgISW57EuI0Pxk
xk24wuIPoTG2S7v9sc3SboNFwO8qbtIjxD2OBmqIUrGo2sZIiGjyj3gX7mCY3uzL
WB2bUOSFz/22OgaANinR5EELHA3pZZCf54Vz1K9ndmtK0xp0j1a7xJShD6TrMdYv
mZ3zH5ViIhW4A3mdcMceh6fy2JLQAiEKF0uPTvcMMz7NlVul0mxyL/+10P7AE/3R
Ehw4fzmm4NDshPDtBOkKH0LsppgXzuItFuQUTpact3JlqTg++bV6onSsrkt1hlY=
=Z7Cm
-----END PGP SIGNATURE-----
Merge tag 'trace-3.19' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace
Pull tracing updates from Steven Rostedt:
"There was a lot of clean ups and minor fixes. One of those clean ups
was to the trace_seq code. It also removed the return values to the
trace_seq_*() functions and use trace_seq_has_overflowed() to see if
the buffer filled up or not. This is similar to work being done to
the seq_file code as well in another tree.
Some of the other goodies include:
- Added some "!" (NOT) logic to the tracing filter.
- Fixed the frame pointer logic to the x86_64 mcount trampolines
- Added the logic for dynamic trampolines on !CONFIG_PREEMPT systems.
That is, the ftrace trampoline can be dynamically allocated and be
called directly by functions that only have a single hook to them"
* tag 'trace-3.19' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace: (55 commits)
tracing: Truncated output is better than nothing
tracing: Add additional marks to signal very large time deltas
Documentation: describe trace_buf_size parameter more accurately
tracing: Allow NOT to filter AND and OR clauses
tracing: Add NOT to filtering logic
ftrace/fgraph/x86: Have prepare_ftrace_return() take ip as first parameter
ftrace/x86: Get rid of ftrace_caller_setup
ftrace/x86: Have save_mcount_regs macro also save stack frames if needed
ftrace/x86: Add macro MCOUNT_REG_SIZE for amount of stack used to save mcount regs
ftrace/x86: Simplify save_mcount_regs on getting RIP
ftrace/x86: Have save_mcount_regs store RIP in %rdi for first parameter
ftrace/x86: Rename MCOUNT_SAVE_FRAME and add more detailed comments
ftrace/x86: Move MCOUNT_SAVE_FRAME out of header file
ftrace/x86: Have static tracing also use ftrace_caller_setup
ftrace/x86: Have static function tracing always test for function graph
kprobes: Add IPMODIFY flag to kprobe_ftrace_ops
ftrace, kprobes: Support IPMODIFY flag to find IP modify conflict
kprobes/ftrace: Recover original IP if pre_handler doesn't change it
tracing/trivial: Fix typos and make an int into a bool
tracing: Deletion of an unnecessary check before iput()
...
Intel SDM table 6-2 ("Priority Among Simultaneous Exceptions and Interrupts")
shows that faults from decoding the next instruction got higher priority than
general protection. Moving the protected-mode check before the CPL check to
avoid wrong exception on vm86 mode.
Signed-off-by: Nadav Amit <namit@cs.technion.ac.il>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
The pushf instruction does not push eflags.VM, so emulation should not do so as
well. Although eflags.RF should not be pushed as well, it is already cleared
by the time pushf is executed.
Signed-off-by: Nadav Amit <namit@cs.technion.ac.il>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
The macro GP already sets the flag Prefix. Remove the redundant flag for
0f_38_f0 and 0f_38_f1 opcodes.
Signed-off-by: Nadav Amit <namit@cs.technion.ac.il>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
We reused host EBX and ECX, but KVM might not support all features;
emulated XSAVE size should be smaller.
EBX depends on unknown XCR0, so we default to ECX.
SDM CPUID (EAX = 0DH, ECX = 0):
EBX Bits 31-00: Maximum size (bytes, from the beginning of the
XSAVE/XRSTOR save area) required by enabled features in XCR0. May
be different than ECX if some features at the end of the XSAVE save
area are not enabled.
ECX Bit 31-00: Maximum size (bytes, from the beginning of the
XSAVE/XRSTOR save area) of the XSAVE/XRSTOR save area required by
all supported features in the processor, i.e all the valid bit
fields in XCR0.
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
Tested-by: Wanpeng Li <wanpeng.li@linux.intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Add nested virtualization support for xsaves.
Signed-off-by: Wanpeng Li <wanpeng.li@linux.intel.com>
Reviewed-by: Radim Krčmář <rkrcmar@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Add logic to get/set the XSS model-specific register.
Signed-off-by: Wanpeng Li <wanpeng.li@linux.intel.com>
Reviewed-by: Radim Krčmář <rkrcmar@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Initialize the XSS exit bitmap. It is zero so there should be no XSAVES
or XRSTORS exits.
Signed-off-by: Wanpeng Li <wanpeng.li@linux.intel.com>
Reviewed-by: Radim Krčmář <rkrcmar@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
- EAX=0Dh, ECX=1: output registers EBX/ECX/EDX are reserved.
- EAX=0Dh, ECX>1: output register ECX bit 0 is clear for all the CPUID
leaves we support, because variable "supported" comes from XCR0 and not
XSS. Bits above 0 are reserved, so ECX is overall zero. Output register
EDX is reserved.
Source: Intel Architecture Instruction Set Extensions Programming
Reference, ref. number 319433-022
Reviewed-by: Radim Krčmář <rkrcmar@redhat.com>
Tested-by: Wanpeng Li <wanpeng.li@linux.intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
This is the size of the XSAVES area. This starts providing guest support
for XSAVES (with no support yet for supervisor states, i.e. XSS == 0
always in guests for now).
Wanpeng Li suggested testing XSAVEC as well as XSAVES, since in practice
no real processor exists that only has one of them, and there is no
other way for userspace programs to compute the area of the XSAVEC
save area. CPUID(EAX=0xd,ECX=1).EBX provides an upper bound.
Suggested-by: Radim Krčmář <rkrcmar@redhat.com>
Reviewed-by: Radim Krčmář <rkrcmar@redhat.com>
Tested-by: Wanpeng Li <wanpeng.li@linux.intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Expose the XSAVES feature to the guest if the kvm_x86_ops say it is
available.
Signed-off-by: Wanpeng Li <wanpeng.li@linux.intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Userspace is expecting non-compacted format for KVM_GET_XSAVE, but
struct xsave_struct might be using the compacted format. Convert
in order to preserve userspace ABI.
Likewise, userspace is passing non-compacted format for KVM_SET_XSAVE
but the kernel will pass it to XRSTORS, and we need to convert back.
Fixes: f31a9f7c71
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: stable@vger.kernel.org
Cc: H. Peter Anvin <hpa@linux.intel.com>
Tested-by: Nadav Amit <namit@cs.technion.ac.il>
Reviewed-by: Radim Krčmář <rkrcmar@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
While fixing an x2apic bug,
17d68b7 KVM: x86: fix guest-initiated crash with x2apic (CVE-2013-6376)
we've made only one cluster available. This means that the amount of
logically addressible x2APICs was reduced to 16 and VCPUs kept
overwriting themselves in that region, so even the first cluster wasn't
set up correctly.
This patch extends x2APIC support back to the logical_map's limit, and
keeps the CVE fixed as messages for non-present APICs are dropped.
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
They can't be violated now, but play it safe for the future.
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
x2apic allows destinations > 0xff and we don't want them delivered to
lower APICs. They are correctly handled by doing nothing.
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Physical mode can't address more than one APIC, but lowest-prio is
allowed, so we just reuse our paths.
SDM 10.6.2.1 Physical Destination:
Also, for any non-broadcast IPI or I/O subsystem initiated interrupt
with lowest priority delivery mode, software must ensure that APICs
defined in the interrupt address are present and enabled to receive
interrupts.
We could warn on top of that.
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
False from kvm_irq_delivery_to_apic_fast() means that we don't handle it
in the fast path, but we still return false in cases that were perfectly
handled, fix that.
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
0x830 MSR is 0x300 xAPIC MMIO, which is MSR_ICR.
Signed-off-by: Radim KrÄmář <rkrcmar@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
x2APIC has no registers for DFR and ICR2 (see Intel SDM 10.12.1.2 "x2APIC
Register Address Space"). KVM needs to cause #GP on such accesses.
Fix it (DFR and ICR2 on read, ICR2 on write, DFR already handled on writes).
Signed-off-by: Nadav Amit <namit@cs.technion.ac.il>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Certain x86 instructions that use modrm operands only allow memory operand
(i.e., mod012), and cause a #UD exception otherwise. KVM ignores this fact.
Currently, the instructions that are such and are emulated by KVM are MOVBE,
MOVNTPS, MOVNTPD and MOVNTI. MOVBE is the most blunt example, since it may be
emulated by the host regardless of MMIO.
The fix introduces a new group for handling such instructions, marking mod3 as
illegal instruction.
Signed-off-by: Nadav Amit <namit@cs.technion.ac.il>
Reviewed-by: Radim Krčmář <rkrcmar@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
This reverts commit 85c8555ff0 ("KVM: check for !is_zero_pfn() in
kvm_is_mmio_pfn()") and renames the function to kvm_is_reserved_pfn.
The problem being addressed by the patch above was that some ARM code
based the memory mapping attributes of a pfn on the return value of
kvm_is_mmio_pfn(), whose name indeed suggests that such pfns should
be mapped as device memory.
However, kvm_is_mmio_pfn() doesn't do quite what it says on the tin,
and the existing non-ARM users were already using it in a way which
suggests that its name should probably have been 'kvm_is_reserved_pfn'
from the beginning, e.g., whether or not to call get_page/put_page on
it etc. This means that returning false for the zero page is a mistake
and the patch above should be reverted.
Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
This reverts commit 85c8555ff0 ("KVM: check for !is_zero_pfn() in
kvm_is_mmio_pfn()") and renames the function to kvm_is_reserved_pfn.
The problem being addressed by the patch above was that some ARM code
based the memory mapping attributes of a pfn on the return value of
kvm_is_mmio_pfn(), whose name indeed suggests that such pfns should
be mapped as device memory.
However, kvm_is_mmio_pfn() doesn't do quite what it says on the tin,
and the existing non-ARM users were already using it in a way which
suggests that its name should probably have been 'kvm_is_reserved_pfn'
from the beginning, e.g., whether or not to call get_page/put_page on
it etc. This means that returning false for the zero page is a mistake
and the patch above should be reverted.
Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
cs.base is declared as a __u64 variable and vector is a u32 so this
causes a static checker warning. The user indeed can set "sipi_vector"
to any u32 value in kvm_vcpu_ioctl_x86_set_vcpu_events(), but the
value should really have 8-bit precision only.
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Create a new header, and hide the device assignment functions there.
Move struct kvm_assigned_dev_kernel to assigned-dev.c by modifying
arch/x86/kvm/iommu.c to take a PCI device struct.
Based on a patch by Radim Krcmar <rkrcmark@redhat.com>.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
This feature is not supported inside KVM guests yet, because we do not emulate
MSR_IA32_XSS. Mask it out.
Cc: stable@vger.kernel.org
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Now that ia64 is gone, we can hide deprecated device assignment in x86.
Notable changes:
- kvm_vm_ioctl_assigned_device() was moved to x86/kvm_arch_vm_ioctl()
The easy parts were removed from generic kvm code, remaining
- kvm_iommu_(un)map_pages() would require new code to be moved
- struct kvm_assigned_dev_kernel depends on struct kvm_irq_ack_notifier
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
ia64 does not need them anymore. Ack notifiers become x86-specific
too.
Suggested-by: Gleb Natapov <gleb@kernel.org>
Reviewed-by: Radim Krcmar <rkrcmar@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
To allow for the restructiong of the trace_seq code, we need users
of it to use the helper functions instead of accessing the internals
of the trace_seq structure itself.
Link: http://lkml.kernel.org/r/20141104160221.585025609@goodmis.org
Tested-by: Jiri Kosina <jkosina@suse.cz>
Acked-by: Jiri Kosina <jkosina@suse.cz>
Acked-by: Paolo Bonzini <pbonzini@redhat.com>
Acked-by: Mark Rustad <mark.d.rustad@intel.com>
Reviewed-by: Petr Mladek <pmladek@suse.cz>
Cc: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Remove FIXME comments about needing fault addresses to be returned. These
are propaagated from walk_addr_generic to gva_to_gpa and from there to
ops->read_std and ops->write_std.
Signed-off-by: Nicholas Krause <xerofoify@gmail.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
The check on the higher limit of the segment, and the check on the
maximum accessible size, is the same for both expand-up and
expand-down segments. Only the computation of "lim" varies.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
register_address has been a duplicate of address_mask ever since the
ancestor of __linearize was born in 90de84f50b (KVM: x86 emulator:
preserve an operand's segment identity, 2010-11-17).
However, we can put it to a better use by including the call to reg_read
in register_address. Similarly, the call to reg_rmw can be moved to
register_address_increment.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
In __linearize there is check of the condition whether to check if masking of
the linear address is needed. It occurs immediately after switch that
evaluates the same condition. Merge them.
Signed-off-by: Nadav Amit <namit@cs.technion.ac.il>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
When SS is used using a non-canonical address, an #SS exception is generated on
real hardware. KVM emulator causes a #GP instead. Fix it to behave as real x86
CPU.
Signed-off-by: Nadav Amit <namit@cs.technion.ac.il>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
If branch (e.g., jmp, ret) causes limit violations, since the target IP >
limit, the #GP exception occurs before the branch. In other words, the RIP
pushed on the stack should be that of the branch and not that of the target.
To do so, we can call __linearize, with new EIP, which also saves us the code
which performs the canonical address checks. On the case of assigning an EIP >=
2^32 (when switching cs.l), we also safe, as __linearize will check the new EIP
does not exceed the limit and would trigger #GP(0) otherwise.
Signed-off-by: Nadav Amit <namit@cs.technion.ac.il>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
When segment is accessed, real hardware does not perform any privilege level
checks. In contrast, KVM emulator does. This causes some discrepencies from
real hardware. For instance, reading from readable code segment may fail due to
incorrect segment checks. In addition, it introduces unnecassary overhead.
To reference Intel SDM 5.5 ("Privilege Levels"): "Privilege levels are checked
when the segment selector of a segment descriptor is loaded into a segment
register." The SDM never mentions privilege level checks during memory access,
except for loading far pointers in section 5.10 ("Pointer Validation"). Those
are actually segment selector loads and are emulated in the similarily (i.e.,
regardless to __linearize checks).
This behavior was also checked using sysexit. A data-segment whose DPL=0 was
loaded, and after sysexit (CPL=3) it is still accessible.
Therefore, all the privilege level checks in __linearize are removed.
Signed-off-by: Nadav Amit <namit@cs.technion.ac.il>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
When performing segmented-read/write in the emulator for stack operations, it
ignores the stack size, and uses the ad_bytes as indication for the pointer
size. As a result, a wrong address may be accessed.
To fix this behavior, we can remove the masking of address in __linearize and
perform it beforehand. It is already done for the operands (so currently it is
inefficiently done twice). It is missing in two cases:
1. When using rip_relative
2. On fetch_bit_operand that changes the address.
This patch masks the address on these two occassions, and removes the masking
from __linearize.
Note that it does not mask EIP during fetch. In protected/legacy mode code
fetch when RIP >= 2^32 should result in #GP and not wrap-around. Since we make
limit checks within __linearize, this is the expected behavior.
Partial revert of commit 518547b32a (KVM: x86: Emulator does not
calculate address correctly, 2014-09-30).
Signed-off-by: Nadav Amit <namit@cs.technion.ac.il>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Commit 10e38fc7cab6 ("KVM: x86: Emulator flag for instruction that only support
16-bit addresses in real mode") introduced NoBigReal for instructions such as
MONITOR. Apparetnly, the Intel SDM description that led to this patch is
misleading. Since no instruction is using NoBigReal, it is safe to remove it,
we fully understand what the SDM means.
Signed-off-by: Nadav Amit <namit@cs.technion.ac.il>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
MMIO_MAX_GEN is the same as MMIO_GEN_MASK. Use only one.
Signed-off-by: Tiejun Chen <tiejun.chen@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Instead, just use PFERR_{FETCH, PRESENT, WRITE}_MASK
inside handle_ept_violation() for slightly better code.
Signed-off-by: Tiejun Chen <tiejun.chen@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
apic_find_highest_irr assumes irr_pending is set if any vector in APIC_IRR is
set. If this assumption is broken and apicv is disabled, the injection of
interrupts may be deferred until another interrupt is delivered to the guest.
Ultimately, if no other interrupt should be injected to that vCPU, the pending
interrupt may be lost.
commit 56cc2406d6 ("KVM: nVMX: fix "acknowledge interrupt on exit" when APICv
is in use") changed the behavior of apic_clear_irr so irr_pending is cleared
after setting APIC_IRR vector. After this commit, if apic_set_irr and
apic_clear_irr run simultaneously, a race may occur, resulting in APIC_IRR
vector set, and irr_pending cleared. In the following example, assume a single
vector is set in IRR prior to calling apic_clear_irr:
apic_set_irr apic_clear_irr
------------ --------------
apic->irr_pending = true;
apic_clear_vector(...);
vec = apic_search_irr(apic);
// => vec == -1
apic_set_vector(...);
apic->irr_pending = (vec != -1);
// => apic->irr_pending == false
Nonetheless, it appears the race might even occur prior to this commit:
apic_set_irr apic_clear_irr
------------ --------------
apic->irr_pending = true;
apic->irr_pending = false;
apic_clear_vector(...);
if (apic_search_irr(apic) != -1)
apic->irr_pending = true;
// => apic->irr_pending == false
apic_set_vector(...);
Fixing this issue by:
1. Restoring the previous behavior of apic_clear_irr: clear irr_pending, call
apic_clear_vector, and then if APIC_IRR is non-zero, set irr_pending.
2. On apic_set_irr: first call apic_set_vector, then set irr_pending.
Signed-off-by: Nadav Amit <namit@cs.technion.ac.il>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Logical destination mode can be used to send NMI IPIs even when all
APICs are software disabled, so if all APICs are software disabled we
should still look at the DFRs.
So the DFRs should all be the same, even if some or all APICs are
software disabled. However, the SDM does not say this, so tweak
the logic as follows:
- if one APIC is enabled and has LDR != 0, use that one to build the map.
This picks the right DFR in case an OS is only setting it for the
software-enabled APICs, or in case an OS is using logical addressing
on some APICs while leaving the rest in reset state (using LDR was
suggested by Radim).
- if all APICs are disabled, pick a random one to build the map.
We use the last one with LDR != 0 for simplicity.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Currently, the APIC logical map does not consider VCPUs whose local-apic is
software-disabled. However, NMIs, INIT, etc. should still be delivered to such
VCPUs. Therefore, the APIC mode should first be determined, and then the map,
considering all VCPUs should be constructed.
To address this issue, first find the APIC mode, and only then construct the
logical map.
Signed-off-by: Nadav Amit <namit@cs.technion.ac.il>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
When running the tsc_adjust kvm-unit-test on an AMD processor with the
IA32_TSC_ADJUST feature enabled, the WARN_ON in svm_adjust_tsc_offset can be
triggered. This WARN_ON checks for a negative adjustment in case __scale_tsc
is called; however it may trigger unnecessary warnings.
This patch moves the WARN_ON to trigger only if __scale_tsc will actually be
called from svm_adjust_tsc_offset. In addition make adj in kvm_set_msr_common
s64 since this can have signed values.
Signed-off-by: Chris J Arges <chris.j.arges@canonical.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
There's nothing to switch if the host and guest values are the same.
I am unable to find evidence that this makes any difference
whatsoever.
Signed-off-by: Andy Lutomirski <luto@amacapital.net>
[I could see a difference on Nehalem. From 5 runs:
userspace exit, guest!=host 12200 11772 12130 12164 12327
userspace exit, guest=host 11983 11780 11920 11919 12040
lightweight exit, guest!=host 3214 3220 3238 3218 3337
lightweight exit, guest=host 3178 3193 3193 3187 3220
This passes the t-test with 99% confidence for userspace exit,
98.5% confidence for lightweight exit. - Paolo]
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
At least on Sandy Bridge, letting the CPU switch IA32_EFER is much
faster than switching it manually.
I benchmarked this using the vmexit kvm-unit-test (single run, but
GOAL multiplied by 5 to do more iterations):
Test Before After Change
cpuid 2000 1932 -3.40%
vmcall 1914 1817 -5.07%
mov_from_cr8 13 13 0.00%
mov_to_cr8 19 19 0.00%
inl_from_pmtimer 19164 10619 -44.59%
inl_from_qemu 15662 10302 -34.22%
inl_from_kernel 3916 3802 -2.91%
outl_to_kernel 2230 2194 -1.61%
mov_dr 172 176 2.33%
ipi (skipped) (skipped)
ipi+halt (skipped) (skipped)
ple-round-robin 13 13 0.00%
wr_tsc_adjust_msr 1920 1845 -3.91%
rd_tsc_adjust_msr 1892 1814 -4.12%
mmio-no-eventfd:pci-mem 16394 11165 -31.90%
mmio-wildcard-eventfd:pci-mem 4607 4645 0.82%
mmio-datamatch-eventfd:pci-mem 4601 4610 0.20%
portio-no-eventfd:pci-io 11507 7942 -30.98%
portio-wildcard-eventfd:pci-io 2239 2225 -0.63%
portio-datamatch-eventfd:pci-io 2250 2234 -0.71%
I haven't explicitly computed the significance of these numbers,
but this isn't subtle.
Signed-off-by: Andy Lutomirski <luto@amacapital.net>
[The results were reproducible on all of Nehalem, Sandy Bridge and
Ivy Bridge. The slowness of manual switching is because writing
to EFER with WRMSR triggers a TLB flush, even if the only bit you're
touching is SCE (so the page table format is not affected). Doing
the write as part of vmentry/vmexit, instead, does not flush the TLB,
probably because all processors that have EPT also have VPID. - Paolo]
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
PCIDs are only supported in 64-bit mode. No need to clear bit 63
of CR3 unless the host is 64-bit.
Reported by Fengguang Wu's autobuilder.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
The new trace event records:
* the id of vcpu being updated
* the pvclock_vcpu_time_info struct being written to guest memory
This is useful for debugging pvclock bugs, such as the bug fixed by
"[PATCH] kvm: x86: Fix kvm clock versioning.".
Signed-off-by: David Matlack <dmatlack@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
kvm updates the version number for the guest paravirt clock structure by
incrementing the version of its private copy. It does not read the guest
version, so will write version = 2 in the first update for every new VM,
including after restoring a saved state. If guest state is saved during
reading the clock, it could read and accept struct fields and guest TSC
from two different updates. This changes the code to increment the guest
version and write it back.
Signed-off-by: Owen Hofmann <osh@google.com>
Reviewed-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Commit 3b32004a66 ("KVM: x86: movnti minimum op size of 32-bit is not kept")
did not fully fix the minimum operand size of MONTI emulation. Still, MOVNTI
may be mistakenly performed using 16-bit opsize.
This patch add No16 flag to mark an instruction does not support 16-bits
operand size.
Signed-off-by: Nadav Amit <namit@cs.technion.ac.il>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
When the guest writes to the TSC, the masterclock TSC copy must be
updated as well along with the TSC_OFFSET update, otherwise a negative
tsc_timestamp is calculated at kvm_guest_time_update.
Once "if (!vcpus_matched && ka->use_master_clock)" is simplified to
"if (ka->use_master_clock)", the corresponding "if (!ka->use_master_clock)"
becomes redundant, so remove the do_request boolean and collapse
everything into a single condition.
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Now that KVM injects #UD on "unhandlable" error, it makes better sense to
return such error on sysenter instead of directly injecting #UD to the guest.
This allows to track more easily the unhandlable cases the emulator does not
support.
Signed-off-by: Nadav Amit <namit@cs.technion.ac.il>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
APIC base relocation is unsupported by KVM. If anyone uses it, the least should
be to report a warning in the hypervisor.
Note that KVM-unit-tests uses this feature for some reason, so running the
tests triggers the warning.
Signed-off-by: Nadav Amit <namit@cs.technion.ac.il>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Commit 7fe864dc94 (KVM: x86: Mark VEX-prefix instructions emulation as
unimplemented, 2014-06-02) marked VEX instructions as such in protected
mode. VEX-prefix instructions are not supported relevant on real-mode
and VM86, but should cause #UD instead of being decoded as LES/LDS.
Fix this behaviour to be consistent with real hardware.
Signed-off-by: Nadav Amit <namit@cs.technion.ac.il>
[Check for mod == 3, rather than 2 or 3. - Paolo]
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Task-switch emulation checks the privilege level prior to performing the
task-switch. This check is incorrect in the case of task-gates, in which the
tss.dpl is ignored, and can cause superfluous exceptions. Moreover this check
is unnecassary, since the CPU checks the privilege levels prior to exiting.
Intel SDM 25.4.2 says "If CALL or JMP accesses a TSS descriptor directly
outside IA-32e mode, privilege levels are checked on the TSS descriptor" prior
to exiting. AMD 15.14.1 says "The intercept is checked before the task switch
takes place but after the incoming TSS and task gate (if one was involved) have
been checked for correctness."
This patch removes the CPL checks for CALL and JMP.
Signed-off-by: Nadav Amit <namit@cs.technion.ac.il>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
When emulating LTR/LDTR/LGDT/LIDT, #GP should be injected if the base is
non-canonical. Otherwise, VM-entry will fail.
Signed-off-by: Nadav Amit <namit@cs.technion.ac.il>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
LGDT and LIDT emulation logic is almost identical. Merge the logic into a
single point to avoid redundancy. This will be used by the next patch that
will ensure the bases of the loaded GDTR and IDTR are canonical.
Signed-off-by: Nadav Amit <namit@cs.technion.ac.il>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
If the emulation ends in fault, eflags should not be updated. However, several
instruction emulations (actually all the fastops) currently update eflags, if
the fault was detected afterwards (e.g., #PF during writeback).
Signed-off-by: Nadav Amit <namit@cs.technion.ac.il>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Although Intel SDM mentions bit 63 is reserved, MOV to CR3 can have bit 63 set.
As Intel SDM states in section 4.10.4 "Invalidation of TLBs and
Paging-Structure Caches": " MOV to CR3. ... If CR4.PCIDE = 1 and bit 63 of the
instruction’s source operand is 0 ..."
In other words, bit 63 is not reserved. KVM emulator currently consider bit 63
as reserved. Fix it.
Signed-off-by: Nadav Amit <namit@cs.technion.ac.il>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
According to Intel SDM push of segment selectors is done in the following
manner: "if the operand size is 32-bits, either a zero-extended value is pushed
on the stack or the segment selector is written on the stack using a 16-bit
move. For the last case, all recent Core and Atom processors perform a 16-bit
move, leaving the upper portion of the stack location unmodified."
This patch modifies the behavior to match the core behavior.
Signed-off-by: Nadav Amit <namit@cs.technion.ac.il>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
CMPS and SCAS instructions are evaluated in the wrong order. For reference (of
CMPS), see http://www.fermimn.gov.it/linux/quarta/x86/cmps.htm : "Note that the
direction of subtraction for CMPS is [SI] - [DI] or [ESI] - [EDI]. The left
operand (SI or ESI) is the source and the right operand (DI or EDI) is the
destination. This is the reverse of the usual Intel convention in which the
left operand is the destination and the right operand is the source."
Introducing em_cmp_r for this matter that performs comparison in reverse order
using fastop infrastructure to avoid a wrapper function.
Signed-off-by: Nadav Amit <namit@cs.technion.ac.il>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
SYSCALL emulation currently clears in 64-bit mode eflags according to
MSR_SYSCALL_MASK. However, on bare-metal eflags[1] which is fixed to one
cannot be cleared, even if MSR_SYSCALL_MASK masks the bit. This wrong behavior
may result in failed VM-entry, as VT disallows entry with eflags[1] cleared.
This patch sets the bit after masking eflags on syscall.
Signed-off-by: Nadav Amit <namit@cs.technion.ac.il>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
In x86, you can only MOV-sreg to memory with either 16-bits or 64-bits size.
In contrast, KVM may write to 32-bits memory on MOV-sreg. This patch fixes KVM
behavior, and sets the destination operand size to two, if the destination is
memory.
When destination is registers, and the operand size is 32-bits, the high
16-bits in modern CPUs is filled with zero. This is handled correctly.
Signed-off-by: Nadav Amit <namit@cs.technion.ac.il>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
x86 debug registers hold a linear address. Therefore, breakpoints detection
should consider CS.base, and check whether instruction linear address equals
(CS.base + RIP). This patch introduces a function to evaluate RIP linear
address and uses it for breakpoints detection.
Signed-off-by: Nadav Amit <namit@cs.technion.ac.il>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
DR6[0:3] (previous breakpoint indications) are cleared when #DB is injected
during handle_exception, just as real hardware does. Similarily, handle_dr
should clear DR6[0:3].
Signed-off-by: Nadav Amit <namit@cs.technion.ac.il>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>