Commit Graph

4100 Commits

Author SHA1 Message Date
Avi Kivity
d196e34336 KVM: MMU: Decouple mmio from shadow page tables
Currently an mmio guest pte is encoded in the shadow pagetable as a
not-present trapping pte, with the SHADOW_IO_MARK bit set.  However
nothing is ever done with this information, so maintaining it is a
useless complication.

This patch moves the check for mmio to before shadow ptes are instantiated,
so the shadow code is never invoked for ptes that reference mmio.  The code
is simpler, and with future work, can be made to handle mmio concurrently.

Signed-off-by: Avi Kivity <avi@qumranet.com>
2008-04-27 11:53:17 +03:00
Avi Kivity
1d6ad2073e KVM: x86 emulator: group decoding for group 1 instructions
Opcodes 0x80-0x83

Signed-off-by: Avi Kivity <avi@qumranet.com>
2008-04-27 11:53:16 +03:00
Avi Kivity
d95058a1a7 KVM: x86 emulator: add group 7 decoding
This adds group decoding for opcode 0x0f 0x01 (group 7).

Signed-off-by: Avi Kivity <avi@qumranet.com>
2008-04-27 11:53:15 +03:00
Avi Kivity
fd60754e4f KVM: x86 emulator: Group decoding for groups 4 and 5
Add group decoding support for opcode 0xfe (group 4) and 0xff (group 5).

Signed-off-by: Avi Kivity <avi@qumranet.com>
2008-04-27 11:53:15 +03:00
Avi Kivity
7d858a19ef KVM: x86 emulator: Group decoding for group 3
This adds group decoding support for opcodes 0xf6, 0xf7 (group 3).

Signed-off-by: Avi Kivity <avi@qumranet.com>
2008-04-27 11:53:14 +03:00
Avi Kivity
43bb19cd33 KVM: x86 emulator: group decoding for group 1A
This adds group decode support for opcode 0x8f.

Signed-off-by: Avi Kivity <avi@qumranet.com>
2008-04-27 11:53:14 +03:00
Avi Kivity
e09d082c03 KVM: x86 emulator: add support for group decoding
Certain x86 instructions use bits 3:5 of the byte following the opcode as an
opcode extension, with the decode sometimes depending on bits 6:7 as well.
Add support for this in the main decoding table rather than an ad-hock
adaptation per opcode.

Signed-off-by: Avi Kivity <avi@qumranet.com>
2008-04-27 11:53:14 +03:00
Dong, Eddie
1ae0a13def KVM: MMU: Simplify hash table indexing
Signed-off-by: Yaozu (Eddie) Dong <eddie.dong@intel.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
2008-04-27 11:53:14 +03:00
Dong, Eddie
489f1d6526 KVM: MMU: Update shadow ptes on partial guest pte writes
A guest partial guest pte write will leave shadow_trap_nonpresent_pte
in spte, which generates a vmexit at the next guest access through that pte.

This patch improves this by reading the full guest pte in advance and thus
being able to update the spte and eliminate the vmexit.

This helps pae guests which use two 32-bit writes to set a single 64-bit pte.

[truncation fix by Eric]

Signed-off-by: Yaozu (Eddie) Dong <eddie.dong@intel.com>
Signed-off-by: Feng (Eric) Liu <eric.e.liu@intel.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
2008-04-27 11:53:13 +03:00
Avi Kivity
e48bb497b9 KVM: MMU: Fix memory leak on guest demand faults
While backporting 72dc67a696, a gfn_to_page()
call was duplicated instead of moved (due to an unrelated patch not being
present in mainline).  This caused a page reference leak, resulting in a
fairly massive memory leak.

Fix by removing the extraneous gfn_to_page() call.

Signed-off-by: Avi Kivity <avi@qumranet.com>
2008-03-25 10:22:17 +02:00
Marcelo Tosatti
707a18a51d KVM: VMX: convert init_rmode_tss() to slots_lock
init_rmode_tss was forgotten during the conversion from mmap_sem to
slots_lock.

INFO: task qemu-system-x86:3748 blocked for more than 120 seconds.
Call Trace:
 [<ffffffff8053d100>] __down_read+0x86/0x9e
 [<ffffffff8053fb43>] do_page_fault+0x346/0x78e
 [<ffffffff8053d235>] trace_hardirqs_on_thunk+0x35/0x3a
 [<ffffffff8053dcad>] error_exit+0x0/0xa9
 [<ffffffff8035a7a7>] copy_user_generic_string+0x17/0x40
 [<ffffffff88099a8a>] :kvm:kvm_write_guest_page+0x3e/0x5f
 [<ffffffff880b661a>] :kvm_intel:init_rmode_tss+0xa7/0xf9
 [<ffffffff880b7d7e>] :kvm_intel:vmx_vcpu_reset+0x10/0x38a
 [<ffffffff8809b9a5>] :kvm:kvm_arch_vcpu_setup+0x20/0x53
 [<ffffffff8809a1e4>] :kvm:kvm_vm_ioctl+0xad/0x1cf
 [<ffffffff80249dea>] __lock_acquire+0x4f7/0xc28
 [<ffffffff8028fad9>] vfs_ioctl+0x21/0x6b
 [<ffffffff8028fd75>] do_vfs_ioctl+0x252/0x26b
 [<ffffffff8028fdca>] sys_ioctl+0x3c/0x5e
 [<ffffffff8020b01b>] system_call_after_swapgs+0x7b/0x80

Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
2008-03-25 10:22:17 +02:00
Marcelo Tosatti
15aaa819e2 KVM: MMU: handle page removal with shadow mapping
Do not assume that a shadow mapping will always point to the same host
frame number.  Fixes crash with madvise(MADV_DONTNEED).

[avi: move after first printk(), add another printk()]

Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
2008-03-25 10:22:17 +02:00
Avi Kivity
4b1a80fa65 KVM: MMU: Fix is_rmap_pte() with io ptes
is_rmap_pte() doesn't take into account io ptes, which have the avail bit set.

Signed-off-by: Avi Kivity <avi@qumranet.com>
2008-03-25 10:22:16 +02:00
Avi Kivity
5dc8326282 KVM: VMX: Restore tss even on x86_64
The vmx hardware state restore restores the tss selector and base address, but
not its length.  Usually, this does not matter since most of the tss contents
is within the default length of 0x67.  However, if a process is using ioperm()
to grant itself I/O port permissions, an additional bitmap within the tss,
but outside the default length is consulted.  The effect is that the process
will receive a SIGSEGV instead of transparently accessing the port.

Fix by restoring the tss length.  Note that i386 had this working already.

Closes bugzilla 10246.

Signed-off-by: Avi Kivity <avi@qumranet.com>
2008-03-25 10:22:16 +02:00
Avi Kivity
33f9c505ed KVM: VMX: Avoid rearranging switched guest msrs while they are loaded
KVM tries to run as much as possible with the guest msrs loaded instead of
host msrs, since switching msrs is very expensive.  It also tries to minimize
the number of msrs switched according to the guest mode; for example,
MSR_LSTAR is needed only by long mode guests.  This optimization is done by
setup_msrs().

However, we must not change which msrs are switched while we are running with
guest msr state:

 - switch to guest msr state
 - call setup_msrs(), removing some msrs from the list
 - switch to host msr state, leaving a few guest msrs loaded

An easy way to trigger this is to kexec an x86_64 linux guest.  Early during
setup, the guest will switch EFER to not include SCE.  KVM will stop saving
MSR_LSTAR, and on the next msr switch it will leave the guest LSTAR loaded.
The next host syscall will end up in a random location in the kernel.

Fix by reloading the host msrs before changing the msr list.

Signed-off-by: Avi Kivity <avi@qumranet.com>
2008-03-04 15:19:50 +02:00
Avi Kivity
f7d9c7b7b9 KVM: MMU: Fix race when instantiating a shadow pte
For improved concurrency, the guest walk is performed concurrently with other
vcpus.  This means that we need to revalidate the guest ptes once we have
write-protected the guest page tables, at which point they can no longer be
modified.

The current code attempts to avoid this check if the shadow page table is not
new, on the assumption that if it has existed before, the guest could not have
modified the pte without the shadow lock.  However the assumption is incorrect,
as the racing vcpu could have modified the pte, then instantiated the shadow
page, before our vcpu regains control:

  vcpu0        vcpu1

  fault
  walk pte

               modify pte
               fault in same pagetable
               instantiate shadow page

  lookup shadow page
  conclude it is old
  instantiate spte based on stale guest pte

We could do something clever with generation counters, but a test run by
Marcelo suggests this is unnecessary and we can just do the revalidation
unconditionally.  The pte will be in the processor cache and the check can
be quite fast.

Signed-off-by: Avi Kivity <avi@qumranet.com>
2008-03-04 15:19:49 +02:00
Avi Kivity
0b975a3c2d KVM: Avoid infinite-frequency local apic timer
If the local apic initial count is zero, don't start a an hrtimer with infinite
frequency, locking up the host.

Signed-off-by: Avi Kivity <avi@qumranet.com>
2008-03-04 15:19:48 +02:00
Marcelo Tosatti
24993d5349 KVM: make MMU_DEBUG compile again
the cr3 variable is now inside the vcpu->arch structure.

Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
2008-03-04 15:19:47 +02:00
Marcelo Tosatti
5e4a0b3c1b KVM: move alloc_apic_access_page() outside of non-preemptable region
alloc_apic_access_page() can sleep, while vmx_vcpu_setup is called
inside a non preemptable region. Move it after put_cpu().

Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
2008-03-04 15:19:46 +02:00
Joerg Roedel
a2938c8070 KVM: SVM: fix Windows XP 64 bit installation crash
While installing Windows XP 64 bit wants to access the DEBUGCTL and the last
branch record (LBR) MSRs. Don't allowing this in KVM causes the installation to
crash. This patch allow the access to these MSRs and fixes the issue.

Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
Signed-off-by: Markus Rechberger <markus.rechberger@amd.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
2008-03-04 15:19:45 +02:00
Izik Eidus
72dc67a696 KVM: remove the usage of the mmap_sem for the protection of the memory slots.
This patch replaces the mmap_sem lock for the memory slots with a new
kvm private lock, it is needed beacuse untill now there were cases where
kvm accesses user memory while holding the mmap semaphore.

Signed-off-by: Izik Eidus <izike@qumranet.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
2008-03-04 15:19:40 +02:00
Joerg Roedel
c7ac679c16 KVM: emulate access to MSR_IA32_MCG_CTL
Injecting an GP when accessing this MSR lets Windows crash when running some
stress test tools in KVM.  So this patch emulates access to this MSR.

Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
Signed-off-by: Markus Rechberger <markus.rechberger@amd.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
2008-03-03 11:22:37 +02:00
Avi Kivity
674eea0fc4 KVM: Make the supported cpuid list a host property rather than a vm property
One of the use cases for the supported cpuid list is to create a "greatest
common denominator" of cpu capabilities in a server farm.  As such, it is
useful to be able to get the list without creating a virtual machine first.

Since the code does not depend on the vm in any way, all that is needed is
to move it to the device ioctl handler.  The capability identifier is also
changed so that binaries made against -rc1 will fail gracefully.

Signed-off-by: Avi Kivity <avi@qumranet.com>
2008-03-03 11:22:25 +02:00
Paul Knowles
d730616384 KVM: Fix kvm_arch_vcpu_ioctl_set_sregs so that set_cr0 works properly
Whilst working on getting a VM to initialize in to IA32e mode I found
this issue. set_cr0 relies on comparing the old cr0 to the new one to
work correctly.  Move the assignment below so the compare can work.

Signed-off-by: Paul Knowles <paul@transitive.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
2008-03-03 11:22:14 +02:00
Joerg Roedel
6b390b6392 KVM: SVM: set NM intercept when enabling CR0.TS in the guest
Explicitly enable the NM intercept in svm_set_cr0 if we enable TS in the guest
copy of CR0 for lazy FPU switching. This fixes guest SMP with Linux under SVM.
Without that patch Linux deadlocks or panics right after trying to boot the
other CPUs.

Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
Signed-off-by: Markus Rechberger <markus.rechberger@amd.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
2008-03-03 11:20:21 +02:00
Joerg Roedel
334df50a86 KVM: SVM: Fix lazy FPU switching
If the guest writes to cr0 and leaves the TS flag at 0 while vcpu->fpu_active
is also 0, the TS flag in the guest's cr0 gets lost. This leads to corrupt FPU
state an causes Windows Vista 64bit to crash very soon after boot.  This patch
fixes this bug.

Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
Signed-off-by: Markus Rechberger <markus.rechberger@amd.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
2008-03-03 11:18:18 +02:00
Andrew Morton
c0b49b0d16 kvm: i386 fix
arch/x86/kvm/x86.c: In function 'emulator_cmpxchg_emulated':
arch/x86/kvm/x86.c:1746: warning: passing argument 2 of 'vcpu->arch.mmu.gva_to_gpa' makes integer from pointer without a cast
arch/x86/kvm/x86.c:1746: warning: 'addr' is used uninitialized in this function

Is true.  Local variable `addr' shadows incoming arg `addr'.  Avi is on
vacation for a while, so...

Cc: Avi Kivity <avi@qumranet.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-02-05 09:44:06 -08:00
Anthony Liguori
0ad07ec1fd virtio: Put the virtio under the virtualization menu
This patch moves virtio under the virtualization menu and changes virtio
devices to not claim to only be for lguest.

Signed-off-by: Anthony Liguori <aliguori@us.ibm.com>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
2008-02-04 23:50:05 +11:00
Avi Kivity
2f52d58c92 KVM: Move apic timer migration away from critical section
Migrating the apic timer in the critical section is not very nice, and is
absolutely horrible with the real-time port.  Move migration to the regular
vcpu execution path, triggered by a new bitflag.

Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Avi Kivity <avi@qumranet.com>
2008-01-30 18:01:22 +02:00
Avi Kivity
6c14280125 KVM: Fix unbounded preemption latency
When preparing to enter the guest, if an interrupt comes in while
preemption is disabled but interrupts are still enabled, we miss a
preemption point.  Fix by explicitly checking whether we need to
reschedule.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Avi Kivity <avi@qumranet.com>
2008-01-30 18:01:22 +02:00
Avi Kivity
97db56ce6c KVM: Initialize the mmu caches only after verifying cpu support
Otherwise we re-initialize the mmu caches, which will fail since the
caches are already registered, which will cause us to deinitialize said caches.

Signed-off-by: Avi Kivity <avi@qumranet.com>
2008-01-30 18:01:22 +02:00
Izik Eidus
75e68e6078 KVM: MMU: Fix dirty page setting for pages removed from rmap
Right now rmap_remove won't set the page as dirty if the shadow pte
pointed to this page had write access and then it became readonly.
This patches fixes that, by setting the page as dirty for spte changes from
write to readonly access.

Signed-off-by: Izik Eidus <izike@qumranet.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
2008-01-30 18:01:22 +02:00
Sheng Yang
571008dacc KVM: x86 emulator: Only allow VMCALL/VMMCALL trapped by #UD
When executing a test program called "crashme", we found the KVM guest cannot
survive more than ten seconds, then encounterd kernel panic. The basic concept
of "crashme" is generating random assembly code and trying to execute it.

After some fixes on emulator insn validity judgment, we found it's hard to
get the current emulator handle the invalid instructions correctly, for the
#UD trap for hypercall patching caused troubles. The problem is, if the opcode
itself was OK, but combination of opcode and modrm_reg was invalid, and one
operand of the opcode was memory (SrcMem or DstMem), the emulator will fetch
the memory operand first rather than checking the validity, and may encounter
an error there. For example, ".byte 0xfe, 0x34, 0xcd" has this problem.

In the patch, we simply check that if the invalid opcode wasn't vmcall/vmmcall,
then return from emulate_instruction() and inject a #UD to guest. With the
patch, the guest had been running for more than 12 hours.

Signed-off-by: Feng (Eric) Liu <eric.e.liu@intel.com>
Signed-off-by: Sheng Yang <sheng.yang@intel.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
2008-01-30 18:01:21 +02:00
Dong, Eddie
5882842f9b KVM: MMU: Merge shadow level check in FNAME(fetch)
Remove the redundant level check when fetching
shadow pte for present & non-present spte.

Signed-off-by: Yaozu (Eddie) Dong <eddie.dong@intel.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
2008-01-30 18:01:21 +02:00
Avi Kivity
eb787d10af KVM: MMU: Move kvm_free_some_pages() into critical section
If some other cpu steals mmu pages between our check and an attempt to
allocate, we can run out of mmu pages.  Fix by moving the check into the
same critical section as the allocation.

Signed-off-by: Avi Kivity <avi@qumranet.com>
2008-01-30 18:01:21 +02:00
Marcelo Tosatti
aaee2c94f7 KVM: MMU: Switch to mmu spinlock
Convert the synchronization of the shadow handling to a separate mmu_lock
spinlock.

Also guard fetch() by mmap_sem in read-mode to protect against alias
and memslot changes.

Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
2008-01-30 18:01:21 +02:00
Avi Kivity
d7824fff89 KVM: MMU: Avoid calling gfn_to_page() in mmu_set_spte()
Since gfn_to_page() is a sleeping function, and we want to make the core mmu
spinlocked, we need to pass the page from the walker context (which can sleep)
to the shadow context (which cannot).

[marcelo: avoid recursive locking of mmap_sem]

Signed-off-by: Avi Kivity <avi@qumranet.com>
2008-01-30 18:01:21 +02:00
Marcelo Tosatti
7ec5458821 KVM: Add kvm_read_guest_atomic()
In preparation for a mmu spinlock, add kvm_read_guest_atomic()
and use it in fetch() and prefetch_page().

Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
2008-01-30 18:01:20 +02:00
Marcelo Tosatti
10589a4699 KVM: MMU: Concurrent guest walkers
Do not hold kvm->lock mutex across the entire pagefault code,
only acquire it in places where it is necessary, such as mmu
hash list, active list, rmap and parent pte handling.

Allow concurrent guest walkers by switching walk_addr() to use
mmap_sem in read-mode.

And get rid of the lockless __gfn_to_page.

[avi: move kvm_mmu_pte_write() locking inside the function]
[avi: add locking for real mode]
[avi: fix cmpxchg locking]

Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
2008-01-30 18:01:20 +02:00
Avi Kivity
774ead3ad9 KVM: Disable vapic support on Intel machines with FlexPriority
FlexPriority accelerates the tpr without any patching.

Signed-off-by: Avi Kivity <avi@qumranet.com>
2008-01-30 18:01:20 +02:00
Avi Kivity
b93463aa59 KVM: Accelerated apic support
This adds a mechanism for exposing the virtual apic tpr to the guest, and a
protocol for letting the guest update the tpr without causing a vmexit if
conditions allow (e.g. there is no interrupt pending with a higher priority
than the new tpr).

Signed-off-by: Avi Kivity <avi@qumranet.com>
2008-01-30 18:01:20 +02:00
Avi Kivity
b209749f52 KVM: local APIC TPR access reporting facility
Add a facility to report on accesses to the local apic tpr even if the
local apic is emulated in the kernel.  This is basically a hack that
allows userspace to patch Windows which tends to bang on the tpr a lot.

Signed-off-by: Avi Kivity <avi@qumranet.com>
2008-01-30 18:01:20 +02:00
Avi Kivity
565f1fbd9d KVM: Print data for unimplemented wrmsr
This can help diagnosing what the guest is trying to do.  In many cases
we can get away with partial emulation of msrs.

Signed-off-by: Avi Kivity <avi@qumranet.com>
2008-01-30 18:01:20 +02:00
Avi Kivity
dfc5aa00cb KVM: MMU: Add cache miss statistic
Signed-off-by: Avi Kivity <avi@qumranet.com>
2008-01-30 18:01:19 +02:00
Eddie Dong
caa5b8a5ed KVM: MMU: Coalesce remote tlb flushes
Host side TLB flush can be merged together if multiple
spte need to be write-protected.

Signed-off-by: Yaozu (Eddie) Dong <eddie.dong@intel.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
2008-01-30 18:01:19 +02:00
Zhang Xiantao
5736199afb KVM: Move kvm_vcpu_kick() to x86.c
Moving kvm_vcpu_kick() to x86.c. Since it should be
common for all archs, put its declarations in <linux/kvm_host.h>

Signed-off-by: Zhang Xiantao <xiantao.zhang@intel.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
2008-01-30 18:01:19 +02:00
Zhang Xiantao
0eb8f49848 KVM: Move ioapic code to common directory.
Move ioapic code to common, since IA64 also needs it.

Signed-off-by: Zhang Xiantao <xiantao.zhang@intel.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
2008-01-30 18:01:19 +02:00
Zhang Xiantao
82470196fa KVM: Move irqchip declarations into new ioapic.h and lapic.h
This allows reuse of ioapic in ia64.

Signed-off-by: Zhang Xiantao <xiantao.zhang@intel.com>
Signed-off-by: Avi Kivity <avi@qumranet.com>
2008-01-30 18:01:19 +02:00
Avi Kivity
0fce5623ba KVM: Move drivers/kvm/* to virt/kvm/
Signed-off-by: Avi Kivity <avi@qumranet.com>
2008-01-30 18:01:18 +02:00
Avi Kivity
edf884172e KVM: Move arch dependent files to new directory arch/x86/kvm/
This paves the way for multiple architecture support.  Note that while
ioapic.c could potentially be shared with ia64, it is also moved.

Signed-off-by: Avi Kivity <avi@qumranet.com>
2008-01-30 18:01:18 +02:00