Commit Graph

897 Commits

Author SHA1 Message Date
Mohammed Gamal
94677e61fd KVM: x86 emulator: Add missing decoder flags for 'or' instructions
Add missing decoder flags for or instructions (0xc-0xd).

Signed-off-by: Mohammed Gamal <m.gamal005@gmail.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2009-12-03 09:32:09 +02:00
Avi Kivity
bfd99ff5d4 KVM: Move assigned device code to own file
Signed-off-by: Avi Kivity <avi@redhat.com>
2009-12-03 09:32:09 +02:00
Avi Kivity
367e1319b2 KVM: Return -ENOTTY on unrecognized ioctls
Not the incorrect -EINVAL.

Signed-off-by: Avi Kivity <avi@redhat.com>
2009-12-03 09:32:08 +02:00
Gleb Natapov
680b3648ba KVM: Drop kvm->irq_lock lock from irq injection path
The only thing it protects now is interrupt injection into lapic and
this can work lockless. Even now with kvm->irq_lock in place access
to lapic is not entirely serialized since vcpu access doesn't take
kvm->irq_lock.

Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2009-12-03 09:32:08 +02:00
Gleb Natapov
eba0226bdf KVM: Move IO APIC to its own lock
The allows removal of irq_lock from the injection path.

Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2009-12-03 09:32:08 +02:00
Gleb Natapov
1a6e4a8c27 KVM: Move irq sharing information to irqchip level
This removes assumptions that max GSIs is smaller than number of pins.
Sharing is tracked on pin level not GSI level.

[avi: no PIC on ia64]

Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2009-12-03 09:32:06 +02:00
Gleb Natapov
79c727d437 KVM: Call pic_clear_isr() on pic reset to reuse logic there
Also move call of ack notifiers after pic state change.

Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2009-12-03 09:32:06 +02:00
Avi Kivity
851ba6922a KVM: Don't pass kvm_run arguments
They're just copies of vcpu->run, which is readily accessible.

Signed-off-by: Avi Kivity <avi@redhat.com>
2009-12-03 09:32:06 +02:00
Mohammed Gamal
d8769fedd4 KVM: x86 emulator: Introduce No64 decode option
Introduces a new decode option "No64", which is used for instructions that are
invalid in long mode.

Signed-off-by: Mohammed Gamal <m.gamal005@gmail.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2009-12-03 09:32:05 +02:00
Mohammed Gamal
0934ac9d13 KVM: x86 emulator: Add 'push/pop sreg' instructions
[avi: avoid buffer overflow]

Signed-off-by: Mohammed Gamal <m.gamal005@gmail.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2009-12-03 09:32:05 +02:00
Ingo Molnar
96200591a3 Merge branch 'tracing/hw-breakpoints' into perf/core
Conflicts:
	arch/x86/kernel/kprobes.c
	kernel/trace/Makefile

Merge reason: hw-breakpoints perf integration is looking
              good in testing and in reviews, plus conflicts
              are mounting up - so merge & resolve.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-11-21 14:07:23 +01:00
Frederic Weisbecker
59d8eb53ea hw-breakpoints: Wrap in the KVM breakpoint active state check
Wrap in the cpu dr7 check that tells if we have active
breakpoints that need to be restored in the cpu.

This wrapper makes the check more self-explainable and also
reusable for any further other uses.

Reported-by: Jan Kiszka <jan.kiszka@web.de>
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Avi Kivity <avi@redhat.com>
Cc: "K. Prasad" <prasad@linux.vnet.ibm.com>
2009-11-10 11:23:43 +01:00
Frederic Weisbecker
24f1e32c60 hw-breakpoints: Rewrite the hw-breakpoints layer on top of perf events
This patch rebase the implementation of the breakpoints API on top of
perf events instances.

Each breakpoints are now perf events that handle the
register scheduling, thread/cpu attachment, etc..

The new layering is now made as follows:

       ptrace       kgdb      ftrace   perf syscall
          \          |          /         /
           \         |         /         /
                                        /
            Core breakpoint API        /
                                      /
                     |               /
                     |              /

              Breakpoints perf events

                     |
                     |

               Breakpoints PMU ---- Debug Register constraints handling
                                    (Part of core breakpoint API)
                     |
                     |

             Hardware debug registers

Reasons of this rewrite:

- Use the centralized/optimized pmu registers scheduling,
  implying an easier arch integration
- More powerful register handling: perf attributes (pinned/flexible
  events, exclusive/non-exclusive, tunable period, etc...)

Impact:

- New perf ABI: the hardware breakpoints counters
- Ptrace breakpoints setting remains tricky and still needs some per
  thread breakpoints references.

Todo (in the order):

- Support breakpoints perf counter events for perf tools (ie: implement
  perf_bpcounter_event())
- Support from perf tools

Changes in v2:

- Follow the perf "event " rename
- The ptrace regression have been fixed (ptrace breakpoint perf events
  weren't released when a task ended)
- Drop the struct hw_breakpoint and store generic fields in
  perf_event_attr.
- Separate core and arch specific headers, drop
  asm-generic/hw_breakpoint.h and create linux/hw_breakpoint.h
- Use new generic len/type for breakpoint
- Handle off case: when breakpoints api is not supported by an arch

Changes in v3:

- Fix broken CONFIG_KVM, we need to propagate the breakpoint api
  changes to kvm when we exit the guest and restore the bp registers
  to the host.

Changes in v4:

- Drop the hw_breakpoint_restore() stub as it is only used by KVM
- EXPORT_SYMBOL_GPL hw_breakpoint_restore() as KVM can be built as a
  module
- Restore the breakpoints unconditionally on kvm guest exit:
  TIF_DEBUG_THREAD doesn't anymore cover every cases of running
  breakpoints and vcpu->arch.switch_db_regs might not always be
  set when the guest used debug registers.
  (Waiting for a reliable optimization)

Changes in v5:

- Split-up the asm-generic/hw-breakpoint.h moving to
  linux/hw_breakpoint.h into a separate patch
- Optimize the breakpoints restoring while switching from kvm guest
  to host. We only want to restore the state if we have active
  breakpoints to the host, otherwise we don't care about messed-up
  address registers.
- Add asm/hw_breakpoint.h to Kbuild
- Fix bad breakpoint type in trace_selftest.c

Changes in v6:

- Fix wrong header inclusion in trace.h (triggered a build
  error with CONFIG_FTRACE_SELFTEST

Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Prasad <prasad@linux.vnet.ibm.com>
Cc: Alan Stern <stern@rowland.harvard.edu>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Jan Kiszka <jan.kiszka@web.de>
Cc: Jiri Slaby <jirislaby@gmail.com>
Cc: Li Zefan <lizf@cn.fujitsu.com>
Cc: Avi Kivity <avi@redhat.com>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Masami Hiramatsu <mhiramat@redhat.com>
Cc: Paul Mundt <lethal@linux-sh.org>
2009-11-08 15:34:42 +01:00
Gleb Natapov
abb3911965 KVM: get_tss_base_addr() should return a gpa_t
If TSS we are switching to resides in high memory task switch will fail
since address will be truncated. Windows2k3 does this sometimes when
running with more then 4G

Cc: stable@kernel.org
Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2009-11-04 12:42:36 -02:00
Jan Kiszka
a9e38c3e01 KVM: x86: Catch potential overrun in MCE setup
We only allocate memory for 32 MCE banks (KVM_MAX_MCE_BANKS) but we
allow user space to fill up to 255 on setup (mcg_cap & 0xff), corrupting
kernel memory. Catch these overflows.

Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2009-11-04 12:42:35 -02:00
Tejun Heo
0fe1e00954 percpu: make percpu symbols in x86 unique
This patch updates percpu related symbols in x86 such that percpu
symbols are unique and don't clash with local symbols.  This serves
two purposes of decreasing the possibility of global percpu symbol
collision and allowing dropping per_cpu__ prefix from percpu symbols.

* arch/x86/kernel/cpu/common.c: rename local variable to avoid collision

* arch/x86/kvm/svm.c: s/svm_data/sd/ for local variables to avoid collision

* arch/x86/kernel/cpu/cpu_debug.c: s/cpu_arr/cpud_arr/
  				   s/priv_arr/cpud_priv_arr/
				   s/cpu_priv_count/cpud_priv_count/

* arch/x86/kernel/cpu/intel_cacheinfo.c: s/cpuid4_info/ici_cpuid4_info/
  					 s/cache_kobject/ici_cache_kobject/
					 s/index_kobject/ici_index_kobject/

* arch/x86/kernel/ds.c: s/cpu_context/cpu_ds_context/

Partly based on Rusty Russell's "alloc_percpu: rename percpu vars
which cause name clashes" patch.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: (kvm) Avi Kivity <avi@redhat.com>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Cc: x86@kernel.org
2009-10-29 22:34:14 +09:00
Frederic Weisbecker
0f8f86c7bd Merge commit 'perf/core' into perf/hw-breakpoint
Conflicts:
	kernel/Makefile
	kernel/trace/Makefile
	kernel/trace/trace.h
	samples/Makefile

Merge reason: We need to be uptodate with the perf events development
branch because we plan to rewrite the breakpoints API on top of
perf events.
2009-10-18 01:12:33 +02:00
Frederik Deweerdt
8a8365c560 KVM: MMU: fix pointer cast
On a 32 bits compile, commit 3da0dd433d
introduced the following warnings:

arch/x86/kvm/mmu.c: In function ‘kvm_set_pte_rmapp’:
arch/x86/kvm/mmu.c:770: warning: cast to pointer from integer of different size
arch/x86/kvm/mmu.c: In function ‘kvm_set_spte_hva’:
arch/x86/kvm/mmu.c:849: warning: cast from pointer to integer of different size

The following patch uses 'unsigned long' instead of u64 to match the
pointer size on both arches.

Signed-off-by: Frederik Deweerdt <frederik.deweerdt@xprog.eu>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2009-10-16 12:30:26 -03:00
Marcelo Tosatti
ace1546487 KVM: use proper hrtimer function to retrieve expiration time
hrtimer->base can be temporarily NULL due to racing hrtimer_start.
See switch_hrtimer_base/lock_hrtimer_base.

Use hrtimer_get_remaining which is robust against it.

CC: stable@kernel.org
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2009-10-16 12:30:25 -03:00
Izik Eidus
3da0dd433d KVM: add support for change_pte mmu notifiers
this is needed for kvm if it want ksm to directly map pages into its
shadow page tables.

[marcelo: cast pfn assignment to u64]

Signed-off-by: Izik Eidus <ieidus@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2009-10-04 17:04:53 +02:00
Izik Eidus
1403283acc KVM: MMU: add SPTE_HOST_WRITEABLE flag to the shadow ptes
this flag notify that the host physical page we are pointing to from
the spte is write protected, and therefore we cant change its access
to be write unless we run get_user_pages(write = 1).

(this is needed for change_pte support in kvm)

Signed-off-by: Izik Eidus <ieidus@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2009-10-04 17:04:50 +02:00
Izik Eidus
acb66dd051 KVM: MMU: dont hold pagecount reference for mapped sptes pages
When using mmu notifiers, we are allowed to remove the page count
reference tooken by get_user_pages to a specific page that is mapped
inside the shadow page tables.

This is needed so we can balance the pagecount against mapcount
checking.

(Right now kvm increase the pagecount and does not increase the
mapcount when mapping page into shadow page table entry,
so when comparing pagecount against mapcount, you have no
reliable result.)

Signed-off-by: Izik Eidus <ieidus@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2009-10-04 17:04:48 +02:00
Avi Kivity
6a54435560 KVM: Prevent overflow in KVM_GET_SUPPORTED_CPUID
The number of entries is multiplied by the entry size, which can
overflow on 32-bit hosts.  Bound the entry count instead.

Reported-by: David Wagner <daw@cs.berkeley.edu>
Cc: stable@kernel.org
Signed-off-by: Avi Kivity <avi@redhat.com>
2009-10-04 17:04:16 +02:00
Marcelo Tosatti
eb5109e311 KVM: VMX: flush TLB with INVEPT on cpu migration
It is possible that stale EPTP-tagged mappings are used, if a
vcpu migrates to a different pcpu.

Set KVM_REQ_TLB_FLUSH in vmx_vcpu_load, when switching pcpus, which
will invalidate both VPID and EPT mappings on the next vm-entry.

Cc: stable@kernel.org
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2009-10-04 13:57:24 +02:00
Aurelien Jarno
b2d83cfa3f KVM: fix LAPIC timer period overflow
Don't overflow when computing the 64-bit period from 32-bit registers.

Fixes sourceforge bug #2826486.

Signed-off-by: Aurelien Jarno <aurelien@aurel32.net>
Cc: stable@kernel.org
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2009-10-04 13:57:23 +02:00
Joerg Roedel
20824f30bb KVM: SVM: Handle tsc in svm_get_msr/svm_set_msr correctly
When running nested we need to touch the l1 guests
tsc_offset. Otherwise changes will be lost or a wrong value
be read.

Cc: stable@kernel.org
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2009-10-04 13:57:23 +02:00
Joerg Roedel
77b1ab1732 KVM: SVM: Fix tsc offset adjustment when running nested
When svm_vcpu_load is called while the vcpu is running in
guest mode the tsc adjustment made there is lost on the next
emulated #vmexit. This causes the tsc running backwards in
the guest. This patch fixes the issue by also adjusting the
tsc_offset in the emulated hsave area so that it will not
get lost.

Cc: stable@kernel.org
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2009-10-04 13:57:22 +02:00
Ingo Molnar
dca2d6ac09 Merge branch 'linus' into tracing/hw-breakpoints
Conflicts:
	arch/x86/kernel/process_64.c

Semantic conflict fixed in:
	arch/x86/kvm/x86.c

Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-09-15 12:18:15 +02:00
Linus Torvalds
69def9f05d Merge branch 'kvm-updates/2.6.32' of git://git.kernel.org/pub/scm/virt/kvm/kvm
* 'kvm-updates/2.6.32' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (202 commits)
  MAINTAINERS: update KVM entry
  KVM: correct error-handling code
  KVM: fix compile warnings on s390
  KVM: VMX: Check cpl before emulating debug register access
  KVM: fix misreporting of coalesced interrupts by kvm tracer
  KVM: x86: drop duplicate kvm_flush_remote_tlb calls
  KVM: VMX: call vmx_load_host_state() only if msr is cached
  KVM: VMX: Conditionally reload debug register 6
  KVM: Use thread debug register storage instead of kvm specific data
  KVM guest: do not batch pte updates from interrupt context
  KVM: Fix coalesced interrupt reporting in IOAPIC
  KVM guest: fix bogus wallclock physical address calculation
  KVM: VMX: Fix cr8 exiting control clobbering by EPT
  KVM: Optimize kvm_mmu_unprotect_page_virt() for tdp
  KVM: Document KVM_CAP_IRQCHIP
  KVM: Protect update_cr8_intercept() when running without an apic
  KVM: VMX: Fix EPT with WP bit change during paging
  KVM: Use kvm_{read,write}_guest_virt() to read and write segment descriptors
  KVM: x86 emulator: Add adc and sbb missing decoder flags
  KVM: Add missing #include
  ...
2009-09-14 17:43:43 -07:00
Avi Kivity
0a79b00952 KVM: VMX: Check cpl before emulating debug register access
Debug registers may only be accessed from cpl 0.  Unfortunately, vmx will
code to emulate the instruction even though it was issued from guest
userspace, possibly leading to an unexpected trap later.

Cc: stable@kernel.org
Signed-off-by: Avi Kivity <avi@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2009-09-10 18:11:10 +03:00
Gleb Natapov
4da748960a KVM: fix misreporting of coalesced interrupts by kvm tracer
Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2009-09-10 18:11:09 +03:00
Marcelo Tosatti
e3904e6ed0 KVM: x86: drop duplicate kvm_flush_remote_tlb calls
kvm_mmu_slot_remove_write_access already calls it.

Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2009-09-10 18:11:08 +03:00
Gleb Natapov
542423b0dd KVM: VMX: call vmx_load_host_state() only if msr is cached
No need to call it before each kvm_(set|get)_msr_common()

Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2009-09-10 18:11:07 +03:00
Avi Kivity
e8a48342e9 KVM: VMX: Conditionally reload debug register 6
Only reload debug register 6 if we're running with the guest's
debug registers.  Saves around 150 cycles from the guest lightweight
exit path.

dr6 contains a couple of bits that are updated on #DB, so intercept
that unconditionally and update those bits then.

Signed-off-by: Avi Kivity <avi@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2009-09-10 18:11:06 +03:00
Avi Kivity
3d53c27d05 KVM: Use thread debug register storage instead of kvm specific data
Instead of saving the debug registers from the processor to a kvm data
structure, rely in the debug registers stored in the thread structure.
This allows us not to save dr6 and dr7.

Reduces lightweight vmexit cost by 350 cycles, or 11 percent.

Signed-off-by: Avi Kivity <avi@redhat.com>
2009-09-10 18:11:04 +03:00
Gleb Natapov
5fff7d270b KVM: VMX: Fix cr8 exiting control clobbering by EPT
Don't call adjust_vmx_controls() two times for the same control.
It restores options that were dropped earlier.  This loses us the cr8
exit control, which causes a massive performance regression Windows x64.

Cc: stable@kernel.org
Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2009-09-10 10:46:57 +03:00
Avi Kivity
60f24784a9 KVM: Optimize kvm_mmu_unprotect_page_virt() for tdp
We know no pages are protected, so we can short-circuit the whole thing
(including fairly nasty guest memory accesses).

Signed-off-by: Avi Kivity <avi@redhat.com>
2009-09-10 10:46:56 +03:00
Avi Kivity
88c808fd42 KVM: Protect update_cr8_intercept() when running without an apic
update_cr8_intercept() can be triggered from userspace while there
is no apic present.

Signed-off-by: Avi Kivity <avi@redhat.com>
2009-09-10 10:46:54 +03:00
Sheng Yang
95eb84a758 KVM: VMX: Fix EPT with WP bit change during paging
QNX update WP bit when paging enabled, which is not covered yet. This one fix
QNX boot with EPT.

Cc: stable@kernel.org
Signed-off-by: Sheng Yang <sheng@linux.intel.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2009-09-10 10:46:53 +03:00
Mikhail Ershov
d9048d3278 KVM: Use kvm_{read,write}_guest_virt() to read and write segment descriptors
Segment descriptors tables can be placed on two non-contiguous pages.
This patch makes reading segment descriptors by linear address.

Signed-off-by: Mikhail Ershov <Mike.Ershov@gmail.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2009-09-10 10:46:52 +03:00
Mohammed Gamal
7bdb588827 KVM: x86 emulator: Add adc and sbb missing decoder flags
Add missing decoder flags for adc and sbb instructions
(opcodes 0x14-0x15, 0x1c-0x1d)

Signed-off-by: Mohammed Gamal <m.gamal005@gmail.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2009-09-10 10:46:51 +03:00
Avi Kivity
56e8231841 KVM: Rename x86_emulate.c to emulate.c
We're in arch/x86, what could we possibly be emulating?

Signed-off-by: Avi Kivity <avi@redhat.com>
2009-09-10 10:46:45 +03:00
Anthony Liguori
c0c7c04b87 KVM: When switching to a vm8086 task, load segments as 16-bit
According to 16.2.5 in the SDM, eflags.vm in the tss is consulted before loading
and new segments.  If eflags.vm == 1, then the segments are treated as 16-bit
segments.  The LDTR and TR are not normally available in vm86 mode so if they
happen to somehow get loaded, they need to be treated as 32-bit segments.

This fixes an invalid vmentry failure in a custom OS that was happening after
a task switch into vm8086 mode.  Since the segments were being mistakenly
treated as 32-bit, we loaded garbage state.

Signed-off-by: Anthony Liguori <aliguori@us.ibm.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2009-09-10 10:46:44 +03:00
Avi Kivity
345dcaa8fd KVM: VMX: Adjust rflags if in real mode emulation
We set rflags.vm86 when virtualizing real mode to do through vm8086 mode;
so we need to take it out again when reading rflags.

Signed-off-by: Avi Kivity <avi@redhat.com>
2009-09-10 10:46:43 +03:00
Avi Kivity
52c7847d12 KVM: SVM: Drop tlb flush workaround in npt
It is no longer possible to reproduce the problem any more, so presumably
it has been fixed.

Signed-off-by: Avi Kivity <avi@redhat.com>
2009-09-10 10:46:40 +03:00
Gleb Natapov
cb142eb743 KVM: Update cr8 intercept when APIC TPR is changed by userspace
Since on vcpu entry we do it only if apic is enabled we should do
it when TPR is changed while apic is disabled. This happens when windows
resets HW without setting TPR to zero.

Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2009-09-10 10:46:39 +03:00
Joerg Roedel
4b6e4dca70 KVM: SVM: enable nested svm by default
Nested SVM is (in my experience) stable enough to be enabled by
default. So omit the requirement to pass a module parameter.

Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2009-09-10 10:46:38 +03:00
Joerg Roedel
108768de55 KVM: SVM: check for nested VINTR flag in svm_interrupt_allowed
Not checking for this flag breaks any nested hypervisor that does not
set VINTR. So fix it with this patch.

Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2009-09-10 10:46:37 +03:00
Joerg Roedel
26666957a5 KVM: SVM: move nested_svm_intr main logic out of if-clause
This patch removes one indentation level from nested_svm_intr and
makes the logic more readable.

Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2009-09-10 10:46:36 +03:00
Joerg Roedel
cda0ffdd86 KVM: SVM: remove unnecessary is_nested check from svm_cpu_run
This check is not necessary. We have to sync the vcpu->arch.cr2 always
back to the VMCB. This patch remove the is_nested check.

Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2009-09-10 10:46:34 +03:00
Joerg Roedel
410e4d573d KVM: SVM: move special nested exit handling to separate function
This patch moves the handling for special nested vmexits like #pf to a
separate function. This makes the kvm_override parameter obsolete and
makes the code more readable.

Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2009-09-10 10:46:33 +03:00
Joerg Roedel
1f8da47805 KVM: SVM: handle errors in vmrun emulation path appropriatly
If nested svm fails to load the msrpm the vmrun succeeds with the old
msrpm which is not correct. This patch changes the logic to roll back
to host mode in case the msrpm cannot be loaded.

Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2009-09-10 10:46:32 +03:00
Joerg Roedel
ea8e064fe2 KVM: SVM: remove nested_svm_do and helper functions
This function is not longer required. So remove it.

Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2009-09-10 10:46:31 +03:00
Joerg Roedel
9738b2c97d KVM: SVM: clean up nested vmrun path
This patch removes the usage of nested_svm_do from the vmrun emulation
path.

Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2009-09-10 10:46:29 +03:00
Joerg Roedel
9966bf6872 KVM: SVM: clean up nestec vmload/vmsave paths
This patch removes the usage of nested_svm_do from the vmload and
vmsave emulation code paths.

Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2009-09-10 10:46:28 +03:00
Joerg Roedel
3d62d9aa98 KVM: SVM: clean up nested_svm_exit_handled_msr
This patch changes nested svm to call nested_svm_exit_handled_msr
directly and not through nested_svm_do.

[alex: fix oops due to nested kmap_atomics]

Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
Signed-off-by: Alexander Graf <agraf@suse.de>
Signed-off-by: Avi Kivity <avi@redhat.com>
2009-09-10 10:45:43 +03:00
Joerg Roedel
34f80cfad5 KVM: SVM: get rid of nested_svm_vmexit_real
This patch is the starting point of removing nested_svm_do from the
nested svm code. The nested_svm_do function basically maps two guest
physical pages to host virtual addresses and calls a passed function
on it. This function pointer code flow is hard to read and not the
best technical solution here.
As a side effect this patch indroduces the nested_svm_[un]map helper
functions.

Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2009-09-10 08:33:26 +03:00
Joerg Roedel
0295ad7de8 KVM: SVM: simplify nested_svm_check_exception
Makes the code of this function more readable by removing on
indentation level for the core logic.

Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2009-09-10 08:33:25 +03:00
Joerg Roedel
9c4e40b994 KVM: SVM: do nested vmexit in nested_svm_exit_handled
If this function returns true a nested vmexit is required. Move that
vmexit into the nested_svm_exit_handled function. This also simplifies
the handling of nested #pf intercepts in this function.

Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
Acked-by: Alexander Graf <agraf@suse.de>
Signed-off-by: Avi Kivity <avi@redhat.com>
2009-09-10 08:33:25 +03:00
Joerg Roedel
4c2161aed5 KVM: SVM: consolidate nested_svm_exit_handled
When caching guest intercepts there is no need anymore for the
nested_svm_exit_handled_real function. So move its code into
nested_svm_exit_handled.

Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
Acked-by: Alexander Graf <agraf@suse.de>
Signed-off-by: Avi Kivity <avi@redhat.com>
2009-09-10 08:33:24 +03:00
Joerg Roedel
aad42c641c KVM: SVM: cache nested intercepts
When the nested intercepts are cached we don't need to call
get_user_pages and/or map the nested vmcb on every nested #vmexit to
check who will handle the intercept.
Further this patch aligns the emulated svm behavior better to real
hardware.

Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2009-09-10 08:33:24 +03:00
Joerg Roedel
e6aa9abd73 KVM: SVM: move nested svm state into seperate struct
This makes it more clear for which purpose these members in the vcpu_svm
exist.

Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
Acked-by: Alexander Graf <agraf@suse.de>
Signed-off-by: Avi Kivity <avi@redhat.com>
2009-09-10 08:33:24 +03:00
Joerg Roedel
a5c3832dfe KVM: SVM: complete interrupts after handling nested exits
The interrupt completion code must run after nested exits are handled
because not injected interrupts or exceptions may be handled by the l1
guest first.

Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
Acked-by: Alexander Graf <agraf@suse.de>
Signed-off-by: Avi Kivity <avi@redhat.com>
2009-09-10 08:33:24 +03:00
Joerg Roedel
0460a979b4 KVM: SVM: copy only necessary parts of the control area on vmrun/vmexit
The vmcb control area contains more then 800 bytes of reserved fields
which are unnecessarily copied. Fix this by introducing a copy
function which only copies the relevant part and saves time.

Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
Acked-by: Alexander Graf <agraf@suse.de>
Signed-off-by: Avi Kivity <avi@redhat.com>
2009-09-10 08:33:23 +03:00
Joerg Roedel
defbba5660 KVM: SVM: optimize nested vmrun
Only copy the necessary parts of the vmcb save area on vmrun and save
precious time.

Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
Acked-by: Alexander Graf <agraf@suse.de>
Signed-off-by: Avi Kivity <avi@redhat.com>
2009-09-10 08:33:23 +03:00
Joerg Roedel
33740e4009 KVM: SVM: optimize nested #vmexit
It is more efficient to copy only the relevant parts of the vmcb back to
the nested vmcb when we emulate an vmexit.

Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
Acked-by: Alexander Graf <agraf@suse.de>
Signed-off-by: Avi Kivity <avi@redhat.com>
2009-09-10 08:33:23 +03:00
Joerg Roedel
2af9194d1b KVM: SVM: add helper functions for global interrupt flag
This patch makes the code easier to read when it comes to setting,
clearing and checking the status of the virtualized global
interrupt flag for the VCPU.

Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2009-09-10 08:33:23 +03:00
Gleb Natapov
88ba63c265 KVM: Replace pic_lock()/pic_unlock() with direct call to spinlock functions
They are not doing anything else now.

Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2009-09-10 08:33:22 +03:00
Gleb Natapov
938396a234 KVM: Call ack notifiers from PIC when guest OS acks an IRQ.
Currently they are called when irq vector is been delivered.  Calling ack
notifiers at this point is wrong.  Device assignment ack notifier enables
host interrupts, but guest not yet had a chance to clear interrupt
condition in a device.

Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2009-09-10 08:33:22 +03:00
Gleb Natapov
956f97cf66 KVM: Call kvm_vcpu_kick() inside pic spinlock
d5ecfdd25 moved it out because back than it was impossible to
call it inside spinlock. This restriction no longer exists.

Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2009-09-10 08:33:21 +03:00
Roel Kluin
3a34a8810b KVM: fix EFER read buffer overflow
Check whether index is within bounds before grabbing the element.

Signed-off-by: Roel Kluin <roel.kluin@gmail.com>
Cc: Avi Kivity <avi@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Avi Kivity <avi@redhat.com>
2009-09-10 08:33:21 +03:00
Amit Shah
1f3ee616dd KVM: ignore reads to perfctr msrs
We ignore writes to the perfctr msrs. Ignore reads as well.

Kaspersky antivirus crashes Windows guests if it can't read
these MSRs.

Signed-off-by: Amit Shah <amit.shah@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2009-09-10 08:33:21 +03:00
Avi Kivity
eab4b8aa34 KVM: VMX: Optimize vmx_get_cpl()
Instead of calling vmx_get_segment() (which reads a whole bunch of
vmcs fields), read only the cs selector which contains the cpl.

Signed-off-by: Avi Kivity <avi@redhat.com>
2009-09-10 08:33:21 +03:00
Jan Kiszka
07708c4af1 KVM: x86: Disallow hypercalls for guest callers in rings > 0
So far unprivileged guest callers running in ring 3 can issue, e.g., MMU
hypercalls. Normally, such callers cannot provide any hand-crafted MMU
command structure as it has to be passed by its physical address, but
they can still crash the guest kernel by passing random addresses.

To close the hole, this patch considers hypercalls valid only if issued
from guest ring 0. This may still be relaxed on a per-hypercall base in
the future once required.

Cc: stable@kernel.org
Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2009-09-10 08:33:20 +03:00
Marcelo Tosatti
b90c062c65 KVM: MMU: fix bogus alloc_mmu_pages assignment
Remove the bogus n_free_mmu_pages assignment from alloc_mmu_pages.

It breaks accounting of mmu pages, since n_free_mmu_pages is modified
but the real number of pages remains the same.

Cc: stable@kernel.org
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2009-09-10 08:33:20 +03:00
Izik Eidus
3b80fffe2b KVM: MMU: make __kvm_mmu_free_some_pages handle empty list
First check if the list is empty before attempting to look at list
entries.

Cc: stable@kernel.org
Signed-off-by: Izik Eidus <ieidus@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2009-09-10 08:33:20 +03:00
Bartlomiej Zolnierkiewicz
95fb4eb698 KVM: remove superfluous NULL pointer check in kvm_inject_pit_timer_irqs()
This takes care of the following entries from Dan's list:

arch/x86/kvm/i8254.c +714 kvm_inject_pit_timer_irqs(6) warning: variable derefenced in initializer 'vcpu'
arch/x86/kvm/i8254.c +714 kvm_inject_pit_timer_irqs(6) warning: variable derefenced before check 'vcpu'

Reported-by: Dan Carpenter <error27@gmail.com>
Cc: corbet@lwn.net
Cc: eteo@redhat.com
Cc: Julia Lawall <julia@diku.dk>
Signed-off-by: Bartlomiej Zolnierkiewicz <bzolnier@gmail.com>
Acked-by: Sheng Yang <sheng@linux.intel.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2009-09-10 08:33:19 +03:00
Joerg Roedel
344f414fa0 KVM: report 1GB page support to userspace
If userspace knows that the kernel part supports 1GB pages it can enable
the corresponding cpuid bit so that guests actually use GB pages.

Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2009-09-10 08:33:19 +03:00
Joerg Roedel
7e4e4056f7 KVM: MMU: shadow support for 1gb pages
This patch adds support for shadow paging to the 1gb page table code in KVM.
With this code the guest can use 1gb pages even if the host does not support
them.

[ Marcelo: fix shadow page collision on pmd level if a guest 1gb page is mapped
           with 4kb ptes on host level ]

Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2009-09-10 08:33:19 +03:00
Joerg Roedel
e04da980c3 KVM: MMU: make page walker aware of mapping levels
The page walker may be used with nested paging too when accessing mmio
areas.  Make it support the additional page-level too.

[ Marcelo: fix reserved bit check for 1gb pte ]

Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2009-09-10 08:33:18 +03:00
Joerg Roedel
852e3c19ac KVM: MMU: make direct mapping paths aware of mapping levels
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2009-09-10 08:33:18 +03:00
Joerg Roedel
d25797b24c KVM: MMU: rename is_largepage_backed to mapping_level
With the new name and the corresponding backend changes this function
can now support multiple hugepage sizes.

Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2009-09-10 08:33:18 +03:00
Joerg Roedel
44ad9944f1 KVM: MMU: make rmap code aware of mapping levels
This patch removes the largepage parameter from the rmap_add function.
Together with rmap_remove this function now uses the role.level field to
find determine if the page is a huge page.

Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2009-09-10 08:33:18 +03:00
Marcelo Tosatti
1444885a04 KVM: limit lapic periodic timer frequency
Otherwise its possible to starve the host by programming lapic timer
with a very high frequency.

Cc: stable@kernel.org
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2009-09-10 08:33:17 +03:00
Mikhail Ershov
5f0269f5d7 KVM: Align cr8 threshold when userspace changes cr8
Commit f0a3602c20 ("KVM: Move interrupt injection logic to x86.c") does not
update the cr8 intercept if the lapic is disabled, so when userspace updates
cr8, the cr8 threshold control is not updated and we are left with illegal
control fields.

Fix by explicitly resetting the cr8 threshold.

Signed-off-by: Avi Kivity <avi@redhat.com>
2009-09-10 08:33:17 +03:00
Jan Kiszka
7f582ab6d8 KVM: VMX: Avoid to return ENOTSUPP to userland
Choose some allowed error values for the cases VMX returned ENOTSUPP so
far as these values could be returned by the KVM_RUN IOCTL.

Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2009-09-10 08:33:17 +03:00
Gleb Natapov
84fde248fe KVM: PIT: Unregister ack notifier callback when freeing
Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2009-09-10 08:33:16 +03:00
Sheng Yang
b927a3cec0 KVM: VMX: Introduce KVM_SET_IDENTITY_MAP_ADDR ioctl
Now KVM allow guest to modify guest's physical address of EPT's identity mapping page.

(change from v1, discard unnecessary check, change ioctl to accept parameter
address rather than value)

Signed-off-by: Sheng Yang <sheng@linux.intel.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2009-09-10 08:33:16 +03:00
Akinobu Mita
b792c344df KVM: x86: use kvm_get_gdt() and kvm_read_ldt()
Use kvm_get_gdt() and kvm_read_ldt() to reduce inline assembly code.

Cc: Avi Kivity <avi@redhat.com>
Cc: kvm@vger.kernel.org
Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2009-09-10 08:33:15 +03:00
Akinobu Mita
46a359e715 KVM: x86: use get_desc_base() and get_desc_limit()
Use get_desc_base() and get_desc_limit() to get the base address and
limit in desc_struct.

Cc: Avi Kivity <avi@redhat.com>
Cc: kvm@vger.kernel.org
Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2009-09-10 08:33:15 +03:00
Marcelo Tosatti
6a1ac77110 KVM: MMU: fix missing locking in alloc_mmu_pages
n_requested_mmu_pages/n_free_mmu_pages are used by
kvm_mmu_change_mmu_pages to calculate the number of pages to zap.

alloc_mmu_pages, called from the vcpu initialization path, modifies this
variables without proper locking, which can result in a negative value
in kvm_mmu_change_mmu_pages (say, with cpu hotplug).

Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2009-09-10 08:33:14 +03:00
Sheng Yang
3662cb1cd6 KVM: Discard unnecessary kvm_mmu_flush_tlb() in kvm_mmu_load()
set_cr3() should already cover the TLB flushing.

Signed-off-by: Sheng Yang <sheng@linux.intel.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2009-09-10 08:33:14 +03:00
Gleb Natapov
4088bb3cae KVM: silence lapic kernel messages that can be triggered by a guest
Some Linux versions (f8) try to read EOI register that is write only.

Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2009-09-10 08:33:14 +03:00
Gleb Natapov
a1b37100d9 KVM: Reduce runnability interface with arch support code
Remove kvm_cpu_has_interrupt() and kvm_arch_interrupt_allowed() from
interface between general code and arch code. kvm_arch_vcpu_runnable()
checks for interrupts instead.

Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2009-09-10 08:33:13 +03:00
Gleb Natapov
b59bb7bdf0 KVM: Move exception handling to the same place as other events
Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2009-09-10 08:33:13 +03:00
Joerg Roedel
a205bc19f0 KVM: MMU: Fix MMU_DEBUG compile breakage
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2009-09-10 08:33:13 +03:00
Gregory Haskins
d34e6b175e KVM: add ioeventfd support
ioeventfd is a mechanism to register PIO/MMIO regions to trigger an eventfd
signal when written to by a guest.  Host userspace can register any
arbitrary IO address with a corresponding eventfd and then pass the eventfd
to a specific end-point of interest for handling.

Normal IO requires a blocking round-trip since the operation may cause
side-effects in the emulated model or may return data to the caller.
Therefore, an IO in KVM traps from the guest to the host, causes a VMX/SVM
"heavy-weight" exit back to userspace, and is ultimately serviced by qemu's
device model synchronously before returning control back to the vcpu.

However, there is a subclass of IO which acts purely as a trigger for
other IO (such as to kick off an out-of-band DMA request, etc).  For these
patterns, the synchronous call is particularly expensive since we really
only want to simply get our notification transmitted asychronously and
return as quickly as possible.  All the sychronous infrastructure to ensure
proper data-dependencies are met in the normal IO case are just unecessary
overhead for signalling.  This adds additional computational load on the
system, as well as latency to the signalling path.

Therefore, we provide a mechanism for registration of an in-kernel trigger
point that allows the VCPU to only require a very brief, lightweight
exit just long enough to signal an eventfd.  This also means that any
clients compatible with the eventfd interface (which includes userspace
and kernelspace equally well) can now register to be notified. The end
result should be a more flexible and higher performance notification API
for the backend KVM hypervisor and perhipheral components.

To test this theory, we built a test-harness called "doorbell".  This
module has a function called "doorbell_ring()" which simply increments a
counter for each time the doorbell is signaled.  It supports signalling
from either an eventfd, or an ioctl().

We then wired up two paths to the doorbell: One via QEMU via a registered
io region and through the doorbell ioctl().  The other is direct via
ioeventfd.

You can download this test harness here:

ftp://ftp.novell.com/dev/ghaskins/doorbell.tar.bz2

The measured results are as follows:

qemu-mmio:       110000 iops, 9.09us rtt
ioeventfd-mmio: 200100 iops, 5.00us rtt
ioeventfd-pio:  367300 iops, 2.72us rtt

I didn't measure qemu-pio, because I have to figure out how to register a
PIO region with qemu's device model, and I got lazy.  However, for now we
can extrapolate based on the data from the NULLIO runs of +2.56us for MMIO,
and -350ns for HC, we get:

qemu-pio:      153139 iops, 6.53us rtt
ioeventfd-hc: 412585 iops, 2.37us rtt

these are just for fun, for now, until I can gather more data.

Here is a graph for your convenience:

http://developer.novell.com/wiki/images/7/76/Iofd-chart.png

The conclusion to draw is that we save about 4us by skipping the userspace
hop.

--------------------

Signed-off-by: Gregory Haskins <ghaskins@novell.com>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2009-09-10 08:33:12 +03:00
Gregory Haskins
090b7aff27 KVM: make io_bus interface more robust
Today kvm_io_bus_regsiter_dev() returns void and will internally BUG_ON
if it fails.  We want to create dynamic MMIO/PIO entries driven from
userspace later in the series, so we need to enhance the code to be more
robust with the following changes:

   1) Add a return value to the registration function
   2) Fix up all the callsites to check the return code, handle any
      failures, and percolate the error up to the caller.
   3) Add an unregister function that collapses holes in the array

Signed-off-by: Gregory Haskins <ghaskins@novell.com>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2009-09-10 08:33:12 +03:00
Beth Kon
e9f4275732 KVM: PIT support for HPET legacy mode
When kvm is in hpet_legacy_mode, the hpet is providing the timer
interrupt and the pit should not be. So in legacy mode, the pit timer
is destroyed, but the *state* of the pit is maintained. So if kvm or
the guest tries to modify the state of the pit, this modification is
accepted, *except* that the timer isn't actually started. When we exit
hpet_legacy_mode, the current state of the pit (which is up to date
since we've been accepting modifications) is used to restart the pit
timer.

The saved_mode code in kvm_pit_load_count temporarily changes mode to
0xff in order to destroy the timer, but then restores the actual
value, again maintaining "current" state of the pit for possible later
reenablement.

[avi: add some reserved storage in the ioctl; make SET_PIT2 IOW]
[marcelo: fix memory corruption due to reserved storage]

Signed-off-by: Beth Kon <eak@us.ibm.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2009-09-10 08:33:12 +03:00
Gleb Natapov
0d1de2d901 KVM: Always report x2apic as supported feature
We emulate x2apic in software, so host support is not required.

Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2009-09-10 08:33:11 +03:00
Gleb Natapov
c7f0f24b1f KVM: No need to kick cpu if not in a guest mode
This will save a couple of IPIs.

Signed-off-by: Gleb Natapov <gleb@redhat.com>
Acked-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2009-09-10 08:33:11 +03:00
Gleb Natapov
1000ff8d89 KVM: Add trace points in irqchip code
Add tracepoint in msi/ioapic/pic set_irq() functions,
in IPI sending and in the point where IRQ is placed into
apic's IRR.

Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2009-09-10 08:33:11 +03:00
Andre Przywara
f7c6d14003 KVM: fix MMIO_CONF_BASE MSR access
Some Windows versions check whether the BIOS has setup MMI/O for
config space accesses on AMD Fam10h CPUs, we say "no" by returning 0 on
reads and only allow disabling of MMI/O CfgSpace setup by igoring "0" writes.

Signed-off-by: Andre Przywara <andre.przywara@amd.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2009-09-10 08:33:10 +03:00
Avi Kivity
f691fe1da7 KVM: Trace shadow page lifecycle
Create, sync, unsync, zap.

Signed-off-by: Avi Kivity <avi@redhat.com>
2009-09-10 08:33:10 +03:00
Avi Kivity
0742017159 KVM: MMU: Trace guest pagetable walker
Signed-off-by: Avi Kivity <avi@redhat.com>
2009-09-10 08:33:09 +03:00
Jan Kiszka
dc7e795e3d Revert "KVM: x86: check for cr3 validity in ioctl_set_sregs"
This reverts commit 6c20e1442bb1c62914bb85b7f4a38973d2a423ba.

To my understanding, it became obsolete with the advent of the more
robust check in mmu_alloc_roots (89da4ff17f). Moreover, it prevents
the conceptually safe pattern

 1. set sregs
 2. register mem-slots
 3. run vcpu

by setting a sticky triple fault during step 1.

Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2009-09-10 08:33:09 +03:00
Andre Przywara
6098ca939e KVM: handle AMD microcode MSR
Windows 7 tries to update the CPU's microcode on some processors,
so we ignore the MSR write here. The patchlevel register is already handled
(returning 0), because the MSR number is the same as Intel's.

Signed-off-by: Andre Przywara <andre.przywara@amd.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2009-09-10 08:33:08 +03:00
Sheng Yang
756975bbfd KVM: Fix apic_mmio_write return for unaligned write
Some in-famous OS do unaligned writing for APIC MMIO, and the return value
has been missed in recent change, then the OS hangs.

Signed-off-by: Sheng Yang <sheng@linux.intel.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2009-09-10 08:33:08 +03:00
Gleb Natapov
0105d1a526 KVM: x2apic interface to lapic
This patch implements MSR interface to local apic as defines by x2apic
Intel specification.

Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2009-09-10 08:33:08 +03:00
Gleb Natapov
fc61b800f9 KVM: Add Directed EOI support to APIC emulation
Directed EOI is specified by x2APIC, but is available even when lapic is
in xAPIC mode.

Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2009-09-10 08:33:07 +03:00
Avi Kivity
cb24772140 KVM: Trace apic registers using their symbolic names
Signed-off-by: Avi Kivity <avi@redhat.com>
2009-09-10 08:33:07 +03:00
Avi Kivity
aec51dc4f1 KVM: Trace mmio
Signed-off-by: Avi Kivity <avi@redhat.com>
2009-09-10 08:33:07 +03:00
Andre Przywara
c323c0e5f0 KVM: Ignore PCI ECS I/O enablement
Linux guests will try to enable access to the extended PCI config space
via the I/O ports 0xCF8/0xCFC on AMD Fam10h CPU. Since we (currently?)
don't use ECS, simply ignore write and read attempts.

Signed-off-by: Andre Przywara <andre.przywara@amd.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2009-09-10 08:33:06 +03:00
Michael S. Tsirkin
bda9020e24 KVM: remove in_range from io devices
This changes bus accesses to use high-level kvm_io_bus_read/kvm_io_bus_write
functions. in_range now becomes unused so it is removed from device ops in
favor of read/write callbacks performing range checks internally.

This allows aliasing (mostly for in-kernel virtio), as well as better error
handling by making it possible to pass errors up to userspace.

Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2009-09-10 08:33:05 +03:00
Michael S. Tsirkin
6c47469453 KVM: convert bus to slots_lock
Use slots_lock to protect device list on the bus.  slots_lock is already
taken for read everywhere, so we only need to take it for write when
registering devices.  This is in preparation to removing in_range and
kvm->lock around it.

Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2009-09-10 08:33:05 +03:00
Michael S. Tsirkin
108b56690f KVM: switch pit creation to slots_lock
switch pit creation to slots_lock. slots_lock is already taken for read
everywhere, so we only need to take it for write when creating pit.
This is in preparation to removing in_range and kvm->lock around it.

Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2009-09-10 08:33:05 +03:00
Marcelo Tosatti
2023a29cbe KVM: remove old KVMTRACE support code
Return EOPNOTSUPP for KVM_TRACE_ENABLE/PAUSE/DISABLE ioctls.

Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2009-09-10 08:33:03 +03:00
Andre Przywara
ed85c06853 KVM: introduce module parameter for ignoring unknown MSRs accesses
KVM will inject a #GP into the guest if that tries to access unhandled
MSRs. This will crash many guests. Although it would be the correct
way to actually handle these MSRs, we introduce a runtime switchable
module param called "ignore_msrs" (defaults to 0). If this is Y, unknown
MSR reads will return 0, while MSR writes are simply dropped. In both cases
we print a message to dmesg to inform the user about that.

You can change the behaviour at any time by saying:

 # echo 1 > /sys/modules/kvm/parameters/ignore_msrs

Signed-off-by: Andre Przywara <andre.przywara@amd.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2009-09-10 08:33:03 +03:00
Andre Przywara
1fdbd48c24 KVM: ignore reads from AMDs C1E enabled MSR
If the Linux kernel detects an C1E capable AMD processor (K8 RevF and
higher), it will access a certain MSR on every attempt to go to halt.
Explicitly handle this read and return 0 to let KVM run a Linux guest
with the native AMD host CPU propagated to the guest.

Signed-off-by: Andre Przywara <andre.przywara@amd.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2009-09-10 08:33:03 +03:00
Andre Przywara
8f1589d95e KVM: ignore AMDs HWCR register access to set the FFDIS bit
Linux tries to disable the flush filter on all AMD K8 CPUs. Since KVM
does not handle the needed MSR, the injected #GP will panic the Linux
kernel. Ignore setting of the HWCR.FFDIS bit in this MSR to let Linux
boot with an AMD K8 family guest CPU.

Signed-off-by: Andre Przywara <andre.przywara@amd.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2009-09-10 08:33:02 +03:00
Marcelo Tosatti
894a9c5543 KVM: x86: missing locking in PIT/IRQCHIP/SET_BSP_CPU ioctl paths
Correct missing locking in a few places in x86's vm_ioctl handling path.

Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2009-09-10 08:33:02 +03:00
Joerg Roedel
ec04b2604c KVM: Prepare memslot data structures for multiple hugepage sizes
[avi: fix build on non-x86]

Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2009-09-10 08:33:02 +03:00
Andre Przywara
4668f05078 KVM: x86 emulator: Add sysexit emulation
Handle #UD intercept of the sysexit instruction in 64bit mode returning to
32bit compat mode on an AMD host.
Setup the segment descriptors for CS and SS and the EIP/ESP registers
according to the manual.

Signed-off-by: Christoph Egger <christoph.egger@amd.com>
Signed-off-by: Amit Shah <amit.shah@redhat.com>
Signed-off-by: Andre Przywara <andre.przywara@amd.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2009-09-10 08:33:01 +03:00
Andre Przywara
8c60435261 KVM: x86 emulator: Add sysenter emulation
Handle #UD intercept of the sysenter instruction in 32bit compat mode on
an AMD host.
Setup the segment descriptors for CS and SS and the EIP/ESP registers
according to the manual.

Signed-off-by: Christoph Egger <christoph.egger@amd.com>
Signed-off-by: Amit Shah <amit.shah@redhat.com>
Signed-off-by: Andre Przywara <andre.przywara@amd.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2009-09-10 08:33:01 +03:00
Andre Przywara
e66bb2ccdc KVM: x86 emulator: add syscall emulation
Handle #UD intercept of the syscall instruction in 32bit compat mode on
an Intel host.
Setup the segment descriptors for CS and SS and the EIP/ESP registers
according to the manual. Save the RIP and EFLAGS to the correct registers.

[avi: fix build on i386 due to missing R11]

Signed-off-by: Christoph Egger <christoph.egger@amd.com>
Signed-off-by: Andre Przywara <andre.przywara@amd.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2009-09-10 08:33:00 +03:00
Andre Przywara
e99f050712 KVM: x86 emulator: Prepare for emulation of syscall instructions
Add the flags needed for syscall, sysenter and sysexit to the opcode table.
Catch (but for now ignore) the opcodes in the emulation switch/case.

Signed-off-by: Andre Przywara <andre.przywara@amd.com>
Signed-off-by: Amit Shah <amit.shah@redhat.com>
Signed-off-by: Christoph Egger <christoph.egger@amd.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2009-09-10 08:33:00 +03:00
Andre Przywara
b1d861431e KVM: x86 emulator: Add missing EFLAGS bit definitions
Signed-off-by: Christoph Egger <christoph.egger@amd.com>
Signed-off-by: Amit Shah <amit.shah@redhat.com>
Signed-off-by: Andre Przywara <andre.przywara@amd.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2009-09-10 08:33:00 +03:00
Andre Przywara
0cb5762ed2 KVM: Allow emulation of syscalls instructions on #UD
Add the opcodes for syscall, sysenter and sysexit to the list of instructions
handled by the undefined opcode handler.

Signed-off-by: Christoph Egger <christoph.egger@amd.com>
Signed-off-by: Amit Shah <amit.shah@redhat.com>
Signed-off-by: Andre Przywara <andre.przywara@amd.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2009-09-10 08:32:59 +03:00
Marcelo Tosatti
229456fc34 KVM: convert custom marker based tracing to event traces
This allows use of the powerful ftrace infrastructure.

See Documentation/trace/ for usage information.

[avi, stephen: various build fixes]
[sheng: fix control register breakage]

Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Sheng Yang <sheng@linux.intel.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2009-09-10 08:32:59 +03:00
Alexander Graf
219b65dcf6 KVM: SVM: Improve nested interrupt injection
While trying to get Hyper-V running, I realized that the interrupt injection
mechanisms that are in place right now are not 100% correct.

This patch makes nested SVM's interrupt injection behave more like on a
real machine.

Signed-off-by: Alexander Graf <agraf@suse.de>
Signed-off-by: Avi Kivity <avi@redhat.com>
2009-09-10 08:32:59 +03:00
Alexander Graf
ff092385e8 KVM: SVM: Implement INVLPGA
SVM adds another way to do INVLPG by ASID which Hyper-V makes use of,
so let's implement it!

For now we just do the same thing invlpg does, as asid switching
means we flush the mmu anyways. That might change one day though.

Signed-off-by: Alexander Graf <agraf@suse.de>
Signed-off-by: Avi Kivity <avi@redhat.com>
2009-09-10 08:32:58 +03:00
Alexander Graf
3c5d0a44b0 KVM: Implement MSRs used by Hyper-V
Hyper-V uses some MSRs, some of which are actually reserved for BIOS usage.

But let's be nice today and have it its way, because otherwise it fails
terribly.

[jaswinder: fix build for linux-next changes]

Signed-off-by: Alexander Graf <agraf@suse.de>
Signed-off-by: Jaswinder Singh Rajput <jaswinderrajput@gmail.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2009-09-10 08:32:58 +03:00
Avi Kivity
b3dbf89e67 KVM: SVM: Don't save/restore host cr2
The host never reads cr2 in process context, so are free to clobber it.  The
vmx code does this, so we can safely remove the save/restore code.

Signed-off-by: Avi Kivity <avi@redhat.com>
2009-09-10 08:32:58 +03:00
Avi Kivity
d3edefc003 KVM: VMX: Only reload guest cr2 if different from host cr2
cr2 changes only rarely, and writing it is expensive.  Avoid the costly cr2
writes by checking if it does not already hold the desired value.

Shaves 70 cycles off the vmexit latency.

Signed-off-by: Avi Kivity <avi@redhat.com>
2009-09-10 08:32:57 +03:00
Jan Kiszka
681405bfc7 KVM: Drop useless atomic test from timer function
The current code tries to optimize the setting of
KVM_REQ_PENDING_TIMER but used atomic_inc_and_test - which always
returns true unless pending had the invalid value of -1 on entry. This
patch drops the test part preserving the original semantic but
expressing it less confusingly.

Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2009-09-10 08:32:57 +03:00
Jan Kiszka
f7104db26a KVM: Fix racy event propagation in timer
Minor issue that likely had no practical relevance: the kvm timer
function so far incremented the pending counter and then may reset it
again to 1 in case reinjection was disabled. This opened a small racy
window with the corresponding VCPU loop that may have happened to run
on another (real) CPU and already consumed the value.

Fix it by skipping the incrementation in case pending is already > 0.
This opens a different race windows, but may only rarely cause lost
events in case we do not care about them anyway (!reinject).

Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2009-09-10 08:32:57 +03:00
Gleb Natapov
33e4c68656 KVM: Optimize searching for highest IRR
Most of the time IRR is empty, so instead of scanning the whole IRR on
each VM entry keep a variable that tells us if IRR is not empty. IRR
will have to be scanned twice on each IRQ delivery, but this is much
more rare than VM entry.

Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2009-09-10 08:32:57 +03:00
Gleb Natapov
6edf14d8d0 KVM: Replace pending exception by PF if it happens serially
Replace previous exception with a new one in a hope that instruction
re-execution will regenerate lost exception.

Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2009-09-10 08:32:56 +03:00
Marcelo Tosatti
54dee9933e KVM: VMX: conditionally disable 2M pages
Disable usage of 2M pages if VMX_EPT_2MB_PAGE_BIT (bit 16) is clear
in MSR_IA32_VMX_EPT_VPID_CAP and EPT is enabled.

[avi: s/largepages_disabled/largepages_enabled/ to avoid negative logic]

Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2009-09-10 08:32:56 +03:00
Marcelo Tosatti
68f89400bc KVM: VMX: EPT misconfiguration handler
Handler for EPT misconfiguration which checks for valid state
in the shadow pagetables, printing the spte on each level.

The separate WARN_ONs are useful for kerneloops.org.

Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2009-09-10 08:32:56 +03:00
Marcelo Tosatti
94d8b056a2 KVM: MMU: add kvm_mmu_get_spte_hierarchy helper
Required by EPT misconfiguration handler.

Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2009-09-10 08:32:56 +03:00
Marcelo Tosatti
4d88954d62 KVM: MMU: make for_each_shadow_entry aware of largepages
This way there is no need to add explicit checks in every
for_each_shadow_entry user.

Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2009-09-10 08:32:55 +03:00
Marcelo Tosatti
e799794e02 KVM: VMX: more MSR_IA32_VMX_EPT_VPID_CAP capability bits
Required for EPT misconfiguration handler.

Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2009-09-10 08:32:55 +03:00
Andre Przywara
71db602322 KVM: Move performance counter MSR access interception to generic x86 path
The performance counter MSRs are different for AMD and Intel CPUs and they
are chosen mainly by the CPUID vendor string. This patch catches writes to
all addresses (regardless of VMX/SVM path) and handles them in the generic
MSR handler routine. Writing a 0 into the event select register is something
we perfectly emulate ;-), so don't print out a warning to dmesg in this
case.
This fixes booting a 64bit Windows guest with an AMD CPUID on an Intel host.

Signed-off-by: Andre Przywara <andre.przywara@amd.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2009-09-10 08:32:54 +03:00
Marcelo Tosatti
2920d72857 KVM: MMU audit: largepage handling
Make the audit code aware of largepages.

Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2009-09-10 08:32:54 +03:00
Marcelo Tosatti
2aaf65e8c4 KVM: MMU audit: audit_mappings tweaks
- Fail early in case gfn_to_pfn returns is_error_pfn.
- For the pre pte write case, avoid spurious "gva is valid but spte is notrap"
  messages (the emulation code does the guest write first, so this particular
  case is OK).

Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2009-09-10 08:32:54 +03:00
Marcelo Tosatti
48fc03174b KVM: MMU audit: nontrapping ptes in nonleaf level
It is valid to set non leaf sptes as notrap.

Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2009-09-10 08:32:54 +03:00
Marcelo Tosatti
e58b0f9e0e KVM: MMU audit: update audit_write_protection
- Unsync pages contain writable sptes in the rmap.
- rmaps do not exclusively contain writable sptes anymore.

Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2009-09-10 08:32:53 +03:00
Marcelo Tosatti
08a3732bf2 KVM: MMU audit: update count_writable_mappings / count_rmaps
Under testing, count_writable_mappings returns a value that is 2 integers
larger than what count_rmaps returns.

Suspicion is that either of the two functions is counting a duplicate (either
positively or negatively).

Modifying check_writable_mappings_rmap to check for rmap existance on
all present MMU pages fails to trigger an error, which should keep Avi
happy.

Also introduce mmu_spte_walk to invoke a callback on all present sptes visible
to the current vcpu, might be useful in the future.

Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2009-09-10 08:32:53 +03:00
Marcelo Tosatti
776e663336 KVM: MMU: introduce is_last_spte helper
Hiding some of the last largepage / level interaction (which is useful
for gbpages and for zero based levels).

Also merge the PT_PAGE_TABLE_LEVEL clearing loop in unlink_children.

Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2009-09-10 08:32:53 +03:00