Commit Graph

570 Commits

Author SHA1 Message Date
Jan Kiszka
3ab66e8a45 KVM: VMX: Pass vcpu to __vmx_complete_interrupts
Cleanup: __vmx_complete_interrupts has no use for the vmx structure.

Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Signed-off-by: Gleb Natapov <gleb@redhat.com>
2013-02-28 10:29:03 +02:00
Jan Kiszka
44ceb9d665 KVM: nVMX: Avoid one redundant vmcs_read in prepare_vmcs12
IDT_VECTORING_INFO_FIELD was already read right after vmexit.

Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Signed-off-by: Gleb Natapov <gleb@redhat.com>
2013-02-28 10:20:06 +02:00
Jan Kiszka
957c897e8c KVM: nVMX: Use cached exit reason
No need to re-read what vmx_vcpu_run already picked up for us.

Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Signed-off-by: Gleb Natapov <gleb@redhat.com>
2013-02-27 15:46:07 +02:00
Jan Kiszka
36c3cc422b KVM: nVMX: Clear segment cache after switching between L1 and L2
Switching the VMCS obviously invalidates what may have been cached about
the guest segments.

Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Signed-off-by: Gleb Natapov <gleb@redhat.com>
2013-02-27 15:41:09 +02:00
Jan Kiszka
d6851fbeee KVM: nVMX: Advertise PAUSE and WBINVD exiting support
These exits have no preconditions, and we already process the
corresponding reasons in nested_vmx_exit_handled correctly.

Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Signed-off-by: Gleb Natapov <gleb@redhat.com>
2013-02-27 15:33:51 +02:00
Jan Kiszka
733568f9ce KVM: VMX: Make prepare_vmcs12 and load_vmcs12_host_state static
Both are only used locally.

Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Signed-off-by: Gleb Natapov <gleb@redhat.com>
2013-02-27 15:31:15 +02:00
Jan Kiszka
bd31a7f557 KVM: nVMX: Trap unconditionally if msr bitmap access fails
This avoids basing decisions on uninitialized variables, potentially
leaking kernel data to the L1 guest.

Reviewed-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2013-02-22 00:50:45 -03:00
Jan Kiszka
908a7bdd6a KVM: nVMX: Improve I/O exit handling
This prevents trapping L2 I/O exits if L1 has neither unconditional nor
bitmap-based exiting enabled. Furthermore, it implements I/O bitmap
handling.

Reviewed-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2013-02-22 00:50:42 -03:00
Jan Kiszka
cbd29cb6e3 KVM: nVMX: Remove redundant get_vmcs12 from nested_vmx_exit_handled_msr
We already pass vmcs12 as argument.

Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Signed-off-by: Gleb Natapov <gleb@redhat.com>
2013-02-14 10:35:16 +02:00
Yang Zhang
257090f702 KVM: VMX: disable apicv by default
Without Posted Interrupt, current code is broken. Just disable by
default until Posted Interrupt is ready.

Signed-off-by: Yang Zhang <yang.z.zhang@Intel.com>
Signed-off-by: Gleb Natapov <gleb@redhat.com>
2013-02-11 10:51:13 +02:00
Gleb Natapov
5037878e22 KVM: VMX: cleanup vmx_set_cr0().
When calculating hw_cr0 teh current code masks bits that should be always
on and re-adds them back immediately after. Cleanup the code by masking
only those bits that should be dropped from hw_cr0. This allow us to
get rid of some defines.

Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2013-02-06 22:00:02 -02:00
Dongxiao Xu
c08800a56c KVM: VMX: disable SMEP feature when guest is in non-paging mode
SMEP is disabled if CPU is in non-paging mode in hardware.
However KVM always uses paging mode to emulate guest non-paging
mode with TDP. To emulate this behavior, SMEP needs to be manually
disabled when guest switches to non-paging mode.

We met an issue that, SMP Linux guest with recent kernel (enable
SMEP support, for example, 3.5.3) would crash with triple fault if
setting unrestricted_guest=0. This is because KVM uses an identity
mapping page table to emulate the non-paging mode, where the page
table is set with USER flag. If SMEP is still enabled in this case,
guest will meet unhandlable page fault and then crash.

Reviewed-by: Gleb Natapov <gleb@redhat.com>
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Dongxiao Xu <dongxiao.xu@intel.com>
Signed-off-by: Xiantao Zhang <xiantao.zhang@intel.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2013-02-05 23:28:07 -02:00
Yang Zhang
c7c9c56ca2 x86, apicv: add virtual interrupt delivery support
Virtual interrupt delivery avoids KVM to inject vAPIC interrupts
manually, which is fully taken care of by the hardware. This needs
some special awareness into existing interrupr injection path:

- for pending interrupt, instead of direct injection, we may need
  update architecture specific indicators before resuming to guest.

- A pending interrupt, which is masked by ISR, should be also
  considered in above update action, since hardware will decide
  when to inject it at right time. Current has_interrupt and
  get_interrupt only returns a valid vector from injection p.o.v.

Reviewed-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Kevin Tian <kevin.tian@intel.com>
Signed-off-by: Yang Zhang <yang.z.zhang@Intel.com>
Signed-off-by: Gleb Natapov <gleb@redhat.com>
2013-01-29 10:48:19 +02:00
Yang Zhang
8d14695f95 x86, apicv: add virtual x2apic support
basically to benefit from apicv, we need to enable virtualized x2apic mode.
Currently, we only enable it when guest is really using x2apic.

Also, clear MSR bitmap for corresponding x2apic MSRs when guest enabled x2apic:
0x800 - 0x8ff: no read intercept for apicv register virtualization,
               except APIC ID and TMCCT which need software's assistance to
               get right value.

Reviewed-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Kevin Tian <kevin.tian@intel.com>
Signed-off-by: Yang Zhang <yang.z.zhang@Intel.com>
Signed-off-by: Gleb Natapov <gleb@redhat.com>
2013-01-29 10:48:06 +02:00
Yang Zhang
83d4c28693 x86, apicv: add APICv register virtualization support
- APIC read doesn't cause VM-Exit
- APIC write becomes trap-like

Reviewed-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Kevin Tian <kevin.tian@intel.com>
Signed-off-by: Yang Zhang <yang.z.zhang@intel.com>
Signed-off-by: Gleb Natapov <gleb@redhat.com>
2013-01-29 10:47:54 +02:00
Gleb Natapov
141687869f KVM: VMX: set vmx->emulation_required only when needed.
If emulate_invalid_guest_state=false vmx->emulation_required is never
actually used, but it ends up to be always set to true since
handle_invalid_guest_state(), the only place it is reset back to
false, is never called. This, besides been not very clean, makes vmexit
and vmentry path to check emulate_invalid_guest_state needlessly.

The patch fixes that by keeping emulation_required coherent with
emulate_invalid_guest_state setting.

Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2013-01-24 00:40:31 -02:00
Gleb Natapov
91b0aa2ca6 KVM: VMX: rename fix_pmode_dataseg to fix_pmode_seg.
The function deals with code segment too.

Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2013-01-24 00:40:30 -02:00
Gleb Natapov
25391454e7 KVM: VMX: don't clobber segment AR of unusable segments.
Usability is returned in unusable field, so not need to clobber entire
AR. Callers have to know how to deal with unusable segments already
since if emulate_invalid_guest_state=true AR is not zeroed.

Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2013-01-24 00:40:28 -02:00
Gleb Natapov
218e763f45 KVM: VMX: skip vmx->rmode.vm86_active check on cr0 write if unrestricted guest is enabled
vmx->rmode.vm86_active is never true is unrestricted guest is enabled.
Make it more explicit that neither enter_pmode() nor enter_rmode() is
called in this case.

Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2013-01-24 00:40:28 -02:00
Gleb Natapov
286da4156d KVM: VMX: remove hack that disables emulation on vcpu reset/init
There is no reason for it. If state is suitable for vmentry it
will be detected during guest entry and no emulation will happen.

Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2013-01-24 00:40:27 -02:00
Gleb Natapov
c5e97c80b5 KVM: VMX: if unrestricted guest is enabled vcpu state is always valid.
Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2013-01-24 00:40:26 -02:00
Gleb Natapov
2f143240cb KVM: VMX: reset CPL only on CS register write.
Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2013-01-24 00:40:26 -02:00
Gleb Natapov
1f3141e80b KVM: VMX: remove special CPL cache access during transition to real mode.
Since vmx_get_cpl() always returns 0 when VCPU is in real mode it is no
longer needed. Also reset CPL cache to zero during transaction to
protected mode since transaction may happen while CS.selectors & 3 != 0,
but in reality CPL is 0.

Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2013-01-24 00:40:25 -02:00
Marcelo Tosatti
b09408d00f KVM: VMX: fix incorrect cached cpl value with real/v8086 modes
CPL is always 0 when in real mode, and always 3 when virtual 8086 mode.

Using values other than those can cause failures on operations that
check CPL.

Reviewed-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2013-01-08 17:25:35 -02:00
Gleb Natapov
0ca1b4f4ba KVM: VMX: handle IO when emulation is due to #GP in real mode.
With emulate_invalid_guest_state=0 if a vcpu is in real mode VMX can
enter the vcpu with smaller segment limit than guest configured.  If the
guest tries to access pass this limit it will get #GP at which point
instruction will be emulated with correct segment limit applied. If
during the emulation IO is detected it is not handled correctly. Vcpu
thread should exit to userspace to serve the IO, but it returns to the
guest instead.  Since emulation is not completed till userspace completes
the IO the faulty instruction is re-executed ad infinitum.

The patch fixes that by exiting to userspace if IO happens during
instruction emulation.

Reported-by: Alex Williamson <alex.williamson@redhat.com>
Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2013-01-02 19:36:31 -02:00
Gleb Natapov
d54d07b2ca KVM: VMX: Do not fix segment register during vcpu initialization.
Segment registers will be fixed according to current emulation policy
during switching to real mode for the first time.

Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2013-01-02 19:36:30 -02:00
Gleb Natapov
d99e415275 KVM: VMX: fix emulation of invalid guest state.
Currently when emulation of invalid guest state is enable
(emulate_invalid_guest_state=1) segment registers are still fixed for
entry to vm86 mode some times. Segment register fixing is avoided in
enter_rmode(), but vmx_set_segment() still does it unconditionally.
The patch fixes it.

Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2013-01-02 19:36:29 -02:00
Gleb Natapov
89efbed02c KVM: VMX: make rmode_segment_valid() more strict.
Currently it allows entering vm86 mode if segment limit is greater than
0xffff and db bit is set. Both of those can cause incorrect execution of
instruction by cpu since in vm86 mode limit will be set to 0xffff and db
will be forced to 0.

Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2013-01-02 19:36:28 -02:00
Gleb Natapov
f924d66d27 KVM: VMX: remove unneeded temporary variable from vmx_set_segment()
Reviewed-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Gleb Natapov <gleb@redhat.com>
2012-12-23 14:02:00 +02:00
Gleb Natapov
1ecd50a947 KVM: VMX: clean-up vmx_set_segment()
Move all vm86_active logic into one place.

Reviewed-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Gleb Natapov <gleb@redhat.com>
2012-12-23 14:01:49 +02:00
Gleb Natapov
39dcfb95de KVM: VMX: remove redundant code from vmx_set_segment()
Segment descriptor's base is fixed by call to fix_rmode_seg(). Not need
to do it twice.

Reviewed-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Gleb Natapov <gleb@redhat.com>
2012-12-23 14:01:37 +02:00
Gleb Natapov
beb853ffec KVM: VMX: use fix_rmode_seg() to fix all code/data segments
The code for SS and CS does the same thing fix_rmode_seg() is doing.
Use it instead of hand crafted code.

Reviewed-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Gleb Natapov <gleb@redhat.com>
2012-12-23 14:01:18 +02:00
Gleb Natapov
c6ad115348 KVM: VMX: return correct segment limit and flags for CS/SS registers in real mode
VMX without unrestricted mode cannot virtualize real mode, so if
emulate_invalid_guest_state=0 kvm uses vm86 mode to approximate
it. Sometimes, when guest moves from protected mode to real mode, it
leaves segment descriptors in a state not suitable for use by vm86 mode
virtualization, so we keep shadow copy of segment descriptors for internal
use and load fake register to VMCS for guest entry to succeed. Till
now we kept shadow for all segments except SS and CS (for SS and CS we
returned parameters directly from VMCS), but since commit a5625189f6
emulator enforces segment limits in real mode. This causes #GP during move
from protected mode to real mode when emulator fetches first instruction
after moving to real mode since it uses incorrect CS base and limit to
linearize the %rip. Fix by keeping shadow for SS and CS too.

Reviewed-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Gleb Natapov <gleb@redhat.com>
2012-12-23 14:01:03 +02:00
Gleb Natapov
0647f4aa8c KVM: VMX: relax check for CS register in rmode_segment_valid()
rmode_segment_valid() checks if segment descriptor can be used to enter
vm86 mode. VMX spec mandates that in vm86 mode CS register will be of
type data, not code. Lets allow guest entry with vm86 mode if the only
problem with CS register is incorrect type. Otherwise entire real mode
will be emulated.

Reviewed-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Gleb Natapov <gleb@redhat.com>
2012-12-23 14:00:47 +02:00
Gleb Natapov
07f42f5f25 KVM: VMX: cleanup rmode_segment_valid()
Set segment fields explicitly instead of using  binary operations.

No behaviour changes.

Reviewed-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Gleb Natapov <gleb@redhat.com>
2012-12-23 14:00:36 +02:00
Alex Williamson
f82a8cfe93 KVM: struct kvm_memory_slot.user_alloc -> bool
There's no need for this to be an int, it holds a boolean.
Move to the end of the struct for alignment.

Reviewed-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2012-12-13 23:24:38 -02:00
Linus Torvalds
66cdd0ceaf Merge tag 'kvm-3.8-1' of git://git.kernel.org/pub/scm/virt/kvm/kvm
Pull KVM updates from Marcelo Tosatti:
 "Considerable KVM/PPC work, x86 kvmclock vsyscall support,
  IA32_TSC_ADJUST MSR emulation, amongst others."

Fix up trivial conflict in kernel/sched/core.c due to cross-cpu
migration notifier added next to rq migration call-back.

* tag 'kvm-3.8-1' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (156 commits)
  KVM: emulator: fix real mode segment checks in address linearization
  VMX: remove unneeded enable_unrestricted_guest check
  KVM: VMX: fix DPL during entry to protected mode
  x86/kexec: crash_vmclear_local_vmcss needs __rcu
  kvm: Fix irqfd resampler list walk
  KVM: VMX: provide the vmclear function and a bitmap to support VMCLEAR in kdump
  x86/kexec: VMCLEAR VMCSs loaded on all cpus if necessary
  KVM: MMU: optimize for set_spte
  KVM: PPC: booke: Get/set guest EPCR register using ONE_REG interface
  KVM: PPC: bookehv: Add EPCR support in mtspr/mfspr emulation
  KVM: PPC: bookehv: Add guest computation mode for irq delivery
  KVM: PPC: Make EPCR a valid field for booke64 and bookehv
  KVM: PPC: booke: Extend MAS2 EPN mask for 64-bit
  KVM: PPC: e500: Mask MAS2 EPN high 32-bits in 32/64 tlbwe emulation
  KVM: PPC: Mask ea's high 32-bits in 32/64 instr emulation
  KVM: PPC: e500: Add emulation helper for getting instruction ea
  KVM: PPC: bookehv64: Add support for interrupt handling
  KVM: PPC: bookehv: Remove GET_VCPU macro from exception handler
  KVM: PPC: booke: Fix get_tb() compile error on 64-bit
  KVM: PPC: e500: Silence bogus GCC warning in tlb code
  ...
2012-12-13 15:31:08 -08:00
Gleb Natapov
0b26b588d9 VMX: remove unneeded enable_unrestricted_guest check
If enable_unrestricted_guest is true vmx->rmode.vm86_active will
always be false.

Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2012-12-11 21:00:28 -02:00
Gleb Natapov
a4d3326c2d KVM: VMX: fix DPL during entry to protected mode
On CPUs without support for unrestricted guests DPL cannot be smaller
than RPL for data segments during guest entry, but this state can occurs
if a data segment selector changes while vcpu is in real mode to a value
with lowest two bits != 00. Fix that by forcing DPL == RPL on transition
to protected mode.

This is a regression introduced by c865c43de6.

Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2012-12-11 21:00:27 -02:00
Zhang Yanfei
8f536b7697 KVM: VMX: provide the vmclear function and a bitmap to support VMCLEAR in kdump
The vmclear function will be assigned to the callback function pointer
when loading kvm-intel module. And the bitmap indicates whether we
should do VMCLEAR operation in kdump. The bits in the bitmap are
set/unset according to different conditions.

Signed-off-by: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
Acked-by: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: Gleb Natapov <gleb@redhat.com>
2012-12-06 18:26:57 +02:00
Julian Stecklina
66f7b72e11 KVM: x86: Make register state after reset conform to specification
VMX behaves now as SVM wrt to FPU initialization. Code has been moved to
generic code path. General-purpose registers are now cleared on reset and
INIT.  SVM code properly initializes EDX.

Signed-off-by: Julian Stecklina <jsteckli@os.inf.tu-dresden.de>
Signed-off-by: Gleb Natapov <gleb@redhat.com>
2012-12-05 18:00:07 +02:00
Zhang Xiantao
2b3c5cbc0d kvm: don't use bit24 for detecting address-specific invalidation capability
Bit24 in VMX_EPT_VPID_CAP_MASI is  not used for address-specific invalidation capability
reporting, so remove it from KVM to avoid conflicts in future.

Signed-off-by: Zhang Xiantao <xiantao.zhang@intel.com>
Signed-off-by: Gleb Natapov <gleb@redhat.com>
2012-12-05 16:35:48 +02:00
Zhang Xiantao
0307b7b8c2 kvm: remove unnecessary bit checking for ept violation
Bit 6 in EPT vmexit's exit qualification is not defined in SDM, so remove it.

Signed-off-by: Zhang Xiantao <xiantao.zhang@intel.com>
Signed-off-by: Gleb Natapov <gleb@redhat.com>
2012-12-05 16:35:21 +02:00
Will Auld
ba904635d4 KVM: x86: Emulate IA32_TSC_ADJUST MSR
CPUID.7.0.EBX[1]=1 indicates IA32_TSC_ADJUST MSR 0x3b is supported

Basic design is to emulate the MSR by allowing reads and writes to a guest
vcpu specific location to store the value of the emulated MSR while adding
the value to the vmcs tsc_offset. In this way the IA32_TSC_ADJUST value will
be included in all reads to the TSC MSR whether through rdmsr or rdtsc. This
is of course as long as the "use TSC counter offsetting" VM-execution control
is enabled as well as the IA32_TSC_ADJUST control.

However, because hardware will only return the TSC + IA32_TSC_ADJUST +
vmsc tsc_offset for a guest process when it does and rdtsc (with the correct
settings) the value of our virtualized IA32_TSC_ADJUST must be stored in one
of these three locations. The argument against storing it in the actual MSR
is performance. This is likely to be seldom used while the save/restore is
required on every transition. IA32_TSC_ADJUST was created as a way to solve
some issues with writing TSC itself so that is not an option either.

The remaining option, defined above as our solution has the problem of
returning incorrect vmcs tsc_offset values (unless we intercept and fix, not
done here) as mentioned above. However, more problematic is that storing the
data in vmcs tsc_offset will have a different semantic effect on the system
than does using the actual MSR. This is illustrated in the following example:

The hypervisor set the IA32_TSC_ADJUST, then the guest sets it and a guest
process performs a rdtsc. In this case the guest process will get
TSC + IA32_TSC_ADJUST_hyperviser + vmsc tsc_offset including
IA32_TSC_ADJUST_guest. While the total system semantics changed the semantics
as seen by the guest do not and hence this will not cause a problem.

Signed-off-by: Will Auld <will.auld@intel.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2012-11-30 18:29:30 -02:00
Will Auld
8fe8ab46be KVM: x86: Add code to track call origin for msr assignment
In order to track who initiated the call (host or guest) to modify an msr
value I have changed function call parameters along the call path. The
specific change is to add a struct pointer parameter that points to (index,
data, caller) information rather than having this information passed as
individual parameters.

The initial use for this capability is for updating the IA32_TSC_ADJUST msr
while setting the tsc value. It is anticipated that this capability is
useful for other tasks.

Signed-off-by: Will Auld <will.auld@intel.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2012-11-30 18:26:12 -02:00
Xiao Guangrong
5a560f8b5e KVM: VMX: fix memory order between loading vmcs and clearing vmcs
vmcs->cpu indicates whether it exists on the target cpu, -1 means the vmcs
does not exist on any vcpu

If vcpu load vmcs with vmcs.cpu = -1, it can be directly added to cpu's percpu
list. The list can be corrupted if the cpu prefetch the vmcs's list before
reading vmcs->cpu. Meanwhile, we should remove vmcs from the list before
making vmcs->vcpu == -1 be visible

Signed-off-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2012-11-29 21:14:46 -02:00
Xiao Guangrong
e6c7d32172 KVM: VMX: fix invalid cpu passed to smp_call_function_single
In loaded_vmcs_clear, loaded_vmcs->cpu is the fist parameter passed to
smp_call_function_single, if the target cpu is downing (doing cpu hot remove),
loaded_vmcs->cpu can become -1 then -1 is passed to smp_call_function_single

It can be triggered when vcpu is being destroyed, loaded_vmcs_clear is called
in the preemptionable context

Signed-off-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2012-11-28 22:04:58 -02:00
Marcelo Tosatti
42897d866b KVM: x86: add kvm_arch_vcpu_postcreate callback, move TSC initialization
TSC initialization will soon make use of online_vcpus.

Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2012-11-27 23:29:14 -02:00
Marcelo Tosatti
886b470cb1 KVM: x86: pass host_tsc to read_l1_tsc
Allow the caller to pass host tsc value to kvm_x86_ops->read_l1_tsc().

Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2012-11-27 23:29:11 -02:00
Takashi Iwai
29282fde80 KVM: x86: Fix invalid secondary exec controls in vmx_cpuid_update()
The commit [ad756a16: KVM: VMX: Implement PCID/INVPCID for guests with
EPT] introduced the unconditional access to SECONDARY_VM_EXEC_CONTROL,
and this triggers kernel warnings like below on old CPUs:

    vmwrite error: reg 401e value a0568000 (err 12)
    Pid: 13649, comm: qemu-kvm Not tainted 3.7.0-rc4-test2+ #154
    Call Trace:
     [<ffffffffa0558d86>] vmwrite_error+0x27/0x29 [kvm_intel]
     [<ffffffffa054e8cb>] vmcs_writel+0x1b/0x20 [kvm_intel]
     [<ffffffffa054f114>] vmx_cpuid_update+0x74/0x170 [kvm_intel]
     [<ffffffffa03629b6>] kvm_vcpu_ioctl_set_cpuid2+0x76/0x90 [kvm]
     [<ffffffffa0341c67>] kvm_arch_vcpu_ioctl+0xc37/0xed0 [kvm]
     [<ffffffff81143f7c>] ? __vunmap+0x9c/0x110
     [<ffffffffa0551489>] ? vmx_vcpu_load+0x39/0x1a0 [kvm_intel]
     [<ffffffffa0340ee2>] ? kvm_arch_vcpu_load+0x52/0x1a0 [kvm]
     [<ffffffffa032dcd4>] ? vcpu_load+0x74/0xd0 [kvm]
     [<ffffffffa032deb0>] kvm_vcpu_ioctl+0x110/0x5e0 [kvm]
     [<ffffffffa032e93d>] ? kvm_dev_ioctl+0x4d/0x4a0 [kvm]
     [<ffffffff8117dc6f>] do_vfs_ioctl+0x8f/0x530
     [<ffffffff81139d76>] ? remove_vma+0x56/0x60
     [<ffffffff8113b708>] ? do_munmap+0x328/0x400
     [<ffffffff81187c8c>] ? fget_light+0x4c/0x100
     [<ffffffff8117e1a1>] sys_ioctl+0x91/0xb0
     [<ffffffff815a942d>] system_call_fastpath+0x1a/0x1f

This patch adds a check for the availability of secondary exec
control to avoid these warnings.

Cc: <stable@vger.kernel.org> [v3.6+]
Signed-off-by: Takashi Iwai <tiwai@suse.de>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2012-11-16 20:25:18 -02:00
Xiao Guangrong
bf4ca23ef5 KVM: VMX: report internal error for MMIO #PF due to delivery event
The #PF with PFEC.RSV = 1 indicates that the guest is accessing MMIO, we
can not fix it if it is caused by delivery event. Reporting internal error
for this case

Signed-off-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2012-10-18 16:30:32 +02:00
Xiao Guangrong
b9bf6882c1 KVM: VMX: report internal error for the unhandleable event
VM exits during Event Delivery is really unexpected if it is not caused
by Exceptions/EPT-VIOLATION/TASK_SWITCH, we'd better to report an internal
and freeze the guest, the VMM has the chance to check the guest

Signed-off-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2012-10-18 16:30:29 +02:00
Linus Torvalds
ecefbd94b8 KVM updates for the 3.7 merge window
-----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1.4.12 (GNU/Linux)
 
 iQIcBAABAgAGBQJQbY/2AAoJEI7yEDeUysxlymQQAIv5svpAI/FUe3FhvBi3IW2h
 WWMIpbdhHyocaINT18qNp8prO0iwoaBfgsnU8zuB34MrbdUgiwSHgM6T4Ff4NGa+
 R4u+gpyKYwxNQYKeJyj04luXra/krxwHL1u9OwN7o44JuQXAmzrw2tZ9ad1ArvL3
 eoZ6kGsPcdHPZMZWw2jN5xzBsRtqybm0GPPQh1qPXdn8UlPPd1X7owvbaud2y4+e
 StVIpGY6wrsO36f7UcA4Gm1EP/1E6Lm5KMXJyHgM9WBRkEfp92jTY5+XKv91vK8Z
 VKUd58QMdZE5NCNBkAR9U5N9aH0oSXnFU/g8hgiwGvrhS3IsSkKUePE6sVyMVTIO
 VptKRYe0AdmD/g25p6ApJsguV7ITlgoCPaE4rMmRcW9/bw8+iY098r7tO7w11H8M
 TyFOXihc3B+rlH8WdzOblwxHMC4yRuiPIktaA3WwbX7eA7Xv/ZRtdidifXKtgsVE
 rtubVqwGyYcHoX1Y+JiByIW1NN0pYncJhPEdc8KbRe2wKs3amA9rio1mUpBYYBPO
 B0ygcITftyXbhcTtssgcwBDGXB0AAGqI7wqdtJhFeIrKwHXD7fNeAGRwO8oKxmlj
 0aPwo9fDtpI+e6BFTohEgjZBocRvXXNWLnDSFB0E7xDR31bACck2FG5FAp1DxdS7
 lb/nbAsXf9UJLgGir4I1
 =kN6V
 -----END PGP SIGNATURE-----

Merge tag 'kvm-3.7-1' of git://git.kernel.org/pub/scm/virt/kvm/kvm

Pull KVM updates from Avi Kivity:
 "Highlights of the changes for this release include support for vfio
  level triggered interrupts, improved big real mode support on older
  Intels, a streamlines guest page table walker, guest APIC speedups,
  PIO optimizations, better overcommit handling, and read-only memory."

* tag 'kvm-3.7-1' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (138 commits)
  KVM: s390: Fix vcpu_load handling in interrupt code
  KVM: x86: Fix guest debug across vcpu INIT reset
  KVM: Add resampling irqfds for level triggered interrupts
  KVM: optimize apic interrupt delivery
  KVM: MMU: Eliminate pointless temporary 'ac'
  KVM: MMU: Avoid access/dirty update loop if all is well
  KVM: MMU: Eliminate eperm temporary
  KVM: MMU: Optimize is_last_gpte()
  KVM: MMU: Simplify walk_addr_generic() loop
  KVM: MMU: Optimize pte permission checks
  KVM: MMU: Update accessed and dirty bits after guest pagetable walk
  KVM: MMU: Move gpte_access() out of paging_tmpl.h
  KVM: MMU: Optimize gpte_access() slightly
  KVM: MMU: Push clean gpte write protection out of gpte_access()
  KVM: clarify kvmclock documentation
  KVM: make processes waiting on vcpu mutex killable
  KVM: SVM: Make use of asm.h
  KVM: VMX: Make use of asm.h
  KVM: VMX: Make lto-friendly
  KVM: x86: lapic: Clean up find_highest_vector() and count_vectors()
  ...

Conflicts:
	arch/s390/include/asm/processor.h
	arch/x86/kvm/i8259.c
2012-10-04 09:30:33 -07:00
Linus Torvalds
ac07f5c3cb Merge branch 'x86-fpu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86/fpu update from Ingo Molnar:
 "The biggest change is the addition of the non-lazy (eager) FPU saving
  support model and enabling it on CPUs with optimized xsaveopt/xrstor
  FPU state saving instructions.

  There are also various Sparse fixes"

Fix up trivial add-add conflict in arch/x86/kernel/traps.c

* 'x86-fpu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86, kvm: fix kvm's usage of kernel_fpu_begin/end()
  x86, fpu: remove cpu_has_xmm check in the fx_finit()
  x86, fpu: make eagerfpu= boot param tri-state
  x86, fpu: enable eagerfpu by default for xsaveopt
  x86, fpu: decouple non-lazy/eager fpu restore from xsave
  x86, fpu: use non-lazy fpu restore for processors supporting xsave
  lguest, x86: handle guest TS bit for lazy/non-lazy fpu host models
  x86, fpu: always use kernel_fpu_begin/end() for in-kernel FPU usage
  x86, kvm: use kernel_fpu_begin/end() in kvm_load/put_guest_fpu()
  x86, fpu: remove unnecessary user_fpu_end() in save_xstate_sig()
  x86, fpu: drop_fpu() before restoring new state from sigframe
  x86, fpu: Unify signal handling code paths for x86 and x86_64 kernels
  x86, fpu: Consolidate inline asm routines for saving/restoring fpu state
  x86, signal: Cleanup ifdefs and is_ia32, is_x32
2012-10-01 11:10:52 -07:00
Jan Kiszka
c863901075 KVM: x86: Fix guest debug across vcpu INIT reset
If we reset a vcpu on INIT, we so far overwrote dr7 as provided by
KVM_SET_GUEST_DEBUG, and we also cleared switch_db_regs unconditionally.

Fix this by saving the dr7 used for guest debugging and calculating the
effective register value as well as switch_db_regs on any potential
change. This will change to focus of the set_guest_debug vendor op to
update_dp_bp_intercept.

Found while trying to stop on start_secondary.

Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2012-09-23 15:00:07 +02:00
Suresh Siddha
b1a74bf821 x86, kvm: fix kvm's usage of kernel_fpu_begin/end()
Preemption is disabled between kernel_fpu_begin/end() and as such
it is not a good idea to use these routines in kvm_load/put_guest_fpu()
which can be very far apart.

kvm_load/put_guest_fpu() routines are already called with
preemption disabled and KVM already uses the preempt notifier to save
the guest fpu state using kvm_put_guest_fpu().

So introduce __kernel_fpu_begin/end() routines which don't touch
preemption and use them instead of kernel_fpu_begin/end()
for KVM's use model of saving/restoring guest FPU state.

Also with this change (and with eagerFPU model), fix the host cr0.TS vm-exit
state in the case of VMX. For eagerFPU case, host cr0.TS is always clear.
So no need to worry about it. For the traditional lazyFPU restore case,
change the cr0.TS bit for the host state during vm-exit to be always clear
and cr0.TS bit is set in the __vmx_load_host_state() when the FPU
(guest FPU or the host task's FPU) state is not active. This ensures
that the host/guest FPU state is properly saved, restored
during context-switch and with interrupts (using irq_fpu_usable()) not
stomping on the active FPU state.

Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com>
Link: http://lkml.kernel.org/r/1348164109.26695.338.camel@sbsiddha-desk.sc.intel.com
Cc: Avi Kivity <avi@redhat.com>
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
2012-09-21 16:59:04 -07:00
Peter Senna Tschudin
4b8073e467 arch/x86: Remove unecessary semicolons
Found by http://coccinelle.lip6.fr/

Signed-off-by: Peter Senna Tschudin <peter.senna@gmail.com>
Cc: avi@redhat.com
Cc: mtosatti@redhat.com
Cc: a.p.zijlstra@chello.nl
Cc: rusty@rustcorp.com.au
Cc: masami.hiramatsu.pt@hitachi.com
Cc: suresh.b.siddha@intel.com
Cc: joerg.roedel@amd.com
Cc: agordeev@redhat.com
Cc: yinghai@kernel.org
Cc: bhelgaas@google.com
Cc: liuj97@gmail.com
Link: http://lkml.kernel.org/r/1347986174-30287-7-git-send-email-peter.senna@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2012-09-19 17:32:48 +02:00
Avi Kivity
b188c81f2e KVM: VMX: Make use of asm.h
Use macros for bitness-insensitive register names, instead of
rolling our own.

Signed-off-by: Avi Kivity <avi@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2012-09-17 10:38:04 -03:00
Avi Kivity
83287ea420 KVM: VMX: Make lto-friendly
LTO (link-time optimization) doesn't like local labels to be referred to
from a different function, since the two functions may be built in separate
compilation units.  Use an external variable instead.

Signed-off-by: Avi Kivity <avi@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2012-09-17 10:38:03 -03:00
Xiao Guangrong
4484141a94 KVM: fix error paths for failed gfn_to_page() calls
This bug was triggered:
[ 4220.198458] BUG: unable to handle kernel paging request at fffffffffffffffe
[ 4220.203907] IP: [<ffffffff81104d85>] put_page+0xf/0x34
......
[ 4220.237326] Call Trace:
[ 4220.237361]  [<ffffffffa03830d0>] kvm_arch_destroy_vm+0xf9/0x101 [kvm]
[ 4220.237382]  [<ffffffffa036fe53>] kvm_put_kvm+0xcc/0x127 [kvm]
[ 4220.237401]  [<ffffffffa03702bc>] kvm_vcpu_release+0x18/0x1c [kvm]
[ 4220.237407]  [<ffffffff81145425>] __fput+0x111/0x1ed
[ 4220.237411]  [<ffffffff8114550f>] ____fput+0xe/0x10
[ 4220.237418]  [<ffffffff81063511>] task_work_run+0x5d/0x88
[ 4220.237424]  [<ffffffff8104c3f7>] do_exit+0x2bf/0x7ca

The test case:

	printf(fmt, ##args);		\
	exit(-1);} while (0)

static int create_vm(void)
{
	int sys_fd, vm_fd;

	sys_fd = open("/dev/kvm", O_RDWR);
	if (sys_fd < 0)
		die("open /dev/kvm fail.\n");

	vm_fd = ioctl(sys_fd, KVM_CREATE_VM, 0);
	if (vm_fd < 0)
		die("KVM_CREATE_VM fail.\n");

	return vm_fd;
}

static int create_vcpu(int vm_fd)
{
	int vcpu_fd;

	vcpu_fd = ioctl(vm_fd, KVM_CREATE_VCPU, 0);
	if (vcpu_fd < 0)
		die("KVM_CREATE_VCPU ioctl.\n");
	printf("Create vcpu.\n");
	return vcpu_fd;
}

static void *vcpu_thread(void *arg)
{
	int vm_fd = (int)(long)arg;

	create_vcpu(vm_fd);
	return NULL;
}

int main(int argc, char *argv[])
{
	pthread_t thread;
	int vm_fd;

	(void)argc;
	(void)argv;

	vm_fd = create_vm();
	pthread_create(&thread, NULL, vcpu_thread, (void *)(long)vm_fd);
	printf("Exit.\n");
	return 0;
}

It caused by release kvm->arch.ept_identity_map_addr which is the
error page.

The parent thread can send KILL signal to the vcpu thread when it was
exiting which stops faulting pages and potentially allocating memory.
So gfn_to_pfn/gfn_to_page may fail at this time

Fixed by checking the page before it is used

Signed-off-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2012-09-10 11:34:11 +03:00
Ren, Yongjie
4f97704555 KVM: x86: Check INVPCID feature bit in EBX of leaf 7
Checks and operations on the INVPCID feature bit should use EBX
of CPUID leaf 7 instead of ECX.

Signed-off-by: Junjie Mao <junjie.mao@intel.com>
Signed-off-by: Yongjie Ren <yongjien.ren@intel.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2012-09-09 17:34:01 +03:00
Mathias Krause
772e031899 KVM: VMX: constify lookup tables
We use vmcs_field_to_offset_table[], kvm_vmx_segment_fields[] and
kvm_vmx_exit_handlers[] as lookup tables only -- make them r/o.

Signed-off-by: Mathias Krause <minipli@googlemail.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2012-09-05 12:42:09 +03:00
Avi Kivity
a81aba14dc KVM: VMX: Ignore segment G and D bits when considering whether we can virtualize
We will enter the guest with G and D cleared; as real hardware ignores D in
real mode, and G is taken care of by the limit test, we allow more code to
run in vm86 mode.

Signed-off-by: Avi Kivity <avi@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2012-08-27 20:02:21 -03:00
Avi Kivity
ce56680347 KVM: VMX: Save all segment data in real mode
Signed-off-by: Avi Kivity <avi@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2012-08-27 20:02:21 -03:00
Avi Kivity
1390a28b27 KVM: VMX: Preserve segment limit and access rights in real mode
While this is undocumented, real processors do not reload the segment
limit and access rights when loading a segment register in real mode.
Real programs rely on it so we need to comply with this behaviour.

Signed-off-by: Avi Kivity <avi@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2012-08-27 20:02:21 -03:00
Avi Kivity
7263642028 KVM: VMX: Return real real-mode segment data even if emulate_invalid_guest_state=1
emulate_invalid_guest_state=1 doesn't mean we don't munge the segments in the
vmcs; we do.  So we need to return the real ones (maintained by vmx_set_segment).

Signed-off-by: Avi Kivity <avi@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2012-08-27 20:02:21 -03:00
Avi Kivity
e2a610d7fc KVM: VMX: Allow vm86 virtualization of big real mode
Usually, big real mode uses large (4GB) segments.  Currently we don't
virtualize this; if any segment has a limit other than 0xffff, we emulate.
But if we set the vmx-visible limit to 0xffff, we can use vm86 to virtualize
real mode; if an access overruns the segment limit, the guest will #GP, which
we will trap and forward to the emulator.  This results in significantly
faster execution, and less risk of hitting an unemulated instruction.

If the limit is less than 0xffff, we retain the existing behaviour.

Signed-off-by: Avi Kivity <avi@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2012-08-27 20:02:20 -03:00
Avi Kivity
495e116684 KVM: VMX: Allow real mode emulation using vm86 with dpl=0
Real mode is always entered from protected mode with dpl=0.  Since
the dpl doesn't affect execution, and we already override it to 3
in the vmcs (as vmx requires), we can allow execution in that state.

Signed-off-by: Avi Kivity <avi@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2012-08-27 20:02:20 -03:00
Avi Kivity
c865c43de6 KVM: VMX: Retain limit and attributes when entering protected mode
Real processors don't change segment limits and attributes while in
real mode.  Mimic that behaviour.

Signed-off-by: Avi Kivity <avi@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2012-08-27 20:02:20 -03:00
Avi Kivity
f5f7b2fe3b KVM: VMX: Use kvm_segment to save protected-mode segments when entering realmode
Instead of using struct kvm_save_segment, use struct kvm_segment, which is what
the other APIs use.  This leads to some simplification.

We replace save_rmode_seg() with a call to vmx_save_segment().  Since this depends
on rmode.vm86_active, we move the call to before setting the flag.

Signed-off-by: Avi Kivity <avi@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2012-08-27 20:02:19 -03:00
Avi Kivity
72fbefec26 KVM: VMX: Fix incorrect lookup of segment S flag in fix_pmode_dataseg()
fix_pmode_dataseg() looks up S in ->base instead of ->ar_bytes.

Signed-off-by: Avi Kivity <avi@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2012-08-27 20:02:19 -03:00
Avi Kivity
baa7e81e32 KVM: VMX: Separate saving pre-realmode state from setting segments
Commit b246dd5df1 ("KVM: VMX: Fix KVM_SET_SREGS with big real mode
segments") moved fix_rmode_seg() to vmx_set_segment(), so that it is
applied not just on transitions to real mode, but also on KVM_SET_SREGS
(migration).  However fix_rmode_seg() not only munges the vmcs segments,
it also sets up the save area for us to restore when returning to
protected mode or to return in vmx_get_segment().

Move saving the segment into a new function, save_rmode_seg(), and
call it just during the transition.

Signed-off-by: Avi Kivity <avi@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2012-08-27 20:02:19 -03:00
Avi Kivity
dbcb4e7980 KVM: VMX: Advertize RDTSC exiting to nested guests
All processors that support VMX have that feature, and guests (Xen) depend on
it.  As we already implement it, advertize it to the guest.

Signed-off-by: Avi Kivity <avi@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2012-08-13 19:08:28 -03:00
Gleb Natapov
2a7921b7a0 KVM: VMX: restore MSR_IA32_DEBUGCTLMSR after VMEXIT
MSR_IA32_DEBUGCTLMSR is zeroed on VMEXIT. Restore it to the correct
value.

Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2012-08-13 19:07:58 -03:00
Xiao Guangrong
32cad84f44 KVM: do not release the error page
After commit a2766325cf, the error page is replaced by the
error code, it need not be released anymore

[ The patch has been compiling tested for powerpc ]

Signed-off-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2012-08-06 16:04:58 +03:00
Avi Kivity
fe56097b23 Merge remote-tracking branch 'upstream' into next
- bring back critical fixes (esp. aa67f6096c)
 - provide an updated base for development

* upstream: (4334 commits)
  missed mnt_drop_write() in do_dentry_open()
  UBIFS: nuke pdflush from comments
  gfs2: nuke pdflush from comments
  drbd: nuke pdflush from comments
  nilfs2: nuke write_super from comments
  hfs: nuke write_super from comments
  vfs: nuke pdflush from comments
  jbd/jbd2: nuke write_super from comments
  btrfs: nuke pdflush from comments
  btrfs: nuke write_super from comments
  ext4: nuke pdflush from comments
  ext4: nuke write_super from comments
  ext3: nuke write_super from comments
  Documentation: fix the VM knobs descritpion WRT pdflush
  Documentation: get rid of write_super
  vfs: kill write_super and sync_supers
  ACPI processor: Fix tick_broadcast_mask online/offline regression
  ACPI: Only count valid srat memory structures
  ACPI: Untangle a return statement for better readability
  Linux 3.6-rc1
  ...

Signed-off-by: Avi Kivity <avi@redhat.com>
2012-08-05 13:25:10 +03:00
Avi Kivity
aa67f6096c KVM: VMX: Fix ds/es corruption on i386 with preemption
Commit b2da15ac26 ("KVM: VMX: Optimize %ds, %es reload") broke i386
in the following scenario:

  vcpu_load
  ...
  vmx_save_host_state
  vmx_vcpu_run
  (ds.rpl, es.rpl cleared by hardware)

  interrupt
    push ds, es  # pushes bad ds, es
    schedule
      vmx_vcpu_put
        vmx_load_host_state
          reload ds, es (with __USER_DS)
    pop ds, es  # of other thread's stack
    iret
  # other thread runs
  interrupt
    push ds, es
    schedule  # back in vcpu thread
    pop ds, es  # now with rpl=0
    iret
  ...
  vcpu_put
  resume_userspace
  iret  # clears ds, es due to mismatched rpl

(instead of resume_userspace, we might return with SYSEXIT and then
take an exception; when the exception IRETs we end up with cleared
ds, es)

Fix by avoiding the optimization on i386 and reloading ds, es on the
lightweight exit path.

Reported-by: Chris Clayron <chris2553@googlemail.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2012-08-01 20:23:57 -03:00
Guo Chao
0fa0607147 KVM: VMX: Fix typos
Signed-off-by: Guo Chao <yan@linux.vnet.ibm.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2012-07-20 15:19:04 -03:00
Mao, Junjie
ad756a1603 KVM: VMX: Implement PCID/INVPCID for guests with EPT
This patch handles PCID/INVPCID for guests.

Process-context identifiers (PCIDs) are a facility by which a logical processor
may cache information for multiple linear-address spaces so that the processor
may retain cached information when software switches to a different linear
address space. Refer to section 4.10.1 in IA32 Intel Software Developer's Manual
Volume 3A for details.

For guests with EPT, the PCID feature is enabled and INVPCID behaves as running
natively.
For guests without EPT, the PCID feature is disabled and INVPCID triggers #UD.

Signed-off-by: Junjie Mao <junjie.mao@intel.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2012-07-12 13:07:34 +03:00
Xiao Guangrong
4f5982a56a KVM: VMX: export PFEC.P bit on ept
Export the present bit of page fault error code, the later patch
will use it

Signed-off-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2012-07-11 16:51:17 +03:00
Avi Kivity
a27685c33a KVM: VMX: Emulate invalid guest state by default
Our emulation should be complete enough that we can emulate guests
while they are in big real mode, or in a mode transition that is not
virtualizable without unrestricted guest support.

Signed-off-by: Avi Kivity <avi@redhat.com>
2012-07-09 14:19:05 +03:00
Avi Kivity
de5f70e0c6 KVM: VMX: Improve error reporting during invalid guest state emulation
If instruction emulation fails, report it properly to userspace.

Signed-off-by: Avi Kivity <avi@redhat.com>
2012-07-09 14:19:04 +03:00
Avi Kivity
de87dcddc7 KVM: VMX: Stop invalid guest state emulation on pending event
Process the event, possibly injecting an interrupt, before continuing.

Signed-off-by: Avi Kivity <avi@redhat.com>
2012-07-09 14:19:04 +03:00
Avi Kivity
7c068e4558 KVM: VMX: Continue emulating after batch exhausted
If we return early from an invalid guest state emulation loop, make
sure we return to it later if the guest state is still invalid.

Signed-off-by: Avi Kivity <avi@redhat.com>
2012-07-09 14:19:03 +03:00
Avi Kivity
bdea48e305 KVM: VMX: Fix interrupt exit condition during emulation
Checking EFLAGS.IF is incorrect as we might be in interrupt shadow.  If
that is the case, the main loop will notice that and not inject the interrupt,
causing an endless loop.

Fix by using vmx_interrupt_allowed() to check if we can inject an interrupt
instead.

Signed-off-by: Avi Kivity <avi@redhat.com>
2012-07-09 14:19:02 +03:00
Avi Kivity
b8405c184b KVM: VMX: Limit iterations with emulator_invalid_guest_state
Otherwise, if the guest ends up looping, we never exit the srcu critical
section, which causes synchronize_srcu() to hang.

Signed-off-by: Avi Kivity <avi@redhat.com>
2012-07-09 14:19:01 +03:00
Avi Kivity
f0495f9b99 KVM: VMX: Relax check on unusable segment
Some userspace (e.g. QEMU 1.1) munge the d and g bits of segment
descriptors, causing us not to recognize them as unusable segments
with emulate_invalid_guest_state=1.  Relax the check by testing for
segment not present (a non-present segment cannot be usable).

Signed-off-by: Avi Kivity <avi@redhat.com>
2012-07-09 14:19:01 +03:00
Avi Kivity
d881e6f6cf KVM: VMX: Return correct CPL during transition to protected mode
In protected mode, the CPL is defined as the lower two bits of CS, as set by
the last far jump.  But during the transition to protected mode, there is no
last far jump, so we need to return zero (the inherited real mode CPL).

Fix by reading CPL from the cache during the transition.  This isn't 100%
correct since we don't set the CPL cache on a far jump, but since protected
mode transition will always jump to a segment with RPL=0, it will always
work.

Signed-off-by: Avi Kivity <avi@redhat.com>
2012-07-09 14:19:00 +03:00
Guo Chao
2106a54812 KVM: VMX: code clean for vmx_init()
Signed-off-by: Guo Chao <yan@linux.vnet.ibm.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2012-07-03 14:55:30 -03:00
Christoffer Dall
a737f256bf KVM: Cleanup the kvm_print functions and introduce pr_XX wrappers
Introduces a couple of print functions, which are essentially wrappers
around standard printk functions, with a KVM: prefix.

Functions introduced or modified are:
 - kvm_err(fmt, ...)
 - kvm_info(fmt, ...)
 - kvm_debug(fmt, ...)
 - kvm_pr_unimpl(fmt, ...)
 - pr_unimpl(vcpu, fmt, ...) -> vcpu_unimpl(vcpu, fmt, ...)

Signed-off-by: Christoffer Dall <c.dall@virtualopensystems.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2012-06-06 15:24:00 +03:00
Orit Wasserman
b246dd5df1 KVM: VMX: Fix KVM_SET_SREGS with big real mode segments
For example migration between Westmere and Nehelem hosts, caught in big real mode.

The code that fixes the segments for real mode guest was moved from enter_rmode
to vmx_set_segments. enter_rmode calls vmx_set_segments for each segment.

Signed-off-by: Orit Wasserman <owasserm@rehdat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2012-06-05 17:51:46 +03:00
Xudong Hao
3f6d8c8a47 KVM: VMX: Use EPT Access bit in response to memory notifiers
Signed-off-by: Haitao Shan <haitao.shan@intel.com>
Signed-off-by: Xudong Hao <xudong.hao@intel.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2012-06-05 16:31:05 +03:00
Xudong Hao
b38f993478 KVM: VMX: Enable EPT A/D bits if supported by turning on relevant bit in EPTP
In EPT page structure entry, Enable EPT A/D bits if processor supported.

Signed-off-by: Haitao Shan <haitao.shan@intel.com>
Signed-off-by: Xudong Hao <xudong.hao@intel.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2012-06-05 16:31:04 +03:00
Xudong Hao
83c3a33122 KVM: VMX: Add parameter to control A/D bits support, default is on
Add kernel parameter to control A/D bits support, it's on by default.

Signed-off-by: Haitao Shan <haitao.shan@intel.com>
Signed-off-by: Xudong Hao <xudong.hao@intel.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2012-06-05 16:31:03 +03:00
Avi Kivity
b2da15ac26 KVM: VMX: Optimize %ds, %es reload
On x86_64, we can defer %ds and %es reload to the heavyweight context switch,
since nothing in the lightweight paths uses the host %ds or %es (they are
ignored by the processor).  Furthermore we can avoid the load if the segments
are null, by letting the hardware load the null segments for us.  This is the
expected case.

On i386, we could avoid the reload entirely, since the entry.S paths take care
of reload, except for the SYSEXIT path which leaves %ds and %es set to __USER_DS.
So we set them to the same values as well.

Saves about 70 cycles out of 1600 (around 4%; noisy measurements).

Signed-off-by: Avi Kivity <avi@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2012-05-16 16:03:19 -03:00
Avi Kivity
512d5649e8 KVM: VMX: Fix %ds/%es clobber
The vmx exit code unconditionally restores %ds and %es to __USER_DS.  This
can override the user's values, since %ds and %es are not saved and restored
in x86_64 syscalls.  In practice, this isn't dangerous since nobody uses
segment registers in long mode, least of all programs that use KVM.

Signed-off-by: Avi Kivity <avi@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2012-05-16 16:03:19 -03:00
Xiao Guangrong
5f3fbc342f KVM: VMX: unlike vmcs on fail path
fix:

[ 1529.577273] Call Trace:
[ 1529.577289]  [<ffffffffa060d58f>] kvm_arch_hardware_disable+0x13/0x30 [kvm]
[ 1529.577302]  [<ffffffffa05fa2d4>] hardware_disable_nolock+0x35/0x39 [kvm]
[ 1529.577311]  [<ffffffffa05fa29f>] ? cpumask_clear_cpu.constprop.31+0x13/0x13 [kvm]
[ 1529.577315]  [<ffffffff81096ba8>] on_each_cpu+0x44/0x84
[ 1529.577326]  [<ffffffffa05f98b5>] hardware_disable_all_nolock+0x34/0x36 [kvm]
[ 1529.577335]  [<ffffffffa05f98e2>] hardware_disable_all+0x2b/0x39 [kvm]
[ 1529.577349]  [<ffffffffa05fafe5>] kvm_put_kvm+0xed/0x10f [kvm]
[ 1529.577358]  [<ffffffffa05fb3d7>] kvm_vm_release+0x22/0x28 [kvm]

Signed-off-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
2012-05-14 11:28:02 +03:00
Marcelo Tosatti
eac0556750 Merge branch 'linus' into queue
Merge reason: development work has dependency on kvm patches merged
upstream.

Conflicts:
	Documentation/feature-removal-schedule.txt

Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2012-04-19 17:06:26 -03:00
Avi Kivity
2225fd5604 KVM: VMX: Fix kvm_set_shared_msr() called in preemptible context
kvm_set_shared_msr() may not be called in preemptible context,
but vmx_set_msr() does so:

  BUG: using smp_processor_id() in preemptible [00000000] code: qemu-kvm/22713
  caller is kvm_set_shared_msr+0x32/0xa0 [kvm]
  Pid: 22713, comm: qemu-kvm Not tainted 3.4.0-rc3+ #39
  Call Trace:
   [<ffffffff8131fa82>] debug_smp_processor_id+0xe2/0x100
   [<ffffffffa0328ae2>] kvm_set_shared_msr+0x32/0xa0 [kvm]
   [<ffffffffa03a103b>] vmx_set_msr+0x28b/0x2d0 [kvm_intel]
   ...

Making kvm_set_shared_msr() work in preemptible is cleaner, but
it's used in the fast path.  Making two variants is overkill, so
this patch just disables preemption around the call.

Reported-by: Dave Jones <davej@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2012-04-18 23:42:27 -03:00
Josh Triplett
e9bda3b3d0 KVM: VMX: Auto-load on CPUs with VMX
Enable x86 feature-based autoloading for the kvm-intel module on CPUs
with X86_FEATURE_VMX.

Signed-off-by: Josh Triplett <josh@joshtriplett.org>
Acked-By: Kay Sievers <kay@vrfy.org>
Signed-off-by: Avi Kivity <avi@redhat.com>
2012-04-08 14:03:12 +03:00