Commit Graph

4434 Commits

Author SHA1 Message Date
Jim Mattson
70f3aac964 kvm: nVMX: Remove superfluous VMX instruction fault checks
According to the Intel SDM, "Certain exceptions have priority over VM
exits. These include invalid-opcode exceptions, faults based on
privilege level*, and general-protection exceptions that are based on
checking I/O permission bits in the task-state segment (TSS)."

There is no need to check for faulting conditions that the hardware
has already checked.

* These include faults generated by attempts to execute, in
  virtual-8086 mode, privileged instructions that are not recognized
  in that mode.

Signed-off-by: Jim Mattson <jmattson@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2017-04-27 17:05:43 +02:00
Ladi Prosek
6ed071f051 KVM: x86: fix emulation of RSM and IRET instructions
On AMD, the effect of set_nmi_mask called by emulate_iret_real and em_rsm
on hflags is reverted later on in x86_emulate_instruction where hflags are
overwritten with ctxt->emul_flags (the kvm_set_hflags call). This manifests
as a hang when rebooting Windows VMs with QEMU, OVMF, and >1 vcpu.

Instead of trying to merge ctxt->emul_flags into vcpu->arch.hflags after
an instruction is emulated, this commit deletes emul_flags altogether and
makes the emulator access vcpu->arch.hflags using two new accessors. This
way all changes, on the emulator side as well as in functions called from
the emulator and accessing vcpu state with emul_to_vcpu, are preserved.

More details on the bug and its manifestation with Windows and OVMF:

  It's a KVM bug in the interaction between SMI/SMM and NMI, specific to AMD.
  I believe that the SMM part explains why we started seeing this only with
  OVMF.

  KVM masks and unmasks NMI when entering and leaving SMM. When KVM emulates
  the RSM instruction in em_rsm, the set_nmi_mask call doesn't stick because
  later on in x86_emulate_instruction we overwrite arch.hflags with
  ctxt->emul_flags, effectively reverting the effect of the set_nmi_mask call.
  The AMD-specific hflag of interest here is HF_NMI_MASK.

  When rebooting the system, Windows sends an NMI IPI to all but the current
  cpu to shut them down. Only after all of them are parked in HLT will the
  initiating cpu finish the restart. If NMI is masked, other cpus never get
  the memo and the initiating cpu spins forever, waiting for
  hal!HalpInterruptProcessorsStarted to drop. That's the symptom we observe.

Fixes: a584539b24 ("KVM: x86: pass the whole hflags field to emulator and back")
Signed-off-by: Ladi Prosek <lprosek@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2017-04-27 16:54:09 +02:00
Andrew Jones
cde9af6e79 KVM: add explicit barrier to kvm_vcpu_kick
kvm_vcpu_kick() must issue a general memory barrier prior to reading
vcpu->mode in order to ensure correctness of the mutual-exclusion
memory barrier pattern used with vcpu->requests.  While the cmpxchg
called from kvm_vcpu_kick():

 kvm_vcpu_kick
   kvm_arch_vcpu_should_kick
     kvm_vcpu_exiting_guest_mode
       cmpxchg

implies general memory barriers before and after the operation, that
implication is only valid when cmpxchg succeeds.  We need an explicit
barrier for when it fails, otherwise a VCPU thread on its entry path
that reads zero for vcpu->requests does not exclude the possibility
the requesting thread sees !IN_GUEST_MODE when it reads vcpu->mode.

kvm_make_all_cpus_request already had a barrier, so we remove it, as
now it would be redundant.

Signed-off-by: Andrew Jones <drjones@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2017-04-27 14:16:17 +02:00
Radim Krčmář
1bd2009e73 KVM: x86: always use kvm_make_request instead of set_bit
Reviewed-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
Reviewed-by: Andrew Jones <drjones@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2017-04-27 14:12:53 +02:00
Radim Krčmář
72875d8a4d KVM: add kvm_{test,clear}_request to replace {test,clear}_bit
Users were expected to use kvm_check_request() for testing and clearing,
but request have expanded their use since then and some users want to
only test or do a faster clear.

Make sure that requests are not directly accessed with bit operations.

Reviewed-by: Christian Borntraeger <borntraeger@de.ibm.com>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
Reviewed-by: Andrew Jones <drjones@redhat.com>
Reviewed-by: Cornelia Huck <cornelia.huck@de.ibm.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2017-04-27 14:12:22 +02:00
Marcelo Tosatti
e891a32e7a KVM: x86: remove irq disablement around KVM_SET_CLOCK/KVM_GET_CLOCK
The disablement of interrupts at KVM_SET_CLOCK/KVM_GET_CLOCK
attempts to disable software suspend from causing "non atomic behaviour" of
the operation:

    Add a helper function to compute the kernel time and convert nanoseconds
    back to CPU specific cycles.  Note that these must not be called in preemptible
    context, as that would mean the kernel could enter software suspend state,
    which would cause non-atomic operation.

However, assume the kernel can enter software suspend at the following 2 points:

        ktime_get_ts(&ts);
1.
						hypothetical_ktime_get_ts(&ts)
        monotonic_to_bootbased(&ts);
2.

monotonic_to_bootbased() should be correct relative to a ktime_get_ts(&ts)
performed after point 1 (that is after resuming from software suspend),
hypothetical_ktime_get_ts()

Therefore it is also correct for the ktime_get_ts(&ts) before point 1,
which is

	ktime_get_ts(&ts) = hypothetical_ktime_get_ts(&ts) + time-to-execute-suspend-code

Note CLOCK_MONOTONIC does not count during suspension.

So remove the irq disablement, which causes the following warning on
-RT kernels:

 With this reasoning, and the -RT bug that the irq disablement causes
 (because spin_lock is now a sleeping lock), remove the IRQ protection as it
 causes:

 [ 1064.668109] in_atomic(): 0, irqs_disabled(): 1, pid: 15296, name:m
 [ 1064.668110] INFO: lockdep is turned off.
 [ 1064.668110] irq event stamp: 0
 [ 1064.668112] hardirqs last  enabled at (0): [<          (null)>]  )
 [ 1064.668116] hardirqs last disabled at (0): [] c0
 [ 1064.668118] softirqs last  enabled at (0): [] c0
 [ 1064.668118] softirqs last disabled at (0): [<          (null)>]  )
 [ 1064.668121] CPU: 13 PID: 15296 Comm: qemu-kvm Not tainted 3.10.0-1
 [ 1064.668121] Hardware name: Dell Inc. PowerEdge R730/0H21J3, BIOS 5
 [ 1064.668123]  ffff8c1796b88000 00000000afe7344c ffff8c179abf3c68 f3
 [ 1064.668125]  ffff8c179abf3c90 ffffffff930ccb3d ffff8c1b992b3610 f0
 [ 1064.668126]  00007ffc1a26fbc0 ffff8c179abf3cb0 ffffffff9375f694 f0
 [ 1064.668126] Call Trace:
 [ 1064.668132]  [] dump_stack+0x19/0x1b
 [ 1064.668135]  [] __might_sleep+0x12d/0x1f0
 [ 1064.668138]  [] rt_spin_lock+0x24/0x60
 [ 1064.668155]  [] __get_kvmclock_ns+0x36/0x110 [k]
 [ 1064.668159]  [] ? futex_wait_queue_me+0x103/0x10
 [ 1064.668171]  [] kvm_arch_vm_ioctl+0xa2/0xd70 [k]
 [ 1064.668173]  [] ? futex_wait+0x1ac/0x2a0

v2: notice get_kvmclock_ns with the same problem (Pankaj).
v3: remove useless helper function (Pankaj).

Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2017-04-21 12:50:28 +02:00
Michael S. Tsirkin
668fffa3f8 kvm: better MWAIT emulation for guests
Guests that are heavy on futexes end up IPI'ing each other a lot. That
can lead to significant slowdowns and latency increase for those guests
when running within KVM.

If only a single guest is needed on a host, we have a lot of spare host
CPU time we can throw at the problem. Modern CPUs implement a feature
called "MWAIT" which allows guests to wake up sleeping remote CPUs without
an IPI - thus without an exit - at the expense of never going out of guest
context.

The decision whether this is something sensible to use should be up to the
VM admin, so to user space. We can however allow MWAIT execution on systems
that support it properly hardware wise.

This patch adds a CAP to user space and a KVM cpuid leaf to indicate
availability of native MWAIT execution. With that enabled, the worst a
guest can do is waste as many cycles as a "jmp ." would do, so it's not
a privilege problem.

We consciously do *not* expose the feature in our CPUID bitmap, as most
people will want to benefit from sleeping vCPUs to allow for over commit.

Reported-by: "Gabriel L. Somlo" <gsomlo@gmail.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
[agraf: fix amd, change commit message]
Signed-off-by: Alexander Graf <agraf@suse.de>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2017-04-21 12:50:28 +02:00
Kyle Huey
db2336a804 KVM: x86: virtualize cpuid faulting
Hardware support for faulting on the cpuid instruction is not required to
emulate it, because cpuid triggers a VM exit anyways. KVM handles the relevant
MSRs (MSR_PLATFORM_INFO and MSR_MISC_FEATURES_ENABLE) and upon a
cpuid-induced VM exit checks the cpuid faulting state and the CPL.
kvm_require_cpl is even kind enough to inject the GP fault for us.

Signed-off-by: Kyle Huey <khuey@kylehuey.com>
Reviewed-by: David Matlack <dmatlack@google.com>
[Return "1" from kvm_emulate_cpuid, it's not void. - Paolo]
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2017-04-21 12:50:06 +02:00
David Hildenbrand
fe0e80befd KVM: VMX: drop vmm_exclusive module parameter
vmm_exclusive=0 leads to KVM setting X86_CR4_VMXE always and calling
VMXON only when the vcpu is loaded. X86_CR4_VMXE is used as an
indication in cpu_emergency_vmxoff() (called on kdump) if VMXOFF has to be
called. This is obviously not the case if both are used independtly.
Calling VMXOFF without a previous VMXON will result in an exception.

In addition, X86_CR4_VMXE is used as a mean to test if VMX is already in
use by another VMM in hardware_enable(). So there can't really be
co-existance. If the other VMM is prepared for co-existance and does a
similar check, only one VMM can exist. If the other VMM is not prepared
and blindly sets/clears X86_CR4_VMXE, we will get inconsistencies with
X86_CR4_VMXE.

As we also had bug reports related to clearing of vmcs with vmm_exclusive=0
this seems to be pretty much untested. So let's better drop it.

While at it, directly move setting/clearing X86_CR4_VMXE into
kvm_cpu_vmxon/off.

Signed-off-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2017-04-21 11:42:49 +02:00
Radim Krčmář
3325187061 KVM: nVMX: fix AD condition when handling EPT violation
I have introduced this bug when applying and simplifying Paolo's patch
as we agreed on the list.  The original was "x &= ~y; if (z) x |= y;".

Here is the story of a bad workflow:

  A maintainer was already testing with the intended change, but it was
  applied only to a testing repo on a different machine.  When the time
  to push tested patches to kvm/next came, he realized that this change
  was missing and quickly added it to the maintenance repo, didn't test
  again (because the change is trivial, right), and pushed the world to
  fire.

Fixes: ae1e2d1082 ("kvm: nVMX: support EPT accessed/dirty bits")
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
2017-04-13 19:36:55 +02:00
Ladi Prosek
405a353a0e KVM: x86: Add MSR_AMD64_DC_CFG to the list of ignored MSRs
Hyper-V writes 0x800000000000 to MSR_AMD64_DC_CFG when running on AMD CPUs
as recommended in erratum 383, analogous to our svm_init_erratum_383.

By ignoring the MSR, this patch enables running Hyper-V in L1 on AMD.

Signed-off-by: Ladi Prosek <lprosek@redhat.com>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
2017-04-12 21:09:24 +02:00
Denis Plotnikov
bd8fab39cd KVM: x86: fix maintaining of kvm_clock stability on guest CPU hotplug
VCPU TSC synchronization is perfromed in kvm_write_tsc() when the TSC
value being set is within 1 second from the expected, as obtained by
extrapolating of the TSC in already synchronized VCPUs.

This is naturally achieved on all VCPUs at VM start and resume;
however on VCPU hotplug it is not: the newly added VCPU is created
with TSC == 0 while others are well ahead.

To compensate for that, consider host-initiated kvm_write_tsc() with
TSC == 0 a special case requiring synchronization regardless of the
current TSC on other VCPUs.

Signed-off-by: Denis Plotnikov <dplotnikov@virtuozzo.com>
Reviewed-by: Roman Kagan <rkagan@virtuozzo.com>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
2017-04-12 20:17:15 +02:00
Denis Plotnikov
c5e8ec8e9b KVM: x86: remaster kvm_write_tsc code
Reuse existing code instead of using inline asm.
Make the code more concise and clear in the TSC
synchronization part.

Signed-off-by: Denis Plotnikov <dplotnikov@virtuozzo.com>
Reviewed-by: Roman Kagan <rkagan@virtuozzo.com>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
2017-04-12 20:17:15 +02:00
David Hildenbrand
900ab14ca9 KVM: x86: use irqchip_kernel() to check for pic+ioapic
Although the current check is not wrong, this check explicitly includes
the pic.

Signed-off-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
2017-04-12 20:17:15 +02:00
David Hildenbrand
b5e7cf52e1 KVM: x86: simplify pic_ioport_read()
Signed-off-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
2017-04-12 20:17:15 +02:00
David Hildenbrand
84a5c79e70 KVM: x86: set data directly in picdev_read()
Now it looks almost as picdev_write().

Signed-off-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
2017-04-12 20:17:15 +02:00
David Hildenbrand
9fecaa9e32 KVM: x86: drop picdev_in_range()
We already have the exact same checks a couple of lines below.

Signed-off-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
2017-04-12 20:17:14 +02:00
David Hildenbrand
dc24d1d2cb KVM: x86: make kvm_pic_reset() static
Not used outside of i8259.c, so let's make it static.

Signed-off-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
2017-04-12 20:17:14 +02:00
David Hildenbrand
e21d1758b0 KVM: x86: simplify pic_unlock()
We can easily compact this code and get rid of one local variable.

Reviewed-by: Peter Xu <peterx@redhat.com>
Signed-off-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
2017-04-12 20:17:14 +02:00
David Hildenbrand
43ae312ca1 KVM: x86: drop goto label in kvm_set_routing_entry()
No need for the goto label + local variable "r".

Reviewed-by: Peter Xu <peterx@redhat.com>
Signed-off-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
2017-04-12 20:17:14 +02:00
David Hildenbrand
993225adf4 KVM: x86: rename kvm_vcpu_request_scan_ioapic()
Let's rename it into a proper arch specific callback.

Signed-off-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
2017-04-12 20:17:14 +02:00
David Hildenbrand
ca8ab3f895 KVM: x86: directly call kvm_make_scan_ioapic_request() in ioapic.c
We know there is an ioapic, so let's call it directly.

Signed-off-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
2017-04-12 20:17:14 +02:00
David Hildenbrand
d62f270b2d KVM: x86: remove all-vcpu request from kvm_ioapic_init()
kvm_ioapic_init() is guaranteed to be called without any created VCPUs,
so doing an all-vcpu request results in a NOP.

Signed-off-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
2017-04-12 20:17:14 +02:00
David Hildenbrand
445ee82d7a KVM: x86: KVM_IRQCHIP_PIC_MASTER only has 8 pins
Currently, one could set pin 8-15, implicitly referring to
KVM_IRQCHIP_PIC_SLAVE.

Get rid of the two local variables max_pin and delta on the way.

Signed-off-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
2017-04-12 20:17:14 +02:00
David Hildenbrand
49f520b99a KVM: x86: push usage of slots_lock down
Let's just move it to the place where it is actually needed.

Signed-off-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
2017-04-12 20:17:14 +02:00
David Hildenbrand
ba7454e17f KVM: x86: don't take kvm->irq_lock when creating IRQCHIP
I don't see any reason any more for this lock, seemed to be used to protect
removal of kvm->arch.vpic / kvm->arch.vioapic when already partially
inititalized, now access is properly protected using kvm->arch.irqchip_mode
and this shouldn't be necessary anymore.

Signed-off-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
2017-04-12 20:17:14 +02:00
David Hildenbrand
33392b4911 KVM: x86: convert kvm_(set|get)_ioapic() into void
Signed-off-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
2017-04-12 20:17:14 +02:00
David Hildenbrand
4c0b06d886 KVM: x86: remove duplicate checks for ioapic
When handling KVM_GET_IRQCHIP, we already check irqchip_kernel(), which
implies a fully inititalized ioapic.

Signed-off-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
2017-04-12 20:17:14 +02:00
David Hildenbrand
0bceb15ad1 KVM: x86: use ioapic_in_kernel() to check for ioapic existence
Signed-off-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
2017-04-12 20:17:13 +02:00
David Hildenbrand
0191e92d84 KVM: x86: get rid of ioapic_irqchip()
Let's just use kvm->arch.vioapic directly.

Signed-off-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
2017-04-12 20:17:13 +02:00
David Hildenbrand
90bca0529e KVM: x86: get rid of pic_irqchip()
It seemed like a nice idea to encapsulate access to kvm->arch.vpic. But
as the usage is already mixed, internal locks are taken outside of i8259.c
and grepping for "vpic" only is much easier, let's just get rid of
pic_irqchip().

Signed-off-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
2017-04-12 20:17:13 +02:00
David Hildenbrand
f567080bdd KVM: x86: check against irqchip_mode in ioapic_in_kernel()
KVM_IRQCHIP_KERNEL implies a fully inititalized ioapic, while
kvm->arch.vioapic might temporarily be set but invalidated again if e.g.
setting of default routing fails when setting KVM_CREATE_IRQCHIP.

Signed-off-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
2017-04-12 20:17:13 +02:00
David Hildenbrand
19d25a0e47 KVM: x86: check against irqchip_mode in pic_in_kernel()
Let's avoid checking against kvm->arch.vpic. We have kvm->arch.irqchip_mode
for that now.

KVM_IRQCHIP_KERNEL implies a fully inititalized pic, while kvm->arch.vpic
might temporarily be set but invalidated again if e.g. kvm_ioapic_init()
fails when setting KVM_CREATE_IRQCHIP. Although current users seem to be
fine, this avoids future bugs.

Signed-off-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
2017-04-12 20:17:13 +02:00
David Hildenbrand
8bf463f3ba KVM: x86: check against irqchip_mode in kvm_set_routing_entry()
Let's replace the checks for pic_in_kernel() and ioapic_in_kernel() by
checks against irqchip_mode.

Also make sure that creation of any route is only possible if we have
an lapic in kernel (irqchip_in_kernel()) or if we are currently
inititalizing the irqchip.

This is necessary to switch pic_in_kernel() and ioapic_in_kernel() to
irqchip_mode, too.

Signed-off-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
2017-04-12 20:17:13 +02:00
David Hildenbrand
637e3f86fa KVM: x86: new irqchip mode KVM_IRQCHIP_INIT_IN_PROGRESS
Let's add a new mode and set it while we create the irqchip via
KVM_CREATE_IRQCHIP and KVM_CAP_SPLIT_IRQCHIP.

This mode will be used later to test if adding routes
(in kvm_set_routing_entry()) is already allowed.

Signed-off-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
2017-04-12 20:17:13 +02:00
Jim Mattson
28d0635388 kvm: nVMX: Disallow userspace-injected exceptions in guest mode
The userspace exception injection API and code path are entirely
unprepared for exceptions that might cause a VM-exit from L2 to L1, so
the best course of action may be to simply disallow this for now.

1. The API provides no mechanism for userspace to specify the new DR6
bits for a #DB exception or the new CR2 value for a #PF
exception. Presumably, userspace is expected to modify these registers
directly with KVM_SET_SREGS before the next KVM_RUN ioctl. However, in
the event that L1 intercepts the exception, these registers should not
be changed. Instead, the new values should be provided in the
exit_qualification field of vmcs12 (Intel SDM vol 3, section 27.1).

2. In the case of a userspace-injected #DB, inject_pending_event()
clears DR7.GD before calling vmx_queue_exception(). However, in the
event that L1 intercepts the exception, this is too early, because
DR7.GD should not be modified by a #DB that causes a VM-exit directly
(Intel SDM vol 3, section 27.1).

3. If the injected exception is a #PF, nested_vmx_check_exception()
doesn't properly check whether or not L1 is interested in the
associated error code (using the #PF error code mask and match fields
from vmcs12). It may either return 0 when it should call
nested_vmx_vmexit() or vice versa.

4. nested_vmx_check_exception() assumes that it is dealing with a
hardware-generated exception intercept from L2, with some of the
relevant details (the VM-exit interruption-information and the exit
qualification) live in vmcs02. For userspace-injected exceptions, this
is not the case.

5. prepare_vmcs12() assumes that when its exit_intr_info argument
specifies valid information with a valid error code that it can VMREAD
the VM-exit interruption error code from vmcs02. For
userspace-injected exceptions, this is not the case.

Signed-off-by: Jim Mattson <jmattson@google.com>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
2017-04-07 16:49:01 +02:00
David Hildenbrand
28bf288879 KVM: x86: fix user triggerable warning in kvm_apic_accept_events()
If we already entered/are about to enter SMM, don't allow switching to
INIT/SIPI_RECEIVED, otherwise the next call to kvm_apic_accept_events()
will report a warning.

Same applies if we are already in MP state INIT_RECEIVED and SMM is
requested to be turned on. Refuse to set the VCPU events in this case.

Fixes: cd7764fe9f ("KVM: x86: latch INITs while in system management mode")
Cc: stable@vger.kernel.org # 4.2+
Reported-by: Dmitry Vyukov <dvyukov@google.com>
Signed-off-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
2017-04-07 16:49:01 +02:00
Paolo Bonzini
3042255899 kvm: make KVM_CAP_COALESCED_MMIO architecture agnostic
Remove code from architecture files that can be moved to virt/kvm, since there
is already common code for coalesced MMIO.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
[Removed a pointless 'break' after 'return'.]
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
2017-04-07 16:49:00 +02:00
Paolo Bonzini
a5f4645704 KVM: nVMX: support RDRAND and RDSEED exiting
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Reviewed-by: Jim Mattson <jmattson@google.com>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
2017-04-07 16:49:00 +02:00
Paolo Bonzini
ae1e2d1082 kvm: nVMX: support EPT accessed/dirty bits
Now use bit 6 of EPTP to optionally enable A/D bits for EPTP.  Another
thing to change is that, when EPT accessed and dirty bits are not in use,
VMX treats accesses to guest paging structures as data reads.  When they
are in use (bit 6 of EPTP is set), they are treated as writes and the
corresponding EPT dirty bit is set.  The MMU didn't know this detail,
so this patch adds it.

We also have to fix up the exit qualification.  It may be wrong because
KVM sets bit 6 but the guest might not.

L1 emulates EPT A/D bits using write permissions, so in principle it may
be possible for EPT A/D bits to be used by L1 even though not available
in hardware.  The problem is that guest page-table walks will be treated
as reads rather than writes, so they would not cause an EPT violation.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
[Fixed typo in walk_addr_generic() comment and changed bit clear +
 conditional-set pattern in handle_ept_violation() to conditional-clear]
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
2017-04-07 16:49:00 +02:00
Paolo Bonzini
86407bcb5c kvm: x86: MMU support for EPT accessed/dirty bits
This prepares the MMU paging code for EPT accessed and dirty bits,
which can be enabled optionally at runtime.  Code that updates the
accessed and dirty bits will need a pointer to the struct kvm_mmu.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
2017-04-07 16:49:00 +02:00
Paolo Bonzini
0047723130 KVM: VMX: remove bogus check for invalid EPT violation
handle_ept_violation is checking for "guest-linear-address invalid" +
"not a paging-structure walk".  However, _all_ EPT violations without
a valid guest linear address are paging structure walks, because those
EPT violations happen when loading the guest PDPTEs.

Therefore, the check can never be true, and even if it were, KVM doesn't
care about the guest linear address; it only uses the guest *physical*
address VMCS field.  So, remove the check altogether.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Reviewed-by: Jim Mattson <jmattson@google.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
2017-04-07 16:49:00 +02:00
Paolo Bonzini
7db742654d KVM: nVMX: we support 1GB EPT pages
Large pages at the PDPE level can be emulated by the MMU, so the bit
can be set unconditionally in the EPT capabilities MSR.  The same is
true of 2MB EPT pages, though all Intel processors with EPT in practice
support those.

Reviewed-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2017-04-07 16:49:00 +02:00
Paolo Bonzini
ad6260da1e KVM: x86: drop legacy device assignment
Legacy device assignment has been deprecated since 4.2 (released
1.5 years ago).  VFIO is better and everyone should have switched to it.
If they haven't, this should convince them. :)

Reviewed-by: Alex Williamson <alex.williamson@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2017-04-07 16:49:00 +02:00
Paolo Bonzini
2c82878b0c KVM: VMX: require virtual NMI support
Virtual NMIs are only missing in Prescott and Yonah chips.  Both are obsolete
for virtualization usage---Yonah is 32-bit only even---so drop vNMI emulation.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2017-04-07 16:49:00 +02:00
Borislav Petkov
74f169090b kvm/svm: Setup MCG_CAP on AMD properly
MCG_CAP[63:9] bits are reserved on AMD. However, on an AMD guest, this
MSR returns 0x100010a. More specifically, bit 24 is set, which is simply
wrong. That bit is MCG_SER_P and is present only on Intel. Thus, clean
up the reserved bits in order not to confuse guests.

Signed-off-by: Borislav Petkov <bp@suse.de>
Cc: Joerg Roedel <joro@8bytes.org>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Radim Krčmář <rkrcmar@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2017-04-07 16:49:00 +02:00
David Hildenbrand
1279a6b124 KVM: nVMX: single function for switching between vmcs
Let's combine it in a single function vmx_switch_vmcs().

Signed-off-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Reviewed-by: Jim Mattson <jmattson@google.com>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
2017-04-07 16:49:00 +02:00
Jim Mattson
f0b98c02c1 kvm: vmx: Don't use INVVPID when EPT is enabled
According to the Intel SDM, volume 3, section 28.3.2: Creating and
Using Cached Translation Information, "No linear mappings are used
while EPT is in use." INVEPT will invalidate both the guest-physical
mappings and the combined mappings in the TLBs and paging-structure
caches, so an INVVPID is superfluous.

Signed-off-by: Jim Mattson <jmattson@google.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
2017-04-07 16:49:00 +02:00
Ladi Prosek
1fb883bb82 KVM: nVMX: initialize PML fields in vmcs02
L2 was running with uninitialized PML fields which led to incomplete
dirty bitmap logging. This manifested as all kinds of subtle erratic
behavior of the nested guest.

Fixes: 843e433057 ("KVM: VMX: Add PML support in VMX")
Signed-off-by: Ladi Prosek <lprosek@redhat.com>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
2017-04-04 16:24:43 +02:00
Ladi Prosek
ab007cc94f KVM: nVMX: do not leak PML full vmexit to L1
The PML feature is not exposed to guests so we should not be forwarding
the vmexit either.

This commit fixes BSOD 0x20001 (HYPERVISOR_ERROR) when running Hyper-V
enabled Windows Server 2016 in L1 on hardware that supports PML.

Fixes: 843e433057 ("KVM: VMX: Add PML support in VMX")
Signed-off-by: Ladi Prosek <lprosek@redhat.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
2017-04-04 16:11:06 +02:00
Ingo Molnar
7f75540ff2 Linux 4.11-rc5
-----BEGIN PGP SIGNATURE-----
 
 iQEcBAABAgAGBQJY4ZYkAAoJEHm+PkMAQRiGsq4H/R4PMXDoe2XhSSk7IoT97pXV
 /A8np/scAPjzEgYUidbb54OSqWwsPRuPGWONTFeSrE2u0L4wln/REI91jg7QetLq
 IisncExlYeJ/XQ+iO0ZZh9fLbqwIlEJFdSXmyIFr3m/TBxe8a61C8j93oNgM1tHT
 yuwzlq7c3sLq2hsmUG2HyL2kJsEfRasv4Rk0yhFuti12zVsBoTW4qmZuMauq+gdf
 f7cSYgiHhPTdb2o+azg5O7uYNHaQQBxdUMlIuhhYtVOUq+pFDO23SLHSFIW2NwOm
 Zn5R6CFSrLsCw0Bx0v8Xlc151QUbaRK4h9lhUhkBr6d3uNShU1NQ9JojpSvYwBo=
 =vP6E
 -----END PGP SIGNATURE-----

Merge tag 'v4.11-rc5' into x86/mm, to refresh the branch

Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-04-03 16:36:32 +02:00
Paolo Bonzini
2beb6dad2e KVM: x86: cleanup the page tracking SRCU instance
SRCU uses a delayed work item.  Skip cleaning it up, and
the result is use-after-free in the work item callbacks.

Reported-by: Dmitry Vyukov <dvyukov@google.com>
Suggested-by: Dmitry Vyukov <dvyukov@google.com>
Cc: stable@vger.kernel.org
Fixes: 0eb05bf290
Reviewed-by: Xiao Guangrong <xiaoguangrong.eric@gmail.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2017-03-28 14:08:02 +02:00
Ladi Prosek
7ad658b693 KVM: nVMX: fix nested EPT detection
The nested_ept_enabled flag introduced in commit 7ca29de213 was not
computed correctly. We are interested only in L1's EPT state, not the
the combined L0+L1 value.

In particular, if L0 uses EPT but L1 does not, nested_ept_enabled must
be false to make sure that PDPSTRs are loaded based on CR3 as usual,
because the special case described in 26.3.2.4 Loading Page-Directory-
Pointer-Table Entries does not apply.

Fixes: 7ca29de213 ("KVM: nVMX: fix CR3 load if L2 uses PAE paging and EPT")
Cc: qemu-stable@nongnu.org
Reported-by: Wanpeng Li <wanpeng.li@hotmail.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Ladi Prosek <lprosek@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2017-03-28 10:10:15 +02:00
Wanpeng Li
08d839c4b1 KVM: VMX: Fix enable VPID conditions
This can be reproduced by running L2 on L1, and disable VPID on L0
if w/o commit "KVM: nVMX: Fix nested VPID vmx exec control", the L2
crash as below:

KVM: entry failed, hardware error 0x7
EAX=00000000 EBX=00000000 ECX=00000000 EDX=000306c3
ESI=00000000 EDI=00000000 EBP=00000000 ESP=00000000
EIP=0000fff0 EFL=00000002 [-------] CPL=0 II=0 A20=1 SMM=0 HLT=0
ES =0000 00000000 0000ffff 00009300
CS =f000 ffff0000 0000ffff 00009b00
SS =0000 00000000 0000ffff 00009300
DS =0000 00000000 0000ffff 00009300
FS =0000 00000000 0000ffff 00009300
GS =0000 00000000 0000ffff 00009300
LDT=0000 00000000 0000ffff 00008200
TR =0000 00000000 0000ffff 00008b00
GDT=     00000000 0000ffff
IDT=     00000000 0000ffff
CR0=60000010 CR2=00000000 CR3=00000000 CR4=00000000
DR0=0000000000000000 DR1=0000000000000000 DR2=0000000000000000 DR3=0000000000000000
DR6=00000000ffff0ff0 DR7=0000000000000400
EFER=0000000000000000

Reference SDM 30.3 INVVPID:

Protected Mode Exceptions
- #UD
  - If not in VMX operation.
  - If the logical processor does not support VPIDs (IA32_VMX_PROCBASED_CTLS2[37]=0).
  - If the logical processor supports VPIDs (IA32_VMX_PROCBASED_CTLS2[37]=1) but does
    not support the INVVPID instruction (IA32_VMX_EPT_VPID_CAP[32]=0).

So we should check both VPID enable bit in vmx exec control and INVVPID support bit
in vmx capability MSRs to enable VPID. This patch adds the guarantee to not enable
VPID if either INVVPID or single-context/all-context invalidation is not exposed in
vmx capability MSRs.

Reviewed-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Jim Mattson <jmattson@google.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Radim Krčmář <rkrcmar@redhat.com>
Signed-off-by: Wanpeng Li <wanpeng.li@hotmail.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2017-03-23 19:02:22 +01:00
Wanpeng Li
63cb6d5f00 KVM: nVMX: Fix nested VPID vmx exec control
This can be reproduced by running kvm-unit-tests/vmx.flat on L0 w/ vpid disabled.

Test suite: VPID
Unhandled exception 6 #UD at ip 00000000004051a6
error_code=0000      rflags=00010047      cs=00000008
rax=0000000000000000 rcx=0000000000000001 rdx=0000000000000047 rbx=0000000000402f79
rbp=0000000000456240 rsi=0000000000000001 rdi=0000000000000000
r8=000000000000000a  r9=00000000000003f8 r10=0000000080010011 r11=0000000000000000
r12=0000000000000003 r13=0000000000000708 r14=0000000000000000 r15=0000000000000000
cr0=0000000080010031 cr2=0000000000000000 cr3=0000000007fff000 cr4=0000000000002020
cr8=0000000000000000
STACK: @4051a6 40523e 400f7f 402059 40028f

We should hide and forbid VPID in L1 if it is disabled on L0. However, nested VPID
enable bit is set unconditionally during setup nested vmx exec controls though VPID
is not exposed through nested VMX capablity. This patch fixes it by don't set nested
VPID enable bit if it is disabled on L0.

Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Radim Krčmář <rkrcmar@redhat.com>
Cc: stable@vger.kernel.org
Fixes: 5c614b3583 (KVM: nVMX: nested VPID emulation)
Signed-off-by: Wanpeng Li <wanpeng.li@hotmail.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2017-03-23 19:02:14 +01:00
Wanpeng Li
24dccf83a1 KVM: x86: correct async page present tracepoint
After async pf setup successfully, there is a broadcast wakeup w/ special
token 0xffffffff which tells vCPU that it should wake up all processes
waiting for APFs though there is no real process waiting at the moment.

The async page present tracepoint print prematurely and fails to catch the
special token setup. This patch fixes it by moving the async page present
tracepoint after the special token setup.

Before patch:

qemu-system-x86-8499  [006] ...1  5973.473292: kvm_async_pf_ready: token 0x0 gva 0x0

After patch:

qemu-system-x86-8499  [006] ...1  5973.473292: kvm_async_pf_ready: token 0xffffffff gva 0x0

Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Radim Krčmář <rkrcmar@redhat.com>
Signed-off-by: Wanpeng Li <wanpeng.li@hotmail.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2017-03-23 19:02:07 +01:00
Jim Mattson
fb6c819843 kvm: vmx: Flush TLB when the APIC-access address changes
Quoting from the Intel SDM, volume 3, section 28.3.3.4: Guidelines for
Use of the INVEPT Instruction:

If EPT was in use on a logical processor at one time with EPTP X, it
is recommended that software use the INVEPT instruction with the
"single-context" INVEPT type and with EPTP X in the INVEPT descriptor
before a VM entry on the same logical processor that enables EPT with
EPTP X and either (a) the "virtualize APIC accesses" VM-execution
control was changed from 0 to 1; or (b) the value of the APIC-access
address was changed.

In the nested case, the burden falls on L1, unless L0 enables EPT in
vmcs02 when L1 doesn't enable EPT in vmcs12.

Signed-off-by: Jim Mattson <jmattson@google.com>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
2017-03-23 19:02:06 +01:00
Peter Xu
c761159cf8 KVM: x86: use pic/ioapic destructor when destroy vm
We have specific destructors for pic/ioapic, we'd better use them when
destroying the VM as well.

Signed-off-by: Peter Xu <peterx@redhat.com>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
2017-03-23 19:02:06 +01:00
Peter Xu
950712eb8e KVM: x86: check existance before destroy
Mostly used for split irqchip mode. In that case, these two things are
not inited at all, so no need to release.

Signed-off-by: Peter Xu <peterx@redhat.com>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
2017-03-23 19:02:03 +01:00
Andy Lutomirski
59c58ceb29 x86/gdt: Get rid of the get_*_gdt_*_vaddr() helpers
There's a single caller that is only there because it's passing a
pointer into a function (vmcs_writel()) that takes an unsigned long.
Let's just cast it in place rather than having a bunch of trivial
helpers.

Signed-off-by: Andy Lutomirski <luto@kernel.org>
Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Juergen Gross <jgross@suse.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Garnier <thgarnie@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/46108fb35e1699252b1b6a85039303ff562c9836.1490218061.git.luto@kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-03-23 08:25:08 +01:00
Wanpeng Li
6d1b3ad2cd KVM: nVMX: don't reset kvm mmu twice
kvm mmu is reset once successfully loading CR3 as part of emulating vmentry
in nested_vmx_load_cr3(). We should not reset kvm mmu twice.

Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Radim Krčmář <rkrcmar@redhat.com>
Signed-off-by: Wanpeng Li <wanpeng.li@hotmail.com>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
2017-03-20 16:25:06 +01:00
Dmitry Vyukov
3863dff0c3 kvm: fix usage of uninit spinlock in avic_vm_destroy()
If avic is not enabled, avic_vm_init() does nothing and returns early.
However, avic_vm_destroy() still tries to destroy what hasn't been created.
The only bad consequence of this now is that avic_vm_destroy() uses
svm_vm_data_hash_lock that hasn't been initialized (and is not meant
to be used at all if avic is not enabled).

Return early from avic_vm_destroy() if avic is not enabled.
It has nothing to destroy.

Signed-off-by: Dmitry Vyukov <dvyukov@google.com>
Cc: Joerg Roedel <joro@8bytes.org>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: "Radim Krčmář" <rkrcmar@redhat.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: kvm@vger.kernel.org
Cc: syzkaller@googlegroups.com
Reviewed-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
2017-03-20 16:25:05 +01:00
Radim Krčmář
6c6c5e0311 KVM: VMX: downgrade warning on unexpected exit code
We never needed the call trace and we better rate-limit if it can be
triggered by a guest.

Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
2017-03-20 16:25:05 +01:00
Thomas Garnier
45fc8757d1 x86: Make the GDT remapping read-only on 64-bit
This patch makes the GDT remapped pages read-only, to prevent accidental
(or intentional) corruption of this key data structure.

This change is done only on 64-bit, because 32-bit needs it to be writable
for TSS switches.

The native_load_tr_desc function was adapted to correctly handle a
read-only GDT. The LTR instruction always writes to the GDT TSS entry.
This generates a page fault if the GDT is read-only. This change checks
if the current GDT is a remap and swap GDTs as needed. This function was
tested by booting multiple machines and checking hibernation works
properly.

KVM SVM and VMX were adapted to use the writeable GDT. On VMX, the
per-cpu variable was removed for functions to fetch the original GDT.
Instead of reloading the previous GDT, VMX will reload the fixmap GDT as
expected. For testing, VMs were started and restored on multiple
configurations.

Signed-off-by: Thomas Garnier <thgarnie@google.com>
Cc: Alexander Potapenko <glider@google.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Cc: Borislav Petkov <bp@suse.de>
Cc: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Christian Borntraeger <borntraeger@de.ibm.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Jiri Kosina <jikos@kernel.org>
Cc: Joerg Roedel <joro@8bytes.org>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Juergen Gross <jgross@suse.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Len Brown <len.brown@intel.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Lorenzo Stoakes <lstoakes@gmail.com>
Cc: Luis R . Rodriguez <mcgrof@kernel.org>
Cc: Matt Fleming <matt@codeblueprint.co.uk>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
Cc: Pavel Machek <pavel@ucw.cz>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Radim Krčmář <rkrcmar@redhat.com>
Cc: Rafael J . Wysocki <rjw@rjwysocki.net>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Stanislaw Gruszka <sgruszka@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tim Chen <tim.c.chen@linux.intel.com>
Cc: Vitaly Kuznetsov <vkuznets@redhat.com>
Cc: kasan-dev@googlegroups.com
Cc: kernel-hardening@lists.openwall.com
Cc: kvm@vger.kernel.org
Cc: lguest@lists.ozlabs.org
Cc: linux-doc@vger.kernel.org
Cc: linux-efi@vger.kernel.org
Cc: linux-mm@kvack.org
Cc: linux-pm@vger.kernel.org
Cc: xen-devel@lists.xenproject.org
Cc: zijun_hu <zijun_hu@htc.com>
Link: http://lkml.kernel.org/r/20170314170508.100882-3-thgarnie@google.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-03-16 09:06:35 +01:00
Radim Krčmář
05d8d34611 KVM: nVMX: do not warn when MSR bitmap address is not backed
Before trying to do nested_get_page() in nested_vmx_merge_msr_bitmap(),
we have already checked that the MSR bitmap address is valid (4k aligned
and within physical limits).  SDM doesn't specify what happens if the
there is no memory mapped at the valid address, but Intel CPUs treat the
situation as if the bitmap was configured to trap all MSRs.

KVM already does that by returning false and a correct handling doesn't
need the guest-trigerrable warning that was reported by syzkaller:
(The warning was originally there to catch some possible bugs in nVMX.)

  ------------[ cut here ]------------
  WARNING: CPU: 0 PID: 7832 at arch/x86/kvm/vmx.c:9709
  nested_vmx_merge_msr_bitmap arch/x86/kvm/vmx.c:9709 [inline]
  WARNING: CPU: 0 PID: 7832 at arch/x86/kvm/vmx.c:9709
  nested_get_vmcs12_pages+0xfb6/0x15c0 arch/x86/kvm/vmx.c:9640
  Kernel panic - not syncing: panic_on_warn set ...
  CPU: 0 PID: 7832 Comm: syz-executor1 Not tainted 4.10.0+ #229
  Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
  Call Trace:
   __dump_stack lib/dump_stack.c:15 [inline]
   dump_stack+0x2ee/0x3ef lib/dump_stack.c:51
   panic+0x1fb/0x412 kernel/panic.c:179
   __warn+0x1c4/0x1e0 kernel/panic.c:540
   warn_slowpath_null+0x2c/0x40 kernel/panic.c:583
   nested_vmx_merge_msr_bitmap arch/x86/kvm/vmx.c:9709 [inline]
   nested_get_vmcs12_pages+0xfb6/0x15c0 arch/x86/kvm/vmx.c:9640
   enter_vmx_non_root_mode arch/x86/kvm/vmx.c:10471 [inline]
   nested_vmx_run+0x6186/0xaab0 arch/x86/kvm/vmx.c:10561
   handle_vmlaunch+0x1a/0x20 arch/x86/kvm/vmx.c:7312
   vmx_handle_exit+0xfc0/0x3f00 arch/x86/kvm/vmx.c:8526
   vcpu_enter_guest arch/x86/kvm/x86.c:6982 [inline]
   vcpu_run arch/x86/kvm/x86.c:7044 [inline]
   kvm_arch_vcpu_ioctl_run+0x1418/0x4840 arch/x86/kvm/x86.c:7205
   kvm_vcpu_ioctl+0x673/0x1120 arch/x86/kvm/../../../virt/kvm/kvm_main.c:2570

Reported-by: Dmitry Vyukov <dvyukov@google.com>
Reviewed-by: Jim Mattson <jmattson@google.com>
[Jim Mattson explained the bare metal behavior: "I believe this behavior
 would be documented in the chipset data sheet rather than the SDM,
 since the chipset returns all 1s for an unclaimed read."]
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
2017-03-09 15:34:51 +01:00
Wanpeng Li
2f707d9798 KVM: nVMX: reset nested_run_pending if the vCPU is going to be reset
Reported by syzkaller:

    WARNING: CPU: 1 PID: 27742 at arch/x86/kvm/vmx.c:11029
    nested_vmx_vmexit+0x5c35/0x74d0 arch/x86/kvm/vmx.c:11029
    CPU: 1 PID: 27742 Comm: a.out Not tainted 4.10.0+ #229
    Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
    Call Trace:
     __dump_stack lib/dump_stack.c:15 [inline]
     dump_stack+0x2ee/0x3ef lib/dump_stack.c:51
     panic+0x1fb/0x412 kernel/panic.c:179
     __warn+0x1c4/0x1e0 kernel/panic.c:540
     warn_slowpath_null+0x2c/0x40 kernel/panic.c:583
     nested_vmx_vmexit+0x5c35/0x74d0 arch/x86/kvm/vmx.c:11029
     vmx_leave_nested arch/x86/kvm/vmx.c:11136 [inline]
     vmx_set_msr+0x1565/0x1910 arch/x86/kvm/vmx.c:3324
     kvm_set_msr+0xd4/0x170 arch/x86/kvm/x86.c:1099
     do_set_msr+0x11e/0x190 arch/x86/kvm/x86.c:1128
     __msr_io arch/x86/kvm/x86.c:2577 [inline]
     msr_io+0x24b/0x450 arch/x86/kvm/x86.c:2614
     kvm_arch_vcpu_ioctl+0x35b/0x46a0 arch/x86/kvm/x86.c:3497
     kvm_vcpu_ioctl+0x232/0x1120 arch/x86/kvm/../../../virt/kvm/kvm_main.c:2721
     vfs_ioctl fs/ioctl.c:43 [inline]
     do_vfs_ioctl+0x1bf/0x1790 fs/ioctl.c:683
     SYSC_ioctl fs/ioctl.c:698 [inline]
     SyS_ioctl+0x8f/0xc0 fs/ioctl.c:689
     entry_SYSCALL_64_fastpath+0x1f/0xc2

The syzkaller folks reported a nested_run_pending warning during userspace
clear VMX capability which is exposed to L1 before.

The warning gets thrown while doing

(*(uint32_t*)0x20aecfe8 = (uint32_t)0x1);
(*(uint32_t*)0x20aecfec = (uint32_t)0x0);
(*(uint32_t*)0x20aecff0 = (uint32_t)0x3a);
(*(uint32_t*)0x20aecff4 = (uint32_t)0x0);
(*(uint64_t*)0x20aecff8 = (uint64_t)0x0);
r[29] = syscall(__NR_ioctl, r[4], 0x4008ae89ul,
		0x20aecfe8ul, 0, 0, 0, 0, 0, 0);

i.e. KVM_SET_MSR ioctl with

struct kvm_msrs {
	.nmsrs = 1,
		.pad = 0,
		.entries = {
			{.index = MSR_IA32_FEATURE_CONTROL,
			 .reserved = 0,
			 .data = 0}
		}
}

The VMLANCH/VMRESUME emulation should be stopped since the CPU is going to
reset here. This patch resets the nested_run_pending since the CPU is going
to be reset hence there should be nothing pending.

Reported-by: Dmitry Vyukov <dvyukov@google.com>
Suggested-by: Radim Krčmář <rkrcmar@redhat.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Radim Krčmář <rkrcmar@redhat.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: David Hildenbrand <david@redhat.com>
Signed-off-by: Wanpeng Li <wanpeng.li@hotmail.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Jim Mattson <jmattson@google.com>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
2017-03-07 15:41:12 +01:00
Jim Mattson
587d7e72ae kvm: nVMX: VMCLEAR should not cause the vCPU to shut down
VMCLEAR should silently ignore a failure to clear the launch state of
the VMCS referenced by the operand.

Signed-off-by: Jim Mattson <jmattson@google.com>
[Changed "kvm_write_guest(vcpu->kvm" to "kvm_vcpu_write_guest(vcpu".]
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
2017-03-06 17:31:29 +01:00
Linus Torvalds
2d62e0768d Second batch of KVM changes for 4.11 merge window
PPC:
  * correct assumption about ASDR on POWER9
  * fix MMIO emulation on POWER9
 
 x86:
  * add a simple test for ioperm
  * cleanup TSS
    (going through KVM tree as the whole undertaking was caused by VMX's
     use of TSS)
  * fix nVMX interrupt delivery
  * fix some performance counters in the guest
 
 And two cleanup patches.
 -----BEGIN PGP SIGNATURE-----
 
 iQEcBAABCAAGBQJYuu5qAAoJEED/6hsPKofoRAUH/jkx/KFDcw3FggixysWVgRai
 iLSbbAZemnSLFSOkOU/t7Bz0fXCUgB0tAcMJd9ow01Dg1zObiTpuUIo6qEPaYHdX
 gqtUzlHuyECZEcgK0RXS9kDYLrvw7EFocxnDWQfV91qCZSS6nBSSLF3ST1rNV69W
 mUvcZG+MciDcZUe1lTexoswVTh1m7avvozEnQ5OHnZR9yicoXiadBQjzL6yqWoqf
 Ml/29zRk5+MvloTudxjkAKm3mh7psW88jNMh37TXbAA7i+Xwl9cU6GLR9mFWstoP
 7Ot7ecq9mNAUO3lTIQh7lqvB60LMFznS4IlYK7MbplC3kvJLkfzhTWaN1aGvh90=
 =cqHo
 -----END PGP SIGNATURE-----

Merge tag 'kvm-4.11-2' of git://git.kernel.org/pub/scm/virt/kvm/kvm

Pull more KVM updates from Radim Krčmář:
 "Second batch of KVM changes for the 4.11 merge window:

  PPC:
   - correct assumption about ASDR on POWER9
   - fix MMIO emulation on POWER9

  x86:
   - add a simple test for ioperm
   - cleanup TSS (going through KVM tree as the whole undertaking was
     caused by VMX's use of TSS)
   - fix nVMX interrupt delivery
   - fix some performance counters in the guest

  ... and two cleanup patches"

* tag 'kvm-4.11-2' of git://git.kernel.org/pub/scm/virt/kvm/kvm:
  KVM: nVMX: Fix pending events injection
  x86/kvm/vmx: remove unused variable in segment_base()
  selftests/x86: Add a basic selftest for ioperm
  x86/asm: Tidy up TSS limit code
  kvm: convert kvm.users_count from atomic_t to refcount_t
  KVM: x86: never specify a sample period for virtualized in_tx_cp counters
  KVM: PPC: Book3S HV: Don't use ASDR for real-mode HPT faults on POWER9
  KVM: PPC: Book3S HV: Fix software walk of guest process page tables
2017-03-04 11:36:19 -08:00
Ingo Molnar
3905f9ad45 sched/headers: Prepare to move sched_info_on() and force_schedstat_enabled() from <linux/sched.h> to <linux/sched/stat.h>
But first update usage sites with the new header dependency.

Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-03-02 08:42:39 +01:00
Ingo Molnar
32ef5517c2 sched/headers: Prepare to move cputime functionality from <linux/sched.h> into <linux/sched/cputime.h>
Introduce a trivial, mostly empty <linux/sched/cputime.h> header
to prepare for the moving of cputime functionality out of sched.h.

Update all code that relies on these facilities.

Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-03-02 08:42:39 +01:00
Ingo Molnar
b2d0910310 sched/headers: Prepare to use <linux/rcuupdate.h> instead of <linux/rculist.h> in <linux/sched.h>
We don't actually need the full rculist.h header in sched.h anymore,
we will be able to include the smaller rcupdate.h header instead.

But first update code that relied on the implicit header inclusion.

Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-03-02 08:42:38 +01:00
Ingo Molnar
3f07c01441 sched/headers: Prepare for new header dependencies before moving code to <linux/sched/signal.h>
We are going to split <linux/sched/signal.h> out of <linux/sched.h>, which
will have to be picked up from other headers and a couple of .c files.

Create a trivial placeholder <linux/sched/signal.h> file that just
maps to <linux/sched.h> to make this patch obviously correct and
bisectable.

Include the new header in the files that are going to need it.

Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-03-02 08:42:29 +01:00
Wanpeng Li
acc9ab6013 KVM: nVMX: Fix pending events injection
L2 fails to boot on a non-APICv box dues to 'commit 0ad3bed6c5
("kvm: nVMX: move nested events check to kvm_vcpu_running")'

KVM internal error. Suberror: 3
extra data[0]: 800000ef
extra data[1]: 1
RAX=0000000000000000 RBX=ffffffff81f36140 RCX=0000000000000000 RDX=0000000000000000
RSI=0000000000000000 RDI=0000000000000000 RBP=ffff88007c92fe90 RSP=ffff88007c92fe90
R8 =ffff88007fccdca0 R9 =0000000000000000 R10=00000000fffedb3d R11=0000000000000000
R12=0000000000000003 R13=0000000000000000 R14=0000000000000000 R15=ffff88007c92c000
RIP=ffffffff810645e6 RFL=00000246 [---Z-P-] CPL=0 II=0 A20=1 SMM=0 HLT=0
ES =0000 0000000000000000 ffffffff 00c00000
CS =0010 0000000000000000 ffffffff 00a09b00 DPL=0 CS64 [-RA]
SS =0000 0000000000000000 ffffffff 00c00000
DS =0000 0000000000000000 ffffffff 00c00000
FS =0000 0000000000000000 ffffffff 00c00000
GS =0000 ffff88007fcc0000 ffffffff 00c00000
LDT=0000 0000000000000000 ffffffff 00c00000
TR =0040 ffff88007fcd4200 00002087 00008b00 DPL=0 TSS64-busy
GDT=     ffff88007fcc9000 0000007f
IDT=     ffffffffff578000 00000fff
CR0=80050033 CR2=00000000ffffffff CR3=0000000001e0a000 CR4=003406e0
DR0=0000000000000000 DR1=0000000000000000 DR2=0000000000000000 DR3=0000000000000000
DR6=00000000fffe0ff0 DR7=0000000000000400
EFER=0000000000000d01

We should try to reinject previous events if any before trying to inject
new event if pending. If vmexit is triggered by L2 guest and L0 interested
in, we should reinject IDT-vectoring info to L2 through vmcs02 if any,
otherwise, we can consider new IRQs/NMIs which can be injected and call
nested events callback to switch from L2 to L1 if needed and inject the
proper vmexit events. However, 'commit 0ad3bed6c5 ("kvm: nVMX: move
nested events check to kvm_vcpu_running")' results in the handle events
order reversely on non-APICv box. This patch fixes it by bailing out for
pending events and not consider new events in this scenario.

Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Radim Krčmář <rkrcmar@redhat.com>
Cc: Jan Kiszka <jan.kiszka@siemens.com>
Signed-off-by: Wanpeng Li <wanpeng.li@hotmail.com>
Fixes: 0ad3bed6c5 ("kvm: nVMX: move nested events check to kvm_vcpu_running")
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
2017-03-01 17:03:24 +01:00
Jérémy Lefaure
0fce546f9f x86/kvm/vmx: remove unused variable in segment_base()
The pointer 'struct desc_struct *d' is unused since commit 8c2e41f7ae
("x86/kvm/vmx: Simplify segment_base()") so let's remove it.

Signed-off-by: Jérémy Lefaure <jeremy.lefaure@lse.epita.fr>
Reviewed-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
2017-03-01 17:03:24 +01:00
Robert O'Callahan
bba82fd756 KVM: x86: never specify a sample period for virtualized in_tx_cp counters
pmc_reprogram_counter() always sets a sample period based on the value of
pmc->counter. However, hsw_hw_config() rejects sample periods less than
2^31 - 1. So for example, if a KVM guest does

    struct perf_event_attr attr;
    memset(&attr, 0, sizeof(attr));
    attr.type = PERF_TYPE_RAW;
    attr.size = sizeof(attr);
    attr.config = 0x2005101c4; // conditional branches retired IN_TXCP
    attr.sample_period = 0;
    int fd = syscall(__NR_perf_event_open, &attr, 0, -1, -1, 0);
    ioctl(fd, PERF_EVENT_IOC_DISABLE, 0);
    ioctl(fd, PERF_EVENT_IOC_ENABLE, 0);

the guest kernel counts some conditional branch events, then updates the
virtual PMU register with a nonzero count. The host reaches
pmc_reprogram_counter() with nonzero pmc->counter, triggers EOPNOTSUPP
in hsw_hw_config(), prints "kvm_pmu: event creation failed" in
pmc_reprogram_counter(), and silently (from the guest's point of view) stops
counting events.

We fix event counting by forcing attr.sample_period to always be zero for
in_tx_cp counters. Sampling doesn't work, but it already didn't work and
can't be fixed without major changes to the approach in hsw_hw_config().

Signed-off-by: Robert O'Callahan <robert@ocallahan.org>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
2017-03-01 14:19:46 +01:00
Masahiro Yamada
9332ef9dbd scripts/spelling.txt: add "an user" pattern and fix typo instances
Fix typos and add the following to the scripts/spelling.txt:

  an user||a user
  an userspace||a userspace

I also added "userspace" to the list since it is a common word in Linux.
I found some instances for "an userfaultfd", but I did not add it to the
list.  I felt it is endless to find words that start with "user" such as
"userland" etc., so must draw a line somewhere.

Link: http://lkml.kernel.org/r/1481573103-11329-4-git-send-email-yamada.masahiro@socionext.com
Signed-off-by: Masahiro Yamada <yamada.masahiro@socionext.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-02-27 18:43:46 -08:00
Linus Torvalds
fd7e9a8834 4.11 is going to be a relatively large release for KVM, with a little over
200 commits and noteworthy changes for most architectures.
 
 * ARM:
 - GICv3 save/restore
 - cache flushing fixes
 - working MSI injection for GICv3 ITS
 - physical timer emulation
 
 * MIPS:
 - various improvements under the hood
 - support for SMP guests
 - a large rewrite of MMU emulation.  KVM MIPS can now use MMU notifiers
 to support copy-on-write, KSM, idle page tracking, swapping, ballooning
 and everything else.  KVM_CAP_READONLY_MEM is also supported, so that
 writes to some memory regions can be treated as MMIO.  The new MMU also
 paves the way for hardware virtualization support.
 
 * PPC:
 - support for POWER9 using the radix-tree MMU for host and guest
 - resizable hashed page table
 - bugfixes.
 
 * s390: expose more features to the guest
 - more SIMD extensions
 - instruction execution protection
 - ESOP2
 
 * x86:
 - improved hashing in the MMU
 - faster PageLRU tracking for Intel CPUs without EPT A/D bits
 - some refactoring of nested VMX entry/exit code, preparing for live
 migration support of nested hypervisors
 - expose yet another AVX512 CPUID bit
 - host-to-guest PTP support
 - refactoring of interrupt injection, with some optimizations thrown in
 and some duct tape removed.
 - remove lazy FPU handling
 - optimizations of user-mode exits
 - optimizations of vcpu_is_preempted() for KVM guests
 
 * generic:
 - alternative signaling mechanism that doesn't pound on tsk->sighand->siglock
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2.0.22 (GNU/Linux)
 
 iQEcBAABAgAGBQJYral1AAoJEL/70l94x66DbNgH/Rx8YXuidFq2fe3RWOvld3RK
 85OM/D5g38cTLpBE0/sJpcvX34iYN8U/l5foCZwpxB+83GHEk2Cr57JyfTogdaAJ
 x8dBhHKQCA/HxSQUQLN6nFqRV+yT8WUR92Fhqx82+80BSen5Yzcfee/TDoW6T1IW
 g8CYgX9FrRaGOX066ImAuUfdAdUVjyssfs9VttDTX+HiusPeuBPx/wsRe1ZEEPlH
 vnltIJQb1ETV2GOZLUojKjzH6aZkjIl29XxjkYii9JTUornClG0DfW+5QT3uLrB5
 gJ+G+Zmpsq8ZBx9jNDtAi7sFsoPY1Mzf+JPNCGXBra2sP2GrBAuXcxmgznRYltQ=
 =8IIp
 -----END PGP SIGNATURE-----

Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm

Pull KVM updates from Paolo Bonzini:
 "4.11 is going to be a relatively large release for KVM, with a little
  over 200 commits and noteworthy changes for most architectures.

  ARM:
   - GICv3 save/restore
   - cache flushing fixes
   - working MSI injection for GICv3 ITS
   - physical timer emulation

  MIPS:
   - various improvements under the hood
   - support for SMP guests
   - a large rewrite of MMU emulation. KVM MIPS can now use MMU
     notifiers to support copy-on-write, KSM, idle page tracking,
     swapping, ballooning and everything else. KVM_CAP_READONLY_MEM is
     also supported, so that writes to some memory regions can be
     treated as MMIO. The new MMU also paves the way for hardware
     virtualization support.

  PPC:
   - support for POWER9 using the radix-tree MMU for host and guest
   - resizable hashed page table
   - bugfixes.

  s390:
   - expose more features to the guest
   - more SIMD extensions
   - instruction execution protection
   - ESOP2

  x86:
   - improved hashing in the MMU
   - faster PageLRU tracking for Intel CPUs without EPT A/D bits
   - some refactoring of nested VMX entry/exit code, preparing for live
     migration support of nested hypervisors
   - expose yet another AVX512 CPUID bit
   - host-to-guest PTP support
   - refactoring of interrupt injection, with some optimizations thrown
     in and some duct tape removed.
   - remove lazy FPU handling
   - optimizations of user-mode exits
   - optimizations of vcpu_is_preempted() for KVM guests

  generic:
   - alternative signaling mechanism that doesn't pound on
     tsk->sighand->siglock"

* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (195 commits)
  x86/kvm: Provide optimized version of vcpu_is_preempted() for x86-64
  x86/paravirt: Change vcp_is_preempted() arg type to long
  KVM: VMX: use correct vmcs_read/write for guest segment selector/base
  x86/kvm/vmx: Defer TR reload after VM exit
  x86/asm/64: Drop __cacheline_aligned from struct x86_hw_tss
  x86/kvm/vmx: Simplify segment_base()
  x86/kvm/vmx: Get rid of segment_base() on 64-bit kernels
  x86/kvm/vmx: Don't fetch the TSS base from the GDT
  x86/asm: Define the kernel TSS limit in a macro
  kvm: fix page struct leak in handle_vmon
  KVM: PPC: Book3S HV: Disable HPT resizing on POWER9 for now
  KVM: Return an error code only as a constant in kvm_get_dirty_log()
  KVM: Return an error code only as a constant in kvm_get_dirty_log_protect()
  KVM: Return directly after a failed copy_from_user() in kvm_vm_compat_ioctl()
  KVM: x86: remove code for lazy FPU handling
  KVM: race-free exit from KVM_RUN without POSIX signals
  KVM: PPC: Book3S HV: Turn "KVM guest htab" message into a debug message
  KVM: PPC: Book3S PR: Ratelimit copy data failure error messages
  KVM: Support vCPU-based gfn->hva cache
  KVM: use separate generations for each address space
  ...
2017-02-22 18:22:53 -08:00
Chao Peng
96794e4ed4 KVM: VMX: use correct vmcs_read/write for guest segment selector/base
Guest segment selector is 16 bit field and guest segment base is natural
width field. Fix two incorrect invocations accordingly.

Without this patch, build fails when aggressive inlining is used with ICC.

Cc: stable@vger.kernel.org
Signed-off-by: Chao Peng <chao.p.peng@linux.intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2017-02-21 12:45:49 +01:00
Andy Lutomirski
b7ffc44d5b x86/kvm/vmx: Defer TR reload after VM exit
Intel's VMX is daft and resets the hidden TSS limit register to 0x67
on VMX reload, and the 0x67 is not configurable.  KVM currently
reloads TR using the LTR instruction on every exit, but this is quite
slow because LTR is serializing.

The 0x67 limit is entirely harmless unless ioperm() is in use, so
defer the reload until a task using ioperm() is actually running.

Here's some poorly done benchmarking using kvm-unit-tests:

Before:

cpuid 1313
vmcall 1195
mov_from_cr8 11
mov_to_cr8 17
inl_from_pmtimer 6770
inl_from_qemu 6856
inl_from_kernel 2435
outl_to_kernel 1402

After:

cpuid 1291
vmcall 1181
mov_from_cr8 11
mov_to_cr8 16
inl_from_pmtimer 6457
inl_from_qemu 6209
inl_from_kernel 2339
outl_to_kernel 1391

Signed-off-by: Andy Lutomirski <luto@kernel.org>
[Force-reload TR in invalidate_tss_limit. - Paolo]
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2017-02-21 12:45:08 +01:00
Andy Lutomirski
8c2e41f7ae x86/kvm/vmx: Simplify segment_base()
Use actual pointer types for pointers (instead of unsigned long) and
replace hardcoded constants with the appropriate self-documenting
macros.

The function is still a bit messy, but this seems a lot better than
before to me.

This is mostly borrowed from a patch by Thomas Garnier.

Cc: Thomas Garnier <thgarnie@google.com>
Cc: Jim Mattson <jmattson@google.com>
Cc: Radim Krčmář <rkrcmar@redhat.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Andy Lutomirski <luto@kernel.org>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2017-02-21 11:48:56 +01:00
Andy Lutomirski
e28baeadcf x86/kvm/vmx: Get rid of segment_base() on 64-bit kernels
It was a bit buggy (it didn't list all segment types that needed
64-bit fixups), but the bug was irrelevant because it wasn't called
in any interesting context on 64-bit kernels and was only used for
data segents on 32-bit kernels.

To avoid confusion, make it explicitly 32-bit only.

Cc: Thomas Garnier <thgarnie@google.com>
Cc: Jim Mattson <jmattson@google.com>
Cc: Radim Krčmář <rkrcmar@redhat.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Andy Lutomirski <luto@kernel.org>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2017-02-21 11:48:46 +01:00
Andy Lutomirski
e0c230634a x86/kvm/vmx: Don't fetch the TSS base from the GDT
The current CPU's TSS base is a foregone conclusion, so there's no need
to parse it out of the segment tables.  This should save a couple cycles
(as STR is surely microcoded and poorly optimized) but, more importantly,
it's a cleanup and it means that segment_base() will never be called on
64-bit kernels.

Cc: Thomas Garnier <thgarnie@google.com>
Cc: Jim Mattson <jmattson@google.com>
Cc: Radim Krčmář <rkrcmar@redhat.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Andy Lutomirski <luto@kernel.org>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2017-02-21 11:48:40 +01:00
Linus Torvalds
828cad8ea0 Merge branch 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull scheduler updates from Ingo Molnar:
 "The main changes in this (fairly busy) cycle were:

   - There was a class of scheduler bugs related to forgetting to update
     the rq-clock timestamp which can cause weird and hard to debug
     problems, so there's a new debug facility for this: which uncovered
     a whole lot of bugs which convinced us that we want to keep the
     debug facility.

     (Peter Zijlstra, Matt Fleming)

   - Various cputime related updates: eliminate cputime and use u64
     nanoseconds directly, simplify and improve the arch interfaces,
     implement delayed accounting more widely, etc. - (Frederic
     Weisbecker)

   - Move code around for better structure plus cleanups (Ingo Molnar)

   - Move IO schedule accounting deeper into the scheduler plus related
     changes to improve the situation (Tejun Heo)

   - ... plus a round of sched/rt and sched/deadline fixes, plus other
     fixes, updats and cleanups"

* 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (85 commits)
  sched/core: Remove unlikely() annotation from sched_move_task()
  sched/autogroup: Rename auto_group.[ch] to autogroup.[ch]
  sched/topology: Split out scheduler topology code from core.c into topology.c
  sched/core: Remove unnecessary #include headers
  sched/rq_clock: Consolidate the ordering of the rq_clock methods
  delayacct: Include <uapi/linux/taskstats.h>
  sched/core: Clean up comments
  sched/rt: Show the 'sched_rr_timeslice' SCHED_RR timeslice tuning knob in milliseconds
  sched/clock: Add dummy clear_sched_clock_stable() stub function
  sched/cputime: Remove generic asm headers
  sched/cputime: Remove unused nsec_to_cputime()
  s390, sched/cputime: Remove unused cputime definitions
  powerpc, sched/cputime: Remove unused cputime definitions
  s390, sched/cputime: Make arch_cpu_idle_time() to return nsecs
  ia64, sched/cputime: Remove unused cputime definitions
  ia64: Convert vtime to use nsec units directly
  ia64, sched/cputime: Move the nsecs based cputime headers to the last arch using it
  sched/cputime: Remove jiffies based cputime
  sched/cputime, vtime: Return nsecs instead of cputime_t to account
  sched/cputime: Complete nsec conversion of tick based accounting
  ...
2017-02-20 12:52:55 -08:00
Paolo Bonzini
06ce521af9 kvm: fix page struct leak in handle_vmon
handle_vmon gets a reference on VMXON region page,
but does not release it. Release the reference.

Found by syzkaller; based on a patch by Dmitry.

Reported-by: Dmitry Vyukov <dvyukov@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2017-02-20 17:54:43 +01:00
Paolo Bonzini
bd7e5b0899 KVM: x86: remove code for lazy FPU handling
The FPU is always active now when running KVM.

Reviewed-by: David Matlack <dmatlack@google.com>
Reviewed-by: Bandan Das <bsd@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2017-02-17 12:28:01 +01:00
Paolo Bonzini
460df4c1fc KVM: race-free exit from KVM_RUN without POSIX signals
The purpose of the KVM_SET_SIGNAL_MASK API is to let userspace "kick"
a VCPU out of KVM_RUN through a POSIX signal.  A signal is attached
to a dummy signal handler; by blocking the signal outside KVM_RUN and
unblocking it inside, this possible race is closed:

          VCPU thread                     service thread
   --------------------------------------------------------------
        check flag
                                          set flag
                                          raise signal
        (signal handler does nothing)
        KVM_RUN

However, one issue with KVM_SET_SIGNAL_MASK is that it has to take
tsk->sighand->siglock on every KVM_RUN.  This lock is often on a
remote NUMA node, because it is on the node of a thread's creator.
Taking this lock can be very expensive if there are many userspace
exits (as is the case for SMP Windows VMs without Hyper-V reference
time counter).

As an alternative, we can put the flag directly in kvm_run so that
KVM can see it:

          VCPU thread                     service thread
   --------------------------------------------------------------
                                          raise signal
        signal handler
          set run->immediate_exit
        KVM_RUN
          check run->immediate_exit

Reviewed-by: Radim Krčmář <rkrcmar@redhat.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2017-02-17 12:27:37 +01:00
Cao, Lei
bbd6411513 KVM: Support vCPU-based gfn->hva cache
Provide versions of struct gfn_to_hva_cache functions that
take vcpu as a parameter instead of struct kvm.  The existing functions
are not needed anymore, so delete them.  This allows dirty pages to
be logged in the vcpu dirty ring, instead of the global dirty ring,
for ring-based dirty memory tracking.

Signed-off-by: Lei Cao <lei.cao@stratus.com>
Message-Id: <CY1PR08MB19929BD2AC47A291FD680E83F04F0@CY1PR08MB1992.namprd08.prod.outlook.com>
Reviewed-by: Radim Krčmář <rkrcmar@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2017-02-16 18:42:46 +01:00
Paolo Bonzini
47c0152e0f KVM: VMX: use vmcs_set/clear_bits for CPU-based execution controls
Reviewed-by: Radim Krčmář <rkrcmar@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2017-02-16 18:42:26 +01:00
David Hildenbrand
681bcea802 KVM: svm: inititalize hash table structures directly
The hashtable and guarding spinlock are global data structures,
we can inititalize them statically.

Signed-off-by: David Hildenbrand <david@redhat.com>
Message-Id: <20170124212116.4568-1-david@redhat.com>
Reviewed-by: Radim Krčmář <rkrcmar@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2017-02-15 15:26:06 +01:00
Jim Mattson
858e25c06f kvm: nVMX: Refactor nested_vmx_run()
Nested_vmx_run is split into two parts: the part that handles the
VMLAUNCH/VMRESUME instruction, and the part that modifies the vcpu state
to transition from VMX root mode to VMX non-root mode. The latter will
be used when restoring the checkpointed state of a vCPU that was in VMX
operation when a snapshot was taken.

Signed-off-by: Jim Mattson <jmattson@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2017-02-15 14:56:36 +01:00
Jim Mattson
ca0bde28f2 kvm: nVMX: Split VMCS checks from nested_vmx_run()
The checks performed on the contents of the vmcs12 are extracted from
nested_vmx_run so that they can be used to validate a vmcs12 that has
been restored from a checkpoint.

Signed-off-by: Jim Mattson <jmattson@google.com>
[Change prepare_vmcs02 and nested_vmx_load_cr3's last argument to u32,
 to match check_vmentry_postreqs.  Update comments for singlestep
 handling. - Paolo]
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2017-02-15 14:56:35 +01:00
Jim Mattson
6beb7bd52e kvm: nVMX: Refactor nested_get_vmcs12_pages()
Perform the checks on vmcs12 state early, but defer the gpa->hpa lookups
until after prepare_vmcs02. Later, when we restore the checkpointed
state of a vCPU in guest mode, we will not be able to do the gpa->hpa
lookups when the restore is done.

Signed-off-by: Jim Mattson <jmattson@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2017-02-15 14:56:11 +01:00
Jim Mattson
a8bc284eb7 kvm: nVMX: Refactor handle_vmptrld()
Handle_vmptrld is split into two parts: the part that handles the
VMPTRLD instruction, and the part that establishes the current VMCS
pointer.  The latter will be used when restoring the checkpointed state
of a vCPU that had a valid VMCS pointer when a snapshot was taken.

Signed-off-by: Jim Mattson <jmattson@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2017-02-15 14:54:37 +01:00
Jim Mattson
e29acc55bf kvm: nVMX: Refactor handle_vmon()
Handle_vmon is split into two parts: the part that handles the VMXON
instruction, and the part that modifies the vcpu state to transition
from legacy mode to VMX operation. The latter will be used when
restoring the checkpointed state of a vCPU that was in VMX operation
when a snapshot was taken.

Signed-off-by: Jim Mattson <jmattson@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2017-02-15 14:54:37 +01:00
Jim Mattson
cf8b84f48a kvm: nVMX: Prepare for checkpointing L2 state
Split prepare_vmcs12 into two parts: the part that stores the current L2
guest state and the part that sets up the exit information fields. The
former will be used when checkpointing the vCPU's VMX state.

Modify prepare_vmcs02 so that it can construct a vmcs02 midway through
L2 execution, using the checkpointed L2 guest state saved into the
cached vmcs12 above.

Signed-off-by: Jim Mattson <jmattson@google.com>
[Rebasing: add from_vmentry argument to prepare_vmcs02 instead of using
 vmx->nested.nested_run_pending, because it is no longer 1 at the
 point prepare_vmcs02 is called. - Paolo]
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2017-02-15 14:54:36 +01:00
Paolo Bonzini
b95234c840 kvm: x86: do not use KVM_REQ_EVENT for APICv interrupt injection
Since bf9f6ac8d7 ("KVM: Update Posted-Interrupts Descriptor when vCPU
is blocked", 2015-09-18) the posted interrupt descriptor is checked
unconditionally for PIR.ON.  Therefore we don't need KVM_REQ_EVENT to
trigger the scan and, if NMIs or SMIs are not involved, we can avoid
the complicated event injection path.

Calling kvm_vcpu_kick if PIR.ON=1 is also useless, though it has been
there since APICv was introduced.

However, without the KVM_REQ_EVENT safety net KVM needs to be much
more careful about races between vmx_deliver_posted_interrupt and
vcpu_enter_guest.  First, the IPI for posted interrupts may be issued
between setting vcpu->mode = IN_GUEST_MODE and disabling interrupts.
If that happens, kvm_trigger_posted_interrupt returns true, but
smp_kvm_posted_intr_ipi doesn't do anything about it.  The guest is
entered with PIR.ON, but the posted interrupt IPI has not been sent
and the interrupt is only delivered to the guest on the next vmentry
(if any).  To fix this, disable interrupts before setting vcpu->mode.
This ensures that the IPI is delayed until the guest enters non-root mode;
it is then trapped by the processor causing the interrupt to be injected.

Second, the IPI may be issued between kvm_x86_ops->sync_pir_to_irr(vcpu)
and vcpu->mode = IN_GUEST_MODE.  In this case, kvm_vcpu_kick is called
but it (correctly) doesn't do anything because it sees vcpu->mode ==
OUTSIDE_GUEST_MODE.  Again, the guest is entered with PIR.ON but no
posted interrupt IPI is pending; this time, the fix for this is to move
the RVI update after IN_GUEST_MODE.

Both issues were mostly masked by the liberal usage of KVM_REQ_EVENT,
though the second could actually happen with VT-d posted interrupts.
In both race scenarios KVM_REQ_EVENT would cancel guest entry, resulting
in another vmentry which would inject the interrupt.

This saves about 300 cycles on the self_ipi_* tests of vmexit.flat.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2017-02-15 14:54:36 +01:00
Paolo Bonzini
76dfafd536 KVM: x86: do not scan IRR twice on APICv vmentry
Calls to apic_find_highest_irr are scanning IRR twice, once
in vmx_sync_pir_from_irr and once in apic_search_irr.  Change
sync_pir_from_irr to get the new maximum IRR from kvm_apic_update_irr;
now that it does the computation, it can also do the RVI write.

In order to avoid complications in svm.c, make the callback optional.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2017-02-15 14:54:35 +01:00
Paolo Bonzini
3d92789f69 KVM: vmx: move sync_pir_to_irr from apic_find_highest_irr to callers
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2017-02-15 14:54:34 +01:00
Paolo Bonzini
810e6defcc KVM: x86: preparatory changes for APICv cleanups
Add return value to __kvm_apic_update_irr/kvm_apic_update_irr.
Move vmx_sync_pir_to_irr around.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2017-02-15 14:54:34 +01:00
Paolo Bonzini
0ad3bed6c5 kvm: nVMX: move nested events check to kvm_vcpu_running
vcpu_run calls kvm_vcpu_running, not kvm_arch_vcpu_runnable,
and the former does not call check_nested_events.

Once KVM_REQ_EVENT is removed from the APICv interrupt injection
path, however, this would leave no place to trigger a vmexit
from L2 to L1, causing a missed interrupt delivery while in guest
mode.  This is caught by the "ack interrupt on exit" test in
vmx.flat.

[This does not change the calls to check_nested_events in
 inject_pending_event.  That is material for a separate cleanup.]

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2017-02-15 14:54:33 +01:00
Paolo Bonzini
967235d320 KVM: vmx: clear pending interrupts on KVM_SET_LAPIC
Pending interrupts might be in the PI descriptor when the
LAPIC is restored from an external state; we do not want
them to be injected.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2017-02-15 14:54:33 +01:00
Paolo Bonzini
db1c056cee kvm: vmx: Use the hardware provided GPA instead of page walk
As in the SVM patch, the guest physical address is passed by
VMX to x86_emulate_instruction already, so mark the GPA as available
in vcpu->arch.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2017-02-15 14:54:32 +01:00
Arnd Bergmann
8ef81a9a45 KVM: x86: hide KVM_HC_CLOCK_PAIRING on 32 bit
The newly added hypercall doesn't work on x86-32:

arch/x86/kvm/x86.c: In function 'kvm_pv_clock_pairing':
arch/x86/kvm/x86.c:6163:6: error: implicit declaration of function 'kvm_get_walltime_and_clockread';did you mean 'kvm_get_time_scale'? [-Werror=implicit-function-declaration]

This adds an #ifdef around it, matching the one around the related
functions that are also only implemented on 64-bit systems.

Fixes: 55dd00a73a ("KVM: x86: add KVM_HC_CLOCK_PAIRING hypercall")
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2017-02-09 16:23:39 +01:00
Paolo Bonzini
2e751dfb5f kvmarm updates for 4.11
- GICv3 save restore
 - Cache flushing fixes
 - MSI injection fix for GICv3 ITS
 - Physical timer emulation support
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQIcBAABAgAGBQJYnICmAAoJECPQ0LrRPXpDC04P/A73ZEL6m0vUzGpuvclxwWc6
 OCJ2C9kYloK+twyGLFbPprI4eN/70dpThgFE1Zr+ol/vAOhQlGQJoarc4n4+eYyb
 8e8IxM5Cmi44HUB64xOInidLacqeyRy5+TvKXIH0aHLgpdynSEQJu88RVXUvVgvs
 IZizhTpYueDYdexNNEkL5r2yJhVZCaczyjB1vU8k5MdODLDM63ABnPOSNJNXio2x
 itoO0EU1Lb9GhuzQj0hiMvKJPyviuPHwau7AhokUSjDPaHzaQT7TgSVioKov/rl6
 bRzhPmXqesex97ZWA5Fxr8jgSNR7JyRz+bzCLEry7XFaI3chbe0YvXeRv32PNH7I
 meuycQw64gsKmfJGRNlq30qhQQfv4fTbzpZP/j1UbvKNwhK5J6e7037c1CUH4i9C
 p9UO9HF/zAMqzD3iMcDZSpaFcbhJYrfQufbhTnbHfGC5AMVJEOWheHSEmzlDWnwr
 K5fPBxnsPv58hDmp/UZUTqCEPusY+HyuOq4ZumFSsnBwjdW+z9mLuaaTJbxaqR/G
 B6dfSQNwSnw6b2lbiXPUCm6c+Z9b190pUEWdwJ4kOTxwiPUWBppVU7gE2TrjnQ8m
 aIvEBPGIf58okjEewA5Dni6qjv7CjDN5z1V0vZUTTdVw8xuhX9eJ1Cx853SM7n0U
 sJgW5nSvSLDUpizSKdRI
 =H4vX
 -----END PGP SIGNATURE-----

Merge tag 'kvmarm-for-4.11' of git://git.kernel.org/pub/scm/linux/kernel/git/kvmarm/kvmarm into HEAD

kvmarm updates for 4.11

- GICv3 save restore
- Cache flushing fixes
- MSI injection fix for GICv3 ITS
- Physical timer emulation support
2017-02-09 16:01:23 +01:00
Paolo Bonzini
80fbd89cbd KVM: x86: fix compilation
Fix rebase breakage from commit 55dd00a73a ("KVM: x86: add
KVM_HC_CLOCK_PAIRING hypercall", 2017-01-24), courtesy of the
"I could have sworn I had pushed the right branch" department.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2017-02-08 10:57:24 +01:00
Marcelo Tosatti
55dd00a73a KVM: x86: add KVM_HC_CLOCK_PAIRING hypercall
Add a hypercall to retrieve the host realtime clock and the TSC value
used to calculate that clock read.

Used to implement clock synchronization between host and guest.

Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2017-02-07 18:16:45 +01:00
David Hildenbrand
6342c50ad1 KVM: nVMX: vmx_complete_nested_posted_interrupt() can't fail
vmx_complete_nested_posted_interrupt() can't fail, let's turn it into
a void function.

Signed-off-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2017-02-07 18:16:44 +01:00
David Hildenbrand
42cf014d38 KVM: nVMX: kmap() can't fail
kmap() can't fail, therefore it will always return a valid pointer. Let's
just get rid of the unnecessary checks.

Signed-off-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2017-02-07 18:16:44 +01:00
Radim Krčmář
00c87e9a70 KVM: x86: do not save guest-unsupported XSAVE state
Saving unsupported state prevents migration when the new host does not
support a XSAVE feature of the original host, even if the feature is not
exposed to the guest.

We've masked host features with guest-visible features before, with
4344ee981e ("KVM: x86: only copy XSAVE state for the supported
features") and dropped it when implementing XSAVES.  Do it again.

Fixes: df1daba7d1 ("KVM: x86: support XSAVES usage in the host")
Cc: stable@vger.kernel.org
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
2017-02-03 18:43:08 +01:00
Frederic Weisbecker
5613fda9a5 sched/cputime: Convert task/group cputime to nsecs
Now that most cputime readers use the transition API which return the
task cputime in old style cputime_t, we can safely store the cputime in
nsecs. This will eventually make cputime statistics less opaque and more
granular. Back and forth convertions between cputime_t and nsecs in order
to deal with cputime_t random granularity won't be needed anymore.

Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Stanislaw Gruszka <sgruszka@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Wanpeng Li <wanpeng.li@hotmail.com>
Link: http://lkml.kernel.org/r/1485832191-26889-8-git-send-email-fweisbec@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-02-01 09:13:49 +01:00
Junaid Shahid
d3e328f2cb kvm: x86: mmu: Verify that restored PTE has needed perms in fast page fault
Before fast page fault restores an access track PTE back to a regular PTE,
it now also verifies that the restored PTE would grant the necessary
permissions for the faulting access to succeed. If not, it falls back
to the slow page fault path.

Signed-off-by: Junaid Shahid <junaids@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2017-01-27 15:46:40 +01:00
Junaid Shahid
d162f30a7c kvm: x86: mmu: Move pgtbl walk inside retry loop in fast_page_fault
Redo the page table walk in fast_page_fault when retrying so that we are
working on the latest PTE even if the hierarchy changes.

Signed-off-by: Junaid Shahid <junaids@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2017-01-27 15:46:40 +01:00
Junaid Shahid
20d65236d0 kvm: x86: mmu: Update comment in mark_spte_for_access_track
Reword the comment to hopefully make it more clear.

Signed-off-by: Junaid Shahid <junaids@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2017-01-27 15:46:39 +01:00
Junaid Shahid
312b616b30 kvm: x86: mmu: Set SPTE_SPECIAL_MASK within mmu.c
Instead of the caller including the SPTE_SPECIAL_MASK in the masks being
supplied to kvm_mmu_set_mmio_spte_mask() and kvm_mmu_set_mask_ptes(),
those functions now themselves include the SPTE_SPECIAL_MASK.

Note that bit 63 is now reset in the default MMIO mask.

Signed-off-by: Junaid Shahid <junaids@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2017-01-27 15:46:39 +01:00
Junaid Shahid
ab22a4733f kvm: x86: mmu: Rename EPT_VIOLATION_READ/WRITE/INSTR constants
Rename the EPT_VIOLATION_READ/WRITE/INSTR constants to
EPT_VIOLATION_ACC_READ/WRITE/INSTR to more clearly indicate that these
signify the type of the memory access as opposed to the permissions
granted by the PTE.

Signed-off-by: Junaid Shahid <junaids@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2017-01-27 15:46:38 +01:00
Jim Mattson
0b4c208d44 Revert "KVM: nested VMX: disable perf cpuid reporting"
This reverts commit bc6134942d.

A CPUID instruction executed in VMX non-root mode always causes a
VM-exit, regardless of the leaf being queried.

Fixes: bc6134942d ("KVM: nested VMX: disable perf cpuid reporting")
Signed-off-by: Jim Mattson <jmattson@google.com>
[The issue solved by bc6134942d has been resolved with ff651cb613
 ("KVM: nVMX: Add nested msr load/restore algorithm").]
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
2017-01-20 22:18:55 +01:00
Piotr Luc
a17f32270a kvm: x86: Expose Intel VPOPCNTDQ feature to guest
Vector population count instructions for dwords and qwords are to be
used in future Intel Xeon & Xeon Phi processors. The bit 14 of
CPUID[level:0x07, ECX] indicates that the new instructions are
supported by a processor.

The spec can be found in the Intel Software Developer Manual (SDM)
or in the Instruction Set Extensions Programming Reference (ISE).

Signed-off-by: Piotr Luc <piotr.luc@intel.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Radim Krčmář <rkrcmar@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: x86@kernel.org
Cc: kvm@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
2017-01-17 17:55:18 +01:00
Radim Krčmář
a9ff720e0f Merge branch 'x86/cpufeature' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip into next
For AVX512_VPOPCNTDQ.
2017-01-17 17:53:01 +01:00
Dmitry Vyukov
ce2e852ecc KVM: x86: fix fixing of hypercalls
emulator_fix_hypercall() replaces hypercall with vmcall instruction,
but it does not handle GP exception properly when writes the new instruction.
It can return X86EMUL_PROPAGATE_FAULT without setting exception information.
This leads to incorrect emulation and triggers
WARN_ON(ctxt->exception.vector > 0x1f) in x86_emulate_insn()
as discovered by syzkaller fuzzer:

WARNING: CPU: 2 PID: 18646 at arch/x86/kvm/emulate.c:5558
Call Trace:
 warn_slowpath_null+0x2c/0x40 kernel/panic.c:582
 x86_emulate_insn+0x16a5/0x4090 arch/x86/kvm/emulate.c:5572
 x86_emulate_instruction+0x403/0x1cc0 arch/x86/kvm/x86.c:5618
 emulate_instruction arch/x86/include/asm/kvm_host.h:1127 [inline]
 handle_exception+0x594/0xfd0 arch/x86/kvm/vmx.c:5762
 vmx_handle_exit+0x2b7/0x38b0 arch/x86/kvm/vmx.c:8625
 vcpu_enter_guest arch/x86/kvm/x86.c:6888 [inline]
 vcpu_run arch/x86/kvm/x86.c:6947 [inline]

Set exception information when write in emulator_fix_hypercall() fails.

Signed-off-by: Dmitry Vyukov <dvyukov@google.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Radim Krčmář <rkrcmar@redhat.com>
Cc: Wanpeng Li <wanpeng.li@hotmail.com>
Cc: kvm@vger.kernel.org
Cc: syzkaller@googlegroups.com
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
2017-01-17 15:06:05 +01:00
Paolo Bonzini
33ab91103b KVM: x86: fix emulation of "MOV SS, null selector"
This is CVE-2017-2583.  On Intel this causes a failed vmentry because
SS's type is neither 3 nor 7 (even though the manual says this check is
only done for usable SS, and the dmesg splat says that SS is unusable!).
On AMD it's worse: svm.c is confused and sets CPL to 0 in the vmcb.

The fix fabricates a data segment descriptor when SS is set to a null
selector, so that CPL and SS.DPL are set correctly in the VMCS/vmcb.
Furthermore, only allow setting SS to a NULL selector if SS.RPL < 3;
this in turn ensures CPL < 3 because RPL must be equal to CPL.

Thanks to Andy Lutomirski and Willy Tarreau for help in analyzing
the bug and deciphering the manuals.

Reported-by: Xiaohan Zhang <zhangxiaohan1@huawei.com>
Fixes: 79d5b4c3cd
Cc: stable@nongnu.org
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2017-01-12 15:17:13 +01:00
Wanpeng Li
546d87e5c9 KVM: x86: fix NULL deref in vcpu_scan_ioapic
Reported by syzkaller:

    BUG: unable to handle kernel NULL pointer dereference at 00000000000001b0
    IP: _raw_spin_lock+0xc/0x30
    PGD 3e28eb067
    PUD 3f0ac6067
    PMD 0
    Oops: 0002 [#1] SMP
    CPU: 0 PID: 2431 Comm: test Tainted: G           OE   4.10.0-rc1+ #3
    Call Trace:
     ? kvm_ioapic_scan_entry+0x3e/0x110 [kvm]
     kvm_arch_vcpu_ioctl_run+0x10a8/0x15f0 [kvm]
     ? pick_next_task_fair+0xe1/0x4e0
     ? kvm_arch_vcpu_load+0xea/0x260 [kvm]
     kvm_vcpu_ioctl+0x33a/0x600 [kvm]
     ? hrtimer_try_to_cancel+0x29/0x130
     ? do_nanosleep+0x97/0xf0
     do_vfs_ioctl+0xa1/0x5d0
     ? __hrtimer_init+0x90/0x90
     ? do_nanosleep+0x5b/0xf0
     SyS_ioctl+0x79/0x90
     do_syscall_64+0x6e/0x180
     entry_SYSCALL64_slow_path+0x25/0x25
    RIP: _raw_spin_lock+0xc/0x30 RSP: ffffa43688973cc0

The syzkaller folks reported a NULL pointer dereference due to
ENABLE_CAP succeeding even without an irqchip.  The Hyper-V
synthetic interrupt controller is activated, resulting in a
wrong request to rescan the ioapic and a NULL pointer dereference.

    #include <sys/ioctl.h>
    #include <sys/mman.h>
    #include <sys/types.h>
    #include <linux/kvm.h>
    #include <pthread.h>
    #include <stddef.h>
    #include <stdint.h>
    #include <stdlib.h>
    #include <string.h>
    #include <unistd.h>

    #ifndef KVM_CAP_HYPERV_SYNIC
    #define KVM_CAP_HYPERV_SYNIC 123
    #endif

    void* thr(void* arg)
    {
	struct kvm_enable_cap cap;
	cap.flags = 0;
	cap.cap = KVM_CAP_HYPERV_SYNIC;
	ioctl((long)arg, KVM_ENABLE_CAP, &cap);
	return 0;
    }

    int main()
    {
	void *host_mem = mmap(0, 0x1000, PROT_READ|PROT_WRITE,
			MAP_PRIVATE|MAP_ANONYMOUS, -1, 0);
	int kvmfd = open("/dev/kvm", 0);
	int vmfd = ioctl(kvmfd, KVM_CREATE_VM, 0);
	struct kvm_userspace_memory_region memreg;
	memreg.slot = 0;
	memreg.flags = 0;
	memreg.guest_phys_addr = 0;
	memreg.memory_size = 0x1000;
	memreg.userspace_addr = (unsigned long)host_mem;
	host_mem[0] = 0xf4;
	ioctl(vmfd, KVM_SET_USER_MEMORY_REGION, &memreg);
	int cpufd = ioctl(vmfd, KVM_CREATE_VCPU, 0);
	struct kvm_sregs sregs;
	ioctl(cpufd, KVM_GET_SREGS, &sregs);
	sregs.cr0 = 0;
	sregs.cr4 = 0;
	sregs.efer = 0;
	sregs.cs.selector = 0;
	sregs.cs.base = 0;
	ioctl(cpufd, KVM_SET_SREGS, &sregs);
	struct kvm_regs regs = { .rflags = 2 };
	ioctl(cpufd, KVM_SET_REGS, &regs);
	ioctl(vmfd, KVM_CREATE_IRQCHIP, 0);
	pthread_t th;
	pthread_create(&th, 0, thr, (void*)(long)cpufd);
	usleep(rand() % 10000);
	ioctl(cpufd, KVM_RUN, 0);
	pthread_join(th, 0);
	return 0;
    }

This patch fixes it by failing ENABLE_CAP if without an irqchip.

Reported-by: Dmitry Vyukov <dvyukov@google.com>
Fixes: 5c919412fe (kvm/x86: Hyper-V synthetic interrupt controller)
Cc: stable@vger.kernel.org # 4.5+
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Radim Krčmář <rkrcmar@redhat.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Signed-off-by: Wanpeng Li <wanpeng.li@hotmail.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2017-01-12 14:52:52 +01:00
Steve Rutherford
129a72a0d3 KVM: x86: Introduce segmented_write_std
Introduces segemented_write_std.

Switches from emulated reads/writes to standard read/writes in fxsave,
fxrstor, sgdt, and sidt.  This fixes CVE-2017-2584, a longstanding
kernel memory leak.

Since commit 283c95d0e3 ("KVM: x86: emulate FXSAVE and FXRSTOR",
2016-11-09), which is luckily not yet in any final release, this would
also be an exploitable kernel memory *write*!

Reported-by: Dmitry Vyukov <dvyukov@google.com>
Cc: stable@vger.kernel.org
Fixes: 96051572c8
Fixes: 283c95d0e3
Suggested-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Steve Rutherford <srutherford@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2017-01-12 14:34:58 +01:00
David Matlack
cef84c302f KVM: x86: flush pending lapic jump label updates on module unload
KVM's lapic emulation uses static_key_deferred (apic_{hw,sw}_disabled).
These are implemented with delayed_work structs which can still be
pending when the KVM module is unloaded. We've seen this cause kernel
panics when the kvm_intel module is quickly reloaded.

Use the new static_key_deferred_flush() API to flush pending updates on
module unload.

Signed-off-by: David Matlack <dmatlack@google.com>
Cc: stable@vger.kernel.org
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2017-01-12 14:33:17 +01:00
Jim Mattson
21e7fbe7db kvm: nVMX: Reorder error checks for emulated VMXON
Checks on the operand to VMXON are performed after the check for
legacy mode operation and the #GP checks, according to the pseudo-code
in Intel's SDM.

Signed-off-by: Jim Mattson <jmattson@google.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
2017-01-09 14:48:04 +01:00
Paolo Bonzini
4d82d12b39 KVM: lapic: do not scan IRR when delivering an interrupt
On interrupt delivery the PPR can only grow (except for auto-EOI),
so it is impossible that non-auto-EOI interrupt delivery results
in KVM_REQ_EVENT.  We can therefore use __apic_update_ppr.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2017-01-09 14:48:03 +01:00
Paolo Bonzini
26fbbee581 KVM: lapic: do not set KVM_REQ_EVENT unnecessarily on PPR update
On PPR update, we set KVM_REQ_EVENT unconditionally anytime PPR is lowered.
But we can take into account IRR here already.

Reviewed-by: Roman Kagan <rkagan@virtuozzo.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2017-01-09 14:48:02 +01:00
Paolo Bonzini
b3c045d332 KVM: lapic: remove unnecessary KVM_REQ_EVENT on PPR update
PPR needs to be updated whenever on every IRR read because we
may have missed TPR writes that _increased_ PPR.  However, these
writes need not generate KVM_REQ_EVENT, because either KVM_REQ_EVENT
has been set already in __apic_accept_irq, or we are going to
process the interrupt right away.

Reviewed-by: Roman Kagan <rkagan@virtuozzo.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2017-01-09 14:48:01 +01:00
Paolo Bonzini
eb90f3417a KVM: vmx: speed up TPR below threshold vmexits
Since we're already in VCPU context, all we have to do here is recompute
the PPR value.  That will in turn generate a KVM_REQ_EVENT if necessary.

Reviewed-by: Roman Kagan <rkagan@virtuozzo.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2017-01-09 14:48:00 +01:00
Paolo Bonzini
0f1e261ead KVM: x86: add VCPU stat for KVM_REQ_EVENT processing
This statistic can be useful to estimate the cost of an IRQ injection
scenario, by comparing it with irq_injections.  For example the stat
shows that sti;hlt triggers more KVM_REQ_EVENT than sti;nop.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2017-01-09 14:47:59 +01:00
Tom Lendacky
0f89b207b0 kvm: svm: Use the hardware provided GPA instead of page walk
When a guest causes a NPF which requires emulation, KVM sometimes walks
the guest page tables to translate the GVA to a GPA. This is unnecessary
most of the time on AMD hardware since the hardware provides the GPA in
EXITINFO2.

The only exception cases involve string operations involving rep or
operations that use two memory locations. With rep, the GPA will only be
the value of the initial NPF and with dual memory locations we won't know
which memory address was translated into EXITINFO2.

Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com>
Reviewed-by: Borislav Petkov <bp@suse.de>
Signed-off-by: Brijesh Singh <brijesh.singh@amd.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2017-01-09 14:47:58 +01:00
Radim Krčmář
5bd5db385b KVM: x86: allow hotplug of VCPU with APIC ID over 0xff
LAPIC after reset is in xAPIC mode, which poses a problem for hotplug of
VCPUs with high APIC ID, because reset VCPU is waiting for INIT/SIPI,
but there is no way to uniquely address it using xAPIC.

From many possible options, we chose the one that also works on real
hardware: accepting interrupts addressed to LAPIC's x2APIC ID even in
xAPIC mode.

KVM intentionally differs from real hardware, because real hardware
(Knights Landing) does just "x2apic_id & 0xff" to decide whether to
accept the interrupt in xAPIC mode and it can deliver one interrupt to
more than one physical destination, e.g. 0x123 to 0x123 and 0x23.

Fixes: 682f732ecf ("KVM: x86: bump MAX_VCPUS to 288")
Reviewed-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2017-01-09 14:47:48 +01:00
Radim Krčmář
b4535b58ae KVM: x86: make interrupt delivery fast and slow path behave the same
Slow path tried to prevent IPIs from x2APIC VCPUs from being delivered
to xAPIC VCPUs and vice-versa.  Make slow path behave like fast path,
which never distinguished that.

Reviewed-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2017-01-09 14:47:40 +01:00
Radim Krčmář
6e50043912 KVM: x86: replace kvm_apic_id with kvm_{x,x2}apic_id
There were three calls sites:
 - recalculate_apic_map and kvm_apic_match_physical_addr, where it would
   only complicate implementation of x2APIC hotplug;
 - in apic_debug, where it was still somewhat preserved, but keeping the
   old function just for apic_debug was not worth it

Reviewed-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2017-01-09 14:47:29 +01:00
Radim Krčmář
f98a3efb28 KVM: x86: use delivery to self in hyperv synic
Interrupt to self can be sent without knowing the APIC ID.

Reviewed-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2017-01-09 14:46:13 +01:00
Junaid Shahid
f160c7b7bb kvm: x86: mmu: Lockless access tracking for Intel CPUs without EPT A bits.
This change implements lockless access tracking for Intel CPUs without EPT
A bits. This is achieved by marking the PTEs as not-present (but not
completely clearing them) when clear_flush_young() is called after marking
the pages as accessed. When an EPT Violation is generated as a result of
the VM accessing those pages, the PTEs are restored to their original values.

Signed-off-by: Junaid Shahid <junaids@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2017-01-09 14:46:11 +01:00
Junaid Shahid
37f0e8fe6b kvm: x86: mmu: Do not use bit 63 for tracking special SPTEs
MMIO SPTEs currently set both bits 62 and 63 to distinguish them as special
PTEs. However, bit 63 is used as the SVE bit in Intel EPT PTEs. The SVE bit
is ignored for misconfigured PTEs but not necessarily for not-Present PTEs.
Since MMIO SPTEs use an EPT misconfiguration, so using bit 63 for them is
acceptable. However, the upcoming fast access tracking feature adds another
type of special tracking PTE, which uses not-Present PTEs and hence should
not set bit 63.

In order to use common bits to distinguish both type of special PTEs, we
now use only bit 62 as the special bit.

Signed-off-by: Junaid Shahid <junaids@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2017-01-09 14:46:10 +01:00
Junaid Shahid
f39a058d0e kvm: x86: mmu: Introduce a no-tracking version of mmu_spte_update
mmu_spte_update() tracks changes in the accessed/dirty state of
the SPTE being updated and calls kvm_set_pfn_accessed/dirty
appropriately. However, in some cases (e.g. when aging the SPTE),
this shouldn't be done. mmu_spte_update_no_track() is introduced
for use in such cases.

Signed-off-by: Junaid Shahid <junaids@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2017-01-09 14:46:09 +01:00
Junaid Shahid
83ef6c8155 kvm: x86: mmu: Refactor accessed/dirty checks in mmu_spte_update/clear
This simplifies mmu_spte_update() a little bit.
The checks for clearing of accessed and dirty bits are refactored into
separate functions, which are used inside both mmu_spte_update() and
mmu_spte_clear_track_bits(), as well as kvm_test_age_rmapp(). The new
helper functions handle both the case when A/D bits are supported in
hardware and the case when they are not.

Signed-off-by: Junaid Shahid <junaids@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2017-01-09 14:46:08 +01:00
Junaid Shahid
97dceba29a kvm: x86: mmu: Fast Page Fault path retries
This change adds retries into the Fast Page Fault path. Without the
retries, the code still works, but if a retry does end up being needed,
then it will result in a second page fault for the same memory access,
which will cause much more overhead compared to just retrying within the
original fault.

This would be especially useful with the upcoming fast access tracking
change, as that would make it more likely for retries to be needed
(e.g. due to read and write faults happening on different CPUs at
the same time).

Signed-off-by: Junaid Shahid <junaids@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2017-01-09 14:46:07 +01:00
Junaid Shahid
ea4114bcd3 kvm: x86: mmu: Rename spte_is_locklessly_modifiable()
This change renames spte_is_locklessly_modifiable() to
spte_can_locklessly_be_made_writable() to distinguish it from other
forms of lockless modifications. The full set of lockless modifications
is covered by spte_has_volatile_bits().

Signed-off-by: Junaid Shahid <junaids@google.com>
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2017-01-09 14:46:06 +01:00
Junaid Shahid
27959a4415 kvm: x86: mmu: Use symbolic constants for EPT Violation Exit Qualifications
This change adds some symbolic constants for VM Exit Qualifications
related to EPT Violations and updates handle_ept_violation() to use
these constants instead of hard-coded numbers.

Signed-off-by: Junaid Shahid <junaids@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2017-01-09 14:46:05 +01:00
David Matlack
114df303a7 kvm: x86: reduce collisions in mmu_page_hash
When using two-dimensional paging, the mmu_page_hash (which provides
lookups for existing kvm_mmu_page structs), becomes imbalanced; with
too many collisions in buckets 0 and 512. This has been seen to cause
mmu_lock to be held for multiple milliseconds in kvm_mmu_get_page on
VMs with a large amount of RAM mapped with 4K pages.

The current hash function uses the lower 10 bits of gfn to index into
mmu_page_hash. When doing shadow paging, gfn is the address of the
guest page table being shadow. These tables are 4K-aligned, which
makes the low bits of gfn a good hash. However, with two-dimensional
paging, no guest page tables are being shadowed, so gfn is the base
address that is mapped by the table. Thus page tables (level=1) have
a 2MB aligned gfn, page directories (level=2) have a 1GB aligned gfn,
etc. This means hashes will only differ in their 10th bit.

hash_64() provides a better hash. For example, on a VM with ~200G
(99458 direct=1 kvm_mmu_page structs):

hash            max_mmu_page_hash_collisions
--------------------------------------------
low 10 bits     49847
hash_64         105
perfect         97

While we're changing the hash, increase the table size by 4x to better
support large VMs (further reduces number of collisions in 200G VM to
29).

Note that hash_64() does not provide a good distribution prior to commit
ef703f49a6 ("Eliminate bad hash multipliers from hash_32() and
hash_64()").

Signed-off-by: David Matlack <dmatlack@google.com>
Change-Id: I5aa6b13c834722813c6cca46b8b1ed6f53368ade
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2017-01-09 14:46:04 +01:00
David Matlack
f3414bc774 kvm: x86: export maximum number of mmu_page_hash collisions
Report the maximum number of mmu_page_hash collisions as a per-VM stat.
This will make it easy to identify problems with the mmu_page_hash in
the future.

Signed-off-by: David Matlack <dmatlack@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2017-01-09 14:46:03 +01:00
Radim Krčmář
826da32140 KVM: x86: simplify conditions with split/kernel irqchip
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2017-01-09 14:45:57 +01:00
Radim Krčmář
8231f50d98 KVM: x86: prevent setup of invalid routes
The check in kvm_set_pic_irq() and kvm_set_ioapic_irq() was just a
temporary measure until the code improved enough for us to do this.

This changes APIC in a case when KVM_SET_GSI_ROUTING is called to set up pic
and ioapic routes before KVM_CREATE_IRQCHIP.  Those rules would get overwritten
by KVM_CREATE_IRQCHIP at best, so it is pointless to allow it.  Userspaces
hopefully noticed that things don't work if they do that and don't do that.

Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2017-01-09 14:45:50 +01:00
Radim Krčmář
e5dc48777d KVM: x86: refactor pic setup in kvm_set_routing_entry
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2017-01-09 14:45:36 +01:00
Radim Krčmář
099413664c KVM: x86: make pic setup code look like ioapic setup
We don't treat kvm->arch.vpic specially anymore, so the setup can look
like ioapic.  This gets a bit more information out of return values.

Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2017-01-09 14:45:28 +01:00
Radim Krčmář
49776faf93 KVM: x86: decouple irqchip_in_kernel() and pic_irqchip()
irqchip_in_kernel() tried to save a bit by reusing pic_irqchip(), but it
just complicated the code.
Add a separate state for the irqchip mode.

Reviewed-by: David Hildenbrand <david@redhat.com>
[Used Paolo's version of condition in irqchip_in_kernel().]
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
2017-01-09 14:42:47 +01:00
Radim Krčmář
35e6eaa3df KVM: x86: don't allow kernel irqchip with split irqchip
Split irqchip cannot be created after creating the kernel irqchip, but
we forgot to restrict the other way.  This is an API change.

Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2017-01-09 14:38:19 +01:00
Linus Torvalds
08289086b0 KVM fixes for v4.10-rc3
MIPS: (both for stable)
  - fix host kernel crashes when receiving a signal with 64-bit userspace
  - flush instruction cache on all vcpus after generating entry code
 
 x86:
  - fix NULL dereference in MMU caused by SMM transitions (for stable)
  - correct guest instruction pointer after emulating some VMX errors
  - minor cleanup
 -----BEGIN PGP SIGNATURE-----
 
 iQEcBAABCAAGBQJYb/N7AAoJEED/6hsPKofoa4QH/0/jwHr64lFeiOzMxqZfTF0y
 wufcTqw3zGq5iPaNlEwn+6AkKnTq2IPws92FludfPHPb7BrLUPqrXxRlSRN+XPVw
 pHVcV9u0q4yghMi7/6Flu3JASnpD6PrPZ7ezugZwgXFrR7pewd/+sTq6xBUnI9rZ
 nNEYsfh8dYiBicxSGXlmZcHLuJJHKshjsv9F6ngyBGXAAf/F+nLiJReUzPO0m2+P
 gmXi5zhVu6z05zlaCW1KAmJ1QV1UJla1vZnzrnK3twRK/05l7YX+xCbHIo1wB03R
 2YhKDnSrnG3Zt+KpXfRhADXazNgM5ASvORdvI6RvjLNVxlnOveQtAcfRyvZezT4=
 =LXLf
 -----END PGP SIGNATURE-----

Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm

Pull KVM fixes from Radim Krčmář:
 "MIPS:
   - fix host kernel crashes when receiving a signal with 64-bit
     userspace

   - flush instruction cache on all vcpus after generating entry code

     (both for stable)

  x86:
   - fix NULL dereference in MMU caused by SMM transitions (for stable)

   - correct guest instruction pointer after emulating some VMX errors

   - minor cleanup"

* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm:
  KVM: VMX: remove duplicated declaration
  KVM: MIPS: Flush KVM entry code from icache globally
  KVM: MIPS: Don't clobber CP0_Status.UX
  KVM: x86: reset MMU on KVM_SET_VCPU_EVENTS
  KVM: nVMX: fix instruction skipping during emulated vm-entry
2017-01-06 15:27:17 -08:00
Jan Dakinevich
69130ea1e6 KVM: VMX: remove duplicated declaration
Declaration of VMX_VPID_EXTENT_SUPPORTED_MASK occures twice in the code.
Probably, it was happened after unsuccessful merge.

Signed-off-by: Jan Dakinevich <jan.dakinevich@gmail.com>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
2017-01-05 15:08:48 +01:00
Linus Torvalds
3ddc76dfc7 Merge branch 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull timer type cleanups from Thomas Gleixner:
 "This series does a tree wide cleanup of types related to
  timers/timekeeping.

   - Get rid of cycles_t and use a plain u64. The type is not really
     helpful and caused more confusion than clarity

   - Get rid of the ktime union. The union has become useless as we use
     the scalar nanoseconds storage unconditionally now. The 32bit
     timespec alike storage got removed due to the Y2038 limitations
     some time ago.

     That leaves the odd union access around for no reason. Clean it up.

  Both changes have been done with coccinelle and a small amount of
  manual mopping up"

* 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  ktime: Get rid of ktime_equal()
  ktime: Cleanup ktime_set() usage
  ktime: Get rid of the union
  clocksource: Use a plain u64 instead of cycle_t
2016-12-25 14:30:04 -08:00
Thomas Gleixner
8b0e195314 ktime: Cleanup ktime_set() usage
ktime_set(S,N) was required for the timespec storage type and is still
useful for situations where a Seconds and Nanoseconds part of a time value
needs to be converted. For anything where the Seconds argument is 0, this
is pointless and can be replaced with a simple assignment.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <peterz@infradead.org>
2016-12-25 17:21:22 +01:00
Thomas Gleixner
a5a1d1c291 clocksource: Use a plain u64 instead of cycle_t
There is no point in having an extra type for extra confusion. u64 is
unambiguous.

Conversion was done with the following coccinelle script:

@rem@
@@
-typedef u64 cycle_t;

@fix@
typedef cycle_t;
@@
-cycle_t
+u64

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: John Stultz <john.stultz@linaro.org>
2016-12-25 11:04:12 +01:00
Thomas Gleixner
73c1b41e63 cpu/hotplug: Cleanup state names
When the state names got added a script was used to add the extra argument
to the calls. The script basically converted the state constant to a
string, but the cleanup to convert these strings into meaningful ones did
not happen.

Replace all the useless strings with 'subsys/xxx/yyy:state' strings which
are used in all the other places already.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Sebastian Siewior <bigeasy@linutronix.de>
Link: http://lkml.kernel.org/r/20161221192112.085444152@linutronix.de
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2016-12-25 10:47:44 +01:00
Xiao Guangrong
6ef4e07ecd KVM: x86: reset MMU on KVM_SET_VCPU_EVENTS
Otherwise, mismatch between the smm bit in hflags and the MMU role
can cause a NULL pointer dereference.

Cc: stable@vger.kernel.org
Signed-off-by: Xiao Guangrong <guangrong.xiao@linux.intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2016-12-24 10:16:04 +01:00
David Matlack
b428018a06 KVM: nVMX: fix instruction skipping during emulated vm-entry
kvm_skip_emulated_instruction() should not be called after emulating
a VM-entry failure during or after loading guest state
(nested_vmx_entry_failure()). Otherwise the L1 hypervisor is resumed
some number of bytes past vmcs->host_rip.

Fixes: eb27756217
Signed-off-by: David Matlack <dmatlack@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2016-12-21 18:55:09 +01:00
Jim Mattson
ef85b67385 kvm: nVMX: Allow L1 to intercept software exceptions (#BP and #OF)
When L2 exits to L0 due to "exception or NMI", software exceptions
(#BP and #OF) for which L1 has requested an intercept should be
handled by L1 rather than L0. Previously, only hardware exceptions
were forwarded to L1.

Signed-off-by: Jim Mattson <jmattson@google.com>
Cc: stable@vger.kernel.org
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2016-12-19 16:05:31 +01:00
Andrea Arcangeli
cc0d907c09 kvm: take srcu lock around kvm_steal_time_set_preempted()
kvm_memslots() will be called by kvm_write_guest_offset_cached() so
take the srcu lock.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2016-12-19 15:45:15 +01:00
Andrea Arcangeli
931f261b42 kvm: fix schedule in atomic in kvm_steal_time_set_preempted()
kvm_steal_time_set_preempted() isn't disabling the pagefaults before
calling __copy_to_user and the kernel debug notices.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2016-12-19 15:45:14 +01:00
Paolo Bonzini
3f5ad8be37 KVM: hyperv: fix locking of struct kvm_hv fields
Introduce a new mutex to avoid an AB-BA deadlock between kvm->lock and
vcpu->mutex.  Protect accesses in kvm_hv_setup_tsc_page too, as suggested
by Roman.

Reported-by: Dmitry Vyukov <dvyukov@google.com>
Reviewed-by: Roman Kagan <rkagan@virtuozzo.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2016-12-16 17:53:38 +01:00
Yi Sun
83781d180b KVM: x86: Expose Intel AVX512IFMA/AVX512VBMI/SHA features to guest.
Expose AVX512IFMA/AVX512VBMI/SHA features to guest.

AVX512 spec can be found at:
https://software.intel.com/sites/default/files/managed/26/40/319433-026.pdf

SHA spec can be found at:
https://software.intel.com/sites/default/files/managed/39/c5/325462-sdm-vol-1-2abcd-3abcd.pdf

This patch depends on below patch.
http://marc.info/?l=linux-kernel&m=147932800828178&w=2

Signed-off-by: Yi Sun <yi.y.sun@linux.intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2016-12-15 15:02:44 +01:00
GanShun
37b9a671f3 kvm: nVMX: Correct a VMX instruction error code for VMPTRLD
When the operand passed to VMPTRLD matches the address of the VMXON
region, the VMX instruction error code should be
VMXERR_VMPTRLD_VMXON_POINTER rather than VMXERR_VMCLEAR_VMXON_POINTER.

Signed-off-by: GanShun <ganshun@google.com>
Signed-off-by: Jim Mattson <jmattson@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2016-12-15 15:02:44 +01:00
Linus Torvalds
93173b5bf2 Small release, the most interesting stuff is x86 nested virt improvements.
x86: userspace can now hide nested VMX features from guests; nested
 VMX can now run Hyper-V in a guest; support for AVX512_4VNNIW and
 AVX512_FMAPS in KVM; infrastructure support for virtual Intel GPUs.
 
 PPC: support for KVM guests on POWER9; improved support for interrupt
 polling; optimizations and cleanups.
 
 s390: two small optimizations, more stuff is in flight and will be
 in 4.11.
 
 ARM: support for the GICv3 ITS on 32bit platforms.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2
 
 iQExBAABCAAbBQJYTkP0FBxwYm9uemluaUByZWRoYXQuY29tAAoJEL/70l94x66D
 lZIH/iT1n9OQXcuTpYYnQhuCenzI3GZZOIMTbCvK2i5bo0FIJKxVn0EiAAqZSXvO
 nO185FqjOgLuJ1AD1kJuxzye5suuQp4HIPWWgNHcexLuy43WXWKZe0IQlJ4zM2Xf
 u31HakpFmVDD+Cd1qN3yDXtDrRQ79/xQn2kw7CWb8olp+pVqwbceN3IVie9QYU+3
 gCz0qU6As0aQIwq2PyalOe03sO10PZlm4XhsoXgWPG7P18BMRhNLTDqhLhu7A/ry
 qElVMANT7LSNLzlwNdpzdK8rVuKxETwjlc1UP8vSuhrwad4zM2JJ1Exk26nC2NaG
 D0j4tRSyGFIdx6lukZm7HmiSHZ0=
 =mkoB
 -----END PGP SIGNATURE-----

Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm

Pull KVM updates from Paolo Bonzini:
 "Small release, the most interesting stuff is x86 nested virt
  improvements.

  x86:
   - userspace can now hide nested VMX features from guests
   - nested VMX can now run Hyper-V in a guest
   - support for AVX512_4VNNIW and AVX512_FMAPS in KVM
   - infrastructure support for virtual Intel GPUs.

  PPC:
   - support for KVM guests on POWER9
   - improved support for interrupt polling
   - optimizations and cleanups.

  s390:
   - two small optimizations, more stuff is in flight and will be in
     4.11.

  ARM:
   - support for the GICv3 ITS on 32bit platforms"

* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (94 commits)
  arm64: KVM: pmu: Reset PMSELR_EL0.SEL to a sane value before entering the guest
  KVM: arm/arm64: timer: Check for properly initialized timer on init
  KVM: arm/arm64: vgic-v2: Limit ITARGETSR bits to number of VCPUs
  KVM: x86: Handle the kthread worker using the new API
  KVM: nVMX: invvpid handling improvements
  KVM: nVMX: check host CR3 on vmentry and vmexit
  KVM: nVMX: introduce nested_vmx_load_cr3 and call it on vmentry
  KVM: nVMX: propagate errors from prepare_vmcs02
  KVM: nVMX: fix CR3 load if L2 uses PAE paging and EPT
  KVM: nVMX: load GUEST_EFER after GUEST_CR0 during emulated VM-entry
  KVM: nVMX: generate MSR_IA32_CR{0,4}_FIXED1 from guest CPUID
  KVM: nVMX: fix checks on CR{0,4} during virtual VMX operation
  KVM: nVMX: support restore of VMX capability MSRs
  KVM: nVMX: generate non-true VMX MSRs based on true versions
  KVM: x86: Do not clear RFLAGS.TF when a singlestep trap occurs.
  KVM: x86: Add kvm_skip_emulated_instruction and use it.
  KVM: VMX: Move skip_emulated_instruction out of nested_vmx_check_vmcs12
  KVM: VMX: Reorder some skip_emulated_instruction calls
  KVM: x86: Add a return value to kvm_emulate_cpuid
  KVM: PPC: Book3S: Move prototypes for KVM functions into kvm_ppc.h
  ...
2016-12-13 15:47:02 -08:00
Linus Torvalds
9439b3710d Main pull request for drm for 4.10 kernel
-----BEGIN PGP SIGNATURE-----
 
 iQIcBAABAgAGBQJYT3qqAAoJEAx081l5xIa+dLMP/2dqBybSAeWlPmAwVenIHRtS
 KFNktISezFSY/LBcIP2mHkFJmjTKBMZFxWnyEJL9NmFUD1cS2WMyNnC1282h/+rD
 +P8Bsmzmt/daV4UTFxVDpzlmVlavAyakNi6FnSQfAfmf+3PB1yzU3gn8ld9pU/if
 h7KEp9fDn9eYZreTRfCUloI2yoVpD9d0DG3uaGDN/N0kGUnCC6TZT5ig5j2JO016
 fYf/DqoYAk3ItWF9WK/uG7qJIGi37afCpQq+kbSSJk+p3HjJqu8JUe9jzqYdl7j9
 26TGSY5o9WLhZkxDgbcCIJzcFJhMmXgMdhjil9lqaHmnNG5FPFU7g8DK1CZqbel9
 m8+aRPn1EgxIahMgdl8NblW1pfO2Kco0tZmoP5vXx1uqhivd67h0hiQqp66WxOJd
 i2yMLncaCEv8M161CVEgtzuI5a7nCfaZv7J9ArzbkD/huBwu51IZgTs7Dz4njgvz
 VPB5FBTB/ZYteErUNoh6gjF0hLngWvvJSPvuzT+EFO7yypek0IJ28GTdbxYSP+jR
 13697s5Itigf/D3KUdRRGsWRzyVVN9n+djkl//sy5ddL9eOlKSKEga4ujOUjTWaW
 hTvAxpK9GmJS/Iun5jIP6f75zDbi+e8FWUeB/OI2lPtnApaSKdXBTPXsco2RnTEV
 +G6XrH8IMEIsTxOk7hWU
 =7s/c
 -----END PGP SIGNATURE-----

Merge tag 'drm-for-v4.10' of git://people.freedesktop.org/~airlied/linux

Pull drm updates from Dave Airlie:
 "This is the main pull request for drm for 4.10 kernel.

  New drivers:
   - ZTE VOU display driver (zxdrm)
   - Amlogic Meson Graphic Controller GXBB/GXL/GXM SoCs (meson)
   - MXSFB support (mxsfb)

  Core:
   - Format handling has been reworked
   - Better atomic state debugging
   - drm_mm leak debugging
   - Atomic explicit fencing support
   - fbdev helper ops
   - Documentation updates
   - MST fbcon fixes

  Bridge:
   - Silicon Image SiI8620 driver

  Panel:
   - Add support for new simple panels

  i915:
   - GVT Device model
   - Better HDMI2.0 support on skylake
   - More watermark fixes
   - GPU idling rework for suspend/resume
   - DP Audio workarounds
   - Scheduler prep-work
   - Opregion CADL handling
   - GPU scheduler and priority boosting

  amdgfx/radeon:
   - Support for virtual devices
   - New VM manager for non-contig VRAM buffers
   - UVD powergating
   - SI register header cleanup
   - Cursor fixes
   - Powermanagement fixes

  nouveau:
   - Powermangement reworks for better voltage/clock changes
   - Atomic modesetting support
   - Displayport Multistream (MST) support.
   - GP102/104 hang and cursor fixes
   - GP106 support

  hisilicon:
   - hibmc support (BMC chip for aarch64 servers)

  armada:
   - add tracing support for overlay change
   - refactor plane support
   - de-midlayer the driver

  omapdrm:
   - Timing code cleanups

  rcar-du:
   - R8A7792/R8A7796 support
   - Misc fixes.

  sunxi:
   - A31 SoC display engine support

  imx-drm:
   - YUV format support
   - Cleanup plane atomic update

  mali-dp:
   - Misc fixes

  dw-hdmi:
   - Add support for HDMI i2c master controller

  tegra:
   - IOMMU support fixes
   - Error handling fixes

  tda998x:
   - Fix connector registration
   - Improved robustness
   - Fix infoframe/audio compliance

  virtio:
   - fix busid issues
   - allocate more vbufs

  qxl:
   - misc fixes and cleanups.

  vc4:
   - Fragment shader threading
   - ETC1 support
   - VEC (tv-out) support

  msm:
   - A5XX GPU support
   - Lots of atomic changes

  tilcdc:
   - Misc fixes and cleanups.

  etnaviv:
   - Fix dma-buf export path
   - DRAW_INSTANCED support
   - fix driver on i.MX6SX

  exynos:
   - HDMI refactoring

  fsl-dcu:
   - fbdev changes"

* tag 'drm-for-v4.10' of git://people.freedesktop.org/~airlied/linux: (1343 commits)
  drm/nouveau/kms/nv50: fix atomic regression on original G80
  drm/nouveau/bl: Do not register interface if Apple GMUX detected
  drm/nouveau/bl: Assign different names to interfaces
  drm/nouveau/bios/dp: fix handling of LevelEntryTableIndex on DP table 4.2
  drm/nouveau/ltc: protect clearing of comptags with mutex
  drm/nouveau/gr/gf100-: handle GPC/TPC/MPC trap
  drm/nouveau/core: recognise GP106 chipset
  drm/nouveau/ttm: wait for bo fence to signal before unmapping vmas
  drm/nouveau/gr/gf100-: FECS intr handling is not relevant on proprietary ucode
  drm/nouveau/gr/gf100-: properly ack all FECS error interrupts
  drm/nouveau/fifo/gf100-: recover from host mmu faults
  drm: Add fake controlD* symlinks for backwards compat
  drm/vc4: Don't use drm_put_dev
  drm/vc4: Document VEC DT binding
  drm/vc4: Add support for the VEC (Video Encoder) IP
  drm: Add TV connector states to drm_connector_state
  drm: Turn DRM_MODE_SUBCONNECTOR_xx definitions into an enum
  drm/vc4: Fix ->clock_select setting for the VEC encoder
  drm/amdgpu/dce6: Set MASTER_UPDATE_MODE to 0 in resume_mc_access as well
  drm/amdgpu: use pin rather than pin_restricted in a few cases
  ...
2016-12-13 09:35:09 -08:00
Linus Torvalds
518bacf5a5 Merge branch 'x86-fpu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 FPU updates from Ingo Molnar:
 "The main changes in this cycle were:

   - do a large round of simplifications after all CPUs do 'eager' FPU
     context switching in v4.9: remove CR0 twiddling, remove leftover
     eager/lazy bts, etc (Andy Lutomirski)

   - more FPU code simplifications: remove struct fpu::counter, clarify
     nomenclature, remove unnecessary arguments/functions and better
     structure the code (Rik van Riel)"

* 'x86-fpu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86/fpu: Remove clts()
  x86/fpu: Remove stts()
  x86/fpu: Handle #NM without FPU emulation as an error
  x86/fpu, lguest: Remove CR0.TS support
  x86/fpu, kvm: Remove host CR0.TS manipulation
  x86/fpu: Remove irq_ts_save() and irq_ts_restore()
  x86/fpu: Stop saving and restoring CR0.TS in fpu__init_check_bugs()
  x86/fpu: Get rid of two redundant clts() calls
  x86/fpu: Finish excising 'eagerfpu'
  x86/fpu: Split old_fpu & new_fpu handling into separate functions
  x86/fpu: Remove 'cpu' argument from __cpu_invalidate_fpregs_state()
  x86/fpu: Split old & new FPU code paths
  x86/fpu: Remove __fpregs_(de)activate()
  x86/fpu: Rename lazy restore functions to "register state valid"
  x86/fpu, kvm: Remove KVM vcpu->fpu_counter
  x86/fpu: Remove struct fpu::counter
  x86/fpu: Remove use_eager_fpu()
  x86/fpu: Remove the XFEATURE_MASK_EAGER/LAZY distinction
  x86/fpu: Hard-disable lazy FPU mode
  x86/crypto, x86/fpu: Remove X86_FEATURE_EAGER_FPU #ifdef from the crc32c code
2016-12-12 14:27:49 -08:00
Ingo Molnar
6f38751510 Merge branch 'linus' into locking/core, to pick up fixes
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-12-11 13:07:13 +01:00
Petr Mladek
36da91bdf5 KVM: x86: Handle the kthread worker using the new API
Use the new API to create and destroy the "kvm-pit" kthread
worker. The API hides some implementation details.

In particular, kthread_create_worker() allocates and initializes
struct kthread_worker. It runs the kthread the right way
and stores task_struct into the worker structure.

kthread_destroy_worker() flushes all pending works, stops
the kthread and frees the structure.

This patch does not change the existing behavior except for
dynamically allocating struct kthread_worker and storing
only the pointer of this structure.

It is compile tested only because I did not find an easy
way how to run the code. Well, it should be pretty safe
given the nature of the change.

Signed-off-by: Petr Mladek <pmladek@suse.com>
Message-Id: <1476877847-11217-1-git-send-email-pmladek@suse.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2016-12-08 15:31:11 +01:00
Jan Dakinevich
16c2aec6a2 KVM: nVMX: invvpid handling improvements
- Expose all invalidation types to the L1

 - Reject invvpid instruction, if L1 passed zero vpid value to single
   context invalidations

Signed-off-by: Jan Dakinevich <jan.dakinevich@gmail.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2016-12-08 15:31:11 +01:00
Ladi Prosek
1dc35dacc1 KVM: nVMX: check host CR3 on vmentry and vmexit
This commit adds missing host CR3 checks. Before entering guest mode, the value
of CR3 is checked for reserved bits. After returning, nested_vmx_load_cr3 is
called to set the new CR3 value and check and load PDPTRs.

Signed-off-by: Ladi Prosek <lprosek@redhat.com>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
2016-12-08 15:31:10 +01:00
Ladi Prosek
9ed38ffad4 KVM: nVMX: introduce nested_vmx_load_cr3 and call it on vmentry
Loading CR3 as part of emulating vmentry is different from regular CR3 loads,
as implemented in kvm_set_cr3, in several ways.

* different rules are followed to check CR3 and it is desirable for the caller
to distinguish between the possible failures
* PDPTRs are not loaded if PAE paging and nested EPT are both enabled
* many MMU operations are not necessary

This patch introduces nested_vmx_load_cr3 suitable for CR3 loads as part of
nested vmentry and vmexit, and makes use of it on the nested vmentry path.

Signed-off-by: Ladi Prosek <lprosek@redhat.com>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
2016-12-08 15:31:10 +01:00
Ladi Prosek
ee146c1c10 KVM: nVMX: propagate errors from prepare_vmcs02
It is possible that prepare_vmcs02 fails to load the guest state. This
patch adds the proper error handling for such a case. L1 will receive
an INVALID_STATE vmexit with the appropriate exit qualification if it
happens.

A failure to set guest CR3 is the only error propagated from prepare_vmcs02
at the moment.

Signed-off-by: Ladi Prosek <lprosek@redhat.com>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
2016-12-08 15:31:09 +01:00
Ladi Prosek
7ca29de213 KVM: nVMX: fix CR3 load if L2 uses PAE paging and EPT
KVM does not correctly handle L1 hypervisors that emulate L2 real mode with
PAE and EPT, such as Hyper-V. In this mode, the L1 hypervisor populates guest
PDPTE VMCS fields and leaves guest CR3 uninitialized because it is not used
(see 26.3.2.4 Loading Page-Directory-Pointer-Table Entries). KVM always
dereferences CR3 and tries to load PDPTEs if PAE is on. This leads to two
related issues:

1) On the first nested vmentry, the guest PDPTEs, as populated by L1, are
overwritten in ept_load_pdptrs because the registers are believed to have
been loaded in load_pdptrs as part of kvm_set_cr3. This is incorrect. L2 is
running with PAE enabled but PDPTRs have been set up by L1.

2) When L2 is about to enable paging and loads its CR3, we, again, attempt
to load PDPTEs in load_pdptrs called from kvm_set_cr3. There are no guarantees
that this will succeed (it's just a CR3 load, paging is not enabled yet) and
if it doesn't, kvm_set_cr3 returns early without persisting the CR3 which is
then lost and L2 crashes right after it enables paging.

This patch replaces the kvm_set_cr3 call with a simple register write if PAE
and EPT are both on. CR3 is not to be interpreted in this case.

Signed-off-by: Ladi Prosek <lprosek@redhat.com>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
2016-12-08 15:31:09 +01:00
David Matlack
5a6a9748b4 KVM: nVMX: load GUEST_EFER after GUEST_CR0 during emulated VM-entry
vmx_set_cr0() modifies GUEST_EFER and "IA-32e mode guest" in the current
VMCS. Call vmx_set_efer() after vmx_set_cr0() so that emulated VM-entry
is more faithful to VMCS12.

This patch correctly causes VM-entry to fail when "IA-32e mode guest" is
1 and GUEST_CR0.PG is 0. Previously this configuration would succeed and
"IA-32e mode guest" would silently be disabled by KVM.

Signed-off-by: David Matlack <dmatlack@google.com>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
2016-12-08 15:31:08 +01:00
David Matlack
8322ebbb24 KVM: nVMX: generate MSR_IA32_CR{0,4}_FIXED1 from guest CPUID
MSR_IA32_CR{0,4}_FIXED1 define which bits in CR0 and CR4 are allowed to
be 1 during VMX operation. Since the set of allowed-1 bits is the same
in and out of VMX operation, we can generate these MSRs entirely from
the guest's CPUID. This lets userspace avoiding having to save/restore
these MSRs.

This patch also initializes MSR_IA32_CR{0,4}_FIXED1 from the CPU's MSRs
by default. This is a saner than the current default of -1ull, which
includes bits that the host CPU does not support.

Signed-off-by: David Matlack <dmatlack@google.com>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
2016-12-08 15:31:08 +01:00
David Matlack
3899152ccb KVM: nVMX: fix checks on CR{0,4} during virtual VMX operation
KVM emulates MSR_IA32_VMX_CR{0,4}_FIXED1 with the value -1ULL, meaning
all CR0 and CR4 bits are allowed to be 1 during VMX operation.

This does not match real hardware, which disallows the high 32 bits of
CR0 to be 1, and disallows reserved bits of CR4 to be 1 (including bits
which are defined in the SDM but missing according to CPUID). A guest
can induce a VM-entry failure by setting these bits in GUEST_CR0 and
GUEST_CR4, despite MSR_IA32_VMX_CR{0,4}_FIXED1 indicating they are
valid.

Since KVM has allowed all bits to be 1 in CR0 and CR4, the existing
checks on these registers do not verify must-be-0 bits. Fix these checks
to identify must-be-0 bits according to MSR_IA32_VMX_CR{0,4}_FIXED1.

This patch should introduce no change in behavior in KVM, since these
MSRs are still -1ULL.

Signed-off-by: David Matlack <dmatlack@google.com>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
2016-12-08 15:31:07 +01:00
David Matlack
62cc6b9dc6 KVM: nVMX: support restore of VMX capability MSRs
The VMX capability MSRs advertise the set of features the KVM virtual
CPU can support. This set of features varies across different host CPUs
and KVM versions. This patch aims to addresses both sources of
differences, allowing VMs to be migrated across CPUs and KVM versions
without guest-visible changes to these MSRs. Note that cross-KVM-
version migration is only supported from this point forward.

When the VMX capability MSRs are restored, they are audited to check
that the set of features advertised are a subset of what KVM and the
CPU support.

Since the VMX capability MSRs are read-only, they do not need to be on
the default MSR save/restore lists. The userspace hypervisor can set
the values of these MSRs or read them from KVM at VCPU creation time,
and restore the same value after every save/restore.

Signed-off-by: David Matlack <dmatlack@google.com>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
2016-12-08 15:31:07 +01:00
David Matlack
0115f9cbac KVM: nVMX: generate non-true VMX MSRs based on true versions
The "non-true" VMX capability MSRs can be generated from their "true"
counterparts, by OR-ing the default1 bits. The default1 bits are fixed
and defined in the SDM.

Since we can generate the non-true VMX MSRs from the true versions,
there's no need to store both in struct nested_vmx. This also lets
userspace avoid having to restore the non-true MSRs.

Note this does not preclude emulating MSR_IA32_VMX_BASIC[55]=0. To do so,
we simply need to set all the default1 bits in the true MSRs (such that
the true MSRs and the generated non-true MSRs are equal).

Signed-off-by: David Matlack <dmatlack@google.com>
Suggested-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
2016-12-08 15:31:06 +01:00
Kyle Huey
ea07e42dec KVM: x86: Do not clear RFLAGS.TF when a singlestep trap occurs.
The trap flag stays set until software clears it.

Signed-off-by: Kyle Huey <khuey@kylehuey.com>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
2016-12-08 15:31:06 +01:00
Kyle Huey
6affcbedca KVM: x86: Add kvm_skip_emulated_instruction and use it.
kvm_skip_emulated_instruction calls both
kvm_x86_ops->skip_emulated_instruction and kvm_vcpu_check_singlestep,
skipping the emulated instruction and generating a trap if necessary.

Replacing skip_emulated_instruction calls with
kvm_skip_emulated_instruction is straightforward, except for:

- ICEBP, which is already inside a trap, so avoid triggering another trap.
- Instructions that can trigger exits to userspace, such as the IO insns,
  MOVs to CR8, and HALT. If kvm_skip_emulated_instruction does trigger a
  KVM_GUESTDBG_SINGLESTEP exit, and the handling code for
  IN/OUT/MOV CR8/HALT also triggers an exit to userspace, the latter will
  take precedence. The singlestep will be triggered again on the next
  instruction, which is the current behavior.
- Task switch instructions which would require additional handling (e.g.
  the task switch bit) and are instead left alone.
- Cases where VMLAUNCH/VMRESUME do not proceed to the next instruction,
  which do not trigger singlestep traps as mentioned previously.

Signed-off-by: Kyle Huey <khuey@kylehuey.com>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
2016-12-08 15:31:05 +01:00
Kyle Huey
eb27756217 KVM: VMX: Move skip_emulated_instruction out of nested_vmx_check_vmcs12
We can't return both the pass/fail boolean for the vmcs and the upcoming
continue/exit-to-userspace boolean for skip_emulated_instruction out of
nested_vmx_check_vmcs, so move skip_emulated_instruction out of it instead.

Additionally, VMENTER/VMRESUME only trigger singlestep exceptions when
they advance the IP to the following instruction, not when they a) succeed,
b) fail MSR validation or c) throw an exception. Add a separate call to
skip_emulated_instruction that will later not be converted to the variant
that checks the singlestep flag.

Signed-off-by: Kyle Huey <khuey@kylehuey.com>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
2016-12-08 15:31:04 +01:00
Kyle Huey
09ca3f2049 KVM: VMX: Reorder some skip_emulated_instruction calls
The functions being moved ahead of skip_emulated_instruction here don't
need updated IPs, and skipping the emulated instruction at the end will
make it easier to return its value.

Signed-off-by: Kyle Huey <khuey@kylehuey.com>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
2016-12-08 15:31:04 +01:00
Kyle Huey
6a908b628c KVM: x86: Add a return value to kvm_emulate_cpuid
Once skipping the emulated instruction can potentially trigger an exit to
userspace (via KVM_GUESTDBG_SINGLESTEP) kvm_emulate_cpuid will need to
propagate a return value.

Signed-off-by: Kyle Huey <khuey@kylehuey.com>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
2016-12-08 15:31:03 +01:00
Dave Airlie
f03ee46be9 Linux 4.9-rc8
-----BEGIN PGP SIGNATURE-----
 
 iQEcBAABAgAGBQJYRIGyAAoJEHm+PkMAQRiG2ksH/jwMUT9j6glbwESxbn1YTqTM
 QcBT5AMc7D0wNuidQe0hWZMtG4RbC+4ZhxzZl2wPgA2gueJ+rBnyX7bgtA7ka8ka
 Fdc3u/Q1v38HPzf8iBnxcdCs40VgsoMLjFYCXrpOxuGDNKYzRd+Q8aI2TeGvzbyi
 X8+6oAWifBwo2oA06jfcuUncEWbyDDyK9aQksmfKOpjHdb26yELPEhsPOlds1g7E
 jYLnvUVnU2CoFaumta+rZQ0kzLdc4Ntu0wEao6WzJuQKsgoID+tS/6iudi8cUhDp
 YowGAVoOfr6rAJB0mwrDVfugpamaT3386XKyocdNsK0/jR60UIJ8x+WzvvSU+lY=
 =JTBj
 -----END PGP SIGNATURE-----

Backmerge tag 'v4.9-rc8' into drm-next

Linux 4.9-rc8

Daniel requested this so we could apply some follow on fixes cleanly to -next.
2016-12-05 17:11:48 +10:00
Radim Krčmář
df492896e6 KVM: x86: check for pic and ioapic presence before use
Split irqchip allows pic and ioapic routes to be used without them being
created, which results in NULL access.  Check for NULL and avoid it.
(The setup is too racy for a nicer solutions.)

Found by syzkaller:

  general protection fault: 0000 [#1] SMP DEBUG_PAGEALLOC KASAN
  Dumping ftrace buffer:
     (ftrace buffer empty)
  Modules linked in:
  CPU: 3 PID: 11923 Comm: kworker/3:2 Not tainted 4.9.0-rc5+ #27
  Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
  Workqueue: events irqfd_inject
  task: ffff88006a06c7c0 task.stack: ffff880068638000
  RIP: 0010:[...]  [...] __lock_acquire+0xb35/0x3380 kernel/locking/lockdep.c:3221
  RSP: 0000:ffff88006863ea20  EFLAGS: 00010006
  RAX: dffffc0000000000 RBX: dffffc0000000000 RCX: 0000000000000000
  RDX: 0000000000000039 RSI: 0000000000000000 RDI: 1ffff1000d0c7d9e
  RBP: ffff88006863ef58 R08: 0000000000000001 R09: 0000000000000000
  R10: 00000000000001c8 R11: 0000000000000000 R12: ffff88006a06c7c0
  R13: 0000000000000001 R14: ffffffff8baab1a0 R15: 0000000000000001
  FS:  0000000000000000(0000) GS:ffff88006d100000(0000) knlGS:0000000000000000
  CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
  CR2: 00000000004abdd0 CR3: 000000003e2f2000 CR4: 00000000000026e0
  Stack:
   ffffffff894d0098 1ffff1000d0c7d56 ffff88006863ecd0 dffffc0000000000
   ffff88006a06c7c0 0000000000000000 ffff88006863ecf8 0000000000000082
   0000000000000000 ffffffff815dd7c1 ffffffff00000000 ffffffff00000000
  Call Trace:
   [...] lock_acquire+0x2a2/0x790 kernel/locking/lockdep.c:3746
   [...] __raw_spin_lock include/linux/spinlock_api_smp.h:144
   [...] _raw_spin_lock+0x38/0x50 kernel/locking/spinlock.c:151
   [...] spin_lock include/linux/spinlock.h:302
   [...] kvm_ioapic_set_irq+0x4c/0x100 arch/x86/kvm/ioapic.c:379
   [...] kvm_set_ioapic_irq+0x8f/0xc0 arch/x86/kvm/irq_comm.c:52
   [...] kvm_set_irq+0x239/0x640 arch/x86/kvm/../../../virt/kvm/irqchip.c:101
   [...] irqfd_inject+0xb4/0x150 arch/x86/kvm/../../../virt/kvm/eventfd.c:60
   [...] process_one_work+0xb40/0x1ba0 kernel/workqueue.c:2096
   [...] worker_thread+0x214/0x18a0 kernel/workqueue.c:2230
   [...] kthread+0x328/0x3e0 kernel/kthread.c:209
   [...] ret_from_fork+0x2a/0x40 arch/x86/entry/entry_64.S:433

Reported-by: Dmitry Vyukov <dvyukov@google.com>
Cc: stable@vger.kernel.org
Fixes: 49df6397ed ("KVM: x86: Split the APIC from the rest of IRQCHIP.")
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
2016-11-24 18:39:28 +01:00
Radim Krčmář
81cdb259fb KVM: x86: fix out-of-bounds accesses of rtc_eoi map
KVM was using arrays of size KVM_MAX_VCPUS with vcpu_id, but ID can be
bigger that the maximal number of VCPUs, resulting in out-of-bounds
access.

Found by syzkaller:

  BUG: KASAN: slab-out-of-bounds in __apic_accept_irq+0xb33/0xb50 at addr [...]
  Write of size 1 by task a.out/27101
  CPU: 1 PID: 27101 Comm: a.out Not tainted 4.9.0-rc5+ #49
  Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
   [...]
  Call Trace:
   [...] __apic_accept_irq+0xb33/0xb50 arch/x86/kvm/lapic.c:905
   [...] kvm_apic_set_irq+0x10e/0x180 arch/x86/kvm/lapic.c:495
   [...] kvm_irq_delivery_to_apic+0x732/0xc10 arch/x86/kvm/irq_comm.c:86
   [...] ioapic_service+0x41d/0x760 arch/x86/kvm/ioapic.c:360
   [...] ioapic_set_irq+0x275/0x6c0 arch/x86/kvm/ioapic.c:222
   [...] kvm_ioapic_inject_all arch/x86/kvm/ioapic.c:235
   [...] kvm_set_ioapic+0x223/0x310 arch/x86/kvm/ioapic.c:670
   [...] kvm_vm_ioctl_set_irqchip arch/x86/kvm/x86.c:3668
   [...] kvm_arch_vm_ioctl+0x1a08/0x23c0 arch/x86/kvm/x86.c:3999
   [...] kvm_vm_ioctl+0x1fa/0x1a70 arch/x86/kvm/../../../virt/kvm/kvm_main.c:3099

Reported-by: Dmitry Vyukov <dvyukov@google.com>
Cc: stable@vger.kernel.org
Fixes: af1bae5497 ("KVM: x86: bump KVM_MAX_VCPU_ID to 1023")
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
2016-11-24 18:37:19 +01:00
Radim Krčmář
2117d5398c KVM: x86: drop error recovery in em_jmp_far and em_ret_far
em_jmp_far and em_ret_far assumed that setting IP can only fail in 64
bit mode, but syzkaller proved otherwise (and SDM agrees).
Code segment was restored upon failure, but it was left uninitialized
outside of long mode, which could lead to a leak of host kernel stack.
We could have fixed that by always saving and restoring the CS, but we
take a simpler approach and just break any guest that manages to fail
as the error recovery is error-prone and modern CPUs don't need emulator
for this.

Found by syzkaller:

  WARNING: CPU: 2 PID: 3668 at arch/x86/kvm/emulate.c:2217 em_ret_far+0x428/0x480
  Kernel panic - not syncing: panic_on_warn set ...

  CPU: 2 PID: 3668 Comm: syz-executor Not tainted 4.9.0-rc4+ #49
  Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
   [...]
  Call Trace:
   [...] __dump_stack lib/dump_stack.c:15
   [...] dump_stack+0xb3/0x118 lib/dump_stack.c:51
   [...] panic+0x1b7/0x3a3 kernel/panic.c:179
   [...] __warn+0x1c4/0x1e0 kernel/panic.c:542
   [...] warn_slowpath_null+0x2c/0x40 kernel/panic.c:585
   [...] em_ret_far+0x428/0x480 arch/x86/kvm/emulate.c:2217
   [...] em_ret_far_imm+0x17/0x70 arch/x86/kvm/emulate.c:2227
   [...] x86_emulate_insn+0x87a/0x3730 arch/x86/kvm/emulate.c:5294
   [...] x86_emulate_instruction+0x520/0x1ba0 arch/x86/kvm/x86.c:5545
   [...] emulate_instruction arch/x86/include/asm/kvm_host.h:1116
   [...] complete_emulated_io arch/x86/kvm/x86.c:6870
   [...] complete_emulated_mmio+0x4e9/0x710 arch/x86/kvm/x86.c:6934
   [...] kvm_arch_vcpu_ioctl_run+0x3b7a/0x5a90 arch/x86/kvm/x86.c:6978
   [...] kvm_vcpu_ioctl+0x61e/0xdd0 arch/x86/kvm/../../../virt/kvm/kvm_main.c:2557
   [...] vfs_ioctl fs/ioctl.c:43
   [...] do_vfs_ioctl+0x18c/0x1040 fs/ioctl.c:679
   [...] SYSC_ioctl fs/ioctl.c:694
   [...] SyS_ioctl+0x8f/0xc0 fs/ioctl.c:685
   [...] entry_SYSCALL_64_fastpath+0x1f/0xc2

Reported-by: Dmitry Vyukov <dvyukov@google.com>
Cc: stable@vger.kernel.org
Fixes: d1442d85cc ("KVM: x86: Handle errors when RIP is set during far jumps")
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
2016-11-24 18:36:54 +01:00
Radim Krčmář
444fdad88f KVM: x86: fix out-of-bounds access in lapic
Cluster xAPIC delivery incorrectly assumed that dest_id <= 0xff.
With enabled KVM_X2APIC_API_USE_32BIT_IDS in KVM_CAP_X2APIC_API, a
userspace can send an interrupt with dest_id that results in
out-of-bounds access.

Found by syzkaller:

  BUG: KASAN: slab-out-of-bounds in kvm_irq_delivery_to_apic_fast+0x11fa/0x1210 at addr ffff88003d9ca750
  Read of size 8 by task syz-executor/22923
  CPU: 0 PID: 22923 Comm: syz-executor Not tainted 4.9.0-rc4+ #49
  Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
   [...]
  Call Trace:
   [...] __dump_stack lib/dump_stack.c:15
   [...] dump_stack+0xb3/0x118 lib/dump_stack.c:51
   [...] kasan_object_err+0x1c/0x70 mm/kasan/report.c:156
   [...] print_address_description mm/kasan/report.c:194
   [...] kasan_report_error mm/kasan/report.c:283
   [...] kasan_report+0x231/0x500 mm/kasan/report.c:303
   [...] __asan_report_load8_noabort+0x14/0x20 mm/kasan/report.c:329
   [...] kvm_irq_delivery_to_apic_fast+0x11fa/0x1210 arch/x86/kvm/lapic.c:824
   [...] kvm_irq_delivery_to_apic+0x132/0x9a0 arch/x86/kvm/irq_comm.c:72
   [...] kvm_set_msi+0x111/0x160 arch/x86/kvm/irq_comm.c:157
   [...] kvm_send_userspace_msi+0x201/0x280 arch/x86/kvm/../../../virt/kvm/irqchip.c:74
   [...] kvm_vm_ioctl+0xba5/0x1670 arch/x86/kvm/../../../virt/kvm/kvm_main.c:3015
   [...] vfs_ioctl fs/ioctl.c:43
   [...] do_vfs_ioctl+0x18c/0x1040 fs/ioctl.c:679
   [...] SYSC_ioctl fs/ioctl.c:694
   [...] SyS_ioctl+0x8f/0xc0 fs/ioctl.c:685
   [...] entry_SYSCALL_64_fastpath+0x1f/0xc2

Reported-by: Dmitry Vyukov <dvyukov@google.com>
Cc: stable@vger.kernel.org
Fixes: e45115b62f ("KVM: x86: use physical LAPIC array for logical x2APIC")
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
2016-11-24 18:35:53 +01:00
Tom Lendacky
8370c3d08b kvm: svm: Add kvm_fast_pio_in support
Update the I/O interception support to add the kvm_fast_pio_in function
to speed up the in instruction similar to the out instruction.

Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com>
Reviewed-by: Borislav Petkov <bp@suse.de>
Signed-off-by: Brijesh Singh <brijesh.singh@amd.com>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
2016-11-24 18:32:45 +01:00
Tom Lendacky
147277540b kvm: svm: Add support for additional SVM NPF error codes
AMD hardware adds two additional bits to aid in nested page fault handling.

Bit 32 - NPF occurred while translating the guest's final physical address
Bit 33 - NPF occurred while translating the guest page tables

The guest page tables fault indicator can be used as an aid for nested
virtualization. Using V0 for the host, V1 for the first level guest and
V2 for the second level guest, when both V1 and V2 are using nested paging
there are currently a number of unnecessary instruction emulations. When
V2 is launched shadow paging is used in V1 for the nested tables of V2. As
a result, KVM marks these pages as RO in the host nested page tables. When
V2 exits and we resume V1, these pages are still marked RO.

Every nested walk for a guest page table is treated as a user-level write
access and this causes a lot of NPFs because the V1 page tables are marked
RO in the V0 nested tables. While executing V1, when these NPFs occur KVM
sees a write to a read-only page, emulates the V1 instruction and unprotects
the page (marking it RW). This patch looks for cases where we get a NPF due
to a guest page table walk where the page was marked RO. It immediately
unprotects the page and resumes the guest, leading to far fewer instruction
emulations when nested virtualization is used.

Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com>
Reviewed-by: Borislav Petkov <bp@suse.de>
Signed-off-by: Brijesh Singh <brijesh.singh@amd.com>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
2016-11-24 18:32:26 +01:00
Ingo Molnar
064e6a8ba6 Merge branch 'linus' into x86/fpu, to resolve conflicts
Conflicts:
	arch/x86/kernel/fpu/core.c

Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-11-23 07:18:09 +01:00
Bandan Das
ae0f549951 kvm: x86: don't print warning messages for unimplemented msrs
Change unimplemented msrs messages to use pr_debug.
If CONFIG_DYNAMIC_DEBUG is set, then these messages can be
enabled at run time or else -DDEBUG can be used at compile
time to enable them. These messages will still be printed if
ignore_msrs=1.

Signed-off-by: Bandan Das <bsd@redhat.com>
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
2016-11-22 18:29:10 +01:00
Jan Dakinevich
bcdde302b8 KVM: nVMX: invvpid handling improvements
- Expose all invalidation types to the L1

 - Reject invvpid instruction, if L1 passed zero vpid value to single
   context invalidations

Signed-off-by: Jan Dakinevich <jan.dakinevich@gmail.com>
Tested-by: Ladi Prosek <lprosek@redhat.com>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
2016-11-22 17:26:42 +01:00
Jim Mattson
c7dd15b337 kvm: x86: CPUID.01H:EDX.APIC[bit 9] should mirror IA32_APIC_BASE[11]
From the Intel SDM, volume 3, section 10.4.3, "Enabling or Disabling the
Local APIC,"

  When IA32_APIC_BASE[11] is 0, the processor is functionally equivalent
  to an IA-32 processor without an on-chip APIC. The CPUID feature flag
  for the APIC (see Section 10.4.2, "Presence of the Local APIC") is
  also set to 0.

Signed-off-by: Jim Mattson <jmattson@google.com>
[Changed subject tag from nVMX to x86.]
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
2016-11-22 14:51:55 +01:00
Pan Xinhui
0b9f6c4615 x86/kvm: Support the vCPU preemption check
Support the vcpu_is_preempted() functionality under KVM. This will
enhance lock performance on overcommitted hosts (more runnable vCPUs
than physical CPUs in the system) as doing busy waits for preempted
vCPUs will hurt system performance far worse than early yielding.

Use struct kvm_steal_time::preempted to indicate that if a vCPU
is running or not.

Signed-off-by: Pan Xinhui <xinhui.pan@linux.vnet.ibm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Paolo Bonzini <pbonzini@redhat.com>
Cc: David.Laight@ACULAB.COM
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: benh@kernel.crashing.org
Cc: boqun.feng@gmail.com
Cc: borntraeger@de.ibm.com
Cc: bsingharora@gmail.com
Cc: dave@stgolabs.net
Cc: jgross@suse.com
Cc: kernellwp@gmail.com
Cc: konrad.wilk@oracle.com
Cc: linuxppc-dev@lists.ozlabs.org
Cc: mpe@ellerman.id.au
Cc: paulmck@linux.vnet.ibm.com
Cc: paulus@samba.org
Cc: rkrcmar@redhat.com
Cc: virtualization@lists.linux-foundation.org
Cc: will.deacon@arm.com
Cc: xen-devel-request@lists.xenproject.org
Cc: xen-devel@lists.xenproject.org
Link: http://lkml.kernel.org/r/1478077718-37424-9-git-send-email-xinhui.pan@linux.vnet.ibm.com
[ Typo fixes. ]
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-11-22 12:48:08 +01:00
Paolo Bonzini
a2b07739ff kvm: x86: merge kvm_arch_set_irq and kvm_arch_set_irq_inatomic
kvm_arch_set_irq is unused since commit b97e6de9c9.  Merge
its functionality with kvm_arch_set_irq_inatomic.

Reported-by: Jiang Biao <jiang.biao2@zte.com.cn>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
2016-11-19 19:04:19 +01:00
Paolo Bonzini
7301d6abae KVM: x86: fix missed SRCU usage in kvm_lapic_set_vapic_addr
Reported by syzkaller:

    [ INFO: suspicious RCU usage. ]
    4.9.0-rc4+ #47 Not tainted
    -------------------------------
    ./include/linux/kvm_host.h:536 suspicious rcu_dereference_check() usage!

    stack backtrace:
    CPU: 1 PID: 6679 Comm: syz-executor Not tainted 4.9.0-rc4+ #47
    Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
     ffff880039e2f6d0 ffffffff81c2e46b ffff88003e3a5b40 0000000000000000
     0000000000000001 ffffffff83215600 ffff880039e2f700 ffffffff81334ea9
     ffffc9000730b000 0000000000000004 ffff88003c4f8420 ffff88003d3f8000
    Call Trace:
     [<     inline     >] __dump_stack lib/dump_stack.c:15
     [<ffffffff81c2e46b>] dump_stack+0xb3/0x118 lib/dump_stack.c:51
     [<ffffffff81334ea9>] lockdep_rcu_suspicious+0x139/0x180 kernel/locking/lockdep.c:4445
     [<     inline     >] __kvm_memslots include/linux/kvm_host.h:534
     [<     inline     >] kvm_memslots include/linux/kvm_host.h:541
     [<ffffffff8105d6ae>] kvm_gfn_to_hva_cache_init+0xa1e/0xce0 virt/kvm/kvm_main.c:1941
     [<ffffffff8112685d>] kvm_lapic_set_vapic_addr+0xed/0x140 arch/x86/kvm/lapic.c:2217

Reported-by: Dmitry Vyukov <dvyukov@google.com>
Fixes: fda4e2e855
Cc: Andrew Honig <ahonig@google.com>
Cc: stable@vger.kernel.org
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
2016-11-19 19:04:18 +01:00
Paolo Bonzini
e3fd9a93a1 kvm: kvmclock: let KVM_GET_CLOCK return whether the master clock is in use
Userspace can read the exact value of kvmclock by reading the TSC
and fetching the timekeeping parameters out of guest memory.  This
however is brittle and not necessary anymore with KVM 4.11.  Provide
a mechanism that lets userspace know if the new KVM_GET_CLOCK
semantics are in effect, and---since we are at it---if the clock
is stable across all VCPUs.

Cc: Radim Krčmář <rkrcmar@redhat.com>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
2016-11-19 19:04:16 +01:00
Ignacio Alvarado
1650b4ebc9 KVM: Disable irq while unregistering user notifier
Function user_notifier_unregister should be called only once for each
registered user notifier.

Function kvm_arch_hardware_disable can be executed from an IPI context
which could cause a race condition with a VCPU returning to user mode
and attempting to unregister the notifier.

Signed-off-by: Ignacio Alvarado <ikalvarado@google.com>
Cc: stable@vger.kernel.org
Fixes: 18863bdd60 ("KVM: x86 shared msr infrastructure")
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
2016-11-19 19:04:04 +01:00
Paolo Bonzini
8b95344064 KVM: x86: do not go through vcpu in __get_kvmclock_ns
Going through the first VCPU is wrong if you follow a KVM_SET_CLOCK with
a KVM_GET_CLOCK immediately after, without letting the VCPU run and
call kvm_guest_time_update.

To fix this, compute the kvmclock value ourselves, using the master
clock (tsc, nsec) pair as the base and the host CPU frequency as
the scale.

Reported-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
2016-11-19 18:03:03 +01:00
Luwei Kang
4504b5c941 kvm: x86: Add AVX512_4VNNIW and AVX512_4FMAPS support
Add two new AVX512 subfeatures support for KVM guest.

AVX512_4VNNIW:
Vector instructions for deep learning enhanced word variable precision.

AVX512_4FMAPS:
Vector instructions for deep learning floating-point single precision.

Reviewed-by: Borislav Petkov <bp@suse.de>
Signed-off-by: He Chen <he.chen@linux.intel.com>
Signed-off-by: Luwei Kang <luwei.kang@intel.com>
[Changed subject tags.]
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
2016-11-16 22:20:02 +01:00
Radim Krčmář
283c95d0e3 KVM: x86: emulate FXSAVE and FXRSTOR
Internal errors were reported on 16 bit fxsave and fxrstor with ipxe.
Old Intels don't have unrestricted_guest, so we have to emulate them.

The patch takes advantage of the hardware implementation.

AMD and Intel differ in saving and restoring other fields in first 32
bytes.  A test wrote 0xff to the fxsave area, 0 to upper bits of MCSXR
in the fxsave area, executed fxrstor, rewrote the fxsave area to 0xee,
and executed fxsave:

  Intel (Nehalem):
    7f 1f 7f 7f ff 00 ff 07 ff ff ff ff ff ff 00 00
    ff ff ff ff ff ff 00 00 ff ff 00 00 ff ff 00 00
  Intel (Haswell -- deprecated FPU CS and FPU DS):
    7f 1f 7f 7f ff 00 ff 07 ff ff ff ff 00 00 00 00
    ff ff ff ff 00 00 00 00 ff ff 00 00 ff ff 00 00
  AMD (Opteron 2300-series):
    7f 1f 7f 7f ff 00 ee ee ee ee ee ee ee ee ee ee
    ee ee ee ee ee ee ee ee ff ff 00 00 ff ff 02 00

fxsave/fxrstor will only be emulated on early Intels, so KVM can't do
much to improve the situation.

Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
2016-11-16 22:09:46 +01:00
Radim Krčmář
aabba3c6ab KVM: x86: add asm_safe wrapper
Move the existing exception handling for inline assembly into a macro
and switch its return values to X86EMUL type.

Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2016-11-16 22:09:45 +01:00
Radim Krčmář
4852018789 KVM: x86: save one bit in ctxt->d
Alignments are exclusive, so 5 modes can be expressed in 3 bits.

Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2016-11-16 22:09:45 +01:00
Radim Krčmář
d3fe959f81 KVM: x86: add Align16 instruction flag
Needed for FXSAVE and FXRSTOR.

Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2016-11-16 22:09:45 +01:00
Jiang Biao
695151964d kvm: x86: remove unused but set variable
The local variable *gpa_offset* is set but not used afterwards,
which make the compiler issue a warning with option
-Wunused-but-set-variable. Remove it to avoid the warning.

Signed-off-by: Jiang Biao <jiang.biao2@zte.com.cn>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2016-11-16 22:09:44 +01:00
Jiang Biao
ecd8a8c2b4 kvm: x86: hyperv: make function static to avoid compiling warning
synic_set_irq is only used in hyperv.c, and should be static to
avoid compiling warning when with -Wmissing-prototypes option.

Signed-off-by: Jiang Biao <jiang.biao2@zte.com.cn>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2016-11-16 22:09:44 +01:00
Jiang Biao
1e13175bd2 kvm: x86: cpuid: remove the unnecessary variable
The use of local variable *function* is not necessary here. Remove
it to avoid compiling warning with -Wunused-but-set-variable option.

Signed-off-by: Jiang Biao <jiang.biao2@zte.com.cn>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2016-11-16 22:09:44 +01:00
Jiang Biao
ae6a237560 kvm: x86: make a function in x86.c static to avoid compiling warning
kvm_emulate_wbinvd_noskip is only used in x86.c, and should be
static to avoid compiling warning when with -Wmissing-prototypes
option.

Signed-off-by: Jiang Biao <jiang.biao2@zte.com.cn>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2016-11-16 22:09:44 +01:00
Jiang Biao
33365e7a45 kvm: x86: make function static to avoid compiling warning
vmx_arm_hv_timer is only used in vmx.c, and should be static to
avoid compiling warning when with -Wmissing-prototypes option.

Signed-off-by: Jiang Biao <jiang.biao2@zte.com.cn>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2016-11-16 22:09:43 +01:00
Paolo Bonzini
6314a17fec The three KVM patches that KVMGT needs.
-----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2.0.22 (GNU/Linux)
 
 iQEcBAABAgAGBQJYIy78AAoJEL/70l94x66DzbQH/34Mgm52OdLzTe+Yf8Pcae5X
 A/z2Aicto+6M+dtYoWPn2DfoevPaSpVmymryRrRzAqk5QDPHiVHZ5iCW0RaIEU7M
 iiPudqSGVGa0oFYBkxsJxCBysAQsHX5sEXEszs4egbO1TtQ8LCAxYUVuLBGlx+Fa
 qj26Fi/p2ByHVp/RX55A2kF5T8J671KT4LWUvFjzTgGFWo8Kr1bk0q4hmRYB9OBc
 /pqczV3Mc6KcmzfIg3Rd6xt8UDlEGJ4YQhpNgY6nxrQ1py3AP7vNqBPAH4RXbHJB
 /OqdAjpqa8+rrwSpQ1f58U+7v/ZO1ZTg0IW9bf60qKjG/aV4fGN6y0/iXKGgKyQ=
 =cpog
 -----END PGP SIGNATURE-----

Merge tag 'tags/for-kvmgt' into HEAD

The three KVM patches that KVMGT needs.

Conflicts:
	arch/x86/include/asm/kvm_page_track.h
	arch/x86/kvm/mmu.c
2016-11-09 15:20:31 +01:00
Jike Song
871b7ef2a1 kvm/page_track: export symbols for external usage
Signed-off-by: Jike Song <jike.song@intel.com>
Reviewed-by: Xiao Guangrong <guangrong.xiao@linux.intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2016-11-04 12:13:20 +01:00
Jike Song
d126363d8f kvm/page_track: call notifiers with kvm_page_track_notifier_node
The user of page_track might needs extra information, so pass
the kvm_page_track_notifier_node to callbacks.

Signed-off-by: Jike Song <jike.song@intel.com>
Reviewed-by: Xiao Guangrong <guangrong.xiao@linux.intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2016-11-04 12:13:20 +01:00
Xiaoguang Chen
ae7cd87372 KVM: x86: add track_flush_slot page track notifier
When a memory slot is being moved or removed users of page track
can be notified. So users can drop write-protection for the pages
in that memory slot.

This notifier type is needed by KVMGT to sync up its shadow page
table when memory slot is being moved or removed.

Register the notifier type track_flush_slot to receive memslot move
and remove event.

Reviewed-by: Xiao Guangrong <guangrong.xiao@intel.com>
Signed-off-by: Chen Xiaoguang <xiaoguang.chen@intel.com>
[Squashed commits to avoid bisection breakage and reworded the subject.]
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
2016-11-04 12:13:19 +01:00
Paolo Bonzini
ad3610919e kvm: x86: avoid atomic operations on APICv vmentry
On some benchmarks (e.g. netperf with ioeventfd disabled), APICv
posted interrupts turn out to be slower than interrupt injection via
KVM_REQ_EVENT.

This patch optimizes a bit the IRR update, avoiding expensive atomic
operations in the common case where PI.ON=0 at vmentry or the PIR vector
is mostly zero.  This saves at least 20 cycles (1%) per vmexit, as
measured by kvm-unit-tests' inl_from_qemu test (20 runs):

              | enable_apicv=1  |  enable_apicv=0
              | mean     stdev  |  mean     stdev
    ----------|-----------------|------------------
    before    | 5826     32.65  |  5765     47.09
    after     | 5809     43.42  |  5777     77.02

Of course, any change in the right column is just placebo effect. :)
The savings are bigger if interrupts are frequent.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2016-11-03 12:27:51 +01:00
Paolo Bonzini
1b07304c58 KVM: nVMX: support descriptor table exits
These are never used by the host, but they can still be reflected to
the guest.

Tested-by: Ladi Prosek <lprosek@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2016-11-02 21:32:17 +01:00
Paolo Bonzini
5587859fb1 KVM: x86: use ktime_get instead of seeking the hrtimer_clock_base
The base clock for the LAPIC timer is always CLOCK_MONOTONIC.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2016-11-02 21:32:17 +01:00
Wanpeng Li
8003c9ae20 KVM: LAPIC: add APIC Timer periodic/oneshot mode VMX preemption timer support
Most windows guests still utilize APIC Timer periodic/oneshot mode
instead of tsc-deadline mode, and the APIC Timer periodic/oneshot
mode are still emulated by high overhead hrtimer on host. This patch
converts the expected expire time of the periodic/oneshot mode to
guest deadline tsc in order to leverage VMX preemption timer logic
for APIC Timer tsc-deadline mode. After each preemption timer vmexit
preemption timer is restarted to emulate LVTT current-count register
is automatically reloaded from the initial-count register when the
count reaches 0. This patch reduces ~5600 cycles for each APIC Timer
periodic mode operation virtualization.

Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Radim Krčmář <rkrcmar@redhat.com>
Cc: Yunhong Jiang <yunhong.jiang@intel.com>
Signed-off-by: Wanpeng Li <wanpeng.li@hotmail.com>
[Squashed with my fixes that were reviewed-by Paolo.]
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
2016-11-02 21:32:17 +01:00
Wanpeng Li
7e810a38e6 KVM: LAPIC: rename start/cancel_hv_tscdeadline to start/cancel_hv_timer
Rename start/cancel_hv_tscdeadline to start/cancel_hv_timer since
they will handle both APIC Timer periodic/oneshot mode and tsc-deadline
mode.

Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Radim Krčmář <rkrcmar@redhat.com>
Cc: Yunhong Jiang <yunhong.jiang@intel.com>
Signed-off-by: Wanpeng Li <wanpeng.li@hotmail.com>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
2016-11-02 21:32:17 +01:00
Wanpeng Li
498f816219 KVM: LAPIC: introduce kvm_get_lapic_target_expiration_tsc()
Introdce kvm_get_lapic_target_expiration_tsc() to get APIC Timer target
deadline tsc.

Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Radim Krčmář <rkrcmar@redhat.com>
Cc: Yunhong Jiang <yunhong.jiang@intel.com>
Signed-off-by: Wanpeng Li <wanpeng.li@hotmail.com>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
2016-11-02 21:32:17 +01:00
Wanpeng Li
a10388e11f KVM: LAPIC: guarantee the timer is in tsc-deadline mode
Check apic_lvtt_tscdeadline() mode directly instead of apic_lvtt_oneshot()
and apic_lvtt_period() to guarantee the timer is in tsc-deadline mode when
rdmsr MSR_IA32_TSCDEADLINE.

Suggested-by: Radim Krčmář <rkrcmar@redhat.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Radim Krčmář <rkrcmar@redhat.com>
Cc: Yunhong Jiang <yunhong.jiang@intel.com>
Signed-off-by: Wanpeng Li <wanpeng.li@hotmail.com>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
2016-11-02 21:32:17 +01:00
Wanpeng Li
7d7f7da2f1 KVM: LAPIC: extract start_sw_period() to handle periodic/oneshot mode
Extract start_sw_period() to handle periodic/oneshot mode, it will be
used by later patch.

Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Radim Krčmář <rkrcmar@redhat.com>
Cc: Yunhong Jiang <yunhong.jiang@intel.com>
Signed-off-by: Wanpeng Li <wanpeng.li@hotmail.com>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
2016-11-02 21:32:17 +01:00
Longpeng(Mike)
868a32f327 kvm: x86: remove the misleading comment in vmx_handle_external_intr
Since Paolo has removed irq-enable-operation in vmx_handle_external_intr
(KVM: x86: use guest_exit_irqoff), the original comment about the IF bit
in rflags is incorrect and stale now, so remove it.

Signed-off-by: Longpeng(Mike) <longpeng2@huawei.com>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
2016-11-02 21:32:17 +01:00
Xiaoguang Chen
b5f5fdca65 KVM: x86: add track_flush_slot page track notifier
When a memory slot is being moved or removed users of page track
can be notified. So users can drop write-protection for the pages
in that memory slot.

This notifier type is needed by KVMGT to sync up its shadow page
table when memory slot is being moved or removed.

Register the notifier type track_flush_slot to receive memslot move
and remove event.

Reviewed-by: Xiao Guangrong <guangrong.xiao@intel.com>
Signed-off-by: Chen Xiaoguang <xiaoguang.chen@intel.com>
[Squashed commits to avoid bisection breakage and reworded the subject.]
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
2016-11-02 21:32:17 +01:00
Radim Krčmář
2361133293 KVM: VMX: refactor setup of global page-sized bitmaps
We've had 10 page-sized bitmaps that were being allocated and freed one
by one when we could just use a cycle.

Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
2016-11-02 21:32:17 +01:00
Radim Krčmář
2e69f86561 KVM: VMX: join functions that disable x2apic msr intercepts
vmx_disable_intercept_msr_read_x2apic() and
vmx_disable_intercept_msr_write_x2apic() differed only in the type.
Pass the type to a new function.

[Ordered and commented TPR intercept according to Paolo's suggestion.]
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
2016-11-02 21:32:17 +01:00
Radim Krčmář
40d8338d09 KVM: VMX: remove functions that enable msr intercepts
All intercepts are enabled at the beginning, so they can only be used if
we disabled an intercept that we wanted to have enabled.
This was done for TMCCT to simplify a loop that disables all x2APIC MSR
intercepts, but just keeping TMCCT enabled yields better results.

Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
2016-11-02 21:32:17 +01:00
Jim Mattson
83bafef1a1 kvm: nVMX: Update MSR load counts on a VMCS switch
When L0 establishes (or removes) an MSR entry in the VM-entry or VM-exit
MSR load lists, the change should affect the dormant VMCS as well as the
current VMCS. Moreover, the vmcs02 MSR-load addresses should be
initialized.

Signed-off-by: Jim Mattson <jmattson@google.com>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
2016-11-02 21:32:17 +01:00
Jim Mattson
cf3215d939 kvm: nVMX: Fetch VM_INSTRUCTION_ERROR from vmcs02 on vmx->fail
When forwarding a hardware VM-entry failure to L1, fetch the
VM_INSTRUCTION_ERROR field from vmcs02 before loading vmcs01.

(Note that there is an implicit assumption that the VM-entry failure was
on the first VM-entry to vmcs02 after nested_vmx_run; otherwise, L1 is
going to be very confused.)

Signed-off-by: Jim Mattson <jmattson@google.com>
Reviewed-by: Peter Feiner <pfeiner@google.com>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
2016-11-02 21:32:17 +01:00
Peter Feiner
66d73e12f2 KVM: X86: MMU: no mmu_notifier_seq++ in kvm_age_hva
The MMU notifier sequence number keeps GPA->HPA mappings in sync when
GPA->HPA lookups are done outside of the MMU lock (e.g., in
tdp_page_fault). Since kvm_age_hva doesn't change GPA->HPA, it's
unnecessary to increment the sequence number.

Signed-off-by: Peter Feiner <pfeiner@google.com>
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
2016-11-02 21:32:17 +01:00
Wanpeng Li
c63e45635b KVM: VMX: Better name x2apic msr bitmaps
Renames x2apic_apicv_inactive msr_bitmaps to x2apic and original
x2apic bitmaps to x2apic_apicv.

Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Radim Krčmář <rkrcmar@redhat.com>
Signed-off-by: Wanpeng Li <wanpeng.li@hotmail.com>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
2016-11-02 21:32:17 +01:00
Owen Hofmann
d9092f52d7 kvm: x86: Check memopp before dereference (CVE-2016-8630)
Commit 41061cdb98 ("KVM: emulate: do not initialize memopp") removes a
check for non-NULL under incorrect assumptions. An undefined instruction
with a ModR/M byte with Mod=0 and R/M-5 (e.g. 0xc7 0x15) will attempt
to dereference a null pointer here.

Fixes: 41061cdb98
Message-Id: <1477592752-126650-2-git-send-email-osh@google.com>
Signed-off-by: Owen Hofmann <osh@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2016-11-02 21:31:53 +01:00
Jim Mattson
355f4fb140 kvm: nVMX: VMCLEAR an active shadow VMCS after last use
After a successful VM-entry with the "VMCS shadowing" VM-execution
control set, the shadow VMCS referenced by the VMCS link pointer field
in the current VMCS becomes active on the logical processor.

A VMCS that is made active on more than one logical processor may become
corrupted. Therefore, before an active VMCS can be migrated to another
logical processor, the first logical processor must execute a VMCLEAR
for the active VMCS. VMCLEAR both ensures that all VMCS data are written
to memory and makes the VMCS inactive.

Signed-off-by: Jim Mattson <jmattson@google.com>
Reviewed-By: David Matlack <dmatlack@google.com>
Message-Id: <1477668579-22555-1-git-send-email-jmattson@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2016-11-02 20:03:17 +01:00
Paolo Bonzini
ea26e4ec08 KVM: x86: drop TSC offsetting kvm_x86_ops to fix KVM_GET/SET_CLOCK
Since commit a545ab6a00 ("kvm: x86: add tsc_offset field to struct
kvm_vcpu_arch", 2016-09-07) the offset between host and L1 TSC is
cached and need not be fished out of the VMCS or VMCB.  This means
that we can implement adjust_tsc_offset_guest and read_l1_tsc
entirely in generic code.  The simplification is particularly
significant for VMX code, where vmx->nested.vmcs01_tsc_offset
was duplicating what is now in vcpu->arch.tsc_offset.  Therefore
the vmcs01_tsc_offset can be dropped completely.

More importantly, this fixes KVM_GET_CLOCK/KVM_SET_CLOCK
which, after commit 108b249c45 ("KVM: x86: introduce get_kvmclock_ns",
2016-09-01) called read_l1_tsc while the VMCS was not loaded.
It thus returned bogus values on Intel CPUs.

Fixes: 108b249c45
Reported-by: Roman Kagan <rkagan@virtuozzo.com>
Reviewed-by: Radim Krčmář <rkrcmar@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2016-11-02 20:03:07 +01:00
Andy Lutomirski
04ac88abaf x86/fpu, kvm: Remove host CR0.TS manipulation
Now that x86 always uses eager FPU switching on the host, there's no
need for KVM to manipulate the host's CR0.TS.

This should be both simpler and faster.

Signed-off-by: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Quentin Casasnovas <quentin.casasnovas@oracle.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: kvm list <kvm@vger.kernel.org>
Link: http://lkml.kernel.org/r/b212064922537c05d0c81d931fc4dbe769127ce7.1477951965.git.luto@kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-11-01 07:47:54 +01:00
Ingo Molnar
c29c716662 Merge branch 'core/urgent' into x86/fpu, to merge fixes
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-11-01 07:47:40 +01:00
Ido Yariv
bd768e1466 KVM: x86: fix wbinvd_dirty_mask use-after-free
vcpu->arch.wbinvd_dirty_mask may still be used after freeing it,
corrupting memory. For example, the following call trace may set a bit
in an already freed cpu mask:
    kvm_arch_vcpu_load
    vcpu_load
    vmx_free_vcpu_nested
    vmx_free_vcpu
    kvm_arch_vcpu_free

Fix this by deferring freeing of wbinvd_dirty_mask.

Cc: stable@vger.kernel.org
Signed-off-by: Ido Yariv <ido@wizery.com>
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
2016-10-28 11:35:21 +02:00
Borislav Petkov
796f4687bd kvm/x86: Show WRMSR data is in hex
Add the "0x" prefix to the error messages format to make it unambiguous
about what kind of value we're talking about.

Signed-off-by: Borislav Petkov <bp@suse.de>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: "Radim Krčmář" <rkrcmar@redhat.com>
Message-Id: <20161027181445.25319-1-bp@alien8.de>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2016-10-27 20:31:25 +02:00
Jim Mattson
85c856b39b kvm: nVMX: Fix kernel panics induced by illegal INVEPT/INVVPID types
Bitwise shifts by amounts greater than or equal to the width of the left
operand are undefined. A malicious guest can exploit this to crash a
32-bit host, due to the BUG_ON(1)'s in handle_{invept,invvpid}.

Signed-off-by: Jim Mattson <jmattson@google.com>
Message-Id: <1477496318-17681-1-git-send-email-jmattson@google.com>
[Change 1UL to 1, to match the range check on the shift count. - Paolo]
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2016-10-27 12:15:27 +02:00
Jiri Slaby
8678654e3c kvm: x86: memset whole irq_eoi
gcc 7 warns:
arch/x86/kvm/ioapic.c: In function 'kvm_ioapic_reset':
arch/x86/kvm/ioapic.c:597:2: warning: 'memset' used with length equal to number of elements without multiplication by element size [-Wmemset-elt-size]

And it is right. Memset whole array using sizeof operator.

Signed-off-by: Jiri Slaby <jslaby@suse.cz>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Radim Krčmář <rkrcmar@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: x86@kernel.org
Cc: kvm@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Cc: stable@vger.kernel.org
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
[Added x86 subject tag]
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
2016-10-20 14:54:11 +02:00
Borislav Petkov
758f588d6f kvm/x86: Fix unused variable warning in kvm_timer_init()
When CONFIG_CPU_FREQ is not set, int cpu is unused and gcc rightfully
warns about it:

  arch/x86/kvm/x86.c: In function ‘kvm_timer_init’:
  arch/x86/kvm/x86.c:5697:6: warning: unused variable ‘cpu’ [-Wunused-variable]
    int cpu;
        ^~~

But since it is used only in the CONFIG_CPU_FREQ block, simply move it
there, thus squashing the warning too.

Signed-off-by: Borislav Petkov <bp@suse.de>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
2016-10-20 14:49:52 +02:00
Ingo Molnar
4d69f155d5 Linux 4.9-rc1
-----BEGIN PGP SIGNATURE-----
 
 iQEcBAABAgAGBQJYAoDuAAoJEHm+PkMAQRiGUeEH/03/cUjHeY5aJkcJ0JeHkoU5
 GR5nRGcjfFF6cGujw2cSXBf5NzZTcrvBBFSgGNJ/rqm4EeDBsmf6T8qSfEKky/SY
 3CNWSzayFU8Na3C8Z/a/xPTPicneX9zVnAi8XMAKXwWPmu21JCLR/hkKaxQ29qGr
 Nqe4kEdLEF80d5lFRfNjK3CX4bD6w6P7aTBaM6wuRe4u5AXKJlSF+j838o5+/tSQ
 Q1V7fyXlX+kwNmH4gViim8im0PLm7/7Li8e24pL3cAR2G6DHrUzcsYYoRMHpk5bv
 HdBeCgZL6TnIaJc0ui2FRqQsifaVfM5J+pK81wr/JhBP2hmuWIN7NMupfCYtCcM=
 =Mown
 -----END PGP SIGNATURE-----

Merge tag 'v4.9-rc1' into x86/fpu, to resolve conflict

Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-10-16 13:04:34 +02:00
Petr Mladek
3989144f86 kthread: kthread worker API cleanup
A good practice is to prefix the names of functions by the name
of the subsystem.

The kthread worker API is a mix of classic kthreads and workqueues.  Each
worker has a dedicated kthread.  It runs a generic function that process
queued works.  It is implemented as part of the kthread subsystem.

This patch renames the existing kthread worker API to use
the corresponding name from the workqueues API prefixed by
kthread_:

__init_kthread_worker()		-> __kthread_init_worker()
init_kthread_worker()		-> kthread_init_worker()
init_kthread_work()		-> kthread_init_work()
insert_kthread_work()		-> kthread_insert_work()
queue_kthread_work()		-> kthread_queue_work()
flush_kthread_work()		-> kthread_flush_work()
flush_kthread_worker()		-> kthread_flush_worker()

Note that the names of DEFINE_KTHREAD_WORK*() macros stay
as they are. It is common that the "DEFINE_" prefix has
precedence over the subsystem names.

Note that INIT() macros and init() functions use different
naming scheme. There is no good solution. There are several
reasons for this solution:

  + "init" in the function names stands for the verb "initialize"
    aka "initialize worker". While "INIT" in the macro names
    stands for the noun "INITIALIZER" aka "worker initializer".

  + INIT() macros are used only in DEFINE() macros

  + init() functions are used close to the other kthread()
    functions. It looks much better if all the functions
    use the same scheme.

  + There will be also kthread_destroy_worker() that will
    be used close to kthread_cancel_work(). It is related
    to the init() function. Again it looks better if all
    functions use the same naming scheme.

  + there are several precedents for such init() function
    names, e.g. amd_iommu_init_device(), free_area_init_node(),
    jump_label_init_type(),  regmap_init_mmio_clk(),

  + It is not an argument but it was inconsistent even before.

[arnd@arndb.de: fix linux-next merge conflict]
 Link: http://lkml.kernel.org/r/20160908135724.1311726-1-arnd@arndb.de
Link: http://lkml.kernel.org/r/1470754545-17632-3-git-send-email-pmladek@suse.com
Suggested-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Petr Mladek <pmladek@suse.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Cc: Josh Triplett <josh@joshtriplett.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Jiri Kosina <jkosina@suse.cz>
Cc: Borislav Petkov <bp@suse.de>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-10-11 15:06:33 -07:00
Rik van Riel
3d42de25d2 x86/fpu, kvm: Remove KVM vcpu->fpu_counter
With the removal of the lazy FPU code, this field is no longer used.
Get rid of it.

Signed-off-by: Rik van Riel <riel@redhat.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Quentin Casasnovas <quentin.casasnovas@oracle.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: pbonzini@redhat.com
Link: http://lkml.kernel.org/r/1475627678-20788-7-git-send-email-riel@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-10-07 11:14:41 +02:00
Andy Lutomirski
c592b57347 x86/fpu: Remove use_eager_fpu()
This removes all the obvious code paths that depend on lazy FPU mode.
It shouldn't change the generated code at all.

Signed-off-by: Andy Lutomirski <luto@kernel.org>
Signed-off-by: Rik van Riel <riel@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Quentin Casasnovas <quentin.casasnovas@oracle.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: pbonzini@redhat.com
Link: http://lkml.kernel.org/r/1475627678-20788-5-git-send-email-riel@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-10-07 11:14:36 +02:00
Linus Torvalds
6218590bcb KVM updates for v4.9-rc1
All architectures:
   Move `make kvmconfig` stubs from x86;  use 64 bits for debugfs stats.
 
 ARM:
   Important fixes for not using an in-kernel irqchip; handle SError
   exceptions and present them to guests if appropriate; proxying of GICV
   access at EL2 if guest mappings are unsafe; GICv3 on AArch32 on ARMv8;
   preparations for GICv3 save/restore, including ABI docs; cleanups and
   a bit of optimizations.
 
 MIPS:
   A couple of fixes in preparation for supporting MIPS EVA host kernels;
   MIPS SMP host & TLB invalidation fixes.
 
 PPC:
   Fix the bug which caused guests to falsely report lockups; other minor
   fixes; a small optimization.
 
 s390:
   Lazy enablement of runtime instrumentation; up to 255 CPUs for nested
   guests; rework of machine check deliver; cleanups and fixes.
 
 x86:
   IOMMU part of AMD's AVIC for vmexit-less interrupt delivery; Hyper-V
   TSC page; per-vcpu tsc_offset in debugfs; accelerated INS/OUTS in
   nVMX; cleanups and fixes.
 -----BEGIN PGP SIGNATURE-----
 
 iQEcBAABCAAGBQJX9iDrAAoJEED/6hsPKofoOPoIAIUlgojkb9l2l1XVDgsXdgQL
 sRVhYSVv7/c8sk9vFImrD5ElOPZd+CEAIqFOu45+NM3cNi7gxip9yftUVs7wI5aC
 eDZRWm1E4trDZLe54ZM9ThcqZzZZiELVGMfR1+ZndUycybwyWzafpXYsYyaXp3BW
 hyHM3qVkoWO3dxBWFwHIoO/AUJrWYkRHEByKyvlC6KPxSdBPSa5c1AQwMCoE0Mo4
 K/xUj4gBn9eMelNhg4Oqu/uh49/q+dtdoP2C+sVM8bSdquD+PmIeOhPFIcuGbGFI
 B+oRpUhIuntN39gz8wInJ4/GRSeTuR2faNPxMn4E1i1u4LiuJvipcsOjPfe0a18=
 =fZRB
 -----END PGP SIGNATURE-----

Merge tag 'kvm-4.9-1' of git://git.kernel.org/pub/scm/virt/kvm/kvm

Pull KVM updates from Radim Krčmář:
 "All architectures:
   - move `make kvmconfig` stubs from x86
   - use 64 bits for debugfs stats

  ARM:
   - Important fixes for not using an in-kernel irqchip
   - handle SError exceptions and present them to guests if appropriate
   - proxying of GICV access at EL2 if guest mappings are unsafe
   - GICv3 on AArch32 on ARMv8
   - preparations for GICv3 save/restore, including ABI docs
   - cleanups and a bit of optimizations

  MIPS:
   - A couple of fixes in preparation for supporting MIPS EVA host
     kernels
   - MIPS SMP host & TLB invalidation fixes

  PPC:
   - Fix the bug which caused guests to falsely report lockups
   - other minor fixes
   - a small optimization

  s390:
   - Lazy enablement of runtime instrumentation
   - up to 255 CPUs for nested guests
   - rework of machine check deliver
   - cleanups and fixes

  x86:
   - IOMMU part of AMD's AVIC for vmexit-less interrupt delivery
   - Hyper-V TSC page
   - per-vcpu tsc_offset in debugfs
   - accelerated INS/OUTS in nVMX
   - cleanups and fixes"

* tag 'kvm-4.9-1' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (140 commits)
  KVM: MIPS: Drop dubious EntryHi optimisation
  KVM: MIPS: Invalidate TLB by regenerating ASIDs
  KVM: MIPS: Split kernel/user ASID regeneration
  KVM: MIPS: Drop other CPU ASIDs on guest MMU changes
  KVM: arm/arm64: vgic: Don't flush/sync without a working vgic
  KVM: arm64: Require in-kernel irqchip for PMU support
  KVM: PPC: Book3s PR: Allow access to unprivileged MMCR2 register
  KVM: PPC: Book3S PR: Support 64kB page size on POWER8E and POWER8NVL
  KVM: PPC: Book3S: Remove duplicate setting of the B field in tlbie
  KVM: PPC: BookE: Fix a sanity check
  KVM: PPC: Book3S HV: Take out virtual core piggybacking code
  KVM: PPC: Book3S: Treat VTB as a per-subcore register, not per-thread
  ARM: gic-v3: Work around definition of gic_write_bpr1
  KVM: nVMX: Fix the NMI IDT-vectoring handling
  KVM: VMX: Enable MSR-BASED TPR shadow even if APICv is inactive
  KVM: nVMX: Fix reload apic access page warning
  kvmconfig: add virtio-gpu to config fragment
  config: move x86 kvm_guest.config to a common location
  arm64: KVM: Remove duplicating init code for setting VMID
  ARM: KVM: Support vgic-v3
  ...
2016-10-06 10:49:01 -07:00
Wanpeng Li
c5a6d5f7fa KVM: nVMX: Fix the NMI IDT-vectoring handling
Run kvm-unit-tests/eventinj.flat in L1:

Sending NMI to self
After NMI to self
FAIL: NMI

This test scenario is to test whether VMM can handle NMI IDT-vectoring info correctly.

At the beginning, L2 writes LAPIC to send a self NMI, the EPT page tables on both L1
and L0 are empty so:

- The L2 accesses memory can generate EPT violation which can be intercepted by L0.

  The EPT violation vmexit occurred during delivery of this NMI, and the NMI info is
  recorded in vmcs02's IDT-vectoring info.

- L0 walks L1's EPT12 and L0 sees the mapping is invalid, it injects the EPT violation into L1.

  The vmcs02's IDT-vectoring info is reflected to vmcs12's IDT-vectoring info since
  it is a nested vmexit.

- L1 receives the EPT violation, then fixes its EPT12.
- L1 executes VMRESUME to resume L2 which generates vmexit and causes L1 exits to L0.
- L0 emulates VMRESUME which is called from L1, then return to L2.

  L0 merges the requirement of vmcs12's IDT-vectoring info and injects it to L2 through
  vmcs02.

- The L2 re-executes the fault instruction and cause EPT violation again.
- Since the L1's EPT12 is valid, L0 can fix its EPT02
- L0 resume L2

  The EPT violation vmexit occurred during delivery of this NMI again, and the NMI info
  is recorded in vmcs02's IDT-vectoring info. L0 should inject the NMI through vmentry
  event injection since it is caused by EPT02's EPT violation.

However, vmx_inject_nmi() refuses to inject NMI from IDT-vectoring info if vCPU is in
guest mode, this patch fix it by permitting to inject NMI from IDT-vectoring if it is
the L0's responsibility to inject NMI from IDT-vectoring info to L2.

Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Radim Krčmář <rkrcmar@redhat.com>
Cc: Jan Kiszka <jan.kiszka@siemens.com>
Cc: Bandan Das <bsd@redhat.com>
Signed-off-by: Wanpeng Li <wanpeng.li@hotmail.com>
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
2016-09-23 01:08:15 +02:00
Wanpeng Li
f6e90f9e0e KVM: VMX: Enable MSR-BASED TPR shadow even if APICv is inactive
I observed that kvmvapic(to optimize flexpriority=N or AMD) is used
to boost TPR access when testing kvm-unit-test/eventinj.flat tpr case
on my haswell desktop (w/ flexpriority, w/o APICv). Commit (8d14695f95
x86, apicv: add virtual x2apic support) disable virtual x2apic mode
completely if w/o APICv, and the author also told me that windows guest
can't enter into x2apic mode when he developed the APICv feature several
years ago. However, it is not truth currently, Interrupt Remapping and
vIOMMU is added to qemu and the developers from Intel test windows 8 can
work in x2apic mode w/ Interrupt Remapping enabled recently.

This patch enables TPR shadow for virtual x2apic mode to boost
windows guest in x2apic mode even if w/o APICv.

Can pass the kvm-unit-test.

Suggested-by: Radim Krčmář <rkrcmar@redhat.com>
Suggested-by: Wincy Van <fanwenyi0529@gmail.com>
Reviewed-by: Radim Krčmář <rkrcmar@redhat.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Radim Krčmář <rkrcmar@redhat.com>
Cc: Wincy Van <fanwenyi0529@gmail.com>
Cc: Yang Zhang <yang.zhang.wz@gmail.com>
Signed-off-by: Wanpeng Li <wanpeng.li@hotmail.com>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
2016-09-23 01:08:14 +02:00
Wanpeng Li
c83b6d1594 KVM: nVMX: Fix reload apic access page warning
WARNING: CPU: 1 PID: 4230 at kernel/sched/core.c:7564 __might_sleep+0x7e/0x80
do not call blocking ops when !TASK_RUNNING; state=1 set at [<ffffffff8d0de7f9>] prepare_to_swait+0x39/0xa0
CPU: 1 PID: 4230 Comm: qemu-system-x86 Not tainted 4.8.0-rc5+ #47
Call Trace:
 dump_stack+0x99/0xd0
 __warn+0xd1/0xf0
 warn_slowpath_fmt+0x4f/0x60
 ? prepare_to_swait+0x39/0xa0
 ? prepare_to_swait+0x39/0xa0
 __might_sleep+0x7e/0x80
 __gfn_to_pfn_memslot+0x156/0x480 [kvm]
 gfn_to_pfn+0x2a/0x30 [kvm]
 gfn_to_page+0xe/0x20 [kvm]
 kvm_vcpu_reload_apic_access_page+0x32/0xa0 [kvm]
 nested_vmx_vmexit+0x765/0xca0 [kvm_intel]
 ? _raw_spin_unlock_irqrestore+0x36/0x80
 vmx_check_nested_events+0x49/0x1f0 [kvm_intel]
 kvm_arch_vcpu_runnable+0x2d/0xe0 [kvm]
 kvm_vcpu_check_block+0x12/0x60 [kvm]
 kvm_vcpu_block+0x94/0x4c0 [kvm]
 kvm_arch_vcpu_ioctl_run+0x619/0x1aa0 [kvm]
 ? kvm_arch_vcpu_ioctl_run+0xdf1/0x1aa0 [kvm]
 kvm_vcpu_ioctl+0x2d3/0x7c0 [kvm]

===============================
[ INFO: suspicious RCU usage. ]
4.8.0-rc5+ #47 Not tainted
-------------------------------
./include/linux/kvm_host.h:535 suspicious rcu_dereference_check() usage!

other info that might help us debug this:

rcu_scheduler_active = 1, debug_locks = 0
1 lock held by qemu-system-x86/4230:
 #0:  (&vcpu->mutex){+.+.+.}, at: [<ffffffffc062975c>] vcpu_load+0x1c/0x60 [kvm]

stack backtrace:
CPU: 1 PID: 4230 Comm: qemu-system-x86 Not tainted 4.8.0-rc5+ #47
Call Trace:
 dump_stack+0x99/0xd0
 lockdep_rcu_suspicious+0xe7/0x120
 gfn_to_memslot+0x12a/0x140 [kvm]
 gfn_to_pfn+0x12/0x30 [kvm]
 gfn_to_page+0xe/0x20 [kvm]
 kvm_vcpu_reload_apic_access_page+0x32/0xa0 [kvm]
 nested_vmx_vmexit+0x765/0xca0 [kvm_intel]
 ? _raw_spin_unlock_irqrestore+0x36/0x80
 vmx_check_nested_events+0x49/0x1f0 [kvm_intel]
 kvm_arch_vcpu_runnable+0x2d/0xe0 [kvm]
 kvm_vcpu_check_block+0x12/0x60 [kvm]
 kvm_vcpu_block+0x94/0x4c0 [kvm]
 kvm_arch_vcpu_ioctl_run+0x619/0x1aa0 [kvm]
 ? kvm_arch_vcpu_ioctl_run+0xdf1/0x1aa0 [kvm]
 kvm_vcpu_ioctl+0x2d3/0x7c0 [kvm]
 ? __fget+0xfd/0x210
 ? __lock_is_held+0x54/0x70
 do_vfs_ioctl+0x96/0x6a0
 ? __fget+0x11c/0x210
 ? __fget+0x5/0x210
 SyS_ioctl+0x79/0x90
 do_syscall_64+0x81/0x220
 entry_SYSCALL64_slow_path+0x25/0x25

These can be triggered by running kvm-unit-test: ./x86-run x86/vmx.flat

The nested preemption timer is based on hrtimer which is started on L2
entry, stopped on L2 exit and evaluated via the new check_nested_events
hook. The current logic adds vCPU to a simple waitqueue (TASK_INTERRUPTIBLE)
if need to yield pCPU and w/o holding srcu read lock when accesses memslots,
both can be in nested preemption timer evaluation path which results in
the warning above.

This patch fix it by leveraging request bit to async reload APIC access
page before vmentry in order to avoid to reload directly during the nested
preemption timer evaluation, it is safe since the vmcs01 is loaded and
current is nested vmexit.

Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Radim Krčmář <rkrcmar@redhat.com>
Cc: Yunhong Jiang <yunhong.jiang@intel.com>
Signed-off-by: Wanpeng Li <wanpeng.li@hotmail.com>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
2016-09-23 01:08:13 +02:00
Colin Ian King
adad0d02a7 kvm: svm: fix unsigned compare less than zero comparison
vm_data->avic_vm_id is a u32, so the check for a error
return (less than zero) such as -EAGAIN from
avic_get_next_vm_id currently has no effect whatsoever.
Fix this by using a temporary int for the comparison
and assign vm_data->avic_vm_id to this. I used an explicit
u32 cast in the assignment to show why vm_data->avic_vm_id
cannot be used in the assign/compare steps.

Signed-off-by: Colin Ian King <colin.king@canonical.com>
Acked-by: Joerg Roedel <jroedel@suse.de>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2016-09-20 09:26:30 +02:00
Paolo Bonzini
095cf55df7 KVM: x86: Hyper-V tsc page setup
Lately tsc page was implemented but filled with empty
values. This patch setup tsc page scale and offset based
on vcpu tsc, tsc_khz and  HV_X64_MSR_TIME_REF_COUNT value.

The valid tsc page drops HV_X64_MSR_TIME_REF_COUNT msr
reads count to zero which potentially improves performance.

Signed-off-by: Andrey Smetanin <asmetanin@virtuozzo.com>
Reviewed-by: Peter Hornyack <peterhornyack@google.com>
Reviewed-by: Radim Krčmář <rkrcmar@redhat.com>
CC: Paolo Bonzini <pbonzini@redhat.com>
CC: Roman Kagan <rkagan@virtuozzo.com>
CC: Denis V. Lunev <den@openvz.org>
[Computation of TSC page parameters rewritten to use the Linux timekeeper
 parameters. - Paolo]
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2016-09-20 09:26:20 +02:00
Paolo Bonzini
108b249c45 KVM: x86: introduce get_kvmclock_ns
Introduce a function that reads the exact nanoseconds value that is
provided to the guest in kvmclock.  This crystallizes the notion of
kvmclock as a thin veneer over a stable TSC, that the guest will
(hopefully) convert with NTP.  In other words, kvmclock is *not* a
paravirtualized host-to-guest NTP.

Drop the get_kernel_ns() function, that was used both to get the base
value of the master clock and to get the current value of kvmclock.
The former use is replaced by ktime_get_boot_ns(), the latter is
the purpose of get_kernel_ns().

This also allows KVM to provide a Hyper-V time reference counter that
is synchronized with the time that is computed from the TSC page.

Reviewed-by: Roman Kagan <rkagan@virtuozzo.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2016-09-20 09:26:15 +02:00
Paolo Bonzini
67198ac3f3 KVM: x86: initialize kvmclock_offset
Make the guest's kvmclock count up from zero, not from the host boot
time.  The guest cannot rely on that anyway because it changes on
migration, the numbers are easier on the eye and finally it matches the
desired semantics of the Hyper-V time reference counter.

Reviewed-by: Roman Kagan <rkagan@virtuozzo.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2016-09-20 09:26:13 +02:00
Paolo Bonzini
0d6dd2ff82 KVM: x86: always fill in vcpu->arch.hv_clock
We will use it in the next patches for KVM_GET_CLOCK and as a basis for the
contents of the Hyper-V TSC page.  Get the values from the Linux
timekeeper even if kvmclock is not enabled.

Reviewed-by: Roman Kagan <rkagan@virtuozzo.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2016-09-20 09:25:53 +02:00
Ingo Molnar
b2c16e1efd Merge branch 'linus' into x86/asm, to pick up fixes
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-09-20 08:29:21 +02:00
Linus Torvalds
3286be9480 Merge branch 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull perf fixes from Thomas Gleixner:
 "A couple of small fixes to x86 perf drivers:

   - Measure L2 for HW_CACHE* events on AMD

   - Fix the address filter handling in the intel/pt driver

   - Handle the BTS disabling at the proper place"

* 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  perf/x86/amd: Make HW_CACHE_REFERENCES and HW_CACHE_MISSES measure L2
  perf/x86/intel/pt: Do validate the size of a kernel address filter
  perf/x86/intel/pt: Fix kernel address filter's offset validation
  perf/x86/intel/pt: Fix an off-by-one in address filter configuration
  perf/x86/intel: Don't disable "intel_bts" around "intel" event batching
2016-09-18 11:50:48 -07:00
Luiz Capitulino
4f5758fc67 kvm: x86: export TSC information to user-space
This commit exports the following information to
user-space via the newly created per-vcpu debugfs
directory:

 - TSC offset (as a signed number)
 - TSC scaling ratio
 - TSC scaling ratio fractinal bits

The original intention of this commit was to
export only the TSC offset, but the TSC scaling
information is exported for completeness.

We need to retrieve the TSC offset from user-space
in order to support the merging of host and guest
traces in trace-cmd. Today, we use the kvm_write_tsc_offset
tracepoint, but it has a number of problems (mainly,
it requires a running VM to be rebooted, ftrace setup,
and also tracepoints are not supposed to be ABIs).

The merging of host and guest traces is explained
in more detail in this thread:

 [Qemu-devel] [RFC] host and guest kernel trace merging
 https://lists.nongnu.org/archive/html/qemu-devel/2016-03/msg00887.html

This commit creates the following files in debugfs:

/sys/kernel/debug/kvm/66828-10/vcpu0/tsc-offset
/sys/kernel/debug/kvm/66828-10/vcpu0/tsc-scaling-ratio
/sys/kernel/debug/kvm/66828-10/vcpu0/tsc-scaling-ratio-frac-bits

The last two are only created if TSC scaling is supported.

Signed-off-by: Luiz Capitulino <lcapitulino@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2016-09-16 16:57:48 +02:00
Luiz Capitulino
235539b48a kvm: add stubs for arch specific debugfs support
Two stubs are added:

 o kvm_arch_has_vcpu_debugfs(): must return true if the arch
   supports creating debugfs entries in the vcpu debugfs dir
   (which will be implemented by the next commit)

 o kvm_arch_create_vcpu_debugfs(): code that creates debugfs
   entries in the vcpu debugfs dir

For x86, this commit introduces a new file to avoid growing
arch/x86/kvm/x86.c even more.

Signed-off-by: Luiz Capitulino <lcapitulino@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2016-09-16 16:57:47 +02:00
Luiz Capitulino
3e3f50262e kvm: x86: drop read_tsc_offset()
The TSC offset can now be read directly from struct kvm_arch_vcpu.

Signed-off-by: Luiz Capitulino <lcapitulino@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2016-09-16 16:57:45 +02:00
Luiz Capitulino
a545ab6a00 kvm: x86: add tsc_offset field to struct kvm_vcpu_arch
A future commit will want to easily read a vCPU's TSC offset,
so we store it in struct kvm_arch_vcpu_arch for easy access.

Signed-off-by: Luiz Capitulino <lcapitulino@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2016-09-16 16:57:45 +02:00
Matt Fleming
080fe0b790 perf/x86/amd: Make HW_CACHE_REFERENCES and HW_CACHE_MISSES measure L2
While the Intel PMU monitors the LLC when perf enables the
HW_CACHE_REFERENCES and HW_CACHE_MISSES events, these events monitor
L1 instruction cache fetches (0x0080) and instruction cache misses
(0x0081) on the AMD PMU.

This is extremely confusing when monitoring the same workload across
Intel and AMD machines, since parameters like,

  $ perf stat -e cache-references,cache-misses

measure completely different things.

Instead, make the AMD PMU measure instruction/data cache and TLB fill
requests to the L2 and instruction/data cache and TLB misses in the L2
when HW_CACHE_REFERENCES and HW_CACHE_MISSES are enabled,
respectively. That way the events measure unified caches on both
platforms.

Signed-off-by: Matt Fleming <matt@codeblueprint.co.uk>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Cc: <stable@vger.kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1472044328-21302-1-git-send-email-matt@codeblueprint.co.uk
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-09-16 16:19:49 +02:00
Paolo Bonzini
b0eaf4506f kvm: x86: correctly reset dest_map->vector when restoring LAPIC state
When userspace sends KVM_SET_LAPIC, KVM schedules a check between
the vCPU's IRR and ISR and the IOAPIC redirection table, in order
to re-establish the IOAPIC's dest_map (the list of CPUs servicing
the real-time clock interrupt with the corresponding vectors).

However, __rtc_irq_eoi_tracking_restore_one was forgetting to
set dest_map->vectors.  Because of this, the IOAPIC did not process
the real-time clock interrupt EOI, ioapic->rtc_status.pending_eoi
got stuck at a non-zero value, and further RTC interrupts were
reported to userspace as coalesced.

Fixes: 9e4aabe2bb
Fixes: 4d99ba898d
Cc: stable@vger.kernel.org
Cc: Joerg Roedel <jroedel@suse.de>
Cc: David Gilbert <dgilbert@redhat.com>
Reviewed-by: Radim Krčmář <rkrcmar@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2016-09-15 18:00:32 +02:00
Ingo Molnar
d4b80afbba Merge branch 'linus' into x86/asm, to pick up recent fixes
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-09-15 08:24:53 +02:00
Suravee Suthikulpanit
411b44ba80 svm: Implements update_pi_irte hook to setup posted interrupt
This patch implements update_pi_irte function hook to allow SVM
communicate to IOMMU driver regarding how to set up IRTE for handling
posted interrupt.

In case AVIC is enabled, during vcpu_load/unload, SVM needs to update
IOMMU IRTE with appropriate host physical APIC ID. Also, when
vcpu_blocking/unblocking, SVM needs to update the is-running bit in
the IOMMU IRTE. Both are achieved via calling amd_iommu_update_ga().

However, if GA mode is not enabled for the pass-through device,
IOMMU driver will simply just return when calling amd_iommu_update_ga.

Signed-off-by: Suravee Suthikulpanit <suravee.suthikulpanit@amd.com>
Reviewed-by: Radim Krčmář <rkrcmar@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2016-09-08 13:18:58 +02:00
Suravee Suthikulpanit
5881f73757 svm: Introduce AMD IOMMU avic_ga_log_notifier
This patch introduces avic_ga_log_notifier, which will be called
by IOMMU driver whenever it handles the Guest vAPIC (GA) log entry.

Reviewed-by: Radim Krčmář <rkrcmar@redhat.com>
Signed-off-by: Suravee Suthikulpanit <suravee.suthikulpanit@amd.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2016-09-08 13:11:56 +02:00
Suravee Suthikulpanit
5ea11f2b31 svm: Introduces AVIC per-VM ID
Introduces per-VM AVIC ID and helper functions to manage the IDs.
Currently, the ID will be used to implement 32-bit AVIC IOMMU GA tag.

The ID is 24-bit one-based indexing value, and is managed via helper
functions to get the next ID, or to free an ID once a VM is destroyed.
There should be no ID conflict for any active VMs.

Reviewed-by: Radim Krčmář <rkrcmar@redhat.com>
Signed-off-by: Suravee Suthikulpanit <suravee.suthikulpanit@amd.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2016-09-08 12:57:20 +02:00
Jan Dakinevich
9ac7e3e815 KVM: nVMX: expose INS/OUTS information support
Expose the feature to L1 hypervisor if host CPU supports it, since
certain hypervisors requires it for own purposes.

According to Intel SDM A.1, if CPU supports the feature,
VMX_INSTRUCTION_INFO field of VMCS will contain detailed information
about INS/OUTS instructions handling. This field is already copied to
VMCS12 for L1 hypervisor (see prepare_vmcs12 routine) independently
feature presence.

Signed-off-by: Jan Dakinevich <jan.dakinevich@gmail.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2016-09-07 19:34:30 +02:00
Paolo Bonzini
16cb025565 KVM: VMX: not use vmcs_config in setup_vmcs_config
setup_vmcs_config takes a pointer to the vmcs_config global.  The
indirection is somewhat pointless, but just keep things consistent
for now.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2016-09-07 19:34:30 +02:00
Paolo Bonzini
1a6982353d KVM: x86: remove stale comments
handle_external_intr does not enable interrupts anymore, vcpu_enter_guest
does it after calling guest_exit_irqoff.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2016-09-07 19:34:29 +02:00
Paolo Bonzini
bbe41b9508 KVM: x86: ratelimit and decrease severity for guest-triggered printk
These are mostly related to nested VMX.  They needn't have
a loglevel as high as KERN_WARN, and mustn't be allowed to
pollute the host logs.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2016-09-07 19:34:29 +02:00
Jan Dakinevich
119a9c01a5 KVM: nVMX: pass valid guest linear-address to the L1
If EPT support is exposed to L1 hypervisor, guest linear-address field
of VMCS should contain GVA of L2, the access to which caused EPT violation.

Signed-off-by: Jan Dakinevich <jan.dakinevich@gmail.com>
Reviewed-by: Wanpeng Li <wanpeng.li@hotmail.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2016-09-07 19:34:28 +02:00
Wanpeng Li
f15a75eedc KVM: nVMX: make emulated nested preemption timer pinned
Commit 61abdbe0bc ("kvm: x86: make lapic hrtimer pinned") pins the emulated
lapic timer. This patch does the same for the emulated nested preemption
timer to avoid vmexit an unrelated vCPU and the latency of kicking IPI to
another vCPU.

Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Radim Krčmář <rkrcmar@redhat.com>
Cc: Yunhong Jiang <yunhong.jiang@intel.com>
Signed-off-by: Wanpeng Li <wanpeng.li@hotmail.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2016-09-07 19:34:27 +02:00
Liang Li
72e0ae58a0 vmx: refine validity check for guest linear address
The validity check for the guest line address is inefficient,
check the invalid value instead of enumerating the valid ones.

Signed-off-by: Liang Li <liang.z.li@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2016-09-07 19:34:27 +02:00
Wanpeng Li
e12c8f36f3 KVM: lapic: adjust preemption timer correctly when goes TSC backward
TSC_OFFSET will be adjusted if discovers TSC backward during vCPU load.
The preemption timer, which relies on the guest tsc to reprogram its
preemption timer value, is also reprogrammed if vCPU is scheded in to
a different pCPU. However, the current implementation reprogram preemption
timer before TSC_OFFSET is adjusted to the right value, resulting in the
preemption timer firing prematurely.

This patch fix it by adjusting TSC_OFFSET before reprogramming preemption
timer if TSC backward.

Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Radim Krċmář <rkrcmar@redhat.com>
Cc: Yunhong Jiang <yunhong.jiang@intel.com>
Signed-off-by: Wanpeng Li <wanpeng.li@hotmail.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2016-09-05 16:14:39 +02:00
Luwei Kang
8e3562f6f8 KVM: x86: Expose more Intel AVX512 feature to guest
Expose AVX512DQ, AVX512BW, AVX512VL feature to guest.
Its spec can be found at:
https://software.intel.com/sites/default/files/managed/b4/3a/319433-024.pdf

Signed-off-by: Luwei Kang <luwei.kang@intel.com>
[Resolved a trivial conflict with removed F(PCOMMIT).]
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
2016-08-19 17:51:40 +02:00
Bandan Das
c4f138b451 mmu: don't pass *kvm to spte_write_protect and spte_*_dirty
That parameter isn't used in these functions,
it's probably a historical artifact.

Signed-off-by: Bandan Das <bsd@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2016-08-19 17:51:40 +02:00
Wanpeng Li
187ca84b4b KVM: lapic: don't recalculate apic map table twice when enabling LAPIC
APIC map table is recalculated during reset APIC ID to the initial value
when enabling LAPIC. This patch move the recalculate_apic_map() to the
next branch since we don't need to recalculate apic map twice in current
codes.

Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Radim Krčmář <rkrcmar@redhat.com>
Signed-off-by: Wanpeng Li <wanpeng.li@hotmail.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2016-08-19 17:51:40 +02:00
Peter Feiner
c95ba92afb kvm: nVMX: fix nested tsc scaling
When the host supported TSC scaling, L2 would use a TSC multiplier of
0, which causes a VM entry failure. Now L2's TSC uses the same
multiplier as L1.

Signed-off-by: Peter Feiner <pfeiner@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2016-08-18 12:19:09 +02:00
Radim Krčmář
dccbfcf52c KVM: nVMX: postpone VMCS changes on MSR_IA32_APICBASE write
If vmcs12 does not intercept APIC_BASE writes, then KVM will handle the
write with vmcs02 as the current VMCS.
This will incorrectly apply modifications intended for vmcs01 to vmcs02
and L2 can use it to gain access to L0's x2APIC registers by disabling
virtualized x2APIC while using msr bitmap that assumes enabled.

Postpone execution of vmx_set_virtual_x2apic_mode until vmcs01 is the
current VMCS.  An alternative solution would temporarily make vmcs01 the
current VMCS, but it requires more care.

Fixes: 8d14695f95 ("x86, apicv: add virtual x2apic support")
Reported-by: Jim Mattson <jmattson@google.com>
Reviewed-by: Wanpeng Li <wanpeng.li@hotmail.com>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
2016-08-18 12:19:08 +02:00
Radim Krčmář
d048c09821 KVM: nVMX: fix msr bitmaps to prevent L2 from accessing L0 x2APIC
msr bitmap can be used to avoid a VM exit (interception) on guest MSR
accesses.  In some configurations of VMX controls, the guest can even
directly access host's x2APIC MSRs.  See SDM 29.5 VIRTUALIZING MSR-BASED
APIC ACCESSES.

L2 could read all L0's x2APIC MSRs and write TPR, EOI, and SELF_IPI.
To do so, L1 would first trick KVM to disable all possible interceptions
by enabling APICv features and then would turn those features off;
nested_vmx_merge_msr_bitmap() only disabled interceptions, so VMX would
not intercept previously enabled MSRs even though they were not safe
with the new configuration.

Correctly re-enabling interceptions is not enough as a second bug would
still allow L1+L2 to access host's MSRs: msr bitmap was shared for all
VMCSs, so L1 could trigger a race to get the desired combination of msr
bitmap and VMX controls.

This fix allocates a msr bitmap for every L1 VCPU, allows only safe
x2APIC MSRs from L1's msr bitmap, and disables msr bitmaps if they would
have to intercept everything anyway.

Fixes: 3af18d9c5f ("KVM: nVMX: Prepare for using hardware MSR bitmap")
Reported-by: Jim Mattson <jmattson@google.com>
Suggested-by: Wincy Van <fanwenyi0529@gmail.com>
Reviewed-by: Wanpeng Li <wanpeng.li@hotmail.com>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
2016-08-18 12:19:07 +02:00
Kees Cook
404f6aac9b x86: Apply more __ro_after_init and const
Guided by grsecurity's analogous __read_only markings in arch/x86,
this applies several uses of __ro_after_init to structures that are
only updated during __init, and const for some structures that are
never updated.  Additionally extends __init markings to some functions
that are only used during __init, and cleans up some missing C99 style
static initializers.

Signed-off-by: Kees Cook <keescook@chromium.org>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brad Spengler <spender@grsecurity.net>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: David Brown <david.brown@linaro.org>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: Emese Revfy <re.emese@gmail.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mathias Krause <minipli@googlemail.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: PaX Team <pageexec@freemail.hu>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: kernel-hardening@lists.openwall.com
Link: http://lkml.kernel.org/r/20160808232906.GA29731@www.outflux.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-08-10 14:55:05 +02:00
Bandan Das
45e11817d5 nvmx: mark ept single context invalidation as supported
Commit 4b85507860 ("KVM: nVMX: Don't advertise single
context invalidation for invept") removed advertising
single context invalidation since the spec does not mandate it.
However, some hypervisors (such as ESX) require it to be present
before willing to use ept in a nested environment. Advertise it
and fallback to the global case.

Signed-off-by: Bandan Das <bsd@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2016-08-04 14:21:52 +02:00
Bandan Das
03331b4b8b nvmx: remove comment about missing nested vpid support
Nested vpid is already supported and both single/global
modes are advertised to the guest

Signed-off-by: Bandan Das <bsd@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2016-08-04 14:21:51 +02:00
Wanpeng Li
910053002e KVM: lapic: fix access preemption timer stuff even if kernel_irqchip=off
BUG: unable to handle kernel NULL pointer dereference at 000000000000008c
IP: [<ffffffffc04e0180>] kvm_lapic_hv_timer_in_use+0x10/0x20 [kvm]
PGD 0
Oops: 0000 [#1] SMP
Call Trace:
 kvm_arch_vcpu_load+0x86/0x260 [kvm]
 vcpu_load+0x46/0x60 [kvm]
 kvm_vcpu_ioctl+0x79/0x7c0 [kvm]
 ? __lock_is_held+0x54/0x70
 do_vfs_ioctl+0x96/0x6a0
 ? __fget_light+0x2a/0x90
 SyS_ioctl+0x79/0x90
 do_syscall_64+0x7c/0x1e0
 entry_SYSCALL64_slow_path+0x25/0x25
RIP  [<ffffffffc04e0180>] kvm_lapic_hv_timer_in_use+0x10/0x20 [kvm]
 RSP <ffff8800db1f3d70>
CR2: 000000000000008c
---[ end trace a55fb79d2b3b4ee8 ]---

This can be reproduced steadily by kernel_irqchip=off.

We should not access preemption timer stuff if lapic is emulated in userspace.
This patch fix it by avoiding access preemption timer stuff when kernel_irqchip=off.

Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Radim Krčmář <rkrcmar@redhat.com>
Cc: Yunhong Jiang <yunhong.jiang@intel.com>
Signed-off-by: Wanpeng Li <wanpeng.li@hotmail.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2016-08-04 14:21:43 +02:00
Paolo Bonzini
6f49b2f341 KVM/ARM Changes for v4.8 - Take 2
Includes GSI routing support to go along with the new VGIC and a small fix that
 has been cooking in -next for a while.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQEcBAABAgAGBQJXoydqAAoJEEtpOizt6ddyM3oH/1A4VeG/J9q4fBPXqY2tVWXs
 c3P7UgNcrEgUNs/F9ykQY/lb31deecUzaBt1OyTf+RlsNbihq3dQdYcBhxtUODw/
 Faok582ya3UFgLW+IRHcID0EbkVOpIzMhOStYsnU/Dz7HG1JL9HdPzwkid7iu9LT
 fI6yrrBnJFjdWAAQ4BkcEKBENRsY8NTs7jX5vnFA92MkUBby7BmariPDD3FtrB+f
 Ob9B7CxM30pNqsN7OA/QvFOHMJHxf3s1TBKwmPHe5TLIfSzV1YxcEGiMc0lWqF4v
 BT8ZeMGCtjDw94tND1DskfQQRPaMqPmGuRTrAW/IuE2n92bFtbqIqs7Cbw0fzLE=
 =Vm6Q
 -----END PGP SIGNATURE-----

Merge tag 'kvm-arm-for-4.8-take2' of git://git.kernel.org/pub/scm/linux/kernel/git/kvmarm/kvmarm into HEAD

KVM/ARM Changes for v4.8 - Take 2

Includes GSI routing support to go along with the new VGIC and a small fix that
has been cooking in -next for a while.
2016-08-04 13:59:56 +02:00
Linus Torvalds
221bb8a46e - ARM: GICv3 ITS emulation and various fixes. Removal of the old
VGIC implementation.
 
 - s390: support for trapping software breakpoints, nested virtualization
 (vSIE), the STHYI opcode, initial extensions for CPU model support.
 
 - MIPS: support for MIPS64 hosts (32-bit guests only) and lots of cleanups,
 preliminary to this and the upcoming support for hardware virtualization
 extensions.
 
 - x86: support for execute-only mappings in nested EPT; reduced vmexit
 latency for TSC deadline timer (by about 30%) on Intel hosts; support for
 more than 255 vCPUs.
 
 - PPC: bugfixes.
 
 The ugly bit is the conflicts.  A couple of them are simple conflicts due
 to 4.7 fixes, but most of them are with other trees. There was definitely
 too much reliance on Acked-by here.  Some conflicts are for KVM patches
 where _I_ gave my Acked-by, but the worst are for this pull request's
 patches that touch files outside arch/*/kvm.  KVM submaintainers should
 probably learn to synchronize better with arch maintainers, with the
 latter providing topic branches whenever possible instead of Acked-by.
 This is what we do with arch/x86.  And I should learn to refuse pull
 requests when linux-next sends scary signals, even if that means that
 submaintainers have to rebase their branches.
 
 Anyhow, here's the list:
 
 - arch/x86/kvm/vmx.c: handle_pcommit and EXIT_REASON_PCOMMIT was removed
 by the nvdimm tree.  This tree adds handle_preemption_timer and
 EXIT_REASON_PREEMPTION_TIMER at the same place.  In general all mentions
 of pcommit have to go.
 
 There is also a conflict between a stable fix and this patch, where the
 stable fix removed the vmx_create_pml_buffer function and its call.
 
 - virt/kvm/kvm_main.c: kvm_cpu_notifier was removed by the hotplug tree.
 This tree adds kvm_io_bus_get_dev at the same place.
 
 - virt/kvm/arm/vgic.c: a few final bugfixes went into 4.7 before the
 file was completely removed for 4.8.
 
 - include/linux/irqchip/arm-gic-v3.h: this one is entirely our fault;
 this is a change that should have gone in through the irqchip tree and
 pulled by kvm-arm.  I think I would have rejected this kvm-arm pull
 request.  The KVM version is the right one, except that it lacks
 GITS_BASER_PAGES_SHIFT.
 
 - arch/powerpc: what a mess.  For the idle_book3s.S conflict, the KVM
 tree is the right one; everything else is trivial.  In this case I am
 not quite sure what went wrong.  The commit that is causing the mess
 (fd7bacbca4, "KVM: PPC: Book3S HV: Fix TB corruption in guest exit
 path on HMI interrupt", 2016-05-15) touches both arch/powerpc/kernel/
 and arch/powerpc/kvm/.  It's large, but at 396 insertions/5 deletions
 I guessed that it wasn't really possible to split it and that the 5
 deletions wouldn't conflict.  That wasn't the case.
 
 - arch/s390: also messy.  First is hypfs_diag.c where the KVM tree
 moved some code and the s390 tree patched it.  You have to reapply the
 relevant part of commits 6c22c98637, plus all of e030c1125e, to
 arch/s390/kernel/diag.c.  Or pick the linux-next conflict
 resolution from http://marc.info/?l=kvm&m=146717549531603&w=2.
 Second, there is a conflict in gmap.c between a stable fix and 4.8.
 The KVM version here is the correct one.
 
 I have pushed my resolution at refs/heads/merge-20160802 (commit
 3d1f53419842) at git://git.kernel.org/pub/scm/virt/kvm/kvm.git.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2.0.22 (GNU/Linux)
 
 iQEcBAABAgAGBQJXoGm7AAoJEL/70l94x66DugQIAIj703ePAFepB/fCrKHkZZia
 SGrsBdvAtNsOhr7FQ5qvvjLxiv/cv7CymeuJivX8H+4kuUHUllDzey+RPHYHD9X7
 U6n1PdCH9F15a3IXc8tDjlDdOMNIKJixYuq1UyNZMU6NFwl00+TZf9JF8A2US65b
 x/41W98ilL6nNBAsoDVmCLtPNWAqQ3lajaZELGfcqRQ9ZGKcAYOaLFXHv2YHf2XC
 qIDMf+slBGSQ66UoATnYV2gAopNlWbZ7n0vO6tE2KyvhHZ1m399aBX1+k8la/0JI
 69r+Tz7ZHUSFtmlmyByi5IAB87myy2WQHyAPwj+4vwJkDGPcl0TrupzbG7+T05Y=
 =42ti
 -----END PGP SIGNATURE-----

Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm

Pull KVM updates from Paolo Bonzini:

 - ARM: GICv3 ITS emulation and various fixes.  Removal of the
   old VGIC implementation.

 - s390: support for trapping software breakpoints, nested
   virtualization (vSIE), the STHYI opcode, initial extensions
   for CPU model support.

 - MIPS: support for MIPS64 hosts (32-bit guests only) and lots
   of cleanups, preliminary to this and the upcoming support for
   hardware virtualization extensions.

 - x86: support for execute-only mappings in nested EPT; reduced
   vmexit latency for TSC deadline timer (by about 30%) on Intel
   hosts; support for more than 255 vCPUs.

 - PPC: bugfixes.

* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (302 commits)
  KVM: PPC: Introduce KVM_CAP_PPC_HTM
  MIPS: Select HAVE_KVM for MIPS64_R{2,6}
  MIPS: KVM: Reset CP0_PageMask during host TLB flush
  MIPS: KVM: Fix ptr->int cast via KVM_GUEST_KSEGX()
  MIPS: KVM: Sign extend MFC0/RDHWR results
  MIPS: KVM: Fix 64-bit big endian dynamic translation
  MIPS: KVM: Fail if ebase doesn't fit in CP0_EBase
  MIPS: KVM: Use 64-bit CP0_EBase when appropriate
  MIPS: KVM: Set CP0_Status.KX on MIPS64
  MIPS: KVM: Make entry code MIPS64 friendly
  MIPS: KVM: Use kmap instead of CKSEG0ADDR()
  MIPS: KVM: Use virt_to_phys() to get commpage PFN
  MIPS: Fix definition of KSEGX() for 64-bit
  KVM: VMX: Add VMCS to CPU's loaded VMCSs before VMPTRLD
  kvm: x86: nVMX: maintain internal copy of current VMCS
  KVM: PPC: Book3S HV: Save/restore TM state in H_CEDE
  KVM: PPC: Book3S HV: Pull out TM state save/restore into separate procedures
  KVM: arm64: vgic-its: Simplify MAPI error handling
  KVM: arm64: vgic-its: Make vgic_its_cmd_handle_mapi similar to other handlers
  KVM: arm64: vgic-its: Turn device_id validation into generic ID validation
  ...
2016-08-02 16:11:27 -04:00
Linus Torvalds
aeb35d6b74 Merge branch 'x86-headers-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 header cleanups from Ingo Molnar:
 "This tree is a cleanup of the x86 tree reducing spurious uses of
  module.h - which should improve build performance a bit"

* 'x86-headers-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86, crypto: Restore MODULE_LICENSE() to glue_helper.c so it loads
  x86/apic: Remove duplicated include from probe_64.c
  x86/ce4100: Remove duplicated include from ce4100.c
  x86/headers: Include spinlock_types.h in x8664_ksyms_64.c for missing spinlock_t
  x86/platform: Delete extraneous MODULE_* tags fromm ts5500
  x86: Audit and remove any remaining unnecessary uses of module.h
  x86/kvm: Audit and remove any unnecessary uses of module.h
  x86/xen: Audit and remove any unnecessary uses of module.h
  x86/platform: Audit and remove any unnecessary uses of module.h
  x86/lib: Audit and remove any unnecessary uses of module.h
  x86/kernel: Audit and remove any unnecessary uses of module.h
  x86/mm: Audit and remove any unnecessary uses of module.h
  x86: Don't use module.h just for AUTHOR / LICENSE tags
2016-08-01 14:23:42 -04:00
Jim Mattson
b80c76ec98 KVM: VMX: Add VMCS to CPU's loaded VMCSs before VMPTRLD
Kexec needs to know the addresses of all VMCSs that are active on
each CPU, so that it can flush them from the VMCS caches. It is
safe to record superfluous addresses that are not associated with
an active VMCS, but it is not safe to omit an address associated
with an active VMCS.

After a call to vmcs_load, the VMCS that was loaded is active on
the CPU. The VMCS should be added to the CPU's list of active
VMCSs before it is loaded.

Signed-off-by: Jim Mattson <jmattson@google.com>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
2016-08-01 15:14:24 +02:00
David Matlack
4f2777bc97 kvm: x86: nVMX: maintain internal copy of current VMCS
KVM maintains L1's current VMCS in guest memory, at the guest physical
page identified by the argument to VMPTRLD. This makes hairy
time-of-check to time-of-use bugs possible,as VCPUs can be writing
the the VMCS page in memory while KVM is emulating VMLAUNCH and
VMRESUME.

The spec documents that writing to the VMCS page while it is loaded is
"undefined". Therefore it is reasonable to load the entire VMCS into
an internal cache during VMPTRLD and ignore writes to the VMCS page
-- the guest should be using VMREAD and VMWRITE to access the current
VMCS.

To adhere to the spec, KVM should flush the current VMCS during VMPTRLD,
and the target VMCS during VMCLEAR (as given by the operand to VMCLEAR).
Since this implementation of VMCS caching only maintains the the current
VMCS, VMCLEAR will only do a flush if the operand to VMCLEAR is the
current VMCS pointer.

KVM will also flush during VMXOFF, which is not mandated by the spec,
but also not in conflict with the spec.

Signed-off-by: David Matlack <dmatlack@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2016-08-01 14:49:05 +02:00
Linus Torvalds
a6408f6cb6 Merge branch 'smp-hotplug-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull smp hotplug updates from Thomas Gleixner:
 "This is the next part of the hotplug rework.

   - Convert all notifiers with a priority assigned

   - Convert all CPU_STARTING/DYING notifiers

     The final removal of the STARTING/DYING infrastructure will happen
     when the merge window closes.

  Another 700 hundred line of unpenetrable maze gone :)"

* 'smp-hotplug-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (70 commits)
  timers/core: Correct callback order during CPU hot plug
  leds/trigger/cpu: Move from CPU_STARTING to ONLINE level
  powerpc/numa: Convert to hotplug state machine
  arm/perf: Fix hotplug state machine conversion
  irqchip/armada: Avoid unused function warnings
  ARC/time: Convert to hotplug state machine
  clocksource/atlas7: Convert to hotplug state machine
  clocksource/armada-370-xp: Convert to hotplug state machine
  clocksource/exynos_mct: Convert to hotplug state machine
  clocksource/arm_global_timer: Convert to hotplug state machine
  rcu: Convert rcutree to hotplug state machine
  KVM/arm/arm64/vgic-new: Convert to hotplug state machine
  smp/cfd: Convert core to hotplug state machine
  x86/x2apic: Convert to CPU hotplug state machine
  profile: Convert to hotplug state machine
  timers/core: Convert to hotplug state machine
  hrtimer: Convert to hotplug state machine
  x86/tboot: Convert to hotplug state machine
  arm64/armv8 deprecated: Convert to hotplug state machine
  hwtracing/coresight-etm4x: Convert to hotplug state machine
  ...
2016-07-29 13:55:30 -07:00
Linus Torvalds
f0c98ebc57 libnvdimm for 4.8
1/ Replace pcommit with ADR / directed-flushing:
    The pcommit instruction, which has not shipped on any product, is
    deprecated. Instead, the requirement is that platforms implement either
    ADR, or provide one or more flush addresses per nvdimm. ADR
    (Asynchronous DRAM Refresh) flushes data in posted write buffers to the
    memory controller on a power-fail event. Flush addresses are defined in
    ACPI 6.x as an NVDIMM Firmware Interface Table (NFIT) sub-structure:
    "Flush Hint Address Structure". A flush hint is an mmio address that
    when written and fenced assures that all previous posted writes
    targeting a given dimm have been flushed to media.
 
 2/ On-demand ARS (address range scrub):
    Linux uses the results of the ACPI ARS commands to track bad blocks
    in pmem devices.  When latent errors are detected we re-scrub the media
    to refresh the bad block list, userspace can also request a re-scrub at
    any time.
 
 3/ Support for the Microsoft DSM (device specific method) command format.
 
 4/ Support for EDK2/OVMF virtual disk device memory ranges.
 
 5/ Various fixes and cleanups across the subsystem.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQIcBAABAgAGBQJXmXBsAAoJEB7SkWpmfYgCEwwP/1IOt9ocP+iHLMDH9KE7VaTZ
 NmUDR+Zy6g5cRQM7SgcuU5BXUcx+OsSrSrUTVF1cW994o9Gbz1mFotkv0ZAsPcYY
 ZVRQxo2oqHrssyOcg+PsgKWiXn68rJOCgmpEyzaJywl5qTMst7pzsT1s1f7rSh6h
 trCf4VaJJwxZR8fARGtlHUnnhPe2Orp99EZRKEWprAsIv2kPuWpPHSjRjuEgN1JG
 KW8AYwWqFTtiLRUk86I4KBB0wcDrfctsjgN9Ogd6+aHyQBRnVSr2U+vDCFkC8KLu
 qiDCpYp+yyxBjclnljz7tRRT3GtzfCUWd4v2KVWqgg2IaobUc0Lbukp/rmikUXQP
 WLikT2OCQ994eFK5OX3Q3cIU/4j459TQnof8q14yVSpjAKrNUXVSR5puN7Hxa+V7
 41wKrAsnsyY1oq+Yd/rMR8VfH7PHx3bFkrmRCGZCufLX1UQm4aYj+sWagDKiV3yA
 DiudghbOnhfurfGsnXUVw7y7GKs+gNWNBmB6ndAD6ZEHmKoGUhAEbJDLCc3DnANl
 b/2mv1MIdIcC1DlCmnbbcn6fv6bICe/r8poK3VrCK3UgOq/EOvKIWl7giP+k1JuC
 6DdVYhlNYIVFXUNSLFAwz8OkLu8byx7WDm36iEqrKHtPw+8qa/2bWVgOU6OBgpjV
 cN3edFVIdxvZeMgM5Ubq
 =xCBG
 -----END PGP SIGNATURE-----

Merge tag 'libnvdimm-for-4.8' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm

Pull libnvdimm updates from Dan Williams:

 - Replace pcommit with ADR / directed-flushing.

   The pcommit instruction, which has not shipped on any product, is
   deprecated.  Instead, the requirement is that platforms implement
   either ADR, or provide one or more flush addresses per nvdimm.

   ADR (Asynchronous DRAM Refresh) flushes data in posted write buffers
   to the memory controller on a power-fail event.

   Flush addresses are defined in ACPI 6.x as an NVDIMM Firmware
   Interface Table (NFIT) sub-structure: "Flush Hint Address Structure".
   A flush hint is an mmio address that when written and fenced assures
   that all previous posted writes targeting a given dimm have been
   flushed to media.

 - On-demand ARS (address range scrub).

   Linux uses the results of the ACPI ARS commands to track bad blocks
   in pmem devices.  When latent errors are detected we re-scrub the
   media to refresh the bad block list, userspace can also request a
   re-scrub at any time.

 - Support for the Microsoft DSM (device specific method) command
   format.

 - Support for EDK2/OVMF virtual disk device memory ranges.

 - Various fixes and cleanups across the subsystem.

* tag 'libnvdimm-for-4.8' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm: (41 commits)
  libnvdimm-btt: Delete an unnecessary check before the function call "__nd_device_register"
  nfit: do an ARS scrub on hitting a latent media error
  nfit: move to nfit/ sub-directory
  nfit, libnvdimm: allow an ARS scrub to be triggered on demand
  libnvdimm: register nvdimm_bus devices with an nd_bus driver
  pmem: clarify a debug print in pmem_clear_poison
  x86/insn: remove pcommit
  Revert "KVM: x86: add pcommit support"
  nfit, tools/testing/nvdimm/: unify shutdown paths
  libnvdimm: move ->module to struct nvdimm_bus_descriptor
  nfit: cleanup acpi_nfit_init calling convention
  nfit: fix _FIT evaluation memory leak + use after free
  tools/testing/nvdimm: add manufacturing_{date|location} dimm properties
  tools/testing/nvdimm: add virtual ramdisk range
  acpi, nfit: treat virtual ramdisk SPA as pmem region
  pmem: kill __pmem address space
  pmem: kill wmb_pmem()
  libnvdimm, pmem: use nvdimm_flush() for namespace I/O writes
  fs/dax: remove wmb_pmem()
  libnvdimm, pmem: flush posted-write queues on shutdown
  ...
2016-07-28 17:38:16 -07:00
Linus Torvalds
85802a49a8 This merge introduces three patches that are later reverted,
- Switching of MSR_TSC_AUX in SVM was thought to cause a host
    misbehavior, but it was later cleared of those doubts and the patch
    moved code to a hot path, so we reverted it.  That patch also needed
    a fix for 32 bit builds and both were reverted in one go.
 
  - Al Viro noticed that a fix for a leak in an error path was not valid
    with the given API and provided a better fix, so the original patch
    was reverted.
 
 Then there are two VMX fixes that move code around because VMCS was not
 accessed between vcpu_load() and vcpu_put(), a simple ARM VHE fix, and
 two one-liners for PML and MTRR.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQEcBAABCAAGBQJXljRwAAoJEED/6hsPKofoIZMIAIMm2h5HKplmpT007dVCt1zw
 dG8gO9hxOxstXfGVkNEZvdxyUb0ilFMO5AYySS1ctENpswVAZlKWlyc+aNGsHXOS
 KFylAUNHlibua1xgB64Sitgub8M9Ct5mDfvvqWL79aCgHTLcDxnb/0NprTqB2P3O
 TfsLaiKMDOeZs4nTcs62vNqpPJzoFc6DK2x1RltFGF9RpR7bOD7gnp7KypDWJx7S
 1LleWPHboxHQ40qf8dxAb7HwEARfXndlP6ZoCkf2stoWTwuexHJfesUnsNgEuXnX
 6YJ9mO7np/bHfSDpGMJbb9pPI5g7UDwOzmgvYQvzhak3LRmvjsZePpWchlb0yCs=
 =3VD4
 -----END PGP SIGNATURE-----

Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm

Pull KVM leftovers from Radim Krčmář:
 "This is a combination of two pull requests for 4.7-rc8 that were not
  merged due to looking hairy.  I have changed the tag message to focus
  on circumstances of contained reverts as they were likely the reason
  behind rejection.

  This merge introduces three patches that are later reverted,

   - Switching of MSR_TSC_AUX in SVM was thought to cause a host
     misbehavior, but it was later cleared of those doubts and the patch
     moved code to a hot path, so we reverted it.  That patch also
     needed a fix for 32 bit builds and both were reverted in one go.

   - Al Viro noticed that a fix for a leak in an error path was not
     valid with the given API and provided a better fix, so the original
     patch was reverted.

  Then there are two VMX fixes that move code around because VMCS was
  not accessed between vcpu_load() and vcpu_put(), a simple ARM VHE fix,
  and two one-liners for PML and MTRR"

* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm:
  arm64: KVM: VHE: Context switch MDSCR_EL1
  KVM: VMX: handle PML full VMEXIT that occurs during event delivery
  Revert "KVM: SVM: fix trashing of MSR_TSC_AUX"
  KVM: SVM: do not set MSR_TSC_AUX on 32-bit builds
  KVM: don't use anon_inode_getfd() before possible failures
  Revert "KVM: release anon file in failure path of vm creation"
  KVM: release anon file in failure path of vm creation
  KVM: nVMX: Fix memory corruption when using VMCS shadowing
  kvm: vmx: ensure VMCS is current while enabling PML
  KVM: SVM: fix trashing of MSR_TSC_AUX
  KVM: MTRR: fix kvm_mtrr_check_gfn_range_consistency page fault
2016-07-26 11:50:42 -07:00
Dan Williams
0606263f24 Merge branch 'for-4.8/libnvdimm' into libnvdimm-for-next 2016-07-24 08:05:44 -07:00
Dan Williams
dfa169bbee Revert "KVM: x86: add pcommit support"
This reverts commit 8b3e34e46a.

Given the deprecation of the pcommit instruction, the relevant VMX
features and CPUID bits are not going to be rolled into the SDM.  Remove
their usage from KVM.

Cc: Xiao Guangrong <guangrong.xiao@linux.intel.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2016-07-23 11:04:23 -07:00
Eric Auger
d9565a7399 KVM: Move kvm_setup_default/empty_irq_routing declaration in arch specific header
kvm_setup_default_irq_routing and kvm_setup_empty_irq_routing are
not used by generic code. So let's move the declarations in x86 irq.h
header instead of kvm_host.h.

Signed-off-by: Eric Auger <eric.auger@redhat.com>
Suggested-by: Andre Przywara <andre.przywara@arm.com>
Acked-by: Radim Krčmář <rkrcmar@redhat.com>
Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
2016-07-22 18:52:00 +01:00
Cao, Lei
b244c9fc25 KVM: VMX: handle PML full VMEXIT that occurs during event delivery
With PML enabled, guest will shut down if a PML full VMEXIT occurs during
event delivery. According to Intel SDM 27.2.3, PML full VMEXIT can occur when
event is being delivered through IDT, so KVM should not exit to user space
with error. Instead, it should let EXIT_REASON_PML_FULL go through and the
event will be re-injected on the next VMENTRY.

Signed-off-by: Lei Cao <lei.cao@stratus.com>
Cc: stable@vger.kernel.org
Fixes: 843e433057 ("KVM: VMX: Add PML support in VMX")
[Shortened the summary and Cc'd stable.]
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
2016-07-16 15:27:40 +02:00
Radim Krčmář
6a907cd0ad Revert "KVM: SVM: fix trashing of MSR_TSC_AUX"
This reverts commit 9770404a00.

The reverted patch is not needed as only userspace uses RDTSCP and
MSR_TSC_AUX is in host_save_user_msrs[] and therefore properly saved in
svm_vcpu_load() and restored in svm_vcpu_put() before every switch to
userspace.

The reverted patch did not allow the kernel to use RDTSCP in the future,
because of missed trashing in svm_set_msr() and 64-bit ifdef.

This reverts commit 2b23c3a6e3.

2b23c3a6e3 ("KVM: SVM: do not set MSR_TSC_AUX on 32-bit builds") is a
build fix for 9770404a00 and reverting them separately would only
break more bisections.

Cc: stable@vger.kernel.org
2016-07-16 15:26:54 +02:00
Sebastian Andrzej Siewior
251a5fd64b x86/kvm/kvmclock: Convert to hotplug state machine
Install the callbacks via the state machine and let the core invoke
the callbacks on the already online CPUs.

We assumed that the priority ordering was ment to invoke the online
callback as the last step. In the original code this also invoked the
down prepare callback as the last step. With the symmetric state
machine the down prepare callback is now the first step.

Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Anna-Maria Gleixner <anna-maria@linutronix.de>
Acked-by: Paolo Bonzini <pbonzini@redhat.com>
Cc: Gleb Natapov <gleb@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Radim Krcmar <rkrcmar@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: kvm@vger.kernel.org
Cc: rt@linutronix.de
Link: http://lkml.kernel.org/r/20160713153335.542880859@linutronix.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-07-15 10:40:21 +02:00
Anna-Maria Gleixner
162e52a117 KVM/x86: Remove superfluous SMP function call
Since the following commit:

  1cf4f629d9 ("cpu/hotplug: Move online calls to hotplugged cpu")

... the CPU_ONLINE and CPU_DOWN_PREPARE notifiers are always run on the hot
plugged CPU, and as of commit:

  3b9d6da67e ("cpu/hotplug: Fix rollback during error-out in __cpu_disable()")

the CPU_DOWN_FAILED notifier also runs on the hot plugged CPU.  This patch
converts the SMP functional calls into direct calls.

smp_function_call_single() executes the function with interrupts
disabled. This calling convention is not preserved because there
is no reason to do so.

Signed-off-by: Anna-Maria Gleixner <anna-maria@linutronix.de>
Acked-by: Paolo Bonzini <pbonzini@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Radim Krcmar <rkrcmar@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: kvm@vger.kernel.org
Cc: rt@linutronix.de
Link: http://lkml.kernel.org/r/20160713153335.452527104@linutronix.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-07-15 10:40:21 +02:00
Paolo Bonzini
2b23c3a6e3 KVM: SVM: do not set MSR_TSC_AUX on 32-bit builds
This is unnecessary---and besides, __getcpu() is not even
available on 32-bit builds.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2016-07-14 19:11:29 +02:00
Jim Mattson
2f1fe81123 KVM: nVMX: Fix memory corruption when using VMCS shadowing
When freeing the nested resources of a vcpu, there is an assumption that
the vcpu's vmcs01 is the current VMCS on the CPU that executes
nested_release_vmcs12(). If this assumption is violated, the vcpu's
vmcs01 may be made active on multiple CPUs at the same time, in
violation of Intel's specification. Moreover, since the vcpu's vmcs01 is
not VMCLEARed on every CPU on which it is active, it can linger in a
CPU's VMCS cache after it has been freed and potentially
repurposed. Subsequent eviction from the CPU's VMCS cache on a capacity
miss can result in memory corruption.

It is not sufficient for vmx_free_vcpu() to call vmx_load_vmcs01(). If
the vcpu in question was last loaded on a different CPU, it must be
migrated to the current CPU before calling vmx_load_vmcs01().

Signed-off-by: Jim Mattson <jmattson@google.com>
Cc: stable@vger.kernel.org
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2016-07-14 19:11:20 +02:00
Peter Feiner
4e59516a12 kvm: vmx: ensure VMCS is current while enabling PML
Between loading the new VMCS and enabling PML, the CPU was unpinned.
If the vCPU thread were migrated to another CPU in the interim (e.g.,
due to preemption or sleeping alloc_page), then the VMWRITEs to enable
PML would target the wrong VMCS -- or no VMCS at all:

  [ 2087.266950] vmwrite error: reg 200e value 3fe1d52000 (err -506126336)
  [ 2087.267062] vmwrite error: reg 812 value 1ff (err 511)
  [ 2087.267125] vmwrite error: reg 401e value 12229c00 (err 304258048)

This patch ensures that the VMCS remains current while enabling PML by
doing the VMWRITEs while the CPU is pinned. Allocation of the PML buffer
is hoisted out of the critical section.

Signed-off-by: Peter Feiner <pfeiner@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2016-07-14 19:11:19 +02:00
Paolo Bonzini
9770404a00 KVM: SVM: fix trashing of MSR_TSC_AUX
I don't know what I was thinking when I wrote commit 46896c73c1 ("KVM:
svm: add support for RDTSCP", 2015-11-12); I missed write_rdtscp_aux which
obviously uses MSR_TSC_AUX.

Therefore we do need to save/restore MSR_TSC_AUX in svm_vcpu_run.

Cc: stable@vger.kernel.org
Cc: Borislav Petkov <bp@alien8.de>
Fixes: 46896c73c1 ("KVM: svm: add support for RDTSCP")
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2016-07-14 19:11:10 +02:00
Paul Gortmaker
1767e931e3 x86/kvm: Audit and remove any unnecessary uses of module.h
Historically a lot of these existed because we did not have
a distinction between what was modular code and what was providing
support to modules via EXPORT_SYMBOL and friends.  That changed
when we forked out support for the latter into the export.h file.

This means we should be able to reduce the usage of module.h
in code that is obj-y Makefile or bool Kconfig.  In the case of
kvm where it is modular, we can extend that to also include files
that are building basic support functionality but not related
to loading or registering the final module; such files also have
no need whatsoever for module.h

The advantage in removing such instances is that module.h itself
sources about 15 other headers; adding significantly to what we feed
cpp, and it can obscure what headers we are effectively using.

Since module.h was the source for init.h (for __init) and for
export.h (for EXPORT_SYMBOL) we consider each instance for the
presence of either and replace as needed.

Several instances got replaced with moduleparam.h since that was
really all that was required for those particular files.

Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
Acked-by: Paolo Bonzini <pbonzini@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Radim Krčmář <rkrcmar@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: kvm@vger.kernel.org
Link: http://lkml.kernel.org/r/20160714001901.31603-8-paul.gortmaker@windriver.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-07-14 15:07:00 +02:00
Radim Krčmář
af1bae5497 KVM: x86: bump KVM_MAX_VCPU_ID to 1023
kzalloc was replaced with kvm_kvzalloc to allow non-contiguous areas and
rcu had to be modified to cope with it.

The practical limit for KVM_MAX_VCPU_ID right now is INT_MAX, but lower
value was chosen in case there were bugs.  1023 is sufficient maximum
APIC ID for 288 VCPUs.

Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2016-07-14 09:29:35 +02:00
Radim Krčmář
c519265f2a KVM: x86: add a flag to disable KVM x2apic broadcast quirk
Add KVM_X2APIC_API_DISABLE_BROADCAST_QUIRK as a feature flag to
KVM_CAP_X2APIC_API.

The quirk made KVM interpret 0xff as a broadcast even in x2APIC mode.
The enableable capability is needed in order to support standard x2APIC and
remain backward compatible.

Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
[Expand kvm_apic_mda comment. - Paolo]
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2016-07-14 09:29:34 +02:00
Radim Krčmář
3713131345 KVM: x86: add KVM_CAP_X2APIC_API
KVM_CAP_X2APIC_API is a capability for features related to x2APIC
enablement.  KVM_X2APIC_API_32BIT_FORMAT feature can be enabled to
extend APIC ID in get/set ioctl and MSI addresses to 32 bits.
Both are needed to support x2APIC.

The feature has to be enableable and disabled by default, because
get/set ioctl shifted and truncated APIC ID to 8 bits by using a
non-standard protocol inspired by xAPIC and the change is not
backward-compatible.

Changes to MSI addresses follow the format used by interrupt remapping
unit.  The upper address word, that used to be 0, contains upper 24 bits
of the LAPIC address in its upper 24 bits.  Lower 8 bits are reserved as
0.  Using the upper address word is not backward-compatible either as we
didn't check that userspace zeroed the word.  Reserved bits are still
not explicitly checked, but non-zero data will affect LAPIC addresses,
which will cause a bug.

Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2016-07-14 09:03:57 +02:00
Radim Krčmář
c63cf538eb KVM: pass struct kvm to kvm_set_routing_entry
Arch-specific code will use it.

Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2016-07-14 09:03:56 +02:00
Radim Krčmář
4d8e772bf8 KVM: x86: reset lapic base in kvm_lapic_reset
LAPIC is reset in xAPIC mode and the surrounding code expects that.
KVM never resets after initialization.  This patch is just for sanity.

Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2016-07-14 09:03:56 +02:00
Radim Krčmář
c93de59dcd KVM: VMX: optimize APIC ID read with APICv
The register is in hardware-compatible format now, so there is not need
to intercept.

Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2016-07-14 09:03:55 +02:00
Radim Krčmář
49bd29ba1d KVM: x86: reset APIC ID when enabling LAPIC
APIC ID should be set to the initial APIC ID when enabling LAPIC.
This only matters if the guest changes APIC ID.  No sane OS does that.

Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2016-07-14 09:03:54 +02:00
Radim Krčmář
a92e2543d6 KVM: x86: use hardware-compatible format for APIC ID register
We currently always shift APIC ID as if APIC was in xAPIC mode.
x2APIC mode wants to use more bits and storing a hardware-compabible
value is the the sanest option.

KVM API to set the lapic expects that bottom 8 bits of APIC ID are in
top 8 bits of APIC_ID register, so the register needs to be shifted in
x2APIC mode.

Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2016-07-14 09:03:54 +02:00
Radim Krčmář
3159d36ad7 KVM: x86: use generic function for MSI parsing
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2016-07-14 09:03:53 +02:00
Radim Krčmář
0ca52e7b81 KVM: x86: dynamic kvm_apic_map
x2APIC supports up to 2^32-1 LAPICs, but most guest in coming years will
probably has fewer VCPUs.  Dynamic size saves memory at the cost of
turning one constant into a variable.

apic_map mutex had to be moved before allocation to avoid races with cpu
hotplug.

Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2016-07-14 09:03:53 +02:00
Radim Krčmář
e45115b62f KVM: x86: use physical LAPIC array for logical x2APIC
Logical x2APIC IDs map injectively to physical x2APIC IDs, so we can
reuse the physical array for them.  This allows us to save space by
sizing the logical maps according to the needs of xAPIC.

Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2016-07-14 09:03:52 +02:00
Radim Krčmář
64aa47bfc4 KVM: x86: add kvm_apic_map_get_dest_lapic
kvm_irq_delivery_to_apic_fast and kvm_intr_is_single_vcpu_fast both
compute the interrupt destination.  Factor the code.

'struct kvm_lapic **dst = NULL' had to be added to silence GCC.
GCC might complain about potential NULL access in the future, because it
missed conditions that avoided uninitialized uses of dst.

Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2016-07-14 09:03:51 +02:00
Bandan Das
02120c45b0 kvm: vmx: advertise support for ept execute only
MMU now knows about execute only mappings, so
advertise the feature to L1 hypervisors

Signed-off-by: Bandan Das <bsd@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2016-07-14 09:03:50 +02:00
Bandan Das
d95c55687e kvm: mmu: track read permission explicitly for shadow EPT page tables
To support execute only mappings on behalf of L1 hypervisors,
reuse ACC_USER_MASK to signify if the L1 hypervisor has the R bit
set.

For the nested EPT case, we assumed that the U bit was always set
since there was no equivalent in EPT page tables.  Strictly
speaking, this was not necessary because handle_ept_violation
never set PFERR_USER_MASK in the error code (uf=0 in the
parlance of update_permission_bitmask).  We now have to set
both U and UF correctly, respectively in FNAME(gpte_access)
and in handle_ept_violation.

Also in handle_ept_violation bit 3 of the exit qualification is
not enough to detect a present PTE; all three bits 3-5 have to
be checked.

Signed-off-by: Bandan Das <bsd@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2016-07-14 09:03:50 +02:00
Bandan Das
ffb128c89b kvm: mmu: don't set the present bit unconditionally
To support execute only mappings on behalf of L1
hypervisors, we need to teach set_spte() to honor all three of
L1's XWR bits.  As a start, add a new variable "shadow_present_mask"
that will be set for non-EPT shadow paging and clear for EPT.

Signed-off-by: Bandan Das <bsd@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2016-07-14 09:03:14 +02:00
Bandan Das
812f30b234 kvm: mmu: remove is_present_gpte()
We have two versions of the above function.
To prevent confusion and bugs in the future, remove
the non-FNAME version entirely and replace all calls
with the actual check.

Signed-off-by: Bandan Das <bsd@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2016-07-14 09:02:47 +02:00
Bandan Das
8d5cf1610d kvm: mmu: extend the is_present check to 32 bits
This is safe because this function is called
on host controlled page table and non-present/non-MMIO
sptes never use bits 1..31. For the EPT case, this
ensures that cases where only the execute bit is set
is marked valid.

Signed-off-by: Bandan Das <bsd@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2016-07-14 09:02:46 +02:00
Paolo Bonzini
8391ce447f KVM: VMX: introduce vm_{entry,exit}_control_reset_shadow
There is no reason to read the entry/exit control fields of the
VMCS and immediately write back the same value.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2016-07-11 10:07:49 +02:00
Paolo Bonzini
9314006db8 KVM: nVMX: keep preemption timer enabled during L2 execution
Because the vmcs12 preemption timer is emulated through a separate hrtimer,
we can keep on using the preemption timer in the vmcs02 to emulare L1's
TSC deadline timer.

However, the corresponding bit in the pin-based execution control field
must be kept consistent between vmcs01 and vmcs02.  On vmentry we copy
it into the vmcs02; on vmexit the preemption timer must be disabled in
the vmcs01 if a preemption timer vmexit happened while in guest mode.

The preemption timer value in the vmcs02 is set by vmx_vcpu_run, so it
need not be considered in prepare_vmcs02.

Cc: Yunhong Jiang <yunhong.jiang@intel.com>
Cc: Haozhong Zhang <haozhong.zhang@intel.com>
Tested-by: Wanpeng Li <kernellwp@gmail.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2016-07-11 09:49:29 +02:00
Wanpeng Li
55123e3c86 KVM: nVMX: avoid incorrect preemption timer vmexit in nested guest
The preemption timer for nested VMX is emulated by hrtimer which is started on L2
entry, stopped on L2 exit and evaluated via the check_nested_events hook. However,
nested_vmx_exit_handled is always returning true for preemption timer vmexit.  Then,
the L1 preemption timer vmexit is captured and be treated as a L2 preemption
timer vmexit, causing NULL pointer dereferences or worse in the L1 guest's
vmexit handler:

    BUG: unable to handle kernel NULL pointer dereference at           (null)
    IP: [<          (null)>]           (null)
    PGD 0
    Oops: 0010 [#1] SMP
    Call Trace:
     ? kvm_lapic_expired_hv_timer+0x47/0x90 [kvm]
     handle_preemption_timer+0xe/0x20 [kvm_intel]
     vmx_handle_exit+0x169/0x15a0 [kvm_intel]
     ? kvm_arch_vcpu_ioctl_run+0xd5d/0x19d0 [kvm]
     kvm_arch_vcpu_ioctl_run+0xdee/0x19d0 [kvm]
     ? kvm_arch_vcpu_ioctl_run+0xd5d/0x19d0 [kvm]
     ? vcpu_load+0x1c/0x60 [kvm]
     ? kvm_arch_vcpu_load+0x57/0x260 [kvm]
     kvm_vcpu_ioctl+0x2d3/0x7c0 [kvm]
     do_vfs_ioctl+0x96/0x6a0
     ? __fget_light+0x2a/0x90
     SyS_ioctl+0x79/0x90
     do_syscall_64+0x68/0x180
     entry_SYSCALL64_slow_path+0x25/0x25
    Code:  Bad RIP value.
    RIP  [<          (null)>]           (null)
     RSP <ffff8800b5263c48>
    CR2: 0000000000000000
    ---[ end trace 9c70c48b1a2bc66e ]---

This can be reproduced readily by preemption timer enabled on L0 and disabled
on L1.

Return false since preemption timer vmexits must never be reflected to L2.

Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Radim Krčmář <rkrcmar@redhat.com>
Cc: Yunhong Jiang <yunhong.jiang@intel.com>
Cc: Jan Kiszka <jan.kiszka@siemens.com>
Cc: Haozhong Zhang <haozhong.zhang@intel.com>
Signed-off-by: Wanpeng Li <wanpeng.li@hotmail.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2016-07-11 09:49:26 +02:00
Paolo Bonzini
1c17c3e6bf KVM: VMX: reflect broken preemption timer in vmcs_config
Simplify cpu_has_vmx_preemption_timer.  This is consistent with the
rest of setup_vmcs_config and preparatory for the next patch.

Tested-by: Wanpeng Li <kernellwp@gmail.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2016-07-11 09:48:49 +02:00
Ingo Molnar
08fb98f5bf Merge branch 'linus' into x86/fpu, to pick up fixes before applying new changes
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2016-07-10 17:11:17 +02:00
Alexis Dambricourt
30b072ce03 KVM: MTRR: fix kvm_mtrr_check_gfn_range_consistency page fault
The following #PF may occurs:
[ 1403.317041] BUG: unable to handle kernel paging request at 0000000200000068
[ 1403.317045] IP: [<ffffffffc04c20b0>] __mtrr_lookup_var_next+0x10/0xa0 [kvm]

[ 1403.317123] Call Trace:
[ 1403.317134]  [<ffffffffc04c2a65>] ? kvm_mtrr_check_gfn_range_consistency+0xc5/0x120 [kvm]
[ 1403.317143]  [<ffffffffc04ac11f>] ? tdp_page_fault+0x9f/0x2c0 [kvm]
[ 1403.317152]  [<ffffffffc0498128>] ? kvm_set_msr_common+0x858/0xc00 [kvm]
[ 1403.317161]  [<ffffffffc04b8883>] ? x86_emulate_insn+0x273/0xd30 [kvm]
[ 1403.317171]  [<ffffffffc04c04e4>] ? kvm_cpuid+0x34/0x190 [kvm]
[ 1403.317180]  [<ffffffffc04a5bb9>] ? kvm_mmu_page_fault+0x59/0xe0 [kvm]
[ 1403.317183]  [<ffffffffc0d729e1>] ? vmx_handle_exit+0x1d1/0x14a0 [kvm_intel]
[ 1403.317185]  [<ffffffffc0d75f3f>] ? atomic_switch_perf_msrs+0x6f/0xa0 [kvm_intel]
[ 1403.317187]  [<ffffffffc0d7621d>] ? vmx_vcpu_run+0x2ad/0x420 [kvm_intel]
[ 1403.317196]  [<ffffffffc04a0962>] ? kvm_arch_vcpu_ioctl_run+0x622/0x1550 [kvm]
[ 1403.317204]  [<ffffffffc049abb9>] ? kvm_arch_vcpu_load+0x59/0x210 [kvm]
[ 1403.317206]  [<ffffffff81036245>] ? __kernel_fpu_end+0x35/0x100
[ 1403.317213]  [<ffffffffc0487eb6>] ? kvm_vcpu_ioctl+0x316/0x5d0 [kvm]
[ 1403.317215]  [<ffffffff81088225>] ? do_sigtimedwait+0xd5/0x220
[ 1403.317217]  [<ffffffff811f84dd>] ? do_vfs_ioctl+0x9d/0x5c0
[ 1403.317224]  [<ffffffffc04928ae>] ? kvm_on_user_return+0x3e/0x70 [kvm]
[ 1403.317225]  [<ffffffff811f8a74>] ? SyS_ioctl+0x74/0x80
[ 1403.317227]  [<ffffffff815bf0b6>] ? entry_SYSCALL_64_fastpath+0x1e/0xa8
[ 1403.317242] RIP  [<ffffffffc04c20b0>] __mtrr_lookup_var_next+0x10/0xa0 [kvm]

At mtrr_lookup_fixed_next(), when the condition
'if (iter->index >= ARRAY_SIZE(iter->mtrr_state->fixed_ranges))' becomes true,
mtrr_lookup_var_start() is called with iter->range with gargabe values from the
fixed MTRR union field. Then, list_prepare_entry() do not call list_entry()
initialization, keeping a garbage pointer in iter->range which is accessed in
the following __mtrr_lookup_var_next() call.

Fixes: f571c0973e
Signed-off-by: Alexis Dambricourt <alexis@blade-group.com>
Cc: stable@vger.kernel.org
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2016-07-05 16:14:43 +02:00
Wei Yongjun
03f6a22a39 KVM: x86: Use ARRAY_SIZE instead of dividing sizeof array with sizeof an element
Use ARRAY_SIZE instead of dividing sizeof array with sizeof an element

Signed-off-by: Wei Yongjun <yongjun_wei@trendmicro.com.cn>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2016-07-05 14:41:45 +02:00
Wanpeng Li
196f20ca52 KVM: vmx: fix missed cancellation of TSC deadline timer
INFO: rcu_sched detected stalls on CPUs/tasks:
 1-...: (11800 GPs behind) idle=45d/140000000000000/0 softirq=0/0 fqs=21663
 (detected by 0, t=65016 jiffies, g=11500, c=11499, q=719)
Task dump for CPU 1:
qemu-system-x86 R  running task        0  3529   3525 0x00080808
 ffff8802021791a0 ffff880212895040 0000000000000001 00007f1c2c00db40
 ffff8801dd20fcd3 ffffc90002b98000 ffff8801dd20fc88 ffff8801dd20fcf8
 0000000000000286 ffff8801dd2ac538 ffff8801dd20fcc0 ffffffffc06949c9
Call Trace:
? kvm_write_guest_cached+0xb9/0x160 [kvm]
? __delay+0xf/0x20
? wait_lapic_expire+0x14a/0x200 [kvm]
? kvm_arch_vcpu_ioctl_run+0xcbe/0x1b00 [kvm]
? kvm_arch_vcpu_ioctl_run+0xe34/0x1b00 [kvm]
? kvm_vcpu_ioctl+0x2d3/0x7c0 [kvm]
? __fget+0x5/0x210
? do_vfs_ioctl+0x96/0x6a0
? __fget_light+0x2a/0x90
? SyS_ioctl+0x79/0x90
? do_syscall_64+0x7c/0x1e0
? entry_SYSCALL64_slow_path+0x25/0x25

This can be reproduced readily by running a full dynticks guest(since hrtimer
in guest is heavily used) w/ lapic_timer_advance disabled.

If fail to program hardware preemption timer, we will fallback to hrtimer based
method, however, a previous programmed preemption timer miss to cancel in this
scenario which results in one hardware preemption timer and one hrtimer emulated
tsc deadline timer run simultaneously. So sometimes the target guest deadline
tsc is earlier than guest tsc, which leads to the computation in vmx_set_hv_timer
can underflow and cause delta_tsc to be set a huge value, then host soft lockup
as above.

This patch fix it by cancelling the previous programmed preemption timer if there
is once we failed to program the new preemption timer and fallback to hrtimer
based method.

Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Radim Krčmář <rkrcmar@redhat.com>
Cc: Yunhong Jiang <yunhong.jiang@intel.com>
Signed-off-by: Wanpeng Li <wanpeng.li@hotmail.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2016-07-01 11:03:42 +02:00
Wanpeng Li
bd97ad0e7e KVM: x86: introduce cancel_hv_tscdeadline
Introduce cancel_hv_tscdeadline() to encapsulate preemption
timer cancel stuff.

Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Radim Krčmář <rkrcmar@redhat.com>
Cc: Yunhong Jiang <yunhong.jiang@intel.com>
Signed-off-by: Wanpeng Li <wanpeng.li@hotmail.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2016-07-01 11:03:41 +02:00
Paolo Bonzini
9175d2e97b KVM: vmx: fix underflow in TSC deadline calculation
If the TSC deadline timer is programmed really close to the deadline or
even in the past, the computation in vmx_set_hv_timer can underflow and
cause delta_tsc to be set to a huge value.  This generally results
in vmx_set_hv_timer returning -ERANGE, but we can fix it by limiting
delta_tsc to be positive or zero.

Reported-by: Wanpeng Li <wanpeng.li@hotmail.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2016-07-01 11:03:39 +02:00
Paolo Bonzini
f2485b3e0c KVM: x86: use guest_exit_irqoff
This gains a few clock cycles per vmexit.  On Intel there is no need
anymore to enable the interrupts in vmx_handle_external_intr, since
we are using the "acknowledge interrupt on exit" feature.  AMD
needs to do that, and must be careful to avoid the interrupt shadow.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2016-07-01 11:03:38 +02:00
Paolo Bonzini
91fa0f8e9e KVM: x86: always use "acknowledge interrupt on exit"
This is necessary to simplify handle_external_intr in the next patch.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2016-07-01 11:03:36 +02:00
Paolo Bonzini
6edaa5307f KVM: remove kvm_guest_enter/exit wrappers
Use the functions from context_tracking.h directly.

Cc: Andy Lutomirski <luto@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2016-07-01 11:03:21 +02:00
Quentin Casasnovas
ff30ef40de KVM: nVMX: VMX instructions: fix segment checks when L1 is in long mode.
I couldn't get Xen to boot a L2 HVM when it was nested under KVM - it was
getting a GP(0) on a rather unspecial vmread from Xen:

     (XEN) ----[ Xen-4.7.0-rc  x86_64  debug=n  Not tainted ]----
     (XEN) CPU:    1
     (XEN) RIP:    e008:[<ffff82d0801e629e>] vmx_get_segment_register+0x14e/0x450
     (XEN) RFLAGS: 0000000000010202   CONTEXT: hypervisor (d1v0)
     (XEN) rax: ffff82d0801e6288   rbx: ffff83003ffbfb7c   rcx: fffffffffffab928
     (XEN) rdx: 0000000000000000   rsi: 0000000000000000   rdi: ffff83000bdd0000
     (XEN) rbp: ffff83000bdd0000   rsp: ffff83003ffbfab0   r8:  ffff830038813910
     (XEN) r9:  ffff83003faf3958   r10: 0000000a3b9f7640   r11: ffff83003f82d418
     (XEN) r12: 0000000000000000   r13: ffff83003ffbffff   r14: 0000000000004802
     (XEN) r15: 0000000000000008   cr0: 0000000080050033   cr4: 00000000001526e0
     (XEN) cr3: 000000003fc79000   cr2: 0000000000000000
     (XEN) ds: 0000   es: 0000   fs: 0000   gs: 0000   ss: 0000   cs: e008
     (XEN) Xen code around <ffff82d0801e629e> (vmx_get_segment_register+0x14e/0x450):
     (XEN)  00 00 41 be 02 48 00 00 <44> 0f 78 74 24 08 0f 86 38 56 00 00 b8 08 68 00
     (XEN) Xen stack trace from rsp=ffff83003ffbfab0:

     ...

     (XEN) Xen call trace:
     (XEN)    [<ffff82d0801e629e>] vmx_get_segment_register+0x14e/0x450
     (XEN)    [<ffff82d0801f3695>] get_page_from_gfn_p2m+0x165/0x300
     (XEN)    [<ffff82d0801bfe32>] hvmemul_get_seg_reg+0x52/0x60
     (XEN)    [<ffff82d0801bfe93>] hvm_emulate_prepare+0x53/0x70
     (XEN)    [<ffff82d0801ccacb>] handle_mmio+0x2b/0xd0
     (XEN)    [<ffff82d0801be591>] emulate.c#_hvm_emulate_one+0x111/0x2c0
     (XEN)    [<ffff82d0801cd6a4>] handle_hvm_io_completion+0x274/0x2a0
     (XEN)    [<ffff82d0801f334a>] __get_gfn_type_access+0xfa/0x270
     (XEN)    [<ffff82d08012f3bb>] timer.c#add_entry+0x4b/0xb0
     (XEN)    [<ffff82d08012f80c>] timer.c#remove_entry+0x7c/0x90
     (XEN)    [<ffff82d0801c8433>] hvm_do_resume+0x23/0x140
     (XEN)    [<ffff82d0801e4fe7>] vmx_do_resume+0xa7/0x140
     (XEN)    [<ffff82d080164aeb>] context_switch+0x13b/0xe40
     (XEN)    [<ffff82d080128e6e>] schedule.c#schedule+0x22e/0x570
     (XEN)    [<ffff82d08012c0cc>] softirq.c#__do_softirq+0x5c/0x90
     (XEN)    [<ffff82d0801602c5>] domain.c#idle_loop+0x25/0x50
     (XEN)
     (XEN)
     (XEN) ****************************************
     (XEN) Panic on CPU 1:
     (XEN) GENERAL PROTECTION FAULT
     (XEN) [error_code=0000]
     (XEN) ****************************************

Tracing my host KVM showed it was the one injecting the GP(0) when
emulating the VMREAD and checking the destination segment permissions in
get_vmx_mem_address():

     3)               |    vmx_handle_exit() {
     3)               |      handle_vmread() {
     3)               |        nested_vmx_check_permission() {
     3)               |          vmx_get_segment() {
     3)   0.074 us    |            vmx_read_guest_seg_base();
     3)   0.065 us    |            vmx_read_guest_seg_selector();
     3)   0.066 us    |            vmx_read_guest_seg_ar();
     3)   1.636 us    |          }
     3)   0.058 us    |          vmx_get_rflags();
     3)   0.062 us    |          vmx_read_guest_seg_ar();
     3)   3.469 us    |        }
     3)               |        vmx_get_cs_db_l_bits() {
     3)   0.058 us    |          vmx_read_guest_seg_ar();
     3)   0.662 us    |        }
     3)               |        get_vmx_mem_address() {
     3)   0.068 us    |          vmx_cache_reg();
     3)               |          vmx_get_segment() {
     3)   0.074 us    |            vmx_read_guest_seg_base();
     3)   0.068 us    |            vmx_read_guest_seg_selector();
     3)   0.071 us    |            vmx_read_guest_seg_ar();
     3)   1.756 us    |          }
     3)               |          kvm_queue_exception_e() {
     3)   0.066 us    |            kvm_multiple_exception();
     3)   0.684 us    |          }
     3)   4.085 us    |        }
     3)   9.833 us    |      }
     3) + 10.366 us   |    }

Cross-checking the KVM/VMX VMREAD emulation code with the Intel Software
Developper Manual Volume 3C - "VMREAD - Read Field from Virtual-Machine
Control Structure", I found that we're enforcing that the destination
operand is NOT located in a read-only data segment or any code segment when
the L1 is in long mode - BUT that check should only happen when it is in
protected mode.

Shuffling the code a bit to make our emulation follow the specification
allows me to boot a Xen dom0 in a nested KVM and start HVM L2 guests
without problems.

Fixes: f9eb4af67c ("KVM: nVMX: VMX instructions: add checks for #GP/#SS exceptions")
Signed-off-by: Quentin Casasnovas <quentin.casasnovas@oracle.com>
Cc: Eugene Korenevsky <ekorenevsky@gmail.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Radim Krčmář <rkrcmar@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: linux-stable <stable@vger.kernel.org>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2016-06-27 15:30:44 +02:00
Marcelo Tosatti
b606f189c7 KVM: LAPIC: cap __delay at lapic_timer_advance_ns
The host timer which emulates the guest LAPIC TSC deadline
timer has its expiration diminished by lapic_timer_advance_ns
nanoseconds. Therefore if, at wait_lapic_expire, a difference
larger than lapic_timer_advance_ns is encountered, delay at most
lapic_timer_advance_ns.

This fixes a problem where the guest can cause the host
to delay for large amounts of time.

Reported-by: Alan Jenkins <alan.christopher.jenkins@gmail.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2016-06-27 15:30:41 +02:00
Marcelo Tosatti
8d93c874ac KVM: x86: move nsec_to_cycles from x86.c to x86.h
Move the inline function nsec_to_cycles from x86.c to x86.h, as
the next patch uses it from lapic.c.

Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2016-06-27 15:30:38 +02:00
Arnd Bergmann
87aeb54f1b kvm: x86: use getboottime64
KVM reads the current boottime value as a struct timespec in order to
calculate the guest wallclock time, resulting in an overflow in 2038
on 32-bit systems.

The data then gets passed as an unsigned 32-bit number to the guest,
and that in turn overflows in 2106.

We cannot do much about the second overflow, which affects both 32-bit
and 64-bit hosts, but we can ensure that they both behave the same
way and don't overflow until 2106, by using getboottime64() to read
a timespec64 value.

Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2016-06-23 19:17:30 +02:00
Ashok Raj
c45dcc71b7 KVM: VMX: enable guest access to LMCE related MSRs
On Intel platforms, this patch adds LMCE to KVM MCE supported
capabilities and handles guest access to LMCE related MSRs.

Signed-off-by: Ashok Raj <ashok.raj@intel.com>
[Haozhong: macro KVM_MCE_CAP_SUPPORTED => variable kvm_mce_cap_supported
           Only enable LMCE on Intel platform
           Check MSR_IA32_FEATURE_CONTROL when handling guest
             access to MSR_IA32_MCG_EXT_CTL]
Signed-off-by: Haozhong Zhang <haozhong.zhang@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2016-06-23 19:17:29 +02:00
Haozhong Zhang
37e4c997da KVM: VMX: validate individual bits of guest MSR_IA32_FEATURE_CONTROL
KVM currently does not check the value written to guest
MSR_IA32_FEATURE_CONTROL, though bits corresponding to disabled features
may be set. This patch makes KVM to validate individual bits written to
guest MSR_IA32_FEATURE_CONTROL according to enabled features.

Signed-off-by: Haozhong Zhang <haozhong.zhang@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2016-06-23 19:17:29 +02:00
Haozhong Zhang
3b84080b95 KVM: VMX: move msr_ia32_feature_control to vcpu_vmx
msr_ia32_feature_control will be used for LMCE and not depend only on
nested anymore, so move it from struct nested_vmx to struct vcpu_vmx.

Signed-off-by: Haozhong Zhang <haozhong.zhang@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2016-06-23 19:17:28 +02:00
Yunhong Jiang
64672c95ea kvm: vmx: hook preemption timer support
Hook the VMX preemption timer to the "hv timer" functionality added
by the previous patch.  This includes: checking if the feature is
supported, if the feature is broken on the CPU, the hooks to
setup/clean the VMX preemption timer, arming the timer on vmentry
and handling the vmexit.

A module parameter states if the VMX preemption timer should be
utilized.

Signed-off-by: Yunhong Jiang <yunhong.jiang@intel.com>
[Move hv_deadline_tsc to struct vcpu_vmx, use -1 as the "unset" value.
 Put all VMX bits here.  Enable it by default #yolo. - Paolo]
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2016-06-16 10:07:50 +02:00
Yunhong Jiang
bc22512bb2 kvm: vmx: rename vmx_pre/post_block to pi_pre/post_block
Prepare to switch from preemption timer to hrtimer in the
vmx_pre/post_block. Current functions are only for posted interrupt,
rename them accordingly.

Signed-off-by: Yunhong Jiang <yunhong.jiang@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2016-06-16 10:07:49 +02:00
Yunhong Jiang
ce7a058a21 KVM: x86: support using the vmx preemption timer for tsc deadline timer
The VMX preemption timer can be used to virtualize the TSC deadline timer.
The VMX preemption timer is armed when the vCPU is running, and a VMExit
will happen if the virtual TSC deadline timer expires.

When the vCPU thread is blocked because of HLT, KVM will switch to use
an hrtimer, and then go back to the VMX preemption timer when the vCPU
thread is unblocked.

This solution avoids the complex OS's hrtimer system, and the host
timer interrupt handling cost, replacing them with a little math
(for guest->host TSC and host TSC->preemption timer conversion)
and a cheaper VMexit.  This benefits latency for isolated pCPUs.

[A word about performance... Yunhong reported a 30% reduction in average
 latency from cyclictest.  I made a similar test with tscdeadline_latency
 from kvm-unit-tests, and measured

 - ~20 clock cycles loss (out of ~3200, so less than 1% but still
   statistically significant) in the worst case where the test halts
   just after programming the TSC deadline timer

 - ~800 clock cycles gain (25% reduction in latency) in the best case
   where the test busy waits.

 I removed the VMX bits from Yunhong's patch, to concentrate them in the
 next patch - Paolo]

Signed-off-by: Yunhong Jiang <yunhong.jiang@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2016-06-16 10:07:48 +02:00
Yunhong Jiang
53f9eedff7 kvm: lapic: separate start_sw_tscdeadline from start_apic_timer
The function to start the tsc deadline timer virtualization will be used
also by the pre_block hook when we use the preemption timer; change it
to a separate function. No logic changes.

Signed-off-by: Yunhong Jiang <yunhong.jiang@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2016-06-16 10:07:46 +02:00
Yang Zhang
a005219162 kvm: vmx: check apicv is active before using VT-d posted interrupt
VT-d posted interrupt is relying on the CPU side's posted interrupt.
Need to check whether VCPU's APICv is active before enabing VT-d
posted interrupt.

Fixes: d62caabb41
Cc: stable@vger.kernel.org
Signed-off-by: Yang Zhang <yang.zhang.wz@gmail.com>
Signed-off-by: Shengge Ding <shengge.dsg@alibaba-inc.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2016-06-16 09:38:24 +02:00
Suravee Suthikulpanit
5b8abf1f33 kvm: svm: Do not support AVIC if not CONFIG_X86_LOCAL_APIC
Add logic to disable AVIC #ifndef CONFIG_X86_LOCAL_APIC.

Suggested-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Suravee Suthikulpanit <suravee.suthikulpanit@amd.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2016-06-16 00:28:30 +02:00
Suravee Suthikulpanit
7d669f5084 kvm: svm: Fix implicit declaration for __default_cpu_present_to_apicid()
The commit 8221c13700 ("svm: Manage vcpu load/unload when enable AVIC")
introduces a build error due to implicit function declaration
when #ifdef CONFIG_X86_32 and #ifndef CONFIG_X86_LOCAL_APIC
(as reported by Kbuild test robot i386-randconfig-x0-06121009).

So, this patch introduces kvm_cpu_get_apicid() wrapper
around __default_cpu_present_to_apicid() with additional
handling if CONFIG_X86_LOCAL_APIC is not defined.

Reported-by: kbuild test robot <fengguang.wu@intel.com>
Fixes: commit 8221c13700 ("svm: Manage vcpu load/unload when enable AVIC")
Signed-off-by: Suravee Suthikulpanit <suravee.suthikulpanit@amd.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2016-06-16 00:28:24 +02:00
Paolo Bonzini
557abc40d1 KVM: remove kvm_vcpu_compatible
The new created_vcpus field makes it possible to avoid the race between
irqchip and VCPU creation in a much nicer way; just check under kvm->lock
whether a VCPU has already been created.

We can then remove KVM_APIC_ARCHITECTURE too, because at this point the
symbol is only governing the default definition of kvm_vcpu_compatible.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2016-06-16 00:05:00 +02:00
Andrea Gelmini
bb3541f175 KVM: x86: Fix typos
Signed-off-by: Andrea Gelmini <andrea.gelmini@gelma.net>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2016-06-14 11:16:28 +02:00