linux/arch/x86
Sean Christopherson bd18bffca3 KVM: nVMX: restore host state in nested_vmx_vmexit for VMFail
A VMEnter that VMFails (as opposed to VMExits) does not touch host
state beyond registers that are explicitly noted in the VMFail path,
e.g. EFLAGS.  Host state does not need to be loaded because VMFail
is only signaled for consistency checks that occur before the CPU
starts to load guest state, i.e. there is no need to restore any
state as nothing has been modified.  But in the case where a VMFail
is detected by hardware and not by KVM (due to deferring consistency
checks to hardware), KVM has already loaded some amount of guest
state.  Luckily, "loaded" only means loaded to KVM's software model,
i.e. vmcs01 has not been modified.  So, unwind our software model to
the pre-VMEntry host state.

Not restoring host state in this VMFail path leads to a variety of
failures because we end up with stale data in vcpu->arch, e.g. CR0,
CR4, EFER, etc... will all be out of sync relative to vmcs01.  Any
significant delta in the stale data is all but guaranteed to crash
L1, e.g. emulation of SMEP, SMAP, UMIP, WP, etc... will be wrong.

An alternative to this "soft" reload would be to load host state from
vmcs12 as if we triggered a VMExit (as opposed to VMFail), but that is
wildly inconsistent with respect to the VMX architecture, e.g. an L1
VMM with separate VMExit and VMFail paths would explode.

Note that this approach does not mean KVM is 100% accurate with
respect to VMX hardware behavior, even at an architectural level
(the exact order of consistency checks is microarchitecture specific).
But 100% emulation accuracy isn't the goal (with this patch), rather
the goal is to be consistent in the information delivered to L1, e.g.
a VMExit should not fall-through VMENTER, and a VMFail should not jump
to HOST_RIP.

This technically reverts commit "5af4157388ad (KVM: nVMX: Fix mmu
context after VMLAUNCH/VMRESUME failure)", but retains the core
aspects of that patch, just in an open coded form due to the need to
pull state from vmcs01 instead of vmcs12.  Restoring host state
resolves a variety of issues introduced by commit "4f350c6dbcb9
(kvm: nVMX: Handle deferred early VMLAUNCH/VMRESUME failure properly)",
which remedied the incorrect behavior of treating VMFail like VMExit
but in doing so neglected to restore arch state that had been modified
prior to attempting nested VMEnter.

A sample failure that occurs due to stale vcpu.arch state is a fault
of some form while emulating an LGDT (due to emulated UMIP) from L1
after a failed VMEntry to L3, in this case when running the KVM unit
test test_tpr_threshold_values in L1.  L0 also hits a WARN in this
case due to a stale arch.cr4.UMIP.

L1:
  BUG: unable to handle kernel paging request at ffffc90000663b9e
  PGD 276512067 P4D 276512067 PUD 276513067 PMD 274efa067 PTE 8000000271de2163
  Oops: 0009 [#1] SMP
  CPU: 5 PID: 12495 Comm: qemu-system-x86 Tainted: G        W         4.18.0-rc2+ #2
  Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
  RIP: 0010:native_load_gdt+0x0/0x10

  ...

  Call Trace:
   load_fixmap_gdt+0x22/0x30
   __vmx_load_host_state+0x10e/0x1c0 [kvm_intel]
   vmx_switch_vmcs+0x2d/0x50 [kvm_intel]
   nested_vmx_vmexit+0x222/0x9c0 [kvm_intel]
   vmx_handle_exit+0x246/0x15a0 [kvm_intel]
   kvm_arch_vcpu_ioctl_run+0x850/0x1830 [kvm]
   kvm_vcpu_ioctl+0x3a1/0x5c0 [kvm]
   do_vfs_ioctl+0x9f/0x600
   ksys_ioctl+0x66/0x70
   __x64_sys_ioctl+0x16/0x20
   do_syscall_64+0x4f/0x100
   entry_SYSCALL_64_after_hwframe+0x44/0xa9

L0:
  WARNING: CPU: 2 PID: 3529 at arch/x86/kvm/vmx.c:6618 handle_desc+0x28/0x30 [kvm_intel]
  ...
  CPU: 2 PID: 3529 Comm: qemu-system-x86 Not tainted 4.17.2-coffee+ #76
  Hardware name: Intel Corporation Kabylake Client platform/KBL S
  RIP: 0010:handle_desc+0x28/0x30 [kvm_intel]

  ...

  Call Trace:
   kvm_arch_vcpu_ioctl_run+0x863/0x1840 [kvm]
   kvm_vcpu_ioctl+0x3a1/0x5c0 [kvm]
   do_vfs_ioctl+0x9f/0x5e0
   ksys_ioctl+0x66/0x70
   __x64_sys_ioctl+0x16/0x20
   do_syscall_64+0x49/0xf0
   entry_SYSCALL_64_after_hwframe+0x44/0xa9

Fixes: 5af4157388 (KVM: nVMX: Fix mmu context after VMLAUNCH/VMRESUME failure)
Fixes: 4f350c6dbc (kvm: nVMX: Handle deferred early VMLAUNCH/VMRESUME failure properly)
Cc: Jim Mattson <jmattson@google.com>
Cc: Krish Sadhukhan <krish.sadhukhan@oracle.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Radim KrÄmář <rkrcmar@redhat.com>
Cc: Wanpeng Li <wanpeng.li@hotmail.com>
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2018-10-17 00:29:39 +02:00
..
boot Kbuild updates for v4.19 (2nd) 2018-08-25 13:40:38 -07:00
configs
crypto crypto: x86/aegis,morus - Do not require OSXSAVE for SSE2 2018-09-14 14:08:27 +08:00
entry x86/vdso: Fix vDSO build if a retpoline is emitted 2018-08-20 18:04:41 +02:00
events perf/x86/intel: Add support/quirk for the MISPREDICT bit on Knights Landing CPUs 2018-09-10 10:03:01 +02:00
hyperv x86/hyper-v: rename ipi_arg_{ex,non_ex} structures 2018-09-20 00:51:42 +02:00
ia32
include KVM: nVMX: Clear reserved bits of #DB exit qualification 2018-10-17 00:29:39 +02:00
kernel x86/APM: Fix build warning when PROC_FS is not enabled 2018-09-15 10:16:25 +02:00
kvm KVM: nVMX: restore host state in nested_vmx_vmexit for VMFail 2018-10-17 00:29:39 +02:00
lib x86/nmi: Fix NMI uaccess race against CR3 switching 2018-08-31 17:08:22 +02:00
math-emu
mm x86/mm: Use WRITE_ONCE() when setting PTEs 2018-09-08 12:30:36 +02:00
net bpf, x32: Fix regression caused by commit 24dea04767 2018-07-26 02:51:12 +02:00
oprofile
pci PCI: Make early dump functionality generic 2018-06-29 20:06:07 -05:00
platform x86/efi: Load fixmap GDT in efi_call_phys_epilog() before setting %cr3 2018-09-12 21:53:34 +02:00
power Power management updates for 4.19-rc1 2018-08-14 13:12:24 -07:00
purgatory kbuild: move bin2c back to scripts/ from scripts/basic/ 2018-07-18 01:18:05 +09:00
ras
realmode
tools x86/relocs: Add __end_rodata_aligned to S_REL 2018-08-09 20:42:07 +02:00
um Consolidation of Kconfig files by Christoph Hellwig. 2018-08-15 13:05:12 -07:00
video
xen xen: fixes for 4.19-rc2 2018-08-31 08:45:16 -07:00
.gitignore
Kbuild
Kconfig x86/Kconfig: Fix trivial typo 2018-08-27 10:29:14 +02:00
Kconfig.cpu
Kconfig.debug Kconfig: consolidate the "Kernel hacking" menu 2018-08-02 08:06:48 +09:00
Makefile x86: Allow generating user-space headers without a compiler 2018-08-31 17:08:22 +02:00
Makefile_32.cpu
Makefile.um kbuild: rename LDFLAGS to KBUILD_LDFLAGS 2018-08-24 08:22:08 +09:00