The first batch of KVM patches, mostly covering x86, which I

am sending out early due to me travelling next week. There is a lone mm patch for which Andrew gave an informal ack at https://lore.kernel.org/linux-mm/20220817102500.440c6d0a3fce296fdf91bea6@linux-foundation.org. I will send the bulk of ARM work, as well as other architectures, at the end of next week. ARM: * Account stage2 page table allocations in memory stats. x86: * Account EPT/NPT arm64 page table allocations in memory stats. * Tracepoint cleanups/fixes for nested VM-Enter and emulated MSR accesses. * Drop eVMCS controls filtering for KVM on Hyper-V, all known versions of Hyper-V now support eVMCS fields associated with features that are enumerated to the guest. * Use KVM's sanitized VMCS config as the basis for the values of nested VMX capabilities MSRs. * A myriad event/exception fixes and cleanups. Most notably, pending exceptions morph into VM-Exits earlier, as soon as the exception is queued, instead of waiting until the next vmentry. This fixed a longstanding issue where the exceptions would incorrecly become double-faults instead of triggering a vmexit; the common case of page-fault vmexits had a special workaround, but now it's fixed for good. * A handful of fixes for memory leaks in error paths. * Cleanups for VMREAD trampoline and VMX's VM-Exit assembly flow. * Never write to memory from non-sleepable kvm_vcpu_check_block() * Selftests refinements and cleanups. * Misc typo cleanups. Generic: * remove KVM_REQ_UNHALT -----BEGIN PGP SIGNATURE----- iQFIBAABCAAyFiEE8TM4V0tmI4mGbHaCv/vSX3jHroMFAmM2zwcUHHBib256aW5p QHJlZGhhdC5jb20ACgkQv/vSX3jHroNpbwf+MlVeOlzE5SBdrJ0TEnLmKUel1lSz QnZzP5+D65oD0zhCilUZHcg6G4mzZ5SdVVOvrGJvA0eXh25ruLNMF6jbaABkMLk/ FfI1ybN7A82hwJn/aXMI/sUurWv4Jteaad20JC2DytBCnsW8jUqc49gtXHS2QWy4 3uMsFdpdTAg4zdJKgEUfXBmQviweVpjjl3ziRyZZ7yaeo1oP7XZ8LaE1nR2l5m0J mfjzneNm5QAnueypOh5KhSwIvqf6WHIVm/rIHDJ1HIFbgfOU0dT27nhb1tmPwAcE +cJnnMUHjZqtCXteHkAxMClyRq0zsEoKk0OGvSOOMoq3Q0DavSXUNANOig== =/hqX -----END PGP SIGNATURE----- Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm Pull kvm updates from Paolo Bonzini: "The first batch of KVM patches, mostly covering x86. ARM: - Account stage2 page table allocations in memory stats x86: - Account EPT/NPT arm64 page table allocations in memory stats - Tracepoint cleanups/fixes for nested VM-Enter and emulated MSR accesses - Drop eVMCS controls filtering for KVM on Hyper-V, all known versions of Hyper-V now support eVMCS fields associated with features that are enumerated to the guest - Use KVM's sanitized VMCS config as the basis for the values of nested VMX capabilities MSRs - A myriad event/exception fixes and cleanups. Most notably, pending exceptions morph into VM-Exits earlier, as soon as the exception is queued, instead of waiting until the next vmentry. This fixed a longstanding issue where the exceptions would incorrecly become double-faults instead of triggering a vmexit; the common case of page-fault vmexits had a special workaround, but now it's fixed for good - A handful of fixes for memory leaks in error paths - Cleanups for VMREAD trampoline and VMX's VM-Exit assembly flow - Never write to memory from non-sleepable kvm_vcpu_check_block() - Selftests refinements and cleanups - Misc typo cleanups Generic: - remove KVM_REQ_UNHALT" * tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (94 commits) KVM: remove KVM_REQ_UNHALT KVM: mips, x86: do not rely on KVM_REQ_UNHALT KVM: x86: never write to memory from kvm_vcpu_check_block() KVM: x86: Don't snapshot pending INIT/SIPI prior to checking nested events KVM: nVMX: Make event request on VMXOFF iff INIT/SIPI is pending KVM: nVMX: Make an event request if INIT or SIPI is pending on VM-Enter KVM: SVM: Make an event request if INIT or SIPI is pending when GIF is set KVM: x86: lapic does not have to process INIT if it is blocked KVM: x86: Rename kvm_apic_has_events() to make it INIT/SIPI specific KVM: x86: Rename and expose helper to detect if INIT/SIPI are allowed KVM: nVMX: Make an event request when pending an MTF nested VM-Exit KVM: x86: make vendor code check for all nested events mailmap: Update Oliver's email address KVM: x86: Allow force_emulation_prefix to be written without a reload KVM: selftests: Add an x86-only test to verify nested exception queueing KVM: selftests: Use uapi header to get VMX and SVM exit reasons/codes KVM: x86: Rename inject_pending_events() to kvm_check_and_inject_events() KVM: VMX: Update MTF and ICEBP comments to document KVM's subtle behavior KVM: x86: Treat pending TRIPLE_FAULT requests as pending exceptions KVM: x86: Morph pending exceptions to pending VM-Exits at queue time ...
2022-10-09 09:39:55 -07:00 · 2022-10-09 09:39:55 -07:00 · ef688f8b8c
commit ef688f8b8c
parent 0e470763d8 c59fb12758
57 changed files with 1868 additions and 1074 deletions
--- a/.mailmap
+++ b/.mailmap
@ -336,6 +336,7 @@ Oleksij Rempel <linux@rempel-privat.de> <external.Oleksij.Rempel@de.bosch.com>
 Oleksij Rempel <linux@rempel-privat.de> <fixed-term.Oleksij.Rempel@de.bosch.com>
 Oleksij Rempel <linux@rempel-privat.de> <o.rempel@pengutronix.de>
 Oleksij Rempel <linux@rempel-privat.de> <ore@pengutronix.de>
 Oliver Upton <oliver.upton@linux.dev> <oupton@google.com>
 Pali Rohár <pali@kernel.org> <pali.rohar@gmail.com>
 Paolo 'Blaisorblade' Giarrusso <blaisorblade@yahoo.it>
 Patrick Mochel <mochel@digitalimplant.org>
--- a/Documentation/admin-guide/cgroup-v2.rst
+++ b/Documentation/admin-guide/cgroup-v2.rst
@ -1355,6 +1355,11 @@ PAGE_SIZE multiple when read back.
 	  pagetables
                Amount of memory allocated for page tables.
 	  sec_pagetables
 		Amount of memory allocated for secondary page tables,
 		this currently includes KVM mmu allocations on x86
 		and arm64.
 	  percpu (npn)
 		Amount of memory used for storing per-cpu kernel
 		data structures.
--- a/Documentation/filesystems/proc.rst
+++ b/Documentation/filesystems/proc.rst
@ -982,6 +982,7 @@ Example output. You may not have all of these fields.
    SUnreclaim:       142336 kB
    KernelStack:       11168 kB
    PageTables:        20540 kB
    SecPageTables:         0 kB
    NFS_Unstable:          0 kB
    Bounce:                0 kB
    WritebackTmp:          0 kB
@ -1090,6 +1091,9 @@ KernelStack
              Memory consumed by the kernel stacks of all tasks
 PageTables
              Memory consumed by userspace page tables
 SecPageTables
              Memory consumed by secondary page tables, this currently
              currently includes KVM mmu allocations on x86 and arm64.
 NFS_Unstable
              Always zero. Previous counted pages which had been written to
              the server, but has not been committed to stable storage.
--- a/Documentation/virt/kvm/api.rst
+++ b/Documentation/virt/kvm/api.rst
@ -4074,7 +4074,7 @@ Queues an SMI on the thread's vcpu.
 4.97 KVM_X86_SET_MSR_FILTER
 ----------------------------
-:Capability: KVM_X86_SET_MSR_FILTER
+:Capability: KVM_CAP_X86_MSR_FILTER
 :Architectures: x86
 :Type: vm ioctl
 :Parameters: struct kvm_msr_filter
@ -4173,8 +4173,10 @@ If an MSR access is not permitted through the filtering, it generates a
 allows user space to deflect and potentially handle various MSR accesses
 into user space.
-If a vCPU is in running state while this ioctl is invoked, the vCPU may
+Note, invoking this ioctl while a vCPU is running is inherently racy.  However,
-experience inconsistent filtering behavior on MSR accesses.
+KVM does guarantee that vCPUs will see either the previous filter or the new
 filter, e.g. MSRs with identical settings in both the old and new filter will
 have deterministic behavior.
 4.98 KVM_CREATE_SPAPR_TCE_64
 ----------------------------
@ -5287,110 +5289,7 @@ KVM_PV_DUMP
    authentication tag all of which are needed to decrypt the dump at a
    later time.
-
+4.126 KVM_XEN_HVM_SET_ATTR
 4.126 KVM_X86_SET_MSR_FILTER
 ----------------------------
 :Capability: KVM_CAP_X86_MSR_FILTER
 :Architectures: x86
 :Type: vm ioctl
 :Parameters: struct kvm_msr_filter
 :Returns: 0 on success, < 0 on error
 ::
  struct kvm_msr_filter_range {
  #define KVM_MSR_FILTER_READ  (1 << 0)
  #define KVM_MSR_FILTER_WRITE (1 << 1)
 	__u32 flags;
 	__u32 nmsrs; /* number of msrs in bitmap */
 	__u32 base;  /* MSR index the bitmap starts at */
 	__u8 *bitmap; /* a 1 bit allows the operations in flags, 0 denies */
  };
  #define KVM_MSR_FILTER_MAX_RANGES 16
  struct kvm_msr_filter {
  #define KVM_MSR_FILTER_DEFAULT_ALLOW (0 << 0)
  #define KVM_MSR_FILTER_DEFAULT_DENY  (1 << 0)
 	__u32 flags;
 	struct kvm_msr_filter_range ranges[KVM_MSR_FILTER_MAX_RANGES];
  };
 flags values for ``struct kvm_msr_filter_range``:
 ``KVM_MSR_FILTER_READ``
  Filter read accesses to MSRs using the given bitmap. A 0 in the bitmap
  indicates that a read should immediately fail, while a 1 indicates that
  a read for a particular MSR should be handled regardless of the default
  filter action.
 ``KVM_MSR_FILTER_WRITE``
  Filter write accesses to MSRs using the given bitmap. A 0 in the bitmap
  indicates that a write should immediately fail, while a 1 indicates that
  a write for a particular MSR should be handled regardless of the default
  filter action.
 ``KVM_MSR_FILTER_READ | KVM_MSR_FILTER_WRITE``
  Filter both read and write accesses to MSRs using the given bitmap. A 0
  in the bitmap indicates that both reads and writes should immediately fail,
  while a 1 indicates that reads and writes for a particular MSR are not
  filtered by this range.
 flags values for ``struct kvm_msr_filter``:
 ``KVM_MSR_FILTER_DEFAULT_ALLOW``
  If no filter range matches an MSR index that is getting accessed, KVM will
  fall back to allowing access to the MSR.
 ``KVM_MSR_FILTER_DEFAULT_DENY``
  If no filter range matches an MSR index that is getting accessed, KVM will
  fall back to rejecting access to the MSR. In this mode, all MSRs that should
  be processed by KVM need to explicitly be marked as allowed in the bitmaps.
 This ioctl allows user space to define up to 16 bitmaps of MSR ranges to
 specify whether a certain MSR access should be explicitly filtered for or not.
 If this ioctl has never been invoked, MSR accesses are not guarded and the
 default KVM in-kernel emulation behavior is fully preserved.
 Calling this ioctl with an empty set of ranges (all nmsrs == 0) disables MSR
 filtering. In that mode, ``KVM_MSR_FILTER_DEFAULT_DENY`` is invalid and causes
 an error.
 As soon as the filtering is in place, every MSR access is processed through
 the filtering except for accesses to the x2APIC MSRs (from 0x800 to 0x8ff);
 x2APIC MSRs are always allowed, independent of the ``default_allow`` setting,
 and their behavior depends on the ``X2APIC_ENABLE`` bit of the APIC base
 register.
 If a bit is within one of the defined ranges, read and write accesses are
 guarded by the bitmap's value for the MSR index if the kind of access
 is included in the ``struct kvm_msr_filter_range`` flags.  If no range
 cover this particular access, the behavior is determined by the flags
 field in the kvm_msr_filter struct: ``KVM_MSR_FILTER_DEFAULT_ALLOW``
 and ``KVM_MSR_FILTER_DEFAULT_DENY``.
 Each bitmap range specifies a range of MSRs to potentially allow access on.
 The range goes from MSR index [base .. base+nmsrs]. The flags field
 indicates whether reads, writes or both reads and writes are filtered
 by setting a 1 bit in the bitmap for the corresponding MSR index.
 If an MSR access is not permitted through the filtering, it generates a
 #GP inside the guest. When combined with KVM_CAP_X86_USER_SPACE_MSR, that
 allows user space to deflect and potentially handle various MSR accesses
 into user space.
 Note, invoking this ioctl with a vCPU is running is inherently racy.  However,
 KVM does guarantee that vCPUs will see either the previous filter or the new
 filter, e.g. MSRs with identical settings in both the old and new filter will
 have deterministic behavior.
 4.127 KVM_XEN_HVM_SET_ATTR
 --------------------------
 :Capability: KVM_CAP_XEN_HVM / KVM_XEN_HVM_CONFIG_SHARED_INFO
--- a/Documentation/virt/kvm/vcpu-requests.rst
+++ b/Documentation/virt/kvm/vcpu-requests.rst
@ -97,7 +97,7 @@ VCPU requests are simply bit indices of the ``vcpu->requests`` bitmap.
 This means general bitops, like those documented in [atomic-ops]_ could
 also be used, e.g. ::
-  clear_bit(KVM_REQ_UNHALT & KVM_REQUEST_MASK, &vcpu->requests);
+  clear_bit(KVM_REQ_UNBLOCK & KVM_REQUEST_MASK, &vcpu->requests);
 However, VCPU request users should refrain from doing so, as it would
 break the abstraction.  The first 8 bits are reserved for architecture
@ -126,17 +126,6 @@ KVM_REQ_UNBLOCK
  or in order to update the interrupt routing and ensure that assigned
  devices will wake up the vCPU.
 KVM_REQ_UNHALT
  This request may be made from the KVM common function kvm_vcpu_block(),
  which is used to emulate an instruction that causes a CPU to halt until
  one of an architectural specific set of events and/or interrupts is
  received (determined by checking kvm_arch_vcpu_runnable()).  When that
  event or interrupt arrives kvm_vcpu_block() makes the request.  This is
  in contrast to when kvm_vcpu_block() returns due to any other reason,
  such as a pending signal, which does not indicate the VCPU's halt
  emulation should stop, and therefore does not make the request.
 KVM_REQ_OUTSIDE_GUEST_MODE
  This "request" ensures the target vCPU has exited guest mode prior to the
@ -297,21 +286,6 @@ architecture dependent.  kvm_vcpu_block() calls kvm_arch_vcpu_runnable()
 to check if it should awaken.  One reason to do so is to provide
 architectures a function where requests may be checked if necessary.
 Clearing Requests
 -----------------
 Generally it only makes sense for the receiving VCPU thread to clear a
 request.  However, in some circumstances, such as when the requesting
 thread and the receiving VCPU thread are executed serially, such as when
 they are the same thread, or when they are using some form of concurrency
 control to temporarily execute synchronously, then it's possible to know
 that the request may be cleared immediately, rather than waiting for the
 receiving VCPU thread to handle the request in VCPU RUN.  The only current
 examples of this are kvm_vcpu_block() calls made by VCPUs to block
 themselves.  A possible side-effect of that call is to make the
 KVM_REQ_UNHALT request, which may then be cleared immediately when the
 VCPU returns from the call.
 References
 ==========
--- a/arch/arm64/kvm/arm.c
+++ b/arch/arm64/kvm/arm.c
@ -666,7 +666,6 @@ void kvm_vcpu_wfi(struct kvm_vcpu *vcpu)
 	kvm_vcpu_halt(vcpu);
 	vcpu_clear_flag(vcpu, IN_WFIT);
 	kvm_clear_request(KVM_REQ_UNHALT, vcpu);
 	preempt_disable();
 	vgic_v4_load(vcpu);
--- a/arch/arm64/kvm/mmu.c
+++ b/arch/arm64/kvm/mmu.c
@ -92,9 +92,13 @@ static bool kvm_is_device_pfn(unsigned long pfn)
 static void *stage2_memcache_zalloc_page(void *arg)
 {
 	struct kvm_mmu_memory_cache *mc = arg;
 	void *virt;
 	/* Allocated with __GFP_ZERO, so no need to zero */
-	return kvm_mmu_memory_cache_alloc(mc);
+	virt = kvm_mmu_memory_cache_alloc(mc);
 	if (virt)
 		kvm_account_pgtable_pages(virt, 1);
 	return virt;
 }
 static void *kvm_host_zalloc_pages_exact(size_t size)
@ -102,6 +106,21 @@ static void *kvm_host_zalloc_pages_exact(size_t size)
 	return alloc_pages_exact(size, GFP_KERNEL_ACCOUNT | __GFP_ZERO);
 }
 static void *kvm_s2_zalloc_pages_exact(size_t size)
 {
 	void *virt = kvm_host_zalloc_pages_exact(size);
 	if (virt)
 		kvm_account_pgtable_pages(virt, (size >> PAGE_SHIFT));
 	return virt;
 }
 static void kvm_s2_free_pages_exact(void *virt, size_t size)
 {
 	kvm_account_pgtable_pages(virt, -(size >> PAGE_SHIFT));
 	free_pages_exact(virt, size);
 }
 static void kvm_host_get_page(void *addr)
 {
 	get_page(virt_to_page(addr));
@ -112,6 +131,15 @@ static void kvm_host_put_page(void *addr)
 	put_page(virt_to_page(addr));
 }
 static void kvm_s2_put_page(void *addr)
 {
 	struct page *p = virt_to_page(addr);
 	/* Dropping last refcount, the page will be freed */
 	if (page_count(p) == 1)
 		kvm_account_pgtable_pages(addr, -1);
 	put_page(p);
 }
 static int kvm_host_page_count(void *addr)
 {
 	return page_count(virt_to_page(addr));
@ -625,10 +653,10 @@ static int get_user_mapping_size(struct kvm *kvm, u64 addr)
 static struct kvm_pgtable_mm_ops kvm_s2_mm_ops = {
 	.zalloc_page		= stage2_memcache_zalloc_page,
-	.zalloc_pages_exact	= kvm_host_zalloc_pages_exact,
+	.zalloc_pages_exact	= kvm_s2_zalloc_pages_exact,
-	.free_pages_exact	= free_pages_exact,
+	.free_pages_exact	= kvm_s2_free_pages_exact,
 	.get_page		= kvm_host_get_page,
-	.put_page		= kvm_host_put_page,
+	.put_page		= kvm_s2_put_page,
 	.page_count		= kvm_host_page_count,
 	.phys_to_virt		= kvm_host_va,
 	.virt_to_phys		= kvm_host_pa,
--- a/arch/mips/kvm/emulate.c
+++ b/arch/mips/kvm/emulate.c
@ -955,13 +955,11 @@ enum emulation_result kvm_mips_emul_wait(struct kvm_vcpu *vcpu)
 		kvm_vcpu_halt(vcpu);
 		/*
-		 * We we are runnable, then definitely go off to user space to
+		 * We are runnable, then definitely go off to user space to
 		 * check if any I/O interrupts are pending.
 		 */
-		if (kvm_check_request(KVM_REQ_UNHALT, vcpu)) {
+		if (kvm_arch_vcpu_runnable(vcpu))
 			kvm_clear_request(KVM_REQ_UNHALT, vcpu);
 			vcpu->run->exit_reason = KVM_EXIT_IRQ_WINDOW_OPEN;
 		}
 	}
 	return EMULATE_DONE;
--- a/arch/powerpc/kvm/book3s_pr.c
+++ b/arch/powerpc/kvm/book3s_pr.c
@ -499,7 +499,6 @@ static void kvmppc_set_msr_pr(struct kvm_vcpu *vcpu, u64 msr)
 	if (msr & MSR_POW) {
 		if (!vcpu->arch.pending_exceptions) {
 			kvm_vcpu_halt(vcpu);
 			kvm_clear_request(KVM_REQ_UNHALT, vcpu);
 			vcpu->stat.generic.halt_wakeup++;
 			/* Unset POW bit after we woke up */
--- a/arch/powerpc/kvm/book3s_pr_papr.c
+++ b/arch/powerpc/kvm/book3s_pr_papr.c
@ -393,7 +393,6 @@ int kvmppc_h_pr(struct kvm_vcpu *vcpu, unsigned long cmd)
 	case H_CEDE:
 		kvmppc_set_msr_fast(vcpu, kvmppc_get_msr(vcpu) | MSR_EE);
 		kvm_vcpu_halt(vcpu);
 		kvm_clear_request(KVM_REQ_UNHALT, vcpu);
 		vcpu->stat.generic.halt_wakeup++;
 		return EMULATE_DONE;
 	case H_LOGICAL_CI_LOAD:
--- a/arch/powerpc/kvm/booke.c
+++ b/arch/powerpc/kvm/booke.c
@ -719,7 +719,6 @@ int kvmppc_core_prepare_to_enter(struct kvm_vcpu *vcpu)
 	if (vcpu->arch.shared->msr & MSR_WE) {
 		local_irq_enable();
 		kvm_vcpu_halt(vcpu);
 		kvm_clear_request(KVM_REQ_UNHALT, vcpu);
 		hard_irq_disable();
 		kvmppc_set_exit_type(vcpu, EMULATED_MTMSRWE_EXITS);
--- a/arch/powerpc/kvm/powerpc.c
+++ b/arch/powerpc/kvm/powerpc.c
@ -239,7 +239,6 @@ int kvmppc_kvm_pv(struct kvm_vcpu *vcpu)
 	case EV_HCALL_TOKEN(EV_IDLE):
 		r = EV_SUCCESS;
 		kvm_vcpu_halt(vcpu);
 		kvm_clear_request(KVM_REQ_UNHALT, vcpu);
 		break;
 	default:
 		r = EV_UNIMPLEMENTED;
--- a/arch/riscv/kvm/vcpu_insn.c
+++ b/arch/riscv/kvm/vcpu_insn.c
@ -191,7 +191,6 @@ void kvm_riscv_vcpu_wfi(struct kvm_vcpu *vcpu)
 		kvm_vcpu_srcu_read_unlock(vcpu);
 		kvm_vcpu_halt(vcpu);
 		kvm_vcpu_srcu_read_lock(vcpu);
 		kvm_clear_request(KVM_REQ_UNHALT, vcpu);
 	}
 }
--- a/arch/s390/kvm/kvm-s390.c
+++ b/arch/s390/kvm/kvm-s390.c
@ -4343,8 +4343,6 @@ retry:
 		goto retry;
 	}
 	/* nothing to do, just clear the request */
 	kvm_clear_request(KVM_REQ_UNHALT, vcpu);
 	/* we left the vsie handler, nothing to do, just clear the request */
 	kvm_clear_request(KVM_REQ_VSIE_RESTART, vcpu);
--- a/arch/x86/include/asm/hyperv-tlfs.h
+++ b/arch/x86/include/asm/hyperv-tlfs.h
@ -138,6 +138,9 @@
 #define HV_X64_NESTED_GUEST_MAPPING_FLUSH		BIT(18)
 #define HV_X64_NESTED_MSR_BITMAP			BIT(19)
 /* Nested features #2. These are HYPERV_CPUID_NESTED_FEATURES.EBX bits. */
 #define HV_X64_NESTED_EVMCS1_PERF_GLOBAL_CTRL		BIT(0)
 /*
 * This is specific to AMD and specifies that enlightened TLB flush is
 * supported. If guest opts in to this feature, ASID invalidations only
@ -546,7 +549,7 @@ struct hv_enlightened_vmcs {
 	u64 guest_rip;
 	u32 hv_clean_fields;
-	u32 hv_padding_32;
+	u32 padding32_1;
 	u32 hv_synthetic_controls;
 	struct {
 		u32 nested_flush_hypercall:1;
@ -554,14 +557,25 @@ struct hv_enlightened_vmcs {
 		u32 reserved:30;
 	}  __packed hv_enlightenments_control;
 	u32 hv_vp_id;
-
+	u32 padding32_2;
 	u64 hv_vm_id;
 	u64 partition_assist_page;
 	u64 padding64_4[4];
 	u64 guest_bndcfgs;
-	u64 padding64_5[7];
+	u64 guest_ia32_perf_global_ctrl;
 	u64 guest_ia32_s_cet;
 	u64 guest_ssp;
 	u64 guest_ia32_int_ssp_table_addr;
 	u64 guest_ia32_lbr_ctl;
 	u64 padding64_5[2];
 	u64 xss_exit_bitmap;
-	u64 padding64_6[7];
+	u64 encls_exiting_bitmap;
 	u64 host_ia32_perf_global_ctrl;
 	u64 tsc_multiplier;
 	u64 host_ia32_s_cet;
 	u64 host_ssp;
 	u64 host_ia32_int_ssp_table_addr;
 	u64 padding64_6;
 } __packed;
 #define HV_VMX_ENLIGHTENED_CLEAN_FIELD_NONE			0
--- a/arch/x86/include/asm/kvm-x86-ops.h
+++ b/arch/x86/include/asm/kvm-x86-ops.h
@ -67,7 +67,7 @@ KVM_X86_OP(get_interrupt_shadow)
 KVM_X86_OP(patch_hypercall)
 KVM_X86_OP(inject_irq)
 KVM_X86_OP(inject_nmi)
-KVM_X86_OP(queue_exception)
+KVM_X86_OP(inject_exception)
 KVM_X86_OP(cancel_injection)
 KVM_X86_OP(interrupt_allowed)
 KVM_X86_OP(nmi_allowed)
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@ -615,6 +615,8 @@ struct kvm_vcpu_hv {
 		u32 enlightenments_eax; /* HYPERV_CPUID_ENLIGHTMENT_INFO.EAX */
 		u32 enlightenments_ebx; /* HYPERV_CPUID_ENLIGHTMENT_INFO.EBX */
 		u32 syndbg_cap_eax; /* HYPERV_CPUID_SYNDBG_PLATFORM_CAPABILITIES.EAX */
 		u32 nested_eax; /* HYPERV_CPUID_NESTED_FEATURES.EAX */
 		u32 nested_ebx; /* HYPERV_CPUID_NESTED_FEATURES.EBX */
 	} cpuid_cache;
 };
@ -639,6 +641,16 @@ struct kvm_vcpu_xen {
 	struct timer_list poll_timer;
 };
 struct kvm_queued_exception {
 	bool pending;
 	bool injected;
 	bool has_error_code;
 	u8 vector;
 	u32 error_code;
 	unsigned long payload;
 	bool has_payload;
 };
 struct kvm_vcpu_arch {
 	/*
 	 * rip and regs accesses must go through
@ -738,16 +750,12 @@ struct kvm_vcpu_arch {
 	u8 event_exit_inst_len;
-	struct kvm_queued_exception {
+	bool exception_from_userspace;
-		bool pending;
+
-		bool injected;
+	/* Exceptions to be injected to the guest. */
-		bool has_error_code;
+	struct kvm_queued_exception exception;
-		u8 nr;
+	/* Exception VM-Exits to be synthesized to L1. */
-		u32 error_code;
+	struct kvm_queued_exception exception_vmexit;
 		unsigned long payload;
 		bool has_payload;
 		u8 nested_apf;
 	} exception;
 	struct kvm_queued_interrupt {
 		bool injected;
@ -858,7 +866,6 @@ struct kvm_vcpu_arch {
 		u32 id;
 		bool send_user_only;
 		u32 host_apf_flags;
 		unsigned long nested_apf_token;
 		bool delivery_as_pf_vmexit;
 		bool pageready_pending;
 	} apf;
@ -1524,7 +1531,7 @@ struct kvm_x86_ops {
 				unsigned char *hypercall_addr);
 	void (*inject_irq)(struct kvm_vcpu *vcpu, bool reinjected);
 	void (*inject_nmi)(struct kvm_vcpu *vcpu);
-	void (*queue_exception)(struct kvm_vcpu *vcpu);
+	void (*inject_exception)(struct kvm_vcpu *vcpu);
 	void (*cancel_injection)(struct kvm_vcpu *vcpu);
 	int (*interrupt_allowed)(struct kvm_vcpu *vcpu, bool for_injection);
 	int (*nmi_allowed)(struct kvm_vcpu *vcpu, bool for_injection);
@ -1634,10 +1641,10 @@ struct kvm_x86_ops {
 struct kvm_x86_nested_ops {
 	void (*leave_nested)(struct kvm_vcpu *vcpu);
 	bool (*is_exception_vmexit)(struct kvm_vcpu *vcpu, u8 vector,
 				    u32 error_code);
 	int (*check_events)(struct kvm_vcpu *vcpu);
-	bool (*handle_page_fault_workaround)(struct kvm_vcpu *vcpu,
+	bool (*has_events)(struct kvm_vcpu *vcpu);
 					     struct x86_exception *fault);
 	bool (*hv_timer_pending)(struct kvm_vcpu *vcpu);
 	void (*triple_fault)(struct kvm_vcpu *vcpu);
 	int (*get_state)(struct kvm_vcpu *vcpu,
 			 struct kvm_nested_state __user *user_kvm_nested_state,
@ -1863,7 +1870,7 @@ void kvm_queue_exception_p(struct kvm_vcpu *vcpu, unsigned nr, unsigned long pay
 void kvm_requeue_exception(struct kvm_vcpu *vcpu, unsigned nr);
 void kvm_requeue_exception_e(struct kvm_vcpu *vcpu, unsigned nr, u32 error_code);
 void kvm_inject_page_fault(struct kvm_vcpu *vcpu, struct x86_exception *fault);
-bool kvm_inject_emulated_page_fault(struct kvm_vcpu *vcpu,
+void kvm_inject_emulated_page_fault(struct kvm_vcpu *vcpu,
 				    struct x86_exception *fault);
 bool kvm_require_cpl(struct kvm_vcpu *vcpu, int required_cpl);
 bool kvm_require_dr(struct kvm_vcpu *vcpu, int dr);
--- a/arch/x86/kvm/cpuid.c
+++ b/arch/x86/kvm/cpuid.c
@ -311,6 +311,15 @@ void kvm_update_cpuid_runtime(struct kvm_vcpu *vcpu)
 }
 EXPORT_SYMBOL_GPL(kvm_update_cpuid_runtime);
 static bool kvm_cpuid_has_hyperv(struct kvm_cpuid_entry2 *entries, int nent)
 {
 	struct kvm_cpuid_entry2 *entry;
 	entry = cpuid_entry2_find(entries, nent, HYPERV_CPUID_INTERFACE,
 				  KVM_CPUID_INDEX_NOT_SIGNIFICANT);
 	return entry && entry->eax == HYPERV_CPUID_SIGNATURE_EAX;
 }
 static void kvm_vcpu_after_set_cpuid(struct kvm_vcpu *vcpu)
 {
 	struct kvm_lapic *apic = vcpu->arch.apic;
@ -346,7 +355,8 @@ static void kvm_vcpu_after_set_cpuid(struct kvm_vcpu *vcpu)
 	vcpu->arch.cr4_guest_rsvd_bits =
 	    __cr4_reserved_bits(guest_cpuid_has, vcpu);
-	kvm_hv_set_cpuid(vcpu);
+	kvm_hv_set_cpuid(vcpu, kvm_cpuid_has_hyperv(vcpu->arch.cpuid_entries,
 						    vcpu->arch.cpuid_nent));
 	/* Invoke the vendor callback only after the above state is updated. */
 	static_call(kvm_x86_vcpu_after_set_cpuid)(vcpu);
@ -409,6 +419,12 @@ static int kvm_set_cpuid(struct kvm_vcpu *vcpu, struct kvm_cpuid_entry2 *e2,
 		return 0;
 	}
 	if (kvm_cpuid_has_hyperv(e2, nent)) {
 		r = kvm_hv_vcpu_init(vcpu);
 		if (r)
 			return r;
 	}
 	r = kvm_check_cpuid(vcpu, e2, nent);
 	if (r)
 		return r;
--- a/arch/x86/kvm/emulate.c
+++ b/arch/x86/kvm/emulate.c
@ -1137,9 +1137,11 @@ static int em_fnstsw(struct x86_emulate_ctxt *ctxt)
 static void decode_register_operand(struct x86_emulate_ctxt *ctxt,
 				    struct operand *op)
 {
-	unsigned reg = ctxt->modrm_reg;
+	unsigned int reg;
-	if (!(ctxt->d & ModRM))
+	if (ctxt->d & ModRM)
 		reg = ctxt->modrm_reg;
 	else
 		reg = (ctxt->b & 7) | ((ctxt->rex_prefix & 1) << 3);
 	if (ctxt->d & Sse) {
@ -1953,7 +1955,7 @@ static int em_pop_sreg(struct x86_emulate_ctxt *ctxt)
 	if (rc != X86EMUL_CONTINUE)
 		return rc;
-	if (ctxt->modrm_reg == VCPU_SREG_SS)
+	if (seg == VCPU_SREG_SS)
 		ctxt->interruptibility = KVM_X86_SHADOW_INT_MOV_SS;
 	if (ctxt->op_bytes > 2)
 		rsp_increment(ctxt, ctxt->op_bytes - 2);
@ -3645,13 +3647,10 @@ static int em_wrmsr(struct x86_emulate_ctxt *ctxt)
 		| ((u64)reg_read(ctxt, VCPU_REGS_RDX) << 32);
 	r = ctxt->ops->set_msr_with_filter(ctxt, msr_index, msr_data);
-	if (r == X86EMUL_IO_NEEDED)
+	if (r == X86EMUL_PROPAGATE_FAULT)
 		return r;
 	if (r > 0)
 		return emulate_gp(ctxt, 0);
-	return r < 0 ? X86EMUL_UNHANDLEABLE : X86EMUL_CONTINUE;
+	return r;
 }
 static int em_rdmsr(struct x86_emulate_ctxt *ctxt)
@ -3662,15 +3661,14 @@ static int em_rdmsr(struct x86_emulate_ctxt *ctxt)
 	r = ctxt->ops->get_msr_with_filter(ctxt, msr_index, &msr_data);
-	if (r == X86EMUL_IO_NEEDED)
+	if (r == X86EMUL_PROPAGATE_FAULT)
 		return r;
 	if (r)
 		return emulate_gp(ctxt, 0);
-	*reg_write(ctxt, VCPU_REGS_RAX) = (u32)msr_data;
+	if (r == X86EMUL_CONTINUE) {
-	*reg_write(ctxt, VCPU_REGS_RDX) = msr_data >> 32;
+		*reg_write(ctxt, VCPU_REGS_RAX) = (u32)msr_data;
-	return X86EMUL_CONTINUE;
+		*reg_write(ctxt, VCPU_REGS_RDX) = msr_data >> 32;
 	}
 	return r;
 }
 static int em_store_sreg(struct x86_emulate_ctxt *ctxt, int segment)
@ -4171,8 +4169,7 @@ static int check_dr7_gd(struct x86_emulate_ctxt *ctxt)
 	ctxt->ops->get_dr(ctxt, 7, &dr7);
-	/* Check if DR7.Global_Enable is set */
+	return dr7 & DR7_GD;
 	return dr7 & (1 << 13);
 }
 static int check_dr_read(struct x86_emulate_ctxt *ctxt)
--- a/arch/x86/kvm/hyperv.c
+++ b/arch/x86/kvm/hyperv.c
@ -38,9 +38,6 @@
 #include "irq.h"
 #include "fpu.h"
 /* "Hv#1" signature */
 #define HYPERV_CPUID_SIGNATURE_EAX 0x31237648
 #define KVM_HV_MAX_SPARSE_VCPU_SET_BITS DIV_ROUND_UP(KVM_MAX_VCPUS, 64)
 static void stimer_mark_pending(struct kvm_vcpu_hv_stimer *stimer,
@ -934,11 +931,14 @@ static void stimer_init(struct kvm_vcpu_hv_stimer *stimer, int timer_index)
 	stimer_prepare_msg(stimer);
 }
-static int kvm_hv_vcpu_init(struct kvm_vcpu *vcpu)
+int kvm_hv_vcpu_init(struct kvm_vcpu *vcpu)
 {
-	struct kvm_vcpu_hv *hv_vcpu;
+	struct kvm_vcpu_hv *hv_vcpu = to_hv_vcpu(vcpu);
 	int i;
 	if (hv_vcpu)
 		return 0;
 	hv_vcpu = kzalloc(sizeof(struct kvm_vcpu_hv), GFP_KERNEL_ACCOUNT);
 	if (!hv_vcpu)
 		return -ENOMEM;
@ -962,11 +962,9 @@ int kvm_hv_activate_synic(struct kvm_vcpu *vcpu, bool dont_zero_synic_pages)
 	struct kvm_vcpu_hv_synic *synic;
 	int r;
-	if (!to_hv_vcpu(vcpu)) {
+	r = kvm_hv_vcpu_init(vcpu);
-		r = kvm_hv_vcpu_init(vcpu);
+	if (r)
-		if (r)
+		return r;
 			return r;
 	}
 	synic = to_hv_synic(vcpu);
@ -1660,10 +1658,8 @@ int kvm_hv_set_msr_common(struct kvm_vcpu *vcpu, u32 msr, u64 data, bool host)
 	if (!host && !vcpu->arch.hyperv_enabled)
 		return 1;
-	if (!to_hv_vcpu(vcpu)) {
+	if (kvm_hv_vcpu_init(vcpu))
-		if (kvm_hv_vcpu_init(vcpu))
+		return 1;
 			return 1;
 	}
 	if (kvm_hv_msr_partition_wide(msr)) {
 		int r;
@ -1683,10 +1679,8 @@ int kvm_hv_get_msr_common(struct kvm_vcpu *vcpu, u32 msr, u64 *pdata, bool host)
 	if (!host && !vcpu->arch.hyperv_enabled)
 		return 1;
-	if (!to_hv_vcpu(vcpu)) {
+	if (kvm_hv_vcpu_init(vcpu))
-		if (kvm_hv_vcpu_init(vcpu))
+		return 1;
 			return 1;
 	}
 	if (kvm_hv_msr_partition_wide(msr)) {
 		int r;
@ -1987,49 +1981,49 @@ ret_success:
 	return HV_STATUS_SUCCESS;
 }
-void kvm_hv_set_cpuid(struct kvm_vcpu *vcpu)
+void kvm_hv_set_cpuid(struct kvm_vcpu *vcpu, bool hyperv_enabled)
 {
 	struct kvm_vcpu_hv *hv_vcpu = to_hv_vcpu(vcpu);
 	struct kvm_cpuid_entry2 *entry;
 	struct kvm_vcpu_hv *hv_vcpu;
-	entry = kvm_find_cpuid_entry(vcpu, HYPERV_CPUID_INTERFACE);
+	vcpu->arch.hyperv_enabled = hyperv_enabled;
-	if (entry && entry->eax == HYPERV_CPUID_SIGNATURE_EAX) {
+
-		vcpu->arch.hyperv_enabled = true;
+	if (!hv_vcpu) {
-	} else {
+		/*
-		vcpu->arch.hyperv_enabled = false;
+		 * KVM should have already allocated kvm_vcpu_hv if Hyper-V is
 		 * enabled in CPUID.
 		 */
 		WARN_ON_ONCE(vcpu->arch.hyperv_enabled);
 		return;
 	}
-	if (!to_hv_vcpu(vcpu) && kvm_hv_vcpu_init(vcpu))
+	memset(&hv_vcpu->cpuid_cache, 0, sizeof(hv_vcpu->cpuid_cache));
 		return;
-	hv_vcpu = to_hv_vcpu(vcpu);
+	if (!vcpu->arch.hyperv_enabled)
 		return;
 	entry = kvm_find_cpuid_entry(vcpu, HYPERV_CPUID_FEATURES);
 	if (entry) {
 		hv_vcpu->cpuid_cache.features_eax = entry->eax;
 		hv_vcpu->cpuid_cache.features_ebx = entry->ebx;
 		hv_vcpu->cpuid_cache.features_edx = entry->edx;
 	} else {
 		hv_vcpu->cpuid_cache.features_eax = 0;
 		hv_vcpu->cpuid_cache.features_ebx = 0;
 		hv_vcpu->cpuid_cache.features_edx = 0;
 	}
 	entry = kvm_find_cpuid_entry(vcpu, HYPERV_CPUID_ENLIGHTMENT_INFO);
 	if (entry) {
 		hv_vcpu->cpuid_cache.enlightenments_eax = entry->eax;
 		hv_vcpu->cpuid_cache.enlightenments_ebx = entry->ebx;
 	} else {
 		hv_vcpu->cpuid_cache.enlightenments_eax = 0;
 		hv_vcpu->cpuid_cache.enlightenments_ebx = 0;
 	}
 	entry = kvm_find_cpuid_entry(vcpu, HYPERV_CPUID_SYNDBG_PLATFORM_CAPABILITIES);
 	if (entry)
 		hv_vcpu->cpuid_cache.syndbg_cap_eax = entry->eax;
-	else
+
-		hv_vcpu->cpuid_cache.syndbg_cap_eax = 0;
+	entry = kvm_find_cpuid_entry(vcpu, HYPERV_CPUID_NESTED_FEATURES);
 	if (entry) {
 		hv_vcpu->cpuid_cache.nested_eax = entry->eax;
 		hv_vcpu->cpuid_cache.nested_ebx = entry->ebx;
 	}
 }
 int kvm_hv_set_enforce_cpuid(struct kvm_vcpu *vcpu, bool enforce)
@ -2552,7 +2546,7 @@ int kvm_get_hv_cpuid(struct kvm_vcpu *vcpu, struct kvm_cpuid2 *cpuid,
 		case HYPERV_CPUID_NESTED_FEATURES:
 			ent->eax = evmcs_ver;
 			ent->eax |= HV_X64_NESTED_MSR_BITMAP;
-
+			ent->ebx |= HV_X64_NESTED_EVMCS1_PERF_GLOBAL_CTRL;
 			break;
 		case HYPERV_CPUID_SYNDBG_VENDOR_AND_MAX_FUNCTIONS:
--- a/arch/x86/kvm/hyperv.h
+++ b/arch/x86/kvm/hyperv.h
@ -23,6 +23,9 @@
 #include <linux/kvm_host.h>
 /* "Hv#1" signature */
 #define HYPERV_CPUID_SIGNATURE_EAX 0x31237648
 /*
 * The #defines related to the synthetic debugger are required by KDNet, but
 * they are not documented in the Hyper-V TLFS because the synthetic debugger
@ -141,7 +144,8 @@ void kvm_hv_request_tsc_page_update(struct kvm *kvm);
 void kvm_hv_init_vm(struct kvm *kvm);
 void kvm_hv_destroy_vm(struct kvm *kvm);
-void kvm_hv_set_cpuid(struct kvm_vcpu *vcpu);
+int kvm_hv_vcpu_init(struct kvm_vcpu *vcpu);
 void kvm_hv_set_cpuid(struct kvm_vcpu *vcpu, bool hyperv_enabled);
 int kvm_hv_set_enforce_cpuid(struct kvm_vcpu *vcpu, bool enforce);
 int kvm_vm_ioctl_hv_eventfd(struct kvm *kvm, struct kvm_hyperv_eventfd *args);
 int kvm_get_hv_cpuid(struct kvm_vcpu *vcpu, struct kvm_cpuid2 *cpuid,
--- a/arch/x86/kvm/lapic.c
+++ b/arch/x86/kvm/lapic.c
@ -3025,17 +3025,8 @@ int kvm_apic_accept_events(struct kvm_vcpu *vcpu)
 	struct kvm_lapic *apic = vcpu->arch.apic;
 	u8 sipi_vector;
 	int r;
 	unsigned long pe;
-	if (!lapic_in_kernel(vcpu))
+	if (!kvm_apic_has_pending_init_or_sipi(vcpu))
 		return 0;
 	/*
 	 * Read pending events before calling the check_events
 	 * callback.
 	 */
 	pe = smp_load_acquire(&apic->pending_events);
 	if (!pe)
 		return 0;
 	if (is_guest_mode(vcpu)) {
@ -3043,38 +3034,31 @@ int kvm_apic_accept_events(struct kvm_vcpu *vcpu)
 		if (r < 0)
 			return r == -EBUSY ? 0 : r;
 		/*
-		 * If an event has happened and caused a vmexit,
+		 * Continue processing INIT/SIPI even if a nested VM-Exit
-		 * we know INITs are latched and therefore
+		 * occurred, e.g. pending SIPIs should be dropped if INIT+SIPI
-		 * we will not incorrectly deliver an APIC
+		 * are blocked as a result of transitioning to VMX root mode.
 		 * event instead of a vmexit.
 		 */
 	}
 	/*
-	 * INITs are latched while CPU is in specific states
+	 * INITs are blocked while CPU is in specific states (SMM, VMX root
-	 * (SMM, VMX root mode, SVM with GIF=0).
+	 * mode, SVM with GIF=0), while SIPIs are dropped if the CPU isn't in
-	 * Because a CPU cannot be in these states immediately
+	 * wait-for-SIPI (WFS).
 	 * after it has processed an INIT signal (and thus in
 	 * KVM_MP_STATE_INIT_RECEIVED state), just eat SIPIs
 	 * and leave the INIT pending.
 	 */
-	if (kvm_vcpu_latch_init(vcpu)) {
+	if (!kvm_apic_init_sipi_allowed(vcpu)) {
 		WARN_ON_ONCE(vcpu->arch.mp_state == KVM_MP_STATE_INIT_RECEIVED);
-		if (test_bit(KVM_APIC_SIPI, &pe))
+		clear_bit(KVM_APIC_SIPI, &apic->pending_events);
 			clear_bit(KVM_APIC_SIPI, &apic->pending_events);
 		return 0;
 	}
-	if (test_bit(KVM_APIC_INIT, &pe)) {
+	if (test_and_clear_bit(KVM_APIC_INIT, &apic->pending_events)) {
 		clear_bit(KVM_APIC_INIT, &apic->pending_events);
 		kvm_vcpu_reset(vcpu, true);
 		if (kvm_vcpu_is_bsp(apic->vcpu))
 			vcpu->arch.mp_state = KVM_MP_STATE_RUNNABLE;
 		else
 			vcpu->arch.mp_state = KVM_MP_STATE_INIT_RECEIVED;
 	}
-	if (test_bit(KVM_APIC_SIPI, &pe)) {
+	if (test_and_clear_bit(KVM_APIC_SIPI, &apic->pending_events)) {
 		clear_bit(KVM_APIC_SIPI, &apic->pending_events);
 		if (vcpu->arch.mp_state == KVM_MP_STATE_INIT_RECEIVED) {
 			/* evaluate pending_events before reading the vector */
 			smp_rmb();
--- a/arch/x86/kvm/lapic.h
+++ b/arch/x86/kvm/lapic.h
@ -7,6 +7,7 @@
 #include <linux/kvm_host.h>
 #include "hyperv.h"
 #include "kvm_cache_regs.h"
 #define KVM_APIC_INIT		0
 #define KVM_APIC_SIPI		1
@ -223,11 +224,17 @@ static inline bool kvm_vcpu_apicv_active(struct kvm_vcpu *vcpu)
 	return lapic_in_kernel(vcpu) && vcpu->arch.apic->apicv_active;
 }
-static inline bool kvm_apic_has_events(struct kvm_vcpu *vcpu)
+static inline bool kvm_apic_has_pending_init_or_sipi(struct kvm_vcpu *vcpu)
 {
 	return lapic_in_kernel(vcpu) && vcpu->arch.apic->pending_events;
 }
 static inline bool kvm_apic_init_sipi_allowed(struct kvm_vcpu *vcpu)
 {
 	return !is_smm(vcpu) &&
 	       !static_call(kvm_x86_apic_init_signal_blocked)(vcpu);
 }
 static inline bool kvm_lowest_prio_delivery(struct kvm_lapic_irq *irq)
 {
 	return (irq->delivery_mode == APIC_DM_LOWEST ||
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@ -1667,6 +1667,18 @@ static inline void kvm_mod_used_mmu_pages(struct kvm *kvm, long nr)
 	percpu_counter_add(&kvm_total_used_mmu_pages, nr);
 }
 static void kvm_account_mmu_page(struct kvm *kvm, struct kvm_mmu_page *sp)
 {
 	kvm_mod_used_mmu_pages(kvm, +1);
 	kvm_account_pgtable_pages((void *)sp->spt, +1);
 }
 static void kvm_unaccount_mmu_page(struct kvm *kvm, struct kvm_mmu_page *sp)
 {
 	kvm_mod_used_mmu_pages(kvm, -1);
 	kvm_account_pgtable_pages((void *)sp->spt, -1);
 }
 static void kvm_mmu_free_shadow_page(struct kvm_mmu_page *sp)
 {
 	MMU_WARN_ON(!is_empty_shadow_page(sp->spt));
@ -2124,7 +2136,7 @@ static struct kvm_mmu_page *kvm_mmu_alloc_shadow_page(struct kvm *kvm,
 	 */
 	sp->mmu_valid_gen = kvm->arch.mmu_valid_gen;
 	list_add(&sp->link, &kvm->arch.active_mmu_pages);
-	kvm_mod_used_mmu_pages(kvm, +1);
+	kvm_account_mmu_page(kvm, sp);
 	sp->gfn = gfn;
 	sp->role = role;
@ -2458,7 +2470,7 @@ static bool __kvm_mmu_prepare_zap_page(struct kvm *kvm,
 			list_add(&sp->link, invalid_list);
 		else
 			list_move(&sp->link, invalid_list);
-		kvm_mod_used_mmu_pages(kvm, -1);
+		kvm_unaccount_mmu_page(kvm, sp);
 	} else {
 		/*
 		 * Remove the active root from the active page list, the root
@ -4292,7 +4304,7 @@ int kvm_handle_page_fault(struct kvm_vcpu *vcpu, u64 error_code,
 	vcpu->arch.l1tf_flush_l1d = true;
 	if (!flags) {
-		trace_kvm_page_fault(fault_address, error_code);
+		trace_kvm_page_fault(vcpu, fault_address, error_code);
 		if (kvm_event_needs_reinjection(vcpu))
 			kvm_mmu_unprotect_page_virt(vcpu, fault_address);
@ -6704,10 +6716,12 @@ int kvm_mmu_vendor_module_init(void)
 	ret = register_shrinker(&mmu_shrinker, "x86-mmu");
 	if (ret)
-		goto out;
+		goto out_shrinker;
 	return 0;
 out_shrinker:
 	percpu_counter_destroy(&kvm_total_used_mmu_pages);
 out:
 	mmu_destroy_caches();
 	return ret;
--- a/arch/x86/kvm/mmu/paging_tmpl.h
+++ b/arch/x86/kvm/mmu/paging_tmpl.h
@ -472,7 +472,7 @@ error:
 #if PTTYPE == PTTYPE_EPT
 	/*
-	 * Use PFERR_RSVD_MASK in error_code to to tell if EPT
+	 * Use PFERR_RSVD_MASK in error_code to tell if EPT
 	 * misconfiguration requires to be injected. The detection is
 	 * done by is_rsvd_bits_set() above.
 	 *
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@ -372,6 +372,16 @@ static void handle_changed_spte_dirty_log(struct kvm *kvm, int as_id, gfn_t gfn,
 	}
 }
 static void tdp_account_mmu_page(struct kvm *kvm, struct kvm_mmu_page *sp)
 {
 	kvm_account_pgtable_pages((void *)sp->spt, +1);
 }
 static void tdp_unaccount_mmu_page(struct kvm *kvm, struct kvm_mmu_page *sp)
 {
 	kvm_account_pgtable_pages((void *)sp->spt, -1);
 }
 /**
 * tdp_mmu_unlink_sp() - Remove a shadow page from the list of used pages
 *
@ -384,6 +394,7 @@ static void handle_changed_spte_dirty_log(struct kvm *kvm, int as_id, gfn_t gfn,
 static void tdp_mmu_unlink_sp(struct kvm *kvm, struct kvm_mmu_page *sp,
 			      bool shared)
 {
 	tdp_unaccount_mmu_page(kvm, sp);
 	if (shared)
 		spin_lock(&kvm->arch.tdp_mmu_pages_lock);
 	else
@ -1132,6 +1143,7 @@ static int tdp_mmu_link_sp(struct kvm *kvm, struct tdp_iter *iter,
 	if (account_nx)
 		account_huge_nx_page(kvm, sp);
 	spin_unlock(&kvm->arch.tdp_mmu_pages_lock);
 	tdp_account_mmu_page(kvm, sp);
 	return 0;
 }
--- a/arch/x86/kvm/svm/nested.c
+++ b/arch/x86/kvm/svm/nested.c
@ -55,28 +55,6 @@ static void nested_svm_inject_npf_exit(struct kvm_vcpu *vcpu,
 	nested_svm_vmexit(svm);
 }
 static bool nested_svm_handle_page_fault_workaround(struct kvm_vcpu *vcpu,
 						    struct x86_exception *fault)
 {
 	struct vcpu_svm *svm = to_svm(vcpu);
 	struct vmcb *vmcb = svm->vmcb;
 	WARN_ON(!is_guest_mode(vcpu));
 	if (vmcb12_is_intercept(&svm->nested.ctl,
 				INTERCEPT_EXCEPTION_OFFSET + PF_VECTOR) &&
 	    !WARN_ON_ONCE(svm->nested.nested_run_pending)) {
 	     	vmcb->control.exit_code = SVM_EXIT_EXCP_BASE + PF_VECTOR;
 		vmcb->control.exit_code_hi = 0;
 		vmcb->control.exit_info_1 = fault->error_code;
 		vmcb->control.exit_info_2 = fault->address;
 		nested_svm_vmexit(svm);
 		return true;
 	}
 	return false;
 }
 static u64 nested_svm_get_tdp_pdptr(struct kvm_vcpu *vcpu, int index)
 {
 	struct vcpu_svm *svm = to_svm(vcpu);
@ -468,7 +446,7 @@ static void nested_save_pending_event_to_vmcb12(struct vcpu_svm *svm,
 	unsigned int nr;
 	if (vcpu->arch.exception.injected) {
-		nr = vcpu->arch.exception.nr;
+		nr = vcpu->arch.exception.vector;
 		exit_int_info = nr | SVM_EVTINJ_VALID | SVM_EVTINJ_TYPE_EXEPT;
 		if (vcpu->arch.exception.has_error_code) {
@ -781,11 +759,15 @@ int enter_svm_guest_mode(struct kvm_vcpu *vcpu, u64 vmcb12_gpa,
 	struct vcpu_svm *svm = to_svm(vcpu);
 	int ret;
-	trace_kvm_nested_vmrun(svm->vmcb->save.rip, vmcb12_gpa,
+	trace_kvm_nested_vmenter(svm->vmcb->save.rip,
-			       vmcb12->save.rip,
+				 vmcb12_gpa,
-			       vmcb12->control.int_ctl,
+				 vmcb12->save.rip,
-			       vmcb12->control.event_inj,
+				 vmcb12->control.int_ctl,
-			       vmcb12->control.nested_ctl);
+				 vmcb12->control.event_inj,
 				 vmcb12->control.nested_ctl,
 				 vmcb12->control.nested_cr3,
 				 vmcb12->save.cr3,
 				 KVM_ISA_SVM);
 	trace_kvm_nested_intercepts(vmcb12->control.intercepts[INTERCEPT_CR] & 0xffff,
 				    vmcb12->control.intercepts[INTERCEPT_CR] >> 16,
@ -1304,44 +1286,46 @@ int nested_svm_check_permissions(struct kvm_vcpu *vcpu)
 	return 0;
 }
-static bool nested_exit_on_exception(struct vcpu_svm *svm)
+static bool nested_svm_is_exception_vmexit(struct kvm_vcpu *vcpu, u8 vector,
 					   u32 error_code)
 {
-	unsigned int nr = svm->vcpu.arch.exception.nr;
+	struct vcpu_svm *svm = to_svm(vcpu);
-	return (svm->nested.ctl.intercepts[INTERCEPT_EXCEPTION] & BIT(nr));
+	return (svm->nested.ctl.intercepts[INTERCEPT_EXCEPTION] & BIT(vector));
 }
-static void nested_svm_inject_exception_vmexit(struct vcpu_svm *svm)
+static void nested_svm_inject_exception_vmexit(struct kvm_vcpu *vcpu)
 {
-	unsigned int nr = svm->vcpu.arch.exception.nr;
+	struct kvm_queued_exception *ex = &vcpu->arch.exception_vmexit;
 	struct vcpu_svm *svm = to_svm(vcpu);
 	struct vmcb *vmcb = svm->vmcb;
-	vmcb->control.exit_code = SVM_EXIT_EXCP_BASE + nr;
+	vmcb->control.exit_code = SVM_EXIT_EXCP_BASE + ex->vector;
 	vmcb->control.exit_code_hi = 0;
-	if (svm->vcpu.arch.exception.has_error_code)
+	if (ex->has_error_code)
-		vmcb->control.exit_info_1 = svm->vcpu.arch.exception.error_code;
+		vmcb->control.exit_info_1 = ex->error_code;
 	/*
 	 * EXITINFO2 is undefined for all exception intercepts other
 	 * than #PF.
 	 */
-	if (nr == PF_VECTOR) {
+	if (ex->vector == PF_VECTOR) {
-		if (svm->vcpu.arch.exception.nested_apf)
+		if (ex->has_payload)
-			vmcb->control.exit_info_2 = svm->vcpu.arch.apf.nested_apf_token;
+			vmcb->control.exit_info_2 = ex->payload;
 		else if (svm->vcpu.arch.exception.has_payload)
 			vmcb->control.exit_info_2 = svm->vcpu.arch.exception.payload;
 		else
-			vmcb->control.exit_info_2 = svm->vcpu.arch.cr2;
+			vmcb->control.exit_info_2 = vcpu->arch.cr2;
-	} else if (nr == DB_VECTOR) {
+	} else if (ex->vector == DB_VECTOR) {
-		/* See inject_pending_event.  */
+		/* See kvm_check_and_inject_events().  */
-		kvm_deliver_exception_payload(&svm->vcpu);
+		kvm_deliver_exception_payload(vcpu, ex);
-		if (svm->vcpu.arch.dr7 & DR7_GD) {
+
-			svm->vcpu.arch.dr7 &= ~DR7_GD;
+		if (vcpu->arch.dr7 & DR7_GD) {
-			kvm_update_dr7(&svm->vcpu);
+			vcpu->arch.dr7 &= ~DR7_GD;
 			kvm_update_dr7(vcpu);
 		}
-	} else
+	} else {
-		WARN_ON(svm->vcpu.arch.exception.has_payload);
+		WARN_ON(ex->has_payload);
 	}
 	nested_svm_vmexit(svm);
 }
@ -1353,10 +1337,22 @@ static inline bool nested_exit_on_init(struct vcpu_svm *svm)
 static int svm_check_nested_events(struct kvm_vcpu *vcpu)
 {
 	struct vcpu_svm *svm = to_svm(vcpu);
 	bool block_nested_events =
 		kvm_event_needs_reinjection(vcpu) || svm->nested.nested_run_pending;
 	struct kvm_lapic *apic = vcpu->arch.apic;
 	struct vcpu_svm *svm = to_svm(vcpu);
 	/*
 	 * Only a pending nested run blocks a pending exception.  If there is a
 	 * previously injected event, the pending exception occurred while said
 	 * event was being delivered and thus needs to be handled.
 	 */
 	bool block_nested_exceptions = svm->nested.nested_run_pending;
 	/*
 	 * New events (not exceptions) are only recognized at instruction
 	 * boundaries.  If an event needs reinjection, then KVM is handling a
 	 * VM-Exit that occurred _during_ instruction execution; new events are
 	 * blocked until the instruction completes.
 	 */
 	bool block_nested_events = block_nested_exceptions ||
 				   kvm_event_needs_reinjection(vcpu);
 	if (lapic_in_kernel(vcpu) &&
 	    test_bit(KVM_APIC_INIT, &apic->pending_events)) {
@ -1368,18 +1364,16 @@ static int svm_check_nested_events(struct kvm_vcpu *vcpu)
 		return 0;
 	}
-	if (vcpu->arch.exception.pending) {
+	if (vcpu->arch.exception_vmexit.pending) {
-		/*
+		if (block_nested_exceptions)
 		 * Only a pending nested run can block a pending exception.
 		 * Otherwise an injected NMI/interrupt should either be
 		 * lost or delivered to the nested hypervisor in the EXITINTINFO
 		 * vmcb field, while delivering the pending exception.
 		 */
 		if (svm->nested.nested_run_pending)
                        return -EBUSY;
-		if (!nested_exit_on_exception(svm))
+		nested_svm_inject_exception_vmexit(vcpu);
-			return 0;
+		return 0;
-		nested_svm_inject_exception_vmexit(svm);
+	}
 	if (vcpu->arch.exception.pending) {
 		if (block_nested_exceptions)
 			return -EBUSY;
 		return 0;
 	}
@ -1720,8 +1714,8 @@ static bool svm_get_nested_state_pages(struct kvm_vcpu *vcpu)
 struct kvm_x86_nested_ops svm_nested_ops = {
 	.leave_nested = svm_leave_nested,
 	.is_exception_vmexit = nested_svm_is_exception_vmexit,
 	.check_events = svm_check_nested_events,
 	.handle_page_fault_workaround = nested_svm_handle_page_fault_workaround,
 	.triple_fault = nested_svm_triple_fault,
 	.get_nested_state_pages = svm_get_nested_state_pages,
 	.get_state = svm_get_nested_state,
--- a/arch/x86/kvm/svm/svm.c
+++ b/arch/x86/kvm/svm/svm.c
@ -461,24 +461,22 @@ static int svm_update_soft_interrupt_rip(struct kvm_vcpu *vcpu)
 	return 0;
 }
-static void svm_queue_exception(struct kvm_vcpu *vcpu)
+static void svm_inject_exception(struct kvm_vcpu *vcpu)
 {
 	struct kvm_queued_exception *ex = &vcpu->arch.exception;
 	struct vcpu_svm *svm = to_svm(vcpu);
 	unsigned nr = vcpu->arch.exception.nr;
 	bool has_error_code = vcpu->arch.exception.has_error_code;
 	u32 error_code = vcpu->arch.exception.error_code;
-	kvm_deliver_exception_payload(vcpu);
+	kvm_deliver_exception_payload(vcpu, ex);
-	if (kvm_exception_is_soft(nr) &&
+	if (kvm_exception_is_soft(ex->vector) &&
 	    svm_update_soft_interrupt_rip(vcpu))
 		return;
-	svm->vmcb->control.event_inj = nr
+	svm->vmcb->control.event_inj = ex->vector
 		| SVM_EVTINJ_VALID
-		| (has_error_code ? SVM_EVTINJ_VALID_ERR : 0)
+		| (ex->has_error_code ? SVM_EVTINJ_VALID_ERR : 0)
 		| SVM_EVTINJ_TYPE_EXEPT;
-	svm->vmcb->control.event_inj_err = error_code;
+	svm->vmcb->control.event_inj_err = ex->error_code;
 }
 static void svm_init_erratum_383(void)
@ -1975,7 +1973,7 @@ static int npf_interception(struct kvm_vcpu *vcpu)
 	u64 fault_address = svm->vmcb->control.exit_info_2;
 	u64 error_code = svm->vmcb->control.exit_info_1;
-	trace_kvm_page_fault(fault_address, error_code);
+	trace_kvm_page_fault(vcpu, fault_address, error_code);
 	return kvm_mmu_page_fault(vcpu, fault_address, error_code,
 			static_cpu_has(X86_FEATURE_DECODEASSISTS) ?
 			svm->vmcb->control.insn_bytes : NULL,
@ -2341,7 +2339,8 @@ void svm_set_gif(struct vcpu_svm *svm, bool value)
 		enable_gif(svm);
 		if (svm->vcpu.arch.smi_pending ||
 		    svm->vcpu.arch.nmi_pending ||
-		    kvm_cpu_has_injectable_intr(&svm->vcpu))
+		    kvm_cpu_has_injectable_intr(&svm->vcpu) ||
 		    kvm_apic_has_pending_init_or_sipi(&svm->vcpu))
 			kvm_make_request(KVM_REQ_EVENT, &svm->vcpu);
 	} else {
 		disable_gif(svm);
@ -3522,7 +3521,7 @@ void svm_complete_interrupt_delivery(struct kvm_vcpu *vcpu, int delivery_mode,
 	/* Note, this is called iff the local APIC is in-kernel. */
 	if (!READ_ONCE(vcpu->arch.apic->apicv_active)) {
-		/* Process the interrupt via inject_pending_event */
+		/* Process the interrupt via kvm_check_and_inject_events(). */
 		kvm_make_request(KVM_REQ_EVENT, vcpu);
 		kvm_vcpu_kick(vcpu);
 		return;
@ -4697,15 +4696,7 @@ static bool svm_apic_init_signal_blocked(struct kvm_vcpu *vcpu)
 {
 	struct vcpu_svm *svm = to_svm(vcpu);
-	/*
+	return !gif_set(svm);
 	 * TODO: Last condition latch INIT signals on vCPU when
 	 * vCPU is in guest-mode and vmcb12 defines intercept on INIT.
 	 * To properly emulate the INIT intercept,
 	 * svm_check_nested_events() should call nested_svm_vmexit()
 	 * if an INIT signal is pending.
 	 */
 	return !gif_set(svm) ||
 		   (vmcb_is_intercept(&svm->vmcb->control, INTERCEPT_INIT));
 }
 static void svm_vcpu_deliver_sipi_vector(struct kvm_vcpu *vcpu, u8 vector)
@ -4798,7 +4789,7 @@ static struct kvm_x86_ops svm_x86_ops __initdata = {
 	.patch_hypercall = svm_patch_hypercall,
 	.inject_irq = svm_inject_irq,
 	.inject_nmi = svm_inject_nmi,
-	.queue_exception = svm_queue_exception,
+	.inject_exception = svm_inject_exception,
 	.cancel_injection = svm_cancel_injection,
 	.interrupt_allowed = svm_interrupt_allowed,
 	.nmi_allowed = svm_nmi_allowed,
--- a/arch/x86/kvm/trace.h
+++ b/arch/x86/kvm/trace.h
@ -394,20 +394,25 @@ TRACE_EVENT(kvm_inj_exception,
 * Tracepoint for page fault.
 */
 TRACE_EVENT(kvm_page_fault,
-	TP_PROTO(unsigned long fault_address, unsigned int error_code),
+	TP_PROTO(struct kvm_vcpu *vcpu, u64 fault_address, u64 error_code),
-	TP_ARGS(fault_address, error_code),
+	TP_ARGS(vcpu, fault_address, error_code),
 	TP_STRUCT__entry(
-		__field(	unsigned long,	fault_address	)
+		__field(	unsigned int,	vcpu_id		)
-		__field(	unsigned int,	error_code	)
+		__field(	unsigned long,	guest_rip	)
 		__field(	u64,		fault_address	)
 		__field(	u64,		error_code	)
 	),
 	TP_fast_assign(
 		__entry->vcpu_id	= vcpu->vcpu_id;
 		__entry->guest_rip	= kvm_rip_read(vcpu);
 		__entry->fault_address	= fault_address;
 		__entry->error_code	= error_code;
 	),
-	TP_printk("address %lx error_code %x",
+	TP_printk("vcpu %u rip 0x%lx address 0x%016llx error_code 0x%llx",
 		  __entry->vcpu_id, __entry->guest_rip,
 		  __entry->fault_address, __entry->error_code)
 );
@ -589,10 +594,12 @@ TRACE_EVENT(kvm_pv_eoi,
 /*
 * Tracepoint for nested VMRUN
 */
-TRACE_EVENT(kvm_nested_vmrun,
+TRACE_EVENT(kvm_nested_vmenter,
 	    TP_PROTO(__u64 rip, __u64 vmcb, __u64 nested_rip, __u32 int_ctl,
-		     __u32 event_inj, bool npt),
+		     __u32 event_inj, bool tdp_enabled, __u64 guest_tdp_pgd,
-	    TP_ARGS(rip, vmcb, nested_rip, int_ctl, event_inj, npt),
+		     __u64 guest_cr3, __u32 isa),
 	    TP_ARGS(rip, vmcb, nested_rip, int_ctl, event_inj, tdp_enabled,
 		    guest_tdp_pgd, guest_cr3, isa),
 	TP_STRUCT__entry(
 		__field(	__u64,		rip		)
@ -600,7 +607,9 @@ TRACE_EVENT(kvm_nested_vmrun,
 		__field(	__u64,		nested_rip	)
 		__field(	__u32,		int_ctl		)
 		__field(	__u32,		event_inj	)
-		__field(	bool,		npt		)
+		__field(	bool,		tdp_enabled	)
 		__field(	__u64,		guest_pgd	)
 		__field(	__u32,		isa		)
 	),
 	TP_fast_assign(
@ -609,14 +618,24 @@ TRACE_EVENT(kvm_nested_vmrun,
 		__entry->nested_rip	= nested_rip;
 		__entry->int_ctl	= int_ctl;
 		__entry->event_inj	= event_inj;
-		__entry->npt		= npt;
+		__entry->tdp_enabled	= tdp_enabled;
 		__entry->guest_pgd	= tdp_enabled ? guest_tdp_pgd : guest_cr3;
 		__entry->isa		= isa;
 	),
-	TP_printk("rip: 0x%016llx vmcb: 0x%016llx nrip: 0x%016llx int_ctl: 0x%08x "
+	TP_printk("rip: 0x%016llx %s: 0x%016llx nested_rip: 0x%016llx "
-		  "event_inj: 0x%08x npt: %s",
+		  "int_ctl: 0x%08x event_inj: 0x%08x nested_%s=%s %s: 0x%016llx",
-		__entry->rip, __entry->vmcb, __entry->nested_rip,
+		  __entry->rip,
-		__entry->int_ctl, __entry->event_inj,
+		  __entry->isa == KVM_ISA_VMX ? "vmcs" : "vmcb",
-		__entry->npt ? "on" : "off")
+		  __entry->vmcb,
 		  __entry->nested_rip,
 		  __entry->int_ctl,
 		  __entry->event_inj,
 		  __entry->isa == KVM_ISA_VMX ? "ept" : "npt",
 		  __entry->tdp_enabled ? "y" : "n",
 		  !__entry->tdp_enabled ? "guest_cr3" :
 		  __entry->isa == KVM_ISA_VMX ? "nested_eptp" : "nested_cr3",
 		  __entry->guest_pgd)
 );
 TRACE_EVENT(kvm_nested_intercepts,
--- a/arch/x86/kvm/vmx/capabilities.h
+++ b/arch/x86/kvm/vmx/capabilities.h
@ -65,6 +65,7 @@ struct vmcs_config {
 	u64 cpu_based_3rd_exec_ctrl;
 	u32 vmexit_ctrl;
 	u32 vmentry_ctrl;
 	u64 misc;
 	struct nested_vmx_msrs nested;
 };
 extern struct vmcs_config vmcs_config;
@ -82,7 +83,8 @@ static inline bool cpu_has_vmx_basic_inout(void)
 static inline bool cpu_has_virtual_nmis(void)
 {
-	return vmcs_config.pin_based_exec_ctrl & PIN_BASED_VIRTUAL_NMIS;
+	return vmcs_config.pin_based_exec_ctrl & PIN_BASED_VIRTUAL_NMIS &&
 	       vmcs_config.cpu_based_exec_ctrl & CPU_BASED_NMI_WINDOW_EXITING;
 }
 static inline bool cpu_has_vmx_preemption_timer(void)
@ -224,11 +226,8 @@ static inline bool cpu_has_vmx_vmfunc(void)
 static inline bool cpu_has_vmx_shadow_vmcs(void)
 {
 	u64 vmx_msr;
 	/* check if the cpu supports writing r/o exit information fields */
-	rdmsrl(MSR_IA32_VMX_MISC, vmx_msr);
+	if (!(vmcs_config.misc & MSR_IA32_VMX_MISC_VMWRITE_SHADOW_RO_FIELDS))
 	if (!(vmx_msr & MSR_IA32_VMX_MISC_VMWRITE_SHADOW_RO_FIELDS))
 		return false;
 	return vmcs_config.cpu_based_2nd_exec_ctrl &
@ -370,10 +369,7 @@ static inline bool cpu_has_vmx_invvpid_global(void)
 static inline bool cpu_has_vmx_intel_pt(void)
 {
-	u64 vmx_msr;
+	return (vmcs_config.misc & MSR_IA32_VMX_MISC_INTEL_PT) &&
 	rdmsrl(MSR_IA32_VMX_MISC, vmx_msr);
 	return (vmx_msr & MSR_IA32_VMX_MISC_INTEL_PT) &&
 		(vmcs_config.cpu_based_2nd_exec_ctrl & SECONDARY_EXEC_PT_USE_GPA) &&
 		(vmcs_config.vmentry_ctrl & VM_ENTRY_LOAD_IA32_RTIT_CTL);
 }
--- a/arch/x86/kvm/vmx/evmcs.c
+++ b/arch/x86/kvm/vmx/evmcs.c
@ -10,6 +10,8 @@
 #include "vmx.h"
 #include "trace.h"
 #define CC KVM_NESTED_VMENTER_CONSISTENCY_CHECK
 DEFINE_STATIC_KEY_FALSE(enable_evmcs);
 #define EVMCS1_OFFSET(x) offsetof(struct hv_enlightened_vmcs, x)
@ -28,6 +30,8 @@ const struct evmcs_field vmcs_field_to_evmcs_1[] = {
 		     HV_VMX_ENLIGHTENED_CLEAN_FIELD_HOST_GRP1),
 	EVMCS1_FIELD(HOST_IA32_EFER, host_ia32_efer,
 		     HV_VMX_ENLIGHTENED_CLEAN_FIELD_HOST_GRP1),
 	EVMCS1_FIELD(HOST_IA32_PERF_GLOBAL_CTRL, host_ia32_perf_global_ctrl,
 		     HV_VMX_ENLIGHTENED_CLEAN_FIELD_HOST_GRP1),
 	EVMCS1_FIELD(HOST_CR0, host_cr0,
 		     HV_VMX_ENLIGHTENED_CLEAN_FIELD_HOST_GRP1),
 	EVMCS1_FIELD(HOST_CR3, host_cr3,
@ -78,6 +82,8 @@ const struct evmcs_field vmcs_field_to_evmcs_1[] = {
 		     HV_VMX_ENLIGHTENED_CLEAN_FIELD_GUEST_GRP1),
 	EVMCS1_FIELD(GUEST_IA32_EFER, guest_ia32_efer,
 		     HV_VMX_ENLIGHTENED_CLEAN_FIELD_GUEST_GRP1),
 	EVMCS1_FIELD(GUEST_IA32_PERF_GLOBAL_CTRL, guest_ia32_perf_global_ctrl,
 		     HV_VMX_ENLIGHTENED_CLEAN_FIELD_GUEST_GRP1),
 	EVMCS1_FIELD(GUEST_PDPTR0, guest_pdptr0,
 		     HV_VMX_ENLIGHTENED_CLEAN_FIELD_GUEST_GRP1),
 	EVMCS1_FIELD(GUEST_PDPTR1, guest_pdptr1,
@ -126,6 +132,28 @@ const struct evmcs_field vmcs_field_to_evmcs_1[] = {
 		     HV_VMX_ENLIGHTENED_CLEAN_FIELD_GUEST_GRP1),
 	EVMCS1_FIELD(XSS_EXIT_BITMAP, xss_exit_bitmap,
 		     HV_VMX_ENLIGHTENED_CLEAN_FIELD_CONTROL_GRP2),
 	EVMCS1_FIELD(ENCLS_EXITING_BITMAP, encls_exiting_bitmap,
 		     HV_VMX_ENLIGHTENED_CLEAN_FIELD_CONTROL_GRP2),
 	EVMCS1_FIELD(TSC_MULTIPLIER, tsc_multiplier,
 		     HV_VMX_ENLIGHTENED_CLEAN_FIELD_CONTROL_GRP2),
 	/*
 	 * Not used by KVM:
 	 *
 	 * EVMCS1_FIELD(0x00006828, guest_ia32_s_cet,
 	 *	     HV_VMX_ENLIGHTENED_CLEAN_FIELD_GUEST_GRP1),
 	 * EVMCS1_FIELD(0x0000682A, guest_ssp,
 	 *	     HV_VMX_ENLIGHTENED_CLEAN_FIELD_GUEST_BASIC),
 	 * EVMCS1_FIELD(0x0000682C, guest_ia32_int_ssp_table_addr,
 	 *	     HV_VMX_ENLIGHTENED_CLEAN_FIELD_GUEST_GRP1),
 	 * EVMCS1_FIELD(0x00002816, guest_ia32_lbr_ctl,
 	 *	     HV_VMX_ENLIGHTENED_CLEAN_FIELD_GUEST_GRP1),
 	 * EVMCS1_FIELD(0x00006C18, host_ia32_s_cet,
 	 *	     HV_VMX_ENLIGHTENED_CLEAN_FIELD_HOST_GRP1),
 	 * EVMCS1_FIELD(0x00006C1A, host_ssp,
 	 *	     HV_VMX_ENLIGHTENED_CLEAN_FIELD_HOST_GRP1),
 	 * EVMCS1_FIELD(0x00006C1C, host_ia32_int_ssp_table_addr,
 	 *	     HV_VMX_ENLIGHTENED_CLEAN_FIELD_HOST_GRP1),
 	 */
 	/* 64 bit read only */
 	EVMCS1_FIELD(GUEST_PHYSICAL_ADDRESS, guest_physical_address,
@ -294,19 +322,6 @@ const struct evmcs_field vmcs_field_to_evmcs_1[] = {
 };
 const unsigned int nr_evmcs_1_fields = ARRAY_SIZE(vmcs_field_to_evmcs_1);
 #if IS_ENABLED(CONFIG_HYPERV)
 __init void evmcs_sanitize_exec_ctrls(struct vmcs_config *vmcs_conf)
 {
 	vmcs_conf->cpu_based_exec_ctrl &= ~EVMCS1_UNSUPPORTED_EXEC_CTRL;
 	vmcs_conf->pin_based_exec_ctrl &= ~EVMCS1_UNSUPPORTED_PINCTRL;
 	vmcs_conf->cpu_based_2nd_exec_ctrl &= ~EVMCS1_UNSUPPORTED_2NDEXEC;
 	vmcs_conf->cpu_based_3rd_exec_ctrl = 0;
 	vmcs_conf->vmexit_ctrl &= ~EVMCS1_UNSUPPORTED_VMEXIT_CTRL;
 	vmcs_conf->vmentry_ctrl &= ~EVMCS1_UNSUPPORTED_VMENTRY_CTRL;
 }
 #endif
 bool nested_enlightened_vmentry(struct kvm_vcpu *vcpu, u64 *evmcs_gpa)
 {
 	struct hv_vp_assist_page assist_page;
@ -334,6 +349,9 @@ uint16_t nested_get_evmcs_version(struct kvm_vcpu *vcpu)
 	 * versions: lower 8 bits is the minimal version, higher 8 bits is the
 	 * maximum supported version. KVM supports versions from 1 to
 	 * KVM_EVMCS_VERSION.
 	 *
 	 * Note, do not check the Hyper-V is fully enabled in guest CPUID, this
 	 * helper is used to _get_ the vCPU's supported CPUID.
 	 */
 	if (kvm_cpu_cap_get(X86_FEATURE_VMX) &&
 	    (!vcpu || to_vmx(vcpu)->nested.enlightened_vmcs_enabled))
@ -342,10 +360,67 @@ uint16_t nested_get_evmcs_version(struct kvm_vcpu *vcpu)
 	return 0;
 }
-void nested_evmcs_filter_control_msr(u32 msr_index, u64 *pdata)
+enum evmcs_revision {
 	EVMCSv1_LEGACY,
 	NR_EVMCS_REVISIONS,
 };
 enum evmcs_ctrl_type {
 	EVMCS_EXIT_CTRLS,
 	EVMCS_ENTRY_CTRLS,
 	EVMCS_2NDEXEC,
 	EVMCS_PINCTRL,
 	EVMCS_VMFUNC,
 	NR_EVMCS_CTRLS,
 };
 static const u32 evmcs_unsupported_ctrls[NR_EVMCS_CTRLS][NR_EVMCS_REVISIONS] = {
 	[EVMCS_EXIT_CTRLS] = {
 		[EVMCSv1_LEGACY] = EVMCS1_UNSUPPORTED_VMEXIT_CTRL,
 	},
 	[EVMCS_ENTRY_CTRLS] = {
 		[EVMCSv1_LEGACY] = EVMCS1_UNSUPPORTED_VMENTRY_CTRL,
 	},
 	[EVMCS_2NDEXEC] = {
 		[EVMCSv1_LEGACY] = EVMCS1_UNSUPPORTED_2NDEXEC,
 	},
 	[EVMCS_PINCTRL] = {
 		[EVMCSv1_LEGACY] = EVMCS1_UNSUPPORTED_PINCTRL,
 	},
 	[EVMCS_VMFUNC] = {
 		[EVMCSv1_LEGACY] = EVMCS1_UNSUPPORTED_VMFUNC,
 	},
 };
 static u32 evmcs_get_unsupported_ctls(enum evmcs_ctrl_type ctrl_type)
 {
 	enum evmcs_revision evmcs_rev = EVMCSv1_LEGACY;
 	return evmcs_unsupported_ctrls[ctrl_type][evmcs_rev];
 }
 static bool evmcs_has_perf_global_ctrl(struct kvm_vcpu *vcpu)
 {
 	struct kvm_vcpu_hv *hv_vcpu = to_hv_vcpu(vcpu);
 	/*
 	 * PERF_GLOBAL_CTRL has a quirk where some Windows guests may fail to
 	 * boot if a PV CPUID feature flag is not also set.  Treat the fields
 	 * as unsupported if the flag is not set in guest CPUID.  This should
 	 * be called only for guest accesses, and all guest accesses should be
 	 * gated on Hyper-V being enabled and initialized.
 	 */
 	if (WARN_ON_ONCE(!hv_vcpu))
 		return false;
 	return hv_vcpu->cpuid_cache.nested_ebx & HV_X64_NESTED_EVMCS1_PERF_GLOBAL_CTRL;
 }
 void nested_evmcs_filter_control_msr(struct kvm_vcpu *vcpu, u32 msr_index, u64 *pdata)
 {
 	u32 ctl_low = (u32)*pdata;
 	u32 ctl_high = (u32)(*pdata >> 32);
 	u32 unsupported_ctrls;
 	/*
 	 * Hyper-V 2016 and 2019 try using these features even when eVMCS
@ -354,77 +429,70 @@ void nested_evmcs_filter_control_msr(u32 msr_index, u64 *pdata)
 	switch (msr_index) {
 	case MSR_IA32_VMX_EXIT_CTLS:
 	case MSR_IA32_VMX_TRUE_EXIT_CTLS:
-		ctl_high &= ~EVMCS1_UNSUPPORTED_VMEXIT_CTRL;
+		unsupported_ctrls = evmcs_get_unsupported_ctls(EVMCS_EXIT_CTRLS);
 		if (!evmcs_has_perf_global_ctrl(vcpu))
 			unsupported_ctrls |= VM_EXIT_LOAD_IA32_PERF_GLOBAL_CTRL;
 		ctl_high &= ~unsupported_ctrls;
 		break;
 	case MSR_IA32_VMX_ENTRY_CTLS:
 	case MSR_IA32_VMX_TRUE_ENTRY_CTLS:
-		ctl_high &= ~EVMCS1_UNSUPPORTED_VMENTRY_CTRL;
+		unsupported_ctrls = evmcs_get_unsupported_ctls(EVMCS_ENTRY_CTRLS);
 		if (!evmcs_has_perf_global_ctrl(vcpu))
 			unsupported_ctrls |= VM_ENTRY_LOAD_IA32_PERF_GLOBAL_CTRL;
 		ctl_high &= ~unsupported_ctrls;
 		break;
 	case MSR_IA32_VMX_PROCBASED_CTLS2:
-		ctl_high &= ~EVMCS1_UNSUPPORTED_2NDEXEC;
+		ctl_high &= ~evmcs_get_unsupported_ctls(EVMCS_2NDEXEC);
 		break;
 	case MSR_IA32_VMX_TRUE_PINBASED_CTLS:
 	case MSR_IA32_VMX_PINBASED_CTLS:
-		ctl_high &= ~EVMCS1_UNSUPPORTED_PINCTRL;
+		ctl_high &= ~evmcs_get_unsupported_ctls(EVMCS_PINCTRL);
 		break;
 	case MSR_IA32_VMX_VMFUNC:
-		ctl_low &= ~EVMCS1_UNSUPPORTED_VMFUNC;
+		ctl_low &= ~evmcs_get_unsupported_ctls(EVMCS_VMFUNC);
 		break;
 	}
 	*pdata = ctl_low | ((u64)ctl_high << 32);
 }
 static bool nested_evmcs_is_valid_controls(enum evmcs_ctrl_type ctrl_type,
 					   u32 val)
 {
 	return !(val & evmcs_get_unsupported_ctls(ctrl_type));
 }
 int nested_evmcs_check_controls(struct vmcs12 *vmcs12)
 {
-	int ret = 0;
+	if (CC(!nested_evmcs_is_valid_controls(EVMCS_PINCTRL,
-	u32 unsupp_ctl;
+					       vmcs12->pin_based_vm_exec_control)))
 		return -EINVAL;
-	unsupp_ctl = vmcs12->pin_based_vm_exec_control &
+	if (CC(!nested_evmcs_is_valid_controls(EVMCS_2NDEXEC,
-		EVMCS1_UNSUPPORTED_PINCTRL;
+					       vmcs12->secondary_vm_exec_control)))
-	if (unsupp_ctl) {
+		return -EINVAL;
 		trace_kvm_nested_vmenter_failed(
 			"eVMCS: unsupported pin-based VM-execution controls",
 			unsupp_ctl);
 		ret = -EINVAL;
 	}
-	unsupp_ctl = vmcs12->secondary_vm_exec_control &
+	if (CC(!nested_evmcs_is_valid_controls(EVMCS_EXIT_CTRLS,
-		EVMCS1_UNSUPPORTED_2NDEXEC;
+					       vmcs12->vm_exit_controls)))
-	if (unsupp_ctl) {
+		return -EINVAL;
 		trace_kvm_nested_vmenter_failed(
 			"eVMCS: unsupported secondary VM-execution controls",
 			unsupp_ctl);
 		ret = -EINVAL;
 	}
-	unsupp_ctl = vmcs12->vm_exit_controls &
+	if (CC(!nested_evmcs_is_valid_controls(EVMCS_ENTRY_CTRLS,
-		EVMCS1_UNSUPPORTED_VMEXIT_CTRL;
+					       vmcs12->vm_entry_controls)))
-	if (unsupp_ctl) {
+		return -EINVAL;
 		trace_kvm_nested_vmenter_failed(
 			"eVMCS: unsupported VM-exit controls",
 			unsupp_ctl);
 		ret = -EINVAL;
 	}
-	unsupp_ctl = vmcs12->vm_entry_controls &
+	/*
-		EVMCS1_UNSUPPORTED_VMENTRY_CTRL;
+	 * VM-Func controls are 64-bit, but KVM currently doesn't support any
-	if (unsupp_ctl) {
+	 * controls in bits 63:32, i.e. dropping those bits on the consistency
-		trace_kvm_nested_vmenter_failed(
+	 * check is intentional.
-			"eVMCS: unsupported VM-entry controls",
+	 */
-			unsupp_ctl);
+	if (WARN_ON_ONCE(vmcs12->vm_function_control >> 32))
-		ret = -EINVAL;
+		return -EINVAL;
 	}
-	unsupp_ctl = vmcs12->vm_function_control & EVMCS1_UNSUPPORTED_VMFUNC;
+	if (CC(!nested_evmcs_is_valid_controls(EVMCS_VMFUNC,
-	if (unsupp_ctl) {
+					       vmcs12->vm_function_control)))
-		trace_kvm_nested_vmenter_failed(
+		return -EINVAL;
 			"eVMCS: unsupported VM-function controls",
 			unsupp_ctl);
 		ret = -EINVAL;
 	}
-	return ret;
+	return 0;
 }
 int nested_enable_evmcs(struct kvm_vcpu *vcpu,
--- a/arch/x86/kvm/vmx/evmcs.h
+++ b/arch/x86/kvm/vmx/evmcs.h
@ -42,8 +42,6 @@ DECLARE_STATIC_KEY_FALSE(enable_evmcs);
 *	PLE_GAP                         = 0x00004020,
 *	PLE_WINDOW                      = 0x00004022,
 *	VMX_PREEMPTION_TIMER_VALUE      = 0x0000482E,
 *      GUEST_IA32_PERF_GLOBAL_CTRL     = 0x00002808,
 *      HOST_IA32_PERF_GLOBAL_CTRL      = 0x00002c04,
 *
 * Currently unsupported in KVM:
 *	GUEST_IA32_RTIT_CTL		= 0x00002814,
@ -61,9 +59,8 @@ DECLARE_STATIC_KEY_FALSE(enable_evmcs);
 	 SECONDARY_EXEC_TSC_SCALING |					\
 	 SECONDARY_EXEC_PAUSE_LOOP_EXITING)
 #define EVMCS1_UNSUPPORTED_VMEXIT_CTRL					\
-	(VM_EXIT_LOAD_IA32_PERF_GLOBAL_CTRL |				\
+	(VM_EXIT_SAVE_VMX_PREEMPTION_TIMER)
-	 VM_EXIT_SAVE_VMX_PREEMPTION_TIMER)
+#define EVMCS1_UNSUPPORTED_VMENTRY_CTRL (0)
 #define EVMCS1_UNSUPPORTED_VMENTRY_CTRL (VM_ENTRY_LOAD_IA32_PERF_GLOBAL_CTRL)
 #define EVMCS1_UNSUPPORTED_VMFUNC (VMX_VMFUNC_EPTP_SWITCHING)
 struct evmcs_field {
@ -212,7 +209,6 @@ static inline void evmcs_load(u64 phys_addr)
 	vp_ap->enlighten_vmentry = 1;
 }
 __init void evmcs_sanitize_exec_ctrls(struct vmcs_config *vmcs_conf);
 #else /* !IS_ENABLED(CONFIG_HYPERV) */
 static __always_inline void evmcs_write64(unsigned long field, u64 value) {}
 static inline void evmcs_write32(unsigned long field, u32 value) {}
@ -243,7 +239,7 @@ bool nested_enlightened_vmentry(struct kvm_vcpu *vcpu, u64 *evmcs_gpa);
 uint16_t nested_get_evmcs_version(struct kvm_vcpu *vcpu);
 int nested_enable_evmcs(struct kvm_vcpu *vcpu,
 			uint16_t *vmcs_version);
-void nested_evmcs_filter_control_msr(u32 msr_index, u64 *pdata);
+void nested_evmcs_filter_control_msr(struct kvm_vcpu *vcpu, u32 msr_index, u64 *pdata);
 int nested_evmcs_check_controls(struct vmcs12 *vmcs12);
 #endif /* __KVM_X86_VMX_EVMCS_H */
--- a/arch/x86/kvm/vmx/nested.c
+++ b/arch/x86/kvm/vmx/nested.c
@ -439,61 +439,22 @@ static bool nested_vmx_is_page_fault_vmexit(struct vmcs12 *vmcs12,
 	return inequality ^ bit;
 }
-
+static bool nested_vmx_is_exception_vmexit(struct kvm_vcpu *vcpu, u8 vector,
-/*
+					   u32 error_code)
 * KVM wants to inject page-faults which it got to the guest. This function
 * checks whether in a nested guest, we need to inject them to L1 or L2.
 */
 static int nested_vmx_check_exception(struct kvm_vcpu *vcpu, unsigned long *exit_qual)
 {
 	struct vmcs12 *vmcs12 = get_vmcs12(vcpu);
 	unsigned int nr = vcpu->arch.exception.nr;
 	bool has_payload = vcpu->arch.exception.has_payload;
 	unsigned long payload = vcpu->arch.exception.payload;
 	if (nr == PF_VECTOR) {
 		if (vcpu->arch.exception.nested_apf) {
 			*exit_qual = vcpu->arch.apf.nested_apf_token;
 			return 1;
 		}
 		if (nested_vmx_is_page_fault_vmexit(vmcs12,
 						    vcpu->arch.exception.error_code)) {
 			*exit_qual = has_payload ? payload : vcpu->arch.cr2;
 			return 1;
 		}
 	} else if (vmcs12->exception_bitmap & (1u << nr)) {
 		if (nr == DB_VECTOR) {
 			if (!has_payload) {
 				payload = vcpu->arch.dr6;
 				payload &= ~DR6_BT;
 				payload ^= DR6_ACTIVE_LOW;
 			}
 			*exit_qual = payload;
 		} else
 			*exit_qual = 0;
 		return 1;
 	}
 	return 0;
 }
 static bool nested_vmx_handle_page_fault_workaround(struct kvm_vcpu *vcpu,
 						    struct x86_exception *fault)
 {
 	struct vmcs12 *vmcs12 = get_vmcs12(vcpu);
-	WARN_ON(!is_guest_mode(vcpu));
+	/*
 	 * Drop bits 31:16 of the error code when performing the #PF mask+match
 	 * check.  All VMCS fields involved are 32 bits, but Intel CPUs never
 	 * set bits 31:16 and VMX disallows setting bits 31:16 in the injected
 	 * error code.  Including the to-be-dropped bits in the check might
 	 * result in an "impossible" or missed exit from L1's perspective.
 	 */
 	if (vector == PF_VECTOR)
 		return nested_vmx_is_page_fault_vmexit(vmcs12, (u16)error_code);
-	if (nested_vmx_is_page_fault_vmexit(vmcs12, fault->error_code) &&
+	return (vmcs12->exception_bitmap & (1u << vector));
 	    !WARN_ON_ONCE(to_vmx(vcpu)->nested.nested_run_pending)) {
 		vmcs12->vm_exit_intr_error_code = fault->error_code;
 		nested_vmx_vmexit(vcpu, EXIT_REASON_EXCEPTION_NMI,
 				  PF_VECTOR | INTR_TYPE_HARD_EXCEPTION |
 				  INTR_INFO_DELIVER_CODE_MASK | INTR_INFO_VALID_MASK,
 				  fault->address);
 		return true;
 	}
 	return false;
 }
 static int nested_vmx_check_io_bitmap_controls(struct kvm_vcpu *vcpu,
@ -1607,6 +1568,10 @@ static void copy_enlightened_to_vmcs12(struct vcpu_vmx *vmx, u32 hv_clean_fields
 		vmcs12->guest_rflags = evmcs->guest_rflags;
 		vmcs12->guest_interruptibility_info =
 			evmcs->guest_interruptibility_info;
 		/*
 		 * Not present in struct vmcs12:
 		 * vmcs12->guest_ssp = evmcs->guest_ssp;
 		 */
 	}
 	if (unlikely(!(hv_clean_fields &
@ -1653,6 +1618,13 @@ static void copy_enlightened_to_vmcs12(struct vcpu_vmx *vmx, u32 hv_clean_fields
 		vmcs12->host_fs_selector = evmcs->host_fs_selector;
 		vmcs12->host_gs_selector = evmcs->host_gs_selector;
 		vmcs12->host_tr_selector = evmcs->host_tr_selector;
 		vmcs12->host_ia32_perf_global_ctrl = evmcs->host_ia32_perf_global_ctrl;
 		/*
 		 * Not present in struct vmcs12:
 		 * vmcs12->host_ia32_s_cet = evmcs->host_ia32_s_cet;
 		 * vmcs12->host_ssp = evmcs->host_ssp;
 		 * vmcs12->host_ia32_int_ssp_table_addr = evmcs->host_ia32_int_ssp_table_addr;
 		 */
 	}
 	if (unlikely(!(hv_clean_fields &
@ -1720,6 +1692,8 @@ static void copy_enlightened_to_vmcs12(struct vcpu_vmx *vmx, u32 hv_clean_fields
 		vmcs12->tsc_offset = evmcs->tsc_offset;
 		vmcs12->virtual_apic_page_addr = evmcs->virtual_apic_page_addr;
 		vmcs12->xss_exit_bitmap = evmcs->xss_exit_bitmap;
 		vmcs12->encls_exiting_bitmap = evmcs->encls_exiting_bitmap;
 		vmcs12->tsc_multiplier = evmcs->tsc_multiplier;
 	}
 	if (unlikely(!(hv_clean_fields &
@ -1767,6 +1741,13 @@ static void copy_enlightened_to_vmcs12(struct vcpu_vmx *vmx, u32 hv_clean_fields
 		vmcs12->guest_bndcfgs = evmcs->guest_bndcfgs;
 		vmcs12->guest_activity_state = evmcs->guest_activity_state;
 		vmcs12->guest_sysenter_cs = evmcs->guest_sysenter_cs;
 		vmcs12->guest_ia32_perf_global_ctrl = evmcs->guest_ia32_perf_global_ctrl;
 		/*
 		 * Not present in struct vmcs12:
 		 * vmcs12->guest_ia32_s_cet = evmcs->guest_ia32_s_cet;
 		 * vmcs12->guest_ia32_lbr_ctl = evmcs->guest_ia32_lbr_ctl;
 		 * vmcs12->guest_ia32_int_ssp_table_addr = evmcs->guest_ia32_int_ssp_table_addr;
 		 */
 	}
 	/*
@ -1869,12 +1850,23 @@ static void copy_vmcs12_to_enlightened(struct vcpu_vmx *vmx)
 	 * evmcs->vm_exit_msr_store_count = vmcs12->vm_exit_msr_store_count;
 	 * evmcs->vm_exit_msr_load_count = vmcs12->vm_exit_msr_load_count;
 	 * evmcs->vm_entry_msr_load_count = vmcs12->vm_entry_msr_load_count;
 	 * evmcs->guest_ia32_perf_global_ctrl = vmcs12->guest_ia32_perf_global_ctrl;
 	 * evmcs->host_ia32_perf_global_ctrl = vmcs12->host_ia32_perf_global_ctrl;
 	 * evmcs->encls_exiting_bitmap = vmcs12->encls_exiting_bitmap;
 	 * evmcs->tsc_multiplier = vmcs12->tsc_multiplier;
 	 *
 	 * Not present in struct vmcs12:
 	 * evmcs->exit_io_instruction_ecx = vmcs12->exit_io_instruction_ecx;
 	 * evmcs->exit_io_instruction_esi = vmcs12->exit_io_instruction_esi;
 	 * evmcs->exit_io_instruction_edi = vmcs12->exit_io_instruction_edi;
 	 * evmcs->exit_io_instruction_eip = vmcs12->exit_io_instruction_eip;
 	 * evmcs->host_ia32_s_cet = vmcs12->host_ia32_s_cet;
 	 * evmcs->host_ssp = vmcs12->host_ssp;
 	 * evmcs->host_ia32_int_ssp_table_addr = vmcs12->host_ia32_int_ssp_table_addr;
 	 * evmcs->guest_ia32_s_cet = vmcs12->guest_ia32_s_cet;
 	 * evmcs->guest_ia32_lbr_ctl = vmcs12->guest_ia32_lbr_ctl;
 	 * evmcs->guest_ia32_int_ssp_table_addr = vmcs12->guest_ia32_int_ssp_table_addr;
 	 * evmcs->guest_ssp = vmcs12->guest_ssp;
 	 */
 	evmcs->guest_es_selector = vmcs12->guest_es_selector;
@ -1982,7 +1974,7 @@ static enum nested_evmptrld_status nested_vmx_handle_enlightened_vmptrld(
 	bool evmcs_gpa_changed = false;
 	u64 evmcs_gpa;
-	if (likely(!vmx->nested.enlightened_vmcs_enabled))
+	if (likely(!guest_cpuid_has_evmcs(vcpu)))
 		return EVMPTRLD_DISABLED;
 	if (!nested_enlightened_vmentry(vcpu, &evmcs_gpa)) {
@ -2328,9 +2320,14 @@ static void prepare_vmcs02_early(struct vcpu_vmx *vmx, struct loaded_vmcs *vmcs0
 	 * are emulated by vmx_set_efer() in prepare_vmcs02(), but speculate
 	 * on the related bits (if supported by the CPU) in the hope that
 	 * we can avoid VMWrites during vmx_set_efer().
 	 *
 	 * Similarly, take vmcs01's PERF_GLOBAL_CTRL in the hope that if KVM is
 	 * loading PERF_GLOBAL_CTRL via the VMCS for L1, then KVM will want to
 	 * do the same for L2.
 	 */
 	exec_control = __vm_entry_controls_get(vmcs01);
-	exec_control |= vmcs12->vm_entry_controls;
+	exec_control |= (vmcs12->vm_entry_controls &
 			 ~VM_ENTRY_LOAD_IA32_PERF_GLOBAL_CTRL);
 	exec_control &= ~(VM_ENTRY_IA32E_MODE | VM_ENTRY_LOAD_IA32_EFER);
 	if (cpu_has_load_ia32_efer()) {
 		if (guest_efer & EFER_LMA)
@ -2863,7 +2860,7 @@ static int nested_vmx_check_controls(struct kvm_vcpu *vcpu,
 	    nested_check_vm_entry_controls(vcpu, vmcs12))
 		return -EINVAL;
-	if (to_vmx(vcpu)->nested.enlightened_vmcs_enabled)
+	if (guest_cpuid_has_evmcs(vcpu))
 		return nested_evmcs_check_controls(vmcs12);
 	return 0;
@ -3145,7 +3142,7 @@ static bool nested_get_evmcs_page(struct kvm_vcpu *vcpu)
 	 * L2 was running), map it here to make sure vmcs12 changes are
 	 * properly reflected.
 	 */
-	if (vmx->nested.enlightened_vmcs_enabled &&
+	if (guest_cpuid_has_evmcs(vcpu) &&
 	    vmx->nested.hv_evmcs_vmptr == EVMPTR_MAP_PENDING) {
 		enum nested_evmptrld_status evmptrld_status =
 			nested_vmx_handle_enlightened_vmptrld(vcpu, false);
@ -3364,12 +3361,24 @@ enum nvmx_vmentry_status nested_vmx_enter_non_root_mode(struct kvm_vcpu *vcpu,
 	};
 	u32 failed_index;
 	trace_kvm_nested_vmenter(kvm_rip_read(vcpu),
 				 vmx->nested.current_vmptr,
 				 vmcs12->guest_rip,
 				 vmcs12->guest_intr_status,
 				 vmcs12->vm_entry_intr_info_field,
 				 vmcs12->secondary_vm_exec_control & SECONDARY_EXEC_ENABLE_EPT,
 				 vmcs12->ept_pointer,
 				 vmcs12->guest_cr3,
 				 KVM_ISA_VMX);
 	kvm_service_local_tlb_flush_requests(vcpu);
 	evaluate_pending_interrupts = exec_controls_get(vmx) &
 		(CPU_BASED_INTR_WINDOW_EXITING | CPU_BASED_NMI_WINDOW_EXITING);
 	if (likely(!evaluate_pending_interrupts) && kvm_vcpu_apicv_active(vcpu))
 		evaluate_pending_interrupts |= vmx_has_apicv_interrupt(vcpu);
 	if (!evaluate_pending_interrupts)
 		evaluate_pending_interrupts |= kvm_apic_has_pending_init_or_sipi(vcpu);
 	if (!vmx->nested.nested_run_pending ||
 	    !(vmcs12->vm_entry_controls & VM_ENTRY_LOAD_DEBUG_CONTROLS))
@ -3450,18 +3459,10 @@ enum nvmx_vmentry_status nested_vmx_enter_non_root_mode(struct kvm_vcpu *vcpu,
 	}
 	/*
-	 * If L1 had a pending IRQ/NMI until it executed
+	 * Re-evaluate pending events if L1 had a pending IRQ/NMI/INIT/SIPI
-	 * VMLAUNCH/VMRESUME which wasn't delivered because it was
+	 * when it executed VMLAUNCH/VMRESUME, as entering non-root mode can
-	 * disallowed (e.g. interrupts disabled), L0 needs to
+	 * effectively unblock various events, e.g. INIT/SIPI cause VM-Exit
-	 * evaluate if this pending event should cause an exit from L2
+	 * unconditionally.
 	 * to L1 or delivered directly to L2 (e.g. In case L1 don't
 	 * intercept EXTERNAL_INTERRUPT).
 	 *
 	 * Usually this would be handled by the processor noticing an
 	 * IRQ/NMI window request, or checking RVI during evaluation of
 	 * pending virtual interrupts.  However, this setting was done
 	 * on VMCS01 and now VMCS02 is active instead. Thus, we force L0
 	 * to perform pending event evaluation by requesting a KVM_REQ_EVENT.
 	 */
 	if (unlikely(evaluate_pending_interrupts))
 		kvm_make_request(KVM_REQ_EVENT, vcpu);
@ -3718,7 +3719,7 @@ static void vmcs12_save_pending_event(struct kvm_vcpu *vcpu,
 	     is_double_fault(exit_intr_info))) {
 		vmcs12->idt_vectoring_info_field = 0;
 	} else if (vcpu->arch.exception.injected) {
-		nr = vcpu->arch.exception.nr;
+		nr = vcpu->arch.exception.vector;
 		idt_vectoring = nr | VECTORING_INFO_VALID_MASK;
 		if (kvm_exception_is_soft(nr)) {
@ -3819,19 +3820,40 @@ mmio_needed:
 	return -ENXIO;
 }
-static void nested_vmx_inject_exception_vmexit(struct kvm_vcpu *vcpu,
+static void nested_vmx_inject_exception_vmexit(struct kvm_vcpu *vcpu)
 					       unsigned long exit_qual)
 {
 	struct kvm_queued_exception *ex = &vcpu->arch.exception_vmexit;
 	u32 intr_info = ex->vector | INTR_INFO_VALID_MASK;
 	struct vmcs12 *vmcs12 = get_vmcs12(vcpu);
-	unsigned int nr = vcpu->arch.exception.nr;
+	unsigned long exit_qual;
 	u32 intr_info = nr | INTR_INFO_VALID_MASK;
-	if (vcpu->arch.exception.has_error_code) {
+	if (ex->has_payload) {
-		vmcs12->vm_exit_intr_error_code = vcpu->arch.exception.error_code;
+		exit_qual = ex->payload;
 	} else if (ex->vector == PF_VECTOR) {
 		exit_qual = vcpu->arch.cr2;
 	} else if (ex->vector == DB_VECTOR) {
 		exit_qual = vcpu->arch.dr6;
 		exit_qual &= ~DR6_BT;
 		exit_qual ^= DR6_ACTIVE_LOW;
 	} else {
 		exit_qual = 0;
 	}
 	if (ex->has_error_code) {
 		/*
 		 * Intel CPUs do not generate error codes with bits 31:16 set,
 		 * and more importantly VMX disallows setting bits 31:16 in the
 		 * injected error code for VM-Entry.  Drop the bits to mimic
 		 * hardware and avoid inducing failure on nested VM-Entry if L1
 		 * chooses to inject the exception back to L2.  AMD CPUs _do_
 		 * generate "full" 32-bit error codes, so KVM allows userspace
 		 * to inject exception error codes with bits 31:16 set.
 		 */
 		vmcs12->vm_exit_intr_error_code = (u16)ex->error_code;
 		intr_info |= INTR_INFO_DELIVER_CODE_MASK;
 	}
-	if (kvm_exception_is_soft(nr))
+	if (kvm_exception_is_soft(ex->vector))
 		intr_info |= INTR_TYPE_SOFT_EXCEPTION;
 	else
 		intr_info |= INTR_TYPE_HARD_EXCEPTION;
@ -3844,16 +3866,39 @@ static void nested_vmx_inject_exception_vmexit(struct kvm_vcpu *vcpu,
 }
 /*
- * Returns true if a debug trap is pending delivery.
+ * Returns true if a debug trap is (likely) pending delivery.  Infer the class
 * of a #DB (trap-like vs. fault-like) from the exception payload (to-be-DR6).
 * Using the payload is flawed because code breakpoints (fault-like) and data
 * breakpoints (trap-like) set the same bits in DR6 (breakpoint detected), i.e.
 * this will return false positives if a to-be-injected code breakpoint #DB is
 * pending (from KVM's perspective, but not "pending" across an instruction
 * boundary).  ICEBP, a.k.a. INT1, is also not reflected here even though it
 * too is trap-like.
 *
- * In KVM, debug traps bear an exception payload. As such, the class of a #DB
+ * KVM "works" despite these flaws as ICEBP isn't currently supported by the
- * exception may be inferred from the presence of an exception payload.
+ * emulator, Monitor Trap Flag is not marked pending on intercepted #DBs (the
 * #DB has already happened), and MTF isn't marked pending on code breakpoints
 * from the emulator (because such #DBs are fault-like and thus don't trigger
 * actions that fire on instruction retire).
 */
-static inline bool vmx_pending_dbg_trap(struct kvm_vcpu *vcpu)
+static unsigned long vmx_get_pending_dbg_trap(struct kvm_queued_exception *ex)
 {
-	return vcpu->arch.exception.pending &&
+	if (!ex->pending || ex->vector != DB_VECTOR)
-			vcpu->arch.exception.nr == DB_VECTOR &&
+		return 0;
-			vcpu->arch.exception.payload;
+
 	/* General Detect #DBs are always fault-like. */
 	return ex->payload & ~DR6_BD;
 }
 /*
 * Returns true if there's a pending #DB exception that is lower priority than
 * a pending Monitor Trap Flag VM-Exit.  TSS T-flag #DBs are not emulated by
 * KVM, but could theoretically be injected by userspace.  Note, this code is
 * imperfect, see above.
 */
 static bool vmx_is_low_priority_db_trap(struct kvm_queued_exception *ex)
 {
 	return vmx_get_pending_dbg_trap(ex) & ~DR6_BT;
 }
 /*
@ -3865,9 +3910,11 @@ static inline bool vmx_pending_dbg_trap(struct kvm_vcpu *vcpu)
 */
 static void nested_vmx_update_pending_dbg(struct kvm_vcpu *vcpu)
 {
-	if (vmx_pending_dbg_trap(vcpu))
+	unsigned long pending_dbg;
-		vmcs_writel(GUEST_PENDING_DBG_EXCEPTIONS,
+
-			    vcpu->arch.exception.payload);
+	pending_dbg = vmx_get_pending_dbg_trap(&vcpu->arch.exception);
 	if (pending_dbg)
 		vmcs_writel(GUEST_PENDING_DBG_EXCEPTIONS, pending_dbg);
 }
 static bool nested_vmx_preemption_timer_pending(struct kvm_vcpu *vcpu)
@ -3876,21 +3923,113 @@ static bool nested_vmx_preemption_timer_pending(struct kvm_vcpu *vcpu)
 	       to_vmx(vcpu)->nested.preemption_timer_expired;
 }
 static bool vmx_has_nested_events(struct kvm_vcpu *vcpu)
 {
 	return nested_vmx_preemption_timer_pending(vcpu) ||
 	       to_vmx(vcpu)->nested.mtf_pending;
 }
 /*
 * Per the Intel SDM's table "Priority Among Concurrent Events", with minor
 * edits to fill in missing examples, e.g. #DB due to split-lock accesses,
 * and less minor edits to splice in the priority of VMX Non-Root specific
 * events, e.g. MTF and NMI/INTR-window exiting.
 *
 * 1 Hardware Reset and Machine Checks
 *	- RESET
 *	- Machine Check
 *
 * 2 Trap on Task Switch
 *	- T flag in TSS is set (on task switch)
 *
 * 3 External Hardware Interventions
 *	- FLUSH
 *	- STOPCLK
 *	- SMI
 *	- INIT
 *
 * 3.5 Monitor Trap Flag (MTF) VM-exit[1]
 *
 * 4 Traps on Previous Instruction
 *	- Breakpoints
 *	- Trap-class Debug Exceptions (#DB due to TF flag set, data/I-O
 *	  breakpoint, or #DB due to a split-lock access)
 *
 * 4.3	VMX-preemption timer expired VM-exit
 *
 * 4.6	NMI-window exiting VM-exit[2]
 *
 * 5 Nonmaskable Interrupts (NMI)
 *
 * 5.5 Interrupt-window exiting VM-exit and Virtual-interrupt delivery
 *
 * 6 Maskable Hardware Interrupts
 *
 * 7 Code Breakpoint Fault
 *
 * 8 Faults from Fetching Next Instruction
 *	- Code-Segment Limit Violation
 *	- Code Page Fault
 *	- Control protection exception (missing ENDBRANCH at target of indirect
 *					call or jump)
 *
 * 9 Faults from Decoding Next Instruction
 *	- Instruction length > 15 bytes
 *	- Invalid Opcode
 *	- Coprocessor Not Available
 *
 *10 Faults on Executing Instruction
 *	- Overflow
 *	- Bound error
 *	- Invalid TSS
 *	- Segment Not Present
 *	- Stack fault
 *	- General Protection
 *	- Data Page Fault
 *	- Alignment Check
 *	- x86 FPU Floating-point exception
 *	- SIMD floating-point exception
 *	- Virtualization exception
 *	- Control protection exception
 *
 * [1] Per the "Monitor Trap Flag" section: System-management interrupts (SMIs),
 *     INIT signals, and higher priority events take priority over MTF VM exits.
 *     MTF VM exits take priority over debug-trap exceptions and lower priority
 *     events.
 *
 * [2] Debug-trap exceptions and higher priority events take priority over VM exits
 *     caused by the VMX-preemption timer.  VM exits caused by the VMX-preemption
 *     timer take priority over VM exits caused by the "NMI-window exiting"
 *     VM-execution control and lower priority events.
 *
 * [3] Debug-trap exceptions and higher priority events take priority over VM exits
 *     caused by "NMI-window exiting".  VM exits caused by this control take
 *     priority over non-maskable interrupts (NMIs) and lower priority events.
 *
 * [4] Virtual-interrupt delivery has the same priority as that of VM exits due to
 *     the 1-setting of the "interrupt-window exiting" VM-execution control.  Thus,
 *     non-maskable interrupts (NMIs) and higher priority events take priority over
 *     delivery of a virtual interrupt; delivery of a virtual interrupt takes
 *     priority over external interrupts and lower priority events.
 */
 static int vmx_check_nested_events(struct kvm_vcpu *vcpu)
 {
 	struct vcpu_vmx *vmx = to_vmx(vcpu);
 	unsigned long exit_qual;
 	bool block_nested_events =
 	    vmx->nested.nested_run_pending || kvm_event_needs_reinjection(vcpu);
 	bool mtf_pending = vmx->nested.mtf_pending;
 	struct kvm_lapic *apic = vcpu->arch.apic;
-
+	struct vcpu_vmx *vmx = to_vmx(vcpu);
 	/*
-	 * Clear the MTF state. If a higher priority VM-exit is delivered first,
+	 * Only a pending nested run blocks a pending exception.  If there is a
-	 * this state is discarded.
+	 * previously injected event, the pending exception occurred while said
 	 * event was being delivered and thus needs to be handled.
 	 */
-	if (!block_nested_events)
+	bool block_nested_exceptions = vmx->nested.nested_run_pending;
-		vmx->nested.mtf_pending = false;
+	/*
 	 * New events (not exceptions) are only recognized at instruction
 	 * boundaries.  If an event needs reinjection, then KVM is handling a
 	 * VM-Exit that occurred _during_ instruction execution; new events are
 	 * blocked until the instruction completes.
 	 */
 	bool block_nested_events = block_nested_exceptions ||
 				   kvm_event_needs_reinjection(vcpu);
 	if (lapic_in_kernel(vcpu) &&
 		test_bit(KVM_APIC_INIT, &apic->pending_events)) {
@ -3900,6 +4039,9 @@ static int vmx_check_nested_events(struct kvm_vcpu *vcpu)
 		clear_bit(KVM_APIC_INIT, &apic->pending_events);
 		if (vcpu->arch.mp_state != KVM_MP_STATE_INIT_RECEIVED)
 			nested_vmx_vmexit(vcpu, EXIT_REASON_INIT_SIGNAL, 0, 0);
 		/* MTF is discarded if the vCPU is in WFS. */
 		vmx->nested.mtf_pending = false;
 		return 0;
 	}
@ -3909,31 +4051,41 @@ static int vmx_check_nested_events(struct kvm_vcpu *vcpu)
 			return -EBUSY;
 		clear_bit(KVM_APIC_SIPI, &apic->pending_events);
-		if (vcpu->arch.mp_state == KVM_MP_STATE_INIT_RECEIVED)
+		if (vcpu->arch.mp_state == KVM_MP_STATE_INIT_RECEIVED) {
 			nested_vmx_vmexit(vcpu, EXIT_REASON_SIPI_SIGNAL, 0,
 						apic->sipi_vector & 0xFFUL);
-		return 0;
+			return 0;
 		}
 		/* Fallthrough, the SIPI is completely ignored. */
 	}
 	/*
-	 * Process any exceptions that are not debug traps before MTF.
+	 * Process exceptions that are higher priority than Monitor Trap Flag:
 	 * fault-like exceptions, TSS T flag #DB (not emulated by KVM, but
 	 * could theoretically come in from userspace), and ICEBP (INT1).
 	 *
-	 * Note that only a pending nested run can block a pending exception.
+	 * TODO: SMIs have higher priority than MTF and trap-like #DBs (except
-	 * Otherwise an injected NMI/interrupt should either be
+	 * for TSS T flag #DBs).  KVM also doesn't save/restore pending MTF
-	 * lost or delivered to the nested hypervisor in the IDT_VECTORING_INFO,
+	 * across SMI/RSM as it should; that needs to be addressed in order to
-	 * while delivering the pending exception.
+	 * prioritize SMI over MTF and trap-like #DBs.
 	 */
-
+	if (vcpu->arch.exception_vmexit.pending &&
-	if (vcpu->arch.exception.pending && !vmx_pending_dbg_trap(vcpu)) {
+	    !vmx_is_low_priority_db_trap(&vcpu->arch.exception_vmexit)) {
-		if (vmx->nested.nested_run_pending)
+		if (block_nested_exceptions)
 			return -EBUSY;
-		if (!nested_vmx_check_exception(vcpu, &exit_qual))
+
-			goto no_vmexit;
+		nested_vmx_inject_exception_vmexit(vcpu);
 		nested_vmx_inject_exception_vmexit(vcpu, exit_qual);
 		return 0;
 	}
-	if (mtf_pending) {
+	if (vcpu->arch.exception.pending &&
 	    !vmx_is_low_priority_db_trap(&vcpu->arch.exception)) {
 		if (block_nested_exceptions)
 			return -EBUSY;
 		goto no_vmexit;
 	}
 	if (vmx->nested.mtf_pending) {
 		if (block_nested_events)
 			return -EBUSY;
 		nested_vmx_update_pending_dbg(vcpu);
@ -3941,15 +4093,20 @@ static int vmx_check_nested_events(struct kvm_vcpu *vcpu)
 		return 0;
 	}
-	if (vcpu->arch.exception.pending) {
+	if (vcpu->arch.exception_vmexit.pending) {
-		if (vmx->nested.nested_run_pending)
+		if (block_nested_exceptions)
 			return -EBUSY;
-		if (!nested_vmx_check_exception(vcpu, &exit_qual))
+
-			goto no_vmexit;
+		nested_vmx_inject_exception_vmexit(vcpu);
 		nested_vmx_inject_exception_vmexit(vcpu, exit_qual);
 		return 0;
 	}
 	if (vcpu->arch.exception.pending) {
 		if (block_nested_exceptions)
 			return -EBUSY;
 		goto no_vmexit;
 	}
 	if (nested_vmx_preemption_timer_pending(vcpu)) {
 		if (block_nested_events)
 			return -EBUSY;
@ -4255,14 +4412,6 @@ static void prepare_vmcs12(struct kvm_vcpu *vcpu, struct vmcs12 *vmcs12,
 			nested_vmx_abort(vcpu,
 					 VMX_ABORT_SAVE_GUEST_MSR_FAIL);
 	}
 	/*
 	 * Drop what we picked up for L2 via vmx_complete_interrupts. It is
 	 * preserved above and would only end up incorrectly in L1.
 	 */
 	vcpu->arch.nmi_injected = false;
 	kvm_clear_exception_queue(vcpu);
 	kvm_clear_interrupt_queue(vcpu);
 }
 /*
@ -4538,6 +4687,9 @@ void nested_vmx_vmexit(struct kvm_vcpu *vcpu, u32 vm_exit_reason,
 	struct vcpu_vmx *vmx = to_vmx(vcpu);
 	struct vmcs12 *vmcs12 = get_vmcs12(vcpu);
 	/* Pending MTF traps are discarded on VM-Exit. */
 	vmx->nested.mtf_pending = false;
 	/* trying to cancel vmlaunch/vmresume is a bug */
 	WARN_ON_ONCE(vmx->nested.nested_run_pending);
@ -4602,6 +4754,17 @@ void nested_vmx_vmexit(struct kvm_vcpu *vcpu, u32 vm_exit_reason,
 		WARN_ON_ONCE(nested_early_check);
 	}
 	/*
 	 * Drop events/exceptions that were queued for re-injection to L2
 	 * (picked up via vmx_complete_interrupts()), as well as exceptions
 	 * that were pending for L2.  Note, this must NOT be hoisted above
 	 * prepare_vmcs12(), events/exceptions queued for re-injection need to
 	 * be captured in vmcs12 (see vmcs12_save_pending_event()).
 	 */
 	vcpu->arch.nmi_injected = false;
 	kvm_clear_exception_queue(vcpu);
 	kvm_clear_interrupt_queue(vcpu);
 	vmx_switch_vmcs(vcpu, &vmx->vmcs01);
 	/* Update any VMCS fields that might have changed while L2 ran */
@ -5030,8 +5193,8 @@ static int handle_vmxoff(struct kvm_vcpu *vcpu)
 	free_nested(vcpu);
-	/* Process a latched INIT during time CPU was in VMX operation */
+	if (kvm_apic_has_pending_init_or_sipi(vcpu))
-	kvm_make_request(KVM_REQ_EVENT, vcpu);
+		kvm_make_request(KVM_REQ_EVENT, vcpu);
 	return nested_vmx_succeed(vcpu);
 }
@ -5067,7 +5230,7 @@ static int handle_vmclear(struct kvm_vcpu *vcpu)
 	 * state. It is possible that the area will stay mapped as
 	 * vmx->nested.hv_evmcs but this shouldn't be a problem.
 	 */
-	if (likely(!vmx->nested.enlightened_vmcs_enabled ||
+	if (likely(!guest_cpuid_has_evmcs(vcpu) ||
 		   !nested_enlightened_vmentry(vcpu, &evmcs_gpa))) {
 		if (vmptr == vmx->nested.current_vmptr)
 			nested_release_vmcs12(vcpu);
@ -6463,6 +6626,9 @@ static int vmx_set_nested_state(struct kvm_vcpu *vcpu,
 	if (ret)
 		goto error_guest_mode;
 	if (vmx->nested.mtf_pending)
 		kvm_make_request(KVM_REQ_EVENT, vcpu);
 	return 0;
 error_guest_mode:
@ -6522,8 +6688,10 @@ static u64 nested_vmx_calc_vmcs_enum_msr(void)
 * bit in the high half is on if the corresponding bit in the control field
 * may be on. See also vmx_control_verify().
 */
-void nested_vmx_setup_ctls_msrs(struct nested_vmx_msrs *msrs, u32 ept_caps)
+void nested_vmx_setup_ctls_msrs(struct vmcs_config *vmcs_conf, u32 ept_caps)
 {
 	struct nested_vmx_msrs *msrs = &vmcs_conf->nested;
 	/*
 	 * Note that as a general rule, the high half of the MSRs (bits in
 	 * the control fields which may be 1) should be initialized by the
@ -6540,11 +6708,10 @@ void nested_vmx_setup_ctls_msrs(struct nested_vmx_msrs *msrs, u32 ept_caps)
 	 */
 	/* pin-based controls */
-	rdmsr(MSR_IA32_VMX_PINBASED_CTLS,
+	msrs->pinbased_ctls_low =
 		msrs->pinbased_ctls_low,
 		msrs->pinbased_ctls_high);
 	msrs->pinbased_ctls_low |=
 		PIN_BASED_ALWAYSON_WITHOUT_TRUE_MSR;
 	msrs->pinbased_ctls_high = vmcs_conf->pin_based_exec_ctrl;
 	msrs->pinbased_ctls_high &=
 		PIN_BASED_EXT_INTR_MASK |
 		PIN_BASED_NMI_EXITING |
@ -6555,50 +6722,47 @@ void nested_vmx_setup_ctls_msrs(struct nested_vmx_msrs *msrs, u32 ept_caps)
 		PIN_BASED_VMX_PREEMPTION_TIMER;
 	/* exit controls */
 	rdmsr(MSR_IA32_VMX_EXIT_CTLS,
 		msrs->exit_ctls_low,
 		msrs->exit_ctls_high);
 	msrs->exit_ctls_low =
 		VM_EXIT_ALWAYSON_WITHOUT_TRUE_MSR;
 	msrs->exit_ctls_high = vmcs_conf->vmexit_ctrl;
 	msrs->exit_ctls_high &=
 #ifdef CONFIG_X86_64
 		VM_EXIT_HOST_ADDR_SPACE_SIZE |
 #endif
 		VM_EXIT_LOAD_IA32_PAT | VM_EXIT_SAVE_IA32_PAT |
-		VM_EXIT_CLEAR_BNDCFGS | VM_EXIT_LOAD_IA32_PERF_GLOBAL_CTRL;
+		VM_EXIT_CLEAR_BNDCFGS;
 	msrs->exit_ctls_high |=
 		VM_EXIT_ALWAYSON_WITHOUT_TRUE_MSR |
 		VM_EXIT_LOAD_IA32_EFER | VM_EXIT_SAVE_IA32_EFER |
-		VM_EXIT_SAVE_VMX_PREEMPTION_TIMER | VM_EXIT_ACK_INTR_ON_EXIT;
+		VM_EXIT_SAVE_VMX_PREEMPTION_TIMER | VM_EXIT_ACK_INTR_ON_EXIT |
 		VM_EXIT_LOAD_IA32_PERF_GLOBAL_CTRL;
 	/* We support free control of debug control saving. */
 	msrs->exit_ctls_low &= ~VM_EXIT_SAVE_DEBUG_CONTROLS;
 	/* entry controls */
 	rdmsr(MSR_IA32_VMX_ENTRY_CTLS,
 		msrs->entry_ctls_low,
 		msrs->entry_ctls_high);
 	msrs->entry_ctls_low =
 		VM_ENTRY_ALWAYSON_WITHOUT_TRUE_MSR;
 	msrs->entry_ctls_high = vmcs_conf->vmentry_ctrl;
 	msrs->entry_ctls_high &=
 #ifdef CONFIG_X86_64
 		VM_ENTRY_IA32E_MODE |
 #endif
-		VM_ENTRY_LOAD_IA32_PAT | VM_ENTRY_LOAD_BNDCFGS |
+		VM_ENTRY_LOAD_IA32_PAT | VM_ENTRY_LOAD_BNDCFGS;
 		VM_ENTRY_LOAD_IA32_PERF_GLOBAL_CTRL;
 	msrs->entry_ctls_high |=
-		(VM_ENTRY_ALWAYSON_WITHOUT_TRUE_MSR | VM_ENTRY_LOAD_IA32_EFER);
+		(VM_ENTRY_ALWAYSON_WITHOUT_TRUE_MSR | VM_ENTRY_LOAD_IA32_EFER |
 		 VM_ENTRY_LOAD_IA32_PERF_GLOBAL_CTRL);
 	/* We support free control of debug control loading. */
 	msrs->entry_ctls_low &= ~VM_ENTRY_LOAD_DEBUG_CONTROLS;
 	/* cpu-based controls */
 	rdmsr(MSR_IA32_VMX_PROCBASED_CTLS,
 		msrs->procbased_ctls_low,
 		msrs->procbased_ctls_high);
 	msrs->procbased_ctls_low =
 		CPU_BASED_ALWAYSON_WITHOUT_TRUE_MSR;
 	msrs->procbased_ctls_high = vmcs_conf->cpu_based_exec_ctrl;
 	msrs->procbased_ctls_high &=
 		CPU_BASED_INTR_WINDOW_EXITING |
 		CPU_BASED_NMI_WINDOW_EXITING | CPU_BASED_USE_TSC_OFFSETTING |
@ -6632,12 +6796,9 @@ void nested_vmx_setup_ctls_msrs(struct nested_vmx_msrs *msrs, u32 ept_caps)
 	 * depend on CPUID bits, they are added later by
 	 * vmx_vcpu_after_set_cpuid.
 	 */
 	if (msrs->procbased_ctls_high & CPU_BASED_ACTIVATE_SECONDARY_CONTROLS)
 		rdmsr(MSR_IA32_VMX_PROCBASED_CTLS2,
 		      msrs->secondary_ctls_low,
 		      msrs->secondary_ctls_high);
 	msrs->secondary_ctls_low = 0;
 	msrs->secondary_ctls_high = vmcs_conf->cpu_based_2nd_exec_ctrl;
 	msrs->secondary_ctls_high &=
 		SECONDARY_EXEC_DESC |
 		SECONDARY_EXEC_ENABLE_RDTSCP |
@ -6717,10 +6878,7 @@ void nested_vmx_setup_ctls_msrs(struct nested_vmx_msrs *msrs, u32 ept_caps)
 		msrs->secondary_ctls_high |= SECONDARY_EXEC_ENCLS_EXITING;
 	/* miscellaneous data */
-	rdmsr(MSR_IA32_VMX_MISC,
+	msrs->misc_low = (u32)vmcs_conf->misc & VMX_MISC_SAVE_EFER_LMA;
 		msrs->misc_low,
 		msrs->misc_high);
 	msrs->misc_low &= VMX_MISC_SAVE_EFER_LMA;
 	msrs->misc_low |=
 		MSR_IA32_VMX_MISC_VMWRITE_SHADOW_RO_FIELDS |
 		VMX_MISC_EMULATED_PREEMPTION_TIMER_RATE |
@ -6814,9 +6972,9 @@ __init int nested_vmx_hardware_setup(int (*exit_handlers[])(struct kvm_vcpu *))
 struct kvm_x86_nested_ops vmx_nested_ops = {
 	.leave_nested = vmx_leave_nested,
 	.is_exception_vmexit = nested_vmx_is_exception_vmexit,
 	.check_events = vmx_check_nested_events,
-	.handle_page_fault_workaround = nested_vmx_handle_page_fault_workaround,
+	.has_events = vmx_has_nested_events,
 	.hv_timer_pending = nested_vmx_preemption_timer_pending,
 	.triple_fault = nested_vmx_triple_fault,
 	.get_state = vmx_get_nested_state,
 	.set_state = vmx_set_nested_state,
--- a/arch/x86/kvm/vmx/nested.h
+++ b/arch/x86/kvm/vmx/nested.h
@ -17,7 +17,7 @@ enum nvmx_vmentry_status {
 };
 void vmx_leave_nested(struct kvm_vcpu *vcpu);
-void nested_vmx_setup_ctls_msrs(struct nested_vmx_msrs *msrs, u32 ept_caps);
+void nested_vmx_setup_ctls_msrs(struct vmcs_config *vmcs_conf, u32 ept_caps);
 void nested_vmx_hardware_unsetup(void);
 __init int nested_vmx_hardware_setup(int (*exit_handlers[])(struct kvm_vcpu *));
 void nested_vmx_set_vmcs_shadowing_bitmap(void);
--- a/arch/x86/kvm/vmx/sgx.c
+++ b/arch/x86/kvm/vmx/sgx.c
@ -129,7 +129,7 @@ static int sgx_inject_fault(struct kvm_vcpu *vcpu, gva_t gva, int trapnr)
 		ex.address = gva;
 		ex.error_code_valid = true;
 		ex.nested_page_fault = false;
-		kvm_inject_page_fault(vcpu, &ex);
+		kvm_inject_emulated_page_fault(vcpu, &ex);
 	} else {
 		kvm_inject_gp(vcpu, 0);
 	}
--- a/arch/x86/kvm/vmx/vmenter.S
+++ b/arch/x86/kvm/vmx/vmenter.S
@ -189,13 +189,16 @@ SYM_INNER_LABEL(vmx_vmexit, SYM_L_GLOBAL)
 	xor %ebx, %ebx
 .Lclear_regs:
 	/* Discard @regs.  The register is irrelevant, it just can't be RBX. */
 	pop %_ASM_AX
 	/*
 	 * Clear all general purpose registers except RSP and RBX to prevent
 	 * speculative use of the guest's values, even those that are reloaded
 	 * via the stack.  In theory, an L1 cache miss when restoring registers
 	 * could lead to speculative execution with the guest's values.
 	 * Zeroing XORs are dirt cheap, i.e. the extra paranoia is essentially
-	 * free.  RSP and RAX are exempt as RSP is restored by hardware during
+	 * free.  RSP and RBX are exempt as RSP is restored by hardware during
 	 * VM-Exit and RBX is explicitly loaded with 0 or 1 to hold the return
 	 * value.
 	 */
@ -216,9 +219,6 @@ SYM_INNER_LABEL(vmx_vmexit, SYM_L_GLOBAL)
 	xor %r15d, %r15d
 #endif
 	/* "POP" @regs. */
 	add $WORD_SIZE, %_ASM_SP
 	/*
 	 * IMPORTANT: RSB filling and SPEC_CTRL handling must be done before
 	 * the first unbalanced RET after vmexit!
@ -234,7 +234,6 @@ SYM_INNER_LABEL(vmx_vmexit, SYM_L_GLOBAL)
 	FILL_RETURN_BUFFER %_ASM_CX, RSB_CLEAR_LOOPS, X86_FEATURE_RSB_VMEXIT,\
 			   X86_FEATURE_RSB_VMEXIT_LITE
 	pop %_ASM_ARG2	/* @flags */
 	pop %_ASM_ARG1	/* @vmx */
@ -293,22 +292,13 @@ SYM_FUNC_START(vmread_error_trampoline)
 	push %r10
 	push %r11
 #endif
-#ifdef CONFIG_X86_64
+
 	/* Load @field and @fault to arg1 and arg2 respectively. */
-	mov 3*WORD_SIZE(%rbp), %_ASM_ARG2
+	mov 3*WORD_SIZE(%_ASM_BP), %_ASM_ARG2
-	mov 2*WORD_SIZE(%rbp), %_ASM_ARG1
+	mov 2*WORD_SIZE(%_ASM_BP), %_ASM_ARG1
 #else
 	/* Parameters are passed on the stack for 32-bit (see asmlinkage). */
 	push 3*WORD_SIZE(%ebp)
 	push 2*WORD_SIZE(%ebp)
 #endif
 	call vmread_error
 #ifndef CONFIG_X86_64
 	add $8, %esp
 #endif
 	/* Zero out @fault, which will be popped into the result register. */
 	_ASM_MOV $0, 3*WORD_SIZE(%_ASM_BP)
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@ -439,7 +439,7 @@ do {					\
 	pr_warn_ratelimited(fmt);	\
 } while (0)
-asmlinkage void vmread_error(unsigned long field, bool fault)
+void vmread_error(unsigned long field, bool fault)
 {
 	if (fault)
 		kvm_spurious_fault();
@ -864,7 +864,7 @@ unsigned int __vmx_vcpu_run_flags(struct vcpu_vmx *vmx)
 	return flags;
 }
-static void clear_atomic_switch_msr_special(struct vcpu_vmx *vmx,
+static __always_inline void clear_atomic_switch_msr_special(struct vcpu_vmx *vmx,
 		unsigned long entry, unsigned long exit)
 {
 	vm_entry_controls_clearbit(vmx, entry);
@ -922,7 +922,7 @@ skip_guest:
 	vmcs_write32(VM_EXIT_MSR_LOAD_COUNT, m->host.nr);
 }
-static void add_atomic_switch_msr_special(struct vcpu_vmx *vmx,
+static __always_inline void add_atomic_switch_msr_special(struct vcpu_vmx *vmx,
 		unsigned long entry, unsigned long exit,
 		unsigned long guest_val_vmcs, unsigned long host_val_vmcs,
 		u64 guest_val, u64 host_val)
@ -1652,17 +1652,25 @@ static void vmx_update_emulated_instruction(struct kvm_vcpu *vcpu)
 	/*
 	 * Per the SDM, MTF takes priority over debug-trap exceptions besides
-	 * T-bit traps. As instruction emulation is completed (i.e. at the
+	 * TSS T-bit traps and ICEBP (INT1).  KVM doesn't emulate T-bit traps
-	 * instruction boundary), any #DB exception pending delivery must be a
+	 * or ICEBP (in the emulator proper), and skipping of ICEBP after an
-	 * debug-trap. Record the pending MTF state to be delivered in
+	 * intercepted #DB deliberately avoids single-step #DB and MTF updates
 	 * as ICEBP is higher priority than both.  As instruction emulation is
 	 * completed at this point (i.e. KVM is at the instruction boundary),
 	 * any #DB exception pending delivery must be a debug-trap of lower
 	 * priority than MTF.  Record the pending MTF state to be delivered in
 	 * vmx_check_nested_events().
 	 */
 	if (nested_cpu_has_mtf(vmcs12) &&
 	    (!vcpu->arch.exception.pending ||
-	     vcpu->arch.exception.nr == DB_VECTOR))
+	     vcpu->arch.exception.vector == DB_VECTOR) &&
 	    (!vcpu->arch.exception_vmexit.pending ||
 	     vcpu->arch.exception_vmexit.vector == DB_VECTOR)) {
 		vmx->nested.mtf_pending = true;
-	else
+		kvm_make_request(KVM_REQ_EVENT, vcpu);
 	} else {
 		vmx->nested.mtf_pending = false;
 	}
 }
 static int vmx_skip_emulated_instruction(struct kvm_vcpu *vcpu)
@ -1684,32 +1692,40 @@ static void vmx_clear_hlt(struct kvm_vcpu *vcpu)
 		vmcs_write32(GUEST_ACTIVITY_STATE, GUEST_ACTIVITY_ACTIVE);
 }
-static void vmx_queue_exception(struct kvm_vcpu *vcpu)
+static void vmx_inject_exception(struct kvm_vcpu *vcpu)
 {
 	struct kvm_queued_exception *ex = &vcpu->arch.exception;
 	u32 intr_info = ex->vector | INTR_INFO_VALID_MASK;
 	struct vcpu_vmx *vmx = to_vmx(vcpu);
 	unsigned nr = vcpu->arch.exception.nr;
 	bool has_error_code = vcpu->arch.exception.has_error_code;
 	u32 error_code = vcpu->arch.exception.error_code;
 	u32 intr_info = nr | INTR_INFO_VALID_MASK;
-	kvm_deliver_exception_payload(vcpu);
+	kvm_deliver_exception_payload(vcpu, ex);
-	if (has_error_code) {
+	if (ex->has_error_code) {
-		vmcs_write32(VM_ENTRY_EXCEPTION_ERROR_CODE, error_code);
+		/*
 		 * Despite the error code being architecturally defined as 32
 		 * bits, and the VMCS field being 32 bits, Intel CPUs and thus
 		 * VMX don't actually supporting setting bits 31:16.  Hardware
 		 * will (should) never provide a bogus error code, but AMD CPUs
 		 * do generate error codes with bits 31:16 set, and so KVM's
 		 * ABI lets userspace shove in arbitrary 32-bit values.  Drop
 		 * the upper bits to avoid VM-Fail, losing information that
 		 * does't really exist is preferable to killing the VM.
 		 */
 		vmcs_write32(VM_ENTRY_EXCEPTION_ERROR_CODE, (u16)ex->error_code);
 		intr_info |= INTR_INFO_DELIVER_CODE_MASK;
 	}
 	if (vmx->rmode.vm86_active) {
 		int inc_eip = 0;
-		if (kvm_exception_is_soft(nr))
+		if (kvm_exception_is_soft(ex->vector))
 			inc_eip = vcpu->arch.event_exit_inst_len;
-		kvm_inject_realmode_interrupt(vcpu, nr, inc_eip);
+		kvm_inject_realmode_interrupt(vcpu, ex->vector, inc_eip);
 		return;
 	}
 	WARN_ON_ONCE(vmx->emulation_required);
-	if (kvm_exception_is_soft(nr)) {
+	if (kvm_exception_is_soft(ex->vector)) {
 		vmcs_write32(VM_ENTRY_INSTRUCTION_LEN,
 			     vmx->vcpu.arch.event_exit_inst_len);
 		intr_info |= INTR_TYPE_SOFT_EXCEPTION;
@ -1930,9 +1946,8 @@ static int vmx_get_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
 		 * sanity checking and refuse to boot. Filter all unsupported
 		 * features out.
 		 */
-		if (!msr_info->host_initiated &&
+		if (!msr_info->host_initiated && guest_cpuid_has_evmcs(vcpu))
-		    vmx->nested.enlightened_vmcs_enabled)
+			nested_evmcs_filter_control_msr(vcpu, msr_info->index,
 			nested_evmcs_filter_control_msr(msr_info->index,
 							&msr_info->data);
 		break;
 	case MSR_IA32_RTIT_CTL:
@ -2494,6 +2509,30 @@ static bool cpu_has_sgx(void)
 	return cpuid_eax(0) >= 0x12 && (cpuid_eax(0x12) & BIT(0));
 }
 /*
 * Some cpus support VM_{ENTRY,EXIT}_IA32_PERF_GLOBAL_CTRL but they
 * can't be used due to errata where VM Exit may incorrectly clear
 * IA32_PERF_GLOBAL_CTRL[34:32]. Work around the errata by using the
 * MSR load mechanism to switch IA32_PERF_GLOBAL_CTRL.
 */
 static bool cpu_has_perf_global_ctrl_bug(void)
 {
 	if (boot_cpu_data.x86 == 0x6) {
 		switch (boot_cpu_data.x86_model) {
 		case INTEL_FAM6_NEHALEM_EP:	/* AAK155 */
 		case INTEL_FAM6_NEHALEM:	/* AAP115 */
 		case INTEL_FAM6_WESTMERE:	/* AAT100 */
 		case INTEL_FAM6_WESTMERE_EP:	/* BC86,AAY89,BD102 */
 		case INTEL_FAM6_NEHALEM_EX:	/* BA97 */
 			return true;
 		default:
 			break;
 		}
 	}
 	return false;
 }
 static __init int adjust_vmx_controls(u32 ctl_min, u32 ctl_opt,
 				      u32 msr, u32 *result)
 {
@ -2526,13 +2565,13 @@ static __init int setup_vmcs_config(struct vmcs_config *vmcs_conf,
 				    struct vmx_capability *vmx_cap)
 {
 	u32 vmx_msr_low, vmx_msr_high;
 	u32 min, opt, min2, opt2;
 	u32 _pin_based_exec_control = 0;
 	u32 _cpu_based_exec_control = 0;
 	u32 _cpu_based_2nd_exec_control = 0;
 	u64 _cpu_based_3rd_exec_control = 0;
 	u32 _vmexit_control = 0;
 	u32 _vmentry_control = 0;
 	u64 misc_msr;
 	int i;
 	/*
@ -2552,64 +2591,17 @@ static __init int setup_vmcs_config(struct vmcs_config *vmcs_conf,
 	};
 	memset(vmcs_conf, 0, sizeof(*vmcs_conf));
 	min = CPU_BASED_HLT_EXITING |
 #ifdef CONFIG_X86_64
 	      CPU_BASED_CR8_LOAD_EXITING |
 	      CPU_BASED_CR8_STORE_EXITING |
 #endif
 	      CPU_BASED_CR3_LOAD_EXITING |
 	      CPU_BASED_CR3_STORE_EXITING |
 	      CPU_BASED_UNCOND_IO_EXITING |
 	      CPU_BASED_MOV_DR_EXITING |
 	      CPU_BASED_USE_TSC_OFFSETTING |
 	      CPU_BASED_MWAIT_EXITING |
 	      CPU_BASED_MONITOR_EXITING |
 	      CPU_BASED_INVLPG_EXITING |
 	      CPU_BASED_RDPMC_EXITING;
-	opt = CPU_BASED_TPR_SHADOW |
+	if (adjust_vmx_controls(KVM_REQUIRED_VMX_CPU_BASED_VM_EXEC_CONTROL,
-	      CPU_BASED_USE_MSR_BITMAPS |
+				KVM_OPTIONAL_VMX_CPU_BASED_VM_EXEC_CONTROL,
-	      CPU_BASED_ACTIVATE_SECONDARY_CONTROLS |
+				MSR_IA32_VMX_PROCBASED_CTLS,
-	      CPU_BASED_ACTIVATE_TERTIARY_CONTROLS;
+				&_cpu_based_exec_control))
 	if (adjust_vmx_controls(min, opt, MSR_IA32_VMX_PROCBASED_CTLS,
 				&_cpu_based_exec_control) < 0)
 		return -EIO;
 #ifdef CONFIG_X86_64
 	if (_cpu_based_exec_control & CPU_BASED_TPR_SHADOW)
 		_cpu_based_exec_control &= ~CPU_BASED_CR8_LOAD_EXITING &
 					   ~CPU_BASED_CR8_STORE_EXITING;
 #endif
 	if (_cpu_based_exec_control & CPU_BASED_ACTIVATE_SECONDARY_CONTROLS) {
-		min2 = 0;
+		if (adjust_vmx_controls(KVM_REQUIRED_VMX_SECONDARY_VM_EXEC_CONTROL,
-		opt2 = SECONDARY_EXEC_VIRTUALIZE_APIC_ACCESSES |
+					KVM_OPTIONAL_VMX_SECONDARY_VM_EXEC_CONTROL,
 			SECONDARY_EXEC_VIRTUALIZE_X2APIC_MODE |
 			SECONDARY_EXEC_WBINVD_EXITING |
 			SECONDARY_EXEC_ENABLE_VPID |
 			SECONDARY_EXEC_ENABLE_EPT |
 			SECONDARY_EXEC_UNRESTRICTED_GUEST |
 			SECONDARY_EXEC_PAUSE_LOOP_EXITING |
 			SECONDARY_EXEC_DESC |
 			SECONDARY_EXEC_ENABLE_RDTSCP |
 			SECONDARY_EXEC_ENABLE_INVPCID |
 			SECONDARY_EXEC_APIC_REGISTER_VIRT |
 			SECONDARY_EXEC_VIRTUAL_INTR_DELIVERY |
 			SECONDARY_EXEC_SHADOW_VMCS |
 			SECONDARY_EXEC_XSAVES |
 			SECONDARY_EXEC_RDSEED_EXITING |
 			SECONDARY_EXEC_RDRAND_EXITING |
 			SECONDARY_EXEC_ENABLE_PML |
 			SECONDARY_EXEC_TSC_SCALING |
 			SECONDARY_EXEC_ENABLE_USR_WAIT_PAUSE |
 			SECONDARY_EXEC_PT_USE_GPA |
 			SECONDARY_EXEC_PT_CONCEAL_VMX |
 			SECONDARY_EXEC_ENABLE_VMFUNC |
 			SECONDARY_EXEC_BUS_LOCK_DETECTION |
 			SECONDARY_EXEC_NOTIFY_VM_EXITING;
 		if (cpu_has_sgx())
 			opt2 |= SECONDARY_EXEC_ENCLS_EXITING;
 		if (adjust_vmx_controls(min2, opt2,
 					MSR_IA32_VMX_PROCBASED_CTLS2,
-					&_cpu_based_2nd_exec_control) < 0)
+					&_cpu_based_2nd_exec_control))
 			return -EIO;
 	}
 #ifndef CONFIG_X86_64
@ -2627,13 +2619,8 @@ static __init int setup_vmcs_config(struct vmcs_config *vmcs_conf,
 	rdmsr_safe(MSR_IA32_VMX_EPT_VPID_CAP,
 		&vmx_cap->ept, &vmx_cap->vpid);
-	if (_cpu_based_2nd_exec_control & SECONDARY_EXEC_ENABLE_EPT) {
+	if (!(_cpu_based_2nd_exec_control & SECONDARY_EXEC_ENABLE_EPT) &&
-		/* CR3 accesses and invlpg don't need to cause VM Exits when EPT
+	    vmx_cap->ept) {
 		   enabled */
 		_cpu_based_exec_control &= ~(CPU_BASED_CR3_LOAD_EXITING |
 					     CPU_BASED_CR3_STORE_EXITING |
 					     CPU_BASED_INVLPG_EXITING);
 	} else if (vmx_cap->ept) {
 		pr_warn_once("EPT CAP should not exist if not support "
 				"1-setting enable EPT VM-execution control\n");
@ -2653,32 +2640,24 @@ static __init int setup_vmcs_config(struct vmcs_config *vmcs_conf,
 		vmx_cap->vpid = 0;
 	}
-	if (_cpu_based_exec_control & CPU_BASED_ACTIVATE_TERTIARY_CONTROLS) {
+	if (!cpu_has_sgx())
-		u64 opt3 = TERTIARY_EXEC_IPI_VIRT;
+		_cpu_based_2nd_exec_control &= ~SECONDARY_EXEC_ENCLS_EXITING;
-		_cpu_based_3rd_exec_control = adjust_vmx_controls64(opt3,
+	if (_cpu_based_exec_control & CPU_BASED_ACTIVATE_TERTIARY_CONTROLS)
 		_cpu_based_3rd_exec_control =
 			adjust_vmx_controls64(KVM_OPTIONAL_VMX_TERTIARY_VM_EXEC_CONTROL,
 					      MSR_IA32_VMX_PROCBASED_CTLS3);
 	}
-	min = VM_EXIT_SAVE_DEBUG_CONTROLS | VM_EXIT_ACK_INTR_ON_EXIT;
+	if (adjust_vmx_controls(KVM_REQUIRED_VMX_VM_EXIT_CONTROLS,
-#ifdef CONFIG_X86_64
+				KVM_OPTIONAL_VMX_VM_EXIT_CONTROLS,
-	min |= VM_EXIT_HOST_ADDR_SPACE_SIZE;
+				MSR_IA32_VMX_EXIT_CTLS,
-#endif
+				&_vmexit_control))
 	opt = VM_EXIT_LOAD_IA32_PERF_GLOBAL_CTRL |
 	      VM_EXIT_LOAD_IA32_PAT |
 	      VM_EXIT_LOAD_IA32_EFER |
 	      VM_EXIT_CLEAR_BNDCFGS |
 	      VM_EXIT_PT_CONCEAL_PIP |
 	      VM_EXIT_CLEAR_IA32_RTIT_CTL;
 	if (adjust_vmx_controls(min, opt, MSR_IA32_VMX_EXIT_CTLS,
 				&_vmexit_control) < 0)
 		return -EIO;
-	min = PIN_BASED_EXT_INTR_MASK | PIN_BASED_NMI_EXITING;
+	if (adjust_vmx_controls(KVM_REQUIRED_VMX_PIN_BASED_VM_EXEC_CONTROL,
-	opt = PIN_BASED_VIRTUAL_NMIS | PIN_BASED_POSTED_INTR |
+				KVM_OPTIONAL_VMX_PIN_BASED_VM_EXEC_CONTROL,
-		 PIN_BASED_VMX_PREEMPTION_TIMER;
+				MSR_IA32_VMX_PINBASED_CTLS,
-	if (adjust_vmx_controls(min, opt, MSR_IA32_VMX_PINBASED_CTLS,
+				&_pin_based_exec_control))
 				&_pin_based_exec_control) < 0)
 		return -EIO;
 	if (cpu_has_broken_vmx_preemption_timer())
@ -2687,15 +2666,10 @@ static __init int setup_vmcs_config(struct vmcs_config *vmcs_conf,
 		SECONDARY_EXEC_VIRTUAL_INTR_DELIVERY))
 		_pin_based_exec_control &= ~PIN_BASED_POSTED_INTR;
-	min = VM_ENTRY_LOAD_DEBUG_CONTROLS;
+	if (adjust_vmx_controls(KVM_REQUIRED_VMX_VM_ENTRY_CONTROLS,
-	opt = VM_ENTRY_LOAD_IA32_PERF_GLOBAL_CTRL |
+				KVM_OPTIONAL_VMX_VM_ENTRY_CONTROLS,
-	      VM_ENTRY_LOAD_IA32_PAT |
+				MSR_IA32_VMX_ENTRY_CTLS,
-	      VM_ENTRY_LOAD_IA32_EFER |
+				&_vmentry_control))
 	      VM_ENTRY_LOAD_BNDCFGS |
 	      VM_ENTRY_PT_CONCEAL_PIP |
 	      VM_ENTRY_LOAD_IA32_RTIT_CTL;
 	if (adjust_vmx_controls(min, opt, MSR_IA32_VMX_ENTRY_CTLS,
 				&_vmentry_control) < 0)
 		return -EIO;
 	for (i = 0; i < ARRAY_SIZE(vmcs_entry_exit_pairs); i++) {
@ -2715,30 +2689,6 @@ static __init int setup_vmcs_config(struct vmcs_config *vmcs_conf,
 		_vmexit_control &= ~x_ctrl;
 	}
 	/*
 	 * Some cpus support VM_{ENTRY,EXIT}_IA32_PERF_GLOBAL_CTRL but they
 	 * can't be used due to an errata where VM Exit may incorrectly clear
 	 * IA32_PERF_GLOBAL_CTRL[34:32].  Workaround the errata by using the
 	 * MSR load mechanism to switch IA32_PERF_GLOBAL_CTRL.
 	 */
 	if (boot_cpu_data.x86 == 0x6) {
 		switch (boot_cpu_data.x86_model) {
 		case 26: /* AAK155 */
 		case 30: /* AAP115 */
 		case 37: /* AAT100 */
 		case 44: /* BC86,AAY89,BD102 */
 		case 46: /* BA97 */
 			_vmentry_control &= ~VM_ENTRY_LOAD_IA32_PERF_GLOBAL_CTRL;
 			_vmexit_control &= ~VM_EXIT_LOAD_IA32_PERF_GLOBAL_CTRL;
 			pr_warn_once("kvm: VM_EXIT_LOAD_IA32_PERF_GLOBAL_CTRL "
 					"does not work properly. Using workaround\n");
 			break;
 		default:
 			break;
 		}
 	}
 	rdmsr(MSR_IA32_VMX_BASIC, vmx_msr_low, vmx_msr_high);
 	/* IA-32 SDM Vol 3B: VMCS size is never greater than 4kB. */
@ -2755,6 +2705,8 @@ static __init int setup_vmcs_config(struct vmcs_config *vmcs_conf,
 	if (((vmx_msr_high >> 18) & 15) != 6)
 		return -EIO;
 	rdmsrl(MSR_IA32_VMX_MISC, misc_msr);
 	vmcs_conf->size = vmx_msr_high & 0x1fff;
 	vmcs_conf->basic_cap = vmx_msr_high & ~0x1fff;
@ -2766,11 +2718,7 @@ static __init int setup_vmcs_config(struct vmcs_config *vmcs_conf,
 	vmcs_conf->cpu_based_3rd_exec_ctrl = _cpu_based_3rd_exec_control;
 	vmcs_conf->vmexit_ctrl         = _vmexit_control;
 	vmcs_conf->vmentry_ctrl        = _vmentry_control;
-
+	vmcs_conf->misc	= misc_msr;
 #if IS_ENABLED(CONFIG_HYPERV)
 	if (enlightened_vmcs)
 		evmcs_sanitize_exec_ctrls(vmcs_conf);
 #endif
 	return 0;
 }
@ -3037,10 +2985,15 @@ int vmx_set_efer(struct kvm_vcpu *vcpu, u64 efer)
 		return 0;
 	vcpu->arch.efer = efer;
 #ifdef CONFIG_X86_64
 	if (efer & EFER_LMA)
 		vm_entry_controls_setbit(vmx, VM_ENTRY_IA32E_MODE);
 	else
 		vm_entry_controls_clearbit(vmx, VM_ENTRY_IA32E_MODE);
 #else
 	if (KVM_BUG_ON(efer & EFER_LMA, vcpu->kvm))
 		return 1;
 #endif
 	vmx_setup_uret_msrs(vmx);
 	return 0;
@ -4327,18 +4280,37 @@ static u32 vmx_vmentry_ctrl(void)
 	if (vmx_pt_mode_is_system())
 		vmentry_ctrl &= ~(VM_ENTRY_PT_CONCEAL_PIP |
 				  VM_ENTRY_LOAD_IA32_RTIT_CTL);
-	/* Loading of EFER and PERF_GLOBAL_CTRL are toggled dynamically */
+	/*
-	return vmentry_ctrl &
+	 * IA32e mode, and loading of EFER and PERF_GLOBAL_CTRL are toggled dynamically.
-		~(VM_ENTRY_LOAD_IA32_PERF_GLOBAL_CTRL | VM_ENTRY_LOAD_IA32_EFER);
+	 */
 	vmentry_ctrl &= ~(VM_ENTRY_LOAD_IA32_PERF_GLOBAL_CTRL |
 			  VM_ENTRY_LOAD_IA32_EFER |
 			  VM_ENTRY_IA32E_MODE);
 	if (cpu_has_perf_global_ctrl_bug())
 		vmentry_ctrl &= ~VM_ENTRY_LOAD_IA32_PERF_GLOBAL_CTRL;
 	return vmentry_ctrl;
 }
 static u32 vmx_vmexit_ctrl(void)
 {
 	u32 vmexit_ctrl = vmcs_config.vmexit_ctrl;
 	/*
 	 * Not used by KVM and never set in vmcs01 or vmcs02, but emulated for
 	 * nested virtualization and thus allowed to be set in vmcs12.
 	 */
 	vmexit_ctrl &= ~(VM_EXIT_SAVE_IA32_PAT | VM_EXIT_SAVE_IA32_EFER |
 			 VM_EXIT_SAVE_VMX_PREEMPTION_TIMER);
 	if (vmx_pt_mode_is_system())
 		vmexit_ctrl &= ~(VM_EXIT_PT_CONCEAL_PIP |
 				 VM_EXIT_CLEAR_IA32_RTIT_CTL);
 	if (cpu_has_perf_global_ctrl_bug())
 		vmexit_ctrl &= ~VM_EXIT_LOAD_IA32_PERF_GLOBAL_CTRL;
 	/* Loading of EFER and PERF_GLOBAL_CTRL are toggled dynamically */
 	return vmexit_ctrl &
 		~(VM_EXIT_LOAD_IA32_PERF_GLOBAL_CTRL | VM_EXIT_LOAD_IA32_EFER);
@ -4376,20 +4348,38 @@ static u32 vmx_exec_control(struct vcpu_vmx *vmx)
 {
 	u32 exec_control = vmcs_config.cpu_based_exec_ctrl;
 	/*
 	 * Not used by KVM, but fully supported for nesting, i.e. are allowed in
 	 * vmcs12 and propagated to vmcs02 when set in vmcs12.
 	 */
 	exec_control &= ~(CPU_BASED_RDTSC_EXITING |
 			  CPU_BASED_USE_IO_BITMAPS |
 			  CPU_BASED_MONITOR_TRAP_FLAG |
 			  CPU_BASED_PAUSE_EXITING);
 	/* INTR_WINDOW_EXITING and NMI_WINDOW_EXITING are toggled dynamically */
 	exec_control &= ~(CPU_BASED_INTR_WINDOW_EXITING |
 			  CPU_BASED_NMI_WINDOW_EXITING);
 	if (vmx->vcpu.arch.switch_db_regs & KVM_DEBUGREG_WONT_EXIT)
 		exec_control &= ~CPU_BASED_MOV_DR_EXITING;
-	if (!cpu_need_tpr_shadow(&vmx->vcpu)) {
+	if (!cpu_need_tpr_shadow(&vmx->vcpu))
 		exec_control &= ~CPU_BASED_TPR_SHADOW;
 #ifdef CONFIG_X86_64
 	if (exec_control & CPU_BASED_TPR_SHADOW)
 		exec_control &= ~(CPU_BASED_CR8_LOAD_EXITING |
 				  CPU_BASED_CR8_STORE_EXITING);
 	else
 		exec_control |= CPU_BASED_CR8_STORE_EXITING |
 				CPU_BASED_CR8_LOAD_EXITING;
 #endif
-	}
+	/* No need to intercept CR3 access or INVPLG when using EPT. */
-	if (!enable_ept)
+	if (enable_ept)
-		exec_control |= CPU_BASED_CR3_STORE_EXITING |
+		exec_control &= ~(CPU_BASED_CR3_LOAD_EXITING |
-				CPU_BASED_CR3_LOAD_EXITING  |
+				  CPU_BASED_CR3_STORE_EXITING |
-				CPU_BASED_INVLPG_EXITING;
+				  CPU_BASED_INVLPG_EXITING);
 	if (kvm_mwait_in_guest(vmx->vcpu.kvm))
 		exec_control &= ~(CPU_BASED_MWAIT_EXITING |
 				CPU_BASED_MONITOR_EXITING);
@ -5155,8 +5145,10 @@ static int handle_exception_nmi(struct kvm_vcpu *vcpu)
 			 * instruction.  ICEBP generates a trap-like #DB, but
 			 * despite its interception control being tied to #DB,
 			 * is an instruction intercept, i.e. the VM-Exit occurs
-			 * on the ICEBP itself.  Note, skipping ICEBP also
+			 * on the ICEBP itself.  Use the inner "skip" helper to
-			 * clears STI and MOVSS blocking.
+			 * avoid single-step #DB and MTF updates, as ICEBP is
 			 * higher priority.  Note, skipping ICEBP still clears
 			 * STI and MOVSS blocking.
 			 *
 			 * For all other #DBs, set vmcs.PENDING_DBG_EXCEPTIONS.BS
 			 * if single-step is enabled in RFLAGS and STI or MOVSS
@ -5638,7 +5630,7 @@ static int handle_ept_violation(struct kvm_vcpu *vcpu)
 		vmcs_set_bits(GUEST_INTERRUPTIBILITY_INFO, GUEST_INTR_STATE_NMI);
 	gpa = vmcs_read64(GUEST_PHYSICAL_ADDRESS);
-	trace_kvm_page_fault(gpa, exit_qualification);
+	trace_kvm_page_fault(vcpu, gpa, exit_qualification);
 	/* Is it a read fault? */
 	error_code = (exit_qualification & EPT_VIOLATION_ACC_READ)
@ -5710,7 +5702,7 @@ static bool vmx_emulation_required_with_pending_exception(struct kvm_vcpu *vcpu)
 	struct vcpu_vmx *vmx = to_vmx(vcpu);
 	return vmx->emulation_required && !vmx->rmode.vm86_active &&
-	       (vcpu->arch.exception.pending || vcpu->arch.exception.injected);
+	       (kvm_is_exception_pending(vcpu) || vcpu->arch.exception.injected);
 }
 static int handle_invalid_guest_state(struct kvm_vcpu *vcpu)
@ -7430,7 +7422,7 @@ static int __init vmx_check_processor_compat(void)
 	if (setup_vmcs_config(&vmcs_conf, &vmx_cap) < 0)
 		return -EIO;
 	if (nested)
-		nested_vmx_setup_ctls_msrs(&vmcs_conf.nested, vmx_cap.ept);
+		nested_vmx_setup_ctls_msrs(&vmcs_conf, vmx_cap.ept);
 	if (memcmp(&vmcs_config, &vmcs_conf, sizeof(struct vmcs_config)) != 0) {
 		printk(KERN_ERR "kvm: CPU %d feature inconsistency!\n",
 				smp_processor_id());
@ -8070,7 +8062,7 @@ static struct kvm_x86_ops vmx_x86_ops __initdata = {
 	.patch_hypercall = vmx_patch_hypercall,
 	.inject_irq = vmx_inject_irq,
 	.inject_nmi = vmx_inject_nmi,
-	.queue_exception = vmx_queue_exception,
+	.inject_exception = vmx_inject_exception,
 	.cancel_injection = vmx_cancel_injection,
 	.interrupt_allowed = vmx_interrupt_allowed,
 	.nmi_allowed = vmx_nmi_allowed,
@ -8227,6 +8219,10 @@ static __init int hardware_setup(void)
 	if (setup_vmcs_config(&vmcs_config, &vmx_capability) < 0)
 		return -EIO;
 	if (cpu_has_perf_global_ctrl_bug())
 		pr_warn_once("kvm: VM_EXIT_LOAD_IA32_PERF_GLOBAL_CTRL "
 			     "does not work properly. Using workaround\n");
 	if (boot_cpu_has(X86_FEATURE_NX))
 		kvm_enable_efer_bits(EFER_NX);
@ -8341,11 +8337,9 @@ static __init int hardware_setup(void)
 	if (enable_preemption_timer) {
 		u64 use_timer_freq = 5000ULL * 1000 * 1000;
 		u64 vmx_msr;
 		rdmsrl(MSR_IA32_VMX_MISC, vmx_msr);
 		cpu_preemption_timer_multi =
-			vmx_msr & VMX_MISC_PREEMPTION_TIMER_RATE_MASK;
+			vmcs_config.misc & VMX_MISC_PREEMPTION_TIMER_RATE_MASK;
 		if (tsc_khz)
 			use_timer_freq = (u64)tsc_khz * 1000;
@ -8381,8 +8375,7 @@ static __init int hardware_setup(void)
 	setup_default_sgx_lepubkeyhash();
 	if (nested) {
-		nested_vmx_setup_ctls_msrs(&vmcs_config.nested,
+		nested_vmx_setup_ctls_msrs(&vmcs_config, vmx_capability.ept);
 					   vmx_capability.ept);
 		r = nested_vmx_hardware_setup(kvm_vmx_exit_handlers);
 		if (r)
--- a/arch/x86/kvm/vmx/vmx.h
+++ b/arch/x86/kvm/vmx/vmx.h
@ -477,29 +477,145 @@ static inline u8 vmx_get_rvi(void)
 	return vmcs_read16(GUEST_INTR_STATUS) & 0xff;
 }
-#define BUILD_CONTROLS_SHADOW(lname, uname, bits)				\
+#define __KVM_REQUIRED_VMX_VM_ENTRY_CONTROLS				\
-static inline void lname##_controls_set(struct vcpu_vmx *vmx, u##bits val)	\
+	(VM_ENTRY_LOAD_DEBUG_CONTROLS)
-{										\
+#ifdef CONFIG_X86_64
-	if (vmx->loaded_vmcs->controls_shadow.lname != val) {			\
+	#define KVM_REQUIRED_VMX_VM_ENTRY_CONTROLS			\
-		vmcs_write##bits(uname, val);					\
+		(__KVM_REQUIRED_VMX_VM_ENTRY_CONTROLS |			\
-		vmx->loaded_vmcs->controls_shadow.lname = val;			\
+		 VM_ENTRY_IA32E_MODE)
-	}									\
+#else
-}										\
+	#define KVM_REQUIRED_VMX_VM_ENTRY_CONTROLS			\
-static inline u##bits __##lname##_controls_get(struct loaded_vmcs *vmcs)	\
+		__KVM_REQUIRED_VMX_VM_ENTRY_CONTROLS
-{										\
+#endif
-	return vmcs->controls_shadow.lname;					\
+#define KVM_OPTIONAL_VMX_VM_ENTRY_CONTROLS				\
-}										\
+	(VM_ENTRY_LOAD_IA32_PERF_GLOBAL_CTRL |				\
-static inline u##bits lname##_controls_get(struct vcpu_vmx *vmx)		\
+	 VM_ENTRY_LOAD_IA32_PAT |					\
-{										\
+	 VM_ENTRY_LOAD_IA32_EFER |					\
-	return __##lname##_controls_get(vmx->loaded_vmcs);			\
+	 VM_ENTRY_LOAD_BNDCFGS |					\
-}										\
+	 VM_ENTRY_PT_CONCEAL_PIP |					\
-static inline void lname##_controls_setbit(struct vcpu_vmx *vmx, u##bits val)	\
+	 VM_ENTRY_LOAD_IA32_RTIT_CTL)
-{										\
+
-	lname##_controls_set(vmx, lname##_controls_get(vmx) | val);		\
+#define __KVM_REQUIRED_VMX_VM_EXIT_CONTROLS				\
-}										\
+	(VM_EXIT_SAVE_DEBUG_CONTROLS |					\
-static inline void lname##_controls_clearbit(struct vcpu_vmx *vmx, u##bits val)	\
+	 VM_EXIT_ACK_INTR_ON_EXIT)
-{										\
+#ifdef CONFIG_X86_64
-	lname##_controls_set(vmx, lname##_controls_get(vmx) & ~val);		\
+	#define KVM_REQUIRED_VMX_VM_EXIT_CONTROLS			\
 		(__KVM_REQUIRED_VMX_VM_EXIT_CONTROLS |			\
 		 VM_EXIT_HOST_ADDR_SPACE_SIZE)
 #else
 	#define KVM_REQUIRED_VMX_VM_EXIT_CONTROLS			\
 		__KVM_REQUIRED_VMX_VM_EXIT_CONTROLS
 #endif
 #define KVM_OPTIONAL_VMX_VM_EXIT_CONTROLS				\
 	      (VM_EXIT_LOAD_IA32_PERF_GLOBAL_CTRL |			\
 	       VM_EXIT_SAVE_IA32_PAT |					\
 	       VM_EXIT_LOAD_IA32_PAT |					\
 	       VM_EXIT_SAVE_IA32_EFER |					\
 	       VM_EXIT_SAVE_VMX_PREEMPTION_TIMER |			\
 	       VM_EXIT_LOAD_IA32_EFER |					\
 	       VM_EXIT_CLEAR_BNDCFGS |					\
 	       VM_EXIT_PT_CONCEAL_PIP |					\
 	       VM_EXIT_CLEAR_IA32_RTIT_CTL)
 #define KVM_REQUIRED_VMX_PIN_BASED_VM_EXEC_CONTROL			\
 	(PIN_BASED_EXT_INTR_MASK |					\
 	 PIN_BASED_NMI_EXITING)
 #define KVM_OPTIONAL_VMX_PIN_BASED_VM_EXEC_CONTROL			\
 	(PIN_BASED_VIRTUAL_NMIS |					\
 	 PIN_BASED_POSTED_INTR |					\
 	 PIN_BASED_VMX_PREEMPTION_TIMER)
 #define __KVM_REQUIRED_VMX_CPU_BASED_VM_EXEC_CONTROL			\
 	(CPU_BASED_HLT_EXITING |					\
 	 CPU_BASED_CR3_LOAD_EXITING |					\
 	 CPU_BASED_CR3_STORE_EXITING |					\
 	 CPU_BASED_UNCOND_IO_EXITING |					\
 	 CPU_BASED_MOV_DR_EXITING |					\
 	 CPU_BASED_USE_TSC_OFFSETTING |					\
 	 CPU_BASED_MWAIT_EXITING |					\
 	 CPU_BASED_MONITOR_EXITING |					\
 	 CPU_BASED_INVLPG_EXITING |					\
 	 CPU_BASED_RDPMC_EXITING |					\
 	 CPU_BASED_INTR_WINDOW_EXITING)
 #ifdef CONFIG_X86_64
 	#define KVM_REQUIRED_VMX_CPU_BASED_VM_EXEC_CONTROL		\
 		(__KVM_REQUIRED_VMX_CPU_BASED_VM_EXEC_CONTROL |		\
 		 CPU_BASED_CR8_LOAD_EXITING |				\
 		 CPU_BASED_CR8_STORE_EXITING)
 #else
 	#define KVM_REQUIRED_VMX_CPU_BASED_VM_EXEC_CONTROL		\
 		__KVM_REQUIRED_VMX_CPU_BASED_VM_EXEC_CONTROL
 #endif
 #define KVM_OPTIONAL_VMX_CPU_BASED_VM_EXEC_CONTROL			\
 	(CPU_BASED_RDTSC_EXITING |					\
 	 CPU_BASED_TPR_SHADOW |						\
 	 CPU_BASED_USE_IO_BITMAPS |					\
 	 CPU_BASED_MONITOR_TRAP_FLAG |					\
 	 CPU_BASED_USE_MSR_BITMAPS |					\
 	 CPU_BASED_NMI_WINDOW_EXITING |					\
 	 CPU_BASED_PAUSE_EXITING |					\
 	 CPU_BASED_ACTIVATE_SECONDARY_CONTROLS |			\
 	 CPU_BASED_ACTIVATE_TERTIARY_CONTROLS)
 #define KVM_REQUIRED_VMX_SECONDARY_VM_EXEC_CONTROL 0
 #define KVM_OPTIONAL_VMX_SECONDARY_VM_EXEC_CONTROL			\
 	(SECONDARY_EXEC_VIRTUALIZE_APIC_ACCESSES |			\
 	 SECONDARY_EXEC_VIRTUALIZE_X2APIC_MODE |			\
 	 SECONDARY_EXEC_WBINVD_EXITING |				\
 	 SECONDARY_EXEC_ENABLE_VPID |					\
 	 SECONDARY_EXEC_ENABLE_EPT |					\
 	 SECONDARY_EXEC_UNRESTRICTED_GUEST |				\
 	 SECONDARY_EXEC_PAUSE_LOOP_EXITING |				\
 	 SECONDARY_EXEC_DESC |						\
 	 SECONDARY_EXEC_ENABLE_RDTSCP |					\
 	 SECONDARY_EXEC_ENABLE_INVPCID |				\
 	 SECONDARY_EXEC_APIC_REGISTER_VIRT |				\
 	 SECONDARY_EXEC_VIRTUAL_INTR_DELIVERY |				\
 	 SECONDARY_EXEC_SHADOW_VMCS |					\
 	 SECONDARY_EXEC_XSAVES |					\
 	 SECONDARY_EXEC_RDSEED_EXITING |				\
 	 SECONDARY_EXEC_RDRAND_EXITING |				\
 	 SECONDARY_EXEC_ENABLE_PML |					\
 	 SECONDARY_EXEC_TSC_SCALING |					\
 	 SECONDARY_EXEC_ENABLE_USR_WAIT_PAUSE |				\
 	 SECONDARY_EXEC_PT_USE_GPA |					\
 	 SECONDARY_EXEC_PT_CONCEAL_VMX |				\
 	 SECONDARY_EXEC_ENABLE_VMFUNC |					\
 	 SECONDARY_EXEC_BUS_LOCK_DETECTION |				\
 	 SECONDARY_EXEC_NOTIFY_VM_EXITING |				\
 	 SECONDARY_EXEC_ENCLS_EXITING)
 #define KVM_REQUIRED_VMX_TERTIARY_VM_EXEC_CONTROL 0
 #define KVM_OPTIONAL_VMX_TERTIARY_VM_EXEC_CONTROL			\
 	(TERTIARY_EXEC_IPI_VIRT)
 #define BUILD_CONTROLS_SHADOW(lname, uname, bits)						\
 static inline void lname##_controls_set(struct vcpu_vmx *vmx, u##bits val)			\
 {												\
 	if (vmx->loaded_vmcs->controls_shadow.lname != val) {					\
 		vmcs_write##bits(uname, val);							\
 		vmx->loaded_vmcs->controls_shadow.lname = val;					\
 	}											\
 }												\
 static inline u##bits __##lname##_controls_get(struct loaded_vmcs *vmcs)			\
 {												\
 	return vmcs->controls_shadow.lname;							\
 }												\
 static inline u##bits lname##_controls_get(struct vcpu_vmx *vmx)				\
 {												\
 	return __##lname##_controls_get(vmx->loaded_vmcs);					\
 }												\
 static __always_inline void lname##_controls_setbit(struct vcpu_vmx *vmx, u##bits val)		\
 {												\
 	BUILD_BUG_ON(!(val & (KVM_REQUIRED_VMX_##uname | KVM_OPTIONAL_VMX_##uname)));		\
 	lname##_controls_set(vmx, lname##_controls_get(vmx) | val);				\
 }												\
 static __always_inline void lname##_controls_clearbit(struct vcpu_vmx *vmx, u##bits val)	\
 {												\
 	BUILD_BUG_ON(!(val & (KVM_REQUIRED_VMX_##uname | KVM_OPTIONAL_VMX_##uname)));		\
 	lname##_controls_set(vmx, lname##_controls_get(vmx) & ~val);				\
 }
 BUILD_CONTROLS_SHADOW(vm_entry, VM_ENTRY_CONTROLS, 32)
 BUILD_CONTROLS_SHADOW(vm_exit, VM_EXIT_CONTROLS, 32)
@ -626,4 +742,14 @@ static inline bool vmx_can_use_ipiv(struct kvm_vcpu *vcpu)
 	return  lapic_in_kernel(vcpu) && enable_ipiv;
 }
 static inline bool guest_cpuid_has_evmcs(struct kvm_vcpu *vcpu)
 {
 	/*
 	 * eVMCS is exposed to the guest if Hyper-V is enabled in CPUID and
 	 * eVMCS has been explicitly enabled by userspace.
 	 */
 	return vcpu->arch.hyperv_enabled &&
 	       to_vmx(vcpu)->nested.enlightened_vmcs_enabled;
 }
 #endif /* __KVM_X86_VMX_H */
--- a/arch/x86/kvm/vmx/vmx_ops.h
+++ b/arch/x86/kvm/vmx/vmx_ops.h
@ -10,7 +10,7 @@
 #include "vmcs.h"
 #include "../x86.h"
-asmlinkage void vmread_error(unsigned long field, bool fault);
+void vmread_error(unsigned long field, bool fault);
 __attribute__((regparm(0))) void vmread_error_trampoline(unsigned long field,
 							 bool fault);
 void vmwrite_error(unsigned long field, unsigned long value);
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@ -173,8 +173,13 @@ bool __read_mostly enable_vmware_backdoor = false;
 module_param(enable_vmware_backdoor, bool, S_IRUGO);
 EXPORT_SYMBOL_GPL(enable_vmware_backdoor);
-static bool __read_mostly force_emulation_prefix = false;
+/*
-module_param(force_emulation_prefix, bool, S_IRUGO);
+ * Flags to manipulate forced emulation behavior (any non-zero value will
 * enable forced emulation).
 */
 #define KVM_FEP_CLEAR_RFLAGS_RF	BIT(1)
 static int __read_mostly force_emulation_prefix;
 module_param(force_emulation_prefix, int, 0644);
 int __read_mostly pi_inject_timer = -1;
 module_param(pi_inject_timer, bint, S_IRUGO | S_IWUSR);
@ -528,6 +533,7 @@ static int exception_class(int vector)
 #define EXCPT_TRAP		1
 #define EXCPT_ABORT		2
 #define EXCPT_INTERRUPT		3
 #define EXCPT_DB		4
 static int exception_type(int vector)
 {
@ -538,8 +544,14 @@ static int exception_type(int vector)
 	mask = 1 << vector;
-	/* #DB is trap, as instruction watchpoints are handled elsewhere */
+	/*
-	if (mask & ((1 << DB_VECTOR) | (1 << BP_VECTOR) | (1 << OF_VECTOR)))
+	 * #DBs can be trap-like or fault-like, the caller must check other CPU
 	 * state, e.g. DR6, to determine whether a #DB is a trap or fault.
 	 */
 	if (mask & (1 << DB_VECTOR))
 		return EXCPT_DB;
 	if (mask & ((1 << BP_VECTOR) | (1 << OF_VECTOR)))
 		return EXCPT_TRAP;
 	if (mask & ((1 << DF_VECTOR) | (1 << MC_VECTOR)))
@ -549,16 +561,13 @@ static int exception_type(int vector)
 	return EXCPT_FAULT;
 }
-void kvm_deliver_exception_payload(struct kvm_vcpu *vcpu)
+void kvm_deliver_exception_payload(struct kvm_vcpu *vcpu,
 				   struct kvm_queued_exception *ex)
 {
-	unsigned nr = vcpu->arch.exception.nr;
+	if (!ex->has_payload)
 	bool has_payload = vcpu->arch.exception.has_payload;
 	unsigned long payload = vcpu->arch.exception.payload;
 	if (!has_payload)
 		return;
-	switch (nr) {
+	switch (ex->vector) {
 	case DB_VECTOR:
 		/*
 		 * "Certain debug exceptions may clear bit 0-3.  The
@ -583,8 +592,8 @@ void kvm_deliver_exception_payload(struct kvm_vcpu *vcpu)
 		 * So they need to be flipped for DR6.
 		 */
 		vcpu->arch.dr6 |= DR6_ACTIVE_LOW;
-		vcpu->arch.dr6 |= payload;
+		vcpu->arch.dr6 |= ex->payload;
-		vcpu->arch.dr6 ^= payload & DR6_ACTIVE_LOW;
+		vcpu->arch.dr6 ^= ex->payload & DR6_ACTIVE_LOW;
 		/*
 		 * The #DB payload is defined as compatible with the 'pending
@ -595,15 +604,30 @@ void kvm_deliver_exception_payload(struct kvm_vcpu *vcpu)
 		vcpu->arch.dr6 &= ~BIT(12);
 		break;
 	case PF_VECTOR:
-		vcpu->arch.cr2 = payload;
+		vcpu->arch.cr2 = ex->payload;
 		break;
 	}
-	vcpu->arch.exception.has_payload = false;
+	ex->has_payload = false;
-	vcpu->arch.exception.payload = 0;
+	ex->payload = 0;
 }
 EXPORT_SYMBOL_GPL(kvm_deliver_exception_payload);
 static void kvm_queue_exception_vmexit(struct kvm_vcpu *vcpu, unsigned int vector,
 				       bool has_error_code, u32 error_code,
 				       bool has_payload, unsigned long payload)
 {
 	struct kvm_queued_exception *ex = &vcpu->arch.exception_vmexit;
 	ex->vector = vector;
 	ex->injected = false;
 	ex->pending = true;
 	ex->has_error_code = has_error_code;
 	ex->error_code = error_code;
 	ex->has_payload = has_payload;
 	ex->payload = payload;
 }
 static void kvm_multiple_exception(struct kvm_vcpu *vcpu,
 		unsigned nr, bool has_error, u32 error_code,
 	        bool has_payload, unsigned long payload, bool reinject)
@ -613,18 +637,31 @@ static void kvm_multiple_exception(struct kvm_vcpu *vcpu,
 	kvm_make_request(KVM_REQ_EVENT, vcpu);
 	/*
 	 * If the exception is destined for L2 and isn't being reinjected,
 	 * morph it to a VM-Exit if L1 wants to intercept the exception.  A
 	 * previously injected exception is not checked because it was checked
 	 * when it was original queued, and re-checking is incorrect if _L1_
 	 * injected the exception, in which case it's exempt from interception.
 	 */
 	if (!reinject && is_guest_mode(vcpu) &&
 	    kvm_x86_ops.nested_ops->is_exception_vmexit(vcpu, nr, error_code)) {
 		kvm_queue_exception_vmexit(vcpu, nr, has_error, error_code,
 					   has_payload, payload);
 		return;
 	}
 	if (!vcpu->arch.exception.pending && !vcpu->arch.exception.injected) {
 	queue:
 		if (reinject) {
 			/*
-			 * On vmentry, vcpu->arch.exception.pending is only
+			 * On VM-Entry, an exception can be pending if and only
-			 * true if an event injection was blocked by
+			 * if event injection was blocked by nested_run_pending.
-			 * nested_run_pending.  In that case, however,
+			 * In that case, however, vcpu_enter_guest() requests an
-			 * vcpu_enter_guest requests an immediate exit,
+			 * immediate exit, and the guest shouldn't proceed far
-			 * and the guest shouldn't proceed far enough to
+			 * enough to need reinjection.
 			 * need reinjection.
 			 */
-			WARN_ON_ONCE(vcpu->arch.exception.pending);
+			WARN_ON_ONCE(kvm_is_exception_pending(vcpu));
 			vcpu->arch.exception.injected = true;
 			if (WARN_ON_ONCE(has_payload)) {
 				/*
@ -639,17 +676,18 @@ static void kvm_multiple_exception(struct kvm_vcpu *vcpu,
 			vcpu->arch.exception.injected = false;
 		}
 		vcpu->arch.exception.has_error_code = has_error;
-		vcpu->arch.exception.nr = nr;
+		vcpu->arch.exception.vector = nr;
 		vcpu->arch.exception.error_code = error_code;
 		vcpu->arch.exception.has_payload = has_payload;
 		vcpu->arch.exception.payload = payload;
 		if (!is_guest_mode(vcpu))
-			kvm_deliver_exception_payload(vcpu);
+			kvm_deliver_exception_payload(vcpu,
 						      &vcpu->arch.exception);
 		return;
 	}
 	/* to check exception */
-	prev_nr = vcpu->arch.exception.nr;
+	prev_nr = vcpu->arch.exception.vector;
 	if (prev_nr == DF_VECTOR) {
 		/* triple fault -> shutdown */
 		kvm_make_request(KVM_REQ_TRIPLE_FAULT, vcpu);
@ -657,25 +695,22 @@ static void kvm_multiple_exception(struct kvm_vcpu *vcpu,
 	}
 	class1 = exception_class(prev_nr);
 	class2 = exception_class(nr);
-	if ((class1 == EXCPT_CONTRIBUTORY && class2 == EXCPT_CONTRIBUTORY)
+	if ((class1 == EXCPT_CONTRIBUTORY && class2 == EXCPT_CONTRIBUTORY) ||
-		|| (class1 == EXCPT_PF && class2 != EXCPT_BENIGN)) {
+	    (class1 == EXCPT_PF && class2 != EXCPT_BENIGN)) {
 		/*
-		 * Generate double fault per SDM Table 5-5.  Set
+		 * Synthesize #DF.  Clear the previously injected or pending
-		 * exception.pending = true so that the double fault
+		 * exception so as not to incorrectly trigger shutdown.
 		 * can trigger a nested vmexit.
 		 */
 		vcpu->arch.exception.pending = true;
 		vcpu->arch.exception.injected = false;
-		vcpu->arch.exception.has_error_code = true;
+		vcpu->arch.exception.pending = false;
-		vcpu->arch.exception.nr = DF_VECTOR;
+
-		vcpu->arch.exception.error_code = 0;
+		kvm_queue_exception_e(vcpu, DF_VECTOR, 0);
-		vcpu->arch.exception.has_payload = false;
+	} else {
 		vcpu->arch.exception.payload = 0;
 	} else
 		/* replace previous exception with a new one in a hope
 		   that instruction re-execution will regenerate lost
 		   exception */
 		goto queue;
 	}
 }
 void kvm_queue_exception(struct kvm_vcpu *vcpu, unsigned nr)
@ -729,20 +764,22 @@ static int complete_emulated_insn_gp(struct kvm_vcpu *vcpu, int err)
 void kvm_inject_page_fault(struct kvm_vcpu *vcpu, struct x86_exception *fault)
 {
 	++vcpu->stat.pf_guest;
-	vcpu->arch.exception.nested_apf =
+
-		is_guest_mode(vcpu) && fault->async_page_fault;
+	/*
-	if (vcpu->arch.exception.nested_apf) {
+	 * Async #PF in L2 is always forwarded to L1 as a VM-Exit regardless of
-		vcpu->arch.apf.nested_apf_token = fault->address;
+	 * whether or not L1 wants to intercept "regular" #PF.
-		kvm_queue_exception_e(vcpu, PF_VECTOR, fault->error_code);
+	 */
-	} else {
+	if (is_guest_mode(vcpu) && fault->async_page_fault)
 		kvm_queue_exception_vmexit(vcpu, PF_VECTOR,
 					   true, fault->error_code,
 					   true, fault->address);
 	else
 		kvm_queue_exception_e_p(vcpu, PF_VECTOR, fault->error_code,
 					fault->address);
 	}
 }
 EXPORT_SYMBOL_GPL(kvm_inject_page_fault);
-/* Returns true if the page fault was immediately morphed into a VM-Exit. */
+void kvm_inject_emulated_page_fault(struct kvm_vcpu *vcpu,
 bool kvm_inject_emulated_page_fault(struct kvm_vcpu *vcpu,
 				    struct x86_exception *fault)
 {
 	struct kvm_mmu *fault_mmu;
@ -760,26 +797,7 @@ bool kvm_inject_emulated_page_fault(struct kvm_vcpu *vcpu,
 		kvm_mmu_invalidate_gva(vcpu, fault_mmu, fault->address,
 				       fault_mmu->root.hpa);
 	/*
 	 * A workaround for KVM's bad exception handling.  If KVM injected an
 	 * exception into L2, and L2 encountered a #PF while vectoring the
 	 * injected exception, manually check to see if L1 wants to intercept
 	 * #PF, otherwise queuing the #PF will lead to #DF or a lost exception.
 	 * In all other cases, defer the check to nested_ops->check_events(),
 	 * which will correctly handle priority (this does not).  Note, other
 	 * exceptions, e.g. #GP, are theoretically affected, #PF is simply the
 	 * most problematic, e.g. when L0 and L1 are both intercepting #PF for
 	 * shadow paging.
 	 *
 	 * TODO: Rewrite exception handling to track injected and pending
 	 *       (VM-Exit) exceptions separately.
 	 */
 	if (unlikely(vcpu->arch.exception.injected && is_guest_mode(vcpu)) &&
 	    kvm_x86_ops.nested_ops->handle_page_fault_workaround(vcpu, fault))
 		return true;
 	fault_mmu->inject_page_fault(vcpu, fault);
 	return false;
 }
 EXPORT_SYMBOL_GPL(kvm_inject_emulated_page_fault);
@ -4841,7 +4859,7 @@ static int kvm_vcpu_ready_for_interrupt_injection(struct kvm_vcpu *vcpu)
 	return (kvm_arch_interrupt_allowed(vcpu) &&
 		kvm_cpu_accept_dm_intr(vcpu) &&
 		!kvm_event_needs_reinjection(vcpu) &&
-		!vcpu->arch.exception.pending);
+		!kvm_is_exception_pending(vcpu));
 }
 static int kvm_vcpu_ioctl_interrupt(struct kvm_vcpu *vcpu,
@ -5016,25 +5034,38 @@ static int kvm_vcpu_ioctl_x86_set_mce(struct kvm_vcpu *vcpu,
 static void kvm_vcpu_ioctl_x86_get_vcpu_events(struct kvm_vcpu *vcpu,
 					       struct kvm_vcpu_events *events)
 {
 	struct kvm_queued_exception *ex;
 	process_nmi(vcpu);
 	if (kvm_check_request(KVM_REQ_SMI, vcpu))
 		process_smi(vcpu);
 	/*
-	 * In guest mode, payload delivery should be deferred,
+	 * KVM's ABI only allows for one exception to be migrated.  Luckily,
-	 * so that the L1 hypervisor can intercept #PF before
+	 * the only time there can be two queued exceptions is if there's a
-	 * CR2 is modified (or intercept #DB before DR6 is
+	 * non-exiting _injected_ exception, and a pending exiting exception.
-	 * modified under nVMX). Unless the per-VM capability,
+	 * In that case, ignore the VM-Exiting exception as it's an extension
-	 * KVM_CAP_EXCEPTION_PAYLOAD, is set, we may not defer the delivery of
+	 * of the injected exception.
-	 * an exception payload and handle after a KVM_GET_VCPU_EVENTS. Since we
+	 */
-	 * opportunistically defer the exception payload, deliver it if the
+	if (vcpu->arch.exception_vmexit.pending &&
-	 * capability hasn't been requested before processing a
+	    !vcpu->arch.exception.pending &&
-	 * KVM_GET_VCPU_EVENTS.
+	    !vcpu->arch.exception.injected)
 		ex = &vcpu->arch.exception_vmexit;
 	else
 		ex = &vcpu->arch.exception;
 	/*
 	 * In guest mode, payload delivery should be deferred if the exception
 	 * will be intercepted by L1, e.g. KVM should not modifying CR2 if L1
 	 * intercepts #PF, ditto for DR6 and #DBs.  If the per-VM capability,
 	 * KVM_CAP_EXCEPTION_PAYLOAD, is not set, userspace may or may not
 	 * propagate the payload and so it cannot be safely deferred.  Deliver
 	 * the payload if the capability hasn't been requested.
 	 */
 	if (!vcpu->kvm->arch.exception_payload_enabled &&
-	    vcpu->arch.exception.pending && vcpu->arch.exception.has_payload)
+	    ex->pending && ex->has_payload)
-		kvm_deliver_exception_payload(vcpu);
+		kvm_deliver_exception_payload(vcpu, ex);
 	/*
 	 * The API doesn't provide the instruction length for software
@ -5042,26 +5073,25 @@ static void kvm_vcpu_ioctl_x86_get_vcpu_events(struct kvm_vcpu *vcpu,
 	 * isn't advanced, we should expect to encounter the exception
 	 * again.
 	 */
-	if (kvm_exception_is_soft(vcpu->arch.exception.nr)) {
+	if (kvm_exception_is_soft(ex->vector)) {
 		events->exception.injected = 0;
 		events->exception.pending = 0;
 	} else {
-		events->exception.injected = vcpu->arch.exception.injected;
+		events->exception.injected = ex->injected;
-		events->exception.pending = vcpu->arch.exception.pending;
+		events->exception.pending = ex->pending;
 		/*
 		 * For ABI compatibility, deliberately conflate
 		 * pending and injected exceptions when
 		 * KVM_CAP_EXCEPTION_PAYLOAD isn't enabled.
 		 */
 		if (!vcpu->kvm->arch.exception_payload_enabled)
-			events->exception.injected |=
+			events->exception.injected |= ex->pending;
 				vcpu->arch.exception.pending;
 	}
-	events->exception.nr = vcpu->arch.exception.nr;
+	events->exception.nr = ex->vector;
-	events->exception.has_error_code = vcpu->arch.exception.has_error_code;
+	events->exception.has_error_code = ex->has_error_code;
-	events->exception.error_code = vcpu->arch.exception.error_code;
+	events->exception.error_code = ex->error_code;
-	events->exception_has_payload = vcpu->arch.exception.has_payload;
+	events->exception_has_payload = ex->has_payload;
-	events->exception_payload = vcpu->arch.exception.payload;
+	events->exception_payload = ex->payload;
 	events->interrupt.injected =
 		vcpu->arch.interrupt.injected && !vcpu->arch.interrupt.soft;
@ -5131,9 +5161,22 @@ static int kvm_vcpu_ioctl_x86_set_vcpu_events(struct kvm_vcpu *vcpu,
 		return -EINVAL;
 	process_nmi(vcpu);
 	/*
 	 * Flag that userspace is stuffing an exception, the next KVM_RUN will
 	 * morph the exception to a VM-Exit if appropriate.  Do this only for
 	 * pending exceptions, already-injected exceptions are not subject to
 	 * intercpetion.  Note, userspace that conflates pending and injected
 	 * is hosed, and will incorrectly convert an injected exception into a
 	 * pending exception, which in turn may cause a spurious VM-Exit.
 	 */
 	vcpu->arch.exception_from_userspace = events->exception.pending;
 	vcpu->arch.exception_vmexit.pending = false;
 	vcpu->arch.exception.injected = events->exception.injected;
 	vcpu->arch.exception.pending = events->exception.pending;
-	vcpu->arch.exception.nr = events->exception.nr;
+	vcpu->arch.exception.vector = events->exception.nr;
 	vcpu->arch.exception.has_error_code = events->exception.has_error_code;
 	vcpu->arch.exception.error_code = events->exception.error_code;
 	vcpu->arch.exception.has_payload = events->exception_has_payload;
@ -7257,6 +7300,7 @@ static int kvm_can_emulate_insn(struct kvm_vcpu *vcpu, int emul_type,
 int handle_ud(struct kvm_vcpu *vcpu)
 {
 	static const char kvm_emulate_prefix[] = { __KVM_EMULATE_PREFIX };
 	int fep_flags = READ_ONCE(force_emulation_prefix);
 	int emul_type = EMULTYPE_TRAP_UD;
 	char sig[5]; /* ud2; .ascii "kvm" */
 	struct x86_exception e;
@ -7264,10 +7308,12 @@ int handle_ud(struct kvm_vcpu *vcpu)
 	if (unlikely(!kvm_can_emulate_insn(vcpu, emul_type, NULL, 0)))
 		return 1;
-	if (force_emulation_prefix &&
+	if (fep_flags &&
 	    kvm_read_guest_virt(vcpu, kvm_get_linear_rip(vcpu),
 				sig, sizeof(sig), &e) == 0 &&
 	    memcmp(sig, kvm_emulate_prefix, sizeof(sig)) == 0) {
 		if (fep_flags & KVM_FEP_CLEAR_RFLAGS_RF)
 			kvm_set_rflags(vcpu, kvm_get_rflags(vcpu) & ~X86_EFLAGS_RF);
 		kvm_rip_write(vcpu, kvm_rip_read(vcpu) + sizeof(sig));
 		emul_type = EMULTYPE_TRAP_UD_FORCED;
 	}
@ -7933,14 +7979,20 @@ static int emulator_get_msr_with_filter(struct x86_emulate_ctxt *ctxt,
 	int r;
 	r = kvm_get_msr_with_filter(vcpu, msr_index, pdata);
 	if (r < 0)
 		return X86EMUL_UNHANDLEABLE;
-	if (r && kvm_msr_user_space(vcpu, msr_index, KVM_EXIT_X86_RDMSR, 0,
+	if (r) {
-				    complete_emulated_rdmsr, r)) {
+		if (kvm_msr_user_space(vcpu, msr_index, KVM_EXIT_X86_RDMSR, 0,
-		/* Bounce to user space */
+				       complete_emulated_rdmsr, r))
-		return X86EMUL_IO_NEEDED;
+			return X86EMUL_IO_NEEDED;
 		trace_kvm_msr_read_ex(msr_index);
 		return X86EMUL_PROPAGATE_FAULT;
 	}
-	return r;
+	trace_kvm_msr_read(msr_index, *pdata);
 	return X86EMUL_CONTINUE;
 }
 static int emulator_set_msr_with_filter(struct x86_emulate_ctxt *ctxt,
@ -7950,14 +8002,20 @@ static int emulator_set_msr_with_filter(struct x86_emulate_ctxt *ctxt,
 	int r;
 	r = kvm_set_msr_with_filter(vcpu, msr_index, data);
 	if (r < 0)
 		return X86EMUL_UNHANDLEABLE;
-	if (r && kvm_msr_user_space(vcpu, msr_index, KVM_EXIT_X86_WRMSR, data,
+	if (r) {
-				    complete_emulated_msr_access, r)) {
+		if (kvm_msr_user_space(vcpu, msr_index, KVM_EXIT_X86_WRMSR, data,
-		/* Bounce to user space */
+				       complete_emulated_msr_access, r))
-		return X86EMUL_IO_NEEDED;
+			return X86EMUL_IO_NEEDED;
 		trace_kvm_msr_write_ex(msr_index, data);
 		return X86EMUL_PROPAGATE_FAULT;
 	}
-	return r;
+	trace_kvm_msr_write(msr_index, data);
 	return X86EMUL_CONTINUE;
 }
 static int emulator_get_msr(struct x86_emulate_ctxt *ctxt,
@ -8161,18 +8219,17 @@ static void toggle_interruptibility(struct kvm_vcpu *vcpu, u32 mask)
 	}
 }
-static bool inject_emulated_exception(struct kvm_vcpu *vcpu)
+static void inject_emulated_exception(struct kvm_vcpu *vcpu)
 {
 	struct x86_emulate_ctxt *ctxt = vcpu->arch.emulate_ctxt;
 	if (ctxt->exception.vector == PF_VECTOR)
 		return kvm_inject_emulated_page_fault(vcpu, &ctxt->exception);
-	if (ctxt->exception.error_code_valid)
+	if (ctxt->exception.vector == PF_VECTOR)
 		kvm_inject_emulated_page_fault(vcpu, &ctxt->exception);
 	else if (ctxt->exception.error_code_valid)
 		kvm_queue_exception_e(vcpu, ctxt->exception.vector,
 				      ctxt->exception.error_code);
 	else
 		kvm_queue_exception(vcpu, ctxt->exception.vector);
 	return false;
 }
 static struct x86_emulate_ctxt *alloc_emulate_ctxt(struct kvm_vcpu *vcpu)
@ -8548,8 +8605,46 @@ int kvm_skip_emulated_instruction(struct kvm_vcpu *vcpu)
 }
 EXPORT_SYMBOL_GPL(kvm_skip_emulated_instruction);
-static bool kvm_vcpu_check_code_breakpoint(struct kvm_vcpu *vcpu, int *r)
+static bool kvm_is_code_breakpoint_inhibited(struct kvm_vcpu *vcpu)
 {
 	u32 shadow;
 	if (kvm_get_rflags(vcpu) & X86_EFLAGS_RF)
 		return true;
 	/*
 	 * Intel CPUs inhibit code #DBs when MOV/POP SS blocking is active,
 	 * but AMD CPUs do not.  MOV/POP SS blocking is rare, check that first
 	 * to avoid the relatively expensive CPUID lookup.
 	 */
 	shadow = static_call(kvm_x86_get_interrupt_shadow)(vcpu);
 	return (shadow & KVM_X86_SHADOW_INT_MOV_SS) &&
 	       guest_cpuid_is_intel(vcpu);
 }
 static bool kvm_vcpu_check_code_breakpoint(struct kvm_vcpu *vcpu,
 					   int emulation_type, int *r)
 {
 	WARN_ON_ONCE(emulation_type & EMULTYPE_NO_DECODE);
 	/*
 	 * Do not check for code breakpoints if hardware has already done the
 	 * checks, as inferred from the emulation type.  On NO_DECODE and SKIP,
 	 * the instruction has passed all exception checks, and all intercepted
 	 * exceptions that trigger emulation have lower priority than code
 	 * breakpoints, i.e. the fact that the intercepted exception occurred
 	 * means any code breakpoints have already been serviced.
 	 *
 	 * Note, KVM needs to check for code #DBs on EMULTYPE_TRAP_UD_FORCED as
 	 * hardware has checked the RIP of the magic prefix, but not the RIP of
 	 * the instruction being emulated.  The intent of forced emulation is
 	 * to behave as if KVM intercepted the instruction without an exception
 	 * and without a prefix.
 	 */
 	if (emulation_type & (EMULTYPE_NO_DECODE | EMULTYPE_SKIP |
 			      EMULTYPE_TRAP_UD | EMULTYPE_VMWARE_GP | EMULTYPE_PF))
 		return false;
 	if (unlikely(vcpu->guest_debug & KVM_GUESTDBG_USE_HW_BP) &&
 	    (vcpu->arch.guest_debug_dr7 & DR7_BP_EN_MASK)) {
 		struct kvm_run *kvm_run = vcpu->run;
@ -8569,7 +8664,7 @@ static bool kvm_vcpu_check_code_breakpoint(struct kvm_vcpu *vcpu, int *r)
 	}
 	if (unlikely(vcpu->arch.dr7 & DR7_BP_EN_MASK) &&
-	    !(kvm_get_rflags(vcpu) & X86_EFLAGS_RF)) {
+	    !kvm_is_code_breakpoint_inhibited(vcpu)) {
 		unsigned long eip = kvm_get_linear_rip(vcpu);
 		u32 dr6 = kvm_vcpu_check_hw_bp(eip, 0,
 					   vcpu->arch.dr7,
@ -8671,8 +8766,7 @@ int x86_emulate_instruction(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa,
 		 * are fault-like and are higher priority than any faults on
 		 * the code fetch itself.
 		 */
-		if (!(emulation_type & EMULTYPE_SKIP) &&
+		if (kvm_vcpu_check_code_breakpoint(vcpu, emulation_type, &r))
 		    kvm_vcpu_check_code_breakpoint(vcpu, &r))
 			return r;
 		r = x86_decode_emulated_instruction(vcpu, emulation_type,
@ -8770,8 +8864,7 @@ restart:
 	if (ctxt->have_exception) {
 		r = 1;
-		if (inject_emulated_exception(vcpu))
+		inject_emulated_exception(vcpu);
 			return r;
 	} else if (vcpu->arch.pio.count) {
 		if (!vcpu->arch.pio.in) {
 			/* FIXME: return into emulator if single-stepping.  */
@ -8801,6 +8894,12 @@ writeback:
 		unsigned long rflags = static_call(kvm_x86_get_rflags)(vcpu);
 		toggle_interruptibility(vcpu, ctxt->interruptibility);
 		vcpu->arch.emulate_regs_need_sync_to_vcpu = false;
 		/*
 		 * Note, EXCPT_DB is assumed to be fault-like as the emulator
 		 * only supports code breakpoints and general detect #DB, both
 		 * of which are fault-like.
 		 */
 		if (!ctxt->have_exception ||
 		    exception_type(ctxt->exception.vector) == EXCPT_TRAP) {
 			kvm_pmu_trigger_event(vcpu, PERF_COUNT_HW_INSTRUCTIONS);
@ -9662,74 +9761,155 @@ int kvm_check_nested_events(struct kvm_vcpu *vcpu)
 static void kvm_inject_exception(struct kvm_vcpu *vcpu)
 {
-	trace_kvm_inj_exception(vcpu->arch.exception.nr,
+	trace_kvm_inj_exception(vcpu->arch.exception.vector,
 				vcpu->arch.exception.has_error_code,
 				vcpu->arch.exception.error_code,
 				vcpu->arch.exception.injected);
 	if (vcpu->arch.exception.error_code && !is_protmode(vcpu))
 		vcpu->arch.exception.error_code = false;
-	static_call(kvm_x86_queue_exception)(vcpu);
+	static_call(kvm_x86_inject_exception)(vcpu);
 }
-static int inject_pending_event(struct kvm_vcpu *vcpu, bool *req_immediate_exit)
+/*
 * Check for any event (interrupt or exception) that is ready to be injected,
 * and if there is at least one event, inject the event with the highest
 * priority.  This handles both "pending" events, i.e. events that have never
 * been injected into the guest, and "injected" events, i.e. events that were
 * injected as part of a previous VM-Enter, but weren't successfully delivered
 * and need to be re-injected.
 *
 * Note, this is not guaranteed to be invoked on a guest instruction boundary,
 * i.e. doesn't guarantee that there's an event window in the guest.  KVM must
 * be able to inject exceptions in the "middle" of an instruction, and so must
 * also be able to re-inject NMIs and IRQs in the middle of an instruction.
 * I.e. for exceptions and re-injected events, NOT invoking this on instruction
 * boundaries is necessary and correct.
 *
 * For simplicity, KVM uses a single path to inject all events (except events
 * that are injected directly from L1 to L2) and doesn't explicitly track
 * instruction boundaries for asynchronous events.  However, because VM-Exits
 * that can occur during instruction execution typically result in KVM skipping
 * the instruction or injecting an exception, e.g. instruction and exception
 * intercepts, and because pending exceptions have higher priority than pending
 * interrupts, KVM still honors instruction boundaries in most scenarios.
 *
 * But, if a VM-Exit occurs during instruction execution, and KVM does NOT skip
 * the instruction or inject an exception, then KVM can incorrecty inject a new
 * asynchrounous event if the event became pending after the CPU fetched the
 * instruction (in the guest).  E.g. if a page fault (#PF, #NPF, EPT violation)
 * occurs and is resolved by KVM, a coincident NMI, SMI, IRQ, etc... can be
 * injected on the restarted instruction instead of being deferred until the
 * instruction completes.
 *
 * In practice, this virtualization hole is unlikely to be observed by the
 * guest, and even less likely to cause functional problems.  To detect the
 * hole, the guest would have to trigger an event on a side effect of an early
 * phase of instruction execution, e.g. on the instruction fetch from memory.
 * And for it to be a functional problem, the guest would need to depend on the
 * ordering between that side effect, the instruction completing, _and_ the
 * delivery of the asynchronous event.
 */
 static int kvm_check_and_inject_events(struct kvm_vcpu *vcpu,
 				       bool *req_immediate_exit)
 {
 	bool can_inject;
 	int r;
 	bool can_inject = true;
 	/* try to reinject previous events if any */
 	if (vcpu->arch.exception.injected) {
 		kvm_inject_exception(vcpu);
 		can_inject = false;
 	}
 	/*
-	 * Do not inject an NMI or interrupt if there is a pending
+	 * Process nested events first, as nested VM-Exit supercedes event
-	 * exception.  Exceptions and interrupts are recognized at
+	 * re-injection.  If there's an event queued for re-injection, it will
-	 * instruction boundaries, i.e. the start of an instruction.
+	 * be saved into the appropriate vmc{b,s}12 fields on nested VM-Exit.
 	 * Trap-like exceptions, e.g. #DB, have higher priority than
 	 * NMIs and interrupts, i.e. traps are recognized before an
 	 * NMI/interrupt that's pending on the same instruction.
 	 * Fault-like exceptions, e.g. #GP and #PF, are the lowest
 	 * priority, but are only generated (pended) during instruction
 	 * execution, i.e. a pending fault-like exception means the
 	 * fault occurred on the *previous* instruction and must be
 	 * serviced prior to recognizing any new events in order to
 	 * fully complete the previous instruction.
 	 */
-	else if (!vcpu->arch.exception.pending) {
+	if (is_guest_mode(vcpu))
-		if (vcpu->arch.nmi_injected) {
+		r = kvm_check_nested_events(vcpu);
-			static_call(kvm_x86_inject_nmi)(vcpu);
+	else
-			can_inject = false;
+		r = 0;
 		} else if (vcpu->arch.interrupt.injected) {
 			static_call(kvm_x86_inject_irq)(vcpu, true);
 			can_inject = false;
 		}
 	}
 	/*
 	 * Re-inject exceptions and events *especially* if immediate entry+exit
 	 * to/from L2 is needed, as any event that has already been injected
 	 * into L2 needs to complete its lifecycle before injecting a new event.
 	 *
 	 * Don't re-inject an NMI or interrupt if there is a pending exception.
 	 * This collision arises if an exception occurred while vectoring the
 	 * injected event, KVM intercepted said exception, and KVM ultimately
 	 * determined the fault belongs to the guest and queues the exception
 	 * for injection back into the guest.
 	 *
 	 * "Injected" interrupts can also collide with pending exceptions if
 	 * userspace ignores the "ready for injection" flag and blindly queues
 	 * an interrupt.  In that case, prioritizing the exception is correct,
 	 * as the exception "occurred" before the exit to userspace.  Trap-like
 	 * exceptions, e.g. most #DBs, have higher priority than interrupts.
 	 * And while fault-like exceptions, e.g. #GP and #PF, are the lowest
 	 * priority, they're only generated (pended) during instruction
 	 * execution, and interrupts are recognized at instruction boundaries.
 	 * Thus a pending fault-like exception means the fault occurred on the
 	 * *previous* instruction and must be serviced prior to recognizing any
 	 * new events in order to fully complete the previous instruction.
 	 */
 	if (vcpu->arch.exception.injected)
 		kvm_inject_exception(vcpu);
 	else if (kvm_is_exception_pending(vcpu))
 		; /* see above */
 	else if (vcpu->arch.nmi_injected)
 		static_call(kvm_x86_inject_nmi)(vcpu);
 	else if (vcpu->arch.interrupt.injected)
 		static_call(kvm_x86_inject_irq)(vcpu, true);
 	/*
 	 * Exceptions that morph to VM-Exits are handled above, and pending
 	 * exceptions on top of injected exceptions that do not VM-Exit should
 	 * either morph to #DF or, sadly, override the injected exception.
 	 */
 	WARN_ON_ONCE(vcpu->arch.exception.injected &&
 		     vcpu->arch.exception.pending);
 	/*
-	 * Call check_nested_events() even if we reinjected a previous event
+	 * Bail if immediate entry+exit to/from the guest is needed to complete
-	 * in order for caller to determine if it should require immediate-exit
+	 * nested VM-Enter or event re-injection so that a different pending
-	 * from L2 to L1 due to pending L1 events which require exit
+	 * event can be serviced (or if KVM needs to exit to userspace).
-	 * from L2 to L1.
+	 *
 	 * Otherwise, continue processing events even if VM-Exit occurred.  The
 	 * VM-Exit will have cleared exceptions that were meant for L2, but
 	 * there may now be events that can be injected into L1.
 	 */
-	if (is_guest_mode(vcpu)) {
+	if (r < 0)
-		r = kvm_check_nested_events(vcpu);
+		goto out;
-		if (r < 0)
+
-			goto out;
+	/*
-	}
+	 * A pending exception VM-Exit should either result in nested VM-Exit
 	 * or force an immediate re-entry and exit to/from L2, and exception
 	 * VM-Exits cannot be injected (flag should _never_ be set).
 	 */
 	WARN_ON_ONCE(vcpu->arch.exception_vmexit.injected ||
 		     vcpu->arch.exception_vmexit.pending);
 	/*
 	 * New events, other than exceptions, cannot be injected if KVM needs
 	 * to re-inject a previous event.  See above comments on re-injecting
 	 * for why pending exceptions get priority.
 	 */
 	can_inject = !kvm_event_needs_reinjection(vcpu);
 	/* try to inject new event if pending */
 	if (vcpu->arch.exception.pending) {
-		if (exception_type(vcpu->arch.exception.nr) == EXCPT_FAULT)
+		/*
 		 * Fault-class exceptions, except #DBs, set RF=1 in the RFLAGS
 		 * value pushed on the stack.  Trap-like exception and all #DBs
 		 * leave RF as-is (KVM follows Intel's behavior in this regard;
 		 * AMD states that code breakpoint #DBs excplitly clear RF=0).
 		 *
 		 * Note, most versions of Intel's SDM and AMD's APM incorrectly
 		 * describe the behavior of General Detect #DBs, which are
 		 * fault-like.  They do _not_ set RF, a la code breakpoints.
 		 */
 		if (exception_type(vcpu->arch.exception.vector) == EXCPT_FAULT)
 			__kvm_set_rflags(vcpu, kvm_get_rflags(vcpu) |
 					     X86_EFLAGS_RF);
-		if (vcpu->arch.exception.nr == DB_VECTOR) {
+		if (vcpu->arch.exception.vector == DB_VECTOR) {
-			kvm_deliver_exception_payload(vcpu);
+			kvm_deliver_exception_payload(vcpu, &vcpu->arch.exception);
 			if (vcpu->arch.dr7 & DR7_GD) {
 				vcpu->arch.dr7 &= ~DR7_GD;
 				kvm_update_dr7(vcpu);
@ -9801,11 +9981,11 @@ static int inject_pending_event(struct kvm_vcpu *vcpu, bool *req_immediate_exit)
 	}
 	if (is_guest_mode(vcpu) &&
-	    kvm_x86_ops.nested_ops->hv_timer_pending &&
+	    kvm_x86_ops.nested_ops->has_events &&
-	    kvm_x86_ops.nested_ops->hv_timer_pending(vcpu))
+	    kvm_x86_ops.nested_ops->has_events(vcpu))
 		*req_immediate_exit = true;
-	WARN_ON(vcpu->arch.exception.pending);
+	WARN_ON(kvm_is_exception_pending(vcpu));
 	return 0;
 out:
@ -10110,7 +10290,7 @@ void kvm_vcpu_update_apicv(struct kvm_vcpu *vcpu)
 	 * When APICv gets disabled, we may still have injected interrupts
 	 * pending. At the same time, KVM_REQ_EVENT may not be set as APICv was
 	 * still active when the interrupt got accepted. Make sure
-	 * inject_pending_event() is called to check for that.
+	 * kvm_check_and_inject_events() is called to check for that.
 	 */
 	if (!apic->apicv_active)
 		kvm_make_request(KVM_REQ_EVENT, vcpu);
@ -10407,7 +10587,7 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu)
 			goto out;
 		}
-		r = inject_pending_event(vcpu, &req_immediate_exit);
+		r = kvm_check_and_inject_events(vcpu, &req_immediate_exit);
 		if (r < 0) {
 			r = 0;
 			goto out;
@ -10646,10 +10826,26 @@ static inline int vcpu_block(struct kvm_vcpu *vcpu)
 		if (hv_timer)
 			kvm_lapic_switch_to_hv_timer(vcpu);
-		if (!kvm_check_request(KVM_REQ_UNHALT, vcpu))
+		/*
 		 * If the vCPU is not runnable, a signal or another host event
 		 * of some kind is pending; service it without changing the
 		 * vCPU's activity state.
 		 */
 		if (!kvm_arch_vcpu_runnable(vcpu))
 			return 1;
 	}
 	/*
 	 * Evaluate nested events before exiting the halted state.  This allows
 	 * the halt state to be recorded properly in the VMCS12's activity
 	 * state field (AMD does not have a similar field and a VM-Exit always
 	 * causes a spurious wakeup from HLT).
 	 */
 	if (is_guest_mode(vcpu)) {
 		if (kvm_check_nested_events(vcpu) < 0)
 			return 0;
 	}
 	if (kvm_apic_accept_events(vcpu) < 0)
 		return 0;
 	switch(vcpu->arch.mp_state) {
@ -10673,9 +10869,6 @@ static inline int vcpu_block(struct kvm_vcpu *vcpu)
 static inline bool kvm_vcpu_running(struct kvm_vcpu *vcpu)
 {
 	if (is_guest_mode(vcpu))
 		kvm_check_nested_events(vcpu);
 	return (vcpu->arch.mp_state == KVM_MP_STATE_RUNNABLE &&
 		!vcpu->arch.apf.halted);
 }
@ -10824,6 +11017,7 @@ static void kvm_put_guest_fpu(struct kvm_vcpu *vcpu)
 int kvm_arch_vcpu_ioctl_run(struct kvm_vcpu *vcpu)
 {
 	struct kvm_queued_exception *ex = &vcpu->arch.exception;
 	struct kvm_run *kvm_run = vcpu->run;
 	int r;
@ -10852,7 +11046,6 @@ int kvm_arch_vcpu_ioctl_run(struct kvm_vcpu *vcpu)
 			r = 0;
 			goto out;
 		}
 		kvm_clear_request(KVM_REQ_UNHALT, vcpu);
 		r = -EAGAIN;
 		if (signal_pending(current)) {
 			r = -EINTR;
@ -10882,6 +11075,21 @@ int kvm_arch_vcpu_ioctl_run(struct kvm_vcpu *vcpu)
 		}
 	}
 	/*
 	 * If userspace set a pending exception and L2 is active, convert it to
 	 * a pending VM-Exit if L1 wants to intercept the exception.
 	 */
 	if (vcpu->arch.exception_from_userspace && is_guest_mode(vcpu) &&
 	    kvm_x86_ops.nested_ops->is_exception_vmexit(vcpu, ex->vector,
 							ex->error_code)) {
 		kvm_queue_exception_vmexit(vcpu, ex->vector,
 					   ex->has_error_code, ex->error_code,
 					   ex->has_payload, ex->payload);
 		ex->injected = false;
 		ex->pending = false;
 	}
 	vcpu->arch.exception_from_userspace = false;
 	if (unlikely(vcpu->arch.complete_userspace_io)) {
 		int (*cui)(struct kvm_vcpu *) = vcpu->arch.complete_userspace_io;
 		vcpu->arch.complete_userspace_io = NULL;
@ -10988,6 +11196,7 @@ static void __set_regs(struct kvm_vcpu *vcpu, struct kvm_regs *regs)
 	kvm_set_rflags(vcpu, regs->rflags | X86_EFLAGS_FIXED);
 	vcpu->arch.exception.pending = false;
 	vcpu->arch.exception_vmexit.pending = false;
 	kvm_make_request(KVM_REQ_EVENT, vcpu);
 }
@ -11125,11 +11334,12 @@ int kvm_arch_vcpu_ioctl_set_mpstate(struct kvm_vcpu *vcpu,
 	}
 	/*
-	 * KVM_MP_STATE_INIT_RECEIVED means the processor is in
+	 * Pending INITs are reported using KVM_SET_VCPU_EVENTS, disallow
-	 * INIT state; latched init should be reported using
+	 * forcing the guest into INIT/SIPI if those events are supposed to be
-	 * KVM_SET_VCPU_EVENTS, so reject it here.
+	 * blocked.  KVM prioritizes SMI over INIT, so reject INIT/SIPI state
 	 * if an SMI is pending as well.
 	 */
-	if ((kvm_vcpu_latch_init(vcpu) || vcpu->arch.smi_pending) &&
+	if ((!kvm_apic_init_sipi_allowed(vcpu) || vcpu->arch.smi_pending) &&
 	    (mp_state->mp_state == KVM_MP_STATE_SIPI_RECEIVED ||
 	     mp_state->mp_state == KVM_MP_STATE_INIT_RECEIVED))
 		goto out;
@ -11368,7 +11578,7 @@ int kvm_arch_vcpu_ioctl_set_guest_debug(struct kvm_vcpu *vcpu,
 	if (dbg->control & (KVM_GUESTDBG_INJECT_DB | KVM_GUESTDBG_INJECT_BP)) {
 		r = -EBUSY;
-		if (vcpu->arch.exception.pending)
+		if (kvm_is_exception_pending(vcpu))
 			goto out;
 		if (dbg->control & KVM_GUESTDBG_INJECT_DB)
 			kvm_queue_exception(vcpu, DB_VECTOR);
@ -11750,8 +11960,8 @@ void kvm_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event)
 		struct fpstate *fpstate = vcpu->arch.guest_fpu.fpstate;
 		/*
-		 * To avoid have the INIT path from kvm_apic_has_events() that be
+		 * All paths that lead to INIT are required to load the guest's
-		 * called with loaded FPU and does not let userspace fix the state.
+		 * FPU state (because most paths are buried in KVM_RUN).
 		 */
 		if (init_event)
 			kvm_put_guest_fpu(vcpu);
@ -12080,6 +12290,10 @@ int kvm_arch_init_vm(struct kvm *kvm, unsigned long type)
 	if (ret)
 		goto out_page_track;
 	ret = static_call(kvm_x86_vm_init)(kvm);
 	if (ret)
 		goto out_uninit_mmu;
 	INIT_HLIST_HEAD(&kvm->arch.mask_notifier_list);
 	INIT_LIST_HEAD(&kvm->arch.assigned_dev_head);
 	atomic_set(&kvm->arch.noncoherent_dma_count, 0);
@ -12115,8 +12329,10 @@ int kvm_arch_init_vm(struct kvm *kvm, unsigned long type)
 	kvm_hv_init_vm(kvm);
 	kvm_xen_init_vm(kvm);
-	return static_call(kvm_x86_vm_init)(kvm);
+	return 0;
 out_uninit_mmu:
 	kvm_mmu_uninit_vm(kvm);
 out_page_track:
 	kvm_page_track_cleanup(kvm);
 out:
@ -12589,13 +12805,14 @@ static inline bool kvm_vcpu_has_events(struct kvm_vcpu *vcpu)
 	if (!list_empty_careful(&vcpu->async_pf.done))
 		return true;
-	if (kvm_apic_has_events(vcpu))
+	if (kvm_apic_has_pending_init_or_sipi(vcpu) &&
 	    kvm_apic_init_sipi_allowed(vcpu))
 		return true;
 	if (vcpu->arch.pv.pv_unhalted)
 		return true;
-	if (vcpu->arch.exception.pending)
+	if (kvm_is_exception_pending(vcpu))
 		return true;
 	if (kvm_test_request(KVM_REQ_NMI, vcpu) ||
@ -12617,16 +12834,13 @@ static inline bool kvm_vcpu_has_events(struct kvm_vcpu *vcpu)
 		return true;
 	if (is_guest_mode(vcpu) &&
-	    kvm_x86_ops.nested_ops->hv_timer_pending &&
+	    kvm_x86_ops.nested_ops->has_events &&
-	    kvm_x86_ops.nested_ops->hv_timer_pending(vcpu))
+	    kvm_x86_ops.nested_ops->has_events(vcpu))
 		return true;
 	if (kvm_xen_has_pending_events(vcpu))
 		return true;
 	if (kvm_test_request(KVM_REQ_TRIPLE_FAULT, vcpu))
 		return true;
 	return false;
 }
@ -12850,7 +13064,7 @@ bool kvm_can_do_async_pf(struct kvm_vcpu *vcpu)
 {
 	if (unlikely(!lapic_in_kernel(vcpu) ||
 		     kvm_event_needs_reinjection(vcpu) ||
-		     vcpu->arch.exception.pending))
+		     kvm_is_exception_pending(vcpu)))
 		return false;
 	if (kvm_hlt_in_guest(vcpu->kvm) && !kvm_can_deliver_async_pf(vcpu))
@ -13401,7 +13615,7 @@ EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_inj_virq);
 EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_page_fault);
 EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_msr);
 EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_cr);
-EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_nested_vmrun);
+EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_nested_vmenter);
 EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_nested_vmexit);
 EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_nested_vmexit_inject);
 EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_nested_intr_vmexit);
--- a/arch/x86/kvm/x86.h
+++ b/arch/x86/kvm/x86.h
@ -82,10 +82,18 @@ static inline unsigned int __shrink_ple_window(unsigned int val,
 void kvm_service_local_tlb_flush_requests(struct kvm_vcpu *vcpu);
 int kvm_check_nested_events(struct kvm_vcpu *vcpu);
 static inline bool kvm_is_exception_pending(struct kvm_vcpu *vcpu)
 {
 	return vcpu->arch.exception.pending ||
 	       vcpu->arch.exception_vmexit.pending ||
 	       kvm_test_request(KVM_REQ_TRIPLE_FAULT, vcpu);
 }
 static inline void kvm_clear_exception_queue(struct kvm_vcpu *vcpu)
 {
 	vcpu->arch.exception.pending = false;
 	vcpu->arch.exception.injected = false;
 	vcpu->arch.exception_vmexit.pending = false;
 }
 static inline void kvm_queue_interrupt(struct kvm_vcpu *vcpu, u8 vector,
@ -267,11 +275,6 @@ static inline bool kvm_check_has_quirk(struct kvm *kvm, u64 quirk)
 	return !(kvm->arch.disabled_quirks & quirk);
 }
 static inline bool kvm_vcpu_latch_init(struct kvm_vcpu *vcpu)
 {
 	return is_smm(vcpu) || static_call(kvm_x86_apic_init_signal_blocked)(vcpu);
 }
 void kvm_inject_realmode_interrupt(struct kvm_vcpu *vcpu, int irq, int inc_eip);
 u64 get_kvmclock_ns(struct kvm *kvm);
@ -286,7 +289,8 @@ int kvm_write_guest_virt_system(struct kvm_vcpu *vcpu,
 int handle_ud(struct kvm_vcpu *vcpu);
-void kvm_deliver_exception_payload(struct kvm_vcpu *vcpu);
+void kvm_deliver_exception_payload(struct kvm_vcpu *vcpu,
 				   struct kvm_queued_exception *ex);
 void kvm_vcpu_mtrr_init(struct kvm_vcpu *vcpu);
 u8 kvm_mtrr_get_guest_memory_type(struct kvm_vcpu *vcpu, gfn_t gfn);
--- a/arch/x86/kvm/xen.c
+++ b/arch/x86/kvm/xen.c
@ -1065,7 +1065,6 @@ static bool kvm_xen_schedop_poll(struct kvm_vcpu *vcpu, bool longmode,
 			del_timer(&vcpu->arch.xen.poll_timer);
 		vcpu->arch.mp_state = KVM_MP_STATE_RUNNABLE;
 		kvm_clear_request(KVM_REQ_UNHALT, vcpu);
 	}
 	vcpu->arch.xen.poll_evtchn = 0;
--- a/drivers/base/node.c
+++ b/drivers/base/node.c
@ -433,6 +433,7 @@ static ssize_t node_read_meminfo(struct device *dev,
 			     "Node %d ShadowCallStack:%8lu kB\n"
 #endif
 			     "Node %d PageTables:     %8lu kB\n"
 			     "Node %d SecPageTables:  %8lu kB\n"
 			     "Node %d NFS_Unstable:   %8lu kB\n"
 			     "Node %d Bounce:         %8lu kB\n"
 			     "Node %d WritebackTmp:   %8lu kB\n"
@ -459,6 +460,7 @@ static ssize_t node_read_meminfo(struct device *dev,
 			     nid, node_page_state(pgdat, NR_KERNEL_SCS_KB),
 #endif
 			     nid, K(node_page_state(pgdat, NR_PAGETABLE)),
 			     nid, K(node_page_state(pgdat, NR_SECONDARY_PAGETABLE)),
 			     nid, 0UL,
 			     nid, K(sum_zone_node_page_state(nid, NR_BOUNCE)),
 			     nid, K(node_page_state(pgdat, NR_WRITEBACK_TEMP)),
--- a/fs/proc/meminfo.c
+++ b/fs/proc/meminfo.c
@ -115,6 +115,8 @@ static int meminfo_proc_show(struct seq_file *m, void *v)
 #endif
 	show_val_kb(m, "PageTables:     ",
 		    global_node_page_state(NR_PAGETABLE));
 	show_val_kb(m, "SecPageTables:	",
 		    global_node_page_state(NR_SECONDARY_PAGETABLE));
 	show_val_kb(m, "NFS_Unstable:   ", 0);
 	show_val_kb(m, "Bounce:         ",
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@ -151,12 +151,11 @@ static inline bool is_error_page(struct page *page)
 #define KVM_REQUEST_NO_ACTION      BIT(10)
 /*
 * Architecture-independent vcpu->requests bit members
- * Bits 4-7 are reserved for more arch-independent bits.
+ * Bits 3-7 are reserved for more arch-independent bits.
 */
 #define KVM_REQ_TLB_FLUSH         (0 | KVM_REQUEST_WAIT | KVM_REQUEST_NO_WAKEUP)
 #define KVM_REQ_VM_DEAD           (1 | KVM_REQUEST_WAIT | KVM_REQUEST_NO_WAKEUP)
 #define KVM_REQ_UNBLOCK           2
 #define KVM_REQ_UNHALT            3
 #define KVM_REQUEST_ARCH_BASE     8
 /*
@ -2247,6 +2246,19 @@ static inline void kvm_handle_signal_exit(struct kvm_vcpu *vcpu)
 }
 #endif /* CONFIG_KVM_XFER_TO_GUEST_WORK */
 /*
 * If more than one page is being (un)accounted, @virt must be the address of
 * the first page of a block of pages what were allocated together (i.e
 * accounted together).
 *
 * kvm_account_pgtable_pages() is thread-safe because mod_lruvec_page_state()
 * is thread-safe.
 */
 static inline void kvm_account_pgtable_pages(void *virt, int nr)
 {
 	mod_lruvec_page_state(virt_to_page(virt), NR_SECONDARY_PAGETABLE, nr);
 }
 /*
 * This defines how many reserved entries we want to keep before we
 * kick the vcpu to the userspace to avoid dirty ring full.  This
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@ -216,6 +216,7 @@ enum node_stat_item {
 	NR_KERNEL_SCS_KB,	/* measured in KiB */
 #endif
 	NR_PAGETABLE,		/* used for pagetables */
 	NR_SECONDARY_PAGETABLE, /* secondary pagetables, e.g. KVM pagetables */
 #ifdef CONFIG_SWAP
 	NR_SWAPCACHE,
 #endif
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@ -1401,6 +1401,7 @@ static const struct memory_stat memory_stats[] = {
 	{ "kernel",			MEMCG_KMEM			},
 	{ "kernel_stack",		NR_KERNEL_STACK_KB		},
 	{ "pagetables",			NR_PAGETABLE			},
 	{ "sec_pagetables",		NR_SECONDARY_PAGETABLE		},
 	{ "percpu",			MEMCG_PERCPU_B			},
 	{ "sock",			MEMCG_SOCK			},
 	{ "vmalloc",			MEMCG_VMALLOC			},
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@ -6085,7 +6085,8 @@ void show_free_areas(unsigned int filter, nodemask_t *nodemask)
 		" active_file:%lu inactive_file:%lu isolated_file:%lu\n"
 		" unevictable:%lu dirty:%lu writeback:%lu\n"
 		" slab_reclaimable:%lu slab_unreclaimable:%lu\n"
-		" mapped:%lu shmem:%lu pagetables:%lu bounce:%lu\n"
+		" mapped:%lu shmem:%lu pagetables:%lu\n"
 		" sec_pagetables:%lu bounce:%lu\n"
 		" kernel_misc_reclaimable:%lu\n"
 		" free:%lu free_pcp:%lu free_cma:%lu\n",
 		global_node_page_state(NR_ACTIVE_ANON),
@ -6102,6 +6103,7 @@ void show_free_areas(unsigned int filter, nodemask_t *nodemask)
 		global_node_page_state(NR_FILE_MAPPED),
 		global_node_page_state(NR_SHMEM),
 		global_node_page_state(NR_PAGETABLE),
 		global_node_page_state(NR_SECONDARY_PAGETABLE),
 		global_zone_page_state(NR_BOUNCE),
 		global_node_page_state(NR_KERNEL_MISC_RECLAIMABLE),
 		global_zone_page_state(NR_FREE_PAGES),
@ -6135,6 +6137,7 @@ void show_free_areas(unsigned int filter, nodemask_t *nodemask)
 			" shadow_call_stack:%lukB"
 #endif
 			" pagetables:%lukB"
 			" sec_pagetables:%lukB"
 			" all_unreclaimable? %s"
 			"\n",
 			pgdat->node_id,
@ -6160,6 +6163,7 @@ void show_free_areas(unsigned int filter, nodemask_t *nodemask)
 			node_page_state(pgdat, NR_KERNEL_SCS_KB),
 #endif
 			K(node_page_state(pgdat, NR_PAGETABLE)),
 			K(node_page_state(pgdat, NR_SECONDARY_PAGETABLE)),
 			pgdat->kswapd_failures >= MAX_RECLAIM_RETRIES ?
 				"yes" : "no");
 	}
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@ -1247,6 +1247,7 @@ const char * const vmstat_text[] = {
 	"nr_shadow_call_stack",
 #endif
 	"nr_page_table_pages",
 	"nr_sec_page_table_pages",
 #ifdef CONFIG_SWAP
 	"nr_swapcached",
 #endif
--- a/tools/testing/selftests/kvm/.gitignore
+++ b/tools/testing/selftests/kvm/.gitignore
@ -28,6 +28,7 @@
 /x86_64/max_vcpuid_cap_test
 /x86_64/mmio_warning_test
 /x86_64/monitor_mwait_test
 /x86_64/nested_exceptions_test
 /x86_64/nx_huge_pages_test
 /x86_64/platform_info_test
 /x86_64/pmu_event_filter_test
--- a/tools/testing/selftests/kvm/Makefile
+++ b/tools/testing/selftests/kvm/Makefile
@ -91,6 +91,7 @@ TEST_GEN_PROGS_x86_64 += x86_64/kvm_clock_test
 TEST_GEN_PROGS_x86_64 += x86_64/kvm_pv_test
 TEST_GEN_PROGS_x86_64 += x86_64/mmio_warning_test
 TEST_GEN_PROGS_x86_64 += x86_64/monitor_mwait_test
 TEST_GEN_PROGS_x86_64 += x86_64/nested_exceptions_test
 TEST_GEN_PROGS_x86_64 += x86_64/platform_info_test
 TEST_GEN_PROGS_x86_64 += x86_64/pmu_event_filter_test
 TEST_GEN_PROGS_x86_64 += x86_64/set_boot_cpu_id
--- a/tools/testing/selftests/kvm/include/x86_64/evmcs.h
+++ b/tools/testing/selftests/kvm/include/x86_64/evmcs.h
@ -203,14 +203,25 @@ struct hv_enlightened_vmcs {
 		u32 reserved:30;
 	} hv_enlightenments_control;
 	u32 hv_vp_id;
-
+	u32 padding32_2;
 	u64 hv_vm_id;
 	u64 partition_assist_page;
 	u64 padding64_4[4];
 	u64 guest_bndcfgs;
-	u64 padding64_5[7];
+	u64 guest_ia32_perf_global_ctrl;
 	u64 guest_ia32_s_cet;
 	u64 guest_ssp;
 	u64 guest_ia32_int_ssp_table_addr;
 	u64 guest_ia32_lbr_ctl;
 	u64 padding64_5[2];
 	u64 xss_exit_bitmap;
-	u64 padding64_6[7];
+	u64 encls_exiting_bitmap;
 	u64 host_ia32_perf_global_ctrl;
 	u64 tsc_multiplier;
 	u64 host_ia32_s_cet;
 	u64 host_ssp;
 	u64 host_ia32_int_ssp_table_addr;
 	u64 padding64_6;
 };
 #define HV_VMX_ENLIGHTENED_CLEAN_FIELD_NONE                     0
@ -656,6 +667,18 @@ static inline int evmcs_vmread(uint64_t encoding, uint64_t *value)
 	case VIRTUAL_PROCESSOR_ID:
 		*value = current_evmcs->virtual_processor_id;
 		break;
 	case HOST_IA32_PERF_GLOBAL_CTRL:
 		*value = current_evmcs->host_ia32_perf_global_ctrl;
 		break;
 	case GUEST_IA32_PERF_GLOBAL_CTRL:
 		*value = current_evmcs->guest_ia32_perf_global_ctrl;
 		break;
 	case ENCLS_EXITING_BITMAP:
 		*value = current_evmcs->encls_exiting_bitmap;
 		break;
 	case TSC_MULTIPLIER:
 		*value = current_evmcs->tsc_multiplier;
 		break;
 	default: return 1;
 	}
@ -1169,6 +1192,22 @@ static inline int evmcs_vmwrite(uint64_t encoding, uint64_t value)
 		current_evmcs->virtual_processor_id = value;
 		current_evmcs->hv_clean_fields &= ~HV_VMX_ENLIGHTENED_CLEAN_FIELD_CONTROL_XLAT;
 		break;
 	case HOST_IA32_PERF_GLOBAL_CTRL:
 		current_evmcs->host_ia32_perf_global_ctrl = value;
 		current_evmcs->hv_clean_fields &= ~HV_VMX_ENLIGHTENED_CLEAN_FIELD_HOST_GRP1;
 		break;
 	case GUEST_IA32_PERF_GLOBAL_CTRL:
 		current_evmcs->guest_ia32_perf_global_ctrl = value;
 		current_evmcs->hv_clean_fields &= ~HV_VMX_ENLIGHTENED_CLEAN_FIELD_GUEST_GRP1;
 		break;
 	case ENCLS_EXITING_BITMAP:
 		current_evmcs->encls_exiting_bitmap = value;
 		current_evmcs->hv_clean_fields &= ~HV_VMX_ENLIGHTENED_CLEAN_FIELD_CONTROL_GRP2;
 		break;
 	case TSC_MULTIPLIER:
 		current_evmcs->tsc_multiplier = value;
 		current_evmcs->hv_clean_fields &= ~HV_VMX_ENLIGHTENED_CLEAN_FIELD_CONTROL_GRP2;
 		break;
 	default: return 1;
 	}
--- a/tools/testing/selftests/kvm/include/x86_64/svm_util.h
+++ b/tools/testing/selftests/kvm/include/x86_64/svm_util.h
@ -9,15 +9,12 @@
 #ifndef SELFTEST_KVM_SVM_UTILS_H
 #define SELFTEST_KVM_SVM_UTILS_H
 #include <asm/svm.h>
 #include <stdint.h>
 #include "svm.h"
 #include "processor.h"
 #define SVM_EXIT_EXCP_BASE	0x040
 #define SVM_EXIT_HLT		0x078
 #define SVM_EXIT_MSR		0x07c
 #define SVM_EXIT_VMMCALL	0x081
 struct svm_test_data {
 	/* VMCB */
 	struct vmcb *vmcb; /* gva */
--- a/tools/testing/selftests/kvm/include/x86_64/vmx.h
+++ b/tools/testing/selftests/kvm/include/x86_64/vmx.h
@ -8,6 +8,8 @@
 #ifndef SELFTEST_KVM_VMX_H
 #define SELFTEST_KVM_VMX_H
 #include <asm/vmx.h>
 #include <stdint.h>
 #include "processor.h"
 #include "apic.h"
@ -100,55 +102,6 @@
 #define VMX_EPT_VPID_CAP_AD_BITS		0x00200000
 #define EXIT_REASON_FAILED_VMENTRY	0x80000000
 #define EXIT_REASON_EXCEPTION_NMI	0
 #define EXIT_REASON_EXTERNAL_INTERRUPT	1
 #define EXIT_REASON_TRIPLE_FAULT	2
 #define EXIT_REASON_INTERRUPT_WINDOW	7
 #define EXIT_REASON_NMI_WINDOW		8
 #define EXIT_REASON_TASK_SWITCH		9
 #define EXIT_REASON_CPUID		10
 #define EXIT_REASON_HLT			12
 #define EXIT_REASON_INVD		13
 #define EXIT_REASON_INVLPG		14
 #define EXIT_REASON_RDPMC		15
 #define EXIT_REASON_RDTSC		16
 #define EXIT_REASON_VMCALL		18
 #define EXIT_REASON_VMCLEAR		19
 #define EXIT_REASON_VMLAUNCH		20
 #define EXIT_REASON_VMPTRLD		21
 #define EXIT_REASON_VMPTRST		22
 #define EXIT_REASON_VMREAD		23
 #define EXIT_REASON_VMRESUME		24
 #define EXIT_REASON_VMWRITE		25
 #define EXIT_REASON_VMOFF		26
 #define EXIT_REASON_VMON		27
 #define EXIT_REASON_CR_ACCESS		28
 #define EXIT_REASON_DR_ACCESS		29
 #define EXIT_REASON_IO_INSTRUCTION	30
 #define EXIT_REASON_MSR_READ		31
 #define EXIT_REASON_MSR_WRITE		32
 #define EXIT_REASON_INVALID_STATE	33
 #define EXIT_REASON_MWAIT_INSTRUCTION	36
 #define EXIT_REASON_MONITOR_INSTRUCTION 39
 #define EXIT_REASON_PAUSE_INSTRUCTION	40
 #define EXIT_REASON_MCE_DURING_VMENTRY	41
 #define EXIT_REASON_TPR_BELOW_THRESHOLD 43
 #define EXIT_REASON_APIC_ACCESS		44
 #define EXIT_REASON_EOI_INDUCED		45
 #define EXIT_REASON_EPT_VIOLATION	48
 #define EXIT_REASON_EPT_MISCONFIG	49
 #define EXIT_REASON_INVEPT		50
 #define EXIT_REASON_RDTSCP		51
 #define EXIT_REASON_PREEMPTION_TIMER	52
 #define EXIT_REASON_INVVPID		53
 #define EXIT_REASON_WBINVD		54
 #define EXIT_REASON_XSETBV		55
 #define EXIT_REASON_APIC_WRITE		56
 #define EXIT_REASON_INVPCID		58
 #define EXIT_REASON_PML_FULL		62
 #define EXIT_REASON_XSAVES		63
 #define EXIT_REASON_XRSTORS		64
 #define LAST_EXIT_REASON		64
 enum vmcs_field {
 	VIRTUAL_PROCESSOR_ID		= 0x00000000,
@ -208,6 +161,8 @@ enum vmcs_field {
 	VMWRITE_BITMAP_HIGH		= 0x00002029,
 	XSS_EXIT_BITMAP			= 0x0000202C,
 	XSS_EXIT_BITMAP_HIGH		= 0x0000202D,
 	ENCLS_EXITING_BITMAP		= 0x0000202E,
 	ENCLS_EXITING_BITMAP_HIGH	= 0x0000202F,
 	TSC_MULTIPLIER			= 0x00002032,
 	TSC_MULTIPLIER_HIGH		= 0x00002033,
 	GUEST_PHYSICAL_ADDRESS		= 0x00002400,
--- a/tools/testing/selftests/kvm/x86_64/nested_exceptions_test.c
+++ b/tools/testing/selftests/kvm/x86_64/nested_exceptions_test.c
@ -0,0 +1,295 @@
 // SPDX-License-Identifier: GPL-2.0-only
 #define _GNU_SOURCE /* for program_invocation_short_name */
 #include "test_util.h"
 #include "kvm_util.h"
 #include "processor.h"
 #include "vmx.h"
 #include "svm_util.h"
 #define L2_GUEST_STACK_SIZE 256
 /*
 * Arbitrary, never shoved into KVM/hardware, just need to avoid conflict with
 * the "real" exceptions used, #SS/#GP/#DF (12/13/8).
 */
 #define FAKE_TRIPLE_FAULT_VECTOR	0xaa
 /* Arbitrary 32-bit error code injected by this test. */
 #define SS_ERROR_CODE 0xdeadbeef
 /*
 * Bit '0' is set on Intel if the exception occurs while delivering a previous
 * event/exception.  AMD's wording is ambiguous, but presumably the bit is set
 * if the exception occurs while delivering an external event, e.g. NMI or INTR,
 * but not for exceptions that occur when delivering other exceptions or
 * software interrupts.
 *
 * Note, Intel's name for it, "External event", is misleading and much more
 * aligned with AMD's behavior, but the SDM is quite clear on its behavior.
 */
 #define ERROR_CODE_EXT_FLAG	BIT(0)
 /*
 * Bit '1' is set if the fault occurred when looking up a descriptor in the
 * IDT, which is the case here as the IDT is empty/NULL.
 */
 #define ERROR_CODE_IDT_FLAG	BIT(1)
 /*
 * The #GP that occurs when vectoring #SS should show the index into the IDT
 * for #SS, plus have the "IDT flag" set.
 */
 #define GP_ERROR_CODE_AMD ((SS_VECTOR * 8) | ERROR_CODE_IDT_FLAG)
 #define GP_ERROR_CODE_INTEL ((SS_VECTOR * 8) | ERROR_CODE_IDT_FLAG | ERROR_CODE_EXT_FLAG)
 /*
 * Intel and AMD both shove '0' into the error code on #DF, regardless of what
 * led to the double fault.
 */
 #define DF_ERROR_CODE 0
 #define INTERCEPT_SS		(BIT_ULL(SS_VECTOR))
 #define INTERCEPT_SS_DF		(INTERCEPT_SS | BIT_ULL(DF_VECTOR))
 #define INTERCEPT_SS_GP_DF	(INTERCEPT_SS_DF | BIT_ULL(GP_VECTOR))
 static void l2_ss_pending_test(void)
 {
 	GUEST_SYNC(SS_VECTOR);
 }
 static void l2_ss_injected_gp_test(void)
 {
 	GUEST_SYNC(GP_VECTOR);
 }
 static void l2_ss_injected_df_test(void)
 {
 	GUEST_SYNC(DF_VECTOR);
 }
 static void l2_ss_injected_tf_test(void)
 {
 	GUEST_SYNC(FAKE_TRIPLE_FAULT_VECTOR);
 }
 static void svm_run_l2(struct svm_test_data *svm, void *l2_code, int vector,
 		       uint32_t error_code)
 {
 	struct vmcb *vmcb = svm->vmcb;
 	struct vmcb_control_area *ctrl = &vmcb->control;
 	vmcb->save.rip = (u64)l2_code;
 	run_guest(vmcb, svm->vmcb_gpa);
 	if (vector == FAKE_TRIPLE_FAULT_VECTOR)
 		return;
 	GUEST_ASSERT_EQ(ctrl->exit_code, (SVM_EXIT_EXCP_BASE + vector));
 	GUEST_ASSERT_EQ(ctrl->exit_info_1, error_code);
 }
 static void l1_svm_code(struct svm_test_data *svm)
 {
 	struct vmcb_control_area *ctrl = &svm->vmcb->control;
 	unsigned long l2_guest_stack[L2_GUEST_STACK_SIZE];
 	generic_svm_setup(svm, NULL, &l2_guest_stack[L2_GUEST_STACK_SIZE]);
 	svm->vmcb->save.idtr.limit = 0;
 	ctrl->intercept |= BIT_ULL(INTERCEPT_SHUTDOWN);
 	ctrl->intercept_exceptions = INTERCEPT_SS_GP_DF;
 	svm_run_l2(svm, l2_ss_pending_test, SS_VECTOR, SS_ERROR_CODE);
 	svm_run_l2(svm, l2_ss_injected_gp_test, GP_VECTOR, GP_ERROR_CODE_AMD);
 	ctrl->intercept_exceptions = INTERCEPT_SS_DF;
 	svm_run_l2(svm, l2_ss_injected_df_test, DF_VECTOR, DF_ERROR_CODE);
 	ctrl->intercept_exceptions = INTERCEPT_SS;
 	svm_run_l2(svm, l2_ss_injected_tf_test, FAKE_TRIPLE_FAULT_VECTOR, 0);
 	GUEST_ASSERT_EQ(ctrl->exit_code, SVM_EXIT_SHUTDOWN);
 	GUEST_DONE();
 }
 static void vmx_run_l2(void *l2_code, int vector, uint32_t error_code)
 {
 	GUEST_ASSERT(!vmwrite(GUEST_RIP, (u64)l2_code));
 	GUEST_ASSERT_EQ(vector == SS_VECTOR ? vmlaunch() : vmresume(), 0);
 	if (vector == FAKE_TRIPLE_FAULT_VECTOR)
 		return;
 	GUEST_ASSERT_EQ(vmreadz(VM_EXIT_REASON), EXIT_REASON_EXCEPTION_NMI);
 	GUEST_ASSERT_EQ((vmreadz(VM_EXIT_INTR_INFO) & 0xff), vector);
 	GUEST_ASSERT_EQ(vmreadz(VM_EXIT_INTR_ERROR_CODE), error_code);
 }
 static void l1_vmx_code(struct vmx_pages *vmx)
 {
 	unsigned long l2_guest_stack[L2_GUEST_STACK_SIZE];
 	GUEST_ASSERT_EQ(prepare_for_vmx_operation(vmx), true);
 	GUEST_ASSERT_EQ(load_vmcs(vmx), true);
 	prepare_vmcs(vmx, NULL, &l2_guest_stack[L2_GUEST_STACK_SIZE]);
 	GUEST_ASSERT_EQ(vmwrite(GUEST_IDTR_LIMIT, 0), 0);
 	/*
 	 * VMX disallows injecting an exception with error_code[31:16] != 0,
 	 * and hardware will never generate a VM-Exit with bits 31:16 set.
 	 * KVM should likewise truncate the "bad" userspace value.
 	 */
 	GUEST_ASSERT_EQ(vmwrite(EXCEPTION_BITMAP, INTERCEPT_SS_GP_DF), 0);
 	vmx_run_l2(l2_ss_pending_test, SS_VECTOR, (u16)SS_ERROR_CODE);
 	vmx_run_l2(l2_ss_injected_gp_test, GP_VECTOR, GP_ERROR_CODE_INTEL);
 	GUEST_ASSERT_EQ(vmwrite(EXCEPTION_BITMAP, INTERCEPT_SS_DF), 0);
 	vmx_run_l2(l2_ss_injected_df_test, DF_VECTOR, DF_ERROR_CODE);
 	GUEST_ASSERT_EQ(vmwrite(EXCEPTION_BITMAP, INTERCEPT_SS), 0);
 	vmx_run_l2(l2_ss_injected_tf_test, FAKE_TRIPLE_FAULT_VECTOR, 0);
 	GUEST_ASSERT_EQ(vmreadz(VM_EXIT_REASON), EXIT_REASON_TRIPLE_FAULT);
 	GUEST_DONE();
 }
 static void __attribute__((__flatten__)) l1_guest_code(void *test_data)
 {
 	if (this_cpu_has(X86_FEATURE_SVM))
 		l1_svm_code(test_data);
 	else
 		l1_vmx_code(test_data);
 }
 static void assert_ucall_vector(struct kvm_vcpu *vcpu, int vector)
 {
 	struct kvm_run *run = vcpu->run;
 	struct ucall uc;
 	TEST_ASSERT(run->exit_reason == KVM_EXIT_IO,
 		    "Unexpected exit reason: %u (%s),\n",
 		    run->exit_reason, exit_reason_str(run->exit_reason));
 	switch (get_ucall(vcpu, &uc)) {
 	case UCALL_SYNC:
 		TEST_ASSERT(vector == uc.args[1],
 			    "Expected L2 to ask for %d, got %ld", vector, uc.args[1]);
 		break;
 	case UCALL_DONE:
 		TEST_ASSERT(vector == -1,
 			    "Expected L2 to ask for %d, L2 says it's done", vector);
 		break;
 	case UCALL_ABORT:
 		TEST_FAIL("%s at %s:%ld (0x%lx != 0x%lx)",
 			  (const char *)uc.args[0], __FILE__, uc.args[1],
 			  uc.args[2], uc.args[3]);
 		break;
 	default:
 		TEST_FAIL("Expected L2 to ask for %d, got unexpected ucall %lu", vector, uc.cmd);
 	}
 }
 static void queue_ss_exception(struct kvm_vcpu *vcpu, bool inject)
 {
 	struct kvm_vcpu_events events;
 	vcpu_events_get(vcpu, &events);
 	TEST_ASSERT(!events.exception.pending,
 		    "Vector %d unexpectedlt pending", events.exception.nr);
 	TEST_ASSERT(!events.exception.injected,
 		    "Vector %d unexpectedly injected", events.exception.nr);
 	events.flags = KVM_VCPUEVENT_VALID_PAYLOAD;
 	events.exception.pending = !inject;
 	events.exception.injected = inject;
 	events.exception.nr = SS_VECTOR;
 	events.exception.has_error_code = true;
 	events.exception.error_code = SS_ERROR_CODE;
 	vcpu_events_set(vcpu, &events);
 }
 /*
 * Verify KVM_{G,S}ET_EVENTS play nice with pending vs. injected exceptions
 * when an exception is being queued for L2.  Specifically, verify that KVM
 * honors L1 exception intercept controls when a #SS is pending/injected,
 * triggers a #GP on vectoring the #SS, morphs to #DF if #GP isn't intercepted
 * by L1, and finally causes (nested) SHUTDOWN if #DF isn't intercepted by L1.
 */
 int main(int argc, char *argv[])
 {
 	vm_vaddr_t nested_test_data_gva;
 	struct kvm_vcpu_events events;
 	struct kvm_vcpu *vcpu;
 	struct kvm_vm *vm;
 	TEST_REQUIRE(kvm_has_cap(KVM_CAP_EXCEPTION_PAYLOAD));
 	TEST_REQUIRE(kvm_cpu_has(X86_FEATURE_SVM) || kvm_cpu_has(X86_FEATURE_VMX));
 	vm = vm_create_with_one_vcpu(&vcpu, l1_guest_code);
 	vm_enable_cap(vm, KVM_CAP_EXCEPTION_PAYLOAD, -2ul);
 	if (kvm_cpu_has(X86_FEATURE_SVM))
 		vcpu_alloc_svm(vm, &nested_test_data_gva);
 	else
 		vcpu_alloc_vmx(vm, &nested_test_data_gva);
 	vcpu_args_set(vcpu, 1, nested_test_data_gva);
 	/* Run L1 => L2.  L2 should sync and request #SS. */
 	vcpu_run(vcpu);
 	assert_ucall_vector(vcpu, SS_VECTOR);
 	/* Pend #SS and request immediate exit.  #SS should still be pending. */
 	queue_ss_exception(vcpu, false);
 	vcpu->run->immediate_exit = true;
 	vcpu_run_complete_io(vcpu);
 	/* Verify the pending events comes back out the same as it went in. */
 	vcpu_events_get(vcpu, &events);
 	ASSERT_EQ(events.flags & KVM_VCPUEVENT_VALID_PAYLOAD,
 		  KVM_VCPUEVENT_VALID_PAYLOAD);
 	ASSERT_EQ(events.exception.pending, true);
 	ASSERT_EQ(events.exception.nr, SS_VECTOR);
 	ASSERT_EQ(events.exception.has_error_code, true);
 	ASSERT_EQ(events.exception.error_code, SS_ERROR_CODE);
 	/*
 	 * Run for real with the pending #SS, L1 should get a VM-Exit due to
 	 * #SS interception and re-enter L2 to request #GP (via injected #SS).
 	 */
 	vcpu->run->immediate_exit = false;
 	vcpu_run(vcpu);
 	assert_ucall_vector(vcpu, GP_VECTOR);
 	/*
 	 * Inject #SS, the #SS should bypass interception and cause #GP, which
 	 * L1 should intercept before KVM morphs it to #DF.  L1 should then
 	 * disable #GP interception and run L2 to request #DF (via #SS => #GP).
 	 */
 	queue_ss_exception(vcpu, true);
 	vcpu_run(vcpu);
 	assert_ucall_vector(vcpu, DF_VECTOR);
 	/*
 	 * Inject #SS, the #SS should bypass interception and cause #GP, which
 	 * L1 is no longer interception, and so should see a #DF VM-Exit.  L1
 	 * should then signal that is done.
 	 */
 	queue_ss_exception(vcpu, true);
 	vcpu_run(vcpu);
 	assert_ucall_vector(vcpu, FAKE_TRIPLE_FAULT_VECTOR);
 	/*
 	 * Inject #SS yet again.  L1 is not intercepting #GP or #DF, and so
 	 * should see nested TRIPLE_FAULT / SHUTDOWN.
 	 */
 	queue_ss_exception(vcpu, true);
 	vcpu_run(vcpu);
 	assert_ucall_vector(vcpu, -1);
 	kvm_vm_free(vm);
 }
--- a/tools/testing/selftests/kvm/x86_64/nx_huge_pages_test.c
+++ b/tools/testing/selftests/kvm/x86_64/nx_huge_pages_test.c
@ -118,13 +118,6 @@ void run_test(int reclaim_period_ms, bool disable_nx_huge_pages,
 	vm = vm_create(1);
 	if (disable_nx_huge_pages) {
 		/*
 		 * Cannot run the test without NX huge pages if the kernel
 		 * does not support it.
 		 */
 		if (!kvm_check_cap(KVM_CAP_VM_DISABLE_NX_HUGE_PAGES))
 			return;
 		r = __vm_disable_nx_huge_pages(vm);
 		if (reboot_permissions) {
 			TEST_ASSERT(!r, "Disabling NX huge pages should succeed if process has reboot permissions");
@ -248,18 +241,13 @@ int main(int argc, char **argv)
 		}
 	}
-	if (token != MAGIC_TOKEN) {
+	TEST_REQUIRE(kvm_has_cap(KVM_CAP_VM_DISABLE_NX_HUGE_PAGES));
-		print_skip("This test must be run with the magic token %d.\n"
+	TEST_REQUIRE(reclaim_period_ms > 0);
 			   "This is done by nx_huge_pages_test.sh, which\n"
 			   "also handles environment setup for the test.",
 			   MAGIC_TOKEN);
 		exit(KSFT_SKIP);
 	}
-	if (!reclaim_period_ms) {
+	__TEST_REQUIRE(token == MAGIC_TOKEN,
-		print_skip("The NX reclaim period must be specified and non-zero");
+		       "This test must be run with the magic token %d.\n"
-		exit(KSFT_SKIP);
+		       "This is done by nx_huge_pages_test.sh, which\n"
-	}
+		       "also handles environment setup for the test.");
 	run_test(reclaim_period_ms, false, reboot_permissions);
 	run_test(reclaim_period_ms, true, reboot_permissions);
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@ -3409,10 +3409,8 @@ static int kvm_vcpu_check_block(struct kvm_vcpu *vcpu)
 	int ret = -EINTR;
 	int idx = srcu_read_lock(&vcpu->kvm->srcu);
-	if (kvm_arch_vcpu_runnable(vcpu)) {
+	if (kvm_arch_vcpu_runnable(vcpu))
 		kvm_make_request(KVM_REQ_UNHALT, vcpu);
 		goto out;
 	}
 	if (kvm_cpu_has_pending_timer(vcpu))
 		goto out;
 	if (signal_pending(current))
@ -5881,7 +5879,7 @@ int kvm_init(void *opaque, unsigned vcpu_size, unsigned vcpu_align,
 	r = kvm_async_pf_init();
 	if (r)
-		goto out_free_5;
+		goto out_free_4;
 	kvm_chardev_ops.owner = module;
@ -5905,10 +5903,9 @@ int kvm_init(void *opaque, unsigned vcpu_size, unsigned vcpu_align,
 out_unreg:
 	kvm_async_pf_deinit();
-out_free_5:
+out_free_4:
 	for_each_possible_cpu(cpu)
 		free_cpumask_var(per_cpu(cpu_kick_mask, cpu));
 out_free_4:
 	kmem_cache_destroy(kvm_vcpu_cache);
 out_free_3:
 	unregister_reboot_notifier(&kvm_reboot_notifier);