linux/arch/x86/kvm
Sean Christopherson 3a0f64de47 KVM: x86/mmu: Don't advance iterator after restart due to yielding
After dropping mmu_lock in the TDP MMU, restart the iterator during
tdp_iter_next() and do not advance the iterator.  Advancing the iterator
results in skipping the top-level SPTE and all its children, which is
fatal if any of the skipped SPTEs were not visited before yielding.

When zapping all SPTEs, i.e. when min_level == root_level, restarting the
iter and then invoking tdp_iter_next() is always fatal if the current gfn
has as a valid SPTE, as advancing the iterator results in try_step_side()
skipping the current gfn, which wasn't visited before yielding.

Sprinkle WARNs on iter->yielded being true in various helpers that are
often used in conjunction with yielding, and tag the helper with
__must_check to reduce the probabily of improper usage.

Failing to zap a top-level SPTE manifests in one of two ways.  If a valid
SPTE is skipped by both kvm_tdp_mmu_zap_all() and kvm_tdp_mmu_put_root(),
the shadow page will be leaked and KVM will WARN accordingly.

  WARNING: CPU: 1 PID: 3509 at arch/x86/kvm/mmu/tdp_mmu.c:46 [kvm]
  RIP: 0010:kvm_mmu_uninit_tdp_mmu+0x3e/0x50 [kvm]
  Call Trace:
   <TASK>
   kvm_arch_destroy_vm+0x130/0x1b0 [kvm]
   kvm_destroy_vm+0x162/0x2a0 [kvm]
   kvm_vcpu_release+0x34/0x60 [kvm]
   __fput+0x82/0x240
   task_work_run+0x5c/0x90
   do_exit+0x364/0xa10
   ? futex_unqueue+0x38/0x60
   do_group_exit+0x33/0xa0
   get_signal+0x155/0x850
   arch_do_signal_or_restart+0xed/0x750
   exit_to_user_mode_prepare+0xc5/0x120
   syscall_exit_to_user_mode+0x1d/0x40
   do_syscall_64+0x48/0xc0
   entry_SYSCALL_64_after_hwframe+0x44/0xae

If kvm_tdp_mmu_zap_all() skips a gfn/SPTE but that SPTE is then zapped by
kvm_tdp_mmu_put_root(), KVM triggers a use-after-free in the form of
marking a struct page as dirty/accessed after it has been put back on the
free list.  This directly triggers a WARN due to encountering a page with
page_count() == 0, but it can also lead to data corruption and additional
errors in the kernel.

  WARNING: CPU: 7 PID: 1995658 at arch/x86/kvm/../../../virt/kvm/kvm_main.c:171
  RIP: 0010:kvm_is_zone_device_pfn.part.0+0x9e/0xd0 [kvm]
  Call Trace:
   <TASK>
   kvm_set_pfn_dirty+0x120/0x1d0 [kvm]
   __handle_changed_spte+0x92e/0xca0 [kvm]
   __handle_changed_spte+0x63c/0xca0 [kvm]
   __handle_changed_spte+0x63c/0xca0 [kvm]
   __handle_changed_spte+0x63c/0xca0 [kvm]
   zap_gfn_range+0x549/0x620 [kvm]
   kvm_tdp_mmu_put_root+0x1b6/0x270 [kvm]
   mmu_free_root_page+0x219/0x2c0 [kvm]
   kvm_mmu_free_roots+0x1b4/0x4e0 [kvm]
   kvm_mmu_unload+0x1c/0xa0 [kvm]
   kvm_arch_destroy_vm+0x1f2/0x5c0 [kvm]
   kvm_put_kvm+0x3b1/0x8b0 [kvm]
   kvm_vcpu_release+0x4e/0x70 [kvm]
   __fput+0x1f7/0x8c0
   task_work_run+0xf8/0x1a0
   do_exit+0x97b/0x2230
   do_group_exit+0xda/0x2a0
   get_signal+0x3be/0x1e50
   arch_do_signal_or_restart+0x244/0x17f0
   exit_to_user_mode_prepare+0xcb/0x120
   syscall_exit_to_user_mode+0x1d/0x40
   do_syscall_64+0x4d/0x90
   entry_SYSCALL_64_after_hwframe+0x44/0xae

Note, the underlying bug existed even before commit 1af4a96025 ("KVM:
x86/mmu: Yield in TDU MMU iter even if no SPTES changed") moved calls to
tdp_mmu_iter_cond_resched() to the beginning of loops, as KVM could still
incorrectly advance past a top-level entry when yielding on a lower-level
entry.  But with respect to leaking shadow pages, the bug was introduced
by yielding before processing the current gfn.

Alternatively, tdp_mmu_iter_cond_resched() could simply fall through, or
callers could jump to their "retry" label.  The downside of that approach
is that tdp_mmu_iter_cond_resched() _must_ be called before anything else
in the loop, and there's no easy way to enfornce that requirement.

Ideally, KVM would handling the cond_resched() fully within the iterator
macro (the code is actually quite clean) and avoid this entire class of
bugs, but that is extremely difficult do while also supporting yielding
after tdp_mmu_set_spte_atomic() fails.  Yielding after failing to set a
SPTE is very desirable as the "owner" of the REMOVED_SPTE isn't strictly
bounded, e.g. if it's zapping a high-level shadow page, the REMOVED_SPTE
may block operations on the SPTE for a significant amount of time.

Fixes: faaf05b00a ("kvm: x86/mmu: Support zapping SPTEs in the TDP MMU")
Fixes: 1af4a96025 ("KVM: x86/mmu: Yield in TDU MMU iter even if no SPTES changed")
Reported-by: Ignat Korchagin <ignat@cloudflare.com>
Cc: stable@vger.kernel.org
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20211214033528.123268-1-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2021-12-20 08:06:53 -05:00
..
mmu KVM: x86/mmu: Don't advance iterator after restart due to yielding 2021-12-20 08:06:53 -05:00
svm KVM: SVM: Do not terminate SEV-ES guests on GHCB validation failure 2021-12-05 03:02:04 -05:00
vmx KVM: nVMX: Don't use Enlightened MSR Bitmap for L3 2021-12-08 04:53:13 -05:00
cpuid.c cpuid: kvm_find_kvm_cpuid_features() should be declared 'static' 2021-11-18 02:03:14 -05:00
cpuid.h KVM: x86: Move reverse CPUID helpers to separate header file 2021-04-26 05:27:13 -04:00
debugfs.c KVM: X86: Introduce mmu_rmaps_stat per-vm debugfs file 2021-08-20 16:06:11 -04:00
emulate.c KVM/arm64 updates for Linux 5.16 2021-10-31 02:28:48 -04:00
fpu.h KVM: x86: Move FPU register accessors into fpu.h 2021-06-17 13:09:24 -04:00
hyperv.c KVM: x86: Ignore sparse banks size for an "all CPUs", non-sparse IPI req 2021-12-10 07:12:42 -05:00
hyperv.h KVM: x86: Query vcpu->vcpu_idx directly and drop its accessor 2021-09-22 10:33:11 -04:00
i8254.c KVM: x86: Migrate the PIT only if vcpu0 is migrated, not any BSP 2021-08-02 11:01:51 -04:00
i8254.h
i8259.c x86: Fix typo s/ECLR/ELCR/ for the PIC register 2021-08-10 23:31:44 +02:00
ioapic.c kvm: rename KVM_MAX_VCPU_ID to KVM_MAX_VCPU_IDS 2021-09-30 04:27:05 -04:00
ioapic.h x86/kvm: remove unused ack_notifier callbacks 2021-11-18 07:05:57 -05:00
irq_comm.c x86: Fix various typos in comments 2021-03-18 15:31:53 +01:00
irq.c KVM: x86/xen: Add event channel interrupt vector upcall 2021-02-04 14:19:39 +00:00
irq.h x86/kvm: remove unused ack_notifier callbacks 2021-11-18 07:05:57 -05:00
Kconfig KVM: x86: add config for non-kvm users of page tracking 2021-10-01 03:44:57 -04:00
kvm_cache_regs.h KVM: x86: Introduce KVM_GET_SREGS2 / KVM_SET_SREGS2 2021-06-17 13:09:47 -04:00
kvm_emulate.h KVM: x86: Drop "pre_" from enter/leave_smm() helpers 2021-06-17 13:09:35 -04:00
kvm_onhyperv.c KVM: x86: hyper-v: Move the remote TLB flush logic out of vmx 2021-06-17 13:09:36 -04:00
kvm_onhyperv.h KVM: x86: hyper-v: Move the remote TLB flush logic out of vmx 2021-06-17 13:09:36 -04:00
lapic.c KVM: x86: check PIR even for vCPUs with disabled APICv 2021-11-30 03:52:39 -05:00
lapic.h KVM: x86: Rename kvm_lapic_enable_pv_eoi() 2021-11-11 10:56:22 -05:00
Makefile KVM: stats: Add fd-based API to read binary stats data 2021-06-24 11:47:57 -04:00
mmu.h KVM: X86: Don't unload MMU in kvm_vcpu_flush_tlb_guest() 2021-10-22 05:44:43 -04:00
mtrr.c KVM: x86: Add helper to consolidate "raw" reserved GPA mask calculations 2021-02-04 09:27:30 -05:00
pmu.c kvm: x86: Convert return type of *is_valid_rdpmc_ecx() to bool 2021-11-11 10:56:19 -05:00
pmu.h kvm: x86: Convert return type of *is_valid_rdpmc_ecx() to bool 2021-11-11 10:56:19 -05:00
reverse_cpuid.h KVM: SEV: Mask CPUID[0x8000001F].eax according to supported features 2021-04-26 05:27:15 -04:00
trace.h KVM: x86: Get exit_reason as part of kvm_x86_ops.get_exit_info 2021-10-25 06:48:24 -04:00
tss.h
x86.c KVM: x86: remove PMU FIXED_CTR3 from msrs_to_save_all 2021-12-20 10:51:19 +01:00
x86.h KVM: nVMX: Abide to KVM_REQ_TLB_FLUSH_GUEST request on nested vmentry/vmexit 2021-11-26 07:07:49 -05:00
xen.c KVM: x86: Assume a 64-bit hypercall for guests with protected state 2021-11-18 02:12:13 -05:00
xen.h KVM: xen: do not use struct gfn_to_hva_cache 2021-08-05 03:31:40 -04:00