linux

Author	SHA1	Message	Date
Sean Christopherson	3b013a2972	KVM: nVMX: Always sync GUEST_BNDCFGS when it comes from vmcs01 If L1 does not set VM_ENTRY_LOAD_BNDCFGS, then L1's BNDCFGS value must be propagated to vmcs02 since KVM always runs with VM_ENTRY_LOAD_BNDCFGS when MPX is supported. Because the value effectively comes from vmcs01, vmcs02 must be updated even if vmcs12 is clean. Fixes: `62cf9bd811` ("KVM: nVMX: Fix emulation of VM_ENTRY_LOAD_BNDCFGS") Cc: stable@vger.kernel.org Cc: Liran Alon <liran.alon@oracle.com> Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2019-06-18 11:46:52 +02:00
Sean Christopherson	d28f4290b5	KVM: VMX: Always signal #GP on WRMSR to MSR_IA32_CR_PAT with bad value The behavior of WRMSR is in no way dependent on whether or not KVM consumes the value. Fixes: `4566654bb9` ("KVM: vmx: Inject #GP on invalid PAT CR") Cc: stable@vger.kernel.org Cc: Nadav Amit <nadav.amit@gmail.com> Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2019-06-18 11:46:51 +02:00
Paolo Bonzini	b1346ab2af	KVM: nVMX: Rename prepare_vmcs02__full to prepare_vmcs02__rare These function do not prepare the entire state of the vmcs02, only the rarely needed parts. Rename them to make this clearer. Reviewed-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2019-06-18 11:46:51 +02:00
Sean Christopherson	7952d769c2	KVM: nVMX: Sync rarely accessed guest fields only when needed Many guest fields are rarely read (or written) by VMMs, i.e. likely aren't accessed between runs of a nested VMCS. Delay pulling rarely accessed guest fields from vmcs02 until they are VMREAD or until vmcs12 is dirtied. The latter case is necessary because nested VM-Entry will consume all manner of fields when vmcs12 is dirty, e.g. for consistency checks. Note, an alternative to synchronizing all guest fields on VMREAD would be to read only the field being accessed, but switching VMCS pointers is expensive and odds are good if one guest field is being accessed then others will soon follow, or that vmcs12 will be dirtied due to a VMWRITE (see above). And the full synchronization results in slightly cleaner code. Note, although GUEST_PDPTRs are relevant only for a 32-bit PAE guest, they are accessed quite frequently for said guests, and a separate patch is in flight to optimize away GUEST_PDTPR synchronziation for non-PAE guests. Skipping rarely accessed guest fields reduces the latency of a nested VM-Exit by ~200 cycles. Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2019-06-18 11:46:50 +02:00
Sean Christopherson	e2174295b4	KVM: nVMX: Add helpers to identify shadowed VMCS fields So that future optimizations related to shadowed fields don't need to define their own switch statement. Add a BUILD_BUG_ON() to ensure at least one of the types (RW vs RO) is defined when including vmcs_shadow_fields.h (guess who keeps mistyping SHADOW_FIELD_RO as SHADOW_FIELD_R0). Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2019-06-18 11:46:47 +02:00
Sean Christopherson	3731905ef2	KVM: nVMX: Use descriptive names for VMCS sync functions and flags Nested virtualization involves copying data between many different types of VMCSes, e.g. vmcs02, vmcs12, shadow VMCS and eVMCS. Rename a variety of functions and flags to document both the source and destination of each sync. Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2019-06-18 11:46:06 +02:00
Sean Christopherson	f4f8316d2a	KVM: nVMX: Lift sync_vmcs12() out of prepare_vmcs12() ... to make it more obvious that sync_vmcs12() is invoked on all nested VM-Exits, e.g. hiding sync_vmcs12() in prepare_vmcs12() makes it appear that guest state is NOT propagated to vmcs12 for a normal VM-Exit. Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2019-06-18 11:46:06 +02:00
Sean Christopherson	1c6f0b47fb	KVM: nVMX: Track vmcs12 offsets for shadowed VMCS fields The vmcs12 fields offsets are constant and known at compile time. Store the associated offset for each shadowed field to avoid the costly lookup in vmcs_field_to_offset() when copying between vmcs12 and the shadow VMCS. Avoiding the costly lookup reduces the latency of copying by ~100 cycles in each direction. Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2019-06-18 11:46:05 +02:00
Sean Christopherson	b643780562	KVM: nVMX: Intercept VMWRITEs to GUEST_{CS,SS}_AR_BYTES VMMs frequently read the guest's CS and SS AR bytes to detect 64-bit mode and CPL respectively, but effectively never write said fields once the VM is initialized. Intercepting VMWRITEs for the two fields saves ~55 cycles in copy_shadow_to_vmcs12(). Because some Intel CPUs, e.g. Haswell, drop the reserved bits of the guest access rights fields on VMWRITE, exposing the fields to L1 for VMREAD but not VMWRITE leads to inconsistent behavior between L1 and L2. On hardware that drops the bits, L1 will see the stripped down value due to reading the value from hardware, while L2 will see the full original value as stored by KVM. To avoid such an inconsistency, emulate the behavior on all CPUS, but only for intercepted VMWRITEs so as to avoid introducing pointless latency into copy_shadow_to_vmcs12(), e.g. if the emulation were added to vmcs12_write_any(). Since the AR_BYTES emulation is done only for intercepted VMWRITE, if a future patch (re)exposed AR_BYTES for both VMWRITE and VMREAD, then KVM would end up with incosistent behavior on pre-Haswell hardware, e.g. KVM would drop the reserved bits on intercepted VMWRITE, but direct VMWRITE to the shadow VMCS would not drop the bits. Add a WARN in the shadow field initialization to detect any attempt to expose an AR_BYTES field without updating vmcs12_write_any(). Note, emulation of the AR_BYTES reserved bit behavior is based on a patch[1] from Jim Mattson that applied the emulation to all writes to vmcs12 so that live migration across different generations of hardware would not introduce divergent behavior. But given that live migration of nested state has already been enabled, that ship has sailed (not to mention that no sane VMM will be affected by this behavior). [1] https://patchwork.kernel.org/patch/10483321/ Cc: Jim Mattson <jmattson@google.com> Cc: Liran Alon <liran.alon@oracle.com> Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2019-06-18 11:46:05 +02:00
Sean Christopherson	fadcead00c	KVM: nVMX: Intercept VMWRITEs to read-only shadow VMCS fields Allowing L1 to VMWRITE read-only fields is only beneficial in a double nesting scenario, e.g. no sane VMM will VMWRITE VM_EXIT_REASON in normal non-nested operation. Intercepting RO fields means KVM doesn't need to sync them from the shadow VMCS to vmcs12 when running L2. The obvious downside is that L1 will VM-Exit more often when running L3, but it's likely safe to assume most folks would happily sacrifice a bit of L3 performance, which may not even be noticeable in the grande scheme, to improve L2 performance across the board. Not intercepting fields tagged read-only also allows for additional optimizations, e.g. marking GUEST_{CS,SS}_AR_BYTES as SHADOW_FIELD_RO since those fields are rarely written by a VMMs, but read frequently. When utilizing a shadow VMCS with asymmetric R/W and R/O bitmaps, fields that cause VM-Exit on VMWRITE but not VMREAD need to be propagated to the shadow VMCS during VMWRITE emulation, otherwise a subsequence VMREAD from L1 will consume a stale value. Note, KVM currently utilizes asymmetric bitmaps when "VMWRITE any field" is not exposed to L1, but only so that it can reject the VMWRITE, i.e. propagating the VMWRITE to the shadow VMCS is a new requirement, not a bug fix. Eliminating the copying of RO fields reduces the latency of nested VM-Entry (copy_shadow_to_vmcs12()) by ~100 cycles (plus 40-50 cycles if/when the AR_BYTES fields are exposed RO). Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2019-06-18 11:46:04 +02:00
Sean Christopherson	95b5a48c4f	KVM: VMX: Handle NMIs, #MCs and async #PFs in common irqs-disabled fn Per commit `1b6269db3f` ("KVM: VMX: Handle NMIs before enabling interrupts and preemption"), NMIs are handled directly in vmx_vcpu_run() to "make sure we handle NMI on the current cpu, and that we don't service maskable interrupts before non-maskable ones". The other exceptions handled by complete_atomic_exit(), e.g. async #PF and #MC, have similar requirements, and are located there to avoid extra VMREADs since VMX bins hardware exceptions and NMIs into a single exit reason. Clean up the code and eliminate the vaguely named complete_atomic_exit() by moving the interrupts-disabled exception and NMI handling into the existing handle_external_intrs() callback, and rename the callback to a more appropriate name. Rename VMexit handlers throughout so that the atomic and non-atomic counterparts have similar names. In addition to improving code readability, this also ensures the NMI handler is run with the host's debug registers loaded in the unlikely event that the user is debugging NMIs. Accuracy of the last_guest_tsc field is also improved when handling NMIs (and #MCs) as the handler will run after updating said field. Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> [Naming cleanups. - Paolo] Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2019-06-18 11:46:04 +02:00
Sean Christopherson	165072b089	KVM: x86: Move kvm_{before,after}_interrupt() calls to vendor code VMX can conditionally call kvm_{before,after}_interrupt() since KVM always uses "ack interrupt on exit" and therefore explicitly handles interrupts as opposed to blindly enabling irqs. Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2019-06-18 11:46:03 +02:00
Sean Christopherson	2342080cd6	KVM: VMX: Store the host kernel's IDT base in a global variable Although the kernel may use multiple IDTs, KVM should only ever see the "real" IDT, e.g. the early init IDT is long gone by the time KVM runs and the debug stack IDT is only used for small windows of time in very specific flows. Before commit `a547c6db4d` ("KVM: VMX: Enable acknowledge interupt on vmexit"), the kernel's IDT base was consumed by KVM only when setting constant VMCS state, i.e. to set VMCS.HOST_IDTR_BASE. Because constant host state is done once per vCPU, there was ostensibly no need to cache the kernel's IDT base. When support for "ack interrupt on exit" was introduced, KVM added a second consumer of the IDT base as handling already-acked interrupts requires directly calling the interrupt handler, i.e. KVM uses the IDT base to find the address of the handler. Because interrupts are a fast path, KVM cached the IDT base to avoid having to VMREAD HOST_IDTR_BASE. Presumably, the IDT base was cached on a per-vCPU basis simply because the existing code grabbed the IDT base on a per-vCPU (VMCS) basis. Note, all post-boot IDTs use the same handlers for external interrupts, i.e. the "ack interrupt on exit" use of the IDT base would be unaffected even if the cached IDT somehow did not match the current IDT. And as for the original use case of setting VMCS.HOST_IDTR_BASE, if any of the above analysis is wrong then KVM has had a bug since the beginning of time since KVM has effectively been caching the IDT at vCPU creation since commit a8b732ca01c ("[PATCH] kvm: userspace interface"). Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2019-06-18 11:46:02 +02:00
Sean Christopherson	49def500e5	KVM: VMX: Read cached VM-Exit reason to detect external interrupt Generic x86 code invokes the kvm_x86_ops external interrupt handler on all VM-Exits regardless of the actual exit type. Use the already-cached EXIT_REASON to determine if the VM-Exit was due to an interrupt, thus avoiding an extra VMREAD (to query VM_EXIT_INTR_INFO) for all other types of VM-Exit. In addition to avoiding the extra VMREAD, checking the EXIT_REASON instead of VM_EXIT_INTR_INFO makes it more obvious that vmx_handle_external_intr() is called for all VM-Exits, e.g. someone unfamiliar with the flow might wonder under what condition(s) VM_EXIT_INTR_INFO does not contain a valid interrupt, which is simply not possible since KVM always runs with "ack interrupt on exit". WARN once if VM_EXIT_INTR_INFO doesn't contain a valid interrupt on an EXTERNAL_INTERRUPT VM-Exit, as such a condition would indicate a hardware bug. Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2019-06-18 11:46:02 +02:00
Paolo Bonzini	2ea7203980	kvm: nVMX: small cleanup in handle_exception The reason for skipping handling of NMI and #MC in handle_exception is the same, namely they are handled earlier by vmx_complete_atomic_exit. Calling the machine check handler (which just returns 1) is misleading, don't do it. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2019-06-18 11:46:01 +02:00
Sean Christopherson	beb8d93b3e	KVM: VMX: Fix handling of #MC that occurs during VM-Entry A previous fix to prevent KVM from consuming stale VMCS state after a failed VM-Entry inadvertantly blocked KVM's handling of machine checks that occur during VM-Entry. Per Intel's SDM, a #MC during VM-Entry is handled in one of three ways, depending on when the #MC is recognoized. As it pertains to this bug fix, the third case explicitly states EXIT_REASON_MCE_DURING_VMENTRY is handled like any other VM-Exit during VM-Entry, i.e. sets bit 31 to indicate the VM-Entry failed. If a machine-check event occurs during a VM entry, one of the following occurs: - The machine-check event is handled as if it occurred before the VM entry: ... - The machine-check event is handled after VM entry completes: ... - A VM-entry failure occurs as described in Section 26.7. The basic exit reason is 41, for "VM-entry failure due to machine-check event". Explicitly handle EXIT_REASON_MCE_DURING_VMENTRY as a one-off case in vmx_vcpu_run() instead of binning it into vmx_complete_atomic_exit(). Doing so allows vmx_vcpu_run() to handle VMX_EXIT_REASONS_FAILED_VMENTRY in a sane fashion and also simplifies vmx_complete_atomic_exit() since VMCS.VM_EXIT_INTR_INFO is guaranteed to be fresh. Fixes: `b060ca3b2e` ("kvm: vmx: Handle VMLAUNCH/VMRESUME failure properly") Cc: stable@vger.kernel.org Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Reviewed-by: Jim Mattson <jmattson@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2019-06-18 11:45:44 +02:00
Paolo Bonzini	73f624f47c	KVM: x86: move MSR_IA32_POWER_CTL handling to common code Make it available to AMD hosts as well, just in case someone is trying to use an Intel processor's CPUID setup. Suggested-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2019-06-18 11:43:48 +02:00
Eugene Korenevsky	fdb28619a8	kvm: vmx: segment limit check: use access length There is an imperfection in get_vmx_mem_address(): access length is ignored when checking the limit. To fix this, pass access length as a function argument. The access length is usually obvious since it is used by callers after get_vmx_mem_address() call, but for vmread/vmwrite it depends on the state of 64-bit mode. Signed-off-by: Eugene Korenevsky <ekorenevsky@gmail.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2019-06-18 11:43:45 +02:00
Eugene Korenevsky	c1a9acbc52	kvm: vmx: fix limit checking in get_vmx_mem_address() Intel SDM vol. 3, 5.3: The processor causes a general-protection exception (or, if the segment is SS, a stack-fault exception) any time an attempt is made to access the following addresses in a segment: - A byte at an offset greater than the effective limit - A word at an offset greater than the (effective-limit â€“ 1) - A doubleword at an offset greater than the (effective-limit â€“ 3) - A quadword at an offset greater than the (effective-limit â€“ 7) Therefore, the generic limit checking error condition must be exn = (off > limit + 1 - access_len) = (off + access_len - 1 > limit) but not exn = (off + access_len > limit) as for now. Also avoid integer overflow of `off` at 32-bit KVM by casting it to u64. Note: access length is currently sizeof(u64) which is incorrect. This will be fixed in the subsequent patch. Signed-off-by: Eugene Korenevsky <ekorenevsky@gmail.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2019-06-18 11:43:45 +02:00
Liran Alon	1fc5d19472	KVM: x86: Use DR_TRAP_BITS instead of hard-coded 15 Make all code consistent with kvm_deliver_exception_payload() by using appropriate symbolic constant instead of hard-coded number. Reviewed-by: Nikita Leshenko <nikita.leshchenko@oracle.com> Reviewed-by: Krish Sadhukhan <krish.sadhukhan@oracle.com> Signed-off-by: Liran Alon <liran.alon@oracle.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2019-06-18 11:43:42 +02:00
Uros Bizjak	1ae4de23ed	KVM: VMX: remove unneeded 'asm volatile ("")' from vmcs_write64 __vmcs_writel uses volatile asm, so there is no need to insert another one between the first and the second call to __vmcs_writel in order to prevent unwanted code moves for 32bit targets. Signed-off-by: Uros Bizjak <ubizjak@gmail.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2019-06-05 14:14:49 +02:00
Jan Beulich	5a253552a5	x86/kvm/VMX: drop bad asm() clobber from nested_vmx_check_vmentry_hw() While upstream gcc doesn't detect conflicts on cc (yet), it really should, and hence "cc" should not be specified for asm()-s also having "=@cc<cond>" outputs. (It is quite pointless anyway to specify a "cc" clobber in x86 inline assembly, since the compiler assumes it to be always clobbered, and has no means [yet] to suppress this behavior.) Signed-off-by: Jan Beulich <jbeulich@suse.com> Fixes: `bbc0b82392` ("KVM: nVMX: Capture VM-Fail via CC_{SET,OUT} in nested early checks") Reviewed-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2019-06-05 14:14:48 +02:00
Wanpeng Li	b51700632e	KVM: X86: Provide a capability to disable cstate msr read intercepts Allow guest reads CORE cstate when exposing host CPU power management capabilities to the guest. PKG cstate is restricted to avoid a guest to get the whole package information in multi-tenant scenario. Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Radim Krčmář <rkrcmar@redhat.com> Cc: Sean Christopherson <sean.j.christopherson@intel.com> Cc: Liran Alon <liran.alon@oracle.com> Signed-off-by: Wanpeng Li <wanpengli@tencent.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2019-06-04 19:27:35 +02:00
Sean Christopherson	f257d6dcda	KVM: Directly return result from kvm_arch_check_processor_compat() Add a wrapper to invoke kvm_arch_check_processor_compat() so that the boilerplate ugliness of checking virtualization support on all CPUs is hidden from the arch specific code. x86's implementation in particular is quite heinous, as it unnecessarily propagates the out-param pattern into kvm_x86_ops. While the x86 specific issue could be resolved solely by changing kvm_x86_ops, make the change for all architectures as returning a value directly is prettier and technically more robust, e.g. s390 doesn't set the out param, which could lead to subtle breakage in the (highly unlikely) scenario where the out-param was not pre-initialized by the caller. Opportunistically annotate svm_check_processor_compat() with __init. Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Reviewed-by: Cornelia Huck <cohuck@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2019-06-04 19:27:32 +02:00
Wanpeng Li	b6c4bc659c	KVM: LAPIC: Optimize timer latency further Advance lapic timer tries to hidden the hypervisor overhead between the host emulated timer fires and the guest awares the timer is fired. However, it just hidden the time between apic_timer_fn/handle_preemption_timer -> wait_lapic_expire, instead of the real position of vmentry which is mentioned in the orignial commit `d0659d946b` ("KVM: x86: add option to advance tscdeadline hrtimer expiration"). There is 700+ cpu cycles between the end of wait_lapic_expire and before world switch on my haswell desktop. This patch tries to narrow the last gap(wait_lapic_expire -> world switch), it takes the real overhead time between apic_timer_fn/handle_preemption_timer and before world switch into consideration when adaptively tuning timer advancement. The patch can reduce 40% latency (~1600+ cycles to ~1000+ cycles on a haswell desktop) for kvm-unit-tests/tscdeadline_latency when testing busy waits. Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Radim Krčmář <rkrcmar@redhat.com> Cc: Sean Christopherson <sean.j.christopherson@intel.com> Cc: Liran Alon <liran.alon@oracle.com> Reviewed-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Wanpeng Li <wanpengli@tencent.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2019-06-04 19:27:30 +02:00
Paolo Bonzini	2924b52117	KVM: x86/pmu: do not mask the value that is written to fixed PMUs According to the SDM, for MSR_IA32_PERFCTR0/1 "the lower-order 32 bits of each MSR may be written with any value, and the high-order 8 bits are sign-extended according to the value of bit 31", but the fixed counters in real hardware are limited to the width of the fixed counters ("bits beyond the width of the fixed-function counter are reserved and must be written as zeros"). Fix KVM to do the same. Reported-by: Nadav Amit <nadav.amit@gmail.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2019-05-24 21:27:14 +02:00
Paolo Bonzini	0e6f467ee2	KVM: x86/pmu: mask the result of rdpmc according to the width of the counters This patch will simplify the changes in the next, by enforcing the masking of the counters to RDPMC and RDMSR. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2019-05-24 21:27:13 +02:00
Paolo Bonzini	6f2f84532c	KVM: x86: do not spam dmesg with VMCS/VMCB dumps Userspace can easily set up invalid processor state in such a way that dmesg will be filled with VMCS or VMCB dumps. Disable this by default using a module parameter. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2019-05-24 21:27:12 +02:00
Yi Wang	4d25996565	kvm: vmx: Fix -Wmissing-prototypes warnings We get a warning when build kernel W=1: arch/x86/kvm/vmx/vmx.c:6365:6: warning: no previous prototype for ‘vmx_update_host_rsp’ [-Wmissing-prototypes] void vmx_update_host_rsp(struct vcpu_vmx *vmx, unsigned long host_rsp) Add the missing declaration to fix this. Signed-off-by: Yi Wang <wang.yi59@zte.com.cn> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2019-05-24 21:27:08 +02:00
Wanpeng Li	541e886f79	KVM: nVMX: Fix using __this_cpu_read() in preemptible context BUG: using __this_cpu_read() in preemptible [00000000] code: qemu-system-x86/4590 caller is nested_vmx_enter_non_root_mode+0xebd/0x1790 [kvm_intel] CPU: 4 PID: 4590 Comm: qemu-system-x86 Tainted: G OE 5.1.0-rc4+ #1 Call Trace: dump_stack+0x67/0x95 __this_cpu_preempt_check+0xd2/0xe0 nested_vmx_enter_non_root_mode+0xebd/0x1790 [kvm_intel] nested_vmx_run+0xda/0x2b0 [kvm_intel] handle_vmlaunch+0x13/0x20 [kvm_intel] vmx_handle_exit+0xbd/0x660 [kvm_intel] kvm_arch_vcpu_ioctl_run+0xa2c/0x1e50 [kvm] kvm_vcpu_ioctl+0x3ad/0x6d0 [kvm] do_vfs_ioctl+0xa5/0x6e0 ksys_ioctl+0x6d/0x80 __x64_sys_ioctl+0x1a/0x20 do_syscall_64+0x6f/0x6c0 entry_SYSCALL_64_after_hwframe+0x49/0xbe Accessing per-cpu variable should disable preemption, this patch extends the preemption disable region for __this_cpu_read(). Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Radim Krčmář <rkrcmar@redhat.com> Signed-off-by: Wanpeng Li <wanpengli@tencent.com> Fixes: `52017608da` ("KVM: nVMX: add option to perform early consistency checks via H/W") Cc: stable@vger.kernel.org Reviewed-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2019-05-24 21:27:08 +02:00
Sean Christopherson	21be4ca1ea	KVM: nVMX: Clear nested_run_pending if setting nested state fails VMX's nested_run_pending flag is subtly consumed when stuffing state to enter guest mode, i.e. needs to be set according before KVM knows if setting guest state is successful. If setting guest state fails, clear the flag as a nested run is obviously not pending. Reported-by: Aaron Lewis <aaronlewis@google.com> Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2019-05-24 21:27:03 +02:00
Paolo Bonzini	db80927ea1	KVM: nVMX: really fix the size checks on KVM_SET_NESTED_STATE The offset for reading the shadow VMCS is sizeof(kvm_state)+VMCS12_SIZE, so the correct size must be that plus sizeof(vmcs12). This could lead to KVM reading garbage data from userspace and not reporting an error, but is otherwise not sensitive. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2019-05-24 21:27:02 +02:00
Linus Torvalds	0ef0fd3515	* ARM: support for SVE and Pointer Authentication in guests, PMU improvements * POWER: support for direct access to the POWER9 XIVE interrupt controller, memory and performance optimizations. * x86: support for accessing memory not backed by struct page, fixes and refactoring * Generic: dirty page tracking improvements -----BEGIN PGP SIGNATURE----- Version: GnuPG v2.0.22 (GNU/Linux) iQEcBAABAgAGBQJc3qV/AAoJEL/70l94x66Dn3QH/jX1Bn0P/RZAIt4w0SySklSg PqxUKDyBQqB9vN9Qeb9jWXAKPH2CtM3+up/rz7oRnBWp7qA6vXcC/R/QJYAvzdXE nklsR/oYCsflR1KdlVYuDvvPCPP2fLBU5zfN83OsaBQ8fNRkm3gN+N5XQ2SbXbLy Mo9tybS4otY201UAC96e8N0ipwwyCRpDneQpLcl+F5nH3RBt63cVbs04O+70MXn7 eT4I+8K3+Go7LATzT8hglD21D/7uvE31qQb6yr5L33IfhU4GB51RZzBXTNaAdY8n hT1rMrRkAMAFWYZPQDfoMadjWU3i5DIfstKjDxOr9oTfuOEp5Z+GvJwvVnUDg1I= =D0+p -----END PGP SIGNATURE----- Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm Pull KVM updates from Paolo Bonzini: "ARM: - support for SVE and Pointer Authentication in guests - PMU improvements POWER: - support for direct access to the POWER9 XIVE interrupt controller - memory and performance optimizations x86: - support for accessing memory not backed by struct page - fixes and refactoring Generic: - dirty page tracking improvements" * tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (155 commits) kvm: fix compilation on aarch64 Revert "KVM: nVMX: Expose RDPMC-exiting only when guest supports PMU" kvm: x86: Fix L1TF mitigation for shadow MMU KVM: nVMX: Disable intercept for FS/GS base MSRs in vmcs02 when possible KVM: PPC: Book3S: Remove useless checks in 'release' method of KVM device KVM: PPC: Book3S HV: XIVE: Fix spelling mistake "acessing" -> "accessing" KVM: PPC: Book3S HV: Make sure to load LPID for radix VCPUs kvm: nVMX: Set nested_run_pending in vmx_set_nested_state after checks complete tests: kvm: Add tests for KVM_SET_NESTED_STATE KVM: nVMX: KVM_SET_NESTED_STATE - Tear down old EVMCS state before setting new state tests: kvm: Add tests for KVM_CAP_MAX_VCPUS and KVM_CAP_MAX_CPU_ID tests: kvm: Add tests to .gitignore KVM: Introduce KVM_CAP_MANUAL_DIRTY_LOG_PROTECT2 KVM: Fix kvm_clear_dirty_log_protect off-by-(minus-)one KVM: Fix the bitmap range to copy during clear dirty KVM: arm64: Fix ptrauth ID register masking logic KVM: x86: use direct accessors for RIP and RSP KVM: VMX: Use accessors for GPRs outside of dedicated caching logic KVM: x86: Omit caching logic for always-available GPRs kvm, x86: Properly check whether a pfn is an MMIO or not ...	2019-05-17 10:33:30 -07:00
Sean Christopherson	f93f7ede08	Revert "KVM: nVMX: Expose RDPMC-exiting only when guest supports PMU" The RDPMC-exiting control is dependent on the existence of the RDPMC instruction itself, i.e. is not tied to the "Architectural Performance Monitoring" feature. For all intents and purposes, the control exists on all CPUs with VMX support since RDPMC also exists on all VCPUs with VMX supported. Per Intel's SDM: The RDPMC instruction was introduced into the IA-32 Architecture in the Pentium Pro processor and the Pentium processor with MMX technology. The earlier Pentium processors have performance-monitoring counters, but they must be read with the RDMSR instruction. Because RDPMC-exiting always exists, KVM requires the control and refuses to load if it's not available. As a result, hiding the PMU from a guest breaks nested virtualization if the guest attemts to use KVM. While it's not explicitly stated in the RDPMC pseudocode, the VM-Exit check for RDPMC-exiting follows standard fault vs. VM-Exit prioritization for privileged instructions, e.g. occurs after the CPL/CR0.PE/CR4.PCE checks, but before the counter referenced in ECX is checked for validity. In other words, the original KVM behavior of injecting a #GP was correct, and the KVM unit test needs to be adjusted accordingly, e.g. eat the #GP when the unit test guest (L3 in this case) executes RDPMC without RDPMC-exiting set in the unit test host (L2). This reverts commit `e51bfdb687`. Fixes: `e51bfdb687` ("KVM: nVMX: Expose RDPMC-exiting only when guest supports PMU") Reported-by: David Hill <hilld@binarystorm.net> Cc: Saar Amar <saaramar@microsoft.com> Cc: Mihai Carabas <mihai.carabas@oracle.com> Cc: Jim Mattson <jmattson@google.com> Cc: Liran Alon <liran.alon@oracle.com> Cc: stable@vger.kernel.org Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2019-05-15 23:19:19 +02:00
Sean Christopherson	d69129b4e4	KVM: nVMX: Disable intercept for FS/GS base MSRs in vmcs02 when possible If L1 is using an MSR bitmap, unconditionally merge the MSR bitmaps from L0 and L1 for MSR_{KERNEL,}_{FS,GS}_BASE. KVM unconditionally exposes MSRs L1. If KVM is also running in L1 then it's highly likely L1 is also exposing the MSRs to L2, i.e. KVM doesn't need to intercept L2 accesses. Based on code from Jintack Lim. Cc: Jintack Lim <jintack@xxxxxxxxxxxxxxx> Signed-off-by: Sean Christopherson <sean.j.christopherson@xxxxxxxxx> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2019-05-15 22:53:44 +02:00
Linus Torvalds	fa4bff1650	Merge branch 'x86-mds-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 MDS mitigations from Thomas Gleixner: "Microarchitectural Data Sampling (MDS) is a hardware vulnerability which allows unprivileged speculative access to data which is available in various CPU internal buffers. This new set of misfeatures has the following CVEs assigned: CVE-2018-12126 MSBDS Microarchitectural Store Buffer Data Sampling CVE-2018-12130 MFBDS Microarchitectural Fill Buffer Data Sampling CVE-2018-12127 MLPDS Microarchitectural Load Port Data Sampling CVE-2019-11091 MDSUM Microarchitectural Data Sampling Uncacheable Memory MDS attacks target microarchitectural buffers which speculatively forward data under certain conditions. Disclosure gadgets can expose this data via cache side channels. Contrary to other speculation based vulnerabilities the MDS vulnerability does not allow the attacker to control the memory target address. As a consequence the attacks are purely sampling based, but as demonstrated with the TLBleed attack samples can be postprocessed successfully. The mitigation is to flush the microarchitectural buffers on return to user space and before entering a VM. It's bolted on the VERW instruction and requires a microcode update. As some of the attacks exploit data structures shared between hyperthreads, full protection requires to disable hyperthreading. The kernel does not do that by default to avoid breaking unattended updates. The mitigation set comes with documentation for administrators and a deeper technical view" * 'x86-mds-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (23 commits) x86/speculation/mds: Fix documentation typo Documentation: Correct the possible MDS sysfs values x86/mds: Add MDSUM variant to the MDS documentation x86/speculation/mds: Add 'mitigations=' support for MDS x86/speculation/mds: Print SMT vulnerable on MSBDS with mitigations off x86/speculation/mds: Fix comment x86/speculation/mds: Add SMT warning message x86/speculation: Move arch_smt_update() call to after mitigation decisions x86/speculation/mds: Add mds=full,nosmt cmdline option Documentation: Add MDS vulnerability documentation Documentation: Move L1TF to separate directory x86/speculation/mds: Add mitigation mode VMWERV x86/speculation/mds: Add sysfs reporting for MDS x86/speculation/mds: Add mitigation control for MDS x86/speculation/mds: Conditionally clear CPU buffers on idle entry x86/kvm/vmx: Add MDS protection when L1D Flush is not active x86/speculation/mds: Clear CPU buffers on exit to user x86/speculation/mds: Add mds_clear_cpu_buffers() x86/kvm: Expose X86_FEATURE_MD_CLEAR to guests x86/speculation/mds: Add BUG_MSBDS_ONLY ...	2019-05-14 07:57:29 -07:00
Aaron Lewis	9b5db6c762	kvm: nVMX: Set nested_run_pending in vmx_set_nested_state after checks complete nested_run_pending=1 implies we have successfully entered guest mode. Move setting from external state in vmx_set_nested_state() until after all other checks are complete. Based on a patch by Aaron Lewis. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2019-05-08 14:14:08 +02:00
Aaron Lewis	332d079735	KVM: nVMX: KVM_SET_NESTED_STATE - Tear down old EVMCS state before setting new state Move call to nested_enable_evmcs until after free_nested() is complete. Signed-off-by: Aaron Lewis <aaronlewis@google.com> Reviewed-by: Marc Orr <marcorr@google.com> Reviewed-by: Peter Shier <pshier@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2019-05-08 14:12:08 +02:00
Linus Torvalds	8ff468c29e	Merge branch 'x86-fpu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 FPU state handling updates from Borislav Petkov: "This contains work started by Rik van Riel and brought to fruition by Sebastian Andrzej Siewior with the main goal to optimize when to load FPU registers: only when returning to userspace and not on every context switch (while the task remains in the kernel). In addition, this optimization makes kernel_fpu_begin() cheaper by requiring registers saving only on the first invocation and skipping that in following ones. What is more, this series cleans up and streamlines many aspects of the already complex FPU code, hopefully making it more palatable for future improvements and simplifications. Finally, there's a __user annotations fix from Jann Horn" * 'x86-fpu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (29 commits) x86/fpu: Fault-in user stack if copy_fpstate_to_sigframe() fails x86/pkeys: Add PKRU value to init_fpstate x86/fpu: Restore regs in copy_fpstate_to_sigframe() in order to use the fastpath x86/fpu: Add a fastpath to copy_fpstate_to_sigframe() x86/fpu: Add a fastpath to __fpu__restore_sig() x86/fpu: Defer FPU state load until return to userspace x86/fpu: Merge the two code paths in __fpu__restore_sig() x86/fpu: Restore from kernel memory on the 64-bit path too x86/fpu: Inline copy_user_to_fpregs_zeroing() x86/fpu: Update xstate's PKRU value on write_pkru() x86/fpu: Prepare copy_fpstate_to_sigframe() for TIF_NEED_FPU_LOAD x86/fpu: Always store the registers in copy_fpstate_to_sigframe() x86/entry: Add TIF_NEED_FPU_LOAD x86/fpu: Eager switch PKRU state x86/pkeys: Don't check if PKRU is zero before writing it x86/fpu: Only write PKRU if it is different from current x86/pkeys: Provide *pkru() helpers x86/fpu: Use a feature number instead of mask in two more helpers x86/fpu: Make __raw_xsave_addr() use a feature number instead of mask x86/fpu: Add an __fpregs_load_activate() internal helper ...	2019-05-07 10:24:10 -07:00
Jim Mattson	e8ab8d24b4	KVM: nVMX: Fix size checks in vmx_set_nested_state The size checks in vmx_nested_state are wrong because the calculations are made based on the size of a pointer to a struct kvm_nested_state rather than the size of a struct kvm_nested_state. Reported-by: Felix Wilhelm <fwilhelm@google.com> Signed-off-by: Jim Mattson <jmattson@google.com> Reviewed-by: Drew Schmitt <dasch@google.com> Reviewed-by: Marc Orr <marcorr@google.com> Reviewed-by: Peter Shier <pshier@google.com> Reviewed-by: Krish Sadhukhan <krish.sadhukhan@oracle.com> Fixes: `8fcc4b5923` Cc: stable@ver.kernel.org Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2019-05-01 00:43:44 +02:00
Paolo Bonzini	e9c16c7850	KVM: x86: use direct accessors for RIP and RSP Use specific inline functions for RIP and RSP instead of going through kvm_register_read and kvm_register_write, which are quite a mouthful. kvm_rsp_read and kvm_rsp_write did not exist, so add them. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2019-04-30 22:07:26 +02:00
Sean Christopherson	2b3eaf815c	KVM: VMX: Use accessors for GPRs outside of dedicated caching logic ... now that there is no overhead when using dedicated accessors. Opportunistically remove a bogus "FIXME" in handle_rdmsr() regarding the upper 32 bits of RAX and RDX. Zeroing the upper 32 bits is architecturally correct as 32-bit writes in 64-bit mode unconditionally clear the upper 32 bits. Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2019-04-30 21:56:27 +02:00
Sean Christopherson	de3cd117ed	KVM: x86: Omit caching logic for always-available GPRs Except for RSP and RIP, which are held in VMX's VMCS, GPRs are always treated "available and dirtly" on both VMX and SVM, i.e. are unconditionally loaded/saved immediately before/after VM-Enter/VM-Exit. Eliminating the unnecessary caching code reduces the size of KVM by a non-trivial amount, much of which comes from the most common code paths. E.g. on x86_64, kvm_emulate_cpuid() is reduced from 342 to 182 bytes and kvm_emulate_hypercall() from 1362 to 1143, with the total size of KVM dropping by ~1000 bytes. With CONFIG_RETPOLINE=y, the numbers are even more pronounced, e.g.: 353->182, 1418->1172 and well over 2000 bytes. Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2019-04-30 21:56:12 +02:00
KarimAllah Ahmed	e0bf2665ca	KVM/nVMX: Use page_address_valid in a few more locations Use page_address_valid in a few more locations that is already checking for a page aligned address that does not cross the maximum physical address. Signed-off-by: KarimAllah Ahmed <karahmed@amazon.de> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2019-04-30 21:49:44 +02:00
KarimAllah Ahmed	dee9c04931	KVM/nVMX: Use kvm_vcpu_map for accessing the enlightened VMCS Use kvm_vcpu_map for accessing the enlightened VMCS since using kvm_vcpu_gpa_to_page() and kmap() will only work for guest memory that has a "struct page". Signed-off-by: KarimAllah Ahmed <karahmed@amazon.de> Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2019-04-30 21:49:40 +02:00
KarimAllah Ahmed	8892530598	KVM/nVMX: Use kvm_vcpu_map for accessing the shadow VMCS Use kvm_vcpu_map for accessing the shadow VMCS since using kvm_vcpu_gpa_to_page() and kmap() will only work for guest memory that has a "struct page". Signed-off-by: KarimAllah Ahmed <karahmed@amazon.de> Reviewed-by: Konrad Rzessutek Wilk <konrad.wilk@oracle.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2019-04-30 21:49:37 +02:00
KarimAllah Ahmed	3278e04925	KVM/nVMX: Use kvm_vcpu_map when mapping the posted interrupt descriptor table Use kvm_vcpu_map when mapping the posted interrupt descriptor table since using kvm_vcpu_gpa_to_page() and kmap() will only work for guest memory that has a "struct page". One additional semantic change is that the virtual host mapping lifecycle has changed a bit. It now has the same lifetime of the pinning of the interrupt descriptor table page on the host side. Signed-off-by: KarimAllah Ahmed <karahmed@amazon.de> Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2019-04-30 21:45:21 +02:00
KarimAllah Ahmed	96c66e87de	KVM/nVMX: Use kvm_vcpu_map when mapping the virtual APIC page Use kvm_vcpu_map when mapping the virtual APIC page since using kvm_vcpu_gpa_to_page() and kmap() will only work for guest memory that has a "struct page". One additional semantic change is that the virtual host mapping lifecycle has changed a bit. It now has the same lifetime of the pinning of the virtual APIC page on the host side. Signed-off-by: KarimAllah Ahmed <karahmed@amazon.de> Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2019-04-30 21:36:43 +02:00
KarimAllah Ahmed	31f0b6c4ba	KVM/nVMX: Use kvm_vcpu_map when mapping the L1 MSR bitmap Use kvm_vcpu_map when mapping the L1 MSR bitmap since using kvm_vcpu_gpa_to_page() and kmap() will only work for guest memory that has a "struct page". Signed-off-by: KarimAllah Ahmed <karahmed@amazon.de> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2019-04-30 21:34:34 +02:00
KarimAllah Ahmed	b146b83928	X86/nVMX: handle_vmptrld: Use kvm_vcpu_map when copying VMCS12 from guest memory Use kvm_vcpu_map to the map the VMCS12 from guest memory because kvm_vcpu_gpa_to_page() and kmap() will only work for guest memory that has a "struct page". Signed-off-by: KarimAllah Ahmed <karahmed@amazon.de> Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2019-04-30 21:32:59 +02:00

1 2 3 4

194 Commits