linux

Author	SHA1	Message	Date
Andy Lutomirski	52ce3c21ae	x86,kvm,vmx: Don't trap writes to CR4.TSD CR4.TSD is guest-owned; don't trap writes to it in VMX guests. This avoids a VM exit on context switches into or out of a PR_TSC_SIGSEGV task. I think that this fixes an unintentional side-effect of: `4c38609ac5` KVM: VMX: Make guest cr4 mask more conservative Signed-off-by: Andy Lutomirski <luto@amacapital.net> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2014-11-03 12:07:22 +01:00
Nadav Amit	bf0b682c9b	KVM: x86: Sysexit emulation does not mask RIP/RSP If the operand size is not 64-bit, then the sysexit instruction should assign ECX to RSP and EDX to RIP. The current code assigns the full 64-bits. Fix it by masking. Signed-off-by: Nadav Amit <namit@cs.technion.ac.il> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2014-11-03 12:07:21 +01:00
Nadav Amit	58b7075d05	KVM: x86: Distinguish between stack operation and near branches In 64-bit, stack operations default to 64-bits, but can be overriden (to 16-bit) using opsize override prefix. In contrast, near-branches are always 64-bit. This patch distinguish between the different behaviors. Signed-off-by: Nadav Amit <namit@cs.technion.ac.il> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2014-11-03 12:07:20 +01:00
Nadav Amit	f7784046ab	KVM: x86: Getting rid of grp45 in emulator Breaking grp45 to the relevant functions to speed up the emulation and simplify the code. In addition, it is necassary the next patch will distinguish between far and near branches according to the flags. Signed-off-by: Nadav Amit <namit@cs.technion.ac.il> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2014-11-03 12:07:20 +01:00
Nadav Amit	4be4de7ef9	KVM: x86: Use new is_noncanonical_address in _linearize Replace the current canonical address check with the new function which is identical. Signed-off-by: Nadav Amit <namit@cs.technion.ac.il> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2014-11-03 12:07:19 +01:00
Paolo Bonzini	d09155d2f3	KVM: emulator: always inline __linearize The two callers have a lot of constant arguments that can be optimized out. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2014-11-03 12:07:18 +01:00
Paolo Bonzini	a73896cb5b	KVM: vmx: defer load of APIC access page address during reset Most call paths to vmx_vcpu_reset do not hold the SRCU lock. Defer loading the APIC access page to the next vmentry. This avoids the following lockdep splat: [ INFO: suspicious RCU usage. ] 3.18.0-rc2-test2+ #70 Not tainted ------------------------------- include/linux/kvm_host.h:474 suspicious rcu_dereference_check() usage! other info that might help us debug this: rcu_scheduler_active = 1, debug_locks = 0 1 lock held by qemu-system-x86/2371: #0: (&vcpu->mutex){+.+...}, at: [<ffffffffa037d800>] vcpu_load+0x20/0xd0 [kvm] stack backtrace: CPU: 4 PID: 2371 Comm: qemu-system-x86 Not tainted 3.18.0-rc2-test2+ #70 Hardware name: Dell Inc. OptiPlex 9010/0M9KCM, BIOS A12 01/10/2013 0000000000000001 ffff880209983ca8 ffffffff816f514f 0000000000000000 ffff8802099b8990 ffff880209983cd8 ffffffff810bd687 00000000000fee00 ffff880208a2c000 ffff880208a10000 ffff88020ef50040 ffff880209983d08 Call Trace: [<ffffffff816f514f>] dump_stack+0x4e/0x71 [<ffffffff810bd687>] lockdep_rcu_suspicious+0xe7/0x120 [<ffffffffa037d055>] gfn_to_memslot+0xd5/0xe0 [kvm] [<ffffffffa03807d3>] __gfn_to_pfn+0x33/0x60 [kvm] [<ffffffffa0380885>] gfn_to_page+0x25/0x90 [kvm] [<ffffffffa038aeec>] kvm_vcpu_reload_apic_access_page+0x3c/0x80 [kvm] [<ffffffffa08f0a9c>] vmx_vcpu_reset+0x20c/0x460 [kvm_intel] [<ffffffffa039ab8e>] kvm_vcpu_reset+0x15e/0x1b0 [kvm] [<ffffffffa039ac0c>] kvm_arch_vcpu_setup+0x2c/0x50 [kvm] [<ffffffffa037f7e0>] kvm_vm_ioctl+0x1d0/0x780 [kvm] [<ffffffff810bc664>] ? __lock_is_held+0x54/0x80 [<ffffffff812231f0>] do_vfs_ioctl+0x300/0x520 [<ffffffff8122ee45>] ? __fget+0x5/0x250 [<ffffffff8122f0fa>] ? __fget_light+0x2a/0xe0 [<ffffffff81223491>] SyS_ioctl+0x81/0xa0 [<ffffffff816fed6d>] system_call_fastpath+0x16/0x1b Reported-by: Takashi Iwai <tiwai@suse.de> Reported-by: Alexei Starovoitov <alexei.starovoitov@gmail.com> Reviewed-by: Wanpeng Li <wanpeng.li@linux.intel.com> Tested-by: Wanpeng Li <wanpeng.li@linux.intel.com> Fixes: `38b9917350` Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2014-11-02 08:37:18 +01:00
Jan Kiszka	282da870f4	KVM: nVMX: Disable preemption while reading from shadow VMCS In order to access the shadow VMCS, we need to load it. At this point, vmx->loaded_vmcs->vmcs and the actually loaded one start to differ. If we now get preempted by Linux, vmx_vcpu_put and, on return, the vmx_vcpu_load will work against the wrong vmcs. That can cause copy_shadow_to_vmcs12 to corrupt the vmcs12 state. Fix the issue by disabling preemption during the copy operation. copy_vmcs12_to_shadow is safe from this issue as it is executed by vmx_vcpu_run when preemption is already disabled before vmentry. This bug is exposed by running Jailhouse within KVM on CPUs with shadow VMCS support. Jailhouse never expects an interrupt pending vmexit, but the bug can cause it if, after copy_shadow_to_vmcs12 is preempted, the active VMCS happens to have the virtual interrupt pending flag set in the CPU-based execution controls. Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2014-11-02 07:55:46 +01:00
Nadav Amit	7e46dddd6f	KVM: x86: Fix far-jump to non-canonical check Commit `d1442d85cc` ("KVM: x86: Handle errors when RIP is set during far jumps") introduced a bug that caused the fix to be incomplete. Due to incorrect evaluation, far jump to segment with L bit cleared (i.e., 32-bit segment) and RIP with any of the high bits set (i.e, RIP[63:32] != 0) set may not trigger #GP. As we know, this imposes a security problem. In addition, the condition for two warnings was incorrect. Fixes: `d1442d85cc` Reported-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: Nadav Amit <namit@cs.technion.ac.il> [Add #ifdef CONFIG_X86_64 to avoid complaints of undefined behavior. - Paolo] Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2014-11-02 07:54:55 +01:00
Paolo Bonzini	fd56e1546a	KVM: emulator: fix execution close to the segment limit Emulation of code that is 14 bytes to the segment limit or closer (e.g. RIP = 0xFFFFFFF2 after reset) is broken because we try to read as many as 15 bytes from the beginning of the instruction, and __linearize fails when the passed (address, size) pair reaches out of the segment. To fix this, let __linearize return the maximum accessible size (clamped to 2^32-1) for usage in __do_insn_fetch_bytes, and avoid the limit check by passing zero for the desired size. For expand-down segments, __linearize is performing a redundant check. (u32)(addr.ea + size - 1) <= lim can only happen if addr.ea is close to 4GB; in this case, addr.ea + size - 1 will also fail the check against the upper bound of the segment (which is provided by the D/B bit). After eliminating the redundant check, it is simple to compute the *max_size for expand-down segments too. Now that the limit check is done in __do_insn_fetch_bytes, we want to inject a general protection fault there if size < op_size (like __linearize would have done), instead of just aborting. This fixes booting Tiano Core from emulated flash with EPT disabled. Cc: stable@vger.kernel.org Fixes: `719d5a9b24` Reported-by: Borislav Petkov <bp@suse.de> Tested-by: Borislav Petkov <bp@suse.de> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2014-10-29 13:13:48 +01:00
Paolo Bonzini	3606189fa3	KVM: emulator: fix error code for __linearize The error code for #GP and #SS is zero when the segment is used to access an operand or an instruction. It is only non-zero when a segment register is being loaded; for limit checks this means cases such as: * for #GP, when RIP is beyond the limit on a far call (before the first instruction is executed). We do not implement this check, but it would be in em_jmp_far/em_call_far. * for #SS, if the new stack overflows during an inter-privilege-level call to a non-conforming code segment. We do not implement stack switching at all. So use an error code of zero. Reviewed-by: Nadav Amit <namit@cs.technion.ac.il> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2014-10-29 12:40:28 +01:00
Nadav Amit	1715d0dcb0	KVM: x86: Wrong assertion on paging_tmpl.h Even after the recent fix, the assertion on paging_tmpl.h is triggered. Apparently, the assertion wants to check that the PAE is always set on long-mode, but does it in incorrect way. Note that the assertion is not enabled unless the code is debugged by defining MMU_DEBUG. Signed-off-by: Nadav Amit <namit@cs.technion.ac.il> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2014-10-24 13:30:37 +02:00
Nadav Amit	3f6f1480d8	KVM: x86: PREFETCH and HINT_NOP should have SrcMem flag The decode phase of the x86 emulator assumes that every instruction with the ModRM flag, and which can be used with RIP-relative addressing, has either SrcMem or DstMem. This is not the case for several instructions - prefetch, hint-nop and clflush. Adding SrcMem\|NoAccess for prefetch and hint-nop and SrcMem for clflush. This fixes CVE-2014-8480. Fixes: `41061cdb98` Cc: stable@vger.kernel.org Signed-off-by: Nadav Amit <namit@cs.technion.ac.il> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2014-10-24 13:30:36 +02:00
Nadav Amit	13e457e0ee	KVM: x86: Emulator does not decode clflush well Currently, all group15 instructions are decoded as clflush (e.g., mfence, xsave). In addition, the clflush instruction requires no prefix (66/f2/f3) would exist. If prefix exists it may encode a different instruction (e.g., clflushopt). Creating a group for clflush, and different group for each prefix. This has been the case forever, but the next patch needs the cflush group in order to fix a bug introduced in 3.17. Fixes: `41061cdb98` Cc: stable@vger.kernel.org Signed-off-by: Nadav Amit <namit@cs.technion.ac.il> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2014-10-24 13:30:36 +02:00
Paolo Bonzini	a430c91663	KVM: emulate: avoid accessing NULL ctxt->memopp A failure to decode the instruction can cause a NULL pointer access. This is fixed simply by moving the "done" label as close as possible to the return. This fixes CVE-2014-8481. Reported-by: Andy Lutomirski <luto@amacapital.net> Cc: stable@vger.kernel.org Fixes: `41061cdb98` Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2014-10-24 13:30:35 +02:00
Nadav Amit	08da44aedb	KVM: x86: Decoding guest instructions which cross page boundary may fail Once an instruction crosses a page boundary, the size read from the second page disregards the common case that part of the operand resides on the first page. As a result, fetch of long insturctions may fail, and thereby cause the decoding to fail as well. Cc: stable@vger.kernel.org Fixes: `5cfc7e0f5e` Signed-off-by: Nadav Amit <namit@cs.technion.ac.il> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2014-10-24 13:21:18 +02:00
Michael S. Tsirkin	2bc19dc375	kvm: x86: don't kill guest on unknown exit reason KVM_EXIT_UNKNOWN is a kvm bug, we don't really know whether it was triggered by a priveledged application. Let's not kill the guest: WARN and inject #UD instead. Cc: stable@vger.kernel.org Signed-off-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2014-10-24 13:21:17 +02:00
Petr Matousek	a642fc3050	kvm: vmx: handle invvpid vm exit gracefully On systems with invvpid instruction support (corresponding bit in IA32_VMX_EPT_VPID_CAP MSR is set) guest invocation of invvpid causes vm exit, which is currently not handled and results in propagation of unknown exit to userspace. Fix this by installing an invvpid vm exit handler. This is CVE-2014-3646. Cc: stable@vger.kernel.org Signed-off-by: Petr Matousek <pmatouse@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2014-10-24 13:21:17 +02:00
Nadav Amit	d1442d85cc	KVM: x86: Handle errors when RIP is set during far jumps Far jmp/call/ret may fault while loading a new RIP. Currently KVM does not handle this case, and may result in failed vm-entry once the assignment is done. The tricky part of doing so is that loading the new CS affects the VMCS/VMCB state, so if we fail during loading the new RIP, we are left in unconsistent state. Therefore, this patch saves on 64-bit the old CS descriptor and restores it if loading RIP failed. This fixes CVE-2014-3647. Cc: stable@vger.kernel.org Signed-off-by: Nadav Amit <namit@cs.technion.ac.il> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2014-10-24 13:21:16 +02:00
Nadav Amit	234f3ce485	KVM: x86: Emulator fixes for eip canonical checks on near branches Before changing rip (during jmp, call, ret, etc.) the target should be asserted to be canonical one, as real CPUs do. During sysret, both target rsp and rip should be canonical. If any of these values is noncanonical, a #GP exception should occur. The exception to this rule are syscall and sysenter instructions in which the assigned rip is checked during the assignment to the relevant MSRs. This patch fixes the emulator to behave as real CPUs do for near branches. Far branches are handled by the next patch. This fixes CVE-2014-3647. Cc: stable@vger.kernel.org Signed-off-by: Nadav Amit <namit@cs.technion.ac.il> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2014-10-24 13:21:16 +02:00
Nadav Amit	05c83ec9b7	KVM: x86: Fix wrong masking on relative jump/call Relative jumps and calls do the masking according to the operand size, and not according to the address size as the KVM emulator does today. This patch fixes KVM behavior. Cc: stable@vger.kernel.org Signed-off-by: Nadav Amit <namit@cs.technion.ac.il> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2014-10-24 13:21:15 +02:00
Andy Honig	2febc83913	KVM: x86: Improve thread safety in pit There's a race condition in the PIT emulation code in KVM. In __kvm_migrate_pit_timer the pit_timer object is accessed without synchronization. If the race condition occurs at the wrong time this can crash the host kernel. This fixes CVE-2014-3611. Cc: stable@vger.kernel.org Signed-off-by: Andrew Honig <ahonig@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2014-10-24 13:21:14 +02:00
Andy Honig	8b3c3104c3	KVM: x86: Prevent host from panicking on shared MSR writes. The previous patch blocked invalid writes directly when the MSR is written. As a precaution, prevent future similar mistakes by gracefulling handle GPs caused by writes to shared MSRs. Cc: stable@vger.kernel.org Signed-off-by: Andrew Honig <ahonig@google.com> [Remove parts obsoleted by Nadav's patch. - Paolo] Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2014-10-24 13:21:08 +02:00
Nadav Amit	854e8bb1aa	KVM: x86: Check non-canonical addresses upon WRMSR Upon WRMSR, the CPU should inject #GP if a non-canonical value (address) is written to certain MSRs. The behavior is "almost" identical for AMD and Intel (ignoring MSRs that are not implemented in either architecture since they would anyhow #GP). However, IA32_SYSENTER_ESP and IA32_SYSENTER_EIP cause #GP if non-canonical address is written on Intel but not on AMD (which ignores the top 32-bits). Accordingly, this patch injects a #GP on the MSRs which behave identically on Intel and AMD. To eliminate the differences between the architecutres, the value which is written to IA32_SYSENTER_ESP and IA32_SYSENTER_EIP is turned to canonical value before writing instead of injecting a #GP. Some references from Intel and AMD manuals: According to Intel SDM description of WRMSR instruction #GP is expected on WRMSR "If the source register contains a non-canonical address and ECX specifies one of the following MSRs: IA32_DS_AREA, IA32_FS_BASE, IA32_GS_BASE, IA32_KERNEL_GS_BASE, IA32_LSTAR, IA32_SYSENTER_EIP, IA32_SYSENTER_ESP." According to AMD manual instruction manual: LSTAR/CSTAR (SYSCALL): "The WRMSR instruction loads the target RIP into the LSTAR and CSTAR registers. If an RIP written by WRMSR is not in canonical form, a general-protection exception (#GP) occurs." IA32_GS_BASE and IA32_FS_BASE (WRFSBASE/WRGSBASE): "The address written to the base field must be in canonical form or a #GP fault will occur." IA32_KERNEL_GS_BASE (SWAPGS): "The address stored in the KernelGSbase MSR must be in canonical form." This patch fixes CVE-2014-3610. Cc: stable@vger.kernel.org Signed-off-by: Nadav Amit <namit@cs.technion.ac.il> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2014-10-24 13:21:08 +02:00
Andy Lutomirski	d974baa398	x86,kvm,vmx: Preserve CR4 across VM entry CR4 isn't constant; at least the TSD and PCE bits can vary. TBH, treating CR0 and CR3 as constant scares me a bit, too, but it looks like it's correct. This adds a branch and a read from cr4 to each vm entry. Because it is extremely likely that consecutive entries into the same vcpu will have the same host cr4 value, this fixes up the vmcs instead of restoring cr4 after the fact. A subsequent patch will add a kernel-wide cr4 shadow, reducing the overhead in the common case to just two memory reads and a branch. Signed-off-by: Andy Lutomirski <luto@amacapital.net> Acked-by: Paolo Bonzini <pbonzini@redhat.com> Cc: stable@vger.kernel.org Cc: Petr Matousek <pmatouse@redhat.com> Cc: Gleb Natapov <gleb@kernel.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2014-10-18 10:09:03 -07:00
Linus Torvalds	0429fbc0bd	Merge branch 'for-3.18-consistent-ops' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/percpu Pull percpu consistent-ops changes from Tejun Heo: "Way back, before the current percpu allocator was implemented, static and dynamic percpu memory areas were allocated and handled separately and had their own accessors. The distinction has been gone for many years now; however, the now duplicate two sets of accessors remained with the pointer based ones - this_cpu_() - evolving various other operations over time. During the process, we also accumulated other inconsistent operations. This pull request contains Christoph's patches to clean up the duplicate accessor situation. __get_cpu_var() uses are replaced with with this_cpu_ptr() and __this_cpu_ptr() with raw_cpu_ptr(). Unfortunately, the former sometimes is tricky thanks to C being a bit messy with the distinction between lvalues and pointers, which led to a rather ugly solution for cpumask_var_t involving the introduction of this_cpu_cpumask_var_ptr(). This converts most of the uses but not all. Christoph will follow up with the remaining conversions in this merge window and hopefully remove the obsolete accessors" 'for-3.18-consistent-ops' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/percpu: (38 commits) irqchip: Properly fetch the per cpu offset percpu: Resolve ambiguities in __get_cpu_var/cpumask_var_t -fix ia64: sn_nodepda cannot be assigned to after this_cpu conversion. Use __this_cpu_write. percpu: Resolve ambiguities in __get_cpu_var/cpumask_var_t Revert "powerpc: Replace __get_cpu_var uses" percpu: Remove __this_cpu_ptr clocksource: Replace __this_cpu_ptr with raw_cpu_ptr sparc: Replace __get_cpu_var uses avr32: Replace __get_cpu_var with __this_cpu_write blackfin: Replace __get_cpu_var uses tile: Use this_cpu_ptr() for hardware counters tile: Replace __get_cpu_var uses powerpc: Replace __get_cpu_var uses alpha: Replace __get_cpu_var ia64: Replace __get_cpu_var uses s390: cio driver &__get_cpu_var replacements s390: Replace __get_cpu_var uses mips: Replace __get_cpu_var uses MIPS: Replace __get_cpu_var uses in FPU emulator. arm: Replace __this_cpu_ptr with raw_cpu_ptr ...	2014-10-15 07:48:18 +02:00
Linus Torvalds	c798360cd1	Merge branch 'for-3.18' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/percpu Pull percpu updates from Tejun Heo: "A lot of activities on percpu front. Notable changes are... - percpu allocator now can take @gfp. If @gfp doesn't contain GFP_KERNEL, it tries to allocate from what's already available to the allocator and a work item tries to keep the reserve around certain level so that these atomic allocations usually succeed. This will replace the ad-hoc percpu memory pool used by blk-throttle and also be used by the planned blkcg support for writeback IOs. Please note that I noticed a bug in how @gfp is interpreted while preparing this pull request and applied the fix `6ae833c7fe` ("percpu: fix how @gfp is interpreted by the percpu allocator") just now. - percpu_ref now uses longs for percpu and global counters instead of ints. It leads to more sparse packing of the percpu counters on 64bit machines but the overhead should be negligible and this allows using percpu_ref for refcnting pages and in-memory objects directly. - The switching between percpu and single counter modes of a percpu_ref is made independent of putting the base ref and a percpu_ref can now optionally be initialized in single or killed mode. This allows avoiding percpu shutdown latency for cases where the refcounted objects may be synchronously created and destroyed in rapid succession with only a fraction of them reaching fully operational status (SCSI probing does this when combined with blk-mq support). It's also planned to be used to implement forced single mode to detect underflow more timely for debugging. There's a separate branch percpu/for-3.18-consistent-ops which cleans up the duplicate percpu accessors. That branch causes a number of conflicts with s390 and other trees. I'll send a separate pull request w/ resolutions once other branches are merged" * 'for-3.18' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/percpu: (33 commits) percpu: fix how @gfp is interpreted by the percpu allocator blk-mq, percpu_ref: start q->mq_usage_counter in atomic mode percpu_ref: make INIT_ATOMIC and switch_to_atomic() sticky percpu_ref: add PERCPU_REF_INIT_* flags percpu_ref: decouple switching to percpu mode and reinit percpu_ref: decouple switching to atomic mode and killing percpu_ref: add PCPU_REF_DEAD percpu_ref: rename things to prepare for decoupling percpu/atomic mode switch percpu_ref: replace pcpu_ prefix with percpu_ percpu_ref: minor code and comment updates percpu_ref: relocate percpu_ref_reinit() Revert "blk-mq, percpu_ref: implement a kludge for SCSI blk-mq stall during probe" Revert "percpu: free percpu allocation info for uniprocessor system" percpu-refcount: make percpu_ref based on longs instead of ints percpu-refcount: improve WARN messages percpu: fix locking regression in the failure path of pcpu_alloc() percpu-refcount: add @gfp to percpu_ref_init() proportions: add @gfp to init functions percpu_counter: add @gfp to percpu_counter_init() percpu_counter: make percpu_counters_lock irq-safe ...	2014-10-10 07:26:02 -04:00
Paolo Bonzini	f439ed27f8	kvm: do not handle APIC access page if in-kernel irqchip is not in use This fixes the following OOPS: loaded kvm module (v3.17-rc1-168-gcec26bc) BUG: unable to handle kernel paging request at fffffffffffffffe IP: [<ffffffff81168449>] put_page+0x9/0x30 PGD 1e15067 PUD 1e17067 PMD 0 Oops: 0000 [#1] PREEMPT SMP [<ffffffffa063271d>] ? kvm_vcpu_reload_apic_access_page+0x5d/0x70 [kvm] [<ffffffffa013b6db>] vmx_vcpu_reset+0x21b/0x470 [kvm_intel] [<ffffffffa0658816>] ? kvm_pmu_reset+0x76/0xb0 [kvm] [<ffffffffa064032a>] kvm_vcpu_reset+0x15a/0x1b0 [kvm] [<ffffffffa06403ac>] kvm_arch_vcpu_setup+0x2c/0x50 [kvm] [<ffffffffa062e540>] kvm_vm_ioctl+0x200/0x780 [kvm] [<ffffffff81212170>] do_vfs_ioctl+0x2d0/0x4b0 [<ffffffff8108bd99>] ? __mmdrop+0x69/0xb0 [<ffffffff812123d1>] SyS_ioctl+0x81/0xa0 [<ffffffff8112a6f6>] ? __audit_syscall_exit+0x1f6/0x2a0 [<ffffffff817229e9>] system_call_fastpath+0x16/0x1b Code: c6 78 ce a3 81 4c 89 e7 e8 d9 80 ff ff 0f 0b 4c 89 e7 e8 8f f6 ff ff e9 fa fe ff ff 66 2e 0f 1f 84 00 00 00 00 00 66 66 66 66 90 <48> f7 07 00 c0 00 00 55 48 89 e5 75 1e 8b 47 1c 85 c0 74 27 f0 RIP [<ffffffff81193045>] put_page+0x5/0x50 when not using the in-kernel irqchip ("-machine kernel_irqchip=off" with QEMU). The fix is to make the same check in kvm_vcpu_reload_apic_access_page that we already have in vmx.c's vm_need_virtualize_apic_accesses(). Reported-by: Jan Kiszka <jan.kiszka@siemens.com> Tested-by: Jan Kiszka <jan.kiszka@siemens.com> Fixes: `4256f43f9f` Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2014-10-02 19:26:11 +02:00
Tejun Heo	d06efebf0c	Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux-block into for-3.18 This is to receive `0a30288da1` ("blk-mq, percpu_ref: implement a kludge for SCSI blk-mq stall during probe") which implements __percpu_ref_kill_expedited() to work around SCSI blk-mq stall. The commit reverted and patches to implement proper fix will be added. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Kent Overstreet <kmo@daterainc.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Christoph Hellwig <hch@lst.de>	2014-09-24 13:00:21 -04:00
Tang Chen	c24ae0dcd3	kvm: x86: Unpin and remove kvm_arch->apic_access_page In order to make the APIC access page migratable, stop pinning it in memory. And because the APIC access page is not pinned in memory, we can remove kvm_arch->apic_access_page. When we need to write its physical address into vmcs, we use gfn_to_page() to get its page struct, which is needed to call page_to_phys(); the page is then immediately unpinned. Suggested-by: Gleb Natapov <gleb@kernel.org> Signed-off-by: Tang Chen <tangchen@cn.fujitsu.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2014-09-24 14:08:01 +02:00
Tang Chen	38b9917350	kvm: vmx: Implement set_apic_access_page_addr Currently, the APIC access page is pinned by KVM for the entire life of the guest. We want to make it migratable in order to make memory hot-unplug available for machines that run KVM. This patch prepares to handle this for the case where there is no nested virtualization, or where the nested guest does not have an APIC page of its own. All accesses to kvm->arch.apic_access_page are changed to go through kvm_vcpu_reload_apic_access_page. If the APIC access page is invalidated when the host is running, we update the VMCS in the next guest entry. If it is invalidated when the guest is running, the MMU notifier will force an exit, after which we will handle everything as in the previous case. If it is invalidated when a nested guest is running, the request will update either the VMCS01 or the VMCS02. Updating the VMCS01 is done at the next L2->L1 exit, while updating the VMCS02 is done in prepare_vmcs02. Signed-off-by: Tang Chen <tangchen@cn.fujitsu.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2014-09-24 14:08:01 +02:00
Tang Chen	4256f43f9f	kvm: x86: Add request bit to reload APIC access page address Currently, the APIC access page is pinned by KVM for the entire life of the guest. We want to make it migratable in order to make memory hot-unplug available for machines that run KVM. This patch prepares to handle this in generic code, through a new request bit (that will be set by the MMU notifier) and a new hook that is called whenever the request bit is processed. Signed-off-by: Tang Chen <tangchen@cn.fujitsu.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2014-09-24 14:08:00 +02:00
Tang Chen	fe71557afb	kvm: Add arch specific mmu notifier for page invalidation This will be used to let the guest run while the APIC access page is not pinned. Because subsequent patches will fill in the function for x86, place the (still empty) x86 implementation in the x86.c file instead of adding an inline function in kvm_host.h. Signed-off-by: Tang Chen <tangchen@cn.fujitsu.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2014-09-24 14:07:59 +02:00
Andres Lagar-Cavilla	5712846808	kvm: Fix page ageing bugs 1. We were calling clear_flush_young_notify in unmap_one, but we are within an mmu notifier invalidate range scope. The spte exists no more (due to range_start) and the accessed bit info has already been propagated (due to kvm_pfn_set_accessed). Simply call clear_flush_young. 2. We clear_flush_young on a primary MMU PMD, but this may be mapped as a collection of PTEs by the secondary MMU (e.g. during log-dirty). This required expanding the interface of the clear_flush_young mmu notifier, so a lot of code has been trivially touched. 3. In the absence of shadow_accessed_mask (e.g. EPT A bit), we emulate the access bit by blowing the spte. This requires proper synchronizing with MMU notifier consumers, like every other removal of spte's does. Signed-off-by: Andres Lagar-Cavilla <andreslc@google.com> Acked-by: Rik van Riel <riel@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2014-09-24 14:07:58 +02:00
Andres Lagar-Cavilla	8a9522d2fe	kvm/x86/mmu: Pass gfn and level to rmapp callback. Callbacks don't have to do extra computation to learn what the caller (lvm_handle_hva_range()) knows very well. Useful for debugging/tracing/printk/future. Signed-off-by: Andres Lagar-Cavilla <andreslc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2014-09-24 14:07:57 +02:00
Chen Yucong	81760dccf8	kvm: x86: use macros to compute bank MSRs Avoid open coded calculations for bank MSRs by using well-defined macros that hide the index of higher bank MSRs. No semantic changes. Signed-off-by: Chen Yucong <slaoub@gmail.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2014-09-24 14:07:56 +02:00
Nadav Amit	d5262739cb	KVM: x86: Remove debug assertion of non-PAE reserved bits Commit `346874c950` ("KVM: x86: Fix CR3 reserved bits") removed non-PAE reserved bits which were not according to Intel SDM. However, residue was left in a debug assertion (CR3_NONPAE_RESERVED_BITS). Remove it. Signed-off-by: Nadav Amit <namit@cs.technion.ac.il> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2014-09-24 14:07:55 +02:00
Tiejun Chen	b461966063	kvm: x86: fix two typos in comment s/drity/dirty and s/vmsc01/vmcs01 Signed-off-by: Tiejun Chen <tiejun.chen@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2014-09-24 14:07:53 +02:00
Nadav Amit	4566654bb9	KVM: vmx: Inject #GP on invalid PAT CR Guest which sets the PAT CR to invalid value should get a #GP. Currently, if vmx supports loading PAT CR during entry, then the value is not checked. This patch makes the required check in that case. Signed-off-by: Nadav Amit <namit@cs.technion.ac.il> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2014-09-24 14:07:52 +02:00
Nadav Amit	040c8dc8a5	KVM: x86: emulating descriptor load misses long-mode case In 64-bit mode a #GP should be delivered to the guest "if the code segment descriptor pointed to by the selector in the 64-bit gate doesn't have the L-bit set and the D-bit clear." - Intel SDM "Interrupt 13â€”General Protection Exception (#GP)". This patch fixes the behavior of CS loading emulation code. Although the comment says that segment loading is not supported in long mode, this function is executed in long mode, so the fix is necassary. Signed-off-by: Nadav Amit <namit@cs.technion.ac.il> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2014-09-24 14:07:52 +02:00
Liang Chen	77c3913b74	KVM: x86: directly use kvm_make_request again A one-line wrapper around kvm_make_request is not particularly useful. Replace kvm_mmu_flush_tlb() with kvm_make_request(). Signed-off-by: Liang Chen <liangchen.linux@gmail.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2014-09-24 14:07:51 +02:00
Radim Krčmář	a70656b63a	KVM: x86: count actual tlb flushes - we count KVM_REQ_TLB_FLUSH requests, not actual flushes (KVM can have multiple requests for one flush) - flushes from kvm_flush_remote_tlbs aren't counted - it's easy to make a direct request by mistake Solve these by postponing the counting to kvm_check_request(). Signed-off-by: Radim Krčmář <rkrcmar@redhat.com> Signed-off-by: Liang Chen <liangchen.linux@gmail.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2014-09-24 14:07:50 +02:00
Marcelo Tosatti	bc6134942d	KVM: nested VMX: disable perf cpuid reporting Initilization of L2 guest with -cpu host, on L1 guest with -cpu host triggers: (qemu) KVM: entry failed, hardware error 0x7 ... nested_vmx_run: VMCS MSR_{LOAD,STORE} unsupported Nested VMX MSR load/store support is not sufficient to allow perf for L2 guest. Until properly fixed, trap CPUID and disable function 0xA. Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2014-09-24 14:07:50 +02:00
Nadav Amit	a2b9e6c1a3	KVM: x86: Don't report guest userspace emulation error to userspace Commit `fc3a9157d3` ("KVM: X86: Don't report L2 emulation failures to user-space") disabled the reporting of L2 (nested guest) emulation failures to userspace due to race-condition between a vmexit and the instruction emulator. The same rational applies also to userspace applications that are permitted by the guest OS to access MMIO area or perform PIO. This patch extends the current behavior - of injecting a #UD instead of reporting it to userspace - also for guest userspace code. Signed-off-by: Nadav Amit <namit@cs.technion.ac.il> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2014-09-24 14:07:49 +02:00
Paolo Bonzini	1f755a8275	kvm: Make init_rmode_tss() return 0 on success. In init_rmode_tss(), there two variables indicating the return value, r and ret, and it return 0 on error, 1 on success. The function is only called by vmx_set_tss_addr(), and ret is redundant. This patch removes the redundant variable, by making init_rmode_tss() return 0 on success, -errno on failure. Reviewed-by: Radim Krčmář <rkrcmar@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2014-09-24 14:07:48 +02:00
Nadav Amit	dd598091de	KVM: x86: Warn if guest virtual address space is not 48-bits The KVM emulator code assumes that the guest virtual address space (in 64-bit) is 48-bits wide. Fail the KVM_SET_CPUID and KVM_SET_CPUID2 ioctl if userspace tries to create a guest that does not obey this restriction. Signed-off-by: Nadav Amit <namit@cs.technion.ac.il> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2014-09-24 14:07:48 +02:00
Tang Chen	f51770ed46	kvm: Make init_rmode_identity_map() return 0 on success. In init_rmode_identity_map(), there two variables indicating the return value, r and ret, and it return 0 on error, 1 on success. The function is only called by vmx_create_vcpu(), and ret is redundant. This patch removes the redundant variable, and makes init_rmode_identity_map() return 0 on success, -errno on failure. Signed-off-by: Tang Chen <tangchen@cn.fujitsu.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2014-09-17 13:10:12 +02:00
Tang Chen	a255d4795f	kvm: Remove ept_identity_pagetable from struct kvm_arch. kvm_arch->ept_identity_pagetable holds the ept identity pagetable page. But it is never used to refer to the page at all. In vcpu initialization, it indicates two things: 1. indicates if ept page is allocated 2. indicates if a memory slot for identity page is initialized Actually, kvm_arch->ept_identity_pagetable_done is enough to tell if the ept identity pagetable is initialized. So we can remove ept_identity_pagetable. NOTE: In the original code, ept identity pagetable page is pinned in memroy. As a result, it cannot be migrated/hot-removed. After this patch, since kvm_arch->ept_identity_pagetable is removed, ept identity pagetable page is no longer pinned in memory. And it can be migrated/hot-removed. Signed-off-by: Tang Chen <tangchen@cn.fujitsu.com> Reviewed-by: Gleb Natapov <gleb@kernel.org> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2014-09-17 13:10:11 +02:00
Guo Hui Liu	105b21bbf6	KVM: x86: Use kvm_make_request when applicable This patch replace the set_bit method by kvm_make_request to make code more readable and consistent. Signed-off-by: Guo Hui Liu <liuguohui@gmail.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2014-09-16 14:44:20 +02:00
Paolo Bonzini	a183b638b6	KVM: x86: make apic_accept_irq tracepoint more generic Initially the tracepoint was added only to the APIC_DM_FIXED case, also because it reported coalesced interrupts that only made sense for that case. However, the coalesced argument is not used anymore and tracing other delivery modes is useful, so hoist the call out of the switch statement. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2014-09-11 11:51:02 +02:00

1 2 3 4 5 ...

3054 Commits