linux

Author	SHA1	Message	Date
Paolo Bonzini	df748f864a	KVM: MMU: use kvm_sync_page in kvm_sync_pages If the last argument is true, kvm_unlink_unsync_page is called anyway in __kvm_sync_page (either by kvm_mmu_prepare_zap_page or by __kvm_sync_page itself). Therefore, kvm_sync_pages can just call kvm_sync_page, instead of going through kvm_unlink_unsync_page+__kvm_sync_page. Reviewed-by: Takuya Yoshikawa <yoshikawa_takuya_b1@lab.ntt.co.jp> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2016-03-08 12:33:20 +01:00
Paolo Bonzini	35a70510ee	KVM: MMU: move TLB flush out of __kvm_sync_page By doing this, kvm_sync_pages can use __kvm_sync_page instead of reinventing it. Because of kvm_mmu_flush_or_zap, the code does not end up being more complex than before, and more cleanups to kvm_sync_pages will come in the next patches. Reviewed-by: Takuya Yoshikawa <yoshikawa_takuya_b1@lab.ntt.co.jp> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2016-03-08 12:33:17 +01:00
Paolo Bonzini	b8c67b7a08	KVM: MMU: introduce kvm_mmu_flush_or_zap This is a generalization of mmu_pte_write_flush_tlb, that also takes care of calling kvm_mmu_commit_zap_page. The next patches will introduce more uses. Reviewed-by: Takuya Yoshikawa <yoshikawa_takuya_b1@lab.ntt.co.jp> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2016-03-08 12:33:00 +01:00
Paolo Bonzini	0e4d44151a	KVM: i8254: drop local copy of mul_u64_u32_div A function that does the same as i8254.c's muldiv64 has been added (for KVM's own use, in fact!) in include/linux/math64.h. Use it instead of muldiv64. Reviewed-by: Radim Krčmář <rkrcmar@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2016-03-04 22:39:17 +01:00
Xiao Guangrong	e23d3fef83	KVM: MMU: check kvm_mmu_pages and mmu_page_path indices Give a special invalid index to the root of the walk, so that we can check the consistency of kvm_mmu_pages and mmu_page_path. Signed-off-by: Xiao Guangrong <guangrong.xiao@linux.intel.com> [Extracted from a bigger patch proposed by Guangrong. - Paolo] Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2016-03-04 12:35:24 +01:00
Paolo Bonzini	0a47cd8583	KVM: MMU: Fix ubsan warnings kvm_mmu_pages_init is doing some really yucky stuff. It is setting up a sentinel for mmu_page_clear_parents; however, because of a) the way levels are numbered starting from 1 and b) the way mmu_page_path sizes its arrays with PT64_ROOT_LEVEL-1 elements, the access can be out of bounds. This is harmless because the code overwrites up to the first two elements of parents->idx and these are initialized, and because the sentinel is not needed in this case---mmu_page_clear_parents exits anyway when it gets to the end of the array. However ubsan complains, and everyone else should too. This fix does three things. First it makes the mmu_page_path arrays PT64_ROOT_LEVEL elements in size, so that we can write to them without checking the level in advance. Second it disintegrates kvm_mmu_pages_init between mmu_unsync_walk (to reset the struct kvm_mmu_pages) and for_each_sp (to place the NULL sentinel at the end of the current path). This is okay because the mmu_page_path is only used in mmu_pages_clear_parents; mmu_pages_clear_parents itself is called within a for_each_sp iterator, and hence always after a call to mmu_pages_next. Third it changes mmu_pages_clear_parents to just use the sentinel to stop iteration, without checking the bounds on level. Reported-by: Sasha Levin <sasha.levin@oracle.com> Reported-by: Mike Krinkin <krinkin.m.u@gmail.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2016-03-04 12:35:23 +01:00
Paolo Bonzini	798e88b31f	KVM: MMU: cleanup handle_abnormal_pfn The goto and temporary variable are unnecessary, just use return statements. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2016-03-04 12:35:23 +01:00
Paolo Bonzini	8f22372f85	KVM: VMX: use vmcs_clear/set_bits for debug register exits Reviewed-by: Xiao Guangrong <guangrong.xiao@linux.intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2016-03-04 12:35:20 +01:00
Radim Krčmář	a0aace5ac0	KVM: i8254: turn kvm_kpit_state.reinject into atomic_t Document possible races between readers and concurrent update to the ioctl. Suggested-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Radim Krčmář <rkrcmar@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2016-03-04 09:30:25 +01:00
Radim Krčmář	ab4c14763b	KVM: i8254: move PIT timer function initialization We can do it just once. Reviewed-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Radim Krčmář <rkrcmar@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2016-03-04 09:30:22 +01:00
Radim Krčmář	34f3941c42	KVM: i8254: don't assume layout of kvm_kpit_state channels has offset 0 and correct size now, but that can change. Reviewed-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Radim Krčmář <rkrcmar@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2016-03-04 09:30:18 +01:00
Radim Krčmář	4a2095df8a	KVM: i8254: remove pointless dereference of PIT PIT is known at that point. Reviewed-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Radim Krčmář <rkrcmar@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2016-03-04 09:30:15 +01:00
Radim Krčmář	a3e1311549	KVM: i8254: remove pit and kvm from kvm_kpit_state kvm isn't ever used and pit can be accessed with container_of. If you really need kvm, pit_state_to_pit(ps)->kvm. Reviewed-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Radim Krčmář <rkrcmar@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2016-03-04 09:30:12 +01:00
Radim Krčmář	08e5ccf3ae	KVM: i8254: refactor kvm_free_pit Could be easier to read, but git history will become deeper. Reviewed-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Radim Krčmář <rkrcmar@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2016-03-04 09:30:07 +01:00
Radim Krčmář	10d2482126	KVM: i8254: refactor kvm_create_pit Locks are gone, so we don't need to duplicate error paths. Use goto everywhere. Reviewed-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Radim Krčmář <rkrcmar@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2016-03-04 09:30:04 +01:00
Radim Krčmář	71474e2f0f	KVM: i8254: remove notifiers from PIT discard policy Discard policy doesn't rely on information from notifiers, so we don't need to register notifiers unconditionally. We kept correct counts in case userspace switched between policies during runtime, but that can be avoided by reseting the state. Signed-off-by: Radim Krčmář <rkrcmar@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2016-03-04 09:30:01 +01:00
Radim Krčmář	b39c90b656	KVM: i8254: remove unnecessary uses of PIT state lock - kvm_create_pit had to lock only because it exposed kvm->arch.vpit very early, but initialization doesn't use kvm->arch.vpit since the last patch, so we can drop locking. - kvm_free_pit is only run after there are no users of KVM and therefore is the sole actor. - Locking in kvm_vm_ioctl_reinject doesn't do anything, because reinject is only protected at that place. - kvm_pit_reset isn't used anywhere and its locking can be dropped if we hide it. Removing useless locking allows to see what actually is being protected by PIT state lock (values accessible from the guest). Signed-off-by: Radim Krčmář <rkrcmar@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2016-03-04 09:29:58 +01:00
Radim Krčmář	09edea72b7	KVM: i8254: pass struct kvm_pit instead of kvm in PIT This patch passes struct kvm_pit into internal PIT functions. Those functions used to get PIT through kvm->arch.vpit, even though most of them never used *kvm for other purposes. Another benefit is that we don't need to set kvm->arch.vpit during initialization. Signed-off-by: Radim Krčmář <rkrcmar@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2016-03-04 09:29:55 +01:00
Radim Krčmář	b69d920f68	KVM: i8254: tone down WARN_ON pit.state_lock If the guest could hit this, it would hang the host kernel, bacause of sheer number of those reports. Internal callers have to be sensible anyway, so we now only check for it in an API function. Signed-off-by: Radim Krčmář <rkrcmar@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2016-03-04 09:29:51 +01:00
Radim Krčmář	ddf54503e2	KVM: i8254: use atomic_t instead of pit.inject_lock The lock was an overkill, the same can be done with atomics. A mb() was added in kvm_pit_ack_irq, to pair with implicit barrier between pit_timer_fn and pit_do_work. The mb() prevents a race that could happen if pending == 0 and irq_ack == 0: kvm_pit_ack_irq: \| pit_timer_fn: p = atomic_read(&ps->pending); \| \| atomic_inc(&ps->pending); \| queue_work(pit_do_work); \| pit_do_work: \| atomic_xchg(&ps->irq_ack, 0); \| return; atomic_set(&ps->irq_ack, 1); \| if (p == 0) return; \| where the interrupt would not be delivered in this tick of pit_timer_fn. PIT would have eventually delivered the interrupt, but we sacrifice perofmance to make sure that interrupts are not needlessly delayed. sfence isn't enough: atomic_dec_if_positive does atomic_read first and x86 can reorder loads before stores. lfence isn't enough: store can pass lfence, turning it into a nop. A compiler barrier would be more than enough as CPU needs to stall for unbelievably long to use fences. This patch doesn't do anything in kvm_pit_reset_reinject, because any order of resets can race, but the result differs by at most one interrupt, which is ok, because it's the same result as if the reset happened at a slightly different time. (Original code didn't protect the reset path with a proper lock, so users have to be robust.) Signed-off-by: Radim Krčmář <rkrcmar@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2016-03-04 09:29:47 +01:00
Radim Krčmář	fd700a00dc	KVM: i8254: add kvm_pit_reset_reinject pit_state.pending and pit_state.irq_ack are always reset at the same time. Create a function for them. Signed-off-by: Radim Krčmář <rkrcmar@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2016-03-04 09:29:43 +01:00
Radim Krčmář	f6e0a0c113	KVM: i8254: simplify atomics in kvm_pit_ack_irq We already have a helper that does the same thing. Signed-off-by: Radim Krčmář <rkrcmar@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2016-03-04 09:29:37 +01:00
Radim Krčmář	7dd0fdff14	KVM: i8254: change PIT discard tick policy Discard policy uses ack_notifiers to prevent injection of PIT interrupts before EOI from the last one. This patch changes the policy to always try to deliver the interrupt, which makes a difference when its vector is in ISR. Old implementation would drop the interrupt, but proposed one injects to IRR, like real hardware would. The old policy breaks legacy NMI watchdogs, where PIT is used through virtual wire (LVT0): PIT never sends an interrupt before receiving EOI, thus a guest deadlock with disabled interrupts will stop NMIs. Note that NMI doesn't do EOI, so PIT also had to send a normal interrupt through IOAPIC. (KVM's PIT is deeply rotten and luckily not used much in modern systems.) Even though there is a chance of regressions, I think we can fix the LVT0 NMI bug without introducing a new tick policy. Cc: <stable@vger.kernel.org> Reported-by: Yuki Shibuya <shibuya.yk@ncos.nec.co.jp> Reviewed-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Radim Krčmář <rkrcmar@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2016-03-04 09:29:31 +01:00
Xiao Guangrong	13d268ca2c	KVM: MMU: apply page track notifier Register the notifier to receive write track event so that we can update our shadow page table It makes kvm_mmu_pte_write() be the callback of the notifier, no function is changed Signed-off-by: Xiao Guangrong <guangrong.xiao@linux.intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2016-03-03 14:36:24 +01:00
Xiao Guangrong	5c520e90af	KVM: MMU: simplify mmu_need_write_protect Now, all non-leaf shadow page are page tracked, if gfn is not tracked there is no non-leaf shadow page of gfn is existed, we can directly make the shadow page of gfn to unsync Signed-off-by: Xiao Guangrong <guangrong.xiao@linux.intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2016-03-03 14:36:23 +01:00
Xiao Guangrong	56ca57f9fe	KVM: MMU: use page track for non-leaf shadow pages non-leaf shadow pages are always write protected, it can be the user of page track Signed-off-by: Xiao Guangrong <guangrong.xiao@linux.intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2016-03-03 14:36:23 +01:00
Xiao Guangrong	0eb05bf290	KVM: page track: add notifier support Notifier list is introduced so that any node wants to receive the track event can register to the list Two APIs are introduced here: - kvm_page_track_register_notifier(): register the notifier to receive track event - kvm_page_track_unregister_notifier(): stop receiving track event by unregister the notifier The callback, node->track_write() is called when a write access on the write tracked page happens Signed-off-by: Xiao Guangrong <guangrong.xiao@linux.intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2016-03-03 14:36:22 +01:00
Xiao Guangrong	e5691a81e8	KVM: MMU: clear write-flooding on the fast path of tracked page If the page fault is caused by write access on write tracked page, the real shadow page walking is skipped, we lost the chance to clear write flooding for the page structure current vcpu is using Fix it by locklessly waking shadow page table to clear write flooding on the shadow page structure out of mmu-lock. So that we change the count to atomic_t Signed-off-by: Xiao Guangrong <guangrong.xiao@linux.intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2016-03-03 14:36:22 +01:00
Xiao Guangrong	3d0c27ad6e	KVM: MMU: let page fault handler be aware tracked page The page fault caused by write access on the write tracked page can not be fixed, it always need to be emulated. page_fault_handle_page_track() is the fast path we introduce here to skip holding mmu-lock and shadow page table walking However, if the page table is not present, it is worth making the page table entry present and readonly to make the read access happy mmu_need_write_protect() need to be cooked to avoid page becoming writable when making page table present or sync/prefetch shadow page table entries Signed-off-by: Xiao Guangrong <guangrong.xiao@linux.intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2016-03-03 14:36:21 +01:00
Xiao Guangrong	f29d4d7810	KVM: page track: introduce kvm_slot_page_track_{add,remove}_page These two functions are the user APIs: - kvm_slot_page_track_add_page(): add the page to the tracking pool after that later specified access on that page will be tracked - kvm_slot_page_track_remove_page(): remove the page from the tracking pool, the specified access on the page is not tracked after the last user is gone Both of these are called under the protection both of mmu-lock and kvm->srcu or kvm->slots_lock Signed-off-by: Xiao Guangrong <guangrong.xiao@linux.intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2016-03-03 14:36:21 +01:00
Xiao Guangrong	21ebbedadd	KVM: page track: add the framework of guest page tracking The array, gfn_track[mode][gfn], is introduced in memory slot for every guest page, this is the tracking count for the gust page on different modes. If the page is tracked then the count is increased, the page is not tracked after the count reaches zero We use 'unsigned short' as the tracking count which should be enough as shadow page table only can use 2^14 (2^3 for level, 2^1 for cr4_pae, 2^2 for quadrant, 2^3 for access, 2^1 for nxe, 2^1 for cr0_wp, 2^1 for smep_andnot_wp, 2^1 for smap_andnot_wp, and 2^1 for smm) at most, there is enough room for other trackers Two callbacks, kvm_page_track_create_memslot() and kvm_page_track_free_memslot() are implemented in this patch, they are internally used to initialize and reclaim the memory of the array Currently, only write track mode is supported Signed-off-by: Xiao Guangrong <guangrong.xiao@linux.intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2016-03-03 14:36:20 +01:00
Xiao Guangrong	aeecee2ea6	KVM: MMU: introduce kvm_mmu_slot_gfn_write_protect Split rmap_write_protect() and introduce the function to abstract the write protection based on the slot This function will be used in the later patch Reviewed-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Xiao Guangrong <guangrong.xiao@linux.intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2016-03-03 14:36:20 +01:00
Xiao Guangrong	547ffaed87	KVM: MMU: introduce kvm_mmu_gfn_{allow,disallow}_lpage Abstract the common operations from account_shadowed() and unaccount_shadowed(), then introduce kvm_mmu_gfn_disallow_lpage() and kvm_mmu_gfn_allow_lpage() These two functions will be used by page tracking in the later patch Reviewed-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Xiao Guangrong <guangrong.xiao@linux.intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2016-03-03 14:36:19 +01:00
Xiao Guangrong	92f94f1e9e	KVM: MMU: rename has_wrprotected_page to mmu_gfn_lpage_is_disallowed kvm_lpage_info->write_count is used to detect if the large page mapping for the gfn on the specified level is allowed, rename it to disallow_lpage to reflect its purpose, also we rename has_wrprotected_page() to mmu_gfn_lpage_is_disallowed() to make the code more clearer Later we will extend this mechanism for page tracking: if the gfn is tracked then large mapping for that gfn on any level is not allowed. The new name is more straightforward Reviewed-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Xiao Guangrong <guangrong.xiao@linux.intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2016-03-03 14:36:19 +01:00
Joerg Roedel	4d99ba898d	kvm: x86: Check dest_map->vector to match eoi signals for rtc Using the vector stored at interrupt delivery makes the eoi matching safe agains irq migration in the ioapic. Signed-off-by: Joerg Roedel <jroedel@suse.de> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2016-03-03 14:36:18 +01:00
Joerg Roedel	9daa50076f	kvm: x86: Track irq vectors in ioapic->rtc_status.dest_map This allows backtracking later in case the rtc irq has been moved to another vcpu/vector. Signed-off-by: Joerg Roedel <jroedel@suse.de> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2016-03-03 14:36:18 +01:00
Joerg Roedel	9e4aabe2bb	kvm: x86: Convert ioapic->rtc_status.dest_map to a struct Currently this is a bitmap which tracks which CPUs we expect an EOI from. Move this bitmap to a struct so that we can track additional information there. Signed-off-by: Joerg Roedel <jroedel@suse.de> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2016-03-03 14:36:17 +01:00
Owen Hofmann	2680d6da45	kvm: x86: Update tsc multiplier on change. vmx.c writes the TSC_MULTIPLIER field in vmx_vcpu_load, but only when a vcpu has migrated physical cpus. Record the last value written and update in vmx_vcpu_load on any change, otherwise a cpu migration must occur for TSC frequency scaling to take effect. Cc: stable@vger.kernel.org Fixes: `ff2c3a1803` Signed-off-by: Owen Hofmann <osh@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2016-03-02 10:37:32 +01:00
Ingo Molnar	6aa447bcbb	Merge branch 'sched/urgent' into sched/core, to pick up fixes before applying new changes Signed-off-by: Ingo Molnar <mingo@kernel.org>	2016-02-29 09:42:07 +01:00
Paolo Bonzini	70e4da7a8f	KVM: x86: fix root cause for missed hardware breakpoints Commit `172b2386ed` ("KVM: x86: fix missed hardware breakpoints", 2016-02-10) worked around a case where the debug registers are not loaded correctly on preemption and on the first entry to KVM_RUN. However, Xiao Guangrong pointed out that the root cause must be that KVM_DEBUGREG_BP_ENABLED is not being set correctly. This can indeed happen due to the lazy debug exit mechanism, which does not call kvm_update_dr7. Fix it by replacing the existing loop (more or less equivalent to kvm_update_dr0123) with calls to all the kvm_update_dr* functions. Cc: stable@vger.kernel.org # 4.1+ Fixes: `172b2386ed` Reviewed-by: Xiao Guangrong <guangrong.xiao@linux.intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2016-02-26 13:03:39 +01:00
Marcelo Tosatti	8577370fb0	KVM: Use simple waitqueue for vcpu->wq The problem: On -rt, an emulated LAPIC timer instances has the following path: 1) hard interrupt 2) ksoftirqd is scheduled 3) ksoftirqd wakes up vcpu thread 4) vcpu thread is scheduled This extra context switch introduces unnecessary latency in the LAPIC path for a KVM guest. The solution: Allow waking up vcpu thread from hardirq context, thus avoiding the need for ksoftirqd to be scheduled. Normal waitqueues make use of spinlocks, which on -RT are sleepable locks. Therefore, waking up a waitqueue waiter involves locking a sleeping lock, which is not allowed from hard interrupt context. cyclictest command line: This patch reduces the average latency in my tests from 14us to 11us. Daniel writes: Paolo asked for numbers from kvm-unit-tests/tscdeadline_latency benchmark on mainline. The test was run 1000 times on tip/sched/core 4.4.0-rc8-01134-g0905f04: ./x86-run x86/tscdeadline_latency.flat -cpu host with idle=poll. The test seems not to deliver really stable numbers though most of them are smaller. Paolo write: "Anything above ~10000 cycles means that the host went to C1 or lower---the number means more or less nothing in that case. The mean shows an improvement indeed." Before: min max mean std count 1000.000000 1000.000000 1000.000000 1000.000000 mean 5162.596000 2019270.084000 5824.491541 20681.645558 std 75.431231 622607.723969 89.575700 6492.272062 min 4466.000000 23928.000000 5537.926500 585.864966 25% 5163.000000 1613252.750000 5790.132275 16683.745433 50% 5175.000000 2281919.000000 5834.654000 23151.990026 75% 5190.000000 2382865.750000 5861.412950 24148.206168 max 5228.000000 4175158.000000 6254.827300 46481.048691 After min max mean std count 1000.000000 1000.00000 1000.000000 1000.000000 mean 5143.511000 2076886.10300 5813.312474 21207.357565 std 77.668322 610413.09583 86.541500 6331.915127 min 4427.000000 25103.00000 5529.756600 559.187707 25% 5148.000000 1691272.75000 5784.889825 17473.518244 50% 5160.000000 2308328.50000 5832.025000 23464.837068 75% 5172.000000 2393037.75000 5853.177675 24223.969976 max 5222.000000 3922458.00000 6186.720500 42520.379830 [Patch was originaly based on the swait implementation found in the -rt tree. Daniel ported it to mainline's version and gathered the benchmark numbers for tscdeadline_latency test.] Signed-off-by: Daniel Wagner <daniel.wagner@bmw-carit.de> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: linux-rt-users@vger.kernel.org Cc: Boqun Feng <boqun.feng@gmail.com> Cc: Marcelo Tosatti <mtosatti@redhat.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Paul Gortmaker <paul.gortmaker@windriver.com> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com> Link: http://lkml.kernel.org/r/1455871601-27484-4-git-send-email-wagi@monom.org Signed-off-by: Thomas Gleixner <tglx@linutronix.de>	2016-02-25 11:27:16 +01:00
Mike Krinkin	17e4bce0ae	KVM: x86: MMU: fix ubsan index-out-of-range warning Ubsan reports the following warning due to a typo in update_accessed_dirty_bits template, the patch fixes the typo: [ 168.791851] ================================================================================ [ 168.791862] UBSAN: Undefined behaviour in arch/x86/kvm/paging_tmpl.h:252:15 [ 168.791866] index 4 is out of range for type 'u64 [4]' [ 168.791871] CPU: 0 PID: 2950 Comm: qemu-system-x86 Tainted: G O L 4.5.0-rc5-next-20160222 #7 [ 168.791873] Hardware name: LENOVO 23205NG/23205NG, BIOS G2ET95WW (2.55 ) 07/09/2013 [ 168.791876] 0000000000000000 ffff8801cfcaf208 ffffffff81c9f780 0000000041b58ab3 [ 168.791882] ffffffff82eb2cc1 ffffffff81c9f6b4 ffff8801cfcaf230 ffff8801cfcaf1e0 [ 168.791886] 0000000000000004 0000000000000001 0000000000000000 ffffffffa1981600 [ 168.791891] Call Trace: [ 168.791899] [<ffffffff81c9f780>] dump_stack+0xcc/0x12c [ 168.791904] [<ffffffff81c9f6b4>] ? _atomic_dec_and_lock+0xc4/0xc4 [ 168.791910] [<ffffffff81da9e81>] ubsan_epilogue+0xd/0x8a [ 168.791914] [<ffffffff81daafa2>] __ubsan_handle_out_of_bounds+0x15c/0x1a3 [ 168.791918] [<ffffffff81daae46>] ? __ubsan_handle_shift_out_of_bounds+0x2bd/0x2bd [ 168.791922] [<ffffffff811287ef>] ? get_user_pages_fast+0x2bf/0x360 [ 168.791954] [<ffffffffa1794050>] ? kvm_largepages_enabled+0x30/0x30 [kvm] [ 168.791958] [<ffffffff81128530>] ? __get_user_pages_fast+0x360/0x360 [ 168.791987] [<ffffffffa181b818>] paging64_walk_addr_generic+0x1b28/0x2600 [kvm] [ 168.792014] [<ffffffffa1819cf0>] ? init_kvm_mmu+0x1100/0x1100 [kvm] [ 168.792019] [<ffffffff8129e350>] ? debug_check_no_locks_freed+0x350/0x350 [ 168.792044] [<ffffffffa1819cf0>] ? init_kvm_mmu+0x1100/0x1100 [kvm] [ 168.792076] [<ffffffffa181c36d>] paging64_gva_to_gpa+0x7d/0x110 [kvm] [ 168.792121] [<ffffffffa181c2f0>] ? paging64_walk_addr_generic+0x2600/0x2600 [kvm] [ 168.792130] [<ffffffff812e848b>] ? debug_lockdep_rcu_enabled+0x7b/0x90 [ 168.792178] [<ffffffffa17d9a4a>] emulator_read_write_onepage+0x27a/0x1150 [kvm] [ 168.792208] [<ffffffffa1794d44>] ? __kvm_read_guest_page+0x54/0x70 [kvm] [ 168.792234] [<ffffffffa17d97d0>] ? kvm_task_switch+0x160/0x160 [kvm] [ 168.792238] [<ffffffff812e848b>] ? debug_lockdep_rcu_enabled+0x7b/0x90 [ 168.792263] [<ffffffffa17daa07>] emulator_read_write+0xe7/0x6d0 [kvm] [ 168.792290] [<ffffffffa183b620>] ? em_cr_write+0x230/0x230 [kvm] [ 168.792314] [<ffffffffa17db005>] emulator_write_emulated+0x15/0x20 [kvm] [ 168.792340] [<ffffffffa18465f8>] segmented_write+0xf8/0x130 [kvm] [ 168.792367] [<ffffffffa1846500>] ? em_lgdt+0x20/0x20 [kvm] [ 168.792374] [<ffffffffa14db512>] ? vmx_read_guest_seg_ar+0x42/0x1e0 [kvm_intel] [ 168.792400] [<ffffffffa1846d82>] writeback+0x3f2/0x700 [kvm] [ 168.792424] [<ffffffffa1846990>] ? em_sidt+0xa0/0xa0 [kvm] [ 168.792449] [<ffffffffa185554d>] ? x86_decode_insn+0x1b3d/0x4f70 [kvm] [ 168.792474] [<ffffffffa1859032>] x86_emulate_insn+0x572/0x3010 [kvm] [ 168.792499] [<ffffffffa17e71dd>] x86_emulate_instruction+0x3bd/0x2110 [kvm] [ 168.792524] [<ffffffffa17e6e20>] ? reexecute_instruction.part.110+0x2e0/0x2e0 [kvm] [ 168.792532] [<ffffffffa14e9a81>] handle_ept_misconfig+0x61/0x460 [kvm_intel] [ 168.792539] [<ffffffffa14e9a20>] ? handle_pause+0x450/0x450 [kvm_intel] [ 168.792546] [<ffffffffa15130ea>] vmx_handle_exit+0xd6a/0x1ad0 [kvm_intel] [ 168.792572] [<ffffffffa17f6a6c>] ? kvm_arch_vcpu_ioctl_run+0xbdc/0x6090 [kvm] [ 168.792597] [<ffffffffa17f6bcd>] kvm_arch_vcpu_ioctl_run+0xd3d/0x6090 [kvm] [ 168.792621] [<ffffffffa17f6a6c>] ? kvm_arch_vcpu_ioctl_run+0xbdc/0x6090 [kvm] [ 168.792627] [<ffffffff8293b530>] ? __ww_mutex_lock_interruptible+0x1630/0x1630 [ 168.792651] [<ffffffffa17f5e90>] ? kvm_arch_vcpu_runnable+0x4f0/0x4f0 [kvm] [ 168.792656] [<ffffffff811eeb30>] ? preempt_notifier_unregister+0x190/0x190 [ 168.792681] [<ffffffffa17e0447>] ? kvm_arch_vcpu_load+0x127/0x650 [kvm] [ 168.792704] [<ffffffffa178e9a3>] kvm_vcpu_ioctl+0x553/0xda0 [kvm] [ 168.792727] [<ffffffffa178e450>] ? vcpu_put+0x40/0x40 [kvm] [ 168.792732] [<ffffffff8129e350>] ? debug_check_no_locks_freed+0x350/0x350 [ 168.792735] [<ffffffff82946087>] ? _raw_spin_unlock+0x27/0x40 [ 168.792740] [<ffffffff8163a943>] ? handle_mm_fault+0x1673/0x2e40 [ 168.792744] [<ffffffff8129daa8>] ? trace_hardirqs_on_caller+0x478/0x6c0 [ 168.792747] [<ffffffff8129dcfd>] ? trace_hardirqs_on+0xd/0x10 [ 168.792751] [<ffffffff812e848b>] ? debug_lockdep_rcu_enabled+0x7b/0x90 [ 168.792756] [<ffffffff81725a80>] do_vfs_ioctl+0x1b0/0x12b0 [ 168.792759] [<ffffffff817258d0>] ? ioctl_preallocate+0x210/0x210 [ 168.792763] [<ffffffff8174aef3>] ? __fget+0x273/0x4a0 [ 168.792766] [<ffffffff8174acd0>] ? __fget+0x50/0x4a0 [ 168.792770] [<ffffffff8174b1f6>] ? __fget_light+0x96/0x2b0 [ 168.792773] [<ffffffff81726bf9>] SyS_ioctl+0x79/0x90 [ 168.792777] [<ffffffff82946880>] entry_SYSCALL_64_fastpath+0x23/0xc1 [ 168.792780] ================================================================================ Signed-off-by: Mike Krinkin <krinkin.m.u@gmail.com> Reviewed-by: Xiao Guangrong <guangrong.xiao@linux.intel.com> Cc: stable@vger.kernel.org Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2016-02-25 09:50:35 +01:00
Paolo Bonzini	0c1d77f4ba	KVM: x86: fix conversion of addresses to linear in 32-bit protected mode Commit `e8dd2d2d64` ("Silence compiler warning in arch/x86/kvm/emulate.c", 2015-09-06) broke boot of the Hurd. The bug is that the "default:" case actually could modify "la", but after the patch this change is not reflected in *linear. The bug is visible whenever a non-zero segment base causes the linear address to wrap around the 4GB mark. Fixes: `e8dd2d2d64` Cc: stable@vger.kernel.org Reported-by: Aurelien Jarno <aurelien@aurel32.net> Tested-by: Aurelien Jarno <aurelien@aurel32.net> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2016-02-24 14:47:45 +01:00
Paolo Bonzini	172b2386ed	KVM: x86: fix missed hardware breakpoints Sometimes when setting a breakpoint a process doesn't stop on it. This is because the debug registers are not loaded correctly on VCPU load. The following simple reproducer from Oleg Nesterov tries using debug registers in two threads. To see the bug, run a 2-VCPU guest with "taskset -c 0" and run "./bp 0 1" inside the guest. #include <unistd.h> #include <signal.h> #include <stdlib.h> #include <stdio.h> #include <sys/wait.h> #include <sys/ptrace.h> #include <sys/user.h> #include <asm/debugreg.h> #include <assert.h> #define offsetof(TYPE, MEMBER) ((size_t) &((TYPE )0)->MEMBER) unsigned long encode_dr7(int drnum, int enable, unsigned int type, unsigned int len) { unsigned long dr7; dr7 = ((len \| type) & 0xf) << (DR_CONTROL_SHIFT + drnum DR_CONTROL_SIZE); if (enable) dr7 \|= (DR_GLOBAL_ENABLE << (drnum * DR_ENABLE_SIZE)); return dr7; } int write_dr(int pid, int dr, unsigned long val) { return ptrace(PTRACE_POKEUSER, pid, offsetof (struct user, u_debugreg[dr]), val); } void set_bp(pid_t pid, void addr) { unsigned long dr7; assert(write_dr(pid, 0, (long)addr) == 0); dr7 = encode_dr7(0, 1, DR_RW_EXECUTE, DR_LEN_1); assert(write_dr(pid, 7, dr7) == 0); } void get_rip(int pid) { return (void)ptrace(PTRACE_PEEKUSER, pid, offsetof(struct user, regs.rip), 0); } void test(int nr) { void bp_addr = &&label + nr, bp_hit; int pid; printf("test bp %d\n", nr); assert(nr < 16); // see 16 asm nops below pid = fork(); if (!pid) { assert(ptrace(PTRACE_TRACEME, 0,0,0) == 0); kill(getpid(), SIGSTOP); for (;;) { label: asm ( "nop; nop; nop; nop;" "nop; nop; nop; nop;" "nop; nop; nop; nop;" "nop; nop; nop; nop;" ); } } assert(pid == wait(NULL)); set_bp(pid, bp_addr); for (;;) { assert(ptrace(PTRACE_CONT, pid, 0, 0) == 0); assert(pid == wait(NULL)); bp_hit = get_rip(pid); if (bp_hit != bp_addr) fprintf(stderr, "ERR!! hit wrong bp %ld != %d\n", bp_hit - &&label, nr); } } int main(int argc, const char argv[]) { while (--argc) { int nr = atoi(*++argv); if (!fork()) test(nr); } while (wait(NULL) > 0) ; return 0; } Cc: stable@vger.kernel.org Suggested-by: Nadav Amit <namit@cs.technion.ac.il> Reported-by: Andrey Wagin <avagin@gmail.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2016-02-24 14:47:39 +01:00
Adam Buchbinder	6a6256f9e0	x86: Fix misspellings in comments Signed-off-by: Adam Buchbinder <adam.buchbinder@gmail.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: trivial@kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>	2016-02-24 08:44:58 +01:00
Chris J Arges	3f62de5f6f	x86/kvm: Add output operand in vmx_handle_external_intr inline asm Stacktool generates the following warning: stacktool: arch/x86/kvm/vmx.o: vmx_handle_external_intr()+0x67: call without frame pointer save/setup By adding the stackpointer as an output operand, this patch ensures that a stack frame is created when CONFIG_FRAME_POINTER is enabled for the inline assmebly statement. Signed-off-by: Chris J Arges <chris.j.arges@canonical.com> Reviewed-by: Josh Poimboeuf <jpoimboe@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: gleb@kernel.org Cc: kvm@vger.kernel.org Cc: live-patching@vger.kernel.org Cc: pbonzini@redhat.com Link: http://lkml.kernel.org/r/1453499078-9330-3-git-send-email-chris.j.arges@canonical.com Signed-off-by: Ingo Molnar <mingo@kernel.org>	2016-02-24 08:35:44 +01:00
Josh Poimboeuf	cb7390fed4	x86/kvm: Make test_cc() always inline With some configs (including allyesconfig), gcc doesn't inline test_cc(). When that happens, test_cc() doesn't create a stack frame before inserting the inline asm call instruction. This breaks frame pointer convention if CONFIG_FRAME_POINTER is enabled and can result in a bad stack trace. Force it to always be inlined so that its containing function's stack frame can be used. Suggested-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com> Acked-by: Paolo Bonzini <pbonzini@redhat.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Andy Lutomirski <luto@amacapital.net> Cc: Andy Lutomirski <luto@kernel.org> Cc: Arnaldo Carvalho de Melo <acme@kernel.org> Cc: Bernd Petrovitsch <bernd@petrovitsch.priv.at> Cc: Borislav Petkov <bp@alien8.de> Cc: Brian Gerst <brgerst@gmail.com> Cc: Chris J Arges <chris.j.arges@canonical.com> Cc: Denys Vlasenko <dvlasenk@redhat.com> Cc: Gleb Natapov <gleb@kernel.org> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Jiri Slaby <jslaby@suse.cz> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Michal Marek <mmarek@suse.cz> Cc: Namhyung Kim <namhyung@gmail.com> Cc: Pedro Alves <palves@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: kvm@vger.kernel.org Cc: live-patching@vger.kernel.org Link: http://lkml.kernel.org/r/20160122161612.GE20502@treble.redhat.com Signed-off-by: Ingo Molnar <mingo@kernel.org>	2016-02-24 08:35:44 +01:00
Josh Poimboeuf	1482a0825b	x86/kvm: Set ELF function type for fastop functions The callable functions created with the FOP* and FASTOP* macros are missing ELF function annotations, which confuses tools like stacktool. Properly annotate them. This adds some additional labels to the assembly, but the generated binary code is unchanged (with the exception of instructions which have embedded references to __LINE__). Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com> Acked-by: Paolo Bonzini <pbonzini@redhat.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Andy Lutomirski <luto@amacapital.net> Cc: Andy Lutomirski <luto@kernel.org> Cc: Arnaldo Carvalho de Melo <acme@kernel.org> Cc: Bernd Petrovitsch <bernd@petrovitsch.priv.at> Cc: Borislav Petkov <bp@alien8.de> Cc: Brian Gerst <brgerst@gmail.com> Cc: Chris J Arges <chris.j.arges@canonical.com> Cc: Denys Vlasenko <dvlasenk@redhat.com> Cc: Gleb Natapov <gleb@kernel.org> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Jiri Slaby <jslaby@suse.cz> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Michal Marek <mmarek@suse.cz> Cc: Namhyung Kim <namhyung@gmail.com> Cc: Pedro Alves <palves@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: kvm@vger.kernel.org Cc: live-patching@vger.kernel.org Link: http://lkml.kernel.org/r/e399651c89ace54906c203c0557f66ed6ea3ce8d.1453405861.git.jpoimboe@redhat.com Signed-off-by: Ingo Molnar <mingo@kernel.org>	2016-02-24 08:35:44 +01:00
Geliang Tang	d74c0e6b54	KVM: x86: use list_last_entry To make the intention clearer, use list_last_entry instead of list_entry. Signed-off-by: Geliang Tang <geliangtang@163.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2016-02-23 15:40:54 +01:00
Geliang Tang	652fc08dae	KVM: x86: use list_for_each_entry* Use list_for_each_entry() instead of list_for_each() to simplify the code. Signed-off-by: Geliang Tang <geliangtang@163.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2016-02-23 15:40:54 +01:00
Takuya Yoshikawa	e9ee956e31	KVM: x86: MMU: Move handle_mmio_page_fault() call to kvm_mmu_page_fault() Rather than placing a handle_mmio_page_fault() call in each vcpu->arch.mmu.page_fault() handler, moving it up to kvm_mmu_page_fault() makes the code better: - avoids code duplication - for kvm_arch_async_page_ready(), which is the other caller of vcpu->arch.mmu.page_fault(), removes an extra error_code check - avoids returning both RET_MMIO_PF_* values and raw integer values from vcpu->arch.mmu.page_fault() Signed-off-by: Takuya Yoshikawa <yoshikawa_takuya_b1@lab.ntt.co.jp> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2016-02-23 14:20:27 +01:00
Takuya Yoshikawa	ded5874946	KVM: x86: MMU: Consolidate quickly_check_mmio_pf() and is_mmio_page_fault() These two have only slight differences: - whether 'addr' is of type u64 or of type gva_t - whether they have 'direct' parameter or not Concerning the former, quickly_check_mmio_pf()'s u64 is better because 'addr' needs to be able to have both a guest physical address and a guest virtual address. The latter is just a stylistic issue as we can always calculate the mode from the 'vcpu' as is_mmio_page_fault() does. This patch keeps the parameter to make the following patch cleaner. In addition, the patch renames the function to mmio_info_in_cache() to make it clear what it actually checks for. Signed-off-by: Takuya Yoshikawa <yoshikawa_takuya_b1@lab.ntt.co.jp> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2016-02-23 14:20:27 +01:00
Paolo Bonzini	3ae13faac4	KVM: x86: pass kvm_get_time_scale arguments in hertz Prepare for improving the precision in the next patch. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2016-02-16 18:48:45 +01:00
Andrey Smetanin	83326e43f2	kvm/x86: Hyper-V VMBus hypercall userspace exit The patch implements KVM_EXIT_HYPERV userspace exit functionality for Hyper-V VMBus hypercalls: HV_X64_HCALL_POST_MESSAGE, HV_X64_HCALL_SIGNAL_EVENT. Changes v3: * use vcpu->arch.complete_userspace_io to setup hypercall result Changes v2: * use KVM_EXIT_HYPERV for hypercalls Signed-off-by: Andrey Smetanin <asmetanin@virtuozzo.com> Reviewed-by: Roman Kagan <rkagan@virtuozzo.com> CC: Gleb Natapov <gleb@kernel.org> CC: Paolo Bonzini <pbonzini@redhat.com> CC: Joerg Roedel <joro@8bytes.org> CC: "K. Y. Srinivasan" <kys@microsoft.com> CC: Haiyang Zhang <haiyangz@microsoft.com> CC: Roman Kagan <rkagan@virtuozzo.com> CC: Denis V. Lunev <den@openvz.org> CC: qemu-devel@nongnu.org Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2016-02-16 18:48:44 +01:00
Andrey Smetanin	b2fdc2570a	kvm/x86: Reject Hyper-V hypercall continuation Currently we do not support Hyper-V hypercall continuation so reject it. Signed-off-by: Andrey Smetanin <asmetanin@virtuozzo.com> Reviewed-by: Roman Kagan <rkagan@virtuozzo.com> CC: Gleb Natapov <gleb@kernel.org> CC: Paolo Bonzini <pbonzini@redhat.com> CC: Joerg Roedel <joro@8bytes.org> CC: "K. Y. Srinivasan" <kys@microsoft.com> CC: Haiyang Zhang <haiyangz@microsoft.com> CC: Roman Kagan <rkagan@virtuozzo.com> CC: Denis V. Lunev <den@openvz.org> CC: qemu-devel@nongnu.org Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2016-02-16 18:48:42 +01:00
Andrey Smetanin	0d9c055eaa	kvm/x86: Pass return code of kvm_emulate_hypercall Pass the return code from kvm_emulate_hypercall on to the caller, in order to allow it to indicate to the userspace that the hypercall has to be handled there. Also adjust all the existing code paths to return 1 to make sure the hypercall isn't passed to the userspace without setting kvm_run appropriately. Signed-off-by: Andrey Smetanin <asmetanin@virtuozzo.com> Reviewed-by: Roman Kagan <rkagan@virtuozzo.com> CC: Gleb Natapov <gleb@kernel.org> CC: Paolo Bonzini <pbonzini@redhat.com> CC: Joerg Roedel <joro@8bytes.org> CC: "K. Y. Srinivasan" <kys@microsoft.com> CC: Haiyang Zhang <haiyangz@microsoft.com> CC: Roman Kagan <rkagan@virtuozzo.com> CC: Denis V. Lunev <den@openvz.org> CC: qemu-devel@nongnu.org Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2016-02-16 18:48:41 +01:00
Andrey Smetanin	8ed6d76781	kvm/x86: Rename Hyper-V long spin wait hypercall Rename HV_X64_HV_NOTIFY_LONG_SPIN_WAIT by HVCALL_NOTIFY_LONG_SPIN_WAIT, so the name is more consistent with the other hypercalls. Signed-off-by: Andrey Smetanin <asmetanin@virtuozzo.com> Reviewed-by: Roman Kagan <rkagan@virtuozzo.com> CC: Gleb Natapov <gleb@kernel.org> CC: Paolo Bonzini <pbonzini@redhat.com> CC: Joerg Roedel <joro@8bytes.org> CC: "K. Y. Srinivasan" <kys@microsoft.com> CC: Haiyang Zhang <haiyangz@microsoft.com> CC: Roman Kagan <rkagan@virtuozzo.com> CC: Denis V. Lunev <den@openvz.org> CC: qemu-devel@nongnu.org [Change name, Andrey used HV_X64_HCALL_NOTIFY_LONG_SPIN_WAIT. - Paolo] Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2016-02-16 18:48:38 +01:00
Paolo Bonzini	4e422bdd2f	KVM: x86: fix missed hardware breakpoints Sometimes when setting a breakpoint a process doesn't stop on it. This is because the debug registers are not loaded correctly on VCPU load. The following simple reproducer from Oleg Nesterov tries using debug registers in both the host and the guest, for example by running "./bp 0 1" on the host and "./bp 14 15" under QEMU. #include <unistd.h> #include <signal.h> #include <stdlib.h> #include <stdio.h> #include <sys/wait.h> #include <sys/ptrace.h> #include <sys/user.h> #include <asm/debugreg.h> #include <assert.h> #define offsetof(TYPE, MEMBER) ((size_t) &((TYPE )0)->MEMBER) unsigned long encode_dr7(int drnum, int enable, unsigned int type, unsigned int len) { unsigned long dr7; dr7 = ((len \| type) & 0xf) << (DR_CONTROL_SHIFT + drnum DR_CONTROL_SIZE); if (enable) dr7 \|= (DR_GLOBAL_ENABLE << (drnum * DR_ENABLE_SIZE)); return dr7; } int write_dr(int pid, int dr, unsigned long val) { return ptrace(PTRACE_POKEUSER, pid, offsetof (struct user, u_debugreg[dr]), val); } void set_bp(pid_t pid, void addr) { unsigned long dr7; assert(write_dr(pid, 0, (long)addr) == 0); dr7 = encode_dr7(0, 1, DR_RW_EXECUTE, DR_LEN_1); assert(write_dr(pid, 7, dr7) == 0); } void get_rip(int pid) { return (void)ptrace(PTRACE_PEEKUSER, pid, offsetof(struct user, regs.rip), 0); } void test(int nr) { void bp_addr = &&label + nr, bp_hit; int pid; printf("test bp %d\n", nr); assert(nr < 16); // see 16 asm nops below pid = fork(); if (!pid) { assert(ptrace(PTRACE_TRACEME, 0,0,0) == 0); kill(getpid(), SIGSTOP); for (;;) { label: asm ( "nop; nop; nop; nop;" "nop; nop; nop; nop;" "nop; nop; nop; nop;" "nop; nop; nop; nop;" ); } } assert(pid == wait(NULL)); set_bp(pid, bp_addr); for (;;) { assert(ptrace(PTRACE_CONT, pid, 0, 0) == 0); assert(pid == wait(NULL)); bp_hit = get_rip(pid); if (bp_hit != bp_addr) fprintf(stderr, "ERR!! hit wrong bp %ld != %d\n", bp_hit - &&label, nr); } } int main(int argc, const char argv[]) { while (--argc) { int nr = atoi(*++argv); if (!fork()) test(nr); } while (wait(NULL) > 0) ; return 0; } Cc: stable@vger.kernel.org Suggested-by: Nadadv Amit <namit@cs.technion.ac.il> Reported-by: Andrey Wagin <avagin@gmail.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2016-02-16 18:48:37 +01:00
Radim Krčmář	4efd805fca	KVM: x86: fix *NULL on invalid low-prio irq Smatch noticed a NULL dereference in kvm_intr_is_single_vcpu_fast that happens if VM already warned about invalid lowest-priority interrupt. Create a function for common code while fixing it. Fixes: `6228a0da80` ("KVM: x86: Add lowest-priority support for vt-d posted-interrupts") Reported-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: Radim Krčmář <rkrcmar@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2016-02-16 18:48:36 +01:00
Paolo Bonzini	78db6a5037	KVM: x86: rewrite handling of scaled TSC for kvmclock This is the same as before: kvm_scale_tsc(tgt_tsc_khz) = tgt_tsc_khz * ratio = tgt_tsc_khz * user_tsc_khz / tsc_khz (see set_tsc_khz) = user_tsc_khz (see kvm_guest_time_update) = vcpu->arch.virtual_tsc_khz (see kvm_set_tsc_khz) However, computing it through kvm_scale_tsc will make it possible to include the NTP correction in tgt_tsc_khz. Reviewed-by: Marcelo Tosatti <mtosatti@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2016-02-16 18:48:34 +01:00
Paolo Bonzini	4941b8cb37	KVM: x86: rename argument to kvm_set_tsc_khz This refers to the desired (scaled) frequency, which is called user_tsc_khz in the rest of the file. Reviewed-by: Marcelo Tosatti <mtosatti@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2016-02-16 18:48:33 +01:00
Jan Kiszka	6f05485d3a	KVM: VMX: Fix guest debugging while in L2 When we take a #DB or #BP vmexit while in guest mode, we first of all need to check if there is ongoing guest debugging that might be interested in the event. Currently, we unconditionally leave L2 and inject the event into L1 if it is intercepting the exceptions. That breaks things marvelously. Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2016-02-16 18:48:32 +01:00
Jan Kiszka	5bb16016ce	KVM: VMX: Factor out is_exception_n helper There is quite some common code in all these is_<exception>() helpers. Factor it out before adding even more of them. Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2016-02-16 18:48:30 +01:00
Paolo Bonzini	bce87cce88	KVM: x86: consolidate different ways to test for in-kernel LAPIC Different pieces of code checked for vcpu->arch.apic being (non-)NULL, or used kvm_vcpu_has_lapic (more optimized) or lapic_in_kernel. Replace everything with lapic_in_kernel's name and kvm_vcpu_has_lapic's implementation. Reviewed-by: Radim Krčmář <rkrcmar@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2016-02-09 16:57:45 +01:00
Paolo Bonzini	1e3161b414	KVM: x86: consolidate "has lapic" checks into irq.c Do for kvm_cpu_has_pending_timer and kvm_inject_pending_timer_irqs what the other irq.c routines have been doing. Reviewed-by: Radim Krčmář <rkrcmar@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2016-02-09 16:57:39 +01:00
Paolo Bonzini	f8543d6a97	KVM: APIC: remove unnecessary double checks on APIC existence Usually the in-kernel APIC's existence is checked in the caller. Do not bother checking it again in lapic.c. Reviewed-by: Radim Krčmář <rkrcmar@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2016-02-09 16:57:14 +01:00
Feng Wu	b6ce978067	KVM/VMX: Add host irq information in trace event when updating IRTE for posted interrupts Add host irq information in trace event, so we can better understand which irq is in posted mode. Signed-off-by: Feng Wu <feng.wu@intel.com> Reviewed-by: Radim Krcmar <rkrcmar@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2016-02-09 13:24:43 +01:00
Feng Wu	6228a0da80	KVM: x86: Add lowest-priority support for vt-d posted-interrupts Use vector-hashing to deliver lowest-priority interrupts for VT-d posted-interrupts. This patch extends kvm_intr_is_single_vcpu() to support lowest-priority handling. Signed-off-by: Feng Wu <feng.wu@intel.com> Reviewed-by: Radim Krčmář <rkrcmar@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2016-02-09 13:24:42 +01:00
Feng Wu	520040146a	KVM: x86: Use vector-hashing to deliver lowest-priority interrupts Use vector-hashing to deliver lowest-priority interrupts, As an example, modern Intel CPUs in server platform use this method to handle lowest-priority interrupts. Signed-off-by: Feng Wu <feng.wu@intel.com> Reviewed-by: Radim Krčmář <rkrcmar@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2016-02-09 13:24:40 +01:00
Feng Wu	23a1c2579b	KVM: Recover IRTE to remapped mode if the interrupt is not single-destination When the interrupt is not single destination any more, we need to change back IRTE to remapped mode explicitly. Signed-off-by: Feng Wu <feng.wu@intel.com> Reviewed-by: Radim Krčmář <rkrcmar@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2016-02-09 13:24:39 +01:00
Paolo Bonzini	b51012deb3	KVM: x86: introduce do_shl32_div32 This is similar to the existing div_frac function, but it returns the remainder too. Unlike div_frac, it can be used to implement long division, e.g. (a << 64) / b for 32-bit a and b. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2016-02-09 13:24:37 +01:00
Dan Williams	ba049e93ae	kvm: rename pfn_t to kvm_pfn_t To date, we have implemented two I/O usage models for persistent memory, PMEM (a persistent "ram disk") and DAX (mmap persistent memory into userspace). This series adds a third, DAX-GUP, that allows DAX mappings to be the target of direct-i/o. It allows userspace to coordinate DMA/RDMA from/to persistent memory. The implementation leverages the ZONE_DEVICE mm-zone that went into 4.3-rc1 (also discussed at kernel summit) to flag pages that are owned and dynamically mapped by a device driver. The pmem driver, after mapping a persistent memory range into the system memmap via devm_memremap_pages(), arranges for DAX to distinguish pfn-only versus page-backed pmem-pfns via flags in the new pfn_t type. The DAX code, upon seeing a PFN_DEV+PFN_MAP flagged pfn, flags the resulting pte(s) inserted into the process page tables with a new _PAGE_DEVMAP flag. Later, when get_user_pages() is walking ptes it keys off _PAGE_DEVMAP to pin the device hosting the page range active. Finally, get_page() and put_page() are modified to take references against the device driver established page mapping. Finally, this need for "struct page" for persistent memory requires memory capacity to store the memmap array. Given the memmap array for a large pool of persistent may exhaust available DRAM introduce a mechanism to allocate the memmap from persistent memory. The new "struct vmem_altmap *" parameter to devm_memremap_pages() enables arch_add_memory() to use reserved pmem capacity rather than the page allocator. This patch (of 18): The core has developed a need for a "pfn_t" type [1]. Move the existing pfn_t in KVM to kvm_pfn_t [2]. [1]: https://lists.01.org/pipermail/linux-nvdimm/2015-September/002199.html [2]: https://lists.01.org/pipermail/linux-nvdimm/2015-September/002218.html Signed-off-by: Dan Williams <dan.j.williams@intel.com> Acked-by: Christoffer Dall <christoffer.dall@linaro.org> Cc: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2016-01-15 17:56:32 -08:00
Linus Torvalds	1baa5efbeb	* s390: Support for runtime instrumentation within guests, support of 248 VCPUs. * ARM: rewrite of the arm64 world switch in C, support for 16-bit VM identifiers. Performance counter virtualization missed the boat. * x86: Support for more Hyper-V features (synthetic interrupt controller), MMU cleanups -----BEGIN PGP SIGNATURE----- Version: GnuPG v2.0.22 (GNU/Linux) iQEcBAABAgAGBQJWlSKwAAoJEL/70l94x66DY0UIAK5vp4zfQoQOJC4KP4Xgxwdu kpnK2Boz3/74o1b0y5+eJZoUZCsXCVLtmP5uhmMxUYWDgByFG2X8ZDhPFwB5FYLT 2dN+Lr4tsolgIfRdHZtrT6Svp9SDL039bWTdscnbR6l37/j9FRWvpKdhI3orloFD /i4CSW2dVIq1/9Xctwu/rtcOEesEx4Cad+6YV3/530eVAXFzE908nXfmqJNZTocY YCGcmrMVCOu0ng5QM4xSzmmYjKMLUcRs+QzZWkVBzdJtTgwZUr09yj7I2dZ1yj/i cxYrJy6shSwE74XkXsmvG+au3C5u3vX4tnXjBFErnPJ99oqzHatVnFWNRhj4dLQ= =PIj1 -----END PGP SIGNATURE----- Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm Pull KVM updates from Paolo Bonzini: "PPC changes will come next week. - s390: Support for runtime instrumentation within guests, support of 248 VCPUs. - ARM: rewrite of the arm64 world switch in C, support for 16-bit VM identifiers. Performance counter virtualization missed the boat. - x86: Support for more Hyper-V features (synthetic interrupt controller), MMU cleanups" * tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (115 commits) kvm: x86: Fix vmwrite to SECONDARY_VM_EXEC_CONTROL kvm/x86: Hyper-V SynIC timers tracepoints kvm/x86: Hyper-V SynIC tracepoints kvm/x86: Update SynIC timers on guest entry only kvm/x86: Skip SynIC vector check for QEMU side kvm/x86: Hyper-V fix SynIC timer disabling condition kvm/x86: Reorg stimer_expiration() to better control timer restart kvm/x86: Hyper-V unify stimer_start() and stimer_restart() kvm/x86: Drop stimer_stop() function kvm/x86: Hyper-V timers fix incorrect logical operation KVM: move architecture-dependent requests to arch/ KVM: renumber vcpu->request bits KVM: document which architecture uses each request bit KVM: Remove unused KVM_REQ_KICK to save a bit in vcpu->requests kvm: x86: Check kvm_write_guest return value in kvm_write_wall_clock KVM: s390: implement the RI support of guest kvm/s390: drop unpaired smp_mb kvm: x86: fix comment about {mmu,nested_mmu}.gva_to_gpa KVM: x86: MMU: Use clear_page() instead of init_shadow_page_table() arm/arm64: KVM: Detect vGIC presence at runtime ...	2016-01-12 13:22:12 -08:00
Huaitong Han	45bdbcfdf2	kvm: x86: Fix vmwrite to SECONDARY_VM_EXEC_CONTROL vmx_cpuid_tries to update SECONDARY_VM_EXEC_CONTROL in the VMCS, but it will cause a vmwrite error on older CPUs because the code does not check for the presence of CPU_BASED_ACTIVATE_SECONDARY_CONTROLS. This will get rid of the following trace on e.g. Core2 6600: vmwrite error: reg 401e value 10 (err 12) Call Trace: [<ffffffff8116e2b9>] dump_stack+0x40/0x57 [<ffffffffa020b88d>] vmx_cpuid_update+0x5d/0x150 [kvm_intel] [<ffffffffa01d8fdc>] kvm_vcpu_ioctl_set_cpuid2+0x4c/0x70 [kvm] [<ffffffffa01b8363>] kvm_arch_vcpu_ioctl+0x903/0xfa0 [kvm] Fixes: `feda805fe7` Cc: stable@vger.kernel.org Reported-by: Zdenek Kaspar <zkaspar82@gmail.com> Signed-off-by: Huaitong Han <huaitong.han@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2016-01-12 11:42:16 +01:00
Linus Torvalds	671d5532aa	Merge branch 'x86-cpu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 cpu updates from Ingo Molnar: "The main changes in this cycle were: - Improved CPU ID handling code and related enhancements (Borislav Petkov) - RDRAND fix (Len Brown)" * 'x86-cpu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86: Replace RDRAND forced-reseed with simple sanity check x86/MSR: Chop off lower 32-bit value x86/cpu: Fix MSR value truncation issue x86/cpu/amd, kvm: Satisfy guest kernel reads of IC_CFG MSR kvm: Add accessors for guest CPU's family, model, stepping x86/cpu: Unify CPU family, model, stepping calculation	2016-01-11 16:46:20 -08:00
Andrey Smetanin	ac3e5fcae8	kvm/x86: Hyper-V SynIC timers tracepoints Trace the following Hyper SynIC timers events: * periodic timer start * one-shot timer start * timer callback * timer expiration and message delivery result * timer config setup * timer count setup * timer cleanup Signed-off-by: Andrey Smetanin <asmetanin@virtuozzo.com> CC: Gleb Natapov <gleb@kernel.org> CC: Paolo Bonzini <pbonzini@redhat.com> CC: Roman Kagan <rkagan@virtuozzo.com> CC: Denis V. Lunev <den@openvz.org> CC: qemu-devel@nongnu.org Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2016-01-08 19:04:43 +01:00
Andrey Smetanin	18659a9cb1	kvm/x86: Hyper-V SynIC tracepoints Trace the following Hyper SynIC events: * set msr * set sint irq * ack sint * sint irq eoi Signed-off-by: Andrey Smetanin <asmetanin@virtuozzo.com> CC: Gleb Natapov <gleb@kernel.org> CC: Paolo Bonzini <pbonzini@redhat.com> CC: Roman Kagan <rkagan@virtuozzo.com> CC: Denis V. Lunev <den@openvz.org> CC: qemu-devel@nongnu.org Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2016-01-08 19:04:43 +01:00
Andrey Smetanin	f3b138c5d8	kvm/x86: Update SynIC timers on guest entry only Consolidate updating the Hyper-V SynIC timers in a single place: on guest entry in processing KVM_REQ_HV_STIMER request. This simplifies the overall logic, and makes sure the most current state of msrs and guest clock is used for arming the timers (to achieve that, KVM_REQ_HV_STIMER has to be processed after KVM_REQ_CLOCK_UPDATE). Signed-off-by: Andrey Smetanin <asmetanin@virtuozzo.com> Reviewed-by: Roman Kagan <rkagan@virtuozzo.com> CC: Gleb Natapov <gleb@kernel.org> CC: Paolo Bonzini <pbonzini@redhat.com> CC: Roman Kagan <rkagan@virtuozzo.com> CC: Denis V. Lunev <den@openvz.org> CC: qemu-devel@nongnu.org Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2016-01-08 19:04:42 +01:00
Andrey Smetanin	7be58a6488	kvm/x86: Skip SynIC vector check for QEMU side QEMU zero-inits Hyper-V SynIC vectors. We should allow that, and don't reject zero values if set by the host. Signed-off-by: Andrey Smetanin <asmetanin@virtuozzo.com> Reviewed-by: Roman Kagan <rkagan@virtuozzo.com> CC: Gleb Natapov <gleb@kernel.org> CC: Paolo Bonzini <pbonzini@redhat.com> CC: Roman Kagan <rkagan@virtuozzo.com> CC: Denis V. Lunev <den@openvz.org> CC: qemu-devel@nongnu.org Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2016-01-08 19:04:42 +01:00
Andrey Smetanin	23a3b201fd	kvm/x86: Hyper-V fix SynIC timer disabling condition Hypervisor Function Specification(HFS) doesn't require to disable SynIC timer at timer config write if timer->count = 0. So drop this check, this allow to load timers MSR's during migration restore, because config are set before count in QEMU side. Also fix condition according to HFS doc(15.3.1): "It is not permitted to set the SINTx field to zero for an enabled timer. If attempted, the timer will be marked disabled (that is, bit 0 cleared) immediately." Signed-off-by: Andrey Smetanin <asmetanin@virtuozzo.com> Reviewed-by: Roman Kagan <rkagan@virtuozzo.com> CC: Gleb Natapov <gleb@kernel.org> CC: Paolo Bonzini <pbonzini@redhat.com> CC: Roman Kagan <rkagan@virtuozzo.com> CC: Denis V. Lunev <den@openvz.org> CC: qemu-devel@nongnu.org Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2016-01-08 19:04:41 +01:00
Andrey Smetanin	0cdeabb118	kvm/x86: Reorg stimer_expiration() to better control timer restart Split stimer_expiration() into two parts - timer expiration message sending and timer restart/cleanup based on timer state(config). This also fixes a bug where a one-shot timer message whose delivery failed once would get lost for good. Signed-off-by: Andrey Smetanin <asmetanin@virtuozzo.com> Reviewed-by: Roman Kagan <rkagan@virtuozzo.com> CC: Gleb Natapov <gleb@kernel.org> CC: Paolo Bonzini <pbonzini@redhat.com> CC: Roman Kagan <rkagan@virtuozzo.com> CC: Denis V. Lunev <den@openvz.org> CC: qemu-devel@nongnu.org Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2016-01-08 19:04:41 +01:00
Andrey Smetanin	f808495da5	kvm/x86: Hyper-V unify stimer_start() and stimer_restart() This will be used in future to start Hyper-V SynIC timer in several places by one logic in one function. Changes v2: * drop stimer->count == 0 check inside stimer_start() * comment stimer_start() assumptions Signed-off-by: Andrey Smetanin <asmetanin@virtuozzo.com> Reviewed-by: Roman Kagan <rkagan@virtuozzo.com> CC: Gleb Natapov <gleb@kernel.org> CC: Paolo Bonzini <pbonzini@redhat.com> CC: Roman Kagan <rkagan@virtuozzo.com> CC: Denis V. Lunev <den@openvz.org> CC: qemu-devel@nongnu.org Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2016-01-08 19:04:40 +01:00
Andrey Smetanin	019b9781cc	kvm/x86: Drop stimer_stop() function The function stimer_stop() is called in one place so remove the function and replace it's call by function content. Signed-off-by: Andrey Smetanin <asmetanin@virtuozzo.com> Reviewed-by: Roman Kagan <rkagan@virtuozzo.com> CC: Gleb Natapov <gleb@kernel.org> CC: Paolo Bonzini <pbonzini@redhat.com> CC: Roman Kagan <rkagan@virtuozzo.com> CC: Denis V. Lunev <den@openvz.org> CC: qemu-devel@nongnu.org Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2016-01-08 19:04:40 +01:00
Andrey Smetanin	1ac1b65ac1	kvm/x86: Hyper-V timers fix incorrect logical operation Signed-off-by: Andrey Smetanin <asmetanin@virtuozzo.com> Reviewed-by: Roman Kagan <rkagan@virtuozzo.com> CC: Gleb Natapov <gleb@kernel.org> CC: Paolo Bonzini <pbonzini@redhat.com> CC: Roman Kagan <rkagan@virtuozzo.com> CC: Denis V. Lunev <den@openvz.org> CC: qemu-devel@nongnu.org Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2016-01-08 19:04:39 +01:00
Paolo Bonzini	2860c4b167	KVM: move architecture-dependent requests to arch/ Since the numbers now overlap, it makes sense to enumerate them in asm/kvm_host.h rather than linux/kvm_host.h. Functions that refer to architecture-specific requests are also moved to arch/. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2016-01-08 19:04:36 +01:00
Nicholas Krause	1dab1345d8	kvm: x86: Check kvm_write_guest return value in kvm_write_wall_clock This makes sure the wall clock is updated only after an odd version value is successfully written to guest memory. Signed-off-by: Nicholas Krause <xerofoify@gmail.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2016-01-07 14:51:32 +01:00
Paolo Bonzini	e5e57e7a03	kvm: x86: only channel 0 of the i8254 is linked to the HPET While setting the KVM PIT counters in 'kvm_pit_load_count', if 'hpet_legacy_start' is set, the function disables the timer on channel[0], instead of the respective index 'channel'. This is because channels 1-3 are not linked to the HPET. Fix the caller to only activate the special HPET processing for channel 0. Reported-by: P J P <pjp@fedoraproject.org> Fixes: `0185604c2d` Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2016-01-07 13:50:38 +01:00
David Matlack	0af2593b2a	kvm: x86: fix comment about {mmu,nested_mmu}.gva_to_gpa The comment had the meaning of mmu.gva_to_gpa and nested_mmu.gva_to_gpa swapped. Fix that, and also add some details describing how each translation works. Signed-off-by: David Matlack <dmatlack@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2016-01-07 11:03:47 +01:00
Andrew Honig	0185604c2d	KVM: x86: Reload pit counters for all channels when restoring state Currently if userspace restores the pit counters with a count of 0 on channels 1 or 2 and the guest attempts to read the count on those channels, then KVM will perform a mod of 0 and crash. This will ensure that 0 values are converted to 65536 as per the spec. This is CVE-2015-7513. Signed-off-by: Andy Honig <ahonig@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2015-12-22 15:36:26 +01:00
Paolo Bonzini	e24dea2afc	KVM: MTRR: treat memory as writeback if MTRR is disabled in guest CPUID Virtual machines can be run with CPUID such that there are no MTRRs. In that case, the firmware will never enable MTRRs and it is obviously undesirable to run the guest entirely with UC memory. Check out guest CPUID, and use WB memory if MTRR do not exist. Cc: qemu-stable@nongnu.org Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=107561 Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2015-12-22 15:29:00 +01:00
Paolo Bonzini	fa7c4ebd5a	KVM: MTRR: observe maxphyaddr from guest CPUID, not host Conversion of MTRRs to ranges used the maxphyaddr from the boot CPU. This is wrong, because var_mtrr_range's mask variable then is discontiguous (like FF00FFFF000, where the first run of 0s corresponds to the bits between host and guest maxphyaddr). Instead always set up the masks to be full 64-bit values---we know that the reserved bits at the top are zero, and we can restore them when reading the MSR. This way var_mtrr_range gets a mask that just works. Fixes: `a13842dc66` Cc: qemu-stable@nongnu.org Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=107561 Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2015-12-22 15:28:56 +01:00
Alexis Dambricourt	a7f2d78657	KVM: MTRR: fix fixed MTRR segment look up This fixes the slow-down of VM running with pci-passthrough, since some MTRR range changed from MTRR_TYPE_WRBACK to MTRR_TYPE_UNCACHABLE. Memory in the 0K-640K range was incorrectly treated as uncacheable. Fixes: `f7bfb57b3e` Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=107561 Cc: qemu-stable@nongnu.org Signed-off-by: Alexis Dambricourt <alexis.dambricourt@gmail.com> [Use correct BZ for "Fixes" annotation. - Paolo] Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2015-12-22 15:28:37 +01:00
Takuya Yoshikawa	774926641d	KVM: x86: MMU: Use clear_page() instead of init_shadow_page_table() Not just in order to clean up the code, but to make it faster by using enhanced instructions: the initialization became 20-30% faster on our testing machine. Signed-off-by: Takuya Yoshikawa <yoshikawa_takuya_b1@lab.ntt.co.jp> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2015-12-18 19:07:45 +01:00
Andrey Smetanin	481d2bcc84	kvm/x86: Remove Hyper-V SynIC timer stopping It's possible that guest send us Hyper-V EOM at the middle of Hyper-V SynIC timer running, so we start processing of Hyper-V SynIC timers in vcpu context and stop the Hyper-V SynIC timer unconditionally: host guest ------------------------------------------------------------------------------ start periodic stimer start periodic timer timer expires after 15ms send expiration message into guest restart periodic timer timer expires again after 15 ms msg slot is still not cleared so setup ->msg_pending (1) restart periodic timer process timer msg and clear slot ->msg_pending was set: send EOM into host received EOM kvm_make_request(KVM_REQ_HV_STIMER) kvm_hv_process_stimers(): ... stimer_stop() if (time_now >= stimer->exp_time) stimer_expiration(stimer); Because the timer was rearmed at (1), time_now < stimer->exp_time and stimer_expiration is not called. The timer then never fires. The patch fixes such situation by not stopping Hyper-V SynIC timer at all, because it's safe to restart it without stop in vcpu context and timer callback always returns HRTIMER_NORESTART. Signed-off-by: Andrey Smetanin <asmetanin@virtuozzo.com> CC: Gleb Natapov <gleb@kernel.org> CC: Paolo Bonzini <pbonzini@redhat.com> CC: Roman Kagan <rkagan@virtuozzo.com> CC: Denis V. Lunev <den@openvz.org> CC: qemu-devel@nongnu.org Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2015-12-16 18:51:22 +01:00
Paolo Bonzini	8a86aea920	KVM: vmx: detect mismatched size in VMCS read/write Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> --- I am sending this as RFC because the error messages it produces are very ugly. Because of inlining, the original line is lost. The alternative is to change vmcs_read/write/checkXX into macros, but then you need to have a single huge BUILD_BUG_ON or BUILD_BUG_ON_MSG because multiple BUILD_BUG_ON* with the same __LINE__ are not supported well.	2015-12-16 18:49:47 +01:00
Paolo Bonzini	845c5b4054	KVM: VMX: fix read/write sizes of VMCS fields in dump_vmcs This was not printing the high parts of several 64-bit fields on 32-bit kernels. Separate from the previous one to make the patches easier to review. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2015-12-16 18:49:47 +01:00
Paolo Bonzini	f353105463	KVM: VMX: fix read/write sizes of VMCS fields In theory this should have broken EPT on 32-bit kernels (due to reading the high part of natural-width field GUEST_CR3). Not sure if no one noticed or the processor behaves differently from the documentation. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2015-12-16 18:49:46 +01:00
Li RongQing	0bcf261cc8	KVM: VMX: fix the writing POSTED_INTR_NV POSTED_INTR_NV is 16bit, should not use 64bit write function [ 5311.676074] vmwrite error: reg 3 value 0 (err 12) [ 5311.680001] CPU: 49 PID: 4240 Comm: qemu-system-i38 Tainted: G I 4.1.13-WR8.0.0.0_standard #1 [ 5311.689343] Hardware name: Intel Corporation S2600WT2/S2600WT2, BIOS SE5C610.86B.01.01.0008.021120151325 02/11/2015 [ 5311.699550] 00000000 00000000 e69a7e1c c1950de1 00000000 e69a7e38 fafcff45 fafebd24 [ 5311.706924] 00000003 00000000 0000000c b6a06dfa e69a7e40 fafcff79 e69a7eb0 fafd5f57 [ 5311.714296] e69a7ec0 c1080600 00000000 00000001 c0e18018 000001be 00000000 00000b43 [ 5311.721651] Call Trace: [ 5311.722942] [<c1950de1>] dump_stack+0x4b/0x75 [ 5311.726467] [<fafcff45>] vmwrite_error+0x35/0x40 [kvm_intel] [ 5311.731444] [<fafcff79>] vmcs_writel+0x29/0x30 [kvm_intel] [ 5311.736228] [<fafd5f57>] vmx_create_vcpu+0x337/0xb90 [kvm_intel] [ 5311.741600] [<c1080600>] ? dequeue_task_fair+0x2e0/0xf60 [ 5311.746197] [<faf3b9ca>] kvm_arch_vcpu_create+0x3a/0x70 [kvm] [ 5311.751278] [<faf29e9d>] kvm_vm_ioctl+0x14d/0x640 [kvm] [ 5311.755771] [<c1129d44>] ? free_pages_prepare+0x1a4/0x2d0 [ 5311.760455] [<c13e2842>] ? debug_smp_processor_id+0x12/0x20 [ 5311.765333] [<c10793be>] ? sched_move_task+0xbe/0x170 [ 5311.769621] [<c11752b3>] ? kmem_cache_free+0x213/0x230 [ 5311.774016] [<faf29d50>] ? kvm_set_memory_region+0x60/0x60 [kvm] [ 5311.779379] [<c1199fa2>] do_vfs_ioctl+0x2e2/0x500 [ 5311.783285] [<c11752b3>] ? kmem_cache_free+0x213/0x230 [ 5311.787677] [<c104dc73>] ? __mmdrop+0x63/0xd0 [ 5311.791196] [<c104dc73>] ? __mmdrop+0x63/0xd0 [ 5311.794712] [<c104dc73>] ? __mmdrop+0x63/0xd0 [ 5311.798234] [<c11a2ed7>] ? __fget+0x57/0x90 [ 5311.801559] [<c11a2f72>] ? __fget_light+0x22/0x50 [ 5311.805464] [<c119a240>] SyS_ioctl+0x80/0x90 [ 5311.808885] [<c1957d30>] sysenter_do_call+0x12/0x12 [ 5312.059280] kvm: zapping shadow pages for mmio generation wraparound [ 5313.678415] kvm [4231]: vcpu0 disabled perfctr wrmsr: 0xc2 data 0xffff [ 5313.726518] kvm [4231]: vcpu0 unhandled rdmsr: 0x570 Signed-off-by: Li RongQing <roy.qing.li@gmail.com> Cc: Yang Zhang <yang.z.zhang@Intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2015-12-16 18:49:45 +01:00
Andrey Smetanin	1f4b34f825	kvm/x86: Hyper-V SynIC timers Per Hyper-V specification (and as required by Hyper-V-aware guests), SynIC provides 4 per-vCPU timers. Each timer is programmed via a pair of MSRs, and signals expiration by delivering a special format message to the configured SynIC message slot and triggering the corresponding synthetic interrupt. Note: as implemented by this patch, all periodic timers are "lazy" (i.e. if the vCPU wasn't scheduled for more than the timer period the timer events are lost), regardless of the corresponding configuration MSR. If deemed necessary, the "catch up" mode (the timer period is shortened until the timer catches up) will be implemented later. Changes v2: * Use remainder to calculate periodic timer expiration time Signed-off-by: Andrey Smetanin <asmetanin@virtuozzo.com> Reviewed-by: Roman Kagan <rkagan@virtuozzo.com> CC: Gleb Natapov <gleb@kernel.org> CC: Paolo Bonzini <pbonzini@redhat.com> CC: "K. Y. Srinivasan" <kys@microsoft.com> CC: Haiyang Zhang <haiyangz@microsoft.com> CC: Vitaly Kuznetsov <vkuznets@redhat.com> CC: Roman Kagan <rkagan@virtuozzo.com> CC: Denis V. Lunev <den@openvz.org> CC: qemu-devel@nongnu.org Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2015-12-16 18:49:45 +01:00
Andrey Smetanin	765eaa0f70	kvm/x86: Hyper-V SynIC message slot pending clearing at SINT ack The SynIC message protocol mandates that the message slot is claimed by atomically setting message type to something other than HVMSG_NONE. If another message is to be delivered while the slot is still busy, message pending flag is asserted to indicate to the guest that the hypervisor wants to be notified when the slot is released. To make sure the protocol works regardless of where the message sources are (kernel or userspace), clear the pending flag on SINT ACK notification, and let the message sources compete for the slot again. Signed-off-by: Andrey Smetanin <asmetanin@virtuozzo.com> Reviewed-by: Roman Kagan <rkagan@virtuozzo.com> CC: Gleb Natapov <gleb@kernel.org> CC: Paolo Bonzini <pbonzini@redhat.com> CC: "K. Y. Srinivasan" <kys@microsoft.com> CC: Haiyang Zhang <haiyangz@microsoft.com> CC: Vitaly Kuznetsov <vkuznets@redhat.com> CC: Roman Kagan <rkagan@virtuozzo.com> CC: Denis V. Lunev <den@openvz.org> CC: qemu-devel@nongnu.org Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2015-12-16 18:49:44 +01:00

1 2 3 4 5 ...

3745 Commits