linux

mirror of https://github.com/torvalds/linux.git synced 2024-12-18 17:12:55 +00:00

Author	SHA1	Message	Date
Andrea Arcangeli	127393fbe5	mm: thp: kvm: fix memory corruption in KVM with THP enabled After the THP refcounting change, obtaining a compound pages from get_user_pages() no longer allows us to assume the entire compound page is immediately mappable from a secondary MMU. A secondary MMU doesn't want to call get_user_pages() more than once for each compound page, in order to know if it can map the whole compound page. So a secondary MMU needs to know from a single get_user_pages() invocation when it can map immediately the entire compound page to avoid a flood of unnecessary secondary MMU faults and spurious atomic_inc()/atomic_dec() (pages don't have to be pinned by MMU notifier users). Ideally instead of the page->_mapcount < 1 check, get_user_pages() should return the granularity of the "page" mapping in the "mm" passed to get_user_pages(). However it's non trivial change to pass the "pmd" status belonging to the "mm" walked by get_user_pages up the stack (up to the caller of get_user_pages). So the fix just checks if there is not a single pte mapping on the page returned by get_user_pages, and in turn if the caller can assume that the whole compound page is mapped in the current "mm" (in a pmd_trans_huge()). In such case the entire compound page is safe to map into the secondary MMU without additional get_user_pages() calls on the surrounding tail/head pages. In addition of being faster, not having to run other get_user_pages() calls also reduces the memory footprint of the secondary MMU fault in case the pmd split happened as result of memory pressure. Without this fix after a MADV_DONTNEED (like invoked by QEMU during postcopy live migration or balloning) or after generic swapping (with a failure in split_huge_page() that would only result in pmd splitting and not a physical page split), KVM would map the whole compound page into the shadow pagetables, despite regular faults or userfaults (like UFFDIO_COPY) may map regular pages into the primary MMU as result of the pte faults, leading to the guest mode and userland mode going out of sync and not working on the same memory at all times. Any other secondary MMU notifier manager (KVM is just one of the many MMU notifier users) will need the same information if it doesn't want to run a flood of get_user_pages_fast and it can support multiple granularity in the secondary MMU mappings, so I think it is justified to be exposed not just to KVM. The other option would be to move transparent_hugepage_adjust to mm/huge_memory.c but that currently has all kind of KVM data structures in it, so it's definitely not a cut-and-paste work, so I couldn't do a fix as cleaner as this one for 4.6. Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> Cc: "Dr. David Alan Gilbert" <dgilbert@redhat.com> Cc: "Kirill A. Shutemov" <kirill@shutemov.name> Cc: "Li, Liang Z" <liang.z.li@intel.com> Cc: Amit Shah <amit.shah@redhat.com> Cc: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2016-05-05 17:38:53 -07:00
Yu Zhao	14f4760562	kvm: set page dirty only if page has been writable In absence of shadow dirty mask, there is no need to set page dirty if page has never been writable. This is a tiny optimization but good to have for people who care much about dirty page tracking. Signed-off-by: Yu Zhao <yuzhao@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2016-04-01 12:10:10 +02:00
Linus Torvalds	d88f48e128	Merge branch 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 fixes from Ingo Molnar: "Misc fixes: - fix hotplug bugs - fix irq live lock - fix various topology handling bugs - fix APIC ACK ordering - fix PV iopl handling - fix speling - fix/tweak memcpy_mcsafe() return value - fix fbcon bug - remove stray prototypes" * 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/msr: Remove unused native_read_tscp() x86/apic: Remove declaration of unused hw_nmi_is_cpu_stuck x86/oprofile/nmi: Add missing hotplug FROZEN handling x86/hpet: Use proper mask to modify hotplug action x86/apic/uv: Fix the hotplug notifier x86/apb/timer: Use proper mask to modify hotplug action x86/topology: Use total_cpus not nr_cpu_ids for logical packages x86/topology: Fix Intel HT disable x86/topology: Fix logical package mapping x86/irq: Cure live lock in fixup_irqs() x86/tsc: Prevent NULL pointer deref in calibrate_delay_is_known() x86/apic: Fix suspicious RCU usage in smp_trace_call_function_interrupt() x86/iopl: Fix iopl capability check on Xen PV x86/iopl/64: Properly context-switch IOPL on Xen PV selftests/x86: Add an iopl test x86/mm, x86/mce: Fix return type/value for memcpy_mcsafe() x86/video: Don't assume all FB devices are PCI devices arch/x86/irq: Purge useless handler declarations from hw_irq.h x86: Fix misspellings in comments	2016-03-24 09:47:32 -07:00
Lan Tianyu	36ca7e0a57	KVM/x86: Replace smp_mb() with smp_store_mb/release() in the walk_shadow_page_lockless_begin/end() Signed-off-by: Lan Tianyu <tianyu.lan@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2016-03-22 16:38:29 +01:00
Lan Tianyu	9753f52915	KVM: Remove redundant smp_mb() in the kvm_mmu_commit_zap_page() There is already a barrier inside of kvm_flush_remote_tlbs() which can help to make sure everyone sees our modifications to the page tables and see changes to vcpu->mode here. So remove the smp_mb in the kvm_mmu_commit_zap_page() and update the comment. Signed-off-by: Lan Tianyu <tianyu.lan@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2016-03-22 16:38:27 +01:00
Huaitong Han	2d344105f5	KVM, pkeys: introduce pkru_mask to cache conditions PKEYS defines a new status bit in the PFEC. PFEC.PK (bit 5), if some conditions is true, the fault is considered as a PKU violation. pkru_mask indicates if we need to check PKRU.ADi and PKRU.WDi, and does cache some conditions for permission_fault. [ Huaitong: Xiao helps to modify many sections. ] Signed-off-by: Huaitong Han <huaitong.han@intel.com> Signed-off-by: Xiao Guangrong <guangrong.xiao@linux.intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2016-03-22 16:21:06 +01:00
Ingo Molnar	00f5268501	Merge branch 'x86/cleanups' into x86/urgent Pull in some merge window leftovers. Signed-off-by: Ingo Molnar <mingo@kernel.org>	2016-03-17 09:44:57 +01:00
Linus Torvalds	10dc374766	One of the largest releases for KVM... Hardly any generic improvement, but lots of architecture-specific changes. * ARM: - VHE support so that we can run the kernel at EL2 on ARMv8.1 systems - PMU support for guests - 32bit world switch rewritten in C - various optimizations to the vgic save/restore code. * PPC: - enabled KVM-VFIO integration ("VFIO device") - optimizations to speed up IPIs between vcpus - in-kernel handling of IOMMU hypercalls - support for dynamic DMA windows (DDW). * s390: - provide the floating point registers via sync regs; - separated instruction vs. data accesses - dirty log improvements for huge guests - bugfixes and documentation improvements. * x86: - Hyper-V VMBus hypercall userspace exit - alternative implementation of lowest-priority interrupts using vector hashing (for better VT-d posted interrupt support) - fixed guest debugging with nested virtualizations - improved interrupt tracking in the in-kernel IOAPIC - generic infrastructure for tracking writes to guest memory---currently its only use is to speedup the legacy shadow paging (pre-EPT) case, but in the future it will be used for virtual GPUs as well - much cleanup (LAPIC, kvmclock, MMU, PIT), including ubsan fixes. -----BEGIN PGP SIGNATURE----- Version: GnuPG v2.0.22 (GNU/Linux) iQEcBAABAgAGBQJW5r3BAAoJEL/70l94x66D2pMH/jTSWWwdTUJMctrDjPVzKzG0 yOzHW5vSLFoFlwEOY2VpslnXzn5TUVmCAfrdmFNmQcSw6hGb3K/xA/ZX/KLwWhyb oZpr123ycahga+3q/ht/dFUBCCyWeIVMdsLSFwpobEBzPL0pMgc9joLgdUC6UpWX tmN0LoCAeS7spC4TTiTTpw3gZ/L+aB0B6CXhOMjldb9q/2CsgaGyoVvKA199nk9o Ngu7ImDt7l/x1VJX4/6E/17VHuwqAdUrrnbqerB/2oJ5ixsZsHMGzxQ3sHCmvyJx WG5L00ubB1oAJAs9fBg58Y/MdiWX99XqFhdEfxq4foZEiQuCyxygVvq3JwZTxII= =OUZZ -----END PGP SIGNATURE----- Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm Pull KVM updates from Paolo Bonzini: "One of the largest releases for KVM... Hardly any generic changes, but lots of architecture-specific updates. ARM: - VHE support so that we can run the kernel at EL2 on ARMv8.1 systems - PMU support for guests - 32bit world switch rewritten in C - various optimizations to the vgic save/restore code. PPC: - enabled KVM-VFIO integration ("VFIO device") - optimizations to speed up IPIs between vcpus - in-kernel handling of IOMMU hypercalls - support for dynamic DMA windows (DDW). s390: - provide the floating point registers via sync regs; - separated instruction vs. data accesses - dirty log improvements for huge guests - bugfixes and documentation improvements. x86: - Hyper-V VMBus hypercall userspace exit - alternative implementation of lowest-priority interrupts using vector hashing (for better VT-d posted interrupt support) - fixed guest debugging with nested virtualizations - improved interrupt tracking in the in-kernel IOAPIC - generic infrastructure for tracking writes to guest memory - currently its only use is to speedup the legacy shadow paging (pre-EPT) case, but in the future it will be used for virtual GPUs as well - much cleanup (LAPIC, kvmclock, MMU, PIT), including ubsan fixes" * tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (217 commits) KVM: x86: remove eager_fpu field of struct kvm_vcpu_arch KVM: x86: disable MPX if host did not enable MPX XSAVE features arm64: KVM: vgic-v3: Only wipe LRs on vcpu exit arm64: KVM: vgic-v3: Reset LRs at boot time arm64: KVM: vgic-v3: Do not save an LR known to be empty arm64: KVM: vgic-v3: Save maintenance interrupt state only if required arm64: KVM: vgic-v3: Avoid accessing ICH registers KVM: arm/arm64: vgic-v2: Make GICD_SGIR quicker to hit KVM: arm/arm64: vgic-v2: Only wipe LRs on vcpu exit KVM: arm/arm64: vgic-v2: Reset LRs at boot time KVM: arm/arm64: vgic-v2: Do not save an LR known to be empty KVM: arm/arm64: vgic-v2: Move GICH_ELRSR saving to its own function KVM: arm/arm64: vgic-v2: Save maintenance interrupt state only if required KVM: arm/arm64: vgic-v2: Avoid accessing GICH registers KVM: s390: allocate only one DMA page per VM KVM: s390: enable STFLE interpretation only if enabled for the guest KVM: s390: wake up when the VCPU cpu timer expires KVM: s390: step the VCPU timer while in enabled wait KVM: s390: protect VCPU cpu timer with a seqcount KVM: s390: step VCPU cpu timer during kvm_run ioctl ...	2016-03-16 09:55:35 -07:00
Paolo Bonzini	5f0b819995	KVM: MMU: fix reserved bit check for ept=0/CR0.WP=0/CR4.SMEP=1/EFER.NX=0 KVM has special logic to handle pages with pte.u=1 and pte.w=0 when CR0.WP=1. These pages' SPTEs flip continuously between two states: U=1/W=0 (user and supervisor reads allowed, supervisor writes not allowed) and U=0/W=1 (supervisor reads and writes allowed, user writes not allowed). When SMEP is in effect, however, U=0 will enable kernel execution of this page. To avoid this, KVM also sets NX=1 in the shadow PTE together with U=0, making the two states U=1/W=0/NX=gpte.NX and U=0/W=1/NX=1. When guest EFER has the NX bit cleared, the reserved bit check thinks that the latter state is invalid; teach it that the smep_andnot_wp case will also use the NX bit of SPTEs. Cc: stable@vger.kernel.org Reviewed-by: Xiao Guangrong <guangrong.xiao@linux.inel.com> Fixes: `c258b62b26` Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2016-03-10 11:26:10 +01:00
Paolo Bonzini	6bb69c9b69	KVM: MMU: simplify last_pte_bitmap Branch-free code is fun and everybody knows how much Avi loves it, but last_pte_bitmap takes it a bit to the extreme. Since the code is simply doing a range check, like (level == 1 \|\| ((gpte & PT_PAGE_SIZE_MASK) && level < N) we can make it branch-free without storing the entire truth table; it is enough to cache N. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2016-03-08 12:33:38 +01:00
Paolo Bonzini	50c9e6f3a6	KVM: MMU: coalesce more page zapping in mmu_sync_children mmu_sync_children can only process up to 16 pages at a time. Check if we need to reschedule, and do not bother zapping the pages until that happens. Reviewed-by: Takuya Yoshikawa <yoshikawa_takuya_b1@lab.ntt.co.jp> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2016-03-08 12:33:33 +01:00
Paolo Bonzini	2a74003ae8	KVM: MMU: move zap/flush to kvm_mmu_get_page kvm_mmu_get_page is the only caller of kvm_sync_page_transient and kvm_sync_pages. Moving the handling of the invalid_list there removes the need for the underdocumented kvm_sync_page_transient function. Reviewed-by: Takuya Yoshikawa <yoshikawa_takuya_b1@lab.ntt.co.jp> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2016-03-08 12:33:30 +01:00
Paolo Bonzini	1f50f1b3a4	KVM: MMU: invert return value of mmu.sync_page and kvm_sync_page Return true if the page was synced (and the TLB must be flushed) and false if the page was zapped. Reviewed-by: Takuya Yoshikawa <yoshikawa_takuya_b1@lab.ntt.co.jp> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2016-03-08 12:33:26 +01:00
Paolo Bonzini	9a43c5d9c3	KVM: MMU: cleanup __kvm_sync_page and its callers Calling kvm_unlink_unsync_page in the middle of __kvm_sync_page makes things unnecessarily tricky. If kvm_mmu_prepare_zap_page is called, it will call kvm_unlink_unsync_page too. So kvm_unlink_unsync_page can be called just as well at the beginning or the end of __kvm_sync_page... which means that we might do it in kvm_sync_page too and remove the parameter. kvm_sync_page ends up being the same code that kvm_sync_pages used to have before the previous patch. Reviewed-by: Takuya Yoshikawa <yoshikawa_takuya_b1@lab.ntt.co.jp> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2016-03-08 12:33:23 +01:00
Paolo Bonzini	df748f864a	KVM: MMU: use kvm_sync_page in kvm_sync_pages If the last argument is true, kvm_unlink_unsync_page is called anyway in __kvm_sync_page (either by kvm_mmu_prepare_zap_page or by __kvm_sync_page itself). Therefore, kvm_sync_pages can just call kvm_sync_page, instead of going through kvm_unlink_unsync_page+__kvm_sync_page. Reviewed-by: Takuya Yoshikawa <yoshikawa_takuya_b1@lab.ntt.co.jp> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2016-03-08 12:33:20 +01:00
Paolo Bonzini	35a70510ee	KVM: MMU: move TLB flush out of __kvm_sync_page By doing this, kvm_sync_pages can use __kvm_sync_page instead of reinventing it. Because of kvm_mmu_flush_or_zap, the code does not end up being more complex than before, and more cleanups to kvm_sync_pages will come in the next patches. Reviewed-by: Takuya Yoshikawa <yoshikawa_takuya_b1@lab.ntt.co.jp> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2016-03-08 12:33:17 +01:00
Paolo Bonzini	b8c67b7a08	KVM: MMU: introduce kvm_mmu_flush_or_zap This is a generalization of mmu_pte_write_flush_tlb, that also takes care of calling kvm_mmu_commit_zap_page. The next patches will introduce more uses. Reviewed-by: Takuya Yoshikawa <yoshikawa_takuya_b1@lab.ntt.co.jp> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2016-03-08 12:33:00 +01:00
Xiao Guangrong	e23d3fef83	KVM: MMU: check kvm_mmu_pages and mmu_page_path indices Give a special invalid index to the root of the walk, so that we can check the consistency of kvm_mmu_pages and mmu_page_path. Signed-off-by: Xiao Guangrong <guangrong.xiao@linux.intel.com> [Extracted from a bigger patch proposed by Guangrong. - Paolo] Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2016-03-04 12:35:24 +01:00
Paolo Bonzini	0a47cd8583	KVM: MMU: Fix ubsan warnings kvm_mmu_pages_init is doing some really yucky stuff. It is setting up a sentinel for mmu_page_clear_parents; however, because of a) the way levels are numbered starting from 1 and b) the way mmu_page_path sizes its arrays with PT64_ROOT_LEVEL-1 elements, the access can be out of bounds. This is harmless because the code overwrites up to the first two elements of parents->idx and these are initialized, and because the sentinel is not needed in this case---mmu_page_clear_parents exits anyway when it gets to the end of the array. However ubsan complains, and everyone else should too. This fix does three things. First it makes the mmu_page_path arrays PT64_ROOT_LEVEL elements in size, so that we can write to them without checking the level in advance. Second it disintegrates kvm_mmu_pages_init between mmu_unsync_walk (to reset the struct kvm_mmu_pages) and for_each_sp (to place the NULL sentinel at the end of the current path). This is okay because the mmu_page_path is only used in mmu_pages_clear_parents; mmu_pages_clear_parents itself is called within a for_each_sp iterator, and hence always after a call to mmu_pages_next. Third it changes mmu_pages_clear_parents to just use the sentinel to stop iteration, without checking the bounds on level. Reported-by: Sasha Levin <sasha.levin@oracle.com> Reported-by: Mike Krinkin <krinkin.m.u@gmail.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2016-03-04 12:35:23 +01:00
Paolo Bonzini	798e88b31f	KVM: MMU: cleanup handle_abnormal_pfn The goto and temporary variable are unnecessary, just use return statements. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2016-03-04 12:35:23 +01:00
Xiao Guangrong	13d268ca2c	KVM: MMU: apply page track notifier Register the notifier to receive write track event so that we can update our shadow page table It makes kvm_mmu_pte_write() be the callback of the notifier, no function is changed Signed-off-by: Xiao Guangrong <guangrong.xiao@linux.intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2016-03-03 14:36:24 +01:00
Xiao Guangrong	5c520e90af	KVM: MMU: simplify mmu_need_write_protect Now, all non-leaf shadow page are page tracked, if gfn is not tracked there is no non-leaf shadow page of gfn is existed, we can directly make the shadow page of gfn to unsync Signed-off-by: Xiao Guangrong <guangrong.xiao@linux.intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2016-03-03 14:36:23 +01:00
Xiao Guangrong	56ca57f9fe	KVM: MMU: use page track for non-leaf shadow pages non-leaf shadow pages are always write protected, it can be the user of page track Signed-off-by: Xiao Guangrong <guangrong.xiao@linux.intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2016-03-03 14:36:23 +01:00
Xiao Guangrong	e5691a81e8	KVM: MMU: clear write-flooding on the fast path of tracked page If the page fault is caused by write access on write tracked page, the real shadow page walking is skipped, we lost the chance to clear write flooding for the page structure current vcpu is using Fix it by locklessly waking shadow page table to clear write flooding on the shadow page structure out of mmu-lock. So that we change the count to atomic_t Signed-off-by: Xiao Guangrong <guangrong.xiao@linux.intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2016-03-03 14:36:22 +01:00
Xiao Guangrong	3d0c27ad6e	KVM: MMU: let page fault handler be aware tracked page The page fault caused by write access on the write tracked page can not be fixed, it always need to be emulated. page_fault_handle_page_track() is the fast path we introduce here to skip holding mmu-lock and shadow page table walking However, if the page table is not present, it is worth making the page table entry present and readonly to make the read access happy mmu_need_write_protect() need to be cooked to avoid page becoming writable when making page table present or sync/prefetch shadow page table entries Signed-off-by: Xiao Guangrong <guangrong.xiao@linux.intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2016-03-03 14:36:21 +01:00
Xiao Guangrong	aeecee2ea6	KVM: MMU: introduce kvm_mmu_slot_gfn_write_protect Split rmap_write_protect() and introduce the function to abstract the write protection based on the slot This function will be used in the later patch Reviewed-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Xiao Guangrong <guangrong.xiao@linux.intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2016-03-03 14:36:20 +01:00
Xiao Guangrong	547ffaed87	KVM: MMU: introduce kvm_mmu_gfn_{allow,disallow}_lpage Abstract the common operations from account_shadowed() and unaccount_shadowed(), then introduce kvm_mmu_gfn_disallow_lpage() and kvm_mmu_gfn_allow_lpage() These two functions will be used by page tracking in the later patch Reviewed-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Xiao Guangrong <guangrong.xiao@linux.intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2016-03-03 14:36:19 +01:00
Xiao Guangrong	92f94f1e9e	KVM: MMU: rename has_wrprotected_page to mmu_gfn_lpage_is_disallowed kvm_lpage_info->write_count is used to detect if the large page mapping for the gfn on the specified level is allowed, rename it to disallow_lpage to reflect its purpose, also we rename has_wrprotected_page() to mmu_gfn_lpage_is_disallowed() to make the code more clearer Later we will extend this mechanism for page tracking: if the gfn is tracked then large mapping for that gfn on any level is not allowed. The new name is more straightforward Reviewed-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Xiao Guangrong <guangrong.xiao@linux.intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2016-03-03 14:36:19 +01:00
Adam Buchbinder	6a6256f9e0	x86: Fix misspellings in comments Signed-off-by: Adam Buchbinder <adam.buchbinder@gmail.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: trivial@kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>	2016-02-24 08:44:58 +01:00
Geliang Tang	d74c0e6b54	KVM: x86: use list_last_entry To make the intention clearer, use list_last_entry instead of list_entry. Signed-off-by: Geliang Tang <geliangtang@163.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2016-02-23 15:40:54 +01:00
Takuya Yoshikawa	e9ee956e31	KVM: x86: MMU: Move handle_mmio_page_fault() call to kvm_mmu_page_fault() Rather than placing a handle_mmio_page_fault() call in each vcpu->arch.mmu.page_fault() handler, moving it up to kvm_mmu_page_fault() makes the code better: - avoids code duplication - for kvm_arch_async_page_ready(), which is the other caller of vcpu->arch.mmu.page_fault(), removes an extra error_code check - avoids returning both RET_MMIO_PF_* values and raw integer values from vcpu->arch.mmu.page_fault() Signed-off-by: Takuya Yoshikawa <yoshikawa_takuya_b1@lab.ntt.co.jp> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2016-02-23 14:20:27 +01:00
Takuya Yoshikawa	ded5874946	KVM: x86: MMU: Consolidate quickly_check_mmio_pf() and is_mmio_page_fault() These two have only slight differences: - whether 'addr' is of type u64 or of type gva_t - whether they have 'direct' parameter or not Concerning the former, quickly_check_mmio_pf()'s u64 is better because 'addr' needs to be able to have both a guest physical address and a guest virtual address. The latter is just a stylistic issue as we can always calculate the mode from the 'vcpu' as is_mmio_page_fault() does. This patch keeps the parameter to make the following patch cleaner. In addition, the patch renames the function to mmio_info_in_cache() to make it clear what it actually checks for. Signed-off-by: Takuya Yoshikawa <yoshikawa_takuya_b1@lab.ntt.co.jp> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2016-02-23 14:20:27 +01:00
Dan Williams	ba049e93ae	kvm: rename pfn_t to kvm_pfn_t To date, we have implemented two I/O usage models for persistent memory, PMEM (a persistent "ram disk") and DAX (mmap persistent memory into userspace). This series adds a third, DAX-GUP, that allows DAX mappings to be the target of direct-i/o. It allows userspace to coordinate DMA/RDMA from/to persistent memory. The implementation leverages the ZONE_DEVICE mm-zone that went into 4.3-rc1 (also discussed at kernel summit) to flag pages that are owned and dynamically mapped by a device driver. The pmem driver, after mapping a persistent memory range into the system memmap via devm_memremap_pages(), arranges for DAX to distinguish pfn-only versus page-backed pmem-pfns via flags in the new pfn_t type. The DAX code, upon seeing a PFN_DEV+PFN_MAP flagged pfn, flags the resulting pte(s) inserted into the process page tables with a new _PAGE_DEVMAP flag. Later, when get_user_pages() is walking ptes it keys off _PAGE_DEVMAP to pin the device hosting the page range active. Finally, get_page() and put_page() are modified to take references against the device driver established page mapping. Finally, this need for "struct page" for persistent memory requires memory capacity to store the memmap array. Given the memmap array for a large pool of persistent may exhaust available DRAM introduce a mechanism to allocate the memmap from persistent memory. The new "struct vmem_altmap *" parameter to devm_memremap_pages() enables arch_add_memory() to use reserved pmem capacity rather than the page allocator. This patch (of 18): The core has developed a need for a "pfn_t" type [1]. Move the existing pfn_t in KVM to kvm_pfn_t [2]. [1]: https://lists.01.org/pipermail/linux-nvdimm/2015-September/002199.html [2]: https://lists.01.org/pipermail/linux-nvdimm/2015-September/002218.html Signed-off-by: Dan Williams <dan.j.williams@intel.com> Acked-by: Christoffer Dall <christoffer.dall@linaro.org> Cc: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2016-01-15 17:56:32 -08:00
David Matlack	0af2593b2a	kvm: x86: fix comment about {mmu,nested_mmu}.gva_to_gpa The comment had the meaning of mmu.gva_to_gpa and nested_mmu.gva_to_gpa swapped. Fix that, and also add some details describing how each translation works. Signed-off-by: David Matlack <dmatlack@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2016-01-07 11:03:47 +01:00
Takuya Yoshikawa	774926641d	KVM: x86: MMU: Use clear_page() instead of init_shadow_page_table() Not just in order to clean up the code, but to make it faster by using enhanced instructions: the initialization became 20-30% faster on our testing machine. Signed-off-by: Takuya Yoshikawa <yoshikawa_takuya_b1@lab.ntt.co.jp> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2015-12-18 19:07:45 +01:00
Takuya Yoshikawa	bb11c6c965	KVM: x86: MMU: Remove unused parameter parent_pte from kvm_mmu_get_page() Signed-off-by: Takuya Yoshikawa <yoshikawa_takuya_b1@lab.ntt.co.jp> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2015-11-26 15:31:36 +01:00
Takuya Yoshikawa	74c4e63ab9	KVM: x86: MMU: Use for_each_rmap_spte macro instead of pte_list_walk() As kvm_mmu_get_page() was changed so that every parent pointer would not get into the sp->parent_ptes chain before the entry pointed to by it was set properly, we can use the for_each_rmap_spte macro instead of pte_list_walk(). Signed-off-by: Takuya Yoshikawa <yoshikawa_takuya_b1@lab.ntt.co.jp> Cc: Xiao Guangrong <guangrong.xiao@linux.intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2015-11-26 15:31:31 +01:00
Takuya Yoshikawa	98bba23842	KVM: x86: MMU: Move parent_pte handling from kvm_mmu_get_page() to link_shadow_page() Every time kvm_mmu_get_page() is called with a non-NULL parent_pte argument, link_shadow_page() follows that to set the parent entry so that the new mapping will point to the returned page table. Moving parent_pte handling there allows to clean up the code because parent_pte is passed to kvm_mmu_get_page() just for mark_unsync() and mmu_page_add_parent_pte(). In addition, the patch avoids calling mark_unsync() for other parents in the sp->parent_ptes chain than the newly added parent_pte, because they have been there since before the current page fault handling started. Signed-off-by: Takuya Yoshikawa <yoshikawa_takuya_b1@lab.ntt.co.jp> Cc: Xiao Guangrong <guangrong.xiao@linux.intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2015-11-26 15:31:24 +01:00
Takuya Yoshikawa	4700579241	KVM: x86: MMU: Move initialization of parent_ptes out from kvm_mmu_alloc_page() Make kvm_mmu_alloc_page() do just what its name tells to do, and remove the extra allocation error check and zero-initialization of parent_ptes: shadow page headers allocated by kmem_cache_zalloc() are always in the per-VCPU pools. Signed-off-by: Takuya Yoshikawa <yoshikawa_takuya_b1@lab.ntt.co.jp> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2015-11-25 17:27:06 +01:00
Takuya Yoshikawa	77fbbbd2f0	KVM: x86: MMU: Consolidate BUG_ON checks for reverse-mapped sptes At some call sites of rmap_get_first() and rmap_get_next(), BUG_ON is placed right after the call to detect unrelated sptes which must not be found in the reverse-mapping list. Move this check in rmap_get_first/next() so that all call sites, not just the users of the for_each_rmap_spte() macro, will be checked the same way. One thing to keep in mind is that kvm_mmu_unlink_parents() also uses rmap_get_first() to handle parent sptes. The change will not break it because parent sptes are present, at least until drop_parent_pte() actually unlinks them, and not mmio-sptes. Signed-off-by: Takuya Yoshikawa <yoshikawa_takuya_b1@lab.ntt.co.jp> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2015-11-25 17:26:47 +01:00
Takuya Yoshikawa	afd28fe1c9	KVM: x86: MMU: Remove is_rmap_spte() and use is_shadow_present_pte() is_rmap_spte(), originally named is_rmap_pte(), was introduced when the simple reverse mapping was implemented by commit `cd4a4e5374` ("[PATCH] KVM: MMU: Implement simple reverse mapping"). At that point, its role was clear and only rmap_add() and rmap_remove() were using it to select sptes that need to be reverse-mapped. Independently of that, is_shadow_present_pte() was first introduced by commit `c7addb9020` ("KVM: Allow not-present guest page faults to bypass kvm") to do bypass_guest_pf optimization, which does not exist any more. These two seem to have changed their roles somewhat, and is_rmap_spte() just calls is_shadow_present_pte() now. Since using both of them without clear distinction just makes the code confusing, remove is_rmap_spte(). Signed-off-by: Takuya Yoshikawa <yoshikawa_takuya_b1@lab.ntt.co.jp> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2015-11-25 17:26:35 +01:00
Takuya Yoshikawa	029499b477	KVM: x86: MMU: Make mmu_set_spte() return emulate value mmu_set_spte()'s code is based on the assumption that the emulate parameter has a valid pointer value if set_spte() returns true and write_fault is not zero. In other cases, emulate may be NULL, so a NULL-check is needed. Stop passing emulate pointer and make mmu_set_spte() return the emulate value instead to clean up this complex interface. Prefetch functions can just throw away the return value. Signed-off-by: Takuya Yoshikawa <yoshikawa_takuya_b1@lab.ntt.co.jp> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2015-11-25 17:26:28 +01:00
Takuya Yoshikawa	fd9514572f	KVM: x86: MMU: Add helper function to clear a bit in unsync child bitmap Both __mmu_unsync_walk() and mmu_pages_clear_parents() have three line code which clears a bit in the unsync child bitmap; the former places it inside a loop block and uses a few goto statements to jump to it. A new helper function, clear_unsync_child_bit(), makes the code cleaner. Signed-off-by: Takuya Yoshikawa <yoshikawa_takuya_b1@lab.ntt.co.jp> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2015-11-25 17:26:15 +01:00
Takuya Yoshikawa	7ee0e5b29d	KVM: x86: MMU: Remove unused parameter of __direct_map() Signed-off-by: Takuya Yoshikawa <yoshikawa_takuya_b1@lab.ntt.co.jp> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2015-11-25 17:26:03 +01:00
Takuya Yoshikawa	018aabb56d	KVM: x86: MMU: Encapsulate the type of rmap-chain head in a new struct New struct kvm_rmap_head makes the code type-safe to some extent. Signed-off-by: Takuya Yoshikawa <yoshikawa_takuya_b1@lab.ntt.co.jp> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2015-11-25 17:25:44 +01:00
Paolo Bonzini	0e3d0648bd	KVM: x86: MMU: always set accessed bit in shadow PTEs Commit `7a1638ce42` ("nEPT: Redefine EPT-specific link_shadow_page()", 2013-08-05) says: Since nEPT doesn't support A/D bit, we should not set those bit when building the shadow page table. but this is not necessary. Even though nEPT doesn't support A/D bits, and hence the vmcs12 EPT pointer will never enable them, we always use them for shadow page tables if available (see construct_eptp in vmx.c). So we can set the A/D bits freely in the shadow page table. This patch hence basically reverts commit `7a1638ce42`. Cc: Yang Zhang <yang.z.zhang@Intel.com> Cc: Takuya Yoshikawa <yoshikawa_takuya_b1@lab.ntt.co.jp> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2015-11-25 17:24:23 +01:00
Paolo Bonzini	450869d6db	KVM: x86: merge handle_mmio_page_fault and handle_mmio_page_fault_common They are exactly the same, except that handle_mmio_page_fault has an unused argument and a call to WARN_ON. Remove the unused argument from the callers, and move the warning to (the former) handle_mmio_page_fault_common. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2015-11-10 12:06:03 +01:00
Takuya Yoshikawa	8c85ac1c0a	KVM: x86: MMU: Initialize force_pt_level before calling mapping_level() Commit `fd13690218` ("KVM: x86: MMU: Move mapping_level_dirty_bitmap() call in mapping_level()") forgot to initialize force_pt_level to false in FNAME(page_fault)() before calling mapping_level() like nonpaging_map() does. This can sometimes result in forcing page table level mapping unnecessarily. Fix this and move the first *force_pt_level check in mapping_level() before kvm_vcpu_gfn_to_memslot() call to make it a bit clearer that the variable must be initialized before mapping_level() gets called. This change can also avoid calling kvm_vcpu_gfn_to_memslot() when !check_hugepage_cache_consistency() check in tdp_page_fault() forces page table level mapping. Signed-off-by: Takuya Yoshikawa <yoshikawa_takuya_b1@lab.ntt.co.jp> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2015-10-19 11:36:05 +02:00
Takuya Yoshikawa	5225fdf8c8	KVM: x86: MMU: Eliminate an extra memory slot search in mapping_level() Calling kvm_vcpu_gfn_to_memslot() twice in mapping_level() should be avoided since getting a slot by binary search may not be negligible, especially for virtual machines with many memory slots. Signed-off-by: Takuya Yoshikawa <yoshikawa_takuya_b1@lab.ntt.co.jp> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2015-10-16 10:34:02 +02:00
Takuya Yoshikawa	d8aacf5df8	KVM: x86: MMU: Remove mapping_level_dirty_bitmap() Now that it has only one caller, and its name is not so helpful for readers, remove it. The new memslot_valid_for_gpte() function makes it possible to share the common code between gfn_to_memslot_dirty_bitmap() and mapping_level(). Signed-off-by: Takuya Yoshikawa <yoshikawa_takuya_b1@lab.ntt.co.jp> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2015-10-16 10:34:01 +02:00

1 2 3 4 5 ...

659 Commits