linux

Author	SHA1	Message	Date
Michel Lespinasse	89154dd531	mmap locking API: convert mmap_sem call sites missed by coccinelle Convert the last few remaining mmap_sem rwsem calls to use the new mmap locking API. These were missed by coccinelle for some reason (I think coccinelle does not support some of the preprocessor constructs in these files ?) [akpm@linux-foundation.org: convert linux-next leftovers] [akpm@linux-foundation.org: more linux-next leftovers] [akpm@linux-foundation.org: more linux-next leftovers] Signed-off-by: Michel Lespinasse <walken@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Daniel Jordan <daniel.m.jordan@oracle.com> Reviewed-by: Laurent Dufour <ldufour@linux.ibm.com> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Cc: Davidlohr Bueso <dbueso@suse.de> Cc: David Rientjes <rientjes@google.com> Cc: Hugh Dickins <hughd@google.com> Cc: Jason Gunthorpe <jgg@ziepe.ca> Cc: Jerome Glisse <jglisse@redhat.com> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Liam Howlett <Liam.Howlett@oracle.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Ying Han <yinghan@google.com> Link: http://lkml.kernel.org/r/20200520052908.204642-6-walken@google.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2020-06-09 09:39:14 -07:00
Sean Christopherson	3bae0459bc	KVM: x86/mmu: Drop KVM's hugepage enums in favor of the kernel's enums Replace KVM's PT_PAGE_TABLE_LEVEL, PT_DIRECTORY_LEVEL and PT_PDPE_LEVEL with the kernel's PG_LEVEL_4K, PG_LEVEL_2M and PG_LEVEL_1G. KVM's enums are borderline impossible to remember and result in code that is visually difficult to audit, e.g. if (!enable_ept) ept_lpage_level = 0; else if (cpu_has_vmx_ept_1g_page()) ept_lpage_level = PT_PDPE_LEVEL; else if (cpu_has_vmx_ept_2m_page()) ept_lpage_level = PT_DIRECTORY_LEVEL; else ept_lpage_level = PT_PAGE_TABLE_LEVEL; versus if (!enable_ept) ept_lpage_level = 0; else if (cpu_has_vmx_ept_1g_page()) ept_lpage_level = PG_LEVEL_1G; else if (cpu_has_vmx_ept_2m_page()) ept_lpage_level = PG_LEVEL_2M; else ept_lpage_level = PG_LEVEL_4K; No functional change intended. Suggested-by: Barret Rhoden <brho@google.com> Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Message-Id: <20200428005422.4235-4-sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2020-05-15 12:26:11 -04:00
Sean Christopherson	b2f432f872	KVM: x86/mmu: Tweak PSE hugepage handling to avoid 2M vs 4M conundrum Change the PSE hugepage handling in walk_addr_generic() to fire on any page level greater than PT_PAGE_TABLE_LEVEL, a.k.a. PG_LEVEL_4K. PSE paging only has two levels, so "== 2" and "> 1" are functionally the same, i.e. this is a nop. A future patch will drop KVM's PT__LEVEL enums in favor of the kernel's PG_LEVEL_ enums, at which point "walker->level == PG_LEVEL_2M" is semantically incorrect (though still functionally ok). No functional change intended. Suggested-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Message-Id: <20200428005422.4235-2-sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2020-05-15 12:26:10 -04:00
Paolo Bonzini	0cd665bd20	KVM: x86: cleanup kvm_inject_emulated_page_fault To reconstruct the kvm_mmu to be used for page fault injection, we can simply use fault->nested_page_fault. This matches how fault->nested_page_fault is assigned in the first place by FNAME(walk_addr_generic). Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2020-04-20 17:26:05 -04:00
Linus Torvalds	8c1b724ddb	ARM: * GICv4.1 support * 32bit host removal PPC: * secure (encrypted) using under the Protected Execution Framework ultravisor s390: * allow disabling GISA (hardware interrupt injection) and protected VMs/ultravisor support. x86: * New dirty bitmap flag that sets all bits in the bitmap when dirty page logging is enabled; this is faster because it doesn't require bulk modification of the page tables. * Initial work on making nested SVM event injection more similar to VMX, and less buggy. * Various cleanups to MMU code (though the big ones and related optimizations were delayed to 5.8). Instead of using cr3 in function names which occasionally means eptp, KVM too has standardized on "pgd". * A large refactoring of CPUID features, which now use an array that parallels the core x86_features. * Some removal of pointer chasing from kvm_x86_ops, which will also be switched to static calls as soon as they are available. * New Tigerlake CPUID features. * More bugfixes, optimizations and cleanups. Generic: * selftests: cleanups, new MMU notifier stress test, steal-time test * CSV output for kvm_stat. KVM/MIPS has been broken since 5.5, it does not compile due to a patch committed by MIPS maintainers. I had already prepared a fix, but the MIPS maintainers prefer to fix it in generic code rather than KVM so they are taking care of it. -----BEGIN PGP SIGNATURE----- iQFIBAABCAAyFiEE8TM4V0tmI4mGbHaCv/vSX3jHroMFAl6GOnIUHHBib256aW5p QHJlZGhhdC5jb20ACgkQv/vSX3jHroMfxwf/ZKLZiRoaovXCOG71M/eHtQb8ZIqU 3MPy+On3eC5Sk/aBxWUL9EFZsbYG6kYdbZ1VOvG9XPBoLlnkDSm/IR0kaELHtnjj oGVda/tvGn46Ne39y8xBptmb91WDcWH0vFthT/CwlMxAw3xjr+gG7Qyo+8F2CW6m SSSuLiHSBnyO1cQKruBTHZ8qnR8LlnfXEqtd6Y4LFLic0LbLIoIdRcT3wjQrcZrm Djd7wbTEYZjUfoqZ72ekwEDUsONcDLDSKcguDO9pSMSCGhpxCVT5Vy68KRpoIMs2 nzNWDKjvqQo5zb2+GWxJgkd12Hv+n7PCXZMbVrWBu1pQsewUns9m4mkpGw== =6fGt -----END PGP SIGNATURE----- Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm Pull kvm updates from Paolo Bonzini: "ARM: - GICv4.1 support - 32bit host removal PPC: - secure (encrypted) using under the Protected Execution Framework ultravisor s390: - allow disabling GISA (hardware interrupt injection) and protected VMs/ultravisor support. x86: - New dirty bitmap flag that sets all bits in the bitmap when dirty page logging is enabled; this is faster because it doesn't require bulk modification of the page tables. - Initial work on making nested SVM event injection more similar to VMX, and less buggy. - Various cleanups to MMU code (though the big ones and related optimizations were delayed to 5.8). Instead of using cr3 in function names which occasionally means eptp, KVM too has standardized on "pgd". - A large refactoring of CPUID features, which now use an array that parallels the core x86_features. - Some removal of pointer chasing from kvm_x86_ops, which will also be switched to static calls as soon as they are available. - New Tigerlake CPUID features. - More bugfixes, optimizations and cleanups. Generic: - selftests: cleanups, new MMU notifier stress test, steal-time test - CSV output for kvm_stat" * tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (277 commits) x86/kvm: fix a missing-prototypes "vmread_error" KVM: x86: Fix BUILD_BUG() in __cpuid_entry_get_reg() w/ CONFIG_UBSAN=y KVM: VMX: Add a trampoline to fix VMREAD error handling KVM: SVM: Annotate svm_x86_ops as __initdata KVM: VMX: Annotate vmx_x86_ops as __initdata KVM: x86: Drop __exit from kvm_x86_ops' hardware_unsetup() KVM: x86: Copy kvm_x86_ops by value to eliminate layer of indirection KVM: x86: Set kvm_x86_ops only after ->hardware_setup() completes KVM: VMX: Configure runtime hooks using vmx_x86_ops KVM: VMX: Move hardware_setup() definition below vmx_x86_ops KVM: x86: Move init-only kvm_x86_ops to separate struct KVM: Pass kvm_init()'s opaque param to additional arch funcs s390/gmap: return proper error code on ksm unsharing KVM: selftests: Fix cosmetic copy-paste error in vm_mem_region_move() KVM: Fix out of range accesses to memslots KVM: X86: Micro-optimize IPI fastpath delay KVM: X86: Delay read msr data iff writes ICR MSR KVM: PPC: Book3S HV: Add a capability for enabling secure guests KVM: arm64: GICv4.1: Expose HW-based SGIs in debugfs KVM: arm64: GICv4.1: Allow non-trapping WFI when using HW SGIs ...	2020-04-02 15:13:15 -07:00
Linus Torvalds	fdf5563a72	Merge branch 'x86-cleanups-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 cleanups from Ingo Molnar: "This topic tree contains more commits than usual: - most of it are uaccess cleanups/reorganization by Al - there's a bunch of prototype declaration (--Wmissing-prototypes) cleanups - misc other cleanups all around the map" * 'x86-cleanups-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (36 commits) x86/mm/set_memory: Fix -Wmissing-prototypes warnings x86/efi: Add a prototype for efi_arch_mem_reserve() x86/mm: Mark setup_emu2phys_nid() static x86/jump_label: Move 'inline' keyword placement x86/platform/uv: Add a missing prototype for uv_bau_message_interrupt() kill uaccess_try() x86: unsafe_put-style macro for sigmask x86: x32_setup_rt_frame(): consolidate uaccess areas x86: __setup_rt_frame(): consolidate uaccess areas x86: __setup_frame(): consolidate uaccess areas x86: setup_sigcontext(): list user_access_{begin,end}() into callers x86: get rid of put_user_try in __setup_rt_frame() (both 32bit and 64bit) x86: ia32_setup_rt_frame(): consolidate uaccess areas x86: ia32_setup_frame(): consolidate uaccess areas x86: ia32_setup_sigcontext(): lift user_access_{begin,end}() into the callers x86/alternatives: Mark text_poke_loc_init() static x86/cpu: Fix a -Wmissing-prototypes warning for init_ia32_feat_ctl() x86/mm: Drop pud_mknotpresent() x86: Replace setup_irq() by request_irq() x86/configs: Slightly reduce defconfigs ...	2020-03-31 11:04:05 -07:00
Sean Christopherson	d8dd54e063	KVM: x86/mmu: Rename kvm_mmu->get_cr3() to ->get_guest_pgd() Rename kvm_mmu->get_cr3() to call out that it is retrieving a guest value, as opposed to kvm_mmu->set_cr3(), which sets a host value, and to note that it will return something other than CR3 when nested EPT is in use. Hopefully the new name will also make it more obvious that L1's nested_cr3 is returned in SVM's nested NPT case. No functional change intended. Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2020-03-16 17:57:46 +01:00
Sean Christopherson	bb1fcc70d9	KVM: nVMX: Allow L1 to use 5-level page walks for nested EPT Add support for 5-level nested EPT, and advertise said support in the EPT capabilities MSR. KVM's MMU can already handle 5-level legacy page tables, there's no reason to force an L1 VMM to use shadow paging if it wants to employ 5-level page tables. Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2020-03-16 17:57:44 +01:00
Al Viro	a481444399	x86 kvm page table walks: switch to explicit __get_user() Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>	2020-02-15 17:26:26 -05:00
Sean Christopherson	f6ab0107a4	KVM: x86/mmu: Fix struct guest_walker arrays for 5-level paging Define PT_MAX_FULL_LEVELS as PT64_ROOT_MAX_LEVEL, i.e. 5, to fix shadow paging for 5-level guest page tables. PT_MAX_FULL_LEVELS is used to size the arrays that track guest pages table information, i.e. using a "max levels" of 4 causes KVM to access garbage beyond the end of an array when querying state for level 5 entries. E.g. FNAME(gpte_changed) will read garbage and most likely return %true for a level 5 entry, soft-hanging the guest because FNAME(fetch) will restart the guest instead of creating SPTEs because it thinks the guest PTE has changed. Note, KVM doesn't yet support 5-level nested EPT, so PT_MAX_FULL_LEVELS gets to stay "4" for the PTTYPE_EPT case. Fixes: `855feb6736` ("KVM: MMU: Add 5 level EPT & Shadow page table support.") Cc: stable@vger.kernel.org Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2020-02-12 20:09:44 +01:00
Sean Christopherson	293e306e7f	KVM: x86/mmu: Fold max_mapping_level() into kvm_mmu_hugepage_adjust() Fold max_mapping_level() into kvm_mmu_hugepage_adjust() now that HugeTLB mappings are handled in kvm_mmu_hugepage_adjust(), i.e. there isn't a need to pre-calculate the max mapping level. Co-locating all hugepage checks eliminates a memslot lookup, at the cost of performing the __mmu_gfn_lpage_is_disallowed() checks while holding mmu_lock. The latency of lpage_is_disallowed() is likely negligible relative to the rest of the code run while holding mmu_lock, and can be offset to some extent by eliminating the mmu_gfn_lpage_is_disallowed() check in set_spte() in a future patch. Eliminating the check in set_spte() is made possible by performing the initial lpage_is_disallowed() checks while holding mmu_lock. Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2020-01-27 20:00:08 +01:00
Sean Christopherson	09c4453ee8	KVM: x86/mmu: Remove obsolete gfn restoration in FNAME(fetch) Remove logic to retrieve the original gfn now that HugeTLB mappings are are identified in FNAME(fetch), i.e. FNAME(page_fault) no longer adjusts the level or gfn. Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2020-01-27 20:00:07 +01:00
Sean Christopherson	83f06fa7a6	KVM: x86/mmu: Rely on host page tables to find HugeTLB mappings Remove KVM's HugeTLB specific logic and instead rely on walking the host page tables (already done for THP) to identify HugeTLB mappings. Eliminating the HugeTLB-only logic avoids taking mmap_sem and calling find_vma() for all hugepage compatible page faults, and simplifies KVM's page fault code by consolidating all hugepage adjustments into a common helper. Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2020-01-27 20:00:06 +01:00
Sean Christopherson	17eff01904	KVM: x86/mmu: Refactor THP adjust to prep for changing query Refactor transparent_hugepage_adjust() in preparation for walking the host page tables to identify hugepage mappings, initially for THP pages, and eventualy for HugeTLB and DAX-backed pages as well. The latter cases support 1gb pages, i.e. the adjustment logic needs access to the max allowed level. Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2020-01-27 20:00:04 +01:00
Sean Christopherson	b5c3c1b3c6	KVM: x86/mmu: Micro-optimize nEPT's bad memptype/XWR checks Rework the handling of nEPT's bad memtype/XWR checks to micro-optimize the checks as much as possible. Move the check to a separate helper, __is_bad_mt_xwr(), which allows the guest_rsvd_check usage in paging_tmpl.h to omit the check entirely for paging32/64 (bad_mt_xwr is always zero for non-nEPT) while retaining the bitwise-OR of the current code for the shadow_zero_check in walk_shadow_page_get_mmio_spte(). Add a comment for the bitwise-OR usage in the mmio spte walk to avoid future attempts to "fix" the code, which is what prompted this optimization in the first place[]. Opportunistically remove the superfluous '!= 0' and parantheses, and use BIT_ULL() instead of open coding its equivalent. The net effect is that code generation is largely unchanged for walk_shadow_page_get_mmio_spte(), marginally better for ept_prefetch_invalid_gpte(), and significantly improved for paging32/64_prefetch_invalid_gpte(). Note, walk_shadow_page_get_mmio_spte() can't use a templated version of the memtype/XRW as it works on the host's shadow PTEs, e.g. checks that KVM hasn't borked its EPT tables. Even if it could be templated, the benefits of having a single implementation far outweight the few uops that would be saved for NPT or non-TDP paging, e.g. most compilers inline it all the way to up kvm_mmu_page_fault(). [] https://lkml.kernel.org/r/20200108001859.25254-1-sean.j.christopherson@intel.com Cc: Jim Mattson <jmattson@google.com> Cc: David Laight <David.Laight@ACULAB.COM> Cc: Arvind Sankar <nivedita@alum.mit.edu> Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Reviewed-by: Vitaly Kuznetsov <vkuznets@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2020-01-21 14:45:31 +01:00
Sean Christopherson	f8052a053a	KVM: x86/mmu: Reorder the reserved bit check in prefetch_invalid_gpte() Move the !PRESENT and !ACCESSED checks in FNAME(prefetch_invalid_gpte) above the call to is_rsvd_bits_set(). For a well behaved guest, the !PRESENT and !ACCESSED are far more likely to evaluate true than the reserved bit checks, and they do not require additional memory accesses. Before: Dump of assembler code for function paging32_prefetch_invalid_gpte: 0x0000000000044240 <+0>: callq 0x44245 <paging32_prefetch_invalid_gpte+5> 0x0000000000044245 <+5>: mov %rcx,%rax 0x0000000000044248 <+8>: shr $0x7,%rax 0x000000000004424c <+12>: and $0x1,%eax 0x000000000004424f <+15>: lea 0x0(,%rax,4),%r8 0x0000000000044257 <+23>: add %r8,%rax 0x000000000004425a <+26>: mov %rcx,%r8 0x000000000004425d <+29>: and 0x120(%rsi,%rax,8),%r8 0x0000000000044265 <+37>: mov 0x170(%rsi),%rax 0x000000000004426c <+44>: shr %cl,%rax 0x000000000004426f <+47>: and $0x1,%eax 0x0000000000044272 <+50>: or %rax,%r8 0x0000000000044275 <+53>: jne 0x4427c <paging32_prefetch_invalid_gpte+60> 0x0000000000044277 <+55>: test $0x1,%cl 0x000000000004427a <+58>: jne 0x4428a <paging32_prefetch_invalid_gpte+74> 0x000000000004427c <+60>: mov %rdx,%rsi 0x000000000004427f <+63>: callq 0x44080 <drop_spte> 0x0000000000044284 <+68>: mov $0x1,%eax 0x0000000000044289 <+73>: retq 0x000000000004428a <+74>: xor %eax,%eax 0x000000000004428c <+76>: and $0x20,%ecx 0x000000000004428f <+79>: jne 0x44289 <paging32_prefetch_invalid_gpte+73> 0x0000000000044291 <+81>: mov %rdx,%rsi 0x0000000000044294 <+84>: callq 0x44080 <drop_spte> 0x0000000000044299 <+89>: mov $0x1,%eax 0x000000000004429e <+94>: jmp 0x44289 <paging32_prefetch_invalid_gpte+73> End of assembler dump. After: Dump of assembler code for function paging32_prefetch_invalid_gpte: 0x0000000000044240 <+0>: callq 0x44245 <paging32_prefetch_invalid_gpte+5> 0x0000000000044245 <+5>: test $0x1,%cl 0x0000000000044248 <+8>: je 0x4424f <paging32_prefetch_invalid_gpte+15> 0x000000000004424a <+10>: test $0x20,%cl 0x000000000004424d <+13>: jne 0x4425d <paging32_prefetch_invalid_gpte+29> 0x000000000004424f <+15>: mov %rdx,%rsi 0x0000000000044252 <+18>: callq 0x44080 <drop_spte> 0x0000000000044257 <+23>: mov $0x1,%eax 0x000000000004425c <+28>: retq 0x000000000004425d <+29>: mov %rcx,%rax 0x0000000000044260 <+32>: mov (%rsi),%rsi 0x0000000000044263 <+35>: shr $0x7,%rax 0x0000000000044267 <+39>: and $0x1,%eax 0x000000000004426a <+42>: lea 0x0(,%rax,4),%r8 0x0000000000044272 <+50>: add %r8,%rax 0x0000000000044275 <+53>: mov %rcx,%r8 0x0000000000044278 <+56>: and 0x120(%rsi,%rax,8),%r8 0x0000000000044280 <+64>: mov 0x170(%rsi),%rax 0x0000000000044287 <+71>: shr %cl,%rax 0x000000000004428a <+74>: and $0x1,%eax 0x000000000004428d <+77>: mov %rax,%rcx 0x0000000000044290 <+80>: xor %eax,%eax 0x0000000000044292 <+82>: or %rcx,%r8 0x0000000000044295 <+85>: je 0x4425c <paging32_prefetch_invalid_gpte+28> 0x0000000000044297 <+87>: mov %rdx,%rsi 0x000000000004429a <+90>: callq 0x44080 <drop_spte> 0x000000000004429f <+95>: mov $0x1,%eax 0x00000000000442a4 <+100>: jmp 0x4425c <paging32_prefetch_invalid_gpte+28> End of assembler dump. Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Reviewed-by: Vitaly Kuznetsov <vkuznets@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2020-01-21 14:45:30 +01:00
Sean Christopherson	0c7a98e34d	KVM: x86/mmu: WARN on an invalid root_hpa WARN on the existing invalid root_hpa checks in __direct_map() and FNAME(fetch). The "legitimate" path that invalidated root_hpa in the middle of a page fault is long since gone, i.e. it should no longer be impossible to invalidate in the middle of a page fault[]. The root_hpa checks were added by two related commits `989c6b34f6` ("KVM: MMU: handle invalid root_hpa at __direct_map") `37f6a4e237` ("KVM: x86: handle invalid root_hpa everywhere") to fix a bug where nested_vmx_vmexit() could be called in the middle* of a page fault. At the time, vmx_interrupt_allowed(), which was and still is used by kvm_can_do_async_pf() via ->interrupt_allowed(), directly invoked nested_vmx_vmexit() to switch from L2 to L1 to emulate a VM-Exit on a pending interrupt. Emulating the nested VM-Exit resulted in root_hpa being invalidated by kvm_mmu_reset_context() without explicitly terminating the page fault. Now that root_hpa is checked for validity by kvm_mmu_page_fault(), WARN on an invalid root_hpa to detect any flows that reset the MMU while handling a page fault. The broken vmx_interrupt_allowed() behavior has long since been fixed and resetting the MMU during a page fault should not be considered legal behavior. [*] It's actually technically possible in FNAME(page_fault)() because it calls inject_page_fault() when the guest translation is invalid, but in that case the page fault handling is immediately terminated. Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2020-01-08 18:16:08 +01:00
Sean Christopherson	4cd071d13c	KVM: x86/mmu: Move calls to thp_adjust() down a level Move the calls to thp_adjust() down a level from the page fault handlers to the map/fetch helpers and remove the page count shuffling done in thp_adjust(). Despite holding a reference to the underlying page while processing a page fault, the page fault flows don't actually rely on holding a reference to the page when thp_adjust() is called. At that point, the fault handlers hold mmu_lock, which prevents mmu_notifier from completing any invalidations, and have verified no invalidations from mmu_notifier have occurred since the page reference was acquired (which is done prior to taking mmu_lock). The kvm_release_pfn_clean()/kvm_get_pfn() dance in thp_adjust() is a quirk that is necessitated because thp_adjust() modifies the pfn that is consumed by its caller. Because the page fault handlers call kvm_release_pfn_clean() on said pfn, thp_adjust() needs to transfer the reference to the correct pfn purely for correctness when the pfn is released. Calling thp_adjust() from __direct_map() and FNAME(fetch) means the pfn adjustment doesn't change the pfn as seen by the page fault handlers, i.e. the pfn released by the page fault handlers is the same pfn that was returned by gfn_to_pfn(). Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2020-01-08 18:16:07 +01:00
Sean Christopherson	cbe1e6f035	KVM: x86/mmu: Incorporate guest's page level into max level for shadow MMU Restrict the max level for a shadow page based on the guest's level instead of capping the level after the fact for host-mapped huge pages, e.g. hugetlbfs pages. Explicitly capping the max level using the guest mapping level also eliminates FNAME(page_fault)'s subtle dependency on THP only supporting 2mb pages. No functional change intended. Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2020-01-08 18:16:05 +01:00
Sean Christopherson	39ca1ecb78	KVM: x86/mmu: Refactor handling of forced 4k pages in page faults Refactor the page fault handlers and mapping_level() to track the max allowed page level instead of only tracking if a 4k page is mandatory due to one restriction or another. This paves the way for cleanly consolidating tdp_page_fault() and nonpaging_page_fault(), and for eliminating a redundant check on mmu_gfn_lpage_is_disallowed(). No functional change intended. Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2020-01-08 18:16:05 +01:00
Sean Christopherson	736c291c9f	KVM: x86: Use gpa_t for cr2/gpa to fix TDP support on 32-bit KVM Convert a plethora of parameters and variables in the MMU and page fault flows from type gva_t to gpa_t to properly handle TDP on 32-bit KVM. Thanks to PSE and PAE paging, 32-bit kernels can access 64-bit physical addresses. When TDP is enabled, the fault address is a guest physical address and thus can be a 64-bit value, even when both KVM and its guest are using 32-bit virtual addressing, e.g. VMX's VMCS.GUEST_PHYSICAL is a 64-bit field, not a natural width field. Using a gva_t for the fault address means KVM will incorrectly drop the upper 32-bits of the GPA. Ditto for gva_to_gpa() when it is used to translate L2 GPAs to L1 GPAs. Opportunistically rename variables and parameters to better reflect the dual address modes, e.g. use "cr2_or_gpa" for fault addresses and plain "addr" instead of "vaddr" when the address may be either a GVA or an L2 GPA. Similarly, use "gpa" in the nonpaging_page_fault() flows to avoid a confusing "gpa_t gva" declaration; this also sets the stage for a future patch to combing nonpaging_page_fault() and tdp_page_fault() with minimal churn. Sprinkle in a few comments to document flows where an address is known to be a GVA and thus can be safely truncated to a 32-bit value. Add WARNs in kvm_handle_page_fault() and FNAME(gva_to_gpa_nested)() to help document such cases and detect bugs. Cc: stable@vger.kernel.org Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2020-01-08 18:16:02 +01:00
Paolo Bonzini	c50d8ae3a1	KVM: x86: create mmu/ subdirectory Preparatory work for shattering mmu.c into multiple files. Besides making it easier to follow, this will also make it possible to write unit tests for various parts. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2019-11-21 12:03:50 +01:00

22 Commits