linux

Author	SHA1	Message	Date
Andi Kleen	a5df70c354	perf/x86: Only show format attributes when supported Only show the Intel format attributes in sysfs when the feature is actually supported with the current model numbers. This allows programs to probe what format attributes are available, and give a sensible error message to users if they are not. This handles near all cases for intel attributes since Nehalem, except the (obscure) case when the model number if known, but PEBS is disabled in PERF_CAPABILITIES. Signed-off-by: Andi Kleen <ak@linux.intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/20170822185201.9261-2-andi@firstfloor.org Signed-off-by: Ingo Molnar <mingo@kernel.org>	2017-08-25 11:04:18 +02:00
Andi Kleen	6ae5fa61d2	perf/x86: Fix data source decoding for Skylake Skylake changed the encoding of the PEBS data source field. Some combinations are not available anymore, but some new cases e.g. for L4 cache hit are added. Fix up the conversion table for Skylake, similar as had been done for Nehalem. On Skylake server the encoding for L4 actually means persistent memory. Handle this case too. To properly describe it in the abstracted perf format I had to add some new fields. Since a hit can have only one level add a new field that is an enumeration, not a bit field to describe the level. It can describe any level. Some numbers are also used to describe PMEM and LFB. Also add a new generic remote flag that can be combined with the generic level to signify a remote cache. And there is an extension field for the snoop indication to handle the Forward state. I didn't add a generic flag for hops because it's not needed for Skylake. I changed the existing encodings for older CPUs to also fill in the new level and remote fields. Signed-off-by: Andi Kleen <ak@linux.intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Madhavan Srinivasan <maddy@linux.vnet.ibm.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: acme@kernel.org Cc: jolsa@kernel.org Link: http://lkml.kernel.org/r/20170816222156.19953-3-andi@firstfloor.org Signed-off-by: Ingo Molnar <mingo@kernel.org>	2017-08-25 11:04:17 +02:00
Andi Kleen	9529835514	perf/x86: Move Nehalem PEBS code to flag Minor cleanup: use an explicit x86_pmu flag to handle the missing Lock / TLB information on Nehalem, instead of always checking the model number for each PEBS sample. Signed-off-by: Andi Kleen <ak@linux.intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: acme@kernel.org Cc: jolsa@kernel.org Link: http://lkml.kernel.org/r/20170816222156.19953-2-andi@firstfloor.org Signed-off-by: Ingo Molnar <mingo@kernel.org>	2017-08-25 11:04:16 +02:00
Eric Biggers	ccd5b32351	x86/mm: Fix use-after-free of ldt_struct The following commit: `39a0526fb3` ("x86/mm: Factor out LDT init from context init") renamed init_new_context() to init_new_context_ldt() and added a new init_new_context() which calls init_new_context_ldt(). However, the error code of init_new_context_ldt() was ignored. Consequently, if a memory allocation in alloc_ldt_struct() failed during a fork(), the ->context.ldt of the new task remained the same as that of the old task (due to the memcpy() in dup_mm()). ldt_struct's are not intended to be shared, so a use-after-free occurred after one task exited. Fix the bug by making init_new_context() pass through the error code of init_new_context_ldt(). This bug was found by syzkaller, which encountered the following splat: BUG: KASAN: use-after-free in free_ldt_struct.part.2+0x10a/0x150 arch/x86/kernel/ldt.c:116 Read of size 4 at addr ffff88006d2cb7c8 by task kworker/u9:0/3710 CPU: 1 PID: 3710 Comm: kworker/u9:0 Not tainted 4.13.0-rc4-next-20170811 #2 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011 Call Trace: __dump_stack lib/dump_stack.c:16 [inline] dump_stack+0x194/0x257 lib/dump_stack.c:52 print_address_description+0x73/0x250 mm/kasan/report.c:252 kasan_report_error mm/kasan/report.c:351 [inline] kasan_report+0x24e/0x340 mm/kasan/report.c:409 __asan_report_load4_noabort+0x14/0x20 mm/kasan/report.c:429 free_ldt_struct.part.2+0x10a/0x150 arch/x86/kernel/ldt.c:116 free_ldt_struct arch/x86/kernel/ldt.c:173 [inline] destroy_context_ldt+0x60/0x80 arch/x86/kernel/ldt.c:171 destroy_context arch/x86/include/asm/mmu_context.h:157 [inline] __mmdrop+0xe9/0x530 kernel/fork.c:889 mmdrop include/linux/sched/mm.h:42 [inline] exec_mmap fs/exec.c:1061 [inline] flush_old_exec+0x173c/0x1ff0 fs/exec.c:1291 load_elf_binary+0x81f/0x4ba0 fs/binfmt_elf.c:855 search_binary_handler+0x142/0x6b0 fs/exec.c:1652 exec_binprm fs/exec.c:1694 [inline] do_execveat_common.isra.33+0x1746/0x22e0 fs/exec.c:1816 do_execve+0x31/0x40 fs/exec.c:1860 call_usermodehelper_exec_async+0x457/0x8f0 kernel/umh.c:100 ret_from_fork+0x2a/0x40 arch/x86/entry/entry_64.S:431 Allocated by task 3700: save_stack_trace+0x16/0x20 arch/x86/kernel/stacktrace.c:59 save_stack+0x43/0xd0 mm/kasan/kasan.c:447 set_track mm/kasan/kasan.c:459 [inline] kasan_kmalloc+0xad/0xe0 mm/kasan/kasan.c:551 kmem_cache_alloc_trace+0x136/0x750 mm/slab.c:3627 kmalloc include/linux/slab.h:493 [inline] alloc_ldt_struct+0x52/0x140 arch/x86/kernel/ldt.c:67 write_ldt+0x7b7/0xab0 arch/x86/kernel/ldt.c:277 sys_modify_ldt+0x1ef/0x240 arch/x86/kernel/ldt.c:307 entry_SYSCALL_64_fastpath+0x1f/0xbe Freed by task 3700: save_stack_trace+0x16/0x20 arch/x86/kernel/stacktrace.c:59 save_stack+0x43/0xd0 mm/kasan/kasan.c:447 set_track mm/kasan/kasan.c:459 [inline] kasan_slab_free+0x71/0xc0 mm/kasan/kasan.c:524 __cache_free mm/slab.c:3503 [inline] kfree+0xca/0x250 mm/slab.c:3820 free_ldt_struct.part.2+0xdd/0x150 arch/x86/kernel/ldt.c:121 free_ldt_struct arch/x86/kernel/ldt.c:173 [inline] destroy_context_ldt+0x60/0x80 arch/x86/kernel/ldt.c:171 destroy_context arch/x86/include/asm/mmu_context.h:157 [inline] __mmdrop+0xe9/0x530 kernel/fork.c:889 mmdrop include/linux/sched/mm.h:42 [inline] __mmput kernel/fork.c:916 [inline] mmput+0x541/0x6e0 kernel/fork.c:927 copy_process.part.36+0x22e1/0x4af0 kernel/fork.c:1931 copy_process kernel/fork.c:1546 [inline] _do_fork+0x1ef/0xfb0 kernel/fork.c:2025 SYSC_clone kernel/fork.c:2135 [inline] SyS_clone+0x37/0x50 kernel/fork.c:2129 do_syscall_64+0x26c/0x8c0 arch/x86/entry/common.c:287 return_from_SYSCALL_64+0x0/0x7a Here is a C reproducer: #include <asm/ldt.h> #include <pthread.h> #include <signal.h> #include <stdlib.h> #include <sys/syscall.h> #include <sys/wait.h> #include <unistd.h> static void fork_thread(void _arg) { fork(); } int main(void) { struct user_desc desc = { .entry_number = 8191 }; syscall(__NR_modify_ldt, 1, &desc, sizeof(desc)); for (;;) { if (fork() == 0) { pthread_t t; srand(getpid()); pthread_create(&t, NULL, fork_thread, NULL); usleep(rand() % 10000); syscall(__NR_exit_group, 0); } wait(NULL); } } Note: the reproducer takes advantage of the fact that alloc_ldt_struct() may use vmalloc() to allocate a large ->entries array, and after commit: `5d17a73a2e` ("vmalloc: back off when the current task is killed") it is possible for userspace to fail a task's vmalloc() by sending a fatal signal, e.g. via exit_group(). It would be more difficult to reproduce this bug on kernels without that commit. This bug only affected kernels with CONFIG_MODIFY_LDT_SYSCALL=y. Signed-off-by: Eric Biggers <ebiggers@google.com> Acked-by: Dave Hansen <dave.hansen@linux.intel.com> Cc: <stable@vger.kernel.org> [v4.6+] Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Andy Lutomirski <luto@amacapital.net> Cc: Borislav Petkov <bp@alien8.de> Cc: Brian Gerst <brgerst@gmail.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Denys Vlasenko <dvlasenk@redhat.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rik van Riel <riel@redhat.com> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-mm@kvack.org Fixes: `39a0526fb3` ("x86/mm: Factor out LDT init from context init") Link: http://lkml.kernel.org/r/20170824175029.76040-1-ebiggers3@gmail.com Signed-off-by: Ingo Molnar <mingo@kernel.org>	2017-08-25 09:55:52 +02:00
Paolo Bonzini	38cfd5e3df	KVM, pkeys: do not use PKRU value in vcpu->arch.guest_fpu.state The host pkru is restored right after vcpu exit (commit `1be0e61`), so KVM_GET_XSAVE will return the host PKRU value instead. Fix this by using the guest PKRU explicitly in fill_xsave and load_xsave. This part is based on a patch by Junkang Fu. The host PKRU data may also not match the value in vcpu->arch.guest_fpu.state, because it could have been changed by userspace since the last time it was saved, so skip loading it in kvm_load_guest_fpu. Reported-by: Junkang Fu <junkang.fjk@alibaba-inc.com> Cc: Yang Zhang <zy107165@alibaba-inc.com> Fixes: `1be0e61c1f` Cc: stable@vger.kernel.org Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2017-08-25 09:28:37 +02:00
Paolo Bonzini	b9dd21e104	KVM: x86: simplify handling of PKRU Move it to struct kvm_arch_vcpu, replacing guest_pkru_valid with a simple comparison against the host value of the register. The write of PKRU in addition can be skipped if the guest has not enabled the feature. Once we do this, we need not test OSPKE in the host anymore, because guest_CR4.PKE=1 implies host_CR4.PKE=1. The static PKU test is kept to elide the code on older CPUs. Suggested-by: Yang Zhang <zy107165@alibaba-inc.com> Fixes: `1be0e61c1f` Cc: stable@vger.kernel.org Reviewed-by: David Hildenbrand <david@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2017-08-25 09:28:28 +02:00
Paolo Bonzini	c469268cd5	KVM: x86: block guest protection keys unless the host has them enabled If the host has protection keys disabled, we cannot read and write the guest PKRU---RDPKRU and WRPKRU fail with #GP(0) if CR4.PKE=0. Block the PKU cpuid bit in that case. This ensures that guest_CR4.PKE=1 implies host_CR4.PKE=1. Fixes: `1be0e61c1f` Cc: stable@vger.kernel.org Reviewed-by: David Hildenbrand <david@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2017-08-25 09:28:02 +02:00
Thomas Gleixner	c0bb80cfa3	Merge branch 'x86/asm' into x86/apic Pick up dependent changes to avoid merge conflicts	2017-08-25 08:56:22 +02:00
Florian Fainelli	2fb44600fe	um: Fix check for _xstate for older hosts Commit `0a98764567` ("um: Allow building and running on older hosts") attempted to check for PTRACE_{GET,SET}REGSET under the premise that these ptrace(2) parameters were directly linked with the presence of the _xstate structure. After Richard's commit `61e8d46245` ("um: Correctly check for PTRACE_GETRESET/SETREGSET") which properly included linux/ptrace.h instead of asm/ptrace.h, we could get into the original build failure that I reported: arch/x86/um/user-offsets.c: In function 'foo': arch/x86/um/user-offsets.c:54: error: invalid application of 'sizeof' to incomplete type 'struct _xstate' On this particular host, we do have PTRACE_GETREGSET and PTRACE_SETREGSET defined in linux/ptrace.h, but not the structure _xstate that should be pulled from the following include chain: signal.h -> bits/sigcontext.h. Correctly fix this by checking for FP_XSTATE_MAGIC1 which is the correct way to see if struct _xstate is available or not on the host. Fixes: `61e8d46245` ("um: Correctly check for PTRACE_GETRESET/SETREGSET") Fixes: `0a98764567` ("um: Allow building and running on older hosts") Signed-off-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: Richard Weinberger <richard@nod.at>	2017-08-24 21:52:28 +02:00
Wanpeng Li	bfcf83b144	KVM: nVMX: Fix trying to cancel vmlauch/vmresume ------------[ cut here ]------------ WARNING: CPU: 7 PID: 3861 at /home/kernel/ssd/kvm/arch/x86/kvm//vmx.c:11299 nested_vmx_vmexit+0x176e/0x1980 [kvm_intel] CPU: 7 PID: 3861 Comm: qemu-system-x86 Tainted: G W OE 4.13.0-rc4+ #11 RIP: 0010:nested_vmx_vmexit+0x176e/0x1980 [kvm_intel] Call Trace: ? kvm_multiple_exception+0x149/0x170 [kvm] ? handle_emulation_failure+0x79/0x230 [kvm] ? load_vmcs12_host_state+0xa80/0xa80 [kvm_intel] ? check_chain_key+0x137/0x1e0 ? reexecute_instruction.part.168+0x130/0x130 [kvm] nested_vmx_inject_exception_vmexit+0xb7/0x100 [kvm_intel] ? nested_vmx_inject_exception_vmexit+0xb7/0x100 [kvm_intel] vmx_queue_exception+0x197/0x300 [kvm_intel] kvm_arch_vcpu_ioctl_run+0x1b0c/0x2c90 [kvm] ? kvm_arch_vcpu_runnable+0x220/0x220 [kvm] ? preempt_count_sub+0x18/0xc0 ? restart_apic_timer+0x17d/0x300 [kvm] ? kvm_lapic_restart_hv_timer+0x37/0x50 [kvm] ? kvm_arch_vcpu_load+0x1d8/0x350 [kvm] kvm_vcpu_ioctl+0x4e4/0x910 [kvm] ? kvm_vcpu_ioctl+0x4e4/0x910 [kvm] ? kvm_dev_ioctl+0xbe0/0xbe0 [kvm] The flag "nested_run_pending", which can override the decision of which should run next, L1 or L2. nested_run_pending=1 means that we must run L2 next, not L1. This is necessary in particular when L1 did a VMLAUNCH of L2 and therefore expects L2 to be run (and perhaps be injected with an event it specified, etc.). Nested_run_pending is especially intended to avoid switching to L1 in the injection decision-point. This can be handled just like the other cases in vmx_check_nested_events, instead of having a special case in vmx_queue_exception. Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Radim Krčmář <rkrcmar@redhat.com> Signed-off-by: Wanpeng Li <wanpeng.li@hotmail.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2017-08-24 18:22:21 +02:00
Wanpeng Li	664f8e26b0	KVM: X86: Fix loss of exception which has not yet been injected vmx_complete_interrupts() assumes that the exception is always injected, so it can be dropped by kvm_clear_exception_queue(). However, an exception cannot be injected immediately if it is: 1) originally destined to a nested guest; 2) trapped to cause a vmexit; 3) happening right after VMLAUNCH/VMRESUME, i.e. when nested_run_pending is true. This patch applies to exceptions the same algorithm that is used for NMIs, replacing exception.reinject with "exception.injected" (equivalent to nmi_injected). exception.pending now represents an exception that is queued and whose side effects (e.g., update RFLAGS.RF or DR7) have not been applied yet. If exception.pending is true, the exception might result in a nested vmexit instead, too (in which case the side effects must not be applied). exception.injected instead represents an exception that is going to be injected into the guest at the next vmentry. Reported-by: Radim Krčmář <rkrcmar@redhat.com> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Radim Krčmář <rkrcmar@redhat.com> Signed-off-by: Wanpeng Li <wanpeng.li@hotmail.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2017-08-24 18:09:19 +02:00
Wanpeng Li	274bba52a0	KVM: VMX: use kvm_event_needs_reinjection Use kvm_event_needs_reinjection() encapsulation. Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Radim Krčmář <rkrcmar@redhat.com> Signed-off-by: Wanpeng Li <wanpeng.li@hotmail.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2017-08-24 18:09:19 +02:00
Paolo Bonzini	09f037aa48	KVM: MMU: speedup update_permission_bitmask update_permission_bitmask currently does a 128-iteration loop to, essentially, compute a constant array. Computing the 8 bits in parallel reduces it to 16 iterations, and is enough to speed it up substantially because many boolean operations in the inner loop become constants or simplify noticeably. Because update_permission_bitmask is actually the top item in the profile for nested vmexits, this speeds up an L2->L1 vmexit by about ten thousand clock cycles, or up to 30%: before after cpuid 35173 25954 vmcall 35122 27079 inl_from_pmtimer 52635 42675 inl_from_qemu 53604 44599 inl_from_kernel 38498 30798 outl_to_kernel 34508 28816 wr_tsc_adjust_msr 34185 26818 rd_tsc_adjust_msr 37409 27049 mmio-no-eventfd:pci-mem 50563 45276 mmio-wildcard-eventfd:pci-mem 34495 30823 mmio-datamatch-eventfd:pci-mem 35612 31071 portio-no-eventfd:pci-io 44925 40661 portio-wildcard-eventfd:pci-io 29708 27269 portio-datamatch-eventfd:pci-io 31135 27164 (I wrote a small C program to compare the tables for all values of CR0.WP, CR4.SMAP and CR4.SMEP, and they match). Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2017-08-24 18:09:18 +02:00
Yu Zhang	fd8cb43373	KVM: MMU: Expose the LA57 feature to VM. This patch exposes 5 level page table feature to the VM. At the same time, the canonical virtual address checking is extended to support both 48-bits and 57-bits address width. Signed-off-by: Yu Zhang <yu.c.zhang@linux.intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2017-08-24 18:09:17 +02:00
Yu Zhang	855feb6736	KVM: MMU: Add 5 level EPT & Shadow page table support. Extends the shadow paging code, so that 5 level shadow page table can be constructed if VM is running in 5 level paging mode. Also extends the ept code, so that 5 level ept table can be constructed if maxphysaddr of VM exceeds 48 bits. Unlike the shadow logic, KVM should still use 4 level ept table for a VM whose physical address width is less than 48 bits, even when the VM is running in 5 level paging mode. Signed-off-by: Yu Zhang <yu.c.zhang@linux.intel.com> [Unconditionally reset the MMU context in kvm_cpuid_update. Changing MAXPHYADDR invalidates the reserved bit bitmasks. - Paolo] Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2017-08-24 18:09:17 +02:00
Yu Zhang	2a7266a8f9	KVM: MMU: Rename PT64_ROOT_LEVEL to PT64_ROOT_4LEVEL. Now we have 4 level page table and 5 level page table in 64 bits long mode, let's rename the PT64_ROOT_LEVEL to PT64_ROOT_4LEVEL, then we can use PT64_ROOT_5LEVEL for 5 level page table, it's helpful to make the code more clear. Also PT64_ROOT_MAX_LEVEL is defined as 4, so that we can just redefine it to 5 whenever a replacement is needed for 5 level paging. Signed-off-by: Yu Zhang <yu.c.zhang@linux.intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2017-08-24 18:09:16 +02:00
Yu Zhang	d1cd3ce900	KVM: MMU: check guest CR3 reserved bits based on its physical address width. Currently, KVM uses CR3_L_MODE_RESERVED_BITS to check the reserved bits in CR3. Yet the length of reserved bits in guest CR3 should be based on the physical address width exposed to the VM. This patch changes CR3 check logic to calculate the reserved bits at runtime. Signed-off-by: Yu Zhang <yu.c.zhang@linux.intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2017-08-24 18:09:16 +02:00
Yu Zhang	e911eb3b34	KVM: x86: Add return value to kvm_cpuid(). Return false in kvm_cpuid() when it fails to find the cpuid entry. Also, this routine(and its caller) is optimized with a new argument - check_limit, so that the check_cpuid_limit() fall back can be avoided. Signed-off-by: Yu Zhang <yu.c.zhang@linux.intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2017-08-24 18:09:15 +02:00
Paolo Bonzini	3db134805c	kvm: vmx: Raise #UD on unsupported XSAVES/XRSTORS A guest may not be configured to support XSAVES/XRSTORS, even when the host does. If the guest does not support XSAVES/XRSTORS, clear the secondary execution control so that the processor will raise #UD. Also clear the "allowed-1" bit for XSAVES/XRSTORS exiting in the IA32_VMX_PROCBASED_CTLS2 MSR, and pass through VMCS12's control in the VMCS02. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2017-08-24 18:09:13 +02:00
Jim Mattson	75f4fc8da9	kvm: vmx: Raise #UD on unsupported RDSEED A guest may not be configured to support RDSEED, even when the host does. If the guest does not support RDSEED, intercept the instruction and synthesize #UD. Also clear the "allowed-1" bit for RDSEED exiting in the IA32_VMX_PROCBASED_CTLS2 MSR. Signed-off-by: Jim Mattson <jmattson@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2017-08-24 15:35:46 +02:00
Jim Mattson	45ec368c9a	kvm: vmx: Raise #UD on unsupported RDRAND A guest may not be configured to support RDRAND, even when the host does. If the guest does not support RDRAND, intercept the instruction and synthesize #UD. Also clear the "allowed-1" bit for RDRAND exiting in the IA32_VMX_PROCBASED_CTLS2 MSR. Signed-off-by: Jim Mattson <jmattson@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2017-08-24 15:35:37 +02:00
Paolo Bonzini	80154d77c9	KVM: VMX: cache secondary exec controls Currently, secondary execution controls are divided in three groups: - static, depending mostly on the module arguments or the processor (vmx_secondary_exec_control) - static, depending on CPUID (vmx_cpuid_update) - dynamic, depending on nested VMX or local APIC state Because walking CPUID is expensive, prepare_vmcs02 is using only the first group. This however is unnecessarily complicated. Just cache the static secondary execution controls, and then prepare_vmcs02 does not need to compute them every time. Computation of all static secondary execution controls is now kept in a single function, vmx_compute_secondary_exec_control. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2017-08-24 15:35:14 +02:00
Ingo Molnar	93da8b221d	Merge branch 'linus' into perf/core, to pick up fixes Signed-off-by: Ingo Molnar <mingo@kernel.org>	2017-08-24 10:12:33 +02:00
Juergen Gross	ecda85e702	x86/lguest: Remove lguest support Lguest seems to be rather unused these days. It has seen only patches ensuring it still builds the last two years and its official state is "Odd Fixes". Remove it in order to be able to clean up the paravirt code. Signed-off-by: Juergen Gross <jgross@suse.com> Acked-by: Rusty Russell <rusty@rustcorp.com.au> Acked-by: Thomas Gleixner <tglx@linutronix.de> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: boris.ostrovsky@oracle.com Cc: lguest@lists.ozlabs.org Cc: rusty@rustcorp.com.au Cc: xen-devel@lists.xenproject.org Link: http://lkml.kernel.org/r/20170816173157.8633-3-jgross@suse.com Signed-off-by: Ingo Molnar <mingo@kernel.org>	2017-08-24 09:57:28 +02:00
Juergen Gross	edcb5cf84f	x86/paravirt/xen: Remove xen_patch() Xen's paravirt patch function xen_patch() does some special casing for irq_ops functions to apply relocations when those functions can be patched inline instead of calls. Unfortunately none of the special case function replacements is small enough to be patched inline, so the special case never applies. As xen_patch() will call paravirt_patch_default() in all cases it can be just dropped. xen-asm.h doesn't seem necessary without xen_patch() as the only thing left in it would be the definition of XEN_EFLAGS_NMI used only once. So move that definition and remove xen-asm.h. Signed-off-by: Juergen Gross <jgross@suse.com> Reviewed-by: Josh Poimboeuf <jpoimboe@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: boris.ostrovsky@oracle.com Cc: lguest@lists.ozlabs.org Cc: rusty@rustcorp.com.au Cc: xen-devel@lists.xenproject.org Link: http://lkml.kernel.org/r/20170816173157.8633-2-jgross@suse.com Signed-off-by: Ingo Molnar <mingo@kernel.org>	2017-08-24 09:57:24 +02:00
Janakarajan Natarajan	640bd6e575	KVM: SVM: Enable Virtual GIF feature Enable the Virtual GIF feature. This is done by setting bit 25 at position 60h in the vmcb. With this feature enabled, the processor uses bit 9 at position 60h as the virtual GIF when executing STGI/CLGI instructions. Since the execution of STGI by the L1 hypervisor does not cause a return to the outermost (L0) hypervisor, the enable_irq_window and enable_nmi_window are modified. The IRQ window will be opened even if GIF is not set, under the assumption that on resuming the L1 hypervisor the IRQ will be held pending until the processor executes the STGI instruction. For the NMI window, the STGI intercept is set. This will assist in opening the window only when GIF=1. Signed-off-by: Janakarajan Natarajan <Janakarajan.Natarajan@amd.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2017-08-23 18:37:37 +02:00
Janakarajan Natarajan	d837312dfd	KVM: SVM: Add Virtual GIF feature definition Add a new cpufeature definition for Virtual GIF. Signed-off-by: Janakarajan Natarajan <Janakarajan.Natarajan@amd.com> Reviewed-by: Borislav Petkov <bp@suse.de> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2017-08-23 18:34:42 +02:00
raymond pang	adfaf18334	x86/ioapic: Print the IRTE's index field correctly when enabling INTR When enabling interrupt remap, IOAPIC's RTE contains the interrupt_index field of IRTE. This field is composed of the ->index and the ->index2 members of 'struct IR_IO_APIC_route_entry' - but what we print out currently only uses ->index. Fix it. Signed-off-by: Raymond Pang <raymondpangxd@gmail.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: joro@8bytes.org Cc: linux-arch@vger.kernel.org Link: http://lkml.kernel.org/r/CAHG4imNDzpDyOVi7MByVrLQ%3DQFuOVqpzJ5F-Xs5z6OZphubj-Q@mail.gmail.com Signed-off-by: Ingo Molnar <mingo@kernel.org>	2017-08-23 10:17:17 +02:00
Herbert Xu	e90c48efde	Merge git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6 Merge the crypto tree to resolve the conflict between the temporary and long-term fixes in algif_skcipher.	2017-08-22 14:53:32 +08:00
David S. Miller	e2a7c34fb2	Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net	2017-08-21 17:06:42 -07:00
Borislav Petkov	d6c8103b02	x86/CPU: Align CR3 defines Align them vertically for better readability and use BIT_ULL() macro. No functionality change. Signed-off-by: Borislav Petkov <bp@suse.de> Cc: Andy Lutomirski <luto@kernel.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Tom Lendacky <thomas.lendacky@amd.com> Link: http://lkml.kernel.org/r/20170821080651.4527-1-bp@alien8.de Signed-off-by: Ingo Molnar <mingo@kernel.org>	2017-08-21 11:35:50 +02:00
Matthias Kaehlcke	9e8730b178	x86/build: Use cc-option to validate stack alignment parameter With the following commit: `8f91869766` ("x86/build: Fix stack alignment for CLang") cc-option is only used to determine the name of the stack alignment option supported by the compiler, but not to verify that the actual parameter <option>=N is valid in combination with the other CFLAGS. This causes problems (as reported by the kbuild robot) with older GCC versions which only support stack alignment on a boundary of 16 bytes or higher. Also use (__)cc_option to add the stack alignment option to CFLAGS to make sure only valid options are added. Reported-by: kbuild test robot <fengguang.wu@intel.com> Signed-off-by: Matthias Kaehlcke <mka@chromium.org> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Bernhard.Rosenkranzer@linaro.org Cc: Greg Hackmann <ghackmann@google.com> Cc: Kees Cook <keescook@chromium.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Masahiro Yamada <yamada.masahiro@socionext.com> Cc: Michael Davidson <md@google.com> Cc: Nick Desaulniers <ndesaulniers@google.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Stephen Hines <srhines@google.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: dianders@chromium.org Fixes: `8f91869766` ("x86/build: Fix stack alignment for CLang") Link: http://lkml.kernel.org/r/20170817182047.176752-1-mka@chromium.org Signed-off-by: Ingo Molnar <mingo@kernel.org>	2017-08-21 09:53:15 +02:00
Linus Torvalds	7f680d7ec3	Merge branch 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 fixes from Thomas Gleixner: "Another pile of small fixes and updates for x86: - Plug a hole in the SMAP implementation which misses to clear AC on NMI entry - Fix the norandmaps/ADDR_NO_RANDOMIZE logic so the command line parameter works correctly again - Use the proper accessor in the startup64 code for next_early_pgt to prevent accessing of invalid addresses and faulting in the early boot code. - Prevent CPU hotplug lock recursion in the MTRR code - Unbreak CPU0 hotplugging - Rename overly long CPUID bits which got introduced in this cycle - Two commits which mark data 'const' and restrict the scope of data and functions to file scope by making them 'static'" * 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86: Constify attribute_group structures x86/boot/64/clang: Use fixup_pointer() to access 'next_early_pgt' x86/elf: Remove the unnecessary ADDR_NO_RANDOMIZE checks x86: Fix norandmaps/ADDR_NO_RANDOMIZE x86/mtrr: Prevent CPU hotplug lock recursion x86: Mark various structures and functions as 'static' x86/cpufeature, kvm/svm: Rename (shorten) the new "virtualized VMSAVE/VMLOAD" CPUID flag x86/smpboot: Unbreak CPU0 hotplug x86/asm/64: Clear AC on NMI entries	2017-08-20 09:36:52 -07:00
Linus Torvalds	e46db8d2ef	Merge branch 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull perf fixes from Thomas Gleixner: "Two fixes for the perf subsystem: - Fix an inconsistency of RDPMC mm struct tagging across exec() which causes RDPMC to fault. - Correct the timestamp mechanics across IOC_DISABLE/ENABLE which causes incorrect timestamps and total time calculations" * 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: perf/core: Fix time on IOC_ENABLE perf/x86: Fix RDPMC vs. mm_struct tracking	2017-08-20 09:20:57 -07:00
Linus Torvalds	e18a5ebc2d	Merge branch 'core-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull watchdog fix from Thomas Gleixner: "A fix for the hardlockup watchdog to prevent false positives with extreme Turbo-Modes which make the perf/NMI watchdog fire faster than the hrtimer which is used to verify. Slightly larger than the minimal fix, which just would increase the hrtimer frequency, but comes with extra overhead of more watchdog timer interrupts and thread wakeups for all users. With this change we restrict the overhead to the extreme Turbo-Mode systems" * 'core-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: kernel/watchdog: Prevent false positives with turbo modes	2017-08-20 08:54:30 -07:00
Kees Cook	c715b72c1b	mm: revert x86_64 and arm64 ELF_ET_DYN_BASE base changes Moving the x86_64 and arm64 PIE base from 0x555555554000 to 0x000100000000 broke AddressSanitizer. This is a partial revert of: `eab09532d4` ("binfmt_elf: use ELF_ET_DYN_BASE only for PIE") `02445990a9` ("arm64: move ELF_ET_DYN_BASE to 4GB / 4MB") The AddressSanitizer tool has hard-coded expectations about where executable mappings are loaded. The motivation for changing the PIE base in the above commits was to avoid the Stack-Clash CVEs that allowed executable mappings to get too close to heap and stack. This was mainly a problem on 32-bit, but the 64-bit bases were moved too, in an effort to proactively protect those systems (proofs of concept do exist that show 64-bit collisions, but other recent changes to fix stack accounting and setuid behaviors will minimize the impact). The new 32-bit PIE base is fine for ASan (since it matches the ET_EXEC base), so only the 64-bit PIE base needs to be reverted to let x86 and arm64 ASan binaries run again. Future changes to the 64-bit PIE base on these architectures can be made optional once a more dynamic method for dealing with AddressSanitizer is found. (e.g. always loading PIE into the mmap region for marked binaries.) Link: http://lkml.kernel.org/r/20170807201542.GA21271@beast Fixes: `eab09532d4` ("binfmt_elf: use ELF_ET_DYN_BASE only for PIE") Fixes: `02445990a9` ("arm64: move ELF_ET_DYN_BASE to 4GB / 4MB") Signed-off-by: Kees Cook <keescook@chromium.org> Reported-by: Kostya Serebryany <kcc@google.com> Acked-by: Will Deacon <will.deacon@arm.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2017-08-18 15:32:02 -07:00
Nicholas Piggin	92e5aae457	kernel/watchdog: fix Kconfig constraints for perf hardlockup watchdog Commit `05a4a95279` ("kernel/watchdog: split up config options") lost the perf-based hardlockup detector's dependency on PERF_EVENTS, which can result in broken builds with some powerpc configurations. Restore the dependency. Add it in for x86 too, despite x86 always selecting PERF_EVENTS it seems reasonable to make the dependency explicit. Link: http://lkml.kernel.org/r/20170810114452.6673-1-npiggin@gmail.com Fixes: `05a4a95279` ("kernel/watchdog: split up config options") Signed-off-by: Nicholas Piggin <npiggin@gmail.com> Acked-by: Don Zickus <dzickus@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>	2017-08-18 15:32:01 -07:00
David Hildenbrand	42aa53b4e1	KVM: VMX: always require WB memory type for EPT We already always set that type but don't check if it is supported. Also for nVMX, we only support WB for now. Let's just require it. Signed-off-by: David Hildenbrand <david@redhat.com> Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>	2017-08-18 17:38:01 +02:00
David Hildenbrand	bb97a01693	KVM: VMX: cleanup EPTP definitions Don't use shifts, tag them correctly as EPTP and use better matching names (PWL vs. GAW). Signed-off-by: David Hildenbrand <david@redhat.com> Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>	2017-08-18 17:38:01 +02:00
Denys Vlasenko	3f0d4db757	KVM: SVM: delete avic_vm_id_bitmap (2 megabyte static array) With lightly tweaked defconfig: text data bss dec hex filename 11259661 5109408 2981888 19350957 12745ad vmlinux.before 11259661 5109408 884736 17253805 10745ad vmlinux.after Only compile-tested. Signed-off-by: Denys Vlasenko <dvlasenk@redhat.com> Cc: Joerg Roedel <joro@8bytes.org> Cc: pbonzini@redhat.com Cc: rkrcmar@redhat.com Cc: tglx@linutronix.de Cc: mingo@redhat.com Cc: hpa@zytor.com Cc: x86@kernel.org Cc: kvm@vger.kernel.org Cc: linux-kernel@vger.kernel.org Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>	2017-08-18 14:37:50 +02:00
Paolo Bonzini	9034e6e895	KVM: x86: fix use of L1 MMIO areas in nested guests There is currently some confusion between nested and L1 GPAs. The assignment to "direct" in kvm_mmu_page_fault tries to fix that, but it is not enough. What this patch does is fence off the MMIO cache completely when using shadow nested page tables, since we have neither a GVA nor an L1 GPA to put in the cache. This also allows some simplifications in kvm_mmu_page_fault and FNAME(page_fault). The EPT misconfig likewise does not have an L1 GPA to pass to kvm_io_bus_write, so that must be skipped for guest mode. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Reviewed-by: David Hildenbrand <david@redhat.com> [Changed comment to say "GPAs" instead of "L1's physical addresses", as per David's review. - Radim] Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>	2017-08-18 14:37:49 +02:00
Brijesh Singh	618232e219	KVM: x86: Avoid guest page table walk when gpa_available is set When a guest causes a page fault which requires emulation, the vcpu->arch.gpa_available flag is set to indicate that cr2 contains a valid GPA. Currently, emulator_read_write_onepage() makes use of gpa_available flag to avoid a guest page walk for a known MMIO regions. Lets not limit the gpa_available optimization to just MMIO region. The patch extends the check to avoid page walk whenever gpa_available flag is set. Signed-off-by: Brijesh Singh <brijesh.singh@amd.com> [Fix EPT=0 according to Wanpeng Li's fix, plus ensure VMX also uses the new code. - Paolo] Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Reviewed-by: David Hildenbrand <david@redhat.com> [Moved "ret < 0" to the else brach, as per David's review. - Radim] Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>	2017-08-18 14:37:49 +02:00
Paolo Bonzini	e08d26f071	KVM: x86: simplify ept_misconfig Calling handle_mmio_page_fault() has been unnecessary since commit `e9ee956e31` ("KVM: x86: MMU: Move handle_mmio_page_fault() call to kvm_mmu_page_fault()", 2016-02-22). handle_mmio_page_fault() can now be made static. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Reviewed-by: David Hildenbrand <david@redhat.com> Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>	2017-08-18 14:37:48 +02:00
Thomas Gleixner	7edaeb6841	kernel/watchdog: Prevent false positives with turbo modes The hardlockup detector on x86 uses a performance counter based on unhalted CPU cycles and a periodic hrtimer. The hrtimer period is about 2/5 of the performance counter period, so the hrtimer should fire 2-3 times before the performance counter NMI fires. The NMI code checks whether the hrtimer fired since the last invocation. If not, it assumess a hard lockup. The calculation of those periods is based on the nominal CPU frequency. Turbo modes increase the CPU clock frequency and therefore shorten the period of the perf/NMI watchdog. With extreme Turbo-modes (3x nominal frequency) the perf/NMI period is shorter than the hrtimer period which leads to false positives. A simple fix would be to shorten the hrtimer period, but that comes with the side effect of more frequent hrtimer and softlockup thread wakeups, which is not desired. Implement a low pass filter, which checks the perf/NMI period against kernel time. If the perf/NMI fires before 4/5 of the watchdog period has elapsed then the event is ignored and postponed to the next perf/NMI. That solves the problem and avoids the overhead of shorter hrtimer periods and more frequent softlockup thread wakeups. Fixes: `58687acba5` ("lockup_detector: Combine nmi_watchdog and softlockup detector") Reported-and-tested-by: Kan Liang <Kan.liang@intel.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: dzickus@redhat.com Cc: prarit@redhat.com Cc: ak@linux.intel.com Cc: babu.moger@oracle.com Cc: peterz@infradead.org Cc: eranian@google.com Cc: acme@redhat.com Cc: stable@vger.kernel.org Cc: atomlin@redhat.com Cc: akpm@linux-foundation.org Cc: torvalds@linux-foundation.org Link: http://lkml.kernel.org/r/alpine.DEB.2.20.1708150931310.1886@nanos	2017-08-18 12:35:02 +02:00
Arvind Yadav	45bd07ad82	x86: Constify attribute_group structures attribute_groups are not supposed to change at runtime and none of the groups is modified. Mark the non-const structs as const. [ tglx: Folded into one big patch ] Signed-off-by: Arvind Yadav <arvind.yadav.cs@gmail.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: tony.luck@intel.com Cc: bp@alien8.de Link: http://lkml.kernel.org/r/1500550238-15655-2-git-send-email-arvind.yadav.cs@gmail.com	2017-08-18 11:30:35 +02:00
Ingo Molnar	0c23647913	Merge branch 'x86/asm' into locking/core We need the ASM_UNREACHABLE() macro for a dependent patch. Signed-off-by: Ingo Molnar <mingo@kernel.org>	2017-08-18 10:29:54 +02:00
Linus Torvalds	d33a2a9143	Power management fixes for v4.13-rc6 - Disable interrupts around reading IA32_APERF and IA32_MPERF in aperfmperf_snapshot_khz() (introduced recently) to avoid excessive delays between the reads that may result from interrupt handling (Doug Smythies). - Fix the comutation of the CPU frequency to be reported through the pstate_sample tracepoint in intel_pstate (Doug Smythies). -----BEGIN PGP SIGNATURE----- Version: GnuPG v2 iQIcBAABCAAGBQJZlfwDAAoJEILEb/54YlRxNz0P/2qaLU/vTk2Ide5A0LNxHPRx kv7kD8HQ37yWMR787FCDihrJqXd9oY5nnrBosolHhaSO0aEn3RwFwWWmZJXVSS9O VB7zSDoxs5p4q+1lDz9nN0I5eu1+6b5Z4kLeEl5qJuJbc36o1wJ4fkg29M9pnoM0 C85M/yrAN+WZMqsqjjTYObJb4NKQw3iIkF1oQW3mM1wM9YZFh4brMjvFGZ97XxjK GJyTgfm580cPQ2aMIYIffXkhLk3LhNRto+fkpWZ4togzutJSbCtA16sKlRVdtrof uGOcP4/dgmR3futM8mG7j6ovz+XvbxKeYcSs5BPh7klvCgwLY/Np+uV582mNrLWT UabL5+Jvwx4zFgS2m/jhZB/6rTs6h4jAmfBpCBlabAX6ppKAr74uH20dAoKePhHm qKa++7xVQBFwmHHsUXesW8QYSaEH37pwj+zUWyw1e+Dt+VvYDWRC5R2nugtOw8zV s6yONCd7HdfqCSpig1eA175E3IUAsFD5s1HXnuGVUAGjnPDiXvwtSZa5fdoDKHVo COZ0hV87z4+VtRF3/87xbJtFsAhz3byapIBrQ3QGAjfYhQ8D6fC1lA9OAqXEVETF 1A14FnHJprqIpTUwXAWEBco6eez8/W2j9KomltNCnsyeZlcV6hy6nO4keRqFKCn0 sRyj93X6N6HlUE+rWQxE =mtB4 -----END PGP SIGNATURE----- Merge tag 'pm-4.13-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm Pull power management fixes from Rafael Wysocki: "These fix two issues related to exposing the current CPU frequency to user space on x86. Specifics: - Disable interrupts around reading IA32_APERF and IA32_MPERF in aperfmperf_snapshot_khz() (introduced recently) to avoid excessive delays between the reads that may result from interrupt handling (Doug Smythies). - Fix the computation of the CPU frequency to be reported through the pstate_sample tracepoint in intel_pstate (Doug Smythies)" * tag 'pm-4.13-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: cpufreq: x86: Disable interrupts during MSRs reading cpufreq: intel_pstate: report correct CPU frequencies during trace	2017-08-17 14:21:18 -07:00
Rafael J. Wysocki	8179962b84	Merge branches 'intel_pstate-fix' and 'cpufreq-x86-fix' * intel_pstate-fix: cpufreq: intel_pstate: report correct CPU frequencies during trace * cpufreq-x86-fix: cpufreq: x86: Disable interrupts during MSRs reading	2017-08-17 21:00:30 +02:00
Baoquan He	c05cd79750	x86/boot/KASLR: Prefer mirrored memory regions for the kernel physical address Currently KASLR will parse all e820 entries of RAM type and add all candidate positions into the slots array. After that we choose one slot randomly as the new position which the kernel will be decompressed into and run at. On systems with EFI enabled, e820 memory regions are coming from EFI memory regions by combining adjacent regions. These EFI memory regions have various attributes, and the "mirrored" attribute is one of them. The physical memory region whose descriptors in EFI memory map has EFI_MEMORY_MORE_RELIABLE attribute (bit: 16) are mirrored. The address range mirroring feature of the kernel arranges such mirrored regions into normal zones and other regions into movable zones. With the mirroring feature enabled, the code and data of the kernel can only be located in the more reliable mirrored regions. However, the current KASLR code doesn't check EFI memory entries, and could choose a new kernel position in non-mirrored regions. This will break the intended functionality of the address range mirroring feature. To fix this, if EFI is detected, iterate EFI memory map and pick the mirrored region to process for adding candidate of randomization slot. If EFI is disabled or no mirrored region found, still process the e820 memory map. Signed-off-by: Baoquan He <bhe@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: ard.biesheuvel@linaro.org Cc: fanc.fnst@cn.fujitsu.com Cc: izumi.taku@jp.fujitsu.com Cc: keescook@chromium.org Cc: linux-efi@vger.kernel.org Cc: matt@codeblueprint.co.uk Cc: n-horiguchi@ah.jp.nec.com Cc: thgarnie@google.com Link: http://lkml.kernel.org/r/1502722464-20614-3-git-send-email-bhe@redhat.com [ Rewrote most of the text. ] Signed-off-by: Ingo Molnar <mingo@kernel.org>	2017-08-17 10:51:35 +02:00
Baoquan He	02e43c2dcd	efi: Introduce efi_early_memdesc_ptr to get pointer to memmap descriptor The existing map iteration helper for_each_efi_memory_desc_in_map can only be used after the kernel initializes the EFI subsystem to set up struct efi_memory_map. Before that we also need iterate map descriptors which are stored in several intermediate structures, like struct efi_boot_memmap for arch independent usage and struct efi_info for x86 arch only. Introduce efi_early_memdesc_ptr() to get pointer to a map descriptor, and replace several places where that primitive is open coded. Signed-off-by: Baoquan He <bhe@redhat.com> [ Various improvements to the text. ] Acked-by: Matt Fleming <matt@codeblueprint.co.uk> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: ard.biesheuvel@linaro.org Cc: fanc.fnst@cn.fujitsu.com Cc: izumi.taku@jp.fujitsu.com Cc: keescook@chromium.org Cc: linux-efi@vger.kernel.org Cc: n-horiguchi@ah.jp.nec.com Cc: thgarnie@google.com Link: http://lkml.kernel.org/r/20170816134651.GF21273@x1 Signed-off-by: Ingo Molnar <mingo@kernel.org>	2017-08-17 10:50:57 +02:00
Ingo Molnar	2257e268b1	Merge branch 'linus' into x86/boot, to pick up fixes Signed-off-by: Ingo Molnar <mingo@kernel.org>	2017-08-17 10:50:48 +02:00
Kees Cook	7a46ec0e2f	locking/refcounts, x86/asm: Implement fast refcount overflow protection This implements refcount_t overflow protection on x86 without a noticeable performance impact, though without the fuller checking of REFCOUNT_FULL. This is done by duplicating the existing atomic_t refcount implementation but with normally a single instruction added to detect if the refcount has gone negative (e.g. wrapped past INT_MAX or below zero). When detected, the handler saturates the refcount_t to INT_MIN / 2. With this overflow protection, the erroneous reference release that would follow a wrap back to zero is blocked from happening, avoiding the class of refcount-overflow use-after-free vulnerabilities entirely. Only the overflow case of refcounting can be perfectly protected, since it can be detected and stopped before the reference is freed and left to be abused by an attacker. There isn't a way to block early decrements, and while REFCOUNT_FULL stops increment-from-zero cases (which would be the state _after_ an early decrement and stops potential double-free conditions), this fast implementation does not, since it would require the more expensive cmpxchg loops. Since the overflow case is much more common (e.g. missing a "put" during an error path), this protection provides real-world protection. For example, the two public refcount overflow use-after-free exploits published in 2016 would have been rendered unexploitable: http://perception-point.io/2016/01/14/analysis-and-exploitation-of-a-linux-kernel-vulnerability-cve-2016-0728/ http://cyseclabs.com/page?n=02012016 This implementation does, however, notice an unchecked decrement to zero (i.e. caller used refcount_dec() instead of refcount_dec_and_test() and it resulted in a zero). Decrements under zero are noticed (since they will have resulted in a negative value), though this only indicates that a use-after-free may have already happened. Such notifications are likely avoidable by an attacker that has already exploited a use-after-free vulnerability, but it's better to have them reported than allow such conditions to remain universally silent. On first overflow detection, the refcount value is reset to INT_MIN / 2 (which serves as a saturation value) and a report and stack trace are produced. When operations detect only negative value results (such as changing an already saturated value), saturation still happens but no notification is performed (since the value was already saturated). On the matter of races, since the entire range beyond INT_MAX but before 0 is negative, every operation at INT_MIN / 2 will trap, leaving no overflow-only race condition. As for performance, this implementation adds a single "js" instruction to the regular execution flow of a copy of the standard atomic_t refcount operations. (The non-"and_test" refcount_dec() function, which is uncommon in regular refcount design patterns, has an additional "jz" instruction to detect reaching exactly zero.) Since this is a forward jump, it is by default the non-predicted path, which will be reinforced by dynamic branch prediction. The result is this protection having virtually no measurable change in performance over standard atomic_t operations. The error path, located in .text.unlikely, saves the refcount location and then uses UD0 to fire a refcount exception handler, which resets the refcount, handles reporting, and returns to regular execution. This keeps the changes to .text size minimal, avoiding return jumps and open-coded calls to the error reporting routine. Example assembly comparison: refcount_inc() before: .text: ffffffff81546149: f0 ff 45 f4 lock incl -0xc(%rbp) refcount_inc() after: .text: ffffffff81546149: f0 ff 45 f4 lock incl -0xc(%rbp) ffffffff8154614d: 0f 88 80 d5 17 00 js ffffffff816c36d3 ... .text.unlikely: ffffffff816c36d3: 48 8d 4d f4 lea -0xc(%rbp),%rcx ffffffff816c36d7: 0f ff (bad) These are the cycle counts comparing a loop of refcount_inc() from 1 to INT_MAX and back down to 0 (via refcount_dec_and_test()), between unprotected refcount_t (atomic_t), fully protected REFCOUNT_FULL (refcount_t-full), and this overflow-protected refcount (refcount_t-fast): 2147483646 refcount_inc()s and 2147483647 refcount_dec_and_test()s: cycles protections atomic_t 82249267387 none refcount_t-fast 82211446892 overflow, untested dec-to-zero refcount_t-full 144814735193 overflow, untested dec-to-zero, inc-from-zero This code is a modified version of the x86 PAX_REFCOUNT atomic_t overflow defense from the last public patch of PaX/grsecurity, based on my understanding of the code. Changes or omissions from the original code are mine and don't reflect the original grsecurity/PaX code. Thanks to PaX Team for various suggestions for improvement for repurposing this code to be a refcount-only protection. Signed-off-by: Kees Cook <keescook@chromium.org> Reviewed-by: Josh Poimboeuf <jpoimboe@redhat.com> Cc: Alexey Dobriyan <adobriyan@gmail.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Christoph Hellwig <hch@infradead.org> Cc: David S. Miller <davem@davemloft.net> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: Elena Reshetova <elena.reshetova@intel.com> Cc: Eric Biggers <ebiggers3@gmail.com> Cc: Eric W. Biederman <ebiederm@xmission.com> Cc: Greg KH <gregkh@linuxfoundation.org> Cc: Hans Liljestrand <ishkamiel@gmail.com> Cc: James Bottomley <James.Bottomley@hansenpartnership.com> Cc: Jann Horn <jannh@google.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Manfred Spraul <manfred@colorfullife.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rik van Riel <riel@redhat.com> Cc: Serge E. Hallyn <serge@hallyn.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: arozansk@redhat.com Cc: axboe@kernel.dk Cc: kernel-hardening@lists.openwall.com Cc: linux-arch <linux-arch@vger.kernel.org> Link: http://lkml.kernel.org/r/20170815161924.GA133115@beast Signed-off-by: Ingo Molnar <mingo@kernel.org>	2017-08-17 10:40:26 +02:00
Tony Luck	ce0fa3e56a	x86/mm, mm/hwpoison: Clear PRESENT bit for kernel 1:1 mappings of poison pages Speculative processor accesses may reference any memory that has a valid page table entry. While a speculative access won't generate a machine check, it will log the error in a machine check bank. That could cause escalation of a subsequent error since the overflow bit will be then set in the machine check bank status register. Code has to be double-plus-tricky to avoid mentioning the 1:1 virtual address of the page we want to map out otherwise we may trigger the very problem we are trying to avoid. We use a non-canonical address that passes through the usual Linux table walking code to get to the same "pte". Thanks to Dave Hansen for reviewing several iterations of this. Also see: http://marc.info/?l=linux-mm&m=149860136413338&w=2 Signed-off-by: Tony Luck <tony.luck@intel.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Andy Lutomirski <luto@kernel.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Borislav Petkov <bp@suse.de> Cc: Brian Gerst <brgerst@gmail.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Denys Vlasenko <dvlasenk@redhat.com> Cc: Elliott, Robert (Persistent Memory) <elliott@hpe.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Josh Poimboeuf <jpoimboe@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-mm@kvack.org Cc: stable@vger.kernel.org Link: http://lkml.kernel.org/r/20170816171803.28342-1-tony.luck@intel.com Signed-off-by: Ingo Molnar <mingo@kernel.org>	2017-08-17 10:30:49 +02:00
Matthias Kaehlcke	8f91869766	x86/build: Fix stack alignment for CLang Commit: `d77698df39` ("x86/build: Specify stack alignment for clang") intended to use the same stack alignment for clang as with gcc. The two compilers use different options to configure the stack alignment (gcc: -mpreferred-stack-boundary=n, clang: -mstack-alignment=n). The above commit assumes that the clang option uses the same parameter type as gcc, i.e. that the alignment is specified as 2^n. However clang interprets the value of this option literally to use an alignment of n, in consequence the stack remains misaligned. Change the values used with -mstack-alignment to be the actual alignment instead of a power of two. cc-option isn't used here with the typical pattern of KBUILD_CFLAGS += $(call cc-option ...). The reason is that older gcc versions don't support the -mpreferred-stack-boundary option, since cc-option doesn't verify whether the alternative option is valid it would incorrectly select the clang option -mstack-alignment.. Signed-off-by: Matthias Kaehlcke <mka@chromium.org> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Bernhard.Rosenkranzer@linaro.org Cc: Greg Hackmann <ghackmann@google.com> Cc: Kees Cook <keescook@chromium.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Masahiro Yamada <yamada.masahiro@socionext.com> Cc: Michael Davidson <md@google.com> Cc: Nick Desaulniers <ndesaulniers@google.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Stephen Hines <srhines@google.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: dianders@chromium.org Link: http://lkml.kernel.org/r/20170817004740.170588-1-mka@chromium.org Signed-off-by: Ingo Molnar <mingo@kernel.org>	2017-08-17 10:27:19 +02:00
Alexander Potapenko	187e91fe5e	x86/boot/64/clang: Use fixup_pointer() to access 'next_early_pgt' __startup_64() is normally using fixup_pointer() to access globals in a position-independent fashion. However 'next_early_pgt' was accessed directly, which wasn't guaranteed to work. Luckily GCC was generating a R_X86_64_PC32 PC-relative relocation for 'next_early_pgt', but Clang emitted a R_X86_64_32S, which led to accessing invalid memory and rebooting the kernel. Signed-off-by: Alexander Potapenko <glider@google.com> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Michael Davidson <md@google.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Fixes: `c88d71508e` ("x86/boot/64: Rewrite startup_64() in C") Link: http://lkml.kernel.org/r/20170816190808.131748-1-glider@google.com Signed-off-by: Ingo Molnar <mingo@kernel.org>	2017-08-17 09:53:00 +02:00
Ingo Molnar	927d2c21f2	Merge branch 'linus' into perf/core, to pick up fixes Signed-off-by: Ingo Molnar <mingo@kernel.org>	2017-08-17 09:41:41 +02:00
Scott Wood	c455fd9235	x86/nmi: Use raw lock register_nmi_handler() can be called from PREEMPT_RT atomic context (e.g. wakeup_cpu_via_init_nmi() or native_stop_other_cpus()), and thus ordinary spinlocks cannot be used. Signed-off-by: Scott Wood <swood@redhat.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Don Zickus <dzickus@redhat.com> Link: http://lkml.kernel.org/r/20170724213242.27598-1-swood@redhat.com	2017-08-16 20:40:09 +02:00
Oleg Nesterov	01578e3616	x86/elf: Remove the unnecessary ADDR_NO_RANDOMIZE checks The ADDR_NO_RANDOMIZE checks in stack_maxrandom_size() and randomize_stack_top() are not required. PF_RANDOMIZE is set by load_elf_binary() only if ADDR_NO_RANDOMIZE is not set, no need to re-check after that. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Dmitry Safonov <dsafonov@virtuozzo.com> Cc: stable@vger.kernel.org Cc: Andy Lutomirski <luto@amacapital.net> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Borislav Petkov <bp@suse.de> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Link: http://lkml.kernel.org/r/20170815154011.GB1076@redhat.com	2017-08-16 20:32:02 +02:00
Oleg Nesterov	47ac5484fd	x86: Fix norandmaps/ADDR_NO_RANDOMIZE Documentation/admin-guide/kernel-parameters.txt says: norandmaps Don't use address space randomization. Equivalent to echo 0 > /proc/sys/kernel/randomize_va_space but it doesn't work because arch_rnd() which is used to randomize mm->mmap_base returns a random value unconditionally. And as Kirill pointed out, ADDR_NO_RANDOMIZE is broken by the same reason. Just shift the PF_RANDOMIZE check from arch_mmap_rnd() to arch_rnd(). Fixes: `1b028f784e` ("x86/mm: Introduce mmap_compat_base() for 32-bit mmap()") Signed-off-by: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Acked-by: Cyrill Gorcunov <gorcunov@openvz.org> Reviewed-by: Dmitry Safonov <dsafonov@virtuozzo.com> Cc: stable@vger.kernel.org Cc: Andy Lutomirski <luto@amacapital.net> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Borislav Petkov <bp@suse.de> Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: http://lkml.kernel.org/r/20170815153952.GA1076@redhat.com	2017-08-16 20:32:01 +02:00
Colin Ian King	5707b46a42	x86/intel_rdt: Remove redundant ternary operator on return The use of the ternary operator is redundant as ret can never be non-zero at that point. Instead, just return nbytes. Detected by CoverityScan, CID#1452658 ("Logically dead code") Signed-off-by: Colin Ian King <colin.king@canonical.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: Vikas Shivappa <vikas.shivappa@linux.intel.com> Cc: Fenghua Yu <fenghua.yu@intel.com> Cc: kernel-janitors@vger.kernel.org Link: http://lkml.kernel.org/r/20170808092859.13021-1-colin.king@canonical.com	2017-08-16 16:20:55 +02:00
Vikas Shivappa	24247aeeab	x86/intel_rdt/cqm: Improve limbo list processing During a mkdir, the entire limbo list is synchronously checked on each package for free RMIDs by sending IPIs. With a large number of RMIDs (SKL has 192) this creates a intolerable amount of work in IPIs. Replace the IPI based checking of the limbo list with asynchronous worker threads on each package which periodically scan the limbo list and move the RMIDs that have: llc_occupancy < threshold_occupancy on all packages to the free list. mkdir now returns -ENOSPC if the free list and the limbo list ere empty or returns -EBUSY if there are RMIDs on the limbo list and the free list is empty. Getting rid of the IPIs also simplifies the data structures and the serialization required for handling the lists. [ tglx: Rewrote changelog ... ] Signed-off-by: Vikas Shivappa <vikas.shivappa@linux.intel.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: ravi.v.shankar@intel.com Cc: tony.luck@intel.com Cc: fenghua.yu@intel.com Cc: peterz@infradead.org Cc: eranian@google.com Cc: vikas.shivappa@intel.com Cc: ak@linux.intel.com Cc: davidcc@google.com Link: http://lkml.kernel.org/r/1502845243-20454-3-git-send-email-vikas.shivappa@linux.intel.com	2017-08-16 12:05:41 +02:00
Vikas Shivappa	bbc4615e0b	x86/intel_rdt/mbm: Fix MBM overflow handler during CPU hotplug When a CPU is dying, the overflow worker is canceled and rescheduled on a different CPU in the same domain. But if the timer is already about to expire this essentially doubles the interval which might result in a non detected overflow. Cancel the overflow worker and reschedule it immediately on a different CPU in same domain. The work could be flushed as well, but that would reschedule it on the same CPU. [ tglx: Rewrote changelog once again ] Reported-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Vikas Shivappa <vikas.shivappa@linux.intel.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: ravi.v.shankar@intel.com Cc: tony.luck@intel.com Cc: fenghua.yu@intel.com Cc: peterz@infradead.org Cc: eranian@google.com Cc: vikas.shivappa@intel.com Cc: ak@linux.intel.com Cc: davidcc@google.com Link: http://lkml.kernel.org/r/1502845243-20454-2-git-send-email-vikas.shivappa@linux.intel.com	2017-08-16 12:05:41 +02:00
David S. Miller	463910e2df	Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net	2017-08-15 20:23:23 -07:00
Thomas Gleixner	84393817db	x86/mtrr: Prevent CPU hotplug lock recursion Larry reported a CPU hotplug lock recursion in the MTRR code. ============================================ WARNING: possible recursive locking detected systemd-udevd/153 is trying to acquire lock: (cpu_hotplug_lock.rw_sem){.+.+.+}, at: [<c030fc26>] stop_machine+0x16/0x30 but task is already holding lock: (cpu_hotplug_lock.rw_sem){.+.+.+}, at: [<c0234353>] mtrr_add_page+0x83/0x470 .... cpus_read_lock+0x48/0x90 stop_machine+0x16/0x30 mtrr_add_page+0x18b/0x470 mtrr_add+0x3e/0x70 mtrr_add_page() holds the hotplug rwsem already and calls stop_machine() which acquires it again. Call stop_machine_cpuslocked() instead. Reported-and-tested-by: Larry Finger <Larry.Finger@lwfinger.net> Reported-by: Dmitry Vyukov <dvyukov@google.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/alpine.DEB.2.20.1708140920250.1865@nanos Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com> Cc: Borislav Petkov <bp@suse.de>	2017-08-15 13:03:47 +02:00
Andy Lutomirski	fa2016a8e7	x86/xen/64: Fix the reported SS and CS in SYSCALL When I cleaned up the Xen SYSCALL entries, I inadvertently changed the reported segment registers. Before my patch, regs->ss was __USER(32)_DS and regs->cs was __USER(32)_CS. After the patch, they are FLAT_USER_CS/DS(32). This had a couple unfortunate effects. It confused the opportunistic fast return logic. It also significantly increased the risk of triggering a nasty glibc bug: https://sourceware.org/bugzilla/show_bug.cgi?id=21269 Update the Xen entry code to change it back. Reported-by: Brian Gerst <brgerst@gmail.com> Signed-off-by: Andy Lutomirski <luto@kernel.org> Cc: Andrew Cooper <andrew.cooper3@citrix.com> Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com> Cc: Borislav Petkov <bp@alien8.de> Cc: Juergen Gross <jgross@suse.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: xen-devel@lists.xenproject.org Fixes: `8a9949bc71` ("x86/xen/64: Rearrange the SYSCALL entries") Link: http://lkml.kernel.org/r/daba8351ea2764bb30272296ab9ce08a81bd8264.1502775273.git.luto@kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>	2017-08-15 10:10:58 +02:00
Dave Airlie	0c697fafc6	Linux 4.13-rc5 -----BEGIN PGP SIGNATURE----- iQEcBAABAgAGBQJZkNpUAAoJEHm+PkMAQRiGr68H/2nr8kxpoUhZ7eA5C71waCjh gnJSevkzJAp+fCb0KfQFAp1qvpmLLle4e6tAxYgTQZg4Z3W5cJJNfxu9TzY5sGuL o9QUr43XzABepW4e4jhRtZv6dj3K6XruNeDQKXDZTDcc/S8zoiS/Pltq7VgPcAuM kX+3qsNdUyknngD6b0z9NtJkb0mHKY6J8MpraWRO34egDwsaN/tuhRj0DRQpCoyQ x/k+hMbc9MB9Dn8cfACo6Omb+r5Rfd7dTBUAju/TnIIgs//9voHba307N7XvLJZg kWc8MqMQQZXfRZHB0atpDMHyZS/XQRlNPXj76j0+Ud/byODKTFkkazmgTpALvj8= =CxeU -----END PGP SIGNATURE----- Backmerge tag 'v4.13-rc5' into drm-next Linux 4.13-rc5 There's a really nasty nouveau collision, hopefully someone can take a look once I pushed this out.	2017-08-15 16:16:58 +10:00
Greg Kroah-Hartman	d985524680	Merge 4.13-rc5 into char-misc-next We want the firmware, and other changes, in here as well. Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>	2017-08-14 13:29:31 -07:00
Linus Torvalds	6b9d1c24e0	Merge branch 'linus' of git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6 Pull crypto fixes from Herbert Xu: "Fix an error path bug in ixp4xx as well as a read overrun in sha1-avx2" * 'linus' of git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6: crypto: x86/sha1 - Fix reads beyond the number of blocks passed crypto: ixp4xx - Fix error handling path in 'aead_perform()'	2017-08-14 11:35:56 -07:00
Vikas Shivappa	a9110b552d	x86/intel_rdt: Modify the intel_pqr_state for better performance Currently we have pqr_state and rdt_default_state which store the cached CLOSID/RMIDs and the user configured cpu default values respectively. We touch both of these during context switch. Put all of them in one structure so that we can spare a cache line. Reported-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Vikas Shivappa <vikas.shivappa@linux.intel.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: ravi.v.shankar@intel.com Cc: tony.luck@intel.com Cc: fenghua.yu@intel.com Cc: peterz@infradead.org Cc: eranian@google.com Cc: sai.praneeth.prakhya@intel.com Cc: ak@linux.intel.com Cc: davidcc@google.com Link: http://lkml.kernel.org/r/1502304395-7166-3-git-send-email-vikas.shivappa@linux.intel.com	2017-08-14 11:47:47 +02:00
Vikas Shivappa	eda61c265f	x86/intel_rdt/cqm: Clear the default RMID during hotcpu The user configured per cpu default RMID is not cleared during cpu hotplug. This may lead to incorrect RMID values after a cpu goes offline and again comes back online. Clear the per cpu default RMID during cpu offline and online handling. Reported-by: Prakyha Sai Praneeth <sai.praneeth.prakhya@intel.com> Signed-off-by: Vikas Shivappa <vikas.shivappa@linux.intel.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: ravi.v.shankar@intel.com Cc: tony.luck@intel.com Cc: fenghua.yu@intel.com Cc: peterz@infradead.org Cc: eranian@google.com Cc: ak@linux.intel.com Cc: davidcc@google.com Link: http://lkml.kernel.org/r/1502304395-7166-2-git-send-email-vikas.shivappa@linux.intel.com	2017-08-14 11:47:46 +02:00
Wolfram Sang	0a1c7959ac	gpu: drm: tc35876x: move header file out of I2C realm include/linux/i2c is not for client devices. Move the header file to a more appropriate location. Signed-off-by: Wolfram Sang <wsa@the-dreams.de> Acked-by: Patrik Jakobsson <patrik.r.jakobsson@gmail.com>	2017-08-13 16:07:17 +02:00
Linus Torvalds	043cd07c55	xen: Fixes for 4.13-rc5 -----BEGIN PGP SIGNATURE----- Version: GnuPG v2 iQEcBAABAgAGBQJZjs9BAAoJELDendYovxMvXXAH/j3pdoshQbflSUBsDAkybhv5 BVe7+bhtwnoawcjCpXq27SMY3qG/YWnATW28XjxBCoe3t7StNcJr5QGXTWMnTjwN f/YA0aqtCoLp9JhovTi9WTTCf1/I9CKYFBdCaAmLkDeMudyifZkbXiDbDe0UZmAc UJt0Jx8KrdMGkuRVp92049calluv+PDHO7gUpGpzoHDJ0IXc1cH9caHTbL+LhioY o0qqQOz9FnJQIvqSGYRkjXudmGwHYCr61yXvWhwqa4PE3Tzss2ckGtzZPLI8s1QN p5m01FbIMQKjLbwpQZaRWmGxSzY2vYxf/TShK8eIsBfRYxsR4d7cXULC2vIJGFI= =jiAk -----END PGP SIGNATURE----- Merge tag 'for-linus-4.13b-rc5-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip Pull xen fixes from Juergen Gross: "Some fixes for Xen: - a fix for a regression introduced in 4.13 for a Xen HVM-guest configured with KASLR - a fix for a possible deadlock in the xenbus driver when booting the system - a fix for lost interrupts in Xen guests" * tag 'for-linus-4.13b-rc5-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip: xen/events: Fix interrupt lost during irq_disable and irq_enable xen: avoid deadlock in xenbus xen: fix hvm guest with kaslr enabled xen: split up xen_hvm_init_shared_info() x86: provide an init_mem_mapping hypervisor hook	2017-08-12 09:01:36 -07:00
Jim Mattson	d3802286fa	kvm: x86: Disallow illegal IA32_APIC_BASE MSR values Host-initiated writes to the IA32_APIC_BASE MSR do not have to follow local APIC state transition constraints, but the value written must be valid. Signed-off-by: Jim Mattson <jmattson@google.com> Reviewed-by: David Hildenbrand <david@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2017-08-11 18:59:30 +02:00
Wanpeng Li	26eeb53cf0	KVM: MMU: Bail out immediately if there is no available mmu page Bailing out immediately if there is no available mmu page to alloc. Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Radim Krčmář <rkrcmar@redhat.com> Signed-off-by: Wanpeng Li <wanpeng.li@hotmail.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2017-08-11 18:59:29 +02:00
Wanpeng Li	42bcbebf11	KVM: MMU: Fix softlockup due to mmu_lock is held too long watchdog: BUG: soft lockup - CPU#5 stuck for 22s! [warn_test:3089] irq event stamp: 20532 hardirqs last enabled at (20531): [<ffffffff8e9b6908>] restore_regs_and_iret+0x0/0x1d hardirqs last disabled at (20532): [<ffffffff8e9b7ae8>] apic_timer_interrupt+0x98/0xb0 softirqs last enabled at (8266): [<ffffffff8e9badc6>] __do_softirq+0x206/0x4c1 softirqs last disabled at (8253): [<ffffffff8e083918>] irq_exit+0xf8/0x100 CPU: 5 PID: 3089 Comm: warn_test Tainted: G OE 4.13.0-rc3+ #8 RIP: 0010:kvm_mmu_prepare_zap_page+0x72/0x4b0 [kvm] Call Trace: make_mmu_pages_available.isra.120+0x71/0xc0 [kvm] kvm_mmu_load+0x1cf/0x410 [kvm] kvm_arch_vcpu_ioctl_run+0x1316/0x1bf0 [kvm] kvm_vcpu_ioctl+0x340/0x700 [kvm] ? kvm_vcpu_ioctl+0x340/0x700 [kvm] ? __fget+0xfc/0x210 do_vfs_ioctl+0xa4/0x6a0 ? __fget+0x11d/0x210 SyS_ioctl+0x79/0x90 entry_SYSCALL_64_fastpath+0x23/0xc2 ? __this_cpu_preempt_check+0x13/0x20 This can be reproduced readily by ept=N and running syzkaller tests since many syzkaller testcases don't setup any memory regions. However, if ept=Y rmode identity map will be created, then kvm_mmu_calculate_mmu_pages() will extend the number of VM's mmu pages to at least KVM_MIN_ALLOC_MMU_PAGES which just hide the issue. I saw the scenario kvm->arch.n_max_mmu_pages == 0 && kvm->arch.n_used_mmu_pages == 1, so there is one active mmu page on the list, kvm_mmu_prepare_zap_page() fails to zap any pages, however prepare_zap_oldest_mmu_page() always returns true. It incurs infinite loop in make_mmu_pages_available() which causes mmu->lock softlockup. This patch fixes it by setting the return value of prepare_zap_oldest_mmu_page() according to whether or not there is mmu page zapped. Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Radim Krčmář <rkrcmar@redhat.com> Signed-off-by: Wanpeng Li <wanpeng.li@hotmail.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2017-08-11 18:59:28 +02:00
David Hildenbrand	a057e0e22c	KVM: nVMX: validate eptp pointer Let's reuse the function introduced with eptp switching. We don't explicitly have to check against enable_ept_ad_bits, as this is implicitly done when checking against nested_vmx_ept_caps in valid_ept_address(). Signed-off-by: David Hildenbrand <david@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2017-08-11 18:53:22 +02:00
Juergen Gross	4ca83dcf4e	xen: fix hvm guest with kaslr enabled A Xen HVM guest running with KASLR enabled will die rather soon today because the shared info page mapping is using va() too early. This was introduced by commit `a5d5f328b0` ("xen: allocate page for shared info page from low memory"). In order to fix this use early_memremap() to get a temporary virtual address for shared info until va() can be used safely. Signed-off-by: Juergen Gross <jgross@suse.com> Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com> Acked-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Juergen Gross <jgross@suse.com>	2017-08-11 15:50:26 +02:00
Juergen Gross	10231f69eb	xen: split up xen_hvm_init_shared_info() Instead of calling xen_hvm_init_shared_info() on boot and resume split it up into a boot time function searching for the pfn to use and a mapping function doing the hypervisor mapping call. Signed-off-by: Juergen Gross <jgross@suse.com> Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com> Acked-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Juergen Gross <jgross@suse.com>	2017-08-11 15:50:24 +02:00
Juergen Gross	c138d81163	x86: provide an init_mem_mapping hypervisor hook Provide a hook in hypervisor_x86 called after setting up initial memory mapping. This is needed e.g. by Xen HVM guests to map the hypervisor shared info page. Signed-off-by: Juergen Gross <jgross@suse.com> Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com> Acked-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Juergen Gross <jgross@suse.com>	2017-08-11 15:50:21 +02:00
Arnd Bergmann	aac64f7de9	x86/cpu/amd: Hide unused legacy_fixup_core_id() function The newly introduced function is only used when CONFIG_SMP is set: arch/x86/kernel/cpu/amd.c:305:13: warning: 'legacy_fixup_core_id' defined but not used This moves the existing #ifdef around the caller so it covers legacy_fixup_core_id() as well. Signed-off-by: Arnd Bergmann <arnd@arndb.de> Cc: Borislav Petkov <bp@suse.de> Cc: Emanuel Czirai <icanrealizeum@gmail.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Suravee Suthikulpanit <suravee.suthikulpanit@amd.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Tom Lendacky <thomas.lendacky@amd.com> Cc: Yazen Ghannam <Yazen.Ghannam@amd.com> Fixes: `b89b41d0b8` ("x86/cpu/amd: Limit cpu_core_id fixup to families older than F17h") Link: http://lkml.kernel.org/r/20170811111937.2006128-1-arnd@arndb.de Signed-off-by: Ingo Molnar <mingo@kernel.org>	2017-08-11 14:58:34 +02:00
Colin Ian King	b45e4c45b1	x86: Mark various structures and functions as 'static' Mark a couple of structures and functions as 'static', pointed out by Sparse: warning: symbol 'bts_pmu' was not declared. Should it be static? warning: symbol 'p4_event_aliases' was not declared. Should it be static? warning: symbol 'rapl_attr_groups' was not declared. Should it be static? symbol 'process_uv2_message' was not declared. Should it be static? Signed-off-by: Colin Ian King <colin.king@canonical.com> Acked-by: Andrew Banman <abanman@hpe.com> # for the UV change Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Will Deacon <will.deacon@arm.com> Cc: kernel-janitors@vger.kernel.org Link: http://lkml.kernel.org/r/20170810155709.7094-1-colin.king@canonical.com Signed-off-by: Ingo Molnar <mingo@kernel.org>	2017-08-11 14:49:43 +02:00
Borislav Petkov	5442c26995	x86/cpufeature, kvm/svm: Rename (shorten) the new "virtualized VMSAVE/VMLOAD" CPUID flag "virtual_vmload_vmsave" is what is going to land in /proc/cpuinfo now as per v4.13-rc4, for a single feature bit which is clearly too long. So rename it to what it is called in the processor manual. "v_vmsave_vmload" is a bit shorter, after all. We could go more aggressively here but having it the same as in the processor manual is advantageous. Signed-off-by: Borislav Petkov <bp@suse.de> Acked-by: Radim Krčmář <rkrcmar@redhat.com> Cc: Janakarajan Natarajan <Janakarajan.Natarajan@amd.com> Cc: Jörg Rödel <joro@8bytes.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: kvm-ML <kvm@vger.kernel.org> Link: http://lkml.kernel.org/r/20170801185552.GA3743@nazgul.tnic Signed-off-by: Ingo Molnar <mingo@kernel.org>	2017-08-11 13:42:28 +02:00
Doug Smythies	8e2f3bce05	cpufreq: x86: Disable interrupts during MSRs reading According to Intel 64 and IA-32 Architectures SDM, Volume 3, Chapter 14.2, "Software needs to exercise care to avoid delays between the two RDMSRs (for example interrupts)". So, disable interrupts during reading MSRs IA32_APERF and IA32_MPERF. See also: commit `4ab60c3f32` (cpufreq: intel_pstate: Disable interrupts during MSRs reading). Signed-off-by: Doug Smythies <dsmythies@telus.net> Reviewed-by: Len Brown <len.brown@intel.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>	2017-08-11 01:27:41 +02:00
Vitaly Kuznetsov	2ffd9e33ce	x86/hyper-v: Use hypercall for remote TLB flush Hyper-V host can suggest us to use hypercall for doing remote TLB flush, this is supposed to work faster than IPIs. Implementation details: to do HvFlushVirtualAddress{Space,List} hypercalls we need to put the input somewhere in memory and we don't really want to have memory allocation on each call so we pre-allocate per cpu memory areas on boot. pv_ops patching is happening very early so we need to separate hyperv_setup_mmu_ops() and hyper_alloc_mmu(). It is possible and easy to implement local TLB flushing too and there is even a hint for that. However, I don't see a room for optimization on the host side as both hypercall and native tlb flush will result in vmexit. The hint is also not set on modern Hyper-V versions. Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com> Reviewed-by: Andy Shevchenko <andy.shevchenko@gmail.com> Reviewed-by: Stephen Hemminger <sthemmin@microsoft.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Haiyang Zhang <haiyangz@microsoft.com> Cc: Jork Loeser <Jork.Loeser@microsoft.com> Cc: K. Y. Srinivasan <kys@microsoft.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Simon Xiao <sixiao@microsoft.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: devel@linuxdriverproject.org Link: http://lkml.kernel.org/r/20170802160921.21791-8-vkuznets@redhat.com Signed-off-by: Ingo Molnar <mingo@kernel.org>	2017-08-10 20:16:44 +02:00
Suravee Suthikulpanit	2b83809a5e	x86/cpu/amd: Derive L3 shared_cpu_map from cpu_llc_shared_mask For systems with X86_FEATURE_TOPOEXT, current logic uses the APIC ID to calculate shared_cpu_map. However, APIC IDs are not guaranteed to be contiguous for cores across different L3s (e.g. family17h system w/ downcore configuration). This breaks the logic, and results in an incorrect L3 shared_cpu_map. Instead, always use the previously calculated cpu_llc_shared_mask of each CPU to derive the L3 shared_cpu_map. Signed-off-by: Suravee Suthikulpanit <suravee.suthikulpanit@amd.com> Signed-off-by: Borislav Petkov <bp@suse.de> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/20170731085159.9455-3-bp@alien8.de Signed-off-by: Ingo Molnar <mingo@kernel.org>	2017-08-10 17:37:43 +02:00
Suravee Suthikulpanit	b89b41d0b8	x86/cpu/amd: Limit cpu_core_id fixup to families older than F17h Current cpu_core_id fixup causes downcored F17h configurations to be incorrect: NODE: 0 processor 0 core id : 0 processor 1 core id : 1 processor 2 core id : 2 processor 3 core id : 4 processor 4 core id : 5 processor 5 core id : 0 NODE: 1 processor 6 core id : 2 processor 7 core id : 3 processor 8 core id : 4 processor 9 core id : 0 processor 10 core id : 1 processor 11 core id : 2 Code that relies on the cpu_core_id, like match_smt(), for example, which builds the thread siblings masks used by the scheduler, is mislead. So, limit the fixup to pre-F17h machines. The new value for cpu_core_id for F17h and later will represent the CPUID_Fn8000001E_EBX[CoreId], which is guaranteed to be unique for each core within a socket. This way we have: NODE: 0 processor 0 core id : 0 processor 1 core id : 1 processor 2 core id : 2 processor 3 core id : 4 processor 4 core id : 5 processor 5 core id : 6 NODE: 1 processor 6 core id : 8 processor 7 core id : 9 processor 8 core id : 10 processor 9 core id : 12 processor 10 core id : 13 processor 11 core id : 14 Signed-off-by: Suravee Suthikulpanit <suravee.suthikulpanit@amd.com> [ Heavily massaged. ] Signed-off-by: Borislav Petkov <bp@suse.de> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Yazen Ghannam <Yazen.Ghannam@amd.com> Link: http://lkml.kernel.org/r/20170731085159.9455-2-bp@alien8.de Signed-off-by: Ingo Molnar <mingo@kernel.org>	2017-08-10 17:37:43 +02:00
Peter Zijlstra	01651324ed	x86: Clarify/fix no-op barriers for text_poke_bp() So I was looking at text_poke_bp() today and I couldn't make sense of the barriers there. How's for something like so? Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Steven Rostedt (VMware) <rostedt@goodmis.org> Acked-by: Jiri Kosina <jkosina@suse.cz> Cc: Josh Poimboeuf <jpoimboe@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: masami.hiramatsu.pt@hitachi.com Link: http://lkml.kernel.org/r/20170731102154.f57cvkjtnbmtctk6@hirez.programming.kicks-ass.net Signed-off-by: Ingo Molnar <mingo@kernel.org>	2017-08-10 17:35:19 +02:00
Andy Lutomirski	e137a4d8f4	x86/switch_to/64: Rewrite FS/GS switching yet again to fix AMD CPUs Switching FS and GS is a mess, and the current code is still subtly wrong: it assumes that "Loading a nonzero value into FS sets the index and base", which is false on AMD CPUs if the value being loaded is 1, 2, or 3. (The current code came from commit `3e2b68d752` ("x86/asm, sched/x86: Rewrite the FS and GS context switch code"), which made it better but didn't fully fix it.) Rewrite it to be much simpler and more obviously correct. This should fix it fully on AMD CPUs and shouldn't adversely affect performance. Signed-off-by: Andy Lutomirski <luto@kernel.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Borislav Petkov <bpetkov@suse.de> Cc: Brian Gerst <brgerst@gmail.com> Cc: Chang Seok <chang.seok.bae@intel.com> Cc: Denys Vlasenko <dvlasenk@redhat.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Josh Poimboeuf <jpoimboe@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: stable@vger.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>	2017-08-10 17:15:13 +02:00
Andy Lutomirski	9584d98bed	x86/fsgsbase/64: Report FSBASE and GSBASE correctly in core dumps In ELF_COPY_CORE_REGS, we're copying from the current task, so accessing thread.fsbase and thread.gsbase makes no sense. Just read the values from the CPU registers. In practice, the old code would have been correct most of the time simply because thread.fsbase and thread.gsbase usually matched the CPU registers. Signed-off-by: Andy Lutomirski <luto@kernel.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Borislav Petkov <bpetkov@suse.de> Cc: Brian Gerst <brgerst@gmail.com> Cc: Chang Seok <chang.seok.bae@intel.com> Cc: Denys Vlasenko <dvlasenk@redhat.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Josh Poimboeuf <jpoimboe@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: stable@vger.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>	2017-08-10 17:15:13 +02:00
Andy Lutomirski	767d035d83	x86/fsgsbase/64: Fully initialize FS and GS state in start_thread_common execve used to leak FSBASE and GSBASE on AMD CPUs. Fix it. The security impact of this bug is small but not quite zero -- it could weaken ASLR when a privileged task execs a less privileged program, but only if program changed bitness across the exec, or the child binary was highly unusual or actively malicious. A child program that was compromised after the exec would not have access to the leaked base. Signed-off-by: Andy Lutomirski <luto@kernel.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Borislav Petkov <bpetkov@suse.de> Cc: Brian Gerst <brgerst@gmail.com> Cc: Chang Seok <chang.seok.bae@intel.com> Cc: Denys Vlasenko <dvlasenk@redhat.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Josh Poimboeuf <jpoimboe@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: stable@vger.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>	2017-08-10 17:15:13 +02:00
Vitaly Kuznetsov	10e66760fa	x86/smpboot: Unbreak CPU0 hotplug A hang on CPU0 onlining after a preceding offlining is observed. Trace shows that CPU0 is stuck in check_tsc_sync_target() waiting for source CPU to run check_tsc_sync_source() but this never happens. Source CPU, in its turn, is stuck on synchronize_sched() which is called from native_cpu_up() -> do_boot_cpu() -> unregister_nmi_handler(). So it's a classic ABBA deadlock, due to the use of synchronize_sched() in unregister_nmi_handler(). Fix the bug by moving unregister_nmi_handler() from do_boot_cpu() to native_cpu_up() after cpu onlining is done. Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/20170803105818.9934-1-vkuznets@redhat.com Signed-off-by: Ingo Molnar <mingo@kernel.org>	2017-08-10 17:02:47 +02:00
Vitaly Kuznetsov	7415aea607	hyper-v: Globalize vp_index To support implementing remote TLB flushing on Hyper-V with a hypercall we need to make vp_index available outside of vmbus module. Rename and globalize. Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com> Reviewed-by: Andy Shevchenko <andy.shevchenko@gmail.com> Reviewed-by: Stephen Hemminger <sthemmin@microsoft.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Haiyang Zhang <haiyangz@microsoft.com> Cc: Jork Loeser <Jork.Loeser@microsoft.com> Cc: K. Y. Srinivasan <kys@microsoft.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Simon Xiao <sixiao@microsoft.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: devel@linuxdriverproject.org Link: http://lkml.kernel.org/r/20170802160921.21791-7-vkuznets@redhat.com Signed-off-by: Ingo Molnar <mingo@kernel.org>	2017-08-10 16:50:23 +02:00
Vitaly Kuznetsov	806c89273b	x86/hyper-v: Implement rep hypercalls Rep hypercalls are normal hypercalls which perform multiple actions at once. Hyper-V guarantees to return exectution to the caller in not more than 50us and the caller needs to use hypercall continuation. Touch NMI watchdog between hypercall invocations. This is going to be used for HvFlushVirtualAddressList hypercall for remote TLB flushing. Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com> Reviewed-by: Andy Shevchenko <andy.shevchenko@gmail.com> Reviewed-by: Stephen Hemminger <sthemmin@microsoft.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Haiyang Zhang <haiyangz@microsoft.com> Cc: Jork Loeser <Jork.Loeser@microsoft.com> Cc: K. Y. Srinivasan <kys@microsoft.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Simon Xiao <sixiao@microsoft.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: devel@linuxdriverproject.org Link: http://lkml.kernel.org/r/20170802160921.21791-6-vkuznets@redhat.com Signed-off-by: Ingo Molnar <mingo@kernel.org>	2017-08-10 16:50:22 +02:00
Vitaly Kuznetsov	6a8edbd0c5	x86/hyper-v: Introduce fast hypercall implementation Hyper-V supports 'fast' hypercalls when all parameters are passed through registers. Implement an inline version of a simpliest of these calls: hypercall with one 8-byte input and no output. Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com> Reviewed-by: Andy Shevchenko <andy.shevchenko@gmail.com> Reviewed-by: Stephen Hemminger <sthemmin@microsoft.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Haiyang Zhang <haiyangz@microsoft.com> Cc: Jork Loeser <Jork.Loeser@microsoft.com> Cc: K. Y. Srinivasan <kys@microsoft.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Simon Xiao <sixiao@microsoft.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: devel@linuxdriverproject.org Link: http://lkml.kernel.org/r/20170802160921.21791-4-vkuznets@redhat.com Signed-off-by: Ingo Molnar <mingo@kernel.org>	2017-08-10 16:50:22 +02:00
Vitaly Kuznetsov	fc53662f13	x86/hyper-v: Make hv_do_hypercall() inline We have only three call sites for hv_do_hypercall() and we're going to change HVCALL_SIGNAL_EVENT to doing fast hypercall so we can inline this function for optimization. Hyper-V top level functional specification states that r9-r11 registers and flags may be clobbered by the hypervisor during hypercall and with inlining this is somewhat important, add the clobbers. Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com> Reviewed-by: Andy Shevchenko <andy.shevchenko@gmail.com> Reviewed-by: Stephen Hemminger <sthemmin@microsoft.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Haiyang Zhang <haiyangz@microsoft.com> Cc: Jork Loeser <Jork.Loeser@microsoft.com> Cc: K. Y. Srinivasan <kys@microsoft.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Simon Xiao <sixiao@microsoft.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: devel@linuxdriverproject.org Link: http://lkml.kernel.org/r/20170802160921.21791-3-vkuznets@redhat.com Signed-off-by: Ingo Molnar <mingo@kernel.org>	2017-08-10 16:50:22 +02:00
Vitaly Kuznetsov	79cadff2d9	x86/hyper-v: Include hyperv/ only when CONFIG_HYPERV is set Code is arch/x86/hyperv/ is only needed when CONFIG_HYPERV is set, the 'basic' support and detection lives in arch/x86/kernel/cpu/mshyperv.c which is included when CONFIG_HYPERVISOR_GUEST is set. Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com> Reviewed-by: Andy Shevchenko <andy.shevchenko@gmail.com> Reviewed-by: Stephen Hemminger <sthemmin@microsoft.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Haiyang Zhang <haiyangz@microsoft.com> Cc: Jork Loeser <Jork.Loeser@microsoft.com> Cc: K. Y. Srinivasan <kys@microsoft.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Simon Xiao <sixiao@microsoft.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: devel@linuxdriverproject.org Link: http://lkml.kernel.org/r/20170802160921.21791-2-vkuznets@redhat.com Signed-off-by: Ingo Molnar <mingo@kernel.org>	2017-08-10 16:50:22 +02:00
Ingo Molnar	636bf35880	Merge branch 'linus' into x86/platform, to pick up fixes Signed-off-by: Ingo Molnar <mingo@kernel.org>	2017-08-10 16:49:46 +02:00
Paolo Bonzini	eebed24389	kvm: nVMX: Add support for fast unprotection of nested guest page tables This is the same as commit `147277540b` ("kvm: svm: Add support for additional SVM NPF error codes", 2016-11-23), but for Intel processors. In this case, the exit qualification field's bit 8 says whether the EPT violation occurred while translating the guest's final physical address or rather while translating the guest page tables. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2017-08-10 16:44:04 +02:00
Brijesh Singh	64531a3b70	KVM: SVM: Limit PFERR_NESTED_GUEST_PAGE error_code check to L1 guest Commit `147277540b` ("kvm: svm: Add support for additional SVM NPF error codes", 2016-11-23) added a new error code to aid nested page fault handling. The commit unprotects (kvm_mmu_unprotect_page) the page when we get a NPF due to guest page table walk where the page was marked RO. However, if an L0->L2 shadow nested page table can also be marked read-only when a page is read only in L1's nested page table. If such a page is accessed by L2 while walking page tables it can cause a nested page fault (page table walks are write accesses). However, after kvm_mmu_unprotect_page we may get another page fault, and again in an endless stream. To cover this use case, we qualify the new error_code check with vcpu->arch.mmu_direct_map so that the error_code check would run on L1 guest, and not the L2 guest. This avoids hitting the above scenario. Fixes: `147277540b` Cc: stable@vger.kernel.org Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Radim Krčmář <rkrcmar@redhat.com> Cc: Thomas Lendacky <thomas.lendacky@amd.com> Signed-off-by: Brijesh Singh <brijesh.singh@amd.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2017-08-10 16:44:04 +02:00
Wanpeng Li	bbeac2830f	KVM: X86: Fix residual mmio emulation request to userspace Reported by syzkaller: The kvm-intel.unrestricted_guest=0 WARNING: CPU: 5 PID: 1014 at /home/kernel/data/kvm/arch/x86/kvm//x86.c:7227 kvm_arch_vcpu_ioctl_run+0x38b/0x1be0 [kvm] CPU: 5 PID: 1014 Comm: warn_test Tainted: G W OE 4.13.0-rc3+ #8 RIP: 0010:kvm_arch_vcpu_ioctl_run+0x38b/0x1be0 [kvm] Call Trace: ? put_pid+0x3a/0x50 ? rcu_read_lock_sched_held+0x79/0x80 ? kmem_cache_free+0x2f2/0x350 kvm_vcpu_ioctl+0x340/0x700 [kvm] ? kvm_vcpu_ioctl+0x340/0x700 [kvm] ? __fget+0xfc/0x210 do_vfs_ioctl+0xa4/0x6a0 ? __fget+0x11d/0x210 SyS_ioctl+0x79/0x90 entry_SYSCALL_64_fastpath+0x23/0xc2 ? __this_cpu_preempt_check+0x13/0x20 The syszkaller folks reported a residual mmio emulation request to userspace due to vm86 fails to emulate inject real mode interrupt(fails to read CS) and incurs a triple fault. The vCPU returns to userspace with vcpu->mmio_needed == true and KVM_EXIT_SHUTDOWN exit reason. However, the syszkaller testcase constructs several threads to launch the same vCPU, the thread which lauch this vCPU after the thread whichs get the vcpu->mmio_needed == true and KVM_EXIT_SHUTDOWN will trigger the warning. #define _GNU_SOURCE #include <pthread.h> #include <stdio.h> #include <stdlib.h> #include <string.h> #include <sys/wait.h> #include <sys/types.h> #include <sys/stat.h> #include <sys/mman.h> #include <fcntl.h> #include <unistd.h> #include <linux/kvm.h> #include <stdio.h> int kvmcpu; struct kvm_run run; void thr(void* arg) { int res; res = ioctl(kvmcpu, KVM_RUN, 0); printf("ret1=%d exit_reason=%d suberror=%d\n", res, run->exit_reason, run->internal.suberror); return 0; } void test() { int i, kvm, kvmvm; pthread_t th[4]; kvm = open("/dev/kvm", O_RDWR); kvmvm = ioctl(kvm, KVM_CREATE_VM, 0); kvmcpu = ioctl(kvmvm, KVM_CREATE_VCPU, 0); run = (struct kvm_run*)mmap(0, 4096, PROT_READ\|PROT_WRITE, MAP_SHARED, kvmcpu, 0); srand(getpid()); for (i = 0; i < 4; i++) { pthread_create(&th[i], 0, thr, 0); usleep(rand() % 10000); } for (i = 0; i < 4; i++) pthread_join(th[i], 0); } int main() { for (;;) { int pid = fork(); if (pid < 0) exit(1); if (pid == 0) { test(); exit(0); } int status; while (waitpid(pid, &status, __WALL) != pid) {} } return 0; } This patch fixes it by resetting the vcpu->mmio_needed once we receive the triple fault to avoid the residue. Reported-by: Dmitry Vyukov <dvyukov@google.com> Tested-by: Dmitry Vyukov <dvyukov@google.com> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Radim Krčmář <rkrcmar@redhat.com> Cc: Dmitry Vyukov <dvyukov@google.com> Signed-off-by: Wanpeng Li <wanpeng.li@hotmail.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2017-08-10 16:43:55 +02:00
Masami Hiramatsu	d9f5f32a7d	kprobes/x86: Do not jump-optimize kprobes on irq entry code Since the kernel segment registers are not prepared at the entry of irq-entry code, if a kprobe on such code is jump-optimized, accessing per-CPU variables may cause a kernel panic. However, if the kprobe is not optimized, it triggers an int3 exception and sets segment registers correctly. With this patch we check the probe-address and if it is in the irq-entry code, it prohibits optimizing such kprobes. This means we can continue probing such interrupt handlers by kprobes but it is not optimized anymore. Reported-by: Francis Deslauriers <francis.deslauriers@efficios.com> Tested-by: Francis Deslauriers <francis.deslauriers@efficios.com> Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org> Cc: Ananth N Mavinakayanahalli <ananth@in.ibm.com> Cc: Anil S Keshavamurthy <anil.s.keshavamurthy@intel.com> Cc: Chris Zankel <chris@zankel.net> Cc: David S . Miller <davem@davemloft.net> Cc: Jesper Nilsson <jesper.nilsson@axis.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Max Filippov <jcmvbkbc@gmail.com> Cc: Mikael Starvik <starvik@axis.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Yoshinori Sato <ysato@users.sourceforge.jp> Cc: linux-arch@vger.kernel.org Cc: linux-cris-kernel@axis.com Cc: mathieu.desnoyers@efficios.com Link: http://lkml.kernel.org/r/150172795654.27216.9824039077047777477.stgit@devbox Signed-off-by: Ingo Molnar <mingo@kernel.org>	2017-08-10 16:28:53 +02:00
Masami Hiramatsu	229a718605	irq: Make the irqentry text section unconditional Generate irqentry and softirqentry text sections without any Kconfig dependencies. This will add extra sections, but there should be no performace impact. Suggested-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org> Cc: Ananth N Mavinakayanahalli <ananth@in.ibm.com> Cc: Anil S Keshavamurthy <anil.s.keshavamurthy@intel.com> Cc: Chris Zankel <chris@zankel.net> Cc: David S . Miller <davem@davemloft.net> Cc: Francis Deslauriers <francis.deslauriers@efficios.com> Cc: Jesper Nilsson <jesper.nilsson@axis.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Max Filippov <jcmvbkbc@gmail.com> Cc: Mikael Starvik <starvik@axis.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Yoshinori Sato <ysato@users.sourceforge.jp> Cc: linux-arch@vger.kernel.org Cc: linux-cris-kernel@axis.com Cc: mathieu.desnoyers@efficios.com Link: http://lkml.kernel.org/r/150172789110.27216.3955739126693102122.stgit@devbox Signed-off-by: Ingo Molnar <mingo@kernel.org>	2017-08-10 16:28:53 +02:00
Bhumika Goyal	276c870547	x86/platform/intel-mid: Make 'bt_sfi_data' const Make this structure const as it is only used during copy operation. Signed-off-by: Bhumika Goyal <bhumirks@gmail.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: julia.lawall@lip6.fr Link: http://lkml.kernel.org/r/1502039720-4471-1-git-send-email-bhumirks@gmail.com Signed-off-by: Ingo Molnar <mingo@kernel.org>	2017-08-10 14:24:32 +02:00
Cao jin	e20f7e5e72	x86/build: Drop unused mflags-y Subarchitecture support (mflags-y) was removed from x86 in this commit: `6bda2c8b32` ("x86: remove subarchitecture support") So drop the mflags-y usage from the Makefile. Signed-off-by: Cao jin <caoj.fnst@cn.fujitsu.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/1502105384-23214-1-git-send-email-caoj.fnst@cn.fujitsu.com Signed-off-by: Ingo Molnar <mingo@kernel.org>	2017-08-10 14:20:15 +02:00
Josh Poimboeuf	af79ded44b	x86/asm: Fix UNWIND_HINT_REGS macro for older binutils Apparently the binutils 2.20 assembler can't handle the '&&' operator in the UNWIND_HINT_REGS macro. Rearrange the macro to do without it. This fixes the following error: arch/x86/entry/entry_64.S: Assembler messages: arch/x86/entry/entry_64.S:521: Error: non-constant expression in ".if" statement arch/x86/entry/entry_64.S:521: Error: non-constant expression in ".if" statement arch/x86/entry/entry_64.S:521: Error: non-constant expression in ".if" statement arch/x86/entry/entry_64.S:521: Error: non-constant expression in ".if" statement arch/x86/entry/entry_64.S:521: Error: non-constant expression in ".if" statement arch/x86/entry/entry_64.S:521: Error: non-constant expression in ".if" statement arch/x86/entry/entry_64.S:521: Error: non-constant expression in ".if" statement arch/x86/entry/entry_64.S:521: Error: non-constant expression in ".if" statement arch/x86/entry/entry_64.S:521: Error: non-constant expression in ".if" statement Reported-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Brian Gerst <brgerst@gmail.com> Cc: Denys Vlasenko <dvlasenk@redhat.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Fixes: `39358a033b` ("objtool, x86: Add facility for asm code to provide unwind hints") Link: http://lkml.kernel.org/r/e2ad97c1ae49a484644b4aaa4dd3faa4d6d969b2.1502116651.git.jpoimboe@redhat.com Signed-off-by: Ingo Molnar <mingo@kernel.org>	2017-08-10 14:16:19 +02:00
Andy Lutomirski	603e492e86	x86/asm/32: Fix regs_get_register() on segment registers The segment register high words on x86_32 may contain garbage. Teach regs_get_register() to read them as u16 instead of unsigned long. Signed-off-by: Andy Lutomirski <luto@kernel.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Brian Gerst <brgerst@gmail.com> Cc: Denys Vlasenko <dvlasenk@redhat.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Josh Poimboeuf <jpoimboe@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/0b76f6dbe477b7b1a81938fddcc3c483d48f0ff2.1502314765.git.luto@kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>	2017-08-10 13:14:58 +02:00
Andy Lutomirski	8a9949bc71	x86/xen/64: Rearrange the SYSCALL entries Xen's raw SYSCALL entries are much less weird than native. Rather than fudging them to look like native entries, use the Xen-provided stack frame directly. This lets us eliminate entry_SYSCALL_64_after_swapgs and two uses of the SWAPGS_UNSAFE_STACK paravirt hook. The SYSENTER code would benefit from similar treatment. This makes one change to the native code path: the compat instruction that clears the high 32 bits of %rax is moved slightly later. I'd be surprised if this affects performance at all. Tested-by: Juergen Gross <jgross@suse.com> Signed-off-by: Andy Lutomirski <luto@kernel.org> Reviewed-by: Juergen Gross <jgross@suse.com> Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com> Cc: Borislav Petkov <bpetkov@suse.de> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: xen-devel@lists.xenproject.org Link: http://lkml.kernel.org/r/7c88ed36805d36841ab03ec3b48b4122c4418d71.1502164668.git.luto@kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>	2017-08-10 13:14:32 +02:00
Ingo Molnar	1d0f49e140	Merge branch 'x86/urgent' into x86/asm, to pick up fixes Signed-off-by: Ingo Molnar <mingo@kernel.org>	2017-08-10 13:14:15 +02:00
Andy Lutomirski	e93c17301a	x86/asm/64: Clear AC on NMI entries This closes a hole in our SMAP implementation. This patch comes from grsecurity. Good catch! Signed-off-by: Andy Lutomirski <luto@kernel.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Brian Gerst <brgerst@gmail.com> Cc: Denys Vlasenko <dvlasenk@redhat.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Josh Poimboeuf <jpoimboe@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: stable@vger.kernel.org Link: http://lkml.kernel.org/r/314cc9f294e8f14ed85485727556ad4f15bb1659.1502159503.git.luto@kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>	2017-08-10 13:13:15 +02:00
Ingo Molnar	388f8e1273	Merge branch 'linus' into locking/core, to pick up fixes Signed-off-by: Ingo Molnar <mingo@kernel.org>	2017-08-10 12:20:53 +02:00
Ingo Molnar	fc33a8943e	Merge branch 'linus' into sched/core, to pick up fixes Signed-off-by: Ingo Molnar <mingo@kernel.org>	2017-08-10 12:10:19 +02:00
Janakarajan Natarajan	ab027620e9	perf/x86/amd/uncore: Get correct number of cores sharing last level cache In Family 17h, the number of cores sharing a cache level is obtained from the Cache Properties CPUID leaf (0x8000001d) by passing in the cache level in ECX. In prior families, a cache level of 2 was used to determine this information. To get the right information, irrespective of Family, iterate over the cache levels using CPUID 0x8000001d. The last level cache is the last value to return a non-zero value in EAX. Signed-off-by: Janakarajan Natarajan <Janakarajan.Natarajan@amd.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Borislav Petkov <bp@suse.de> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Arnaldo Carvalho de Melo <acme@kernel.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Suravee Suthikulpanit <suravee.suthikulpanit@amd.com> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/5ab569025b39cdfaeca55b571d78c0fc800bdb69.1497452002.git.Janakarajan.Natarajan@amd.com Signed-off-by: Ingo Molnar <mingo@kernel.org>	2017-08-10 12:08:39 +02:00
Janakarajan Natarajan	910448bbed	perf/x86/amd/uncore: Rename cpufeatures macro for cache counters In Family 17h, L3 is the last level cache as opposed to L2 in previous families. Avoid this name confusion and rename X86_FEATURE_PERFCTR_L2 to X86_FEATURE_PERFCTR_LLC to indicate the performance counter on the last level of cache. Signed-off-by: Janakarajan Natarajan <Janakarajan.Natarajan@amd.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Borislav Petkov <bp@suse.de> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Arnaldo Carvalho de Melo <acme@kernel.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Suravee Suthikulpanit <suravee.suthikulpanit@amd.com> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/016311029fdecdc3fdc13b7ed865c6cbf48b2f15.1497452002.git.Janakarajan.Natarajan@amd.com Signed-off-by: Ingo Molnar <mingo@kernel.org>	2017-08-10 12:08:38 +02:00
Ingo Molnar	1ccb2f4e8e	Merge branch 'perf/urgent' into perf/core, to pick up fixes Signed-off-by: Ingo Molnar <mingo@kernel.org>	2017-08-10 12:02:26 +02:00
Peter Zijlstra	bfe334924c	perf/x86: Fix RDPMC vs. mm_struct tracking Vince reported the following rdpmc() testcase failure: > Failing test case: > > fd=perf_event_open(); > addr=mmap(fd); > exec() // without closing or unmapping the event > fd=perf_event_open(); > addr=mmap(fd); > rdpmc() // GPFs due to rdpmc being disabled The problem is of course that exec() plays tricks with what is current->mm, only destroying the old mappings after having installed the new mm. Fix this confusion by passing along vma->vm_mm instead of relying on current->mm. Reported-by: Vince Weaver <vincent.weaver@maine.edu> Tested-by: Vince Weaver <vincent.weaver@maine.edu> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Andy Lutomirski <luto@kernel.org> Cc: Arnaldo Carvalho de Melo <acme@kernel.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Stephane Eranian <eranian@gmail.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: stable@vger.kernel.org Fixes: `1e0fb9ec67` ("perf: Add pmu callbacks to track event mapping and unmapping") Link: http://lkml.kernel.org/r/20170802173930.cstykcqefmqt7jau@hirez.programming.kicks-ass.net [ Minor cleanups. ] Signed-off-by: Ingo Molnar <mingo@kernel.org>	2017-08-10 12:01:08 +02:00
Daniel Borkmann	52afc51e94	bpf, x86: implement jiting of BPF_J{LT,LE,SLT,SLE} This work implements jiting of BPF_J{LT,LE,SLT,SLE} instructions with BPF_X/BPF_K variants for the x86_64 eBPF JIT. Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>	2017-08-09 16:53:56 -07:00
megha.dey@linux.intel.com	8861249c74	crypto: x86/sha1 - Fix reads beyond the number of blocks passed It was reported that the sha1 AVX2 function(sha1_transform_avx2) is reading ahead beyond its intended data, and causing a crash if the next block is beyond page boundary: http://marc.info/?l=linux-crypto-vger&m=149373371023377 This patch makes sure that there is no overflow for any buffer length. It passes the tests written by Jan Stancek that revealed this problem: https://github.com/jstancek/sha1-avx2-crash I have re-enabled sha1-avx2 by reverting commit `b82ce24426` Cc: <stable@vger.kernel.org> Fixes: `b82ce24426` ("crypto: sha1-ssse3 - Disable avx2") Originally-by: Ilya Albrekht <ilya.albrekht@intel.com> Tested-by: Jan Stancek <jstancek@redhat.com> Signed-off-by: Megha Dey <megha.dey@linux.intel.com> Reported-by: Jan Stancek <jstancek@redhat.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>	2017-08-09 20:01:37 +08:00
Longpeng(Mike)	de63ad4cf4	KVM: X86: implement the logic for spinlock optimization get_cpl requires vcpu_load, so we must cache the result (whether the vcpu was preempted when its cpl=0) in kvm_vcpu_arch. Signed-off-by: Longpeng(Mike) <longpeng2@huawei.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2017-08-08 10:57:43 +02:00
Longpeng(Mike)	199b5763d3	KVM: add spinlock optimization framework If a vcpu exits due to request a user mode spinlock, then the spinlock-holder may be preempted in user mode or kernel mode. (Note that not all architectures trap spin loops in user mode, only AMD x86 and ARM/ARM64 currently do). But if a vcpu exits in kernel mode, then the holder must be preempted in kernel mode, so we should choose a vcpu in kernel mode as a more likely candidate for the lock holder. This introduces kvm_arch_vcpu_in_kernel() to decide whether the vcpu is in kernel-mode when it's preempted. kvm_vcpu_on_spin's new argument says the same of the spinning VCPU. Signed-off-by: Longpeng(Mike) <longpeng2@huawei.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2017-08-08 10:57:43 +02:00
Radim Krčmář	1b4d56b86a	KVM: x86: use general helpers for some cpuid manipulation Add guest_cpuid_clear() and use it instead of kvm_find_cpuid_entry(). Also replace some uses of kvm_find_cpuid_entry() with guest_cpuid_has(). Signed-off-by: Radim Krčmář <rkrcmar@redhat.com> Reviewed-by: David Hildenbrand <david@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2017-08-07 16:16:30 +02:00
Radim Krčmář	d6321d4933	KVM: x86: generalize guest_cpuid_has_ helpers This patch turns guest_cpuid_has_XYZ(cpuid) into guest_cpuid_has(cpuid, X86_FEATURE_XYZ), which gets rid of many very similar helpers. When seeing a X86_FEATURE_*, we can know which cpuid it belongs to, but this information isn't in common code, so we recreate it for KVM. Add some BUILD_BUG_ONs to make sure that it runs nicely. Signed-off-by: Radim Krčmář <rkrcmar@redhat.com> Reviewed-by: David Hildenbrand <david@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2017-08-07 16:11:50 +02:00
Radim Krčmář	c6bd18011f	KVM: x86: X86_FEATURE_NRIPS is not scattered anymore bit(X86_FEATURE_NRIPS) is 3 since `2ccd71f1b2` ("x86/cpufeature: Move some of the scattered feature bits to x86_capability"), so we can simplify the code. Signed-off-by: Radim Krčmář <rkrcmar@redhat.com> Reviewed-by: David Hildenbrand <david@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2017-08-07 16:09:38 +02:00
Bandan Das	41ab937274	KVM: nVMX: Emulate EPTP switching for the L1 hypervisor When L2 uses vmfunc, L0 utilizes the associated vmexit to emulate a switching of the ept pointer by reloading the guest MMU. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Bandan Das <bsd@redhat.com> Acked-by: David Hildenbrand <david@redhat.com> Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>	2017-08-07 15:29:22 +02:00
Bandan Das	27c42a1bb8	KVM: nVMX: Enable VMFUNC for the L1 hypervisor Expose VMFUNC in MSRs and VMCS fields. No actual VMFUNCs are enabled. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Bandan Das <bsd@redhat.com> Acked-by: David Hildenbrand <david@redhat.com> Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>	2017-08-07 15:29:21 +02:00
Bandan Das	2a499e49c2	KVM: vmx: Enable VMFUNCs Enable VMFUNC in the secondary execution controls. This simplifies the changes necessary to expose it to nested hypervisors. VMFUNCs still cause #UD when invoked. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Bandan Das <bsd@redhat.com> Acked-by: David Hildenbrand <david@redhat.com> Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>	2017-08-07 15:29:20 +02:00
David Hildenbrand	53a70daf3c	KVM: nVMX: get rid of nested_release_page* Let's also just use the underlying functions directly here. Reviewed-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: David Hildenbrand <david@redhat.com> [Rebased on top of `9f744c5974` ("KVM: nVMX: do not pin the VMCS12")] Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>	2017-08-07 15:29:20 +02:00
David Hildenbrand	5e2f30b756	KVM: nVMX: get rid of nested_get_page() nested_get_page() just sounds confusing. All we want is a page from G1. This is even unrelated to nested. Let's introduce kvm_vcpu_gpa_to_page() so we don't get too lengthy lines. Reviewed-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: David Hildenbrand <david@redhat.com> Signed-off-by: Radim Krčmář <rkrcmar@redhat.com> [Squash pasto fix from Wanpeng Li. - Paolo] Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2017-08-07 15:27:00 +02:00
Paolo Bonzini	90a2db6d86	KVM: nVMX: INVPCID support Expose the "Enable INVPCID" secondary execution control to the guest and properly reflect the exit reason. In addition, before this patch the guest was always running with INVPCID enabled, causing pcid.flat's "Test on INVPCID when disabled" test to fail. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2017-08-07 15:26:06 +02:00
Ladi Prosek	72c139bacf	KVM: hyperv: support HV_X64_MSR_TSC_FREQUENCY and HV_X64_MSR_APIC_FREQUENCY It has been experimentally confirmed that supporting these two MSRs is one of the necessary conditions for nested Hyper-V to use the TSC page. Modern Windows guests are noticeably slower when they fall back to reading timestamps from the HV_X64_MSR_TIME_REF_COUNT MSR instead of using the TSC page. The newly supported MSRs are advertised with the AccessFrequencyRegs partition privilege flag and CPUID.40000003H:EDX[8] "Support for determining timer frequencies is available" (both outside of the scope of this KVM patch). Reviewed-by: Radim Krčmář <rkrcmar@redhat.com> Signed-off-by: Ladi Prosek <lprosek@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2017-08-07 15:26:06 +02:00
Linus Torvalds	6999507416	KVM fixes for v4.13-rc4 ARM: - Yet another race with VM destruction plugged - A set of small vgic fixes x86: - Preserve pending INIT - RCU fixes in paravirtual async pf, VM teardown, and VMXOFF emulation - nVMX interrupt injection and dirty tracking fixes - initialize to make UBSAN happy -----BEGIN PGP SIGNATURE----- iQEcBAABCAAGBQJZhNZ2AAoJEED/6hsPKofoKDEH/iIw1pcgdEW2NP/kFtKXSMCK josdFwGPQMjBGzx6No4tfMCNDOjW2FKYXapN6CASAqMJo5H2krj8VHMVwm0h3lUl 4RdbbkFTdfl/Znp8M39efFheWrjX+L37AltKV7xAgA7n8cO39KV4RReimzSc7aVq 5dDt4k0dbF9/zXHxkGiKEhwaSSbZEEznQQ/09annSoOVe6om5esUrUtnUF5P99uz IhAsmJbZxE5VmowjT5MjaR1mXSLLNL55HWKvkf3B3ZGnyxQU+3Vz7IGf2Ma2j+jV IrdXA11NHDY1anDYgDhFlr3rTCPu9CBmTv4O8zsDRlX9TGpr8bBX2dvjRKl7uOo= =KM80 -----END PGP SIGNATURE----- Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm Pull KVM fixes from Radim Krčmář: "ARM: - Yet another race with VM destruction plugged - A set of small vgic fixes x86: - Preserve pending INIT - RCU fixes in paravirtual async pf, VM teardown, and VMXOFF emulation - nVMX interrupt injection and dirty tracking fixes - initialize to make UBSAN happy" * tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: KVM: arm/arm64: vgic: Use READ_ONCE fo cmpxchg KVM: nVMX: Fix interrupt window request with "Acknowledge interrupt on exit" KVM: nVMX: mark vmcs12 pages dirty on L2 exit kvm: nVMX: don't flush VMCS12 during VMXOFF or VCPU teardown KVM: nVMX: do not pin the VMCS12 KVM: avoid using rcu_dereference_protected KVM: X86: init irq->level in kvm_pv_kick_cpu_op KVM: X86: Fix loss of pending INIT due to race KVM: async_pf: make rcu irq exit if not triggered from idle task KVM: nVMX: fixes to nested virt interrupt injection KVM: nVMX: do not fill vm_exit_intr_error_code in prepare_vmcs12 KVM: arm/arm64: Handle hva aging while destroying the vm KVM: arm/arm64: PMU: Fix overflow interrupt injection KVM: arm/arm64: Fix bug in advertising KVM_CAP_MSI_DEVID capability	2017-08-04 15:18:27 -07:00
Linus Torvalds	0d5b994407	Merge branch 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 fix from Thomas Gleixner: "The recent irq core changes unearthed API abuse in the HPET code, which manifested itself in a suspend/resume regression. The fix replaces the cruft with the proper function calls and cures the regression" * 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/hpet: Cure interface abuse in the resume path	2017-08-04 15:16:09 -07:00
Ard Biesheuvel	45fe93dff2	crypto: algapi - make crypto_xor() take separate dst and src arguments There are quite a number of occurrences in the kernel of the pattern if (dst != src) memcpy(dst, src, walk.total % AES_BLOCK_SIZE); crypto_xor(dst, final, walk.total % AES_BLOCK_SIZE); or crypto_xor(keystream, src, nbytes); memcpy(dst, keystream, nbytes); where crypto_xor() is preceded or followed by a memcpy() invocation that is only there because crypto_xor() uses its output parameter as one of the inputs. To avoid having to add new instances of this pattern in the arm64 code, which will be refactored to implement non-SIMD fallbacks, add an alternative implementation called crypto_xor_cpy(), taking separate input and output arguments. This removes the need for the separate memcpy(). Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>	2017-08-04 09:27:15 +08:00
Lukas Wunner	630b3aff8a	treewide: Consolidate Apple DMI checks We're about to amend ACPI bus scan with DMI checks whether we're running on a Mac to support Apple device properties in AML. The DMI checks are performed for every single device, adding overhead for everything x86 that isn't Apple, which is the majority. Rafael and Andy therefore request to perform the DMI match only once and cache the result. Outside of ACPI various other Apple DMI checks exist and it seems reasonable to use the cached value there as well. Rafael, Andy and Darren suggest performing the DMI check in arch code and making it available with a header in include/linux/platform_data/x86/. To this end, add early_platform_quirks() to arch/x86/kernel/quirks.c to perform the DMI check and invoke it from setup_arch(). Switch over all existing Apple DMI checks, thereby fixing two deficiencies: * They are now #defined to false on non-x86 arches and can thus be optimized away if they're located in cross-arch code. * Some of them only match "Apple Inc." but not "Apple Computer, Inc.", which is used by BIOSes released between January 2006 (when the first x86 Macs started shipping) and January 2007 (when the company name changed upon introduction of the iPhone). Suggested-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Suggested-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Suggested-by: Darren Hart <dvhart@infradead.org> Signed-off-by: Lukas Wunner <lukas@wunner.de> Acked-by: Mika Westerberg <mika.westerberg@linux.intel.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>	2017-08-03 23:26:22 +02:00
Rafael J. Wysocki	8a05c3115d	Merge branches 'pm-cpufreq-x86', 'pm-cpufreq-docs' and 'intel_pstate' * pm-cpufreq-x86: cpufreq: x86: Make scaling_cur_freq behave more as expected * pm-cpufreq-docs: cpufreq: docs: Add missing cpuinfo_cur_freq description * intel_pstate: cpufreq: intel_pstate: Drop ->get from intel_pstate structure	2017-08-03 20:29:24 +02:00
Wanpeng Li	6550c4df7e	KVM: nVMX: Fix interrupt window request with "Acknowledge interrupt on exit" ------------[ cut here ]------------ WARNING: CPU: 5 PID: 2288 at arch/x86/kvm/vmx.c:11124 nested_vmx_vmexit+0xd64/0xd70 [kvm_intel] CPU: 5 PID: 2288 Comm: qemu-system-x86 Not tainted 4.13.0-rc2+ #7 RIP: 0010:nested_vmx_vmexit+0xd64/0xd70 [kvm_intel] Call Trace: vmx_check_nested_events+0x131/0x1f0 [kvm_intel] ? vmx_check_nested_events+0x131/0x1f0 [kvm_intel] kvm_arch_vcpu_ioctl_run+0x5dd/0x1be0 [kvm] ? vmx_vcpu_load+0x1be/0x220 [kvm_intel] ? kvm_arch_vcpu_load+0x62/0x230 [kvm] kvm_vcpu_ioctl+0x340/0x700 [kvm] ? kvm_vcpu_ioctl+0x340/0x700 [kvm] ? __fget+0xfc/0x210 do_vfs_ioctl+0xa4/0x6a0 ? __fget+0x11d/0x210 SyS_ioctl+0x79/0x90 do_syscall_64+0x8f/0x750 ? trace_hardirqs_on_thunk+0x1a/0x1c entry_SYSCALL64_slow_path+0x25/0x25 This can be reproduced by booting L1 guest w/ 'noapic' grub parameter, which means that tells the kernel to not make use of any IOAPICs that may be present in the system. Actually external_intr variable in nested_vmx_vmexit() is the req_int_win variable passed from vcpu_enter_guest() which means that the L0's userspace requests an irq window. I observed the scenario (!kvm_cpu_has_interrupt(vcpu) && L0's userspace reqeusts an irq window) is true, so there is no interrupt which L1 requires to inject to L2, we should not attempt to emualte "Acknowledge interrupt on exit" for the irq window requirement in this scenario. This patch fixes it by not attempt to emulate "Acknowledge interrupt on exit" if there is no L1 requirement to inject an interrupt to L2. Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Radim Krčmář <rkrcmar@redhat.com> Signed-off-by: Wanpeng Li <wanpeng.li@hotmail.com> [Added code comment to make it obvious that the behavior is not correct. We should do a userspace exit with open interrupt window instead of the nested VM exit. This patch still improves the behavior, so it was accepted as a (temporary) workaround.] Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>	2017-08-03 15:38:11 +02:00
David Matlack	c9f04407f2	KVM: nVMX: mark vmcs12 pages dirty on L2 exit The host physical addresses of L1's Virtual APIC Page and Posted Interrupt descriptor are loaded into the VMCS02. The CPU may write to these pages via their host physical address while L2 is running, bypassing address-translation-based dirty tracking (e.g. EPT write protection). Mark them dirty on every exit from L2 to prevent them from getting out of sync with dirty tracking. Also mark the virtual APIC page and the posted interrupt descriptor dirty when KVM is virtualizing posted interrupt processing. Signed-off-by: David Matlack <dmatlack@google.com> Reviewed-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>	2017-08-02 22:41:04 +02:00
David Matlack	8ca44e88c3	kvm: nVMX: don't flush VMCS12 during VMXOFF or VCPU teardown According to the Intel SDM, software cannot rely on the current VMCS to be coherent after a VMXOFF or shutdown. So this is a valid way to handle VMCS12 flushes. 24.11.1 Software Use of Virtual-Machine Control Structures ... If a logical processor leaves VMX operation, any VMCSs active on that logical processor may be corrupted (see below). To prevent such corruption of a VMCS that may be used either after a return to VMX operation or on another logical processor, software should execute VMCLEAR for that VMCS before executing the VMXOFF instruction or removing power from the processor (e.g., as part of a transition to the S3 and S4 power states). ... This fixes a "suspicious rcu_dereference_check() usage!" warning during kvm_vm_release() because nested_release_vmcs12() calls kvm_vcpu_write_guest_page() without holding kvm->srcu. Signed-off-by: David Matlack <dmatlack@google.com> Reviewed-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>	2017-08-02 22:41:03 +02:00
Paolo Bonzini	9f744c5974	KVM: nVMX: do not pin the VMCS12 Since the current implementation of VMCS12 does a memcpy in and out of guest memory, we do not need current_vmcs12 and current_vmcs12_page anymore. current_vmptr is enough to read and write the VMCS12. And David Matlack noted: This patch also fixes dirty tracking (memslot->dirty_bitmap) of the VMCS12 page by using kvm_write_guest. nested_release_page() only marks the struct page dirty. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Reviewed-by: David Hildenbrand <david@redhat.com> [Added David Matlack's note and nested_release_page_clean() fix.] Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>	2017-08-02 22:41:03 +02:00
Longpeng(Mike)	ebd28fcb55	KVM: X86: init irq->level in kvm_pv_kick_cpu_op 'lapic_irq' is a local variable and its 'level' field isn't initialized, so 'level' is random, it doesn't matter but makes UBSAN unhappy: UBSAN: Undefined behaviour in .../lapic.c:... load of value 10 is not a valid value for type '_Bool' ... Call Trace: [<ffffffff81f030b6>] dump_stack+0x1e/0x20 [<ffffffff81f03173>] ubsan_epilogue+0x12/0x55 [<ffffffff81f03b96>] __ubsan_handle_load_invalid_value+0x118/0x162 [<ffffffffa1575173>] kvm_apic_set_irq+0xc3/0xf0 [kvm] [<ffffffffa1575b20>] kvm_irq_delivery_to_apic_fast+0x450/0x910 [kvm] [<ffffffffa15858ea>] kvm_irq_delivery_to_apic+0xfa/0x7a0 [kvm] [<ffffffffa1517f4e>] kvm_emulate_hypercall+0x62e/0x760 [kvm] [<ffffffffa113141a>] handle_vmcall+0x1a/0x30 [kvm_intel] [<ffffffffa114e592>] vmx_handle_exit+0x7a2/0x1fa0 [kvm_intel] ... Signed-off-by: Longpeng(Mike) <longpeng2@huawei.com> Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>	2017-08-02 22:41:01 +02:00
Wanpeng Li	f4ef191086	KVM: X86: Fix loss of pending INIT due to race When SMP VM start, AP may lost INIT because of receiving INIT between kvm_vcpu_ioctl_x86_get/set_vcpu_events. vcpu 0 vcpu 1 kvm_vcpu_ioctl_x86_get_vcpu_events events->smi.latched_init = 0 send INIT to vcpu1 set vcpu1's pending_events kvm_vcpu_ioctl_x86_set_vcpu_events if (events->smi.latched_init == 0) clear INIT in pending_events This patch fixes it by just update SMM related flags if we are in SMM. Thanks Peng Hao for the report and original commit message. Reported-by: Peng Hao <peng.hao2@zte.com.cn> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Radim Krčmář <rkrcmar@redhat.com> Signed-off-by: Wanpeng Li <wanpeng.li@hotmail.com> Reviewed-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>	2017-08-02 22:41:01 +02:00
Fenghua Yu	0dd2d7494c	x86/intel_rdt: Show bitmask of shareable resource with other executing units CPUID.(EAX=0x10, ECX=res#):EBX[31:0] reports a bit mask for a resource. Each set bit within the length of the CBM indicates the corresponding unit of the resource allocation may be used by other entities in the platform (e.g. an integrated graphics engine or hardware units outside the processor core and have direct access to the resource). Each cleared bit within the length of the CBM indicates the corresponding allocation unit can be configured to implement a priority-based allocation scheme without interference with other hardware agents in the system. Bits outside the length of the CBM are reserved. More details on the bit mask are described in x86 Software Developer's Manual. The bitmask is shown in "info" directory for each resource. It's up to user to decide how to use the bitmask within a CBM in a partition to share or isolate a resource with other executing units. Suggested-by: Reinette Chatre <reinette.chatre@intel.com> Signed-off-by: Fenghua Yu <fenghua.yu@intel.com> Signed-off-by: Tony Luck <tony.luck@intel.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: ravi.v.shankar@intel.com Cc: peterz@infradead.org Cc: eranian@google.com Cc: ak@linux.intel.com Cc: davidcc@google.com Cc: vikas.shivappa@linux.intel.com Link: http://lkml.kernel.org/r/20170725223904.12996-1-tony.luck@intel.com	2017-08-01 22:41:30 +02:00
Vikas Shivappa	e33026831b	x86/intel_rdt/mbm: Handle counter overflow Set up a delayed work queue for each domain that will read all the MBM counters of active RMIDs once per second to make sure that they don't wrap around between reads from users. [Tony: Added the initializations for the work structure and completed the patch] Signed-off-by: Tony Luck <tony.luck@intel.com> Signed-off-by: Vikas Shivappa <vikas.shivappa@linux.intel.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: ravi.v.shankar@intel.com Cc: fenghua.yu@intel.com Cc: peterz@infradead.org Cc: eranian@google.com Cc: vikas.shivappa@intel.com Cc: ak@linux.intel.com Cc: davidcc@google.com Cc: reinette.chatre@intel.com Link: http://lkml.kernel.org/r/1501017287-28083-29-git-send-email-vikas.shivappa@linux.intel.com	2017-08-01 22:41:29 +02:00
Vikas Shivappa	a4de1dfdd7	x86/intel_rdt/mbm: Add mbm counter initialization MBM counters are monotonically increasing counts representing the total memory bytes at a particular time. In order to calculate total_bytes for an rdtgroup, we store the value of the counter when we create an rdtgroup or when a new domain comes online. When the total_bytes(all memory controller bytes) or local_bytes(local memory controller bytes) file in "mon_data" is read it shows the total bytes for that rdtgroup since its creation. User can snapshot this at different time intervals to obtain bytes/second. Signed-off-by: Vikas Shivappa <vikas.shivappa@linux.intel.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: ravi.v.shankar@intel.com Cc: tony.luck@intel.com Cc: fenghua.yu@intel.com Cc: peterz@infradead.org Cc: eranian@google.com Cc: vikas.shivappa@intel.com Cc: ak@linux.intel.com Cc: davidcc@google.com Cc: reinette.chatre@intel.com Link: http://lkml.kernel.org/r/1501017287-28083-28-git-send-email-vikas.shivappa@linux.intel.com	2017-08-01 22:41:29 +02:00
Tony Luck	9f52425ba3	x86/intel_rdt/mbm: Basic counting of MBM events (total and local) Check CPUID bits for whether each of the MBM events is supported. Allocate space for each RMID for each counter in each domain to save previous MSR counter value and running total of data. Create files in each of the monitor directories. Signed-off-by: Tony Luck <tony.luck@intel.com> Signed-off-by: Vikas Shivappa <vikas.shivappa@linux.intel.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: ravi.v.shankar@intel.com Cc: fenghua.yu@intel.com Cc: peterz@infradead.org Cc: eranian@google.com Cc: vikas.shivappa@intel.com Cc: ak@linux.intel.com Cc: davidcc@google.com Cc: reinette.chatre@intel.com Link: http://lkml.kernel.org/r/1501017287-28083-27-git-send-email-vikas.shivappa@linux.intel.com	2017-08-01 22:41:28 +02:00
Vikas Shivappa	895c663ece	x86/intel_rdt/cqm: Add CPU hotplug support Resource groups have a per domain directory under "mon_data". Add or remove these directories as and when domains come online and go offline. Also update the per cpu rmids and cache upon onlining and offlining cpus. Signed-off-by: Vikas Shivappa <vikas.shivappa@linux.intel.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: ravi.v.shankar@intel.com Cc: tony.luck@intel.com Cc: fenghua.yu@intel.com Cc: peterz@infradead.org Cc: eranian@google.com Cc: vikas.shivappa@intel.com Cc: ak@linux.intel.com Cc: davidcc@google.com Cc: reinette.chatre@intel.com Link: http://lkml.kernel.org/r/1501017287-28083-26-git-send-email-vikas.shivappa@linux.intel.com	2017-08-01 22:41:28 +02:00
Vikas Shivappa	748b6b881c	x86/intel_rdt/cqm: Add sched_in support OS associates an RMID/CLOSid to a task by writing the per CPU IA32_PQR_ASSOC MSR when a task is scheduled in. The sched_in code will stay as no-op unless we are running on Intel SKU which supports either resource control or monitoring and we also enable them by mounting the resctrl fs. The per cpu CLOSid/RMID values are cached and the write is performed only when a task with a different CLOSid/RMID is scheduled in. Signed-off-by: Vikas Shivappa <vikas.shivappa@linux.intel.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: ravi.v.shankar@intel.com Cc: tony.luck@intel.com Cc: fenghua.yu@intel.com Cc: peterz@infradead.org Cc: eranian@google.com Cc: vikas.shivappa@intel.com Cc: ak@linux.intel.com Cc: davidcc@google.com Cc: reinette.chatre@intel.com Link: http://lkml.kernel.org/r/1501017287-28083-25-git-send-email-vikas.shivappa@linux.intel.com	2017-08-01 22:41:28 +02:00
Vikas Shivappa	4be6c07842	x86/intel_rdt: Introduce rdt_enable_key for scheduling Introduce the usage of rdt_enable_key in sched_in code as a preparation to add RDT monitoring support for sched_in. Signed-off-by: Vikas Shivappa <vikas.shivappa@linux.intel.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: ravi.v.shankar@intel.com Cc: tony.luck@intel.com Cc: fenghua.yu@intel.com Cc: peterz@infradead.org Cc: eranian@google.com Cc: vikas.shivappa@intel.com Cc: ak@linux.intel.com Cc: davidcc@google.com Cc: reinette.chatre@intel.com Link: http://lkml.kernel.org/r/1501017287-28083-24-git-send-email-vikas.shivappa@linux.intel.com	2017-08-01 22:41:27 +02:00
Vikas Shivappa	4af4a88e0c	x86/intel_rdt/cqm: Add mount,umount support Add monitoring support during mount and unmount. Since root directory is a "ctrl_mon" directory which can control and monitor resources create the "mon_groups" directory which can hold monitor groups and a "mon_data" directory which would hold all monitoring data like the rest of resource groups. The mount succeeds if either of monitoring or control/allocation is enabled. If only monitoring is enabled user can still create monitor groups under the "/sys/fs/resctrl/mon_groups/" and any mkdir under root would fail. If only control/allocation is enabled all of the monitoring related directories/files would not exist and resctrl would work in legacy mode. Signed-off-by: Vikas Shivappa <vikas.shivappa@linux.intel.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: ravi.v.shankar@intel.com Cc: tony.luck@intel.com Cc: fenghua.yu@intel.com Cc: peterz@infradead.org Cc: eranian@google.com Cc: vikas.shivappa@intel.com Cc: ak@linux.intel.com Cc: davidcc@google.com Cc: reinette.chatre@intel.com Link: http://lkml.kernel.org/r/1501017287-28083-23-git-send-email-vikas.shivappa@linux.intel.com	2017-08-01 22:41:27 +02:00
Vikas Shivappa	f3cbeacaa0	x86/intel_rdt/cqm: Add rmdir support Resource groups (ctrl_mon and monitor groups) are represented by directories in resctrl fs. Add support to remove the directories. When a ctrl_mon directory is removed all the cpus and tasks are assigned back to the root rdtgroup. When a monitor group is removed the cpus and tasks are returned to the parent control group. Signed-off-by: Vikas Shivappa <vikas.shivappa@linux.intel.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: ravi.v.shankar@intel.com Cc: tony.luck@intel.com Cc: fenghua.yu@intel.com Cc: peterz@infradead.org Cc: eranian@google.com Cc: vikas.shivappa@intel.com Cc: ak@linux.intel.com Cc: davidcc@google.com Cc: reinette.chatre@intel.com Link: http://lkml.kernel.org/r/1501017287-28083-22-git-send-email-vikas.shivappa@linux.intel.com	2017-08-01 22:41:27 +02:00
Vikas Shivappa	f9049547f7	x86/intel_rdt: Separate the ctrl bits from rmdir Re-factor the code to separate the ctrl group removal from the rmdir to prepare to add RDT monitoring group removal. Signed-off-by: Vikas Shivappa <vikas.shivappa@linux.intel.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: ravi.v.shankar@intel.com Cc: tony.luck@intel.com Cc: fenghua.yu@intel.com Cc: peterz@infradead.org Cc: eranian@google.com Cc: vikas.shivappa@intel.com Cc: ak@linux.intel.com Cc: davidcc@google.com Cc: reinette.chatre@intel.com Link: http://lkml.kernel.org/r/1501017287-28083-21-git-send-email-vikas.shivappa@linux.intel.com	2017-08-01 22:41:26 +02:00
Vikas Shivappa	d89b737901	x86/intel_rdt/cqm: Add mon_data Add a mon_data directory for the root rdtgroup and all other rdtgroups. The directory holds all of the monitored data for all domains and events of all resources being monitored. The mon_data itself has a list of directories in the format mon_<domain_name>_<domain_id>. Each of these subdirectories contain one file per event in the mode "0444". Reading the file displays a snapshot of the monitored data for the event the file represents. For ex, on a 2 socket Broadwell with llc_occupancy being monitored the mon_data contents look as below: $ ls /sys/fs/resctrl/p1/mon_data/ mon_L3_00 mon_L3_01 Each domain directory has one file per event: $ ls /sys/fs/resctrl/p1/mon_data/mon_L3_00/ llc_occupancy To read current llc_occupancy of ctrl_mon group p1 $ cat /sys/fs/resctrl/p1/mon_data/mon_L3_00/llc_occupancy 33789096 [This patch idea is based on Tony's sample patches to organise data in a per domain directory and have one file per event (and use the fp->priv to store mon data bits)] Signed-off-by: Vikas Shivappa <vikas.shivappa@linux.intel.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: ravi.v.shankar@intel.com Cc: tony.luck@intel.com Cc: fenghua.yu@intel.com Cc: peterz@infradead.org Cc: eranian@google.com Cc: vikas.shivappa@intel.com Cc: ak@linux.intel.com Cc: davidcc@google.com Cc: reinette.chatre@intel.com Link: http://lkml.kernel.org/r/1501017287-28083-20-git-send-email-vikas.shivappa@linux.intel.com	2017-08-01 22:41:26 +02:00
Vikas Shivappa	90c403e831	x86/intel_rdt: Prepare for RDT monitor data support Rename the intel_rdt_schemata file to intel_rdt_ctrlmondata as we now want to add support for RDT monitoring data for the events that are supported in later patches. Signed-off-by: Vikas Shivappa <vikas.shivappa@linux.intel.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: ravi.v.shankar@intel.com Cc: tony.luck@intel.com Cc: fenghua.yu@intel.com Cc: peterz@infradead.org Cc: eranian@google.com Cc: vikas.shivappa@intel.com Cc: ak@linux.intel.com Cc: davidcc@google.com Cc: reinette.chatre@intel.com Link: http://lkml.kernel.org/r/1501017287-28083-19-git-send-email-vikas.shivappa@linux.intel.com	2017-08-01 22:41:25 +02:00
Vikas Shivappa	a9fcf8627d	x86/intel_rdt/cqm: Add cpus file support The cpus file is extended to support resource monitoring. This is used to over-ride the RMID of the default group when running on specific CPUs. It works similar to the resource control. The "cpus" and "cpus_list" file is present in default group, ctrl_mon groups and monitor groups. Each "cpus" file or cpu_list file reads a cpumask or list showing which CPUs belong to the resource group. By default all online cpus belong to the default root group. A CPU can be present in one "ctrl_mon" and one "monitor" group simultaneously. They can be added to a resource group by writing the CPU to the file. When a CPU is added to a ctrl_mon group it is automatically removed from the previous ctrl_mon group. A CPU can be added to a monitor group only if it is present in the parent ctrl_mon group and when a CPU is added to a monitor group, it is automatically removed from the previous monitor group. When CPUs go offline, they are automatically removed from the ctrl_mon and monitor groups. Signed-off-by: Vikas Shivappa <vikas.shivappa@linux.intel.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: ravi.v.shankar@intel.com Cc: tony.luck@intel.com Cc: fenghua.yu@intel.com Cc: peterz@infradead.org Cc: eranian@google.com Cc: vikas.shivappa@intel.com Cc: ak@linux.intel.com Cc: davidcc@google.com Cc: reinette.chatre@intel.com Link: http://lkml.kernel.org/r/1501017287-28083-18-git-send-email-vikas.shivappa@linux.intel.com	2017-08-01 22:41:25 +02:00
Vikas Shivappa	b09d981b3f	x86/intel_rdt: Prepare to add RDT monitor cpus file support Separate the ctrl cpus file handling from the generic cpus file handling and convert the per cpu closid from u32 to a struct which will be used later to add rmid to the same struct. Also cleanup some name space. Signed-off-by: Vikas Shivappa <vikas.shivappa@linux.intel.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: ravi.v.shankar@intel.com Cc: tony.luck@intel.com Cc: fenghua.yu@intel.com Cc: peterz@infradead.org Cc: eranian@google.com Cc: vikas.shivappa@intel.com Cc: ak@linux.intel.com Cc: davidcc@google.com Cc: reinette.chatre@intel.com Link: http://lkml.kernel.org/r/1501017287-28083-17-git-send-email-vikas.shivappa@linux.intel.com	2017-08-01 22:41:25 +02:00
Vikas Shivappa	d6aaba615a	x86/intel_rdt/cqm: Add tasks file support The root directory, ctrl_mon and monitor groups are populated with a read/write file named "tasks". When read, it shows all the task IDs assigned to the resource group. Tasks can be added to groups by writing the PID to the file. A task can be present in one "ctrl_mon" group "and" one "monitor" group. IOW a PID_x can be seen in a ctrl_mon group and a monitor group at the same time. When a task is added to a ctrl_mon group, it is automatically removed from the previous ctrl_mon group where it belonged. Similarly if a task is moved to a monitor group it is removed from the previous monitor group . Also since the monitor groups can only have subset of tasks of parent ctrl_mon group, a task can be moved to a monitor group only if its already present in the parent ctrl_mon group. Task membership is indicated by a new field in the task_struct "u32 rmid" which holds the RMID for the task. RMID=0 is reserved for the default root group where the tasks belong to at mount. [tony: zero the rmid if rdtgroup was deleted when task was being moved] Signed-off-by: Tony Luck <tony.luck@linux.intel.com> Signed-off-by: Vikas Shivappa <vikas.shivappa@linux.intel.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: ravi.v.shankar@intel.com Cc: tony.luck@intel.com Cc: fenghua.yu@intel.com Cc: peterz@infradead.org Cc: eranian@google.com Cc: vikas.shivappa@intel.com Cc: ak@linux.intel.com Cc: davidcc@google.com Cc: reinette.chatre@intel.com Link: http://lkml.kernel.org/r/1501017287-28083-16-git-send-email-vikas.shivappa@linux.intel.com	2017-08-01 22:41:24 +02:00
Vikas Shivappa	0734ded1ab	x86/intel_rdt: Change closid type from int to u32 OS associates a CLOSid(Class of service id) to a task by writing the high 32 bits of per CPU IA32_PQR_ASSOC MSR when a task is scheduled in. CPUID.(EAX=10H, ECX=1):EDX[15:0] enumerates the max CLOSID supported and it is zero indexed. Hence change the type to u32 from int. Signed-off-by: Vikas Shivappa <vikas.shivappa@linux.intel.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: ravi.v.shankar@intel.com Cc: tony.luck@intel.com Cc: fenghua.yu@intel.com Cc: peterz@infradead.org Cc: eranian@google.com Cc: vikas.shivappa@intel.com Cc: ak@linux.intel.com Cc: davidcc@google.com Cc: reinette.chatre@intel.com Link: http://lkml.kernel.org/r/1501017287-28083-15-git-send-email-vikas.shivappa@linux.intel.com	2017-08-01 22:41:24 +02:00
Vikas Shivappa	c7d9aac613	x86/intel_rdt/cqm: Add mkdir support for RDT monitoring Resource control groups can be created using mkdir in resctrl fs(rdtgroup). In order to extend the resctrl interface to support monitoring the control groups, extend the current mkdir to support resource monitoring also. This allows the rdtgroup created under the root directory to be able to both control and monitor resources (ctrl_mon group). The ctrl_mon groups are associated with one CLOSID like the legacy rdtgroups and one RMID(Resource monitoring ID) as well. Hardware uses RMID to track the resource usage. Once either of the CLOSID or RMID are exhausted, the mkdir fails with -ENOSPC. If there are RMIDs in limbo list but not free an -EBUSY is returned. User can also monitor a subset of the ctrl_mon rdtgroup's tasks/cpus using the monitor groups. The monitor groups are created using mkdir under the "mon_groups" directory in every ctrl_mon group. [Merged Tony's code: Removed a lot of common mkdir code, a fix to handling of the list of the child rdtgroups and some cleanups in list traversal. Also the changes to have similar alloc and free for CLOS/RMID and return -EBUSY when RMIDs are in limbo and not free] Signed-off-by: Tony Luck <tony.luck@intel.com> Signed-off-by: Vikas Shivappa <vikas.shivappa@linux.intel.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: ravi.v.shankar@intel.com Cc: fenghua.yu@intel.com Cc: peterz@infradead.org Cc: eranian@google.com Cc: vikas.shivappa@intel.com Cc: ak@linux.intel.com Cc: davidcc@google.com Cc: reinette.chatre@intel.com Link: http://lkml.kernel.org/r/1501017287-28083-14-git-send-email-vikas.shivappa@linux.intel.com	2017-08-01 22:41:23 +02:00
Vikas Shivappa	65b4f40305	x86/intel_rdt: Prepare for RDT monitoring mkdir support Separate the ctrl mkdir code from the rest in order to prepare for adding support for RDT monitoring mkdir support as well. Signed-off-by: Vikas Shivappa <vikas.shivappa@linux.intel.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: ravi.v.shankar@intel.com Cc: tony.luck@intel.com Cc: fenghua.yu@intel.com Cc: peterz@infradead.org Cc: eranian@google.com Cc: vikas.shivappa@intel.com Cc: ak@linux.intel.com Cc: davidcc@google.com Cc: reinette.chatre@intel.com Link: http://lkml.kernel.org/r/1501017287-28083-13-git-send-email-vikas.shivappa@linux.intel.com	2017-08-01 22:41:23 +02:00
Vikas Shivappa	d4ab332010	x86/intel_rdt/cqm: Add info files for RDT monitoring Add info directory files specific to RDT monitoring. num_rmids: The number of RMIDs which are valid for the resource. mon_features: Lists the monitoring events if monitoring is enabled for the resource. max_threshold_occupancy: This is specific to llc_occupancy monitoring and is used to determine if an RMID can be reused. Provides an upper bound on the threshold and is shown to the user in bytes though the internal value will be rounded to the scaling factor supported by the h/w. Signed-off-by: Vikas Shivappa <vikas.shivappa@linux.intel.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: ravi.v.shankar@intel.com Cc: tony.luck@intel.com Cc: fenghua.yu@intel.com Cc: peterz@infradead.org Cc: eranian@google.com Cc: vikas.shivappa@intel.com Cc: ak@linux.intel.com Cc: davidcc@google.com Cc: reinette.chatre@intel.com Link: http://lkml.kernel.org/r/1501017287-28083-12-git-send-email-vikas.shivappa@linux.intel.com	2017-08-01 22:41:22 +02:00
Tony luck	5dc1d5c6ba	x86/intel_rdt: Simplify info and base file lists The info directory files and base files need to be different for each resource like cache and Memory bandwidth. With in each resource, the files would be further different for monitoring and ctrl. This leads to a lot of different static array declarations given that we are adding resctrl monitoring. Simplify this to one common list of files and then declare a set of flags to choose the files based on the resource, whether it is info or base and if it is control type file. This is as a preparation to include monitoring based info and base files. No functional change. [Vikas: Extended the flags to have few bits per category like resource, info/base etc] Signed-off-by: Tony luck <tony.luck@intel.com> Signed-off-by: Vikas Shivappa <vikas.shivappa@linux.intel.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: ravi.v.shankar@intel.com Cc: fenghua.yu@intel.com Cc: peterz@infradead.org Cc: eranian@google.com Cc: vikas.shivappa@intel.com Cc: ak@linux.intel.com Cc: davidcc@google.com Cc: reinette.chatre@intel.com Link: http://lkml.kernel.org/r/1501017287-28083-11-git-send-email-vikas.shivappa@linux.intel.com	2017-08-01 22:41:22 +02:00
Vikas Shivappa	edf6fa1c4a	x86/intel_rdt/cqm: Add RMID (Resource monitoring ID) management Hardware uses RMID(Resource monitoring ID) to keep track of each of the RDT events associated with tasks. The number of RMIDs is dependent on the SKU and is enumerated via CPUID. We add support to manage the RMIDs which include managing the RMID allocation and reading LLC occupancy for an RMID. RMID allocation is managed by keeping a free list which is initialized to all available RMIDs except for RMID 0 which is always reserved for root group. RMIDs goto a limbo list once they are freed since the RMIDs are still tagged to cache lines of the tasks which were using them - thereby still having some occupancy. They continue to be in limbo list until the occupancy < threshold_occupancy. The threshold_occupancy is a user configurable value. OS uses IA32_QM_CTR MSR to read the occupancy associated with an RMID after programming the IA32_EVENTSEL MSR with the RMID. [Tony: Improved limbo search] Signed-off-by: Vikas Shivappa <vikas.shivappa@linux.intel.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: ravi.v.shankar@intel.com Cc: tony.luck@intel.com Cc: fenghua.yu@intel.com Cc: peterz@infradead.org Cc: eranian@google.com Cc: vikas.shivappa@intel.com Cc: ak@linux.intel.com Cc: davidcc@google.com Cc: reinette.chatre@intel.com Link: http://lkml.kernel.org/r/1501017287-28083-10-git-send-email-vikas.shivappa@linux.intel.com	2017-08-01 22:41:21 +02:00
Vikas Shivappa	6a445edce6	x86/intel_rdt/cqm: Add RDT monitoring initialization Add common data structures for RDT resource monitoring and perform RDT monitoring related data structure initializations which include setting up the RMID(Resource monitoring ID) lists and event list which the resource supports. [ tony: some cleanup to make adding MBM easier later, remove "cqm" from some names, make some data structure local to intel_rdt_monitor.c static. Add copyright header] [ tglx: Made it readable ] Signed-off-by: Vikas Shivappa <vikas.shivappa@linux.intel.com> Signed-off-by: Tony Luck <tony.luck@intel.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: ravi.v.shankar@intel.com Cc: fenghua.yu@intel.com Cc: peterz@infradead.org Cc: eranian@google.com Cc: vikas.shivappa@intel.com Cc: ak@linux.intel.com Cc: davidcc@google.com Cc: reinette.chatre@intel.com Link: http://lkml.kernel.org/r/1501017287-28083-9-git-send-email-vikas.shivappa@linux.intel.com	2017-08-01 22:41:21 +02:00
Vikas Shivappa	dd131853f3	x86/intel_rdt: Make rdt_resources_all more readable Change the format of the global rdt_resources_all. This holds all the RDT resource structure initialization values. Make this more readable by using the format: rdt_resources_all[] = { [<resource_index>] = {... } ... } Signed-off-by: Vikas Shivappa <vikas.shivappa@linux.intel.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: ravi.v.shankar@intel.com Cc: tony.luck@intel.com Cc: fenghua.yu@intel.com Cc: peterz@infradead.org Cc: eranian@google.com Cc: vikas.shivappa@intel.com Cc: ak@linux.intel.com Cc: davidcc@google.com Cc: reinette.chatre@intel.com Link: http://lkml.kernel.org/r/1501017287-28083-8-git-send-email-vikas.shivappa@linux.intel.com	2017-08-01 22:41:20 +02:00
Vikas Shivappa	1b5c0b7583	x86/intel_rdt: Cleanup namespace to support RDT monitoring Few of the data-structures have generic names although they are RDT allocation specific. Rename them to be allocation specific to accommodate RDT monitoring. E.g. s/enabled/alloc_enabled/ No functional change. Signed-off-by: Vikas Shivappa <vikas.shivappa@linux.intel.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: ravi.v.shankar@intel.com Cc: tony.luck@intel.com Cc: fenghua.yu@intel.com Cc: peterz@infradead.org Cc: eranian@google.com Cc: vikas.shivappa@intel.com Cc: ak@linux.intel.com Cc: davidcc@google.com Cc: reinette.chatre@intel.com Link: http://lkml.kernel.org/r/1501017287-28083-7-git-send-email-vikas.shivappa@linux.intel.com	2017-08-01 22:41:20 +02:00
Reinette Chatre	cb2200e967	x86/intel_rdt: Mark rdt_root and closid_alloc as static Sparse reports that both of these can be static. Make it so. Signed-off-by: Reinette Chatre <reinette.chatre@intel.com> Signed-off-by: Vikas Shivappa <vikas.shivappa@linux.intel.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: ravi.v.shankar@intel.com Cc: tony.luck@intel.com Cc: fenghua.yu@intel.com Cc: peterz@infradead.org Cc: eranian@google.com Cc: vikas.shivappa@intel.com Cc: ak@linux.intel.com Cc: davidcc@google.com Link: http://lkml.kernel.org/r/1501017287-28083-6-git-send-email-vikas.shivappa@linux.intel.com	2017-08-01 22:41:20 +02:00
Vikas Shivappa	0583020456	x86/intel_rdt: Change file names to accommodate RDT monitor code Because the "perf cqm" and resctrl code were separately added and indivdually configurable, there seem to be separate context switch code and also things on global .h which are not really needed. Move only the scheduling specific code and definitions to <asm/intel_rdt_sched.h> and the put all the other declarations to a local intel_rdt.h. h/t to Reinette Chatre for pointing out that we should separate the public interfaces used by other parts of the kernel from private objects shared between the various files comprising RDT. No functional change. Signed-off-by: Vikas Shivappa <vikas.shivappa@linux.intel.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: ravi.v.shankar@intel.com Cc: tony.luck@intel.com Cc: fenghua.yu@intel.com Cc: peterz@infradead.org Cc: eranian@google.com Cc: vikas.shivappa@intel.com Cc: ak@linux.intel.com Cc: davidcc@google.com Cc: reinette.chatre@intel.com Link: http://lkml.kernel.org/r/1501017287-28083-5-git-send-email-vikas.shivappa@linux.intel.com	2017-08-01 22:41:19 +02:00
Vikas Shivappa	f01d7d51f5	x86/intel_rdt: Introduce a common compile option for RDT We currently have a CONFIG_RDT_A which is for RDT(Resource directory technology) allocation based resctrl filesystem interface. As a preparation to add support for RDT monitoring as well into the same resctrl filesystem, change the config option to be CONFIG_RDT which would include both RDT allocation and monitoring code. No functional change. Signed-off-by: Vikas Shivappa <vikas.shivappa@linux.intel.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: ravi.v.shankar@intel.com Cc: tony.luck@intel.com Cc: fenghua.yu@intel.com Cc: peterz@infradead.org Cc: eranian@google.com Cc: vikas.shivappa@intel.com Cc: ak@linux.intel.com Cc: davidcc@google.com Cc: reinette.chatre@intel.com Link: http://lkml.kernel.org/r/1501017287-28083-4-git-send-email-vikas.shivappa@linux.intel.com	2017-08-01 22:41:19 +02:00
Vikas Shivappa	c39a0e2c88	x86/perf/cqm: Wipe out perf based cqm 'perf cqm' never worked due to the incompatibility between perf infrastructure and cqm hardware support. The hardware uses RMIDs to track the llc occupancy of tasks and these RMIDs are per package. This makes monitoring a hierarchy like cgroup along with monitoring of tasks separately difficult and several patches sent to lkml to fix them were NACKed. Further more, the following issues in the current perf cqm make it almost unusable: 1. No support to monitor the same group of tasks for which we do allocation using resctrl. 2. It gives random and inaccurate data (mostly 0s) once we run out of RMIDs due to issues in Recycling. 3. Recycling results in inaccuracy of data because we cannot guarantee that the RMID was stolen from a task when it was not pulling data into cache or even when it pulled the least data. Also for monitoring llc_occupancy, if we stop using an RMID_x and then start using an RMID_y after we reclaim an RMID from an other event, we miss accounting all the occupancy that was tagged to RMID_x at a later perf_count. 2. Recycling code makes the monitoring code complex including scheduling because the event can lose RMID any time. Since MBM counters count bandwidth for a period of time by taking snap shot of total bytes at two different times, recycling complicates the way we count MBM in a hierarchy. Also we need a spin lock while we do the processing to account for MBM counter overflow. We also currently use a spin lock in scheduling to prevent the RMID from being taken away. 4. Lack of support when we run different kind of event like task, system-wide and cgroup events together. Data mostly prints 0s. This is also because we can have only one RMID tied to a cpu as defined by the cqm hardware but a perf can at the same time tie multiple events during one sched_in. 5. No support of monitoring a group of tasks. There is partial support for cgroup but it does not work once there is a hierarchy of cgroups or if we want to monitor a task in a cgroup and the cgroup itself. 6. No support for monitoring tasks for the lifetime without perf overhead. 7. It reported the aggregate cache occupancy or memory bandwidth over all sockets. But most cloud and VMM based use cases want to know the individual per-socket usage. Signed-off-by: Vikas Shivappa <vikas.shivappa@linux.intel.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: ravi.v.shankar@intel.com Cc: tony.luck@intel.com Cc: fenghua.yu@intel.com Cc: peterz@infradead.org Cc: eranian@google.com Cc: vikas.shivappa@intel.com Cc: ak@linux.intel.com Cc: davidcc@google.com Cc: reinette.chatre@intel.com Link: http://lkml.kernel.org/r/1501017287-28083-2-git-send-email-vikas.shivappa@linux.intel.com	2017-08-01 22:41:18 +02:00
Wanpeng Li	337c017ccd	KVM: async_pf: make rcu irq exit if not triggered from idle task WARNING: CPU: 5 PID: 1242 at kernel/rcu/tree_plugin.h:323 rcu_note_context_switch+0x207/0x6b0 CPU: 5 PID: 1242 Comm: unity-settings- Not tainted 4.13.0-rc2+ #1 RIP: 0010:rcu_note_context_switch+0x207/0x6b0 Call Trace: __schedule+0xda/0xba0 ? kvm_async_pf_task_wait+0x1b2/0x270 schedule+0x40/0x90 kvm_async_pf_task_wait+0x1cc/0x270 ? prepare_to_swait+0x22/0x70 do_async_page_fault+0x77/0xb0 ? do_async_page_fault+0x77/0xb0 async_page_fault+0x28/0x30 RIP: 0010:__d_lookup_rcu+0x90/0x1e0 I encounter this when trying to stress the async page fault in L1 guest w/ L2 guests running. Commit `9b132fbe54` (Add rcu user eqs exception hooks for async page fault) adds rcu_irq_enter/exit() to kvm_async_pf_task_wait() to exit cpu idle eqs when needed, to protect the code that needs use rcu. However, we need to call the pair even if the function calls schedule(), as seen from the above backtrace. This patch fixes it by informing the RCU subsystem exit/enter the irq towards/away from idle for both n.halted and !n.halted. Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Radim Krčmář <rkrcmar@redhat.com> Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: stable@vger.kernel.org Signed-off-by: Wanpeng Li <wanpeng.li@hotmail.com> Reviewed-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>	2017-08-01 22:24:18 +02:00
Paolo Bonzini	b96fb43977	KVM: nVMX: fixes to nested virt interrupt injection There are three issues in nested_vmx_check_exception: 1) it is not taking PFEC_MATCH/PFEC_MASK into account, as reported by Wanpeng Li; 2) it should rebuild the interruption info and exit qualification fields from scratch, as reported by Jim Mattson, because the values from the L2->L0 vmexit may be invalid (e.g. if an emulated instruction causes a page fault, the EPT misconfig's exit qualification is incorrect). 3) CR2 and DR6 should not be written for exception intercept vmexits (CR2 only for AMD). This patch fixes the first two and adds a comment about the last, outlining the fix. Cc: Jim Mattson <jmattson@google.com> Cc: Wanpeng Li <wanpeng.li@hotmail.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2017-08-01 22:24:17 +02:00
Paolo Bonzini	7313c69805	KVM: nVMX: do not fill vm_exit_intr_error_code in prepare_vmcs12 Do this in the caller of nested_vmx_vmexit instead. nested_vmx_check_exception was doing a vmwrite to the vmcs02's VM_EXIT_INTR_ERROR_CODE field, so that prepare_vmcs12 would move the field to vmcs12->vm_exit_intr_error_code. However that isn't possible on pre-Haswell machines. Moving the vmcs12 write to the callers fixes it. Reported-by: Jim Mattson <jmattson@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> [Changed nested_vmx_reflect_vmexit() return type to (int)1 from (bool)1, thanks to fengguang.wu@intel.com] Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>	2017-08-01 22:23:25 +02:00
Thomas Gleixner	bb68cfe2f5	x86/hpet: Cure interface abuse in the resume path The HPET resume path abuses irq_domain_[de]activate_irq() to restore the MSI message in the HPET chip for the boot CPU on resume and it relies on an implementation detail of the interrupt core code, which magically makes the HPET unmask call invoked via a irq_disable/enable pair. This worked as long as the irq code did unconditionally invoke the unmask() callback. With the recent changes which keep track of the masked state to avoid expensive hardware access, this does not longer work. As a consequence the HPET timer interrupts are not unmasked which breaks resume as the boot CPU waits forever that a timer interrupt arrives. Make the restore of the MSI message explicit and invoke the unmask() function directly. While at it get rid of the pointless affinity setting as nothing can change the affinity of the interrupt and the vector across suspend/resume. The restore of the MSI message reestablishes the previous affinity setting which is the correct one. Fixes: `bf22ff45be` ("genirq: Avoid unnecessary low level irq function calls") Reported-and-tested-by: Tomi Sarvela <tomi.p.sarvela@intel.com> Reported-by: Martin Peres <martin.peres@linux.intel.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: "Rafael J. Wysocki" <rafael.j.wysocki@intel.com> Cc: jeffy.chen@rock-chips.com Cc: Peter Zijlstra <peterz@infradead.org> Cc: Marc Zyngier <marc.zyngier@arm.com> Link: http://lkml.kernel.org/r/alpine.DEB.2.20.1707312158590.2287@nanos	2017-08-01 13:02:37 +02:00
Linus Torvalds	f137e0b0c5	Merge branch 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 fixes from Thomas Gleixner: "A small set of x86 fixes: - prevent the kernel from using the EFI reboot method when EFI is disabled. - two patches addressing clang issues" * 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/boot: Disable the address-of-packed-member compiler warning x86/efi: Fix reboot_mode when EFI runtime services are disabled x86/boot: #undef memcpy() et al in string.c	2017-07-30 12:19:35 -07:00
Linus Torvalds	dbc52a8030	Merge branch 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull perf fixes from Thomas Gleixner: "A couple of fixes for performance counters and kprobes: - a series of small patches which make the uncore performance counters on Skylake server systems work correctly - add a missing instruction slot release to the failure path of kprobes" * 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: kprobes/x86: Release insn_slot in failure path perf/x86/intel/uncore: Fix missing marker for skx_uncore_cha_extra_regs perf/x86/intel/uncore: Fix SKX CHA event extra regs perf/x86/intel/uncore: Remove invalid Skylake server CHA filter field perf/x86/intel/uncore: Fix Skylake server CHA LLC_LOOKUP event umask perf/x86/intel/uncore: Fix Skylake server PCU PMU event format perf/x86/intel/uncore: Fix Skylake UPI PMU event masks	2017-07-30 11:52:15 -07:00
Rafael J. Wysocki	4815d3c56d	cpufreq: x86: Make scaling_cur_freq behave more as expected After commit `f8475cef90` "x86: use common aperfmperf_khz_on_cpu() to calculate KHz using APERF/MPERF" the scaling_cur_freq policy attribute in sysfs only behaves as expected on x86 with APERF/MPERF registers available when it is read from at least twice in a row. The value returned by the first read may not be meaningful, because the computations in there use cached values from the previous iteration of aperfmperf_snapshot_khz() which may be stale. To prevent that from happening, modify arch_freq_get_on_cpu() to call aperfmperf_snapshot_khz() twice, with a short delay between these calls, if the previous invocation of aperfmperf_snapshot_khz() was too far back in the past (specifically, more that 1s ago). Also, as pointed out by Doug Smythies, aperf_delta is limited now and the multiplication of it by cpu_khz won't overflow, so simplify the s->khz computations too. Fixes: `f8475cef90` "x86: use common aperfmperf_khz_on_cpu() to calculate KHz using APERF/MPERF" Reported-by: Doug Smythies <dsmythies@telus.net> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>	2017-07-30 14:26:51 +02:00
Tom Lendacky	57bd1905b2	acpi, x86/mm: Remove encryption mask from ACPI page protection type The arch_apei_get_mem_attributes() function is used to set the page protection type for ACPI physical addresses. When SME is active, the associated protection type cannot have the encryption mask set since the ACPI tables live in un-encrypted memory - the kernel will see corrupted data. To fix this, create a new protection type, PAGE_KERNEL_NOENC, that is a 'no encryption' version of PAGE_KERNEL, and return that from arch_apei_get_mem_attributes(). Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Brijesh Singh <brijesh.singh@amd.com> Cc: Dave Young <dyoung@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/e1cb9395b2f061cd96f1e59c3cbbe5ff5d4ec26e.1501186516.git.thomas.lendacky@amd.com Signed-off-by: Ingo Molnar <mingo@kernel.org>	2017-07-30 12:09:12 +02:00
Tom Lendacky	4e237903f9	x86/mm, kexec: Fix memory corruption with SME on successive kexecs After issuing successive kexecs it was found that the SHA hash failed verification when booting the kexec'd kernel. When SME is enabled, the change from using pages that were marked encrypted to now being marked as not encrypted (through new identify mapped page tables) results in memory corruption if there are any cache entries for the previously encrypted pages. This is because separate cache entries can exist for the same physical location but tagged both with and without the encryption bit. To prevent this, issue a wbinvd if SME is active before copying the pages from the source location to the destination location to clear any possible cache entry conflicts. Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com> Cc: <kexec@lists.infradead.org> Cc: Andy Lutomirski <luto@kernel.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Brijesh Singh <brijesh.singh@amd.com> Cc: Dave Young <dyoung@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/e7fb8610af3a93e8f8ae6f214cd9249adc0df2b4.1501186516.git.thomas.lendacky@amd.com Signed-off-by: Ingo Molnar <mingo@kernel.org>	2017-07-30 12:09:12 +02:00
Andy Lutomirski	99504819fc	x86/asm/32: Remove a bunch of '& 0xffff' from pt_regs segment reads Now that pt_regs properly defines segment fields as 16-bit on 32-bit CPUs, there's no need to mask off the high word. Signed-off-by: Andy Lutomirski <luto@kernel.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Borislav Petkov <bpetkov@suse.de> Cc: Brian Gerst <brgerst@gmail.com> Cc: Denys Vlasenko <dvlasenk@redhat.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Josh Poimboeuf <jpoimboe@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Ingo Molnar <mingo@kernel.org>	2017-07-30 12:04:41 +02:00
Andy Lutomirski	630c1863bc	x86/traps: Don't clear segment high bits in early_idt_handler_common() Now that pt_regs defines the segment fields as 16-bit, there's no need to sanitize the values. Signed-off-by: Andy Lutomirski <luto@kernel.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Borislav Petkov <bpetkov@suse.de> Cc: Brian Gerst <brgerst@gmail.com> Cc: Denys Vlasenko <dvlasenk@redhat.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Josh Poimboeuf <jpoimboe@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Ingo Molnar <mingo@kernel.org>	2017-07-30 12:04:40 +02:00
Andy Lutomirski	385eca8f27	x86/asm/32: Make pt_regs's segment registers be 16 bits Many 32-bit x86 CPUs do 16-bit writes when storing segment registers to memory. This can cause the high word of regs->[cdefgs]s to occasionally contain garbage. Rather than making the entry code more complicated to fix up the garbage, just change pt_regs to reflect reality. Signed-off-by: Andy Lutomirski <luto@kernel.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Borislav Petkov <bpetkov@suse.de> Cc: Brian Gerst <brgerst@gmail.com> Cc: Denys Vlasenko <dvlasenk@redhat.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Josh Poimboeuf <jpoimboe@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Ingo Molnar <mingo@kernel.org>	2017-07-30 12:04:40 +02:00
Ingo Molnar	f5db340f19	Merge branch 'perf/urgent' into perf/core, to pick up latest fixes and refresh the tree Signed-off-by: Ingo Molnar <mingo@kernel.org>	2017-07-30 11:15:13 +02:00
Matthias Kaehlcke	20c6c18904	x86/boot: Disable the address-of-packed-member compiler warning The clang warning 'address-of-packed-member' is disabled for the general kernel code, also disable it for the x86 boot code. This suppresses a bunch of warnings like this when building with clang: ./arch/x86/include/asm/processor.h:535:30: warning: taking address of packed member 'sp0' of class or structure 'x86_hw_tss' may result in an unaligned pointer value [-Waddress-of-packed-member] return this_cpu_read_stable(cpu_tss.x86_tss.sp0); ^~~~~~~~~~~~~~~~~~~ ./arch/x86/include/asm/percpu.h:391:59: note: expanded from macro 'this_cpu_read_stable' #define this_cpu_read_stable(var) percpu_stable_op("mov", var) ^~~ ./arch/x86/include/asm/percpu.h:228:16: note: expanded from macro 'percpu_stable_op' : "p" (&(var))); ^~~ Signed-off-by: Matthias Kaehlcke <mka@chromium.org> Cc: Doug Anderson <dianders@chromium.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Masahiro Yamada <yamada.masahiro@socionext.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/20170725215053.135586-1-mka@chromium.org Signed-off-by: Ingo Molnar <mingo@kernel.org>	2017-07-28 08:39:08 +02:00
Dou Liyang	dbe04493ed	x86/topology: Remove the unused parent_node() macro Commit: `a7be6e5a7f` ("mm: drop useless local parameters of __register_one_node()") ... removed the last user of parent_node(), so remove the macro. Reported-by: Michael Ellerman <mpe@ellerman.id.au> Signed-off-by: Dou Liyang <douly.fnst@cn.fujitsu.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/1501076076-1974-11-git-send-email-douly.fnst@cn.fujitsu.com Signed-off-by: Ingo Molnar <mingo@kernel.org>	2017-07-27 10:53:00 +02:00
Andy Lutomirski	a632375764	x86/ldt/64: Refresh DS and ES when modify_ldt changes an entry On x86_32, modify_ldt() implicitly refreshes the cached DS and ES segments because they are refreshed on return to usermode. On x86_64, they're not refreshed on return to usermode. To improve determinism and match x86_32's behavior, refresh them when we update the LDT. This avoids a situation in which the DS points to a descriptor that is changed but the old cached segment persists until the next reschedule. If this happens, then the user-visible state will change nondeterministically some time after modify_ldt() returns, which is unfortunate. Signed-off-by: Andy Lutomirski <luto@kernel.org> Cc: Borislav Petkov <bpetkov@suse.de> Cc: Chang Seok <chang.seok.bae@intel.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Ingo Molnar <mingo@kernel.org>	2017-07-27 09:12:57 +02:00
Dave Airlie	0eb2c0ae57	Linux 4.13-rc2 -----BEGIN PGP SIGNATURE----- iQEcBAABAgAGBQJZdS4PAAoJEHm+PkMAQRiGbEYH/2mukTPOUAfNoWaVjO2YHxuL 5yI3n1838tKIJm967IUmGdckN/RYGPjJxvZ+muXN2/rv23+9j3LVq9vQcsYqRQop vrWP+hvGGJvOGJ2NYBDB+4AUrPPdeX9stolwyAcYvyCZ8AilPIovm4s2poA+fuQX D78c8JSfpse32oc93dy4bUz3mRFKTeufstrWEuzqXI691mthF2G9EpA0R3hlbqv+ GiUnNcZVOnOuCt/47GnpWVKsyv91l3CkGq3bV1GSUi8a/1PnyFxHQxQI/qgbkLXs NuswRupSeLDQKRgiDLgWF/BpdHEp4dpFFWXm00KWlgxeGSQnKat9bpW/d5OgnhA= =mv3H -----END PGP SIGNATURE----- Backmerge tag 'v4.13-rc2' into drm-next Linux 4.13-rc2 This is required for drm-misc fixing.	2017-07-27 08:15:43 +10:00
Wanpeng Li	1d518c6820	KVM: LAPIC: Fix reentrancy issues with preempt notifiers Preempt can occur in the preemption timer expiration handler: CPU0 CPU1 preemption timer vmexit handle_preemption_timer(vCPU0) kvm_lapic_expired_hv_timer hv_timer_is_use == true sched_out sched_in kvm_arch_vcpu_load kvm_lapic_restart_hv_timer restart_apic_timer start_hv_timer already-expired timer or sw timer triggerd in the window start_sw_timer cancel_hv_timer /* back in kvm_lapic_expired_hv_timer */ cancel_hv_timer WARN_ON(!apic->lapic_timer.hv_timer_in_use); ==> Oops This can be reproduced if CONFIG_PREEMPT is enabled. ------------[ cut here ]------------ WARNING: CPU: 4 PID: 2972 at /home/kernel/linux/arch/x86/kvm//lapic.c:1563 kvm_lapic_expired_hv_timer+0x9e/0xb0 [kvm] CPU: 4 PID: 2972 Comm: qemu-system-x86 Tainted: G OE 4.13.0-rc2+ #16 RIP: 0010:kvm_lapic_expired_hv_timer+0x9e/0xb0 [kvm] Call Trace: handle_preemption_timer+0xe/0x20 [kvm_intel] vmx_handle_exit+0xb8/0xd70 [kvm_intel] kvm_arch_vcpu_ioctl_run+0xdd1/0x1be0 [kvm] ? kvm_arch_vcpu_load+0x47/0x230 [kvm] ? kvm_arch_vcpu_load+0x62/0x230 [kvm] kvm_vcpu_ioctl+0x340/0x700 [kvm] ? kvm_vcpu_ioctl+0x340/0x700 [kvm] ? __fget+0xfc/0x210 do_vfs_ioctl+0xa4/0x6a0 ? __fget+0x11d/0x210 SyS_ioctl+0x79/0x90 do_syscall_64+0x81/0x220 entry_SYSCALL64_slow_path+0x25/0x25 ------------[ cut here ]------------ WARNING: CPU: 4 PID: 2972 at /home/kernel/linux/arch/x86/kvm//lapic.c:1498 cancel_hv_timer.isra.40+0x4f/0x60 [kvm] CPU: 4 PID: 2972 Comm: qemu-system-x86 Tainted: G W OE 4.13.0-rc2+ #16 RIP: 0010:cancel_hv_timer.isra.40+0x4f/0x60 [kvm] Call Trace: kvm_lapic_expired_hv_timer+0x3e/0xb0 [kvm] handle_preemption_timer+0xe/0x20 [kvm_intel] vmx_handle_exit+0xb8/0xd70 [kvm_intel] kvm_arch_vcpu_ioctl_run+0xdd1/0x1be0 [kvm] ? kvm_arch_vcpu_load+0x47/0x230 [kvm] ? kvm_arch_vcpu_load+0x62/0x230 [kvm] kvm_vcpu_ioctl+0x340/0x700 [kvm] ? kvm_vcpu_ioctl+0x340/0x700 [kvm] ? __fget+0xfc/0x210 do_vfs_ioctl+0xa4/0x6a0 ? __fget+0x11d/0x210 SyS_ioctl+0x79/0x90 do_syscall_64+0x81/0x220 entry_SYSCALL64_slow_path+0x25/0x25 This patch fixes it by making the caller of cancel_hv_timer, start_hv_timer and start_sw_timer be in preemption-disabled regions, which trivially avoid any reentrancy issue with preempt notifier. Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Radim Krčmář <rkrcmar@redhat.com> Signed-off-by: Wanpeng Li <wanpeng.li@hotmail.com> [Add more WARNs. - Paolo] Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2017-07-26 19:04:53 +02:00
Wanpeng Li	2d6144e366	KVM: nVMX: Fix loss of L2's NMI blocking state Run kvm-unit-tests/eventinj.flat in L1 w/ ept=0 on both L0 and L1: Before NMI IRET test Sending NMI to self NMI isr running stack 0x461000 Sending nested NMI to self After nested NMI to self Nested NMI isr running rip=40038e After iret After NMI to self FAIL: NMI Commit `4c4a6f790e` (KVM: nVMX: track NMI blocking state separately for each VMCS) tracks NMI blocking state separately for vmcs01 and vmcs02. However it is not enough: - The L2 (kvm-unit-tests/eventinj.flat) generates NMI that will fault on IRET, so the L2 can generate #PF which can be intercepted by L0. - L0 walks L1's guest page table and sees the mapping is invalid, it resumes the L1 guest and injects the #PF into L1. At this point the vmcs02 has nmi_known_unmasked=true. - L1 sets set bit 3 (blocking by NMI) in the interruptibility-state field of vmcs12 (and fixes the shadow page table) before resuming L2 guest. - L1 executes VMRESUME to resume L2, causing a vmexit to L0 - during VMRESUME emulation, prepare_vmcs02 sets bit 3 in the interruptibility-state field of vmcs02, but nmi_known_unmasked is still true. - L2 immediately exits to L0 with another page fault, because L0 still has not updated the NGVA->HPA page tables. However, nmi_known_unmasked is true so vmx_recover_nmi_blocking does not do anything. The fix is to update nmi_known_unmasked when preparing vmcs02 from vmcs12. Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Radim Krčmář <rkrcmar@redhat.com> Signed-off-by: Wanpeng Li <wanpeng.li@hotmail.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2017-07-26 18:57:46 +02:00
Wincy Van	06a5524f09	KVM: nVMX: Fix posted intr delivery when vcpu is in guest mode The PI vector for L0 and L1 must be different. If dest vcpu0 is in guest mode while vcpu1 is delivering a non-nested PI to vcpu0, there wont't be any vmexit so that the non-nested interrupt will be delayed. Signed-off-by: Wincy Van <fanwenyi0529@gmail.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2017-07-26 18:57:46 +02:00
Wincy Van	210f84b0ca	x86: irq: Define a global vector for nested posted interrupts We are using the same vector for nested/non-nested posted interrupts delivery, this may cause interrupts latency in L1 since we can't kick the L2 vcpu out of vmx-nonroot mode. This patch introduces a new vector which is only for nested posted interrupts to solve the problems above. Signed-off-by: Wincy Van <fanwenyi0529@gmail.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2017-07-26 18:57:45 +02:00
Paolo Bonzini	a512177ef3	KVM: x86: do mask out upper bits of PAE CR3 This reverts the change of commit `f85c758dbe`, as the behavior it modified was intended. The VM is running in 32-bit PAE mode, and Table 4-7 of the Intel manual says: Table 4-7. Use of CR3 with PAE Paging Bit Position(s) Contents 4:0 Ignored 31:5 Physical address of the 32-Byte aligned page-directory-pointer table used for linear-address translation 63:32 Ignored (these bits exist only on processors supporting the Intel-64 architecture) To placate the static checker, write the mask explicitly as an unsigned long constant instead of using a 32-bit unsigned constant. Cc: Dan Carpenter <dan.carpenter@oracle.com> Fixes: `f85c758dbe` Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>	2017-07-26 18:57:45 +02:00
Josh Poimboeuf	81d3871900	x86/kconfig: Consolidate unwinders into multiple choice selection There are three mutually exclusive unwinders. Make that more obvious by combining them into a multiple-choice selection: CONFIG_FRAME_POINTER_UNWINDER CONFIG_ORC_UNWINDER CONFIG_GUESS_UNWINDER (if CONFIG_EXPERT=y) Frame pointers are still the default (for now). The old CONFIG_FRAME_POINTER option is still used in some arch-independent places, so keep it around, but make it invisible to the user on x86 - it's now selected by CONFIG_FRAME_POINTER_UNWINDER=y. Suggested-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Brian Gerst <brgerst@gmail.com> Cc: Denys Vlasenko <dvlasenk@redhat.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Jiri Slaby <jslaby@suse.cz> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: live-patching@vger.kernel.org Link: http://lkml.kernel.org/r/20170725135424.zukjmgpz3plf5pmt@treble Signed-off-by: Ingo Molnar <mingo@kernel.org>	2017-07-26 14:05:36 +02:00
Josh Poimboeuf	a34a766ff9	x86/kconfig: Make it easier to switch to the new ORC unwinder A couple of Kconfig changes which make it much easier to switch to the new CONFIG_ORC_UNWINDER: 1) Remove x86 dependencies on CONFIG_FRAME_POINTER for lockdep, latencytop, and fault injection. x86 has a 'guess' unwinder which just scans the stack for kernel text addresses. It's not 100% accurate but in many cases it's good enough. This allows those users who don't want the text overhead of the frame pointer or ORC unwinders to still use these features. More importantly, this also makes it much more straightforward to disable frame pointers. 2) Make CONFIG_ORC_UNWINDER depend on !CONFIG_FRAME_POINTER. While it would be possible to have both enabled, it doesn't really make sense to do so. So enforce a sane configuration to prevent the user from making a dumb mistake. With these changes, when you disable CONFIG_FRAME_POINTER, "make oldconfig" will ask if you want to enable CONFIG_ORC_UNWINDER. Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Brian Gerst <brgerst@gmail.com> Cc: Denys Vlasenko <dvlasenk@redhat.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Jiri Slaby <jslaby@suse.cz> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: live-patching@vger.kernel.org Link: http://lkml.kernel.org/r/9985fb91ce5005fe33ea5cc2a20f14bd33c61d03.1500938583.git.jpoimboe@redhat.com Signed-off-by: Ingo Molnar <mingo@kernel.org>	2017-07-26 13:18:20 +02:00
Josh Poimboeuf	ee9f8fce99	x86/unwind: Add the ORC unwinder Add the new ORC unwinder which is enabled by CONFIG_ORC_UNWINDER=y. It plugs into the existing x86 unwinder framework. It relies on objtool to generate the needed .orc_unwind and .orc_unwind_ip sections. For more details on why ORC is used instead of DWARF, see Documentation/x86/orc-unwinder.txt - but the short version is that it's a simplified, fundamentally more robust debugninfo data structure, which also allows up to two orders of magnitude faster lookups than the DWARF unwinder - which matters to profiling workloads like perf. Thanks to Andy Lutomirski for the performance improvement ideas: splitting the ORC unwind table into two parallel arrays and creating a fast lookup table to search a subset of the unwind table. Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Borislav Petkov <bp@alien8.de> Cc: Brian Gerst <brgerst@gmail.com> Cc: Denys Vlasenko <dvlasenk@redhat.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Jiri Slaby <jslaby@suse.cz> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: live-patching@vger.kernel.org Link: http://lkml.kernel.org/r/0a6cbfb40f8da99b7a45a1a8302dc6aef16ec812.1500938583.git.jpoimboe@redhat.com [ Extended the changelog. ] Signed-off-by: Ingo Molnar <mingo@kernel.org>	2017-07-26 13:18:20 +02:00
Yazen Ghannam	9662d43f52	x86/mce/AMD: Allow any CPU to initialize the smca_banks array Current SMCA implementations have the same banks on each CPU with the non-core banks only visible to a "master thread" on each die. Practically, this means the smca_banks array, which describes the banks, only needs to be populated once by a single master thread. CPU 0 seemed like a good candidate to do the populating. However, it's possible that CPU 0 is not enabled in which case the smca_banks array won't be populated. Rather than try to figure out another master thread to do the populating, we should just allow any CPU to populate the array. Drop the CPU 0 check and return early if the bank was already initialized. Also, drop the WARNing about an already initialized bank, since this will be a common, expected occurrence. The smca_banks array is only populated at boot time and CPUs are brought online sequentially. So there's no need for locking around the array. If the first CPU up is a master thread, then it will populate the array with all banks, core and non-core. Every CPU afterwards will return early. If the first CPU up is not a master thread, then it will populate the array with all core banks. The first CPU afterwards that is a master thread will skip populating the core banks and continue populating the non-core banks. Signed-off-by: Yazen Ghannam <yazen.ghannam@amd.com> Signed-off-by: Borislav Petkov <bp@suse.de> Acked-by: Jack Miller <jack@codezen.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Tony Luck <tony.luck@intel.com> Cc: linux-edac <linux-edac@vger.kernel.org> Link: http://lkml.kernel.org/r/20170724101228.17326-4-bp@alien8.de Signed-off-by: Ingo Molnar <mingo@kernel.org>	2017-07-25 15:50:53 +02:00
Ingo Molnar	25547b6b08	Merge branch 'WIP.locking/atomics' into locking/core Merge two uncontroversial cleanups from this branch while the rest is being reworked. Signed-off-by: Ingo Molnar <mingo@kernel.org>	2017-07-25 15:48:58 +02:00
Stefan Assmann	4ecf7191fd	x86/efi: Fix reboot_mode when EFI runtime services are disabled When EFI runtime services are disabled, for example by the "noefi" kernel cmdline parameter, the reboot_type could still be set to BOOT_EFI causing reboot to fail. Fix this by checking if EFI runtime services are enabled. Signed-off-by: Stefan Assmann <sassmann@kpanic.de> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/20170724122248.24006-1-sassmann@kpanic.de [ Fixed 'not disabled' double negation. ] Signed-off-by: Ingo Molnar <mingo@kernel.org>	2017-07-25 11:30:45 +02:00
Shu Wang	a99f03428f	x86/microcode/AMD: Free unneeded patch before exit from update_cache() verify_and_add_patch() allocates memory for a microcode patch and hands it down to be added to the cache of patches. However, if the cache already has the latest patch, the newly allocated one needs to be freed before returning. Do that. This issue has been found by kmemleak: unreferenced object 0xffff88010e780b40 (size 32): comm "bash", pid 860, jiffies 4294690939 (age 29.297s) backtrace: kmemleak_alloc kmem_cache_alloc_trace load_microcode_amd.isra.0 request_microcode_amd reload_store dev_attr_store sysfs_kf_write kernfs_fop_write __vfs_write vfs_write SyS_write do_syscall_64 return_from_SYSCALL_64 0xffffffffffffffff (gdb) list 0xffffffff81050d60 0xffffffff81050d60 is in load_microcode_amd (arch/x86/kernel/cpu/microcode/amd.c:616). which is this: patch = kzalloc(sizeof(patch), GFP_KERNEL); --> if (!patch) { pr_err("Patch allocation failure.\n"); return -EINVAL; } Signed-off-by: Shu Wang <shuwang@redhat.com> [ Rewrite commit message. ] Signed-off-by: Borislav Petkov <bp@suse.de> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: chuhu@redhat.com Cc: liwang@redhat.com Link: http://lkml.kernel.org/r/20170724101228.17326-2-bp@alien8.de Signed-off-by: Ingo Molnar <mingo@kernel.org>	2017-07-25 11:26:19 +02:00
Andrey Ryabinin	04b67022fb	x86/mm/dump_pagetables: Speed up page tables dump for CONFIG_KASAN=y KASAN fills kernel page tables with repeated values to map several TBs of the virtual memory to the single kasan_zero_page: kasan_zero_p4d -> kasan_zero_pud -> kasan_zero_pmd-> kasan_zero_pte-> kasan_zero_page Walking the whole KASAN shadow range takes a lot of time, especially with 5-level page tables. Since we already know that all kasan page tables eventually point to the kasan_zero_page we could call note_page() right and avoid walking lower levels of the page tables. This will not affect the output of the kernel_page_tables file, but let us avoid spending time in page table walkers: Before: $ time cat /sys/kernel/debug/kernel_page_tables > /dev/null real 0m55.855s user 0m0.000s sys 0m55.840s After: $ time cat /sys/kernel/debug/kernel_page_tables > /dev/null real 0m0.054s user 0m0.000s sys 0m0.054s Signed-off-by: Andrey Ryabinin <aryabinin@virtuozzo.com> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Alexander Potapenko <glider@google.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/20170724152558.24689-1-aryabinin@virtuozzo.com Signed-off-by: Ingo Molnar <mingo@kernel.org>	2017-07-25 11:22:09 +02:00
Andy Shevchenko	5b395e2be6	x86/platform/intel-mid: Make IRQ allocation a bit more flexible In the future we would use dynamic allocation for IRQ which brings non-1:1 mapping for IOAPIC domain. Thus, we need to respect return value of mp_map_gsi_to_irq() and assign it back to the device structure. Besides that we need to read GSI from interrupt pin register to avoid cases when some drivers will try to initialize PCI device twice in a row which will call pcibios_enable_irq() twice as well. serial 0000:00:04.1: Mapped GSI28 to IRQ5 serial 0000:00:04.2: Mapped GSI29 to IRQ5 serial 0000:00:04.3: Mapped GSI54 to IRQ5 8250_mid 0000:00:04.1: Mapped GSI28 to IRQ5 8250_mid 0000:00:04.2: Mapped GSI29 to IRQ6 8250_mid 0000:00:04.3: Mapped GSI54 to IRQ7 Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Cc: Bjorn Helgaas <bhelgaas@google.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-pci@vger.kernel.org Link: http://lkml.kernel.org/r/20170724173402.12939-1-andriy.shevchenko@linux.intel.com Signed-off-by: Ingo Molnar <mingo@kernel.org>	2017-07-25 11:21:03 +02:00
Andy Shevchenko	b0ee9effc1	x86/platform/intel-mid: Group timers callbacks together Group timers callback initializers together in x86_intel_mid_early_setup() for easy to find and maintain. No functional change intended. Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/20170724173309.12878-1-andriy.shevchenko@linux.intel.com Signed-off-by: Ingo Molnar <mingo@kernel.org>	2017-07-25 11:21:02 +02:00

... 2 3 4 5 6 ...

27706 Commits