The value passed in to addr_referenced is of type void __user *, so update
the addr_referenced parameter in trace_mpx_bounds_register_exception to match.
Also update the addr_referenced paramater in TP_STRUCT__entry as it again
holdes the same value.
I don't know why this was missed earlier but sparse was complaining when
testing test branch so fix this now.
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Replace user_single_step_siginfo with user_single_step_report
that allocates siginfo structure on the stack and sends it.
This allows tracehook_report_syscall_exit to become a simple
if statement that calls user_single_step_report or ptrace_report_syscall
depending on the value of step.
Update the default helper function now called user_single_step_report
to explicitly set si_code to SI_USER and to set si_uid and si_pid to 0.
The default helper has always been doing this (using memset) but it
was far from obvious.
The powerpc helper can now just call force_sig_fault.
The x86 helper can now just call send_sigtrap.
Unfortunately the default implementation of user_single_step_report
can not use force_sig_fault as it does not use a SIGTRAP si_code.
So it has to carefully setup the siginfo and use use force_sig_info.
The net result is code that is easier to understand and simpler
to maintain.
Ref: 85ec7fd9f8 ("ptrace: introduce user_single_step_siginfo() helper")
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
kvmclock defines few static variables which are shared with the
hypervisor during the kvmclock initialization.
When SEV is active, memory is encrypted with a guest-specific key, and
if the guest OS wants to share the memory region with the hypervisor
then it must clear the C-bit before sharing it.
Currently, we use kernel_physical_mapping_init() to split large pages
before clearing the C-bit on shared pages. But it fails when called from
the kvmclock initialization (mainly because the memblock allocator is
not ready that early during boot).
Add a __bss_decrypted section attribute which can be used when defining
such shared variable. The so-defined variables will be placed in the
.bss..decrypted section. This section will be mapped with C=0 early
during boot.
The .bss..decrypted section has a big chunk of memory that may be unused
when memory encryption is not active, free it when memory encryption is
not active.
Suggested-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Brijesh Singh <brijesh.singh@amd.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Tom Lendacky <thomas.lendacky@amd.com>
Cc: Borislav Petkov <bp@suse.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Sean Christopherson <sean.j.christopherson@intel.com>
Cc: Radim Krčmář<rkrcmar@redhat.com>
Cc: kvm@vger.kernel.org
Link: https://lkml.kernel.org/r/1536932759-12905-2-git-send-email-brijesh.singh@amd.com
This reverts commit 1f40a46cf4.
It turned out that this patch is not sufficient to enable PTI on 32 bit
systems with legacy 2-level page-tables. In this paging mode the huge-page
PTEs are in the top-level page-table directory, where also the mirroring to
the user-space page-table happens. So every huge PTE exits twice, in the
kernel and in the user page-table.
That means that accessed/dirty bits need to be fetched from two PTEs in
this mode to be safe, but this is not trivial to implement because it needs
changes to generic code just for the sake of enabling PTI with 32-bit
legacy paging. As all systems that need PTI should support PAE anyway,
remove support for PTI when 32-bit legacy paging is used.
Fixes: 7757d607c6 ('x86/pti: Allow CONFIG_PAGE_TABLE_ISOLATION for x86_32')
Reported-by: Meelis Roos <mroos@linux.ee>
Signed-off-by: Joerg Roedel <jroedel@suse.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: hpa@zytor.com
Cc: linux-mm@kvack.org
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Link: https://lkml.kernel.org/r/1536922754-31379-1-git-send-email-joro@8bytes.org
The SYSCALL64 trampoline has a couple of nice properties:
- The usual sequence of SWAPGS followed by two GS-relative accesses to
set up RSP is somewhat slow because the GS-relative accesses need
to wait for SWAPGS to finish. The trampoline approach allows
RIP-relative accesses to set up RSP, which avoids the stall.
- The trampoline avoids any percpu access before CR3 is set up,
which means that no percpu memory needs to be mapped in the user
page tables. This prevents using Meltdown to read any percpu memory
outside the cpu_entry_area and prevents using timing leaks
to directly locate the percpu areas.
The downsides of using a trampoline may outweigh the upsides, however.
It adds an extra non-contiguous I$ cache line to system calls, and it
forces an indirect jump to transfer control back to the normal kernel
text after CR3 is set up. The latter is because x86 lacks a 64-bit
direct jump instruction that could jump from the trampoline to the entry
text. With retpolines enabled, the indirect jump is extremely slow.
Change the code to map the percpu TSS into the user page tables to allow
the non-trampoline SYSCALL64 path to work under PTI. This does not add a
new direct information leak, since the TSS is readable by Meltdown from the
cpu_entry_area alias regardless. It does allow a timing attack to locate
the percpu area, but KASLR is more or less a lost cause against local
attack on CPUs vulnerable to Meltdown regardless. As far as I'm concerned,
on current hardware, KASLR is only useful to mitigate remote attacks that
try to attack the kernel without first gaining RCE against a vulnerable
user process.
On Skylake, with CONFIG_RETPOLINE=y and KPTI on, this reduces syscall
overhead from ~237ns to ~228ns.
There is a possible alternative approach: Move the trampoline within 2G of
the entry text and make a separate copy for each CPU. This would allow a
direct jump to rejoin the normal entry path. There are pro's and con's for
this approach:
+ It avoids a pipeline stall
- It executes from an extra page and read from another extra page during
the syscall. The latter is because it needs to use a relative
addressing mode to find sp1 -- it's the same *cacheline*, but accessed
using an alias, so it's an extra TLB entry.
- Slightly more memory. This would be one page per CPU for a simple
implementation and 64-ish bytes per CPU or one page per node for a more
complex implementation.
- More code complexity.
The current approach is chosen for simplicity and because the alternative
does not provide a significant benefit, which makes it worth.
[ tglx: Added the alternative discussion to the changelog ]
Signed-off-by: Andy Lutomirski <luto@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Borislav Petkov <bp@suse.de>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Joerg Roedel <joro@8bytes.org>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: https://lkml.kernel.org/r/8c7c6e483612c3e4e10ca89495dc160b1aa66878.1536015544.git.luto@kernel.org
I use memcpy_flushcache() in my persistent memory driver for metadata
updates, there are many 8-byte and 16-byte updates and it turns out that
the overhead of memcpy_flushcache causes 2% performance degradation
compared to "movnti" instruction explicitly coded using inline assembler.
The tests were done on a Skylake processor with persistent memory emulated
using the "memmap" kernel parameter. dd was used to copy data to the
dm-writecache target.
This patch recognizes memcpy_flushcache calls with constant short length
and turns them into inline assembler - so that I don't have to use inline
assembler in the driver.
Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Snitzer <snitzer@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: device-mapper development <dm-devel@redhat.com>
Link: http://lkml.kernel.org/r/alpine.LRH.2.02.1808081720460.24747@file01.intranet.prod.int.rdu2.redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Pull x86 fixes from Thomas Gleixner:
"A set of fixes for x86:
- Prevent multiplication result truncation on 32bit. Introduced with
the early timestamp reworrk.
- Ensure microcode revision storage to be consistent under all
circumstances
- Prevent write tearing of PTEs
- Prevent confusion of user and kernel reegisters when dumping fatal
signals verbosely
- Make an error return value in a failure path of the vector
allocation negative. Returning EINVAL might the caller assume
success and causes further wreckage.
- A trivial kernel doc warning fix"
* 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/mm: Use WRITE_ONCE() when setting PTEs
x86/apic/vector: Make error return value negative
x86/process: Don't mix user/kernel regs in 64bit __show_regs()
x86/tsc: Prevent result truncation on 32bit
x86: Fix kernel-doc atomic.h warnings
x86/microcode: Update the new microcode revision unconditionally
x86/microcode: Make sure boot_cpu_data.microcode is up-to-date
When page-table entries are set, the compiler might optimize their
assignment by using multiple instructions to set the PTE. This might
turn into a security hazard if the user somehow manages to use the
interim PTE. L1TF does not make our lives easier, making even an interim
non-present PTE a security hazard.
Using WRITE_ONCE() to set PTEs and friends should prevent this potential
security hazard.
I skimmed the differences in the binary with and without this patch. The
differences are (obviously) greater when CONFIG_PARAVIRT=n as more
code optimizations are possible. For better and worse, the impact on the
binary with this patch is pretty small. Skimming the code did not cause
anything to jump out as a security hazard, but it seems that at least
move_soft_dirty_pte() caused set_pte_at() to use multiple writes.
Signed-off-by: Nadav Amit <namit@vmware.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Sean Christopherson <sean.j.christopherson@intel.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: stable@vger.kernel.org
Link: https://lkml.kernel.org/r/20180902181451.80520-1-namit@vmware.com
In the non-trampoline SYSCALL64 path, a percpu variable is used to
temporarily store the user RSP value.
Instead of a separate variable, use the otherwise unused sp2 slot in the
TSS. This will improve cache locality, as the sp1 slot is already used in
the same code to find the kernel stack. It will also simplify a future
change to make the non-trampoline path work in PTI mode.
Signed-off-by: Andy Lutomirski <luto@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Borislav Petkov <bp@suse.de>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Joerg Roedel <joro@8bytes.org>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: https://lkml.kernel.org/r/08e769a0023dbad4bac6f34f3631dbaf8ad59f4f.1536015544.git.luto@kernel.org
Dan Carpenter reported that the untrusted data returns from kvm_register_read()
results in the following static checker warning:
arch/x86/kvm/lapic.c:576 kvm_pv_send_ipi()
error: buffer underflow 'map->phys_map' 's32min-s32max'
KVM guest can easily trigger this by executing the following assembly sequence
in Ring0:
mov $10, %rax
mov $0xFFFFFFFF, %rbx
mov $0xFFFFFFFF, %rdx
mov $0, %rsi
vmcall
As this will cause KVM to execute the following code-path:
vmx_handle_exit() -> handle_vmcall() -> kvm_emulate_hypercall() -> kvm_pv_send_ipi()
which will reach out-of-bounds access.
This patch fixes it by adding a check to kvm_pv_send_ipi() against map->max_apic_id,
ignoring destinations that are not present and delivering the rest. We also check
whether or not map->phys_map[min + i] is NULL since the max_apic_id is set to the
max apic id, some phys_map maybe NULL when apic id is sparse, especially kvm
unconditionally set max_apic_id to 255 to reserve enough space for any xAPIC ID.
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Reviewed-by: Liran Alon <liran.alon@oracle.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Radim Krčmář <rkrcmar@redhat.com>
Cc: Liran Alon <liran.alon@oracle.com>
Cc: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Wanpeng Li <wanpengli@tencent.com>
[Add second "if (min > map->max_apic_id)" to complete the fix. -Radim]
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
- Fix a VFP corruption in 32-bit guest
- Add missing cache invalidation for CoW pages
- Two small cleanups
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQEcBAABAgAGBQJbkngmAAoJEEtpOizt6ddyeaoH/15bbGHlwWf23tGjSoDzhyD4
zAXfy+SJdm4cR8K7jEkVrNffkEMAby7Zl28hTHKB9jsY1K8DD+EuCE3Nd4kkVAsc
iHJwV4aiHil/zC5SyE0MqMzELeS8UhsxESYebG6yNF0ElQDQ0SG+QAFr47/OBN9S
u4I7x0rhyJP6Kg8z9U4KtEX0hM6C7VVunGWu44/xZSAecTaMuJnItCIM4UMdEkSs
xpAoI59lwM6BWrXLvEunekAkxEXoR7AVpQER2PDINoLK2I0i0oavhPim9Xdt2ZXs
rqQqfmwmPOVvYbexDp97JtfWo3/psGLqvgoK1tq9bzF3u6Y3ylnUK5IspyVYwuQ=
=TK8A
-----END PGP SIGNATURE-----
Merge tag 'kvm-arm-fixes-for-v4.19-v2' of git://git.kernel.org/pub/scm/linux/kernel/git/kvmarm/kvmarm
Fixes for KVM/ARM for Linux v4.19 v2:
- Fix a VFP corruption in 32-bit guest
- Add missing cache invalidation for CoW pages
- Two small cleanups
kvm_unmap_hva is long gone, and we only have kvm_unmap_hva_range to
deal with. Drop the now obsolete code.
Fixes: fb1522e099 ("KVM: update to new mmu_notifier semantic v2")
Cc: James Hogan <jhogan@kernel.org>
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
Signed-off-by: Christoffer Dall <christoffer.dall@arm.com>
The PARAVIRT_XXL changes introduced a redefinition of SAVE_FLAGS under
certain configurations. Cure it
Fixes: 6da63eb241 ("x86/paravirt: Move the pv_irq_ops under the PARAVIRT_XXL umbrella").
Signed-off-by: Juergen Gross <jgross@suse.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: hpa@zytor.com
Link: https://lkml.kernel.org/r/20180905053720.13710-1-jgross@suse.com
When the kernel.print-fatal-signals sysctl has been enabled, a simple
userspace crash will cause the kernel to write a crash dump that contains,
among other things, the kernel gsbase into dmesg.
As suggested by Andy, limit output to pt_regs, FS_BASE and KERNEL_GS_BASE
in this case.
This also moves the bitness-specific logic from show_regs() into
process_{32,64}.c.
Fixes: 45807a1df9 ("vdso: print fatal signals")
Signed-off-by: Jann Horn <jannh@google.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bpetkov@suse.de>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: stable@vger.kernel.org
Link: https://lkml.kernel.org/r/20180831194151.123586-1-jannh@google.com
This is preparation for looking at trap number and fault address in the
handlers for uaccess errors. No functional change.
Signed-off-by: Jann Horn <jannh@google.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Kees Cook <keescook@chromium.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: kernel-hardening@lists.openwall.com
Cc: linux-kernel@vger.kernel.org
Cc: dvyukov@google.com
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: "Naveen N. Rao" <naveen.n.rao@linux.vnet.ibm.com>
Cc: Anil S Keshavamurthy <anil.s.keshavamurthy@intel.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: linux-fsdevel@vger.kernel.org
Cc: Borislav Petkov <bp@alien8.de>
Link: https://lkml.kernel.org/r/20180828201421.157735-6-jannh@google.com
Currently, most fixups for attempting to access userspace memory are
handled using _ASM_EXTABLE, which is also used for various other types of
fixups (e.g. safe MSR access, IRET failures, and a bunch of other things).
In order to make it possible to add special safety checks to uaccess fixups
(in particular, checking whether the fault address is actually in
userspace), introduce a new exception table handler ex_handler_uaccess()
and wire it up to all the user access fixups (excluding ones that
already use _ASM_EXTABLE_EX).
Signed-off-by: Jann Horn <jannh@google.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Kees Cook <keescook@chromium.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: kernel-hardening@lists.openwall.com
Cc: dvyukov@google.com
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: "Naveen N. Rao" <naveen.n.rao@linux.vnet.ibm.com>
Cc: Anil S Keshavamurthy <anil.s.keshavamurthy@intel.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: linux-fsdevel@vger.kernel.org
Cc: Borislav Petkov <bp@alien8.de>
Link: https://lkml.kernel.org/r/20180828201421.157735-5-jannh@google.com
Fix kernel-doc warnings in arch/x86/include/asm/atomic.h that are caused by
having a #define macro between the kernel-doc notation and the function
name. Fixed by moving the #define macro to after the function
implementation.
Make the same change for atomic64_{32,64}.h for consistency even though
there were no kernel-doc warnings found in these header files, but there
would be if they were used in generation of documentation.
Fixes these kernel-doc warnings:
../arch/x86/include/asm/atomic.h:84: warning: Excess function parameter 'i' description in 'arch_atomic_sub_and_test'
../arch/x86/include/asm/atomic.h:84: warning: Excess function parameter 'v' description in 'arch_atomic_sub_and_test'
../arch/x86/include/asm/atomic.h:96: warning: Excess function parameter 'v' description in 'arch_atomic_inc'
../arch/x86/include/asm/atomic.h:109: warning: Excess function parameter 'v' description in 'arch_atomic_dec'
../arch/x86/include/asm/atomic.h:124: warning: Excess function parameter 'v' description in 'arch_atomic_dec_and_test'
../arch/x86/include/asm/atomic.h:138: warning: Excess function parameter 'v' description in 'arch_atomic_inc_and_test'
../arch/x86/include/asm/atomic.h:153: warning: Excess function parameter 'i' description in 'arch_atomic_add_negative'
../arch/x86/include/asm/atomic.h:153: warning: Excess function parameter 'v' description in 'arch_atomic_add_negative'
Fixes: 18cc1814d4 ("atomics/treewide: Make test ops optional")
Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Mark Rutland <mark.rutland@arm.com>
Link: https://lkml.kernel.org/r/0a1e678d-c8c5-b32c-2640-ed4e94d399d2@infradead.org
Pull x86 fixes from Thomas Gleixner:
"Speculation:
- Make the microcode check more robust
- Make the L1TF memory limit depend on the internal cache physical
address space and not on the CPUID advertised physical address
space, which might be significantly smaller. This avoids disabling
L1TF on machines which utilize the full physical address space.
- Fix the GDT mapping for EFI calls on 32bit PTI
- Fix the MCE nospec implementation to prevent #GP
Fixes and robustness:
- Use the proper operand order for LSL in the VDSO
- Prevent NMI uaccess race against CR3 switching
- Add a lockdep check to verify that text_mutex is held in
text_poke() functions
- Repair the fallout of giving native_restore_fl() a prototype
- Prevent kernel memory dumps based on usermode RIP
- Wipe KASAN shadow stack before rewinding the stack to prevent false
positives
- Move the AMS GOTO enforcement to the actual build stage to allow
user API header extraction without a compiler
- Fix a section mismatch introduced by the on demand VDSO mapping
change
Miscellaneous:
- Trivial typo, GCC quirk removal and CC_SET/OUT() cleanups"
* 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/pti: Fix section mismatch warning/error
x86/vdso: Fix lsl operand order
x86/mce: Fix set_mce_nospec() to avoid #GP fault
x86/efi: Load fixmap GDT in efi_call_phys_epilog()
x86/nmi: Fix NMI uaccess race against CR3 switching
x86: Allow generating user-space headers without a compiler
x86/dumpstack: Don't dump kernel memory based on usermode RIP
x86/asm: Use CC_SET()/CC_OUT() in __gen_sigismember()
x86/alternatives: Lockdep-enforce text_mutex in text_poke*()
x86/entry/64: Wipe KASAN stack shadow before rewind_stack_do_exit()
x86/irqflags: Mark native_restore_fl extern inline
x86/build: Remove jump label quirk for GCC older than 4.5.2
x86/Kconfig: Fix trivial typo
x86/speculation/l1tf: Increase l1tf memory limit for Nehalem+
x86/spectre: Add missing family 6 check to microcode check
In the __getcpu function, lsl is using the wrong target and destination
registers. Luckily, the compiler tends to choose %eax for both variables,
so it has been working so far.
Fixes: a582c540ac ("x86/vdso: Use RDPID in preference to LSL when available")
Signed-off-by: Samuel Neves <sneves@dei.uc.pt>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Andy Lutomirski <luto@kernel.org>
Cc: stable@vger.kernel.org
Link: https://lkml.kernel.org/r/20180901201452.27828-1-sneves@dei.uc.pt
-----BEGIN PGP SIGNATURE-----
iHUEABYIAB0WIQRTLbB6QfY48x44uB6AXGG7T9hjvgUCW4lM6AAKCRCAXGG7T9hj
vs8AAQDysFccg97UdopW3B7yklIaRqkfEIAsxe65f191MXsH2AEAp5SKxZqRPqBP
a9WHDj8ShB3BhZ/IxpdO9Y59U3Jo4wA=
=Gt4c
-----END PGP SIGNATURE-----
Merge tag 'for-linus-4.19b-rc2-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip
Pull xen fixes from Juergen Gross:
- minor cleanup avoiding a warning when building with new gcc
- a patch to add a new sysfs node for Xen frontend/backend drivers to
make it easier to obtain the state of a pv device
- two fixes for 32-bit pv-guests to avoid intermediate L1TF vulnerable
PTEs
* tag 'for-linus-4.19b-rc2-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip:
x86/xen: remove redundant variable save_pud
xen: export device state to sysfs
x86/pae: use 64 bit atomic xchg function in native_ptep_get_and_clear
x86/xen: don't write ptes directly in 32-bit PV guests
A NMI can hit in the middle of context switching or in the middle of
switch_mm_irqs_off(). In either case, CR3 might not match current->mm,
which could cause copy_from_user_nmi() and friends to read the wrong
memory.
Fix it by adding a new nmi_uaccess_okay() helper and checking it in
copy_from_user_nmi() and in __copy_from_user_nmi()'s callers.
Signed-off-by: Andy Lutomirski <luto@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Rik van Riel <riel@surriel.com>
Cc: Nadav Amit <nadav.amit@gmail.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Jann Horn <jannh@google.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: stable@vger.kernel.org
Link: https://lkml.kernel.org/r/dd956eba16646fd0b15c3c0741269dfd84452dac.1535557289.git.luto@kernel.org
show_opcodes() is used both for dumping kernel instructions and for dumping
user instructions. If userspace causes #PF by jumping to a kernel address,
show_opcodes() can be reached with regs->ip controlled by the user,
pointing to kernel code. Make sure that userspace can't trick us into
dumping kernel memory into dmesg.
Fixes: 7cccf0725c ("x86/dumpstack: Add a show_ip() function")
Signed-off-by: Jann Horn <jannh@google.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Kees Cook <keescook@chromium.org>
Reviewed-by: Borislav Petkov <bp@suse.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: security@kernel.org
Cc: stable@vger.kernel.org
Link: https://lkml.kernel.org/r/20180828154901.112726-1-jannh@google.com
Allowing x86_emulate_instruction() to be called directly has led to
subtle bugs being introduced, e.g. not setting EMULTYPE_NO_REEXECUTE
in the emulation type. While most of the blame lies on re-execute
being opt-out, exporting x86_emulate_instruction() also exposes its
cr2 parameter, which may have contributed to commit d391f12070
("x86/kvm/vmx: do not use vm-exit instruction length for fast MMIO
when running nested") using x86_emulate_instruction() instead of
emulate_instruction() because "hey, I have a cr2!", which in turn
introduced its EMULTYPE_NO_REEXECUTE bug.
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
Lack of the kvm_ prefix gives the impression that it's a VMX or SVM
specific function, and there's no conflict that prevents adding the
kvm_ prefix.
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
retry_instruction() and reexecute_instruction() are a package deal,
i.e. there is no scenario where one is allowed and the other is not.
Merge their controlling emulation type flags to enforce this in code.
Name the combined flag EMULTYPE_ALLOW_RETRY to make it abundantly
clear that we are allowing re{try,execute} to occur, as opposed to
explicitly requesting retry of a previously failed instruction.
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Cc: stable@vger.kernel.org
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
Re-execution of an instruction after emulation decode failure is
intended to be used only when emulating shadow page accesses. Invert
the flag to make allowing re-execution opt-in since that behavior is
by far in the minority.
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Cc: stable@vger.kernel.org
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
Re-execution after an emulation decode failure is only intended to
handle a case where two or vCPUs race to write a shadowed page, i.e.
we should never re-execute an instruction as part of RSM emulation.
Add a new helper, kvm_emulate_instruction_from_buffer(), to support
emulating from a pre-defined buffer. This eliminates the last direct
call to x86_emulate_instruction() outside of kvm_mmu_page_fault(),
which means x86_emulate_instruction() can be unexported in a future
patch.
Fixes: 7607b71744 ("KVM: SVM: install RSM intercept")
Cc: Brijesh Singh <brijesh.singh@amd.com>
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Cc: stable@vger.kernel.org
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
This should have been marked extern inline in order to pick up the out
of line definition in arch/x86/kernel/irqflags.S.
Fixes: 208cbb3255 ("x86/irqflags: Provide a declaration for native_save_fl")
Reported-by: Ben Hutchings <ben.hutchings@codethink.co.uk>
Signed-off-by: Nick Desaulniers <ndesaulniers@google.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Juergen Gross <jgross@suse.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: stable@vger.kernel.org
Link: https://lkml.kernel.org/r/20180827214011.55428-1-ndesaulniers@google.com
After changing over to 64-bit time_t syscalls, many architectures will
want compat_sys_utimensat() but not respective handlers for utime(),
utimes() and futimesat(). This adds a new __ARCH_WANT_SYS_UTIME32 to
complement __ARCH_WANT_SYS_UTIME. For now, all 64-bit architectures that
support CONFIG_COMPAT set it, but future 64-bit architectures will not
(tile would not have needed it either, but got removed).
As older 32-bit architectures get converted to using CONFIG_64BIT_TIME,
they will have to use __ARCH_WANT_SYS_UTIME32 instead of
__ARCH_WANT_SYS_UTIME. Architectures using the generic syscall ABI don't
need either of them as they never had a utime syscall.
Since the compat_utimbuf structure is now required outside of
CONFIG_COMPAT, I'm moving it into compat_time.h.
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
---
changed from last version:
- renamed __ARCH_WANT_COMPAT_SYS_UTIME to __ARCH_WANT_SYS_UTIME32
The sys_llseek sytem call is needed on all 32-bit architectures and
none of the 64-bit ones, so we can remove the __ARCH_WANT_SYS_LLSEEK guard
and simplify the include/asm-generic/unistd.h header further.
Since 32-bit tasks can run either natively or in compat mode on 64-bit
architectures, we have to check for both !CONFIG_64BIT and CONFIG_COMPAT.
There are a few 64-bit architectures that also reference sys_llseek
in their 64-bit ABI (e.g. sparc), but I verified that those all
select CONFIG_COMPAT, so the #if check is still correct here. It's
a bit odd to include it in the syscall table though, as it's the
same as sys_lseek() on 64-bit, but with strange calling conventions.
Acked-by: Geert Uytterhoeven <geert@linux-m68k.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
While converting compat system call handlers to work on 32-bit
architectures, I found a number of types used in those handlers
that are identical between all architectures.
Let's move all the identical ones into asm-generic/compat.h to avoid
having to add even more identical definitions of those types.
For unknown reasons, mips defines __compat_gid32_t, __compat_uid32_t
and compat_caddr_t as signed, while all others have them unsigned.
This seems to be a mistake, but I'm leaving it alone here. The other
types all differ by size or alignment on at least on architecture.
compat_aio_context_t is currently defined in linux/compat.h but
also needed for compat_sys_io_getevents(), so let's move it into
the same place.
While we still have not decided whether the 32-bit time handling
will always use the compat syscalls, or in which form, I think this
is a useful cleanup that we can merge regardless.
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
We have four generations of stat() syscalls:
- the oldstat syscalls that are only used on the older architectures
- the newstat family that is used on all 64-bit architectures but
lacked support for large files on 32-bit architectures.
- the stat64 family that is used mostly on 32-bit architectures to
replace newstat
- statx() to replace all of the above, adding 64-bit timestamps among
other things.
We already compile stat64 only on those architectures that need it,
but newstat is always built, including on those that don't reference
it. This adds a new __ARCH_WANT_NEW_STAT symbol along the lines of
__ARCH_WANT_OLD_STAT and __ARCH_WANT_STAT64 to control compilation of
newstat. All architectures that need it use an explict define, the
others now get a little bit smaller, and future architecture (including
64-bit targets) won't ever see it.
Acked-by: Geert Uytterhoeven <geert@linux-m68k.org>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Using only 32-bit writes for the pte will result in an intermediate
L1TF vulnerable PTE. When running as a Xen PV guest this will at once
switch the guest to shadow mode resulting in a loss of performance.
Use arch_atomic64_xchg() instead which will perform the requested
operation atomically with all 64 bits.
Some performance considerations according to:
https://software.intel.com/sites/default/files/managed/ad/dc/Intel-Xeon-Scalable-Processor-throughput-latency.pdf
The main number should be the latency, as there is no tight loop around
native_ptep_get_and_clear().
"lock cmpxchg8b" has a latency of 20 cycles, while "lock xchg" (with a
memory operand) isn't mentioned in that document. "lock xadd" (with xadd
having 3 cycles less latency than xchg) has a latency of 11, so we can
assume a latency of 14 for "lock xchg".
Signed-off-by: Juergen Gross <jgross@suse.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
Tested-by: Jason Andryuk <jandryuk@gmail.com>
Signed-off-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
On Nehalem and newer core CPUs the CPU cache internally uses 44 bits
physical address space. The L1TF workaround is limited by this internal
cache address width, and needs to have one bit free there for the
mitigation to work.
Older client systems report only 36bit physical address space so the range
check decides that L1TF is not mitigated for a 36bit phys/32GB system with
some memory holes.
But since these actually have the larger internal cache width this warning
is bogus because it would only really be needed if the system had more than
43bits of memory.
Add a new internal x86_cache_bits field. Normally it is the same as the
physical bits field reported by CPUID, but for Nehalem and newerforce it to
be at least 44bits.
Change the L1TF memory size warning to use the new cache_bits field to
avoid bogus warnings and remove the bogus comment about memory size.
Fixes: 17dbca1193 ("x86/speculation/l1tf: Add sysfs reporting for l1tf")
Reported-by: George Anchev <studio@anchev.net>
Reported-by: Christopher Snowhill <kode54@gmail.com>
Signed-off-by: Andi Kleen <ak@linux.intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: x86@kernel.org
Cc: linux-kernel@vger.kernel.org
Cc: Michael Hocko <mhocko@suse.com>
Cc: vbabka@suse.cz
Cc: stable@vger.kernel.org
Link: https://lkml.kernel.org/r/20180824170351.34874-1-andi@firstfloor.org
Pull x86 fixes from Thomas Gleixner:
- Correct the L1TF fallout on 32bit and the off by one in the 'too much
RAM for protection' calculation.
- Add a helpful kernel message for the 'too much RAM' case
- Unbreak the VDSO in case that the compiler desides to use indirect
jumps/calls and emits retpolines which cannot be resolved because the
kernel uses its own thunks, which does not work for the VDSO. Make it
use the builtin thunks.
- Re-export start_thread() which was unexported when the 32/64bit
implementation was unified. start_thread() is required by modular
binfmt handlers.
- Trivial cleanups
* 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/speculation/l1tf: Suggest what to do on systems with too much RAM
x86/speculation/l1tf: Fix off-by-one error when warning that system has too much RAM
x86/kvm/vmx: Remove duplicate l1d flush definitions
x86/speculation/l1tf: Fix overflow in l1tf_pfn_limit() on 32bit
x86/process: Re-export start_thread()
x86/mce: Add notifier_block forward declaration
x86/vdso: Fix vDSO build if a retpoline is emitted
* memory_failure() gets confused by dev_pagemap backed mappings. The
recovery code has specific enabling for several possible page states
that needs new enabling to handle poison in dax mappings. Teach
memory_failure() about ZONE_DEVICE pages.
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEE5DAy15EJMCV1R6v9YGjFFmlTOEoFAlt9ui8ACgkQYGjFFmlT
OEpNRw//XGj9s7sezfJFeol4psJlRUd935yii/gmJRgi/yPf2VxxQG9qyM6SMBUc
75jASfOL6FSsfxHz0kplyWzMDNdrTkNNAD+9rv80FmY7GqWgcas9DaJX7jZ994vI
5SRO7pfvNZcXlo7IhqZippDw3yxkIU9Ufi0YQKaEUm7GFieptvCZ0p9x3VYfdvwM
BExrxQe0X1XUF4xErp5P78+WUbKxP47DLcucRDig8Q7dmHELUdyNzo3E1SVoc7m+
3CmvyTj6XuFQgOZw7ZKun1BJYfx/eD5ZlRJLZbx6wJHRtTXv/Uea8mZ8mJ31ykN9
F7QVd0Pmlyxys8lcXfK+nvpL09QBE0/PhwWKjmZBoU8AdgP/ZvBXLDL/D6YuMTg6
T4wwtPNJorfV4lVD06OliFkVI4qbKbmNsfRq43Ns7PCaLueu4U/eMaSwSH99UMaZ
MGbO140XW2RZsHiU9yTRUmZq73AplePEjxtzR8oHmnjo45nPDPy8mucWPlkT9kXA
oUFMhgiviK7dOo19H4eaPJGqLmHM93+x5tpYxGqTr0dUOXUadKWxMsTnkID+8Yi7
/kzQWCFvySz3VhiEHGuWkW08GZT6aCcpkREDomnRh4MEnETlZI8bblcuXYOCLs6c
nNf1SIMtLdlsl7U1fEX89PNeQQ2y237vEDhFQZftaalPeu/JJV0=
=Ftop
-----END PGP SIGNATURE-----
Merge tag 'libnvdimm-for-4.19_dax-memory-failure' of gitolite.kernel.org:pub/scm/linux/kernel/git/nvdimm/nvdimm
Pull libnvdimm memory-failure update from Dave Jiang:
"As it stands, memory_failure() gets thoroughly confused by dev_pagemap
backed mappings. The recovery code has specific enabling for several
possible page states and needs new enabling to handle poison in dax
mappings.
In order to support reliable reverse mapping of user space addresses:
1/ Add new locking in the memory_failure() rmap path to prevent races
that would typically be handled by the page lock.
2/ Since dev_pagemap pages are hidden from the page allocator and the
"compound page" accounting machinery, add a mechanism to determine
the size of the mapping that encompasses a given poisoned pfn.
3/ Given pmem errors can be repaired, change the speculatively
accessed poison protection, mce_unmap_kpfn(), to be reversible and
otherwise allow ongoing access from the kernel.
A side effect of this enabling is that MADV_HWPOISON becomes usable
for dax mappings, however the primary motivation is to allow the
system to survive userspace consumption of hardware-poison via dax.
Specifically the current behavior is:
mce: Uncorrected hardware memory error in user-access at af34214200
{1}[Hardware Error]: It has been corrected by h/w and requires no further action
mce: [Hardware Error]: Machine check events logged
{1}[Hardware Error]: event severity: corrected
Memory failure: 0xaf34214: reserved kernel page still referenced by 1 users
[..]
Memory failure: 0xaf34214: recovery action for reserved kernel page: Failed
mce: Memory error not recovered
<reboot>
...and with these changes:
Injecting memory failure for pfn 0x20cb00 at process virtual address 0x7f763dd00000
Memory failure: 0x20cb00: Killing dax-pmd:5421 due to hardware memory corruption
Memory failure: 0x20cb00: recovery action for dax page: Recovered
Given all the cross dependencies I propose taking this through
nvdimm.git with acks from Naoya, x86/core, x86/RAS, and of course dax
folks"
* tag 'libnvdimm-for-4.19_dax-memory-failure' of gitolite.kernel.org:pub/scm/linux/kernel/git/nvdimm/nvdimm:
libnvdimm, pmem: Restore page attributes when clearing errors
x86/memory_failure: Introduce {set, clear}_mce_nospec()
x86/mm/pat: Prepare {reserve, free}_memtype() for "decoy" addresses
mm, memory_failure: Teach memory_failure() about dev_pagemap pages
filesystem-dax: Introduce dax_lock_mapping_entry()
mm, memory_failure: Collect mapping size in collect_procs()
mm, madvise_inject_error: Let memory_failure() optionally take a page reference
mm, dev_pagemap: Do not clear ->mapping on final put
mm, madvise_inject_error: Disable MADV_SOFT_OFFLINE for ZONE_DEVICE pages
filesystem-dax: Set page->index
device-dax: Set page->index
device-dax: Enable page_mapping()
device-dax: Convert to vmf_insert_mixed and vm_fault_t
Including:
- PASID table handling updates for the Intel VT-d driver. It
implements a global PASID space now so that applications
usings multiple devices will just have one PASID.
- A new config option to make iommu passthroug mode the default.
- New sysfs attribute for iommu groups to export the type of the
default domain.
- A debugfs interface (for debug only) usable by IOMMU drivers
to export internals to user-space.
- R-Car Gen3 SoCs support for the ipmmu-vmsa driver
- The ARM-SMMU now aborts transactions from unknown devices and
devices not attached to any domain.
- Various cleanups and smaller fixes all over the place.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2
iQIcBAABAgAGBQJbf/9wAAoJECvwRC2XARrjcuYP/3dIsOFN7Xb4sTOB5wxk4wmD
2Rm5o/18cFekEy4M8fwIBCYkzH/McohgKbOFcH6XiCxIwJ5RdXzITLAwmp4PbvIO
KtwppXSp+MQtboip/bp6NDNBhABErgUtgdXawwENCCrFivXDsB8W4wnXESAOkLv9
4fLXrUgDFCAquLZpLqQobXHhajtGAkSekaasphlhejXFulFyF1YcEUcliU7eXZ0R
rZjL4Zqcyyi5kv6d3WhL+tvmmhr7wfMsMPaW18eRf9tXvMpWRM2GOAj65coI2AWs
1T1kW/jvvrxnewOsmo1nYlw7R07uiRkUfHmJ9tY65xW4120HJFhdFLPUQZXfrX/b
wcGbheYIh6cwAaZBtPJ35bPeW6pREkDOShohbzt45T62Q837cBkr3zyHhNsoOXHS
13YVtTd2vtPa4iLdu2qmEOC1OuhQnMvqHqX0iN8U74QbDxEYYvMfAdx0JL3hmPp/
uynY3QmXIKCeZg+vH2qcWHm07nfaAr5y8WSPA0crnqeznD5zJ4kvJf5dFGmDyTKr
pyTkhidkifm6ZejrJsDZveoZdLpHrOatrqKaoLFh2crMUG3d807NYqQ3JmA3NDjg
zPbYyU4joFGNVjd3XkSnRTGxR6YvLIwNbkQ3b/K/B5AqWJ6VrTbbTCOa4GSms6rF
Qm8wRrmYaycKxkcMqtls
=TeYQ
-----END PGP SIGNATURE-----
Merge tag 'iommu-updates-v4.19' of git://git.kernel.org/pub/scm/linux/kernel/git/joro/iommu
Pull IOMMU updates from Joerg Roedel:
- PASID table handling updates for the Intel VT-d driver. It implements
a global PASID space now so that applications usings multiple devices
will just have one PASID.
- A new config option to make iommu passthroug mode the default.
- New sysfs attribute for iommu groups to export the type of the
default domain.
- A debugfs interface (for debug only) usable by IOMMU drivers to
export internals to user-space.
- R-Car Gen3 SoCs support for the ipmmu-vmsa driver
- The ARM-SMMU now aborts transactions from unknown devices and devices
not attached to any domain.
- Various cleanups and smaller fixes all over the place.
* tag 'iommu-updates-v4.19' of git://git.kernel.org/pub/scm/linux/kernel/git/joro/iommu: (42 commits)
iommu/omap: Fix cache flushes on L2 table entries
iommu: Remove the ->map_sg indirection
iommu/arm-smmu-v3: Abort all transactions if SMMU is enabled in kdump kernel
iommu/arm-smmu-v3: Prevent any devices access to memory without registration
iommu/ipmmu-vmsa: Don't register as BUS IOMMU if machine doesn't have IPMMU-VMSA
iommu/ipmmu-vmsa: Clarify supported platforms
iommu/ipmmu-vmsa: Fix allocation in atomic context
iommu: Add config option to set passthrough as default
iommu: Add sysfs attribyte for domain type
iommu/arm-smmu-v3: sync the OVACKFLG to PRIQ consumer register
iommu/arm-smmu: Error out only if not enough context interrupts
iommu/io-pgtable-arm-v7s: Abort allocation when table address overflows the PTE
iommu/io-pgtable-arm: Fix pgtable allocation in selftest
iommu/vt-d: Remove the obsolete per iommu pasid tables
iommu/vt-d: Apply per pci device pasid table in SVA
iommu/vt-d: Allocate and free pasid table
iommu/vt-d: Per PCI device pasid table interfaces
iommu/vt-d: Add for_each_device_domain() helper
iommu/vt-d: Move device_domain_info to header
iommu/vt-d: Apply global PASID in SVA
...
Two users have reported [1] that they have an "extremely unlikely" system
with more than MAX_PA/2 memory and L1TF mitigation is not effective. In
fact it's a CPU with 36bits phys limit (64GB) and 32GB memory, but due to
holes in the e820 map, the main region is almost 500MB over the 32GB limit:
[ 0.000000] BIOS-e820: [mem 0x0000000100000000-0x000000081effffff] usable
Suggestions to use 'mem=32G' to enable the L1TF mitigation while losing the
500MB revealed, that there's an off-by-one error in the check in
l1tf_select_mitigation().
l1tf_pfn_limit() returns the last usable pfn (inclusive) and the range
check in the mitigation path does not take this into account.
Instead of amending the range check, make l1tf_pfn_limit() return the first
PFN which is over the limit which is less error prone. Adjust the other
users accordingly.
[1] https://bugzilla.suse.com/show_bug.cgi?id=1105536
Fixes: 17dbca1193 ("x86/speculation/l1tf: Add sysfs reporting for l1tf")
Reported-by: George Anchev <studio@anchev.net>
Reported-by: Christopher Snowhill <kode54@gmail.com>
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: "H . Peter Anvin" <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: stable@vger.kernel.org
Link: https://lkml.kernel.org/r/20180823134418.17008-1-vbabka@suse.cz
Merge fixes for missing TLB shootdowns.
This fixes a couple of cases that involved us possibly freeing page
table structures before the required TLB shootdown had been done.
There are a few cleanup patches to make the code easier to follow, and
to avoid some of the more problematic cases entirely when not necessary.
To make this easier for backports, it undoes the recent lazy TLB
patches, because the cleanups and fixes are more important, and Rik is
ok with re-doing them later when things have calmed down.
The missing TLB flush was only delayed, and the wrong ordering only
happened under memory pressure (and in theory under a couple of other
fairly theoretical situations), so this may have been all very unlikely
to have hit people in practice.
But getting the TLB shootdown wrong is _so_ hard to debug and see that I
consider this a crticial fix.
Many thanks to Jann Horn for having debugged this.
* tlb-fixes:
x86/mm: Only use tlb_remove_table() for paravirt
mm: mmu_notifier fix for tlb_end_vma
mm/tlb, x86/mm: Support invalidating TLB caches for RCU_TABLE_FREE
mm/tlb: Remove tlb_remove_table() non-concurrent condition
mm: move tlb_table_flush to tlb_flush_mmu_free
x86/mm/tlb: Revert the recent lazy TLB patches
-----BEGIN PGP SIGNATURE-----
iHUEABYIAB0WIQRTLbB6QfY48x44uB6AXGG7T9hjvgUCW36rRgAKCRCAXGG7T9hj
vkrcAQC8F+ljGO5PtYUkKcMy17vqvcq/BdetJuUVfk+G1WmLxQEAiaNiqqJGsOyJ
Msa0HHDT31uBYGg/iq7yAWk23tcTZwE=
=Px4D
-----END PGP SIGNATURE-----
Merge tag 'for-linus-4.19b-rc1b-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip
Pull xen fixes and cleanups from Juergen Gross:
"Some cleanups, some minor fixes and a fix for a bug introduced in this
merge window hitting 32-bit PV guests"
* tag 'for-linus-4.19b-rc1b-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip:
x86/xen: enable early use of set_fixmap in 32-bit Xen PV guest
xen: remove unused hypercall functions
x86/xen: remove unused function xen_auto_xlated_memory_setup()
xen/ACPI: don't upload Px/Cx data for disabled processors
x86/Xen: further refine add_preferred_console() invocations
xen/mcelog: eliminate redundant setting of interface version
x86/Xen: mark xen_setup_gdt() __init
If we don't use paravirt; don't play unnecessary and complicated games
to free page-tables.
Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Rik van Riel <riel@surriel.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Revert commits:
95b0e6357d x86/mm/tlb: Always use lazy TLB mode
64482aafe5 x86/mm/tlb: Only send page table free TLB flush to lazy TLB CPUs
ac03158969 x86/mm/tlb: Make lazy TLB mode lazier
61d0beb579 x86/mm/tlb: Restructure switch_mm_irqs_off()
2ff6ddf19c x86/mm/tlb: Leave lazy TLB mode at page table free time
In order to simplify the TLB invalidate fixes for x86 and unify the
parts that need backporting. We'll try again later.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Rik van Riel <riel@surriel.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
optimizations for ARMv8.4 systems, Userspace interface for RAS, Fault
path optimization, Emulated physical timer fixes, Random cleanups
x86: fixes for L1TF, a new test case, non-support for SGX (inject the
right exception in the guest), a lockdep false positive
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.22 (GNU/Linux)
iQEcBAABAgAGBQJbfXfZAAoJEL/70l94x66DL2QH/RnQZW4OaqVdE3pNvRvaNJGQ
41yk9aErbqPcK25aIKnhs9e3S+e32BhArA1YBwdHXwwuanANYv5W+o3HNTL0UFj7
UG6APKm5DR6kJeUZ3vCfyeZ/ZKxDW0uqf5DXQyHUiAhwLGw2wWYJ9Ttv0m0Q4Fxl
x9HEnK/s+komG93QT+2hIXtZdPiB026yBBqDDPyYiWrweyBagYUHz65p6qaPiOEY
HqOyLYKsgrqCv9U0NLTD9U54IWGFIaxMGgjyRdZTMCIQeGj6dAH7vyfURGOeDHvw
C0OZeEKRbMsHLwzXRBDEZp279pYgS7zafe/hMkr/znaac+j6xNwxpWwqg5Sm0UE=
=5yTH
-----END PGP SIGNATURE-----
Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm
Pull second set of KVM updates from Paolo Bonzini:
"ARM:
- Support for Group0 interrupts in guests
- Cache management optimizations for ARMv8.4 systems
- Userspace interface for RAS
- Fault path optimization
- Emulated physical timer fixes
- Random cleanups
x86:
- fixes for L1TF
- a new test case
- non-support for SGX (inject the right exception in the guest)
- fix lockdep false positive"
* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (49 commits)
KVM: VMX: fixes for vmentry_l1d_flush module parameter
kvm: selftest: add dirty logging test
kvm: selftest: pass in extra memory when create vm
kvm: selftest: include the tools headers
kvm: selftest: unify the guest port macros
tools: introduce test_and_clear_bit
KVM: x86: SVM: Call x86_spec_ctrl_set_guest/host() with interrupts disabled
KVM: vmx: Inject #UD for SGX ENCLS instruction in guest
KVM: vmx: Add defines for SGX ENCLS exiting
x86/kvm/vmx: Fix coding style in vmx_setup_l1d_flush()
x86: kvm: avoid unused variable warning
KVM: Documentation: rename the capability of KVM_CAP_ARM_SET_SERROR_ESR
KVM: arm/arm64: Skip updating PTE entry if no change
KVM: arm/arm64: Skip updating PMD entry if no change
KVM: arm: Use true and false for boolean values
KVM: arm/arm64: vgic: Do not use spin_lock_irqsave/restore with irq disabled
KVM: arm/arm64: vgic: Move DEBUG_SPINLOCK_BUG_ON to vgic.h
KVM: arm: vgic-v3: Add support for ICC_SGI0R and ICC_ASGI1R accesses
KVM: arm64: vgic-v3: Add support for ICC_SGI0R_EL1 and ICC_ASGI1R_EL1 accesses
KVM: arm/arm64: vgic-v3: Add core support for Group0 SGIs
...
An ordinary arm64 defconfig build has ~64 KB worth of __ksymtab entries,
each consisting of two 64-bit fields containing absolute references, to
the symbol itself and to a char array containing its name, respectively.
When we build the same configuration with KASLR enabled, we end up with an
additional ~192 KB of relocations in the .init section, i.e., one 24 byte
entry for each absolute reference, which all need to be processed at boot
time.
Given how the struct kernel_symbol that describes each entry is completely
local to module.c (except for the references emitted by EXPORT_SYMBOL()
itself), we can easily modify it to contain two 32-bit relative references
instead. This reduces the size of the __ksymtab section by 50% for all
64-bit architectures, and gets rid of the runtime relocations entirely for
architectures implementing KASLR, either via standard PIE linking (arm64)
or using custom host tools (x86).
Note that the binary search involving __ksymtab contents relies on each
section being sorted by symbol name. This is implemented based on the
input section names, not the names in the ksymtab entries, so this patch
does not interfere with that.
Given that the use of place-relative relocations requires support both in
the toolchain and in the module loader, we cannot enable this feature for
all architectures. So make it dependent on whether
CONFIG_HAVE_ARCH_PREL32_RELOCATIONS is defined.
Link: http://lkml.kernel.org/r/20180704083651.24360-4-ard.biesheuvel@linaro.org
Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Acked-by: Jessica Yu <jeyu@kernel.org>
Acked-by: Michael Ellerman <mpe@ellerman.id.au>
Reviewed-by: Will Deacon <will.deacon@arm.com>
Acked-by: Ingo Molnar <mingo@kernel.org>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Bjorn Helgaas <bhelgaas@google.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: James Morris <james.morris@microsoft.com>
Cc: James Morris <jmorris@namei.org>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Nicolas Pitre <nico@linaro.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Petr Mladek <pmladek@suse.com>
Cc: Russell King <linux@armlinux.org.uk>
Cc: "Serge E. Hallyn" <serge@hallyn.com>
Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Thomas Garnier <thgarnie@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Hardware support for basic SGX virtualization adds a new execution
control (ENCLS_EXITING), VMCS field (ENCLS_EXITING_BITMAP) and exit
reason (ENCLS), that enables a VMM to intercept specific ENCLS leaf
functions, e.g. to inject faults when the VMM isn't exposing SGX to
a VM. When ENCLS_EXITING is enabled, the VMM can set/clear bits in
the bitmap to intercept/allow ENCLS leaf functions in non-root, e.g.
setting bit 2 in the ENCLS_EXITING_BITMAP will cause ENCLS[EINIT]
to VMExit(ENCLS).
Note: EXIT_REASON_ENCLS was previously added by commit 1f51999270
("KVM: VMX: add missing exit reasons").
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Message-Id: <20180814163334.25724-2-sean.j.christopherson@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Remove Xen hypercall functions which are used nowhere in the kernel.
Signed-off-by: Juergen Gross <jgross@suse.com>
Signed-off-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Currently memory_failure() returns zero if the error was handled. On
that result mce_unmap_kpfn() is called to zap the page out of the kernel
linear mapping to prevent speculative fetches of potentially poisoned
memory. However, in the case of dax mapped devmap pages the page may be
in active permanent use by the device driver, so it cannot be unmapped
from the kernel.
Instead of marking the page not present, marking the page UC should
be sufficient for preventing poison from being pre-fetched into the
cache. Convert mce_unmap_pfn() to set_mce_nospec() remapping the page as
UC, to hide it from speculative accesses.
Given that that persistent memory errors can be cleared by the driver,
include a facility to restore the page to cacheable operation,
clear_mce_nospec().
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: <linux-edac@vger.kernel.org>
Cc: <x86@kernel.org>
Acked-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Acked-by: Ingo Molnar <mingo@redhat.com>
Signed-off-by: Dave Jiang <dave.jiang@intel.com>
On 32bit PAE kernels on 64bit hardware with enough physical bits,
l1tf_pfn_limit() will overflow unsigned long. This in turn affects
max_swapfile_size() and can lead to swapon returning -EINVAL. This has been
observed in a 32bit guest with 42 bits physical address size, where
max_swapfile_size() overflows exactly to 1 << 32, thus zero, and produces
the following warning to dmesg:
[ 6.396845] Truncating oversized swap area, only using 0k out of 2047996k
Fix this by using unsigned long long instead.
Fixes: 17dbca1193 ("x86/speculation/l1tf: Add sysfs reporting for l1tf")
Fixes: 377eeaa8e1 ("x86/speculation/l1tf: Limit swap file size to MAX_PA/2")
Reported-by: Dominique Leuenberger <dimstar@suse.de>
Reported-by: Adrian Schroeter <adrian@suse.de>
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Andi Kleen <ak@linux.intel.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: "H . Peter Anvin" <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: stable@vger.kernel.org
Link: https://lkml.kernel.org/r/20180820095835.5298-1-vbabka@suse.cz
Without linux/irq.h, there is no declaration of notifier_block, leading to
a build warning:
In file included from arch/x86/kernel/cpu/mcheck/threshold.c:10:
arch/x86/include/asm/mce.h:151:46: error: 'struct notifier_block' declared inside parameter list will not be visible outside of this definition or declaration [-Werror]
It's sufficient to declare the struct tag here, which avoids pulling in
more header files.
Fixes: 447ae31667 ("x86: Don't include linux/irq.h from asm/hardirq.h")
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Nicolai Stange <nstange@suse.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Borislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20180817100156.3009043-1-arnd@arndb.de
For x86 this brings in PCID emulation and CR3 caching for shadow page
tables, nested VMX live migration, nested VMCS shadowing, an optimized
IPI hypercall, and some optimizations.
ARM will come next week.
There is a semantic conflict because tip also added an .init_platform
callback to kvm.c. Please keep the initializer from this branch,
and add a call to kvmclock_init (added by tip) inside kvm_init_platform
(added here).
Also, there is a backmerge from 4.18-rc6. This is because of a
refactoring that conflicted with a relatively late bugfix and
resulted in a particularly hellish conflict. Because the conflict
was only due to unfortunate timing of the bugfix, I backmerged and
rebased the refactoring rather than force the resolution on you.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.22 (GNU/Linux)
iQEcBAABAgAGBQJbdwNFAAoJEL/70l94x66DiPEH/1cAGZWGd85Y3yRu1dmTmqiz
kZy0V+WTQ5kyJF4ZsZKKOp+xK7Qxh5e9kLdTo70uPZCHwLu9IaGKN9+dL9Jar3DR
yLPX5bMsL8UUed9g9mlhdaNOquWi7d7BseCOnIyRTolb+cqnM5h3sle0gqXloVrS
UQb4QogDz8+86czqR8tNfazjQRKW/D2HEGD5NDNVY1qtpY+leCDAn9/u6hUT5c6z
EtufgyDh35UN+UQH0e2605gt3nN3nw3FiQJFwFF1bKeQ7k5ByWkuGQI68XtFVhs+
2WfqL3ftERkKzUOy/WoSJX/C9owvhMcpAuHDGOIlFwguNGroZivOMVnACG1AI3I=
=9Mgw
-----END PGP SIGNATURE-----
Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm
Pull first set of KVM updates from Paolo Bonzini:
"PPC:
- minor code cleanups
x86:
- PCID emulation and CR3 caching for shadow page tables
- nested VMX live migration
- nested VMCS shadowing
- optimized IPI hypercall
- some optimizations
ARM will come next week"
* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (85 commits)
kvm: x86: Set highest physical address bits in non-present/reserved SPTEs
KVM/x86: Use CC_SET()/CC_OUT in arch/x86/kvm/vmx.c
KVM: X86: Implement PV IPIs in linux guest
KVM: X86: Add kvm hypervisor init time platform setup callback
KVM: X86: Implement "send IPI" hypercall
KVM/x86: Move X86_CR4_OSXSAVE check into kvm_valid_sregs()
KVM: x86: Skip pae_root shadow allocation if tdp enabled
KVM/MMU: Combine flushing remote tlb in mmu_set_spte()
KVM: vmx: skip VMWRITE of HOST_{FS,GS}_BASE when possible
KVM: vmx: skip VMWRITE of HOST_{FS,GS}_SEL when possible
KVM: vmx: always initialize HOST_{FS,GS}_BASE to zero during setup
KVM: vmx: move struct host_state usage to struct loaded_vmcs
KVM: vmx: compute need to reload FS/GS/LDT on demand
KVM: nVMX: remove a misleading comment regarding vmcs02 fields
KVM: vmx: rename __vmx_load_host_state() and vmx_save_host_state()
KVM: vmx: add dedicated utility to access guest's kernel_gs_base
KVM: vmx: track host_state.loaded using a loaded_vmcs pointer
KVM: vmx: refactor segmentation code in vmx_save_host_state()
kvm: nVMX: Fix fault priority for VMX operations
kvm: nVMX: Fix fault vector for VMX operation at CPL > 0
...
Here is the bit set of char/misc drivers for 4.19-rc1
There is a lot here, much more than normal, seems like everyone is
writing new driver subsystems these days... Anyway, major things here
are:
- new FSI driver subsystem, yet-another-powerpc low-level
hardware bus
- gnss, finally an in-kernel GPS subsystem to try to tame all of
the crazy out-of-tree drivers that have been floating around
for years, combined with some really hacky userspace
implementations. This is only for GNSS receivers, but you
have to start somewhere, and this is great to see.
Other than that, there are new slimbus drivers, new coresight drivers,
new fpga drivers, and loads of DT bindings for all of these and existing
drivers.
Full details of everything is in the shortlog.
All of these have been in linux-next for a while with no reported
issues.
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
-----BEGIN PGP SIGNATURE-----
iG0EABECAC0WIQT0tgzFv3jCIUoxPcsxR9QN2y37KQUCW3g7ew8cZ3JlZ0Brcm9h
aC5jb20ACgkQMUfUDdst+ykfBgCeOG0RkSI92XVZe0hs/QYFW9kk8JYAnRBf3Qpm
cvW7a+McOoKz/MGmEKsi
=TNfn
-----END PGP SIGNATURE-----
Merge tag 'char-misc-4.19-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/char-misc
Pull char/misc driver updates from Greg KH:
"Here is the bit set of char/misc drivers for 4.19-rc1
There is a lot here, much more than normal, seems like everyone is
writing new driver subsystems these days... Anyway, major things here
are:
- new FSI driver subsystem, yet-another-powerpc low-level hardware
bus
- gnss, finally an in-kernel GPS subsystem to try to tame all of the
crazy out-of-tree drivers that have been floating around for years,
combined with some really hacky userspace implementations. This is
only for GNSS receivers, but you have to start somewhere, and this
is great to see.
Other than that, there are new slimbus drivers, new coresight drivers,
new fpga drivers, and loads of DT bindings for all of these and
existing drivers.
All of these have been in linux-next for a while with no reported
issues"
* tag 'char-misc-4.19-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/char-misc: (255 commits)
android: binder: Rate-limit debug and userspace triggered err msgs
fsi: sbefifo: Bump max command length
fsi: scom: Fix NULL dereference
misc: mic: SCIF Fix scif_get_new_port() error handling
misc: cxl: changed asterisk position
genwqe: card_base: Use true and false for boolean values
misc: eeprom: assignment outside the if statement
uio: potential double frees if __uio_register_device() fails
eeprom: idt_89hpesx: clean up an error pointer vs NULL inconsistency
misc: ti-st: Fix memory leak in the error path of probe()
android: binder: Show extra_buffers_size in trace
firmware: vpd: Fix section enabled flag on vpd_section_destroy
platform: goldfish: Retire pdev_bus
goldfish: Use dedicated macros instead of manual bit shifting
goldfish: Add missing includes to goldfish.h
mux: adgs1408: new driver for Analog Devices ADGS1408/1409 mux
dt-bindings: mux: add adi,adgs1408
Drivers: hv: vmbus: Cleanup synic memory free path
Drivers: hv: vmbus: Remove use of slow_virt_to_phys()
Drivers: hv: vmbus: Reset the channel callback in vmbus_onoffer_rescind()
...
It turns out that we should *not* invert all not-present mappings,
because the all zeroes case is obviously special.
clear_page() does not undergo the XOR logic to invert the address bits,
i.e. PTE, PMD and PUD entries that have not been individually written
will have val=0 and so will trigger __pte_needs_invert(). As a result,
{pte,pmd,pud}_pfn() will return the wrong PFN value, i.e. all ones
(adjusted by the max PFN mask) instead of zero. A zeroed entry is ok
because the page at physical address 0 is reserved early in boot
specifically to mitigate L1TF, so explicitly exempt them from the
inversion when reading the PFN.
Manifested as an unexpected mprotect(..., PROT_NONE) failure when called
on a VMA that has VM_PFNMAP and was mmap'd to as something other than
PROT_NONE but never used. mprotect() sends the PROT_NONE request down
prot_none_walk(), which walks the PTEs to check the PFNs.
prot_none_pte_entry() gets the bogus PFN from pte_pfn() and returns
-EACCES because it thinks mprotect() is trying to adjust a high MMIO
address.
[ This is a very modified version of Sean's original patch, but all
credit goes to Sean for doing this and also pointing out that
sometimes the __pte_needs_invert() function only gets the protection
bits, not the full eventual pte. But zero remains special even in
just protection bits, so that's ok. - Linus ]
Fixes: f22cc87f6c ("x86/speculation/l1tf: Invert all not present mappings")
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Acked-by: Andi Kleen <ak@linux.intel.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-----BEGIN PGP SIGNATURE-----
iQJIBAABCgAyFiEEgMe7l+5h9hnxdsnuWYigwDrT+vwFAlt1f9AUHGJoZWxnYWFz
QGdvb2dsZS5jb20ACgkQWYigwDrT+vxbdhAArnhRvkwOk4m4/LCuKF6HpmlxbBNC
TjnBCenNf+lFXzWskfDFGFl/Wif4UzGbRTSCNQrwMzj3Ww3f/6R2QIq9rEJvyNC4
VdxQnaBEZSUgN87q5UGqgdjMTo3zFvlFH6fpb5XDiQ5IX/QZeXeYqoB64w+HvKPU
M+IsoOvnA5gb7pMcpchrGUnSfS1e6AqQbbTt6tZflore6YCEA4cH5OnpGx8qiZIp
ut+CMBvQjQB01fHeBc/wGrVte4NwXdONrXqpUb4sHF7HqRNfEh0QVyPhvebBi+k1
kquqoBQfPFTqgcab31VOcQhg70dEx+1qGm5/YBAwmhCpHR/g2gioFXoROsr+iUOe
BtF6LZr+Y8cySuhJnkCrJBqWvvBaKbJLg0KMbI+7p4o9MZpod2u7LS5LFrlRDyKW
3nz3o+b1+v3tCCKVKIhKo0ljolgkweQtR1f6KIHvq93wBODHVQnAOt9NlPfHVyks
ryGBnOhMjoU5hvfexgIWFk9Ph9MEVQSffkI+TeFPO/tyGBfGfQyGtESiXuEaMQaH
FGdZHX2RLkY3pWHOtWeMzRHzOnr2XjpDFcAqL3HBGPdJ30K3Umv3WOgoFe2SaocG
0gaddPjKSwwM4Sa/VP+O5cjGuzi7QnczSDdpYjxIGZzBav32hqx4/rsnLw7bHH8y
XkEme7cYJc8MGsA=
=2Dmn
-----END PGP SIGNATURE-----
Merge tag 'pci-v4.19-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/helgaas/pci
Pull pci updates from Bjorn Helgaas:
- Decode AER errors with names similar to "lspci" (Tyler Baicar)
- Expose AER statistics in sysfs (Rajat Jain)
- Clear AER status bits selectively based on the type of recovery (Oza
Pawandeep)
- Honor "pcie_ports=native" even if HEST sets FIRMWARE_FIRST (Alexandru
Gagniuc)
- Don't clear AER status bits if we're using the "Firmware-First"
strategy where firmware owns the registers (Alexandru Gagniuc)
- Use sysfs_match_string() to simplify ASPM sysfs parsing (Andy
Shevchenko)
- Remove unnecessary includes of <linux/pci-aspm.h> (Bjorn Helgaas)
- Defer DPC event handling to work queue (Keith Busch)
- Use threaded IRQ for DPC bottom half (Keith Busch)
- Print AER status while handling DPC events (Keith Busch)
- Work around IDT switch ACS Source Validation erratum (James
Puthukattukaran)
- Emit diagnostics for all cases of PCIe Link downtraining (Links
operating slower than they're capable of) (Alexandru Gagniuc)
- Skip VFs when configuring Max Payload Size (Myron Stowe)
- Reduce Root Port Max Payload Size if necessary when hot-adding a
device below it (Myron Stowe)
- Simplify SHPC existence/permission checks (Bjorn Helgaas)
- Remove hotplug sample skeleton driver (Lukas Wunner)
- Convert pciehp to threaded IRQ handling (Lukas Wunner)
- Improve pciehp tolerance of missed events and initially unstable
links (Lukas Wunner)
- Clear spurious pciehp events on resume (Lukas Wunner)
- Add pciehp runtime PM support, including for Thunderbolt controllers
(Lukas Wunner)
- Support interrupts from pciehp bridges in D3hot (Lukas Wunner)
- Mark fall-through switch cases before enabling -Wimplicit-fallthrough
(Gustavo A. R. Silva)
- Move DMA-debug PCI init from arch code to PCI core (Christoph
Hellwig)
- Fix pci_request_irq() usage of IRQF_ONESHOT when no handler is
supplied (Heiner Kallweit)
- Unify PCI and DMA direction #defines (Shunyong Yang)
- Add PCI_DEVICE_DATA() macro (Andy Shevchenko)
- Check for VPD completion before checking for timeout (Bert Kenward)
- Limit Netronome NFP5000 config space size to work around erratum
(Jakub Kicinski)
- Set IRQCHIP_ONESHOT_SAFE for PCI MSI irqchips (Heiner Kallweit)
- Document ACPI description of PCI host bridges (Bjorn Helgaas)
- Add "pci=disable_acs_redir=" parameter to disable ACS redirection for
peer-to-peer DMA support (we don't have the peer-to-peer support yet;
this is just one piece) (Logan Gunthorpe)
- Clean up devm_of_pci_get_host_bridge_resources() resource allocation
(Jan Kiszka)
- Fixup resizable BARs after suspend/resume (Christian König)
- Make "pci=earlydump" generic (Sinan Kaya)
- Fix ROM BAR access routines to stay in bounds and check for signature
correctly (Rex Zhu)
- Add DMA alias quirk for Microsemi Switchtec NTB (Doug Meyer)
- Expand documentation for pci_add_dma_alias() (Logan Gunthorpe)
- To avoid bus errors, enable PASID only if entire path supports
End-End TLP prefixes (Sinan Kaya)
- Unify slot and bus reset functions and remove hotplug knowledge from
callers (Sinan Kaya)
- Add Function-Level Reset quirks for Intel and Samsung NVMe devices to
fix guest reboot issues (Alex Williamson)
- Add function 1 DMA alias quirk for Marvell 88SS9183 PCIe SSD
Controller (Bjorn Helgaas)
- Remove Xilinx AXI-PCIe host bridge arch dependency (Palmer Dabbelt)
- Remove Aardvark outbound window configuration (Evan Wang)
- Fix Aardvark bridge window sizing issue (Zachary Zhang)
- Convert Aardvark to use pci_host_probe() to reduce code duplication
(Thomas Petazzoni)
- Correct the Cadence cdns_pcie_writel() signature (Alan Douglas)
- Add Cadence support for optional generic PHYs (Alan Douglas)
- Add Cadence power management ops (Alan Douglas)
- Remove redundant variable from Cadence driver (Colin Ian King)
- Add Kirin MSI support (Xiaowei Song)
- Drop unnecessary root_bus_nr setting from exynos, imx6, keystone,
armada8k, artpec6, designware-plat, histb, qcom, spear13xx (Shawn
Guo)
- Move link notification settings from DesignWare core to individual
drivers (Gustavo Pimentel)
- Add endpoint library MSI-X interfaces (Gustavo Pimentel)
- Correct signature of endpoint library IRQ interfaces (Gustavo
Pimentel)
- Add DesignWare endpoint library MSI-X callbacks (Gustavo Pimentel)
- Add endpoint library MSI-X test support (Gustavo Pimentel)
- Remove unnecessary GFP_ATOMIC from Hyper-V "new child" allocation
(Jia-Ju Bai)
- Add more devices to Broadcom PAXC quirk (Ray Jui)
- Work around corrupted Broadcom PAXC config space to enable SMMU and
GICv3 ITS (Ray Jui)
- Disable MSI parsing to work around broken Broadcom PAXC logic in some
devices (Ray Jui)
- Hide unconfigured functions to work around a Broadcom PAXC defect
(Ray Jui)
- Lower iproc log level to reduce console output during boot (Ray Jui)
- Fix mobiveil iomem/phys_addr_t type usage (Lorenzo Pieralisi)
- Fix mobiveil missing include file (Lorenzo Pieralisi)
- Add mobiveil Kconfig/Makefile support (Lorenzo Pieralisi)
- Fix mvebu I/O space remapping issues (Thomas Petazzoni)
- Use generic pci_host_bridge in mvebu instead of ARM-specific API
(Thomas Petazzoni)
- Whitelist VMD devices with fast interrupt handlers to avoid sharing
vectors with slow handlers (Keith Busch)
* tag 'pci-v4.19-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/helgaas/pci: (153 commits)
PCI/AER: Don't clear AER bits if error handling is Firmware-First
PCI: Limit config space size for Netronome NFP5000
PCI/MSI: Set IRQCHIP_ONESHOT_SAFE for PCI-MSI irqchips
PCI/VPD: Check for VPD access completion before checking for timeout
PCI: Add PCI_DEVICE_DATA() macro to fully describe device ID entry
PCI: Match Root Port's MPS to endpoint's MPSS as necessary
PCI: Skip MPS logic for Virtual Functions (VFs)
PCI: Add function 1 DMA alias quirk for Marvell 88SS9183
PCI: Check for PCIe Link downtraining
PCI: Add ACS Redirect disable quirk for Intel Sunrise Point
PCI: Add device-specific ACS Redirect disable infrastructure
PCI: Convert device-specific ACS quirks from NULL termination to ARRAY_SIZE
PCI: Add "pci=disable_acs_redir=" parameter for peer-to-peer support
PCI: Allow specifying devices using a base bus and path of devfns
PCI: Make specifying PCI devices in kernel parameters reusable
PCI: Hide ACS quirk declarations inside PCI core
PCI: Delay after FLR of Intel DC P3700 NVMe
PCI: Disable Samsung SM961/PM961 NVMe before FLR
PCI: Export pcie_has_flr()
PCI: mvebu: Drop bogus comment above mvebu_pcie_map_registers()
...
i8259.h uses inb/outb and thus needs to include asm/io.h to avoid the
following build error, as seen with x86_64:defconfig and CONFIG_SMP=n.
In file included from drivers/rtc/rtc-cmos.c:45:0:
arch/x86/include/asm/i8259.h: In function 'inb_pic':
arch/x86/include/asm/i8259.h:32:24: error:
implicit declaration of function 'inb'
arch/x86/include/asm/i8259.h: In function 'outb_pic':
arch/x86/include/asm/i8259.h:45:2: error:
implicit declaration of function 'outb'
Reported-by: Sebastian Gottschall <s.gottschall@dd-wrt.com>
Suggested-by: Sebastian Gottschall <s.gottschall@dd-wrt.com>
Fixes: 447ae31667 ("x86: Don't include linux/irq.h from asm/hardirq.h")
Signed-off-by: Guenter Roeck <linux@roeck-us.net>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-----BEGIN PGP SIGNATURE-----
iHUEABYIAB0WIQRTLbB6QfY48x44uB6AXGG7T9hjvgUCW3LkCgAKCRCAXGG7T9hj
vtyfAQDTMUqfBlpz9XqFyTBTFRkP3aVtnEeE7BijYec+RXPOxwEAsiXwZPsmW/AN
up+NEHqPvMOcZC8zJZ9THCiBgOxligY=
=F51X
-----END PGP SIGNATURE-----
Merge tag 'for-linus-4.19-rc1-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip
Pull xen updates from Juergen Gross:
- add dma-buf functionality to Xen grant table handling
- fix for booting the kernel as Xen PVH dom0
- fix for booting the kernel as a Xen PV guest with
CONFIG_DEBUG_VIRTUAL enabled
- other minor performance and style fixes
* tag 'for-linus-4.19-rc1-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip:
xen/balloon: fix balloon initialization for PVH Dom0
xen: don't use privcmd_call() from xen_mc_flush()
xen/pv: Call get_cpu_address_sizes to set x86_virt/phys_bits
xen/biomerge: Use true and false for boolean values
xen/gntdev: don't dereference a null gntdev_dmabuf on allocation failure
xen/spinlock: Don't use pvqspinlock if only 1 vCPU
xen/gntdev: Implement dma-buf import functionality
xen/gntdev: Implement dma-buf export functionality
xen/gntdev: Add initial support for dma-buf UAPI
xen/gntdev: Make private routines/structures accessible
xen/gntdev: Allow mappings for DMA buffers
xen/grant-table: Allow allocating buffers suitable for DMA
xen/balloon: Share common memory reservation routines
xen/grant-table: Make set/clear page private code shared
Merge L1 Terminal Fault fixes from Thomas Gleixner:
"L1TF, aka L1 Terminal Fault, is yet another speculative hardware
engineering trainwreck. It's a hardware vulnerability which allows
unprivileged speculative access to data which is available in the
Level 1 Data Cache when the page table entry controlling the virtual
address, which is used for the access, has the Present bit cleared or
other reserved bits set.
If an instruction accesses a virtual address for which the relevant
page table entry (PTE) has the Present bit cleared or other reserved
bits set, then speculative execution ignores the invalid PTE and loads
the referenced data if it is present in the Level 1 Data Cache, as if
the page referenced by the address bits in the PTE was still present
and accessible.
While this is a purely speculative mechanism and the instruction will
raise a page fault when it is retired eventually, the pure act of
loading the data and making it available to other speculative
instructions opens up the opportunity for side channel attacks to
unprivileged malicious code, similar to the Meltdown attack.
While Meltdown breaks the user space to kernel space protection, L1TF
allows to attack any physical memory address in the system and the
attack works across all protection domains. It allows an attack of SGX
and also works from inside virtual machines because the speculation
bypasses the extended page table (EPT) protection mechanism.
The assoicated CVEs are: CVE-2018-3615, CVE-2018-3620, CVE-2018-3646
The mitigations provided by this pull request include:
- Host side protection by inverting the upper address bits of a non
present page table entry so the entry points to uncacheable memory.
- Hypervisor protection by flushing L1 Data Cache on VMENTER.
- SMT (HyperThreading) control knobs, which allow to 'turn off' SMT
by offlining the sibling CPU threads. The knobs are available on
the kernel command line and at runtime via sysfs
- Control knobs for the hypervisor mitigation, related to L1D flush
and SMT control. The knobs are available on the kernel command line
and at runtime via sysfs
- Extensive documentation about L1TF including various degrees of
mitigations.
Thanks to all people who have contributed to this in various ways -
patches, review, testing, backporting - and the fruitful, sometimes
heated, but at the end constructive discussions.
There is work in progress to provide other forms of mitigations, which
might be less horrible performance wise for a particular kind of
workloads, but this is not yet ready for consumption due to their
complexity and limitations"
* 'l1tf-final' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (75 commits)
x86/microcode: Allow late microcode loading with SMT disabled
tools headers: Synchronise x86 cpufeatures.h for L1TF additions
x86/mm/kmmio: Make the tracer robust against L1TF
x86/mm/pat: Make set_memory_np() L1TF safe
x86/speculation/l1tf: Make pmd/pud_mknotpresent() invert
x86/speculation/l1tf: Invert all not present mappings
cpu/hotplug: Fix SMT supported evaluation
KVM: VMX: Tell the nested hypervisor to skip L1D flush on vmentry
x86/speculation: Use ARCH_CAPABILITIES to skip L1D flush on vmentry
x86/speculation: Simplify sysfs report of VMX L1TF vulnerability
Documentation/l1tf: Remove Yonah processors from not vulnerable list
x86/KVM/VMX: Don't set l1tf_flush_l1d from vmx_handle_external_intr()
x86/irq: Let interrupt handlers set kvm_cpu_l1tf_flush_l1d
x86: Don't include linux/irq.h from asm/hardirq.h
x86/KVM/VMX: Introduce per-host-cpu analogue of l1tf_flush_l1d
x86/irq: Demote irq_cpustat_t::__softirq_pending to u16
x86/KVM/VMX: Move the l1tf_flush_l1d test to vmx_l1d_flush()
x86/KVM/VMX: Replace 'vmx_l1d_flush_always' with 'vmx_l1d_flush_cond'
x86/KVM/VMX: Don't set l1tf_flush_l1d to true from vmx_l1d_flush()
cpu/hotplug: detect SMT disabled by BIOS
...
Pull x86 timer updates from Thomas Gleixner:
"Early TSC based time stamping to allow better boot time analysis.
This comes with a general cleanup of the TSC calibration code which
grew warts and duct taping over the years and removes 250 lines of
code. Initiated and mostly implemented by Pavel with help from various
folks"
* 'x86-timers-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (37 commits)
x86/kvmclock: Mark kvm_get_preset_lpj() as __init
x86/tsc: Consolidate init code
sched/clock: Disable interrupts when calling generic_sched_clock_init()
timekeeping: Prevent false warning when persistent clock is not available
sched/clock: Close a hole in sched_clock_init()
x86/tsc: Make use of tsc_calibrate_cpu_early()
x86/tsc: Split native_calibrate_cpu() into early and late parts
sched/clock: Use static key for sched_clock_running
sched/clock: Enable sched clock early
sched/clock: Move sched clock initialization and merge with generic clock
x86/tsc: Use TSC as sched clock early
x86/tsc: Initialize cyc2ns when tsc frequency is determined
x86/tsc: Calibrate tsc only once
ARM/time: Remove read_boot_clock64()
s390/time: Remove read_boot_clock64()
timekeeping: Default boot time offset to local_clock()
timekeeping: Replace read_boot_clock64() with read_persistent_wall_and_boot_offset()
s390/time: Add read_persistent_wall_and_boot_offset()
x86/xen/time: Output xen sched_clock time from 0
x86/xen/time: Initialize pv xen time in init_hypervisor_platform()
...
Pull x86 PTI updates from Thomas Gleixner:
"The Speck brigade sadly provides yet another large set of patches
destroying the perfomance which we carefully built and preserved
- PTI support for 32bit PAE. The missing counter part to the 64bit
PTI code implemented by Joerg.
- A set of fixes for the Global Bit mechanics for non PCID CPUs which
were setting the Global Bit too widely and therefore possibly
exposing interesting memory needlessly.
- Protection against userspace-userspace SpectreRSB
- Support for the upcoming Enhanced IBRS mode, which is preferred
over IBRS. Unfortunately we dont know the performance impact of
this, but it's expected to be less horrible than the IBRS
hammering.
- Cleanups and simplifications"
* 'x86/pti' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (60 commits)
x86/mm/pti: Move user W+X check into pti_finalize()
x86/relocs: Add __end_rodata_aligned to S_REL
x86/mm/pti: Clone kernel-image on PTE level for 32 bit
x86/mm/pti: Don't clear permissions in pti_clone_pmd()
x86/mm/pti: Fix 32 bit PCID check
x86/mm/init: Remove freed kernel image areas from alias mapping
x86/mm/init: Add helper for freeing kernel image pages
x86/mm/init: Pass unconverted symbol addresses to free_init_pages()
mm: Allow non-direct-map arguments to free_reserved_area()
x86/mm/pti: Clear Global bit more aggressively
x86/speculation: Support Enhanced IBRS on future CPUs
x86/speculation: Protect against userspace-userspace spectreRSB
x86/kexec: Allocate 8k PGDs for PTI
Revert "perf/core: Make sure the ring-buffer is mapped in all page-tables"
x86/mm: Remove in_nmi() warning from vmalloc_fault()
x86/entry/32: Check for VM86 mode in slow-path check
perf/core: Make sure the ring-buffer is mapped in all page-tables
x86/pti: Check the return value of pti_user_pagetable_walk_pmd()
x86/pti: Check the return value of pti_user_pagetable_walk_p4d()
x86/entry/32: Add debug code to check entry/exit CR3
...
Pull misc x86 fixes from Thomas Gleixner:
"Two fixes for x86:
- Provide a declaration for native_save_fl() which unbreaks the
wreckage caused by making it 'extern inline'.
- Fix the failing paravirt patching which is supposed to replace
indirect with direct calls. The wreckage is caused by an incorrect
clobber test"
* 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/paravirt: Fix spectre-v2 mitigations for paravirt guests
x86/irqflags: Provide a declaration for native_save_fl
Pull x86 mm updates from Thomas Gleixner:
- Make lazy TLB mode even lazier to avoid pointless switch_mm()
operations, which reduces CPU load by 1-2% for memcache workloads
- Small cleanups and improvements all over the place
* 'x86-mm-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/mm: Remove redundant check for kmem_cache_create()
arm/asm/tlb.h: Fix build error implicit func declaration
x86/mm/tlb: Make clear_asid_other() static
x86/mm/tlb: Skip atomic operations for 'init_mm' in switch_mm_irqs_off()
x86/mm/tlb: Always use lazy TLB mode
x86/mm/tlb: Only send page table free TLB flush to lazy TLB CPUs
x86/mm/tlb: Make lazy TLB mode lazier
x86/mm/tlb: Restructure switch_mm_irqs_off()
x86/mm/tlb: Leave lazy TLB mode at page table free time
mm: Allocate the mm_cpumask (mm->cpu_bitmap[]) dynamically based on nr_cpu_ids
x86/mm: Add TLB purge to free pmd/pte page interfaces
ioremap: Update pgtable free interfaces with addr
x86/mm: Disable ioremap free page handling on x86-PAE
Pull x86/hyper-v update from Thomas Gleixner:
"Add fast hypercall support for guest running on the Microsoft HyperV(isor)"
* 'x86-hyperv-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/hyper-v: Fix wrong merge conflict resolution
x86/hyper-v: Check for VP_INVAL in hyperv_flush_tlb_others()
x86/hyper-v: Check cpumask_to_vpset() return value in hyperv_flush_tlb_others_ex()
x86/hyper-v: Trace PV IPI send
x86/hyper-v: Use cheaper HVCALL_SEND_IPI hypercall when possible
x86/hyper-v: Use 'fast' hypercall for HVCALL_SEND_IPI
x86/hyper-v: Implement hv_do_fast_hypercall16
x86/hyper-v: Use cheaper HVCALL_FLUSH_VIRTUAL_ADDRESS_{LIST,SPACE} hypercalls when possible
Pull x86 cpu updates from Thomas Gleixner:
"Two small updates for the CPU code:
- Improve NUMA emulation
- Add the EPT_AD CPU feature bit"
* 'x86-cpu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/cpufeatures: Add EPT_AD feature bit
x86/numa_emulation: Introduce uniform split capability
x86/numa_emulation: Fix emulated-to-physical node mapping
Pull x86 asm updates from Thomas Gleixner:
"The lowlevel and ASM code updates for x86:
- Make stack trace unwinding more reliable
- ASM instruction updates for better code generation
- Various cleanups"
* 'x86-asm-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/entry/64: Add two more instruction suffixes
x86/asm/64: Use 32-bit XOR to zero registers
x86/build/vdso: Simplify 'cmd_vdso2c'
x86/build/vdso: Remove unused vdso-syms.lds
x86/stacktrace: Enable HAVE_RELIABLE_STACKTRACE for the ORC unwinder
x86/unwind/orc: Detect the end of the stack
x86/stacktrace: Do not fail for ORC with regs on stack
x86/stacktrace: Clarify the reliable success paths
x86/stacktrace: Remove STACKTRACE_DUMP_ONCE
x86/stacktrace: Do not unwind after user regs
x86/asm: Use CC_SET/CC_OUT in percpu_cmpxchg8b_double() to micro-optimize code generation
Pull perf update from Thomas Gleixner:
"The perf crowd presents:
Kernel updates:
- Removal of jprobes
- Cleanup and consolidatation the handling of kprobes
- Cleanup and consolidation of hardware breakpoints
- The usual pile of fixes and updates to PMUs and event descriptors
Tooling updates:
- Updates and improvements all over the place. Nothing outstanding,
just the (good) boring incremental grump work"
* 'perf-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (103 commits)
perf trace: Do not require --no-syscalls to suppress strace like output
perf bpf: Include uapi/linux/bpf.h from the 'perf trace' script's bpf.h
perf tools: Allow overriding MAX_NR_CPUS at compile time
perf bpf: Show better message when failing to load an object
perf list: Unify metric group description format with PMU event description
perf vendor events arm64: Update ThunderX2 implementation defined pmu core events
perf cs-etm: Generate branch sample for CS_ETM_TRACE_ON packet
perf cs-etm: Generate branch sample when receiving a CS_ETM_TRACE_ON packet
perf cs-etm: Support dummy address value for CS_ETM_TRACE_ON packet
perf cs-etm: Fix start tracing packet handling
perf build: Fix installation directory for eBPF
perf c2c report: Fix crash for empty browser
perf tests: Fix indexing when invoking subtests
perf trace: Beautify the AF_INET & AF_INET6 'socket' syscall 'protocol' args
perf trace beauty: Add beautifiers for 'socket''s 'protocol' arg
perf trace beauty: Do not print NULL strarray entries
perf beauty: Add a generator for IPPROTO_ socket's protocol constants
tools include uapi: Grab a copy of linux/in.h
perf tests: Fix complex event name parsing
perf evlist: Fix error out while applying initial delay and LBR
...
Pull locking/atomics update from Thomas Gleixner:
"The locking, atomics and memory model brains delivered:
- A larger update to the atomics code which reworks the ordering
barriers, consolidates the atomic primitives, provides the new
atomic64_fetch_add_unless() primitive and cleans up the include
hell.
- Simplify cmpxchg() instrumentation and add instrumentation for
xchg() and cmpxchg_double().
- Updates to the memory model and documentation"
* 'locking-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (48 commits)
locking/atomics: Rework ordering barriers
locking/atomics: Instrument cmpxchg_double*()
locking/atomics: Instrument xchg()
locking/atomics: Simplify cmpxchg() instrumentation
locking/atomics/x86: Reduce arch_cmpxchg64*() instrumentation
tools/memory-model: Rename litmus tests to comply to norm7
tools/memory-model/Documentation: Fix typo, smb->smp
sched/Documentation: Update wake_up() & co. memory-barrier guarantees
locking/spinlock, sched/core: Clarify requirements for smp_mb__after_spinlock()
sched/core: Use smp_mb() in wake_woken_function()
tools/memory-model: Add informal LKMM documentation to MAINTAINERS
locking/atomics/Documentation: Describe atomic_set() as a write operation
tools/memory-model: Make scripts executable
tools/memory-model: Remove ACCESS_ONCE() from model
tools/memory-model: Remove ACCESS_ONCE() from recipes
locking/memory-barriers.txt/kokr: Update Korean translation to fix broken DMA vs. MMIO ordering example
MAINTAINERS: Add Daniel Lustig as an LKMM reviewer
tools/memory-model: Fix ISA2+pooncelock+pooncelock+pombonce name
tools/memory-model: Add litmus test for full multicopy atomicity
locking/refcount: Always allow checked forms
...
The user page-table gets the updated kernel mappings in pti_finalize(),
which runs after the RO+X permissions got applied to the kernel page-table
in mark_readonly().
But with CONFIG_DEBUG_WX enabled, the user page-table is already checked in
mark_readonly() for insecure mappings. This causes false-positive
warnings, because the user page-table did not get the updated mappings yet.
Move the W+X check for the user page-table into pti_finalize() after it
updated all required mappings.
[ tglx: Folded !NX supported fix ]
Signed-off-by: Joerg Roedel <jroedel@suse.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: "H . Peter Anvin" <hpa@zytor.com>
Cc: linux-mm@kvack.org
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Juergen Gross <jgross@suse.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Jiri Kosina <jkosina@suse.cz>
Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: David Laight <David.Laight@aculab.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: Eduardo Valentin <eduval@amazon.com>
Cc: Greg KH <gregkh@linuxfoundation.org>
Cc: Will Deacon <will.deacon@arm.com>
Cc: aliguori@amazon.com
Cc: daniel.gruss@iaik.tugraz.at
Cc: hughd@google.com
Cc: keescook@google.com
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Waiman Long <llong@redhat.com>
Cc: Pavel Machek <pavel@ucw.cz>
Cc: "David H . Gutteridge" <dhgutteridge@sympatico.ca>
Cc: joro@8bytes.org
Link: https://lkml.kernel.org/r/1533727000-9172-1-git-send-email-joro@8bytes.org
Some cases in THP like:
- MADV_FREE
- mprotect
- split
mark the PMD non present for temporarily to prevent races. The window for
an L1TF attack in these contexts is very small, but it wants to be fixed
for correctness sake.
Use the proper low level functions for pmd/pud_mknotpresent() to address
this.
Signed-off-by: Andi Kleen <ak@linux.intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
For kernel mappings PAGE_PROTNONE is not necessarily set for a non present
mapping, but the inversion logic explicitely checks for !PRESENT and
PROT_NONE.
Remove the PROT_NONE check and make the inversion unconditional for all not
present mappings.
Signed-off-by: Andi Kleen <ak@linux.intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Using privcmd_call() for a singleton multicall seems to be wrong, as
privcmd_call() is using stac()/clac() to enable hypervisor access to
Linux user space.
Even if currently not a problem (pv domains can't use SMAP while HVM
and PVH domains can't use multicalls) things might change when
PVH dom0 support is added to the kernel.
Reported-by: Jan Beulich <jbeulich@suse.com>
Signed-off-by: Juergen Gross <jgross@suse.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
Signed-off-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
The kernel image is mapped into two places in the virtual address space
(addresses without KASLR, of course):
1. The kernel direct map (0xffff880000000000)
2. The "high kernel map" (0xffffffff81000000)
We actually execute out of #2. If we get the address of a kernel symbol,
it points to #2, but almost all physical-to-virtual translations point to
Parts of the "high kernel map" alias are mapped in the userspace page
tables with the Global bit for performance reasons. The parts that we map
to userspace do not (er, should not) have secrets. When PTI is enabled then
the global bit is usually not set in the high mapping and just used to
compensate for poor performance on systems which lack PCID.
This is fine, except that some areas in the kernel image that are adjacent
to the non-secret-containing areas are unused holes. We free these holes
back into the normal page allocator and reuse them as normal kernel memory.
The memory will, of course, get *used* via the normal map, but the alias
mapping is kept.
This otherwise unused alias mapping of the holes will, by default keep the
Global bit, be mapped out to userspace, and be vulnerable to Meltdown.
Remove the alias mapping of these pages entirely. This is likely to
fracture the 2M page mapping the kernel image near these areas, but this
should affect a minority of the area.
The pageattr code changes *all* aliases mapping the physical pages that it
operates on (by default). We only want to modify a single alias, so we
need to tweak its behavior.
This unmapping behavior is currently dependent on PTI being in place.
Going forward, we should at least consider doing this for all
configurations. Having an extra read-write alias for memory is not exactly
ideal for debugging things like random memory corruption and this does
undercut features like DEBUG_PAGEALLOC or future work like eXclusive Page
Frame Ownership (XPFO).
Before this patch:
current_kernel:---[ High Kernel Mapping ]---
current_kernel-0xffffffff80000000-0xffffffff81000000 16M pmd
current_kernel-0xffffffff81000000-0xffffffff81e00000 14M ro PSE GLB x pmd
current_kernel-0xffffffff81e00000-0xffffffff81e11000 68K ro GLB x pte
current_kernel-0xffffffff81e11000-0xffffffff82000000 1980K RW NX pte
current_kernel-0xffffffff82000000-0xffffffff82600000 6M ro PSE GLB NX pmd
current_kernel-0xffffffff82600000-0xffffffff82c00000 6M RW PSE NX pmd
current_kernel-0xffffffff82c00000-0xffffffff82e00000 2M RW NX pte
current_kernel-0xffffffff82e00000-0xffffffff83200000 4M RW PSE NX pmd
current_kernel-0xffffffff83200000-0xffffffffa0000000 462M pmd
current_user:---[ High Kernel Mapping ]---
current_user-0xffffffff80000000-0xffffffff81000000 16M pmd
current_user-0xffffffff81000000-0xffffffff81e00000 14M ro PSE GLB x pmd
current_user-0xffffffff81e00000-0xffffffff81e11000 68K ro GLB x pte
current_user-0xffffffff81e11000-0xffffffff82000000 1980K RW NX pte
current_user-0xffffffff82000000-0xffffffff82600000 6M ro PSE GLB NX pmd
current_user-0xffffffff82600000-0xffffffffa0000000 474M pmd
After this patch:
current_kernel:---[ High Kernel Mapping ]---
current_kernel-0xffffffff80000000-0xffffffff81000000 16M pmd
current_kernel-0xffffffff81000000-0xffffffff81e00000 14M ro PSE GLB x pmd
current_kernel-0xffffffff81e00000-0xffffffff81e11000 68K ro GLB x pte
current_kernel-0xffffffff81e11000-0xffffffff82000000 1980K pte
current_kernel-0xffffffff82000000-0xffffffff82400000 4M ro PSE GLB NX pmd
current_kernel-0xffffffff82400000-0xffffffff82488000 544K ro NX pte
current_kernel-0xffffffff82488000-0xffffffff82600000 1504K pte
current_kernel-0xffffffff82600000-0xffffffff82c00000 6M RW PSE NX pmd
current_kernel-0xffffffff82c00000-0xffffffff82c0d000 52K RW NX pte
current_kernel-0xffffffff82c0d000-0xffffffff82dc0000 1740K pte
current_user:---[ High Kernel Mapping ]---
current_user-0xffffffff80000000-0xffffffff81000000 16M pmd
current_user-0xffffffff81000000-0xffffffff81e00000 14M ro PSE GLB x pmd
current_user-0xffffffff81e00000-0xffffffff81e11000 68K ro GLB x pte
current_user-0xffffffff81e11000-0xffffffff82000000 1980K pte
current_user-0xffffffff82000000-0xffffffff82400000 4M ro PSE GLB NX pmd
current_user-0xffffffff82400000-0xffffffff82488000 544K ro NX pte
current_user-0xffffffff82488000-0xffffffff82600000 1504K pte
current_user-0xffffffff82600000-0xffffffffa0000000 474M pmd
[ tglx: Do not unmap on 32bit as there is only one mapping ]
Fixes: 0f561fce4d ("x86/pti: Enable global pages for shared areas")
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Kees Cook <keescook@google.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Juergen Gross <jgross@suse.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Hugh Dickins <hughd@google.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Joerg Roedel <jroedel@suse.de>
Link: https://lkml.kernel.org/r/20180802225831.5F6A2BFC@viggo.jf.intel.com
Implement paravirtual apic hooks to enable PV IPIs for KVM if the "send IPI"
hypercall is available. The hypercall lets a guest send IPIs, with
at most 128 destinations per hypercall in 64-bit mode and 64 vCPUs per
hypercall in 32-bit mode.
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Radim Krčmář <rkrcmar@redhat.com>
Cc: Vitaly Kuznetsov <vkuznets@redhat.com>
Signed-off-by: Wanpeng Li <wanpengli@tencent.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
This patch is to provide a way for platforms to register hv tlb remote
flush callback and this helps to optimize operation of tlb flush
among vcpus for nested virtualization case.
Signed-off-by: Lan Tianyu <Tianyu.Lan@microsoft.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
This patch is to add hyperv_nested_flush_guest_mapping support to trace
hvFlushGuestPhysicalAddressSpace hypercall.
Signed-off-by: Lan Tianyu <Tianyu.Lan@microsoft.com>
Acked-by: K. Y. Srinivasan <kys@microsoft.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Hyper-V supports a pv hypercall HvFlushGuestPhysicalAddressSpace to
flush nested VM address space mapping in l1 hypervisor and it's to
reduce overhead of flushing ept tlb among vcpus. This patch is to
implement it.
Signed-off-by: Lan Tianyu <Tianyu.Lan@microsoft.com>
Acked-by: K. Y. Srinivasan <kys@microsoft.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
It is a duplicate of X86_CR3_PCID_NOFLUSH. So just use that instead.
Signed-off-by: Junaid Shahid <junaids@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Adds support for storing multiple previous CR3/root_hpa pairs maintained
as an LRU cache, so that the lockless CR3 switch path can be used when
switching back to any of them.
Signed-off-by: Junaid Shahid <junaids@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
This needs a minor bug fix. The updated patch is as follows.
Thanks,
Junaid
------------------------------------------------------------------------------
kvm_mmu_invlpg() and kvm_mmu_invpcid_gva() only need to flush the TLB
entries for the specific guest virtual address, instead of flushing all
TLB entries associated with the VM.
Signed-off-by: Junaid Shahid <junaids@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
kvm_mmu_free_roots() now takes a mask specifying which roots to free, so
that either one of the roots (active/previous) can be individually freed
when needed.
Signed-off-by: Junaid Shahid <junaids@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
This allows invlpg() to be called using either the active root_hpa
or the prev_root_hpa.
Signed-off-by: Junaid Shahid <junaids@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
When PCIDs are enabled, the MSb of the source operand for a MOV-to-CR3
instruction indicates that the TLB doesn't need to be flushed.
This change enables this optimization for MOV-to-CR3s in the guest
that have been intercepted by KVM for shadow paging and are handled
within the fast CR3 switch path.
Signed-off-by: Junaid Shahid <junaids@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Implement support for INVPCID in shadow paging mode as well.
Signed-off-by: Junaid Shahid <junaids@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
The KVM_REQ_LOAD_CR3 request loads the hardware CR3 using the
current root_hpa.
Signed-off-by: Junaid Shahid <junaids@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
When using shadow paging, a CR3 switch in the guest results in a VM Exit.
In the common case, that VM exit doesn't require much processing by KVM.
However, it does acquire the MMU lock, which can start showing signs of
contention under some workloads even on a 2 VCPU VM when the guest is
using KPTI. Therefore, we add a fast path that avoids acquiring the MMU
lock in the most common cases e.g. when switching back and forth between
the kernel and user mode CR3s used by KPTI with no guest page table
changes in between.
For now, this fast path is implemented only for 64-bit guests and hosts
to avoid the handling of PDPTEs, but it can be extended later to 32-bit
guests and/or hosts as well.
Signed-off-by: Junaid Shahid <junaids@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
For nested virtualization L0 KVM is managing a bit of state for L2 guests,
this state can not be captured through the currently available IOCTLs. In
fact the state captured through all of these IOCTLs is usually a mix of L1
and L2 state. It is also dependent on whether the L2 guest was running at
the moment when the process was interrupted to save its state.
With this capability, there are two new vcpu ioctls: KVM_GET_NESTED_STATE
and KVM_SET_NESTED_STATE. These can be used for saving and restoring a VM
that is in VMX operation.
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Radim Krčmář <rkrcmar@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: x86@kernel.org
Cc: kvm@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Jim Mattson <jmattson@google.com>
[karahmed@ - rename structs and functions and make them ready for AMD and
address previous comments.
- handle nested.smm state.
- rebase & a bit of refactoring.
- Merge 7/8 and 8/8 into one patch. ]
Signed-off-by: KarimAllah Ahmed <karahmed@amazon.de>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
If the vCPU enters system management mode while running a nested guest,
RSM starts processing the vmentry while still in SMM. In that case,
however, the pages pointed to by the vmcs12 might be incorrectly
loaded from SMRAM. To avoid this, delay the handling of the pages
until just before the next vmentry. This is done with a new request
and a new entry in kvm_x86_ops, which we will be able to reuse for
nested VMX state migration.
Extracted from a patch by Jim Mattson and KarimAllah Ahmed.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
When chunks of the kernel image are freed, free_init_pages() is used
directly. Consolidate the three sites that do this. Also update the
string to give an incrementally better description of that memory versus
what was there before.
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: keescook@google.com
Cc: aarcange@redhat.com
Cc: jgross@suse.com
Cc: jpoimboe@redhat.com
Cc: gregkh@linuxfoundation.org
Cc: peterz@infradead.org
Cc: hughd@google.com
Cc: torvalds@linux-foundation.org
Cc: bp@alien8.de
Cc: luto@kernel.org
Cc: ak@linux.intel.com
Cc: Kees Cook <keescook@google.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Juergen Gross <jgross@suse.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Hugh Dickins <hughd@google.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Andi Kleen <ak@linux.intel.com>
Link: https://lkml.kernel.org/r/20180802225829.FE0E32EA@viggo.jf.intel.com
When nested virtualization is in use, VMENTER operations from the nested
hypervisor into the nested guest will always be processed by the bare metal
hypervisor, and KVM's "conditional cache flushes" mode in particular does a
flush on nested vmentry. Therefore, include the "skip L1D flush on
vmentry" bit in KVM's suggested ARCH_CAPABILITIES setting.
Add the relevant Documentation.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Bit 3 of ARCH_CAPABILITIES tells a hypervisor that L1D flush on vmentry is
not needed. Add a new value to enum vmx_l1d_flush_state, which is used
either if there is no L1TF bug at all, or if bit 3 is set in ARCH_CAPABILITIES.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
The last missing piece to having vmx_l1d_flush() take interrupts after
VMEXIT into account is to set the kvm_cpu_l1tf_flush_l1d per-cpu flag on
irq entry.
Issue calls to kvm_set_cpu_l1tf_flush_l1d() from entering_irq(),
ipi_entering_ack_irq(), smp_reschedule_interrupt() and
uv_bau_message_interrupt().
Suggested-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Nicolai Stange <nstange@suse.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
The next patch in this series will have to make the definition of
irq_cpustat_t available to entering_irq().
Inclusion of asm/hardirq.h into asm/apic.h would cause circular header
dependencies like
asm/smp.h
asm/apic.h
asm/hardirq.h
linux/irq.h
linux/topology.h
linux/smp.h
asm/smp.h
or
linux/gfp.h
linux/mmzone.h
asm/mmzone.h
asm/mmzone_64.h
asm/smp.h
asm/apic.h
asm/hardirq.h
linux/irq.h
linux/irqdesc.h
linux/kobject.h
linux/sysfs.h
linux/kernfs.h
linux/idr.h
linux/gfp.h
and others.
This causes compilation errors because of the header guards becoming
effective in the second inclusion: symbols/macros that had been defined
before wouldn't be available to intermediate headers in the #include chain
anymore.
A possible workaround would be to move the definition of irq_cpustat_t
into its own header and include that from both, asm/hardirq.h and
asm/apic.h.
However, this wouldn't solve the real problem, namely asm/harirq.h
unnecessarily pulling in all the linux/irq.h cruft: nothing in
asm/hardirq.h itself requires it. Also, note that there are some other
archs, like e.g. arm64, which don't have that #include in their
asm/hardirq.h.
Remove the linux/irq.h #include from x86' asm/hardirq.h.
Fix resulting compilation errors by adding appropriate #includes to *.c
files as needed.
Note that some of these *.c files could be cleaned up a bit wrt. to their
set of #includes, but that should better be done from separate patches, if
at all.
Signed-off-by: Nicolai Stange <nstange@suse.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Part of the L1TF mitigation for vmx includes flushing the L1D cache upon
VMENTRY.
L1D flushes are costly and two modes of operations are provided to users:
"always" and the more selective "conditional" mode.
If operating in the latter, the cache would get flushed only if a host side
code path considered unconfined had been traversed. "Unconfined" in this
context means that it might have pulled in sensitive data like user data
or kernel crypto keys.
The need for L1D flushes is tracked by means of the per-vcpu flag
l1tf_flush_l1d. KVM exit handlers considered unconfined set it. A
vmx_l1d_flush() subsequently invoked before the next VMENTER will conduct a
L1d flush based on its value and reset that flag again.
Currently, interrupts delivered "normally" while in root operation between
VMEXIT and VMENTER are not taken into account. Part of the reason is that
these don't leave any traces and thus, the vmx code is unable to tell if
any such has happened.
As proposed by Paolo Bonzini, prepare for tracking all interrupts by
introducing a new per-cpu flag, "kvm_cpu_l1tf_flush_l1d". It will be in
strong analogy to the per-vcpu ->l1tf_flush_l1d.
A later patch will make interrupt handlers set it.
For the sake of cache locality, group kvm_cpu_l1tf_flush_l1d into x86'
per-cpu irq_cpustat_t as suggested by Peter Zijlstra.
Provide the helpers kvm_set_cpu_l1tf_flush_l1d(),
kvm_clear_cpu_l1tf_flush_l1d() and kvm_get_cpu_l1tf_flush_l1d(). Make them
trivial resp. non-existent for !CONFIG_KVM_INTEL as appropriate.
Let vmx_l1d_flush() handle kvm_cpu_l1tf_flush_l1d in the same way as
l1tf_flush_l1d.
Suggested-by: Paolo Bonzini <pbonzini@redhat.com>
Suggested-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Nicolai Stange <nstange@suse.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
An upcoming patch will extend KVM's L1TF mitigation in conditional mode
to also cover interrupts after VMEXITs. For tracking those, stores to a
new per-cpu flag from interrupt handlers will become necessary.
In order to improve cache locality, this new flag will be added to x86's
irq_cpustat_t.
Make some space available there by shrinking the ->softirq_pending bitfield
from 32 to 16 bits: the number of bits actually used is only NR_SOFTIRQS,
i.e. 10.
Suggested-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Nicolai Stange <nstange@suse.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Future Intel processors will support "Enhanced IBRS" which is an "always
on" mode i.e. IBRS bit in SPEC_CTRL MSR is enabled once and never
disabled.
From the specification [1]:
"With enhanced IBRS, the predicted targets of indirect branches
executed cannot be controlled by software that was executed in a less
privileged predictor mode or on another logical processor. As a
result, software operating on a processor with enhanced IBRS need not
use WRMSR to set IA32_SPEC_CTRL.IBRS after every transition to a more
privileged predictor mode. Software can isolate predictor modes
effectively simply by setting the bit once. Software need not disable
enhanced IBRS prior to entering a sleep state such as MWAIT or HLT."
If Enhanced IBRS is supported by the processor then use it as the
preferred spectre v2 mitigation mechanism instead of Retpoline. Intel's
Retpoline white paper [2] states:
"Retpoline is known to be an effective branch target injection (Spectre
variant 2) mitigation on Intel processors belonging to family 6
(enumerated by the CPUID instruction) that do not have support for
enhanced IBRS. On processors that support enhanced IBRS, it should be
used for mitigation instead of retpoline."
The reason why Enhanced IBRS is the recommended mitigation on processors
which support it is that these processors also support CET which
provides a defense against ROP attacks. Retpoline is very similar to ROP
techniques and might trigger false positives in the CET defense.
If Enhanced IBRS is selected as the mitigation technique for spectre v2,
the IBRS bit in SPEC_CTRL MSR is set once at boot time and never
cleared. Kernel also has to make sure that IBRS bit remains set after
VMEXIT because the guest might have cleared the bit. This is already
covered by the existing x86_spec_ctrl_set_guest() and
x86_spec_ctrl_restore_host() speculation control functions.
Enhanced IBRS still requires IBPB for full mitigation.
[1] Speculative-Execution-Side-Channel-Mitigations.pdf
[2] Retpoline-A-Branch-Target-Injection-Mitigation.pdf
Both documents are available at:
https://bugzilla.kernel.org/show_bug.cgi?id=199511
Originally-by: David Woodhouse <dwmw@amazon.co.uk>
Signed-off-by: Sai Praneeth Prakhya <sai.praneeth.prakhya@intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Tim C Chen <tim.c.chen@intel.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Ravi Shankar <ravi.v.shankar@intel.com>
Link: https://lkml.kernel.org/r/1533148945-24095-1-git-send-email-sai.praneeth.prakhya@intel.com
Some Intel processors have an EPT feature whereby the accessed & dirty bits
in EPT entries can be updated by HW. MSR IA32_VMX_EPT_VPID_CAP exposes the
presence of this capability.
There is no point in trying to use that new feature bit in the VMX code as
VMX needs to read the MSR anyway to access other bits, but having the
feature bit for EPT_AD in place helps virtualization management as it
exposes "ept_ad" in /proc/cpuinfo/$proc/flags if the feature is present.
[ tglx: Amended changelog ]
Signed-off-by: Peter Feiner <pfeiner@google.com>
Signed-off-by: Peter Shier <pshier@google.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Jim Mattson <jmattson@google.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Borislav Petkov <bp@suse.de>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: David Woodhouse <dwmw@amazon.co.uk>
Link: https://lkml.kernel.org/r/20180801180657.138051-1-pshier@google.com
Get rid of ISA specific code from vmus_drv.c which is common code.
Fixes: 81b18bce48 ("Drivers: HV: Send one page worth of kmsg dump over Hyper-V during panic")
Signed-off-by: Sunil Muthuswamy <sunilmut@microsoft.com>
Signed-off-by: K. Y. Srinivasan <kys@microsoft.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
The pebs_drain() need to support fixed counters. The DS Save Area now
include "counter reset value" fields for each fixed counters.
Extend the related variables (e.g. mask, counters, error) to support
fixed counters. There is no extended PEBS in PEBS v2 and earlier PEBS
format. Only need to change the code for PEBS v3 and later PEBS format.
Extend the pebs_event_reset[] logic to support new "counter reset value" fields.
Increase the reserve space for fixed counters.
Based-on-code-from: Andi Kleen <ak@linux.intel.com>
Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephane Eranian <eranian@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vince Weaver <vincent.weaver@maine.edu>
Cc: acme@kernel.org
Link: http://lkml.kernel.org/r/20180309021542.11374-3-kan.liang@linux.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
The LOCK_PREFIX macro should be used in the __raw_callee_save___pv_queued_spin_unlock()
assembly code, so that the lock prefix can be patched out on UP systems.
Signed-off-by: Waiman Long <longman@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Joe Mario <jmario@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Will Deacon <will.deacon@arm.com>
Link: http://lkml.kernel.org/r/1531858560-21547-1-git-send-email-longman@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Pull x86 pti fixes from Ingo Molnar:
"An APM fix, and a BTS hardware-tracing fix related to PTI changes"
* 'x86-pti-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/apm: Don't access __preempt_count with zeroed fs
x86/events/intel/ds: Fix bts_interrupt_threshold alignment
This adds the needed special case for PAE to get the LDT mapped into the
user page-table when PTI is enabled. The big difference to the other paging
modes is that on PAE there is no full top-level PGD entry available for the
LDT, but only a PMD entry.
Signed-off-by: Joerg Roedel <jroedel@suse.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Pavel Machek <pavel@ucw.cz>
Cc: "H . Peter Anvin" <hpa@zytor.com>
Cc: linux-mm@kvack.org
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Juergen Gross <jgross@suse.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Jiri Kosina <jkosina@suse.cz>
Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: David Laight <David.Laight@aculab.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: Eduardo Valentin <eduval@amazon.com>
Cc: Greg KH <gregkh@linuxfoundation.org>
Cc: Will Deacon <will.deacon@arm.com>
Cc: aliguori@amazon.com
Cc: daniel.gruss@iaik.tugraz.at
Cc: hughd@google.com
Cc: keescook@google.com
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Waiman Long <llong@redhat.com>
Cc: "David H . Gutteridge" <dhgutteridge@sympatico.ca>
Cc: joro@8bytes.org
Link: https://lkml.kernel.org/r/1531906876-13451-37-git-send-email-joro@8bytes.org
It marks the end of the address-space range reserved for the LDT. The
LDT-code will use it when unmapping the LDT for user-space.
Signed-off-by: Joerg Roedel <jroedel@suse.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Pavel Machek <pavel@ucw.cz>
Cc: "H . Peter Anvin" <hpa@zytor.com>
Cc: linux-mm@kvack.org
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Juergen Gross <jgross@suse.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Jiri Kosina <jkosina@suse.cz>
Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: David Laight <David.Laight@aculab.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: Eduardo Valentin <eduval@amazon.com>
Cc: Greg KH <gregkh@linuxfoundation.org>
Cc: Will Deacon <will.deacon@arm.com>
Cc: aliguori@amazon.com
Cc: daniel.gruss@iaik.tugraz.at
Cc: hughd@google.com
Cc: keescook@google.com
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Waiman Long <llong@redhat.com>
Cc: "David H . Gutteridge" <dhgutteridge@sympatico.ca>
Cc: joro@8bytes.org
Link: https://lkml.kernel.org/r/1531906876-13451-35-git-send-email-joro@8bytes.org
Introduce a new function to finalize the kernel mappings for the userspace
page-table after all ro/nx protections have been applied to the kernel
mappings.
Also move the call to pti_clone_kernel_text() to that function so that it
will run on 32 bit kernels too.
Signed-off-by: Joerg Roedel <jroedel@suse.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Pavel Machek <pavel@ucw.cz>
Cc: "H . Peter Anvin" <hpa@zytor.com>
Cc: linux-mm@kvack.org
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Juergen Gross <jgross@suse.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Jiri Kosina <jkosina@suse.cz>
Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: David Laight <David.Laight@aculab.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: Eduardo Valentin <eduval@amazon.com>
Cc: Greg KH <gregkh@linuxfoundation.org>
Cc: Will Deacon <will.deacon@arm.com>
Cc: aliguori@amazon.com
Cc: daniel.gruss@iaik.tugraz.at
Cc: hughd@google.com
Cc: keescook@google.com
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Waiman Long <llong@redhat.com>
Cc: "David H . Gutteridge" <dhgutteridge@sympatico.ca>
Cc: joro@8bytes.org
Link: https://lkml.kernel.org/r/1531906876-13451-30-git-send-email-joro@8bytes.org
The pti_clone_kernel_text() function references __end_rodata_hpage_align,
which is only present on x86-64. This makes sense as the end of the rodata
section is not huge-page aligned on 32 bit.
Nevertheless a symbol is required for the function that points at the right
address for both 32 and 64 bit. Introduce __end_rodata_aligned for that
purpose and use it in pti_clone_kernel_text().
Signed-off-by: Joerg Roedel <jroedel@suse.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Pavel Machek <pavel@ucw.cz>
Cc: "H . Peter Anvin" <hpa@zytor.com>
Cc: linux-mm@kvack.org
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Juergen Gross <jgross@suse.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Jiri Kosina <jkosina@suse.cz>
Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: David Laight <David.Laight@aculab.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: Eduardo Valentin <eduval@amazon.com>
Cc: Greg KH <gregkh@linuxfoundation.org>
Cc: Will Deacon <will.deacon@arm.com>
Cc: aliguori@amazon.com
Cc: daniel.gruss@iaik.tugraz.at
Cc: hughd@google.com
Cc: keescook@google.com
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Waiman Long <llong@redhat.com>
Cc: "David H . Gutteridge" <dhgutteridge@sympatico.ca>
Cc: joro@8bytes.org
Link: https://lkml.kernel.org/r/1531906876-13451-28-git-send-email-joro@8bytes.org
Generic page-table code populates all non-leaf entries with _KERNPG_TABLE
bits set. This is fine for all paging modes except PAE.
In PAE mode only a subset of the bits is allowed to be set. Make sure to
only set allowed bits by masking out the reserved bits.
Signed-off-by: Joerg Roedel <jroedel@suse.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Pavel Machek <pavel@ucw.cz>
Cc: "H . Peter Anvin" <hpa@zytor.com>
Cc: linux-mm@kvack.org
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Juergen Gross <jgross@suse.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Jiri Kosina <jkosina@suse.cz>
Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: David Laight <David.Laight@aculab.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: Eduardo Valentin <eduval@amazon.com>
Cc: Greg KH <gregkh@linuxfoundation.org>
Cc: Will Deacon <will.deacon@arm.com>
Cc: aliguori@amazon.com
Cc: daniel.gruss@iaik.tugraz.at
Cc: hughd@google.com
Cc: keescook@google.com
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Waiman Long <llong@redhat.com>
Cc: "David H . Gutteridge" <dhgutteridge@sympatico.ca>
Cc: joro@8bytes.org
Link: https://lkml.kernel.org/r/1531906876-13451-22-git-send-email-joro@8bytes.org
These two functions are required for PTI on 32 bit:
* pgdp_maps_userspace()
* pgd_large()
Also re-implement pgdp_maps_userspace() so that it will work on 64 and 32
bit kernels.
Signed-off-by: Joerg Roedel <jroedel@suse.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Pavel Machek <pavel@ucw.cz>
Cc: "H . Peter Anvin" <hpa@zytor.com>
Cc: linux-mm@kvack.org
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Juergen Gross <jgross@suse.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Jiri Kosina <jkosina@suse.cz>
Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: David Laight <David.Laight@aculab.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: Eduardo Valentin <eduval@amazon.com>
Cc: Greg KH <gregkh@linuxfoundation.org>
Cc: Will Deacon <will.deacon@arm.com>
Cc: aliguori@amazon.com
Cc: daniel.gruss@iaik.tugraz.at
Cc: hughd@google.com
Cc: keescook@google.com
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Waiman Long <llong@redhat.com>
Cc: "David H . Gutteridge" <dhgutteridge@sympatico.ca>
Cc: joro@8bytes.org
Link: https://lkml.kernel.org/r/1531906876-13451-21-git-send-email-joro@8bytes.org
The way page-table folding is implemented on 32 bit, these functions are
not only setting, but also PUDs and even PMDs. Give the function a more
generic name to reflect that.
Signed-off-by: Joerg Roedel <jroedel@suse.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Pavel Machek <pavel@ucw.cz>
Cc: "H . Peter Anvin" <hpa@zytor.com>
Cc: linux-mm@kvack.org
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Juergen Gross <jgross@suse.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Jiri Kosina <jkosina@suse.cz>
Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: David Laight <David.Laight@aculab.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: Eduardo Valentin <eduval@amazon.com>
Cc: Greg KH <gregkh@linuxfoundation.org>
Cc: Will Deacon <will.deacon@arm.com>
Cc: aliguori@amazon.com
Cc: daniel.gruss@iaik.tugraz.at
Cc: hughd@google.com
Cc: keescook@google.com
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Waiman Long <llong@redhat.com>
Cc: "David H . Gutteridge" <dhgutteridge@sympatico.ca>
Cc: joro@8bytes.org
Link: https://lkml.kernel.org/r/1531906876-13451-16-git-send-email-joro@8bytes.org
The function does not update sp0 anymore but updates makes the task-stack
visible for entry code. This is by either writing it to sp1 or by doing a
hypercall. Rename the function to get rid of the misleading name.
Signed-off-by: Joerg Roedel <jroedel@suse.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Pavel Machek <pavel@ucw.cz>
Cc: "H . Peter Anvin" <hpa@zytor.com>
Cc: linux-mm@kvack.org
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Juergen Gross <jgross@suse.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Jiri Kosina <jkosina@suse.cz>
Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: David Laight <David.Laight@aculab.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: Eduardo Valentin <eduval@amazon.com>
Cc: Greg KH <gregkh@linuxfoundation.org>
Cc: Will Deacon <will.deacon@arm.com>
Cc: aliguori@amazon.com
Cc: daniel.gruss@iaik.tugraz.at
Cc: hughd@google.com
Cc: keescook@google.com
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Waiman Long <llong@redhat.com>
Cc: "David H . Gutteridge" <dhgutteridge@sympatico.ca>
Cc: joro@8bytes.org
Link: https://lkml.kernel.org/r/1531906876-13451-15-git-send-email-joro@8bytes.org
Use the entry-stack as a trampoline to enter the kernel. The entry-stack is
already in the cpu_entry_area and will be mapped to userspace when PTI is
enabled.
Signed-off-by: Joerg Roedel <jroedel@suse.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Pavel Machek <pavel@ucw.cz>
Cc: "H . Peter Anvin" <hpa@zytor.com>
Cc: linux-mm@kvack.org
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Juergen Gross <jgross@suse.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Jiri Kosina <jkosina@suse.cz>
Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: David Laight <David.Laight@aculab.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: Eduardo Valentin <eduval@amazon.com>
Cc: Greg KH <gregkh@linuxfoundation.org>
Cc: Will Deacon <will.deacon@arm.com>
Cc: aliguori@amazon.com
Cc: daniel.gruss@iaik.tugraz.at
Cc: hughd@google.com
Cc: keescook@google.com
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Waiman Long <llong@redhat.com>
Cc: "David H . Gutteridge" <dhgutteridge@sympatico.ca>
Cc: joro@8bytes.org
Link: https://lkml.kernel.org/r/1531906876-13451-8-git-send-email-joro@8bytes.org
It supposed to be safe to modify static branches after jump_label_init().
But, because static key modifying code eventually calls text_poke() it can
end up accessing a struct page which has not been initialized yet.
Here is how to quickly reproduce the problem. Insert code like this
into init/main.c:
| +static DEFINE_STATIC_KEY_FALSE(__test);
| asmlinkage __visible void __init start_kernel(void)
| {
| char *command_line;
|@@ -587,6 +609,10 @@ asmlinkage __visible void __init start_kernel(void)
| vfs_caches_init_early();
| sort_main_extable();
| trap_init();
|+ {
|+ static_branch_enable(&__test);
|+ WARN_ON(!static_branch_likely(&__test));
|+ }
| mm_init();
The following warnings show-up:
WARNING: CPU: 0 PID: 0 at arch/x86/kernel/alternative.c:701 text_poke+0x20d/0x230
RIP: 0010:text_poke+0x20d/0x230
Call Trace:
? text_poke_bp+0x50/0xda
? arch_jump_label_transform+0x89/0xe0
? __jump_label_update+0x78/0xb0
? static_key_enable_cpuslocked+0x4d/0x80
? static_key_enable+0x11/0x20
? start_kernel+0x23e/0x4c8
? secondary_startup_64+0xa5/0xb0
---[ end trace abdc99c031b8a90a ]---
If the code above is moved after mm_init(), no warning is shown, as struct
pages are initialized during handover from memblock.
Use text_poke_early() in static branching until early boot IRQs are enabled
and from there switch to text_poke. Also, ensure text_poke() is never
invoked when unitialized memory access may happen by using adding a
!after_bootmem assertion.
Signed-off-by: Pavel Tatashin <pasha.tatashin@oracle.com>
Cc: steven.sistare@oracle.com
Cc: daniel.m.jordan@oracle.com
Cc: linux@armlinux.org.uk
Cc: schwidefsky@de.ibm.com
Cc: heiko.carstens@de.ibm.com
Cc: john.stultz@linaro.org
Cc: sboyd@codeaurora.org
Cc: hpa@zytor.com
Cc: douly.fnst@cn.fujitsu.com
Cc: peterz@infradead.org
Cc: prarit@redhat.com
Cc: feng.tang@intel.com
Cc: pmladek@suse.com
Cc: gnomes@lxorguk.ukuu.org.uk
Cc: linux-s390@vger.kernel.org
Cc: boris.ostrovsky@oracle.com
Cc: jgross@suse.com
Cc: pbonzini@redhat.com
Link: https://lkml.kernel.org/r/20180719205545.16512-9-pasha.tatashin@oracle.com
Now that CPUs in lazy TLB mode no longer receive TLB shootdown IPIs, except
at page table freeing time, and idle CPUs will no longer get shootdown IPIs
for things like mprotect and madvise, we can always use lazy TLB mode.
Tested-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Rik van Riel <riel@surriel.com>
Acked-by: Dave Hansen <dave.hansen@intel.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: efault@gmx.de
Cc: kernel-team@fb.com
Cc: luto@kernel.org
Link: http://lkml.kernel.org/r/20180716190337.26133-7-riel@surriel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Andy discovered that speculative memory accesses while in lazy
TLB mode can crash a system, when a CPU tries to dereference a
speculative access using memory contents that used to be valid
page table memory, but have since been reused for something else
and point into la-la land.
The latter problem can be prevented in two ways. The first is to
always send a TLB shootdown IPI to CPUs in lazy TLB mode, while
the second one is to only send the TLB shootdown at page table
freeing time.
The second should result in fewer IPIs, since operationgs like
mprotect and madvise are very common with some workloads, but
do not involve page table freeing. Also, on munmap, batching
of page table freeing covers much larger ranges of virtual
memory than the batching of unmapped user pages.
Tested-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Rik van Riel <riel@surriel.com>
Acked-by: Dave Hansen <dave.hansen@intel.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: efault@gmx.de
Cc: kernel-team@fb.com
Cc: luto@kernel.org
Link: http://lkml.kernel.org/r/20180716190337.26133-3-riel@surriel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
All copy_to_user() implementations need to be prepared to handle faults
accessing userspace. The __memcpy_mcsafe() implementation handles both
mmu-faults on the user destination and machine-check-exceptions on the
source buffer. However, the memcpy_mcsafe() wrapper may silently
fallback to memcpy() depending on build options and cpu-capabilities.
Force copy_to_user_mcsafe() to always use __memcpy_mcsafe() when
available, and otherwise disable all of the copy_to_user_mcsafe()
infrastructure when __memcpy_mcsafe() is not available, i.e.
CONFIG_X86_MCE=n.
This fixes crashes of the form:
run fstests generic/323 at 2018-07-02 12:46:23
BUG: unable to handle kernel paging request at 00007f0d50001000
RIP: 0010:__memcpy+0x12/0x20
[..]
Call Trace:
copyout_mcsafe+0x3a/0x50
_copy_to_iter_mcsafe+0xa1/0x4a0
? dax_alive+0x30/0x50
dax_iomap_actor+0x1f9/0x280
? dax_iomap_rw+0x100/0x100
iomap_apply+0xba/0x130
? dax_iomap_rw+0x100/0x100
dax_iomap_rw+0x95/0x100
? dax_iomap_rw+0x100/0x100
xfs_file_dax_read+0x7b/0x1d0 [xfs]
xfs_file_read_iter+0xa7/0xc0 [xfs]
aio_read+0x11c/0x1a0
Reported-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Tested-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tony Luck <tony.luck@intel.com>
Fixes: 8780356ef6 ("x86/asm/memcpy_mcsafe: Define copy_to_iter_mcsafe()")
Link: http://lkml.kernel.org/r/153108277790.37979.1486841789275803399.stgit@dwillia2-desk3.amr.corp.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Introduce the 'l1tf=' kernel command line option to allow for boot-time
switching of mitigation that is used on processors affected by L1TF.
The possible values are:
full
Provides all available mitigations for the L1TF vulnerability. Disables
SMT and enables all mitigations in the hypervisors. SMT control via
/sys/devices/system/cpu/smt/control is still possible after boot.
Hypervisors will issue a warning when the first VM is started in
a potentially insecure configuration, i.e. SMT enabled or L1D flush
disabled.
full,force
Same as 'full', but disables SMT control. Implies the 'nosmt=force'
command line option. sysfs control of SMT and the hypervisor flush
control is disabled.
flush
Leaves SMT enabled and enables the conditional hypervisor mitigation.
Hypervisors will issue a warning when the first VM is started in a
potentially insecure configuration, i.e. SMT enabled or L1D flush
disabled.
flush,nosmt
Disables SMT and enables the conditional hypervisor mitigation. SMT
control via /sys/devices/system/cpu/smt/control is still possible
after boot. If SMT is reenabled or flushing disabled at runtime
hypervisors will issue a warning.
flush,nowarn
Same as 'flush', but hypervisors will not warn when
a VM is started in a potentially insecure configuration.
off
Disables hypervisor mitigations and doesn't emit any warnings.
Default is 'flush'.
Let KVM adhere to these semantics, which means:
- 'lt1f=full,force' : Performe L1D flushes. No runtime control
possible.
- 'l1tf=full'
- 'l1tf-flush'
- 'l1tf=flush,nosmt' : Perform L1D flushes and warn on VM start if
SMT has been runtime enabled or L1D flushing
has been run-time enabled
- 'l1tf=flush,nowarn' : Perform L1D flushes and no warnings are emitted.
- 'l1tf=off' : L1D flushes are not performed and no warnings
are emitted.
KVM can always override the L1D flushing behavior using its 'vmentry_l1d_flush'
module parameter except when lt1f=full,force is set.
This makes KVM's private 'nosmt' option redundant, and as it is a bit
non-systematic anyway (this is something to control globally, not on
hypervisor level), remove that option.
Add the missing Documentation entry for the l1tf vulnerability sysfs file
while at it.
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Jiri Kosina <jkosina@suse.cz>
Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Reviewed-by: Josh Poimboeuf <jpoimboe@redhat.com>
Link: https://lkml.kernel.org/r/20180713142323.202758176@linutronix.de
If Extended Page Tables (EPT) are disabled or not supported, no L1D
flushing is required. The setup function can just avoid setting up the L1D
flush for the EPT=n case.
Invoke it after the hardware setup has be done and enable_ept has the
correct state and expose the EPT disabled state in the mitigation status as
well.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Jiri Kosina <jkosina@suse.cz>
Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Reviewed-by: Josh Poimboeuf <jpoimboe@redhat.com>
Link: https://lkml.kernel.org/r/20180713142322.612160168@linutronix.de
Store the effective mitigation of VMX in a status variable and use it to
report the VMX state in the l1tf sysfs file.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Jiri Kosina <jkosina@suse.cz>
Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Reviewed-by: Josh Poimboeuf <jpoimboe@redhat.com>
Link: https://lkml.kernel.org/r/20180713142322.433098358@linutronix.de
In the VM mode on Hyper-V, currently, when the kernel panics, an error
code and few register values are populated in an MSR and the Hypervisor
notified. This information is collected on the host. The amount of
information currently collected is found to be limited and not very
actionable. To gather more actionable data, such as stack trace, the
proposal is to write one page worth of kmsg data on an allocated page
and the Hypervisor notified of the page address through the MSR.
- Sysctl option to control the behavior, with ON by default.
Cc: K. Y. Srinivasan <kys@microsoft.com>
Cc: Stephen Hemminger <sthemmin@microsoft.com>
Signed-off-by: Sunil Muthuswamy <sunilmut@microsoft.com>
Signed-off-by: K. Y. Srinivasan <kys@microsoft.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
The enum is currently defined in Intel-specific DMAR header file,
but it is also used by APIC common code. Therefore, move it to
a more appropriate interrupt-remapping common header file.
This will also be used by subsequent patches.
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Joerg Roedel <jroedel@suse.de>
Signed-off-by: Suravee Suthikulpanit <suravee.suthikulpanit@amd.com>
Signed-off-by: Joerg Roedel <jroedel@suse.de>
Add the logic for flushing L1D on VMENTER. The flush depends on the static
key being enabled and the new l1tf_flush_l1d flag being set.
The flags is set:
- Always, if the flush module parameter is 'always'
- Conditionally at:
- Entry to vcpu_run(), i.e. after executing user space
- From the sched_in notifier, i.e. when switching to a vCPU thread.
- From vmexit handlers which are considered unsafe, i.e. where
sensitive data can be brought into L1D:
- The emulator, which could be a good target for other speculative
execution-based threats,
- The MMU, which can bring host page tables in the L1 cache.
- External interrupts
- Nested operations that require the MMU (see above). That is
vmptrld, vmptrst, vmclear,vmwrite,vmread.
- When handling invept,invvpid
[ tglx: Split out from combo patch and reduced to a single flag ]
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
336996-Speculative-Execution-Side-Channel-Mitigations.pdf defines a new MSR
(IA32_FLUSH_CMD aka 0x10B) which has similar write-only semantics to other
MSRs defined in the document.
The semantics of this MSR is to allow "finer granularity invalidation of
caching structures than existing mechanisms like WBINVD. It will writeback
and invalidate the L1 data cache, including all cachelines brought in by
preceding instructions, without invalidating all caches (eg. L2 or
LLC). Some processors may also invalidate the first level level instruction
cache on a L1D_FLUSH command. The L1 data and instruction caches may be
shared across the logical processors of a core."
Use it instead of the loop based L1 flush algorithm.
A copy of this document is available at
https://bugzilla.kernel.org/show_bug.cgi?id=199511
[ tglx: Avoid allocating pages when the MSR is available ]
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
The Hyper-V feature and hint flags in hyperv-tlfs.h are all defined
with the string "X64" in the name. Some of these flags are indeed
x86/x64 specific, but others are not. For the ones that are used
in architecture independent Hyper-V driver code, or will be used in
the upcoming support for Hyper-V for ARM64, this patch removes the
"X64" from the name.
This patch changes the flags that are currently known to be
used on multiple architectures. Hyper-V for ARM64 is still a
work-in-progress and the Top Level Functional Spec (TLFS) has not
been separated into x86/x64 and ARM64 areas. So additional flags
may need to be updated later.
This patch only changes symbol names. There are no functional
changes.
Signed-off-by: Michael Kelley <mikelley@microsoft.com>
Signed-off-by: K. Y. Srinivasan <kys@microsoft.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
After custom TSC calibration gone, there is no more reason to have
custom platform code for each of Intel MID.
Thus, remove it for good.
Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Pavel Tatashin <pasha.tatashin@oracle.com>
Link: https://lkml.kernel.org/r/20180629193113.84425-7-andriy.shevchenko@linux.intel.com
Since the commit
7da7c15613 ("x86, tsc: Add static (MSR) TSC calibration on Intel Atom SoCs")
introduced a common way for all Intel MID chips to get their TSC frequency
via MSRs, there is no need to keep a duplication in each of Intel MID
platform code.
Thus, remove the custom calibration code for good.
Note, there is slight difference in how to get frequency for (reserved?)
values in MSRs, i.e. legacy code enforces some defaults while new code just
uses 0 in that cases.
Suggested-by: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Pavel Tatashin <pasha.tatashin@oracle.com>
Cc: Bin Gao <bin.gao@intel.com>
Link: https://lkml.kernel.org/r/20180629193113.84425-6-andriy.shevchenko@linux.intel.com
These macros are often used by drivers and there exists already a lot of
duplication as ICPU() macro across the drivers.
Provide a generic x86 macro for users.
Note, as Ingo Molnar pointed out this has a hidden issue when a driver
needs to preserve const qualifier. Though, it would be addressed
separately at some point.
Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Pavel Tatashin <pasha.tatashin@oracle.com>
Link: https://lkml.kernel.org/r/20180629193113.84425-2-andriy.shevchenko@linux.intel.com
In architecture independent code for manipulating Hyper-V synthetic timers
and synthetic interrupts, pass in an ordinal number identifying the timer
or interrupt, rather than an actual MSR register address. Then in
x86/x64 specific code, map the ordinal number to the appropriate MSR.
This change facilitates the introduction of an ARM64 version of Hyper-V,
which uses the same synthetic timers and interrupts, but a different
mechanism for accessing them.
Signed-off-by: Michael Kelley <mikelley@microsoft.com>
Signed-off-by: K. Y. Srinivasan <kys@microsoft.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Implement 'Fast' hypercall with two 64-bit input parameter. This is
going to be used for HvCallSendSyntheticClusterIpi hypercall.
Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: devel@linuxdriverproject.org
Cc: "K. Y. Srinivasan" <kys@microsoft.com>
Cc: Haiyang Zhang <haiyangz@microsoft.com>
Cc: Stephen Hemminger <sthemmin@microsoft.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Tianyu Lan <Tianyu.Lan@microsoft.com>
Cc: "Michael Kelley (EOSG)" <Michael.H.Kelley@microsoft.com>
Link: https://lkml.kernel.org/r/20180622170625.30688-2-vkuznets@redhat.com
Dave Hansen reported, that it's outright dangerous to keep SMT siblings
disabled completely so they are stuck in the BIOS and wait for SIPI.
The reason is that Machine Check Exceptions are broadcasted to siblings and
the soft disabled sibling has CR4.MCE = 0. If a MCE is delivered to a
logical core with CR4.MCE = 0, it asserts IERR#, which shuts down or
reboots the machine. The MCE chapter in the SDM contains the following
blurb:
Because the logical processors within a physical package are tightly
coupled with respect to shared hardware resources, both logical
processors are notified of machine check errors that occur within a
given physical processor. If machine-check exceptions are enabled when
a fatal error is reported, all the logical processors within a physical
package are dispatched to the machine-check exception handler. If
machine-check exceptions are disabled, the logical processors enter the
shutdown state and assert the IERR# signal. When enabling machine-check
exceptions, the MCE flag in control register CR4 should be set for each
logical processor.
Reverting the commit which ignores siblings at enumeration time solves only
half of the problem. The core cpuhotplug logic needs to be adjusted as
well.
This thoughtful engineered mechanism also turns the boot process on all
Intel HT enabled systems into a MCE lottery. MCE is enabled on the boot CPU
before the secondary CPUs are brought up. Depending on the number of
physical cores the window in which this situation can happen is smaller or
larger. On a HSW-EX it's about 750ms:
MCE is enabled on the boot CPU:
[ 0.244017] mce: CPU supports 22 MCE banks
The corresponding sibling #72 boots:
[ 1.008005] .... node #0, CPUs: #72
That means if an MCE hits on physical core 0 (logical CPUs 0 and 72)
between these two points the machine is going to shutdown. At least it's a
known safe state.
It's obvious that the early boot can be hit by an MCE as well and then runs
into the same situation because MCEs are not yet enabled on the boot CPU.
But after enabling them on the boot CPU, it does not make any sense to
prevent the kernel from recovering.
Adjust the nosmt kernel parameter documentation as well.
Reverts: 2207def700 ("x86/apic: Ignore secondary threads if nosmt=force")
Reported-by: Dave Hansen <dave.hansen@intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Tony Luck <tony.luck@intel.com>
Move early dump functionality into common code so that it is available for
all architectures. No need to carry arch-specific reads around as the read
hooks are already initialized by the time pci_setup_device() is getting
called during scan.
Tested-by: Andy Shevchenko <andy.shevchenko@gmail.com>
Signed-off-by: Sinan Kaya <okaya@codeaurora.org>
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
Reviewed-by: Andy Shevchenko <andy.shevchenko@gmail.com>
Jan has noticed that pte_pfn and co. resp. pfn_pte are incorrect for
CONFIG_PAE because phys_addr_t is wider than unsigned long and so the
pte_val reps. shift left would get truncated. Fix this up by using proper
types.
Fixes: 6b28baca9b ("x86/speculation/l1tf: Protect PROT_NONE PTEs against speculation")
Reported-by: Jan Beulich <JBeulich@suse.com>
Signed-off-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
The PAE 3-level paging code currently doesn't mitigate L1TF by flipping the
offset bits, and uses the high PTE word, thus bits 32-36 for type, 37-63 for
offset. The lower word is zeroed, thus systems with less than 4GB memory are
safe. With 4GB to 128GB the swap type selects the memory locations vulnerable
to L1TF; with even more memory, also the swap offfset influences the address.
This might be a problem with 32bit PAE guests running on large 64bit hosts.
By continuing to keep the whole swap entry in either high or low 32bit word of
PTE we would limit the swap size too much. Thus this patch uses the whole PAE
PTE with the same layout as the 64bit version does. The macros just become a
bit tricky since they assume the arch-dependent swp_entry_t to be 32bit.
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Michal Hocko <mhocko@suse.com>
This reverts the following commits:
1ea66554d3 ("x86/mm: Mark p4d_offset() __always_inline")
046c0dbec0 ("x86: Mark native_set_p4d() as __always_inline")
p4d_offset(), native_set_p4d() and native_p4d_clear() were marked
__always_inline in attempt to move __pgtable_l5_enabled into __initdata
section.
It was required as KASAN initialization code is a user of
USE_EARLY_PGTABLE_L5, so all pgtable_l5_enabled() translated to
__pgtable_l5_enabled there. This includes pgtable_l5_enabled() called
from inline p4d helpers.
If compiler would decided to not inline these p4d helpers, but leave
them standalone, we end up with section mismatch.
We don't need __always_inline here anymore. __pgtable_l5_enabled moved
back to be __ro_after_init. See the following commit:
51be133515 ("Revert "x86/mm: Mark __pgtable_l5_enabled __initdata"")
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20180626100341.49910-1-kirill.shutemov@linux.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
When the P4D page table layer is folded at runtime, the p4d_free()
should do nothing, the same as in <asm-generic/pgtable-nop4d.h>.
It seems this bug should cause double-free in efi_call_phys_epilog(),
but I don't know how to trigger that code path, so I can't confirm that
by testing.
Signed-off-by: Andrey Ryabinin <aryabinin@virtuozzo.com>
Reviewed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: stable@vger.kernel.org # 4.17
Fixes: 98219dda2a ("x86/mm: Fold p4d page table layer at runtime")
Link: http://lkml.kernel.org/r/20180625102427.15015-1-aryabinin@virtuozzo.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
All architectures have implemented it, we can now remove the poor weak
version.
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Chris Zankel <chris@zankel.net>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Joel Fernandes <joel.opensrc@gmail.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Max Filippov <jcmvbkbc@gmail.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rich Felker <dalias@libc.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Will Deacon <will.deacon@arm.com>
Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
Link: http://lkml.kernel.org/r/1529981939-8231-11-git-send-email-frederic@kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Migrate to the new API in order to remove arch_validate_hwbkpt_settings()
that clumsily mixes up architecture validation and commit.
Original-patch-by: Andy Lutomirski <luto@kernel.org>
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Chris Zankel <chris@zankel.net>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Joel Fernandes <joel.opensrc@gmail.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Max Filippov <jcmvbkbc@gmail.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rich Felker <dalias@libc.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Will Deacon <will.deacon@arm.com>
Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
Link: http://lkml.kernel.org/r/1529981939-8231-4-git-send-email-frederic@kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
We can't pass the breakpoint directly on arch_check_bp_in_kernelspace()
anymore because its architecture internal datas (struct arch_hw_breakpoint)
are not yet filled by the time we call the function, and most
implementation need this backend to be up to date. So arrange the
function to take the probing struct instead.
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Chris Zankel <chris@zankel.net>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Joel Fernandes <joel.opensrc@gmail.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Max Filippov <jcmvbkbc@gmail.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rich Felker <dalias@libc.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Will Deacon <will.deacon@arm.com>
Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
Link: http://lkml.kernel.org/r/1529981939-8231-3-git-send-email-frederic@kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Pull x86 pti fixes from Thomas Gleixner:
"Two small updates for the speculative distractions:
- Make it more clear to the compiler that array_index_mask_nospec()
is not subject for optimizations. It's not perfect, but ...
- Don't report XEN PV guests as vulnerable because their mitigation
state depends on the hypervisor. Report unknown and refer to the
hypervisor requirement"
* 'x86-pti-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/spectre_v1: Disable compiler optimizations over array_index_mask_nospec()
x86/pti: Don't report XenPV as vulnerable
This patch extends the checks done prior to a nested VM entry.
Specifically, it extends the check_vmentry_prereqs function with checks
for fields relevant to the VM-entry event injection information, as
described in the Intel SDM, volume 3.
This patch is motivated by a syzkaller bug, where a bad VM-entry
interruption information field is generated in the VMCS02, which causes
the nested VM launch to fail. Then, KVM fails to resume L1.
While KVM should be improved to correctly resume L1 execution after a
failed nested launch, this change is justified because the existing code
to resume L1 is flaky/ad-hoc and the test coverage for resuming L1 is
sparse.
Reported-by: syzbot <syzkaller@googlegroups.com>
Signed-off-by: Marc Orr <marcorr@google.com>
[Removed comment whose parts were describing previous revisions and the
rest was obvious from function/variable naming. - Radim]
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
Debloat <linux/refcount.h>'s dependencies:
- <linux/kernel.h> is not needed, but <linux/compiler.h> is.
- <linux/mutex.h> is not needed, only a forward declaration of "struct mutex".
- <linux/spinlock.h> is not needed, <linux/spinlock_types.h> is enough.
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Will Deacon <will.deacon@arm.com>
Link: https://lkml.kernel.org/lkml/20180331220036.GA7676@avx2
Signed-off-by: Ingo Molnar <mingo@kernel.org>
336996-Speculative-Execution-Side-Channel-Mitigations.pdf defines a new MSR
(IA32_FLUSH_CMD) which is detected by CPUID.7.EDX[28]=1 bit being set.
This new MSR "gives software a way to invalidate structures with finer
granularity than other architectual methods like WBINVD."
A copy of this document is available at
https://bugzilla.kernel.org/show_bug.cgi?id=199511
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
The existing UNWIND_HINT_EMPTY annotations happen to be good indicators
of where entry code calls into C code for the first time. So also use
them to mark the end of the stack for the ORC unwinder.
Use that information to set unwind->error if the ORC unwinder doesn't
unwind all the way to the end. This will be needed for enabling
HAVE_RELIABLE_STACKTRACE for the ORC unwinder so we can use it with the
livepatch consistency model.
Thanks to Jiri Slaby for teaching the ORCs about the unwind hints.
Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com>
Signed-off-by: Jiri Slaby <jslaby@suse.cz>
Acked-by: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: https://lkml.kernel.org/lkml/20180518064713.26440-5-jslaby@suse.cz
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Mark Rutland noticed that GCC optimization passes have the potential to elide
necessary invocations of the array_index_mask_nospec() instruction sequence,
so mark the asm() volatile.
Mark explains:
"The volatile will inhibit *some* cases where the compiler could lift the
array_index_nospec() call out of a branch, e.g. where there are multiple
invocations of array_index_nospec() with the same arguments:
if (idx < foo) {
idx1 = array_idx_nospec(idx, foo)
do_something(idx1);
}
< some other code >
if (idx < foo) {
idx2 = array_idx_nospec(idx, foo);
do_something_else(idx2);
}
... since the compiler can determine that the two invocations yield the same
result, and reuse the first result (likely the same register as idx was in
originally) for the second branch, effectively re-writing the above as:
if (idx < foo) {
idx = array_idx_nospec(idx, foo);
do_something(idx);
}
< some other code >
if (idx < foo) {
do_something_else(idx);
}
... if we don't take the first branch, then speculatively take the second, we
lose the nospec protection.
There's more info on volatile asm in the GCC docs:
https://gcc.gnu.org/onlinedocs/gcc/Extended-Asm.html#Volatile
"
Reported-by: Mark Rutland <mark.rutland@arm.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Acked-by: Mark Rutland <mark.rutland@arm.com>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: <stable@vger.kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Fixes: babdde2698 ("x86: Implement array_index_mask_nospec")
Link: https://lkml.kernel.org/lkml/152838798950.14521.4893346294059739135.stgit@dwillia2-desk3.amr.corp.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Use CC_SET(z)/CC_OUT(z) instead of explicit SETZ instruction.
Using these two defines, the compiler that supports generation of
condition code outputs from inline assembly flags generates e.g.:
cmpxchg8b %fs:(%esi)
jne 172255 <__kmalloc+0x65>
instead of:
cmpxchg8b %fs:(%esi)
sete %al
test %al,%al
je 172255 <__kmalloc+0x65>
Note that older compilers now generate:
cmpxchg8b %fs:(%esi)
sete %cl
test %cl,%cl
je 173a85 <__kmalloc+0x65>
since we have to mark that cmpxchg8b instruction outputs to %eax
register and this way clobbers the value in the register.
Signed-off-by: Uros Bizjak <ubizjak@gmail.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/lkml/20180605163910.13015-1-ubizjak@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
The conditional inc/dec ops differ for atomic_t and atomic64_t:
- atomic_inc_unless_positive() is optional for atomic_t, and doesn't exist for atomic64_t.
- atomic_dec_unless_negative() is optional for atomic_t, and doesn't exist for atomic64_t.
- atomic_dec_if_positive is optional for atomic_t, and is mandatory for atomic64_t.
Let's make these consistently optional for both. At the same time, let's
clean up the existing fallbacks to use atomic_try_cmpxchg().
The instrumented atomics are updated accordingly.
There should be no functional change as a result of this patch.
Signed-off-by: Mark Rutland <mark.rutland@arm.com>
Reviewed-by: Will Deacon <will.deacon@arm.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Boqun Feng <boqun.feng@gmail.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/lkml/20180621121321.4761-18-mark.rutland@arm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Many of the inc/dec ops are mandatory, but for most architectures inc/dec are
simply trivial wrappers around their corresponding add/sub ops.
Let's make all the inc/dec ops optional, so that we can get rid of these
boilerplate wrappers.
The instrumented atomics are updated accordingly.
There should be no functional change as a result of this patch.
Signed-off-by: Mark Rutland <mark.rutland@arm.com>
Reviewed-by: Will Deacon <will.deacon@arm.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Palmer Dabbelt <palmer@sifive.com>
Cc: Boqun Feng <boqun.feng@gmail.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/lkml/20180621121321.4761-17-mark.rutland@arm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Some of the atomics return the result of a test applied after the atomic
operation, and almost all architectures implement these as trivial
wrappers around the underlying atomic. Specifically:
* <atomic>_inc_and_test(v) is (<atomic>_inc_return(v) == 0)
* <atomic>_dec_and_test(v) is (<atomic>_dec_return(v) == 0)
* <atomic>_sub_and_test(i, v) is (<atomic>_sub_return(i, v) == 0)
* <atomic>_add_negative(i, v) is (<atomic>_add_return(i, v) < 0)
Rather than have these definitions duplicated in all architectures, with
minor inconsistencies in formatting and documentation, let's make these
operations optional, with default fallbacks as above. Implementations
must now provide a preprocessor symbol.
The instrumented atomics are updated accordingly.
Both x86 and m68k have custom implementations, which are left as-is,
given preprocessor symbols to avoid being overridden.
There should be no functional change as a result of this patch.
Signed-off-by: Mark Rutland <mark.rutland@arm.com>
Reviewed-by: Will Deacon <will.deacon@arm.com>
Acked-by: Geert Uytterhoeven <geert@linux-m68k.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Palmer Dabbelt <palmer@sifive.com>
Cc: Boqun Feng <boqun.feng@gmail.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/lkml/20180621121321.4761-16-mark.rutland@arm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Architectures with atomic64_fetch_add_unless() provide a preprocessor
symbol if they do so, and all other architectures have trivial C
implementations of atomic64_add_unless() which are near-identical.
Let's unify the trivial definitions of atomic64_fetch_add_unless() in
<linux/atomic.h>, so that we always have both
atomic64_fetch_add_unless() and atomic64_add_unless() with less
boilerplate code.
This means that atomic64_add_unless() is always implemented in core
code, and the instrumented atomics are updated accordingly.
There should be no functional change as a result of this patch.
Signed-off-by: Mark Rutland <mark.rutland@arm.com>
Reviewed-by: Will Deacon <will.deacon@arm.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Boqun Feng <boqun.feng@gmail.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/lkml/20180621121321.4761-15-mark.rutland@arm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Several architectures these have a near-identical implementation based
on atomic_read() and atomic_cmpxchg() which we can instead define in
<linux/atomic.h>, so let's do so, using something close to the existing
x86 implementation with try_cmpxchg().
Where an architecture provides its own atomic_fetch_add_unless(), it
must define a preprocessor symbol for it. The instrumented atomics are
updated accordingly.
Note that arch/arc's existing atomic_fetch_add_unless() had redundant
barriers, as these are already present in its atomic_cmpxchg()
implementation.
There should be no functional change as a result of this patch.
Signed-off-by: Mark Rutland <mark.rutland@arm.com>
Reviewed-by: Geert Uytterhoeven <geert@linux-m68k.org>
Reviewed-by: Will Deacon <will.deacon@arm.com>
Acked-by: Geert Uytterhoeven <geert@linux-m68k.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Palmer Dabbelt <palmer@sifive.com>
Cc: Boqun Feng <boqun.feng@gmail.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vineet Gupta <vgupta@synopsys.com>
Link: https://lore.kernel.org/lkml/20180621121321.4761-7-mark.rutland@arm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
We define a trivial fallback for atomic_inc_not_zero(), but don't do
the same for atomic64_inc_not_zero(), leading most architectures to
define the same boilerplate.
Let's add a fallback in <linux/atomic.h>, and remove the redundant
implementations. Note that atomic64_add_unless() is always defined in
<linux/atomic.h>, and promotes its arguments to the requisite types, so
we need not do this explicitly.
There should be no functional change as a result of this patch.
Signed-off-by: Mark Rutland <mark.rutland@arm.com>
Reviewed-by: Will Deacon <will.deacon@arm.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Palmer Dabbelt <palmer@sifive.com>
Cc: Boqun Feng <boqun.feng@gmail.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/lkml/20180621121321.4761-6-mark.rutland@arm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
While __atomic_add_unless() was originally intended as a building-block
for atomic_add_unless(), it's now used in a number of places around the
kernel. It's the only common atomic operation named __atomic*(), rather
than atomic_*(), and for consistency it would be better named
atomic_fetch_add_unless().
This lack of consistency is slightly confusing, and gets in the way of
scripting atomics. Given that, let's clean things up and promote it to
an official part of the atomics API, in the form of
atomic_fetch_add_unless().
This patch converts definitions and invocations over to the new name,
including the instrumented version, using the following script:
----
git grep -w __atomic_add_unless | while read line; do
sed -i '{s/\<__atomic_add_unless\>/atomic_fetch_add_unless/}' "${line%%:*}";
done
git grep -w __arch_atomic_add_unless | while read line; do
sed -i '{s/\<__arch_atomic_add_unless\>/arch_atomic_fetch_add_unless/}' "${line%%:*}";
done
----
Note that we do not have atomic{64,_long}_fetch_add_unless(), which will
be introduced by later patches.
There should be no functional change as a result of this patch.
Signed-off-by: Mark Rutland <mark.rutland@arm.com>
Reviewed-by: Will Deacon <will.deacon@arm.com>
Acked-by: Geert Uytterhoeven <geert@linux-m68k.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Palmer Dabbelt <palmer@sifive.com>
Cc: Boqun Feng <boqun.feng@gmail.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/lkml/20180621121321.4761-2-mark.rutland@arm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
nosmt on the kernel command line merely prevents the onlining of the
secondary SMT siblings.
nosmt=force makes the APIC detection code ignore the secondary SMT siblings
completely, so they even do not show up as possible CPUs. That reduces the
amount of memory allocations for per cpu variables and saves other
resources from being allocated too large.
This is not fully equivalent to disabling SMT in the BIOS because the low
level SMT enabling in the BIOS can result in partitioning of resources
between the siblings, which is not undone by just ignoring them. Some CPUs
can use the full resources when their sibling is not onlined, but this is
depending on the CPU family and model and it's not well documented whether
this applies to all partitioned resources. That means depending on the
workload disabling SMT in the BIOS might result in better performance.
Linus analysis of the Intel manual:
The intel optimization manual is not very clear on what the partitioning
rules are.
I find:
"In general, the buffers for staging instructions between major pipe
stages are partitioned. These buffers include µop queues after the
execution trace cache, the queues after the register rename stage, the
reorder buffer which stages instructions for retirement, and the load
and store buffers.
In the case of load and store buffers, partitioning also provided an
easier implementation to maintain memory ordering for each logical
processor and detect memory ordering violations"
but some of that partitioning may be relaxed if the HT thread is "not
active":
"In Intel microarchitecture code name Sandy Bridge, the micro-op queue
is statically partitioned to provide 28 entries for each logical
processor, irrespective of software executing in single thread or
multiple threads. If one logical processor is not active in Intel
microarchitecture code name Ivy Bridge, then a single thread executing
on that processor core can use the 56 entries in the micro-op queue"
but I do not know what "not active" means, and how dynamic it is. Some of
that partitioning may be entirely static and depend on the early BIOS
disabling of HT, and even if we park the cores, the resources will just be
wasted.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Acked-by: Ingo Molnar <mingo@kernel.org>
Provide information whether SMT is supoorted by the CPUs. Preparatory patch
for SMT control mechanism.
Suggested-by: Dave Hansen <dave.hansen@intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Ingo Molnar <mingo@kernel.org>
If the CPU is supporting SMT then the primary thread can be found by
checking the lower APIC ID bits for zero. smp_num_siblings is used to build
the mask for the APIC ID bits which need to be taken into account.
This uses the MPTABLE or ACPI/MADT supplied APIC ID, which can be different
than the initial APIC ID in CPUID. But according to AMD the lower bits have
to be consistent. Intel gave a tentative confirmation as well.
Preparatory patch to support disabling SMT at boot/runtime.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Acked-by: Ingo Molnar <mingo@kernel.org>
Don't call the ->break_handler() and remove break_handler
related code from x86 since that was only used by jprobe
which got removed.
Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Ananth N Mavinakayanahalli <ananth@linux.vnet.ibm.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: linux-arch@vger.kernel.org
Link: https://lore.kernel.org/lkml/152942465549.15209.15889693025972771135.stgit@devbox
Signed-off-by: Ingo Molnar <mingo@kernel.org>
For L1TF PROT_NONE mappings are protected by inverting the PFN in the page
table entry. This sets the high bits in the CPU's address space, thus
making sure to point to not point an unmapped entry to valid cached memory.
Some server system BIOSes put the MMIO mappings high up in the physical
address space. If such an high mapping was mapped to unprivileged users
they could attack low memory by setting such a mapping to PROT_NONE. This
could happen through a special device driver which is not access
protected. Normal /dev/mem is of course access protected.
To avoid this forbid PROT_NONE mappings or mprotect for high MMIO mappings.
Valid page mappings are allowed because the system is then unsafe anyways.
It's not expected that users commonly use PROT_NONE on MMIO. But to
minimize any impact this is only enforced if the mapping actually refers to
a high MMIO address (defined as the MAX_PA-1 bit being set), and also skip
the check for root.
For mmaps this is straight forward and can be handled in vm_insert_pfn and
in remap_pfn_range().
For mprotect it's a bit trickier. At the point where the actual PTEs are
accessed a lot of state has been changed and it would be difficult to undo
on an error. Since this is a uncommon case use a separate early page talk
walk pass for MMIO PROT_NONE mappings that checks for this condition
early. For non MMIO and non PROT_NONE there are no changes.
Signed-off-by: Andi Kleen <ak@linux.intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Josh Poimboeuf <jpoimboe@redhat.com>
Acked-by: Dave Hansen <dave.hansen@intel.com>
L1TF core kernel workarounds are cheap and normally always enabled, However
they still should be reported in sysfs if the system is vulnerable or
mitigated. Add the necessary CPU feature/bug bits.
- Extend the existing checks for Meltdowns to determine if the system is
vulnerable. All CPUs which are not vulnerable to Meltdown are also not
vulnerable to L1TF
- Check for 32bit non PAE and emit a warning as there is no practical way
for mitigation due to the limited physical address bits
- If the system has more than MAX_PA/2 physical memory the invert page
workarounds don't protect the system against the L1TF attack anymore,
because an inverted physical address will also point to valid
memory. Print a warning in this case and report that the system is
vulnerable.
Add a function which returns the PFN limit for the L1TF mitigation, which
will be used in follow up patches for sanity and range checks.
[ tglx: Renamed the CPU feature bit to L1TF_PTEINV ]
Signed-off-by: Andi Kleen <ak@linux.intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Josh Poimboeuf <jpoimboe@redhat.com>
Acked-by: Dave Hansen <dave.hansen@intel.com>
When PTEs are set to PROT_NONE the kernel just clears the Present bit and
preserves the PFN, which creates attack surface for L1TF speculation
speculation attacks.
This is important inside guests, because L1TF speculation bypasses physical
page remapping. While the host has its own migitations preventing leaking
data from other VMs into the guest, this would still risk leaking the wrong
page inside the current guest.
This uses the same technique as Linus' swap entry patch: while an entry is
is in PROTNONE state invert the complete PFN part part of it. This ensures
that the the highest bit will point to non existing memory.
The invert is done by pte/pmd_modify and pfn/pmd/pud_pte for PROTNONE and
pte/pmd/pud_pfn undo it.
This assume that no code path touches the PFN part of a PTE directly
without using these primitives.
This doesn't handle the case that MMIO is on the top of the CPU physical
memory. If such an MMIO region was exposed by an unpriviledged driver for
mmap it would be possible to attack some real memory. However this
situation is all rather unlikely.
For 32bit non PAE the inversion is not done because there are really not
enough bits to protect anything.
Q: Why does the guest need to be protected when the HyperVisor already has
L1TF mitigations?
A: Here's an example:
Physical pages 1 2 get mapped into a guest as
GPA 1 -> PA 2
GPA 2 -> PA 1
through EPT.
The L1TF speculation ignores the EPT remapping.
Now the guest kernel maps GPA 1 to process A and GPA 2 to process B, and
they belong to different users and should be isolated.
A sets the GPA 1 PA 2 PTE to PROT_NONE to bypass the EPT remapping and
gets read access to the underlying physical page. Which in this case
points to PA 2, so it can read process B's data, if it happened to be in
L1, so isolation inside the guest is broken.
There's nothing the hypervisor can do about this. This mitigation has to
be done in the guest itself.
[ tglx: Massaged changelog ]
Signed-off-by: Andi Kleen <ak@linux.intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Josh Poimboeuf <jpoimboe@redhat.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Dave Hansen <dave.hansen@intel.com>
With L1 terminal fault the CPU speculates into unmapped PTEs, and resulting
side effects allow to read the memory the PTE is pointing too, if its
values are still in the L1 cache.
For swapped out pages Linux uses unmapped PTEs and stores a swap entry into
them.
To protect against L1TF it must be ensured that the swap entry is not
pointing to valid memory, which requires setting higher bits (between bit
36 and bit 45) that are inside the CPUs physical address space, but outside
any real memory.
To do this invert the offset to make sure the higher bits are always set,
as long as the swap file is not too big.
Note there is no workaround for 32bit !PAE, or on systems which have more
than MAX_PA/2 worth of memory. The later case is very unlikely to happen on
real systems.
[AK: updated description and minor tweaks by. Split out from the original
patch ]
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Andi Kleen <ak@linux.intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Andi Kleen <ak@linux.intel.com>
Reviewed-by: Josh Poimboeuf <jpoimboe@redhat.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Dave Hansen <dave.hansen@intel.com>
If pages are swapped out, the swap entry is stored in the corresponding
PTE, which has the Present bit cleared. CPUs vulnerable to L1TF speculate
on PTE entries which have the present bit set and would treat the swap
entry as phsyical address (PFN). To mitigate that the upper bits of the PTE
must be set so the PTE points to non existent memory.
The swap entry stores the type and the offset of a swapped out page in the
PTE. type is stored in bit 9-13 and offset in bit 14-63. The hardware
ignores the bits beyond the phsyical address space limit, so to make the
mitigation effective its required to start 'offset' at the lowest possible
bit so that even large swap offsets do not reach into the physical address
space limit bits.
Move offset to bit 9-58 and type to bit 59-63 which are the bits that
hardware generally doesn't care about.
That, in turn, means that if you on desktop chip with only 40 bits of
physical addressing, now that the offset starts at bit 9, there needs to be
30 bits of offset actually *in use* until bit 39 ends up being set, which
means when inverted it will again point into existing memory.
So that's 4 terabyte of swap space (because the offset is counted in pages,
so 30 bits of offset is 42 bits of actual coverage). With bigger physical
addressing, that obviously grows further, until the limit of the offset is
hit (at 50 bits of offset - 62 bits of actual swap file coverage).
This is a preparatory change for the actual swap entry inversion to protect
against L1TF.
[ AK: Updated description and minor tweaks. Split into two parts ]
[ tglx: Massaged changelog ]
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Andi Kleen <ak@linux.intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Andi Kleen <ak@linux.intel.com>
Reviewed-by: Josh Poimboeuf <jpoimboe@redhat.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Dave Hansen <dave.hansen@intel.com>
L1 Terminal Fault (L1TF) is a speculation related vulnerability. The CPU
speculates on PTE entries which do not have the PRESENT bit set, if the
content of the resulting physical address is available in the L1D cache.
The OS side mitigation makes sure that a !PRESENT PTE entry points to a
physical address outside the actually existing and cachable memory
space. This is achieved by inverting the upper bits of the PTE. Due to the
address space limitations this only works for 64bit and 32bit PAE kernels,
but not for 32bit non PAE.
This mitigation applies to both host and guest kernels, but in case of a
64bit host (hypervisor) and a 32bit PAE guest, inverting the upper bits of
the PAE address space (44bit) is not enough if the host has more than 43
bits of populated memory address space, because the speculation treats the
PTE content as a physical host address bypassing EPT.
The host (hypervisor) protects itself against the guest by flushing L1D as
needed, but pages inside the guest are not protected against attacks from
other processes inside the same guest.
For the guest the inverted PTE mask has to match the host to provide the
full protection for all pages the host could possibly map into the
guest. The hosts populated address space is not known to the guest, so the
mask must cover the possible maximal host address space, i.e. 52 bit.
On 32bit PAE the maximum PTE mask is currently set to 44 bit because that
is the limit imposed by 32bit unsigned long PFNs in the VMs. This limits
the mask to be below what the host could possible use for physical pages.
The L1TF PROT_NONE protection code uses the PTE masks to determine which
bits to invert to make sure the higher bits are set for unmapped entries to
prevent L1TF speculation attacks against EPT inside guests.
In order to invert all bits that could be used by the host, increase
__PHYSICAL_PAGE_SHIFT to 52 to match 64bit.
The real limit for a 32bit PAE kernel is still 44 bits because all Linux
PTEs are created from unsigned long PFNs, so they cannot be higher than 44
bits on a 32bit kernel. So these extra PFN bits should be never set. The
only users of this macro are using it to look at PTEs, so it's safe.
[ tglx: Massaged changelog ]
Signed-off-by: Andi Kleen <ak@linux.intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Josh Poimboeuf <jpoimboe@redhat.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Dave Hansen <dave.hansen@intel.com>
The changes to automatically test for working stack protector compiler
support in the Kconfig files removed the special STACKPROTECTOR_AUTO
option that picked the strongest stack protector that the compiler
supported.
That was all a nice cleanup - it makes no sense to have the AUTO case
now that the Kconfig phase can just determine the compiler support
directly.
HOWEVER.
It also meant that doing "make oldconfig" would now _disable_ the strong
stackprotector if you had AUTO enabled, because in a legacy config file,
the sane stack protector configuration would look like
CONFIG_HAVE_CC_STACKPROTECTOR=y
# CONFIG_CC_STACKPROTECTOR_NONE is not set
# CONFIG_CC_STACKPROTECTOR_REGULAR is not set
# CONFIG_CC_STACKPROTECTOR_STRONG is not set
CONFIG_CC_STACKPROTECTOR_AUTO=y
and when you ran this through "make oldconfig" with the Kbuild changes,
it would ask you about the regular CONFIG_CC_STACKPROTECTOR (that had
been renamed from CONFIG_CC_STACKPROTECTOR_REGULAR to just
CONFIG_CC_STACKPROTECTOR), but it would think that the STRONG version
used to be disabled (because it was really enabled by AUTO), and would
disable it in the new config, resulting in:
CONFIG_HAVE_CC_STACKPROTECTOR=y
CONFIG_CC_HAS_STACKPROTECTOR_NONE=y
CONFIG_CC_STACKPROTECTOR=y
# CONFIG_CC_STACKPROTECTOR_STRONG is not set
CONFIG_CC_HAS_SANE_STACKPROTECTOR=y
That's dangerously subtle - people could suddenly find themselves with
the weaker stack protector setup without even realizing.
The solution here is to just rename not just the old RECULAR stack
protector option, but also the strong one. This does that by just
removing the CC_ prefix entirely for the user choices, because it really
is not about the compiler support (the compiler support now instead
automatially impacts _visibility_ of the options to users).
This results in "make oldconfig" actually asking the user for their
choice, so that we don't have any silent subtle security model changes.
The end result would generally look like this:
CONFIG_HAVE_CC_STACKPROTECTOR=y
CONFIG_CC_HAS_STACKPROTECTOR_NONE=y
CONFIG_STACKPROTECTOR=y
CONFIG_STACKPROTECTOR_STRONG=y
CONFIG_CC_HAS_SANE_STACKPROTECTOR=y
where the "CC_" versions really are about internal compiler
infrastructure, not the user selections.
Acked-by: Masahiro Yamada <yamada.masahiro@socionext.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* ARM: lazy context-switching of FPSIMD registers on arm64, "split"
regions for vGIC redistributor
* s390: cleanups for nested, clock handling, crypto, storage keys and
control register bits
* x86: many bugfixes, implement more Hyper-V super powers,
implement lapic_timer_advance_ns even when the LAPIC timer
is emulated using the processor's VMX preemption timer. Two
security-related bugfixes at the top of the branch.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.22 (GNU/Linux)
iQEcBAABAgAGBQJbH8Z/AAoJEL/70l94x66DF+UIAJeOuTp6LGasT/9uAb2OovaN
+5kGmOPGFwkTcmg8BQHI2fXT4vhxMXWPFcQnyig9eXJVxhuwluXDOH4P9IMay0yw
VDCBsWRdMvZDQad2hn6Z5zR4Jx01XrSaG/KqvXbbDKDCy96mWG7SYAY2m3ZwmeQi
3Pa3O3BTijr7hBYnMhdXGkSn4ZyU8uPaAgIJ8795YKeOJ2JmioGYk6fj6y2WCxA3
ztJymBjTmIoZ/F8bjuVouIyP64xH4q9roAyw4rpu7vnbWGqx1fjPYJoB8yddluWF
JqCPsPzhKDO7mjZJy+lfaxIlzz2BN7tKBNCm88s5GefGXgZwk3ByAq/0GQ2M3rk=
=H5zI
-----END PGP SIGNATURE-----
Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm
Pull KVM updates from Paolo Bonzini:
"Small update for KVM:
ARM:
- lazy context-switching of FPSIMD registers on arm64
- "split" regions for vGIC redistributor
s390:
- cleanups for nested
- clock handling
- crypto
- storage keys
- control register bits
x86:
- many bugfixes
- implement more Hyper-V super powers
- implement lapic_timer_advance_ns even when the LAPIC timer is
emulated using the processor's VMX preemption timer.
- two security-related bugfixes at the top of the branch"
* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (79 commits)
kvm: fix typo in flag name
kvm: x86: use correct privilege level for sgdt/sidt/fxsave/fxrstor access
KVM: x86: pass kvm_vcpu to kvm_read_guest_virt and kvm_write_guest_virt_system
KVM: x86: introduce linear_{read,write}_system
kvm: nVMX: Enforce cpl=0 for VMX instructions
kvm: nVMX: Add support for "VMWRITE to any supported field"
kvm: nVMX: Restrict VMX capability MSR changes
KVM: VMX: Optimize tscdeadline timer latency
KVM: docs: nVMX: Remove known limitations as they do not exist now
KVM: docs: mmu: KVM support exposing SLAT to guests
kvm: no need to check return value of debugfs_create functions
kvm: Make VM ioctl do valloc for some archs
kvm: Change return type to vm_fault_t
KVM: docs: mmu: Fix link to NPT presentation from KVM Forum 2008
kvm: x86: Amend the KVM_GET_SUPPORTED_CPUID API documentation
KVM: x86: hyperv: declare KVM_CAP_HYPERV_TLBFLUSH capability
KVM: x86: hyperv: simplistic HVCALL_FLUSH_VIRTUAL_ADDRESS_{LIST,SPACE}_EX implementation
KVM: x86: hyperv: simplistic HVCALL_FLUSH_VIRTUAL_ADDRESS_{LIST,SPACE} implementation
KVM: introduce kvm_make_vcpus_request_mask() API
KVM: x86: hyperv: do rep check for each hypercall separately
...
The functions that were used in the emulation of fxrstor, fxsave, sgdt and
sidt were originally meant for task switching, and as such they did not
check privilege levels. This is very bad when the same functions are used
in the emulation of unprivileged instructions. This is CVE-2018-10853.
The obvious fix is to add a new argument to ops->read_std and ops->write_std,
which decides whether the access is a "system" access or should use the
processor's CPL.
Fixes: 129a72a0d3 ("KVM: x86: Introduce segmented_write_std", 2017-01-12)
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Pull x86 updates and fixes from Thomas Gleixner:
- Fix the (late) fallout from the vector management rework causing
hlist corruption and irq descriptor reference leaks caused by a
missing sanity check.
The straight forward fix triggered another long standing issue to
surface. The pre rework code hid the issue due to being way slower,
but now the chance that user space sees an EBUSY error return when
updating irq affinities is way higher, though quite a bunch of
userspace tools do not handle it properly despite the fact that EBUSY
could be returned for at least 10 years.
It turned out that the EBUSY return can be avoided completely by
utilizing the existing delayed affinity update mechanism for irq
remapped scenarios as well. That's a bit more error handling in the
kernel, but avoids fruitless fingerpointing discussions with tool
developers.
- Decouple PHYSICAL_MASK from AMD SME as its going to be required for
the upcoming Intel memory encryption support as well.
- Handle legacy device ACPI detection properly for newer platforms
- Fix the wrong argument ordering in the vector allocation tracepoint
- Simplify the IDT setup code for the APIC=n case
- Use the proper string helpers in the MTRR code
- Remove a stale unused VDSO source file
- Convert the microcode update lock to a raw spinlock as its used in
atomic context.
* 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/intel_rdt: Enable CMT and MBM on new Skylake stepping
x86/apic/vector: Print APIC control bits in debugfs
genirq/affinity: Defer affinity setting if irq chip is busy
x86/platform/uv: Use apic_ack_irq()
x86/ioapic: Use apic_ack_irq()
irq_remapping: Use apic_ack_irq()
x86/apic: Provide apic_ack_irq()
genirq/migration: Avoid out of line call if pending is not set
genirq/generic_pending: Do not lose pending affinity update
x86/apic/vector: Prevent hlist corruption and leaks
x86/vector: Fix the args of vector_alloc tracepoint
x86/idt: Simplify the idt_setup_apic_and_irq_gates()
x86/platform/uv: Remove extra parentheses
x86/mm: Decouple dynamic __PHYSICAL_MASK from AMD SME
x86: Mark native_set_p4d() as __always_inline
x86/microcode: Make the late update update_lock a raw lock for RT
x86/mtrr: Convert to use strncpy_from_user() helper
x86/mtrr: Convert to use match_string() helper
x86/vdso: Remove unused file
x86/i8237: Register device based on FADT legacy boot flag
Pull x86 pti updates from Thomas Gleixner:
"Three small commits updating the SSB mitigation to take the updated
AMD mitigation variants into account"
* 'x86-pti-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/bugs: Switch the selection of mitigation from CPU vendor to CPU features
x86/bugs: Add AMD's SPEC_CTRL MSR usage
x86/bugs: Add AMD's variant of SSB_NO
* DAX broke a fundamental assumption of truncate of file mapped pages.
The truncate path assumed that it is safe to disconnect a pinned page
from a file and let the filesystem reclaim the physical block. With DAX
the page is equivalent to the filesystem block. Introduce
dax_layout_busy_page() to enable filesystems to wait for pinned DAX
pages to be released. Without this wait a filesystem could allocate
blocks under active device-DMA to a new file.
* DAX arranges for the block layer to be bypassed and uses
dax_direct_access() + copy_to_iter() to satisfy read(2) calls.
However, the memcpy_mcsafe() facility is available through the pmem
block driver. In order to safely handle media errors, via the DAX
block-layer bypass, introduce copy_to_iter_mcsafe().
* Fix cache management policy relative to the ACPI NFIT Platform
Capabilities Structure to properly elide cache flushes when they are not
necessary. The table indicates whether CPU caches are power-fail
protected. Clarify that a deep flush is always performed on
REQ_{FUA,PREFLUSH} requests.
-----BEGIN PGP SIGNATURE-----
iQIcBAABAgAGBQJbGxI7AAoJEB7SkWpmfYgCDjsP/2Lcibu9Kf4tKIzuInsle6iE
6qP29qlkpHVTpDKbhvIxTYTYL9sMU0DNUrpPCJR/EYdeyztLWDFC5EAT1wF240vf
maV37s/uP331jSC/2VJnKWzBs2ztQxmKLEIQCxh6aT0qs9cbaOvJgB/WlVu+qtsl
aGJFLmb6vdQacp31noU5plKrMgMA1pADyF5qx9I9K2HwowHE7T368ZEFS/3S//c3
LXmpx/Nfq52sGu/qbRbu6B1CTJhIGhmarObyQnvBYoKntK1Ov4e8DS95wD3EhNDe
FuRkOCUKhjl6cFy7QVWh1ct1bFm84ny+b4/AtbpOmv9l/+0mveJ7e+5mu8HQTifT
wYiEe2xzXJ+OG/xntv8SvlZKMpjP3BqI0jYsTutsjT4oHrciiXdXM186cyS+BiGp
KtFmWyncQJgfiTq6+Hj5XpP9BapNS+OYdYgUagw9ZwzdzptuGFYUMSVOBrYrn6c/
fwqtxjubykJoW0P3pkIoT91arFSea7nxOKnGwft06imQ7TwR4ARsI308feQ9itJq
2P2e7/20nYMsw2aRaUDDA70Yu+Lagn1m8WL87IybUGeUDLb1BAkjphAlWa6COJ+u
PhvAD2tvyM9m0c7O5Mytvz7iWKG6SVgatoAyOPkaeplQK8khZ+wEpuK58sO6C1w8
4GBvt9ri9i/Ww/A+ppWs
=4bfw
-----END PGP SIGNATURE-----
Merge tag 'libnvdimm-for-4.18' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm
Pull libnvdimm updates from Dan Williams:
"This adds a user for the new 'bytes-remaining' updates to
memcpy_mcsafe() that you already received through Ingo via the
x86-dax- for-linus pull.
Not included here, but still targeting this cycle, is support for
handling memory media errors (poison) consumed via userspace dax
mappings.
Summary:
- DAX broke a fundamental assumption of truncate of file mapped
pages. The truncate path assumed that it is safe to disconnect a
pinned page from a file and let the filesystem reclaim the physical
block. With DAX the page is equivalent to the filesystem block.
Introduce dax_layout_busy_page() to enable filesystems to wait for
pinned DAX pages to be released. Without this wait a filesystem
could allocate blocks under active device-DMA to a new file.
- DAX arranges for the block layer to be bypassed and uses
dax_direct_access() + copy_to_iter() to satisfy read(2) calls.
However, the memcpy_mcsafe() facility is available through the pmem
block driver. In order to safely handle media errors, via the DAX
block-layer bypass, introduce copy_to_iter_mcsafe().
- Fix cache management policy relative to the ACPI NFIT Platform
Capabilities Structure to properly elide cache flushes when they
are not necessary. The table indicates whether CPU caches are
power-fail protected. Clarify that a deep flush is always performed
on REQ_{FUA,PREFLUSH} requests"
* tag 'libnvdimm-for-4.18' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm: (21 commits)
dax: Use dax_write_cache* helpers
libnvdimm, pmem: Do not flush power-fail protected CPU caches
libnvdimm, pmem: Unconditionally deep flush on *sync
libnvdimm, pmem: Complete REQ_FLUSH => REQ_PREFLUSH
acpi, nfit: Remove ecc_unit_size
dax: dax_insert_mapping_entry always succeeds
libnvdimm, e820: Register all pmem resources
libnvdimm: Debug probe times
linvdimm, pmem: Preserve read-only setting for pmem devices
x86, nfit_test: Add unit test for memcpy_mcsafe()
pmem: Switch to copy_to_iter_mcsafe()
dax: Report bytes remaining in dax_iomap_actor()
dax: Introduce a ->copy_to_iter dax operation
uio, lib: Fix CONFIG_ARCH_HAS_UACCESS_MCSAFE compilation
xfs, dax: introduce xfs_break_dax_layouts()
xfs: prepare xfs_break_layouts() for another layout type
xfs: prepare xfs_break_layouts() to be called with XFS_MMAPLOCK_EXCL
mm, fs, dax: handle layout changes to pinned dax mappings
mm: fix __gup_device_huge vs unmap
mm: introduce MEMORY_DEVICE_FS_DAX and CONFIG_DEV_PAGEMAP_OPS
...
Currently the PTE special supports is turned on in per architecture
header files. Most of the time, it is defined in
arch/*/include/asm/pgtable.h depending or not on some other per
architecture static definition.
This patch introduce a new configuration variable to manage this
directly in the Kconfig files. It would later replace
__HAVE_ARCH_PTE_SPECIAL.
Here notes for some architecture where the definition of
__HAVE_ARCH_PTE_SPECIAL is not obvious:
arm
__HAVE_ARCH_PTE_SPECIAL which is currently defined in
arch/arm/include/asm/pgtable-3level.h which is included by
arch/arm/include/asm/pgtable.h when CONFIG_ARM_LPAE is set.
So select ARCH_HAS_PTE_SPECIAL if ARM_LPAE.
powerpc
__HAVE_ARCH_PTE_SPECIAL is defined in 2 files:
- arch/powerpc/include/asm/book3s/64/pgtable.h
- arch/powerpc/include/asm/pte-common.h
The first one is included if (PPC_BOOK3S & PPC64) while the second is
included in all the other cases.
So select ARCH_HAS_PTE_SPECIAL all the time.
sparc:
__HAVE_ARCH_PTE_SPECIAL is defined if defined(__sparc__) &&
defined(__arch64__) which are defined through the compiler in
sparc/Makefile if !SPARC32 which I assume to be if SPARC64.
So select ARCH_HAS_PTE_SPECIAL if SPARC64
There is no functional change introduced by this patch.
Link: http://lkml.kernel.org/r/1523433816-14460-2-git-send-email-ldufour@linux.vnet.ibm.com
Signed-off-by: Laurent Dufour <ldufour@linux.vnet.ibm.com>
Suggested-by: Jerome Glisse <jglisse@redhat.com>
Reviewed-by: Jerome Glisse <jglisse@redhat.com>
Acked-by: David Rientjes <rientjes@google.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: "Aneesh Kumar K . V" <aneesh.kumar@linux.vnet.ibm.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Will Deacon <will.deacon@arm.com>
Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
Cc: Rich Felker <dalias@libc.org>
Cc: David S. Miller <davem@davemloft.net>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Vineet Gupta <vgupta@synopsys.com>
Cc: Palmer Dabbelt <palmer@sifive.com>
Cc: Albert Ou <albert@sifive.com>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Robin Murphy <robin.murphy@arm.com>
Cc: Christophe LEROY <christophe.leroy@c-s.fr>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Notable changes:
- Support for split PMD page table lock on 64-bit Book3S (Power8/9).
- Add support for HAVE_RELIABLE_STACKTRACE, so we properly support live
patching again.
- Add support for patching barrier_nospec in copy_from_user() and syscall entry.
- A couple of fixes for our data breakpoints on Book3S.
- A series from Nick optimising TLB/mm handling with the Radix MMU.
- Numerous small cleanups to squash sparse/gcc warnings from Mathieu Malaterre.
- Several series optimising various parts of the 32-bit code from Christophe Leroy.
- Removal of support for two old machines, "SBC834xE" and "C2K" ("GEFanuc,C2K"),
which is why the diffstat has so many deletions.
And many other small improvements & fixes.
There's a few out-of-area changes. Some minor ftrace changes OK'ed by Steve, and
a fix to our powernv cpuidle driver. Then there's a series touching mm, x86 and
fs/proc/task_mmu.c, which cleans up some details around pkey support. It was
ack'ed/reviewed by Ingo & Dave and has been in next for several weeks.
Thanks to:
Akshay Adiga, Alastair D'Silva, Alexey Kardashevskiy, Al Viro, Andrew
Donnellan, Aneesh Kumar K.V, Anju T Sudhakar, Arnd Bergmann, Balbir Singh,
Cédric Le Goater, Christophe Leroy, Christophe Lombard, Colin Ian King, Dave
Hansen, Fabio Estevam, Finn Thain, Frederic Barrat, Gautham R. Shenoy, Haren
Myneni, Hari Bathini, Ingo Molnar, Jonathan Neuschäfer, Josh Poimboeuf,
Kamalesh Babulal, Madhavan Srinivasan, Mahesh Salgaonkar, Mark Greer, Mathieu
Malaterre, Matthew Wilcox, Michael Neuling, Michal Suchanek, Naveen N. Rao,
Nicholas Piggin, Nicolai Stange, Olof Johansson, Paul Gortmaker, Paul
Mackerras, Peter Rosin, Pridhiviraj Paidipeddi, Ram Pai, Rashmica Gupta, Ravi
Bangoria, Russell Currey, Sam Bobroff, Samuel Mendoza-Jonas, Segher
Boessenkool, Shilpasri G Bhat, Simon Guo, Souptick Joarder, Stewart Smith,
Thiago Jung Bauermann, Torsten Duwe, Vaibhav Jain, Wei Yongjun, Wolfram Sang,
Yisheng Xie, YueHaibing.
-----BEGIN PGP SIGNATURE-----
iQIwBAABCAAaBQJbGQKBExxtcGVAZWxsZXJtYW4uaWQuYXUACgkQUevqPMjhpYBq
TRAAioK7rz5xYMkxaM3Ng3ybobEeNAwQqOolz98xvmnB9SfDWNuc99vf8cGu0/fQ
zc8AKZ5RcnwipOjyGlxW9oa1ZhVq0xtYnQPiYLEKMdLQmh5D+C7+KpvAd1UElweg
ub40/xDySWfMujfuMSF9JDCWPIXyojt4Xg5nJKIVRrAm/3YMe/+i5Am7NWHuMCEb
aQmZtlYW5Mz81XY0968hjpUO6eKFRmsaM7yFAhGTXx6+oLRpGj1PZB4AwdRIKS2L
Ak7q/VgxtE4W+s3a0GK2s+eXIhGKeFuX9AVnx3nti+8/K1OqrqhDcLMUC/9JpCpv
EvOtO7dxPnZujHjdu4Eai/xNoo4h6zRy7bWqve9LoBM40CP5jljKzu1lwqqb5yO0
jC7/aXhgiSIxxcRJLjoI/TYpZPu40MifrkydmczykdPyPCnMIWEJDcj4KsRL/9Y8
9SSbJzRNC/SgQNTbUYPZFFi6G0QaMmlcbCb628k8QT+Gn3Xkdf/ZtxzqEyoF4Irq
46kFBsiSSK4Bu0rVlcUtJQLgdqytWULO6NKEYnD67laxYcgQd8pGFQ8SjZhRZLgU
q5LA3HIWhoAI4M0wZhOnKXO6JfiQ1UbO8gUJLsWsfF0Fk5KAcdm+4kb4jbI1H4Qk
Vol9WNRZwEllyaiqScZN9RuVVuH0GPOZeEH1dtWK+uWi0lM=
=ZlBf
-----END PGP SIGNATURE-----
Merge tag 'powerpc-4.18-1' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux
Pull powerpc updates from Michael Ellerman:
"Notable changes:
- Support for split PMD page table lock on 64-bit Book3S (Power8/9).
- Add support for HAVE_RELIABLE_STACKTRACE, so we properly support
live patching again.
- Add support for patching barrier_nospec in copy_from_user() and
syscall entry.
- A couple of fixes for our data breakpoints on Book3S.
- A series from Nick optimising TLB/mm handling with the Radix MMU.
- Numerous small cleanups to squash sparse/gcc warnings from Mathieu
Malaterre.
- Several series optimising various parts of the 32-bit code from
Christophe Leroy.
- Removal of support for two old machines, "SBC834xE" and "C2K"
("GEFanuc,C2K"), which is why the diffstat has so many deletions.
And many other small improvements & fixes.
There's a few out-of-area changes. Some minor ftrace changes OK'ed by
Steve, and a fix to our powernv cpuidle driver. Then there's a series
touching mm, x86 and fs/proc/task_mmu.c, which cleans up some details
around pkey support. It was ack'ed/reviewed by Ingo & Dave and has
been in next for several weeks.
Thanks to: Akshay Adiga, Alastair D'Silva, Alexey Kardashevskiy, Al
Viro, Andrew Donnellan, Aneesh Kumar K.V, Anju T Sudhakar, Arnd
Bergmann, Balbir Singh, Cédric Le Goater, Christophe Leroy, Christophe
Lombard, Colin Ian King, Dave Hansen, Fabio Estevam, Finn Thain,
Frederic Barrat, Gautham R. Shenoy, Haren Myneni, Hari Bathini, Ingo
Molnar, Jonathan Neuschäfer, Josh Poimboeuf, Kamalesh Babulal,
Madhavan Srinivasan, Mahesh Salgaonkar, Mark Greer, Mathieu Malaterre,
Matthew Wilcox, Michael Neuling, Michal Suchanek, Naveen N. Rao,
Nicholas Piggin, Nicolai Stange, Olof Johansson, Paul Gortmaker, Paul
Mackerras, Peter Rosin, Pridhiviraj Paidipeddi, Ram Pai, Rashmica
Gupta, Ravi Bangoria, Russell Currey, Sam Bobroff, Samuel
Mendoza-Jonas, Segher Boessenkool, Shilpasri G Bhat, Simon Guo,
Souptick Joarder, Stewart Smith, Thiago Jung Bauermann, Torsten Duwe,
Vaibhav Jain, Wei Yongjun, Wolfram Sang, Yisheng Xie, YueHaibing"
* tag 'powerpc-4.18-1' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux: (251 commits)
powerpc/64s/radix: Fix missing ptesync in flush_cache_vmap
cpuidle: powernv: Fix promotion from snooze if next state disabled
powerpc: fix build failure by disabling attribute-alias warning in pci_32
ocxl: Fix missing unlock on error in afu_ioctl_enable_p9_wait()
powerpc-opal: fix spelling mistake "Uniterrupted" -> "Uninterrupted"
powerpc: fix spelling mistake: "Usupported" -> "Unsupported"
powerpc/pkeys: Detach execute_only key on !PROT_EXEC
powerpc/powernv: copy/paste - Mask SO bit in CR
powerpc: Remove core support for Marvell mv64x60 hostbridges
powerpc/boot: Remove core support for Marvell mv64x60 hostbridges
powerpc/boot: Remove support for Marvell mv64x60 i2c controller
powerpc/boot: Remove support for Marvell MPSC serial controller
powerpc/embedded6xx: Remove C2K board support
powerpc/lib: optimise PPC32 memcmp
powerpc/lib: optimise 32 bits __clear_user()
powerpc/time: inline arch_vtime_task_switch()
powerpc/Makefile: set -mcpu=860 flag for the 8xx
powerpc: Implement csum_ipv6_magic in assembly
powerpc/32: Optimise __csum_partial()
powerpc/lib: Adjust .balign inside string functions for PPC32
...
Pull networking updates from David Miller:
1) Add Maglev hashing scheduler to IPVS, from Inju Song.
2) Lots of new TC subsystem tests from Roman Mashak.
3) Add TCP zero copy receive and fix delayed acks and autotuning with
SO_RCVLOWAT, from Eric Dumazet.
4) Add XDP_REDIRECT support to mlx5 driver, from Jesper Dangaard
Brouer.
5) Add ttl inherit support to vxlan, from Hangbin Liu.
6) Properly separate ipv6 routes into their logically independant
components. fib6_info for the routing table, and fib6_nh for sets of
nexthops, which thus can be shared. From David Ahern.
7) Add bpf_xdp_adjust_tail helper, which can be used to generate ICMP
messages from XDP programs. From Nikita V. Shirokov.
8) Lots of long overdue cleanups to the r8169 driver, from Heiner
Kallweit.
9) Add BTF ("BPF Type Format"), from Martin KaFai Lau.
10) Add traffic condition monitoring to iwlwifi, from Luca Coelho.
11) Plumb extack down into fib_rules, from Roopa Prabhu.
12) Add Flower classifier offload support to igb, from Vinicius Costa
Gomes.
13) Add UDP GSO support, from Willem de Bruijn.
14) Add documentation for eBPF helpers, from Quentin Monnet.
15) Add TLS tx offload to mlx5, from Ilya Lesokhin.
16) Allow applications to be given the number of bytes available to read
on a socket via a control message returned from recvmsg(), from
Soheil Hassas Yeganeh.
17) Add x86_32 eBPF JIT compiler, from Wang YanQing.
18) Add AF_XDP sockets, with zerocopy support infrastructure as well.
From Björn Töpel.
19) Remove indirect load support from all of the BPF JITs and handle
these operations in the verifier by translating them into native BPF
instead. From Daniel Borkmann.
20) Add GRO support to ipv6 gre tunnels, from Eran Ben Elisha.
21) Allow XDP programs to do lookups in the main kernel routing tables
for forwarding. From David Ahern.
22) Allow drivers to store hardware state into an ELF section of kernel
dump vmcore files, and use it in cxgb4. From Rahul Lakkireddy.
23) Various RACK and loss detection improvements in TCP, from Yuchung
Cheng.
24) Add TCP SACK compression, from Eric Dumazet.
25) Add User Mode Helper support and basic bpfilter infrastructure, from
Alexei Starovoitov.
26) Support ports and protocol values in RTM_GETROUTE, from Roopa
Prabhu.
27) Support bulking in ->ndo_xdp_xmit() API, from Jesper Dangaard
Brouer.
28) Add lots of forwarding selftests, from Petr Machata.
29) Add generic network device failover driver, from Sridhar Samudrala.
* ra.kernel.org:/pub/scm/linux/kernel/git/davem/net-next: (1959 commits)
strparser: Add __strp_unpause and use it in ktls.
rxrpc: Fix terminal retransmission connection ID to include the channel
net: hns3: Optimize PF CMDQ interrupt switching process
net: hns3: Fix for VF mailbox receiving unknown message
net: hns3: Fix for VF mailbox cannot receiving PF response
bnx2x: use the right constant
Revert "net: sched: cls: Fix offloading when ingress dev is vxlan"
net: dsa: b53: Fix for brcm tag issue in Cygnus SoC
enic: fix UDP rss bits
netdev-FAQ: clarify DaveM's position for stable backports
rtnetlink: validate attributes in do_setlink()
mlxsw: Add extack messages for port_{un, }split failures
netdevsim: Add extack error message for devlink reload
devlink: Add extack to reload and port_{un, }split operations
net: metrics: add proper netlink validation
ipmr: fix error path when ipmr_new_table fails
ip6mr: only set ip6mr_table from setsockopt when ip6mr_new_table succeeds
net: hns3: remove unused hclgevf_cfg_func_mta_filter
netfilter: provide udp*_lib_lookup for nf_tproxy
qed*: Utilize FW 8.37.2.0
...
apic_ack_edge() is explicitely for handling interrupt affinity cleanup when
interrupt remapping is not available or disable.
Remapped interrupts and also some of the platform specific special
interrupts, e.g. UV, invoke ack_APIC_irq() directly.
To address the issue of failing an affinity update with -EBUSY the delayed
affinity mechanism can be reused, but ack_APIC_irq() does not handle
that. Adding this to ack_APIC_irq() is not possible, because that function
is also used for exceptions and directly handled interrupts like IPIs.
Create a new function, which just contains the conditional invocation of
irq_move_irq() and the final ack_APIC_irq().
Reuse the new function in apic_ack_edge().
Preparatory change for the real fix.
Fixes: dccfe3147b ("x86/vector: Simplify vector move cleanup")
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Song Liu <songliubraving@fb.com>
Cc: Joerg Roedel <jroedel@suse.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Song Liu <liu.song.a23@gmail.com>
Cc: Dmitry Safonov <0x7f454c46@gmail.com>
Cc: stable@vger.kernel.org
Cc: Mike Travis <mike.travis@hpe.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Tariq Toukan <tariqt@mellanox.com>
Link: https://lkml.kernel.org/r/20180604162224.471925894@linutronix.de
The AMD document outlining the SSBD handling
124441_AMD64_SpeculativeStoreBypassDisable_Whitepaper_final.pdf
mentions that if CPUID 8000_0008.EBX[24] is set we should be using
the SPEC_CTRL MSR (0x48) over the VIRT SPEC_CTRL MSR (0xC001_011f)
for speculative store bypass disable.
This in effect means we should clear the X86_FEATURE_VIRT_SSBD
flag so that we would prefer the SPEC_CTRL MSR.
See the document titled:
124441_AMD64_SpeculativeStoreBypassDisable_Whitepaper_final.pdf
A copy of this document is available at
https://bugzilla.kernel.org/show_bug.cgi?id=199889
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Tom Lendacky <thomas.lendacky@amd.com>
Cc: Janakarajan Natarajan <Janakarajan.Natarajan@amd.com>
Cc: kvm@vger.kernel.org
Cc: KarimAllah Ahmed <karahmed@amazon.de>
Cc: andrew.cooper3@citrix.com
Cc: Joerg Roedel <joro@8bytes.org>
Cc: Radim Krčmář <rkrcmar@redhat.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Borislav Petkov <bp@suse.de>
Cc: David Woodhouse <dwmw@amazon.co.uk>
Cc: Kees Cook <keescook@chromium.org>
Link: https://lkml.kernel.org/r/20180601145921.9500-3-konrad.wilk@oracle.com
The AMD document outlining the SSBD handling
124441_AMD64_SpeculativeStoreBypassDisable_Whitepaper_final.pdf
mentions that the CPUID 8000_0008.EBX[26] will mean that the
speculative store bypass disable is no longer needed.
A copy of this document is available at:
https://bugzilla.kernel.org/show_bug.cgi?id=199889
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Tom Lendacky <thomas.lendacky@amd.com>
Cc: Janakarajan Natarajan <Janakarajan.Natarajan@amd.com>
Cc: kvm@vger.kernel.org
Cc: andrew.cooper3@citrix.com
Cc: Andy Lutomirski <luto@kernel.org>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Borislav Petkov <bp@suse.de>
Cc: David Woodhouse <dwmw@amazon.co.uk>
Link: https://lkml.kernel.org/r/20180601145921.9500-2-konrad.wilk@oracle.com
AMD SME claims one bit from physical address to indicate whether the
page is encrypted or not. To achieve that we clear out the bit from
__PHYSICAL_MASK.
The capability to adjust __PHYSICAL_MASK is required beyond AMD SME.
For instance for upcoming Intel Multi-Key Total Memory Encryption.
Factor it out into a separate feature with own Kconfig handle.
It also helps with overhead of AMD SME. It saves more than 3k in .text
on defconfig + AMD_MEM_ENCRYPT:
add/remove: 3/2 grow/shrink: 5/110 up/down: 189/-3753 (-3564)
We would need to return to this once we have infrastructure to patch
constants in code. That's good candidate for it.
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Tom Lendacky <thomas.lendacky@amd.com>
Cc: linux-mm@kvack.org
Cc: "H. Peter Anvin" <hpa@zytor.com>
Link: https://lkml.kernel.org/r/20180518113028.79825-1-kirill.shutemov@linux.intel.com
When CONFIG_OPTIMIZE_INLINING is enabled, the function native_set_p4d()
may not be fully inlined into the caller, resulting in a false-positive
warning about an access to the __pgtable_l5_enabled variable from a
non-__init function, despite the original caller being an __init function:
WARNING: vmlinux.o(.text.unlikely+0x1429): Section mismatch in reference from the function native_set_p4d() to the variable .init.data:__pgtable_l5_enabled
WARNING: vmlinux.o(.text.unlikely+0x1429): Section mismatch in reference from the function native_p4d_clear() to the variable .init.data:__pgtable_l5_enabled
The function native_set_p4d() references the variable __initdata
__pgtable_l5_enabled. This is often because native_set_p4d lacks a
__initdata annotation or the annotation of __pgtable_l5_enabled is wrong.
Marking the native_set_p4d function and its caller native_p4d_clear()
avoids this problem.
I did not bisect the original cause, but I assume this is related to the
recent rework that turned pgtable_l5_enabled() into an inline function,
which in turn caused the compiler to make different inlining decisions.
Fixes: ad3fe525b9 ("x86/mm: Unify pgtable_l5_enabled usage in early boot code")
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Zi Yan <zi.yan@cs.rutgers.edu>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Link: https://lkml.kernel.org/r/20180605113715.1133726-1-arnd@arndb.de
These include a significant update of the generic power domains (genpd)
and Operating Performance Points (OPP) frameworks, mostly related to
the introduction of power domain performance levels, cpufreq updates
(new driver for Qualcomm Kryo processors, updates of the existing
drivers, some core fixes, schedutil governor improvements), PCI power
management fixes, ACPI workaround for EC-based wakeup events handling
on resume from suspend-to-idle, and major updates of the turbostat
and pm-graph utilities.
Specifics:
- Introduce power domain performance levels into the the generic
power domains (genpd) and Operating Performance Points (OPP)
frameworks (Viresh Kumar, Rajendra Nayak, Dan Carpenter).
- Fix two issues in the runtime PM framework related to the
initialization and removal of devices using device links (Ulf
Hansson).
- Clean up the initialization of drivers for devices in PM domains
(Ulf Hansson, Geert Uytterhoeven).
- Fix a cpufreq core issue related to the policy sysfs interface
causing CPU online to fail for CPUs sharing one cpufreq policy in
some situations (Tao Wang).
- Make it possible to use platform-specific suspend/resume hooks
in the cpufreq-dt driver and make the Armada 37xx DVFS use that
feature (Viresh Kumar, Miquel Raynal).
- Optimize policy transition notifications in cpufreq (Viresh Kumar).
- Improve the iowait boost mechanism in the schedutil cpufreq
governor (Patrick Bellasi).
- Improve the handling of deferred frequency updates in the
schedutil cpufreq governor (Joel Fernandes, Dietmar Eggemann,
Rafael Wysocki, Viresh Kumar).
- Add a new cpufreq driver for Qualcomm Kryo (Ilia Lin).
- Fix and clean up some cpufreq drivers (Colin Ian King, Dmitry
Osipenko, Doug Smythies, Luc Van Oostenryck, Simon Horman,
Viresh Kumar).
- Fix the handling of PCI devices with the DPM_SMART_SUSPEND flag
set and update stale comments in the PCI core PM code (Rafael
Wysocki).
- Work around an issue related to the handling of EC-based wakeup
events in the ACPI PM core during resume from suspend-to-idle if
the EC has been put into the low-power mode (Rafael Wysocki).
- Improve the handling of wakeup source objects in the PM core (Doug
Berger, Mahendran Ganesh, Rafael Wysocki).
- Update the driver core to prevent deferred probe from breaking
suspend/resume ordering (Feng Kan).
- Clean up the PM core somewhat (Bjorn Helgaas, Ulf Hansson, Rafael
Wysocki).
- Make the core suspend/resume code and cpufreq support the RT patch
(Sebastian Andrzej Siewior, Thomas Gleixner).
- Consolidate the PM QoS handling in cpuidle governors (Rafael
Wysocki).
- Fix a possible crash in the hibernation core (Tetsuo Handa).
- Update the rockchip-io Adaptive Voltage Scaling (AVS) driver
(David Wu).
- Update the turbostat utility (fixes, cleanups, new CPU IDs, new
command line options, built-in "Low Power Idle" counters support,
new POLL and POLL% columns) and add an entry for it to MAINTAINERS
(Len Brown, Artem Bityutskiy, Chen Yu, Laura Abbott, Matt Turner,
Prarit Bhargava, Srinivas Pandruvada).
- Update the pm-graph to version 5.1 (Todd Brandt).
- Update the intel_pstate_tracer utility (Doug Smythies).
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2
iQIcBAABCAAGBQJbFRzjAAoJEILEb/54YlRxREQQAKD7IjnLA86ZDkmwiwzFa9Cz
OJ0qlKAcMZGjeWH6LYq7lqWtaJ5PcFkBwNB4sRyKFdGPQOX3Ph8ZzILm2j8hhma4
Azn9632P6CoYHABa8Vof+A1BZ/j0aWtvtJEfqXhtF6rAYyWQlF0UmOIRsMs+54a+
Z/w4WuLaX8qYq3JlR60TogNtTIbdUjkjfvxMGrE9OSQ8n4oEhqoF/v0WoTHYLpWw
fu81M378axOu0Sgq1ZQ8GPUdblUqIO97iWwF7k2YUl7D9n5dm4wOhXDz3CLI8Cdb
RkoFFdp8bJIthbc5desKY2XFU1ClY8lxEVMXewFzTGwWMw0OyWgQP0/ZiG+Mujq3
CSbstg8GGpbwQoWU+VrluYa0FtqofV2UaGk1gOuPaojMqaIchRU4Nmbd2U6naNwp
XN7A1DzrOVGEt0ny8ztKH2Oqmj+NOCcRsChlYzdhLQ1wlqG54iCGwAML2ZJF9/Nw
0Sx8hm6eyWLzjSa0L384Msb+v5oqCoac66gPHCl2x7W+3F+jmqx1KbmkI2SRNUAL
7CS9lcImpvC4uZB54Aqya104vfqHiDse7WP0GrKqOmNVucD7hYCPiq/pycLwez+b
V3zLyvly8PsuBIa4AOQGGiK45HGpaKuB4TkRqRyFO0Fb5uL1M+Ld6kJiWlacl4az
STEUjY/90SRQvX3ocGyB
=wqBV
-----END PGP SIGNATURE-----
Merge tag 'pm-4.18-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm
Pull power management updates from Rafael Wysocki:
"These include a significant update of the generic power domains
(genpd) and Operating Performance Points (OPP) frameworks, mostly
related to the introduction of power domain performance levels,
cpufreq updates (new driver for Qualcomm Kryo processors, updates of
the existing drivers, some core fixes, schedutil governor
improvements), PCI power management fixes, ACPI workaround for
EC-based wakeup events handling on resume from suspend-to-idle, and
major updates of the turbostat and pm-graph utilities.
Specifics:
- Introduce power domain performance levels into the the generic
power domains (genpd) and Operating Performance Points (OPP)
frameworks (Viresh Kumar, Rajendra Nayak, Dan Carpenter).
- Fix two issues in the runtime PM framework related to the
initialization and removal of devices using device links (Ulf
Hansson).
- Clean up the initialization of drivers for devices in PM domains
(Ulf Hansson, Geert Uytterhoeven).
- Fix a cpufreq core issue related to the policy sysfs interface
causing CPU online to fail for CPUs sharing one cpufreq policy in
some situations (Tao Wang).
- Make it possible to use platform-specific suspend/resume hooks in
the cpufreq-dt driver and make the Armada 37xx DVFS use that
feature (Viresh Kumar, Miquel Raynal).
- Optimize policy transition notifications in cpufreq (Viresh Kumar).
- Improve the iowait boost mechanism in the schedutil cpufreq
governor (Patrick Bellasi).
- Improve the handling of deferred frequency updates in the schedutil
cpufreq governor (Joel Fernandes, Dietmar Eggemann, Rafael Wysocki,
Viresh Kumar).
- Add a new cpufreq driver for Qualcomm Kryo (Ilia Lin).
- Fix and clean up some cpufreq drivers (Colin Ian King, Dmitry
Osipenko, Doug Smythies, Luc Van Oostenryck, Simon Horman, Viresh
Kumar).
- Fix the handling of PCI devices with the DPM_SMART_SUSPEND flag set
and update stale comments in the PCI core PM code (Rafael Wysocki).
- Work around an issue related to the handling of EC-based wakeup
events in the ACPI PM core during resume from suspend-to-idle if
the EC has been put into the low-power mode (Rafael Wysocki).
- Improve the handling of wakeup source objects in the PM core (Doug
Berger, Mahendran Ganesh, Rafael Wysocki).
- Update the driver core to prevent deferred probe from breaking
suspend/resume ordering (Feng Kan).
- Clean up the PM core somewhat (Bjorn Helgaas, Ulf Hansson, Rafael
Wysocki).
- Make the core suspend/resume code and cpufreq support the RT patch
(Sebastian Andrzej Siewior, Thomas Gleixner).
- Consolidate the PM QoS handling in cpuidle governors (Rafael
Wysocki).
- Fix a possible crash in the hibernation core (Tetsuo Handa).
- Update the rockchip-io Adaptive Voltage Scaling (AVS) driver (David
Wu).
- Update the turbostat utility (fixes, cleanups, new CPU IDs, new
command line options, built-in "Low Power Idle" counters support,
new POLL and POLL% columns) and add an entry for it to MAINTAINERS
(Len Brown, Artem Bityutskiy, Chen Yu, Laura Abbott, Matt Turner,
Prarit Bhargava, Srinivas Pandruvada).
- Update the pm-graph to version 5.1 (Todd Brandt).
- Update the intel_pstate_tracer utility (Doug Smythies)"
* tag 'pm-4.18-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: (128 commits)
tools/power turbostat: update version number
tools/power turbostat: Add Node in output
tools/power turbostat: add node information into turbostat calculations
tools/power turbostat: remove num_ from cpu_topology struct
tools/power turbostat: rename num_cores_per_pkg to num_cores_per_node
tools/power turbostat: track thread ID in cpu_topology
tools/power turbostat: Calculate additional node information for a package
tools/power turbostat: Fix node and siblings lookup data
tools/power turbostat: set max_num_cpus equal to the cpumask length
tools/power turbostat: if --num_iterations, print for specific number of iterations
tools/power turbostat: Add Cannon Lake support
tools/power turbostat: delete duplicate #defines
x86: msr-index.h: Correct SNB_C1/C3_AUTO_UNDEMOTE defines
tools/power turbostat: Correct SNB_C1/C3_AUTO_UNDEMOTE defines
tools/power turbostat: add POLL and POLL% column
tools/power turbostat: Fix --hide Pk%pc10
tools/power turbostat: Build-in "Low Power Idle" counters support
tools/power turbostat: Don't make man pages executable
tools/power turbostat: remove blank lines
tools/power turbostat: a small C-states dump readability immprovement
...
Pull x86 hyperv updates from Thomas Gleixner:
"A set of commits to enable APIC enlightenment when running as a guest
on Microsoft HyperV.
This accelerates the APIC access with paravirtualization techniques,
which are called enlightenments on Hyper-V"
* 'x86-hyperv-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/Hyper-V/hv_apic: Build the Hyper-V APIC conditionally
x86/Hyper-V/hv_apic: Include asm/apic.h
X86/Hyper-V: Consolidate the allocation of the hypercall input page
X86/Hyper-V: Consolidate code for converting cpumask to vpset
X86/Hyper-V: Enhanced IPI enlightenment
X86/Hyper-V: Enable IPI enlightenments
X86/Hyper-V: Enlighten APIC access
Pull timers and timekeeping updates from Thomas Gleixner:
- Core infrastucture work for Y2038 to address the COMPAT interfaces:
+ Add a new Y2038 safe __kernel_timespec and use it in the core
code
+ Introduce config switches which allow to control the various
compat mechanisms
+ Use the new config switch in the posix timer code to control the
32bit compat syscall implementation.
- Prevent bogus selection of CPU local clocksources which causes an
endless reselection loop
- Remove the extra kthread in the clocksource code which has no value
and just adds another level of indirection
- The usual bunch of trivial updates, cleanups and fixlets all over the
place
- More SPDX conversions
* 'timers-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (24 commits)
clocksource/drivers/mxs_timer: Switch to SPDX identifier
clocksource/drivers/timer-imx-tpm: Switch to SPDX identifier
clocksource/drivers/timer-imx-gpt: Switch to SPDX identifier
clocksource/drivers/timer-imx-gpt: Remove outdated file path
clocksource/drivers/arc_timer: Add comments about locking while read GFRC
clocksource/drivers/mips-gic-timer: Add pr_fmt and reword pr_* messages
clocksource/drivers/sprd: Fix Kconfig dependency
clocksource: Move inline keyword to the beginning of function declarations
timer_list: Remove unused function pointer typedef
timers: Adjust a kernel-doc comment
tick: Prefer a lower rating device only if it's CPU local device
clocksource: Remove kthread
time: Change nanosleep to safe __kernel_* types
time: Change types to new y2038 safe __kernel_* types
time: Fix get_timespec64() for y2038 safe compat interfaces
time: Add new y2038 safe __kernel_timespec
posix-timers: Make compat syscalls depend on CONFIG_COMPAT_32BIT_TIME
time: Introduce CONFIG_COMPAT_32BIT_TIME
time: Introduce CONFIG_64BIT_TIME in architectures
compat: Enable compat_get/put_timespec64 always
...
Pull irq updates from Thomas Gleixner:
- Consolidation of softirq pending:
The softirq mask and its accessors/mutators have many implementations
scattered around many architectures. Most do the same things
consisting in a field in a per-cpu struct (often irq_cpustat_t)
accessed through per-cpu ops. We can provide instead a generic
efficient version that most of them can use. In fact s390 is the only
exception because the field is stored in lowcore.
- Support for level!?! triggered MSI (ARM)
Over the past couple of years, we've seen some SoCs coming up with
ways of signalling level interrupts using a new flavor of MSIs, where
the MSI controller uses two distinct messages: one that raises a
virtual line, and one that lowers it. The target MSI controller is in
charge of maintaining the state of the line.
This allows for a much simplified HW signal routing (no need to have
hundreds of discrete lines to signal level interrupts if you already
have a memory bus), but results in a departure from the current idea
the kernel has of MSIs.
- Support for Meson-AXG GPIO irqchip
- Large stm32 irqchip rework (suspend/resume, hierarchical domains)
- More SPDX conversions
* 'irq-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (36 commits)
ARM: dts: stm32: Add exti support to stm32mp157 pinctrl
ARM: dts: stm32: Add exti support for stm32mp157c
pinctrl/stm32: Add irq_eoi for stm32gpio irqchip
irqchip/stm32: Add suspend/resume support for hierarchy domain
irqchip/stm32: Add stm32mp1 support with hierarchy domain
irqchip/stm32: Prepare common functions
irqchip/stm32: Add host and driver data structures
irqchip/stm32: Add suspend support
irqchip/stm32: Add falling pending register support
irqchip/stm32: Checkpatch fix
irqchip/stm32: Optimizes and cleans up stm32-exti irq_domain
irqchip/meson-gpio: Add support for Meson-AXG SoCs
dt-bindings: interrupt-controller: New binding for Meson-AXG SoC
dt-bindings: interrupt-controller: Fix the double quotes
softirq/s390: Move default mutators of overwritten softirq mask to s390
softirq/x86: Switch to generic local_softirq_pending() implementation
softirq/sparc: Switch to generic local_softirq_pending() implementation
softirq/powerpc: Switch to generic local_softirq_pending() implementation
softirq/parisc: Switch to generic local_softirq_pending() implementation
softirq/ia64: Switch to generic local_softirq_pending() implementation
...
Pull x86 dax updates from Ingo Molnar:
"This contains x86 memcpy_mcsafe() fault handling improvements the
nvdimm tree would like to make more use of"
* 'x86-dax-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/asm/memcpy_mcsafe: Define copy_to_iter_mcsafe()
x86/asm/memcpy_mcsafe: Add write-protection-fault handling
x86/asm/memcpy_mcsafe: Return bytes remaining
x86/asm/memcpy_mcsafe: Add labels for __memcpy_mcsafe() write fault handling
x86/asm/memcpy_mcsafe: Remove loop unrolling
Pull x86 debug updates from Ingo Molnar:
"This contains the x86 oops code printing reorganization and cleanups
from Borislav Betkov, with a particular focus in enhancing opcode
dumping all around"
* 'x86-debug-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/dumpstack: Explain the reasoning for the prologue and buffer size
x86/dumpstack: Save first regs set for the executive summary
x86/dumpstack: Add a show_ip() function
x86/fault: Dump user opcode bytes on fatal faults
x86/dumpstack: Add loglevel argument to show_opcodes()
x86/dumpstack: Improve opcodes dumping in the code section
x86/dumpstack: Carve out code-dumping into a function
x86/dumpstack: Unexport oops_begin()
x86/dumpstack: Remove code_bytes
Pull x86 asm updates from Ingo Molnar:
- better support (non-atomic) 64-bit readq()/writeq() variants (Andy
Shevchenko)
- __clear_user() micro-optimization (Alexey Dobriyan)
* 'x86-asm-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/io: Define readq()/writeq() to use 64-bit type
x86/asm/64: Micro-optimize __clear_user() - Use immediate constants
Pull x86 boot updates from Ingo Molnar:
- Centaur CPU updates (David Wang)
- AMD and other CPU topology enumeration improvements and fixes
(Borislav Petkov, Thomas Gleixner, Suravee Suthikulpanit)
- Continued 5-level paging work (Kirill A. Shutemov)
* 'x86-boot-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/mm: Mark __pgtable_l5_enabled __initdata
x86/mm: Mark p4d_offset() __always_inline
x86/mm: Introduce the 'no5lvl' kernel parameter
x86/mm: Stop pretending pgtable_l5_enabled is a variable
x86/mm: Unify pgtable_l5_enabled usage in early boot code
x86/boot/compressed/64: Fix trampoline page table address calculation
x86/CPU: Move x86_cpuinfo::x86_max_cores assignment to detect_num_cpu_cores()
x86/Centaur: Report correct CPU/cache topology
x86/CPU: Move cpu_detect_cache_sizes() into init_intel_cacheinfo()
x86/CPU: Make intel_num_cpu_cores() generic
x86/CPU: Move cpu local function declarations to local header
x86/CPU/AMD: Derive CPU topology from CPUID function 0xB when available
x86/CPU: Modify detect_extended_topology() to return result
x86/CPU/AMD: Calculate last level cache ID from number of sharing threads
x86/CPU: Rename intel_cacheinfo.c to cacheinfo.c
perf/events/amd/uncore: Fix amd_uncore_llc ID to use pre-defined cpu_llc_id
x86/CPU/AMD: Have smp_num_siblings and cpu_llc_id always be present
x86/Centaur: Initialize supported CPU features properly
Pull locking updates from Ingo Molnar:
- Lots of tidying up changes all across the map for Linux's formal
memory/locking-model tooling, by Alan Stern, Akira Yokosawa, Andrea
Parri, Paul E. McKenney and SeongJae Park.
Notable changes beyond an overall update in the tooling itself is the
tidying up of spin_is_locked() semantics, which spills over into the
kernel proper as well.
- qspinlock improvements: the locking algorithm now guarantees forward
progress whereas the previous implementation in mainline could starve
threads indefinitely in cmpxchg() loops. Also other related cleanups
to the qspinlock code (Will Deacon)
- misc smaller improvements, cleanups and fixes all across the locking
subsystem
* 'locking-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (51 commits)
locking/rwsem: Simplify the is-owner-spinnable checks
tools/memory-model: Add reference for 'Simplifying ARM concurrency'
tools/memory-model: Update ASPLOS information
MAINTAINERS, tools/memory-model: Update e-mail address for Andrea Parri
tools/memory-model: Fix coding style in 'lock.cat'
tools/memory-model: Remove out-of-date comments and code from lock.cat
tools/memory-model: Improve mixed-access checking in lock.cat
tools/memory-model: Improve comments in lock.cat
tools/memory-model: Remove duplicated code from lock.cat
tools/memory-model: Flag "cumulativity" and "propagation" tests
tools/memory-model: Add model support for spin_is_locked()
tools/memory-model: Add scripts to test memory model
tools/memory-model: Fix coding style in 'linux-kernel.def'
tools/memory-model: Model 'smp_store_mb()'
tools/memory-order: Update the cheat-sheet to show that smp_mb__after_atomic() orders later RMW operations
tools/memory-order: Improve key for SELF and SV
tools/memory-model: Fix cheat sheet typo
tools/memory-model: Update required version of herdtools7
tools/memory-model: Redefine rb in terms of rcu-fence
tools/memory-model: Rename link and rcu-path to rcu-link and rb
...
- replaceme the force_dma flag with a dma_configure bus method.
(Nipun Gupta, although one patch is іncorrectly attributed to me
due to a git rebase bug)
- use GFP_DMA32 more agressively in dma-direct. (Takashi Iwai)
- remove PCI_DMA_BUS_IS_PHYS and rely on the dma-mapping API to do the
right thing for bounce buffering.
- move dma-debug initialization to common code, and apply a few cleanups
to the dma-debug code.
- cleanup the Kconfig mess around swiotlb selection
- swiotlb comment fixup (Yisheng Xie)
- a trivial swiotlb fix. (Dan Carpenter)
- support swiotlb on RISC-V. (based on a patch from Palmer Dabbelt)
- add a new generic dma-noncoherent dma_map_ops implementation and use
it for arc, c6x and nds32.
- improve scatterlist validity checking in dma-debug. (Robin Murphy)
- add a struct device quirk to limit the dma-mask to 32-bit due to
bridge/system issues, and switch x86 to use it instead of a local
hack for VIA bridges.
- handle devices without a dma_mask more gracefully in the dma-direct
code.
-----BEGIN PGP SIGNATURE-----
iQI/BAABCAApFiEEgdbnc3r/njty3Iq9D55TZVIEUYMFAlsU1hwLHGhjaEBsc3Qu
ZGUACgkQD55TZVIEUYPraxAAocC7JiFKW133/VugCtGA1x9uE8DPHealtsWTAeEq
KOOB3GxWMU2hKqQ4km5tcfdWoGJvvab6hmDXcitzZGi2JajO7Ae0FwIy3yvxSIKm
iH/ON7c4sJt8gKrXYsLVylmwDaimNs4a6xfODoCRgnWuovI2QrrZzupnlzPNsiOC
lv8ezzcW+Ay/gvDD/r72psO+w3QELETif/OzR/qTOtvLrVabM06eHmPQ8Wb98smu
/UPMMv6/3XwQnxpxpdyqN+p/gUdneXithzT261wTeZ+8gDXmcWBwHGcMBCimcoBi
FklW52moazIPIsTysqoNlVFsLGJTeS4p2D3BLAp5NwWYsLv+zHUVZsI1JY/8u5Ox
mM11LIfvu9JtUzaqD9SvxlxIeLhhYZZGnUoV3bQAkpHSQhN/xp2YXd5NWSo5ac2O
dch83+laZkZgd6ryw6USpt/YTPM/UHBYy7IeGGHX/PbmAke0ZlvA6Rae7kA5DG59
7GaLdwQyrHp8uGFgwze8P+R4POSk1ly73HHLBT/pFKnDD7niWCPAnBzuuEQGJs00
0zuyWLQyzOj1l6HCAcMNyGnYSsMp8Fx0fvEmKR/EYs8O83eJKXi6L9aizMZx4v1J
0wTolUWH6SIIdz474YmewhG5YOLY7mfe9E8aNr8zJFdwRZqwaALKoteRGUxa3f6e
zUE=
=6Acj
-----END PGP SIGNATURE-----
Merge tag 'dma-mapping-4.18' of git://git.infradead.org/users/hch/dma-mapping
Pull dma-mapping updates from Christoph Hellwig:
- replace the force_dma flag with a dma_configure bus method. (Nipun
Gupta, although one patch is іncorrectly attributed to me due to a
git rebase bug)
- use GFP_DMA32 more agressively in dma-direct. (Takashi Iwai)
- remove PCI_DMA_BUS_IS_PHYS and rely on the dma-mapping API to do the
right thing for bounce buffering.
- move dma-debug initialization to common code, and apply a few
cleanups to the dma-debug code.
- cleanup the Kconfig mess around swiotlb selection
- swiotlb comment fixup (Yisheng Xie)
- a trivial swiotlb fix. (Dan Carpenter)
- support swiotlb on RISC-V. (based on a patch from Palmer Dabbelt)
- add a new generic dma-noncoherent dma_map_ops implementation and use
it for arc, c6x and nds32.
- improve scatterlist validity checking in dma-debug. (Robin Murphy)
- add a struct device quirk to limit the dma-mask to 32-bit due to
bridge/system issues, and switch x86 to use it instead of a local
hack for VIA bridges.
- handle devices without a dma_mask more gracefully in the dma-direct
code.
* tag 'dma-mapping-4.18' of git://git.infradead.org/users/hch/dma-mapping: (48 commits)
dma-direct: don't crash on device without dma_mask
nds32: use generic dma_noncoherent_ops
nds32: implement the unmap_sg DMA operation
nds32: consolidate DMA cache maintainance routines
x86/pci-dma: switch the VIA 32-bit DMA quirk to use the struct device flag
x86/pci-dma: remove the explicit nodac and allowdac option
x86/pci-dma: remove the experimental forcesac boot option
Documentation/x86: remove a stray reference to pci-nommu.c
core, dma-direct: add a flag 32-bit dma limits
dma-mapping: remove unused gfp_t parameter to arch_dma_alloc_attrs
dma-debug: check scatterlist segments
c6x: use generic dma_noncoherent_ops
arc: use generic dma_noncoherent_ops
arc: fix arc_dma_{map,unmap}_page
arc: fix arc_dma_sync_sg_for_{cpu,device}
arc: simplify arc_dma_sync_single_for_{cpu,device}
dma-mapping: provide a generic dma-noncoherent implementation
dma-mapping: simplify Kconfig dependencies
riscv: add swiotlb support
riscv: only enable ZONE_DMA32 for 64-bit
...
According to the Intel Software Developers' Manual, Vol. 4, Order No.
335592, these macros have been reversed since they were added in the
initial turbostat commit. The reversed definitions were presumably
copied from turbostat.c to this file.
Fixes: 9c63a650bb ("tools/power/x86/turbostat: share kernel MSR #defines")
Signed-off-by: Matt Turner <mattst88@gmail.com>
Acked-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Len Brown <len.brown@intel.com>
Instead of globally disabling > 32bit DMA using the arch_dma_supported
hook walk the PCI bus under the actually affected bridge and mark every
device with the dma_32bit_limit flag. This also gets rid of the
arch_dma_supported hook entirely.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Implement HvFlushVirtualAddress{List,Space} hypercalls in a simplistic way:
do full TLB flush with KVM_REQ_TLB_FLUSH and kick vCPUs which are currently
IN_GUEST_MODE.
Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
Hyper-V TLB flush hypercalls definitions will be required for KVM so move
them hyperv-tlfs.h. Structures also need to be renamed as '_pcpu' suffix is
irrelevant for a general-purpose definition.
Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
Changing the VMCS12 layout will break save/restore compatibility with
older kvm releases once the KVM_{GET,SET}_NESTED_STATE ioctls are
accepted upstream. Google has already been using these ioctls for some
time, and we implore the community not to disturb the existing layout.
Move the four most recently added fields to preserve the offsets of
the previously defined fields and reserve locations for the vmread and
vmwrite bitmaps, which will be used in the virtualization of VMCS
shadowing (to improve the performance of double-nesting).
Signed-off-by: Jim Mattson <jmattson@google.com>
Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
[Kept the SDM order in vmcs_field_to_offset_table. - Radim]
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
Given the fact that the ACPI "EINJ" (error injection) facility is not
universally available, implement software infrastructure to validate the
memcpy_mcsafe() exception handling implementation.
For each potential read exception point in memcpy_mcsafe(), inject a
emulated exception point at the address identified by 'mcsafe_inject'
variable. With this infrastructure implement a test to validate that the
'bytes remaining' calculation is correct for a range of various source
buffer alignments.
This code is compiled out by default. The CONFIG_MCSAFE_DEBUG
configuration symbol needs to be manually enabled by editing
Kconfig.debug. I.e. this functionality can not be accidentally enabled
by a user / distro, it's only for development.
Cc: <x86@kernel.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Reported-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
S390 bpf_jit.S is removed in net-next and had changes in 'net',
since that code isn't used any more take the removal.
TLS data structures split the TX and RX components in 'net-next',
put the new struct members from the bug fix in 'net' into the RX
part.
The 'net-next' tree had some reworking of how the ERSPAN code works in
the GRE tunneling code, overlapping with a one-line headroom
calculation fix in 'net'.
Overlapping changes in __sock_map_ctx_update_elem(), keep the bits
that read the prog members via READ_ONCE() into local variables
before using them.
Signed-off-by: David S. Miller <davem@davemloft.net>
Merge speculative store buffer bypass fixes from Thomas Gleixner:
- rework of the SPEC_CTRL MSR management to accomodate the new fancy
SSBD (Speculative Store Bypass Disable) bit handling.
- the CPU bug and sysfs infrastructure for the exciting new Speculative
Store Bypass 'feature'.
- support for disabling SSB via LS_CFG MSR on AMD CPUs including
Hyperthread synchronization on ZEN.
- PRCTL support for dynamic runtime control of SSB
- SECCOMP integration to automatically disable SSB for sandboxed
processes with a filter flag for opt-out.
- KVM integration to allow guests fiddling with SSBD including the new
software MSR VIRT_SPEC_CTRL to handle the LS_CFG based oddities on
AMD.
- BPF protection against SSB
.. this is just the core and x86 side, other architecture support will
come separately.
* 'speck-v20' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (49 commits)
bpf: Prevent memory disambiguation attack
x86/bugs: Rename SSBD_NO to SSB_NO
KVM: SVM: Implement VIRT_SPEC_CTRL support for SSBD
x86/speculation, KVM: Implement support for VIRT_SPEC_CTRL/LS_CFG
x86/bugs: Rework spec_ctrl base and mask logic
x86/bugs: Remove x86_spec_ctrl_set()
x86/bugs: Expose x86_spec_ctrl_base directly
x86/bugs: Unify x86_spec_ctrl_{set_guest,restore_host}
x86/speculation: Rework speculative_store_bypass_update()
x86/speculation: Add virtualized speculative store bypass disable support
x86/bugs, KVM: Extend speculation control for VIRT_SPEC_CTRL
x86/speculation: Handle HT correctly on AMD
x86/cpufeatures: Add FEATURE_ZEN
x86/cpufeatures: Disentangle SSBD enumeration
x86/cpufeatures: Disentangle MSR_SPEC_CTRL enumeration from IBRS
x86/speculation: Use synthetic bits for IBRS/IBPB/STIBP
KVM: SVM: Move spec control call after restore of GS
x86/cpu: Make alternative_msr_write work for 32-bit code
x86/bugs: Fix the parameters alignment and missing void
x86/bugs: Make cpu_show_common() static
...
Pull x86 fixes from Thomas Gleixner:
"An unfortunately larger set of fixes, but a large portion is
selftests:
- Fix the missing clusterid initializaiton for x2apic cluster
management which caused boot failures due to IPIs being sent to the
wrong cluster
- Drop TX_COMPAT when a 64bit executable is exec()'ed from a compat
task
- Wrap access to __supported_pte_mask in __startup_64() where clang
compile fails due to a non PC relative access being generated.
- Two fixes for 5 level paging fallout in the decompressor:
- Handle GOT correctly for paging_prepare() and
cleanup_trampoline()
- Fix the page table handling in cleanup_trampoline() to avoid
page table corruption.
- Stop special casing protection key 0 as this is inconsistent with
the manpage and also inconsistent with the allocation map handling.
- Override the protection key wen moving away from PROT_EXEC to
prevent inaccessible memory.
- Fix and update the protection key selftests to address breakage and
to cover the above issue
- Add a MOV SS self test"
[ Part of the x86 fixes were in the earlier core pull due to dependencies ]
* 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (21 commits)
x86/mm: Drop TS_COMPAT on 64-bit exec() syscall
x86/apic/x2apic: Initialize cluster ID properly
x86/boot/compressed/64: Fix moving page table out of trampoline memory
x86/boot/compressed/64: Set up GOT for paging_prepare() and cleanup_trampoline()
x86/pkeys: Do not special case protection key 0
x86/pkeys/selftests: Add a test for pkey 0
x86/pkeys/selftests: Save off 'prot' for allocations
x86/pkeys/selftests: Fix pointer math
x86/pkeys: Override pkey when moving away from PROT_EXEC
x86/pkeys/selftests: Fix pkey exhaustion test off-by-one
x86/pkeys/selftests: Add PROT_EXEC test
x86/pkeys/selftests: Factor out "instruction page"
x86/pkeys/selftests: Allow faults on unknown keys
x86/pkeys/selftests: Avoid printf-in-signal deadlocks
x86/pkeys/selftests: Remove dead debugging code, fix dprint_in_signal
x86/pkeys/selftests: Stop using assert()
x86/pkeys/selftests: Give better unexpected fault error messages
x86/selftests: Add mov_to_ss test
x86/mpx/selftests: Adjust the self-test to fresh distros that export the MPX ABI
x86/pkeys/selftests: Adjust the self-test to fresh distros that export the pkeys ABI
...
Pull core fixes from Thomas Gleixner:
- Unbreak the BPF compilation which got broken by the unconditional
requirement of asm-goto, which is not supported by clang.
- Prevent probing on exception masking instructions in uprobes and
kprobes to avoid the issues of the delayed exceptions instead of
having an ugly workaround.
- Prevent a double free_page() in the error path of do_kexec_load()
- A set of objtool updates addressing various issues mostly related to
switch tables and the noreturn detection for recursive sibling calls
- Header sync for tools.
* 'core-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
objtool: Detect RIP-relative switch table references, part 2
objtool: Detect RIP-relative switch table references
objtool: Support GCC 8 switch tables
objtool: Support GCC 8's cold subfunctions
objtool: Fix "noreturn" detection for recursive sibling calls
objtool, kprobes/x86: Sync the latest <asm/insn.h> header with tools/objtool/arch/x86/include/asm/insn.h
x86/cpufeature: Guard asm_volatile_goto usage for BPF compilation
uprobes/x86: Prohibit probing on MOV SS instruction
kprobes/x86: Prohibit probing on exception masking instructions
x86/kexec: Avoid double free_page() upon do_kexec_load() failure
The Hyper-V APIC code is built when CONFIG_HYPERV is enabled but the actual
code in that file is guarded with CONFIG_X86_64. There is no point in doing
this. Neither is there a point in having the CONFIG_HYPERV guard in there
because the containing directory is not built when CONFIG_HYPERV=n.
Further for the hv_init_apic() function a stub is provided only for
CONFIG_HYPERV=n, which is pointless as the callsite is not compiled at
all. But for X86_32 the stub is missing and the build fails.
Clean that up:
- Compile hv_apic.c only when CONFIG_X86_64=y
- Make the stub for hv_init_apic() available when CONFG_X86_64=n
Fixes: 6b48cb5f83 ("X86/Hyper-V: Enlighten APIC access")
Reported-by: kbuild test robot <lkp@intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: K. Y. Srinivasan <kys@microsoft.com>
Cc: Michael Kelley <mikelley@microsoft.com>
The x86 platform operations are fairly isolated, so it's easy to change
them from using timespec to timespec64. It has been checked that all the
users and callers are safe, and there is only one critical function that is
broken beyond 2106:
pvclock_read_wallclock() uses a 32-bit number of seconds since the epoch
to communicate the boot time between host and guest in a virtual
environment. This will work until 2106, but fixing this is outside the
scope of this change, Add a comment at least.
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Acked-by: Radim Krčmář <rkrcmar@redhat.com>
Acked-by: Jan Kiszka <jan.kiszka@siemens.com>
Cc: Juergen Gross <jgross@suse.com>
Cc: jailhouse-dev@googlegroups.com
Cc: Borislav Petkov <bp@suse.de>
Cc: kvm@vger.kernel.org
Cc: y2038@lists.linaro.org
Cc: "Rafael J. Wysocki" <rafael.j.wysocki@intel.com>
Cc: xen-devel@lists.xenproject.org
Cc: John Stultz <john.stultz@linaro.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Cc: Joao Martins <joao.m.martins@oracle.com>
Link: https://lkml.kernel.org/r/20180427201435.3194219-1-arnd@arndb.de
__pgtable_l5_enabled shouldn't be needed after system has booted, we can
mark it as __initdata, but it requires preparation.
KASAN initialization code is a user of USE_EARLY_PGTABLE_L5, so all
pgtable_l5_enabled() translated to __pgtable_l5_enabled there, including
the one in p4d_offset().
It may lead to section mismatch, if a compiler would not inline
p4d_offset(), but leave it as a standalone function: p4d_offset() is not
marked as __init.
Marking p4d_offset() as __always_inline fixes the issue.
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Hugh Dickins <hughd@google.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/20180518103528.59260-7-kirill.shutemov@linux.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
pgtable_l5_enabled is defined using cpu_feature_enabled() but we refer
to it as a variable. This is misleading.
Make pgtable_l5_enabled() a function.
We cannot literally define it as a function due to circular dependencies
between header files. Function-alike macros is close enough.
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Hugh Dickins <hughd@google.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/20180518103528.59260-4-kirill.shutemov@linux.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Usually pgtable_l5_enabled is defined using cpu_feature_enabled().
cpu_feature_enabled() is not available in early boot code. We use
several different preprocessor tricks to get around it. It's messy.
Unify them all.
If cpu_feature_enabled() is not yet available, USE_EARLY_PGTABLE_L5 can
be defined before all includes. It makes pgtable_l5_enabled rely on
__pgtable_l5_enabled variable instead. This approach fits all early
users.
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Hugh Dickins <hughd@google.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/20180518103528.59260-3-kirill.shutemov@linux.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
The "336996 Speculative Execution Side Channel Mitigations" from
May defines this as SSB_NO, hence lets sync-up.
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Since non atomic readq() and writeq() were added some of the drivers
would like to use it in a manner of:
#include <io-64-nonatomic-lo-hi.h>
...
pr_debug("Debug value of some register: %016llx\n", readq(addr));
However, lo_hi_readq() always returns __u64 data, while readq()
on x86_64 defines it as unsigned long. and thus compiler warns
about type mismatch, although they are both 64-bit on x86_64.
Convert readq() and writeq() on x86 to operate on deterministic
64-bit type. The most of architectures in the kernel already are using
either unsigned long long, or u64 type for readq() / writeq().
This change propagates consistency in that sense.
While this is not an issue per se, though if someone wants to address it,
the anchor could be the commit:
797a796a13 ("asm-generic: architecture independent readq/writeq for 32bit environment")
where non-atomic variants had been introduced.
Note, there are only few users of above pattern and they will not be
affected because they do cast returned value. The actual warning has
been issued on not-yet-upstreamed code.
Potentially we might get a new warnings if some 64-bit only code
assigns returned value to unsigned long type of variable. This is
assumed to be addressed on case-by-case basis.
Reported-by: lkp <lkp@intel.com>
Tested-by: Sohil Mehta <sohil.mehta@intel.com>
Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20180515115211.55050-1-andriy.shevchenko@linux.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
KVM_HINTS_DEDICATED seems to be somewhat confusing:
Guest doesn't really care whether it's the only task running on a host
CPU as long as it's not preempted.
And there are more reasons for Guest to be preempted than host CPU
sharing, for example, with memory overcommit it can get preempted on a
memory access, post copy migration can cause preemption, etc.
Let's call it KVM_HINTS_REALTIME which seems to better
match what guests expect.
Also, the flag most be set on all vCPUs - current guests assume this.
Note so in the documentation.
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Expose the new virtualized architectural mechanism, VIRT_SSBD, for using
speculative store bypass disable (SSBD) under SVM. This will allow guests
to use SSBD on hardware that uses non-architectural mechanisms for enabling
SSBD.
[ tglx: Folded the migration fixup from Paolo Bonzini ]
Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Add the necessary logic for supporting the emulated VIRT_SPEC_CTRL MSR to
x86_virt_spec_ctrl(). If either X86_FEATURE_LS_CFG_SSBD or
X86_FEATURE_VIRT_SPEC_CTRL is set then use the new guest_virt_spec_ctrl
argument to check whether the state must be modified on the host. The
update reuses speculative_store_bypass_update() so the ZEN-specific sibling
coordination can be reused.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
x86_spec_ctrl_set() is only used in bugs.c and the extra mask checks there
provide no real value as both call sites can just write x86_spec_ctrl_base
to MSR_SPEC_CTRL. x86_spec_ctrl_base is valid and does not need any extra
masking or checking.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
x86_spec_ctrl_base is the system wide default value for the SPEC_CTRL MSR.
x86_spec_ctrl_get_default() returns x86_spec_ctrl_base and was intended to
prevent modification to that variable. Though the variable is read only
after init and globaly visible already.
Remove the function and export the variable instead.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Function bodies are very similar and are going to grow more almost
identical code. Add a bool arg to determine whether SPEC_CTRL is being set
for the guest or restored to the host.
No functional changes.
Signed-off-by: Borislav Petkov <bp@suse.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
The upcoming support for the virtual SPEC_CTRL MSR on AMD needs to reuse
speculative_store_bypass_update() to avoid code duplication. Add an
argument for supplying a thread info (TIF) value and create a wrapper
speculative_store_bypass_update_current() which is used at the existing
call site.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Some AMD processors only support a non-architectural means of enabling
speculative store bypass disable (SSBD). To allow a simplified view of
this to a guest, an architectural definition has been created through a new
CPUID bit, 0x80000008_EBX[25], and a new MSR, 0xc001011f. With this, a
hypervisor can virtualize the existence of this definition and provide an
architectural method for using SSBD to a guest.
Add the new CPUID feature, the new MSR and update the existing SSBD
support to use this MSR when present.
Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Borislav Petkov <bp@suse.de>
AMD is proposing a VIRT_SPEC_CTRL MSR to handle the Speculative Store
Bypass Disable via MSR_AMD64_LS_CFG so that guests do not have to care
about the bit position of the SSBD bit and thus facilitate migration.
Also, the sibling coordination on Family 17H CPUs can only be done on
the host.
Extend x86_spec_ctrl_set_guest() and x86_spec_ctrl_restore_host() with an
extra argument for the VIRT_SPEC_CTRL MSR.
Hand in 0 from VMX and in SVM add a new virt_spec_ctrl member to the CPU
data structure which is going to be used in later patches for the actual
implementation.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
The AMD64_LS_CFG MSR is a per core MSR on Family 17H CPUs. That means when
hyperthreading is enabled the SSBD bit toggle needs to take both cores into
account. Otherwise the following situation can happen:
CPU0 CPU1
disable SSB
disable SSB
enable SSB <- Enables it for the Core, i.e. for CPU0 as well
So after the SSB enable on CPU1 the task on CPU0 runs with SSB enabled
again.
On Intel the SSBD control is per core as well, but the synchronization
logic is implemented behind the per thread SPEC_CTRL MSR. It works like
this:
CORE_SPEC_CTRL = THREAD0_SPEC_CTRL | THREAD1_SPEC_CTRL
i.e. if one of the threads enables a mitigation then this affects both and
the mitigation is only disabled in the core when both threads disabled it.
Add the necessary synchronization logic for AMD family 17H. Unfortunately
that requires a spinlock to serialize the access to the MSR, but the locks
are only shared between siblings.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Add a ZEN feature bit so family-dependent static_cpu_has() optimizations
can be built for ZEN.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
The SSBD enumeration is similarly to the other bits magically shared
between Intel and AMD though the mechanisms are different.
Make X86_FEATURE_SSBD synthetic and set it depending on the vendor specific
features or family dependent setup.
Change the Intel bit to X86_FEATURE_SPEC_CTRL_SSBD to denote that SSBD is
controlled via MSR_SPEC_CTRL and fix up the usage sites.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
The availability of the SPEC_CTRL MSR is enumerated by a CPUID bit on
Intel and implied by IBRS or STIBP support on AMD. That's just confusing
and in case an AMD CPU has IBRS not supported because the underlying
problem has been fixed but has another bit valid in the SPEC_CTRL MSR,
the thing falls apart.
Add a synthetic feature bit X86_FEATURE_MSR_SPEC_CTRL to denote the
availability on both Intel and AMD.
While at it replace the boot_cpu_has() checks with static_cpu_has() where
possible. This prevents late microcode loading from exposing SPEC_CTRL, but
late loading is already very limited as it does not reevaluate the
mitigation options and other bits and pieces. Having static_cpu_has() is
the simplest and least fragile solution.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Intel and AMD have different CPUID bits hence for those use synthetic bits
which get set on the respective vendor's in init_speculation_control(). So
that debacles like what the commit message of
c65732e4f7 ("x86/cpu: Restore CPUID_8000_0008_EBX reload")
talks about don't happen anymore.
Signed-off-by: Borislav Petkov <bp@suse.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Tested-by: Jörg Otte <jrg.otte@gmail.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Link: https://lkml.kernel.org/r/20180504161815.GG9257@pd.tnic
Use the updated memcpy_mcsafe() implementation to define
copy_user_mcsafe() and copy_to_iter_mcsafe(). The most significant
difference from typical copy_to_iter() is that the ITER_KVEC and
ITER_BVEC iterator types can fail to complete a full transfer.
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tony Luck <tony.luck@intel.com>
Cc: hch@lst.de
Cc: linux-fsdevel@vger.kernel.org
Cc: linux-nvdimm@lists.01.org
Link: http://lkml.kernel.org/r/152539239150.31796.9189779163576449784.stgit@dwillia2-desk3.amr.corp.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
In preparation for using memcpy_mcsafe() to handle user copies it needs
to be to handle write-protection faults while writing user pages. Add
MMU-fault handlers alongside the machine-check exception handlers.
Note that the machine check fault exception handling makes assumptions
about source buffer alignment and poison alignment. In the write fault
case, given the destination buffer is arbitrarily aligned, it needs a
separate / additional fault handling approach. The mcsafe_handle_tail()
helper is reused. The @limit argument is set to @len since there is no
safety concern about retriggering an MMU fault, and this simplifies the
assembly.
Co-developed-by: Tony Luck <tony.luck@intel.com>
Reported-by: Mika Penttilä <mika.penttila@nextfour.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tony Luck <tony.luck@intel.com>
Cc: hch@lst.de
Cc: linux-fsdevel@vger.kernel.org
Cc: linux-nvdimm@lists.01.org
Link: http://lkml.kernel.org/r/152539238635.31796.14056325365122961778.stgit@dwillia2-desk3.amr.corp.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Machine check safe memory copies are currently deployed in the pmem
driver whenever reading from persistent memory media, so that -EIO is
returned rather than triggering a kernel panic. While this protects most
pmem accesses, it is not complete in the filesystem-dax case. When
filesystem-dax is enabled reads may bypass the block layer and the
driver via dax_iomap_actor() and its usage of copy_to_iter().
In preparation for creating a copy_to_iter() variant that can handle
machine checks, teach memcpy_mcsafe() to return the number of bytes
remaining rather than -EFAULT when an exception occurs.
Co-developed-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tony Luck <tony.luck@intel.com>
Cc: hch@lst.de
Cc: linux-fsdevel@vger.kernel.org
Cc: linux-nvdimm@lists.01.org
Link: http://lkml.kernel.org/r/152539238119.31796.14318473522414462886.stgit@dwillia2-desk3.amr.corp.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
In preparation for teaching memcpy_mcsafe() to return 'bytes remaining'
rather than pass / fail, simplify the implementation to remove loop
unrolling. The unrolling complicates the fault handling for negligible
benefit given modern CPUs perform loop stream detection.
Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tony Luck <tony.luck@intel.com>
Cc: hch@lst.de
Cc: linux-fsdevel@vger.kernel.org
Cc: linux-nvdimm@lists.01.org
Link: http://lkml.kernel.org/r/152539237092.31796.9115692316555638048.stgit@dwillia2-desk3.amr.corp.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Make the RETPOLINE_{RA,ED}X_BPF_JIT() a bit more readable by
cleaning up the macro, aligning comments and spacing.
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Yonghong Song <yhs@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
L1 and L2 need to have disjoint mappings, so that L1's APIC access
page (under VMX) can be omitted from L2's mappings.
Signed-off-by: Jim Mattson <jmattson@google.com>
Signed-off-by: Krish Sadhukhan <krish.sadhukhan@oracle.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Previously, we toggled between SECONDARY_EXEC_VIRTUALIZE_X2APIC_MODE
and SECONDARY_EXEC_VIRTUALIZE_APIC_ACCESSES, depending on whether or
not the EXTD bit was set in MSR_IA32_APICBASE. However, if the local
APIC is disabled, we should not set either of these APIC
virtualization control bits.
Signed-off-by: Jim Mattson <jmattson@google.com>
Signed-off-by: Krish Sadhukhan <krish.sadhukhan@oracle.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Enlightened MSR-Bitmap is a natural extension of Enlightened VMCS:
Hyper-V Top Level Functional Specification states:
"The L1 hypervisor may collaborate with the L0 hypervisor to make MSR
accesses more efficient. It can enable enlightened MSR bitmaps by setting
the corresponding field in the enlightened VMCS to 1. When enabled, the L0
hypervisor does not monitor the MSR bitmaps for changes. Instead, the L1
hypervisor must invalidate the corresponding clean field after making
changes to one of the MSR bitmaps."
I reached out to Hyper-V team for additional details and I got the
following information:
"Current Hyper-V implementation works as following:
If the enlightened MSR bitmap is not enabled:
- All MSR accesses of L2 guests cause physical VM-Exits
If the enlightened MSR bitmap is enabled:
- Physical VM-Exits for L2 accesses to certain MSRs (currently FS_BASE,
GS_BASE and KERNEL_GS_BASE) are avoided, thus making these MSR accesses
faster."
I tested my series with a tight rdmsrl loop in L2, for KERNEL_GS_BASE the
results are:
Without Enlightened MSR-Bitmap: 1300 cycles/read
With Enlightened MSR-Bitmap: 120 cycles/read
Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Tested-by: Lan Tianyu <Tianyu.Lan@microsoft.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Extract the logic to free a root page in a separate function to avoid code
duplication in mmu_free_roots(). Also, change it to an exported function
i.e. kvm_mmu_free_roots().
Signed-off-by: Junaid Shahid <junaids@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Remove the ad-hoc implementation, the generic code now allows us not to
reinvent the wheel.
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: David S. Miller <davem@davemloft.net>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Helge Deller <deller@gmx.de>
Cc: James E.J. Bottomley <jejb@parisc-linux.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Rich Felker <dalias@libc.org>
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
Link: http://lkml.kernel.org/r/1525786706-22846-11-git-send-email-frederic@kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
mm_pkey_is_allocated() treats pkey 0 as unallocated. That is
inconsistent with the manpages, and also inconsistent with
mm->context.pkey_allocation_map. Stop special casing it and only
disallow values that are actually bad (< 0).
The end-user visible effect of this is that you can now use
mprotect_pkey() to set pkey=0.
This is a bit nicer than what Ram proposed[1] because it is simpler
and removes special-casing for pkey 0. On the other hand, it does
allow applications to pkey_free() pkey-0, but that's just a silly
thing to do, so we are not going to protect against it.
The scenario that could happen is similar to what happens if you free
any other pkey that is in use: it might get reallocated later and used
to protect some other data. The most likely scenario is that pkey-0
comes back from pkey_alloc(), an access-disable or write-disable bit
is set in PKRU for it, and the next stack access will SIGSEGV. It's
not horribly different from if you mprotect()'d your stack or heap to
be unreadable or unwritable, which is generally very foolish, but also
not explicitly prevented by the kernel.
1. http://lkml.kernel.org/r/1522112702-27853-1-git-send-email-linuxram@us.ibm.com
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Andrew Morton <akpm@linux-foundation.org>p
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Michael Ellermen <mpe@ellerman.id.au>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ram Pai <linuxram@us.ibm.com>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-mm@kvack.org
Cc: stable@vger.kernel.org
Fixes: 58ab9a088d ("x86/pkeys: Check against max pkey to avoid overflows")
Link: http://lkml.kernel.org/r/20180509171358.47FD785E@viggo.jf.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
I got a bug report that the following code (roughly) was
causing a SIGSEGV:
mprotect(ptr, size, PROT_EXEC);
mprotect(ptr, size, PROT_NONE);
mprotect(ptr, size, PROT_READ);
*ptr = 100;
The problem is hit when the mprotect(PROT_EXEC)
is implicitly assigned a protection key to the VMA, and made
that key ACCESS_DENY|WRITE_DENY. The PROT_NONE mprotect()
failed to remove the protection key, and the PROT_NONE->
PROT_READ left the PTE usable, but the pkey still in place
and left the memory inaccessible.
To fix this, we ensure that we always "override" the pkee
at mprotect() if the VMA does not have execute-only
permissions, but the VMA has the execute-only pkey.
We had a check for PROT_READ/WRITE, but it did not work
for PROT_NONE. This entirely removes the PROT_* checks,
which ensures that PROT_NONE now works.
Reported-by: Shakeel Butt <shakeelb@google.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Michael Ellermen <mpe@ellerman.id.au>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ram Pai <linuxram@us.ibm.com>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-mm@kvack.org
Cc: stable@vger.kernel.org
Fixes: 62b5f7d013 ("mm/core, x86/mm/pkeys: Add execute-only protection keys support")
Link: http://lkml.kernel.org/r/20180509171351.084C5A71@viggo.jf.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cast val and (val >> 32) to (u32), so that they fit in a
general-purpose register in both 32-bit and 64-bit code.
[ tglx: Made it u32 instead of uintptr_t ]
Fixes: c65732e4f7 ("x86/cpu: Restore CPUID_8000_0008_EBX reload")
Signed-off-by: Jim Mattson <jmattson@google.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Since MOV SS and POP SS instructions will delay the exceptions until the
next instruction is executed, single-stepping on it by kprobes must be
prohibited.
However, kprobes usually executes those instructions directly on trampoline
buffer (a.k.a. kprobe-booster), except for the kprobes which has
post_handler. Thus if kprobe user probes MOV SS with post_handler, it will
do single-stepping on the MOV SS.
This means it is safe that if it is used via ftrace or perf/bpf since those
don't use the post_handler.
Anyway, since the stack switching is a rare case, it is safer just
rejecting kprobes on such instructions.
Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Ricardo Neri <ricardo.neri-calderon@linux.intel.com>
Cc: Francis Deslauriers <francis.deslauriers@efficios.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: "H . Peter Anvin" <hpa@zytor.com>
Cc: Yonghong Song <yhs@fb.com>
Cc: Borislav Petkov <bp@suse.de>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: "David S . Miller" <davem@davemloft.net>
Link: https://lkml.kernel.org/r/152587069574.17316.3311695234863248641.stgit@devbox
No point in exposing all these functions globaly as they are strict local
to the cpu management code.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Intel collateral will reference the SSB mitigation bit in IA32_SPEC_CTL[2]
as SSBD (Speculative Store Bypass Disable).
Hence changing it.
It is unclear yet what the MSR_IA32_ARCH_CAPABILITIES (0x10a) Bit(4) name
is going to be. Following the rename it would be SSBD_NO but that rolls out
to Speculative Store Bypass Disable No.
Also fixed the missing space in X86_FEATURE_AMD_SSBD.
[ tglx: Fixup x86_amd_rds_enable() and rds_tif_to_amd_ls_cfg() as well ]
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
This will be used in future patches to check for arch support for
pkeys in generic code.
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Reviewed-by: Dave Hansen <dave.hansen@intel.com>
Move the last remaining pkey helper, vma_pkey() into asm/pkeys.h
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Reviewed-by: Dave Hansen <dave.hansen@intel.com>
Consolidate the pkey handling by providing a common empty definition
of vma_pkey() in pkeys.h when CONFIG_ARCH_HAS_PKEYS=n.
This also removes another entanglement of pkeys.h and
asm/mmu_context.h.
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Reviewed-by: Ram Pai <linuxram@us.ibm.com>
Reviewed-by: Dave Hansen <dave.hansen@intel.com>
Minor conflict, a CHECK was placed into an if() statement
in net-next, whilst a newline was added to that CHECK
call in 'net'. Thanks to Daniel for the merge resolution.
Signed-off-by: David S. Miller <davem@davemloft.net>
This is a follow-up to Deepa's work on the timekeeping system calls,
providing a y2038-safe syscall API for SYSVIPC. It uses a combination
of two strategies:
For sys_msgctl, sys_semctl and sys_shmctl, I do not introduce a completely
new set of replacement system calls, but instead extend the existing
ones to return data in the reserved fields of the normal data structure.
This should be completely transparent to any existing user space, and
only after the 32-bit time_t wraps, it will make a difference in the
returned data.
libc implementations will consequently have to provide their own data
structures when they move to 64-bit time_t, and convert the structures
in user space from the ones returned by the kernel.
In contrast, mq_timedsend, mq_timedreceive and and semtimedop all do
need to change because having a libc redefine the timespec type
breaks the ABI, so with this series there will be two separate entry
points for 32-bit architectures.
There are three cases here:
- little-endian architectures (except powerpc and mips) can use
the normal layout and just cast the data structure to the user space
type that contains 64-bit numbers.
- parisc and sparc can do the same thing with big-endian user space
- little-endian powerpc and most big-endian architectures have
to flip the upper and lower 32-bit halves of the time_t value in memory,
but can otherwise keep using the normal layout
- mips and big-endian xtensa need to be more careful because
they are not consistent in their definitions, and they have to provide
custom libc implementations for the system calls to use 64-bit time_t.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQIcBAABAgAGBQJa4HbTAAoJEGCrR//JCVIniiUP/0mXR18lDCROYoVHgGwDHUas
9CjdGk+GvFFzRYvcoOgBjf8RhiUcYITyn2t9Kv52fZRv6RRaxD+qHWMXs8rnpmnm
59v/GWi4qdbrliCgxrUU6LMsRom4mjTXuZLqCoOrs7F2pGurQ3bq75m4IM7wsx/+
cucSSxLb+qmCeT5HF/7LbvVLm2X10RGW6iI+UeU267sitymUaGmuJGcF6WxioXB2
0u6mwlj62nlc07vSBJQQgSOuw+U095q6hS62uaNr7ZMByckbiPbVV9M4H5OFflqI
Y70UohSue2LIYvJOhu70wQWs832W7sYb+Ia3fnMaX1AEIErtmGoBiIJ1lio1vYpb
jVCPPsR0jWMg2MxGEGAEmEXQ7MZLme6yRmd08IFNJmRzuuzzuwXyz5hwz53ZJtIX
dw0BZnw49b1Hy2oW03w6tbrnW7MlEkMMHM0wOSYGd0K8zJQUOSlW6p1m/UTTpsJQ
G5CSaFWPtEfNPiS+E+w+C8TUtTs6SEZXn8/pIrXSnUjEu9QJvsCmxOroEW7D8pdD
d4+13U5VzIXNlzf+/K1YZ2PIMmorDXkr2otMyi44naksWqc/p4NaikoINgq8QVm2
aoZ0ddlIbyTmiXvWfL7AVjmi7w2ACML8OoapITdWCr1Bfs+DUFNdOrFbdFA7psq7
L98FqbtoHwFLM7b3veF5
=7FX4
-----END PGP SIGNATURE-----
Merge tag 'y2038-ipc' of git://git.kernel.org/pub/scm/linux/kernel/git/arnd/playground into timers/2038
Pull 'y2038: IPC system call conversion' from Arnd Bergmann:
"This is a follow-up to Deepa's work on the timekeeping system calls,
providing a y2038-safe syscall API for SYSVIPC. It uses a combination
of two strategies:
For sys_msgctl, sys_semctl and sys_shmctl, I do not introduce a completely
new set of replacement system calls, but instead extend the existing
ones to return data in the reserved fields of the normal data structure.
This should be completely transparent to any existing user space, and
only after the 32-bit time_t wraps, it will make a difference in the
returned data.
libc implementations will consequently have to provide their own data
structures when they move to 64-bit time_t, and convert the structures
in user space from the ones returned by the kernel.
In contrast, mq_timedsend, mq_timedreceive and and semtimedop all do
need to change because having a libc redefine the timespec type
breaks the ABI, so with this series there will be two separate entry
points for 32-bit architectures.
There are three cases here:
- little-endian architectures (except powerpc and mips) can use
the normal layout and just cast the data structure to the user space
type that contains 64-bit numbers.
- parisc and sparc can do the same thing with big-endian user space
- little-endian powerpc and most big-endian architectures have
to flip the upper and lower 32-bit halves of the time_t value in memory,
but can otherwise keep using the normal layout
- mips and big-endian xtensa need to be more careful because
they are not consistent in their definitions, and they have to provide
custom libc implementations for the system calls to use 64-bit time_t."
This was used by the ide, scsi and networking code in the past to
determine if they should bounce payloads. Now that the dma mapping
always have to support dma to all physical memory (thanks to swiotlb
for non-iommu systems) there is no need to this crude hack any more.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Palmer Dabbelt <palmer@sifive.com> (for riscv)
Reviewed-by: Jens Axboe <axboe@kernel.dk>
Current implementation does not communicate whether it can successfully
detect CPUID function 0xB information. Therefore, modify the function to
return success or error codes. This will be used by subsequent patches.
Signed-off-by: Suravee Suthikulpanit <suravee.suthikulpanit@amd.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Borislav Petkov <bp@suse.de>
Link: http://lkml.kernel.org/r/1524865681-112110-2-git-send-email-suravee.suthikulpanit@amd.com
Last Level Cache ID can be calculated from the number of threads sharing
the cache, which is available from CPUID Fn0x8000001D (Cache Properties).
This is used to left-shift the APIC ID to derive LLC ID.
Therefore, default to this method unless the APIC ID enumeration does not
follow the scheme.
Signed-off-by: Suravee Suthikulpanit <suravee.suthikulpanit@amd.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1524864877-111962-5-git-send-email-suravee.suthikulpanit@amd.com
Move smp_num_siblings and cpu_llc_id to cpu/common.c so that they're
always present as symbols and not only in the CONFIG_SMP case. Then,
other code using them doesn't need ugly ifdeffery anymore. Get rid of
some ifdeffery.
Signed-off-by: Borislav Petkov <bpetkov@suse.de>
Signed-off-by: Suravee Suthikulpanit <suravee.suthikulpanit@amd.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1524864877-111962-2-git-send-email-suravee.suthikulpanit@amd.com
Unless explicitly opted out of, anything running under seccomp will have
SSB mitigations enabled. Choosing the "prctl" mode will disable this.
[ tglx: Adjusted it to the new arch_seccomp_spec_mitigate() mechanism ]
Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
The JIT compiler emits ia32 bit instructions. Currently, It supports eBPF
only. Classic BPF is supported because of the conversion by BPF core.
Almost all instructions from eBPF ISA supported except the following:
BPF_ALU64 | BPF_DIV | BPF_K
BPF_ALU64 | BPF_DIV | BPF_X
BPF_ALU64 | BPF_MOD | BPF_K
BPF_ALU64 | BPF_MOD | BPF_X
BPF_STX | BPF_XADD | BPF_W
BPF_STX | BPF_XADD | BPF_DW
It doesn't support BPF_JMP|BPF_CALL with BPF_PSEUDO_CALL at the moment.
IA32 has few general purpose registers, EAX|EDX|ECX|EBX|ESI|EDI. I use
EAX|EDX|ECX|EBX as temporary registers to simulate instructions in eBPF
ISA, and allocate ESI|EDI to BPF_REG_AX for constant blinding, all others
eBPF registers, R0-R10, are simulated through scratch space on stack.
The reasons behind the hardware registers allocation policy are:
1:MUL need EAX:EDX, shift operation need ECX, so they aren't fit
for general eBPF 64bit register simulation.
2:We need at least 4 registers to simulate most eBPF ISA operations
on registers operands instead of on register&memory operands.
3:We need to put BPF_REG_AX on hardware registers, or constant blinding
will degrade jit performance heavily.
Tested on PC (Intel(R) Core(TM) i5-5200U CPU).
Testing results on i5-5200U:
1) test_bpf: Summary: 349 PASSED, 0 FAILED, [319/341 JIT'ed]
2) test_progs: Summary: 83 PASSED, 0 FAILED.
3) test_lpm: OK
4) test_lru_map: OK
5) test_verifier: Summary: 828 PASSED, 0 FAILED.
Above tests are all done in following two conditions separately:
1:bpf_jit_enable=1 and bpf_jit_harden=0
2:bpf_jit_enable=1 and bpf_jit_harden=2
Below are some numbers for this jit implementation:
Note:
I run test_progs in kselftest 100 times continuously for every condition,
the numbers are in format: total/times=avg.
The numbers that test_bpf reports show almost the same relation.
a:jit_enable=0 and jit_harden=0 b:jit_enable=1 and jit_harden=0
test_pkt_access:PASS:ipv4:15622/100=156 test_pkt_access:PASS:ipv4:10674/100=106
test_pkt_access:PASS:ipv6:9130/100=91 test_pkt_access:PASS:ipv6:4855/100=48
test_xdp:PASS:ipv4:240198/100=2401 test_xdp:PASS:ipv4:138912/100=1389
test_xdp:PASS:ipv6:137326/100=1373 test_xdp:PASS:ipv6:68542/100=685
test_l4lb:PASS:ipv4:61100/100=611 test_l4lb:PASS:ipv4:37302/100=373
test_l4lb:PASS:ipv6:101000/100=1010 test_l4lb:PASS:ipv6:55030/100=550
c:jit_enable=1 and jit_harden=2
test_pkt_access:PASS:ipv4:10558/100=105
test_pkt_access:PASS:ipv6:5092/100=50
test_xdp:PASS:ipv4:131902/100=1319
test_xdp:PASS:ipv6:77932/100=779
test_l4lb:PASS:ipv4:38924/100=389
test_l4lb:PASS:ipv6:57520/100=575
The numbers show we get 30%~50% improvement.
See Documentation/networking/filter.txt for more information.
Changelog:
Changes v5-v6:
1:Add do {} while (0) to RETPOLINE_RAX_BPF_JIT for
consistence reason.
2:Clean up non-standard comments, reported by Daniel Borkmann.
3:Fix a memory leak issue, repoted by Daniel Borkmann.
Changes v4-v5:
1:Delete is_on_stack, BPF_REG_AX is the only one
on real hardware registers, so just check with
it.
2:Apply commit 1612a981b7 ("bpf, x64: fix JIT emission
for dead code"), suggested by Daniel Borkmann.
Changes v3-v4:
1:Fix changelog in commit.
I install llvm-6.0, then test_progs willn't report errors.
I submit another patch:
"bpf: fix misaligned access for BPF_PROG_TYPE_PERF_EVENT program type on x86_32 platform"
to fix another problem, after that patch, test_verifier willn't report errors too.
2:Fix clear r0[1] twice unnecessarily in *BPF_IND|BPF_ABS* simulation.
Changes v2-v3:
1:Move BPF_REG_AX to real hardware registers for performance reason.
3:Using bpf_load_pointer instead of bpf_jit32.S, suggested by Daniel Borkmann.
4:Delete partial codes in 1c2a088a66, suggested by Daniel Borkmann.
5:Some bug fixes and comments improvement.
Changes v1-v2:
1:Fix bug in emit_ia32_neg64.
2:Fix bug in emit_ia32_arsh_r64.
3:Delete filename in top level comment, suggested by Thomas Gleixner.
4:Delete unnecessary boiler plate text, suggested by Thomas Gleixner.
5:Rewrite some words in changelog.
6:CodingSytle improvement and a little more comments.
Signed-off-by: Wang YanQing <udknight@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>