linux/arch/x86
Andy Lutomirski b956575bed x86/mm: Flush more aggressively in lazy TLB mode
Since commit:

  94b1b03b51 ("x86/mm: Rework lazy TLB mode and TLB freshness tracking")

x86's lazy TLB mode has been all the way lazy: when running a kernel thread
(including the idle thread), the kernel keeps using the last user mm's
page tables without attempting to maintain user TLB coherence at all.

From a pure semantic perspective, this is fine -- kernel threads won't
attempt to access user pages, so having stale TLB entries doesn't matter.

Unfortunately, I forgot about a subtlety.  By skipping TLB flushes,
we also allow any paging-structure caches that may exist on the CPU
to become incoherent.  This means that we can have a
paging-structure cache entry that references a freed page table, and
the CPU is within its rights to do a speculative page walk starting
at the freed page table.

I can imagine this causing two different problems:

 - A speculative page walk starting from a bogus page table could read
   IO addresses.  I haven't seen any reports of this causing problems.

 - A speculative page walk that involves a bogus page table can install
   garbage in the TLB.  Such garbage would always be at a user VA, but
   some AMD CPUs have logic that triggers a machine check when it notices
   these bogus entries.  I've seen a couple reports of this.

Boris further explains the failure mode:

> It is actually more of an optimization which assumes that paging-structure
> entries are in WB DRAM:
>
> "TlbCacheDis: cacheable memory disable. Read-write. 0=Enables
> performance optimization that assumes PML4, PDP, PDE, and PTE entries
> are in cacheable WB-DRAM; memory type checks may be bypassed, and
> addresses outside of WB-DRAM may result in undefined behavior or NB
> protocol errors. 1=Disables performance optimization and allows PML4,
> PDP, PDE and PTE entries to be in any memory type. Operating systems
> that maintain page tables in memory types other than WB- DRAM must set
> TlbCacheDis to insure proper operation."
>
> The MCE generated is an NB protocol error to signal that
>
> "Link: A specific coherent-only packet from a CPU was issued to an
> IO link. This may be caused by software which addresses page table
> structures in a memory type other than cacheable WB-DRAM without
> properly configuring MSRC001_0015[TlbCacheDis]. This may occur, for
> example, when page table structure addresses are above top of memory. In
> such cases, the NB will generate an MCE if it sees a mismatch between
> the memory operation generated by the core and the link type."
>
> I'm assuming coherent-only packets don't go out on IO links, thus the
> error.

To fix this, reinstate TLB coherence in lazy mode.  With this patch
applied, we do it in one of two ways:

 - If we have PCID, we simply switch back to init_mm's page tables
   when we enter a kernel thread -- this seems to be quite cheap
   except for the cost of serializing the CPU.

 - If we don't have PCID, then we set a flag and switch to init_mm
   the first time we would otherwise need to flush the TLB.

The /sys/kernel/debug/x86/tlb_use_lazy_mode debug switch can be changed
to override the default mode for benchmarking.

In theory, we could optimize this better by only flushing the TLB in
lazy CPUs when a page table is freed.  Doing that would require
auditing the mm code to make sure that all page table freeing goes
through tlb_remove_page() as well as reworking some data structures
to implement the improved flush logic.

Reported-by: Markus Trippelsdorf <markus@trippelsdorf.de>
Reported-by: Adam Borowski <kilobyte@angband.pl>
Signed-off-by: Andy Lutomirski <luto@kernel.org>
Signed-off-by: Borislav Petkov <bp@suse.de>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Daniel Borkmann <daniel@iogearbox.net>
Cc: Eric Biggers <ebiggers@google.com>
Cc: Johannes Hirte <johannes.hirte@datenkhaos.de>
Cc: Kees Cook <keescook@chromium.org>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Nadav Amit <nadav.amit@gmail.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Roman Kagan <rkagan@virtuozzo.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Fixes: 94b1b03b51 ("x86/mm: Rework lazy TLB mode and TLB freshness tracking")
Link: http://lkml.kernel.org/r/20171009170231.fkpraqokz6e4zeco@pd.tnic
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2017-10-14 09:21:24 +02:00
..
boot Merge branch 'efi-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip 2017-09-07 09:42:35 -07:00
configs Merge branch 'x86/urgent' into x86/asm, to pick up fixes 2017-08-10 13:14:15 +02:00
crypto crypto: x86/twofish - Fix RBP usage 2017-09-20 17:42:38 +08:00
entry x86/unwind: Use MSB for frame pointer encoding on 32-bit 2017-10-10 12:49:48 +02:00
events perf/x86/intel/uncore: Correct num_boxes for IIO and IRP 2017-09-25 12:43:56 +02:00
hyperv x86/hyperv: Fix hypercalls with extended CPU ranges for TLB flushing 2017-10-10 12:54:56 +02:00
ia32 x86/fpu: Rename fpu::fpstate_active to fpu::initialized 2017-09-26 09:43:36 +02:00
include x86/mm: Flush more aggressively in lazy TLB mode 2017-10-14 09:21:24 +02:00
kernel x86/apic: Update TSC_DEADLINE quirk with additional SKX stepping 2017-10-12 17:10:11 +02:00
kvm Mixed bugfixes. Perhaps the most interesting one is a latent bug 2017-09-29 12:18:55 -07:00
lib x86/boot: Add early cmdline parsing for options with arguments 2017-07-18 11:38:06 +02:00
math-emu x86/fpu: Rename fpu__activate_curr() to fpu__initialize() 2017-09-26 09:43:44 +02:00
mm x86/mm: Flush more aggressively in lazy TLB mode 2017-10-14 09:21:24 +02:00
net x86: bpf_jit: small optimization in emit_bpf_tail_call() 2017-08-31 11:57:37 -07:00
oprofile
pci dmi: Mark all struct dmi_system_id instances const 2017-09-14 11:59:30 +02:00
platform Merge branch 'i2c/for-4.14' of git://git.kernel.org/pub/scm/linux/kernel/git/wsa/linux 2017-09-09 14:18:40 -07:00
power dmi: Mark all struct dmi_system_id instances const 2017-09-14 11:59:30 +02:00
purgatory kasan: do not sanitize kexec purgatory 2017-03-31 17:13:30 -07:00
ras x86/mce: Merge mce_amd_inj into mce-inject 2017-06-14 07:32:07 +02:00
realmode x86/boot/realmode: Check for memory encryption on the APs 2017-07-18 11:38:04 +02:00
tools
um um: remove a stray tab 2017-09-13 22:36:27 +02:00
video
xen xen: fixes for 4.14-rc3 2017-09-29 12:24:28 -07:00
.gitignore
Kbuild Merge branch 'x86-platform-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip 2017-09-07 09:25:15 -07:00
Kconfig libnvdimm for 4.14 2017-09-11 13:10:57 -07:00
Kconfig.cpu
Kconfig.debug Merge branch 'x86/asm' into locking/core 2017-08-18 10:29:54 +02:00
Makefile x86/build: Use cc-option to validate stack alignment parameter 2017-08-21 09:53:15 +02:00
Makefile_32.cpu kbuild: remove cc-option-align 2017-06-25 12:43:00 +09:00
Makefile.um