For example, when two APF's for page ready happen after one exit and
the first one becomes pending, the second one will result in #DF.
Instead, just handle the second page fault synchronously.
Reported-by: Ross Zwisler <zwisler@gmail.com>
Message-ID: <CAOxpaSUBf8QoOZQ1p4KfUp0jq76OKfGY4Uxs-Gg8ngReD99xww@mail.gmail.com>
Reported-by: Alec Blayne <ab@tevsa.net>
Signed-off-by: Haozhong Zhang <haozhong.zhang@intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
The alternatives code checks only the first byte whether it is a NOP, but
with NOPs in front of the payload and having actual instructions after it
breaks the "optimized' test.
Make sure to scan all bytes before deciding to optimize the NOPs in there.
Reported-by: David Woodhouse <dwmw2@infradead.org>
Signed-off-by: Borislav Petkov <bp@suse.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Tom Lendacky <thomas.lendacky@amd.com>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Tim Chen <tim.c.chen@linux.intel.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Jiri Kosina <jikos@kernel.org>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: Andrew Lutomirski <luto@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Greg Kroah-Hartman <gregkh@linux-foundation.org>
Cc: Paul Turner <pjt@google.com>
Link: https://lkml.kernel.org/r/20180110112815.mgciyf5acwacphkq@pd.tnic
Construct the init thread stack in the linker script rather than doing it
by means of a union so that ia64's init_task.c can be got rid of.
The following symbols are then made available from INIT_TASK_DATA() linker
script macro:
init_thread_union
init_stack
INIT_TASK_DATA() also expands the region to THREAD_SIZE to accommodate the
size of the init stack. init_thread_union is given its own section so that
it can be placed into the stack space in the right order. I'm assuming
that the ia64 ordering is correct and that the task_struct is first and the
thread_info second.
Signed-off-by: David Howells <dhowells@redhat.com>
Tested-by: Tony Luck <tony.luck@intel.com>
Tested-by: Will Deacon <will.deacon@arm.com> (arm64)
Tested-by: Palmer Dabbelt <palmer@sifive.com>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
With LFENCE now a serializing instruction, use LFENCE_RDTSC in preference
to MFENCE_RDTSC. However, since the kernel could be running under a
hypervisor that does not support writing that MSR, read the MSR back and
verify that the bit has been set successfully. If the MSR can be read
and the bit is set, then set the LFENCE_RDTSC feature, otherwise set the
MFENCE_RDTSC feature.
Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Reviewed-by: Borislav Petkov <bp@suse.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Tim Chen <tim.c.chen@linux.intel.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Greg Kroah-Hartman <gregkh@linux-foundation.org>
Cc: David Woodhouse <dwmw@amazon.co.uk>
Cc: Paul Turner <pjt@google.com>
Link: https://lkml.kernel.org/r/20180108220932.12580.52458.stgit@tlendack-t1.amdoffice.net
To aid in speculation control, make LFENCE a serializing instruction
since it has less overhead than MFENCE. This is done by setting bit 1
of MSR 0xc0011029 (DE_CFG). Some families that support LFENCE do not
have this MSR. For these families, the LFENCE instruction is already
serializing.
Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Reviewed-by: Borislav Petkov <bp@suse.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Tim Chen <tim.c.chen@linux.intel.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Greg Kroah-Hartman <gregkh@linux-foundation.org>
Cc: David Woodhouse <dwmw@amazon.co.uk>
Cc: Paul Turner <pjt@google.com>
Link: https://lkml.kernel.org/r/20180108220921.12580.71694.stgit@tlendack-t1.amdoffice.net
ARCH_HAS_REFCOUNT is no longer marked as broken ('if BROKEN'), so remove
the stale comment regarding it being broken.
Signed-off-by: Eric Biggers <ebiggers@google.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20171229195303.17781-1-ebiggers3@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
So one of the constification patches unearthed a type casting fragility
of the underlying code:
276c870547 ("x86/platform/intel-mid: Make 'bt_sfi_data' const")
converted the struct to be const while it is also used as a temporary
container for important data that is used to fill 'parent' and 'name'
fields in struct platform_device_info.
The compiler doesn't notice this due to an explicit type cast that loses
the const - which fragility will be fixed separately.
This type cast turned a seemingly trivial const propagation patch into a
hard to debug data corruptor and crasher bug.
Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Cc: Bhumika Goyal <bhumirks@gmail.com>
Cc: Darren Hart <dvhart@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: julia.lawall@lip6.fr
Cc: platform-driver-x86@vger.kernel.org
Link: http://lkml.kernel.org/r/20171228122523.21802-1-andriy.shevchenko@linux.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
The following code contains dead logic:
162 if (pgd_none(*pgd)) {
163 unsigned long new_p4d_page = __get_free_page(gfp);
164 if (!new_p4d_page)
165 return NULL;
166
167 if (pgd_none(*pgd)) {
168 set_pgd(pgd, __pgd(_KERNPG_TABLE | __pa(new_p4d_page)));
169 new_p4d_page = 0;
170 }
171 if (new_p4d_page)
172 free_page(new_p4d_page);
173 }
There can't be any difference between two pgd_none(*pgd) at L162 and L167,
so it's always false at L171.
Dave Hansen explained:
Yes, the double-test was part of an optimization where we attempted to
avoid using a global spinlock in the fork() path. We would check for
unallocated mid-level page tables without the lock. The lock was only
taken when we needed to *make* an entry to avoid collisions.
Now that it is all single-threaded, there is no chance of a collision,
no need for a lock, and no need for the re-check.
As all these functions are only called during init, mark them __init as
well.
Fixes: 03f4424f34 ("x86/mm/pti: Add functions to clone kernel PMDs")
Signed-off-by: Jike Song <albcamus@gmail.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Alan Cox <gnomes@lxorguk.ukuu.org.uk>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Tom Lendacky <thomas.lendacky@amd.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Tim Chen <tim.c.chen@linux.intel.com>
Cc: Jiri Koshina <jikos@kernel.org>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Kees Cook <keescook@google.com>
Cc: Andi Lutomirski <luto@amacapital.net>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Greg KH <gregkh@linux-foundation.org>
Cc: David Woodhouse <dwmw@amazon.co.uk>
Cc: Paul Turner <pjt@google.com>
Cc: stable@vger.kernel.org
Link: https://lkml.kernel.org/r/20180108160341.3461-1-albcamus@gmail.com
This is another case similar to what EFI does: create a new set of
page tables, map some code at a low address, and jump to it. PTI
mistakes this low address for userspace and mistakenly marks it
non-executable in an effort to make it unusable for userspace.
Undo the poison to allow execution.
Fixes: 385ce0ea4c ("x86/mm/pti: Add Kconfig")
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Alan Cox <gnomes@lxorguk.ukuu.org.uk>
Cc: Tim Chen <tim.c.chen@linux.intel.com>
Cc: Jon Masters <jcm@redhat.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: Jeff Law <law@redhat.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Greg Kroah-Hartman <gregkh@linux-foundation.org>
Cc: David" <dwmw@amazon.co.uk>
Cc: Nick Clifton <nickc@redhat.com>
Cc: stable@vger.kernel.org
Link: https://lkml.kernel.org/r/20180108102805.GK25546@redhat.com
Variable Length Arrays In Structs (VLAIS) is not supported by Clang, and
frowned upon by others.
https://lkml.org/lkml/2013/9/23/500
Here, the VLAIS was used because the size of the bitmap returned from
xen_mc_entry() depended on possibly (based on kernel configuration)
runtime sized data. Rather than declaring args as a VLAIS then calling
sizeof on *args, we calculate the appropriate sizeof args manually.
Further, we can get rid of the #ifdef's and rely on num_possible_cpus()
(thanks to a helpful checkpatch warning from an earlier version of this
patch).
Suggested-by: Juergen Gross <jgross@suse.com>
Signed-off-by: Nick Desaulniers <nick.desaulniers@gmail.com>
Reviewed-by: Juergen Gross <jgross@suse.com>
Signed-off-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
The header declares this function as __init but is defined in __ref
section.
Signed-off-by: Nick Desaulniers <nick.desaulniers@gmail.com>
Reviewed-by: Juergen Gross <jgross@suse.com>
Signed-off-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Implement the CPU vulnerabilty show functions for meltdown, spectre_v1 and
spectre_v2.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Will Deacon <will.deacon@arm.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Linus Torvalds <torvalds@linuxfoundation.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: David Woodhouse <dwmw@amazon.co.uk>
Link: https://lkml.kernel.org/r/20180107214913.177414879@linutronix.de
s390:
* Two fixes for potential bitmap overruns in the cmma migration code
x86:
* Clear guest provided GPRs to defeat the Project Zero PoC for CVE
2017-5715
-----BEGIN PGP SIGNATURE-----
iQEcBAABCAAGBQJaUTJ4AAoJEED/6hsPKofohk0IAJAFlMG66u5MxC0kSM61U4Zf
1vkzRwAkBbcN82LpGQKbqabVyTq0F3aLipyOn6WO5SN0K5m+OI2OV/aAroPyX8bI
F7nWIqTXLhJ9X6KXINFvyavHMprvWl8PA72tR/B/7GhhfShrZ2wGgqhl0vv/kCUK
/8q+5e693yJqw8ceemin9a6kPJrLpmjeH+Oy24KIlGbvJWV4UrIE86pRHnAnBtg8
L7Vbxn5+ezKmakvBh+zF8NKcD1zHDcmQZHoYFPsQT0vX5GPoYqT2bcO6gsh1Grmp
8ti6KkrnP+j2A/OEna4LBWfwKI/1xHXneB22BYrAxvNjHt+R4JrjaPpx82SEB4Y=
=URMR
-----END PGP SIGNATURE-----
Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm
Pull KVM fixes from Radim Krčmář:
"s390:
- Two fixes for potential bitmap overruns in the cmma migration code
x86:
- Clear guest provided GPRs to defeat the Project Zero PoC for CVE
2017-5715"
* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm:
kvm: vmx: Scrub hardware GPRs at VM-exit
KVM: s390: prevent buffer overrun on memory hotplug during migration
KVM: s390: fix cmma migration for multiple memory slots
Add the bug bits for spectre v1/2 and force them unconditionally for all
cpus.
Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: gnomes@lxorguk.ukuu.org.uk
Cc: Rik van Riel <riel@redhat.com>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Jiri Kosina <jikos@kernel.org>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Kees Cook <keescook@google.com>
Cc: Tim Chen <tim.c.chen@linux.intel.com>
Cc: Greg Kroah-Hartman <gregkh@linux-foundation.org>
Cc: Paul Turner <pjt@google.com>
Cc: stable@vger.kernel.org
Link: https://lkml.kernel.org/r/1515239374-23361-2-git-send-email-dwmw@amazon.co.uk
EFI_OLD_MEMMAP's efi_call_phys_prolog() calls set_pgd() with swapper PGD that
has PAGE_USER set, which makes PTI set NX on it, and therefore EFI can't
execute it's code.
Fix that by forcefully clearing _PAGE_NX from the PGD (this can't be done
by the pgprot API).
_PAGE_NX will be automatically reintroduced in efi_call_phys_epilog(), as
_set_pgd() will again notice that this is _PAGE_USER, and set _PAGE_NX on
it.
Tested-by: Dimitri Sivanich <sivanich@hpe.com>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Matt Fleming <matt@codeblueprint.co.uk>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-efi@vger.kernel.org
Cc: stable@vger.kernel.org
Link: http://lkml.kernel.org/r/nycvar.YFH.7.76.1801052215460.11852@cbobk.fhfr.pm
Instead of blacklisting all model 79 CPUs when attempting a late
microcode loading, limit that only to CPUs with microcode revisions <
0x0b000021 because only on those late loading may cause a system hang.
For such processors either:
a) a BIOS update which might contain a newer microcode revision
or
b) the early microcode loading method
should be considered.
Processors with revisions 0x0b000021 or higher will not experience such
hangs.
For more details, see erratum BDF90 in document #334165 (Intel Xeon
Processor E7-8800/4800 v4 Product Family Specification Update) from
September 2017.
[ bp: Heavily massage commit message and pr_* statements. ]
Fixes: 723f2828a9 ("x86/microcode/intel: Disable late loading on model 79")
Signed-off-by: Jia Zhang <qianyue.zj@alibaba-inc.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Tony Luck <tony.luck@intel.com>
Cc: x86-ml <x86@kernel.org>
Cc: <stable@vger.kernel.org> # v4.14
Link: http://lkml.kernel.org/r/1514772287-92959-1-git-send-email-qianyue.zj@alibaba-inc.com
Recent changes made a bit of an inconsistent mess out of arch/x86/events/msr.c,
fix it:
- re-align the initialization tables to be vertically aligned and readable again
- harmonize comment style in terms of punctuation, capitalization and spelling
- use curly braces for multi-condition branches
- remove extra newlines
- simplify the code a bit
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephane Eranian <eranian@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vince Weaver <vincent.weaver@maine.edu>
Cc: kan.liang@intel.com
Link: http://lkml.kernel.org/r/1515169132-3980-1-git-send-email-eranian@google.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
This patch adds support for the Digital Readout provided by the
IA32_THERM_STATUS MSR (0x19C) on Intel X86 processors. The readout
shows the number of degrees Celcius to the TCC (critical temperature)
supported by the processor. Thus, the larger, the better.
The perf_event support is provided via the msr PMU. The new
logical event is called cpu_thermal_margin. It comes with a unit and
snapshot files. The event shows the current temprature distance (margin).
It is not an accumulating event. The unit is degrees C. The event
is provided per logical CPU to make things simpler but it is the
same for both hyper-threads sharing a physical core.
$ perf stat -I 1000 -a -A -e msr/cpu_thermal_margin/
This will print the temperature for all logical CPUs.
time CPU counts unit events
1.000123741 CPU0 38 C msr/cpu_thermal_margin/
1.000161837 CPU1 37 C msr/cpu_thermal_margin/
1.000187906 CPU2 36 C msr/cpu_thermal_margin/
1.000189046 CPU3 39 C msr/cpu_thermal_margin/
1.000283044 CPU4 40 C msr/cpu_thermal_margin/
1.000344297 CPU5 40 C msr/cpu_thermal_margin/
1.000365832 CPU6 39 C msr/cpu_thermal_margin/
...
In case the temperature margin cannot be read, the reported value would be -1.
Works on all processors supporting the Digital Readout (dtherm in cpuinfo)
Signed-off-by: Stephane Eranian <eranian@google.com>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vince Weaver <vincent.weaver@maine.edu>
Cc: kan.liang@intel.com
Link: http://lkml.kernel.org/r/1515169132-3980-1-git-send-email-eranian@google.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Pull more x86 pti fixes from Thomas Gleixner:
"Another small stash of fixes for fallout from the PTI work:
- Fix the modules vs. KASAN breakage which was caused by making
MODULES_END depend of the fixmap size. That was done when the cpu
entry area moved into the fixmap, but now that we have a separate
map space for that this is causing more issues than it solves.
- Use the proper cache flush methods for the debugstore buffers as
they are mapped/unmapped during runtime and not statically mapped
at boot time like the rest of the cpu entry area.
- Make the map layout of the cpu_entry_area consistent for 4 and 5
level paging and fix the KASLR vaddr_end wreckage.
- Use PER_CPU_EXPORT for per cpu variable and while at it unbreak
nvidia gfx drivers by dropping the GPL export. The subject line of
the commit tells it the other way around, but I noticed that too
late.
- Fix the ASM alternative macros so they can be used in the middle of
an inline asm block.
- Rename the BUG_CPU_INSECURE flag to BUG_CPU_MELTDOWN so the attack
vector is properly identified. The Spectre mitigations will come
with their own bug bits later"
* 'x86-pti-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/pti: Rename BUG_CPU_INSECURE to BUG_CPU_MELTDOWN
x86/alternatives: Add missing '\n' at end of ALTERNATIVE inline asm
x86/tlb: Drop the _GPL from the cpu_tlbstate export
x86/events/intel/ds: Use the proper cache flush method for mapping ds buffers
x86/kaslr: Fix the vaddr_end mess
x86/mm: Map cpu_entry_area at the same place on 4/5 level
x86/mm: Set MODULES_END to 0xffffffffff000000
Pull EFI updates from Thomas Gleixner:
- A fix for a add_efi_memmap parameter regression which ensures that
the parameter is parsed before it is used.
- Reinstate the virtual capsule mapping as the cached copy turned out
to break Quark and other things
- Remove Matt Fleming as EFI co-maintainer. He stepped back a few days
ago. Thanks Matt for all your great work!
* 'efi-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
MAINTAINERS: Remove Matt Fleming as EFI co-maintainer
efi/capsule-loader: Reinstate virtual capsule mapping
x86/efi: Fix kernel param add_efi_memmap regression
Guest GPR values are live in the hardware GPRs at VM-exit. Do not
leave any guest values in hardware GPRs after the guest GPR values are
saved to the vcpu_vmx structure.
This is a partial mitigation for CVE 2017-5715 and CVE 2017-5753.
Specifically, it defeats the Project Zero PoC for CVE 2017-5715.
Suggested-by: Eric Northup <digitaleric@google.com>
Signed-off-by: Jim Mattson <jmattson@google.com>
Reviewed-by: Eric Northup <digitaleric@google.com>
Reviewed-by: Benjamin Serebrin <serebrin@google.com>
Reviewed-by: Andrew Honig <ahonig@google.com>
[Paolo: Add AMD bits, Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com>]
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Use the name associated with the particular attack which needs page table
isolation for mitigation.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: David Woodhouse <dwmw@amazon.co.uk>
Cc: Alan Cox <gnomes@lxorguk.ukuu.org.uk>
Cc: Jiri Koshina <jikos@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Tim Chen <tim.c.chen@linux.intel.com>
Cc: Andi Lutomirski <luto@amacapital.net>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Paul Turner <pjt@google.com>
Cc: Tom Lendacky <thomas.lendacky@amd.com>
Cc: Greg KH <gregkh@linux-foundation.org>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Kees Cook <keescook@google.com>
Cc: stable@vger.kernel.org
Link: https://lkml.kernel.org/r/alpine.DEB.2.20.1801051525300.1724@nanos
acpi_get_override_irq() followed by acpi_register_gsi() returns negative
error code on failure.
Propagate it from acpi_gsi_to_irq() to callers.
Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
[ rjw : Subject/changelog ]
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Where an ALTERNATIVE is used in the middle of an inline asm block, this
would otherwise lead to the following instruction being appended directly
to the trailing ".popsection", and a failed compile.
Fixes: 9cebed423c ("x86, alternative: Use .pushsection/.popsection")
Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: gnomes@lxorguk.ukuu.org.uk
Cc: Rik van Riel <riel@redhat.com>
Cc: ak@linux.intel.com
Cc: Tim Chen <tim.c.chen@linux.intel.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Paul Turner <pjt@google.com>
Cc: Jiri Kosina <jikos@kernel.org>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Kees Cook <keescook@google.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Greg Kroah-Hartman <gregkh@linux-foundation.org>
Cc: stable@vger.kernel.org
Link: https://lkml.kernel.org/r/20180104143710.8961-8-dwmw@amazon.co.uk
The recent changes for PTI touch cpu_tlbstate from various tlb_flush
inlines. cpu_tlbstate is exported as GPL symbol, so this causes a
regression when building out of tree drivers for certain graphics cards.
Aside of that the export was wrong since it was introduced as it should
have been EXPORT_PER_CPU_SYMBOL_GPL().
Use the correct PER_CPU export and drop the _GPL to restore the previous
state which allows users to utilize the cards they payed for.
As always I'm really thrilled to make this kind of change to support the
#friends (or however the hot hashtag of today is spelled) from that closet
sauce graphics corp.
Fixes: 1e02ce4ccc ("x86: Store a per-cpu shadow copy of CR4")
Fixes: 6fd166aae7 ("x86/mm: Use/Fix PCID to optimize user/kernel switches")
Reported-by: Kees Cook <keescook@google.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: stable@vger.kernel.org
Thomas reported the following warning:
BUG: using smp_processor_id() in preemptible [00000000] code: ovsdb-server/4498
caller is native_flush_tlb_single+0x57/0xc0
native_flush_tlb_single+0x57/0xc0
__set_pte_vaddr+0x2d/0x40
set_pte_vaddr+0x2f/0x40
cea_set_pte+0x30/0x40
ds_update_cea.constprop.4+0x4d/0x70
reserve_ds_buffers+0x159/0x410
x86_reserve_hardware+0x150/0x160
x86_pmu_event_init+0x3e/0x1f0
perf_try_init_event+0x69/0x80
perf_event_alloc+0x652/0x740
SyS_perf_event_open+0x3f6/0xd60
do_syscall_64+0x5c/0x190
set_pte_vaddr is used to map the ds buffers into the cpu entry area, but
there are two problems with that:
1) The resulting flush is not supposed to be called in preemptible context
2) The cpu entry area is supposed to be per CPU, but the debug store
buffers are mapped for all CPUs so these mappings need to be flushed
globally.
Add the necessary preemption protection across the mapping code and flush
TLBs globally.
Fixes: c1961a4631 ("x86/events/intel/ds: Map debug buffers in cpu_entry_area")
Reported-by: Thomas Zeitlhofer <thomas.zeitlhofer+lkml@ze-it.at>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Thomas Zeitlhofer <thomas.zeitlhofer+lkml@ze-it.at>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Hugh Dickins <hughd@google.com>
Cc: stable@vger.kernel.org
Link: https://lkml.kernel.org/r/20180104170712.GB3040@hirez.programming.kicks-ass.net
vaddr_end for KASLR is only documented in the KASLR code itself and is
adjusted depending on config options. So it's not surprising that a change
of the memory layout causes KASLR to have the wrong vaddr_end. This can map
arbitrary stuff into other areas causing hard to understand problems.
Remove the whole ifdef magic and define the start of the cpu_entry_area to
be the end of the KASLR vaddr range.
Add documentation to that effect.
Fixes: 92a0f81d89 ("x86/cpu_entry_area: Move it out of the fixmap")
Reported-by: Benjamin Gilbert <benjamin.gilbert@coreos.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Benjamin Gilbert <benjamin.gilbert@coreos.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: stable <stable@vger.kernel.org>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Garnier <thgarnie@google.com>,
Cc: Alexander Kuleshov <kuleshovmail@gmail.com>
Link: https://lkml.kernel.org/r/alpine.DEB.2.20.1801041320360.1771@nanos
There is no reason for 4 and 5 level pagetables to have a different
layout. It just makes determining vaddr_end for KASLR harder than
necessary.
Fixes: 92a0f81d89 ("x86/cpu_entry_area: Move it out of the fixmap")
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Benjamin Gilbert <benjamin.gilbert@coreos.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: stable <stable@vger.kernel.org>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Garnier <thgarnie@google.com>,
Cc: Alexander Kuleshov <kuleshovmail@gmail.com>
Link: https://lkml.kernel.org/r/alpine.DEB.2.20.1801041320360.1771@nanos
Since f06bdd4001 ("x86/mm: Adapt MODULES_END based on fixmap section size")
kasan_mem_to_shadow(MODULES_END) could be not aligned to a page boundary.
So passing page unaligned address to kasan_populate_zero_shadow() have two
possible effects:
1) It may leave one page hole in supposed to be populated area. After commit
21506525fb ("x86/kasan/64: Teach KASAN about the cpu_entry_area") that
hole happens to be in the shadow covering fixmap area and leads to crash:
BUG: unable to handle kernel paging request at fffffbffffe8ee04
RIP: 0010:check_memory_region+0x5c/0x190
Call Trace:
<NMI>
memcpy+0x1f/0x50
ghes_copy_tofrom_phys+0xab/0x180
ghes_read_estatus+0xfb/0x280
ghes_notify_nmi+0x2b2/0x410
nmi_handle+0x115/0x2c0
default_do_nmi+0x57/0x110
do_nmi+0xf8/0x150
end_repeat_nmi+0x1a/0x1e
Note, the crash likely disappeared after commit 92a0f81d89, which
changed kasan_populate_zero_shadow() call the way it was before
commit 21506525fb.
2) Attempt to load module near MODULES_END will fail, because
__vmalloc_node_range() called from kasan_module_alloc() will hit the
WARN_ON(!pte_none(*pte)) in the vmap_pte_range() and bail out with error.
To fix this we need to make kasan_mem_to_shadow(MODULES_END) page aligned
which means that MODULES_END should be 8*PAGE_SIZE aligned.
The whole point of commit f06bdd4001 was to move MODULES_END down if
NR_CPUS is big, so the cpu_entry_area takes a lot of space.
But since 92a0f81d89 ("x86/cpu_entry_area: Move it out of the fixmap")
the cpu_entry_area is no longer in fixmap, so we could just set
MODULES_END to a fixed 8*PAGE_SIZE aligned address.
Fixes: f06bdd4001 ("x86/mm: Adapt MODULES_END based on fixmap section size")
Reported-by: Jakub Kicinski <kubakici@wp.pl>
Signed-off-by: Andrey Ryabinin <aryabinin@virtuozzo.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: stable@vger.kernel.org
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Thomas Garnier <thgarnie@google.com>
Link: https://lkml.kernel.org/r/20171228160620.23818-1-aryabinin@virtuozzo.com
Pull x86 page table isolation fixes from Thomas Gleixner:
"A couple of urgent fixes for PTI:
- Fix a PTE mismatch between user and kernel visible mapping of the
cpu entry area (differs vs. the GLB bit) and causes a TLB mismatch
MCE on older AMD K8 machines
- Fix the misplaced CR3 switch in the SYSCALL compat entry code which
causes access to unmapped kernel memory resulting in double faults.
- Fix the section mismatch of the cpu_tss_rw percpu storage caused by
using a different mechanism for declaration and definition.
- Two fixes for dumpstack which help to decode entry stack issues
better
- Enable PTI by default in Kconfig. We should have done that earlier,
but it slipped through the cracks.
- Exclude AMD from the PTI enforcement. Not necessarily a fix, but if
AMD is so confident that they are not affected, then we should not
burden users with the overhead"
* 'x86-pti-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/process: Define cpu_tss_rw in same section as declaration
x86/pti: Switch to kernel CR3 at early in entry_SYSCALL_compat()
x86/dumpstack: Print registers for first stack frame
x86/dumpstack: Fix partial register dumps
x86/pti: Make sure the user/kernel PTEs match
x86/cpu, x86/pti: Do not enable PTI on AMD processors
x86/pti: Enable PTI by default
The preparation for PTI which added CR3 switching to the entry code
misplaced the CR3 switch in entry_SYSCALL_compat().
With PTI enabled the entry code tries to access a per cpu variable after
switching to kernel GS. This fails because that variable is not mapped to
user space. This results in a double fault and in the worst case a kernel
crash.
Move the switch ahead of the access and clobber RSP which has been saved
already.
Fixes: 8a09317b89 ("x86/mm/pti: Prepare the x86/entry assembly code for entry/exit CR3 switching")
Reported-by: Lars Wendler <wendler.lars@web.de>
Reported-by: Laura Abbott <labbott@redhat.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Borislav Betkov <bp@alien8.de>
Cc: Andy Lutomirski <luto@kernel.org>,
Cc: Dave Hansen <dave.hansen@linux.intel.com>,
Cc: Peter Zijlstra <peterz@infradead.org>,
Cc: Greg KH <gregkh@linuxfoundation.org>, ,
Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>,
Cc: Juergen Gross <jgross@suse.com>
Cc: stable@vger.kernel.org
Link: https://lkml.kernel.org/r/alpine.DEB.2.20.1801031949200.1957@nanos
In the stack dump code, if the frame after the starting pt_regs is also
a regs frame, the registers don't get printed. Fix that.
Reported-by: Andy Lutomirski <luto@amacapital.net>
Tested-by: Alexander Tsoy <alexander@tsoy.me>
Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Toralf Förster <toralf.foerster@gmx.de>
Cc: stable@vger.kernel.org
Fixes: 3b3fa11bc7 ("x86/dumpstack: Print any pt_regs found on the stack")
Link: http://lkml.kernel.org/r/396f84491d2f0ef64eda4217a2165f5712f6a115.1514736742.git.jpoimboe@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
The show_regs_safe() logic is wrong. When there's an iret stack frame,
it prints the entire pt_regs -- most of which is random stack data --
instead of just the five registers at the end.
show_regs_safe() is also poorly named: the on_stack() checks aren't for
safety. Rename the function to show_regs_if_on_stack() and add a
comment to explain why the checks are needed.
These issues were introduced with the "partial register dump" feature of
the following commit:
b02fcf9ba1 ("x86/unwinder: Handle stack overflows more gracefully")
That patch had gone through a few iterations of development, and the
above issues were artifacts from a previous iteration of the patch where
'regs' pointed directly to the iret frame rather than to the (partially
empty) pt_regs.
Tested-by: Alexander Tsoy <alexander@tsoy.me>
Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Toralf Förster <toralf.foerster@gmx.de>
Cc: stable@vger.kernel.org
Fixes: b02fcf9ba1 ("x86/unwinder: Handle stack overflows more gracefully")
Link: http://lkml.kernel.org/r/5b05b8b344f59db2d3d50dbdeba92d60f2304c54.1514736742.git.jpoimboe@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Meelis reported that his K8 Athlon64 emits MCE warnings when PTI is
enabled:
[Hardware Error]: Error Addr: 0x0000ffff81e000e0
[Hardware Error]: MC1 Error: L1 TLB multimatch.
[Hardware Error]: cache level: L1, tx: INSN
The address is in the entry area, which is mapped into kernel _AND_ user
space. That's special because we switch CR3 while we are executing
there.
User mapping:
0xffffffff81e00000-0xffffffff82000000 2M ro PSE GLB x pmd
Kernel mapping:
0xffffffff81000000-0xffffffff82000000 16M ro PSE x pmd
So the K8 is complaining that the TLB entries differ. They differ in the
GLB bit.
Drop the GLB bit when installing the user shared mapping.
Fixes: 6dc72c3cbc ("x86/mm/pti: Share entry text PMD")
Reported-by: Meelis Roos <mroos@linux.ee>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Meelis Roos <mroos@linux.ee>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Tom Lendacky <thomas.lendacky@amd.com>
Cc: stable@vger.kernel.org
Link: https://lkml.kernel.org/r/alpine.DEB.2.20.1801031407180.1957@nanos
AMD processors are not subject to the types of attacks that the kernel
page table isolation feature protects against. The AMD microarchitecture
does not allow memory references, including speculative references, that
access higher privileged data when running in a lesser privileged mode
when that access would result in a page fault.
Disable page table isolation by default on AMD processors by not setting
the X86_BUG_CPU_INSECURE feature, which controls whether X86_FEATURE_PTI
is set.
Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Borislav Petkov <bp@suse.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: stable@vger.kernel.org
Link: https://lkml.kernel.org/r/20171227054354.20369.94587.stgit@tlendack-t1.amdoffice.net
Commit:
82c3768b8d ("efi/capsule-loader: Use a cached copy of the capsule header")
... refactored the capsule loading code that maps the capsule header,
to avoid having to map it several times.
However, as it turns out, the vmap() call we ended up removing did not
just map the header, but the entire capsule image, and dropping this
virtual mapping breaks capsules that are processed by the firmware
immediately (i.e., without a reboot).
Unfortunately, that change was part of a larger refactor that allowed
a quirk to be implemented for Quark, which has a non-standard memory
layout for capsules, and we have slightly painted ourselves into a
corner by allowing quirk code to mangle the capsule header and memory
layout.
So we need to fix this without breaking Quark. Fortunately, Quark does
not appear to care about the virtual mapping, and so we can simply
do a partial revert of commit:
2a457fb31d ("efi/capsule-loader: Use page addresses rather than struct page pointers")
... and create a vmap() mapping of the entire capsule (including header)
based on the reinstated struct page array, unless running on Quark, in
which case we pass the capsule header copy as before.
Reported-by: Ge Song <ge.song@hxt-semitech.com>
Tested-by: Bryan O'Donoghue <pure.logic@nexus-software.ie>
Tested-by: Ge Song <ge.song@hxt-semitech.com>
Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Cc: <stable@vger.kernel.org>
Cc: Dave Young <dyoung@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Matt Fleming <matt@codeblueprint.co.uk>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-efi@vger.kernel.org
Fixes: 82c3768b8d ("efi/capsule-loader: Use a cached copy of the capsule header")
Link: http://lkml.kernel.org/r/20180102172110.17018-3-ard.biesheuvel@linaro.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
'add_efi_memmap' is an early param, but do_add_efi_memmap() has no
chance to run because the code path is before parse_early_param().
I believe it worked when the param was introduced but probably later
some other changes caused the wrong order and nobody noticed it.
Move efi_memblock_x86_reserve_range() after parse_early_param()
to fix it.
Signed-off-by: Dave Young <dyoung@redhat.com>
Signed-off-by: Matt Fleming <matt@codeblueprint.co.uk>
Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Cc: Bryan O'Donoghue <pure.logic@nexus-software.ie>
Cc: Ge Song <ge.song@hxt-semitech.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-efi@vger.kernel.org
Link: http://lkml.kernel.org/r/20180102172110.17018-2-ard.biesheuvel@linaro.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Pull x86 fixes from Thomas Gleixner:
"A couple of fixlets for x86:
- Fix the ESPFIX double fault handling for 5-level pagetables
- Fix the commandline parsing for 'apic=' on 32bit systems and update
documentation
- Make zombie stack traces reliable
- Fix kexec with stack canary
- Fix the delivery mode for APICs which was missed when the x86
vector management was converted to single target delivery. Caused a
regression due to the broken hardware which ignores affinity
settings in lowest prio delivery mode.
- Unbreak modules when AMD memory encryption is enabled
- Remove an unused parameter of prepare_switch_to"
* 'x86/urgent' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/apic: Switch all APICs to Fixed delivery mode
x86/apic: Update the 'apic=' description of setting APIC driver
x86/apic: Avoid wrong warning when parsing 'apic=' in X86-32 case
x86-32: Fix kexec with stack canary (CONFIG_CC_STACKPROTECTOR)
x86: Remove unused parameter of prepare_switch_to
x86/stacktrace: Make zombie stack traces reliable
x86/mm: Unbreak modules that use the DMA API
x86/build: Make isoimage work on Debian
x86/espfix/64: Fix espfix double-fault handling on 5-level systems
Pull x86 page table isolation fixes from Thomas Gleixner:
"Four patches addressing the PTI fallout as discussed and debugged
yesterday:
- Remove stale and pointless TLB flush invocations from the hotplug
code
- Remove stale preempt_disable/enable from __native_flush_tlb()
- Plug the memory leak in the write_ldt() error path"
* 'x86-pti-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/ldt: Make LDT pgtable free conditional
x86/ldt: Plug memory leak in error path
x86/mm: Remove preempt_disable/enable() from __native_flush_tlb()
x86/smpboot: Remove stale TLB flush invocations
Pull perf fixes from Thomas Gleixner:
- plug a memory leak in the intel pmu init code
- clang fixes
- tooling fix to avoid including kernel headers
- a fix for jvmti to generate correct debug information for inlined
code
- replace backtick with a regular shell function
- fix the build in hardened environments
* 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
perf/x86/intel: Plug memory leak in intel_pmu_init()
x86/asm: Allow again using asm.h when building for the 'bpf' clang target
tools arch s390: Do not include header files from the kernel sources
perf jvmti: Generate correct debug information for inlined code
perf tools: Fix up build in hardened environments
perf tools: Use shell function for perl cflags retrieval
Pull irq fixes from Thomas Gleixner:
"A rather large update after the kaisered maintainer finally found time
to handle regression reports.
- The larger part addresses a regression caused by the x86 vector
management rework.
The reservation based model does not work reliably for MSI
interrupts, if they cannot be masked (yes, yet another hw
engineering trainwreck). The reason is that the reservation mode
assigns a dummy vector when the interrupt is allocated and switches
to a real vector when the interrupt is requested.
If the MSI entry cannot be masked then the initialization might
raise an interrupt before the interrupt is requested, which ends up
as spurious interrupt and causes device malfunction and worse. The
fix is to exclude MSI interrupts which do not support masking from
reservation mode and assign a real vector right away.
- Extend the extra lockdep class setup for nested interrupts with a
class for the recently added irq_desc::request_mutex so lockdep can
differeniate and does not emit false positive warnings.
- A ratelimit guard for the bad irq printout so in case a bad irq
comes back immediately the system does not drown in dmesg spam"
* 'irq-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
genirq/msi, x86/vector: Prevent reservation mode for non maskable MSI
genirq/irqdomain: Rename early argument of irq_domain_activate_irq()
x86/vector: Use IRQD_CAN_RESERVE flag
genirq: Introduce IRQD_CAN_RESERVE flag
genirq/msi: Handle reactivation only on success
gpio: brcmstb: Make really use of the new lockdep class
genirq: Guard handle_bad_irq log messages
kernel/irq: Extend lockdep class for request mutex
Andy prefers to be paranoid about the pagetable free in the error path of
write_ldt(). Make it conditional and warn whenever the installment of a
secondary LDT fails.
Requested-by: Andy Lutomirski <luto@amacapital.net>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
The error path in write_ldt() tries to free 'old_ldt' instead of the newly
allocated 'new_ldt', resulting in a memory leak. It also misses to clean up a
half populated LDT pagetable, which is not a leak as it gets cleaned up
when the process exits.
Free both the potentially half populated LDT pagetable and the newly
allocated LDT struct. This can be done unconditionally because once an LDT
is mapped subsequent maps will succeed, because the PTE page is already
populated and the two LDTs fit into that single page.
Reported-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Dominik Brodowski <linux@dominikbrodowski.net>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Linus Torvalds <torvalds@linuxfoundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Fixes: f55f0501cb ("x86/pti: Put the LDT in its own PGD if PTI is on")
Link: http://lkml.kernel.org/r/alpine.DEB.2.20.1712311121340.1899@nanos
Signed-off-by: Ingo Molnar <mingo@kernel.org>
The preempt_disable/enable() pair in __native_flush_tlb() was added in
commit:
5cf0791da5 ("x86/mm: Disable preemption during CR3 read+write")
... to protect the UP variant of flush_tlb_mm_range().
That preempt_disable/enable() pair should have been added to the UP variant
of flush_tlb_mm_range() instead.
The UP variant was removed with commit:
ce4a4e565f ("x86/mm: Remove the UP asm/tlbflush.h code, always use the (formerly) SMP code")
... but the preempt_disable/enable() pair stayed around.
The latest change to __native_flush_tlb() in commit:
6fd166aae7 ("x86/mm: Use/Fix PCID to optimize user/kernel switches")
... added an access to a per CPU variable outside the preempt disabled
regions, which makes no sense at all. __native_flush_tlb() must always
be called with at least preemption disabled.
Remove the preempt_disable/enable() pair and add a WARN_ON_ONCE() to catch
bad callers independent of the smp_processor_id() debugging.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: <stable@vger.kernel.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Dominik Brodowski <linux@dominikbrodowski.net>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Linus Torvalds <torvalds@linuxfoundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/20171230211829.679325424@linutronix.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
smpboot_setup_warm_reset_vector() and smpboot_restore_warm_reset_vector()
invoke local_flush_tlb() for no obvious reason.
Digging in history revealed that the original code in the 2.1 era added
those because the code manipulated a swapper_pg_dir pagetable entry. The
pagetable manipulation was removed long ago in the 2.3 timeframe, but the
TLB flush invocations stayed around forever.
Remove them along with the pointless pr_debug()s which come from the same 2.1
change.
Reported-by: Dominik Brodowski <linux@dominikbrodowski.net>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: <stable@vger.kernel.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Linus Torvalds <torvalds@linuxfoundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/20171230211829.586548655@linutronix.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Pull x86 page table isolation updates from Thomas Gleixner:
"This is the final set of enabling page table isolation on x86:
- Infrastructure patches for handling the extra page tables.
- Patches which map the various bits and pieces which are required to
get in and out of user space into the user space visible page
tables.
- The required changes to have CR3 switching in the entry/exit code.
- Optimizations for the CR3 switching along with documentation how
the ASID/PCID mechanism works.
- Updates to dump pagetables to cover the user space page tables for
W+X scans and extra debugfs files to analyze both the kernel and
the user space visible page tables
The whole functionality is compile time controlled via a config switch
and can be turned on/off on the command line as well"
* 'x86-pti-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (32 commits)
x86/ldt: Make the LDT mapping RO
x86/mm/dump_pagetables: Allow dumping current pagetables
x86/mm/dump_pagetables: Check user space page table for WX pages
x86/mm/dump_pagetables: Add page table directory to the debugfs VFS hierarchy
x86/mm/pti: Add Kconfig
x86/dumpstack: Indicate in Oops whether PTI is configured and enabled
x86/mm: Clarify the whole ASID/kernel PCID/user PCID naming
x86/mm: Use INVPCID for __native_flush_tlb_single()
x86/mm: Optimize RESTORE_CR3
x86/mm: Use/Fix PCID to optimize user/kernel switches
x86/mm: Abstract switching CR3
x86/mm: Allow flushing for future ASID switches
x86/pti: Map the vsyscall page if needed
x86/pti: Put the LDT in its own PGD if PTI is on
x86/mm/64: Make a full PGD-entry size hole in the memory map
x86/events/intel/ds: Map debug buffers in cpu_entry_area
x86/cpu_entry_area: Add debugstore entries to cpu_entry_area
x86/mm/pti: Map ESPFIX into user space
x86/mm/pti: Share entry text PMD
x86/entry: Align entry text section to PMD boundary
...
The new reservation mode for interrupts assigns a dummy vector when the
interrupt is allocated and assigns a real vector when the interrupt is
requested. The reservation mode prevents vector pressure when devices with
a large amount of queues/interrupts are initialized, but only a minimal
subset of those queues/interrupts is actually used.
This mode has an issue with MSI interrupts which cannot be masked. If the
driver is not careful or the hardware emits an interrupt before the device
irq is requestd by the driver then the interrupt ends up on the dummy
vector as a spurious interrupt which can cause malfunction of the device or
in the worst case a lockup of the machine.
Change the logic for the reservation mode so that the early activation of
MSI interrupts checks whether:
- the device is a PCI/MSI device
- the reservation mode of the underlying irqdomain is activated
- PCI/MSI masking is globally enabled
- the PCI/MSI device uses either MSI-X, which supports masking, or
MSI with the maskbit supported.
If one of those conditions is false, then clear the reservation mode flag
in the irq data of the interrupt and invoke irq_domain_activate_irq() with
the reserve argument cleared. In the x86 vector code, clear the can_reserve
flag in the vector allocation data so a subsequent free_irq() won't create
the same situation again. The interrupt stays assigned to a real vector
until pci_disable_msi() is invoked and all allocations are undone.
Fixes: 4900be8360 ("x86/vector/msi: Switch to global reservation mode")
Reported-by: Alexandru Chirvasitu <achirvasub@gmail.com>
Reported-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Alexandru Chirvasitu <achirvasub@gmail.com>
Tested-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Cc: Dou Liyang <douly.fnst@cn.fujitsu.com>
Cc: Pavel Machek <pavel@ucw.cz>
Cc: Maciej W. Rozycki <macro@linux-mips.org>
Cc: Mikael Pettersson <mikpelinux@gmail.com>
Cc: Josh Poulson <jopoulso@microsoft.com>
Cc: Mihai Costache <v-micos@microsoft.com>
Cc: Stephen Hemminger <sthemmin@microsoft.com>
Cc: Marc Zyngier <marc.zyngier@arm.com>
Cc: linux-pci@vger.kernel.org
Cc: Haiyang Zhang <haiyangz@microsoft.com>
Cc: Dexuan Cui <decui@microsoft.com>
Cc: Simon Xiao <sixiao@microsoft.com>
Cc: Saeed Mahameed <saeedm@mellanox.com>
Cc: Jork Loeser <Jork.Loeser@microsoft.com>
Cc: Bjorn Helgaas <bhelgaas@google.com>
Cc: devel@linuxdriverproject.org
Cc: KY Srinivasan <kys@microsoft.com>
Cc: Alan Cox <alan@linux.intel.com>
Cc: Sakari Ailus <sakari.ailus@intel.com>,
Cc: linux-media@vger.kernel.org
Link: https://lkml.kernel.org/r/alpine.DEB.2.20.1712291406420.1899@nanos
Link: https://lkml.kernel.org/r/alpine.DEB.2.20.1712291409460.1899@nanos
The 'early' argument of irq_domain_activate_irq() is actually used to
denote reservation mode. To avoid confusion, rename it before abuse
happens.
No functional change.
Fixes: 7249164346 ("genirq/irqdomain: Update irq_domain_ops.activate() signature")
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Alexandru Chirvasitu <achirvasub@gmail.com>
Cc: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Cc: Dou Liyang <douly.fnst@cn.fujitsu.com>
Cc: Pavel Machek <pavel@ucw.cz>
Cc: Maciej W. Rozycki <macro@linux-mips.org>
Cc: Mikael Pettersson <mikpelinux@gmail.com>
Cc: Josh Poulson <jopoulso@microsoft.com>
Cc: Mihai Costache <v-micos@microsoft.com>
Cc: Stephen Hemminger <sthemmin@microsoft.com>
Cc: Marc Zyngier <marc.zyngier@arm.com>
Cc: linux-pci@vger.kernel.org
Cc: Haiyang Zhang <haiyangz@microsoft.com>
Cc: Dexuan Cui <decui@microsoft.com>
Cc: Simon Xiao <sixiao@microsoft.com>
Cc: Saeed Mahameed <saeedm@mellanox.com>
Cc: Jork Loeser <Jork.Loeser@microsoft.com>
Cc: Bjorn Helgaas <bhelgaas@google.com>
Cc: devel@linuxdriverproject.org
Cc: KY Srinivasan <kys@microsoft.com>
Cc: Alan Cox <alan@linux.intel.com>
Cc: Sakari Ailus <sakari.ailus@intel.com>,
Cc: linux-media@vger.kernel.org
Set the new CAN_RESERVE flag when the initial reservation for an interrupt
happens. The flag is used in a subsequent patch to disable reservation mode
for a certain class of MSI devices.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Alexandru Chirvasitu <achirvasub@gmail.com>
Tested-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Cc: Dou Liyang <douly.fnst@cn.fujitsu.com>
Cc: Pavel Machek <pavel@ucw.cz>
Cc: Maciej W. Rozycki <macro@linux-mips.org>
Cc: Mikael Pettersson <mikpelinux@gmail.com>
Cc: Josh Poulson <jopoulso@microsoft.com>
Cc: Mihai Costache <v-micos@microsoft.com>
Cc: Stephen Hemminger <sthemmin@microsoft.com>
Cc: Marc Zyngier <marc.zyngier@arm.com>
Cc: linux-pci@vger.kernel.org
Cc: Haiyang Zhang <haiyangz@microsoft.com>
Cc: Dexuan Cui <decui@microsoft.com>
Cc: Simon Xiao <sixiao@microsoft.com>
Cc: Saeed Mahameed <saeedm@mellanox.com>
Cc: Jork Loeser <Jork.Loeser@microsoft.com>
Cc: Bjorn Helgaas <bhelgaas@google.com>
Cc: devel@linuxdriverproject.org
Cc: KY Srinivasan <kys@microsoft.com>
Cc: Alan Cox <alan@linux.intel.com>
Cc: Sakari Ailus <sakari.ailus@intel.com>,
Cc: linux-media@vger.kernel.org
Some of the APIC incarnations are operating in lowest priority delivery
mode. This worked as long as the vector management code allocated the same
vector on all possible CPUs for each interrupt.
Lowest priority delivery mode does not necessarily respect the affinity
setting and may redirect to some other online CPU. This was documented
somewhere in the old code and the conversion to single target delivery
missed to update the delivery mode of the affected APIC drivers which
results in spurious interrupts on some of the affected CPU/Chipset
combinations.
Switch the APIC drivers over to Fixed delivery mode and remove all
leftovers of lowest priority delivery mode.
Switching to Fixed delivery mode is not a problem on these CPUs because the
kernel already uses Fixed delivery mode for IPIs. The reason for this is
that th SDM explicitely forbids lowest prio mode for IPIs. The reason is
obvious: If the irq routing does not honor destination targets in lowest
prio mode then an IPI targeted at CPU1 might end up on CPU0, which would be
a fatal problem in many cases.
As a consequence of this change, the apic::irq_delivery_mode field is now
pointless, but this needs to be cleaned up in a separate patch.
Fixes: fdba46ffb4 ("x86/apic: Get rid of multi CPU affinity")
Reported-by: vcaputo@pengaru.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: vcaputo@pengaru.com
Cc: Pavel Machek <pavel@ucw.cz>
Link: https://lkml.kernel.org/r/alpine.DEB.2.20.1712281140440.1688@nanos
As per note in 5.2.9 Fixed ACPI Description Table (FADT) chapter of ACPI
specification, on HW-reduced platforma OSPM should ignore fields related
to the ACPI HW register interface, one of which is SCI_INT.
Follow the spec and ignore any configuration done for interrupt line
defined by SCI_INT if FADT specifies a HW-reduced platform.
HW-reduced platforms will still be able to use SCI in case it provides
an override record in MADT table.
Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
0 is valid hardware interrupt which might be in some cases overridden.
Due to this, switch to INVALID_ACPI_IRQ to mark SCI override not set.
While here, change the type of the variable from int to u32 to match
the GSI type used in the rest of the code.
Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Commit 49e4b84333 (ACPI: Use correct IRQ when uninstalling ACPI
interrupt handler) brings a new definition for invalid ACPI IRQ,
i.e. INVALID_ACPI_IRQ, which is defined to 0xffffffff (or -1 for
unsigned value).
Get rid of a former one, which was brought in by commit 2c0a6894df
(x86, ACPI, irq: Enhance error handling in function acpi_register_gsi()),
in favour of latter.
To clarify the rationale of changing from INT_MIN to ((unsigned)-1)
definition consider the following:
- IRQ 0 is valid one in hardware, so, better not to use it everywhere
(Linux uses 0 as NO IRQ, though it's another story)
- INT_MIN splits the range into two, while 0xffffffff reserves only the
last item
- when type casting is done in most cases 0xff, 0xffff is naturally used
as a marker of invalid HW IRQ: for example PCI INT line 0xff means
no IRQ assigned by BIOS
Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
For better readability compare input to something considered settled
down. Additionally move it to one line (while it's slightly longer
80 characters it makes readability better).
No functional change intended.
Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
There are two consumers of apic=:
apic_set_verbosity() for setting the APIC debug level;
parse_apic() for registering APIC driver by hand.
X86-32 supports both of them, but sometimes, kernel issues a weird warning.
eg: when kernel was booted up with 'apic=bigsmp' in command line,
early_param would warn like that:
...
[ 0.000000] APIC Verbosity level bigsmp not recognised use apic=verbose or apic=debug
[ 0.000000] Malformed early option 'apic'
...
Wrap the warning code in CONFIG_X86_64 case to avoid this.
Signed-off-by: Dou Liyang <douly.fnst@cn.fujitsu.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: peterz@infradead.org
Cc: rdunlap@infradead.org
Cc: corbet@lwn.net
Link: https://lkml.kernel.org/r/20171204040313.24824-1-douly.fnst@cn.fujitsu.com
Commit e802a51ede ("x86/idt: Consolidate IDT invalidation") cleaned up
and unified the IDT invalidation that existed in a couple of places. It
changed no actual real code.
Despite not changing any actual real code, it _did_ change code generation:
by implementing the common idt_invalidate() function in
archx86/kernel/idt.c, it made the use of the function in
arch/x86/kernel/machine_kexec_32.c be a real function call rather than an
(accidental) inlining of the function.
That, in turn, exposed two issues:
- in load_segments(), we had incorrectly reset all the segment
registers, which then made the stack canary load (which gcc does
using offset of %gs) cause a trap. Instead of %gs pointing to the
stack canary, it will be the normal zero-based kernel segment, and
the stack canary load will take a page fault at address 0x14.
- to make this even harder to debug, we had invalidated the GDT just
before calling idt_invalidate(), which meant that the fault happened
with an invalid GDT, which in turn causes a triple fault and
immediate reboot.
Fix this by
(a) not reloading the special segments in load_segments(). We currently
don't do any percpu accesses (which would require %fs on x86-32) in
this area, but there's no reason to think that we might not want to
do them, and like %gs, it's pointless to break it.
(b) doing idt_invalidate() before invalidating the GDT, to keep things
at least _slightly_ more debuggable for a bit longer. Without a
IDT, traps will not work. Without a GDT, traps also will not work,
but neither will any segment loads etc. So in a very real sense,
the GDT is even more core than the IDT.
Fixes: e802a51ede ("x86/idt: Consolidate IDT invalidation")
Reported-and-tested-by: Alexandru Chirvasitu <achirvasub@gmail.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: stable@vger.kernel.org
Link: https://lkml.kernel.org/r/alpine.LFD.2.21.1712271143180.8572@i7.lan
A recent commit introduced an extra merge_attr() call in the skylake
branch, which causes a memory leak.
Store the pointer to the extra allocated memory and free it at the end of
the function.
Fixes: a5df70c354 ("perf/x86: Only show format attributes when supported")
Reported-by: Tommi Rantala <tommi.t.rantala@nokia.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Andi Kleen <ak@linux.intel.com>
Now that the LDT mapping is in a known area when PAGE_TABLE_ISOLATION is
enabled its a primary target for attacks, if a user space interface fails
to validate a write address correctly. That can never happen, right?
The SDM states:
If the segment descriptors in the GDT or an LDT are placed in ROM, the
processor can enter an indefinite loop if software or the processor
attempts to update (write to) the ROM-based segment descriptors. To
prevent this problem, set the accessed bits for all segment descriptors
placed in a ROM. Also, remove operating-system or executive code that
attempts to modify segment descriptors located in ROM.
So its a valid approach to set the ACCESS bit when setting up the LDT entry
and to map the table RO. Fixup the selftest so it can handle that new mode.
Remove the manual ACCESS bit setter in set_tls_desc() as this is now
pointless. Folded the patch from Peter Ziljstra.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Juergen Gross <jgross@suse.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Add two debugfs files which allow to dump the pagetable of the current
task.
current_kernel dumps the regular page table. This is the page table which
is normally shared between kernel and user space. If kernel page table
isolation is enabled this is the kernel space mapping.
If kernel page table isolation is enabled the second file, current_user,
dumps the user space page table.
These files allow to verify the resulting page tables for page table
isolation, but even in the normal case its useful to be able to inspect
user space page tables of current for debugging purposes.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: David Laight <David.Laight@aculab.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: Eduardo Valentin <eduval@amazon.com>
Cc: Greg KH <gregkh@linuxfoundation.org>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Juergen Gross <jgross@suse.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Will Deacon <will.deacon@arm.com>
Cc: aliguori@amazon.com
Cc: daniel.gruss@iaik.tugraz.at
Cc: hughd@google.com
Cc: keescook@google.com
Cc: linux-mm@kvack.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
ptdump_walk_pgd_level_checkwx() checks the kernel page table for WX pages,
but does not check the PAGE_TABLE_ISOLATION user space page table.
Restructure the code so that dmesg output is selected by an explicit
argument and not implicit via checking the pgd argument for !NULL.
Add the check for the user space page table.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: David Laight <David.Laight@aculab.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: Eduardo Valentin <eduval@amazon.com>
Cc: Greg KH <gregkh@linuxfoundation.org>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Juergen Gross <jgross@suse.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Will Deacon <will.deacon@arm.com>
Cc: aliguori@amazon.com
Cc: daniel.gruss@iaik.tugraz.at
Cc: hughd@google.com
Cc: keescook@google.com
Cc: linux-mm@kvack.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
The upcoming support for dumping the kernel and the user space page tables
of the current process would create more random files in the top level
debugfs directory.
Add a page table directory and move the existing file to it.
Signed-off-by: Borislav Petkov <bp@suse.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: David Laight <David.Laight@aculab.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: Eduardo Valentin <eduval@amazon.com>
Cc: Greg KH <gregkh@linuxfoundation.org>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Juergen Gross <jgross@suse.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Will Deacon <will.deacon@arm.com>
Cc: aliguori@amazon.com
Cc: daniel.gruss@iaik.tugraz.at
Cc: hughd@google.com
Cc: keescook@google.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
CONFIG_PAGE_TABLE_ISOLATION is relatively new and intrusive feature that may
still have some corner cases which could take some time to manifest and be
fixed. It would be useful to have Oops messages indicate whether it was
enabled for building the kernel, and whether it was disabled during boot.
Example of fully enabled:
Oops: 0001 [#1] SMP PTI
Example of enabled during build, but disabled during boot:
Oops: 0001 [#1] SMP NOPTI
We can decide to remove this after the feature has been tested in the field
long enough.
[ tglx: Made it use boot_cpu_has() as requested by Borislav ]
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Eduardo Valentin <eduval@amazon.com>
Acked-by: Dave Hansen <dave.hansen@intel.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Andy Lutomirsky <luto@kernel.org>
Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: David Laight <David.Laight@aculab.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: Greg KH <gregkh@linuxfoundation.org>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Juergen Gross <jgross@suse.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Will Deacon <will.deacon@arm.com>
Cc: aliguori@amazon.com
Cc: bpetkov@suse.de
Cc: daniel.gruss@iaik.tugraz.at
Cc: hughd@google.com
Cc: jkosina@suse.cz
Cc: keescook@google.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
This uses INVPCID to shoot down individual lines of the user mapping
instead of marking the entire user map as invalid. This
could/might/possibly be faster.
This for sure needs tlb_single_page_flush_ceiling to be redetermined;
esp. since INVPCID is _slow_.
A detailed performance analysis is available here:
https://lkml.kernel.org/r/3062e486-3539-8a1f-5724-16199420be71@intel.com
[ Peterz: Split out from big combo patch ]
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: Eduardo Valentin <eduval@amazon.com>
Cc: Greg KH <gregkh@linuxfoundation.org>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Juergen Gross <jgross@suse.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Will Deacon <will.deacon@arm.com>
Cc: aliguori@amazon.com
Cc: daniel.gruss@iaik.tugraz.at
Cc: hughd@google.com
Cc: keescook@google.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Most NMI/paranoid exceptions will not in fact change pagetables and would
thus not require TLB flushing, however RESTORE_CR3 uses flushing CR3
writes.
Restores to kernel PCIDs can be NOFLUSH, because we explicitly flush the
kernel mappings and now that we track which user PCIDs need flushing we can
avoid those too when possible.
This does mean RESTORE_CR3 needs an additional scratch_reg, luckily both
sites have plenty available.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: David Laight <David.Laight@aculab.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: Eduardo Valentin <eduval@amazon.com>
Cc: Greg KH <gregkh@linuxfoundation.org>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Juergen Gross <jgross@suse.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Will Deacon <will.deacon@arm.com>
Cc: aliguori@amazon.com
Cc: daniel.gruss@iaik.tugraz.at
Cc: hughd@google.com
Cc: keescook@google.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
We can use PCID to retain the TLBs across CR3 switches; including those now
part of the user/kernel switch. This increases performance of kernel
entry/exit at the cost of more expensive/complicated TLB flushing.
Now that we have two address spaces, one for kernel and one for user space,
we need two PCIDs per mm. We use the top PCID bit to indicate a user PCID
(just like we use the PFN LSB for the PGD). Since we do TLB invalidation
from kernel space, the existing code will only invalidate the kernel PCID,
we augment that by marking the corresponding user PCID invalid, and upon
switching back to userspace, use a flushing CR3 write for the switch.
In order to access the user_pcid_flush_mask we use PER_CPU storage, which
means the previously established SWAPGS vs CR3 ordering is now mandatory
and required.
Having to do this memory access does require additional registers, most
sites have a functioning stack and we can spill one (RAX), sites without
functional stack need to otherwise provide the second scratch register.
Note: PCID is generally available on Intel Sandybridge and later CPUs.
Note: Up until this point TLB flushing was broken in this series.
Based-on-code-from: Dave Hansen <dave.hansen@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: David Laight <David.Laight@aculab.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: Eduardo Valentin <eduval@amazon.com>
Cc: Greg KH <gregkh@linuxfoundation.org>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Juergen Gross <jgross@suse.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Will Deacon <will.deacon@arm.com>
Cc: aliguori@amazon.com
Cc: daniel.gruss@iaik.tugraz.at
Cc: hughd@google.com
Cc: keescook@google.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
If changing the page tables in such a way that an invalidation of all
contexts (aka. PCIDs / ASIDs) is required, they can be actively invalidated
by:
1. INVPCID for each PCID (works for single pages too).
2. Load CR3 with each PCID without the NOFLUSH bit set
3. Load CR3 with the NOFLUSH bit set for each and do INVLPG for each address.
But, none of these are really feasible since there are ~6 ASIDs (12 with
PAGE_TABLE_ISOLATION) at the time that invalidation is required.
Instead of actively invalidating them, invalidate the *current* context and
also mark the cpu_tlbstate _quickly_ to indicate future invalidation to be
required.
At the next context-switch, look for this indicator
('invalidate_other' being set) invalidate all of the
cpu_tlbstate.ctxs[] entries.
This ensures that any future context switches will do a full flush
of the TLB, picking up the previous changes.
[ tglx: Folded more fixups from Peter ]
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: David Laight <David.Laight@aculab.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: Eduardo Valentin <eduval@amazon.com>
Cc: Greg KH <gregkh@linuxfoundation.org>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Juergen Gross <jgross@suse.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Will Deacon <will.deacon@arm.com>
Cc: aliguori@amazon.com
Cc: daniel.gruss@iaik.tugraz.at
Cc: hughd@google.com
Cc: keescook@google.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Make VSYSCALLs work fully in PTI mode by mapping them properly to the user
space visible page tables.
[ tglx: Hide unused functions (Patch by Arnd Bergmann) ]
Signed-off-by: Andy Lutomirski <luto@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: David Laight <David.Laight@aculab.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Juergen Gross <jgross@suse.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
With PTI enabled, the LDT must be mapped in the usermode tables somewhere.
The LDT is per process, i.e. per mm.
An earlier approach mapped the LDT on context switch into a fixmap area,
but that's a big overhead and exhausted the fixmap space when NR_CPUS got
big.
Take advantage of the fact that there is an address space hole which
provides a completely unused pgd. Use this pgd to manage per-mm LDT
mappings.
This has a down side: the LDT isn't (currently) randomized, and an attack
that can write the LDT is instant root due to call gates (thanks, AMD, for
leaving call gates in AMD64 but designing them wrong so they're only useful
for exploits). This can be mitigated by making the LDT read-only or
randomizing the mapping, either of which is strightforward on top of this
patch.
This will significantly slow down LDT users, but that shouldn't matter for
important workloads -- the LDT is only used by DOSEMU(2), Wine, and very
old libc implementations.
[ tglx: Cleaned it up. ]
Signed-off-by: Andy Lutomirski <luto@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: David Laight <David.Laight@aculab.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Juergen Gross <jgross@suse.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Kirill A. Shutemov <kirill@shutemov.name>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Shrink vmalloc space from 16384TiB to 12800TiB to enlarge the hole starting
at 0xff90000000000000 to be a full PGD entry.
A subsequent patch will use this hole for the pagetable isolation LDT
alias.
Signed-off-by: Andy Lutomirski <luto@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: David Laight <David.Laight@aculab.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Juergen Gross <jgross@suse.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Kirill A. Shutemov <kirill@shutemov.name>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
The BTS and PEBS buffers both have their virtual addresses programmed into
the hardware. This means that any access to them is performed via the page
tables. The times that the hardware accesses these are entirely dependent
on how the performance monitoring hardware events are set up. In other
words, there is no way for the kernel to tell when the hardware might
access these buffers.
To avoid perf crashes, place 'debug_store' allocate pages and map them into
the cpu_entry_area.
The PEBS fixup buffer does not need this treatment.
[ tglx: Got rid of the kaiser_add_mapping() complication ]
Signed-off-by: Hugh Dickins <hughd@google.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: David Laight <David.Laight@aculab.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: Eduardo Valentin <eduval@amazon.com>
Cc: Greg KH <gregkh@linuxfoundation.org>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Juergen Gross <jgross@suse.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Will Deacon <will.deacon@arm.com>
Cc: aliguori@amazon.com
Cc: daniel.gruss@iaik.tugraz.at
Cc: keescook@google.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
The Intel PEBS/BTS debug store is a design trainwreck as it expects virtual
addresses which must be visible in any execution context.
So it is required to make these mappings visible to user space when kernel
page table isolation is active.
Provide enough room for the buffer mappings in the cpu_entry_area so the
buffers are available in the user space visible page tables.
At the point where the kernel side entry area is populated there is no
buffer available yet, but the kernel PMD must be populated. To achieve this
set the entries for these buffers to non present.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: David Laight <David.Laight@aculab.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: Eduardo Valentin <eduval@amazon.com>
Cc: Greg KH <gregkh@linuxfoundation.org>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Juergen Gross <jgross@suse.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Will Deacon <will.deacon@arm.com>
Cc: aliguori@amazon.com
Cc: daniel.gruss@iaik.tugraz.at
Cc: hughd@google.com
Cc: keescook@google.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Map the ESPFIX pages into user space when PTI is enabled.
Signed-off-by: Andy Lutomirski <luto@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: David Laight <David.Laight@aculab.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Juergen Gross <jgross@suse.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Share the entry text PMD of the kernel mapping with the user space
mapping. If large pages are enabled this is a single PMD entry and at the
point where it is copied into the user page table the RW bit has not been
cleared yet. Clear it right away so the user space visible map becomes RX.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: David Laight <David.Laight@aculab.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: Eduardo Valentin <eduval@amazon.com>
Cc: Greg KH <gregkh@linuxfoundation.org>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Juergen Gross <jgross@suse.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Will Deacon <will.deacon@arm.com>
Cc: aliguori@amazon.com
Cc: daniel.gruss@iaik.tugraz.at
Cc: hughd@google.com
Cc: keescook@google.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
The (irq)entry text must be visible in the user space page tables. To allow
simple PMD based sharing, make the entry text PMD aligned.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: David Laight <David.Laight@aculab.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: Eduardo Valentin <eduval@amazon.com>
Cc: Greg KH <gregkh@linuxfoundation.org>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Juergen Gross <jgross@suse.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Will Deacon <will.deacon@arm.com>
Cc: aliguori@amazon.com
Cc: daniel.gruss@iaik.tugraz.at
Cc: hughd@google.com
Cc: keescook@google.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Share the cpu entry area so the user space and kernel space page tables
have the same P4D page.
Signed-off-by: Andy Lutomirski <luto@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: David Laight <David.Laight@aculab.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: Eduardo Valentin <eduval@amazon.com>
Cc: Greg KH <gregkh@linuxfoundation.org>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Juergen Gross <jgross@suse.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Will Deacon <will.deacon@arm.com>
Cc: aliguori@amazon.com
Cc: daniel.gruss@iaik.tugraz.at
Cc: hughd@google.com
Cc: keescook@google.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Force the entry through the trampoline only when PTI is active. Otherwise
go through the normal entry code.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Borislav Petkov <bp@suse.de>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: David Laight <David.Laight@aculab.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: Eduardo Valentin <eduval@amazon.com>
Cc: Greg KH <gregkh@linuxfoundation.org>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Juergen Gross <jgross@suse.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Will Deacon <will.deacon@arm.com>
Cc: aliguori@amazon.com
Cc: daniel.gruss@iaik.tugraz.at
Cc: hughd@google.com
Cc: keescook@google.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Provide infrastructure to:
- find a kernel PMD for a mapping which must be visible to user space for
the entry/exit code to work.
- walk an address range and share the kernel PMD with it.
This reuses a small part of the original KAISER patches to populate the
user space page table.
[ tglx: Made it universally usable so it can be used for any kind of shared
mapping. Add a mechanism to clear specific bits in the user space
visible PMD entry. Folded Andys simplifactions ]
Originally-by: Dave Hansen <dave.hansen@linux.intel.com>
Signed-off-by: Andy Lutomirski <luto@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Borislav Petkov <bpetkov@suse.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: David Laight <David.Laight@aculab.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: Eduardo Valentin <eduval@amazon.com>
Cc: Greg KH <gregkh@linuxfoundation.org>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Juergen Gross <jgross@suse.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Will Deacon <will.deacon@arm.com>
Cc: aliguori@amazon.com
Cc: daniel.gruss@iaik.tugraz.at
Cc: hughd@google.com
Cc: keescook@google.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
In clone_pgd_range() copy the init user PGDs which cover the kernel half of
the address space, so a process has all the required kernel mappings
visible.
[ tglx: Split out from the big kaiser dump and folded Andys simplification ]
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Borislav Petkov <bp@suse.de>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: David Laight <David.Laight@aculab.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: Eduardo Valentin <eduval@amazon.com>
Cc: Greg KH <gregkh@linuxfoundation.org>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Juergen Gross <jgross@suse.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Will Deacon <will.deacon@arm.com>
Cc: aliguori@amazon.com
Cc: daniel.gruss@iaik.tugraz.at
Cc: hughd@google.com
Cc: keescook@google.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Kernel page table isolation requires to have two PGDs. One for the kernel,
which contains the full kernel mapping plus the user space mapping and one
for user space which contains the user space mappings and the minimal set
of kernel mappings which are required by the architecture to be able to
transition from and to user space.
Add the necessary preliminaries.
[ tglx: Split out from the big kaiser dump. EFI fixup from Kirill ]
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Borislav Petkov <bp@suse.de>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: David Laight <David.Laight@aculab.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: Eduardo Valentin <eduval@amazon.com>
Cc: Greg KH <gregkh@linuxfoundation.org>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Juergen Gross <jgross@suse.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Will Deacon <will.deacon@arm.com>
Cc: aliguori@amazon.com
Cc: daniel.gruss@iaik.tugraz.at
Cc: hughd@google.com
Cc: keescook@google.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
With PAGE_TABLE_ISOLATION the user portion of the kernel page tables is
poisoned with the NX bit so if the entry code exits with the kernel page
tables selected in CR3, userspace crashes.
But doing so trips the p4d/pgd_bad() checks. Make sure it does not do
that.
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Borislav Petkov <bp@suse.de>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: David Laight <David.Laight@aculab.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: Eduardo Valentin <eduval@amazon.com>
Cc: Greg KH <gregkh@linuxfoundation.org>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Juergen Gross <jgross@suse.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Will Deacon <will.deacon@arm.com>
Cc: aliguori@amazon.com
Cc: daniel.gruss@iaik.tugraz.at
Cc: hughd@google.com
Cc: keescook@google.com
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Add the pagetable helper functions do manage the separate user space page
tables.
[ tglx: Split out from the big combo kaiser patch. Folded Andys
simplification and made it out of line as Boris suggested ]
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: David Laight <David.Laight@aculab.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: Eduardo Valentin <eduval@amazon.com>
Cc: Greg KH <gregkh@linuxfoundation.org>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Juergen Gross <jgross@suse.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Will Deacon <will.deacon@arm.com>
Cc: aliguori@amazon.com
Cc: daniel.gruss@iaik.tugraz.at
Cc: hughd@google.com
Cc: keescook@google.com
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
PAGE_TABLE_ISOLATION needs to switch to a different CR3 value when it
enters the kernel and switch back when it exits. This essentially needs to
be done before leaving assembly code.
This is extra challenging because the switching context is tricky: the
registers that can be clobbered can vary. It is also hard to store things
on the stack because there is an established ABI (ptregs) or the stack is
entirely unsafe to use.
Establish a set of macros that allow changing to the user and kernel CR3
values.
Interactions with SWAPGS:
Previous versions of the PAGE_TABLE_ISOLATION code relied on having
per-CPU scratch space to save/restore a register that can be used for the
CR3 MOV. The %GS register is used to index into our per-CPU space, so
SWAPGS *had* to be done before the CR3 switch. That scratch space is gone
now, but the semantic that SWAPGS must be done before the CR3 MOV is
retained. This is good to keep because it is not that hard to do and it
allows to do things like add per-CPU debugging information.
What this does in the NMI code is worth pointing out. NMIs can interrupt
*any* context and they can also be nested with NMIs interrupting other
NMIs. The comments below ".Lnmi_from_kernel" explain the format of the
stack during this situation. Changing the format of this stack is hard.
Instead of storing the old CR3 value on the stack, this depends on the
*regular* register save/restore mechanism and then uses %r14 to keep CR3
during the NMI. It is callee-saved and will not be clobbered by the C NMI
handlers that get called.
[ PeterZ: ESPFIX optimization ]
Based-on-code-from: Andy Lutomirski <luto@kernel.org>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: David Laight <David.Laight@aculab.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: Eduardo Valentin <eduval@amazon.com>
Cc: Greg KH <gregkh@linuxfoundation.org>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Juergen Gross <jgross@suse.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Will Deacon <will.deacon@arm.com>
Cc: aliguori@amazon.com
Cc: daniel.gruss@iaik.tugraz.at
Cc: hughd@google.com
Cc: keescook@google.com
Cc: linux-mm@kvack.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Global pages stay in the TLB across context switches. Since all contexts
share the same kernel mapping, these mappings are marked as global pages
so kernel entries in the TLB are not flushed out on a context switch.
But, even having these entries in the TLB opens up something that an
attacker can use, such as the double-page-fault attack:
http://www.ieee-security.org/TC/SP2013/papers/4977a191.pdf
That means that even when PAGE_TABLE_ISOLATION switches page tables
on return to user space the global pages would stay in the TLB cache.
Disable global pages so that kernel TLB entries can be flushed before
returning to user space. This way, all accesses to kernel addresses from
userspace result in a TLB miss independent of the existence of a kernel
mapping.
Suppress global pages via the __supported_pte_mask. The user space
mappings set PAGE_GLOBAL for the minimal kernel mappings which are
required for entry/exit. These mappings are set up manually so the
filtering does not take place.
[ The __supported_pte_mask simplification was written by Thomas Gleixner. ]
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Borislav Petkov <bp@suse.de>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: David Laight <David.Laight@aculab.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: Eduardo Valentin <eduval@amazon.com>
Cc: Greg KH <gregkh@linuxfoundation.org>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Juergen Gross <jgross@suse.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Will Deacon <will.deacon@arm.com>
Cc: aliguori@amazon.com
Cc: daniel.gruss@iaik.tugraz.at
Cc: hughd@google.com
Cc: keescook@google.com
Cc: linux-mm@kvack.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Many x86 CPUs leak information to user space due to missing isolation of
user space and kernel space page tables. There are many well documented
ways to exploit that.
The upcoming software migitation of isolating the user and kernel space
page tables needs a misfeature flag so code can be made runtime
conditional.
Add the BUG bits which indicates that the CPU is affected and add a feature
bit which indicates that the software migitation is enabled.
Assume for now that _ALL_ x86 CPUs are affected by this. Exceptions can be
made later.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: David Laight <David.Laight@aculab.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: Eduardo Valentin <eduval@amazon.com>
Cc: Greg KH <gregkh@linuxfoundation.org>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Juergen Gross <jgross@suse.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Will Deacon <will.deacon@arm.com>
Cc: aliguori@amazon.com
Cc: daniel.gruss@iaik.tugraz.at
Cc: hughd@google.com
Cc: keescook@google.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Pull x86 PTI preparatory patches from Thomas Gleixner:
"Todays Advent calendar window contains twentyfour easy to digest
patches. The original plan was to have twenty three matching the date,
but a late fixup made that moot.
- Move the cpu_entry_area mapping out of the fixmap into a separate
address space. That's necessary because the fixmap becomes too big
with NRCPUS=8192 and this caused already subtle and hard to
diagnose failures.
The top most patch is fresh from today and cures a brain slip of
that tall grumpy german greybeard, who ignored the intricacies of
32bit wraparounds.
- Limit the number of CPUs on 32bit to 64. That's insane big already,
but at least it's small enough to prevent address space issues with
the cpu_entry_area map, which have been observed and debugged with
the fixmap code
- A few TLB flush fixes in various places plus documentation which of
the TLB functions should be used for what.
- Rename the SYSENTER stack to CPU_ENTRY_AREA stack as it is used for
more than sysenter now and keeping the name makes backtraces
confusing.
- Prevent LDT inheritance on exec() by moving it to arch_dup_mmap(),
which is only invoked on fork().
- Make vysycall more robust.
- A few fixes and cleanups of the debug_pagetables code. Check
PAGE_PRESENT instead of checking the PTE for 0 and a cleanup of the
C89 initialization of the address hint array which already was out
of sync with the index enums.
- Move the ESPFIX init to a different place to prepare for PTI.
- Several code moves with no functional change to make PTI
integration simpler and header files less convoluted.
- Documentation fixes and clarifications"
* 'x86-pti-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (24 commits)
x86/cpu_entry_area: Prevent wraparound in setup_cpu_entry_area_ptes() on 32bit
init: Invoke init_espfix_bsp() from mm_init()
x86/cpu_entry_area: Move it out of the fixmap
x86/cpu_entry_area: Move it to a separate unit
x86/mm: Create asm/invpcid.h
x86/mm: Put MMU to hardware ASID translation in one place
x86/mm: Remove hard-coded ASID limit checks
x86/mm: Move the CR3 construction functions to tlbflush.h
x86/mm: Add comments to clarify which TLB-flush functions are supposed to flush what
x86/mm: Remove superfluous barriers
x86/mm: Use __flush_tlb_one() for kernel memory
x86/microcode: Dont abuse the TLB-flush interface
x86/uv: Use the right TLB-flush API
x86/entry: Rename SYSENTER_stack to CPU_ENTRY_AREA_entry_stack
x86/doc: Remove obvious weirdnesses from the x86 MM layout documentation
x86/mm/64: Improve the memory map documentation
x86/ldt: Prevent LDT inheritance on exec
x86/ldt: Rework locking
arch, mm: Allow arch_dup_mmap() to fail
x86/vsyscall/64: Warn and fail vsyscall emulation in NATIVE mode
...
The loop which populates the CPU entry area PMDs can wrap around on 32bit
machines when the number of CPUs is small.
It worked wonderful for NR_CPUS=64 for whatever reason and the moron who
wrote that code did not bother to test it with !SMP.
Check for the wraparound to fix it.
Fixes: 92a0f81d89 ("x86/cpu_entry_area: Move it out of the fixmap")
Reported-by: kernel test robot <fengguang.wu@intel.com>
Signed-off-by: Thomas "Feels stupid" Gleixner <tglx@linutronix.de>
Tested-by: Borislav Petkov <bp@alien8.de>
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2
iQEcBAABAgAGBQJaPKq7AAoJELDendYovxMvQOEH/2iuLSDI7b5vjPuBCvFjituP
floACKQl3Zp1Xk//DQLwTis02/9cIAOUGM11PmrkEq1lehpXPxIPzyfpx3wbEezd
A9hP71AMojdOIUCxucAGg94kxryv9OgXT6/qggzLlpmEpo7x12dVSPV+LxfcbkqL
zeTi1WEzz9jacfFI5CRvJx68tacIxvxCdKfauq2Yz2AB3BKd2xtMR7j77lycAeSw
KTFaIikKnZ3Aonn/yRUhD89oOp/Kt7XJib3glsAAKgA1GMuqmJsk1yB4Wm3qkpGD
bFSzf51HLl2PRyV5PxlJOfHtyTUKRj1Jf80YQgI2x9jR2LT3pBSI+NZt7Paw4Wc=
=QB74
-----END PGP SIGNATURE-----
Merge tag 'for-linus-4.15-rc5-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip
Pull xen fixes from Juergen Gross:
"This contains two fixes for running under Xen:
- a fix avoiding resource conflicts between adding mmio areas and
memory hotplug
- a fix setting NX bits in page table entries copied from Xen when
running a PV guest"
* tag 'for-linus-4.15-rc5-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip:
xen/balloon: Mark unallocated host memory as UNUSABLE
x86-64/Xen: eliminate W+X mappings
init_espfix_bsp() needs to be invoked before the page table isolation
initialization. Move it into mm_init() which is the place where pti_init()
will be added.
While at it get rid of the #ifdeffery and provide proper stub functions.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Juergen Gross <jgross@suse.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Put the cpu_entry_area into a separate P4D entry. The fixmap gets too big
and 0-day already hit a case where the fixmap PTEs were cleared by
cleanup_highmap().
Aside of that the fixmap API is a pain as it's all backwards.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Juergen Gross <jgross@suse.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Separate the cpu_entry_area code out of cpu/common.c and the fixmap.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Juergen Gross <jgross@suse.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
There are effectively two ASID types:
1. The one stored in the mmu_context that goes from 0..5
2. The one programmed into the hardware that goes from 1..6
This consolidates the locations where converting between the two (by doing
a +1) to a single place which gives us a nice place to comment.
PAGE_TABLE_ISOLATION will also need to, given an ASID, know which hardware
ASID to flush for the userspace mapping.
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: David Laight <David.Laight@aculab.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: Eduardo Valentin <eduval@amazon.com>
Cc: Greg KH <gregkh@linuxfoundation.org>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Juergen Gross <jgross@suse.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Will Deacon <will.deacon@arm.com>
Cc: aliguori@amazon.com
Cc: daniel.gruss@iaik.tugraz.at
Cc: hughd@google.com
Cc: keescook@google.com
Cc: linux-mm@kvack.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
First, it's nice to remove the magic numbers.
Second, PAGE_TABLE_ISOLATION is going to consume half of the available ASID
space. The space is currently unused, but add a comment to spell out this
new restriction.
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: David Laight <David.Laight@aculab.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: Eduardo Valentin <eduval@amazon.com>
Cc: Greg KH <gregkh@linuxfoundation.org>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Juergen Gross <jgross@suse.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Will Deacon <will.deacon@arm.com>
Cc: aliguori@amazon.com
Cc: daniel.gruss@iaik.tugraz.at
Cc: hughd@google.com
Cc: keescook@google.com
Cc: linux-mm@kvack.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
For flushing the TLB, the ASID which has been programmed into the hardware
must be known. That differs from what is in 'cpu_tlbstate'.
Add functions to transform the 'cpu_tlbstate' values into to the one
programmed into the hardware (CR3).
It's not easy to include mmu_context.h into tlbflush.h, so just move the
CR3 building over to tlbflush.h.
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: David Laight <David.Laight@aculab.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: Eduardo Valentin <eduval@amazon.com>
Cc: Greg KH <gregkh@linuxfoundation.org>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Juergen Gross <jgross@suse.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Will Deacon <will.deacon@arm.com>
Cc: aliguori@amazon.com
Cc: daniel.gruss@iaik.tugraz.at
Cc: hughd@google.com
Cc: keescook@google.com
Cc: linux-mm@kvack.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Commit:
ec400ddeff ("x86/microcode_intel_early.c: Early update ucode on Intel's CPU")
... grubbed into tlbflush internals without coherent explanation.
Since it says its a precaution and the SDM doesn't mention anything like
this, take it out back.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: David Laight <David.Laight@aculab.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: Eduardo Valentin <eduval@amazon.com>
Cc: Greg KH <gregkh@linuxfoundation.org>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Juergen Gross <jgross@suse.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Will Deacon <will.deacon@arm.com>
Cc: aliguori@amazon.com
Cc: daniel.gruss@iaik.tugraz.at
Cc: fenghua.yu@intel.com
Cc: hughd@google.com
Cc: keescook@google.com
Cc: linux-mm@kvack.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Since uv_flush_tlb_others() implements flush_tlb_others() which is
about flushing user mappings, we should use __flush_tlb_single(),
which too is about flushing user mappings.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Andrew Banman <abanman@hpe.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: David Laight <David.Laight@aculab.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: Eduardo Valentin <eduval@amazon.com>
Cc: Greg KH <gregkh@linuxfoundation.org>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Juergen Gross <jgross@suse.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Travis <mike.travis@hpe.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Will Deacon <will.deacon@arm.com>
Cc: aliguori@amazon.com
Cc: daniel.gruss@iaik.tugraz.at
Cc: hughd@google.com
Cc: keescook@google.com
Cc: linux-mm@kvack.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
If the kernel oopses while on the trampoline stack, it will print
"<SYSENTER>" even if SYSENTER is not involved. That is rather confusing.
The "SYSENTER" stack is used for a lot more than SYSENTER now. Give it a
better string to display in stack dumps, and rename the kernel code to
match.
Also move the 32-bit code over to the new naming even though it still uses
the entry stack only for SYSENTER.
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Borislav Petkov <bp@suse.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Juergen Gross <jgross@suse.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
The LDT is inherited across fork() or exec(), but that makes no sense
at all because exec() is supposed to start the process clean.
The reason why this happens is that init_new_context_ldt() is called from
init_new_context() which obviously needs to be called for both fork() and
exec().
It would be surprising if anything relies on that behaviour, so it seems to
be safe to remove that misfeature.
Split the context initialization into two parts. Clear the LDT pointer and
initialize the mutex from the general context init and move the LDT
duplication to arch_dup_mmap() which is only called on fork().
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Andy Lutomirsky <luto@kernel.org>
Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Borislav Petkov <bpetkov@suse.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: David Laight <David.Laight@aculab.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: Eduardo Valentin <eduval@amazon.com>
Cc: Greg KH <gregkh@linuxfoundation.org>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Juergen Gross <jgross@suse.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Will Deacon <will.deacon@arm.com>
Cc: aliguori@amazon.com
Cc: dan.j.williams@intel.com
Cc: hughd@google.com
Cc: keescook@google.com
Cc: kirill.shutemov@linux.intel.com
Cc: linux-mm@kvack.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
The LDT is duplicated on fork() and on exec(), which is wrong as exec()
should start from a clean state, i.e. without LDT. To fix this the LDT
duplication code will be moved into arch_dup_mmap() which is only called
for fork().
This introduces a locking problem. arch_dup_mmap() holds mmap_sem of the
parent process, but the LDT duplication code needs to acquire
mm->context.lock to access the LDT data safely, which is the reverse lock
order of write_ldt() where mmap_sem nests into context.lock.
Solve this by introducing a new rw semaphore which serializes the
read/write_ldt() syscall operations and use context.lock to protect the
actual installment of the LDT descriptor.
So context.lock stabilizes mm->context.ldt and can nest inside of the new
semaphore or mmap_sem.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Andy Lutomirsky <luto@kernel.org>
Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Borislav Petkov <bpetkov@suse.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: David Laight <David.Laight@aculab.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: Eduardo Valentin <eduval@amazon.com>
Cc: Greg KH <gregkh@linuxfoundation.org>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Juergen Gross <jgross@suse.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Will Deacon <will.deacon@arm.com>
Cc: aliguori@amazon.com
Cc: dan.j.williams@intel.com
Cc: hughd@google.com
Cc: keescook@google.com
Cc: kirill.shutemov@linux.intel.com
Cc: linux-mm@kvack.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
In order to sanitize the LDT initialization on x86 arch_dup_mmap() must be
allowed to fail. Fix up all instances.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Andy Lutomirsky <luto@kernel.org>
Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Borislav Petkov <bpetkov@suse.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: David Laight <David.Laight@aculab.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: Eduardo Valentin <eduval@amazon.com>
Cc: Greg KH <gregkh@linuxfoundation.org>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Juergen Gross <jgross@suse.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Will Deacon <will.deacon@arm.com>
Cc: aliguori@amazon.com
Cc: dan.j.williams@intel.com
Cc: hughd@google.com
Cc: keescook@google.com
Cc: kirill.shutemov@linux.intel.com
Cc: linux-mm@kvack.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
If something goes wrong with pagetable setup, vsyscall=native will
accidentally fall back to emulation. Make it warn and fail so that we
notice.
Signed-off-by: Andy Lutomirski <luto@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: David Laight <David.Laight@aculab.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Juergen Gross <jgross@suse.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
The kernel is very erratic as to which pagetables have _PAGE_USER set. The
vsyscall page gets lucky: it seems that all of the relevant pagetables are
among the apparently arbitrary ones that set _PAGE_USER. Rather than
relying on chance, just explicitly set _PAGE_USER.
This will let us clean up pagetable setup to stop setting _PAGE_USER. The
added code can also be reused by pagetable isolation to manage the
_PAGE_USER bit in the usermode tables.
[ tglx: Folded paravirt fix from Juergen Gross ]
Signed-off-by: Andy Lutomirski <luto@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: David Laight <David.Laight@aculab.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Juergen Gross <jgross@suse.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
The address hints are a trainwreck. The array entry numbers have to kept
magically in sync with the actual hints, which is doomed as some of the
array members are initialized at runtime via the entry numbers.
Designated initializers have been around before this code was
implemented....
Use the entry numbers to populate the address hints array and add the
missing bits and pieces. Split 32 and 64 bit for readability sake.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Juergen Gross <jgross@suse.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
The check for a present page in printk_prot():
if (!pgprot_val(prot)) {
/* Not present */
is bogus. If a PTE is set to PAGE_NONE then the pgprot_val is not zero and
the entry is decoded in bogus ways, e.g. as RX GLB. That is confusing when
analyzing mapping correctness. Check for the present bit to make an
informed decision.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Juergen Gross <jgross@suse.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
The recent cpu_entry_area changes fail to compile on 32-bit when BIGSMP=y
and NR_CPUS=512, because the fixmap area becomes too big.
Limit the number of CPUs with BIGSMP to 64, which is already way to big for
32-bit, but it's at least a working limitation.
We performed a quick survey of 32-bit-only machines that might be affected
by this change negatively, but found none.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Juergen Gross <jgross@suse.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
- A bug in handling of SPE state for non-vhe systems
- A fix for a crash on system shutdown
- Three timer fixes, introduced by the timer optimizations for v4.15
x86 fixes:
- fix for a WARN that was introduced in 4.15
- fix for SMM when guest uses PCID
- fixes for several bugs found by syzkaller
... and a dozen papercut fixes for the kvm_stat tool.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.22 (GNU/Linux)
iQEcBAABAgAGBQJaO6N9AAoJEL/70l94x66DC1wH/Rf+u0Cj6ZQil6LK6Nf8bfPd
3TqrwrxUDeXwi8GzsvK14izBr1mDzidSHIO0Q4XINFRSRdaf43h3R2im/SJqvNhP
xktCmJI2CxN96oaC7kIExgwf3YKhFdLIADfbT8oR9p3xZG/+c97dkr3b4XtmVCDb
ZXdUEOcKnoW4zwpfJN30FLlq4OwYvuYVz02AEfPivZRDfhhus/TYSnuSdxH8CLNf
75ymuKyXoo/RELbimwbMk8Cm9+ey7PjlUGOgbnbXIFtmgznXhLzAOeES2B+46J5b
sMBPlmiJrn6N//lM18CC5yOBzBLGsYOoXggtw4aU/5nM4GVcFebWedpcoD4D8Jw=
=Bt8w
-----END PGP SIGNATURE-----
Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm
Pull KVM fixes from Paolo Bonzini:
"ARM fixes:
- A bug in handling of SPE state for non-vhe systems
- A fix for a crash on system shutdown
- Three timer fixes, introduced by the timer optimizations for v4.15
x86 fixes:
- fix for a WARN that was introduced in 4.15
- fix for SMM when guest uses PCID
- fixes for several bugs found by syzkaller
... and a dozen papercut fixes for the kvm_stat tool"
* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (22 commits)
tools/kvm_stat: sort '-f help' output
kvm: x86: fix RSM when PCID is non-zero
KVM: Fix stack-out-of-bounds read in write_mmio
KVM: arm/arm64: Fix timer enable flow
KVM: arm/arm64: Properly handle arch-timer IRQs after vtimer_save_state
KVM: arm/arm64: timer: Don't set irq as forwarded if no usable GIC
KVM: arm/arm64: Fix HYP unmapping going off limits
arm64: kvm: Prevent restoring stale PMSCR_EL1 for vcpu
KVM/x86: Check input paging mode when cs.l is set
tools/kvm_stat: add line for totals
tools/kvm_stat: stop ignoring unhandled arguments
tools/kvm_stat: suppress usage information on command line errors
tools/kvm_stat: handle invalid regular expressions
tools/kvm_stat: add hint on '-f help' to man page
tools/kvm_stat: fix child trace events accounting
tools/kvm_stat: fix extra handling of 'help' with fields filter
tools/kvm_stat: fix missing field update after filter change
tools/kvm_stat: fix drilldown in events-by-guests mode
tools/kvm_stat: fix command line option '-g'
kvm: x86: fix WARN due to uninitialized guest FPU state
...
rsm_load_state_64() and rsm_enter_protected_mode() load CR3, then
CR4 & ~PCIDE, then CR0, then CR4.
However, setting CR4.PCIDE fails if CR3[11:0] != 0. It's probably easier
in the long run to replace rsm_enter_protected_mode() with an emulator
callback that sets all the special registers (like KVM_SET_SREGS would
do). For now, set the PCID field of CR3 only after CR4.PCIDE is 1.
Reported-by: Laszlo Ersek <lersek@redhat.com>
Tested-by: Laszlo Ersek <lersek@redhat.com>
Fixes: 660a5d517a
Cc: stable@vger.kernel.org
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Commit f5775e0b61 ("x86/xen: discard RAM regions above the maximum
reservation") left host memory not assigned to dom0 as available for
memory hotplug.
Unfortunately this also meant that those regions could be used by
others. Specifically, commit fa564ad963 ("x86/PCI: Enable a 64bit BAR
on AMD Family 15h (Models 00-1f, 30-3f, 60-7f)") may try to map those
addresses as MMIO.
To prevent this mark unallocated host memory as E820_TYPE_UNUSABLE (thus
effectively reverting f5775e0b61) and keep track of that region as
a hostmem resource that can be used for the hotplug.
Signed-off-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Reviewed-by: Juergen Gross <jgross@suse.com>
The hashing of %p was designed to restrict kernel addresses. There is
no reason to hash the userspace values seen during a segfault report,
so switch these to %px. (Some architectures already use %lx.)
Fixes: ad67b74d24 ("printk: hash addresses printed with %p")
Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
A few thousand such pages are usually left around due to the re-use of
L1 tables having been provided by the hypervisor (Dom0) or tool stack
(DomU). Set NX in the direct map variant, which needs to be done in L2
due to the dual use of the re-used L1s.
For x86_configure_nx() to actually do what it is supposed to do, call
get_cpu_cap() first. This was broken by commit 4763ed4d45 ("x86, mm:
Clean up and simplify NX enablement") when switching away from the
direct EFER read.
Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Juergen Gross <jgross@suse.com>
Signed-off-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Commit:
1959a60182 ("x86/dumpstack: Pin the target stack when dumping it")
changed the behavior of stack traces for zombies. Before that commit,
/proc/<pid>/stack reported the last execution path of the zombie before
it died:
[<ffffffff8105b877>] do_exit+0x6f7/0xa80
[<ffffffff8105bc79>] do_group_exit+0x39/0xa0
[<ffffffff8105bcf0>] __wake_up_parent+0x0/0x30
[<ffffffff8152dd09>] system_call_fastpath+0x16/0x1b
[<00007fd128f9c4f9>] 0x7fd128f9c4f9
[<ffffffffffffffff>] 0xffffffffffffffff
After the commit, it just reports an empty stack trace.
The new behavior is actually probably more correct. If the stack
refcount has gone down to zero, then the task has already gone through
do_exit() and isn't going to run anymore. The stack could be freed at
any time and is basically gone, so reporting an empty stack makes sense.
However, save_stack_trace_tsk_reliable() treats such a missing stack
condition as an error. That can cause livepatch transition stalls if
there are any unreaped zombies. Instead, just treat it as a reliable,
empty stack.
Reported-and-tested-by: Miroslav Benes <mbenes@suse.cz>
Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: live-patching@vger.kernel.org
Fixes: af085d9084 ("stacktrace/x86: add function for detecting reliable stack traces")
Link: http://lkml.kernel.org/r/e4b09e630e99d0c1080528f0821fc9d9dbaeea82.1513631620.git.jpoimboe@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Pull x86 syscall entry code changes for PTI from Ingo Molnar:
"The main changes here are Andy Lutomirski's changes to switch the
x86-64 entry code to use the 'per CPU entry trampoline stack'. This,
besides helping fix KASLR leaks (the pending Page Table Isolation
(PTI) work), also robustifies the x86 entry code"
* 'WIP.x86-pti.entry-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (26 commits)
x86/cpufeatures: Make CPU bugs sticky
x86/paravirt: Provide a way to check for hypervisors
x86/paravirt: Dont patch flush_tlb_single
x86/entry/64: Make cpu_entry_area.tss read-only
x86/entry: Clean up the SYSENTER_stack code
x86/entry/64: Remove the SYSENTER stack canary
x86/entry/64: Move the IST stacks into struct cpu_entry_area
x86/entry/64: Create a per-CPU SYSCALL entry trampoline
x86/entry/64: Return to userspace from the trampoline stack
x86/entry/64: Use a per-CPU trampoline stack for IDT entries
x86/espfix/64: Stop assuming that pt_regs is on the entry stack
x86/entry/64: Separate cpu_current_top_of_stack from TSS.sp0
x86/entry: Remap the TSS into the CPU entry area
x86/entry: Move SYSENTER_stack to the beginning of struct tss_struct
x86/dumpstack: Handle stack overflow on all stacks
x86/entry: Fix assumptions that the HW TSS is at the beginning of cpu_tss
x86/kasan/64: Teach KASAN about the cpu_entry_area
x86/mm/fixmap: Generalize the GDT fixmap mechanism, introduce struct cpu_entry_area
x86/entry/gdt: Put per-CPU GDT remaps in ascending order
x86/dumpstack: Add get_stack_info() support for the SYSENTER stack
...
Up to f5caf621ee ("x86/asm: Fix inline asm call constraints for Clang")
we were able to use x86 headers to build to the 'bpf' clang target, as
done by the BPF code in tools/perf/.
With that commit, we ended up with following failure for 'perf test LLVM', this
is because "clang ... -target bpf ..." fails since 4.0 does not have bpf inline
asm support and 6.0 does not recognize the register 'esp', fix it by guarding
that part with an #ifndef __BPF__, that is defined by clang when building to
the "bpf" target.
# perf test -v LLVM
37: LLVM search and compile :
37.1: Basic BPF llvm compile :
--- start ---
test child forked, pid 25526
Kernel build dir is set to /lib/modules/4.14.0+/build
set env: KBUILD_DIR=/lib/modules/4.14.0+/build
unset env: KBUILD_OPTS
include option is set to -nostdinc -isystem /usr/lib/gcc/x86_64-redhat-linux/7/include -I/home/acme/git/linux/arch/x86/include -I./arch/x86/include/generated -I/home/acme/git/linux/include -I./include -I/home/acme/git/linux/arch/x86/include/uapi -I./arch/x86/include/generated/uapi -I/home/acme/git/linux/include/uapi -I./include/generated/uapi -include /home/acme/git/linux/include/linux/kconfig.h
set env: NR_CPUS=4
set env: LINUX_VERSION_CODE=0x40e00
set env: CLANG_EXEC=/usr/local/bin/clang
set env: CLANG_OPTIONS=-xc
set env: KERNEL_INC_OPTIONS= -nostdinc -isystem /usr/lib/gcc/x86_64-redhat-linux/7/include -I/home/acme/git/linux/arch/x86/include -I./arch/x86/include/generated -I/home/acme/git/linux/include -I./include -I/home/acme/git/linux/arch/x86/include/uapi -I./arch/x86/include/generated/uapi -I/home/acme/git/linux/include/uapi -I./include/generated/uapi -include /home/acme/git/linux/include/linux/kconfig.h
set env: WORKING_DIR=/lib/modules/4.14.0+/build
set env: CLANG_SOURCE=-
llvm compiling command template: echo '/*
* bpf-script-example.c
* Test basic LLVM building
*/
#ifndef LINUX_VERSION_CODE
# error Need LINUX_VERSION_CODE
# error Example: for 4.2 kernel, put 'clang-opt="-DLINUX_VERSION_CODE=0x40200" into llvm section of ~/.perfconfig'
#endif
#define BPF_ANY 0
#define BPF_MAP_TYPE_ARRAY 2
#define BPF_FUNC_map_lookup_elem 1
#define BPF_FUNC_map_update_elem 2
static void *(*bpf_map_lookup_elem)(void *map, void *key) =
(void *) BPF_FUNC_map_lookup_elem;
static void *(*bpf_map_update_elem)(void *map, void *key, void *value, int flags) =
(void *) BPF_FUNC_map_update_elem;
struct bpf_map_def {
unsigned int type;
unsigned int key_size;
unsigned int value_size;
unsigned int max_entries;
};
#define SEC(NAME) __attribute__((section(NAME), used))
struct bpf_map_def SEC("maps") flip_table = {
.type = BPF_MAP_TYPE_ARRAY,
.key_size = sizeof(int),
.value_size = sizeof(int),
.max_entries = 1,
};
SEC("func=SyS_epoll_wait")
int bpf_func__SyS_epoll_wait(void *ctx)
{
int ind =0;
int *flag = bpf_map_lookup_elem(&flip_table, &ind);
int new_flag;
if (!flag)
return 0;
/* flip flag and store back */
new_flag = !*flag;
bpf_map_update_elem(&flip_table, &ind, &new_flag, BPF_ANY);
return new_flag;
}
char _license[] SEC("license") = "GPL";
int _version SEC("version") = LINUX_VERSION_CODE;
' | $CLANG_EXEC -D__KERNEL__ -D__NR_CPUS__=$NR_CPUS -DLINUX_VERSION_CODE=$LINUX_VERSION_CODE $CLANG_OPTIONS $KERNEL_INC_OPTIONS -Wno-unused-value -Wno-pointer-sign -working-directory $WORKING_DIR -c "$CLANG_SOURCE" -target bpf -O2 -o -
test child finished with 0
---- end ----
LLVM search and compile subtest 0: Ok
37.2: kbuild searching :
--- start ---
test child forked, pid 25950
Kernel build dir is set to /lib/modules/4.14.0+/build
set env: KBUILD_DIR=/lib/modules/4.14.0+/build
unset env: KBUILD_OPTS
include option is set to -nostdinc -isystem /usr/lib/gcc/x86_64-redhat-linux/7/include -I/home/acme/git/linux/arch/x86/include -I./arch/x86/include/generated -I/home/acme/git/linux/include -I./include -I/home/acme/git/linux/arch/x86/include/uapi -I./arch/x86/include/generated/uapi -I/home/acme/git/linux/include/uapi -I./include/generated/uapi -include /home/acme/git/linux/include/linux/kconfig.h
set env: NR_CPUS=4
set env: LINUX_VERSION_CODE=0x40e00
set env: CLANG_EXEC=/usr/local/bin/clang
set env: CLANG_OPTIONS=-xc
set env: KERNEL_INC_OPTIONS= -nostdinc -isystem /usr/lib/gcc/x86_64-redhat-linux/7/include -I/home/acme/git/linux/arch/x86/include -I./arch/x86/include/generated -I/home/acme/git/linux/include -I./include -I/home/acme/git/linux/arch/x86/include/uapi -I./arch/x86/include/generated/uapi -I/home/acme/git/linux/include/uapi -I./include/generated/uapi -include /home/acme/git/linux/include/linux/kconfig.h
set env: WORKING_DIR=/lib/modules/4.14.0+/build
set env: CLANG_SOURCE=-
llvm compiling command template: echo '/*
* bpf-script-test-kbuild.c
* Test include from kernel header
*/
#ifndef LINUX_VERSION_CODE
# error Need LINUX_VERSION_CODE
# error Example: for 4.2 kernel, put 'clang-opt="-DLINUX_VERSION_CODE=0x40200" into llvm section of ~/.perfconfig'
#endif
#define SEC(NAME) __attribute__((section(NAME), used))
#include <uapi/linux/fs.h>
#include <uapi/asm/ptrace.h>
SEC("func=vfs_llseek")
int bpf_func__vfs_llseek(void *ctx)
{
return 0;
}
char _license[] SEC("license") = "GPL";
int _version SEC("version") = LINUX_VERSION_CODE;
' | $CLANG_EXEC -D__KERNEL__ -D__NR_CPUS__=$NR_CPUS -DLINUX_VERSION_CODE=$LINUX_VERSION_CODE $CLANG_OPTIONS $KERNEL_INC_OPTIONS -Wno-unused-value -Wno-pointer-sign -working-directory $WORKING_DIR -c "$CLANG_SOURCE" -target bpf -O2 -o -
In file included from <stdin>:12:
In file included from /home/acme/git/linux/arch/x86/include/uapi/asm/ptrace.h:5:
In file included from /home/acme/git/linux/include/linux/compiler.h:242:
In file included from /home/acme/git/linux/arch/x86/include/asm/barrier.h:5:
In file included from /home/acme/git/linux/arch/x86/include/asm/alternative.h:10:
/home/acme/git/linux/arch/x86/include/asm/asm.h:145:50: error: unknown register name 'esp' in asm
register unsigned long current_stack_pointer asm(_ASM_SP);
^
/home/acme/git/linux/arch/x86/include/asm/asm.h:44:18: note: expanded from macro '_ASM_SP'
#define _ASM_SP __ASM_REG(sp)
^
/home/acme/git/linux/arch/x86/include/asm/asm.h:27:32: note: expanded from macro '__ASM_REG'
#define __ASM_REG(reg) __ASM_SEL_RAW(e##reg, r##reg)
^
/home/acme/git/linux/arch/x86/include/asm/asm.h:18:29: note: expanded from macro '__ASM_SEL_RAW'
# define __ASM_SEL_RAW(a,b) __ASM_FORM_RAW(a)
^
/home/acme/git/linux/arch/x86/include/asm/asm.h:11:32: note: expanded from macro '__ASM_FORM_RAW'
# define __ASM_FORM_RAW(x) #x
^
<scratch space>:4:1: note: expanded from here
"esp"
^
1 error generated.
ERROR: unable to compile -
Hint: Check error message shown above.
Hint: You can also pre-compile it into .o using:
clang -target bpf -O2 -c -
with proper -I and -D options.
Failed to compile test case: 'kbuild searching'
test child finished with -1
---- end ----
LLVM search and compile subtest 1: FAILED!
Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: Alexander Potapenko <glider@google.com>
Cc: Alexei Starovoitov <alexei.starovoitov@gmail.com>
Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Daniel Borkmann <daniel@iogearbox.net>
Cc: David Ahern <dsahern@gmail.com>
Cc: Dmitriy Vyukov <dvyukov@google.com>
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Matthias Kaehlcke <mka@chromium.org>
Cc: Miguel Bernal Marin <miguel.bernal.marin@linux.intel.com>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Wang Nan <wangnan0@huawei.com>
Cc: Yonghong Song <yhs@fb.com>
Link: https://lkml.kernel.org/r/20171128175948.GL3298@kernel.org
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Commit d8aa7eea78 ("x86/mm: Add Secure Encrypted Virtualization (SEV)
support") changed sme_active() from an inline function that referenced
sme_me_mask to a non-inlined function in order to make the sev_enabled
variable a static variable. This function was marked EXPORT_SYMBOL_GPL
because at the time the patch was submitted, sme_me_mask was marked
EXPORT_SYMBOL_GPL.
Commit 87df26175e ("x86/mm: Unbreak modules that rely on external
PAGE_KERNEL availability") changed sme_me_mask variable from
EXPORT_SYMBOL_GPL to EXPORT_SYMBOL, allowing external modules the ability
to build with CONFIG_AMD_MEM_ENCRYPT=y. Now, however, with sev_active()
no longer an inline function and marked as EXPORT_SYMBOL_GPL, external
modules that use the DMA API are once again broken in 4.15. Since the DMA
API is meant to be used by external modules, this needs to be changed.
Change the sme_active() and sev_active() functions from EXPORT_SYMBOL_GPL
to EXPORT_SYMBOL.
Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brijesh Singh <brijesh.singh@amd.com>
Link: https://lkml.kernel.org/r/20171215162011.14125.7113.stgit@tlendack-t1.amdoffice.net
AMD systems may log Deferred errors. These are errors that are uncorrected
but which do not need immediate action. The MCA_STATUS[UC] bit may not be
set for Deferred errors.
Flag the error as not correctable when MCA_STATUS[Deferred] is set and
do not feed it into the Correctable Errors Collector.
[ bp: Massage commit message. ]
Signed-off-by: Yazen Ghannam <yazen.ghannam@amd.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20171212165143.27475-1-Yazen.Ghannam@amd.com
The MCA_STATUS[ErrorCodeExt] field is very bank type specific.
We currently check if the ErrorCodeExt value is 0x0 or 0x8 in
mce_is_memory_error(), but we don't check the bank number. This means
that we could flag non-memory errors as memory errors.
We know that we want to flag DRAM ECC errors as memory errors, so let's do
those cases first. We can add more cases later when needed.
Define a wrapper function in mce_amd.c so we can use SMCA enums.
[ bp: Remove brackets around return statements. ]
Signed-off-by: Yazen Ghannam <yazen.ghannam@amd.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20171207203955.118171-2-Yazen.Ghannam@amd.com
Scalable MCA systems have various types of banks. The bank's type
can determine how we handle errors from it. For example, if a bank
represents a UMC (Unified Memory Controller) then we will need to
convert its address from a normalized address to a system physical
address before handling the error.
[ bp: Verify m->bank is within range and use bank pointer. ]
Signed-off-by: Yazen Ghannam <yazen.ghannam@amd.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20171207203955.118171-1-Yazen.Ghannam@amd.com
Reported by syzkaller:
BUG: KASAN: stack-out-of-bounds in write_mmio+0x11e/0x270 [kvm]
Read of size 8 at addr ffff8803259df7f8 by task syz-executor/32298
CPU: 6 PID: 32298 Comm: syz-executor Tainted: G OE 4.15.0-rc2+ #18
Hardware name: LENOVO ThinkCentre M8500t-N000/SHARKBAY, BIOS FBKTC1AUS 02/16/2016
Call Trace:
dump_stack+0xab/0xe1
print_address_description+0x6b/0x290
kasan_report+0x28a/0x370
write_mmio+0x11e/0x270 [kvm]
emulator_read_write_onepage+0x311/0x600 [kvm]
emulator_read_write+0xef/0x240 [kvm]
emulator_fix_hypercall+0x105/0x150 [kvm]
em_hypercall+0x2b/0x80 [kvm]
x86_emulate_insn+0x2b1/0x1640 [kvm]
x86_emulate_instruction+0x39a/0xb90 [kvm]
handle_exception+0x1b4/0x4d0 [kvm_intel]
vcpu_enter_guest+0x15a0/0x2640 [kvm]
kvm_arch_vcpu_ioctl_run+0x549/0x7d0 [kvm]
kvm_vcpu_ioctl+0x479/0x880 [kvm]
do_vfs_ioctl+0x142/0x9a0
SyS_ioctl+0x74/0x80
entry_SYSCALL_64_fastpath+0x23/0x9a
The path of patched vmmcall will patch 3 bytes opcode 0F 01 C1(vmcall)
to the guest memory, however, write_mmio tracepoint always prints 8 bytes
through *(u64 *)val since kvm splits the mmio access into 8 bytes. This
leaks 5 bytes from the kernel stack (CVE-2017-17741). This patch fixes
it by just accessing the bytes which we operate on.
Before patch:
syz-executor-5567 [007] .... 51370.561696: kvm_mmio: mmio write len 3 gpa 0x10 val 0x1ffff10077c1010f
After patch:
syz-executor-13416 [002] .... 51302.299573: kvm_mmio: mmio write len 3 gpa 0x10 val 0xc1010f
Reported-by: Dmitry Vyukov <dvyukov@google.com>
Reviewed-by: Darren Kenny <darren.kenny@oracle.com>
Reviewed-by: Marc Zyngier <marc.zyngier@arm.com>
Tested-by: Marc Zyngier <marc.zyngier@arm.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Radim Krčmář <rkrcmar@redhat.com>
Cc: Marc Zyngier <marc.zyngier@arm.com>
Cc: Christoffer Dall <christoffer.dall@linaro.org>
Signed-off-by: Wanpeng Li <wanpeng.li@hotmail.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
There is currently no way to force CPU bug bits like CPU feature bits. That
makes it impossible to set a bug bit once at boot and have it stick for all
upcoming CPUs.
Extend the force set/clear arrays to handle bug bits as well.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Borislav Petkov <bp@suse.de>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Borislav Petkov <bpetkov@suse.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: David Laight <David.Laight@aculab.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: Eduardo Valentin <eduval@amazon.com>
Cc: Greg KH <gregkh@linuxfoundation.org>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Juergen Gross <jgross@suse.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Will Deacon <will.deacon@arm.com>
Cc: aliguori@amazon.com
Cc: daniel.gruss@iaik.tugraz.at
Cc: hughd@google.com
Cc: keescook@google.com
Link: https://lkml.kernel.org/r/20171204150606.992156574@linutronix.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
There is no generic way to test whether a kernel is running on a specific
hypervisor. But that's required to prevent the upcoming user address space
separation feature in certain guest modes.
Make the hypervisor type enum unconditionally available and provide a
helper function which allows to test for a specific type.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Juergen Gross <jgross@suse.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Borislav Petkov <bpetkov@suse.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: David Laight <David.Laight@aculab.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: Eduardo Valentin <eduval@amazon.com>
Cc: Greg KH <gregkh@linuxfoundation.org>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Will Deacon <will.deacon@arm.com>
Cc: aliguori@amazon.com
Cc: daniel.gruss@iaik.tugraz.at
Cc: hughd@google.com
Cc: keescook@google.com
Link: https://lkml.kernel.org/r/20171204150606.912938129@linutronix.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
native_flush_tlb_single() will be changed with the upcoming
PAGE_TABLE_ISOLATION feature. This requires to have more code in
there than INVLPG.
Remove the paravirt patching for it.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Josh Poimboeuf <jpoimboe@redhat.com>
Reviewed-by: Juergen Gross <jgross@suse.com>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Borislav Petkov <bpetkov@suse.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: David Laight <David.Laight@aculab.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: Eduardo Valentin <eduval@amazon.com>
Cc: Greg KH <gregkh@linuxfoundation.org>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Will Deacon <will.deacon@arm.com>
Cc: aliguori@amazon.com
Cc: daniel.gruss@iaik.tugraz.at
Cc: hughd@google.com
Cc: keescook@google.com
Cc: linux-mm@kvack.org
Cc: michael.schwarz@iaik.tugraz.at
Cc: moritz.lipp@iaik.tugraz.at
Cc: richard.fellner@student.tugraz.at
Link: https://lkml.kernel.org/r/20171204150606.828111617@linutronix.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
The TSS is a fairly juicy target for exploits, and, now that the TSS
is in the cpu_entry_area, it's no longer protected by kASLR. Make it
read-only on x86_64.
On x86_32, it can't be RO because it's written by the CPU during task
switches, and we use a task gate for double faults. I'd also be
nervous about errata if we tried to make it RO even on configurations
without double fault handling.
[ tglx: AMD confirmed that there is no problem on 64-bit with TSS RO. So
it's probably safe to assume that it's a non issue, though Intel
might have been creative in that area. Still waiting for
confirmation. ]
Signed-off-by: Andy Lutomirski <luto@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Borislav Petkov <bpetkov@suse.de>
Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: David Laight <David.Laight@aculab.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: Eduardo Valentin <eduval@amazon.com>
Cc: Greg KH <gregkh@linuxfoundation.org>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Juergen Gross <jgross@suse.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Will Deacon <will.deacon@arm.com>
Cc: aliguori@amazon.com
Cc: daniel.gruss@iaik.tugraz.at
Cc: hughd@google.com
Cc: keescook@google.com
Link: https://lkml.kernel.org/r/20171204150606.733700132@linutronix.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
The existing code was a mess, mainly because C arrays are nasty. Turn
SYSENTER_stack into a struct, add a helper to find it, and do all the
obvious cleanups this enables.
Signed-off-by: Andy Lutomirski <luto@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Borislav Petkov <bpetkov@suse.de>
Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: David Laight <David.Laight@aculab.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: Eduardo Valentin <eduval@amazon.com>
Cc: Greg KH <gregkh@linuxfoundation.org>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Juergen Gross <jgross@suse.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Will Deacon <will.deacon@arm.com>
Cc: aliguori@amazon.com
Cc: daniel.gruss@iaik.tugraz.at
Cc: hughd@google.com
Cc: keescook@google.com
Link: https://lkml.kernel.org/r/20171204150606.653244723@linutronix.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Now that the SYSENTER stack has a guard page, there's no need for a canary
to detect overflow after the fact.
Signed-off-by: Andy Lutomirski <luto@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Borislav Petkov <bp@suse.de>
Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Borislav Petkov <bpetkov@suse.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: David Laight <David.Laight@aculab.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: Eduardo Valentin <eduval@amazon.com>
Cc: Greg KH <gregkh@linuxfoundation.org>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Juergen Gross <jgross@suse.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Will Deacon <will.deacon@arm.com>
Cc: aliguori@amazon.com
Cc: daniel.gruss@iaik.tugraz.at
Cc: hughd@google.com
Cc: keescook@google.com
Link: https://lkml.kernel.org/r/20171204150606.572577316@linutronix.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
The IST stacks are needed when an IST exception occurs and are accessed
before any kernel code at all runs. Move them into struct cpu_entry_area.
The IST stacks are unlike the rest of cpu_entry_area: they're used even for
entries from kernel mode. This means that they should be set up before we
load the final IDT. Move cpu_entry_area setup to trap_init() for the boot
CPU and set it up for all possible CPUs at once in native_smp_prepare_cpus().
Signed-off-by: Andy Lutomirski <luto@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Borislav Petkov <bp@suse.de>
Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Borislav Petkov <bpetkov@suse.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: David Laight <David.Laight@aculab.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: Eduardo Valentin <eduval@amazon.com>
Cc: Greg KH <gregkh@linuxfoundation.org>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Juergen Gross <jgross@suse.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Will Deacon <will.deacon@arm.com>
Cc: aliguori@amazon.com
Cc: daniel.gruss@iaik.tugraz.at
Cc: hughd@google.com
Cc: keescook@google.com
Link: https://lkml.kernel.org/r/20171204150606.480598743@linutronix.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Handling SYSCALL is tricky: the SYSCALL handler is entered with every
single register (except FLAGS), including RSP, live. It somehow needs
to set RSP to point to a valid stack, which means it needs to save the
user RSP somewhere and find its own stack pointer. The canonical way
to do this is with SWAPGS, which lets us access percpu data using the
%gs prefix.
With PAGE_TABLE_ISOLATION-like pagetable switching, this is
problematic. Without a scratch register, switching CR3 is impossible, so
%gs-based percpu memory would need to be mapped in the user pagetables.
Doing that without information leaks is difficult or impossible.
Instead, use a different sneaky trick. Map a copy of the first part
of the SYSCALL asm at a different address for each CPU. Now RIP
varies depending on the CPU, so we can use RIP-relative memory access
to access percpu memory. By putting the relevant information (one
scratch slot and the stack address) at a constant offset relative to
RIP, we can make SYSCALL work without relying on %gs.
A nice thing about this approach is that we can easily switch it on
and off if we want pagetable switching to be configurable.
The compat variant of SYSCALL doesn't have this problem in the first
place -- there are plenty of scratch registers, since we don't care
about preserving r8-r15. This patch therefore doesn't touch SYSCALL32
at all.
This patch actually seems to be a small speedup. With this patch,
SYSCALL touches an extra cache line and an extra virtual page, but
the pipeline no longer stalls waiting for SWAPGS. It seems that, at
least in a tight loop, the latter outweights the former.
Thanks to David Laight for an optimization tip.
Signed-off-by: Andy Lutomirski <luto@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Borislav Petkov <bpetkov@suse.de>
Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: David Laight <David.Laight@aculab.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: Eduardo Valentin <eduval@amazon.com>
Cc: Greg KH <gregkh@linuxfoundation.org>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Juergen Gross <jgross@suse.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Will Deacon <will.deacon@arm.com>
Cc: aliguori@amazon.com
Cc: daniel.gruss@iaik.tugraz.at
Cc: hughd@google.com
Cc: keescook@google.com
Link: https://lkml.kernel.org/r/20171204150606.403607157@linutronix.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
By itself, this is useless. It gives us the ability to run some final code
before exit that cannnot run on the kernel stack. This could include a CR3
switch a la PAGE_TABLE_ISOLATION or some kernel stack erasing, for
example. (Or even weird things like *changing* which kernel stack gets
used as an ASLR-strengthening mechanism.)
The SYSRET32 path is not covered yet. It could be in the future or
we could just ignore it and force the slow path if needed.
Signed-off-by: Andy Lutomirski <luto@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Borislav Petkov <bp@suse.de>
Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Borislav Petkov <bpetkov@suse.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: David Laight <David.Laight@aculab.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: Eduardo Valentin <eduval@amazon.com>
Cc: Greg KH <gregkh@linuxfoundation.org>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Juergen Gross <jgross@suse.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Will Deacon <will.deacon@arm.com>
Cc: aliguori@amazon.com
Cc: daniel.gruss@iaik.tugraz.at
Cc: hughd@google.com
Cc: keescook@google.com
Link: https://lkml.kernel.org/r/20171204150606.306546484@linutronix.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Historically, IDT entries from usermode have always gone directly
to the running task's kernel stack. Rearrange it so that we enter on
a per-CPU trampoline stack and then manually switch to the task's stack.
This touches a couple of extra cachelines, but it gives us a chance
to run some code before we touch the kernel stack.
The asm isn't exactly beautiful, but I think that fully refactoring
it can wait.
Signed-off-by: Andy Lutomirski <luto@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Borislav Petkov <bpetkov@suse.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: David Laight <David.Laight@aculab.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: Eduardo Valentin <eduval@amazon.com>
Cc: Greg KH <gregkh@linuxfoundation.org>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Juergen Gross <jgross@suse.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Will Deacon <will.deacon@arm.com>
Cc: aliguori@amazon.com
Cc: daniel.gruss@iaik.tugraz.at
Cc: hughd@google.com
Cc: keescook@google.com
Link: https://lkml.kernel.org/r/20171204150606.225330557@linutronix.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
When we start using an entry trampoline, a #GP from userspace will
be delivered on the entry stack, not on the task stack. Fix the
espfix64 #DF fixup to set up #GP according to TSS.SP0, rather than
assuming that pt_regs + 1 == SP0. This won't change anything
without an entry stack, but it will make the code continue to work
when an entry stack is added.
While we're at it, improve the comments to explain what's actually
going on.
Signed-off-by: Andy Lutomirski <luto@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Borislav Petkov <bp@suse.de>
Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Borislav Petkov <bpetkov@suse.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: David Laight <David.Laight@aculab.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: Eduardo Valentin <eduval@amazon.com>
Cc: Greg KH <gregkh@linuxfoundation.org>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Juergen Gross <jgross@suse.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Will Deacon <will.deacon@arm.com>
Cc: aliguori@amazon.com
Cc: daniel.gruss@iaik.tugraz.at
Cc: hughd@google.com
Cc: keescook@google.com
Link: https://lkml.kernel.org/r/20171204150606.130778051@linutronix.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
On 64-bit kernels, we used to assume that TSS.sp0 was the current
top of stack. With the addition of an entry trampoline, this will
no longer be the case. Store the current top of stack in TSS.sp1,
which is otherwise unused but shares the same cacheline.
Signed-off-by: Andy Lutomirski <luto@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Borislav Petkov <bp@suse.de>
Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Borislav Petkov <bpetkov@suse.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: David Laight <David.Laight@aculab.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: Eduardo Valentin <eduval@amazon.com>
Cc: Greg KH <gregkh@linuxfoundation.org>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Juergen Gross <jgross@suse.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Will Deacon <will.deacon@arm.com>
Cc: aliguori@amazon.com
Cc: daniel.gruss@iaik.tugraz.at
Cc: hughd@google.com
Cc: keescook@google.com
Link: https://lkml.kernel.org/r/20171204150606.050864668@linutronix.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
This has a secondary purpose: it puts the entry stack into a region
with a well-controlled layout. A subsequent patch will take
advantage of this to streamline the SYSCALL entry code to be able to
find it more easily.
Signed-off-by: Andy Lutomirski <luto@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Borislav Petkov <bpetkov@suse.de>
Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: David Laight <David.Laight@aculab.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: Eduardo Valentin <eduval@amazon.com>
Cc: Greg KH <gregkh@linuxfoundation.org>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Juergen Gross <jgross@suse.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Will Deacon <will.deacon@arm.com>
Cc: aliguori@amazon.com
Cc: daniel.gruss@iaik.tugraz.at
Cc: hughd@google.com
Cc: keescook@google.com
Link: https://lkml.kernel.org/r/20171204150605.962042855@linutronix.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
SYSENTER_stack should have reliable overflow detection, which
means that it needs to be at the bottom of a page, not the top.
Move it to the beginning of struct tss_struct and page-align it.
Also add an assertion to make sure that the fixed hardware TSS
doesn't cross a page boundary.
Signed-off-by: Andy Lutomirski <luto@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Borislav Petkov <bp@suse.de>
Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Borislav Petkov <bpetkov@suse.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: David Laight <David.Laight@aculab.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: Eduardo Valentin <eduval@amazon.com>
Cc: Greg KH <gregkh@linuxfoundation.org>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Juergen Gross <jgross@suse.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Will Deacon <will.deacon@arm.com>
Cc: aliguori@amazon.com
Cc: daniel.gruss@iaik.tugraz.at
Cc: hughd@google.com
Cc: keescook@google.com
Link: https://lkml.kernel.org/r/20171204150605.881827433@linutronix.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
We currently special-case stack overflow on the task stack. We're
going to start putting special stacks in the fixmap with a custom
layout, so they'll have guard pages, too. Teach the unwinder to be
able to unwind an overflow of any of the stacks.
Signed-off-by: Andy Lutomirski <luto@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Borislav Petkov <bp@suse.de>
Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Borislav Petkov <bpetkov@suse.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: David Laight <David.Laight@aculab.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: Eduardo Valentin <eduval@amazon.com>
Cc: Greg KH <gregkh@linuxfoundation.org>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Juergen Gross <jgross@suse.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Will Deacon <will.deacon@arm.com>
Cc: aliguori@amazon.com
Cc: daniel.gruss@iaik.tugraz.at
Cc: hughd@google.com
Cc: keescook@google.com
Link: https://lkml.kernel.org/r/20171204150605.802057305@linutronix.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
A future patch will move SYSENTER_stack to the beginning of cpu_tss
to help detect overflow. Before this can happen, fix several code
paths that hardcode assumptions about the old layout.
Signed-off-by: Andy Lutomirski <luto@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Dave Hansen <dave.hansen@intel.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Borislav Petkov <bpetkov@suse.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: David Laight <David.Laight@aculab.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: Eduardo Valentin <eduval@amazon.com>
Cc: Greg KH <gregkh@linuxfoundation.org>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Juergen Gross <jgross@suse.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Will Deacon <will.deacon@arm.com>
Cc: aliguori@amazon.com
Cc: daniel.gruss@iaik.tugraz.at
Cc: hughd@google.com
Cc: keescook@google.com
Link: https://lkml.kernel.org/r/20171204150605.722425540@linutronix.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Currently, the GDT is an ad-hoc array of pages, one per CPU, in the
fixmap. Generalize it to be an array of a new 'struct cpu_entry_area'
so that we can cleanly add new things to it.
Signed-off-by: Andy Lutomirski <luto@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Borislav Petkov <bp@suse.de>
Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Borislav Petkov <bpetkov@suse.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: David Laight <David.Laight@aculab.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: Eduardo Valentin <eduval@amazon.com>
Cc: Greg KH <gregkh@linuxfoundation.org>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Juergen Gross <jgross@suse.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Will Deacon <will.deacon@arm.com>
Cc: aliguori@amazon.com
Cc: daniel.gruss@iaik.tugraz.at
Cc: hughd@google.com
Cc: keescook@google.com
Link: https://lkml.kernel.org/r/20171204150605.563271721@linutronix.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
We currently have CPU 0's GDT at the top of the GDT range and
higher-numbered CPUs at lower addresses. This happens because the
fixmap is upside down (index 0 is the top of the fixmap).
Flip it so that GDTs are in ascending order by virtual address.
This will simplify a future patch that will generalize the GDT
remap to contain multiple pages.
Signed-off-by: Andy Lutomirski <luto@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Borislav Petkov <bpetkov@suse.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: David Laight <David.Laight@aculab.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: Eduardo Valentin <eduval@amazon.com>
Cc: Greg KH <gregkh@linuxfoundation.org>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Juergen Gross <jgross@suse.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Will Deacon <will.deacon@arm.com>
Cc: aliguori@amazon.com
Cc: daniel.gruss@iaik.tugraz.at
Cc: hughd@google.com
Cc: keescook@google.com
Link: https://lkml.kernel.org/r/20171204150605.471561421@linutronix.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
get_stack_info() doesn't currently know about the SYSENTER stack, so
unwinding will fail if we entered the kernel on the SYSENTER stack
and haven't fully switched off. Teach get_stack_info() about the
SYSENTER stack.
With future patches applied that run part of the entry code on the
SYSENTER stack and introduce an intentional BUG(), I would get:
PANIC: double fault, error_code: 0x0
...
RIP: 0010:do_error_trap+0x33/0x1c0
...
Call Trace:
Code: ...
With this patch, I get:
PANIC: double fault, error_code: 0x0
...
Call Trace:
<SYSENTER>
? async_page_fault+0x36/0x60
? invalid_op+0x22/0x40
? async_page_fault+0x36/0x60
? sync_regs+0x3c/0x40
? sync_regs+0x2e/0x40
? error_entry+0x6c/0xd0
? async_page_fault+0x36/0x60
</SYSENTER>
Code: ...
which is a lot more informative.
Signed-off-by: Andy Lutomirski <luto@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Borislav Petkov <bp@suse.de>
Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Borislav Petkov <bpetkov@suse.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: David Laight <David.Laight@aculab.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: Eduardo Valentin <eduval@amazon.com>
Cc: Greg KH <gregkh@linuxfoundation.org>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Juergen Gross <jgross@suse.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Will Deacon <will.deacon@arm.com>
Cc: aliguori@amazon.com
Cc: daniel.gruss@iaik.tugraz.at
Cc: hughd@google.com
Cc: keescook@google.com
Link: https://lkml.kernel.org/r/20171204150605.392711508@linutronix.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
This will simplify future changes that want scratch variables early in
the SYSENTER handler -- they'll be able to spill registers to the
stack. It also lets us get rid of a SWAPGS_UNSAFE_STACK user.
This does not depend on CONFIG_IA32_EMULATION=y because we'll want the
stack space even without IA32 emulation.
As far as I can tell, the reason that this wasn't done from day 1 is
that we use IST for #DB and #BP, which is IMO rather nasty and causes
a lot more problems than it solves. But, since #DB uses IST, we don't
actually need a real stack for SYSENTER (because SYSENTER with TF set
will invoke #DB on the IST stack rather than the SYSENTER stack).
I want to remove IST usage from these vectors some day, and this patch
is a prerequisite for that as well.
Signed-off-by: Andy Lutomirski <luto@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Borislav Petkov <bp@suse.de>
Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Borislav Petkov <bpetkov@suse.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: David Laight <David.Laight@aculab.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: Eduardo Valentin <eduval@amazon.com>
Cc: Greg KH <gregkh@linuxfoundation.org>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Juergen Gross <jgross@suse.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Will Deacon <will.deacon@arm.com>
Cc: aliguori@amazon.com
Cc: daniel.gruss@iaik.tugraz.at
Cc: hughd@google.com
Cc: keescook@google.com
Link: https://lkml.kernel.org/r/20171204150605.312726423@linutronix.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
There are at least two unwinder bugs hindering the debugging of
stack-overflow crashes:
- It doesn't deal gracefully with the case where the stack overflows and
the stack pointer itself isn't on a valid stack but the
to-be-dereferenced data *is*.
- The ORC oops dump code doesn't know how to print partial pt_regs, for the
case where if we get an interrupt/exception in *early* entry code
before the full pt_regs have been saved.
Fix both issues.
http://lkml.kernel.org/r/20171126024031.uxi4numpbjm5rlbr@treble
Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Borislav Petkov <bpetkov@suse.de>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: David Laight <David.Laight@aculab.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: Eduardo Valentin <eduval@amazon.com>
Cc: Greg KH <gregkh@linuxfoundation.org>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Juergen Gross <jgross@suse.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Will Deacon <will.deacon@arm.com>
Cc: aliguori@amazon.com
Cc: daniel.gruss@iaik.tugraz.at
Cc: hughd@google.com
Cc: keescook@google.com
Link: https://lkml.kernel.org/r/20171204150605.071425003@linutronix.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
If the stack overflows into a guard page and the ORC unwinder should work
well: by construction, there can't be any meaningful data in the guard page
because no writes to the guard page will have succeeded.
But there is a bug that prevents unwinding from working correctly: if the
starting register state has RSP pointing into a stack guard page, the ORC
unwinder bails out immediately.
Instead of bailing out immediately check whether the next page up is a
valid check page and if so analyze that. As a result the ORC unwinder will
start the unwind.
Tested by intentionally overflowing the task stack. The result is an
accurate call trace instead of a trace consisting purely of '?' entries.
There are a few other bugs that are triggered if the unwinder encounters a
stack overflow after the first step, but they are outside the scope of this
fix.
Signed-off-by: Andy Lutomirski <luto@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Borislav Petkov <bpetkov@suse.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: David Laight <David.Laight@aculab.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: Eduardo Valentin <eduval@amazon.com>
Cc: Greg KH <gregkh@linuxfoundation.org>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Juergen Gross <jgross@suse.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Will Deacon <will.deacon@arm.com>
Cc: aliguori@amazon.com
Cc: daniel.gruss@iaik.tugraz.at
Cc: hughd@google.com
Cc: keescook@google.com
Link: https://lkml.kernel.org/r/20171204150604.991389777@linutronix.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Commit 1d3e53e862 ("x86/entry/64: Refactor IRQ stacks and make them
NMI-safe") added DEBUG_ENTRY_ASSERT_IRQS_OFF macro that acceses eflags
using 'pushfq' instruction when testing for IF bit. On PV Xen guests
looking at IF flag directly will always see it set, resulting in 'ud2'.
Introduce SAVE_FLAGS() macro that will use appropriate save_fl pv op when
running paravirt.
Signed-off-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Juergen Gross <jgross@suse.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Borislav Petkov <bpetkov@suse.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: David Laight <David.Laight@aculab.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: Eduardo Valentin <eduval@amazon.com>
Cc: Greg KH <gregkh@linuxfoundation.org>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Will Deacon <will.deacon@arm.com>
Cc: aliguori@amazon.com
Cc: daniel.gruss@iaik.tugraz.at
Cc: hughd@google.com
Cc: keescook@google.com
Cc: xen-devel@lists.xenproject.org
Link: https://lkml.kernel.org/r/20171204150604.899457242@linutronix.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
[ Note, this is a Git cherry-pick of the following commit:
d17a1d97dc: ("x86/mm/kasan: don't use vmemmap_populate() to initialize shadow")
... for easier x86 PTI code testing and back-porting. ]
The KASAN shadow is currently mapped using vmemmap_populate() since that
provides a semi-convenient way to map pages into init_top_pgt. However,
since that no longer zeroes the mapped pages, it is not suitable for
KASAN, which requires zeroed shadow memory.
Add kasan_populate_shadow() interface and use it instead of
vmemmap_populate(). Besides, this allows us to take advantage of
gigantic pages and use them to populate the shadow, which should save us
some memory wasted on page tables and reduce TLB pressure.
Link: http://lkml.kernel.org/r/20171103185147.2688-2-pasha.tatashin@oracle.com
Signed-off-by: Andrey Ryabinin <aryabinin@virtuozzo.com>
Signed-off-by: Pavel Tatashin <pasha.tatashin@oracle.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Steven Sistare <steven.sistare@oracle.com>
Cc: Daniel Jordan <daniel.m.jordan@oracle.com>
Cc: Bob Picco <bob.picco@oracle.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Alexander Potapenko <glider@google.com>
Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Christian Borntraeger <borntraeger@de.ibm.com>
Cc: David S. Miller <davem@davemloft.net>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Sam Ravnborg <sam@ravnborg.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Will Deacon <will.deacon@arm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
[ Note, this is a Git cherry-pick of the following commit:
506458efaf ("locking/barriers: Convert users of lockless_dereference() to READ_ONCE()")
... for easier x86 PTI code testing and back-porting. ]
READ_ONCE() now has an implicit smp_read_barrier_depends() call, so it
can be used instead of lockless_dereference() without any change in
semantics.
Signed-off-by: Will Deacon <will.deacon@arm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1508840570-22169-4-git-send-email-will.deacon@arm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
[ Note, this is a Git cherry-pick of the following commit:
a47ba4d77e ("perf/x86: Enable free running PEBS for REGS_USER/INTR")
... for easier x86 PTI code testing and back-porting. ]
Currently free running PEBS is disabled when user or interrupt
registers are requested. Most of the registers are actually
available in the PEBS record and can be supported.
So we just need to check for the supported registers and then
allow it: it is all except for the segment register.
For user registers this only works when the counter is limited
to ring 3 only, so this also needs to be checked.
Signed-off-by: Andi Kleen <ak@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20170831214630.21892-1-andi@firstfloor.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
[ Note, this is a Git cherry-pick of the following commit:
2b67799bdf25 ("x86: Make X86_BUG_FXSAVE_LEAK detectable in CPUID on AMD")
... for easier x86 PTI code testing and back-porting. ]
The latest AMD AMD64 Architecture Programmer's Manual
adds a CPUID feature XSaveErPtr (CPUID_Fn80000008_EBX[2]).
If this feature is set, the FXSAVE, XSAVE, FXSAVEOPT, XSAVEC, XSAVES
/ FXRSTOR, XRSTOR, XRSTORS always save/restore error pointers,
thus making the X86_BUG_FXSAVE_LEAK workaround obsolete on such CPUs.
Signed-Off-By: Rudolf Marek <r.marek@assembler.cz>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Borislav Petkov <bp@suse.de>
Tested-by: Borislav Petkov <bp@suse.de>
Cc: Andy Lutomirski <luto@amacapital.net>
Link: https://lkml.kernel.org/r/bdcebe90-62c5-1f05-083c-eba7f08b2540@assembler.cz
Signed-off-by: Ingo Molnar <mingo@kernel.org>
[ Note, this is a Git cherry-pick of the following commit: (limited to the cpufeatures.h file)
3522c2a6a4 ("x86/cpufeature: Add User-Mode Instruction Prevention definitions")
... for easier x86 PTI code testing and back-porting. ]
User-Mode Instruction Prevention is a security feature present in new
Intel processors that, when set, prevents the execution of a subset of
instructions if such instructions are executed in user mode (CPL > 0).
Attempting to execute such instructions causes a general protection
exception.
The subset of instructions comprises:
* SGDT - Store Global Descriptor Table
* SIDT - Store Interrupt Descriptor Table
* SLDT - Store Local Descriptor Table
* SMSW - Store Machine Status Word
* STR - Store Task Register
This feature is also added to the list of disabled-features to allow
a cleaner handling of build-time configuration.
Signed-off-by: Ricardo Neri <ricardo.neri-calderon@linux.intel.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Borislav Petkov <bp@suse.de>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Chen Yucong <slaoub@gmail.com>
Cc: Chris Metcalf <cmetcalf@mellanox.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Huang Rui <ray.huang@amd.com>
Cc: Jiri Slaby <jslaby@suse.cz>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Michael S. Tsirkin <mst@redhat.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ravi V. Shankar <ravi.v.shankar@intel.com>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: ricardo.neri@intel.com
Link: http://lkml.kernel.org/r/1509935277-22138-7-git-send-email-ricardo.neri-calderon@linux.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Debian does not ship a 'mkisofs' symlink to genisoimage. All modern
distros ship genisoimage, so just use that directly. That requires
renaming the 'genisoimage' function. Also neaten up the 'for' loop
while I'm in here.
Signed-off-by: Matthew Wilcox <mawilcox@microsoft.com>
Cc: Changbin Du <changbin.du@intel.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Pull x86 fixes from Ingo Molnar:
"Misc fixes:
- fix the s2ram regression related to confusion around segment
register restoration, plus related cleanups that make the code more
robust
- a guess-unwinder Kconfig dependency fix
- an isoimage build target fix for certain tool chain combinations
- instruction decoder opcode map fixes+updates, and the syncing of
the kernel decoder headers to the objtool headers
- a kmmio tracing fix
- two 5-level paging related fixes
- a topology enumeration fix on certain SMP systems"
* 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
objtool: Resync objtool's instruction decoder source code copy with the kernel's latest version
x86/decoder: Fix and update the opcodes map
x86/power: Make restore_processor_context() sane
x86/power/32: Move SYSENTER MSR restoration to fix_processor_context()
x86/power/64: Use struct desc_ptr for the IDT in struct saved_context
x86/unwinder/guess: Prevent using CONFIG_UNWINDER_GUESS=y with CONFIG_STACKDEPOT=y
x86/build: Don't verify mtools configuration file for isoimage
x86/mm/kmmio: Fix mmiotrace for page unaligned addresses
x86/boot/compressed/64: Print error if 5-level paging is not supported
x86/boot/compressed/64: Detect and handle 5-level paging at boot-time
x86/smpboot: Do not use smp_num_siblings in __max_logical_packages calculation
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2
iQEcBAABAgAGBQJaM17CAAoJELDendYovxMvZs0H/3YIAN3wXEqcaZuAoMKdTU3C
gSipBUJKwcwo/1qJCtE5CzQnl7U6qC4BPnwNfed1AJUULFBYtpEMw6KFMhK2KELn
ADWoJzVbBzouZnj0GHuky4dQnsC5lPL3RjlWqKkzlEA3KebrxLsO7inw5Pru9zPZ
y1Isl/nLxV04f8SYITjgThCCdo9pnQOdx275YAcbQprqsOK45j+4d1Mn0fYb1Rjk
7Fk4i9UAZlY/VYycpgyC/m3PKFdawl0uqHXwu4yTKXQJGZ3BwED7ifTvHknaSOhA
DVxktnEOTg5IrrqDrr4EdZIgXQcqQF19oIOtEWCKCLvW8j/sBlSAQYkgtn5OpxY=
=gqWy
-----END PGP SIGNATURE-----
Merge tag 'for-linus-4.15-rc4-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip
Pull xen fixes from Juergen Gross:
"Two minor fixes for running as Xen dom0:
- when built as 32 bit kernel on large machines the Xen LAPIC
emulation should report a rather modern LAPIC in order to support
enough APIC-Ids
- The Xen LAPIC emulation is needed for dom0 only, so build it only
for kernels supporting to run as Xen dom0"
* tag 'for-linus-4.15-rc4-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip:
xen: XEN_ACPI_PROCESSOR is Dom0-only
x86/Xen: don't report ancient LAPIC version
Using PGDIR_SHIFT to identify espfix64 addresses on 5-level systems
was wrong, and it resulted in panics due to unhandled double faults.
Use P4D_SHIFT instead, which is correct on 4-level and 5-level
machines.
This fixes a panic when running x86 selftests on 5-level machines.
Signed-off-by: Andy Lutomirski <luto@kernel.org>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: David Laight <David.Laight@aculab.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Kirill A. Shutemov <kirill@shutemov.name>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: stable@vger.kernel.org
Fixes: 1d33b21956 ("x86/espfix: Add support for 5-level paging")
Link: http://lkml.kernel.org/r/24c898b4f44fdf8c22d93703850fb384ef87cfdc.1513035461.git.luto@kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Update x86-opcode-map.txt based on the October 2017 Intel SDM publication.
Fix INVPID to INVVPID.
Add UD0 and UD1 instruction opcodes.
Also sync the objtool and perf tooling copies of this file.
Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Acked-by: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Masami Hiramatsu <masami.hiramatsu@gmail.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/aac062d7-c0f6-96e3-5c92-ed299e2bd3da@infradead.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
My previous attempt to fix a couple of bugs in __restore_processor_context():
5b06bbcfc2 ("x86/power: Fix some ordering bugs in __restore_processor_context()")
... introduced yet another bug, breaking suspend-resume.
Rather than trying to come up with a minimal fix, let's try to clean it up
for real. This patch fixes quite a few things:
- The old code saved a nonsensical subset of segment registers.
The only registers that need to be saved are those that contain
userspace state or those that can't be trivially restored without
percpu access working. (On x86_32, we can restore percpu access
by writing __KERNEL_PERCPU to %fs. On x86_64, it's easier to
save and restore the kernel's GSBASE.) With this patch, we
restore hardcoded values to the kernel state where applicable and
explicitly restore the user state after fixing all the descriptor
tables.
- We used to use an unholy mix of inline asm and C helpers for
segment register access. Let's get rid of the inline asm.
This fixes the reported s2ram hangs and make the code all around
more logical.
Analyzed-by: Linus Torvalds <torvalds@linux-foundation.org>
Reported-by: Jarkko Nikula <jarkko.nikula@linux.intel.com>
Reported-by: Pavel Machek <pavel@ucw.cz>
Tested-by: Jarkko Nikula <jarkko.nikula@linux.intel.com>
Tested-by: Pavel Machek <pavel@ucw.cz>
Signed-off-by: Andy Lutomirski <luto@kernel.org>
Acked-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Borislav Petkov <bpetkov@suse.de>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rafael J. Wysocki <rjw@rjwysocki.net>
Cc: Zhang Rui <rui.zhang@intel.com>
Fixes: 5b06bbcfc2 ("x86/power: Fix some ordering bugs in __restore_processor_context()")
Link: http://lkml.kernel.org/r/398ee68e5c0f766425a7b746becfc810840770ff.1513286253.git.luto@kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
x86_64 restores system call MSRs in fix_processor_context(), and
x86_32 restored them along with segment registers. The 64-bit
variant makes more sense, so move the 32-bit code to match the
64-bit code.
No side effects are expected to runtime behavior.
Tested-by: Jarkko Nikula <jarkko.nikula@linux.intel.com>
Signed-off-by: Andy Lutomirski <luto@kernel.org>
Acked-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Borislav Petkov <bpetkov@suse.de>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Pavel Machek <pavel@ucw.cz>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rafael J. Wysocki <rjw@rjwysocki.net>
Cc: Zhang Rui <rui.zhang@intel.com>
Link: http://lkml.kernel.org/r/65158f8d7ee64dd6bbc6c1c83b3b34aaa854e3ae.1513286253.git.luto@kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
x86_64's saved_context nonsensically used separate idt_limit and
idt_base fields and then cast &idt_limit to struct desc_ptr *.
This was correct (with -fno-strict-aliasing), but it's confusing,
served no purpose, and required #ifdeffery. Simplify this by
using struct desc_ptr directly.
No change in functionality.
Tested-by: Jarkko Nikula <jarkko.nikula@linux.intel.com>
Signed-off-by: Andy Lutomirski <luto@kernel.org>
Acked-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Borislav Petkov <bpetkov@suse.de>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Pavel Machek <pavel@ucw.cz>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rafael J. Wysocki <rjw@rjwysocki.net>
Cc: Zhang Rui <rui.zhang@intel.com>
Link: http://lkml.kernel.org/r/967909ce38d341b01d45eff53e278e2728a3a93a.1513286253.git.luto@kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQIcBAABAgAGBQJaMwhcAAoJEFmIoMA60/r88ygP/3fuGrYrolx1yOM/XKjiG1Za
B+FF5IK0xcV4Xrdg4K9TrqCkVBlSmTkF2wvD1raFQhvPvzNnTMI0JvI5iwUogSlA
EI8rKvKRT0QRtT27URhGWCZOrPMgylE7Vtq7ZGLin2o19A4JQw3TC8mz9QirSbii
C5H+WZ6iAQ3ISjOuiFUREqQDCE8FzbYQC69uh/xMRshZFG1WPsX96ZRSvVfPQUeC
XwvKqiO4s9I7zyxhS/WSRnnEHO7LNSRRiq8kzP0VKKMAfTo8gLYPoeIPakvXSsQZ
j7N1ycWub9DSn7wjh9Zd922wp1EYfxkCW6jilVOd+BlmOxLjyPPGgahbNIrIojwa
qiTY5SNdty3h27IT3Hep4yDBoqhXXPORXgh7QRlzizu8Te2SjHY7n9qi1S8Uobn+
s98PoNYU+oelivi9MtKLKcqCaNL0RE8aCQfuA3TzCU/dzR4Trx8obV50f67vsL6e
eK67S/KpOVUjHQmGec8iqnrJfHPZkxgaRTOVVYCXKidtJxTf/LyjDKYh8A2Q27A5
wXdRaSxLw25r0alAWzUEIVj/iUPd4nYKY9RTeRPWiXU+wsEUcBO+js6HE1xZAAfk
ULDqZKnkKsgW9aWlQU6DzzLsCjxPTRcC8SfK79K47Otua2laABzL47mzo2sSKKFs
mNq3i4OiNFq8XCsN18lo
=o3Rn
-----END PGP SIGNATURE-----
Merge tag 'pci-v4.15-fixes-1' of git://git.kernel.org/pub/scm/linux/kernel/git/helgaas/pci
Pull PCI fixes from Bjorn Helgaas:
- add a pci_get_domain_bus_and_slot() stub for the CONFIG_PCI=n case to
avoid build breakage in the v4.16 merge window if a
pci_get_bus_and_slot() -> pci_get_domain_bus_and_slot() patch gets
merged before the PCI tree (Randy Dunlap)
- fix an AMD boot regression in the 64bit BAR support added in v4.15
(Christian König)
- fix an R-Car use-after-free that causes a crash if no PCIe card is
present (Geert Uytterhoeven)
* tag 'pci-v4.15-fixes-1' of git://git.kernel.org/pub/scm/linux/kernel/git/helgaas/pci:
PCI: rcar: Fix use-after-free in probe error path
x86/PCI: Only enable a 64bit BAR on single-socket AMD Family 15h
x86/PCI: Fix infinite loop in search for 64bit BAR placement
PCI: Add pci_get_domain_bus_and_slot() stub
------------[ cut here ]------------
Bad FPU state detected at kvm_put_guest_fpu+0xd8/0x2d0 [kvm], reinitializing FPU registers.
WARNING: CPU: 1 PID: 4594 at arch/x86/mm/extable.c:103 ex_handler_fprestore+0x88/0x90
CPU: 1 PID: 4594 Comm: qemu-system-x86 Tainted: G B OE 4.15.0-rc2+ #10
RIP: 0010:ex_handler_fprestore+0x88/0x90
Call Trace:
fixup_exception+0x4e/0x60
do_general_protection+0xff/0x270
general_protection+0x22/0x30
RIP: 0010:kvm_put_guest_fpu+0xd8/0x2d0 [kvm]
RSP: 0018:ffff8803d5627810 EFLAGS: 00010246
kvm_vcpu_reset+0x3b4/0x3c0 [kvm]
kvm_apic_accept_events+0x1c0/0x240 [kvm]
kvm_arch_vcpu_ioctl_run+0x1658/0x2fb0 [kvm]
kvm_vcpu_ioctl+0x479/0x880 [kvm]
do_vfs_ioctl+0x142/0x9a0
SyS_ioctl+0x74/0x80
do_syscall_64+0x15f/0x600
where kvm_put_guest_fpu is called without a prior kvm_load_guest_fpu.
To fix it, move kvm_load_guest_fpu to the very beginning of
kvm_arch_vcpu_ioctl_run.
Cc: stable@vger.kernel.org
Fixes: f775b13eed
Signed-off-by: Peter Xu <peterx@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
The below test case can cause infinite loop in kvm when ept=0.
#include <unistd.h>
#include <sys/syscall.h>
#include <string.h>
#include <stdint.h>
#include <linux/kvm.h>
#include <fcntl.h>
#include <sys/ioctl.h>
long r[5];
int main()
{
r[2] = open("/dev/kvm", O_RDONLY);
r[3] = ioctl(r[2], KVM_CREATE_VM, 0);
r[4] = ioctl(r[3], KVM_CREATE_VCPU, 7);
ioctl(r[4], KVM_RUN, 0);
}
It doesn't setup the memory regions, mmu_alloc_shadow/direct_roots() in
kvm return 1 when kvm fails to allocate root page table which can result
in beblow infinite loop:
vcpu_run() {
for (;;) {
r = vcpu_enter_guest()::kvm_mmu_reload() returns 1
if (r <= 0)
break;
if (need_resched())
cond_resched();
}
}
This patch fixes it by returning -ENOSPC when there is no available kvm mmu
page for root page table.
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Radim Krčmář <rkrcmar@redhat.com>
Cc: stable@vger.kernel.org
Fixes: 26eeb53cf0 (KVM: MMU: Bail out immediately if there is no available mmu page)
Signed-off-by: Wanpeng Li <wanpeng.li@hotmail.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Unconditionally reporting a value seen on the P4 or older invokes
functionality like io_apic_get_unique_id() on 32-bit builds, resulting
in a panic() with sufficiently many CPUs and/or IO-APICs. Doing what
that function does would be the hypervisor's responsibility anyway, so
makes no sense to be used when running on Xen. Uniformly report a more
modern version; this shouldn't matter much as both LAPIC and IO-APIC are
being managed entirely / mostly by the hypervisor.
Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Juergen Gross <jgross@suse.com>
Signed-off-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Standardize warning, error, and success printout format
of insn_decode_test so that user can easily understand
which test tool caused the messages.
Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/151153628279.22827.4869104298276788693.stgit@devbox
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Rename distill.awk to objdump_reformat.awk because it more
clearly expresses its purpose of re-formatting the output
of objdump so that insn_decoder_test can read it.
Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/151153625409.22827.10470603625519700259.stgit@devbox
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Rename test_get_len test command to insn_decoder_test
as it a more meaningful name. This also changes some
comments in related files.
Note that this also removes the paragraph about
writing to the Free Software Foundation's mailing
address.
Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/151153622537.22827.14928774603980883278.stgit@devbox
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Pull crypto fixes from Herbert Xu:
"This push fixes the following issues:
- buffer overread in RSA
- potential use after free in algif_aead.
- error path null pointer dereference in af_alg
- forbid combinations such as hmac(hmac(sha3)) which may crash
- crash in salsa20 due to incorrect API usage"
* 'linus' of git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6:
crypto: salsa20 - fix blkcipher_walk API usage
crypto: hmac - require that the underlying hash algorithm is unkeyed
crypto: af_alg - fix NULL pointer dereference in
crypto: algif_aead - fix reference counting of null skcipher
crypto: rsa - fix buffer overread when stripping leading zeroes
If mtools.conf is not generated before, 'make isoimage' could complain:
Kernel: arch/x86/boot/bzImage is ready (#597)
GENIMAGE arch/x86/boot/image.iso
*** Missing file: arch/x86/boot/mtools.conf
arch/x86/boot/Makefile:144: recipe for target 'isoimage' failed
mtools.conf is not used for isoimage generation, so do not check it.
Signed-off-by: Changbin Du <changbin.du@intel.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Fixes: 4366d57af1 ("x86/build: Factor out fdimage/isoimage generation commands to standalone script")
Link: http://lkml.kernel.org/r/1512053480-8083-1-git-send-email-changbin.du@intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Uprobe is a tracing mechanism for userspace programs.
Typical uprobe will incur overhead of two traps.
First trap is caused by replaced trap insn, and
the second trap is to execute the original displaced
insn in user space.
To reduce the overhead, kernel provides hooks
for architectures to emulate the original insn
and skip the second trap. In x86, emulation
is done for certain branch insns.
This patch extends the emulation to "push <reg>"
insns. These insns are typical in the beginning
of the function. For example, bcc
in https://github.com/iovisor/bcc repo provides
tools to measure funclantency, detect memleak, etc.
The tools will place uprobes in the beginning of
function and possibly uretprobes at the end of function.
This patch is able to reduce the trap overhead for
uprobe from 2 to 1.
Without this patch, uretprobe will typically incur
three traps. With this patch, if the function starts
with "push" insn, the number of traps can be
reduced from 3 to 2.
An experiment was conducted on two local VMs,
fedora 26 64-bit VM and 32-bit VM, both 4 processors
and 4GB memory, booted with latest tip repo (and this patch).
The host is MacBook with intel i7 processor.
The test program looks like:
#include <stdio.h>
#include <stdlib.h>
#include <time.h>
#include <sys/time.h>
static void test() __attribute__((noinline));
void test() {}
int main() {
struct timeval start, end;
gettimeofday(&start, NULL);
for (int i = 0; i < 1000000; i++) {
test();
}
gettimeofday(&end, NULL);
printf("%ld\n", ((end.tv_sec * 1000000 + end.tv_usec)
- (start.tv_sec * 1000000 + start.tv_usec)));
return 0;
}
The program is compiled without optimization, and
the first insn for function "test" is "push %rbp".
The host is relatively idle.
Before the test run, the uprobe is inserted as below for uprobe:
echo 'p <binary>:<test_func_offset>' > /sys/kernel/debug/tracing/uprobe_events
echo 1 > /sys/kernel/debug/tracing/events/uprobes/enable
and for uretprobe:
echo 'r <binary>:<test_func_offset>' > /sys/kernel/debug/tracing/uprobe_events
echo 1 > /sys/kernel/debug/tracing/events/uprobes/enable
Unit: microsecond(usec) per loop iteration
x86_64 W/ this patch W/O this patch
uprobe 1.55 3.1
uretprobe 2.0 3.6
x86_32 W/ this patch W/O this patch
uprobe 1.41 3.5
uretprobe 1.75 4.0
You can see that this patch significantly reduced the overhead,
50% for uprobe and 44% for uretprobe on x86_64, and even more
on x86_32.
Signed-off-by: Yonghong Song <yhs@fb.com>
Reviewed-by: Oleg Nesterov <oleg@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: kernel-team@fb.com
Link: http://lkml.kernel.org/r/20171201001202.3706564-1-yhs@fb.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
If something calls ioremap() with an address not aligned to PAGE_SIZE, the
returned address might be not aligned as well. This led to a probe
registered on exactly the returned address, but the entire page was armed
for mmiotracing.
On calling iounmap() the address passed to unregister_kmmio_probe() was
PAGE_SIZE aligned by the caller leading to a complete freeze of the
machine.
We should always page align addresses while (un)registerung mappings,
because the mmiotracer works on top of pages, not mappings. We still keep
track of the probes based on their real addresses and lengths though,
because the mmiotrace still needs to know what are mapped memory regions.
Also move the call to mmiotrace_iounmap() prior page aligning the address,
so that all probes are unregistered properly, otherwise the kernel ends up
failing memory allocations randomly after disabling the mmiotracer.
Tested-by: Lyude <lyude@redhat.com>
Signed-off-by: Karol Herbst <kherbst@redhat.com>
Acked-by: Pekka Paalanen <ppaalanen@gmail.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: nouveau@lists.freedesktop.org
Link: http://lkml.kernel.org/r/20171127075139.4928-1-kherbst@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
ARM:
* A number of issues in the vgic discovered using SMATCH
* A bit one-off calculation in out stage base address mask (32-bit and
64-bit)
* Fixes to single-step debugging instructions that trap for other
reasons such as MMMIO aborts
* Printing unavailable hyp mode as error
* Potential spinlock deadlock in the vgic
* Avoid calling vgic vcpu free more than once
* Broken bit calculation for big endian systems
s390:
* SPDX tags
* Fence storage key accesses from problem state
* Make sure that irq_state.flags is not used in the future
x86:
* Intercept port 0x80 accesses to prevent host instability (CVE)
* Use userspace FPU context for guest FPU (mainly an optimization that
fixes a double use of kernel FPU)
* Do not leak one page per module load
* Flush APIC page address cache from MMU invalidation notifiers
-----BEGIN PGP SIGNATURE-----
iQEcBAABCAAGBQJaLA93AAoJEED/6hsPKofo9msH/2DrqT2FOKfLuxNR2FeUGWr3
lqFoBRUXrVDMINGStnWrV36h/xYzlgJl9jtSDS8dr3VxLqtrNLlDg9NmGeogoZ+k
/xewr/jFYoSRfffsvrbkzORUfvu6zqvJwufiwBEJwAfcswiLqPizdFXcxtUL4eZE
9s9sIweo5zp2Xjg5yLOEkyanePKMEht/81zPkHyM+g0ZMoaPam3qZHA0lLzdyRgd
G9LpSyiMFHguYYgbwipaVue3zgMY1EdmKQ8C2hEPmZd8nVau26YDwRnAwwLrmVkW
sFhGO1Xi18TzQPokzALC25c9v0fqgxL5+fNyFNgWwTc2n9PSwO+IHcy699UH+3A=
=Qcqd
-----END PGP SIGNATURE-----
Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm
Pull KVM fixes from Radim Krčmář:
"ARM:
- A number of issues in the vgic discovered using SMATCH
- A bit one-off calculation in out stage base address mask (32-bit
and 64-bit)
- Fixes to single-step debugging instructions that trap for other
reasons such as MMMIO aborts
- Printing unavailable hyp mode as error
- Potential spinlock deadlock in the vgic
- Avoid calling vgic vcpu free more than once
- Broken bit calculation for big endian systems
s390:
- SPDX tags
- Fence storage key accesses from problem state
- Make sure that irq_state.flags is not used in the future
x86:
- Intercept port 0x80 accesses to prevent host instability (CVE)
- Use userspace FPU context for guest FPU (mainly an optimization
that fixes a double use of kernel FPU)
- Do not leak one page per module load
- Flush APIC page address cache from MMU invalidation notifiers"
* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (28 commits)
KVM: x86: fix APIC page invalidation
KVM: s390: Fix skey emulation permission check
KVM: s390: mark irq_state.flags as non-usable
KVM: s390: Remove redundant license text
KVM: s390: add SPDX identifiers to the remaining files
KVM: VMX: fix page leak in hardware_setup()
KVM: VMX: remove I/O port 0x80 bypass on Intel hosts
x86,kvm: remove KVM emulator get_fpu / put_fpu
x86,kvm: move qemu/guest FPU switching out to vcpu_run
KVM: arm/arm64: Fix broken GICH_ELRSR big endian conversion
KVM: arm/arm64: kvm_arch_destroy_vm cleanups
KVM: arm/arm64: Fix spinlock acquisition in vgic_set_owner
kvm: arm: don't treat unavailable HYP mode as an error
KVM: arm/arm64: Avoid attempting to load timer vgic state without a vgic
kvm: arm64: handle single-step of hyp emulated mmio instructions
kvm: arm64: handle single-step during SError exceptions
kvm: arm64: handle single-step of userspace mmio instructions
kvm: arm64: handle single-stepping trapped instructions
KVM: arm/arm64: debug: Introduce helper for single-step
arm: KVM: Fix VTTBR_BADDR_MASK BUG_ON off-by-one
...
Commit 4675ff05de ("kmemcheck: rip it out") has removed the code but
for some reason SPDX header stayed in place. This looks like a rebase
mistake in the mmotm tree or the merge mistake. Let's drop those
leftovers as well.
Signed-off-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Pull networking fixes from David Miller:
1) CAN fixes from Martin Kelly (cancel URBs properly in all the CAN usb
drivers).
2) Revert returning -EEXIST from __dev_alloc_name() as this propagates
to userspace and broke some apps. From Johannes Berg.
3) Fix conn memory leaks and crashes in TIPC, from Jon Malloc and Cong
Wang.
4) Gianfar MAC can't do EEE so don't advertise it by default, from
Claudiu Manoil.
5) Relax strict netlink attribute validation, but emit a warning. From
David Ahern.
6) Fix regression in checksum offload of thunderx driver, from Florian
Westphal.
7) Fix UAPI bpf issues on s390, from Hendrik Brueckner.
8) New card support in iwlwifi, from Ihab Zhaika.
9) BBR congestion control bug fixes from Neal Cardwell.
10) Fix port stats in nfp driver, from Pieter Jansen van Vuuren.
11) Fix leaks in qualcomm rmnet, from Subash Abhinov Kasiviswanathan.
12) Fix DMA API handling in sh_eth driver, from Thomas Petazzoni.
13) Fix spurious netpoll warnings in bnxt_en, from Calvin Owens.
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (67 commits)
net: mvpp2: fix the RSS table entry offset
tcp: evaluate packet losses upon RTT change
tcp: fix off-by-one bug in RACK
tcp: always evaluate losses in RACK upon undo
tcp: correctly test congestion state in RACK
bnxt_en: Fix sources of spurious netpoll warnings
tcp_bbr: reset long-term bandwidth sampling on loss recovery undo
tcp_bbr: reset full pipe detection on loss recovery undo
tcp_bbr: record "full bw reached" decision in new full_bw_reached bit
sfc: pass valid pointers from efx_enqueue_unwind
gianfar: Disable EEE autoneg by default
tcp: invalidate rate samples during SACK reneging
can: peak/pcie_fd: fix potential bug in restarting tx queue
can: usb_8dev: cancel urb on -EPIPE and -EPROTO
can: kvaser_usb: cancel urb on -EPIPE and -EPROTO
can: esd_usb2: cancel urb on -EPIPE and -EPROTO
can: ems_usb: cancel urb on -EPIPE and -EPROTO
can: mcba_usb: cancel urb on -EPROTO
usbnet: fix alignment for frames with no ethernet header
tcp: use current time in tcp_rcv_space_adjust()
...
If the machine does not support the paging mode for which the kernel was
compiled, the boot process cannot continue.
It's not possible to let the kernel detect the mismatch as it does not even
reach the point where cpu features can be evaluted due to a triple fault in
the KASLR setup.
Instead of instantaneous silent reboot, emit an error message which gives
the user the information why the boot fails.
Fixes: 77ef56e4f0 ("x86: Enable 5-level paging support via CONFIG_X86_5LEVEL=y")
Reported-by: Borislav Petkov <bp@suse.de>
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Borislav Petkov <bp@suse.de>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: stable@vger.kernel.org
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: linux-mm@kvack.org
Cc: Cyrill Gorcunov <gorcunov@openvz.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: https://lkml.kernel.org/r/20171204124059.63515-3-kirill.shutemov@linux.intel.com
Prerequisite for fixing the current problem of instantaneous reboots when a
5-level paging kernel is booted on 4-level paging hardware.
At the same time this change prepares the decompression code to boot-time
switching between 4- and 5-level paging.
[ tglx: Folded the GCC < 5 fix. ]
Fixes: 77ef56e4f0 ("x86: Enable 5-level paging support via CONFIG_X86_5LEVEL=y")
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: stable@vger.kernel.org
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: linux-mm@kvack.org
Cc: Cyrill Gorcunov <gorcunov@openvz.org>
Cc: Borislav Petkov <bp@suse.de>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: https://lkml.kernel.org/r/20171204124059.63515-2-kirill.shutemov@linux.intel.com
Documentation/x86/topology.txt defines smp_num_siblings as "The number of
threads in a core". Since commit bbb65d2d36 ("x86: use cpuid vector 0xb
when available for detecting cpu topology") smp_num_siblings is the
maximum number of threads in a core. If Simultaneous MultiThreading
(SMT) is disabled on a system, smp_num_siblings is 2 and not 1 as
expected.
Use topology_max_smt_threads(), which contains the active numer of threads,
in the __max_logical_packages calculation.
On a single socket, single core, single thread system __max_smt_threads has
not been updated when the __max_logical_packages calculation happens, so its
zero which makes the package estimate fail. Initialize it to one, which is
the minimum number of threads on a core.
[ tglx: Folded the __max_smt_threads fix in ]
Fixes: b4c0a7326f ("x86/smpboot: Fix __max_logical_packages estimate")
Reported-by: Jakub Kicinski <kubakici@wp.pl>
Signed-off-by: Prarit Bhargava <prarit@redhat.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Jakub Kicinski <kubakici@wp.pl>
Cc: netdev@vger.kernel.org
Cc: "netdev@vger.kernel.org"
Cc: Clark Williams <williams@redhat.com>
Link: https://lkml.kernel.org/r/20171204164521.17870-1-prarit@redhat.com
Pull misc x86 fixes from Ingo Molnar:
- make CR4 handling irq-safe, which bug vmware guests ran into
- don't crash on early IRQs in Xen guests
- don't crash secondary CPU bringup if #UD assisted WARN()ings are
triggered
- make X86_BUG_FXSAVE_LEAK optional on newer AMD CPUs that have the fix
- fix AMD Fam17h microcode loading
- fix broadcom_postcore_init() if ACPI is disabled
- fix resume regression in __restore_processor_context()
- fix Sparse warnings
- fix a GCC-8 warning
* 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/vdso: Change time() prototype to match __vdso_time()
x86: Fix Sparse warnings about non-static functions
x86/power: Fix some ordering bugs in __restore_processor_context()
x86/PCI: Make broadcom_postcore_init() check acpi_disabled
x86/microcode/AMD: Add support for fam17h microcode loading
x86/cpufeatures: Make X86_BUG_FXSAVE_LEAK detectable in CPUID on AMD
x86/idt: Load idt early in start_secondary
x86/xen: Support early interrupts in xen pv guests
x86/tlb: Disable interrupts when changing CR4
x86/tlb: Refactor CR4 setting and shadow write
When we have a multi-socket system, each CPU core needs the same setup.
Since this is tricky to do in the fixup code, don't enable a 64bit BAR on
multi-socket systems for now.
Signed-off-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
Break the loop if we can't find some address space for a 64bit BAR.
Signed-off-by: Christian König <christian.koenig@amd.com>
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
gcc-8 warns that time() is an alias for __vdso_time() but the two
have different prototypes:
arch/x86/entry/vdso/vclock_gettime.c:327:5: error: 'time' alias between functions of incompatible types 'int(time_t *)' {aka 'int(long int *)'} and 'time_t(time_t *)' {aka 'long int(long int *)'} [-Werror=attribute-alias]
int time(time_t *t)
^~~~
arch/x86/entry/vdso/vclock_gettime.c:318:16: note: aliased declaration here
I could not figure out whether this is intentional, but I see that
changing it to return time_t avoids the warning.
Returning 'int' from time() is also a bit questionable, as it causes an
overflow in y2038 even on 64-bit architectures that use a 64-bit time_t
type. On 32-bit architecture with 64-bit time_t, time() should always
be implement by the C library by calling a (to be added) clock_gettime()
variant that takes a sufficiently wide argument.
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Vitaly Kuznetsov <vkuznets@redhat.com>
Link: http://lkml.kernel.org/r/20171204150203.852959-1-arnd@arndb.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>