These need to be migrated together, as the compat case used to
jump into the middle of the 64-bit exit code.
Remove the old assembly code.
Signed-off-by: Andy Lutomirski <luto@kernel.org>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: Denys Vlasenko <vda.linux@googlemail.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: paulmck@linux.vnet.ibm.com
Link: http://lkml.kernel.org/r/d4d1d70de08ac3640badf50048a9e8f18fe2497f.1435952415.git.luto@kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
In 539f511365 ("x86/asm/entry/64: Disentangle error_entry/exit
gsbase/ebx/usermode code"), I arranged the code slightly wrong
-- IRET faults would skip the code path that was intended to
execute on all error entries from user mode. Fix it up.
While we're at it, make all the labels in error_entry local.
This does not fix a bug, but we'll need it, and it slightly
shrinks the code.
Signed-off-by: Andy Lutomirski <luto@kernel.org>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: Denys Vlasenko <vda.linux@googlemail.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: paulmck@linux.vnet.ibm.com
Link: http://lkml.kernel.org/r/91e17891e49fa3d61357eadc451529ad48143ee1.1435952415.git.luto@kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
The current x86 entry and exit code, written in a mixture of assembly and
C code, is incomprehensible due to being open-coded in a lot of places
without coherent documentation.
It appears to work primary by luck and duct tape: i.e. obvious runtime
failures were fixed on-demand, without re-thinking the design.
Due to those reasons our confidence level in that code is low, and it is
very difficult to incrementally improve.
Add new code written in C, in preparation for simply deleting the old
entry code.
prepare_exit_to_usermode() is a new function that will handle all
slow path exits to user mode. It is called with IRQs disabled
and it leaves us in a state in which it is safe to immediately
return to user mode. IRQs must not be re-enabled at any point
after prepare_exit_to_usermode() returns and user mode is actually
entered. (We can, of course, fail to enter user mode and treat
that failure as a fresh entry to kernel mode.)
All callers of do_notify_resume() will be migrated to call
prepare_exit_to_usermode() instead; prepare_exit_to_usermode() needs
to do everything that do_notify_resume() does today, but it also
takes care of scheduling and context tracking. Unlike
do_notify_resume(), it does not need to be called in a loop.
syscall_return_slowpath() is exactly what it sounds like: it will
be called on any syscall exit slow path. It will replace
syscall_trace_leave() and it calls prepare_exit_to_usermode() on the
way out.
Signed-off-by: Andy Lutomirski <luto@kernel.org>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: Denys Vlasenko <vda.linux@googlemail.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: paulmck@linux.vnet.ibm.com
Link: http://lkml.kernel.org/r/c57c8b87661a4152801d7d3786eac2d1a2f209dd.1435952415.git.luto@kernel.org
[ Improved the changelog a bit. ]
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Changing the x86 context tracking hooks is dangerous because
there are no good checks that we track our context correctly.
Add a helper to check that we're actually in CONTEXT_USER when
we enter from user mode and wire it up for syscall entries.
Subsequent patches will wire this up for all non-NMI entries as
well. NMIs are their own special beast and cannot currently
switch overall context tracking state. Instead, they have their
own special RCU hooks.
This is a tiny speedup if !CONFIG_CONTEXT_TRACKING (removes a
branch) and a tiny slowdown if CONFIG_CONTEXT_TRACING (adds a
layer of indirection). Eventually, we should fix up the core
context tracking code to supply a function that does what we
want (and can be much simpler than user_exit), which will enable
us to get rid of the extra call.
Signed-off-by: Andy Lutomirski <luto@kernel.org>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: Denys Vlasenko <vda.linux@googlemail.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: paulmck@linux.vnet.ibm.com
Link: http://lkml.kernel.org/r/853b42420066ec3fb856779cdc223a6dcb5d355b.1435952415.git.luto@kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Other than the super-atomic exception entries, all exception
entries are supposed to switch our context tracking state to
CONTEXT_KERNEL. Assert that they do. These assertions appear
trivial at this point, as exception_enter() is the function
responsible for switching context, but I'm planning on reworking
x86's exception context tracking, and these assertions will help
make sure that all of this code keeps working.
Signed-off-by: Andy Lutomirski <luto@kernel.org>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: Denys Vlasenko <vda.linux@googlemail.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: paulmck@linux.vnet.ibm.com
Link: http://lkml.kernel.org/r/20fa1ee2d943233a184aaf96ff75394d3b34dfba.1435952415.git.luto@kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
The entry and exit C helpers were confusingly scattered between
ptrace.c and signal.c, even though they aren't specific to
ptrace or signal handling. Move them together in a new file.
This change just moves code around. It doesn't change anything.
Signed-off-by: Andy Lutomirski <luto@kernel.org>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: Denys Vlasenko <vda.linux@googlemail.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: paulmck@linux.vnet.ibm.com
Link: http://lkml.kernel.org/r/324d686821266544d8572423cc281f961da445f4.1435952415.git.luto@kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
If user code does SYSCALL32 or SYSENTER without a valid stack,
then our attempt to determine the syscall args will result in a
failed uaccess fault. Previously, we would try to recover by
jumping to the syscall exit code, but we'd run the syscall exit
work even though we never made it to the syscall entry work.
Clean it up by treating the failure path as a non-syscall entry
and exit pair.
This fixes strace's output when running the syscall_arg_fault
test. Without this fix, strace would get out of sync and would
fail to associate syscall entries with syscall exits.
Signed-off-by: Andy Lutomirski <luto@kernel.org>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: Denys Vlasenko <vda.linux@googlemail.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: paulmck@linux.vnet.ibm.com
Link: http://lkml.kernel.org/r/903010762c07a3d67df914fea2da84b52b0f8f1d.1435952415.git.luto@kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
This is useful for reporting various addresses or other values
while debugging early boot, for example, the recent kernel image
size vs kernel run size. For example, when
CONFIG_X86_VERBOSE_BOOTUP is set, this is now visible at boot
time:
early console in setup code
early console in decompress_kernel
input_data: 0x0000000001e1526e
input_len: 0x0000000000732236
output: 0x0000000001000000
output_len: 0x0000000001535640
run_size: 0x00000000021fb000
KASLR using RDTSC...
Signed-off-by: Kees Cook <keescook@chromium.org>
Cc: Andrey Ryabinin <a.ryabinin@samsung.com>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@suse.de>
Cc: Jan Beulich <JBeulich@suse.com>
Cc: Jiri Kosina <jkosina@suse.cz>
Cc: Joe Perches <joe@perches.com>
Cc: Josh Triplett <josh@joshtriplett.org>
Cc: Junjie Mao <eternal.n08@gmail.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vivek Goyal <vgoyal@redhat.com>
Cc: Yinghai Lu <yinghai@kernel.org>
Link: http://lkml.kernel.org/r/20150706230620.GA17501@www.outflux.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>
This is specific driver for Intel Atom SoCs like BayTrail and
Braswell. Let's move it to dedicated folder and alleviate a
arch/x86/kernel burden.
There is no functional change.
Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Cc: Aubrey Li <aubrey.li@linux.intel.com>
Cc: Kumar P Mahesh <mahesh.kumar.p@intel.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rafael J . Wysocki <rafael.j.wysocki@intel.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1436192944-56496-6-git-send-email-andriy.shevchenko@linux.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
The patch adds CHT PMC interface. This exposes all the South IP
device power states and S0ix states for CHT. The bit map of
FUNC_DIS and D3_STS_0 registers for SoCs are consistent. The
D3_STS_1 and FUNC_DIS_2 registers, however, are not aligned.
This is fixed by splitting a common mapping on per register basis.
(Originally based on code from Kumar P Mahesh.)
Originally-from: Kumar P Mahesh <mahesh.kumar.p@intel.com>
Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Cc: Aubrey Li <aubrey.li@linux.intel.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rafael J . Wysocki <rafael.j.wysocki@intel.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1436192944-56496-5-git-send-email-andriy.shevchenko@linux.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
The patch converts the functions to use the register mappings
provided by PMC object. It would help in case of mappings on
different platforms.
Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Cc: Aubrey Li <aubrey.li@linux.intel.com>
Cc: Rafael J . Wysocki <rafael.j.wysocki@intel.com>
Cc: Kumar P Mahesh <mahesh.kumar.p@intel.com>
Link: http://lkml.kernel.org/r/1436192944-56496-4-git-send-email-andriy.shevchenko@linux.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
The register mapping may change from one platform to another.
Thus, indices might be not the same on different platforms. The
patch makes the code to print the device index dynamically at
run time.
The patch also changes the for loop to iterate over the map
until a terminator is found.
Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Cc: Aubrey Li <aubrey.li@linux.intel.com>
Cc: Rafael J . Wysocki <rafael.j.wysocki@intel.com>
Cc: Kumar P Mahesh <mahesh.kumar.p@intel.com>
Link: http://lkml.kernel.org/r/1436192944-56496-3-git-send-email-andriy.shevchenko@linux.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Export the pmc_atom_read() and pmc_atom_write() accessors to the PMC
registers. On early initcall stages the functions will return
-ENODEV, and caller has to wait when it will be available.
Additionally make absence of debugfs a non-fatal error.
The patch will be useful for the upcoming fixes regarding to the
LPSS block found on Intel BayTrail-T and Braswell.
Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Cc: Aubrey Li <aubrey.li@linux.intel.com>
Cc: Kumar P Mahesh <mahesh.kumar.p@intel.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rafael J . Wysocki <rafael.j.wysocki@intel.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1436192944-56496-2-git-send-email-andriy.shevchenko@linux.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
When I enable early_printk on a kernel, I cut and paste the
console= input and add to earlyprintk parameter. But I notice
recently that ktest has not been detecting triple faults. The
way it detects it, is by seeing the kernel banner "Linux version
.." with a different kernel version pop up. Then I noticed that
early printk was no longer working on my console, which was why
ktest was not seeing it.
I bisected it down and it was added to 4.0 with this commit:
ea9e9d8029 ("Specify PCI based UART for earlyprintk")
because it converted the simple_strtoul() that converts the baud
number into a kstrtoul(). The problem with this is, I had as my
baud rate, 115200n8 (acceptable for console=ttyS0), but because
of the "n8", the kstrtoul() doesn't parse the baud rate and
returns an error, which sets the baud rate to the default 9600.
This explains the garbage on my screen.
Now, earlyprintk= kernel parameter does not say it accepts that
format. Thus, one answer would simply be me changing my kernel
parameters to remove the "n8" since it isn't parsed anyway. But
I wonder if other people run into this, and it seems strange
that the two consoles for serial accepts different input.
I could also extend this to have earlyprintk do something with
that "n8" or whatever it has and have it match the console
parsing (which, BTW, still uses simple_strtoul(), as I guess it
has to).
This patch just makes my old kernel parameter parsing work like
it use to.
Although, simple_strtoull() is considered obsolete, it is the
only standard string parsing function that parses a number that
is attached to text. Ironically, commit ea9e9d8029 also added
several calls to simple_strtoul()!
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Cc: Alan Cox <alan@linux.intel.com>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: David Cohen <david.a.cohen@linux.intel.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stuart R. Anderson <stuart.r.anderson@intel.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20150706101434.5f6a351b@gandalf.local.home
[ Cleaned it up a bit. ]
Signed-off-by: Ingo Molnar <mingo@kernel.org>
The x32 ABI is now independent of the ia32 compat ABI. Common
code is now conditional on CONFIG_COMPAT, but unshared code like
syscall entry, signal handling, and the VDSO are under separate
config options.
Signed-off-by: Brian Gerst <brgerst@gmail.com>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1434974121-32575-13-git-send-email-brgerst@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
x32 does not need CONFIG_ARCH_WANT_OLD_COMPAT_IPC=y.
Signed-off-by: Brian Gerst <brgerst@gmail.com>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1434974121-32575-11-git-send-email-brgerst@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Including sys_ia32.h is not needed in signal.c.
Signed-off-by: Brian Gerst <brgerst@gmail.com>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1434974121-32575-10-git-send-email-brgerst@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
perf_callchain_user32() is not needed for x32.
Signed-off-by: Brian Gerst <brgerst@gmail.com>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1434974121-32575-9-git-send-email-brgerst@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Change this to CONFIG_COMPAT so both 32-bit compat and x32 will
do the check.
Signed-off-by: Brian Gerst <brgerst@gmail.com>
Acked-by: Andy Lutomirski <luto@kernel.org>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1434974121-32575-8-git-send-email-brgerst@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Build the 32-bit vdso only for native 32-bit or 32-bit compat is
enabled. x32 should not force it to build.
Signed-off-by: Brian Gerst <brgerst@gmail.com>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1434974121-32575-7-git-send-email-brgerst@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Move the ia32-specific code in compat_arch_ptrace() into its
own function.
Signed-off-by: Brian Gerst <brgerst@gmail.com>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1434974121-32575-6-git-send-email-brgerst@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
This function is shared between the 32-bit compat and x32 ABIs.
Signed-off-by: Brian Gerst <brgerst@gmail.com>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1434974121-32575-5-git-send-email-brgerst@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
ia32.h should only contain the code for 32-bit compatability.
Signed-off-by: Brian Gerst <brgerst@gmail.com>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1434974121-32575-4-git-send-email-brgerst@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
TIF_ADDR32 is set for both ia32 and x32 tasks, so change from
CONFIG_IA32_EMULATION to CONFIG_COMPAT. Use config_enabled()
to make the function more readable.
Signed-off-by: Brian Gerst <brgerst@gmail.com>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1434974121-32575-3-git-send-email-brgerst@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
copy_siginfo_to_user32() and copy_siginfo_from_user32() are used
by both the 32-bit compat and x32 ABIs. Move them to
signal_compat.c.
Signed-off-by: Brian Gerst <brgerst@gmail.com>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1434974121-32575-2-git-send-email-brgerst@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Before, the code to do RDTSC looked like:
rdtsc
shl $0x20, %rdx
mov %eax, %eax
or %rdx, %rax
The "mov %eax, %eax" is required to clear the high 32 bits of RAX.
By declaring low and high as 64-bit variables, the code is
simplified to:
rdtsc
shl $0x20,%rdx
or %rdx,%rax
Yes, it's a 2-byte instruction that's not on a critical path,
but there are principles to be upheld.
Every user of EAX_EDX_RET has been checked. I tried to check
users of EAX_EDX_ARGS, but there weren't any, so I deleted it to
be safe.
( There's no benefit to making "high" 64 bits, but it was the
simplest way to proceed. )
Signed-off-by: George Spelvin <linux@horizon.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: jacob.jun.pan@linux.intel.com
Link: http://lkml.kernel.org/r/20150618075906.4615.qmail@ns.horizon.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
All callers have been converted to rdtsc_ordered().
Signed-off-by: Andy Lutomirski <luto@kernel.org>
Signed-off-by: Borislav Petkov <bp@suse.de>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Huang Rui <ray.huang@amd.com>
Cc: John Stultz <john.stultz@linaro.org>
Cc: Len Brown <lenb@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: Richard Weinberger <richard@nod.at>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: kvm ML <kvm@vger.kernel.org>
Link: http://lkml.kernel.org/r/9baa4ae9a1e7c7c282f9cb2f15bb6bf5c2004032.1434501121.git.luto@kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
__pvclock_read_cycles() used to have two barriers, one of which was unnecessary,
which got removed after an initial version of this patch was sent.
But the barrier is still open-coded unnecessarily - get rid of
that barrier and clean up the code by just using rdtsc_ordered().
Signed-off-by: Andy Lutomirski <luto@kernel.org>
Signed-off-by: Borislav Petkov <bp@suse.de>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Huang Rui <ray.huang@amd.com>
Cc: John Stultz <john.stultz@linaro.org>
Cc: Len Brown <lenb@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Radim Krcmar <rkrcmar@redhat.com>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: kvm ML <kvm@vger.kernel.org>
Link: http://lkml.kernel.org/r/678981cc4761fb38a793c217c9cac42503cf3719.1434501121.git.luto@kernel.org
[ Ported it to v4.2-rc1. ]
Signed-off-by: Ingo Molnar <mingo@kernel.org>
There are two logical changes here. First, this removes a check
for cpu_has_tsc. That check is unnecessary, as we don't
register the TSC as a clocksource on systems that have no TSC.
Second, it adds a barrier, thus preventing observable
non-monotonicity.
I suspect that the missing barrier was never a problem in
practice because system calls themselves were heavy enough
barriers to prevent user code from observing time warps due to
speculation. (Without the corresponding barrier in the vDSO,
however, non-monotonicity is easy to detect.)
Signed-off-by: Andy Lutomirski <luto@kernel.org>
Signed-off-by: Borislav Petkov <bp@suse.de>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Huang Rui <ray.huang@amd.com>
Cc: John Stultz <john.stultz@linaro.org>
Cc: Len Brown <lenb@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: kvm ML <kvm@vger.kernel.org>
Link: http://lkml.kernel.org/r/c6ff621a053127a65b70f175443578db7a0711be.1434501121.git.luto@kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Using get_cycles was unnecessary: check_tsc_warp() is not called
on TSC-less systems. Replace rdtsc_barrier(); get_cycles() with
rdtsc_ordered().
While we're at it, make the somewhat more dangerous change of
removing barrier_before_rdtsc after RDTSC in the TSC warp check
code. This should be okay, though -- the vDSO TSC code doesn't
have that barrier, so, if removing the barrier from the warp
check would cause us to detect a warp that we otherwise wouldn't
detect, then we have a genuine bug.
Signed-off-by: Andy Lutomirski <luto@kernel.org>
Signed-off-by: Borislav Petkov <bp@suse.de>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Huang Rui <ray.huang@amd.com>
Cc: John Stultz <john.stultz@linaro.org>
Cc: Len Brown <lenb@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: kvm ML <kvm@vger.kernel.org>
Link: http://lkml.kernel.org/r/387c4c3a75f875bcde6cd68cee013273a744f364.1434501121.git.luto@kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
rdtsc_barrier(); rdtsc() is an unnecessary mouthful and requires
more thought than should be necessary. Add an rdtsc_ordered()
helper and replace the trivial call sites with it.
This should not change generated code. The duplication of the
fence asm is temporary.
Signed-off-by: Andy Lutomirski <luto@kernel.org>
Signed-off-by: Borislav Petkov <bp@suse.de>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Huang Rui <ray.huang@amd.com>
Cc: John Stultz <john.stultz@linaro.org>
Cc: Len Brown <lenb@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: kvm ML <kvm@vger.kernel.org>
Link: http://lkml.kernel.org/r/dddbf98a2af53312e9aa73a5a2b1622fe5d6f52b.1434501121.git.luto@kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Now that there is no paravirt TSC, the "native" is
inappropriate. The function does RDTSC, so give it the obvious
name: rdtsc().
Suggested-by: Borislav Petkov <bp@suse.de>
Signed-off-by: Andy Lutomirski <luto@kernel.org>
Signed-off-by: Borislav Petkov <bp@suse.de>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Huang Rui <ray.huang@amd.com>
Cc: John Stultz <john.stultz@linaro.org>
Cc: Len Brown <lenb@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: kvm ML <kvm@vger.kernel.org>
Link: http://lkml.kernel.org/r/fd43e16281991f096c1e4d21574d9e1402c62d39.1434501121.git.luto@kernel.org
[ Ported it to v4.2-rc1. ]
Signed-off-by: Ingo Molnar <mingo@kernel.org>
It has no more callers, and it was never a very sensible
interface to begin with. Users of the TSC should either read all
64 bits or explicitly throw out the high bits.
Signed-off-by: Andy Lutomirski <luto@kernel.org>
Signed-off-by: Borislav Petkov <bp@suse.de>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Huang Rui <ray.huang@amd.com>
Cc: John Stultz <john.stultz@linaro.org>
Cc: Len Brown <lenb@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: kvm ML <kvm@vger.kernel.org>
Link: http://lkml.kernel.org/r/250105f7cee519be9d7fc4464b5784caafc8f4fe.1434501121.git.luto@kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
This code is timing 100k indirect calls, so the added overhead
of counting the number of cycles elapsed as a 64-bit number
should be insignificant. Drop the optimization of using a
32-bit count.
Signed-off-by: Andy Lutomirski <luto@kernel.org>
Signed-off-by: Borislav Petkov <bp@suse.de>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Huang Rui <ray.huang@amd.com>
Cc: John Stultz <john.stultz@linaro.org>
Cc: Len Brown <lenb@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: kvm ML <kvm@vger.kernel.org>
Link: http://lkml.kernel.org/r/d58f339a9c0dd8352b50d2f7a216f67ec2844f20.1434501121.git.luto@kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
As a very minor optimization, delay_tsc() was only using the low
32 bits of the TSC. It's a delay function, so just use the whole
thing.
Signed-off-by: Andy Lutomirski <luto@kernel.org>
Signed-off-by: Borislav Petkov <bp@suse.de>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Huang Rui <ray.huang@amd.com>
Cc: John Stultz <john.stultz@linaro.org>
Cc: Len Brown <lenb@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: kvm ML <kvm@vger.kernel.org>
Link: http://lkml.kernel.org/r/bd1a277c71321b67c4794970cb5ace05efe21ab6.1434501121.git.luto@kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
They have no users. Leave native_read_tscp() which seems
potentially useful despite also having no callers.
Signed-off-by: Andy Lutomirski <luto@kernel.org>
Signed-off-by: Borislav Petkov <bp@suse.de>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Huang Rui <ray.huang@amd.com>
Cc: John Stultz <john.stultz@linaro.org>
Cc: Len Brown <lenb@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: kvm ML <kvm@vger.kernel.org>
Link: http://lkml.kernel.org/r/6abfa3ef80534b5d73898a48c4d25e069303cbe5.1434501121.git.luto@kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Now that the ->read_tsc() paravirt hook is gone, rdtscll() is
just a wrapper around native_read_tsc(). Unwrap it.
Signed-off-by: Andy Lutomirski <luto@kernel.org>
Signed-off-by: Borislav Petkov <bp@suse.de>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Huang Rui <ray.huang@amd.com>
Cc: John Stultz <john.stultz@linaro.org>
Cc: Len Brown <lenb@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: kvm ML <kvm@vger.kernel.org>
Link: http://lkml.kernel.org/r/d2449ae62c1b1fb90195bcfb19ef4a35883a04dc.1434501121.git.luto@kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
We've had ->read_tsc() and ->read_tscp() paravirt hooks since
the very beginning of paravirt, i.e.,
d3561b7fa0 ("[PATCH] paravirt: header and stubs for paravirtualisation").
AFAICT, the only paravirt guest implementation that ever
replaced these calls was vmware, and it's gone. Arguably even
vmware shouldn't have hooked RDTSC -- we fully support systems
that don't have a TSC at all, so there's no point for a paravirt
implementation to pretend that we have a TSC but to replace it.
I also doubt that these hooks actually worked. Calls to rdtscl()
and rdtscll(), which respected the hooks, were used seemingly
interchangeably with native_read_tsc(), which did not.
Just remove them. If anyone ever needs them again, they can try
to make a case for why they need them.
Before, on a paravirt config:
text data bss dec hex filename
12618257 1816384 1093632 15528273 ecf151 vmlinux
After:
text data bss dec hex filename
12617207 1816384 1093632 15527223 eced37 vmlinux
Signed-off-by: Andy Lutomirski <luto@kernel.org>
Signed-off-by: Borislav Petkov <bp@suse.de>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Huang Rui <ray.huang@amd.com>
Cc: John Stultz <john.stultz@linaro.org>
Cc: Len Brown <lenb@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: kvm ML <kvm@vger.kernel.org>
Cc: virtualization@lists.linux-foundation.org
Link: http://lkml.kernel.org/r/d08a2600fb298af163681e5efd8e599d889a5b97.1434501121.git.luto@kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
The only caller was KVM's read_tsc(). The only difference
between vget_cycles() and native_read_tsc() was that
vget_cycles() returned zero instead of crashing on TSC-less
systems. KVM already checks vclock_mode() before calling that
function, so the extra check is unnecessary. Also, KVM
(host-side) requires the TSC to exist.
Signed-off-by: Andy Lutomirski <luto@kernel.org>
Signed-off-by: Borislav Petkov <bp@suse.de>
Acked-by: Paolo Bonzini <pbonzini@redhat.com>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Huang Rui <ray.huang@amd.com>
Cc: John Stultz <john.stultz@linaro.org>
Cc: Len Brown <lenb@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: kvm ML <kvm@vger.kernel.org>
Link: http://lkml.kernel.org/r/20615df14ae2eb713ea7a5f5123c1dc4c7ca993d.1434501121.git.luto@kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
In the following commit:
cdc7957d19 ("x86: move native_read_tsc() offline")
... native_read_tsc() was moved out of line, presumably for some
now-obsolete vDSO-related reason. Undo it.
The entire rdtsc, shl, or sequence is only 11 bytes, and calls
via rdtscl() and similar helpers were already inlined.
Signed-off-by: Andy Lutomirski <luto@kernel.org>
Signed-off-by: Borislav Petkov <bp@suse.de>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Huang Rui <ray.huang@amd.com>
Cc: John Stultz <john.stultz@linaro.org>
Cc: Len Brown <lenb@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: kvm ML <kvm@vger.kernel.org>
Link: http://lkml.kernel.org/r/d05ffe2aaf8468ca475ebc00efad7b2fa174af19.1434501121.git.luto@kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
As we alloc pages with GFP_KERNEL in init_espfix_ap() which is
called before we enable local irqs, so the lockdep sub-system
would (correctly) trigger a warning about the potentially
blocking API.
So we allocate them on the boot CPU side when the secondary CPU is
brought up by the boot CPU, and hand them over to the secondary
CPU.
And we use alloc_pages_node() with the secondary CPU's node, to
make sure the espfix stack is NUMA-local to the CPU that is
going to use it.
Signed-off-by: Zhu Guihua <zhugh.fnst@cn.fujitsu.com>
Cc: <bp@alien8.de>
Cc: <luto@amacapital.net>
Cc: <luto@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/c97add2670e9abebb90095369f0cfc172373ac94.1435824469.git.zhugh.fnst@cn.fujitsu.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Add a CPU index parameter to init_espfix_ap(), so that the
parameter could be propagated to the function for espfix
page allocation.
Signed-off-by: Zhu Guihua <zhugh.fnst@cn.fujitsu.com>
Cc: <bp@alien8.de>
Cc: <luto@amacapital.net>
Cc: <luto@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/cde3fcf1b3211f3f03feb1a995bce3fee850f0fc.1435824469.git.zhugh.fnst@cn.fujitsu.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
KASAN_SHADOW_OFFSET is purely arch specific setting,
so it should be in arch's Kconfig file.
Signed-off-by: Andrey Ryabinin <a.ryabinin@samsung.com>
Cc: Alexander Popov <alpopov@ptsecurity.com>
Cc: Alexander Potapenko <glider@google.com>
Cc: Andrey Konovalov <adech.fo@gmail.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Paul Bolle <pebolle@tiscali.nl>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1435828178-10975-7-git-send-email-a.ryabinin@samsung.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
While populating zero shadow wrong bits in upper level page
tables used. __PAGE_KERNEL_RO that was used for pgd/pud/pmd has
_PAGE_BIT_GLOBAL set. Global bit is present only in the lowest
level of the page translation hierarchy (ptes), and it should be
zero in upper levels.
This bug seems doesn't cause any troubles on Intel cpus, while
on AMDs it cause kernel crash on boot.
Use _KERNPG_TABLE bits for pgds/puds/pmds to fix this.
Reported-by: Borislav Petkov <bp@alien8.de>
Signed-off-by: Andrey Ryabinin <a.ryabinin@samsung.com>
Cc: <stable@vger.kernel.org> # 4.0+
Cc: Alexander Popov <alpopov@ptsecurity.com>
Cc: Alexander Potapenko <glider@google.com>
Cc: Andrey Konovalov <adech.fo@gmail.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1435828178-10975-5-git-send-email-a.ryabinin@samsung.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
load_cr3() doesn't cause tlb_flush if PGE enabled.
This may cause tons of false positive reports spamming the
kernel to death.
To fix this __flush_tlb_all() should be called explicitly
after CR3 changed.
Signed-off-by: Andrey Ryabinin <a.ryabinin@samsung.com>
Cc: <stable@vger.kernel.org> # 4.0+
Cc: Alexander Popov <alpopov@ptsecurity.com>
Cc: Alexander Potapenko <glider@google.com>
Cc: Andrey Konovalov <adech.fo@gmail.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1435828178-10975-4-git-send-email-a.ryabinin@samsung.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Currently KASAN shadow region page tables created without
respect of physical offset (phys_base). This causes kernel halt
when phys_base is not zero.
So let's initialize KASAN shadow region page tables in
kasan_early_init() using __pa_nodebug() which considers
phys_base.
This patch also separates x86_64_start_kernel() from KASAN low
level details by moving kasan_map_early_shadow(init_level4_pgt)
into kasan_early_init().
Remove the comment before clear_bss() which stopped bringing
much profit to the code readability. Otherwise describing all
the new order dependencies would be too verbose.
Signed-off-by: Alexander Popov <alpopov@ptsecurity.com>
Signed-off-by: Andrey Ryabinin <a.ryabinin@samsung.com>
Cc: <stable@vger.kernel.org> # 4.0+
Cc: Alexander Potapenko <glider@google.com>
Cc: Andrey Konovalov <adech.fo@gmail.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1435828178-10975-3-git-send-email-a.ryabinin@samsung.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Currently x86_64_start_kernel() has two KASAN related
function calls. The first call maps shadow to early_level4_pgt,
the second maps shadow to init_level4_pgt.
If we move clear_page(init_level4_pgt) earlier, we could hide
KASAN low level detail from generic x86_64 initialization code.
The next patch will do it.
Signed-off-by: Andrey Ryabinin <a.ryabinin@samsung.com>
Cc: <stable@vger.kernel.org> # 4.0+
Cc: Alexander Popov <alpopov@ptsecurity.com>
Cc: Alexander Potapenko <glider@google.com>
Cc: Andrey Konovalov <adech.fo@gmail.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1435828178-10975-2-git-send-email-a.ryabinin@samsung.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
To sync up with the naming convention used in qspinlock, all the
qrwlock functions were renamed to started with "queued" instead of
"queue".
Signed-off-by: Waiman Long <Waiman.Long@hp.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Douglas Hatch <doug.hatch@hp.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Scott J Norton <scott.norton@hp.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Will Deacon <will.deacon@arm.com>
Link: http://lkml.kernel.org/r/1434729002-57724-2-git-send-email-Waiman.Long@hp.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Commit 0a196848ca ("perf: Fix arch_perf_out_copy_user default"),
changes copy_from_user_nmi() to return the number of
remaining bytes so that it behave like copy_from_user().
Unfortunately, when the range is outside of the process
memory, the return value is still the number of byte
copied, eg. 0, instead of the remaining bytes.
As all users of copy_from_user_nmi() were modified as
part of commit 0a196848ca, the function should be
fixed to return the total number of bytes if range is
not correct.
Signed-off-by: Yann Droneaud <ydroneaud@opteya.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1435001923-30986-1-git-send-email-ydroneaud@opteya.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
If it takes longer than 12us to read the PIT counter lsb/msb,
then the error margin will never fall below 500ppm within 50ms,
and Fast TSC calibration will always fail.
This patch detects when that will happen and fails fast. Note
the failure message is not printed in that case because:
1. it will always happen on that class of hardware
2. the absence of the message is more informative than its
presence
Signed-off-by: Adrian Hunter <adrian.hunter@intel.com>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Len Brown <lenb@kernel.org>
Cc: Andi Kleen <ak@linux.intel.com>
Link: http://lkml.kernel.org/r/556EB717.9070607@intel.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
A new intel_pmc_ipc driver, a symmetrical allocation and free fix in
dell-laptop, a couple minor fixes, and some updated documentation in the
dell-laptop comments.
intel_pmc_ipc:
- Add Intel Apollo Lake PMC IPC driver
tc1100-wmi:
- Delete an unnecessary check before the function call "kfree"
dell-laptop:
- Fix allocating & freeing SMI buffer page
- Show info about WiGig and UWB in debugfs
- Update information about wireless control
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQEcBAABAgAGBQJVmM8aAAoJEKbMaAwKp364iUkH/jihOduWkDTzzzxRP2Dv2nEh
qyvE94Nc9A9dl87C2+II/Pi1s8h4CJOQpl70syYYPc4FdF70hpvP8TbHkgCWrY/d
F8CoS9L9keviMtGOWlbEL9hBjfSDNwTMESTrDxrwhA04TSAwjDmXhhiUOF5FjFJm
CX5+ZQ3iXEH6KsENR+Er54J9+6WKE6IuRcnnKCapnPQ8cEYeVn+WEPyzHCOy8Pg3
xzzUar3/knS2VMIb5eIVpaKFvD9P9qBsC/gQ0pk1Y+686gwQZMVURDv8lw8hfXpx
TJDOXk21P8WbSH1r+jwax5wLjLge7vJtYG2Deye6MUgvSgg+O2tSVCv9SMQR088=
=WUgr
-----END PGP SIGNATURE-----
Merge tag 'platform-drivers-x86-v4.2-2' of git://git.infradead.org/users/dvhart/linux-platform-drivers-x86
Pull late x86 platform driver updates from Darren Hart:
"The following came in a bit later and I wanted them to bake in next a
few more days before submitting, thus the second pull.
A new intel_pmc_ipc driver, a symmetrical allocation and free fix in
dell-laptop, a couple minor fixes, and some updated documentation in
the dell-laptop comments.
intel_pmc_ipc:
- Add Intel Apollo Lake PMC IPC driver
tc1100-wmi:
- Delete an unnecessary check before the function call "kfree"
dell-laptop:
- Fix allocating & freeing SMI buffer page
- Show info about WiGig and UWB in debugfs
- Update information about wireless control"
* tag 'platform-drivers-x86-v4.2-2' of git://git.infradead.org/users/dvhart/linux-platform-drivers-x86:
intel_pmc_ipc: Add Intel Apollo Lake PMC IPC driver
tc1100-wmi: Delete an unnecessary check before the function call "kfree"
dell-laptop: Fix allocating & freeing SMI buffer page
dell-laptop: Show info about WiGig and UWB in debugfs
dell-laptop: Update information about wireless control
that could have been waited for -rc2. Sending them now since I
was taking care of Peter's patch anyway.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.22 (GNU/Linux)
iQEcBAABAgAGBQJVmB5pAAoJEL/70l94x66DpLAH/A0p2HICsG5Qw3gnI3NxAmK4
YUvtMx0d67mFXPg0kuYRMO7C2Is6XHKtnmsX8oqkg3JTRFfn7XYqlwvrrK3Be08U
tGvhigneJTGDXwU74jyik+D6VLmyJP3CxEvXM3d9AFyy7Ro9Grxx0Ja8c9cmKGQE
esCwNAEJOcqaQMtNIix3WtXifOVFr40NZlbAawsMyxVw8LZK/K5maXyUTRDI57Qn
B1wbTN1KD847/0rLrit+8VlsGEZBorUgCFhueeYGy/7EdiY0bNkzhLWb4erlWnRq
ZlKzsLdfXmEg2CEepaHCm5jlLfIurgbLfoV1tzQ5jAuj/SHmUxq+k3lZZYTYA3w=
=vDKM
-----END PGP SIGNATURE-----
Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm
Pull kvm fixes from Paolo Bonzini:
"Except for the preempt notifiers fix, these are all small bugfixes
that could have been waited for -rc2. Sending them now since I was
taking care of Peter's patch anyway"
* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm:
kvm: add hyper-v crash msrs values
KVM: x86: remove data variable from kvm_get_msr_common
KVM: s390: virtio-ccw: don't overwrite config space values
KVM: x86: keep track of LVT0 changes under APICv
KVM: x86: properly restore LVT0
KVM: x86: make vapics_in_nmi_mode atomic
sched, preempt_notifier: separate notifier registration from static_key inc/dec
Pull x86 fixes from Ingo Molnar:
"Two FPU rewrite related fixes. This addresses all known x86
regressions at this stage. Also some other misc fixes"
* 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/fpu: Fix boot crash in the early FPU code
x86/asm/entry/64: Update path names
x86/fpu: Fix FPU related boot regression when CPUID masking BIOS feature is enabled
x86/boot/setup: Clean up the e820_reserve_setup_data() code
x86/kaslr: Fix typo in the KASLR_FLAG documentation
Pull perf updates from Ingo Molnar:
"This tree includes an x86 PMU scheduling fix, but most changes are
late breaking tooling fixes and updates:
User visible fixes:
- Create config.detected into OUTPUT directory, fixing parallel
builds sharing the same source directory (Aaro Kiskinen)
- Allow to specify custom linker command, fixing some MIPS64 builds.
(Aaro Kiskinen)
- Fix to show proper convergence stats in 'perf bench numa' (Srikar
Dronamraju)
User visible changes:
- Validate syscall list passed via -e argument to 'perf trace'.
(Arnaldo Carvalho de Melo)
- Introduce 'perf stat --per-thread' (Jiri Olsa)
- Check access permission for --kallsyms and --vmlinux (Li Zhang)
- Move toggling event logic from 'perf top' and into hists browser,
allowing freeze/unfreeze with event lists with more than one entry
(Namhyung Kim)
- Add missing newlines when dumping PERF_RECORD_FINISHED_ROUND and
showing the Aggregated stats in 'perf report -D' (Adrian Hunter)
Infrastructure fixes:
- Add missing break for PERF_RECORD_ITRACE_START, which caused those
events samples to be parsed as well as PERF_RECORD_LOST_SAMPLES.
ITRACE_START only appears when Intel PT or BTS are present, so..
(Jiri Olsa)
- Call the perf_session destructor when bailing out in the inject,
kmem, report, kvm and mem tools (Taeung Song)
Infrastructure changes:
- Move stuff out of 'perf stat' and into the lib for further use
(Jiri Olsa)
- Reference count the cpu_map and thread_map classes (Jiri Olsa)
- Set evsel->{cpus,threads} from the evlist, if not set, allowing the
generalization of some 'perf stat' functions that previously were
accessing private static evlist variable (Jiri Olsa)
- Delete an unnecessary check before the calling free_event_desc()
(Markus Elfring)
- Allow auxtrace data alignment (Adrian Hunter)
- Allow events with dot (Andi Kleen)
- Fix failure to 'perf probe' events on arm (He Kuang)
- Add testing for Makefile.perf (Jiri Olsa)
- Add test for make install with prefix (Jiri Olsa)
- Fix single target build dependency check (Jiri Olsa)
- Access thread_map entries via accessors, prep patch to hold more
info per entry, for ongoing 'perf stat --per-thread' work (Jiri
Olsa)
- Use __weak definition from compiler.h (Sukadev Bhattiprolu)
- Split perf_pmu__new_alias() (Sukadev Bhattiprolu)"
* 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (54 commits)
perf tools: Allow to specify custom linker command
perf tools: Create config.detected into OUTPUT directory
perf mem: Fill in the missing session freeing after an error occurs
perf kvm: Fill in the missing session freeing after an error occurs
perf report: Fill in the missing session freeing after an error occurs
perf kmem: Fill in the missing session freeing after an error occurs
perf inject: Fill in the missing session freeing after an error occurs
perf tools: Add missing break for PERF_RECORD_ITRACE_START
perf/x86: Fix 'active_events' imbalance
perf symbols: Check access permission when reading symbol files
perf stat: Introduce --per-thread option
perf stat: Introduce print_counters function
perf stat: Using init_stats instead of memset
perf stat: Rename print_interval to process_interval
perf stat: Remove perf_evsel__read_cb function
perf stat: Move perf_stat initialization counter process code
perf stat: Move zero_per_pkg into counter process code
perf stat: Separate counters reading and processing
perf stat: Introduce read_counters function
perf stat: Introduce perf_evsel__read function
...
Jan Kara and Thomas Gleixner reported boot crashes in the FPU
code:
general protection fault: 0000 [#1] SMP
RIP: 0010:[<ffffffff81048a6c>] [<ffffffff81048a6c>] mxcsr_feature_mask_init+0x1c/0x40
2b:* 0f ae 85 00 fe ff ff fxsave -0x200(%rbp)
and bisected it down to the following FPU commit:
91a8c2a5b4 ("x86/fpu: Clean up and fix MXCSR handling")
The reason is that the on-stack FPU registers state variable,
used by the FXSAVE instruction, did not have the required
minimum alignment of 16 bytes, causing the general protection
fault.
This is most likely a GCC bug in older GCC versions, but the
offending commit also added a bogus extra 32-byte alignment
(which GCC ignored too).
So fix this bug by making the variable static again, but also
mark it __initdata this time, because fpu__init_system_mxcsr()
is now an __init function.
Reported-and-bisected-by: Jan Kara <jack@suse.cz>
Reported-bisected-and-tested-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Quentin Casasnovas <quentin.casasnovas@oracle.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20150704075819.GA9201@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Commit 609e36d372 ("KVM: x86: pass host_initiated to functions that
read MSRs") modified kvm_get_msr_common function to use msr_info->data
instead of data but missed one occurrence. Replace it and remove the
unused local variable.
Fixes: 609e36d372 ("KVM: x86: pass host_initiated to functions that
read MSRs")
Signed-off-by: Nicolas Iooss <nicolas.iooss_linux@m4x.org>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Memory-mapped LVT0 register already contains the new value when APICv
traps so we can't directly detect a change.
Memorize a bit we are interested in to enable legacy NMI watchdog.
Suggested-by: Yoshida Nobuo <yoshida.nb@ncos.nec.co.jp>
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Legacy NMI watchdog didn't work after migration/resume, because
vapics_in_nmi_mode was left at 0.
Cc: stable@vger.kernel.org
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Writes were a bit racy, but hard to turn into a bug at the same time.
(Particularly because modern Linux doesn't use this feature anymore.)
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
[Actually the next patch makes it much, much easier to trigger the race
so I'm including this one for stable@ as well. - Paolo]
Cc: stable@vger.kernel.org
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQIcBAABAgAGBQJVkPNDAAoJEOvOhAQsB9HWTNwP/1xtv8s2f7dY1JOV9T3oad7K
FJYOnFRu1CbXqtOGgJQlsY5eUc3liC+UEkqMFmvX008GIoIGi/aq1alzM4ySlu45
c8QttAS9aFFHwsNQUFA8rNN2Lz1xmhKi3ovc/+BBN9stgX0W0fJHX8A7TYtBsVFa
YqfkNP/4XGH+Taz4B7Id6Mv3RJfB+9TWMlHJ4oKl1NhT+fU+Ce2888K7y5llHGIz
Y9yDt7hMUv/7ysOpiHbvSKy3XnitTNx9JbN8CDQV22krpgsU1k0nYloxOVj5K0h0
vxcjpQ1Wmjlc7RO826tciMi3ZD880GK5n8NHuI87d/N/egXRP0Tsy1iy9eGK0R7i
udXR2y4RP5gD7SPuMJCUCrBTxkfp+rxQ775Keo/R9r4v/KzpKX6e0LcEDjiLsk88
5UHUZNdPgXxw85O354QwX05jAucPIs6Eq8PR324F+R+FU8x5EI6GWtFts0K4YI7j
ebsgaQR/aqvRlr859iJBFGBwEu0YWcbkVb6kKdMSjE4x0a3YxhFe6aXXll0g+iIZ
wGR54nRpBUUvh+qqlrSFTc3VA4f1KPdhylcfEmfSH2iNjARvDR61vzkLW1Nt6u0I
aM6ZYcfbGhGHt+pycqe6LAydS3qRyWDA6QTu6+TFZid/Ay6NBEI+Ubbx+eLNf8vr
+trFtqFvEfIMuT1BvOXo
=TR34
-----END PGP SIGNATURE-----
Merge tag 'module-misc-v4.1-rc8' of git://git.kernel.org/pub/scm/linux/kernel/git/paulg/linux
Pull init.h/module.h fragility fixes from Paul Gortmaker:
"Fixup various init.h misuses that are fragile wrt code moving to
module.h
What started as a removal of no longer required include <linux/init.h>
due to the earlier __cpuinit and __devinit removal led to the
observation that some module specfic support was living in init.h
itself, thus preventing the full removal from introducing compile
regressions.
This series includes a few final fixups needed prior to the relocation
of the modular init code from <init.h> to <module.h>. These are
things that weren't easily categorized into any of the other previous
series categories already requested for pull.
That said, each fixup branch (including this one) is independent and
there are no ordering constraints. Only the final code relocation
(which is NOT in this pull) requires that all my cleanup branches be
merged first"
* tag 'module-misc-v4.1-rc8' of git://git.kernel.org/pub/scm/linux/kernel/git/paulg/linux:
tile: add init.h to usb.c to avoid compile failure
arm: fix implicit #include <linux/init.h> in entry asm.
x86: replace __init_or_module with __init in non-modular vsmp_64.c
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQIcBAABAgAGBQJVkO6nAAoJEOvOhAQsB9HWpHMP/Aknc+lmX2dZeIn96gdkP+UK
1qL24C5oq2sm/9yTZLdoXbyApLaaTbAJHS9O4kolaOU6uOs3JrgtXqL1697PVp1R
qV4f4DOzXmmEHaE2oO21afAri3tXIVQNqA2NQl2TmKfwz0Atu01Vj5RJPu/ZOBPl
dONXcFnE6nO2p7AEFRP/GfDZwkng4xALyZPhwL7tJDAeGaBpqG/n2hCuq+Szn9g8
wjTFACBdad/mRrYsL6YsWZ1e+LKI8vsArQbdPTam+jPaEUlK7yjFReFKCJVzL2JP
xfQoTcCgFztzTUV0JTGR9sqeYA3WH9AkJOFDxNE/eIili4xiTh789WbEpHLVECSX
1LsW025I3DkRWBPT4L+9ZP805ha71kNXDFc5N3XJkzrCYaFvD2BgsUzxi6FXj7aC
9lEVKt6xO04FFG5SwTKnO0f8PEhPemZH3BDnVvjBDWQYLjUcPSNz7bfyHUhif0G5
ulOGVB0ncJJF9iP8PyZs1RA/F8kKxXWnhYMIHzvl0f0vLUA7rAKsACnhBgq8s9ZQ
uM5YjzU91Z/4pe5C2E5MmQIZ84b79ZPsee1lF0GJdjK5W3PDvnCjIdXfQ5M/f3S8
76cssXWNhS78/P+19YqirLeb0u7Zw0jf73m9t9ywRgcByWfY5ZUDm0DFpQnWKkoR
QY/aFO/yHKTO3VHj8Ril
=KDJO
-----END PGP SIGNATURE-----
Merge tag 'module_init-alternate_initcall-v4.1-rc8' of git://git.kernel.org/pub/scm/linux/kernel/git/paulg/linux
Pull module_init replacement part two from Paul Gortmaker:
"Replace module_init with appropriate alternate initcall in non
modules.
This series converts non-modular code that is using the module_init()
call to hook itself into the system to instead use one of our
alternate priority initcalls.
Unlike the previous series that used device_initcall and hence was a
runtime no-op, these commits change to one of the alternate initcalls,
because (a) we have them and (b) it seems like the right thing to do.
For example, it would seem logical to use arch_initcall for arch
specific setup code and fs_initcall for filesystem setup code.
This does mean however, that changes in the init ordering will be
taking place, and so there is a small risk that some kind of implicit
init ordering issue may lie uncovered. But I think it is still better
to give these ones sensible priorities than to just assign them all to
device_initcall in order to exactly preserve the old ordering.
Thad said, we have already made similar changes in core kernel code in
commit c96d6660dc ("kernel: audit/fix non-modular users of
module_init in core code") without any regressions reported, so this
type of change isn't without precedent. It has also got the same
local testing and linux-next coverage as all the other pull requests
that I'm sending for this merge window have got.
Once again, there is an unused module_exit function removal that shows
up as an outlier upon casual inspection of the diffstat"
* tag 'module_init-alternate_initcall-v4.1-rc8' of git://git.kernel.org/pub/scm/linux/kernel/git/paulg/linux:
x86: perf_event_intel_pt.c: use arch_initcall to hook in enabling
x86: perf_event_intel_bts.c: use arch_initcall to hook in enabling
mm/page_owner.c: use late_initcall to hook in enabling
lib/list_sort: use late_initcall to hook in self tests
arm: use subsys_initcall in non-modular pl320 IPC code
powerpc: don't use module_init for non-modular core hugetlb code
powerpc: use subsys_initcall for Freescale Local Bus
x86: don't use module_init for non-modular core bootflag code
netfilter: don't use module_init/exit in core IPV4 code
fs/notify: don't use module_init for non-modular inotify_user code
mm: replace module_init usages with subsys_initcall in nommu.c
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQIcBAABAgAGBQJVkO5XAAoJEOvOhAQsB9HWe4cQAJcsmSXIDN2O6oxvgH8Wilof
EIEMvT13uwBdsjQdYUY6A6B3iUV9wzEEgoosg/JRgpz5/b1FTDMIO4arUPD3Lcak
5bmyVO2qAT+yaLAWSgn6I8DMplXrKiEuK+TkH/mW3p9TdvElLdG3Vg6UI407hSWv
W0QbVwkNtv8XmzshV9F2YdmflT8j1PgYxIu/tEkVOWn37DNW+Fp2OVBrdTIYp3AJ
X6bYZPEcQDCrWWW/s2GmIDrNgryiebasns+CAgGY21262jAYaRcFOJmR47AsTqW7
DSZXIlLc/gJca++hfxqV15RZ4NRHxrebCypTsPtZUV7ZiYHI726eeUZzxsp/9itu
mvhmi+aQUTTUP3dDhiv05f4syAKEb4zslT6SLwcna6oi09M97HfCeQsHqhcFq/MG
KnS2JJoJQToQtJvMUXMQzp5hyHjNlOclIvCxEiL32EZU54PeJOKasy/mptNGEctk
TxACWvoXBQglRaVN+1wIjjr0BaHJSuJa9CUnIfM4WZdSHiMQMx00XLTkZcTnSM6R
12pE54vVolrXswGPJhy4W/Sf1yPSW1tkWSVBbkKLyCIrlAWJtu68rXhvwhG/nz6E
3g6QrDEQGlk6bzUH4CJCEqXLPRN1bNS0XjdkEFh60Lury3Ns5yHKZXPW5vCQ5csr
FQNUyBs595CWbJNfbn1n
=0BDx
-----END PGP SIGNATURE-----
Merge tag 'module_init-device_initcall-v4.1-rc8' of git://git.kernel.org/pub/scm/linux/kernel/git/paulg/linux
Pull module_init replacement part one from Paul Gortmaker:
"Replace module_init with equivalent device_initcall in non modules.
This series of commits converts non-modular code that is using the
module_init() call to hook itself into the system to instead use
device_initcall().
The conversion is a runtime no-op, since module_init actually becomes
__initcall in the non-modular case, and that in turn gets mapped onto
device_initcall. A couple files show a larger negative diffstat,
representing ones that had a module_exit function that we remove here
vs previously relying on the linker to dispose of it.
We make this conversion now, so that we can relocate module_init from
init.h into module.h in the future.
The files changed here are just limited to those that would otherwise
have to add module.h to obviously non-modular code, in order to avoid
a compile fail, as testing has shown"
* tag 'module_init-device_initcall-v4.1-rc8' of git://git.kernel.org/pub/scm/linux/kernel/git/paulg/linux:
MIPS: don't use module_init in non-modular cobalt/mtd.c file
drivers/leds: don't use module_init in non-modular leds-cobalt-raq.c
cris: don't use module_init for non-modular core eeprom.c code
tty/metag_da: Avoid module_init/module_exit in non-modular code
drivers/clk: don't use module_init in clk-nomadik.c which is non-modular
xtensa: don't use module_init for non-modular core network.c code
sh: don't use module_init in non-modular psw.c code
mn10300: don't use module_init in non-modular flash.c code
parisc64: don't use module_init for non-modular core perf code
parisc: don't use module_init for non-modular core pdc_cons code
cris: don't use module_init for non-modular core intmem.c code
ia64: don't use module_init in non-modular sim/simscsi.c code
ia64: don't use module_init for non-modular core kernel/mca.c code
arm: don't use module_init in non-modular mach-vexpress/spc.c code
powerpc: don't use module_init in non-modular 83xx suspend code
powerpc: use device_initcall for registering rtc devices
x86: don't use module_init in non-modular devicetree.c code
x86: don't use module_init in non-modular intel_mid_vrtc.c
Merge third patchbomb from Andrew Morton:
- the rest of MM
- scripts/gdb updates
- ipc/ updates
- lib/ updates
- MAINTAINERS updates
- various other misc things
* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (67 commits)
genalloc: rename of_get_named_gen_pool() to of_gen_pool_get()
genalloc: rename dev_get_gen_pool() to gen_pool_get()
x86: opt into HAVE_COPY_THREAD_TLS, for both 32-bit and 64-bit
MAINTAINERS: add zpool
MAINTAINERS: BCACHE: Kent Overstreet has changed email address
MAINTAINERS: move Jens Osterkamp to CREDITS
MAINTAINERS: remove unused nbd.h pattern
MAINTAINERS: update brcm gpio filename pattern
MAINTAINERS: update brcm dts pattern
MAINTAINERS: update sound soc intel patterns
MAINTAINERS: remove website for paride
MAINTAINERS: update Emulex ocrdma email addresses
bcache: use kvfree() in various places
libcxgbi: use kvfree() in cxgbi_free_big_mem()
target: use kvfree() in session alloc and free
IB/ehca: use kvfree() in ipz_queue_{cd}tor()
drm/nouveau/gem: use kvfree() in u_free()
drm: use kvfree() in drm_free_large()
cxgb4: use kvfree() in t4_free_mem()
cxgb3: use kvfree() in cxgb_free_mem()
...
Pull crypto fixes from Herbert Xu:
"This fixes the aesni setkey error and removes a couple of unnecessary
NULL checks in the Intel qat driver"
* git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6:
crypto: aesni - fix failing setkey for rfc4106-gcm-aesni
crypto: qat - Deletion of unnecessary checks before two function calls
- Add "make xenconfig" to assist in generating configs for Xen guests.
- Preparatory cleanups necessary for supporting 64 KiB pages in ARM
guests.
- Automatically use hvc0 as the default console in ARM guests.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.12 (GNU/Linux)
iQEcBAABAgAGBQJVkpoqAAoJEFxbo/MsZsTRu3IH/2AMPx2i65hoSqfHtGf3sz/z
XNfcidVmOElFVXGaW83m0tBWMemT5LpOGRfiq5sIo8xt/8xD2vozEkl/3kkf3RrX
EmZDw3E8vmstBdBTjWdovVhNenRc0m0pB5daS7wUdo9cETq1ag1L3BHTB3fEBApO
74V6qAfnhnq+snqWhRD3XAk3LKI0nWuWaV+5HsmxDtnunGhuRLGVs7mwxZGg56sM
mILA0eApGPdwyVVpuDe0SwV52V8E/iuVOWTcomGEN2+cRWffG5+QpHxQA8bOtF6O
KfqldiNXOY/idM+5+oSm9hespmdWbyzsFqmTYz0LvQvxE8eEZtHHB3gIcHkE8QU=
=danz
-----END PGP SIGNATURE-----
Merge tag 'for-linus-4.2-rc0-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip
Pull xen updates from David Vrabel:
"Xen features and cleanups for 4.2-rc0:
- add "make xenconfig" to assist in generating configs for Xen guests
- preparatory cleanups necessary for supporting 64 KiB pages in ARM
guests
- automatically use hvc0 as the default console in ARM guests"
* tag 'for-linus-4.2-rc0-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip:
block/xen-blkback: s/nr_pages/nr_segs/
block/xen-blkfront: Remove invalid comment
block/xen-blkfront: Remove unused macro MAXIMUM_OUTSTANDING_BLOCK_REQS
arm/xen: Drop duplicate define mfn_to_virt
xen/grant-table: Remove unused macro SPP
xen/xenbus: client: Fix call of virt_to_mfn in xenbus_grant_ring
xen: Include xen/page.h rather than asm/xen/page.h
kconfig: add xenconfig defconfig helper
kconfig: clarify kvmconfig is for kvm
xen/pcifront: Remove usage of struct timeval
xen/tmem: use BUILD_BUG_ON() in favor of BUG_ON()
hvc_xen: avoid uninitialized variable warning
xenbus: avoid uninitialized variable warning
xen/arm: allow console=hvc0 to be omitted for guests
arm,arm64/xen: move Xen initialization earlier
arm/xen: Correctly check if the event channel interrupt is present
Main excitement here is Peter Zijlstra's lockless rbtree optimization to
speed module address lookup. He found some abusers of the module lock
doing that too.
A little bit of parameter work here too; including Dan Streetman's breaking
up the big param mutex so writing a parameter can load another module (yeah,
really). Unfortunately that broke the usual suspects, !CONFIG_MODULES and
!CONFIG_SYSFS, so those fixes were appended too.
Cheers,
Rusty.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQIcBAABAgAGBQJVkgKHAAoJENkgDmzRrbjxQpwQAJVmBN6jF3SnwbQXv9vRixjH
58V33sb1G1RW+kXxQ3/e8jLX/4VaN479CufruXQp+IJWXsN/CH0lbC3k8m7u50d7
b1Zeqd/Yrh79rkc11b0X1698uGCSMlzz+V54Z0QOTEEX+nSu2ZZvccFS4UaHkn3z
rqDo00lb7rxQz8U25qro2OZrG6D3ub2q20TkWUB8EO4AOHkPn8KWP2r429Axrr0K
wlDWDTTt8/IsvPbuPf3T15RAhq1avkMXWn9nDXDjyWbpLfTn8NFnWmtesgY7Jl4t
GjbXC5WYekX3w2ZDB9KaT/DAMQ1a7RbMXNSz4RX4VbzDl+yYeSLmIh2G9fZb1PbB
PsIxrOgy4BquOWsJPm+zeFPSC3q9Cfu219L4AmxSjiZxC3dlosg5rIB892Mjoyv4
qxmg6oiqtc4Jxv+Gl9lRFVOqyHZrTC5IJ+xgfv1EyP6kKMUKLlDZtxZAuQxpUyxR
HZLq220RYnYSvkWauikq4M8fqFM8bdt6hLJnv7bVqllseROk9stCvjSiE3A9szH5
OgtOfYV5GhOeb8pCZqJKlGDw+RoJ21jtNCgOr6DgkNKV9CX/kL/Puwv8gnA0B0eh
dxCeB7f/gcLl7Cg3Z3gVVcGlgak6JWrLf5ITAJhBZ8Lv+AtL2DKmwEWS/iIMRmek
tLdh/a9GiCitqS0bT7GE
=tWPQ
-----END PGP SIGNATURE-----
Merge tag 'modules-next-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rusty/linux
Pull module updates from Rusty Russell:
"Main excitement here is Peter Zijlstra's lockless rbtree optimization
to speed module address lookup. He found some abusers of the module
lock doing that too.
A little bit of parameter work here too; including Dan Streetman's
breaking up the big param mutex so writing a parameter can load
another module (yeah, really). Unfortunately that broke the usual
suspects, !CONFIG_MODULES and !CONFIG_SYSFS, so those fixes were
appended too"
* tag 'modules-next-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rusty/linux: (26 commits)
modules: only use mod->param_lock if CONFIG_MODULES
param: fix module param locks when !CONFIG_SYSFS.
rcu: merge fix for Convert ACCESS_ONCE() to READ_ONCE() and WRITE_ONCE()
module: add per-module param_lock
module: make perm const
params: suppress unused variable error, warn once just in case code changes.
modules: clarify CONFIG_MODULE_COMPRESS help, suggest 'N'.
kernel/module.c: avoid ifdefs for sig_enforce declaration
kernel/workqueue.c: remove ifdefs over wq_power_efficient
kernel/params.c: export param_ops_bool_enable_only
kernel/params.c: generalize bool_enable_only
kernel/module.c: use generic module param operaters for sig_enforce
kernel/params: constify struct kernel_param_ops uses
sysfs: tightened sysfs permission checks
module: Rework module_addr_{min,max}
module: Use __module_address() for module_address_lookup()
module: Make the mod_tree stuff conditional on PERF_EVENTS || TRACING
module: Optimize __module_address() using a latched RB-tree
rbtree: Implement generic latch_tree
seqlock: Introduce raw_read_seqcount_latch()
...
For 32-bit userspace on a 64-bit kernel, this requires modifying
stub32_clone to actually swap the appropriate arguments to match
CONFIG_CLONE_BACKWARDS, rather than just leaving the C argument for tls
broken.
Patch co-authored by Josh Triplett and Thiago Macieira.
Signed-off-by: Josh Triplett <josh@joshtriplett.org>
Acked-by: Andy Lutomirski <luto@kernel.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Thiago Macieira <thiago.macieira@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Any parameter passed after '--' in the kernel command-line will not be
parsed by the kernel at all, instead it will be passed directly to init
process.
Currently the kernel appends elfcorehdr=<paddr> to the cmdline passed from
kexec load, and if this command-line is used to pass parameters to init
process this means that 'elfcorehdr' will not be parsed as a kernel
parameter at all which will be a problem for vmcore subsystem since it
will know nothing about the location of the ELF structure!
Prepending 'elfcorehdr' instead of appending it fixes this problem since
it ensures that it always comes before '--' and so it's always parsed as a
kernel command-line parameter.
Even with this patch things can still go wrong if 'CONFIG_CMDLINE' was
also used to embedd a command-line to the crash dump kernel and this
command-line contains '--' since the current behavior of the kernel is to
actually append the boot loader command-line to the embedded command-line.
Signed-off-by: KarimAllah Ahmed <karahmed@amazon.de>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Acked-by: Vivek Goyal <vgoyal@redhat.com>
Cc: Haren Myneni <hbabu@us.ibm.com>
Cc: Eric Biederman <ebiederm@xmission.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Subject says it all. Other architectures may enable on a case-by-case
basis after auditing early_pfn_to_nid and testing.
Signed-off-by: Mel Gorman <mgorman@suse.de>
Tested-by: Nate Zimmer <nzimmer@sgi.com>
Tested-by: Waiman Long <waiman.long@hp.com>
Tested-by: Daniel J Blueman <daniel@numascale.com>
Acked-by: Pekka Enberg <penberg@kernel.org>
Cc: Robin Holt <robinmholt@gmail.com>
Cc: Nate Zimmer <nzimmer@sgi.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Waiman Long <waiman.long@hp.com>
Cc: Scott Norton <scott.norton@hp.com>
Cc: "Luck, Tony" <tony.luck@intel.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Commit 1b7b938f18 ("perf/x86/intel: Fix PMI handling for Intel PT") conditionally
increments active_events in x86_add_exclusive() but unconditionally decrements in
x86_del_exclusive().
These extra decrements can lead to the situation where
active_events is zero and thus the PMI handler is 'disabled'
while we have active events on the PMU generating PMIs.
This leads to a truckload of:
Uhhuh. NMI received for unknown reason 21 on CPU 28.
Do you have a strange power saving mode enabled?
Dazed and confused, but trying to continue
messages and generally messes up perf.
Remove the condition on the increment, double increment balanced
by a double decrement is perfectly fine.
Restructure the code a little bit to make the unconditional inc
a bit more natural.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: alexander.shishkin@linux.intel.com
Cc: brgerst@gmail.com
Cc: dvlasenk@redhat.com
Cc: luto@amacapital.net
Cc: oleg@redhat.com
Fixes: 1b7b938f18 ("perf/x86/intel: Fix PMI handling for Intel PT")
Link: http://lkml.kernel.org/r/20150624144750.GJ18673@twins.programming.kicks-ass.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Mike Galbraith reported:
" My i7-4790 box is having one hell of a time with this merge
window, dead in the water.
BIOS setting "Limit CPUID Maximum" upsets new fpu code
mightily. "
It turns out that Linux does a double workaround here, as per:
066941bd4e ("x86: unmask CPUID levels on Intel CPUs")
it undoes the BIOS workaround - but as a side effect the CPUID
state is not completely constant during early init anymore,
and the new FPU init code did not take this into account.
So what happened is that the xstate init code did not have full
CPUID available, which broke subsequent attempts to use xstate
features.
Fix this by ordering the early FPU init code to after we've
stabilized the CPUID state.
Reported-bisected-and-tested-by: Mike Galbraith <umgwanakikbuti@gmail.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <umgwanakikbuti@gmail.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20150627082514.GA10894@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
This driver provides support for PMC control on Apollo Lake platforms.
The PMC is an ARC processor which defines some IPC commands for
communication with other entities in the CPU.
Signed-off-by: qipeng.zha <qipeng.zha@intel.com>
[fengguang.wu@intel.com: Fix Sparse and Cocinelle warnings]
Signed-off-by: Fengguang Wu <fengguang.wu@intel.com>
Signed-off-by: Darren Hart <dvhart@linux.intel.com>
4 drivers / enabling modules:
NFIT:
Instantiates an "nvdimm bus" with the core and registers memory devices
(NVDIMMs) enumerated by the ACPI 6.0 NFIT (NVDIMM Firmware Interface
table). After registering NVDIMMs the NFIT driver then registers
"region" devices. A libnvdimm-region defines an access mode and the
boundaries of persistent memory media. A region may span multiple
NVDIMMs that are interleaved by the hardware memory controller. In
turn, a libnvdimm-region can be carved into a "namespace" device and
bound to the PMEM or BLK driver which will attach a Linux block device
(disk) interface to the memory.
PMEM:
Initially merged in v4.1 this driver for contiguous spans of persistent
memory address ranges is re-worked to drive PMEM-namespaces emitted by
the libnvdimm-core. In this update the PMEM driver, on x86, gains the
ability to assert that writes to persistent memory have been flushed all
the way through the caches and buffers in the platform to persistent
media. See memcpy_to_pmem() and wmb_pmem().
BLK:
This new driver enables access to persistent memory media through "Block
Data Windows" as defined by the NFIT. The primary difference of this
driver to PMEM is that only a small window of persistent memory is
mapped into system address space at any given point in time. Per-NVDIMM
windows are reprogrammed at run time, per-I/O, to access different
portions of the media. BLK-mode, by definition, does not support DAX.
BTT:
This is a library, optionally consumed by either PMEM or BLK, that
converts a byte-accessible namespace into a disk with atomic sector
update semantics (prevents sector tearing on crash or power loss). The
sinister aspect of sector tearing is that most applications do not know
they have a atomic sector dependency. At least today's disk's rarely
ever tear sectors and if they do one almost certainly gets a CRC error
on access. NVDIMMs will always tear and always silently. Until an
application is audited to be robust in the presence of sector-tearing
the usage of BTT is recommended.
Thanks to: Ross Zwisler, Jeff Moyer, Vishal Verma, Christoph Hellwig,
Ingo Molnar, Neil Brown, Boaz Harrosh, Robert Elliott, Matthew Wilcox,
Andy Rudoff, Linda Knippers, Toshi Kani, Nicholas Moulin, Rafael
Wysocki, and Bob Moore.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQIcBAABAgAGBQJVjZGBAAoJEB7SkWpmfYgC4fkP/j+k6HmSRNU/yRYPyo7CAWvj
3P5P1i6R6nMZZbjQrQArAXaIyLlFk4sEQDYsciR6dmslhhFZAkR2eFwVO5rBOyx3
QN0yxEpyjJbroRFUrV/BLaFK4cq2oyJAFFHs0u7/pLHBJ4MDMqfRKAMtlnBxEkTE
LFcqXapSlvWitSbjMdIBWKFEvncaiJ2mdsFqT4aZqclBBTj00eWQvEG9WxleJLdv
+tj7qR/vGcwOb12X5UrbQXgwtMYos7A6IzhHbqwQL8IrOcJ6YB8NopJUpLDd7ZVq
KAzX6ZYMzNueN4uvv6aDfqDRLyVL7qoxM9XIjGF5R8SV9sF2LMspm1FBpfowo1GT
h2QMr0ky1nHVT32yspBCpE9zW/mubRIDtXxEmZZ53DIc4N6Dy9jFaNVmhoWtTAqG
b9pndFnjUzzieCjX5pCvo2M5U6N0AQwsnq76/CasiWyhSa9DNKOg8MVDRg0rbxb0
UvK0v8JwOCIRcfO3qiKcx+02nKPtjCtHSPqGkFKPySRvAdb+3g6YR26CxTb3VmnF
etowLiKU7HHalLvqGFOlDoQG6viWes9Zl+ZeANBOCVa6rL2O7ZnXJtYgXf1wDQee
fzgKB78BcDjXH4jHobbp/WBANQGN/GF34lse8yHa7Ym+28uEihDvSD1wyNLnefmo
7PJBbN5M5qP5tD0aO7SZ
=VtWG
-----END PGP SIGNATURE-----
Merge tag 'libnvdimm-for-4.2' of git://git.kernel.org/pub/scm/linux/kernel/git/djbw/nvdimm
Pull libnvdimm subsystem from Dan Williams:
"The libnvdimm sub-system introduces, in addition to the
libnvdimm-core, 4 drivers / enabling modules:
NFIT:
Instantiates an "nvdimm bus" with the core and registers memory
devices (NVDIMMs) enumerated by the ACPI 6.0 NFIT (NVDIMM Firmware
Interface table).
After registering NVDIMMs the NFIT driver then registers "region"
devices. A libnvdimm-region defines an access mode and the
boundaries of persistent memory media. A region may span multiple
NVDIMMs that are interleaved by the hardware memory controller. In
turn, a libnvdimm-region can be carved into a "namespace" device and
bound to the PMEM or BLK driver which will attach a Linux block
device (disk) interface to the memory.
PMEM:
Initially merged in v4.1 this driver for contiguous spans of
persistent memory address ranges is re-worked to drive
PMEM-namespaces emitted by the libnvdimm-core.
In this update the PMEM driver, on x86, gains the ability to assert
that writes to persistent memory have been flushed all the way
through the caches and buffers in the platform to persistent media.
See memcpy_to_pmem() and wmb_pmem().
BLK:
This new driver enables access to persistent memory media through
"Block Data Windows" as defined by the NFIT. The primary difference
of this driver to PMEM is that only a small window of persistent
memory is mapped into system address space at any given point in
time.
Per-NVDIMM windows are reprogrammed at run time, per-I/O, to access
different portions of the media. BLK-mode, by definition, does not
support DAX.
BTT:
This is a library, optionally consumed by either PMEM or BLK, that
converts a byte-accessible namespace into a disk with atomic sector
update semantics (prevents sector tearing on crash or power loss).
The sinister aspect of sector tearing is that most applications do
not know they have a atomic sector dependency. At least today's
disk's rarely ever tear sectors and if they do one almost certainly
gets a CRC error on access. NVDIMMs will always tear and always
silently. Until an application is audited to be robust in the
presence of sector-tearing the usage of BTT is recommended.
Thanks to: Ross Zwisler, Jeff Moyer, Vishal Verma, Christoph Hellwig,
Ingo Molnar, Neil Brown, Boaz Harrosh, Robert Elliott, Matthew Wilcox,
Andy Rudoff, Linda Knippers, Toshi Kani, Nicholas Moulin, Rafael
Wysocki, and Bob Moore"
* tag 'libnvdimm-for-4.2' of git://git.kernel.org/pub/scm/linux/kernel/git/djbw/nvdimm: (33 commits)
arch, x86: pmem api for ensuring durability of persistent memory updates
libnvdimm: Add sysfs numa_node to NVDIMM devices
libnvdimm: Set numa_node to NVDIMM devices
acpi: Add acpi_map_pxm_to_online_node()
libnvdimm, nfit: handle unarmed dimms, mark namespaces read-only
pmem: flag pmem block devices as non-rotational
libnvdimm: enable iostat
pmem: make_request cleanups
libnvdimm, pmem: fix up max_hw_sectors
libnvdimm, blk: add support for blk integrity
libnvdimm, btt: add support for blk integrity
fs/block_dev.c: skip rw_page if bdev has integrity
libnvdimm: Non-Volatile Devices
tools/testing/nvdimm: libnvdimm unit test infrastructure
libnvdimm, nfit, nd_blk: driver for BLK-mode access persistent memory
nd_btt: atomic sector updates
libnvdimm: infrastructure for btt devices
libnvdimm: write blk label set
libnvdimm: write pmem label set
libnvdimm: blk labels and namespace instantiation
...
rfc4106(gcm(aes)) uses ctr(aes) to generate hash key. ctr(aes) needs
chainiv, but the chainiv gets initialized after aesni_intel when both
are statically linked so the setkey fails.
This patch forces aesni_intel to be initialized after chainiv.
Signed-off-by: Tadeusz Struk <tadeusz.struk@intel.com>
Tested-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Pull UML updates from Richard Weinberger:
- remove hppfs ("HonePot ProcFS")
- initial support for musl libc
- uaccess cleanup
- random cleanups and bug fixes all over the place
* 'for-linus-4.2-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rw/uml: (21 commits)
um: Don't pollute kernel namespace with uapi
um: Include sys/types.h for makedev(), major(), minor()
um: Do not use stdin and stdout identifiers for struct members
um: Do not use __ptr_t type for stack_t's .ss pointer
um: Fix mconsole dependency
um: Handle tracehook_report_syscall_entry() result
um: Remove copy&paste code from init.h
um: Stop abusing __KERNEL__
um: Catch unprotected user memory access
um: Fix warning in setup_signal_stack_si()
um: Rework uaccess code
um: Add uaccess.h to ldt.c
um: Add uaccess.h to syscalls_64.c
um: Add asm/elf.h to vma.c
um: Cleanup mem_32/64.c headers
um: Remove hppfs
um: Move syscall() declaration into os.h
um: kernel: ksyms: Export symbol syscall() for fixing modpost issue
um/os-Linux: Use char[] for syscall_stub declarations
um: Use char[] for linker script address declarations
...
Here's the tty and serial driver patches for 4.2-rc1.
A number of individual driver updates, some code cleanups, and other
minor things, full details in the shortlog.
All have been in linux-next for a while with no reported issues.
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2
iEYEABECAAYFAlWNoSAACgkQMUfUDdst+ymxNQCguSEmkAYNDdLyYhdcOqSxJt9u
U1gAoMThUDoomkx6CTDMU1wn53hxgMk9
=eCUS
-----END PGP SIGNATURE-----
Merge tag 'tty-4.2-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/tty
Pull tty/serial driver updates from Greg KH:
"Here's the tty and serial driver patches for 4.2-rc1.
A number of individual driver updates, some code cleanups, and other
minor things, full details in the shortlog.
All have been in linux-next for a while with no reported issues"
* tag 'tty-4.2-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/tty: (152 commits)
Doc: serial-rs485.txt: update RS485 driver interface
Doc: tty.txt: remove mention of the BKL
MAINTAINERS: tty: add serial docs directory
serial: sprd: check for NULL after calling devm_clk_get
serial: 8250_pci: Correct uartclk for xr17v35x expansion chips
serial: 8250_pci: Add support for 12 port Exar boards
serial: 8250_uniphier: add bindings document for UniPhier UART
serial: core: cleanup in uart_get_baud_rate()
serial: stm32-usart: Add STM32 USART Driver
tty/serial: kill off set_irq_flags usage
tty: move linux/gsmmux.h to uapi
doc: dt: add documentation for nxp,lpc1850-uart
serial: 8250: add LPC18xx/43xx UART driver
serial: 8250_uniphier: add UniPhier serial driver
serial: 8250_dw: support ACPI platforms with integrated DMA engine
serial: of_serial: check the return value of clk_prepare_enable()
serial: of_serial: use devm_clk_get() instead of clk_get()
serial: earlycon: Add support for big-endian MMIO accesses
serial: sirf: use hrtimer for data rx
serial: sirf: correct the fifo empty_bit
...
Here's the big char/misc driver pull request for 4.2-rc1.
Lots of mei, extcon, coresight, uio, mic, and other driver updates in
here. Full details in the shortlog. All of these have been in
linux-next for some time with no reported problems.
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2
iEYEABECAAYFAlWNn0gACgkQMUfUDdst+ykCCQCgvdF4F2+Hy9+RATdk22ak1uq1
JDMAoJTf4oyaIEdaiOKfEIWg9MasS42B
=H5wD
-----END PGP SIGNATURE-----
Merge tag 'char-misc-4.2-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/char-misc
Pull char/misc driver updates from Greg KH:
"Here's the big char/misc driver pull request for 4.2-rc1.
Lots of mei, extcon, coresight, uio, mic, and other driver updates in
here. Full details in the shortlog. All of these have been in
linux-next for some time with no reported problems"
* tag 'char-misc-4.2-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/char-misc: (176 commits)
mei: me: wait for power gating exit confirmation
mei: reset flow control on the last client disconnection
MAINTAINERS: mei: add mei_cl_bus.h to maintained file list
misc: sram: sort and clean up included headers
misc: sram: move reserved block logic out of probe function
misc: sram: add private struct device and virt_base members
misc: sram: report correct SRAM pool size
misc: sram: bump error message level on unclean driver unbinding
misc: sram: fix device node reference leak on error
misc: sram: fix enabled clock leak on error path
misc: mic: Fix reported static checker warning
misc: mic: Fix randconfig build error by including errno.h
uio: pruss: Drop depends on ARCH_DAVINCI_DA850 from config
uio: pruss: Add CONFIG_HAS_IOMEM dependence
uio: pruss: Include <linux/sizes.h>
extcon: Redefine the unique id of supported external connectors without 'enum extcon' type
char:xilinx_hwicap:buffer_icap - change 1/0 to true/false for bool type variable in function buffer_icap_set_configuration().
Drivers: hv: vmbus: Allocate ring buffer memory in NUMA aware fashion
parport: check exclusive access before register
w1: use correct lock on error in w1_seq_show()
...
"monitonic raw". Also some enhancements to make the ring buffer even
faster. But the biggest and most noticeable change is the renaming of
the ftrace* files, structures and variables that have to deal with
trace events.
Over the years I've had several developers tell me about their confusion
with what ftrace is compared to events. Technically, "ftrace" is the
infrastructure to do the function hooks, which include tracing and also
helps with live kernel patching. But the trace events are a separate
entity altogether, and the files that affect the trace events should
not be named "ftrace". These include:
include/trace/ftrace.h -> include/trace/trace_events.h
include/linux/ftrace_event.h -> include/linux/trace_events.h
Also, functions that are specific for trace events have also been renamed:
ftrace_print_*() -> trace_print_*()
(un)register_ftrace_event() -> (un)register_trace_event()
ftrace_event_name() -> trace_event_name()
ftrace_trigger_soft_disabled()-> trace_trigger_soft_disabled()
ftrace_define_fields_##call() -> trace_define_fields_##call()
ftrace_get_offsets_##call() -> trace_get_offsets_##call()
Structures have been renamed:
ftrace_event_file -> trace_event_file
ftrace_event_{call,class} -> trace_event_{call,class}
ftrace_event_buffer -> trace_event_buffer
ftrace_subsystem_dir -> trace_subsystem_dir
ftrace_event_raw_##call -> trace_event_raw_##call
ftrace_event_data_offset_##call-> trace_event_data_offset_##call
ftrace_event_type_funcs_##call -> trace_event_type_funcs_##call
And a few various variables and flags have also been updated.
This has been sitting in linux-next for some time, and I have not heard
a single complaint about this rename breaking anything. Mostly because
these functions, variables and structures are mostly internal to the
tracing system and are seldom (if ever) used by anything external to that.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQEcBAABAgAGBQJViYhVAAoJEEjnJuOKh9ldcJ0IAI+mytwoMAN/CWDE8pXrTrgs
aHlcr1zorSzZ0Lq6lKsWP+V0VGVhP8KWO16vl35HaM5ZB9U+cDzWiGobI8JTHi/3
eeTAPTjQdgrr/L+ZO1ApzS1jYPhN3Xi5L7xublcYMJjKfzU+bcYXg/x8gRt0QbG3
S9QN/kBt0JIIjT7McN64m5JVk2OiU36LxXxwHgCqJvVCPHUrriAdIX7Z5KRpEv13
zxgCN4d7Jiec/FsMW8dkO0vRlVAvudZWLL7oDmdsvNhnLy8nE79UOeHos2c1qifQ
LV4DeQ+2Hlu7w9wxixHuoOgNXDUEiQPJXzPc/CuCahiTL9N/urQSGQDoOVMltR4=
=hkdz
-----END PGP SIGNATURE-----
Merge tag 'trace-v4.2' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace
Pull tracing updates from Steven Rostedt:
"This patch series contains several clean ups and even a new trace
clock "monitonic raw". Also some enhancements to make the ring buffer
even faster. But the biggest and most noticeable change is the
renaming of the ftrace* files, structures and variables that have to
deal with trace events.
Over the years I've had several developers tell me about their
confusion with what ftrace is compared to events. Technically,
"ftrace" is the infrastructure to do the function hooks, which include
tracing and also helps with live kernel patching. But the trace
events are a separate entity altogether, and the files that affect the
trace events should not be named "ftrace". These include:
include/trace/ftrace.h -> include/trace/trace_events.h
include/linux/ftrace_event.h -> include/linux/trace_events.h
Also, functions that are specific for trace events have also been renamed:
ftrace_print_*() -> trace_print_*()
(un)register_ftrace_event() -> (un)register_trace_event()
ftrace_event_name() -> trace_event_name()
ftrace_trigger_soft_disabled() -> trace_trigger_soft_disabled()
ftrace_define_fields_##call() -> trace_define_fields_##call()
ftrace_get_offsets_##call() -> trace_get_offsets_##call()
Structures have been renamed:
ftrace_event_file -> trace_event_file
ftrace_event_{call,class} -> trace_event_{call,class}
ftrace_event_buffer -> trace_event_buffer
ftrace_subsystem_dir -> trace_subsystem_dir
ftrace_event_raw_##call -> trace_event_raw_##call
ftrace_event_data_offset_##call-> trace_event_data_offset_##call
ftrace_event_type_funcs_##call -> trace_event_type_funcs_##call
And a few various variables and flags have also been updated.
This has been sitting in linux-next for some time, and I have not
heard a single complaint about this rename breaking anything. Mostly
because these functions, variables and structures are mostly internal
to the tracing system and are seldom (if ever) used by anything
external to that"
* tag 'trace-v4.2' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace: (33 commits)
ring_buffer: Allow to exit the ring buffer benchmark immediately
ring-buffer-benchmark: Fix the wrong type
ring-buffer-benchmark: Fix the wrong param in module_param
ring-buffer: Add enum names for the context levels
ring-buffer: Remove useless unused tracing_off_permanent()
ring-buffer: Give NMIs a chance to lock the reader_lock
ring-buffer: Add trace_recursive checks to ring_buffer_write()
ring-buffer: Allways do the trace_recursive checks
ring-buffer: Move recursive check to per_cpu descriptor
ring-buffer: Add unlikelys to make fast path the default
tracing: Rename ftrace_get_offsets_##call() to trace_event_get_offsets_##call()
tracing: Rename ftrace_define_fields_##call() to trace_event_define_fields_##call()
tracing: Rename ftrace_event_type_funcs_##call to trace_event_type_funcs_##call
tracing: Rename ftrace_data_offset_##call to trace_event_data_offset_##call
tracing: Rename ftrace_raw_##call event structures to trace_event_raw_##call
tracing: Rename ftrace_trigger_soft_disabled() to trace_trigger_soft_disabled()
tracing: Rename FTRACE_EVENT_FL_* flags to EVENT_FILE_FL_*
tracing: Rename struct ftrace_subsystem_dir to trace_subsystem_dir
tracing: Rename ftrace_event_name() to trace_event_name()
tracing: Rename FTRACE_MAX_EVENT to TRACE_EVENT_TYPE_MAX
...
Pull drm updates from Dave Airlie:
"This is the main drm pull request for v4.2.
I've one other new driver from freescale on my radar, it's been posted
and reviewed, I'd just like to get someone to give it a last look, so
maybe I'll send it or maybe I'll leave it.
There is no major nouveau changes in here, Ben was working on
something big, and we agreed it was a bit late, there wasn't anything
else he considered urgent to merge.
There might be another msm pull for some bits that are waiting on
arm-soc, I'll see how we time it.
This touches some "of" stuff, acks are in place except for the fixes
to the build in various configs,t hat I just applied.
Summary:
New drivers:
- virtio-gpu:
KMS only pieces of driver for virtio-gpu in qemu.
This is just the first part of this driver, enough to run
unaccelerated userspace on. As qemu merges more we'll start
adding the 3D features for the virgl 3d work.
- amdgpu:
a new driver from AMD to driver their newer GPUs. (VI+)
It contains a new cleaner userspace API, and is a clean
break from radeon moving forward, that AMD are going to
concentrate on. It also contains a set of register headers
auto generated from AMD internal database.
core:
- atomic modesetting API completed, enabled by default now.
- Add support for mode_id blob to atomic ioctl to complete interface.
- bunch of Displayport MST fixes
- lots of misc fixes.
panel:
- new simple panels
- fix some long-standing build issues with bridge drivers
radeon:
- VCE1 support
- add a GPU reset counter for userspace
- lots of fixes.
amdkfd:
- H/W debugger support module
- static user-mode queues
- support killing all the waves when a process terminates
- use standard DECLARE_BITMAP
i915:
- Add Broxton support
- S3, rotation support for Skylake
- RPS booting tuning
- CPT modeset sequence fixes
- ns2501 dither support
- enable cmd parser on haswell
- cdclk handling fixes
- gen8 dynamic pte allocation
- lots of atomic conversion work
exynos:
- Add atomic modesetting support
- Add iommu support
- Consolidate drm driver initialization
- and MIC, DECON and MIPI-DSI support for exynos5433
omapdrm:
- atomic modesetting support (fixes lots of things in rewrite)
tegra:
- DP aux transaction fixes
- iommu support fix
msm:
- adreno a306 support
- various dsi bits
- various 64-bit fixes
- NV12MT support
rcar-du:
- atomic and misc fixes
sti:
- fix HDMI timing complaince
tilcdc:
- use drm component API to access tda998x driver
- fix module unloading
qxl:
- stability fixes"
* 'drm-next' of git://people.freedesktop.org/~airlied/linux: (872 commits)
drm/nouveau: Pause between setting gpu to D3hot and cutting the power
drm/dp/mst: close deadlock in connector destruction.
drm: Always enable atomic API
drm/vgem: Set unique to "vgem"
of: fix a build error to of_graph_get_endpoint_by_regs function
drm/dp/mst: take lock around looking up the branch device on hpd irq
drm/dp/mst: make sure mst_primary mstb is valid in work function
of: add EXPORT_SYMBOL for of_graph_get_endpoint_by_regs
ARM: dts: rename the clock of MIPI DSI 'pll_clk' to 'sclk_mipi'
drm/atomic: Don't set crtc_state->enable manually
drm/exynos: dsi: do not set TE GPIO direction by input
drm/exynos: dsi: add support for MIC driver as a bridge
drm/exynos: dsi: add support for Exynos5433
drm/exynos: dsi: make use of array for clock access
drm/exynos: dsi: make use of driver data for static values
drm/exynos: dsi: add macros for register access
drm/exynos: dsi: rename pll_clk to sclk_clk
drm/exynos: mic: add MIC driver
of: add helper for getting endpoint node of specific identifiers
drm/exynos: add Exynos5433 decon driver
...
Merge second patchbomb from Andrew Morton:
- most of the rest of MM
- lots of misc things
- procfs updates
- printk feature work
- updates to get_maintainer, MAINTAINERS, checkpatch
- lib/ updates
* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (96 commits)
exit,stats: /* obey this comment */
coredump: add __printf attribute to cn_*printf functions
coredump: use from_kuid/kgid when formatting corename
fs/reiserfs: remove unneeded cast
NILFS2: support NFSv2 export
fs/befs/btree.c: remove unneeded initializations
fs/minix: remove unneeded cast
init/do_mounts.c: add create_dev() failure log
kasan: remove duplicate definition of the macro KASAN_FREE_PAGE
fs/efs: femove unneeded cast
checkpatch: emit "NOTE: <types>" message only once after multiple files
checkpatch: emit an error when there's a diff in a changelog
checkpatch: validate MODULE_LICENSE content
checkpatch: add multi-line handling for PREFER_ETHER_ADDR_COPY
checkpatch: suggest using eth_zero_addr() and eth_broadcast_addr()
checkpatch: fix processing of MEMSET issues
checkpatch: suggest using ether_addr_equal*()
checkpatch: avoid NOT_UNIFIED_DIFF errors on cover-letter.patch files
checkpatch: remove local from codespell path
checkpatch: add --showfile to allow input via pipe to show filenames
...
Based on an original patch by Ross Zwisler [1].
Writes to persistent memory have the potential to be posted to cpu
cache, cpu write buffers, and platform write buffers (memory controller)
before being committed to persistent media. Provide apis,
memcpy_to_pmem(), wmb_pmem(), and memremap_pmem(), to write data to
pmem and assert that it is durable in PMEM (a persistent linear address
range). A '__pmem' attribute is added so sparse can track proper usage
of pointers to pmem.
This continues the status quo of pmem being x86 only for 4.2, but
reworks to ioremap, and wider implementation of memremap() will enable
other archs in 4.3.
[1]: https://lists.01.org/pipermail/linux-nvdimm/2015-May/000932.html
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
[djbw: various reworks]
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
ACPI NFIT table has System Physical Address Range Structure entries that
describe a proximity ID of each range when ACPI_NFIT_PROXIMITY_VALID is
set in the flags.
Change acpi_nfit_register_region() to map a proximity ID to its node ID,
and set it to a new numa_node field of nd_region_desc, which is then
conveyed to the nd_region device.
The device core arranges for btt and namespace devices to inherit their
node from their parent region.
Signed-off-by: Toshi Kani <toshi.kani@hp.com>
[djbw: move set_dev_node() from region.c to bus.c]
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Nobody used these hooks so they were removed from common code, and can now
be removed from the architectures.
Signed-off-by: Dominik Dingel <dingel@linux.vnet.ibm.com>
Acked-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
Acked-by: Ralf Baechle <ralf@linux-mips.org>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Christian Borntraeger <borntraeger@de.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Pull asm/scatterlist.h removal from Jens Axboe:
"We don't have any specific arch scatterlist anymore, since parisc
finally switched over. Kill the include"
* 'for-4.2/sg' of git://git.kernel.dk/linux-block:
remove scatterlist.h generation from arch Kbuild files
remove <asm/scatterlist.h>
Don't include ptrace uapi stuff in arch headers, it will
pollute the kernel namespace and conflict with existing
stuff.
In this case it fixes clashes with common names like R8.
Signed-off-by: Richard Weinberger <richard@nod.at>
Merge first patchbomb from Andrew Morton:
- a few misc things
- ocfs2 udpates
- kernel/watchdog.c feature work (took ages to get right)
- most of MM. A few tricky bits are held up and probably won't make 4.2.
* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (91 commits)
mm: kmemleak_alloc_percpu() should follow the gfp from per_alloc()
mm, thp: respect MPOL_PREFERRED policy with non-local node
tmpfs: truncate prealloc blocks past i_size
mm/memory hotplug: print the last vmemmap region at the end of hot add memory
mm/mmap.c: optimization of do_mmap_pgoff function
mm: kmemleak: optimise kmemleak_lock acquiring during kmemleak_scan
mm: kmemleak: avoid deadlock on the kmemleak object insertion error path
mm: kmemleak: do not acquire scan_mutex in kmemleak_do_cleanup()
mm: kmemleak: fix delete_object_*() race when called on the same memory block
mm: kmemleak: allow safe memory scanning during kmemleak disabling
memcg: convert mem_cgroup->under_oom from atomic_t to int
memcg: remove unused mem_cgroup->oom_wakeups
frontswap: allow multiple backends
x86, mirror: x86 enabling - find mirrored memory ranges
mm/memblock: allocate boot time data structures from mirrored memory
mm/memblock: add extra "flags" to memblock to allow selection of memory based on attribute
mm: do not ignore mapping_gfp_mask in page cache allocation paths
mm/cma.c: fix typos in comments
mm/oom_kill.c: print points as unsigned int
mm/hugetlb: handle races in alloc_huge_page and hugetlb_reserve_pages
...
* New APM X-Gene SoC EDAC driver (Loc Ho)
* AMD error injection module improvements (Aravind Gopalakrishnan)
* Altera Arria 10 support (Thor Thayer)
* misc fixes and cleanups all over the place
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQIcBAABAgAGBQJViuInAAoJEBLB8Bhh3lVKHT8QAKkHIMreO8obo09haxNJlfdF
BaG7SNEDhvcgQ1B76RsjnjkUpsivvUt+mCYMP+BxcAqFrTA33UZCCOK5tEhGb1wr
matRdR6+aezqAl2e/0/Ti25bWOkDxcOeazh2TyezuyIXtaJjOq1oZC7OaYGmxPun
NlZY+/uY1eiHlewKsK04y8G8J5i4wGoKnuxBvOyELT90+a+fLfAOshAp0D4r0piB
Znv0ydsHlu+Wx57slg1DktlsyswmcGS9WfWwwTlELOLulKgN8wEAVYzUB5pJzNbz
ehq0J4wYz95juXADC4M4tEjErHVJNl6PbyMqwt0+XUUJ1NSgOj7Q6iqwxDoZX8km
oxiLVydQBtoIzF1LojFKAVZDFnrMKHKwK3RaDaUJjTI90+tVzEU8xsBlUf6+EgD2
Ss2RH8Gfuf52RdtwHh9++T1ur5rM9YNCAm31msq06mcOf0bEtmDbhZ+fVC5mjhqB
fIb3hxnk0r2BVg+ZCN/boxGS6RzUtYVcCXaBPDMeHcg9BEEds70KCFEcsX7TvJIg
5/SHI+033MylqkX2zrgDQLj7CQk3R0jaotHVbdhLupyOldcM7r5uF+VO84drNWGN
GfM2lpyE/swZWnzKuotgYIGR1XvFjtJAVAyNGIvwP+ajjTsqXzEnLSLClY5LWfYd
nSSSMpCCqsEmhoWftOix
=Id4f
-----END PGP SIGNATURE-----
Merge tag 'edac_for_4.2_2' of git://git.kernel.org/pub/scm/linux/kernel/git/bp/bp
Pull EDAC updates from Borislav Petkov:
- New APM X-Gene SoC EDAC driver (Loc Ho)
- AMD error injection module improvements (Aravind Gopalakrishnan)
- Altera Arria 10 support (Thor Thayer)
- misc fixes and cleanups all over the place
* tag 'edac_for_4.2_2' of git://git.kernel.org/pub/scm/linux/kernel/git/bp/bp: (28 commits)
EDAC: Update Documentation/edac.txt
EDAC: Fix typos in Documentation/edac.txt
EDAC, mce_amd_inj: Set MISCV on injection
EDAC, mce_amd_inj: Move bit preparations before the injection
EDAC, mce_amd_inj: Cleanup and simplify README
EDAC, altera: Do not allow suspend when EDAC is enabled
EDAC, mce_amd_inj: Make inj_type static
arm: socfpga: dts: Add Arria10 SDRAM EDAC DTS support
EDAC, altera: Add Arria10 EDAC support
EDAC, altera: Refactor for Altera CycloneV SoC
EDAC, altera: Generalize driver to use DT Memory size
EDAC, mce_amd_inj: Add README file
EDAC, mce_amd_inj: Add individual permissions field to dfs_node
EDAC, mce_amd_inj: Modify flags attribute to use string arguments
EDAC, mce_amd_inj: Read out number of MCE banks from the hardware
EDAC, mce_amd_inj: Use MCE_INJECT_GET macro for bank node too
EDAC, xgene: Fix cpuid abuse
EDAC, mpc85xx: Extend error address to 64 bit
EDAC, mpc8xxx: Adapt for FSL SoC
EDAC, edac_stub: Drop arch-specific include
...
nd_pmem attaches to persistent memory regions and namespaces emitted by
the libnvdimm subsystem, and, same as the original pmem driver, presents
the system-physical-address range as a block device.
The existing e820-type-12 to pmem setup is converted to an nvdimm_bus
that emits an nd_namespace_io device.
Note that the X in 'pmemX' is now derived from the parent region. This
provides some stability to the pmem devices names from boot-to-boot.
The minor numbers are also more predictable by passing 0 to
alloc_disk().
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Boaz Harrosh <boaz@plexistor.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Jens Axboe <axboe@fb.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Tested-by: Toshi Kani <toshi.kani@hp.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
UEFI GetMemoryMap() uses a new attribute bit to mark mirrored memory
address ranges. See UEFI 2.5 spec pages 157-158:
http://www.uefi.org/sites/default/files/resources/UEFI%202_5.pdf
On EFI enabled systems scan the memory map and tell memblock about any
mirrored ranges.
Signed-off-by: Tony Luck <tony.luck@intel.com>
Cc: Xishi Qiu <qiuxishi@huawei.com>
Cc: Hanjun Guo <guohanjun@huawei.com>
Cc: Xiexiuqi <xiexiuqi@huawei.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Yinghai Lu <yinghai@kernel.org>
Cc: Naoya Horiguchi <nao.horiguchi@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Some high end Intel Xeon systems report uncorrectable memory errors as a
recoverable machine check. Linux has included code for some time to
process these and just signal the affected processes (or even recover
completely if the error was in a read only page that can be replaced by
reading from disk).
But we have no recovery path for errors encountered during kernel code
execution. Except for some very specific cases were are unlikely to ever
be able to recover.
Enter memory mirroring. Actually 3rd generation of memory mirroing.
Gen1: All memory is mirrored
Pro: No s/w enabling - h/w just gets good data from other side of the
mirror
Con: Halves effective memory capacity available to OS/applications
Gen2: Partial memory mirror - just mirror memory begind some memory controllers
Pro: Keep more of the capacity
Con: Nightmare to enable. Have to choose between allocating from
mirrored memory for safety vs. NUMA local memory for performance
Gen3: Address range partial memory mirror - some mirror on each memory
controller
Pro: Can tune the amount of mirror and keep NUMA performance
Con: I have to write memory management code to implement
The current plan is just to use mirrored memory for kernel allocations.
This has been broken into two phases:
1) This patch series - find the mirrored memory, use it for boot time
allocations
2) Wade into mm/page_alloc.c and define a ZONE_MIRROR to pick up the
unused mirrored memory from mm/memblock.c and only give it out to
select kernel allocations (this is still being scoped because
page_alloc.c is scary).
This patch (of 3):
Add extra "flags" to memblock to allow selection of memory based on
attribute. No functional changes
Signed-off-by: Tony Luck <tony.luck@intel.com>
Cc: Xishi Qiu <qiuxishi@huawei.com>
Cc: Hanjun Guo <guohanjun@huawei.com>
Cc: Xiexiuqi <xiexiuqi@huawei.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Yinghai Lu <yinghai@kernel.org>
Cc: Naoya Horiguchi <nao.horiguchi@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
We have confusing functions to clear pmd, pmd_clear_* and pmd_clear. Add
_huge_ to pmdp_clear functions so that we are clear that they operate on
hugepage pte.
We don't bother about other functions like pmdp_set_wrprotect,
pmdp_clear_flush_young, because they operate on PTE bits and hence
indicate they are operating on hugepage ptes
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Currently we have many duplicates in definitions of
hugetlb_prefault_arch_hook. In all architectures this function is empty.
Signed-off-by: Zhang Zhen <zhenzhang.zhang@huawei.com>
Acked-by: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
CRIU is recreating the process memory layout by remapping the checkpointee
memory area on top of the current process (criu). This includes remapping
the vDSO to the place it has at checkpoint time.
However some architectures like powerpc are keeping a reference to the
vDSO base address to build the signal return stack frame by calling the
vDSO sigreturn service. So once the vDSO has been moved, this reference
is no more valid and the signal frame built later are not usable.
This patch serie is introducing a new mm hook framework, and a new
arch_remap hook which is called when mremap is done and the mm lock still
hold. The next patch is adding the vDSO remap and unmap tracking to the
powerpc architecture.
This patch (of 3):
This patch introduces a new set of header file to manage mm hooks:
- per architecture empty header file (arch/x/include/asm/mm-arch-hooks.h)
- a generic header (include/linux/mm-arch-hooks.h)
The architecture which need to overwrite a hook as to redefine it in its
header file, while architecture which doesn't need have nothing to do.
The default hooks are defined in the generic header and are used in the
case the architecture is not defining it.
In a next step, mm hooks defined in include/asm-generic/mm_hooks.h should
be moved here.
Signed-off-by: Laurent Dufour <ldufour@linux.vnet.ibm.com>
Suggested-by: Andrew Morton <akpm@linux-foundation.org>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Pavel Emelyanov <xemul@parallels.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Pull networking updates from David Miller:
1) Add TX fast path in mac80211, from Johannes Berg.
2) Add TSO/GRO support to ibmveth, from Thomas Falcon
3) Move away from cached routes in ipv6, just like ipv4, from Martin
KaFai Lau.
4) Lots of new rhashtable tests, from Thomas Graf.
5) Run ingress qdisc lockless, from Alexei Starovoitov.
6) Allow servers to fetch TCP packet headers for SYN packets of new
connections, for fingerprinting. From Eric Dumazet.
7) Add mode parameter to pktgen, for testing receive. From Alexei
Starovoitov.
8) Cache access optimizations via simplifications of build_skb(), from
Alexander Duyck.
9) Move page frag allocator under mm/, also from Alexander.
10) Add xmit_more support to hv_netvsc, from KY Srinivasan.
11) Add a counter guard in case we try to perform endless reclassify
loops in the packet scheduler.
12) Extern flow dissector to be programmable and use it in new "Flower"
classifier. From Jiri Pirko.
13) AF_PACKET fanout rollover fixes, performance improvements, and new
statistics. From Willem de Bruijn.
14) Add netdev driver for GENEVE tunnels, from John W Linville.
15) Add ingress netfilter hooks and filtering, from Pablo Neira Ayuso.
16) Fix handling of epoll edge triggers in TCP, from Eric Dumazet.
17) Add an ECN retry fallback for the initial TCP handshake, from Daniel
Borkmann.
18) Add tail call support to BPF, from Alexei Starovoitov.
19) Add several pktgen helper scripts, from Jesper Dangaard Brouer.
20) Add zerocopy support to AF_UNIX, from Hannes Frederic Sowa.
21) Favor even port numbers for allocation to connect() requests, and
odd port numbers for bind(0), in an effort to help avoid
ip_local_port_range exhaustion. From Eric Dumazet.
22) Add Cavium ThunderX driver, from Sunil Goutham.
23) Allow bpf programs to access skb_iif and dev->ifindex SKB metadata,
from Alexei Starovoitov.
24) Add support for T6 chips in cxgb4vf driver, from Hariprasad Shenai.
25) Double TCP Small Queues default to 256K to accomodate situations
like the XEN driver and wireless aggregation. From Wei Liu.
26) Add more entropy inputs to flow dissector, from Tom Herbert.
27) Add CDG congestion control algorithm to TCP, from Kenneth Klette
Jonassen.
28) Convert ipset over to RCU locking, from Jozsef Kadlecsik.
29) Track and act upon link status of ipv4 route nexthops, from Andy
Gospodarek.
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next: (1670 commits)
bridge: vlan: flush the dynamically learned entries on port vlan delete
bridge: multicast: add a comment to br_port_state_selection about blocking state
net: inet_diag: export IPV6_V6ONLY sockopt
stmmac: troubleshoot unexpected bits in des0 & des1
net: ipv4 sysctl option to ignore routes when nexthop link is down
net: track link-status of ipv4 nexthops
net: switchdev: ignore unsupported bridge flags
net: Cavium: Fix MAC address setting in shutdown state
drivers: net: xgene: fix for ACPI support without ACPI
ip: report the original address of ICMP messages
net/mlx5e: Prefetch skb data on RX
net/mlx5e: Pop cq outside mlx5e_get_cqe
net/mlx5e: Remove mlx5e_cq.sqrq back-pointer
net/mlx5e: Remove extra spaces
net/mlx5e: Avoid TX CQE generation if more xmit packets expected
net/mlx5e: Avoid redundant dev_kfree_skb() upon NOP completion
net/mlx5e: Remove re-assignment of wq type in mlx5e_enable_rq()
net/mlx5e: Use skb_shinfo(skb)->gso_segs rather than counting them
net/mlx5e: Static mapping of netdev priv resources to/from netdev TX queues
net/mlx4_en: Use HW counters for rx/tx bytes/packets in PF device
...
for silicon that no one owns: these are really new features for
everyone.
* ARM: several features are in progress but missed the 4.2 deadline.
So here is just a smattering of bug fixes, plus enabling the VFIO
integration.
* s390: Some fixes/refactorings/optimizations, plus support for
2GB pages.
* x86: 1) host and guest support for marking kvmclock as a stable
scheduler clock. 2) support for write combining. 3) support for
system management mode, needed for secure boot in guests. 4) a bunch
of cleanups required for 2+3. 5) support for virtualized performance
counters on AMD; 6) legacy PCI device assignment is deprecated and
defaults to "n" in Kconfig; VFIO replaces it. On top of this there are
also bug fixes and eager FPU context loading for FPU-heavy guests.
* Common code: Support for multiple address spaces; for now it is
used only for x86 SMM but the s390 folks also have plans.
There are some x86 conflicts, one with the rc8 pull request and
the rest with Ingo's FPU rework.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.22 (GNU/Linux)
iQEcBAABAgAGBQJViYzhAAoJEL/70l94x66Dda0H/1IepMbfEy+o849d5G71fNTs
F8Y8qUP2GZuL7T53FyFUGSBw+AX7kimu9ia4gR/PmDK+QYsdosYeEjwlsolZfTBf
sHuzNtPoJhi5o1o/ur4NGameo0WjGK8f1xyzr+U8z74QDQyQv/QYCdK/4isp4BJL
ugHNHkuROX6Zng4i7jc9rfaSRg29I3GBxQUYpMkEnD3eMYMUBWGm6Rs8pHgGAMvL
vqzntgW00WNxehTqcAkmD/Wv+txxhkvIadZnjgaxH49e9JeXeBKTIR5vtb7Hns3s
SuapZUyw+c95DIipXq4EznxxaOrjbebOeFgLCJo8+XMXZum8RZf/ob24KroYad0=
=YsAR
-----END PGP SIGNATURE-----
Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm
Pull first batch of KVM updates from Paolo Bonzini:
"The bulk of the changes here is for x86. And for once it's not for
silicon that no one owns: these are really new features for everyone.
Details:
- ARM:
several features are in progress but missed the 4.2 deadline.
So here is just a smattering of bug fixes, plus enabling the
VFIO integration.
- s390:
Some fixes/refactorings/optimizations, plus support for 2GB
pages.
- x86:
* host and guest support for marking kvmclock as a stable
scheduler clock.
* support for write combining.
* support for system management mode, needed for secure boot in
guests.
* a bunch of cleanups required for the above
* support for virtualized performance counters on AMD
* legacy PCI device assignment is deprecated and defaults to "n"
in Kconfig; VFIO replaces it
On top of this there are also bug fixes and eager FPU context
loading for FPU-heavy guests.
- Common code:
Support for multiple address spaces; for now it is used only for
x86 SMM but the s390 folks also have plans"
* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (124 commits)
KVM: s390: clear floating interrupt bitmap and parameters
KVM: x86/vPMU: Enable PMU handling for AMD PERFCTRn and EVNTSELn MSRs
KVM: x86/vPMU: Implement AMD vPMU code for KVM
KVM: x86/vPMU: Define kvm_pmu_ops to support vPMU function dispatch
KVM: x86/vPMU: introduce kvm_pmu_msr_idx_to_pmc
KVM: x86/vPMU: reorder PMU functions
KVM: x86/vPMU: whitespace and stylistic adjustments in PMU code
KVM: x86/vPMU: use the new macros to go between PMC, PMU and VCPU
KVM: x86/vPMU: introduce pmu.h header
KVM: x86/vPMU: rename a few PMU functions
KVM: MTRR: do not map huge page for non-consistent range
KVM: MTRR: simplify kvm_mtrr_get_guest_memory_type
KVM: MTRR: introduce mtrr_for_each_mem_type
KVM: MTRR: introduce fixed_mtrr_addr_* functions
KVM: MTRR: sort variable MTRRs
KVM: MTRR: introduce var_mtrr_range
KVM: MTRR: introduce fixed_mtrr_segment table
KVM: MTRR: improve kvm_mtrr_get_guest_memory_type
KVM: MTRR: do not split 64 bits MSR content
KVM: MTRR: clean up mtrr default type
...
- ACPICA update to upstream revision 20150515 including basic
support for ACPI 6 features: new ACPI tables introduced by
ACPI 6 (STAO, XENV, WPBT, NFIT, IORT), changes related to the
other tables (DTRM, FADT, LPIT, MADT), new predefined names
(_BTH, _CR3, _DSD, _LPI, _MTL, _PRR, _RDI, _RST, _TFP, _TSN),
fixes and cleanups (Bob Moore, Lv Zheng).
- ACPI device power management core code update to follow ACPI 6
which reflects the ACPI device power management implementation
in Windows (Rafael J Wysocki).
- Rework of the backlight interface selection logic to reduce the
number of kernel command line options and improve the handling
of DMI quirks that may be involved in that and to make the
code generally more straightforward (Hans de Goede).
- Fixes for the ACPI Embedded Controller (EC) driver related to
the handling of EC transactions (Lv Zheng).
- Fix for a regression related to the ACPI resources management
and resulting from a recent change of ACPI initialization code
ordering (Rafael J Wysocki).
- Fix for a system initialization regression related to ACPI
introduced during the 3.14 cycle and caused by running the
code that switches the platform over to the ACPI mode too
early in the initialization sequence (Rafael J Wysocki).
- Support for the ACPI _CCA device configuration object related
to DMA cache coherence (Suravee Suthikulpanit).
- ACPI/APEI fixes and cleanups (Jiri Kosina, Borislav Petkov).
- ACPI battery driver cleanups (Luis Henriques, Mathias Krause).
- ACPI processor driver cleanups (Hanjun Guo).
- Cleanups and documentation update related to the ACPI device
properties interface based on _DSD (Rafael J Wysocki).
- ACPI device power management fixes (Rafael J Wysocki).
- Assorted cleanups related to ACPI (Dominik Brodowski. Fabian
Frederick, Lorenzo Pieralisi, Mathias Krause, Rafael J Wysocki).
- Fix for a long-standing issue causing General Protection Faults
to be generated occasionally on return to user space after resume
from ACPI-based suspend-to-RAM on 32-bit x86 (Ingo Molnar).
- Fix to make the suspend core code return -EBUSY consistently in
all cases when system suspend is aborted due to wakeup detection
(Ruchi Kandoi).
- Support for automated device wakeup IRQ handling allowing drivers
to make their PM support more starightforward (Tony Lindgren).
- New tracepoints for suspend-to-idle tracing and rework of the
prepare/complete callbacks tracing in the PM core (Todd E Brandt,
Rafael J Wysocki).
- Wakeup sources framework enhancements (Jin Qian).
- New macro for noirq system PM callbacks (Grygorii Strashko).
- Assorted cleanups related to system suspend (Rafael J Wysocki).
- cpuidle core cleanups to make the code more efficient (Rafael J
Wysocki).
- powernv/pseries cpuidle driver update (Shilpasri G Bhat).
- cpufreq core fixes related to CPU online/offline that should
reduce the overhead of these operations quite a bit, unless the
CPU in question is physically going away (Viresh Kumar, Saravana
Kannan).
- Serialization of cpufreq governor callbacks to avoid race
conditions in some cases (Viresh Kumar).
- intel_pstate driver fixes and cleanups (Doug Smythies, Prarit
Bhargava, Joe Konno).
- cpufreq driver (arm_big_little, cpufreq-dt, qoriq) updates (Sudeep
Holla, Felipe Balbi, Tang Yuantian).
- Assorted cleanups in cpufreq drivers and core (Shailendra Verma,
Fabian Frederick, Wang Long).
- New Device Tree bindings for representing Operating Performance
Points (Viresh Kumar).
- Updates for the common clock operations support code in the PM
core (Rajendra Nayak, Geert Uytterhoeven).
- PM domains core code update (Geert Uytterhoeven).
- Intel Knights Landing support for the RAPL (Running Average Power
Limit) power capping driver (Dasaratharaman Chandramouli).
- Fixes related to the floor frequency setting on Atom SoCs in the
RAPL power capping driver (Ajay Thomas).
- Runtime PM framework documentation update (Ben Dooks).
- cpupower tool fix (Herton R Krzesinski).
/
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.22 (GNU/Linux)
iQIcBAABCAAGBQJViJdWAAoJEILEb/54YlRx/9gP/3gHoFevNRycvn0VpKqdufCI
Mxy2LBBLlfyW2uD3+NvqvA2WWSo0Cs/LgXa04eAVxPdU7k48s8w+54U23wSouzjW
gfwAmuHxzDR8v0h8X3h6BxNzmkIQHtmDcQlA/cZdHejY/UUw01yxRGNUUZDNbxlm
WXn2nmlBLmGqXTYq0fpBV+3jicUghJqHHsBCqa3VR2yQioHMJG01F4UZMqYTZunN
OIvDUghxByKz6alzdCqlLl1Y0exV6vwWUAzBsl1qHqmHu/bWFSZn3ujNNVrjqHhw
Kl7/8dC2pQkv3Zo3gEVvfQ0onotwWZxGHzPQRdvmxvRnBunQVCi/wynx90yABX/r
PPb/iBNV0mZskbF0zb0GZT3ZZWGA8Z0p3o5JQv2jV4m62qTzx8w50Y5kbn9N1WT+
5bre7AVbVAlGonWszcS9iE+6TOboRz9OD1CCwPFXHItFutlBkau+1hHfFoLM0o9n
LhpGuyszT/EUa1BHkLzuCckFqO2DpbF3N2CKmuTekw0CdgdsvRL2pRByuerk3j7R
WQhlcvBq5YH6j43AuoEZKp8r1iN8oG/iqlrMYQaYWrW9hJaoQOoU8dGJxp/e7gKN
r/qeYjETI+tIsjCbtH5WQzzxDI3gPISAYAtfqs7G34EEo+Lwp6kyRUAF4kDot2V3
ZIyuKMmTu4cdwDETr/O+
=7jTj
-----END PGP SIGNATURE-----
Merge tag 'pm+acpi-4.2-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm
Pull power management and ACPI updates from Rafael Wysocki:
"The rework of backlight interface selection API from Hans de Goede
stands out from the number of commits and the number of affected
places perspective. The cpufreq core fixes from Viresh Kumar are
quite significant too as far as the number of commits goes and because
they should reduce CPU online/offline overhead quite a bit in the
majority of cases.
From the new featues point of view, the ACPICA update (to upstream
revision 20150515) adding support for new ACPI 6 material to ACPICA is
the one that matters the most as some new significant features will be
based on it going forward. Also included is an update of the ACPI
device power management core to follow ACPI 6 (which in turn reflects
the Windows' device PM implementation), a PM core extension to support
wakeup interrupts in a more generic way and support for the ACPI _CCA
device configuration object.
The rest is mostly fixes and cleanups all over and some documentation
updates, including new DT bindings for Operating Performance Points.
There is one fix for a regression introduced in the 4.1 cycle, but it
adds quite a number of lines of code, it wasn't really ready before
Thursday and you were on vacation, so I refrained from pushing it on
the last minute for 4.1.
Specifics:
- ACPICA update to upstream revision 20150515 including basic support
for ACPI 6 features: new ACPI tables introduced by ACPI 6 (STAO,
XENV, WPBT, NFIT, IORT), changes related to the other tables (DTRM,
FADT, LPIT, MADT), new predefined names (_BTH, _CR3, _DSD, _LPI,
_MTL, _PRR, _RDI, _RST, _TFP, _TSN), fixes and cleanups (Bob Moore,
Lv Zheng).
- ACPI device power management core code update to follow ACPI 6
which reflects the ACPI device power management implementation in
Windows (Rafael J Wysocki).
- rework of the backlight interface selection logic to reduce the
number of kernel command line options and improve the handling of
DMI quirks that may be involved in that and to make the code
generally more straightforward (Hans de Goede).
- fixes for the ACPI Embedded Controller (EC) driver related to the
handling of EC transactions (Lv Zheng).
- fix for a regression related to the ACPI resources management and
resulting from a recent change of ACPI initialization code ordering
(Rafael J Wysocki).
- fix for a system initialization regression related to ACPI
introduced during the 3.14 cycle and caused by running the code
that switches the platform over to the ACPI mode too early in the
initialization sequence (Rafael J Wysocki).
- support for the ACPI _CCA device configuration object related to
DMA cache coherence (Suravee Suthikulpanit).
- ACPI/APEI fixes and cleanups (Jiri Kosina, Borislav Petkov).
- ACPI battery driver cleanups (Luis Henriques, Mathias Krause).
- ACPI processor driver cleanups (Hanjun Guo).
- cleanups and documentation update related to the ACPI device
properties interface based on _DSD (Rafael J Wysocki).
- ACPI device power management fixes (Rafael J Wysocki).
- assorted cleanups related to ACPI (Dominik Brodowski, Fabian
Frederick, Lorenzo Pieralisi, Mathias Krause, Rafael J Wysocki).
- fix for a long-standing issue causing General Protection Faults to
be generated occasionally on return to user space after resume from
ACPI-based suspend-to-RAM on 32-bit x86 (Ingo Molnar).
- fix to make the suspend core code return -EBUSY consistently in all
cases when system suspend is aborted due to wakeup detection (Ruchi
Kandoi).
- support for automated device wakeup IRQ handling allowing drivers
to make their PM support more starightforward (Tony Lindgren).
- new tracepoints for suspend-to-idle tracing and rework of the
prepare/complete callbacks tracing in the PM core (Todd E Brandt,
Rafael J Wysocki).
- wakeup sources framework enhancements (Jin Qian).
- new macro for noirq system PM callbacks (Grygorii Strashko).
- assorted cleanups related to system suspend (Rafael J Wysocki).
- cpuidle core cleanups to make the code more efficient (Rafael J
Wysocki).
- powernv/pseries cpuidle driver update (Shilpasri G Bhat).
- cpufreq core fixes related to CPU online/offline that should reduce
the overhead of these operations quite a bit, unless the CPU in
question is physically going away (Viresh Kumar, Saravana Kannan).
- serialization of cpufreq governor callbacks to avoid race
conditions in some cases (Viresh Kumar).
- intel_pstate driver fixes and cleanups (Doug Smythies, Prarit
Bhargava, Joe Konno).
- cpufreq driver (arm_big_little, cpufreq-dt, qoriq) updates (Sudeep
Holla, Felipe Balbi, Tang Yuantian).
- assorted cleanups in cpufreq drivers and core (Shailendra Verma,
Fabian Frederick, Wang Long).
- new Device Tree bindings for representing Operating Performance
Points (Viresh Kumar).
- updates for the common clock operations support code in the PM core
(Rajendra Nayak, Geert Uytterhoeven).
- PM domains core code update (Geert Uytterhoeven).
- Intel Knights Landing support for the RAPL (Running Average Power
Limit) power capping driver (Dasaratharaman Chandramouli).
- fixes related to the floor frequency setting on Atom SoCs in the
RAPL power capping driver (Ajay Thomas).
- runtime PM framework documentation update (Ben Dooks).
- cpupower tool fix (Herton R Krzesinski)"
* tag 'pm+acpi-4.2-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: (194 commits)
cpuidle: powernv/pseries: Auto-promotion of snooze to deeper idle state
x86: Load __USER_DS into DS/ES after resume
PM / OPP: Add binding for 'opp-suspend'
PM / OPP: Allow multiple OPP tables to be passed via DT
PM / OPP: Add new bindings to address shortcomings of existing bindings
ACPI: Constify ACPI device IDs in documentation
ACPI / enumeration: Document the rules regarding the PRP0001 device ID
ACPI / video: Make acpi_video_unregister_backlight() private
acpi-video-detect: Remove old API
toshiba-acpi: Port to new backlight interface selection API
thinkpad-acpi: Port to new backlight interface selection API
sony-laptop: Port to new backlight interface selection API
samsung-laptop: Port to new backlight interface selection API
msi-wmi: Port to new backlight interface selection API
msi-laptop: Port to new backlight interface selection API
intel-oaktrail: Port to new backlight interface selection API
ideapad-laptop: Port to new backlight interface selection API
fujitsu-laptop: Port to new backlight interface selection API
eeepc-laptop: Port to new backlight interface selection API
dell-wmi: Port to new backlight interface selection API
...
Pull livepatching fixes from Jiri Kosina:
- symbol lookup locking fix, from Miroslav Benes
- error handling improvements in case of failure of the module coming
notifier, from Minfei Huang
- we were too pessimistic when kASLR has been enabled on x86 and were
dropping address hints on the floor unnecessarily in such case. Fix
from Jiri Kosina
- a few other small fixes and cleanups
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/livepatching:
livepatch: add module locking around kallsyms calls
livepatch: annotate klp_init() with __init
livepatch: introduce patch/func-walking helpers
livepatch: make kobject in klp_object statically allocated
livepatch: Prevent patch inconsistencies if the coming module notifier fails
livepatch: match return value to function signature
x86: kaslr: fix build due to missing ALIGN definition
livepatch: x86: make kASLR logic more accurate
x86: introduce kaslr_offset()
This patch enables AMD guest VM to access (R/W) PMU related MSRs, which
include PERFCTR[0..3] and EVNTSEL[0..3].
Reviewed-by: Joerg Roedel <jroedel@suse.de>
Tested-by: Joerg Roedel <jroedel@suse.de>
Reviewed-by: Radim Krčmář <rkrcmar@redhat.com>
Signed-off-by: Wei Huang <wei@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
This patch replaces the empty AMD vPMU functions (in pmu_amd.c) with real
implementation.
Reviewed-by: Joerg Roedel <jroedel@suse.de>
Tested-by: Joerg Roedel <jroedel@suse.de>
Signed-off-by: Wei Huang <wei@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
This patch defines a new function pointer struct (kvm_pmu_ops) to
support vPMU for both Intel and AMD. The functions pointers defined in
this new struct will be linked with Intel and AMD functions later. In the
meanwhile the struct that maps from event_sel bits to PERF_TYPE_HARDWARE
events is renamed and moved from Intel specific code to kvm_host.h as a
common struct.
Reviewed-by: Joerg Roedel <jroedel@suse.de>
Tested-by: Joerg Roedel <jroedel@suse.de>
Signed-off-by: Wei Huang <wei@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Pull crypto update from Herbert Xu:
"Here is the crypto update for 4.2:
API:
- Convert RNG interface to new style.
- New AEAD interface with one SG list for AD and plain/cipher text.
All external AEAD users have been converted.
- New asymmetric key interface (akcipher).
Algorithms:
- Chacha20, Poly1305 and RFC7539 support.
- New RSA implementation.
- Jitter RNG.
- DRBG is now seeded with both /dev/random and Jitter RNG. If kernel
pool isn't ready then DRBG will be reseeded when it is.
- DRBG is now the default crypto API RNG, replacing krng.
- 842 compression (previously part of powerpc nx driver).
Drivers:
- Accelerated SHA-512 for arm64.
- New Marvell CESA driver that supports DMA and more algorithms.
- Updated powerpc nx 842 support.
- Added support for SEC1 hardware to talitos"
* git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6: (292 commits)
crypto: marvell/cesa - remove COMPILE_TEST dependency
crypto: algif_aead - Temporarily disable all AEAD algorithms
crypto: af_alg - Forbid the use internal algorithms
crypto: echainiv - Only hold RNG during initialisation
crypto: seqiv - Add compatibility support without RNG
crypto: eseqiv - Offer normal cipher functionality without RNG
crypto: chainiv - Offer normal cipher functionality without RNG
crypto: user - Add CRYPTO_MSG_DELRNG
crypto: user - Move cryptouser.h to uapi
crypto: rng - Do not free default RNG when it becomes unused
crypto: skcipher - Allow givencrypt to be NULL
crypto: sahara - propagate the error on clk_disable_unprepare() failure
crypto: rsa - fix invalid select for AKCIPHER
crypto: picoxcell - Update to the current clk API
crypto: nx - Check for bogus firmware properties
crypto: marvell/cesa - add DT bindings documentation
crypto: marvell/cesa - add support for Kirkwood and Dove SoCs
crypto: marvell/cesa - add support for Orion SoCs
crypto: marvell/cesa - add allhwsupport module parameter
crypto: marvell/cesa - add support for all armada SoCs
...
Pull timer updates from Thomas Gleixner:
"A rather largish update for everything time and timer related:
- Cache footprint optimizations for both hrtimers and timer wheel
- Lower the NOHZ impact on systems which have NOHZ or timer migration
disabled at runtime.
- Optimize run time overhead of hrtimer interrupt by making the clock
offset updates smarter
- hrtimer cleanups and removal of restrictions to tackle some
problems in sched/perf
- Some more leap second tweaks
- Another round of changes addressing the 2038 problem
- First step to change the internals of clock event devices by
introducing the necessary infrastructure
- Allow constant folding for usecs/msecs_to_jiffies()
- The usual pile of clockevent/clocksource driver updates
The hrtimer changes contain updates to sched, perf and x86 as they
depend on them plus changes all over the tree to cleanup API changes
and redundant code, which got copied all over the place. The y2038
changes touch s390 to remove the last non 2038 safe code related to
boot/persistant clock"
* 'timers-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (114 commits)
clocksource: Increase dependencies of timer-stm32 to limit build wreckage
timer: Minimize nohz off overhead
timer: Reduce timer migration overhead if disabled
timer: Stats: Simplify the flags handling
timer: Replace timer base by a cpu index
timer: Use hlist for the timer wheel hash buckets
timer: Remove FIFO "guarantee"
timers: Sanitize catchup_timer_jiffies() usage
hrtimer: Allow hrtimer::function() to free the timer
seqcount: Introduce raw_write_seqcount_barrier()
seqcount: Rename write_seqcount_barrier()
hrtimer: Fix hrtimer_is_queued() hole
hrtimer: Remove HRTIMER_STATE_MIGRATE
selftest: Timers: Avoid signal deadlock in leap-a-day
timekeeping: Copy the shadow-timekeeper over the real timekeeper last
clockevents: Check state instead of mode in suspend/resume path
selftests: timers: Add leap-second timer edge testing to leap-a-day.c
ntp: Do leapsecond adjustment in adjtimex read path
time: Prevent early expiry of hrtimers[CLOCK_REALTIME] at the leap second edge
ntp: Introduce and use SECS_PER_DAY macro instead of 86400
...
Pull x86 core updates from Ingo Molnar:
"There were so many changes in the x86/asm, x86/apic and x86/mm topics
in this cycle that the topical separation of -tip broke down somewhat -
so the result is a more traditional architecture pull request,
collected into the 'x86/core' topic.
The topics were still maintained separately as far as possible, so
bisectability and conceptual separation should still be pretty good -
but there were a handful of merge points to avoid excessive
dependencies (and conflicts) that would have been poorly tested in the
end.
The next cycle will hopefully be much more quiet (or at least will
have fewer dependencies).
The main changes in this cycle were:
* x86/apic changes, with related IRQ core changes: (Jiang Liu, Thomas
Gleixner)
- This is the second and most intrusive part of changes to the x86
interrupt handling - full conversion to hierarchical interrupt
domains:
[IOAPIC domain] -----
|
[MSI domain] --------[Remapping domain] ----- [ Vector domain ]
| (optional) |
[HPET MSI domain] ----- |
|
[DMAR domain] -----------------------------
|
[Legacy domain] -----------------------------
This now reflects the actual hardware and allowed us to distangle
the domain specific code from the underlying parent domain, which
can be optional in the case of interrupt remapping. It's a clear
separation of functionality and removes quite some duct tape
constructs which plugged the remap code between ioapic/msi/hpet
and the vector management.
- Intel IOMMU IRQ remapping enhancements, to allow direct interrupt
injection into guests (Feng Wu)
* x86/asm changes:
- Tons of cleanups and small speedups, micro-optimizations. This
is in preparation to move a good chunk of the low level entry
code from assembly to C code (Denys Vlasenko, Andy Lutomirski,
Brian Gerst)
- Moved all system entry related code to a new home under
arch/x86/entry/ (Ingo Molnar)
- Removal of the fragile and ugly CFI dwarf debuginfo annotations.
Conversion to C will reintroduce many of them - but meanwhile
they are only getting in the way, and the upstream kernel does
not rely on them (Ingo Molnar)
- NOP handling refinements. (Borislav Petkov)
* x86/mm changes:
- Big PAT and MTRR rework: making the code more robust and
preparing to phase out exposing direct MTRR interfaces to drivers -
in favor of using PAT driven interfaces (Toshi Kani, Luis R
Rodriguez, Borislav Petkov)
- New ioremap_wt()/set_memory_wt() interfaces to support
Write-Through cached memory mappings. This is especially
important for good performance on NVDIMM hardware (Toshi Kani)
* x86/ras changes:
- Add support for deferred errors on AMD (Aravind Gopalakrishnan)
This is an important RAS feature which adds hardware support for
poisoned data. That means roughly that the hardware marks data
which it has detected as corrupted but wasn't able to correct, as
poisoned data and raises an APIC interrupt to signal that in the
form of a deferred error. It is the OS's responsibility then to
take proper recovery action and thus prolonge system lifetime as
far as possible.
- Add support for Intel "Local MCE"s: upcoming CPUs will support
CPU-local MCE interrupts, as opposed to the traditional system-
wide broadcasted MCE interrupts (Ashok Raj)
- Misc cleanups (Borislav Petkov)
* x86/platform changes:
- Intel Atom SoC updates
... and lots of other cleanups, fixlets and other changes - see the
shortlog and the Git log for details"
* 'x86-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (222 commits)
x86/hpet: Use proper hpet device number for MSI allocation
x86/hpet: Check for irq==0 when allocating hpet MSI interrupts
x86/mm/pat, drivers/infiniband/ipath: Use arch_phys_wc_add() and require PAT disabled
x86/mm/pat, drivers/media/ivtv: Use arch_phys_wc_add() and require PAT disabled
x86/platform/intel/baytrail: Add comments about why we disabled HPET on Baytrail
genirq: Prevent crash in irq_move_irq()
genirq: Enhance irq_data_to_desc() to support hierarchy irqdomain
iommu, x86: Properly handle posted interrupts for IOMMU hotplug
iommu, x86: Provide irq_remapping_cap() interface
iommu, x86: Setup Posted-Interrupts capability for Intel iommu
iommu, x86: Add cap_pi_support() to detect VT-d PI capability
iommu, x86: Avoid migrating VT-d posted interrupts
iommu, x86: Save the mode (posted or remapped) of an IRTE
iommu, x86: Implement irq_set_vcpu_affinity for intel_ir_chip
iommu: dmar: Provide helper to copy shared irte fields
iommu: dmar: Extend struct irte for VT-d Posted-Interrupts
iommu: Add new member capability to struct irq_remap_ops
x86/asm/entry/64: Disentangle error_entry/exit gsbase/ebx/usermode code
x86/asm/entry/32: Shorten __audit_syscall_entry() args preparation
x86/asm/entry/32: Explain reloading of registers after __audit_syscall_entry()
...
Pull x86 warning fixlet from Ingo Molnar:
"A build fix for certain (rare) variants of binutils that did not make
it into v4.1"
* 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/boot: Fix overflow warning with 32-bit binutils
Pul x86 microcode updates from Ingo Molnar:
"x86 microcode loader updates from Borislav Petkov:
- early parsing of the built-in microcode
- cleanups
- misc smaller fixes"
* 'x86-microcode-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/microcode: Correct CPU family related variable types
x86/microcode: Disable builtin microcode loading on 32-bit for now
x86/microcode/intel: Rename get_matching_sig()
x86/microcode/intel: Simplify get_matching_sig()
x86/microcode/intel: Simplify update_match_cpu()
x86/microcode/intel: Rename get_matching_microcode
x86/cpu/microcode: Zap changelog
x86/microcode: Parse built-in microcode early
x86/microcode/intel: Remove unused @rev arg of get_matching_sig()
x86/microcode/intel: Get rid of revision_is_newer()
Pull x86 kdump updates from Ingo Molnar:
"Three kdump robustness related improvements (Joerg Roedel)"
* 'x86-kdump-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/crash: Allocate enough low memory when crashkernel=high
x86/swiotlb: Try coherent allocations with __GFP_NOWARN
swiotlb: Warn on allocation failure in swiotlb_alloc_coherent()
Pull x86 FPU updates from Ingo Molnar:
"This tree contains two main changes:
- The big FPU code rewrite: wide reaching cleanups and reorganization
that pulls all the FPU code together into a clean base in
arch/x86/fpu/.
The resulting code is leaner and faster, and much easier to
understand. This enables future work to further simplify the FPU
code (such as removing lazy FPU restores).
By its nature these changes have a substantial regression risk: FPU
code related bugs are long lived, because races are often subtle
and bugs mask as user-space failures that are difficult to track
back to kernel side backs. I'm aware of no unfixed (or even
suspected) FPU related regression so far.
- MPX support rework/fixes. As this is still not a released CPU
feature, there were some buglets in the code - should be much more
robust now (Dave Hansen)"
* 'x86-fpu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (250 commits)
x86/fpu: Fix double-increment in setup_xstate_features()
x86/mpx: Allow 32-bit binaries on 64-bit kernels again
x86/mpx: Do not count MPX VMAs as neighbors when unmapping
x86/mpx: Rewrite the unmap code
x86/mpx: Support 32-bit binaries on 64-bit kernels
x86/mpx: Use 32-bit-only cmpxchg() for 32-bit apps
x86/mpx: Introduce new 'directory entry' to 'addr' helper function
x86/mpx: Add temporary variable to reduce masking
x86: Make is_64bit_mm() widely available
x86/mpx: Trace allocation of new bounds tables
x86/mpx: Trace the attempts to find bounds tables
x86/mpx: Trace entry to bounds exception paths
x86/mpx: Trace #BR exceptions
x86/mpx: Introduce a boot-time disable flag
x86/mpx: Restrict the mmap() size check to bounds tables
x86/mpx: Remove redundant MPX_BNDCFG_ADDR_MASK
x86/mpx: Clean up the code by not passing a task pointer around when unnecessary
x86/mpx: Use the new get_xsave_field_ptr()API
x86/fpu/xstate: Wrap get_xsave_addr() to make it safer
x86/fpu/xstate: Fix up bad get_xsave_addr() assumptions
...
Pull x86 EFI updates from Ingo Molnar:
"EFI changes:
- Use idiomatic negative error values in efivar_create_sysfs_entry()
instead of returning '1' to indicate error (Dan Carpenter)
- Implement new support to expose the EFI System Resource Tables in
sysfs, which provides information for performing firmware updates
(Peter Jones)
- Documentation cleanup in the EFI handover protocol section which
falsely claimed that 'cmdline_size' needed to be filled out by the
boot loader (Alex Smith)
- Align the order of SMBIOS tables in /sys/firmware/efi/systab to
match the way that we do things for ACPI and add documentation to
Documentation/ABI (Jean Delvare)"
* 'x86-efi-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
efi: Work around ia64 build problem with ESRT driver
efi: Add 'systab' information to Documentation/ABI
efi: dmi: List SMBIOS3 table before SMBIOS table
efi/esrt: Fix some compiler warnings
x86, doc: Remove cmdline_size from list of fields to be filled in for EFI handover
efi: Add esrt support
efi: efivar_create_sysfs_entry() should return negative error codes
Pull x86 CPU features from Ingo Molnar:
"Various CPU feature support related changes: in particular the
/proc/cpuinfo model name sanitization change should be monitored, it
has a chance to break stuff. (but really shouldn't and there are no
regression reports)"
* 'x86-cpu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/cpu/amd: Give access to the number of nodes in a physical package
x86/cpu: Trim model ID whitespace
x86/cpu: Strip any /proc/cpuinfo model name field whitespace
x86/cpu/amd: Set X86_FEATURE_EXTD_APICID for future processors
x86/gart: Check for GART support before accessing GART registers
Pull x86 cleanups from Ingo Molnar:
"Misc cleanups"
* 'x86-cleanups-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/mm: Clean up types in xlate_dev_mem_ptr() some more
x86: Deinline dma_free_attrs()
x86: Deinline dma_alloc_attrs()
x86: Remove unused TI_cpu
x86: Merge common 32-bit values in asm-offsets.c
Pull scheduler updates from Ingo Molnar:
"The main changes are:
- lockless wakeup support for futexes and IPC message queues
(Davidlohr Bueso, Peter Zijlstra)
- Replace spinlocks with atomics in thread_group_cputimer(), to
improve scalability (Jason Low)
- NUMA balancing improvements (Rik van Riel)
- SCHED_DEADLINE improvements (Wanpeng Li)
- clean up and reorganize preemption helpers (Frederic Weisbecker)
- decouple page fault disabling machinery from the preemption
counter, to improve debuggability and robustness (David
Hildenbrand)
- SCHED_DEADLINE documentation updates (Luca Abeni)
- topology CPU masks cleanups (Bartosz Golaszewski)
- /proc/sched_debug improvements (Srikar Dronamraju)"
* 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (79 commits)
sched/deadline: Remove needless parameter in dl_runtime_exceeded()
sched: Remove superfluous resetting of the p->dl_throttled flag
sched/deadline: Drop duplicate init_sched_dl_class() declaration
sched/deadline: Reduce rq lock contention by eliminating locking of non-feasible target
sched/deadline: Make init_sched_dl_class() __init
sched/deadline: Optimize pull_dl_task()
sched/preempt: Add static_key() to preempt_notifiers
sched/preempt: Fix preempt notifiers documentation about hlist_del() within unsafe iteration
sched/stop_machine: Fix deadlock between multiple stop_two_cpus()
sched/debug: Add sum_sleep_runtime to /proc/<pid>/sched
sched/debug: Replace vruntime with wait_sum in /proc/sched_debug
sched/debug: Properly format runnable tasks in /proc/sched_debug
sched/numa: Only consider less busy nodes as numa balancing destinations
Revert 095bebf61a ("sched/numa: Do not move past the balance point if unbalanced")
sched/fair: Prevent throttling in early pick_next_task_fair()
preempt: Reorganize the notrace definitions a bit
preempt: Use preempt_schedule_context() as the official tracing preemption point
sched: Make preempt_schedule_context() function-tracing safe
x86: Remove cpu_sibling_mask() and cpu_core_mask()
x86: Replace cpu_**_mask() with topology_**_cpumask()
...
Pull perf fixes from Ingo Molnar:
"These are the left over fixes from the v4.1 cycle"
* 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
perf tools: Fix build breakage if prefix= is specified
perf/x86: Honor the architectural performance monitoring version
perf/x86/intel: Fix PMI handling for Intel PT
perf/x86/intel/bts: Fix DS area sharing with x86_pmu events
perf/x86: Add more Broadwell model numbers
perf: Fix ring_buffer_attach() RCU sync, again
Pull perf updates from Ingo Molnar:
"Kernel side changes mostly consist of work on x86 PMU drivers:
- x86 Intel PT (hardware CPU tracer) improvements (Alexander
Shishkin)
- x86 Intel CQM (cache quality monitoring) improvements (Thomas
Gleixner)
- x86 Intel PEBSv3 support (Peter Zijlstra)
- x86 Intel PEBS interrupt batching support for lower overhead
sampling (Zheng Yan, Kan Liang)
- x86 PMU scheduler fixes and improvements (Peter Zijlstra)
There's too many tooling improvements to list them all - here are a
few select highlights:
'perf bench':
- Introduce new 'perf bench futex' benchmark: 'wake-parallel', to
measure parallel waker threads generating contention for kernel
locks (hb->lock). (Davidlohr Bueso)
'perf top', 'perf report':
- Allow disabling/enabling events dynamicaly in 'perf top':
a 'perf top' session can instantly become a 'perf report'
one, i.e. going from dynamic analysis to a static one,
returning to a dynamic one is possible, to toogle the
modes, just press 'f' to 'freeze/unfreeze' the sampling. (Arnaldo Carvalho de Melo)
- Make Ctrl-C stop processing on TUI, allowing interrupting the load of big
perf.data files (Namhyung Kim)
'perf probe': (Masami Hiramatsu)
- Support glob wildcards for function name
- Support $params special probe argument: Collect all function arguments
- Make --line checks validate C-style function name.
- Add --no-inlines option to avoid searching inline functions
- Greatly speed up 'perf probe --list' by caching debuginfo.
- Improve --filter support for 'perf probe', allowing using its arguments
on other commands, as --add, --del, etc.
'perf sched':
- Add option in 'perf sched' to merge like comms to lat output (Josef Bacik)
Plus tons of infrastructure work - in particular preparation for
upcoming threaded perf report support, but also lots of other work -
and fixes and other improvements. See (much) more details in the
shortlog and in the git log"
* 'perf-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (305 commits)
perf tools: Configurable per thread proc map processing time out
perf tools: Add time out to force stop proc map processing
perf report: Fix sort__sym_cmp to also compare end of symbol
perf hists browser: React to unassigned hotkey pressing
perf top: Tell the user how to unfreeze events after pressing 'f'
perf hists browser: Honour the help line provided by builtin-{top,report}.c
perf hists browser: Do not exit when 'f' is pressed in 'report' mode
perf top: Replace CTRL+z with 'f' as hotkey for enable/disable events
perf annotate: Rename source_line_percent to source_line_samples
perf annotate: Display total number of samples with --show-total-period
perf tools: Ensure thread-stack is flushed
perf top: Allow disabling/enabling events dynamicly
perf evlist: Add toggle_enable() method
perf trace: Fix race condition at the end of started workloads
perf probe: Speed up perf probe --list by caching debuginfo
perf probe: Show usage even if the last event is skipped
perf tools: Move libtraceevent dynamic list to separated LDFLAGS variable
perf tools: Fix a problem when opening old perf.data with different byte order
perf tools: Ignore .config-detected in .gitignore
perf probe: Fix to return error if no probe is added
...
Pull locking updates from Ingo Molnar:
"The main changes are:
- 'qspinlock' support, enabled on x86: queued spinlocks - these are
now the spinlock variant used by x86 as they outperform ticket
spinlocks in every category. (Waiman Long)
- 'pvqspinlock' support on x86: paravirtualized variant of queued
spinlocks. (Waiman Long, Peter Zijlstra)
- 'qrwlock' support, enabled on x86: queued rwlocks. Similar to
queued spinlocks, they are now the variant used by x86:
CONFIG_ARCH_USE_QUEUED_SPINLOCKS=y
CONFIG_QUEUED_SPINLOCKS=y
CONFIG_ARCH_USE_QUEUED_RWLOCKS=y
CONFIG_QUEUED_RWLOCKS=y
- various lockdep fixlets
- various locking primitives cleanups, further WRITE_ONCE()
propagation"
* 'locking-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (24 commits)
locking/lockdep: Remove hard coded array size dependency
locking/qrwlock: Don't contend with readers when setting _QW_WAITING
lockdep: Do not break user-visible string
locking/arch: Rename set_mb() to smp_store_mb()
locking/arch: Add WRITE_ONCE() to set_mb()
rtmutex: Warn if trylock is called from hard/softirq context
arch: Remove __ARCH_HAVE_CMPXCHG
locking/rtmutex: Drop usage of __HAVE_ARCH_CMPXCHG
locking/qrwlock: Rename QUEUE_RWLOCK to QUEUED_RWLOCKS
locking/pvqspinlock: Rename QUEUED_SPINLOCK to QUEUED_SPINLOCKS
locking/pvqspinlock: Replace xchg() by the more descriptive set_mb()
locking/pvqspinlock, x86: Enable PV qspinlock for Xen
locking/pvqspinlock, x86: Enable PV qspinlock for KVM
locking/pvqspinlock, x86: Implement the paravirt qspinlock call patching
locking/pvqspinlock: Implement simple paravirt support for the qspinlock
locking/qspinlock: Revert to test-and-set on hypervisors
locking/qspinlock: Use a simple write to grab the lock
locking/qspinlock: Optimize for smaller NR_CPUS
locking/qspinlock: Extract out code snippets for the next patch
locking/qspinlock: Add pending bit
...
Pull RCU updates from Ingo Molnar:
- Continued initialization/Kconfig updates: hide most Kconfig options
from unsuspecting users.
There's now a single high level configuration option:
*
* RCU Subsystem
*
Make expert-level adjustments to RCU configuration (RCU_EXPERT) [N/y/?] (NEW)
Which if answered in the negative, leaves us with a single
interactive configuration option:
Offload RCU callback processing from boot-selected CPUs (RCU_NOCB_CPU) [N/y/?] (NEW)
All the rest of the RCU options are configured automatically. Later
on we'll remove this single leftover configuration option as well.
- Remove all uses of RCU-protected array indexes: replace the
rcu_[access|dereference]_index_check() APIs with READ_ONCE() and
rcu_lockdep_assert()
- RCU CPU-hotplug cleanups
- Updates to Tiny RCU: a race fix and further code shrinkage.
- RCU torture-testing updates: fixes, speedups, cleanups and
documentation updates.
- Miscellaneous fixes
- Documentation updates
* 'core-rcu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (60 commits)
rcutorture: Allow repetition factors in Kconfig-fragment lists
rcutorture: Display "make oldconfig" errors
rcutorture: Update TREE_RCU-kconfig.txt
rcutorture: Make rcutorture scripts force RCU_EXPERT
rcutorture: Update configuration fragments for rcutree.rcu_fanout_exact
rcutorture: TASKS_RCU set directly, so don't explicitly set it
rcutorture: Test SRCU cleanup code path
rcutorture: Replace barriers with smp_store_release() and smp_load_acquire()
locktorture: Change longdelay_us to longdelay_ms
rcutorture: Allow negative values of nreaders to oversubscribe
rcutorture: Exchange TREE03 and TREE08 NR_CPUS, speed up CPU hotplug
rcutorture: Exchange TREE03 and TREE04 geometries
locktorture: fix deadlock in 'rw_lock_irq' type
rcu: Correctly handle non-empty Tiny RCU callback list with none ready
rcutorture: Test both RCU-sched and RCU-bh for Tiny RCU
rcu: Further shrink Tiny RCU by making empty functions static inlines
rcu: Conditionally compile RCU's eqs warnings
rcu: Remove prompt for RCU implementation
rcu: Make RCU able to tolerate undefined CONFIG_RCU_KTHREAD_PRIO
rcu: Make RCU able to tolerate undefined CONFIG_RCU_FANOUT_LEAF
...
Since we only support modesetting by default (disabling modesetting on
the command line prevents i915.ko from loading), having a parameter to
disable modesstting by default is superfluous, i.e. saying
CONFIG_DRM_I915_KMS=n is equivalent to CONFIG_DRM_I915=n.
Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Daniel Veter <daniel.vetter@ffwll.ch>
Reviewed-by: Damien Lespiau <damien.lespiau@intel.com>
Signed-off-by: Daniel Vetter <daniel.vetter@ffwll.ch>
Srinivas Pandruvada reported a problem with system resume from
suspend-to-RAM on 32-bit x86 systems where the DS register of
the CPU is set to __KERNEL_DS instead of __USER_DS on return
to user space which cases a General Protection Fault to occur.
The issue is that DS is set to __KERNEL_DS by the ACPI resume code
path while the SYSEXIT path never reloads DS/ES. It assumes they
are still __USER_DS set at the SYSENTER time (Brian Gerst), so if
the return to user space happens to be through SYSEXIT, it will lead
to the reported GPF.
Fix the problem by setting the DS and ES registers to __USER_DS
as expected by the SYSEXIT path.
Link: https://bugzilla.kernel.org/show_bug.cgi?id=61781
Link: http://marc.info/?l=linux-pm&m=143406648920385&w=2
Acked-by: Pavel Machek <pavel@ucw.cz>
Tested-by: Pavel Machek <pavel@ucw.cz>
Acked-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
hpet_assign_irq() is called with hpet_device->num as "hardware
interrupt number", but hpet_device->num is initialized after the
interrupt has been assigned, so it's always 0. As a consequence only
the first MSI allocation succeeds, the following ones fail because the
"hardware interrupt number" already exists.
Move the initialization of dev->num and other fields before the call
to hpet_assign_irq(), which is the ordering before the offending
commit which introduced that regression.
Fixes: "3cb96f0c9733 x86/hpet: Enhance HPET IRQ to support hierarchical irqdomains"
Reported-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/alpine.DEB.2.11.1506211635010.4107@nanos
Cc: Jiang Liu <jiang.liu@linux.intel.com>
Cc: Borislav Petkov <bp@alien8.de>
irq == 0 is not a valid irq for a irqdomain MSI allocation, but hpet
code checks only for negative return values.
Reported-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Cc: Borislav Petkov <bp@alien8.de>
Link: http://lkml.kernel.org/r/558447AF.30703@linux.intel.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
This will be used for private function used by AMD- and Intel-specific
PMU implementations.
Signed-off-by: Wei Huang <wei@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Based on Intel's SDM, mapping huge page which do not have consistent
memory cache for each 4k page will cause undefined behavior
In order to avoiding this kind of undefined behavior, we force to use
4k pages under this case
Signed-off-by: Xiao Guangrong <guangrong.xiao@linux.intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
mtrr_for_each_mem_type() is ready now, use it to simplify
kvm_mtrr_get_guest_memory_type()
Signed-off-by: Xiao Guangrong <guangrong.xiao@linux.intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
It walks all MTRRs and gets all the memory cache type setting for the
specified range also it checks if the range is fully covered by MTRRs
Signed-off-by: Xiao Guangrong <guangrong.xiao@linux.intel.com>
[Adjust for range_size->range_shift change. - Paolo]
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Two functions are introduced:
- fixed_mtrr_addr_to_seg() translates the address to the fixed
MTRR segment
- fixed_mtrr_addr_seg_to_range_index() translates the address to
the index of kvm_mtrr.fixed_ranges[]
They will be used in the later patch
Signed-off-by: Xiao Guangrong <guangrong.xiao@linux.intel.com>
[Adjust for range_size->range_shift change. - Paolo]
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Sort all valid variable MTRRs based on its base address, it will help us to
check a range to see if it's fully contained in variable MTRRs
Signed-off-by: Xiao Guangrong <guangrong.xiao@linux.intel.com>
[Fix list insertion sort, simplify var_mtrr_range_is_valid to just
test the V bit. - Paolo]
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
It gets the range for the specified variable MTRR
Signed-off-by: Xiao Guangrong <guangrong.xiao@linux.intel.com>
[Simplify boolean operations. - Paolo]
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
This table summarizes the information of fixed MTRRs and introduce some APIs
to abstract its operation which helps us to clean up the code and will be
used in later patches
Signed-off-by: Xiao Guangrong <guangrong.xiao@linux.intel.com>
[Change range_size to range_shift, in order to avoid udivdi3 errors.
- Paolo]
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
- kvm_mtrr_get_guest_memory_type() only checks one page in MTRRs so
that it's unnecessary to check to see if the range is partially
covered in MTRR
- optimize the check of overlap memory type and add some comments
to explain the precedence
Signed-off-by: Xiao Guangrong <guangrong.xiao@linux.intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Variable MTRR MSRs are 64 bits which are directly accessed with full length,
no reason to split them to two 32 bits
Signed-off-by: Xiao Guangrong <guangrong.xiao@linux.intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Drop kvm_mtrr->enable, omit the decode/code workload and get rid of
all the hard code
Signed-off-by: Xiao Guangrong <guangrong.xiao@linux.intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Only KVM_NR_VAR_MTRR variable MTRRs are available in KVM guest
Signed-off-by: Xiao Guangrong <guangrong.xiao@linux.intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
vMTRR does not depend on any host MTRR feature and fixed MTRRs have always
been implemented, so drop this field
Signed-off-by: Xiao Guangrong <guangrong.xiao@linux.intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
MSR_MTRRcap is a MTRR msr so move the handler to the common place, also
add some comments to make the hard code more readable
Signed-off-by: Xiao Guangrong <guangrong.xiao@linux.intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
MTRR code locates in x86.c and mmu.c so that move them to a separate file to
make the organization more clearer and it will be the place where we fully
implement vMTRR
Signed-off-by: Xiao Guangrong <guangrong.xiao@linux.intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Currently, CR0.CD is not checked when we virtualize memory cache type for
noncoherent_dma guests, this patch fixes it by :
- setting UC for all memory if CR0.CD = 1
- zapping all the last sptes in MMU if CR0.CD is changed
Signed-off-by: Xiao Guangrong <guangrong.xiao@linux.intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
If hardware doesn't support DecodeAssist - a feature that provides
more information about the intercept in the VMCB, KVM decodes the
instruction and then updates the next_rip vmcb control field.
However, NRIP support itself depends on cpuid Fn8000_000A_EDX[NRIPS].
Since skip_emulated_instruction() doesn't verify nrip support
before accepting control.next_rip as valid, avoid writing this
field if support isn't present.
Signed-off-by: Bandan Das <bsd@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Merge the mvebu/drivers branch of the arm-soc tree which contains
just a single patch bfa1ce5f38 ("bus:
mvebu-mbus: add mv_mbus_dram_info_nooverlap()") that happens to be
a prerequisite of the new marvell/cesa crypto driver.
When building the kernel with 32-bit binutils built with support
only for the i386 target, we get the following warning:
arch/x86/kernel/head_32.S:66: Warning: shift count out of range (32 is not between 0 and 31)
The problem is that in that case, binutils' internal type
representation is 32-bit wide and the shift range overflows.
In order to fix this, manipulate the shift expression which
creates the 4GiB constant to not overflow the shift count.
Suggested-by: Michael Matz <matz@suse.de>
Reported-and-tested-by: Enrico Mioso <mrkiko.rs@gmail.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Architectural performance monitoring, version 1, doesn't support fixed counters.
Currently, even if a hypervisor advertises support for architectural
performance monitoring version 1, perf may still try to use the fixed
counters, as the constraints are set up based on the CPU model.
This patch ensures that perf honors the architectural performance monitoring
version returned by CPUID, and it only uses the fixed counters for version 2
and above.
(Some of the ideas in this patch came from Peter Zijlstra.)
Signed-off-by: Imre Palik <imrep@amazon.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Anthony Liguori <aliguori@amazon.com>
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1433767609-1039-1-git-send-email-imrep.amz@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Intel PT is a separate PMU and it is not using any of the x86_pmu
code paths, which means in particular that the active_events counter
remains intact when new PT events are created.
However, PT uses the generic x86_pmu PMI handler for its PMI handling needs.
The problem here is that the latter checks active_events and in case of it
being zero, exits without calling the actual x86_pmu.handle_nmi(), which
results in unknown NMI errors and massive data loss for PT.
The effect is not visible if there are other perf events in the system
at the same time that keep active_events counter non-zero, for instance
if the NMI watchdog is running, so one needs to disable it to reproduce
the problem.
At the same time, the active_events counter besides doing what the name
suggests also implicitly serves as a PMC hardware and DS area reference
counter.
This patch adds a separate reference counter for the PMC hardware, leaving
active_events for actually counting the events and makes sure it also
counts PT and BTS events.
Signed-off-by: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: acme@infradead.org
Cc: adrian.hunter@intel.com
Link: http://lkml.kernel.org/r/87k2v92t0s.fsf@ashishki-desk.ger.corp.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Currently, the intel_bts driver relies on the DS area allocated by the x86_pmu
code in its event_init() path, which is a bug: creating a BTS event while
no x86_pmu events are present results in a NULL pointer dereference.
The same DS area is also used by PEBS sampling, which makes it quite a bit
trickier to have a separate one for intel_bts' purposes.
This patch makes intel_bts driver use the same DS allocation and reference
counting code as x86_pmu to make sure it is always present when either
intel_bts or x86_pmu need it.
Signed-off-by: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: acme@infradead.org
Cc: adrian.hunter@intel.com
Link: http://lkml.kernel.org/r/1434024837-9916-2-git-send-email-alexander.shishkin@linux.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
This patch adds additional model numbers for Broadwell to perf.
Support for Broadwell with Iris Pro (Intel Core i7-57xxC)
and support for Broadwell Server Xeon.
Signed-off-by: Andi Kleen <ak@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1434055942-28253-1-git-send-email-andi@firstfloor.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Stash the number of nodes in a physical processor package
locally and add an accessor to be called by interested parties.
The first user is the MCE injection module which uses it to find
the node base core in a package for injecting a certain type of
errors.
Signed-off-by: Aravind Gopalakrishnan <Aravind.Gopalakrishnan@amd.com>
[ Rewrote the commit message, merged it with the accessor patch and unified naming. ]
Signed-off-by: Borislav Petkov <bp@suse.de>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Jacob Shin <jacob.w.shin@gmail.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-edac <linux-edac@vger.kernel.org>
Cc: mchehab@osg.samsung.com
Link: http://lkml.kernel.org/r/1433868317-18417-2-git-send-email-Aravind.Gopalakrishnan@amd.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
This question has been asked many times, and finally I found the
official document which explains the problem of HPET on Baytrail,
that it will halt in deep idle states.
Signed-off-by: Feng Tang <feng.tang@intel.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: john.stultz@linaro.org
Cc: len.brown@intel.com
Cc: matthew.lee@intel.com
Link: http://lkml.kernel.org/r/1434361201-31743-1-git-send-email-feng.tang@intel.com
[ Prettified things a bit. ]
Signed-off-by: Ingo Molnar <mingo@kernel.org>
We enable _CRS on all systems from 2008 and later. On older systems, we
ignore _CRS and assume the whole physical address space (excluding RAM and
other devices) is available for PCI devices, but on systems that support
physical address spaces larger than 4GB, it's doubtful that the area above
4GB is really available for PCI.
After d56dbf5bab ("PCI: Allocate 64-bit BARs above 4G when possible"), we
try to use that space above 4GB *first*, so we're more likely to put a
device there.
On Juan's Toshiba Satellite Pro U200, BIOS left the graphics, sound, 1394,
and card reader devices unassigned (but only after Windows had been
booted). Only the sound device had a 64-bit BAR, so it was the only device
placed above 4GB, and hence the only device that didn't work.
Keep _CRS enabled even on pre-2008 systems if they support physical address
space larger than 4GB.
Fixes: d56dbf5bab ("PCI: Allocate 64-bit BARs above 4G when possible")
Reported-and-tested-by: Juan Dayer <jdayer@outlook.com>
Reported-and-tested-by: Alan Horsfield <alan@hazelgarth.co.uk>
Link: https://bugzilla.kernel.org/show_bug.cgi?id=99221
Link: https://bugzilla.opensuse.org/show_bug.cgi?id=907092
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
CC: stable@vger.kernel.org # v3.14+
Config TDP is a feature that allows parts to be configured
for different thermal limits after they have left the factory.
This can have an effect on the operation of the part,
particularly in determiniing...
Max Non-turbo Ratio
Turbo Activation Ratio
Signed-off-by: Len Brown <len.brown@intel.com>
The __init_or_module is from commit 05e12e1c4c
("x86: fix 27-rc crash on vsmp due to paravirt during module load").
But as of commit 70511134f6
("Revert "x86: don't compile vsmp_64 for 32bit") this file became
obj-y and hence is now only for built-in. That makes any
"_or_module" support no longer necessary.
We need to distinguish between the two in order to do some header
reorganization between init.h and module.h and we don't want to
be including module.h in non-modular code.
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: x86@kernel.org
Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
This was using module_init, but the current Kconfig situation is
as follows:
In arch/x86/kernel/cpu/Makefile:
obj-$(CONFIG_CPU_SUP_INTEL) += perf_event_intel_pt.o perf_event_intel_bts.o
and in arch/x86/Kconfig.cpu:
config CPU_SUP_INTEL
default y
bool "Support Intel processors" if PROCESSOR_SELECT
So currently, the end user can not build this code into a module.
If in the future, there is desire for this to be modular, then
it can be changed to include <linux/module.h> and use module_init.
But currently, in the non-modular case, a module_init becomes a
device_initcall. But this really isn't a device, so we should
choose a more appropriate initcall bucket to put it in.
The obvious choice here seems to be arch_initcall, but that does
make it earlier than it was currently through device_initcall.
As long as perf_pmu_register() is functional, we should be OK.
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: x86@kernel.org
Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
This was using module_init, but the current Kconfig situation is
as follows:
In arch/x86/kernel/cpu/Makefile:
obj-$(CONFIG_CPU_SUP_INTEL) += perf_event_intel_pt.o perf_event_intel_bts.o
and in arch/x86/Kconfig.cpu:
config CPU_SUP_INTEL
default y
bool "Support Intel processors" if PROCESSOR_SELECT
So currently, the end user can not build this code into a module.
If in the future, there is desire for this to be modular, then
it can be changed to include <linux/module.h> and use module_init.
But currently, in the non-modular case, a module_init becomes a
device_initcall. But this really isn't a device, so we should
choose a more appropriate initcall bucket to put it in.
The obvious choice here seems to be arch_initcall, but that does
make it earlier than it was currently through device_initcall.
As long as perf_pmu_register() is functional, we should be OK.
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: x86@kernel.org
Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
The bootflag.o is obj-y (always built in). It will never be
modular, so using module_init as an alias for __initcall is
somewhat misleading.
Fix this up now, so that we can relocate module_init from
init.h into module.h in the future. If we don't do this, we'd
have to add module.h to obviously non-modular code, and that
would be a worse thing.
Note that direct use of __initcall is discouraged, vs. one
of the priority categorized subgroups. As __initcall gets
mapped onto device_initcall, our use of arch_initcall (which
makes sense for arch code) will thus change this registration
from level 6-device to level 3-arch (i.e. slightly earlier).
However no observable impact of that small difference has
been observed during testing, or is expected.
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: x86@kernel.org
Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
The devicetree.o is built for "OF" -- which is bool, and hence
this code is either present or absent. It will never be modular,
so using module_init as an alias for __initcall can be somewhat
misleading.
Fix this up now, so that we can relocate module_init from
init.h into module.h in the future. If we don't do this, we'd
have to add module.h to obviously non-modular code, and that
would be a worse thing.
Note that direct use of __initcall is discouraged, vs. one
of the priority categorized subgroups. As __initcall gets
mapped onto device_initcall, our use of device_initcall
directly in this change means that the runtime impact is
zero -- it will remain at level 6 in initcall ordering.
Reported-by: kbuild test robot <fengguang.wu@intel.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: x86@kernel.org
Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
The X86_INTEL_MID option is bool, and hence this code is either
present or absent. It will never be modular, so using
module_init as an alias for __initcall is rather misleading.
Fix this up now, so that we can relocate module_init from
init.h into module.h in the future. If we don't do this, we'd
have to add module.h to obviously non-modular code, and that
would be a worse thing.
Note that direct use of __initcall is discouraged, vs. one
of the priority categorized subgroups. As __initcall gets
mapped onto device_initcall, our use of device_initcall
directly in this change means that the runtime impact is
zero -- it will remain at level 6 in initcall ordering.
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: x86@kernel.org
Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
This lets you build a kernel which can support xen dom0
or xen guests on i386, x86-64 and arm64 by just using:
make xenconfig
You can start from an allnoconfig and then switch to xenconfig.
This also splits out the options which are available currently
to be built with x86 and 'make ARCH=arm64' under a shared config.
Technically xen supports a dom0 kernel and also a guest
kernel configuration but upon review with the xen team
since we don't have many dom0 options its best to just
combine these two into one.
A few generic notes: we enable both of these:
CONFIG_INET=y
CONFIG_BINFMT_ELF=y
although technically not required given you likely will
end up with a pretty useless system otherwise.
A few architectural differences worth noting:
$ make allnoconfig; make xenconfig > /dev/null ; \
grep XEN .config > 64-bit-config
$ make ARCH=i386 allnoconfig; make ARCH=i386 xenconfig > /dev/null; \
grep XEN .config > 32-bit-config
$ make ARCH=arm64 allnoconfig; make ARCH=arm64 xenconfig > /dev/null; \
grep XEN .config > arm64-config
Since the options are already split up with a generic config and
architecture specific configs you anything on the x86 configs
are known to only work right now on x86. For instance arm64 doesn't
support MEMORY_HOTPLUG yet as such although we try to enabe it
generically arm64 doesn't have it yet, so we leave the xen
specific kconfig option XEN_BALLOON_MEMORY_HOTPLUG on x86's config
file to set expecations correctly.
Then on x86 we have differences between i386 and x86-64. The difference
between 64-bit-config and 32-bit-config is you don't get XEN_MCE_LOG as
this is only supported on 64-bit. You also do not get on i386
XEN_BALLOON_MEMORY_HOTPLUG, there does not seem to be any technical
reasons to not allow this but I gave up after a few attempts.
Cc: Josh Triplett <josh@joshtriplett.org>
Cc: Borislav Petkov <bp@suse.de>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Michal Marek <mmarek@suse.cz>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: penberg@kernel.org
Cc: levinsasha928@gmail.com
Cc: mtosatti@redhat.com
Cc: fengguang.wu@intel.com
Cc: David Vrabel <david.vrabel@citrix.com>
Cc: Ian Campbell <Ian.Campbell@citrix.com>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: xen-devel@lists.xenproject.org
Acked-by: Stefano Stabellini <stefano.stabellini@eu.citrix.com>
Acked-by: Julien Grall <julien.grall@linaro.org>
Acked-by: Michal Marek <mmarek@suse.cz>
Acked-by: David Rientjes <rientjes@google.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
Signed-off-by: Luis R. Rodriguez <mcgrof@suse.com>
Signed-off-by: David Vrabel <david.vrabel@citrix.com>
lapic.timer_mode was not properly initialized after migration, which
broke few useful things, like login, by making every sleep eternal.
Fix this by calling apic_update_lvtt in kvm_apic_post_state_restore.
There are other slowpaths that update lvtt, so this patch makes sure
something similar doesn't happen again by calling apic_update_lvtt
after every modification.
Cc: stable@vger.kernel.org
Fixes: f30ebc312c ("KVM: x86: optimize some accesses to LVTT and SPIV")
Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
The '__init aesni_init()' function calls the '__exit crypto_fpu_exit()'
function directly. Since they are in different sections, this generates
a warning.
make CONFIG_DEBUG_SECTION_MISMATCH=y
...
WARNING: arch/x86/crypto/aesni-intel.o(.init.text+0x12b): Section
mismatch in reference from the function init_module() to the function
.exit.text:crypto_fpu_exit()
The function __init init_module() references
a function __exit crypto_fpu_exit().
This is often seen when error handling in the init function
uses functionality in the exit path.
The fix is often to remove the __exit annotation of
crypto_fpu_exit() so it may be used outside an exit section.
Fix the warning by removing the __exit annotation.
Signed-off-by: Jeremiah Mahler <jmmahler@gmail.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Pull perf fixes from Ingo Molnar:
"A regression fix for a crash, and a Intel HSW uncore PMU driver fix"
* 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
Revert "perf/x86/intel/uncore: Move uncore_box_init() out of driver initialization"
perf/x86/intel/uncore: Fix CBOX bit wide and UBOX reg on Haswell-EP
Add a new interface irq_remapping_cap() to detect whether irq
remapping supports new features, such as VT-d Posted-Interrupts.
Export the function, so that KVM code can check this and use this
mechanism properly.
Signed-off-by: Feng Wu <feng.wu@intel.com>
Reviewed-by: Jiang Liu <jiang.liu@linux.intel.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Joerg Roedel <joro@8bytes.org>
Cc: iommu@lists.linux-foundation.org
Cc: dwmw2@infradead.org
Link: http://lkml.kernel.org/r/1433827237-3382-10-git-send-email-feng.wu@intel.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Interrupt chip callback to set the VCPU affinity for posted interrupts.
[ tglx: Use the helper function to copy from the remap irte instead of
open coding it. Massage the comment as well ]
Signed-off-by: Feng Wu <feng.wu@intel.com>
Reviewed-by: Jiang Liu <jiang.liu@linux.intel.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: David Woodhouse <David.Woodhouse@intel.com>
Cc: iommu@lists.linux-foundation.org
Cc: joro@8bytes.org
Cc: dwmw2@infradead.org
Link: http://lkml.kernel.org/r/1433827237-3382-5-git-send-email-feng.wu@intel.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Add a new member 'capability' to struct irq_remap_ops for storing
information about available capabilities such as VT-d
Posted-Interrupts.
Signed-off-by: Feng Wu <feng.wu@intel.com>
Reviewed-by: Jiang Liu <jiang.liu@linux.intel.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Joerg Roedel <joro@8bytes.org>
Cc: iommu@lists.linux-foundation.org
Cc: dwmw2@infradead.org
Link: http://lkml.kernel.org/r/1433827237-3382-2-git-send-email-feng.wu@intel.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
I noticed that my MPX tracepoints were producing garbage for the
lower and upper bounds:
mpx_bounds_register_exception: address referenced: 0x00007fffffffccb7 bounds: lower: 0x0 ~upper: 0xffffffffffffffff
mpx_bounds_register_exception: address referenced: 0x00007fffffffccbf bounds: lower: 0x0 ~upper: 0xffffffffffffffff
This is, of course, bogus because 0x00007fffffffccbf is *within*
the bounds. I assumed that my instruction decoder was bad and
went looking at it. But I eventually realized that I was
getting a '0' offset back from xstate_offsets[BNDREGS].
It was being skipped in the initialization, which is obviously
bogus, so remove the extra leaf++.
This also goes an initializes xstate_offsets/sizes[] to -1 so
so that bugs like this will oops instead of silently failing
in interesting ways.
This was introduced by:
39f1acd ("x86/fpu/xstate: Don't assume the first zero xfeatures zero bit means the end")
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: dave@sr71.net
Link: http://lkml.kernel.org/r/20150611193400.2E0B00DB@viggo.jf.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
When the crash kernel is loaded above 4GiB in memory, the
first kernel allocates only 72MiB of low-memory for the DMA
requirements of the second kernel. On systems with many
devices this is not enough and causes device driver
initialization errors and failed crash dumps. Testing by
SUSE and Redhat has shown that 256MiB is a good default
value for now and the discussion has lead to this value as
well. So set this default value to 256MiB to make sure there
is enough memory available for DMA.
Signed-off-by: Joerg Roedel <jroedel@suse.de>
[ Reflow comment. ]
Signed-off-by: Borislav Petkov <bp@suse.de>
Acked-by: Baoquan He <bhe@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Dave Young <dyoung@redhat.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Jörg Rödel <joro@8bytes.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vivek Goyal <vgoyal@redhat.com>
Cc: kexec@lists.infradead.org
Link: http://lkml.kernel.org/r/1433500202-25531-4-git-send-email-joro@8bytes.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
When we boot a kdump kernel in high memory, there is by
default only 72MB of low memory available. The swiotlb code
takes 64MB of it (by default) so that there are only 8MB
left to allocate from. On systems with many devices this
causes page allocator warnings from
dma_generic_alloc_coherent():
systemd-udevd: page allocation failure: order:0, mode:0x280d4
CPU: 0 PID: 197 Comm: systemd-udevd Tainted: G W
3.12.28-4-default #1 Hardware name: HP ProLiant DL980 G7, BIOS
P66 07/30/2012 ffff8800781335e0 ffffffff8150b1db 00000000000280d4 ffffffff8113af90
0000000000000000 0000000000000000 ffff88007efdbb00 0000000100000000
0000000000000000 0000000000000000 0000000000000000 0000000000000001
Call Trace:
dump_trace+0x7d/0x2d0
show_stack_log_lvl+0x94/0x170
show_stack+0x21/0x50
dump_stack+0x41/0x51
warn_alloc_failed+0xf0/0x160
__alloc_pages_slowpath+0x72f/0x796
__alloc_pages_nodemask+0x1ea/0x210
dma_generic_alloc_coherent+0x96/0x140
x86_swiotlb_alloc_coherent+0x1c/0x50
ttm_dma_pool_alloc_new_pages+0xab/0x320 [ttm]
ttm_dma_populate+0x3ce/0x640 [ttm]
ttm_tt_bind+0x36/0x60 [ttm]
ttm_bo_handle_move_mem+0x55f/0x5c0 [ttm]
ttm_bo_move_buffer+0x105/0x130 [ttm]
ttm_bo_validate+0xc1/0x130 [ttm]
ttm_bo_init+0x24b/0x400 [ttm]
radeon_bo_create+0x16c/0x200 [radeon]
radeon_ring_init+0x11e/0x2b0 [radeon]
r100_cp_init+0x123/0x5b0 [radeon]
r100_startup+0x194/0x230 [radeon]
r100_init+0x223/0x410 [radeon]
radeon_device_init+0x6af/0x830 [radeon]
radeon_driver_load_kms+0x89/0x180 [radeon]
drm_get_pci_dev+0x121/0x2f0 [drm]
local_pci_probe+0x39/0x60
pci_device_probe+0xa9/0x120
driver_probe_device+0x9d/0x3d0
__driver_attach+0x8b/0x90
bus_for_each_dev+0x5b/0x90
bus_add_driver+0x1f8/0x2c0
driver_register+0x5b/0xe0
do_one_initcall+0xf2/0x1a0
load_module+0x1207/0x1c70
SYSC_finit_module+0x75/0xa0
system_call_fastpath+0x16/0x1b
0x7fac533d2788
After these warnings the code enters a fall-back path and
allocated directly from the swiotlb aperture in the end.
So remove these warnings as this is not a fatal error.
Signed-off-by: Joerg Roedel <jroedel@suse.de>
[ Simplify, reflow comment. ]
Signed-off-by: Borislav Petkov <bp@suse.de>
Acked-by: Baoquan He <bhe@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Dave Young <dyoung@redhat.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Jörg Rödel <joro@8bytes.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vivek Goyal <vgoyal@redhat.com>
Cc: kexec@lists.infradead.org
Link: http://lkml.kernel.org/r/1433500202-25531-3-git-send-email-joro@8bytes.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Fix this compile issue with gcc-4.4.4:
arch/x86/kvm/mmu.c: In function 'kvm_mmu_pte_write':
arch/x86/kvm/mmu.c:4256: error: unknown field 'cr0_wp' specified in initializer
arch/x86/kvm/mmu.c:4257: error: unknown field 'cr4_pae' specified in initializer
arch/x86/kvm/mmu.c:4257: warning: excess elements in union initializer
...
gcc-4.4.4 (at least) has issues when using anonymous unions in
initializers.
Fixes: edc90b7dc4 ("KVM: MMU: fix SMAP virtualization")
Cc: Xiao Guangrong <guangrong.xiao@linux.intel.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The error_entry/error_exit code to handle gsbase and whether we
return to user mdoe was a mess:
- error_sti was misnamed. In particular, it did not enable interrupts.
- Error handling for gs_change was hopelessly tangled the normal
usermode path. Separate it out. This saves a branch in normal
entries from kernel mode.
- The comments were bad.
Fix it up. As a nice side effect, there's now a code path that
happens on error entries from user mode. We'll use it soon.
Signed-off-by: Andy Lutomirski <luto@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/f1be898ab93360169fb845ab85185948832209ee.1433878454.git.luto@kernel.org
[ Prettified it, clarified comments some more. ]
Signed-off-by: Ingo Molnar <mingo@kernel.org>
We use three MOVs to swap edx and ecx. We can use one XCHG
instead.
Expand the comments. It's difficult to keep track which arg#
every register corresponds to, so spell it out.
Signed-off-by: Denys Vlasenko <dvlasenk@redhat.com>
Cc: Alexei Starovoitov <ast@plumgrid.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Will Drewry <wad@chromium.org>
Link: http://lkml.kernel.org/r/1433876051-26604-3-git-send-email-dvlasenk@redhat.com
[ Expanded the comments some more. ]
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Here it is not obvious why we load pt_regs->cx to %esi etc.
Lets improve comments.
Explain that here we combine two things: first, we reload
registers since some of them are clobbered by the C function we
just called; and we also convert 32-bit syscall params to 64-bit
C ABI, because we are going to jump back to syscall dispatch
code.
Move reloading of 6th argument into the macro instead of having
it after each of two macro invocations.
No actual code changes here.
Signed-off-by: Denys Vlasenko <dvlasenk@redhat.com>
Cc: Alexei Starovoitov <ast@plumgrid.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Will Drewry <wad@chromium.org>
Link: http://lkml.kernel.org/r/1433876051-26604-2-git-send-email-dvlasenk@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
I put %ebp restoration code too late. Under strace, it is not
reached and %ebp is not restored upon return to userspace.
This is the fix. Run-tested.
Signed-off-by: Denys Vlasenko <dvlasenk@redhat.com>
Cc: Alexei Starovoitov <ast@plumgrid.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Will Drewry <wad@chromium.org>
Link: http://lkml.kernel.org/r/1433876051-26604-1-git-send-email-dvlasenk@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
The Foxconn K8M890-8237A has two PCI host bridges, and we can't assign
resources correctly without the information from _CRS that tells us which
address ranges are claimed by which bridge. In the bugs mentioned below,
we incorrectly assign a sound card address (this example is from 1033299):
bus: 00 index 2 [mem 0x80000000-0xfcffffffff]
ACPI: PCI Root Bridge [PCI0] (domain 0000 [bus 00-7f])
pci_root PNP0A08:00: host bridge window [mem 0x80000000-0xbfefffff] (ignored)
pci_root PNP0A08:00: host bridge window [mem 0xc0000000-0xdfffffff] (ignored)
pci_root PNP0A08:00: host bridge window [mem 0xf0000000-0xfebfffff] (ignored)
ACPI: PCI Root Bridge [PCI1] (domain 0000 [bus 80-ff])
pci_root PNP0A08:01: host bridge window [mem 0xbff00000-0xbfffffff] (ignored)
pci 0000:80:01.0: [1106:3288] type 0 class 0x000403
pci 0000:80:01.0: reg 10: [mem 0xbfffc000-0xbfffffff 64bit]
pci 0000:80:01.0: address space collision: [mem 0xbfffc000-0xbfffffff 64bit] conflicts with PCI Bus #00 [mem 0x80000000-0xfcffffffff]
pci 0000:80:01.0: BAR 0: assigned [mem 0xfd00000000-0xfd00003fff 64bit]
BUG: unable to handle kernel paging request at ffffc90000378000
IP: [<ffffffffa0345f63>] azx_create+0x37c/0x822 [snd_hda_intel]
We assigned 0xfd_0000_0000, but that is not in any of the host bridge
windows, and the sound card doesn't work.
Turn on pci=use_crs automatically for this system.
Link: https://bugs.launchpad.net/ubuntu/+source/alsa-driver/+bug/931368
Link: https://bugs.launchpad.net/ubuntu/+source/alsa-driver/+bug/1033299
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
CC: stable@vger.kernel.org
Now that the bugs in mixed mode MPX handling are fixed, re-allow
32-bit binaries on 64-bit kernels again.
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Dave Hansen <dave@sr71.net>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/20150607183706.70277DAD@viggo.jf.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
The comment pretty much says it all.
I wrote a test program that does lots of random allocations
and forces bounds tables to be created. It came up with a
layout like this:
.... | BOUNDS DIRECTORY ENTRY COVERS | ....
| BOUNDS TABLE COVERS |
| BOUNDS TABLE | REAL ALLOC | BOUNDS TABLE |
Unmapping "REAL ALLOC" should have been able to free the
bounds table "covering" the "REAL ALLOC" because it was the
last real user. But, the neighboring VMA bounds tables were
found, considered as real neighbors, and we declined to free
the bounds table covering the area.
Doing this over and over left a small but significant number
of these orphans. Handling them is fairly straighforward.
All we have to do is walk the VMAs and skip all of the MPX
ones when looking for neighbors.
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Dave Hansen <dave@sr71.net>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/20150607183706.A6BD90BF@viggo.jf.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
The MPX code needs to clear out bounds tables for memory which
is no longer in use. We do this when a userspace mapping is
torn down (unmapped).
There are two modes:
1. An entire bounds table becomes unused, and can be freed
and its pointer removed from the bounds directory. This
happens either when a large mapping is torn down, or when
a small mapping is torn down and it is the last mapping
"covered" by a bounds table.
2. Only part of a bounds table becomes unused, in which case
we free the backing memory as if MADV_DONTNEED was called.
The old code was a spaghetti mess of "edge" bounds tables
where the edges were handled specially, even if we were
unmapping an entire one. Non-edge bounds tables are always
fully unmapped, but share a different code path from the edge
ones. The old code had a bug where it was unmapping too much
memory. I worked on fixing it for two days and gave up.
I didn't write the original code. I didn't particularly like
it, but it worked, so I left it. After my debug session, I
realized it was undebuggagle *and* buggy, so out it went.
I also wrote a new unmapping test program which uncovers bugs
pretty nicely.
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Dave Hansen <dave@sr71.net>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/20150607183706.DCAEC67D@viggo.jf.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Right now, the kernel can only switch between 64-bit and 32-bit
binaries at compile time. This patch adds support for 32-bit
binaries on 64-bit kernels when we support ia32 emulation.
We essentially choose which set of table sizes to use when doing
arithmetic for the bounds table calculations.
This also uses a different approach for calculating the table
indexes than before. I think the new one makes it much more
clear what is going on, and allows us to share more code between
the 32-bit and 64-bit cases.
Based-on-patch-by: Qiaowei Ren <qiaowei.ren@intel.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Dave Hansen <dave@sr71.net>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/20150607183705.E01F21E2@viggo.jf.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
user_atomic_cmpxchg_inatomic() actually looks at sizeof(*ptr) to
figure out how many bytes to copy. If we run it on a 64-bit
kernel with a 64-bit pointer, it will copy a 64-bit bounds
directory entry. That's fine, except when we have 32-bit
programs with 32-bit bounds directory entries and we only *want*
32-bits.
This patch breaks the cmpxchg() operation out in to its own
function and performs the 32-bit type swizzling in there.
Note, the "64-bit" version of this code _would_ work on a
32-bit-only kernel. The issue this patch addresses is only for
when the kernel's 'long' is mismatched from the size of the
bounds directory entry of the process we are working on.
The new helper modifies 'actual_old_val' or returns an error.
But gcc doesn't know this, so it warns about 'actual_old_val'
being unused. Shut it up with an uninitialized_var().
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Dave Hansen <dave@sr71.net>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/20150607183705.672B115E@viggo.jf.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Currently, to get from a bounds directory entry to the virtual
address of a bounds table, we simply mask off a few low bits.
However, the set of bits we mask off is different for 32-bit and
64-bit binaries.
This breaks the operation out in to a helper function and also
adds a temporary variable to store the result until we are
sure we are returning one.
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Dave Hansen <dave@sr71.net>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/20150607183704.007686CE@viggo.jf.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
When we allocate a bounds table, we call mmap(), then add a
"valid" bit to the value before storing it in to the bounds
directory.
If we fail along the way, we go and mask that valid bit
_back_ out. That seems a little silly, and this makes it
much more clear when we have a plain address versus an
actual table _entry_.
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Dave Hansen <dave@sr71.net>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/20150607183704.3D69D5F4@viggo.jf.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
The uprobes code has a nice helper, is_64bit_mm(), that consults
both the runtime and compile-time flags for 32-bit support.
Instead of reinventing the wheel, pull it in to an x86 header so
we can use it for MPX.
I prefer passing the 'mm' around to test_thread_flag(TIF_IA32)
because it makes it explicit where the context is coming from.
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Dave Hansen <dave@sr71.net>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/20150607183704.F0209999@viggo.jf.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Bounds tables are a significant consumer of memory. It is
important to know when they are being allocated. Add a trace
point to trace whenever an allocation occurs and also its
virtual address.
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Dave Hansen <dave@sr71.net>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/20150607183704.EC23A93E@viggo.jf.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
There are two different events being traced here. They are
doing similar things so share a trace "EVENT_CLASS" and are
presented together.
1. Trace when MPX is zapping pages "mpx_unmap_zap":
When MPX can not free an entire bounds table, it will
instead try to zap unused parts of a bounds table to free
the backing memory. This decreases RSS (resident set
size) without decreasing the virtual space allocated
for bounds tables.
2. Trace attempts to find bounds tables "mpx_unmap_search":
This event traces any time we go looking to unmap a
bounds table for a given virtual address range. This is
useful to ensure that the kernel actually "tried" to free
a bounds table versus times it succeeded in finding one.
It might try and fail if it realized that a table was
shared with an adjacent VMA which is not being unmapped.
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Dave Hansen <dave@sr71.net>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/20150607183703.B9D2468B@viggo.jf.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
There are two basic things that can happen as the result of
a bounds exception (#BR):
1. We allocate a new bounds table
2. We pass up a bounds exception to userspace.
This patch adds a trace point for the case where we are
passing the exception up to userspace with a signal.
We are also explicit that we're printing out the inverse of
the 'upper' that we encounter. If you want to filter, for
instance, you need to ~ the value first. The reason we do
this is because of how 'upper' is stored in the bounds table.
If a pointer's range is:
0x1000 -> 0x2000
it is stored in the bounds table as (32-bits here for brevity):
lower: 0x00001000
upper: 0xffffdfff
That is so that an all 0's entry:
lower: 0x00000000
upper: 0x00000000
corresponds to the "init" bounds which store a *range* of:
0x00000000 -> 0xffffffff
That is, by far, the common case, and that lets us use the
zero page, or deduplicate the memory, etc... The 'upper'
stored in the table is gibberish to print by itself, so we
print ~upper to get the *actual*, logical, human-readable
value printed out.
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Dave Hansen <dave@sr71.net>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/20150607183703.027BB9B0@viggo.jf.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
This is the first in a series of MPX tracing patches.
I've found these extremely useful in the process of
debugging applications and the kernel code itself.
This exception hooks in to the bounds (#BR) exception
very early and allows capturing the key registers which
would influence how the exception is handled.
Note that bndcfgu/bndstatus are technically still
64-bit registers even in 32-bit mode.
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Dave Hansen <dave@sr71.net>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/20150607183703.5FE2619A@viggo.jf.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
MPX has the _potential_ to cause some issues. Say part of your
init system tried to protect one of its components from buffer
overflows with MPX. If there were a false positive, it's
possible that MPX could keep a system from booting.
MPX could also potentially cause performance issues since it is
present in hot paths like the unmap path.
Allow it to be disabled at boot time.
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Dave Hansen <dave@sr71.net>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20150607183702.2E8B77AB@viggo.jf.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
The comment and code here are confusing. We do not currently
allocate the bounds directory in the kernel.
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Dave Hansen <dave@sr71.net>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/20150607183702.222CEC2A@viggo.jf.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
MPX_BNDCFG_ADDR_MASK is defined two times, so this patch removes
redundant one.
Signed-off-by: Qiaowei Ren <qiaowei.ren@intel.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Dave Hansen <dave@sr71.net>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/20150607183702.5F129376@viggo.jf.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
The MPX code can only work on the current task. You can not,
for instance, enable MPX management in another process or
thread. You can also not handle a fault for another process or
thread.
Despite this, we pass a task_struct around prolifically. This
patch removes all of the task struct passing for code paths
where the code can not deal with another task (which turns out
to be all of them).
This has no functional changes. It's just a cleanup.
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Dave Hansen <dave@sr71.net>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: bp@alien8.de
Link: http://lkml.kernel.org/r/20150607183702.6A81DA2C@viggo.jf.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
The MPX registers (bndcsr/bndcfgu/bndstatus) are not directly
accessible via normal instructions. They essentially act as
if they were floating point registers and are saved/restored
along with those registers.
There are two main paths in the MPX code where we care about
the contents of these registers:
1. #BR (bounds) faults
2. the prctl() code where we are setting MPX up
Both of those paths _might_ be called without the FPU having
been used. That means that 'tsk->thread.fpu.state' might
never be allocated.
Also, fpu_save_init() is not preempt-safe. It was a bug to
call it without disabling preemption. The new
get_xsave_addr() calls unlazy_fpu() instead and properly
disables preemption.
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Dave Hansen <dave@sr71.net>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Suresh Siddha <sbsiddha@gmail.com>
Cc: bp@alien8.de
Link: http://lkml.kernel.org/r/20150607183701.BC0D37CF@viggo.jf.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
The MPX code appears is calling a low-level FPU function
(copy_fpregs_to_fpstate()). This function is not able to
be called in all contexts, although it is safe to call
directly in some cases.
Although probably correct, the current code is ugly and
potentially error-prone. So, add a wrapper that calls
the (slightly) higher-level fpu__save() (which is preempt-
safe) and also ensures that we even *have* an FPU context
(in the case that this was called when in lazy FPU mode).
Ingo had this to say about the details about when we need
preemption disabled:
> it's indeed generally unsafe to access/copy FPU registers with preemption enabled,
> for two reasons:
>
> - on older systems that use FSAVE the instruction destroys FPU register
> contents, which has to be handled carefully
>
> - even on newer systems if we copy to FPU registers (which this code doesn't)
> then we don't want a context switch to occur in the middle of it, because a
> context switch will write to the fpstate, potentially overwriting our new data
> with old FPU state.
>
> But it's safe to access FPU registers with preemption enabled in a couple of
> special cases:
>
> - potentially destructively saving FPU registers: the signal handling code does
> this in copy_fpstate_to_sigframe(), because it can rely on the signal restore
> side to restore the original FPU state.
>
> - reading FPU registers on modern systems: we don't do this anywhere at the
> moment, mostly to keep symmetry with older systems where FSAVE is
> destructive.
>
> - initializing FPU registers on modern systems: fpu__clear() does this. Here
> it's safe because we don't copy from the fpstate.
>
> - directly writing FPU registers from user-space memory (!). We do this in
> fpu__restore_sig(), and it's safe because neither context switches nor
> irq-handler FPU use can corrupt the source context of the copy (which is
> user-space memory).
>
> Note that the MPX code's current use of copy_fpregs_to_fpstate() was safe I think,
> because:
>
> - MPX is predicated on eagerfpu, so the destructive F[N]SAVE instruction won't be
> used.
>
> - the code was only reading FPU registers, and was doing it only in places that
> guaranteed that an FPU state was already active (i.e. didn't do it in
> kthreads)
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Dave Hansen <dave@sr71.net>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Suresh Siddha <sbsiddha@gmail.com>
Cc: bp@alien8.de
Link: http://lkml.kernel.org/r/20150607183700.AA881696@viggo.jf.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
get_xsave_addr() assumes that if an xsave bit is present in the
hardware (pcntxt_mask) that it is present in a given xsave
buffer. Due to an bug in the xsave code on all of the systems
that have MPX (and thus all the users of this code), that has
been a true assumption.
But, the bug is getting fixed, so our assumption is not going
to hold any more.
It's quite possible (and normal) for an enabled state to be
present on 'pcntxt_mask', but *not* in 'xstate_bv'. We need
to consult 'xstate_bv'.
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Dave Hansen <dave@sr71.net>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/20150607183700.1E739B34@viggo.jf.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
A few bits were missed.
Signed-off-by: Denys Vlasenko <dvlasenk@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Brian Gerst noticed that I did a weird rename in the following commit:
b2502b418e ("x86/asm/entry: Untangle 'system_call' into two entry points: entry_SYSCALL_64 and entry_INT80_32")
which renamed __NR_ia32_syscall_max to __NR_entry_INT80_compat_max.
Now the original name was a misnomer, but the new one is a misnomer as well,
as all the 32-bit compat syscall entry points (sysenter, syscall) share the
system call table, not just the INT80 based one.
Rename it to __NR_syscall_compat_max.
Reported-by: Brian Gerst <brgerst@gmail.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
I broke this recently when I changed pt_regs->r8..r11 clearing
logic in INT 80 code path.
There is a branch from SYSENTER/SYSCALL code to INT 80 code:
if we fail to retrieve arg6, we return EFAULT. Before this
patch, in this case we don't clear pt_regs->r8..r11.
This patch fixes this. The resulting code is smaller and
simpler.
While at it, remove incorrect comment about syscall dispatching
CALL insn: it does not use RIP-relative addressing form (the
comment was meant to be "TODO: make this rip-relative", and
morphed since then, dropping "TODO").
Signed-off-by: Denys Vlasenko <dvlasenk@redhat.com>
Cc: Alexei Starovoitov <ast@plumgrid.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Will Drewry <wad@chromium.org>
Link: http://lkml.kernel.org/r/1433701470-28800-1-git-send-email-dvlasenk@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Make the 64-bit syscall entry code a bit more readable:
- use consistent assembly coding style similar to the other entry_*.S files
- remove old comments that are not true anymore
- eliminate whitespace noise
- use consistent vertical spacing
- fix various comments
- reorganize entry point generation tables to be more readable
No code changed:
# arch/x86/entry/entry_64.o:
text data bss dec hex filename
12282 0 0 12282 2ffa entry_64.o.before
12282 0 0 12282 2ffa entry_64.o.after
md5:
cbab1f2d727a2a8a87618eeb79f391b7 entry_64.o.before.asm
cbab1f2d727a2a8a87618eeb79f391b7 entry_64.o.after.asm
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
This fixes up a merge issue with the amba-pl011.c driver, and we want
the fixes in this branch as well.
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
pci_dma_burst_advice() was added by e24c2d963a ("[PATCH] PCI: DMA
bursting advice") but apparently never used. Remove it.
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
Acked-by: Michal Simek <monstr@monstr.eu> # microblaze
CC: David S. Miller <davem@davemloft.net>
In include/linux/pci.h, we already #include <asm/pci.h>, so we don't need
to include <asm/pci.h> directly.
Remove the unnecessary includes. All the files here already include
<linux/pci.h>.
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
Acked-by: Simon Horman <horms+renesas@verge.net.au> # sh
Acked-by: Ralf Baechle <ralf@linux-mips.org>
Make the 32-bit syscall entry code a bit more readable:
- use consistent assembly coding style similar to entry_64.S
- remove old comments that are not true anymore
- eliminate whitespace noise
- use consistent vertical spacing
- fix various comments
No code changed:
# arch/x86/entry/entry_32.o:
text data bss dec hex filename
6025 0 0 6025 1789 entry_32.o.before
6025 0 0 6025 1789 entry_32.o.after
md5:
f3fa16b2b0dca804f052deb6b30ba6cb entry_32.o.before.asm
f3fa16b2b0dca804f052deb6b30ba6cb entry_32.o.after.asm
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
The 'system_call' entry points differ starkly between native 32-bit and 64-bit
kernels: on 32-bit kernels it defines the INT 0x80 entry point, while on
64-bit it's the SYSCALL entry point.
This is pretty confusing when looking at generic code, and it also obscures
the nature of the entry point at the assembly level.
So unangle this by splitting the name into its two uses:
system_call (32) -> entry_INT80_32
system_call (64) -> entry_SYSCALL_64
As per the generic naming scheme for x86 system call entry points:
entry_MNEMONIC_qualifier
where 'qualifier' is one of _32, _64 or _compat.
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
So the SYSENTER instruction is pretty quirky and it has different behavior
depending on bitness and CPU maker.
Yet we create a false sense of coherency by naming it 'ia32_sysenter_target'
in both of the cases.
Split the name into its two uses:
ia32_sysenter_target (32) -> entry_SYSENTER_32
ia32_sysenter_target (64) -> entry_SYSENTER_compat
As per the generic naming scheme for x86 system call entry points:
entry_MNEMONIC_qualifier
where 'qualifier' is one of _32, _64 or _compat.
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Rename the following system call entry points:
ia32_cstar_target -> entry_SYSCALL_compat
ia32_syscall -> entry_INT80_compat
The generic naming scheme for x86 system call entry points is:
entry_MNEMONIC_qualifier
where 'qualifier' is one of _32, _64 or _compat.
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
PEBSv3 as present on Skylake fixed the long standing issue of the
status bits. They now really reflect the events that generated the
record.
Tested-by: Andi Kleen <ak@linux.intel.com>
Tested-by: Kan Liang <kan.liang@intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
After enlarging the PEBS interrupt threshold, there may be some mixed up
PEBS samples which are discarded by the kernel.
This patch makes the kernel emit a PERF_RECORD_LOST_SAMPLES record with
the number of possible discarded records when it is impossible to demux
the samples.
It makes sure the user is not left in the dark about such discards.
Signed-off-by: Kan Liang <kan.liang@intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: acme@infradead.org
Cc: eranian@google.com
Link: http://lkml.kernel.org/r/1431285195-14269-8-git-send-email-kan.liang@intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Currently the PEBS buffer size is 4k, it can only hold about 21
PEBS records. This patch enlarges the PEBS buffer size to 64k
(the same as the BTS buffer).
64k memory can hold about 330 PEBS records. This will significantly
reduce the number of PMIs when batched PEBS interrupts are enabled.
Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
Signed-off-by: Kan Liang <kan.liang@intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: acme@infradead.org
Cc: eranian@google.com
Link: http://lkml.kernel.org/r/1430940834-8964-7-git-send-email-kan.liang@intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Flush the PEBS buffer during context switches if PEBS interrupt threshold
is larger than one. This allows perf to supply TID for sample outputs.
Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
Signed-off-by: Kan Liang <kan.liang@intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: acme@infradead.org
Cc: eranian@google.com
Link: http://lkml.kernel.org/r/1430940834-8964-6-git-send-email-kan.liang@intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
PEBS always had the capability to log samples to its buffers without
an interrupt. Traditionally perf has not used this but always set the
PEBS threshold to one.
For frequently occurring events (like cycles or branches or load/store)
this in term requires using a relatively high sampling period to avoid
overloading the system, by only processing PMIs. This in term increases
sampling error.
For the common cases we still need to use the PMI because the PEBS
hardware has various limitations. The biggest one is that it can not
supply a callgraph. It also requires setting a fixed period, as the
hardware does not support adaptive period. Another issue is that it
cannot supply a time stamp and some other options. To supply a TID it
requires flushing on context switch. It can however supply the IP, the
load/store address, TSX information, registers, and some other things.
So we can make PEBS work for some specific cases, basically as long as
you can do without a callgraph and can set the period you can use this
new PEBS mode.
The main benefit is the ability to support much lower sampling period
(down to -c 1000) without extensive overhead.
One use cases is for example to increase the resolution of the c2c tool.
Another is double checking when you suspect the standard sampling has
too much sampling error.
Some numbers on the overhead, using cycle soak, comparing the elapsed
time from "kernbench -M -H" between plain (threshold set to one) and
multi (large threshold).
The test command for plain:
"perf record --time -e cycles:p -c $period -- kernbench -M -H"
The test command for multi:
"perf record --no-time -e cycles:p -c $period -- kernbench -M -H"
( The only difference of test command between multi and plain is time
stamp options. Since time stamp is not supported by large PEBS
threshold, it can be used as a flag to indicate if large threshold is
enabled during the test. )
period plain(Sec) multi(Sec) Delta
10003 32.7 16.5 16.2
20003 30.2 16.2 14.0
40003 18.6 14.1 4.5
80003 16.8 14.6 2.2
100003 16.9 14.1 2.8
800003 15.4 15.7 -0.3
1000003 15.3 15.2 0.2
2000003 15.3 15.1 0.1
With periods below 100003, plain (threshold one) cause much more
overhead. With 10003 sampling period, the Elapsed Time for multi is
even 2X faster than plain.
Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
Signed-off-by: Kan Liang <kan.liang@intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: acme@infradead.org
Cc: eranian@google.com
Link: http://lkml.kernel.org/r/1430940834-8964-5-git-send-email-kan.liang@intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
When the PEBS interrupt threshold is larger than one record and the
machine supports multiple PEBS events, the records of these events are
mixed up and we need to demultiplex them.
Demuxing the records is hard because the hardware is deficient. The
hardware has two issues that, when combined, create impossible
scenarios to demux.
The first issue is that the 'status' field of the PEBS record is a copy
of the GLOBAL_STATUS MSR at PEBS assist time. To see why this is a
problem let us first describe the regular PEBS cycle:
A) the CTRn value reaches 0:
- the corresponding bit in GLOBAL_STATUS gets set
- we start arming the hardware assist
< some unspecified amount of time later -- this could cover multiple
events of interest >
B) the hardware assist is armed, any next event will trigger it
C) a matching event happens:
- the hardware assist triggers and generates a PEBS record
this includes a copy of GLOBAL_STATUS at this moment
- if we auto-reload we (re)set CTRn
- we clear the relevant bit in GLOBAL_STATUS
Now consider the following chain of events:
A0, B0, A1, C0
The event generated for counter 0 will include a status with counter 1
set, even though its not at all related to the record. A similar thing
can happen with a !PEBS event if it just happens to overflow at the
right moment.
The second issue is that the hardware will only emit one record for two
or more counters if the event that triggers the assist is 'close'. The
'close' can be several cycles. In some cases even the complete assist,
if the event is something that doesn't need retirement.
For instance, consider this chain of events:
A0, B0, A1, B1, C01
Where C01 is an event that triggers both hardware assists, we will
generate but a single record, but again with both counters listed in the
status field.
This time the record pertains to both events.
Note that these two cases are different but undistinguishable with the
data as generated. Therefore demuxing records with multiple PEBS bits
(we can safely ignore status bits for !PEBS counters) is impossible.
Furthermore we cannot emit the record to both events because that might
cause a data leak -- the events might not have the same privileges -- so
what this patch does is discard such events.
The assumption/hope is that such discards will be rare.
Here lists some possible ways you may get high discard rate.
- when you count the same thing multiple times. But it is not a useful
configuration.
- you can be unfortunate if you measure with a userspace only PEBS
event along with either a kernel or unrestricted PEBS event. Imagine
the event triggering and setting the overflow flag right before
entering the kernel. Then all kernel side events will end up with
multiple bits set.
Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
Signed-off-by: Kan Liang <kan.liang@intel.com>
[ Changelog improvements. ]
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: acme@infradead.org
Cc: eranian@google.com
Link: http://lkml.kernel.org/r/1430940834-8964-4-git-send-email-kan.liang@intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Move code that sets up the PEBS sample data to a separate function.
Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
Signed-off-by: Kan Liang <kan.liang@intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: acme@infradead.org
Cc: eranian@google.com
Link: http://lkml.kernel.org/r/1430940834-8964-3-git-send-email-kan.liang@intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
When a fixed period is specified, this patch makes perf use the PEBS
auto reload mechanism. This makes normal profiling faster, because
it avoids one costly MSR write in the PMI handler.
However, the reset value will be loaded by hardware assist. There is a
small delay compared to the previous non-auto-reload mechanism. The
delay time is arbitrary, but very small. The assist cost is 400-800
cycles, assuming common cases with everything cached. The minimum period
the patch currently uses is 10000. In that extreme case it can be ~10%
if cycles are used.
Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
Signed-off-by: Kan Liang <kan.liang@intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: acme@infradead.org
Cc: eranian@google.com
Link: http://lkml.kernel.org/r/1430940834-8964-2-git-send-email-kan.liang@intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
This patch enables support for branch sampling filter
for indirect jumps (IND_JUMP). It enables LBR IND_JMP
filtering where available. There is also software filtering
support.
Signed-off-by: Stephane Eranian <eranian@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Andi Kleen <ak@linux.intel.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: acme@redhat.com
Cc: dsahern@gmail.com
Cc: jolsa@redhat.com
Cc: kan.liang@intel.com
Cc: namhyung@kernel.org
Link: http://lkml.kernel.org/r/1431637800-31061-3-git-send-email-eranian@google.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
preempt_schedule_context() is a tracing safe preemption point but it's
only used when CONFIG_CONTEXT_TRACKING=y. Other configs have tracing
recursion issues since commit:
b30f0e3ffe ("sched/preempt: Optimize preemption operations on __schedule() callers")
introduced function based preemp_count_*() ops.
Lets make it available on all configs and give it a more appropriate
name for its new position.
Reported-by: Fengguang Wu <fengguang.wu@intel.com>
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1433432349-1021-3-git-send-email-fweisbec@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
CBOX counters are increased to 48b on HSX.
Correct the MSR address for HSWEP_U_MSR_PMON_CTR0 and
HSWEP_U_MSR_PMON_CTL0.
See specification in:
http://www.intel.com/content/www/us/en/processors/xeon/
xeon-e5-v3-uncore-performance-monitoring.html
Signed-off-by: Kan Liang <kan.liang@intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1432645835-7918-1-git-send-email-kan.liang@intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Change the type of variables and function prototypes to be in
alignment with what the x86_*() / __x86_*() family/model
functions return.
Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1433436928-31903-21-git-send-email-bp@alien8.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Andy Shevchenko reported machine freezes when booting latest tip
on 32-bit setups. Problem is, the builtin microcode handling cannot
really work that early, when we haven't even enabled paging.
A proper fix would involve handling that case specially as every
other early 32-bit boot case in the microcode loader and would
require much more involved changes for which it is too late now,
more than a week before the upcoming merge window.
So, disable the builtin microcode loading on 32-bit for now.
Reported-and-tested-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1433436928-31903-20-git-send-email-bp@alien8.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
This header containing all MSRs and respective bit definitions
got exported to userspace in conjunction with the big UAPI
shuffle.
But, it doesn't belong in the UAPI headers because userspace can
do its own MSR defines and exporting them from the kernel blocks
us from doing cleanups/renames in that header. Which is
ridiculous - it is not kernel's job to export such a header and
keep MSRs list and their names stable.
Signed-off-by: Borislav Petkov <bp@suse.de>
Acked-by: H. Peter Anvin <hpa@zytor.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: David Howells <dhowells@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1433436928-31903-19-git-send-email-bp@alien8.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
In talking to Aravind recently about making certain AMD topology
attributes available to the MCE injection module, it seemed like
that CONFIG_X86_HT thing is more or less superfluous. It is
def_bool y, depends on SMP and gets enabled in the majority of
.configs - distro and otherwise - out there.
So let's kill it and make code behind it depend directly on SMP.
Signed-off-by: Borislav Petkov <bp@suse.de>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Aravind Gopalakrishnan <Aravind.Gopalakrishnan@amd.com>
Cc: Bartosz Golaszewski <bgolaszewski@baylibre.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Daniel Walter <dwalter@google.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Igor Mammedov <imammedo@redhat.com>
Cc: Jacob Shin <jacob.w.shin@gmail.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1433436928-31903-18-git-send-email-bp@alien8.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Add the necessary changes to do_machine_check() to be able to
process MCEs signaled as local MCEs. Typically, only recoverable
errors (SRAR type) will be Signaled as LMCE. The architecture
does not restrict to only those errors, however.
When errors are signaled as LMCE, there is no need for the MCE
handler to perform rendezvous with other logical processors
unlike earlier processors that would broadcast machine check
errors.
Signed-off-by: Ashok Raj <ashok.raj@intel.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tony Luck <tony.luck@intel.com>
Cc: linux-edac <linux-edac@vger.kernel.org>
Link: http://lkml.kernel.org/r/1433436928-31903-17-git-send-email-bp@alien8.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Initialize and prepare for handling LMCEs. Add a boot-time
option to disable LMCEs.
Signed-off-by: Ashok Raj <ashok.raj@intel.com>
[ Simplify stuff, align statements for better readability, reflow comments; kill
unused lmce_clear(); save us an MSR write if LMCE is already enabled. ]
Signed-off-by: Borislav Petkov <bp@suse.de>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tony Luck <tony.luck@intel.com>
Cc: linux-edac <linux-edac@vger.kernel.org>
Link: http://lkml.kernel.org/r/1433436928-31903-16-git-send-email-bp@alien8.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Add required definitions to support Local Machine Check
Exceptions.
Historically, machine check exceptions on Intel x86 processors
have been broadcast to all logical processors in the system.
Upcoming CPUs will support an opt-in mechanism to request some
machine check exceptions be delivered to a single logical
processor experiencing the fault.
See http://www.intel.com/sdm Volume 3, System Programming Guide,
chapter 15 for more information on MSRs and documentation on
Local MCE.
Signed-off-by: Ashok Raj <ashok.raj@intel.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tony Luck <tony.luck@intel.com>
Cc: linux-edac <linux-edac@vger.kernel.org>
Link: http://lkml.kernel.org/r/1433436928-31903-15-git-send-email-bp@alien8.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Now that reserve_ram_pages_type() accepts the WT type, add
set_memory_wt(), set_memory_array_wt() and set_pages_array_wt()
in order to be able to set memory to Write-Through page cache
mode.
Also, extend ioremap_change_attr() to accept the WT type.
Signed-off-by: Toshi Kani <toshi.kani@hp.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Elliott@hp.com
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Luis R. Rodriguez <mcgrof@suse.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: arnd@arndb.de
Cc: hch@lst.de
Cc: hmh@hmh.eng.br
Cc: jgross@suse.com
Cc: konrad.wilk@oracle.com
Cc: linux-mm <linux-mm@kvack.org>
Cc: linux-nvdimm@lists.01.org
Cc: stefan.bader@canonical.com
Cc: yigal@plexistor.com
Link: http://lkml.kernel.org/r/1433436928-31903-13-git-send-email-bp@alien8.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
As set_memory_wb() calls free_ram_pages_type(), which then calls
set_page_memtype() with -1, _PGMT_DEFAULT is used for tracking
the WB type. _PGMT_WB is defined but unused. Thus, rename
_PGMT_DEFAULT to _PGMT_WB to clarify the usage, and release the
slot used by _PGMT_WB.
Furthermore, change free_ram_pages_type() to call
set_page_memtype() with _PGMT_WB, and get_page_memtype() to
return _PAGE_CACHE_MODE_WB for _PGMT_WB.
Then, define _PGMT_WT in the freed slot. This allows
set_page_memtype() to track the WT type.
Signed-off-by: Toshi Kani <toshi.kani@hp.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Elliott@hp.com
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Luis R. Rodriguez <mcgrof@suse.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: arnd@arndb.de
Cc: hch@lst.de
Cc: hmh@hmh.eng.br
Cc: jgross@suse.com
Cc: konrad.wilk@oracle.com
Cc: linux-mm <linux-mm@kvack.org>
Cc: linux-nvdimm@lists.01.org
Cc: stefan.bader@canonical.com
Cc: yigal@plexistor.com
Link: http://lkml.kernel.org/r/1433436928-31903-12-git-send-email-bp@alien8.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Add ioremap_wt() for creating Write-Through mappings on x86. It
follows the same model as ioremap_wc() for multi-arch support.
Define ARCH_HAS_IOREMAP_WT in the x86 version of io.h to
indicate that ioremap_wt() is implemented on x86.
Also update the PAT documentation file to cover ioremap_wt().
Signed-off-by: Toshi Kani <toshi.kani@hp.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Elliott@hp.com
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Luis R. Rodriguez <mcgrof@suse.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: arnd@arndb.de
Cc: hch@lst.de
Cc: hmh@hmh.eng.br
Cc: jgross@suse.com
Cc: konrad.wilk@oracle.com
Cc: linux-mm <linux-mm@kvack.org>
Cc: linux-nvdimm@lists.01.org
Cc: stefan.bader@canonical.com
Cc: yigal@plexistor.com
Link: http://lkml.kernel.org/r/1433436928-31903-8-git-send-email-bp@alien8.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
__ioremap_caller() calls reserve_memtype() and the passed down
@new_pcm contains the actual page cache type it reserved in the
success case.
is_new_memtype_allowed() verifies if converting to the new page
cache type is allowed when @pcm (the requested type) is
different from @new_pcm.
When WT is requested, the caller expects that writes are ordered
and uncached. Therefore, enhance is_new_memtype_allowed() to
disallow the following cases:
- If the request is WT, mapping type cannot be WB
- If the request is WT, mapping type cannot be WC
Signed-off-by: Toshi Kani <toshi.kani@hp.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Elliott@hp.com
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Luis R. Rodriguez <mcgrof@suse.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: arnd@arndb.de
Cc: hch@lst.de
Cc: hmh@hmh.eng.br
Cc: jgross@suse.com
Cc: konrad.wilk@oracle.com
Cc: linux-mm <linux-mm@kvack.org>
Cc: linux-nvdimm@lists.01.org
Cc: stefan.bader@canonical.com
Cc: yigal@plexistor.com
Link: http://lkml.kernel.org/r/1433436928-31903-7-git-send-email-bp@alien8.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
When a target range is in RAM, reserve_ram_pages_type() verifies
the requested type. Change it to fail WT and WP requests with
-EINVAL since set_page_memtype() is limited to handle three
types: WB, WC and UC-.
Signed-off-by: Toshi Kani <toshi.kani@hp.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Elliott@hp.com
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Luis R. Rodriguez <mcgrof@suse.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: arnd@arndb.de
Cc: hch@lst.de
Cc: hmh@hmh.eng.br
Cc: jgross@suse.com
Cc: konrad.wilk@oracle.com
Cc: linux-mm <linux-mm@kvack.org>
Cc: linux-nvdimm@lists.01.org
Cc: stefan.bader@canonical.com
Cc: yigal@plexistor.com
Link: http://lkml.kernel.org/r/1433436928-31903-6-git-send-email-bp@alien8.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Assign Write-Through type to the PA7 slot in the PAT MSR when
the processor is not affected by PAT errata. The PA7 slot is
chosen to improve robustness in the presence of errata that
might cause the high PAT bit to be ignored. This way a buggy PA7
slot access will hit the PA3 slot, which is UC, so at worst we
lose performance without causing a correctness issue.
The following Intel processors are affected by the PAT errata.
Errata CPUID
----------------------------------------------------
Pentium 2, A52 family 0x6, model 0x5
Pentium 3, E27 family 0x6, model 0x7, 0x8
Pentium 3 Xenon, G26 family 0x6, model 0x7, 0x8, 0xa
Pentium M, Y26 family 0x6, model 0x9
Pentium M 90nm, X9 family 0x6, model 0xd
Pentium 4, N46 family 0xf, model 0x0
Instead of making sharp boundary checks, we remain conservative
and exclude all Pentium 2, 3, M and 4 family processors. For
those, _PAGE_CACHE_MODE_WT is redirected to UC- per the default
setup in __cachemode2pte_tbl[].
Signed-off-by: Toshi Kani <toshi.kani@hp.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Elliott@hp.com
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Luis R. Rodriguez <mcgrof@suse.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: arnd@arndb.de
Cc: hch@lst.de
Cc: hmh@hmh.eng.br
Cc: jgross@suse.com
Cc: konrad.wilk@oracle.com
Cc: linux-mm <linux-mm@kvack.org>
Cc: linux-nvdimm@lists.01.org
Cc: stefan.bader@canonical.com
Cc: yigal@plexistor.com
Link: https://lkml.kernel.org/r/1433187393-22688-2-git-send-email-toshi.kani@hp.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Now that we emulate a PAT table when PAT is disabled, there's no
need for those checks anymore as the PAT abstraction will handle
those cases too.
Based on a conglomerate patch from Toshi Kani.
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Toshi Kani <toshi.kani@hp.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Elliott@hp.com
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Luis R. Rodriguez <mcgrof@suse.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: arnd@arndb.de
Cc: hch@lst.de
Cc: hmh@hmh.eng.br
Cc: jgross@suse.com
Cc: konrad.wilk@oracle.com
Cc: linux-mm <linux-mm@kvack.org>
Cc: linux-nvdimm@lists.01.org
Cc: stefan.bader@canonical.com
Cc: yigal@plexistor.com
Link: http://lkml.kernel.org/r/1433436928-31903-4-git-send-email-bp@alien8.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
In the case when PAT is disabled on the command line with
"nopat" or when virtualization doesn't support PAT (correctly) -
see
9d34cfdf47 ("x86: Don't rely on VMWare emulating PAT MSR correctly").
we emulate it using the PWT and PCD cache attribute bits. Get
rid of boot_pat_state while at it.
Based on a conglomerate patch from Toshi Kani.
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Toshi Kani <toshi.kani@hp.com>
Acked-by: Juergen Gross <jgross@suse.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Elliott@hp.com
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Luis R. Rodriguez <mcgrof@suse.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: arnd@arndb.de
Cc: hch@lst.de
Cc: hmh@hmh.eng.br
Cc: konrad.wilk@oracle.com
Cc: linux-mm <linux-mm@kvack.org>
Cc: linux-nvdimm@lists.01.org
Cc: stefan.bader@canonical.com
Cc: yigal@plexistor.com
Link: http://lkml.kernel.org/r/1433436928-31903-3-git-send-email-bp@alien8.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
So we now have the following system entry code related
files, which define the following system call instruction
and other entry paths:
entry_32.S # 32-bit binaries on 32-bit kernels
entry_64.S # 64-bit binaries on 64-bit kernels
entry_64_compat.S # 32-bit binaries on 64-bit kernels
Cc: Alexei Starovoitov <ast@plumgrid.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Josh Triplett <josh@joshtriplett.org>
Cc: Kees Cook <keescook@chromium.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Will Drewry <wad@chromium.org>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Pull x86 fixes from Ingo Molnar:
"Misc fixes:
- early_idt_handlers[] fix that fixes the build with bleeding edge
tooling
- build warning fix on GCC 5.1
- vm86 fix plus self-test to make it harder to break it again"
* 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/asm/irq: Stop relying on magic JMP behavior for early_idt_handlers
x86/asm/entry/32, selftests: Add a selftest for kernel entries from VM86 mode
x86/boot: Add CONFIG_PARAVIRT_SPINLOCKS quirk to arch/x86/boot/compressed/misc.h
x86/asm/entry/32: Really make user_mode() work correctly for VM86 mode
Pull perf fixes from Ingo Molnar:
"The biggest chunk of the changes are two regression fixes: a HT
workaround fix and an event-group scheduling fix. It's been verified
with 5 days of fuzzer testing.
Other fixes:
- eBPF fix
- a BIOS breakage detection fix
- PMU driver fixes"
* 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
perf/x86/intel/pt: Fix a refactoring bug
perf/x86: Tweak broken BIOS rules during check_hw_exists()
perf/x86/intel/pt: Untangle pt_buffer_reset_markers()
perf: Disallow sparse AUX allocations for non-SG PMUs in overwrite mode
perf/x86: Improve HT workaround GP counter constraint
perf/x86: Fix event/group validation
perf: Fix race in BPF program unregister
Follow up to commit e194bbdf36.
Suggested-by: Bandan Das <bsd@redhat.com>
Suggested-by: Alex Williamson <alex.williamson@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
... and we're done. :)
Because SMBASE is usually relocated above 1M on modern chipsets, and
SMM handlers might indeed rely on 4G segment limits, we only expose it
if KVM is able to run the guest in big real mode. This includes any
of VMX+emulate_invalid_guest_state, VMX+unrestricted_guest, or SVM.
Reviewed-by: Radim Krčmář <rkrcmar@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
This is now very simple to do. The only interesting part is a simple
trick to find the right memslot in gfn_to_rmap, retrieving the address
space from the spte role word. The same trick is used in the auditing
code.
The comment on top of union kvm_mmu_page_role has been stale forever,
so remove it. Speaking of stale code, remove pad_for_nice_hex_output
too: it was splitting the "access" bitfield across two bytes and thus
had effectively turned into pad_for_ugly_hex_output.
Reviewed-by: Radim Krčmář <rkrcmar@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
This patch has no semantic change, but it prepares for the introduction
of a second address space for system management mode.
A new function x86_set_memory_region (and the "slots_lock taken"
counterpart __x86_set_memory_region) is introduced in order to
operate on all address spaces when adding or deleting private
memory slots.
Reviewed-by: Radim Krčmář <rkrcmar@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
We need to hide SMRAM from guests not running in SMM. Therefore,
all uses of kvm_read_guest* and kvm_write_guest* must be changed to
check whether the VCPU is in system management mode and use a
different set of memslots. Switch from kvm_* to the newly-introduced
kvm_vcpu_*, which call into kvm_arch_vcpu_memslots_id.
Reviewed-by: Radim Krčmář <rkrcmar@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
This is always available (with one exception in the auditing code),
and with the same auditing exception the level was coming from
sp->role.level.
Later, the spte's role will also be used to look up the right memslots
array.
Reviewed-by: Radim Krčmář <rkrcmar@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Deobfuscate the 'found' logic, it can be replaced with a simple:
if (!pa_data)
return;
and 'found' can be eliminated.
Signed-off-by: Wei Yang <weiyang@linux.vnet.ibm.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Yinghai Lu <yinghai@kernel.org>
Link: http://lkml.kernel.org/r/1433398729-8314-1-git-send-email-weiyang@linux.vnet.ibm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Really swap arguments #4 and #5 in stub32_clone instead of
"optimizing" it into a move.
Yes, tls_val is currently unused. Yes, on some CPUs XCHG is a
little bit more expensive than MOV. But a cycle or two on an
expensive syscall like clone() is way below noise floor, and
this optimization is simply not worth the obfuscation of logic.
[ There's also ongoing work on the clone() ABI by Josh Triplett
that will depend on this change later on. ]
Signed-off-by: Denys Vlasenko <dvlasenk@redhat.com>
Cc: Alexei Starovoitov <ast@plumgrid.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Josh Triplett <josh@joshtriplett.org>
Cc: Kees Cook <keescook@chromium.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Will Drewry <wad@chromium.org>
Link: http://lkml.kernel.org/r/1433339930-20880-2-git-send-email-dvlasenk@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
The reason for copying of %r8 to %rcx is quite non-obvious.
Add a comment which explains why it is done.
Signed-off-by: Denys Vlasenko <dvlasenk@redhat.com>
Cc: Alexei Starovoitov <ast@plumgrid.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Will Drewry <wad@chromium.org>
Link: http://lkml.kernel.org/r/1433339930-20880-1-git-send-email-dvlasenk@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Make the 64-bit compat 32-bit syscall entry code a bit more readable:
- eliminate whitespace noise
- use consistent vertical spacing
- use consistent assembly coding style similar to entry_64.S
- fix various comments
No code changed:
arch/x86/entry/ia32entry.o:
text data bss dec hex filename
1391 0 0 1391 56f ia32entry.o.before
1391 0 0 1391 56f ia32entry.o.after
md5:
f28501dcc366e68b557313942c6496d6 ia32entry.o.before.asm
f28501dcc366e68b557313942c6496d6 ia32entry.o.after.asm
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
SYSENTER and SYSCALL 32-bit entry points differ in handling of
arg2 and arg6.
SYSENTER:
* ecx arg2
* ebp user stack
* 0(%ebp) arg6
SYSCALL:
* ebp arg2
* esp user stack
* 0(%esp) arg6
Sysenter code loads 0(%ebp) to %ebp right away.
(This destroys %ebp. It means we do not preserve it on return.
It's not causing problems since userspace VDSO code does not
depend on it, and SYSENTER insn can't be sanely used outside of
VDSO).
Syscall code loads 0(%ebp) to %r9. This allows to eliminate one
MOV insn (r9 is a register where arg6 should be for 64-bit ABI),
but on audit/ptrace code paths this requires juggling of r9 and
ebp: (1) ptrace expects arg6 to be in pt_regs->bp;
(2) r9 is callee-clobbered register and needs to be
saved/restored around calls to C functions.
This patch changes syscall code to load 0(%ebp) to %ebp, making
it more similar to sysenter code. It's a bit smaller:
text data bss dec hex filename
1407 0 0 1407 57f ia32entry.o.before
1391 0 0 1391 56f ia32entry.o
To preserve ABI compat, we restore ebp on exit.
Run-tested.
Signed-off-by: Denys Vlasenko <dvlasenk@redhat.com>
Cc: Alexei Starovoitov <ast@plumgrid.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Will Drewry <wad@chromium.org>
Link: http://lkml.kernel.org/r/1433336169-18964-1-git-send-email-dvlasenk@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
This macro is small, has only three callsites, and one of them
is slightly different using a conditional parameter.
A few saved lines aren't worth the resulting obfuscation.
Generated machine code is identical.
Signed-off-by: Denys Vlasenko <dvlasenk@redhat.com>
Cc: Alexei Starovoitov <ast@plumgrid.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Will Drewry <wad@chromium.org>
Link: http://lkml.kernel.org/r/1433271842-9139-2-git-send-email-dvlasenk@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
This macro is small, has only four callsites, and one of them is
slightly different using a conditional parameter.
A few saved lines aren't worth the resulting obfuscation.
Generated machine code is identical.
Signed-off-by: Denys Vlasenko <dvlasenk@redhat.com>
[ Added comments. ]
Cc: Alexei Starovoitov <ast@plumgrid.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Will Drewry <wad@chromium.org>
Link: http://lkml.kernel.org/r/1433271842-9139-1-git-send-email-dvlasenk@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
32-bit syscall entry points do not save the complete pt_regs struct,
they leave some fields uninitialized. However, they must be
careful to not leak uninitialized data in pt_regs->r8..r11 to
ptrace users.
CLEAR_RREGS macro is used to zero these fields out when needed.
However, in the int80 code path this zeroing is unconditional.
This patch simplifies it by storing zeroes there right away,
when pt_regs is constructed on stack.
This uses shorter instructions:
text data bss dec hex filename
1423 0 0 1423 58f ia32entry.o.before
1407 0 0 1407 57f ia32entry.o
Compile-tested.
Signed-off-by: Denys Vlasenko <dvlasenk@redhat.com>
Cc: Alexei Starovoitov <ast@plumgrid.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Will Drewry <wad@chromium.org>
Link: http://lkml.kernel.org/r/1433266510-2938-1-git-send-email-dvlasenk@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
INTERRUPT_RETURN turns into a jmp instruction. There's no need
for extra indirection.
Signed-off-by: Andy Lutomirski <luto@kernel.org>
Cc: <linux-kernel@vger.kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/2f2318653dbad284a59311f13f08cea71298fd7c.1433449436.git.luto@kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
The wrmsrl_safe macro performs invalid shifts if the value
argument is 32 bits. This makes it unnecessarily awkward to
write code that puts an unsigned long into an MSR.
Convert it to a real inline function.
For inspiration, see:
7c74d5b7b7 ("x86/asm/entry/64: Fix MSR_IA32_SYSENTER_CS MSR value").
Signed-off-by: Andy Lutomirski <luto@kernel.org>
Cc: <linux-kernel@vger.kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
[ Applied small improvements. ]
Signed-off-by: Ingo Molnar <mingo@kernel.org>
The big ugly one. This patch adds support for switching in and out of
system management mode, respectively upon receiving KVM_REQ_SMI and upon
executing a RSM instruction. Both 32- and 64-bit formats are supported
for the SMM state save area.
Reviewed-by: Radim Krčmář <rkrcmar@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Commit 066450be41 ("perf/x86/intel/pt: Clean up the control flow
in pt_pmu_hw_init()") changed attribute initialization so that
only the first attribute gets initialized using
sysfs_attr_init(), which upsets lockdep.
This patch fixes the glitch so that all allocated attributes are
properly initialized thus fixing the lockdep warning reported by
Tvrtko and Imre.
Reported-by: Tvrtko Ursulin <tvrtko.ursulin@linux.intel.com>
Reported-by: Imre Deak <imre.deak@intel.com>
Signed-off-by: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: <linux-kernel@vger.kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Do not process INITs immediately while in system management mode, keep
it instead in apic->pending_events. Tell userspace if an INIT is
pending when they issue GET_VCPU_EVENTS, and similarly handle the
new field in SET_VCPU_EVENTS.
Note that the same treatment should be done while in VMX non-root mode.
Reviewed-by: Radim Krčmář <rkrcmar@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
This patch adds the interface between x86.c and the emulator: the
SMBASE register, a new emulator flag, the RSM instruction. It also
adds a new request bit that will be used by the KVM_SMI ioctl.
Reviewed-by: Radim Krčmář <rkrcmar@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
This patch includes changes to the external API for SMM support.
Userspace can predicate the availability of the new fields and
ioctls on a new capability, KVM_CAP_X86_SMM, which is added at the end
of the patch series.
Reviewed-by: Radim Krčmář <rkrcmar@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
The hflags field will contain information about system management mode
and will be useful for the emulator. Pass the entire field rather than
just the guest-mode information.
Reviewed-by: Radim Krčmář <rkrcmar@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
SMBASE is only readable from SMM for the VCPU, but it must be always
accessible if userspace is accessing it. Thus, all functions that
read MSRs are changed to accept a struct msr_data; the host_initiated
and index fields are pre-initialized, while the data field is filled
on return.
Reviewed-by: Radim Krčmář <rkrcmar@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
We will want to filter away MSR_IA32_SMBASE from the emulated_msrs if
the host CPU does not support SMM virtualization. Introduce the
logic to do that, and also move paravirt MSRs to emulated_msrs for
simplicity and to get rid of KVM_SAVE_MSRS_BEGIN.
Reviewed-by: Radim Krčmář <rkrcmar@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Malicious (or egregiously buggy) userspace can trigger it, but it
should never happen in normal operation.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
VFIO has proved itself a much better option than KVM's built-in
device assignment. It is mature, provides better isolation because
it enforces ACS, and even the userspace code is being tested on
a wider variety of hardware these days than the legacy support.
Disable legacy device assignment by default.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
The vsyscall code is entry code too, so move it to arch/x86/entry/vsyscall/.
Cc: Borislav Petkov <bp@alien8.de>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
The build time generated syscall definitions are entry code related, move
them into the arch/x86/entry/ directory.
Cc: Borislav Petkov <bp@alien8.de>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
asm/calling.h is private to the entry code, make this more apparent
by moving it to the new arch/x86/entry/ directory.
Cc: Borislav Petkov <bp@alien8.de>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
These are all calling x86 entry code functions, so move them close
to other entry code.
Change lib-y to obj-y: there's no real difference between the two
as we don't really drop any of them during the linking stage, and
obj-y is the more common approach for core kernel object code.
Cc: Borislav Petkov <bp@alien8.de>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQEcBAABAgAGBQJVa7zvAAoJEHm+PkMAQRiGtfMIAILs3sxFtrC1hApgcfRLF/7z
K34bwTRqErzqUO/orTwakEr9kSIpIL0zIPSryTCOTPZLfMGkQjhHXO3KR/DSbbTV
MZ8y/BM/yelFA/Np+1LjbiYjTNRnTRvCoaQihkIH8Rn02g7ob9HyL4gIGKpuGFcZ
04GacL2cgChqsRSACdNef948jCoJXKgcuDpe39DXphDWZnBKNZ3HFuJ6bryGJf9A
1/eCI4is85BNwKPemQUYR0xx83UIzDfrghatZP2mOCDDSA2MNg8HNxLTd12LGoQD
tfgX4B7aftzW9Y7GSEDfZ0IKm2NRzgPmCVj6PjVR/iI0lIK4Aq0Z/lDJxxEq3XQ=
=AJM5
-----END PGP SIGNATURE-----
Merge tag 'v4.1-rc6' into drm-next
Linux 4.1-rc6
backmerge 4.1-rc6 as some of the later pull reqs are based on newer bases
and I'd prefer to do the fixup myself.
Create a new directory hierarchy for the low level x86 entry code:
arch/x86/entry/*
This will host all the low level glue that is currently scattered
all across arch/x86/.
Start with entry_64.S and entry_32.S.
Cc: Borislav Petkov <bp@alien8.de>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Nothing in <asm/io.h> uses anything from <linux/vmalloc.h>, so
remove it from there and fix up the resulting build problems
triggered on x86 {64|32}-bit {def|allmod|allno}configs.
The breakages were triggering in places where x86 builds relied
on vmalloc() facilities but did not include <linux/vmalloc.h>
explicitly and relied on the implicit inclusion via <asm/io.h>.
Also add:
- <linux/init.h> to <linux/io.h>
- <asm/pgtable_types> to <asm/io.h>
... which were two other implicit header file dependencies.
Suggested-by: David Miller <davem@davemloft.net>
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
[ Tidied up the changelog. ]
Acked-by: David Miller <davem@davemloft.net>
Acked-by: Takashi Iwai <tiwai@suse.de>
Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
Acked-by: Vinod Koul <vinod.koul@intel.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Anton Vorontsov <anton@enomsg.org>
Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Cc: Colin Cross <ccross@android.com>
Cc: David Vrabel <david.vrabel@citrix.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Haiyang Zhang <haiyangz@microsoft.com>
Cc: James E.J. Bottomley <JBottomley@odin.com>
Cc: Jaroslav Kysela <perex@perex.cz>
Cc: K. Y. Srinivasan <kys@microsoft.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: Kristen Carlson Accardi <kristen@linux.intel.com>
Cc: Len Brown <lenb@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rafael J. Wysocki <rjw@rjwysocki.net>
Cc: Suma Ramars <sramars@cisco.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tony Luck <tony.luck@intel.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Peter Zijstra noticed that in arch/x86/Kconfig there are a lot
of X86_{32,64} clauses in the X86 symbol, plus there are a number
of similar selects in the X86_32 and X86_64 config definitions
as well - which all overlap in an inconsistent mess.
So:
- move all select's from X86_32 and X86_64 to the X64 config
option
- sort their names, so that duplications are easier to spot
- align their if clauses, so that they are easier to identify
at a glance - and so that weirdnesses stand out more
No change in functionality:
105 insertions(+)
105 deletions(-)
Originally-from: Peter Zijlstra <peterz@infradead.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20150602153027.GU3644@twins.programming.kicks-ass.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>
This patch converts rfc4106-gcm-aesni to the new AEAD interface.
The low-level interface remains as is for now because we can't
touch it until cryptd itself is upgraded.
In the conversion I've also removed the duplicate copy of the
context in the top-level algorithm. Now all processing is carried
out in the low-level __driver-gcm-aes-aesni algorithm.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
We did try trimming whitespace surrounding the 'model name'
field in /proc/cpuinfo since reportedly some userspace uses it
in string comparisons and there were discrepancies:
[thetango@prarit ~]# grep "^model name" /proc/cpuinfo | uniq -c | sed 's/\ /_/g'
______1_model_name :_AMD_Opteron(TM)_Processor_6272
_____63_model_name :_AMD_Opteron(TM)_Processor_6272_________________
However, there were issues with overlapping buffers, string
sizes and non-byte-sized copies in the previous proposed
solutions; see Link tags below for the whole farce.
So, instead of diddling with this more, let's simply extend what
was there originally with trimming any present trailing
whitespace. Final result is really simple and obvious.
Testing with the most insane model IDs qemu can generate, looks
good:
.model_id = " My funny model ID CPU ",
______4_model_name :_My_funny_model_ID_CPU
.model_id = "My funny model ID CPU ",
______4_model_name :_My_funny_model_ID_CPU
.model_id = " My funny model ID CPU",
______4_model_name :_My_funny_model_ID_CPU
.model_id = " ",
______4_model_name :__
.model_id = "",
______4_model_name :_15/02
Signed-off-by: Borislav Petkov <bp@suse.de>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Igor Mammedov <imammedo@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1432050210-32036-1-git-send-email-prarit@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
retint_kernel doesn't require %rcx to be pointing to thread info
(anymore?), and the code on the two alternative paths is - not
really surprisingly - identical.
Signed-off-by: Jan Beulich <jbeulich@suse.com>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/556C664F020000780007FB64@mail.emea.novell.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Doing so allows adjustments by 128 bytes (occurring for
REMOVE_PT_GPREGS_FROM_STACK 8 uses) to be expressed with a
single byte immediate.
Signed-off-by: Jan Beulich <jbeulich@suse.com>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/556C660F020000780007FB60@mail.emea.novell.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
The early_idt_handlers asm code generates an array of entry
points spaced nine bytes apart. It's not really clear from that
code or from the places that reference it what's going on, and
the code only works in the first place because GAS never
generates two-byte JMP instructions when jumping to global
labels.
Clean up the code to generate the correct array stride (member size)
explicitly. This should be considerably more robust against
screw-ups, as GAS will warn if a .fill directive has a negative
count. Using '. =' to advance would have been even more robust
(it would generate an actual error if it tried to move
backwards), but it would pad with nulls, confusing anyone who
tries to disassemble the code. The new scheme should be much
clearer to future readers.
While we're at it, improve the comments and rename the array and
common code.
Binutils may start relaxing jumps to non-weak labels. If so,
this change will fix our build, and we may need to backport this
change.
Before, on x86_64:
0000000000000000 <early_idt_handlers>:
0: 6a 00 pushq $0x0
2: 6a 00 pushq $0x0
4: e9 00 00 00 00 jmpq 9 <early_idt_handlers+0x9>
5: R_X86_64_PC32 early_idt_handler-0x4
...
48: 66 90 xchg %ax,%ax
4a: 6a 08 pushq $0x8
4c: e9 00 00 00 00 jmpq 51 <early_idt_handlers+0x51>
4d: R_X86_64_PC32 early_idt_handler-0x4
...
117: 6a 00 pushq $0x0
119: 6a 1f pushq $0x1f
11b: e9 00 00 00 00 jmpq 120 <early_idt_handler>
11c: R_X86_64_PC32 early_idt_handler-0x4
After:
0000000000000000 <early_idt_handler_array>:
0: 6a 00 pushq $0x0
2: 6a 00 pushq $0x0
4: e9 14 01 00 00 jmpq 11d <early_idt_handler_common>
...
48: 6a 08 pushq $0x8
4a: e9 d1 00 00 00 jmpq 120 <early_idt_handler_common>
4f: cc int3
50: cc int3
...
117: 6a 00 pushq $0x0
119: 6a 1f pushq $0x1f
11b: eb 03 jmp 120 <early_idt_handler_common>
11d: cc int3
11e: cc int3
11f: cc int3
Signed-off-by: Andy Lutomirski <luto@kernel.org>
Acked-by: H. Peter Anvin <hpa@linux.intel.com>
Cc: Binutils <binutils@sourceware.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: H.J. Lu <hjl.tools@gmail.com>
Cc: Jan Beulich <JBeulich@suse.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: <stable@vger.kernel.org>
Link: http://lkml.kernel.org/r/ac027962af343b0c599cbfcf50b945ad2ef3d7a8.1432336324.git.luto@kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
instead of returning '1' to indicate error - Dan Carpenter
* New support to expose the EFI System Resource Tables in sysfs, which
provides information for performing firmware updates - Peter Jones
* Documentation cleanup in the EFI handover protocol section which
falsely claimed that 'cmdline_size' needed to be filled out by the
boot loader - Alex Smith
* Align the order of SMBIOS tables in /sys/firmware/efi/systab to match
the way that we do things for ACPI and add documentation to
Documentation/ABI - Jean Delvare
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQIcBAABAgAGBQJVazVXAAoJEC84WcCNIz1VtVIP/1bwaRIw4eHBuunTY5ONZ9FP
+uP0hyvUyGajES91PArqWpCeubn6hOAENT98+Tp+w81n3BPL3ZKKZB5jIbIpVqiF
IOlpUud+MlpoHbyBleCVQHBG6+8pfE8ty3sC+gljjDhjaXnT1QJt9IdoEMpLnx7P
pS0b9RzBVHJX1Y0ILMXstJKtNjyZfsxZ031XbjEuRfw7V2DtptkjRivR8EKDBKsG
kNYcHxJJX/+DE9+pNPc3wrByBasQlBmrnZpwP3LIG12GRtoEZzbogHmFExeQZ+9k
Gp3xuyOFx2Texl7bXM0artWbtTdzQj1ai8MoT5fQexy0UzO1TtlkdfaBkYKd3mtY
AxvLPxCQpmGMV16T3QNaHEocFDAHSUvc2o85sQj+EdHhUcSkFybi4rSpDFf7HzO6
x6xkt2Fu9d7GEpZG1O7V/v1uMNsp3tOBRMiMdruRq2Ui2UV8s616DqfjtoX/pkS3
clNGrGZlUfDegKhkCuQqfUZY4jz/gioCEciY1S4auz/OX5jK0NTWUmAWzBnnWjsC
M/RHbTbRbYGh1lTUSZQIdGSe5ejW/kBGMCeNh5ZmaxsZx057TYywSqLvo4PVoxON
DTJUMwP2X/rzS2L3o3KVdjDTf3PTw7tQbieAjr5M4N7cd0I+BjRWBcQaCOnA0qN0
SQwqdWeY/ZHcZftbgCAw
=Twjn
-----END PGP SIGNATURE-----
Merge tag 'efi-next' of git://git.kernel.org/pub/scm/linux/kernel/git/mfleming/efi into x86/efi
Pull EFI changes from Matt Fleming:
- Use idiomatic negative error values in efivar_create_sysfs_entry()
instead of returning '1' to indicate error. (Dan Carpenter)
- Implement new support to expose the EFI System Resource Tables in sysfs,
which provides information for performing firmware updates. (Peter Jones)
- Documentation cleanup in the EFI handover protocol section which
falsely claimed that 'cmdline_size' needed to be filled out by the
boot loader. (Alex Smith)
- Align the order of SMBIOS tables in /sys/firmware/efi/systab to match
the way that we do things for ACPI and add documentation to
Documentation/ABI. (Jean Delvare)
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Pull RCU changes from Paul E. McKenney:
- Initialization/Kconfig updates: hide most Kconfig options from unsuspecting users.
There's now a single high level configuration option:
*
* RCU Subsystem
*
Make expert-level adjustments to RCU configuration (RCU_EXPERT) [N/y/?] (NEW)
Which if answered in the negative, leaves us with a single interactive
configuration option:
Offload RCU callback processing from boot-selected CPUs (RCU_NOCB_CPU) [N/y/?] (NEW)
All the rest of the RCU options are configured automatically.
- Remove all uses of RCU-protected array indexes: replace the
rcu_[access|dereference]_index_check() APIs with READ_ONCE() and rcu_lockdep_assert().
- RCU CPU-hotplug cleanups.
- Updates to Tiny RCU: a race fix and further code shrinkage.
- RCU torture-testing updates: fixes, speedups, cleanups and
documentation updates.
- Miscellaneous fixes.
- Documentation updates.
Signed-off-by: Ingo Molnar <mingo@kernel.org>
So the dwarf2 annotations in low level assembly code have
become an increasing hindrance: unreadable, messy macros
mixed into some of the most security sensitive code paths
of the Linux kernel.
These debug info annotations don't even buy the upstream
kernel anything: dwarf driven stack unwinding has caused
problems in the past so it's out of tree, and the upstream
kernel only uses the much more robust framepointers based
stack unwinding method.
In addition to that there's a steady, slow bitrot going
on with these annotations, requiring frequent fixups.
There's no tooling and no functionality upstream that
keeps it correct.
So burn down the sick forest, allowing new, healthier growth:
27 files changed, 350 insertions(+), 1101 deletions(-)
Someone who has the willingness and time to do this
properly can attempt to reintroduce dwarf debuginfo in x86
assembly code plus dwarf unwinding from first principles,
with the following conditions:
- it should be maximally readable, and maximally low-key to
'ordinary' code reading and maintenance.
- find a build time method to insert dwarf annotations
automatically in the most common cases, for pop/push
instructions that manipulate the stack pointer. This could
be done for example via a preprocessing step that just
looks for common patterns - plus special annotations for
the few cases where we want to depart from the default.
We have hundreds of CFI annotations, so automating most of
that makes sense.
- it should come with build tooling checks that ensure that
CFI annotations are sensible. We've seen such efforts from
the framepointer side, and there's no reason it couldn't be
done on the dwarf side.
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: Frédéric Weisbecker <fweisbec@gmail.com
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Jan Beulich <JBeulich@suse.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Conflicts:
drivers/net/phy/amd-xgbe-phy.c
drivers/net/wireless/iwlwifi/Kconfig
include/net/mac80211.h
iwlwifi/Kconfig and mac80211.h were both trivial overlapping
changes.
The drivers/net/phy/amd-xgbe-phy.c file got removed in 'net-next' and
the bug fix that happened on the 'net' side is already integrated
into the rest of the amd-xgbe driver.
Signed-off-by: David S. Miller <davem@davemloft.net>
Currently UML is abusing __KERNEL__ to distinguish between
kernel and host code (os-Linux). It is better to use a custom
define such that existing users of __KERNEL__ don't get confused.
Signed-off-by: Richard Weinberger <richard@nod.at>
Pull turbostat tool fixes from Len Brown:
"Just one minor kernel dependency in this batch -- added a #define to
msr-index.h"
* 'turbostat' of git://git.kernel.org/pub/scm/linux/kernel/git/lenb/linux:
tools/power turbostat: update version number to 4.7
tools/power turbostat: allow running without cpu0
tools/power turbostat: correctly decode of ENERGY_PERFORMANCE_BIAS
tools/power turbostat: enable turbostat to support Knights Landing (KNL)
tools/power turbostat: correctly display more than 2 threads/core
Fixes:
arch/x86/um/signal.c: In function ‘setup_signal_stack_si’:
include/asm-generic/uaccess.h:146:27: warning: initialization from incompatible pointer type [enabled by default]
__typeof__(*(ptr)) __x = (x); \
^
arch/x86/um/signal.c:544:10: note: in expansion of macro ‘__put_user’
err |= __put_user(ksig->ka.sa.sa_restorer,
Signed-off-by: Richard Weinberger <richard@nod.at>
This fixes a bug uncovered by a recent driver core change that
modified the implementation of the ACPI_COMPANION_SET() macro to
strictly rely on its second argument to be either NULL or a valid
pointer to struct acpi_device.
As it turns out, pcibios_root_bridge_prepare() on x86 and ia64
works with the assumption that the only code path calling
pci_create_root_bus() is pci_acpi_scan_root() and therefore
the sysdata argument passed to it will always match the
expectations of pcibios_root_bridge_prepare(). That need not
be the case, however, and in particular it is not the case for
the Xen pcifront driver that passes a pointer to its own private
data strcture as sysdata to pci_scan_bus_parented() which then
passes it to pci_create_root_bus() and it ends up being used
incorrectly by pcibios_root_bridge_prepare().
/
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.22 (GNU/Linux)
iQIcBAABCAAGBQJVaP1HAAoJEILEb/54YlRxpwEQAJn+6phfcLZiBkwRCvGAmWwv
Hi5UhcLCZb8r0RX+CssrwDxmiTD+OO5Kume7yqzb9R1EJfEWcx7dOnvrtl6mfy9E
sfWuU/ILF9TW57bqEHBde921VwtRh3BQ4fOCbp4qYEEXjFx+N/UOYHRefgWzvgYP
0j9U7HvwBatIjOC6gPWDyoOI7AYZwBVh4lhjYTPfOTta7QqbjFPD1b+EaAqSyNsl
TqrTQDgtxMTL02u50fP1hLJtOibdp7gOTRlEyfCJru4JjHBTRxBlaefymGi4e2Nw
5bFyXlDv4SVAGpE2XFDIJftAkAs+lXPjfIt7aHPB+DSa01axBesyqLITQV9wDoL8
AETnGomgoe+fsymJ1Pk4JOar6bKJmezAwDibi7bDcUoZkTD91VREwZiaLR+kEvzH
msujMNV0SdVS+PIZUNIwYE6t/ffSDnHjdGjGnsAhNt2JtFFcDKyNkf21O4NYlV99
iLLx/YY+kxmTBuV+L9pjEefS/u9g60cdjslPFBzffXIZ8N9g2uKOqB6ulfg2dSm/
tB5u8gyTuR++jmm/f+rlpCCzQ4EMG9ciZnZJfce3m961Sac2kNacGARMMIAe6xMq
ijuOgcVZZibGBviG5v64ntTQki9qgmImZiPCSoOn+3UgrkCExJfUNDSVl5tQA6MP
oJEj4WRvH2LrFd/PjPlW
=ulKW
-----END PGP SIGNATURE-----
Merge tag 'acpi-pci-4.1-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm
Pull PCI / ACPI fix from Rafael Wysocki:
"This fixes a bug uncovered by a recent driver core change that
modified the implementation of the ACPI_COMPANION_SET() macro to
strictly rely on its second argument to be either NULL or a valid
pointer to struct acpi_device.
As it turns out, pcibios_root_bridge_prepare() on x86 and ia64 works
with the assumption that the only code path calling pci_create_root_bus()
is pci_acpi_scan_root() and therefore the sysdata argument passed to
it will always match the expectations of pcibios_root_bridge_prepare().
That need not be the case, however, and in particular it is not the
case for the Xen pcifront driver that passes a pointer to its own
private data strcture as sysdata to pci_scan_bus_parented() which then
passes it to pci_create_root_bus() and it ends up being used incorrectly
by pcibios_root_bridge_prepare()"
* tag 'acpi-pci-4.1-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
PCI / ACPI: Do not set ACPI companions for host bridges with parents
Initialize kvmclock base, on kvmclock system MSR write time,
so that the guest sees kvmclock counting from zero.
This matches baremetal behaviour when kvmclock in guest
sets sched clock stable.
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
[Remove unnecessary comment. - Paolo]
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
If you try to enable NOHZ_FULL on a guest today, you'll get
the following error when the guest tries to deactivate the
scheduler tick:
WARNING: CPU: 3 PID: 2182 at kernel/time/tick-sched.c:192 can_stop_full_tick+0xb9/0x290()
NO_HZ FULL will not work with unstable sched clock
CPU: 3 PID: 2182 Comm: kworker/3:1 Not tainted 4.0.0-10545-gb9bb6fb #204
Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
Workqueue: events flush_to_ldisc
ffffffff8162a0c7 ffff88011f583e88 ffffffff814e6ba0 0000000000000002
ffff88011f583ed8 ffff88011f583ec8 ffffffff8104d095 ffff88011f583eb8
0000000000000000 0000000000000003 0000000000000001 0000000000000001
Call Trace:
<IRQ> [<ffffffff814e6ba0>] dump_stack+0x4f/0x7b
[<ffffffff8104d095>] warn_slowpath_common+0x85/0xc0
[<ffffffff8104d146>] warn_slowpath_fmt+0x46/0x50
[<ffffffff810bd2a9>] can_stop_full_tick+0xb9/0x290
[<ffffffff810bd9ed>] tick_nohz_irq_exit+0x8d/0xb0
[<ffffffff810511c5>] irq_exit+0xc5/0x130
[<ffffffff814f180a>] smp_apic_timer_interrupt+0x4a/0x60
[<ffffffff814eff5e>] apic_timer_interrupt+0x6e/0x80
<EOI> [<ffffffff814ee5d1>] ? _raw_spin_unlock_irqrestore+0x31/0x60
[<ffffffff8108bbc8>] __wake_up+0x48/0x60
[<ffffffff8134836c>] n_tty_receive_buf_common+0x49c/0xba0
[<ffffffff8134a6bf>] ? tty_ldisc_ref+0x1f/0x70
[<ffffffff81348a84>] n_tty_receive_buf2+0x14/0x20
[<ffffffff8134b390>] flush_to_ldisc+0xe0/0x120
[<ffffffff81064d05>] process_one_work+0x1d5/0x540
[<ffffffff81064c81>] ? process_one_work+0x151/0x540
[<ffffffff81065191>] worker_thread+0x121/0x470
[<ffffffff81065070>] ? process_one_work+0x540/0x540
[<ffffffff8106b4df>] kthread+0xef/0x110
[<ffffffff8106b3f0>] ? __kthread_parkme+0xa0/0xa0
[<ffffffff814ef4f2>] ret_from_fork+0x42/0x70
[<ffffffff8106b3f0>] ? __kthread_parkme+0xa0/0xa0
---[ end trace 06e3507544a38866 ]---
However, it turns out that kvmclock does provide a stable
sched_clock callback. So, let the scheduler know this which
in turn makes NOHZ_FULL work in the guest.
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Luiz Capitulino <lcapitulino@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Setting sched clock stable for kvmclock causes the printk timestamps
to not start from zero, which is different from baremetal and
can possibly break userspace. Add a flag to indicate that
hypervisor sets clock base at zero when kvmclock is initialized.
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Linus reported the following new warning on x86 allmodconfig with GCC 5.1:
> ./arch/x86/include/asm/spinlock.h: In function ‘arch_spin_lock’:
> ./arch/x86/include/asm/spinlock.h:119:3: warning: implicit declaration
> of function ‘__ticket_lock_spinning’ [-Wimplicit-function-declaration]
> __ticket_lock_spinning(lock, inc.tail);
> ^
This warning triggers because of these hacks in misc.h:
/*
* we have to be careful, because no indirections are allowed here, and
* paravirt_ops is a kind of one. As it will only run in baremetal anyway,
* we just keep it from happening
*/
#undef CONFIG_PARAVIRT
#undef CONFIG_KASAN
But these hacks were not updated when CONFIG_PARAVIRT_SPINLOCKS was added,
and eventually (with the introduction of queued paravirt spinlocks in
recent kernels) this created an invalid Kconfig combination and broke
the build.
So add a CONFIG_PARAVIRT_SPINLOCKS #undef line as well.
Also remove the _ASM_X86_DESC_H quirk: that undocumented quirk
was originally added ages ago, in:
099e137726 ("x86: use ELF format in compressed images.")
and I went back to that kernel (and fixed up the main Makefile
which didn't build anymore) and checked what failure it
avoided: it avoided an include file dependencies related
build failure related to our old x86-platforms code.
That old code is long gone, the header dependencies got cleaned
up, and the build does not fail anymore with the totality of
asm/desc.h included - so remove the quirk.
Reported-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
While commit efa7045103 ("x86/asm/entry: Make user_mode() work
correctly if regs came from VM86 mode") claims that "user_mode()
is now identical to user_mode_vm()", this wasn't actually the
case - no prior commit made it so.
Signed-off-by: Jan Beulich <jbeulich@suse.com>
Acked-by: Andy Lutomirski <luto@kernel.org>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/5566EB0D020000780007E655@mail.emea.novell.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
So first of all, this atomic_scrub() function's naming is bad. It looks
like an atomic_t helper. Change it to edac_atomic_scrub().
The bigger problem is that this function is arch-specific and every new
arch which doesn't necessarily need that functionality still needs to
define it, otherwise EDAC doesn't compile.
So instead of doing that and including arch-specific headers, have each
arch define an EDAC_ATOMIC_SCRUB symbol which can be used in edac_mc.c
for ifdeffery. Much cleaner.
And we already are doing this with another symbol - EDAC_SUPPORT. This
is also much cleaner than having CONFIG_EDAC enumerate all the arches
which need/have EDAC support and drivers.
This way I can kill the useless edac.h header in tile too.
Acked-by: Ralf Baechle <ralf@linux-mips.org>
Acked-by: Michael Ellerman <mpe@ellerman.id.au>
Acked-by: Chris Metcalf <cmetcalf@ezchip.com>
Acked-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Russell King <rmk+kernel@arm.linux.org.uk>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Doug Thompson <dougthompson@xmission.com>
Cc: linux-arm-kernel@lists.infradead.org
Cc: linux-edac@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Cc: linux-mips@linux-mips.org
Cc: linuxppc-dev@lists.ozlabs.org
Cc: "Maciej W. Rozycki" <macro@codesourcery.com>
Cc: Markos Chandras <markos.chandras@imgtec.com>
Cc: Mauro Carvalho Chehab <mchehab@osg.samsung.com>
Cc: Paul Mackerras <paulus@samba.org>
Cc: "Steven J. Hill" <Steven.Hill@imgtec.com>
Cc: x86@kernel.org
Signed-off-by: Borislav Petkov <bp@suse.de>
... as their only caller is.
Signed-off-by: Jan Beulich <jbeulich@suse.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/5566EE07020000780007E683@mail.emea.novell.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
arch/x86/kvm/mmu.c: In function 'kvm_mmu_pte_write':
arch/x86/kvm/mmu.c:4256: error: unknown field 'cr0_wp' specified in initializer
arch/x86/kvm/mmu.c:4257: error: unknown field 'cr4_pae' specified in initializer
arch/x86/kvm/mmu.c:4257: warning: excess elements in union initializer
...
gcc-4.4.4 (at least) has issues when using anonymous unions in
initializers.
Fixes: edc90b7dc4 ("KVM: MMU: fix SMAP virtualization")
Cc: Xiao Guangrong <guangrong.xiao@linux.intel.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
There is no reason to deny this feature to guests. We are emulating the
APIC timer, thus are exposing it without stops in power-saving states.
Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Logical x2APIC stops working if we rewrite it with zeros.
The best references are SDM April 2015: 10.12.10.1 Logical Destination
Mode in x2APIC Mode
[...], the LDR are initialized by hardware based on the value of
x2APIC ID upon x2APIC state transitions.
and SDM April 2015: 10.12.10.2 Deriving Logical x2APIC ID from the Local
x2APIC ID
The LDR initialization occurs whenever the x2APIC mode is enabled
Signed-off-by: Radim KrÄmář <rkrcmar@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
SDM April 2015, 10.12.5 State Changes From xAPIC Mode to x2APIC Mode
• Any APIC ID value written to the memory-mapped local APIC ID register
is not preserved.
Fix it by sourcing vcpu_id (= initial APIC ID) instead of memory-mapped
APIC ID. Proper use of apic functions would result in two calls to
recalculate_apic_map(), so this patch makes a new helper.
Signed-off-by: Radim KrÄmář <rkrcmar@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
The periodic kvmclock sync can be an undesired source of latencies.
When running cyclictest on a guest, a latency spike is visible.
With kvmclock periodic sync disabled, the spike is gone.
Guests should use ntp which means the propagations of ntp corrections
from the host clock are unnecessary.
v2:
-> Make parameter read-only (Radim)
-> Return early on kvmclock_sync_fn (Andrew)
Reported-and-tested-by: Luiz Capitulino <lcapitulino@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Prepare for multiple address spaces this way, since a VCPU is not available
where unaccount_shadowed is called. We will get to the right kvm_memslots
struct through the role field in struct kvm_mmu_page.
Reviewed-by: Takuya Yoshikawa <yoshikawa_takuya_b1@lab.ntt.co.jp>
Reviewed-by: Radim Krcmar <rkrcmar@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
The memory slot is already available from gfn_to_memslot_dirty_bitmap.
Isn't it a shame to look it up again? Plus, it makes gfn_to_page_many_atomic
agnostic of multiple VCPU address spaces.
Reviewed-by: Radim Krcmar <rkrcmar@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
This lets the function access the new memory slot without going through
kvm_memslots and id_to_memslot. It will simplify the code when more
than one address space will be supported.
Unfortunately, the "const"ness of the new argument must be casted
away in two places. Fixing KVM to accept const struct kvm_memory_slot
pointers would require modifications in pretty much all architectures,
and is left for later.
Reviewed-by: Radim Krcmar <rkrcmar@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Most code already uses consts for the struct kernel_param_ops,
sweep the kernel for the last offending stragglers. Other than
include/linux/moduleparam.h and kernel/params.c all other changes
were generated with the following Coccinelle SmPL patch. Merge
conflicts between trees can be handled with Coccinelle.
In the future git could get Coccinelle merge support to deal with
patch --> fail --> grammar --> Coccinelle --> new patch conflicts
automatically for us on patches where the grammar is available and
the patch is of high confidence. Consider this a feature request.
Test compiled on x86_64 against:
* allnoconfig
* allmodconfig
* allyesconfig
@ const_found @
identifier ops;
@@
const struct kernel_param_ops ops = {
};
@ const_not_found depends on !const_found @
identifier ops;
@@
-struct kernel_param_ops ops = {
+const struct kernel_param_ops ops = {
};
Generated-by: Coccinelle SmPL
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Junio C Hamano <gitster@pobox.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Kees Cook <keescook@chromium.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: cocci@systeme.lip6.fr
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Luis R. Rodriguez <mcgrof@suse.com>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
ACPI 6.0 formalizes e820-type-7 and efi-type-14 as persistent memory.
Mark it "reserved" and allow it to be claimed by a persistent memory
device driver.
This definition is in addition to the Linux kernel's existing type-12
definition that was recently added in support of shipping platforms with
NVDIMM support that predate ACPI 6.0 (which now classifies type-12 as
OEM reserved).
Note, /proc/iomem can be consulted for differentiating legacy
"Persistent Memory (legacy)" E820_PRAM vs standard "Persistent Memory"
E820_PMEM.
Cc: Boaz Harrosh <boaz@plexistor.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Jens Axboe <axboe@fb.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Matthew Wilcox <willy@linux.intel.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Jeff Moyer <jmoyer@redhat.com>
Acked-by: Andy Lutomirski <luto@amacapital.net>
Reviewed-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Tested-by: Toshi Kani <toshi.kani@hp.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Commit 97badf873a (device property: Make it possible to use
secondary firmware nodes) uncovered a bug in the x86 (and ia64) PCI
host bridge initialization code that assumes bridge->bus->sysdata
to always point to a struct pci_sysdata object which need not be
the case (in particular, the Xen PCI frontend driver sets it to point
to a different data type). If it is not the case, an incorrect
pointer (or a piece of data that is not a pointer at all) will be
passed to ACPI_COMPANION_SET() and that may cause interesting
breakage to happen going forward.
To work around this problem use the observation that the ACPI
host bridge initialization always passes NULL as parent to
pci_create_root_bus(), so if pcibios_root_bridge_prepare() sees
a non-NULL parent of the bridge, it should not attempt to set
an ACPI companion for it, because that means that
pci_create_root_bus() has been called by someone else.
Fixes: 97badf873a (device property: Make it possible to use secondary firmware nodes)
Reported-and-tested-by: Sander Eikelenboom <linux@eikelenboom.it>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Acked-by: Bjorn Helgaas <bhelgaas@google.com>
Changes mainly to account for minor differences in Knights Landing(KNL):
1. KNL supports C1 and C6 core states.
2. KNL supports PC2, PC3 and PC6 package states.
3. KNL has a different encoding of the TURBO_RATIO_LIMIT MSR
Signed-off-by: Dasaratharaman Chandramouli <dasaratharaman.chandramouli@intel.com>
Signed-off-by: Len Brown <len.brown@intel.com>
Pull networking fixes from David Miller:
1) Don't use MMIO on certain iwlwifi devices otherwise we get a
firmware crash.
2) Don't corrupt the GRO lists of mac80211 contexts by doing sends via
timer interrupt, from Johannes Berg.
3) SKB tailroom is miscalculated in AP_VLAN crypto code, from Michal
Kazior.
4) Fix fw_status memory leak in iwlwifi, from Haim Dreyfuss.
5) Fix use after free in iwl_mvm_d0i3_enable_tx(), from Eliad Peller.
6) JIT'ing of large BPF programs is broken on x86, from Alexei
Starovoitov.
7) EMAC driver ethtool register dump size is miscalculated, from Ivan
Mikhaylov.
8) Fix PHY initial link mode when autonegotiation is disabled in
amd-xgbe, from Tom Lendacky.
9) Fix NULL deref on SOCK_DEAD socket in AF_UNIX and CAIF protocols,
from Mark Salyzyn.
10) credit_bytes not initialized properly in xen-netback, from Ross
Lagerwall.
11) Fallback from MSI-X to INTx interrupts not handled properly in mlx4
driver, fix from Benjamin Poirier.
12) Perform ->attach() after binding dev->qdisc in packet scheduler,
otherwise we can crash. From Cong WANG.
13) Don't clobber data in sctp_v4_map_v6(). From Jason Gunthorpe.
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (30 commits)
sctp: Fix mangled IPv4 addresses on a IPv6 listening socket
net_sched: invoke ->attach() after setting dev->qdisc
xen-netfront: properly destroy queues when removing device
mlx4_core: Fix fallback from MSI-X to INTx
xen/netback: Properly initialize credit_bytes
net: netxen: correct sysfs bin attribute return code
tools: bpf_jit_disasm: fix segfault on disabled debugging log output
unix/caif: sk_socket can disappear when state is unlocked
amd-xgbe-phy: Fix initial mode when autoneg is disabled
net: dp83640: fix improper double spin locking.
net: dp83640: reinforce locking rules.
net: dp83640: fix broken calibration routine.
net: stmmac: create one debugfs dir per net-device
net/ibm/emac: fix size of emac dump memory areas
x86: bpf_jit: fix compilation of large bpf programs
net: phy: bcm7xxx: Fix 7425 PHY ID and flags
iwlwifi: mvm: avoid use-after-free on iwl_mvm_d0i3_enable_tx()
iwlwifi: mvm: clean net-detect info if device was reset during suspend
iwlwifi: mvm: take the UCODE_DOWN reference when resuming
iwlwifi: mvm: BT Coex - duplicate the command if sent ASYNC
...
Because mce is arch-specific x86 code, there is little or no
performance benefit of using rcu_dereference_index_check() over using
smp_load_acquire(). It also turns out that mce is the only place that
array-index-based RCU is used, and it would be convenient to drop
this portion of the RCU API.
This patch therefore changes rcu_dereference_index_check() uses to
smp_load_acquire(), but keeping the lockdep diagnostics, and also
changes rcu_access_index() uses to READ_ONCE().
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: linux-edac@vger.kernel.org
Cc: Tony Luck <tony.luck@intel.com>
Acked-by: Borislav Petkov <bp@suse.de>
Pull x86 fixes from Ingo Molnar:
"This tree includes:
- a fix that disables the compacted FPU XSAVE format by disabling
XSAVES support: the fixes are too complex and the breakages
ABI-affecting, so we want this to be quirked off in a robust way
and backported, to make sure no broken kernel is exposed to the new
hardware (which exposure is still very limited).
- an MCE printk message fix
- a documentation fix"
* 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/fpu: Disable XSAVES* support for now
x86/Documentation: Update the contact email for L3 cache index disable functionality
x86/mce: Fix MCE severity messages
These functions are arch-specific and duplicate the
functionality of macros defined in linux/include/topology.h.
Remove them as all the callers in x86 have now switched to using
the topology_**_cpumask() family.
Signed-off-by: Bartosz Golaszewski <bgolaszewski@baylibre.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Benoit Cousson <bcousson@baylibre.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: Guenter Roeck <linux@roeck-us.net>
Cc: Jean Delvare <jdelvare@suse.de>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Drokin <oleg.drokin@intel.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rafael J. Wysocki <rjw@rjwysocki.net>
Cc: Russell King <linux@arm.linux.org.uk>
Cc: Viresh Kumar <viresh.kumar@linaro.org>
Link: http://lkml.kernel.org/r/1432645896-12588-10-git-send-email-bgolaszewski@baylibre.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
The former duplicate the functionalities of the latter but are
neither documented nor arch-independent.
Signed-off-by: Bartosz Golaszewski <bgolaszewski@baylibre.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Benoit Cousson <bcousson@baylibre.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: Guenter Roeck <linux@roeck-us.net>
Cc: Jean Delvare <jdelvare@suse.de>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Drokin <oleg.drokin@intel.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rafael J. Wysocki <rjw@rjwysocki.net>
Cc: Russell King <linux@arm.linux.org.uk>
Cc: Viresh Kumar <viresh.kumar@linaro.org>
Link: http://lkml.kernel.org/r/1432645896-12588-9-git-send-email-bgolaszewski@baylibre.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Two Linux device drivers cannot work with PAT and the work
required to make them work is significant. There is not enough
motivation to convert these drivers over to use PAT properly,
the compromise reached is to let drivers that cannot be ported
to PAT check if PAT was enabled and if so fail on probe with a
recommendation to boot with the "nopat" kernel parameter.
Signed-off-by: Luis R. Rodriguez <mcgrof@suse.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Andy Walls <awalls@md.metrocast.net>
Cc: Bjorn Helgaas <bhelgaas@google.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
Cc: Dave Airlie <airlied@redhat.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: Doug Ledford <dledford@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Juergen Gross <jgross@suse.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Michael S. Tsirkin <mst@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1430425520-22275-4-git-send-email-mcgrof@do-not-panic.com
Link: http://lkml.kernel.org/r/1432628901-18044-14-git-send-email-bp@alien8.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
We use pat_enabled in x86-specific code to see if PAT is enabled
or not but we're granting full access to it even though readers
do not need to set it. If, for instance, we granted access to it
to modules later they then could override the variable
setting... no bueno.
This renames pat_enabled to a new static variable __pat_enabled.
Folks are redirected to use pat_enabled() now.
Code that sets this can only be internal to pat.c. Apart from
the early kernel parameter "nopat" to disable PAT, we also have
a few cases that disable it later and make use of a helper
pat_disable(). It is wrapped under an ifdef but since that code
cannot run unless PAT was enabled its not required to wrap it
with ifdefs, unwrap that. Likewise, since "nopat" doesn't really
change non-PAT systems just remove that ifdef as well.
Although we could add and use an early_param_off(), these
helpers don't use __read_mostly but we want to keep
__read_mostly for __pat_enabled as this is a hot path -- upon
boot, for instance, a simple guest may see ~4k accesses to
pat_enabled(). Since __read_mostly early boot params are not
that common we don't add a helper for them just yet.
Signed-off-by: Luis R. Rodriguez <mcgrof@suse.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Andy Walls <awalls@md.metrocast.net>
Cc: Bjorn Helgaas <bhelgaas@google.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
Cc: Dave Airlie <airlied@redhat.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: Doug Ledford <dledford@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Juergen Gross <jgross@suse.com>
Cc: Kyle McMartin <kyle@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Michael S. Tsirkin <mst@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1430425520-22275-3-git-send-email-mcgrof@do-not-panic.com
Link: http://lkml.kernel.org/r/1432628901-18044-13-git-send-email-bp@alien8.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
It is possible to enable CONFIG_MTRR and CONFIG_X86_PAT and end
up with a system with MTRR functionality disabled but PAT
functionality enabled. This can happen, for instance, when the
Xen hypervisor is used where MTRRs are not supported but PAT is.
This can happen on Linux as of commit
47591df505 ("xen: Support Xen pv-domains using PAT")
by Juergen, introduced in v3.19.
Technically, we should assume the proper CPU bits would be set
to disable MTRRs but we can't always rely on this. At least on
the Xen Hypervisor, for instance, only X86_FEATURE_MTRR was
disabled as of Xen 4.4 through Xen commit 586ab6a [0], but not
X86_FEATURE_K6_MTRR, X86_FEATURE_CENTAUR_MCR, or
X86_FEATURE_CYRIX_ARR for instance.
Roger Pau Monné has clarified though that although this is
technically true we will never support PVH on these CPU types so
Xen has no need to disable these bits on those systems. As per
Roger, AMD K6, Centaur and VIA chips don't have the necessary
hardware extensions to allow running PVH guests [1].
As per Toshi it is also possible for the BIOS to disable MTRR
support, in such cases get_mtrr_state() would update the MTRR
state as per the BIOS, we need to propagate this information as
well.
x86 MTRR code relies on quite a bit of checks for mtrr_if being
set to check to see if MTRRs did get set up. Instead, lets
provide a generic getter for that. This also adds a few checks
where they were not before which could potentially safeguard
ourselves against incorrect usage of MTRR where this was not
desirable.
Where possible match error codes as if MTRRs were disabled on
arch/x86/include/asm/mtrr.h.
Lastly, since disabling MTRRs can happen at run time and we
could end up with PAT enabled, best record now in our logs when
MTRRs are disabled.
[0] ~/devel/xen (git::stable-4.5)$ git describe --contains 586ab6a 4.4.0-rc1~18
[1] http://lists.xenproject.org/archives/html/xen-devel/2015-03/msg03460.html
Signed-off-by: Luis R. Rodriguez <mcgrof@suse.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Antonino Daplas <adaplas@gmail.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
Cc: Dave Airlie <airlied@redhat.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Davidlohr Bueso <dbueso@suse.de>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Jean-Christophe Plagniol-Villard <plagnioj@jcrosoft.com>
Cc: Juergen Gross <jgross@suse.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Roger Pau Monné <roger.pau@citrix.com>
Cc: Stefan Bader <stefan.bader@canonical.com>
Cc: Suresh Siddha <sbsiddha@gmail.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tomi Valkeinen <tomi.valkeinen@ti.com>
Cc: Toshi Kani <toshi.kani@hp.com>
Cc: Ville Syrjälä <syrjala@sci.fi>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: bhelgaas@google.com
Cc: david.vrabel@citrix.com
Cc: jbeulich@suse.com
Cc: konrad.wilk@oracle.com
Cc: venkatesh.pallipadi@intel.com
Cc: ville.syrjala@linux.intel.com
Cc: xen-devel@lists.xensource.com
Link: http://lkml.kernel.org/r/1426893517-2511-3-git-send-email-mcgrof@do-not-panic.com
Link: http://lkml.kernel.org/r/1432628901-18044-12-git-send-email-bp@alien8.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
There is only one user but since we're going to bury MTRR next
out of access to drivers, expose this last piece of API to
drivers in a general fashion only needing io.h for access to
helpers.
Signed-off-by: Luis R. Rodriguez <mcgrof@suse.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Cc: Abhilash Kesavan <a.kesavan@samsung.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Antonino Daplas <adaplas@gmail.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Cristian Stoica <cristian.stoica@freescale.com>
Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
Cc: Dave Airlie <airlied@redhat.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Davidlohr Bueso <dbueso@suse.de>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Jean-Christophe Plagniol-Villard <plagnioj@jcrosoft.com>
Cc: Juergen Gross <jgross@suse.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Matthias Brugger <matthias.bgg@gmail.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Suresh Siddha <sbsiddha@gmail.com>
Cc: Thierry Reding <treding@nvidia.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tomi Valkeinen <tomi.valkeinen@ti.com>
Cc: Toshi Kani <toshi.kani@hp.com>
Cc: Ville Syrjälä <syrjala@sci.fi>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Will Deacon <will.deacon@arm.com>
Cc: dri-devel@lists.freedesktop.org
Link: http://lkml.kernel.org/r/1429722736-4473-1-git-send-email-mcgrof@do-not-panic.com
Link: http://lkml.kernel.org/r/1432628901-18044-11-git-send-email-bp@alien8.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
As part of the effort to phase out MTRR use document
write-combining MTRR effects on pages with different non-PAT
page attributes flags and different PAT entry values. Extend
arch_phys_wc_add() documentation to clarify power of two sizes /
boundary requirements as we phase out mtrr_add() use.
Lastly hint towards ioremap_uc() for corner cases on device
drivers working with devices with mixed regions where MTRR size
requirements would otherwise not enable write-combining
effective memory types.
Signed-off-by: Luis R. Rodriguez <mcgrof@suse.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Antonino Daplas <adaplas@gmail.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
Cc: Dave Airlie <airlied@redhat.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Davidlohr Bueso <dbueso@suse.de>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Jean-Christophe Plagniol-Villard <plagnioj@jcrosoft.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Juergen Gross <jgross@suse.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Suresh Siddha <sbsiddha@gmail.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tomi Valkeinen <tomi.valkeinen@ti.com>
Cc: Ville Syrjälä <syrjala@sci.fi>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: linux-fbdev@vger.kernel.org
Link: http://lkml.kernel.org/r/1430343851-967-3-git-send-email-mcgrof@do-not-panic.com
Link: http://lkml.kernel.org/r/1432628901-18044-10-git-send-email-bp@alien8.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Use pr_info() instead of the old printk to prefix the component
where things are coming from. With this readers will know
exactly where the message is coming from. We use pr_* helpers
but define pr_fmt to the empty string for easier grepping for
those error messages.
We leave the users of dprintk() in place, this will print only
when the debugpat kernel parameter is enabled. We want to leave
those enabled as a debug feature, but also make them use the
same prefix.
Signed-off-by: Luis R. Rodriguez <mcgrof@suse.com>
[ Kill pr_fmt. ]
Signed-off-by: Borislav Petkov <bp@suse.de>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Andy Walls <awalls@md.metrocast.net>
Cc: Bjorn Helgaas <bhelgaas@google.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
Cc: Dave Airlie <airlied@redhat.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: Doug Ledford <dledford@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Juergen Gross <jgross@suse.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Michael S. Tsirkin <mst@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: cocci@systeme.lip6.fr
Cc: plagnioj@jcrosoft.com
Cc: tomi.valkeinen@ti.com
Link: http://lkml.kernel.org/r/1430425520-22275-2-git-send-email-mcgrof@do-not-panic.com
Link: http://lkml.kernel.org/r/1432628901-18044-9-git-send-email-bp@alien8.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
This patch adds the argument 'uniform' to mtrr_type_lookup(),
which gets set to 1 when a given range is covered uniformly by
MTRRs, i.e. the range is fully covered by a single MTRR entry or
the default type.
Change pud_set_huge() and pmd_set_huge() to honor the 'uniform'
flag to see if it is safe to create a huge page mapping in the
range.
This allows them to create a huge page mapping in a range
covered by a single MTRR entry of any memory type. It also
detects a non-optimal request properly. They continue to check
with the WB type since it does not effectively change the
uniform mapping even if a request spans multiple MTRR entries.
pmd_set_huge() logs a warning message to a non-optimal request
so that driver writers will be aware of such a case. Drivers
should make a mapping request aligned to a single MTRR entry
when the range is covered by MTRRs.
Signed-off-by: Toshi Kani <toshi.kani@hp.com>
[ Realign, flesh out comments, improve warning message. ]
Signed-off-by: Borislav Petkov <bp@suse.de>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: Elliott@hp.com
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Luis R. Rodriguez <mcgrof@suse.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: dave.hansen@intel.com
Cc: linux-mm <linux-mm@kvack.org>
Cc: pebolle@tiscali.nl
Link: http://lkml.kernel.org/r/1431714237-880-7-git-send-email-toshi.kani@hp.com
Link: http://lkml.kernel.org/r/1432628901-18044-8-git-send-email-bp@alien8.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
MTRRs contain fixed and variable entries. mtrr_type_lookup() may
repeatedly call __mtrr_type_lookup() to handle a request that
overlaps with variable entries.
However, __mtrr_type_lookup() also handles the fixed entries,
which do not have to be repeated. Therefore, this patch creates
separate functions, mtrr_type_lookup_fixed() and
mtrr_type_lookup_variable(), to handle the fixed and variable
ranges respectively.
The patch also updates the function headers to clarify the
return values and output argument. It updates comments to
clarify that the repeating is necessary to handle overlaps with
the default type, since overlaps with multiple entries alone can
be handled without such repeating.
There is no functional change in this patch.
Signed-off-by: Toshi Kani <toshi.kani@hp.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: Elliott@hp.com
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Luis R. Rodriguez <mcgrof@suse.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: dave.hansen@intel.com
Cc: linux-mm <linux-mm@kvack.org>
Cc: pebolle@tiscali.nl
Link: http://lkml.kernel.org/r/1431714237-880-6-git-send-email-toshi.kani@hp.com
Link: http://lkml.kernel.org/r/1432628901-18044-6-git-send-email-bp@alien8.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
mtrr_type_lookup() returns verbatim 0xFF when MTRRs are
disabled. This patch defines MTRR_TYPE_INVALID to clarify the
meaning of this value, and documents its usage.
Document the return values of the kernel virtual address mapping
helpers pud_set_huge(), pmd_set_huge, pud_clear_huge() and
pmd_clear_huge().
There is no functional change in this patch.
Signed-off-by: Toshi Kani <toshi.kani@hp.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: Elliott@hp.com
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Luis R. Rodriguez <mcgrof@suse.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: dave.hansen@intel.com
Cc: linux-mm <linux-mm@kvack.org>
Cc: pebolle@tiscali.nl
Link: http://lkml.kernel.org/r/1431714237-880-5-git-send-email-toshi.kani@hp.com
Link: http://lkml.kernel.org/r/1432628901-18044-5-git-send-email-bp@alien8.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
'mtrr_state.enabled' contains the FE (fixed MTRRs enabled)
and E (MTRRs enabled) flags in MSR_MTRRdefType. Intel SDM,
section 11.11.2.1, defines these flags as follows:
- All MTRRs are disabled when the E flag is clear.
The FE flag has no affect when the E flag is clear.
- The default type is enabled when the E flag is set.
- MTRR variable ranges are enabled when the E flag is set.
- MTRR fixed ranges are enabled when both E and FE flags
are set.
MTRR state checks in __mtrr_type_lookup() do not match with SDM.
Hence, this patch makes the following changes:
- The current code detects MTRRs disabled when both E and
FE flags are clear in mtrr_state.enabled. Fix to detect
MTRRs disabled when the E flag is clear.
- The current code does not check if the FE bit is set in
mtrr_state.enabled when looking at the fixed entries.
Fix to check the FE flag.
- The current code returns the default type when the E flag
is clear in mtrr_state.enabled. However, the default type
is UC when the E flag is clear. Remove the code as this
case is handled as MTRR disabled with the 1st change.
In addition, this patch defines the E and FE flags in
mtrr_state.enabled as follows.
- FE flag: MTRR_STATE_MTRR_FIXED_ENABLED
- E flag: MTRR_STATE_MTRR_ENABLED
print_mtrr_state() and x86_get_mtrr_mem_range() are also updated
accordingly.
Signed-off-by: Toshi Kani <toshi.kani@hp.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: Elliott@hp.com
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Luis R. Rodriguez <mcgrof@suse.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: dave.hansen@intel.com
Cc: linux-mm <linux-mm@kvack.org>
Cc: pebolle@tiscali.nl
Link: http://lkml.kernel.org/r/1431714237-880-4-git-send-email-toshi.kani@hp.com
Link: http://lkml.kernel.org/r/1432628901-18044-4-git-send-email-bp@alien8.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
When an MTRR entry is inclusive to a requested range, i.e. the
start and end of the request are not within the MTRR entry range
but the range contains the MTRR entry entirely:
range_start ... [mtrr_start ... mtrr_end] ... range_end
__mtrr_type_lookup() ignores such a case because both
start_state and end_state are set to zero.
This bug can cause the following issues:
1) reserve_memtype() tracks an effective memory type in case
a request type is WB (ex. /dev/mem blindly uses WB). Missing
to track with its effective type causes a subsequent request
to map the same range with the effective type to fail.
2) pud_set_huge() and pmd_set_huge() check if a requested range
has any overlap with MTRRs. Missing to detect an overlap may
cause a performance penalty or undefined behavior.
This patch fixes the bug by adding a new flag, 'inclusive',
to detect the inclusive case. This case is then handled in
the same way as end_state:1 since the first region is the same.
With this fix, __mtrr_type_lookup() handles the inclusive case
properly.
Signed-off-by: Toshi Kani <toshi.kani@hp.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: Elliott@hp.com
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Luis R. Rodriguez <mcgrof@suse.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: dave.hansen@intel.com
Cc: linux-mm <linux-mm@kvack.org>
Cc: pebolle@tiscali.nl
Link: http://lkml.kernel.org/r/1431714237-880-3-git-send-email-toshi.kani@hp.com
Link: http://lkml.kernel.org/r/1432628901-18044-3-git-send-email-bp@alien8.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQEcBAABAgAGBQJVYnloAAoJEHm+PkMAQRiGCgkH/j3r2djOOm4h83FXrShaHORY
p8TBI3FNj4fzLk2PfzqbmiDw2T2CwygB+pxb2Ac9CE99epw8qPk2SRvPXBpdKR7t
lolhhwfzApLJMZbhzNLVywUCDUhFoiEWRhmPqIfA3WXFcIW3t5VNXAoIFjV5HFr6
sYUlaxSI1XiQ5tldVv8D6YSFHms41pisziBIZmzhIUg10P6Vv3D0FbE74fjAJwx0
+08zj3EO7yQMv7Aeeq8F8AJ3628142rcZf0NWF5ohlKLRK3gt0cl9jO5U4Co2dDt
29v03LP5EI6jDKkIbuWlqRMq9YxJz7N3wnkzV0EJiqXucoqPLFDqzbxB4gnS1pI=
=7vbA
-----END PGP SIGNATURE-----
Merge tag 'v4.1-rc5' into x86/mm, to refresh the tree before applying new changes
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Using "mce=1,10000000" on the kernel cmdline to change the
monarch timeout does not work. The cause is that get_option()
does parse a subsequent comma in the option string and signals
that with a return value. So we don't need to check for a second
comma ourselves.
Signed-off-by: Xie XiuQi <xiexiuqi@huawei.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tony Luck <tony.luck@intel.com>
Link: http://lkml.kernel.org/r/1432120943-25028-1-git-send-email-xiexiuqi@huawei.com
Link: http://lkml.kernel.org/r/1432628901-18044-19-git-send-email-bp@alien8.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
When comparing the 'model name' field of each core in
/proc/cpuinfo it was noticed that there is a whitespace
difference between the cores' model names.
After some quick investigation it was noticed that the model
name fields were actually different -- processor 0's model name
field had trailing whitespace removed, while the other
processors did not.
Another way of seeing this behaviour is to convert spaces into
underscores in the output of /proc/cpuinfo,
[thetango@prarit ~]# grep "^model name" /proc/cpuinfo | uniq -c | sed 's/\ /_/g'
______1_model_name :_AMD_Opteron(TM)_Processor_6272
_____63_model_name :_AMD_Opteron(TM)_Processor_6272_________________
which shows the discrepancy.
This occurs because the kernel calls strim() on cpu 0's
x86_model_id field to output a pretty message to the console in
print_cpu_info(), and as a result strips the whitespace at the
end of the ->x86_model_id field.
But, the ->x86_model_id field should be the same for the all
identical CPUs in the box. Thus, we need to remove both leading
and trailing whitespace.
As a result, the print_cpu_info() output looks like
smpboot: CPU0: AMD Opteron(TM) Processor 6272 (fam: 15, model: 01, stepping: 02)
and the x86_model_id field is correct on all processors on AMD
platforms:
_____64_model_name :_AMD_Opteron(TM)_Processor_6272
Output is still correct on an Intel box:
____144_model_name :_Intel(R)_Xeon(R)_CPU_E7-8890_v3_@_2.50GHz
Signed-off-by: Prarit Bhargava <prarit@redhat.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Igor Mammedov <imammedo@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1432050210-32036-1-git-send-email-prarit@redhat.com
Link: http://lkml.kernel.org/r/1432628901-18044-15-git-send-email-bp@alien8.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Make sure the WARN_ON_FPU() macro consumes the macro argument,
to avoid 'unused variable' build warnings if the only use of
a variable is in debugging code.
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Bobby Powers <bobbypowers@gmail.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
copy_kernel_to_xregs_booting() has a second parameter that is the mask
of xfeatures that should be copied - but this parameter is always -1.
Simplify the call site of this function, this also makes it more
similar to the function call signature of other copy_kernel_to*regs()
functions.
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Bobby Powers <bobbypowers@gmail.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Bring the __copy_fpstate_to_fpregs() and copy_fpstate_to_fpregs() functions
in line with the parameter passing convention of other kernel-to-FPU-registers
copying functions: pass around an in-memory FPU register state pointer,
instead of struct fpu *.
NOTE: This patch also changes the assembly constraint of the FXSAVE-leak
workaround from 'fpu->fpregs_active' to 'fpstate' - but that is fine,
as we only need a valid memory address there for the FILDL instruction.
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Bobby Powers <bobbypowers@gmail.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
None of the copy_kernel_to_*regs() FPU register copying functions are
supposed to fail, and all of them have debugging checks that enforce
this.
Remove their return values and simplify their call sites, which have
redundant error checks and error handling code paths.
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Bobby Powers <bobbypowers@gmail.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Bring the __copy_fpstate_to_fpregs() and copy_fpstate_to_fpregs() functions
in line with the naming of other kernel-to-FPU-registers copying functions.
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Bobby Powers <bobbypowers@gmail.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Copying from in-kernel FPU context buffers to FPU registers are
never supposed to fault.
Add debugging checks to copy_kernel_to_fxregs() and copy_kernel_to_fregs()
to double check this assumption.
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Bobby Powers <bobbypowers@gmail.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
The copy_fpstate_to_fpregs() function is never supposed to fail,
so add a debugging check to its call site in fpu__restore().
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Bobby Powers <bobbypowers@gmail.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
fpu__activate_fpstate_write() is used before ptrace writes to the fpstate
context. Because it expects the modified registers to be reloaded on the
nexts context switch, it's only valid to call this function for stopped
child tasks.
- add a debugging check for this assumption
- remove code that only runs if the current task's FPU state needs
to be saved, which cannot occur here
- update comments to match the implementation
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Bobby Powers <bobbypowers@gmail.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Remaining users of fpu__activate_fpstate() are all places that want to modify
FPU registers, rename the function to fpu__activate_fpstate_write() according
to this usage.
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Bobby Powers <bobbypowers@gmail.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
fpu__activate_fpstate_read() is used before FPU registers are
read from the fpstate by ptrace and core dumping.
It's not necessary to unlazy non-current child tasks in this case,
since the reading of registers is non-destructive.
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Bobby Powers <bobbypowers@gmail.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Currently fpu__activate_fpstate() is used for two distinct purposes:
- read access by ptrace and core dumping, where in the core dumping
case the current task's FPU state may be examined as well.
- write access by ptrace, which modifies FPU registers and expects
the modified registers to be reloaded on the next context switch.
Split out the reading side into fpu__activate_fpstate_read().
( Note that this is just a pure duplication of fpu__activate_fpstate()
for the time being, we'll optimize the new function in the next patch. )
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Bobby Powers <bobbypowers@gmail.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Bobby Powers reported the following FPU warning during ELF coredumping:
WARNING: CPU: 0 PID: 27452 at arch/x86/kernel/fpu/core.c:324 fpu__activate_stopped+0x8a/0xa0()
This warning unearthed an invalid assumption about fpu__activate_stopped()
that I added in:
67e97fc2ec ("x86/fpu: Rename init_fpu() to fpu__unlazy_stopped() and add debugging check")
the old init_fpu() function had an (intentional but obscure) side effect:
when FPU registers are accessed for the current task, for reading, then
it synchronized live in-register FPU state with the fpstate by saving it.
So fix this bug by saving the FPU if we are the current task. We'll
still warn in fpu__save() if this is called for not yet stopped
child tasks, so the debugging check is still preserved.
Also rename the function to fpu__activate_fpstate(), because it's not
exclusively used for stopped tasks, but for the current task as well.
( Note that this bug calls for a cleaner separation of access-for-read
and access-for-modification FPU methods, but we'll do that in separate
patches. )
Reported-by: Bobby Powers <bobbypowers@gmail.com>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
There is a 'pt' variable in the outer scope of pt_event_stop() with the same
type, we don't really need another one in the inner scope.
This patch removes the redundant variable declaration.
Signed-off-by: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: acme@infradead.org
Cc: adrian.hunter@intel.com
Cc: hpa@zytor.com
Link: http://lkml.kernel.org/r/1432308626-18845-8-git-send-email-alexander.shishkin@linux.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Initially, we were trying to guard against scenarios where somebody
attaches to the system with a hardware debugger while PT is enabled
from software and pt_is_running() tries to make sure we handle this
better, but the truth is, there is still a race window no matter what
and people with hardware debuggers should really know what they are
doing anyway.
In other words, there is no point in keeping this one around, and
it's one RDMSR instructions fewer in the fast path.
The case when PT is enabled by the BIOS at boot time is handled
in the driver initialization path and doesn't use pt_is_running().
This patch gets rid of it.
Signed-off-by: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: acme@infradead.org
Cc: adrian.hunter@intel.com
Cc: hpa@zytor.com
Link: http://lkml.kernel.org/r/1429622177-22843-6-git-send-email-alexander.shishkin@linux.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Currently, the description of pt_buffer_reset_offsets() lacks information
about its calling constraints and ordering with regards to other buffer
management functions.
Add a clarification about when this function has to be called.
Signed-off-by: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: acme@infradead.org
Cc: adrian.hunter@intel.com
Cc: hpa@zytor.com
Link: http://lkml.kernel.org/r/1429622177-22843-5-git-send-email-alexander.shishkin@linux.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
The comments in the driver don't make it absolutely clear as to what
exactly is the calling order and other possible constraints of buffer
management functions.
Document constraints and calling order for the buffer configuration
functions. While at it, replace a redundant check in
pt_buffer_reset_markers() with an explanation why it is not needed.
Signed-off-by: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: acme@infradead.org
Cc: adrian.hunter@intel.com
Cc: hpa@zytor.com
Link: http://lkml.kernel.org/r/1429622177-22843-4-git-send-email-alexander.shishkin@linux.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Currently, there's a set-but-not-used variable in setup_topa_index();
this patch gets rid of it. And while at it, fixes a style issue with
brackets around a one-line block.
Signed-off-by: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: acme@infradead.org
Cc: adrian.hunter@intel.com
Cc: hpa@zytor.com
Link: http://lkml.kernel.org/r/1429622177-22843-2-git-send-email-alexander.shishkin@linux.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Don't bother with taking locks if we're not actually going to do
anything. Also, drop the _irqsave(), this is very much only called
from IRQ-disabled context.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephane Eranian <eranian@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vince Weaver <vincent.weaver@maine.edu>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
For some obscure reason intel_{start,stop}_scheduling() copy the HT
state to an intermediate array. This would make sense if we ever were
to make changes to it which we'd have to discard.
Except we don't. By the time we call intel_commit_scheduling() we're;
as the name implies; committed to them. We'll never back out.
A further hint its pointless is that stop_scheduling() unconditionally
publishes the state.
So the intermediate array is pointless, modify the state in place and
kill the extra array.
And remove the pointless array initialization: INTEL_EXCL_UNUSED == 0.
Note; all is serialized by intel_excl_cntr::lock.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephane Eranian <eranian@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vince Weaver <vincent.weaver@maine.edu>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Both intel_commit_scheduling() and intel_get_excl_contraints() test
for cntr < 0.
The only way that can happen (aside from a bug) is through
validate_event(), however that is already captured by the
cpuc->is_fake test.
So remove these test and simplify the code.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephane Eranian <eranian@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vince Weaver <vincent.weaver@maine.edu>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Move the code of intel_commit_scheduling() to the right place, which is
in between start() and stop().
No change in functionality.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephane Eranian <eranian@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vince Weaver <vincent.weaver@maine.edu>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
The intel_commit_scheduling() callback is pointlessly different from
the start and stop scheduling callback.
Furthermore, the constraint should never be NULL, so remove that test.
Even though we'll never get called (because we NULL the callbacks)
when !is_ht_workaround_enabled() put that test in.
Collapse the (pointless) WARN_ON_ONCE() and bail on !cpuc->excl_cntrs --
this is doubly pointless, because its the same condition as
is_ht_workaround_enabled() which was already pointless because the
whole method won't ever be called.
Furthremore, make all the !excl_cntrs test WARN_ON_ONCE(); they're all
pointless, because the above, either the function
({get,put}_excl_constraint) are already predicated on it existing or
the is_ht_workaround_enabled() thing is the same test.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephane Eranian <eranian@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vince Weaver <vincent.weaver@maine.edu>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
We have two 'struct event_constraint' local variables in
intel_get_excl_constraints(): 'cx' and 'c'.
Instead of using 'cx' after the dynamic allocation, put all 'cx' inside
the dynamic allocation block and use 'c' outside of it.
Also use direct assignment to copy the structure; let the compiler
figure it out.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephane Eranian <eranian@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vince Weaver <vincent.weaver@maine.edu>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Lockdep is very good at finding incorrect IRQ state while locking and
is far better at telling us if we hold a lock than the _is_locked()
API. It also generates less code for !DEBUG kernels.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephane Eranian <eranian@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vince Weaver <vincent.weaver@maine.edu>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
For some obscure reason the current code accounts the current SMT
thread's state on the remote thread and reads the remote's state on
the local SMT thread.
While internally consistent, and 'correct' its pointless confusion we
can do without.
Flip them the right way around.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephane Eranian <eranian@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vince Weaver <vincent.weaver@maine.edu>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Since we write RMID values to MSRs the correct type to use is 'u32'
because that clearly articulates we're writing a hardware register
value.
Fix up all uses of RMID in this code to consistently use the correct data
type.
Reported-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Matt Fleming <matt.fleming@intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Kanaka Juvva <kanaka.d.juvva@intel.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Vikas Shivappa <vikas.shivappa@linux.intel.com>
Cc: Will Auld <will.auld@intel.com>
Link: http://lkml.kernel.org/r/1432285182-17180-1-git-send-email-matt@codeblueprint.co.uk
Signed-off-by: Ingo Molnar <mingo@kernel.org>
'closid' (CLass Of Service ID) is used for the Class based Cache
Allocation Technology (CAT). Add explicit storage to the per cpu cache
for it, so it can be used later with the CAT support (requires to move
the per cpu data).
While at it:
- Rename the structure to intel_pqr_state which reflects the actual
purpose of the struct: cache values which go into the PQR MSR
- Rename 'cnt' to rmid_usecnt which reflects the actual purpose of
the counter.
- Document the structure and the struct members.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Matt Fleming <matt.fleming@intel.com>
Cc: Kanaka Juvva <kanaka.d.juvva@intel.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Vikas Shivappa <vikas.shivappa@linux.intel.com>
Cc: Will Auld <will.auld@intel.com>
Link: http://lkml.kernel.org/r/20150518235150.240899319@linutronix.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
If the usage counter is non-zero there is no point to update the rmid
in the PQR MSR.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Matt Fleming <matt.fleming@intel.com>
Cc: Kanaka Juvva <kanaka.d.juvva@intel.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Vikas Shivappa <vikas.shivappa@linux.intel.com>
Cc: Will Auld <will.auld@intel.com>
Link: http://lkml.kernel.org/r/20150518235150.080844281@linutronix.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
'struct intel_cqm_state' is a strict per CPU cache of the rmid and the
usage counter. It can never be modified from a remote CPU.
The three functions which modify the content: intel_cqm_event[start|stop|del]
(del maps to stop) are called from the perf core with interrupts disabled
which is enough protection for the per CPU state values.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Matt Fleming <matt.fleming@intel.com>
Cc: Kanaka Juvva <kanaka.d.juvva@intel.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Vikas Shivappa <vikas.shivappa@linux.intel.com>
Cc: Will Auld <will.auld@intel.com>
Link: http://lkml.kernel.org/r/20150518235150.001006529@linutronix.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
'int' is really not a proper data type for an MSR. Use u32 to make it
clear that we are dealing with a 32-bit unsigned hardware value.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Matt Fleming <matt.fleming@intel.com>
Cc: Kanaka Juvva <kanaka.d.juvva@intel.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Vikas Shivappa <vikas.shivappa@linux.intel.com>
Cc: Will Auld <will.auld@intel.com>
Link: http://lkml.kernel.org/r/20150518235149.919350144@linutronix.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
The CQM code acts like it owns the PQR MSR completely. That's not true
because only the lower 10 bits are used for CQM. The upper 32 bits are
used for the 'CLass Of Service ID' (CLOSID). Document the abuse. Will be
fixed in a later patch.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Matt Fleming <matt.fleming@intel.com>
Cc: Kanaka Juvva <kanaka.d.juvva@intel.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Vikas Shivappa <vikas.shivappa@linux.intel.com>
Cc: Will Auld <will.auld@intel.com>
Link: http://lkml.kernel.org/r/20150518235149.823214798@linutronix.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
I stumbled upon an AMD box that had the BIOS using a hardware performance
counter. Instead of printing out a warning and continuing, it failed and
blocked further perf counter usage.
Looking through the history, I found this commit:
a5ebe0ba3d ("perf/x86: Check all MSRs before passing hw check")
which tweaked the rules for a Xen guest on an almost identical box and now
changed the behaviour.
Unfortunately the rules were tweaked incorrectly and will always lead to
MSR failures even though the MSRs are completely fine.
What happens now is in arch/x86/kernel/cpu/perf_event.c::check_hw_exists():
<snip>
for (i = 0; i < x86_pmu.num_counters; i++) {
reg = x86_pmu_config_addr(i);
ret = rdmsrl_safe(reg, &val);
if (ret)
goto msr_fail;
if (val & ARCH_PERFMON_EVENTSEL_ENABLE) {
bios_fail = 1;
val_fail = val;
reg_fail = reg;
}
}
<snip>
/*
* Read the current value, change it and read it back to see if it
* matches, this is needed to detect certain hardware emulators
* (qemu/kvm) that don't trap on the MSR access and always return 0s.
*/
reg = x86_pmu_event_addr(0);
^^^^
if the first perf counter is enabled, then this routine will always fail
because the counter is running. :-(
if (rdmsrl_safe(reg, &val))
goto msr_fail;
val ^= 0xffffUL;
ret = wrmsrl_safe(reg, val);
ret |= rdmsrl_safe(reg, &val_new);
if (ret || val != val_new)
goto msr_fail;
The above bios_fail used to be a 'goto' which is why it worked in the past.
Further, most vendors have migrated to using fixed counters to hide their
evilness hence this problem rarely shows up now days except on a few old boxes.
I fixed my problem and kept the spirit of the original Xen fix, by recording a
safe non-enable register to be used safely for the reading/writing check.
Because it is not enabled, this passes on bare metal boxes (like metal), but
should continue to throw an msr_fail on Xen guests because the register isn't
emulated yet.
Now I get a proper bios_fail error message and Xen should still see their
msr_fail message (untested).
Signed-off-by: Don Zickus <dzickus@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: george.dunlap@eu.citrix.com
Cc: konrad.wilk@oracle.com
Link: http://lkml.kernel.org/r/1431976608-56970-1-git-send-email-dzickus@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Currently, pt_buffer_reset_markers() is a difficult to read knot of
arithmetics with a redundant check for multiple-entry TOPA capability,
a commented out wakeup marker placement and a logical error wrt to
stop marker placement. The latter happens when write head is not page
aligned and results in stop marker being placed one page earlier than
it actually should.
All these problems only affect PT implementations that support
multiple-entry TOPA tables (read: proper scatter-gather).
For single-entry TOPA implementations, there is no functional impact.
This patch deals with all of the above. Tested on both single-entry
and multiple-entry TOPA PT implementations.
Signed-off-by: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: acme@infradead.org
Cc: adrian.hunter@intel.com
Cc: hpa@zytor.com
Link: http://lkml.kernel.org/r/1432308626-18845-4-git-send-email-alexander.shishkin@linux.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
The (SNB/IVB/HSW) HT bug only affects events that can be programmed
onto GP counters, therefore we should only limit the number of GP
counters that can be used per cpu -- iow we should not constrain the
FP counters.
Furthermore, we should only enfore such a limit when there are in fact
exclusive events being scheduled on either sibling.
Reported-by: Vince Weaver <vincent.weaver@maine.edu>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
[ Fixed build fail for the !CONFIG_CPU_SUP_INTEL case. ]
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Commit 43b4578071 ("perf/x86: Reduce stack usage of
x86_schedule_events()") violated the rule that 'fake' scheduling; as
used for event/group validation; should not change the event state.
This went mostly un-noticed because repeated calls of
x86_pmu::get_event_constraints() would give the same result. And
x86_pmu::put_event_constraints() would mostly not do anything.
Commit e979121b1b ("perf/x86/intel: Implement cross-HT corruption
bug workaround") made the situation much worse by actually setting the
event->hw.constraint value to NULL, so when validation and actual
scheduling interact we get NULL ptr derefs.
Fix it by removing the constraint pointer from the event and move it
back to an array, this time in cpuc instead of on the stack.
validate_group()
x86_schedule_events()
event->hw.constraint = c; # store
<context switch>
perf_task_event_sched_in()
...
x86_schedule_events();
event->hw.constraint = c2; # store
...
put_event_constraints(event); # assume failure to schedule
intel_put_event_constraints()
event->hw.constraint = NULL;
<context switch end>
c = event->hw.constraint; # read -> NULL
if (!test_bit(hwc->idx, c->idxmsk)) # <- *BOOM* NULL deref
This in particular is possible when the event in question is a
cpu-wide event and group-leader, where the validate_group() tries to
add an event to the group.
Reported-by: Vince Weaver <vincent.weaver@maine.edu>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Andrew Hunter <ahh@google.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Maria Dimakopoulou <maria.n.dimakopoulou@gmail.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Fixes: 43b4578071 ("perf/x86: Reduce stack usage of x86_schedule_events()")
Fixes: e979121b1b ("perf/x86/intel: Implement cross-HT corruption bug workaround")
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Architecture-specific helpers are not supposed to muck with
struct kvm_userspace_memory_region contents. Add const to
enforce this.
In order to eliminate the only write in __kvm_set_memory_region,
the cleaning of deleted slots is pulled up from update_memslots
to __kvm_set_memory_region.
Reviewed-by: Takuya Yoshikawa <yoshikawa_takuya_b1@lab.ntt.co.jp>
Reviewed-by: Radim Krcmar <rkrcmar@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
kvm_memslots provides lockdep checking. Use it consistently instead of
explicit dereferencing of kvm->memslots.
Reviewed-by: Radim Krcmar <rkrcmar@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
The copy_xregs_to_kernel*() and copy_kernel_to_xregs*() functions are used
to copy FPU registers to kernel memory and vice versa.
They are never expected to fail, yet they have a return code, mostly because
that way they can share the assembly macros with the copy*user*() functions.
This error code is then silently ignored by the context switching
and other code - which made the bug in:
b8c1b8ea7b ("x86/fpu: Fix FPU state save area alignment bug")
harder to fix than necessary.
So remove the return values and check for no faults when FPU debugging
is enabled in the .config.
This improves the eagerfpu context switching fast path by a couple of
instructions, when FPU debugging is disabled:
ffffffff810407fa: 89 c2 mov %eax,%edx
ffffffff810407fc: 48 0f ae 2f xrstor64 (%rdi)
ffffffff81040800: 31 c0 xor %eax,%eax
-ffffffff81040802: eb 0a jmp ffffffff8104080e <__switch_to+0x321>
+ffffffff81040802: eb 16 jmp ffffffff8104081a <__switch_to+0x32d>
ffffffff81040804: 31 c0 xor %eax,%eax
ffffffff81040806: 48 0f ae 8b c0 05 00 fxrstor64 0x5c0(%rbx)
ffffffff8104080d: 00
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
There's a confusing aspect of how xstate_fault() constraints are
handled by the FPU register/memory copying functions in
fpu/internal.h: they use "0" (0) to signal that the asm code
will not always set 'err' to a valid value.
But 'err' is already initialized to 0 in C code, which is duplicated
by the asm() constraint. Should the initialization value ever be
changed, it might become subtly inconsistent with the not too clear
asm() constraint.
Use 'err' as the value of the input variable instead, to clarify
this all.
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
There are two problems with xstate_fault handling:
- The xstate_fault() macro takes an argument, but that's
propagated into the assembly named label as well. This
is technically correct currently but might result in
failures if anytime a more complex argument is used.
So use a separate '_err' name instead for the label.
- All the xstate_fault() using functions have an error
variable named 'err', which is an output variable to
the asm() they are using. The problem is, it's not always
set by the asm(), in which case the compiler might
optimize out its initialization, so that the C variable
'err' might become corrupted after the asm() - confusing
anyone who tries to take advantage of this variable
after the asm(). Mark it an input variable as well.
This is a latent bug currently, but an upcoming debug
patch will make use of 'err'.
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
So the xstate code was probably first copied from the fxregs code,
hence it carried over the 'fx' naming for the state pointer variable.
But this is slightly confusing, as we usually on call the (legacy)
MMX/SSE state 'fx', both in data structures and in the functions
build around FXSAVE/FXRSTOR.
So rename it to 'xstate' to make it more apparent what it is related to.
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Remove obsolete comment about __init limitations: in the new code there aren't any.
Also standardize the comment style in the function while at it.
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
All the other register<-> memory copying functions are defined
in fpu/internal.h, so move the xstate variants there too.
Beyond being more consistent, this also allows FPU debugging
checks to be added to them. (Because they can now use the
macros defined in fpu/internal.h.)
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
On most configs task-struct is cache line aligned, which makes
the XSAVE area's 64-byte required alignment work out fine.
But on some .config's task_struct is aligned only to 16 bytes
(enforced by ARCH_MIN_TASKALIGN), which makes things like
fpu__copy() (that XSAVEOPT uses) not work so well.
I broke this in:
7366ed771f ("x86/fpu: Simplify FPU handling by embedding the fpstate in task_struct (again)")
which embedded the fpstate in the task_struct.
The alignment requirements of the FPU code were originally present
in ARCH_MIN_TASKALIGN, which still has a value of 16, which was the
alignment requirement of the FPU state area prior XSAVE. But this
link was not documented (and not required) and the link got lost
when the FPU state area was made dynamic years ago.
With XSAVEOPT the minimum alignment requirment went up to 64 bytes,
and the embedding of the FPU state area in task_struct exposed it
again - and '16' was not increased to '64'.
So fix this bug, but also try to address the underlying lost link
of information that made it easier to happen:
- document ARCH_MIN_TASKALIGN a bit better
- use alignof() to recover the current alignment requirements.
This would work in the future as well, should the alignment
requirements go up to 128 bytes with things like AVX512.
( We should probably also use the vSMP alignment rules for all
of x86, but that's for another patch. )
Reported-by: Peter Zijlstra <peterz@infradead.org>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
x86 has variable length encoding. x86 JIT compiler is trying
to pick the shortest encoding for given bpf instruction.
While doing so the jump targets are changing, so JIT is doing
multiple passes over the program. Typical program needs 3 passes.
Some very short programs converge with 2 passes. Large programs
may need 4 or 5. But specially crafted bpf programs may hit the
pass limit and if the program converges on the last iteration
the JIT compiler will be producing an image full of 'int 3' insns.
Fix this corner case by doing final iteration over bpf program.
Fixes: 0a14842f5a ("net: filter: Just In Time compiler for x86-64")
Reported-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Alexei Starovoitov <ast@plumgrid.com>
Tested-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch splits CONFIG_I8K compile option to SENSORS_DELL_SMM and CONFIG_I8K.
Option SENSORS_DELL_SMM is now used to enable compilation of dell-smm-hwmon
driver and old CONFIG_I8K option to enable /proc/i8k interface in driver.
So this change allows to compile dell-smm-hwmon driver without legacy /proc/i8k
interface which is needed only for old Dell Inspirion models or for userspace
i8kutils package.
For backward compatibility when CONFIG_I8K is enabled then also SENSORS_DELL_SMM
is enabled and so driver dell-smm-hwmon (with /proc/i8k) is compiled.
Signed-off-by: Pali Rohár <pali.rohar@gmail.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
The early_idt_handlers asm code generates an array of entry
points spaced nine bytes apart. It's not really clear from that
code or from the places that reference it what's going on, and
the code only works in the first place because GAS never
generates two-byte JMP instructions when jumping to global
labels.
Clean up the code to generate the correct array stride (member size)
explicitly. This should be considerably more robust against
screw-ups, as GAS will warn if a .fill directive has a negative
count. Using '. =' to advance would have been even more robust
(it would generate an actual error if it tried to move
backwards), but it would pad with nulls, confusing anyone who
tries to disassemble the code. The new scheme should be much
clearer to future readers.
While we're at it, improve the comments and rename the array and
common code.
Binutils may start relaxing jumps to non-weak labels. If so,
this change will fix our build, and we may need to backport this
change.
Before, on x86_64:
0000000000000000 <early_idt_handlers>:
0: 6a 00 pushq $0x0
2: 6a 00 pushq $0x0
4: e9 00 00 00 00 jmpq 9 <early_idt_handlers+0x9>
5: R_X86_64_PC32 early_idt_handler-0x4
...
48: 66 90 xchg %ax,%ax
4a: 6a 08 pushq $0x8
4c: e9 00 00 00 00 jmpq 51 <early_idt_handlers+0x51>
4d: R_X86_64_PC32 early_idt_handler-0x4
...
117: 6a 00 pushq $0x0
119: 6a 1f pushq $0x1f
11b: e9 00 00 00 00 jmpq 120 <early_idt_handler>
11c: R_X86_64_PC32 early_idt_handler-0x4
After:
0000000000000000 <early_idt_handler_array>:
0: 6a 00 pushq $0x0
2: 6a 00 pushq $0x0
4: e9 14 01 00 00 jmpq 11d <early_idt_handler_common>
...
48: 6a 08 pushq $0x8
4a: e9 d1 00 00 00 jmpq 120 <early_idt_handler_common>
4f: cc int3
50: cc int3
...
117: 6a 00 pushq $0x0
119: 6a 1f pushq $0x1f
11b: eb 03 jmp 120 <early_idt_handler_common>
11d: cc int3
11e: cc int3
11f: cc int3
Signed-off-by: Andy Lutomirski <luto@kernel.org>
Acked-by: H. Peter Anvin <hpa@linux.intel.com>
Cc: Binutils <binutils@sourceware.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: H.J. Lu <hjl.tools@gmail.com>
Cc: Jan Beulich <JBeulich@suse.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/ac027962af343b0c599cbfcf50b945ad2ef3d7a8.1432336324.git.luto@kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Conflicts:
drivers/net/ethernet/cadence/macb.c
drivers/net/phy/phy.c
include/linux/skbuff.h
net/ipv4/tcp.c
net/switchdev/switchdev.c
Switchdev was a case of RTNH_H_{EXTERNAL --> OFFLOAD}
renaming overlapping with net-next changes of various
sorts.
phy.c was a case of two changes, one adding a local
variable to a function whilst the second was removing
one.
tcp.c overlapped a deadlock fix with the addition of new tcp_info
statistic values.
macb.c involved the addition of two zyncq device entries.
skbuff.h involved adding back ipv4_daddr to nf_bridge_info
whilst net-next changes put two other existing members of
that struct into a union.
Signed-off-by: David S. Miller <davem@davemloft.net>
For some CPU models I broke the AVX2 feature detection in:
7bc371faa9 ("x86/fpu, crypto x86/camellia_aesni_avx2: Simplify the camellia_aesni_init() xfeature checks")
534ff06e39 ("x86/fpu, crypto x86/serpent_avx2: Simplify the init() xfeature checks")
... because I did not realize that it's possible for a CPU to support
the xstate necessary for AVX2 execution (XSTATE_YMM), but not have
the AVX2 instructions themselves.
Restore the necessary CPUID checks as well.
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
and on x86. The rest is fixes for bugs with newer Intel
processors.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.22 (GNU/Linux)
iQEcBAABAgAGBQJVXGLvAAoJEL/70l94x66DxOAH/270hZu3Rt0Tt04LYs0uy1B3
6a91Hs4YsYALe0j6IVZUQ2ngO+N4DPsw/Lusutd82jWX13UG221w1rbUtUpNF46r
bPf7Eh4AdGhNehGtkllRKrBmZEDkZVngZWsftFvzA+rmbV/HVzFU5SfuPdhzYAL5
WpQTzou0w63c3Gh6hymLsq/x/zUScMRoFdyjIEJTRN+AOnnro9I/nj4O83OEF8uv
Hp4VZ7TDG55xTloiC5WSimTCWPIZFDMiuim1iFo/OOOIGjfjdM8IBKwer4zIXa/S
VD71lYu267yxIabYpbEOjd+dcZ5myJhy4ePWmWHZczsOeklbvMouWMD7/1U2Gpg=
=x0LU
-----END PGP SIGNATURE-----
Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm
Pull KVM fixes from Paolo Bonzini:
"This includes a fix for two oopses, one on PPC and on x86.
The rest is fixes for bugs with newer Intel processors"
* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm:
kvm/fpu: Enable eager restore kvm FPU for MPX
Revert "KVM: x86: drop fpu_activate hook"
kvm: fix crash in kvm_vcpu_reload_apic_access_page
KVM: MMU: fix SMAP virtualization
KVM: MMU: fix CR4.SMEP=1, CR0.WP=0 with shadow pages
KVM: MMU: fix smap permission check
KVM: PPC: Book3S HV: Fix list traversal in error case
bpf_tail_call() arguments:
ctx - context pointer
jmp_table - one of BPF_MAP_TYPE_PROG_ARRAY maps used as the jump table
index - index in the jump table
In this implementation x64 JIT bypasses stack unwind and jumps into the
callee program after prologue, so the callee program reuses the same stack.
The logic can be roughly expressed in C like:
u32 tail_call_cnt;
void *jumptable[2] = { &&label1, &&label2 };
int bpf_prog1(void *ctx)
{
label1:
...
}
int bpf_prog2(void *ctx)
{
label2:
...
}
int bpf_prog1(void *ctx)
{
...
if (tail_call_cnt++ < MAX_TAIL_CALL_CNT)
goto *jumptable[index]; ... and pass my 'ctx' to callee ...
... fall through if no entry in jumptable ...
}
Note that 'skip current program epilogue and next program prologue' is
an optimization. Other JITs don't have to do it the same way.
>From safety point of view it's valid as well, since programs always
initialize the stack before use, so any residue in the stack left by
the current program is not going be read. The same verifier checks are
done for the calls from the kernel into all bpf programs.
Signed-off-by: Alexei Starovoitov <ast@plumgrid.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The MPX feature requires eager KVM FPU restore support. We have verified
that MPX cannot work correctly with the current lazy KVM FPU restore
mechanism. Eager KVM FPU restore should be enabled if the MPX feature is
exposed to VM.
Signed-off-by: Yang Zhang <yang.z.zhang@intel.com>
Signed-off-by: Liang Li <liang.z.li@intel.com>
[Also activate the FPU on AMD processors. - Paolo]
Cc: stable@vger.kernel.org
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Conflicts:
arch/x86/kernel/i387.c
This commit is conflicting:
e88221c50c ("x86/fpu: Disable XSAVES* support for now")
These functions changed a lot, move the quirk to arch/x86/kernel/fpu/init.c's
fpu__init_system_xstate_size_legacy().
Signed-off-by: Ingo Molnar <mingo@kernel.org>
The kernel's handling of 'compacted' xsave state layout is buggy:
http://marc.info/?l=linux-kernel&m=142967852317199
I don't have such a system, and the description there is vague, but
from extrapolation I guess that there were two kinds of bugs
observed:
- boot crashes, due to size calculations being wrong and the dynamic
allocation allocating a too small xstate area. (This is now fixed
in the new FPU code - but still present in stable kernels.)
- FPU state corruption and ABI breakage: if signal handlers try to
change the FPU state in standard format, which then the kernel
tries to restore in the compacted format.
These breakages are scary, but they only occur on a small number of
systems that have XSAVES* CPU support. Yet we have had XSAVES support
in the upstream kernel for a large number of stable kernel releases,
and the fixes are involved and unproven.
So do the safe resolution first: disable XSAVES* support and only
use the standard xstate format. This makes the code work and is
easy to backport.
On top of this we can work on enabling (and testing!) proper
compacted format support, without backporting pressure, on top of the
new, cleaned up FPU code.
Cc: <stable@vger.kernel.org>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
This makes the functions kvm_guest_cpu_init and kvm_init_debugfs
static now due to having no external callers outside their
declarations in the file, kvm.c.
Signed-off-by: Nicholas Krause <xerofoify@gmail.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Explain the functions and also standardize their style
and naming.
No change in functionality.
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
We had a number of FPU init related boot option handlers
in arch/x86/kernel/cpu/common.c - move them over into
arch/x86/kernel/fpu/init.c to have them all in a
single place.
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
While looking at xstate.h it took me some time to realize that
'xstate_fault' uses 'err' as a silent parameter. This is not
obvious at the call site, at all.
Make it an explicit macro argument, so that the syntactic
connection is easier to see. Also explain xstate_fault()
a bit.
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQEcBAABAgAGBQJVWh3TAAoJEHm+PkMAQRiG/kwH/2c9irodp2+M9OUnX2bfsBb6
LnChiDpvkF5BB8jhP6d/XmvPp4NJzAbTxByhjdfb2E2HkorCUHCOIn2tI1TE2pUs
2qjkOVH+XCzoV0goGtQjzK1ht8f2IrtlDiEjyRekK5cJHzhggb22QPtWL4npyd0O
reDmG2jsRaF9POr9uLSFEv4CEnkksmRLUU0vuQX0TZeCJ41O7TXrkN/wKrLZ5mj4
IWpqXQaSlrffq/T5HnVbXBxk3/T8QmhrIoppiMpV1mUVj0uTqlFRNi5qwT2Nit1h
FVljWI4+WgOk3bf7fUlp+ahopjkTgu+GuXkiRP/pdgWNQO0cxCWSAzSndAlIIAE=
=uOoJ
-----END PGP SIGNATURE-----
Backmerge v4.1-rc4 into into drm-next
We picked up a silent conflict in amdkfd with drm-fixes and drm-next,
backmerge v4.1-rc5 and fix the conflicts
Signed-off-by: Dave Airlie <airlied@redhat.com>
Conflicts:
drivers/gpu/drm/drm_irq.c
gfn_to_pfn_async is used in just one place, and because of x86-specific
treatment that place will need to look at the memory slot. Hence inline
it into try_async_pf and export __gfn_to_pfn_memslot.
The patch also switches the subsequent call to gfn_to_pfn_prot to use
__gfn_to_pfn_memslot. This is a small optimization. Finally, remove
the now-unused async argument of __gfn_to_pfn.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
CR0.CD and CR0.NW are not used by shadow page table so that need
not adjust mmu if these two bit are changed
Signed-off-by: Xiao Guangrong <guangrong.xiao@linux.intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Currently, whenever guest MTRR registers are changed
kvm_mmu_reset_context is called to switch to the new root shadow page
table, however, it's useless since:
1) the cache type is not cached into shadow page's attribute so that
the original root shadow page will be reused
2) the cache type is set on the last spte, that means we should sync
the last sptes when MTRR is changed
This patch fixs this issue by drop all the spte in the gfn range which
is being updated by MTRR
Signed-off-by: Xiao Guangrong <guangrong.xiao@linux.intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
There are some bugs in current get_mtrr_type();
1: bit 1 of mtrr_state->enabled is corresponding bit 11 of
IA32_MTRR_DEF_TYPE MSR which completely control MTRR's enablement
that means other bits are ignored if it is cleared
2: the fixed MTRR ranges are controlled by bit 0 of
mtrr_state->enabled (bit 10 of IA32_MTRR_DEF_TYPE)
3: if MTRR is disabled, UC is applied to all of physical memory rather
than mtrr_state->def_type
Signed-off-by: Xiao Guangrong <guangrong.xiao@linux.intel.com>
Reviewed-by: Wanpeng Li <wanpeng.li@linux.intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Split kvm_unmap_rmapp and introduce kvm_zap_rmapp which will be used in the
later patch
Signed-off-by: Xiao Guangrong <guangrong.xiao@linux.intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
slot_handle_level and its helper functions are ready now, use them to
clean up the code
Signed-off-by: Xiao Guangrong <guangrong.xiao@linux.intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
There are several places walking all rmaps for the memslot so that
introduce common functions to cleanup the code
Signed-off-by: Xiao Guangrong <guangrong.xiao@linux.intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
It's used to abstract the code from kvm_handle_hva_range and it will be
used by later patch
Signed-off-by: Xiao Guangrong <guangrong.xiao@linux.intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
It's used to walk all the sptes on the rmap to clean up the
code
Signed-off-by: Xiao Guangrong <guangrong.xiao@linux.intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
This reverts commit ff7bbb9c6a.
Sasha Levin is seeing odd jump in time values during boot of a KVM guest:
[...]
[ 0.000000] tsc: Detected 2260.998 MHz processor
[3376355.247558] Calibrating delay loop (skipped) preset value..
[...]
and bisected them to this commit.
Reported-by: Sasha Levin <sasha.levin@oracle.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
KVM may turn a user page to a kernel page when kernel writes a readonly
user page if CR0.WP = 1. This shadow page entry will be reused after
SMAP is enabled so that kernel is allowed to access this user page
Fix it by setting SMAP && !CR0.WP into shadow page's role and reset mmu
once CR4.SMAP is updated
Signed-off-by: Xiao Guangrong <guangrong.xiao@linux.intel.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
When a REP-string is executed in 64-bit mode with an address-size prefix,
ECX/EDI/ESI are used as counter and pointers. When ECX is initially zero, Intel
CPUs clear the high 32-bits of RCX, and recent Intel CPUs update the high bits
of the pointers in MOVS/STOS. This behavior is specific to Intel according to
few experiments.
As one may guess, this is an undocumented behavior. Yet, it is observable in
the guest, since at least VMX traps REP-INS/OUTS even when ECX=0. Note that
VMware appears to get it right. The behavior can be observed using the
following code:
#include <stdio.h>
#define LOW_MASK (0xffffffff00000000ull)
#define ALL_MASK (0xffffffffffffffffull)
#define TEST(opcode) \
do { \
asm volatile(".byte 0xf2 \n\t .byte 0x67 \n\t .byte " opcode "\n\t" \
: "=S"(s), "=c"(c), "=D"(d) \
: "S"(ALL_MASK), "c"(LOW_MASK), "D"(ALL_MASK)); \
printf("opcode %s rcx=%llx rsi=%llx rdi=%llx\n", \
opcode, c, s, d); \
} while(0)
void main()
{
unsigned long long s, d, c;
iopl(3);
TEST("0x6c");
TEST("0x6d");
TEST("0x6e");
TEST("0x6f");
TEST("0xa4");
TEST("0xa5");
TEST("0xa6");
TEST("0xa7");
TEST("0xaa");
TEST("0xab");
TEST("0xae");
TEST("0xaf");
}
Signed-off-by: Nadav Amit <namit@cs.technion.ac.il>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
When REP-string instruction is preceded with an address-size prefix,
ECX/EDI/ESI are used as the operation counter and pointers. When they are
updated, the high 32-bits of RCX/RDI/RSI are cleared, similarly to the way they
are updated on every 32-bit register operation. Fix it.
Signed-off-by: Nadav Amit <namit@cs.technion.ac.il>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
If the host sets hardware breakpoints to debug the guest, and a task-switch
occurs in the guest, the architectural DR7 will not be updated. The effective
DR7 would be updated instead.
This fix puts the DR7 update during task-switch emulation, so it now uses the
standard DR setting mechanism instead of the one that was previously used. As a
bonus, the update of DR7 will now be effective for AMD as well.
Signed-off-by: Nadav Amit <namit@cs.technion.ac.il>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Currently, we use a global vector as the Posted-Interrupts
Notification Event for all the vCPUs in the system. We need
to introduce another global vector for VT-d Posted-Interrtups,
which will be used to wakeup the sleep vCPU when an external
interrupt from a direct-assigned device happens for that vCPU.
[ tglx: Removed a gazillion of extra newlines ]
Signed-off-by: Feng Wu <feng.wu@intel.com>
Cc: jiang.liu@linux.intel.com
Link: http://lkml.kernel.org/r/1432026437-16560-4-git-send-email-feng.wu@intel.com
Suggested-by: Yang Zhang <yang.z.zhang@intel.com>
Acked-by: H. Peter Anvin <hpa@linux.intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
fpu/internal.h has grown organically, with not much high level structure,
which hurts its readability.
Organize the various definitions into 5 sections:
- high level FPU state functions
- FPU/CPU feature flag helpers
- fpstate handling functions
- FPU context switching helpers
- misc helper functions
Other related changes:
- Move MXCSR_DEFAULT to fpu/types.h.
- drop the unused X87_FSW_ES define
No change in functionality.
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
There are various internal FPU state debugging checks that never
trigger in practice, but which are useful for FPU code development.
Separate these out into CONFIG_X86_DEBUG_FPU=y, and also add a
couple of new ones.
The size difference is about 0.5K of code on defconfig:
text data bss filename
15028906 2578816 1638400 vmlinux
15029430 2578816 1638400 vmlinux
( Keep this enabled by default until the new FPU code is debugged. )
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
I tried to simulate an ancient CPU via this option, and
found that it still has fxsr_opt enabled, confusing the
FPU code.
Make the 'nofxsr' option also clear FXSR_OPT flag.
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
This cleans up the call sites and the function a bit,
and also makes it more symmetric with the other high
level FPU state handling functions.
It's still only valid for the current task, as we copy
to the FPU registers of the current CPU.
No change in functionality.
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Now that all the FPU init function call dependencies are
cleaned up we can propagate __init annotations deeper.
This shrinks the runtime size of the kernel a bit, and
also addresses a few section warnings.
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
So call setup_xstate_comp() from the xstate init code, not
from the generic fpu__init_system() code.
This allows us to remove the protytype from xstate.h as well.
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Put MPX support into its separate high level structure, and
also replace the fixed YMM, LWP and MPX structures in
xregs_state with just reservations - their exact offsets
in the structure will depend on the CPU and no code actually
relies on those fields.
No change in functionality.
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
The current xstate code in setup_xstate_features() assumes that
the first zero bit means the end of xfeatures - but that is not
so, the SDM clearly states that an arbitrary set of xfeatures
might be enabled - and it is also clear from the description
of the compaction feature that holes are possible:
"13-6 Vol. 1MANAGING STATE USING THE XSAVE FEATURE SET
[...]
Compacted format. Each state component i (i ≥ 2) is located at a byte
offset from the base address of the XSAVE area based on the XCOMP_BV
field in the XSAVE header:
— If XCOMP_BV[i] = 0, state component i is not in the XSAVE area.
— If XCOMP_BV[i] = 1, the following items apply:
• If XCOMP_BV[j] = 0 for every j, 2 ≤ j < i, state component i is
located at a byte offset 576 from the base address of the XSAVE
area. (This item applies if i is the first bit set in bits 62:2 of
the XCOMP_BV; it implies that state component i is located at the
beginning of the extended region.)
• Otherwise, let j, 2 ≤ j < i, be the greatest value such that
XCOMP_BV[j] = 1. Then state component i is located at a byte offset
X from the location of state component j, where X is the number of
bytes required for state component j as enumerated in
CPUID.(EAX=0DH,ECX=j):EAX. (This item implies that state component i
immediately follows the preceding state component whose bit is set
in XCOMP_BV.)"
So don't assume that the first zero xfeatures bit means the end of
all xfeatures - iterate through all of them.
I'm not aware of hardware that triggers this currently.
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
kernel_fpu_begin() is __kernel_fpu_begin() with a preempt_disable().
Move the kernel_fpu_begin() debugging check into __kernel_fpu_begin(),
so that users of __kernel_fpu_begin() may benefit from it as well.
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Document all the structures that make up 'struct fpu'.
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Improve the memory layout of 'struct fpu':
- change ->fpregs_active from 'int' to 'char' - it's just a single flag
and modern x86 CPUs can do efficient byte accesses.
- pack related fields closer to each other: often 'fpu->state' will not be
touched, while the other fields will - so pack them into a group.
Also add comments to each field, describing their purpose, and add
some background information about lazy restores.
Also fix an obsolete, lazy switching related comment in fpu_copy()'s description.
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
So much of fpu/core.c is the regset code, but it just obscures the generic
FPU state machine logic. Factor out the regset code into fpu/regset.c, where
it can be read in isolation.
This affects one API: fpu__activate_stopped() has to be made available
from the core to fpu/regset.c.
No change in functionality.
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
fpu/xstate.c has a lot of generic FPU signal frame handling routines,
move them into a separate file: fpu/signal.c.
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Move restore_init_xstate() next to its sole caller.
Also rename it to copy_init_fpstate_to_fpregs() and add
some comments about what it does.
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
So the handling of init_xstate_ctx has a layering violation: both
'struct xsave_struct' and 'union thread_xstate' have a
'struct i387_fxsave_struct' member:
xsave_struct::i387
thread_xstate::fxsave
The handling of init_xstate_ctx is generic, it is used on all
CPUs, with or without XSAVE instruction. So it's confusing how
the generic code passes around and handles an XSAVE specific
format.
What we really want is for init_xstate_ctx to be a proper
fpstate and we use its ::fxsave and ::xsave members, as
appropriate.
Since the xsave_struct::i387 and thread_xstate::fxsave aliases
each other this is not a functional problem.
So implement this, and move init_xstate_ctx to the generic FPU
code in the process.
Also, since init_xstate_ctx is not XSAVE specific anymore,
rename it to init_fpstate, and mark it __read_mostly,
because it's only modified once during bootup, and used
as a reference fpstate later on.
There's no change in functionality.
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
fpstate_init() only uses fpu->state, so pass that in to it.
This enables the cleanup we will do in the next patch.
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Factor out the FPU error code handling code from traps.c and fpu/internal.h
and move them close to each other.
Also convert the helper functions to 'struct fpu *', which further simplifies
them.
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Remove various boot quirks that came from the old code.
The new code is cleanly split up into per-system and per-cpu
init sequences, and system init functions are only called once.
Remove the run-once quirks.
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Only a few places use the regset definitions, so factor them out.
Also fix related header dependency assumptions.
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Most of the FPU does not use them, so split it out and include
them in signal.c and ia32_signal.c
Also fix header file dependency assumption in fpu/core.c.
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Move them to their only user. This makes the code easier to read,
the header is less cluttered, and it also speeds up the build a bit.
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
With recent cleanups and fixes the fpu__reset() and fpu__clear()
functions have become almost identical in functionality: the only
difference is that fpu__reset() assumed that the fpstate
was already active in the eagerfpu case, while fpu__clear()
activated it if it was inactive.
This distinction almost never matters, the only case where such
fpstate activation happens if if the init thread (PID 1) gets exec()-ed
for the first time.
So keep fpu__clear() and change all fpu__reset() uses to
fpu__clear() to simpify the logic.
( In a later patch we'll further simplify fpu__clear() by making
sure that all contexts it is called on are already active. )
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Consolidate more signal frame related functions:
text data bss dec filename
14108070 2575280 1634304 18317654 vmlinux.before
14107944 2575344 1634304 18317592 vmlinux.after
Also, while moving it, rename alloc_mathframe() to fpu__alloc_mathframe().
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
restore_xstate_sig() is a misnomer: it's not limited to 'xstate' at all,
it is the high level 'restore FPU state from a signal frame' function
that works with all legacy FPU formats as well.
Rename it (and its helper) accordingly, and also move it to the
fpu__*() namespace.
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Do it like all other high level FPU state handling functions: they
only know about struct fpu, not about the task.
(Also remove a dead prototype while at it.)
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
The fpu__*() methods are closely related, but they are defined
in scattered places within the FPU code.
Concentrate them, and also uninline fpu__save(), fpu__drop()
and fpu__reset() to save about 5K of kernel text on 64-bit kernels:
text data bss dec filename
14113063 2575280 1634304 18322647 vmlinux.before
14108070 2575280 1634304 18317654 vmlinux.after
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
fpu_restore_checking() is a helper function of restore_fpu_checking(),
but this is not apparent from the naming.
Both copy fpstate contents to fpregs, while the fuller variant does
a full copy without leaking information.
So rename them to:
copy_fpstate_to_fpregs()
__copy_fpstate_to_fpregs()
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
drop_fpu() and fpu_reset_state() are similar in functionality
and in scope, yet this is not apparent from their names.
drop_fpu() deactivates FPU contents (both the fpregs and the fpstate),
but leaves register contents intact in the eager-FPU case, mostly as an
optimization. It disables fpregs in the lazy FPU case. The drop_fpu()
method can be used to destroy FPU state in an optimized way, when we
know that a new state will be loaded before user-space might see
any remains of the old FPU state:
- such as in sys_exit()'s exit_thread() where we know this task
won't execute any user-space instructions anymore and the
next context switch cleans up the FPU. The old FPU state
might still be around in the eagerfpu case but won't be
saved.
- in __restore_xstate_sig(), where we use drop_fpu() before
copying a new state into the fpstate and activating that one.
No user-pace instructions can execute between those steps.
- in sys_execve()'s fpu__clear(): there we use drop_fpu() in
the !eagerfpu case, where it's equivalent to a full reinit.
fpu_reset_state() is a stronger version of drop_fpu(): both in
the eagerfpu and the lazy-FPU case it guarantees that fpregs
are reinitialized to init state. This method is used in cases
where we need a full reset:
- handle_signal() uses fpu_reset_state() to reset the FPU state
to init before executing a user-space signal handler. While we
have already saved the original FPU state at this point, and
always restore the original state, the signal handling code
still has to do this reinit, because signals may interrupt
any user-space instruction, and the FPU might be in various
intermediate states (such as an unbalanced x87 stack) that is
not immediately usable for general C signal handler code.
- __restore_xstate_sig() uses fpu_reset_state() when the signal
frame has no FP context. Since the signal handler may have
modified the FPU state, it gets reset back to init state.
- in another branch __restore_xstate_sig() uses fpu_reset_state()
to handle a restoration error: when restore_user_xstate() fails
to restore FPU state and we might have inconsistent FPU data,
fpu_reset_state() is used to reset it back to a known good
state.
- __kernel_fpu_end() uses fpu_reset_state() in an error branch.
This is in a 'must not trigger' error branch, so on bug-free
kernels this never triggers.
- fpu__restore() uses fpu_reset_state() in an error path
as well: if the fpstate was set up with invalid FPU state
(via ptrace or via a signal handler), then it's reset back
to init state.
- likewise, the scheduler's switch_fpu_finish() uses it in a
restoration error path too.
Move both drop_fpu() and fpu_reset_state() to the fpu__*() namespace
and harmonize their naming with their function:
fpu__drop()
fpu__reset()
This clearly shows that both methods operate on the full state of the
FPU, just like fpu__restore().
Also add comments to explain what each function does.
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
We'd like to use xsave_state() earlier, but its SYSTEM_BOOTING check
is too imprecise.
The real condition that xsave_state() would like to check is whether
alternative XSAVE instructions were patched into the kernel image
already.
Add such a (read-mostly) debug flag and use it in xsave_state().
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
FPU fpregs do not get initialized during bootup on secondary CPUs,
on non-xsave capable CPUs.
For example on one of my systems, the secondary CPU has this FPU
state on bootup:
x86: Booting SMP configuration:
.... node #0, CPUs: #1
x86/fpu ######################
x86/fpu # FPU register dump on CPU#1:
x86/fpu # ... CWD: ffff0040
x86/fpu # ... SWD: ffff0000
x86/fpu # ... TWD: ffff555a
x86/fpu # ... FIP: 00000000
x86/fpu # ... FCS: 00000000
x86/fpu # ... FOO: 00000000
x86/fpu # ... FOS: ffff0000
x86/fpu # ... FP0: 02 57 00 00 00 00 00 00 ff ff
x86/fpu # ... FP1: 1b e2 00 00 00 00 00 00 ff ff
x86/fpu # ... FP2: 00 00 00 00 00 00 00 00 00 00
x86/fpu # ... FP3: 00 00 00 00 00 00 00 00 00 00
x86/fpu # ... FP4: 00 00 00 00 00 00 00 00 00 00
x86/fpu # ... FP5: 00 00 00 00 00 00 00 00 00 00
x86/fpu # ... FP6: 00 00 00 00 00 00 00 00 00 00
x86/fpu # ... FP7: 00 00 00 00 00 00 00 00 00 00
x86/fpu # ... SW: dadadada
x86/fpu ######################
Note how CWD and TWD are off their usual init state (0x037f and 0xffff),
and how FP0 and FP1 has non-zero content.
This is normally not a problem, because any user-space FPU state
is initalized properly - but it can complicate the use of FPU
instructions in kernel code via kernel_fpu_begin()/end(): if
the FPU using code does not initialize registers itself, it
might generate spurious exceptions depending on which CPU it
executes on.
Fix this by initializing the x87 state via the FNINIT instruction.
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Rename this function in line with the new FPU nomenclature.
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
So this function still had ancient language about 'saving current
math information' - but we haven't been doing lazy FPU saves for
quite some time, we are doing lazy FPU restores.
Also remove IRQ13 related comment, which we don't support anymore
either.
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Move the naming in line with existing names, so that we now have:
copy_fpregs_to_fpstate()
copy_fpstate_to_sigframe()
copy_fpregs_to_sigframe()
... where each function does what its name suggests.
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Standardize the naming of save_xstate_sig() by renaming it to
copy_fpstate_to_sigframe(): this tells us at a glance that
the function copies an FPU fpstate to a signal frame.
This naming also follows the naming of copy_fpregs_to_fpstate().
Don't put 'xstate' into the name: since this is a generic name,
it's expected that the function is able to handle xstate frames
as well, beyond legacy frames.
xstate used to be the odd case in the x86 FPU code - now it's the
common case.
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Currently fpstate_sanitize_xstate() has a task_struct input parameter,
but it only uses the fpu structure from it - so pass in a 'struct fpu'
pointer only and update all call sites.
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Remove the extra layer of __fpstate_sanitize_xstate():
if (!use_xsaveopt())
return;
__fpstate_sanitize_xstate(tsk);
and move the check for use_xsaveopt() into fpstate_sanitize_xstate().
In general we optimize for the presence of CPU features, not for
the absence of them. Furthermore there's little point in this inlining,
as the call sites are not super hot code paths.
Doing this uninlining shrinks the code a bit:
text data bss dec hex filename
14108751 2573624 1634304 18316679 1177d87 vmlinux.before
14108627 2573624 1634304 18316555 1177d0b vmlinux.after
Also remove a pointless '!fx' check from fpstate_sanitize_xstate().
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
So the sanitize_i387_state() function has the following purpose:
on CPUs that support optimized xstate saving instructions, an
FPU fpstate might end up having partially uninitialized data.
This function initializes that data.
Note that the function name is a misnomer and confusing on two levels,
not only is it not i387 specific at all, but it is the exact opposite:
it only matters on xstate CPUs.
So rename sanitize_i387_state() and __sanitize_i387_state() to
fpstate_sanitize_xstate() and __fpstate_sanitize_xstate(),
to clearly express the purpose and usage of the function.
We'll further clean up this function in the next patch.
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Now that all FPU internals using drivers are converted to public APIs,
move xcr.h's definitions into fpu/internal.h and remove xcr.h.
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
This file only uses the public FPU APIs, so remove the xcr.h, fpu/xstate.h
and fpu/internal.h headers and add the fpu/api.h include.
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Use the new 'cpu_has_xfeatures()' function to query AVX CPU support.
This has the following advantages to the driver:
- Decouples the driver from FPU internals: it's now only using <asm/fpu/api.h>.
- Removes detection complexity from the driver, no more raw XGETBV instruction
- Shrinks the code a bit.
- Standardizes feature name error message printouts across drivers
There are also advantages to the x86 FPU code: once all drivers
are decoupled from internals we can move them out of common
headers and we'll also be able to remove xcr.h.
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Use the new 'cpu_has_xfeatures()' function to query AVX CPU support.
This has the following advantages to the driver:
- Decouples the driver from FPU internals: it's now only using <asm/fpu/api.h>.
- Removes detection complexity from the driver, no more raw XGETBV instruction
- Shrinks the code a bit.
- Standardizes feature name error message printouts across drivers
There are also advantages to the x86 FPU code: once all drivers
are decoupled from internals we can move them out of common
headers and we'll also be able to remove xcr.h.
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Use the new 'cpu_has_xfeatures()' function to query AVX CPU support.
This has the following advantages to the driver:
- Decouples the driver from FPU internals: it's now only using <asm/fpu/api.h>.
- Removes detection complexity from the driver, no more raw XGETBV instruction
- Shrinks the code a bit.
- Standardizes feature name error message printouts across drivers
There are also advantages to the x86 FPU code: once all drivers
are decoupled from internals we can move them out of common
headers and we'll also be able to remove xcr.h.
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Use the new 'cpu_has_xfeatures()' function to query AVX CPU support.
This has the following advantages to the driver:
- Decouples the driver from FPU internals: it's now only using <asm/fpu/api.h>.
- Removes detection complexity from the driver, no more raw XGETBV instruction
- Shrinks the code a bit.
- Standardizes feature name error message printouts across drivers
There are also advantages to the x86 FPU code: once all drivers
are decoupled from internals we can move them out of common
headers and we'll also be able to remove xcr.h.
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Use the new 'cpu_has_xfeatures()' function to query AVX CPU support.
This has the following advantages to the driver:
- Decouples the driver from FPU internals: it's now only using <asm/fpu/api.h>.
- Removes detection complexity from the driver, no more raw XGETBV instruction
- Shrinks the code a bit.
- Standardizes feature name error message printouts across drivers
There are also advantages to the x86 FPU code: once all drivers
are decoupled from internals we can move them out of common
headers and we'll also be able to remove xcr.h.
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Use the new 'cpu_has_xfeatures()' function to query AVX CPU support.
This has the following advantages to the driver:
- Decouples the driver from FPU internals: it's now only using <asm/fpu/api.h>.
- Removes detection complexity from the driver, no more raw XGETBV instruction
- Shrinks the code a bit.
- Standardizes feature name error message printouts across drivers
There are also advantages to the x86 FPU code: once all drivers
are decoupled from internals we can move them out of common
headers and we'll also be able to remove xcr.h.
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Use the new 'cpu_has_xfeatures()' function to query AVX CPU support.
This has the following advantages to the driver:
- Decouples the driver from FPU internals: it's now only using <asm/fpu/api.h>.
- Removes detection complexity from the driver, no more raw XGETBV instruction
- Shrinks the code a bit.
- Standardizes feature name error message printouts across drivers
There are also advantages to the x86 FPU code: once all drivers
are decoupled from internals we can move them out of common
headers and we'll also be able to remove xcr.h.
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Use the new 'cpu_has_xfeatures()' function to query AVX CPU support.
This has the following advantages to the driver:
- Decouples the driver from FPU internals: it's now only using <asm/fpu/api.h>.
- Removes detection complexity from the driver, no more raw XGETBV instruction
- Shrinks the code a bit.
- Standardizes feature name error message printouts across drivers
There are also advantages to the x86 FPU code: once all drivers
are decoupled from internals we can move them out of common
headers and we'll also be able to remove xcr.h.
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Use the new 'cpu_has_xfeatures()' function to query AVX CPU support.
This has the following advantages to the driver:
- Decouples the driver from FPU internals: it's now only using <asm/fpu/api.h>.
- Removes detection complexity from the driver, no more raw XGETBV instruction
- Shrinks the code a bit.
- Standardizes feature name error message printouts across drivers
There are also advantages to the x86 FPU code: once all drivers
are decoupled from internals we can move them out of common
headers and we'll also be able to remove xcr.h.
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Use the new 'cpu_has_xfeatures()' function to query AVX CPU support.
This has the following advantages to the driver:
- Decouples the driver from FPU internals: it's now only using <asm/fpu/api.h>.
- Removes detection complexity from the driver, no more raw XGETBV instruction
- Shrinks the code a bit:
text data bss dec hex filename
2128 2896 0 5024 13a0 camellia_aesni_avx_glue.o.before
2067 2896 0 4963 1363 camellia_aesni_avx_glue.o.after
- Standardizes feature name error message printouts across drivers
There are also advantages to the x86 FPU code: once all drivers
are decoupled from internals we can move them out of common
headers and we'll also be able to remove xcr.h.
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
So xsave.h is an internal header that FPU using drivers commonly include,
to get access to the xstate feature names, amongst other things.
Move these type definitions to fpu/fpu.h to allow simplification
of FPU using driver code.
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Transform the xstate masks into an enumerated list of xfeature bits.
This removes the hard coding of XFEATURES_NR_MAX.
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
We do a boot time printout of xfeatures in print_xstate_features(),
simplify this code to make use of the recently introduced cpu_has_xfeature()
method.
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
A lot of FPU using driver code is querying complex CPU features to be
able to figure out whether a given set of xstate features is supported
by the CPU or not.
Introduce a simplified API function that can be used on any CPU type
to get this information. Also add an error string return pointer,
so that the driver can print a meaningful error message with a
standardized feature name.
Also mark xfeatures_mask as __read_only.
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
'xsave' is an x86 instruction name to most people - but xsave.h is
about a lot more than just the XSAVE instruction: it includes
definitions and support, both internal and external, related to
xstate and xfeatures support.
As a first step in cleaning up the various xstate uses rename this
header to 'fpu/xstate.h' to better reflect what this header file
is about.
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
The current fpu_copy() code on lazy switching CPUs always saves
into the current fpstate and then copies it over into the child
context:
preempt_disable();
if (!copy_fpregs_to_fpstate(src_fpu))
fpregs_deactivate(src_fpu);
preempt_enable();
memcpy(&dst_fpu->state, &src_fpu->state, xstate_size);
That memcpy() can be avoided on all lazy switching setups except
really old FNSAVE-only systems: change fpu_copy() to directly save
into the child context, for both the lazy and the eager context
switching case.
Note that we still have to do a memcpy() back into the parent
context in the FNSAVE case, but this won't be executed on the
majority of x86 systems that got built in the last 10 years or so.
Reviewed-by: Borislav Petkov <bp@alien8.de>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Optimize fpu_copy() a bit by expanding the ->fpstate_active == 1
portion of fpu__save() into it.
( The main purpose of this change is to enable another, larger
optimization that will be done in the next patch. )
Reviewed-by: Borislav Petkov <bp@alien8.de>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
So fpu__save() does this currently:
copy_fpregs_to_fpstate(fpu);
if (!use_eager_fpu())
fpregs_deactivate(fpu);
... which deactivates the FPU on lazy switching systems unconditionally.
Both usecases of fpu__save() use this function to save the
FPU state into a fpstate: fork()/clone() and math error signal handling.
The unconditional disabling of FPU registers in the lazy switching
case is probably a mistaken conversion of old FNSAVE code (that had
to disable FPU registers).
So speed up this code by only disabling FPU registers when absolutely
necessary: when indicated by the copy_fpregs_to_fpstate() return
code:
if (!copy_fpregs_to_fpstate(fpu))
fpregs_deactivate(fpu);
Reviewed-by: Borislav Petkov <bp@alien8.de>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
The current implementation of __save_fpu():
if (use_xsave()) {
xsave_state(&fpu->state.xsave);
} else {
fpu_fxsave(fpu);
}
Is actually a simplified version of copy_fpregs_to_fpstate(),
if use_eager_fpu() is true.
But all call sites of __save_fpu() call it only it when use_eager_fpu()
is true.
So we can eliminate __save_fpu() altogether and use the standard
copy_fpregs_to_fpstate() function. This cleans up the code
by making it use fewer variants of FPU register saving.
Reviewed-by: Borislav Petkov <bp@alien8.de>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
__save_fpu() has this pattern:
if (unlikely(system_state == SYSTEM_BOOTING))
xsave_state_booting(&fpu->state.xsave);
else
xsave_state(&fpu->state.xsave);
... but it does not actually get called during system bootup.
So remove the complication and always call xsave_state().
To make sure this assumption is correct, add a WARN_ONCE()
debug check to xsave_state().
Reviewed-by: Borislav Petkov <bp@alien8.de>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
We have repeat patterns of:
if (!use_eager_fpu())
clts();
... to activate FPU registers, and:
if (!use_eager_fpu())
stts();
... to deactivate them.
Encapsulate these in:
__fpregs_activate_hw();
__fpregs_activate_hw();
and use them accordingly.
Doing this synchronizes the idiom with the fpu->fpregs_active
software-flag's handling functions, creating clear patterns of:
__fpregs_activate_hw();
__fpregs_activate(fpu);
etc., which improves readability.
Reviewed-by: Borislav Petkov <bp@alien8.de>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
In line with the fpstate_activate() change, name
fpu__unlazy_stopped() in a similar fashion as well: its purpose
is to make the fpstate of a stopped task the current and active FPU
context, which may require unlazying and initialization.
The unlazying is just part of the job, the main concept is to make
the fpstate active.
Also clarify the function's description to clarify its exact
usage and the background behind it all.
Reviewed-by: Borislav Petkov <bp@alien8.de>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Now that fpstate_init_curr() is not doing implicit allocations
anymore, almost all uses of it involve a very simple pattern:
if (!fpu->fpstate_active)
fpstate_init_curr(fpu);
which is basically activating the FPU fpstate if it was not active
before.
So propagate the check into the function itself, and rename the
function according to its new purpose:
fpu__activate_curr(fpu);
Reviewed-by: Borislav Petkov <bp@alien8.de>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Now that fpstate_init() cannot fail the error return of fx_init()
has lost its purpose. Eliminate the error return and propagate this
change to all callers.
Reviewed-by: Borislav Petkov <bp@alien8.de>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Now that FPU contexts are always allocated, fpu__unlazy_stopped()
cannot fail. Remove its error return and propagate the changes to
the callers.
Reviewed-by: Borislav Petkov <bp@alien8.de>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Now that there are no FPU context allocations, rename fpstate_alloc_init()
to fpstate_init_curr(), to signal that it initializes the fpstate and
marks it active, for the current task.
Reviewed-by: Borislav Petkov <bp@alien8.de>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Remove the failure code and propagate this down to callers.
Note that this function still has an 'init' aspect, which must be
called.
Reviewed-by: Borislav Petkov <bp@alien8.de>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Now that we always allocate the FPU context as part of task_struct there's
no need for separate allocations - remove them and their primary failure
handling code.
( Note that there's still secondary error codes that have become superfluous,
those will be removed in separate patches. )
Move the somewhat misplaced setup_xstate_comp() call to the core.
Reviewed-by: Borislav Petkov <bp@alien8.de>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
So 6 years ago we made the FPU fpstate dynamically allocated:
aa283f4927 ("x86, fpu: lazy allocation of FPU area - v5")
61c4628b53 ("x86, fpu: split FPU state from task struct - v5")
In hindsight this was a mistake:
- it complicated context allocation failure handling, such as:
/* kthread execs. TODO: cleanup this horror. */
if (WARN_ON(fpstate_alloc_init(fpu)))
force_sig(SIGKILL, tsk);
- it caused us to enable irqs in fpu__restore():
local_irq_enable();
/*
* does a slab alloc which can sleep
*/
if (fpstate_alloc_init(fpu)) {
/*
* ran out of memory!
*/
do_group_exit(SIGKILL);
return;
}
local_irq_disable();
- it (slightly) slowed down task creation/destruction by adding
slab allocation/free pattens.
- it made access to context contents (slightly) slower by adding
one more pointer dereference.
The motivation for the dynamic allocation was two-fold:
- reduce memory consumption by non-FPU tasks
- allocate and handle only the necessary amount of context for
various XSAVE processors that have varying hardware frame
sizes.
These days, with glibc using SSE memcpy by default and GCC optimizing
for SSE/AVX by default, the scope of FPU using apps on an x86 system is
much larger than it was 6 years ago.
For example on a freshly installed Fedora 21 desktop system, with a
recent kernel, all non-kthread tasks have used the FPU shortly after
bootup.
Also, even modern embedded x86 CPUs try to support the latest vector
instruction set - so they'll too often use the larger xstate frame
sizes.
So remove the dynamic allocation complication by embedding the FPU
fpstate in task_struct again. This should make the FPU a lot more
accessible to all sorts of atomic contexts.
We could still optimize for the xstate frame size in the future,
by moving the state structure to the last element of task_struct,
and allocating only a part of that.
This change is kept minimal by still keeping the ctx_alloc()/free()
routines (that now do nothing substantial) - we'll remove them in
the following patches.
Reviewed-by: Borislav Petkov <bp@alien8.de>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
So we have the following ancient code in copy_fpregs_to_fpstate():
if (unlikely(fpu->state->fxsave.swd & X87_FSW_ES)) {
asm volatile("fnclex");
goto drop_fpregs;
}
which clears pending FPU exceptions and then drops registers, which
causes the next FP instruction of the saved context to re-load the
saved FPU state, with all pending exceptions marked properly, and
will re-start the exception handling mechanism in the hardware.
Since FPU exceptions are always issued on instruction boundaries,
in particular on the next FP instruction following the exception
generating instruction, there's no fear of getting an FP exception
asynchronously.
They were truly asynchronous back in the IRQ13 days, when the FPU was
a weird and expensive co-processor that did its own processing, and we
had to synchronize with them, but that code is not working anymore:
we don't have IRQ13 mapped in the IDT anymore.
With the introduction of optimized XSAVE support there's a new
complication: if the xstate features bit indicates that a particular
state component is unused (in 'init state'), then the hardware does
not guarantee that the XSAVE (et al) instruction keeps the underlying
FPU state image in memory valid and current. In practice this means
that the hardware won't write it, and the exceptions flag in the
state might be an older version, with it still being set. This
meant that we had to check the xfeatures flag as well, adding
another memory load and branch to a critical hot path of the scheduler.
So optimize all this by removing both the old quirk and the new check,
and straight-line optimizing the most common cases with likely()
hints. Quite a bit of code gets removed this way:
arch/x86/kernel/process_64.o:
text data bss dec filename
5484 8 0 5492 process_64.o.before
5416 8 0 5424 process_64.o.after
Now there's also a chance that some weird behavior or erratum was
masked by our IRQ13 handling quirk (or that I misunderstood the
nature of the quirk), and that this change triggers some badness.
There's no real good way to protect against that possibility other
than keeping this change well isolated, well commented and well
bisectable. If you bisect a weird (or not so weird) breakage to
this commit then please let us know!
Reviewed-by: Borislav Petkov <bp@alien8.de>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
So fpu_save_init() is a historic name that got its name when the only
way the FPU state was FNSAVE, which cleared (well, destroyed) the FPU
state after saving it.
Nowadays the name is misleading, because ever since the introduction of
FXSAVE (and more modern FPU saving instructions) the 'we need to reload
the FPU state' part is only true if there's a pending FPU exception [*],
which is almost never the case.
So rename it to copy_fpregs_to_fpstate() to make it clear what's
happening. Also add a few comments about why we cannot keep registers
in certain cases.
Also clean up the control flow a bit, to make it more apparent when
we are dropping/keeping FP registers, and to optimize the common
case (of keeping fpregs) some more.
[*] Probably not true anymore, modern instructions always leave the FPU
state intact, even if exceptions are pending: because pending FP
exceptions are posted on the next FP instruction, not asynchronously.
They were truly asynchronous back in the IRQ13 case, and we had to
synchronize with them, but that code is not working anymore: we don't
have IRQ13 mapped in the IDT anymore.
But a cleanup patch is obviously not the place to change subtle behavior.
Reviewed-by: Borislav Petkov <bp@alien8.de>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Especially the irq_ts_save() function is pretty bloaty, generating
over a dozen instructions, so uninline them.
Even though the API is used rarely, the space savings are measurable:
text data bss dec hex filename
13331995 2572920 1634304 17539219 10ba093 vmlinux.before
13331739 2572920 1634304 17538963 10b9f93 vmlinux.after
( This also allows the removal of an include file inclusion from fpu/api.h,
speeding up the kernel build slightly. )
Reviewed-by: Borislav Petkov <bp@alien8.de>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
There are a number of FPU internal function prototypes and an inline function
in fpu/api.h, mostly placed so historically as the code grew over the years.
Move them over into fpu/internal.h where they belong. (Add sched.h include
to stackprotector.h which incorrectly relied on getting it from fpu/api.h.)
fpu/api.h is now a pure file that only contains FPU APIs intended for driver
use.
Reviewed-by: Borislav Petkov <bp@alien8.de>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Both inline functions call an inline function unconditionally, so we
already pay the function call based clobbering cost. Uninline them.
This saves quite a bit of code in various performance sensitive
code paths:
text data bss dec hex filename
13321334 2569888 1634304 17525526 10b6b16 vmlinux.before
13320246 2569888 1634304 17524438 10b66d6 vmlinux.after
Reviewed-by: Borislav Petkov <bp@alien8.de>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
It's an internal method, not a driver API, so move it from fpu/api.h
to fpu/internal.h.
Reviewed-by: Borislav Petkov <bp@alien8.de>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Extend the comments of the FPU init code, and fix old ones.
Reviewed-by: Borislav Petkov <bp@alien8.de>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reorder init methods in order of their relationship and usage, to
form coherent blocks throughout the whole file.
Reviewed-by: Borislav Petkov <bp@alien8.de>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
To bring it in line with the other init_system*() methods.
Reviewed-by: Borislav Petkov <bp@alien8.de>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Now that fpu__detect() has become an empty layer around
fpu__init_system(), eliminate it and make fpu__init_system()
the main system initialization routine.
Reviewed-by: Borislav Petkov <bp@alien8.de>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Move the fpu__init_system_early_generic() call into fpu__init_system(),
which hosts all the system init calls.
Expose fpu__init_system() to other modules - this will be our main and only
system init function.
Reviewed-by: Borislav Petkov <bp@alien8.de>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
check_fpu() currently relies on being called early in the init sequence,
when CR0::TS has not been set up yet.
Save/restore CR0::TS across this function, to make it invariant to
init ordering. This way we'll be able to move the generic FPU setup
routines earlier in the init sequence.
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Create separate fpu/bugs.c code so that if we read generic FPU code
we don't have to wade through all the bugcheck related code first.
Reviewed-by: Borislav Petkov <bp@alien8.de>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
There's a !FPU related sanity check in fpu__init_cpu_generic(),
which is executed on every CPU onlining - even though we should do
this only once, and during system init.
Move this check to fpu__init_system_early_generic().
Reviewed-by: Borislav Petkov <bp@alien8.de>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Move the generic bits of fpu__detect() into fpu__init_system_early_generic().
We'll move some other code here too in a followup patch.
Reviewed-by: Borislav Petkov <bp@alien8.de>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Factor out the generic bits from fpu__init_system().
Rename mxcsr_feature_mask_init() to fpu__init_system_mxcsr()
to bring it in line with the rest of the nomenclature.
Reviewed-by: Borislav Petkov <bp@alien8.de>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Factor out the generic bits from fpu__init_cpu(), to create
a flat sequence of per CPU initialization function calls:
fpu__init_cpu_generic();
fpu__init_cpu_xstate();
fpu__init_cpu_ctx_switch();
Reviewed-by: Borislav Petkov <bp@alien8.de>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
After the latest round of cleanups, fpu__cpu_init() has become
a simple call to fpu__init_cpu().
Rename fpu__init_cpu() to fpu__cpu_init() and remove the
extra layer.
Reviewed-by: Borislav Petkov <bp@alien8.de>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
We are now doing the fpu__init_cpu_ctx_switch() call from fpu__init_cpu(),
so there's no need to call it from fpu__init_system() anymore.
Reviewed-by: Borislav Petkov <bp@alien8.de>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
fpu__cpu_init() is called on every CPU, so it is the wrong place
to call fpu__init_system() from. Call it from fpu__detect():
this is early CPU init code, but we already have CPU features detected,
so we can call the system-wide FPU init code from here.
Reviewed-by: Borislav Petkov <bp@alien8.de>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
fpu__init_cpu() is currently called from fpu__init_system(),
which is the wrong place for it: call it from the proper high level
per CPU init function, fpu__init_cpu().
Note, we still keep the old call site as well, because it depends
on having proper CR0::TS setup. We'll fix this in the next patch.
Reviewed-by: Borislav Petkov <bp@alien8.de>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
The fpstate_xstate_init_size() function sets up a basic xstate_size, called
during fpu__detect() currently.
Its real dependency is to be called before fpu__init_system_xstate().
So move the function call site into fpu__init_system(), to right before the
fpu__init_system_xstate() call.
Also add a once-per-boot flag to fpstate_xstate_init_size(), we'll remove
this quirk later once we've cleaned up the init dependencies.
This moves the two related functions closer to each other and makes them
both part of the _init_system() functionality.
Currently we do the fpstate_xstate_init_size()
Reviewed-by: Borislav Petkov <bp@alien8.de>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
mxcsr_feature_mask_init() depends on TS being cleared, as it executes
an FXSAVE instruction.
After later changes we will move the TS setup into fpu__init_cpu(),
which will interact with this - so clear the TS flag explicitly.
Reviewed-by: Borislav Petkov <bp@alien8.de>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
So fpu__ctx_switch_init() has two aspects: a once per bootup functionality
that sets up a capability flag, and a per CPU functionality that sets CR0::TS.
Split the function.
Note that at this stage we still have duplicate calls into these methods, as
both the _system() and the _cpu() methods are run on all CPUs, with lower
level on_boot_cpu flags filtering out the duplicates where needed. So add
TS flag clearing as well, to handle the aftermath of early CPU init sequences
that might call in without having eager-fpu set - don't assume the TS flag
is cleared.
Calling each from its respective init level will happen later on.
Reviewed-by: Borislav Petkov <bp@alien8.de>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
It's not an xsave specific function anymore, so rename it accordingly
and also clean it up a bit:
- remove the obsolete __init_refok, as the code paths are not
mixed anymore
- rename it from eager_fpu_init() to fpu__ctx_switch_init()
- remove stray 'return;'
- make it static to its only user
Reviewed-by: Borislav Petkov <bp@alien8.de>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
The FPU context switch type (lazy or eager) setup code is split into
two places currently - move it all to eager_fpu_init().
Note that the code we move will now be executed on non-xstate CPUs
as well, but this should be safe: both xfeatures_mask and
cpu_has_xsaveopt is 0 there.
Reviewed-by: Borislav Petkov <bp@alien8.de>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
It's a pure xstate method now, no need for this duplicate call.
Reviewed-by: Borislav Petkov <bp@alien8.de>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
The legacy FPU init image is used on older CPUs who don't run xstate init.
But the init code is called within setup_init_fpu_buf(), an xstate method.
Move this legacy init out of the xstate code and put it into fpu/init.c.
Reviewed-by: Borislav Petkov <bp@alien8.de>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Expand fpu__init_system_xstate() and fpu__init_cpu_xstate() calls
into xsave_init() calls.
(This will allow us to call the proper versions in higher level FPU init code
later on.)
No change in functionality.
Reviewed-by: Borislav Petkov <bp@alien8.de>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Linearize the call sequence in xsave_init():
fpu__init_system_xstate();
fpu__init_cpu_xstate();
We do this by propagating the boot-once quirk into
fpu__init_system_xstate(). fpu__init_cpu_xstate() is
safe to be called multiple time.
Reviewed-by: Borislav Petkov <bp@alien8.de>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Now that legacy code can execute fpu__init_cpu_xstate() in
xsave_init(), we can move the once per boot legacy check into
fpu__init_system_xstate(), where it belongs.
Reviewed-by: Borislav Petkov <bp@alien8.de>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
fpu__init_system_xstate() does an FPU capability check that is better
done in fpu__init_cpu_xstate(). This will allow us to call
fpu__init_cpu_xstate() directly on legacy CPUs as well.
Reviewed-by: Borislav Petkov <bp@alien8.de>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Rename existing xstate init functions along the system/cpu init principles:
fpu__init_system_xstate(): called once per system bootup
fpu__init_cpu_xstate(): called per CPU onlining
Also make the fpu__init_cpu_xstate() early code invariant:
if xfeatures_mask is not set yet then don't crash but return.
Reviewed-by: Borislav Petkov <bp@alien8.de>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
There are two kinds of FPU initialization sequences necessary to bring FPU
functionality up: once per system bootup activities, such as detection,
feature initialization, etc. of attributes that are shared by all CPUs
in the system - and per cpu initialization sequences run when a CPU is
brought online (either during bootup or during CPU hotplug onlining),
such as CR0/CR4 register setting, etc.
The FPU code is mixing these roles together, with no clear distinction.
Start sorting this out by splitting the main FPU detection routine
(fpu__cpu_init()) into two parts: fpu__init_system() for
one per system init activities, and fpu__init_cpu() for the
per CPU onlining init activities.
Note that xstate_init() is called from both variants for the time being,
because it has a dual nature as well. We'll fix that in upcoming patches.
Just do the split and call it as we used to before, don't introduce any
change in initialization behavior yet, beyond duplicate (and harmless)
fpu__init_cpu() and xstate_init() calls - which we'll fix in later
patches.
Reviewed-by: Borislav Petkov <bp@alien8.de>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Make init_xstate_buf allocated statically at build time.
This structure's maximum size is around 1KB - and it's allocated even on
most modern embedded x86 CPUs which strive for FPU instruction set parity
with desktop and server CPUs, so it's not like we can save much on smaller
systems.
This removes the last bootmem allocation from the FPU init path, allowing
it to be called earlier in the boot sequence.
Reviewed-by: Borislav Petkov <bp@alien8.de>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Remove the dependency on the init_xstate_buf == NULL check to
implement once-per-bootup logic in eager_fpu_init(), by making
setup_init_fpu_buf() run once per bootup explicitly.
This is in preparation to make init_xstate_buf statically
allocated.
The various boot-once quirks in the FPU init code will be removed
in a later cleanup stage.
Reviewed-by: Borislav Petkov <bp@alien8.de>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>