Currently the PEBS buffer size is 4k, it can only hold about 21
PEBS records. This patch enlarges the PEBS buffer size to 64k
(the same as the BTS buffer).
64k memory can hold about 330 PEBS records. This will significantly
reduce the number of PMIs when batched PEBS interrupts are enabled.
Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
Signed-off-by: Kan Liang <kan.liang@intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: acme@infradead.org
Cc: eranian@google.com
Link: http://lkml.kernel.org/r/1430940834-8964-7-git-send-email-kan.liang@intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Flush the PEBS buffer during context switches if PEBS interrupt threshold
is larger than one. This allows perf to supply TID for sample outputs.
Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
Signed-off-by: Kan Liang <kan.liang@intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: acme@infradead.org
Cc: eranian@google.com
Link: http://lkml.kernel.org/r/1430940834-8964-6-git-send-email-kan.liang@intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
PEBS always had the capability to log samples to its buffers without
an interrupt. Traditionally perf has not used this but always set the
PEBS threshold to one.
For frequently occurring events (like cycles or branches or load/store)
this in term requires using a relatively high sampling period to avoid
overloading the system, by only processing PMIs. This in term increases
sampling error.
For the common cases we still need to use the PMI because the PEBS
hardware has various limitations. The biggest one is that it can not
supply a callgraph. It also requires setting a fixed period, as the
hardware does not support adaptive period. Another issue is that it
cannot supply a time stamp and some other options. To supply a TID it
requires flushing on context switch. It can however supply the IP, the
load/store address, TSX information, registers, and some other things.
So we can make PEBS work for some specific cases, basically as long as
you can do without a callgraph and can set the period you can use this
new PEBS mode.
The main benefit is the ability to support much lower sampling period
(down to -c 1000) without extensive overhead.
One use cases is for example to increase the resolution of the c2c tool.
Another is double checking when you suspect the standard sampling has
too much sampling error.
Some numbers on the overhead, using cycle soak, comparing the elapsed
time from "kernbench -M -H" between plain (threshold set to one) and
multi (large threshold).
The test command for plain:
"perf record --time -e cycles:p -c $period -- kernbench -M -H"
The test command for multi:
"perf record --no-time -e cycles:p -c $period -- kernbench -M -H"
( The only difference of test command between multi and plain is time
stamp options. Since time stamp is not supported by large PEBS
threshold, it can be used as a flag to indicate if large threshold is
enabled during the test. )
period plain(Sec) multi(Sec) Delta
10003 32.7 16.5 16.2
20003 30.2 16.2 14.0
40003 18.6 14.1 4.5
80003 16.8 14.6 2.2
100003 16.9 14.1 2.8
800003 15.4 15.7 -0.3
1000003 15.3 15.2 0.2
2000003 15.3 15.1 0.1
With periods below 100003, plain (threshold one) cause much more
overhead. With 10003 sampling period, the Elapsed Time for multi is
even 2X faster than plain.
Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
Signed-off-by: Kan Liang <kan.liang@intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: acme@infradead.org
Cc: eranian@google.com
Link: http://lkml.kernel.org/r/1430940834-8964-5-git-send-email-kan.liang@intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
When the PEBS interrupt threshold is larger than one record and the
machine supports multiple PEBS events, the records of these events are
mixed up and we need to demultiplex them.
Demuxing the records is hard because the hardware is deficient. The
hardware has two issues that, when combined, create impossible
scenarios to demux.
The first issue is that the 'status' field of the PEBS record is a copy
of the GLOBAL_STATUS MSR at PEBS assist time. To see why this is a
problem let us first describe the regular PEBS cycle:
A) the CTRn value reaches 0:
- the corresponding bit in GLOBAL_STATUS gets set
- we start arming the hardware assist
< some unspecified amount of time later -- this could cover multiple
events of interest >
B) the hardware assist is armed, any next event will trigger it
C) a matching event happens:
- the hardware assist triggers and generates a PEBS record
this includes a copy of GLOBAL_STATUS at this moment
- if we auto-reload we (re)set CTRn
- we clear the relevant bit in GLOBAL_STATUS
Now consider the following chain of events:
A0, B0, A1, C0
The event generated for counter 0 will include a status with counter 1
set, even though its not at all related to the record. A similar thing
can happen with a !PEBS event if it just happens to overflow at the
right moment.
The second issue is that the hardware will only emit one record for two
or more counters if the event that triggers the assist is 'close'. The
'close' can be several cycles. In some cases even the complete assist,
if the event is something that doesn't need retirement.
For instance, consider this chain of events:
A0, B0, A1, B1, C01
Where C01 is an event that triggers both hardware assists, we will
generate but a single record, but again with both counters listed in the
status field.
This time the record pertains to both events.
Note that these two cases are different but undistinguishable with the
data as generated. Therefore demuxing records with multiple PEBS bits
(we can safely ignore status bits for !PEBS counters) is impossible.
Furthermore we cannot emit the record to both events because that might
cause a data leak -- the events might not have the same privileges -- so
what this patch does is discard such events.
The assumption/hope is that such discards will be rare.
Here lists some possible ways you may get high discard rate.
- when you count the same thing multiple times. But it is not a useful
configuration.
- you can be unfortunate if you measure with a userspace only PEBS
event along with either a kernel or unrestricted PEBS event. Imagine
the event triggering and setting the overflow flag right before
entering the kernel. Then all kernel side events will end up with
multiple bits set.
Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
Signed-off-by: Kan Liang <kan.liang@intel.com>
[ Changelog improvements. ]
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: acme@infradead.org
Cc: eranian@google.com
Link: http://lkml.kernel.org/r/1430940834-8964-4-git-send-email-kan.liang@intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Move code that sets up the PEBS sample data to a separate function.
Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
Signed-off-by: Kan Liang <kan.liang@intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: acme@infradead.org
Cc: eranian@google.com
Link: http://lkml.kernel.org/r/1430940834-8964-3-git-send-email-kan.liang@intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
When a fixed period is specified, this patch makes perf use the PEBS
auto reload mechanism. This makes normal profiling faster, because
it avoids one costly MSR write in the PMI handler.
However, the reset value will be loaded by hardware assist. There is a
small delay compared to the previous non-auto-reload mechanism. The
delay time is arbitrary, but very small. The assist cost is 400-800
cycles, assuming common cases with everything cached. The minimum period
the patch currently uses is 10000. In that extreme case it can be ~10%
if cycles are used.
Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
Signed-off-by: Kan Liang <kan.liang@intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: acme@infradead.org
Cc: eranian@google.com
Link: http://lkml.kernel.org/r/1430940834-8964-2-git-send-email-kan.liang@intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
This patch enables support for branch sampling filter
for indirect jumps (IND_JUMP). It enables LBR IND_JMP
filtering where available. There is also software filtering
support.
Signed-off-by: Stephane Eranian <eranian@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Andi Kleen <ak@linux.intel.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: acme@redhat.com
Cc: dsahern@gmail.com
Cc: jolsa@redhat.com
Cc: kan.liang@intel.com
Cc: namhyung@kernel.org
Link: http://lkml.kernel.org/r/1431637800-31061-3-git-send-email-eranian@google.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
preempt_schedule_context() is a tracing safe preemption point but it's
only used when CONFIG_CONTEXT_TRACKING=y. Other configs have tracing
recursion issues since commit:
b30f0e3ffe ("sched/preempt: Optimize preemption operations on __schedule() callers")
introduced function based preemp_count_*() ops.
Lets make it available on all configs and give it a more appropriate
name for its new position.
Reported-by: Fengguang Wu <fengguang.wu@intel.com>
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1433432349-1021-3-git-send-email-fweisbec@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
CBOX counters are increased to 48b on HSX.
Correct the MSR address for HSWEP_U_MSR_PMON_CTR0 and
HSWEP_U_MSR_PMON_CTL0.
See specification in:
http://www.intel.com/content/www/us/en/processors/xeon/
xeon-e5-v3-uncore-performance-monitoring.html
Signed-off-by: Kan Liang <kan.liang@intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1432645835-7918-1-git-send-email-kan.liang@intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Change the type of variables and function prototypes to be in
alignment with what the x86_*() / __x86_*() family/model
functions return.
Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1433436928-31903-21-git-send-email-bp@alien8.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Andy Shevchenko reported machine freezes when booting latest tip
on 32-bit setups. Problem is, the builtin microcode handling cannot
really work that early, when we haven't even enabled paging.
A proper fix would involve handling that case specially as every
other early 32-bit boot case in the microcode loader and would
require much more involved changes for which it is too late now,
more than a week before the upcoming merge window.
So, disable the builtin microcode loading on 32-bit for now.
Reported-and-tested-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1433436928-31903-20-git-send-email-bp@alien8.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
In talking to Aravind recently about making certain AMD topology
attributes available to the MCE injection module, it seemed like
that CONFIG_X86_HT thing is more or less superfluous. It is
def_bool y, depends on SMP and gets enabled in the majority of
.configs - distro and otherwise - out there.
So let's kill it and make code behind it depend directly on SMP.
Signed-off-by: Borislav Petkov <bp@suse.de>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Aravind Gopalakrishnan <Aravind.Gopalakrishnan@amd.com>
Cc: Bartosz Golaszewski <bgolaszewski@baylibre.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Daniel Walter <dwalter@google.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Igor Mammedov <imammedo@redhat.com>
Cc: Jacob Shin <jacob.w.shin@gmail.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1433436928-31903-18-git-send-email-bp@alien8.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Add the necessary changes to do_machine_check() to be able to
process MCEs signaled as local MCEs. Typically, only recoverable
errors (SRAR type) will be Signaled as LMCE. The architecture
does not restrict to only those errors, however.
When errors are signaled as LMCE, there is no need for the MCE
handler to perform rendezvous with other logical processors
unlike earlier processors that would broadcast machine check
errors.
Signed-off-by: Ashok Raj <ashok.raj@intel.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tony Luck <tony.luck@intel.com>
Cc: linux-edac <linux-edac@vger.kernel.org>
Link: http://lkml.kernel.org/r/1433436928-31903-17-git-send-email-bp@alien8.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Initialize and prepare for handling LMCEs. Add a boot-time
option to disable LMCEs.
Signed-off-by: Ashok Raj <ashok.raj@intel.com>
[ Simplify stuff, align statements for better readability, reflow comments; kill
unused lmce_clear(); save us an MSR write if LMCE is already enabled. ]
Signed-off-by: Borislav Petkov <bp@suse.de>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tony Luck <tony.luck@intel.com>
Cc: linux-edac <linux-edac@vger.kernel.org>
Link: http://lkml.kernel.org/r/1433436928-31903-16-git-send-email-bp@alien8.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Pull x86 fixes from Ingo Molnar:
"Misc fixes:
- early_idt_handlers[] fix that fixes the build with bleeding edge
tooling
- build warning fix on GCC 5.1
- vm86 fix plus self-test to make it harder to break it again"
* 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/asm/irq: Stop relying on magic JMP behavior for early_idt_handlers
x86/asm/entry/32, selftests: Add a selftest for kernel entries from VM86 mode
x86/boot: Add CONFIG_PARAVIRT_SPINLOCKS quirk to arch/x86/boot/compressed/misc.h
x86/asm/entry/32: Really make user_mode() work correctly for VM86 mode
Pull perf fixes from Ingo Molnar:
"The biggest chunk of the changes are two regression fixes: a HT
workaround fix and an event-group scheduling fix. It's been verified
with 5 days of fuzzer testing.
Other fixes:
- eBPF fix
- a BIOS breakage detection fix
- PMU driver fixes"
* 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
perf/x86/intel/pt: Fix a refactoring bug
perf/x86: Tweak broken BIOS rules during check_hw_exists()
perf/x86/intel/pt: Untangle pt_buffer_reset_markers()
perf: Disallow sparse AUX allocations for non-SG PMUs in overwrite mode
perf/x86: Improve HT workaround GP counter constraint
perf/x86: Fix event/group validation
perf: Fix race in BPF program unregister
Deobfuscate the 'found' logic, it can be replaced with a simple:
if (!pa_data)
return;
and 'found' can be eliminated.
Signed-off-by: Wei Yang <weiyang@linux.vnet.ibm.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Yinghai Lu <yinghai@kernel.org>
Link: http://lkml.kernel.org/r/1433398729-8314-1-git-send-email-weiyang@linux.vnet.ibm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Commit 066450be41 ("perf/x86/intel/pt: Clean up the control flow
in pt_pmu_hw_init()") changed attribute initialization so that
only the first attribute gets initialized using
sysfs_attr_init(), which upsets lockdep.
This patch fixes the glitch so that all allocated attributes are
properly initialized thus fixing the lockdep warning reported by
Tvrtko and Imre.
Reported-by: Tvrtko Ursulin <tvrtko.ursulin@linux.intel.com>
Reported-by: Imre Deak <imre.deak@intel.com>
Signed-off-by: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: <linux-kernel@vger.kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
The vsyscall code is entry code too, so move it to arch/x86/entry/vsyscall/.
Cc: Borislav Petkov <bp@alien8.de>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQEcBAABAgAGBQJVa7zvAAoJEHm+PkMAQRiGtfMIAILs3sxFtrC1hApgcfRLF/7z
K34bwTRqErzqUO/orTwakEr9kSIpIL0zIPSryTCOTPZLfMGkQjhHXO3KR/DSbbTV
MZ8y/BM/yelFA/Np+1LjbiYjTNRnTRvCoaQihkIH8Rn02g7ob9HyL4gIGKpuGFcZ
04GacL2cgChqsRSACdNef948jCoJXKgcuDpe39DXphDWZnBKNZ3HFuJ6bryGJf9A
1/eCI4is85BNwKPemQUYR0xx83UIzDfrghatZP2mOCDDSA2MNg8HNxLTd12LGoQD
tfgX4B7aftzW9Y7GSEDfZ0IKm2NRzgPmCVj6PjVR/iI0lIK4Aq0Z/lDJxxEq3XQ=
=AJM5
-----END PGP SIGNATURE-----
Merge tag 'v4.1-rc6' into drm-next
Linux 4.1-rc6
backmerge 4.1-rc6 as some of the later pull reqs are based on newer bases
and I'd prefer to do the fixup myself.
Create a new directory hierarchy for the low level x86 entry code:
arch/x86/entry/*
This will host all the low level glue that is currently scattered
all across arch/x86/.
Start with entry_64.S and entry_32.S.
Cc: Borislav Petkov <bp@alien8.de>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Nothing in <asm/io.h> uses anything from <linux/vmalloc.h>, so
remove it from there and fix up the resulting build problems
triggered on x86 {64|32}-bit {def|allmod|allno}configs.
The breakages were triggering in places where x86 builds relied
on vmalloc() facilities but did not include <linux/vmalloc.h>
explicitly and relied on the implicit inclusion via <asm/io.h>.
Also add:
- <linux/init.h> to <linux/io.h>
- <asm/pgtable_types> to <asm/io.h>
... which were two other implicit header file dependencies.
Suggested-by: David Miller <davem@davemloft.net>
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
[ Tidied up the changelog. ]
Acked-by: David Miller <davem@davemloft.net>
Acked-by: Takashi Iwai <tiwai@suse.de>
Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
Acked-by: Vinod Koul <vinod.koul@intel.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Anton Vorontsov <anton@enomsg.org>
Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Cc: Colin Cross <ccross@android.com>
Cc: David Vrabel <david.vrabel@citrix.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Haiyang Zhang <haiyangz@microsoft.com>
Cc: James E.J. Bottomley <JBottomley@odin.com>
Cc: Jaroslav Kysela <perex@perex.cz>
Cc: K. Y. Srinivasan <kys@microsoft.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: Kristen Carlson Accardi <kristen@linux.intel.com>
Cc: Len Brown <lenb@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rafael J. Wysocki <rjw@rjwysocki.net>
Cc: Suma Ramars <sramars@cisco.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tony Luck <tony.luck@intel.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
We did try trimming whitespace surrounding the 'model name'
field in /proc/cpuinfo since reportedly some userspace uses it
in string comparisons and there were discrepancies:
[thetango@prarit ~]# grep "^model name" /proc/cpuinfo | uniq -c | sed 's/\ /_/g'
______1_model_name :_AMD_Opteron(TM)_Processor_6272
_____63_model_name :_AMD_Opteron(TM)_Processor_6272_________________
However, there were issues with overlapping buffers, string
sizes and non-byte-sized copies in the previous proposed
solutions; see Link tags below for the whole farce.
So, instead of diddling with this more, let's simply extend what
was there originally with trimming any present trailing
whitespace. Final result is really simple and obvious.
Testing with the most insane model IDs qemu can generate, looks
good:
.model_id = " My funny model ID CPU ",
______4_model_name :_My_funny_model_ID_CPU
.model_id = "My funny model ID CPU ",
______4_model_name :_My_funny_model_ID_CPU
.model_id = " My funny model ID CPU",
______4_model_name :_My_funny_model_ID_CPU
.model_id = " ",
______4_model_name :__
.model_id = "",
______4_model_name :_15/02
Signed-off-by: Borislav Petkov <bp@suse.de>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Igor Mammedov <imammedo@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1432050210-32036-1-git-send-email-prarit@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
retint_kernel doesn't require %rcx to be pointing to thread info
(anymore?), and the code on the two alternative paths is - not
really surprisingly - identical.
Signed-off-by: Jan Beulich <jbeulich@suse.com>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/556C664F020000780007FB64@mail.emea.novell.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
The early_idt_handlers asm code generates an array of entry
points spaced nine bytes apart. It's not really clear from that
code or from the places that reference it what's going on, and
the code only works in the first place because GAS never
generates two-byte JMP instructions when jumping to global
labels.
Clean up the code to generate the correct array stride (member size)
explicitly. This should be considerably more robust against
screw-ups, as GAS will warn if a .fill directive has a negative
count. Using '. =' to advance would have been even more robust
(it would generate an actual error if it tried to move
backwards), but it would pad with nulls, confusing anyone who
tries to disassemble the code. The new scheme should be much
clearer to future readers.
While we're at it, improve the comments and rename the array and
common code.
Binutils may start relaxing jumps to non-weak labels. If so,
this change will fix our build, and we may need to backport this
change.
Before, on x86_64:
0000000000000000 <early_idt_handlers>:
0: 6a 00 pushq $0x0
2: 6a 00 pushq $0x0
4: e9 00 00 00 00 jmpq 9 <early_idt_handlers+0x9>
5: R_X86_64_PC32 early_idt_handler-0x4
...
48: 66 90 xchg %ax,%ax
4a: 6a 08 pushq $0x8
4c: e9 00 00 00 00 jmpq 51 <early_idt_handlers+0x51>
4d: R_X86_64_PC32 early_idt_handler-0x4
...
117: 6a 00 pushq $0x0
119: 6a 1f pushq $0x1f
11b: e9 00 00 00 00 jmpq 120 <early_idt_handler>
11c: R_X86_64_PC32 early_idt_handler-0x4
After:
0000000000000000 <early_idt_handler_array>:
0: 6a 00 pushq $0x0
2: 6a 00 pushq $0x0
4: e9 14 01 00 00 jmpq 11d <early_idt_handler_common>
...
48: 6a 08 pushq $0x8
4a: e9 d1 00 00 00 jmpq 120 <early_idt_handler_common>
4f: cc int3
50: cc int3
...
117: 6a 00 pushq $0x0
119: 6a 1f pushq $0x1f
11b: eb 03 jmp 120 <early_idt_handler_common>
11d: cc int3
11e: cc int3
11f: cc int3
Signed-off-by: Andy Lutomirski <luto@kernel.org>
Acked-by: H. Peter Anvin <hpa@linux.intel.com>
Cc: Binutils <binutils@sourceware.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: H.J. Lu <hjl.tools@gmail.com>
Cc: Jan Beulich <JBeulich@suse.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: <stable@vger.kernel.org>
Link: http://lkml.kernel.org/r/ac027962af343b0c599cbfcf50b945ad2ef3d7a8.1432336324.git.luto@kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Pull RCU changes from Paul E. McKenney:
- Initialization/Kconfig updates: hide most Kconfig options from unsuspecting users.
There's now a single high level configuration option:
*
* RCU Subsystem
*
Make expert-level adjustments to RCU configuration (RCU_EXPERT) [N/y/?] (NEW)
Which if answered in the negative, leaves us with a single interactive
configuration option:
Offload RCU callback processing from boot-selected CPUs (RCU_NOCB_CPU) [N/y/?] (NEW)
All the rest of the RCU options are configured automatically.
- Remove all uses of RCU-protected array indexes: replace the
rcu_[access|dereference]_index_check() APIs with READ_ONCE() and rcu_lockdep_assert().
- RCU CPU-hotplug cleanups.
- Updates to Tiny RCU: a race fix and further code shrinkage.
- RCU torture-testing updates: fixes, speedups, cleanups and
documentation updates.
- Miscellaneous fixes.
- Documentation updates.
Signed-off-by: Ingo Molnar <mingo@kernel.org>
So the dwarf2 annotations in low level assembly code have
become an increasing hindrance: unreadable, messy macros
mixed into some of the most security sensitive code paths
of the Linux kernel.
These debug info annotations don't even buy the upstream
kernel anything: dwarf driven stack unwinding has caused
problems in the past so it's out of tree, and the upstream
kernel only uses the much more robust framepointers based
stack unwinding method.
In addition to that there's a steady, slow bitrot going
on with these annotations, requiring frequent fixups.
There's no tooling and no functionality upstream that
keeps it correct.
So burn down the sick forest, allowing new, healthier growth:
27 files changed, 350 insertions(+), 1101 deletions(-)
Someone who has the willingness and time to do this
properly can attempt to reintroduce dwarf debuginfo in x86
assembly code plus dwarf unwinding from first principles,
with the following conditions:
- it should be maximally readable, and maximally low-key to
'ordinary' code reading and maintenance.
- find a build time method to insert dwarf annotations
automatically in the most common cases, for pop/push
instructions that manipulate the stack pointer. This could
be done for example via a preprocessing step that just
looks for common patterns - plus special annotations for
the few cases where we want to depart from the default.
We have hundreds of CFI annotations, so automating most of
that makes sense.
- it should come with build tooling checks that ensure that
CFI annotations are sensible. We've seen such efforts from
the framepointer side, and there's no reason it couldn't be
done on the dwarf side.
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: Frédéric Weisbecker <fweisbec@gmail.com
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Jan Beulich <JBeulich@suse.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
If you try to enable NOHZ_FULL on a guest today, you'll get
the following error when the guest tries to deactivate the
scheduler tick:
WARNING: CPU: 3 PID: 2182 at kernel/time/tick-sched.c:192 can_stop_full_tick+0xb9/0x290()
NO_HZ FULL will not work with unstable sched clock
CPU: 3 PID: 2182 Comm: kworker/3:1 Not tainted 4.0.0-10545-gb9bb6fb #204
Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
Workqueue: events flush_to_ldisc
ffffffff8162a0c7 ffff88011f583e88 ffffffff814e6ba0 0000000000000002
ffff88011f583ed8 ffff88011f583ec8 ffffffff8104d095 ffff88011f583eb8
0000000000000000 0000000000000003 0000000000000001 0000000000000001
Call Trace:
<IRQ> [<ffffffff814e6ba0>] dump_stack+0x4f/0x7b
[<ffffffff8104d095>] warn_slowpath_common+0x85/0xc0
[<ffffffff8104d146>] warn_slowpath_fmt+0x46/0x50
[<ffffffff810bd2a9>] can_stop_full_tick+0xb9/0x290
[<ffffffff810bd9ed>] tick_nohz_irq_exit+0x8d/0xb0
[<ffffffff810511c5>] irq_exit+0xc5/0x130
[<ffffffff814f180a>] smp_apic_timer_interrupt+0x4a/0x60
[<ffffffff814eff5e>] apic_timer_interrupt+0x6e/0x80
<EOI> [<ffffffff814ee5d1>] ? _raw_spin_unlock_irqrestore+0x31/0x60
[<ffffffff8108bbc8>] __wake_up+0x48/0x60
[<ffffffff8134836c>] n_tty_receive_buf_common+0x49c/0xba0
[<ffffffff8134a6bf>] ? tty_ldisc_ref+0x1f/0x70
[<ffffffff81348a84>] n_tty_receive_buf2+0x14/0x20
[<ffffffff8134b390>] flush_to_ldisc+0xe0/0x120
[<ffffffff81064d05>] process_one_work+0x1d5/0x540
[<ffffffff81064c81>] ? process_one_work+0x151/0x540
[<ffffffff81065191>] worker_thread+0x121/0x470
[<ffffffff81065070>] ? process_one_work+0x540/0x540
[<ffffffff8106b4df>] kthread+0xef/0x110
[<ffffffff8106b3f0>] ? __kthread_parkme+0xa0/0xa0
[<ffffffff814ef4f2>] ret_from_fork+0x42/0x70
[<ffffffff8106b3f0>] ? __kthread_parkme+0xa0/0xa0
---[ end trace 06e3507544a38866 ]---
However, it turns out that kvmclock does provide a stable
sched_clock callback. So, let the scheduler know this which
in turn makes NOHZ_FULL work in the guest.
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Luiz Capitulino <lcapitulino@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
ACPI 6.0 formalizes e820-type-7 and efi-type-14 as persistent memory.
Mark it "reserved" and allow it to be claimed by a persistent memory
device driver.
This definition is in addition to the Linux kernel's existing type-12
definition that was recently added in support of shipping platforms with
NVDIMM support that predate ACPI 6.0 (which now classifies type-12 as
OEM reserved).
Note, /proc/iomem can be consulted for differentiating legacy
"Persistent Memory (legacy)" E820_PRAM vs standard "Persistent Memory"
E820_PMEM.
Cc: Boaz Harrosh <boaz@plexistor.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Jens Axboe <axboe@fb.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Matthew Wilcox <willy@linux.intel.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Jeff Moyer <jmoyer@redhat.com>
Acked-by: Andy Lutomirski <luto@amacapital.net>
Reviewed-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Tested-by: Toshi Kani <toshi.kani@hp.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Because mce is arch-specific x86 code, there is little or no
performance benefit of using rcu_dereference_index_check() over using
smp_load_acquire(). It also turns out that mce is the only place that
array-index-based RCU is used, and it would be convenient to drop
this portion of the RCU API.
This patch therefore changes rcu_dereference_index_check() uses to
smp_load_acquire(), but keeping the lockdep diagnostics, and also
changes rcu_access_index() uses to READ_ONCE().
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: linux-edac@vger.kernel.org
Cc: Tony Luck <tony.luck@intel.com>
Acked-by: Borislav Petkov <bp@suse.de>
The former duplicate the functionalities of the latter but are
neither documented nor arch-independent.
Signed-off-by: Bartosz Golaszewski <bgolaszewski@baylibre.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Benoit Cousson <bcousson@baylibre.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: Guenter Roeck <linux@roeck-us.net>
Cc: Jean Delvare <jdelvare@suse.de>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Drokin <oleg.drokin@intel.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rafael J. Wysocki <rjw@rjwysocki.net>
Cc: Russell King <linux@arm.linux.org.uk>
Cc: Viresh Kumar <viresh.kumar@linaro.org>
Link: http://lkml.kernel.org/r/1432645896-12588-9-git-send-email-bgolaszewski@baylibre.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
We use pat_enabled in x86-specific code to see if PAT is enabled
or not but we're granting full access to it even though readers
do not need to set it. If, for instance, we granted access to it
to modules later they then could override the variable
setting... no bueno.
This renames pat_enabled to a new static variable __pat_enabled.
Folks are redirected to use pat_enabled() now.
Code that sets this can only be internal to pat.c. Apart from
the early kernel parameter "nopat" to disable PAT, we also have
a few cases that disable it later and make use of a helper
pat_disable(). It is wrapped under an ifdef but since that code
cannot run unless PAT was enabled its not required to wrap it
with ifdefs, unwrap that. Likewise, since "nopat" doesn't really
change non-PAT systems just remove that ifdef as well.
Although we could add and use an early_param_off(), these
helpers don't use __read_mostly but we want to keep
__read_mostly for __pat_enabled as this is a hot path -- upon
boot, for instance, a simple guest may see ~4k accesses to
pat_enabled(). Since __read_mostly early boot params are not
that common we don't add a helper for them just yet.
Signed-off-by: Luis R. Rodriguez <mcgrof@suse.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Andy Walls <awalls@md.metrocast.net>
Cc: Bjorn Helgaas <bhelgaas@google.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
Cc: Dave Airlie <airlied@redhat.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: Doug Ledford <dledford@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Juergen Gross <jgross@suse.com>
Cc: Kyle McMartin <kyle@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Michael S. Tsirkin <mst@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1430425520-22275-3-git-send-email-mcgrof@do-not-panic.com
Link: http://lkml.kernel.org/r/1432628901-18044-13-git-send-email-bp@alien8.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
It is possible to enable CONFIG_MTRR and CONFIG_X86_PAT and end
up with a system with MTRR functionality disabled but PAT
functionality enabled. This can happen, for instance, when the
Xen hypervisor is used where MTRRs are not supported but PAT is.
This can happen on Linux as of commit
47591df505 ("xen: Support Xen pv-domains using PAT")
by Juergen, introduced in v3.19.
Technically, we should assume the proper CPU bits would be set
to disable MTRRs but we can't always rely on this. At least on
the Xen Hypervisor, for instance, only X86_FEATURE_MTRR was
disabled as of Xen 4.4 through Xen commit 586ab6a [0], but not
X86_FEATURE_K6_MTRR, X86_FEATURE_CENTAUR_MCR, or
X86_FEATURE_CYRIX_ARR for instance.
Roger Pau Monné has clarified though that although this is
technically true we will never support PVH on these CPU types so
Xen has no need to disable these bits on those systems. As per
Roger, AMD K6, Centaur and VIA chips don't have the necessary
hardware extensions to allow running PVH guests [1].
As per Toshi it is also possible for the BIOS to disable MTRR
support, in such cases get_mtrr_state() would update the MTRR
state as per the BIOS, we need to propagate this information as
well.
x86 MTRR code relies on quite a bit of checks for mtrr_if being
set to check to see if MTRRs did get set up. Instead, lets
provide a generic getter for that. This also adds a few checks
where they were not before which could potentially safeguard
ourselves against incorrect usage of MTRR where this was not
desirable.
Where possible match error codes as if MTRRs were disabled on
arch/x86/include/asm/mtrr.h.
Lastly, since disabling MTRRs can happen at run time and we
could end up with PAT enabled, best record now in our logs when
MTRRs are disabled.
[0] ~/devel/xen (git::stable-4.5)$ git describe --contains 586ab6a 4.4.0-rc1~18
[1] http://lists.xenproject.org/archives/html/xen-devel/2015-03/msg03460.html
Signed-off-by: Luis R. Rodriguez <mcgrof@suse.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Antonino Daplas <adaplas@gmail.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
Cc: Dave Airlie <airlied@redhat.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Davidlohr Bueso <dbueso@suse.de>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Jean-Christophe Plagniol-Villard <plagnioj@jcrosoft.com>
Cc: Juergen Gross <jgross@suse.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Roger Pau Monné <roger.pau@citrix.com>
Cc: Stefan Bader <stefan.bader@canonical.com>
Cc: Suresh Siddha <sbsiddha@gmail.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tomi Valkeinen <tomi.valkeinen@ti.com>
Cc: Toshi Kani <toshi.kani@hp.com>
Cc: Ville Syrjälä <syrjala@sci.fi>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: bhelgaas@google.com
Cc: david.vrabel@citrix.com
Cc: jbeulich@suse.com
Cc: konrad.wilk@oracle.com
Cc: venkatesh.pallipadi@intel.com
Cc: ville.syrjala@linux.intel.com
Cc: xen-devel@lists.xensource.com
Link: http://lkml.kernel.org/r/1426893517-2511-3-git-send-email-mcgrof@do-not-panic.com
Link: http://lkml.kernel.org/r/1432628901-18044-12-git-send-email-bp@alien8.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
There is only one user but since we're going to bury MTRR next
out of access to drivers, expose this last piece of API to
drivers in a general fashion only needing io.h for access to
helpers.
Signed-off-by: Luis R. Rodriguez <mcgrof@suse.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Cc: Abhilash Kesavan <a.kesavan@samsung.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Antonino Daplas <adaplas@gmail.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Cristian Stoica <cristian.stoica@freescale.com>
Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
Cc: Dave Airlie <airlied@redhat.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Davidlohr Bueso <dbueso@suse.de>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Jean-Christophe Plagniol-Villard <plagnioj@jcrosoft.com>
Cc: Juergen Gross <jgross@suse.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Matthias Brugger <matthias.bgg@gmail.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Suresh Siddha <sbsiddha@gmail.com>
Cc: Thierry Reding <treding@nvidia.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tomi Valkeinen <tomi.valkeinen@ti.com>
Cc: Toshi Kani <toshi.kani@hp.com>
Cc: Ville Syrjälä <syrjala@sci.fi>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Will Deacon <will.deacon@arm.com>
Cc: dri-devel@lists.freedesktop.org
Link: http://lkml.kernel.org/r/1429722736-4473-1-git-send-email-mcgrof@do-not-panic.com
Link: http://lkml.kernel.org/r/1432628901-18044-11-git-send-email-bp@alien8.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
As part of the effort to phase out MTRR use document
write-combining MTRR effects on pages with different non-PAT
page attributes flags and different PAT entry values. Extend
arch_phys_wc_add() documentation to clarify power of two sizes /
boundary requirements as we phase out mtrr_add() use.
Lastly hint towards ioremap_uc() for corner cases on device
drivers working with devices with mixed regions where MTRR size
requirements would otherwise not enable write-combining
effective memory types.
Signed-off-by: Luis R. Rodriguez <mcgrof@suse.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Antonino Daplas <adaplas@gmail.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
Cc: Dave Airlie <airlied@redhat.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Davidlohr Bueso <dbueso@suse.de>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Jean-Christophe Plagniol-Villard <plagnioj@jcrosoft.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Juergen Gross <jgross@suse.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Suresh Siddha <sbsiddha@gmail.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tomi Valkeinen <tomi.valkeinen@ti.com>
Cc: Ville Syrjälä <syrjala@sci.fi>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: linux-fbdev@vger.kernel.org
Link: http://lkml.kernel.org/r/1430343851-967-3-git-send-email-mcgrof@do-not-panic.com
Link: http://lkml.kernel.org/r/1432628901-18044-10-git-send-email-bp@alien8.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
This patch adds the argument 'uniform' to mtrr_type_lookup(),
which gets set to 1 when a given range is covered uniformly by
MTRRs, i.e. the range is fully covered by a single MTRR entry or
the default type.
Change pud_set_huge() and pmd_set_huge() to honor the 'uniform'
flag to see if it is safe to create a huge page mapping in the
range.
This allows them to create a huge page mapping in a range
covered by a single MTRR entry of any memory type. It also
detects a non-optimal request properly. They continue to check
with the WB type since it does not effectively change the
uniform mapping even if a request spans multiple MTRR entries.
pmd_set_huge() logs a warning message to a non-optimal request
so that driver writers will be aware of such a case. Drivers
should make a mapping request aligned to a single MTRR entry
when the range is covered by MTRRs.
Signed-off-by: Toshi Kani <toshi.kani@hp.com>
[ Realign, flesh out comments, improve warning message. ]
Signed-off-by: Borislav Petkov <bp@suse.de>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: Elliott@hp.com
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Luis R. Rodriguez <mcgrof@suse.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: dave.hansen@intel.com
Cc: linux-mm <linux-mm@kvack.org>
Cc: pebolle@tiscali.nl
Link: http://lkml.kernel.org/r/1431714237-880-7-git-send-email-toshi.kani@hp.com
Link: http://lkml.kernel.org/r/1432628901-18044-8-git-send-email-bp@alien8.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
MTRRs contain fixed and variable entries. mtrr_type_lookup() may
repeatedly call __mtrr_type_lookup() to handle a request that
overlaps with variable entries.
However, __mtrr_type_lookup() also handles the fixed entries,
which do not have to be repeated. Therefore, this patch creates
separate functions, mtrr_type_lookup_fixed() and
mtrr_type_lookup_variable(), to handle the fixed and variable
ranges respectively.
The patch also updates the function headers to clarify the
return values and output argument. It updates comments to
clarify that the repeating is necessary to handle overlaps with
the default type, since overlaps with multiple entries alone can
be handled without such repeating.
There is no functional change in this patch.
Signed-off-by: Toshi Kani <toshi.kani@hp.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: Elliott@hp.com
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Luis R. Rodriguez <mcgrof@suse.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: dave.hansen@intel.com
Cc: linux-mm <linux-mm@kvack.org>
Cc: pebolle@tiscali.nl
Link: http://lkml.kernel.org/r/1431714237-880-6-git-send-email-toshi.kani@hp.com
Link: http://lkml.kernel.org/r/1432628901-18044-6-git-send-email-bp@alien8.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
mtrr_type_lookup() returns verbatim 0xFF when MTRRs are
disabled. This patch defines MTRR_TYPE_INVALID to clarify the
meaning of this value, and documents its usage.
Document the return values of the kernel virtual address mapping
helpers pud_set_huge(), pmd_set_huge, pud_clear_huge() and
pmd_clear_huge().
There is no functional change in this patch.
Signed-off-by: Toshi Kani <toshi.kani@hp.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: Elliott@hp.com
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Luis R. Rodriguez <mcgrof@suse.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: dave.hansen@intel.com
Cc: linux-mm <linux-mm@kvack.org>
Cc: pebolle@tiscali.nl
Link: http://lkml.kernel.org/r/1431714237-880-5-git-send-email-toshi.kani@hp.com
Link: http://lkml.kernel.org/r/1432628901-18044-5-git-send-email-bp@alien8.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
'mtrr_state.enabled' contains the FE (fixed MTRRs enabled)
and E (MTRRs enabled) flags in MSR_MTRRdefType. Intel SDM,
section 11.11.2.1, defines these flags as follows:
- All MTRRs are disabled when the E flag is clear.
The FE flag has no affect when the E flag is clear.
- The default type is enabled when the E flag is set.
- MTRR variable ranges are enabled when the E flag is set.
- MTRR fixed ranges are enabled when both E and FE flags
are set.
MTRR state checks in __mtrr_type_lookup() do not match with SDM.
Hence, this patch makes the following changes:
- The current code detects MTRRs disabled when both E and
FE flags are clear in mtrr_state.enabled. Fix to detect
MTRRs disabled when the E flag is clear.
- The current code does not check if the FE bit is set in
mtrr_state.enabled when looking at the fixed entries.
Fix to check the FE flag.
- The current code returns the default type when the E flag
is clear in mtrr_state.enabled. However, the default type
is UC when the E flag is clear. Remove the code as this
case is handled as MTRR disabled with the 1st change.
In addition, this patch defines the E and FE flags in
mtrr_state.enabled as follows.
- FE flag: MTRR_STATE_MTRR_FIXED_ENABLED
- E flag: MTRR_STATE_MTRR_ENABLED
print_mtrr_state() and x86_get_mtrr_mem_range() are also updated
accordingly.
Signed-off-by: Toshi Kani <toshi.kani@hp.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: Elliott@hp.com
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Luis R. Rodriguez <mcgrof@suse.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: dave.hansen@intel.com
Cc: linux-mm <linux-mm@kvack.org>
Cc: pebolle@tiscali.nl
Link: http://lkml.kernel.org/r/1431714237-880-4-git-send-email-toshi.kani@hp.com
Link: http://lkml.kernel.org/r/1432628901-18044-4-git-send-email-bp@alien8.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
When an MTRR entry is inclusive to a requested range, i.e. the
start and end of the request are not within the MTRR entry range
but the range contains the MTRR entry entirely:
range_start ... [mtrr_start ... mtrr_end] ... range_end
__mtrr_type_lookup() ignores such a case because both
start_state and end_state are set to zero.
This bug can cause the following issues:
1) reserve_memtype() tracks an effective memory type in case
a request type is WB (ex. /dev/mem blindly uses WB). Missing
to track with its effective type causes a subsequent request
to map the same range with the effective type to fail.
2) pud_set_huge() and pmd_set_huge() check if a requested range
has any overlap with MTRRs. Missing to detect an overlap may
cause a performance penalty or undefined behavior.
This patch fixes the bug by adding a new flag, 'inclusive',
to detect the inclusive case. This case is then handled in
the same way as end_state:1 since the first region is the same.
With this fix, __mtrr_type_lookup() handles the inclusive case
properly.
Signed-off-by: Toshi Kani <toshi.kani@hp.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: Elliott@hp.com
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Luis R. Rodriguez <mcgrof@suse.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: dave.hansen@intel.com
Cc: linux-mm <linux-mm@kvack.org>
Cc: pebolle@tiscali.nl
Link: http://lkml.kernel.org/r/1431714237-880-3-git-send-email-toshi.kani@hp.com
Link: http://lkml.kernel.org/r/1432628901-18044-3-git-send-email-bp@alien8.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQEcBAABAgAGBQJVYnloAAoJEHm+PkMAQRiGCgkH/j3r2djOOm4h83FXrShaHORY
p8TBI3FNj4fzLk2PfzqbmiDw2T2CwygB+pxb2Ac9CE99epw8qPk2SRvPXBpdKR7t
lolhhwfzApLJMZbhzNLVywUCDUhFoiEWRhmPqIfA3WXFcIW3t5VNXAoIFjV5HFr6
sYUlaxSI1XiQ5tldVv8D6YSFHms41pisziBIZmzhIUg10P6Vv3D0FbE74fjAJwx0
+08zj3EO7yQMv7Aeeq8F8AJ3628142rcZf0NWF5ohlKLRK3gt0cl9jO5U4Co2dDt
29v03LP5EI6jDKkIbuWlqRMq9YxJz7N3wnkzV0EJiqXucoqPLFDqzbxB4gnS1pI=
=7vbA
-----END PGP SIGNATURE-----
Merge tag 'v4.1-rc5' into x86/mm, to refresh the tree before applying new changes
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Using "mce=1,10000000" on the kernel cmdline to change the
monarch timeout does not work. The cause is that get_option()
does parse a subsequent comma in the option string and signals
that with a return value. So we don't need to check for a second
comma ourselves.
Signed-off-by: Xie XiuQi <xiexiuqi@huawei.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tony Luck <tony.luck@intel.com>
Link: http://lkml.kernel.org/r/1432120943-25028-1-git-send-email-xiexiuqi@huawei.com
Link: http://lkml.kernel.org/r/1432628901-18044-19-git-send-email-bp@alien8.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
When comparing the 'model name' field of each core in
/proc/cpuinfo it was noticed that there is a whitespace
difference between the cores' model names.
After some quick investigation it was noticed that the model
name fields were actually different -- processor 0's model name
field had trailing whitespace removed, while the other
processors did not.
Another way of seeing this behaviour is to convert spaces into
underscores in the output of /proc/cpuinfo,
[thetango@prarit ~]# grep "^model name" /proc/cpuinfo | uniq -c | sed 's/\ /_/g'
______1_model_name :_AMD_Opteron(TM)_Processor_6272
_____63_model_name :_AMD_Opteron(TM)_Processor_6272_________________
which shows the discrepancy.
This occurs because the kernel calls strim() on cpu 0's
x86_model_id field to output a pretty message to the console in
print_cpu_info(), and as a result strips the whitespace at the
end of the ->x86_model_id field.
But, the ->x86_model_id field should be the same for the all
identical CPUs in the box. Thus, we need to remove both leading
and trailing whitespace.
As a result, the print_cpu_info() output looks like
smpboot: CPU0: AMD Opteron(TM) Processor 6272 (fam: 15, model: 01, stepping: 02)
and the x86_model_id field is correct on all processors on AMD
platforms:
_____64_model_name :_AMD_Opteron(TM)_Processor_6272
Output is still correct on an Intel box:
____144_model_name :_Intel(R)_Xeon(R)_CPU_E7-8890_v3_@_2.50GHz
Signed-off-by: Prarit Bhargava <prarit@redhat.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Igor Mammedov <imammedo@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1432050210-32036-1-git-send-email-prarit@redhat.com
Link: http://lkml.kernel.org/r/1432628901-18044-15-git-send-email-bp@alien8.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
copy_kernel_to_xregs_booting() has a second parameter that is the mask
of xfeatures that should be copied - but this parameter is always -1.
Simplify the call site of this function, this also makes it more
similar to the function call signature of other copy_kernel_to*regs()
functions.
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Bobby Powers <bobbypowers@gmail.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Bring the __copy_fpstate_to_fpregs() and copy_fpstate_to_fpregs() functions
in line with the parameter passing convention of other kernel-to-FPU-registers
copying functions: pass around an in-memory FPU register state pointer,
instead of struct fpu *.
NOTE: This patch also changes the assembly constraint of the FXSAVE-leak
workaround from 'fpu->fpregs_active' to 'fpstate' - but that is fine,
as we only need a valid memory address there for the FILDL instruction.
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Bobby Powers <bobbypowers@gmail.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
None of the copy_kernel_to_*regs() FPU register copying functions are
supposed to fail, and all of them have debugging checks that enforce
this.
Remove their return values and simplify their call sites, which have
redundant error checks and error handling code paths.
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Bobby Powers <bobbypowers@gmail.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Bring the __copy_fpstate_to_fpregs() and copy_fpstate_to_fpregs() functions
in line with the naming of other kernel-to-FPU-registers copying functions.
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Bobby Powers <bobbypowers@gmail.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
The copy_fpstate_to_fpregs() function is never supposed to fail,
so add a debugging check to its call site in fpu__restore().
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Bobby Powers <bobbypowers@gmail.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
fpu__activate_fpstate_write() is used before ptrace writes to the fpstate
context. Because it expects the modified registers to be reloaded on the
nexts context switch, it's only valid to call this function for stopped
child tasks.
- add a debugging check for this assumption
- remove code that only runs if the current task's FPU state needs
to be saved, which cannot occur here
- update comments to match the implementation
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Bobby Powers <bobbypowers@gmail.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Remaining users of fpu__activate_fpstate() are all places that want to modify
FPU registers, rename the function to fpu__activate_fpstate_write() according
to this usage.
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Bobby Powers <bobbypowers@gmail.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
fpu__activate_fpstate_read() is used before FPU registers are
read from the fpstate by ptrace and core dumping.
It's not necessary to unlazy non-current child tasks in this case,
since the reading of registers is non-destructive.
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Bobby Powers <bobbypowers@gmail.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Currently fpu__activate_fpstate() is used for two distinct purposes:
- read access by ptrace and core dumping, where in the core dumping
case the current task's FPU state may be examined as well.
- write access by ptrace, which modifies FPU registers and expects
the modified registers to be reloaded on the next context switch.
Split out the reading side into fpu__activate_fpstate_read().
( Note that this is just a pure duplication of fpu__activate_fpstate()
for the time being, we'll optimize the new function in the next patch. )
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Bobby Powers <bobbypowers@gmail.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Bobby Powers reported the following FPU warning during ELF coredumping:
WARNING: CPU: 0 PID: 27452 at arch/x86/kernel/fpu/core.c:324 fpu__activate_stopped+0x8a/0xa0()
This warning unearthed an invalid assumption about fpu__activate_stopped()
that I added in:
67e97fc2ec ("x86/fpu: Rename init_fpu() to fpu__unlazy_stopped() and add debugging check")
the old init_fpu() function had an (intentional but obscure) side effect:
when FPU registers are accessed for the current task, for reading, then
it synchronized live in-register FPU state with the fpstate by saving it.
So fix this bug by saving the FPU if we are the current task. We'll
still warn in fpu__save() if this is called for not yet stopped
child tasks, so the debugging check is still preserved.
Also rename the function to fpu__activate_fpstate(), because it's not
exclusively used for stopped tasks, but for the current task as well.
( Note that this bug calls for a cleaner separation of access-for-read
and access-for-modification FPU methods, but we'll do that in separate
patches. )
Reported-by: Bobby Powers <bobbypowers@gmail.com>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
There is a 'pt' variable in the outer scope of pt_event_stop() with the same
type, we don't really need another one in the inner scope.
This patch removes the redundant variable declaration.
Signed-off-by: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: acme@infradead.org
Cc: adrian.hunter@intel.com
Cc: hpa@zytor.com
Link: http://lkml.kernel.org/r/1432308626-18845-8-git-send-email-alexander.shishkin@linux.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Initially, we were trying to guard against scenarios where somebody
attaches to the system with a hardware debugger while PT is enabled
from software and pt_is_running() tries to make sure we handle this
better, but the truth is, there is still a race window no matter what
and people with hardware debuggers should really know what they are
doing anyway.
In other words, there is no point in keeping this one around, and
it's one RDMSR instructions fewer in the fast path.
The case when PT is enabled by the BIOS at boot time is handled
in the driver initialization path and doesn't use pt_is_running().
This patch gets rid of it.
Signed-off-by: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: acme@infradead.org
Cc: adrian.hunter@intel.com
Cc: hpa@zytor.com
Link: http://lkml.kernel.org/r/1429622177-22843-6-git-send-email-alexander.shishkin@linux.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Currently, the description of pt_buffer_reset_offsets() lacks information
about its calling constraints and ordering with regards to other buffer
management functions.
Add a clarification about when this function has to be called.
Signed-off-by: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: acme@infradead.org
Cc: adrian.hunter@intel.com
Cc: hpa@zytor.com
Link: http://lkml.kernel.org/r/1429622177-22843-5-git-send-email-alexander.shishkin@linux.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
The comments in the driver don't make it absolutely clear as to what
exactly is the calling order and other possible constraints of buffer
management functions.
Document constraints and calling order for the buffer configuration
functions. While at it, replace a redundant check in
pt_buffer_reset_markers() with an explanation why it is not needed.
Signed-off-by: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: acme@infradead.org
Cc: adrian.hunter@intel.com
Cc: hpa@zytor.com
Link: http://lkml.kernel.org/r/1429622177-22843-4-git-send-email-alexander.shishkin@linux.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Currently, there's a set-but-not-used variable in setup_topa_index();
this patch gets rid of it. And while at it, fixes a style issue with
brackets around a one-line block.
Signed-off-by: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: acme@infradead.org
Cc: adrian.hunter@intel.com
Cc: hpa@zytor.com
Link: http://lkml.kernel.org/r/1429622177-22843-2-git-send-email-alexander.shishkin@linux.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Don't bother with taking locks if we're not actually going to do
anything. Also, drop the _irqsave(), this is very much only called
from IRQ-disabled context.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephane Eranian <eranian@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vince Weaver <vincent.weaver@maine.edu>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
For some obscure reason intel_{start,stop}_scheduling() copy the HT
state to an intermediate array. This would make sense if we ever were
to make changes to it which we'd have to discard.
Except we don't. By the time we call intel_commit_scheduling() we're;
as the name implies; committed to them. We'll never back out.
A further hint its pointless is that stop_scheduling() unconditionally
publishes the state.
So the intermediate array is pointless, modify the state in place and
kill the extra array.
And remove the pointless array initialization: INTEL_EXCL_UNUSED == 0.
Note; all is serialized by intel_excl_cntr::lock.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephane Eranian <eranian@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vince Weaver <vincent.weaver@maine.edu>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Both intel_commit_scheduling() and intel_get_excl_contraints() test
for cntr < 0.
The only way that can happen (aside from a bug) is through
validate_event(), however that is already captured by the
cpuc->is_fake test.
So remove these test and simplify the code.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephane Eranian <eranian@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vince Weaver <vincent.weaver@maine.edu>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Move the code of intel_commit_scheduling() to the right place, which is
in between start() and stop().
No change in functionality.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephane Eranian <eranian@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vince Weaver <vincent.weaver@maine.edu>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
The intel_commit_scheduling() callback is pointlessly different from
the start and stop scheduling callback.
Furthermore, the constraint should never be NULL, so remove that test.
Even though we'll never get called (because we NULL the callbacks)
when !is_ht_workaround_enabled() put that test in.
Collapse the (pointless) WARN_ON_ONCE() and bail on !cpuc->excl_cntrs --
this is doubly pointless, because its the same condition as
is_ht_workaround_enabled() which was already pointless because the
whole method won't ever be called.
Furthremore, make all the !excl_cntrs test WARN_ON_ONCE(); they're all
pointless, because the above, either the function
({get,put}_excl_constraint) are already predicated on it existing or
the is_ht_workaround_enabled() thing is the same test.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephane Eranian <eranian@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vince Weaver <vincent.weaver@maine.edu>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
We have two 'struct event_constraint' local variables in
intel_get_excl_constraints(): 'cx' and 'c'.
Instead of using 'cx' after the dynamic allocation, put all 'cx' inside
the dynamic allocation block and use 'c' outside of it.
Also use direct assignment to copy the structure; let the compiler
figure it out.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephane Eranian <eranian@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vince Weaver <vincent.weaver@maine.edu>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Lockdep is very good at finding incorrect IRQ state while locking and
is far better at telling us if we hold a lock than the _is_locked()
API. It also generates less code for !DEBUG kernels.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephane Eranian <eranian@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vince Weaver <vincent.weaver@maine.edu>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
For some obscure reason the current code accounts the current SMT
thread's state on the remote thread and reads the remote's state on
the local SMT thread.
While internally consistent, and 'correct' its pointless confusion we
can do without.
Flip them the right way around.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephane Eranian <eranian@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vince Weaver <vincent.weaver@maine.edu>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Since we write RMID values to MSRs the correct type to use is 'u32'
because that clearly articulates we're writing a hardware register
value.
Fix up all uses of RMID in this code to consistently use the correct data
type.
Reported-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Matt Fleming <matt.fleming@intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Kanaka Juvva <kanaka.d.juvva@intel.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Vikas Shivappa <vikas.shivappa@linux.intel.com>
Cc: Will Auld <will.auld@intel.com>
Link: http://lkml.kernel.org/r/1432285182-17180-1-git-send-email-matt@codeblueprint.co.uk
Signed-off-by: Ingo Molnar <mingo@kernel.org>
'closid' (CLass Of Service ID) is used for the Class based Cache
Allocation Technology (CAT). Add explicit storage to the per cpu cache
for it, so it can be used later with the CAT support (requires to move
the per cpu data).
While at it:
- Rename the structure to intel_pqr_state which reflects the actual
purpose of the struct: cache values which go into the PQR MSR
- Rename 'cnt' to rmid_usecnt which reflects the actual purpose of
the counter.
- Document the structure and the struct members.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Matt Fleming <matt.fleming@intel.com>
Cc: Kanaka Juvva <kanaka.d.juvva@intel.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Vikas Shivappa <vikas.shivappa@linux.intel.com>
Cc: Will Auld <will.auld@intel.com>
Link: http://lkml.kernel.org/r/20150518235150.240899319@linutronix.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
If the usage counter is non-zero there is no point to update the rmid
in the PQR MSR.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Matt Fleming <matt.fleming@intel.com>
Cc: Kanaka Juvva <kanaka.d.juvva@intel.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Vikas Shivappa <vikas.shivappa@linux.intel.com>
Cc: Will Auld <will.auld@intel.com>
Link: http://lkml.kernel.org/r/20150518235150.080844281@linutronix.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
'struct intel_cqm_state' is a strict per CPU cache of the rmid and the
usage counter. It can never be modified from a remote CPU.
The three functions which modify the content: intel_cqm_event[start|stop|del]
(del maps to stop) are called from the perf core with interrupts disabled
which is enough protection for the per CPU state values.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Matt Fleming <matt.fleming@intel.com>
Cc: Kanaka Juvva <kanaka.d.juvva@intel.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Vikas Shivappa <vikas.shivappa@linux.intel.com>
Cc: Will Auld <will.auld@intel.com>
Link: http://lkml.kernel.org/r/20150518235150.001006529@linutronix.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
'int' is really not a proper data type for an MSR. Use u32 to make it
clear that we are dealing with a 32-bit unsigned hardware value.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Matt Fleming <matt.fleming@intel.com>
Cc: Kanaka Juvva <kanaka.d.juvva@intel.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Vikas Shivappa <vikas.shivappa@linux.intel.com>
Cc: Will Auld <will.auld@intel.com>
Link: http://lkml.kernel.org/r/20150518235149.919350144@linutronix.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
The CQM code acts like it owns the PQR MSR completely. That's not true
because only the lower 10 bits are used for CQM. The upper 32 bits are
used for the 'CLass Of Service ID' (CLOSID). Document the abuse. Will be
fixed in a later patch.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Matt Fleming <matt.fleming@intel.com>
Cc: Kanaka Juvva <kanaka.d.juvva@intel.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Vikas Shivappa <vikas.shivappa@linux.intel.com>
Cc: Will Auld <will.auld@intel.com>
Link: http://lkml.kernel.org/r/20150518235149.823214798@linutronix.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
I stumbled upon an AMD box that had the BIOS using a hardware performance
counter. Instead of printing out a warning and continuing, it failed and
blocked further perf counter usage.
Looking through the history, I found this commit:
a5ebe0ba3d ("perf/x86: Check all MSRs before passing hw check")
which tweaked the rules for a Xen guest on an almost identical box and now
changed the behaviour.
Unfortunately the rules were tweaked incorrectly and will always lead to
MSR failures even though the MSRs are completely fine.
What happens now is in arch/x86/kernel/cpu/perf_event.c::check_hw_exists():
<snip>
for (i = 0; i < x86_pmu.num_counters; i++) {
reg = x86_pmu_config_addr(i);
ret = rdmsrl_safe(reg, &val);
if (ret)
goto msr_fail;
if (val & ARCH_PERFMON_EVENTSEL_ENABLE) {
bios_fail = 1;
val_fail = val;
reg_fail = reg;
}
}
<snip>
/*
* Read the current value, change it and read it back to see if it
* matches, this is needed to detect certain hardware emulators
* (qemu/kvm) that don't trap on the MSR access and always return 0s.
*/
reg = x86_pmu_event_addr(0);
^^^^
if the first perf counter is enabled, then this routine will always fail
because the counter is running. :-(
if (rdmsrl_safe(reg, &val))
goto msr_fail;
val ^= 0xffffUL;
ret = wrmsrl_safe(reg, val);
ret |= rdmsrl_safe(reg, &val_new);
if (ret || val != val_new)
goto msr_fail;
The above bios_fail used to be a 'goto' which is why it worked in the past.
Further, most vendors have migrated to using fixed counters to hide their
evilness hence this problem rarely shows up now days except on a few old boxes.
I fixed my problem and kept the spirit of the original Xen fix, by recording a
safe non-enable register to be used safely for the reading/writing check.
Because it is not enabled, this passes on bare metal boxes (like metal), but
should continue to throw an msr_fail on Xen guests because the register isn't
emulated yet.
Now I get a proper bios_fail error message and Xen should still see their
msr_fail message (untested).
Signed-off-by: Don Zickus <dzickus@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: george.dunlap@eu.citrix.com
Cc: konrad.wilk@oracle.com
Link: http://lkml.kernel.org/r/1431976608-56970-1-git-send-email-dzickus@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Currently, pt_buffer_reset_markers() is a difficult to read knot of
arithmetics with a redundant check for multiple-entry TOPA capability,
a commented out wakeup marker placement and a logical error wrt to
stop marker placement. The latter happens when write head is not page
aligned and results in stop marker being placed one page earlier than
it actually should.
All these problems only affect PT implementations that support
multiple-entry TOPA tables (read: proper scatter-gather).
For single-entry TOPA implementations, there is no functional impact.
This patch deals with all of the above. Tested on both single-entry
and multiple-entry TOPA PT implementations.
Signed-off-by: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: acme@infradead.org
Cc: adrian.hunter@intel.com
Cc: hpa@zytor.com
Link: http://lkml.kernel.org/r/1432308626-18845-4-git-send-email-alexander.shishkin@linux.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
The (SNB/IVB/HSW) HT bug only affects events that can be programmed
onto GP counters, therefore we should only limit the number of GP
counters that can be used per cpu -- iow we should not constrain the
FP counters.
Furthermore, we should only enfore such a limit when there are in fact
exclusive events being scheduled on either sibling.
Reported-by: Vince Weaver <vincent.weaver@maine.edu>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
[ Fixed build fail for the !CONFIG_CPU_SUP_INTEL case. ]
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Commit 43b4578071 ("perf/x86: Reduce stack usage of
x86_schedule_events()") violated the rule that 'fake' scheduling; as
used for event/group validation; should not change the event state.
This went mostly un-noticed because repeated calls of
x86_pmu::get_event_constraints() would give the same result. And
x86_pmu::put_event_constraints() would mostly not do anything.
Commit e979121b1b ("perf/x86/intel: Implement cross-HT corruption
bug workaround") made the situation much worse by actually setting the
event->hw.constraint value to NULL, so when validation and actual
scheduling interact we get NULL ptr derefs.
Fix it by removing the constraint pointer from the event and move it
back to an array, this time in cpuc instead of on the stack.
validate_group()
x86_schedule_events()
event->hw.constraint = c; # store
<context switch>
perf_task_event_sched_in()
...
x86_schedule_events();
event->hw.constraint = c2; # store
...
put_event_constraints(event); # assume failure to schedule
intel_put_event_constraints()
event->hw.constraint = NULL;
<context switch end>
c = event->hw.constraint; # read -> NULL
if (!test_bit(hwc->idx, c->idxmsk)) # <- *BOOM* NULL deref
This in particular is possible when the event in question is a
cpu-wide event and group-leader, where the validate_group() tries to
add an event to the group.
Reported-by: Vince Weaver <vincent.weaver@maine.edu>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Andrew Hunter <ahh@google.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Maria Dimakopoulou <maria.n.dimakopoulou@gmail.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Fixes: 43b4578071 ("perf/x86: Reduce stack usage of x86_schedule_events()")
Fixes: e979121b1b ("perf/x86/intel: Implement cross-HT corruption bug workaround")
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Remove obsolete comment about __init limitations: in the new code there aren't any.
Also standardize the comment style in the function while at it.
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
The early_idt_handlers asm code generates an array of entry
points spaced nine bytes apart. It's not really clear from that
code or from the places that reference it what's going on, and
the code only works in the first place because GAS never
generates two-byte JMP instructions when jumping to global
labels.
Clean up the code to generate the correct array stride (member size)
explicitly. This should be considerably more robust against
screw-ups, as GAS will warn if a .fill directive has a negative
count. Using '. =' to advance would have been even more robust
(it would generate an actual error if it tried to move
backwards), but it would pad with nulls, confusing anyone who
tries to disassemble the code. The new scheme should be much
clearer to future readers.
While we're at it, improve the comments and rename the array and
common code.
Binutils may start relaxing jumps to non-weak labels. If so,
this change will fix our build, and we may need to backport this
change.
Before, on x86_64:
0000000000000000 <early_idt_handlers>:
0: 6a 00 pushq $0x0
2: 6a 00 pushq $0x0
4: e9 00 00 00 00 jmpq 9 <early_idt_handlers+0x9>
5: R_X86_64_PC32 early_idt_handler-0x4
...
48: 66 90 xchg %ax,%ax
4a: 6a 08 pushq $0x8
4c: e9 00 00 00 00 jmpq 51 <early_idt_handlers+0x51>
4d: R_X86_64_PC32 early_idt_handler-0x4
...
117: 6a 00 pushq $0x0
119: 6a 1f pushq $0x1f
11b: e9 00 00 00 00 jmpq 120 <early_idt_handler>
11c: R_X86_64_PC32 early_idt_handler-0x4
After:
0000000000000000 <early_idt_handler_array>:
0: 6a 00 pushq $0x0
2: 6a 00 pushq $0x0
4: e9 14 01 00 00 jmpq 11d <early_idt_handler_common>
...
48: 6a 08 pushq $0x8
4a: e9 d1 00 00 00 jmpq 120 <early_idt_handler_common>
4f: cc int3
50: cc int3
...
117: 6a 00 pushq $0x0
119: 6a 1f pushq $0x1f
11b: eb 03 jmp 120 <early_idt_handler_common>
11d: cc int3
11e: cc int3
11f: cc int3
Signed-off-by: Andy Lutomirski <luto@kernel.org>
Acked-by: H. Peter Anvin <hpa@linux.intel.com>
Cc: Binutils <binutils@sourceware.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: H.J. Lu <hjl.tools@gmail.com>
Cc: Jan Beulich <JBeulich@suse.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/ac027962af343b0c599cbfcf50b945ad2ef3d7a8.1432336324.git.luto@kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Conflicts:
arch/x86/kernel/i387.c
This commit is conflicting:
e88221c50c ("x86/fpu: Disable XSAVES* support for now")
These functions changed a lot, move the quirk to arch/x86/kernel/fpu/init.c's
fpu__init_system_xstate_size_legacy().
Signed-off-by: Ingo Molnar <mingo@kernel.org>
The kernel's handling of 'compacted' xsave state layout is buggy:
http://marc.info/?l=linux-kernel&m=142967852317199
I don't have such a system, and the description there is vague, but
from extrapolation I guess that there were two kinds of bugs
observed:
- boot crashes, due to size calculations being wrong and the dynamic
allocation allocating a too small xstate area. (This is now fixed
in the new FPU code - but still present in stable kernels.)
- FPU state corruption and ABI breakage: if signal handlers try to
change the FPU state in standard format, which then the kernel
tries to restore in the compacted format.
These breakages are scary, but they only occur on a small number of
systems that have XSAVES* CPU support. Yet we have had XSAVES support
in the upstream kernel for a large number of stable kernel releases,
and the fixes are involved and unproven.
So do the safe resolution first: disable XSAVES* support and only
use the standard xstate format. This makes the code work and is
easy to backport.
On top of this we can work on enabling (and testing!) proper
compacted format support, without backporting pressure, on top of the
new, cleaned up FPU code.
Cc: <stable@vger.kernel.org>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
This makes the functions kvm_guest_cpu_init and kvm_init_debugfs
static now due to having no external callers outside their
declarations in the file, kvm.c.
Signed-off-by: Nicholas Krause <xerofoify@gmail.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Explain the functions and also standardize their style
and naming.
No change in functionality.
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
We had a number of FPU init related boot option handlers
in arch/x86/kernel/cpu/common.c - move them over into
arch/x86/kernel/fpu/init.c to have them all in a
single place.
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQEcBAABAgAGBQJVWh3TAAoJEHm+PkMAQRiG/kwH/2c9irodp2+M9OUnX2bfsBb6
LnChiDpvkF5BB8jhP6d/XmvPp4NJzAbTxByhjdfb2E2HkorCUHCOIn2tI1TE2pUs
2qjkOVH+XCzoV0goGtQjzK1ht8f2IrtlDiEjyRekK5cJHzhggb22QPtWL4npyd0O
reDmG2jsRaF9POr9uLSFEv4CEnkksmRLUU0vuQX0TZeCJ41O7TXrkN/wKrLZ5mj4
IWpqXQaSlrffq/T5HnVbXBxk3/T8QmhrIoppiMpV1mUVj0uTqlFRNi5qwT2Nit1h
FVljWI4+WgOk3bf7fUlp+ahopjkTgu+GuXkiRP/pdgWNQO0cxCWSAzSndAlIIAE=
=uOoJ
-----END PGP SIGNATURE-----
Backmerge v4.1-rc4 into into drm-next
We picked up a silent conflict in amdkfd with drm-fixes and drm-next,
backmerge v4.1-rc5 and fix the conflicts
Signed-off-by: Dave Airlie <airlied@redhat.com>
Conflicts:
drivers/gpu/drm/drm_irq.c
This reverts commit ff7bbb9c6a.
Sasha Levin is seeing odd jump in time values during boot of a KVM guest:
[...]
[ 0.000000] tsc: Detected 2260.998 MHz processor
[3376355.247558] Calibrating delay loop (skipped) preset value..
[...]
and bisected them to this commit.
Reported-by: Sasha Levin <sasha.levin@oracle.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Currently, we use a global vector as the Posted-Interrupts
Notification Event for all the vCPUs in the system. We need
to introduce another global vector for VT-d Posted-Interrtups,
which will be used to wakeup the sleep vCPU when an external
interrupt from a direct-assigned device happens for that vCPU.
[ tglx: Removed a gazillion of extra newlines ]
Signed-off-by: Feng Wu <feng.wu@intel.com>
Cc: jiang.liu@linux.intel.com
Link: http://lkml.kernel.org/r/1432026437-16560-4-git-send-email-feng.wu@intel.com
Suggested-by: Yang Zhang <yang.z.zhang@intel.com>
Acked-by: H. Peter Anvin <hpa@linux.intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
There are various internal FPU state debugging checks that never
trigger in practice, but which are useful for FPU code development.
Separate these out into CONFIG_X86_DEBUG_FPU=y, and also add a
couple of new ones.
The size difference is about 0.5K of code on defconfig:
text data bss filename
15028906 2578816 1638400 vmlinux
15029430 2578816 1638400 vmlinux
( Keep this enabled by default until the new FPU code is debugged. )
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
I tried to simulate an ancient CPU via this option, and
found that it still has fxsr_opt enabled, confusing the
FPU code.
Make the 'nofxsr' option also clear FXSR_OPT flag.
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
This cleans up the call sites and the function a bit,
and also makes it more symmetric with the other high
level FPU state handling functions.
It's still only valid for the current task, as we copy
to the FPU registers of the current CPU.
No change in functionality.
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Now that all the FPU init function call dependencies are
cleaned up we can propagate __init annotations deeper.
This shrinks the runtime size of the kernel a bit, and
also addresses a few section warnings.
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
So call setup_xstate_comp() from the xstate init code, not
from the generic fpu__init_system() code.
This allows us to remove the protytype from xstate.h as well.
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
The current xstate code in setup_xstate_features() assumes that
the first zero bit means the end of xfeatures - but that is not
so, the SDM clearly states that an arbitrary set of xfeatures
might be enabled - and it is also clear from the description
of the compaction feature that holes are possible:
"13-6 Vol. 1MANAGING STATE USING THE XSAVE FEATURE SET
[...]
Compacted format. Each state component i (i ≥ 2) is located at a byte
offset from the base address of the XSAVE area based on the XCOMP_BV
field in the XSAVE header:
— If XCOMP_BV[i] = 0, state component i is not in the XSAVE area.
— If XCOMP_BV[i] = 1, the following items apply:
• If XCOMP_BV[j] = 0 for every j, 2 ≤ j < i, state component i is
located at a byte offset 576 from the base address of the XSAVE
area. (This item applies if i is the first bit set in bits 62:2 of
the XCOMP_BV; it implies that state component i is located at the
beginning of the extended region.)
• Otherwise, let j, 2 ≤ j < i, be the greatest value such that
XCOMP_BV[j] = 1. Then state component i is located at a byte offset
X from the location of state component j, where X is the number of
bytes required for state component j as enumerated in
CPUID.(EAX=0DH,ECX=j):EAX. (This item implies that state component i
immediately follows the preceding state component whose bit is set
in XCOMP_BV.)"
So don't assume that the first zero xfeatures bit means the end of
all xfeatures - iterate through all of them.
I'm not aware of hardware that triggers this currently.
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
kernel_fpu_begin() is __kernel_fpu_begin() with a preempt_disable().
Move the kernel_fpu_begin() debugging check into __kernel_fpu_begin(),
so that users of __kernel_fpu_begin() may benefit from it as well.
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Improve the memory layout of 'struct fpu':
- change ->fpregs_active from 'int' to 'char' - it's just a single flag
and modern x86 CPUs can do efficient byte accesses.
- pack related fields closer to each other: often 'fpu->state' will not be
touched, while the other fields will - so pack them into a group.
Also add comments to each field, describing their purpose, and add
some background information about lazy restores.
Also fix an obsolete, lazy switching related comment in fpu_copy()'s description.
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
So much of fpu/core.c is the regset code, but it just obscures the generic
FPU state machine logic. Factor out the regset code into fpu/regset.c, where
it can be read in isolation.
This affects one API: fpu__activate_stopped() has to be made available
from the core to fpu/regset.c.
No change in functionality.
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
fpu/xstate.c has a lot of generic FPU signal frame handling routines,
move them into a separate file: fpu/signal.c.
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Move restore_init_xstate() next to its sole caller.
Also rename it to copy_init_fpstate_to_fpregs() and add
some comments about what it does.
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
So the handling of init_xstate_ctx has a layering violation: both
'struct xsave_struct' and 'union thread_xstate' have a
'struct i387_fxsave_struct' member:
xsave_struct::i387
thread_xstate::fxsave
The handling of init_xstate_ctx is generic, it is used on all
CPUs, with or without XSAVE instruction. So it's confusing how
the generic code passes around and handles an XSAVE specific
format.
What we really want is for init_xstate_ctx to be a proper
fpstate and we use its ::fxsave and ::xsave members, as
appropriate.
Since the xsave_struct::i387 and thread_xstate::fxsave aliases
each other this is not a functional problem.
So implement this, and move init_xstate_ctx to the generic FPU
code in the process.
Also, since init_xstate_ctx is not XSAVE specific anymore,
rename it to init_fpstate, and mark it __read_mostly,
because it's only modified once during bootup, and used
as a reference fpstate later on.
There's no change in functionality.
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
fpstate_init() only uses fpu->state, so pass that in to it.
This enables the cleanup we will do in the next patch.
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Factor out the FPU error code handling code from traps.c and fpu/internal.h
and move them close to each other.
Also convert the helper functions to 'struct fpu *', which further simplifies
them.
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Remove various boot quirks that came from the old code.
The new code is cleanly split up into per-system and per-cpu
init sequences, and system init functions are only called once.
Remove the run-once quirks.
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Only a few places use the regset definitions, so factor them out.
Also fix related header dependency assumptions.
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Most of the FPU does not use them, so split it out and include
them in signal.c and ia32_signal.c
Also fix header file dependency assumption in fpu/core.c.
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Move them to their only user. This makes the code easier to read,
the header is less cluttered, and it also speeds up the build a bit.
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
With recent cleanups and fixes the fpu__reset() and fpu__clear()
functions have become almost identical in functionality: the only
difference is that fpu__reset() assumed that the fpstate
was already active in the eagerfpu case, while fpu__clear()
activated it if it was inactive.
This distinction almost never matters, the only case where such
fpstate activation happens if if the init thread (PID 1) gets exec()-ed
for the first time.
So keep fpu__clear() and change all fpu__reset() uses to
fpu__clear() to simpify the logic.
( In a later patch we'll further simplify fpu__clear() by making
sure that all contexts it is called on are already active. )
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Consolidate more signal frame related functions:
text data bss dec filename
14108070 2575280 1634304 18317654 vmlinux.before
14107944 2575344 1634304 18317592 vmlinux.after
Also, while moving it, rename alloc_mathframe() to fpu__alloc_mathframe().
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
restore_xstate_sig() is a misnomer: it's not limited to 'xstate' at all,
it is the high level 'restore FPU state from a signal frame' function
that works with all legacy FPU formats as well.
Rename it (and its helper) accordingly, and also move it to the
fpu__*() namespace.
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Do it like all other high level FPU state handling functions: they
only know about struct fpu, not about the task.
(Also remove a dead prototype while at it.)
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
The fpu__*() methods are closely related, but they are defined
in scattered places within the FPU code.
Concentrate them, and also uninline fpu__save(), fpu__drop()
and fpu__reset() to save about 5K of kernel text on 64-bit kernels:
text data bss dec filename
14113063 2575280 1634304 18322647 vmlinux.before
14108070 2575280 1634304 18317654 vmlinux.after
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
fpu_restore_checking() is a helper function of restore_fpu_checking(),
but this is not apparent from the naming.
Both copy fpstate contents to fpregs, while the fuller variant does
a full copy without leaking information.
So rename them to:
copy_fpstate_to_fpregs()
__copy_fpstate_to_fpregs()
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
drop_fpu() and fpu_reset_state() are similar in functionality
and in scope, yet this is not apparent from their names.
drop_fpu() deactivates FPU contents (both the fpregs and the fpstate),
but leaves register contents intact in the eager-FPU case, mostly as an
optimization. It disables fpregs in the lazy FPU case. The drop_fpu()
method can be used to destroy FPU state in an optimized way, when we
know that a new state will be loaded before user-space might see
any remains of the old FPU state:
- such as in sys_exit()'s exit_thread() where we know this task
won't execute any user-space instructions anymore and the
next context switch cleans up the FPU. The old FPU state
might still be around in the eagerfpu case but won't be
saved.
- in __restore_xstate_sig(), where we use drop_fpu() before
copying a new state into the fpstate and activating that one.
No user-pace instructions can execute between those steps.
- in sys_execve()'s fpu__clear(): there we use drop_fpu() in
the !eagerfpu case, where it's equivalent to a full reinit.
fpu_reset_state() is a stronger version of drop_fpu(): both in
the eagerfpu and the lazy-FPU case it guarantees that fpregs
are reinitialized to init state. This method is used in cases
where we need a full reset:
- handle_signal() uses fpu_reset_state() to reset the FPU state
to init before executing a user-space signal handler. While we
have already saved the original FPU state at this point, and
always restore the original state, the signal handling code
still has to do this reinit, because signals may interrupt
any user-space instruction, and the FPU might be in various
intermediate states (such as an unbalanced x87 stack) that is
not immediately usable for general C signal handler code.
- __restore_xstate_sig() uses fpu_reset_state() when the signal
frame has no FP context. Since the signal handler may have
modified the FPU state, it gets reset back to init state.
- in another branch __restore_xstate_sig() uses fpu_reset_state()
to handle a restoration error: when restore_user_xstate() fails
to restore FPU state and we might have inconsistent FPU data,
fpu_reset_state() is used to reset it back to a known good
state.
- __kernel_fpu_end() uses fpu_reset_state() in an error branch.
This is in a 'must not trigger' error branch, so on bug-free
kernels this never triggers.
- fpu__restore() uses fpu_reset_state() in an error path
as well: if the fpstate was set up with invalid FPU state
(via ptrace or via a signal handler), then it's reset back
to init state.
- likewise, the scheduler's switch_fpu_finish() uses it in a
restoration error path too.
Move both drop_fpu() and fpu_reset_state() to the fpu__*() namespace
and harmonize their naming with their function:
fpu__drop()
fpu__reset()
This clearly shows that both methods operate on the full state of the
FPU, just like fpu__restore().
Also add comments to explain what each function does.
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
We'd like to use xsave_state() earlier, but its SYSTEM_BOOTING check
is too imprecise.
The real condition that xsave_state() would like to check is whether
alternative XSAVE instructions were patched into the kernel image
already.
Add such a (read-mostly) debug flag and use it in xsave_state().
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
FPU fpregs do not get initialized during bootup on secondary CPUs,
on non-xsave capable CPUs.
For example on one of my systems, the secondary CPU has this FPU
state on bootup:
x86: Booting SMP configuration:
.... node #0, CPUs: #1
x86/fpu ######################
x86/fpu # FPU register dump on CPU#1:
x86/fpu # ... CWD: ffff0040
x86/fpu # ... SWD: ffff0000
x86/fpu # ... TWD: ffff555a
x86/fpu # ... FIP: 00000000
x86/fpu # ... FCS: 00000000
x86/fpu # ... FOO: 00000000
x86/fpu # ... FOS: ffff0000
x86/fpu # ... FP0: 02 57 00 00 00 00 00 00 ff ff
x86/fpu # ... FP1: 1b e2 00 00 00 00 00 00 ff ff
x86/fpu # ... FP2: 00 00 00 00 00 00 00 00 00 00
x86/fpu # ... FP3: 00 00 00 00 00 00 00 00 00 00
x86/fpu # ... FP4: 00 00 00 00 00 00 00 00 00 00
x86/fpu # ... FP5: 00 00 00 00 00 00 00 00 00 00
x86/fpu # ... FP6: 00 00 00 00 00 00 00 00 00 00
x86/fpu # ... FP7: 00 00 00 00 00 00 00 00 00 00
x86/fpu # ... SW: dadadada
x86/fpu ######################
Note how CWD and TWD are off their usual init state (0x037f and 0xffff),
and how FP0 and FP1 has non-zero content.
This is normally not a problem, because any user-space FPU state
is initalized properly - but it can complicate the use of FPU
instructions in kernel code via kernel_fpu_begin()/end(): if
the FPU using code does not initialize registers itself, it
might generate spurious exceptions depending on which CPU it
executes on.
Fix this by initializing the x87 state via the FNINIT instruction.
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Rename this function in line with the new FPU nomenclature.
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
So this function still had ancient language about 'saving current
math information' - but we haven't been doing lazy FPU saves for
quite some time, we are doing lazy FPU restores.
Also remove IRQ13 related comment, which we don't support anymore
either.
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Move the naming in line with existing names, so that we now have:
copy_fpregs_to_fpstate()
copy_fpstate_to_sigframe()
copy_fpregs_to_sigframe()
... where each function does what its name suggests.
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Standardize the naming of save_xstate_sig() by renaming it to
copy_fpstate_to_sigframe(): this tells us at a glance that
the function copies an FPU fpstate to a signal frame.
This naming also follows the naming of copy_fpregs_to_fpstate().
Don't put 'xstate' into the name: since this is a generic name,
it's expected that the function is able to handle xstate frames
as well, beyond legacy frames.
xstate used to be the odd case in the x86 FPU code - now it's the
common case.
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Currently fpstate_sanitize_xstate() has a task_struct input parameter,
but it only uses the fpu structure from it - so pass in a 'struct fpu'
pointer only and update all call sites.
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Remove the extra layer of __fpstate_sanitize_xstate():
if (!use_xsaveopt())
return;
__fpstate_sanitize_xstate(tsk);
and move the check for use_xsaveopt() into fpstate_sanitize_xstate().
In general we optimize for the presence of CPU features, not for
the absence of them. Furthermore there's little point in this inlining,
as the call sites are not super hot code paths.
Doing this uninlining shrinks the code a bit:
text data bss dec hex filename
14108751 2573624 1634304 18316679 1177d87 vmlinux.before
14108627 2573624 1634304 18316555 1177d0b vmlinux.after
Also remove a pointless '!fx' check from fpstate_sanitize_xstate().
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
So the sanitize_i387_state() function has the following purpose:
on CPUs that support optimized xstate saving instructions, an
FPU fpstate might end up having partially uninitialized data.
This function initializes that data.
Note that the function name is a misnomer and confusing on two levels,
not only is it not i387 specific at all, but it is the exact opposite:
it only matters on xstate CPUs.
So rename sanitize_i387_state() and __sanitize_i387_state() to
fpstate_sanitize_xstate() and __fpstate_sanitize_xstate(),
to clearly express the purpose and usage of the function.
We'll further clean up this function in the next patch.
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Now that all FPU internals using drivers are converted to public APIs,
move xcr.h's definitions into fpu/internal.h and remove xcr.h.
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
We do a boot time printout of xfeatures in print_xstate_features(),
simplify this code to make use of the recently introduced cpu_has_xfeature()
method.
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
A lot of FPU using driver code is querying complex CPU features to be
able to figure out whether a given set of xstate features is supported
by the CPU or not.
Introduce a simplified API function that can be used on any CPU type
to get this information. Also add an error string return pointer,
so that the driver can print a meaningful error message with a
standardized feature name.
Also mark xfeatures_mask as __read_only.
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
The current fpu_copy() code on lazy switching CPUs always saves
into the current fpstate and then copies it over into the child
context:
preempt_disable();
if (!copy_fpregs_to_fpstate(src_fpu))
fpregs_deactivate(src_fpu);
preempt_enable();
memcpy(&dst_fpu->state, &src_fpu->state, xstate_size);
That memcpy() can be avoided on all lazy switching setups except
really old FNSAVE-only systems: change fpu_copy() to directly save
into the child context, for both the lazy and the eager context
switching case.
Note that we still have to do a memcpy() back into the parent
context in the FNSAVE case, but this won't be executed on the
majority of x86 systems that got built in the last 10 years or so.
Reviewed-by: Borislav Petkov <bp@alien8.de>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Optimize fpu_copy() a bit by expanding the ->fpstate_active == 1
portion of fpu__save() into it.
( The main purpose of this change is to enable another, larger
optimization that will be done in the next patch. )
Reviewed-by: Borislav Petkov <bp@alien8.de>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
So fpu__save() does this currently:
copy_fpregs_to_fpstate(fpu);
if (!use_eager_fpu())
fpregs_deactivate(fpu);
... which deactivates the FPU on lazy switching systems unconditionally.
Both usecases of fpu__save() use this function to save the
FPU state into a fpstate: fork()/clone() and math error signal handling.
The unconditional disabling of FPU registers in the lazy switching
case is probably a mistaken conversion of old FNSAVE code (that had
to disable FPU registers).
So speed up this code by only disabling FPU registers when absolutely
necessary: when indicated by the copy_fpregs_to_fpstate() return
code:
if (!copy_fpregs_to_fpstate(fpu))
fpregs_deactivate(fpu);
Reviewed-by: Borislav Petkov <bp@alien8.de>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
The current implementation of __save_fpu():
if (use_xsave()) {
xsave_state(&fpu->state.xsave);
} else {
fpu_fxsave(fpu);
}
Is actually a simplified version of copy_fpregs_to_fpstate(),
if use_eager_fpu() is true.
But all call sites of __save_fpu() call it only it when use_eager_fpu()
is true.
So we can eliminate __save_fpu() altogether and use the standard
copy_fpregs_to_fpstate() function. This cleans up the code
by making it use fewer variants of FPU register saving.
Reviewed-by: Borislav Petkov <bp@alien8.de>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
__save_fpu() has this pattern:
if (unlikely(system_state == SYSTEM_BOOTING))
xsave_state_booting(&fpu->state.xsave);
else
xsave_state(&fpu->state.xsave);
... but it does not actually get called during system bootup.
So remove the complication and always call xsave_state().
To make sure this assumption is correct, add a WARN_ONCE()
debug check to xsave_state().
Reviewed-by: Borislav Petkov <bp@alien8.de>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
We have repeat patterns of:
if (!use_eager_fpu())
clts();
... to activate FPU registers, and:
if (!use_eager_fpu())
stts();
... to deactivate them.
Encapsulate these in:
__fpregs_activate_hw();
__fpregs_activate_hw();
and use them accordingly.
Doing this synchronizes the idiom with the fpu->fpregs_active
software-flag's handling functions, creating clear patterns of:
__fpregs_activate_hw();
__fpregs_activate(fpu);
etc., which improves readability.
Reviewed-by: Borislav Petkov <bp@alien8.de>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
In line with the fpstate_activate() change, name
fpu__unlazy_stopped() in a similar fashion as well: its purpose
is to make the fpstate of a stopped task the current and active FPU
context, which may require unlazying and initialization.
The unlazying is just part of the job, the main concept is to make
the fpstate active.
Also clarify the function's description to clarify its exact
usage and the background behind it all.
Reviewed-by: Borislav Petkov <bp@alien8.de>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Now that fpstate_init_curr() is not doing implicit allocations
anymore, almost all uses of it involve a very simple pattern:
if (!fpu->fpstate_active)
fpstate_init_curr(fpu);
which is basically activating the FPU fpstate if it was not active
before.
So propagate the check into the function itself, and rename the
function according to its new purpose:
fpu__activate_curr(fpu);
Reviewed-by: Borislav Petkov <bp@alien8.de>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Now that FPU contexts are always allocated, fpu__unlazy_stopped()
cannot fail. Remove its error return and propagate the changes to
the callers.
Reviewed-by: Borislav Petkov <bp@alien8.de>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Now that there are no FPU context allocations, rename fpstate_alloc_init()
to fpstate_init_curr(), to signal that it initializes the fpstate and
marks it active, for the current task.
Reviewed-by: Borislav Petkov <bp@alien8.de>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Remove the failure code and propagate this down to callers.
Note that this function still has an 'init' aspect, which must be
called.
Reviewed-by: Borislav Petkov <bp@alien8.de>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Now that we always allocate the FPU context as part of task_struct there's
no need for separate allocations - remove them and their primary failure
handling code.
( Note that there's still secondary error codes that have become superfluous,
those will be removed in separate patches. )
Move the somewhat misplaced setup_xstate_comp() call to the core.
Reviewed-by: Borislav Petkov <bp@alien8.de>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
So 6 years ago we made the FPU fpstate dynamically allocated:
aa283f4927 ("x86, fpu: lazy allocation of FPU area - v5")
61c4628b53 ("x86, fpu: split FPU state from task struct - v5")
In hindsight this was a mistake:
- it complicated context allocation failure handling, such as:
/* kthread execs. TODO: cleanup this horror. */
if (WARN_ON(fpstate_alloc_init(fpu)))
force_sig(SIGKILL, tsk);
- it caused us to enable irqs in fpu__restore():
local_irq_enable();
/*
* does a slab alloc which can sleep
*/
if (fpstate_alloc_init(fpu)) {
/*
* ran out of memory!
*/
do_group_exit(SIGKILL);
return;
}
local_irq_disable();
- it (slightly) slowed down task creation/destruction by adding
slab allocation/free pattens.
- it made access to context contents (slightly) slower by adding
one more pointer dereference.
The motivation for the dynamic allocation was two-fold:
- reduce memory consumption by non-FPU tasks
- allocate and handle only the necessary amount of context for
various XSAVE processors that have varying hardware frame
sizes.
These days, with glibc using SSE memcpy by default and GCC optimizing
for SSE/AVX by default, the scope of FPU using apps on an x86 system is
much larger than it was 6 years ago.
For example on a freshly installed Fedora 21 desktop system, with a
recent kernel, all non-kthread tasks have used the FPU shortly after
bootup.
Also, even modern embedded x86 CPUs try to support the latest vector
instruction set - so they'll too often use the larger xstate frame
sizes.
So remove the dynamic allocation complication by embedding the FPU
fpstate in task_struct again. This should make the FPU a lot more
accessible to all sorts of atomic contexts.
We could still optimize for the xstate frame size in the future,
by moving the state structure to the last element of task_struct,
and allocating only a part of that.
This change is kept minimal by still keeping the ctx_alloc()/free()
routines (that now do nothing substantial) - we'll remove them in
the following patches.
Reviewed-by: Borislav Petkov <bp@alien8.de>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
So fpu_save_init() is a historic name that got its name when the only
way the FPU state was FNSAVE, which cleared (well, destroyed) the FPU
state after saving it.
Nowadays the name is misleading, because ever since the introduction of
FXSAVE (and more modern FPU saving instructions) the 'we need to reload
the FPU state' part is only true if there's a pending FPU exception [*],
which is almost never the case.
So rename it to copy_fpregs_to_fpstate() to make it clear what's
happening. Also add a few comments about why we cannot keep registers
in certain cases.
Also clean up the control flow a bit, to make it more apparent when
we are dropping/keeping FP registers, and to optimize the common
case (of keeping fpregs) some more.
[*] Probably not true anymore, modern instructions always leave the FPU
state intact, even if exceptions are pending: because pending FP
exceptions are posted on the next FP instruction, not asynchronously.
They were truly asynchronous back in the IRQ13 case, and we had to
synchronize with them, but that code is not working anymore: we don't
have IRQ13 mapped in the IDT anymore.
But a cleanup patch is obviously not the place to change subtle behavior.
Reviewed-by: Borislav Petkov <bp@alien8.de>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Especially the irq_ts_save() function is pretty bloaty, generating
over a dozen instructions, so uninline them.
Even though the API is used rarely, the space savings are measurable:
text data bss dec hex filename
13331995 2572920 1634304 17539219 10ba093 vmlinux.before
13331739 2572920 1634304 17538963 10b9f93 vmlinux.after
( This also allows the removal of an include file inclusion from fpu/api.h,
speeding up the kernel build slightly. )
Reviewed-by: Borislav Petkov <bp@alien8.de>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
There are a number of FPU internal function prototypes and an inline function
in fpu/api.h, mostly placed so historically as the code grew over the years.
Move them over into fpu/internal.h where they belong. (Add sched.h include
to stackprotector.h which incorrectly relied on getting it from fpu/api.h.)
fpu/api.h is now a pure file that only contains FPU APIs intended for driver
use.
Reviewed-by: Borislav Petkov <bp@alien8.de>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Both inline functions call an inline function unconditionally, so we
already pay the function call based clobbering cost. Uninline them.
This saves quite a bit of code in various performance sensitive
code paths:
text data bss dec hex filename
13321334 2569888 1634304 17525526 10b6b16 vmlinux.before
13320246 2569888 1634304 17524438 10b66d6 vmlinux.after
Reviewed-by: Borislav Petkov <bp@alien8.de>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Extend the comments of the FPU init code, and fix old ones.
Reviewed-by: Borislav Petkov <bp@alien8.de>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reorder init methods in order of their relationship and usage, to
form coherent blocks throughout the whole file.
Reviewed-by: Borislav Petkov <bp@alien8.de>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
To bring it in line with the other init_system*() methods.
Reviewed-by: Borislav Petkov <bp@alien8.de>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Now that fpu__detect() has become an empty layer around
fpu__init_system(), eliminate it and make fpu__init_system()
the main system initialization routine.
Reviewed-by: Borislav Petkov <bp@alien8.de>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Move the fpu__init_system_early_generic() call into fpu__init_system(),
which hosts all the system init calls.
Expose fpu__init_system() to other modules - this will be our main and only
system init function.
Reviewed-by: Borislav Petkov <bp@alien8.de>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
check_fpu() currently relies on being called early in the init sequence,
when CR0::TS has not been set up yet.
Save/restore CR0::TS across this function, to make it invariant to
init ordering. This way we'll be able to move the generic FPU setup
routines earlier in the init sequence.
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Create separate fpu/bugs.c code so that if we read generic FPU code
we don't have to wade through all the bugcheck related code first.
Reviewed-by: Borislav Petkov <bp@alien8.de>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
There's a !FPU related sanity check in fpu__init_cpu_generic(),
which is executed on every CPU onlining - even though we should do
this only once, and during system init.
Move this check to fpu__init_system_early_generic().
Reviewed-by: Borislav Petkov <bp@alien8.de>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Move the generic bits of fpu__detect() into fpu__init_system_early_generic().
We'll move some other code here too in a followup patch.
Reviewed-by: Borislav Petkov <bp@alien8.de>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Factor out the generic bits from fpu__init_system().
Rename mxcsr_feature_mask_init() to fpu__init_system_mxcsr()
to bring it in line with the rest of the nomenclature.
Reviewed-by: Borislav Petkov <bp@alien8.de>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Factor out the generic bits from fpu__init_cpu(), to create
a flat sequence of per CPU initialization function calls:
fpu__init_cpu_generic();
fpu__init_cpu_xstate();
fpu__init_cpu_ctx_switch();
Reviewed-by: Borislav Petkov <bp@alien8.de>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
After the latest round of cleanups, fpu__cpu_init() has become
a simple call to fpu__init_cpu().
Rename fpu__init_cpu() to fpu__cpu_init() and remove the
extra layer.
Reviewed-by: Borislav Petkov <bp@alien8.de>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
We are now doing the fpu__init_cpu_ctx_switch() call from fpu__init_cpu(),
so there's no need to call it from fpu__init_system() anymore.
Reviewed-by: Borislav Petkov <bp@alien8.de>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
fpu__cpu_init() is called on every CPU, so it is the wrong place
to call fpu__init_system() from. Call it from fpu__detect():
this is early CPU init code, but we already have CPU features detected,
so we can call the system-wide FPU init code from here.
Reviewed-by: Borislav Petkov <bp@alien8.de>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
fpu__init_cpu() is currently called from fpu__init_system(),
which is the wrong place for it: call it from the proper high level
per CPU init function, fpu__init_cpu().
Note, we still keep the old call site as well, because it depends
on having proper CR0::TS setup. We'll fix this in the next patch.
Reviewed-by: Borislav Petkov <bp@alien8.de>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
The fpstate_xstate_init_size() function sets up a basic xstate_size, called
during fpu__detect() currently.
Its real dependency is to be called before fpu__init_system_xstate().
So move the function call site into fpu__init_system(), to right before the
fpu__init_system_xstate() call.
Also add a once-per-boot flag to fpstate_xstate_init_size(), we'll remove
this quirk later once we've cleaned up the init dependencies.
This moves the two related functions closer to each other and makes them
both part of the _init_system() functionality.
Currently we do the fpstate_xstate_init_size()
Reviewed-by: Borislav Petkov <bp@alien8.de>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
mxcsr_feature_mask_init() depends on TS being cleared, as it executes
an FXSAVE instruction.
After later changes we will move the TS setup into fpu__init_cpu(),
which will interact with this - so clear the TS flag explicitly.
Reviewed-by: Borislav Petkov <bp@alien8.de>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
So fpu__ctx_switch_init() has two aspects: a once per bootup functionality
that sets up a capability flag, and a per CPU functionality that sets CR0::TS.
Split the function.
Note that at this stage we still have duplicate calls into these methods, as
both the _system() and the _cpu() methods are run on all CPUs, with lower
level on_boot_cpu flags filtering out the duplicates where needed. So add
TS flag clearing as well, to handle the aftermath of early CPU init sequences
that might call in without having eager-fpu set - don't assume the TS flag
is cleared.
Calling each from its respective init level will happen later on.
Reviewed-by: Borislav Petkov <bp@alien8.de>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
It's not an xsave specific function anymore, so rename it accordingly
and also clean it up a bit:
- remove the obsolete __init_refok, as the code paths are not
mixed anymore
- rename it from eager_fpu_init() to fpu__ctx_switch_init()
- remove stray 'return;'
- make it static to its only user
Reviewed-by: Borislav Petkov <bp@alien8.de>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
The FPU context switch type (lazy or eager) setup code is split into
two places currently - move it all to eager_fpu_init().
Note that the code we move will now be executed on non-xstate CPUs
as well, but this should be safe: both xfeatures_mask and
cpu_has_xsaveopt is 0 there.
Reviewed-by: Borislav Petkov <bp@alien8.de>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
It's a pure xstate method now, no need for this duplicate call.
Reviewed-by: Borislav Petkov <bp@alien8.de>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
The legacy FPU init image is used on older CPUs who don't run xstate init.
But the init code is called within setup_init_fpu_buf(), an xstate method.
Move this legacy init out of the xstate code and put it into fpu/init.c.
Reviewed-by: Borislav Petkov <bp@alien8.de>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Expand fpu__init_system_xstate() and fpu__init_cpu_xstate() calls
into xsave_init() calls.
(This will allow us to call the proper versions in higher level FPU init code
later on.)
No change in functionality.
Reviewed-by: Borislav Petkov <bp@alien8.de>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Linearize the call sequence in xsave_init():
fpu__init_system_xstate();
fpu__init_cpu_xstate();
We do this by propagating the boot-once quirk into
fpu__init_system_xstate(). fpu__init_cpu_xstate() is
safe to be called multiple time.
Reviewed-by: Borislav Petkov <bp@alien8.de>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Now that legacy code can execute fpu__init_cpu_xstate() in
xsave_init(), we can move the once per boot legacy check into
fpu__init_system_xstate(), where it belongs.
Reviewed-by: Borislav Petkov <bp@alien8.de>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
fpu__init_system_xstate() does an FPU capability check that is better
done in fpu__init_cpu_xstate(). This will allow us to call
fpu__init_cpu_xstate() directly on legacy CPUs as well.
Reviewed-by: Borislav Petkov <bp@alien8.de>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Rename existing xstate init functions along the system/cpu init principles:
fpu__init_system_xstate(): called once per system bootup
fpu__init_cpu_xstate(): called per CPU onlining
Also make the fpu__init_cpu_xstate() early code invariant:
if xfeatures_mask is not set yet then don't crash but return.
Reviewed-by: Borislav Petkov <bp@alien8.de>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
There are two kinds of FPU initialization sequences necessary to bring FPU
functionality up: once per system bootup activities, such as detection,
feature initialization, etc. of attributes that are shared by all CPUs
in the system - and per cpu initialization sequences run when a CPU is
brought online (either during bootup or during CPU hotplug onlining),
such as CR0/CR4 register setting, etc.
The FPU code is mixing these roles together, with no clear distinction.
Start sorting this out by splitting the main FPU detection routine
(fpu__cpu_init()) into two parts: fpu__init_system() for
one per system init activities, and fpu__init_cpu() for the
per CPU onlining init activities.
Note that xstate_init() is called from both variants for the time being,
because it has a dual nature as well. We'll fix that in upcoming patches.
Just do the split and call it as we used to before, don't introduce any
change in initialization behavior yet, beyond duplicate (and harmless)
fpu__init_cpu() and xstate_init() calls - which we'll fix in later
patches.
Reviewed-by: Borislav Petkov <bp@alien8.de>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Make init_xstate_buf allocated statically at build time.
This structure's maximum size is around 1KB - and it's allocated even on
most modern embedded x86 CPUs which strive for FPU instruction set parity
with desktop and server CPUs, so it's not like we can save much on smaller
systems.
This removes the last bootmem allocation from the FPU init path, allowing
it to be called earlier in the boot sequence.
Reviewed-by: Borislav Petkov <bp@alien8.de>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Remove the dependency on the init_xstate_buf == NULL check to
implement once-per-bootup logic in eager_fpu_init(), by making
setup_init_fpu_buf() run once per bootup explicitly.
This is in preparation to make init_xstate_buf statically
allocated.
The various boot-once quirks in the FPU init code will be removed
in a later cleanup stage.
Reviewed-by: Borislav Petkov <bp@alien8.de>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
There's only 8 xstate bits at the moment, and it's not like we
can support unknown bits - so put xstate_offsets[] and
xstate_sizes[] into static allocation.
This is in preparation to be able to call the FPU init code
earlier, when there's no bootmem available yet.
Reviewed-by: Borislav Petkov <bp@alien8.de>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
fpstate_xstate_init_size() is called in fpu__cpu_init(), which is
run on every CPU, every time they are brought online.
But we want to call fpstate_xstate_init_size() only once. Move it to
fpu__detect(), which only runs once, on the boot CPU.
Also clean up the flow of fpstate_xstate_init_size() a bit, by
removing a 'return' from the middle of the function.
Reviewed-by: Borislav Petkov <bp@alien8.de>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Propagate the 'fpu->fpregs_active' naming to the high level function that
clears it.
Reviewed-by: Borislav Petkov <bp@alien8.de>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Propagate the 'fpu->fpregs_active' naming to the high level
function that sets it.
Reviewed-by: Borislav Petkov <bp@alien8.de>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
So the current code uses fpu->has_cpu to determine whether a given
user FPU context is actively loaded into the FPU's registers [*] and
that those registers represent the task's current FPU state.
But this term is not unambiguous: especially the distinction between
fpu->has_fpu, PF_USED_MATH and fpu_fpregs_owner_ctx is not clear.
Increase clarity by unambigously signalling that it's about
hardware registers being active right now, by renaming it to
fpu->fpregs_active.
( In later patches we'll use more of the 'fpregs' naming, which will
make it easier to grep for as well. )
[*] There's the kernel_fpu_begin()/end() primitive that also
activates FPU hw registers as well and uses them, without
touching the fpu->fpregs_active flag.
Reviewed-by: Borislav Petkov <bp@alien8.de>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Improve the comments and add new ones, as this code isn't very obvious.
Reviewed-by: Borislav Petkov <bp@alien8.de>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Rename regset accessors to prefix them with 'regset_', because we
want to start using the 'fpregs_active' name elsewhere.
Reviewed-by: Borislav Petkov <bp@alien8.de>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
The code has the following problems:
- it uses a single global 'fx_scratch' area that multiple CPUs could
write into simultaneously, in theory.
- it wastes 512 bytes of .data for something that is only rarely used.
Fix this by moving the state buffer to the stack. Note that while
this is 512 bytes, we don't ever call this function in very deep
callchains, so its stack usage should not be a problem.
Also add comments to explain the magic 0x0000ffbf default value.
Reviewed-by: Borislav Petkov <bp@alien8.de>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
'xsave.header::xstate_bv' is a misnomer - what does 'bv' stand for?
It probably comes from the 'XGETBV' instruction name, but I could
not find in the Intel documentation where that abbreviation comes
from. It could mean 'bit vector' - or something else?
But how about - instead of guessing about a weird name - we named
the field in an obvious and descriptive way that tells us exactly
what it does?
So rename it to 'xfeatures', which is a bitmask of the
xfeatures that are fpstate_active in that context structure.
Eyesore like:
fpu->state->xsave.xsave_hdr.xstate_bv |= XSTATE_FP;
is now much more readable:
fpu->state->xsave.header.xfeatures |= XSTATE_FP;
Which form is not just infinitely more readable, but is also
shorter as well.
Reviewed-by: Borislav Petkov <bp@alien8.de>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Code like:
fpu->state->xsave.xsave_hdr.xstate_bv |= XSTATE_FP;
is an eyesore, because not only is the words 'xsave' and 'state'
are repeated twice times (!), but also because of the 'hdr' and 'bv'
abbreviations that are pretty meaningless at a first glance.
Start cleaning this up by renaming 'xsave_hdr' to 'header'.
Reviewed-by: Borislav Petkov <bp@alien8.de>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Clean up various regset handlers: use the 'fpu' pointer which
is available in most cases.
Reviewed-by: Borislav Petkov <bp@alien8.de>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
The suspend code accesses FPU state internals, add a helper for
it and isolate it.
Reviewed-by: Borislav Petkov <bp@alien8.de>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
The name 'xstate_features' does not tell us whether it's a bitmap
or any other value. That it's a count of features is only obvious
if you read the code that calculates it.
Rename it to the more descriptive 'xfeatures_nr' name.
Reviewed-by: Borislav Petkov <bp@alien8.de>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
So the 'pcntxt_mask' is a misnomer, it's essentially meaningless to anyone
who doesn't know what it does exactly.
Name it more descriptively as 'xfeatures_mask'.
Reviewed-by: Borislav Petkov <bp@alien8.de>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Inform the user/admin about which xstate features the kernel supports.
Reviewed-by: Borislav Petkov <bp@alien8.de>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Standardize the various boot time messages printed during FPU detection:
- Use a common 'x86/fpu: ' prefix for consistency and to make it easy
to grep boot logs for FPU related messages
- Correct speling errors
- Add printout for the legacy FPU case as well
- Clarify messages
Reviewed-by: Borislav Petkov <bp@alien8.de>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
So this code surprised me - and being surprised when reading FPU code
does not help maintainability of an already overly complex subsystem.
Remove the obfuscation and just don't use __init annotation for now.
Anyone who wants to free these ~600 bytes of xstate_enable_boot_cpu()
should implement it cleanly.
Reviewed-by: Borislav Petkov <bp@alien8.de>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
This unifies all the FPU related header files under a unified, hiearchical
naming scheme:
- asm/fpu/types.h: FPU related data types, needed for 'struct task_struct',
widely included in almost all kernel code, and hence kept
as small as possible.
- asm/fpu/api.h: FPU related 'public' methods exported to other subsystems.
- asm/fpu/internal.h: FPU subsystem internal methods
- asm/fpu/xsave.h: XSAVE support internal methods
(Also standardize the header guard in asm/fpu/internal.h.)
Reviewed-by: Borislav Petkov <bp@alien8.de>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
We already have fpu/types.h, move i387.h to fpu/api.h.
The file name has become a misnomer anyway: it offers generic FPU APIs,
but is not limited to i387 functionality.
Reviewed-by: Borislav Petkov <bp@alien8.de>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
The primary purpose of this function is to clear the current task's
FPU before an exec(), to not leak information from the previous task,
and to allow the new task to start with freshly initialized FPU
registers.
Rename the function to reflect this primary purpose.
Reviewed-by: Borislav Petkov <bp@alien8.de>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Migrate this function to pure 'struct fpu' usage.
Reviewed-by: Borislav Petkov <bp@alien8.de>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Migrate this function to pure 'struct fpu' usage.
Reviewed-by: Borislav Petkov <bp@alien8.de>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Migrate this function to pure 'struct fpu' usage.
Reviewed-by: Borislav Petkov <bp@alien8.de>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Migrate this function to pure 'struct fpu' usage.
Reviewed-by: Borislav Petkov <bp@alien8.de>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Migrate this function to pure 'struct fpu' usage.
Reviewed-by: Borislav Petkov <bp@alien8.de>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Migrate this function to pure 'struct fpu' usage.
Reviewed-by: Borislav Petkov <bp@alien8.de>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
This helper function is only used in fpu/core.c, move it there.
This slightly speeds up compilation.
Reviewed-by: Borislav Petkov <bp@alien8.de>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Migrate this function to pure 'struct fpu' usage.
Reviewed-by: Borislav Petkov <bp@alien8.de>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Migrate this function to pure 'struct fpu' usage.
Reviewed-by: Borislav Petkov <bp@alien8.de>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Migrate this function to pure 'struct fpu' usage.
Reviewed-by: Borislav Petkov <bp@alien8.de>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Migrate this function to pure 'struct fpu' usage.
Reviewed-by: Borislav Petkov <bp@alien8.de>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Replace task_disable_lazy_fpu_restore() with easier to read
open-coded uses: we already update the fpu->last_cpu field
explicitly in other cases.
(This also removes yet another task_struct using FPU method.)
Better explain the fpu::last_cpu field in the structure definition.
Reviewed-by: Borislav Petkov <bp@alien8.de>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Migrate this function to pure 'struct fpu' usage.
Reviewed-by: Borislav Petkov <bp@alien8.de>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Introduce a simple fpu->fpstate_active flag in the fpu context data structure
and use that instead of PF_USED_MATH in task->flags.
Testing for this flag byte should be slightly more efficient than
testing a bit in a bitmask, but the main advantage is that most
FPU functions can now be performed on a 'struct fpu' alone, they
don't need access to 'struct task_struct' anymore.
There's a slight linecount increase, mostly due to the 'fpu' local
variables and due to extra comments. The local variables will go away
once we move most of the FPU methods to pure 'struct fpu' parameters.
Reviewed-by: Borislav Petkov <bp@alien8.de>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Explain its usage and also document a TODO item.
Reviewed-by: Borislav Petkov <bp@alien8.de>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
PF_USED_MATH is used directly, but also in a handful of helper inlines.
To ease the elimination of PF_USED_MATH, convert all inline helpers
to open-coded PF_USED_MATH usage.
Reviewed-by: Borislav Petkov <bp@alien8.de>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Migrate this function to pure 'struct fpu' usage.
Reviewed-by: Borislav Petkov <bp@alien8.de>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Migrate this function to pure 'struct fpu' usage.
Reviewed-by: Borislav Petkov <bp@alien8.de>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Track the FPU owner context instead of the owner task: this change,
together with other changes, will allow in subsequent patches the
elimination of 'struct task_struct' usage in various FPU code:
we'll be able to use 'struct fpu' only.
There's no change in code size:
text data bss dec hex filename
13066467 2545248 1626112 17237827 1070743 vmlinux.before
13066467 2545248 1626112 17237827 1070743 vmlinux.after
Reviewed-by: Borislav Petkov <bp@alien8.de>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Move it closer to other per-cpu FPU data structures.
This also unifies the 32-bit and 64-bit code.
Reviewed-by: Borislav Petkov <bp@alien8.de>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Start migrating FPU methods towards using 'struct fpu *fpu'
directly. __thread_has_fpu() is just a trivial wrapper around
fpu->has_fpu, eliminate it.
Reviewed-by: Borislav Petkov <bp@alien8.de>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Ever since the kernel started defaulting to eager FPU switches on modern Intel
CPUs it's not been obvious whether a given system is using the lazy or the eager
FPU context switching logic.
So generate a boot message about which mode the FPU code is in:
x86/fpu: Using 'lazy' FPU context switches.
or:
x86/fpu: Using 'eager' FPU context switches.
Reviewed-by: Borislav Petkov <bp@alien8.de>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Also add a bit of documentation.
Reviewed-by: Borislav Petkov <bp@alien8.de>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Move fpu_copy() where its only user is.
Beyond readability this also speeds up compilation, as fpu-internal.h
is included in over a dozen .c files.
Reviewed-by: Borislav Petkov <bp@alien8.de>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
__save_init_fpu() is just a trivial wrapper around fpu_save_init().
Remove the extra layer of obfuscation.
Reviewed-by: Borislav Petkov <bp@alien8.de>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Instead of open-coded in_kernel_fpu access, Use kernel_fpu_disabled() in
interrupted_kernel_fpu_idle(), matching the other kernel_fpu_*() methods.
Also add some documentation for in_kernel_fpu.
Reviewed-by: Borislav Petkov <bp@alien8.de>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
We are not supposed to call kernel_fpu_disable() if we have not
previously enabled it.
Also use kernel_fpu_disable()/enable() in the __kernel_fpu_begin/end()
primitives, instead of writing to in_kernel_fpu directly,
so that we get the debugging checks.
Reviewed-by: Borislav Petkov <bp@alien8.de>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
This allows the compiler to inline them and to eliminate them:
arch/x86/kernel/fpu/core.o:
text data bss dec hex filename
6741 4 8 6753 1a61 core.o.before
6716 4 8 6728 1a48 core.o.after
Reviewed-by: Borislav Petkov <bp@alien8.de>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
It's now local to fpu/core.c, make it static.
Reviewed-by: Borislav Petkov <bp@alien8.de>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Introduce fpu__copy() and use it in arch_dup_task_struct(),
thus moving another chunk of FPU logic to fpu/core.c.
Reviewed-by: Borislav Petkov <bp@alien8.de>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
This code was historically in process.c, now we have FPU core internals in
fpu/core.c instead - move it there.
Reviewed-by: Borislav Petkov <bp@alien8.de>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
These functions (xsave_state() and xsave_state_booting()) have a 'mask'
argument that is always -1.
Propagate this into the functions instead and eliminate the extra argument.
Does not change the generated code, because these were inlined functions.
Reviewed-by: Borislav Petkov <bp@alien8.de>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Move the boot-time FPU bug detection code to the other FPU boot time
init code in fpu/init.c.
No change in code size:
text data bss dec hex filename
13044568 1884440 1130496 16059504 f50c70 vmlinux.before
13044568 1884440 1130496 16059504 f50c70 vmlinux.after
Reviewed-by: Borislav Petkov <bp@alien8.de>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
It's another piece of FPU internals that is better off close to
the other FPU internals.
Reviewed-by: Borislav Petkov <bp@alien8.de>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
flush_thread() open codes a lot of FPU internals - create a separate
function for it in fpu/core.c.
Turns out that this does not hurt performance:
text data bss dec hex filename
11843039 1884440 1130496 14857975 e2b6f7 vmlinux.before
11843039 1884440 1130496 14857975 e2b6f7 vmlinux.after
and since this is a slowpath clarity comes first anyway.
We can reconsider inlining decisions after the FPU code has been cleaned up.
Reviewed-by: Borislav Petkov <bp@alien8.de>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Use fpstate_free() directly to manage FPU state.
Only process.c was using this method, so this is a speedup as well,
as it removes the extra function call and related clobbers.
Reviewed-by: Borislav Petkov <bp@alien8.de>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Both no_387() and fpu__detect() run at boot time, so they belong
into init.c.
Reviewed-by: Borislav Petkov <bp@alien8.de>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
fpu/core.c includes a lot of files for mostly historic reasons.
It only needs fpu-internal.h, which already includes all
the required headers.
Reviewed-by: Borislav Petkov <bp@alien8.de>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Move boot time FPU initialization code into init.c, to better
isolate it into its own domain.
Reviewed-by: Borislav Petkov <bp@alien8.de>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Fix a minor header file dependency bug in asm/fpu-internal.h: it
relies on i387.h but does not include it. All users of fpu-internal.h
included it explicitly.
Also remove unnecessary includes, to reduce compilation time.
This also makes it easier to use it as a standalone header file
for FPU internals, such as an upcoming C module in arch/x86/kernel/fpu/.
Reviewed-by: Borislav Petkov <bp@alien8.de>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Create a new subdirectory for the FPU support code in arch/x86/kernel/fpu/.
Rename 'i387.c' to 'core.c' - as this really collects the core FPU support
code, nothing i387 specific.
We'll better organize this directory in later patches.
Reviewed-by: Borislav Petkov <bp@alien8.de>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
This field is kept separate from the main FPU state structure for
no good reason.
Reviewed-by: Borislav Petkov <bp@alien8.de>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
So init_thread_xstate() is a misnomer in that it's not really related to a specific
thread - it determines, once during initial bootup, the size of the xstate context.
Also improve the comments.
Reviewed-by: Borislav Petkov <bp@alien8.de>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
fpu_init() is a bit of a misnomer in that it (falsely) creates the
impression that it's related to the (old) fpu_finit() function,
which initializes FPU ctx state.
Rename it to fpu__cpu_init() to make its boot time initialization
clear, and to move it to the fpu__*() namespace.
Also fix and extend its comment block to point out that it's
called not only on the boot CPU, but on secondary CPUs as well.
Reviewed-by: Borislav Petkov <bp@alien8.de>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Make it clear that we are initializing the in-memory FPU context area,
no the FPU registers.
Also move it to the fpu__*() namespace.
Reviewed-by: Borislav Petkov <bp@alien8.de>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Use the fpu__*() namespace for fpstate_alloc() as well.
Also add a comment about FPU state alignment.
Reviewed-by: Borislav Petkov <bp@alien8.de>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
This is not a small function, and it's used in several places,
one of them a popular module (KVM).
Move the function out of line. This saves a bit of text,
even with the symbol export overhead:
text data bss dec hex filename
12566052 1619504 1089536 15275092 e91454 vmlinux.before
12566046 1619504 1089536 15275086 e9144e vmlinux.after
Reviewed-by: Borislav Petkov <bp@alien8.de>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Open code the PF_USED_MATH logic, to make the logic more obvious.
(We'll slowly convert the other users of *_used_math() methods as well.)
Reviewed-by: Borislav Petkov <bp@alien8.de>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
This function is only called for stopped child tasks, so the
fpu__save() branch will never get called - remove it.
Reviewed-by: Borislav Petkov <bp@alien8.de>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
This function name is a misnomer now that we've split out all the
other users from it. Rename it accordingly: it's used to save
the FPU state of (ptrace-)stopped child tasks.
Add debugging check to double check this intended usage: that this
function is only called for non-current, stopped child tasks.
Reviewed-by: Borislav Petkov <bp@alien8.de>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Now that the allocation users have been split off into a separate
function, init_fpu() has become local to i387.c: make it static.
Reviewed-by: Borislav Petkov <bp@alien8.de>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Most init_fpu() users don't want the register-saving aspect of the
function, they are calling it for 'current' and when FPU registers
are not allocated and initialized yet.
Split out a simplified API that does just that (and add debug-checks
for these conditions): fpstate_alloc_init().
Use it where appropriate.
Reviewed-by: Borislav Petkov <bp@alien8.de>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Use the fpu__*() namespace to organize FPU ops better.
Also document fpu__detect() a bit.
Reviewed-by: Borislav Petkov <bp@alien8.de>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Document the function a bit more and add debugging check that we are only
running this with the current task.
Reviewed-by: Borislav Petkov <bp@alien8.de>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Add an explanation to fpu__save() and also don't export it to
random modules - we don't want them to futz around with deep kernel
internals.
Reviewed-by: Borislav Petkov <bp@alien8.de>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
This function is a misnomer on two levels:
1) it doesn't really manipulate TS on modern CPUs anymore, its
primary purpose is to save FPU state, used:
- when executing fork()/clone(): to copy current FPU state
to the child's FPU state.
- when handling math exceptions: to generate the math error
si_code in the signal frame.
2) even on legacy CPUs it doesn't actually 'unlazy', if then
it lazies the FPU state: as a side effect of the old FNSAVE
instruction which clears (destroys) FPU state it's necessary
to set CR0::TS.
So rename it to fpu__save() to better reflect its purpose.
Reviewed-by: Borislav Petkov <bp@alien8.de>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
This routine has been around for over a decade, but with EISA
being dead and abandoned for about twice that long, the name can
be kind of confusing. The function is going at the PIC Edge/Level
Configuration Registers (ELCR), so rename it as such and mentally
decouple it from the long since dead EISA bus.
Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
Reviewed-by: Maciej W. Rozycki <macro@linux-mips.org>
Acked-by: Pavel Machek <pavel@ucw.cz>
Cc: Rafael J. Wysocki <rjw@rjwysocki.net>
Cc: Len Brown <len.brown@intel.com>
Cc: Bjorn Helgaas <bhelgaas@google.com>
Cc: x86@kernel.org
Link: http://lkml.kernel.org/r/1431217657-934-1-git-send-email-paul.gortmaker@windriver.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
So while testing kernels using tools/kvm/ (kvmtool) I noticed that it
booted super slow:
[ 0.142991] Performance Events: no PMU driver, software events only.
[ 0.149265] x86: Booting SMP configuration:
[ 0.149765] .... node #0, CPUs: #1
[ 0.148304] kvm-clock: cpu 1, msr 2:1bfe9041, secondary cpu clock
[ 10.158813] KVM setup async PF for cpu 1
[ 10.159000] #2
[ 10.159000] kvm-stealtime: cpu 1, msr 211a4d400
[ 10.158829] kvm-clock: cpu 2, msr 2:1bfe9081, secondary cpu clock
[ 20.167805] KVM setup async PF for cpu 2
[ 20.168000] #3
[ 20.168000] kvm-stealtime: cpu 2, msr 211a8d400
[ 20.167818] kvm-clock: cpu 3, msr 2:1bfe90c1, secondary cpu clock
[ 30.176902] KVM setup async PF for cpu 3
[ 30.177000] #4
[ 30.177000] kvm-stealtime: cpu 3, msr 211acd400
One CPU booted up per 10 seconds. With 120 CPUs that takes a while.
Bisection pinpointed this commit:
853b160aaa ("Revert f5d6a52f51 ("x86/smpboot: Skip delays during SMP initialization similar to Xen")")
But that commit just restores previous behavior, so it cannot cause the
problem. After some head scratching it turns out that these two commits:
1a744cb356 ("x86/smp/boot: Remove 10ms delay from cpu_up() on modern processors")
d68921f9bd ("x86/smp/boot: Add cmdline "cpu_init_udelay=N" to specify cpu_up() delay")
added the following code to smpboot.c:
- mdelay(10);
+ mdelay(init_udelay);
Note the mismatch in the units: the delay is called 'udelay' and is set
to microseconds - while the function used here is actually 'mdelay',
which counts in milliseconds ...
So the delay for legacy systems is off by a factor of 1,000, so instead
of 10 msecs we waited for 10 seconds ...
The reason bisection pointed to 853b160aaa was that 853b160aaa removed
a (broken) boot-time speedup patch, which masked the factor 1,000 bug.
Fix it by using udelay(). This fixes my bootup problems.
Cc: Len Brown <len.brown@intel.com>
Cc: Alan Cox <alan@linux.intel.com>
Cc: Arjan van de Ven <arjan@linux.intel.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Jan H. Schönherr <jschoenh@amazon.de>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Derek noticed that a critical MCE gets reported with the wrong
error type description:
[Hardware Error]: CPU 34: Machine Check Exception: 5 Bank 9: f200003f000100b0
[Hardware Error]: RIP !INEXACT! 10:<ffffffff812e14c1> {intel_idle+0xb1/0x170}
[Hardware Error]: TSC 49587b8e321cb
[Hardware Error]: PROCESSOR 0:306e4 TIME 1431561296 SOCKET 1 APIC 29
[Hardware Error]: Some CPUs didn't answer in synchronization
[Hardware Error]: Machine check: Invalid
^^^^^^^
The last line with 'Invalid' should have printed the high level
MCE error type description we get from mce_severity, i.e.
something like:
[Hardware Error]: Machine check: Action required: data load error in a user process
this happens due to the fact that mce_no_way_out() iterates over
all MCA banks and possibly overwrites the @msg argument which is
used in the panic printing later.
Change behavior to take the message of only and the (last)
critical MCE it detects.
Reported-by: Derek <denc716@gmail.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Cc: <stable@vger.kernel.org>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tony Luck <tony.luck@intel.com>
Link: http://lkml.kernel.org/r/1431936437-25286-3-git-send-email-bp@alien8.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
... to find_matching_signature() which is exactly what it does.
No functionality change.
Signed-off-by: Borislav Petkov <bp@suse.de>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Quentin Casasnovas <quentin.casasnovas@oracle.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1431860101-14847-5-git-send-email-bp@alien8.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Unclutter function, make it a bit more readable, drop local
variables.
No functionality change.
Signed-off-by: Borislav Petkov <bp@suse.de>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Quentin Casasnovas <quentin.casasnovas@oracle.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1431860101-14847-4-git-send-email-bp@alien8.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Drop unreadable macro, deconstruct compound conditional
statement into single ones and return early if they match. Add
comments.
There should be no functionality change resulting from this
patch.
Signed-off-by: Borislav Petkov <bp@suse.de>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Quentin Casasnovas <quentin.casasnovas@oracle.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1431860101-14847-3-git-send-email-bp@alien8.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
... to has_newer_microcode() as it does exactly that: checks
whether binary data @mc has newer microcode patch than the
applied one. Move @mc to be the first function arg too.
Signed-off-by: Borislav Petkov <bp@suse.de>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Quentin Casasnovas <quentin.casasnovas@oracle.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1431860101-14847-2-git-send-email-bp@alien8.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
The "movw %ds,%cx" instruction needs a 0x66 prefix, while
"movl %ds,%ecx" does not.
The difference is that latter form (on 64-bit CPUs)
overwrites the entire %ecx, not only its lower half.
But subsequent code doesn't depend on the value of upper
half of %ecx, so we can safely use the shorter instruction.
The new code is also faster than the old one - now we don't
depend on the old value of %ecx, but this code fragment is
not performance-critical so it does not matter much.
Signed-off-by: Denys Vlasenko <dvlasenk@redhat.com>
Cc: Alexei Starovoitov <ast@plumgrid.com>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Will Drewry <wad@chromium.org>
Link: http://lkml.kernel.org/r/1431722346-26585-1-git-send-email-dvlasenk@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
smp.c and irq_work.c implement the same inline helper. Move it to
apic.h and use it everywhere.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <peterz@infradead.org>
On NUMA systems, an IO device may be associated with a NUMA node.
It may improve IO performance to allocate resources, such as memory
and interrupts, from device local node.
This patch introduces a mechanism to support CPU vector allocation
policies. It tries to allocate CPU vectors from CPUs on device local
node first, and then fallback to all online(global) CPUs.
This mechanism may be used to support NumaConnect systems to allocate
CPU vectors from device local node.
Signed-off-by: Jiang Liu <jiang.liu@linux.intel.com>
Tested-by: Daniel J Blueman <daniel@numascale.com>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Bjorn Helgaas <bhelgaas@google.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Rafael J. Wysocki <rjw@rjwysocki.net>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: Yinghai Lu <yinghai@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Link: http://lkml.kernel.org/r/1430967244-28905-1-git-send-email-jiang.liu@linux.intel.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Fix the following oops:
hpet_msi_get_hwirq+0x1f/0x27
msi_domain_alloc+0x35/0xfe
? trace_hardirqs_on_caller+0x16c/0x188
irq_domain_alloc_irqs_recursive+0x51/0x95
__irq_domain_alloc_irqs+0x151/0x223
hpet_assign_irq+0x5d/0x68
hpet_msi_capability_lookup+0x121/0x1cb
? hpet_enable+0x2b4/0x2b4
hpet_late_init+0x5f/0xf2
? hpet_enable+0x2b4/0x2b4
do_one_initcall+0x184/0x199
kernel_init_freeable+0x1af/0x237
? rest_init+0x13a/0x13a
kernel_init+0xe/0xd4
ret_from_fork+0x3f/0x70
? rest_init+0x13a/0x13a
Since 3cb96f0c97 ('x86/hpet: Enhance HPET IRQ to support
hierarchical irqdomains') hpet_msi_capability_lookup() uses
hpet_assign_irq(). The latter initializes irq_alloc_info on stack, but
passes a NULL pointer to irq_domain_alloc_irqs(), which causes a NULL
pointer dereference later in hpet_msi_get_hwirq().
Pass the pointer to the irq_alloc_info irq_domain_alloc_irqs().
Fixes: 3cb96f0c97 'x86/hpet: Enhance HPET IRQ to support hierarchical irqdomains'
Signed-off-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Reviewed-by: Jiang Liu <jiang.liu@linux.intel.com>
Cc: Sergey Senozhatsky <sergey.senozhatsky.work@gmail.com>
Link: http://lkml.kernel.org/r/20150512041444.GA1094@swordfish
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Huang Ying reported x86 boot hangs due to this commit.
Turns out that the change, despite its changelog, does more
than just change timeouts: it also changes the way we
assert/deassert INIT via the APIC_DM_INIT IPI, in the x2apic
case it skips the deassert step.
This is historically fragile code and the patch did not
improve it, so revert these changes.
This commit:
1a744cb356 ("x86/smp/boot: Remove 10ms delay from cpu_up() on modern processors")
independently removes the worst of the delays (the 10 msec delay).
The remaining delays can be addressed one by one, combined
with careful testing.
Reported-by: Huang Ying <ying.huang@intel.com>
Cc: Anthony Liguori <aliguori@amazon.com>
Cc: Arjan van de Ven <arjan@linux.intel.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Gang Wei <gang.wei@intel.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Jan H. Schönherr <jschoenh@amazon.de>
Cc: Len Brown <len.brown@intel.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tim Deegan <tim@xen.org>
Link: http://lkml.kernel.org/r/1430732554-7294-1-git-send-email-jschoenh@amazon.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Modern processor familes do not require the 10ms delay
in cpu_up() to de-assert INIT. This speeds up boot
and resume by 10ms per (application) processor.
Signed-off-by: Len Brown <len.brown@intel.com>
Cc: Alan Cox <alan@linux.intel.com>
Cc: Arjan van de Ven <arjan@linux.intel.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Jan H. Schönherr <jschoenh@amazon.de>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/021ce30c88f216ad39686646421194dc25671e55.1431379433.git.len.brown@intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
No change to default behavior.
Replace the hard-coded mdelay(10) in cpu_up() with a variable
udelay, that is set to a defined default -- rather than a magic
number.
Add a boot-time override, "cpu_init_udelay=N"
Signed-off-by: Len Brown <len.brown@intel.com>
Cc: Alan Cox <alan@linux.intel.com>
Cc: Arjan van de Ven <arjan@linux.intel.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Jan H. Schönherr <jschoenh@amazon.de>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/2fe8e6c798e8def271122f62df9bbf58dc283e2a.1431379433.git.len.brown@intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
This patch enables the uncore Memory Controller (IMC) PMU
support for Intel Broadwell-U (Model 61) mobile processors.
The IMC PMU enables measuring memory bandwidth.
To use with perf:
$ perf stat -a -I 1000 -e
uncore_imc/data_reads/,uncore_imc/data_writes/ sleep 10
Tested-by: Sonny Rao <sonnyrao@chromium.org>
Signed-off-by: Stephane Eranian <eranian@google.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: kan.liang@intel.com
Cc: peterz@infradead.org
Link: http://lkml.kernel.org/r/20150423065642.GA4890@thinkpad
Signed-off-by: Ingo Molnar <mingo@kernel.org>
__mtrr_type_lookup() checks MTRR fixed ranges when mtrr_state.have_fixed
is set and start is less than 0x100000.
However, the 'else if (start < 0x1000000)' in the code checks with an
incorrect address as it has an extra-zero in the address.
The code still runs correctly as this check is meaningless, though.
This patch replaces the incorrect address check with 'else' with no
condition.
Signed-off-by: Toshi Kani <toshi.kani@hp.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: Elliott@hp.com
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Luis R. Rodriguez <mcgrof@suse.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: dave.hansen@intel.com
Cc: linux-mm <linux-mm@kvack.org>
Cc: pebolle@tiscali.nl
Link: http://lkml.kernel.org/r/1427234921-19737-4-git-send-email-toshi.kani@hp.com
Link: http://lkml.kernel.org/r/1431332153-18566-8-git-send-email-bp@alien8.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
It is useless at best and git history has it all detailed
anyway. Update copyright while at it.
Signed-off-by: Borislav Petkov <bp@suse.de>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1431332153-18566-3-git-send-email-bp@alien8.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
This is an important RAS feature which adds hardware support for
poisoned data. That means roughly that the hardware marks data which it
has detected as corrupted but wasn't able to correct, as poisoned data
and raises an APIC interrupt to signal that in the form of a deferred
error. It is the OS's responsibility then to take proper recovery action
and thus prolonge system lifetime as far as possible.
Misc cleanups ontop. (Borislav Petkov)
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQIcBAABAgAGBQJVUF2sAAoJEBLB8Bhh3lVKXZ4QAJ3UdVM1/TuqAsZ7+jLkb7BZ
BRyWgv31CcX5fM1D0vV+6K+4GPPsLAtNVYy2G+LauFX1bfE1f9ExWKlMzp45h1sS
xaNLDhIIP+aE4kD1J7mlNc0WlF0ghlfX+iaGc7lI+j3o2Ydlxm15Pt6Te9hDI7en
C1NOWrkJ0+BJv48bPeJ835CLu+DZ6xktWdJ1In88PNUA9YiTj12/nhMKkaGbh3zv
Ep3FCFD/tHcecRK/rVmSTE3cG50SLKtndh/Kl7s1wYhgw6ERyg3x/t8QefZkuU0Q
6fbetgYS9VvpewViAuNemoCHY5qxBNHPLsn6vwhluzlelW1CcgINU8LHcGZiaLmd
DYVM9bHfSrKrHhH0M55XPn9RQSZpA+cTep3IyQzCK+jmLBiqrH3bMIRHjNQRUOLy
DsGLm51tQqaMmnhDma8mMjF7LN+iBqNxXeqvkxQxQBE5NVLXHoaajOgUuj/N59WE
FEFa65rmTrmsmgjAn9BPBk0zeoyQaYFKCLhENB19Vlt/4YoY/vHvzFYJNEcQT5ZU
kuM8/hSEqeYZH4ZjJ8i9zKVado7z6pRQqV/lwRJ27tuXy9+9y6pV+ewmk8gCjQe4
gvySlHbIlfO5geF59GYenp4ll5CdZFvIJuwhybDBZhk3C7M2M7X/xgHnJnprza6j
YVzOp7Jj2aeHGImqGL49
=3e5/
-----END PGP SIGNATURE-----
Merge tag 'ras_for_4.2' of git://git.kernel.org/pub/scm/linux/kernel/git/ras/ras into x86/ras
Pull RAS updates from Borislav Petkov:
- RAS: Add support for deferred errors on AMD (Aravind Gopalakrishnan)
This is an important RAS feature which adds hardware support for
poisoned data. That means roughly that the hardware marks data which it
has detected as corrupted but wasn't able to correct, as poisoned data
and raises an APIC interrupt to signal that in the form of a deferred
error. It is the OS's responsibility then to take proper recovery action
and thus prolonge system lifetime as far as possible.
- Misc cleanups ontop. (Borislav Petkov)"
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Valentin Rothberg reported that we use CONFIG_QUEUED_SPINLOCKS
in arch/x86/kernel/paravirt_patch_32.c, while the symbol is
called CONFIG_QUEUED_SPINLOCK. (Note the extra 'S')
But the typo was natural: the proper English term for such
a generic object would be 'queued spinlocks' - so rename
this and related symbols accordingly to the plural form.
Reported-by: Valentin Rothberg <valentinrothberg@gmail.com>
Cc: Douglas Hatch <doug.hatch@hp.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Scott J Norton <scott.norton@hp.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Waiman Long <Waiman.Long@hp.com>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Since the ISA irqs are in a single block, use
ISA_IRQ_VECTOR(irq) instead of individual macros.
Signed-off-by: Brian Gerst <brgerst@gmail.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1431185813-15413-5-git-send-email-brgerst@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Use IA32_SYSCALL_VECTOR for both compat and native.
Signed-off-by: Brian Gerst <brgerst@gmail.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1431185813-15413-4-git-send-email-brgerst@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
32-bit code has PER_CPU_VAR(cpu_current_top_of_stack).
64-bit code uses somewhat more obscure: PER_CPU_VAR(cpu_tss + TSS_sp0).
Define the 'cpu_current_top_of_stack' macro on CONFIG_X86_64
as well so that the PER_CPU_VAR(cpu_current_top_of_stack)
expression can be used in both 32-bit and 64-bit code.
Signed-off-by: Denys Vlasenko <dvlasenk@redhat.com>
Cc: Alexei Starovoitov <ast@plumgrid.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Will Drewry <wad@chromium.org>
Link: http://lkml.kernel.org/r/1429889495-27850-3-git-send-email-dvlasenk@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
PER_CPU_VAR(kernel_stack) is redundant:
- On the 64-bit build, we can use PER_CPU_VAR(cpu_tss + TSS_sp0).
- On the 32-bit build, we can use PER_CPU_VAR(cpu_current_top_of_stack).
PER_CPU_VAR(kernel_stack) will be deleted by a separate change.
Signed-off-by: Denys Vlasenko <dvlasenk@redhat.com>
Cc: Alexei Starovoitov <ast@plumgrid.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Will Drewry <wad@chromium.org>
Link: http://lkml.kernel.org/r/1429889495-27850-1-git-send-email-dvlasenk@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
drm-intel-next-2015-04-23:
- dither support for ns2501 dvo (Thomas Richter)
- some polish for the gtt code and fixes to finally enable the cmd parser on hsw
- first pile of bxt stage 1 enabling (too many different people to list ...)
- more psr fixes from Rodrigo
- skl rotation support from Chandra
- more atomic work from Ander and Matt
- pile of cleanups and micro-ops for execlist from Chris
drm-intel-next-2015-04-10:
- cdclk handling cleanup and fixes from Ville
- more prep patches for olr removal from John Harrison
- gmbus pin naming rework from Jani (prep for bxt)
- remove ->new_config from Ander (more atomic conversion work)
- rps (boost) tuning and unification with byt/bsw from Chris
- cmd parser batch bool tuning from Chris
- gen8 dynamic pte allocation (Michel Thierry, based on work from Ben Widawsky)
- execlist tuning (not yet all of it) from Chris
- add drm_plane_from_index (Chandra)
- various small things all over
* tag 'drm-intel-next-2015-04-23-fixed' of git://anongit.freedesktop.org/drm-intel: (204 commits)
drm/i915/gtt: Allocate va range only if vma is not bound
drm/i915: Enable cmd parser to do secure batch promotion for aliasing ppgtt
drm/i915: fix intel_prepare_ddi
drm/i915: factor out ddi_get_encoder_port
drm/i915/hdmi: check port in ibx_infoframe_enabled
drm/i915/hdmi: fix vlv infoframe port check
drm/i915: Silence compiler warning in dvo
drm/i915: Update DRIVER_DATE to 20150423
drm/i915: Enable dithering on NatSemi DVO2501 for Fujitsu S6010
rm/i915: Move i915_get_ggtt_vma_pages into ggtt_bind_vma
drm/i915: Don't try to outsmart gcc in i915_gem_gtt.c
drm/i915: Unduplicate i915_ggtt_unbind/bind_vma
drm/i915: Move ppgtt_bind/unbind around
drm/i915: move i915_gem_restore_gtt_mappings around
drm/i915: Fix up the vma aliasing ppgtt binding
drm/i915: Remove misleading comment around bind_to_vm
drm/i915: Don't use atomics for pg_dirty_rings
drm/i915: Don't look at pg_dirty_rings for aliasing ppgtt
drm/i915/skl: Support Y tiling in MMIO flips
drm/i915: Fixup kerneldoc for struct intel_context
...
Conflicts:
drivers/gpu/drm/i915/i915_drv.c
This patch adds the necessary KVM specific code to allow KVM to
support the CPU halting and kicking operations needed by the queue
spinlock PV code.
Signed-off-by: Waiman Long <Waiman.Long@hp.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Daniel J Blueman <daniel@numascale.com>
Cc: David Vrabel <david.vrabel@citrix.com>
Cc: Douglas Hatch <doug.hatch@hp.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Paolo Bonzini <paolo.bonzini@gmail.com>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Raghavendra K T <raghavendra.kt@linux.vnet.ibm.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Scott J Norton <scott.norton@hp.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: virtualization@lists.linux-foundation.org
Cc: xen-devel@lists.xenproject.org
Link: http://lkml.kernel.org/r/1429901803-29771-11-git-send-email-Waiman.Long@hp.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
We use the regular paravirt call patching to switch between:
native_queued_spin_lock_slowpath() __pv_queued_spin_lock_slowpath()
native_queued_spin_unlock() __pv_queued_spin_unlock()
We use a callee saved call for the unlock function which reduces the
i-cache footprint and allows 'inlining' of SPIN_UNLOCK functions
again.
We further optimize the unlock path by patching the direct call with a
"movb $0,%arg1" if we are indeed using the native unlock code. This
makes the unlock code almost as fast as the !PARAVIRT case.
This significantly lowers the overhead of having
CONFIG_PARAVIRT_SPINLOCKS enabled, even for native code.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Waiman Long <Waiman.Long@hp.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Daniel J Blueman <daniel@numascale.com>
Cc: David Vrabel <david.vrabel@citrix.com>
Cc: Douglas Hatch <doug.hatch@hp.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Paolo Bonzini <paolo.bonzini@gmail.com>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Raghavendra K T <raghavendra.kt@linux.vnet.ibm.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Scott J Norton <scott.norton@hp.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: virtualization@lists.linux-foundation.org
Cc: xen-devel@lists.xenproject.org
Link: http://lkml.kernel.org/r/1429901803-29771-10-git-send-email-Waiman.Long@hp.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
iTLB-load-misses and LLC-load-misses count incorrectly on SLM.
There is no ITLB.MISSES support on SLM. Event PAGE_WALKS.I_SIDE_WALK
should be used to count iTLB-load-misses. This event counts when an
instruction (I) page walk is completed or started. Since a page walk
implies a TLB miss, the number of TLB misses can be counted by counting
the number of pagewalks.
DMND_DATA_RD counts both demand and DCU prefetch data reads. However,
LLC-load-misses should only count demand reads. There is no way to not
include prefetches with a single counter on SLM. So the LLC-load-misses
support should be removed on SLM.
Signed-off-by: Kan Liang <kan.liang@intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1429608881-5055-1-git-send-email-kan.liang@intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
It is useless and git history has it all detailed anyway. Update
copyright while at it.
Signed-off-by: Borislav Petkov <bp@suse.de>
Cc: Aravind Gopalakrishnan <Aravind.Gopalakrishnan@amd.com>
Those were leftovers of the x86 merge, see
081f75bbdc ("traps: x86: make traps_32.c and traps_64.c equal")
for example and are not needed now.
Signed-off-by: Borislav Petkov <bp@suse.de>
If you try to enable NOHZ_FULL on a guest today, you'll get
the following error when the guest tries to deactivate the
scheduler tick:
WARNING: CPU: 3 PID: 2182 at kernel/time/tick-sched.c:192 can_stop_full_tick+0xb9/0x290()
NO_HZ FULL will not work with unstable sched clock
CPU: 3 PID: 2182 Comm: kworker/3:1 Not tainted 4.0.0-10545-gb9bb6fb #204
Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
Workqueue: events flush_to_ldisc
ffffffff8162a0c7 ffff88011f583e88 ffffffff814e6ba0 0000000000000002
ffff88011f583ed8 ffff88011f583ec8 ffffffff8104d095 ffff88011f583eb8
0000000000000000 0000000000000003 0000000000000001 0000000000000001
Call Trace:
<IRQ> [<ffffffff814e6ba0>] dump_stack+0x4f/0x7b
[<ffffffff8104d095>] warn_slowpath_common+0x85/0xc0
[<ffffffff8104d146>] warn_slowpath_fmt+0x46/0x50
[<ffffffff810bd2a9>] can_stop_full_tick+0xb9/0x290
[<ffffffff810bd9ed>] tick_nohz_irq_exit+0x8d/0xb0
[<ffffffff810511c5>] irq_exit+0xc5/0x130
[<ffffffff814f180a>] smp_apic_timer_interrupt+0x4a/0x60
[<ffffffff814eff5e>] apic_timer_interrupt+0x6e/0x80
<EOI> [<ffffffff814ee5d1>] ? _raw_spin_unlock_irqrestore+0x31/0x60
[<ffffffff8108bbc8>] __wake_up+0x48/0x60
[<ffffffff8134836c>] n_tty_receive_buf_common+0x49c/0xba0
[<ffffffff8134a6bf>] ? tty_ldisc_ref+0x1f/0x70
[<ffffffff81348a84>] n_tty_receive_buf2+0x14/0x20
[<ffffffff8134b390>] flush_to_ldisc+0xe0/0x120
[<ffffffff81064d05>] process_one_work+0x1d5/0x540
[<ffffffff81064c81>] ? process_one_work+0x151/0x540
[<ffffffff81065191>] worker_thread+0x121/0x470
[<ffffffff81065070>] ? process_one_work+0x540/0x540
[<ffffffff8106b4df>] kthread+0xef/0x110
[<ffffffff8106b3f0>] ? __kthread_parkme+0xa0/0xa0
[<ffffffff814ef4f2>] ret_from_fork+0x42/0x70
[<ffffffff8106b3f0>] ? __kthread_parkme+0xa0/0xa0
---[ end trace 06e3507544a38866 ]---
However, it turns out that kvmclock does provide a stable
sched_clock callback. So, let the scheduler know this which
in turn makes NOHZ_FULL work in the guest.
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Luiz Capitulino <lcapitulino@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
'setup_APIC_mce' doesn't give us an indication of why we are
going to program LVT. Make that explicit by renaming it to
setup_APIC_mce_threshold so we know.
No functional change is introduced.
Signed-off-by: Aravind Gopalakrishnan <Aravind.Gopalakrishnan@amd.com>
Cc: Tony Luck <tony.luck@intel.com>
Cc: x86-ml <x86@kernel.org>
Cc: linux-edac <linux-edac@vger.kernel.org>
Link: http://lkml.kernel.org/r/1430913538-1415-7-git-send-email-Aravind.Gopalakrishnan@amd.com
Signed-off-by: Borislav Petkov <bp@suse.de>
Deferred errors indicate error conditions that were not corrected, but
require no action from S/W (or action is optional).These errors provide
info about a latent UC MCE that can occur when a poisoned data is
consumed by the processor.
Processors that report these errors can be configured to generate APIC
interrupts to notify OS about the error.
Provide an interrupt handler in this patch so that OS can catch these
errors as and when they happen. Currently, we simply log the errors and
exit the handler as S/W action is not mandated.
Signed-off-by: Aravind Gopalakrishnan <Aravind.Gopalakrishnan@amd.com>
Cc: Tony Luck <tony.luck@intel.com>
Cc: x86-ml <x86@kernel.org>
Cc: linux-edac <linux-edac@vger.kernel.org>
Link: http://lkml.kernel.org/r/1430913538-1415-5-git-send-email-Aravind.Gopalakrishnan@amd.com
Signed-off-by: Borislav Petkov <bp@suse.de>
- Fix blkback regression if using persistent grants.
- Fix various event channel related suspend/resume bugs.
- Fix AMD x86 regression with X86_BUG_SYSRET_SS_ATTRS.
- SWIOTLB on ARM now uses frames <4 GiB (if available) so device only
capable of 32-bit DMA work.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.12 (GNU/Linux)
iQEcBAABAgAGBQJVSiC1AAoJEFxbo/MsZsTRojgH/1zWPD0r5WMAEPb6DFdb7Ga1
SqBbyHFu43axNwZ7EvUzSqI8BKDPbTnScQ3+zC6Zy1SIEfS+40+vn7kY/uASmWtK
LYaYu8nd49OZP8ykH0HEvsJ2LXKnAwqAwvVbEigG7KJA7h8wXo7aDwdwxtZmHlFP
18xRTfHcrnINtAJpjVRmIGZsCMXhXQz4bm0HwsXTTX0qUcRWtxydKDlMPTVFyWR8
wQ2m5+76fQ8KlFsoJEB0M9ygFdheZBF4FxBGHRrWXBUOhHrQITnH+cf1aMVxTkvy
NDwiEebwXUDHacv21onszoOkNjReLsx+DWp9eHknlT/fgPo6tweMM2yazFGm+JQ=
=W683
-----END PGP SIGNATURE-----
Merge tag 'for-linus-4.1b-rc2-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip
Pull xen bug fixes from David Vrabel:
- fix blkback regression if using persistent grants
- fix various event channel related suspend/resume bugs
- fix AMD x86 regression with X86_BUG_SYSRET_SS_ATTRS
- SWIOTLB on ARM now uses frames <4 GiB (if available) so device only
capable of 32-bit DMA work.
* tag 'for-linus-4.1b-rc2-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip:
xen: Add __GFP_DMA flag when xen_swiotlb_init gets free pages on ARM
hypervisor/x86/xen: Unset X86_BUG_SYSRET_SS_ATTRS on Xen PV guests
xen/events: Set irq_info->evtchn before binding the channel to CPU in __startup_pirq()
xen/console: Update console event channel on resume
xen/xenbus: Update xenbus event channel on resume
xen/events: Clear cpu_evtchn_mask before resuming
xen-pciback: Add name prefix to global 'permissive' variable
xen: Suspend ticks on all CPUs during suspend
xen/grant: introduce func gnttab_unmap_refs_sync()
xen/blkback: safely unmap purge persistent grants
Deferred errors indicate error conditions that were not corrected, but
those errors have not been consumed yet. They require no action from
S/W (or action is optional). These errors provide info about a latent
uncorrectable MCE that can occur when a poisoned data is consumed by the
processor.
Newer AMD processors can generate deferred errors and can be configured
to generate APIC interrupts on such events.
SUCCOR stands for S/W UnCorrectable error COntainment and Recovery.
It indicates support for data poisoning in HW and deferred error
interrupts.
Add new bitfield to mce_vendor_flags for this. We use this to verify
presence of deferred error interrupts before we enable them in mce_amd.c
While at it, clarify comments in mce_vendor_flags to provide an
indication of usages of the bitfields.
Signed-off-by: Aravind Gopalakrishnan <Aravind.Gopalakrishnan@amd.com>
Cc: Tony Luck <tony.luck@intel.com>
Cc: x86-ml <x86@kernel.org>
Cc: linux-edac <linux-edac@vger.kernel.org>
Link: http://lkml.kernel.org/r/1430913538-1415-4-git-send-email-Aravind.Gopalakrishnan@amd.com
[ beef up commit message, do CPUID(8000_0007) only once. ]
Signed-off-by: Borislav Petkov <bp@suse.de>
Pull x86 fixes from Ingo Molnar:
"EFI fixes, and FPU fix, a ticket spinlock boundary condition fix and
two build fixes"
* 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/fpu: Always restore_xinit_state() when use_eager_cpu()
x86: Make cpu_tss available to external modules
efi: Fix error handling in add_sysfs_runtime_map_entry()
x86/spinlocks: Fix regression in spinlock contention detection
x86/mm: Clean up types in xlate_dev_mem_ptr()
x86/efi: Store upper bits of command line buffer address in ext_cmd_line_ptr
efivarfs: Ensure VariableName is NUL-terminated
amd_decode_mce() needs value in m->addr so it can report the error
address correctly. This should be setup in __log_error() before we call
mce_log(). We do this because the error address is an important bit of
information which should be conveyed to userspace.
The correct output then reports proper address, like this:
[Hardware Error]: Corrected error, no action required.
[Hardware Error]: CPU:0 (15:60:0) MC0_STATUS [-|CE|-|-|AddrV|-|-|CECC]: 0x840041000028017b
[Hardware Error]: MC0 Error Address: 0x00001f808f0ff040
Signed-off-by: Aravind Gopalakrishnan <Aravind.Gopalakrishnan@amd.com>
Cc: Tony Luck <tony.luck@intel.com>
Cc: x86-ml <x86@kernel.org>
Cc: linux-edac <linux-edac@vger.kernel.org>
Link: http://lkml.kernel.org/r/1430913538-1415-3-git-send-email-Aravind.Gopalakrishnan@amd.com
Signed-off-by: Borislav Petkov <bp@suse.de>
Refactor the code here to setup struct mce and call mce_log() to log
the error. We're going to reuse this in a later patch as part of the
deferred error interrupt enablement.
No functional change is introduced.
Suggested-by: Borislav Petkov <bp@alien8.de>
Signed-off-by: Aravind Gopalakrishnan <Aravind.Gopalakrishnan@amd.com>
Cc: Tony Luck <tony.luck@intel.com>
Cc: x86-ml <x86@kernel.org>
Cc: linux-edac <linux-edac@vger.kernel.org>
Link: http://lkml.kernel.org/r/1430913538-1415-2-git-send-email-Aravind.Gopalakrishnan@amd.com
Signed-off-by: Borislav Petkov <bp@suse.de>
Pull perf fixes from Ingo Molnar:
"Mostly tooling fixes, but also an uncore PMU driver fix and an uncore
PMU driver hardware-enablement addition"
* 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
perf probe: Fix segfault if passed with ''.
perf report: Fix -T/--threads option to work again
perf bench numa: Fix immediate meeting of convergence condition
perf bench numa: Fixes of --quiet argument
perf bench futex: Fix hung wakeup tasks after requeueing
perf probe: Fix bug with global variables handling
perf top: Fix a segfault when kernel map is restricted.
tools lib traceevent: Fix build failure on 32-bit arch
perf kmem: Fix compiles on RHEL6/OL6
tools lib api: Undefine _FORTIFY_SOURCE before setting it
perf kmem: Consistently use PRIu64 for printing u64 values
perf trace: Disable events and drain events when forked workload ends
perf trace: Enable events when doing system wide tracing and starting a workload
perf/x86/intel/uncore: Move PCI IDs for IMC to uncore driver
perf/x86/intel/uncore: Add support for Intel Haswell ULT (lower power Mobile Processor) IMC uncore PMUs
perf/x86/intel: Add cpu_(prepare|starting|dying) for core_pmu
Apparently, people do build microcode into the kernel image, i.e.
CONFIG_FIRMWARE_IN_KERNEL=y.
Make that work in the early loader which is where microcode should be
preferably loaded anyway.
Note that you need to specify the microcode filename with the path
relative to the toplevel firmware directory (the same like the late
loading method) in CONFIG_EXTRA_FIRMWARE=y so that early loader can
find it.
I.e., something like this (Intel variant):
CONFIG_FIRMWARE_IN_KERNEL=y
CONFIG_EXTRA_FIRMWARE="intel-ucode/06-3a-09"
CONFIG_EXTRA_FIRMWARE_DIR="/lib/firmware/"
While at it, add me to the loader copyright boilerplate.
Signed-off-by: Borislav Petkov <bp@suse.de>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Daniel J Blueman <daniel@numascale.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
@rev wasn't used in get_matching_sig(), drop it.
Signed-off-by: Borislav Petkov <bp@suse.de>
Cc: Borislav Petkov <bp@alien8.de>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
It is a one-liner for checking microcode header revisions. On top of
that, it can be used wrong as it was the case in _save_mc(). Get rid of
it.
Signed-off-by: Borislav Petkov <bp@suse.de>
Acked-by: Quentin Casasnovas <quentin.casasnovas@oracle.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
The following commit:
f893959b08 ("x86/fpu: Don't abuse drop_init_fpu() in flush_thread()")
removed drop_init_fpu() usage from flush_thread(). This seems to break
things for me - the Go 1.4 test suite fails all over the place with
floating point comparision errors (offending commit found through
bisection).
The functional change was that flush_thread() after this commit
only calls restore_init_xstate() when both use_eager_fpu() and
!used_math() are true. drop_init_fpu() (now fpu_reset_state()) calls
restore_init_xstate() regardless of whether current used_math() - apply
the same logic here.
Switch used_math() -> tsk_used_math(tsk) to consistently use the grabbed
tsk instead of current, like in the rest of flush_thread().
Tested-by: Dave Hansen <dave.hansen@intel.com>
Signed-off-by: Bobby Powers <bobbypowers@gmail.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Acked-by: Oleg Nesterov <oleg@redhat.com>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Pekka Riikonen <priikone@iki.fi>
Cc: Quentin Casasnovas <quentin.casasnovas@oracle.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Suresh Siddha <sbsiddha@gmail.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Fixes: f893959b ("x86/fpu: Don't abuse drop_init_fpu() in flush_thread()")
Link: http://lkml.kernel.org/r/1430147441-9820-1-git-send-email-bobbypowers@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Decision to use a 4-bit mask or 8-bit mask in default_get_apic_id()
is controlled by setting capability bit X86_FEATURE_EXTD_APICID.
Currently, we detect extended APIC ID support by accessing Link
Transaction Control register D18F0x68 in PCI config space.
But, not even that is needed as we can safely postulate that future
AMD processors will support 8-bit APIC IDs and we can simply set that
feature bit on them, without the PCI access.
Signed-off-by: Aravind Gopalakrishnan <Aravind.Gopalakrishnan@amd.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Cc: Borislav Petkov <bp@alien8.de>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Jacob Shin <jacob.w.shin@gmail.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: dave.hansen@linux.intel.com
Cc: hecmargi@upv.es
Cc: mgorman@suse.de
Link: http://lkml.kernel.org/r/1430148351-9013-1-git-send-email-Aravind.Gopalakrishnan@amd.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
GART registers are not present in newer AMD processors (Fam15h, Model
10h and later). So, avoid accessing those in PCI config space by
returning early in early_gart_iommu_check() and gart_iommu_hole_init()
if GART is not available.
Current code doesn't break on existing processors but there are some
side effects:
We get bogus AGP aperture messages which are simply noise on
GART-less processors:
AGP: Node 0: aperture [bus addr 0x00000000-0x01ffffff] (32MB)
AGP: Your BIOS doesn't leave aperture memory hole
AGP: Please enable the IOMMU option in the BIOS setup
AGP: This costs you 64MB of RAM
AGP: Mapping aperture over RAM [mem 0xd4000000-0xd7ffffff]
We can avoid calling allocate_aperture() and would not have to
wastefully reserve 64MB of RAM with memblock_reserve(). Also, we can
avoid having to loop through all PCI buses and devices twice, searching
for a non-existent AGP bridge if we bail out early.
Refactor the family check used in amd_nb.c into an inline function so we
can use it here as well as in amd_nb.c
Fix some typos while at it.
Tested the patch on Fam10h and Fam15h Model 00h-fh and this code runs
fine. On Fam15h Model 60h-6fh and on Fam16h, we bail early as they don't
have GART.
Signed-off-by: Aravind Gopalakrishnan <Aravind.Gopalakrishnan@amd.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Suravee Suthikulpanit <Suravee.Suthikulpanit@amd.com>
Cc: Bjorn Helgaas <bhelgaas@google.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Joerg Rodel <joro@8bytes.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1428443197-3834-1-git-send-email-Aravind.Gopalakrishnan@amd.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Remove the per-CPU delays during SMP initialization, which seems
to be possible on newer architectures with an x2APIC.
Xen does this since 2011. In fact, this commit is basically a
combination of the following Xen commits. The first removes the
delays, the second fixes an issue with the removal:
commit 68fce206f6dba9981e8322269db49692c95ce250
Author: Tim Deegan <Tim.Deegan@citrix.com>
Date: Tue Jul 19 14:13:01 2011 +0100
x86: Remove timeouts from INIT-SIPI-SIPI sequence when using x2apic.
Some of the timeouts are pointless since they're waiting for the ICR
to ack the IPI delivery and that doesn't happen on x2apic.
The others should be benign (and are suggested in the SDM) but
removing them makes AP bringup much more reliable on some test boxes.
Signed-off-by: Tim Deegan <Tim.Deegan@citrix.com>
commit f12ee533150761df5a7099c83f2a5fa6c07d1187
Author: Gang Wei <gang.wei@intel.com>
Date: Thu Dec 29 10:07:54 2011 +0000
X86: Add a delay between INIT & SIPIs for tboot AP bring-up in X2APIC case
Without this delay, Xen could not bring APs up while working with
TXT/tboot, because tboot needs some time in APs to handle INIT before
becoming ready for receiving SIPIs (this delay was removed as part of
c/s 23724 by Tim Deegan).
Signed-off-by: Gang Wei <gang.wei@intel.com>
Acked-by: Keir Fraser <keir@xen.org>
Acked-by: Tim Deegan <tim@xen.org>
Committed-by: Tim Deegan <tim@xen.org>
Signed-off-by: Jan H. Schönherr <jschoenh@amazon.de>
Cc: Anthony Liguori <aliguori@amazon.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Gang Wei <gang.wei@intel.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Len Brown <len.brown@intel.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tim Deegan <tim@xen.org>
Link: http://lkml.kernel.org/r/1430732554-7294-1-git-send-email-jschoenh@amazon.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Merge common values for 32-bit native and compat.
Signed-off-by: Brian Gerst <brgerst@gmail.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Acked-by: Andy Lutomirski <luto@amacapital.net>
Link: http://lkml.kernel.org/r/1428844486-6638-1-git-send-email-brgerst@gmail.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Commit 75182b1632 ("x86/asm/entry: Switch all C consumers of
kernel_stack to this_cpu_sp0()") changed current_thread_info
to use this_cpu_sp0, and indirectly made it rely on init_tss
which was exported with EXPORT_PER_CPU_SYMBOL_GPL.
As a result some macros and inline functions such as set/get_fs,
test_thread_flag and variants have been made unusable for
external modules.
Make cpu_tss exported with EXPORT_PER_CPU_SYMBOL so that these
functions are accessible again, as they were previously.
Signed-off-by: Marc Dionne <marc.dionne@your-file-system.com>
Acked-by: Andy Lutomirski <luto@amacapital.net>
Link: http://lkml.kernel.org/r/1430763404-21221-1-git-send-email-marc.dionne@your-file-system.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Commit 61f01dd941 ("x86_64, asm: Work around AMD SYSRET SS descriptor
attribute issue") makes AMD processors set SS to __KERNEL_DS in
__switch_to() to deal with cases when SS is NULL.
This breaks Xen PV guests who do not want to load SS with__KERNEL_DS.
Since the problem that the commit is trying to address would have to be
fixed in the hypervisor (if it in fact exists under Xen) there is no
reason to set X86_BUG_SYSRET_SS_ATTRS flag for PV VPCUs here.
This can be easily achieved by adding x86_hyper_xen_hvm.set_cpu_features
op which will clear this flag. (And since this structure is no longer
HVM-specific we should do some renaming).
Signed-off-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Reported-by: Sander Eikelenboom <linux@eikelenboom.it>
Signed-off-by: David Vrabel <david.vrabel@citrix.com>
If ACPI is disabled, acpi_gbl_FADT is not available, and and the build
breaks.
Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
Link: http://lkml.kernel.org/r/55479708.2000104@siemens.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Nothing changes those ops. Make the initializers readable while at it.
Reported-by: Krzysztof Kozlowski <k.kozlowski.k@gmail.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Offset that has been chosen for kaslr during kernel decompression can be
easily computed as a difference between _text and __START_KERNEL. We are
already making use of this in dump_kernel_offset() notifier and in
arch_crash_save_vmcoreinfo().
Introduce kaslr_offset() that makes this computation instead of hard-coding
it, so that other kernel code (such as live patching) can make use of it.
Also convert existing users to make use of it.
This patch is equivalent transofrmation without any effects on the resulting
code:
$ diff -u vmlinux.old.asm vmlinux.new.asm
--- vmlinux.old.asm 2015-04-28 17:55:19.520983368 +0200
+++ vmlinux.new.asm 2015-04-28 17:55:24.141206072 +0200
@@ -1,5 +1,5 @@
-vmlinux.old: file format elf64-x86-64
+vmlinux.new: file format elf64-x86-64
Disassembly of section .text:
$
Acked-by: Borislav Petkov <bp@suse.de>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
This reverts commits 0a4e6be9ca
and 80f7fdb1c7.
The task migration notifier was originally introduced in order to support
the pvclock vsyscall with non-synchronized TSC, but KVM only supports it
with synchronized TSC. Hence, on KVM the race condition is only needed
due to a bad implementation on the host side, and even then it's so rare
that it's mostly theoretical.
As far as KVM is concerned it's possible to fix the host, avoiding the
additional complexity in the vDSO and the (re)introduction of the task
migration notifier.
Xen, on the other hand, hasn't yet implemented vsyscall support at
all, so we do not care about its plans for non-synchronized TSC.
Reported-by: Peter Zijlstra <peterz@infradead.org>
Suggested-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
AMD CPUs don't reinitialize the SS descriptor on SYSRET, so SYSRET with
SS == 0 results in an invalid usermode state in which SS is apparently
equal to __USER_DS but causes #SS if used.
Work around the issue by setting SS to __KERNEL_DS __switch_to, thus
ensuring that SYSRET never happens with SS set to NULL.
This was exposed by a recent vDSO cleanup.
Fixes: e7d6eefaaa x86/vdso32/syscall.S: Do not load __USER32_DS to %ss
Signed-off-by: Andy Lutomirski <luto@kernel.org>
Cc: Peter Anvin <hpa@zytor.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Denys Vlasenko <vda.linux@googlemail.com>
Cc: Brian Gerst <brgerst@gmail.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This series introduces preliminary ACPI 5.1 support to the arm64 kernel
using the "hardware reduced" profile. We don't support any peripherals
yet, so it's fairly limited in scope:
- Memory init (UEFI)
- ACPI discovery (RSDP via UEFI)
- CPU init (FADT)
- GIC init (MADT)
- SMP boot (MADT + PSCI)
- ACPI Kconfig options (dependent on EXPERT)
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQEcBAABCgAGBQJVNOC2AAoJELescNyEwWM08dIH/1Pn5xa04wwNDn0MOpbuQMk2
kHM7hx69fbXflTJpnZRVyFBjRxxr5qilA7rljAFLnFeF8Fcll/s5VNy7ElHKLISq
CB0ywgUfOd/sFJH57rcc67pC1b/XuqTbE1u1NFwvp2R3j1kGAEJWNA6SyxIP4bbc
NO5jScx0lQOJ3rrPAXBW8qlGkeUk7TPOQJtMrpftNXlFLFrR63rPaEmMZ9dWepBF
aRE4GXPvyUhpyv5o9RvlN5l8bQttiRJ3f9QjyG7NYhX0PXH3DyvGUzYlk2IoZtID
v3ssCQH3uRsAZHIBhaTyNqFnUIaDR825bvGqyG/tj2Dt3kQZiF+QrfnU5D9TuMw=
=zLJn
-----END PGP SIGNATURE-----
Merge tag 'arm64-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux
Pull initial ACPI support for arm64 from Will Deacon:
"This series introduces preliminary ACPI 5.1 support to the arm64
kernel using the "hardware reduced" profile. We don't support any
peripherals yet, so it's fairly limited in scope:
- MEMORY init (UEFI)
- ACPI discovery (RSDP via UEFI)
- CPU init (FADT)
- GIC init (MADT)
- SMP boot (MADT + PSCI)
- ACPI Kconfig options (dependent on EXPERT)
ACPI for arm64 has been in development for a while now and hardware
has been available that can boot with either FDT or ACPI tables. This
has been made possible by both changes to the ACPI spec to cater for
ARM-based machines (known as "hardware-reduced" in ACPI parlance) but
also a Linaro-driven effort to get this supported on top of the Linux
kernel. This pull request is the result of that work.
These changes allow us to initialise the CPUs, interrupt controller,
and timers via ACPI tables, with memory information and cmdline coming
from EFI. We don't support a hybrid ACPI/FDT scheme. Of course,
there is still plenty of work to do (a serial console would be nice!)
but I expect that to happen on a per-driver basis after this core
series has been merged.
Anyway, the diff stat here is fairly horrible, but splitting this up
and merging it via all the different subsystems would have been
extremely painful. Instead, we've got all the relevant Acks in place
and I've not seen anything other than trivial (Kconfig) conflicts in
-next (for completeness, I've included my resolution below). Nearly
half of the insertions fall under Documentation/.
So, we'll see how this goes. Right now, it all depends on EXPERT and
I fully expect people to use FDT by default for the immediate future"
* tag 'arm64-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux: (31 commits)
ARM64 / ACPI: make acpi_map_gic_cpu_interface() as void function
ARM64 / ACPI: Ignore the return error value of acpi_map_gic_cpu_interface()
ARM64 / ACPI: fix usage of acpi_map_gic_cpu_interface
ARM64: kernel: acpi: honour acpi=force command line parameter
ARM64: kernel: acpi: refactor ACPI tables init and checks
ARM64: kernel: psci: let ACPI probe PSCI version
ARM64: kernel: psci: factor out probe function
ACPI: move arm64 GSI IRQ model to generic GSI IRQ layer
ARM64 / ACPI: Don't unflatten device tree if acpi=force is passed
ARM64 / ACPI: additions of ACPI documentation for arm64
Documentation: ACPI for ARM64
ARM64 / ACPI: Enable ARM64 in Kconfig
XEN / ACPI: Make XEN ACPI depend on X86
ARM64 / ACPI: Select ACPI_REDUCED_HARDWARE_ONLY if ACPI is enabled on ARM64
clocksource / arch_timer: Parse GTDT to initialize arch timer
irqchip: Add GICv2 specific ACPI boot support
ARM64 / ACPI: Introduce ACPI_IRQ_MODEL_GIC and register device's gsi
ACPI / processor: Make it possible to get CPU hardware ID via GICC
ACPI / processor: Introduce phys_cpuid_t for CPU hardware ID
ARM64 / ACPI: Parse MADT for SMP initialization
...
Function __assign_irq_vector() is protected by vector_lock, so use
a global temporary cpu_mask to avoid allocating/freeing cpu_mask.
Signed-off-by: Jiang Liu <jiang.liu@linux.intel.com>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: David Cohen <david.a.cohen@linux.intel.com>
Cc: Sander Eikelenboom <linux@eikelenboom.it>
Cc: David Vrabel <david.vrabel@citrix.com>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Joerg Roedel <joro@8bytes.org>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Bjorn Helgaas <bhelgaas@google.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Rafael J. Wysocki <rjw@rjwysocki.net>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: Yinghai Lu <yinghai@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dimitri Sivanich <sivanich@sgi.com>
Link: http://lkml.kernel.org/r/1428978610-28986-34-git-send-email-jiang.liu@linux.intel.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Now we have dedicated asm/irqdomain.h, so move irqdomain specific
code into it.
Signed-off-by: Jiang Liu <jiang.liu@linux.intel.com>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: David Cohen <david.a.cohen@linux.intel.com>
Cc: Sander Eikelenboom <linux@eikelenboom.it>
Cc: David Vrabel <david.vrabel@citrix.com>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Joerg Roedel <joro@8bytes.org>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Bjorn Helgaas <bhelgaas@google.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Rafael J. Wysocki <rjw@rjwysocki.net>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: Yinghai Lu <yinghai@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dimitri Sivanich <sivanich@sgi.com>
Cc: Joerg Roedel <jroedel@suse.de>
Cc: Andy Lutomirski <luto@amacapital.net>
Link: http://lkml.kernel.org/r/1428978610-28986-33-git-send-email-jiang.liu@linux.intel.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
We have 3 identical copies of the ioapic domain ops for acpi, mpparse,
and sfi. Have a global one in the io_apic code and be done with it.
To avoid include hell in io_apic.h, create a private irqdomain header
and include the generic irqdomain header from there.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Jiang Liu <jiang.liu@linux.intel.com>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: David Cohen <david.a.cohen@linux.intel.com>
Cc: Sander Eikelenboom <linux@eikelenboom.it>
Cc: David Vrabel <david.vrabel@citrix.com>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Joerg Roedel <joro@8bytes.org>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: sfi-devel@simplefirmware.org
Cc: Bjorn Helgaas <bhelgaas@google.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Rafael J. Wysocki <rjw@rjwysocki.net>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: Yinghai Lu <yinghai@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dimitri Sivanich <sivanich@sgi.com>
Cc: Len Brown <len.brown@intel.com>
Cc: Pavel Machek <pavel@ucw.cz>
Cc: Grant Likely <grant.likely@linaro.org>
Cc: Rob Herring <robh@kernel.org>
Cc: x86@kernel.org
Link: http://lkml.kernel.org/r/1428978610-28986-32-git-send-email-jiang.liu@linux.intel.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
These functions are full of pointless indentations, useless comments
and even more useless printks.
Clean them up.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: David Cohen <david.a.cohen@linux.intel.com>
Cc: Sander Eikelenboom <linux@eikelenboom.it>
Cc: David Vrabel <david.vrabel@citrix.com>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Joerg Roedel <joro@8bytes.org>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Bjorn Helgaas <bhelgaas@google.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Rafael J. Wysocki <rjw@rjwysocki.net>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: Yinghai Lu <yinghai@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dimitri Sivanich <sivanich@sgi.com>
Cc: Grant Likely <grant.likely@linaro.org>
Link: http://lkml.kernel.org/r/1428978610-28986-31-git-send-email-jiang.liu@linux.intel.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Jiang Liu <jiang.liu@linux.intel.com>
Cc: x86@kernel.org
Signed-off-by: Jiang Liu <jiang.liu@linux.intel.com>
While looking at the printout issue, I stumbled more than once over
the various 0/1 assignments which are either commented in strange ways
or force to lookup the meaning.
Use proper constants and fix the misleading comments. While at it
remove pointless 0 assignments in native_disable_io_apic() which have
no value for understanding the code.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Jiang Liu <jiang.liu@linux.intel.com>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: David Cohen <david.a.cohen@linux.intel.com>
Cc: Sander Eikelenboom <linux@eikelenboom.it>
Cc: David Vrabel <david.vrabel@citrix.com>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Joerg Roedel <joro@8bytes.org>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Bjorn Helgaas <bhelgaas@google.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Rafael J. Wysocki <rjw@rjwysocki.net>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: Yinghai Lu <yinghai@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dimitri Sivanich <sivanich@sgi.com>
Cc: Grant Likely <grant.likely@linaro.org>
Cc: x86@kernel.org
Link: http://lkml.kernel.org/r/1428978610-28986-30-git-send-email-jiang.liu@linux.intel.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Function mp_register_gsi() is only called once, so fold it into caller
acpi_register_gsi_ioapic(). Do the same for mp_unregister_gsi().
Signed-off-by: Jiang Liu <jiang.liu@linux.intel.com>
Tested-by: Joerg Roedel <jroedel@suse.de>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: David Cohen <david.a.cohen@linux.intel.com>
Cc: Sander Eikelenboom <linux@eikelenboom.it>
Cc: David Vrabel <david.vrabel@citrix.com>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Joerg Roedel <joro@8bytes.org>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Bjorn Helgaas <bhelgaas@google.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Rafael J. Wysocki <rjw@rjwysocki.net>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: Yinghai Lu <yinghai@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dimitri Sivanich <sivanich@sgi.com>
Cc: Len Brown <len.brown@intel.com>
Cc: Pavel Machek <pavel@ucw.cz>
Link: http://lkml.kernel.org/r/1428978610-28986-29-git-send-email-jiang.liu@linux.intel.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Several fields in struct irq_cfg are private to vector.c, so move it
into dedicated data structure. This helps to hide implementation
details.
Signed-off-by: Jiang Liu <jiang.liu@linux.intel.com>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: David Cohen <david.a.cohen@linux.intel.com>
Cc: Sander Eikelenboom <linux@eikelenboom.it>
Cc: David Vrabel <david.vrabel@citrix.com>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Joerg Roedel <joro@8bytes.org>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Bjorn Helgaas <bhelgaas@google.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Rafael J. Wysocki <rjw@rjwysocki.net>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: Yinghai Lu <yinghai@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dimitri Sivanich <sivanich@sgi.com>
Link: http://lkml.kernel.org/r/1428978610-28986-27-git-send-email-jiang.liu@linux.intel.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Joerg Roedel <joro@8bytes.org>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Bjorn Helgaas <bhelgaas@google.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Rafael J. Wysocki <rjw@rjwysocki.net>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: Yinghai Lu <yinghai@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Link: http://lkml.kernel.org/r/1416901802-24211-35-git-send-email-jiang.liu@linux.intel.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Joerg Roedel <jroedel@suse.de>
Now there's no user of apic_set_affinity(), so remove it. Also rename
vector_set_affinity() to apic_set_affinity() for consistency.
Signed-off-by: Jiang Liu <jiang.liu@linux.intel.com>
Tested-by: Joerg Roedel <jroedel@suse.de>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: David Cohen <david.a.cohen@linux.intel.com>
Cc: Sander Eikelenboom <linux@eikelenboom.it>
Cc: David Vrabel <david.vrabel@citrix.com>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Joerg Roedel <joro@8bytes.org>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Bjorn Helgaas <bhelgaas@google.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Rafael J. Wysocki <rjw@rjwysocki.net>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: Yinghai Lu <yinghai@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dimitri Sivanich <sivanich@sgi.com>
Link: http://lkml.kernel.org/r/1428978610-28986-25-git-send-email-jiang.liu@linux.intel.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Function {assign|clear}_irq_vector() and apic_retrigger_irq() are only
used in vector.c, so make them static.
Signed-off-by: Jiang Liu <jiang.liu@linux.intel.com>
Tested-by: Joerg Roedel <jroedel@suse.de>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: David Cohen <david.a.cohen@linux.intel.com>
Cc: Sander Eikelenboom <linux@eikelenboom.it>
Cc: David Vrabel <david.vrabel@citrix.com>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Joerg Roedel <joro@8bytes.org>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Bjorn Helgaas <bhelgaas@google.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Rafael J. Wysocki <rjw@rjwysocki.net>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: Yinghai Lu <yinghai@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dimitri Sivanich <sivanich@sgi.com>
Link: http://lkml.kernel.org/r/1428978610-28986-24-git-send-email-jiang.liu@linux.intel.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
The SiS apic bug workaround is now obsolete as we cache the register
values for performance reasons.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Jiang Liu <jiang.liu@linux.intel.com>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: David Cohen <david.a.cohen@linux.intel.com>
Cc: Sander Eikelenboom <linux@eikelenboom.it>
Cc: David Vrabel <david.vrabel@citrix.com>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Joerg Roedel <joro@8bytes.org>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Bjorn Helgaas <bhelgaas@google.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Rafael J. Wysocki <rjw@rjwysocki.net>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: Yinghai Lu <yinghai@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dimitri Sivanich <sivanich@sgi.com>
Cc: Grant Likely <grant.likely@linaro.org>
Link: http://lkml.kernel.org/r/1428978610-28986-22-git-send-email-jiang.liu@linux.intel.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
io_apic_ops.init() is either NULL, if IO-APIC support is disabled at
compile time or native_io_apic_init_mappings(). No point to have that
as we can achieve the same thing with an empty inline.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
x86_io_apic_ops.write is always set to native_io_apic_write(),
and nobody overrides it. So get rid of the indirection by changing
native_io_apic_write() as io_apic_write() and removing
x86_io_apic_ops.write.
Do the same for x86_io_apic_ops.modify and native_io_apic_modify().
Signed-off-by: Jiang Liu <jiang.liu@linux.intel.com>
Tested-by: Joerg Roedel <jroedel@suse.de>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: David Cohen <david.a.cohen@linux.intel.com>
Cc: Sander Eikelenboom <linux@eikelenboom.it>
Cc: David Vrabel <david.vrabel@citrix.com>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Joerg Roedel <joro@8bytes.org>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Bjorn Helgaas <bhelgaas@google.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Rafael J. Wysocki <rjw@rjwysocki.net>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: Yinghai Lu <yinghai@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dimitri Sivanich <sivanich@sgi.com>
Cc: Yijing Wang <wangyijing@huawei.com>
Cc: Grant Likely <grant.likely@linaro.org>
Link: http://lkml.kernel.org/r/1428978610-28986-19-git-send-email-jiang.liu@linux.intel.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Now there's no user of struct io_apic_irq_attr anymore, so remove it.
Signed-off-by: Jiang Liu <jiang.liu@linux.intel.com>
Tested-by: Joerg Roedel <jroedel@suse.de>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: David Cohen <david.a.cohen@linux.intel.com>
Cc: Sander Eikelenboom <linux@eikelenboom.it>
Cc: David Vrabel <david.vrabel@citrix.com>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Joerg Roedel <joro@8bytes.org>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Bjorn Helgaas <bhelgaas@google.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Rafael J. Wysocki <rjw@rjwysocki.net>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: Yinghai Lu <yinghai@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dimitri Sivanich <sivanich@sgi.com>
Cc: Grant Likely <grant.likely@linaro.org>
Link: http://lkml.kernel.org/r/1428978610-28986-18-git-send-email-jiang.liu@linux.intel.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
There's no user of irq_alloc_hwirqs(), irq_alloc_hwirq(),
irq_free_hwirqs() and irq_free_hwirq() in x86 anymore, so remove
GENERIC_IRQ_LEGACY_ALLOC_HWIRQ and related code.
Signed-off-by: Jiang Liu <jiang.liu@linux.intel.com>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: David Cohen <david.a.cohen@linux.intel.com>
Cc: Sander Eikelenboom <linux@eikelenboom.it>
Cc: David Vrabel <david.vrabel@citrix.com>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Joerg Roedel <joro@8bytes.org>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Bjorn Helgaas <bhelgaas@google.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Rafael J. Wysocki <rjw@rjwysocki.net>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: Yinghai Lu <yinghai@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dimitri Sivanich <sivanich@sgi.com>
Link: http://lkml.kernel.org/r/1428978610-28986-8-git-send-email-jiang.liu@linux.intel.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Now there is no user of x86_io_apic_ops.eoi_ioapic_pin anymore, so remove
it.
Signed-off-by: Jiang Liu <jiang.liu@linux.intel.com>
Tested-by: Joerg Roedel <jroedel@suse.de>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: David Cohen <david.a.cohen@linux.intel.com>
Cc: Sander Eikelenboom <linux@eikelenboom.it>
Cc: David Vrabel <david.vrabel@citrix.com>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: iommu@lists.linux-foundation.org
Cc: Bjorn Helgaas <bhelgaas@google.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Rafael J. Wysocki <rjw@rjwysocki.net>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: Yinghai Lu <yinghai@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dimitri Sivanich <sivanich@sgi.com>
Cc: Joerg Roedel <joro@8bytes.org>
Cc: Yijing Wang <wangyijing@huawei.com>
Cc: Grant Likely <grant.likely@linaro.org>
Link: http://lkml.kernel.org/r/1428978610-28986-7-git-send-email-jiang.liu@linux.intel.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Now there is no user of x86_io_apic_ops.set_affinity anymore, so remove
it.
Signed-off-by: Jiang Liu <jiang.liu@linux.intel.com>
Tested-by: Joerg Roedel <jroedel@suse.de>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: David Cohen <david.a.cohen@linux.intel.com>
Cc: Sander Eikelenboom <linux@eikelenboom.it>
Cc: David Vrabel <david.vrabel@citrix.com>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: iommu@lists.linux-foundation.org
Cc: Bjorn Helgaas <bhelgaas@google.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Rafael J. Wysocki <rjw@rjwysocki.net>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: Yinghai Lu <yinghai@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dimitri Sivanich <sivanich@sgi.com>
Cc: Joerg Roedel <joro@8bytes.org>
Cc: Yijing Wang <wangyijing@huawei.com>
Cc: Grant Likely <grant.likely@linaro.org>
Link: http://lkml.kernel.org/r/1428978610-28986-6-git-send-email-jiang.liu@linux.intel.com
Now there is no user of x86_io_apic_ops.setup_entry anymore, so remove it.
Signed-off-by: Jiang Liu <jiang.liu@linux.intel.com>
Tested-by: Joerg Roedel <jroedel@suse.de>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: David Cohen <david.a.cohen@linux.intel.com>
Cc: Sander Eikelenboom <linux@eikelenboom.it>
Cc: David Vrabel <david.vrabel@citrix.com>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: iommu@lists.linux-foundation.org
Cc: Bjorn Helgaas <bhelgaas@google.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Rafael J. Wysocki <rjw@rjwysocki.net>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: Yinghai Lu <yinghai@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dimitri Sivanich <sivanich@sgi.com>
Cc: Joerg Roedel <joro@8bytes.org>
Cc: Yijing Wang <wangyijing@huawei.com>
Cc: Grant Likely <grant.likely@linaro.org>
Link: http://lkml.kernel.org/r/1428978610-28986-5-git-send-email-jiang.liu@linux.intel.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Now there is no user of x86_io_apic_ops.print_entries anymore, so remove
it.
Signed-off-by: Jiang Liu <jiang.liu@linux.intel.com>
Tested-by: Joerg Roedel <jroedel@suse.de>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: David Cohen <david.a.cohen@linux.intel.com>
Cc: Sander Eikelenboom <linux@eikelenboom.it>
Cc: David Vrabel <david.vrabel@citrix.com>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: iommu@lists.linux-foundation.org
Cc: Bjorn Helgaas <bhelgaas@google.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Rafael J. Wysocki <rjw@rjwysocki.net>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: Yinghai Lu <yinghai@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dimitri Sivanich <sivanich@sgi.com>
Cc: Joerg Roedel <joro@8bytes.org>
Cc: Yijing Wang <wangyijing@huawei.com>
Cc: Grant Likely <grant.likely@linaro.org>
Link: http://lkml.kernel.org/r/1428978610-28986-4-git-send-email-jiang.liu@linux.intel.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Now nobody makes use of struct mp_pin_info, so remove it.
Signed-off-by: Jiang Liu <jiang.liu@linux.intel.com>
Tested-by: Joerg Roedel <jroedel@suse.de>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: David Cohen <david.a.cohen@linux.intel.com>
Cc: Sander Eikelenboom <linux@eikelenboom.it>
Cc: David Vrabel <david.vrabel@citrix.com>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Joerg Roedel <joro@8bytes.org>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Bjorn Helgaas <bhelgaas@google.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Rafael J. Wysocki <rjw@rjwysocki.net>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: Yinghai Lu <yinghai@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dimitri Sivanich <sivanich@sgi.com>
Cc: Grant Likely <grant.likely@linaro.org>
Link: http://lkml.kernel.org/r/1428978610-28986-3-git-send-email-jiang.liu@linux.intel.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Now we have converted to hierarchical irqdomain, so remove unused old
IOAPIC interfaces and code.
Signed-off-by: Jiang Liu <jiang.liu@linux.intel.com>
Tested-by: Joerg Roedel <jroedel@suse.de>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: David Cohen <david.a.cohen@linux.intel.com>
Cc: Sander Eikelenboom <linux@eikelenboom.it>
Cc: David Vrabel <david.vrabel@citrix.com>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Joerg Roedel <joro@8bytes.org>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Bjorn Helgaas <bhelgaas@google.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Rafael J. Wysocki <rjw@rjwysocki.net>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: Yinghai Lu <yinghai@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dimitri Sivanich <sivanich@sgi.com>
Cc: Grant Likely <grant.likely@linaro.org>
Link: http://lkml.kernel.org/r/1428978610-28986-2-git-send-email-jiang.liu@linux.intel.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Convert IOAPIC driver to support and use hierarchical irqdomain
interfaces. It's a little big, but would break bisecting if we split
it into multiple patches.
Fold in a patch from Andy Shevchenko <andriy.shevchenko@linux.intel.com>
to make it bisectable.
http://lkml.org/lkml/2014/12/10/622
Signed-off-by: Jiang Liu <jiang.liu@linux.intel.com>
Tested-by: Joerg Roedel <jroedel@suse.de>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: Sander Eikelenboom <linux@eikelenboom.it>
Cc: David Vrabel <david.vrabel@citrix.com>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Joerg Roedel <joro@8bytes.org>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Cc: sfi-devel@simplefirmware.org
Cc: Bjorn Helgaas <bhelgaas@google.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Rafael J. Wysocki <rjw@rjwysocki.net>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: Yinghai Lu <yinghai@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dimitri Sivanich <sivanich@sgi.com>
Cc: Len Brown <len.brown@intel.com>
Cc: Pavel Machek <pavel@ucw.cz>
Cc: Grant Likely <grant.likely@linaro.org>
Cc: Rob Herring <robh@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: David Cohen <david.a.cohen@linux.intel.com>
Link: http://lkml.kernel.org/r/1428905519-23704-38-git-send-email-jiang.liu@linux.intel.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Introduce several helper functions, which will be used to enable
hierarchical irqdomain for IOAPIC.
Signed-off-by: Jiang Liu <jiang.liu@linux.intel.com>
Tested-by: Joerg Roedel <jroedel@suse.de>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: David Cohen <david.a.cohen@linux.intel.com>
Cc: Sander Eikelenboom <linux@eikelenboom.it>
Cc: David Vrabel <david.vrabel@citrix.com>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Joerg Roedel <joro@8bytes.org>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Bjorn Helgaas <bhelgaas@google.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Rafael J. Wysocki <rjw@rjwysocki.net>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: Yinghai Lu <yinghai@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dimitri Sivanich <sivanich@sgi.com>
Cc: Grant Likely <grant.likely@linaro.org>
Link: http://lkml.kernel.org/r/1428905519-23704-37-git-send-email-jiang.liu@linux.intel.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Simplify the way to print IOAPIC entry content, so we can remove
native_io_apic_print_entries(), intel_ir_io_apic_print_entries()
and x86_io_apic_ops.print_entries() later.
Folded a patch from Thomas to fix errors in printed pin attributes,
http://www.spinics.net/lists/linux-tip-commits/msg26108.html
Signed-off-by: Jiang Liu <jiang.liu@linux.intel.com>
Tested-by: Joerg Roedel <jroedel@suse.de>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: David Cohen <david.a.cohen@linux.intel.com>
Cc: Sander Eikelenboom <linux@eikelenboom.it>
Cc: David Vrabel <david.vrabel@citrix.com>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Joerg Roedel <joro@8bytes.org>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Bjorn Helgaas <bhelgaas@google.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Rafael J. Wysocki <rjw@rjwysocki.net>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: Yinghai Lu <yinghai@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dimitri Sivanich <sivanich@sgi.com>
Cc: Grant Likely <grant.likely@linaro.org>
Link: http://lkml.kernel.org/r/1428905519-23704-36-git-send-email-jiang.liu@linux.intel.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
To support legacy ISA IRQs, we need to preallocate irq_cfg structures
for legacy ISA IRQs. Refine the way to allocate irq_cfg for legacy ISA
IRQs, so it's more friendly for the hierarchical irqdomain
implementation.
Signed-off-by: Jiang Liu <jiang.liu@linux.intel.com>
Tested-by: Joerg Roedel <jroedel@suse.de>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: David Cohen <david.a.cohen@linux.intel.com>
Cc: Sander Eikelenboom <linux@eikelenboom.it>
Cc: David Vrabel <david.vrabel@citrix.com>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Joerg Roedel <joro@8bytes.org>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Bjorn Helgaas <bhelgaas@google.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Rafael J. Wysocki <rjw@rjwysocki.net>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: Yinghai Lu <yinghai@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dimitri Sivanich <sivanich@sgi.com>
Cc: Grant Likely <grant.likely@linaro.org>
Link: http://lkml.kernel.org/r/1428905519-23704-35-git-send-email-jiang.liu@linux.intel.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Implement required callbacks to prepare for enabling hierarchical
irqdomains on IOAPICs. After the conversion we can remove quite some
code from the old implementation.
Signed-off-by: Jiang Liu <jiang.liu@linux.intel.com>
Tested-by: Joerg Roedel <jroedel@suse.de>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: David Cohen <david.a.cohen@linux.intel.com>
Cc: Sander Eikelenboom <linux@eikelenboom.it>
Cc: David Vrabel <david.vrabel@citrix.com>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Joerg Roedel <joro@8bytes.org>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Bjorn Helgaas <bhelgaas@google.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Rafael J. Wysocki <rjw@rjwysocki.net>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: Yinghai Lu <yinghai@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dimitri Sivanich <sivanich@sgi.com>
Cc: Jan Beulich <JBeulich@suse.com>
Cc: Grant Likely <grant.likely@linaro.org>
Link: http://lkml.kernel.org/r/1428905519-23704-34-git-send-email-jiang.liu@linux.intel.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Introduce helper functions to manipulate struct irq_alloc_info for
IOAPIC. Also add an extra parameter to IOAPIC interfaces to prepare
for hierarchical irqdomain. Function mp_set_gsi_attr() will be removed
once we have switched to hierarchical irqdomains.
Signed-off-by: Jiang Liu <jiang.liu@linux.intel.com>
Tested-by: Joerg Roedel <jroedel@suse.de>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: Sander Eikelenboom <linux@eikelenboom.it>
Cc: David Vrabel <david.vrabel@citrix.com>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Joerg Roedel <joro@8bytes.org>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Bjorn Helgaas <bhelgaas@google.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Rafael J. Wysocki <rjw@rjwysocki.net>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: Yinghai Lu <yinghai@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dimitri Sivanich <sivanich@sgi.com>
Cc: Len Brown <len.brown@intel.com>
Cc: Pavel Machek <pavel@ucw.cz>
Cc: Jan Beulich <JBeulich@suse.com>
Cc: Grant Likely <grant.likely@linaro.org>
Cc: David Cohen <david.a.cohen@linux.intel.com>
Link: http://lkml.kernel.org/r/1428905519-23704-33-git-send-email-jiang.liu@linux.intel.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Now there's no user of pre_init_apic_IRQ0(), so remove it.
Signed-off-by: Jiang Liu <jiang.liu@linux.intel.com>
Tested-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: David Cohen <david.a.cohen@linux.intel.com>
Cc: Sander Eikelenboom <linux@eikelenboom.it>
Cc: David Vrabel <david.vrabel@citrix.com>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Joerg Roedel <joro@8bytes.org>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Bjorn Helgaas <bhelgaas@google.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Rafael J. Wysocki <rjw@rjwysocki.net>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: Yinghai Lu <yinghai@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dimitri Sivanich <sivanich@sgi.com>
Cc: Jan Beulich <JBeulich@suse.com>
Cc: Grant Likely <grant.likely@linaro.org>
Link: http://lkml.kernel.org/r/1428905519-23704-32-git-send-email-jiang.liu@linux.intel.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
MID has no PIC, but depending on the platform it requires the
abt_timer, which is connected to irq0. The timer is set up at
late_time_init().
But, looking at the MID code it seems, that there is no reason to do
so. The only code which might need the timer working is the TSC
calibration code, but thats a non issue on MID as that is using its
own empty calibration function. And check_timer() is not invoked
either because MID has no PIC and therefor no legacy irqs.
So if you look at intel_mid_time_init() then you'll see that in the
ARAT case the timer setup is skipped already. So until the point where
x86_init.timers.setup_percpu_clockev() is called for the boot cpu
nothing really needs a timer on MID.
According to the MID code the apbt horror is only used for moorestown.
Medfield and later use the local apic timer without the apbt nonsense.
The best thing we can do is to drop moorestown support and get rid of
that apbt nonsense alltogether.
I don't think anyone deeply cares about it not being supported from
3.18 on. The number of devices which sport a moorestown should be
pretty limited and the only relevant use case of those is to act as a
pocket heater with short battery life time. Its pretty pointless to
update kernels on pocket heaters except for bragging reasons.
If someone at Intel really thinks that we need to keep moorestown
alive for other than documentary and sentimental reasons, then we can
move the apbt setup to x86_init.timers.setup_percpu_clockev(). At that
point the IOAPIC is setup already, so it should just work.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Signed-off-by: Jiang Liu <jiang.liu@linux.intel.com>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: David Cohen <david.a.cohen@linux.intel.com>
Cc: Sander Eikelenboom <linux@eikelenboom.it>
Cc: David Vrabel <david.vrabel@citrix.com>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Joerg Roedel <joro@8bytes.org>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: H. Peter Anvin <hpa@linux.intel.com>
Cc: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
Cc: Bjorn Helgaas <bhelgaas@google.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Rafael J. Wysocki <rjw@rjwysocki.net>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: Yinghai Lu <yinghai@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dimitri Sivanich <sivanich@sgi.com>
Cc: Rickard Strandqvist <rickard_strandqvist@spectrumdigital.se>
Link: http://lkml.kernel.org/r/1428905519-23704-30-git-send-email-jiang.liu@linux.intel.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Use common MSI interfaces instead of private implementations of the
same functionality to simplify DMAR/HPET driver implementation.
Signed-off-by: Jiang Liu <jiang.liu@linux.intel.com>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: David Cohen <david.a.cohen@linux.intel.com>
Cc: Sander Eikelenboom <linux@eikelenboom.it>
Cc: David Vrabel <david.vrabel@citrix.com>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Joerg Roedel <joro@8bytes.org>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Bjorn Helgaas <bhelgaas@google.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Rafael J. Wysocki <rjw@rjwysocki.net>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: Yinghai Lu <yinghai@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dimitri Sivanich <sivanich@sgi.com>
Link: http://lkml.kernel.org/r/1428905519-23704-28-git-send-email-jiang.liu@linux.intel.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Implement irq_chip.irq_write_msi_msg for MSI/DMAR/HPET irq_chips, they
will be used to replace duplicated code.
Signed-off-by: Jiang Liu <jiang.liu@linux.intel.com>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: David Cohen <david.a.cohen@linux.intel.com>
Cc: Sander Eikelenboom <linux@eikelenboom.it>
Cc: David Vrabel <david.vrabel@citrix.com>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Joerg Roedel <joro@8bytes.org>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Bjorn Helgaas <bhelgaas@google.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Rafael J. Wysocki <rjw@rjwysocki.net>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: Yinghai Lu <yinghai@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dimitri Sivanich <sivanich@sgi.com>
Link: http://lkml.kernel.org/r/1428905519-23704-27-git-send-email-jiang.liu@linux.intel.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Function irq_chip_compose_msi_msg() can achieve the same goal as
msi_update_msg(), so remove msi_update_msg().
Signed-off-by: Jiang Liu <jiang.liu@linux.intel.com>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: David Cohen <david.a.cohen@linux.intel.com>
Cc: Sander Eikelenboom <linux@eikelenboom.it>
Cc: David Vrabel <david.vrabel@citrix.com>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Joerg Roedel <joro@8bytes.org>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Bjorn Helgaas <bhelgaas@google.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Rafael J. Wysocki <rjw@rjwysocki.net>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: Yinghai Lu <yinghai@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dimitri Sivanich <sivanich@sgi.com>
Link: http://lkml.kernel.org/r/1428905519-23704-26-git-send-email-jiang.liu@linux.intel.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Simplify the way to deal with remapped MSI interrupts, so we can remove
irq_chip.irq_print_chip later. We simply change the name when the
setup detects that the parent domain is remapping.
Signed-off-by: Jiang Liu <jiang.liu@linux.intel.com>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: David Cohen <david.a.cohen@linux.intel.com>
Cc: Sander Eikelenboom <linux@eikelenboom.it>
Cc: David Vrabel <david.vrabel@citrix.com>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Joerg Roedel <joro@8bytes.org>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Bjorn Helgaas <bhelgaas@google.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Rafael J. Wysocki <rjw@rjwysocki.net>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: Yinghai Lu <yinghai@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dimitri Sivanich <sivanich@sgi.com>
Link: http://lkml.kernel.org/r/1428905519-23704-25-git-send-email-jiang.liu@linux.intel.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Some irq_chip names use underscore, others use hyphen. So normalize them
to use hyphen as separator.
Signed-off-by: Jiang Liu <jiang.liu@linux.intel.com>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: David Cohen <david.a.cohen@linux.intel.com>
Cc: Sander Eikelenboom <linux@eikelenboom.it>
Cc: David Vrabel <david.vrabel@citrix.com>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Joerg Roedel <joro@8bytes.org>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Bjorn Helgaas <bhelgaas@google.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Rafael J. Wysocki <rjw@rjwysocki.net>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: Yinghai Lu <yinghai@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dimitri Sivanich <sivanich@sgi.com>
Link: http://lkml.kernel.org/r/1428905519-23704-24-git-send-email-jiang.liu@linux.intel.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
We have slightly changed the architecture interfaces to support htirq
PCI driver. It's safe because currently Hypertransport interrupt is
only enabled on x86 platforms.
Signed-off-by: Jiang Liu <jiang.liu@linux.intel.com>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: David Cohen <david.a.cohen@linux.intel.com>
Cc: Sander Eikelenboom <linux@eikelenboom.it>
Cc: David Vrabel <david.vrabel@citrix.com>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Joerg Roedel <joro@8bytes.org>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Bjorn Helgaas <bhelgaas@google.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Rafael J. Wysocki <rjw@rjwysocki.net>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: Yinghai Lu <yinghai@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dimitri Sivanich <sivanich@sgi.com>
Link: http://lkml.kernel.org/r/1428905519-23704-22-git-send-email-jiang.liu@linux.intel.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Enhance DMAR code to support hierarchical irqdomain, it helps to make
the architecture more clear.
Signed-off-by: Jiang Liu <jiang.liu@linux.intel.com>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: David Cohen <david.a.cohen@linux.intel.com>
Cc: Sander Eikelenboom <linux@eikelenboom.it>
Cc: David Vrabel <david.vrabel@citrix.com>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Joerg Roedel <joro@8bytes.org>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Bjorn Helgaas <bhelgaas@google.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Rafael J. Wysocki <rjw@rjwysocki.net>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: Yinghai Lu <yinghai@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dimitri Sivanich <sivanich@sgi.com>
Link: http://lkml.kernel.org/r/1428905519-23704-21-git-send-email-jiang.liu@linux.intel.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Refine the interfaces to create IRQ for DMAR unit. It's a preparation
for converting DMAR IRQ to hierarchical irqdomain on x86.
It also moves dmar_alloc_hwirq()/dmar_free_hwirq() from irq_remapping.h
to dmar.h. They are not irq_remapping specific.
Signed-off-by: Jiang Liu <jiang.liu@linux.intel.com>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: David Cohen <david.a.cohen@linux.intel.com>
Cc: Sander Eikelenboom <linux@eikelenboom.it>
Cc: David Vrabel <david.vrabel@citrix.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: iommu@lists.linux-foundation.org
Cc: Vinod Koul <vinod.koul@intel.com>
Cc: Bjorn Helgaas <bhelgaas@google.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Rafael J. Wysocki <rjw@rjwysocki.net>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: Yinghai Lu <yinghai@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dimitri Sivanich <sivanich@sgi.com>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: Joerg Roedel <joro@8bytes.org>
Link: http://lkml.kernel.org/r/1428905519-23704-20-git-send-email-jiang.liu@linux.intel.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Now MSI interrupt has been converted to new hierarchical irqdomain
interfaces, so remove legacy MSI related code and interfaces.
Signed-off-by: Jiang Liu <jiang.liu@linux.intel.com>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: David Cohen <david.a.cohen@linux.intel.com>
Cc: Sander Eikelenboom <linux@eikelenboom.it>
Cc: David Vrabel <david.vrabel@citrix.com>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Joerg Roedel <joro@8bytes.org>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Bjorn Helgaas <bhelgaas@google.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Rafael J. Wysocki <rjw@rjwysocki.net>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: Yinghai Lu <yinghai@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dimitri Sivanich <sivanich@sgi.com>
Cc: Yijing Wang <wangyijing@huawei.com>
Link: http://lkml.kernel.org/r/1428905519-23704-19-git-send-email-jiang.liu@linux.intel.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Now MSI interrupt has been converted to new hierarchical irqdomain
interfaces, so remove legacy MSI related code and interfaces.
Signed-off-by: Jiang Liu <jiang.liu@linux.intel.com>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: David Cohen <david.a.cohen@linux.intel.com>
Cc: Sander Eikelenboom <linux@eikelenboom.it>
Cc: David Vrabel <david.vrabel@citrix.com>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: iommu@lists.linux-foundation.org
Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Cc: Joerg Roedel <jroedel@suse.de>
Cc: Bjorn Helgaas <bhelgaas@google.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Rafael J. Wysocki <rjw@rjwysocki.net>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: Yinghai Lu <yinghai@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dimitri Sivanich <sivanich@sgi.com>
Cc: Joerg Roedel <joro@8bytes.org>
Cc: Yijing Wang <wangyijing@huawei.com>
Link: http://lkml.kernel.org/r/1428905519-23704-18-git-send-email-jiang.liu@linux.intel.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
DMAR interrupt won't be remapped by interrupt remapping hardware,
so directly call native_compose_msi_msg() for DMAR IRQ to compose MSI
message data. This will help to simplify MSI code later.
Signed-off-by: Jiang Liu <jiang.liu@linux.intel.com>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: David Cohen <david.a.cohen@linux.intel.com>
Cc: Sander Eikelenboom <linux@eikelenboom.it>
Cc: David Vrabel <david.vrabel@citrix.com>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Joerg Roedel <joro@8bytes.org>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Bjorn Helgaas <bhelgaas@google.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Rafael J. Wysocki <rjw@rjwysocki.net>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: Yinghai Lu <yinghai@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dimitri Sivanich <sivanich@sgi.com>
Link: http://lkml.kernel.org/r/1428905519-23704-15-git-send-email-jiang.liu@linux.intel.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Enhance MSI code to support hierarchical irqdomains, it helps to make
the architecture more clear.
Signed-off-by: Jiang Liu <jiang.liu@linux.intel.com>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: David Cohen <david.a.cohen@linux.intel.com>
Cc: Sander Eikelenboom <linux@eikelenboom.it>
Cc: David Vrabel <david.vrabel@citrix.com>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: iommu@lists.linux-foundation.org
Cc: Joerg Roedel <jroedel@suse.de>
Cc: Bjorn Helgaas <bhelgaas@google.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Rafael J. Wysocki <rjw@rjwysocki.net>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: Yinghai Lu <yinghai@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dimitri Sivanich <sivanich@sgi.com>
Cc: Joerg Roedel <joro@8bytes.org>
Link: http://lkml.kernel.org/r/1428905519-23704-14-git-send-email-jiang.liu@linux.intel.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Use new irqdomain interfaces to allocate/free IRQ for DMAR and interrupt
remapping, so we can remove GENERIC_IRQ_LEGACY_ALLOC_HWIRQ later.
The private definitions of irq_alloc_hwirqs()/irq_free_hwirqs() are a
temporary solution, they will be removed once we have converted the
interrupt remapping driver to use irqdomain framework.
Signed-off-by: Jiang Liu <jiang.liu@linux.intel.com>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: David Cohen <david.a.cohen@linux.intel.com>
Cc: Sander Eikelenboom <linux@eikelenboom.it>
Cc: David Vrabel <david.vrabel@citrix.com>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: iommu@lists.linux-foundation.org
Cc: Joerg Roedel <jroedel@suse.de>
Cc: Bjorn Helgaas <bhelgaas@google.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Rafael J. Wysocki <rjw@rjwysocki.net>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: Yinghai Lu <yinghai@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dimitri Sivanich <sivanich@sgi.com>
Cc: Joerg Roedel <joro@8bytes.org>
Link: http://lkml.kernel.org/r/1428905519-23704-8-git-send-email-jiang.liu@linux.intel.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Use new irqdomain interfaces to allocate/free IRQ for HTIRQ, so we can
remove GENERIC_IRQ_LEGACY_ALLOC_HWIRQ later.
This patch changes the interfaces between arch independent PCI driver
and arch specific code. Currently HT_IRQ is only enabled on x86, so it
does not affect other architectures.
Signed-off-by: Jiang Liu <jiang.liu@linux.intel.com>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: David Cohen <david.a.cohen@linux.intel.com>
Cc: Sander Eikelenboom <linux@eikelenboom.it>
Cc: David Vrabel <david.vrabel@citrix.com>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Joerg Roedel <joro@8bytes.org>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Bjorn Helgaas <bhelgaas@google.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Rafael J. Wysocki <rjw@rjwysocki.net>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: Yinghai Lu <yinghai@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dimitri Sivanich <sivanich@sgi.com>
Link: http://lkml.kernel.org/r/1428905519-23704-7-git-send-email-jiang.liu@linux.intel.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Use new irqdomain interfaces to allocate/free IRQ for PCI MSI, so we
can remove GENERIC_IRQ_LEGACY_ALLOC_HWIRQ later.
Signed-off-by: Jiang Liu <jiang.liu@linux.intel.com>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: David Cohen <david.a.cohen@linux.intel.com>
Cc: Sander Eikelenboom <linux@eikelenboom.it>
Cc: David Vrabel <david.vrabel@citrix.com>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Joerg Roedel <joro@8bytes.org>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Bjorn Helgaas <bhelgaas@google.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Rafael J. Wysocki <rjw@rjwysocki.net>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: Yinghai Lu <yinghai@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dimitri Sivanich <sivanich@sgi.com>
Link: http://lkml.kernel.org/r/1416894816-23245-5-git-send-email-jiang.liu@linux.intel.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Use new irqdomain interfaces to allocate/free IRQ for HPET, so we can
remove GENERIC_IRQ_LEGACY_ALLOC_HWIRQ later.
Signed-off-by: Jiang Liu <jiang.liu@linux.intel.com>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: David Cohen <david.a.cohen@linux.intel.com>
Cc: Sander Eikelenboom <linux@eikelenboom.it>
Cc: David Vrabel <david.vrabel@citrix.com>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Joerg Roedel <joro@8bytes.org>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
Cc: Bjorn Helgaas <bhelgaas@google.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Rafael J. Wysocki <rjw@rjwysocki.net>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: Yinghai Lu <yinghai@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dimitri Sivanich <sivanich@sgi.com>
Cc: Andy Lutomirski <luto@amacapital.net>
Link: http://lkml.kernel.org/r/1416894816-23245-4-git-send-email-jiang.liu@linux.intel.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Abstract CPU local APIC as an interrupt controller and create an
irqdomain for it to manage CPU interrupt vectors. It's the base to
enable hierarchical irqdomains on x86 systems.
The final irqdomain hierarchy will look like this:
IOAPIC domain ----|
MSI/MSI-x domain ----> [Interrupt Remapping domain] -> CPU vector domain
HPET_IRQ domain ----| ^
|
DMAR domain ----------------------------------------------|
HT_IRQ domain ----------------------------------------------|
Signed-off-by: Jiang Liu <jiang.liu@linux.intel.com>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: David Cohen <david.a.cohen@linux.intel.com>
Cc: Sander Eikelenboom <linux@eikelenboom.it>
Cc: David Vrabel <david.vrabel@citrix.com>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Joerg Roedel <joro@8bytes.org>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Prarit Bhargava <prarit@redhat.com>
Cc: Bjorn Helgaas <bhelgaas@google.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Rafael J. Wysocki <rjw@rjwysocki.net>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: Yinghai Lu <yinghai@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dimitri Sivanich <sivanich@sgi.com>
Cc: Grant Likely <grant.likely@linaro.org>
Link: http://lkml.kernel.org/r/1428905519-23704-3-git-send-email-jiang.liu@linux.intel.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cache destination CPU APIC ID into struct irq_cfg when assigning vector
for interrupt. Upper layer just needs to read the cached APIC ID instead
of calling apic->cpu_mask_to_apicid_and(), it helps to hide APIC driver
details from IOAPIC/HPET/MSI drivers..
Signed-off-by: Jiang Liu <jiang.liu@linux.intel.com>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: David Cohen <david.a.cohen@linux.intel.com>
Cc: Sander Eikelenboom <linux@eikelenboom.it>
Cc: David Vrabel <david.vrabel@citrix.com>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Joerg Roedel <joro@8bytes.org>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Bjorn Helgaas <bhelgaas@google.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Rafael J. Wysocki <rjw@rjwysocki.net>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: Yinghai Lu <yinghai@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dimitri Sivanich <sivanich@sgi.com>
Link: http://lkml.kernel.org/r/1428905519-23704-2-git-send-email-jiang.liu@linux.intel.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
hrtimer_start() does not longer defer already expired timers to the
softirq. Get rid of the __hrtimer_start_range_ns() invocation.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Cc: Preeti U Murthy <preeti@linux.vnet.ibm.com>
Cc: Viresh Kumar <viresh.kumar@linaro.org>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: x86@kernel.org
Link: http://lkml.kernel.org/r/20150414203502.360555157@linutronix.de
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
hrtimer_start() does not longer defer already expired timers to the
softirq. Get rid of the __hrtimer_start_range_ns() invocation.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Cc: Preeti U Murthy <preeti@linux.vnet.ibm.com>
Cc: Viresh Kumar <viresh.kumar@linaro.org>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: x86@kernel.org
Link: http://lkml.kernel.org/r/20150414203502.260487331@linutronix.de
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
This change makes the check exact (no more false positives
on "negative" addresses).
Andy explains:
"Canonical addresses either start with 17 zeros or 17 ones.
In the old code, we checked that the top (64-47) = 17 bits were all
zero. We did this by shifting right by 47 bits and making sure that
nothing was left.
In the new code, we're shifting left by (64 - 48) = 16 bits and then
signed shifting right by the same amount, this propagating the 17th
highest bit to all positions to its left. If we get the same value we
started with, then we're good to go."
While it isn't really important to be fully correct here -
almost all addresses we'll ever see will be userspace ones,
but OTOH it looks to be cheap enough: the new code uses
two more ALU ops but preserves %rcx, allowing to not
reload it from pt_regs->cx again.
On disassembly level, the changes are:
cmp %rcx,0x80(%rsp) -> mov 0x80(%rsp),%r11; cmp %rcx,%r11
shr $0x2f,%rcx -> shl $0x10,%rcx; sar $0x10,%rcx; cmp %rcx,%r11
mov 0x58(%rsp),%rcx -> (eliminated)
Signed-off-by: Denys Vlasenko <dvlasenk@redhat.com>
Acked-by: Andy Lutomirski <luto@kernel.org>
Cc: Alexei Starovoitov <ast@plumgrid.com>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Will Drewry <wad@chromium.org>
Link: http://lkml.kernel.org/r/1429633649-20169-1-git-send-email-dvlasenk@redhat.com
[ Changelog massage. ]
Signed-off-by: Ingo Molnar <mingo@kernel.org>
This keeps all the related PCI IDs together in the driver where
they are used.
Signed-off-by: Sonny Rao <sonnyrao@chromium.org>
Acked-by: Bjorn Helgaas <bhelgaas@google.com>
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Stephane Eranian <eranian@google.com>
Link: http://lkml.kernel.org/r/1429644791-25724-1-git-send-email-sonnyrao@chromium.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
This uncore is the same as the Haswell desktop part but uses a
different PCI ID.
Signed-off-by: Sonny Rao <sonnyrao@chromium.org>
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Cc: Bjorn Helgaas <bhelgaas@google.com>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Stephane Eranian <eranian@google.com>
Link: http://lkml.kernel.org/r/1429569247-16697-1-git-send-email-sonnyrao@chromium.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
The core_pmu does not define cpu_* callbacks, which handles
allocation of 'struct cpu_hw_events::shared_regs' data,
initialization of debug store and PMU_FL_EXCL_CNTRS counters.
While this probably won't happen on bare metal, virtual CPU can
define x86_pmu.extra_regs together with PMU version 1 and thus
be using core_pmu -> using shared_regs data without it being
allocated. That could could leave to following panic:
BUG: unable to handle kernel NULL pointer dereference at (null)
IP: [<ffffffff8152cd4f>] _spin_lock_irqsave+0x1f/0x40
SNIP
[<ffffffff81024bd9>] __intel_shared_reg_get_constraints+0x69/0x1e0
[<ffffffff81024deb>] intel_get_event_constraints+0x9b/0x180
[<ffffffff8101e815>] x86_schedule_events+0x75/0x1d0
[<ffffffff810586dc>] ? check_preempt_curr+0x7c/0x90
[<ffffffff810649fe>] ? try_to_wake_up+0x24e/0x3e0
[<ffffffff81064ba2>] ? default_wake_function+0x12/0x20
[<ffffffff8109eb16>] ? autoremove_wake_function+0x16/0x40
[<ffffffff810577e9>] ? __wake_up_common+0x59/0x90
[<ffffffff811a9517>] ? __d_lookup+0xa7/0x150
[<ffffffff8119db5f>] ? do_lookup+0x9f/0x230
[<ffffffff811a993a>] ? dput+0x9a/0x150
[<ffffffff8119c8f5>] ? path_to_nameidata+0x25/0x60
[<ffffffff8119e90a>] ? __link_path_walk+0x7da/0x1000
[<ffffffff8101d8f9>] ? x86_pmu_add+0xb9/0x170
[<ffffffff8101d7a7>] x86_pmu_commit_txn+0x67/0xc0
[<ffffffff811b07b0>] ? mntput_no_expire+0x30/0x110
[<ffffffff8119c731>] ? path_put+0x31/0x40
[<ffffffff8107c297>] ? current_fs_time+0x27/0x30
[<ffffffff8117d170>] ? mem_cgroup_get_reclaim_stat_from_page+0x20/0x70
[<ffffffff8111b7aa>] group_sched_in+0x13a/0x170
[<ffffffff81014a29>] ? sched_clock+0x9/0x10
[<ffffffff8111bac8>] ctx_sched_in+0x2e8/0x330
[<ffffffff8111bb7b>] perf_event_sched_in+0x6b/0xb0
[<ffffffff8111bc36>] perf_event_context_sched_in+0x76/0xc0
[<ffffffff8111eb3b>] perf_event_comm+0x1bb/0x2e0
[<ffffffff81195ee9>] set_task_comm+0x69/0x80
[<ffffffff81195fe1>] setup_new_exec+0xe1/0x2e0
[<ffffffff811ea68e>] load_elf_binary+0x3ce/0x1ab0
Adding cpu_(prepare|starting|dying) for core_pmu to have
shared_regs data allocated for core_pmu. AFAICS there's no harm
to initialize debug store and PMU_FL_EXCL_CNTRS either for
core_pmu.
Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephane Eranian <eranian@google.com>
Link: http://lkml.kernel.org/r/20150421152623.GC13169@krava.redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
We don't use irq_enable_sysexit on 64-bit kernels any more.
Remove all the paravirt and Xen machinery to support it on
64-bit kernels.
Tested-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Signed-off-by: Andy Lutomirski <luto@kernel.org>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: Denys Vlasenko <vda.linux@googlemail.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/8a03355698fe5b94194e9e7360f19f91c1b2cf1f.1428100853.git.luto@kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Here's the big tty/serial driver update for 4.1-rc1.
It was delayed for a bit due to some questions surrounding some of the
console command line parsing changes that are in here. There's still
one tiny regression for people who were previously putting multiple
console command lines and expecting them all to be ignored for some odd
reason, but Peter is working on fixing that. If not, I'll send a revert
for the offending patch, but I have faith that Peter can address it.
Other than the console work here, there's the usual serial driver
updates and changes, and a buch of 8250 reworks to try to make that
driver easier to maintain over time, and have it support more devices in
the future.
All of these have been in linux-next for a while.
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2
iEYEABECAAYFAlU2IcUACgkQMUfUDdst+ylFqACcC8LPhFEZg9aHn0hNUoqGK3rE
5dUAnR4b8r/NYqjVoE9FJZgZfB/TqVi1
=lyN/
-----END PGP SIGNATURE-----
Merge tag 'tty-4.1-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/tty
Pull tty/serial updates from Greg KH:
"Here's the big tty/serial driver update for 4.1-rc1.
It was delayed for a bit due to some questions surrounding some of the
console command line parsing changes that are in here. There's still
one tiny regression for people who were previously putting multiple
console command lines and expecting them all to be ignored for some
odd reason, but Peter is working on fixing that. If not, I'll send a
revert for the offending patch, but I have faith that Peter can
address it.
Other than the console work here, there's the usual serial driver
updates and changes, and a buch of 8250 reworks to try to make that
driver easier to maintain over time, and have it support more devices
in the future.
All of these have been in linux-next for a while"
* tag 'tty-4.1-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/tty: (119 commits)
n_gsm: Drop unneeded cast on netdev_priv
sc16is7xx: expose RTS inversion in RS-485 mode
serial: 8250_pci: port failed after wakeup from S3
earlycon: 8250: Document kernel command line options
earlycon: 8250: Fix command line regression
earlycon: Fix __earlycon_table stride
tty: clean up the tty time logic a bit
serial: 8250_dw: only get the clock rate in one place
serial: 8250_dw: remove useless ACPI ID check
dmaengine: hsu: move memory allocation to GFP_NOWAIT
dmaengine: hsu: remove redundant pieces of code
serial: 8250_pci: add Intel Tangier support
dmaengine: hsu: add Intel Tangier PCI ID
serial: 8250_pci: replace switch-case by formula for Intel MID
serial: 8250_pci: replace switch-case by formula
tty: cpm_uart: replace CONFIG_8xx by CONFIG_CPM1
serial: jsm: some off by one bugs
serial: xuartps: Fix check in console_setup().
serial: xuartps: Get rid of register access macros.
serial: xuartps: Fix iobase use.
...
functions, prompted by their mis-use in staging.
With these function removed, all cpu functions should only iterate to
nr_cpu_ids, so we finally only allocate that many bits when cpumasks
are allocated offstack.
Thanks,
Rusty.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQIcBAABAgAGBQJVNPMuAAoJENkgDmzRrbjx7ZIP/j65e6xs1jEyXR3WOYSdTU1x
bMo6JcII6O1oEZLgyKXgx9KiBg6uIIDta1NG/H/XIe354dwfHVsHvj5HHHQR5Xof
iRrjLOaHj4XglI3hvsk0eEEl3/OBBLgyo9bUwDvMF1fmr/9tW4caMs3Op6n7Evzm
YIvoAyeJ0A8BfEtOU5lXhcVIGmnHtSw0x6mdGXpXIBmWYQPCtdQP868s4lnl44w0
bSNpAYdzEqg64Ph3SK0prgWPrn5+5EiaAhV7HZzENZ5+o0DAdIXWq/W7uHyCWPKH
536cJDojec+nSUQkPYngngGprxrKO02aBcMw/3JGJ0tdCDj8yw3XAyVAFzz4hmMb
Lkmyv4QHHIILLvJ4ZRH5KHWCjjVBg41LNCs2H3HnoxFACdm0lZYKHsUAh2ucBVtU
Wb/eHmLxOG43UIkpX4yrhy3SfE1ZdnOVzEzOzPXtr51t8ojqk+bLFe/hJ6EkzrQX
X+90qHfBq+PMJlAnc3zdXHjxoJrL6KPWVwVvFrNeibgEKtVvy/BiwZkS6QceC1Ea
TatOYA5r6awFVHHQCooN1DGAxN5Juvu2SmdnTUA9ymsCNDghj1YUoAKRNP81u8Sa
pe3hco/63iCuPna+vlwNDU6SgsaMk9m0p+1n1BiDIfVJIkWYCNeG+u2gQkzbDKlQ
AJuKKQv1QuZiF0ylZ0wq
=VAgA
-----END PGP SIGNATURE-----
Merge tag 'cpumask-next-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rusty/linux
Pull final removal of deprecated cpus_* cpumask functions from Rusty Russell:
"This is the final removal (after several years!) of the obsolete
cpus_* functions, prompted by their mis-use in staging.
With these function removed, all cpu functions should only iterate to
nr_cpu_ids, so we finally only allocate that many bits when cpumasks
are allocated offstack"
* tag 'cpumask-next-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rusty/linux: (25 commits)
cpumask: remove __first_cpu / __next_cpu
cpumask: resurrect CPU_MASK_CPU0
linux/cpumask.h: add typechecking to cpumask_test_cpu
cpumask: only allocate nr_cpumask_bits.
Fix weird uses of num_online_cpus().
cpumask: remove deprecated functions.
mips: fix obsolete cpumask_of_cpu usage.
x86: fix more deprecated cpu function usage.
ia64: remove deprecated cpus_ usage.
powerpc: fix deprecated CPU_MASK_CPU0 usage.
CPU_MASK_ALL/CPU_MASK_NONE: remove from deprecated region.
staging/lustre/o2iblnd: Don't use cpus_weight
staging/lustre/libcfs: replace deprecated cpus_ calls with cpumask_
staging/lustre/ptlrpc: Do not use deprecated cpus_* functions
blackfin: fix up obsolete cpu function usage.
parisc: fix up obsolete cpu function usage.
tile: fix up obsolete cpu function usage.
arm64: fix up obsolete cpu function usage.
mips: fix up obsolete cpu function usage.
x86: fix up obsolete cpu function usage.
...
Pull PMEM driver from Ingo Molnar:
"This is the initial support for the pmem block device driver:
persistent non-volatile memory space mapped into the system's physical
memory space as large physical memory regions.
The driver is based on Intel code, written by Ross Zwisler, with fixes
by Boaz Harrosh, integrated with x86 e820 memory resource management
and tidied up by Christoph Hellwig.
Note that there were two other separate pmem driver submissions to
lkml: but apparently all parties (Ross Zwisler, Boaz Harrosh) are
reasonably happy with this initial version.
This version enables minimal support that enables persistent memory
devices out in the wild to work as block devices, identified through a
magic (non-standard) e820 flag and auto-discovered if
CONFIG_X86_PMEM_LEGACY=y, or added explicitly through manipulating the
memory maps via the "memmap=..." boot option with the new, special '!'
modifier character.
Limitations: this is a regular block device, and since the pmem areas
are not struct page backed, they are invisible to the rest of the
system (other than the block IO device), so direct IO to/from pmem
areas, direct mmap() or XIP is not possible yet. The page cache will
also shadow and double buffer pmem contents, etc.
Initial support is for x86"
* 'x86-pmem-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
drivers/block/pmem: Fix 32-bit build warning in pmem_alloc()
drivers/block/pmem: Add a driver for persistent memory
x86/mm: Add support for the non-standard protected e820 type
Pull x86 fixes from Ingo Molnar:
"This tree includes:
- an FPU related crash fix
- a ptrace fix (with matching testcase in tools/testing/selftests/)
- an x86 Kconfig DMA-config defaults tweak to better avoid
non-working drivers"
* 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
config: Enable NEED_DMA_MAP_STATE by default when SWIOTLB is selected
x86/fpu: Load xsave pointer *after* initialization
x86/ptrace: Fix the TIF_FORCED_TF logic in handle_signal()
x86, selftests: Add single_step_syscall test
Pull perf updates from Ingo Molnar:
"This update has mostly fixes, but also other bits:
- perf tooling fixes
- PMU driver fixes
- Intel Broadwell PMU driver HW-enablement for LBR callstacks
- a late coming 'perf kmem' tool update that enables it to also
analyze page allocation data. Note, this comes with MM tracepoint
changes that we believe to not break anything: because it changes
the formerly opaque 'struct page *' field that uniquely identifies
pages to 'pfn' which identifies pages uniquely too, but isn't as
opaque and can be used for other purposes as well"
* 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
perf/x86/intel/pt: Fix and clean up error handling in pt_event_add()
perf/x86/intel: Add Broadwell support for the LBR callstack
perf/x86/intel/rapl: Fix energy counter measurements but supporing per domain energy units
perf/x86/intel: Fix Core2,Atom,NHM,WSM cycles:pp events
perf/x86: Fix hw_perf_event::flags collision
perf probe: Fix segfault when probe with lazy_line to file
perf probe: Find compilation directory path for lazy matching
perf probe: Set retprobe flag when probe in address-based alternative mode
perf kmem: Analyze page allocator events also
tracing, mm: Record pfn instead of pointer to struct page
Dan Carpenter reported that pt_event_add() has buggy
error handling logic: it returns 0 instead of -EBUSY when
it fails to start a newly added event.
Furthermore, the control flow in this function is messy,
with cleanup labels mixed with direct returns.
Fix the bug and clean up the code by converting it to
a straight fast path for the regular non-failing case,
plus a clear sequence of cascading goto labels to do
all cleanup.
NOTE: I materially changed the existing clean up logic in the
pt_event_start() failure case to use the direct
perf_aux_output_end() path, not pt_event_del(), because
perf_aux_output_end() is enough here.
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Acked-by: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Cc: Julia Lawall <julia.lawall@lip6.fr>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/20150416103830.GB7847@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
So I was playing with gdb today and did this simple thing:
gdb /bin/ls
...
(gdb) run
Box exploded with this splat:
BUG: unable to handle kernel NULL pointer dereference at 00000000000001d0
IP: [<ffffffff8100fe5a>] xstateregs_get+0x7a/0x120
[...]
Call Trace:
ptrace_regset
ptrace_request
? wait_task_inactive
? preempt_count_sub
arch_ptrace
? ptrace_get_task_struct
SyS_ptrace
system_call_fastpath
... because we do cache &target->thread.fpu.state->xsave into the
local variable xsave but that pointer is NULL at that time and
it gets initialized later, in init_fpu(), see:
e7f180dcd8 ("x86/fpu: Change xstateregs_get()/set() to use ->xsave.i387 rather than ->fxsave")
The fix is simple: load xsave *after* init_fpu() has run.
Also do the same in xstateregs_set(), as suggested by Oleg Nesterov.
Signed-off-by: Borislav Petkov <bp@suse.de>
Acked-by: Oleg Nesterov <oleg@redhat.com>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Tavis Ormandy <taviso@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1429209697-5902-1-git-send-email-bp@alien8.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Same as Haswell, Broadwell also support the LBR callstack.
Signed-off-by: Kan Liang <kan.liang@intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Andi Kleen <ak@linux.intel.com>
Link: http://lkml.kernel.org/r/1427962377-40955-1-git-send-email-kan.liang@intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
RAPL energy hardware unit can vary within a single CPU package, e.g.
HSW server DRAM has a fixed energy unit of 15.3 uJ (2^-16) whereas
the unit on other domains can be enumerated from power unit MSR.
There might be other variations in the future, this patch adds
per cpu model quirk to allow special handling of certain cpus.
hw_unit is also removed from per cpu data since it is not per cpu
and the sampling rate for energy counter is typically not high.
Without this patch, DRAM domain on HSW servers will be counted
4x higher than the real energy counter.
Signed-off-by: Jacob Pan <jacob.jun.pan@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Stephane Eranian <eranian@google.com>
Cc: Andi Kleen <andi.kleen@intel.com>
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vince Weaver <vincent.weaver@maine.edu>
Link: http://lkml.kernel.org/r/1427405325-780-1-git-send-email-jacob.jun.pan@linux.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Ingo reported that cycles:pp didn't work for him on some machines.
It turns out that in this commit:
af4bdcf675 perf/x86/intel: Disallow flags for most Core2/Atom/Nehalem/Westmere events
Andi forgot to explicitly allow that event when he
disabled event flags for PEBS on those uarchs.
Reported-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Fixes: af4bdcf675 ("perf/x86/intel: Disallow flags for most Core2/Atom/Nehalem/Westmere events")
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Somehow we ended up with overlapping flags when merging the
RDPMC control flag - this is bad, fix it.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
When the TIF_SINGLESTEP tracee dequeues a signal,
handle_signal() clears TIF_FORCED_TF and X86_EFLAGS_TF but
leaves TIF_SINGLESTEP set.
If the tracer does PTRACE_SINGLESTEP again, enable_single_step()
sets X86_EFLAGS_TF but not TIF_FORCED_TF. This means that the
subsequent PTRACE_CONT doesn't not clear X86_EFLAGS_TF, and the
tracee gets the wrong SIGTRAP.
Test-case (needs -O2 to avoid prologue insns in signal handler):
#include <unistd.h>
#include <stdio.h>
#include <sys/ptrace.h>
#include <sys/wait.h>
#include <sys/user.h>
#include <assert.h>
#include <stddef.h>
void handler(int n)
{
asm("nop");
}
int child(void)
{
assert(ptrace(PTRACE_TRACEME, 0,0,0) == 0);
signal(SIGALRM, handler);
kill(getpid(), SIGALRM);
return 0x23;
}
void *getip(int pid)
{
return (void*)ptrace(PTRACE_PEEKUSER, pid,
offsetof(struct user, regs.rip), 0);
}
int main(void)
{
int pid, status;
pid = fork();
if (!pid)
return child();
assert(wait(&status) == pid);
assert(WIFSTOPPED(status) && WSTOPSIG(status) == SIGALRM);
assert(ptrace(PTRACE_SINGLESTEP, pid, 0, SIGALRM) == 0);
assert(wait(&status) == pid);
assert(WIFSTOPPED(status) && WSTOPSIG(status) == SIGTRAP);
assert((getip(pid) - (void*)handler) == 0);
assert(ptrace(PTRACE_SINGLESTEP, pid, 0, SIGALRM) == 0);
assert(wait(&status) == pid);
assert(WIFSTOPPED(status) && WSTOPSIG(status) == SIGTRAP);
assert((getip(pid) - (void*)handler) == 1);
assert(ptrace(PTRACE_CONT, pid, 0,0) == 0);
assert(wait(&status) == pid);
assert(WIFEXITED(status) && WEXITSTATUS(status) == 0x23);
return 0;
}
The last assert() fails because PTRACE_CONT wrongly triggers
another single-step and X86_EFLAGS_TF can't be cleared by
debugger until the tracee does sys_rt_sigreturn().
Change handle_signal() to do user_disable_single_step() if
stepping, we do not need to preserve TIF_SINGLESTEP because we
are going to do ptrace_notify(), and it is simply wrong to leak
this bit.
While at it, change the comment to explain why we also need to
clear TF unconditionally after setup_rt_frame().
Note: in the longer term we should probably change
setup_sigcontext() to use get_flags() and then just remove this
user_disable_single_step(). And, the state of TIF_FORCED_TF can
be wrong after restore_sigcontext() which can set/clear TF, this
needs another fix.
This fix fixes the 'single_step_syscall_32' testcase in
the x86 testsuite:
Before:
~/linux/tools/testing/selftests/x86> ./single_step_syscall_32
[RUN] Set TF and check nop
[OK] Survived with TF set and 9 traps
[RUN] Set TF and check int80
[OK] Survived with TF set and 9 traps
[RUN] Set TF and check a fast syscall
[WARN] Hit 10000 SIGTRAPs with si_addr 0xf7789cc0, ip 0xf7789cc0
Trace/breakpoint trap (core dumped)
After:
~/linux/linux/tools/testing/selftests/x86> ./single_step_syscall_32
[RUN] Set TF and check nop
[OK] Survived with TF set and 9 traps
[RUN] Set TF and check int80
[OK] Survived with TF set and 9 traps
[RUN] Set TF and check a fast syscall
[OK] Survived with TF set and 39 traps
[RUN] Fast syscall with TF cleared
[OK] Nothing unexpected happened
Reported-by: Evan Teran <eteran@alum.rit.edu>
Reported-by: Pedro Alves <palves@redhat.com>
Tested-by: Andres Freund <andres@anarazel.de>
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
[ Added x86 self-test info. ]
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Merge second patchbomb from Andrew Morton:
- the rest of MM
- various misc bits
- add ability to run /sbin/reboot at reboot time
- printk/vsprintf changes
- fiddle with seq_printf() return value
* akpm: (114 commits)
parisc: remove use of seq_printf return value
lru_cache: remove use of seq_printf return value
tracing: remove use of seq_printf return value
cgroup: remove use of seq_printf return value
proc: remove use of seq_printf return value
s390: remove use of seq_printf return value
cris fasttimer: remove use of seq_printf return value
cris: remove use of seq_printf return value
openrisc: remove use of seq_printf return value
ARM: plat-pxa: remove use of seq_printf return value
nios2: cpuinfo: remove use of seq_printf return value
microblaze: mb: remove use of seq_printf return value
ipc: remove use of seq_printf return value
rtc: remove use of seq_printf return value
power: wakeup: remove use of seq_printf return value
x86: mtrr: if: remove use of seq_printf return value
linux/bitmap.h: improve BITMAP_{LAST,FIRST}_WORD_MASK
MAINTAINERS: CREDITS: remove Stefano Brivio from B43
.mailmap: add Ricardo Ribalda
CREDITS: add Ricardo Ribalda Delgado
...
The seq_printf return value, because it's frequently misused,
will eventually be converted to void.
See: commit 1f33c41c03 ("seq_file: Rename seq_overflow() to
seq_has_overflowed() and make public")
Signed-off-by: Joe Perches <joe@perches.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Pull exec domain removal from Richard Weinberger:
"This series removes execution domain support from Linux.
The idea behind exec domains was to support different ABIs. The
feature was never complete nor stable. Let's rip it out and make the
kernel signal handling code less complicated"
* 'exec_domain_rip_v2' of git://git.kernel.org/pub/scm/linux/kernel/git/rw/misc: (27 commits)
arm64: Removed unused variable
sparc: Fix execution domain removal
Remove rest of exec domains.
arch: Remove exec_domain from remaining archs
arc: Remove signal translation and exec_domain
xtensa: Remove signal translation and exec_domain
xtensa: Autogenerate offsets in struct thread_info
x86: Remove signal translation and exec_domain
unicore32: Remove signal translation and exec_domain
um: Remove signal translation and exec_domain
tile: Remove signal translation and exec_domain
sparc: Remove signal translation and exec_domain
sh: Remove signal translation and exec_domain
s390: Remove signal translation and exec_domain
mn10300: Remove signal translation and exec_domain
microblaze: Remove signal translation and exec_domain
m68k: Remove signal translation and exec_domain
m32r: Remove signal translation and exec_domain
m32r: Autogenerate offsets in struct thread_info
frv: Remove signal translation and exec_domain
...
Merge common values for 32-bit native and compat.
Signed-off-by: Brian Gerst <brgerst@gmail.com>
Acked-by: Andy Lutomirski <luto@kernel.org>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Link: http://lkml.kernel.org/r/1428844486-6638-1-git-send-email-brgerst@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Merge first patchbomb from Andrew Morton:
- arch/sh updates
- ocfs2 updates
- kernel/watchdog feature
- about half of mm/
* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (122 commits)
Documentation: update arch list in the 'memtest' entry
Kconfig: memtest: update number of test patterns up to 17
arm: add support for memtest
arm64: add support for memtest
memtest: use phys_addr_t for physical addresses
mm: move memtest under mm
mm, hugetlb: abort __get_user_pages if current has been oom killed
mm, mempool: do not allow atomic resizing
memcg: print cgroup information when system panics due to panic_on_oom
mm: numa: remove migrate_ratelimited
mm: fold arch_randomize_brk into ARCH_HAS_ELF_RANDOMIZE
mm: split ET_DYN ASLR from mmap ASLR
s390: redefine randomize_et_dyn for ELF_ET_DYN_BASE
mm: expose arch_mmap_rnd when available
s390: standardize mmap_rnd() usage
powerpc: standardize mmap_rnd() usage
mips: extract logic for mmap_rnd()
arm64: standardize mmap_rnd() usage
x86: standardize mmap_rnd() usage
arm: factor out mmap ASLR into mmap_rnd
...
We would want to use number of page table level to define mm_struct.
Let's expose it as CONFIG_PGTABLE_LEVELS.
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Tested-by: Guenter Roeck <linux@roeck-us.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Have kvm_guest_init() use hardlockup_detector_disable() instead of
watchdog_enable_hardlockup_detector(false).
Remove the watchdog_hardlockup_detector_is_enabled() and the
watchdog_enable_hardlockup_detector() function which are no longer needed.
Signed-off-by: Ulrich Obergfell <uobergfe@redhat.com>
Signed-off-by: Don Zickus <dzickus@redhat.com>
Cc: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Pull perf changes from Ingo Molnar:
"Core kernel changes:
- One of the more interesting features in this cycle is the ability
to attach eBPF programs (user-defined, sandboxed bytecode executed
by the kernel) to kprobes.
This allows user-defined instrumentation on a live kernel image
that can never crash, hang or interfere with the kernel negatively.
(Right now it's limited to root-only, but in the future we might
allow unprivileged use as well.)
(Alexei Starovoitov)
- Another non-trivial feature is per event clockid support: this
allows, amongst other things, the selection of different clock
sources for event timestamps traced via perf.
This feature is sought by people who'd like to merge perf generated
events with external events that were measured with different
clocks:
- cluster wide profiling
- for system wide tracing with user-space events,
- JIT profiling events
etc. Matching perf tooling support is added as well, available via
the -k, --clockid <clockid> parameter to perf record et al.
(Peter Zijlstra)
Hardware enablement kernel changes:
- x86 Intel Processor Trace (PT) support: which is a hardware tracer
on steroids, available on Broadwell CPUs.
The hardware trace stream is directly output into the user-space
ring-buffer, using the 'AUX' data format extension that was added
to the perf core to support hardware constraints such as the
necessity to have the tracing buffer physically contiguous.
This patch-set was developed for two years and this is the result.
A simple way to make use of this is to use BTS tracing, the PT
driver emulates BTS output - available via the 'intel_bts' PMU.
More explicit PT specific tooling support is in the works as well -
will probably be ready by 4.2.
(Alexander Shishkin, Peter Zijlstra)
- x86 Intel Cache QoS Monitoring (CQM) support: this is a hardware
feature of Intel Xeon CPUs that allows the measurement and
allocation/partitioning of caches to individual workloads.
These kernel changes expose the measurement side as a new PMU
driver, which exposes various QoS related PMU events. (The
partitioning change is work in progress and is planned to be merged
as a cgroup extension.)
(Matt Fleming, Peter Zijlstra; CPU feature detection by Peter P
Waskiewicz Jr)
- x86 Intel Haswell LBR call stack support: this is a new Haswell
feature that allows the hardware recording of call chains, plus
tooling support. To activate this feature you have to enable it
via the new 'lbr' call-graph recording option:
perf record --call-graph lbr
perf report
or:
perf top --call-graph lbr
This hardware feature is a lot faster than stack walk or dwarf
based unwinding, but has some limitations:
- It reuses the current LBR facility, so LBR call stack and
branch record can not be enabled at the same time.
- It is only available for user-space callchains.
(Yan, Zheng)
- x86 Intel Broadwell CPU support and various event constraints and
event table fixes for earlier models.
(Andi Kleen)
- x86 Intel HT CPUs event scheduling workarounds. This is a complex
CPU bug affecting the SNB,IVB,HSW families that results in counter
value corruption. The mitigation code is automatically enabled and
is transparent.
(Maria Dimakopoulou, Stephane Eranian)
The perf tooling side had a ton of changes in this cycle as well, so
I'm only able to list the user visible changes here, in addition to
the tooling changes outlined above:
User visible changes affecting all tools:
- Improve support of compressed kernel modules (Jiri Olsa)
- Save DSO loading errno to better report errors (Arnaldo Carvalho de Melo)
- Bash completion for subcommands (Yunlong Song)
- Add 'I' event modifier for perf_event_attr.exclude_idle bit (Jiri Olsa)
- Support missing -f to override perf.data file ownership. (Yunlong Song)
- Show the first event with an invalid filter (David Ahern, Arnaldo Carvalho de Melo)
User visible changes in individual tools:
'perf data':
New tool for converting perf.data to other formats, initially
for the CTF (Common Trace Format) from LTTng (Jiri Olsa,
Sebastian Siewior)
'perf diff':
Add --kallsyms option (David Ahern)
'perf list':
Allow listing events with 'tracepoint' prefix (Yunlong Song)
Sort the output of the command (Yunlong Song)
'perf kmem':
Respect -i option (Jiri Olsa)
Print big numbers using thousands' group (Namhyung Kim)
Allow -v option (Namhyung Kim)
Fix alignment of slab result table (Namhyung Kim)
'perf probe':
Support multiple probes on different binaries on the same command line (Masami Hiramatsu)
Support unnamed union/structure members data collection. (Masami Hiramatsu)
Check kprobes blacklist when adding new events. (Masami Hiramatsu)
'perf record':
Teach 'perf record' about perf_event_attr.clockid (Peter Zijlstra)
Support recording running/enabled time (Andi Kleen)
'perf sched':
Improve the performance of 'perf sched replay' on high CPU core count machines (Yunlong Song)
'perf report' and 'perf top':
Allow annotating entries in callchains in the hists browser (Arnaldo Carvalho de Melo)
Indicate which callchain entries are annotated in the
TUI hists browser (Arnaldo Carvalho de Melo)
Add pid/tid filtering to 'report' and 'script' commands (David Ahern)
Consider PERF_RECORD_ events with cpumode == 0 in 'perf top', removing one
cause of long term memory usage buildup, i.e. not processing PERF_RECORD_EXIT
events (Arnaldo Carvalho de Melo)
'perf stat':
Report unsupported events properly (Suzuki K. Poulose)
Output running time and run/enabled ratio in CSV mode (Andi Kleen)
'perf trace':
Handle legacy syscalls tracepoints (David Ahern, Arnaldo Carvalho de Melo)
Only insert blank duration bracket when tracing syscalls (Arnaldo Carvalho de Melo)
Filter out the trace pid when no threads are specified (Arnaldo Carvalho de Melo)
Dump stack on segfaults (Arnaldo Carvalho de Melo)
No need to explicitely enable evsels for workload started from perf, let it
be enabled via perf_event_attr.enable_on_exec, removing some events that take
place in the 'perf trace' before a workload is really started by it.
(Arnaldo Carvalho de Melo)
Allow mixing with tracepoints and suppressing plain syscalls. (Arnaldo Carvalho de Melo)
There's also been a ton of infrastructure work done, such as the
split-out of perf's build system into tools/build/ and other changes -
see the shortlog and changelog for details"
* 'perf-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (358 commits)
perf/x86/intel/pt: Clean up the control flow in pt_pmu_hw_init()
perf evlist: Fix type for references to data_head/tail
perf probe: Check the orphaned -x option
perf probe: Support multiple probes on different binaries
perf buildid-list: Fix segfault when show DSOs with hits
perf tools: Fix cross-endian analysis
perf tools: Fix error path to do closedir() when synthesizing threads
perf tools: Fix synthesizing fork_event.ppid for non-main thread
perf tools: Add 'I' event modifier for exclude_idle bit
perf report: Don't call map__kmap if map is NULL.
perf tests: Fix attr tests
perf probe: Fix ARM 32 building error
perf tools: Merge all perf_event_attr print functions
perf record: Add clockid parameter
perf sched replay: Use replay_repeat to calculate the runavg of cpu usage instead of the default value 10
perf sched replay: Support using -f to override perf.data file ownership
perf sched replay: Fix the EMFILE error caused by the limitation of the maximum open files
perf sched replay: Handle the dead halt of sem_wait when create_tasks() fails for any task
perf sched replay: Fix the segmentation fault problem caused by pr_err in threads
perf sched replay: Realloc the memory of pid_to_task stepwise to adapt to the different pid_max configurations
...
Pull NOHZ changes from Ingo Molnar:
"This tree adds full dynticks support to KVM guests (support the
disabling of the timer tick on the guest). The main missing piece was
the recognition of guest execution as RCU extended quiescent state and
related changes"
* 'timers-nohz-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
kvm,rcu,nohz: use RCU extended quiescent state when running KVM guest
context_tracking: Export context_tracking_user_enter/exit
context_tracking: Run vtime_user_enter/exit only when state == CONTEXT_USER
context_tracking: Add stub context_tracking_is_enabled
context_tracking: Generalize context tracking APIs to support user and guest
context_tracking: Rename context symbols to prepare for transition state
ppc: Remove unused cpp symbols in kvm headers
Pull RCU changes from Ingo Molnar:
"The main changes in this cycle were:
- changes permitting use of call_rcu() and friends very early in
boot, for example, before rcu_init() is invoked.
- add in-kernel API to enable and disable expediting of normal RCU
grace periods.
- improve RCU's handling of (hotplug-) outgoing CPUs.
- NO_HZ_FULL_SYSIDLE fixes.
- tiny-RCU updates to make it more tiny.
- documentation updates.
- miscellaneous fixes"
* 'core-rcu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (58 commits)
cpu: Provide smpboot_thread_init() on !CONFIG_SMP kernels as well
cpu: Defer smpboot kthread unparking until CPU known to scheduler
rcu: Associate quiescent-state reports with grace period
rcu: Yet another fix for preemption and CPU hotplug
rcu: Add diagnostics to grace-period cleanup
rcutorture: Default to grace-period-initialization delays
rcu: Handle outgoing CPUs on exit from idle loop
cpu: Make CPU-offline idle-loop transition point more precise
rcu: Eliminate ->onoff_mutex from rcu_node structure
rcu: Process offlining and onlining only at grace-period start
rcu: Move rcu_report_unblock_qs_rnp() to common code
rcu: Rework preemptible expedited bitmask handling
rcu: Remove event tracing from rcu_cpu_notify(), used by offline CPUs
rcutorture: Enable slow grace-period initializations
rcu: Provide diagnostic option to slow down grace-period initialization
rcu: Detect stalls caused by failure to propagate up rcu_node tree
rcu: Eliminate empty HOTPLUG_CPU ifdef
rcu: Simplify sync_rcu_preempt_exp_init()
rcu: Put all orphan-callback-related code under same comment
rcu: Consolidate offline-CPU callback initialization
...
Pull trivial tree from Jiri Kosina:
"Usual trivial tree updates. Nothing outstanding -- mostly printk()
and comment fixes and unused identifier removals"
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial:
goldfish: goldfish_tty_probe() is not using 'i' any more
powerpc: Fix comment in smu.h
qla2xxx: Fix printks in ql_log message
lib: correct link to the original source for div64_u64
si2168, tda10071, m88ds3103: Fix firmware wording
usb: storage: Fix printk in isd200_log_config()
qla2xxx: Fix printk in qla25xx_setup_mode
init/main: fix reset_device comment
ipwireless: missing assignment
goldfish: remove unreachable line of code
coredump: Fix do_coredump() comment
stacktrace.h: remove duplicate declaration task_struct
smpboot.h: Remove unused function prototype
treewide: Fix typo in printk messages
treewide: Fix typo in printk messages
mod_devicetable: fix comment for match_flags
Pull x86 RAS changes from Ingo Molnar:
"The main changes in this cycle were:
- Simplify the CMCI storm logic on Intel CPUs after yet another
report about a race in the code (Borislav Petkov)
- Enable the MCE threshold irq on AMD CPUs by default (Aravind
Gopalakrishnan)
- Add AMD-specific MCE-severity grading function. Further error
recovery actions will be based on its output (Aravind Gopalakrishnan)
- Documentation updates (Borislav Petkov)
- ... assorted fixes and cleanups"
* 'x86-ras-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/mce/severity: Fix warning about indented braces
x86/mce: Define mce_severity function pointer
x86/mce: Add an AMD severities-grading function
x86/mce: Reindent __mcheck_cpu_apply_quirks() properly
x86/mce: Use safe MSR accesses for AMD quirk
x86/MCE/AMD: Enable thresholding interrupts by default if supported
x86/MCE: Make mce_panic() fatal machine check msg in the same pattern
x86/MCE/intel: Cleanup CMCI storm logic
Documentation/acpi/einj: Correct and streamline text
x86/MCE/AMD: Drop bogus const modifier from AMD's bank4_names()
Pull x86 mm changes from Ingo Molnar:
"The main changes in this cycle were:
- reduce the x86/32 PAE per task PGD allocation overhead from 4K to
0.032k (Fenghua Yu)
- early_ioremap/memunmap() usage cleanups (Juergen Gross)
- gbpages support cleanups (Luis R Rodriguez)
- improve AMD Bulldozer (family 0x15) ASLR I$ aliasing workaround to
increase randomization by 3 bits (per bootup) (Hector
Marco-Gisbert)
- misc fixlets"
* 'x86-mm-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/mm: Improve AMD Bulldozer ASLR workaround
x86/mm/pat: Initialize __cachemode2pte_tbl[] and __pte2cachemode_tbl[] in a bit more readable fashion
init.h: Clean up the __setup()/early_param() macros
x86/mm: Simplify probe_page_size_mask()
x86/mm: Further simplify 1 GB kernel linear mappings handling
x86/mm: Use early_param_on_off() for direct_gbpages
init.h: Add early_param_on_off()
x86/mm: Simplify enabling direct_gbpages
x86/mm: Use IS_ENABLED() for direct_gbpages
x86/mm: Unexport set_memory_ro() and set_memory_rw()
x86/mm, efi: Use early_ioremap() in arch/x86/platform/efi/efi-bgrt.c
x86/mm: Use early_memunmap() instead of early_iounmap()
x86/mm/pat: Ensure different messages in STRICT_DEVMEM and PAT cases
x86/mm: Reduce PAE-mode per task pgd allocation overhead from 4K to 32 bytes
Pull x86 microcode changes from Ingo Molnar:
"Microcode driver updates: mostly cleanups but also some fixes
(Borislav Petkov)"
* 'x86-microcode-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/microcode/amd: Drop the pci_ids.h dependency
x86/microcode/intel: Fix printing of microcode blobs in show_saved_mc()
x86/microcode/intel: Check scan_microcode()'s retval
x86/microcode/intel: Sanitize microcode_pointer()
x86/microcode/intel: Move mc arg last in get_matching_{microcode|sig}
x86/microcode/intel: Simplify generic_load_microcode_early()
x86/microcode: Consolidate family,model, ... code
x86/microcode/intel: Rename update_match_revision()
x86/microcode/intel: Sanitize _save_mc()
x86/microcode/intel: Make _save_mc() return the updated saved count
x86/microcode/intel: Simplify load_ucode_intel_bsp()
x86/microcode/intel: Get rid of last arg to load_ucode_intel_bsp()
x86/microcode/intel: Do the mc_saved_src NULL check first
x86/microcode/intel: Check if microcode was found before applying
x86/microcode/intel: Fix out of bounds memory access to the extended header
Pull x86 fpu changes from Ingo Molnar:
"Various x86 FPU handling cleanups, refactorings and fixes (Borislav
Petkov, Oleg Nesterov, Rik van Riel)"
* 'x86-fpu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (21 commits)
x86/fpu: Kill eager_fpu_init_bp()
x86/fpu: Don't allocate fpu->state for swapper/0
x86/fpu: Rename drop_init_fpu() to fpu_reset_state()
x86/fpu: Fold __drop_fpu() into its sole user
x86/fpu: Don't abuse drop_init_fpu() in flush_thread()
x86/fpu: Use restore_init_xstate() instead of math_state_restore() on kthread exec
x86/fpu: Introduce restore_init_xstate()
x86/fpu: Document user_fpu_begin()
x86/fpu: Factor out memset(xstate, 0) in fpu_finit() paths
x86/fpu: Change xstateregs_get()/set() to use ->xsave.i387 rather than ->fxsave
x86/fpu: Don't abuse FPU in kernel threads if use_eager_fpu()
x86/fpu: Always allow FPU in interrupt if use_eager_fpu()
x86/fpu: __kernel_fpu_begin() should clear fpu_owner_task even if use_eager_fpu()
x86/fpu: Also check fpu_lazy_restore() when use_eager_fpu()
x86/fpu: Use task_disable_lazy_fpu_restore() helper
x86/fpu: Use an explicit if/else in switch_fpu_prepare()
x86/fpu: Introduce task_disable_lazy_fpu_restore() helper
x86/fpu: Move lazy restore functions up a few lines
x86/fpu: Change math_error() to use unlazy_fpu(), kill (now) unused save_init_fpu()
x86/fpu: Don't do __thread_fpu_end() if use_eager_fpu()
...
Pull x86 debug changes from Ingo Molnar:
"Stack printing fixlets"
* 'x86-debug-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/kernel: Use kstack_end() in dumpstack_64.c
x86/kernel: Fix output of show_stack_log_lvl()
Pull x86 cacheinfo sysfs changes from Ingo Molnar:
"This tree converts the x86 cacheinfo sysfs code to use the generic
code in drivers/base/cacheinfo.c.
It's not intended to change the sysfs ABI:
'This patch neither alters any existing sysfs entries nor their
formating, however since the generic cacheinfo has switched to
use the device attributes instead of the traditional raw
kobjects, a directory named 'power' along with its standard
attributes are added similar to any other device'"
* 'x86-cpu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/cpu/cacheinfo: Fix cache_get_priv_group() for Intel processors
x86/cacheinfo: Move cacheinfo sysfs code to generic infrastructure
Pull x86 cleanups from Ingo Molnar:
"Various cleanups"
* 'x86-cleanups-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/iommu: Fix header comments regarding standard and _FINISH macros
x86/earlyprintk: Put CONFIG_PCI-only functions under the #ifdef
x86: Fix up obsolete __cpu_set() function usage
Pull x86 boot changes from Ingo Molnar:
"A number of cleanups"
* 'x86-boot-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/boot: Standardize strcmp()
x86/boot/64: Remove pointless early_printk() message
x86/boot/video: Move the 'video_segment' variable to video.c
Pull x86 asm changes from Ingo Molnar:
"There were lots of changes in this development cycle:
- over 100 separate cleanups, restructuring changes, speedups and
fixes in the x86 system call, irq, trap and other entry code, part
of a heroic effort to deobfuscate a decade old spaghetti asm code
and its C code dependencies (Denys Vlasenko, Andy Lutomirski)
- alternatives code fixes and enhancements (Borislav Petkov)
- simplifications and cleanups to the compat code (Brian Gerst)
- signal handling fixes and new x86 testcases (Andy Lutomirski)
- various other fixes and cleanups
By their nature many of these changes are risky - we tried to test
them well on many different x86 systems (there are no known
regressions), and they are split up finely to help bisection - but
there's still a fair bit of residual risk left so caveat emptor"
* 'x86-asm-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (148 commits)
perf/x86/64: Report regs_user->ax too in get_regs_user()
perf/x86/64: Simplify regs_user->abi setting code in get_regs_user()
perf/x86/64: Do report user_regs->cx while we are in syscall, in get_regs_user()
perf/x86/64: Do not guess user_regs->cs, ss, sp in get_regs_user()
x86/asm/entry/32: Tidy up JNZ instructions after TESTs
x86/asm/entry/64: Reduce padding in execve stubs
x86/asm/entry/64: Remove GET_THREAD_INFO() in ret_from_fork
x86/asm/entry/64: Simplify jumps in ret_from_fork
x86/asm/entry/64: Remove a redundant jump
x86/asm/entry/64: Optimize [v]fork/clone stubs
x86/asm/entry: Zero EXTRA_REGS for stub32_execve() too
x86/asm/entry/64: Move stub_x32_execvecloser() to stub_execveat()
x86/asm/entry/64: Use common code for rt_sigreturn() epilogue
x86/asm/entry/64: Add forgotten CFI annotation
x86/asm/entry/irq: Simplify interrupt dispatch table (IDT) layout
x86/asm/entry/64: Move opportunistic sysret code to syscall code path
x86, selftests: Add sigreturn selftest
x86/alternatives: Guard NOPs optimization
x86/asm/entry: Clear EXTRA_REGS for all executable formats
x86/signal: Remove pax argument from restore_sigcontext
...
Pull timer updates from Ingo Molnar:
"The main changes in this cycle were:
- clockevents state machine cleanups and enhancements (Viresh Kumar)
- clockevents broadcast notifier horror to state machine conversion
and related cleanups (Thomas Gleixner, Rafael J Wysocki)
- clocksource and timekeeping core updates (John Stultz)
- clocksource driver updates and fixes (Ben Dooks, Dmitry Osipenko,
Hans de Goede, Laurent Pinchart, Maxime Ripard, Xunlei Pang)
- y2038 fixes (Xunlei Pang, John Stultz)
- NMI-safe ktime_get_raw_fast() and general refactoring of the clock
code, in preparation to perf's per event clock ID support (Peter
Zijlstra)
- generic sched/clock fixes, optimizations and cleanups (Daniel
Thompson)
- clockevents cpu_down() race fix (Preeti U Murthy)"
* 'timers-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (94 commits)
timers/PM: Drop unnecessary braces from tick_freeze()
timers/PM: Fix up tick_unfreeze()
timekeeping: Get rid of stale comment
clockevents: Cleanup dead cpu explicitely
clockevents: Make tick handover explicit
clockevents: Remove broadcast oneshot control leftovers
sched/idle: Use explicit broadcast oneshot control function
ARM: Tegra: Use explicit broadcast oneshot control function
ARM: OMAP: Use explicit broadcast oneshot control function
intel_idle: Use explicit broadcast oneshot control function
ACPI/idle: Use explicit broadcast control function
ACPI/PAD: Use explicit broadcast oneshot control function
x86/amd/idle, clockevents: Use explicit broadcast oneshot control functions
clockevents: Provide explicit broadcast oneshot control functions
clockevents: Remove the broadcast control leftovers
ARM: OMAP: Use explicit broadcast control function
intel_idle: Use explicit broadcast control function
cpuidle: Use explicit broadcast control function
ACPI/processor: Use explicit broadcast control function
ACPI/PAD: Use explicit broadcast control function
...
Pull scheduler changes from Ingo Molnar:
"Major changes:
- Reworked CPU capacity code, for better SMP load balancing on
systems with assymetric CPUs. (Vincent Guittot, Morten Rasmussen)
- Reworked RT task SMP balancing to be push based instead of pull
based, to reduce latencies on large CPU count systems. (Steven
Rostedt)
- SCHED_DEADLINE support updates and fixes. (Juri Lelli)
- SCHED_DEADLINE task migration support during CPU hotplug. (Wanpeng Li)
- x86 mwait-idle optimizations and fixes. (Mike Galbraith, Len Brown)
- sched/numa improvements. (Rik van Riel)
- various cleanups"
* 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (28 commits)
sched/core: Drop debugging leftover trace_printk call
sched/deadline: Support DL task migration during CPU hotplug
sched/core: Check for available DL bandwidth in cpuset_cpu_inactive()
sched/deadline: Always enqueue on previous rq when dl_task_timer() fires
sched/core: Remove unused argument from init_[rt|dl]_rq()
sched/deadline: Fix rt runtime corruption when dl fails its global constraints
sched/deadline: Avoid a superfluous check
sched: Improve load balancing in the presence of idle CPUs
sched: Optimize freq invariant accounting
sched: Move CFS tasks to CPUs with higher capacity
sched: Add SD_PREFER_SIBLING for SMT level
sched: Remove unused struct sched_group_capacity::capacity_orig
sched: Replace capacity_factor by usage
sched: Calculate CPU's usage statistic and put it into struct sg_lb_stats::group_usage
sched: Add struct rq::cpu_capacity_orig
sched: Make scale_rt invariant with frequency
sched: Make sched entity usage tracking scale-invariant
sched: Remove frequency scaling from cpu_capacity
sched: Track group sched_entity usage contributions
sched: Add sched_avg::utilization_avg_contrib
...
ARM/ARM64: fixes for live migration, irqfd and ioeventfd support (enabling
vhost, too), page aging
s390: interrupt handling rework, allowing to inject all local interrupts
via new ioctl and to get/set the full local irq state for migration
and introspection. New ioctls to access memory by virtual address,
and to get/set the guest storage keys. SIMD support.
MIPS: FPU and MIPS SIMD Architecture (MSA) support. Includes some patches
from Ralf Baechle's MIPS tree.
x86: bugfixes (notably for pvclock, the others are small) and cleanups.
Another small latency improvement for the TSC deadline timer.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.22 (GNU/Linux)
iQEcBAABAgAGBQJVJ9vmAAoJEL/70l94x66DoMEH/R3rh8IMf4jTiWRkcqohOMPX
k1+NaSY/lCKayaSgggJ2hcQenMbQoXEOdslvaA/H0oC+VfJGK+lmU6E63eMyyhjQ
Y+Px6L85NENIzDzaVu/TIWWuhil5PvIRr3VO8cvntExRoCjuekTUmNdOgCvN2ObW
wswN2qRdPIeEj2kkulbnye+9IV4G0Ne9bvsmUdOdfSSdi6ZcV43JcvrpOZT++mKj
RrKB+3gTMZYGJXMMLBwMkdl8mK1ozriD+q0mbomT04LUyGlPwYLl4pVRDBqyksD7
KsSSybaK2E4i5R80WEljgDMkNqrCgNfg6VZe4n9Y+CfAAOToNnkMJaFEi+yuqbs=
=yu2b
-----END PGP SIGNATURE-----
Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm
Pull KVM updates from Paolo Bonzini:
"First batch of KVM changes for 4.1
The most interesting bit here is irqfd/ioeventfd support for ARM and
ARM64.
Summary:
ARM/ARM64:
fixes for live migration, irqfd and ioeventfd support (enabling
vhost, too), page aging
s390:
interrupt handling rework, allowing to inject all local interrupts
via new ioctl and to get/set the full local irq state for migration
and introspection. New ioctls to access memory by virtual address,
and to get/set the guest storage keys. SIMD support.
MIPS:
FPU and MIPS SIMD Architecture (MSA) support. Includes some
patches from Ralf Baechle's MIPS tree.
x86:
bugfixes (notably for pvclock, the others are small) and cleanups.
Another small latency improvement for the TSC deadline timer"
* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (146 commits)
KVM: use slowpath for cross page cached accesses
kvm: mmu: lazy collapse small sptes into large sptes
KVM: x86: Clear CR2 on VCPU reset
KVM: x86: DR0-DR3 are not clear on reset
KVM: x86: BSP in MSR_IA32_APICBASE is writable
KVM: x86: simplify kvm_apic_map
KVM: x86: avoid logical_map when it is invalid
KVM: x86: fix mixed APIC mode broadcast
KVM: x86: use MDA for interrupt matching
kvm/ppc/mpic: drop unused IRQ_testbit
KVM: nVMX: remove unnecessary double caching of MAXPHYADDR
KVM: nVMX: checks for address bits beyond MAXPHYADDR on VM-entry
KVM: x86: cache maxphyaddr CPUID leaf in struct kvm_vcpu
KVM: vmx: pass error code with internal error #2
x86: vdso: fix pvclock races with task migration
KVM: remove kvm_read_hva and kvm_read_hva_atomic
KVM: x86: optimize delivery of TSC deadline timer interrupt
KVM: x86: extract blocking logic from __vcpu_run
kvm: x86: fix x86 eflags fixed bit
KVM: s390: migrate vcpu interrupt state
...
As execution domain support is gone we can remove
signal translation from the signal code and remove
exec_domain from thread_info.
Signed-off-by: Richard Weinberger <richard@nod.at>
Dan Carpenter pointed out that the control flow in pt_pmu_hw_init()
is a bit messy: for example the kfree(de_attrs) is entirely
superfluous.
Another problem is the inconsistent mixing of label based and
direct return error handling.
Add modern, label based error handling instead and clean up the code
a bit as well.
Note that we'll still do a kfree(NULL) in the normal case - this does
not matter as this is an init path and kfree() returns early if it
sees a NULL.
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/20150409090805.GG17605@mwanda
Signed-off-by: Ingo Molnar <mingo@kernel.org>
I don't see why we report e.g. orix_ax, which is not always
meaningful, but don't report ax, which is meaningful.
Signed-off-by: Denys Vlasenko <dvlasenk@redhat.com>
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/1428671219-29341-4-git-send-email-dvlasenk@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
user_64bit_mode(regs) basically checks regs->cs to point to a
64-bit segment. This check used to be unreliable here because
regs->cs was not always correct in syscalls.
Now regs->cs is always correct: in syscalls, in interrupts, in
exceptions. No need to emply heuristics here.
Signed-off-by: Denys Vlasenko <dvlasenk@redhat.com>
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/1428671219-29341-3-git-send-email-dvlasenk@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Yes, it is true that cx contains return address.
It's not clear why we trash it.
Stop doing that.
Signed-off-by: Denys Vlasenko <dvlasenk@redhat.com>
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/1428671219-29341-2-git-send-email-dvlasenk@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
After recent changes to syscall entry points,
user_regs->{cs,ss,sp} are always correct. (They used to be
undefined while in syscalls).
We can report them reliably, without guessing.
Signed-off-by: Denys Vlasenko <dvlasenk@redhat.com>
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/1428671219-29341-1-git-send-email-dvlasenk@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Update the check for UV2000/3000. Note when the HUB is not recognized.
Signed-off-by: Mike Travis <travis@sgi.com>
Acked-by: Hedi Berriche <hedi@sgi.com>
Acked-by: Dimitri Sivanich <sivanich@sgi.com>
Link: http://lkml.kernel.org/r/20150409182629.267239403@asylum.americas.sgi.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Fix a bug in the OEM check function that determines if the
system is a UV system and the BIOS is compatible with the
kernel's UV apic driver. This prevents some possibly obscure
panics and guards the system against being started on SGI
hardware that does not have the required kernel support.
Signed-off-by: Mike Travis <travis@sgi.com>
Acked-by: Hedi Berriche <hedi@sgi.com>
Acked-by: Dimitri Sivanich <sivanich@sgi.com>
Link: http://lkml.kernel.org/r/20150409182629.112998930@asylum.americas.sgi.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Optimize the first "SGI" OEM check to return faster if the
system is not an SGI or UV system.
Signed-off-by: Mike Travis <travis@sgi.com>
Acked-by: Hedi Berriche <hedi@sgi.com>
Acked-by: Dimitri Sivanich <sivanich@sgi.com>
Link: http://lkml.kernel.org/r/20150409182628.952357922@asylum.americas.sgi.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
v2: Rebase on top of the early-quirks rework from Ville.
Signed-off-by: Damien Lespiau <damien.lespiau@intel.com> (v1)
Reviewed-by: Sivakumar Thulasimani <sivakumar.thulasimani@intel.com>
Signed-off-by: Daniel Vetter <daniel.vetter@ffwll.ch>
It used to be used to check for _TIF_IA32, but the check has
been removed.
Remove GET_THREAD_INFO() too.
Run-tested.
Signed-off-by: Denys Vlasenko <dvlasenk@redhat.com>
Cc: Alexei Starovoitov <ast@plumgrid.com>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Will Drewry <wad@chromium.org>
Link: http://lkml.kernel.org/r/1428439424-7258-7-git-send-email-dvlasenk@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Jumping to the very next instruction is not very useful:
jmp label
label:
Removing the jump.
Signed-off-by: Denys Vlasenko <dvlasenk@redhat.com>
Cc: Alexei Starovoitov <ast@plumgrid.com>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Will Drewry <wad@chromium.org>
Link: http://lkml.kernel.org/r/1428439424-7258-5-git-send-email-dvlasenk@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
This is a preparatory patch for moving stub32_execve[at]() to this
file. It makes sense to have all execve stubs in one place, so
that they can reuse code.
Signed-off-by: Denys Vlasenko <dvlasenk@redhat.com>
Cc: Alexei Starovoitov <ast@plumgrid.com>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Will Drewry <wad@chromium.org>
Link: http://lkml.kernel.org/r/1428439424-7258-2-git-send-email-dvlasenk@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Interrupt entry points are handled with the following code,
each 32-byte code block contains seven entry points:
...
[push][jump 22] // 4 bytes
[push][jump 18] // 4 bytes
[push][jump 14] // 4 bytes
[push][jump 10] // 4 bytes
[push][jump 6] // 4 bytes
[push][jump 2] // 4 bytes
[push][jump common_interrupt][padding] // 8 bytes
[push][jump]
[push][jump]
[push][jump]
[push][jump]
[push][jump]
[push][jump]
[push][jump common_interrupt][padding]
[padding_2]
common_interrupt:
And there is a table which holds pointers to every entry point,
IOW: to every push.
In cold cache, two jumps are still costlier than one, even
though we get the benefit of them residing in the same
cacheline.
This change replaces short jumps with near ones to
'common_interrupt', and pads every push+jump pair to 8 bytes. This
way, each interrupt takes only one jump.
This change replaces ".p2align CONFIG_X86_L1_CACHE_SHIFT" before
dispatch table with ".align 8" - we do not need anything
stronger than that.
The table of entry addresses (the interrupt[] array) is no
longer necessary, the address of entries can be easily
calculated as (irq_entries_start + i*8).
text data bss dec hex filename
12546 0 0 12546 3102 entry_64.o.before
11626 0 0 11626 2d6a entry_64.o
The size decrease is because 1656 bytes of .init.rodata are
gone. That's initdata, though. The resident size does go up a
bit.
Run-tested (32 and 64 bits).
Acked-and-Tested-by: Borislav Petkov <bp@suse.de>
Signed-off-by: Denys Vlasenko <dvlasenk@redhat.com>
Cc: Alexei Starovoitov <ast@plumgrid.com>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Will Drewry <wad@chromium.org>
Link: http://lkml.kernel.org/r/1428090553-7283-1-git-send-email-dvlasenk@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
This change does two things:
Copy-pastes "retint_swapgs:" code into syscall handling code,
the copy is under "syscall_return:" label. The code is unchanged
apart from some label renames.
Removes "opportunistic sysret" code from "retint_swapgs:" code
block, since now it won't be reached by syscall return. This in
fact removes most of the code in question.
text data bss dec hex filename
12530 0 0 12530 30f2 entry_64.o.before
12562 0 0 12562 3112 entry_64.o
Run-tested.
Acked-and-Tested-by: Borislav Petkov <bp@suse.de>
Signed-off-by: Denys Vlasenko <dvlasenk@redhat.com>
Cc: Alexei Starovoitov <ast@plumgrid.com>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Will Drewry <wad@chromium.org>
Link: http://lkml.kernel.org/r/1427993219-7291-1-git-send-email-dvlasenk@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Take a look at the first instruction byte before optimizing the NOP -
there might be something else there already, like the ALTERNATIVE_2()
in rdtsc_barrier() which NOPs out on AMD even though we just
patched in an MFENCE.
This happens because the alternatives sees X86_FEATURE_MFENCE_RDTSC,
AMD CPUs set it, we patch in the MFENCE and right afterwards it sees
X86_FEATURE_LFENCE_RDTSC which AMD CPUs don't set and we blindly
optimize the NOP.
Checking whether at least the first byte is 0x90 prevents that.
Signed-off-by: Borislav Petkov <bp@suse.de>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1428181662-18020-1-git-send-email-bp@alien8.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
On failure, sys_execve() does not clobber EXTRA_REGS, so we can
just return to userpsace without saving/restoring them.
On success, ELF_PLAT_INIT() in sys_execve() clears all these
registers.
On other executable formats:
- binfmt_flat.c has similar FLAT_PLAT_INIT, but x86 (and everyone
else except sh) doesn't define it.
- binfmt_elf_fdpic.c has ELF_FDPIC_PLAT_INIT, but x86 (and most
others) doesn't define it.
- There are no such hooks in binfmt_aout.c et al. We inherit
EXTRA_REGS from the prior executable.
This inconsistency was not intended.
This change removes SAVE/RESTORE_EXTRA_REGS in stub_execve,
removes register clearing in ELF_PLAT_INIT(),
and instead simply clears them on success in stub_execve.
Run-tested.
Signed-off-by: Denys Vlasenko <dvlasenk@redhat.com>
Cc: Alexei Starovoitov <ast@plumgrid.com>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Will Drewry <wad@chromium.org>
Link: http://lkml.kernel.org/r/1428173719-7637-1-git-send-email-dvlasenk@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
The 'pax' argument is unnecesary. Instead, store the RAX value
directly in regs.
This pattern goes all the way back to 2.1.106pre1, when restore_sigcontext()
was changed to return an error code instead of EAX directly:
https://git.kernel.org/cgit/linux/kernel/git/history/history.git/diff/arch/i386/kernel/signal.c?id=9a8f8b7ca3f319bd668298d447bdf32730e51174
In 2007 sigaltstack syscall support was added, where the return
value of restore_sigcontext() was changed to carry the memory-copying
failure code.
But instead of putting 'ax' into regs->ax directly, it was carried
in via a pointer and then returned, where the generic syscall return
code copied it to regs->ax.
So there was never any deeper reason for this suboptimal pattern, it
was simply never noticed after being introduced.
Signed-off-by: Brian Gerst <brgerst@gmail.com>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1428152303-17154-1-git-send-email-brgerst@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Quentin caught a corner case with the generation of instruction
padding in the ALTERNATIVE_2 macro: if len(orig_insn) <
len(alt1) < len(alt2), then not enough padding gets added and
that is not good(tm) as we could overwrite the beginning of the
next instruction.
Luckily, at the time of this writing, we don't have
ALTERNATIVE_2() invocations which have that problem and even if
we did, a simple fix would be to prepend the instructions with
enough prefixes so that that corner case doesn't happen.
However, best it would be if we fixed it properly. See below for
a simple, abstracted example of what we're doing.
So what we ended up doing is, we compute the
max(len(alt1), len(alt2)) - len(orig_insn)
and feed that value to the .skip gas directive. The max() cannot
have conditionals due to gas limitations, thus the fancy integer
math.
With this patch, all ALTERNATIVE_2 sites get padded correctly;
generating obscure test cases pass too:
#define alt_max_short(a, b) ((a) ^ (((a) ^ (b)) & -(-((a) < (b)))))
#define gen_skip(orig, alt1, alt2, marker) \
.skip -((alt_max_short(alt1, alt2) - (orig)) > 0) * \
(alt_max_short(alt1, alt2) - (orig)),marker
.pushsection .text, "ax"
.globl main
main:
gen_skip(1, 2, 4, 0x09)
gen_skip(4, 1, 2, 0x10)
...
.popsection
Thanks to Quentin for catching it and double-checking the fix!
Reported-by: Quentin Casasnovas <quentin.casasnovas@oracle.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20150404133443.GE21152@pd.tnic
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Pull x86 fixes from Ingo Molnar:
"Misc fixes: a SYSRET single-stepping fix, a dmi-scan robustization
fix, a reboot quirk and a kgdb fixlet"
* 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
kgdb/x86: Fix reporting of 'si' in kgdb on x86_64
x86/asm/entry/64: Disable opportunistic SYSRET if regs->flags has TF set
x86/reboot: Add ASRock Q1900DC-ITX mainboard reboot quirk
MAINTAINERS: Change the x86 microcode loader maintainer
firmware: dmi_scan: Prevent dmi_num integer overflow
Commit:
e2b32e6785 ("x86, kaslr: randomize module base load address")
made module base address randomization unconditional and didn't regard
disabled KKASLR due to CONFIG_HIBERNATION and command line option
"nokaslr". For more info see (now reverted) commit:
f47233c2d3 ("x86/mm/ASLR: Propagate base load address calculation")
In order to propagate KASLR status to kernel proper, we need a single bit
in boot_params.hdr.loadflags and we've chosen bit 1 thus leaving the
top-down allocated bits for bits supposed to be used by the bootloader.
Originally-From: Jiri Kosina <jkosina@suse.cz>
Suggested-by: H. Peter Anvin <hpa@zytor.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Cc: Kees Cook <keescook@chromium.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Two static functions are only used if CONFIG_PCI is defined, so only
build them if this is the case. Fixes the build warnings:
arch/x86/kernel/early_printk.c:98:13: warning: ‘mem32_serial_out’ defined but not used [-Wunused-function]
static void mem32_serial_out(unsigned long addr, int offset, int value)
^
arch/x86/kernel/early_printk.c:105:21: warning: ‘mem32_serial_in’ defined but not used [-Wunused-function]
static unsigned int mem32_serial_in(unsigned long addr, int offset)
^
Also convert a few related instances of uintXX_t to kernel specific uXX
defines.
Signed-off-by: Mark Einon <mark.einon@gmail.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: stuart.r.anderson@intel.com
Link: http://lkml.kernel.org/r/1427923924-22653-1-git-send-email-mark.einon@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Dan reported compiler warnings about missing curly braces in
mce_severity_amd(). Reindent the catch-all "return MCE_AR_SEVERITY"
correctly to single tab.
While at it, chain ctx == IN_KERNEL check with mcgstatus check to make
it cleaner, as suggested by Boris.
No functional changes are introduced by this patch.
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Aravind Gopalakrishnan <Aravind.Gopalakrishnan@amd.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tony Luck <tony.luck@intel.com>
Cc: linux-edac <linux-edac@vger.kernel.org>
Link: http://lkml.kernel.org/r/1427814281-18192-1-git-send-email-Aravind.Gopalakrishnan@amd.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Replace the clockevents_notify() call with an explicit function call.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/8569669.lgxIty9PKW@vostro.rjw.lan
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Replace the clockevents_notify() call with an explicit function call.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1528188.S1pjqkSL1P@vostro.rjw.lan
Signed-off-by: Ingo Molnar <mingo@kernel.org>
We write a stack pointer to MSR_IA32_SYSENTER_ESP exactly once,
and we unnecessarily cache the value in tss.sp1. We never
read the cached value.
Remove all of the caching. It serves no purpose.
Suggested-by: Denys Vlasenko <dvlasenk@redhat.com>
Signed-off-by: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/05a0163eb33ef5208363f0015496855da7cebadd.1428002830.git.luto@kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
At Denys' request, clean up the comment describing stack padding
in the 32-bit sysenter path.
No code changes.
Signed-off-by: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/41fee7bb8490ae840fe7ef2699f9c2feb932e729.1428002830.git.luto@kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
On a 32-bit build I got:
arch/x86/kernel/cpu/perf_event_intel_pt.c:413:5: warning: cast to pointer from integer of different size [-Wint-to-pointer-cast]
arch/x86/kernel/cpu/perf_event_intel_bts.c:162:24: warning: cast from pointer to integer of different size [-Wpointer-to-int-cast]
Fix it. The code should probably be (re-)tested on 32-bit systems to make
sure all is fine.
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Kaixu Xia <kaixu.xia@linaro.org>
Cc: linux-kernel@vger.kernel.org
Cc: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Robert Richter <rric@kernel.org>
Cc: Stephane Eranian <eranian@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: acme@infradead.org
Cc: adrian.hunter@intel.com
Cc: kan.liang@intel.com
Cc: markus.t.metzger@intel.com
Cc: mathieu.poirier@linaro.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
perf with LBRs on has a tendency to rewrite the DEBUGCTL MSR with
the same value. Add a little optimization to skip the unnecessary
write.
Signed-off-by: Andi Kleen <ak@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: eranian@google.com
Link: http://lkml.kernel.org/r/1426871484-21285-2-git-send-email-andi@firstfloor.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
The perf PMI currently does unnecessary MSR accesses when
LBRs are enabled. We use LBR freezing, or when in callstack
mode force the LBRs to only filter on ring 3.
So there is no need to disable the LBRs explicitely in the
PMI handler.
Also we always unnecessarily rewrite LBR_SELECT in the LBR
handler, even though it can never change.
5) | /* write_msr: MSR_LBR_SELECT(1c8), value 0 */
5) | /* read_msr: MSR_IA32_DEBUGCTLMSR(1d9), value 1801 */
5) | /* write_msr: MSR_IA32_DEBUGCTLMSR(1d9), value 1801 */
5) | /* write_msr: MSR_CORE_PERF_GLOBAL_CTRL(38f), value 70000000f */
5) | /* write_msr: MSR_CORE_PERF_GLOBAL_CTRL(38f), value 0 */
5) | /* write_msr: MSR_LBR_SELECT(1c8), value 0 */
5) | /* read_msr: MSR_IA32_DEBUGCTLMSR(1d9), value 1801 */
5) | /* write_msr: MSR_IA32_DEBUGCTLMSR(1d9), value 1801 */
This patch:
- Avoids disabling already frozen LBRs unnecessarily in the PMI
- Avoids changing LBR_SELECT in the PMI
Signed-off-by: Andi Kleen <ak@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: eranian@google.com
Link: http://lkml.kernel.org/r/1426871484-21285-1-git-send-email-andi@firstfloor.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Technically PEBS_ENABLED is only guaranteed to exist when we
detected PEBS. So add a check for this to the PMU dump function.
I don't think it can happen on a real CPU, but could in a VM.
Signed-off-by: Andi Kleen <ak@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: eranian@google.com
Link: http://lkml.kernel.org/r/1425059312-18217-4-git-send-email-andi@firstfloor.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
LBRs and LBR freezing are controlled through the DEBUGCTL MSR. So
dump the state of DEBUGCTL too when dumping the PMU state.
Signed-off-by: Andi Kleen <ak@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: eranian@google.com
Link: http://lkml.kernel.org/r/1425059312-18217-3-git-send-email-andi@firstfloor.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
The PMU reset code didn't quite keep up with newer PMU features.
Improve it a bit to really reset a modern PMU:
- Clear all overflow status
- Clear LBRs and freezing state
- Disable fixed counters too
Signed-off-by: Andi Kleen <ak@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: eranian@google.com
Link: http://lkml.kernel.org/r/1425059312-18217-2-git-send-email-andi@firstfloor.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
This patch disables the PMU HT bug when Hyperthreading (HT)
is disabled. We cannot do this test immediately when perf_events
is initialized. We need to wait until the topology information
is setup properly. As such, we register a later initcall, check
the topology and potentially disable the workaround. To do this,
we need to ensure there is no user of the PMU. At this point of
the boot, the only user is the NMI watchdog, thus we disable
it during the switch and re-enable it right after.
Having the workaround disabled when it is not needed provides
some benefits by limiting the overhead is time and space.
The workaround still ensures correct scheduling of the corrupting
memory events (0xd0, 0xd1, 0xd2) when HT is off. Those events
can only be measured on counters 0-3. Something else the current
kernel did not handle correctly.
Signed-off-by: Stephane Eranian <eranian@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: bp@alien8.de
Cc: jolsa@redhat.com
Cc: kan.liang@intel.com
Cc: maria.n.dimakopoulou@gmail.com
Link: http://lkml.kernel.org/r/1416251225-17721-13-git-send-email-eranian@google.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
This patch limits the number of counters available to each CPU when
the HT bug workaround is enabled.
This is necessary to avoid situation of counter starvation. Such can
arise from configuration where one HT thread, HT0, is using all 4 counters
with corrupting events which require exclusion the the sibling HT, HT1.
In such case, HT1 would not be able to schedule any event until HT0
is done. To mitigate this problem, this patch artificially limits
the number of counters to 2.
That way, we can gurantee that at least 2 counters are not in exclusive
mode and therefore allow the sibling thread to schedule events of the
same type (system vs. per-thread). The 2 counters are not determined
in advance. We simply set the limit to two events per HT.
This helps mitigate starvation in case of events with specific counter
constraints such a PREC_DIST.
Note that this does not elimintate the starvation is all cases. But
it is better than not having it.
(Solution suggested by Peter Zjilstra.)
Signed-off-by: Stephane Eranian <eranian@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: bp@alien8.de
Cc: jolsa@redhat.com
Cc: kan.liang@intel.com
Cc: maria.n.dimakopoulou@gmail.com
Link: http://lkml.kernel.org/r/1416251225-17721-11-git-send-email-eranian@google.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
This patches activates the HT bug workaround for the
SNB/IVB/HSW processors. This covers non-PEBS mode.
Activation is done thru the constraint tables.
Both client and server processors needs this workaround.
Signed-off-by: Maria Dimakopoulou <maria.n.dimakopoulou@gmail.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Stephane Eranian <eranian@google.com>
Cc: bp@alien8.de
Cc: jolsa@redhat.com
Cc: kan.liang@intel.com
Link: http://lkml.kernel.org/r/1416251225-17721-8-git-send-email-eranian@google.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
This patch implements a software workaround for a HW erratum
on Intel SandyBridge, IvyBridge and Haswell processors
with Hyperthreading enabled. The errata are documented for
each processor in their respective specification update
documents:
- SandyBridge: BJ122
- IvyBridge: BV98
- Haswell: HSD29
The bug causes silent counter corruption across hyperthreads only
when measuring certain memory events (0xd0, 0xd1, 0xd2, 0xd3).
Counters measuring those events may leak counts to the sibling
counter. For instance, counter 0, thread 0 measuring event 0xd0,
may leak to counter 0, thread 1, regardless of the event measured
there. The size of the leak is not predictible. It all depends on
the workload and the state of each sibling hyper-thread. The
corrupting events do undercount as a consequence of the leak. The
leak is compensated automatically only when the sibling counter measures
the exact same corrupting event AND the workload is on the two threads
is the same. Given, there is no way to guarantee this, a work-around
is necessary. Furthermore, there is a serious problem if the leaked count
is added to a low-occurrence event. In that case the corruption on
the low occurrence event can be very large, e.g., orders of magnitude.
There is no HW or FW workaround for this problem.
The bug is very easy to reproduce on a loaded system.
Here is an example on a Haswell client, where CPU0, CPU4
are siblings. We load the CPUs with a simple triad app
streaming large floating-point vector. We use 0x81d0
corrupting event (MEM_UOPS_RETIRED:ALL_LOADS) and
0x20cc (ROB_MISC_EVENTS:LBR_INSERTS). Given we are not
using the LBR, the 0x20cc event should be zero.
$ taskset -c 0 triad &
$ taskset -c 4 triad &
$ perf stat -a -C 0 -e r81d0 sleep 100 &
$ perf stat -a -C 4 -r20cc sleep 10
Performance counter stats for 'system wide':
139 277 291 r20cc
10,000969126 seconds time elapsed
In this example, 0x81d0 and r20cc ar eusing sinling counters
on CPU0 and CPU4. 0x81d0 leaks into 0x20cc and corrupts it
from 0 to 139 millions occurrences.
This patch provides a software workaround to this problem by modifying the
way events are scheduled onto counters by the kernel. The patch forces
cross-thread mutual exclusion between counters in case a corrupting event
is measured by one of the hyper-threads. If thread 0, counter 0 is measuring
event 0xd0, then nothing can be measured on counter 0, thread 1. If no corrupting
event is measured on any hyper-thread, event scheduling proceeds as before.
The same example run with the workaround enabled, yield the correct answer:
$ taskset -c 0 triad &
$ taskset -c 4 triad &
$ perf stat -a -C 0 -e r81d0 sleep 100 &
$ perf stat -a -C 4 -r20cc sleep 10
Performance counter stats for 'system wide':
0 r20cc
10,000969126 seconds time elapsed
The patch does provide correctness for all non-corrupting events. It does not
"repatriate" the leaked counts back to the leaking counter. This is planned
for a second patch series. This patch series makes this repatriation more
easy by guaranteeing the sibling counter is not measuring any useful event.
The patch introduces dynamic constraints for events. That means that events which
did not have constraints, i.e., could be measured on any counters, may now be
constrained to a subset of the counters depending on what is going on the sibling
thread. The algorithm is similar to a cache coherency protocol. We call it XSU
in reference to Exclusive, Shared, Unused, the 3 possible states of a PMU
counter.
As a consequence of the workaround, users may see an increased amount of event
multiplexing, even in situtations where there are fewer events than counters
measured on a CPU.
Patch has been tested on all three impacted processors. Note that when
HT is off, there is no corruption. However, the workaround is still enabled,
yet not costing too much. Adding a dynamic detection of HT on turned out to
be complex are requiring too much to code to be justified.
This patch addresses the issue when PEBS is not used. A subsequent patch
fixes the problem when PEBS is used.
Signed-off-by: Maria Dimakopoulou <maria.n.dimakopoulou@gmail.com>
[spinlock_t -> raw_spinlock_t]
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Stephane Eranian <eranian@google.com>
Cc: bp@alien8.de
Cc: jolsa@redhat.com
Cc: kan.liang@intel.com
Link: http://lkml.kernel.org/r/1416251225-17721-7-git-send-email-eranian@google.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
This patch adds a new shared_regs style structure to the
per-cpu x86 state (cpuc). It is used to coordinate access
between counters which must be used with exclusion across
HyperThreads on Intel processors. This new struct is not
needed on each PMU, thus is is allocated on demand.
Signed-off-by: Maria Dimakopoulou <maria.n.dimakopoulou@gmail.com>
[peterz: spinlock_t -> raw_spinlock_t]
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Stephane Eranian <eranian@google.com>
Cc: bp@alien8.de
Cc: jolsa@redhat.com
Cc: kan.liang@intel.com
Link: http://lkml.kernel.org/r/1416251225-17721-6-git-send-email-eranian@google.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
This patch adds an index parameter to the get_event_constraint()
x86_pmu callback. It is expected to represent the index of the
event in the cpuc->event_list[] array. When the callback is used
for fake_cpuc (evnet validation), then the index must be -1. The
motivation for passing the index is to use it to index into another
cpuc array.
Signed-off-by: Stephane Eranian <eranian@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: bp@alien8.de
Cc: jolsa@redhat.com
Cc: kan.liang@intel.com
Cc: maria.n.dimakopoulou@gmail.com
Link: http://lkml.kernel.org/r/1416251225-17721-5-git-send-email-eranian@google.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
This patch adds 3 new PMU model specific callbacks
during the event scheduling done by x86_schedule_events().
->start_scheduling(): invoked when entering the schedule routine.
->stop_scheduling(): invoked at the end of the schedule routine
->commit_scheduling(): invoked for each committed event
To be used optionally by model-specific code.
Signed-off-by: Maria Dimakopoulou <maria.n.dimakopoulou@gmail.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Stephane Eranian <eranian@google.com>
Cc: bp@alien8.de
Cc: jolsa@redhat.com
Cc: kan.liang@intel.com
Link: http://lkml.kernel.org/r/1416251225-17721-4-git-send-email-eranian@google.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Make the cpuc->kfree_on_online a vector to accommodate
more than one entry and add the second entry to be
used by a later patch.
Signed-off-by: Stephane Eranian <eranian@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Maria Dimakopoulou <maria.n.dimakopoulou@gmail.com>
Cc: bp@alien8.de
Cc: jolsa@redhat.com
Cc: kan.liang@intel.com
Link: http://lkml.kernel.org/r/1416251225-17721-3-git-send-email-eranian@google.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Add support for Branch Trace Store (BTS) via kernel perf event infrastructure.
The difference with the existing implementation of BTS support is that this
one is a separate PMU that exports events' trace buffers to userspace by means
of AUX area of the perf buffer, which is zero-copy mapped into userspace.
The immediate benefit is that the buffer size can be much bigger, resulting in
fewer interrupts and no kernel side copying is involved and little to no trace
data loss. Also, kernel code can be traced with this driver.
The old way of collecting BTS traces still works.
Signed-off-by: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Kaixu Xia <kaixu.xia@linaro.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Robert Richter <rric@kernel.org>
Cc: Stephane Eranian <eranian@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: acme@infradead.org
Cc: adrian.hunter@intel.com
Cc: kan.liang@intel.com
Cc: markus.t.metzger@intel.com
Cc: mathieu.poirier@linaro.org
Link: http://lkml.kernel.org/r/1422614435-114702-1-git-send-email-alexander.shishkin@linux.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Add support for Intel Processor Trace (PT) to kernel's perf events.
PT is an extension of Intel Architecture that collects information about
software execuction such as control flow, execution modes and timings and
formats it into highly compressed binary packets. Even being compressed,
these packets are generated at hundreds of megabytes per second per core,
which makes it impractical to decode them on the fly in the kernel.
This driver exports trace data by through AUX space in the perf ring
buffer, which is zero-copy mapped into userspace for faster data retrieval.
Signed-off-by: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Kaixu Xia <kaixu.xia@linaro.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Robert Richter <rric@kernel.org>
Cc: Stephane Eranian <eranian@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: acme@infradead.org
Cc: adrian.hunter@intel.com
Cc: kan.liang@intel.com
Cc: markus.t.metzger@intel.com
Cc: mathieu.poirier@linaro.org
Link: http://lkml.kernel.org/r/1422614392-114498-1-git-send-email-alexander.shishkin@linux.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Intel PT cannot be used at the same time as LBR or BTS and will cause a
general protection fault if they are used together. In order to avoid
fixing up GPs in the fast path, instead we disallow creating LBR/BTS
events when PT events are present and vice versa.
Signed-off-by: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Kaixu Xia <kaixu.xia@linaro.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Robert Richter <rric@kernel.org>
Cc: Stephane Eranian <eranian@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: acme@infradead.org
Cc: adrian.hunter@intel.com
Cc: kan.liang@intel.com
Cc: markus.t.metzger@intel.com
Cc: mathieu.poirier@linaro.org
Link: http://lkml.kernel.org/r/1421237903-181015-12-git-send-email-alexander.shishkin@linux.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Some of the CYCLE_ACTIVITY.* events can only be scheduled on
counter 2. Due to a typo Haswell matched those with
INTEL_EVENT_CONSTRAINT, which lead to the events never
matching as the comparison does not expect anything
in the umask too. Fix the typo.
Signed-off-by: Andi Kleen <ak@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1425925222-32361-1-git-send-email-andi@firstfloor.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
For supporting Intel LBR branches filtering, Intel LBR sharing logic
mechanism is introduced from commit b36817e886 ("perf/x86: Add Intel
LBR sharing logic"). It modifies __intel_shared_reg_get_constraints() to
config lbr_sel, which is finally used to set LBR_SELECT.
However, the intel_shared_regs_constraints() function is called after
intel_pebs_constraints(). The PEBS event will return immediately after
intel_pebs_constraints(). So it's impossible to filter branches for PEBS
events.
This patch moves intel_shared_regs_constraints() ahead of
intel_pebs_constraints().
We can safely do that because the intel_shared_regs_constraints() function
only returns empty constraint if its rejecting the event, otherwise it
returns NULL such that we continue calling intel_pebs_constraints() and
x86_get_event_constraint().
Signed-off-by: Kan Liang <kan.liang@intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: eranian@google.com
Link: http://lkml.kernel.org/r/1427467105-9260-1-git-send-email-kan.liang@intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Some of x86 bare-metal and Xen CPU initialization code is common
between the two and therefore can be factored out to avoid code
duplication.
As a side effect, doing so will also extend the fix provided by
commit a7fcf28d43 ("x86/asm/entry: Replace this_cpu_sp0() with
current_top_of_stack() to x86_32") to 32-bit Xen PV guests.
Signed-off-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Reviewed-by: David Vrabel <david.vrabel@citrix.com>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: konrad.wilk@oracle.com
Cc: xen-devel@lists.xenproject.org
Link: http://lkml.kernel.org/r/1427897534-5086-1-git-send-email-boris.ostrovsky@oracle.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
This patch fixes an error in kgdb for x86_64 which would report
the value of dx when asked to give the value of si.
Signed-off-by: Steffen Liebergeld <steffen.liebergeld@kernkonzept.com>
Cc: Jason Wessel <jason.wessel@windriver.com>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
When I wrote the opportunistic SYSRET code, I missed an important difference
between SYSRET and IRET.
Both instructions are capable of setting EFLAGS.TF, but they behave differently
when doing so:
- IRET will not issue a #DB trap after execution when it sets TF.
This is critical -- otherwise you'd never be able to make forward progress when
returning to userspace.
- SYSRET, on the other hand, will trap with #DB immediately after
returning to CPL3, and the next instruction will never execute.
This breaks anything that opportunistically SYSRETs to a user
context with TF set. For example, running this code with TF set
and a SIGTRAP handler loaded never gets past 'post_nop':
extern unsigned char post_nop[];
asm volatile ("pushfq\n\t"
"popq %%r11\n\t"
"nop\n\t"
"post_nop:"
: : "c" (post_nop) : "r11");
In my defense, I can't find this documented in the AMD or Intel manual.
Fix it by using IRET to restore TF.
Signed-off-by: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Borislav Petkov <bp@suse.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Fixes: 2a23c6b8a9 ("x86_64, entry: Use sysret to return to userspace when possible")
Link: http://lkml.kernel.org/r/9472f1ca4c19a38ecda45bba9c91b7168135fcfa.1427923514.git.luto@kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Various recent BIOSes support NVDIMMs or ADR using a
non-standard e820 memory type, and Intel supplied reference
Linux code using this type to various vendors.
Wire this e820 table type up to export platform devices for the
pmem driver so that we can use it in Linux.
Based on earlier work from:
Dave Jiang <dave.jiang@intel.com>
Dan Williams <dan.j.williams@intel.com>
Includes fixes for NUMA regions from Boaz Harrosh.
Tested-by: Ross Zwisler <ross.zwisler@linux.intel.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Dan Williams <dan.j.williams@intel.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Boaz Harrosh <boaz@plexistor.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Jens Axboe <axboe@fb.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Keith Busch <keith.busch@intel.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Matthew Wilcox <willy@linux.intel.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-nvdimm@ml01.01.org
Link: http://lkml.kernel.org/r/1427872339-6688-2-git-send-email-hch@lst.de
[ Minor cleanups. ]
Signed-off-by: Ingo Molnar <mingo@kernel.org>
The ASRock Q1900DC-ITX mainboard (Baytrail-D) hangs randomly in
both BIOS and UEFI mode while rebooting unless reboot=pci is
used. Add a quirk to reboot via the pci method.
The problem is very intermittent and hard to debug, it might succeed
rebooting just fine 40 times in a row - but fails half a dozen times
the next day. It seems to be slightly less common in BIOS CSM mode
than native UEFI (with the CSM disabled), but it does happen in either
mode. Since I've started testing this patch in late january, rebooting
has been 100% reliable.
Most of the time it already hangs during POST, but occasionally it
might even make it through the bootloader and the kernel might even
start booting, but then hangs before the mode switch. The same symptoms
occur with grub-efi, gummiboot and grub-pc, just as well as (at least)
kernel 3.16-3.19 and 4.0-rc6 (I haven't tried older kernels than 3.16).
Upgrading to the most current mainboard firmware of the ASRock
Q1900DC-ITX, version 1.20, does not improve the situation.
( Searching the web seems to suggest that other Bay Trail-D mainboards
might be affected as well. )
--
Signed-off-by: Stefan Lippers-Hollmann <s.l-h@gmx.de>
Cc: <stable@vger.kernel.org>
Cc: Matt Fleming <matt.fleming@intel.com>
Link: http://lkml.kernel.org/r/20150330224427.0fb58e42@mir
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Logically, we just want to jump around the following instruction
and its prologue/epilogue:
call *sys_call_table(,%rax,8)
if the syscall number is too big - we do not specifically target
the "int_ret_from_sys_call" label.
Use a local, numerical label for this jump, for more clarity.
This also makes the code smaller:
-ffffffff8187756b: 0f 87 0f 00 00 00 ja ffffffff81877580 <int_ret_from_sys_call>
+ffffffff8187756b: 77 0f ja ffffffff8187757c <int_ret_from_sys_call>
because jumps to global labels are never translated to short jump
instructions by GAS.
Signed-off-by: Denys Vlasenko <dvlasenk@redhat.com>
Cc: Alexei Starovoitov <ast@plumgrid.com>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Will Drewry <wad@chromium.org>
Link: http://lkml.kernel.org/r/1427821211-25099-9-git-send-email-dvlasenk@redhat.com
[ Improved the changelog. ]
Signed-off-by: Ingo Molnar <mingo@kernel.org>
There is no reason to use MOVQ to load a non-negative immediate
constant value into a 64-bit register. MOVL does the same, since
the upper 32 bits are zero-extended by the CPU.
This makes the code a bit smaller, while leaving functionality
unchanged.
Signed-off-by: Denys Vlasenko <dvlasenk@redhat.com>
Cc: Alexei Starovoitov <ast@plumgrid.com>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Will Drewry <wad@chromium.org>
Link: http://lkml.kernel.org/r/1427821211-25099-8-git-send-email-dvlasenk@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
At the 'exit_intr' label we test whether interrupt/exception was in
kernel. If it did, we jump to the preemption check. If preemption
does happen (IOW if we call preempt_schedule_irq()), we go back to
'exit_intr'.
But it's pointless, we already know that the test succeeded last
time, preemption doesn't change the fact that interrupt/exception
was in the kernel.
We can go back directly to checking PER_CPU_VAR(__preempt_count) instead.
This makes the 'exit_intr' label unused, drop it.
Signed-off-by: Denys Vlasenko <dvlasenk@redhat.com>
Cc: Alexei Starovoitov <ast@plumgrid.com>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Will Drewry <wad@chromium.org>
Link: http://lkml.kernel.org/r/1427821211-25099-5-git-send-email-dvlasenk@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Get rid of #define obfuscation of retint_kernel in
CONFIG_PREEMPT case by defining retint_kernel label always, not
only for CONFIG_PREEMPT.
Strip retint_kernel of .global-ness (ENTRY macro) - it has no
users outside of this file.
This looks like cosmetics, but it is not:
"je LABEL" can be optimized to short jump by assember
only if LABEL is not global, for global labels jump is always
a near one with relocation.
Convert retint_restore_args to a local numeric label, making it
clearer that it is not used elsewhere in the file.
Signed-off-by: Denys Vlasenko <dvlasenk@redhat.com>
Cc: Alexei Starovoitov <ast@plumgrid.com>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Will Drewry <wad@chromium.org>
Link: http://lkml.kernel.org/r/1427821211-25099-3-git-send-email-dvlasenk@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
SYSRET code path has a small irq-off block.
On this code path, TRACE_IRQS_ON can't be called right before
interrupts are enabled for real, we can't clobber registers
there. So current code does it earlier, in a safe place.
But with this, TRACE_IRQS_OFF/ON frames just two fast
instructions, which is ridiculous: now most of irq-off block is
_outside_ of the framing.
Do the same thing that we do on SYSCALL entry: do not track this
irq-off block, it is very small to ever cause noticeable irq
latency.
Be careful: make sure that "jnz int_ret_from_sys_call_irqs_off"
now does invoke TRACE_IRQS_OFF - move
int_ret_from_sys_call_irqs_off label before TRACE_IRQS_OFF.
Signed-off-by: Denys Vlasenko <dvlasenk@redhat.com>
Cc: Alexei Starovoitov <ast@plumgrid.com>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Will Drewry <wad@chromium.org>
Link: http://lkml.kernel.org/r/1427821211-25099-1-git-send-email-dvlasenk@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
__verify_local_APIC() is detritus from the early APIC days.
Its return value isn't used anywhere and the information it
prints when debug is enabled is already part of APIC
initialization messages printed to syslog. Off with it!
Signed-off-by: Bandan Das <bsd@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/jpgy4mcsxsq.fsf@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
user_mode_ignore_vm86() can be used instead of user_mode(), in
places where we have already done a v8086_mode() security
check of ptregs.
But doing this check in the wrong place would be a bug that
could result in security problems, and also the naming still
isn't very clear.
Furthermore, it only affects 32-bit kernels, while most
development happens on 64-bit kernels.
If we replace them with user_mode() checks then the cost is only
a very minor increase in various slowpaths:
text data bss dec hex filename
10573391 703562 1753042 13029995 c6d26b vmlinux.o.before
10573423 703562 1753042 13030027 c6d28b vmlinux.o.after
So lets get rid of this distinction once and for all.
Acked-by: Borislav Petkov <bp@suse.de>
Acked-by: Andy Lutomirski <luto@kernel.org>
Cc: Andrew Lutomirski <luto@kernel.org>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brad Spengler <spender@grsecurity.net>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20150329090233.GA1963@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
The ASLR implementation needs to special-case AMD F15h processors by
clearing out bits [14:12] of the virtual address in order to avoid I$
cross invalidations and thus performance penalty for certain workloads.
For details, see:
dfb09f9b7a ("x86, amd: Avoid cache aliasing penalties on AMD family 15h")
This special case reduces the mmapped file's entropy by 3 bits.
The following output is the run on an AMD Opteron 62xx class CPU
processor under x86_64 Linux 4.0.0:
$ for i in `seq 1 10`; do cat /proc/self/maps | grep "r-xp.*libc" ; done
b7588000-b7736000 r-xp 00000000 00:01 4924 /lib/i386-linux-gnu/libc.so.6
b7570000-b771e000 r-xp 00000000 00:01 4924 /lib/i386-linux-gnu/libc.so.6
b75d0000-b777e000 r-xp 00000000 00:01 4924 /lib/i386-linux-gnu/libc.so.6
b75b0000-b775e000 r-xp 00000000 00:01 4924 /lib/i386-linux-gnu/libc.so.6
b7578000-b7726000 r-xp 00000000 00:01 4924 /lib/i386-linux-gnu/libc.so.6
...
Bits [12:14] are always 0, i.e. the address always ends in 0x8000 or
0x0000.
32-bit systems, as in the example above, are especially sensitive
to this issue because 32-bit randomness for VA space is 8 bits (see
mmap_rnd()). With the Bulldozer special case, this diminishes to only 32
different slots of mmap virtual addresses.
This patch randomizes per boot the three affected bits rather than
setting them to zero. Since all the shared pages have the same value
at bits [12..14], there is no cache aliasing problems. This value gets
generated during system boot and it is thus not known to a potential
remote attacker. Therefore, the impact from the Bulldozer workaround
gets diminished and ASLR randomness increased.
More details at:
http://hmarco.org/bugs/AMD-Bulldozer-linux-ASLR-weakness-reducing-mmaped-files-by-eight.html
Original white paper by AMD dealing with the issue:
http://developer.amd.com/wordpress/media/2012/10/SharedL1InstructionCacheonAMD15hCPU.pdf
Mentored-by: Ismael Ripoll <iripoll@disca.upv.es>
Signed-off-by: Hector Marco-Gisbert <hecmargi@upv.es>
Signed-off-by: Borislav Petkov <bp@suse.de>
Acked-by: Kees Cook <keescook@chromium.org>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Jan-Simon <dl9pf@gmx.de>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-fsdevel@vger.kernel.org
Link: http://lkml.kernel.org/r/1427456301-3764-1-git-send-email-hecmargi@upv.es
Signed-off-by: Ingo Molnar <mingo@kernel.org>
This file doesn't use any macros from pci_ids.h anymore, drop the include.
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Cc: Andreas Herrmann <herrmann.der.user@googlemail.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1427635734-24786-80-git-send-email-mst@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
At exit_intr, we GET_THREAD_INFO(%rcx) and then jump to
retint_kernel if saved CS was from kernel. But the code at
retint_kernel doesn't need %rcx.
Move GET_THREAD_INFO(%rcx) down, after CS check and branch.
While at it, remove "has a correct top of stack" comment.
After recent changes which eliminated FIXUP_TOP_OF_STACK,
we always have a correct pt_regs layout.
Signed-off-by: Denys Vlasenko <dvlasenk@redhat.com>
Cc: Alexei Starovoitov <ast@plumgrid.com>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Will Drewry <wad@chromium.org>
Link: http://lkml.kernel.org/r/1427738975-7391-5-git-send-email-dvlasenk@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
The "retint_kernel" code block is misplaced. Since its logical
continuation is "retint_restore_args", it is more natural to
place it above that label. This also makes two jumps "short".
This change only moves code block around, without changing
logic.
This enables the next simplification: making
"retint_restore_args" label a local numeric one.
Signed-off-by: Denys Vlasenko <dvlasenk@redhat.com>
Cc: Alexei Starovoitov <ast@plumgrid.com>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Will Drewry <wad@chromium.org>
Link: http://lkml.kernel.org/r/1427738975-7391-2-git-send-email-dvlasenk@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
The comment is ancient, it dates to the time when only AMD's
x86_64 implementation existed. AMD wasn't (and still isn't)
supporting SYSENTER, so these writes were "just in case" back
then.
This has changed: Intel's x86_64 appeared, and Intel does
support SYSENTER in long mode. "Some future 64-bit CPU" is here
already.
The code may appear "buggy" for AMD as it stands, since
MSR_IA32_SYSENTER_EIP is only 32-bit for AMD CPUs. Writing a
kernel function's address to it would drop high bits. Subsequent
use of this MSR for branch via SYSENTER seem to allow user to
transition to CPL0 while executing his code. Scary, eh?
Explain why that is not a bug: because SYSENTER insn would not
work on AMD CPU.
Signed-off-by: Denys Vlasenko <dvlasenk@redhat.com>
Cc: Alexei Starovoitov <ast@plumgrid.com>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Will Drewry <wad@chromium.org>
Link: http://lkml.kernel.org/r/1427453956-21931-1-git-send-email-dvlasenk@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
While thinking on the whole clock discussion it occurred to me we have
two distinct uses of time:
1) the tracking of event/ctx/cgroup enabled/running/stopped times
which includes the self-monitoring support in struct
perf_event_mmap_page.
2) the actual timestamps visible in the data records.
And we've been conflating them.
The first is all about tracking time deltas, nobody should really care
in what time base that happens, its all relative information, as long
as its internally consistent it works.
The second however is what people are worried about when having to
merge their data with external sources. And here we have the
discussion on MONOTONIC vs MONOTONIC_RAW etc..
Where MONOTONIC is good for correlating between machines (static
offset), MONOTNIC_RAW is required for correlating against a fixed rate
hardware clock.
This means configurability; now 1) makes that hard because it needs to
be internally consistent across groups of unrelated events; which is
why we had to have a global perf_clock().
However, for 2) it doesn't really matter, perf itself doesn't care
what it writes into the buffer.
The below patch makes the distinction between these two cases by
adding perf_event_clock() which is used for the second case. It
further makes this configurable on a per-event basis, but adds a few
sanity checks such that we cannot combine events with different clocks
in confusing ways.
And since we then have per-event configurability we might as well
retain the 'legacy' behaviour as a default.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: David Ahern <dsahern@gmail.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: John Stultz <john.stultz@linaro.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephane Eranian <eranian@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
An upcoming patch will depend on tai_ns() and NMI-safe ktime_get_raw_fast(),
so merge timers/core here in a separate topic branch until it's all cooked
and timers/core is merged upstream.
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Pull RCU updates from Paul E. McKenney:
- Documentation updates.
- Changes permitting use of call_rcu() and friends very early in
boot, for example, before rcu_init() is invoked.
- Miscellaneous fixes.
- Add in-kernel API to enable and disable expediting of normal RCU
grace periods.
- Improve RCU's handling of (hotplug-) outgoing CPUs.
Note: ARM support is lagging a bit here, and these improved
diagnostics might generate (harmless) splats.
- NO_HZ_FULL_SYSIDLE fixes.
- Tiny RCU updates to make it more tiny.
Signed-off-by: Ingo Molnar <mingo@kernel.org>
A named label "ret_from_sys_call" implies that there are jumps
to this location from elsewhere, as happens with many other
labels in this file.
But this label is used only by the JMP a few insns above.
To make that obvious, use local numeric label instead.
Improve comments:
"and return regs->ax" isn't too informative. We always return
regs->ax.
The comment suggesting that it'd be cool to use rip relative
addressing for CALL is deleted. It's unclear why that would be
an improvement - we aren't striving to use position-independent
code here. PIC code here would require something like LEA
sys_call_table(%rip),reg + CALL *(reg,%rax*8)...
"iret frame is also incomplete" is no longer true, fix that too.
Also fix typo in comment.
Signed-off-by: Denys Vlasenko <dvlasenk@redhat.com>
Cc: Alexei Starovoitov <ast@plumgrid.com>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Will Drewry <wad@chromium.org>
Link: http://lkml.kernel.org/r/1427303896-24023-1-git-send-email-dvlasenk@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
perf_pmu_disable() is called before pmu->add() and perf_pmu_enable() is called
afterwards. No need to call these inside of x86_pmu_add() as well.
Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1424281543-67335-1-git-send-email-dsahern@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
In preparation of adding another tkr field, rename this one to
tkr_mono. Also rename tk_read_base::base_mono to tk_read_base::base,
since the structure is not specific to CLOCK_MONOTONIC and the mono
name got added to the tk_read_base instance.
Lots of trivial churn.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: John Stultz <john.stultz@linaro.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20150319093400.344679419@infradead.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
On Broadwell INST_RETIRED.ALL cannot be used with any period
that doesn't have the lowest 6 bits cleared. And the period
should not be smaller than 128.
This is erratum BDM11 and BDM55:
http://www.intel.com/content/dam/www/public/us/en/documents/specification-updates/5th-gen-core-family-spec-update.pdf
BDM11: When using a period < 100; we may get incorrect PEBS/PMI
interrupts and/or an invalid counter state.
BDM55: When bit0-5 of the period are !0 we may get redundant PEBS
records on overflow.
Add a new callback to enforce this, and set it for Broadwell.
How does this handle the case when an app requests a specific
period with some of the bottom bits set?
Short answer:
Any useful instruction sampling period needs to be 4-6 orders
of magnitude larger than 128, as an PMI every 128 instructions
would instantly overwhelm the system and be throttled.
So the +-64 error from this is really small compared to the
period, much smaller than normal system jitter.
Long answer (by Peterz):
IFF we guarantee perf_event_attr::sample_period >= 128.
Suppose we start out with sample_period=192; then we'll set period_left
to 192, we'll end up with left = 128 (we truncate the lower bits). We
get an interrupt, find that period_left = 64 (>0 so we return 0 and
don't get an overflow handler), up that to 128. Then we trigger again,
at n=256. Then we find period_left = -64 (<=0 so we return 1 and do get
an overflow). We increment with sample_period so we get left = 128. We
fire again, at n=384, period_left = 0 (<=0 so we return 1 and get an
overflow). And on and on.
So while the individual interrupts are 'wrong' we get then with
interval=256,128 in exactly the right ratio to average out at 192. And
this works for everything >=128.
So the num_samples*fixed_period thing is still entirely correct +- 127,
which is good enough I'd say, as you already have that error anyhow.
So no need to 'fix' the tools, al we need to do is refuse to create
INST_RETIRED:ALL events with sample_period < 128.
Signed-off-by: Andi Kleen <ak@linux.intel.com>
[ Updated comments and changelog a bit. ]
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1424225886-18652-3-git-send-email-andi@firstfloor.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Add Broadwell support for Broadwell to perf.
The basic support is very similar to Haswell. We use the new cache
event list added for Haswell earlier. The only differences
are a few bits related to remote nodes. To avoid an extra,
mostly identical, table these are patched up in the initialization code.
The constraint list has one new event that needs to be handled over Haswell.
Includes code and testing from Kan Liang.
Signed-off-by: Andi Kleen <ak@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1424225886-18652-2-git-send-email-andi@firstfloor.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Haswell offcore events are quite different from Sandy Bridge.
Add a new table to handle Haswell properly.
Note that the offcore bits listed in the SDM are not quite correct
(this is currently being fixed). An uptodate list of bits is
in the patch.
The basic setup is similar to Sandy Bridge. The prefetch columns
have been removed, as prefetch counting is not very reliable
on Haswell. One L1 event that is not in the event list anymore
has been also removed.
- data reads do not include code reads (comparable to earlier Sandy Bridge tables)
- data counts include speculative execution (except L1 write, dtlb, bpu)
- remote node access includes both remote memory, remote cache, remote mmio.
- prefetches are not included in the counts for consistency
(different from Sandy Bridge, which includes prefetches in the remote node)
Signed-off-by: Andi Kleen <ak@linux.intel.com>
[ Removed the HSM30 comments; we don't have them for SNB/IVB either. ]
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1424225886-18652-1-git-send-email-andi@firstfloor.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
CPU hardware ID (phys_id) is defined as u32 in structure acpi_processor,
but phys_id is used as int in acpi processor driver, so it will lead to
some inconsistence for the drivers.
Furthermore, to cater for ACPI arch ports that implement 64 bits CPU
ids a generic CPU physical id type is required.
So introduce typedef u32 phys_cpuid_t in a common file, and introduce
a macro PHYS_CPUID_INVALID as (phys_cpuid_t)(-1) if it's not defined
by other archs, this will solve the inconsistence in acpi processor driver,
and will prepare for the ACPI on ARM64 for the 64 bit CPU hardware ID
in the following patch.
CC: Rafael J Wysocki <rjw@rjwysocki.net>
Suggested-by: Lorenzo Pieralisi <lorenzo.pieralisi@arm.com>
Reviewed-by: Grant Likely <grant.likely@linaro.org>
Acked-by: Sudeep Holla <sudeep.holla@arm.com>
Acked-by: Lorenzo Pieralisi <lorenzo.pieralisi@arm.com>
Acked-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
[hj: reworked cpu physid map return codes]
Signed-off-by: Hanjun Guo <hanjun.guo@linaro.org>
Signed-off-by: Will Deacon <will.deacon@arm.com>
We currently have a race: if we're preempted during syscall
exit, we can fail to process syscall return work that is queued
up while we're preempted in ret_from_sys_call after checking
ti.flags.
Fix it by disabling interrupts before checking ti.flags.
Reported-by: Stefan Seyfried <stefan.seyfried@googlemail.com>
Reported-by: Takashi Iwai <tiwai@suse.de>
Signed-off-by: Andy Lutomirski <luto@kernel.org>
Acked-by: Denys Vlasenko <dvlasenk@redhat.com>
Cc: Jiri Kosina <jkosina@suse.cz>
Cc: Tejun Heo <tj@kernel.org>
Fixes: 96b6352c12 ("x86_64, entry: Remove the syscall exit audit")
Link: http://lkml.kernel.org/r/189320d42b4d671df78c10555976bb10af1ffc75.1427137498.git.luto@kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
The THREAD_INFO() macro has a somewhat confusingly generic name,
defined in a generic .h C header file. It also does not make it
clear that it constructs a memory operand for use in assembly
code.
Rename it to ASM_THREAD_INFO() to make it all glaringly
obvious on first glance.
Acked-by: Borislav Petkov <bp@suse.de>
Cc: Alexei Starovoitov <ast@plumgrid.com>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Will Drewry <wad@chromium.org>
Link: http://lkml.kernel.org/r/20150324184442.GC14760@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
On CONFIG_IA32_EMULATION=y kernels we set up
MSR_IA32_SYSENTER_CS/ESP/EIP, but on !CONFIG_IA32_EMULATION
kernels we leave them unchanged.
Clear them to make sure the instruction is disabled properly.
SYSCALL is set up properly in both cases.
Acked-by: Denys Vlasenko <dvlasenk@redhat.com>
Acked-by: Andy Lutomirski <luto@amacapital.net>
Cc: Alexei Starovoitov <ast@plumgrid.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Will Drewry <wad@chromium.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
The FIXUP_TOP_OF_STACK macro is only necessary because we don't save %r11
to pt_regs->r11 on SYSCALL64 fast path, but we want ptrace to see it populated.
Bite the bullet, add a single additional PUSH instruction, and remove
the FIXUP_TOP_OF_STACK macro.
The RESTORE_TOP_OF_STACK macro is already a nop. Remove it too.
On SandyBridge CPU, it does not get slower:
measured 54.22 ns per getpid syscall before and after last two
changes on defconfig kernel.
Signed-off-by: Denys Vlasenko <dvlasenk@redhat.com>
Acked-by: Borislav Petkov <bp@suse.de>
Acked-by: Andy Lutomirski <luto@kernel.org>
Cc: Alexei Starovoitov <ast@plumgrid.com>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Will Drewry <wad@chromium.org>
Link: http://lkml.kernel.org/r/1426785469-15125-4-git-send-email-dvlasenk@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
With this change, on SYSCALL64 code path we are now populating
pt_regs->cs, pt_regs->ss and pt_regs->rcx unconditionally and
therefore don't need to do that in FIXUP_TOP_OF_STACK.
We lose a number of large instructions there:
text data bss dec hex filename
13298 0 0 13298 33f2 entry_64_before.o
12978 0 0 12978 32b2 entry_64.o
What's more important, we convert two "MOVQ $imm,off(%rsp)" to
"PUSH $imm" (the ones which fill pt_regs->cs,ss).
Before this patch, placing them on fast path was slowing it down
by two cycles: this form of MOV is very large, 12 bytes, and
this probably reduces decode bandwidth to one instruction per cycle
when CPU sees them.
Therefore they were living in FIXUP_TOP_OF_STACK instead (away
from fast path).
"PUSH $imm" is a small 2-byte instruction. Moving it to fast path does
not slow it down in my measurements.
Signed-off-by: Denys Vlasenko <dvlasenk@redhat.com>
Acked-by: Borislav Petkov <bp@suse.de>
Acked-by: Andy Lutomirski <luto@kernel.org>
Cc: Alexei Starovoitov <ast@plumgrid.com>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Will Drewry <wad@chromium.org>
Link: http://lkml.kernel.org/r/1426785469-15125-3-git-send-email-dvlasenk@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
PER_CPU_VAR(kernel_stack) was set up in a way where it points
five stack slots below the top of stack.
Presumably, it was done to avoid one "sub $5*8,%rsp"
in syscall/sysenter code paths, where iret frame needs to be
created by hand.
Ironically, none of them benefits from this optimization,
since all of them need to allocate additional data on stack
(struct pt_regs), so they still have to perform subtraction.
This patch eliminates KERNEL_STACK_OFFSET.
PER_CPU_VAR(kernel_stack) now points directly to top of stack.
pt_regs allocations are adjusted to allocate iret frame as well.
Hopefully we can merge it later with 32-bit specific
PER_CPU_VAR(cpu_current_top_of_stack) variable...
Net result in generated code is that constants in several insns
are changed.
This change is necessary for changing struct pt_regs creation
in SYSCALL64 code path from MOV to PUSH instructions.
Signed-off-by: Denys Vlasenko <dvlasenk@redhat.com>
Acked-by: Borislav Petkov <bp@suse.de>
Acked-by: Andy Lutomirski <luto@kernel.org>
Cc: Alexei Starovoitov <ast@plumgrid.com>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Will Drewry <wad@chromium.org>
Link: http://lkml.kernel.org/r/1426785469-15125-2-git-send-email-dvlasenk@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
This changes the THREAD_INFO() definition and all its callsites
so that they do not count stack position from
(top of stack - KERNEL_STACK_OFFSET), but from top of stack.
Semi-mysterious expressions THREAD_INFO(%rsp,RIP) - "why RIP??"
are now replaced by more logical THREAD_INFO(%rsp,SIZEOF_PTREGS)
- "calculate thread_info's address using information that
rsp is SIZEOF_PTREGS bytes below top of stack".
While at it, replace "(off)-THREAD_SIZE(reg)" with equivalent
"((off)-THREAD_SIZE)(reg)". The form without parentheses
falsely looks like we invoke THREAD_SIZE() macro.
Improve comment atop THREAD_INFO macro definition.
This patch does not change generated code (verified by objdump).
Signed-off-by: Denys Vlasenko <dvlasenk@redhat.com>
Acked-by: Borislav Petkov <bp@suse.de>
Acked-by: Andy Lutomirski <luto@kernel.org>
Cc: Alexei Starovoitov <ast@plumgrid.com>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Will Drewry <wad@chromium.org>
Link: http://lkml.kernel.org/r/1426785469-15125-1-git-send-email-dvlasenk@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Rename mce_severity() to mce_severity_intel() and assign the
mce_severity function pointer to mce_severity_amd() during init on AMD.
This way, we can avoid a test to call mce_severity_amd every time we get
into mce_severity(). And it's cleaner to do it this way.
Signed-off-by: Aravind Gopalakrishnan <Aravind.Gopalakrishnan@amd.com>
Suggested-by: Tony Luck <tony.luck@intel.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Chen Yucong <slaoub@gmail.com>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: linux-edac <linux-edac@vger.kernel.org>
Link: http://lkml.kernel.org/r/1427125373-2918-3-git-send-email-Aravind.Gopalakrishnan@amd.com
Signed-off-by: Borislav Petkov <bp@suse.de>
Add a severities function that caters to AMD processors. This allows us
to do some vendor-specific work within the function if necessary.
Also, introduce a vendor flag bitfield for vendor-specific settings. The
severities code uses this to define error scope based on the prescence
of the flags field.
This is based off of work by Boris Petkov.
Testing details:
Fam10h, Model 9h (Greyhound)
Fam15h: Models 0h-0fh (Orochi), 30h-3fh (Kaveri) and 60h-6fh (Carrizo),
Fam16h Model 00h-0fh (Kabini)
Boris:
Intel SNB
AMD K8 (JH-E0)
Signed-off-by: Aravind Gopalakrishnan <aravind.gopalakrishnan@amd.com>
Acked-by: Tony Luck <tony.luck@intel.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Chen Yucong <slaoub@gmail.com>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: linux-edac@vger.kernel.org
Link: http://lkml.kernel.org/r/1427125373-2918-2-git-send-email-Aravind.Gopalakrishnan@amd.com
[ Fixup build, clean up comments. ]
Signed-off-by: Borislav Petkov <bp@suse.de>
Having syscall32/sysenter32 initialization in a separate tiny
function, called from within a function that is already syscall
init specific, serves no real purpose.
Its existense also caused an unintended effect of having
wrmsrl(MSR_CSTAR) performed twice: once we set it to a dummy
function returning -ENOSYS, and immediately after
(if CONFIG_IA32_EMULATION), we set it to point to the proper
syscall32 entry point, ia32_cstar_target.
Signed-off-by: Denys Vlasenko <dvlasenk@redhat.com>
Acked-by: Andy Lutomirski <luto@amacapital.net>
Cc: Alexei Starovoitov <ast@plumgrid.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Will Drewry <wad@chromium.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
The following point:
2. per-CPU pvclock time info is updated if the
underlying CPU changes.
Is not true anymore since "KVM: x86: update pvclock area conditionally,
on cpu migration".
Add task migration notification back.
Problem noticed by Andy Lutomirski.
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
CC: stable@kernel.org # 3.11+
The recent old_rsp -> rsp_scratch rename also changed this
comment, but in this case "old_rsp" was not referring to
PER_CPU(old_rsp).
Fix this.
Signed-off-by: Denys Vlasenko <dvlasenk@redhat.com>
Cc: Alexei Starovoitov <ast@plumgrid.com>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Will Drewry <wad@chromium.org>
Link: http://lkml.kernel.org/r/1427115839-6397-1-git-send-email-dvlasenk@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
This allows us to remove some unnecessary ifdefs. There should
be no change to the generated code.
Signed-off-by: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brad Spengler <spender@grsecurity.net>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/f7e00f0d668e253abf0bd8bf36491ac47bd761ff.1426728647.git.luto@kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
user_mode_vm() and user_mode() are now the same. Change all callers
of user_mode_vm() to user_mode().
The next patch will remove the definition of user_mode_vm.
Signed-off-by: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brad Spengler <spender@grsecurity.net>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/43b1f57f3df70df5a08b0925897c660725015554.1426728647.git.luto@kernel.org
[ Merged to a more recent kernel. ]
Signed-off-by: Ingo Molnar <mingo@kernel.org>
A few of the user_mode() checks in traps.c are immediately after
explicit checks for vm86 mode. Change them to user_mode_ignore_vm86().
Signed-off-by: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brad Spengler <spender@grsecurity.net>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/0b324d5b75c3402be07f8d3c6245ed7f4995029e.1426728647.git.luto@kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
There's no point in checking the VM bit on 64-bit, and, since
we're explicitly checking it, we can use user_mode_ignore_vm86()
after the check.
While we're at it, rearrange the #ifdef slightly to make the code
flow a bit clearer.
Signed-off-by: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brad Spengler <spender@grsecurity.net>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/dc1457a734feccd03a19bb3538a7648582f57cdd.1426728647.git.luto@kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
The private pointer provided by the cacheinfo code is used to implement
the AMD L3 cache-specific attributes using a pointer to the northbridge
descriptor. It is needed for performing L3-specific operations and for
that we need a couple of PCI devices and other service information, all
contained in the northbridge descriptor.
This results in failure of cacheinfo setup as shown below as
cache_get_priv_group() returns the uninitialised private attributes which
are not valid for Intel processors.
------------[ cut here ]------------
WARNING: CPU: 3 PID: 1 at fs/sysfs/group.c:102
internal_create_group+0x151/0x280()
sysfs: (bin_)attrs not set by subsystem for group: index3/
Modules linked in:
CPU: 3 PID: 1 Comm: swapper/0 Not tainted 4.0.0-rc3+ #1
Hardware name: Dell Inc. Precision T3600/0PTTT9, BIOS A13 05/11/2014
...
Call Trace:
dump_stack
warn_slowpath_common
warn_slowpath_fmt
internal_create_group
sysfs_create_groups
device_add
cpu_device_create
? __kmalloc
cache_add_dev
cacheinfo_sysfs_init
? container_dev_init
do_one_initcall
kernel_init_freeable
? rest_init
kernel_init
ret_from_fork
? rest_init
This patch fixes the issue by checking if the L3 cache indices are
populated correctly (AMD-specific) before initializing the private
attributes.
Reported-by: Borislav Petkov <bp@suse.de>
Signed-off-by: Sudeep Holla <sudeep.holla@arm.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Certain MSRs are only relevant to a kernel in host mode, and kvm had
chosen not to implement these MSRs at all for guests. If a guest kernel
ever tried to access these MSRs, the result was a general protection
fault.
KVM will be separately patched to return 0 when these MSRs are read,
and this patch ensures that MSR accesses are tolerant of exceptions.
Signed-off-by: Jesse Larrew <jesse.larrew@amd.com>
[ Drop {} braces around loop ]
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Joel Schopp <joel.schopp@amd.com>
Acked-by: Tony Luck <tony.luck@intel.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-edac@vger.kernel.org
Link: http://lkml.kernel.org/r/1426262619-5016-1-git-send-email-jesse.larrew@amd.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Now that eager_fpu_init_bp() does setup_init_fpu_buf() only and
nothing else, we can remove it and move this code into its "caller",
eager_fpu_init().
This avoids the confusing games with "static __refdata void (*boot_func)":
init_xstate_buf can be NULL only during boot, so it is safe to call the
__init-annotated setup_init_fpu_buf() function in eager_fpu_init(), we
just need to mark it as __init_refok.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Pekka Riikonen <priikone@iki.fi>
Cc: Quentin Casasnovas <quentin.casasnovas@oracle.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Suresh Siddha <sbsiddha@gmail.com>
Link: http://lkml.kernel.org/r/20150314151334.GC13029@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Now that kthreads do not use FPU until they get executed, swapper/0
doesn't need to allocate fpu->state.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Pekka Riikonen <priikone@iki.fi>
Cc: Quentin Casasnovas <quentin.casasnovas@oracle.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Suresh Siddha <sbsiddha@gmail.com>
Link: http://lkml.kernel.org/r/20150313182716.GB8249@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Call it what it does and in accordance with the context where it is
used: we reset the FPU state either because we were unable to restore it
from the one saved in the task or because we simply want to reset it.
Signed-off-by: Borislav Petkov <bp@suse.de>
Acked-by: Oleg Nesterov <oleg@redhat.com>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Rik van Riel <riel@redhat.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
flush_thread() -> drop_init_fpu() is suboptimal and confusing. It does
drop_fpu() or restore_init_xstate() depending on !use_eager_fpu(). But
flush_thread() too checks eagerfpu right after that, and if it is true
then restore_init_xstate() just burns CPU for no reason. We are going to
load init_xstate_buf again after we set used_math()/user_has_fpu(), until
then the FPU state can't survive after switch_to().
Remove it, and change the "if (!use_eager_fpu())" to call drop_fpu().
While at it, clean up the tsk/current usage.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Pekka Riikonen <priikone@iki.fi>
Cc: Quentin Casasnovas <quentin.casasnovas@oracle.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Suresh Siddha <sbsiddha@gmail.com>
Link: http://lkml.kernel.org/r/20150313173030.GA31217@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Change flush_thread() to do user_fpu_begin() and restore_init_xstate()
instead of math_state_restore().
Note: "TODO: cleanup this horror" is still valid. We do not need
init_fpu() at all, we only need fpu_alloc() and memset(0). But this
needs other changes, in particular user_fpu_begin() should set
used_math().
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Pekka Riikonen <priikone@iki.fi>
Cc: Quentin Casasnovas <quentin.casasnovas@oracle.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Suresh Siddha <sbsiddha@gmail.com>
Link: http://lkml.kernel.org/r/20150311173449.GE5032@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
We want to check whether user code is in 32-bit mode, not
whether the task is nominally 32-bit.
Signed-off-by: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brad Spengler <spender@grsecurity.net>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/33e5107085ce347a8303560302b15c2cadd62c4c.1426728647.git.luto@kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Both the execve() and sigreturn() family of syscalls have the
ability to change registers in ways that may not be compatabile
with the syscall path they were called from.
In particular, SYSRET and SYSEXIT can't handle non-default %cs and %ss,
and some bits in eflags.
These syscalls have stubs that are hardcoded to jump to the IRET path,
and not return to the original syscall path.
The following commit:
76f5df43ca ("Always allocate a complete "struct pt_regs" on the kernel stack")
recently changed this for some 32-bit compat syscalls, but introduced a bug where
execve from a 32-bit program to a 64-bit program would fail because it still returned
via SYSRETL. This caused Wine to fail when built for both 32-bit and 64-bit.
This patch sets TIF_NOTIFY_RESUME for execve() and sigreturn() so
that the IRET path is always taken on exit to userspace.
Signed-off-by: Brian Gerst <brgerst@gmail.com>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/r/1426978461-32089-1-git-send-email-brgerst@gmail.com
[ Improved the changelog and comments. ]
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Make clear that the usage of PER_CPU(old_rsp) is purely temporary,
by renaming it to 'rsp_scratch'.
Cc: Alexei Starovoitov <ast@plumgrid.com>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Will Drewry <wad@chromium.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Tweak a few outdated comments that were obsoleted by recent changes
to syscall entry code:
- we no longer have a "partial stack frame" on
entry, ever.
- explain the syscall entry usage of old_rsp.
Partially based on a (split out of) patch from Denys Vlasenko.
Originally-from: Denys Vlasenko <dvlasenk@redhat.com>
Acked-by: Borislav Petkov <bp@alien8.de>
Cc: Alexei Starovoitov <ast@plumgrid.com>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Will Drewry <wad@chromium.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Remove all manipulations of PER_CPU(old_rsp) in C code:
- it is not used on SYSRET return anymore, and system entries
are atomic, so updating it from the fork and context switch
paths is pointless.
- Tweak a few related comments as well: we no longer have a
"partial stack frame" on entry, ever.
Based on (split out of) patch from Denys Vlasenko.
Originally-from: Denys Vlasenko <dvlasenk@redhat.com>
Tested-by: Borislav Petkov <bp@alien8.de>
Acked-by: Borislav Petkov <bp@alien8.de>
Cc: Alexei Starovoitov <ast@plumgrid.com>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Will Drewry <wad@chromium.org>
Link: http://lkml.kernel.org/r/1426599779-8010-2-git-send-email-dvlasenk@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
We want to use PER_CPU_VAR(old_rsp) as a simple temporary register,
to shuffle user-space RSP into (and from) when we set up the system
call stack frame. At that point we cannot shuffle values into general
purpose registers, because we have not saved them yet.
To be able to do this shuffling into a memory location, we must be
atomic and must not be preempted while we do the shuffling, otherwise
the 'temporary' register gets overwritten by some other task's
temporary register contents ...
Tested-by: Borislav Petkov <bp@alien8.de>
Signed-off-by: Denys Vlasenko <dvlasenk@redhat.com>
Acked-by: Borislav Petkov <bp@alien8.de>
Cc: Alexei Starovoitov <ast@plumgrid.com>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Will Drewry <wad@chromium.org>
Link: http://lkml.kernel.org/r/1426600344-8254-1-git-send-email-dvlasenk@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
On x86-64, __copy_instruction() always returns 0 (error) if the
instruction uses %rip-relative addressing. This is because
kernel_insn_init() is called the second time for 'insn' instance
in such cases and sets all its fields to 0.
Because of this, trying to place a kprobe on such instruction
will fail, register_kprobe() will return -EINVAL.
This patch fixes the problem.
Signed-off-by: Eugene Shatokhin <eugene.shatokhin@rosalab.ru>
Signed-off-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Link: http://lkml.kernel.org/r/20150317100918.28349.94654.stgit@localhost.localdomain
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Before the patch, the 'tss_struct::stack' field was not referenced anywhere.
It was used only to set SYSENTER's stack to point after the last byte
of tss_struct, thus the trailing field, stack[64], was used.
But grep would not know it. You can comment it out, compile,
and kernel will even run until an unlucky NMI corrupts
io_bitmap[] (which is also not easily detectable).
This patch changes code so that the purpose and usage of this
field is not mysterious anymore, and can be easily grepped for.
This does change generated code, for a subtle reason:
since tss_struct is ____cacheline_aligned, there happens to be
5 longs of padding at the end. Old code was using the padding
too; new code will strictly use it only for SYSENTER_stack[].
Signed-off-by: Denys Vlasenko <dvlasenk@redhat.com>
Cc: Alexei Starovoitov <ast@plumgrid.com>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Will Drewry <wad@chromium.org>
Link: http://lkml.kernel.org/r/1425912738-559-2-git-send-email-dvlasenk@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
x86_32 and x86_64 need slightly different thread_struct::sp0 values, and
x86_32's was incorrect for init.
This never mattered -- the init thread never runs user code, so we never
used thread_struct::sp0 for anything.
Fix it and mostly unify them.
Signed-off-by: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1b810c1d2e797e27bb4a7708c426101161edd1f6.1426009661.git.luto@amacapital.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>
x86_32, unlike x86_64, pads the top of the kernel stack, because the
hardware stack frame formats are variable in size.
Document this padding and give it a name.
This should make no change whatsoever to the compiled kernel
image. It also doesn't fix any of the current bugs in this area.
Signed-off-by: Andy Lutomirski <luto@amacapital.net>
Acked-by: Denys Vlasenko <dvlasenk@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/02bf2f54b8dcb76a62a142b6dfe07d4ef7fc582e.1426009661.git.luto@amacapital.net
[ Fixed small details, such as a missed magic constant in entry_32.S pointed out by Denys Vlasenko. ]
Signed-off-by: Ingo Molnar <mingo@kernel.org>
As far as I can tell, these fields have been set to zero on save
and ignored on restore since Linux was imported into git.
Rename them '__pad1' and '__pad2' to avoid confusion. This may
also allow us to recycle them some day.
This also adds a comment clarifying the history of those fields.
I'm intentionally avoiding calling either of them '__pad0': the
field formerly known as '__pad0' is now 'ss'.
Signed-off-by: Andy Lutomirski <luto@amacapital.net>
Reviewed-by: Oleg Nesterov <oleg@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Borislav Petkov <bp@alien8.de>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/844f8490e938780c03355be4c9b69eb4c494bf4e.1426193719.git.luto@kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
The comment in the signal code says that apps can save/restore
other segments on their own. It's true that apps can *save* SS
on their own, but there's no way for apps to restore it: SYSCALL
effectively resets SS to __USER_DS, so any value that user code
tries to load into SS gets lost on entry to sigreturn.
This recycles two padding bytes in the segment selector area for SS.
While we're at it, we need a second change to make this useful.
If the signal we're delivering is caused by a bad SS value,
saving that value isn't enough. We need to remove that bad
value from the regs before we try to deliver the signal. Oddly,
the i386 code already got this right.
I suspect that 64-bit programs that try to run 16-bit code and
use signals will have a lot of trouble without this.
Signed-off-by: Andy Lutomirski <luto@amacapital.net>
Reviewed-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Borislav Petkov <bp@suse.de>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/405594361340a2ec32f8e2b115c142df0e180d8e.1426193719.git.luto@kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Pull full dynticks support for virt guests from Frederic Weisbecker:
"Some measurements showed that disabling the tick on the host while the
guest is running can be interesting on some workloads. Indeed the
host tick is irrelevant while a vcpu runs, it consumes CPU time and cache
footprint for no good reasons.
Full dynticks already works in every context, but RCU prevents it to
be effective outside userspace, because the CPU needs to take part of
RCU grace period completion as long as RCU may be used on it, which is
the case in kernel context.
However guest is similar to userspace and idle in that we know RCU is
unused on such context. Therefore a CPU in guest/userspace/idle context
can let other CPUs report its own RCU quiescent state on its behalf
and shut down the tick safely, provided it isn't needed for other
reasons than RCU. This is called RCU extended quiescent state.
This was already implemented for idle and userspace. This patchset now
brings it for guest contexts through the following steps:
- Generalize the context tracking APIs to also track guest state
- Rename/sanitize a few CPP symbols accordingly
- Report guest entry/exit to RCU and define this context area as an RCU
extended quiescent state."
Signed-off-by: Ingo Molnar <mingo@kernel.org>
This reverts commit:
f47233c2d3 ("x86/mm/ASLR: Propagate base load address calculation")
The main reason for the revert is that the new boot flag does not work
at all currently, and in order to make this work, we need non-trivial
changes to the x86 boot code which we didn't manage to get done in
time for merging.
And even if we did, they would've been too risky so instead of
rushing things and break booting 4.1 on boxes left and right, we
will be very strict and conservative and will take our time with
this to fix and test it properly.
Reported-by: Yinghai Lu <yinghai@kernel.org>
Signed-off-by: Borislav Petkov <bp@suse.de>
Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Cc: Baoquan He <bhe@redhat.com>
Cc: H. Peter Anvin <hpa@linux.intel.com
Cc: Jiri Kosina <jkosina@suse.cz>
Cc: Josh Triplett <josh@joshtriplett.org>
Cc: Junjie Mao <eternal.n08@gmail.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Matt Fleming <matt.fleming@intel.com>
Link: http://lkml.kernel.org/r/20150316100628.GD22995@pd.tnic
Signed-off-by: Ingo Molnar <mingo@kernel.org>
To fully take advantage of MWAIT, apparently the CLFLUSH instruction needs
another quirk on certain CPUs: proper barriers around it on certain machines.
On a Q6600 SMP system, pipe-test scheduling performance, cross core,
improves significantly:
3.8.13 487.2 KHz 1.000
3.13.0-master 415.5 KHz .852
3.13.0-master+ 415.2 KHz .852 + restore mwait_idle
3.13.0-master++ 488.5 KHz 1.002 + restore mwait_idle + IPI fix
Since X86_BUG_CLFLUSH_MONITOR is already a quirk, don't create a separate
quirk for the extra smp_mb()s.
Signed-off-by: Mike Galbraith <bitbucket@online.de>
Cc: <stable@vger.kernel.org> # 3.10+
Cc: Borislav Petkov <bp@alien8.de>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Ian Malone <ibmalone@gmail.com>
Cc: Josh Boyer <jwboyer@redhat.com>
Cc: Len Brown <len.brown@intel.com>
Cc: Len Brown <lenb@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1390061684.5566.4.camel@marge.simpson.net
[ Ported to recent kernel, added comments about the quirk. ]
Signed-off-by: Ingo Molnar <mingo@kernel.org>
In Linux-3.9 we removed the mwait_idle() loop:
69fb3676df ("x86 idle: remove mwait_idle() and "idle=mwait" cmdline param")
The reasoning was that modern machines should be sufficiently
happy during the boot process using the default_idle() HALT
loop, until cpuidle loads and either acpi_idle or intel_idle
invoke the newer MWAIT-with-hints idle loop.
But two machines reported problems:
1. Certain Core2-era machines support MWAIT-C1 and HALT only.
MWAIT-C1 is preferred for optimal power and performance.
But if they support just C1, cpuidle never loads and
so they use the boot-time default idle loop forever.
2. Some laptops will boot-hang if HALT is used,
but will boot successfully if MWAIT is used.
This appears to be a hidden assumption in BIOS SMI,
that is presumably valid on the proprietary OS
where the BIOS was validated.
https://bugzilla.kernel.org/show_bug.cgi?id=60770
So here we effectively revert the patch above, restoring
the mwait_idle() loop. However, we don't bother restoring
the idle=mwait cmdline parameter, since it appears to add
no value.
Maintainer notes:
For 3.9, simply revert 69fb3676df
for 3.10, patch -F3 applies, fuzz needed due to __cpuinit use in
context For 3.11, 3.12, 3.13, this patch applies cleanly
Tested-by: Mike Galbraith <bitbucket@online.de>
Signed-off-by: Len Brown <len.brown@intel.com>
Acked-by: Mike Galbraith <bitbucket@online.de>
Cc: <stable@vger.kernel.org> # 3.9+
Cc: Borislav Petkov <bp@alien8.de>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Ian Malone <ibmalone@gmail.com>
Cc: Josh Boyer <jwboyer@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/345254a551eb5a6a866e048d7ab570fd2193aca4.1389763084.git.len.brown@intel.com
[ Ported to recent kernels. ]
Signed-off-by: Ingo Molnar <mingo@kernel.org>
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQEcBAABAgAGBQJU/NacAAoJEHm+PkMAQRiGdUcIAJU5dHclwd9HRc7LX5iOwYN6
mN0aCsYjMD8Pjx2VcPCgJvkIoESQO5pkwYpFFWCwILup1bVEidqXfr8EPOdThzdh
kcaT0FwUvd19K+0jcKVNCX1RjKBtlUfUKONk6sS2x4RrYZpv0Ur8Gh+yXV8iMWtf
fAusNEYlxQJvEz5+NSKw86EZTr4VVcykKLNvj+/t/JrXEuue7IG8EyoAO/nLmNd2
V/TUKKttqpE6aUVBiBDmcMQl2SUVAfp5e+KJAHmizdDpSE80nU59UC1uyV8VCYdM
qwHXgttLhhKr8jBPOkvUxl4aSXW7S0QWO8TrMpNdEOeB3ZB8AKsiIuhe1JrK0ro=
=Xkue
-----END PGP SIGNATURE-----
Merge tag 'v4.0-rc3' into x86/build, to refresh an older tree before applying new changes
Signed-off-by: Ingo Molnar <mingo@kernel.org>
math_state_restore() assumes it is called with irqs disabled,
but this is not true if the caller is __restore_xstate_sig().
This means that if ia32_fxstate == T and __copy_from_user()
fails, __restore_xstate_sig() returns with irqs disabled too.
This triggers:
BUG: sleeping function called from invalid context at kernel/locking/rwsem.c:41
dump_stack
___might_sleep
? _raw_spin_unlock_irqrestore
__might_sleep
down_read
? _raw_spin_unlock_irqrestore
print_vma_addr
signal_fault
sys32_rt_sigreturn
Change __restore_xstate_sig() to call set_used_math()
unconditionally. This avoids enabling and disabling interrupts
in math_state_restore(). If copy_from_user() fails, we can
simply do fpu_finit() by hand.
[ Note: this is only the first step. math_state_restore() should
not check used_math(), it should set this flag. While
init_fpu() should simply die. ]
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Cc: <stable@vger.kernel.org>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Pekka Riikonen <priikone@iki.fi>
Cc: Quentin Casasnovas <quentin.casasnovas@oracle.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Suresh Siddha <sbsiddha@gmail.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20150307153844.GB25954@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
On a platform in ACPI Hardware-reduced mode, the legacy PIC and
PIT may not be initialized even though they may be present in
silicon. Touching these legacy components causes unexpected
results on the system.
On the Bay Trail-T(ASUS-T100) platform, touching these legacy
components blocks platform hardware low idle power state(S0ix)
during system suspend. So we should bypass them in ACPI hardware
reduced mode.
Suggested-by: Arjan van de Ven <arjan@linux.intel.com>
Signed-off-by: Li Aubrey <aubrey.li@linux.intel.com>
Cc: <alan@linux.intel.com>
Cc: Alan Cox <alan@linux.intel.com>
Cc: H. Peter Anvin <hpa@linux.intel.com>
Cc: Len Brown <len.brown@intel.com>
Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Cc: Rafael J. Wysocki <rjw@rjwysocki.net>
Link: http://lkml.kernel.org/r/54FFF81C.20703@linux.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
This commit removes the open-coded CPU-offline notification with new
common code. Among other things, this change avoids calling scheduler
code using RCU from an offline CPU that RCU is ignoring. It also allows
Xen to notice at online time that the CPU did not go offline correctly.
Note that Xen has the surviving CPU carry out some cleanup operations,
so if the surviving CPU times out, these cleanup operations might have
been carried out while the outgoing CPU was still running. It might
therefore be unwise to bring this CPU back online, and this commit
avoids doing so.
Signed-off-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: <x86@kernel.org>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: David Vrabel <david.vrabel@citrix.com>
Cc: <xen-devel@lists.xenproject.org>
Prepare for the removal of 'usersp', by simplifying PER_CPU(old_rsp) usage:
- use it only as temp storage
- store the userspace stack pointer immediately in pt_regs->sp
on syscall entry, instead of using it later, on syscall exit.
- change C code to use pt_regs->sp only, instead of PER_CPU(old_rsp)
and task->thread.usersp.
FIXUP/RESTORE_TOP_OF_STACK are simplified as well.
Signed-off-by: Denys Vlasenko <dvlasenk@redhat.com>
Cc: Alexei Starovoitov <ast@plumgrid.com>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Will Drewry <wad@chromium.org>
Link: http://lkml.kernel.org/r/1425926364-9526-4-git-send-email-dvlasenk@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Before this patch, R11 was saved in pt_regs->r11.
Which looks natural, but requires messy shuffling to/from iret
frame whenever ptrace or e.g. sys_iopl() wants to modify flags -
because that's how this register is used by SYSCALL/SYSRET.
This patch saves R11 in pt_regs->flags, and uses that value for
the SYSRET64 instruction. Shuffling is eliminated.
FIXUP/RESTORE_TOP_OF_STACK are simplified.
stub_iopl is no longer needed: pt_regs->flags needs no fixing up.
Testing shows that syscall fast path is ~54.3 ns before
and after the patch (on 2.7 GHz Sandy Bridge CPU).
Signed-off-by: Denys Vlasenko <dvlasenk@redhat.com>
Cc: Alexei Starovoitov <ast@plumgrid.com>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Will Drewry <wad@chromium.org>
Link: http://lkml.kernel.org/r/1425926364-9526-2-git-send-email-dvlasenk@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
fx_finit() has two users but only fpu_finit() needs to clear
xstate, alloc_bootmem_align() in setup_init_fpu_buf() returns
zero-filled memory.
And note that both memset()'s look confusing. Yes, offsetof() is
0 for ->fxsave or ->fsave, but it would be cleaner to turn
them into a single memset() which zeroes fpu->state.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Acked-by: Rik van Riel <riel@redhat.com>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Tavis Ormandy <taviso@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1425967585-4725-2-git-send-email-bp@alien8.de
Link: http://lkml.kernel.org/r/20150302183257.GC23085@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
This is a cosmetic change: xstateregs_get() and xstateregs_set()
abuse ->fxsave to access xsave->i387.sw_reserved.
This practice is correct, ->fxsave and xsave->i387 share the same memory,
but IMHO this looks confusing.
And we can make this code more readable if we add a
"struct xsave_struct *" local variable as well.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Rik van Riel <riel@redhat.com>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Tavis Ormandy <taviso@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1425967585-4725-1-git-send-email-bp@alien8.de
Link: http://lkml.kernel.org/r/20150302183237.GB23085@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>