Commit Graph

11685 Commits

Author SHA1 Message Date
Ville Syrjälä
298a96c12b x86/dma-mapping: Fix arch_dma_alloc_attrs() oops with NULL dev
Commit 6894258eda broke drivers that pass NULL as the device pointer
to dma_alloc. The reason is that arch_dma_alloc_attrs() now calls
dma_alloc_coherent_gfp_flags() which in turn calls
dma_alloc_coherent_mask(), where the device pointer is dereferenced
unconditionally.

Fix things by moving the ISA DMA fallback device assignment before the
call to dma_alloc_coherent_gfp_flags().

Fixes: 6894258eda ("dma-mapping: consolidate dma_{alloc,free}_{attrs,coherent}")
Reported-and-tested-by: Meelis Roos <mroos@linux.ee>
Signed-off-by: Ville Syrjälä <ville.syrjala@linux.intel.com>
Cc: Christoph Hellwig <hch@lst.de>
Link: http://lkml.kernel.org/r/1445807503-8920-1-git-send-email-ville.syrjala@linux.intel.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2015-10-26 14:59:36 +09:00
Rafael J. Wysocki
343ccb040e Merge branches 'acpi-scan', 'acpi-tables', 'acpi-ec' and 'acpi-assorted'
* acpi-scan:
  ACPI / scan: use kstrdup_const() in acpi_add_id()
  ACPI / scan: constify struct acpi_hardware_id::id
  ACPI / scan: constify first argument of struct acpi_scan_handler::match

* acpi-tables:
  ACPI / tables: test the correct variable
  x86, ACPI: Handle apic/x2apic entries in MADT in correct order
  ACPI / tables: Add acpi_subtable_proc to ACPI table parsers

* acpi-ec:
  ACPI / EC: Fix a race issue in acpi_ec_guard_event()
  ACPI / EC: Fix query handler related issues

* acpi-assorted:
  ACPI: change acpi_sleep_proc_init() to return void
  ACPI: change init_acpi_device_notify() to return void
2015-10-25 22:54:46 +01:00
Minfei Huang
883a1e867e ftrace: Calculate the correct dyn_ftrace number to report to the userspace
Now, ftrace only calculate the dyn_ftrace number in the adding
breakpoint loop, not in adding update and finish update loop.

Calculate the correct dyn_ftrace, once ftrace reports the failure message
to the userspace.

Link: http://lkml.kernel.org/r/1442420382-13130-1-git-send-email-mnfhuang@gmail.com

Signed-off-by: Minfei Huang <mnfhuang@gmail.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2015-10-22 15:44:24 -04:00
Borislav Petkov
c595ac2bac x86/microcode/intel: Move #ifdef DEBUG inside the function
... and save us the stub.

Signed-off-by: Borislav Petkov <bp@suse.de>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Arjan van de Ven <arjan@linux.intel.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Dave Jones <davej@codemonkey.org.uk>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Len Brown <len.brown@intel.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tony Luck <tony.luck@intel.com>
Link: http://lkml.kernel.org/r/1445334889-300-6-git-send-email-bp@alien8.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-10-21 11:22:12 +02:00
Borislav Petkov
6f7fc44bf1 x86/microcode/amd: Remove maintainers from comments
We have the MAINTAINERS file for that. Also, Andreas doesn't
have the time for this work anymore.

Signed-off-by: Borislav Petkov <bp@suse.de>
Cc: Andreas Herrmann <herrmann.der.user@googlemail.com>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Arjan van de Ven <arjan@linux.intel.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Dave Jones <davej@codemonkey.org.uk>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Len Brown <len.brown@intel.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tony Luck <tony.luck@intel.com>
Link: http://lkml.kernel.org/r/1445334889-300-5-git-send-email-bp@alien8.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-10-21 11:22:12 +02:00
Borislav Petkov
6b26e1bf66 x86/microcode: Remove modularization leftovers
Remove the remaining module functionality leftovers. Make
"dis_ucode_ldr" an early_param and make it static again. Drop
module aliases, autoloading table, description, etc.

Bump version number, while at it.

Signed-off-by: Borislav Petkov <bp@suse.de>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Arjan van de Ven <arjan@linux.intel.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Dave Jones <davej@codemonkey.org.uk>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Len Brown <len.brown@intel.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tony Luck <tony.luck@intel.com>
Link: http://lkml.kernel.org/r/1445334889-300-4-git-send-email-bp@alien8.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-10-21 11:22:12 +02:00
Borislav Petkov
fe055896c0 x86/microcode: Merge the early microcode loader
Merge the early loader functionality into the driver proper. The
diff is huge but logically, it is simply moving code from the
_early.c files into the main driver.

Signed-off-by: Borislav Petkov <bp@suse.de>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Arjan van de Ven <arjan@linux.intel.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Dave Jones <davej@codemonkey.org.uk>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Len Brown <len.brown@intel.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tony Luck <tony.luck@intel.com>
Link: http://lkml.kernel.org/r/1445334889-300-3-git-send-email-bp@alien8.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-10-21 11:22:12 +02:00
Borislav Petkov
9a2bc335f1 x86/microcode: Unmodularize the microcode driver
Make CONFIG_MICROCODE a bool. It was practically a bool already anyway,
since early loader was forcing it to =y.

Regardless, there's no real reason to have something be a module which
gets built-in on the majority of installations out there. And its not
like there's noticeable change in functionality - we still can load late
microcode - just the module glue disappears.

Signed-off-by: Borislav Petkov <bp@suse.de>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Arjan van de Ven <arjan@linux.intel.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Dave Jones <davej@codemonkey.org.uk>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Len Brown <len.brown@intel.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tony Luck <tony.luck@intel.com>
Link: http://lkml.kernel.org/r/1445334889-300-2-git-send-email-bp@alien8.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-10-21 11:22:11 +02:00
Jan Beulich
3d45ac4b35 timers/x86/hpet: Type adjustments
Standardize on bool instead of an inconsistent mixture of u8 and plain 'int'.

Also use u32 or 'unsigned int' instead of 'unsigned long' when a 32-bit type
suffices, generating slightly better code on x86-64.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/5624E3A002000078000AC49A@prv-mh.provo.novell.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-10-21 11:17:32 +02:00
Andi Kleen
81ffdcdd97 x86/mce: Fix thermal throttling reporting after kexec
The per CPU thermal vector init code checks if the thermal
vector is already installed and complains and bails out if it
is.

This happens after kexec, as kernel shut down does not clear the
thermal vector APIC register.

This causes two problems:

1. So we always do not fully initialize thermal reports after
   kexec. The CPU is still likely initialized, as the previous
   kernel should have done it. But we don't set up the software
   pointer to the thermal vector, so reporting may end up with a
   unknown thermal interrupt message.

2. Also it complains for every logical CPU, even though the
   value is actually derived from BP only.

The problem is that we end up with one message per CPU, so on
larger systems it becomes very noisy and messes up the otherwise
nicely formatted CPU bootup numbers in the kernel log.

Just remove the check. I checked the code and there's no valid
code paths where the thermal init code for a CPU could be called
multiple times.

Why the kernel does not clean up this value on shutdown:

The thermal monitoring is controlled per logical CPU thread.
Normal shutdown code is just running on one CPU. To disable it
we would need a broadcast NMI to all CPUs on shut down. That's
overkill for this. So we just ignore it after kexec.

Signed-off-by: Andi Kleen <ak@linux.intel.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Tony Luck <tony.luck@intel.com>
Cc: linux-edac <linux-edac@vger.kernel.org>
Link: http://lkml.kernel.org/r/1445246268-26285-9-git-send-email-bp@alien8.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-10-21 11:10:57 +02:00
Borislav Petkov
6f3760570e x86/setup/crash: Check memblock_reserve() retval
memblock_reserve() can fail but the crashkernel reservation code
doesn't check that and this can lead the user into believing
that the crashkernel region was actually reserved. Make sure we
check that return value and we exit early with a failure message
in the error case.

Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Dave Young <dyoung@redhat.com>
Reviewed-by: Joerg Roedel <jroedel@suse.de>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Jiri Kosina <jkosina@suse.cz>
Cc: Juergen Gross <jgross@suse.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mark Salter <msalter@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: WANG Chao <chaowang@redhat.com>
Cc: jerry_hoemann@hp.com
Link: http://lkml.kernel.org/r/1445246268-26285-7-git-send-email-bp@alien8.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-10-21 11:10:56 +02:00
Borislav Petkov
f56d55781c x86/setup/crash: Cleanup some more
* Remove unused auto_set variable
* Cleanup local function variable declarations
* Reformat printk string and use pr_info()

No functionality change.

Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Dave Young <dyoung@redhat.com>
Reviewed-by: Joerg Roedel <jroedel@suse.de>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Jiri Kosina <jkosina@suse.cz>
Cc: Juergen Gross <jgross@suse.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mark Salter <msalter@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: WANG Chao <chaowang@redhat.com>
Cc: jerry_hoemann@hp.com
Link: http://lkml.kernel.org/r/1445246268-26285-6-git-send-email-bp@alien8.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-10-21 11:10:56 +02:00
Borislav Petkov
606134f77c x86/setup/crash: Remove alignment variable
Use a macro instead. No functionality change.

Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Dave Young <dyoung@redhat.com>
Reviewed-by: Joerg Roedel <jroedel@suse.de>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Jiri Kosina <jkosina@suse.cz>
Cc: Juergen Gross <jgross@suse.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mark Salter <msalter@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: WANG Chao <chaowang@redhat.com>
Cc: jerry_hoemann@hp.com
Link: http://lkml.kernel.org/r/1445246268-26285-5-git-send-email-bp@alien8.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-10-21 11:10:56 +02:00
Borislav Petkov
97eac21bab x86/setup: Cleanup crashkernel reservation functions
* Shorten variable names
* Realign code, space out for better readability

No code changed:

  # arch/x86/kernel/setup.o:

   text    data     bss     dec     hex filename
   4543    3096   69904   77543   12ee7 setup.o.before
   4543    3096   69904   77543   12ee7 setup.o.after

md5:
   8a1b7c6738a553ca207b56bd84a8f359  setup.o.before.asm
   8a1b7c6738a553ca207b56bd84a8f359  setup.o.after.asm

Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Dave Young <dyoung@redhat.com>
Reviewed-by: Joerg Roedel <jroedel@suse.de>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Jiri Kosina <jkosina@suse.cz>
Cc: Juergen Gross <jgross@suse.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mark Salter <msalter@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: WANG Chao <chaowang@redhat.com>
Cc: jerry_hoemann@hp.com
Link: http://lkml.kernel.org/r/1445246268-26285-4-git-send-email-bp@alien8.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-10-21 11:10:56 +02:00
Baoquan He
eb6db83d10 x86/setup: Do not reserve crashkernel high memory if low reservation failed
People reported that when allocating crashkernel memory using
the ",high" and ",low" syntax, there were cases where the
reservation of the high portion succeeds but the reservation of
the low portion fails.

Then kexec can load the kdump kernel successfully, but booting
the kdump kernel fails as there's no low memory.

The low memory allocation for the kdump kernel can fail on large
systems for a couple of reasons. For example, the manually
specified crashkernel low memory can be too large and thus no
adequate memblock region would be found.

Therefore, we try to reserve low memory for the crash kernel
*after* the high memory portion has been allocated. If that
fails, we free crashkernel high memory too and return. The user
can then take measures accordingly.

Tested-by: Joerg Roedel <jroedel@suse.de>
Signed-off-by: Baoquan He <bhe@redhat.com>
[ Massage text. ]
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Joerg Roedel <jroedel@suse.de>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Dave Young <dyoung@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Jiri Kosina <jkosina@suse.cz>
Cc: Juergen Gross <jgross@suse.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mark Salter <msalter@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: WANG Chao <chaowang@redhat.com>
Cc: jerry_hoemann@hp.com
Cc: yinghai@kernel.org
Link: http://lkml.kernel.org/r/1445246268-26285-2-git-send-email-bp@alien8.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-10-21 11:10:55 +02:00
Andrey Ryabinin
f7d27c35dd x86/mm, kasan: Silence KASAN warnings in get_wchan()
get_wchan() is racy by design, it may access volatile stack
of running task, thus it may access redzone in a stack frame
and cause KASAN to warn about this.

Use READ_ONCE_NOCHECK() to silence these warnings.

Reported-by: Sasha Levin <sasha.levin@oracle.com>
Signed-off-by: Andrey Ryabinin <aryabinin@virtuozzo.com>
Cc: Alexander Potapenko <glider@google.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Andrey Konovalov <andreyknvl@google.com>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Kostya Serebryany <kcc@google.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Wolfram Gloger <wmglo@dent.med.uni-muenchen.de>
Cc: kasan-dev <kasan-dev@googlegroups.com>
Link: http://lkml.kernel.org/r/1445243838-17763-3-git-send-email-aryabinin@virtuozzo.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-10-20 11:04:19 +02:00
Stephane Eranian
d892819faa perf/x86: Add support for PERF_SAMPLE_BRANCH_CALL
This patch enables the suport for the PERF_SAMPLE_BRANCH_CALL
for Intel x86 processors. When the processor support LBR filtering
this the selection is done in hardware. Otherwise, the filter is
applied by software. Note that we chose to include zero length calls
because they also represent calls.

Signed-off-by: Stephane Eranian <eranian@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: David Ahern <dsahern@gmail.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vince Weaver <vincent.weaver@maine.edu>
Cc: khandual@linux.vnet.ibm.com
Link: http://lkml.kernel.org/r/1444720151-10275-3-git-send-email-eranian@google.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-10-20 10:30:53 +02:00
Adrian Hunter
b9511cd761 perf/x86: Fix time_shift in perf_event_mmap_page
Commit:

  b20112edea ("perf/x86: Improve accuracy of perf/sched clock")

allowed the time_shift value in perf_event_mmap_page to be as much
as 32.  Unfortunately the documented algorithms for using time_shift
have it shifting an integer, whereas to work correctly with the value
32, the type must be u64.

In the case of perf tools, Intel PT decodes correctly but the timestamps
that are output (for example by perf script) have lost 32-bits of
granularity so they look like they are not changing at all.

Fix by limiting the shift to 31 and adjusting the multiplier accordingly.

Also update the documentation of perf_event_mmap_page so that new code
based on it will be more future-proof.

Signed-off-by: Adrian Hunter <adrian.hunter@intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: David Ahern <dsahern@gmail.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephane Eranian <eranian@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vince Weaver <vincent.weaver@maine.edu>
Fixes: b20112edea ("perf/x86: Improve accuracy of perf/sched clock")
Link: http://lkml.kernel.org/r/1445001845-13688-2-git-send-email-adrian.hunter@intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-10-20 10:30:52 +02:00
Chuck Ebbert
558a65bc31 sched/x86: Fix typo in __switch_to() comments
Fix obvious mistake: FS/GS should be DS/ES.

Signed-off-by: Chuck Ebbert <cebbert.lkml@gmail.com>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/20151014143119.78858eeb@r5
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-10-19 10:18:53 +02:00
Len Brown
fcafddec4e x86/smpboot: Fix CPU #1 boot timeout
The following commit:

  a9bcaa02a5 ("x86/smpboot: Remove SIPI delays from cpu_up()")

Caused some Intel Core2 processors to time-out when bringing up CPU #1,
resulting in the missing of that CPU after bootup.

That patch reduced the SIPI delays from udelay() 300, 200 to udelay() 0,
0 on modern processors.

Several Intel(R) Core(TM)2 systems failed to bring up CPU #1 10/10 times
after that change.

Increasing either of the SIPI delays to udelay(1) results in
success. So here we increase both to udelay(10).  While this may
be 20x slower than the absolute minimum, it is still 20x to 30x
faster than the original code.

Tested-by: Donald Parsons <dparsons@brightdsl.net>
Tested-by: Shane <shrybman@teksavvy.com>
Signed-off-by: Len Brown <len.brown@intel.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: dparsons@brightdsl.net
Cc: shrybman@teksavvy.com
Link: http://lkml.kernel.org/r/6dd554ee8945984d85aafb2ad35793174d068af0.1444968087.git.len.brown@intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-10-19 09:14:41 +02:00
Len Brown
f1ccd24931 x86/smpboot: Fix cpu_init_udelay=10000 corner case boot parameter misbehavior
For legacy machines cpu_init_udelay defaults to 10,000.
For modern machines it is set to 0.

The user should be able to set cpu_init_udelay to
any value on the cmdline, including 10,000.

Before this patch, that was seen as "unchanged from default"
and thus on a modern machine, the user request was ignored
and the delay was set to 0.

Signed-off-by: Len Brown <len.brown@intel.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: dparsons@brightdsl.net
Cc: shrybman@teksavvy.com
Link: http://lkml.kernel.org/r/de363cdbbcfcca1d22569683f7eb9873e0177251.1444968087.git.len.brown@intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-10-19 09:14:41 +02:00
Vitaly Kuznetsov
c0ff971ef9 x86/ioapic: Disable interrupts when re-routing legacy IRQs
A sporadic hang with consequent crash is observed when booting Hyper-V Gen1
guests:

 Call Trace:
  <IRQ>
  [<ffffffff810ab68d>] ? trace_hardirqs_off+0xd/0x10
  [<ffffffff8107b616>] queue_work_on+0x46/0x90
  [<ffffffff81365696>] ? add_interrupt_randomness+0x176/0x1d0
  ...
  <EOI>
  [<ffffffff81471ddb>] ? _raw_spin_unlock_irqrestore+0x3b/0x60
  [<ffffffff810c295e>] __irq_put_desc_unlock+0x1e/0x40
  [<ffffffff810c5c35>] irq_modify_status+0xb5/0xd0
  [<ffffffff8104adbb>] mp_register_handler+0x4b/0x70
  [<ffffffff8104c55a>] mp_irqdomain_alloc+0x1ea/0x2a0
  [<ffffffff810c7f10>] irq_domain_alloc_irqs_recursive+0x40/0xa0
  [<ffffffff810c860c>] __irq_domain_alloc_irqs+0x13c/0x2b0
  [<ffffffff8104b070>] alloc_isa_irq_from_domain.isra.1+0xc0/0xe0
  [<ffffffff8104bfa5>] mp_map_pin_to_irq+0x165/0x2d0
  [<ffffffff8104c157>] pin_2_irq+0x47/0x80
  [<ffffffff81744253>] setup_IO_APIC+0xfe/0x802
  ...
  [<ffffffff814631c0>] ? rest_init+0x140/0x140

The issue is easily reproducible with a simple instrumentation: if
mdelay(10) is put between mp_setup_entry() and mp_register_handler() calls
in mp_irqdomain_alloc() Hyper-V guest always fails to boot when re-routing
IRQ0. The issue seems to be caused by the fact that we don't disable
interrupts while doing IOPIC programming for legacy IRQs and IRQ0 actually
happens. 

Protect the setup sequence against concurrent interrupts.

[ tglx: Make the protection unconditional and not only for legacy
  	interrupts ]

Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Cc: Jiang Liu <jiang.liu@linux.intel.com>
Cc: Yinghai Lu <yinghai@kernel.org>
Cc: K. Y. Srinivasan <kys@microsoft.com>
Link: http://lkml.kernel.org/r/1444930943-19336-1-git-send-email-vkuznets@redhat.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2015-10-16 16:31:24 +02:00
Paolo Bonzini
f5f3497cad x86/setup: Extend low identity map to cover whole kernel range
On 32-bit systems, the initial_page_table is reused by
efi_call_phys_prolog as an identity map to call
SetVirtualAddressMap.  efi_call_phys_prolog takes care of
converting the current CPU's GDT to a physical address too.

For PAE kernels the identity mapping is achieved by aliasing the
first PDPE for the kernel memory mapping into the first PDPE
of initial_page_table.  This makes the EFI stub's trick "just work".

However, for non-PAE kernels there is no guarantee that the identity
mapping in the initial_page_table extends as far as the GDT; in this
case, accesses to the GDT will cause a page fault (which quickly becomes
a triple fault).  Fix this by copying the kernel mappings from
swapper_pg_dir to initial_page_table twice, both at PAGE_OFFSET and at
identity mapping.

For some reason, this is only reproducible with QEMU's dynamic translation
mode, and not for example with KVM.  However, even under KVM one can clearly
see that the page table is bogus:

    $ qemu-system-i386 -pflash OVMF.fd -M q35 vmlinuz0 -s -S -daemonize
    $ gdb
    (gdb) target remote localhost:1234
    (gdb) hb *0x02858f6f
    Hardware assisted breakpoint 1 at 0x2858f6f
    (gdb) c
    Continuing.

    Breakpoint 1, 0x02858f6f in ?? ()
    (gdb) monitor info registers
    ...
    GDT=     0724e000 000000ff
    IDT=     fffbb000 000007ff
    CR0=0005003b CR2=ff896000 CR3=032b7000 CR4=00000690
    ...

The page directory is sane:

    (gdb) x/4wx 0x32b7000
    0x32b7000:	0x03398063	0x03399063	0x0339a063	0x0339b063
    (gdb) x/4wx 0x3398000
    0x3398000:	0x00000163	0x00001163	0x00002163	0x00003163
    (gdb) x/4wx 0x3399000
    0x3399000:	0x00400003	0x00401003	0x00402003	0x00403003

but our particular page directory entry is empty:

    (gdb) x/1wx 0x32b7000 + (0x724e000 >> 22) * 4
    0x32b7070:	0x00000000

[ It appears that you can skate past this issue if you don't receive
  any interrupts while the bogus GDT pointer is loaded, or if you avoid
  reloading the segment registers in general.

  Andy Lutomirski provides some additional insight:

   "AFAICT it's entirely permissible for the GDTR and/or LDT
    descriptor to point to unmapped memory.  Any attempt to use them
    (segment loads, interrupts, IRET, etc) will try to access that memory
    as if the access came from CPL 0 and, if the access fails, will
    generate a valid page fault with CR2 pointing into the GDT or
    LDT."

  Up until commit 23a0d4e8fa ("efi: Disable interrupts around EFI
  calls, not in the epilog/prolog calls") interrupts were disabled
  around the prolog and epilog calls, and the functional GDT was
  re-installed before interrupts were re-enabled.

  Which explains why no one has hit this issue until now. ]

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Reported-by: Laszlo Ersek <lersek@redhat.com>
Cc: <stable@vger.kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Andy Lutomirski <luto@amacapital.net>
Signed-off-by: Matt Fleming <matt.fleming@intel.com>
[ Updated changelog. ]
2015-10-16 10:52:29 +01:00
Lukasz Anaczkowski
d81056b527 x86, ACPI: Handle apic/x2apic entries in MADT in correct order
ACPI specifies the following rules when listing APIC IDs:
(1) Boot processor is listed first
(2) For multi-threaded processors, BIOS should list the first logical
    processor of each of the individual multi-threaded processors in MADT
    before listing any of the second logical processors.
(3) APIC IDs < 0xFF should be listed in APIC subtable, APIC IDs >= 0xFF
    should be listed in X2APIC subtable

Because of above, when there's more than 0xFF logical CPUs, BIOS
interleaves APIC/X2APIC subtables.

Assuming, there's 72 cores, 72 hyper-threads each, 288 CPUs total,
listing is like this:

APIC (0,4,8, .., 252)
X2APIC (258,260,264, .. 284)
APIC (1,5,9,...,253)
X2APIC (259,261,265,...,285)
APIC (2,6,10,...,254)
X2APIC (260,262,266,..,286)
APIC (3,7,11,...,251)
X2APIC (255,261,262,266,..,287)

Now, before this patch, due to how ACPI MADT subtables were parsed (BSP
then X2APIC then APIC), kernel enumerated CPUs in reverted order (i.e.
high APIC IDs were getting low logical IDs, and low APIC IDs were
getting high logical IDs).
This is wrong for the following reasons:
() it's hard to predict how cores and threads are enumerated
() when it's hard to predict, s/w threads cannot be properly affinitized
   causing significant performance impact due to e.g. inproper cache
   sharing
() enumeration is inconsistent with how threads are enumerated on
   other Intel Xeon processors

So, order in which MADT APIC/X2APIC handlers are passed is
reverse and both handlers are passed to be called during same MADT
table to walk to achieve correct CPU enumeration.

In scenario when someone boots kernel with options 'maxcpus=72 nox2apic',
in result less cores may be booted, since some of the CPUs the kernel
will try to use will have APIC ID >= 0xFF. In such case, one
should not pass 'nox2apic'.

Disclimer: code parsing MADT APIC/X2APIC has not been touched since 2009,
when X2APIC support was initially added. I do not know why MADT parsing
code was added in the reversed order in the first place.
I guess it didn't matter at that time since nobody cared about cores
with APIC IDs >= 0xFF, right?

This patch is based on work of "Yinghai Lu <yinghai@kernel.org>"
previously published at https://lkml.org/lkml/2013/1/21/563

Here's the explanation why parsing interface needs to be changed
and why simpler approach will not work https://lkml.org/lkml/2015/9/7/285

Signed-off-by: Lukasz Anaczkowski <lukasz.anaczkowski@intel.com>
Acked-by: Thomas Gleixner <tglx@linutronix.de> (commit message)
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2015-10-15 01:31:06 +02:00
Ingo Molnar
790a2ee242 * Make the EFI System Resource Table (ESRT) driver explicitly
non-modular by ripping out the module_* code since Kconfig doesn't
    allow it to be built as a module anyway - Paul Gortmaker
 
  * Make the x86 efi=debug kernel parameter, which enables EFI debug
    code and output, generic and usable by arm64 - Leif Lindholm
 
  * Add support to the x86 EFI boot stub for 64-bit Graphics Output
    Protocol frame buffer addresses - Matt Fleming
 
  * Detect when the UEFI v2.5 EFI_PROPERTIES_TABLE feature is enabled
    in the firmware and set an efi.flags bit so the kernel knows when
    it can apply more strict runtime mapping attributes - Ard Biesheuvel
 
  * Auto-load the efi-pstore module on EFI systems, just like we
    currently do for the efivars module - Ben Hutchings
 
  * Add "efi_fake_mem" kernel parameter which allows the system's EFI
    memory map to be updated with additional attributes for specific
    memory ranges. This is useful for testing the kernel code that handles
    the EFI_MEMORY_MORE_RELIABLE memmap bit even if your firmware
    doesn't include support - Taku Izumi
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQIcBAABAgAGBQJWG7OwAAoJEC84WcCNIz1VEEEP/0SsdrwJ66B4MfP5YNjqHYWm
 +OTHR6Ovv2i10kc+NjOV/GN8sWPndnkLfIfJ4EqJ9BoQ9PDEYZilV2aleSQ4DrPm
 H7uGwBXQkfd76tZKX9pMToK76mkhg6M7M2LR3Suv3OGfOEzuozAOt3Ez37lpksTN
 2ByhHr/oGbhu99jC2ki5+k0ySH8PMqDBRxqrPbBzTD+FfB7bM11vAJbSNbSMQ21R
 ZwX0acZBLqb9J2Vf7tDsW+fCfz0TFo8JHW8jdLRFm/y2dpquzxswkkBpODgA8+VM
 0F5UbiUdkaIRug75I6N/OJ8+yLwdzuxm7ul+tbS3JrXGLAlK3850+dP2Pr5zQ2Ce
 zaYGRUy+tD5xMXqOKgzpu+Ia8XnDRLhOlHabiRd5fG6ZC9nR8E9uK52g79voSN07
 pADAJnVB03CGV/HdduDOI4C4UykUKubuArbQVkqWJcecV1Jic/tYI0gjeACmU1VF
 v8FzXpBUe3U3A0jauOz8PBz8M+k5qky/GbIrnEvXreBtKdt999LN9fykTN7rBOpo
 dk/6vTR1Jyv3aYc9EXHmRluktI6KmfWCqmRBOIgQveX1VhdRM+1w2LKC0+8co3dF
 v/DBh19KDyfPI8eOvxKykhn164UeAt03EXqDa46wFGr2nVOm/JiShL/d+QuyYU4G
 8xb/rET4JrhCG4gFMUZ7
 =1Oee
 -----END PGP SIGNATURE-----

Merge tag 'efi-next' of git://git.kernel.org/pub/scm/linux/kernel/git/mfleming/efi into core/efi

Pull v4.4 EFI updates from Matt Fleming:

  - Make the EFI System Resource Table (ESRT) driver explicitly
    non-modular by ripping out the module_* code since Kconfig doesn't
    allow it to be built as a module anyway. (Paul Gortmaker)

  - Make the x86 efi=debug kernel parameter, which enables EFI debug
    code and output, generic and usable by arm64. (Leif Lindholm)

  - Add support to the x86 EFI boot stub for 64-bit Graphics Output
    Protocol frame buffer addresses. (Matt Fleming)

  - Detect when the UEFI v2.5 EFI_PROPERTIES_TABLE feature is enabled
    in the firmware and set an efi.flags bit so the kernel knows when
    it can apply more strict runtime mapping attributes - Ard Biesheuvel

  - Auto-load the efi-pstore module on EFI systems, just like we
    currently do for the efivars module. (Ben Hutchings)

  - Add "efi_fake_mem" kernel parameter which allows the system's EFI
    memory map to be updated with additional attributes for specific
    memory ranges. This is useful for testing the kernel code that handles
    the EFI_MEMORY_MORE_RELIABLE memmap bit even if your firmware
    doesn't include support. (Taku Izumi)

Note: there is a semantic conflict between the following two commits:

  8a53554e12 ("x86/efi: Fix multiple GOP device support")
  ae2ee627dc ("efifb: Add support for 64-bit frame buffer addresses")

I fixed up the interaction in the merge commit, changing the type of
current_fb_base from u32 to u64.

Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-10-14 16:51:34 +02:00
Andy Shevchenko
3435dd0809 x86/early_printk: Set __iomem address space for IO
There are following warnings on unpatched code:

arch/x86/kernel/early_printk.c:198:32: warning: incorrect type in initializer (different address spaces)
arch/x86/kernel/early_printk.c:198:32:    expected void [noderef] <asn:2>*vaddr
arch/x86/kernel/early_printk.c:198:32:    got unsigned int [usertype] *<noident>
arch/x86/kernel/early_printk.c:205:32: warning: incorrect type in initializer (different address spaces)
arch/x86/kernel/early_printk.c:205:32:    expected void [noderef] <asn:2>*vaddr
arch/x86/kernel/early_printk.c:205:32:    got unsigned int [usertype] *<noident>

Annotate it proper.

Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Link: http://lkml.kernel.org/r/1444646837-42615-1-git-send-email-andriy.shevchenko@linux.intel.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2015-10-13 21:45:56 +02:00
Borislav Petkov
0399f73299 x86/microcode/amd: Do not overwrite final patch levels
A certain number of patch levels of applied microcode should not
be overwritten by the microcode loader, otherwise bad things
will happen.

Check those and abort update if the current core has one of
those final patch levels applied by the BIOS. 32-bit needs
special handling, of course.

See https://bugzilla.suse.com/show_bug.cgi?id=913996 for more
info.

Tested-by: Peter Kirchgeßner <pkirchgessner@t-online.de>
Signed-off-by: Borislav Petkov <bp@suse.de>
Cc: Borislav Petkov <bp@alien8.de>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tony Luck <tony.luck@intel.com>
Link: http://lkml.kernel.org/r/1444641762-9437-7-git-send-email-bp@alien8.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-10-12 16:15:48 +02:00
Borislav Petkov
2eff73c0a1 x86/microcode/amd: Extract current patch level read to a function
Pave the way for checking the current patch level of the
microcode in a core. We want to be able to do stuff depending on
the patch level - in this case decide whether to update or not.
But that will be added in a later patch.

Drop unused local var uci assignment, while at it.

Integrate a fix for 32-bit and CONFIG_PARAVIRT from Takashi Iwai:

 Use native_rdmsr() in check_current_patch_level() because with
 CONFIG_PARAVIRT enabled and on 32-bit, where we run before
 paging has been enabled, we cannot deref pv_info yet. Or we
 could, but we'd need to access its physical address. This way of
 fixing it is simpler. See:

   https://bugzilla.suse.com/show_bug.cgi?id=943179 for the background.

Signed-off-by: Borislav Petkov <bp@suse.de>
Cc: Borislav Petkov <bp@alien8.de>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Takashi Iwai <tiwai@suse.com>:
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tony Luck <tony.luck@intel.com>
Link: http://lkml.kernel.org/r/1444641762-9437-6-git-send-email-bp@alien8.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-10-12 16:15:48 +02:00
Taku Izumi
0f96a99dab efi: Add "efi_fake_mem" boot option
This patch introduces new boot option named "efi_fake_mem".
By specifying this parameter, you can add arbitrary attribute
to specific memory range.
This is useful for debugging of Address Range Mirroring feature.

For example, if "efi_fake_mem=2G@4G:0x10000,2G@0x10a0000000:0x10000"
is specified, the original (firmware provided) EFI memmap will be
updated so that the specified memory regions have
EFI_MEMORY_MORE_RELIABLE attribute (0x10000):

 <original>
   efi: mem36: [Conventional Memory|  |  |  |  |  |   |WB|WT|WC|UC] range=[0x0000000100000000-0x00000020a0000000) (129536MB)

 <updated>
   efi: mem36: [Conventional Memory|  |MR|  |  |  |   |WB|WT|WC|UC] range=[0x0000000100000000-0x0000000180000000) (2048MB)
   efi: mem37: [Conventional Memory|  |  |  |  |  |   |WB|WT|WC|UC] range=[0x0000000180000000-0x00000010a0000000) (61952MB)
   efi: mem38: [Conventional Memory|  |MR|  |  |  |   |WB|WT|WC|UC] range=[0x00000010a0000000-0x0000001120000000) (2048MB)
   efi: mem39: [Conventional Memory|  |  |  |  |  |   |WB|WT|WC|UC] range=[0x0000001120000000-0x00000020a0000000) (63488MB)

And you will find that the following message is output:

   efi: Memory: 4096M/131455M mirrored memory

Signed-off-by: Taku Izumi <izumi.taku@jp.fujitsu.com>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Xishi Qiu <qiuxishi@huawei.com>
Cc: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Signed-off-by: Matt Fleming <matt.fleming@intel.com>
2015-10-12 14:20:09 +01:00
Ingo Molnar
cdbcd239e2 Merge branch 'x86/ras' into ras/core, to pick up changes
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-10-12 14:52:34 +02:00
Minfei Huang
e9c40d257f x86/kexec: Remove obsolete 'in_crash_kexec' flag
Previously, UV NMI used the 'in_crash_kexec' flag to determine whether
we are in a kdump kernel or not:

  5edd19af18 ("x86, UV: Make kdump avoid stack dumps")

But this flags was removed in the following commit:

  9c48f1c629 ("x86, nmi: Wire up NMI handlers to new routines")

Since it isn't used any more, remove it.

Signed-off-by: Minfei Huang <mnfhuang@gmail.com>
Acked-by: Don Zickus <dzickus@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: cpw@sgi.com
Cc: kexec@lists.infradead.org
Cc: mhuang@redhat.com
Link: http://lkml.kernel.org/r/1444070155-17934-1-git-send-email-mhuang@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-10-12 09:43:11 +02:00
Andy Shevchenko
4faefda97b x86/io_apic: Make eoi_ioapic_pin() static
We have to define internally used function as static, otherwise the following
warning will be generated:

arch/x86/kernel/apic/io_apic.c:532:6: warning: no previous prototype for 'eoi_ioapic_pin' [-Wmissing-prototypes]

Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Reviewed-by: Jiang Liu <jiang.liu@linux.intel.com>
Link: http://lkml.kernel.org/r/1444400685-98611-1-git-send-email-andriy.shevchenko@linux.intel.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2015-10-11 21:10:31 +02:00
Andy Lutomirski
487e3bf4f7 x86/asm: Remove thread_info.sysenter_return
It's no longer needed.

We could reinstate something like it as an optimization, which
would remove two cachelines from the fast syscall entry working
set.  I benchmarked it, and it makes no difference whatsoever to
the performance of cache-hot compat syscalls on Sandy Bridge.

Signed-off-by: Andy Lutomirski <luto@kernel.org>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Link: http://lkml.kernel.org/r/f08cc0cff30201afe9bb565c47134c0a6c1a96a2.1444091585.git.luto@kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-10-09 09:41:11 +02:00
Ingo Molnar
d3df65c198 Merge branch 'perf/urgent' into perf/core, before pulling new changes
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-10-08 10:52:18 +02:00
Andy Lutomirski
0a6d1fa0d2 x86/vdso: Remove runtime 32-bit vDSO selection
32-bit userspace will now always see the same vDSO, which is
exactly what used to be the int80 vDSO.  Subsequent patches will
clean it up and make it support SYSENTER and SYSCALL using
alternatives.

Signed-off-by: Andy Lutomirski <luto@kernel.org>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Link: http://lkml.kernel.org/r/e7e6b3526fa442502e6125fe69486aab50813c32.1444091584.git.luto@kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-10-07 11:34:08 +02:00
Taku Izumi
712df65ccb perf/x86/intel/uncore: Fix multi-segment problem of perf_event_intel_uncore
In multi-segment system, uncore devices may belong to buses whose segment
number is other than 0:

  ....
  0000:ff:10.5 System peripheral: Intel Corporation Xeon E5 v3/Core i7 Scratchpad & Semaphore Registers (rev 03)
  ...
  0001:7f:10.5 System peripheral: Intel Corporation Xeon E5 v3/Core i7 Scratchpad & Semaphore Registers (rev 03)
  ...
  0001:bf:10.5 System peripheral: Intel Corporation Xeon E5 v3/Core i7 Scratchpad & Semaphore Registers (rev 03)
  ...
  0001:ff:10.5 System peripheral: Intel Corporation Xeon E5 v3/Core i7 Scratchpad & Semaphore Registers (rev 03
  ...

In that case, relation of bus number and physical id may be broken
because "uncore_pcibus_to_physid" doesn't take account of PCI segment.
For example, bus 0000:ff and 0001:ff uses the same entry of
"uncore_pcibus_to_physid" array.

This patch fixes this problem by introducing the segment-aware pci2phy_map instead.

Signed-off-by: Taku Izumi <izumi.taku@jp.fujitsu.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: acme@kernel.org
Cc: hpa@zytor.com
Link: http://lkml.kernel.org/r/1443096621-4119-1-git-send-email-izumi.taku@jp.fujitsu.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-10-06 17:31:51 +02:00
Kan Liang
7ce1346a68 perf/x86: Add Intel cstate PMUs support
This patch adds new PMUs to support cstate related free running
(read-only) counters. These counters may be used simultaneously by other
tools, such as turbostat. However, it still make sense to implement them
in perf. Because we can conveniently collect them together with other
events, and allow to use them from tools without special MSR access
code.

These counters include CORE_C*_RESIDENCY and PKG_C*_RESIDENCY.
According to counters' scope and category, two PMUs are registered with
the perf_event core subsystem.

 - 'cstate_core': The counter is available for each physical core. The
                  counters include CORE_C*_RESIDENCY.

 - 'cstate_pkg':  The counter is available for each physical package. The
                  counters include PKG_C*_RESIDENCY.

The events are exposed in sysfs for use by perf stat and other tools.
The files are:

  /sys/devices/cstate_core/events/c*-residency
  /sys/devices/cstate_pkg/events/c*-residency

These events only support system-wide mode counting.
The /sys/devices/cstate_*/cpumask file can be used by tools to figure
out which CPUs to monitor by default.

The PMU type (attr->type) is dynamically allocated and is available from
/sys/devices/core_misc/type and /sys/device/cstate_*/type.

Sampling is not supported.

Here is an example.

 - To caculate the fraction of time when the core is running in C6 state
   CORE_C6_time% = CORE_C6_RESIDENCY / TSC

 # perf stat -x, -e"cstate_core/c6-residency/,msr/tsc/" -C0 -- taskset -c 0 sleep 5

   11838820015,,cstate_core/c6-residency/,5175919658,100.00
   11877130740,,msr/tsc/,5175922010,100.00

 For sleep, 99.7% of time we ran in C6 state.

 # perf stat -x, -e"cstate_core/c6-residency/,msr/tsc/" -C0 -- taskset -c 0 busyloop

   1253316,,cstate_core/c6-residency/,4360969154,100.00
   10012635248,,msr/tsc/,4360972366,100.00

 For busyloop, 0.01% of time we ran in C6 state.

Signed-off-by: Kan Liang <kan.liang@intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Arjan van de Ven <arjan@linux.intel.com>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: acme@kernel.org
Cc: eranian@google.com
Link: http://lkml.kernel.org/r/1443443404-8581-1-git-send-email-kan.liang@intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-10-06 17:31:51 +02:00
Peter Zijlstra
d87b7a3379 sched/core, sched/x86: Kill thread_info::saved_preempt_count
With the introduction of the context switch preempt_count invariant,
and the demise of PREEMPT_ACTIVE, its pointless to save/restore the
per-cpu preemption count, it must always be 2.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-10-06 17:08:18 +02:00
Linus Torvalds
2cf30826bb Merge branch 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 fixes from Ingo Molnar:
 "Fixes all around the map: W+X kernel mapping fix, WCHAN fixes, two
  build failure fixes for corner case configs, x32 header fix and a
  speling fix"

* 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86/headers/uapi: Fix __BITS_PER_LONG value for x32 builds
  x86/mm: Set NX on gap between __ex_table and rodata
  x86/kexec: Fix kexec crash in syscall kexec_file_load()
  x86/process: Unify 32bit and 64bit implementations of get_wchan()
  x86/process: Add proper bound checks in 64bit get_wchan()
  x86, efi, kasan: Fix build failure on !KASAN && KMEMCHECK=y kernels
  x86/hyperv: Fix the build in the !CONFIG_KEXEC_CORE case
  x86/cpufeatures: Correct spelling of the HWP_NOTIFY flag
2015-10-03 10:53:05 -04:00
Lee, Chun-Yi
e3c41e37b0 x86/kexec: Fix kexec crash in syscall kexec_file_load()
The original bug is a page fault crash that sometimes happens
on big machines when preparing ELF headers:

    BUG: unable to handle kernel paging request at ffffc90613fc9000
    IP: [<ffffffff8103d645>] prepare_elf64_ram_headers_callback+0x165/0x260

The bug is caused by us under-counting the number of memory ranges
and subsequently not allocating enough ELF header space for them.
The bug is typically masked on smaller systems, because the ELF header
allocation is rounded up to the next page.

This patch modifies the code in fill_up_crash_elf_data() by using
walk_system_ram_res() instead of walk_system_ram_range() to correctly
count the max number of crash memory ranges. That's because the
walk_system_ram_range() filters out small memory regions that
reside in the same page, but walk_system_ram_res() does not.

Here's how I found the bug:

After tracing prepare_elf64_headers() and prepare_elf64_ram_headers_callback(),
the code uses walk_system_ram_res() to fill-in crash memory regions information
to the program header, so it counts those small memory regions that
reside in a page area.

But, when the kernel was using walk_system_ram_range() in
fill_up_crash_elf_data() to count the number of crash memory regions,
it filters out small regions.

I printed those small memory regions, for example:

  kexec: Get nr_ram ranges. vaddr=0xffff880077592258 paddr=0x77592258, sz=0xdc0

Based on the code in walk_system_ram_range(), this memory region
will be filtered out:

  pfn = (0x77592258 + 0x1000 - 1) >> 12 = 0x77593
  end_pfn = (0x77592258 + 0xfc0 -1 + 1) >> 12 = 0x77593
  end_pfn - pfn = 0x77593 - 0x77593 = 0  <=== if (end_pfn > pfn) is FALSE

So, the max_nr_ranges that's counted by the kernel doesn't include
small memory regions - causing us to under-allocate the required space.
That causes the page fault crash that happens in a later code path
when preparing ELF headers.

This bug is not easy to reproduce on small machines that have few
CPUs, because the allocated page aligned ELF buffer has more free
space to cover those small memory regions' PT_LOAD headers.

Signed-off-by: Lee, Chun-Yi <jlee@suse.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Baoquan He <bhe@redhat.com>
Cc: Jiang Liu <jiang.liu@linux.intel.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: Takashi Iwai <tiwai@suse.de>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Viresh Kumar <viresh.kumar@linaro.org>
Cc: Vivek Goyal <vgoyal@redhat.com>
Cc: kexec@lists.infradead.org
Cc: linux-kernel@vger.kernel.org
Cc: <stable@vger.kernel.org>
Link: http://lkml.kernel.org/r/1443531537-29436-1-git-send-email-jlee@suse.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-10-02 09:13:06 +02:00
Radim Krčmář
72c930dcfc x86: kvmclock: abolish PVCLOCK_COUNTS_FROM_ZERO
Newer KVM won't be exposing PVCLOCK_COUNTS_FROM_ZERO anymore.
The purpose of that flags was to start counting system time from 0 when
the KVM clock has been initialized.
We can achieve the same by selecting one read as the initial point.

A simple subtraction will work unless the KVM clock count overflows
earlier (has smaller width) than scheduler's cycle count.  We should be
safe till x86_128.

Because PVCLOCK_COUNTS_FROM_ZERO was enabled only on new hypervisors,
setting sched clock as stable based on PVCLOCK_TSC_STABLE_BIT might
regress on older ones.

I presume we don't need to change kvm_clock_read instead of introducing
kvm_sched_clock_read.  A problem could arise in case sched_clock is
expected to return the same value as get_cycles, but we should have
merged those clocks in that case.

Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
Acked-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2015-10-01 15:06:42 +02:00
Geliang Tang
a7e705af52 x86/irq: Drop unlikely before IS_ERR_OR_NULL
IS_ERR_OR_NULL already contain an unlikely compiler flag. Drop it.

Signed-off-by: Geliang Tang <geliangtang@163.com>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Andy Lutomirski <luto@kernel.org>
Link: http://lkml.kernel.org/r/03d18502ed7ed417f136c091f417d2d88c147ec6.1443667610.git.geliangtang@163.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2015-10-01 11:08:56 +02:00
Vaishali Thakkar
c2365b9388 perf/x86/intel/uncore: Do not use macro DEFINE_PCI_DEVICE_TABLE()
The DEFINE_PCI_DEVICE_TABLE() macro is deprecated. Use
'struct pci_device_id' instead of DEFINE_PCI_DEVICE_TABLE(),
with the goal of getting rid of this macro completely.

This Coccinelle semantic patch performs this transformation:

	@@
	identifier a;
	declarer name DEFINE_PCI_DEVICE_TABLE;
	initializer i;
	@@
	- DEFINE_PCI_DEVICE_TABLE(a)
	+ const struct pci_device_id a[] = i;

Signed-off-by: Vaishali Thakkar <vthakkar1994@gmail.com>
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Link: http://lkml.kernel.org/r/20151001085201.GA16939@localhost
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-10-01 10:53:03 +02:00
Denys Vlasenko
dae0f305d6 x86/signal: Deinline get_sigframe, save 240 bytes
This function compiles to 277 bytes of machine code and has 4 callsites.

Signed-off-by: Denys Vlasenko <dvlasenk@redhat.com>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Link: http://lkml.kernel.org/r/1443443037-22077-4-git-send-email-dvlasenk@redhat.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2015-09-30 21:54:40 +02:00
Denys Vlasenko
c368ef2866 x86: Deinline early_console_register, save 403 bytes
This function compiles to 60 bytes of machine code.

Signed-off-by: Denys Vlasenko <dvlasenk@redhat.com>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Link: http://lkml.kernel.org/r/1443443037-22077-3-git-send-email-dvlasenk@redhat.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2015-09-30 21:54:40 +02:00
Denys Vlasenko
e6e5f84092 x86/e820: Deinline e820_type_to_string, save 126 bytes
This function compiles to 102 bytes of machine code. It has two
callsites.

Signed-off-by: Denys Vlasenko <dvlasenk@redhat.com>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Link: http://lkml.kernel.org/r/1443443037-22077-2-git-send-email-dvlasenk@redhat.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2015-09-30 21:54:40 +02:00
Thomas Gleixner
7ba78053aa x86/process: Unify 32bit and 64bit implementations of get_wchan()
The stack layout and the functionality is identical. Use the 64bit
version for all of x86.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Borislav Petkov <bp@alien8.de>
Reviewed-by: Dmitry Vyukov <dvyukov@google.com>
Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Andrey Konovalov <andreyknvl@google.com>
Cc: Kostya Serebryany <kcc@google.com>
Cc: Alexander Potapenko <glider@google.com>
Cc: kasan-dev <kasan-dev@googlegroups.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Sasha Levin <sasha.levin@oracle.com>
Cc: Wolfram Gloger <wmglo@dent.med.uni-muenchen.de>
Link: http://lkml.kernel.org/r/20150930083302.779694618@linutronix.de
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2015-09-30 21:51:34 +02:00
Thomas Gleixner
eddd3826a1 x86/process: Add proper bound checks in 64bit get_wchan()
Dmitry Vyukov reported the following using trinity and the memory
error detector AddressSanitizer
(https://code.google.com/p/address-sanitizer/wiki/AddressSanitizerForKernel).

[ 124.575597] ERROR: AddressSanitizer: heap-buffer-overflow on
address ffff88002e280000
[ 124.576801] ffff88002e280000 is located 131938492886538 bytes to
the left of 28857600-byte region [ffffffff81282e0a, ffffffff82e0830a)
[ 124.578633] Accessed by thread T10915:
[ 124.579295] inlined in describe_heap_address
./arch/x86/mm/asan/report.c:164
[ 124.579295] #0 ffffffff810dd277 in asan_report_error
./arch/x86/mm/asan/report.c:278
[ 124.580137] #1 ffffffff810dc6a0 in asan_check_region
./arch/x86/mm/asan/asan.c:37
[ 124.581050] #2 ffffffff810dd423 in __tsan_read8 ??:0
[ 124.581893] #3 ffffffff8107c093 in get_wchan
./arch/x86/kernel/process_64.c:444

The address checks in the 64bit implementation of get_wchan() are
wrong in several ways:

 - The lower bound of the stack is not the start of the stack
   page. It's the start of the stack page plus sizeof (struct
   thread_info)

 - The upper bound must be:

       top_of_stack - TOP_OF_KERNEL_STACK_PADDING - 2 * sizeof(unsigned long).

   The 2 * sizeof(unsigned long) is required because the stack pointer
   points at the frame pointer. The layout on the stack is: ... IP FP
   ... IP FP. So we need to make sure that both IP and FP are in the
   bounds.

Fix the bound checks and get rid of the mix of numeric constants, u64
and unsigned long. Making all unsigned long allows us to use the same
function for 32bit as well.

Use READ_ONCE() when accessing the stack. This does not prevent a
concurrent wakeup of the task and the stack changing, but at least it
avoids TOCTOU.

Also check task state at the end of the loop. Again that does not
prevent concurrent changes, but it avoids walking for nothing.

Add proper comments while at it.

Reported-by: Dmitry Vyukov <dvyukov@google.com>
Reported-by: Sasha Levin <sasha.levin@oracle.com>
Based-on-patch-from: Wolfram Gloger <wmglo@dent.med.uni-muenchen.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Borislav Petkov <bp@alien8.de>
Reviewed-by: Dmitry Vyukov <dvyukov@google.com>
Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Andrey Konovalov <andreyknvl@google.com>
Cc: Kostya Serebryany <kcc@google.com>
Cc: Alexander Potapenko <glider@google.com>
Cc: kasan-dev <kasan-dev@googlegroups.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Wolfram Gloger <wmglo@dent.med.uni-muenchen.de>
Cc: stable@vger.kernel.org
Link: http://lkml.kernel.org/r/20150930083302.694788319@linutronix.de
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2015-09-30 21:51:34 +02:00
Thomas Gleixner
09907ca630 Merge branch 'x86/for-kvm' into x86/apic
Pull in the apic change which is provided for kvm folks to pull into
their tree.
2015-09-30 21:20:39 +02:00
Denys Vlasenko
d786ad32c3 x86/apic: Deinline various functions
__x2apic_disable: 178 bytes, 3 calls
__x2apic_enable: 117 bytes, 3 calls
__smp_spurious_interrupt: 110 bytes, 2 calls
__smp_error_interrupt: 208 bytes, 2 calls

Reduces code size by about 850 bytes.

Signed-off-by: Denys Vlasenko <dvlasenk@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Andy Lutomirski <luto@amacapital.net>
Link: http://lkml.kernel.org/r/1443559022-23793-1-git-send-email-dvlasenk@redhat.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2015-09-30 21:15:53 +02:00
Vitaly Kuznetsov
1e034743e9 x86/hyperv: Fix the build in the !CONFIG_KEXEC_CORE case
Recent changes in the Hyper-V driver:

  b4370df2b1 ("Drivers: hv: vmbus: add special crash handler")

broke the build when CONFIG_KEXEC_CORE is not set:

  arch/x86/built-in.o: In function `hv_machine_crash_shutdown':
  arch/x86/kernel/cpu/mshyperv.c:112: undefined reference to `native_machine_crash_shutdown'

Decorate all kexec related code with #ifdef CONFIG_KEXEC_CORE.

Reported-by: Jim Davis <jim.epost@gmail.com>
Reported-by: Stephen Hemminger <stephen@networkplumber.org>
Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: devel@linuxdriverproject.org
Cc: K. Y. Srinivasan <kys@microsoft.com>
Cc: Haiyang Zhang <haiyangz@microsoft.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/r/1443002577-25370-1-git-send-email-vkuznets@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-09-30 07:44:15 +02:00
Ashok Raj
6e06780a98 x86/mce: Don't clear shared banks on Intel when offlining CPUs
It is not safe to clear global MCi_CTL banks during CPU offline
or suspend/resume operations. These MSRs are either
thread-scoped (meaning private to a thread), or core-scoped
(private to threads in that core only), or with a socket scope:
visible and controllable from all threads in the socket.

When we offline a single CPU, clearing those MCi_CTL bits will
stop signaling for all the shared, i.e., socket-wide resources,
such as LLC, iMC, etc.

In addition, it might be possible to compromise the integrity of
an Intel Secure Guard eXtentions (SGX) system if the attacker
has control of the host system and is able to inject errors
which would be otherwise ignored when MCi_CTL bits are cleared.

Hence on SGX enabled systems, if MCi_CTL is cleared, SGX gets
disabled.

Tested-by: Serge Ayoun <serge.ayoun@intel.com>
Signed-off-by: Ashok Raj <ashok.raj@intel.com>
[ Cleanup text. ]
Signed-off-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Tony Luck <tony.luck@intel.com>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-edac <linux-edac@vger.kernel.org>
Link: http://lkml.kernel.org/r/1441391390-16985-1-git-send-email-ashok.raj@intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-09-28 10:15:26 +02:00
Denys Vlasenko
0d44975d1e x86/kgdb: Replace bool_int_array[NR_CPUS] with bitmap
Straigntforward conversion from:

    int was_in_debug_nmi[NR_CPUS]

to:

    DECLARE_BITMAP(was_in_debug_nmi, NR_CPUS)

Saves about 2 kbytes in BSS for NR_CPUS=512.

Signed-off-by: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Jason Wessel <jason.wessel@windriver.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Link: http://lkml.kernel.org/r/1443271638-2568-1-git-send-email-dvlasenk@redhat.com
[ Tidied up the code a bit. ]
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-09-28 10:13:31 +02:00
Geliang Tang
18ab2cd3ee perf/core, perf/x86: Change needlessly global functions and a variable to static
Fixes various sparse warnings.

Signed-off-by: Geliang Tang <geliangtang@163.com>
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/70c14234da1bed6e3e67b9c419e2d5e376ab4f32.1443367286.git.geliangtang@163.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-09-28 08:09:52 +02:00
Ingo Molnar
6afc0c269c Merge branch 'linus' into perf/core, to pick up fixes before applying new changes
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-09-28 08:06:57 +02:00
Linus Torvalds
e3be4266d3 Merge branch 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull perf fixes from Thomas Gleixner:
 "Another pile of fixes for perf:

   - Plug overflows and races in the core code

   - Sanitize the flow of the perf syscall so we error out before
     handling the more complex and hard to undo setups

   - Improve and fix Broadwell and Skylake hardware support

   - Revert a fix which broke what it tried to fix in perf tools

   - A couple of smaller fixes in various places of perf tools"

* 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  perf tools: Fix copying of /proc/kcore
  perf intel-pt: Remove no_force_psb from documentation
  perf probe: Use existing routine to look for a kernel module by dso->short_name
  perf/x86: Change test_aperfmperf() and test_intel() to static
  tools lib traceevent: Fix string handling in heterogeneous arch environments
  perf record: Avoid infinite loop at buildid processing with no samples
  perf: Fix races in computing the header sizes
  perf: Fix u16 overflows
  perf: Restructure perf syscall point of no return
  perf/x86/intel: Fix Skylake FRONTEND MSR extrareg mask
  perf/x86/intel/pebs: Add PEBS frontend profiling for Skylake
  perf/x86/intel: Make the CYCLE_ACTIVITY.* constraint on Broadwell more specific
  perf tools: Bool functions shouldn't return -1
  tools build: Add test for presence of __get_cpuid() gcc builtin
  tools build: Add test for presence of numa_num_possible_cpus() in libnuma
  Revert "perf symbols: Fix mismatched declarations for elf_getphdrnum"
  perf stat: Fix per-pkg event reporting bug
2015-09-27 12:51:39 -04:00
Geliang Tang
7e5560a564 perf/x86: Change test_aperfmperf() and test_intel() to static
Fixes the following sparse warnings:

 arch/x86/kernel/cpu/perf_event_msr.c:13:6: warning: symbol
 'test_aperfmperf' was not declared. Should it be static?

 arch/x86/kernel/cpu/perf_event_msr.c:18:6: warning: symbol
 'test_intel' was not declared. Should it be static?

Signed-off-by: Geliang Tang <geliangtang@163.com>
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/4588e8ab09638458f2451af572827108be3b4a36.1443123796.git.geliangtang@163.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-09-25 09:42:40 +02:00
Kristen Carlson Accardi
a7adb91b13 x86/cpufeatures: Correct spelling of the HWP_NOTIFY flag
Because noitification just isn't right.

Signed-off-by: Kristen Carlson Accardi <kristen@linux.intel.com>
Acked-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Cc: rjw@rjwysocki.net
Link: http://lkml.kernel.org/r/1442944296-11737-1-git-send-email-kristen@linux.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-09-23 09:57:24 +02:00
Andy Lutomirski
fc57a7c680 x86/paravirt: Replace the paravirt nop with a bona fide empty function
PARAVIRT_ADJUST_EXCEPTION_FRAME generates this code (using nmi as an
example, trimmed for readability):

    ff 15 00 00 00 00       callq  *0x0(%rip)        # 2796 <nmi+0x6>
              2792: R_X86_64_PC32     pv_irq_ops+0x2c

That's a call through a function pointer to regular C function that
does nothing on native boots, but that function isn't protected
against kprobes, isn't marked notrace, and is certainly not
guaranteed to preserve any registers if the compiler is feeling
perverse.  This is bad news for a CLBR_NONE operation.

Of course, if everything works correctly, once paravirt ops are
patched, it gets nopped out, but what if we hit this code before
paravirt ops are patched in?  This can potentially cause breakage
that is very difficult to debug.

A more subtle failure is possible here, too: if _paravirt_nop uses
the stack at all (even just to push RBP), it will overwrite the "NMI
executing" variable if it's called in the NMI prologue.

The Xen case, perhaps surprisingly, is fine, because it's already
written in asm.

Fix all of the cases that default to paravirt_nop (including
adjust_exception_frame) with a big hammer: replace paravirt_nop with
an asm function that is just a ret instruction.

The Xen case may have other problems, so document them.

This is part of a fix for some random crashes that Sasha saw.

Reported-and-tested-by: Sasha Levin <sasha.levin@oracle.com>
Signed-off-by: Andy Lutomirski <luto@kernel.org>
Cc: stable@vger.kernel.org
Link: http://lkml.kernel.org/r/8f5d2ba295f9d73751c33d97fda03e0495d9ade0.1442791737.git.luto@kernel.org
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2015-09-22 22:40:28 +02:00
Daniel J Blueman
ad03a9c25d x86/numachip: Add Numachip IPI optimisations
When sending IPIs, first check if the non-local part of the source and
destination APIC IDs match; if so, send via the local APIC for efficiency.

Secondly, since the AMD BIOS-kernel developer guide states IPI delivery
will occur invarient of prior deliver status, avoid polling the delivery
status bit for efficiency.

Signed-off-by: Daniel J Blueman <daniel@numascale.com>
Acked-by: Steffen Persvold <sp@numascale.com>
Cc: Daniel Lezcano <daniel.lezcano@linaro.org>
Link: http://lkml.kernel.org/r/1442768522-19217-3-git-send-email-daniel@numascale.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2015-09-22 22:25:33 +02:00
Daniel J Blueman
d9d4dee6ce x86/numachip: Add Numachip2 APIC support
Introduce support for Numachip2 remote interrupts via detecting the right
ACPI SRAT signature.

Access is performed via a fixed mapping in the x86 physical address space.

Signed-off-by: Daniel J Blueman <daniel@numascale.com>
Acked-by: Steffen Persvold <sp@numascale.com>
Cc: Daniel Lezcano <daniel.lezcano@linaro.org>
Link: http://lkml.kernel.org/r/1442768522-19217-2-git-send-email-daniel@numascale.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2015-09-22 22:25:33 +02:00
Daniel J Blueman
db1003a719 x86/numachip: Cleanup Numachip support
Drop unused code and includes in Numachip header files and APIC driver.

Additionally, use the 'numachip1' prefix on Numachip1-specific functions;
this prepares for adding Numachip2 support in later patches.

Signed-off-by: Daniel J Blueman <daniel@numascale.com>
Acked-by: Steffen Persvold <sp@numascale.com>
Cc: Daniel Lezcano <daniel.lezcano@linaro.org>
Link: http://lkml.kernel.org/r/1442768522-19217-1-git-send-email-daniel@numascale.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2015-09-22 22:25:32 +02:00
Linus Torvalds
fadb97b089 Merge branch 'irq-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull irq updates from Thomas Gleixner:
 "This is a rather large update post rc1 due to the final steps of
  cleanups and API changes which had to wait for the preparatory patches
  to hit your tree.

   - Regression fixes for ARM GIC irqchips

   - Regression fixes and lockdep anotations for renesas irq chips

   - The leftovers of the cleanup and preparatory patches which have
     been ignored by maintainers

   - Final conversions of the newly merged users of obsolete APIs

   - Final removal of obsolete APIs

   - Final removal of ARM artifacts which had been introduced during the
     conversion of ARM to the generic interrupt code.

   - Final split of the irq_data into chip specific and common data to
     reflect the needs of hierarchical irq domains.

   - Treewide removal of the first argument of interrupt flow handlers,
     i.e. the irq number, which is not used by the majority of handlers
     and simple to retrieve from the other argument the irq descriptor.

   - A few comment updates and build warning fixes"

* 'irq-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (40 commits)
  arm64: Remove ununsed set_irq_flags
  ARM: Remove ununsed set_irq_flags
  sh: Kill off set_irq_flags usage
  irqchip: Kill off set_irq_flags usage
  gpu/drm: Kill off set_irq_flags usage
  genirq: Remove irq argument from irq flow handlers
  genirq: Move field 'msi_desc' from irq_data into irq_common_data
  genirq: Move field 'affinity' from irq_data into irq_common_data
  genirq: Move field 'handler_data' from irq_data into irq_common_data
  genirq: Move field 'node' from irq_data into irq_common_data
  irqchip/gic-v3: Use IRQD_FORWARDED_TO_VCPU flag
  irqchip/gic: Use IRQD_FORWARDED_TO_VCPU flag
  genirq: Provide IRQD_FORWARDED_TO_VCPU status flag
  genirq: Simplify irq_data_to_desc()
  genirq: Remove __irq_set_handler_locked()
  pinctrl/pistachio: Use irq_set_handler_locked
  gpio: vf610: Use irq_set_handler_locked
  powerpc/mpc8xx: Use irq_set_handler_locked()
  powerpc/ipic: Use irq_set_handler_locked()
  powerpc/cpm2: Use irq_set_handler_locked()
  ...
2015-09-18 08:11:42 -07:00
Linus Torvalds
09784fb8ef Merge branch 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 fix from Thomas Gleixner:
 "A single regression fix for the x86 dma allocator which got wreckaged
  in the merge window"

* 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86/pci/dma: Fix gfp flags for coherent DMA memory allocation
2015-09-18 08:06:28 -07:00
Kan Liang
96f3eda67f perf/x86/intel: Fix static checker warning in lbr enable
Commit deb27519bf ("perf/x86/intel: Fix LBR callstack issue caused
by FREEZE_LBRS_ON_PMI") leads to the following Smatch complaint:

   warn: variable dereferenced before check 'cpuc->lbr_sel' (see line 154)

Fix the warning.

Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Kan Liang <kan.liang@intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Fixes: deb27519bf ("perf/x86/intel: Fix LBR callstack issue caused by FREEZE_LBRS_ON_PMI")
Link: http://lkml.kernel.org/r/1442240047-48149-1-git-send-email-kan.liang@intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-09-18 09:24:57 +02:00
Ingo Molnar
02386c356a Merge branch 'perf/urgent' into perf/core, to pick up fixes before applying new changes
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-09-18 09:24:01 +02:00
Andi Kleen
dfe1f3cb31 perf/x86/intel: Fix Skylake FRONTEND MSR extrareg mask
Stephane pointed out that the extrareg mask was one bit too short.
The bubble width field was truncated by one bit. Fix that here.
Also add some extra comments on the reserved bits inside the event
select code.

Reported-by: Stephane Eranian <eranian@google.com>
Signed-off-by: Andi Kleen <ak@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1441835640-21347-3-git-send-email-andi@firstfloor.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-09-18 09:20:23 +02:00
Andi Kleen
d0dc8494cd perf/x86/intel/pebs: Add PEBS frontend profiling for Skylake
Skylake has a new FRONTEND_LATENCY PEBS event to accurately profile
frontend problems (like ITLB or decoding issues).

The new event is configured through a separate MSR, which selects
a range of sub events.

Define the extra MSR as a extra reg and export support for it
through sysfs.  To avoid duplicating the existing
tables use a new function to add new entries to existing tables.

Signed-off-by: Andi Kleen <ak@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1435707205-6676-4-git-send-email-andi@firstfloor.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-09-18 09:20:22 +02:00
Andi Kleen
5e176213a6 perf/x86/intel: Make the CYCLE_ACTIVITY.* constraint on Broadwell more specific
The counter constraint for CYCLE_ACTIVITY.* on Broadwell covered
all CYCLE_ACTIVITY.* sub events, and forced them on counter 2.
But actually only one sub event (umask 8) needs to be on counter 2,
all others do not have any constraint.

Only force that subevent. This fixes groups with multiple
CYCLE_ACTIVITY.* events, for example:

	% perf stat -x, -e '{cpu/event=0xa3,umask=0x6,cmask=6/,\
	cpu/event=0xa2,umask=0x8/,\
	cpu/event=0xa3,umask=0x4,cmask=4/,cpu/event=0xb1,umask=0x1,cmask=1/}' true
	122150,,cpu/event=0xa3,umask=0x6,cmask=6/,846486,100.00
	16483,,cpu/event=0xa2,umask=0x8/,846486,100.00
	252280,,cpu/event=0xa3,umask=0x4,cmask=4/,846486,100.00
	233604,,cpu/event=0xb1,umask=0x1,cmask=1/,846486,100.00
	%

Without this patch the third result would be <unsupported>

Signed-off-by: Andi Kleen <ak@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1442267222-16464-1-git-send-email-andi@firstfloor.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-09-18 09:20:21 +02:00
Linus Torvalds
42dc2a3048 Merge branch 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 fixes from Ingo Molnar:
 - misc fixes all around the map
 - block non-root vm86(old) if mmap_min_addr != 0
 - two small debuggability improvements
 - removal of obsolete paravirt op

* 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86/platform: Fix Geode LX timekeeping in the generic x86 build
  x86/apic: Serialize LVTT and TSC_DEADLINE writes
  x86/ioapic: Force affinity setting in setup_ioapic_dest()
  x86/paravirt: Remove the unused pv_time_ops::get_tsc_khz method
  x86/ldt: Fix small LDT allocation for Xen
  x86/vm86: Fix the misleading CONFIG_VM86 Kconfig help text
  x86/cpu: Print family/model/stepping in hex
  x86/vm86: Block non-root vm86(old) if mmap_min_addr != 0
  x86/alternatives: Make optimize_nops() interrupt safe and synced
  x86/mm/srat: Print non-volatile flag in SRAT
  x86/cpufeatures: Enable cpuid for Intel SHA extensions
2015-09-17 11:01:34 -07:00
Linus Torvalds
a706797feb Merge branch 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull perf fixes from Ingo MOlnar:
 "Mostly tooling fixes, but also two x86 PMU driver fixes"

* 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  perf tests: Fix software clock events test setting maps
  perf tests: Fix task exit test setting maps
  perf evlist: Fix create_syswide_maps() not propagating maps
  perf evlist: Fix add() not propagating maps
  perf evlist: Factor out a function to propagate maps for a single evsel
  perf evlist: Make create_maps() use set_maps()
  perf evlist: Make set_maps() more resilient
  perf evsel: Add own_cpus member
  perf evlist: Fix missing thread_map__put in propagate_maps()
  perf evlist: Fix splice_list_tail() not setting evlist
  perf evlist: Add has_user_cpus member
  perf evlist: Remove redundant validation from propagate_maps()
  perf evlist: Simplify set_maps() logic
  perf evlist: Simplify propagate_maps() logic
  perf top: Fix segfault pressing -> with no hist entries
  perf header: Fixup reading of HEADER_NRCPUS feature
  perf/x86/intel: Fix constraint access
  perf/x86/intel/bts: Set event->hw.itrace_started in pmu::start to match the new logic
  perf tools: Fix use of wrong event when processing exit events
  perf tools: Fix parse_events_add_pmu caller
2015-09-17 10:37:46 -07:00
Junichi Nomura
590f07874e x86/pci/dma: Fix gfp flags for coherent DMA memory allocation
Commit 6894258eda reversed the order of gfp_flags adjustment in
dma_alloc_attrs() for x86 [arch/x86/kernel/pci-dma.c] As a result,
relevant flags set by dma_alloc_coherent_gfp_flags() are just
discarded and cause coherent DMA memory allocation failure on some
devices.

Fixes: 6894258eda ("dma-mapping: consolidate dma_{alloc,free}_{attrs,coherent}")
Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com>
Tested-by: Tony Luck <tony.luck@intel.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Link: http://lkml.kernel.org/r/20150914073834.GA13077@xzibit.linux.bs1.fc.nec.co.jp
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2015-09-17 16:22:11 +02:00
David Woodhouse
03da3ff1cf x86/platform: Fix Geode LX timekeeping in the generic x86 build
In 2007, commit 07190a08ee ("Mark TSC on GeodeLX reliable")
bypassed verification of the TSC on Geode LX. However, this code
(now in the check_system_tsc_reliable() function in
arch/x86/kernel/tsc.c) was only present if CONFIG_MGEODE_LX was
set.

OpenWRT has recently started building its generic Geode target
for Geode GX, not LX, to include support for additional
platforms. This broke the timekeeping on LX-based devices,
because the TSC wasn't marked as reliable:
https://dev.openwrt.org/ticket/20531

By adding a runtime check on is_geode_lx(), we can also include
the fix if CONFIG_MGEODEGX1 or CONFIG_X86_GENERIC are set, thus
fixing the problem.

Signed-off-by: David Woodhouse <David.Woodhouse@intel.com>
Cc: Andres Salomon <dilinger@queued.net>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Marcelo Tosatti <marcelo@kvack.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: stable@vger.kernel.org
Link: http://lkml.kernel.org/r/1442409003.131189.87.camel@infradead.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-09-16 16:00:12 +02:00
Thomas Gleixner
bd0b9ac405 genirq: Remove irq argument from irq flow handlers
Most interrupt flow handlers do not use the irq argument. Those few
which use it can retrieve the irq number from the irq descriptor.

Remove the argument.

Search and replace was done with coccinelle and some extra helper
scripts around it. Thanks to Julia for her help!

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Julia Lawall <Julia.Lawall@lip6.fr>
Cc: Jiang Liu <jiang.liu@linux.intel.com>
2015-09-16 15:47:51 +02:00
Jiang Liu
9df872faa7 genirq: Move field 'affinity' from irq_data into irq_common_data
Irq affinity mask is per-irq instead of per irqchip, so move it into
struct irq_common_data.

Signed-off-by: Jiang Liu <jiang.liu@linux.intel.com>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Bjorn Helgaas <bhelgaas@google.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: Yinghai Lu <yinghai@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Jason Cooper <jason@lakedaemon.net>
Cc: Kevin Cernekee <cernekee@gmail.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Link: http://lkml.kernel.org/r/1433303281-27688-1-git-send-email-jiang.liu@linux.intel.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2015-09-16 15:46:49 +02:00
Shaohua Li
5d7c631d92 x86/apic: Serialize LVTT and TSC_DEADLINE writes
The APIC LVTT register is MMIO mapped but the TSC_DEADLINE register is an
MSR. The write to the TSC_DEADLINE MSR is not serializing, so it's not
guaranteed that the write to LVTT has reached the APIC before the
TSC_DEADLINE MSR is written. In such a case the write to the MSR is
ignored and as a consequence the local timer interrupt never fires.

The SDM decribes this issue for xAPIC and x2APIC modes. The
serialization methods recommended by the SDM differ.

xAPIC:
 "1. Memory-mapped write to LVT Timer Register, setting bits 18:17 to 10b.
  2. WRMSR to the IA32_TSC_DEADLINE MSR a value much larger than current time-stamp counter.
  3. If RDMSR of the IA32_TSC_DEADLINE MSR returns zero, go to step 2.
  4. WRMSR to the IA32_TSC_DEADLINE MSR the desired deadline."

x2APIC:
 "To allow for efficient access to the APIC registers in x2APIC mode,
  the serializing semantics of WRMSR are relaxed when writing to the
  APIC registers. Thus, system software should not use 'WRMSR to APIC
  registers in x2APIC mode' as a serializing instruction. Read and write
  accesses to the APIC registers will occur in program order. A WRMSR to
  an APIC register may complete before all preceding stores are globally
  visible; software can prevent this by inserting a serializing
  instruction, an SFENCE, or an MFENCE before the WRMSR."

The xAPIC method is to just wait for the memory mapped write to hit
the LVTT by checking whether the MSR write has reached the hardware.
There is no reason why a proper MFENCE after the memory mapped write would
not do the same. Andi Kleen confirmed that MFENCE is sufficient for the
xAPIC case as well.

Issue MFENCE before writing to the TSC_DEADLINE MSR. This can be done
unconditionally as all CPUs which have TSC_DEADLINE also have MFENCE
support.

[ tglx: Massaged the changelog ]

Signed-off-by: Shaohua Li <shli@fb.com>
Reviewed-by: Ingo Molnar <mingo@kernel.org>
Cc: <Kernel-team@fb.com>
Cc: <lenb@kernel.org>
Cc: <fenghua.yu@intel.com>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: stable@vger.kernel.org #v3.7+
Link: http://lkml.kernel.org/r/20150909041352.GA2059853@devbig257.prn2.facebook.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2015-09-14 18:29:59 +02:00
Thomas Gleixner
4857c91f0d x86/ioapic: Force affinity setting in setup_ioapic_dest()
The recent ioapic cleanups changed the affinity setting in
setup_ioapic_dest() from a direct write to the hardware to the delayed
affinity setup via irq_set_affinity().

That results in a warning from chained_irq_exit():
WARNING: CPU: 0 PID: 5 at kernel/irq/migration.c:32 irq_move_masked_irq
[<ffffffff810a0a88>] irq_move_masked_irq+0xb8/0xc0
[<ffffffff8103c161>] ioapic_ack_level+0x111/0x130
[<ffffffff812bbfe8>] intel_gpio_irq_handler+0x148/0x1c0

The reason is that irq_set_affinity() does not write directly to the
hardware. It marks the affinity setting as pending and executes it
from the next interrupt. The chained handler infrastructure does not
take the irq descriptor lock for performance reasons because such a
chained interrupt is not visible to any interfaces. So the delayed
affinity setting triggers the warning in irq_move_masked_irq().

Restore the old behaviour by calling the set_affinity function of the
ioapic chip in setup_ioapic_dest(). This is safe as none of the
interrupts can be on the fly at this point.

Fixes: aa5cb97f14 'x86/irq: Remove x86_io_apic_ops.set_affinity and related interfaces'
Reported-and-tested-by: Mika Westerberg <mika.westerberg@linux.intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Jiang Liu <jiang.liu@linux.intel.com>
Cc: jarkko.nikula@linux.intel.com
2015-09-14 18:28:15 +02:00
Dave Hansen
ef78f2a4bf x86/fpu: Check CPU-provided sizes against struct declarations
We now have C structures defined for each of the XSAVE state
components that we support.  This patch adds checks during our
verification pass to ensure that the CPU-provided data
enumerated in CPUID leaves matches our C structures.

If not, we warn and dump all the XSAVE CPUID leaves.

Note: this *actually* found an inconsistency with the MPX
'bndcsr' state.  The hardware pads it out differently from
our C structures.  This patch caught it and warned.

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tim Chen <tim.c.chen@linux.intel.com>
Cc: dave@sr71.net
Cc: linux-kernel@vger.kernel.org
Link: http://lkml.kernel.org/r/20150902233131.A8DB36DA@viggo.jf.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-09-14 12:22:02 +02:00
Dave Hansen
e6e888f96b x86/fpu: Check to ensure increasing-offset xstate offsets
The xstate CPUID leaves enumerate where each state component is
inside the XSAVE buffer, along with the size of the entire
buffer.  Our new XSAVE sanity-checking code extrapolates an
expected _total_ buffer size by looking at the last component
that it encounters.

That method requires that the highest-numbered component also
be the one with the highest offset.  This is a pretty safe
assumption, but let's add some code to ensure it stays true.

To make this check work correctly, we also need to ensure we
only consider the offsets from enabled features because the
offset register (ebx) will return 0 on unsupported features.

This also means that we will preserve the -1's that we
initialized xstate_offsets/sizes[] with.  That will help
find bugs.

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tim Chen <tim.c.chen@linux.intel.com>
Cc: dave@sr71.net
Cc: linux-kernel@vger.kernel.org
Link: http://lkml.kernel.org/r/20150902233130.0843AB15@viggo.jf.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-09-14 12:22:02 +02:00
Dave Hansen
65ac2e9baa x86/fpu: Correct and check XSAVE xstate size calculations
Note: our xsaves support is currently broken and disabled.  This
patch does not fix it, but it is an incremental improvement.

This might be useful to someone backporting the entire set of
XSAVES patches at some point, but it should not be backported
alone.

Ingo said he wanted something like this (bullets 2 and 3):

  http://lkml.kernel.org/r/20150808091508.GB32641@gmail.com

There are currently two xsave buffer formats: standard and
compacted.  The standard format is waht 'XSAVE' and 'XSAVEOPT'
produce while 'XSAVES' and 'XSAVEC' produce a compacted-formet
buffer.  (The kernel never uses XSAVEC)

But, the XSAVES buffer *ALSO* contains "system state components"
which are never saved by a plain XSAVE.  So, XSAVES has two
things that might make its buffer differently-sized from an
XSAVE-produced one.

The current code assumes that an XSAVES buffer's size is simply
the sum of the sizes of the (user) states which are supported.
This seems to work in most cases, but it is not consistent with
what the SDM says, and it breaks if we 'align' a component in
the buffer.  The calculation is also unnecessary work since the
CPU *tells* us the size of the buffer directly.

This patch just reads the size of the buffer right out of the
CPUID leaf instead of trying to derive it.

But, blindly trusting the CPU like this is dangerous.  We add
a verification pass in do_extra_xstate_size_checks() to ensure
that the size we calculate matches with what we see from the
hardware.  When it comes down to it, we trust but verify the
CPU.

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tim Chen <tim.c.chen@linux.intel.com>
Cc: dave@sr71.net
Cc: linux-kernel@vger.kernel.org
Link: http://lkml.kernel.org/r/20150902233130.234FE1EC@viggo.jf.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-09-14 12:22:01 +02:00
Dave Hansen
1126cb4535 x86/fpu/mpx: Rework MPX 'xstate' types
MPX includes two separate "extended state components".  There is
no real need to have an 'mpx_struct' because we never really
manage the states together.

We also separate out the actual data in 'mpx_bndcsr_state' from
the padding.  We will shortly be checking the state sizes
against our structures and need them to match.  For consistency,
we also ensure to prefix these types with 'mpx_'.

Lastly, we add some comments to mirror some of the descriptions
in the Intel documents (SDM) of the various state components.

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tim Chen <tim.c.chen@linux.intel.com>
Cc: dave@sr71.net
Cc: linux-kernel@vger.kernel.org
Link: http://lkml.kernel.org/r/20150902233129.384B73EB@viggo.jf.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-09-14 12:22:00 +02:00
Dave Hansen
633d54c47a x86/fpu: Add xfeature_enabled() helper instead of test_bit()
We currently use test_bit() in a few places to see if an
xfeature is enabled.  It ends up being a bit ugly because
'xfeatures_mask' is a u64 and test_bit wants an 'unsigned long'
so it requires a cast.  The *_bit() functions are also
techincally atomic, which we have no need for here.

So, remove the test_bit()s and replace with the new
xfeature_enabled() helper.

This also provides a central place to add a comment about the
future need to support 'system xstates'.

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tim Chen <tim.c.chen@linux.intel.com>
Cc: dave@sr71.net
Cc: linux-kernel@vger.kernel.org
Link: http://lkml.kernel.org/r/20150902233129.B1534F86@viggo.jf.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-09-14 12:21:59 +02:00
Dave Hansen
ee9ae257eb x86/fpu: Remove 'xfeature_nr'
xfeature_nr ended up being initialized too late for me to
use it in the "xsave size sanity check" patch which is
later in the series.  I tried to move around its initialization
but realized that it was just as easy to get rid of it.

We only have 9 XFEATURES.  Instead of dynamically calculating
and storing the last feature, just use the compile-time max:
XFEATURES_NR_MAX.  Note that even with 'xfeatures_nr' we can
had "holes" in the xfeatures_mask that we had to deal with.

We also change a 'leaf' variable to be a plain 'i'.  Although
it is used to grab a cpuid leaf in this one loop, all of the
other loops just use an 'i' and I find it much more obvious
to keep the naming consistent across all the similar loops.

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tim Chen <tim.c.chen@linux.intel.com>
Cc: dave@sr71.net
Cc: linux-kernel@vger.kernel.org
Link: http://lkml.kernel.org/r/20150902233128.3F30DF5A@viggo.jf.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-09-14 12:21:59 +02:00
Dave Hansen
8a93c9e0dc x86/fpu: Rework XSTATE_* macros to remove magic '2'
The 'xstate.c' code has a bunch of references to '2'.  This
is because we have a lot more work to do for the "extended"
xstates than the "legacy" ones and state component 2 is the
first "extended" state.

This patch replaces all of the instances of '2' with
FIRST_EXTENDED_XFEATURE, which clearly explains what is
going on.

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tim Chen <tim.c.chen@linux.intel.com>
Cc: dave@sr71.net
Cc: linux-kernel@vger.kernel.org
Link: http://lkml.kernel.org/r/20150902233128.A8C0BF51@viggo.jf.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-09-14 12:21:58 +02:00
Dave Hansen
dad8c4fe85 x86/fpu: Rename XFEATURES_NR_MAX
This is a logcal followon to the last patch.  It makes the
XFEATURE_MAX naming consistent with the other enum values.
This is what Ingo suggested.

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tim Chen <tim.c.chen@linux.intel.com>
Cc: dave@sr71.net
Cc: linux-kernel@vger.kernel.org
Link: http://lkml.kernel.org/r/20150902233127.A541448F@viggo.jf.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-09-14 12:21:57 +02:00
Dave Hansen
d91cab7813 x86/fpu: Rename XSAVE macros
There are two concepts that have some confusing naming:
 1. Extended State Component numbers (currently called
    XFEATURE_BIT_*)
 2. Extended State Component masks (currently called XSTATE_*)

The numbers are (currently) from 0-9.  State component 3 is the
bounds registers for MPX, for instance.

But when we want to enable "state component 3", we go set a bit
in XCR0.  The bit we set is 1<<3.  We can check to see if a
state component feature is enabled by looking at its bit.

The current 'xfeature_bit's are at best xfeature bit _numbers_.
Calling them bits is at best inconsistent with ending the enum
list with 'XFEATURES_NR_MAX'.

This patch renames the enum to be 'xfeature'.  These also
happen to be what the Intel documentation calls a "state
component".

We also want to differentiate these from the "XSTATE_*" macros.
The "XSTATE_*" macros are a mask, and we rename them to match.

These macros are reasonably widely used so this patch is a
wee bit big, but this really is just a rename.

The only non-mechanical part of this is the

	s/XSTATE_EXTEND_MASK/XFEATURE_MASK_EXTEND/

We need a better name for it, but that's another patch.

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tim Chen <tim.c.chen@linux.intel.com>
Cc: dave@sr71.net
Cc: linux-kernel@vger.kernel.org
Link: http://lkml.kernel.org/r/20150902233126.38653250@viggo.jf.intel.com
[ Ported to v4.3-rc1. ]
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-09-14 12:21:46 +02:00
Jan Beulich
f454b47886 x86/ldt: Fix small LDT allocation for Xen
While the following commit:

  37868fe113 ("x86/ldt: Make modify_ldt synchronous")

added a nice comment explaining that Xen needs page-aligned
whole page chunks for guest descriptor tables, it then
nevertheless used kzalloc() on the small size path.

As I'm unaware of guarantees for kmalloc(PAGE_SIZE, ) to return
page-aligned memory blocks, I believe this needs to be switched
back to __get_free_page() (or better get_zeroed_page()).

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: David Vrabel <david.vrabel@citrix.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/55E735D6020000780009F1E6@prv-mh.provo.novell.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-09-14 12:10:50 +02:00
Dave Hansen
4109ca066b x86/fpu: Remove XSTATE_RESERVE
The original purpose of XSTATE_RESERVE was to carve out space
to store all of the possible extended state components that
get saved with the XSAVE instruction(s).

However, we are now almost entirely dynamically allocating
the buffers we use for XSAVE by placing them at the end of
the task_struct and them sizing them at boot.  The one
exception for that is the init_task.

The maximum extended state component size that we have today
is on systems with space for AVX-512 and Memory Protection
Keys: 2696 bytes.  We have reserved a PAGE_SIZE buffer in
the init_task via fpregs_state->__padding.

This check ensures that even if the component sizes or
layout were changed (which we do not expect), that we will
still not overflow the init_task's buffer.

In the case that we detect we might overflow the buffer,
we completely disable XSAVE support in the kernel and try
to boot as if we had 'legacy x87 FPU' support in place.
This is a crippled state without any of the XSAVE-enabled
features (MPX, AVX, etc...).  But, it at least let us
boot safely.

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tim Chen <tim.c.chen@linux.intel.com>
Cc: dave@sr71.net
Cc: linux-kernel@vger.kernel.org
Link: http://lkml.kernel.org/r/20150902233125.D948D475@viggo.jf.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-09-14 12:07:56 +02:00
Dave Hansen
0a26537502 x86/fpu: Move XSAVE-disabling code to a helper
When we want to _completely_ disable XSAVE support as far as
the kernel is concerned, we have a big set of feature flags
to clear.  We currently only do this in cases where the user
asks for it to be disabled, but we are about to expand the
places where we do it to handle errors too.

Move the code in to xstate.c, and put it in the xstate.h
header.  We will use it in the next patch too.

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tim Chen <tim.c.chen@linux.intel.com>
Cc: dave@sr71.net
Cc: linux-kernel@vger.kernel.org
Link: http://lkml.kernel.org/r/20150902233124.EA9A70E5@viggo.jf.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-09-14 12:07:56 +02:00
Dave Hansen
b081535959 x86/fpu: Print xfeature buffer size in decimal
This is utterly a personal taste thing, but I find it way easier
to read structure sizes in decimal than in hex.

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tim Chen <tim.c.chen@linux.intel.com>
Cc: dave@sr71.net
Cc: linux-kernel@vger.kernel.org
Link: http://lkml.kernel.org/r/20150902233124.1A8B04A8@viggo.jf.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-09-14 12:07:55 +02:00
Sukadev Bhattiprolu
8f3e5684d3 perf/core: Drop PERF_EVENT_TXN
We currently use PERF_EVENT_TXN flag to determine if we are in the middle
of a transaction. If in a transaction, we defer the schedulability checks
from pmu->add() operation to the pmu->commit() operation.

Now that we have "transaction types" (PERF_PMU_TXN_ADD, PERF_PMU_TXN_READ)
we can use the type to determine if we are in a transaction and drop the
PERF_EVENT_TXN flag.

When PERF_EVENT_TXN is dropped, the cpuhw->group_flag on some architectures
becomes unused, so drop that field as well.

This is an extension of the Powerpc patch from Peter Zijlstra to s390,
Sparc and x86 architectures.

Signed-off-by: Sukadev Bhattiprolu <sukadev@linux.vnet.ibm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephane Eranian <eranian@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vince Weaver <vincent.weaver@maine.edu>
Link: http://lkml.kernel.org/r/1441336073-22750-11-git-send-email-sukadev@linux.vnet.ibm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-09-13 11:27:30 +02:00
Sukadev Bhattiprolu
fbbe070115 perf/core: Add a 'flags' parameter to the PMU transactional interfaces
Currently, the PMU interface allows reading only one counter at a time.
But some PMUs like the 24x7 counters in Power, support reading several
counters at once. To leveage this functionality, extend the transaction
interface to support a "transaction type".

The first type, PERF_PMU_TXN_ADD, refers to the existing transactions,
i.e. used to _schedule_ all the events on the PMU as a group. A second
transaction type, PERF_PMU_TXN_READ, will be used in a follow-on patch,
by the 24x7 counters to read several counters at once.

Extend the transaction interfaces to the PMU to accept a 'txn_flags'
parameter and use this parameter to ignore any transactions that are
not of type PERF_PMU_TXN_ADD.

Thanks to Peter Zijlstra for his input.

Signed-off-by: Sukadev Bhattiprolu <sukadev@linux.vnet.ibm.com>
[peterz: s390 compile fix]
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Michael Ellerman <mpe@ellerman.id.au>
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephane Eranian <eranian@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vince Weaver <vincent.weaver@maine.edu>
Link: http://lkml.kernel.org/r/1441336073-22750-3-git-send-email-sukadev@linux.vnet.ibm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-09-13 11:27:25 +02:00
Huaitong Han
73fdeb6659 perf/x86/intel/pt: Fix KVM warning due to doing rdmsr() before the CPUID test
If KVM does not support INTEL_PT, guest MSR_IA32_RTIT_CTL reading will
produce host warning like "kvm [2469]: vcpu0 unhandled rdmsr: 0x570".

Guest can determine whether the CPU supports Intel_PT according to CPUID,
so test_cpu_cap function is added before rdmsr,and it is more in line with
the code style.

Signed-off-by: Huaitong Han <huaitong.han@intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephane Eranian <eranian@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vince Weaver <vincent.weaver@maine.edu>
Cc: acme@kernel.org
Link: http://lkml.kernel.org/r/1441009262-9792-1-git-send-email-huaitong.han@intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-09-13 11:27:23 +02:00
Kan Liang
deb27519bf perf/x86/intel: Fix LBR callstack issue caused by FREEZE_LBRS_ON_PMI
This patch fixes an issue which introduced by commit
1a78d93750 ("perf/x86/intel: Streamline
LBR MSR handling in PMI").

The old patch not only avoids writing LBR_SELECT MSR in PMI, but also
avoids updating lbr_select variable. So in PMI, FREEZE_LBRS_ON_PMI bit
is always mistakenly set for IA32_DEBUGCTLMSR MSR, which causes
superfluous increase/decrease of LBR_TOS when collecting LBR callstack.

Reported-by: Milian Wolff <mail@milianw.de>
Signed-off-by: Kan Liang <kan.liang@intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephane Eranian <eranian@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vince Weaver <vincent.weaver@maine.edu>
Link: http://lkml.kernel.org/r/1439815051-8616-1-git-send-email-kan.liang@intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-09-13 11:27:22 +02:00
Alexander Shishkin
d2878d642a perf/x86/intel/bts: Disallow use by unprivileged users on paranoid systems
BTS leaks kernel addresses even in userspace-only mode due to imprecise IP
sampling, so sometimes syscall entry points or page fault handler addresses
end up in a userspace trace.

Now, intel_bts driver exports trace data zero-copy, it does not scan through
it to filter out the kernel addresses and it's would be a O(n) job.

To work around this situation, this patch forbids the use of intel_bts
driver by unprivileged users on systems with the paranoid setting above the
(kernel's) default "1", which still allows kernel profiling. In other words,
using intel_bts driver implies kernel tracing, regardless of the
"exclude_kernel" attribute setting.

Signed-off-by: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Arnaldo Carvalho de Melo <acme@infradead.org>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephane Eranian <eranian@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vince Weaver <vincent.weaver@maine.edu>
Cc: hpa@zytor.com
Link: http://lkml.kernel.org/r/1441030168-6853-3-git-send-email-alexander.shishkin@linux.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-09-13 11:27:22 +02:00
Alexander Shishkin
a09d31f452 perf/x86/intel/ds: Work around BTS leaking kernel addresses
BTS leaks kernel addresses even in userspace-only mode due to imprecise IP
sampling, so sometimes syscall entry points or page fault handler addresses
end up in a userspace trace.

Since this driver uses a relatively small buffer for BTS records and it has
to iterate through them anyway, it can also take on the additional job of
filtering out the records that contain kernel addresses when kernel space
tracing is not enabled.

This patch changes the bts code to skip the offending records from perf
output. In order to request the exact amount of space on the ring buffer,
we need to do an extra pass through the records to know how many there are
of the valid ones, but considering the small size of the buffer, this extra
pass adds very little overhead to the nmi handler. This way we won't end
up with awkward IP samples with zero IPs in the perf stream.

Signed-off-by: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Arnaldo Carvalho de Melo <acme@infradead.org>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephane Eranian <eranian@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vince Weaver <vincent.weaver@maine.edu>
Cc: hpa@zytor.com
Link: http://lkml.kernel.org/r/1441030168-6853-2-git-send-email-alexander.shishkin@linux.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-09-13 11:27:21 +02:00
Adrian Hunter
b20112edea perf/x86: Improve accuracy of perf/sched clock
When TSC is stable perf/sched clock is based on it.
However the conversion from cycles to nanoseconds
is not as accurate as it could be.  Because
CYC2NS_SCALE_FACTOR is 10, the accuracy is +/- 1/2048

The change is to calculate the maximum shift that
results in a multiplier that is still a 32-bit number.
For example all frequencies over 1 GHz will have
a shift of 32, making the accuracy of the conversion
+/- 1/(2^33).  That is achieved by using the
'clocks_calc_mult_shift()' function.

Signed-off-by: Adrian Hunter <adrian.hunter@intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephane Eranian <eranian@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vince Weaver <vincent.weaver@maine.edu>
Link: http://lkml.kernel.org/r/1440147918-22250-1-git-send-email-adrian.hunter@intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-09-13 11:27:20 +02:00
Ingo Molnar
216dcaf290 Merge branch 'perf/urgent' into perf/core, to pick up fixes before applying new changes
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-09-13 11:25:55 +02:00
Peter Zijlstra
ebfb4988f0 perf/x86/intel: Fix constraint access
Sasha reported that we can get here with .idx==-1, and
cpuc->event_constraints unallocated.

Suggested-by: Stephane Eranian <eranian@google.com>
Reported-by: Sasha Levin <sasha.levin@oracle.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: <stable@vger.kernel.org>
Fixes: b371b59431 ("perf/x86: Fix event/group validation")
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-09-13 09:37:10 +02:00
Borislav Petkov
7c5b190e11 x86/cpu: Print family/model/stepping in hex
924e101a7a ("x86/debug: Dump family, model, stepping of the
boot CPU") had its good intentions to dump the exact F/M/S as an
aid during debugging sessions but its output can be ambiguous.
Fix that:

-smpboot: CPU0: Intel Core Processor (Broadwell) (fam: 06, model: 47, stepping: 02)
+smpboot: CPU0: Intel Core Processor (Broadwell) (family: 0x6, model: 0x47, stepping: 0x2)

Also, spell out "family".

Signed-off-by: Borislav Petkov <bp@suse.de>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1441914927-32037-1-git-send-email-bp@alien8.de
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2015-09-13 09:30:07 +02:00