Add sysfs format entries for AMD IBS PMUs:
# find /sys/bus/event_source/devices/ibs_*/format
/sys/bus/event_source/devices/ibs_fetch/format
/sys/bus/event_source/devices/ibs_fetch/format/rand_en
/sys/bus/event_source/devices/ibs_op/format
/sys/bus/event_source/devices/ibs_op/format/cnt_ctl
This allows to specify following IBS options:
$ perf record -e ibs_fetch/rand_en=1/GH ...
$ perf record -e ibs_op/cnt_ctl=1/GH ...
Option cnt_ctl is only enabled if the IBS_CAPS_OPCNT bit is set in IBS
cpuid feature flags (AMD family 10h RevC and above).
Signed-off-by: Robert Richter <robert.richter@amd.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/1347447584-28405-1-git-send-email-robert.richter@amd.com
[ Added small readability improvements. ]
Signed-off-by: Ingo Molnar <mingo@kernel.org>
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.12 (GNU/Linux)
iQIcBAABAgAGBQJQbY/2AAoJEI7yEDeUysxlymQQAIv5svpAI/FUe3FhvBi3IW2h
WWMIpbdhHyocaINT18qNp8prO0iwoaBfgsnU8zuB34MrbdUgiwSHgM6T4Ff4NGa+
R4u+gpyKYwxNQYKeJyj04luXra/krxwHL1u9OwN7o44JuQXAmzrw2tZ9ad1ArvL3
eoZ6kGsPcdHPZMZWw2jN5xzBsRtqybm0GPPQh1qPXdn8UlPPd1X7owvbaud2y4+e
StVIpGY6wrsO36f7UcA4Gm1EP/1E6Lm5KMXJyHgM9WBRkEfp92jTY5+XKv91vK8Z
VKUd58QMdZE5NCNBkAR9U5N9aH0oSXnFU/g8hgiwGvrhS3IsSkKUePE6sVyMVTIO
VptKRYe0AdmD/g25p6ApJsguV7ITlgoCPaE4rMmRcW9/bw8+iY098r7tO7w11H8M
TyFOXihc3B+rlH8WdzOblwxHMC4yRuiPIktaA3WwbX7eA7Xv/ZRtdidifXKtgsVE
rtubVqwGyYcHoX1Y+JiByIW1NN0pYncJhPEdc8KbRe2wKs3amA9rio1mUpBYYBPO
B0ygcITftyXbhcTtssgcwBDGXB0AAGqI7wqdtJhFeIrKwHXD7fNeAGRwO8oKxmlj
0aPwo9fDtpI+e6BFTohEgjZBocRvXXNWLnDSFB0E7xDR31bACck2FG5FAp1DxdS7
lb/nbAsXf9UJLgGir4I1
=kN6V
-----END PGP SIGNATURE-----
Merge tag 'kvm-3.7-1' of git://git.kernel.org/pub/scm/virt/kvm/kvm
Pull KVM updates from Avi Kivity:
"Highlights of the changes for this release include support for vfio
level triggered interrupts, improved big real mode support on older
Intels, a streamlines guest page table walker, guest APIC speedups,
PIO optimizations, better overcommit handling, and read-only memory."
* tag 'kvm-3.7-1' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (138 commits)
KVM: s390: Fix vcpu_load handling in interrupt code
KVM: x86: Fix guest debug across vcpu INIT reset
KVM: Add resampling irqfds for level triggered interrupts
KVM: optimize apic interrupt delivery
KVM: MMU: Eliminate pointless temporary 'ac'
KVM: MMU: Avoid access/dirty update loop if all is well
KVM: MMU: Eliminate eperm temporary
KVM: MMU: Optimize is_last_gpte()
KVM: MMU: Simplify walk_addr_generic() loop
KVM: MMU: Optimize pte permission checks
KVM: MMU: Update accessed and dirty bits after guest pagetable walk
KVM: MMU: Move gpte_access() out of paging_tmpl.h
KVM: MMU: Optimize gpte_access() slightly
KVM: MMU: Push clean gpte write protection out of gpte_access()
KVM: clarify kvmclock documentation
KVM: make processes waiting on vcpu mutex killable
KVM: SVM: Make use of asm.h
KVM: VMX: Make use of asm.h
KVM: VMX: Make lto-friendly
KVM: x86: lapic: Clean up find_highest_vector() and count_vectors()
...
Conflicts:
arch/s390/include/asm/processor.h
arch/x86/kvm/i8259.c
The following patch adds perf_event support for the Xeon-Phi
PMU, as documented in the "Intel Xeon Phi Coprocessor (codename:
Knights Corner) Performance Monitoring Units" manual.
Even though it is a co-processor, a Phi runs a full Linux
environment and can support performance counters.
This is just barebones support, it does not add support for
interesting new features such as the SPFLT intruction that
allows starting/stopping events without entering the kernel.
The PMU internally is just like that of an original Pentium, but
a "P6-like" MSR interface is provided. The interface is
different enough from a real P6 that it's not easy (or
practical) to re-use the code in perf_event_p6.c
Acked-by: Lawrence F Meadows <lawrence.f.meadows@intel.com>
Acked-by: Cyrill Gorcunov <gorcunov@openvz.org>
Signed-off-by: Vince Weaver <vincent.weaver@maine.edu>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Arnaldo Carvalho de Melo <acme@ghostprotocols.net>
Cc: eranian@gmail.com
Cc: Lawrence F <lawrence.f.meadows@intel.com>
Link: http://lkml.kernel.org/r/alpine.DEB.2.02.1209261405320.8398@vincent-weaver-1.um.maine.edu
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Remove the quirk for the SBC FITPC. It seems ot have been
required when the default was kbd reboot, but no longer required
now that the default is acpi reboot. Furthermore, BIOS reboot no
longer works for this board as of 2.6.39 or any of the 3.x
kernels.
Signed-off-by: David Hooper <dave@beermex.com>
Signed-off-by: Alan Cox <alan@linux.intel.com>
Link: http://lkml.kernel.org/r/20121002142635.17403.59959.stgit@localhost.localdomain
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Partition the header include path flags into two sets, one for kernelspace
builds and one for userspace builds.
Add the following directories to build after the ordinary include directories
so that #include will pick up the UAPI header directly if the kernel header
has been moved there.
The userspace set (represented by the USERINCLUDE make variable) contains:
-I $(srctree)/arch/$(hdr-arch)/include/uapi
-I arch/$(hdr-arch)/include/generated/uapi
-I $(srctree)/include/uapi
-I include/generated/uapi
-include $(srctree)/include/linux/kconfig.h
and the kernelspace set (represented by the LINUXINCLUDE make variable)
contains:
-I $(srctree)/arch/$(hdr-arch)/include
-I arch/$(hdr-arch)/include/generated
-I $(srctree)/include
-I include --- if not building in the source tree
plus everything in the USERINCLUDE set.
Then use USERINCLUDE in building the x86 boot code.
Signed-off-by: David Howells <dhowells@redhat.com>
Acked-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Acked-by: Dave Jones <davej@redhat.com>
This fixes two issues that could cause incompatibility between
kernel versions:
- If a tracer uses SECCOMP_RET_TRACE to select a syscall number
higher than the largest known syscall, emulate the unknown
vsyscall by returning -ENOSYS. (This is unlikely to make a
noticeable difference on x86-64 due to the way the system call
entry works.)
- On x86-64 with vsyscall=emulate, skipped vsyscalls were buggy.
This updates the documentation accordingly.
Signed-off-by: Andy Lutomirski <luto@amacapital.net>
Acked-by: Will Drewry <wad@chromium.org>
Signed-off-by: James Morris <james.l.morris@oracle.com>
"ACPI: Store valid ACPI tables passed via early initrd in reserved
memblock areas" breaks the build if either CONFIG_ACPI or
CONFIG_BLK_DEV_INITRD is disabled:
arch/x86/kernel/setup.c: In function 'setup_arch':
arch/x86/kernel/setup.c:944: error: implicit declaration of function 'acpi_initrd_override'
or
arch/x86/built-in.o: In function `setup_arch':
(.init.text+0x1397): undefined reference to `initrd_start'
arch/x86/built-in.o: In function `setup_arch':
(.init.text+0x139e): undefined reference to `initrd_end'
The dummy acpi_initrd_override() function in acpi.h isn't defined without
CONFIG_ACPI and initrd_{start,end} are declared but not defined without
CONFIG_BLK_DEV_INITRD.
[ hpa: applying this as a fix, but this really should be done cleaner ]
Signed-off-by: David Rientjes <rientjes@google.com>
Link: http://lkml.kernel.org/r/alpine.DEB.2.00.1210012032470.31644@chino.kir.corp.google.com
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
Cc: Thomas Renninger <trenn@suse.de>
Cc: Len Brown <lenb@kernel.org>
Pull x86/smap support from Ingo Molnar:
"This adds support for the SMAP (Supervisor Mode Access Prevention) CPU
feature on Intel CPUs: a hardware feature that prevents unintended
user-space data access from kernel privileged code.
It's turned on automatically when possible.
This, in combination with SMEP, makes it even harder to exploit kernel
bugs such as NULL pointer dereferences."
Fix up trivial conflict in arch/x86/kernel/entry_64.S due to newly added
includes right next to each other.
* 'x86-smap-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86, smep, smap: Make the switching functions one-way
x86, suspend: On wakeup always initialize cr4 and EFER
x86-32: Start out eflags and cr4 clean
x86, smap: Do not abuse the [f][x]rstor_checking() functions for user space
x86-32, smap: Add STAC/CLAC instructions to 32-bit kernel entry
x86, smap: Reduce the SMAP overhead for signal handling
x86, smap: A page fault due to SMAP is an oops
x86, smap: Turn on Supervisor Mode Access Prevention
x86, smap: Add STAC and CLAC instructions to control user space access
x86, uaccess: Merge prototypes for clear_user/__clear_user
x86, smap: Add a header file with macros for STAC/CLAC
x86, alternative: Add header guards to <asm/alternative-asm.h>
x86, alternative: Use .pushsection/.popsection
x86, smap: Add CR4 bit for SMAP
x86-32, mm: The WP test should be done on a kernel page
Pull x86/microcode changes from Ingo Molnar:
"The biggest changes are to AMD microcode patching: add code for
caching all microcode patches which belong to the current family on
which we're running, in the kernel.
We look up the patch needed for each core from the cache at
patch-application time instead of holding a single patch per-system"
* 'x86-microcode-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86, microcode, AMD: Fix use after free in free_cache()
x86, microcode, AMD: Rewrite patch application procedure
x86, microcode, AMD: Add a small, per-family patches cache
x86, microcode, AMD: Add reverse equiv table search
x86, microcode: Add a refresh firmware flag to ->request_microcode_fw
x86, microcode, AMD: Read CPUID(1).EAX on the correct cpu
x86, microcode, AMD: Check before applying a patch
x86, microcode, AMD: Remove useless get_ucode_data wrapper
x86, microcode: Straighten out Kconfig text
x86, microcode: Cleanup cpu hotplug notifier callback
x86, microcode: Drop uci->mc check on resume path
x86, microcode: Save an indentation level in reload_for_cpu
Pull x86/platform changes from Ingo Molnar:
"This cleans up some Xen-induced pagetable init code uglies, by
generalizing new platform callbacks and state: x86_init.paging.*"
* 'x86-platform-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86: Document x86_init.paging.pagetable_init()
x86: xen: Cleanup and remove x86_init.paging.pagetable_setup_done()
x86: Move paging_init() call to x86_init.paging.pagetable_init()
x86: Rename pagetable_setup_start() to pagetable_init()
x86: Remove base argument from x86_init.paging.pagetable_setup_start
Pull x86/mm changes from Ingo Molnar:
"The biggest change is new TLB partial flushing code for AMD CPUs.
(The v3.6 kernel had the Intel CPU side code, see commits
e0ba94f14f74..effee4b9b3b.)
There's also various other refinements around the TLB flush code"
* 'x86-mm-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86: Distinguish TLB shootdown interrupts from other functions call interrupts
x86/mm: Fix range check in tlbflush debugfs interface
x86, cpu: Preset default tlb_flushall_shift on AMD
x86, cpu: Add AMD TLB size detection
x86, cpu: Push TLB detection CPUID check down
x86, cpu: Fixup tlb_flushall_shift formatting
Pull x86/MCE update from Ingo Molnar:
"Various MCE robustness enhancements.
One of the changes adds CMCI (Corrected Machine Check Interrupt) poll
mode on Intel Nehalem+ CPUs, which mode is automatically entered when
the rate of messages is too high - and exited once the storm is over.
An MCE events storm will roughly look like this:
[ 5342.740616] mce: [Hardware Error]: Machine check events logged
[ 5342.746501] mce: [Hardware Error]: Machine check events logged
[ 5342.757971] CMCI storm detected: switching to poll mode
[ 5372.674957] CMCI storm subsided: switching to interrupt mode
This should make such events more survivable"
* 'x86-mce-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/mce: Provide boot argument to honour bios-set CMCI threshold
x86, MCE: Remove unused defines
x86, mce: Enable MCA support by default
x86/mce: Add CMCI poll mode
x86/mce: Make cmci_discover() quiet
x86: mce: Remove the frozen cases in the hotplug code
x86: mce: Split timer init
x86: mce: Serialize mce injection
x86: mce: Disable preemption when calling raise_local()
Pull x86/fpu update from Ingo Molnar:
"The biggest change is the addition of the non-lazy (eager) FPU saving
support model and enabling it on CPUs with optimized xsaveopt/xrstor
FPU state saving instructions.
There are also various Sparse fixes"
Fix up trivial add-add conflict in arch/x86/kernel/traps.c
* 'x86-fpu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86, kvm: fix kvm's usage of kernel_fpu_begin/end()
x86, fpu: remove cpu_has_xmm check in the fx_finit()
x86, fpu: make eagerfpu= boot param tri-state
x86, fpu: enable eagerfpu by default for xsaveopt
x86, fpu: decouple non-lazy/eager fpu restore from xsave
x86, fpu: use non-lazy fpu restore for processors supporting xsave
lguest, x86: handle guest TS bit for lazy/non-lazy fpu host models
x86, fpu: always use kernel_fpu_begin/end() for in-kernel FPU usage
x86, kvm: use kernel_fpu_begin/end() in kvm_load/put_guest_fpu()
x86, fpu: remove unnecessary user_fpu_end() in save_xstate_sig()
x86, fpu: drop_fpu() before restoring new state from sigframe
x86, fpu: Unify signal handling code paths for x86 and x86_64 kernels
x86, fpu: Consolidate inline asm routines for saving/restoring fpu state
x86, signal: Cleanup ifdefs and is_ia32, is_x32
Pull x86 debug update from Ingo Molnar:
"Various small enhancements"
* 'x86-debug-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/debug: Dump family, model, stepping of the boot CPU
x86/iommu: Use NULL instead of plain 0 for __IOMMU_INIT
x86/iommu: Drop duplicate const in __IOMMU_INIT
x86/fpu/xsave: Keep __user annotation in casts
x86/pci/probe_roms: Add missing __iomem annotation to pci_map_biosrom()
x86/signals: ia32_signal.c: add __user casts to fix sparse warnings
x86/vdso: Add __user annotation to VDSO32_SYMBOL
x86: Fix __user annotations in asm/sys_ia32.h
Pull x86/cpu and x86/cpufeature from Ingo Molnar:
"One tiny cleanup, and prepare for SMAP (Supervisor Mode Access
Prevention) support on x86"
* 'x86-cpu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86: Remove the useless branch in c_start()
* 'x86-cpufeature-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86, cpufeature: Add feature bit for SMAP
Pull x86/asm changes from Ingo Molnar:
"The one change that stands out is the alternatives patching change
that prevents us from ever patching back instructions from SMP to UP:
this simplifies things and speeds up CPU hotplug.
Other than that it's smaller fixes, cleanups and improvements."
* 'x86-asm-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86: Unspaghettize do_trap()
x86_64: Work around old GAS bug
x86: Use REP BSF unconditionally
x86: Prefer TZCNT over BFS
x86/64: Adjust types of temporaries used by ffs()/fls()/fls64()
x86: Drop unnecessary kernel_eflags variable on 64-bit
x86/smp: Don't ever patch back to UP if we unplug cpus
Pull x86/apic changes from Ingo Molnar:
"Smaller fixes and cleanups"
* 'x86-apic-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/api: Rename mp_register_lapic in a comment
x86/irq/i8259: Fix incorrect comment
x86: dt: Use linear irq domain for ioapic(s)
Pull perf fix from Ingo Molnar:
"Leftover perf/urgent fix from the v3.6 cycle"
* 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
perf/x86: Fix typo in uncore_pmu_to_box
Pull perf update from Ingo Molnar:
"Lots of changes in this cycle as well, with hundreds of commits from
over 30 contributors. Most of the activity was on the tooling side.
Higher level changes:
- New 'perf kvm' analysis tool, from Xiao Guangrong.
- New 'perf trace' system-wide tracing tool
- uprobes fixes + cleanups from Oleg Nesterov.
- Lots of patches to make perf build on Android out of box, from
Irina Tirdea
- Extend ftrace function tracing utility to be more dynamic for its
users. It allows for data passing to the callback functions, as
well as reading regs as if a breakpoint were to trigger at function
entry.
The main goal of this patch series was to allow kprobes to use
ftrace as an optimized probe point when a probe is placed on an
ftrace nop. With lots of help from Masami Hiramatsu, and going
through lots of iterations, we finally came up with a good
solution.
- Add cpumask for uncore pmu, use it in 'stat', from Yan, Zheng.
- Various tracing updates from Steve Rostedt
- Clean up and improve 'perf sched' performance by elliminating lots
of needless calls to libtraceevent.
- Event group parsing support, from Jiri Olsa
- UI/gtk refactorings and improvements from Namhyung Kim
- Add support for non-tracepoint events in perf script python, from
Feng Tang
- Add --symbols to 'script', similar to the one in 'report', from
Feng Tang.
Infrastructure enhancements and fixes:
- Convert the trace builtins to use the growing evsel/evlist
tracepoint infrastructure, removing several open coded constructs
like switch like series of strcmp to dispatch events, etc.
Basically what had already been showcased in 'perf sched'.
- Add evsel constructor for tracepoints, that uses libtraceevent just
to parse the /format events file, use it in a new 'perf test' to
make sure the libtraceevent format parsing regressions can be more
readily caught.
- Some strange errors were happening in some builds, but not on the
next, reported by several people, problem was some parser related
files, generated during the build, didn't had proper make deps, fix
from Eric Sandeen.
- Introduce struct and cache information about the environment where
a perf.data file was captured, from Namhyung Kim.
- Fix handling of unresolved samples when --symbols is used in
'report', from Feng Tang.
- Add union member access support to 'probe', from Hyeoncheol Lee.
- Fixups to die() removal, from Namhyung Kim.
- Render fixes for the TUI, from Namhyung Kim.
- Don't enable annotation in non symbolic view, from Namhyung Kim.
- Fix pipe mode in 'report', from Namhyung Kim.
- Move related stats code from stat to util/, will be used by the
'stat' kvm tool, from Xiao Guangrong.
- Remove die()/exit() calls from several tools.
- Resolve vdso callchains, from Jiri Olsa
- Don't pass const char pointers to basename, so that we can
unconditionally use libgen.h and thus avoid ifdef BIONIC lines,
from David Ahern
- Refactor hist formatting so that it can be reused with the GTK
browser, From Namhyung Kim
- Fix build for another rbtree.c change, from Adrian Hunter.
- Make 'perf diff' command work with evsel hists, from Jiri Olsa.
- Use the only field_sep var that is set up: symbol_conf.field_sep,
fix from Jiri Olsa.
- .gitignore compiled python binaries, from Namhyung Kim.
- Get rid of die() in more libtraceevent places, from Namhyung Kim.
- Rename libtraceevent 'private' struct member to 'priv' so that it
works in C++, from Steven Rostedt
- Remove lots of exit()/die() calls from tools so that the main perf
exit routine can take place, from David Ahern
- Fix x86 build on x86-64, from David Ahern.
- {int,str,rb}list fixes from Suzuki K Poulose
- perf.data header fixes from Namhyung Kim
- Allow user to indicate objdump path, needed in cross environments,
from Maciek Borzecki
- Fix hardware cache event name generation, fix from Jiri Olsa
- Add round trip test for sw, hw and cache event names, catching the
problem Jiri fixed, after Jiri's patch, the test passes
successfully.
- Clean target should do clean for lib/traceevent too, fix from David
Ahern
- Check the right variable for allocation failure, fix from Namhyung
Kim
- Set up evsel->tp_format regardless of evsel->name being set
already, fix from Namhyung Kim
- Oprofile fixes from Robert Richter.
- Remove perf_event_attr needless version inflation, from Jiri Olsa
- Introduce libtraceevent strerror like error reporting facility,
from Namhyung Kim
- Add pmu mappings to perf.data header and use event names from cmd
line, from Robert Richter
- Fix include order for bison/flex-generated C files, from Ben
Hutchings
- Build fixes and documentation corrections from David Ahern
- Assorted cleanups from Robert Richter
- Let O= makes handle relative paths, from Steven Rostedt
- perf script python fixes, from Feng Tang.
- Initial bash completion support, from Frederic Weisbecker
- Allow building without libelf, from Namhyung Kim.
- Support DWARF CFI based unwind to have callchains when %bp based
unwinding is not possible, from Jiri Olsa.
- Symbol resolution fixes, while fixing support PPC64 files with an
.opt ELF section was the end goal, several fixes for code that
handles all architectures and cleanups are included, from Cody
Schafer.
- Assorted fixes for Documentation and build in 32 bit, from Robert
Richter
- Cache the libtraceevent event_format associated to each evsel
early, so that we avoid relookups, i.e. calling pevent_find_event
repeatedly when processing tracepoint events.
[ This is to reduce the surface contact with libtraceevents and
make clear what is that the perf tools needs from that lib: so
far parsing the common and per event fields. ]
- Don't stop the build if the audit libraries are not installed, fix
from Namhyung Kim.
- Fix bfd.h/libbfd detection with recent binutils, from Markus
Trippelsdorf.
- Improve warning message when libunwind devel packages not present,
from Jiri Olsa"
* 'perf-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (282 commits)
perf trace: Add aliases for some syscalls
perf probe: Print an enum type variable in "enum variable-name" format when showing accessible variables
perf tools: Check libaudit availability for perf-trace builtin
perf hists: Add missing period_* fields when collapsing a hist entry
perf trace: New tool
perf evsel: Export the event_format constructor
perf evsel: Introduce rawptr() method
perf tools: Use perf_evsel__newtp in the event parser
perf evsel: The tracepoint constructor should store sys:name
perf evlist: Introduce set_filter() method
perf evlist: Renane set_filters method to apply_filters
perf test: Add test to check we correctly parse and match syscall open parms
perf evsel: Handle endianity in intval method
perf evsel: Know if byte swap is needed
perf tools: Allow handling a NULL cpu_map as meaning "all cpus"
perf evsel: Improve tracepoint constructor setup
tools lib traceevent: Fix error path on pevent_parse_event
perf test: Fix build failure
trace: Move trace event enable from fs_initcall to core_initcall
tracing: Add an option for disabling markers
...
no need to have the call of do_notify_resume() + checks around it
duplicated for vm86 case - a bit of rearranging of ifdefs and we'll
have a perfectly fine copy to jump back to.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
32bit wrapper is lost on that; 64bit one is *not*, since
we need to arrange for full pt_regs on stack when we call
sys_execve() and we need to load callee-saved ones from
there afterwards.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
A later patch will compare them with ACPI tables that get loaded at boot or
runtime and if criteria match, a stored one is loaded.
Signed-off-by: Thomas Renninger <trenn@suse.de>
Link: http://lkml.kernel.org/r/1349043837-22659-4-git-send-email-trenn@suse.de
Cc: Len Brown <lenb@kernel.org>
Cc: Robert Moore <robert.moore@intel.com>
Cc: Yinghai Lu <yinghai@kernel.org>
Cc: Eric Piel <eric.piel@tremplin-utc.net>
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
This is needed for ACPI table overriding via initrd. Beside reserving
memblocks, X86 also requires to flag the memory area to E820_RESERVED or
E820_ACPI in the e820 mappings to be able to io(re)map it later.
Signed-off-by: Thomas Renninger <trenn@suse.de>
Link: http://lkml.kernel.org/r/1349043837-22659-3-git-send-email-trenn@suse.de
Cc: Len Brown <lenb@kernel.org>
Cc: Robert Moore <robert.moore@intel.com>
Cc: Yinghai Lu <yinghai@kernel.org>
Cc: Eric Piel <eric.piel@tremplin-utc.net>
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
As TLB shootdown requests to other CPU cores are now using function call
interrupts, TLB shootdowns entry in /proc/interrupts is always shown as 0.
This behavior change was introduced by commit 52aec3308d ("x86/tlb:
replace INVALIDATE_TLB_VECTOR by CALL_FUNCTION_VECTOR").
This patch reverts TLB shootdowns entry in /proc/interrupts to count TLB
shootdowns separately from the other function call interrupts.
Signed-off-by: Tomoki Sekiyama <tomoki.sekiyama.qu@hitachi.com>
Link: http://lkml.kernel.org/r/20120926021128.22212.20440.stgit@hpxw
Acked-by: Alex Shi <alex.shi@intel.com>
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
The ACPI spec doesn't provide for a way for the bios to pass down
recommended thresholds to the OS on a _per-bank_ basis. This patch adds
a new boot option, which if passed, tells Linux to use CMCI thresholds
set by the bios.
As fail-safe, we initialize threshold to 1 if some banks have not been
initialized by the bios and warn the user.
Signed-off-by: Naveen N. Rao <naveen.n.rao@linux.vnet.ibm.com>
Signed-off-by: Tony Luck <tony.luck@intel.com>
There is no fundamental reason why we should switch SMEP and SMAP on
during early cpu initialization just to switch them off again. Now
with %eflags and %cr4 forced to be initialized to a clean state, we
only need the one-way enable. Also, make the functions inline to make
them (somewhat) harder to abuse.
This does mean that SMEP and SMAP do not get initialized anywhere near
as early. Even using early_param() instead of __setup() doesn't give
us control early enough to do this during the early cpu initialization
phase. This seems reasonable to me, because SMEP and SMAP should not
matter until we have userspace to protect ourselves from, but it does
potentially make it possible for a bug involving a "leak of
permissions to userspace" to get uncaught.
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
We already have a flag word to indicate the existence of MISC_ENABLES,
so use the same flag word to indicate existence of cr4 and EFER, and
always restore them if they exist. That way if something passes a
nonzero value when the value *should* be zero, we will still
initialize it.
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
Cc: Rafael J. Wysocki <rjw@sisk.pl>
Link: http://lkml.kernel.org/r/1348529239-17943-1-git-send-email-hpa@linux.intel.com
%cr4 is supposed to reflect a set of features into which the operating
system is opting in. If the BIOS or bootloader leaks bits here, this
is not desirable. Consider a bootloader passing in %cr4.pae set to a
legacy paging kernel, for example -- it will not have any immediate
effect, but the kernel would crash when turning paging on.
A similar argument applies to %eflags, and since we have to look for
%eflags.id being settable we can use a sequence which clears %eflags
as a side effect.
Note that we already do this for x86-64.
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
Link: http://lkml.kernel.org/r/1348529239-17943-1-git-send-email-hpa@linux.intel.com
do_notify_resume() may be called on irq or exception
exit. But at that time the exception has already called
rcu_user_enter() and the irq has already called rcu_irq_exit().
Since it can use RCU read side critical section, we must call
rcu_user_exit() before doing anything there. Then we must call
back rcu_user_enter() after this function because we know we are
going to userspace from there.
This complete support for userspace RCU extended quiescent state
in x86-64.
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Alessio Igor Bogani <abogani@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Avi Kivity <avi@redhat.com>
Cc: Chris Metcalf <cmetcalf@tilera.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Geoff Levand <geoff@infradead.org>
Cc: Gilad Ben Yossef <gilad@benyossef.com>
Cc: Hakan Akkan <hakanakkan@gmail.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Josh Triplett <josh@joshtriplett.org>
Cc: Kevin Hilman <khilman@ti.com>
Cc: Max Krasnyansky <maxk@qualcomm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephen Hemminger <shemminger@vyatta.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Sven-Thorsten Dietrich <thebigcorporation@gmail.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
This way we can exit the RCU extended quiescent state before
we schedule a new task from irq/exception exit.
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Alessio Igor Bogani <abogani@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Avi Kivity <avi@redhat.com>
Cc: Chris Metcalf <cmetcalf@tilera.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Geoff Levand <geoff@infradead.org>
Cc: Gilad Ben Yossef <gilad@benyossef.com>
Cc: Hakan Akkan <hakanakkan@gmail.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Josh Triplett <josh@joshtriplett.org>
Cc: Kevin Hilman <khilman@ti.com>
Cc: Max Krasnyansky <maxk@qualcomm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephen Hemminger <shemminger@vyatta.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Sven-Thorsten Dietrich <thebigcorporation@gmail.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
Add necessary hooks to x86 exception for userspace
RCU extended quiescent state support.
This includes traps, page fault, debug exceptions, etc...
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Alessio Igor Bogani <abogani@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Avi Kivity <avi@redhat.com>
Cc: Chris Metcalf <cmetcalf@tilera.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Geoff Levand <geoff@infradead.org>
Cc: Gilad Ben Yossef <gilad@benyossef.com>
Cc: Hakan Akkan <hakanakkan@gmail.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Josh Triplett <josh@joshtriplett.org>
Cc: Kevin Hilman <khilman@ti.com>
Cc: Max Krasnyansky <maxk@qualcomm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephen Hemminger <shemminger@vyatta.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Sven-Thorsten Dietrich <thebigcorporation@gmail.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
There is some unnatural label based layout in this function.
Convert the unnecessary goto to readable conditional blocks.
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Ingo Molnar <mingo@kernel.org>
Add syscall slow path hooks to notify syscall entry
and exit on CPUs that want to support userspace RCU
extended quiescent state.
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Alessio Igor Bogani <abogani@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Avi Kivity <avi@redhat.com>
Cc: Chris Metcalf <cmetcalf@tilera.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Geoff Levand <geoff@infradead.org>
Cc: Gilad Ben Yossef <gilad@benyossef.com>
Cc: Hakan Akkan <hakanakkan@gmail.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Josh Triplett <josh@joshtriplett.org>
Cc: Kevin Hilman <khilman@ti.com>
Cc: Max Krasnyansky <maxk@qualcomm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Stephen Hemminger <shemminger@vyatta.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Sven-Thorsten Dietrich <thebigcorporation@gmail.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>
Cleanup the label maze in this function. Having a
seperate function to first handle the traps that don't
generate a signal makes it easier to convert into
more readable conditional paths.
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/1348577479-2564-1-git-send-email-fweisbec@gmail.com
[ Fixed 32-bit build failure. ]
Signed-off-by: Ingo Molnar <mingo@kernel.org>
GAS in binutils(2.16.91) could not parse parentheses within
macro parameters unless fully parenthesized, and this is a
workaround to make old gas work without generating below errors:
arch/x86/kernel/entry_64.S: Assembler messages:
arch/x86/kernel/entry_64.S:387: Error: too many positional arguments
arch/x86/kernel/entry_64.S:389: Error: too many positional arguments
[...]
Signed-off-by: Tao Guo <glorioustao@gmail.com>
Reluctantly-Acked-by: Jan Beulich <jbeulich@novell.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Link: http://lkml.kernel.org/r/1348648102-12653-1-git-send-email-glorioustao@gmail.com
[ Jan argues that these old GAS versions are fragile - which is so, but lets give them a chance. ]
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Commit 31d2092eb0 ("x86: move
mp_register_lapic_address to boot.c") renamed mp_register_lapic
to acpi_register_lapic. But mp_register_lapic remains in a
comment. So the patch rename it.
Signed-off-by: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Cc: Len Brown <lenb@kernel.org>
Link: http://lkml.kernel.org/r/50625239.3050403@jp.fujitsu.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
With SMAP, the [f][x]rstor_checking() functions are no longer usable
for user-space pointers by applying a simple __force cast. Instead,
create new [f][x]rstor_user() functions which do the proper SMAP
magic.
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
Cc: Suresh Siddha <suresh.b.siddha@intel.com>
Link: http://lkml.kernel.org/r/1343171129-2747-3-git-send-email-suresh.b.siddha@intel.com
Switch x86_64 to using sub-ns precise vsyscall
Cc: Tony Luck <tony.luck@intel.com>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Paul Turner <pjt@google.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Richard Cochran <richardcochran@gmail.com>
Cc: Prarit Bhargava <prarit@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: John Stultz <john.stultz@linaro.org>
To help migrate archtectures over to the new update_vsyscall method,
redfine CONFIG_GENERIC_TIME_VSYSCALL as CONFIG_GENERIC_TIME_VSYSCALL_OLD
Cc: Tony Luck <tony.luck@intel.com>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Paul Turner <pjt@google.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Richard Cochran <richardcochran@gmail.com>
Cc: Prarit Bhargava <prarit@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: John Stultz <john.stultz@linaro.org>
Since users will need to include timekeeper_internal.h, move
update_vsyscall definitions to timekeeper_internal.h.
Cc: Tony Luck <tony.luck@intel.com>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Paul Turner <pjt@google.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Richard Cochran <richardcochran@gmail.com>
Cc: Prarit Bhargava <prarit@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: John Stultz <john.stultz@linaro.org>
CLOCK_TICK_RATE is used to accurately caclulate exactly how
a tick will be at a given HZ.
This is useful, because while we'd expect NSEC_PER_SEC/HZ,
the underlying hardware will have some granularity limit,
so we won't be able to have exactly HZ ticks per second.
This slight error can cause timekeeping quality problems
when using the jiffies or other jiffies driven clocksources.
Thus we currently use compile time CLOCK_TICK_RATE value to
generate SHIFTED_HZ and NSEC_PER_JIFFIES, which we then use
to adjust the jiffies clocksource to correct this error.
Unfortunately though, since CLOCK_TICK_RATE is a compile
time value, and the jiffies clocksource is registered very
early during boot, there are a number of cases where there
are different possible hardware timers that have different
tick rates. This causes problems in cases like ARM where
there are numerous different types of hardware, each having
their own compile-time CLOCK_TICK_RATE, making it hard to
accurately support different hardware with a single kernel.
For the most part, this doesn't matter all that much, as not
too many systems actually utilize the jiffies or jiffies driven
clocksource. Usually there are other highres clocksources
who's granularity error is negligable.
Even so, we have some complicated calcualtions that we do
everywhere to handle these edge cases.
This patch removes the compile time SHIFTED_HZ value, and
introduces a register_refined_jiffies() function. This results
in the default jiffies clock as being assumed a perfect HZ
freq, and allows archtectures that care about jiffies accuracy
to call register_refined_jiffies() with the tick rate, specified
dynamically at boot.
This allows us, where necessary, to not have a compile time
CLOCK_TICK_RATE constant, simplifies the jiffies code, and
still provides a way to have an accurate jiffies clock.
NOTE: Since this patch does not add register_refinied_jiffies()
calls for every arch, it may cause time quality regressions
in some cases. Its likely these will not be noticable, but
if they are an issue, adding the following to the end of
setup_arch() should resolve the regression:
register_refinied_jiffies(CLOCK_TICK_RATE)
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Richard Cochran <richardcochran@gmail.com>
Cc: Prarit Bhargava <prarit@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: John Stultz <john.stultz@linaro.org>
If arch/x86/kernel/cpuid.c is a module, a CPU might offline or online
between the for_each_online_cpu() loop and the call to
register_hotcpu_notifier in cpuid_init or the call to
unregister_hotcpu_notifier in cpuid_exit. The potential races can
lead to leaks/duplicates, attempts to destroy non-existant devices, or
random pointer dereferences.
For example, in cpuid_exit if:
for_each_online_cpu(cpu)
cpuid_device_destroy(cpu);
class_destroy(cpuid_class);
__unregister_chrdev(CPUID_MAJOR, 0, NR_CPUS, "cpu/cpuid");
<----- CPU onlines
unregister_hotcpu_notifier(&cpuid_class_cpu_notifier);
the hotcpu notifier will attempt to create a device for the
cpuid_class, which the module already destroyed.
This fix surrounds for_each_online_cpu and register_hotcpu_notifier or
unregister_hotcpu_notifier with get_online_cpus+put_online_cpus.
Tested on a VM.
Signed-off-by: Silas Boyd-Wickizer <sbw@mit.edu>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
If arch/x86/kernel/msr.c is a module, a CPU might offline or online
between the for_each_online_cpu(i) loop and the call to
register_hotcpu_notifier in msr_init or the call to
unregister_hotcpu_notifier in msr_exit. The potential races can lead
to leaks/duplicates, attempts to destroy non-existant devices, or
random pointer dereferences.
For example, in msr_init if:
for_each_online_cpu(i) {
err = msr_device_create(i);
if (err != 0)
goto out_class;
}
<----- CPU offlines
register_hotcpu_notifier(&msr_class_cpu_notifier);
and the CPU never onlines before msr_exit, then the module will never
call msr_device_destroy for the associated CPU.
This fix surrounds for_each_online_cpu and register_hotcpu_notifier or
unregister_hotcpu_notifier with get_online_cpus+put_online_cpus.
Tested on a VM.
Signed-off-by: Silas Boyd-Wickizer <sbw@mit.edu>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reason for merge:
x86/fpu changed the structure of some of the code that x86/smap
changes; mostly fpu-internal.h but also minor changes to the
signal code.
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
Resolved Conflicts:
arch/x86/ia32/ia32_signal.c
arch/x86/include/asm/fpu-internal.h
arch/x86/kernel/signal.c
Preemption is disabled between kernel_fpu_begin/end() and as such
it is not a good idea to use these routines in kvm_load/put_guest_fpu()
which can be very far apart.
kvm_load/put_guest_fpu() routines are already called with
preemption disabled and KVM already uses the preempt notifier to save
the guest fpu state using kvm_put_guest_fpu().
So introduce __kernel_fpu_begin/end() routines which don't touch
preemption and use them instead of kernel_fpu_begin/end()
for KVM's use model of saving/restoring guest FPU state.
Also with this change (and with eagerFPU model), fix the host cr0.TS vm-exit
state in the case of VMX. For eagerFPU case, host cr0.TS is always clear.
So no need to worry about it. For the traditional lazyFPU restore case,
change the cr0.TS bit for the host state during vm-exit to be always clear
and cr0.TS bit is set in the __vmx_load_host_state() when the FPU
(guest FPU or the host task's FPU) state is not active. This ensures
that the host/guest FPU state is properly saved, restored
during context-switch and with interrupts (using irq_fpu_usable()) not
stomping on the active FPU state.
Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com>
Link: http://lkml.kernel.org/r/1348164109.26695.338.camel@sbsiddha-desk.sc.intel.com
Cc: Avi Kivity <avi@redhat.com>
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
The changes to entry_32.S got missed in checkin:
63bcff2a x86, smap: Add STAC and CLAC instructions to control user space access
The resulting kernel was largely functional but SMAP protection could
have been bypassed.
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
Link: http://lkml.kernel.org/r/1348256595-29119-9-git-send-email-hpa@linux.intel.com
Signal handling contains a bunch of accesses to individual user space
items, which causes an excessive number of STAC and CLAC
instructions. Instead, let get/put_user_try ... get/put_user_catch()
contain the STAC and CLAC instructions.
This means that get/put_user_try no longer nests, and furthermore that
it is no longer legal to use user space access functions other than
__get/put_user_ex() inside those blocks. However, these macros are
x86-specific anyway and are only used in the signal-handling paths; a
simple reordering of moving the larger subroutine calls out of the
try...catch blocks resolves that problem.
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
Link: http://lkml.kernel.org/r/1348256595-29119-12-git-send-email-hpa@linux.intel.com
When Supervisor Mode Access Prevention (SMAP) is enabled, access to
userspace from the kernel is controlled by the AC flag. To make the
performance of manipulating that flag acceptable, there are two new
instructions, STAC and CLAC, to set and clear it.
This patch adds those instructions, via alternative(), when the SMAP
feature is enabled. It also adds X86_EFLAGS_AC unconditionally to the
SYSCALL entry mask; there is simply no reason to make that one
conditional.
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
Link: http://lkml.kernel.org/r/1348256595-29119-9-git-send-email-hpa@linux.intel.com
TIF_NOTIFY_RESUME will work in precisely the same way; all that
is achieved by TIF_IRET is appearing that there's some work to be
done, so we end up on the iret exit path. Just use NOTIFY_RESUME.
And for execve() do that in 32bit start_thread(), not sys_execve()
itself.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
I get this warning:
arch/x86/kernel/kprobes.c:544:23: warning: ‘skip_singlestep’ declared ‘static’ but never defined
on tip/auto-latest.
Put the skip_singlestep function declaration up, in
KPROBES_CAN_USE_FTRACE and drop the superfluous forward
declaration.
Signed-off-by: Borislav Petkov <borislav.petkov@amd.com>
Acked-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Link: http://lkml.kernel.org/r/1348145034-16603-1-git-send-email-bp@amd64.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
list_for_each_entry_reverse() dereferences the iterator, but we already
freed it. I don't see a reason that this has to be done in reverse order
so change it to use list_for_each_entry_safe().
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Borislav Petkov <borislav.petkov@amd.com>
This patch updates the existing Intel IvyBridge (model 58)
support with proper PEBS event constraints. It cannot reuse
the same as SandyBridge because some events (0xd3) are
specific to IvyBridge.
Also there is no UOPS_DISPATCHED.THREAD on IVB, so do not
populate the PERF_COUNT_HW_STALLED_CYCLES_BACKEND mapping.
Signed-off-by: Stephane Eranian <eranian@google.com>
Cc: peterz@infradead.org
Cc: ak@linux.intel.com
Link: http://lkml.kernel.org/r/20120910230701.GA5898@quad
Signed-off-by: Ingo Molnar <mingo@kernel.org>
When acting on a user bug report, we find ourselves constantly
asking for /proc/cpuinfo in order to know the exact family,
model, stepping of the CPU in question.
Instead of having to ask this, add this to dmesg so that it is
visible and no ambiguities can ensue from looking at the
official name string of the CPU coming from CPUID and trying
to map it to f/m/s.
Output then looks like this:
[ 0.146041] smpboot: CPU0: AMD FX(tm)-8100 Eight-Core Processor (fam: 15, model: 01, stepping: 02)
Signed-off-by: Borislav Petkov <borislav.petkov@amd.com>
Cc: Andreas Herrmann <andreas.herrmann3@amd.com>
Link: http://lkml.kernel.org/r/1347640666-13638-1-git-send-email-bp@amd64.org
[ tweaked it minimally to add commas. ]
Signed-off-by: Ingo Molnar <mingo@kernel.org>
The test should be >= ARRAY_SIZE() instead of > ARRAY_SIZE().
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Acked-by: Jiri Olsa <jolsa@redhat.com>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Link: http://lkml.kernel.org/r/20120905123126.GC6128@elgon.mountain
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Add the "eagerfpu=auto" (that selects the default scheme in
enabling eagerfpu) which can override compiled-in boot parameters
like "eagerfpu=on/off" (that force enable/disable eagerfpu).
Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com>
Link: http://lkml.kernel.org/r/1347300665-6209-5-git-send-email-suresh.b.siddha@intel.com
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
xsaveopt/xrstor support optimized state save/restore by tracking the
INIT state and MODIFIED state during context-switch.
Enable eagerfpu by default for processors supporting xsaveopt.
Can be disabled by passing "eagerfpu=off" boot parameter.
Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com>
Link: http://lkml.kernel.org/r/1347300665-6209-3-git-send-email-suresh.b.siddha@intel.com
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
Decouple non-lazy/eager fpu restore policy from the existence of the xsave
feature. Introduce a synthetic CPUID flag to represent the eagerfpu
policy. "eagerfpu=on" boot paramter will enable the policy.
Requested-by: H. Peter Anvin <hpa@zytor.com>
Requested-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com>
Link: http://lkml.kernel.org/r/1347300665-6209-2-git-send-email-suresh.b.siddha@intel.com
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
Fundamental model of the current Linux kernel is to lazily init and
restore FPU instead of restoring the task state during context switch.
This changes that fundamental lazy model to the non-lazy model for
the processors supporting xsave feature.
Reasons driving this model change are:
i. Newer processors support optimized state save/restore using xsaveopt and
xrstor by tracking the INIT state and MODIFIED state during context-switch.
This is faster than modifying the cr0.TS bit which has serializing semantics.
ii. Newer glibc versions use SSE for some of the optimized copy/clear routines.
With certain workloads (like boot, kernel-compilation etc), application
completes its work with in the first 5 task switches, thus taking upto 5 #DNA
traps with the kernel not getting a chance to apply the above mentioned
pre-load heuristic.
iii. Some xstate features (like AMD's LWP feature) don't honor the cr0.TS bit
and thus will not work correctly in the presence of lazy restore. Non-lazy
state restore is needed for enabling such features.
Some data on a two socket SNB system:
* Saved 20K DNA exceptions during boot on a two socket SNB system.
* Saved 50K DNA exceptions during kernel-compilation workload.
* Improved throughput of the AVX based checksumming function inside the
kernel by ~15% as xsave/xrstor is faster than the serializing clts/stts
pair.
Also now kernel_fpu_begin/end() relies on the patched
alternative instructions. So move check_fpu() which uses the
kernel_fpu_begin/end() after alternative_instructions().
Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com>
Link: http://lkml.kernel.org/r/1345842782-24175-7-git-send-email-suresh.b.siddha@intel.com
Merge 32-bit boot fix from,
Link: http://lkml.kernel.org/r/1347300665-6209-4-git-send-email-suresh.b.siddha@intel.com
Cc: Jim Kukunas <james.t.kukunas@linux.intel.com>
Cc: NeilBrown <neilb@suse.de>
Cc: Avi Kivity <avi@redhat.com>
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
Few lines below we do drop_fpu() which is more safer. Remove the
unnecessary user_fpu_end() in save_xstate_sig(), which allows
the drop_fpu() to ignore any pending exceptions from the user-space
and drop the current fpu.
Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com>
Link: http://lkml.kernel.org/r/1345842782-24175-3-git-send-email-suresh.b.siddha@intel.com
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
No need to save the state with unlazy_fpu(), that is about to get overwritten
by the state from the signal frame. Instead use drop_fpu() and continue
to restore the new state.
Also fold the stop_fpu_preload() into drop_fpu().
Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com>
Link: http://lkml.kernel.org/r/1345842782-24175-2-git-send-email-suresh.b.siddha@intel.com
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
Currently for x86 and x86_32 binaries, fpstate in the user sigframe is copied
to/from the fpstate in the task struct.
And in the case of signal delivery for x86_64 binaries, if the fpstate is live
in the CPU registers, then the live state is copied directly to the user
sigframe. Otherwise fpstate in the task struct is copied to the user sigframe.
During restore, fpstate in the user sigframe is restored directly to the live
CPU registers.
Historically, different code paths led to different bugs. For example,
x86_64 code path was not preemption safe till recently. Also there is lot
of code duplication for support of new features like xsave etc.
Unify signal handling code paths for x86 and x86_64 kernels.
New strategy is as follows:
Signal delivery: Both for 32/64-bit frames, align the core math frame area to
64bytes as needed by xsave (this where the main fpu/extended state gets copied
to and excludes the legacy compatibility fsave header for the 32-bit [f]xsave
frames). If the state is live, copy the register state directly to the user
frame. If not live, copy the state in the thread struct to the user frame. And
for 32-bit [f]xsave frames, construct the fsave header separately before
the actual [f]xsave area.
Signal return: As the 32-bit frames with [f]xstate has an additional
'fsave' header, copy everything back from the user sigframe to the
fpstate in the task structure and reconstruct the fxstate from the 'fsave'
header (Also user passed pointers may not be correctly aligned for
any attempt to directly restore any partial state). At the next fpstate usage,
everything will be restored to the live CPU registers.
For all the 64-bit frames and the 32-bit fsave frame, restore the state from
the user sigframe directly to the live CPU registers. 64-bit signals always
restored the math frame directly, so we can expect the math frame pointer
to be correctly aligned. For 32-bit fsave frames, there are no alignment
requirements, so we can restore the state directly.
"lat_sig catch" microbenchmark numbers (for x86, x86_64, x86_32 binaries) are
with in the noise range with this change.
Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com>
Link: http://lkml.kernel.org/r/1343171129-2747-4-git-send-email-suresh.b.siddha@intel.com
[ Merged in compilation fix ]
Link: http://lkml.kernel.org/r/1344544736.8326.17.camel@sbsiddha-desk.sc.intel.com
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
This patch adds a cpumask file to the uncore pmu sysfs directory. The
cpumask file contains one active cpu for every socket.
Signed-off-by: "Yan, Zheng" <zheng.z.yan@intel.com>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Acked-by: Ingo Molnar <mingo@kernel.org>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Stephane Eranian <eranian@google.com>
Cc: "Yan, Zheng" <zheng.z.yan@intel.com>
Link: http://lkml.kernel.org/r/1347263631-23175-2-git-send-email-zheng.z.yan@intel.com
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
arch_uprobe_disable_step() should also take UTASK_SSTEP_TRAPPED into
account. In this case the probed insn was not executed, we need to
clear X86_EFLAGS_TF if it was set by us and that is all.
Again, this code will look more clean when we move it into
arch_uprobe_post_xol() and arch_uprobe_abort_xol().
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
arch_uprobe_disable_step() correctly preserves X86_EFLAGS_TF and
returns to user-mode. But this means the application gets SIGTRAP
only after the next insn.
This means that UPROBE_CLEAR_TF logic is not really right. _enable
should only record the state of X86_EFLAGS_TF, and _disable should
check it separately from UPROBE_FIX_SETF.
Remove arch_uprobe_task->restore_flags, add ->saved_tf instead, and
change enable/disable accordingly. This assumes that the probed insn
was not trapped, see the next patch.
arch_uprobe_skip_sstep() logic has the same problem, change it to
check X86_EFLAGS_TF and send SIGTRAP as well. We will cleanup this
all after we fold enable/disable_step into pre/post_hol hooks.
Note: send_sig(SIGTRAP) is not actually right, we need send_sigtrap().
But this needs more changes, handle_swbp() does the same and this is
equally wrong.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
user_enable/disable_single_step() was designed for ptrace, it assumes
a single user and does unnecessary and wrong things for uprobes. For
example:
- arch_uprobe_enable_step() can't trust TIF_SINGLESTEP, an
application itself can set X86_EFLAGS_TF which must be
preserved after arch_uprobe_disable_step().
- we do not want to set TIF_SINGLESTEP/TIF_FORCED_TF in
arch_uprobe_enable_step(), this only makes sense for ptrace.
- otoh we leak TIF_SINGLESTEP if arch_uprobe_disable_step()
doesn't do user_disable_single_step(), the application will
be killed after the next syscall.
- arch_uprobe_enable_step() does access_process_vm() we do
not need/want.
Change arch_uprobe_enable/disable_step() to set/clear X86_EFLAGS_TF
directly, this is much simpler and more correct. However, we need to
clear TIF_BLOCKSTEP/DEBUGCTLMSR_BTF before executing the probed insn,
add set_task_blockstep(false).
Note: with or without this patch, there is another (hopefully minor)
problem. A probed "pushf" insn can see the wrong X86_EFLAGS_TF set by
uprobes. Perhaps we should change _disable to update the stack, or
teach arch_uprobe_skip_sstep() to emulate this insn.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Afaics the usage of update_debugctlmsr() and TIF_BLOCKSTEP in
step.c was always very wrong.
1. update_debugctlmsr() was simply unneeded. The child sleeps
TASK_TRACED, __switch_to_xtra(next_p => child) should notice
TIF_BLOCKSTEP and set/clear DEBUGCTLMSR_BTF after resume if
needed.
2. It is wrong. The state of DEBUGCTLMSR_BTF bit in CPU register
should always match the state of current's TIF_BLOCKSTEP bit.
3. Even get_debugctlmsr() + update_debugctlmsr() itself does not
look right. Irq can change other bits in MSR_IA32_DEBUGCTLMSR
register or the caller can be preempted in between.
4. It is not safe to play with TIF_BLOCKSTEP if task != current.
DEBUGCTLMSR_BTF and TIF_BLOCKSTEP should always match each
other if the task is running. The tracee is stopped but it
can be SIGKILL'ed right before set/clear_tsk_thread_flag().
However, now that uprobes uses user_enable_single_step(current)
we can't simply remove update_debugctlmsr(). So this patch adds
the additional "task == current" check and disables irqs to avoid
the race with interrupts/preemption.
Unfortunately this patch doesn't solve the last problem, we need
another fix. Probably we should teach ptrace_stop() to set/clear
single/block stepping after resume.
And afaics there is yet another problem: perf can play with
MSR_IA32_DEBUGCTLMSR from nmi, this obviously means that even
__switch_to_xtra() has problems.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
No functional changes, preparation for the next fix and for uprobes
single-step fixes.
Move the code playing with TIF_BLOCKSTEP/DEBUGCTLMSR_BTF into the
new helper, set_task_blockstep().
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
The arch specific implementation behaves like user_enable_single_step()
except that it does not disable single stepping if it was already
enabled by ptrace. This allows the debugger to single step over an
uprobe. The state of block stepping is not restored. It makes only sense
together with TF and if that was enabled then the debugger is notified.
Note: this is still not correct. For example, TIF_SINGLESTEP check
is not right, the application itself can set X86_EFLAGS_TF. And otoh
we leak TIF_SINGLESTEP (set by enable) if the probed insn is "popf".
See the next patches, we need the changes in arch/x86/kernel/step.c
first.
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Fix kprobes/x86 to support jprobes on ftrace-based kprobes.
Because of -mfentry support of ftrace, ftrace is now put
on the beginning of function where jprobes are put.
Originally ftrace-based kprobes doesn't support jprobe
because it will change regs->ip and ftrace doesn't support
changing IP and ftrace itself doesn't conflict jprobe.
However, ftrace -mfentry support moves mcount call on the
top of functions where jprobes are put. This means that
jprobe always conflicts with ftrace-based kprobe and fails.
This patch allows ftrace-based kprobes to support jprobes
by allowing to modify regs->ip and kprobes breakpoint
handler also allows to skip singlestepping because there
is a ftrace call (not an original instruction).
Link: http://lkml.kernel.org/r/20120905143125.10329.90836.stgit@localhost.localdomain
Reported-by: Fengguang Wu <fengguang.wu@intel.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Signed-off-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Allow ftrace handlers to change RIP register (regs->ip)
in handlers. This will allow handlers to call another
function instead of original function.
Link: http://lkml.kernel.org/r/20120905143118.10329.5078.stgit@localhost.localdomain
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Signed-off-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Current kprobe_ftrace_handler expects regs->ip == ip, but it is
incorrect (originally on x86-64). Actually, ftrace handler sets
regs->ip = ip + MCOUNT_INSN_SIZE.
kprobe_ftrace_handler must take care for that.
Link: http://lkml.kernel.org/r/20120905143112.10329.72069.stgit@localhost.localdomain
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Signed-off-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Adjust x86 regs.ip to ip + MCOUNT_INSN_SIZE as like as
on x86-64. This helps us to consolidate codes which use
regs->ip on both of x86/x86-64.
Link: http://lkml.kernel.org/r/20120905143100.10329.60109.stgit@localhost.localdomain
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Signed-off-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
On 64 bit x86 we save the current eflags in cpu_init for use in
ret_from_fork. Strictly speaking reserved bits in EFLAGS should
be read as written but in practise it is unlikely that EFLAGS
could ever be extended in this way and the kernel alread clears
any undefined flags early on.
The equivalent 32 bit code simply hard codes 0x0202 as the new
EFLAGS.
This change makes 64 bit use the same mechanism to setup the
initial EFLAGS on fork. Note that 64 bit resets EFLAGS before
calling schedule_tail() as opposed to 32 bit which calls
schedule_tail() first. Therefore the correct value for EFLAGS
has opposite IF bit.
Signed-off-by: Ian Campbell <ian.campbell@citrix.com>
Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org>
Acked-by: Andi Kleen <ak@linux.intel.com>
Acked-by: "H. Peter Anvin" <hpa@zytor.com>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Andi Kleen <ak@linux.intel.com>
Link: http://lkml.kernel.org/r/20120824195847.GA31628@moon
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Current implementation simply ignores attribute flags. Thus, there is
no notification to userland of unsupported features. Check syscall's
attribute flags to let userland know if a feature is supported by the
kernel. This is also needed to distinguish between future kernels what
might support a feature.
Cc: <stable@vger.kernel.org> v3.5..
Signed-off-by: Robert Richter <robert.richter@amd.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/20120910093018.GO8285@erda.amd.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
This patch exports the clockticks event and its encoding to user level.
The clockticks event was exported for Nehalem/Westmere but not for Sandy
Bridge (client). Given that it uses a special encoding, it needs to be
exported to user tools, so users can do:
# perf stat -a -C 0 -e uncore_cbox_0/clockticks/ sleep 1
Signed-off-by: Stephane Eranian <eranian@google.com>
Acked-by: Yan, Zheng <zheng.z.yan@intel.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/20120829130122.GA32336@quad
Signed-off-by: Ingo Molnar <mingo@kernel.org>
At this stage x86_init.paging.pagetable_setup_done is only used in the
XEN case. Move its content in the x86_init.paging.pagetable_init setup
function and remove the now unused x86_init.paging.pagetable_setup_done
remaining infrastructure.
Signed-off-by: Attilio Rao <attilio.rao@citrix.com>
Acked-by: <konrad.wilk@oracle.com>
Cc: <Ian.Campbell@citrix.com>
Cc: <Stefano.Stabellini@eu.citrix.com>
Cc: <xen-devel@lists.xensource.com>
Link: http://lkml.kernel.org/r/1345580561-8506-5-git-send-email-attilio.rao@citrix.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Move the paging_init() call to the platform specific pagetable_init()
function, so we can get rid of the extra pagetable_setup_done()
function pointer.
Signed-off-by: Attilio Rao <attilio.rao@citrix.com>
Acked-by: <konrad.wilk@oracle.com>
Cc: <Ian.Campbell@citrix.com>
Cc: <Stefano.Stabellini@eu.citrix.com>
Cc: <xen-devel@lists.xensource.com>
Link: http://lkml.kernel.org/r/1345580561-8506-4-git-send-email-attilio.rao@citrix.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
In preparation for unifying the pagetable_setup_start() and
pagetable_setup_done() setup functions, rename appropriately all the
infrastructure related to pagetable_setup_start().
Signed-off-by: Attilio Rao <attilio.rao@citrix.com>
Ackedd-by: <konrad.wilk@oracle.com>
Cc: <Ian.Campbell@citrix.com>
Cc: <Stefano.Stabellini@eu.citrix.com>
Cc: <xen-devel@lists.xensource.com>
Link: http://lkml.kernel.org/r/1345580561-8506-3-git-send-email-attilio.rao@citrix.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
We either use swapper_pg_dir or the argument is unused. Preparatory
patch to simplify platform pagetable setup further.
Signed-off-by: Attilio Rao <attilio.rao@citrix.com>
Ackedb-by: <konrad.wilk@oracle.com>
Cc: <Ian.Campbell@citrix.com>
Cc: <Stefano.Stabellini@eu.citrix.com>
Cc: <xen-devel@lists.xensource.com>
Link: http://lkml.kernel.org/r/1345580561-8506-2-git-send-email-attilio.rao@citrix.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Don't remove the __user annotation of the fpstate pointer, but
drop the superfluous void * cast instead.
This fixes the following sparse warnings:
xsave.c:135:15: warning: cast removes address space of expression
xsave.c:135:15: warning: incorrect type in argument 1 (different address spaces)
xsave.c:135:15: expected void const volatile [noderef] <asn:1>*<noident>
[...]
Signed-off-by: Mathias Krause <minipli@googlemail.com>
Cc: Suresh Siddha <suresh.b.siddha@intel.com>
Link: http://lkml.kernel.org/r/1346621506-30857-6-git-send-email-minipli@googlemail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
This patch enables perf_events support for Intel Cedarview
Atom (model 54) processors. Support includes PEBS and LBR.
Tested on my Atom N2600 netbook.
Signed-off-by: Stephane Eranian <eranian@google.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/20120820092421.GA11284@quad
Signed-off-by: Ingo Molnar <mingo@kernel.org>
The following patch makes the microcode update code path
actually invoke the perf_check_microcode() function and
thus potentially renabling SNB PEBS.
By default, CONFIG_MICROCODE_OLD_INTERFACE is
forced to Y in arch/x86/Kconfig. There is no
way to disable this. That means that the code
path used in arch/x86/kernel/microcode_core.c
did not include the call to perf_check_microcode().
Thus, even though the microcode was updated to a
version that fixes the SNB PEBS problem, perf_event
would still return EOPNOTSUPP when enabling precise
sampling.
This patch simply adds a call to perf_check_microcode()
in the call path used when OLD_INTERFACE=y.
Signed-off-by: Stephane Eranian <eranian@google.com>
Acked-by: Borislav Petkov <borislav.petkov@amd.com>
Cc: peterz@infradead.org
Cc: andi@firstfloor.org
Link: http://lkml.kernel.org/r/20120824133434.GA8014@quad
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Merging critical fixes from upstream required for development.
* upstream/master: (809 commits)
libata: Add a space to " 2GB ATA Flash Disk" DMA blacklist entry
Revert "powerpc: Update g5_defconfig"
powerpc/perf: Use pmc_overflow() to detect rolled back events
powerpc: Fix VMX in interrupt check in POWER7 copy loops
powerpc: POWER7 copy_to_user/copy_from_user patch applied twice
powerpc: Fix personality handling in ppc64_personality()
powerpc/dma-iommu: Fix IOMMU window check
powerpc: Remove unnecessary ifdefs
powerpc/kgdb: Restore current_thread_info properly
powerpc/kgdb: Bail out of KGDB when we've been triggered
powerpc/kgdb: Do not set kgdb_single_step on ppc
powerpc/mpic_msgr: Add missing includes
powerpc: Fix null pointer deref in perf hardware breakpoints
powerpc: Fixup whitespace in xmon
powerpc: Fix xmon dl command for new printk implementation
xfs: check for possible overflow in xfs_ioc_trim
xfs: unlock the AGI buffer when looping in xfs_dialloc
xfs: fix uninitialised variable in xfs_rtbuf_get()
powerpc/fsl: fix "Failed to mount /dev: No such device" errors
powerpc/fsl: update defconfigs
...
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
If the kernel is compiled with gcc 4.6.0 which supports -mfentry,
then use that instead of mcount.
With mcount, frame pointers are forced with the -pg option and we
get something like:
<can_vma_merge_before>:
55 push %rbp
48 89 e5 mov %rsp,%rbp
53 push %rbx
41 51 push %r9
e8 fe 6a 39 00 callq ffffffff81483d00 <mcount>
31 c0 xor %eax,%eax
48 89 fb mov %rdi,%rbx
48 89 d7 mov %rdx,%rdi
48 33 73 30 xor 0x30(%rbx),%rsi
48 f7 c6 ff ff ff f7 test $0xfffffffff7ffffff,%rsi
With -mfentry, frame pointers are no longer forced and the call looks
like this:
<can_vma_merge_before>:
e8 33 af 37 00 callq ffffffff81461b40 <__fentry__>
53 push %rbx
48 89 fb mov %rdi,%rbx
31 c0 xor %eax,%eax
48 89 d7 mov %rdx,%rdi
41 51 push %r9
48 33 73 30 xor 0x30(%rbx),%rsi
48 f7 c6 ff ff ff f7 test $0xfffffffff7ffffff,%rsi
This adds the ftrace hook at the beginning of the function before a
frame is set up, and allows the function callbacks to be able to access
parameters. As kprobes now can use function tracing (at least on x86)
this speeds up the kprobe hooks that are at the beginning of the
function.
Link: http://lkml.kernel.org/r/20120807194100.130477900@goodmis.org
Acked-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Cc: Andi Kleen <andi@firstfloor.org>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
We still patch SMP instructions to UP variants if we boot with a
single CPU, but not at any other time. In particular, not if we
unplug CPUs to return to a single cpu.
Paul McKenney points out:
mean offline overhead is 6251/48=130.2 milliseconds.
If I remove the alternatives_smp_switch() from the offline
path [...] the mean offline overhead is 550/42=13.1 milliseconds
Basically, we're never going to get those 120ms back, and the
code is pretty messy.
We get rid of:
1) The "smp-alt-once" boot option. It's actually "smp-alt-boot", the
documentation is wrong. It's now the default.
2) The skip_smp_alternatives flag used by suspend.
3) arch_disable_nonboot_cpus_begin() and arch_disable_nonboot_cpus_end()
which were only used to set this one flag.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Cc: Paul McKenney <paul.mckenney@us.ibm.com>
Cc: Suresh Siddha <suresh.b.siddha@intel.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/87vcgwwive.fsf@rustcorp.com.au
Signed-off-by: Ingo Molnar <mingo@kernel.org>
The distinction between CONFIG_KVM_CLOCK and CONFIG_KVM_GUEST is
not so clear anymore, as demonstrated by recent bugs caused by poor
handling of on/off combinations of these options.
Merge CONFIG_KVM_CLOCK into CONFIG_KVM_GUEST.
Reported-By: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Limit the access to userspace only on the BSP where we load the
container, verify the patches in it and put them in the patch cache.
Then, at application time, we lookup the correct patch in the cache and
use it.
When we need to reload the userspace container, we do that over the
reload interface:
echo 1 > /sys/devices/system/cpu/microcode/reload
which reloads (a possibly newer) container from userspace and applies
then the newest patches from there.
Signed-off-by: Borislav Petkov <borislav.petkov@amd.com>
Link: http://lkml.kernel.org/r/1344361461-10076-13-git-send-email-bp@amd64.org
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
This is a trivial cache which collects all ucode patches for the current
family of CPUs on the system. If a newer patch appears due to the
container file being updated in userspace, we replace our cached version
with the new one.
Signed-off-by: Borislav Petkov <borislav.petkov@amd.com>
Link: http://lkml.kernel.org/r/1344361461-10076-12-git-send-email-bp@amd64.org
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
We search the equivalence table using the CPUID(1) signature of the
CPU in order to get the equivalence ID of the patch which we need to
apply. Add a function which does the reverse - it will be needed in
later patches.
While at it, pull the other equiv table function up in the file so that
it can be used by other functionality without forward declarations.
Signed-off-by: Borislav Petkov <borislav.petkov@amd.com>
Link: http://lkml.kernel.org/r/1344361461-10076-11-git-send-email-bp@amd64.org
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
This is done in preparation for teaching the ucode driver to either load
a new ucode patches container from userspace or use an already cached
version. No functionality change in this patch.
Signed-off-by: Borislav Petkov <borislav.petkov@amd.com>
Link: http://lkml.kernel.org/r/1344361461-10076-10-git-send-email-bp@amd64.org
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
Read the CPUID(1).EAX leaf at the correct cpu and use it to search the
equivalence table for matching microcode patch. No functionality change.
Signed-off-by: Borislav Petkov <borislav.petkov@amd.com>
Link: http://lkml.kernel.org/r/1344361461-10076-9-git-send-email-bp@amd64.org
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
Make sure we're actually applying a microcode patch to a core which
really needs it.
This brings only a very very very minor slowdown on F10:
0.032218828 sec vs 0.056010626 sec with this patch.
And small speedup on F15:
0.487089449 sec vs 0.180551162 sec (from perf output).
Also, fixup comments while at it.
Signed-off-by: Borislav Petkov <borislav.petkov@amd.com>
Link: http://lkml.kernel.org/r/1344361461-10076-8-git-send-email-bp@amd64.org
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
get_ucode_data was a trivial memcpy wrapper. Remove it so as not to
obfuscate code unnecessarily with no obvious gain.
No functional change.
Signed-off-by: Borislav Petkov <borislav.petkov@amd.com>
Link: http://lkml.kernel.org/r/1344361461-10076-7-git-send-email-bp@amd64.org
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
Mask out CPU_TASKS_FROZEN bit so that all _FROZEN cases can be dropped.
Also, add some more comments as to why CPU_ONLINE falls through to
CPU_DOWN_FAILED (no break), and for the CPU_DEAD case. Realign debug
printks better.
Idea blatantly stolen from a tglx patch:
http://marc.info/?l=linux-kernel&m=134267779513862
Signed-off-by: Borislav Petkov <borislav.petkov@amd.com>
Link: http://lkml.kernel.org/r/1344361461-10076-5-git-send-email-bp@amd64.org
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
Remove the uci->mc check on the cpu resume path because the low-level
drivers do that anyway.
More importantly, though, this fixes a contrived and obscure but still
important case. Imagine the following:
* boot machine, no new microcode in /lib/firmware
* a subset of the CPUs is offlined
* in the meantime, user puts new fresh microcode container into
/lib/firmware and reloads it by doing
$ echo 1 > /sys/devices/system/cpu/microcode/reload
* offlined cores come back online and they don't get the newer microcode
applied due to this check.
Later patches take care of the issue on AMD.
While at it, cleanup code around it.
Signed-off-by: Borislav Petkov <borislav.petkov@amd.com>
Link: http://lkml.kernel.org/r/1344361461-10076-4-git-send-email-bp@amd64.org
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
Invert the uci->valid check so that the later block can be aligned on
the first indentation level of the function. No functional change.
Signed-off-by: Borislav Petkov <borislav.petkov@amd.com>
Link: http://lkml.kernel.org/r/1344361461-10076-3-git-send-email-bp@amd64.org
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
This issue was recently observed on an AMD C-50 CPU where a patch of
maximum size was applied.
Commit be62adb492 ("x86, microcode, AMD: Simplify ucode verification")
added current_size in get_matching_microcode(). This is calculated as
size of the ucode patch + 8 (ie. size of the header). Later this is
compared against the maximum possible ucode patch size for a CPU family.
And of course this fails if the patch has already maximum size.
Cc: <stable@vger.kernel.org> [3.3+]
Signed-off-by: Andreas Herrmann <andreas.herrmann3@amd.com>
Signed-off-by: Borislav Petkov <borislav.petkov@amd.com>
Link: http://lkml.kernel.org/r/1344361461-10076-1-git-send-email-bp@amd64.org
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
Probably a leftover from the early days of self-patching, p6nops
are marked __initconst_or_module, which causes them to be
discarded in a non-modular kernel. If something later triggers
patching, it will overwrite kernel code with garbage.
Reported-by: Tomas Racek <tracek@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
Cc: Michael Tokarev <mjt@tls.msk.ru>
Cc: Borislav Petkov <borislav.petkov@amd.com>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Cc: qemu-devel@nongnu.org
Cc: Anthony Liguori <anthony@codemonkey.ws>
Cc: H. Peter Anvin <hpa@linux.intel.com>
Cc: Alan Cox <alan@lxorguk.ukuu.org.uk>
Cc: Alan Cox <alan@linux.intel.com>
Link: http://lkml.kernel.org/r/5034AE84.90708@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
When one CPU is going down and this CPU is the last one in irq
affinity, current code is setting cpu_all_mask as the new
affinity for that irq.
But for some systems (such as in Medfield Android mobile) the
firmware sends the interrupt to each CPU in the irq affinity
mask, averaged, and cpu_all_mask includes all potential CPUs,
i.e. offline ones as well.
So replace cpu_all_mask with cpu_online_mask.
Signed-off-by: liu chuansheng <chuansheng.liu@intel.com>
Acked-by: Yanmin Zhang <yanmin_zhang@linux.intel.com>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/27240C0AC20F114CBF8149A2696CBE4A137286@SHSMSX101.ccr.corp.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
The former conversion to irq_domain_add_legacy() did not fully work
since we miss the irq decs for NR_IRQS_LEGACY+.
Ideally we could use irq_domain_add_simple() or the no-map variant (and
program the virq <-> line mapping directly into ioapic) but this would
require a different irq lookup in "do_IRQ()" and won't work with ACPI
without changes. So this is probably easiest for everyone.
Tested-by: Thierry Reding <thierry.reding@avionic-design.de>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: Grant Likely <grant.likely@secretlab.ca>
Link: http://lkml.kernel.org/r/20120813202304.GA3529@breakpoint.cc
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
. Fix include order for bison/flex-generated C files, from Ben Hutchings
. Build fixes and documentation corrections from David Ahern
. Group parsing support, from Jiri Olsa
. UI/gtk refactorings and improvements from Namhyung Kim
. NULL deref fix for perf script, from Namhyung Kim
. Assorted cleanups from Robert Richter
. Let O= makes handle relative paths, from Steven Rostedt
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.14 (GNU/Linux)
iQIcBAABAgAGBQJQMkGhAAoJENZQFvNTUqpAqjsQAJE5iD1LFogC8o/WjvRHz0TY
Y0x+sR/XfW61KYpeq5g+UaKuFU3P44ijCoyks3y5sza97DkYgUwMpEHlLXFSM8Pp
sNOapqY57s24nq3MLrhH1V9w+cSE+m2u/Gi5fGLCQekio9gkOBwYxNGk7vpKri/n
LBRsMozBu/mZjMy20uWOb7Uk8xsAToh+TFaAtjyQ9Snn9nNJj49NUAp37uN888H/
ducMLq32HN5v/6Zd3q6IWdDWgZsHLkIa3R5FIs/GNe3Dih07gtYLmDol4ktPbTFm
yoaWpP5wbtu/62EZlJwE393vMuoeqN/96394ZZQGFafhHVxN4+rcBhXbejBs0T2b
wk/0CzntW8bbUAI/cl3SB9aui//FWOxcjG9aDQ7PsmHzPw1Q4VD0F9Mcod4p+dRX
PsA9q/tST1eAiwzWYthDtj81U7iChINcXKhoZn2xn6+0+aMH+6FFNBmCH8MR5aCU
BvrXhTJjvau/Ym/sILl4Tf4wfssTq49yMsn/YKCwLJ0hg0XlTObWfQRy2MOayXH9
NJvUE+9GSXoTEKhmr1AfTYEG9vObaXZyFwAI74xvPPwUYojCb4ZjEKmG0egW+VGk
IJKFCaJZwwVsGau4aIbFAMP12/L8Qs/Ox91ddCJ0j5TIlSGMaqW5lbV1N1crzlTT
a0GsN49NvhbFttBXrcNX
=0a2X
-----END PGP SIGNATURE-----
Merge tag 'perf-core-for-mingo' of git://git.kernel.org/pub/scm/linux/kernel/git/acme/linux into perf/core
Pull perf/core improvements and fixes from Arnaldo Carvalho de Melo:
* Fix include order for bison/flex-generated C files, from Ben Hutchings
* Build fixes and documentation corrections from David Ahern
* Group parsing support, from Jiri Olsa
* UI/gtk refactorings and improvements from Namhyung Kim
* NULL deref fix for perf script, from Namhyung Kim
* Assorted cleanups from Robert Richter
* Let O= makes handle relative paths, from Steven Rostedt
* perf script python fixes, from Feng Tang.
* Improve 'perf lock' error message when the needed tracepoints
are not present, from David Ahern.
* Initial bash completion support, from Frederic Weisbecker
* Allow building without libelf, from Namhyung Kim.
* Support DWARF CFI based unwind to have callchains when %bp
based unwinding is not possible, from Jiri Olsa.
* Symbol resolution fixes, while fixing support PPC64 files with an .opt ELF
section was the end goal, several fixes for code that handles all
architectures and cleanups are included, from Cody Schafer.
* Add a description for the JIT interface, from Andi Kleen.
* Assorted fixes for Documentation and build in 32 bit, from Robert Richter
* Add support for non-tracepoint events in perf script python, from Feng Tang
* Cache the libtraceevent event_format associated to each evsel early, so that we
avoid relookups, i.e. calling pevent_find_event repeatedly when processing
tracepoint events.
[ This is to reduce the surface contact with libtraceevents and make clear what
is that the perf tools needs from that lib: so far parsing the common and per
event fields. ]
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Pull ftrace updates from Steve Rostedt:
" This patch series extends ftrace function tracing utility to be
more dynamic for its users. It allows for data passing to the callback
functions, as well as reading regs as if a breakpoint were to trigger
at function entry.
The main goal of this patch series was to allow kprobes to use ftrace
as an optimized probe point when a probe is placed on an ftrace nop.
With lots of help from Masami Hiramatsu, and going through lots of
iterations, we finally came up with a good solution. "
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Pull x86 fixes from Ingo Molnar.
A x32 socket ABI fix with a -stable backport tag among other fixes.
* 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x32: Use compat shims for {g,s}etsockopt
Revert "x86-64/efi: Use EFI to deal with platform wall clock"
x86, apic: fix broken legacy interrupts in the logical apic mode
x86, build: Globally set -fno-pic
x86, avx: don't use avx instructions with "noxsave" boot param
else, host continues to update stealtime after reboot,
which can corrupt e.g. initramfs area.
found when tracking down initramfs unpack error on initial reboot
(with qemu-kvm -smp 2, no problem with single-core).
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Recent commit 332afa656e cleaned up
a workaround that updates irq_cfg domain for legacy irq's that
are handled by the IO-APIC. This was assuming that the recent
changes in assign_irq_vector() were sufficient to remove the workaround.
But this broke couple of AMD platforms. One of them seems to be
sending interrupts to the offline cpu's, resulting in spurious
"No irq handler for vector xx (irq -1)" messages when those cpu's come online.
And the other platform seems to always send the interrupt to the last logical
CPU (cpu-7). Recent changes had an unintended side effect of using only logical
cpu-0 in the IO-APIC RTE (during boot for the legacy interrupts) and this
broke the legacy interrupts not getting routed to the cpu-7 on the AMD
platform, resulting in a boot hang.
For now, reintroduce the removed workaround, (essentially not allowing the
vector to change for legacy irq's when io-apic starts to handle the irq. Which
also addressed the uninteded sife effect of just specifying cpu-0 in the
IO-APIC RTE for those irq's during boot).
Reported-and-tested-by: Robert Richter <robert.richter@amd.com>
Reported-and-tested-by: Borislav Petkov <bp@amd64.org>
Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com>
Link: http://lkml.kernel.org/r/1344453412.29170.5.camel@sbsiddha-desk.sc.intel.com
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
If PMU counter has PEBS enabled it is not enough to disable counter
on a guest entry since PEBS memory write can overshoot guest entry
and corrupt guest memory. Disabling PEBS during guest entry solves
the problem.
Tested-by: David Ahern <dsahern@gmail.com>
Signed-off-by: Gleb Natapov <gleb@redhat.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/20120809085234.GI3341@redhat.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
The Westmere-EX uncore is similar to the Nehalem-EX uncore. The
differences are:
- Westmere-EX uncore has 10 instances of Cbox. The MSRs for Cbox8
and Cbox9 in the Westmere-EX aren't contiguous with Cbox 0~7.
- The fvid field in the ZDP_CTL_FVC register in the Mbox is
different. It's 5 bits in the Nehalem-EX, 6 bits in the
Westmere-EX.
Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/1344229882-3907-3-git-send-email-zheng.z.yan@intel.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
This patch includes following fixes and update:
- Only some events in the Sbox and Mbox can use the match/mask
registers, add code to check this.
- The format definitions for xbr_mm_cfg and xbr_match registers
in the Rbox are wrong, xbr_mm_cfg should use 32 bits, xbr_match
should use 64 bits.
- Cleanup the Rbox code. Compute the addresses extra registers in
the enable_event function instead of the hw_config function.
This simplifies the code in nhmex_rbox_alter_er().
Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/1344229882-3907-2-git-send-email-zheng.z.yan@intel.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Fix the following section mismatch:
WARNING: arch/x86/kernel/cpu/built-in.o(.text+0x7ad9): Section mismatch in reference from the function uncore_types_exit() to the function .init.text:uncore_type_exit()
The function uncore_types_exit() references the function __init
uncore_type_exit(). This is often because uncore_types_exit lacks a
__init annotation or the annotation of uncore_type_exit is wrong.
caused by 14371cce03 ("perf: Add generic PCI uncore PMU device
support").
Cc: Zheng Yan <zheng.z.yan@intel.com>
Cc: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Borislav Petkov <borislav.petkov@amd.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/1339741902-8449-8-git-send-email-zheng.z.yan@intel.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Introducing PERF_SAMPLE_REGS_USER sample type bit to trigger the dump of
user level registers on sample. Registers we want to dump are specified
by sample_regs_user bitmask.
Only user level registers are dumped at the moment. Meaning the register
values of the user space context as it was before the user entered the
kernel for whatever reason (syscall, irq, exception, or a PMI happening
in userspace).
The layout of the sample_regs_user bitmap is described in
asm/perf_regs.h for archs that support register dump.
This is going to be useful to bring Dwarf CFI based stack unwinding on
top of samples.
Original-patch-by: Frederic Weisbecker <fweisbec@gmail.com>
[ Dump registers ABI specification. ]
Signed-off-by: Jiri Olsa <jolsa@redhat.com>
Suggested-by: Stephane Eranian <eranian@google.com>
Cc: "Frank Ch. Eigler" <fche@redhat.com>
Cc: Arun Sharma <asharma@fb.com>
Cc: Benjamin Redelings <benjamin.redelings@nescent.org>
Cc: Corey Ashford <cjashfor@linux.vnet.ibm.com>
Cc: Cyrill Gorcunov <gorcunov@openvz.org>
Cc: Frank Ch. Eigler <fche@redhat.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Robert Richter <robert.richter@amd.com>
Cc: Stephane Eranian <eranian@google.com>
Cc: Tom Zanussi <tzanussi@gmail.com>
Cc: Ulrich Drepper <drepper@gmail.com>
Link: http://lkml.kernel.org/r/1344345647-11536-3-git-send-email-jolsa@redhat.com
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
This brings a new API to help the selective dump of registers on event
sampling, and its implementation for x86 arch.
Added HAVE_PERF_REGS config option to determine if the architecture
provides perf registers ABI.
The information about desired registers will be passed in u64 mask.
It's up to the architecture to map the registers into the mask bits.
For the x86 arch implementation, both 32 and 64 bit registers bits are
defined within single enum to ensure 64 bit system can provide register
dump for compat task if needed in the future.
Original-patch-by: Frederic Weisbecker <fweisbec@gmail.com>
[ Added missing linux/errno.h include ]
Signed-off-by: Jiri Olsa <jolsa@redhat.com>
Cc: "Frank Ch. Eigler" <fche@redhat.com>
Cc: Arun Sharma <asharma@fb.com>
Cc: Benjamin Redelings <benjamin.redelings@nescent.org>
Cc: Corey Ashford <cjashfor@linux.vnet.ibm.com>
Cc: Cyrill Gorcunov <gorcunov@openvz.org>
Cc: Frank Ch. Eigler <fche@redhat.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Robert Richter <robert.richter@amd.com>
Cc: Stephane Eranian <eranian@google.com>
Cc: Tom Zanussi <tzanussi@gmail.com>
Cc: Ulrich Drepper <drepper@gmail.com>
Link: http://lkml.kernel.org/r/1344345647-11536-2-git-send-email-jolsa@redhat.com
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
On Intel systems corrected machine check interrupts (CMCI) may be sent to
multiple logical processors; possibly to all processors on the affected
socket (SDM Volume 3B "15.5.1 CMCI Local APIC Interface"). This means
that a persistent error (such as a stuck bit in ECC memory) may cause
a storm of interrupts that greatly hinders or prevents forward progress
(probably on many processors).
To solve this we keep track of the rate at which each processor sees
CMCI. If we exceed a threshold, we disable CMCI delivery and switch to
polling the machine check banks. If the storm subsides (none of the
affected processors see any more errors for a complete poll interval) we
re-enable CMCI.
[Tony: Added console messages when storm begins/ends and increased storm
threshold from 5 to 15 so we have a few more logged entries before we
disable interrupts and start dropping reports]
Signed-off-by: Chen Gong <gong.chen@linux.intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Chen Gong <gong.chen@linux.intel.com>
Signed-off-by: Tony Luck <tony.luck@intel.com>
cmci_discover() works out which machine check banks support CMCI, and
which of those are shared by multiple logical processors. It uses this
information to ensure that exactly one cpu is designated the owner of
each bank so that when interrupts are broadcast to multiple cpus, only one
of them will look in a shared bank to log the error and clear the bank.
At boot time cmci_discover() performs this task silently. But during
certain cpu hotplug operations it prints out a set of summary lines
like this:
CPU 35 MCA banks CMCI:0 CMCI:1 CMCI:3 CMCI:5 CMCI:6 CMCI:7 CMCI:8 CMCI:9 CMCI:10 CMCI:11
CPU 1 MCA banks CMCI:0 CMCI:1 CMCI:3
CPU 39 MCA banks CMCI:0 CMCI:1 CMCI:3
CPU 38 MCA banks CMCI:0 CMCI:1 CMCI:3
CPU 32 MCA banks CMCI:0 CMCI:1 CMCI:3
CPU 37 MCA banks CMCI:0 CMCI:1 CMCI:3
CPU 36 MCA banks CMCI:0 CMCI:1 CMCI:3
CPU 34 MCA banks CMCI:0 CMCI:1 CMCI:3
The value of these messages seems very low. A user might painstakingly
cross-check against the data sheet for a processor to ensure that all
CMCI supported banks are correctly reported, but this seems improbable.
If users really wanted to do this, we should print the information at
boot time too.
Remove the messages.
Signed-off-by: Tony Luck <tony.luck@intel.com>
Clear AVX, AVX2 features along with clearing XSAVE feature bits,
as part of the parsing "noxsave" parameter.
Fixes the kernel boot panic with "noxsave" boot parameter.
We could have checked cpu_has_osxsave along with cpu_has_avx etc, but Peter
mentioned clearing the feature bits will be better for uses like
static_cpu_has() etc.
Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com>
Link: http://lkml.kernel.org/r/1343755754.2041.2.camel@sbsiddha-desk.sc.intel.com
Cc: <stable@vger.kernel.org> # v3.5
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
Run the mprotect.c microbenchmark on all our families >= K8 and preset
the flushall shift variable accordingly.
Signed-off-by: Borislav Petkov <borislav.petkov@amd.com>
Link: http://lkml.kernel.org/r/1344272439-29080-5-git-send-email-bp@amd64.org
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
Push the max CPUID leaf check into the ->detect_tlb function and remove
general test case from the generic path.
Signed-off-by: Borislav Petkov <borislav.petkov@amd.com>
Link: http://lkml.kernel.org/r/1344272439-29080-3-git-send-email-bp@amd64.org
Acked-by: Alex Shi <alex.shi@intel.com>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
The TLB characteristics appeared like this in dmesg:
[ 0.065817] Last level iTLB entries: 4KB 512, 2MB 1024, 4MB 512
[ 0.065817] Last level dTLB entries: 4KB 1024, 2MB 1024, 4MB 512
[ 0.065817] tlb_flushall_shift is 0xffffffff
where tlb_flushall_shift is actually -1 but dumped as a hex number.
However, the Kconfig option CONFIG_DEBUG_TLBFLUSH and the rest of the
code treats this as a signed decimal and states "If you set it to -1,
the code flushes the whole TLB unconditionally."
So, fix its formatting in accordance with the other references to it.
Signed-off-by: Borislav Petkov <borislav.petkov@amd.com>
Link: http://lkml.kernel.org/r/1344272439-29080-2-git-send-email-bp@amd64.org
Acked-by: Alex Shi <alex.shi@intel.com>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
Pull ACPI and power management fixes from Len Brown:
"A 3.3 sleep regression fixed, numa bugfix, plus some minor cleanups"
* 'release' of git://git.kernel.org/pub/scm/linux/kernel/git/lenb/linux:
ACPI processor: Fix tick_broadcast_mask online/offline regression
ACPI: Only count valid srat memory structures
ACPI: Untangle a return statement for better readability
ACPI / PCI: Do not try to acquire _OSC control if that is hopeless
ACPI: delete _GTS/_BFS support
ACPI/x86: revert 'x86, acpi: Call acpi_enter_sleep_state via an asmlinkage C function from assembler'
ACPI: replace strlen("string") with sizeof("string") -1
ACPI / PM: Fix build warning in sleep.c for CONFIG_ACPI_SLEEP unset
No point in having double cases if we can simply mask the FROZEN bit
out.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Chen Gong <gong.chen@linux.intel.com>
Signed-off-by: Tony Luck <tony.luck@intel.com>
Split timer init function into the init and the start part, so the
start part can replace the open coded version in CPU_DOWN_FAILED.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Chen Gong <gong.chen@linux.intel.com>
Acked-by: Borislav Petkov <borislav.petkov@amd.com>
Signed-off-by: Tony Luck <tony.luck@intel.com>
raise_mce() fiddles with global state, but lacks any kind of
serialization.
Add a mutex around the raise_mce() call, so concurrent writers do not
stomp on each other toes.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Chen Gong <gong.chen@linux.intel.com>
Signed-off-by: Tony Luck <tony.luck@intel.com>
raise_mce() has a code path which does not disable preemption when the
raise_local() is called. The per cpu variable access in raise_local()
depends on preemption being disabled to be functional. So that code
path was either never tested or never tested with CONFIG_DEBUG_PREEMPT
enabled.
Add the missing preempt_disable/enable() pair around the call.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Chen Gong <gong.chen@linux.intel.com>
Signed-off-by: Tony Luck <tony.luck@intel.com>
Pull x86 fixes from Ingo Molnar:
"Various fixes"
* 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86-64, kcmp: The kcmp system call can be common
arch/x86/kernel/kdebugfs.c: Ensure a consistent return value in error case
x86/mce: Add quirk for instruction recovery on Sandy Bridge processors
x86/mce: Move MCACOD defines from mce-severity.c to <asm/mce.h>
x86/ioapic: Fix NULL pointer dereference on CPU hotplug after disabling irqs
x86, nops: Missing break resulting in incorrect selection on Intel
x86: CONFIG_CC_STACKPROTECTOR=y is no longer experimental
Pull perf fixes from Ingo Molnar:
"Fix merge window fallout and fix sleep profiling (this was always
broken, so it's not a fix for the merge window - we can skip this one
from the head of the tree)."
* 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
perf/trace: Add ability to set a target task for events
perf/x86: Fix USER/KERNEL tagging of samples properly
perf/x86/intel/uncore: Make UNCORE_PMU_HRTIMER_INTERVAL 64-bit
Pull perf updates from Ingo Molnar:
"The biggest changes are Intel Nehalem-EX PMU uncore support, uprobes
updates/cleanups/fixes from Oleg and diverse tooling updates (mostly
fixes) now that Arnaldo is back from vacation."
* 'perf-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (40 commits)
uprobes: __replace_page() needs munlock_vma_page()
uprobes: Rename vma_address() and make it return "unsigned long"
uprobes: Fix register_for_each_vma()->vma_address() check
uprobes: Introduce vaddr_to_offset(vma, vaddr)
uprobes: Teach build_probe_list() to consider the range
uprobes: Remove insert_vm_struct()->uprobe_mmap()
uprobes: Remove copy_vma()->uprobe_mmap()
uprobes: Fix overflow in vma_address()/find_active_uprobe()
uprobes: Suppress uprobe_munmap() from mmput()
uprobes: Uprobe_mmap/munmap needs list_for_each_entry_safe()
uprobes: Clean up and document write_opcode()->lock_page(old_page)
uprobes: Kill write_opcode()->lock_page(new_page)
uprobes: __replace_page() should not use page_address_in_vma()
uprobes: Don't recheck vma/f_mapping in write_opcode()
perf/x86: Fix missing struct before structure name
perf/x86: Fix format definition of SNB-EP uncore QPI box
perf/x86: Make bitfield unsigned
perf/x86: Fix LLC-* and node-* events on Intel SandyBridge
perf/x86: Add Intel Nehalem-EX uncore support
perf/x86: Fix typo in format definition of uncore PCU filter
...
Some PMUs don't provide a full register set for their sample,
specifically 'advanced' PMUs like AMD IBS and Intel PEBS which provide
'better' than regular interrupt accuracy.
In this case we use the interrupt regs as basis and over-write some
fields (typically IP) with different information.
The perf core however uses user_mode() to distinguish user/kernel
samples, user_mode() relies on regs->cs. If the interrupt skid pushed
us over a boundary the new IP might not be in the same domain as the
interrupt.
Commit ce5c1fe9a9 ("perf/x86: Fix USER/KERNEL tagging of samples")
tried to fix this by making the perf core use kernel_ip(). This
however is wrong (TM), as pointed out by Linus, since it doesn't allow
for VM86 and non-zero based segments in IA32 mode.
Therefore, provide a new helper to set the regs->ip field,
set_linear_ip(), which massages the regs into a suitable state
assuming the provided IP is in fact a linear address.
Also modify perf_instruction_pointer() and perf_callchain_user() to
deal with segments base offsets.
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/1341910954.3462.102.camel@twins
Signed-off-by: Ingo Molnar <mingo@kernel.org>
i386 allmodconfig:
arch/x86/kernel/cpu/perf_event_intel_uncore.c: In function 'uncore_pmu_hrtimer':
arch/x86/kernel/cpu/perf_event_intel_uncore.c:728: warning: integer overflow in expression
arch/x86/kernel/cpu/perf_event_intel_uncore.c: In function 'uncore_pmu_start_hrtimer':
arch/x86/kernel/cpu/perf_event_intel_uncore.c:735: warning: integer overflow in expression
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Zheng Yan <zheng.z.yan@intel.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/n/tip-h84qlqj02zrojmxxybzmy9hi@git.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Add function tracer based kprobe optimization support
handlers on x86. This allows kprobes to use function
tracer for probing on mcount call.
Link: http://lkml.kernel.org/r/20120605102838.27845.26317.stgit@localhost.localdomain
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ananth N Mavinakayanahalli <ananth@in.ibm.com>
Cc: "Frank Ch. Eigler" <fche@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Signed-off-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
[ Updated to new port of ftrace save regs functions ]
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
The graph caller is called by the mcount callers, which already does
the check against the function_trace_stop variable. No reason to
check it again.
Link: http://lkml.kernel.org/r/20120711195745.588538769@goodmis.org
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
The final position of the stack after saving regs and setting up
the parameters for ftrace_regs_call, is the position of the pt_regs
needed for the 4th parameter. Instead of saving it into a temporary
reg and pushing the reg, simply push the stack pointer.
Link: http://lkml.kernel.org/r/1342702344.12353.16.camel@gandalf.stny.rr.com
Reviewed-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Signed-off-by: Uros Bizjak <ubizjak@gmail.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
cd74257b97
patched up GTS/BFS -- a feature we want to remove.
So revert it (by hand, due to conflict in sleep.h)
to prepare for GTS/BFS removal.
Signed-off-by: Len Brown <len.brown@intel.com>
Acked-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
There are two ways to create /sys/firmware/memmap/X sysfs:
- firmware_map_add_early
When the system starts, it is calledd from e820_reserve_resources()
- firmware_map_add_hotplug
When the memory is hot plugged, it is called from add_memory()
But these functions are called without unifying value of end argument as
below:
- end argument of firmware_map_add_early() : start + size - 1
- end argument of firmware_map_add_hogplug() : start + size
The patch unifies them to "start + size". Even if applying the patch,
/sys/firmware/memmap/X/end file content does not change.
[akpm@linux-foundation.org: clarify comments]
Signed-off-by: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
Reviewed-by: Dave Hansen <dave@linux.vnet.ibm.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Tejun Heo <tj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Pull x86/mm changes from Peter Anvin:
"The big change here is the patchset by Alex Shi to use INVLPG to flush
only the affected pages when we only need to flush a small page range.
It also removes the special INVALIDATE_TLB_VECTOR interrupts (32
vectors!) and replace it with an ordinary IPI function call."
Fix up trivial conflicts in arch/x86/include/asm/apic.h (added code next
to changed line)
* 'x86-mm-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/tlb: Fix build warning and crash when building for !SMP
x86/tlb: do flush_tlb_kernel_range by 'invlpg'
x86/tlb: replace INVALIDATE_TLB_VECTOR by CALL_FUNCTION_VECTOR
x86/tlb: enable tlb flush range support for x86
mm/mmu_gather: enable tlb flush range in generic mmu_gather
x86/tlb: add tlb_flushall_shift knob into debugfs
x86/tlb: add tlb_flushall_shift for specific CPU
x86/tlb: fall back to flush all when meet a THP large page
x86/flush_tlb: try flush_tlb_single one by one in flush_tlb_range
x86/tlb_info: get last level TLB entry number of CPU
x86: Add read_mostly declaration/definition to variables from smp.h
x86: Define early read-mostly per-cpu macros
Pull scheduler changes from Ingo Molnar:
"The biggest change is a performance improvement on SMP systems:
| 4 socket 40 core + SMT Westmere box, single 30 sec tbench
| runs, higher is better:
|
| clients 1 2 4 8 16 32 64 128
|..........................................................................
| pre 30 41 118 645 3769 6214 12233 14312
| post 299 603 1211 2418 4697 6847 11606 14557
|
| A nice increase in performance.
which speedup is particularly noticeable on heavily interacting
few-tasks workloads, so the changes should help desktop-style Xorg
workloads and interactivity as well, on multi-core CPUs.
There are also cpuset suspend behavior fixes/restructuring and various
smaller tweaks."
* 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
sched: Fix race in task_group()
sched: Improve balance_cpu() to consider other cpus in its group as target of (pinned) task
sched: Reset loop counters if all tasks are pinned and we need to redo load balance
sched: Reorder 'struct lb_env' members to reduce its size
sched: Improve scalability via 'CPU buddies', which withstand random perturbations
cpusets: Remove/update outdated comments
cpusets, hotplug: Restructure functions that are invoked during hotplug
cpusets, hotplug: Implement cpuset tree traversal in a helper function
CPU hotplug, cpusets, suspend: Don't modify cpusets during suspend/resume
sched/x86: Remove broken power estimation
Typically, the return value desired for the failure of a
function with an integer return value is a negative integer. In
these cases, the return value is sometimes a negative integer
and sometimes 0, due to a subsequent initialization of the
return variable within the loop.
A simplified version of the semantic match that finds this
problem is: (http://coccinelle.lip6.fr/)
//<smpl>
@r exists@
identifier ret;
position p;
constant C;
expression e1,e3,e4;
statement S;
@@
ret = -C
... when != ret = e3
when any
if@p (...) S
... when any
if (\(ret != 0\|ret < 0\|ret > 0\) || ...) { ... return ...; }
... when != ret = e3
when any
*if@p (...)
{
... when != ret = e4
return ret;
}
//</smpl>
Signed-off-by: Julia Lawall <Julia.Lawall@lip6.fr>
Link: http://lkml.kernel.org/r/1342284188-19176-7-git-send-email-Julia.Lawall@lip6.fr
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Sandy Bridge processors follow the SDM (Vol 3B, Table 15-20) and
set both the RIPV and EIPV bits in the MCG_STATUS register to
zero for machine checks during instruction fetch. This is more
than a little counter-intuitive and means that Linux cannot
recover from these errors. Rather than insert special case code
at several places in mce.c and mce-severity.c, we pretend the
EIPV bit was set for just this case early in processing the
machine check.
Acked-by: Borislav Petkov <bp@amd64.org>
Signed-off-by: Tony Luck <tony.luck@intel.com>
Cc: Chen Gong <gong.chen@linux.intel.com>
Cc: Huang Ying <ying.huang@intel.com>
Cc: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com>
Link: http://lkml.kernel.org/r/180a06f3f357cf9f78259ae443a082b14a29535b.1343078495.git.tony.luck@intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
We will need some of these values in mce.c. Move them to the
appropriate header file so they are available.
Acked-by: Borislav Petkov <bp@amd64.org>
Signed-off-by: Tony Luck <tony.luck@intel.com>
Cc: Chen Gong <gong.chen@linux.intel.com>
Cc: Huang Ying <ying.huang@intel.com>
Cc: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com>
Link: http://lkml.kernel.org/r/0ccfb1af5fe35e537b7cd8e4d448bf7d851dbfb9.1343078495.git.tony.luck@intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
In the current kernel, percpu variable `vector_irq' is not always
cleared when a CPU is offlined. If the CPU that has the disabled
irqs in vector_irq is hotplugged again, __setup_vector_irq()
hits invalid irq vector and may crash.
This bug can be reproduced as following;
# echo 0 > /sys/devices/system/cpu/cpu7/online
# modprobe -r some_driver_using_interrupts # vector_irq@cpu7 uncleared
# echo 1 > /sys/devices/system/cpu/cpu7/online # kernel may crash
To fix this problem, this patch clears vector_irq in
__fixup_irqs() when the CPU is offlined.
This also reverts commit f6175f5bfb, which partially fixes
this bug by clearing vector in __clear_irq_vector(). But in
environments with IOMMU IRQ remapper, it could fail because
cfg->domain doesn't contain offlined CPUs. With this patch, the
fix in __clear_irq_vector() can be reverted because every
vector_irq is already cleared in __fixup_irqs() on offlined CPUs.
Signed-off-by: Tomoki Sekiyama <tomoki.sekiyama.qu@hitachi.com>
Acked-by: Suresh Siddha <suresh.b.siddha@intel.com>
Cc: yrl.pp-manager.tt@hitachi.com
Cc: Yinghai Lu <yinghai@kernel.org>
Cc: Alexander Gordeev <agordeev@redhat.com>
Link: http://lkml.kernel.org/r/20120726104732.2889.19144.stgit@kvmdev
Signed-off-by: Ingo Molnar <mingo@kernel.org>
The event control register of SNB-EP uncore QPI box has a one bit
extension at bit position 21.
Reported-by: Stephane Eranian <eranian@google.com>
Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/1343097850-4348-1-git-send-email-zheng.z.yan@intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
LLC-* and node-* events require using the OFFCORE_RESPONSE events
on SandyBridge, but the hw_cache_extra_regs is left uninitialized.
This patch adds the missing extra register configure table for
SandyBridge.
Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/1342517275-2875-1-git-send-email-zheng.z.yan@intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
The uncore subsystem in Nehalem-EX consists of 7 components
(U-Box, C-Box, B-Box, S-Box, R-Box, M-Box and W-Box). This
patch is large because the way to program these boxes is
diverse.
Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/4FF534F1.3030307@intel.com
[ Improved the code. ]
Signed-off-by: Ingo Molnar <mingo@kernel.org>
The format definition of uncore PCU filter should be filter_band*
instead of filter_brand*.
Reported-by: Stephane Eranian <eranian@google.com>
Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/1343024611-4692-1-git-send-email-zheng.z.yan@intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
The Intel case falls through into the generic case which then changes
the values. For cases like the P6 it doesn't do the right thing so
this seems to be a screwup.
Signed-off-by: Alan Cox <alan@linux.intel.com>
Link: http://lkml.kernel.org/n/tip-lww2uirad4skzjlmrm0vru8o@git.kernel.org
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
Cc: <stable@vger.kernel.org>
The most important part of these updates is the IOMMU groups code
enhancement written by Alex Williamson. It abstracts the problem that a
given hardware IOMMU can't isolate any given device from any other
device (e.g. 32 bit PCI devices can't usually be isolated). Devices that
can't be isolated are grouped together. This code is required for the
upcoming VFIO framework.
Another IOMMU-API change written by be is the introduction of domain
attributes. This makes it easier to handle GART-like IOMMUs with the
IOMMU-API because now the start-address and the size of the domain
address space can be queried.
Besides that there are a few cleanups and fixes for the NVidia Tegra
IOMMU drivers and the reworked init-code for the AMD IOMMU. The later is
from my patch-set to support interrupt remapping. The rest of this
patch-set requires x86 changes which are not mergabe yet. So full
support for interrupt remapping with AMD IOMMUs will come in a future
merge window.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.11 (GNU/Linux)
iQIcBAABAgAGBQJQDV/MAAoJECvwRC2XARrjSDcP+gJbtSHDMyZ71zyfQfAZcxJt
rTqLbdZRtIjrjgtKSEDp8u5Bo5TK9dAYoZVuJMOZewFzwI/fSfbRsWp1PU0I88Fr
ZzM+/o1N9MLvf1e3kRVOzNzUfku+jTQgUBD4txsbtQzc/IeGHe9qP1Bqzs/xg4Pk
SjWu7pLNYxaER10z76nRodNn6zGjsc7GFdOW8cJu2HOAHhisIAR291jSQgd6Rz9r
zWqSTsXIEzYt2CtU3G2/tFJ554Mp8v5F80gHo+0Ldw8aNxlD6nGtbqGNt+KI8qTv
MUL8KJ0TNms9CZdti1CSlSNp51VgJi2GaWKCaDAkYuuER2IbC/8Yp/p2DIIA0GNp
HpziIs+dauZPWfZHc6oU7lJAClGAG4MUx7CysVIOzl7ML/Bf4mjYv0faGf5YQfyE
weOR+OPPIWDUwgjzHKMAboA4ijkE/v+EKjOaN/S9rEqFEMKC99fwGkf9wUcpZTne
8lzdI2JrgYNDWMVNYlomeLD4lBAbxb/QsnRUa33igjr0MclvMDkp5HaO631Z1+Zx
be2z8Rl1CtMwS4qeaOXoeaoNWHU26+oJRZNtCGi/Fw4aKqYXP1dnE/m0GtqEP9Yi
+CU2rKbZn3j0+ZcQjCQop8FREPrZ2/Uaji70b6G7WZ2ApcqBxzBffpbMKOmd6T1D
HIzGh0fpdYNDuwn6Txit
=MbAC
-----END PGP SIGNATURE-----
Merge tag 'iommu-updates-v3.6-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/joro/iommu
Pull IOMMU updates from Joerg Roedel:
"The most important part of these updates is the IOMMU groups code
enhancement written by Alex Williamson. It abstracts the problem that
a given hardware IOMMU can't isolate any given device from any other
device (e.g. 32 bit PCI devices can't usually be isolated). Devices
that can't be isolated are grouped together. This code is required
for the upcoming VFIO framework.
Another IOMMU-API change written by me is the introduction of domain
attributes. This makes it easier to handle GART-like IOMMUs with the
IOMMU-API because now the start-address and the size of the domain
address space can be queried.
Besides that there are a few cleanups and fixes for the NVidia Tegra
IOMMU drivers and the reworked init-code for the AMD IOMMU. The
latter is from my patch-set to support interrupt remapping. The rest
of this patch-set requires x86 changes which are not mergabe yet. So
full support for interrupt remapping with AMD IOMMUs will come in a
future merge window."
* tag 'iommu-updates-v3.6-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/joro/iommu: (33 commits)
iommu/amd: Fix hotplug with iommu=pt
iommu/amd: Add missing spin_lock initialization
iommu/amd: Convert iommu initialization to state machine
iommu/amd: Introduce amd_iommu_init_dma routine
iommu/amd: Move unmap_flush message to amd_iommu_init_dma_ops()
iommu/amd: Split enable_iommus() routine
iommu/amd: Introduce early_amd_iommu_init routine
iommu/amd: Move informational prinks out of iommu_enable
iommu/amd: Split out PCI related parts of IOMMU initialization
iommu/amd: Use acpi_get_table instead of acpi_table_parse
iommu/amd: Fix sparse warnings
iommu/tegra: Don't call alloc_pdir with as->lock
iommu/tegra: smmu: Fix unsleepable memory allocation at alloc_pdir()
iommu/tegra: smmu: Remove unnecessary sanity check at alloc_pdir()
iommu/exynos: Implement DOMAIN_ATTR_GEOMETRY attribute
iommu/tegra: Implement DOMAIN_ATTR_GEOMETRY attribute
iommu/msm: Implement DOMAIN_ATTR_GEOMETRY attribute
iommu/omap: Implement DOMAIN_ATTR_GEOMETRY attribute
iommu/vt-d: Implement DOMAIN_ATTR_GEOMETRY attribute
iommu/amd: Implement DOMAIN_ATTR_GEOMETRY attribute
...
Host bridge hotplug
- Add MMCONFIG support for hot-added host bridges (Jiang Liu)
Device hotplug
- Move fixups from __init to __devinit (Sebastian Andrzej Siewior)
- Call FINAL fixups for hot-added devices, too (Myron Stowe)
- Factor out generic code for P2P bridge hot-add (Yinghai Lu)
- Remove all functions in a slot, not just those with _EJx (Amos Kong)
Dynamic resource management
- Track bus number allocation (struct resource tree per domain) (Yinghai Lu)
- Make P2P bridge 1K I/O windows work with resource reassignment (Bjorn Helgaas, Yinghai Lu)
- Disable decoding while updating 64-bit BARs (Bjorn Helgaas)
Power management
- Add PCIe runtime D3cold support (Huang Ying)
Virtualization
- Add VFIO infrastructure (ACS, DMA source ID quirks) (Alex Williamson)
- Add quirks for devices with broken INTx masking (Jan Kiszka)
Miscellaneous
- Fix some PCI Express capability version issues (Myron Stowe)
- Factor out some arch code with a weak, generic, pcibios_setup() (Myron Stowe)
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.10 (GNU/Linux)
iQIcBAABAgAGBQJQBy+9AAoJEPGMOI97Hn6zOpQP+wVFvA7pcteFj6HPs5nTq2Hc
55oeRqCO0wBHoFMCKB0AjeTATjqxi9OhcjaiVrZejxNyWKC9MnrXuunpQ0l/hCbR
M/TK+BCelfX2FU4eXNf+TBCCcOhOVWqQft9Gm6nYKwX8Y0msRVCceI4WwhZgSwtI
vdtmnqlwolscdnq+8ThsnvUMtwkN0gExmn2FJRl6EoEgG0DTqhMkZ83uA+NPBhvv
I+g0XbA6haaZph2nnSYR0hIW4Q7JkT/LgA6uVAQxamctwxLol7xxsjCRnfqrulkf
kaRr2fAgBXfmaOIltro4UkXrCM52ZSyggCDfExHp6mWGPKMjE5ZcyK1YbGfmmumk
DS3t1S0eBdDJXrnf9l/Yb8e95dQxRCYKelKzr1rTD9QAXsInE8rC40hvhfFaTa4s
nZYRTz0SKv6coQihqaOR7shx1DNomLFk7jndaWEElfl9/cT/nQnZ8XLfVMzkJNNB
Y4SM6zkiIaCL0aiSEE16MqVjmODYRjbURLYzQIrqr2KJQg8X6XjIRojQLjL6xEgA
22ry2ZRPhqO68g7aLqvixiSDaTp0Z0Vw+JmgjtBqvkokwZcGQtm4umkpAdOi+Es8
3bJaMY7ZUpDX53FE8iyP6AnmR/1k19rC1gNnNq/syWyjtYOYJ9i3QCTafFgvE1VC
5coQ1L5tByHvpzK5PHwf
=oo/A
-----END PGP SIGNATURE-----
Merge tag 'for-3.6' of git://git.kernel.org/pub/scm/linux/kernel/git/helgaas/pci
Pull PCI changes from Bjorn Helgaas:
"Host bridge hotplug:
- Add MMCONFIG support for hot-added host bridges (Jiang Liu)
Device hotplug:
- Move fixups from __init to __devinit (Sebastian Andrzej Siewior)
- Call FINAL fixups for hot-added devices, too (Myron Stowe)
- Factor out generic code for P2P bridge hot-add (Yinghai Lu)
- Remove all functions in a slot, not just those with _EJx (Amos
Kong)
Dynamic resource management:
- Track bus number allocation (struct resource tree per domain)
(Yinghai Lu)
- Make P2P bridge 1K I/O windows work with resource reassignment
(Bjorn Helgaas, Yinghai Lu)
- Disable decoding while updating 64-bit BARs (Bjorn Helgaas)
Power management:
- Add PCIe runtime D3cold support (Huang Ying)
Virtualization:
- Add VFIO infrastructure (ACS, DMA source ID quirks) (Alex
Williamson)
- Add quirks for devices with broken INTx masking (Jan Kiszka)
Miscellaneous:
- Fix some PCI Express capability version issues (Myron Stowe)
- Factor out some arch code with a weak, generic, pcibios_setup()
(Myron Stowe)"
* tag 'for-3.6' of git://git.kernel.org/pub/scm/linux/kernel/git/helgaas/pci: (122 commits)
PCI: hotplug: ensure a consistent return value in error case
PCI: fix undefined reference to 'pci_fixup_final_inited'
PCI: build resource code for M68K architecture
PCI: pciehp: remove unused pciehp_get_max_lnk_width(), pciehp_get_cur_lnk_width()
PCI: reorder __pci_assign_resource() (no change)
PCI: fix truncation of resource size to 32 bits
PCI: acpiphp: merge acpiphp_debug and debug
PCI: acpiphp: remove unused res_lock
sparc/PCI: replace pci_cfg_fake_ranges() with pci_read_bridge_bases()
PCI: call final fixups hot-added devices
PCI: move final fixups from __init to __devinit
x86/PCI: move final fixups from __init to __devinit
MIPS/PCI: move final fixups from __init to __devinit
PCI: support sizing P2P bridge I/O windows with 1K granularity
PCI: reimplement P2P bridge 1K I/O windows (Intel P64H2)
PCI: disable MEM decoding while updating 64-bit MEM BARs
PCI: leave MEM and IO decoding disabled during 64-bit BAR sizing, too
PCI: never discard enable/suspend/resume_early/resume fixups
PCI: release temporary reference in __nv_msi_ht_cap_quirk()
PCI: restructure 'pci_do_fixups()'
...
Pull trivial tree from Jiri Kosina:
"Trivial updates all over the place as usual."
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial: (29 commits)
Fix typo in include/linux/clk.h .
pci: hotplug: Fix typo in pci
iommu: Fix typo in iommu
video: Fix typo in drivers/video
Documentation: Add newline at end-of-file to files lacking one
arm,unicore32: Remove obsolete "select MISC_DEVICES"
module.c: spelling s/postition/position/g
cpufreq: Fix typo in cpufreq driver
trivial: typo in comment in mksysmap
mach-omap2: Fix typo in debug message and comment
scsi: aha152x: Fix sparse warning and make printing pointer address more portable.
Change email address for Steve Glendinning
Btrfs: fix typo in convert_extent_bit
via: Remove bogus if check
netprio_cgroup.c: fix comment typo
backlight: fix memory leak on obscure error path
Documentation: asus-laptop.txt references an obsolete Kconfig item
Documentation: ManagementStyle: fixed typo
mm/vmscan: cleanup comment error in balance_pgdat
mm: cleanup on the comments of zone_reclaim_stat
...
* Performance improvement to lower the amount of traps the hypervisor
has to do 32-bit guests. Mainly for setting PTE entries and updating
TLS descriptors.
* MCE polling driver to collect hypervisor MCE buffer and present them to
/dev/mcelog.
* Physical CPU online/offline support. When an privileged guest is booted
it is present with virtual CPUs, which might have an 1:1 to physical
CPUs but usually don't. This provides mechanism to offline/online physical
CPUs.
Bug-fixes for:
* Coverity found fixes in the console and ACPI processor driver.
* PVonHVM kexec fixes along with some cleanups.
* Pages that fall within E820 gaps and non-RAM regions (and had been
released to hypervisor) would be populated back, but potentially in
non-RAM regions.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.12 (GNU/Linux)
iQEcBAABAgAGBQJQDWcvAAoJEFjIrFwIi8fJ6GAH/iFIkOC5wseD8qZ9nV4VI46t
0GYvBFC4F91NvC7CNfoAySr84v+ZORIZzMcdyDF8H/tLO9MaOY/Mwn0S5ZSqmYMi
rhskvK3InBaVkYtceOHugNGM7mB0c3STIm7OsjW6gbVzohmTN25rbQR+X5iWAtVA
cTUtDyH3AU15mwuVT3U+VC4IulHpnNJz4pHoq3Sn61/UK1LYmhLXYd5fveA0D0B8
lRZTAvNMsYDJDDmkWNrs8RczKkQ86DTSjfGawm0YG+Gf94GgD5yMHWbiHh2Gy93e
u7sHK0RrKbP5BY/MV6vVJxkoV5NoWgCc0tcjBcYwdyvwzxDS75UhV6uoVHC3Ao8=
=drt2
-----END PGP SIGNATURE-----
Merge tag 'stable/for-linus-3.6-rc0-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/konrad/xen
Pull Xen update from Konrad Rzeszutek Wilk:
"Features:
* Performance improvement to lower the amount of traps the hypervisor
has to do 32-bit guests. Mainly for setting PTE entries and
updating TLS descriptors.
* MCE polling driver to collect hypervisor MCE buffer and present
them to /dev/mcelog.
* Physical CPU online/offline support. When an privileged guest is
booted it is present with virtual CPUs, which might have an 1:1 to
physical CPUs but usually don't. This provides mechanism to
offline/online physical CPUs.
Bug-fixes for:
* Coverity found fixes in the console and ACPI processor driver.
* PVonHVM kexec fixes along with some cleanups.
* Pages that fall within E820 gaps and non-RAM regions (and had been
released to hypervisor) would be populated back, but potentially in
non-RAM regions."
* tag 'stable/for-linus-3.6-rc0-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/konrad/xen:
xen: populate correct number of pages when across mem boundary (v2)
xen PVonHVM: move shared_info to MMIO before kexec
xen: simplify init_hvm_pv_info
xen: remove cast from HYPERVISOR_shared_info assignment
xen: enable platform-pci only in a Xen guest
xen/pv-on-hvm kexec: shutdown watches from old kernel
xen/x86: avoid updating TLS descriptors if they haven't changed
xen/x86: add desc_equal() to compare GDT descriptors
xen/mm: zero PTEs for non-present MFNs in the initial page table
xen/mm: do direct hypercall in xen_set_pte() if batching is unavailable
xen/hvc: Fix up checks when the info is allocated.
xen/acpi: Fix potential memory leak.
xen/mce: add .poll method for mcelog device driver
xen/mce: schedule a workqueue to avoid sleep in atomic context
xen/pcpu: Xen physical cpus online/offline sys interface
xen/mce: Register native mce handler as vMCE bounce back point
x86, MCE, AMD: Adjust initcall sequence for xen
xen/mce: Add mcelog support for Xen platform
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.12 (GNU/Linux)
iQIcBAABAgAGBQJQDRDNAAoJEI7yEDeUysxlkl8P/3C2AHx2webOU8sVzhfU6ONZ
ZoGevwBjyZIeJEmiWVpFTTEew1l0PXtpyOocXGNUXIddVnhXTQOKr/Scj4uFbmx8
ROqgK8NSX9+xOGrBPCoN7SlJkmp+m6uYtwYkl2SGnsEVLWMKkc7J7oqmszCcTQvN
UXMf7G47/Ul2NUSBdv4Yvizhl4kpvWxluiweDw3E/hIQKN0uyP7CY58qcAztw8nG
csZBAnnuPFwIAWxHXW3eBBv4UP138HbNDqJ/dujjocM6GnOxmXJmcZ6b57gh+Y64
3+w9IR4qrRWnsErb/I8inKLJ1Jdcf7yV2FmxYqR4pIXay2Yzo1BsvFd6EB+JavUv
pJpixrFiDDFoQyXlh4tGpsjpqdXNMLqyG4YpqzSZ46C8naVv9gKE7SXqlXnjyDlb
Llx3hb9Fop8O5ykYEGHi+gIISAK5eETiQl4yw9RUBDpxydH4qJtqGIbLiDy8y9wi
Xyi8PBlNl+biJFsK805lxURqTp/SJTC3+Zb7A7CzYEQm5xZw3W/CKZx1ZYBfpaa/
pWaP6tB7JwgLIVXi4HQayLWqMVwH0soZIn9yazpOEFv6qO8d5QH5RAxAW2VXE3n5
JDlrajar/lGIdiBVWfwTJLb86gv3QDZtIWoR9mZuLKeKWE/6PRLe7HQpG1pJovsm
2AsN5bS0BWq+aqPpZHa5
=pECD
-----END PGP SIGNATURE-----
Merge tag 'kvm-3.6-1' of git://git.kernel.org/pub/scm/virt/kvm/kvm
Pull KVM updates from Avi Kivity:
"Highlights include
- full big real mode emulation on pre-Westmere Intel hosts (can be
disabled with emulate_invalid_guest_state=0)
- relatively small ppc and s390 updates
- PCID/INVPCID support in guests
- EOI avoidance; 3.6 guests should perform better on 3.6 hosts on
interrupt intensive workloads)
- Lockless write faults during live migration
- EPT accessed/dirty bits support for new Intel processors"
Fix up conflicts in:
- Documentation/virtual/kvm/api.txt:
Stupid subchapter numbering, added next to each other.
- arch/powerpc/kvm/booke_interrupts.S:
PPC asm changes clashing with the KVM fixes
- arch/s390/include/asm/sigp.h, arch/s390/kvm/sigp.c:
Duplicated commits through the kvm tree and the s390 tree, with
subsequent edits in the KVM tree.
* tag 'kvm-3.6-1' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (93 commits)
KVM: fix race with level interrupts
x86, hyper: fix build with !CONFIG_KVM_GUEST
Revert "apic: fix kvm build on UP without IOAPIC"
KVM guest: switch to apic_set_eoi_write, apic_write
apic: add apic_set_eoi_write for PV use
KVM: VMX: Implement PCID/INVPCID for guests with EPT
KVM: Add x86_hyper_kvm to complete detect_hypervisor_platform check
KVM: PPC: Critical interrupt emulation support
KVM: PPC: e500mc: Fix tlbilx emulation for 64-bit guests
KVM: PPC64: booke: Set interrupt computation mode for 64-bit host
KVM: PPC: bookehv: Add ESR flag to Data Storage Interrupt
KVM: PPC: bookehv64: Add support for std/ld emulation.
booke: Added crit/mc exception handler for e500v2
booke/bookehv: Add host crit-watchdog exception support
KVM: MMU: document mmu-lock and fast page fault
KVM: MMU: fix kvm_mmu_pagetable_walk tracepoint
KVM: MMU: trace fast page fault
KVM: MMU: fast path of handling guest page fault
KVM: MMU: introduce SPTE_MMU_WRITEABLE bit
KVM: MMU: fold tlb flush judgement into mmu_spte_update
...
The x86 sched power implementation has been broken forever and gets in
the way of other stuff, remove it.
[ For archaeological interest, fixing this code would require dealing
with the cross-cpu calling of these functions and more importantly, we
need to filter idle time out of the a/m-perf stuff because the ratio
will go down to 0 when idle, giving a 0 capacity which is not what
we'd want. ]
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Arjan van de Ven <arjan@linux.intel.com>
Link: http://lkml.kernel.org/r/1339594110.8980.38.camel@twins
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Pull x86/mce changes from Ingo Molnar:
"This tree improves the AMD thresholding bank code and includes a
memory fault signal handling fixlet."
* 'x86-mce-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/mce: Fix siginfo_t->si_addr value for non-recoverable memory faults
x86, MCE, AMD: Update copyrights and boilerplate
x86, MCE, AMD: Give proper names to the thresholding banks
x86, MCE, AMD: Make error_count read only
x86, MCE, AMD: Cleanup reading of error_count
x86, MCE, AMD: Print decimal thresholding values
x86, MCE, AMD: Move shared bank to node descriptor
x86, MCE, AMD: Remove local_allocate_... wrapper
x86, MCE, AMD: Remove shared banks sysfs linking
x86, amd_nb: Export model 0x10 and later PCI id
Pull x86/reboot changes from Ingo Molnar:
"Now that the revampted x86 real-mode trampoline code is upstream and
seems to be working well, we can extend the 64-bit reboot code to be
as capable as the 32-bit one."
* 'x86-reboot-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86-64, reboot: Be more paranoid in 64-bit reboot=bios
x86, reboot: Drop redundant write of reboot_mode
x86-64, reboot: Allow reboot=bios and reboot-cpu override on x86-64
Pull x86 platform changes from Ingo Molnar:
"This tree mostly involves various APIC driver cleanups/robustization,
and vSMP motivated platform callback improvements/cleanups"
Fix up trivial conflict due to printk cleanup right next to return value
change.
* 'x86-platform-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (29 commits)
Revert "x86/early_printk: Replace obsolete simple_strtoul() usage with kstrtoint()"
x86/apic/x2apic: Use multiple cluster members for the irq destination only with the explicit affinity
x86/apic/x2apic: Limit the vector reservation to the user specified mask
x86/apic: Optimize cpu traversal in __assign_irq_vector() using domain membership
x86/vsmp: Fix vector_allocation_domain's return value
irq/apic: Use config_enabled(CONFIG_SMP) checks to clean up irq_set_affinity() for UP
x86/vsmp: Fix linker error when CONFIG_PROC_FS is not set
x86/apic/es7000: Make apicid of a cluster (not CPU) from a cpumask
x86/apic/es7000+summit: Always make valid apicid from a cpumask
x86/apic/es7000+summit: Fix compile warning in cpu_mask_to_apicid()
x86/apic: Fix ugly casting and branching in cpu_mask_to_apicid_and()
x86/apic: Eliminate cpu_mask_to_apicid() operation
x86/x2apic/cluster: Vector_allocation_domain() should return a value
x86/apic/irq_remap: Silence a bogus pr_err()
x86/vsmp: Ignore IOAPIC IRQ affinity if possible
x86/apic: Make cpu_mask_to_apicid() operations check cpu_online_mask
x86/apic: Make cpu_mask_to_apicid() operations return error code
x86/apic: Avoid useless scanning thru a cpumask in assign_irq_vector()
x86/apic: Try to spread IRQ vectors to different priority levels
x86/apic: Factor out default vector_allocation_domain() operation
...
Pull debug-for-linus git tree from Ingo Molnar.
Fix up trivial conflict in arch/x86/kernel/cpu/perf_event_intel.c due to
a printk() having changed to a pr_info() differently in the two branches.
* 'x86-debug-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86: Move call to print_modules() out of show_regs()
x86/mm: Mark free_initrd_mem() as __init
x86/microcode: Mark microcode_id[] as __initconst
x86/nmi: Clean up register_nmi_handler() usage
x86: Save cr2 in NMI in case NMIs take a page fault (for i386)
x86: Remove cmpxchg from i386 NMI nesting code
x86: Save cr2 in NMI in case NMIs take a page fault
x86/debug: Add KERN_<LEVEL> to bare printks, convert printks to pr_<level>
Pull x86/asm changes from Ingo Molnar:
"Assorted single-commit improvements, as usual"
* 'x86-asm-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/mm/mtrr: Slightly simplify print_mtrr_state()
x86/mm/mtrr: Fix alignment determination in range_to_mtrr()
x86/copy_user_generic: Optimize copy_user_generic with CPU erms feature
x86/alternatives: Use atomic_xchg() instead atomic_dec_and_test() for stop_machine_text_poke()
This reverts commit fbd24153c4.
This commit is subtly buggy: kstrto*int() can return an error but
it's not checked in every path. simple_strtoul() on the other hand
could not fail, so this patch subtly intruduces new failure modes.
Signed-off-by: Shuah Khan <shuahkhan@gmail.com>
Link: http://lkml.kernel.org/r/1338424803.3569.5.camel@lorien2
Signed-off-by: Ingo Molnar <mingo@kernel.org>
there are 3 funcs which need to be _initcalled in a logic sequence:
1. xen_late_init_mcelog
2. mcheck_init_device
3. threshold_init_device
xen_late_init_mcelog must register xen_mce_chrdev_device before
native mce_chrdev_device registration if running under xen platform;
mcheck_init_device should be inited before threshold_init_device to
initialize mce_device, otherwise a a NULL ptr dereference will cause panic.
so we use following _initcalls
1. device_initcall(xen_late_init_mcelog);
2. device_initcall_sync(mcheck_init_device);
3. late_initcall(threshold_init_device);
when running under xen, the initcall order is 1,2,3;
on baremetal, we skip 1 and we do only 2 and 3.
Acked-and-tested-by: Borislav Petkov <bp@amd64.org>
Suggested-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Signed-off-by: Liu, Jinsong <jinsong.liu@intel.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
When MCA error occurs, it would be handled by Xen hypervisor first,
and then the error information would be sent to initial domain for logging.
This patch gets error information from Xen hypervisor and convert
Xen format error into Linux format mcelog. This logic is basically
self-contained, not touching other kernel components.
By using tools like mcelog tool users could read specific error information,
like what they did under native Linux.
To test follow directions outlined in Documentation/acpi/apei/einj.txt
Acked-and-tested-by: Borislav Petkov <borislav.petkov@amd.com>
Signed-off-by: Ke, Liping <liping.ke@intel.com>
Signed-off-by: Jiang, Yunhong <yunhong.jiang@intel.com>
Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
Signed-off-by: Liu, Jinsong <jinsong.liu@intel.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Add saving full regs for function tracing on i386.
The saving of regs was influenced by patches sent out by
Masami Hiramatsu.
Link: Link: http://lkml.kernel.org/r/20120711195745.379060003@goodmis.org
Reviewed-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Add a way to have different functions calling different trampolines.
If a ftrace_ops wants regs saved on the return, then have only the
functions with ops registered to save regs. Functions registered by
other ops would not be affected, unless the functions overlap.
If one ftrace_ops registered functions A, B and C and another ops
registered fucntions to save regs on A, and D, then only functions
A and D would be saving regs. Function B and C would work as normal.
Although A is registered by both ops: normal and saves regs; this is fine
as saving the regs is needed to satisfy one of the ops that calls it
but the regs are ignored by the other ops function.
x86_64 implements the full regs saving, and i386 just passes a NULL
for regs to satisfy the ftrace_ops passing. Where an arch must supply
both regs and ftrace_ops parameters, even if regs is just NULL.
It is OK for an arch to pass NULL regs. All function trace users that
require regs passing must add the flag FTRACE_OPS_FL_SAVE_REGS when
registering the ftrace_ops. If the arch does not support saving regs
then the ftrace_ops will fail to register. The flag
FTRACE_OPS_FL_SAVE_REGS_IF_SUPPORTED may be set that will prevent the
ftrace_ops from failing to register. In this case, the handler may
either check if regs is not NULL or check if ARCH_SUPPORTS_FTRACE_SAVE_REGS.
If the arch supports passing regs it will set this macro and pass regs
for ops that request them. All other archs will just pass NULL.
Link: Link: http://lkml.kernel.org/r/20120711195745.107705970@goodmis.org
Cc: Alexander van Heukelum <heukelum@fastmail.fm>
Reviewed-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Add support of passing the current ftrace_ops into the 3rd parameter
of the callback to the function tracer.
Link: http://lkml.kernel.org/r/20120612225424.942411318@goodmis.org
Reviewed-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Currently the function trace callback receives only the ip and parent_ip
of the function that it traced. It would be more powerful to also return
the ops that registered the function as well. This allows the same function
to act differently depending on what ftrace_ops registered it.
Link: http://lkml.kernel.org/r/20120612225424.267254552@goodmis.org
Reviewed-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Use apic_set_eoi_write, apic_write to avoid meedling in core apic
driver data structures directly.
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
KVM PV EOI optimization overrides eoi_write apic op with its own
version. Add an API for this to avoid meddling with core x86 apic driver
data structures directly.
For KVM use, we don't need any guarantees about when the switch to the
new op will take place, so it could in theory use this API after SMP init,
but it currently doesn't, and restricting callers to early init makes it
clear that it's safe as it won't race with actual APIC driver use.
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Acked-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Avi Kivity <avi@redhat.com>
vsyscall_seccomp introduced a dependency on __secure_computing. On
configurations with CONFIG_SECCOMP disabled, compilation will fail.
Reported-by: feng xiangjun <fengxj325@gmail.com>
Signed-off-by: Will Drewry <wad@chromium.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
If a seccomp filter program is installed, older static binaries and
distributions with older libc implementations (glibc 2.13 and earlier)
that rely on vsyscall use will be terminated regardless of the filter
program policy when executing time, gettimeofday, or getcpu. This is
only the case when vsyscall emulation is in use (vsyscall=emulate is the
default).
This patch emulates system call entry inside a vsyscall=emulate by
populating regs->ax and regs->orig_ax with the system call number prior
to calling into seccomp such that all seccomp-dependencies function
normally. Additionally, system call return behavior is emulated in line
with other vsyscall entrypoints for the trace/trap cases.
[ v2: fixed ip and sp on SECCOMP_RET_TRAP/TRACE (thanks to luto@mit.edu) ]
Reported-and-tested-by: Owen Kibel <qmewlo@gmail.com>
Signed-off-by: Will Drewry <wad@chromium.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
In commit dad1743e59 ("x86/mce: Only restart instruction after machine
check recovery if it is safe") we fixed mce_notify_process() to force a
signal to the current process if it was not restartable (RIPV bit not
set in MCG_STATUS). But doing it here means that the process doesn't
get told the virtual address of the fault via siginfo_t->si_addr. This
would prevent application level recovery from the fault.
Make a new MF_MUST_KILL flag bit for memory_failure() et al. to use so
that we will provide the right information with the signal.
Signed-off-by: Tony Luck <tony.luck@intel.com>
Acked-by: Borislav Petkov <borislav.petkov@amd.com>
Cc: stable@kernel.org # 3.4+
While debugging I noticed that unlike all the other hypervisor code in the
kernel, kvm does not have an entry for x86_hyper which is used in
detect_hypervisor_platform() which results in a nice printk in the
syslog. This is only really a stub function but it
does make kvm more consistent with the other hypervisors.
Signed-off-by: Prarit Bhargava <prarit@redhat.com>
Cc: Avi Kivity <avi@redhat.com>
Cc: Gleb Natapov <gleb@redhat.com>
Cc: Alex Williamson <alex.williamson@redhat.com>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: Marcelo Tostatti <mtosatti@redhat.com>
Cc: kvm@vger.kernel.org
Signed-off-by: Avi Kivity <avi@redhat.com>
high_width can be easily calculated in a single expression when
making use of __ffs64().
Signed-off-by: Jan Beulich <jbeulich@suse.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Yinghai Lu <yinghai@kernel.org>
Link: http://lkml.kernel.org/r/4FF71053020000780008E1B5@nat28.tlf.novell.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
With the variable operated on being of "unsigned long" type,
neither ffs() nor fls() are suitable to use on them, as those
truncate their arguments to 32 bits. Using __ffs() and __fls()
respectively at once eliminates the need to subtract 1 from their
results.
Additionally, with the alignment value subsequently used as a
shift count, it must be enforced to be less than BITS_PER_LONG
(and on 64-bit there's no need for it to be any smaller).
Signed-off-by: Jan Beulich <jbeulich@suse.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Yinghai Lu <yinghai@kernel.org>
Link: http://lkml.kernel.org/r/4FF70D54020000780008E179@nat28.tlf.novell.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Use tabs for "intel_perfmon_event_map" formatting in
perf_event_intel.c.
Signed-off-by: Pekka Enberg <penberg@kernel.org>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Link: http://lkml.kernel.org/r/1341568786-7045-1-git-send-email-penberg@kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
During boot or driver load etc, interrupt destination is setup
using default target cpu's. Later the user (irqbalance etc) or
the driver (irq_set_affinity/ irq_set_affinity_hint) can request
the interrupt to be migrated to some specific set of cpu's.
In the x2apic cluster routing, for the default scenario use
single cpu as the interrupt destination and when there is an
explicit interrupt affinity request, route the interrupt to
multiple members of a x2apic cluster specified in the cpumask of
the migration request.
This will minmize the vector pressure when there are lot of
interrupt sources and relatively few x2apic clusters (for
example a single socket server). This will allow the performance
critical interrupts to be routed to multiple cpu's in the x2apic
cluster (irqbalance for example uses the cache siblings etc
while specifying the interrupt destination) and allow
non-critical interrupts to be serviced by a single logical cpu.
Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com>
Acked-by: Yinghai Lu <yinghai@kernel.org>
Acked-by: Alexander Gordeev <agordeev@redhat.com>
Acked-by: Cyrill Gorcunov <gorcunov@openvz.org>
Link: http://lkml.kernel.org/r/1340656709-11423-4-git-send-email-suresh.b.siddha@intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
For the x2apic cluster mode, vector for an interrupt is
currently reserved on all the cpu's that are part of the x2apic
cluster. But the interrupts will be routed only to the cluster
(derived from the first cpu in the mask) members specified in
the mask. So there is no need to reserve the vector in the
unused cluster members.
Modify __assign_irq_vector() to reserve the vectors based on the
user specified irq destination mask. If the new mask is a proper
subset of the currently used mask, cleanup the vector allocation
on the unused cpu members.
Also, allow the apic driver to tune the vector domain based on
the affinity mask (which in most cases is the user-specified
mask).
Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com>
Acked-by: Yinghai Lu <yinghai@kernel.org>
Acked-by: Alexander Gordeev <agordeev@redhat.com>
Acked-by: Cyrill Gorcunov <gorcunov@openvz.org>
Link: http://lkml.kernel.org/r/1340656709-11423-3-git-send-email-suresh.b.siddha@intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Currently __assign_irq_vector() goes through each cpu in the
specified mask until it finds a free vector in all the cpu's
that are part of the same interrupt domain. We visit all the
interrupt domain sibling cpus to reserve the free vector. So,
when we fail to find a free vector in an interrupt domain, it is
safe to continue our search with a cpu belonging to a new
interrupt domain. No need to go through each cpu, if the domain
containing that cpu is already visited.
Use the irq_cfg's old_domain to track the visited domains and
optimize the cpu traversal while finding a free vector in the
given cpumask.
NOTE: We can also optimize the search by using for_each_cpu() and
skip the current cpu, if it is not the first cpu in the mask
returned by the vector_allocation_domain(). But re-using the
cfg->old_domain to track the visited domains will be slightly
faster.
Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com>
Acked-by: Yinghai Lu <yinghai@kernel.org>
Acked-by: Alexander Gordeev <agordeev@redhat.com>
Acked-by: Cyrill Gorcunov <gorcunov@openvz.org>
Link: http://lkml.kernel.org/r/1340656709-11423-2-git-send-email-suresh.b.siddha@intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
This patch adds C-Box and PCU filter support for SandyBridge-EP
uncore. We can filter C-Box events by thread/core ID and filter
PCU events by frequency/voltage.
Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/1341381616-12229-5-git-send-email-zheng.z.yan@intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
The CBox manages the interface between the core and the LLC, so
the instances of uncore CBox is equal to number of cores.
Reported-by: Andrew Cooks <acooks@gmail.com>
Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/1341381616-12229-4-git-send-email-zheng.z.yan@intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Stephane Eranian suggestted using 0xff as pseudo code for fixed
uncore event and using the umask value to determine which of the
fixed events we want to map to. So far there is at most one fixed
counter in a uncore PMU. So just change the definition of
UNCORE_FIXED_EVENT to 0xff.
Suggested-by: Stephane Eranian <eranian@google.com>
Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/1340780953-21130-1-git-send-email-zheng.z.yan@intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
All these are basically boolean flags, use a bitfield to save a few
bytes.
Suggested-by: Borislav Petkov <bp@amd64.org>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/n/tip-vsevd5g8lhcn129n3s7trl7r@git.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Recent Intel microcode resolved the SNB-PEBS issues, so conditionally
enable PEBS on SNB hardware depending on the microcode revision.
Thanks to Stephane for figuring out the various microcode revisions.
Suggested-by: Stephane Eranian <eranian@google.com>
Acked-by: Borislav Petkov <borislav.petkov@amd.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/n/tip-v3672ziwh9damwqwh1uz3krm@git.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
It might be of interest which perfctr msr failed.
Signed-off-by: Robert Richter <robert.richter@amd.com>
[ added hunk to avoid GCC warn ]
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/1340217996-2254-5-git-send-email-robert.richter@amd.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
There is no need for keeping separate pmu structs. We can enable
amd_{get,put}_event_constraints() functions also for family 15h event.
The advantage is that there is only a single pmu struct for all AMD
cpus. This patch introduces functions to setup the pmu to enabe core
performance counters or counter constraints.
Also, cpuid checks are used instead of family checks where
possible. Thus, it enables the code independently of cpu families if
the feature flag is set.
Signed-off-by: Robert Richter <robert.richter@amd.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/1340217996-2254-4-git-send-email-robert.richter@amd.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
There is some Intel specific code in the generic x86 path. Move it to
intel_pmu_init().
Since p4 and p6 pmus don't have fixed counters we may skip the check
in case such a pmu is detected.
Signed-off-by: Robert Richter <robert.richter@amd.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/1340217996-2254-3-git-send-email-robert.richter@amd.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
There are macros that are Intel specific and not x86 generic. Rename
them into INTEL_*.
This patch removes X86_PMC_IDX_GENERIC and does:
$ sed -i -e 's/X86_PMC_MAX_/INTEL_PMC_MAX_/g' \
arch/x86/include/asm/kvm_host.h \
arch/x86/include/asm/perf_event.h \
arch/x86/kernel/cpu/perf_event.c \
arch/x86/kernel/cpu/perf_event_p4.c \
arch/x86/kvm/pmu.c
$ sed -i -e 's/X86_PMC_IDX_FIXED/INTEL_PMC_IDX_FIXED/g' \
arch/x86/include/asm/perf_event.h \
arch/x86/kernel/cpu/perf_event.c \
arch/x86/kernel/cpu/perf_event_intel.c \
arch/x86/kernel/cpu/perf_event_intel_ds.c \
arch/x86/kvm/pmu.c
$ sed -i -e 's/X86_PMC_MSK_/INTEL_PMC_MSK_/g' \
arch/x86/include/asm/perf_event.h \
arch/x86/kernel/cpu/perf_event.c
Signed-off-by: Robert Richter <robert.richter@amd.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/1340217996-2254-2-git-send-email-robert.richter@amd.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Merge this branch because we want to rely on the newer (and saner)
microcode loading and checking facilities.
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Several perf interrupt handlers (PEBS,IBS,BTS) re-write regs->ip but
do not update the segment registers. So use an regs->ip based test
instead of an regs->cs/regs->flags based test.
Reported-and-tested-by: Stephane Eranian <eranian@google.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Link: http://lkml.kernel.org/n/tip-xxrt0a1zronm1sm36obwc2vy@git.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
The reload interface should be per-system so that a full system ucode
reload happens (on each core) when doing
echo 1 > /sys/devices/system/cpu/microcode/reload
Move it to the cpu subsys directory instead of it being per-cpu.
Cc: Henrique de Moraes Holschuh <hmh@hmh.eng.br>
Cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Borislav Petkov <borislav.petkov@amd.com>
Link: http://lkml.kernel.org/r/1340280437-7718-3-git-send-email-bp@amd64.org
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
Microcode reloading in a per-core manner is a very bad idea for both
major x86 vendors. And the thing is, we have such interface with which
we can end up with different microcode versions applied on different
cores of an otherwise homogeneous wrt (family,model,stepping) system.
So turn off the possibility of doing that per core and allow it only
system-wide.
This is a minimal fix which we'd like to see in stable too thus the
more-or-less arbitrary decision to allow system-wide reloading only on
the BSP:
$ echo 1 > /sys/devices/system/cpu/cpu0/microcode/reload
...
and disable the interface on the other cores:
$ echo 1 > /sys/devices/system/cpu/cpu23/microcode/reload
-bash: echo: write error: Invalid argument
Also, allowing the reload only from one CPU (the BSP in
that case) doesn't allow the reload procedure to degenerate
into an O(n^2) deal when triggering reloads from all
/sys/devices/system/cpu/cpuX/microcode/reload sysfs nodes
simultaneously.
A more generic fix will follow.
Cc: Henrique de Moraes Holschuh <hmh@hmh.eng.br>
Cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Borislav Petkov <borislav.petkov@amd.com>
Link: http://lkml.kernel.org/r/1340280437-7718-2-git-send-email-bp@amd64.org
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
Cc: <stable@vger.kernel.org>
Pull ACPI & Power Management patches from Len Brown.
* 'release' of git://git.kernel.org/pub/scm/linux/kernel/git/lenb/linux:
acpi_pad: fix power_saving thread deadlock
ACPI video: Still use ACPI backlight control if _DOS doesn't exist
ACPI, APEI, Avoid too much error reporting in runtime
ACPI: Add a quirk for "AMILO PRO V2030" to ignore the timer overriding
ACPI: Remove one board specific WARN when ignoring timer overriding
ACPI: Make acpi_skip_timer_override cover all source_irq==0 cases
ACPI, x86: fix Dell M6600 ACPI reboot regression via DMI
ACPI sysfs.c strlen fix
According to Intel 64 and IA-32 SDM and Optimization Reference Manual, beginning
with Ivybridge, REG string operation using MOVSB and STOSB can provide both
flexible and high-performance REG string operations in cases like memory copy.
Enhancement availability is indicated by CPUID.7.0.EBX[9] (Enhanced REP MOVSB/
STOSB).
If CPU erms feature is detected, patch copy_user_generic with enhanced fast
string version of copy_user_generic.
A few new macros are defined to reduce duplicate code in ALTERNATIVE and
ALTERNATIVE_2.
Signed-off-by: Fenghua Yu <fenghua.yu@intel.com>
Link: http://lkml.kernel.org/r/1337908785-14015-1-git-send-email-fenghua.yu@intel.com
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
Pull x86 fixes from Ingo Molnar.
* 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86, cpufeature: Remove stray %s, add -w to mkcapflags.pl
x86, cpufeature: Catch duplicate CPU feature strings
x86, cpufeature: Rename X86_FEATURE_DTS to X86_FEATURE_DTHERM
x86: Fix kernel-doc warnings
x86, compat: Use test_thread_flag(TIF_IA32) in compat signal delivery
There are 32 INVALIDATE_TLB_VECTOR now in kernel. That is quite big
amount of vector in IDT. But it is still not enough, since modern x86
sever has more cpu number. That still causes heavy lock contention
in TLB flushing.
The patch using generic smp call function to replace it. That saved 32
vector number in IDT, and resolved the lock contention in TLB
flushing on large system.
In the NHM EX machine 4P * 8cores * HT = 64 CPUs, hackbench pthread
has 3% performance increase.
Signed-off-by: Alex Shi <alex.shi@intel.com>
Link: http://lkml.kernel.org/r/1340845344-27557-9-git-send-email-alex.shi@intel.com
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
Testing show different CPU type(micro architectures and NUMA mode) has
different balance points between the TLB flush all and multiple invlpg.
And there also has cases the tlb flush change has no any help.
This patch give a interface to let x86 vendor developers have a chance
to set different shift for different CPU type.
like some machine in my hands, balance points is 16 entries on
Romely-EP; while it is at 8 entries on Bloomfield NHM-EP; and is 256 on
IVB mobile CPU. but on model 15 core2 Xeon using invlpg has nothing
help.
For untested machine, do a conservative optimization, same as NHM CPU.
Signed-off-by: Alex Shi <alex.shi@intel.com>
Link: http://lkml.kernel.org/r/1340845344-27557-5-git-send-email-alex.shi@intel.com
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
For 4KB pages, x86 CPU has 2 or 1 level TLB, first level is data TLB and
instruction TLB, second level is shared TLB for both data and instructions.
For hupe page TLB, usually there is just one level and seperated by 2MB/4MB
and 1GB.
Although each levels TLB size is important for performance tuning, but for
genernal and rude optimizing, last level TLB entry number is suitable. And
in fact, last level TLB always has the biggest entry number.
This patch will get the biggest TLB entry number and use it in furture TLB
optimizing.
Accroding Borislav's suggestion, except tlb_ll[i/d]_* array, other
function and data will be released after system boot up.
For all kinds of x86 vendor friendly, vendor specific code was moved to its
specific files.
Signed-off-by: Alex Shi <alex.shi@intel.com>
Link: http://lkml.kernel.org/r/1340845344-27557-2-git-send-email-alex.shi@intel.com
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
There was a stray %s left from testing, remove it.
Add -w to the #! line (which is parsed by Perl even if the Perl
interpreter is invoked explicitly on the command line) to catch these
kinds of errors in the future.
Reported-by: Jean Delvare <khali@linux-fr.org>
Link: http://lkml.kernel.org/r/20120626143246.0c9bf301@endymion.delvare
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
We had a case of duplicate CPU feature strings, a user space ABI
violation, for almost two years. Make it a build error so that
doesn't happen again.
Link: http://lkml.kernel.org/r/4FE34BCB.5050305@linux.intel.com
Cc: Jan Beulich <JBeulich@suse.com>
Cc: Jean Delvare <khali@linux-fr.org>
It makes sense to label "Digital Thermal Sensor" as "DTS", but
unfortunately the string "dts" was already used for "Debug Store", and
/proc/cpuinfo is a user space ABI.
Therefore, rename this to "dtherm".
This conflict went into mainline via the hwmon tree without any x86
maintainer ack, and without any kind of hint in the subject.
a4659053 x86/hwmon: fix initialization of coretemp
Reported-by: Jean Delvare <khali@linux-fr.org>
Link: http://lkml.kernel.org/r/4FE34BCB.5050305@linux.intel.com
Cc: Jan Beulich <JBeulich@suse.com>
Cc: <stable@vger.kernel.org> v2.6.36..v3.4
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
The iommu=group_mf is really no longer needed with the addition of ACS
support in IOMMU drivers creating groups. Most multifunction devices
will now be grouped already. If a device has gone to the trouble of
exposing ACS, trust that it works. We can use the device specific ACS
function for fixing devices we trust individually. This largely
reverts bcb71abe.
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
Fix section mismatch in uncore_pci_init():
WARNING: vmlinux.o(.init.text+0x9246): Section mismatch in reference from the function uncore_pci_init() to the function .devexit.text:uncore_pci_remove()
The function __init uncore_pci_init() references
a function __devexit uncore_pci_remove().
[...]
Signed-off-by: Robert Richter <robert.richter@amd.com>
Cc: <a.p.zijlstra@chello.nl>
Cc: <zheng.z.yan@intel.com>
Link: http://lkml.kernel.org/r/20120620163927.GI5046@erda.amd.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
We write reboot_mode to BIOS location 0x472 in
native_machine_emergency_restart() (reboot.c:542) already, there is no
need to then write it again in machine_real_restart().
This means nothing gets written there for MRR_APM, but the APM call is
a poweroff call and doesn't use this memory location.
Link: http://lkml.kernel.org/n/tip-3i0pfh44c1e3jv5lab0cf7sc@git.kernel.org
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
Printing the list of loaded modules is really unrelated to what
this function is about, and is particularly unnecessary in the
context of the SysRQ key handling (gets printed so far over and
over).
It should really be the caller of the function to decide whether
this piece of information is useful (and to avoid redundantly
printing it).
Signed-off-by: Jan Beulich <jbeulich@suse.com>
Link: http://lkml.kernel.org/r/4FDF21A4020000780008A67F@nat28.tlf.novell.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
It's not being used for other than creating module aliases (i.e.
no loadable section has any reference to it).
Signed-off-by: Jan Beulich <jbeulich@suse.com>
Link: http://lkml.kernel.org/r/4FDF1EFD020000780008A65D@nat28.tlf.novell.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Implement a cleaner and easier to maintain version for the section
warning fixes implemented in commit eeaaa96a3a
("x86/nmi: Fix section mismatch warnings on 32-bit").
Signed-off-by: Li Zhong <zhong@linux.vnet.ibm.com>
Signed-off-by: Don Zickus <dzickus@redhat.com>
Cc: Jan Beulich <JBeulich@suse.com>
Link: http://lkml.kernel.org/r/1340049393-17771-1-git-send-email-dzickus@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
The uncore subsystem in Sandy Bridge-EP consists of 8 components:
Ubox, Cacheing Agent, Home Agent, Memory controller, Power Control,
QPI Link Layer, R2PCIe, R3QPI.
Signed-off-by: Zheng Yan <zheng.z.yan@intel.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1339741902-8449-9-git-send-email-zheng.z.yan@intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
This patch adds generic support for uncore PMUs presented as
PCI devices. (These come in addition to the CPU/MSR based
uncores.)
Signed-off-by: Zheng Yan <zheng.z.yan@intel.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/1339741902-8449-8-git-send-email-zheng.z.yan@intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
This patch adds the generic Intel uncore PMU support, including helper
functions that add/delete uncore events, a hrtimer that periodically
polls the counters to avoid overflow and code that places all events
for a particular socket onto a single cpu.
The code design is based on the structure of Sandy Bridge-EP's uncore
subsystem, which consists of a variety of components, each component
contains one or more "boxes".
(Tooling support follows in the next patches.)
Signed-off-by: Zheng Yan <zheng.z.yan@intel.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/1339741902-8449-6-git-send-email-zheng.z.yan@intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
The RDPMC index calculation is wrong for AMD family 15h
(X86_FEATURE_ PERFCTR_CORE set). This leads to a #GP when
accessing the counter:
Pid: 2237, comm: syslog-ng Not tainted 3.5.0-rc1-perf-x86_64-standard-g130ff90 #135 AMD Pike/Pike
RIP: 0010:[<ffffffff8100dc33>] [<ffffffff8100dc33>] x86_perf_event_update+0x27/0x66
While the msr address offset is (index << 1) we must use index to
select the correct rdpmc.
Signed-off-by: Robert Richter <robert.richter@amd.com>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Vince Weaver <vweaver1@eecs.utk.edu>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
With the revamped realmode trampoline code, it is trivial to extend
support for reboot=bios to x86-64. Furthermore, while we are at it,
remove the restriction that only we can only override the reboot CPU
on 32 bits.
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
Link: http://lkml.kernel.org/n/tip-jopx7y6g6dbcx4tpal8q0jlr@git.kernel.org
Pull DMA-mapping fixes from Marek Szyprowski:
"A set of minor fixes for dma-mapping code (ARM and x86) required for
Contiguous Memory Allocator (CMA) patches merged in v3.5-rc1."
* 'fixes-for-linus' of git://git.linaro.org/people/mszyprowski/linux-dma-mapping:
x86: dma-mapping: fix broken allocation when dma_mask has been provided
ARM: dma-mapping: fix debug messages in dmabounce code
ARM: mm: fix type of the arm_dma_limit global variable
ARM: dma-mapping: Add missing static storage class specifier
Move the ->irq_set_affinity() routines out of the #ifdef CONFIG_SMP
sections and use config_enabled(CONFIG_SMP) checks inside those
routines. Thus making those routines simple null stubs for
!CONFIG_SMP and retaining those routines with no additional
runtime overhead for CONFIG_SMP kernels.
Cleans up the ifdef CONFIG_SMP in and around routines related to
irq_set_affinity in io_apic and irq_remapping subsystems.
Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com>
Cc: torvalds@linux-foundation.org
Cc: joerg.roedel@amd.com
Cc: Sam Ravnborg <sam@ravnborg.org>
Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
Link: http://lkml.kernel.org/r/1339723729.3475.63.camel@sbsiddha-desk.sc.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
set_vsmp_pv_ops() references no_irq_affinity which is undeclared
if CONFIG_PROC_FS isn't set. Fix this by adding an #ifdef around
this variable's access.
Reported-by: Fengguang Wu <wfg@linux.intel.com>
Signed-off-by: Ido Yariv <ido@wizery.com>
Acked-by: Shai Fultheim <shai@scalemp.com>
Link: http://lkml.kernel.org/r/1339688588-12674-1-git-send-email-ido@wizery.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Commit 0a2b9a6ea9 ("X86: integrate CMA with DMA-mapping subsystem")
broke memory allocation with dma_mask. This patch fixes possible kernel
ops caused by lack of resetting page variable when jumping to 'again' label.
Reported-by: Konrad Rzeszutek Wilk <konrad@darnok.org>
Signed-off-by: Marek Szyprowski <m.szyprowski@samsung.com>
Acked-by: Michal Nazarewicz <mina86@mina86.com>
cpu_mask_to_apicid_and() always returns apicid of a single CPU,
even in case multiple CPUs were requested. This update fixes a
typo and forces apicid of a cluster to be returned.
Signed-off-by: Alexander Gordeev <agordeev@redhat.com>
Cc: Suresh Siddha <suresh.b.siddha@intel.com>
Cc: Yinghai Lu <yinghai@kernel.org>
Link: http://lkml.kernel.org/r/20120614075043.GI3383@dhcp-26-207.brq.redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
In case of invalid parameters cpu_mask_to_apicid_and() might
return apicid value of 0 (on Summit) or a uninitialized value
(on ES7000), although it is supposed to return apicid of cpu-0
at least. Fix the operation to always return a valid apicid.
Signed-off-by: Alexander Gordeev <agordeev@redhat.com>
Cc: Suresh Siddha <suresh.b.siddha@intel.com>
Cc: Yinghai Lu <yinghai@kernel.org>
Link: http://lkml.kernel.org/r/20120614075026.GH3383@dhcp-26-207.brq.redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Since there are only two locations where cpu_mask_to_apicid() is
called from, remove the operation and use only
cpu_mask_to_apicid_and() instead.
Signed-off-by: Alexander Gordeev <agordeev@redhat.com>
Suggested-and-acked-by: Suresh Siddha <suresh.b.siddha@intel.com>
Acked-by: Yinghai Lu <yinghai@kernel.org>
Link: http://lkml.kernel.org/r/20120614074935.GE3383@dhcp-26-207.brq.redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Since commit 8637e38 ("x86/apic: Avoid useless scanning thru a
cpumask in assign_irq_vector()") vector_allocation_domain()
operation indicates if a cpumask is dynamic or static. This
update fixes the oversight and makes the operation to return a
value.
Signed-off-by: Alexander Gordeev <agordeev@redhat.com>
Cc: Suresh Siddha <suresh.b.siddha@intel.com>
Cc: Yinghai Lu <yinghai@kernel.org>
Link: http://lkml.kernel.org/r/20120614103933.GJ3383@dhcp-26-207.brq.redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Add "read-mostly" qualifier to the following variables in
smp.h:
- cpu_sibling_map
- cpu_core_map
- cpu_llc_shared_map
- cpu_llc_id
- cpu_number
- x86_cpu_to_apicid
- x86_bios_cpu_apicid
- x86_cpu_to_logical_apicid
As long as all the variables above are only written during the
initialization, this change is meant to prevent the false
sharing. More specifically, on vSMP Foundation platform
x86_cpu_to_apicid shared the same internode_cache_line with
frequently written lapic_events.
From the analysis of the first 33 per_cpu variables out of 219
(memories they describe, to be more specific) the 8 have read_mostly
nature (tlb_vector_offset, cpu_loops_per_jiffy, xen_debug_irq, etc.)
and 25 are frequently written (irq_stack_union, gdt_page,
exception_stacks, idt_desc, etc.).
Assuming that the spread of the rest of the per_cpu variables is
similar, identifying the read mostly memories will make more sense
in terms of long-term code maintenance comparing to identifying
frequently written memories.
Signed-off-by: Vlad Zolotarov <vlad@scalemp.com>
Acked-by: Shai Fultheim <shai@scalemp.com>
Cc: Shai Fultheim (Shai@ScaleMP.com) <Shai@scalemp.com>
Cc: ido@wizery.com
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/1719258.EYKzE4Zbq5@vlad
Signed-off-by: Ingo Molnar <mingo@kernel.org>
stop_machine_text_poke() uses atomic_dec_and_test() to select one of
the CPUs executing that function to actually modify the code.
Since the variable is initialized to 1, subsequent CPUs will make the
variable go negative. Since going negative is uncommon/unexpected in
typical dec_and_test usage change this user to atomic_xchg().
This was found using a patch that warns on dec_and_test going
negative.
Signed-off-by: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp>
Acked-by: Steven Rostedt <rostedt@goodmis.org>
[ Rewrote changelog ]
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/87zk8fgsx9.fsf@devron.myhome.or.jp
Signed-off-by: Ingo Molnar <mingo@kernel.org>
The warning below triggers on AMD MCM packages because physical package
IDs on the cores of a _physical_ socket are the same. I.e., this field
says which CPUs belong to the same physical package.
However, the same two CPUs belong to two different internal, i.e.
"logical" nodes in the same physical socket which is reflected in the
CPU-to-node map on x86 with NUMA.
Which makes this check wrong on the above topologies so circumvent it.
[ 0.444413] Booting Node 0, Processors #1#2#3#4#5 Ok.
[ 0.461388] ------------[ cut here ]------------
[ 0.465997] WARNING: at arch/x86/kernel/smpboot.c:310 topology_sane.clone.1+0x6e/0x81()
[ 0.473960] Hardware name: Dinar
[ 0.477170] sched: CPU #6's mc-sibling CPU #0 is not on the same node! [node: 1 != 0]. Ignoring dependency.
[ 0.486860] Booting Node 1, Processors #6
[ 0.491104] Modules linked in:
[ 0.494141] Pid: 0, comm: swapper/6 Not tainted 3.4.0+ #1
[ 0.499510] Call Trace:
[ 0.501946] [<ffffffff8144bf92>] ? topology_sane.clone.1+0x6e/0x81
[ 0.508185] [<ffffffff8102f1fc>] warn_slowpath_common+0x85/0x9d
[ 0.514163] [<ffffffff8102f2b7>] warn_slowpath_fmt+0x46/0x48
[ 0.519881] [<ffffffff8144bf92>] topology_sane.clone.1+0x6e/0x81
[ 0.525943] [<ffffffff8144c234>] set_cpu_sibling_map+0x251/0x371
[ 0.532004] [<ffffffff8144c4ee>] start_secondary+0x19a/0x218
[ 0.537729] ---[ end trace 4eaa2a86a8e2da22 ]---
[ 0.628197] #7#8#9#10#11 Ok.
[ 0.807108] Booting Node 3, Processors #12#13#14#15#16#17 Ok.
[ 0.897587] Booting Node 2, Processors #18#19#20#21#22#23 Ok.
[ 0.917443] Brought up 24 CPUs
We ran a topology sanity check test we have here on it and
it all looks ok... hopefully :).
Signed-off-by: Borislav Petkov <borislav.petkov@amd.com>
Cc: Andreas Herrmann <andreas.herrmann3@amd.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/20120529135442.GE29157@aftab.osrc.amd.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
The fixups are executed once the pci-device is found which is during
boot process so __init seems fine as long as the platform does not
support hotplug.
However it is possible to remove the PCI bus at run time and have it
rediscovered again via "echo 1 > /sys/bus/pci/rescan" and this will call
the fixups again.
Cc: x86@kernel.org
Signed-off-by: Sebastian Andrzej Siewior <sebastian@breakpoint.cc>
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
CPU offline path calls the hrtimer interrupt handler with interrupts
disabled, without touching preempt_count, triggering this warning.
Remove the warning since it is supposed to be used from hrtimer
interrupt context only.
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
This is the 2nd part of fix for kernel bugzilla 40002:
"IRQ 0 assigned to VGA"
https://bugzilla.kernel.org/show_bug.cgi?id=40002
The root cause is the buggy FW, whose ACPI tables assign the GSI 16
to 2 irqs 0 and 16(VGA), and the VGA is the right owner of GSI 16.
So add a quirk to ignore the irq0 overriding GSI 16 for the
FUJITSU SIEMENS AMILO PRO V2030 platform will solve this issue.
Reported-and-tested-by: Szymon Kowalczyk <fazerxlo@o2.pl>
Signed-off-by: Feng Tang <feng.tang@intel.com>
Signed-off-by: Len Brown <len.brown@intel.com>
Current WARN msg is only for the ati_ixp4x0 board, while this function
is used by mulitple platforms. So this one board specific warning
is not appropriate any more.
Signed-off-by: Feng Tang <feng.tang@intel.com>
Signed-off-by: Len Brown <len.brown@intel.com>
Currently when acpi_skip_timer_override is set, it only cover the
(source_irq == 0 && global_irq == 2) cases. While there is also
platform which need use this option and its global_irq is not 2.
This patch will extend acpi_skip_timer_override to cover all
timer overriding cases as long as the source irq is 0.
This is the first part of a fix to kernel bug bugzilla 40002:
"IRQ 0 assigned to VGA"
https://bugzilla.kernel.org/show_bug.cgi?id=40002
Reported-and-tested-by: Szymon Kowalczyk <fazerxlo@o2.pl>
Signed-off-by: Feng Tang <feng.tang@intel.com>
Signed-off-by: Len Brown <len.brown@intel.com>
vSMP can route interrupts more optimally based on internal
knowledge the OS does not have. In order to support this
optimization, all CPUs must be able to handle all possible
IOAPIC interrupts.
Fix this by setting the vector allocation domain for all CPUs
and by enabling this feature in vSMP.
Signed-off-by: Ravikiran Thirumalai <kiran.thirumalai@gmail.com>
Signed-off-by: Shai Fultheim <shai@scalemp.com>
[ Rebased, simplified, and reworded the commit message. ]
Signed-off-by: Ido Yariv <ido@wizery.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Avi Kivity reported that page faults in NMIs could cause havic if
the NMI preempted another page fault handler:
The recent changes to NMI allow exceptions to take place in NMI
handlers, but I think that a #PF (say, due to access to vmalloc space)
is still problematic. Consider the sequence
#PF (cr2 set by processor)
NMI
...
#PF (cr2 clobbered)
do_page_fault()
IRET
...
IRET
do_page_fault()
address = read_cr2()
The last line reads the overwritten cr2 value.
This is the i386 version, which has the luxury of doing the work
in C code.
Link: http://lkml.kernel.org/r/4FBB8C40.6080304@redhat.com
Reported-by: Avi Kivity <avi@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
I've been informed by someone on LWN called 'slashdot' that
some i386 machines do not support a true cmpxchg. The cmpxchg
used by the i386 NMI nesting code must be a true cmpxchg as
disabling interrupts will not work for NMIs (which is the work
around for i386s that do not have a true cmpxchg).
This 'slashdot' character also suggested a fix to the issue.
As the state of the nesting NMIs goes as follows:
NOT_RUNNING -> EXECUTING
EXECUTING -> NOT_RUNNING
EXECUTING -> LATCHED
LATCHED -> EXECUTING
Having these states as enum values of:
NOT_RUNNING = 0
EXECUTING = 1
LATCHED = 2
Instead of a cmpxchg to make EXECUTING -> NOT_RUNNING a
dec_and_test() would work as well. If the dec_and_test brings
the state to NOT_RUNNING, that is the same as a cmpxchg
succeeding to change EXECUTING to NOT_RUNNING. If a nested NMI
were to come in and change it to LATCHED, the dec_and_test() would
convert the state to EXECUTING (what we want it to be in such a
case anyway).
I asked 'slashdot' to post this as a patch, but it never came to
be. I decided to do the work instead.
Thanks to H. Peter Anvin for suggesting to use this_cpu_dec_and_return()
instead of local_dec_and_test(&__get_cpu_var()).
Link: http://lwn.net/Articles/484932/
Cc: H. Peter Anvin <hpa@zytor.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Pull x86 fixes from Ingo Molnar.
* 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/nmi: Fix section mismatch warnings on 32-bit
x86/uv: Fix UV2 BAU legacy mode
x86/mm: Only add extra pages count for the first memory range during pre-allocation early page table space
x86, efi stub: Add .reloc section back into image
x86/ioapic: Fix NULL pointer dereference on CPU hotplug after disabling irqs
x86/reboot: Fix a warning message triggered by stop_other_cpus()
x86/intel/moorestown: Change intel_scu_devices_create() to __devinit
x86/numa: Set numa_nodes_parsed at acpi_numa_memory_affinity_init()
x86/gart: Fix kmemleak warning
x86: mce: Add the dropped timer interval init back
x86/mce: Fix the MCE poll timer logic
Pull perf fixes from Ingo Molnar:
"A bit larger than what I'd wish for - half of it is due to hw driver
updates to Intel Ivy-Bridge which info got recently released,
cycles:pp should work there now too, amongst other things. (but we
are generally making exceptions for hardware enablement of this type.)
There are also callchain fixes in it - responding to mostly
theoretical (but valid) concerns. The tooling side sports perf.data
endianness/portability fixes which did not make it for the merge
window - and various other fixes as well."
* 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (26 commits)
perf/x86: Check user address explicitly in copy_from_user_nmi()
perf/x86: Check if user fp is valid
perf: Limit callchains to 127
perf/x86: Allow multiple stacks
perf/x86: Update SNB PEBS constraints
perf/x86: Enable/Add IvyBridge hardware support
perf/x86: Implement cycles:p for SNB/IVB
perf/x86: Fix Intel shared extra MSR allocation
x86/decoder: Fix bsr/bsf/jmpe decoding with operand-size prefix
perf: Remove duplicate invocation on perf_event_for_each
perf uprobes: Remove unnecessary check before strlist__delete
perf symbols: Check for valid dso before creating map
perf evsel: Fix 32 bit values endianity swap for sample_id_all header
perf session: Handle endianity swap on sample_id_all header data
perf symbols: Handle different endians properly during symbol load
perf evlist: Pass third argument to ioctl explicitly
perf tools: Update ioctl documentation for PERF_IOC_FLAG_GROUP
perf tools: Make --version show kernel version instead of pull req tag
perf tools: Check if callchain is corrupted
perf callchain: Make callchain cursors TLS
...
It was reported that compiling for 32-bit caused a bunch of
section mismatch warnings:
VDSOSYM arch/x86/vdso/vdso32-syms.lds
LD arch/x86/vdso/built-in.o
LD arch/x86/built-in.o
WARNING: arch/x86/built-in.o(.data+0x5af0): Section mismatch in
reference from the variable test_nmi_ipi_callback_na.10451 to
the function .init.text:test_nmi_ipi_callback() [...]
WARNING: arch/x86/built-in.o(.data+0x5b04): Section mismatch in
reference from the variable nmi_unk_cb_na.10399 to the function
.init.text:nmi_unk_cb() The variable nmi_unk_cb_na.10399
references the function __init nmi_unk_cb() [...]
Both of these are attributed to the internal representation of
the nmiaction struct created during register_nmi_handler. The
reason for this is that those structs are not defined in the
init section whereas the rest of the code in nmi_selftest.c is.
To resolve this, I created a new #define,
register_nmi_handler_initonly, that tags the struct as
__initdata to resolve the mismatch. This #define should only be
used in rare situations where the register/unregister is called
during init of the kernel.
Big thanks to Jan Beulich for decoding this for me as I didn't
have a clue what was going on.
Reported-by: Witold Baryluk <baryluk@smp.if.uj.edu.pl>
Tested-by: Witold Baryluk <baryluk@smp.if.uj.edu.pl>
Cc: Jan Beulich <JBeulich@suse.com>
Signed-off-by: Don Zickus <dzickus@redhat.com>
Link: http://lkml.kernel.org/r/1338991542-23000-1-git-send-email-dzickus@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Currently cpu_mask_to_apicid() should not get a offline CPU with
the cpumask. Otherwise some apic drivers might try to access
non-existent per-cpu variables (i.e. x2apic). In that regard
cpu_mask_to_apicid() and cpu_mask_to_apicid_and() operations are
inconsistent.
This fix makes the two operations do not rely on calling
functions and always return the apicid for only online CPUs. As
result, the meaning and implementations of cpu_mask_to_apicid()
and cpu_mask_to_apicid_and() operations become straight.
Signed-off-by: Alexander Gordeev <agordeev@redhat.com>
Acked-by: Suresh Siddha <suresh.b.siddha@intel.com>
Cc: Yinghai Lu <yinghai@kernel.org>
Link: http://lkml.kernel.org/r/20120607131624.GG4759@dhcp-26-207.brq.redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Current cpu_mask_to_apicid() and cpu_mask_to_apicid_and()
implementations have few shortcomings:
1. A value returned by cpu_mask_to_apicid() is written to
hardware registers unconditionally. Should BAD_APICID get ever
returned it will be written to a hardware too. But the value of
BAD_APICID is not universal across all hardware in all modes and
might cause unexpected results, i.e. interrupts might get routed
to CPUs that are not configured to receive it.
2. Because the value of BAD_APICID is not universal it is
counter- intuitive to return it for a hardware where it does not
make sense (i.e. x2apic).
3. cpu_mask_to_apicid_and() operation is thought as an
complement to cpu_mask_to_apicid() that only applies a AND mask
on top of a cpumask being passed. Yet, as consequence of 18374d8
commit the two operations are inconsistent in that of:
cpu_mask_to_apicid() should not get a offline CPU with the cpumask
cpu_mask_to_apicid_and() should not fail and return BAD_APICID
These limitations are impossible to realize just from looking at
the operations prototypes.
Most of these shortcomings are resolved by returning a error
code instead of BAD_APICID. As the result, faults are reported
back early rather than possibilities to cause a unexpected
behaviour exist (in case of [1]).
The only exception is setup_timer_IRQ0_pin() routine. Although
obviously controversial to this fix, its existing behaviour is
preserved to not break the fragile check_timer() and would
better addressed in a separate fix.
Signed-off-by: Alexander Gordeev <agordeev@redhat.com>
Acked-by: Suresh Siddha <suresh.b.siddha@intel.com>
Cc: Yinghai Lu <yinghai@kernel.org>
Link: http://lkml.kernel.org/r/20120607131559.GF4759@dhcp-26-207.brq.redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
In case of static vector allocation domains (i.e. flat) if all
vector numbers are exhausted, an attempt to assign a new vector
will lead to useless scans through all CPUs in the cpumask, even
though it is known that each new pass would fail. Make this
corner case less painful by letting report whether the vector
allocation domain depends on passed arguments or not and stop
scanning early.
The same could have been achived by introducing a static flag to
the apic operations. But let's allow vector_allocation_domain()
have more intelligence here and decide dynamically, in case we
would need it in the future.
Signed-off-by: Alexander Gordeev <agordeev@redhat.com>
Acked-by: Suresh Siddha <suresh.b.siddha@intel.com>
Cc: Yinghai Lu <yinghai@kernel.org>
Link: http://lkml.kernel.org/r/20120607131542.GE4759@dhcp-26-207.brq.redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
When assigning a new vector it is primarially done by adding 8
to the previously given out vector number. Hence, two
consequently allocated vector numbers would likely fall into the
same priority level. Try to spread vector numbers to different
priority levels better by changing the step from 8 to 16.
Signed-off-by: Alexander Gordeev <agordeev@redhat.com>
Acked-by: Suresh Siddha <suresh.b.siddha@intel.com>
Cc: Yinghai Lu <yinghai@kernel.org>
Link: http://lkml.kernel.org/r/20120607131514.GD4759@dhcp-26-207.brq.redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Rename checking_wrmsrl() to wrmsrl_safe(), to match the naming
convention used by all the other MSR access functions/macros.
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
Now that all users of {rd,wr}msr_amd_safe have been fixed, deprecate its
use by making them private to amd.c and adding warnings when used on
anything else beside K8.
Signed-off-by: Borislav Petkov <borislav.petkov@amd.com>
Link: http://lkml.kernel.org/r/1338562358-28182-5-git-send-email-bp@amd64.org
Acked-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
f7f286a910 ("x86/amd: Re-enable CPU topology extensions in case BIOS
has disabled it") wrongfully added code which used the AMD-specific
{rd,wr}msr variants for no real reason.
This caused boot panics on xen which wasn't initializing the
{rd,wr}msr_safe_regs pv_ops members properly.
This, in turn, caused a heated discussion leading to us reviewing all
uses of the AMD-specific variants and removing them where unneeded
(almost everywhere except an obscure K8 BIOS fix, see 6b0f43ddfa).
Finally, this patch switches to the standard {rd,wr}msr*_safe* variants
which should've been used in the first place anyway and avoided unneeded
excitation with xen.
Signed-off-by: Andre Przywara <andre.przywara@amd.com>
Link: http://lkml.kernel.org/r/1338562358-28182-4-git-send-email-bp@amd64.org
Cc: Andreas Herrmann <andreas.herrmann3@amd.com>
Link: <http://lkml.kernel.org/r/1338383402-3838-1-git-send-email-andre.przywara@amd.com>
[Boris: correct and expand commit message]
Signed-off-by: Borislav Petkov <borislav.petkov@amd.com>
Acked-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
There's no real reason why, when showing the MSRs on a CPU at boottime,
we should be using the AMD-specific variant. Simply use the generic safe
one which handles #GPs just fine.
Cc: Yinghai Lu <yinghai@kernel.org>
Signed-off-by: Borislav Petkov <borislav.petkov@amd.com>
Link: http://lkml.kernel.org/r/1338562358-28182-3-git-send-email-bp@amd64.org
Acked-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
There were paravirt_ops hooks for the full register set variant of
{rd,wr}msr_safe which are actually not used by anyone anymore. Remove
them to make the code cleaner and avoid silent breakages when the pvops
members were uninitialized. This has been boot-tested natively and under
Xen with PVOPS enabled and disabled on one machine.
Signed-off-by: Andre Przywara <andre.przywara@amd.com>
Link: http://lkml.kernel.org/r/1338562358-28182-2-git-send-email-bp@amd64.org
Acked-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
Avi Kivity reported that page faults in NMIs could cause havic if
the NMI preempted another page fault handler:
The recent changes to NMI allow exceptions to take place in NMI
handlers, but I think that a #PF (say, due to access to vmalloc space)
is still problematic. Consider the sequence
#PF (cr2 set by processor)
NMI
...
#PF (cr2 clobbered)
do_page_fault()
IRET
...
IRET
do_page_fault()
address = read_cr2()
The last line reads the overwritten cr2 value.
Originally I wrote a patch to solve this by saving the cr2 on the stack.
Brian Gerst suggested to save it in the r12 register as both r12 and rbx
are saved by the do_nmi handler as required by the C standard. But rbx
is already used for saving if swapgs needs to be run on exit of the NMI
handler.
Link: http://lkml.kernel.org/r/4FBB8C40.6080304@redhat.com
Link: http://lkml.kernel.org/r/1337763411.13348.140.camel@gandalf.stny.rr.com
Reported-by: Avi Kivity <avi@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Suggested-by: Brian Gerst <brgerst@gmail.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Until now, writing to error count caused the code to reset the
thresholding bank to the current thresholding limit and start counting
errors from the beginning.
This is misleading and unclear, and can be accomplished by writing the
old thresholding limit into ->threshold_limit.
Make error_count read-only with the functionality to show the current
error count.
Signed-off-by: Borislav Petkov <borislav.petkov@amd.com>
We have rdmsr_on_cpu() now so remove locally defined solution in favor
of the generic one.
No functionality change.
Signed-off-by: Borislav Petkov <borislav.petkov@amd.com>
If one sets the threshold limit, say to 25:
$ echo 25 > machinecheck0/threshold_bank4/misc0/threshold_limit
and then reads it back again, it gives
$ cat machinecheck0/threshold_bank4/misc0/threshold_limit
19
which is actually 0x19 but we don't know that.
Make all output decimal.
Signed-off-by: Borislav Petkov <borislav.petkov@amd.com>
Well, instead of having a real bank 4 on the BSP of each node and
symlinks on the remaining cores, we push it up into the amd_northbridge
descriptor which now contains a pointer to the northbridge bank 4
because the bank is one per northbridge and, as such, belongs in the NB
descriptor anyway.
Each time we hotplug CPUs, we use the northbridge pointer to copy the
shared bank into the per-CPU array of threshold_banks pointers, or
destroy it when the last CPU on the node goes offline, or create it when
the first comes online.
Signed-off-by: Borislav Petkov <borislav.petkov@amd.com>
The code used to create a symlink on all non-BSP cores of a node when
the MCi_MISCj bank is present once per node. (This is generally the
case with bank 4 on AMD). However, these sysfs links cause a bunch
of problems with cpu off-/onlining testing and are, as such, a bit
overengineered. IOW, there's nothing wrong with having normal sysfs
files for the shared banks since the corresponding MSRs are replicated
across each core anyway.
Signed-off-by: Borislav Petkov <borislav.petkov@amd.com>
Add the F3 PCI id of F15h, model 0x10 to pci_ids.h and to the amd_nb
code which generates the list of northbridges on an AMD box. Shorten
define name while at it so that it fits into pci_ids.h.
Acked-by: Clemens Ladisch <clemens@ladisch.de>
Cc: Bjorn Helgaas <bhelgaas@google.com>
Acked-by: Andreas Herrmann <andreas.herrmann3@amd.com>
Signed-off-by: Borislav Petkov <borislav.petkov@amd.com>
On Sandy Bridge in non HT mode there are 8 counters available.
Since every counter can write a PEBS record assuming there are
4 max is incorrect. Use the reported counter number -- with an
upper limit for a static array -- instead.
Also I made the warning messages a bit more informational.
Signed-off-by: Andi Kleen <ak@linux.intel.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/1338944211-28275-2-git-send-email-andi@firstfloor.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
The rdpmc instruction is faster than the equivelant rdmsr call,
so use it when possible in the kernel.
The perfctr kernel patches did this, after extensive testing showed
rdpmc to always be faster (One can look in etc/costs in the perfctr-2.6
package to see a historical list of the overhead).
I have done some tests on a 3.2 kernel, the kernel module I used
was included in the first posting of this patch:
rdmsr rdpmc
Core2 T9900: 203.9 cycles 30.9 cycles
AMD fam0fh: 56.2 cycles 9.8 cycles
Atom 6/28/2: 129.7 cycles 50.6 cycles
The speedup of using rdpmc is large.
[ It's probably possible (and desirable) to do this without
requiring a new field in the hw_perf_event structure, but
the fixed events make this tricky. ]
Signed-off-by: Vince Weaver <vweaver1@eecs.utk.edu>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/alpine.DEB.2.00.1203011724030.26934@cl320.eecs.utk.edu
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Move the wrmslr() debug wrapper to the common header now that all the
include games are gone. Also clean it up a bit to avoid multiple
evaluation of the argument.
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/n/tip-l4gkfnivwv4yi5mqxjlovymx@git.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Without this patch, applications with two different stack
regions (eg: native stack vs JIT stack) get truncated
callchains even when RBP chaining is present. GDB shows proper
stack traces and the frame pointer chaining is intact.
This patch disables the (fp < RSP) check, hoping that other checks
in the code save the day for us. In our limited testing, this
didn't seem to break anything.
In the long term, we could potentially have userspace advise
the kernel on the range of valid stack addresses, so we don't
spend a lot of time unwinding from bogus addresses.
Signed-off-by: Arun Sharma <asharma@fb.com>
CC: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Stephane Eranian <eranian@google.com>
Cc: Namhyung Kim <namhyung.kim@lge.com>
Cc: Tom Zanussi <tzanussi@gmail.com>
Cc: linux-kernel@vger.kernel.org
Cc: linux-perf-users@vger.kernel.org
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/1334961696-19580-2-git-send-email-asharma@fb.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Afaict there's no need to (incompletely) iterate the
MEM_UOPS_RETIRED.* umask state.
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Stephane Eranian <eranian@google.com>
Link: http://lkml.kernel.org/r/1338884803.28282.153.camel@twins
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Implement rudimentary IVB perf support. The SDM states its identical
to SNB with exception of the exact event tables, but a quick look
suggests they're similar enough.
Also mark SNB-EP as broken for now.
Requested-and-tested-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Stephane Eranian <eranian@google.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/1338884803.28282.153.camel@twins
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Now that there's finally a chip with working PEBS (IvyBridge), we can
enable the hardware and implement cycles:p for SNB/IVB.
Cc: Stephane Eranian <eranian@google.com>
Requested-and-tested-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/1338884803.28282.153.camel@twins
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Zheng Yan reported that event group validation can wreck event state
when Intel extra_reg allocation changes event state.
Validation shouldn't change any persistent state. Cloning events in
validate_{event,group}() isn't really pretty either, so add a few
special cases to avoid modifying the event state.
The code is restructured to minimize the special case impact.
Reported-by: Zheng Yan <zheng.z.yan@linux.intel.com>
Acked-by: Stephane Eranian <eranian@google.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/1338903031.28282.175.camel@twins
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Commit 316ad24830 ("sched/x86: Rewrite set_cpu_sibling_map()")
broke the booted_cores accounting.
The problem is that the booted_cores accounting needs all the
sibling links set up. So restore the second loop and add a comment as
to why its needed.
On qemu booted with -smp sockets=1,cores=2,threads=2;
Before:
$ grep cores /proc/cpuinfo
cpu cores : 2
cpu cores : 1
cpu cores : 4
cpu cores : 3
With the patch:
$ grep cores /proc/cpuinfo
cpu cores : 2
cpu cores : 2
cpu cores : 2
cpu cores : 2
Reported-by: Prarit Bhargava <prarit@redhat.com>
Reported-by: Borislav Petkov <bp@amd64.org>
Signed-off-by: Kamalesh Babulal <kamalesh@linux.vnet.ibm.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/20120531073738.GH7511@linux.vnet.ibm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
In current Linux, percpu variable `vector_irq' is not cleared on
offlined cpus while disabling devices' irqs. If the cpu that has
the disabled irqs in vector_irq is hotplugged,
__setup_vector_irq() hits invalid irq vector and may crash.
This bug can be reproduced as following;
# echo 0 > /sys/devices/system/cpu/cpu7/online
# modprobe -r some_driver_using_interrupts # vector_irq@cpu7 uncleared
# echo 1 > /sys/devices/system/cpu/cpu7/online # kernel may crash
This patch fixes this bug by clearing vector_irq in
__clear_irq_vector() even if the cpu is offlined.
Signed-off-by: Tomoki Sekiyama <tomoki.sekiyama.qu@hitachi.com>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Cc: yrl.pp-manager.tt@hitachi.com
Cc: ltc-kernel@ml.yrl.intra.hitachi.co.jp
Cc: Suresh Siddha <suresh.b.siddha@intel.com>
Cc: Yinghai Lu <yinghai@kernel.org>
Cc: Alexander Gordeev <agordeev@redhat.com>
Link: http://lkml.kernel.org/r/4FC340BE.7080101@hitachi.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
When rebooting our 24 CPU Westmere servers with 3.4-rc6, we
always see this warning msg:
Restarting system.
machine restart
------------[ cut here ]------------
WARNING: at arch/x86/kernel/smp.c:125
native_smp_send_reschedule+0x74/0xa7() Hardware name: X8DTN
Modules linked in: igb [last unloaded: scsi_wait_scan]
Pid: 1, comm: systemd-shutdow Not tainted 3.4.0-rc6+ #22
Call Trace:
<IRQ> [<ffffffff8102a41f>] warn_slowpath_common+0x7e/0x96
[<ffffffff8102a44c>] warn_slowpath_null+0x15/0x17
[<ffffffff81018cf7>] native_smp_send_reschedule+0x74/0xa7
[<ffffffff810561c1>] trigger_load_balance+0x279/0x2a6
[<ffffffff81050112>] scheduler_tick+0xe0/0xe9
[<ffffffff81036768>] update_process_times+0x60/0x70
[<ffffffff81062f2f>] tick_sched_timer+0x68/0x92
[<ffffffff81046e33>] __run_hrtimer+0xb3/0x13c
[<ffffffff81062ec7>] ? tick_nohz_handler+0xd0/0xd0
[<ffffffff810474f2>] hrtimer_interrupt+0xdb/0x198
[<ffffffff81019a35>] smp_apic_timer_interrupt+0x81/0x94
[<ffffffff81655187>] apic_timer_interrupt+0x67/0x70
<EOI> [<ffffffff8101a3c4>] ? default_send_IPI_mask_allbutself_phys+0xb4/0xc4
[<ffffffff8101c680>] physflat_send_IPI_allbutself+0x12/0x14
[<ffffffff81018db4>] native_nmi_stop_other_cpus+0x8a/0xd6
[<ffffffff810188ba>] native_machine_shutdown+0x50/0x67
[<ffffffff81018926>] machine_shutdown+0xa/0xc
[<ffffffff8101897e>] native_machine_restart+0x20/0x32
[<ffffffff810189b0>] machine_restart+0xa/0xc
[<ffffffff8103b196>] kernel_restart+0x47/0x4c
[<ffffffff8103b2e6>] sys_reboot+0x13e/0x17c
[<ffffffff8164e436>] ? _raw_spin_unlock_bh+0x10/0x12
[<ffffffff810fcac9>] ? bdi_queue_work+0xcf/0xd8
[<ffffffff810fe82f>] ? __bdi_start_writeback+0xae/0xb7
[<ffffffff810e0d64>] ? iterate_supers+0xa3/0xb7
[<ffffffff816547a2>] system_call_fastpath+0x16/0x1b
---[ end trace 320af5cb1cb60c5b ]---
The root cause seems to be the
default_send_IPI_mask_allbutself_phys() takes quite some time (I
measured it could be several ms) to complete sending NMIs to all
the other 23 CPUs, and for HZ=250/1000 system, the time is long
enough for a timer interrupt to happen, which will in turn
trigger to kick load balance to a stopped CPU and cause this
warning in native_smp_send_reschedule().
So disabling the local irq before stop_other_cpu() can fix this
problem (tested 25 times reboot ok), and it is fine as there
should be nobody caring the timer interrupt in such reboot
stage.
The latest 3.4 kernel slightly changes this behavior by sending
REBOOT_VECTOR first and only send NMI_VECTOR if the REBOOT_VCTOR
fails, and this patch is still needed to prevent the problem.
Signed-off-by: Feng Tang <feng.tang@intel.com>
Acked-by: Don Zickus <dzickus@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/20120530231541.4c13433a@feng-i7
Signed-off-by: Ingo Molnar <mingo@kernel.org>
commit 82f7af09 ("x86/mce: Cleanup timer mess) dropped the
initialization of the per cpu timer interval. Duh :(
Restore the previous behaviour.
Reported-by: Chen Gong <gong.chen@linux.intel.com>
Cc: bp@amd64.org
Cc: tony.luck@intel.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
If the HW implements round-robin interrupt delivery, this
enables multiple cpu's (which are part of the user specified
interrupt smp_affinity mask and belong to the same x2apic
cluster) to service the interrupt.
Also if the platform supports Power Aware Interrupt Routing,
then this enables the interrupt to be routed to an idle cpu or a
busy cpu depending on the perf/power bias tunable.
We are now grouping all the cpu's in a cluster to one vector
domain. So that will limit the total number of interrupt sources
handled by Linux. Previously we support "cpu-count *
available-vectors-per-cpu" interrupt sources but this will now
reduce to "cpu-count/16 * available-vectors-per-cpu".
Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com>
Cc: yinghai@kernel.org
Cc: gorcunov@openvz.org
Cc: agordeev@redhat.com
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/1337644682-19854-2-git-send-email-suresh.b.siddha@intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Until now, irq_cfg domain is mostly static. Either all CPU's
(used by flat mode) or one CPU (first CPU in the irq afffinity
mask) to which irq is being migrated (this is used by the rest
of apic modes).
Upcoming x2apic cluster mode optimization patch allows the irq
to be sent to any CPU in the x2apic cluster (if supported by the
HW). So irq_cfg domain changes on the fly (depending on which
CPU in the x2apic cluster is online).
Instead of checking for any intersection between the new irq
affinity mask and the current irq_cfg domain, check if the new
irq affinity mask is a subset of the current irq_cfg domain.
Otherwise proceed with updating the irq_cfg domain aswell as
assigning vector's on all the CPUs specified in the new mask.
This also cleans up a workaround in updating irq_cfg domain for
legacy irq's that are handled by the IO-APIC.
Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com>
Cc: yinghai@kernel.org
Cc: gorcunov@openvz.org
Cc: agordeev@redhat.com
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/1337644682-19854-1-git-send-email-suresh.b.siddha@intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Use a more current logging style:
- Bare printks should have a KERN_<LEVEL> for consistency's sake
- Add pr_fmt where appropriate
- Neaten some macro definitions
- Convert some Ok output to OK
- Use "%s: ", __func__ in pr_fmt for summit
- Convert some printks to pr_<level>
Message output is not identical in all cases.
Signed-off-by: Joe Perches <joe@perches.com>
Cc: levinsasha928@gmail.com
Link: http://lkml.kernel.org/r/1337655007.24226.10.camel@joe2Laptop
[ merged two similar patches, tidied up the changelog ]
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Some subarchitectures (such as vSMP) need to slightly adjust the
underlying APIC structure. Add an APIC post-initialization callback
to 'struct x86_platform_ops' for this purpose and use it for
adjusting the APIC structure on vSMP systems.
Signed-off-by: Ido Yariv <ido@wizery.com>
Acked-by: Shai Fultheim <shai@scalemp.com>
Link: http://lkml.kernel.org/r/1338675095-27260-1-git-send-email-ido@wizery.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
In commit 82f7af09 (x86/mce: Cleanup timer mess), Thomas just forgot
the "/ 2" there while cleaning up.
Signed-off-by: Chen Gong <gong.chen@linux.intel.com>
Acked-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Tony Luck <tony.luck@intel.com>
Pull scheduler fixes from Ingo Molnar.
* 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
sched: Remove NULL assignment of dattr_cur
sched: Remove the last NULL entry from sched_feat_names
sched: Make sched_feat_names const
sched/rt: Fix SCHED_RR across cgroups
sched: Move nr_cpus_allowed out of 'struct sched_rt_entity'
sched: Make sure to not re-read variables after validation
sched: Fix SD_OVERLAP
sched: Don't try allocating memory from offline nodes
sched/nohz: Fix rq->cpu_load calculations some more
sched/x86: Use cpu_llc_shared_mask(cpu) for coregroup_mask
ipi_call_lock/unlock() lock resp. unlock call_function.lock. This lock
protects only the call_function data structure itself, but it's
completely unrelated to cpu_online_mask. The mask to which the IPIs
are sent is calculated before call_function.lock is taken in
smp_call_function_many(), so the locking around set_cpu_online() is
pointless and can be removed.
[ tglx: Massaged changelog ]
Signed-off-by: Yong Zhang <yong.zhang0@gmail.com>
Cc: ralf@linux-mips.org
Cc: sshtylyov@mvista.com
Cc: david.daney@cavium.com
Cc: nikunj@linux.vnet.ibm.com
Cc: paulmck@linux.vnet.ibm.com
Cc: axboe@kernel.dk
Cc: peterz@infradead.org
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: Jeremy Fitzhardinge <jeremy@goop.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Link: http://lkml.kernel.org/r/1338275765-3217-7-git-send-email-yong.zhang0@gmail.com
Acked-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Dell Precision M6600 is known to require PCI reboot, so add it to
the reboot blacklist in pci_reboot_dmi_table[].
https://bugzilla.kernel.org/show_bug.cgi?id=42749
cc: x86@kernel.org
Signed-off-by: Zhang Rui <rui.zhang@intel.com>
Signed-off-by: Len Brown <len.brown@intel.com>
Pull straggler x86 fixes from Peter Anvin:
"Three groups of patches:
- EFI boot stub documentation and the ability to print error messages;
- Removal for PTRACE_ARCH_PRCTL for x32 (obsolete interface which
should never have been ported, and the port is broken and
potentially dangerous.)
- ftrace stack corruption fixes. I'm not super-happy about the
technical implementation, but it is probably the least invasive in
the short term. In the future I would like a single method for
nesting the debug stack, however."
* 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86, x32, ptrace: Remove PTRACE_ARCH_PRCTL for x32
x86, efi: Add EFI boot stub documentation
x86, efi; Add EFI boot stub console support
x86, efi: Only close open files in error path
ftrace/x86: Do not change stacks in DEBUG when calling lockdep
x86: Allow nesting of the debug stack IDT setting
x86: Reset the debug_stack update counter
ftrace: Use breakpoint method to update ftrace caller
ftrace: Synchronize variable setting with breakpoints
When I added x32 ptrace to 3.4 kernel, I also include PTRACE_ARCH_PRCTL
support for x32 GDB For ARCH_GET_FS/GS, it takes a pointer to int64. But
at user level, ARCH_GET_FS/GS takes a pointer to int32. So I have to add
x32 ptrace to glibc to handle it with a temporary int64 passed to kernel and
copy it back to GDB as int32. Roland suggested that PTRACE_ARCH_PRCTL
is obsolete and x32 GDB should use fs_base and gs_base fields of
user_regs_struct instead.
Accordingly, remove PTRACE_ARCH_PRCTL completely from the x32 code to
avoid possible memory overrun when pointer to int32 is passed to
kernel.
Link: http://lkml.kernel.org/r/CAMe9rOpDzHfS7NH7m1vmD9QRw8SSj4Sc%2BaNOgcWm_WJME2eRsQ@mail.gmail.com
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
Cc: <stable@vger.kernel.org> v3.4
If we end up calling do_notify_resume() with !user_mode(refs), it
does nothing (do_signal() explicitly bails out and we can't get there
with TIF_NOTIFY_RESUME in such situations). Then we jump to
resume_userspace_sig, which rechecks the same thing and bails out
to resume_kernel, thus breaking the loop.
It's easier and cheaper to check *before* calling do_notify_resume()
and bail out to resume_kernel immediately. And kill the check in
do_signal()...
Note that on amd64 we can't get there with !user_mode() at all - asm
glue takes care of that.
Acked-and-reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Does block_sigmask() + tracehook_signal_handler(); called when
sigframe has been successfully built. All architectures converted
to it; block_sigmask() itself is gone now (merged into this one).
I'm still not too happy with the signature, but that's a separate
story (IMO we need a structure that would contain signal number +
siginfo + k_sigaction, so that get_signal_to_deliver() would fill one,
signal_delivered(), handle_signal() and probably setup...frame() -
take one).
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Only 3 out of 63 do not. Renamed the current variant to __set_current_blocked(),
added set_current_blocked() that will exclude unblockable signals, switched
open-coded instances to it.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
replace boilerplate "should we use ->saved_sigmask or ->blocked?"
with calls of obvious inlined helper...
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
first fruits of ..._restore_sigmask() helpers: now we can take
boilerplate "signal didn't have a handler, clear RESTORE_SIGMASK
and restore the blocked mask from ->saved_mask" into a common
helper. Open-coded instances switched...
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
When both DYNAMIC_FTRACE and LOCKDEP are set, the TRACE_IRQS_ON/OFF
will call into the lockdep code. The lockdep code can call lots of
functions that may be traced by ftrace. When ftrace is updating its
code and hits a breakpoint, the breakpoint handler will call into
lockdep. If lockdep happens to call a function that also has a breakpoint
attached, it will jump back into the breakpoint handler resetting
the stack to the debug stack and corrupt the contents currently on
that stack.
The 'do_sym' call that calls do_int3() is protected by modifying the
IST table to point to a different location if another breakpoint is
hit. But the TRACE_IRQS_OFF/ON are outside that protection, and if
a breakpoint is hit from those, the stack will get corrupted, and
the kernel will crash:
[ 1013.243754] BUG: unable to handle kernel NULL pointer dereference at 0000000000000002
[ 1013.272665] IP: [<ffff880145cc0000>] 0xffff880145cbffff
[ 1013.285186] PGD 1401b2067 PUD 14324c067 PMD 0
[ 1013.298832] Oops: 0010 [#1] PREEMPT SMP
[ 1013.310600] CPU 2
[ 1013.317904] Modules linked in: ip6t_REJECT nf_conntrack_ipv6 nf_defrag_ipv6 xt_state nf_conntrack ip6table_filter ip6_tables crc32c_intel ghash_clmulni_intel microcode usb_debug serio_raw pcspkr iTCO_wdt i2c_i801 iTCO_vendor_support e1000e nfsd nfs_acl auth_rpcgss lockd sunrpc i915 video i2c_algo_bit drm_kms_helper drm i2c_core [last unloaded: scsi_wait_scan]
[ 1013.401848]
[ 1013.407399] Pid: 112, comm: kworker/2:1 Not tainted 3.4.0+ #30
[ 1013.437943] RIP: 8eb8:[<ffff88014630a000>] [<ffff88014630a000>] 0xffff880146309fff
[ 1013.459871] RSP: ffffffff8165e919:ffff88014780f408 EFLAGS: 00010046
[ 1013.477909] RAX: 0000000000000001 RBX: ffffffff81104020 RCX: 0000000000000000
[ 1013.499458] RDX: ffff880148008ea8 RSI: ffffffff8131ef40 RDI: ffffffff82203b20
[ 1013.521612] RBP: ffffffff81005751 R08: 0000000000000000 R09: 0000000000000000
[ 1013.543121] R10: ffffffff82cdc318 R11: 0000000000000000 R12: ffff880145cc0000
[ 1013.564614] R13: ffff880148008eb8 R14: 0000000000000002 R15: ffff88014780cb40
[ 1013.586108] FS: 0000000000000000(0000) GS:ffff880148000000(0000) knlGS:0000000000000000
[ 1013.609458] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[ 1013.627420] CR2: 0000000000000002 CR3: 0000000141f10000 CR4: 00000000001407e0
[ 1013.649051] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 1013.670724] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[ 1013.692376] Process kworker/2:1 (pid: 112, threadinfo ffff88013fe0e000, task ffff88014020a6a0)
[ 1013.717028] Stack:
[ 1013.724131] ffff88014780f570 ffff880145cc0000 0000400000004000 0000000000000000
[ 1013.745918] cccccccccccccccc ffff88014780cca8 ffffffff811072bb ffffffff81651627
[ 1013.767870] ffffffff8118f8a7 ffffffff811072bb ffffffff81f2b6c5 ffffffff81f11bdb
[ 1013.790021] Call Trace:
[ 1013.800701] Code: 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a <e7> d7 64 81 ff ff ff ff 01 00 00 00 00 00 00 00 65 d9 64 81 ff
[ 1013.861443] RIP [<ffff88014630a000>] 0xffff880146309fff
[ 1013.884466] RSP <ffff88014780f408>
[ 1013.901507] CR2: 0000000000000002
The solution was to reuse the NMI functions that change the IDT table to make the debug
stack keep its current stack (in kernel mode) when hitting a breakpoint:
call debug_stack_set_zero
TRACE_IRQS_ON
call debug_stack_reset
If the TRACE_IRQS_ON happens to hit a breakpoint then it will keep the current stack
and not crash the box.
Reported-by: Dave Jones <davej@redhat.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
When the NMI handler runs, it checks if it preempted a debug handler
and if that handler is using the debug stack. If it is, it changes the
IDT table not to update the stack, otherwise it will reset the debug
stack and corrupt the debug handler it preempted.
Now that ftrace uses breakpoints to change functions from nops to
callers, many more places may hit a breakpoint. Unfortunately this
includes some of the calls that lockdep performs. Which causes issues
with the debug stack. It too needs to change the debug stack before
tracing (if called from the debug handler).
Allow the debug_stack_set_zero() and debug_stack_reset() to be nested
so that the debug handlers can take advantage of them too.
[ Used this_cpu_*() over __get_cpu_var() as suggested by H. Peter Anvin ]
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
When an NMI goes off and it sees that it preempted the debug stack,
to keep the debug stack safe, it changes the IDT to point to one that
does not modify the stack on breakpoint (to allow breakpoints in NMIs).
But the variable that gets set to know to undo it on exit never gets
cleared on exit. Thus every NMI will reset it on exit the first time
it is done even if it does not need to be reset.
[ Added H. Peter Anvin's suggestion to use this_cpu_read/write ]
Cc: <stable@vger.kernel.org> # v3.3
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
On boot up and module load, it is fine to modify the code directly,
without the use of breakpoints. This is because boot up modification
is done before SMP is initialized, thus the modification is serial,
and module load is done before the module executes.
But after that we must use a SMP safe method to modify running code.
Otherwise, if we are running the function tracer and update its
function (by starting off the stack tracer, or perf tracing)
the change of the function called by the ftrace trampoline is done
directly. If this is being executed on another CPU, that CPU may
take a GPF and crash the kernel.
The breakpoint method is used to change the nops at all the functions, but
the change of the ftrace callback handler itself was still using a
direct modification. If tracing was enabled and the function callback
was changed then another CPU could fault if it was currently calling
the original callback. This modification must use the breakpoint method
too.
Note, the direct method is still used for boot up and module load.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
When the function tracer starts modifying the code via breakpoints
it sets a variable (modifying_ftrace_code) to inform the breakpoint
handler to call the ftrace int3 code.
But there's no synchronization between setting this code and the
handler, thus it is possible for the handler to be called on another
CPU before it sees the variable. This will cause a kernel crash as
the int3 handler will not know what to do with it.
I originally added smp_mb()'s to force the visibility of the variable
but H. Peter Anvin suggested that I just make it atomic.
[ Added comments as suggested by Peter Zijlstra ]
Suggested-by: H. Peter Anvin <hpa@zytor.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Pull second pile of signal handling patches from Al Viro:
"This one is just task_work_add() series + remaining prereqs for it.
There probably will be another pull request from that tree this
cycle - at least for helpers, to get them out of the way for per-arch
fixes remaining in the tree."
Fix trivial conflict in kernel/irq/manage.c: the merge of Andrew's pile
had brought in commit 97fd75b7b8 ("kernel/irq/manage.c: use the
pr_foo() infrastructure to prefix printks") which changed one of the
pr_err() calls that this merge moves around.
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/signal:
keys: kill task_struct->replacement_session_keyring
keys: kill the dummy key_replace_session_keyring()
keys: change keyctl_session_to_parent() to use task_work_add()
genirq: reimplement exit_irq_thread() hook via task_work_add()
task_work_add: generic process-context callbacks
avr32: missed _TIF_NOTIFY_RESUME on one of do_notify_resume callers
parisc: need to check NOTIFY_RESUME when exiting from syscall
move key_repace_session_keyring() into tracehook_notify_resume()
TIF_NOTIFY_RESUME is defined on all targets now
Use unsigned long for dealing with jiffies not int. Rename the
callback to something sensible. Use __this_cpu_read/write for
accessing per cpu data.
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Borislav Petkov <borislav.petkov@amd.com>
Signed-off-by: Tony Luck <tony.luck@intel.com>
When boot on sun G5+ with 4T mem, see an overflow in mtrr cleanup as below.
*BAD*gran_size: 2G chunk_size: 2G num_reg: 10 lose cover RAM:
-18014398505283592M
This is because 1<<31 sign extended. Use an unsigned long constant to
fix it. Useful for mem larger than or equal to 4T.
-v2: Use 64bit constant instead of explicit type conversion as suggested
by Yinghai. Description updated too.
Signed-off-by: Zhenzhong Duan <zhenzhong.duan@oracle.com>
Link: http://lkml.kernel.org/r/4FC5A77F.6060505@oracle.com
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
Commit commit 8e7fbcbc2 ("sched: Remove stale power aware scheduling
remnants and dysfunctional knobs") made a boo-boo with removing the
power aware scheduling muck from the x86 topology bits.
We should unconditionally use the llc_shared mask for multi-core.
Reported-and-tested-by: Mike Galbraith <efault@gmx.de>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Borislav Petkov <bp@amd64.org>
Cc: Andreas Herrmann <andreas.herrmann3@amd.com>
Link: http://lkml.kernel.org/n/tip-lsksc2kfyeveb13avh327p0d@git.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Pull x86 trampoline rework from H. Peter Anvin:
"This code reworks all the "trampoline"/"realmode" code (various bits
that need to live in the first megabyte of memory, most but not all of
which runs in real mode at some point) in the kernel into a single
object. The main reason for doing this is that it eliminates the last
place in the kernel where we needed pages to be mapped RWX. This code
separates all that code into proper R/RW/RX pages."
Fix up conflicts in arch/x86/kernel/Makefile (mca removed next to reboot
code), and arch/x86/kernel/reboot.c (reboot code moved around in one
branch, modified in this one), and arch/x86/tools/relocs.c (mostly same
code came in earlier due to working around the ld bugs just before the
3.4 release).
Also remove stale x86-relocs entry from scripts/.gitignore as per Peter
Anvin.
* commit '61f5446169046c217a5479517edac3a890c3bee7': (36 commits)
x86, realmode: Move end signature into header.S
x86, relocs: When printing an error, say relative or absolute
x86, relocs: More relocations which may end up as absolute
x86, relocs: Workaround for binutils 2.22.52.0.1 section bug
xen-acpi-processor: Add missing #include <xen/xen.h>
acpi, bgrd: Add missing <linux/io.h> to drivers/acpi/bgrt.c
x86, realmode: Change EFER to a single u64 field
x86, realmode: Move kernel/realmode.c to realmode/init.c
x86, realmode: Move not-common bits out of trampoline_common.S
x86, realmode: Mask out EFER.LMA when saving trampoline EFER
x86, realmode: Fix no cache bits test in reboot_32.S
x86, realmode: Make sure all generated files are listed in targets
x86, realmode: build fix: remove duplicate build
x86, realmode: read cr4 and EFER from kernel for 64-bit trampoline
x86, realmode: fixes compilation issue in tboot.c
x86, realmode: move relocs from scripts/ to arch/x86/tools
x86, realmode: header for trampoline code
x86, realmode: flattened rm hierachy
x86, realmode: don't copy real_mode_header
x86, realmode: fix 64-bit wakeup sequence
...
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.11 (GNU/Linux)
iQIcBAABAgAGBQJPv7OiAAoJEKurIx+X31iBQxEP/RV8YO4nrozhHY597qabzfJc
4YoCOma+wUyhXPDmZI80XvrIlcCq7TEJL1HTaAyA5rnyYvRHpM+uXDCRmbJDI4e0
gA42/Y8+lbmR6BLY8sdptCXIWxw/d8wYEKK2BgNsPhkJxODGW/gVAws93erist/v
yepq+GwI0QGAeRlO6AYgE7NwQmHXK5AdfH3phHUTVABVIUGH5+Zp6471FTB+hYy5
aNvUnL0hw8vxrbpDL/Le359etPqC6wsELCIQ9wVtWCD0/UJM6Yd3j0+CKQ7q/KHU
7zMcP+OCGTJ3koMhEbFIOnxuswWDGq5y/qIzSXMEEemGqgxqFUvX3wUeZ3HFAFNx
nJ7ZaA813t7Bud4G4WwESxMGQpxI7FTvxnF1ow3IlRsMtV4ffvAS9xvWi0GQJrVY
xixK7G87PGAm6fP9Zbb/lQlRO8gD498j4rfI9MOsUuY9QgFNcH2eg6c4O0iHDpos
WkSgUaM49Q610JslrxsXp+BZZLBF/wbcjcFiQGFAWOIKTKgRQ99+dXAQY7fw9CIf
/wNl9MkbOvJdPL9OfLTmAYAMXyaXbOX8qcvItwqcBsUT0AV863NdIXtS4BXBOrMs
5u16CDX1ieFAlA2dzhynvE0Zd1Ws6wfe5W/BgtQ+H175uHFr8pHAxsBTX8GSNrXG
/bSWWrR3CIBRHoWCJMmH
=kG4e
-----END PGP SIGNATURE-----
Merge tag 'x86-mce-merge' of git://git.kernel.org/pub/scm/linux/kernel/git/ras/ras
Pull x86/mce merge window patches from Tony Luck:
"Including two that make error_context() checks less sucky"
* tag 'x86-mce-merge' of git://git.kernel.org/pub/scm/linux/kernel/git/ras/ras:
x86/mce: Add instruction recovery signatures to mce-severity table
x86/mce: Fix check for processor context when machine check was taken.
MCE: Fix vm86 handling for 32bit mce handler
x86/mce Add validation check before GHES error is recorded
x86/mce: Avoid reading every machine check bank register twice.
Pull CMA and ARM DMA-mapping updates from Marek Szyprowski:
"These patches contain two major updates for DMA mapping subsystem
(mainly for ARM architecture). First one is Contiguous Memory
Allocator (CMA) which makes it possible for device drivers to allocate
big contiguous chunks of memory after the system has booted.
The main difference from the similar frameworks is the fact that CMA
allows to transparently reuse the memory region reserved for the big
chunk allocation as a system memory, so no memory is wasted when no
big chunk is allocated. Once the alloc request is issued, the
framework migrates system pages to create space for the required big
chunk of physically contiguous memory.
For more information one can refer to nice LWN articles:
- 'A reworked contiguous memory allocator':
http://lwn.net/Articles/447405/
- 'CMA and ARM':
http://lwn.net/Articles/450286/
- 'A deep dive into CMA':
http://lwn.net/Articles/486301/
- and the following thread with the patches and links to all previous
versions:
https://lkml.org/lkml/2012/4/3/204
The main client for this new framework is ARM DMA-mapping subsystem.
The second part provides a complete redesign in ARM DMA-mapping
subsystem. The core implementation has been changed to use common
struct dma_map_ops based infrastructure with the recent updates for
new dma attributes merged in v3.4-rc2. This allows to use more than
one implementation of dma-mapping calls and change/select them on the
struct device basis. The first client of this new infractructure is
dmabounce implementation which has been completely cut out of the
core, common code.
The last patch of this redesign update introduces a new, experimental
implementation of dma-mapping calls on top of generic IOMMU framework.
This lets ARM sub-platform to transparently use IOMMU for DMA-mapping
calls if one provides required IOMMU hardware.
For more information please refer to the following thread:
http://www.spinics.net/lists/arm-kernel/msg175729.html
The last patch merges changes from both updates and provides a
resolution for the conflicts which cannot be avoided when patches have
been applied on the same files (mainly arch/arm/mm/dma-mapping.c)."
Acked by Andrew Morton <akpm@linux-foundation.org>:
"Yup, this one please. It's had much work, plenty of review and I
think even Russell is happy with it."
* 'for-linus' of git://git.linaro.org/people/mszyprowski/linux-dma-mapping: (28 commits)
ARM: dma-mapping: use PMD size for section unmap
cma: fix migration mode
ARM: integrate CMA with DMA-mapping subsystem
X86: integrate CMA with DMA-mapping subsystem
drivers: add Contiguous Memory Allocator
mm: trigger page reclaim in alloc_contig_range() to stabilise watermarks
mm: extract reclaim code from __alloc_pages_direct_reclaim()
mm: Serialize access to min_free_kbytes
mm: page_isolation: MIGRATE_CMA isolation functions added
mm: mmzone: MIGRATE_CMA migration type added
mm: page_alloc: change fallbacks array handling
mm: page_alloc: introduce alloc_contig_range()
mm: compaction: export some of the functions
mm: compaction: introduce isolate_freepages_range()
mm: compaction: introduce map_pages()
mm: compaction: introduce isolate_migratepages_range()
mm: page_alloc: remove trailing whitespace
ARM: dma-mapping: add support for IOMMU mapper
ARM: dma-mapping: use alloc, mmap, free from dma_ops
ARM: dma-mapping: remove redundant code and do the cleanup
...
Conflicts:
arch/x86/include/asm/dma-mapping.h
Pull KVM changes from Avi Kivity:
"Changes include additional instruction emulation, page-crossing MMIO,
faster dirty logging, preventing the watchdog from killing a stopped
guest, module autoload, a new MSI ABI, and some minor optimizations
and fixes. Outside x86 we have a small s390 and a very large ppc
update.
Regarding the new (for kvm) rebaseless workflow, some of the patches
that were merged before we switch trees had to be rebased, while
others are true pulls. In either case the signoffs should be correct
now."
Fix up trivial conflicts in Documentation/feature-removal-schedule.txt
arch/powerpc/kvm/book3s_segment.S and arch/x86/include/asm/kvm_para.h.
I suspect the kvm_para.h resolution ends up doing the "do I have cpuid"
check effectively twice (it was done differently in two different
commits), but better safe than sorry ;)
* 'next' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (125 commits)
KVM: make asm-generic/kvm_para.h have an ifdef __KERNEL__ block
KVM: s390: onereg for timer related registers
KVM: s390: epoch difference and TOD programmable field
KVM: s390: KVM_GET/SET_ONEREG for s390
KVM: s390: add capability indicating COW support
KVM: Fix mmu_reload() clash with nested vmx event injection
KVM: MMU: Don't use RCU for lockless shadow walking
KVM: VMX: Optimize %ds, %es reload
KVM: VMX: Fix %ds/%es clobber
KVM: x86 emulator: convert bsf/bsr instructions to emulate_2op_SrcV_nobyte()
KVM: VMX: unlike vmcs on fail path
KVM: PPC: Emulator: clean up SPR reads and writes
KVM: PPC: Emulator: clean up instruction parsing
kvm/powerpc: Add new ioctl to retreive server MMU infos
kvm/book3s: Make kernel emulated H_PUT_TCE available for "PR" KVM
KVM: PPC: bookehv: Fix r8/r13 storing in level exception handler
KVM: PPC: Book3S: Enable IRQs during exit handling
KVM: PPC: Fix PR KVM on POWER7 bare metal
KVM: PPC: Fix stbux emulation
KVM: PPC: bookehv: Use lwz/stw instead of PPC_LL/PPC_STL for 32-bit fields
...
The interrupt chip irq_set_affinity() functions copy the affinity mask
to irq_data->affinity but return 0, i.e. IRQ_SET_MASK_OK.
IRQ_SET_MASK_OK causes the core code to do another redundant copy.
Return IRQ_SET_MASK_OK_NOCOPY to avoid this.
Signed-off-by: Jiang Liu <jiang.liu@huawei.com>
Cc: Suresh Siddha <suresh.b.siddha@intel.com>
Cc: Yinghai Lu <yinghai@kernel.org>
Cc: Naga Chumbalkar <nagananda.chumbalkar@hp.com>
Cc: Jacob Pan <jacob.jun.pan@linux.intel.com>
Cc: Cliff Wickman <cpw@sgi.com>
Cc: Jiang Liu <liuj97@gmail.com>
Cc: Keping Chen <chenkeping@huawei.com>
Link: http://lkml.kernel.org/r/1333120296-13563-4-git-send-email-jiang.liu@huawei.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
This reverts commit cf8ff6b6ab.
Just found this commit is a function duplicatation of commit 6b617e22
"x86/platform: Add a wallclock_init func to x86_init.timers ops".
Let's revert it and sorry for the noise.
Signed-off-by: Feng Tang <feng.tang@intel.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: H. Peter Anvin <hpa@linux.intel.com>
Cc: Jacob Pan <jacob.jun.pan@linux.intel.com>
Cc: Alan Cox <alan@linux.intel.com>
Cc: Dirk Brandewie <dirk.brandewie@gmail.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Pull timer updates from Thomas Gleixner.
Various trivial conflict fixups in arch Kconfig due to addition of
unrelated entries nearby. And one slightly more subtle one for sparc32
(new user of GENERIC_CLOCKEVENTS), fixed up as per Thomas.
* 'timers-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (31 commits)
timekeeping: Fix a few minor newline issues.
time: remove obsolete declaration
ntp: Fix a stale comment and a few stray newlines.
ntp: Correct TAI offset during leap second
timers: Fixup the Kconfig consolidation fallout
x86: Use generic time config
unicore32: Use generic time config
um: Use generic time config
tile: Use generic time config
sparc: Use: generic time config
sh: Use generic time config
score: Use generic time config
s390: Use generic time config
openrisc: Use generic time config
powerpc: Use generic time config
mn10300: Use generic time config
mips: Use generic time config
microblaze: Use generic time config
m68k: Use generic time config
m32r: Use generic time config
...
Pull user-space probe instrumentation from Ingo Molnar:
"The uprobes code originates from SystemTap and has been used for years
in Fedora and RHEL kernels. This version is much rewritten, reviews
from PeterZ, Oleg and myself shaped the end result.
This tree includes uprobes support in 'perf probe' - but SystemTap
(and other tools) can take advantage of user probe points as well.
Sample usage of uprobes via perf, for example to profile malloc()
calls without modifying user-space binaries.
First boot a new kernel with CONFIG_UPROBE_EVENT=y enabled.
If you don't know which function you want to probe you can pick one
from 'perf top' or can get a list all functions that can be probed
within libc (binaries can be specified as well):
$ perf probe -F -x /lib/libc.so.6
To probe libc's malloc():
$ perf probe -x /lib64/libc.so.6 malloc
Added new event:
probe_libc:malloc (on 0x7eac0)
You can now use it in all perf tools, such as:
perf record -e probe_libc:malloc -aR sleep 1
Make use of it to create a call graph (as the flat profile is going to
look very boring):
$ perf record -e probe_libc:malloc -gR make
[ perf record: Woken up 173 times to write data ]
[ perf record: Captured and wrote 44.190 MB perf.data (~1930712
$ perf report | less
32.03% git libc-2.15.so [.] malloc
|
--- malloc
29.49% cc1 libc-2.15.so [.] malloc
|
--- malloc
|
|--0.95%-- 0x208eb1000000000
|
|--0.63%-- htab_traverse_noresize
11.04% as libc-2.15.so [.] malloc
|
--- malloc
|
7.15% ld libc-2.15.so [.] malloc
|
--- malloc
|
5.07% sh libc-2.15.so [.] malloc
|
--- malloc
|
4.99% python-config libc-2.15.so [.] malloc
|
--- malloc
|
4.54% make libc-2.15.so [.] malloc
|
--- malloc
|
|--7.34%-- glob
| |
| |--93.18%-- 0x41588f
| |
| --6.82%-- glob
| 0x41588f
...
Or:
$ perf report -g flat | less
# Overhead Command Shared Object Symbol
# ........ ............. ............. ..........
#
32.03% git libc-2.15.so [.] malloc
27.19%
malloc
29.49% cc1 libc-2.15.so [.] malloc
24.77%
malloc
11.04% as libc-2.15.so [.] malloc
11.02%
malloc
7.15% ld libc-2.15.so [.] malloc
6.57%
malloc
...
The core uprobes design is fairly straightforward: uprobes probe
points register themselves at (inode:offset) addresses of
libraries/binaries, after which all existing (or new) vmas that map
that address will have a software breakpoint injected at that address.
vmas are COW-ed to preserve original content. The probe points are
kept in an rbtree.
If user-space executes the probed inode:offset instruction address
then an event is generated which can be recovered from the regular
perf event channels and mmap-ed ring-buffer.
Multiple probes at the same address are supported, they create a
dynamic callback list of event consumers.
The basic model is further complicated by the XOL speedup: the
original instruction that is probed is copied (in an architecture
specific fashion) and executed out of line when the probe triggers.
The XOL area is a single vma per process, with a fixed number of
entries (which limits probe execution parallelism).
The API: uprobes are installed/removed via
/sys/kernel/debug/tracing/uprobe_events, the API is integrated to
align with the kprobes interface as much as possible, but is separate
to it.
Injecting a probe point is privileged operation, which can be relaxed
by setting perf_paranoid to -1.
You can use multiple probes as well and mix them with kprobes and
regular PMU events or tracepoints, when instrumenting a task."
Fix up trivial conflicts in mm/memory.c due to previous cleanup of
unmap_single_vma().
* 'perf-uprobes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (21 commits)
perf probe: Detect probe target when m/x options are absent
perf probe: Provide perf interface for uprobes
tracing: Fix kconfig warning due to a typo
tracing: Provide trace events interface for uprobes
tracing: Extract out common code for kprobes/uprobes trace events
tracing: Modify is_delete, is_return from int to bool
uprobes/core: Decrement uprobe count before the pages are unmapped
uprobes/core: Make background page replacement logic account for rss_stat counters
uprobes/core: Optimize probe hits with the help of a counter
uprobes/core: Allocate XOL slots for uprobes use
uprobes/core: Handle breakpoint and singlestep exceptions
uprobes/core: Rename bkpt to swbp
uprobes/core: Make order of function parameters consistent across functions
uprobes/core: Make macro names consistent
uprobes: Update copyright notices
uprobes/core: Move insn to arch specific structure
uprobes/core: Remove uprobe_opcode_sz
uprobes/core: Make instruction tables volatile
uprobes: Move to kernel/events/
uprobes/core: Clean up, refactor and improve the code
...
Pull first series of signal handling cleanups from Al Viro:
"This is just the first part of the queue (about a half of it);
assorted fixes all over the place in signal handling.
This one ends with all sigsuspend() implementations switched to
generic one (->saved_sigmask-based).
With this, a bunch of assorted old buglets are fixed and most of the
missing bits of NOTIFY_RESUME hookup are in place. Two more fixes sit
in arm and um trees respectively, and there's a couple of broken ones
that need obvious fixes - parisc and avr32 check TIF_NOTIFY_RESUME
only on one of two codepaths; fixes for that will happen in the next
series"
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/signal: (55 commits)
unicore32: if there's no handler we need to restore sigmask, syscall or no syscall
xtensa: add handling of TIF_NOTIFY_RESUME
microblaze: drop 'oldset' argument of do_notify_resume()
microblaze: handle TIF_NOTIFY_RESUME
score: add handling of NOTIFY_RESUME to do_notify_resume()
m68k: add TIF_NOTIFY_RESUME and handle it.
sparc: kill ancient comment in sparc_sigaction()
h8300: missing checks of __get_user()/__put_user() return values
frv: missing checks of __get_user()/__put_user() return values
cris: missing checks of __get_user()/__put_user() return values
powerpc: missing checks of __get_user()/__put_user() return values
sh: missing checks of __get_user()/__put_user() return values
sparc: missing checks of __get_user()/__put_user() return values
avr32: struct old_sigaction is never used
m32r: struct old_sigaction is never used
xtensa: xtensa_sigaction doesn't exist
alpha: tidy signal delivery up
score: don't open-code force_sigsegv()
cris: don't open-code force_sigsegv()
blackfin: don't open-code force_sigsegv()
...
Pull the MCA deletion branch from Paul Gortmaker:
"It was good that we could support MCA machines back in the day, but
realistically, nobody is using them anymore. They were mostly limited
to 386-sx 16MHz CPU and some 486 class machines and never more than
64MB of RAM. Even the enthusiast hobbyist community seems to have
dried up close to ten years ago, based on what you can find searching
various websites dedicated to the relatively short lived hardware.
So lets remove the support relating to CONFIG_MCA. There is no point
carrying this forward, wasting cycles doing routine maintenance on it;
wasting allyesconfig build time on validating it, wasting I/O on git
grep'ping over it, and so on."
Let's see if anybody screams. It generally has compiled, and James
Bottomley pointed out that there was a MCA extension from NCR that
allowed for up to 4GB of memory and PPro-class machines. So in *theory*
there may be users out there.
But even James (technically listed as a maintainer) doesn't actually
have a system, and while Alan Cox claims to have a machine in his cellar
that he offered to anybody who wants to take it off his hands, he didn't
argue for keeping MCA support either.
So we could bring it back. But somebody had better speak up and talk
about how they have actually been using said MCA hardware with modern
kernels for us to do that. And David already took the patch to delete
all the networking driver code (commit a5e371f61a: "drivers/net:
delete all code/drivers depending on CONFIG_MCA").
* 'delete-mca' of git://git.kernel.org/pub/scm/linux/kernel/git/paulg/linux:
MCA: delete all remaining traces of microchannel bus support.
scsi: delete the MCA specific drivers and driver code
serial: delete the MCA specific 8250 support.
arm: remove ability to select CONFIG_MCA
Instruction recovery cases are very similar to the data recovery one
we already have. Just trade out for a new MCACOD value.
Signed-off-by: Tony Luck <tony.luck@intel.com>
Linus pointed out that there was no value is checking whether m->ip
was zero - because zero is a legimate value. If we have a reliable
(or faked in the VM86 case) "m->cs" we can use it to tell whether we
were in user mode or kernelwhen the machine check hit.
Reported-by: Linus Torvalds <torvalds@linuxfoundation.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: Tony Luck <tony.luck@intel.com>
When running on 32bit the mce handler could misinterpret
vm86 mode as ring 0. This can affect whether it does recovery
or not; it was possible to panic when recovery was actually
possible.
Fix this by always forcing vm86 to look like ring 3.
Signed-off-by: Andi Kleen <ak@linux.intel.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Tony Luck <tony.luck@intel.com>
Pull perf fixes from Ingo Molnar:
- Leftover AMD PMU driver fix fix from the end of the v3.4
stabilization cycle.
- Late tools/perf/ changes that missed the first round:
* endianness fixes
* event parsing improvements
* libtraceevent fixes factored out from trace-cmd
* perl scripting engine fixes related to libtraceevent,
* testcase improvements
* perf inject / pipe mode fixes
* plus a kernel side fix
* 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
perf/x86: Update event scheduling constraints for AMD family 15h models
* 'perf-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
Revert "sched, perf: Use a single callback into the scheduler"
perf evlist: Show event attribute details
perf tools: Bump default sample freq to 4 kHz
perf buildid-list: Work better with pipe mode
perf tools: Fix piped mode read code
perf inject: Fix broken perf inject -b
perf tools: rename HEADER_TRACE_INFO to HEADER_TRACING_DATA
perf tools: Add union u64_swap type for swapping u64 data
perf tools: Carry perf_event_attr bitfield throught different endians
perf record: Fix documentation for branch stack sampling
perf target: Add cpu flag to sample_type if target has cpu
perf tools: Always try to build libtraceevent
perf tools: Rename libparsevent to libtraceevent in Makefile
perf script: Rename struct event to struct event_format in perl engine
perf script: Explicitly handle known default print arg type
perf tools: Add hardcoded name term for pmu events
perf tools: Separate 'mem:' event scanner bits
perf tools: Use allocated list for each parsed event
perf tools: Add support for displaying event parser debug info
perf test: Move parse event automated tests to separated object
Pull x86 reboot changes from Ingo Molnar:
"The biggest change is a gentler method of rebooting/stopping via IRQs
first and then via NMIs. There are several cleanups in the tree as
well."
* 'x86-reboot-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/reboot: Update nonmi_ipi parameter
x86/reboot: Use NMI to assist in shutting down if IRQ fails
Revert "x86, reboot: Use NMI instead of REBOOT_VECTOR to stop cpus"
x86/reboot: Clean up coding style
x86/reboot: Reduce to a single DMI table for reboot quirks
Pull x86 platform changes from Ingo Molnar:
"This tree includes assorted platform driver updates and a preparatory
series for a platform with custom DMA remapping semantics (sta2x11 I/O
hub)."
* 'x86-platform-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/vsmp: Fix number of CPUs when vsmp is disabled
keyboard: Use BIOS Keyboard variable to set Numlock
x86/olpc/xo1/sci: Report RTC wakeup events
x86/olpc/xo1/sci: Produce wakeup events for buttons and switches
x86, platform: Initial support for sta2x11 I/O hub
x86: Introduce CONFIG_X86_DMA_REMAP
x86-32: Introduce CONFIG_X86_DEV_DMA_OPS
Pull MCE updates from Ingo Molnar:
"This tree updates/fixes MCE hardware support, it makes the APIC LVT
thresholding interrupt optional because a subset of AMD F15h models
don't support it."
* 'x86-mce-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86, MCE, AMD: Disable error thresholding bank 4 on some models
x86, MCE, AMD: Hide interrupt_enable sysfs node
x86, MCE, AMD: Make APIC LVT thresholding interrupt optional
Pull fpu state cleanups from Ingo Molnar:
"This tree streamlines further aspects of FPU handling by eliminating
the prepare_to_copy() complication and moving that logic to
arch_dup_task_struct().
It also fixes the FPU dumps in threaded core dumps, removes and old
(and now invalid) assumption plus micro-optimizes the exit path by
avoiding an FPU save for dead tasks."
Fixed up trivial add-add conflict in arch/sh/kernel/process.c that came
in because we now do the FPU handling in arch_dup_task_struct() rather
than the legacy (and now gone) prepare_to_copy().
* 'x86-fpu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86, fpu: drop the fpu state during thread exit
x86, xsave: remove thread_has_fpu() bug check in __sanitize_i387_state()
coredump: ensure the fpu state is flushed for proper multi-threaded core dump
fork: move the real prepare_to_copy() users to arch_dup_task_struct()
Pull exception table generation updates from Ingo Molnar:
"The biggest change here is to allow the build-time sorting of the
exception table, to speed up booting. This is achieved by the
architecture enabling BUILDTIME_EXTABLE_SORT. This option is enabled
for x86 and MIPS currently.
On x86 a number of fixes and changes were needed to allow build-time
sorting of the exception table, in particular a relocation invariant
exception table format was needed. This required the abstracting out
of exception table protocol and the removal of 20 years of accumulated
assumptions about the x86 exception table format.
While at it, this tree also cleans up various other aspects of
exception handling, such as early(er) exception handling for
rdmsr_safe() et al.
All in one, as the result of these changes the x86 exception code is
now pretty nice and modern. As an added bonus any regressions in this
code will be early and violent crashes, so if you see any of those,
you'll know whom to blame!"
Fix up trivial conflicts in arch/{mips,x86}/Kconfig files due to nearby
modifications of other core architecture options.
* 'x86-extable-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (35 commits)
Revert "x86, extable: Disable presorted exception table for now"
scripts/sortextable: Handle relative entries, and other cleanups
x86, extable: Switch to relative exception table entries
x86, extable: Disable presorted exception table for now
x86, extable: Add _ASM_EXTABLE_EX() macro
x86, extable: Remove open-coded exception table entries in arch/x86/ia32/ia32entry.S
x86, extable: Remove open-coded exception table entries in arch/x86/include/asm/xsave.h
x86, extable: Remove open-coded exception table entries in arch/x86/include/asm/kvm_host.h
x86, extable: Remove the now-unused __ASM_EX_SEC macros
x86, extable: Remove open-coded exception table entries in arch/x86/xen/xen-asm_32.S
x86, extable: Remove open-coded exception table entries in arch/x86/um/checksum_32.S
x86, extable: Remove open-coded exception table entries in arch/x86/lib/usercopy_32.c
x86, extable: Remove open-coded exception table entries in arch/x86/lib/putuser.S
x86, extable: Remove open-coded exception table entries in arch/x86/lib/getuser.S
x86, extable: Remove open-coded exception table entries in arch/x86/lib/csum-copy_64.S
x86, extable: Remove open-coded exception table entries in arch/x86/lib/copy_user_nocache_64.S
x86, extable: Remove open-coded exception table entries in arch/x86/lib/copy_user_64.S
x86, extable: Remove open-coded exception table entries in arch/x86/lib/checksum_32.S
x86, extable: Remove open-coded exception table entries in arch/x86/kernel/test_rodata.c
x86, extable: Remove open-coded exception table entries in arch/x86/kernel/entry_64.S
...
Pull x86/urgent branch from Ingo Molnar:
"These are the fixes left over from the very end of the v3.4
stabilization cycle, plus one more fix."
Ugh. Those KERN_CONT additions are just pointless. I think they came
as a reaction to some of the early (broken) printk() work - but that was
fixed before it was merged.
* 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86, relocs: Build clean fix
x86, printk: Add missing KERN_CONT to NMI selftest
x86: Fix boot on Twinhead H12Y
Got bitten again by the BIT() macro:
arch/x86/kernel/cpu/mcheck/mce.c: In function '__mcheck_cpu_apply_quirks':
arch/x86/kernel/cpu/mcheck/mce.c:1453:6: warning: left shift
count >= width of type arch/x86/kernel/cpu/mcheck/mce.c:1454:7: warning: left shift count >= width of type
Fix it already.
Signed-off-by: Borislav Petkov <borislav.petkov@amd.com>
Cc: Frank Arnold <frank.arnold@amd.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/1337684026-19740-2-git-send-email-bp@amd64.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Pull trivial updates from Jiri Kosina:
"As usual, it's mostly typo fixes, redundant code elimination and some
documentation updates."
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial: (57 commits)
edac, mips: don't change code that has been removed in edac/mips tree
xtensa: Change mail addresses of Hannes Weiner and Oskar Schirmer
lib: Change mail address of Oskar Schirmer
net: Change mail address of Oskar Schirmer
arm/m68k: Change mail address of Sebastian Hess
i2c: Change mail address of Oskar Schirmer
net: Fix tcp_build_and_update_options comment in struct tcp_sock
atomic64_32.h: fix parameter naming mismatch
Kconfig: replace "--- help ---" with "---help---"
c2port: fix bogus Kconfig "default no"
edac: Fix spelling errors.
qla1280: Remove redundant NULL check before release_firmware() call
remoteproc: remove redundant NULL check before release_firmware()
qla2xxx: Remove redundant NULL check before release_firmware() call.
aic94xx: Get rid of redundant NULL check before release_firmware() call
tehuti: delete redundant NULL check before release_firmware()
qlogic: get rid of a redundant test for NULL before call to release_firmware()
bna: remove redundant NULL test before release_firmware()
tg3: remove redundant NULL test before release_firmware() call
typhoon: get rid of redundant conditional before all to release_firmware()
...
Pull x86/apic changes from Ingo Molnar:
"Most of the changes are about helping virtualized guest kernels
achieve better performance."
Fix up trivial conflicts with the iommu updates to arch/x86/kernel/apic/io_apic.c
* 'x86-apic-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/apic: Implement EIO micro-optimization
x86/apic: Add apic->eoi_write() callback
x86/apic: Use symbolic APIC_EOI_ACK
x86/apic: Fix typo EIO_ACK -> EOI_ACK and document it
x86/xen/apic: Add missing #include <xen/xen.h>
x86/apic: Only compile local function if used with !CONFIG_GENERIC_PENDING_IRQ
x86/apic: Fix UP boot crash
x86: Conditionally update time when ack-ing pending irqs
xen/apic: implement io apic read with hypercall
Revert "xen/x86: Workaround 'x86/ioapic: Add register level checks to detect bogus io-apic entries'"
xen/x86: Implement x86_apic_ops
x86/apic: Replace io_apic_ops with x86_io_apic_ops.
Pull scheduler changes from Ingo Molnar:
"The biggest change is the cleanup/simplification of the load-balancer:
instead of the current practice of architectures twiddling scheduler
internal data structures and providing the scheduler domains in
colorfully inconsistent ways, we now have generic scheduler code in
kernel/sched/core.c:sched_init_numa() that looks at the architecture's
node_distance() parameters and (while not fully trusting it) deducts a
NUMA topology from it.
This inevitably changes balancing behavior - hopefully for the better.
There are various smaller optimizations, cleanups and fixlets as well"
* 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
sched: Taint kernel with TAINT_WARN after sleep-in-atomic bug
sched: Remove stale power aware scheduling remnants and dysfunctional knobs
sched/debug: Fix printing large integers on 32-bit platforms
sched/fair: Improve the ->group_imb logic
sched/nohz: Fix rq->cpu_load[] calculations
sched/numa: Don't scale the imbalance
sched/fair: Revert sched-domain iteration breakage
sched/x86: Rewrite set_cpu_sibling_map()
sched/numa: Fix the new NUMA topology bits
sched/numa: Rewrite the CONFIG_NUMA sched domain support
sched/fair: Propagate 'struct lb_env' usage into find_busiest_group
sched/fair: Add some serialization to the sched_domain load-balance walk
sched/fair: Let minimally loaded cpu balance the group
sched: Change rq->nr_running to unsigned int
x86/numa: Check for nonsensical topologies on real hw as well
x86/numa: Hard partition cpu topology masks on node boundaries
x86/numa: Allow specifying node_distance() for numa=fake
x86/sched: Make mwait_usable() heed to "idle=" kernel parameters properly
sched: Update documentation and comments
sched_rt: Avoid unnecessary dequeue and enqueue of pushable tasks in set_cpus_allowed_rt()
Pull perf changes from Ingo Molnar:
"Lots of changes:
- (much) improved assembly annotation support in perf report, with
jump visualization, searching, navigation, visual output
improvements and more.
- kernel support for AMD IBS PMU hardware features. Notably 'perf
record -e cycles:p' and 'perf top -e cycles:p' should work without
skid now, like PEBS does on the Intel side, because it takes
advantage of IBS transparently.
- the libtracevents library: it is the first step towards unifying
tracing tooling and perf, and it also gives a tracing library for
external tools like powertop to rely on.
- infrastructure: various improvements and refactoring of the UI
modules and related code
- infrastructure: cleanup and simplification of the profiling
targets code (--uid, --pid, --tid, --cpu, --all-cpus, etc.)
- tons of robustness fixes all around
- various ftrace updates: speedups, cleanups, robustness
improvements.
- typing 'make' in tools/ will now give you a menu of projects to
build and a short help text to explain what each does.
- ... and lots of other changes I forgot to list.
The perf record make bzImage + perf report regression you reported
should be fixed."
* 'perf-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (166 commits)
tracing: Remove kernel_lock annotations
tracing: Fix initial buffer_size_kb state
ring-buffer: Merge separate resize loops
perf evsel: Create events initially disabled -- again
perf tools: Split term type into value type and term type
perf hists: Fix callchain ip printf format
perf target: Add uses_mmap field
ftrace: Remove selecting FRAME_POINTER with FUNCTION_TRACER
ftrace/x86: Have x86 ftrace use the ftrace_modify_all_code()
ftrace: Make ftrace_modify_all_code() global for archs to use
ftrace: Return record ip addr for ftrace_location()
ftrace: Consolidate ftrace_location() and ftrace_text_reserved()
ftrace: Speed up search by skipping pages by address
ftrace: Remove extra helper functions
ftrace: Sort all function addresses, not just per page
tracing: change CPU ring buffer state from tracing_cpumask
tracing: Check return value of tracing_dentry_percpu()
ring-buffer: Reset head page before running self test
ring-buffer: Add integrity check at end of iter read
ring-buffer: Make addition of pages in ring buffer atomic
...
Pull percpu updates from Tejun Heo:
"Contains Alex Shi's three patches to remove percpu_xxx() which overlap
with this_cpu_xxx(). There shouldn't be any functional change."
* 'for-3.5' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/percpu:
percpu: remove percpu_xxx() functions
x86: replace percpu_xxx funcs with this_cpu_xxx
net: replace percpu_xxx funcs with this_cpu_xxx or __this_cpu_xxx
guts of saved_sigmask-based sigsuspend/rt_sigsuspend. Takes
kernel sigset_t *.
Open-coded instances replaced with calling it.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Pull security subsystem updates from James Morris:
"New notable features:
- The seccomp work from Will Drewry
- PR_{GET,SET}_NO_NEW_PRIVS from Andy Lutomirski
- Longer security labels for Smack from Casey Schaufler
- Additional ptrace restriction modes for Yama by Kees Cook"
Fix up trivial context conflicts in arch/x86/Kconfig and include/linux/filter.h
* 'next' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/linux-security: (65 commits)
apparmor: fix long path failure due to disconnected path
apparmor: fix profile lookup for unconfined
ima: fix filename hint to reflect script interpreter name
KEYS: Don't check for NULL key pointer in key_validate()
Smack: allow for significantly longer Smack labels v4
gfp flags for security_inode_alloc()?
Smack: recursive tramsmute
Yama: replace capable() with ns_capable()
TOMOYO: Accept manager programs which do not start with / .
KEYS: Add invalidation support
KEYS: Do LRU discard in full keyrings
KEYS: Permit in-place link replacement in keyring list
KEYS: Perform RCU synchronisation on keys prior to key destruction
KEYS: Announce key type (un)registration
KEYS: Reorganise keys Makefile
KEYS: Move the key config into security/keys/Kconfig
KEYS: Use the compat keyctl() syscall wrapper on Sparc64 for Sparc32 compat
Yama: remove an unused variable
samples/seccomp: fix dependencies on arch macros
Yama: add additional ptrace scopes
...
Pull smp hotplug cleanups from Thomas Gleixner:
"This series is merily a cleanup of code copied around in arch/* and
not changing any of the real cpu hotplug horrors yet. I wish I'd had
something more substantial for 3.5, but I underestimated the lurking
horror..."
Fix up trivial conflicts in arch/{arm,sparc,x86}/Kconfig and
arch/sparc/include/asm/thread_info_32.h
* 'smp-hotplug-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (79 commits)
um: Remove leftover declaration of alloc_task_struct_node()
task_allocator: Use config switches instead of magic defines
sparc: Use common threadinfo allocator
score: Use common threadinfo allocator
sh-use-common-threadinfo-allocator
mn10300: Use common threadinfo allocator
powerpc: Use common threadinfo allocator
mips: Use common threadinfo allocator
hexagon: Use common threadinfo allocator
m32r: Use common threadinfo allocator
frv: Use common threadinfo allocator
cris: Use common threadinfo allocator
x86: Use common threadinfo allocator
c6x: Use common threadinfo allocator
fork: Provide kmemcache based thread_info allocator
tile: Use common threadinfo allocator
fork: Provide weak arch_release_[task_struct|thread_info] functions
fork: Move thread info gfp flags to header
fork: Remove the weak insanity
sh: Remove cpu_idle_wait()
...
Pull core locking updates from Ingo Molnar:
"This update:
- extends and simplifies x86 NMI callback handling code to enhance
and fix the HP hw-watchdog driver
- simplifies the x86 NMI callback handling code to fix a kmemcheck
bug.
- enhances the hung-task debugger"
* 'core-locking-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/nmi: Fix the type of the nmiaction.flags field
x86/nmi: Fix page faults by nmiaction if kmemcheck is enabled
x86/nmi: Add new NMI queues to deal with IO_CHK and SERR
watchdog, hpwdt: Remove priority option for NMI callback
hung task debugging: Inject NMI when hung and going to panic
Pull iommu core changes from Ingo Molnar:
"The IOMMU changes in this cycle are mostly about factoring out
Intel-VT-d specific IRQ remapping details and introducing struct
irq_remap_ops, in preparation for AMD specific hardware."
* 'core-iommu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
iommu: Fix off by one in dmar_get_fault_reason()
irq_remap: Fix the 'sub_handle' uninitialized warning
irq_remap: Fix UP build failure
irq_remap: Fix compiler warning with CONFIG_IRQ_REMAP=y
iommu: rename intr_remapping.[ch] to irq_remapping.[ch]
iommu: rename intr_remapping references to irq_remapping
x86, iommu/vt-d: Clean up interfaces for interrupt remapping
iommu/vt-d: Convert MSI remapping setup to remap_ops
iommu/vt-d: Convert free_irte into a remap_ops callback
iommu/vt-d: Convert IR set_affinity function to remap_ops
iommu/vt-d: Convert IR ioapic-setup to use remap_ops
iommu/vt-d: Convert missing apic.c intr-remapping call to remap_ops
iommu/vt-d: Make intr-remapping initialization generic
iommu: Rename intr_remapping files to intel_intr_remapping
Fix this behaviour:
----------------
| NMI testsuite:
--------------------
remote IPI:
ok |
local IPI:
ok |
Revealed due to a new modification to printk().
Signed-off-by: Sasha Levin <levinsasha928@gmail.com>
Link: http://lkml.kernel.org/r/1336492573-17530-3-git-send-email-levinsasha928@gmail.com
Signed-off-by: H. Peter Anvin <hpa@zytor.com>
This patch adds support for CMA to dma-mapping subsystem for x86
architecture that uses common pci-dma/pci-nommu implementation. This
allows to test CMA on KVM/QEMU and a lot of common x86 boxes.
Signed-off-by: Marek Szyprowski <m.szyprowski@samsung.com>
Signed-off-by: Kyungmin Park <kyungmin.park@samsung.com>
CC: Michal Nazarewicz <mina86@mina86.com>
Acked-by: Arnd Bergmann <arnd@arndb.de>
Fixes for perf/core:
- Rename some perf_target methods to avoid double negation, from Namhyung Kim.
- Revert change to use per task events with inheritance, from Namhyung Kim.
- Events should start disabled till children starts running, from David Ahern.
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.11 (GNU/Linux)
iQIcBAABAgAGBQJPtS5qAAoJEKurIx+X31iB+bAP/1FjUa2Nd53X89HFc6DoLktA
4AshM/JENhAfSbpTfGhg10ZOuwUa8sQ85Sf6yz1CsW6mEiJK/bDFrR1g2KrmejyL
owgQvV6ukPABzfB27tWyXSVVBPmLkJviedLDautVpgPEPVuqauntmpe7fW51b5pf
92MxvYZ6AzgbDIjVaXP7e+kPomvgyM1C/UEvCgoyEcw81h5dchU9NSdXNBS67JS/
uOsMiMJyoNI46haYYbyFgMq3RmpYuxTLFj7qFDlUltyjP+vIvyLs38Ae/vkRMNfV
sYXWRUQlRpvqg4MDIFVZx8FWTufzTm0BMS+Be7JkXKWdF3DAksq6FprOWIxfYi+d
PMxwTFeSJzTINb9n9MiLt3TmuRy3mu37QWd28qJaJciNMkYWbclPqyJmjwsuAMKg
hKSy2FvewIDHTAGOkwaVjS+L8O7j3TNRIAbweFA1d1K4rt6oSfwdn6GZrAb6MTvx
oV0Fe1nAyY9mucyjknBTim3RYZ9qQ7H9SjL8JoaGihSzi988MgBun+iEQOQiwjNl
YJNGIv0wb31qCKFtxU0A8rA0sFdhQRoLgigJfETb5a+2ghtG323oH6ZVWTnB8bp/
g1XOpus222/iJdL2AizMiJeUNV1IZ0SeLn2GoTFQIy9Uf11Zdgu5wLJumUva8EC6
JAACbpBlYufftIn278OH
=tk2M
-----END PGP SIGNATURE-----
Merge tag 'linus-mce-fix' of git://git.kernel.org/pub/scm/linux/kernel/git/ras/ras
Pull a machine check recovery fix from Tony Luck.
I really don't like how the MCE code does some of the things it does,
but this does seem to be an improvement.
* tag 'linus-mce-fix' of git://git.kernel.org/pub/scm/linux/kernel/git/ras/ras:
x86/mce: Only restart instruction after machine check recovery if it is safe
Merge reason: We are going to queue up a dependent patch:
"perf tools: Move parse event automated tests to separated object"
That depends on:
commit e7c72d8
perf tools: Add 'G' and 'H' modifiers to event parsing
Conflicts:
tools/perf/builtin-stat.c
Conflicted with the recent 'perf_target' patches when checking the
result of perf_evsel open routines to see if a retry is needed to cope
with older kernels where the exclude guest/host perf_event_attr bits
were not used.
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
We know both register and value for eoi beforehand,
so there's no need to check it and no need to do math
to calculate the msr. Saves instructions/branches
on each EOI when using x2apic.
I looked at the objdump output to verify that the
generated code looks right and actually is shorter.
The real improvemements will be on the KVM guest side
though, those come in a later patch.
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Cc: Avi Kivity <avi@redhat.com>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Cc: gleb@redhat.com
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/r/e019d1a125316f10d3e3a4b2f6bda41473f4fb72.1337184153.git.mst@redhat.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Add eoi_write callback so that kvm can override
eoi accesses without touching the rest of the apic.
As a side-effect, this will enable a micro-optimization
for apics using msr.
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Cc: Avi Kivity <avi@redhat.com>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Cc: gleb@redhat.com
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/r/0df425d746c49ac2ecc405174df87752869629d2.1337184153.git.mst@redhat.com
[ tidied it up a bit ]
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Hardware with MCA bus is limited to 386 and 486 class machines
that are now 20+ years old and typically with less than 32MB
of memory. A quick search on the internet, and you see that
even the MCA hobbyist/enthusiast community has lost interest
in the early 2000 era and never really even moved ahead from
the 2.4 kernels to the 2.6 series.
This deletes anything remaining related to CONFIG_MCA from core
kernel code and from the x86 architecture. There is no point in
carrying this any further into the future.
One complication to watch for is inadvertently scooping up
stuff relating to machine check, since there is overlap in
the TLA name space (e.g. arch/x86/boot/mca.c).
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: James Bottomley <JBottomley@Parallels.com>
Cc: x86@kernel.org
Acked-by: Ingo Molnar <mingo@elte.hu>
Acked-by: H. Peter Anvin <hpa@zytor.com>
Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
Pull perf, x86 and scheduler updates from Ingo Molnar.
* 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
tracing: Do not enable function event with enable
perf stat: handle ENXIO error for perf_event_open
perf: Turn off compiler warnings for flex and bison generated files
perf stat: Fix case where guest/host monitoring is not supported by kernel
perf build-id: Fix filename size calculation
* 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86, kvm: KVM paravirt kernels don't check for CPUID being unavailable
x86: Fix section annotation of acpi_map_cpu2node()
x86/microcode: Ensure that module is only loaded on supported Intel CPUs
* 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
sched: Fix KVM and ia64 boot crash due to sched_groups circular linked list assumption
It's been broken forever (i.e. it's not scheduling in a power
aware fashion), as reported by Suresh and others sending
patches, and nobody cares enough to fix it properly ...
so remove it to make space free for something better.
There's various problems with the code as it stands today, first
and foremost the user interface which is bound to topology
levels and has multiple values per level. This results in a
state explosion which the administrator or distro needs to
master and almost nobody does.
Furthermore large configuration state spaces aren't good, it
means the thing doesn't just work right because it's either
under so many impossibe to meet constraints, or even if
there's an achievable state workloads have to be aware of
it precisely and can never meet it for dynamic workloads.
So pushing this kind of decision to user-space was a bad idea
even with a single knob - it's exponentially worse with knobs
on every node of the topology.
There is a proposal to replace the user interface with a single
3 state knob:
sched_balance_policy := { performance, power, auto }
where 'auto' would be the preferred default which looks at things
like Battery/AC mode and possible cpufreq state or whatever the hw
exposes to show us power use expectations - but there's been no
progress on it in the past many months.
Aside from that, the actual implementation of the various knobs
is known to be broken. There have been sporadic attempts at
fixing things but these always stop short of reaching a mergable
state.
Therefore this wholesale removal with the hopes of spurring
people who care to come forward once again and work on a
coherent replacement.
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Suresh Siddha <suresh.b.siddha@intel.com>
Cc: Arjan van de Ven <arjan@linux.intel.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Vaidyanathan Srinivasan <svaidy@linux.vnet.ibm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Link: http://lkml.kernel.org/r/1326104915.2442.53.camel@twins
Signed-off-by: Ingo Molnar <mingo@kernel.org>
To remove duplicate code, have the ftrace arch_ftrace_update_code()
use the generic ftrace_modify_all_code(). This requires that the
default ftrace_replace_code() becomes a weak function so that an
arch may override it.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
There is no need to save any active fpu state to the task structure
memory if the task is dead. Just drop the state instead.
For example, this saved some 1770 xsave's during the system boot
of a two socket Xeon system.
Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com>
Link: http://lkml.kernel.org/r/1336692811-30576-4-git-send-email-suresh.b.siddha@intel.com
Cc: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
Code paths like fork(), exit() and signal handling flush the fpu
state explicitly to the structures in memory.
BUG_ON() in __sanitize_i387_state() is checking that the fpu state
is not live any more. But for preempt kernels, task can be scheduled
out and in at any place and the preload_fpu logic during context switch
can make the fpu registers live again.
For example, consider a 64-bit Task which uses fpu frequently and as such
you will find its fpu_counter mostly non-zero. During its time slice, kernel
used fpu by doing kernel_fpu_begin/kernel_fpu_end(). After this, in the same
scheduling slice, task-A got a signal to handle. Then during the signal
setup path we got preempted when we are just before the sanitize_i387_state()
in arch/x86/kernel/xsave.c:save_i387_xstate(). And when we come back we
will have the fpu registers live that can hit the bug_on.
Similarly during core dump, other threads can context-switch in and out
(because of spurious wakeups while waiting for the coredump to finish in
kernel/exit.c:exit_mm()) and the main thread dumping core can run into this
bug when it finds some other thread with its fpu registers live on some other cpu.
So remove the paranoid check for now, even though it caught a bug in the
multi-threaded core dump case (fixed in the previous patch).
Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com>
Link: http://lkml.kernel.org/r/1336692811-30576-3-git-send-email-suresh.b.siddha@intel.com
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
Historical prepare_to_copy() is mostly a no-op, duplicated for majority of
the architectures and the rest following the x86 model of flushing the extended
register state like fpu there.
Remove it and use the arch_dup_task_struct() instead.
Suggested-by: Oleg Nesterov <oleg@redhat.com>
Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com>
Link: http://lkml.kernel.org/r/1336692811-30576-1-git-send-email-suresh.b.siddha@intel.com
Acked-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: David Howells <dhowells@redhat.com>
Cc: Koichi Yasutake <yasutake.koichi@jp.panasonic.com>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Paul Mundt <lethal@linux-sh.org>
Cc: Chris Zankel <chris@zankel.net>
Cc: Richard Henderson <rth@twiddle.net>
Cc: Russell King <linux@arm.linux.org.uk>
Cc: Haavard Skinnemoen <hskinnemoen@gmail.com>
Cc: Mike Frysinger <vapier@gentoo.org>
Cc: Mark Salter <msalter@redhat.com>
Cc: Aurelien Jacquiot <a-jacquiot@ti.com>
Cc: Mikael Starvik <starvik@axis.com>
Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
Cc: Richard Kuo <rkuo@codeaurora.org>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Michal Simek <monstr@monstr.eu>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: Jonas Bonn <jonas@southpole.se>
Cc: James E.J. Bottomley <jejb@parisc-linux.org>
Cc: Helge Deller <deller@gmx.de>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Chen Liqin <liqin.chen@sunplusct.com>
Cc: Lennox Wu <lennox.wu@gmail.com>
Cc: David S. Miller <davem@davemloft.net>
Cc: Chris Metcalf <cmetcalf@tilera.com>
Cc: Jeff Dike <jdike@addtoit.com>
Cc: Richard Weinberger <richard@nod.at>
Cc: Guan Xuetao <gxt@mprc.pku.edu.cn>
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
If we've determined we can't do what the user asked, trying to do
something else isn't going to make the user's life better.
Without this the screen scrolls a bit and then you get a panic
anyway, and it's nice not to have so much scroll after the real
problem in bug reports.
Link: http://lkml.kernel.org/r/1337190206-12121-1-git-send-email-pjones@redhat.com
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>