There was a bug in the handling of SNB-EP/IVB-EP uncore PCI
fixed counters, e.g., IMC.
It would cause erratic values to be returned for the IMC
clockticks event. This was due to a bogus hwc->config value
which was then written to PCI config space.
The erratic values can be seen via:
$ perf stat -a -C 0 -e uncore_imc_0/clockticks/ -I 1000 sleep 10
The fixed counter has most fields marked as reserved with
hw reset values of 0. Yet the kernel was defaulting to a
hwc->config = ~0 and that was causing the issues.
This patch sets the hwc->config values for fixed uncore event
to 0. Now, the values of IMC clockticks is correct.
Signed-off-by: Stephane Eranian <eranian@google.com>
Reviewed-by: Andi Kleen <ak@linux.intel.com>
Cc: peterz@infradead.org
Cc: zheng.z.yan@intel.com
Link: http://lkml.kernel.org/r/20130909195350.GA17643@google.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
The IvyBridge event CYCLE_ACTIVITY:CYCLES_LDM_PENDING can only
be measured on counters 0-3 when HT is off. When HT is on, you
only have counters 0-3.
If you program it on the eight counters for 1s on a 3GHz
IVB laptop running a noploop, you see:
2 747 527 CYCLE_ACTIVITY:CYCLES_LDM_PENDING
2 747 527 CYCLE_ACTIVITY:CYCLES_LDM_PENDING
2 747 527 CYCLE_ACTIVITY:CYCLES_LDM_PENDING
2 747 527 CYCLE_ACTIVITY:CYCLES_LDM_PENDING
3 280 563 608 CYCLE_ACTIVITY:CYCLES_LDM_PENDING
3 280 563 608 CYCLE_ACTIVITY:CYCLES_LDM_PENDING
3 280 563 608 CYCLE_ACTIVITY:CYCLES_LDM_PENDING
3 280 563 608 CYCLE_ACTIVITY:CYCLES_LDM_PENDING
Clearly the last 4 values are bogus.
Signed-off-by: Stephane Eranian <eranian@google.com>
Cc: peterz@infradead.org
Cc: ak@linux.intel.com
Cc: zheng.z.yan@intel.com
Cc: dhsharp@google.com
Link: http://lkml.kernel.org/r/20130911152222.GA28761@google.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Pull x86 spinlock changes from Ingo Molnar:
"The biggest change here are paravirtualized ticket spinlocks (PV
spinlocks), which bring a nice speedup on various benchmarks.
The KVM host side will come to you via the KVM tree"
* 'x86-spinlocks-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/kvm/guest: Fix sparse warning: "symbol 'klock_waiting' was not declared as static"
kvm: Paravirtual ticketlocks support for linux guests running on KVM hypervisor
kvm guest: Add configuration support to enable debug information for KVM Guests
kvm uapi: Add KICK_CPU and PV_UNHALT definition to uapi
xen, pvticketlock: Allow interrupts to be enabled while blocking
x86, ticketlock: Add slowpath logic
jump_label: Split jumplabel ratelimit
x86, pvticketlock: When paravirtualizing ticket locks, increment by 2
x86, pvticketlock: Use callee-save for lock_spinning
xen, pvticketlocks: Add xen_nopvspin parameter to disable xen pv ticketlocks
xen, pvticketlock: Xen implementation for PV ticket locks
xen: Defer spinlock setup until boot CPU setup
x86, ticketlock: Collapse a layer of functions
x86, ticketlock: Don't inline _spin_unlock when using paravirt spinlocks
x86, spinlock: Replace pv spinlocks with pv ticketlocks
Pull x86 SMAP fixes from Ingo Molnar:
"Fixes for Intel SMAP support, to fix SIGSEGVs during bootup"
* 'x86-smap-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
Introduce [compat_]save_altstack_ex() to unbreak x86 SMAP
x86, smap: Handle csum_partial_copy_*_user()
Pull x86 RAS changes from Ingo Molnar:
"[ The reason for drivers/ updates is that Boris asked for the
drivers/edac/ changes to go via x86/ras in this cycle ]
Main changes:
- AMD CPUs:
. Add ECC event decoding support for new F15h models
. Various erratum fixes
. Fix single-channel on dual-channel-controllers bug.
- Intel CPUs:
. UC uncorrectable memory error parsing fix
. Add support for CMC (Corrected Machine Check) 'FF' (Firmware
First) flag in the APEI HEST
- Various cleanups and fixes"
* 'x86-ras-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
amd64_edac: Fix incorrect wraparounds
amd64_edac: Correct erratum 505 range
cpc925_edac: Use proper array termination
x86/mce, acpi/apei: Only disable banks listed in HEST if mce is configured
amd64_edac: Get rid of boot_cpu_data accesses
amd64_edac: Add ECC decoding support for newer F15h models
x86, amd_nb: Clarify F15h, model 30h GART and L3 support
pci_ids: Add PCI device ID functions 3 and 4 for newer F15h models.
x38_edac: Make a local function static
i3200_edac: Make a local function static
x86/mce: Pay no attention to 'F' bit in MCACOD when parsing 'UC' errors
APEI/ERST: Fix error message formatting
amd64_edac: Fix single-channel setups
EDAC: Replace strict_strtol() with kstrtol()
mce: acpi/apei: Soft-offline a page on firmware GHES notification
mce: acpi/apei: Add a boot option to disable ff mode for corrected errors
mce: acpi/apei: Honour Firmware First for MCA banks listed in APEI HEST CMC
Pull x86 platform documentation fix from Ingo Molnar.
* 'x86-platform-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/acpi: Correct out-of-date comment of __acpi_map_table()
Pull x86 paravirt changes from Ingo Molnar:
"Hypervisor signature detection cleanup and fixes - the goal is to make
KVM guests run better on MS/Hyperv and to generalize and factor out
the code a bit"
* 'x86-paravirt-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86: Correctly detect hypervisor
x86, kvm: Switch to use hypervisor_cpuid_base()
xen: Switch to use hypervisor_cpuid_base()
x86: Introduce hypervisor_cpuid_base()
Pull x86 mm changes from Ingo Molnar:
"Misc smaller fixes:
- a parse_setup_data() boot crash fix
- a memblock and an __early_ioremap cleanup
- turn the always-on CONFIG_ARCH_MEMORY_PROBE=y into a configurable
option and turn it off - it's an unrobust debug facility, it
shouldn't be enabled by default"
* 'x86-mm-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86: avoid remapping data in parse_setup_data()
x86: Use memblock_set_current_limit() to set limit for memblock.
mm: Remove unused variable idx0 in __early_ioremap()
mm/hotplug, x86: Disable ARCH_MEMORY_PROBE by default
Pull x86 relocation changes from Ingo Molnar:
"This tree contains a single change, ELF relocation handling in C - one
of the kernel randomization patches that makes sense even without
randomization present upstream"
* 'x86-kaslr-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86, relocs: Move ELF relocation handling to C
Pull x86 fb changes from Ingo Molnar:
"This tree includes preparatory patches for SimpleDRM driver support,
by David Herrmann. They clean up x86 framebuffer support by creating
simplefb devices wherever possible. More background can be found at
http://lwn.net/Articles/558104/"
* 'x86-fb-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
fbdev: fbcon: select VT_HW_CONSOLE_BINDING
fbdev: efifb: bind to efi-framebuffer
fbdev: vesafb: bind to platform-framebuffer device
fbdev: simplefb: add common x86 RGB formats
x86: sysfb: move EFI quirks from efifb to sysfb
x86: provide platform-devices for boot-framebuffers
fbdev: simplefb: mark as fw and allocate apertures
fbdev: simplefb: add init through platform_data
Pull x86 cpu feature fixes from Ingo Molnar:
"Two small cpufeature support updates"
* 'x86-cpu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86: Fix override new_cpu_data.x86 with 486
x86, cpufeature: Use new CC_HAVE_ASM_GOTO
Pull tiny x86 boot cleanups from Ingo Molnar.
* 'x86-boot-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/boot: Fix a sanity check in printf.c
* 'x86-cleanups-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86, boot: Fix warning due to undeclared strlen()
Pull x86/asmlinkage changes from Ingo Molnar:
"As a preparation for Andi Kleen's LTO patchset (link time
optimizations using GCC's -flto which build time optimization has
steadily increased in quality over the past few years and might
eventually be usable for the kernel too) this tree includes a handful
of preparatory patches that make function calling convention
annotations consistent again:
- Mark every function without arguments (or 64bit only) that is used
by assembly code with asmlinkage()
- Mark every function with parameters or variables that is used by
assembly code as __visible.
For the vanilla kernel this has documentation, consistency and
debuggability advantages, for the time being"
* 'x86-asmlinkage-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/asmlinkage: Fix warning in xen asmlinkage change
x86, asmlinkage, vdso: Mark vdso variables __visible
x86, asmlinkage, power: Make various symbols used by the suspend asm code visible
x86, asmlinkage: Make dump_stack visible
x86, asmlinkage: Make 64bit checksum functions visible
x86, asmlinkage, paravirt: Add __visible/asmlinkage to xen paravirt ops
x86, asmlinkage, apm: Make APM data structure used from assembler visible
x86, asmlinkage: Make syscall tables visible
x86, asmlinkage: Make several variables used from assembler/linker script visible
x86, asmlinkage: Make kprobes code visible and fix assembler code
x86, asmlinkage: Make various syscalls asmlinkage
x86, asmlinkage: Make 32bit/64bit __switch_to visible
x86, asmlinkage: Make _*_start_kernel visible
x86, asmlinkage: Make all interrupt handlers asmlinkage / __visible
x86, asmlinkage: Change dotraplinkage into __visible on 32bit
x86: Fix sys_call_table type in asm/syscall.h
Pull x86/asm changes from Ingo Molnar:
"Main changes:
- Apply low level mutex optimization on x86-64, by Wedson Almeida
Filho.
- Change bitops to be naturally 'long', by H Peter Anvin.
- Add TSX-NI opcodes support to the x86 (instrumentation) decoder, by
Masami Hiramatsu.
- Add clang compatibility adjustments/workarounds, by Jan-Simon
Möller"
* 'x86-asm-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86, doc: Update uaccess.h comment to reflect clang changes
x86, asm: Fix a compilation issue with clang
x86, asm: Extend definitions of _ASM_* with a raw format
x86, insn: Add new opcodes as of June, 2013
x86/ia32/asm: Remove unused argument in macro
x86, bitops: Change bitops to be native operand size
x86: Use asm-goto to implement mutex fast path on x86-64
Pull x86/apic changes from Ingo Molnar:
"Smaller fixes"
* 'x86-apic-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/ioapic: Check attr against the previous setting when programmed more than once
x86/ioapic/kcrash: Prevent crash_kexec() from deadlocking on ioapic_lock
x86/acpi: Fix incorrect sanity check in acpi_register_lapic()
Pull scheduler changes from Ingo Molnar:
"Various optimizations, cleanups and smaller fixes - no major changes
in scheduler behavior"
* 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
sched/fair: Fix the sd_parent_degenerate() code
sched/fair: Rework and comment the group_imb code
sched/fair: Optimize find_busiest_queue()
sched/fair: Make group power more consistent
sched/fair: Remove duplicate load_per_task computations
sched/fair: Shrink sg_lb_stats and play memset games
sched: Clean-up struct sd_lb_stat
sched: Factor out code to should_we_balance()
sched: Remove one division operation in find_busiest_queue()
sched/cputime: Use this_cpu_add() in task_group_account_field()
cpumask: Fix cpumask leak in partition_sched_domains()
sched/x86: Optimize switch_mm() for multi-threaded workloads
generic-ipi: Kill unnecessary variable - csd_flags
numa: Mark __node_set() as __always_inline
sched/fair: Cleanup: remove duplicate variable declaration
sched/__wake_up_sync_key(): Fix nr_exclusive tasks which lead to WF_SYNC clearing
Pull perf changes from Ingo Molnar:
"As a first remark I'd like to point out that the obsolete '-f'
(--force) option, which has not done anything for several releases,
has been removed from 'perf record' and related utilities. Everyone
please update muscle memory accordingly! :-)
Main changes on the perf kernel side:
- Performance optimizations:
. for trace events, by Steve Rostedt.
. for time values, by Peter Zijlstra
- New hardware support:
. for Intel Silvermont (22nm Atom) CPUs, by Zheng Yan
. for Intel SNB-EP uncore PMUs, by Zheng Yan
- Enhanced hardware support:
. for Intel uncore PMUs: add filter support for QPI boxes, by Zheng Yan
- Core perf events code enhancements and fixes:
. for full-nohz feature handling, by Frederic Weisbecker
. for group events, by Jiri Olsa
. for call chains, by Frederic Weisbecker
. for event stream parsing, by Adrian Hunter
- New ABI details:
. Add attr->mmap2 attribute, by Stephane Eranian
. Add PERF_EVENT_IOC_ID ioctl to return event ID, by Jiri Olsa
. Export u64 time_zero on the mmap header page to allow TSC
calculation, by Adrian Hunter
. Add dummy software event, by Adrian Hunter.
. Add a new PERF_SAMPLE_IDENTIFIER to make samples always
parseable, by Adrian Hunter.
. Make Power7 events available via sysfs, by Runzhen Wang.
- Code cleanups and refactorings:
. for nohz-full, by Frederic Weisbecker
. for group events, by Jiri Olsa
- Documentation updates:
. for perf_event_type, by Peter Zijlstra
Main changes on the perf tooling side (some of these tooling changes
utilize the above kernel side changes):
- Lots of 'perf trace' enhancements:
. Make 'perf trace' command line arguments consistent with
'perf record', by David Ahern.
. Allow specifying syscalls a la strace, by Arnaldo Carvalho de Melo.
. Add --verbose and -o/--output options, by Arnaldo Carvalho de Melo.
. Support ! in -e expressions, to filter a list of syscalls,
by Arnaldo Carvalho de Melo.
. Arg formatting improvements to allow masking arguments in
syscalls such as futex and open, where the some arguments are
ignored and thus should not be printed depending on other args,
by Arnaldo Carvalho de Melo.
. Beautify futex open, openat, open_by_handle_at, lseek and futex
syscalls, by Arnaldo Carvalho de Melo.
. Add option to analyze events in a file versus live, so that
one can do:
[root@zoo ~]# perf record -a -e raw_syscalls:* sleep 1
[ perf record: Woken up 0 times to write data ]
[ perf record: Captured and wrote 25.150 MB perf.data (~1098836 samples) ]
[root@zoo ~]# perf trace -i perf.data -e futex --duration 1
17.799 ( 1.020 ms): 7127 futex(uaddr: 0x7fff3f6c6674, op: 393, val: 1, utime: 0x7fff3f6c6470, ua
113.344 (95.429 ms): 7127 futex(uaddr: 0x7fff3f6c6674, op: 393, val: 1, utime: 0x7fff3f6c6470, uaddr2: 0x7fff3f6c6648, val3: 4294967
133.778 ( 1.042 ms): 18004 futex(uaddr: 0x7fff3f6c6674, op: 393, val: 1, utime: 0x7fff3f6c6470, uaddr2: 0x7fff3f6c6648, val3: 429496
[root@zoo ~]#
By David Ahern.
. Honor target pid / tid options when analyzing a file, by David Ahern.
. Introduce better formatting of syscall arguments, including so
far beautifiers for mmap, madvise, syscall return values,
by Arnaldo Carvalho de Melo.
. Handle HUGEPAGE defines in the mmap beautifier, by David Ahern.
- 'perf report/top' enhancements:
. Do annotation using /proc/kcore and /proc/kallsyms when
available, removing the forced need for a vmlinux file kernel
assembly annotation. This also improves this use case because
vmlinux has just the initial kernel image, not what is actually
in use after various code patchings by things like alternatives.
By Adrian Hunter.
. Add --ignore-callees=<regex> option to collapse undesired parts
of call graphs, by Greg Price.
. Simplify symbol filtering by doing it at machine class level,
by Adrian Hunter.
. Add support for callchains in the gtk UI, by Namhyung Kim.
. Add --objdump option to 'perf top', by Sukadev Bhattiprolu.
- 'perf kvm' enhancements:
. Add option to print only events that exceed a specified time
duration, by David Ahern.
. Improve stack trace printing, by David Ahern.
. Update documentation of the live command, by David Ahern
. Add perf kvm stat live mode that combines aspects of 'perf kvm
stat' record and report, by David Ahern.
. Add option to analyze specific VM in perf kvm stat report, by
David Ahern.
. Do not require /lib/modules/* on a guest, by Jason Wessel.
- 'perf script' enhancements:
. Fix symbol offset computation for some dsos, by David Ahern.
. Fix named threads support, by David Ahern.
. Don't install scripting files files when perl/python support
is disabled, by Arnaldo Carvalho de Melo.
- 'perf test' enhancements:
. Add various improvements and fixes to the "vmlinux matches
kallsyms" 'perf test' entry, related to the /proc/kcore
annotation feature. By Adrian Hunter.
. Add sample parsing test, by Adrian Hunter.
. Add test for reading object code, by Adrian Hunter.
. Add attr record group sampling test, by Jiri Olsa.
. Misc testing infrastructure improvements and other details,
by Jiri Olsa.
- 'perf list' enhancements:
. Skip unsupported hardware events, by Namhyung Kim.
. List pmu events, by Andi Kleen.
- 'perf diff' enhancements:
. Add support for more than two files comparison, by Jiri Olsa.
- 'perf sched' enhancements:
. Various improvements, including removing reliance on some
scheduler tracepoints that provide the same information as the
PERF_RECORD_{FORK,EXIT} events. By David Ahern.
. Remove odd build stall by moving a large struct initialization
from a local variable to a global one, by Namhyung Kim.
- 'perf stat' enhancements:
. Add --initial-delay option to skip measuring for a defined
startup phase, by Andi Kleen.
- Generic perf tooling infrastructure/plumbing changes:
. Tidy up sample parsing validation, by Adrian Hunter.
. Fix up jobserver setup in libtraceevent Makefile.
by Arnaldo Carvalho de Melo.
. Debug improvements, by Adrian Hunter.
. Fix correlation of samples coming after PERF_RECORD_EXIT event,
by David Ahern.
. Improve robustness of the topology parsing code,
by Stephane Eranian.
. Add group leader sampling, that allows just one event in a group
to sample while the other events have just its values read,
by Jiri Olsa.
. Add support for a new modifier "D", which requests that the
event, or group of events, be pinned to the PMU.
By Michael Ellerman.
. Support callchain sorting based on addresses, by Andi Kleen
. Prep work for multi perf data file storage, by Jiri Olsa.
. libtraceevent cleanups, by Namhyung Kim.
And lots and lots of other fixes and code reorganizations that did not
make it into the list, see the shortlog, diffstat and the Git log for
details!"
[ Also merge a leftover from the 3.11 cycle ]
* 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
perf: Prevent race in unthrottling code
* 'perf-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (237 commits)
perf trace: Tell arg formatters the arg index
perf trace: Add beautifier for open's flags arg
perf trace: Add beautifier for lseek's whence arg
perf tools: Fix symbol offset computation for some dsos
perf list: Skip unsupported events
perf tests: Add 'keep tracking' test
perf tools: Add support for PERF_COUNT_SW_DUMMY
perf: Add a dummy software event to keep tracking
perf trace: Add beautifier for futex 'operation' parm
perf trace: Allow syscall arg formatters to mask args
perf: Convert kmalloc_node(...GFP_ZERO...) to kzalloc_node()
perf: Export struct perf_branch_entry to userspace
perf: Add attr->mmap2 attribute to an event
perf/x86: Add Silvermont (22nm Atom) support
perf/x86: use INTEL_UEVENT_EXTRA_REG to define MSR_OFFCORE_RSP_X
perf trace: Handle missing HUGEPAGE defines
perf trace: Honor target pid / tid options when analyzing a file
perf trace: Add option to analyze events in a file versus live
perf evlist: Add tracepoint lookup by name
perf tests: Add a sample parsing test
...
1) ACPI-based PCI hotplug (ACPIPHP) subsystem rework and introduction
of Intel Thunderbolt support on systems that use ACPI for signalling
Thunderbolt hotplug events. This also should make ACPIPHP work in
some cases in which it was known to have problems. From
Rafael J Wysocki, Mika Westerberg and Kirill A Shutemov.
2) ACPI core code cleanups and dock station support cleanups from
Jiang Liu and Rafael J Wysocki.
3) Fixes for locking problems related to ACPI device hotplug from
Rafael J Wysocki.
4) ACPICA update to version 20130725 includig fixes, cleanups, support
for more than 256 GPEs per GPE block and a change to make the ACPI
PM Timer optional (we've seen systems without the PM Timer in the
field already). One of the fixes, related to the DeRefOf operator,
is necessary to prevent some Windows 8 oriented AML from causing
problems to happen. From Bob Moore, Lv Zheng, and Jung-uk Kim.
5) Removal of the old and long deprecated /proc/acpi/event interface
and related driver changes from Thomas Renninger.
6) ACPI and Xen changes to make the reduced hardware sleep work with
the latter from Ben Guthro.
7) ACPI video driver cleanups and a blacklist of systems that should
not tell the BIOS that they are compatible with Windows 8 (or ACPI
backlight and possibly other things will not work on them). From
Felipe Contreras.
8) Assorted ACPI fixes and cleanups from Aaron Lu, Hanjun Guo,
Kuppuswamy Sathyanarayanan, Lan Tianyu, Sachin Kamat, Tang Chen,
Toshi Kani, and Wei Yongjun.
9) cpufreq ondemand governor target frequency selection change to
reduce oscillations between min and max frequencies (essentially,
it causes the governor to choose target frequencies proportional
to load) from Stratos Karafotis.
10) cpufreq fixes allowing sysfs attributes file permissions to be
preserved over suspend/resume cycles Srivatsa S Bhat.
11) Removal of Device Tree parsing for CPU device nodes from multiple
cpufreq drivers that required some changes related to
of_get_cpu_node() to be made in a few architectures and in the
driver core. From Sudeep KarkadaNagesha.
12) cpufreq core fixes and cleanups related to mutual exclusion and
driver module references from Viresh Kumar, Lukasz Majewski and
Rafael J Wysocki.
13) Assorted cpufreq fixes and cleanups from Amit Daniel Kachhap,
Bartlomiej Zolnierkiewicz, Hanjun Guo, Jingoo Han, Joseph Lo,
Julia Lawall, Li Zhong, Mark Brown, Sascha Hauer, Stephen Boyd,
Stratos Karafotis, and Viresh Kumar.
14) Fixes to prevent race conditions in coupled cpuidle from happening
from Colin Cross.
15) cpuidle core fixes and cleanups from Daniel Lezcano and
Tuukka Tikkanen.
16) Assorted cpuidle fixes and cleanups from Daniel Lezcano,
Geert Uytterhoeven, Jingoo Han, Julia Lawall, Linus Walleij,
and Sahara.
17) System sleep tracing changes from Todd E Brandt and Shuah Khan.
18) PNP subsystem conversion to using struct dev_pm_ops for power
management from Shuah Khan.
/
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.19 (GNU/Linux)
iQIcBAABCAAGBQJSJcKhAAoJEKhOf7ml8uNsplIQAJSOshxhkkemvFOuHZ+0YIbh
R9aufjXeDkMDBi8YtU+tB7ERth1j+0LUSM0NTnP51U7e+7eSGobA9s5jSZQj2l7r
HFtnSOegLuKAfqwgfSLK91xa1rTFdfW0Kych9G2nuHtBIt6P0Oc59Cb5M0oy6QXs
nVtaDEuU//tmO71+EF5HnMJHabRTrpvtn/7NbDUpU7LZYpWJrHJFT9xt1rXNab7H
YRCATPm3kXGRg58Doc3EZE4G3D7DLvq74jWMaI089X/m5Pg1G6upqArypOy6oxdP
p2FEzYVrb2bi8fakXp7BBeO1gCJTAqIgAkbSSZHLpGhFaeEMmb9/DWPXdm2TjzMV
c1EEucvsqZWoprXgy12i5Hk814xN8d8nBBLg/UYiRJ44nc/hevXfyE9ZYj6bkseJ
+GNHmZIa1QYC05nnGli4+W4kHns8EZf/gmvIxnPuco1RN2yMWagrud5/G6Dr9M2B
hzJV6qauLVzgZso4oe79zv9aVxe/dPHKANLD/sg23WBiJJbJF1ocBlnj2Xlbpqze
pmMUWGiO/gUiS0fmpW/lAJauza5jFmSCjE4E8R0Gyn0j4YXjmMhdEanaU6J3VuCi
yVgEzYEth4sowq4AflMMLKYN+WmozDnK7taZRGmT0t+EKRFKLT6EgnNrkQgs1vKl
oawD9LM4fZ8E0yroOEme
=CgqW
-----END PGP SIGNATURE-----
Merge tag 'pm+acpi-3.12-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm
Pull ACPI and power management updates from Rafael Wysocki:
1) ACPI-based PCI hotplug (ACPIPHP) subsystem rework and introduction
of Intel Thunderbolt support on systems that use ACPI for signalling
Thunderbolt hotplug events. This also should make ACPIPHP work in
some cases in which it was known to have problems. From
Rafael J Wysocki, Mika Westerberg and Kirill A Shutemov.
2) ACPI core code cleanups and dock station support cleanups from
Jiang Liu and Rafael J Wysocki.
3) Fixes for locking problems related to ACPI device hotplug from
Rafael J Wysocki.
4) ACPICA update to version 20130725 includig fixes, cleanups, support
for more than 256 GPEs per GPE block and a change to make the ACPI
PM Timer optional (we've seen systems without the PM Timer in the
field already). One of the fixes, related to the DeRefOf operator,
is necessary to prevent some Windows 8 oriented AML from causing
problems to happen. From Bob Moore, Lv Zheng, and Jung-uk Kim.
5) Removal of the old and long deprecated /proc/acpi/event interface
and related driver changes from Thomas Renninger.
6) ACPI and Xen changes to make the reduced hardware sleep work with
the latter from Ben Guthro.
7) ACPI video driver cleanups and a blacklist of systems that should
not tell the BIOS that they are compatible with Windows 8 (or ACPI
backlight and possibly other things will not work on them). From
Felipe Contreras.
8) Assorted ACPI fixes and cleanups from Aaron Lu, Hanjun Guo,
Kuppuswamy Sathyanarayanan, Lan Tianyu, Sachin Kamat, Tang Chen,
Toshi Kani, and Wei Yongjun.
9) cpufreq ondemand governor target frequency selection change to
reduce oscillations between min and max frequencies (essentially,
it causes the governor to choose target frequencies proportional
to load) from Stratos Karafotis.
10) cpufreq fixes allowing sysfs attributes file permissions to be
preserved over suspend/resume cycles Srivatsa S Bhat.
11) Removal of Device Tree parsing for CPU device nodes from multiple
cpufreq drivers that required some changes related to
of_get_cpu_node() to be made in a few architectures and in the
driver core. From Sudeep KarkadaNagesha.
12) cpufreq core fixes and cleanups related to mutual exclusion and
driver module references from Viresh Kumar, Lukasz Majewski and
Rafael J Wysocki.
13) Assorted cpufreq fixes and cleanups from Amit Daniel Kachhap,
Bartlomiej Zolnierkiewicz, Hanjun Guo, Jingoo Han, Joseph Lo,
Julia Lawall, Li Zhong, Mark Brown, Sascha Hauer, Stephen Boyd,
Stratos Karafotis, and Viresh Kumar.
14) Fixes to prevent race conditions in coupled cpuidle from happening
from Colin Cross.
15) cpuidle core fixes and cleanups from Daniel Lezcano and
Tuukka Tikkanen.
16) Assorted cpuidle fixes and cleanups from Daniel Lezcano,
Geert Uytterhoeven, Jingoo Han, Julia Lawall, Linus Walleij,
and Sahara.
17) System sleep tracing changes from Todd E Brandt and Shuah Khan.
18) PNP subsystem conversion to using struct dev_pm_ops for power
management from Shuah Khan.
* tag 'pm+acpi-3.12-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: (217 commits)
cpufreq: Don't use smp_processor_id() in preemptible context
cpuidle: coupled: fix race condition between pokes and safe state
cpuidle: coupled: abort idle if pokes are pending
cpuidle: coupled: disable interrupts after entering safe state
ACPI / hotplug: Remove containers synchronously
driver core / ACPI: Avoid device hot remove locking issues
cpufreq: governor: Fix typos in comments
cpufreq: governors: Remove duplicate check of target freq in supported range
cpufreq: Fix timer/workqueue corruption due to double queueing
ACPI / EC: Add ASUSTEK L4R to quirk list in order to validate ECDT
ACPI / thermal: Add check of "_TZD" availability and evaluating result
cpufreq: imx6q: Fix clock enable balance
ACPI: blacklist win8 OSI for buggy laptops
cpufreq: tegra: fix the wrong clock name
cpuidle: Change struct menu_device field types
cpuidle: Add a comment warning about possible overflow
cpuidle: Fix variable domains in get_typical_interval()
cpuidle: Fix menu_device->intervals type
cpuidle: CodingStyle: Break up multiple assignments on single line
cpuidle: Check called function parameter in get_typical_interval()
...
Here's the big driver core pull request for 3.12-rc1.
Lots of tiny changes here fixing up the way sysfs attributes are
created, to try to make drivers simpler, and fix a whole class race
conditions with creations of device attributes after the device was
announced to userspace.
All the various pieces are acked by the different subsystem maintainers.
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.21 (GNU/Linux)
iEYEABECAAYFAlIlIPcACgkQMUfUDdst+ynUMwCaAnITsxyDXYQ4DqEsz8EcOtMk
718AoLrgnUZs3B+70AT34DVktg4HSThk
=USl9
-----END PGP SIGNATURE-----
Merge tag 'driver-core-3.12-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core
Pull driver core patches from Greg KH:
"Here's the big driver core pull request for 3.12-rc1.
Lots of tiny changes here fixing up the way sysfs attributes are
created, to try to make drivers simpler, and fix a whole class race
conditions with creations of device attributes after the device was
announced to userspace.
All the various pieces are acked by the different subsystem
maintainers"
* tag 'driver-core-3.12-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core: (119 commits)
firmware loader: fix pending_fw_head list corruption
drivers/base/memory.c: introduce help macro to_memory_block
dynamic debug: line queries failing due to uninitialized local variable
sysfs: sysfs_create_groups returns a value.
debugfs: provide debugfs_create_x64() when disabled
rbd: convert bus code to use bus_groups
firmware: dcdbas: use binary attribute groups
sysfs: add sysfs_create/remove_groups for when SYSFS is not enabled
driver core: add #include <linux/sysfs.h> to core files.
HID: convert bus code to use dev_groups
Input: serio: convert bus code to use drv_groups
Input: gameport: convert bus code to use drv_groups
driver core: firmware: use __ATTR_RW()
driver core: core: use DEVICE_ATTR_RO
driver core: bus: use DRIVER_ATTR_WO()
driver core: create write-only attribute macros for devices and drivers
sysfs: create __ATTR_WO()
driver-core: platform: convert bus code to use dev_groups
workqueue: convert bus code to use dev_groups
MEI: convert bus code to use dev_groups
...
Merge lockref infrastructure code by me and Waiman Long.
I already merged some of the preparatory patches that didn't actually do
any semantic changes earlier, but this merges the actual _reason_ for
those preparatory patches.
The "lockref" structure is a combination "spinlock and reference count"
that allows optimized reference count accesses. In particular, it
guarantees that the reference count will be updated AS IF the spinlock
was held, but using atomic accesses that cover both the reference count
and the spinlock words, we can often do the update without actually
having to take the lock.
This allows us to avoid the nastiest cases of spinlock contention on
large machines under heavy pathname lookup loads. When updating the
dentry reference counts on a large system, we'll still end up with the
cache line bouncing around, but that's much less noticeable than
actually having to spin waiting for the lock.
* lockref:
lockref: implement lockless reference count updates using cmpxchg()
lockref: uninline lockref helper functions
vfs: reimplement d_rcu_to_refcount() using lockref_get_or_lock()
vfs: use lockref_get_not_zero() for optimistic lockless dget_parent()
lockref: add 'lockref_get_or_lock() helper
Instead of taking the spinlock, the lockless versions atomically check
that the lock is not taken, and do the reference count update using a
cmpxchg() loop. This is semantically identical to doing the reference
count update protected by the lock, but avoids the "wait for lock"
contention that you get when accesses to the reference count are
contended.
Note that a "lockref" is absolutely _not_ equivalent to an atomic_t.
Even when the lockref reference counts are updated atomically with
cmpxchg, the fact that they also verify the state of the spinlock means
that the lockless updates can never happen while somebody else holds the
spinlock.
So while "lockref_put_or_lock()" looks a lot like just another name for
"atomic_dec_and_lock()", and both optimize to lockless updates, they are
fundamentally different: the decrement done by atomic_dec_and_lock() is
truly independent of any lock (as long as it doesn't decrement to zero),
so a locked region can still see the count change.
The lockref structure, in contrast, really is a *locked* reference
count. If you hold the spinlock, the reference count will be stable and
you can modify the reference count without using atomics, because even
the lockless updates will see and respect the state of the lock.
In order to enable the cmpxchg lockless code, the architecture needs to
do three things:
(1) Make sure that the "arch_spinlock_t" and an "unsigned int" can fit
in an aligned u64, and have a "cmpxchg()" implementation that works
on such a u64 data type.
(2) define a helper function to test for a spinlock being unlocked
("arch_spin_value_unlocked()")
(3) select the "ARCH_USE_CMPXCHG_LOCKREF" config variable in its
Kconfig file.
This enables it for x86-64 (but not 32-bit, we'd need to make sure
cmpxchg() turns into the proper cmpxchg8b in order to enable it for
32-bit mode).
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Pull x86 boot fix from Peter Anvin:
"A single very small boot fix for very large memory systems (> 0.5T)"
* 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/mm: Fix boot crash with DEBUG_PAGE_ALLOC=y and more than 512G RAM
Compared to old atom, Silvermont has offcore and has more events
that support PEBS.
Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
Reviewed-by: Stephane Eranian <eranian@google.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1374138144-17278-2-git-send-email-zheng.z.yan@intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Silvermont (22nm Atom) has two offcore response configuration MSRs,
unlike other Intel CPU, its event code for MSR_OFFCORE_RSP_1 is 0x02b7.
To avoid complicating intel_fixup_er(), use INTEL_UEVENT_EXTRA_REG to
define MSR_OFFCORE_RSP_X. So intel_fixup_er() can find the event code
for OFFCORE_RSP_N by x86_pmu.extra_regs[N].event.
Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1374138144-17278-1-git-send-email-zheng.z.yan@intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
For performance reasons, when SMAP is in use, SMAP is left open for an
entire put_user_try { ... } put_user_catch(); block, however, calling
__put_user() in the middle of that block will close SMAP as the
STAC..CLAC constructs intentionally do not nest.
Furthermore, using __put_user() rather than put_user_ex() here is bad
for performance.
Thus, introduce new [compat_]save_altstack_ex() helpers that replace
__[compat_]save_altstack() for x86, being currently the only
architecture which supports put_user_try { ... } put_user_catch().
Reported-by: H. Peter Anvin <hpa@linux.intel.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
Cc: <stable@vger.kernel.org> # v3.8+
Link: http://lkml.kernel.org/n/tip-es5p6y64if71k8p5u08agv9n@git.kernel.org
Add SMAP annotations to csum_partial_copy_to/from_user(). These
functions legitimately access user space and thus need to set the AC
flag.
TODO: add explicit checks that the side with the kernel space pointer
really points into kernel space.
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
Link: http://lkml.kernel.org/n/tip-2aps0u00eer658fd5xyanan7@git.kernel.org
Cc: <stable@vger.kernel.org> # v3.7+
Update comment in uaccess.h to reflect the changes for clang support:
gcc only cares about the base register (most architectures don't
encode the size of the operation in the operands like x86 does, and so
it is treated effectively like a register number), whereas clang tries
to enforce the size -- but not for register pairs.
Link: http://lkml.kernel.org/r/1377803585-5913-3-git-send-email-dl9pf@gmx.de
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
Cc: Jan-Simon Möller <dl9pf@gmx.de>
Clang does not support the "shortcut" we're taking here for gcc (see below).
The patch uses the macro _ASM_DX to do the job.
From arch/x86/include/asm/uaccess.h:
/*
* Careful: we have to cast the result to the type of the pointer
* for sign reasons.
*
* The use of %edx as the register specifier is a bit of a
* simplification, as gcc only cares about it as the starting point
* and not size: for a 64-bit value it will use %ecx:%edx on 32 bits
* (%ecx being the next register in gcc's x86 register sequence), and
* %rdx on 64 bits.
*/
[ hpa: I consider this a compatibility bug in clang as this reflects a
bit of a misunderstanding about how register strings are used by
gcc, but the workaround is straightforward and there is no
particular reason to not do it. ]
Signed-off-by: Jan-Simon Möller <dl9pf@gmx.de>
Link: http://lkml.kernel.org/r/1377803585-5913-3-git-send-email-dl9pf@gmx.de
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
The __ASM_* macros (e.g. __ASM_DX) are used to return the proper
register name (e.g. edx for 32bit / rdx for 64bit). We want to use
this also in arch/x86/include/asm/uaccess.h / get_user() . For this
to work, we need a raw form as both gcc and clang choke on the
whitespace in a register asm() statement, and the __ASM_FORM macro
surrounds the argument with blanks. A new macro, __ASM_FORM_RAW was
added and we change __ASM_REG to use the new RAW form.
Signed-off-by: Jan-Simon Möller <dl9pf@gmx.de>
Link: http://lkml.kernel.org/r/1377803585-5913-2-git-send-email-dl9pf@gmx.de
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
* pm-cpufreq: (60 commits)
cpufreq: pmac32-cpufreq: remove device tree parsing for cpu nodes
cpufreq: pmac64-cpufreq: remove device tree parsing for cpu nodes
cpufreq: maple-cpufreq: remove device tree parsing for cpu nodes
cpufreq: arm_big_little: remove device tree parsing for cpu nodes
cpufreq: kirkwood-cpufreq: remove device tree parsing for cpu nodes
cpufreq: spear-cpufreq: remove device tree parsing for cpu nodes
cpufreq: highbank-cpufreq: remove device tree parsing for cpu nodes
cpufreq: cpufreq-cpu0: remove device tree parsing for cpu nodes
cpufreq: imx6q-cpufreq: remove device tree parsing for cpu nodes
drivers/bus: arm-cci: avoid parsing DT for cpu device nodes
ARM: mvebu: remove device tree parsing for cpu nodes
ARM: topology: remove hwid/MPIDR dependency from cpu_capacity
of/device: add helper to get cpu device node from logical cpu index
driver/core: cpu: initialize of_node in cpu's device struture
ARM: DT/kernel: define ARM specific arch_match_cpu_phys_id
of: move of_get_cpu_node implementation to DT core library
powerpc: refactor of_get_cpu_node to support other architectures
openrisc: remove undefined of_get_cpu_node declaration
microblaze: remove undefined of_get_cpu_node declaration
cpufreq: fix bad unlock balance on !CONFIG_SMP
...
* acpi-pci-hotplug: (34 commits)
ACPI / PM: Hold acpi_scan_lock over system PM transitions
ACPI / hotplug / PCI: Fix NULL pointer dereference in cleanup_bridge()
PCI / ACPI: Use dev_dbg() instead of dev_info() in acpi_pci_set_power_state()
ACPI / hotplug / PCI: Get rid of check_sub_bridges()
ACPI / hotplug / PCI: Clean up bridge_mutex usage
ACPI / hotplug / PCI: Redefine enable_device() and disable_device()
ACPI / hotplug / PCI: Sanitize acpiphp_get_(latch)|(adapter)_status()
ACPI / hotplug / PCI: Get rid of unused constants in acpiphp.h
ACPI / hotplug / PCI: Check for new devices on enabled slots
ACPI / hotplug / PCI: Allow slots without new devices to be rescanned
ACPI / hotplug / PCI: Do not check SLOT_ENABLED in enable_device()
ACPI / hotplug / PCI: Do not exectute _PS0 and _PS3 directly
ACPI / hotplug / PCI: Do not queue up event handling work items in vain
ACPI / hotplug / PCI: Consolidate slot disabling and ejecting
ACPI / hotplug / PCI: Drop redundant checks from check_hotplug_bridge()
ACPI / hotplug / PCI: Rework namespace scanning and trimming routines
ACPI / hotplug / PCI: Store parent in functions and bus in slots
ACPI / hotplug / PCI: Drop handle field from struct acpiphp_bridge
ACPI / hotplug / PCI: Drop handle field from struct acpiphp_func
ACPI / hotplug / PCI: Embed function struct into struct acpiphp_context
...
When programming ioapic pinX more than once, current code
does not check whether the later attr (trigger & polarity) is the
same as the former or not.
This causes broken semantics which can be observed in a qemu q35
machine, where ioapic's ioredtbl[x] can never be set as low-active,
even if the hpet driver registered it.
And hpet driver may share a high-level active IRQ line with other
devices. So in qemu, when hpet-dev asserts low-level as kernel
expects, the kernel has no response.
With this patch, we can observe an ioredtbl[x] set as low-active
for hpet.
Fix it by reporting -EBUSY to the caller, when attr is different.
Signed-off-by: Liu Ping Fan <pingfank@linux.vnet.ibm.com>
Cc: Kevin Hao <haokexin@gmail.com>
Cc: Len Brown <len.brown@intel.com>
Cc: Yinghai Lu <yinghai@kernel.org>
Link: http://lkml.kernel.org/r/1377248327-19633-1-git-send-email-pingfank@linux.vnet.ibm.com
[ Made small readability edits to both the changelog and the code. ]
Signed-off-by: Ingo Molnar <mingo@kernel.org>
commit 40b313608a ("Finally eradicate
CONFIG_HOTPLUG") removed remaining references to CONFIG_HOTPLUG, but missed
a few plain English references in the CONFIG_KEXEC help texts.
Remove them, too.
Signed-off-by: Geert Uytterhoeven <geert@linux-m68k.org>
Acked-by: Stephen Rothwell <sfr@canb.auug.org.au>
Acked-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
This is the updated version of df54d6fa54 ("x86 get_unmapped_area():
use proper mmap base for bottom-up direction") that only randomizes the
mmap base address once.
Signed-off-by: Radu Caragea <sinaelgl@gmail.com>
Reported-and-tested-by: Jeff Shorey <shoreyjeff@gmail.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Michel Lespinasse <walken@google.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Adrian Sendroiu <molecula2788@gmail.com>
Cc: Greg KH <greg@kroah.com>
Cc: Kamal Mostafa <kamal@canonical.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This reverts commit df54d6fa54.
The commit isn't necessarily wrong, but because it recalculates the
random mmap_base every time, it seems to confuse user memory allocators
that expect contiguous mmap allocations even when the mmap address isn't
specified.
In particular, the MATLAB Java runtime seems to be unhappy. See
https://bugzilla.kernel.org/show_bug.cgi?id=60774
So we'll want to apply the random offset only once, and Radu has a patch
for that. Revert this older commit in order to apply the other one.
Reported-by: Jeff Shorey <shoreyjeff@gmail.com>
Cc: Radu Caragea <sinaelgl@gmail.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Based on a patch by Jon Mason (see URL below).
All users of pcie_bus_configure_settings() pass arguments of the form
"bus, bus->self->pcie_mpss". The "mpss" argument is redundant since we
can easily look it up internally. In addition, all callers check
"bus->self" for NULL, which we can also do internally.
This patch simplifies the interface and the callers. No functional change.
Reference: http://lkml.kernel.org/r/1317048850-30728-2-git-send-email-mason@myri.com
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
Current code uses asmlinkage for functions without arguments.
This adds an implicit regparm(0) which creates a warning
when assigning the function to pointers.
Use __visible for the functions without arguments.
This avoids having to add regparm(0) to function pointers.
Since they have no arguments it does not make any difference.
Signed-off-by: Andi Kleen <ak@linux.intel.com>
Link: http://lkml.kernel.org/r/1377115662-4865-1-git-send-email-andi@firstfloor.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
- On ARM did not have balanced calls to get/put_cpu.
- Fix to make tboot + Xen + Linux correctly.
- Fix events VCPU binding issues.
- Fix a vCPU online race where IPIs are sent to not-yet-online vCPU.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.14 (GNU/Linux)
iQEcBAABAgAGBQJSFMaJAAoJEFjIrFwIi8fJ+/0H/32rLj60FpKXcPDCvID+9p8T
XDGnFNttsxyhuzEzetOAd0aLKYKGnUaTDZBHfgSNipGCxjMLYgz84phRmHAYEj8u
kai1Ag1WjhZilCmImzFvdHFiUwtvKwkeBIL/cZtKr1BetpnuuFsoVnwbH9FVjMpr
TCg6sUwFq7xRyD1azo/cTLZFeiUqq0aQLw8J72YaapdS3SztHPeDHXlPpmLUdb6+
hiSYveJMYp2V0SW8g8eLKDJxVr2QdPEfl9WpBzpLlLK8GrNw8BEU6hSOSLzxB7z/
hDATAuZ5iHiIEi1uGfVjOyDws2ngUhmBKUH5x5iVIZd2P5c/ffLh2ePDVWGO5RI=
=yMuS
-----END PGP SIGNATURE-----
Merge tag 'stable/for-linus-3.11-rc6-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip
Pull Xen bug-fixes from Konrad Rzeszutek Wilk:
- On ARM did not have balanced calls to get/put_cpu.
- Fix to make tboot + Xen + Linux correctly.
- Fix events VCPU binding issues.
- Fix a vCPU online race where IPIs are sent to not-yet-online vCPU.
* tag 'stable/for-linus-3.11-rc6-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip:
xen/smp: initialize IPI vectors before marking CPU online
xen/events: mask events when changing their VCPU binding
xen/events: initialize local per-cpu mask for all possible events
x86/xen: do not identity map UNUSABLE regions in the machine E820
xen/arm: missing put_cpu in xen_percpu_init
An older PVHVM guest (v3.0 based) crashed during vCPU hot-plug with:
kernel BUG at drivers/xen/events.c:1328!
RCU has detected that a CPU has not entered a quiescent state within the
grace period. It needs to send the CPU a reschedule IPI if it is not
offline. rcu_implicit_offline_qs() does this check:
/*
* If the CPU is offline, it is in a quiescent state. We can
* trust its state not to change because interrupts are disabled.
*/
if (cpu_is_offline(rdp->cpu)) {
rdp->offline_fqs++;
return 1;
}
Else the CPU is online. Send it a reschedule IPI.
The CPU is in the middle of being hot-plugged and has been marked online
(!cpu_is_offline()). See start_secondary():
set_cpu_online(smp_processor_id(), true);
...
per_cpu(cpu_state, smp_processor_id()) = CPU_ONLINE;
start_secondary() then waits for the CPU bringing up the hot-plugged CPU to
mark it as active:
/*
* Wait until the cpu which brought this one up marked it
* online before enabling interrupts. If we don't do that then
* we can end up waking up the softirq thread before this cpu
* reached the active state, which makes the scheduler unhappy
* and schedule the softirq thread on the wrong cpu. This is
* only observable with forced threaded interrupts, but in
* theory it could also happen w/o them. It's just way harder
* to achieve.
*/
while (!cpumask_test_cpu(smp_processor_id(), cpu_active_mask))
cpu_relax();
/* enable local interrupts */
local_irq_enable();
The CPU being hot-plugged will be marked active after it has been fully
initialized by the CPU managing the hot-plug. In the Xen PVHVM case
xen_smp_intr_init() is called to set up the hot-plugged vCPU's
XEN_RESCHEDULE_VECTOR.
The hot-plugging CPU is marked online, not marked active and does not have
its IPI vectors set up. rcu_implicit_offline_qs() sees the hot-plugging
cpu is !cpu_is_offline() and tries to send it a reschedule IPI:
This will lead to:
kernel BUG at drivers/xen/events.c:1328!
xen_send_IPI_one()
xen_smp_send_reschedule()
rcu_implicit_offline_qs()
rcu_implicit_dynticks_qs()
force_qs_rnp()
force_quiescent_state()
__rcu_process_callbacks()
rcu_process_callbacks()
__do_softirq()
call_softirq()
do_softirq()
irq_exit()
xen_evtchn_do_upcall()
because xen_send_IPI_one() will attempt to use an uninitialized IRQ for
the XEN_RESCHEDULE_VECTOR.
There is at least one other place that has caused the same crash:
xen_smp_send_reschedule()
wake_up_idle_cpu()
add_timer_on()
clocksource_watchdog()
call_timer_fn()
run_timer_softirq()
__do_softirq()
call_softirq()
do_softirq()
irq_exit()
xen_evtchn_do_upcall()
xen_hvm_callback_vector()
clocksource_watchdog() uses cpu_online_mask to pick the next CPU to handle
a watchdog timer:
/*
* Cycle through CPUs to check if the CPUs stay synchronized
* to each other.
*/
next_cpu = cpumask_next(raw_smp_processor_id(), cpu_online_mask);
if (next_cpu >= nr_cpu_ids)
next_cpu = cpumask_first(cpu_online_mask);
watchdog_timer.expires += WATCHDOG_INTERVAL;
add_timer_on(&watchdog_timer, next_cpu);
This resulted in an attempt to send an IPI to a hot-plugging CPU that
had not initialized its reschedule vector. One option would be to make
the RCU code check to not check for CPU offline but for CPU active.
As becoming active is done after a CPU is online (in older kernels).
But Srivatsa pointed out that "the cpu_active vs cpu_online ordering has been
completely reworked - in the online path, cpu_active is set *before* cpu_online,
and also, in the cpu offline path, the cpu_active bit is reset in the CPU_DYING
notification instead of CPU_DOWN_PREPARE." Drilling in this the bring-up
path: "[brought up CPU].. send out a CPU_STARTING notification, and in response
to that, the scheduler sets the CPU in the cpu_active_mask. Again, this mask
is better left to the scheduler alone, since it has the intelligence to use it
judiciously."
The conclusion was that:
"
1. At the IPI sender side:
It is incorrect to send an IPI to an offline CPU (cpu not present in
the cpu_online_mask). There are numerous places where we check this
and warn/complain.
2. At the IPI receiver side:
It is incorrect to let the world know of our presence (by setting
ourselves in global bitmasks) until our initialization steps are complete
to such an extent that we can handle the consequences (such as
receiving interrupts without crashing the sender etc.)
" (from Srivatsa)
As the native code enables the interrupts at some point we need to be
able to service them. In other words a CPU must have valid IPI vectors
if it has been marked online.
It doesn't need to handle the IPI (interrupts may be disabled) but needs
to have valid IPI vectors because another CPU may find it in cpu_online_mask
and attempt to send it an IPI.
This patch will change the order of the Xen vCPU bring-up functions so that
Xen vectors have been set up before start_secondary() is called.
It also will not continue to bring up a Xen vCPU if xen_smp_intr_init() fails
to initialize it.
Orabug 13823853
Signed-off-by Chuck Anderson <chuck.anderson@oracle.com>
Acked-by: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
If there are UNUSABLE regions in the machine memory map, dom0 will
attempt to map them 1:1 which is not permitted by Xen and the kernel
will crash.
There isn't anything interesting in the UNUSABLE region that the dom0
kernel needs access to so we can avoid making the 1:1 mapping and
treat it as RAM.
We only do this for dom0, as that is where tboot case shows up.
A PV domU could have an UNUSABLE region in its pseudo-physical map
and would need to be handled in another patch.
This fixes a boot failure on hosts with tboot.
tboot marks a region in the e820 map as unusable and the dom0 kernel
would attempt to map this region and Xen does not permit unusable
regions to be mapped by guests.
(XEN) 0000000000000000 - 0000000000060000 (usable)
(XEN) 0000000000060000 - 0000000000068000 (reserved)
(XEN) 0000000000068000 - 000000000009e000 (usable)
(XEN) 0000000000100000 - 0000000000800000 (usable)
(XEN) 0000000000800000 - 0000000000972000 (unusable)
tboot marked this region as unusable.
(XEN) 0000000000972000 - 00000000cf200000 (usable)
(XEN) 00000000cf200000 - 00000000cf38f000 (reserved)
(XEN) 00000000cf38f000 - 00000000cf3ce000 (ACPI data)
(XEN) 00000000cf3ce000 - 00000000d0000000 (reserved)
(XEN) 00000000e0000000 - 00000000f0000000 (reserved)
(XEN) 00000000fe000000 - 0000000100000000 (reserved)
(XEN) 0000000100000000 - 0000000630000000 (usable)
Signed-off-by: David Vrabel <david.vrabel@citrix.com>
[v1: Altered the patch and description with domU's with UNUSABLE regions]
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Dave Hansen reported that systems between 500G and 600G RAM
crash early if DEBUG_PAGEALLOC is selected.
> [ 0.000000] init_memory_mapping: [mem 0x00000000-0x000fffff]
> [ 0.000000] [mem 0x00000000-0x000fffff] page 4k
> [ 0.000000] BRK [0x02086000, 0x02086fff] PGTABLE
> [ 0.000000] BRK [0x02087000, 0x02087fff] PGTABLE
> [ 0.000000] BRK [0x02088000, 0x02088fff] PGTABLE
> [ 0.000000] init_memory_mapping: [mem 0xe80ee00000-0xe80effffff]
> [ 0.000000] [mem 0xe80ee00000-0xe80effffff] page 4k
> [ 0.000000] BRK [0x02089000, 0x02089fff] PGTABLE
> [ 0.000000] BRK [0x0208a000, 0x0208afff] PGTABLE
> [ 0.000000] Kernel panic - not syncing: alloc_low_page: ran out of memory
It turns out that we missed increasing needed pages in BRK to
mapping initial 2M and [0,1M) when we switched to use the #PF
handler to set memory mappings:
> commit 8170e6bed4
> Author: H. Peter Anvin <hpa@zytor.com>
> Date: Thu Jan 24 12:19:52 2013 -0800
>
> x86, 64bit: Use a #PF handler to materialize early mappings on demand
Before that, we had the maping from [0,512M) in head_64.S, and we
can spare two pages [0-1M). After that change, we can not reuse
pages anymore.
When we have more than 512M ram, we need an extra page for pgd page
with [512G, 1024g).
Increase pages in BRK for page table to solve the boot crash.
Reported-by: Dave Hansen <dave.hansen@intel.com>
Bisected-by: Dave Hansen <dave.hansen@intel.com>
Tested-by: Dave Hansen <dave.hansen@intel.com>
Signed-off-by: Yinghai Lu <yinghai@kernel.org>
Cc: <stable@vger.kernel.org> # v3.9 and later
Link: http://lkml.kernel.org/r/1376351004-4015-1-git-send-email-yinghai@kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Prevent crash_kexec() from deadlocking on ioapic_lock. When
crash_kexec() is executed on a CPU, the CPU will take ioapic_lock
in disable_IO_APIC(). So if the cpu gets an NMI while locking
ioapic_lock, a deadlock will happen.
In this patch, ioapic_lock is zapped/initialized before disable_IO_APIC().
You can reproduce this deadlock the following way:
1. Add mdelay(1000) after raw_spin_lock_irqsave() in
native_ioapic_set_affinity()@arch/x86/kernel/apic/io_apic.c
Although the deadlock can occur without this modification, it will increase
the potential of the deadlock problem.
2. Build and install the kernel
3. Set up the OS which will run panic() and kexec when NMI is injected
# echo "kernel.unknown_nmi_panic=1" >> /etc/sysctl.conf
# vim /etc/default/grub
add "nmi_watchdog=0 crashkernel=256M" in GRUB_CMDLINE_LINUX line
# grub2-mkconfig
4. Reboot the OS
5. Run following command for each vcpu on the guest
# while true; do echo <CPU num> > /proc/irq/<IO-APIC-edge or IO-APIC-fasteoi>/smp_affinitity; done;
By running this command, cpus will get ioapic_lock for setting affinity.
6. Inject NMI (push a dump button or execute 'virsh inject-nmi <domain>' if you
use VM). After injecting NMI, panic() is called in an nmi-handler context.
Then, kexec will normally run in panic(), but the operation will be stopped
by deadlock on ioapic_lock in crash_kexec()->machine_crash_shutdown()->
native_machine_crash_shutdown()->disable_IO_APIC()->clear_IO_APIC()->
clear_IO_APIC_pin()->ioapic_read_entry().
Signed-off-by: Yoshihiro YUNOMAE <yoshihiro.yunomae.ez@hitachi.com>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Gleb Natapov <gleb@redhat.com>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: Joerg Roedel <joro@8bytes.org>
Cc: Marcelo Tosatti <mtosatti@redhat.com>
Cc: Hidehiro Kawai <hidehiro.kawai.ez@hitachi.com>
Cc: Sebastian Andrzej Siewior <sebastian@breakpoint.cc>
Cc: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
Cc: Eric W. Biederman <ebiederm@xmission.com>
Cc: yrl.pp-manager.tt@hitachi.com
Cc: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Cc: Seiji Aguchi <seiji.aguchi@hds.com>
Link: http://lkml.kernel.org/r/20130820070107.28245.83806.stgit@yunodevel
Signed-off-by: Ingo Molnar <mingo@kernel.org>